Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-27 Thread Vincent Guittot
On 25 March 2017 at 04:48, Joel Fernandes  wrote:
> Hi Vincent,
>
> On Thu, Mar 23, 2017 at 3:08 PM, Vincent Guittot
>  wrote:
> [..]

> So I'm not really aligned with the description of your problem: PELT
> metric underestimates the load of the CPU.  The PELT is just about
> tracking CFS task utilization but not whole CPU utilization and
> according to your description of the problem (time stolen by irq),
> your problem doesn't come from an underestimation of CFS task but from
> time spent in something else but not accounted in the value used by
> schedutil

 Quite likely. Indeed, it can really be that the CFS task is preempted
 because of some RT activity generated by the IRQ handler.

 More in general, I've also noticed many suboptimal freq switches when
 RT tasks interleave with CFS ones, because of:
 - relatively long down _and up_ throttling times
 - the way schedutil's flags are tracked and updated
 - the callsites from where we call schedutil updates

 For example it can really happen that we are running at the highest
 OPP because of some RT activity. Then we switch back to a relatively
 low utilization CFS workload and then:
 1. a tick happens which produces a frequency drop
>>>
>>> Any idea why this frequency drop would happen? Say a running CFS task
>>> gets preempted by RT task, the PELT signal shouldn't drop for the
>>> duration the CFS task is preempted because the task is runnable, so
>>
>> utilization only tracks the running state but not runnable state.
>> Runnable state is tracked in load_avg
>
> Thanks. I got it now.
>
> Correct me if I'm wrong but strictly speaking utilization for a cfs_rq
> (which drives the frequency for CFS) still tracks the blocked/runnable
> time of tasks although its decayed as time moves forward. Only when we
> migrate the rq of a cfs task is the util_avg contribution removed from
> the rq. But I can see now why running RT can decay this load tracking
> signal.

Yes. you're right


>
> Regards,
> Joel


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-27 Thread Vincent Guittot
On 25 March 2017 at 04:48, Joel Fernandes  wrote:
> Hi Vincent,
>
> On Thu, Mar 23, 2017 at 3:08 PM, Vincent Guittot
>  wrote:
> [..]

> So I'm not really aligned with the description of your problem: PELT
> metric underestimates the load of the CPU.  The PELT is just about
> tracking CFS task utilization but not whole CPU utilization and
> according to your description of the problem (time stolen by irq),
> your problem doesn't come from an underestimation of CFS task but from
> time spent in something else but not accounted in the value used by
> schedutil

 Quite likely. Indeed, it can really be that the CFS task is preempted
 because of some RT activity generated by the IRQ handler.

 More in general, I've also noticed many suboptimal freq switches when
 RT tasks interleave with CFS ones, because of:
 - relatively long down _and up_ throttling times
 - the way schedutil's flags are tracked and updated
 - the callsites from where we call schedutil updates

 For example it can really happen that we are running at the highest
 OPP because of some RT activity. Then we switch back to a relatively
 low utilization CFS workload and then:
 1. a tick happens which produces a frequency drop
>>>
>>> Any idea why this frequency drop would happen? Say a running CFS task
>>> gets preempted by RT task, the PELT signal shouldn't drop for the
>>> duration the CFS task is preempted because the task is runnable, so
>>
>> utilization only tracks the running state but not runnable state.
>> Runnable state is tracked in load_avg
>
> Thanks. I got it now.
>
> Correct me if I'm wrong but strictly speaking utilization for a cfs_rq
> (which drives the frequency for CFS) still tracks the blocked/runnable
> time of tasks although its decayed as time moves forward. Only when we
> migrate the rq of a cfs task is the util_avg contribution removed from
> the rq. But I can see now why running RT can decay this load tracking
> signal.

Yes. you're right


>
> Regards,
> Joel


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-24 Thread Joel Fernandes
Hi Vincent,

On Thu, Mar 23, 2017 at 3:08 PM, Vincent Guittot
 wrote:
[..]
>>>
 So I'm not really aligned with the description of your problem: PELT
 metric underestimates the load of the CPU.  The PELT is just about
 tracking CFS task utilization but not whole CPU utilization and
 according to your description of the problem (time stolen by irq),
 your problem doesn't come from an underestimation of CFS task but from
 time spent in something else but not accounted in the value used by
 schedutil
>>>
>>> Quite likely. Indeed, it can really be that the CFS task is preempted
>>> because of some RT activity generated by the IRQ handler.
>>>
>>> More in general, I've also noticed many suboptimal freq switches when
>>> RT tasks interleave with CFS ones, because of:
>>> - relatively long down _and up_ throttling times
>>> - the way schedutil's flags are tracked and updated
>>> - the callsites from where we call schedutil updates
>>>
>>> For example it can really happen that we are running at the highest
>>> OPP because of some RT activity. Then we switch back to a relatively
>>> low utilization CFS workload and then:
>>> 1. a tick happens which produces a frequency drop
>>
>> Any idea why this frequency drop would happen? Say a running CFS task
>> gets preempted by RT task, the PELT signal shouldn't drop for the
>> duration the CFS task is preempted because the task is runnable, so
>
> utilization only tracks the running state but not runnable state.
> Runnable state is tracked in load_avg

Thanks. I got it now.

Correct me if I'm wrong but strictly speaking utilization for a cfs_rq
(which drives the frequency for CFS) still tracks the blocked/runnable
time of tasks although its decayed as time moves forward. Only when we
migrate the rq of a cfs task is the util_avg contribution removed from
the rq. But I can see now why running RT can decay this load tracking
signal.

Regards,
Joel


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-24 Thread Joel Fernandes
Hi Vincent,

On Thu, Mar 23, 2017 at 3:08 PM, Vincent Guittot
 wrote:
[..]
>>>
 So I'm not really aligned with the description of your problem: PELT
 metric underestimates the load of the CPU.  The PELT is just about
 tracking CFS task utilization but not whole CPU utilization and
 according to your description of the problem (time stolen by irq),
 your problem doesn't come from an underestimation of CFS task but from
 time spent in something else but not accounted in the value used by
 schedutil
>>>
>>> Quite likely. Indeed, it can really be that the CFS task is preempted
>>> because of some RT activity generated by the IRQ handler.
>>>
>>> More in general, I've also noticed many suboptimal freq switches when
>>> RT tasks interleave with CFS ones, because of:
>>> - relatively long down _and up_ throttling times
>>> - the way schedutil's flags are tracked and updated
>>> - the callsites from where we call schedutil updates
>>>
>>> For example it can really happen that we are running at the highest
>>> OPP because of some RT activity. Then we switch back to a relatively
>>> low utilization CFS workload and then:
>>> 1. a tick happens which produces a frequency drop
>>
>> Any idea why this frequency drop would happen? Say a running CFS task
>> gets preempted by RT task, the PELT signal shouldn't drop for the
>> duration the CFS task is preempted because the task is runnable, so
>
> utilization only tracks the running state but not runnable state.
> Runnable state is tracked in load_avg

Thanks. I got it now.

Correct me if I'm wrong but strictly speaking utilization for a cfs_rq
(which drives the frequency for CFS) still tracks the blocked/runnable
time of tasks although its decayed as time moves forward. Only when we
migrate the rq of a cfs task is the util_avg contribution removed from
the rq. But I can see now why running RT can decay this load tracking
signal.

Regards,
Joel


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-23 Thread Vincent Guittot
On 23 March 2017 at 00:56, Joel Fernandes  wrote:
> On Mon, Mar 20, 2017 at 5:34 AM, Patrick Bellasi
>  wrote:
>> On 20-Mar 09:26, Vincent Guittot wrote:
>>> On 20 March 2017 at 04:57, Viresh Kumar  wrote:
>>> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>>> >> From: Rafael J. Wysocki 
>>> >>
>>> >> The PELT metric used by the schedutil governor underestimates the
>>> >> CPU utilization in some cases.  The reason for that may be time spent
>>> >> in interrupt handlers and similar which is not accounted for by PELT.
>>>
>>> Are you sure of the root cause  described above (time stolen by irq
>>> handler) or is it just a hypotheses ? That would be good to be sure of
>>> the root cause
>>> Furthermore, IIRC the time spent in irq context is also accounted as
>>> run time for the running cfs task but not RT and deadline task running
>>> time
>>
>> As long as the IRQ processing does not generate a context switch,
>> which is happening (eventually) if the top half schedule some deferred
>> work to be executed by a bottom half.
>>
>> Thus, me too I would say that all the top half time is accounted in
>> PELT, since the current task is still RUNNABLE/RUNNING.
>
> Sorry if I'm missing something but doesn't this depend on whether you
> have CONFIG_IRQ_TIME_ACCOUNTING enabled?
>
> __update_load_avg uses rq->clock_task for deltas which I think
> shouldn't account IRQ time with that config option. So it should be
> quite possible for IRQ time spent to reduce the PELT signal right?
>
>>
>>> So I'm not really aligned with the description of your problem: PELT
>>> metric underestimates the load of the CPU.  The PELT is just about
>>> tracking CFS task utilization but not whole CPU utilization and
>>> according to your description of the problem (time stolen by irq),
>>> your problem doesn't come from an underestimation of CFS task but from
>>> time spent in something else but not accounted in the value used by
>>> schedutil
>>
>> Quite likely. Indeed, it can really be that the CFS task is preempted
>> because of some RT activity generated by the IRQ handler.
>>
>> More in general, I've also noticed many suboptimal freq switches when
>> RT tasks interleave with CFS ones, because of:
>> - relatively long down _and up_ throttling times
>> - the way schedutil's flags are tracked and updated
>> - the callsites from where we call schedutil updates
>>
>> For example it can really happen that we are running at the highest
>> OPP because of some RT activity. Then we switch back to a relatively
>> low utilization CFS workload and then:
>> 1. a tick happens which produces a frequency drop
>
> Any idea why this frequency drop would happen? Say a running CFS task
> gets preempted by RT task, the PELT signal shouldn't drop for the
> duration the CFS task is preempted because the task is runnable, so

utilization only tracks the running state but not runnable state.
Runnable state is tracked in load_avg

> once the CFS task gets CPU back, schedutil should still maintain the
> capacity right?
>
> Regards,
> Joel


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-23 Thread Vincent Guittot
On 23 March 2017 at 00:56, Joel Fernandes  wrote:
> On Mon, Mar 20, 2017 at 5:34 AM, Patrick Bellasi
>  wrote:
>> On 20-Mar 09:26, Vincent Guittot wrote:
>>> On 20 March 2017 at 04:57, Viresh Kumar  wrote:
>>> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>>> >> From: Rafael J. Wysocki 
>>> >>
>>> >> The PELT metric used by the schedutil governor underestimates the
>>> >> CPU utilization in some cases.  The reason for that may be time spent
>>> >> in interrupt handlers and similar which is not accounted for by PELT.
>>>
>>> Are you sure of the root cause  described above (time stolen by irq
>>> handler) or is it just a hypotheses ? That would be good to be sure of
>>> the root cause
>>> Furthermore, IIRC the time spent in irq context is also accounted as
>>> run time for the running cfs task but not RT and deadline task running
>>> time
>>
>> As long as the IRQ processing does not generate a context switch,
>> which is happening (eventually) if the top half schedule some deferred
>> work to be executed by a bottom half.
>>
>> Thus, me too I would say that all the top half time is accounted in
>> PELT, since the current task is still RUNNABLE/RUNNING.
>
> Sorry if I'm missing something but doesn't this depend on whether you
> have CONFIG_IRQ_TIME_ACCOUNTING enabled?
>
> __update_load_avg uses rq->clock_task for deltas which I think
> shouldn't account IRQ time with that config option. So it should be
> quite possible for IRQ time spent to reduce the PELT signal right?
>
>>
>>> So I'm not really aligned with the description of your problem: PELT
>>> metric underestimates the load of the CPU.  The PELT is just about
>>> tracking CFS task utilization but not whole CPU utilization and
>>> according to your description of the problem (time stolen by irq),
>>> your problem doesn't come from an underestimation of CFS task but from
>>> time spent in something else but not accounted in the value used by
>>> schedutil
>>
>> Quite likely. Indeed, it can really be that the CFS task is preempted
>> because of some RT activity generated by the IRQ handler.
>>
>> More in general, I've also noticed many suboptimal freq switches when
>> RT tasks interleave with CFS ones, because of:
>> - relatively long down _and up_ throttling times
>> - the way schedutil's flags are tracked and updated
>> - the callsites from where we call schedutil updates
>>
>> For example it can really happen that we are running at the highest
>> OPP because of some RT activity. Then we switch back to a relatively
>> low utilization CFS workload and then:
>> 1. a tick happens which produces a frequency drop
>
> Any idea why this frequency drop would happen? Say a running CFS task
> gets preempted by RT task, the PELT signal shouldn't drop for the
> duration the CFS task is preempted because the task is runnable, so

utilization only tracks the running state but not runnable state.
Runnable state is tracked in load_avg

> once the CFS task gets CPU back, schedutil should still maintain the
> capacity right?
>
> Regards,
> Joel


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-22 Thread Joel Fernandes
On Mon, Mar 20, 2017 at 5:34 AM, Patrick Bellasi
 wrote:
> On 20-Mar 09:26, Vincent Guittot wrote:
>> On 20 March 2017 at 04:57, Viresh Kumar  wrote:
>> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>> >> From: Rafael J. Wysocki 
>> >>
>> >> The PELT metric used by the schedutil governor underestimates the
>> >> CPU utilization in some cases.  The reason for that may be time spent
>> >> in interrupt handlers and similar which is not accounted for by PELT.
>>
>> Are you sure of the root cause  described above (time stolen by irq
>> handler) or is it just a hypotheses ? That would be good to be sure of
>> the root cause
>> Furthermore, IIRC the time spent in irq context is also accounted as
>> run time for the running cfs task but not RT and deadline task running
>> time
>
> As long as the IRQ processing does not generate a context switch,
> which is happening (eventually) if the top half schedule some deferred
> work to be executed by a bottom half.
>
> Thus, me too I would say that all the top half time is accounted in
> PELT, since the current task is still RUNNABLE/RUNNING.

Sorry if I'm missing something but doesn't this depend on whether you
have CONFIG_IRQ_TIME_ACCOUNTING enabled?

__update_load_avg uses rq->clock_task for deltas which I think
shouldn't account IRQ time with that config option. So it should be
quite possible for IRQ time spent to reduce the PELT signal right?

>
>> So I'm not really aligned with the description of your problem: PELT
>> metric underestimates the load of the CPU.  The PELT is just about
>> tracking CFS task utilization but not whole CPU utilization and
>> according to your description of the problem (time stolen by irq),
>> your problem doesn't come from an underestimation of CFS task but from
>> time spent in something else but not accounted in the value used by
>> schedutil
>
> Quite likely. Indeed, it can really be that the CFS task is preempted
> because of some RT activity generated by the IRQ handler.
>
> More in general, I've also noticed many suboptimal freq switches when
> RT tasks interleave with CFS ones, because of:
> - relatively long down _and up_ throttling times
> - the way schedutil's flags are tracked and updated
> - the callsites from where we call schedutil updates
>
> For example it can really happen that we are running at the highest
> OPP because of some RT activity. Then we switch back to a relatively
> low utilization CFS workload and then:
> 1. a tick happens which produces a frequency drop

Any idea why this frequency drop would happen? Say a running CFS task
gets preempted by RT task, the PELT signal shouldn't drop for the
duration the CFS task is preempted because the task is runnable, so
once the CFS task gets CPU back, schedutil should still maintain the
capacity right?

Regards,
Joel


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-22 Thread Joel Fernandes
On Mon, Mar 20, 2017 at 5:34 AM, Patrick Bellasi
 wrote:
> On 20-Mar 09:26, Vincent Guittot wrote:
>> On 20 March 2017 at 04:57, Viresh Kumar  wrote:
>> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>> >> From: Rafael J. Wysocki 
>> >>
>> >> The PELT metric used by the schedutil governor underestimates the
>> >> CPU utilization in some cases.  The reason for that may be time spent
>> >> in interrupt handlers and similar which is not accounted for by PELT.
>>
>> Are you sure of the root cause  described above (time stolen by irq
>> handler) or is it just a hypotheses ? That would be good to be sure of
>> the root cause
>> Furthermore, IIRC the time spent in irq context is also accounted as
>> run time for the running cfs task but not RT and deadline task running
>> time
>
> As long as the IRQ processing does not generate a context switch,
> which is happening (eventually) if the top half schedule some deferred
> work to be executed by a bottom half.
>
> Thus, me too I would say that all the top half time is accounted in
> PELT, since the current task is still RUNNABLE/RUNNING.

Sorry if I'm missing something but doesn't this depend on whether you
have CONFIG_IRQ_TIME_ACCOUNTING enabled?

__update_load_avg uses rq->clock_task for deltas which I think
shouldn't account IRQ time with that config option. So it should be
quite possible for IRQ time spent to reduce the PELT signal right?

>
>> So I'm not really aligned with the description of your problem: PELT
>> metric underestimates the load of the CPU.  The PELT is just about
>> tracking CFS task utilization but not whole CPU utilization and
>> according to your description of the problem (time stolen by irq),
>> your problem doesn't come from an underestimation of CFS task but from
>> time spent in something else but not accounted in the value used by
>> schedutil
>
> Quite likely. Indeed, it can really be that the CFS task is preempted
> because of some RT activity generated by the IRQ handler.
>
> More in general, I've also noticed many suboptimal freq switches when
> RT tasks interleave with CFS ones, because of:
> - relatively long down _and up_ throttling times
> - the way schedutil's flags are tracked and updated
> - the callsites from where we call schedutil updates
>
> For example it can really happen that we are running at the highest
> OPP because of some RT activity. Then we switch back to a relatively
> low utilization CFS workload and then:
> 1. a tick happens which produces a frequency drop

Any idea why this frequency drop would happen? Say a running CFS task
gets preempted by RT task, the PELT signal shouldn't drop for the
duration the CFS task is preempted because the task is runnable, so
once the CFS task gets CPU back, schedutil should still maintain the
capacity right?

Regards,
Joel


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Patrick Bellasi
On 20-Mar 14:05, Rafael J. Wysocki wrote:
> On Monday, March 20, 2017 01:06:15 PM Patrick Bellasi wrote:
> > On 20-Mar 13:50, Peter Zijlstra wrote:
> > > On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > > > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > > > From: Rafael J. Wysocki 
> > > > > > 
> > > > > > The PELT metric used by the schedutil governor underestimates the
> > > > > > CPU utilization in some cases.  The reason for that may be time 
> > > > > > spent
> > > > > > in interrupt handlers and similar which is not accounted for by 
> > > > > > PELT.
> > > > > > 
> > > > > > That can be easily demonstrated by running kernel compilation on
> > > > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > > > register.  Namely, the expected result would be that when all CPUs
> > > > > > were 100% busy, all of them would be requested to run in the maximum
> > > > > > P-state, but observation shows that this clearly isn't the case.
> > > > > > The CPUs run in the maximum P-state for a while and then are
> > > > > > requested to run slower and go back to the maximum P-state after
> > > > > > a while again.  That causes the actual frequency of the processor to
> > > > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > > > which clearly is not desirable.
> > > > > > 
> > > > > > To work around this issue use the observation that, from the
> > > > > > schedutil governor's perspective, CPUs that are never idle should
> > > > > > always run at the maximum frequency and make that happen.
> > > > > > 
> > > > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > > > is about to put the given CPU into an idle state.  Next, make the
> > > > > > schedutil governor look at that counter for the current CPU every
> > > > > > time before it is about to start heavy computations.  If the counter
> > > > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > > > the CPU has not been idle for at least that long and the governor
> > > > > > will choose the maximum frequency for it without looking at the PELT
> > > > > > metric at all.
> > > > > 
> > > > > Why the time limit?
> > > > 
> > > > One iteration appeared to be a bit too aggressive, but honestly I think
> > > > I need to check again if this thing is regarded as viable at all.
> > > > 
> > > 
> > > I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
> > > just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> > > RCU come to mind as things that might already use something like that.
> > 
> > Maybe the problem is not going down (e.g. when there are only small
> > CFS tasks it makes perfectly sense) but instead not being fast enough
> > on rampin-up when a new RT task is activated.
> > 
> > And this boils down to two main point:
> > 1) throttling for up transitions perhaps is only harmful
> > 2) the call sites for schedutils updates are not properly positioned
> >in specific scheduler decision points.
> > 
> > The proposed patch is adding yet another throttling mechanism, perhaps
> > on top of one which already needs to be improved.
> 
> It is not throttling anything.

It's a kind-of...

-   if (flags & SCHED_CPUFREQ_RT_DL) { 
+   if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
next_f = policy->cpuinfo.max_freq;

This check disregard any signal the scheduler can provide.

A 60% CFS task with a 100ms period, with such a policy will end up
running at the highest OPP for just 10% of its entire activation.
Moreover, when it completes, we are likely to enter an idle OPP while
still remaining at the highest OPP.

IMHO the ultimate goal of scheduitl should be that to be driven by the
scheduler, which has (or can have) all the required information to
support OPP selection.

If something is not working, well, then we should properly fix the
signals and/or provide (at least) a per-task tunable interface.

Adding an hardcoded threshold is an easy fix but it will ultimately
increase the complexity of the governor.

-- 
#include 

Patrick Bellasi


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Patrick Bellasi
On 20-Mar 14:05, Rafael J. Wysocki wrote:
> On Monday, March 20, 2017 01:06:15 PM Patrick Bellasi wrote:
> > On 20-Mar 13:50, Peter Zijlstra wrote:
> > > On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > > > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > > > From: Rafael J. Wysocki 
> > > > > > 
> > > > > > The PELT metric used by the schedutil governor underestimates the
> > > > > > CPU utilization in some cases.  The reason for that may be time 
> > > > > > spent
> > > > > > in interrupt handlers and similar which is not accounted for by 
> > > > > > PELT.
> > > > > > 
> > > > > > That can be easily demonstrated by running kernel compilation on
> > > > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > > > register.  Namely, the expected result would be that when all CPUs
> > > > > > were 100% busy, all of them would be requested to run in the maximum
> > > > > > P-state, but observation shows that this clearly isn't the case.
> > > > > > The CPUs run in the maximum P-state for a while and then are
> > > > > > requested to run slower and go back to the maximum P-state after
> > > > > > a while again.  That causes the actual frequency of the processor to
> > > > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > > > which clearly is not desirable.
> > > > > > 
> > > > > > To work around this issue use the observation that, from the
> > > > > > schedutil governor's perspective, CPUs that are never idle should
> > > > > > always run at the maximum frequency and make that happen.
> > > > > > 
> > > > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > > > is about to put the given CPU into an idle state.  Next, make the
> > > > > > schedutil governor look at that counter for the current CPU every
> > > > > > time before it is about to start heavy computations.  If the counter
> > > > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > > > the CPU has not been idle for at least that long and the governor
> > > > > > will choose the maximum frequency for it without looking at the PELT
> > > > > > metric at all.
> > > > > 
> > > > > Why the time limit?
> > > > 
> > > > One iteration appeared to be a bit too aggressive, but honestly I think
> > > > I need to check again if this thing is regarded as viable at all.
> > > > 
> > > 
> > > I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
> > > just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> > > RCU come to mind as things that might already use something like that.
> > 
> > Maybe the problem is not going down (e.g. when there are only small
> > CFS tasks it makes perfectly sense) but instead not being fast enough
> > on rampin-up when a new RT task is activated.
> > 
> > And this boils down to two main point:
> > 1) throttling for up transitions perhaps is only harmful
> > 2) the call sites for schedutils updates are not properly positioned
> >in specific scheduler decision points.
> > 
> > The proposed patch is adding yet another throttling mechanism, perhaps
> > on top of one which already needs to be improved.
> 
> It is not throttling anything.

It's a kind-of...

-   if (flags & SCHED_CPUFREQ_RT_DL) { 
+   if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
next_f = policy->cpuinfo.max_freq;

This check disregard any signal the scheduler can provide.

A 60% CFS task with a 100ms period, with such a policy will end up
running at the highest OPP for just 10% of its entire activation.
Moreover, when it completes, we are likely to enter an idle OPP while
still remaining at the highest OPP.

IMHO the ultimate goal of scheduitl should be that to be driven by the
scheduler, which has (or can have) all the required information to
support OPP selection.

If something is not working, well, then we should properly fix the
signals and/or provide (at least) a per-task tunable interface.

Adding an hardcoded threshold is an easy fix but it will ultimately
increase the complexity of the governor.

-- 
#include 

Patrick Bellasi


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Rafael J. Wysocki
On Monday, March 20, 2017 01:06:15 PM Patrick Bellasi wrote:
> On 20-Mar 13:50, Peter Zijlstra wrote:
> > On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > > From: Rafael J. Wysocki 
> > > > > 
> > > > > The PELT metric used by the schedutil governor underestimates the
> > > > > CPU utilization in some cases.  The reason for that may be time spent
> > > > > in interrupt handlers and similar which is not accounted for by PELT.
> > > > > 
> > > > > That can be easily demonstrated by running kernel compilation on
> > > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > > register.  Namely, the expected result would be that when all CPUs
> > > > > were 100% busy, all of them would be requested to run in the maximum
> > > > > P-state, but observation shows that this clearly isn't the case.
> > > > > The CPUs run in the maximum P-state for a while and then are
> > > > > requested to run slower and go back to the maximum P-state after
> > > > > a while again.  That causes the actual frequency of the processor to
> > > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > > which clearly is not desirable.
> > > > > 
> > > > > To work around this issue use the observation that, from the
> > > > > schedutil governor's perspective, CPUs that are never idle should
> > > > > always run at the maximum frequency and make that happen.
> > > > > 
> > > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > > is about to put the given CPU into an idle state.  Next, make the
> > > > > schedutil governor look at that counter for the current CPU every
> > > > > time before it is about to start heavy computations.  If the counter
> > > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > > the CPU has not been idle for at least that long and the governor
> > > > > will choose the maximum frequency for it without looking at the PELT
> > > > > metric at all.
> > > > 
> > > > Why the time limit?
> > > 
> > > One iteration appeared to be a bit too aggressive, but honestly I think
> > > I need to check again if this thing is regarded as viable at all.
> > > 
> > 
> > I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
> > just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> > RCU come to mind as things that might already use something like that.
> 
> Maybe the problem is not going down (e.g. when there are only small
> CFS tasks it makes perfectly sense) but instead not being fast enough
> on rampin-up when a new RT task is activated.
> 
> And this boils down to two main point:
> 1) throttling for up transitions perhaps is only harmful
> 2) the call sites for schedutils updates are not properly positioned
>in specific scheduler decision points.
> 
> The proposed patch is adding yet another throttling mechanism, perhaps
> on top of one which already needs to be improved.

It is not throttling anything.

Thanks,
Rafael



Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Rafael J. Wysocki
On Monday, March 20, 2017 01:06:15 PM Patrick Bellasi wrote:
> On 20-Mar 13:50, Peter Zijlstra wrote:
> > On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > > From: Rafael J. Wysocki 
> > > > > 
> > > > > The PELT metric used by the schedutil governor underestimates the
> > > > > CPU utilization in some cases.  The reason for that may be time spent
> > > > > in interrupt handlers and similar which is not accounted for by PELT.
> > > > > 
> > > > > That can be easily demonstrated by running kernel compilation on
> > > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > > register.  Namely, the expected result would be that when all CPUs
> > > > > were 100% busy, all of them would be requested to run in the maximum
> > > > > P-state, but observation shows that this clearly isn't the case.
> > > > > The CPUs run in the maximum P-state for a while and then are
> > > > > requested to run slower and go back to the maximum P-state after
> > > > > a while again.  That causes the actual frequency of the processor to
> > > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > > which clearly is not desirable.
> > > > > 
> > > > > To work around this issue use the observation that, from the
> > > > > schedutil governor's perspective, CPUs that are never idle should
> > > > > always run at the maximum frequency and make that happen.
> > > > > 
> > > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > > is about to put the given CPU into an idle state.  Next, make the
> > > > > schedutil governor look at that counter for the current CPU every
> > > > > time before it is about to start heavy computations.  If the counter
> > > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > > the CPU has not been idle for at least that long and the governor
> > > > > will choose the maximum frequency for it without looking at the PELT
> > > > > metric at all.
> > > > 
> > > > Why the time limit?
> > > 
> > > One iteration appeared to be a bit too aggressive, but honestly I think
> > > I need to check again if this thing is regarded as viable at all.
> > > 
> > 
> > I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
> > just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> > RCU come to mind as things that might already use something like that.
> 
> Maybe the problem is not going down (e.g. when there are only small
> CFS tasks it makes perfectly sense) but instead not being fast enough
> on rampin-up when a new RT task is activated.
> 
> And this boils down to two main point:
> 1) throttling for up transitions perhaps is only harmful
> 2) the call sites for schedutils updates are not properly positioned
>in specific scheduler decision points.
> 
> The proposed patch is adding yet another throttling mechanism, perhaps
> on top of one which already needs to be improved.

It is not throttling anything.

Thanks,
Rafael



Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Vincent Guittot
On 20 March 2017 at 13:59, Rafael J. Wysocki  wrote:
> On Monday, March 20, 2017 09:26:34 AM Vincent Guittot wrote:
>> On 20 March 2017 at 04:57, Viresh Kumar  wrote:
>> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>> >> From: Rafael J. Wysocki 
>> >>
>> >> The PELT metric used by the schedutil governor underestimates the
>> >> CPU utilization in some cases.  The reason for that may be time spent
>> >> in interrupt handlers and similar which is not accounted for by PELT.
>>
>> Are you sure of the root cause  described above (time stolen by irq
>> handler) or is it just a hypotheses ? That would be good to be sure of
>> the root cause
>
> No, I'm not sure.  That's why I said "may be". :-)
>
>> Furthermore, IIRC the time spent in irq context is also accounted as
>> run time for the running cfs task but not RT and deadline task running
>> time
>
> OK
>
> Anyway, the problem is that we have a 100% busy CPU which quite evidently
> is not reported as 100% busy by the metric we use.


>
> Now, if I was sure about the root cause, I would fix it rather than suggest
> workarounds.
>
>> So I'm not really aligned with the description of your problem: PELT
>> metric underestimates the load of the CPU.  The PELT is just about
>> tracking CFS task utilization but not whole CPU utilization and
>> according to your description of the problem (time stolen by irq),
>> your problem doesn't come from an underestimation of CFS task but from
>> time spent in something else but not accounted in the value used by
>> schedutil
>
> That's fair enough, but the assumption was that PELT would be sufficient for
> that.  To me, it clearly isn't, so the assumption was incorrect.

So PELT is just about CFS utilization and we need metrics for other
sched class as well
I know that Juri and other people are working on the dealine part

In current kernel, the only available metric is rt_avg (and
scale_rt_capacity) which gives you the available compute capacity for
CFS and in some way the amount of capacity used by RT and deadline.
That would be interesting to see if the amount of capacity used by RT
tasks (i assume  that you don't have deadline task in your use case)
is similar to the amount of underestimation of CFS

>
>> That would be good to be sure of what is running during this not
>> accounted time and find a way to account them
>
> Yes, I agree.
>
> I'm not sure if I can carry out full investigation of that any time soon, 
> however.

If you can share traces as Patrick proposed, we could estimate how
much time/capacity is used by rt tasks and if this can explain the
underestimation.

>
> I sent this mostly to let everybody know that there was a problem and how it
> could be worked around.  That's why it is an RFC.

Ok

>
>>
>> >>
>> >> That can be easily demonstrated by running kernel compilation on
>> >> a Sandy Bridge Intel processor, running turbostat in parallel with
>> >> it and looking at the values written to the MSR_IA32_PERF_CTL
>> >> register.  Namely, the expected result would be that when all CPUs
>> >> were 100% busy, all of them would be requested to run in the maximum
>> >> P-state, but observation shows that this clearly isn't the case.
>> >> The CPUs run in the maximum P-state for a while and then are
>> >> requested to run slower and go back to the maximum P-state after
>> >> a while again.  That causes the actual frequency of the processor to
>> >> visibly oscillate below the sustainable maximum in a jittery fashion
>> >> which clearly is not desirable.
>> >>
>> >> To work around this issue use the observation that, from the
>> >> schedutil governor's perspective, CPUs that are never idle should
>> >> always run at the maximum frequency and make that happen.
>> >>
>> >> To that end, add a counter of idle calls to struct sugov_cpu and
>> >> modify cpuidle_idle_call() to increment that counter every time it
>> >> is about to put the given CPU into an idle state.  Next, make the
>> >> schedutil governor look at that counter for the current CPU every
>> >> time before it is about to start heavy computations.  If the counter
>> >> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
>> >> the CPU has not been idle for at least that long and the governor
>> >> will choose the maximum frequency for it without looking at the PELT
>> >> metric at all.
>> >
>> > Looks like we are fixing a PELT problem with a schedutil Hack :)
>>
>> I would not say that it's a PELT problem (at least based on current
>> description) but more that we don't track all
>> activities of CPU
>
> I generally agree.  PELT does what it does.
>
> However, using PELT the way we do that in schedutil turns out to be 
> problematic.

yes. PELT is not enough but just part of the equation

>
> Thanks,
> Rafael
>


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Vincent Guittot
On 20 March 2017 at 13:59, Rafael J. Wysocki  wrote:
> On Monday, March 20, 2017 09:26:34 AM Vincent Guittot wrote:
>> On 20 March 2017 at 04:57, Viresh Kumar  wrote:
>> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>> >> From: Rafael J. Wysocki 
>> >>
>> >> The PELT metric used by the schedutil governor underestimates the
>> >> CPU utilization in some cases.  The reason for that may be time spent
>> >> in interrupt handlers and similar which is not accounted for by PELT.
>>
>> Are you sure of the root cause  described above (time stolen by irq
>> handler) or is it just a hypotheses ? That would be good to be sure of
>> the root cause
>
> No, I'm not sure.  That's why I said "may be". :-)
>
>> Furthermore, IIRC the time spent in irq context is also accounted as
>> run time for the running cfs task but not RT and deadline task running
>> time
>
> OK
>
> Anyway, the problem is that we have a 100% busy CPU which quite evidently
> is not reported as 100% busy by the metric we use.


>
> Now, if I was sure about the root cause, I would fix it rather than suggest
> workarounds.
>
>> So I'm not really aligned with the description of your problem: PELT
>> metric underestimates the load of the CPU.  The PELT is just about
>> tracking CFS task utilization but not whole CPU utilization and
>> according to your description of the problem (time stolen by irq),
>> your problem doesn't come from an underestimation of CFS task but from
>> time spent in something else but not accounted in the value used by
>> schedutil
>
> That's fair enough, but the assumption was that PELT would be sufficient for
> that.  To me, it clearly isn't, so the assumption was incorrect.

So PELT is just about CFS utilization and we need metrics for other
sched class as well
I know that Juri and other people are working on the dealine part

In current kernel, the only available metric is rt_avg (and
scale_rt_capacity) which gives you the available compute capacity for
CFS and in some way the amount of capacity used by RT and deadline.
That would be interesting to see if the amount of capacity used by RT
tasks (i assume  that you don't have deadline task in your use case)
is similar to the amount of underestimation of CFS

>
>> That would be good to be sure of what is running during this not
>> accounted time and find a way to account them
>
> Yes, I agree.
>
> I'm not sure if I can carry out full investigation of that any time soon, 
> however.

If you can share traces as Patrick proposed, we could estimate how
much time/capacity is used by rt tasks and if this can explain the
underestimation.

>
> I sent this mostly to let everybody know that there was a problem and how it
> could be worked around.  That's why it is an RFC.

Ok

>
>>
>> >>
>> >> That can be easily demonstrated by running kernel compilation on
>> >> a Sandy Bridge Intel processor, running turbostat in parallel with
>> >> it and looking at the values written to the MSR_IA32_PERF_CTL
>> >> register.  Namely, the expected result would be that when all CPUs
>> >> were 100% busy, all of them would be requested to run in the maximum
>> >> P-state, but observation shows that this clearly isn't the case.
>> >> The CPUs run in the maximum P-state for a while and then are
>> >> requested to run slower and go back to the maximum P-state after
>> >> a while again.  That causes the actual frequency of the processor to
>> >> visibly oscillate below the sustainable maximum in a jittery fashion
>> >> which clearly is not desirable.
>> >>
>> >> To work around this issue use the observation that, from the
>> >> schedutil governor's perspective, CPUs that are never idle should
>> >> always run at the maximum frequency and make that happen.
>> >>
>> >> To that end, add a counter of idle calls to struct sugov_cpu and
>> >> modify cpuidle_idle_call() to increment that counter every time it
>> >> is about to put the given CPU into an idle state.  Next, make the
>> >> schedutil governor look at that counter for the current CPU every
>> >> time before it is about to start heavy computations.  If the counter
>> >> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
>> >> the CPU has not been idle for at least that long and the governor
>> >> will choose the maximum frequency for it without looking at the PELT
>> >> metric at all.
>> >
>> > Looks like we are fixing a PELT problem with a schedutil Hack :)
>>
>> I would not say that it's a PELT problem (at least based on current
>> description) but more that we don't track all
>> activities of CPU
>
> I generally agree.  PELT does what it does.
>
> However, using PELT the way we do that in schedutil turns out to be 
> problematic.

yes. PELT is not enough but just part of the equation

>
> Thanks,
> Rafael
>


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Rafael J. Wysocki
On Monday, March 20, 2017 09:27:45 AM Viresh Kumar wrote:
> On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 
> > 
> > The PELT metric used by the schedutil governor underestimates the
> > CPU utilization in some cases.  The reason for that may be time spent
> > in interrupt handlers and similar which is not accounted for by PELT.
> > 
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register.  Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again.  That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
> > 
> > To work around this issue use the observation that, from the
> > schedutil governor's perspective, CPUs that are never idle should
> > always run at the maximum frequency and make that happen.
> > 
> > To that end, add a counter of idle calls to struct sugov_cpu and
> > modify cpuidle_idle_call() to increment that counter every time it
> > is about to put the given CPU into an idle state.  Next, make the
> > schedutil governor look at that counter for the current CPU every
> > time before it is about to start heavy computations.  If the counter
> > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > the CPU has not been idle for at least that long and the governor
> > will choose the maximum frequency for it without looking at the PELT
> > metric at all.
> 
> Looks like we are fixing a PELT problem with a schedutil Hack :)

Did you notice the "To work around this issue" phrase above?  This is a
workaround, not a fix.

But it is an optimization too, because avoiding heavy computations if they
are not necessary should be a win in any case.

> > Signed-off-by: Rafael J. Wysocki 
> > ---
> >  include/linux/sched/cpufreq.h|6 ++
> >  kernel/sched/cpufreq_schedutil.c |   38 
> > --
> >  kernel/sched/idle.c  |3 +++
> >  3 files changed, 45 insertions(+), 2 deletions(-)
> > 
> > Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> > ===
> > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> > +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> > @@ -20,6 +20,7 @@
> >  #include "sched.h"
> >  
> >  #define SUGOV_KTHREAD_PRIORITY 50
> > +#define SUGOV_BUSY_THRESHOLD   (50 * NSEC_PER_MSEC)
> >  
> >  struct sugov_tunables {
> > struct gov_attr_set attr_set;
> > @@ -55,6 +56,9 @@ struct sugov_cpu {
> >  
> > unsigned long iowait_boost;
> > unsigned long iowait_boost_max;
> > +   unsigned long idle_calls;
> > +   unsigned long saved_idle_calls;

The above two could be unsigned int, BTW.

> > +   u64 busy_start;
> > u64 last_update;
> >  
> > /* The fields below are only needed when sharing a policy. */
> > @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
> > sg_cpu->iowait_boost >>= 1;
> >  }
> >  
> > +void cpufreq_schedutil_idle_call(void)
> > +{
> > +   struct sugov_cpu *sg_cpu = this_cpu_ptr(_cpu);
> > +
> > +   sg_cpu->idle_calls++;
> > +}
> > +
> > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> > +{
> > +   if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
> > +   sg_cpu->busy_start = 0;
> > +   return false;
> > +   }
> > +
> > +   if (!sg_cpu->busy_start) {
> > +   sg_cpu->busy_start = sg_cpu->last_update;
> > +   return false;
> > +   }
> > +
> > +   return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
> > +}
> > +
> > +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
> > +{
> > +   if (!sg_cpu->busy_start)
> > +   sg_cpu->saved_idle_calls = sg_cpu->idle_calls;
> 
> Why aren't we doing this in sugov_cpu_is_busy() itself ?

No, we aren't.

sugov_cpu_is_busy() may not be called at all sometimes.

> And isn't it possible for idle_calls to get incremented by this time?

No, it isn't.

The idle loop does that and after it has disabled interrupts.  If this code is
running, we surely are not in there on the same CPU, are we?

> > +}
> > +
> >  static void sugov_update_single(struct update_util_data *hook, u64 time,
> > unsigned int flags)
> >  {
> > @@ -207,7 +239,7 @@ static void sugov_update_single(struct u
> > if (!sugov_should_update_freq(sg_policy, time))
> > return;
> >  
> > -   if (flags & SCHED_CPUFREQ_RT_DL) {
> > +   if ((flags & SCHED_CPUFREQ_RT_DL) || 

Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Rafael J. Wysocki
On Monday, March 20, 2017 09:27:45 AM Viresh Kumar wrote:
> On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 
> > 
> > The PELT metric used by the schedutil governor underestimates the
> > CPU utilization in some cases.  The reason for that may be time spent
> > in interrupt handlers and similar which is not accounted for by PELT.
> > 
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register.  Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again.  That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
> > 
> > To work around this issue use the observation that, from the
> > schedutil governor's perspective, CPUs that are never idle should
> > always run at the maximum frequency and make that happen.
> > 
> > To that end, add a counter of idle calls to struct sugov_cpu and
> > modify cpuidle_idle_call() to increment that counter every time it
> > is about to put the given CPU into an idle state.  Next, make the
> > schedutil governor look at that counter for the current CPU every
> > time before it is about to start heavy computations.  If the counter
> > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > the CPU has not been idle for at least that long and the governor
> > will choose the maximum frequency for it without looking at the PELT
> > metric at all.
> 
> Looks like we are fixing a PELT problem with a schedutil Hack :)

Did you notice the "To work around this issue" phrase above?  This is a
workaround, not a fix.

But it is an optimization too, because avoiding heavy computations if they
are not necessary should be a win in any case.

> > Signed-off-by: Rafael J. Wysocki 
> > ---
> >  include/linux/sched/cpufreq.h|6 ++
> >  kernel/sched/cpufreq_schedutil.c |   38 
> > --
> >  kernel/sched/idle.c  |3 +++
> >  3 files changed, 45 insertions(+), 2 deletions(-)
> > 
> > Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> > ===
> > --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> > +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> > @@ -20,6 +20,7 @@
> >  #include "sched.h"
> >  
> >  #define SUGOV_KTHREAD_PRIORITY 50
> > +#define SUGOV_BUSY_THRESHOLD   (50 * NSEC_PER_MSEC)
> >  
> >  struct sugov_tunables {
> > struct gov_attr_set attr_set;
> > @@ -55,6 +56,9 @@ struct sugov_cpu {
> >  
> > unsigned long iowait_boost;
> > unsigned long iowait_boost_max;
> > +   unsigned long idle_calls;
> > +   unsigned long saved_idle_calls;

The above two could be unsigned int, BTW.

> > +   u64 busy_start;
> > u64 last_update;
> >  
> > /* The fields below are only needed when sharing a policy. */
> > @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
> > sg_cpu->iowait_boost >>= 1;
> >  }
> >  
> > +void cpufreq_schedutil_idle_call(void)
> > +{
> > +   struct sugov_cpu *sg_cpu = this_cpu_ptr(_cpu);
> > +
> > +   sg_cpu->idle_calls++;
> > +}
> > +
> > +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> > +{
> > +   if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
> > +   sg_cpu->busy_start = 0;
> > +   return false;
> > +   }
> > +
> > +   if (!sg_cpu->busy_start) {
> > +   sg_cpu->busy_start = sg_cpu->last_update;
> > +   return false;
> > +   }
> > +
> > +   return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
> > +}
> > +
> > +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
> > +{
> > +   if (!sg_cpu->busy_start)
> > +   sg_cpu->saved_idle_calls = sg_cpu->idle_calls;
> 
> Why aren't we doing this in sugov_cpu_is_busy() itself ?

No, we aren't.

sugov_cpu_is_busy() may not be called at all sometimes.

> And isn't it possible for idle_calls to get incremented by this time?

No, it isn't.

The idle loop does that and after it has disabled interrupts.  If this code is
running, we surely are not in there on the same CPU, are we?

> > +}
> > +
> >  static void sugov_update_single(struct update_util_data *hook, u64 time,
> > unsigned int flags)
> >  {
> > @@ -207,7 +239,7 @@ static void sugov_update_single(struct u
> > if (!sugov_should_update_freq(sg_policy, time))
> > return;
> >  
> > -   if (flags & SCHED_CPUFREQ_RT_DL) {
> > +   if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
> > next_f = 

Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Rafael J. Wysocki
On Monday, March 20, 2017 01:50:09 PM Peter Zijlstra wrote:
> On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki 
> > > > 
> > > > The PELT metric used by the schedutil governor underestimates the
> > > > CPU utilization in some cases.  The reason for that may be time spent
> > > > in interrupt handlers and similar which is not accounted for by PELT.
> > > > 
> > > > That can be easily demonstrated by running kernel compilation on
> > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > register.  Namely, the expected result would be that when all CPUs
> > > > were 100% busy, all of them would be requested to run in the maximum
> > > > P-state, but observation shows that this clearly isn't the case.
> > > > The CPUs run in the maximum P-state for a while and then are
> > > > requested to run slower and go back to the maximum P-state after
> > > > a while again.  That causes the actual frequency of the processor to
> > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > which clearly is not desirable.
> > > > 
> > > > To work around this issue use the observation that, from the
> > > > schedutil governor's perspective, CPUs that are never idle should
> > > > always run at the maximum frequency and make that happen.
> > > > 
> > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > is about to put the given CPU into an idle state.  Next, make the
> > > > schedutil governor look at that counter for the current CPU every
> > > > time before it is about to start heavy computations.  If the counter
> > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > the CPU has not been idle for at least that long and the governor
> > > > will choose the maximum frequency for it without looking at the PELT
> > > > metric at all.
> > > 
> > > Why the time limit?
> > 
> > One iteration appeared to be a bit too aggressive, but honestly I think
> > I need to check again if this thing is regarded as viable at all.
> > 
> 
> I don't hate the idea; if we don't hit idle; we shouldn't shift down.

OK

> I just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> RCU come to mind as things that might already use something like that.

NOHZ does that, but I did't want this to artificially depend on NOHZ.  That 
said,
yes, we can use that one too.



Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Rafael J. Wysocki
On Monday, March 20, 2017 01:50:09 PM Peter Zijlstra wrote:
> On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki 
> > > > 
> > > > The PELT metric used by the schedutil governor underestimates the
> > > > CPU utilization in some cases.  The reason for that may be time spent
> > > > in interrupt handlers and similar which is not accounted for by PELT.
> > > > 
> > > > That can be easily demonstrated by running kernel compilation on
> > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > register.  Namely, the expected result would be that when all CPUs
> > > > were 100% busy, all of them would be requested to run in the maximum
> > > > P-state, but observation shows that this clearly isn't the case.
> > > > The CPUs run in the maximum P-state for a while and then are
> > > > requested to run slower and go back to the maximum P-state after
> > > > a while again.  That causes the actual frequency of the processor to
> > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > which clearly is not desirable.
> > > > 
> > > > To work around this issue use the observation that, from the
> > > > schedutil governor's perspective, CPUs that are never idle should
> > > > always run at the maximum frequency and make that happen.
> > > > 
> > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > is about to put the given CPU into an idle state.  Next, make the
> > > > schedutil governor look at that counter for the current CPU every
> > > > time before it is about to start heavy computations.  If the counter
> > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > the CPU has not been idle for at least that long and the governor
> > > > will choose the maximum frequency for it without looking at the PELT
> > > > metric at all.
> > > 
> > > Why the time limit?
> > 
> > One iteration appeared to be a bit too aggressive, but honestly I think
> > I need to check again if this thing is regarded as viable at all.
> > 
> 
> I don't hate the idea; if we don't hit idle; we shouldn't shift down.

OK

> I just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> RCU come to mind as things that might already use something like that.

NOHZ does that, but I did't want this to artificially depend on NOHZ.  That 
said,
yes, we can use that one too.



Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Rafael J. Wysocki
On Monday, March 20, 2017 09:26:34 AM Vincent Guittot wrote:
> On 20 March 2017 at 04:57, Viresh Kumar  wrote:
> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> >> From: Rafael J. Wysocki 
> >>
> >> The PELT metric used by the schedutil governor underestimates the
> >> CPU utilization in some cases.  The reason for that may be time spent
> >> in interrupt handlers and similar which is not accounted for by PELT.
> 
> Are you sure of the root cause  described above (time stolen by irq
> handler) or is it just a hypotheses ? That would be good to be sure of
> the root cause

No, I'm not sure.  That's why I said "may be". :-)

> Furthermore, IIRC the time spent in irq context is also accounted as
> run time for the running cfs task but not RT and deadline task running
> time

OK

Anyway, the problem is that we have a 100% busy CPU which quite evidently
is not reported as 100% busy by the metric we use.

Now, if I was sure about the root cause, I would fix it rather than suggest
workarounds.

> So I'm not really aligned with the description of your problem: PELT
> metric underestimates the load of the CPU.  The PELT is just about
> tracking CFS task utilization but not whole CPU utilization and
> according to your description of the problem (time stolen by irq),
> your problem doesn't come from an underestimation of CFS task but from
> time spent in something else but not accounted in the value used by
> schedutil

That's fair enough, but the assumption was that PELT would be sufficient for
that.  To me, it clearly isn't, so the assumption was incorrect.

> That would be good to be sure of what is running during this not
> accounted time and find a way to account them

Yes, I agree.

I'm not sure if I can carry out full investigation of that any time soon, 
however.

I sent this mostly to let everybody know that there was a problem and how it
could be worked around.  That's why it is an RFC.

> 
> >>
> >> That can be easily demonstrated by running kernel compilation on
> >> a Sandy Bridge Intel processor, running turbostat in parallel with
> >> it and looking at the values written to the MSR_IA32_PERF_CTL
> >> register.  Namely, the expected result would be that when all CPUs
> >> were 100% busy, all of them would be requested to run in the maximum
> >> P-state, but observation shows that this clearly isn't the case.
> >> The CPUs run in the maximum P-state for a while and then are
> >> requested to run slower and go back to the maximum P-state after
> >> a while again.  That causes the actual frequency of the processor to
> >> visibly oscillate below the sustainable maximum in a jittery fashion
> >> which clearly is not desirable.
> >>
> >> To work around this issue use the observation that, from the
> >> schedutil governor's perspective, CPUs that are never idle should
> >> always run at the maximum frequency and make that happen.
> >>
> >> To that end, add a counter of idle calls to struct sugov_cpu and
> >> modify cpuidle_idle_call() to increment that counter every time it
> >> is about to put the given CPU into an idle state.  Next, make the
> >> schedutil governor look at that counter for the current CPU every
> >> time before it is about to start heavy computations.  If the counter
> >> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> >> the CPU has not been idle for at least that long and the governor
> >> will choose the maximum frequency for it without looking at the PELT
> >> metric at all.
> >
> > Looks like we are fixing a PELT problem with a schedutil Hack :)
> 
> I would not say that it's a PELT problem (at least based on current
> description) but more that we don't track all
> activities of CPU

I generally agree.  PELT does what it does.

However, using PELT the way we do that in schedutil turns out to be problematic.

Thanks,
Rafael



Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Rafael J. Wysocki
On Monday, March 20, 2017 09:26:34 AM Vincent Guittot wrote:
> On 20 March 2017 at 04:57, Viresh Kumar  wrote:
> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> >> From: Rafael J. Wysocki 
> >>
> >> The PELT metric used by the schedutil governor underestimates the
> >> CPU utilization in some cases.  The reason for that may be time spent
> >> in interrupt handlers and similar which is not accounted for by PELT.
> 
> Are you sure of the root cause  described above (time stolen by irq
> handler) or is it just a hypotheses ? That would be good to be sure of
> the root cause

No, I'm not sure.  That's why I said "may be". :-)

> Furthermore, IIRC the time spent in irq context is also accounted as
> run time for the running cfs task but not RT and deadline task running
> time

OK

Anyway, the problem is that we have a 100% busy CPU which quite evidently
is not reported as 100% busy by the metric we use.

Now, if I was sure about the root cause, I would fix it rather than suggest
workarounds.

> So I'm not really aligned with the description of your problem: PELT
> metric underestimates the load of the CPU.  The PELT is just about
> tracking CFS task utilization but not whole CPU utilization and
> according to your description of the problem (time stolen by irq),
> your problem doesn't come from an underestimation of CFS task but from
> time spent in something else but not accounted in the value used by
> schedutil

That's fair enough, but the assumption was that PELT would be sufficient for
that.  To me, it clearly isn't, so the assumption was incorrect.

> That would be good to be sure of what is running during this not
> accounted time and find a way to account them

Yes, I agree.

I'm not sure if I can carry out full investigation of that any time soon, 
however.

I sent this mostly to let everybody know that there was a problem and how it
could be worked around.  That's why it is an RFC.

> 
> >>
> >> That can be easily demonstrated by running kernel compilation on
> >> a Sandy Bridge Intel processor, running turbostat in parallel with
> >> it and looking at the values written to the MSR_IA32_PERF_CTL
> >> register.  Namely, the expected result would be that when all CPUs
> >> were 100% busy, all of them would be requested to run in the maximum
> >> P-state, but observation shows that this clearly isn't the case.
> >> The CPUs run in the maximum P-state for a while and then are
> >> requested to run slower and go back to the maximum P-state after
> >> a while again.  That causes the actual frequency of the processor to
> >> visibly oscillate below the sustainable maximum in a jittery fashion
> >> which clearly is not desirable.
> >>
> >> To work around this issue use the observation that, from the
> >> schedutil governor's perspective, CPUs that are never idle should
> >> always run at the maximum frequency and make that happen.
> >>
> >> To that end, add a counter of idle calls to struct sugov_cpu and
> >> modify cpuidle_idle_call() to increment that counter every time it
> >> is about to put the given CPU into an idle state.  Next, make the
> >> schedutil governor look at that counter for the current CPU every
> >> time before it is about to start heavy computations.  If the counter
> >> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> >> the CPU has not been idle for at least that long and the governor
> >> will choose the maximum frequency for it without looking at the PELT
> >> metric at all.
> >
> > Looks like we are fixing a PELT problem with a schedutil Hack :)
> 
> I would not say that it's a PELT problem (at least based on current
> description) but more that we don't track all
> activities of CPU

I generally agree.  PELT does what it does.

However, using PELT the way we do that in schedutil turns out to be problematic.

Thanks,
Rafael



Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Patrick Bellasi
On 20-Mar 13:50, Peter Zijlstra wrote:
> On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki 
> > > > 
> > > > The PELT metric used by the schedutil governor underestimates the
> > > > CPU utilization in some cases.  The reason for that may be time spent
> > > > in interrupt handlers and similar which is not accounted for by PELT.
> > > > 
> > > > That can be easily demonstrated by running kernel compilation on
> > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > register.  Namely, the expected result would be that when all CPUs
> > > > were 100% busy, all of them would be requested to run in the maximum
> > > > P-state, but observation shows that this clearly isn't the case.
> > > > The CPUs run in the maximum P-state for a while and then are
> > > > requested to run slower and go back to the maximum P-state after
> > > > a while again.  That causes the actual frequency of the processor to
> > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > which clearly is not desirable.
> > > > 
> > > > To work around this issue use the observation that, from the
> > > > schedutil governor's perspective, CPUs that are never idle should
> > > > always run at the maximum frequency and make that happen.
> > > > 
> > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > is about to put the given CPU into an idle state.  Next, make the
> > > > schedutil governor look at that counter for the current CPU every
> > > > time before it is about to start heavy computations.  If the counter
> > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > the CPU has not been idle for at least that long and the governor
> > > > will choose the maximum frequency for it without looking at the PELT
> > > > metric at all.
> > > 
> > > Why the time limit?
> > 
> > One iteration appeared to be a bit too aggressive, but honestly I think
> > I need to check again if this thing is regarded as viable at all.
> > 
> 
> I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
> just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> RCU come to mind as things that might already use something like that.

Maybe the problem is not going down (e.g. when there are only small
CFS tasks it makes perfectly sense) but instead not being fast enough
on rampin-up when a new RT task is activated.

And this boils down to two main point:
1) throttling for up transitions perhaps is only harmful
2) the call sites for schedutils updates are not properly positioned
   in specific scheduler decision points.

The proposed patch is adding yet another throttling mechanism, perhaps
on top of one which already needs to be improved.

-- 
#include 

Patrick Bellasi


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Patrick Bellasi
On 20-Mar 13:50, Peter Zijlstra wrote:
> On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> > On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > > From: Rafael J. Wysocki 
> > > > 
> > > > The PELT metric used by the schedutil governor underestimates the
> > > > CPU utilization in some cases.  The reason for that may be time spent
> > > > in interrupt handlers and similar which is not accounted for by PELT.
> > > > 
> > > > That can be easily demonstrated by running kernel compilation on
> > > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > > register.  Namely, the expected result would be that when all CPUs
> > > > were 100% busy, all of them would be requested to run in the maximum
> > > > P-state, but observation shows that this clearly isn't the case.
> > > > The CPUs run in the maximum P-state for a while and then are
> > > > requested to run slower and go back to the maximum P-state after
> > > > a while again.  That causes the actual frequency of the processor to
> > > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > > which clearly is not desirable.
> > > > 
> > > > To work around this issue use the observation that, from the
> > > > schedutil governor's perspective, CPUs that are never idle should
> > > > always run at the maximum frequency and make that happen.
> > > > 
> > > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > > modify cpuidle_idle_call() to increment that counter every time it
> > > > is about to put the given CPU into an idle state.  Next, make the
> > > > schedutil governor look at that counter for the current CPU every
> > > > time before it is about to start heavy computations.  If the counter
> > > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > > the CPU has not been idle for at least that long and the governor
> > > > will choose the maximum frequency for it without looking at the PELT
> > > > metric at all.
> > > 
> > > Why the time limit?
> > 
> > One iteration appeared to be a bit too aggressive, but honestly I think
> > I need to check again if this thing is regarded as viable at all.
> > 
> 
> I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
> just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
> RCU come to mind as things that might already use something like that.

Maybe the problem is not going down (e.g. when there are only small
CFS tasks it makes perfectly sense) but instead not being fast enough
on rampin-up when a new RT task is activated.

And this boils down to two main point:
1) throttling for up transitions perhaps is only harmful
2) the call sites for schedutils updates are not properly positioned
   in specific scheduler decision points.

The proposed patch is adding yet another throttling mechanism, perhaps
on top of one which already needs to be improved.

-- 
#include 

Patrick Bellasi


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Peter Zijlstra
On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki 
> > > 
> > > The PELT metric used by the schedutil governor underestimates the
> > > CPU utilization in some cases.  The reason for that may be time spent
> > > in interrupt handlers and similar which is not accounted for by PELT.
> > > 
> > > That can be easily demonstrated by running kernel compilation on
> > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > register.  Namely, the expected result would be that when all CPUs
> > > were 100% busy, all of them would be requested to run in the maximum
> > > P-state, but observation shows that this clearly isn't the case.
> > > The CPUs run in the maximum P-state for a while and then are
> > > requested to run slower and go back to the maximum P-state after
> > > a while again.  That causes the actual frequency of the processor to
> > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > which clearly is not desirable.
> > > 
> > > To work around this issue use the observation that, from the
> > > schedutil governor's perspective, CPUs that are never idle should
> > > always run at the maximum frequency and make that happen.
> > > 
> > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > modify cpuidle_idle_call() to increment that counter every time it
> > > is about to put the given CPU into an idle state.  Next, make the
> > > schedutil governor look at that counter for the current CPU every
> > > time before it is about to start heavy computations.  If the counter
> > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > the CPU has not been idle for at least that long and the governor
> > > will choose the maximum frequency for it without looking at the PELT
> > > metric at all.
> > 
> > Why the time limit?
> 
> One iteration appeared to be a bit too aggressive, but honestly I think
> I need to check again if this thing is regarded as viable at all.
> 

I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
RCU come to mind as things that might already use something like that.



Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Peter Zijlstra
On Mon, Mar 20, 2017 at 01:35:12PM +0100, Rafael J. Wysocki wrote:
> On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> > On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki 
> > > 
> > > The PELT metric used by the schedutil governor underestimates the
> > > CPU utilization in some cases.  The reason for that may be time spent
> > > in interrupt handlers and similar which is not accounted for by PELT.
> > > 
> > > That can be easily demonstrated by running kernel compilation on
> > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > register.  Namely, the expected result would be that when all CPUs
> > > were 100% busy, all of them would be requested to run in the maximum
> > > P-state, but observation shows that this clearly isn't the case.
> > > The CPUs run in the maximum P-state for a while and then are
> > > requested to run slower and go back to the maximum P-state after
> > > a while again.  That causes the actual frequency of the processor to
> > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > which clearly is not desirable.
> > > 
> > > To work around this issue use the observation that, from the
> > > schedutil governor's perspective, CPUs that are never idle should
> > > always run at the maximum frequency and make that happen.
> > > 
> > > To that end, add a counter of idle calls to struct sugov_cpu and
> > > modify cpuidle_idle_call() to increment that counter every time it
> > > is about to put the given CPU into an idle state.  Next, make the
> > > schedutil governor look at that counter for the current CPU every
> > > time before it is about to start heavy computations.  If the counter
> > > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > > the CPU has not been idle for at least that long and the governor
> > > will choose the maximum frequency for it without looking at the PELT
> > > metric at all.
> > 
> > Why the time limit?
> 
> One iteration appeared to be a bit too aggressive, but honestly I think
> I need to check again if this thing is regarded as viable at all.
> 

I don't hate the idea; if we don't hit idle; we shouldn't shift down. I
just wonder if we don't already keep a idle-seqcount somewhere; NOHZ and
RCU come to mind as things that might already use something like that.



Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Rafael J. Wysocki
On Monday, March 20, 2017 11:38:15 AM Peter Zijlstra wrote:
> On Sun, Mar 19, 2017 at 10:24:24PM +0100, Rafael J. Wysocki wrote:
> 
> > Honestly, if the processor had been capable of doing per-core P-states, that
> > would have been a disaster and there are customers who wouldn't look at
> > schedutil again after being confronted with these numbers.
> 
> This, I feel, is a bit overstated.

Maybe a bit, but I'm not really sure ...

> We have bug, we fix them.

Of course. :-)



Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Rafael J. Wysocki
On Monday, March 20, 2017 11:38:15 AM Peter Zijlstra wrote:
> On Sun, Mar 19, 2017 at 10:24:24PM +0100, Rafael J. Wysocki wrote:
> 
> > Honestly, if the processor had been capable of doing per-core P-states, that
> > would have been a disaster and there are customers who wouldn't look at
> > schedutil again after being confronted with these numbers.
> 
> This, I feel, is a bit overstated.

Maybe a bit, but I'm not really sure ...

> We have bug, we fix them.

Of course. :-)



Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Rafael J. Wysocki
On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 
> > 
> > The PELT metric used by the schedutil governor underestimates the
> > CPU utilization in some cases.  The reason for that may be time spent
> > in interrupt handlers and similar which is not accounted for by PELT.
> > 
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register.  Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again.  That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
> > 
> > To work around this issue use the observation that, from the
> > schedutil governor's perspective, CPUs that are never idle should
> > always run at the maximum frequency and make that happen.
> > 
> > To that end, add a counter of idle calls to struct sugov_cpu and
> > modify cpuidle_idle_call() to increment that counter every time it
> > is about to put the given CPU into an idle state.  Next, make the
> > schedutil governor look at that counter for the current CPU every
> > time before it is about to start heavy computations.  If the counter
> > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > the CPU has not been idle for at least that long and the governor
> > will choose the maximum frequency for it without looking at the PELT
> > metric at all.
> 
> Why the time limit?

One iteration appeared to be a bit too aggressive, but honestly I think
I need to check again if this thing is regarded as viable at all.



Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Rafael J. Wysocki
On Monday, March 20, 2017 11:36:45 AM Peter Zijlstra wrote:
> On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 
> > 
> > The PELT metric used by the schedutil governor underestimates the
> > CPU utilization in some cases.  The reason for that may be time spent
> > in interrupt handlers and similar which is not accounted for by PELT.
> > 
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register.  Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again.  That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
> > 
> > To work around this issue use the observation that, from the
> > schedutil governor's perspective, CPUs that are never idle should
> > always run at the maximum frequency and make that happen.
> > 
> > To that end, add a counter of idle calls to struct sugov_cpu and
> > modify cpuidle_idle_call() to increment that counter every time it
> > is about to put the given CPU into an idle state.  Next, make the
> > schedutil governor look at that counter for the current CPU every
> > time before it is about to start heavy computations.  If the counter
> > has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> > the CPU has not been idle for at least that long and the governor
> > will choose the maximum frequency for it without looking at the PELT
> > metric at all.
> 
> Why the time limit?

One iteration appeared to be a bit too aggressive, but honestly I think
I need to check again if this thing is regarded as viable at all.



Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Patrick Bellasi
On 20-Mar 09:26, Vincent Guittot wrote:
> On 20 March 2017 at 04:57, Viresh Kumar  wrote:
> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> >> From: Rafael J. Wysocki 
> >>
> >> The PELT metric used by the schedutil governor underestimates the
> >> CPU utilization in some cases.  The reason for that may be time spent
> >> in interrupt handlers and similar which is not accounted for by PELT.
> 
> Are you sure of the root cause  described above (time stolen by irq
> handler) or is it just a hypotheses ? That would be good to be sure of
> the root cause
> Furthermore, IIRC the time spent in irq context is also accounted as
> run time for the running cfs task but not RT and deadline task running
> time

As long as the IRQ processing does not generate a context switch,
which is happening (eventually) if the top half schedule some deferred
work to be executed by a bottom half.

Thus, me too I would say that all the top half time is accounted in
PELT, since the current task is still RUNNABLE/RUNNING.

> So I'm not really aligned with the description of your problem: PELT
> metric underestimates the load of the CPU.  The PELT is just about
> tracking CFS task utilization but not whole CPU utilization and
> according to your description of the problem (time stolen by irq),
> your problem doesn't come from an underestimation of CFS task but from
> time spent in something else but not accounted in the value used by
> schedutil

Quite likely. Indeed, it can really be that the CFS task is preempted
because of some RT activity generated by the IRQ handler.

More in general, I've also noticed many suboptimal freq switches when
RT tasks interleave with CFS ones, because of:
- relatively long down _and up_ throttling times
- the way schedutil's flags are tracked and updated
- the callsites from where we call schedutil updates

For example it can really happen that we are running at the highest
OPP because of some RT activity. Then we switch back to a relatively
low utilization CFS workload and then:
1. a tick happens which produces a frequency drop
2. right after a new IRQ starts a RT task

Well, because of the schedutil's throttling mechanism we can end up
running at the reduce OPP for quite long before (eventually) going
back to the higher OPP... if and only we are lucky enough to get a new
RT task starting outside of a throttling window.

I guess that if we have quite a lot of IRQ->RT activations in a CPU
with a relatively low CFS utilization, this unwanted frequency hopping
is quite likely.

> That would be good to be sure of what is running during this not
> accounted time and find a way to account them

Do we have any number to characterize the frequency of IRQ/RT
activations?

We should notice also that, while CFS tasks are preempted by RT ones,
the PELT signal decays. This can contribute on lowering even further
the utilization demand from the CFS class.

> >> That can be easily demonstrated by running kernel compilation on
> >> a Sandy Bridge Intel processor, running turbostat in parallel with
> >> it and looking at the values written to the MSR_IA32_PERF_CTL
> >> register.  Namely, the expected result would be that when all CPUs
> >> were 100% busy, all of them would be requested to run in the maximum
> >> P-state, but observation shows that this clearly isn't the case.

Can you share a trace with at least sched_switch events enabled?
This can help on identify which workloads are there and how they
interact with schedutil.

> >> The CPUs run in the maximum P-state for a while and then are
> >> requested to run slower and go back to the maximum P-state after
> >> a while again.  That causes the actual frequency of the processor to

That's possibly exactly the pattern I've described above.

> >> visibly oscillate below the sustainable maximum in a jittery fashion
> >> which clearly is not desirable.
> >>
> >> To work around this issue use the observation that, from the
> >> schedutil governor's perspective, CPUs that are never idle should
> >> always run at the maximum frequency and make that happen.

Perhaps some of the patches I've posted in this series:
   https://lkml.org/lkml/2017/3/2/385
can help on that side.

The pattern you report seems to me quite matching with the case "b"
described in the cover letter. I've also shared a notebook:
   https://gist.github.com/derkling/d6a21b459a18091b2b058668a550010d
which clearly shows that behavior happening in the first plot,
cell named Out[15].

Did you had a chance to have a look at them?

It would be nice to know if they can provide some benefits to your
use-case.

> >> To that end, add a counter of idle calls to struct sugov_cpu and
> >> modify cpuidle_idle_call() to increment that counter every time it
> >> is about to put the given CPU into an idle state.  Next, make the
> >> schedutil governor look at that counter for the current CPU every
> >> time before it is about to start heavy computations.  If the 

Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Patrick Bellasi
On 20-Mar 09:26, Vincent Guittot wrote:
> On 20 March 2017 at 04:57, Viresh Kumar  wrote:
> > On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> >> From: Rafael J. Wysocki 
> >>
> >> The PELT metric used by the schedutil governor underestimates the
> >> CPU utilization in some cases.  The reason for that may be time spent
> >> in interrupt handlers and similar which is not accounted for by PELT.
> 
> Are you sure of the root cause  described above (time stolen by irq
> handler) or is it just a hypotheses ? That would be good to be sure of
> the root cause
> Furthermore, IIRC the time spent in irq context is also accounted as
> run time for the running cfs task but not RT and deadline task running
> time

As long as the IRQ processing does not generate a context switch,
which is happening (eventually) if the top half schedule some deferred
work to be executed by a bottom half.

Thus, me too I would say that all the top half time is accounted in
PELT, since the current task is still RUNNABLE/RUNNING.

> So I'm not really aligned with the description of your problem: PELT
> metric underestimates the load of the CPU.  The PELT is just about
> tracking CFS task utilization but not whole CPU utilization and
> according to your description of the problem (time stolen by irq),
> your problem doesn't come from an underestimation of CFS task but from
> time spent in something else but not accounted in the value used by
> schedutil

Quite likely. Indeed, it can really be that the CFS task is preempted
because of some RT activity generated by the IRQ handler.

More in general, I've also noticed many suboptimal freq switches when
RT tasks interleave with CFS ones, because of:
- relatively long down _and up_ throttling times
- the way schedutil's flags are tracked and updated
- the callsites from where we call schedutil updates

For example it can really happen that we are running at the highest
OPP because of some RT activity. Then we switch back to a relatively
low utilization CFS workload and then:
1. a tick happens which produces a frequency drop
2. right after a new IRQ starts a RT task

Well, because of the schedutil's throttling mechanism we can end up
running at the reduce OPP for quite long before (eventually) going
back to the higher OPP... if and only we are lucky enough to get a new
RT task starting outside of a throttling window.

I guess that if we have quite a lot of IRQ->RT activations in a CPU
with a relatively low CFS utilization, this unwanted frequency hopping
is quite likely.

> That would be good to be sure of what is running during this not
> accounted time and find a way to account them

Do we have any number to characterize the frequency of IRQ/RT
activations?

We should notice also that, while CFS tasks are preempted by RT ones,
the PELT signal decays. This can contribute on lowering even further
the utilization demand from the CFS class.

> >> That can be easily demonstrated by running kernel compilation on
> >> a Sandy Bridge Intel processor, running turbostat in parallel with
> >> it and looking at the values written to the MSR_IA32_PERF_CTL
> >> register.  Namely, the expected result would be that when all CPUs
> >> were 100% busy, all of them would be requested to run in the maximum
> >> P-state, but observation shows that this clearly isn't the case.

Can you share a trace with at least sched_switch events enabled?
This can help on identify which workloads are there and how they
interact with schedutil.

> >> The CPUs run in the maximum P-state for a while and then are
> >> requested to run slower and go back to the maximum P-state after
> >> a while again.  That causes the actual frequency of the processor to

That's possibly exactly the pattern I've described above.

> >> visibly oscillate below the sustainable maximum in a jittery fashion
> >> which clearly is not desirable.
> >>
> >> To work around this issue use the observation that, from the
> >> schedutil governor's perspective, CPUs that are never idle should
> >> always run at the maximum frequency and make that happen.

Perhaps some of the patches I've posted in this series:
   https://lkml.org/lkml/2017/3/2/385
can help on that side.

The pattern you report seems to me quite matching with the case "b"
described in the cover letter. I've also shared a notebook:
   https://gist.github.com/derkling/d6a21b459a18091b2b058668a550010d
which clearly shows that behavior happening in the first plot,
cell named Out[15].

Did you had a chance to have a look at them?

It would be nice to know if they can provide some benefits to your
use-case.

> >> To that end, add a counter of idle calls to struct sugov_cpu and
> >> modify cpuidle_idle_call() to increment that counter every time it
> >> is about to put the given CPU into an idle state.  Next, make the
> >> schedutil governor look at that counter for the current CPU every
> >> time before it is about to start heavy computations.  If the counter
> >> has not changed for over 

Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Peter Zijlstra
On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The PELT metric used by the schedutil governor underestimates the
> CPU utilization in some cases.  The reason for that may be time spent
> in interrupt handlers and similar which is not accounted for by PELT.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
> 
> To work around this issue use the observation that, from the
> schedutil governor's perspective, CPUs that are never idle should
> always run at the maximum frequency and make that happen.
> 
> To that end, add a counter of idle calls to struct sugov_cpu and
> modify cpuidle_idle_call() to increment that counter every time it
> is about to put the given CPU into an idle state.  Next, make the
> schedutil governor look at that counter for the current CPU every
> time before it is about to start heavy computations.  If the counter
> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> the CPU has not been idle for at least that long and the governor
> will choose the maximum frequency for it without looking at the PELT
> metric at all.

Why the time limit?


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Peter Zijlstra
On Sun, Mar 19, 2017 at 02:34:32PM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The PELT metric used by the schedutil governor underestimates the
> CPU utilization in some cases.  The reason for that may be time spent
> in interrupt handlers and similar which is not accounted for by PELT.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
> 
> To work around this issue use the observation that, from the
> schedutil governor's perspective, CPUs that are never idle should
> always run at the maximum frequency and make that happen.
> 
> To that end, add a counter of idle calls to struct sugov_cpu and
> modify cpuidle_idle_call() to increment that counter every time it
> is about to put the given CPU into an idle state.  Next, make the
> schedutil governor look at that counter for the current CPU every
> time before it is about to start heavy computations.  If the counter
> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> the CPU has not been idle for at least that long and the governor
> will choose the maximum frequency for it without looking at the PELT
> metric at all.

Why the time limit?


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Peter Zijlstra
On Sun, Mar 19, 2017 at 10:24:24PM +0100, Rafael J. Wysocki wrote:

> Honestly, if the processor had been capable of doing per-core P-states, that
> would have been a disaster and there are customers who wouldn't look at
> schedutil again after being confronted with these numbers.

This, I feel, is a bit overstated. We have bug, we fix them.


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Peter Zijlstra
On Sun, Mar 19, 2017 at 10:24:24PM +0100, Rafael J. Wysocki wrote:

> Honestly, if the processor had been capable of doing per-core P-states, that
> would have been a disaster and there are customers who wouldn't look at
> schedutil again after being confronted with these numbers.

This, I feel, is a bit overstated. We have bug, we fix them.


Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Vincent Guittot
On 20 March 2017 at 04:57, Viresh Kumar  wrote:
> On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki 
>>
>> The PELT metric used by the schedutil governor underestimates the
>> CPU utilization in some cases.  The reason for that may be time spent
>> in interrupt handlers and similar which is not accounted for by PELT.

Are you sure of the root cause  described above (time stolen by irq
handler) or is it just a hypotheses ? That would be good to be sure of
the root cause
Furthermore, IIRC the time spent in irq context is also accounted as
run time for the running cfs task but not RT and deadline task running
time
So I'm not really aligned with the description of your problem: PELT
metric underestimates the load of the CPU.  The PELT is just about
tracking CFS task utilization but not whole CPU utilization and
according to your description of the problem (time stolen by irq),
your problem doesn't come from an underestimation of CFS task but from
time spent in something else but not accounted in the value used by
schedutil

That would be good to be sure of what is running during this not
accounted time and find a way to account them


>>
>> That can be easily demonstrated by running kernel compilation on
>> a Sandy Bridge Intel processor, running turbostat in parallel with
>> it and looking at the values written to the MSR_IA32_PERF_CTL
>> register.  Namely, the expected result would be that when all CPUs
>> were 100% busy, all of them would be requested to run in the maximum
>> P-state, but observation shows that this clearly isn't the case.
>> The CPUs run in the maximum P-state for a while and then are
>> requested to run slower and go back to the maximum P-state after
>> a while again.  That causes the actual frequency of the processor to
>> visibly oscillate below the sustainable maximum in a jittery fashion
>> which clearly is not desirable.
>>
>> To work around this issue use the observation that, from the
>> schedutil governor's perspective, CPUs that are never idle should
>> always run at the maximum frequency and make that happen.
>>
>> To that end, add a counter of idle calls to struct sugov_cpu and
>> modify cpuidle_idle_call() to increment that counter every time it
>> is about to put the given CPU into an idle state.  Next, make the
>> schedutil governor look at that counter for the current CPU every
>> time before it is about to start heavy computations.  If the counter
>> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
>> the CPU has not been idle for at least that long and the governor
>> will choose the maximum frequency for it without looking at the PELT
>> metric at all.
>
> Looks like we are fixing a PELT problem with a schedutil Hack :)

I would not say that it's a PELT problem (at least based on current
description) but more that we don't track all
activities of CPU

>
>> Signed-off-by: Rafael J. Wysocki 
>> ---
>>  include/linux/sched/cpufreq.h|6 ++
>>  kernel/sched/cpufreq_schedutil.c |   38 
>> --
>>  kernel/sched/idle.c  |3 +++
>>  3 files changed, 45 insertions(+), 2 deletions(-)
>>
>> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
>> ===
>> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
>> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
>> @@ -20,6 +20,7 @@
>>  #include "sched.h"
>>
>>  #define SUGOV_KTHREAD_PRIORITY   50
>> +#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC)
>>
>>  struct sugov_tunables {
>>   struct gov_attr_set attr_set;
>> @@ -55,6 +56,9 @@ struct sugov_cpu {
>>
>>   unsigned long iowait_boost;
>>   unsigned long iowait_boost_max;
>> + unsigned long idle_calls;
>> + unsigned long saved_idle_calls;
>> + u64 busy_start;
>>   u64 last_update;
>>
>>   /* The fields below are only needed when sharing a policy. */
>> @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
>>   sg_cpu->iowait_boost >>= 1;
>>  }
>>
>> +void cpufreq_schedutil_idle_call(void)
>> +{
>> + struct sugov_cpu *sg_cpu = this_cpu_ptr(_cpu);
>> +
>> + sg_cpu->idle_calls++;
>> +}
>> +
>> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
>> +{
>> + if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
>> + sg_cpu->busy_start = 0;
>> + return false;
>> + }
>> +
>> + if (!sg_cpu->busy_start) {
>> + sg_cpu->busy_start = sg_cpu->last_update;
>> + return false;
>> + }
>> +
>> + return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
>> +}
>> +
>> +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
>> +{
>> + if (!sg_cpu->busy_start)
>> + sg_cpu->saved_idle_calls = sg_cpu->idle_calls;
>
> Why aren't we doing this in sugov_cpu_is_busy() itself ? And isn't it possible
> for idle_calls to 

Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-20 Thread Vincent Guittot
On 20 March 2017 at 04:57, Viresh Kumar  wrote:
> On 19-03-17, 14:34, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki 
>>
>> The PELT metric used by the schedutil governor underestimates the
>> CPU utilization in some cases.  The reason for that may be time spent
>> in interrupt handlers and similar which is not accounted for by PELT.

Are you sure of the root cause  described above (time stolen by irq
handler) or is it just a hypotheses ? That would be good to be sure of
the root cause
Furthermore, IIRC the time spent in irq context is also accounted as
run time for the running cfs task but not RT and deadline task running
time
So I'm not really aligned with the description of your problem: PELT
metric underestimates the load of the CPU.  The PELT is just about
tracking CFS task utilization but not whole CPU utilization and
according to your description of the problem (time stolen by irq),
your problem doesn't come from an underestimation of CFS task but from
time spent in something else but not accounted in the value used by
schedutil

That would be good to be sure of what is running during this not
accounted time and find a way to account them


>>
>> That can be easily demonstrated by running kernel compilation on
>> a Sandy Bridge Intel processor, running turbostat in parallel with
>> it and looking at the values written to the MSR_IA32_PERF_CTL
>> register.  Namely, the expected result would be that when all CPUs
>> were 100% busy, all of them would be requested to run in the maximum
>> P-state, but observation shows that this clearly isn't the case.
>> The CPUs run in the maximum P-state for a while and then are
>> requested to run slower and go back to the maximum P-state after
>> a while again.  That causes the actual frequency of the processor to
>> visibly oscillate below the sustainable maximum in a jittery fashion
>> which clearly is not desirable.
>>
>> To work around this issue use the observation that, from the
>> schedutil governor's perspective, CPUs that are never idle should
>> always run at the maximum frequency and make that happen.
>>
>> To that end, add a counter of idle calls to struct sugov_cpu and
>> modify cpuidle_idle_call() to increment that counter every time it
>> is about to put the given CPU into an idle state.  Next, make the
>> schedutil governor look at that counter for the current CPU every
>> time before it is about to start heavy computations.  If the counter
>> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
>> the CPU has not been idle for at least that long and the governor
>> will choose the maximum frequency for it without looking at the PELT
>> metric at all.
>
> Looks like we are fixing a PELT problem with a schedutil Hack :)

I would not say that it's a PELT problem (at least based on current
description) but more that we don't track all
activities of CPU

>
>> Signed-off-by: Rafael J. Wysocki 
>> ---
>>  include/linux/sched/cpufreq.h|6 ++
>>  kernel/sched/cpufreq_schedutil.c |   38 
>> --
>>  kernel/sched/idle.c  |3 +++
>>  3 files changed, 45 insertions(+), 2 deletions(-)
>>
>> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
>> ===
>> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
>> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
>> @@ -20,6 +20,7 @@
>>  #include "sched.h"
>>
>>  #define SUGOV_KTHREAD_PRIORITY   50
>> +#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC)
>>
>>  struct sugov_tunables {
>>   struct gov_attr_set attr_set;
>> @@ -55,6 +56,9 @@ struct sugov_cpu {
>>
>>   unsigned long iowait_boost;
>>   unsigned long iowait_boost_max;
>> + unsigned long idle_calls;
>> + unsigned long saved_idle_calls;
>> + u64 busy_start;
>>   u64 last_update;
>>
>>   /* The fields below are only needed when sharing a policy. */
>> @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
>>   sg_cpu->iowait_boost >>= 1;
>>  }
>>
>> +void cpufreq_schedutil_idle_call(void)
>> +{
>> + struct sugov_cpu *sg_cpu = this_cpu_ptr(_cpu);
>> +
>> + sg_cpu->idle_calls++;
>> +}
>> +
>> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
>> +{
>> + if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
>> + sg_cpu->busy_start = 0;
>> + return false;
>> + }
>> +
>> + if (!sg_cpu->busy_start) {
>> + sg_cpu->busy_start = sg_cpu->last_update;
>> + return false;
>> + }
>> +
>> + return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
>> +}
>> +
>> +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
>> +{
>> + if (!sg_cpu->busy_start)
>> + sg_cpu->saved_idle_calls = sg_cpu->idle_calls;
>
> Why aren't we doing this in sugov_cpu_is_busy() itself ? And isn't it possible
> for idle_calls to get incremented by this time?
>
>> +}
>> +
>>  static void 

Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-19 Thread Viresh Kumar
On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The PELT metric used by the schedutil governor underestimates the
> CPU utilization in some cases.  The reason for that may be time spent
> in interrupt handlers and similar which is not accounted for by PELT.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
> 
> To work around this issue use the observation that, from the
> schedutil governor's perspective, CPUs that are never idle should
> always run at the maximum frequency and make that happen.
> 
> To that end, add a counter of idle calls to struct sugov_cpu and
> modify cpuidle_idle_call() to increment that counter every time it
> is about to put the given CPU into an idle state.  Next, make the
> schedutil governor look at that counter for the current CPU every
> time before it is about to start heavy computations.  If the counter
> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> the CPU has not been idle for at least that long and the governor
> will choose the maximum frequency for it without looking at the PELT
> metric at all.

Looks like we are fixing a PELT problem with a schedutil Hack :)

> Signed-off-by: Rafael J. Wysocki 
> ---
>  include/linux/sched/cpufreq.h|6 ++
>  kernel/sched/cpufreq_schedutil.c |   38 
> --
>  kernel/sched/idle.c  |3 +++
>  3 files changed, 45 insertions(+), 2 deletions(-)
> 
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -20,6 +20,7 @@
>  #include "sched.h"
>  
>  #define SUGOV_KTHREAD_PRIORITY   50
> +#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC)
>  
>  struct sugov_tunables {
>   struct gov_attr_set attr_set;
> @@ -55,6 +56,9 @@ struct sugov_cpu {
>  
>   unsigned long iowait_boost;
>   unsigned long iowait_boost_max;
> + unsigned long idle_calls;
> + unsigned long saved_idle_calls;
> + u64 busy_start;
>   u64 last_update;
>  
>   /* The fields below are only needed when sharing a policy. */
> @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
>   sg_cpu->iowait_boost >>= 1;
>  }
>  
> +void cpufreq_schedutil_idle_call(void)
> +{
> + struct sugov_cpu *sg_cpu = this_cpu_ptr(_cpu);
> +
> + sg_cpu->idle_calls++;
> +}
> +
> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +{
> + if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
> + sg_cpu->busy_start = 0;
> + return false;
> + }
> +
> + if (!sg_cpu->busy_start) {
> + sg_cpu->busy_start = sg_cpu->last_update;
> + return false;
> + }
> +
> + return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
> +}
> +
> +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
> +{
> + if (!sg_cpu->busy_start)
> + sg_cpu->saved_idle_calls = sg_cpu->idle_calls;

Why aren't we doing this in sugov_cpu_is_busy() itself ? And isn't it possible
for idle_calls to get incremented by this time?

> +}
> +
>  static void sugov_update_single(struct update_util_data *hook, u64 time,
>   unsigned int flags)
>  {
> @@ -207,7 +239,7 @@ static void sugov_update_single(struct u
>   if (!sugov_should_update_freq(sg_policy, time))
>   return;
>  
> - if (flags & SCHED_CPUFREQ_RT_DL) {
> + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
>   next_f = policy->cpuinfo.max_freq;
>   } else {
>   sugov_get_util(, );
> @@ -215,6 +247,7 @@ static void sugov_update_single(struct u
>   next_f = get_next_freq(sg_policy, util, max);
>   }
>   sugov_update_commit(sg_policy, time, next_f);
> + sugov_save_idle_calls(sg_cpu);
>  }
>  
>  static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
> @@ -278,12 +311,13 @@ static void sugov_update_shared(struct u
>   sg_cpu->last_update = time;
>  
>   if (sugov_should_update_freq(sg_policy, time)) {
> - if (flags & SCHED_CPUFREQ_RT_DL)
> + if ((flags & SCHED_CPUFREQ_RT_DL) || 

Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-19 Thread Viresh Kumar
On 19-03-17, 14:34, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The PELT metric used by the schedutil governor underestimates the
> CPU utilization in some cases.  The reason for that may be time spent
> in interrupt handlers and similar which is not accounted for by PELT.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
> 
> To work around this issue use the observation that, from the
> schedutil governor's perspective, CPUs that are never idle should
> always run at the maximum frequency and make that happen.
> 
> To that end, add a counter of idle calls to struct sugov_cpu and
> modify cpuidle_idle_call() to increment that counter every time it
> is about to put the given CPU into an idle state.  Next, make the
> schedutil governor look at that counter for the current CPU every
> time before it is about to start heavy computations.  If the counter
> has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
> the CPU has not been idle for at least that long and the governor
> will choose the maximum frequency for it without looking at the PELT
> metric at all.

Looks like we are fixing a PELT problem with a schedutil Hack :)

> Signed-off-by: Rafael J. Wysocki 
> ---
>  include/linux/sched/cpufreq.h|6 ++
>  kernel/sched/cpufreq_schedutil.c |   38 
> --
>  kernel/sched/idle.c  |3 +++
>  3 files changed, 45 insertions(+), 2 deletions(-)
> 
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -20,6 +20,7 @@
>  #include "sched.h"
>  
>  #define SUGOV_KTHREAD_PRIORITY   50
> +#define SUGOV_BUSY_THRESHOLD (50 * NSEC_PER_MSEC)
>  
>  struct sugov_tunables {
>   struct gov_attr_set attr_set;
> @@ -55,6 +56,9 @@ struct sugov_cpu {
>  
>   unsigned long iowait_boost;
>   unsigned long iowait_boost_max;
> + unsigned long idle_calls;
> + unsigned long saved_idle_calls;
> + u64 busy_start;
>   u64 last_update;
>  
>   /* The fields below are only needed when sharing a policy. */
> @@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
>   sg_cpu->iowait_boost >>= 1;
>  }
>  
> +void cpufreq_schedutil_idle_call(void)
> +{
> + struct sugov_cpu *sg_cpu = this_cpu_ptr(_cpu);
> +
> + sg_cpu->idle_calls++;
> +}
> +
> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +{
> + if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
> + sg_cpu->busy_start = 0;
> + return false;
> + }
> +
> + if (!sg_cpu->busy_start) {
> + sg_cpu->busy_start = sg_cpu->last_update;
> + return false;
> + }
> +
> + return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
> +}
> +
> +static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
> +{
> + if (!sg_cpu->busy_start)
> + sg_cpu->saved_idle_calls = sg_cpu->idle_calls;

Why aren't we doing this in sugov_cpu_is_busy() itself ? And isn't it possible
for idle_calls to get incremented by this time?

> +}
> +
>  static void sugov_update_single(struct update_util_data *hook, u64 time,
>   unsigned int flags)
>  {
> @@ -207,7 +239,7 @@ static void sugov_update_single(struct u
>   if (!sugov_should_update_freq(sg_policy, time))
>   return;
>  
> - if (flags & SCHED_CPUFREQ_RT_DL) {
> + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
>   next_f = policy->cpuinfo.max_freq;
>   } else {
>   sugov_get_util(, );
> @@ -215,6 +247,7 @@ static void sugov_update_single(struct u
>   next_f = get_next_freq(sg_policy, util, max);
>   }
>   sugov_update_commit(sg_policy, time, next_f);
> + sugov_save_idle_calls(sg_cpu);
>  }
>  
>  static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
> @@ -278,12 +311,13 @@ static void sugov_update_shared(struct u
>   sg_cpu->last_update = time;
>  
>   if (sugov_should_update_freq(sg_policy, time)) {
> - if (flags & SCHED_CPUFREQ_RT_DL)
> + if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu))

What about others CPUs in this policy?

Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-19 Thread Rafael J. Wysocki
On Sunday, March 19, 2017 10:24:24 PM Rafael J. Wysocki wrote:
> On Sunday, March 19, 2017 02:34:32 PM Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 
> > 
> > The PELT metric used by the schedutil governor underestimates the
> > CPU utilization in some cases.  The reason for that may be time spent
> > in interrupt handlers and similar which is not accounted for by PELT.
> > 
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register.  Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again.  That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
> 
> In case you are wondering about the actual numbers, attached are two turbostat
> log files from two runs of the same workload, without (before.txt) and with 
> (after.txt)
> the patch applied.
> 
> The workload is essentially "make -j 5" in the kernel source tree and the
> machine has an SSD storage and a quad-core Intel Sandy Bridge processor.
> The P-states available for each core are between 8 and 31 (0x1f) corresponding
> to 800 MHz and 3.1 GHz, respectively.  All cores can run sustainably at 2.9 
> GHz
> at the same time, although that is not a guaranteed sustainable frequency
> (it may be dropped occasionally for thermal reasons, for example).
> 
> The interesting columns are Bzy_MHz (and specifically the rows with "-" under
> CPU that correspond to the entire processor), which is the avreage frequency
> between iterations based on the numbers read from feedback registers, and
> the rightmost one, which is the values written to the P-state request register
> (the 3rd and 4th hex digits from the right represent the requested P-state).
> 
> The turbostat data collection ran every 2 seconds and I looked at the last 30
> iterations in each case corresponding to about 1 minute of the workload run
> during which all of the cores were around 100% busy.
> 
> Now, if you look at after.txt (the run with the patch applied), you'll notice 
> that
> during those last 30 iterations P-state 31 (0x1f) had been requested on all
> cores pretty much 100% of the time (meaning: as expected in that case) and
> the average processor frequency (computed by taking the average from
> all of the 30 "-" rows) was 2899.33 MHz (apparently, the hardware decided to
> drop it from 2.9 GHz occasionally).
> 
> In the before.txt case (without the patch) the average frequency over the last
> 30 iterations was 2896.90 MHz which is about 0.8% slower than with the patch
> applied (on the average).

0.08% of course, sorry.  Still visible, though. :-)

Thanks,
Rafael



Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-19 Thread Rafael J. Wysocki
On Sunday, March 19, 2017 10:24:24 PM Rafael J. Wysocki wrote:
> On Sunday, March 19, 2017 02:34:32 PM Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 
> > 
> > The PELT metric used by the schedutil governor underestimates the
> > CPU utilization in some cases.  The reason for that may be time spent
> > in interrupt handlers and similar which is not accounted for by PELT.
> > 
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register.  Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again.  That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
> 
> In case you are wondering about the actual numbers, attached are two turbostat
> log files from two runs of the same workload, without (before.txt) and with 
> (after.txt)
> the patch applied.
> 
> The workload is essentially "make -j 5" in the kernel source tree and the
> machine has an SSD storage and a quad-core Intel Sandy Bridge processor.
> The P-states available for each core are between 8 and 31 (0x1f) corresponding
> to 800 MHz and 3.1 GHz, respectively.  All cores can run sustainably at 2.9 
> GHz
> at the same time, although that is not a guaranteed sustainable frequency
> (it may be dropped occasionally for thermal reasons, for example).
> 
> The interesting columns are Bzy_MHz (and specifically the rows with "-" under
> CPU that correspond to the entire processor), which is the avreage frequency
> between iterations based on the numbers read from feedback registers, and
> the rightmost one, which is the values written to the P-state request register
> (the 3rd and 4th hex digits from the right represent the requested P-state).
> 
> The turbostat data collection ran every 2 seconds and I looked at the last 30
> iterations in each case corresponding to about 1 minute of the workload run
> during which all of the cores were around 100% busy.
> 
> Now, if you look at after.txt (the run with the patch applied), you'll notice 
> that
> during those last 30 iterations P-state 31 (0x1f) had been requested on all
> cores pretty much 100% of the time (meaning: as expected in that case) and
> the average processor frequency (computed by taking the average from
> all of the 30 "-" rows) was 2899.33 MHz (apparently, the hardware decided to
> drop it from 2.9 GHz occasionally).
> 
> In the before.txt case (without the patch) the average frequency over the last
> 30 iterations was 2896.90 MHz which is about 0.8% slower than with the patch
> applied (on the average).

0.08% of course, sorry.  Still visible, though. :-)

Thanks,
Rafael



Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-19 Thread Rafael J. Wysocki
On Sunday, March 19, 2017 02:34:32 PM Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The PELT metric used by the schedutil governor underestimates the
> CPU utilization in some cases.  The reason for that may be time spent
> in interrupt handlers and similar which is not accounted for by PELT.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.

In case you are wondering about the actual numbers, attached are two turbostat
log files from two runs of the same workload, without (before.txt) and with 
(after.txt)
the patch applied.

The workload is essentially "make -j 5" in the kernel source tree and the
machine has an SSD storage and a quad-core Intel Sandy Bridge processor.
The P-states available for each core are between 8 and 31 (0x1f) corresponding
to 800 MHz and 3.1 GHz, respectively.  All cores can run sustainably at 2.9 GHz
at the same time, although that is not a guaranteed sustainable frequency
(it may be dropped occasionally for thermal reasons, for example).

The interesting columns are Bzy_MHz (and specifically the rows with "-" under
CPU that correspond to the entire processor), which is the avreage frequency
between iterations based on the numbers read from feedback registers, and
the rightmost one, which is the values written to the P-state request register
(the 3rd and 4th hex digits from the right represent the requested P-state).

The turbostat data collection ran every 2 seconds and I looked at the last 30
iterations in each case corresponding to about 1 minute of the workload run
during which all of the cores were around 100% busy.

Now, if you look at after.txt (the run with the patch applied), you'll notice 
that
during those last 30 iterations P-state 31 (0x1f) had been requested on all
cores pretty much 100% of the time (meaning: as expected in that case) and
the average processor frequency (computed by taking the average from
all of the 30 "-" rows) was 2899.33 MHz (apparently, the hardware decided to
drop it from 2.9 GHz occasionally).

In the before.txt case (without the patch) the average frequency over the last
30 iterations was 2896.90 MHz which is about 0.8% slower than with the patch
applied (on the average).  That already is quite a measurable difference, but it
would have been much worse if the processor had not coordinated P-states in
hardware (such that if any core requested 31, the processor would pick that
one or close to it for the entire package regardless of the requests from the
other cores).

Namely, if you look at the P-states requested for different cores (during the 
last
30 iterations of the before.txt run), which essentially is what should be used
according to the governor, the average of them is 27.25 (almost 4 bins lower
than the maximum) and the standard deviation is 6, so it is not like they are a
little off occasionally.  At least some of them are way off most of the time.

Honestly, if the processor had been capable of doing per-core P-states, that
would have been a disaster and there are customers who wouldn't look at
schedutil again after being confronted with these numbers.

So this is rather serious.

BTW, both intel_pstate in the active mode and ondemand request 0x1f on
all cores for that workload, like in the after.txt case.

Thanks,
Rafael
CPUID(7): No-SGX
 CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
   -  93   11.26 8262494  0x
   0  617.56 8082494  0x0800
   2  90   10.87 8322494  0x0800
   1 130   15.79 8222494  0x0800
   3  90   10.83 8372494  0x0800
 CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
   -  546.63 8112494  0x
   0  465.68 8162494  0x0800
   2  556.87 8042494  0x0800
   1  799.77 8062494  0x0800
   3  354.20 8282494  0x0800
 CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
   - 305   36.58 8352494  0x
   0 332   39.69 8382494  0x0800
   2 307   36.95 8332494  0x0800
   1 248   29.82 8322494  0x0800
   3 333   39.87 8362494  0x0c00
 CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
   - 

Re: [RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-19 Thread Rafael J. Wysocki
On Sunday, March 19, 2017 02:34:32 PM Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The PELT metric used by the schedutil governor underestimates the
> CPU utilization in some cases.  The reason for that may be time spent
> in interrupt handlers and similar which is not accounted for by PELT.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.

In case you are wondering about the actual numbers, attached are two turbostat
log files from two runs of the same workload, without (before.txt) and with 
(after.txt)
the patch applied.

The workload is essentially "make -j 5" in the kernel source tree and the
machine has an SSD storage and a quad-core Intel Sandy Bridge processor.
The P-states available for each core are between 8 and 31 (0x1f) corresponding
to 800 MHz and 3.1 GHz, respectively.  All cores can run sustainably at 2.9 GHz
at the same time, although that is not a guaranteed sustainable frequency
(it may be dropped occasionally for thermal reasons, for example).

The interesting columns are Bzy_MHz (and specifically the rows with "-" under
CPU that correspond to the entire processor), which is the avreage frequency
between iterations based on the numbers read from feedback registers, and
the rightmost one, which is the values written to the P-state request register
(the 3rd and 4th hex digits from the right represent the requested P-state).

The turbostat data collection ran every 2 seconds and I looked at the last 30
iterations in each case corresponding to about 1 minute of the workload run
during which all of the cores were around 100% busy.

Now, if you look at after.txt (the run with the patch applied), you'll notice 
that
during those last 30 iterations P-state 31 (0x1f) had been requested on all
cores pretty much 100% of the time (meaning: as expected in that case) and
the average processor frequency (computed by taking the average from
all of the 30 "-" rows) was 2899.33 MHz (apparently, the hardware decided to
drop it from 2.9 GHz occasionally).

In the before.txt case (without the patch) the average frequency over the last
30 iterations was 2896.90 MHz which is about 0.8% slower than with the patch
applied (on the average).  That already is quite a measurable difference, but it
would have been much worse if the processor had not coordinated P-states in
hardware (such that if any core requested 31, the processor would pick that
one or close to it for the entire package regardless of the requests from the
other cores).

Namely, if you look at the P-states requested for different cores (during the 
last
30 iterations of the before.txt run), which essentially is what should be used
according to the governor, the average of them is 27.25 (almost 4 bins lower
than the maximum) and the standard deviation is 6, so it is not like they are a
little off occasionally.  At least some of them are way off most of the time.

Honestly, if the processor had been capable of doing per-core P-states, that
would have been a disaster and there are customers who wouldn't look at
schedutil again after being confronted with these numbers.

So this is rather serious.

BTW, both intel_pstate in the active mode and ondemand request 0x1f on
all cores for that workload, like in the after.txt case.

Thanks,
Rafael
CPUID(7): No-SGX
 CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
   -  93   11.26 8262494  0x
   0  617.56 8082494  0x0800
   2  90   10.87 8322494  0x0800
   1 130   15.79 8222494  0x0800
   3  90   10.83 8372494  0x0800
 CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
   -  546.63 8112494  0x
   0  465.68 8162494  0x0800
   2  556.87 8042494  0x0800
   1  799.77 8062494  0x0800
   3  354.20 8282494  0x0800
 CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
   - 305   36.58 8352494  0x
   0 332   39.69 8382494  0x0800
   2 307   36.95 8332494  0x0800
   1 248   29.82 8322494  0x0800
   3 333   39.87 8362494  0x0c00
 CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz   MSR 0x199
   - 357   38.08 9402495  

[RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-19 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

The PELT metric used by the schedutil governor underestimates the
CPU utilization in some cases.  The reason for that may be time spent
in interrupt handlers and similar which is not accounted for by PELT.

That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register.  Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again.  That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.

To work around this issue use the observation that, from the
schedutil governor's perspective, CPUs that are never idle should
always run at the maximum frequency and make that happen.

To that end, add a counter of idle calls to struct sugov_cpu and
modify cpuidle_idle_call() to increment that counter every time it
is about to put the given CPU into an idle state.  Next, make the
schedutil governor look at that counter for the current CPU every
time before it is about to start heavy computations.  If the counter
has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
the CPU has not been idle for at least that long and the governor
will choose the maximum frequency for it without looking at the PELT
metric at all.

Signed-off-by: Rafael J. Wysocki 
---
 include/linux/sched/cpufreq.h|6 ++
 kernel/sched/cpufreq_schedutil.c |   38 --
 kernel/sched/idle.c  |3 +++
 3 files changed, 45 insertions(+), 2 deletions(-)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -20,6 +20,7 @@
 #include "sched.h"
 
 #define SUGOV_KTHREAD_PRIORITY 50
+#define SUGOV_BUSY_THRESHOLD   (50 * NSEC_PER_MSEC)
 
 struct sugov_tunables {
struct gov_attr_set attr_set;
@@ -55,6 +56,9 @@ struct sugov_cpu {
 
unsigned long iowait_boost;
unsigned long iowait_boost_max;
+   unsigned long idle_calls;
+   unsigned long saved_idle_calls;
+   u64 busy_start;
u64 last_update;
 
/* The fields below are only needed when sharing a policy. */
@@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
sg_cpu->iowait_boost >>= 1;
 }
 
+void cpufreq_schedutil_idle_call(void)
+{
+   struct sugov_cpu *sg_cpu = this_cpu_ptr(_cpu);
+
+   sg_cpu->idle_calls++;
+}
+
+static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
+{
+   if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
+   sg_cpu->busy_start = 0;
+   return false;
+   }
+
+   if (!sg_cpu->busy_start) {
+   sg_cpu->busy_start = sg_cpu->last_update;
+   return false;
+   }
+
+   return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
+}
+
+static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
+{
+   if (!sg_cpu->busy_start)
+   sg_cpu->saved_idle_calls = sg_cpu->idle_calls;
+}
+
 static void sugov_update_single(struct update_util_data *hook, u64 time,
unsigned int flags)
 {
@@ -207,7 +239,7 @@ static void sugov_update_single(struct u
if (!sugov_should_update_freq(sg_policy, time))
return;
 
-   if (flags & SCHED_CPUFREQ_RT_DL) {
+   if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
next_f = policy->cpuinfo.max_freq;
} else {
sugov_get_util(, );
@@ -215,6 +247,7 @@ static void sugov_update_single(struct u
next_f = get_next_freq(sg_policy, util, max);
}
sugov_update_commit(sg_policy, time, next_f);
+   sugov_save_idle_calls(sg_cpu);
 }
 
 static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
@@ -278,12 +311,13 @@ static void sugov_update_shared(struct u
sg_cpu->last_update = time;
 
if (sugov_should_update_freq(sg_policy, time)) {
-   if (flags & SCHED_CPUFREQ_RT_DL)
+   if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu))
next_f = sg_policy->policy->cpuinfo.max_freq;
else
next_f = sugov_next_freq_shared(sg_cpu);
 
sugov_update_commit(sg_policy, time, next_f);
+   sugov_save_idle_calls(sg_cpu);
}
 
raw_spin_unlock(_policy->update_lock);
Index: linux-pm/kernel/sched/idle.c

[RFC][PATCH 2/2] cpufreq: schedutil: Force max frequency on busy CPUs

2017-03-19 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

The PELT metric used by the schedutil governor underestimates the
CPU utilization in some cases.  The reason for that may be time spent
in interrupt handlers and similar which is not accounted for by PELT.

That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register.  Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again.  That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.

To work around this issue use the observation that, from the
schedutil governor's perspective, CPUs that are never idle should
always run at the maximum frequency and make that happen.

To that end, add a counter of idle calls to struct sugov_cpu and
modify cpuidle_idle_call() to increment that counter every time it
is about to put the given CPU into an idle state.  Next, make the
schedutil governor look at that counter for the current CPU every
time before it is about to start heavy computations.  If the counter
has not changed for over SUGOV_BUSY_THRESHOLD time (equal to 50 ms),
the CPU has not been idle for at least that long and the governor
will choose the maximum frequency for it without looking at the PELT
metric at all.

Signed-off-by: Rafael J. Wysocki 
---
 include/linux/sched/cpufreq.h|6 ++
 kernel/sched/cpufreq_schedutil.c |   38 --
 kernel/sched/idle.c  |3 +++
 3 files changed, 45 insertions(+), 2 deletions(-)

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -20,6 +20,7 @@
 #include "sched.h"
 
 #define SUGOV_KTHREAD_PRIORITY 50
+#define SUGOV_BUSY_THRESHOLD   (50 * NSEC_PER_MSEC)
 
 struct sugov_tunables {
struct gov_attr_set attr_set;
@@ -55,6 +56,9 @@ struct sugov_cpu {
 
unsigned long iowait_boost;
unsigned long iowait_boost_max;
+   unsigned long idle_calls;
+   unsigned long saved_idle_calls;
+   u64 busy_start;
u64 last_update;
 
/* The fields below are only needed when sharing a policy. */
@@ -192,6 +196,34 @@ static void sugov_iowait_boost(struct su
sg_cpu->iowait_boost >>= 1;
 }
 
+void cpufreq_schedutil_idle_call(void)
+{
+   struct sugov_cpu *sg_cpu = this_cpu_ptr(_cpu);
+
+   sg_cpu->idle_calls++;
+}
+
+static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
+{
+   if (sg_cpu->idle_calls != sg_cpu->saved_idle_calls) {
+   sg_cpu->busy_start = 0;
+   return false;
+   }
+
+   if (!sg_cpu->busy_start) {
+   sg_cpu->busy_start = sg_cpu->last_update;
+   return false;
+   }
+
+   return sg_cpu->last_update - sg_cpu->busy_start > SUGOV_BUSY_THRESHOLD;
+}
+
+static void sugov_save_idle_calls(struct sugov_cpu *sg_cpu)
+{
+   if (!sg_cpu->busy_start)
+   sg_cpu->saved_idle_calls = sg_cpu->idle_calls;
+}
+
 static void sugov_update_single(struct update_util_data *hook, u64 time,
unsigned int flags)
 {
@@ -207,7 +239,7 @@ static void sugov_update_single(struct u
if (!sugov_should_update_freq(sg_policy, time))
return;
 
-   if (flags & SCHED_CPUFREQ_RT_DL) {
+   if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu)) {
next_f = policy->cpuinfo.max_freq;
} else {
sugov_get_util(, );
@@ -215,6 +247,7 @@ static void sugov_update_single(struct u
next_f = get_next_freq(sg_policy, util, max);
}
sugov_update_commit(sg_policy, time, next_f);
+   sugov_save_idle_calls(sg_cpu);
 }
 
 static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu)
@@ -278,12 +311,13 @@ static void sugov_update_shared(struct u
sg_cpu->last_update = time;
 
if (sugov_should_update_freq(sg_policy, time)) {
-   if (flags & SCHED_CPUFREQ_RT_DL)
+   if ((flags & SCHED_CPUFREQ_RT_DL) || sugov_cpu_is_busy(sg_cpu))
next_f = sg_policy->policy->cpuinfo.max_freq;
else
next_f = sugov_next_freq_shared(sg_cpu);
 
sugov_update_commit(sg_policy, time, next_f);
+   sugov_save_idle_calls(sg_cpu);
}
 
raw_spin_unlock(_policy->update_lock);
Index: linux-pm/kernel/sched/idle.c
===
---