Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-14 Thread Rafael J. Wysocki
On Wed, Nov 14, 2018 at 7:26 AM Doug Smythies  wrote:
>
> On 2018.11.08 00:00 Rafael J. Wysocki wrote:
> > On Wednesday, November 7, 2018 6:04:12 PM CET Doug Smythies wrote:
> >> On 2018.11.04 08:31 Rafael J. Wysocki wrote:
>
> ...[snip]...
> >> The results are:
> >> http://fast.smythies.com/linux-pm/k420/k420-dbench-teo3.htm
> >> http://fast.smythies.com/linux-pm/k420/histo_compare.htm
>
> ...[snip]...
>
> >> There are some odd long idle durations with TEOv3 for idle
> >> states 1, 2, and 3 that I'll watch for with v4 testing.
> >
> > That unfortunately is a result of bugs in the v4 (and v2 - v3 too).
> >
> > Namely, it doesn't take the cases when the tick has been stopped already
> > into account correctly.  IOW, all of the data points beyond the tick 
> > boundary
> > should go into the "final" peak.
> >
> > I'll send a v5.
>
> With v5 there were no long idle durations for idle states 1, 2, and 3 for
> this same Phoronix dbench test.

That's good news, thank you!


Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-14 Thread Rafael J. Wysocki
On Wed, Nov 14, 2018 at 7:26 AM Doug Smythies  wrote:
>
> On 2018.11.08 00:00 Rafael J. Wysocki wrote:
> > On Wednesday, November 7, 2018 6:04:12 PM CET Doug Smythies wrote:
> >> On 2018.11.04 08:31 Rafael J. Wysocki wrote:
>
> ...[snip]...
> >> The results are:
> >> http://fast.smythies.com/linux-pm/k420/k420-dbench-teo3.htm
> >> http://fast.smythies.com/linux-pm/k420/histo_compare.htm
>
> ...[snip]...
>
> >> There are some odd long idle durations with TEOv3 for idle
> >> states 1, 2, and 3 that I'll watch for with v4 testing.
> >
> > That unfortunately is a result of bugs in the v4 (and v2 - v3 too).
> >
> > Namely, it doesn't take the cases when the tick has been stopped already
> > into account correctly.  IOW, all of the data points beyond the tick 
> > boundary
> > should go into the "final" peak.
> >
> > I'll send a v5.
>
> With v5 there were no long idle durations for idle states 1, 2, and 3 for
> this same Phoronix dbench test.

That's good news, thank you!


RE: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-13 Thread Doug Smythies
On 2018.11.08 00:00 Rafael J. Wysocki wrote:
> On Wednesday, November 7, 2018 6:04:12 PM CET Doug Smythies wrote:
>> On 2018.11.04 08:31 Rafael J. Wysocki wrote:

...[snip]...
>> The results are:
>> http://fast.smythies.com/linux-pm/k420/k420-dbench-teo3.htm
>> http://fast.smythies.com/linux-pm/k420/histo_compare.htm

...[snip]...

>> There are some odd long idle durations with TEOv3 for idle
>> states 1, 2, and 3 that I'll watch for with v4 testing.
>
> That unfortunately is a result of bugs in the v4 (and v2 - v3 too).
>
> Namely, it doesn't take the cases when the tick has been stopped already
> into account correctly.  IOW, all of the data points beyond the tick boundary
> should go into the "final" peak.
>
> I'll send a v5.

With v5 there were no long idle durations for idle states 1, 2, and 3 for
this same Phoronix dbench test.

... Doug




RE: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-13 Thread Doug Smythies
On 2018.11.08 00:00 Rafael J. Wysocki wrote:
> On Wednesday, November 7, 2018 6:04:12 PM CET Doug Smythies wrote:
>> On 2018.11.04 08:31 Rafael J. Wysocki wrote:

...[snip]...
>> The results are:
>> http://fast.smythies.com/linux-pm/k420/k420-dbench-teo3.htm
>> http://fast.smythies.com/linux-pm/k420/histo_compare.htm

...[snip]...

>> There are some odd long idle durations with TEOv3 for idle
>> states 1, 2, and 3 that I'll watch for with v4 testing.
>
> That unfortunately is a result of bugs in the v4 (and v2 - v3 too).
>
> Namely, it doesn't take the cases when the tick has been stopped already
> into account correctly.  IOW, all of the data points beyond the tick boundary
> should go into the "final" peak.
>
> I'll send a v5.

With v5 there were no long idle durations for idle states 1, 2, and 3 for
this same Phoronix dbench test.

... Doug




RE: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-11 Thread Doug Smythies
On 2018.11.10 13:48 Doug Smythies wrote:
> On 2018.11.07 09:04 Doug Smythies wrote:
>
>> The Phoronix dbench test was run under the option to run all
>> the tests, instead of just one number of clients. This was done
>> with a reference/baseline kernel of 4.20-rc1, and also with this
>> TEO version 3 patch. The tests were also repeated with trace
>> enabled for 5000 seconds. Idle information and processor
>> package power were sampled once per minute in all test runs.
>>
>> The results are:
>> http://fast.smythies.com/linux-pm/k420/k420-dbench-teo3.htm
>> http://fast.smythies.com/linux-pm/k420/histo_compare.htm
>
> Another observation from the data, and for the reference/
> baseline 4.20-rc1 kernel, idle state 0 histogram plots
> is that there are several (280) long idle durations.
>
> For unknown reasons, these are consistently dominated by
> CPU 5 on my system (264 of the 280, in this case).
>
> No other test that I have tried shows this issue,
> But other tests also tend to set the need_resched
> flag whereas the dbench test doesn't.
>
> Older kernels also have the issue. I tried: 4.19, 4.18
> 4.17, 4.16+"V9" idle re-work patch set of the time.
> There is no use going back further, because "V9" was
> to address excessively long durations in shallow idle states.
>
> I have not made progress towards determining the root issue.

For whatever reason, sometimes a softirq_entry occurs and it is
a considerably longer than usual time until the softirq_exit.
Sometimes there are bunch of interrupts piled up, and sometimes
some delay between the softirq_entry and anything else happening,
which perhaps just means I didn't figure out what else to enable
in the trace to observe it.

Anyway, it seems likely that there is no problem here after all.
 
... Doug




RE: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-11 Thread Doug Smythies
On 2018.11.10 13:48 Doug Smythies wrote:
> On 2018.11.07 09:04 Doug Smythies wrote:
>
>> The Phoronix dbench test was run under the option to run all
>> the tests, instead of just one number of clients. This was done
>> with a reference/baseline kernel of 4.20-rc1, and also with this
>> TEO version 3 patch. The tests were also repeated with trace
>> enabled for 5000 seconds. Idle information and processor
>> package power were sampled once per minute in all test runs.
>>
>> The results are:
>> http://fast.smythies.com/linux-pm/k420/k420-dbench-teo3.htm
>> http://fast.smythies.com/linux-pm/k420/histo_compare.htm
>
> Another observation from the data, and for the reference/
> baseline 4.20-rc1 kernel, idle state 0 histogram plots
> is that there are several (280) long idle durations.
>
> For unknown reasons, these are consistently dominated by
> CPU 5 on my system (264 of the 280, in this case).
>
> No other test that I have tried shows this issue,
> But other tests also tend to set the need_resched
> flag whereas the dbench test doesn't.
>
> Older kernels also have the issue. I tried: 4.19, 4.18
> 4.17, 4.16+"V9" idle re-work patch set of the time.
> There is no use going back further, because "V9" was
> to address excessively long durations in shallow idle states.
>
> I have not made progress towards determining the root issue.

For whatever reason, sometimes a softirq_entry occurs and it is
a considerably longer than usual time until the softirq_exit.
Sometimes there are bunch of interrupts piled up, and sometimes
some delay between the softirq_entry and anything else happening,
which perhaps just means I didn't figure out what else to enable
in the trace to observe it.

Anyway, it seems likely that there is no problem here after all.
 
... Doug




RE: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-10 Thread Doug Smythies
On 2018.11.07 09:04 Doug Smythies wrote:

> The Phoronix dbench test was run under the option to run all
> the tests, instead of just one number of clients. This was done
> with a reference/baseline kernel of 4.20-rc1, and also with this
> TEO version 3 patch. The tests were also repeated with trace
> enabled for 5000 seconds. Idle information and processor
> package power were sampled once per minute in all test runs.
>
> The results are:
> http://fast.smythies.com/linux-pm/k420/k420-dbench-teo3.htm
> http://fast.smythies.com/linux-pm/k420/histo_compare.htm

Another observation from the data, and for the reference/
baseline 4.20-rc1 kernel, idle state 0 histogram plots
is that there are several (280) long idle durations.

For unknown reasons, these are consistently dominated by
CPU 5 on my system (264 of the 280, in this case).

No other test that I have tried shows this issue,
But other tests also tend to set the need_resched
flag whereas the dbench test doesn't.

Older kernels also have the issue. I tried: 4.19, 4.18
4.17, 4.16+"V9" idle re-work patch set of the time.
There is no use going back further, because "V9" was
to address excessively long durations in shallow idle states.

I have not made progress towards determining the root issue.

... Doug




RE: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-10 Thread Doug Smythies
On 2018.11.07 09:04 Doug Smythies wrote:

> The Phoronix dbench test was run under the option to run all
> the tests, instead of just one number of clients. This was done
> with a reference/baseline kernel of 4.20-rc1, and also with this
> TEO version 3 patch. The tests were also repeated with trace
> enabled for 5000 seconds. Idle information and processor
> package power were sampled once per minute in all test runs.
>
> The results are:
> http://fast.smythies.com/linux-pm/k420/k420-dbench-teo3.htm
> http://fast.smythies.com/linux-pm/k420/histo_compare.htm

Another observation from the data, and for the reference/
baseline 4.20-rc1 kernel, idle state 0 histogram plots
is that there are several (280) long idle durations.

For unknown reasons, these are consistently dominated by
CPU 5 on my system (264 of the 280, in this case).

No other test that I have tried shows this issue,
But other tests also tend to set the need_resched
flag whereas the dbench test doesn't.

Older kernels also have the issue. I tried: 4.19, 4.18
4.17, 4.16+"V9" idle re-work patch set of the time.
There is no use going back further, because "V9" was
to address excessively long durations in shallow idle states.

I have not made progress towards determining the root issue.

... Doug




Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-08 Thread Rafael J. Wysocki
On Wednesday, November 7, 2018 6:04:12 PM CET Doug Smythies wrote:
> On 2018.11.04 08:31 Rafael J. Wysocki wrote:
> 
> > v2 -> v3:
> > * Simplify the pattern detection code and make it return a value
> > lower than the time to the closest timer if the majority of recent
> > idle intervals are below it regardless of their variance (that should
> > cause it to be slightly more aggressive).
> > * Do not count wakeups from state 0 due to the time limit in poll_idle()
> >as non-timer.
> >
> > Note: I will be mostly offline tomorrow, so this goes slightly early.
> > I have tested it only very lightly, but it is not so much different from
> > the previous one.
> > 
> > It requires the same additional patches to apply as the previous one too.
> 
> Even though this v3 has now been superseded by v4, I completed some test
> work in progress for v3 anyhow.

That's useful anyway, thanks for doing that!

> The main reason to complete the work, and write up, was because, and for my
> own interest as much as anything, I wanted to specifically test for the
> influence of running trace on the system under test.
> Reference: https://marc.info/?l=linux-kernel=154145580925439=2
> 
> The Phoronix dbench test was run under the option to run all
> the tests, instead of just one number of clients. This was done
> with a reference/baseline kernel of 4.20-rc1, and also with this
> TEO version 3 patch. The tests were also repeated with trace
> enabled for 5000 seconds. Idle information and processor
> package power were sampled once per minute in all test runs.
> 
> The results are:
> http://fast.smythies.com/linux-pm/k420/k420-dbench-teo3.htm
> http://fast.smythies.com/linux-pm/k420/histo_compare.htm

Thanks a bunch for these!

> Conclusion: trace has negligible effect, until the system gets
> severely overloaded.
> 
> There are some odd long idle durations with TEOv3 for idle
> states 1, 2, and 3 that I'll watch for with v4 testing.

That unfortunately is a result of bugs in the v4 (and v2 - v3 too).

Namely, it doesn't take the cases when the tick has been stopped already
into account correctly.  IOW, all of the data points beyond the tick boundary
should go into the "final" peak.

I'll send a v5.

> Other information:
> Processor: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
> The kernels were 1000 Hz.
> Idle latency/residency info:
> STATE: state0   DESC: CPUIDLE CORE POLL IDLENAME: POLL  LATENCY: 0
>   RESIDENCY: 0
> STATE: state1   DESC: MWAIT 0x00NAME: C1LATENCY: 2  
> RESIDENCY: 2
> STATE: state2   DESC: MWAIT 0x01NAME: C1E   LATENCY: 10 
> RESIDENCY: 20
> STATE: state3   DESC: MWAIT 0x10NAME: C3LATENCY: 80 
> RESIDENCY: 211
> STATE: state4   DESC: MWAIT 0x20NAME: C6LATENCY: 104
> RESIDENCY: 345

Thanks,
Rafael



Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-08 Thread Rafael J. Wysocki
On Wednesday, November 7, 2018 6:04:12 PM CET Doug Smythies wrote:
> On 2018.11.04 08:31 Rafael J. Wysocki wrote:
> 
> > v2 -> v3:
> > * Simplify the pattern detection code and make it return a value
> > lower than the time to the closest timer if the majority of recent
> > idle intervals are below it regardless of their variance (that should
> > cause it to be slightly more aggressive).
> > * Do not count wakeups from state 0 due to the time limit in poll_idle()
> >as non-timer.
> >
> > Note: I will be mostly offline tomorrow, so this goes slightly early.
> > I have tested it only very lightly, but it is not so much different from
> > the previous one.
> > 
> > It requires the same additional patches to apply as the previous one too.
> 
> Even though this v3 has now been superseded by v4, I completed some test
> work in progress for v3 anyhow.

That's useful anyway, thanks for doing that!

> The main reason to complete the work, and write up, was because, and for my
> own interest as much as anything, I wanted to specifically test for the
> influence of running trace on the system under test.
> Reference: https://marc.info/?l=linux-kernel=154145580925439=2
> 
> The Phoronix dbench test was run under the option to run all
> the tests, instead of just one number of clients. This was done
> with a reference/baseline kernel of 4.20-rc1, and also with this
> TEO version 3 patch. The tests were also repeated with trace
> enabled for 5000 seconds. Idle information and processor
> package power were sampled once per minute in all test runs.
> 
> The results are:
> http://fast.smythies.com/linux-pm/k420/k420-dbench-teo3.htm
> http://fast.smythies.com/linux-pm/k420/histo_compare.htm

Thanks a bunch for these!

> Conclusion: trace has negligible effect, until the system gets
> severely overloaded.
> 
> There are some odd long idle durations with TEOv3 for idle
> states 1, 2, and 3 that I'll watch for with v4 testing.

That unfortunately is a result of bugs in the v4 (and v2 - v3 too).

Namely, it doesn't take the cases when the tick has been stopped already
into account correctly.  IOW, all of the data points beyond the tick boundary
should go into the "final" peak.

I'll send a v5.

> Other information:
> Processor: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
> The kernels were 1000 Hz.
> Idle latency/residency info:
> STATE: state0   DESC: CPUIDLE CORE POLL IDLENAME: POLL  LATENCY: 0
>   RESIDENCY: 0
> STATE: state1   DESC: MWAIT 0x00NAME: C1LATENCY: 2  
> RESIDENCY: 2
> STATE: state2   DESC: MWAIT 0x01NAME: C1E   LATENCY: 10 
> RESIDENCY: 20
> STATE: state3   DESC: MWAIT 0x10NAME: C3LATENCY: 80 
> RESIDENCY: 211
> STATE: state4   DESC: MWAIT 0x20NAME: C6LATENCY: 104
> RESIDENCY: 345

Thanks,
Rafael



RE: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-07 Thread Doug Smythies
On 2018.11.04 08:31 Rafael J. Wysocki wrote:

> v2 -> v3:
> * Simplify the pattern detection code and make it return a value
>   lower than the time to the closest timer if the majority of recent
>   idle intervals are below it regardless of their variance (that should
>   cause it to be slightly more aggressive).
> * Do not count wakeups from state 0 due to the time limit in poll_idle()
>as non-timer.
>
> Note: I will be mostly offline tomorrow, so this goes slightly early.
> I have tested it only very lightly, but it is not so much different from
> the previous one.
> 
> It requires the same additional patches to apply as the previous one too.

Even though this v3 has now been superseded by v4, I completed some test
work in progress for v3 anyhow.

The main reason to complete the work, and write up, was because, and for my
own interest as much as anything, I wanted to specifically test for the
influence of running trace on the system under test.
Reference: https://marc.info/?l=linux-kernel=154145580925439=2

The Phoronix dbench test was run under the option to run all
the tests, instead of just one number of clients. This was done
with a reference/baseline kernel of 4.20-rc1, and also with this
TEO version 3 patch. The tests were also repeated with trace
enabled for 5000 seconds. Idle information and processor
package power were sampled once per minute in all test runs.

The results are:
http://fast.smythies.com/linux-pm/k420/k420-dbench-teo3.htm
http://fast.smythies.com/linux-pm/k420/histo_compare.htm

Conclusion: trace has negligible effect, until the system gets
severely overloaded.

There are some odd long idle durations with TEOv3 for idle
states 1, 2, and 3 that I'll watch for with v4 testing.

Other information:
Processor: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
The kernels were 1000 Hz.
Idle latency/residency info:
STATE: state0   DESC: CPUIDLE CORE POLL IDLENAME: POLL  LATENCY: 0  
RESIDENCY: 0
STATE: state1   DESC: MWAIT 0x00NAME: C1LATENCY: 2  
RESIDENCY: 2
STATE: state2   DESC: MWAIT 0x01NAME: C1E   LATENCY: 10 
RESIDENCY: 20
STATE: state3   DESC: MWAIT 0x10NAME: C3LATENCY: 80 
RESIDENCY: 211
STATE: state4   DESC: MWAIT 0x20NAME: C6LATENCY: 104
RESIDENCY: 345

... Doug
 



RE: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-07 Thread Doug Smythies
On 2018.11.04 08:31 Rafael J. Wysocki wrote:

> v2 -> v3:
> * Simplify the pattern detection code and make it return a value
>   lower than the time to the closest timer if the majority of recent
>   idle intervals are below it regardless of their variance (that should
>   cause it to be slightly more aggressive).
> * Do not count wakeups from state 0 due to the time limit in poll_idle()
>as non-timer.
>
> Note: I will be mostly offline tomorrow, so this goes slightly early.
> I have tested it only very lightly, but it is not so much different from
> the previous one.
> 
> It requires the same additional patches to apply as the previous one too.

Even though this v3 has now been superseded by v4, I completed some test
work in progress for v3 anyhow.

The main reason to complete the work, and write up, was because, and for my
own interest as much as anything, I wanted to specifically test for the
influence of running trace on the system under test.
Reference: https://marc.info/?l=linux-kernel=154145580925439=2

The Phoronix dbench test was run under the option to run all
the tests, instead of just one number of clients. This was done
with a reference/baseline kernel of 4.20-rc1, and also with this
TEO version 3 patch. The tests were also repeated with trace
enabled for 5000 seconds. Idle information and processor
package power were sampled once per minute in all test runs.

The results are:
http://fast.smythies.com/linux-pm/k420/k420-dbench-teo3.htm
http://fast.smythies.com/linux-pm/k420/histo_compare.htm

Conclusion: trace has negligible effect, until the system gets
severely overloaded.

There are some odd long idle durations with TEOv3 for idle
states 1, 2, and 3 that I'll watch for with v4 testing.

Other information:
Processor: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
The kernels were 1000 Hz.
Idle latency/residency info:
STATE: state0   DESC: CPUIDLE CORE POLL IDLENAME: POLL  LATENCY: 0  
RESIDENCY: 0
STATE: state1   DESC: MWAIT 0x00NAME: C1LATENCY: 2  
RESIDENCY: 2
STATE: state2   DESC: MWAIT 0x01NAME: C1E   LATENCY: 10 
RESIDENCY: 20
STATE: state3   DESC: MWAIT 0x10NAME: C3LATENCY: 80 
RESIDENCY: 211
STATE: state4   DESC: MWAIT 0x20NAME: C6LATENCY: 104
RESIDENCY: 345

... Doug
 



Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-07 Thread Rafael J. Wysocki
On Wed, Nov 7, 2018 at 9:59 AM Peter Zijlstra  wrote:
>
> On Wed, Nov 07, 2018 at 12:39:31AM +0100, Rafael J. Wysocki wrote:
> > On Tue, Nov 6, 2018 at 8:51 PM Peter Zijlstra  wrote:
> > >
> > > On Tue, Nov 06, 2018 at 07:19:24PM +0100, Rafael J. Wysocki wrote:
> > > > On Tue, Nov 6, 2018 at 6:04 PM Peter Zijlstra  
> > > > wrote:
> > >
> > > > > Instead of this detector; why haven't you used the code from
> > > > > kernel/irq/timings.c ?
> > > >
> > > > Because it doesn't help much AFAICS.
> > > >
> > > > Wakeups need not be interrupts in particular
> > >
> > > You're alluding to the MWAIT wakeup through the MONITOR address ?
> >
> > Yes.
>
> Right, those will not be accounted for and will need something else.

Precisely. :-)

> > > > and interrupt patterns that show up when the CPU is busy may not be
> > > > relevant for when it is idle.
> > >
> > > I think that is not always true; consider things like the periodic
> > > interrupt from frame rendering or audio; if there is nothing more going
> > > on in the system than say playing your favourite tune, it gets the
> > > 'need more data soon' interrupt from the audio card, wakes up, does a 
> > > little
> > > mp3/flac/ogg/whatever decode to fill up the buffer and goes back to
> > > sleep. Same for video playback I assume, the vsync interrupt for buffer
> > > flips is fairly predictable.
> > >
> > > The interrupt predictor we have in kernel/irq/timings.c should be very
> > > accurate in predicting those interrupts.
> >
> > In the above case the interrupts should produce a detectable pattern
> > of wakeups anyway.
>
> Ah, not so. Suppose you have both the audio and video interrupt going at
> a steady rate but different rate, then the combined pattern isn't
> trivial at all.

It isn't trivial, but it will appear as a "pattern" to the pattern
detector in v4 of the patch.

Basically, all of that is about looking for a reason to select a
shallower idle state than indicated by the time till the closest timer
alone.

>From that standpoint it makes sense to look at several most recent
wakeups and see if a pattern is forming in there, which is what the
code in v4 of the patch does.  It also makes sense to look at
interrupt patters, but that is costly, so the overhead needs to be
justified.  The question to me is whether or not there is any
additional value in that given that the most recent wakeups are (and
IMO need to be) taken into consideration anyway.

> > In general, however, I need to be convinced that interrupts that
> > didn't wake up the CPU from idle are relevant for next wakeup
> > prediction.  I see that this may be the case, but to what extent is
> > rather unclear to me and it looks like calling
> > irq_timings_next_event() would add considerable overhead.
>
> How about we add a (debug) knob so that people can play with it for now?
> If it turns out to be useful, we'll learn.

So I'm inclined to try to invoke irq_timings_next_event() in addition
to looking at the most recent wakeups and see how far that can get us.

With some extra instrumentation in place it should be possible to
verify how much value that brings to the table.


Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-07 Thread Rafael J. Wysocki
On Wed, Nov 7, 2018 at 9:59 AM Peter Zijlstra  wrote:
>
> On Wed, Nov 07, 2018 at 12:39:31AM +0100, Rafael J. Wysocki wrote:
> > On Tue, Nov 6, 2018 at 8:51 PM Peter Zijlstra  wrote:
> > >
> > > On Tue, Nov 06, 2018 at 07:19:24PM +0100, Rafael J. Wysocki wrote:
> > > > On Tue, Nov 6, 2018 at 6:04 PM Peter Zijlstra  
> > > > wrote:
> > >
> > > > > Instead of this detector; why haven't you used the code from
> > > > > kernel/irq/timings.c ?
> > > >
> > > > Because it doesn't help much AFAICS.
> > > >
> > > > Wakeups need not be interrupts in particular
> > >
> > > You're alluding to the MWAIT wakeup through the MONITOR address ?
> >
> > Yes.
>
> Right, those will not be accounted for and will need something else.

Precisely. :-)

> > > > and interrupt patterns that show up when the CPU is busy may not be
> > > > relevant for when it is idle.
> > >
> > > I think that is not always true; consider things like the periodic
> > > interrupt from frame rendering or audio; if there is nothing more going
> > > on in the system than say playing your favourite tune, it gets the
> > > 'need more data soon' interrupt from the audio card, wakes up, does a 
> > > little
> > > mp3/flac/ogg/whatever decode to fill up the buffer and goes back to
> > > sleep. Same for video playback I assume, the vsync interrupt for buffer
> > > flips is fairly predictable.
> > >
> > > The interrupt predictor we have in kernel/irq/timings.c should be very
> > > accurate in predicting those interrupts.
> >
> > In the above case the interrupts should produce a detectable pattern
> > of wakeups anyway.
>
> Ah, not so. Suppose you have both the audio and video interrupt going at
> a steady rate but different rate, then the combined pattern isn't
> trivial at all.

It isn't trivial, but it will appear as a "pattern" to the pattern
detector in v4 of the patch.

Basically, all of that is about looking for a reason to select a
shallower idle state than indicated by the time till the closest timer
alone.

>From that standpoint it makes sense to look at several most recent
wakeups and see if a pattern is forming in there, which is what the
code in v4 of the patch does.  It also makes sense to look at
interrupt patters, but that is costly, so the overhead needs to be
justified.  The question to me is whether or not there is any
additional value in that given that the most recent wakeups are (and
IMO need to be) taken into consideration anyway.

> > In general, however, I need to be convinced that interrupts that
> > didn't wake up the CPU from idle are relevant for next wakeup
> > prediction.  I see that this may be the case, but to what extent is
> > rather unclear to me and it looks like calling
> > irq_timings_next_event() would add considerable overhead.
>
> How about we add a (debug) knob so that people can play with it for now?
> If it turns out to be useful, we'll learn.

So I'm inclined to try to invoke irq_timings_next_event() in addition
to looking at the most recent wakeups and see how far that can get us.

With some extra instrumentation in place it should be possible to
verify how much value that brings to the table.


Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-07 Thread Daniel Lezcano
On 07/11/2018 00:39, Rafael J. Wysocki wrote:
> On Tue, Nov 6, 2018 at 8:51 PM Peter Zijlstra  wrote:
>>
>> On Tue, Nov 06, 2018 at 07:19:24PM +0100, Rafael J. Wysocki wrote:
>>> On Tue, Nov 6, 2018 at 6:04 PM Peter Zijlstra  wrote:
>>
 Instead of this detector; why haven't you used the code from
 kernel/irq/timings.c ?
>>>
>>> Because it doesn't help much AFAICS.
>>>
>>> Wakeups need not be interrupts in particular
>>
>> You're alluding to the MWAIT wakeup through the MONITOR address ?
> 
> Yes.
> 
>>> and interrupt patterns that show up when the CPU is busy may not be
>>> relevant for when it is idle.
>>
>> I think that is not always true; consider things like the periodic
>> interrupt from frame rendering or audio; if there is nothing more going
>> on in the system than say playing your favourite tune, it gets the
>> 'need more data soon' interrupt from the audio card, wakes up, does a little
>> mp3/flac/ogg/whatever decode to fill up the buffer and goes back to
>> sleep. Same for video playback I assume, the vsync interrupt for buffer
>> flips is fairly predictable.
>>
>> The interrupt predictor we have in kernel/irq/timings.c should be very
>> accurate in predicting those interrupts.
> 
> In the above case the interrupts should produce a detectable pattern
> of wakeups anyway.
> 
> In general, however, I need to be convinced that interrupts that
> didn't wake up the CPU from idle are relevant for next wakeup
> prediction.  I see that this may be the case, but to what extent is
> rather unclear to me and it looks like calling
> irq_timings_next_event() would add considerable overhead.

I agree the current irq next event can add an overhead and actually it
works well if there are interrupts with regular interrupts but with
irregular timings interval we need to search a pattern.

The first version called 'anova' I shared with you (Peter and Rafael)
privately is pretty good to detect such intervals but the algorithm has
a cost in terms of computation in the self-learning phase. I dropped
this approach as it could be overkill.

I'm working on a different algorithm to detect such patterns which
should be faster.

That said, the pattern detection process shouldn't be happening every
time we enter idle but when the pattern is invalidated. Until the
pattern gives a successful prediction, we keep it and the repeating
period is used to give the next event.

So IMO, the cost to detect the pattern should be compensated by the
computation saving we do if the pattern is found successfully.

These are all assumptions, hopefully I can provide a patch series and
numbers in January ...



-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog



Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-07 Thread Daniel Lezcano
On 07/11/2018 00:39, Rafael J. Wysocki wrote:
> On Tue, Nov 6, 2018 at 8:51 PM Peter Zijlstra  wrote:
>>
>> On Tue, Nov 06, 2018 at 07:19:24PM +0100, Rafael J. Wysocki wrote:
>>> On Tue, Nov 6, 2018 at 6:04 PM Peter Zijlstra  wrote:
>>
 Instead of this detector; why haven't you used the code from
 kernel/irq/timings.c ?
>>>
>>> Because it doesn't help much AFAICS.
>>>
>>> Wakeups need not be interrupts in particular
>>
>> You're alluding to the MWAIT wakeup through the MONITOR address ?
> 
> Yes.
> 
>>> and interrupt patterns that show up when the CPU is busy may not be
>>> relevant for when it is idle.
>>
>> I think that is not always true; consider things like the periodic
>> interrupt from frame rendering or audio; if there is nothing more going
>> on in the system than say playing your favourite tune, it gets the
>> 'need more data soon' interrupt from the audio card, wakes up, does a little
>> mp3/flac/ogg/whatever decode to fill up the buffer and goes back to
>> sleep. Same for video playback I assume, the vsync interrupt for buffer
>> flips is fairly predictable.
>>
>> The interrupt predictor we have in kernel/irq/timings.c should be very
>> accurate in predicting those interrupts.
> 
> In the above case the interrupts should produce a detectable pattern
> of wakeups anyway.
> 
> In general, however, I need to be convinced that interrupts that
> didn't wake up the CPU from idle are relevant for next wakeup
> prediction.  I see that this may be the case, but to what extent is
> rather unclear to me and it looks like calling
> irq_timings_next_event() would add considerable overhead.

I agree the current irq next event can add an overhead and actually it
works well if there are interrupts with regular interrupts but with
irregular timings interval we need to search a pattern.

The first version called 'anova' I shared with you (Peter and Rafael)
privately is pretty good to detect such intervals but the algorithm has
a cost in terms of computation in the self-learning phase. I dropped
this approach as it could be overkill.

I'm working on a different algorithm to detect such patterns which
should be faster.

That said, the pattern detection process shouldn't be happening every
time we enter idle but when the pattern is invalidated. Until the
pattern gives a successful prediction, we keep it and the repeating
period is used to give the next event.

So IMO, the cost to detect the pattern should be compensated by the
computation saving we do if the pattern is found successfully.

These are all assumptions, hopefully I can provide a patch series and
numbers in January ...



-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog



Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-07 Thread Peter Zijlstra
On Wed, Nov 07, 2018 at 12:39:31AM +0100, Rafael J. Wysocki wrote:
> On Tue, Nov 6, 2018 at 8:51 PM Peter Zijlstra  wrote:
> >
> > On Tue, Nov 06, 2018 at 07:19:24PM +0100, Rafael J. Wysocki wrote:
> > > On Tue, Nov 6, 2018 at 6:04 PM Peter Zijlstra  
> > > wrote:
> >
> > > > Instead of this detector; why haven't you used the code from
> > > > kernel/irq/timings.c ?
> > >
> > > Because it doesn't help much AFAICS.
> > >
> > > Wakeups need not be interrupts in particular
> >
> > You're alluding to the MWAIT wakeup through the MONITOR address ?
> 
> Yes.

Right, those will not be accounted for and will need something else.

> > > and interrupt patterns that show up when the CPU is busy may not be
> > > relevant for when it is idle.
> >
> > I think that is not always true; consider things like the periodic
> > interrupt from frame rendering or audio; if there is nothing more going
> > on in the system than say playing your favourite tune, it gets the
> > 'need more data soon' interrupt from the audio card, wakes up, does a little
> > mp3/flac/ogg/whatever decode to fill up the buffer and goes back to
> > sleep. Same for video playback I assume, the vsync interrupt for buffer
> > flips is fairly predictable.
> >
> > The interrupt predictor we have in kernel/irq/timings.c should be very
> > accurate in predicting those interrupts.
> 
> In the above case the interrupts should produce a detectable pattern
> of wakeups anyway.

Ah, not so. Suppose you have both the audio and video interrupt going at
a steady rate but different rate, then the combined pattern isn't
trivial at all.

> In general, however, I need to be convinced that interrupts that
> didn't wake up the CPU from idle are relevant for next wakeup
> prediction.  I see that this may be the case, but to what extent is
> rather unclear to me and it looks like calling
> irq_timings_next_event() would add considerable overhead.

How about we add a (debug) knob so that people can play with it for now?
If it turns out to be useful, we'll learn.


Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-07 Thread Peter Zijlstra
On Wed, Nov 07, 2018 at 12:39:31AM +0100, Rafael J. Wysocki wrote:
> On Tue, Nov 6, 2018 at 8:51 PM Peter Zijlstra  wrote:
> >
> > On Tue, Nov 06, 2018 at 07:19:24PM +0100, Rafael J. Wysocki wrote:
> > > On Tue, Nov 6, 2018 at 6:04 PM Peter Zijlstra  
> > > wrote:
> >
> > > > Instead of this detector; why haven't you used the code from
> > > > kernel/irq/timings.c ?
> > >
> > > Because it doesn't help much AFAICS.
> > >
> > > Wakeups need not be interrupts in particular
> >
> > You're alluding to the MWAIT wakeup through the MONITOR address ?
> 
> Yes.

Right, those will not be accounted for and will need something else.

> > > and interrupt patterns that show up when the CPU is busy may not be
> > > relevant for when it is idle.
> >
> > I think that is not always true; consider things like the periodic
> > interrupt from frame rendering or audio; if there is nothing more going
> > on in the system than say playing your favourite tune, it gets the
> > 'need more data soon' interrupt from the audio card, wakes up, does a little
> > mp3/flac/ogg/whatever decode to fill up the buffer and goes back to
> > sleep. Same for video playback I assume, the vsync interrupt for buffer
> > flips is fairly predictable.
> >
> > The interrupt predictor we have in kernel/irq/timings.c should be very
> > accurate in predicting those interrupts.
> 
> In the above case the interrupts should produce a detectable pattern
> of wakeups anyway.

Ah, not so. Suppose you have both the audio and video interrupt going at
a steady rate but different rate, then the combined pattern isn't
trivial at all.

> In general, however, I need to be convinced that interrupts that
> didn't wake up the CPU from idle are relevant for next wakeup
> prediction.  I see that this may be the case, but to what extent is
> rather unclear to me and it looks like calling
> irq_timings_next_event() would add considerable overhead.

How about we add a (debug) knob so that people can play with it for now?
If it turns out to be useful, we'll learn.


Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-06 Thread Rafael J. Wysocki
On Tue, Nov 6, 2018 at 8:51 PM Peter Zijlstra  wrote:
>
> On Tue, Nov 06, 2018 at 07:19:24PM +0100, Rafael J. Wysocki wrote:
> > On Tue, Nov 6, 2018 at 6:04 PM Peter Zijlstra  wrote:
>
> > > Instead of this detector; why haven't you used the code from
> > > kernel/irq/timings.c ?
> >
> > Because it doesn't help much AFAICS.
> >
> > Wakeups need not be interrupts in particular
>
> You're alluding to the MWAIT wakeup through the MONITOR address ?

Yes.

> > and interrupt patterns that show up when the CPU is busy may not be
> > relevant for when it is idle.
>
> I think that is not always true; consider things like the periodic
> interrupt from frame rendering or audio; if there is nothing more going
> on in the system than say playing your favourite tune, it gets the
> 'need more data soon' interrupt from the audio card, wakes up, does a little
> mp3/flac/ogg/whatever decode to fill up the buffer and goes back to
> sleep. Same for video playback I assume, the vsync interrupt for buffer
> flips is fairly predictable.
>
> The interrupt predictor we have in kernel/irq/timings.c should be very
> accurate in predicting those interrupts.

In the above case the interrupts should produce a detectable pattern
of wakeups anyway.

In general, however, I need to be convinced that interrupts that
didn't wake up the CPU from idle are relevant for next wakeup
prediction.  I see that this may be the case, but to what extent is
rather unclear to me and it looks like calling
irq_timings_next_event() would add considerable overhead.


Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-06 Thread Rafael J. Wysocki
On Tue, Nov 6, 2018 at 8:51 PM Peter Zijlstra  wrote:
>
> On Tue, Nov 06, 2018 at 07:19:24PM +0100, Rafael J. Wysocki wrote:
> > On Tue, Nov 6, 2018 at 6:04 PM Peter Zijlstra  wrote:
>
> > > Instead of this detector; why haven't you used the code from
> > > kernel/irq/timings.c ?
> >
> > Because it doesn't help much AFAICS.
> >
> > Wakeups need not be interrupts in particular
>
> You're alluding to the MWAIT wakeup through the MONITOR address ?

Yes.

> > and interrupt patterns that show up when the CPU is busy may not be
> > relevant for when it is idle.
>
> I think that is not always true; consider things like the periodic
> interrupt from frame rendering or audio; if there is nothing more going
> on in the system than say playing your favourite tune, it gets the
> 'need more data soon' interrupt from the audio card, wakes up, does a little
> mp3/flac/ogg/whatever decode to fill up the buffer and goes back to
> sleep. Same for video playback I assume, the vsync interrupt for buffer
> flips is fairly predictable.
>
> The interrupt predictor we have in kernel/irq/timings.c should be very
> accurate in predicting those interrupts.

In the above case the interrupts should produce a detectable pattern
of wakeups anyway.

In general, however, I need to be convinced that interrupts that
didn't wake up the CPU from idle are relevant for next wakeup
prediction.  I see that this may be the case, but to what extent is
rather unclear to me and it looks like calling
irq_timings_next_event() would add considerable overhead.


Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-06 Thread Peter Zijlstra
On Tue, Nov 06, 2018 at 07:19:24PM +0100, Rafael J. Wysocki wrote:
> On Tue, Nov 6, 2018 at 6:04 PM Peter Zijlstra  wrote:

> > Instead of this detector; why haven't you used the code from
> > kernel/irq/timings.c ?
> 
> Because it doesn't help much AFAICS.
> 
> Wakeups need not be interrupts in particular 

You're alluding to the MWAIT wakeup through the MONITOR address ?

> and interrupt patterns that show up when the CPU is busy may not be
> relevant for when it is idle.

I think that is not always true; consider things like the periodic
interrupt from frame rendering or audio; if there is nothing more going
on in the system than say playing your favourite tune, it gets the
'need more data soon' interrupt from the audio card, wakes up, does a little
mp3/flac/ogg/whatever decode to fill up the buffer and goes back to
sleep. Same for video playback I assume, the vsync interrupt for buffer
flips is fairly predictable.

The interrupt predictor we have in kernel/irq/timings.c should be very
accurate in predicting those interrupts.




Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-06 Thread Peter Zijlstra
On Tue, Nov 06, 2018 at 07:19:24PM +0100, Rafael J. Wysocki wrote:
> On Tue, Nov 6, 2018 at 6:04 PM Peter Zijlstra  wrote:

> > Instead of this detector; why haven't you used the code from
> > kernel/irq/timings.c ?
> 
> Because it doesn't help much AFAICS.
> 
> Wakeups need not be interrupts in particular 

You're alluding to the MWAIT wakeup through the MONITOR address ?

> and interrupt patterns that show up when the CPU is busy may not be
> relevant for when it is idle.

I think that is not always true; consider things like the periodic
interrupt from frame rendering or audio; if there is nothing more going
on in the system than say playing your favourite tune, it gets the
'need more data soon' interrupt from the audio card, wakes up, does a little
mp3/flac/ogg/whatever decode to fill up the buffer and goes back to
sleep. Same for video playback I assume, the vsync interrupt for buffer
flips is fairly predictable.

The interrupt predictor we have in kernel/irq/timings.c should be very
accurate in predicting those interrupts.




Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-06 Thread Rafael J. Wysocki
On Tue, Nov 6, 2018 at 6:04 PM Peter Zijlstra  wrote:
>
> On Sun, Nov 04, 2018 at 05:31:20PM +0100, Rafael J. Wysocki wrote:
> > + * - If there is a pattern of 5 or more recent non-timer wakeups earlier 
> > than
> > + *   the closest timer event, expect one more of them to occur and use the
> > + *   average of the idle duration values corresponding to them to select an
> > + *   idle state for the CPU.
>
>
> > +/**
> > + * teo_idle_duration - Estimate the duration of the upcoming CPU idle time.
> > + * @drv: cpuidle driver containing state data.
> > + * @cpu_data: Governor data for the target CPU.
> > + * @sleep_length_us: Time till the closest timer event in microseconds.
> > + */
> > +unsigned int teo_idle_duration(struct cpuidle_driver *drv,
> > +struct teo_cpu *cpu_data,
> > +unsigned int sleep_length_us)
> > +{
> > + u64 range, max_spread, max, sum;
> > + unsigned int count;
> > +
> > + /*
> > +  * If the sleep length is below the target residency of idle state 1,
> > +  * the only viable choice is to select the first available (enabled)
> > +  * idle state, so return immediately in that case.
> > +  */
> > + if (sleep_length_us < drv->states[1].target_residency)
> > + return sleep_length_us;
> > +
> > + /*
> > +  * The purpose of this function is to check if there is a pattern of
> > +  * wakeups indicating that it would be better to select a state
> > +  * shallower than the deepest one matching the sleep length or the
> > +  * deepest one at all if the sleep lenght is long.  Larger idle 
> > duration
> > +  * values are beyond the interesting range.
> > +  *
> > +  * Narrowing the range of interesting values down upfront also helps 
> > to
> > +  * avoid overflows during the computation below.
> > +  */
> > + range = drv->states[drv->state_count-1].target_residency;
> > + range = min_t(u64, sleep_length_us, range + (range >> 2));
> > +
> > + /*
> > +  * This is the value to compare with the distance between the average
> > +  * and the greatest sample to decide whether or not it is small 
> > enough.
> > +  * Take 10 us as the total cap of it.
> > +  */
> > + max_spread = max_t(u64, range >> MAX_SPREAD_SHIFT, 10);
> > +
> > + max = range;
> > +
> > + do {
> > + u64 cap = max;
> > + int i;
> > +
> > + /*
> > +  * Compute the sum of the saved intervals below the cap and 
> > the
> > +  * sum of of their squares.  Count them and find the maximum
> > +  * interval below the cap.
> > +  */
> > + count = 0;
> > + sum = 0;
> > + max = 0;
> > +
> > + for (i = 0; i < INTERVALS; i++) {
> > + u64 val = cpu_data->intervals[i];
> > +
> > + if (val >= cap)
> > + continue;
> > +
> > + count++;
> > + sum += val;
> > + if (max < val)
> > + max = val;
> > + }
> > +
> > + /*
> > +  * Give up if the total number of interesting samples is too
> > +  * small.
> > +  */
> > + if (cap == range && count <= INTERVALS / 2)
> > + return sleep_length_us;
> > +
> > + /*
> > +  * If the distance between the max and the average is too 
> > large,
> > +  * discard the max an repeat.
> > +  */
> > + } while (count > 3 && max > max_spread && (max - max_spread) * count 
> > > sum);
> > +
> > + return div64_u64(sum, count);
> > +}
>
> Instead of this detector; why haven't you used the code from
> kernel/irq/timings.c ?

Because it doesn't help much AFAICS.

Wakeups need not be interrupts in particular and interrupt patterns
that show up when the CPU is busy may not be relevant for when it is
idle.


Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-06 Thread Rafael J. Wysocki
On Tue, Nov 6, 2018 at 6:04 PM Peter Zijlstra  wrote:
>
> On Sun, Nov 04, 2018 at 05:31:20PM +0100, Rafael J. Wysocki wrote:
> > + * - If there is a pattern of 5 or more recent non-timer wakeups earlier 
> > than
> > + *   the closest timer event, expect one more of them to occur and use the
> > + *   average of the idle duration values corresponding to them to select an
> > + *   idle state for the CPU.
>
>
> > +/**
> > + * teo_idle_duration - Estimate the duration of the upcoming CPU idle time.
> > + * @drv: cpuidle driver containing state data.
> > + * @cpu_data: Governor data for the target CPU.
> > + * @sleep_length_us: Time till the closest timer event in microseconds.
> > + */
> > +unsigned int teo_idle_duration(struct cpuidle_driver *drv,
> > +struct teo_cpu *cpu_data,
> > +unsigned int sleep_length_us)
> > +{
> > + u64 range, max_spread, max, sum;
> > + unsigned int count;
> > +
> > + /*
> > +  * If the sleep length is below the target residency of idle state 1,
> > +  * the only viable choice is to select the first available (enabled)
> > +  * idle state, so return immediately in that case.
> > +  */
> > + if (sleep_length_us < drv->states[1].target_residency)
> > + return sleep_length_us;
> > +
> > + /*
> > +  * The purpose of this function is to check if there is a pattern of
> > +  * wakeups indicating that it would be better to select a state
> > +  * shallower than the deepest one matching the sleep length or the
> > +  * deepest one at all if the sleep lenght is long.  Larger idle 
> > duration
> > +  * values are beyond the interesting range.
> > +  *
> > +  * Narrowing the range of interesting values down upfront also helps 
> > to
> > +  * avoid overflows during the computation below.
> > +  */
> > + range = drv->states[drv->state_count-1].target_residency;
> > + range = min_t(u64, sleep_length_us, range + (range >> 2));
> > +
> > + /*
> > +  * This is the value to compare with the distance between the average
> > +  * and the greatest sample to decide whether or not it is small 
> > enough.
> > +  * Take 10 us as the total cap of it.
> > +  */
> > + max_spread = max_t(u64, range >> MAX_SPREAD_SHIFT, 10);
> > +
> > + max = range;
> > +
> > + do {
> > + u64 cap = max;
> > + int i;
> > +
> > + /*
> > +  * Compute the sum of the saved intervals below the cap and 
> > the
> > +  * sum of of their squares.  Count them and find the maximum
> > +  * interval below the cap.
> > +  */
> > + count = 0;
> > + sum = 0;
> > + max = 0;
> > +
> > + for (i = 0; i < INTERVALS; i++) {
> > + u64 val = cpu_data->intervals[i];
> > +
> > + if (val >= cap)
> > + continue;
> > +
> > + count++;
> > + sum += val;
> > + if (max < val)
> > + max = val;
> > + }
> > +
> > + /*
> > +  * Give up if the total number of interesting samples is too
> > +  * small.
> > +  */
> > + if (cap == range && count <= INTERVALS / 2)
> > + return sleep_length_us;
> > +
> > + /*
> > +  * If the distance between the max and the average is too 
> > large,
> > +  * discard the max an repeat.
> > +  */
> > + } while (count > 3 && max > max_spread && (max - max_spread) * count 
> > > sum);
> > +
> > + return div64_u64(sum, count);
> > +}
>
> Instead of this detector; why haven't you used the code from
> kernel/irq/timings.c ?

Because it doesn't help much AFAICS.

Wakeups need not be interrupts in particular and interrupt patterns
that show up when the CPU is busy may not be relevant for when it is
idle.


Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-06 Thread Peter Zijlstra
On Sun, Nov 04, 2018 at 05:31:20PM +0100, Rafael J. Wysocki wrote:
> + * - If there is a pattern of 5 or more recent non-timer wakeups earlier than
> + *   the closest timer event, expect one more of them to occur and use the
> + *   average of the idle duration values corresponding to them to select an
> + *   idle state for the CPU.


> +/**
> + * teo_idle_duration - Estimate the duration of the upcoming CPU idle time.
> + * @drv: cpuidle driver containing state data.
> + * @cpu_data: Governor data for the target CPU.
> + * @sleep_length_us: Time till the closest timer event in microseconds.
> + */
> +unsigned int teo_idle_duration(struct cpuidle_driver *drv,
> +struct teo_cpu *cpu_data,
> +unsigned int sleep_length_us)
> +{
> + u64 range, max_spread, max, sum;
> + unsigned int count;
> +
> + /*
> +  * If the sleep length is below the target residency of idle state 1,
> +  * the only viable choice is to select the first available (enabled)
> +  * idle state, so return immediately in that case.
> +  */
> + if (sleep_length_us < drv->states[1].target_residency)
> + return sleep_length_us;
> +
> + /*
> +  * The purpose of this function is to check if there is a pattern of
> +  * wakeups indicating that it would be better to select a state
> +  * shallower than the deepest one matching the sleep length or the
> +  * deepest one at all if the sleep lenght is long.  Larger idle duration
> +  * values are beyond the interesting range.
> +  *
> +  * Narrowing the range of interesting values down upfront also helps to
> +  * avoid overflows during the computation below.
> +  */
> + range = drv->states[drv->state_count-1].target_residency;
> + range = min_t(u64, sleep_length_us, range + (range >> 2));
> +
> + /*
> +  * This is the value to compare with the distance between the average
> +  * and the greatest sample to decide whether or not it is small enough.
> +  * Take 10 us as the total cap of it.
> +  */
> + max_spread = max_t(u64, range >> MAX_SPREAD_SHIFT, 10);
> +
> + max = range;
> +
> + do {
> + u64 cap = max;
> + int i;
> +
> + /*
> +  * Compute the sum of the saved intervals below the cap and the
> +  * sum of of their squares.  Count them and find the maximum
> +  * interval below the cap.
> +  */
> + count = 0;
> + sum = 0;
> + max = 0;
> +
> + for (i = 0; i < INTERVALS; i++) {
> + u64 val = cpu_data->intervals[i];
> +
> + if (val >= cap)
> + continue;
> +
> + count++;
> + sum += val;
> + if (max < val)
> + max = val;
> + }
> +
> + /*
> +  * Give up if the total number of interesting samples is too
> +  * small.
> +  */
> + if (cap == range && count <= INTERVALS / 2)
> + return sleep_length_us;
> +
> + /*
> +  * If the distance between the max and the average is too large,
> +  * discard the max an repeat.
> +  */
> + } while (count > 3 && max > max_spread && (max - max_spread) * count > 
> sum);
> +
> + return div64_u64(sum, count);
> +}

Instead of this detector; why haven't you used the code from
kernel/irq/timings.c ?


Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-06 Thread Peter Zijlstra
On Sun, Nov 04, 2018 at 05:31:20PM +0100, Rafael J. Wysocki wrote:
> + * - If there is a pattern of 5 or more recent non-timer wakeups earlier than
> + *   the closest timer event, expect one more of them to occur and use the
> + *   average of the idle duration values corresponding to them to select an
> + *   idle state for the CPU.


> +/**
> + * teo_idle_duration - Estimate the duration of the upcoming CPU idle time.
> + * @drv: cpuidle driver containing state data.
> + * @cpu_data: Governor data for the target CPU.
> + * @sleep_length_us: Time till the closest timer event in microseconds.
> + */
> +unsigned int teo_idle_duration(struct cpuidle_driver *drv,
> +struct teo_cpu *cpu_data,
> +unsigned int sleep_length_us)
> +{
> + u64 range, max_spread, max, sum;
> + unsigned int count;
> +
> + /*
> +  * If the sleep length is below the target residency of idle state 1,
> +  * the only viable choice is to select the first available (enabled)
> +  * idle state, so return immediately in that case.
> +  */
> + if (sleep_length_us < drv->states[1].target_residency)
> + return sleep_length_us;
> +
> + /*
> +  * The purpose of this function is to check if there is a pattern of
> +  * wakeups indicating that it would be better to select a state
> +  * shallower than the deepest one matching the sleep length or the
> +  * deepest one at all if the sleep lenght is long.  Larger idle duration
> +  * values are beyond the interesting range.
> +  *
> +  * Narrowing the range of interesting values down upfront also helps to
> +  * avoid overflows during the computation below.
> +  */
> + range = drv->states[drv->state_count-1].target_residency;
> + range = min_t(u64, sleep_length_us, range + (range >> 2));
> +
> + /*
> +  * This is the value to compare with the distance between the average
> +  * and the greatest sample to decide whether or not it is small enough.
> +  * Take 10 us as the total cap of it.
> +  */
> + max_spread = max_t(u64, range >> MAX_SPREAD_SHIFT, 10);
> +
> + max = range;
> +
> + do {
> + u64 cap = max;
> + int i;
> +
> + /*
> +  * Compute the sum of the saved intervals below the cap and the
> +  * sum of of their squares.  Count them and find the maximum
> +  * interval below the cap.
> +  */
> + count = 0;
> + sum = 0;
> + max = 0;
> +
> + for (i = 0; i < INTERVALS; i++) {
> + u64 val = cpu_data->intervals[i];
> +
> + if (val >= cap)
> + continue;
> +
> + count++;
> + sum += val;
> + if (max < val)
> + max = val;
> + }
> +
> + /*
> +  * Give up if the total number of interesting samples is too
> +  * small.
> +  */
> + if (cap == range && count <= INTERVALS / 2)
> + return sleep_length_us;
> +
> + /*
> +  * If the distance between the max and the average is too large,
> +  * discard the max an repeat.
> +  */
> + } while (count > 3 && max > max_spread && (max - max_spread) * count > 
> sum);
> +
> + return div64_u64(sum, count);
> +}

Instead of this detector; why haven't you used the code from
kernel/irq/timings.c ?


Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-06 Thread Rafael J. Wysocki
On Monday, November 5, 2018 8:32:51 PM CET Giovanni Gherdovich wrote:
> On Sun, 2018-11-04 at 17:31 +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 
> > 

[cut]

> > +
> > +/**
> > + * teo_update - Update CPU data after wakeup.
> > + * @drv: cpuidle driver containing state data.
> > + * @dev: Target CPU.
> > + */
> > +static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device 
> > *dev)
> > +{
> > +   struct teo_cpu *cpu_data = per_cpu_ptr(_cpus, dev->cpu);
> > +   unsigned int sleep_length_us = 
> > ktime_to_us(cpu_data->sleep_length_ns);
> > +   int i, idx_hit = -1, idx_timer = -1;
> > +   unsigned int measured_us;
> > +
> > +   if (cpu_data->time_span_ns == cpu_data->sleep_length_ns) {
> > +   /* One of the safety nets has triggered (most likely). */
> > +   measured_us = sleep_length_us;
> > +   } else {
> > +   measured_us = dev->last_residency;
> > +   i = cpu_data->last_state;
> > +   if (measured_us >= 2 * drv->states[i].exit_latency)
> > +   measured_us -= drv->states[i].exit_latency;
> > +   else
> > +   measured_us /= 2;
> > +   }
> > +
> 
> I haven't read this v3 yet,

The pattern detection code in it has a minor issue that needs addressing, so
I'm going to send a v4 later today (after testing it somewhat).

> so just a little comment on the bit above (which
> is there since v1).

To be precise, this piece is there in the menu governor too and I thought that
it made sense, so I copied it from there. :-)

> When you check for measured_us >= 2 * exit_latency, is that because
> dev->last_residency is composed by an "entry" latency, then the actual
> residency, and finally the exit_latency? I'm asking about the 2x factor
> there.

If measured_us is less than twice the state's exit latency, the difference
between it and the exit latency may be very small, so it is better to take a
half of it then as an upper bound estimate of it in case there are some
inaccuracies in the measurement.

> If that succeeds, you proceed to remove the exit_latency from
> measured_us... just once. Given how the condition is formulated, I expected
> measured_us -= 2 * exit_latency there.

The exit latency is subtracted once, because we are interested in the time
between asking the hardware to enter the idle state and the wakeup event.

The exit latency is the time between the wakeup event and the first instruction,
so it shouldn't be counted, roughly speaking.

> More: you acknowledge, in that snippet of code, that there can be
> dev->last_residency's smaller than twice the exit_latency, i.e. not even the
> time to entry/exit the state. Am I reading this right? Is that because both
> exit_latency and dev->last_residency are only approximations?

Yes, they are just approximations.

> I actually see quite a few of those extra-short residencies in my traces, even
> with dev->last_residency < exit_latency.

Well, they are expected to occur at least occasionally.

Cheers,
Rafael



Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-06 Thread Rafael J. Wysocki
On Monday, November 5, 2018 8:32:51 PM CET Giovanni Gherdovich wrote:
> On Sun, 2018-11-04 at 17:31 +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 
> > 

[cut]

> > +
> > +/**
> > + * teo_update - Update CPU data after wakeup.
> > + * @drv: cpuidle driver containing state data.
> > + * @dev: Target CPU.
> > + */
> > +static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device 
> > *dev)
> > +{
> > +   struct teo_cpu *cpu_data = per_cpu_ptr(_cpus, dev->cpu);
> > +   unsigned int sleep_length_us = 
> > ktime_to_us(cpu_data->sleep_length_ns);
> > +   int i, idx_hit = -1, idx_timer = -1;
> > +   unsigned int measured_us;
> > +
> > +   if (cpu_data->time_span_ns == cpu_data->sleep_length_ns) {
> > +   /* One of the safety nets has triggered (most likely). */
> > +   measured_us = sleep_length_us;
> > +   } else {
> > +   measured_us = dev->last_residency;
> > +   i = cpu_data->last_state;
> > +   if (measured_us >= 2 * drv->states[i].exit_latency)
> > +   measured_us -= drv->states[i].exit_latency;
> > +   else
> > +   measured_us /= 2;
> > +   }
> > +
> 
> I haven't read this v3 yet,

The pattern detection code in it has a minor issue that needs addressing, so
I'm going to send a v4 later today (after testing it somewhat).

> so just a little comment on the bit above (which
> is there since v1).

To be precise, this piece is there in the menu governor too and I thought that
it made sense, so I copied it from there. :-)

> When you check for measured_us >= 2 * exit_latency, is that because
> dev->last_residency is composed by an "entry" latency, then the actual
> residency, and finally the exit_latency? I'm asking about the 2x factor
> there.

If measured_us is less than twice the state's exit latency, the difference
between it and the exit latency may be very small, so it is better to take a
half of it then as an upper bound estimate of it in case there are some
inaccuracies in the measurement.

> If that succeeds, you proceed to remove the exit_latency from
> measured_us... just once. Given how the condition is formulated, I expected
> measured_us -= 2 * exit_latency there.

The exit latency is subtracted once, because we are interested in the time
between asking the hardware to enter the idle state and the wakeup event.

The exit latency is the time between the wakeup event and the first instruction,
so it shouldn't be counted, roughly speaking.

> More: you acknowledge, in that snippet of code, that there can be
> dev->last_residency's smaller than twice the exit_latency, i.e. not even the
> time to entry/exit the state. Am I reading this right? Is that because both
> exit_latency and dev->last_residency are only approximations?

Yes, they are just approximations.

> I actually see quite a few of those extra-short residencies in my traces, even
> with dev->last_residency < exit_latency.

Well, they are expected to occur at least occasionally.

Cheers,
Rafael



Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-05 Thread Giovanni Gherdovich
On Sun, 2018-11-04 at 17:31 +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The venerable menu governor does some thigns that are quite
> questionable in my view.  First, it includes timer wakeups in
> the pattern detection data and mixes them up with wakeups from
> other sources which in some cases causes it to expect what
> essentially would be a timer wakeup in a time frame in which
> no timer wakeups are possible (becuase it knows the time until
> the next timer event and that is later than the expected wakeup
> time).  Second, it uses the extra exit latency limit based on
> the predicted idle duration and depending on the number of tasks
> waiting on I/O, even though those tasks may run on a different
> CPU when they are woken up.  Moreover, the time ranges used by it
> for the sleep length correction factors depend on whether or not
> there are tasks waiting on I/O, which again doesn't imply anything
> in particular, and they are not correlated to the list of available
> idle states in any way whatever.  Also,  the pattern detection code
> in menu may end up considering values that are too large to matter
> at all, in which cases running it is a waste of time.
> 
> A major rework of the menu governor would be required to address
> these issues and the performance of at least some workloads (tuned
> specifically to the current behavior of the menu governor) is likely
> to suffer from that.  It is thus better to introduce an entirely new
> governor without them and let everybody use the governor that works
> better with their actual workloads.
> 
> The new governor introduced here, the timer events oriented (TEO)
> governor, uses the same basic strategy as menu: it always tries to
> find the deepest idle state that can be used in the given conditions.
> However, it applies a different approach to that problem.  First, it
> doesn't use "correction factors" for the time till the closest timer,
> but instead it tries to correlate the measured idle duration values
> with the available idle states and use that information to pick up
> the idle state that is most likely to "match" the upcoming CPU idle
> interval.  Second, it doesn't take the number of "I/O waiters" into
> account at all and the pattern detection code in it tries to avoid
> taking timer wakeups into account.  It also only uses idle duration
> values less than the current time till the closest timer (with the
> tick excluded) for that purpose.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> v2 -> v3:
>  * Simplify the pattern detection code and make it return a value
> lower than the time to the closest timer if the majority of recent
> idle intervals are below it regardless of their variance (that should
> cause it to be slightly more aggressive).
>  * Do not count wakeups from state 0 due to the time limit in poll_idle()
>as non-timer.
> 
> Note: I will be mostly offline tomorrow, so this goes slightly early.
> I have tested it only very lightly, but it is not so much different from
> the previous one.
> 
> It requires the same additional patches to apply as the previous one too.
> 
> ---
>  drivers/cpuidle/Kconfig|   11 
>  drivers/cpuidle/governors/Makefile |1 
>  drivers/cpuidle/governors/teo.c|  491 
>+
>  3 files changed, 503 insertions(+)
> 
> Index: linux-pm/drivers/cpuidle/governors/teo.c
> ===
> --- /dev/null
> +++ linux-pm/drivers/cpuidle/governors/teo.c
> @@ -0,0 +1,491 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Timer events oriented CPU idle governor
> + *
> + * Copyright (C) 2018 Intel Corporation
> + * Author: Rafael J. Wysocki 
> + *
> + * The idea of this governor is based on the observation that on many systems
> + * timer events are two or more orders of magnitude more frequent than any
> + * other interrupts, so they are likely to be the most significant source of 
> CPU
> + * wakeups from idle states.  Moreover, information about what happened in 
> the
> + * (relatively recent) past can be used to estimate whether or not the 
> deepest
> + * idle state with target residency within the time to the closest timer is
> + * likely to be suitable for the upcoming idle time of the CPU and, if not, 
> then
> + * which of the shallower idle states to choose.
> + *
> + * Of course, non-timer wakeup sources are more important in some use cases 
> and
> + * they can be covered by detecting patterns among recent idle time intervals
> + * of the CPU.  However, even in that case it is not necessary to take idle
> + * duration values greater than the time till the closest timer into 
> account, as
> + * the patterns that they may belong to produce average values close enough 
> to
> + * the time till the closest timer (sleep length) anyway.
> + *
> + * Thus this governor estimates whether or not the upcoming idle time of the 
> CPU
> + * is likely to be 

Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-05 Thread Giovanni Gherdovich
On Sun, 2018-11-04 at 17:31 +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The venerable menu governor does some thigns that are quite
> questionable in my view.  First, it includes timer wakeups in
> the pattern detection data and mixes them up with wakeups from
> other sources which in some cases causes it to expect what
> essentially would be a timer wakeup in a time frame in which
> no timer wakeups are possible (becuase it knows the time until
> the next timer event and that is later than the expected wakeup
> time).  Second, it uses the extra exit latency limit based on
> the predicted idle duration and depending on the number of tasks
> waiting on I/O, even though those tasks may run on a different
> CPU when they are woken up.  Moreover, the time ranges used by it
> for the sleep length correction factors depend on whether or not
> there are tasks waiting on I/O, which again doesn't imply anything
> in particular, and they are not correlated to the list of available
> idle states in any way whatever.  Also,  the pattern detection code
> in menu may end up considering values that are too large to matter
> at all, in which cases running it is a waste of time.
> 
> A major rework of the menu governor would be required to address
> these issues and the performance of at least some workloads (tuned
> specifically to the current behavior of the menu governor) is likely
> to suffer from that.  It is thus better to introduce an entirely new
> governor without them and let everybody use the governor that works
> better with their actual workloads.
> 
> The new governor introduced here, the timer events oriented (TEO)
> governor, uses the same basic strategy as menu: it always tries to
> find the deepest idle state that can be used in the given conditions.
> However, it applies a different approach to that problem.  First, it
> doesn't use "correction factors" for the time till the closest timer,
> but instead it tries to correlate the measured idle duration values
> with the available idle states and use that information to pick up
> the idle state that is most likely to "match" the upcoming CPU idle
> interval.  Second, it doesn't take the number of "I/O waiters" into
> account at all and the pattern detection code in it tries to avoid
> taking timer wakeups into account.  It also only uses idle duration
> values less than the current time till the closest timer (with the
> tick excluded) for that purpose.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> v2 -> v3:
>  * Simplify the pattern detection code and make it return a value
> lower than the time to the closest timer if the majority of recent
> idle intervals are below it regardless of their variance (that should
> cause it to be slightly more aggressive).
>  * Do not count wakeups from state 0 due to the time limit in poll_idle()
>as non-timer.
> 
> Note: I will be mostly offline tomorrow, so this goes slightly early.
> I have tested it only very lightly, but it is not so much different from
> the previous one.
> 
> It requires the same additional patches to apply as the previous one too.
> 
> ---
>  drivers/cpuidle/Kconfig|   11 
>  drivers/cpuidle/governors/Makefile |1 
>  drivers/cpuidle/governors/teo.c|  491 
>+
>  3 files changed, 503 insertions(+)
> 
> Index: linux-pm/drivers/cpuidle/governors/teo.c
> ===
> --- /dev/null
> +++ linux-pm/drivers/cpuidle/governors/teo.c
> @@ -0,0 +1,491 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Timer events oriented CPU idle governor
> + *
> + * Copyright (C) 2018 Intel Corporation
> + * Author: Rafael J. Wysocki 
> + *
> + * The idea of this governor is based on the observation that on many systems
> + * timer events are two or more orders of magnitude more frequent than any
> + * other interrupts, so they are likely to be the most significant source of 
> CPU
> + * wakeups from idle states.  Moreover, information about what happened in 
> the
> + * (relatively recent) past can be used to estimate whether or not the 
> deepest
> + * idle state with target residency within the time to the closest timer is
> + * likely to be suitable for the upcoming idle time of the CPU and, if not, 
> then
> + * which of the shallower idle states to choose.
> + *
> + * Of course, non-timer wakeup sources are more important in some use cases 
> and
> + * they can be covered by detecting patterns among recent idle time intervals
> + * of the CPU.  However, even in that case it is not necessary to take idle
> + * duration values greater than the time till the closest timer into 
> account, as
> + * the patterns that they may belong to produce average values close enough 
> to
> + * the time till the closest timer (sleep length) anyway.
> + *
> + * Thus this governor estimates whether or not the upcoming idle time of the 
> CPU
> + * is likely to be