RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-08 Thread Doug Smythies
On 2018.12.08 02:23 Giovanni Gherdovich wrote:

> sorry for the late reply, this week I was traveling.

No Problem. Thanks very much for your very detailed reply,
which obviously took considerable time to write. While
I was making progress, your instructions really fill in
some gaps and mistakes I was making.

Eventually (probably several days) I'll report back my test
results.


> Some specific remarks you raise:
>
> On Mon, 2018-12-03 at 08:23 -0800, Doug Smythies wrote:
>> ...
>> My issue is that I do not understand the output or how it
>> might correlate with your tables.
>> 
>> I get, for example:
>> 
>>31   1 0.13s 0.68s 0.80s  1003894.302 1003779.613
>>31   1 0.16s 0.64s 0.80s  1008900.053 1008215.336
>>31   1 0.14s 0.66s 0.80s  1009630.439 1008990.265
>> ...
>> 
>> But I don't know what that means, nor have I been able to find
>> a description anywhere.
>
> I don't recognize this output. I hope the illustration above can clarify how
> MMTests is used.

Due to incompetence on my part, the config file being run for my tests was
always just the default config file from my original
git clone https://github.com/gormanm/mmtests.git
command. So regardless of what I thought I was doing, I was running "pft"
(Page Fault Test).

... Doug




RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-08 Thread Doug Smythies
On 2018.12.08 02:23 Giovanni Gherdovich wrote:

> sorry for the late reply, this week I was traveling.

No Problem. Thanks very much for your very detailed reply,
which obviously took considerable time to write. While
I was making progress, your instructions really fill in
some gaps and mistakes I was making.

Eventually (probably several days) I'll report back my test
results.


> Some specific remarks you raise:
>
> On Mon, 2018-12-03 at 08:23 -0800, Doug Smythies wrote:
>> ...
>> My issue is that I do not understand the output or how it
>> might correlate with your tables.
>> 
>> I get, for example:
>> 
>>31   1 0.13s 0.68s 0.80s  1003894.302 1003779.613
>>31   1 0.16s 0.64s 0.80s  1008900.053 1008215.336
>>31   1 0.14s 0.66s 0.80s  1009630.439 1008990.265
>> ...
>> 
>> But I don't know what that means, nor have I been able to find
>> a description anywhere.
>
> I don't recognize this output. I hope the illustration above can clarify how
> MMTests is used.

Due to incompetence on my part, the config file being run for my tests was
always just the default config file from my original
git clone https://github.com/gormanm/mmtests.git
command. So regardless of what I thought I was doing, I was running "pft"
(Page Fault Test).

... Doug




Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-08 Thread Giovanni Gherdovich
Hello Doug,

sorry for the late reply, this week I was traveling.

First off, thank you for trying out MMTests; I admit the documentation is
somewhat incomplete. I'm going to give you an overview of how I run benchmarks
with MMTests and how do I print comparisons, hoping this can address your
questions.

In the last report I posted the following two tables, for instance; I'll now
show the commands I used to produce them.

>  * sockperf on loopback over UDP, mode "throughput"
> * global-dhp__network-sockperf-unbound
> 48x-HASWELL-NUMA fixed since v2, the others greatly improved in v6.
> 
> teo-v1  teo-v2  teo-v3  teo-v5  teo-v6
>   
>---
>   8x-SKYLAKE-UMA1% worse1% worse1% worse1% worse10% 
>better
>   80x-BROADWELL-NUMA3% better   2% better   5% better   3% worse8% 
>better
>   48x-HASWELL-NUMA  4% better   12% worse   no change   no change   no 
>change
> 
>   SOCKPERF-UDP-THROUGHPUT
>   ===
>   NOTES: Test run in mode "throughput" over UDP. The varying parameter is the
>   message size.
>   MEASURES: Throughput, in MBits/second
>   HIGHER is better
> 
>   machine: 8x-SKYLAKE-UMA
> 
>   4.18.0 4.18.0 
>4.18.0 4.18.0 4.18.0 4.18.0
>      vanillateo
>teo-v2+backportteo-v3+backportteo-v5+backport
>teo-v6+backport
>   
>---
>   Hmean 1470.34 (   0.00%)   69.80 *  -0.76%*   69.11 *  
>-1.75%*   69.49 *  -1.20%*   69.71 *  -0.90%*   77.51 *  10.20%*
>   Hmean 100  499.24 (   0.00%)  494.26 *  -1.00%*  492.74 *  
>-1.30%*  494.90 *  -0.87%*  497.43 *  -0.36%*  549.93 *  10.15%*
>   Hmean 300 1489.13 (   0.00%) 1472.39 *  -1.12%* 1468.45 *  
>-1.39%* 1477.74 *  -0.76%* 1478.61 *  -0.71%* 1632.63 *   9.64%*
>   Hmean 500 2469.62 (   0.00%) 2444.41 *  -1.02%* 2434.61 *  
>-1.42%* 2454.15 *  -0.63%* 2454.76 *  -0.60%* 2698.70 *   9.28%*
>   Hmean 850 4165.12 (   0.00%) 4123.82 *  -0.99%* 4100.37 *  
>-1.55%* 4111.82 *  -1.28%* 4120.04 *  -1.08%* 4521.11 *   8.55%*

The first table is a panoramic view of all machines, the second is a zoom into
the 8x-SKYLAKE-UMA machine where the overall benchmark score is broken down
into the various message sizes.

The first thing to do is, obviously, to gather data for each kernel. Once the
kernel is installed on the box, as you already figured out, you have to run:

  ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 
SOME-MNEMONIC-NAME

In my case, what I did is to run:

  # build, install and boot 4.18.0-vanilla kernel
  ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 
 4.18.0-vanilla

  # build, install and boot 4.18.0-teo kernel
  ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 
 4.18.0-teo

  # build, install and boot 4.18.0-teo-v2+backport kernel
  ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 
 4.18.0-teo-v2+backport

  ...

  # build, install and boot 4.18.0-teo-v6+backport kernel
  ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 
 4.18.0-teo-v6+backport

At this point in the work/log directory I've accumulated all the data I need
for a report. What's important to note here is that a single configuration
file (such as config-global-dhp__network-sockperf-unbound) often runs more than
a single  benchmark, according to the value of the MMTESTS variable in that
config. The config we're using has:

  export MMTESTS="sockperf-tcp-throughput sockperf-tcp-under-load 
sockperf-udp-throughput sockperf-udp-under-load"

which means it's running 4 different flavors of sockperf. The two tables above
are from the "sockperf-udp-throughput" variant.

Now that we've run the benchmarks for each kernel (every run takes around 75
minutes on my machines) we're ready to extract some comparison tables.
Exploring the work/log directory shows what we've got:

  $ find . -type d -name sockperf\* | sort 
  ./sockperf-tcp-throughput-4.18.0-teo
  ./sockperf-tcp-throughput-4.18.0-teo-v2+backport
  ./sockperf-tcp-throughput-4.18.0-teo-v3+backport
  ./sockperf-tcp-throughput-4.18.0-teo-v5+backport
  ./sockperf-tcp-throughput-4.18.0-teo-v6+backport
  ./sockperf-tcp-throughput-4.18.0-vanilla
  ./sockperf-tcp-under-load-4.18.0-teo
  ./sockperf-tcp-under-load-4.18.0-teo-v2+backport
  ./sockperf-tcp-under-load-4.18.0-teo-v3+backport
  ./sockperf-tcp-under-load-4.18.0-teo-v5+backport
  

Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-08 Thread Giovanni Gherdovich
Hello Doug,

sorry for the late reply, this week I was traveling.

First off, thank you for trying out MMTests; I admit the documentation is
somewhat incomplete. I'm going to give you an overview of how I run benchmarks
with MMTests and how do I print comparisons, hoping this can address your
questions.

In the last report I posted the following two tables, for instance; I'll now
show the commands I used to produce them.

>  * sockperf on loopback over UDP, mode "throughput"
> * global-dhp__network-sockperf-unbound
> 48x-HASWELL-NUMA fixed since v2, the others greatly improved in v6.
> 
> teo-v1  teo-v2  teo-v3  teo-v5  teo-v6
>   
>---
>   8x-SKYLAKE-UMA1% worse1% worse1% worse1% worse10% 
>better
>   80x-BROADWELL-NUMA3% better   2% better   5% better   3% worse8% 
>better
>   48x-HASWELL-NUMA  4% better   12% worse   no change   no change   no 
>change
> 
>   SOCKPERF-UDP-THROUGHPUT
>   ===
>   NOTES: Test run in mode "throughput" over UDP. The varying parameter is the
>   message size.
>   MEASURES: Throughput, in MBits/second
>   HIGHER is better
> 
>   machine: 8x-SKYLAKE-UMA
> 
>   4.18.0 4.18.0 
>4.18.0 4.18.0 4.18.0 4.18.0
>      vanillateo
>teo-v2+backportteo-v3+backportteo-v5+backport
>teo-v6+backport
>   
>---
>   Hmean 1470.34 (   0.00%)   69.80 *  -0.76%*   69.11 *  
>-1.75%*   69.49 *  -1.20%*   69.71 *  -0.90%*   77.51 *  10.20%*
>   Hmean 100  499.24 (   0.00%)  494.26 *  -1.00%*  492.74 *  
>-1.30%*  494.90 *  -0.87%*  497.43 *  -0.36%*  549.93 *  10.15%*
>   Hmean 300 1489.13 (   0.00%) 1472.39 *  -1.12%* 1468.45 *  
>-1.39%* 1477.74 *  -0.76%* 1478.61 *  -0.71%* 1632.63 *   9.64%*
>   Hmean 500 2469.62 (   0.00%) 2444.41 *  -1.02%* 2434.61 *  
>-1.42%* 2454.15 *  -0.63%* 2454.76 *  -0.60%* 2698.70 *   9.28%*
>   Hmean 850 4165.12 (   0.00%) 4123.82 *  -0.99%* 4100.37 *  
>-1.55%* 4111.82 *  -1.28%* 4120.04 *  -1.08%* 4521.11 *   8.55%*

The first table is a panoramic view of all machines, the second is a zoom into
the 8x-SKYLAKE-UMA machine where the overall benchmark score is broken down
into the various message sizes.

The first thing to do is, obviously, to gather data for each kernel. Once the
kernel is installed on the box, as you already figured out, you have to run:

  ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 
SOME-MNEMONIC-NAME

In my case, what I did is to run:

  # build, install and boot 4.18.0-vanilla kernel
  ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 
 4.18.0-vanilla

  # build, install and boot 4.18.0-teo kernel
  ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 
 4.18.0-teo

  # build, install and boot 4.18.0-teo-v2+backport kernel
  ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 
 4.18.0-teo-v2+backport

  ...

  # build, install and boot 4.18.0-teo-v6+backport kernel
  ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 
 4.18.0-teo-v6+backport

At this point in the work/log directory I've accumulated all the data I need
for a report. What's important to note here is that a single configuration
file (such as config-global-dhp__network-sockperf-unbound) often runs more than
a single  benchmark, according to the value of the MMTESTS variable in that
config. The config we're using has:

  export MMTESTS="sockperf-tcp-throughput sockperf-tcp-under-load 
sockperf-udp-throughput sockperf-udp-under-load"

which means it's running 4 different flavors of sockperf. The two tables above
are from the "sockperf-udp-throughput" variant.

Now that we've run the benchmarks for each kernel (every run takes around 75
minutes on my machines) we're ready to extract some comparison tables.
Exploring the work/log directory shows what we've got:

  $ find . -type d -name sockperf\* | sort 
  ./sockperf-tcp-throughput-4.18.0-teo
  ./sockperf-tcp-throughput-4.18.0-teo-v2+backport
  ./sockperf-tcp-throughput-4.18.0-teo-v3+backport
  ./sockperf-tcp-throughput-4.18.0-teo-v5+backport
  ./sockperf-tcp-throughput-4.18.0-teo-v6+backport
  ./sockperf-tcp-throughput-4.18.0-vanilla
  ./sockperf-tcp-under-load-4.18.0-teo
  ./sockperf-tcp-under-load-4.18.0-teo-v2+backport
  ./sockperf-tcp-under-load-4.18.0-teo-v3+backport
  ./sockperf-tcp-under-load-4.18.0-teo-v5+backport
  

Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-07 Thread Mel Gorman
On Mon, Dec 03, 2018 at 08:23:56AM -0800, Doug Smythies wrote:
> In the README file, I did see that for reporting I am 
> somehow supposed to use compare-kernels.sh, but
> I couldn't figure that out.
> 

cd work/log
../../compare-kernels.sh

> By the way, I am running these tests as a regular user, but
> they seem to want to modify:
> 
> /sys/kernel/mm/transparent_hugepage/enabled
> 

Red herring in this case. Even if transparent hugepages are left as the
default, it still tries to write it stupidly. An irritating, but
harmless bug.

-- 
Mel Gorman
SUSE Labs


Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-07 Thread Mel Gorman
On Mon, Dec 03, 2018 at 08:23:56AM -0800, Doug Smythies wrote:
> In the README file, I did see that for reporting I am 
> somehow supposed to use compare-kernels.sh, but
> I couldn't figure that out.
> 

cd work/log
../../compare-kernels.sh

> By the way, I am running these tests as a regular user, but
> they seem to want to modify:
> 
> /sys/kernel/mm/transparent_hugepage/enabled
> 

Red herring in this case. Even if transparent hugepages are left as the
default, it still tries to write it stupidly. An irritating, but
harmless bug.

-- 
Mel Gorman
SUSE Labs


Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-06 Thread Rafael J. Wysocki
On Thu, Dec 6, 2018 at 12:06 AM Doug Smythies  wrote:
>
> On 2018.12.03 03:48 Rafael J. Wysocki wrote:
>
> >>> There is an additional issue where if idle state 0 is disabled (with the 
> >>> above suggested code patch),
> >>> idle state usage seems to fall to deeper states than idle state 1.
> >>> This is not the expected behaviour.
> >>
> >> No, it isn't.
> >>
> >>> Kernel 4.20-rc3 works as expected.
> >>> I have not figured this issue out yet, in the code.
> >>>
> >>> Example (1 minute per sample. Number of entries/exits per state):
> >>> State 0 State 1 State 2 State 3 State 4Watts
> >>>28235143, 83, 26, 17,837,  64.900
> >>> 5583238, 657079,5884941,8498552,   30986831,  62.433 << 
> >>> Transition sample, after idle state 0 disabled
> >>>   0, 793517,7186099,   10559878,   38485721,  61.900 << 
> >>> ?? should have all gone into Idle state 1
> >>>   0, 795414,7340703,   10553117,   38513456,  62.050
> >>>   0, 807028,7288195,   10574113,   38523524,  62.167
> >>>   0, 814983,7403534,   10575108,   38571228,  62.167
> >>>   0, 838302,7747127,   10552289,   38556054,  62.183
> >>> 9664999, 544473,4914512,6942037,   25295361,  63.633 << 
> >>> Transition sample, after idle state 0 enabled
> >>>27893504, 96, 40,  9,912,  66.500
> >>>26556343, 83, 29,  7,814,  66.683
> >>>27929227, 64, 20, 10,931,  66.683
> >>
> >> I see.
> >>
> >> OK, I'll look into this too, thanks!
> >
> > This probably is the artifact of the fix for the teo_find_shallower_state()
> > issue.
> >
> > Anyway, I'm not able to reproduce this with the teo_find_shallower_state() 
> > issue
> > fixed differently.
>
> I am not able to reproduce with your teo_find_shallower_state(), or teo V 7,
> either. Everything is graceful now, as states are disabled:
> (10 seconds per sample. Number of entries/exits per state):
>
> State 0 State 1 State 2 State 3 State 4Watts
>   0,  6,  4,  1,414,   3.700
>   2,  4, 30,  3,578,   3.700  << No 
> load
>  168619, 37, 39,  4,480,   5.600  << 
> Transition sample
> 4643618, 45,  8,  1,137,  61.200  << All 
> idle states enabled
> 4736227, 40,  3,  5,111,  61.800
> 1888417,4369314, 25,  2, 89,  62.000  << 
> Transition sample
>   0,7266864,  9,  0,  0,  62.200  << 
> state 0 disabled
>   0,7193372,  9,  0,  0,  62.700
>   0,5539898,1744007,  0,  0,  63.500  << 
> Transition sample
>   0,  0,8152956,  0,  0,  63.700  << 
> states 0,1 disabled
>   0,  0,8015151,  0,  0,  63.900
>   0,  0,4146806,6349619,  0,  63.000  << 
> Transition sample
>   0,  0,  0,   13252144,  0,  61.600  << 
> states 0,1,2 disabled
>   0,  0,  0,   13258313,  0,  61.800
>   0,  0,  0,   10417428,1984451,  61.200  << 
> Transition sample
>   0,  0,  0,  0,9247172,  58.500  << 
> states 0,1,2,3 disabled
>   0,  0,  0,  0,9242657,  58.500
>   0,  0,  0,  0,9233749,  58.600
>   0,  0,  0,  0,9238444,  58.700
>   0,  0,  0,  0,9236345,  58.600
>
> For reference, this is kernel 4.20-rc5 (with your other proposed patches):
>
> State 0 State 1 State 2 State 3 State 4Watts
>   0,  4,  8,  6,426,   3.700
> 1592870,279,149, 96,831,  21.800
> 5071279,154, 25,  6,105,  61.200
> 5095090, 78, 21,  1, 86,  61.800
> 5001493, 94, 30,  4,101,  62.200
>  616019,5446924,  5,  3, 38,  62.500
>   0,6249752,  0,  0,  0,  63.300
>   0,6293671,  0,  0,  0,  63.800
>   0,3751035,2529964,  0,  0,  64.100
>   0,  0,6101167,  0,  0,  64.500
>   0,  0,6172526,  0,  0,  64.700
>   0,  0,6163797,  0,  0,  64.900
>   0,  0,1724841,9567528,  0,  63.300
>   0,  0,  0,   13349668,  

Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-06 Thread Rafael J. Wysocki
On Thu, Dec 6, 2018 at 12:06 AM Doug Smythies  wrote:
>
> On 2018.12.03 03:48 Rafael J. Wysocki wrote:
>
> >>> There is an additional issue where if idle state 0 is disabled (with the 
> >>> above suggested code patch),
> >>> idle state usage seems to fall to deeper states than idle state 1.
> >>> This is not the expected behaviour.
> >>
> >> No, it isn't.
> >>
> >>> Kernel 4.20-rc3 works as expected.
> >>> I have not figured this issue out yet, in the code.
> >>>
> >>> Example (1 minute per sample. Number of entries/exits per state):
> >>> State 0 State 1 State 2 State 3 State 4Watts
> >>>28235143, 83, 26, 17,837,  64.900
> >>> 5583238, 657079,5884941,8498552,   30986831,  62.433 << 
> >>> Transition sample, after idle state 0 disabled
> >>>   0, 793517,7186099,   10559878,   38485721,  61.900 << 
> >>> ?? should have all gone into Idle state 1
> >>>   0, 795414,7340703,   10553117,   38513456,  62.050
> >>>   0, 807028,7288195,   10574113,   38523524,  62.167
> >>>   0, 814983,7403534,   10575108,   38571228,  62.167
> >>>   0, 838302,7747127,   10552289,   38556054,  62.183
> >>> 9664999, 544473,4914512,6942037,   25295361,  63.633 << 
> >>> Transition sample, after idle state 0 enabled
> >>>27893504, 96, 40,  9,912,  66.500
> >>>26556343, 83, 29,  7,814,  66.683
> >>>27929227, 64, 20, 10,931,  66.683
> >>
> >> I see.
> >>
> >> OK, I'll look into this too, thanks!
> >
> > This probably is the artifact of the fix for the teo_find_shallower_state()
> > issue.
> >
> > Anyway, I'm not able to reproduce this with the teo_find_shallower_state() 
> > issue
> > fixed differently.
>
> I am not able to reproduce with your teo_find_shallower_state(), or teo V 7,
> either. Everything is graceful now, as states are disabled:
> (10 seconds per sample. Number of entries/exits per state):
>
> State 0 State 1 State 2 State 3 State 4Watts
>   0,  6,  4,  1,414,   3.700
>   2,  4, 30,  3,578,   3.700  << No 
> load
>  168619, 37, 39,  4,480,   5.600  << 
> Transition sample
> 4643618, 45,  8,  1,137,  61.200  << All 
> idle states enabled
> 4736227, 40,  3,  5,111,  61.800
> 1888417,4369314, 25,  2, 89,  62.000  << 
> Transition sample
>   0,7266864,  9,  0,  0,  62.200  << 
> state 0 disabled
>   0,7193372,  9,  0,  0,  62.700
>   0,5539898,1744007,  0,  0,  63.500  << 
> Transition sample
>   0,  0,8152956,  0,  0,  63.700  << 
> states 0,1 disabled
>   0,  0,8015151,  0,  0,  63.900
>   0,  0,4146806,6349619,  0,  63.000  << 
> Transition sample
>   0,  0,  0,   13252144,  0,  61.600  << 
> states 0,1,2 disabled
>   0,  0,  0,   13258313,  0,  61.800
>   0,  0,  0,   10417428,1984451,  61.200  << 
> Transition sample
>   0,  0,  0,  0,9247172,  58.500  << 
> states 0,1,2,3 disabled
>   0,  0,  0,  0,9242657,  58.500
>   0,  0,  0,  0,9233749,  58.600
>   0,  0,  0,  0,9238444,  58.700
>   0,  0,  0,  0,9236345,  58.600
>
> For reference, this is kernel 4.20-rc5 (with your other proposed patches):
>
> State 0 State 1 State 2 State 3 State 4Watts
>   0,  4,  8,  6,426,   3.700
> 1592870,279,149, 96,831,  21.800
> 5071279,154, 25,  6,105,  61.200
> 5095090, 78, 21,  1, 86,  61.800
> 5001493, 94, 30,  4,101,  62.200
>  616019,5446924,  5,  3, 38,  62.500
>   0,6249752,  0,  0,  0,  63.300
>   0,6293671,  0,  0,  0,  63.800
>   0,3751035,2529964,  0,  0,  64.100
>   0,  0,6101167,  0,  0,  64.500
>   0,  0,6172526,  0,  0,  64.700
>   0,  0,6163797,  0,  0,  64.900
>   0,  0,1724841,9567528,  0,  63.300
>   0,  0,  0,   13349668,  

RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-05 Thread Doug Smythies
On 2018.12.03 03:48 Rafael J. Wysocki wrote:

>>> There is an additional issue where if idle state 0 is disabled (with the 
>>> above suggested code patch),
>>> idle state usage seems to fall to deeper states than idle state 1.
>>> This is not the expected behaviour.
>> 
>> No, it isn't.
>> 
>>> Kernel 4.20-rc3 works as expected.
>>> I have not figured this issue out yet, in the code.
>>>
>>> Example (1 minute per sample. Number of entries/exits per state):
>>> State 0 State 1 State 2 State 3 State 4Watts
>>>28235143, 83, 26, 17,837,  64.900
>>> 5583238, 657079,5884941,8498552,   30986831,  62.433 << 
>>> Transition sample, after idle state 0 disabled
>>>   0, 793517,7186099,   10559878,   38485721,  61.900 << ?? 
>>> should have all gone into Idle state 1
>>>   0, 795414,7340703,   10553117,   38513456,  62.050
>>>   0, 807028,7288195,   10574113,   38523524,  62.167
>>>   0, 814983,7403534,   10575108,   38571228,  62.167
>>>   0, 838302,7747127,   10552289,   38556054,  62.183
>>> 9664999, 544473,4914512,6942037,   25295361,  63.633 << 
>>> Transition sample, after idle state 0 enabled
>>>27893504, 96, 40,  9,912,  66.500
>>>26556343, 83, 29,  7,814,  66.683
>>>27929227, 64, 20, 10,931,  66.683
>> 
>> I see.
>> 
>> OK, I'll look into this too, thanks!
>
> This probably is the artifact of the fix for the teo_find_shallower_state()
> issue.
>
> Anyway, I'm not able to reproduce this with the teo_find_shallower_state() 
> issue
> fixed differently.

I am not able to reproduce with your teo_find_shallower_state(), or teo V 7,
either. Everything is graceful now, as states are disabled:
(10 seconds per sample. Number of entries/exits per state):

State 0 State 1 State 2 State 3 State 4Watts
  0,  6,  4,  1,414,   3.700
  2,  4, 30,  3,578,   3.700  << No load
 168619, 37, 39,  4,480,   5.600  << 
Transition sample
4643618, 45,  8,  1,137,  61.200  << All 
idle states enabled
4736227, 40,  3,  5,111,  61.800
1888417,4369314, 25,  2, 89,  62.000  << 
Transition sample
  0,7266864,  9,  0,  0,  62.200  << state 
0 disabled
  0,7193372,  9,  0,  0,  62.700
  0,5539898,1744007,  0,  0,  63.500  << 
Transition sample
  0,  0,8152956,  0,  0,  63.700  << states 
0,1 disabled
  0,  0,8015151,  0,  0,  63.900
  0,  0,4146806,6349619,  0,  63.000  << 
Transition sample
  0,  0,  0,   13252144,  0,  61.600  << states 
0,1,2 disabled
  0,  0,  0,   13258313,  0,  61.800
  0,  0,  0,   10417428,1984451,  61.200  << 
Transition sample
  0,  0,  0,  0,9247172,  58.500  << states 
0,1,2,3 disabled
  0,  0,  0,  0,9242657,  58.500
  0,  0,  0,  0,9233749,  58.600
  0,  0,  0,  0,9238444,  58.700
  0,  0,  0,  0,9236345,  58.600

For reference, this is kernel 4.20-rc5 (with your other proposed patches):

State 0 State 1 State 2 State 3 State 4Watts
  0,  4,  8,  6,426,   3.700
1592870,279,149, 96,831,  21.800
5071279,154, 25,  6,105,  61.200
5095090, 78, 21,  1, 86,  61.800
5001493, 94, 30,  4,101,  62.200
 616019,5446924,  5,  3, 38,  62.500
  0,6249752,  0,  0,  0,  63.300
  0,6293671,  0,  0,  0,  63.800
  0,3751035,2529964,  0,  0,  64.100
  0,  0,6101167,  0,  0,  64.500
  0,  0,6172526,  0,  0,  64.700
  0,  0,6163797,  0,  0,  64.900
  0,  0,1724841,9567528,  0,  63.300
  0,  0,  0,   13349668,  0,  62.700
  0,  0,  0,   13360471,  0,  62.700
  0,  0,  0,   13355424,  0,  62.700
  0,  0,  0,8854491,3132640,  61.600
  0,

RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-05 Thread Doug Smythies
On 2018.12.03 03:48 Rafael J. Wysocki wrote:

>>> There is an additional issue where if idle state 0 is disabled (with the 
>>> above suggested code patch),
>>> idle state usage seems to fall to deeper states than idle state 1.
>>> This is not the expected behaviour.
>> 
>> No, it isn't.
>> 
>>> Kernel 4.20-rc3 works as expected.
>>> I have not figured this issue out yet, in the code.
>>>
>>> Example (1 minute per sample. Number of entries/exits per state):
>>> State 0 State 1 State 2 State 3 State 4Watts
>>>28235143, 83, 26, 17,837,  64.900
>>> 5583238, 657079,5884941,8498552,   30986831,  62.433 << 
>>> Transition sample, after idle state 0 disabled
>>>   0, 793517,7186099,   10559878,   38485721,  61.900 << ?? 
>>> should have all gone into Idle state 1
>>>   0, 795414,7340703,   10553117,   38513456,  62.050
>>>   0, 807028,7288195,   10574113,   38523524,  62.167
>>>   0, 814983,7403534,   10575108,   38571228,  62.167
>>>   0, 838302,7747127,   10552289,   38556054,  62.183
>>> 9664999, 544473,4914512,6942037,   25295361,  63.633 << 
>>> Transition sample, after idle state 0 enabled
>>>27893504, 96, 40,  9,912,  66.500
>>>26556343, 83, 29,  7,814,  66.683
>>>27929227, 64, 20, 10,931,  66.683
>> 
>> I see.
>> 
>> OK, I'll look into this too, thanks!
>
> This probably is the artifact of the fix for the teo_find_shallower_state()
> issue.
>
> Anyway, I'm not able to reproduce this with the teo_find_shallower_state() 
> issue
> fixed differently.

I am not able to reproduce with your teo_find_shallower_state(), or teo V 7,
either. Everything is graceful now, as states are disabled:
(10 seconds per sample. Number of entries/exits per state):

State 0 State 1 State 2 State 3 State 4Watts
  0,  6,  4,  1,414,   3.700
  2,  4, 30,  3,578,   3.700  << No load
 168619, 37, 39,  4,480,   5.600  << 
Transition sample
4643618, 45,  8,  1,137,  61.200  << All 
idle states enabled
4736227, 40,  3,  5,111,  61.800
1888417,4369314, 25,  2, 89,  62.000  << 
Transition sample
  0,7266864,  9,  0,  0,  62.200  << state 
0 disabled
  0,7193372,  9,  0,  0,  62.700
  0,5539898,1744007,  0,  0,  63.500  << 
Transition sample
  0,  0,8152956,  0,  0,  63.700  << states 
0,1 disabled
  0,  0,8015151,  0,  0,  63.900
  0,  0,4146806,6349619,  0,  63.000  << 
Transition sample
  0,  0,  0,   13252144,  0,  61.600  << states 
0,1,2 disabled
  0,  0,  0,   13258313,  0,  61.800
  0,  0,  0,   10417428,1984451,  61.200  << 
Transition sample
  0,  0,  0,  0,9247172,  58.500  << states 
0,1,2,3 disabled
  0,  0,  0,  0,9242657,  58.500
  0,  0,  0,  0,9233749,  58.600
  0,  0,  0,  0,9238444,  58.700
  0,  0,  0,  0,9236345,  58.600

For reference, this is kernel 4.20-rc5 (with your other proposed patches):

State 0 State 1 State 2 State 3 State 4Watts
  0,  4,  8,  6,426,   3.700
1592870,279,149, 96,831,  21.800
5071279,154, 25,  6,105,  61.200
5095090, 78, 21,  1, 86,  61.800
5001493, 94, 30,  4,101,  62.200
 616019,5446924,  5,  3, 38,  62.500
  0,6249752,  0,  0,  0,  63.300
  0,6293671,  0,  0,  0,  63.800
  0,3751035,2529964,  0,  0,  64.100
  0,  0,6101167,  0,  0,  64.500
  0,  0,6172526,  0,  0,  64.700
  0,  0,6163797,  0,  0,  64.900
  0,  0,1724841,9567528,  0,  63.300
  0,  0,  0,   13349668,  0,  62.700
  0,  0,  0,   13360471,  0,  62.700
  0,  0,  0,   13355424,  0,  62.700
  0,  0,  0,8854491,3132640,  61.600
  0,

Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-03 Thread Rafael J. Wysocki
On Thursday, November 29, 2018 12:20:07 AM CET Doug Smythies wrote:
> On 2018.11.23 02:36 Rafael J. Wysocki wrote:
> 
> v5 -> v6:
>  * Avoid applying poll_time_limit to non-polling idle states by mistake.
>  * Use idle duration measured by the governor for everything (as it likely is
>more accurate than the one measured by the core).
> 
> -- above missing-- (see follow up e-mail from Rafael)
> 
>  * Rename SPIKE to PULSE.
>  * Do not run pattern detection upfront.  Instead, use recent idle duration
>values to refine the state selection after finding a candidate idle state.
>  * Do not use the expected idle duration as an extra latency constraint
>(exit latency is less than the target residency for all of the idle states
>known to me anyway, so this doesn't change anything in practice).
> 
> Hi Rafael,
> 
> I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline
> reference kernel.
> 
> Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients.
> 
> Note: because it uses the disk, the dbench test is somewhat non-repeatable.
> However, if particular attention is paid to not doing anything else with
> the disk between tests, then it seems to be repeatable to within about 6%.
> 
> Anyway no significant difference observed between kernel 4.20-rc3 and the
> same with the teov6 patch.
> 
> Test 2: Pipe test, non cross core. (And idle state 0 test, really)
> I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core.
> Thus, pretty much only idle state 0 was ever used.
> Processor package power was similar for both kernels.
> teov6 entered/exited idle state 0 about 60,984 times/second/cpu.
> -rc3 entered/exited idle state 0 about 62,806 times/second/cpu.
> There was a difference in percentage time spent in idle state 0,
> with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses
> teov6 at 0.0641%.
> 
> For throughput, teov6 was 1.4% faster.

This may indicate that teov6 is somewhat too aggressive.

> Test 3: was an attempt to sweep through a preference for
> all idle states.
> 
> 40 threads were launched with nothing to do except sleep
> for a variable duration of 1 to 500 uSec, each step was
> run for 1 minute. With 1 minute idle before the test and a few
> minutes idle after, the total test duration was about 505 minutes.
> Recall that when one asks for a short sleep of 1 uSec, they actually
> get about 50 uSec, due to overheads. So I use 40 threads in an attempt
> to get the average time between wakeup events per CPU down somewhat.
> 
> The results are here:
> http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm

And, so long as my understanding of the graphs is correct, the results
here indicate that teov6 tends to prefer relatively shallow idle states
which is good for performance (at least with some workloads), but not
necessarily for energy-efficiency.

I will send a v7 of TEO with some changes to make it a bit more
energy-efficient with respect to the v6.

Thanks,
Rafael



Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-03 Thread Rafael J. Wysocki
On Thursday, November 29, 2018 12:20:07 AM CET Doug Smythies wrote:
> On 2018.11.23 02:36 Rafael J. Wysocki wrote:
> 
> v5 -> v6:
>  * Avoid applying poll_time_limit to non-polling idle states by mistake.
>  * Use idle duration measured by the governor for everything (as it likely is
>more accurate than the one measured by the core).
> 
> -- above missing-- (see follow up e-mail from Rafael)
> 
>  * Rename SPIKE to PULSE.
>  * Do not run pattern detection upfront.  Instead, use recent idle duration
>values to refine the state selection after finding a candidate idle state.
>  * Do not use the expected idle duration as an extra latency constraint
>(exit latency is less than the target residency for all of the idle states
>known to me anyway, so this doesn't change anything in practice).
> 
> Hi Rafael,
> 
> I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline
> reference kernel.
> 
> Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients.
> 
> Note: because it uses the disk, the dbench test is somewhat non-repeatable.
> However, if particular attention is paid to not doing anything else with
> the disk between tests, then it seems to be repeatable to within about 6%.
> 
> Anyway no significant difference observed between kernel 4.20-rc3 and the
> same with the teov6 patch.
> 
> Test 2: Pipe test, non cross core. (And idle state 0 test, really)
> I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core.
> Thus, pretty much only idle state 0 was ever used.
> Processor package power was similar for both kernels.
> teov6 entered/exited idle state 0 about 60,984 times/second/cpu.
> -rc3 entered/exited idle state 0 about 62,806 times/second/cpu.
> There was a difference in percentage time spent in idle state 0,
> with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses
> teov6 at 0.0641%.
> 
> For throughput, teov6 was 1.4% faster.

This may indicate that teov6 is somewhat too aggressive.

> Test 3: was an attempt to sweep through a preference for
> all idle states.
> 
> 40 threads were launched with nothing to do except sleep
> for a variable duration of 1 to 500 uSec, each step was
> run for 1 minute. With 1 minute idle before the test and a few
> minutes idle after, the total test duration was about 505 minutes.
> Recall that when one asks for a short sleep of 1 uSec, they actually
> get about 50 uSec, due to overheads. So I use 40 threads in an attempt
> to get the average time between wakeup events per CPU down somewhat.
> 
> The results are here:
> http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm

And, so long as my understanding of the graphs is correct, the results
here indicate that teov6 tends to prefer relatively shallow idle states
which is good for performance (at least with some workloads), but not
necessarily for energy-efficiency.

I will send a v7 of TEO with some changes to make it a bit more
energy-efficient with respect to the v6.

Thanks,
Rafael



Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-03 Thread Rafael J. Wysocki
On Friday, November 30, 2018 9:51:19 AM CET Rafael J. Wysocki wrote:
> Hi Doug,
> 
> On Fri, Nov 30, 2018 at 8:49 AM Doug Smythies  wrote:
> >
> > Hi Rafael,
> >
> > On 2018.11.23 02:36 Rafael J. Wysocki wrote:
> >
> > ... [snip]...
> >
> > > +/**
> > > + * teo_find_shallower_state - Find shallower idle state matching given 
> > > duration.
> > > + * @drv: cpuidle driver containing state data.
> > > + * @dev: Target CPU.
> > > + * @state_idx: Index of the capping idle state.
> > > + * @duration_us: Idle duration value to match.
> > > + */
> > > +static int teo_find_shallower_state(struct cpuidle_driver *drv,
> > > + struct cpuidle_device *dev, int 
> > > state_idx,
> > > + unsigned int duration_us)
> > > +{
> > > + int i;
> > > +
> > > + for (i = state_idx - 1; i > 0; i--) {
> > > + if (drv->states[i].disabled || dev->states_usage[i].disable)
> > > + continue;
> > > +
> > > + if (drv->states[i].target_residency <= duration_us)
> > > + break;
> > > + }
> > > + return i;
> > > +}
> >
> > I think this subroutine has a problem when idle state 0
> > is disabled.
> 
> You are right, thanks!
> 
> > Perhaps something like this might help:
> >
> > diff --git a/drivers/cpuidle/governors/teo.c 
> > b/drivers/cpuidle/governors/teo.c
> > index bc1c9a2..5b97639 100644
> > --- a/drivers/cpuidle/governors/teo.c
> > +++ b/drivers/cpuidle/governors/teo.c
> > @@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, 
> > struct cpuidle_device *dev)
> >  }
> >
> >  /**
> > - * teo_find_shallower_state - Find shallower idle state matching given 
> > duration.
> > + * teo_find_shallower_state - Find shallower idle state matching given
> > + * duration, if possible.
> >   * @drv: cpuidle driver containing state data.
> >   * @dev: Target CPU.
> >   * @state_idx: Index of the capping idle state.
> > @@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct 
> > cpuidle_driver *drv,
> >  {
> > int i;
> >
> > -   for (i = state_idx - 1; i > 0; i--) {
> > +   for (i = state_idx - 1; i >= 0; i--) {
> > if (drv->states[i].disabled || dev->states_usage[i].disable)
> > continue;
> >
> > if (drv->states[i].target_residency <= duration_us)
> > break;
> > }
> > +   if (i < 0)
> > +   i = state_idx;
> > return i;
> >  }
> 
> I'll do something slightly similar, but equivalent.

I actually ended up fixing it differently, as the above will cause state_idx
to be returned even if some states shallower than state_idx are enabled, but
their target residencies are higher than duration_us.  In that case, though,
it still is more correct to return the shallowest enabled state rather than
state_idx.

> >
> > @@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, 
> > struct cpuidle_device *dev,
> > if (max_early_idx >= 0 &&
> > count < cpu_data->states[i].early_hits)
> > count = cpu_data->states[i].early_hits;
> > -
> > continue;
> > }
> >
> > There is an additional issue where if idle state 0 is disabled (with the 
> > above suggested code patch),
> > idle state usage seems to fall to deeper states than idle state 1.
> > This is not the expected behaviour.
> 
> No, it isn't.
> 
> > Kernel 4.20-rc3 works as expected.
> > I have not figured this issue out yet, in the code.
> >
> > Example (1 minute per sample. Number of entries/exits per state):
> > State 0 State 1 State 2 State 3 State 4Watts
> >28235143, 83, 26, 17,837,  64.900
> > 5583238, 657079,5884941,8498552,   30986831,  62.433 << 
> > Transition sample, after idle state 0 disabled
> >   0, 793517,7186099,   10559878,   38485721,  61.900 << ?? 
> > should have all gone into Idle state 1
> >   0, 795414,7340703,   10553117,   38513456,  62.050
> >   0, 807028,7288195,   10574113,   38523524,  62.167
> >   0, 814983,7403534,   10575108,   38571228,  62.167
> >   0, 838302,7747127,   10552289,   38556054,  62.183
> > 9664999, 544473,4914512,6942037,   25295361,  63.633 << 
> > Transition sample, after idle state 0 enabled
> >27893504, 96, 40,  9,912,  66.500
> >26556343, 83, 29,  7,814,  66.683
> >27929227, 64, 20, 10,931,  66.683
> 
> I see.
> 
> OK, I'll look into this too, thanks!

This probably is the artifact of the fix for the teo_find_shallower_state()
issue.

Anyway, I'm not able to reproduce this with the teo_find_shallower_state() issue
fixed differently.

Thanks,
Rafael



Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-03 Thread Rafael J. Wysocki
On Friday, November 30, 2018 9:51:19 AM CET Rafael J. Wysocki wrote:
> Hi Doug,
> 
> On Fri, Nov 30, 2018 at 8:49 AM Doug Smythies  wrote:
> >
> > Hi Rafael,
> >
> > On 2018.11.23 02:36 Rafael J. Wysocki wrote:
> >
> > ... [snip]...
> >
> > > +/**
> > > + * teo_find_shallower_state - Find shallower idle state matching given 
> > > duration.
> > > + * @drv: cpuidle driver containing state data.
> > > + * @dev: Target CPU.
> > > + * @state_idx: Index of the capping idle state.
> > > + * @duration_us: Idle duration value to match.
> > > + */
> > > +static int teo_find_shallower_state(struct cpuidle_driver *drv,
> > > + struct cpuidle_device *dev, int 
> > > state_idx,
> > > + unsigned int duration_us)
> > > +{
> > > + int i;
> > > +
> > > + for (i = state_idx - 1; i > 0; i--) {
> > > + if (drv->states[i].disabled || dev->states_usage[i].disable)
> > > + continue;
> > > +
> > > + if (drv->states[i].target_residency <= duration_us)
> > > + break;
> > > + }
> > > + return i;
> > > +}
> >
> > I think this subroutine has a problem when idle state 0
> > is disabled.
> 
> You are right, thanks!
> 
> > Perhaps something like this might help:
> >
> > diff --git a/drivers/cpuidle/governors/teo.c 
> > b/drivers/cpuidle/governors/teo.c
> > index bc1c9a2..5b97639 100644
> > --- a/drivers/cpuidle/governors/teo.c
> > +++ b/drivers/cpuidle/governors/teo.c
> > @@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, 
> > struct cpuidle_device *dev)
> >  }
> >
> >  /**
> > - * teo_find_shallower_state - Find shallower idle state matching given 
> > duration.
> > + * teo_find_shallower_state - Find shallower idle state matching given
> > + * duration, if possible.
> >   * @drv: cpuidle driver containing state data.
> >   * @dev: Target CPU.
> >   * @state_idx: Index of the capping idle state.
> > @@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct 
> > cpuidle_driver *drv,
> >  {
> > int i;
> >
> > -   for (i = state_idx - 1; i > 0; i--) {
> > +   for (i = state_idx - 1; i >= 0; i--) {
> > if (drv->states[i].disabled || dev->states_usage[i].disable)
> > continue;
> >
> > if (drv->states[i].target_residency <= duration_us)
> > break;
> > }
> > +   if (i < 0)
> > +   i = state_idx;
> > return i;
> >  }
> 
> I'll do something slightly similar, but equivalent.

I actually ended up fixing it differently, as the above will cause state_idx
to be returned even if some states shallower than state_idx are enabled, but
their target residencies are higher than duration_us.  In that case, though,
it still is more correct to return the shallowest enabled state rather than
state_idx.

> >
> > @@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, 
> > struct cpuidle_device *dev,
> > if (max_early_idx >= 0 &&
> > count < cpu_data->states[i].early_hits)
> > count = cpu_data->states[i].early_hits;
> > -
> > continue;
> > }
> >
> > There is an additional issue where if idle state 0 is disabled (with the 
> > above suggested code patch),
> > idle state usage seems to fall to deeper states than idle state 1.
> > This is not the expected behaviour.
> 
> No, it isn't.
> 
> > Kernel 4.20-rc3 works as expected.
> > I have not figured this issue out yet, in the code.
> >
> > Example (1 minute per sample. Number of entries/exits per state):
> > State 0 State 1 State 2 State 3 State 4Watts
> >28235143, 83, 26, 17,837,  64.900
> > 5583238, 657079,5884941,8498552,   30986831,  62.433 << 
> > Transition sample, after idle state 0 disabled
> >   0, 793517,7186099,   10559878,   38485721,  61.900 << ?? 
> > should have all gone into Idle state 1
> >   0, 795414,7340703,   10553117,   38513456,  62.050
> >   0, 807028,7288195,   10574113,   38523524,  62.167
> >   0, 814983,7403534,   10575108,   38571228,  62.167
> >   0, 838302,7747127,   10552289,   38556054,  62.183
> > 9664999, 544473,4914512,6942037,   25295361,  63.633 << 
> > Transition sample, after idle state 0 enabled
> >27893504, 96, 40,  9,912,  66.500
> >26556343, 83, 29,  7,814,  66.683
> >27929227, 64, 20, 10,931,  66.683
> 
> I see.
> 
> OK, I'll look into this too, thanks!

This probably is the artifact of the fix for the teo_find_shallower_state()
issue.

Anyway, I'm not able to reproduce this with the teo_find_shallower_state() issue
fixed differently.

Thanks,
Rafael



Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-03 Thread Rafael J. Wysocki
On Saturday, December 1, 2018 3:18:24 PM CET Giovanni Gherdovich wrote:
> On Fri, 2018-11-23 at 11:35 +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 
> > 

[cut]

> > 
> > [snip]
> 
> [NOTE: the tables in this message are quite wide. If this doesn't get to you
> properly formatted you can read a copy of this message at the URL
> https://beta.suse.com/private/ggherdovich/teo-eval/teo-v6-eval.html ]
> 
> All performance concerns manifested in v5 are wiped out by v6. Not only v6
> improves over v5, but is even better than the baseline (menu) in most
> cases. The optimizations in v6 paid off!

This is very encouraging, thank you!

> The overview of the analysis for v5, from the message
> https://lore.kernel.org/lkml/1541877001.17878.5.ca...@suse.cz , was:
> 
> > The quick summary is:
> > 
> > ---> sockperf on loopback over UDP, mode "throughput":
> >  this had a 12% regression in v2 on 48x-HASWELL-NUMA, which is 
> > completely
> >  recovered in v3 and v5. Good stuff.
> > 
> > ---> dbench on xfs:
> >  this was down 16% in v2 on 48x-HASWELL-NUMA. On v5 we're at a 10%
> >  regression. Slight improvement. What's really hurting here is the 
> > single
> >  client scenario.
> > 
> > ---> netperf-udp on loopback:
> >  had 6% regression on v2 on 8x-SKYLAKE-UMA, which is the same as what
> >  happens in v5.
> > 
> > ---> tbench on loopback:
> >  was down 10% in v2 on 8x-SKYLAKE-UMA, now slightly worse in v5 with a 
> > 12%
> >  regression. As in dbench, it's at low number of clients that the 
> > results
> >  are worst. Note that this machine is different from the one that has 
> > the
> >  dbench regression.
> 
> now the situation is overturned:
> 
> ---> sockperf on loopback over UDP, mode "throughput":
>  No new problems from 48x-HASWELL-NUMA, which stays put at the level of
>  the baseline. OTOH 80x-BROADWELL-NUMA and 8x-SKYLAKE-UMA improve over the
>  baseline of 8% and 10% respectively.

Good.

> ---> dbench on xfs:
>  48x-HASWELL-NUMA rebounds from the previous 10% degradation and it's now
>  at 0, i.e. the baseline level. The 1-client case, responsible for the
>  previous overall degradation (I average results from different number of
>  clients), went from -40% to -20% and is compensated in my table by
>  improvements with 4, 8, 16 and 32 clients (table below).
> 
> ---> netperf-udp on loopback:
>  8x-SKYLAKE-UMA now shows a 9% improvement over  baseline.
>  80x-BROADWELL-NUMA, previously similar to baseline, now improves 7%.

Good.

> ---> tbench on loopback:
>  Impressive change of color for 8x-SKYLAKE-UMA, from 12% regression in v5
>  to 7% improvement in v6. The problematic 1- and 2-clients cases went from
>  -25% and -33% to +13% and +10% respectively.

Awesome. :-)

> Details below.
> 
> Runs are compared against v4.18 with the Menu governor. I know v4.18 is a
> little old now but that's where I measured my baseline. My machine pool didn't
> change:
> 
> * single socket E3-1240 v5 (Skylake 8 cores, which I'll call 8x-SKYLAKE-UMA)
> * two sockets E5-2698 v4 (Broadwell 80 cores, 80x-BROADWELL-NUMA from here 
> onwards)
> * two sockets E5-2670 v3 (Haswell 48 cores, 48x-HASWELL-NUMA from here 
> onwards)
> 

[cut]

> 
> 
> PREVIOUSLY REGRESSING BENCHMARKS: OVERVIEW
> ==
> 
> * sockperf on loopback over UDP, mode "throughput"
> * global-dhp__network-sockperf-unbound
> 48x-HASWELL-NUMA fixed since v2, the others greatly improved in v6.
> 
> teo-v1  teo-v2  teo-v3  teo-v5  teo-v6
>   
> ---
>   8x-SKYLAKE-UMA1% worse1% worse1% worse1% worse10% 
> better
>   80x-BROADWELL-NUMA3% better   2% better   5% better   3% worse8% 
> better
>   48x-HASWELL-NUMA  4% better   12% worse   no change   no change   no 
> change
> 
> * dbench on xfs
> * global-dhp__io-dbench4-async-xfs
> 48x-HASWELL-NUMA is fixed wrt v5 and earlier versions.
> 
> teo-v1  teo-v2  teo-v3 teo-v5   
> teo-v6   
>   
> ---
>   8x-SKYLAKE-UMA3% better   4% better   6% better  4% better5% 
> better
>   80x-BROADWELL-NUMAno change   no change   1% worse   3% worse 2% 
> better
>   48x-HASWELL-NUMA  6% worse16% worse   8% worse   10% worseno 
> change 
> 
> * netperf on loopback over UDP
> * global-dhp__network-netperf-unbound
> 8x-SKYLAKE-UMA fixed.
> 
> teo-v1  teo-v2  teo-v3 teo-v5   
> teo-v6   
>   
> ---
>   8x-SKYLAKE-UMAno change   6% worse4% worse   6% worse 9% 
> better
>   80x-BROADWELL-NUMA1% worse4% worseno change  no change7% 
> 

Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-03 Thread Rafael J. Wysocki
On Saturday, December 1, 2018 3:18:24 PM CET Giovanni Gherdovich wrote:
> On Fri, 2018-11-23 at 11:35 +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 
> > 

[cut]

> > 
> > [snip]
> 
> [NOTE: the tables in this message are quite wide. If this doesn't get to you
> properly formatted you can read a copy of this message at the URL
> https://beta.suse.com/private/ggherdovich/teo-eval/teo-v6-eval.html ]
> 
> All performance concerns manifested in v5 are wiped out by v6. Not only v6
> improves over v5, but is even better than the baseline (menu) in most
> cases. The optimizations in v6 paid off!

This is very encouraging, thank you!

> The overview of the analysis for v5, from the message
> https://lore.kernel.org/lkml/1541877001.17878.5.ca...@suse.cz , was:
> 
> > The quick summary is:
> > 
> > ---> sockperf on loopback over UDP, mode "throughput":
> >  this had a 12% regression in v2 on 48x-HASWELL-NUMA, which is 
> > completely
> >  recovered in v3 and v5. Good stuff.
> > 
> > ---> dbench on xfs:
> >  this was down 16% in v2 on 48x-HASWELL-NUMA. On v5 we're at a 10%
> >  regression. Slight improvement. What's really hurting here is the 
> > single
> >  client scenario.
> > 
> > ---> netperf-udp on loopback:
> >  had 6% regression on v2 on 8x-SKYLAKE-UMA, which is the same as what
> >  happens in v5.
> > 
> > ---> tbench on loopback:
> >  was down 10% in v2 on 8x-SKYLAKE-UMA, now slightly worse in v5 with a 
> > 12%
> >  regression. As in dbench, it's at low number of clients that the 
> > results
> >  are worst. Note that this machine is different from the one that has 
> > the
> >  dbench regression.
> 
> now the situation is overturned:
> 
> ---> sockperf on loopback over UDP, mode "throughput":
>  No new problems from 48x-HASWELL-NUMA, which stays put at the level of
>  the baseline. OTOH 80x-BROADWELL-NUMA and 8x-SKYLAKE-UMA improve over the
>  baseline of 8% and 10% respectively.

Good.

> ---> dbench on xfs:
>  48x-HASWELL-NUMA rebounds from the previous 10% degradation and it's now
>  at 0, i.e. the baseline level. The 1-client case, responsible for the
>  previous overall degradation (I average results from different number of
>  clients), went from -40% to -20% and is compensated in my table by
>  improvements with 4, 8, 16 and 32 clients (table below).
> 
> ---> netperf-udp on loopback:
>  8x-SKYLAKE-UMA now shows a 9% improvement over  baseline.
>  80x-BROADWELL-NUMA, previously similar to baseline, now improves 7%.

Good.

> ---> tbench on loopback:
>  Impressive change of color for 8x-SKYLAKE-UMA, from 12% regression in v5
>  to 7% improvement in v6. The problematic 1- and 2-clients cases went from
>  -25% and -33% to +13% and +10% respectively.

Awesome. :-)

> Details below.
> 
> Runs are compared against v4.18 with the Menu governor. I know v4.18 is a
> little old now but that's where I measured my baseline. My machine pool didn't
> change:
> 
> * single socket E3-1240 v5 (Skylake 8 cores, which I'll call 8x-SKYLAKE-UMA)
> * two sockets E5-2698 v4 (Broadwell 80 cores, 80x-BROADWELL-NUMA from here 
> onwards)
> * two sockets E5-2670 v3 (Haswell 48 cores, 48x-HASWELL-NUMA from here 
> onwards)
> 

[cut]

> 
> 
> PREVIOUSLY REGRESSING BENCHMARKS: OVERVIEW
> ==
> 
> * sockperf on loopback over UDP, mode "throughput"
> * global-dhp__network-sockperf-unbound
> 48x-HASWELL-NUMA fixed since v2, the others greatly improved in v6.
> 
> teo-v1  teo-v2  teo-v3  teo-v5  teo-v6
>   
> ---
>   8x-SKYLAKE-UMA1% worse1% worse1% worse1% worse10% 
> better
>   80x-BROADWELL-NUMA3% better   2% better   5% better   3% worse8% 
> better
>   48x-HASWELL-NUMA  4% better   12% worse   no change   no change   no 
> change
> 
> * dbench on xfs
> * global-dhp__io-dbench4-async-xfs
> 48x-HASWELL-NUMA is fixed wrt v5 and earlier versions.
> 
> teo-v1  teo-v2  teo-v3 teo-v5   
> teo-v6   
>   
> ---
>   8x-SKYLAKE-UMA3% better   4% better   6% better  4% better5% 
> better
>   80x-BROADWELL-NUMAno change   no change   1% worse   3% worse 2% 
> better
>   48x-HASWELL-NUMA  6% worse16% worse   8% worse   10% worseno 
> change 
> 
> * netperf on loopback over UDP
> * global-dhp__network-netperf-unbound
> 8x-SKYLAKE-UMA fixed.
> 
> teo-v1  teo-v2  teo-v3 teo-v5   
> teo-v6   
>   
> ---
>   8x-SKYLAKE-UMAno change   6% worse4% worse   6% worse 9% 
> better
>   80x-BROADWELL-NUMA1% worse4% worseno change  no change7% 
> 

RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-03 Thread Doug Smythies
Hi Giovanni,

Perhaps I should go off-list for this, not sure.

I had the thought that I should be able to get similar
results as your "8x-SKYLAKE-UMA" on my test computer,
i7-2600K. Or that at least it was worth trying, just
to see. I couldn't find the same or similar test
on Phoronix, and my attempts to do similar, for example,
with iperf, didn't show differences between the baseline
kernel and one with the teov6 patch.

So I tried the test set you referenced [1]:

On 2018.12.01 06:18 Giovanni Gherdovich wrote:
...
> * netperf on loopback over TCP
>* global-dhp__network-netperf-unbound

I assume this means that I am supposed to do:

cp config-global-dhp__network-netperf-unbound config

from the configs directory. Anyway that config file
looks correct. Then:

./run-mmtests.sh --no-monitor 3.0-nomonitor

...

> * sockperf on loopback over UDP, mode "throughput"
>* global-dhp__network-sockperf-unbound

Similarly (from the appropriate directories): 

cp config-global-dhp__network-sockperf-unbound config
./run-mmtests.sh --no-monitor 3.0-nomonitor

My issue is that I do not understand the output or how it
might correlate with your tables.

I get, for example:

   31   1 0.13s 0.68s 0.80s  1003894.302 1003779.613
   31   1 0.16s 0.64s 0.80s  1008900.053 1008215.336
   31   1 0.14s 0.66s 0.80s  1009630.439 1008990.265
...

But I don't know what that means, nor have I been able to find
a description anywhere.

In the README file, I did see that for reporting I am 
somehow supposed to use compare-kernels.sh, but
I couldn't figure that out.

By the way, I am running these tests as a regular user, but
they seem to want to modify:

/sys/kernel/mm/transparent_hugepage/enabled

which requires root privilege. I don't really want to mess
with that stuff for these tests.
 
> [1] https://github.com/gormanm/mmtests

Can you help me to produce meaningful results to compare with
your results?

... Doug




RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-03 Thread Doug Smythies
Hi Giovanni,

Perhaps I should go off-list for this, not sure.

I had the thought that I should be able to get similar
results as your "8x-SKYLAKE-UMA" on my test computer,
i7-2600K. Or that at least it was worth trying, just
to see. I couldn't find the same or similar test
on Phoronix, and my attempts to do similar, for example,
with iperf, didn't show differences between the baseline
kernel and one with the teov6 patch.

So I tried the test set you referenced [1]:

On 2018.12.01 06:18 Giovanni Gherdovich wrote:
...
> * netperf on loopback over TCP
>* global-dhp__network-netperf-unbound

I assume this means that I am supposed to do:

cp config-global-dhp__network-netperf-unbound config

from the configs directory. Anyway that config file
looks correct. Then:

./run-mmtests.sh --no-monitor 3.0-nomonitor

...

> * sockperf on loopback over UDP, mode "throughput"
>* global-dhp__network-sockperf-unbound

Similarly (from the appropriate directories): 

cp config-global-dhp__network-sockperf-unbound config
./run-mmtests.sh --no-monitor 3.0-nomonitor

My issue is that I do not understand the output or how it
might correlate with your tables.

I get, for example:

   31   1 0.13s 0.68s 0.80s  1003894.302 1003779.613
   31   1 0.16s 0.64s 0.80s  1008900.053 1008215.336
   31   1 0.14s 0.66s 0.80s  1009630.439 1008990.265
...

But I don't know what that means, nor have I been able to find
a description anywhere.

In the README file, I did see that for reporting I am 
somehow supposed to use compare-kernels.sh, but
I couldn't figure that out.

By the way, I am running these tests as a regular user, but
they seem to want to modify:

/sys/kernel/mm/transparent_hugepage/enabled

which requires root privilege. I don't really want to mess
with that stuff for these tests.
 
> [1] https://github.com/gormanm/mmtests

Can you help me to produce meaningful results to compare with
your results?

... Doug




Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-01 Thread Giovanni Gherdovich
On Fri, 2018-11-23 at 11:35 +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The venerable menu governor does some thigns that are quite
> questionable in my view.
> 
> First, it includes timer wakeups in the pattern detection data and
> mixes them up with wakeups from other sources which in some cases
> causes it to expect what essentially would be a timer wakeup in a
> time frame in which no timer wakeups are possible (becuase it knows
> the time until the next timer event and that is later than the
> expected wakeup time).
> 
> Second, it uses the extra exit latency limit based on the predicted
> idle duration and depending on the number of tasks waiting on I/O,
> even though those tasks may run on a different CPU when they are
> woken up.  Moreover, the time ranges used by it for the sleep length
> correction factors depend on whether or not there are tasks waiting
> on I/O, which again doesn't imply anything in particular, and they
> are not correlated to the list of available idle states in any way
> whatever.
> 
> Also, the pattern detection code in menu may end up considering
> values that are too large to matter at all, in which cases running
> it is a waste of time.
> 
> A major rework of the menu governor would be required to address
> these issues and the performance of at least some workloads (tuned
> specifically to the current behavior of the menu governor) is likely
> to suffer from that.  It is thus better to introduce an entirely new
> governor without them and let everybody use the governor that works
> better with their actual workloads.
> 
> The new governor introduced here, the timer events oriented (TEO)
> governor, uses the same basic strategy as menu: it always tries to
> find the deepest idle state that can be used in the given conditions.
> However, it applies a different approach to that problem.
> 
> First, it doesn't use "correction factors" for the time till the
> closest timer, but instead it tries to correlate the measured idle
> duration values with the available idle states and use that
> information to pick up the idle state that is most likely to "match"
> the upcoming CPU idle interval.
> 
> Second, it doesn't take the number of "I/O waiters" into account at
> all and the pattern detection code in it avoids taking timer wakeups
> into account.  It also only uses idle duration values less than the
> current time till the closest timer (with the tick excluded) for that
> purpose.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> v5 -> v6:
>  * Avoid applying poll_time_limit to non-polling idle states by mistake.
>  * Use idle duration measured by the governor for everything (as it likely is
>more accurate than the one measured by the core).
>  * Rename SPIKE to PULSE.
>  * Do not run pattern detection upfront.  Instead, use recent idle duration
>values to refine the state selection after finding a candidate idle state.
>  * Do not use the expected idle duration as an extra latency constraint
>(exit latency is less than the target residency for all of the idle states
>known to me anyway, so this doesn't change anything in practice).
> 
> v4 -> v5:
>  * Avoid using shallow idle states when the tick has been stopped already.
> 
> v3 -> v4:
>  * Make the pattern detection avoid returning too early if the minimum
>sample is too far from the average.
>  * Reformat the changelog (as requested by Peter).
> 
> v2 -> v3:
>  * Simplify the pattern detection code and make it return a value
>   lower than the time to the closest timer if the majority of recent
>   idle intervals are below it regardless of their variance (that should
>   cause it to be slightly more aggressive).
>  * Do not count wakeups from state 0 due to the time limit in poll_idle()
>as non-timer.
> 
> [snip]

[NOTE: the tables in this message are quite wide. If this doesn't get to you
properly formatted you can read a copy of this message at the URL
https://beta.suse.com/private/ggherdovich/teo-eval/teo-v6-eval.html ]

All performance concerns manifested in v5 are wiped out by v6. Not only v6
improves over v5, but is even better than the baseline (menu) in most
cases. The optimizations in v6 paid off!

The overview of the analysis for v5, from the message
https://lore.kernel.org/lkml/1541877001.17878.5.ca...@suse.cz , was:

> The quick summary is:
> 
> ---> sockperf on loopback over UDP, mode "throughput":
>  this had a 12% regression in v2 on 48x-HASWELL-NUMA, which is completely
>  recovered in v3 and v5. Good stuff.
> 
> ---> dbench on xfs:
>  this was down 16% in v2 on 48x-HASWELL-NUMA. On v5 we're at a 10%
>  regression. Slight improvement. What's really hurting here is the single
>  client scenario.
> 
> ---> netperf-udp on loopback:
>  had 6% regression on v2 on 8x-SKYLAKE-UMA, which is the same as what
>  happens in v5.
> 
> ---> tbench on loopback:
>  was down 10% in v2 on 8x-SKYLAKE-UMA, now slightly worse in v5 with a 

Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-01 Thread Giovanni Gherdovich
On Fri, 2018-11-23 at 11:35 +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The venerable menu governor does some thigns that are quite
> questionable in my view.
> 
> First, it includes timer wakeups in the pattern detection data and
> mixes them up with wakeups from other sources which in some cases
> causes it to expect what essentially would be a timer wakeup in a
> time frame in which no timer wakeups are possible (becuase it knows
> the time until the next timer event and that is later than the
> expected wakeup time).
> 
> Second, it uses the extra exit latency limit based on the predicted
> idle duration and depending on the number of tasks waiting on I/O,
> even though those tasks may run on a different CPU when they are
> woken up.  Moreover, the time ranges used by it for the sleep length
> correction factors depend on whether or not there are tasks waiting
> on I/O, which again doesn't imply anything in particular, and they
> are not correlated to the list of available idle states in any way
> whatever.
> 
> Also, the pattern detection code in menu may end up considering
> values that are too large to matter at all, in which cases running
> it is a waste of time.
> 
> A major rework of the menu governor would be required to address
> these issues and the performance of at least some workloads (tuned
> specifically to the current behavior of the menu governor) is likely
> to suffer from that.  It is thus better to introduce an entirely new
> governor without them and let everybody use the governor that works
> better with their actual workloads.
> 
> The new governor introduced here, the timer events oriented (TEO)
> governor, uses the same basic strategy as menu: it always tries to
> find the deepest idle state that can be used in the given conditions.
> However, it applies a different approach to that problem.
> 
> First, it doesn't use "correction factors" for the time till the
> closest timer, but instead it tries to correlate the measured idle
> duration values with the available idle states and use that
> information to pick up the idle state that is most likely to "match"
> the upcoming CPU idle interval.
> 
> Second, it doesn't take the number of "I/O waiters" into account at
> all and the pattern detection code in it avoids taking timer wakeups
> into account.  It also only uses idle duration values less than the
> current time till the closest timer (with the tick excluded) for that
> purpose.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> v5 -> v6:
>  * Avoid applying poll_time_limit to non-polling idle states by mistake.
>  * Use idle duration measured by the governor for everything (as it likely is
>more accurate than the one measured by the core).
>  * Rename SPIKE to PULSE.
>  * Do not run pattern detection upfront.  Instead, use recent idle duration
>values to refine the state selection after finding a candidate idle state.
>  * Do not use the expected idle duration as an extra latency constraint
>(exit latency is less than the target residency for all of the idle states
>known to me anyway, so this doesn't change anything in practice).
> 
> v4 -> v5:
>  * Avoid using shallow idle states when the tick has been stopped already.
> 
> v3 -> v4:
>  * Make the pattern detection avoid returning too early if the minimum
>sample is too far from the average.
>  * Reformat the changelog (as requested by Peter).
> 
> v2 -> v3:
>  * Simplify the pattern detection code and make it return a value
>   lower than the time to the closest timer if the majority of recent
>   idle intervals are below it regardless of their variance (that should
>   cause it to be slightly more aggressive).
>  * Do not count wakeups from state 0 due to the time limit in poll_idle()
>as non-timer.
> 
> [snip]

[NOTE: the tables in this message are quite wide. If this doesn't get to you
properly formatted you can read a copy of this message at the URL
https://beta.suse.com/private/ggherdovich/teo-eval/teo-v6-eval.html ]

All performance concerns manifested in v5 are wiped out by v6. Not only v6
improves over v5, but is even better than the baseline (menu) in most
cases. The optimizations in v6 paid off!

The overview of the analysis for v5, from the message
https://lore.kernel.org/lkml/1541877001.17878.5.ca...@suse.cz , was:

> The quick summary is:
> 
> ---> sockperf on loopback over UDP, mode "throughput":
>  this had a 12% regression in v2 on 48x-HASWELL-NUMA, which is completely
>  recovered in v3 and v5. Good stuff.
> 
> ---> dbench on xfs:
>  this was down 16% in v2 on 48x-HASWELL-NUMA. On v5 we're at a 10%
>  regression. Slight improvement. What's really hurting here is the single
>  client scenario.
> 
> ---> netperf-udp on loopback:
>  had 6% regression on v2 on 8x-SKYLAKE-UMA, which is the same as what
>  happens in v5.
> 
> ---> tbench on loopback:
>  was down 10% in v2 on 8x-SKYLAKE-UMA, now slightly worse in v5 with a 

Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-30 Thread Rafael J. Wysocki
Hi Doug,

On Fri, Nov 30, 2018 at 8:49 AM Doug Smythies  wrote:
>
> Hi Rafael,
>
> On 2018.11.23 02:36 Rafael J. Wysocki wrote:
>
> ... [snip]...
>
> > +/**
> > + * teo_find_shallower_state - Find shallower idle state matching given 
> > duration.
> > + * @drv: cpuidle driver containing state data.
> > + * @dev: Target CPU.
> > + * @state_idx: Index of the capping idle state.
> > + * @duration_us: Idle duration value to match.
> > + */
> > +static int teo_find_shallower_state(struct cpuidle_driver *drv,
> > + struct cpuidle_device *dev, int state_idx,
> > + unsigned int duration_us)
> > +{
> > + int i;
> > +
> > + for (i = state_idx - 1; i > 0; i--) {
> > + if (drv->states[i].disabled || dev->states_usage[i].disable)
> > + continue;
> > +
> > + if (drv->states[i].target_residency <= duration_us)
> > + break;
> > + }
> > + return i;
> > +}
>
> I think this subroutine has a problem when idle state 0
> is disabled.

You are right, thanks!

> Perhaps something like this might help:
>
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index bc1c9a2..5b97639 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, struct 
> cpuidle_device *dev)
>  }
>
>  /**
> - * teo_find_shallower_state - Find shallower idle state matching given 
> duration.
> + * teo_find_shallower_state - Find shallower idle state matching given
> + * duration, if possible.
>   * @drv: cpuidle driver containing state data.
>   * @dev: Target CPU.
>   * @state_idx: Index of the capping idle state.
> @@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct 
> cpuidle_driver *drv,
>  {
> int i;
>
> -   for (i = state_idx - 1; i > 0; i--) {
> +   for (i = state_idx - 1; i >= 0; i--) {
> if (drv->states[i].disabled || dev->states_usage[i].disable)
> continue;
>
> if (drv->states[i].target_residency <= duration_us)
> break;
> }
> +   if (i < 0)
> +   i = state_idx;
> return i;
>  }

I'll do something slightly similar, but equivalent.

>
> @@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, struct 
> cpuidle_device *dev,
> if (max_early_idx >= 0 &&
> count < cpu_data->states[i].early_hits)
> count = cpu_data->states[i].early_hits;
> -
> continue;
> }
>
> There is an additional issue where if idle state 0 is disabled (with the 
> above suggested code patch),
> idle state usage seems to fall to deeper states than idle state 1.
> This is not the expected behaviour.

No, it isn't.

> Kernel 4.20-rc3 works as expected.
> I have not figured this issue out yet, in the code.
>
> Example (1 minute per sample. Number of entries/exits per state):
> State 0 State 1 State 2 State 3 State 4Watts
>28235143, 83, 26, 17,837,  64.900
> 5583238, 657079,5884941,8498552,   30986831,  62.433 << 
> Transition sample, after idle state 0 disabled
>   0, 793517,7186099,   10559878,   38485721,  61.900 << ?? 
> should have all gone into Idle state 1
>   0, 795414,7340703,   10553117,   38513456,  62.050
>   0, 807028,7288195,   10574113,   38523524,  62.167
>   0, 814983,7403534,   10575108,   38571228,  62.167
>   0, 838302,7747127,   10552289,   38556054,  62.183
> 9664999, 544473,4914512,6942037,   25295361,  63.633 << 
> Transition sample, after idle state 0 enabled
>27893504, 96, 40,  9,912,  66.500
>26556343, 83, 29,  7,814,  66.683
>27929227, 64, 20, 10,931,  66.683

I see.

OK, I'll look into this too, thanks!


Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-30 Thread Rafael J. Wysocki
Hi Doug,

On Fri, Nov 30, 2018 at 8:49 AM Doug Smythies  wrote:
>
> Hi Rafael,
>
> On 2018.11.23 02:36 Rafael J. Wysocki wrote:
>
> ... [snip]...
>
> > +/**
> > + * teo_find_shallower_state - Find shallower idle state matching given 
> > duration.
> > + * @drv: cpuidle driver containing state data.
> > + * @dev: Target CPU.
> > + * @state_idx: Index of the capping idle state.
> > + * @duration_us: Idle duration value to match.
> > + */
> > +static int teo_find_shallower_state(struct cpuidle_driver *drv,
> > + struct cpuidle_device *dev, int state_idx,
> > + unsigned int duration_us)
> > +{
> > + int i;
> > +
> > + for (i = state_idx - 1; i > 0; i--) {
> > + if (drv->states[i].disabled || dev->states_usage[i].disable)
> > + continue;
> > +
> > + if (drv->states[i].target_residency <= duration_us)
> > + break;
> > + }
> > + return i;
> > +}
>
> I think this subroutine has a problem when idle state 0
> is disabled.

You are right, thanks!

> Perhaps something like this might help:
>
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index bc1c9a2..5b97639 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, struct 
> cpuidle_device *dev)
>  }
>
>  /**
> - * teo_find_shallower_state - Find shallower idle state matching given 
> duration.
> + * teo_find_shallower_state - Find shallower idle state matching given
> + * duration, if possible.
>   * @drv: cpuidle driver containing state data.
>   * @dev: Target CPU.
>   * @state_idx: Index of the capping idle state.
> @@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct 
> cpuidle_driver *drv,
>  {
> int i;
>
> -   for (i = state_idx - 1; i > 0; i--) {
> +   for (i = state_idx - 1; i >= 0; i--) {
> if (drv->states[i].disabled || dev->states_usage[i].disable)
> continue;
>
> if (drv->states[i].target_residency <= duration_us)
> break;
> }
> +   if (i < 0)
> +   i = state_idx;
> return i;
>  }

I'll do something slightly similar, but equivalent.

>
> @@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, struct 
> cpuidle_device *dev,
> if (max_early_idx >= 0 &&
> count < cpu_data->states[i].early_hits)
> count = cpu_data->states[i].early_hits;
> -
> continue;
> }
>
> There is an additional issue where if idle state 0 is disabled (with the 
> above suggested code patch),
> idle state usage seems to fall to deeper states than idle state 1.
> This is not the expected behaviour.

No, it isn't.

> Kernel 4.20-rc3 works as expected.
> I have not figured this issue out yet, in the code.
>
> Example (1 minute per sample. Number of entries/exits per state):
> State 0 State 1 State 2 State 3 State 4Watts
>28235143, 83, 26, 17,837,  64.900
> 5583238, 657079,5884941,8498552,   30986831,  62.433 << 
> Transition sample, after idle state 0 disabled
>   0, 793517,7186099,   10559878,   38485721,  61.900 << ?? 
> should have all gone into Idle state 1
>   0, 795414,7340703,   10553117,   38513456,  62.050
>   0, 807028,7288195,   10574113,   38523524,  62.167
>   0, 814983,7403534,   10575108,   38571228,  62.167
>   0, 838302,7747127,   10552289,   38556054,  62.183
> 9664999, 544473,4914512,6942037,   25295361,  63.633 << 
> Transition sample, after idle state 0 enabled
>27893504, 96, 40,  9,912,  66.500
>26556343, 83, 29,  7,814,  66.683
>27929227, 64, 20, 10,931,  66.683

I see.

OK, I'll look into this too, thanks!


RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-29 Thread Doug Smythies
Hi Rafael,

On 2018.11.23 02:36 Rafael J. Wysocki wrote:

... [snip]...

> +/**
> + * teo_find_shallower_state - Find shallower idle state matching given 
> duration.
> + * @drv: cpuidle driver containing state data.
> + * @dev: Target CPU.
> + * @state_idx: Index of the capping idle state.
> + * @duration_us: Idle duration value to match.
> + */
> +static int teo_find_shallower_state(struct cpuidle_driver *drv,
> + struct cpuidle_device *dev, int state_idx,
> + unsigned int duration_us)
> +{
> + int i;
> +
> + for (i = state_idx - 1; i > 0; i--) {
> + if (drv->states[i].disabled || dev->states_usage[i].disable)
> + continue;
> +
> + if (drv->states[i].target_residency <= duration_us)
> + break;
> + }
> + return i;
> +}

I think this subroutine has a problem when idle state 0
is disabled.

Perhaps something like this might help:

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index bc1c9a2..5b97639 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, struct 
cpuidle_device *dev)
 }

 /**
- * teo_find_shallower_state - Find shallower idle state matching given 
duration.
+ * teo_find_shallower_state - Find shallower idle state matching given
+ * duration, if possible.
  * @drv: cpuidle driver containing state data.
  * @dev: Target CPU.
  * @state_idx: Index of the capping idle state.
@@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct cpuidle_driver 
*drv,
 {
int i;

-   for (i = state_idx - 1; i > 0; i--) {
+   for (i = state_idx - 1; i >= 0; i--) {
if (drv->states[i].disabled || dev->states_usage[i].disable)
continue;

if (drv->states[i].target_residency <= duration_us)
break;
}
+   if (i < 0)
+   i = state_idx;
return i;
 }

@@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, struct 
cpuidle_device *dev,
if (max_early_idx >= 0 &&
count < cpu_data->states[i].early_hits)
count = cpu_data->states[i].early_hits;
-
continue;
}

There is an additional issue where if idle state 0 is disabled (with the above 
suggested code patch),
idle state usage seems to fall to deeper states than idle state 1.
This is not the expected behaviour.
Kernel 4.20-rc3 works as expected.
I have not figured this issue out yet, in the code.

Example (1 minute per sample. Number of entries/exits per state):
State 0 State 1 State 2 State 3 State 4Watts
   28235143, 83, 26, 17,837,  64.900
5583238, 657079,5884941,8498552,   30986831,  62.433 << 
Transition sample, after idle state 0 disabled
  0, 793517,7186099,   10559878,   38485721,  61.900 << ?? 
should have all gone into Idle state 1
  0, 795414,7340703,   10553117,   38513456,  62.050
  0, 807028,7288195,   10574113,   38523524,  62.167
  0, 814983,7403534,   10575108,   38571228,  62.167
  0, 838302,7747127,   10552289,   38556054,  62.183
9664999, 544473,4914512,6942037,   25295361,  63.633 << 
Transition sample, after idle state 0 enabled
   27893504, 96, 40,  9,912,  66.500
   26556343, 83, 29,  7,814,  66.683
   27929227, 64, 20, 10,931,  66.683
 
... Doug




RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-29 Thread Doug Smythies
Hi Rafael,

On 2018.11.23 02:36 Rafael J. Wysocki wrote:

... [snip]...

> +/**
> + * teo_find_shallower_state - Find shallower idle state matching given 
> duration.
> + * @drv: cpuidle driver containing state data.
> + * @dev: Target CPU.
> + * @state_idx: Index of the capping idle state.
> + * @duration_us: Idle duration value to match.
> + */
> +static int teo_find_shallower_state(struct cpuidle_driver *drv,
> + struct cpuidle_device *dev, int state_idx,
> + unsigned int duration_us)
> +{
> + int i;
> +
> + for (i = state_idx - 1; i > 0; i--) {
> + if (drv->states[i].disabled || dev->states_usage[i].disable)
> + continue;
> +
> + if (drv->states[i].target_residency <= duration_us)
> + break;
> + }
> + return i;
> +}

I think this subroutine has a problem when idle state 0
is disabled.

Perhaps something like this might help:

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index bc1c9a2..5b97639 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, struct 
cpuidle_device *dev)
 }

 /**
- * teo_find_shallower_state - Find shallower idle state matching given 
duration.
+ * teo_find_shallower_state - Find shallower idle state matching given
+ * duration, if possible.
  * @drv: cpuidle driver containing state data.
  * @dev: Target CPU.
  * @state_idx: Index of the capping idle state.
@@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct cpuidle_driver 
*drv,
 {
int i;

-   for (i = state_idx - 1; i > 0; i--) {
+   for (i = state_idx - 1; i >= 0; i--) {
if (drv->states[i].disabled || dev->states_usage[i].disable)
continue;

if (drv->states[i].target_residency <= duration_us)
break;
}
+   if (i < 0)
+   i = state_idx;
return i;
 }

@@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, struct 
cpuidle_device *dev,
if (max_early_idx >= 0 &&
count < cpu_data->states[i].early_hits)
count = cpu_data->states[i].early_hits;
-
continue;
}

There is an additional issue where if idle state 0 is disabled (with the above 
suggested code patch),
idle state usage seems to fall to deeper states than idle state 1.
This is not the expected behaviour.
Kernel 4.20-rc3 works as expected.
I have not figured this issue out yet, in the code.

Example (1 minute per sample. Number of entries/exits per state):
State 0 State 1 State 2 State 3 State 4Watts
   28235143, 83, 26, 17,837,  64.900
5583238, 657079,5884941,8498552,   30986831,  62.433 << 
Transition sample, after idle state 0 disabled
  0, 793517,7186099,   10559878,   38485721,  61.900 << ?? 
should have all gone into Idle state 1
  0, 795414,7340703,   10553117,   38513456,  62.050
  0, 807028,7288195,   10574113,   38523524,  62.167
  0, 814983,7403534,   10575108,   38571228,  62.167
  0, 838302,7747127,   10552289,   38556054,  62.183
9664999, 544473,4914512,6942037,   25295361,  63.633 << 
Transition sample, after idle state 0 enabled
   27893504, 96, 40,  9,912,  66.500
   26556343, 83, 29,  7,814,  66.683
   27929227, 64, 20, 10,931,  66.683
 
... Doug




Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-29 Thread Rafael J. Wysocki
Hi Doug,

On Thu, Nov 29, 2018 at 12:20 AM Doug Smythies  wrote:
>
> On 2018.11.23 02:36 Rafael J. Wysocki wrote:
>
> v5 -> v6:
>  * Avoid applying poll_time_limit to non-polling idle states by mistake.
>  * Use idle duration measured by the governor for everything (as it likely is
>more accurate than the one measured by the core).
>
> -- above missing-- (see follow up e-mail from Rafael)
>
>  * Rename SPIKE to PULSE.
>  * Do not run pattern detection upfront.  Instead, use recent idle duration
>values to refine the state selection after finding a candidate idle state.
>  * Do not use the expected idle duration as an extra latency constraint
>(exit latency is less than the target residency for all of the idle states
>known to me anyway, so this doesn't change anything in practice).
>
> Hi Rafael,
>
> I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline
> reference kernel.
>
> Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients.
>
> Note: because it uses the disk, the dbench test is somewhat non-repeatable.
> However, if particular attention is paid to not doing anything else with
> the disk between tests, then it seems to be repeatable to within about 6%.
>
> Anyway no significant difference observed between kernel 4.20-rc3 and the
> same with the teov6 patch.
>
> Test 2: Pipe test, non cross core. (And idle state 0 test, really)
> I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core.
> Thus, pretty much only idle state 0 was ever used.
> Processor package power was similar for both kernels.
> teov6 entered/exited idle state 0 about 60,984 times/second/cpu.
> -rc3 entered/exited idle state 0 about 62,806 times/second/cpu.
> There was a difference in percentage time spent in idle state 0,
> with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses
> teov6 at 0.0641%.
>
> For throughput, teov6 was 1.4% faster.
>
> Test 3: was an attempt to sweep through a preference for
> all idle states.
>
> 40 threads were launched with nothing to do except sleep
> for a variable duration of 1 to 500 uSec, each step was
> run for 1 minute. With 1 minute idle before the test and a few
> minutes idle after, the total test duration was about 505 minutes.
> Recall that when one asks for a short sleep of 1 uSec, they actually
> get about 50 uSec, due to overheads. So I use 40 threads in an attempt
> to get the average time between wakeup events per CPU down somewhat.
>
> The results are here:
> http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm
>
> I might try to get some histogram information at a later date.

Thank you for the results, much appreciated!


Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-29 Thread Rafael J. Wysocki
Hi Doug,

On Thu, Nov 29, 2018 at 12:20 AM Doug Smythies  wrote:
>
> On 2018.11.23 02:36 Rafael J. Wysocki wrote:
>
> v5 -> v6:
>  * Avoid applying poll_time_limit to non-polling idle states by mistake.
>  * Use idle duration measured by the governor for everything (as it likely is
>more accurate than the one measured by the core).
>
> -- above missing-- (see follow up e-mail from Rafael)
>
>  * Rename SPIKE to PULSE.
>  * Do not run pattern detection upfront.  Instead, use recent idle duration
>values to refine the state selection after finding a candidate idle state.
>  * Do not use the expected idle duration as an extra latency constraint
>(exit latency is less than the target residency for all of the idle states
>known to me anyway, so this doesn't change anything in practice).
>
> Hi Rafael,
>
> I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline
> reference kernel.
>
> Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients.
>
> Note: because it uses the disk, the dbench test is somewhat non-repeatable.
> However, if particular attention is paid to not doing anything else with
> the disk between tests, then it seems to be repeatable to within about 6%.
>
> Anyway no significant difference observed between kernel 4.20-rc3 and the
> same with the teov6 patch.
>
> Test 2: Pipe test, non cross core. (And idle state 0 test, really)
> I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core.
> Thus, pretty much only idle state 0 was ever used.
> Processor package power was similar for both kernels.
> teov6 entered/exited idle state 0 about 60,984 times/second/cpu.
> -rc3 entered/exited idle state 0 about 62,806 times/second/cpu.
> There was a difference in percentage time spent in idle state 0,
> with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses
> teov6 at 0.0641%.
>
> For throughput, teov6 was 1.4% faster.
>
> Test 3: was an attempt to sweep through a preference for
> all idle states.
>
> 40 threads were launched with nothing to do except sleep
> for a variable duration of 1 to 500 uSec, each step was
> run for 1 minute. With 1 minute idle before the test and a few
> minutes idle after, the total test duration was about 505 minutes.
> Recall that when one asks for a short sleep of 1 uSec, they actually
> get about 50 uSec, due to overheads. So I use 40 threads in an attempt
> to get the average time between wakeup events per CPU down somewhat.
>
> The results are here:
> http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm
>
> I might try to get some histogram information at a later date.

Thank you for the results, much appreciated!


RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-28 Thread Doug Smythies
On 2018.11.23 02:36 Rafael J. Wysocki wrote:

v5 -> v6:
 * Avoid applying poll_time_limit to non-polling idle states by mistake.
 * Use idle duration measured by the governor for everything (as it likely is
   more accurate than the one measured by the core).

-- above missing-- (see follow up e-mail from Rafael)

 * Rename SPIKE to PULSE.
 * Do not run pattern detection upfront.  Instead, use recent idle duration
   values to refine the state selection after finding a candidate idle state.
 * Do not use the expected idle duration as an extra latency constraint
   (exit latency is less than the target residency for all of the idle states
   known to me anyway, so this doesn't change anything in practice).

Hi Rafael,

I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline
reference kernel.

Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients.

Note: because it uses the disk, the dbench test is somewhat non-repeatable.
However, if particular attention is paid to not doing anything else with
the disk between tests, then it seems to be repeatable to within about 6%.

Anyway no significant difference observed between kernel 4.20-rc3 and the
same with the teov6 patch.

Test 2: Pipe test, non cross core. (And idle state 0 test, really)
I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core.
Thus, pretty much only idle state 0 was ever used.
Processor package power was similar for both kernels.
teov6 entered/exited idle state 0 about 60,984 times/second/cpu.
-rc3 entered/exited idle state 0 about 62,806 times/second/cpu.
There was a difference in percentage time spent in idle state 0,
with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses
teov6 at 0.0641%.

For throughput, teov6 was 1.4% faster.

Test 3: was an attempt to sweep through a preference for
all idle states.

40 threads were launched with nothing to do except sleep
for a variable duration of 1 to 500 uSec, each step was
run for 1 minute. With 1 minute idle before the test and a few
minutes idle after, the total test duration was about 505 minutes.
Recall that when one asks for a short sleep of 1 uSec, they actually
get about 50 uSec, due to overheads. So I use 40 threads in an attempt
to get the average time between wakeup events per CPU down somewhat.

The results are here:
http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm

I might try to get some histogram information at a later date.

... Doug




RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-28 Thread Doug Smythies
On 2018.11.23 02:36 Rafael J. Wysocki wrote:

v5 -> v6:
 * Avoid applying poll_time_limit to non-polling idle states by mistake.
 * Use idle duration measured by the governor for everything (as it likely is
   more accurate than the one measured by the core).

-- above missing-- (see follow up e-mail from Rafael)

 * Rename SPIKE to PULSE.
 * Do not run pattern detection upfront.  Instead, use recent idle duration
   values to refine the state selection after finding a candidate idle state.
 * Do not use the expected idle duration as an extra latency constraint
   (exit latency is less than the target residency for all of the idle states
   known to me anyway, so this doesn't change anything in practice).

Hi Rafael,

I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline
reference kernel.

Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients.

Note: because it uses the disk, the dbench test is somewhat non-repeatable.
However, if particular attention is paid to not doing anything else with
the disk between tests, then it seems to be repeatable to within about 6%.

Anyway no significant difference observed between kernel 4.20-rc3 and the
same with the teov6 patch.

Test 2: Pipe test, non cross core. (And idle state 0 test, really)
I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core.
Thus, pretty much only idle state 0 was ever used.
Processor package power was similar for both kernels.
teov6 entered/exited idle state 0 about 60,984 times/second/cpu.
-rc3 entered/exited idle state 0 about 62,806 times/second/cpu.
There was a difference in percentage time spent in idle state 0,
with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses
teov6 at 0.0641%.

For throughput, teov6 was 1.4% faster.

Test 3: was an attempt to sweep through a preference for
all idle states.

40 threads were launched with nothing to do except sleep
for a variable duration of 1 to 500 uSec, each step was
run for 1 minute. With 1 minute idle before the test and a few
minutes idle after, the total test duration was about 505 minutes.
Recall that when one asks for a short sleep of 1 uSec, they actually
get about 50 uSec, due to overheads. So I use 40 threads in an attempt
to get the average time between wakeup events per CPU down somewhat.

The results are here:
http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm

I might try to get some histogram information at a later date.

... Doug




Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-23 Thread Rafael J. Wysocki
On Friday, November 23, 2018 11:35:38 AM CET Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The venerable menu governor does some thigns that are quite
> questionable in my view.
> 
> First, it includes timer wakeups in the pattern detection data and
> mixes them up with wakeups from other sources which in some cases
> causes it to expect what essentially would be a timer wakeup in a
> time frame in which no timer wakeups are possible (becuase it knows
> the time until the next timer event and that is later than the
> expected wakeup time).
> 
> Second, it uses the extra exit latency limit based on the predicted
> idle duration and depending on the number of tasks waiting on I/O,
> even though those tasks may run on a different CPU when they are
> woken up.  Moreover, the time ranges used by it for the sleep length
> correction factors depend on whether or not there are tasks waiting
> on I/O, which again doesn't imply anything in particular, and they
> are not correlated to the list of available idle states in any way
> whatever.
> 
> Also, the pattern detection code in menu may end up considering
> values that are too large to matter at all, in which cases running
> it is a waste of time.
> 
> A major rework of the menu governor would be required to address
> these issues and the performance of at least some workloads (tuned
> specifically to the current behavior of the menu governor) is likely
> to suffer from that.  It is thus better to introduce an entirely new
> governor without them and let everybody use the governor that works
> better with their actual workloads.
> 
> The new governor introduced here, the timer events oriented (TEO)
> governor, uses the same basic strategy as menu: it always tries to
> find the deepest idle state that can be used in the given conditions.
> However, it applies a different approach to that problem.
> 
> First, it doesn't use "correction factors" for the time till the
> closest timer, but instead it tries to correlate the measured idle
> duration values with the available idle states and use that
> information to pick up the idle state that is most likely to "match"
> the upcoming CPU idle interval.
> 
> Second, it doesn't take the number of "I/O waiters" into account at
> all and the pattern detection code in it avoids taking timer wakeups
> into account.  It also only uses idle duration values less than the
> current time till the closest timer (with the tick excluded) for that
> purpose.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> v5 -> v6:
>  * Avoid applying poll_time_limit to non-polling idle states by mistake.
>  * Use idle duration measured by the governor for everything (as it likely is
>more accurate than the one measured by the core).

This particular change is actually missing, sorry about that.  It is not
essential, however, so the v6 should be good enough as is for evaluation
and review purposes.

>  * Rename SPIKE to PULSE.
>  * Do not run pattern detection upfront.  Instead, use recent idle duration
>values to refine the state selection after finding a candidate idle state.
>  * Do not use the expected idle duration as an extra latency constraint
>(exit latency is less than the target residency for all of the idle states
>known to me anyway, so this doesn't change anything in practice).

Thanks,
Rafael



Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-23 Thread Rafael J. Wysocki
On Friday, November 23, 2018 11:35:38 AM CET Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The venerable menu governor does some thigns that are quite
> questionable in my view.
> 
> First, it includes timer wakeups in the pattern detection data and
> mixes them up with wakeups from other sources which in some cases
> causes it to expect what essentially would be a timer wakeup in a
> time frame in which no timer wakeups are possible (becuase it knows
> the time until the next timer event and that is later than the
> expected wakeup time).
> 
> Second, it uses the extra exit latency limit based on the predicted
> idle duration and depending on the number of tasks waiting on I/O,
> even though those tasks may run on a different CPU when they are
> woken up.  Moreover, the time ranges used by it for the sleep length
> correction factors depend on whether or not there are tasks waiting
> on I/O, which again doesn't imply anything in particular, and they
> are not correlated to the list of available idle states in any way
> whatever.
> 
> Also, the pattern detection code in menu may end up considering
> values that are too large to matter at all, in which cases running
> it is a waste of time.
> 
> A major rework of the menu governor would be required to address
> these issues and the performance of at least some workloads (tuned
> specifically to the current behavior of the menu governor) is likely
> to suffer from that.  It is thus better to introduce an entirely new
> governor without them and let everybody use the governor that works
> better with their actual workloads.
> 
> The new governor introduced here, the timer events oriented (TEO)
> governor, uses the same basic strategy as menu: it always tries to
> find the deepest idle state that can be used in the given conditions.
> However, it applies a different approach to that problem.
> 
> First, it doesn't use "correction factors" for the time till the
> closest timer, but instead it tries to correlate the measured idle
> duration values with the available idle states and use that
> information to pick up the idle state that is most likely to "match"
> the upcoming CPU idle interval.
> 
> Second, it doesn't take the number of "I/O waiters" into account at
> all and the pattern detection code in it avoids taking timer wakeups
> into account.  It also only uses idle duration values less than the
> current time till the closest timer (with the tick excluded) for that
> purpose.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> v5 -> v6:
>  * Avoid applying poll_time_limit to non-polling idle states by mistake.
>  * Use idle duration measured by the governor for everything (as it likely is
>more accurate than the one measured by the core).

This particular change is actually missing, sorry about that.  It is not
essential, however, so the v6 should be good enough as is for evaluation
and review purposes.

>  * Rename SPIKE to PULSE.
>  * Do not run pattern detection upfront.  Instead, use recent idle duration
>values to refine the state selection after finding a candidate idle state.
>  * Do not use the expected idle duration as an extra latency constraint
>(exit latency is less than the target residency for all of the idle states
>known to me anyway, so this doesn't change anything in practice).

Thanks,
Rafael



[RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-23 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

The venerable menu governor does some thigns that are quite
questionable in my view.

First, it includes timer wakeups in the pattern detection data and
mixes them up with wakeups from other sources which in some cases
causes it to expect what essentially would be a timer wakeup in a
time frame in which no timer wakeups are possible (becuase it knows
the time until the next timer event and that is later than the
expected wakeup time).

Second, it uses the extra exit latency limit based on the predicted
idle duration and depending on the number of tasks waiting on I/O,
even though those tasks may run on a different CPU when they are
woken up.  Moreover, the time ranges used by it for the sleep length
correction factors depend on whether or not there are tasks waiting
on I/O, which again doesn't imply anything in particular, and they
are not correlated to the list of available idle states in any way
whatever.

Also, the pattern detection code in menu may end up considering
values that are too large to matter at all, in which cases running
it is a waste of time.

A major rework of the menu governor would be required to address
these issues and the performance of at least some workloads (tuned
specifically to the current behavior of the menu governor) is likely
to suffer from that.  It is thus better to introduce an entirely new
governor without them and let everybody use the governor that works
better with their actual workloads.

The new governor introduced here, the timer events oriented (TEO)
governor, uses the same basic strategy as menu: it always tries to
find the deepest idle state that can be used in the given conditions.
However, it applies a different approach to that problem.

First, it doesn't use "correction factors" for the time till the
closest timer, but instead it tries to correlate the measured idle
duration values with the available idle states and use that
information to pick up the idle state that is most likely to "match"
the upcoming CPU idle interval.

Second, it doesn't take the number of "I/O waiters" into account at
all and the pattern detection code in it avoids taking timer wakeups
into account.  It also only uses idle duration values less than the
current time till the closest timer (with the tick excluded) for that
purpose.

Signed-off-by: Rafael J. Wysocki 
---

v5 -> v6:
 * Avoid applying poll_time_limit to non-polling idle states by mistake.
 * Use idle duration measured by the governor for everything (as it likely is
   more accurate than the one measured by the core).
 * Rename SPIKE to PULSE.
 * Do not run pattern detection upfront.  Instead, use recent idle duration
   values to refine the state selection after finding a candidate idle state.
 * Do not use the expected idle duration as an extra latency constraint
   (exit latency is less than the target residency for all of the idle states
   known to me anyway, so this doesn't change anything in practice).

v4 -> v5:
 * Avoid using shallow idle states when the tick has been stopped already.

v3 -> v4:
 * Make the pattern detection avoid returning too early if the minimum
   sample is too far from the average.
 * Reformat the changelog (as requested by Peter).

v2 -> v3:
 * Simplify the pattern detection code and make it return a value
lower than the time to the closest timer if the majority of recent
idle intervals are below it regardless of their variance (that should
cause it to be slightly more aggressive).
 * Do not count wakeups from state 0 due to the time limit in poll_idle()
   as non-timer.

---
 drivers/cpuidle/Kconfig|   11 
 drivers/cpuidle/governors/Makefile |1 
 drivers/cpuidle/governors/teo.c|  450 +
 3 files changed, 462 insertions(+)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===
--- /dev/null
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -0,0 +1,450 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Timer events oriented CPU idle governor
+ *
+ * Copyright (C) 2018 Intel Corporation
+ * Author: Rafael J. Wysocki 
+ *
+ * The idea of this governor is based on the observation that on many systems
+ * timer events are two or more orders of magnitude more frequent than any
+ * other interrupts, so they are likely to be the most significant source of 
CPU
+ * wakeups from idle states.  Moreover, information about what happened in the
+ * (relatively recent) past can be used to estimate whether or not the deepest
+ * idle state with target residency within the time to the closest timer is
+ * likely to be suitable for the upcoming idle time of the CPU and, if not, 
then
+ * which of the shallower idle states to choose.
+ *
+ * Of course, non-timer wakeup sources are more important in some use cases and
+ * they can be covered by taking a few most recent idle time intervals of the
+ * CPU into account.  However, even in that case it is 

[RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-23 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

The venerable menu governor does some thigns that are quite
questionable in my view.

First, it includes timer wakeups in the pattern detection data and
mixes them up with wakeups from other sources which in some cases
causes it to expect what essentially would be a timer wakeup in a
time frame in which no timer wakeups are possible (becuase it knows
the time until the next timer event and that is later than the
expected wakeup time).

Second, it uses the extra exit latency limit based on the predicted
idle duration and depending on the number of tasks waiting on I/O,
even though those tasks may run on a different CPU when they are
woken up.  Moreover, the time ranges used by it for the sleep length
correction factors depend on whether or not there are tasks waiting
on I/O, which again doesn't imply anything in particular, and they
are not correlated to the list of available idle states in any way
whatever.

Also, the pattern detection code in menu may end up considering
values that are too large to matter at all, in which cases running
it is a waste of time.

A major rework of the menu governor would be required to address
these issues and the performance of at least some workloads (tuned
specifically to the current behavior of the menu governor) is likely
to suffer from that.  It is thus better to introduce an entirely new
governor without them and let everybody use the governor that works
better with their actual workloads.

The new governor introduced here, the timer events oriented (TEO)
governor, uses the same basic strategy as menu: it always tries to
find the deepest idle state that can be used in the given conditions.
However, it applies a different approach to that problem.

First, it doesn't use "correction factors" for the time till the
closest timer, but instead it tries to correlate the measured idle
duration values with the available idle states and use that
information to pick up the idle state that is most likely to "match"
the upcoming CPU idle interval.

Second, it doesn't take the number of "I/O waiters" into account at
all and the pattern detection code in it avoids taking timer wakeups
into account.  It also only uses idle duration values less than the
current time till the closest timer (with the tick excluded) for that
purpose.

Signed-off-by: Rafael J. Wysocki 
---

v5 -> v6:
 * Avoid applying poll_time_limit to non-polling idle states by mistake.
 * Use idle duration measured by the governor for everything (as it likely is
   more accurate than the one measured by the core).
 * Rename SPIKE to PULSE.
 * Do not run pattern detection upfront.  Instead, use recent idle duration
   values to refine the state selection after finding a candidate idle state.
 * Do not use the expected idle duration as an extra latency constraint
   (exit latency is less than the target residency for all of the idle states
   known to me anyway, so this doesn't change anything in practice).

v4 -> v5:
 * Avoid using shallow idle states when the tick has been stopped already.

v3 -> v4:
 * Make the pattern detection avoid returning too early if the minimum
   sample is too far from the average.
 * Reformat the changelog (as requested by Peter).

v2 -> v3:
 * Simplify the pattern detection code and make it return a value
lower than the time to the closest timer if the majority of recent
idle intervals are below it regardless of their variance (that should
cause it to be slightly more aggressive).
 * Do not count wakeups from state 0 due to the time limit in poll_idle()
   as non-timer.

---
 drivers/cpuidle/Kconfig|   11 
 drivers/cpuidle/governors/Makefile |1 
 drivers/cpuidle/governors/teo.c|  450 +
 3 files changed, 462 insertions(+)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===
--- /dev/null
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -0,0 +1,450 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Timer events oriented CPU idle governor
+ *
+ * Copyright (C) 2018 Intel Corporation
+ * Author: Rafael J. Wysocki 
+ *
+ * The idea of this governor is based on the observation that on many systems
+ * timer events are two or more orders of magnitude more frequent than any
+ * other interrupts, so they are likely to be the most significant source of 
CPU
+ * wakeups from idle states.  Moreover, information about what happened in the
+ * (relatively recent) past can be used to estimate whether or not the deepest
+ * idle state with target residency within the time to the closest timer is
+ * likely to be suitable for the upcoming idle time of the CPU and, if not, 
then
+ * which of the shallower idle states to choose.
+ *
+ * Of course, non-timer wakeup sources are more important in some use cases and
+ * they can be covered by taking a few most recent idle time intervals of the
+ * CPU into account.  However, even in that case it is