RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On 2018.12.08 02:23 Giovanni Gherdovich wrote: > sorry for the late reply, this week I was traveling. No Problem. Thanks very much for your very detailed reply, which obviously took considerable time to write. While I was making progress, your instructions really fill in some gaps and mistakes I was making. Eventually (probably several days) I'll report back my test results. > Some specific remarks you raise: > > On Mon, 2018-12-03 at 08:23 -0800, Doug Smythies wrote: >> ... >> My issue is that I do not understand the output or how it >> might correlate with your tables. >> >> I get, for example: >> >>31 1 0.13s 0.68s 0.80s 1003894.302 1003779.613 >>31 1 0.16s 0.64s 0.80s 1008900.053 1008215.336 >>31 1 0.14s 0.66s 0.80s 1009630.439 1008990.265 >> ... >> >> But I don't know what that means, nor have I been able to find >> a description anywhere. > > I don't recognize this output. I hope the illustration above can clarify how > MMTests is used. Due to incompetence on my part, the config file being run for my tests was always just the default config file from my original git clone https://github.com/gormanm/mmtests.git command. So regardless of what I thought I was doing, I was running "pft" (Page Fault Test). ... Doug
RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On 2018.12.08 02:23 Giovanni Gherdovich wrote: > sorry for the late reply, this week I was traveling. No Problem. Thanks very much for your very detailed reply, which obviously took considerable time to write. While I was making progress, your instructions really fill in some gaps and mistakes I was making. Eventually (probably several days) I'll report back my test results. > Some specific remarks you raise: > > On Mon, 2018-12-03 at 08:23 -0800, Doug Smythies wrote: >> ... >> My issue is that I do not understand the output or how it >> might correlate with your tables. >> >> I get, for example: >> >>31 1 0.13s 0.68s 0.80s 1003894.302 1003779.613 >>31 1 0.16s 0.64s 0.80s 1008900.053 1008215.336 >>31 1 0.14s 0.66s 0.80s 1009630.439 1008990.265 >> ... >> >> But I don't know what that means, nor have I been able to find >> a description anywhere. > > I don't recognize this output. I hope the illustration above can clarify how > MMTests is used. Due to incompetence on my part, the config file being run for my tests was always just the default config file from my original git clone https://github.com/gormanm/mmtests.git command. So regardless of what I thought I was doing, I was running "pft" (Page Fault Test). ... Doug
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
Hello Doug, sorry for the late reply, this week I was traveling. First off, thank you for trying out MMTests; I admit the documentation is somewhat incomplete. I'm going to give you an overview of how I run benchmarks with MMTests and how do I print comparisons, hoping this can address your questions. In the last report I posted the following two tables, for instance; I'll now show the commands I used to produce them. > * sockperf on loopback over UDP, mode "throughput" > * global-dhp__network-sockperf-unbound > 48x-HASWELL-NUMA fixed since v2, the others greatly improved in v6. > > teo-v1 teo-v2 teo-v3 teo-v5 teo-v6 > >--- > 8x-SKYLAKE-UMA1% worse1% worse1% worse1% worse10% >better > 80x-BROADWELL-NUMA3% better 2% better 5% better 3% worse8% >better > 48x-HASWELL-NUMA 4% better 12% worse no change no change no >change > > SOCKPERF-UDP-THROUGHPUT > === > NOTES: Test run in mode "throughput" over UDP. The varying parameter is the > message size. > MEASURES: Throughput, in MBits/second > HIGHER is better > > machine: 8x-SKYLAKE-UMA > > 4.18.0 4.18.0 >4.18.0 4.18.0 4.18.0 4.18.0 > vanillateo >teo-v2+backportteo-v3+backportteo-v5+backport >teo-v6+backport > >--- > Hmean 1470.34 ( 0.00%) 69.80 * -0.76%* 69.11 * >-1.75%* 69.49 * -1.20%* 69.71 * -0.90%* 77.51 * 10.20%* > Hmean 100 499.24 ( 0.00%) 494.26 * -1.00%* 492.74 * >-1.30%* 494.90 * -0.87%* 497.43 * -0.36%* 549.93 * 10.15%* > Hmean 300 1489.13 ( 0.00%) 1472.39 * -1.12%* 1468.45 * >-1.39%* 1477.74 * -0.76%* 1478.61 * -0.71%* 1632.63 * 9.64%* > Hmean 500 2469.62 ( 0.00%) 2444.41 * -1.02%* 2434.61 * >-1.42%* 2454.15 * -0.63%* 2454.76 * -0.60%* 2698.70 * 9.28%* > Hmean 850 4165.12 ( 0.00%) 4123.82 * -0.99%* 4100.37 * >-1.55%* 4111.82 * -1.28%* 4120.04 * -1.08%* 4521.11 * 8.55%* The first table is a panoramic view of all machines, the second is a zoom into the 8x-SKYLAKE-UMA machine where the overall benchmark score is broken down into the various message sizes. The first thing to do is, obviously, to gather data for each kernel. Once the kernel is installed on the box, as you already figured out, you have to run: ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound SOME-MNEMONIC-NAME In my case, what I did is to run: # build, install and boot 4.18.0-vanilla kernel ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 4.18.0-vanilla # build, install and boot 4.18.0-teo kernel ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 4.18.0-teo # build, install and boot 4.18.0-teo-v2+backport kernel ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 4.18.0-teo-v2+backport ... # build, install and boot 4.18.0-teo-v6+backport kernel ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 4.18.0-teo-v6+backport At this point in the work/log directory I've accumulated all the data I need for a report. What's important to note here is that a single configuration file (such as config-global-dhp__network-sockperf-unbound) often runs more than a single benchmark, according to the value of the MMTESTS variable in that config. The config we're using has: export MMTESTS="sockperf-tcp-throughput sockperf-tcp-under-load sockperf-udp-throughput sockperf-udp-under-load" which means it's running 4 different flavors of sockperf. The two tables above are from the "sockperf-udp-throughput" variant. Now that we've run the benchmarks for each kernel (every run takes around 75 minutes on my machines) we're ready to extract some comparison tables. Exploring the work/log directory shows what we've got: $ find . -type d -name sockperf\* | sort ./sockperf-tcp-throughput-4.18.0-teo ./sockperf-tcp-throughput-4.18.0-teo-v2+backport ./sockperf-tcp-throughput-4.18.0-teo-v3+backport ./sockperf-tcp-throughput-4.18.0-teo-v5+backport ./sockperf-tcp-throughput-4.18.0-teo-v6+backport ./sockperf-tcp-throughput-4.18.0-vanilla ./sockperf-tcp-under-load-4.18.0-teo ./sockperf-tcp-under-load-4.18.0-teo-v2+backport ./sockperf-tcp-under-load-4.18.0-teo-v3+backport ./sockperf-tcp-under-load-4.18.0-teo-v5+backport
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
Hello Doug, sorry for the late reply, this week I was traveling. First off, thank you for trying out MMTests; I admit the documentation is somewhat incomplete. I'm going to give you an overview of how I run benchmarks with MMTests and how do I print comparisons, hoping this can address your questions. In the last report I posted the following two tables, for instance; I'll now show the commands I used to produce them. > * sockperf on loopback over UDP, mode "throughput" > * global-dhp__network-sockperf-unbound > 48x-HASWELL-NUMA fixed since v2, the others greatly improved in v6. > > teo-v1 teo-v2 teo-v3 teo-v5 teo-v6 > >--- > 8x-SKYLAKE-UMA1% worse1% worse1% worse1% worse10% >better > 80x-BROADWELL-NUMA3% better 2% better 5% better 3% worse8% >better > 48x-HASWELL-NUMA 4% better 12% worse no change no change no >change > > SOCKPERF-UDP-THROUGHPUT > === > NOTES: Test run in mode "throughput" over UDP. The varying parameter is the > message size. > MEASURES: Throughput, in MBits/second > HIGHER is better > > machine: 8x-SKYLAKE-UMA > > 4.18.0 4.18.0 >4.18.0 4.18.0 4.18.0 4.18.0 > vanillateo >teo-v2+backportteo-v3+backportteo-v5+backport >teo-v6+backport > >--- > Hmean 1470.34 ( 0.00%) 69.80 * -0.76%* 69.11 * >-1.75%* 69.49 * -1.20%* 69.71 * -0.90%* 77.51 * 10.20%* > Hmean 100 499.24 ( 0.00%) 494.26 * -1.00%* 492.74 * >-1.30%* 494.90 * -0.87%* 497.43 * -0.36%* 549.93 * 10.15%* > Hmean 300 1489.13 ( 0.00%) 1472.39 * -1.12%* 1468.45 * >-1.39%* 1477.74 * -0.76%* 1478.61 * -0.71%* 1632.63 * 9.64%* > Hmean 500 2469.62 ( 0.00%) 2444.41 * -1.02%* 2434.61 * >-1.42%* 2454.15 * -0.63%* 2454.76 * -0.60%* 2698.70 * 9.28%* > Hmean 850 4165.12 ( 0.00%) 4123.82 * -0.99%* 4100.37 * >-1.55%* 4111.82 * -1.28%* 4120.04 * -1.08%* 4521.11 * 8.55%* The first table is a panoramic view of all machines, the second is a zoom into the 8x-SKYLAKE-UMA machine where the overall benchmark score is broken down into the various message sizes. The first thing to do is, obviously, to gather data for each kernel. Once the kernel is installed on the box, as you already figured out, you have to run: ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound SOME-MNEMONIC-NAME In my case, what I did is to run: # build, install and boot 4.18.0-vanilla kernel ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 4.18.0-vanilla # build, install and boot 4.18.0-teo kernel ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 4.18.0-teo # build, install and boot 4.18.0-teo-v2+backport kernel ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 4.18.0-teo-v2+backport ... # build, install and boot 4.18.0-teo-v6+backport kernel ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound 4.18.0-teo-v6+backport At this point in the work/log directory I've accumulated all the data I need for a report. What's important to note here is that a single configuration file (such as config-global-dhp__network-sockperf-unbound) often runs more than a single benchmark, according to the value of the MMTESTS variable in that config. The config we're using has: export MMTESTS="sockperf-tcp-throughput sockperf-tcp-under-load sockperf-udp-throughput sockperf-udp-under-load" which means it's running 4 different flavors of sockperf. The two tables above are from the "sockperf-udp-throughput" variant. Now that we've run the benchmarks for each kernel (every run takes around 75 minutes on my machines) we're ready to extract some comparison tables. Exploring the work/log directory shows what we've got: $ find . -type d -name sockperf\* | sort ./sockperf-tcp-throughput-4.18.0-teo ./sockperf-tcp-throughput-4.18.0-teo-v2+backport ./sockperf-tcp-throughput-4.18.0-teo-v3+backport ./sockperf-tcp-throughput-4.18.0-teo-v5+backport ./sockperf-tcp-throughput-4.18.0-teo-v6+backport ./sockperf-tcp-throughput-4.18.0-vanilla ./sockperf-tcp-under-load-4.18.0-teo ./sockperf-tcp-under-load-4.18.0-teo-v2+backport ./sockperf-tcp-under-load-4.18.0-teo-v3+backport ./sockperf-tcp-under-load-4.18.0-teo-v5+backport
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On Mon, Dec 03, 2018 at 08:23:56AM -0800, Doug Smythies wrote: > In the README file, I did see that for reporting I am > somehow supposed to use compare-kernels.sh, but > I couldn't figure that out. > cd work/log ../../compare-kernels.sh > By the way, I am running these tests as a regular user, but > they seem to want to modify: > > /sys/kernel/mm/transparent_hugepage/enabled > Red herring in this case. Even if transparent hugepages are left as the default, it still tries to write it stupidly. An irritating, but harmless bug. -- Mel Gorman SUSE Labs
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On Mon, Dec 03, 2018 at 08:23:56AM -0800, Doug Smythies wrote: > In the README file, I did see that for reporting I am > somehow supposed to use compare-kernels.sh, but > I couldn't figure that out. > cd work/log ../../compare-kernels.sh > By the way, I am running these tests as a regular user, but > they seem to want to modify: > > /sys/kernel/mm/transparent_hugepage/enabled > Red herring in this case. Even if transparent hugepages are left as the default, it still tries to write it stupidly. An irritating, but harmless bug. -- Mel Gorman SUSE Labs
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On Thu, Dec 6, 2018 at 12:06 AM Doug Smythies wrote: > > On 2018.12.03 03:48 Rafael J. Wysocki wrote: > > >>> There is an additional issue where if idle state 0 is disabled (with the > >>> above suggested code patch), > >>> idle state usage seems to fall to deeper states than idle state 1. > >>> This is not the expected behaviour. > >> > >> No, it isn't. > >> > >>> Kernel 4.20-rc3 works as expected. > >>> I have not figured this issue out yet, in the code. > >>> > >>> Example (1 minute per sample. Number of entries/exits per state): > >>> State 0 State 1 State 2 State 3 State 4Watts > >>>28235143, 83, 26, 17,837, 64.900 > >>> 5583238, 657079,5884941,8498552, 30986831, 62.433 << > >>> Transition sample, after idle state 0 disabled > >>> 0, 793517,7186099, 10559878, 38485721, 61.900 << > >>> ?? should have all gone into Idle state 1 > >>> 0, 795414,7340703, 10553117, 38513456, 62.050 > >>> 0, 807028,7288195, 10574113, 38523524, 62.167 > >>> 0, 814983,7403534, 10575108, 38571228, 62.167 > >>> 0, 838302,7747127, 10552289, 38556054, 62.183 > >>> 9664999, 544473,4914512,6942037, 25295361, 63.633 << > >>> Transition sample, after idle state 0 enabled > >>>27893504, 96, 40, 9,912, 66.500 > >>>26556343, 83, 29, 7,814, 66.683 > >>>27929227, 64, 20, 10,931, 66.683 > >> > >> I see. > >> > >> OK, I'll look into this too, thanks! > > > > This probably is the artifact of the fix for the teo_find_shallower_state() > > issue. > > > > Anyway, I'm not able to reproduce this with the teo_find_shallower_state() > > issue > > fixed differently. > > I am not able to reproduce with your teo_find_shallower_state(), or teo V 7, > either. Everything is graceful now, as states are disabled: > (10 seconds per sample. Number of entries/exits per state): > > State 0 State 1 State 2 State 3 State 4Watts > 0, 6, 4, 1,414, 3.700 > 2, 4, 30, 3,578, 3.700 << No > load > 168619, 37, 39, 4,480, 5.600 << > Transition sample > 4643618, 45, 8, 1,137, 61.200 << All > idle states enabled > 4736227, 40, 3, 5,111, 61.800 > 1888417,4369314, 25, 2, 89, 62.000 << > Transition sample > 0,7266864, 9, 0, 0, 62.200 << > state 0 disabled > 0,7193372, 9, 0, 0, 62.700 > 0,5539898,1744007, 0, 0, 63.500 << > Transition sample > 0, 0,8152956, 0, 0, 63.700 << > states 0,1 disabled > 0, 0,8015151, 0, 0, 63.900 > 0, 0,4146806,6349619, 0, 63.000 << > Transition sample > 0, 0, 0, 13252144, 0, 61.600 << > states 0,1,2 disabled > 0, 0, 0, 13258313, 0, 61.800 > 0, 0, 0, 10417428,1984451, 61.200 << > Transition sample > 0, 0, 0, 0,9247172, 58.500 << > states 0,1,2,3 disabled > 0, 0, 0, 0,9242657, 58.500 > 0, 0, 0, 0,9233749, 58.600 > 0, 0, 0, 0,9238444, 58.700 > 0, 0, 0, 0,9236345, 58.600 > > For reference, this is kernel 4.20-rc5 (with your other proposed patches): > > State 0 State 1 State 2 State 3 State 4Watts > 0, 4, 8, 6,426, 3.700 > 1592870,279,149, 96,831, 21.800 > 5071279,154, 25, 6,105, 61.200 > 5095090, 78, 21, 1, 86, 61.800 > 5001493, 94, 30, 4,101, 62.200 > 616019,5446924, 5, 3, 38, 62.500 > 0,6249752, 0, 0, 0, 63.300 > 0,6293671, 0, 0, 0, 63.800 > 0,3751035,2529964, 0, 0, 64.100 > 0, 0,6101167, 0, 0, 64.500 > 0, 0,6172526, 0, 0, 64.700 > 0, 0,6163797, 0, 0, 64.900 > 0, 0,1724841,9567528, 0, 63.300 > 0, 0, 0, 13349668,
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On Thu, Dec 6, 2018 at 12:06 AM Doug Smythies wrote: > > On 2018.12.03 03:48 Rafael J. Wysocki wrote: > > >>> There is an additional issue where if idle state 0 is disabled (with the > >>> above suggested code patch), > >>> idle state usage seems to fall to deeper states than idle state 1. > >>> This is not the expected behaviour. > >> > >> No, it isn't. > >> > >>> Kernel 4.20-rc3 works as expected. > >>> I have not figured this issue out yet, in the code. > >>> > >>> Example (1 minute per sample. Number of entries/exits per state): > >>> State 0 State 1 State 2 State 3 State 4Watts > >>>28235143, 83, 26, 17,837, 64.900 > >>> 5583238, 657079,5884941,8498552, 30986831, 62.433 << > >>> Transition sample, after idle state 0 disabled > >>> 0, 793517,7186099, 10559878, 38485721, 61.900 << > >>> ?? should have all gone into Idle state 1 > >>> 0, 795414,7340703, 10553117, 38513456, 62.050 > >>> 0, 807028,7288195, 10574113, 38523524, 62.167 > >>> 0, 814983,7403534, 10575108, 38571228, 62.167 > >>> 0, 838302,7747127, 10552289, 38556054, 62.183 > >>> 9664999, 544473,4914512,6942037, 25295361, 63.633 << > >>> Transition sample, after idle state 0 enabled > >>>27893504, 96, 40, 9,912, 66.500 > >>>26556343, 83, 29, 7,814, 66.683 > >>>27929227, 64, 20, 10,931, 66.683 > >> > >> I see. > >> > >> OK, I'll look into this too, thanks! > > > > This probably is the artifact of the fix for the teo_find_shallower_state() > > issue. > > > > Anyway, I'm not able to reproduce this with the teo_find_shallower_state() > > issue > > fixed differently. > > I am not able to reproduce with your teo_find_shallower_state(), or teo V 7, > either. Everything is graceful now, as states are disabled: > (10 seconds per sample. Number of entries/exits per state): > > State 0 State 1 State 2 State 3 State 4Watts > 0, 6, 4, 1,414, 3.700 > 2, 4, 30, 3,578, 3.700 << No > load > 168619, 37, 39, 4,480, 5.600 << > Transition sample > 4643618, 45, 8, 1,137, 61.200 << All > idle states enabled > 4736227, 40, 3, 5,111, 61.800 > 1888417,4369314, 25, 2, 89, 62.000 << > Transition sample > 0,7266864, 9, 0, 0, 62.200 << > state 0 disabled > 0,7193372, 9, 0, 0, 62.700 > 0,5539898,1744007, 0, 0, 63.500 << > Transition sample > 0, 0,8152956, 0, 0, 63.700 << > states 0,1 disabled > 0, 0,8015151, 0, 0, 63.900 > 0, 0,4146806,6349619, 0, 63.000 << > Transition sample > 0, 0, 0, 13252144, 0, 61.600 << > states 0,1,2 disabled > 0, 0, 0, 13258313, 0, 61.800 > 0, 0, 0, 10417428,1984451, 61.200 << > Transition sample > 0, 0, 0, 0,9247172, 58.500 << > states 0,1,2,3 disabled > 0, 0, 0, 0,9242657, 58.500 > 0, 0, 0, 0,9233749, 58.600 > 0, 0, 0, 0,9238444, 58.700 > 0, 0, 0, 0,9236345, 58.600 > > For reference, this is kernel 4.20-rc5 (with your other proposed patches): > > State 0 State 1 State 2 State 3 State 4Watts > 0, 4, 8, 6,426, 3.700 > 1592870,279,149, 96,831, 21.800 > 5071279,154, 25, 6,105, 61.200 > 5095090, 78, 21, 1, 86, 61.800 > 5001493, 94, 30, 4,101, 62.200 > 616019,5446924, 5, 3, 38, 62.500 > 0,6249752, 0, 0, 0, 63.300 > 0,6293671, 0, 0, 0, 63.800 > 0,3751035,2529964, 0, 0, 64.100 > 0, 0,6101167, 0, 0, 64.500 > 0, 0,6172526, 0, 0, 64.700 > 0, 0,6163797, 0, 0, 64.900 > 0, 0,1724841,9567528, 0, 63.300 > 0, 0, 0, 13349668,
RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On 2018.12.03 03:48 Rafael J. Wysocki wrote: >>> There is an additional issue where if idle state 0 is disabled (with the >>> above suggested code patch), >>> idle state usage seems to fall to deeper states than idle state 1. >>> This is not the expected behaviour. >> >> No, it isn't. >> >>> Kernel 4.20-rc3 works as expected. >>> I have not figured this issue out yet, in the code. >>> >>> Example (1 minute per sample. Number of entries/exits per state): >>> State 0 State 1 State 2 State 3 State 4Watts >>>28235143, 83, 26, 17,837, 64.900 >>> 5583238, 657079,5884941,8498552, 30986831, 62.433 << >>> Transition sample, after idle state 0 disabled >>> 0, 793517,7186099, 10559878, 38485721, 61.900 << ?? >>> should have all gone into Idle state 1 >>> 0, 795414,7340703, 10553117, 38513456, 62.050 >>> 0, 807028,7288195, 10574113, 38523524, 62.167 >>> 0, 814983,7403534, 10575108, 38571228, 62.167 >>> 0, 838302,7747127, 10552289, 38556054, 62.183 >>> 9664999, 544473,4914512,6942037, 25295361, 63.633 << >>> Transition sample, after idle state 0 enabled >>>27893504, 96, 40, 9,912, 66.500 >>>26556343, 83, 29, 7,814, 66.683 >>>27929227, 64, 20, 10,931, 66.683 >> >> I see. >> >> OK, I'll look into this too, thanks! > > This probably is the artifact of the fix for the teo_find_shallower_state() > issue. > > Anyway, I'm not able to reproduce this with the teo_find_shallower_state() > issue > fixed differently. I am not able to reproduce with your teo_find_shallower_state(), or teo V 7, either. Everything is graceful now, as states are disabled: (10 seconds per sample. Number of entries/exits per state): State 0 State 1 State 2 State 3 State 4Watts 0, 6, 4, 1,414, 3.700 2, 4, 30, 3,578, 3.700 << No load 168619, 37, 39, 4,480, 5.600 << Transition sample 4643618, 45, 8, 1,137, 61.200 << All idle states enabled 4736227, 40, 3, 5,111, 61.800 1888417,4369314, 25, 2, 89, 62.000 << Transition sample 0,7266864, 9, 0, 0, 62.200 << state 0 disabled 0,7193372, 9, 0, 0, 62.700 0,5539898,1744007, 0, 0, 63.500 << Transition sample 0, 0,8152956, 0, 0, 63.700 << states 0,1 disabled 0, 0,8015151, 0, 0, 63.900 0, 0,4146806,6349619, 0, 63.000 << Transition sample 0, 0, 0, 13252144, 0, 61.600 << states 0,1,2 disabled 0, 0, 0, 13258313, 0, 61.800 0, 0, 0, 10417428,1984451, 61.200 << Transition sample 0, 0, 0, 0,9247172, 58.500 << states 0,1,2,3 disabled 0, 0, 0, 0,9242657, 58.500 0, 0, 0, 0,9233749, 58.600 0, 0, 0, 0,9238444, 58.700 0, 0, 0, 0,9236345, 58.600 For reference, this is kernel 4.20-rc5 (with your other proposed patches): State 0 State 1 State 2 State 3 State 4Watts 0, 4, 8, 6,426, 3.700 1592870,279,149, 96,831, 21.800 5071279,154, 25, 6,105, 61.200 5095090, 78, 21, 1, 86, 61.800 5001493, 94, 30, 4,101, 62.200 616019,5446924, 5, 3, 38, 62.500 0,6249752, 0, 0, 0, 63.300 0,6293671, 0, 0, 0, 63.800 0,3751035,2529964, 0, 0, 64.100 0, 0,6101167, 0, 0, 64.500 0, 0,6172526, 0, 0, 64.700 0, 0,6163797, 0, 0, 64.900 0, 0,1724841,9567528, 0, 63.300 0, 0, 0, 13349668, 0, 62.700 0, 0, 0, 13360471, 0, 62.700 0, 0, 0, 13355424, 0, 62.700 0, 0, 0,8854491,3132640, 61.600 0,
RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On 2018.12.03 03:48 Rafael J. Wysocki wrote: >>> There is an additional issue where if idle state 0 is disabled (with the >>> above suggested code patch), >>> idle state usage seems to fall to deeper states than idle state 1. >>> This is not the expected behaviour. >> >> No, it isn't. >> >>> Kernel 4.20-rc3 works as expected. >>> I have not figured this issue out yet, in the code. >>> >>> Example (1 minute per sample. Number of entries/exits per state): >>> State 0 State 1 State 2 State 3 State 4Watts >>>28235143, 83, 26, 17,837, 64.900 >>> 5583238, 657079,5884941,8498552, 30986831, 62.433 << >>> Transition sample, after idle state 0 disabled >>> 0, 793517,7186099, 10559878, 38485721, 61.900 << ?? >>> should have all gone into Idle state 1 >>> 0, 795414,7340703, 10553117, 38513456, 62.050 >>> 0, 807028,7288195, 10574113, 38523524, 62.167 >>> 0, 814983,7403534, 10575108, 38571228, 62.167 >>> 0, 838302,7747127, 10552289, 38556054, 62.183 >>> 9664999, 544473,4914512,6942037, 25295361, 63.633 << >>> Transition sample, after idle state 0 enabled >>>27893504, 96, 40, 9,912, 66.500 >>>26556343, 83, 29, 7,814, 66.683 >>>27929227, 64, 20, 10,931, 66.683 >> >> I see. >> >> OK, I'll look into this too, thanks! > > This probably is the artifact of the fix for the teo_find_shallower_state() > issue. > > Anyway, I'm not able to reproduce this with the teo_find_shallower_state() > issue > fixed differently. I am not able to reproduce with your teo_find_shallower_state(), or teo V 7, either. Everything is graceful now, as states are disabled: (10 seconds per sample. Number of entries/exits per state): State 0 State 1 State 2 State 3 State 4Watts 0, 6, 4, 1,414, 3.700 2, 4, 30, 3,578, 3.700 << No load 168619, 37, 39, 4,480, 5.600 << Transition sample 4643618, 45, 8, 1,137, 61.200 << All idle states enabled 4736227, 40, 3, 5,111, 61.800 1888417,4369314, 25, 2, 89, 62.000 << Transition sample 0,7266864, 9, 0, 0, 62.200 << state 0 disabled 0,7193372, 9, 0, 0, 62.700 0,5539898,1744007, 0, 0, 63.500 << Transition sample 0, 0,8152956, 0, 0, 63.700 << states 0,1 disabled 0, 0,8015151, 0, 0, 63.900 0, 0,4146806,6349619, 0, 63.000 << Transition sample 0, 0, 0, 13252144, 0, 61.600 << states 0,1,2 disabled 0, 0, 0, 13258313, 0, 61.800 0, 0, 0, 10417428,1984451, 61.200 << Transition sample 0, 0, 0, 0,9247172, 58.500 << states 0,1,2,3 disabled 0, 0, 0, 0,9242657, 58.500 0, 0, 0, 0,9233749, 58.600 0, 0, 0, 0,9238444, 58.700 0, 0, 0, 0,9236345, 58.600 For reference, this is kernel 4.20-rc5 (with your other proposed patches): State 0 State 1 State 2 State 3 State 4Watts 0, 4, 8, 6,426, 3.700 1592870,279,149, 96,831, 21.800 5071279,154, 25, 6,105, 61.200 5095090, 78, 21, 1, 86, 61.800 5001493, 94, 30, 4,101, 62.200 616019,5446924, 5, 3, 38, 62.500 0,6249752, 0, 0, 0, 63.300 0,6293671, 0, 0, 0, 63.800 0,3751035,2529964, 0, 0, 64.100 0, 0,6101167, 0, 0, 64.500 0, 0,6172526, 0, 0, 64.700 0, 0,6163797, 0, 0, 64.900 0, 0,1724841,9567528, 0, 63.300 0, 0, 0, 13349668, 0, 62.700 0, 0, 0, 13360471, 0, 62.700 0, 0, 0, 13355424, 0, 62.700 0, 0, 0,8854491,3132640, 61.600 0,
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On Thursday, November 29, 2018 12:20:07 AM CET Doug Smythies wrote: > On 2018.11.23 02:36 Rafael J. Wysocki wrote: > > v5 -> v6: > * Avoid applying poll_time_limit to non-polling idle states by mistake. > * Use idle duration measured by the governor for everything (as it likely is >more accurate than the one measured by the core). > > -- above missing-- (see follow up e-mail from Rafael) > > * Rename SPIKE to PULSE. > * Do not run pattern detection upfront. Instead, use recent idle duration >values to refine the state selection after finding a candidate idle state. > * Do not use the expected idle duration as an extra latency constraint >(exit latency is less than the target residency for all of the idle states >known to me anyway, so this doesn't change anything in practice). > > Hi Rafael, > > I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline > reference kernel. > > Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients. > > Note: because it uses the disk, the dbench test is somewhat non-repeatable. > However, if particular attention is paid to not doing anything else with > the disk between tests, then it seems to be repeatable to within about 6%. > > Anyway no significant difference observed between kernel 4.20-rc3 and the > same with the teov6 patch. > > Test 2: Pipe test, non cross core. (And idle state 0 test, really) > I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core. > Thus, pretty much only idle state 0 was ever used. > Processor package power was similar for both kernels. > teov6 entered/exited idle state 0 about 60,984 times/second/cpu. > -rc3 entered/exited idle state 0 about 62,806 times/second/cpu. > There was a difference in percentage time spent in idle state 0, > with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses > teov6 at 0.0641%. > > For throughput, teov6 was 1.4% faster. This may indicate that teov6 is somewhat too aggressive. > Test 3: was an attempt to sweep through a preference for > all idle states. > > 40 threads were launched with nothing to do except sleep > for a variable duration of 1 to 500 uSec, each step was > run for 1 minute. With 1 minute idle before the test and a few > minutes idle after, the total test duration was about 505 minutes. > Recall that when one asks for a short sleep of 1 uSec, they actually > get about 50 uSec, due to overheads. So I use 40 threads in an attempt > to get the average time between wakeup events per CPU down somewhat. > > The results are here: > http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm And, so long as my understanding of the graphs is correct, the results here indicate that teov6 tends to prefer relatively shallow idle states which is good for performance (at least with some workloads), but not necessarily for energy-efficiency. I will send a v7 of TEO with some changes to make it a bit more energy-efficient with respect to the v6. Thanks, Rafael
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On Thursday, November 29, 2018 12:20:07 AM CET Doug Smythies wrote: > On 2018.11.23 02:36 Rafael J. Wysocki wrote: > > v5 -> v6: > * Avoid applying poll_time_limit to non-polling idle states by mistake. > * Use idle duration measured by the governor for everything (as it likely is >more accurate than the one measured by the core). > > -- above missing-- (see follow up e-mail from Rafael) > > * Rename SPIKE to PULSE. > * Do not run pattern detection upfront. Instead, use recent idle duration >values to refine the state selection after finding a candidate idle state. > * Do not use the expected idle duration as an extra latency constraint >(exit latency is less than the target residency for all of the idle states >known to me anyway, so this doesn't change anything in practice). > > Hi Rafael, > > I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline > reference kernel. > > Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients. > > Note: because it uses the disk, the dbench test is somewhat non-repeatable. > However, if particular attention is paid to not doing anything else with > the disk between tests, then it seems to be repeatable to within about 6%. > > Anyway no significant difference observed between kernel 4.20-rc3 and the > same with the teov6 patch. > > Test 2: Pipe test, non cross core. (And idle state 0 test, really) > I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core. > Thus, pretty much only idle state 0 was ever used. > Processor package power was similar for both kernels. > teov6 entered/exited idle state 0 about 60,984 times/second/cpu. > -rc3 entered/exited idle state 0 about 62,806 times/second/cpu. > There was a difference in percentage time spent in idle state 0, > with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses > teov6 at 0.0641%. > > For throughput, teov6 was 1.4% faster. This may indicate that teov6 is somewhat too aggressive. > Test 3: was an attempt to sweep through a preference for > all idle states. > > 40 threads were launched with nothing to do except sleep > for a variable duration of 1 to 500 uSec, each step was > run for 1 minute. With 1 minute idle before the test and a few > minutes idle after, the total test duration was about 505 minutes. > Recall that when one asks for a short sleep of 1 uSec, they actually > get about 50 uSec, due to overheads. So I use 40 threads in an attempt > to get the average time between wakeup events per CPU down somewhat. > > The results are here: > http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm And, so long as my understanding of the graphs is correct, the results here indicate that teov6 tends to prefer relatively shallow idle states which is good for performance (at least with some workloads), but not necessarily for energy-efficiency. I will send a v7 of TEO with some changes to make it a bit more energy-efficient with respect to the v6. Thanks, Rafael
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On Friday, November 30, 2018 9:51:19 AM CET Rafael J. Wysocki wrote: > Hi Doug, > > On Fri, Nov 30, 2018 at 8:49 AM Doug Smythies wrote: > > > > Hi Rafael, > > > > On 2018.11.23 02:36 Rafael J. Wysocki wrote: > > > > ... [snip]... > > > > > +/** > > > + * teo_find_shallower_state - Find shallower idle state matching given > > > duration. > > > + * @drv: cpuidle driver containing state data. > > > + * @dev: Target CPU. > > > + * @state_idx: Index of the capping idle state. > > > + * @duration_us: Idle duration value to match. > > > + */ > > > +static int teo_find_shallower_state(struct cpuidle_driver *drv, > > > + struct cpuidle_device *dev, int > > > state_idx, > > > + unsigned int duration_us) > > > +{ > > > + int i; > > > + > > > + for (i = state_idx - 1; i > 0; i--) { > > > + if (drv->states[i].disabled || dev->states_usage[i].disable) > > > + continue; > > > + > > > + if (drv->states[i].target_residency <= duration_us) > > > + break; > > > + } > > > + return i; > > > +} > > > > I think this subroutine has a problem when idle state 0 > > is disabled. > > You are right, thanks! > > > Perhaps something like this might help: > > > > diff --git a/drivers/cpuidle/governors/teo.c > > b/drivers/cpuidle/governors/teo.c > > index bc1c9a2..5b97639 100644 > > --- a/drivers/cpuidle/governors/teo.c > > +++ b/drivers/cpuidle/governors/teo.c > > @@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, > > struct cpuidle_device *dev) > > } > > > > /** > > - * teo_find_shallower_state - Find shallower idle state matching given > > duration. > > + * teo_find_shallower_state - Find shallower idle state matching given > > + * duration, if possible. > > * @drv: cpuidle driver containing state data. > > * @dev: Target CPU. > > * @state_idx: Index of the capping idle state. > > @@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct > > cpuidle_driver *drv, > > { > > int i; > > > > - for (i = state_idx - 1; i > 0; i--) { > > + for (i = state_idx - 1; i >= 0; i--) { > > if (drv->states[i].disabled || dev->states_usage[i].disable) > > continue; > > > > if (drv->states[i].target_residency <= duration_us) > > break; > > } > > + if (i < 0) > > + i = state_idx; > > return i; > > } > > I'll do something slightly similar, but equivalent. I actually ended up fixing it differently, as the above will cause state_idx to be returned even if some states shallower than state_idx are enabled, but their target residencies are higher than duration_us. In that case, though, it still is more correct to return the shallowest enabled state rather than state_idx. > > > > @@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, > > struct cpuidle_device *dev, > > if (max_early_idx >= 0 && > > count < cpu_data->states[i].early_hits) > > count = cpu_data->states[i].early_hits; > > - > > continue; > > } > > > > There is an additional issue where if idle state 0 is disabled (with the > > above suggested code patch), > > idle state usage seems to fall to deeper states than idle state 1. > > This is not the expected behaviour. > > No, it isn't. > > > Kernel 4.20-rc3 works as expected. > > I have not figured this issue out yet, in the code. > > > > Example (1 minute per sample. Number of entries/exits per state): > > State 0 State 1 State 2 State 3 State 4Watts > >28235143, 83, 26, 17,837, 64.900 > > 5583238, 657079,5884941,8498552, 30986831, 62.433 << > > Transition sample, after idle state 0 disabled > > 0, 793517,7186099, 10559878, 38485721, 61.900 << ?? > > should have all gone into Idle state 1 > > 0, 795414,7340703, 10553117, 38513456, 62.050 > > 0, 807028,7288195, 10574113, 38523524, 62.167 > > 0, 814983,7403534, 10575108, 38571228, 62.167 > > 0, 838302,7747127, 10552289, 38556054, 62.183 > > 9664999, 544473,4914512,6942037, 25295361, 63.633 << > > Transition sample, after idle state 0 enabled > >27893504, 96, 40, 9,912, 66.500 > >26556343, 83, 29, 7,814, 66.683 > >27929227, 64, 20, 10,931, 66.683 > > I see. > > OK, I'll look into this too, thanks! This probably is the artifact of the fix for the teo_find_shallower_state() issue. Anyway, I'm not able to reproduce this with the teo_find_shallower_state() issue fixed differently. Thanks, Rafael
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On Friday, November 30, 2018 9:51:19 AM CET Rafael J. Wysocki wrote: > Hi Doug, > > On Fri, Nov 30, 2018 at 8:49 AM Doug Smythies wrote: > > > > Hi Rafael, > > > > On 2018.11.23 02:36 Rafael J. Wysocki wrote: > > > > ... [snip]... > > > > > +/** > > > + * teo_find_shallower_state - Find shallower idle state matching given > > > duration. > > > + * @drv: cpuidle driver containing state data. > > > + * @dev: Target CPU. > > > + * @state_idx: Index of the capping idle state. > > > + * @duration_us: Idle duration value to match. > > > + */ > > > +static int teo_find_shallower_state(struct cpuidle_driver *drv, > > > + struct cpuidle_device *dev, int > > > state_idx, > > > + unsigned int duration_us) > > > +{ > > > + int i; > > > + > > > + for (i = state_idx - 1; i > 0; i--) { > > > + if (drv->states[i].disabled || dev->states_usage[i].disable) > > > + continue; > > > + > > > + if (drv->states[i].target_residency <= duration_us) > > > + break; > > > + } > > > + return i; > > > +} > > > > I think this subroutine has a problem when idle state 0 > > is disabled. > > You are right, thanks! > > > Perhaps something like this might help: > > > > diff --git a/drivers/cpuidle/governors/teo.c > > b/drivers/cpuidle/governors/teo.c > > index bc1c9a2..5b97639 100644 > > --- a/drivers/cpuidle/governors/teo.c > > +++ b/drivers/cpuidle/governors/teo.c > > @@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, > > struct cpuidle_device *dev) > > } > > > > /** > > - * teo_find_shallower_state - Find shallower idle state matching given > > duration. > > + * teo_find_shallower_state - Find shallower idle state matching given > > + * duration, if possible. > > * @drv: cpuidle driver containing state data. > > * @dev: Target CPU. > > * @state_idx: Index of the capping idle state. > > @@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct > > cpuidle_driver *drv, > > { > > int i; > > > > - for (i = state_idx - 1; i > 0; i--) { > > + for (i = state_idx - 1; i >= 0; i--) { > > if (drv->states[i].disabled || dev->states_usage[i].disable) > > continue; > > > > if (drv->states[i].target_residency <= duration_us) > > break; > > } > > + if (i < 0) > > + i = state_idx; > > return i; > > } > > I'll do something slightly similar, but equivalent. I actually ended up fixing it differently, as the above will cause state_idx to be returned even if some states shallower than state_idx are enabled, but their target residencies are higher than duration_us. In that case, though, it still is more correct to return the shallowest enabled state rather than state_idx. > > > > @@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, > > struct cpuidle_device *dev, > > if (max_early_idx >= 0 && > > count < cpu_data->states[i].early_hits) > > count = cpu_data->states[i].early_hits; > > - > > continue; > > } > > > > There is an additional issue where if idle state 0 is disabled (with the > > above suggested code patch), > > idle state usage seems to fall to deeper states than idle state 1. > > This is not the expected behaviour. > > No, it isn't. > > > Kernel 4.20-rc3 works as expected. > > I have not figured this issue out yet, in the code. > > > > Example (1 minute per sample. Number of entries/exits per state): > > State 0 State 1 State 2 State 3 State 4Watts > >28235143, 83, 26, 17,837, 64.900 > > 5583238, 657079,5884941,8498552, 30986831, 62.433 << > > Transition sample, after idle state 0 disabled > > 0, 793517,7186099, 10559878, 38485721, 61.900 << ?? > > should have all gone into Idle state 1 > > 0, 795414,7340703, 10553117, 38513456, 62.050 > > 0, 807028,7288195, 10574113, 38523524, 62.167 > > 0, 814983,7403534, 10575108, 38571228, 62.167 > > 0, 838302,7747127, 10552289, 38556054, 62.183 > > 9664999, 544473,4914512,6942037, 25295361, 63.633 << > > Transition sample, after idle state 0 enabled > >27893504, 96, 40, 9,912, 66.500 > >26556343, 83, 29, 7,814, 66.683 > >27929227, 64, 20, 10,931, 66.683 > > I see. > > OK, I'll look into this too, thanks! This probably is the artifact of the fix for the teo_find_shallower_state() issue. Anyway, I'm not able to reproduce this with the teo_find_shallower_state() issue fixed differently. Thanks, Rafael
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On Saturday, December 1, 2018 3:18:24 PM CET Giovanni Gherdovich wrote: > On Fri, 2018-11-23 at 11:35 +0100, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki > > [cut] > > > > [snip] > > [NOTE: the tables in this message are quite wide. If this doesn't get to you > properly formatted you can read a copy of this message at the URL > https://beta.suse.com/private/ggherdovich/teo-eval/teo-v6-eval.html ] > > All performance concerns manifested in v5 are wiped out by v6. Not only v6 > improves over v5, but is even better than the baseline (menu) in most > cases. The optimizations in v6 paid off! This is very encouraging, thank you! > The overview of the analysis for v5, from the message > https://lore.kernel.org/lkml/1541877001.17878.5.ca...@suse.cz , was: > > > The quick summary is: > > > > ---> sockperf on loopback over UDP, mode "throughput": > > this had a 12% regression in v2 on 48x-HASWELL-NUMA, which is > > completely > > recovered in v3 and v5. Good stuff. > > > > ---> dbench on xfs: > > this was down 16% in v2 on 48x-HASWELL-NUMA. On v5 we're at a 10% > > regression. Slight improvement. What's really hurting here is the > > single > > client scenario. > > > > ---> netperf-udp on loopback: > > had 6% regression on v2 on 8x-SKYLAKE-UMA, which is the same as what > > happens in v5. > > > > ---> tbench on loopback: > > was down 10% in v2 on 8x-SKYLAKE-UMA, now slightly worse in v5 with a > > 12% > > regression. As in dbench, it's at low number of clients that the > > results > > are worst. Note that this machine is different from the one that has > > the > > dbench regression. > > now the situation is overturned: > > ---> sockperf on loopback over UDP, mode "throughput": > No new problems from 48x-HASWELL-NUMA, which stays put at the level of > the baseline. OTOH 80x-BROADWELL-NUMA and 8x-SKYLAKE-UMA improve over the > baseline of 8% and 10% respectively. Good. > ---> dbench on xfs: > 48x-HASWELL-NUMA rebounds from the previous 10% degradation and it's now > at 0, i.e. the baseline level. The 1-client case, responsible for the > previous overall degradation (I average results from different number of > clients), went from -40% to -20% and is compensated in my table by > improvements with 4, 8, 16 and 32 clients (table below). > > ---> netperf-udp on loopback: > 8x-SKYLAKE-UMA now shows a 9% improvement over baseline. > 80x-BROADWELL-NUMA, previously similar to baseline, now improves 7%. Good. > ---> tbench on loopback: > Impressive change of color for 8x-SKYLAKE-UMA, from 12% regression in v5 > to 7% improvement in v6. The problematic 1- and 2-clients cases went from > -25% and -33% to +13% and +10% respectively. Awesome. :-) > Details below. > > Runs are compared against v4.18 with the Menu governor. I know v4.18 is a > little old now but that's where I measured my baseline. My machine pool didn't > change: > > * single socket E3-1240 v5 (Skylake 8 cores, which I'll call 8x-SKYLAKE-UMA) > * two sockets E5-2698 v4 (Broadwell 80 cores, 80x-BROADWELL-NUMA from here > onwards) > * two sockets E5-2670 v3 (Haswell 48 cores, 48x-HASWELL-NUMA from here > onwards) > [cut] > > > PREVIOUSLY REGRESSING BENCHMARKS: OVERVIEW > == > > * sockperf on loopback over UDP, mode "throughput" > * global-dhp__network-sockperf-unbound > 48x-HASWELL-NUMA fixed since v2, the others greatly improved in v6. > > teo-v1 teo-v2 teo-v3 teo-v5 teo-v6 > > --- > 8x-SKYLAKE-UMA1% worse1% worse1% worse1% worse10% > better > 80x-BROADWELL-NUMA3% better 2% better 5% better 3% worse8% > better > 48x-HASWELL-NUMA 4% better 12% worse no change no change no > change > > * dbench on xfs > * global-dhp__io-dbench4-async-xfs > 48x-HASWELL-NUMA is fixed wrt v5 and earlier versions. > > teo-v1 teo-v2 teo-v3 teo-v5 > teo-v6 > > --- > 8x-SKYLAKE-UMA3% better 4% better 6% better 4% better5% > better > 80x-BROADWELL-NUMAno change no change 1% worse 3% worse 2% > better > 48x-HASWELL-NUMA 6% worse16% worse 8% worse 10% worseno > change > > * netperf on loopback over UDP > * global-dhp__network-netperf-unbound > 8x-SKYLAKE-UMA fixed. > > teo-v1 teo-v2 teo-v3 teo-v5 > teo-v6 > > --- > 8x-SKYLAKE-UMAno change 6% worse4% worse 6% worse 9% > better > 80x-BROADWELL-NUMA1% worse4% worseno change no change7% >
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On Saturday, December 1, 2018 3:18:24 PM CET Giovanni Gherdovich wrote: > On Fri, 2018-11-23 at 11:35 +0100, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki > > [cut] > > > > [snip] > > [NOTE: the tables in this message are quite wide. If this doesn't get to you > properly formatted you can read a copy of this message at the URL > https://beta.suse.com/private/ggherdovich/teo-eval/teo-v6-eval.html ] > > All performance concerns manifested in v5 are wiped out by v6. Not only v6 > improves over v5, but is even better than the baseline (menu) in most > cases. The optimizations in v6 paid off! This is very encouraging, thank you! > The overview of the analysis for v5, from the message > https://lore.kernel.org/lkml/1541877001.17878.5.ca...@suse.cz , was: > > > The quick summary is: > > > > ---> sockperf on loopback over UDP, mode "throughput": > > this had a 12% regression in v2 on 48x-HASWELL-NUMA, which is > > completely > > recovered in v3 and v5. Good stuff. > > > > ---> dbench on xfs: > > this was down 16% in v2 on 48x-HASWELL-NUMA. On v5 we're at a 10% > > regression. Slight improvement. What's really hurting here is the > > single > > client scenario. > > > > ---> netperf-udp on loopback: > > had 6% regression on v2 on 8x-SKYLAKE-UMA, which is the same as what > > happens in v5. > > > > ---> tbench on loopback: > > was down 10% in v2 on 8x-SKYLAKE-UMA, now slightly worse in v5 with a > > 12% > > regression. As in dbench, it's at low number of clients that the > > results > > are worst. Note that this machine is different from the one that has > > the > > dbench regression. > > now the situation is overturned: > > ---> sockperf on loopback over UDP, mode "throughput": > No new problems from 48x-HASWELL-NUMA, which stays put at the level of > the baseline. OTOH 80x-BROADWELL-NUMA and 8x-SKYLAKE-UMA improve over the > baseline of 8% and 10% respectively. Good. > ---> dbench on xfs: > 48x-HASWELL-NUMA rebounds from the previous 10% degradation and it's now > at 0, i.e. the baseline level. The 1-client case, responsible for the > previous overall degradation (I average results from different number of > clients), went from -40% to -20% and is compensated in my table by > improvements with 4, 8, 16 and 32 clients (table below). > > ---> netperf-udp on loopback: > 8x-SKYLAKE-UMA now shows a 9% improvement over baseline. > 80x-BROADWELL-NUMA, previously similar to baseline, now improves 7%. Good. > ---> tbench on loopback: > Impressive change of color for 8x-SKYLAKE-UMA, from 12% regression in v5 > to 7% improvement in v6. The problematic 1- and 2-clients cases went from > -25% and -33% to +13% and +10% respectively. Awesome. :-) > Details below. > > Runs are compared against v4.18 with the Menu governor. I know v4.18 is a > little old now but that's where I measured my baseline. My machine pool didn't > change: > > * single socket E3-1240 v5 (Skylake 8 cores, which I'll call 8x-SKYLAKE-UMA) > * two sockets E5-2698 v4 (Broadwell 80 cores, 80x-BROADWELL-NUMA from here > onwards) > * two sockets E5-2670 v3 (Haswell 48 cores, 48x-HASWELL-NUMA from here > onwards) > [cut] > > > PREVIOUSLY REGRESSING BENCHMARKS: OVERVIEW > == > > * sockperf on loopback over UDP, mode "throughput" > * global-dhp__network-sockperf-unbound > 48x-HASWELL-NUMA fixed since v2, the others greatly improved in v6. > > teo-v1 teo-v2 teo-v3 teo-v5 teo-v6 > > --- > 8x-SKYLAKE-UMA1% worse1% worse1% worse1% worse10% > better > 80x-BROADWELL-NUMA3% better 2% better 5% better 3% worse8% > better > 48x-HASWELL-NUMA 4% better 12% worse no change no change no > change > > * dbench on xfs > * global-dhp__io-dbench4-async-xfs > 48x-HASWELL-NUMA is fixed wrt v5 and earlier versions. > > teo-v1 teo-v2 teo-v3 teo-v5 > teo-v6 > > --- > 8x-SKYLAKE-UMA3% better 4% better 6% better 4% better5% > better > 80x-BROADWELL-NUMAno change no change 1% worse 3% worse 2% > better > 48x-HASWELL-NUMA 6% worse16% worse 8% worse 10% worseno > change > > * netperf on loopback over UDP > * global-dhp__network-netperf-unbound > 8x-SKYLAKE-UMA fixed. > > teo-v1 teo-v2 teo-v3 teo-v5 > teo-v6 > > --- > 8x-SKYLAKE-UMAno change 6% worse4% worse 6% worse 9% > better > 80x-BROADWELL-NUMA1% worse4% worseno change no change7% >
RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
Hi Giovanni, Perhaps I should go off-list for this, not sure. I had the thought that I should be able to get similar results as your "8x-SKYLAKE-UMA" on my test computer, i7-2600K. Or that at least it was worth trying, just to see. I couldn't find the same or similar test on Phoronix, and my attempts to do similar, for example, with iperf, didn't show differences between the baseline kernel and one with the teov6 patch. So I tried the test set you referenced [1]: On 2018.12.01 06:18 Giovanni Gherdovich wrote: ... > * netperf on loopback over TCP >* global-dhp__network-netperf-unbound I assume this means that I am supposed to do: cp config-global-dhp__network-netperf-unbound config from the configs directory. Anyway that config file looks correct. Then: ./run-mmtests.sh --no-monitor 3.0-nomonitor ... > * sockperf on loopback over UDP, mode "throughput" >* global-dhp__network-sockperf-unbound Similarly (from the appropriate directories): cp config-global-dhp__network-sockperf-unbound config ./run-mmtests.sh --no-monitor 3.0-nomonitor My issue is that I do not understand the output or how it might correlate with your tables. I get, for example: 31 1 0.13s 0.68s 0.80s 1003894.302 1003779.613 31 1 0.16s 0.64s 0.80s 1008900.053 1008215.336 31 1 0.14s 0.66s 0.80s 1009630.439 1008990.265 ... But I don't know what that means, nor have I been able to find a description anywhere. In the README file, I did see that for reporting I am somehow supposed to use compare-kernels.sh, but I couldn't figure that out. By the way, I am running these tests as a regular user, but they seem to want to modify: /sys/kernel/mm/transparent_hugepage/enabled which requires root privilege. I don't really want to mess with that stuff for these tests. > [1] https://github.com/gormanm/mmtests Can you help me to produce meaningful results to compare with your results? ... Doug
RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
Hi Giovanni, Perhaps I should go off-list for this, not sure. I had the thought that I should be able to get similar results as your "8x-SKYLAKE-UMA" on my test computer, i7-2600K. Or that at least it was worth trying, just to see. I couldn't find the same or similar test on Phoronix, and my attempts to do similar, for example, with iperf, didn't show differences between the baseline kernel and one with the teov6 patch. So I tried the test set you referenced [1]: On 2018.12.01 06:18 Giovanni Gherdovich wrote: ... > * netperf on loopback over TCP >* global-dhp__network-netperf-unbound I assume this means that I am supposed to do: cp config-global-dhp__network-netperf-unbound config from the configs directory. Anyway that config file looks correct. Then: ./run-mmtests.sh --no-monitor 3.0-nomonitor ... > * sockperf on loopback over UDP, mode "throughput" >* global-dhp__network-sockperf-unbound Similarly (from the appropriate directories): cp config-global-dhp__network-sockperf-unbound config ./run-mmtests.sh --no-monitor 3.0-nomonitor My issue is that I do not understand the output or how it might correlate with your tables. I get, for example: 31 1 0.13s 0.68s 0.80s 1003894.302 1003779.613 31 1 0.16s 0.64s 0.80s 1008900.053 1008215.336 31 1 0.14s 0.66s 0.80s 1009630.439 1008990.265 ... But I don't know what that means, nor have I been able to find a description anywhere. In the README file, I did see that for reporting I am somehow supposed to use compare-kernels.sh, but I couldn't figure that out. By the way, I am running these tests as a regular user, but they seem to want to modify: /sys/kernel/mm/transparent_hugepage/enabled which requires root privilege. I don't really want to mess with that stuff for these tests. > [1] https://github.com/gormanm/mmtests Can you help me to produce meaningful results to compare with your results? ... Doug
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On Fri, 2018-11-23 at 11:35 +0100, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > The venerable menu governor does some thigns that are quite > questionable in my view. > > First, it includes timer wakeups in the pattern detection data and > mixes them up with wakeups from other sources which in some cases > causes it to expect what essentially would be a timer wakeup in a > time frame in which no timer wakeups are possible (becuase it knows > the time until the next timer event and that is later than the > expected wakeup time). > > Second, it uses the extra exit latency limit based on the predicted > idle duration and depending on the number of tasks waiting on I/O, > even though those tasks may run on a different CPU when they are > woken up. Moreover, the time ranges used by it for the sleep length > correction factors depend on whether or not there are tasks waiting > on I/O, which again doesn't imply anything in particular, and they > are not correlated to the list of available idle states in any way > whatever. > > Also, the pattern detection code in menu may end up considering > values that are too large to matter at all, in which cases running > it is a waste of time. > > A major rework of the menu governor would be required to address > these issues and the performance of at least some workloads (tuned > specifically to the current behavior of the menu governor) is likely > to suffer from that. It is thus better to introduce an entirely new > governor without them and let everybody use the governor that works > better with their actual workloads. > > The new governor introduced here, the timer events oriented (TEO) > governor, uses the same basic strategy as menu: it always tries to > find the deepest idle state that can be used in the given conditions. > However, it applies a different approach to that problem. > > First, it doesn't use "correction factors" for the time till the > closest timer, but instead it tries to correlate the measured idle > duration values with the available idle states and use that > information to pick up the idle state that is most likely to "match" > the upcoming CPU idle interval. > > Second, it doesn't take the number of "I/O waiters" into account at > all and the pattern detection code in it avoids taking timer wakeups > into account. It also only uses idle duration values less than the > current time till the closest timer (with the tick excluded) for that > purpose. > > Signed-off-by: Rafael J. Wysocki > --- > > v5 -> v6: > * Avoid applying poll_time_limit to non-polling idle states by mistake. > * Use idle duration measured by the governor for everything (as it likely is >more accurate than the one measured by the core). > * Rename SPIKE to PULSE. > * Do not run pattern detection upfront. Instead, use recent idle duration >values to refine the state selection after finding a candidate idle state. > * Do not use the expected idle duration as an extra latency constraint >(exit latency is less than the target residency for all of the idle states >known to me anyway, so this doesn't change anything in practice). > > v4 -> v5: > * Avoid using shallow idle states when the tick has been stopped already. > > v3 -> v4: > * Make the pattern detection avoid returning too early if the minimum >sample is too far from the average. > * Reformat the changelog (as requested by Peter). > > v2 -> v3: > * Simplify the pattern detection code and make it return a value > lower than the time to the closest timer if the majority of recent > idle intervals are below it regardless of their variance (that should > cause it to be slightly more aggressive). > * Do not count wakeups from state 0 due to the time limit in poll_idle() >as non-timer. > > [snip] [NOTE: the tables in this message are quite wide. If this doesn't get to you properly formatted you can read a copy of this message at the URL https://beta.suse.com/private/ggherdovich/teo-eval/teo-v6-eval.html ] All performance concerns manifested in v5 are wiped out by v6. Not only v6 improves over v5, but is even better than the baseline (menu) in most cases. The optimizations in v6 paid off! The overview of the analysis for v5, from the message https://lore.kernel.org/lkml/1541877001.17878.5.ca...@suse.cz , was: > The quick summary is: > > ---> sockperf on loopback over UDP, mode "throughput": > this had a 12% regression in v2 on 48x-HASWELL-NUMA, which is completely > recovered in v3 and v5. Good stuff. > > ---> dbench on xfs: > this was down 16% in v2 on 48x-HASWELL-NUMA. On v5 we're at a 10% > regression. Slight improvement. What's really hurting here is the single > client scenario. > > ---> netperf-udp on loopback: > had 6% regression on v2 on 8x-SKYLAKE-UMA, which is the same as what > happens in v5. > > ---> tbench on loopback: > was down 10% in v2 on 8x-SKYLAKE-UMA, now slightly worse in v5 with a
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On Fri, 2018-11-23 at 11:35 +0100, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > The venerable menu governor does some thigns that are quite > questionable in my view. > > First, it includes timer wakeups in the pattern detection data and > mixes them up with wakeups from other sources which in some cases > causes it to expect what essentially would be a timer wakeup in a > time frame in which no timer wakeups are possible (becuase it knows > the time until the next timer event and that is later than the > expected wakeup time). > > Second, it uses the extra exit latency limit based on the predicted > idle duration and depending on the number of tasks waiting on I/O, > even though those tasks may run on a different CPU when they are > woken up. Moreover, the time ranges used by it for the sleep length > correction factors depend on whether or not there are tasks waiting > on I/O, which again doesn't imply anything in particular, and they > are not correlated to the list of available idle states in any way > whatever. > > Also, the pattern detection code in menu may end up considering > values that are too large to matter at all, in which cases running > it is a waste of time. > > A major rework of the menu governor would be required to address > these issues and the performance of at least some workloads (tuned > specifically to the current behavior of the menu governor) is likely > to suffer from that. It is thus better to introduce an entirely new > governor without them and let everybody use the governor that works > better with their actual workloads. > > The new governor introduced here, the timer events oriented (TEO) > governor, uses the same basic strategy as menu: it always tries to > find the deepest idle state that can be used in the given conditions. > However, it applies a different approach to that problem. > > First, it doesn't use "correction factors" for the time till the > closest timer, but instead it tries to correlate the measured idle > duration values with the available idle states and use that > information to pick up the idle state that is most likely to "match" > the upcoming CPU idle interval. > > Second, it doesn't take the number of "I/O waiters" into account at > all and the pattern detection code in it avoids taking timer wakeups > into account. It also only uses idle duration values less than the > current time till the closest timer (with the tick excluded) for that > purpose. > > Signed-off-by: Rafael J. Wysocki > --- > > v5 -> v6: > * Avoid applying poll_time_limit to non-polling idle states by mistake. > * Use idle duration measured by the governor for everything (as it likely is >more accurate than the one measured by the core). > * Rename SPIKE to PULSE. > * Do not run pattern detection upfront. Instead, use recent idle duration >values to refine the state selection after finding a candidate idle state. > * Do not use the expected idle duration as an extra latency constraint >(exit latency is less than the target residency for all of the idle states >known to me anyway, so this doesn't change anything in practice). > > v4 -> v5: > * Avoid using shallow idle states when the tick has been stopped already. > > v3 -> v4: > * Make the pattern detection avoid returning too early if the minimum >sample is too far from the average. > * Reformat the changelog (as requested by Peter). > > v2 -> v3: > * Simplify the pattern detection code and make it return a value > lower than the time to the closest timer if the majority of recent > idle intervals are below it regardless of their variance (that should > cause it to be slightly more aggressive). > * Do not count wakeups from state 0 due to the time limit in poll_idle() >as non-timer. > > [snip] [NOTE: the tables in this message are quite wide. If this doesn't get to you properly formatted you can read a copy of this message at the URL https://beta.suse.com/private/ggherdovich/teo-eval/teo-v6-eval.html ] All performance concerns manifested in v5 are wiped out by v6. Not only v6 improves over v5, but is even better than the baseline (menu) in most cases. The optimizations in v6 paid off! The overview of the analysis for v5, from the message https://lore.kernel.org/lkml/1541877001.17878.5.ca...@suse.cz , was: > The quick summary is: > > ---> sockperf on loopback over UDP, mode "throughput": > this had a 12% regression in v2 on 48x-HASWELL-NUMA, which is completely > recovered in v3 and v5. Good stuff. > > ---> dbench on xfs: > this was down 16% in v2 on 48x-HASWELL-NUMA. On v5 we're at a 10% > regression. Slight improvement. What's really hurting here is the single > client scenario. > > ---> netperf-udp on loopback: > had 6% regression on v2 on 8x-SKYLAKE-UMA, which is the same as what > happens in v5. > > ---> tbench on loopback: > was down 10% in v2 on 8x-SKYLAKE-UMA, now slightly worse in v5 with a
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
Hi Doug, On Fri, Nov 30, 2018 at 8:49 AM Doug Smythies wrote: > > Hi Rafael, > > On 2018.11.23 02:36 Rafael J. Wysocki wrote: > > ... [snip]... > > > +/** > > + * teo_find_shallower_state - Find shallower idle state matching given > > duration. > > + * @drv: cpuidle driver containing state data. > > + * @dev: Target CPU. > > + * @state_idx: Index of the capping idle state. > > + * @duration_us: Idle duration value to match. > > + */ > > +static int teo_find_shallower_state(struct cpuidle_driver *drv, > > + struct cpuidle_device *dev, int state_idx, > > + unsigned int duration_us) > > +{ > > + int i; > > + > > + for (i = state_idx - 1; i > 0; i--) { > > + if (drv->states[i].disabled || dev->states_usage[i].disable) > > + continue; > > + > > + if (drv->states[i].target_residency <= duration_us) > > + break; > > + } > > + return i; > > +} > > I think this subroutine has a problem when idle state 0 > is disabled. You are right, thanks! > Perhaps something like this might help: > > diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c > index bc1c9a2..5b97639 100644 > --- a/drivers/cpuidle/governors/teo.c > +++ b/drivers/cpuidle/governors/teo.c > @@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, struct > cpuidle_device *dev) > } > > /** > - * teo_find_shallower_state - Find shallower idle state matching given > duration. > + * teo_find_shallower_state - Find shallower idle state matching given > + * duration, if possible. > * @drv: cpuidle driver containing state data. > * @dev: Target CPU. > * @state_idx: Index of the capping idle state. > @@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct > cpuidle_driver *drv, > { > int i; > > - for (i = state_idx - 1; i > 0; i--) { > + for (i = state_idx - 1; i >= 0; i--) { > if (drv->states[i].disabled || dev->states_usage[i].disable) > continue; > > if (drv->states[i].target_residency <= duration_us) > break; > } > + if (i < 0) > + i = state_idx; > return i; > } I'll do something slightly similar, but equivalent. > > @@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, struct > cpuidle_device *dev, > if (max_early_idx >= 0 && > count < cpu_data->states[i].early_hits) > count = cpu_data->states[i].early_hits; > - > continue; > } > > There is an additional issue where if idle state 0 is disabled (with the > above suggested code patch), > idle state usage seems to fall to deeper states than idle state 1. > This is not the expected behaviour. No, it isn't. > Kernel 4.20-rc3 works as expected. > I have not figured this issue out yet, in the code. > > Example (1 minute per sample. Number of entries/exits per state): > State 0 State 1 State 2 State 3 State 4Watts >28235143, 83, 26, 17,837, 64.900 > 5583238, 657079,5884941,8498552, 30986831, 62.433 << > Transition sample, after idle state 0 disabled > 0, 793517,7186099, 10559878, 38485721, 61.900 << ?? > should have all gone into Idle state 1 > 0, 795414,7340703, 10553117, 38513456, 62.050 > 0, 807028,7288195, 10574113, 38523524, 62.167 > 0, 814983,7403534, 10575108, 38571228, 62.167 > 0, 838302,7747127, 10552289, 38556054, 62.183 > 9664999, 544473,4914512,6942037, 25295361, 63.633 << > Transition sample, after idle state 0 enabled >27893504, 96, 40, 9,912, 66.500 >26556343, 83, 29, 7,814, 66.683 >27929227, 64, 20, 10,931, 66.683 I see. OK, I'll look into this too, thanks!
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
Hi Doug, On Fri, Nov 30, 2018 at 8:49 AM Doug Smythies wrote: > > Hi Rafael, > > On 2018.11.23 02:36 Rafael J. Wysocki wrote: > > ... [snip]... > > > +/** > > + * teo_find_shallower_state - Find shallower idle state matching given > > duration. > > + * @drv: cpuidle driver containing state data. > > + * @dev: Target CPU. > > + * @state_idx: Index of the capping idle state. > > + * @duration_us: Idle duration value to match. > > + */ > > +static int teo_find_shallower_state(struct cpuidle_driver *drv, > > + struct cpuidle_device *dev, int state_idx, > > + unsigned int duration_us) > > +{ > > + int i; > > + > > + for (i = state_idx - 1; i > 0; i--) { > > + if (drv->states[i].disabled || dev->states_usage[i].disable) > > + continue; > > + > > + if (drv->states[i].target_residency <= duration_us) > > + break; > > + } > > + return i; > > +} > > I think this subroutine has a problem when idle state 0 > is disabled. You are right, thanks! > Perhaps something like this might help: > > diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c > index bc1c9a2..5b97639 100644 > --- a/drivers/cpuidle/governors/teo.c > +++ b/drivers/cpuidle/governors/teo.c > @@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, struct > cpuidle_device *dev) > } > > /** > - * teo_find_shallower_state - Find shallower idle state matching given > duration. > + * teo_find_shallower_state - Find shallower idle state matching given > + * duration, if possible. > * @drv: cpuidle driver containing state data. > * @dev: Target CPU. > * @state_idx: Index of the capping idle state. > @@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct > cpuidle_driver *drv, > { > int i; > > - for (i = state_idx - 1; i > 0; i--) { > + for (i = state_idx - 1; i >= 0; i--) { > if (drv->states[i].disabled || dev->states_usage[i].disable) > continue; > > if (drv->states[i].target_residency <= duration_us) > break; > } > + if (i < 0) > + i = state_idx; > return i; > } I'll do something slightly similar, but equivalent. > > @@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, struct > cpuidle_device *dev, > if (max_early_idx >= 0 && > count < cpu_data->states[i].early_hits) > count = cpu_data->states[i].early_hits; > - > continue; > } > > There is an additional issue where if idle state 0 is disabled (with the > above suggested code patch), > idle state usage seems to fall to deeper states than idle state 1. > This is not the expected behaviour. No, it isn't. > Kernel 4.20-rc3 works as expected. > I have not figured this issue out yet, in the code. > > Example (1 minute per sample. Number of entries/exits per state): > State 0 State 1 State 2 State 3 State 4Watts >28235143, 83, 26, 17,837, 64.900 > 5583238, 657079,5884941,8498552, 30986831, 62.433 << > Transition sample, after idle state 0 disabled > 0, 793517,7186099, 10559878, 38485721, 61.900 << ?? > should have all gone into Idle state 1 > 0, 795414,7340703, 10553117, 38513456, 62.050 > 0, 807028,7288195, 10574113, 38523524, 62.167 > 0, 814983,7403534, 10575108, 38571228, 62.167 > 0, 838302,7747127, 10552289, 38556054, 62.183 > 9664999, 544473,4914512,6942037, 25295361, 63.633 << > Transition sample, after idle state 0 enabled >27893504, 96, 40, 9,912, 66.500 >26556343, 83, 29, 7,814, 66.683 >27929227, 64, 20, 10,931, 66.683 I see. OK, I'll look into this too, thanks!
RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
Hi Rafael, On 2018.11.23 02:36 Rafael J. Wysocki wrote: ... [snip]... > +/** > + * teo_find_shallower_state - Find shallower idle state matching given > duration. > + * @drv: cpuidle driver containing state data. > + * @dev: Target CPU. > + * @state_idx: Index of the capping idle state. > + * @duration_us: Idle duration value to match. > + */ > +static int teo_find_shallower_state(struct cpuidle_driver *drv, > + struct cpuidle_device *dev, int state_idx, > + unsigned int duration_us) > +{ > + int i; > + > + for (i = state_idx - 1; i > 0; i--) { > + if (drv->states[i].disabled || dev->states_usage[i].disable) > + continue; > + > + if (drv->states[i].target_residency <= duration_us) > + break; > + } > + return i; > +} I think this subroutine has a problem when idle state 0 is disabled. Perhaps something like this might help: diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c index bc1c9a2..5b97639 100644 --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) } /** - * teo_find_shallower_state - Find shallower idle state matching given duration. + * teo_find_shallower_state - Find shallower idle state matching given + * duration, if possible. * @drv: cpuidle driver containing state data. * @dev: Target CPU. * @state_idx: Index of the capping idle state. @@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct cpuidle_driver *drv, { int i; - for (i = state_idx - 1; i > 0; i--) { + for (i = state_idx - 1; i >= 0; i--) { if (drv->states[i].disabled || dev->states_usage[i].disable) continue; if (drv->states[i].target_residency <= duration_us) break; } + if (i < 0) + i = state_idx; return i; } @@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, if (max_early_idx >= 0 && count < cpu_data->states[i].early_hits) count = cpu_data->states[i].early_hits; - continue; } There is an additional issue where if idle state 0 is disabled (with the above suggested code patch), idle state usage seems to fall to deeper states than idle state 1. This is not the expected behaviour. Kernel 4.20-rc3 works as expected. I have not figured this issue out yet, in the code. Example (1 minute per sample. Number of entries/exits per state): State 0 State 1 State 2 State 3 State 4Watts 28235143, 83, 26, 17,837, 64.900 5583238, 657079,5884941,8498552, 30986831, 62.433 << Transition sample, after idle state 0 disabled 0, 793517,7186099, 10559878, 38485721, 61.900 << ?? should have all gone into Idle state 1 0, 795414,7340703, 10553117, 38513456, 62.050 0, 807028,7288195, 10574113, 38523524, 62.167 0, 814983,7403534, 10575108, 38571228, 62.167 0, 838302,7747127, 10552289, 38556054, 62.183 9664999, 544473,4914512,6942037, 25295361, 63.633 << Transition sample, after idle state 0 enabled 27893504, 96, 40, 9,912, 66.500 26556343, 83, 29, 7,814, 66.683 27929227, 64, 20, 10,931, 66.683 ... Doug
RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
Hi Rafael, On 2018.11.23 02:36 Rafael J. Wysocki wrote: ... [snip]... > +/** > + * teo_find_shallower_state - Find shallower idle state matching given > duration. > + * @drv: cpuidle driver containing state data. > + * @dev: Target CPU. > + * @state_idx: Index of the capping idle state. > + * @duration_us: Idle duration value to match. > + */ > +static int teo_find_shallower_state(struct cpuidle_driver *drv, > + struct cpuidle_device *dev, int state_idx, > + unsigned int duration_us) > +{ > + int i; > + > + for (i = state_idx - 1; i > 0; i--) { > + if (drv->states[i].disabled || dev->states_usage[i].disable) > + continue; > + > + if (drv->states[i].target_residency <= duration_us) > + break; > + } > + return i; > +} I think this subroutine has a problem when idle state 0 is disabled. Perhaps something like this might help: diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c index bc1c9a2..5b97639 100644 --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) } /** - * teo_find_shallower_state - Find shallower idle state matching given duration. + * teo_find_shallower_state - Find shallower idle state matching given + * duration, if possible. * @drv: cpuidle driver containing state data. * @dev: Target CPU. * @state_idx: Index of the capping idle state. @@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct cpuidle_driver *drv, { int i; - for (i = state_idx - 1; i > 0; i--) { + for (i = state_idx - 1; i >= 0; i--) { if (drv->states[i].disabled || dev->states_usage[i].disable) continue; if (drv->states[i].target_residency <= duration_us) break; } + if (i < 0) + i = state_idx; return i; } @@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, if (max_early_idx >= 0 && count < cpu_data->states[i].early_hits) count = cpu_data->states[i].early_hits; - continue; } There is an additional issue where if idle state 0 is disabled (with the above suggested code patch), idle state usage seems to fall to deeper states than idle state 1. This is not the expected behaviour. Kernel 4.20-rc3 works as expected. I have not figured this issue out yet, in the code. Example (1 minute per sample. Number of entries/exits per state): State 0 State 1 State 2 State 3 State 4Watts 28235143, 83, 26, 17,837, 64.900 5583238, 657079,5884941,8498552, 30986831, 62.433 << Transition sample, after idle state 0 disabled 0, 793517,7186099, 10559878, 38485721, 61.900 << ?? should have all gone into Idle state 1 0, 795414,7340703, 10553117, 38513456, 62.050 0, 807028,7288195, 10574113, 38523524, 62.167 0, 814983,7403534, 10575108, 38571228, 62.167 0, 838302,7747127, 10552289, 38556054, 62.183 9664999, 544473,4914512,6942037, 25295361, 63.633 << Transition sample, after idle state 0 enabled 27893504, 96, 40, 9,912, 66.500 26556343, 83, 29, 7,814, 66.683 27929227, 64, 20, 10,931, 66.683 ... Doug
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
Hi Doug, On Thu, Nov 29, 2018 at 12:20 AM Doug Smythies wrote: > > On 2018.11.23 02:36 Rafael J. Wysocki wrote: > > v5 -> v6: > * Avoid applying poll_time_limit to non-polling idle states by mistake. > * Use idle duration measured by the governor for everything (as it likely is >more accurate than the one measured by the core). > > -- above missing-- (see follow up e-mail from Rafael) > > * Rename SPIKE to PULSE. > * Do not run pattern detection upfront. Instead, use recent idle duration >values to refine the state selection after finding a candidate idle state. > * Do not use the expected idle duration as an extra latency constraint >(exit latency is less than the target residency for all of the idle states >known to me anyway, so this doesn't change anything in practice). > > Hi Rafael, > > I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline > reference kernel. > > Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients. > > Note: because it uses the disk, the dbench test is somewhat non-repeatable. > However, if particular attention is paid to not doing anything else with > the disk between tests, then it seems to be repeatable to within about 6%. > > Anyway no significant difference observed between kernel 4.20-rc3 and the > same with the teov6 patch. > > Test 2: Pipe test, non cross core. (And idle state 0 test, really) > I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core. > Thus, pretty much only idle state 0 was ever used. > Processor package power was similar for both kernels. > teov6 entered/exited idle state 0 about 60,984 times/second/cpu. > -rc3 entered/exited idle state 0 about 62,806 times/second/cpu. > There was a difference in percentage time spent in idle state 0, > with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses > teov6 at 0.0641%. > > For throughput, teov6 was 1.4% faster. > > Test 3: was an attempt to sweep through a preference for > all idle states. > > 40 threads were launched with nothing to do except sleep > for a variable duration of 1 to 500 uSec, each step was > run for 1 minute. With 1 minute idle before the test and a few > minutes idle after, the total test duration was about 505 minutes. > Recall that when one asks for a short sleep of 1 uSec, they actually > get about 50 uSec, due to overheads. So I use 40 threads in an attempt > to get the average time between wakeup events per CPU down somewhat. > > The results are here: > http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm > > I might try to get some histogram information at a later date. Thank you for the results, much appreciated!
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
Hi Doug, On Thu, Nov 29, 2018 at 12:20 AM Doug Smythies wrote: > > On 2018.11.23 02:36 Rafael J. Wysocki wrote: > > v5 -> v6: > * Avoid applying poll_time_limit to non-polling idle states by mistake. > * Use idle duration measured by the governor for everything (as it likely is >more accurate than the one measured by the core). > > -- above missing-- (see follow up e-mail from Rafael) > > * Rename SPIKE to PULSE. > * Do not run pattern detection upfront. Instead, use recent idle duration >values to refine the state selection after finding a candidate idle state. > * Do not use the expected idle duration as an extra latency constraint >(exit latency is less than the target residency for all of the idle states >known to me anyway, so this doesn't change anything in practice). > > Hi Rafael, > > I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline > reference kernel. > > Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients. > > Note: because it uses the disk, the dbench test is somewhat non-repeatable. > However, if particular attention is paid to not doing anything else with > the disk between tests, then it seems to be repeatable to within about 6%. > > Anyway no significant difference observed between kernel 4.20-rc3 and the > same with the teov6 patch. > > Test 2: Pipe test, non cross core. (And idle state 0 test, really) > I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core. > Thus, pretty much only idle state 0 was ever used. > Processor package power was similar for both kernels. > teov6 entered/exited idle state 0 about 60,984 times/second/cpu. > -rc3 entered/exited idle state 0 about 62,806 times/second/cpu. > There was a difference in percentage time spent in idle state 0, > with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses > teov6 at 0.0641%. > > For throughput, teov6 was 1.4% faster. > > Test 3: was an attempt to sweep through a preference for > all idle states. > > 40 threads were launched with nothing to do except sleep > for a variable duration of 1 to 500 uSec, each step was > run for 1 minute. With 1 minute idle before the test and a few > minutes idle after, the total test duration was about 505 minutes. > Recall that when one asks for a short sleep of 1 uSec, they actually > get about 50 uSec, due to overheads. So I use 40 threads in an attempt > to get the average time between wakeup events per CPU down somewhat. > > The results are here: > http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm > > I might try to get some histogram information at a later date. Thank you for the results, much appreciated!
RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On 2018.11.23 02:36 Rafael J. Wysocki wrote: v5 -> v6: * Avoid applying poll_time_limit to non-polling idle states by mistake. * Use idle duration measured by the governor for everything (as it likely is more accurate than the one measured by the core). -- above missing-- (see follow up e-mail from Rafael) * Rename SPIKE to PULSE. * Do not run pattern detection upfront. Instead, use recent idle duration values to refine the state selection after finding a candidate idle state. * Do not use the expected idle duration as an extra latency constraint (exit latency is less than the target residency for all of the idle states known to me anyway, so this doesn't change anything in practice). Hi Rafael, I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline reference kernel. Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients. Note: because it uses the disk, the dbench test is somewhat non-repeatable. However, if particular attention is paid to not doing anything else with the disk between tests, then it seems to be repeatable to within about 6%. Anyway no significant difference observed between kernel 4.20-rc3 and the same with the teov6 patch. Test 2: Pipe test, non cross core. (And idle state 0 test, really) I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core. Thus, pretty much only idle state 0 was ever used. Processor package power was similar for both kernels. teov6 entered/exited idle state 0 about 60,984 times/second/cpu. -rc3 entered/exited idle state 0 about 62,806 times/second/cpu. There was a difference in percentage time spent in idle state 0, with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses teov6 at 0.0641%. For throughput, teov6 was 1.4% faster. Test 3: was an attempt to sweep through a preference for all idle states. 40 threads were launched with nothing to do except sleep for a variable duration of 1 to 500 uSec, each step was run for 1 minute. With 1 minute idle before the test and a few minutes idle after, the total test duration was about 505 minutes. Recall that when one asks for a short sleep of 1 uSec, they actually get about 50 uSec, due to overheads. So I use 40 threads in an attempt to get the average time between wakeup events per CPU down somewhat. The results are here: http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm I might try to get some histogram information at a later date. ... Doug
RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On 2018.11.23 02:36 Rafael J. Wysocki wrote: v5 -> v6: * Avoid applying poll_time_limit to non-polling idle states by mistake. * Use idle duration measured by the governor for everything (as it likely is more accurate than the one measured by the core). -- above missing-- (see follow up e-mail from Rafael) * Rename SPIKE to PULSE. * Do not run pattern detection upfront. Instead, use recent idle duration values to refine the state selection after finding a candidate idle state. * Do not use the expected idle duration as an extra latency constraint (exit latency is less than the target residency for all of the idle states known to me anyway, so this doesn't change anything in practice). Hi Rafael, I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline reference kernel. Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients. Note: because it uses the disk, the dbench test is somewhat non-repeatable. However, if particular attention is paid to not doing anything else with the disk between tests, then it seems to be repeatable to within about 6%. Anyway no significant difference observed between kernel 4.20-rc3 and the same with the teov6 patch. Test 2: Pipe test, non cross core. (And idle state 0 test, really) I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core. Thus, pretty much only idle state 0 was ever used. Processor package power was similar for both kernels. teov6 entered/exited idle state 0 about 60,984 times/second/cpu. -rc3 entered/exited idle state 0 about 62,806 times/second/cpu. There was a difference in percentage time spent in idle state 0, with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses teov6 at 0.0641%. For throughput, teov6 was 1.4% faster. Test 3: was an attempt to sweep through a preference for all idle states. 40 threads were launched with nothing to do except sleep for a variable duration of 1 to 500 uSec, each step was run for 1 minute. With 1 minute idle before the test and a few minutes idle after, the total test duration was about 505 minutes. Recall that when one asks for a short sleep of 1 uSec, they actually get about 50 uSec, due to overheads. So I use 40 threads in an attempt to get the average time between wakeup events per CPU down somewhat. The results are here: http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm I might try to get some histogram information at a later date. ... Doug
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On Friday, November 23, 2018 11:35:38 AM CET Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > The venerable menu governor does some thigns that are quite > questionable in my view. > > First, it includes timer wakeups in the pattern detection data and > mixes them up with wakeups from other sources which in some cases > causes it to expect what essentially would be a timer wakeup in a > time frame in which no timer wakeups are possible (becuase it knows > the time until the next timer event and that is later than the > expected wakeup time). > > Second, it uses the extra exit latency limit based on the predicted > idle duration and depending on the number of tasks waiting on I/O, > even though those tasks may run on a different CPU when they are > woken up. Moreover, the time ranges used by it for the sleep length > correction factors depend on whether or not there are tasks waiting > on I/O, which again doesn't imply anything in particular, and they > are not correlated to the list of available idle states in any way > whatever. > > Also, the pattern detection code in menu may end up considering > values that are too large to matter at all, in which cases running > it is a waste of time. > > A major rework of the menu governor would be required to address > these issues and the performance of at least some workloads (tuned > specifically to the current behavior of the menu governor) is likely > to suffer from that. It is thus better to introduce an entirely new > governor without them and let everybody use the governor that works > better with their actual workloads. > > The new governor introduced here, the timer events oriented (TEO) > governor, uses the same basic strategy as menu: it always tries to > find the deepest idle state that can be used in the given conditions. > However, it applies a different approach to that problem. > > First, it doesn't use "correction factors" for the time till the > closest timer, but instead it tries to correlate the measured idle > duration values with the available idle states and use that > information to pick up the idle state that is most likely to "match" > the upcoming CPU idle interval. > > Second, it doesn't take the number of "I/O waiters" into account at > all and the pattern detection code in it avoids taking timer wakeups > into account. It also only uses idle duration values less than the > current time till the closest timer (with the tick excluded) for that > purpose. > > Signed-off-by: Rafael J. Wysocki > --- > > v5 -> v6: > * Avoid applying poll_time_limit to non-polling idle states by mistake. > * Use idle duration measured by the governor for everything (as it likely is >more accurate than the one measured by the core). This particular change is actually missing, sorry about that. It is not essential, however, so the v6 should be good enough as is for evaluation and review purposes. > * Rename SPIKE to PULSE. > * Do not run pattern detection upfront. Instead, use recent idle duration >values to refine the state selection after finding a candidate idle state. > * Do not use the expected idle duration as an extra latency constraint >(exit latency is less than the target residency for all of the idle states >known to me anyway, so this doesn't change anything in practice). Thanks, Rafael
Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On Friday, November 23, 2018 11:35:38 AM CET Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > The venerable menu governor does some thigns that are quite > questionable in my view. > > First, it includes timer wakeups in the pattern detection data and > mixes them up with wakeups from other sources which in some cases > causes it to expect what essentially would be a timer wakeup in a > time frame in which no timer wakeups are possible (becuase it knows > the time until the next timer event and that is later than the > expected wakeup time). > > Second, it uses the extra exit latency limit based on the predicted > idle duration and depending on the number of tasks waiting on I/O, > even though those tasks may run on a different CPU when they are > woken up. Moreover, the time ranges used by it for the sleep length > correction factors depend on whether or not there are tasks waiting > on I/O, which again doesn't imply anything in particular, and they > are not correlated to the list of available idle states in any way > whatever. > > Also, the pattern detection code in menu may end up considering > values that are too large to matter at all, in which cases running > it is a waste of time. > > A major rework of the menu governor would be required to address > these issues and the performance of at least some workloads (tuned > specifically to the current behavior of the menu governor) is likely > to suffer from that. It is thus better to introduce an entirely new > governor without them and let everybody use the governor that works > better with their actual workloads. > > The new governor introduced here, the timer events oriented (TEO) > governor, uses the same basic strategy as menu: it always tries to > find the deepest idle state that can be used in the given conditions. > However, it applies a different approach to that problem. > > First, it doesn't use "correction factors" for the time till the > closest timer, but instead it tries to correlate the measured idle > duration values with the available idle states and use that > information to pick up the idle state that is most likely to "match" > the upcoming CPU idle interval. > > Second, it doesn't take the number of "I/O waiters" into account at > all and the pattern detection code in it avoids taking timer wakeups > into account. It also only uses idle duration values less than the > current time till the closest timer (with the tick excluded) for that > purpose. > > Signed-off-by: Rafael J. Wysocki > --- > > v5 -> v6: > * Avoid applying poll_time_limit to non-polling idle states by mistake. > * Use idle duration measured by the governor for everything (as it likely is >more accurate than the one measured by the core). This particular change is actually missing, sorry about that. It is not essential, however, so the v6 should be good enough as is for evaluation and review purposes. > * Rename SPIKE to PULSE. > * Do not run pattern detection upfront. Instead, use recent idle duration >values to refine the state selection after finding a candidate idle state. > * Do not use the expected idle duration as an extra latency constraint >(exit latency is less than the target residency for all of the idle states >known to me anyway, so this doesn't change anything in practice). Thanks, Rafael
[RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
From: Rafael J. Wysocki The venerable menu governor does some thigns that are quite questionable in my view. First, it includes timer wakeups in the pattern detection data and mixes them up with wakeups from other sources which in some cases causes it to expect what essentially would be a timer wakeup in a time frame in which no timer wakeups are possible (becuase it knows the time until the next timer event and that is later than the expected wakeup time). Second, it uses the extra exit latency limit based on the predicted idle duration and depending on the number of tasks waiting on I/O, even though those tasks may run on a different CPU when they are woken up. Moreover, the time ranges used by it for the sleep length correction factors depend on whether or not there are tasks waiting on I/O, which again doesn't imply anything in particular, and they are not correlated to the list of available idle states in any way whatever. Also, the pattern detection code in menu may end up considering values that are too large to matter at all, in which cases running it is a waste of time. A major rework of the menu governor would be required to address these issues and the performance of at least some workloads (tuned specifically to the current behavior of the menu governor) is likely to suffer from that. It is thus better to introduce an entirely new governor without them and let everybody use the governor that works better with their actual workloads. The new governor introduced here, the timer events oriented (TEO) governor, uses the same basic strategy as menu: it always tries to find the deepest idle state that can be used in the given conditions. However, it applies a different approach to that problem. First, it doesn't use "correction factors" for the time till the closest timer, but instead it tries to correlate the measured idle duration values with the available idle states and use that information to pick up the idle state that is most likely to "match" the upcoming CPU idle interval. Second, it doesn't take the number of "I/O waiters" into account at all and the pattern detection code in it avoids taking timer wakeups into account. It also only uses idle duration values less than the current time till the closest timer (with the tick excluded) for that purpose. Signed-off-by: Rafael J. Wysocki --- v5 -> v6: * Avoid applying poll_time_limit to non-polling idle states by mistake. * Use idle duration measured by the governor for everything (as it likely is more accurate than the one measured by the core). * Rename SPIKE to PULSE. * Do not run pattern detection upfront. Instead, use recent idle duration values to refine the state selection after finding a candidate idle state. * Do not use the expected idle duration as an extra latency constraint (exit latency is less than the target residency for all of the idle states known to me anyway, so this doesn't change anything in practice). v4 -> v5: * Avoid using shallow idle states when the tick has been stopped already. v3 -> v4: * Make the pattern detection avoid returning too early if the minimum sample is too far from the average. * Reformat the changelog (as requested by Peter). v2 -> v3: * Simplify the pattern detection code and make it return a value lower than the time to the closest timer if the majority of recent idle intervals are below it regardless of their variance (that should cause it to be slightly more aggressive). * Do not count wakeups from state 0 due to the time limit in poll_idle() as non-timer. --- drivers/cpuidle/Kconfig| 11 drivers/cpuidle/governors/Makefile |1 drivers/cpuidle/governors/teo.c| 450 + 3 files changed, 462 insertions(+) Index: linux-pm/drivers/cpuidle/governors/teo.c === --- /dev/null +++ linux-pm/drivers/cpuidle/governors/teo.c @@ -0,0 +1,450 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Timer events oriented CPU idle governor + * + * Copyright (C) 2018 Intel Corporation + * Author: Rafael J. Wysocki + * + * The idea of this governor is based on the observation that on many systems + * timer events are two or more orders of magnitude more frequent than any + * other interrupts, so they are likely to be the most significant source of CPU + * wakeups from idle states. Moreover, information about what happened in the + * (relatively recent) past can be used to estimate whether or not the deepest + * idle state with target residency within the time to the closest timer is + * likely to be suitable for the upcoming idle time of the CPU and, if not, then + * which of the shallower idle states to choose. + * + * Of course, non-timer wakeup sources are more important in some use cases and + * they can be covered by taking a few most recent idle time intervals of the + * CPU into account. However, even in that case it is
[RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
From: Rafael J. Wysocki The venerable menu governor does some thigns that are quite questionable in my view. First, it includes timer wakeups in the pattern detection data and mixes them up with wakeups from other sources which in some cases causes it to expect what essentially would be a timer wakeup in a time frame in which no timer wakeups are possible (becuase it knows the time until the next timer event and that is later than the expected wakeup time). Second, it uses the extra exit latency limit based on the predicted idle duration and depending on the number of tasks waiting on I/O, even though those tasks may run on a different CPU when they are woken up. Moreover, the time ranges used by it for the sleep length correction factors depend on whether or not there are tasks waiting on I/O, which again doesn't imply anything in particular, and they are not correlated to the list of available idle states in any way whatever. Also, the pattern detection code in menu may end up considering values that are too large to matter at all, in which cases running it is a waste of time. A major rework of the menu governor would be required to address these issues and the performance of at least some workloads (tuned specifically to the current behavior of the menu governor) is likely to suffer from that. It is thus better to introduce an entirely new governor without them and let everybody use the governor that works better with their actual workloads. The new governor introduced here, the timer events oriented (TEO) governor, uses the same basic strategy as menu: it always tries to find the deepest idle state that can be used in the given conditions. However, it applies a different approach to that problem. First, it doesn't use "correction factors" for the time till the closest timer, but instead it tries to correlate the measured idle duration values with the available idle states and use that information to pick up the idle state that is most likely to "match" the upcoming CPU idle interval. Second, it doesn't take the number of "I/O waiters" into account at all and the pattern detection code in it avoids taking timer wakeups into account. It also only uses idle duration values less than the current time till the closest timer (with the tick excluded) for that purpose. Signed-off-by: Rafael J. Wysocki --- v5 -> v6: * Avoid applying poll_time_limit to non-polling idle states by mistake. * Use idle duration measured by the governor for everything (as it likely is more accurate than the one measured by the core). * Rename SPIKE to PULSE. * Do not run pattern detection upfront. Instead, use recent idle duration values to refine the state selection after finding a candidate idle state. * Do not use the expected idle duration as an extra latency constraint (exit latency is less than the target residency for all of the idle states known to me anyway, so this doesn't change anything in practice). v4 -> v5: * Avoid using shallow idle states when the tick has been stopped already. v3 -> v4: * Make the pattern detection avoid returning too early if the minimum sample is too far from the average. * Reformat the changelog (as requested by Peter). v2 -> v3: * Simplify the pattern detection code and make it return a value lower than the time to the closest timer if the majority of recent idle intervals are below it regardless of their variance (that should cause it to be slightly more aggressive). * Do not count wakeups from state 0 due to the time limit in poll_idle() as non-timer. --- drivers/cpuidle/Kconfig| 11 drivers/cpuidle/governors/Makefile |1 drivers/cpuidle/governors/teo.c| 450 + 3 files changed, 462 insertions(+) Index: linux-pm/drivers/cpuidle/governors/teo.c === --- /dev/null +++ linux-pm/drivers/cpuidle/governors/teo.c @@ -0,0 +1,450 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Timer events oriented CPU idle governor + * + * Copyright (C) 2018 Intel Corporation + * Author: Rafael J. Wysocki + * + * The idea of this governor is based on the observation that on many systems + * timer events are two or more orders of magnitude more frequent than any + * other interrupts, so they are likely to be the most significant source of CPU + * wakeups from idle states. Moreover, information about what happened in the + * (relatively recent) past can be used to estimate whether or not the deepest + * idle state with target residency within the time to the closest timer is + * likely to be suitable for the upcoming idle time of the CPU and, if not, then + * which of the shallower idle states to choose. + * + * Of course, non-timer wakeup sources are more important in some use cases and + * they can be covered by taking a few most recent idle time intervals of the + * CPU into account. However, even in that case it is