Re: [RFC v4 0/2] CPU-Idle latency selftest framework
Hi Pratik, V4 seems fine. Thank you. On Mon, Apr 12, 2021 at 12:43 AM Pratik Rajesh Sampat wrote: > > Changelog v3-->v4 > Based on review comments by Doug Smythies, > 1. Parsing the thread_siblings_list for CPU topology information to >correctly identify the cores the test should run on in >default(quick) mode. > 2. The source CPU to source CPU interaction in the IPI test will always >result in a lower latency and cause a bias in the average, hence >avoid adding the latency to be averaged for same cpu IPIs. The >latency will still be displayed in the detailed logs. > > RFC v3: https://lkml.org/lkml/2021/4/4/31 Example output for an Intel i5-10600K, HWP active, performance mode. System very idle: $ sudo ./cpuidle.sh -v -i Inserting /lib/modules/5.12.0-rc7-prs-v4/kernel/drivers/cpuidle/test-cpuidle_latency.ko module --IPI Latency Test--- Baseline Avg IPI latency(ns): 686 Observed Avg IPI latency(ns) - State 0: 468 Observed Avg IPI latency(ns) - State 1: 956 Observed Avg IPI latency(ns) - State 2: 17936 Observed Avg IPI latency(ns) - State 3: 17968 --Timeout Latency Test-- Baseline Avg timeout diff(ns): 445 Observed Avg timeout diff(ns) - State 0: 377 Observed Avg timeout diff(ns) - State 1: 630 Observed Avg timeout diff(ns) - State 2: 322812 Observed Avg timeout diff(ns) - State 3: 306067 Removing /lib/modules/5.12.0-rc7-prs-v4/kernel/drivers/cpuidle/test-cpuidle_latency.ko module Full Output logged at: cpuidle.log $ grep . /sys/devices/system/cpu/cpu7/cpuidle/state*/residency /sys/devices/system/cpu/cpu7/cpuidle/state0/residency:0 /sys/devices/system/cpu/cpu7/cpuidle/state1/residency:1 /sys/devices/system/cpu/cpu7/cpuidle/state2/residency:360 /sys/devices/system/cpu/cpu7/cpuidle/state3/residency:3102 $ grep . /sys/devices/system/cpu/cpu7/cpuidle/state*/latency /sys/devices/system/cpu/cpu7/cpuidle/state0/latency:0 /sys/devices/system/cpu/cpu7/cpuidle/state1/latency:1 /sys/devices/system/cpu/cpu7/cpuidle/state2/latency:120 /sys/devices/system/cpu/cpu7/cpuidle/state3/latency:1034 ... Doug
Re: [RFC v3 0/2] CPU-Idle latency selftest framework
On Fri, Apr 9, 2021 at 12:43 AM Pratik Sampat wrote: > On 09/04/21 10:53 am, Doug Smythies wrote: > > I tried V3 on a Intel i5-10600K processor with 6 cores and 12 CPUs. > > The core to cpu mappings are: > > core 0 has cpus 0 and 6 > > core 1 has cpus 1 and 7 > > core 2 has cpus 2 and 8 > > core 3 has cpus 3 and 9 > > core 4 has cpus 4 and 10 > > core 5 has cpus 5 and 11 > > > > By default, it will test CPUs 0,2,4,6,10 on cores 0,2,4,0,2,4. > > wouldn't it make more sense to test each core once? > > Ideally it would be better to run on all the CPUs, however on larger systems > that I'm testing on with hundreds of cores and a high a thread count, the > execution time increases while not particularly bringing any additional > information to the table. > > That is why it made sense only run on one of the threads of each core to make > the experiment faster while preserving accuracy. > > To handle various thread topologies it maybe worthwhile if we parse > /sys/devices/system/cpu/cpuX/topology/thread_siblings_list for each core and > use this information to run only once per physical core, rather than > assuming the topology. > > What are your thoughts on a mechanism like this? Yes, seems like a good solution. ... Doug
Re: [RFC v3 0/2] CPU-Idle latency selftest framework
Hi Pratik, I tried V3 on a Intel i5-10600K processor with 6 cores and 12 CPUs. The core to cpu mappings are: core 0 has cpus 0 and 6 core 1 has cpus 1 and 7 core 2 has cpus 2 and 8 core 3 has cpus 3 and 9 core 4 has cpus 4 and 10 core 5 has cpus 5 and 11 By default, it will test CPUs 0,2,4,6,10 on cores 0,2,4,0,2,4. wouldn't it make more sense to test each core once? With the source CPU always 0, I think the results from the results from the destination CPUs 0 and 6, on core 0 bias the results, at least in the deeper idle states. They don't make much difference in the shallow states. Myself, I wouldn't include them in the results. Example, where I used the -v option for all CPUs: --IPI Latency Test--- --Baseline IPI Latency measurement: CPU Busy-- SRC_CPU DEST_CPU IPI_Latency(ns) 00 101 01 790 02 609 03 595 04 737 05 759 06 780 07 741 08 574 09 681 0 10 527 0 11 552 Baseline Avg IPI latency(ns): 620 suggest 656 here ---Enabling state: 0--- SRC_CPU DEST_CPU IPI_Latency(ns) 00 76 01 471 02 420 03 462 04 454 05 468 06 453 07 473 08 380 09 483 0 10 492 0 11 454 Expected IPI latency(ns): 0 Observed Avg IPI latency(ns) - State 0: 423 < suggest 456 here ---Enabling state: 1--- SRC_CPU DEST_CPU IPI_Latency(ns) 00 112 01 866 02 663 03 851 04 1090 05 1314 06 1941 07 1458 08 687 09 802 0 10 1041 0 11 1284 Expected IPI latency(ns): 1000 Observed Avg IPI latency(ns) - State 1: 1009 suggest 1006 here ---Enabling state: 2--- SRC_CPU DEST_CPU IPI_Latency(ns) 00 75 0116362 0216785 0319650 0417356 0517606 06 2217 0717958 0817332 0916615 0 1017382 0 1117423 Expected IPI latency(ns): 12 Observed Avg IPI latency(ns) - State 2: 14730 suggest 17447 here ---Enabling state: 3--- SRC_CPU DEST_CPU IPI_Latency(ns) 00 103 0117416 0217961 0316651 0417867 0517726 06 2178 0716620 0820951 0916567 0 1017131 0 1117563 Expected IPI latency(ns): 1034000 Observed Avg IPI latency(ns) - State 3: 14894 suggest 17645 here Hope this helps. ... Doug
Re: [RFC v2 2/2] selftest/cpuidle: Add support for cpuidle latency measurement
Hi Pratik, On Thu, Apr 1, 2021 at 4:45 AM Pratik Rajesh Sampat wrote: > ... > To run this test specifically: > $ make -C tools/testing/selftests TARGETS="cpuidle" run_tests I have not become any smarter than I was with version 1, and still assumed that the "$" meant regular user. Please put it as "#" or separate the two steps, compile and run. > > There are a few optinal arguments too that the script can take optional Suggest to also specifically mention how to run without re-compile, # ./cpuidle.sh -v Note also that the test still leaves all idle states disabled when done. > [-h ] > [-i ] > [-m ] > [-o ] > [-v (run on all cpus)] > Default Output location in: tools/testing/selftest/cpuidle/cpuidle.log ... > +cpu_is_online() > +{ > + cpu=$1 > + if [ ! -f "/sys/devices/system/cpu/cpu$cpu/online" ]; then > + echo 0 incorrect. should be: > + echo 1 ... Doug
Re: [3/3,v3] tools/power turbostat: Enable accumulate RAPL display
Just resending to previously missed people who should also test this. (See other e-mail: Re: turbostat: Fix Pkg Power on Zen) On Sat, Mar 13, 2021 at 5:49 AM youling 257 wrote: > > test this patch, turbostat can work. > > 2021-03-12 21:41 GMT+08:00, Chen Yu : > > Hi Youling, Bas, and Bingsong, > > On Wed, Mar 10, 2021 at 04:03:31PM -0800, Doug Smythies wrote: > >> Hi Yu, > >> > >> I am just resending your e-mail, adjusting the "To:" list to > >> include the 3 others that have submitted similar patches. > >> > >> ... Doug > >> > > Could you please help check if the following combined patch works? > > > > Thanks, > > Chenyu > > > > > > From 00e0622b1b693a5c7dc343aeb3aa51614a9e125e Mon Sep 17 00:00:00 2001 > > From: Bas Nieuwenhuizen > > Date: Fri, 12 Mar 2021 21:27:40 +0800 > > Subject: [PATCH] tools/power/turbostat: Fix turbostat for AMD Zen CPUs > > > > It was reported that on Zen+ system turbostat started exiting, > > which was tracked down to the MSR_PKG_ENERGY_STAT read failing because > > offset_to_idx wasn't returning a non-negative index. > > > > This patch combined the modification from Bingsong Si and > > Bas Nieuwenhuizen and addd the MSR to the index system as alternative for > > MSR_PKG_ENERGY_STATUS. > > > > Fixes: 9972d5d84d76 ("tools/power turbostat: Enable accumulate RAPL > > display") > > Reported-by: youling257 > > Co-developed-by: Bingsong Si > > Signed-off-by: Chen Yu > > --- > > tools/power/x86/turbostat/turbostat.c | 8 ++-- > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > diff --git a/tools/power/x86/turbostat/turbostat.c > > b/tools/power/x86/turbostat/turbostat.c > > index a7c4f0772e53..a7c965734fdf 100644 > > --- a/tools/power/x86/turbostat/turbostat.c > > +++ b/tools/power/x86/turbostat/turbostat.c > > @@ -297,7 +297,10 @@ int idx_to_offset(int idx) > > > > switch (idx) { > > case IDX_PKG_ENERGY: > > - offset = MSR_PKG_ENERGY_STATUS; > > + if (do_rapl & RAPL_AMD_F17H) > > + offset = MSR_PKG_ENERGY_STAT; > > + else > > + offset = MSR_PKG_ENERGY_STATUS; > > break; > > case IDX_DRAM_ENERGY: > > offset = MSR_DRAM_ENERGY_STATUS; > > @@ -326,6 +329,7 @@ int offset_to_idx(int offset) > > > > switch (offset) { > > case MSR_PKG_ENERGY_STATUS: > > + case MSR_PKG_ENERGY_STAT: > > idx = IDX_PKG_ENERGY; > > break; > > case MSR_DRAM_ENERGY_STATUS: > > @@ -353,7 +357,7 @@ int idx_valid(int idx) > > { > > switch (idx) { > > case IDX_PKG_ENERGY: > > - return do_rapl & RAPL_PKG; > > + return do_rapl & (RAPL_PKG | RAPL_AMD_F17H); > > case IDX_DRAM_ENERGY: > > return do_rapl & RAPL_DRAM; > > case IDX_PP0_ENERGY: > > -- > > 2.25.1 > > > >
Re: turbostat: Fix Pkg Power on Zen
On Wed, Mar 24, 2021 at 5:38 AM Salvatore Bonaccorso wrote: > On Mon, Mar 15, 2021 at 10:54:24PM +0100, Christian Kastner wrote: > > On 01.02.21 10:01, Kurt Garloff wrote: > > > Issue persists on Ryzen in 5.11-rc6: > > > kvmadmin@KurtSrv2018(//):~ [0]$ sudo > > > /casa/src/linux-stable/tools/power/x86/turbostat/turbostat ... deleted stuff that doesn't display properly ... > > I was seeing the same issue (no stats, program just exits with 243), and > > Kurt's simple patch resolved it for me. > > Does Kurt's patch seems good to you and can be applied or is there > anything missing? There were multiple patch submissions. Chen Yu kindly merged them into one, which was put out for testing a couple of weeks ago. Try it and report back. I'll forward it in a moment.
Re: [RFC 2/2] selftest/cpuidle: Add support for cpuidle latency measurement
On Wed, Mar 17, 2021 at 11:44 PM Pratik Sampat wrote: > > Hi Doug, > Thanks for trying these patches out. > > On 18/03/21 2:30 am, Doug Smythies wrote: > > Hi Pratik, > > > > It just so happens that I have been trying Artem's version this last > > week, so I tried yours. > > > > On Mon, Mar 15, 2021 at 4:49 AM Pratik Rajesh Sampat > > wrote: > > ... ... > > Other notes: > > > > No idle state for CPU 0 ever gets disabled. > > I assume this is because CPU 0 can never be offline, > > so that bit of code (Disable all stop states) doesn't find its state. > > By the way, processor = Intel i5-9600K > > I had tried these patches on an IBM POWER 9 processor and disabling CPU0's > idle > state works there. However, it does make sense for some processors to treat > CPU > 0 differently. > Maybe I could write in a case if idle state disabling fails for a CPU then we > just skip it? I didn't try it, I just did a hack so I could continue for this reply. > > The system is left with all idle states disabled, well not for CPU 0 > > as per the above comment. The suggestion is to restore them, > > otherwise my processor hogs 42 watts instead of 2. > > > > My results are highly variable per test. > > Question: Do you notice high variability with IPI test, Timer test or both? The IPI test has less variability than the Timer test. > > I can think of two reasons for high run to run variance: > > 1. If you observe variance in timer tests, then I believe there could a > mechanism of "C-state pre-wake" on some Intel machines at play here, which can > pre-wake a CPU from an idle state when timers are armed. I'm not sure if the > Intel platform that you're running on does that or not. > > Artem had described this behavior to me a while ago and I think his wult page > describes this behavior in more detail: > https://intel.github.io/wult/#c-state-pre-wake Yes, I have reviewed all the references. And yes, I think my processors have the pre-wake stuff. I do not have the proper hardware to do the Artem pre-wake workaround method, but might buy it in future. > 2. I have noticed variability in results when there are kernel book-keeping or > jitter tasks scheduled from time to time on an otherwise idle core. > In the full per-CPU logs at tools/testing/selftests/cpuidle/cpuidle.log can > you > spot any obvious outliers per-CPU state? Yes. I'll just paste in an example cpuidle.log file having used the -v option below, along with my hack job diff. doug@s19:~/temp-k-git/linux/tools/testing/selftests/cpuidle$ cat cpuidle.log.v3-1 --IPI Latency Test--- --Baseline IPI Latency measurement: CPU Busy-- SRC_CPU DEST_CPU IPI_Latency(ns) 00 140 01 632 02 675 03 671 04 675 05 767 06 653 07 826 08 819 09 615 0 10 758 0 11 758 Baseline Avg IPI latency(ns): 665 ---Enabling state: 0--- SRC_CPU DEST_CPU IPI_Latency(ns) 00 76 01 484 02 494 03 539 04 498 05 491 06 474 07 434 08 544 09 476 0 10 447 0 11 467 Expected IPI latency(ns): 0 Observed Avg IPI latency(ns) - State 0: 452 ---Enabling state: 1--- SRC_CPU DEST_CPU IPI_Latency(ns) 00 72 01 1081 02 821 03 1486 04 1022 05 960 06 1634 07 933 08 1032 09 1046 0 10 1430 0 11 1338 Expected IPI latency(ns): 1000 Observed Avg IPI latency(ns) - State 1: 1071 ---Enabling state: 2--- SRC_CPU DEST_CPU IPI_Latency(ns) 00 264 0130836 0230562 0330748 0435286 0530978 06 1952 0736066 0830670 0930605 0 1030635 0 1135423 Expected IPI latency(ns): 12 Observed Avg IPI latency(ns) - State 2: 27002 ---Enabling state: 3--- SRC_CPU DEST_CPU IPI_Latency(ns) 00 71 0130853 0232095 0332661 0430230 0534348 06 2012 0730816 0830908 0931130 0
Re: [RFC 2/2] selftest/cpuidle: Add support for cpuidle latency measurement
Hi Pratik, It just so happens that I have been trying Artem's version this last week, so I tried yours. On Mon, Mar 15, 2021 at 4:49 AM Pratik Rajesh Sampat wrote: > ... > To run this test specifically: > $ make -C tools/testing/selftests TARGETS="cpuidle" run_tests While I suppose it should have been obvious, I interpreted the "$" sign to mean I could run as a regular user, which I can not. > There are a few optinal arguments too that the script can take > [-h ] > [-m ] > [-o ] > [-v (run on all cpus)] > Default Output location in: tools/testing/cpuidle/cpuidle.log Isn't it: tools/testing/selftests/cpuidle/cpuidle.log ? At least, that is where my file was. Other notes: No idle state for CPU 0 ever gets disabled. I assume this is because CPU 0 can never be offline, so that bit of code (Disable all stop states) doesn't find its state. By the way, processor = Intel i5-9600K The system is left with all idle states disabled, well not for CPU 0 as per the above comment. The suggestion is to restore them, otherwise my processor hogs 42 watts instead of 2. My results are highly variable per test. My system is very idle: Example (from turbostat at 6 seconds sample rate): Busy% Bzy_MHz IRQ PkgTmp PkgWatt RAMWatt 0.034600153 28 2.031.89 0.014600103 29 2.031.89 0.054600115 29 2.081.89 0.01460095 28 2.091.89 0.034600114 28 2.111.89 0.014600107 29 2.071.89 0.024600102 29 2.111.89 ... ... Doug
Re: [PATCH] tools/power/x86/turbostat: Fix TCC offset bit mask
On Fri, Mar 12, 2021 at 2:16 PM Len Brown wrote: > > Doug, > The offset works for control. > > However, it is erroneous to use it for reporting of the actual > temperature, like I did in turbostat. Agreed. I have been running with a correction for that for a while, and as discussed on Rui's thread. But this bit mask correction patch is still needed isn't it? for this: cpu4: MSR_IA32_TEMPERATURE_TARGET: 0x1a64100d (90 C) (100 default - 10 offset) which should be this: cpu4: MSR_IA32_TEMPERATURE_TARGET: 0x1a64100d (74 C) (100 default - 26 offset) But yes, I do now see the field size is only 4 bits for some parts. ... Doug > Thus, I'm going to revert the patch that added it's use in turbostat > for the Temperature column. > > thanks, > -Len > > On Fri, Mar 12, 2021 at 1:26 AM Doug Smythies wrote: > > > > Hi Len, > > > > > > thank you for your reply. > > > > On Thu, Mar 11, 2021 at 3:19 PM Len Brown wrote: > > > > > > Thanks for the close read, Doug. > > > > > > This field size actually varies from system to system, > > > but the reality is that the offset is never that big, and so the > > > smaller mask is sufficient. > > > > Disagree. > > > > I want to use an offset of 26. > > > > > Finally, this may all be moot, because there is discussion that using > > > the offset this way is simply erroneous. > > > > Disagree. > > It works great. > > As far as I know/recall I was the only person that responded to Rui's thread > > "thermal/intel: introduce tcc cooling driver" [1] > > And, I spent quite a bit of time doing so. > > However, I agree the response seems different between the two systems > > under test, Rui's and mine. > > > > [1] https://marc.info/?l=linux-pm=161070345329806=2 > > > > > stay tuned. > > > > O.K. > > > > ... Doug > > > > > > -Len > > > > > > > > > On Sat, Jan 16, 2021 at 12:07 PM Doug Smythies > > > wrote: > > > > > > > > The TCC offset mask is incorrect, resulting in > > > > incorrect target temperature calculations, if > > > > the offset is big enough to exceed the mask size. > > > > > > > > Signed-off-by: Doug Smythies > > > > --- > > > > tools/power/x86/turbostat/turbostat.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/tools/power/x86/turbostat/turbostat.c > > > > b/tools/power/x86/turbostat/turbostat.c > > > > index 389ea5209a83..d7acdd4d16c4 100644 > > > > --- a/tools/power/x86/turbostat/turbostat.c > > > > +++ b/tools/power/x86/turbostat/turbostat.c > > > > @@ -4823,7 +4823,7 @@ int read_tcc_activation_temp() > > > > > > > > target_c = (msr >> 16) & 0xFF; > > > > > > > > - offset_c = (msr >> 24) & 0xF; > > > > + offset_c = (msr >> 24) & 0x3F; > > > > > > > > tcc = target_c - offset_c; > > > > > > > > -- > > > > 2.25.1 > > > > > > > > > > > > > -- > > > Len Brown, Intel Open Source Technology Center > > > > -- > Len Brown, Intel Open Source Technology Center
Re: [PATCH] tools/power/x86/turbostat: Fix TCC offset bit mask
Hi Len, thank you for your reply. On Thu, Mar 11, 2021 at 3:19 PM Len Brown wrote: > > Thanks for the close read, Doug. > > This field size actually varies from system to system, > but the reality is that the offset is never that big, and so the > smaller mask is sufficient. Disagree. I want to use an offset of 26. > Finally, this may all be moot, because there is discussion that using > the offset this way is simply erroneous. Disagree. It works great. As far as I know/recall I was the only person that responded to Rui's thread "thermal/intel: introduce tcc cooling driver" [1] And, I spent quite a bit of time doing so. However, I agree the response seems different between the two systems under test, Rui's and mine. [1] https://marc.info/?l=linux-pm=161070345329806=2 > stay tuned. O.K. ... Doug > > -Len > > > On Sat, Jan 16, 2021 at 12:07 PM Doug Smythies > wrote: > > > > The TCC offset mask is incorrect, resulting in > > incorrect target temperature calculations, if > > the offset is big enough to exceed the mask size. > > > > Signed-off-by: Doug Smythies > > --- > > tools/power/x86/turbostat/turbostat.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/tools/power/x86/turbostat/turbostat.c > > b/tools/power/x86/turbostat/turbostat.c > > index 389ea5209a83..d7acdd4d16c4 100644 > > --- a/tools/power/x86/turbostat/turbostat.c > > +++ b/tools/power/x86/turbostat/turbostat.c > > @@ -4823,7 +4823,7 @@ int read_tcc_activation_temp() > > > > target_c = (msr >> 16) & 0xFF; > > > > - offset_c = (msr >> 24) & 0xF; > > + offset_c = (msr >> 24) & 0x3F; > > > > tcc = target_c - offset_c; > > > > -- > > 2.25.1 > > > > > -- > Len Brown, Intel Open Source Technology Center
Re: [3/3,v3] tools/power turbostat: Enable accumulate RAPL display
Hi Yu, I am just resending your e-mail, adjusting the "To:" list to include the 3 others that have submitted similar patches. ... Doug On Mon, Mar 8, 2021 at 8:11 AM Chen Yu wrote: > > Hi, > On Mon, Mar 08, 2021 at 07:37:07AM -0800, Doug Smythies wrote: > > On Mon, Mar 8, 2021 at 5:50 AM youling257 wrote: > > > > > > this cause turbostat not work on amd cpu. > > > > > > root@localhost:~# /turbostat > > > turbostat version 20.09.30 - Len Brown > > > CPUID(0): AuthenticAMD 0xd CPUID levels; 0x801f xlevels; > > > family:model:stepping 0x17:18:1 (23:24:1) > > > > There are already two fixes for this in the queue. > > https://marc.info/?l=linux-pm=161382097503925=2 > > https://marc.info/?l=linux-pm=161141701219263=2 > Thanks for reporting and pointing this out. I assume these two patches are > both fixing the > same issue? It looks like these two patches should be merged into one: > 1. Bingsong's patch access MSR_PKG_ENERGY_STAT only when RAPL_AMD_F17H is set, >which is consistent with the original context. > 2. Bas Nieuwenhuizen's patch also fixes the issue in idx_valid() >in case RAPL_PKG was not set but with RAPL_AMD_F17H set. > > thanks, > Chenyu
Re: [3/3,v3] tools/power turbostat: Enable accumulate RAPL display
On Mon, Mar 8, 2021 at 5:50 AM youling257 wrote: > > this cause turbostat not work on amd cpu. > > root@localhost:~# /turbostat > turbostat version 20.09.30 - Len Brown > CPUID(0): AuthenticAMD 0xd CPUID levels; 0x801f xlevels; > family:model:stepping 0x17:18:1 (23:24:1) There are already two fixes for this in the queue. https://marc.info/?l=linux-pm=161382097503925=2 https://marc.info/?l=linux-pm=161141701219263=2
[PATCH] tools/power/x86/turbostat: Fix TCC offset bit mask
The TCC offset mask is incorrect, resulting in incorrect target temperature calculations, if the offset is big enough to exceed the mask size. Signed-off-by: Doug Smythies --- tools/power/x86/turbostat/turbostat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c index 389ea5209a83..d7acdd4d16c4 100644 --- a/tools/power/x86/turbostat/turbostat.c +++ b/tools/power/x86/turbostat/turbostat.c @@ -4823,7 +4823,7 @@ int read_tcc_activation_temp() target_c = (msr >> 16) & 0xFF; - offset_c = (msr >> 24) & 0xF; + offset_c = (msr >> 24) & 0x3F; tcc = target_c - offset_c; -- 2.25.1
RE: [PATCH v2 0/3] cpufreq: Allow drivers to receive more information from the governor
On 2020.12.14 12:02 Rafael J. Wysocki wrote: > Hi, Hi Rafael, V2 test results below are new, other results are partially re-stated: For readers that do not want to read on, I didn't find anything different than with the other versions. This was more just due diligence. Legend: hwp: Kernel 5.10-rc6, HWP enabled; intel_cpufreq rfc (or rjw): Kernel 5.10-rc6 + this patch set, HWP enabled; intel_cpu-freq; schedutil no-hwp: Kernel 5.10-rc6, HWP disabled; intel_cpu-freq acpi (or acpi-cpufreq): Kernel 5.10-rc6, HWP disabled; acpi-cpufreq; schedutil patch: Kernel 5.10-rc7 + V1 patch set, HWP enabled; intel_cpu-freq; schedutil v2: Kernel 5.10-rc7 + V2 patch set, HWP enabled; intel_cpu-freq; schedutil Fixed work packet, fixed period, periodic workflow, load sweep up/down: load work/sleep frequency: 73 Hertz: hwp: Average: 12.00822 watts rjw: Average: 10.18089 watts no-hwp: Average: 10.21947 watts acpi-cpufreq: Average: 9.06585 watts patch: Average: 10.26060 watts v2: Average: 10.50444 load work/sleep frequency: 113 Hertz: hwp: Average: 12.01056 rjw: Average: 10.12303 no-hwp: Average: 10.08228 acpi-cpufreq: Average: 9.02215 patch: Average: 10.27055 v2: Average: 10.31097 load work/sleep frequency: 211 Hertz: hwp: Average: 12.16067 rjw: Average: 10.24413 no-hwp: Average: 10.12463 acpi-cpufreq: Average: 9.19175 patch: Average: 10.33000 v2: Average: 10.39811 load work/sleep frequency: 347 Hertz: hwp: Average: 12.34169 rjw: Average: 10.79980 no-hwp: Average: 10.57296 acpi-cpufreq: Average: 9.84709 patch: Average: 10.67029 v2: Average: 10.93143 load work/sleep frequency: 401 Hertz: hwp: Average: 12.42562 rjw: Average: 11.12465 no-hwp: Average: 11.24203 acpi-cpufreq: Average: 10.78670 patch: Average: 10.94514 v2: Average: 11.50324 Serialized single threaded via PIDs per second method: A.K.A fixed work packet, variable period Results: Execution times (seconds. Less is better): no-hwp: performance: Samples: 382 ; Average: 10.54450 ; Stand Deviation: 0.01564 ; Maximum: 10.61000 ; Minimum: 10.5 schedutil: Samples: 293 ; Average: 13.73416 ; Stand Deviation: 0.73395 ; Maximum: 15.46000 ; Minimum: 11.68000 acpi: Samples: 253 ; Average: 15.94889 ; Stand Deviation: 1.28219 ; Maximum: 18.66000 ; Minimum: 12.04000 hwp: schedutil: Samples: 380 ; Average: 10.58287 ; Stand Deviation: 0.01864 ; Maximum: 10.64000 ; Minimum: 10.54000 patch: Samples: 276 ; Average: 14.57029 ; Stand Deviation: 0.89771 ; Maximum: 16.04000 ; Minimum: 11.68000 rfc: Samples: 271 ; Average: 14.86037 ; Stand Deviation: 0.84164 ; Maximum: 16.04000 ; Minimum: 12.21000 v2: Samples: 274 ; Average: 14.67978 ; Stand Deviation: 1.03378 ; Maximum: 16.07000 ; Minimum: 11.43000 Power (watts. More indicates higher CPU frequency and better performance. Sample time = 1 second.): no-hwp: performance: Samples: 4000 ; Average: 25.41355 ; Stand Deviation: 0.22156 ; Maximum: 26.01996 ; Minimum: 24.08807 schedutil: Samples: 4000 ; Average: 12.58863 ; Stand Deviation: 5.48600 ; Maximum: 25.50934 ; Minimum: 7.54559 acpi: Samples: 4000 ; Average: 9.57924 ; Stand Deviation: 5.41157 ; Maximum: 25.06366 ; Minimum: 5.51129 hwp: schedutil: Samples: 4000 ; Average: 25.24245 ; Stand Deviation: 0.19539 ; Maximum: 25.93671 ; Minimum: 24.14746 patch: Samples: 4000 ; Average: 11.07225 ; Stand Deviation: 5.63142 ; Maximum: 24.99493 ; Minimum: 3.67548 rfc: Samples: 4000 ; Average: 10.35842 ; Stand Deviation: 4.77915 ; Maximum: 24.95953 ; Minimum: 7.26202 v2: Samples: 4000 ; Average: 10.98284 ; Stand Deviation: 5.48859 ; Maximum: 25.76331 ; Minimum: 7.53790
RE: [PATCH v1 0/4] cpufreq: Allow drivers to receive more information from the governor
On 2020.12.08 11:15 Doug wrote: > On 2020.12.08 09:14 Rafael J. Wysocki wrote: > > On Tue, Dec 8, 2020 at 5:31 PM Giovanni Gherdovich > > wrote: > >> On Mon, 2020-12-07 at 17:25 +0100, Rafael J. Wysocki wrote: > >> I'd like to test this patch, as I have concerns on how it performs against > >> the > >> current intel_pstate passive + HWP + schedutil (which leaves > >> HWP.REQ.DESIRED > >> untouched). > > > >The performance will be worse in some cases, which is to be expected, > > Agreed. More further below. > > >but the point here is to counter the problem that in some cases the > >frequency is excessive now. > > Agreed. ... > > The "hwp" case is misleading. It is really bouncing around at about 0.2 hertz > and can't seem to make up its mind. > If nothing else, all the other versions > are a little more deterministic in their response. Hmmm... I think that statement is backwards, for this particular workflow. Read on. > > > > >> I'll get results within a week. Do you mind holding back the merge until > >> then? > > On my system that git "make test" is broken, so I can not do the high PIDs > per second test that way. > My own PIDs per second test also has issues on this computer. > I am not sure when I'll get these type of tests working, but will try for > within a week. Geovanni: Previously you have reported run to run variability with the git "make test" serialized single threaded workflow, with a new PID per task. Previously, I had attributed the variability to disk I/O. For the schedutil governor, there is also a significant variability in execution time within and around a frequency of 0.2 Hertz. Generally, it would tend to get averaged out over the long execution time of the "make test". I have completely re-done everything for my version (no disk I/O) of the PIDs per second test: Preamble: As mentioned previously, the issue with attempting to acquire useful debugging information with the schedutil governor is that it is always so oscillatory. Sampling processor package energy at too high a frequency and it just looks like noise, sample at too low a frequency and any useful information is filtered out. To make a long story shorter, a sample frequency of about 1 hertz, or a sample period of about 1 second, was revealing. Event periods of between 5 and 35 seconds are also supported by listening to the fan speed modulation on my test server. A lot of intel_pstate_tracer.py utility trace data was acquired and reviewed, but I am unable to make any sense out of it. Data is presented using the teo idle governor, and a 1000 hertz kernel, but be aware that the menu governor and a 250 Hertz kernel were also tried, but results not written up. Data is presented for one set of conditions only, but be aware that all of the PIDs per second space was searched for any specific anomalies. It is the same everywhere (tested ~40 - ~3500 PIDs per second) (obviously at some low enough PIDs per second, the differences will diminish) and the operating point of focus for further dwell type testing was not cherry picked. Operating point for the rest of the work: Fixed work packet per task (PID): ~1780 PIDs per second, performance governor. ~ 308 PIDs per second, powersave governor. Test computer: i5-9600K Legend: hwp: Kernel 5.10-rc6, HWP enabled; intel_cpufreq rfc: Kernel 5.10-rc6 + this patch set, HWP enabled; intel_cpu-freq; schedutil no-hwp: Kernel 5.10-rc6, HWP disabled; intel_cpu-freq acpi: Kernel 5.10-rc6, HWP disabled; acpi-cpufreq; schedutil patch: Kernel 5.10-rc7 + this patch set, HWP enabled; intel_cpu-freq; schedutil Results: Execution times (seconds. Less is better): no-hwp: conservative: Samples: 371 ; Average: 10.84296 ; Stand Deviation: 0.07160 ; Maximum: 11.09000 ; Minimum: 10.65000 ondemand: Samples: 342 ; Average: 11.76442 ; Stand Deviation: 0.02640 ; Maximum: 11.85000 ; Minimum: 11.71000 performance: Samples: 382 ; Average: 10.54450 ; Stand Deviation: 0.01564 ; Maximum: 10.61000 ; Minimum: 10.5 powersave: Samples: 68 ; Average: 59.93897 ; Stand Deviation: 0.11697 ; Maximum: 60.19000 ; Minimum: 59.67000 schedutil: Samples: 293 ; Average: 13.73416 ; Stand Deviation: 0.73395 ; Maximum: 15.46000 ; Minimum: 11.68000 acpi: Samples: 253 ; Average: 15.94889 ; Stand Deviation: 1.28219 ; Maximum: 18.66000 ; Minimum: 12.04000 hwp: conservative: Samples: 380 ; Average: 10.58518 ; Stand Deviation: 0.01562 ; Maximum: 10.64000 ; Minimum: 10.54000 ondemand: Samples: 380 ; Average: 10.58395 ; Stand Deviation: 0.01509 ; Maximum: 10.62000 ; Minimum: 10.54000 performance: Samples: 381 ; Average: 10.57871 ; Stand Deviation: 0.01870 ; Maximum: 10.63000 ; Minimum: 10.53000 powersave: Samples: 67 ; Average: 60.37075 ; Stand Deviation: 0.20095 ; Maximum: 60.81000 ; Minimum: 59.92000 schedutil: Samples: 380 ; Average: 10.58287 ; Stand Deviation: 0.01864 ; Maximum: 10.64000 ; Minimum: 10.54000 patch: Samples: 276 ; Average: 14.57029 ; Stand
RE: [PATCH v1 0/4] cpufreq: Allow drivers to receive more information from the governor
On 2020.12.08 09:14 Rafael J. Wysocki wrote: > On Tue, Dec 8, 2020 at 5:31 PM Giovanni Gherdovich > wrote: >> On Mon, 2020-12-07 at 17:25 +0100, Rafael J. Wysocki wrote: >>> Hi, >>> >>> This is based on the RFC posted a few days ago: >>> >>> https://lore.kernel.org/linux-pm/1817571.2o5Kk4Ohv2@kreacher/ >>> >>> The majority of the original cover letter still applies, so let me quote it >>> here: >>> [...] >> >> Hello Rafael, >> >> I'd like to test this patch, as I have concerns on how it performs against >> the >> current intel_pstate passive + HWP + schedutil (which leaves HWP.REQ.DESIRED >> untouched). > >The performance will be worse in some cases, which is to be expected, Agreed. More further below. >but the point here is to counter the problem that in some cases the >frequency is excessive now. Agreed. I retested with this new version with my load sweep test, and my results were the pretty similar to previously reported, [1]. If anything, there might be a perceptible difference between the RFC version and this version as a function of work/sleep frequency. 73 Hertz: RFC was 0.8% less power 113 Hertz: RFC was 1.5% less power 211 Hertz: RFC was 0.8% less power 347 Hertz: RFC was 1.2% more power 401 Hertz: RFC was 1.6% more power Of course, let us not lose site of the original gains that were in the 13 to 17% range. Now, for the "worse in some cases, which is to be expected" part: At least on my system, it is most evident for some of the pipe type tests, where the schedutil governor has never really known what to do. This patch set seems to add enough of a downward bias that this version of the schedutil governor now behaves much like the other versions Here is an example (no forced CPU affinity, 2 pipes): Legend: hwp: Kernel 5.10-rc6, HWP enabled; intel_cpufreq; schedutil (always) rjw: Kernel 5.10-rc6 + this patch set, HWP enabled; intel_cpu-freq; schedutil no-hwp: Kernel 5.10-rc6, HWP disabled; intel_cpu-freq; schedutil acpi-cpufreq (or just acpi): Kernel 5.10-rc6, HWP disabled; acpi-cpufreq; schedutil patch: Kernel 5.10-rc7 + the non RFC 4 patch version of this is patch set, HWP enabled; intel_cpu-freq; schedutil hwp: 3.295 uSec/loop (steady); average power: 62.78 Watts (steady); freq: 4.60GHz (steady). rjw: 3.6 uSec/loop (noisey) (9.3% worse); average power: 44.13 Watts (noisey); freq: ?.??GHz (noisey). no-hwp: 3.35 uSec/loop (noisey); average power: 59.17 Watts (noisey); freq: ?.??GHz (much less noisey). acpi-cpufreq: 3.4 uSec/loop (noisey); average power: 56.93 Watts (noisey); freq: ?.??GHz (less noisey). patch: 3.6 uSec/loop (noisey) (9.3% worse); average power: 43.36 Watts (noisey); freq: ?.??GHz (noisey). One could argue that this patch set might be driving the frequency down a little too hard in this case, but this is one of those difficult rotating through the CPUs cases anyhow. As a side note, all other governors (well, not powersave) pin the CPU frequency at max (4.6 GHz). For my version of the "Alex" pipe test (a token is passed around and around via 6 named pipes, with a bit of work to do per token stop) I get (more power = better performance): hwp: average power: 17.16 Watts (very noisey) rjw: average power: 3.18 (noisey) no-hwp: average power: 2.45 (less noisey) acpi-cpufreq: average power: 2.46 (less noisey) patch: average power: 2.47 (less noisey) The "hwp" case is misleading. It is really bouncing around at about 0.2 hertz and can't seem to make up its mind. If nothing else, all the other versions are a little more deterministic in their response. > >> I'll get results within a week. Do you mind holding back the merge until >> then? On my system that git "make test" is broken, so I can not do the high PIDs per second test that way. My own PIDs per second test also has issues on this computer. I am not sure when I'll get these type of tests working, but will try for within a week. ... Doug References: [1] https://marc.info/?l=linux-kernel=160692480907696=2 My tests results graphs (updated): Note: I have to code the web site, or I get hammered by bots. Note: it is .com only because it was less expensive than .org 73 Hertz (10 samples per second): Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/su73/ 113 Hertz (10 samples per second): Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/su113/ 211 Hertz (10 samples per second): Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/su211/ 347 Hertz (10 samples per second): Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/su347/ 401 Hertz (10 samples per second): Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/su401/ Note: The below graphs are mislabelled, because I made hacker use of a tool dedicated to graphing, and HTML wrapping, the load sweep test. The test conditions are steady state operation, no variable changes. pipe-test (5 seconds per
RE: [RFC][PATCH 1/2] cpufreq: Add special-purpose fast-switching callback for drivers
Hi Rafael, On 2020.11.30 10:37 Rafael J. Wysocki wrote: > First off, some cpufreq drivers (eg. intel_pstate) can pass hints > beyond the current target frequency to the hardware and there are no > provisions for doing that in the cpufreq framework. In particular, > today the driver has to assume that it should allow the frequency to Forgot the important "not": today the driver has to assume that it should allow not the frequency to > fall below the one requested by the governor (or the required capacity > may not be provided) which may not be the case and which may lead to > excessive energy usage in some scenarios. > > Second, the hints passed by these drivers to the hardware neeed not s/neeed/need ... O.K. this is good. The problem with my basic CPU frequency verses load test with the schedutil governor is that it is always so oscillatory it is pretty much not possible to conclude anything. So I re-worked the test to look at Processor Package Power load instead. In a previous e-mail [1] I had reported the power differences for one periodic load at one frequency, as a (apparently cherry picked) example. Quoted: > schedutil governor: > acpi-cpufreq: good > intel_cpufreq hwp: bad< Now good, with this patch set. > intel_cpufreq no hwp: good > ... > periodic workflow at 347 hertz. > ~36% load at 4.60 GHz (where hwp operates) > ~55% load at 3.2 GHz (where no hwp operates) > > intel_cpufreq hwp: 9.6 processor package watts. 45.8 watts on the mains to > the computer. > intel_cpufreq no hwp: ~6 processor package watts. ~41 watts on the mains to > the computer. (noisy) So this time, I only have power/energy data, and a relatively easy way to compress all 12,000 samples into some concise summary is to simply find the average power for the entire experiment: Legend: hwp: Kernel 5.10-rc6, HWP enabled; intel_cpufreq; schedutil (always) rjw: Kernel 5.10-rc6 + this patch set, HWP enabled; intel_cpu-freq; schedutil no-hwp: Kernel 5.10-rc6, HWP disabled; intel_cpu-freq; schedutil acpi-cpufreq: Kernel 5.10-rc6, HWP disabled; acpi-cpufreq; schedutil load work/sleep frequency: 73 Hertz: hwp: Average: 12.00822 watts rjw: Average: 10.18089 watts no-hwp: Average: 10.21947 watts acpi-cpufreq: Average: 9.06585 watts load work/sleep frequency: 113 Hertz: hwp: Average: 12.01056 rjw: Average: 10.12303 no-hwp: Average: 10.08228 acpi-cpufreq: Average: 9.02215 load work/sleep frequency: 211 Hertz: hwp: Average: 12.16067 rjw: Average: 10.24413 no-hwp: Average: 10.12463 acpi-cpufreq: Average: 9.19175 load work/sleep frequency: 347 Hertz: hwp: Average: 12.34169 rjw: Average: 10.79980 no-hwp: Average: 10.57296 acpi-cpufreq: Average: 9.84709 load work/sleep frequency: 401 Hertz: hwp: Average: 12.42562 rjw: Average: 11.12465 no-hwp: Average: 11.24203 acpi-cpufreq: Average: 10.78670 [1] https://marc.info/?l=linux-pm=159769839401767=2 My tests results graphs: Note: I have to code the web site, or I get hammered by bots. Note: it is .com only because it was less expensive than .org 73 Hertz: Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/su73/ 113 Hertz: Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/su113/ 211 Hertz: Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/su211/ 347 Hertz: Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/su347/ 401 Hertz: Double u double u double u dot smythies dot .com/~doug/linux/s18/hwp/k510-rc6/su401/ ... Doug
RE: [PATCH v3 0/4] cpufreq: intel_pstate: Handle powersave governor correctly in the passive mode with HWP
On 2020.11.10 09:22 Rafael J. Wysocki wrote: > On Monday, November 9, 2020 5:49:49 PM CET Rafael J. Wysocki wrote: >> >> Even after the changes made very recently, the handling of the powersave >> governor is not exactly as expected when intel_pstate operates in the >> "passive" mode with HWP enabled. >> >> Namely, in that case HWP is not limited to the policy min frequency, but it >> can scale the frequency up to the policy max limit and it cannot be >> constrained >> currently, because there are no provisions for that in the framework. >> >> To address that, patches [1-3/4] add a new governor flag to indicate that >> this >> governor wants the target frequency to be set to the exact value passed to >> the >> driver, if possible, and change the powersave and performance governors to >> have >> that flag set. >> >> The last patch makes intel_pstate take that flag into account when >> programming >> the HWP Request MSR. > > The v3 simply uses different names for the new governor flags. Thank you. I tested v2, with positive results, as reported for v1. I do not have time to re-test v3. My input is to also default this flag to be set for the userspace and ondemand governors. userspace: I tested with and without this flag set, and the flag is needed if the user expects the scaling_setspeed to be enforced. Disclaimer: I don't normally actually use the userspace governor. ondemand: from my tests, the ondemand response more closely mimics acpi-ondemand with the flag set. Power consumption has been better for the limited testing done. However, it is also a function of work/sleep frequency for periodic workflows and a function of INTEL_CPUFREQ_TRANSITION_DELAY_HWP. I am saying that my ability to support the suggestion to default to setting the flag is a little weak. ... Doug
RE: [PATCH 1/2] cpufreq: Introduce target min and max frequency hints
Hi Rafael: Thank you for this patch set. I can not get the patch to apply. I was trying on top on 5.10-rc2, and have been unable to determine what other patches might need to be applied first. On 2020.11.05 10:24 Rafael J. Wysocki wrote: ... > > Signed-off-by: Rafael J. Wysocki > --- > drivers/cpufreq/cpufreq.c |3 +++ > drivers/cpufreq/cpufreq_performance.c |4 > drivers/cpufreq/cpufreq_powersave.c |4 > include/linux/cpufreq.h | 16 I do not understand why this part says to look for 16 differences, but I can only find 2. > 4 files changed, 27 insertions(+) > > Index: linux-pm/include/linux/cpufreq.h > === > --- linux-pm.orig/include/linux/cpufreq.h > +++ linux-pm/include/linux/cpufreq.h > @@ -63,6 +63,8 @@ struct cpufreq_policy { > > unsigned intmin;/* in kHz */ > unsigned intmax;/* in kHz */ > + unsigned inttarget_min; /* in kHz */ > + unsigned inttarget_max; /* in kHz */ > unsigned intcur;/* in kHz, only needed if cpufreq >* governors are used */ > unsigned intsuspend_freq; /* freq to set during suspend */ > Index: linux-pm/drivers/cpufreq/cpufreq.c ... Anyway, I edited the patch, deleting the include/linux/cpufreq.h part, then it applied, as did patch 2 of 2. I edited include/linux/cpufreq.h manually. Issues with the powersave governor reported in [1] and [2] are fixed. Relevant part quoted and updated below: > In early September Doug wrote: >> powersave governor: >> acpi-cpufreq: good >> intel_cpufreq hwp: bad Now good, with this patch set. >> intel_cpufreq no hwp: good ... > For the powersave governor, this is what we have now: > > intel_cpufreq hwp == intel_pstate hwp > intel_cpufreq no hwp == acpi-cpufreq == always minimum freq > intel_pstate no hwp ~= acpi-cpufreq/ondemand ... > My expectation was/is: > > intel_cpufreq hwp == intel_cpufreq no hwp == acpi-cpufreq == always minimum > freq And this is what we now have, with this patch set. > intel_pstate no hwp ~= acpi-cpufreq/ondemand > intel_pstate hwp == Unique. Say, extremely course version of ondemand. [1] https://marc.info/?l=linux-pm=159769839401767=2 [2] https://marc.info/?l=linux-pm=159943780220923=2 ... Doug
RE: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled
Hi Rafael, On 2020.08.17 14:06 Doug Smythies wrote: > On 2020.08.06 05:04 Rafael J. Wysocki wrote: > > > Allow intel_pstate to work in the passive mode with HWP enabled and > > make it set the HWP minimum performance limit (HWP floor) to the > > P-state value given by the target frequency supplied by the cpufreq > > governor, so as to prevent the HWP algorithm and the CPU scheduler > > from working against each other, at least when the schedutil governor > > is in use, and update the intel_pstate documentation accordingly. > ... > > powersave governor: > acpi-cpufreq: good > intel_cpufreq hwp: bad > intel_cpufreq no hwp: good It occurs to me that my expectations as to what is meant by "powersave" might not agree with yours. For the powersave governor, this is what we have now: intel_cpufreq hwp == intel_pstate hwp intel_cpufreq no hwp == acpi-cpufreq == always minimum freq intel_pstate no hwp ~= acpi-cpufreq/ondemand Is that your understanding/intention? My expectation was/is: intel_cpufreq hwp == intel_cpufreq no hwp == acpi-cpufreq == always minimum freq intel_pstate no hwp ~= acpi-cpufreq/ondemand intel_pstate hwp == Unique. Say, extremely course version of ondemand. ... Doug
RE: [PATCH v2 0/5] cpufreq: intel_pstate: Address some HWP-related oddities
Hi Srinivas, Thanks for your reply. On 2020.08.25 08:12 Srinivas Pandruvada wrote: > On Mon, 2020-08-24 at 18:00 -0700, Doug Smythies wrote: > > I think there is a disconnect between your written > > description of what is going on and your supporting MSR reads. > > > I reproduced again. > I see the copy paste individual at the first place swapped. Yes, and that had me confused, initially. > I pasted the full output by direct copy - paste from the screen. > > But the issues are still there. Agreed. I didn't try your offline/online of CPU 1 part previously, but did now, and get the same results as you. I did not know that "rdmsr -a 0x774" lists stuff in the order that CPU were last brought on-line. I had assumed the list was in CPU order. Weird. My example (nothing new here, just me catching up. The offline/online order was cpu1, then cpu3, then cpu2): root@s18:/sys/devices/system/cpu# grep . cpu*/cpufreq/energy_performance_preference cpu0/cpufreq/energy_performance_preference:balance_performance cpu1/cpufreq/energy_performance_preference:127 cpu2/cpufreq/energy_performance_preference:125 cpu3/cpufreq/energy_performance_preference:126 cpu4/cpufreq/energy_performance_preference:balance_performance cpu5/cpufreq/energy_performance_preference:balance_performance root@s18:/sys/devices/system/cpu# rdmsr -p 0 0x774 80002e2e root@s18:/sys/devices/system/cpu# rdmsr -p 1 0x774 7f002e2e root@s18:/sys/devices/system/cpu# rdmsr -p 2 0x774 7d002e2e root@s18:/sys/devices/system/cpu# rdmsr -p 3 0x774 7e002e2e root@s18:/sys/devices/system/cpu# rdmsr -p 4 0x774 80002e2e root@s18:/sys/devices/system/cpu# rdmsr -p 5 0x774 80002e2e root@s18:/sys/devices/system/cpu# rdmsr -a 0x774 80002e2e 80002e2e 80002e2e 7f002e2e 7e002e2e 7d002e2e ... Doug
RE: [PATCH v2 0/5] cpufreq: intel_pstate: Address some HWP-related oddities
Hi Srinivas, I think there is a disconnect between your written description of what is going on and your supporting MSR reads. On 2020.08.24 16:56 Srinivas Pandruvada wrote: > On Mon, 2020-08-24 at 19:39 +0200, Rafael J. Wysocki wrote: > > Hi All, > > > > The v2 is here to address feedback from Doug and one issue found by > > me. > > > > The purpose of this series is to address some peculiarities related > > to > > taking CPUs offline/online and switching between different operation > > modes with HWP enabled that have become visible after allowing the > > driver to work in the passive mode with HWP enabled in 5.9-rc1 (and > > one that was there earlier, but can be addressed easily after the > > changes madein 5.9-rc1). > > > > Please refer to the patch changelogs for details. > > > > For easier testing/review, the series is available from the git > > branch at: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \ > > intel_pstate-testing > > > > Applied these patches to 5.9-rc2 So did I, and the issues I reported the other day are fine now. I did try a few of the things you were doing. > > - After s3 limits got messed up. > # cat /sys/power/mem_sleep > s2idle [deep] > > - In the dmesg unchecked MSR for HWP register > > 1. > Before test > > sudo rdmsr -a 0x774 > 80002b0c > 80002b0c > 80002d0c > 80002b0c > 80002b0c > 80002b0c > 80002b0c > 80002b0c > 80002d0c > 80002b0c > 80002b0c > 80002b0c > 80002d0c > 80002b0c > 80002b0c > 80002b0c > 80002b0c > 80002b0c > 80002d0d > 7f002b0c ?? This looks like the MSR read for further below, and for CPU 19 instead of 1. > cd /sys/devices/system/cpu/intel_pstate/ > [root@otcpl-perf-test-skx-i9 intel_pstate]# grep . * > hwp_dynamic_boost:0 > max_perf_pct:100 > min_perf_pct:27 > no_turbo:0 > num_pstates:32 > status:active > turbo_pct:32 > > cd ../cpu1/cpufreq/ > [root@otcpl-perf-test-skx-i9 cpufreq]# grep . * > affected_cpus:1 > base_frequency:330 > cpuinfo_max_freq:430 > cpuinfo_min_freq:120 > cpuinfo_transition_latency:0 > energy_performance_available_preferences:default performance > balance_performance balance_power power > energy_performance_preference:balance_performance > related_cpus:1 > scaling_available_governors:performance powersave > scaling_cur_freq:120 > scaling_driver:intel_pstate > scaling_governor:powersave > scaling_max_freq:430 > scaling_min_freq:120 > scaling_setspeed: > > > 2. Now change the EPP > > # echo 127 > energy_performance_preference > sudo rdmsr -a 0x774 > 80002b0c > 80002b0c > 80002d0c > 80002b0c > 80002b0c > 80002b0c > 80002b0c > 80002b0c > 80002d0c > 80002b0c > 80002b0c > 80002b0c > 80002d0c > 80002b0c > 80002b0c > 80002b0c > 80002b0c > 80002b0c > 80002d0d This looks like the original MSR read. > > Good here > > 3. Offline/online good > > [root@otcpl-perf-test-skx-i9 cpufreq]# echo 0 > > /sys/devices/system/cpu/cpu1/online > [root@otcpl-perf-test-skx-i9 cpufreq]# echo ` > > /sys/devices/system/cpu/cpu1/online > > echo ` > /sys/devices/system/cpu/cpu1/online ^C > [root@otcpl-perf-test-skx-i9 cpufreq]# echo 1 > > /sys/devices/system/cpu/cpu1/online > > sudo rdmsr -a 0x774 > 80002b0c > 80002b0c > 80002d0c > 80002b0c > 80002b0c > 80002b0c > 80002b0c > 80002b0c > 80002d0c > 80002b0c > 80002b0c > 80002b0c > 80002d0c > 80002b0c > 80002b0c > 80002b0c > 80002b0c > 80002b0c > 80002d0d > 7f002b0c O.K. > > Good. Online restored the setting > > 4. Now S3 > > rtcwake -m mem -s 10 Cool command. I did not know about it. I tried it. > > All limits are now messed up > > sudo rdmsr -a 0x774 > 80002b0c > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > 8000ff00 > Yes, I got the same: # /home/doug/c/msr-decoder (edited) 6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 : raw: 80002E08 : 8000FF01 : 8000FF01 : 8000FF01 : 8000FF01 : 8000FF01 : min:8 :1 :1 :1 :1 :1 : max: 46 : 255 : 255 : 255 : 255 : 255 : des:0 :0 :0 :0 :0 :0 : epp: 128 : 128 : 128 : 128 : 128 : 128 : act:0 :0 :0 :0 :0 :0 : > 5. Now switch to passive > Again bad, some CPU max/min is 0 > > sudo rdmsr -a 0x774 > 80002b0d > 7f002b0f Hmmm... Now seems to be CPU 1 > 80002b0c > 80002d0e > 80002b0c > 80002b0d > 80002b0f > 80002b2b > 80002b0c > 80002d1d > 8000 > 80002b0c > 80002b0c > 8000 > 8000 > 8000 > 8000 > 8000 > 8000 > 8000 MSR 774 was good for me, but in general my decoder was having troubles. 0x774: IA32_HWP_REQUEST:CPU 0-5 : sh: 0: getcwd() failed: No such file or directory raw: 80002E2E : 7F002E2E : 80002E2E : 80002E2E : 80002E2E : 80002E2E : min: 46 : 46 : 46 : 46 : 46 :
RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Work in passive mode with HWP enabled
On 2020.05.21 10:16 Rafael J. Wysocki wrote: ... > > The INTEL_CPUFREQ_TRANSITION_DELAY_HWP value has been guessed and it very well > may turn out to be either too high or too low for the general use, which is > one > reason why getting as much testing coverage as possible is key here. > > If you can play with different INTEL_CPUFREQ_TRANSITION_DELAY_HWP values, > please do so and let me know the conclusions. ... Hi Rafael, Since my v7 reply the other day about results for 347 Hz work/sleep frequency periodic workflows [1], I have been testing at 73 Hertz work/sleep frequency. And am now testing this: -#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000 +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 1 Which shows promise for the schedutil governor, which no longer drives prematurely into the turbo range, consuming excessive amounts of energy as compared to passive without HWP. The ondemand governor is still very bad. The only way I have found to get it behave is to force max=min=target_pstate, in addition to the above. It will be quite some time before I have well organized feedback. In the meantime if anyone has suggestions for an optimal INTEL_CPUFREQ_TRANSITION_DELAY_HWP value, I would be grateful. Otherwise I'll attempt to find it via tests. ... Doug [1] https://marc.info/?l=linux-pm=159769839401767=2
RE: [PATCH 3/4] cpufreq: intel_pstate: Add ->offline and ->online callbacks
Hi Rafael, Just annoying typo type feedback. On 2020.08.20 09:38 Rafael J. Wysocki wrote: > > From: "Rafael J. Wysocki" > > Add ->offline and ->online driver callbacksto to do the cleanup "to to" and suggest this: Add ->offline and ->online driver callbacks to cleanup > before taking a CPU offline and to restore its working configuration > when it goes back online, respectively, to avoid invoking the ->init > callback on every CPU online which is quite a bit of unnecessary > overhead. > > Define ->offline and ->online so that they can be used in the > passive as well as in the active mode and because ->offline will passive mode > do the majority of ->stop_cpu work, the passive mode does not > need that callback any more, so drop it. > > Signed-off-by: Rafael J. Wysocki > --- > drivers/cpufreq/intel_pstate.c | 38 -- > 1 file changed, 32 insertions(+), 6 deletions(-) > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c > index 3d18934fa975..aca0587b176f 100644 > --- a/drivers/cpufreq/intel_pstate.c > +++ b/drivers/cpufreq/intel_pstate.c > @@ -2297,28 +2297,51 @@ static int intel_pstate_verify_policy(struct > cpufreq_policy_data *policy) > return 0; > } > > -static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy) > +static int intel_pstate_cpu_offline(struct cpufreq_policy *policy) > { > + pr_debug("CPU %d going offline\n", policy->cpu); > + > + intel_pstate_exit_perf_limits(policy); > + > + /* > + * If the CPU is an SMT thread and it goes offline with the performance > + * settings different from the minimum, it will prevent its sibling > + * from getting to lower performance levels, so force the minimum > + * performance on CPU offline to prevent that form happening. form/from > + */ > if (hwp_active) > intel_pstate_hwp_force_min_perf(policy->cpu); > else > intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]); > + > + return 0; > +} > + > +static int intel_pstate_cpu_online(struct cpufreq_policy *policy) > +{ > + pr_debug("CPU %d going online\n", policy->cpu); > + > + intel_pstate_init_acpi_perf_limits(policy); > + > + if (hwp_active) > + wrmsrl_on_cpu(policy->cpu, MSR_HWP_REQUEST, > + all_cpu_data[policy->cpu]->hwp_req_cached); > + > + return 0; > } > > static void intel_pstate_stop_cpu(struct cpufreq_policy *policy) > { > - pr_debug("CPU %d exiting\n", policy->cpu); > + pr_debug("CPU %d stopping\n", policy->cpu); > > intel_pstate_clear_update_util_hook(policy->cpu); > if (hwp_active) > intel_pstate_hwp_save_state(policy); > - > - intel_cpufreq_stop_cpu(policy); > } > > static int intel_pstate_cpu_exit(struct cpufreq_policy *policy) > { > - intel_pstate_exit_perf_limits(policy); > + pr_debug("CPU %d exiting\n", policy->cpu); > > policy->fast_switch_possible = false; > > @@ -2398,6 +2421,8 @@ static struct cpufreq_driver intel_pstate = { > .init = intel_pstate_cpu_init, > .exit = intel_pstate_cpu_exit, > .stop_cpu = intel_pstate_stop_cpu, > + .offline= intel_pstate_cpu_offline, > + .online = intel_pstate_cpu_online, > .update_limits = intel_pstate_update_limits, > .name = "intel_pstate", > }; > @@ -2652,7 +2677,8 @@ static struct cpufreq_driver intel_cpufreq = { > .fast_switch= intel_cpufreq_fast_switch, > .init = intel_cpufreq_cpu_init, > .exit = intel_cpufreq_cpu_exit, > - .stop_cpu = intel_cpufreq_stop_cpu, > + .offline= intel_pstate_cpu_offline, > + .online = intel_pstate_cpu_online, > .update_limits = intel_pstate_update_limits, > .name = "intel_cpufreq", > }; > -- > 2.26.2 > >
RE: [PATCH 0/4] cpufreq: intel_pstate: Address some HWP-related oddities
Hi Rafael, On 2020.08.20 09:35 Rafael J. Wysocki wrote: > > The purpose of this series is to address some peculiarities related to > taking CPUs offline/online and switching between different operation > modes with HWP enabled that have become visible after allowing the > driver to work in the passive mode with HWP enabled in 5.9-rc1 (and > one that was there earlier, but can be addressed easily after the > changes madein 5.9-rc1). > > Please refer to the patch changelogs for details. > > For easier testing/review, the series is available from the git branch at: > > git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \ > intel_pstate-testing Thanks for that. There still seems to be a problem with EPP getting messed up. I have not been able to find the exact spot in the code. One problem is that EPP can end up as 0, and thereafter stays at 0. In sysfs terms, it ends up as "performance" and thereafter stays as "performance". Meanwhile I never modified it, and it started as "balance_performance". It happens when changing from active to passive if the governor is performance. If the governor is not performance things work as expected. Another problem is that EPP will end up as 128 when changing from passive to active. This erroneous condition is cleared by changing the governor to powersave and back to performance. It also doesn't occur the first time after boot, when booting to intel_cpufreq/performance/HWP. (confused yet?) The sysfs value is O.K. during this. Supporting data: Example 1: Grub: GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=450 msr.allow_writes=on cpuidle.governor=teo" So I boot to intel_pstate/performance/HWP: # /home/doug/c/msr-decoder (always edited for only relevant parts) 6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 : raw: 2E2E : 2E2E : 2E2E : 2E2E : 2E2E : 2E2E : epp:0 :0 :0 :0 :0 :0 : # cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver intel_pstate # cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor performance # cat /sys/devices/system/cpu/cpufreq/policy3/energy_performance_preference balance_performance # echo passive > /sys/devices/system/cpu/intel_pstate/status Note: the following results are incorrect: # cat /sys/devices/system/cpu/cpufreq/policy3/energy_performance_preference performance # echo "ondemand" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ondemand # cat /sys/devices/system/cpu/cpufreq/policy3/energy_performance_preference performance # /home/doug/c/msr-decoder 6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 : raw: 2E08 : 2E08 : 2E08 : 2E0B : 2E13 : 2E08 : epp:0 :0 :0 :0 :0 :0 : # echo active > /sys/devices/system/cpu/intel_pstate/status # cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor performance # cat /sys/devices/system/cpu/cpufreq/policy3/energy_performance_preference performance # echo "powersave" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor powersave # cat /sys/devices/system/cpu/cpufreq/policy3/energy_performance_preference performance # cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor powersave # cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver intel_pstate Example 2: Grub: GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=450 intel_pstate=passive msr.allow_writes=on cpuidle.governor=teo" So I boot to intel_cpufreq/performance/HWP: # /home/doug/c/msr-decoder 6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 : raw: 80002E2E : 80002E2E : 80002E2E : 80002E2E : 80002E2E : 80002E2E : epp: 128 : 128 : 128 : 128 : 128 : 128 : # cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver intel_cpufreq # cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor performance # cat /sys/devices/system/cpu/cpufreq/policy3/energy_performance_preference balance_performance # echo active > /sys/devices/system/cpu/intel_pstate/status # cat /sys/devices/system/cpu/cpufreq/policy3/energy_performance_preference balance_performance # cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor performance # cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver intel_pstate # /home/doug/c/msr-decoder 6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 : raw: 2E2E : 2E2E : 2E2E : 2E2E : 2E2E : 2E2E : epp:0 :0 :0 :0 :0 :0 : # echo passive > /sys/devices/system/cpu/intel_pstate/status # /home/doug/c/msr-decoder 6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 : raw: 80002E2E : 80002E2E : 80002E2E : 80002E2E : 80002E2E : 80002E2E : epp: 128 : 128 : 128 : 128 : 128 : 128 : # echo active > /sys/devices/system/cpu/intel_pstate/status Note: the following results are incorrect: # /home/doug/c/msr-decoder 6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 : raw: 80002E2E : 80002E2E : 80002E2E : 80002E2E :
RE: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On 2020.08.06 05:04 Rafael J. Wysocki wrote: > Allow intel_pstate to work in the passive mode with HWP enabled and > make it set the HWP minimum performance limit (HWP floor) to the > P-state value given by the target frequency supplied by the cpufreq > governor, so as to prevent the HWP algorithm and the CPU scheduler > from working against each other, at least when the schedutil governor > is in use, and update the intel_pstate documentation accordingly. ... Hi Rafael, You may or may not recall, I mentioned my further feedback would be delayed, as I wanted to work on reducing the labour content of my most basic CPU frequency scaler test. I have tested kernel 5.9-rc1 for pretty much every intel_pstate variant and governor, and also the acpi-cpufreq driver. Other than changing governors, changes were only made via grub command line options and re-boot. EPP or EPB were never modified, they were always whatever default. performance governor: (left mostly blank, on purpose.) acpi-cpufreq: intel_cpufreq hwp: good intel_cpufreq no hwp: intel_pstate hwp: intel_pstate no hwp: ondemand governor: acpi-cpufreq: good intel_cpufreq hwp: bad intel_cpufreq no hwp: good conservative governor: acpi-cpufreq: good intel_cpufreq hwp: good intel_cpufreq no hwp: good schedutil governor: acpi-cpufreq: good intel_cpufreq hwp: bad intel_cpufreq no hwp: good powersave governor: acpi-cpufreq: good intel_cpufreq hwp: bad intel_cpufreq no hwp: good active-powersave governor: intel_pstate hwp: ? not smooth, suffers from the broken HWP issue. intel_pstate no hwp: good. Intel_pstate hwp, idle state 2 disabled: Better but still worse for power. Now, we don't actually care about CPU frequency, we care about power: ondemand governor: periodic workflow at 347 hertz. ~58% load at 4.60 GHz (where hwp operates) ~76% load at 3.5 GHz (where no hwp operates) intel_cpufreq hwp: 14.3 processor package watts. 51.5 watts on the mains to the computer. intel_cpufreq no hwp: 9.1 processor package watts. 45.5 watts on the mains to the computer. schedutil governor: periodic workflow at 347 hertz. ~36% load at 4.60 GHz (where hwp operates) ~55% load at 3.2 GHz (where no hwp operates) intel_cpufreq hwp: 9.6 processor package watts. 45.8 watts on the mains to the computer. intel_cpufreq no hwp: ~6 processor package watts. ~41 watts on the mains to the computer. (noisy) powersave governor: periodic workflow at 347 hertz. ~39.8% load at 2.00 GHz (where hwp operates) ~92.5% load at 0.8 GHz (where no hwp operates) intel_cpufreq hwp: 2.6 processor package watts. 38 watts on the mains to the computer. intel_cpufreq no hwp: 1.9 processor package watts. 36 watts on the mains to the computer. active-powersave governor: periodic workflow at 347 hertz. ~58% load at 4.60 GHz (where hwp operates) ~72% load at 3.88 GHz (where no hwp operates) intel_pstate hwp: 14.2 processor package watts. 52 watts on the mains to the computer. intel_pstate no hwp: 10.1 processor package watts. 48 watts on the mains to the computer. Link to web page with much of this same content which, in turn, links to various graphs. Coded, to avoid the barrage of bots: double u double u double u dot smythies dot com /~doug/linux/s18/hwp/v7/ ... Doug
RE: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On 2020.08.05 09:56 Rafael J. Wysocki wrote: > v6 -> v7: >* Cosmetic changes in store_energy_performance_prefernce() to reduce the > LoC number and make it a bit easier to read. No intentional functional > impact. ?? V7 is identical to V6. Diff: $ diff hwppassive-v6-2-2.patch hwppassive-v7-2-2.patch 2c2 < Sent: August 4, 2020 8:11 AM --- > Sent: August 5, 2020 9:56 AM 5c5 < Subject: [PATCH v6] cpufreq: intel_pstate: Implement passive mode with HWP enabled --- > Subject: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP > enabled 76a77,81 > > v6 -> v7: >* Cosmetic changes in store_energy_performance_prefernce() to reduce the > LoC number and make it a bit easier to read. No intentional functional > impact. ... Doug
RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On 2020.08.03 10:09 Rafael J. Wysocki wrote: > On Sunday, August 2, 2020 5:17:39 PM CEST Doug Smythies wrote: > > On 2020.07.19 04:43 Rafael J. Wysocki wrote: > > > On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies wrote: > > > > On 2020.07.16 05:08 Rafael J. Wysocki wrote: > > > > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies > > > > > wrote: > > > > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote: > > > > >> > > > > > >> > From: Rafael J. Wysocki > > > > >> ... > > > > >> > Since the passive mode hasn't worked with HWP at all, and it is > > > > >> > not going to > > > > >> > the default for HWP systems anyway, I don't see any drawbacks > > > > >> > related to making > > > > >> > this change, so I would consider this as 5.9 material unless there > > > > >> > are any > > > > >> > serious objections. > > > > >> > > > > >> Good point. > > > > > > > > Actually, for those users that default to passive mode upon boot, > > > > this would mean they would find themselves using this. > > > > Also, it isn't obvious, from the typical "what driver and what governor" > > > > inquiry. > > > > > > So the change in behavior is that after this patch > > > intel_pstate=passive doesn't imply no_hwp any more. > > > > > > That's a very minor difference though and I'm not aware of any adverse > > > effects it can cause on HWP systems anyway. > > > > My point was, that it will now default to something where > > testing has not been completed. > > > > > The "what governor" is straightforward in the passive mode: that's > > > whatever cpufreq governor has been selected. > > > > I think you might have missed my point. > > From the normal methods of inquiry one does not know > > if HWP is being used or not. Why? Because with > > or without HWP one gets the same answers under: > > > > /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver > > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor > > Yes, but this is also the case in the active mode, isn't it? Yes, fair enough. But we aren't changing what it means by default between kernel 5.8 and 5.9-rc1. ... Doug
RE: [PATCH v6] cpufreq: intel_pstate: Implement passive mode with HWP enabled
Hi Rafael, I was just writing you about V5 when this V6 came. On 2020.08.04 08:11 Rafael J. Wysocki wrote: ... > This is on top of the material already in the mainline. Oh, should have read that part better, but did get there in the end. ... > v5 -> v6: >* Fix the problem with the EPP setting via sysfs not working with the > performance and powersave governors by stopping and restarting the > governor around the sysfs-based EPP updates in the passive mode. >* Because of that, use the epp_cached field just for avoiding the above > if the new EPP value for the given CPU is the same as the old one. >* Export cpufreq_start/stop_governor() from the core (for the above). EPP is still not right. I am not messing with it at all, just observing via my msr-decoder. I booted without any intel_pstate related directives for the kernel command line. The below is as expected (performance gov.): # /home/doug/c/msr-decoder How many CPUs?: 6 8.) 0x198: IA32_PERF_STATUS : CPU 0-5 : 46 : 46 : 46 : 46 : 46 : 46 : B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable 1.) 0x19C: IA32_THERM_STATUS: 8845 2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination enabled OOB Bit 8 reset OOB Bit 18 reset 3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 8843 4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0 A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO disable 5.) 0x771: IA32_HWP_CAPABILITIES (performance): 109252E : high 46 : guaranteed 37 : efficient 9 : lowest 1 6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 : raw: 2E2E : 2E2E : 2E2E : 2E2E : 2E2E : 2E2E : min: 46 : 46 : 46 : 46 : 46 : 46 : max: 46 : 46 : 46 : 46 : 46 : 46 : des:0 :0 :0 :0 :0 :0 : epp:0 :0 :0 :0 :0 :0 : act:0 :0 :0 :0 :0 :0 : 7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0 : lowest 0 and then switched to passive mode later. EPP is not as expected. Expect 0 (performance mode): # /home/doug/c/msr-decoder How many CPUs?: 6 8.) 0x198: IA32_PERF_STATUS : CPU 0-5 : 46 : 46 : 46 : 46 : 46 : 46 : B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable 1.) 0x19C: IA32_THERM_STATUS: 8844 2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination enabled OOB Bit 8 reset OOB Bit 18 reset 3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 8842 4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0 A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO disable 5.) 0x771: IA32_HWP_CAPABILITIES (performance): 108252E : high 46 : guaranteed 37 : efficient 8 : lowest 1 6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 : raw: FF002E2E : FF002E2E : FF002E2E : FF002E2E : FF002E2E : FF002E2E : min: 46 : 46 : 46 : 46 : 46 : 46 : max: 46 : 46 : 46 : 46 : 46 : 46 : des:0 :0 :0 :0 :0 :0 : epp: 255 : 255 : 255 : 255 : 255 : 255 : act:0 :0 :0 :0 :0 :0 : 7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0 : lowest 0 Then switched to ondemand governor, and put 100% load on 2 CPUs. EPP is not as expected, which I don't actually know what to expect, but assume 128: # /home/doug/c/msr-decoder How many CPUs?: 6 8.) 0x198: IA32_PERF_STATUS : CPU 0-5 : 46 : 46 : 46 : 46 : 46 : 46 : B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable 1.) 0x19C: IA32_THERM_STATUS: 883B 2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination enabled OOB Bit 8 reset OOB Bit 18 reset 3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 882B 4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0 A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO disable 5.) 0x771: IA32_HWP_CAPABILITIES (performance): 10B252E : high 46 : guaranteed 37 : efficient 11 : lowest 1 6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 : raw: FF002E09 : FF002E0C : FF002E2E : FF002E08 : FF002E2E : FF002E18 : min:9 : 12 : 46 :8 : 46 : 24 : max: 46 : 46 : 46 : 46 : 46 : 46 : des:0 :0 :0 :0 :0 :0 : epp: 255 : 255 : 255 : 255 : 255 : 255 : act:0 :0 :0 :0 :0 :0 : 7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0 : lowest 0 For what it's worth, Kernel: 78b39581ed85 (HEAD -> dtemp) cpufreq: intel_pstate: Implement passive mode with HWP enabled c0842fbc1b18 (origin/master, origin/HEAD, master) random32: move the pseudo-random 32-bit definitions to prandom.h 2baa85d6927d Merge tag 'acpi-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm 04084978003c Merge
RE: [PATCH v4 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
Hi Srinivas, Thanks for your help. I was missing several needed patches. On 2020.08.02 11:39 Srinivas Pandruvada wrote: > On Sun, 2020-08-02 at 07:00 -0700, Doug Smythies wrote: > > On 2020.08.01 09:40 Srinivas Pandruvada wrote: > > > > On Monday, July 27, 2020 5:13:40 PM CEST Rafael J. Wysocki wrote: > > > > > On Thursday, July 16, 2020 7:37:04 PM CEST Rafael J. Wysocki > > > > > wrote: > > > > > > This really is a v2 of this patch: > > > > > > > > > > > > https://patchwork.kernel.org/patch/11663271/ > > > > > > > > > > > > with an extra preceding cleanup patch to avoid making > > > > > > unrelated > > > > > > changes in the > > > > > > [2/2]. > > > I applied this series along with > > > [PATCH] cpufreq: intel_pstate: Fix EPP setting via sysfs in active > > > mode > > > on 5.8 latest master (On top of raw epp patchset). > > > > Would you be kind enough to provide a "git log --oneline" output > > of what you did. > > 69dd9b2b11cd (HEAD -> 5-9-devel) cpufreq: intel_pstate: Implement > passive mode with HWP enabled > 63efaa01b06a cpufreq: intel_pstate: Fix EPP setting via sysfs in active > mode > e11e0a2edf83 cpufreq: intel_pstate: Rearrange the storing of new EPP > values > 93c3fd6a315c cpufreq: intel_pstate: Avoid enabling HWP if EPP is not > supported > 7cef1dd371c3 cpufreq: intel_pstate: Clean up aperf_mperf_shift > description > a3248d8d3a11 cpufreq: intel_pstate: Supply struct attribute description > for get_aperf_mperf_shift() > f52b6b075b07 cpufreq: intel_pstate: Fix static checker warning for epp > variable > 4a59d6be0774 cpufreq: intel_pstate: Allow raw energy performance > preference value > 7b34b5acdcc6 cpufreq: intel_pstate: Allow enable/disable energy > efficiency > ac3a0c847296 (origin/master, origin/HEAD, master) Merge > git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net > > > > I have been trying unsuccessfully to apply the patches, > > so somewhere I obviously missed something. > > > > > When intel_pstate=passive from kernel command line then it is fine, > > > no > > > crash. But switch dynamically, crashed: > > > > I'll try to repeat, if I can get an actual kernel. I could not repeat your crash. I tried booting with and without intel_pstate=passive on the kernel command line and then switching back and forth thereafter. However, I do confirm EPP is messed up. But not min and max from MSR 0x774, they behave as expected, based on quick testing only. Since you mentioned: >>> Don't have a recipe to reproduce this. Maybe I simply didn't hit whatever. ... Doug Useless additional stuff: # cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-5.8.0-rc7-dssp root=UUID=0ac356c1-caa9-4c2e-8229-4408bd998dbd ro ipv6.disable=1 consoleblank=450 intel_pstate=passive cpuidle_sysfs_switch cpuidle.governor=teo Went "active" then "passive" and set ondemand governor. 2 X 100% CPU loads: # /home/doug/c/msr-decoder How many CPUs?: 6 8.) 0x198: IA32_PERF_STATUS : CPU 0-5 : 46 : 46 : 46 : 46 : 46 : 46 : B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable 1.) 0x19C: IA32_THERM_STATUS: 883C 2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination enabled OOB Bit 8 reset OOB Bit 18 reset 3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 882D 4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0 A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO disable 5.) 0x771: IA32_HWP_CAPABILITIES (performance): 10B252E : high 46 : guaranteed 37 : efficient 11 : lowest 1 6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 : raw: FF002E0A : FF002E2E : FF002E2E : FF002E08 : FF002E18 : FF002E08 : min: 10 : 46 : 46 :8 : 24 :8 : max: 46 : 46 : 46 : 46 : 46 : 46 : des:0 :0 :0 :0 :0 :0 : epp: 255 : 255 : 255 : 255 : 255 : 255 : act:0 :0 :0 :0 :0 :0 : 7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0 : lowest 0 Kernel: d72c8472dbd5 (HEAD -> k58rc7-d3) cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode c2f4869fbc27 cpufreq: intel_pstate: Implement passive mode with HWP enabled 85219968fab9 cpufreq: intel_pstate: Rearrange the storing of new EPP values 5c09a1a38106 cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported 9f29c81fe0b3 cpufreq: intel_pstate: Clean up aperf_mperf_shift description 2a863c241495 cpufreq: intel_pstate: Supply struct attribute description for get_aperf_mperf_shift() 4180d8413037 cpufreq: intel_pstate: Fix static checker warning for epp variable 7cd50e86a9e6 cpufreq: intel_pstate: Allow raw energy performance preference value 56dce9a1081e cpufreq: intel_pstate: Allow enable/disable energy efficiency 92ed30191993 (tag: v5.8-rc7) Linux 5.8-rc7
RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
Hi Rafael, On 2020.07.19 04:43 Rafael J. Wysocki wrote: > On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies wrote: > > On 2020.07.16 05:08 Rafael J. Wysocki wrote: > > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies > > > wrote: > > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote: > > >> > > > >> > From: Rafael J. Wysocki > > >> ... > > >> > Since the passive mode hasn't worked with HWP at all, and it is not > > >> > going to > > >> > the default for HWP systems anyway, I don't see any drawbacks related > > >> > to making > > >> > this change, so I would consider this as 5.9 material unless there are > > >> > any > > >> > serious objections. > > >> > > >> Good point. > > > > Actually, for those users that default to passive mode upon boot, > > this would mean they would find themselves using this. > > Also, it isn't obvious, from the typical "what driver and what governor" > > inquiry. > > So the change in behavior is that after this patch > intel_pstate=passive doesn't imply no_hwp any more. > > That's a very minor difference though and I'm not aware of any adverse > effects it can cause on HWP systems anyway. My point was, that it will now default to something where testing has not been completed. > The "what governor" is straightforward in the passive mode: that's > whatever cpufreq governor has been selected. I think you might have missed my point. >From the normal methods of inquiry one does not know if HWP is being used or not. Why? Because with or without HWP one gets the same answers under: /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ... Doug
RE: [PATCH v4 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On 2020.08.01 16:41 Srinivas Pandruvada wrote: > On Tue, 2020-07-28 at 17:13 +0200, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki > > > > Allow intel_pstate to work in the passive mode with HWP enabled and > > make it set the HWP minimum performance limit (HWP floor) to the > > P-state value given by the target frequency supplied by the cpufreq > > governor, so as to prevent the HWP algorithm and the CPU scheduler > > from working against each other, at least when the schedutil governor > > is in use, and update the intel_pstate documentation accordingly. > > > > Among other things, this allows utilization clamps to be taken > > into account, at least to a certain extent, when intel_pstate is > > in use and makes it more likely that sufficient capacity for > > deadline tasks will be provided. > > > > After this change, the resulting behavior of an HWP system with > > intel_pstate in the passive mode should be close to the behavior > > of the analogous non-HWP system with intel_pstate in the passive > > mode, except that in the frequency range below the base frequency > > (ie. the frequency retured by the base_frequency cpufreq attribute > > in sysfs on HWP systems) the HWP algorithm is allowed to go above > > the floor P-state set by intel_pstate with or without hardware > > coordination of P-states among CPUs in the same package. > > > Do you mean HWP.req.min will be below base_freq (unless user overrides > it)? No. > With busy workload I see HWP req.min = HWP req.max. > The base freq: 1.3GHz (ratio 0x0d), MAX 1C turbo: 3.9GHz (ratio: 0x27) > When I monitor MSR 0x774 (HWP_REQ), I see > 0x80002727 Yes, that is what I expect to see. > > Normally msr 0x774 > 0x80002704 That would be "active" mode and the powersave governor, correct?. And yes that is what I expect for your processor. For mine, load or no load, decoded: 0x774: IA32_HWP_REQUEST:CPU 0-5 : raw: 80002E08 : 80002E08 : 80002E08 : 80002E08 : 80002E08 : 80002E08 : min:8 :8 :8 :8 :8 :8 : max: 46 : 46 : 46 : 46 : 46 : 46 : des:0 :0 :0 :0 :0 :0 : epp: 128 : 128 : 128 : 128 : 128 : 128 : act:0 :0 :0 :0 :0 :0 : This thread is about passive mode, and myself, I do not expect the last byte to be 4 (8 for mine) under load. ... Doug
RE: [PATCH v4 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On 2020.08.01 09:40 Srinivas Pandruvada wrote: >> On Monday, July 27, 2020 5:13:40 PM CEST Rafael J. Wysocki wrote: >>> On Thursday, July 16, 2020 7:37:04 PM CEST Rafael J. Wysocki wrote: This really is a v2 of this patch: https://patchwork.kernel.org/patch/11663271/ with an extra preceding cleanup patch to avoid making unrelated changes in the [2/2]. >>> > I applied this series along with > [PATCH] cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode > on 5.8 latest master (On top of raw epp patchset). Hi Srinivas, Would you be kind enough to provide a "git log --oneline" output of what you did. I have been trying unsuccessfully to apply the patches, so somewhere I obviously missed something. > When intel_pstate=passive from kernel command line then it is fine, no > crash. But switch dynamically, crashed: I'll try to repeat, if I can get an actual kernel. > Attached crash.txt. I may need to try your linux-pm tree. I also tried the linux-pm tree, same. ... Doug
RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
Hi Rafael, Thank you for your reply. On 2020.07.16 05:08 Rafael J. Wysocki wrote: > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies wrote: >> On 2020.07.14 11:16 Rafael J. Wysocki wrote: >> > >> > From: Rafael J. Wysocki >> ... >> > Since the passive mode hasn't worked with HWP at all, and it is not going >> > to >> > the default for HWP systems anyway, I don't see any drawbacks related to >> > making >> > this change, so I would consider this as 5.9 material unless there are any >> > serious objections. >> >> Good point. Actually, for those users that default to passive mode upon boot, this would mean they would find themselves using this. Also, it isn't obvious, from the typical "what driver and what governor" inquiry. >> Some of the tests I do involve labour intensive post processing of data. >> I want to automate some of that work, and it will take time. >> We might be into the 5.9-rc series before I have detailed feedback. >> >> However, so far: >> >> Inverse impulse response test [1]: >> >> High level test, i5-9600K, HWP-passive (this patch), ondemand: >> 3101 tests. 0 failures. (GOOD) >> >> From [1], re-stated: >> > . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. >> > (HWP-active - powersave) >> > . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 >> > failures. >> >> My version of that cool Alexander named pipe test [2] serialized workflow: >> >> HWP-passive (this patch), performance: PASS. >> >> From [2], re-stated, and also re-tested. >> HWP-disabled passive - performance: FAIL. > > But I'm not quite sure how this is related to this patch? It isn't. The point being that it is different. But yes, that failure is because of our other discussion [3]. > > This test would still fail without the patch if the kernel was started > with intel_pstate=passive in the kernel command line, wouldn't it. Yes. > > > Although, I believe the issue to be EPB management, [3]. > > Well, that's kind of unexpected. > > If this really is the case, then it looks like the effect of the EPB > setting on the processor is different depending on whether or not HWP > is enabled. > >> And yes, I did see the reply to [3] that came earlier, >> And have now re-done the test, with the referenced patch added. >> It still is FAIL. I reply to the [3] thread, eventually. >> >> [1] https://marc.info/?l=linux-pm=159354421400342=2 >> [2] https://marc.info/?l=linux-pm=159155067328641=2 >> [3] https://marc.info/?l=linux-pm=159438804230744=2 >> >> Kernel: >> >> b08284a541ad (HEAD -> k58rc5-doug) cpufreq: intel_pstate: Avoid enabling HWP >> if EPP is not supported >> 063fd7ccabfe cpufreq: intel_pstate: Implement passive mode with HWP enabled >> 730ccf5054e9 cpufreq: intel_pstate: Allow raw energy performance preference >> value >> bee36df01c68 cpufreq: intel_pstate: Allow enable/disable energy efficiency >> 199629d8200e cpufreq: intel_pstate: Fix active mode setting from command line >> 11ba468877bb (tag: v5.8-rc5, origin/master, origin/HEAD, master) Linux >> 5.8-rc5 >> >> Rules for this work: >> >> . never use x86_energy_perf_policy. >> . For HWP disabled: never change from active to passive or via versa, but >> rather do it via boot. >> . after boot always check and reset the various power limit log bits that >> are set. >> . never compile the kernel (well, until after any tests), which will set >> those bits again. >> . never run prime95 high heat torture test, which will set those bits again. >> . try to never do anything else that will set those bits again. >> >> To be clear, I do allow changing governors within the context of the above >> rules. > > Thanks for the feedback!
RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On 2020.07.14 11:16 Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki ... > Since the passive mode hasn't worked with HWP at all, and it is not going to > the default for HWP systems anyway, I don't see any drawbacks related to > making > this change, so I would consider this as 5.9 material unless there are any > serious objections. Good point. Some of the tests I do involve labour intensive post processing of data. I want to automate some of that work, and it will take time. We might be into the 5.9-rc series before I have detailed feedback. However, so far: Inverse impulse response test [1]: High level test, i5-9600K, HWP-passive (this patch), ondemand: 3101 tests. 0 failures. (GOOD) >From [1], re-stated: > . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. (HWP-active > - powersave) > . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 failures. My version of that cool Alexander named pipe test [2] serialized workflow: HWP-passive (this patch), performance: PASS. >From [2], re-stated, and also re-tested. HWP-disabled passive - performance: FAIL. Although, I believe the issue to be EPB management, [3]. And yes, I did see the reply to [3] that came earlier, And have now re-done the test, with the referenced patch added. It still is FAIL. I reply to the [3] thread, eventually. [1] https://marc.info/?l=linux-pm=159354421400342=2 [2] https://marc.info/?l=linux-pm=159155067328641=2 [3] https://marc.info/?l=linux-pm=159438804230744=2 Kernel: b08284a541ad (HEAD -> k58rc5-doug) cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported 063fd7ccabfe cpufreq: intel_pstate: Implement passive mode with HWP enabled 730ccf5054e9 cpufreq: intel_pstate: Allow raw energy performance preference value bee36df01c68 cpufreq: intel_pstate: Allow enable/disable energy efficiency 199629d8200e cpufreq: intel_pstate: Fix active mode setting from command line 11ba468877bb (tag: v5.8-rc5, origin/master, origin/HEAD, master) Linux 5.8-rc5 Rules for this work: . never use x86_energy_perf_policy. . For HWP disabled: never change from active to passive or via versa, but rather do it via boot. . after boot always check and reset the various power limit log bits that are set. . never compile the kernel (well, until after any tests), which will set those bits again. . never run prime95 high heat torture test, which will set those bits again. . try to never do anything else that will set those bits again. To be clear, I do allow changing governors within the context of the above rules. ... Doug
RE: [PATCH] cpufreq: intel_pstate: Fix active mode setting from command line
Hi Rafael, Thank you. On 2020.07.13 06:59 Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki > > If intel_pstate starts in the passive mode by default (that happens > when the processor in the system doesn't support HWP), passing > intel_pstate=active in the kernel command line doesn't work, so > fix that. > > Fixes: 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default > without HWP") > Reported-by: Doug Smythies > Signed-off-by: Rafael J. Wysocki Acked-by: Doug Smythies > --- > drivers/cpufreq/intel_pstate.c |8 ++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > Index: linux-pm/drivers/cpufreq/intel_pstate.c > === > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c > +++ linux-pm/drivers/cpufreq/intel_pstate.c > @@ -2534,7 +2534,7 @@ static struct cpufreq_driver intel_cpufr > .name = "intel_cpufreq", > }; > > -static struct cpufreq_driver *default_driver = _pstate; > +static struct cpufreq_driver *default_driver; > > static void intel_pstate_driver_cleanup(void) > { > @@ -2828,6 +2828,7 @@ static int __init intel_pstate_init(void > hwp_active++; > hwp_mode_bdw = id->driver_data; > intel_pstate.attr = hwp_cpufreq_attrs; > + default_driver = _pstate; > goto hwp_cpu_matched; > } > } else { > @@ -2845,7 +2846,8 @@ static int __init intel_pstate_init(void > return -ENODEV; > } > /* Without HWP start in the passive mode. */ > - default_driver = _cpufreq; > + if (!default_driver) > + default_driver = _cpufreq; > > hwp_cpu_matched: > /* > @@ -2899,6 +2901,8 @@ static int __init intel_pstate_setup(cha > > if (!strcmp(str, "disable")) { > no_load = 1; > + } else if (!strcmp(str, "active")) { > + default_driver = _pstate; > } else if (!strcmp(str, "passive")) { > default_driver = _cpufreq; > no_hwp = 1; >
RE: [PATCH 2/2] cpufreq: intel_pstate: Use passive mode by default without HWP
Hi Rafael, As you may or may not recall, I am attempting to untangle and separate multiple compounding issues around the intel_pstate driver and HWP (or not). Until everything is figured out, I am using the following rules: . never use x86_energy_perf_policy. . For HWP disabled: never change from active to passive or via versa, but rather do it via boot. . after boot always check and reset the various power limit log bits that are set. . never compile the kernel (well, until after any tests), which will set those bits again. . never run prime95 high heat torture test, which will set those bits again. . try to never do anything else that will set those bits again. On 2020.03.28 05:58 Rafael J. Wysocki wrote: > > From: "Rafael J. Wysocki" > > After recent changes allowing scale-invariant utilization to be > used on x86, the schedutil governor on top of intel_pstate in the > passive mode should be on par with (or better than) the active mode > "powersave" algorithm of intel_pstate on systems in which > hardware-managed P-states (HWP) are not used, so it should not be > necessary to use the internal scaling algorithm in those cases. > > Accordingly, modify intel_pstate to start in the passive mode by > default if the processor at hand does not support HWP of if the driver > is requested to avoid using HWP through the kernel command line. > > Among other things, that will allow utilization clamps and the > support for RT/DL tasks in the schedutil governor to be utilized on > systems in which intel_pstate is used. > > Signed-off-by: Rafael J. Wysocki > --- > Documentation/admin-guide/pm/intel_pstate.rst | 32 > --- > drivers/cpufreq/intel_pstate.c| 3 ++- > 2 files changed, 21 insertions(+), 14 deletions(-) > > diff --git a/Documentation/admin-guide/pm/intel_pstate.rst > b/Documentation/admin- > guide/pm/intel_pstate.rst > index ad392f3aee06..39d80bc29ccd 100644 > --- a/Documentation/admin-guide/pm/intel_pstate.rst > +++ b/Documentation/admin-guide/pm/intel_pstate.rst > @@ -62,9 +62,10 @@ on the capabilities of the processor. > Active Mode > --- > > -This is the default operation mode of ``intel_pstate``. If it works in this > -mode, the ``scaling_driver`` policy attribute in ``sysfs`` for all > ``CPUFreq`` > -policies contains the string "intel_pstate". > +This is the default operation mode of ``intel_pstate`` for processors with > +hardware-managed P-states (HWP) support. If it works in this mode, the > +``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq`` policies > +contains the string "intel_pstate". > > In this mode the driver bypasses the scaling governors layer of ``CPUFreq`` > and > provides its own scaling algorithms for P-state selection. Those algorithms > @@ -138,12 +139,13 @@ internal P-state selection logic to be less > performance-focused. > Active Mode Without HWP > ~~~ > > -This is the default operation mode for processors that do not support the HWP > -feature. It also is used by default with the ``intel_pstate=no_hwp`` > argument > -in the kernel command line. However, in this mode ``intel_pstate`` may > refuse > -to work with the given processor if it does not recognize it. [Note that > -``intel_pstate`` will never refuse to work with any processor with the HWP > -feature enabled.] > +This operation mode is optional for processors that do not support the HWP > +feature or when the ``intel_pstate=no_hwp`` argument is passed to the kernel > in > +the command line. The active mode is used in those cases if the > +``intel_pstate=active`` argument is passed to the kernel in the command line. ??? I can not see anywhere in the code where the kernel command line argument "intel_pstate=active" is dealt with. > +In this mode ``intel_pstate`` may refuse to work with processors that are not > +recognized by it. [Note that ``intel_pstate`` will never refuse to work with > +any processor with the HWP feature enabled.] > > In this mode ``intel_pstate`` registers utilization update callbacks with the > CPU scheduler in order to run a P-state selection algorithm, either > @@ -188,10 +190,14 @@ is not set. > Passive Mode > > > -This mode is used if the ``intel_pstate=passive`` argument is passed to the > -kernel in the command line (it implies the ``intel_pstate=no_hwp`` setting > too). > -Like in the active mode without HWP support, in this mode ``intel_pstate`` > may > -refuse to work with the given processor if it does not recognize it. > +This is the default operation mode of ``intel_pstate`` for processors without > +hardware-managed P-states (HWP) support. It is always used if the > +``intel_pstate=passive`` argument is passed to the kernel in the command line > +regardless of whether or not the given processor supports HWP. [Note that > the > +``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is > used > +without
RE: [PATCH v4 2/2] cpufreq: intel_pstate: Allow raw energy performance preference value
Hi Srinivas, Thanks for all your work on this. I have fallen behind, and not sure when I can catch up. However... On 2020.06.26 11:34 Srinivas Pandruvada wrote: > Similarly on battery the default "balance_performance" mode can be > aggressive in power consumption. But picking up the next choice > "balance power" results in too much loss of performance, which results in > bad user experience in use cases like "Google Hangout". It was observed > that some value between these two EPP is optimal. There is a possibility that one of the issues I have been ranting about could be a contributing factor to things like this. (I don't know if it actually is.) One way to compensate is to lower EPP. I am going to send a new e-mail in a minute about it. Please consider the possibility that some of these EPP adjustments might just be programming around the issue. ... Doug
RE: [PATCH 1/2] cpufreq: intel_pstate: Allow enable/disable energy efficiency
Hi Srinivas, I saw your V3. I do not understand your reluctance to use arch/x86/include/asm/msr-index.h as the place to define anything MSR related. Please explain. One more comment about 1/3 of the way down below. ... Doug On 2020.06.23 08:53 Doug Smythies wrote: > On 2020.06.22 22:13 Srinivas Pandruvada wrote: > > > By default intel_pstate driver disables energy efficiency by setting > > MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode. > > This CPU model is also shared by Coffee Lake desktop CPUs. This allows > > these systems to reach maximum possible frequency. But this adds power > > penalty, which some customers don't want. They want some way to enable/ > > disable dynamically. > > > > So, add an additional attribute "energy_efficiency_enable" under > > /sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows > > to read and write bit 19 ("Disable Energy Efficiency Optimization") in > > the MSR IA32_POWER_CTL. > > > > This attribute is present in both HWP and non-HWP mode as this has an > > effect in both modes. Refer to Intel Software Developer's manual for > > details. The scope of this bit is package wide. > > I do and always have. However these manuals are 1000's of pages, > are updated often, and it can be difficult to find the correct page > for the correct processor. So it is great that, in general, the same > mnemonics are used in the code as the manuals. > > > > > Suggested-by: Len Brown > > Signed-off-by: Srinivas Pandruvada > > --- > > Documentation/admin-guide/pm/intel_pstate.rst | 7 +++ > > drivers/cpufreq/intel_pstate.c| 49 ++- > > 2 files changed, 54 insertions(+), 2 deletions(-) > > > > diff --git a/Documentation/admin-guide/pm/intel_pstate.rst > > b/Documentation/admin- > > guide/pm/intel_pstate.rst > > index 39d80bc29ccd..939bfdc53f4f 100644 > > --- a/Documentation/admin-guide/pm/intel_pstate.rst > > +++ b/Documentation/admin-guide/pm/intel_pstate.rst > > @@ -431,6 +431,13 @@ argument is passed to the kernel in the command line. > > supported in the current configuration, writes to this attribute will > > fail with an appropriate error. > > > > +``energy_efficiency_enable`` > > + This attribute is only present on platforms, which has CPUs matching > > + Kaby Lake desktop CPU model. By default "energy_efficiency" is disabled > > So, why not mention Coffee Lake also, as you did above? And I still think you need to add "Coffee Lake" here also. > > > + on these CPU models in HWP mode by this driver. Enabling energy > > + efficiency may limit maximum operating frequency in both HWP and non > > + HWP mode. > > + > > Interpretation of Policy Attributes > > --- > > > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c > > index 8e23a698ce04..1cf6d06f2314 100644 > > --- a/drivers/cpufreq/intel_pstate.c > > +++ b/drivers/cpufreq/intel_pstate.c > > @@ -1218,6 +1218,44 @@ static ssize_t store_hwp_dynamic_boost(struct > > kobject *a, > > return count; > > } > > > > +#define MSR_IA32_POWER_CTL_BIT_EE 19 > > (same comment as the other day, for another patch) In my opinion and for > consistency, this bit should be defined in > > arch/x86/include/asm/msr-index.h > > like so (I defined the other bits also): > > diff --git a/arch/x86/include/asm/msr-index.h > b/arch/x86/include/asm/msr-index.h > index e8370e64a155..1a6084067f18 100644 > --- a/arch/x86/include/asm/msr-index.h > +++ b/arch/x86/include/asm/msr-index.h > @@ -255,6 +255,12 @@ > > #define MSR_IA32_POWER_CTL 0x01fc > > +/* POWER_CTL bits (some are model specific): */ > + > +#define POWER_CTL_C1E 1 > +#define POWER_CTL_EEO 19 > +#define POWER_CTL_RHO 20 > + > #define MSR_IA32_MC0_CTL 0x0400 > #define MSR_IA32_MC0_STATUS0x0401 > #define MSR_IA32_MC0_ADDR 0x0402 > > There is another almost identical file at: > > tools/arch/x86/include/asm/msr-index.h > > and I am not sure which one is the master, but > the former contains stuff that the later does not. > > I have defined the bits names in a consistent way > as observed elsewhere in the file. i.e. drop the > "MSR_IA32_" prefix and add the bit. > > Now, tying this back to my earlier comment about the > manuals: > The file "tools/arch/x86/include/asm/msr-index.h"
RE: [PATCH v2 2/2] cpufreq: intel_pstate: Allow raw energy performance preference value
Hi Srinivas, I have immediate need for this. I have been using a tool I wrote myself for this which I can now retire. (it wasn't very good anyway). Yours remembers for each governor, and is way better. Thanks. On 2020.06.23 11:27 Srinivas Pandruvada wrote: > Currently using attribute "energy_performance_preference", user space can > write one of the four per-defined preference string. These preference > strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or > Energy-Performance Bias (EPB) knob. > > These four values supposed to cover broad spectrum of use cases, but they > are not uniformly distributed in the range. Suggest: These four values are supposed to cover broad spectrum of use cases, but are not uniformly distributed in the range. > There are number of cases, > where this is not enough. For example: > > Suppose user wants more performance when connected to AC. Instead of using > default "balance performance", the "performance" setting can be used. This > changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in > electrical and thermal issues on some platforms. > This results in CPU to do > aggressive throttling, which causes drop in performance. Suggest: This results in aggressive throttling, which causes a drop in performance. And: Tough. I consider "performance mode" as sacrosanct, and have always expected these to behave identically and at max CPU freq: intel_pstate no-hwp / performance intel_cpufreq no-hwp / performance (a.k.a. passive) acpi_cpufreq / performance intel_pstate hwp / performance intel_cpufreq hwp / performance (in future) as was always the case on my i7-2600K (no hwp) based computer and is not the case on my i5-9600K (hwp capable) computer. > But some value > between 0x80 and 0x00 results in better performance. But that value can't > be fixed as the power curve is not linear. In some cases just changing EPP > from 0x80 to 0x75 is enough to get significant performance gain. > > Similarly on battery EPP 0x80 can be very aggressive in power consumption. > But picking up the next choice "balance power" results in too much loss > of performance, which cause bad user experience in use case like "Google > Hangout". It was observed that some value between these two EPP is > optimal. > > This change allows fine grain EPP tuning for platform like Chromebooks. > Here based on the product and use cases, different EPP values can be set. > This change is similar to the change done for: > /sys/devices/system/cpu/cpu*/power/energy_perf_bias > where user has choice to write a predefined string or raw value. > > The change itself is trivial. When user preference doesn't match > predefined string preferences and value is an unsigned integer and in > range, use that value for EPP. When the EPP feature is not prsent ^^ s/prsent/present > writing raw value is not supported. > > Suggested-by: Len Brown > Signed-off-by: Srinivas Pandruvada > --- > Documentation/admin-guide/pm/intel_pstate.rst | 6 ++- > drivers/cpufreq/intel_pstate.c| 50 +++ > 2 files changed, 45 insertions(+), 11 deletions(-) > > diff --git a/Documentation/admin-guide/pm/intel_pstate.rst > b/Documentation/admin- > guide/pm/intel_pstate.rst > index 939bfdc53f4f..5e209926e0ed 100644 > --- a/Documentation/admin-guide/pm/intel_pstate.rst > +++ b/Documentation/admin-guide/pm/intel_pstate.rst > @@ -561,7 +561,11 @@ somewhere between the two extremes: > Strings written to the ``energy_performance_preference`` attribute are > internally translated to integer values written to the processor's > Energy-Performance Preference (EPP) knob (if supported) or its > -Energy-Performance Bias (EPB) knob. > +Energy-Performance Bias (EPB) knob. It is also possible to write a positive > +integer value between 0 to 255, if the EPP feature is present. If the EPP > +feature is not present, writing integer value to this attribute is not > +supported. In this case, user can use > + "/sys/devices/system/cpu/cpu*/power/energy_perf_bias" interface. > > [Note that tasks may by migrated from one CPU to another by the scheduler's > load-balancing algorithm and if different energy vs performance hints are > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c > index 1cf6d06f2314..d8f195c7a428 100644 > --- a/drivers/cpufreq/intel_pstate.c > +++ b/drivers/cpufreq/intel_pstate.c > @@ -602,11 +602,12 @@ static const unsigned int epp_values[] = { > HWP_EPP_POWERSAVE > }; > > -static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data) > +static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data, int > *raw_epp) > { > s16 epp; > int index = -EINVAL; > > + *raw_epp = 0; > epp = intel_pstate_get_epp(cpu_data, 0); > if (epp < 0) > return epp; > @@ -614,12 +615,14 @@ static int
RE: [PATCH v2 0/2] cpufreq: Specify the default governor on command line
Hi Quentin, Thanks for your quick reply. On 2020.06.23 11:05 Quentin Perret wrote: > Hi Doug, > > On Tuesday 23 Jun 2020 at 10:54:33 (-0700), Doug Smythies wrote: > > Hi Quentin, > > > > Because I am lazy and sometimes do not want to recompile > > the distro source, I have a need/desire for this. > > Good to know I'm not the only one ;-) > > > Tested these two grub command lines: > > > > GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=300 > > intel_pstate=disable > cpufreq.default_governor=schedutil cpuidle_sysfs_switch cpuidle.governor=teo" > > > > And > > > > #GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=450 > > intel_pstate=passive > cpufreq.default_governor=schedutil cpuidle_sysfs_switch cpuidle.governor=teo" > > > > And all worked as expected. I use Ubuntu as my distro, and also had to > > disable a startup script that > switches to "ondemand", or similar, after 1 minute. > > Good, thanks for giving it a try. > > > As a side note (separate subject, but is one reason I tried it): > > My i5-9600K based computer seems to hit a power limit during boot > > approximately 3 seconds after > kernel selection on grub. > > This had no effect on that issue (even when selecting powersave governor). > > Interesting ... Could you confirm that compiling with powersave as > default doesn't fix the issue either? No, it doesn't (good idea for a test though). However, the big mains spike is also gone. So, I no longer know why those power limit log bits are always set after boot. > > Other question, when does the intel_pstate driver start on your device? > Before or after that 3 seconds boot time? Before, if I understand correctly (from dmesg): [0.468969] intel_pstate: Intel P-state driver initializing I'll attach a couple of annotated mains power graphs. (which will likely get stripped from the on-list version of this e-mail). Currently, I am drowning in stuff that doesn't work, and will put this aside for now. I'll revive this as a new thread or a bugzilla eventually. I also tried booting with turbo disabled, no difference. Thanks for this patch set. ... Doug
RE: [PATCH v2 0/2] cpufreq: Specify the default governor on command line
On 2020.06.23 07:22 Quentin Perret wrote: > > This series enables users of prebuilt kernels (e.g. distro kernels) to > specify their CPUfreq governor of choice using the kernel command line, > instead of having to wait for the system to fully boot to userspace to > switch using the sysfs interface. This is helpful for 2 reasons: > 1. users get to choose the governor that runs during the actual boot; > 2. it simplifies the userspace boot procedure a bit (one less thing to > worry about). > > To enable this, the first patch moves all governor init calls to > core_initcall, to make sure they are registered by the time the drivers > probe. This should be relatively low impact as registering a governor > is a simple procedure (it gets added to a llist), and all governors > already load at core_initcall anyway when they're set as the default > in Kconfig. This also allows to clean-up the governors' init/exit code, > and reduces boilerplate. > > The second patch introduces the new command line parameter, inspired by > its cpuidle counterpart. More details can be found in the respective > patch headers. > > Changes in v2: > - added Viresh's ack to patch 01 > - moved the assignment of 'default_governor' in patch 02 to the governor >registration path instead of the driver registration (Viresh) > > Thanks, > Quentin > > Quentin Perret (2): > cpufreq: Register governors at core_initcall > cpufreq: Specify default governor on command line > > .../admin-guide/kernel-parameters.txt | 5 > Documentation/admin-guide/pm/cpufreq.rst | 6 ++--- > .../platforms/cell/cpufreq_spudemand.c| 26 ++- > drivers/cpufreq/cpufreq.c | 23 > drivers/cpufreq/cpufreq_conservative.c| 22 > drivers/cpufreq/cpufreq_ondemand.c| 24 + > drivers/cpufreq/cpufreq_performance.c | 14 ++ > drivers/cpufreq/cpufreq_powersave.c | 18 +++-- > drivers/cpufreq/cpufreq_userspace.c | 18 +++-- > include/linux/cpufreq.h | 14 ++ > kernel/sched/cpufreq_schedutil.c | 6 + > 11 files changed, 62 insertions(+), 114 deletions(-) > > -- > 2.27.0.111.gc72c7da667-goog Hi Quentin, Because I am lazy and sometimes do not want to recompile the distro source, I have a need/desire for this. Tested these two grub command lines: GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=300 intel_pstate=disable cpufreq.default_governor=schedutil cpuidle_sysfs_switch cpuidle.governor=teo" And #GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=450 intel_pstate=passive cpufreq.default_governor=schedutil cpuidle_sysfs_switch cpuidle.governor=teo" And all worked as expected. I use Ubuntu as my distro, and also had to disable a startup script that switches to "ondemand", or similar, after 1 minute. As a side note (separate subject, but is one reason I tried it): My i5-9600K based computer seems to hit a power limit during boot approximately 3 seconds after kernel selection on grub. This had no effect on that issue (even when selecting powersave governor). ... Doug
RE: [PATCH 1/2] cpufreq: intel_pstate: Allow enable/disable energy efficiency
On 2020.06.22 22:13 Srinivas Pandruvada wrote: > By default intel_pstate driver disables energy efficiency by setting > MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode. > This CPU model is also shared by Coffee Lake desktop CPUs. This allows > these systems to reach maximum possible frequency. But this adds power > penalty, which some customers don't want. They want some way to enable/ > disable dynamically. > > So, add an additional attribute "energy_efficiency_enable" under > /sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows > to read and write bit 19 ("Disable Energy Efficiency Optimization") in > the MSR IA32_POWER_CTL. > > This attribute is present in both HWP and non-HWP mode as this has an > effect in both modes. Refer to Intel Software Developer's manual for > details. The scope of this bit is package wide. I do and always have. However these manuals are 1000's of pages, are updated often, and it can be difficult to find the correct page for the correct processor. So it is great that, in general, the same mnemonics are used in the code as the manuals. > > Suggested-by: Len Brown > Signed-off-by: Srinivas Pandruvada > --- > Documentation/admin-guide/pm/intel_pstate.rst | 7 +++ > drivers/cpufreq/intel_pstate.c| 49 ++- > 2 files changed, 54 insertions(+), 2 deletions(-) > > diff --git a/Documentation/admin-guide/pm/intel_pstate.rst > b/Documentation/admin- > guide/pm/intel_pstate.rst > index 39d80bc29ccd..939bfdc53f4f 100644 > --- a/Documentation/admin-guide/pm/intel_pstate.rst > +++ b/Documentation/admin-guide/pm/intel_pstate.rst > @@ -431,6 +431,13 @@ argument is passed to the kernel in the command line. > supported in the current configuration, writes to this attribute will > fail with an appropriate error. > > +``energy_efficiency_enable`` > + This attribute is only present on platforms, which has CPUs matching > + Kaby Lake desktop CPU model. By default "energy_efficiency" is disabled So, why not mention Coffee Lake also, as you did above? > + on these CPU models in HWP mode by this driver. Enabling energy > + efficiency may limit maximum operating frequency in both HWP and non > + HWP mode. > + > Interpretation of Policy Attributes > --- > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c > index 8e23a698ce04..1cf6d06f2314 100644 > --- a/drivers/cpufreq/intel_pstate.c > +++ b/drivers/cpufreq/intel_pstate.c > @@ -1218,6 +1218,44 @@ static ssize_t store_hwp_dynamic_boost(struct kobject > *a, > return count; > } > > +#define MSR_IA32_POWER_CTL_BIT_EE19 (same comment as the other day, for another patch) In my opinion and for consistency, this bit should be defined in arch/x86/include/asm/msr-index.h like so (I defined the other bits also): diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index e8370e64a155..1a6084067f18 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -255,6 +255,12 @@ #define MSR_IA32_POWER_CTL 0x01fc +/* POWER_CTL bits (some are model specific): */ + +#define POWER_CTL_C1E 1 +#define POWER_CTL_EEO 19 +#define POWER_CTL_RHO 20 + #define MSR_IA32_MC0_CTL 0x0400 #define MSR_IA32_MC0_STATUS0x0401 #define MSR_IA32_MC0_ADDR 0x0402 There is another almost identical file at: tools/arch/x86/include/asm/msr-index.h and I am not sure which one is the master, but the former contains stuff that the later does not. I have defined the bits names in a consistent way as observed elsewhere in the file. i.e. drop the "MSR_IA32_" prefix and add the bit. Now, tying this back to my earlier comment about the manuals: The file "tools/arch/x86/include/asm/msr-index.h" is an excellent gateway reference for those 1000's of pages of software reference manuals. As a user that uses them all the time, but typically only digs deep into those manuals once every year or two, I find tremendous value added via the msr-index.h file. > + > +static ssize_t show_energy_efficiency_enable(struct kobject *kobj, > + struct kobj_attribute *attr, > + char *buf) > +{ > + u64 power_ctl; > + int enable; > + > + rdmsrl(MSR_IA32_POWER_CTL, power_ctl); > + enable = (power_ctl & BIT(MSR_IA32_POWER_CTL_BIT_EE)) >> > MSR_IA32_POWER_CTL_BIT_EE; > + return sprintf(buf, "%d\n", !enable); > +} > + > +static ssize_t store_energy_efficiency_enable(struct kobject *a, > + struct kobj_attribute *b, > + const char *buf, size_t count) > +{ > + u64 power_ctl; > + u32 input; > + int ret; > + > + ret = kstrtouint(buf, 10, ); > + if (ret) > +
RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Work in passive mode with HWP enabled
On 2020.05.21 10:16 Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki > > Allow intel_pstate to work in the passive mode with HWP enabled and > make it translate the target frequency supplied by the cpufreq > governor in use into an EPP value to be written to the HWP request > MSR (high frequencies are mapped to low EPP values that mean more > performance-oriented HWP operation) as a hint for the HWP algorithm > in the processor, so as to prevent it and the CPU scheduler from > working against each other at least when the schedutil governor is > in use. > > Signed-off-by: Rafael J. Wysocki > --- > > This is a prototype not intended for production use (based on linux-next). > > Please test it if you can (on HWP systems, of course) and let me know the > results. > > The INTEL_CPUFREQ_TRANSITION_DELAY_HWP value has been guessed and it very well > may turn out to be either too high or too low for the general use, which is > one > reason why getting as much testing coverage as possible is key here. > > If you can play with different INTEL_CPUFREQ_TRANSITION_DELAY_HWP values, > please do so and let me know the conclusions. > > Cheers, > Rafael To anyone trying this patch: You will need to monitor EPP (Energy Performance Preference) carefully. It changes as a function of passive/active, and if you booted active or passive or no-hwp and changed later. Originally, I was not specifically monitoring EPP, or paths taken since boot towards driver, intel_pstate or intel_cpufreq, and governor, and will now have to set aside test results. @Rafael: I am still having problems with my test computer and HWP. However, I can observe the energy saving potential of this "passive-yet-active HWP mode". At this point, I am actually trying to make my newer test computer simply behave and do what it is told with respect to CPU frequency scaling, because even acpi-cpufreq misbehaves for performance governor under some conditions [1]. [1] https://marc.info/?l=linux-pm=159155067328641=2 To my way of thinking: 1.) it is imperative that we be able to decouple the governor servo from the processor servo. At a minimum this is needed for system testing, debugging and reference baselines. At a maximum users could, perhaps, decide for themselves. Myself, I would prefer "passive" to mean "do what you have been told", and that is now what I am testing. 2.) I have always thought, indeed relied on, performance mode as being more than a hint. For my older i7-2600K it never disobeyed orders, except for the most minuscule of workloads. This newer i5-9600K seems to have a mind of its own which I would like to be able to turn off, yet still be able to use intel_pstate trace with schedutil. Recall last week I said > moving forward the typical CPU frequency scaling > configuration for my test system will be: > > driver: intel-cpufreq, forced at boot. > governor: schedutil > hwp: forced off at boot. The problem is that baseline references are still needed and performance mode is unreliable. Maybe other stuff also, I simply don't know at this point. Example of EPP changing (no need to read on) (from fresh boot): Current EPP: root@s18:/home/doug# rdmsr --bitfield 31:24 -u -a 0x774 128 128 128 128 128 128 root@s18:/home/doug# grep . /sys/devices/system/cpu/cpu3/cpufreq/* /sys/devices/system/cpu/cpu3/cpufreq/affected_cpus:3 /sys/devices/system/cpu/cpu3/cpufreq/base_frequency:370 /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_max_freq:460 /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_min_freq:80 /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_transition_latency:0 /sys/devices/system/cpu/cpu3/cpufreq/energy_performance_available_preferences:default performance balance_performance balance_power power /sys/devices/system/cpu/cpu3/cpufreq/energy_performance_preference:balance_performance /sys/devices/system/cpu/cpu3/cpufreq/related_cpus:3 /sys/devices/system/cpu/cpu3/cpufreq/scaling_available_governors:performance powersave /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800102 /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver:intel_pstate /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor:powersave /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq:460 /sys/devices/system/cpu/cpu3/cpufreq/scaling_min_freq:80 /sys/devices/system/cpu/cpu3/cpufreq/scaling_setspeed: Now, switch to passive mode: echo passive > /sys/devices/system/cpu/intel_pstate/status And observe EPP: root@s18:/home/doug# rdmsr --bitfield 31:24 -u -a 0x774 255 255 255 255 255 255 root@s18:/home/doug# grep . /sys/devices/system/cpu/cpu3/cpufreq/* /sys/devices/system/cpu/cpu3/cpufreq/affected_cpus:3 /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_max_freq:460 /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_min_freq:80 /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_transition_latency:2 /sys/devices/system/cpu/cpu3/cpufreq/related_cpus:3 /sys/devices/system/cpu/cpu3/cpufreq/scaling_available_governors:conservative ondemand userspace
RE: [PATCH] cpufreq: intel_pstate: Add additional OOB enabling bit
On 2020.06.11 10:49 Srinivas Pandruvada wrote: > Add additional bit for OOB (Out of band) enabling of P-states. In this > case intel_pstate shouldn't load. Currently, only "BIT(8) == 1" of the > MSR MSR_MISC_PWR_MGMT is considered as OOB. Also add "BIT(18) == 1" as > OOB condition. Shouldn't those bits be defined in these files: arch/x86/include/asm/msr-index.h and tools/arch/x86/include/asm/msr-index.h ? By the way, I couldn't find those bits defined in Intel docs that I have. > > Signed-off-by: Srinivas Pandruvada > --- > drivers/cpufreq/intel_pstate.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c > index 8e23a698ce04..f21761443c90 100644 > --- a/drivers/cpufreq/intel_pstate.c > +++ b/drivers/cpufreq/intel_pstate.c > @@ -2686,8 +2686,8 @@ static bool __init > intel_pstate_platform_pwr_mgmt_exists(void) > id = x86_match_cpu(intel_pstate_cpu_oob_ids); > if (id) { > rdmsrl(MSR_MISC_PWR_MGMT, misc_pwr); > - if (misc_pwr & (1 << 8)) { > - pr_debug("Bit 8 in the MISC_PWR_MGMT MSR set\n"); > + if ((misc_pwr & BIT(8)) || (misc_pwr & BIT(18))) { And then those bit definitions used above. > + pr_debug("Bit 8 or 18 in the MISC_PWR_MGMT MSR set\n"); And then some insight also printed with the debug message. At least say "Out of Band". > return true; > } > } > -- > 2.24.1
RE: schedutil issue with serial workloads
On 2020.06.05 Rafael J. Wysocki wrote: > On 6/4/2020 11:29 PM, Alexander Monakov wrote: > > Hello, > > Hi, > > Let's make more people see your report. > > +Peter, Giovanni, Quentin, Juri, Valentin, Vincent, Doug, and linux-pm. > >> this is a question/bugreport about behavior of schedutil on serial workloads >> such as rsync, or './configure', or 'make install'. These workloads are >> such that there's no single task that takes a substantial portion of CPU >> time, but at any moment there's at least one runnable task, and overall >> the workload is compute-bound. To run the workload efficiently, cpufreq >> governor should select a high frequency. >> >> Assume the system is idle except for the workload in question. >> >> Sadly, schedutil will select the lowest frequency, unless the workload is >> confined to one core with taskset (in which case it will select the >> highest frequency, correctly though somewhat paradoxically). > > That's because the CPU utilization generated by the workload on all CPUs > is small. > > Confining it to one CPU causes the utilization of this one to grow and > so schedutil selects a higher frequency for it. > >> This sounds like it should be a known problem, but I couldn't find any >> mention of it in the documentation. Yes, this issue is very well known, and has been discussed on this list several times, going back many years (and I likely missed some of the discussions). In recent years Giovanni's git "make test" has been the "goto" example for this. From that test, which has run to run variability due to disk I/O, I made some test that varys PIDs per second verses time. Giovanni's recent work on frequency invariance made a huge difference for the schedutil response to this type of serialized workflow. For my part of it: I only ever focused on a new PID per work packet serialized workflow; Since my last testing on this subject in January, I fell behind with system issues and infrastructure updates. Your workflow example is fascinating and rather revealing. I will make use of it moving forward. Thank you. Yes, schedutil basically responds poorly as it did for PIDs/second based workflow before frequency invariance, but...(digression follows)... Typically, I merely set the performance governor whenever I know I will be doing serialized workflow, or whenever I just want the job done the fastest (i.e. kernel compile). If I use performance mode (hwp disabled, either active or passive, doesn't matter), then I can not get the CPU frequency to max, even if I set: $ grep . /sys/devices/system/cpu/intel_pstate/m??_perf_pct /sys/devices/system/cpu/intel_pstate/max_perf_pct:100 /sys/devices/system/cpu/intel_pstate/min_perf_pct:100 I have to increase EPB all way to 1 to get to max CPU frequency. There also is extreme hysteresis, as I have to back to 9 for the frequency to drop again. The above was an i5-9600K. My much older i7-9600K, works fine with default EPB of 6. I had not previously realized there was so much difference between processors and EPB. I don't have time to dig deeper right now, but will in future. >> I was able to replicate the effect with a pair of 'ping-pong' programs >> that get a token, burn some cycles to simulate work, and pass the token. >> Thus, each program has 50% CPU utilization. To repeat my test: >> >> gcc -O2 pingpong.c -o pingpong >> mkfifo ping >> mkfifo pong >> taskset -c 0 ./pingpong 100 < ping > pong & >> taskset -c 1 ./pingpong 100 < pong > ping & >> echo > ping >> >> #include >> #include >> int main(int argc, char *argv[]) >> { >> unsigned i, n; >> sscanf(argv[1], "%u", ); >> for (;;) { >> char c; >> read(0, , 1); >> for (i = n; i; i--) >> asm("" :: "r"(i)); >> write(1, , 1); >> } >> } >> >> Alexander It was not obvious to me what the approximate work/sleep frequency would be for your work flow. For my version of it I made the loop time slower on purpose, and because I could merely adjust "N" to compensate. I measured 100 hertz work/sleep frequency per CPU, but my pipeline is 6 instead of 2. Just for the record, this is what I did: doug@s18:~/c$ cat pingpong.c #include #include int main(int argc, char *argv[]) { unsigned i, n, k; sscanf(argv[1], "%u", ); while(1) { char c; read(0, , 1); for (i = n; i; i--){ k = i; k = k++; } write(1, , 1); } } Compiled with: cc pingpong.c -o pingpong and run with (on purpose, I did not force CPU affinity, as I wanted schedutil to decide (when it was the governor, at least)): #! /bin/dash # # ping-pong-test Smythies 2019.06.06 # serialized workflow, but same PID. # from Alexander, but modified. # # because I always forget from last time killall pingpong rm --force pong1 rm --force pong2 rm --force pong3 rm --force pong4 rm
RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with HWP enabled
where HWP appears to behave, the no-hwp power > use is much better. > > O.K., so now, do a couple of more turbostat samples: > > intel_pstate/powersave hwp idle state 2 disabled: > > Overruns: 3 > Ave. work percent: 66.647895 > Processor package power: ~16.8 watts. > Average CPU frequency: 4.6 gigahertz > > intel_pstate/powersave hwp idle state 3 disabled: > > Overruns: 22 > Ave. work percent: 66.647895 > Processor package power: ~16.2 watts. > Average CPU frequency: 4.6 gigahertz > > To prevent all the bots that burden my site, the link is coded: > [1] double u double u double u dot smythies dot com > /~doug/linux/s18/hwp/index.html > > ... Doug > > > -Original Message- > > From: Rafael J. Wysocki [mailto:r...@rjwysocki.net] > > Sent: May 26, 2020 11:21 AM > > To: Linux PM; Doug Smythies > > Cc: LKML; Len Brown; Srinivas Pandruvada; Peter Zijlstra; Giovanni > > Gherdovich; Francisco Jerez > > Subject: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with > > HWP enabled > > > > From: Rafael J. Wysocki > > > > Allow intel_pstate to work in the passive mode with HWP enabled and > > make it set the HWP minimum performance limit to 75% of the P-state > > value corresponding to the target frequency supplied by the cpufreq > > governor, so as to prevent the HWP algorithm and the CPU scheduler > > from working against each other at least when the schedutil governor > > is in use. > > > > Signed-off-by: Rafael J. Wysocki > > --- > > > > This is a replacement for https://patchwork.kernel.org/patch/11563615/ that > > uses the HWP floor (minimum performance limit) as the feedback to the HWP > > algorithm (instead of the EPP). > > > > The INTEL_CPUFREQ_TRANSITION_DELAY_HWP is still 5000 and the previous > > comments > > still apply to it. > > > > In addition to that, the 75% fraction used in intel_cpufreq_adjust_hwp() > > can be > > adjusted too, but I would like to use a value with a power-of-2 denominator > > for > > that (so the next candidate would be 7/8). > > > > Everyone who can do that is kindly requested to test this and let me know > > the outcome.
RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with HWP enabled
Hi Rafael, As you well know, I often test first and ask questions and review code later. I think I should have questioned this first. To the best of my ability/availability, I am committed to follow up on the hwp issue raised on the other branch of this thread. However, moving forward the typical CPU frequency scaling configuration for my test system will be: driver: intel-cpufreq, forced at boot. governor: schedutil hwp: forced off at boot. On 2020.05.26 11:21 Rafael J. Wysocki wrote: > > Allow intel_pstate to work in the passive mode with HWP enabled and > make it set the HWP minimum performance limit to 75% of the P-state > value corresponding to the target frequency supplied by the cpufreq > governor, so as to prevent the HWP algorithm and the CPU scheduler > from working against each other at least when the schedutil governor > is in use. I think we need to define what "passive" mode is. I have always interpreted it to mean "I would like this pstate please. It has been requested by some higher level servo". The name intel_cpufreq makes sense. I have always interpreted "active" to mean "I would like the intel_pstate CPU frequency driver to decide what pstate I need". As mentioned on the other branch of this thread, I don't have a stable test baseline, but the servos are still fighting each other with this version of the patch. > > Signed-off-by: Rafael J. Wysocki > --- > > This is a replacement for https://patchwork.kernel.org/patch/11563615/ that > uses the HWP floor (minimum performance limit) as the feedback to the HWP > algorithm (instead of the EPP). > > The INTEL_CPUFREQ_TRANSITION_DELAY_HWP is still 5000 and the previous comments > still apply to it. > > In addition to that, the 75% fraction used in intel_cpufreq_adjust_hwp() can > be > adjusted too, but I would like to use a value with a power-of-2 denominator > for > that (so the next candidate would be 7/8). The issue here is that the lag of the CPU frequency is not a constant, but rather a function of the task work/sleep timing verses whatever else is going on. One has to allow for the worst case. From thousands of seconds of intel_pstate trace data, that limit needs to be about 3% (31/32). Disclaimer: Done with no-hwp, active/powersave. The results might not be transferrable to hwp enabled. > > Everyone who can do that is kindly requested to test this and let me know > the outcome. > > Of course, the documentation still needs to be updated. Also, the EPP can be > handled in analogy with the active mode now, but that part can be added in a > separate patch on top of this one. > > Thanks! > > --- > drivers/cpufreq/intel_pstate.c | 119 > ++--- > 1 file changed, 88 insertions(+), 31 deletions(-) > > Index: linux-pm/drivers/cpufreq/intel_pstate.c > === > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c > +++ linux-pm/drivers/cpufreq/intel_pstate.c > @@ -36,6 +36,7 @@ > #define INTEL_PSTATE_SAMPLING_INTERVAL (10 * NSEC_PER_MSEC) > > #define INTEL_CPUFREQ_TRANSITION_LATENCY 2 > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000 > #define INTEL_CPUFREQ_TRANSITION_DELAY 500 > > #ifdef CONFIG_ACPI > @@ -2175,7 +2176,10 @@ static int intel_pstate_verify_policy(st > > static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy) > { > - intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]); > + if (hwp_active) > + intel_pstate_hwp_force_min_perf(policy->cpu); > + else > + intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]); > } > > static void intel_pstate_stop_cpu(struct cpufreq_policy *policy) > @@ -2183,12 +2187,10 @@ static void intel_pstate_stop_cpu(struct > pr_debug("CPU %d exiting\n", policy->cpu); > > intel_pstate_clear_update_util_hook(policy->cpu); > - if (hwp_active) { > + if (hwp_active) > intel_pstate_hwp_save_state(policy); > - intel_pstate_hwp_force_min_perf(policy->cpu); > - } else { > - intel_cpufreq_stop_cpu(policy); > - } > + > + intel_cpufreq_stop_cpu(policy); > } > > static int intel_pstate_cpu_exit(struct cpufreq_policy *policy) > @@ -2318,13 +2320,58 @@ static void intel_cpufreq_trace(struct c > fp_toint(cpu->iowait_boost * 100)); > } > > +static void intel_cpufreq_update_hwp_request(struct cpudata *cpu, u32 > min_perf) > +{ > + u64 value, prev; > + > + rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, ); > + value = prev; > + > + value &= ~HWP_MIN_PERF(~0L); > + value |= HWP_MIN_PERF(min_perf); > + > + /* > + * The entire MSR needs to be updated in order to update the HWP min > + * field in it, so opportunistically update the max too if needed. > + */ > + value &= ~HWP_MAX_PERF(~0L); > + value |= HWP_MAX_PERF(cpu->max_perf_ratio); > + > + if (value != prev) > +
RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with HWP enabled
On 2020.05.31 12:29 Srinivas Pandruvada wrote: > On Sun, 2020-05-31 at 11:59 -0700, Srinivas Pandruvada wrote: >> On Sun, 2020-05-31 at 11:06 -0700, Doug Smythies wrote: >> > Hi Srinivas, >> > >> > Thanks you for your quick reply. >> > >> > On 2020.05.31 09:54 Srinivas Pandruvada wrote >> > > On Sun, 2020-05-31 at 09:39 -0700, Doug Smythies wrote: >> > > > Event begins at 17.456 seconds elapsed time. >> > > > Previous event was about 107 milliseconds ago. >> > > > >> > > > Old min ; new min ; freq GHz; load % ; duration mS >> > > > 27 ; 28 ; 4.60; 68.17 ; 10.226 >> > > > 28 ; 26 ; 4.53; 57.47 ; 10.005 >> > > >> > > Seems you hit power/thermal limit >> > >> > No. >> > >> > I am nowhere near any power limit at all. >> > I have meticulously configured and tested the thermal management of >> > this computer. >> > I never ever hit a thermal limit and have TDP set such that the >> > processor >> > temperature never exceeds about 75 degrees centigrade. >> > >> > There should never be throttling involved in these experiments. >> > I can achieve throttling when compiling the kernel and with >> > torture test mode on the mprime test (other CPU stressors, >> > including my own, are not as good at generating heat as >> > mprime). >> > >> > This system can run indefinitely at 99.9 watts processor package >> > power. >> > Example (turbostat, steady state, CPU freq throttled to 4.04 GHz): >> > >> > doug@s18:~$ sudo ~/turbostat --Summary --quiet --show >> > Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 12 >> > Busy% Bzy_MHz IRQ PkgTmp PkgWatt GFXWatt >> > 100.21 404572231 66 99.93 0.00 >> > 100.21 404372239 65 99.92 0.00 >> > >> > > Is this some Lenovo system? >> > >> > No. The web page version of my original e-mail has >> > a link to the test computer hardware profile. >> > >> > The motherboard is ASUS PRIME Z390-P. >> > >> >> OK, this seems a desktop system. >> >> > > If you disable HWP you don't see that? >> > >> > Correct. >> > >> > > What is the value of >> > > cat /sys/bus/pci/devices/\:00\:04.0/tcc_offset_degree_celsius >> > >> > ? "No such file or directory" >> > >> > > cat /sys/class/powercap/intel-rapl-mmio/intel-rapl- >> > > mmio:0/constraint_0_power_limit_uw >> You may not have >> CONFIG_INT340X_THERMAL=y I have: CONFIG_INT340X_THERMAL=m >> >> What is >> #rdmsr 0x1a2 >From the turbostat startup spew of stuff (also in a link on the web page version of the e-mail): MSR_IA32_TEMPERATURE_TARGET: 0x0064100d (100 C) Or manually now: root@s18:/home/doug/prime95# rdmsr 0x1a2 64100d >> >> Try changing energy_perf_bias and see if it helps here. Yes, that is a test I meant to do, and should have done. No, it doesn't help. >> > Also if > MSR 0x1FC bit 19 is 0, change to 1. > Ya, I have always found documentation on 0x1FC somewhat lacking. Anyway, the bit is already 1. Is the bit EE_TURBO_DISABLE? Anyway, I tried that bit as 0, and it didn't help. > Thanks, > Srinivas > ... Doug
RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with HWP enabled
Hi Srinivas, Thanks you for your quick reply. On 2020.05.31 09:54 Srinivas Pandruvada wrote > On Sun, 2020-05-31 at 09:39 -0700, Doug Smythies wrote: >> Event begins at 17.456 seconds elapsed time. >> Previous event was about 107 milliseconds ago. >> >> Old min ; new min ; freq GHz; load % ; duration mS >> 27 ; 28 ; 4.60; 68.17 ; 10.226 >> 28 ; 26 ; 4.53; 57.47 ; 10.005 > > Seems you hit power/thermal limit No. I am nowhere near any power limit at all. I have meticulously configured and tested the thermal management of this computer. I never ever hit a thermal limit and have TDP set such that the processor temperature never exceeds about 75 degrees centigrade. There should never be throttling involved in these experiments. I can achieve throttling when compiling the kernel and with torture test mode on the mprime test (other CPU stressors, including my own, are not as good at generating heat as mprime). This system can run indefinitely at 99.9 watts processor package power. Example (turbostat, steady state, CPU freq throttled to 4.04 GHz): doug@s18:~$ sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 12 Busy% Bzy_MHz IRQ PkgTmp PkgWatt GFXWatt 100.21 404572231 66 99.93 0.00 100.21 404372239 65 99.92 0.00 > > Is this some Lenovo system? No. The web page version of my original e-mail has a link to the test computer hardware profile. The motherboard is ASUS PRIME Z390-P. > > If you disable HWP you don't see that? Correct. > > What is the value of > cat /sys/bus/pci/devices/\:00\:04.0/tcc_offset_degree_celsius ? "No such file or directory" > cat /sys/class/powercap/intel-rapl-mmio/intel-rapl- > mmio:0/constraint_0_power_limit_uw ? "No such file or directory" > You may want to run > Try running dptfxtract once. No, I am not going to. I am not running thermald. Eventually I will, as a backup in case of cooling failure, so as not to hit the processor limit shutdown. I just haven't done it yet. > > Then try to get again > > cat /sys/bus/pci/devices/\:00\:04.0/tcc_offset_degree_celsius > cat /sys/class/powercap/intel-rapl-mmio/intel-rapl- > mmio:0/constraint_0_power_limit_uw > > > Thanks, > Srinivas
RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with HWP enabled
Correction: On 2020.05.31 09:39 Doug smythies wrote: > The overruns and use of idle state 0 are exactly correlated. Should have been "idle state 2": The overruns and use of idle state 2 are exactly correlated.
RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with HWP enabled
e CPU loaded with new BIOS, and earlier tests NOT re-done). Is the processor using the latest microcode? Currently 0xcc. Can not figure out if there is anything newer. Leaving out the details, but all the tests and results are available, a mess but available, the summary is: With HWP enabled, if idle state 2 is used, there is a probability that the CPU frequency might unexpectedly drop significantly. If the processor does this by itself, or by being told to via sources outside of the intel_pstate CPU frequency driver, I don't know. The load sweep test was run at 6 seconds per step during increasing load and 3 seconds per step decreasing (by mistake, if you must know), while monitoring the idle statistics. The test was done in a hurry, so many above/below statistics are 0%, due to insufficient sample size. The overruns and use of idle state 0 are exactly correlated. There are a lot of graphs on the idle statistics web page, but the idle state 2 usage correlates exactly with undesired low CPU frequency and overruns. Side note: Even in the areas where HWP appears to behave, the no-hwp power use is much better. O.K., so now, do a couple of more turbostat samples: intel_pstate/powersave hwp idle state 2 disabled: Overruns: 3 Ave. work percent: 66.647895 Processor package power: ~16.8 watts. Average CPU frequency: 4.6 gigahertz intel_pstate/powersave hwp idle state 3 disabled: Overruns: 22 Ave. work percent: 66.647895 Processor package power: ~16.2 watts. Average CPU frequency: 4.6 gigahertz To prevent all the bots that burden my site, the link is coded: [1] double u double u double u dot smythies dot com /~doug/linux/s18/hwp/index.html ... Doug > -Original Message- > From: Rafael J. Wysocki [mailto:r...@rjwysocki.net] > Sent: May 26, 2020 11:21 AM > To: Linux PM; Doug Smythies > Cc: LKML; Len Brown; Srinivas Pandruvada; Peter Zijlstra; Giovanni > Gherdovich; Francisco Jerez > Subject: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with HWP > enabled > > From: Rafael J. Wysocki > > Allow intel_pstate to work in the passive mode with HWP enabled and > make it set the HWP minimum performance limit to 75% of the P-state > value corresponding to the target frequency supplied by the cpufreq > governor, so as to prevent the HWP algorithm and the CPU scheduler > from working against each other at least when the schedutil governor > is in use. > > Signed-off-by: Rafael J. Wysocki > --- > > This is a replacement for https://patchwork.kernel.org/patch/11563615/ that > uses the HWP floor (minimum performance limit) as the feedback to the HWP > algorithm (instead of the EPP). > > The INTEL_CPUFREQ_TRANSITION_DELAY_HWP is still 5000 and the previous comments > still apply to it. > > In addition to that, the 75% fraction used in intel_cpufreq_adjust_hwp() can > be > adjusted too, but I would like to use a value with a power-of-2 denominator > for > that (so the next candidate would be 7/8). > > Everyone who can do that is kindly requested to test this and let me know > the outcome.
RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Work in passive mode with HWP enabled
On 2020.05.26 01:19 Rafael J. Wysocki wrote: > to On Mon, May 25, 2020 at 10:57 PM Francisco Jerez > > "Rafael J. Wysocki" writes: > > > On Mon, May 25, 2020 at 3:39 AM Francisco Jerez > > > > Why not HWP_MIN_PERF? That would leave the HWP quite some room for > > maneuvering (the whole HWP_MIN_PERF-HWP_MAX_PERF P-state range, it's not > > like P-state selection would be entirely driven by the kernel), while > > avoiding every failure scenario I was describing in my previous reply. I have re-done my tests. The problem that I observe seems specific to hwp itself and not this patch and it's use in passive mode. I see the exact same thing with intel_pstate/powersave. [1] detail A. Test: still simple single threaded load sweep, at 347 hertz work/sleep frequency. What do I see? Unexpected frequency drops at around 70% load. Example, from trace: First, the thing has been going for awhile at 4.8 GHz. Old epp ; new epp ; freq GHz; load % ; duration mS 80; 82 ; 4.57; 61.94 ; 20.001 82 ; 80; 4.57; 62.47 ; 40.003 80 ; 44 ; 3.73; 68.63 ; 62.009 What? Why freq down? Why long duration? 44 ; 0 ; 1.96; 100.23 ; 19.995 Even lower freq. load overruns. 0 ; 73 ; 4.56; 82.93 ; 40.07 O.K. recovered, but damage done. 73 ; 46 ; 2.36; 79.19 ; 20.94 <<< now things oscillate a little. 46 ; 0 ; 1.9884 ; 100.24 ; 20.99 0 ; 75 ; 4.5624 ; 82.1 ; 41.002 <<< Event ends. Next event in 487 milliseconds. Observation: Events are often, but not always, preceded by a longer than normal duration. However, long durations are also not out of the ordinary in passive mode. And yes, the above trace was with DELAY_HWP 20,000, but I do have trace examples with it at 5,000. This was just a particularly good example. Observation (from looking at a lot of trace data): There are phase delays between the two systems, intel_cpufreq and hwp, and sometimes they seem to oscillate a little and fight each other. There maybe some problematic work/sleep frequencies where the oscillation builds into a full blown resonance. Why does hwp drop the frequency? This system is otherwise fairly idle, so maybe because the pll drops down during the non work periods. Maybe HWP thinks the system is idle and drops the frequency. I can eliminate the overruns by disabling deep idle states such that the PLL vote is never relinquished, but it's not a fair test. Note that the above response can be "tuned". If we take the conversation algorithm from target frequency to EPP and introduce and offset, the above can be improved. At what cost? More sluggishness, for a large positive offset. So, the overruns just move from the steady state side of the task to when the task starts. I did not find if there is a "sweet spot" between offset and system response, and I do not think there is value added in trying. Note: With original settings, I rarely observe a problem with the step function response to a new task. > > Actually, I have been thinking about the HWP min as an alternative > that may be worth evaluating. > > However, I would rather set the HWP min to something like 80% if the > cpufreq request. Yes, this is a good idea and should not suffer from the two servo systems fighting each other. I got 0 overruns, verses 2240 overruns with no min limitation (100 second test). As for INTEL_CPUFREQ_TRANSITION_DELAY_HWP, I'll probably use 10 milliseconds moving forward, because that is what I am most familiar with from years ago work on the this driver. But, I did not observe any issue with 5 milliseconds. [1] http://www.smythies.com/~doug/linux/intel_pstate/passive-hwp/passive-hwp-but-active-powersave.png Other replaced graphs: http://www.smythies.com/~doug/linux/intel_pstate/passive-hwp/passive-hwp-ondemand.png http://www.smythies.com/~doug/linux/intel_pstate/passive-hwp/passive-hwp-schedutil.png ... Doug
RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Work in passive mode with HWP enabled
Hi all, The INTEL_CPUFREQ_TRANSITION_DELAY_HWP = 2 test results from this e-mail were incorrect. The test and graphs are being re-done. On 2020.05.25 08:30 Doug smythies wrote: > > Legend - intel_pstate - powersave graph [2]. > > What? Why is there such a graph, unrelated to this patch? > Well, because there is a not yet understood effect. > > p_powe_stock : intel_pstate, powersave, stock kernel (5.7-rc6), hwp disabled. > P_powe_hwp : intel_pstate, powersave, patched kernel (5.7-rc6), DELAY_HWP > 5000. > P_powe_hwp2 : intel_pstate, powersave, patched kernel (5.7-rc6), DELAY_HWP > 2. > > Conclusion: ?? > > Note: That I merely made a stupid mistake is a real possibility. Yes, that was it. However all DELAY_HWP 2 tests were bad, Not just this one. ... Doug
RE: [RFC 0/1] Alternate history mechanism for the TEO governor
On 2020.05.21 04:09 Pratik Sampat wrote: > On 17/05/20 11:41 pm, Doug Smythies wrote: > > On 2020.05.11 Pratik Rajesh Sampat wrote: > >> First RFC posting:https://lkml.org/lkml/2020/2/22/27 > > Summary: > > > > On that thread I wrote: > > > >> I have done a couple of other tests with this patch set, > >> but nothing to report yet, as the differences have been > >> minor so far. > > > > I tried your tests, or as close as I could find, and still > > do not notice much difference. > > That is quite unfortunate. At least it doesn't seem to regress. Yes, while I have not been able to demonstrate improvement, I have not found any regression. > > Nevertheless, as Rafael suggested aging is crucial, this patch doesn't age > weights. I do have a version with aging but I had a lot of run to run variance > so I had refrained from posting that. > I'm tweaking around the logic for aging as well as distribution of weights, > hopefully that may help. O.K. I am putting this testing aside for now. I like the set of tests, as they really show the differences between menu and teo governors well. > >> > >> Sleeping Ebizzy > >> --- > >> Program to generate workloads resembling web server workloads. > >> The benchmark is customized to allow for a sleep interval -i > > I found a Phoronix ebizzy, but without the customization, > > which I suspect is important to demonstrate your potential > > improvement. > > > > Could you send me yours to try? > > Sure thing, sleeping ebizzy is hosted here: > https://github.com/pratiksampat/sleeping-ebizzy > > > > > ebizzy (records per second, more is better) > > > > teo wtteo menu > > 132344 132228 99.91% 130926 98.93% O.K. yours is way different than what I was using. Anyway, results still are not very different between teo and wtteo. Some tests are showing a little difference between above/below statistics [1] [1] http://www.smythies.com/~doug/linux/idle/wtteo/ebizzy-interval/2_below.png By the way, and likely not relevant, your sleeping-ebizzy test seems extremely sensitive to the interval and number of threads. It is not clear to me what settings I should use to try to re-create your results. [2] is an interesting graph of records per second verses intervals verses threads. [2] http://www.smythies.com/~doug/linux/idle/wtteo/doug08/sleeping-ebizzy-records-intervals-threads.png
RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Work in passive mode with HWP enabled
On 2020.05.21 10:16 Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki > > Allow intel_pstate to work in the passive mode with HWP enabled and > make it translate the target frequency supplied by the cpufreq > governor in use into an EPP value to be written to the HWP request > MSR (high frequencies are mapped to low EPP values that mean more > performance-oriented HWP operation) as a hint for the HWP algorithm > in the processor, so as to prevent it and the CPU scheduler from > working against each other at least when the schedutil governor is > in use. > > Signed-off-by: Rafael J. Wysocki > --- > > This is a prototype not intended for production use (based on linux-next). > > Please test it if you can (on HWP systems, of course) and let me know the > results. > > The INTEL_CPUFREQ_TRANSITION_DELAY_HWP value has been guessed and it very well > may turn out to be either too high or too low for the general use, which is > one > reason why getting as much testing coverage as possible is key here. > > If you can play with different INTEL_CPUFREQ_TRANSITION_DELAY_HWP values, > please do so and let me know the conclusions. Hi all, It is way too early for me to reply to this, as my tests are incomplete. However, I saw the other e-mails and will chime in now. Note before: I have been having difficulties with this test computer, results may be suspect. The default INTEL_CPUFREQ_TRANSITION_DELAY_HWP is definitely too short. I do not know what an optimal value is, but I have made it the same as INTEL_CPUFREQ_TRANSITION_LATENCY for now, and because it makes for great test comparisons. I have only done the simplest of tests so far, single thread, forced CPU affinity load ramp up/down. Load: fixed work packet per work period, i.e. the amount of work to do is independent of CPU frequency. 347 hertz work / sleep frequency. Why? No reason. Because it is what I used last time. To reveal any hysteresis (i.e. with conservative governor) load ramps up from none to 100% and then back down to none at 3 seconds per step (step size is uncalibrated). Processor: Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz Modes: intel_cpufreq (i_) intel_pstate (p_) acpi-cpufreq (a_) Kernel: b955f2f4a425 (HEAD -> k57-hwp) cpufreq: intel_pstate: Work in passive mode with HWP enabled faf93d5c9da1 cpufreq: intel_pstate: Use passive mode by default without HWP b9bbe6ed63b2 (tag: v5.7-rc6) Linux 5.7-rc6 Graph notes: The ramp up and ramp down are folded back on the graphs. These tests were launched manually in two terminals, so the timing was not exact. I'll fix this moving forward. Legend - ondemand graph [1] (minor annotation added): i_onde_stock : intel-cpufreq, ondemand, stock kernel (5.7-rc6), hwp disabled. i_onde_hwp : intel-cpufreq, ondemand, patched kernel (5.7-rc6), DELAY_HWP 5000. i_onde_hwp2 : intel-cpufreq, ondemand, patched kernel (5.7-rc6), DELAY_HWP 2. a_onde: acpi-cpufreq, stock kernel (5.7-rc6), intel_pstate disabled. Conclusion: DELAY_HWP 2 matches the pre-patch response almost exactly. DELAY_HWP 5000 goes too far (my opinion), but does look promising. Legend - intel_pstate - powersave graph [2]. What? Why is there such a graph, unrelated to this patch? Well, because there is a not yet understood effect. p_powe_stock : intel_pstate, powersave, stock kernel (5.7-rc6), hwp disabled. P_powe_hwp : intel_pstate, powersave, patched kernel (5.7-rc6), DELAY_HWP 5000. P_powe_hwp2 : intel_pstate, powersave, patched kernel (5.7-rc6), DELAY_HWP 2. Conclusion: ?? Note: That I merely made a stupid mistake is a real possibility. Repeat test pending. Legend - schedutil graph [3]: i_sche_stock : intel-cpufreq, ondemand, stock kernel (5.7-rc6), hwp disabled. i_sche_hwp : intel-cpufreq, ondemand, patched kernel (5.7-rc6), DELAY_HWP 5000. i_sche_hwp2 : intel-cpufreq, ondemand, patched kernel (5.7-rc6), DELAY_HWP 2. a_sche: acpi-cpufreq, stock kernel (5.7-rc6), intel_pstate disabled. Conclusion: ?? I don't know, always had schedutil issues on this computer. Expected more like typical from my older test server (i7-2600K) [4]: [1] http://www.smythies.com/~doug/linux/intel_pstate/passive-hwp/passive-hwp-ondemand.png [2] http://www.smythies.com/~doug/linux/intel_pstate/passive-hwp/passive-hwp-but-active-powersave.png [3] http://www.smythies.com/~doug/linux/intel_pstate/passive-hwp/passive-hwp-schedutil.png [4] http://www.smythies.com/~doug/linux/intel_pstate/cpu-freq-scalers/fixed-work-packet-folded-schedutil.png > > #define INTEL_CPUFREQ_TRANSITION_LATENCY 2 > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000 For now, while trying to prove nothing is busted, suggest 2. > +/** > + * intel_cpufreq_adjust_hwp_request - Adjust the HWP reuqest register. request. > +static void intel_cpufreq_adjust_hwp_request(struct cpudata *cpu, s64 > max_freq, > + unsigned int target_freq) > +{ > + s64 epp_fp = div_fp(255 * (max_freq - target_freq),
RE: [RFC 0/1] Alternate history mechanism for the TEO governor
On 2020.05.11 Pratik Rajesh Sampat wrote: > > First RFC posting: https://lkml.org/lkml/2020/2/22/27 Summary: On that thread I wrote: > I have done a couple of other tests with this patch set, > but nothing to report yet, as the differences have been > minor so far. I tried your tests, or as close as I could find, and still do not notice much difference. For detail, but likely little added value, read on: Kernel: 5.7-rc4: "teo": unmodified kernel. "wtteo": with this patch added. "menu": the menu idle governor, for comparison. CPU frequency scaling driver: intel-cpufreq CPU frequency scaling governor: schedutil CPU idle driver: intel_idle ... > Benchmarks: > Schbench > > Benchmarks scheduler wakeup latencies > > 1. latency 99th percentile - usec I found a Phoronix schbench test. It defaults to 99.9th percentile. schbench (usec, 99.9th Latency Percentile, less is better)(8 workers) threads teo wtteo menu 2 14197 14194 99.98% 14467 101.90% 4 46733 46490 99.48% 46554 99.62% 6 57306 58291 101.72% 57754 100.78% 8 81408 80768 99.21% 81715 100.38% 16 157286 156570 99.54% 156621 99.58% 32 314573 310784 98.80% 315802 100.39% Powers and other idle statistics were similar. [1] > 2. Power - watts > Machine - IBM Power 9 > > Latency and Power - Normalized > +-+--+-+---+ > | Threads | TEO Baseline | Wt. TEO Latency | Wt. TEO Power | > +-+--+-+---+ > | 2 | 100 | 101.3 | 85.29 | > +-+--+-+---+ > | 4 | 100 | 105.06 | 113.63| > +-+--+-+---+ > | 8 | 100 | 92.32 | 90.36 | > +-+--+-+---+ > | 16 | 100 | 99.1| 92.43 | > +-+--+-+---+ > > Accuracy > > Vanilla TEO Governor - Prediction distribution % > +-+--+--+--+---+---+---+-+ > | Threads | US 1 | US 2 | US 3 | US 4 | US 5 | US 6 | Correct | > +-+--+--+--+---+---+---+-+ > | 2 | 6.12 | 1.08 | 1.76 | 20.41 | 9.2 | 28.74 | 22.51 | > +-+--+--+--+---+---+---+-+ > | 4 | 8.54 | 1.56 | 1.25 | 20.24 | 10.75 | 25.17 | 22.67 | > +-+--+--+--+---+---+---+-+ > | 8 | 5.88 | 2.67 | 1.09 | 13.72 | 17.08 | 32.04 | 22.95 | > +-+--+--+--+---+---+---+-+ > | 16 | 6.29 | 2.43 | 0.86 | 13.21 | 15.33 | 26.52 | 29.34 | > +-+--+--+--+---+---+---+-+ > +-+--+--+--+ > | Threads | OS 1 | OS 2 | OS 3 | > +-+--+--+--+ > | 2 | 1.77 | 1.27 | 7.14 | > +-+--+--+--+ > | 4 | 1.8 | 1.31 | 6.71 | > +-+--+--+--+ > | 8 | 0.65 | 0.72 | 3.2 | > +-+--+--+--+ > | 16 | 0.63 | 1.71 | 3.68 | > +-+--+--+--+ > > Weighted TEO Governor - Prediction distribution % > +-+--+--+--+---+---+---+-+ > | Threads | US 1 | US 2 | US 3 | US 4 | US 5 | US 6 | Correct | > +-+--+--+--+---+---+---+-+ > | 2 | 7.26 | 2.07 | 0.02 | 15.85 | 13.29 | 36.26 | 22.13 | > +-+--+--+--+---+---+---+-+ > | 4 | 4.33 | 1.45 | 0.15 | 14.17 | 14.68 | 40.36 | 21.01 | > +-+--+--+--+---+---+---+-+ > | 8 | 4.73 | 2.46 | 0.12 | 12.48 | 14.68 | 32.38 | 28.9| > +-+--+--+--+---+---+---+-+ > | 16 | 7.68 | 1.25 | 0.98 | 12.15 | 11.19 | 24.91 | 35.92 | > +-+--+--+--+---+---+---+-+ > +-+--+--+--+ > | Threads | OS 1 | OS 2 | OS 3 | > +-+--+--+--+ > | 2 | 0.39 | 0.42 | 2.31 | > +-+--+--+--+ > | 4 | 0.45 | 0.51 | 2.89 | > +-+--+--+--+ > | 8 | 0.53 | 0.66 | 3.06 | > +-+--+--+--+ > | 16 | 0.97 | 1.9 | 3.05 | > +-+--+--+--+ > > Sleeping Ebizzy > --- > Program to generate workloads resembling web server workloads. > The benchmark is customized to allow for a sleep interval -i I found a Phoronix ebizzy, but without the customization, which I suspect is important to demonstrate your potential improvement. Could you
RE: [PATCH 0/4] cpuidle: teo: Fix issues related to disabled idle states
On 2019.10.10 Rafael J. Wysocki wrote: > There are a few issues related to the handling of disabled idle states in the > TEO (Timer-Events-Oriented) cpuidle governor which are addressed by this > series. > > The application of the entire series is exactly equivalent to the testing > patch > at https://lore.kernel.org/lkml/3490479.2dnHFFeJIp@kreacher/ , but IMO it is > cleaner to split the changes into smaller patches which also allows them to > be explained more accurately. Hi, I have re-tested and continued testing using this 4 patch set. Summary: So far, everything is fine. Some, but not all, detail: Reference kernel: 5.4-rc2 "stock" Test kernel: 5.4-rc2 + this 4 patch set "rjw-4" Test 1: Where I simply coded, to automate, my best use case example: This is an idle state 4 disabled test. The max entries column is entries and exits for idle state 0 over the test sample interval (15 seconds in this case). stock kernel: idle-doug01 : begin ... Per CPU PASS/FAIL : Totals: fail rate: max entries: 0 1 2 3 4 5 6 7 : : (percent):: . . F . . . F . : 8 : 25. : 69939 . . F . . . F . : 16 : 25. : 69938 . . F . . . F . : 24 : 25. : 70126 ... . . . . . . F F : 1928 : 30.8610 : 66530 . . . . . . F F : 1936 : 30.8368 : 68451 . . . . . . F F : 1944 : 30.8128 : 68643 . . . . . . F F : 1952 : 30.7889 : 68645 . . . . . . F F : 1960 : 30.7653 : 68152 . . . . . . F F : 1968 : 30.7419 : 67145 . . . . . . F F : 1976 : 30.7186 : 67349 . . . . . . F F : 1984 : 30.6956 : 68481 . . . . . . F F : 1992 : 30.6727 : 67394 . . . . . . F F : 2000 : 30.6500 : 68645 ^C --- SIGINT (^C) detected. Terminate gracefully, saving the sample data... . . . . . . F F : 2008 : 30.6275 : 29010 Summary: Total Tests: 2008 : Total Fails:615 : Fail rate (percent): 30.6275 : Per CPU: CPU00: 0, CPU01: 0, CPU02: 12, CPU03: 0, CPU04:152, CPU05: 43, CPU06:197, CPU07:211, idle-doug01 : end ... rjw-4 kernel: idle-doug01 : begin ... Per CPU PASS/FAIL : Totals: fail rate: max entries: 0 1 2 3 4 5 6 7 : : (percent):: . . . . . . . . : 8 : 0. : 3 . . . . . . . . : 16 : 0. : 7 . . . . . . . . : 24 : 0. : 10 . . . . . . . . : 32 : 0. : 10 . . . . . . . . : 40 : 0. : 23 . . . . . . . . : 48 : 0. : 23 . . . . . . . . : 56 : 0. : 1 . . . . . . . . : 64 : 0. : 1 . . . . . . . . : 72 : 0. : 0 . . . . . . . . : 80 : 0. : 0 . . . . . . . . : 88 : 0. : 1 . . . . . . . . : 96 : 0. : 1 . . . . . . . . :104 : 0. : 0 . . . . . . . . :112 : 0. : 0 . . . . . . . . :120 : 0. : 1 . . . . . . . . :128 : 0. : 2 . . . . . . . . :136 : 0. : 2 . . . . . . . . :144 : 0. : 3 . . . . . . . . :152 : 0. : 3 . . . . . . . . :160 : 0. : 4 . . . . . . . . :168 : 0. : 4 . . . . . . . . :176 : 0. : 6 . . . . . . . . :184 : 0. : 3 . . . . . . . . :192 : 0. : 4 ... . . . . . . . . : 11480 : 0. : 5 . . . . . . . . : 11488 : 0. : 4 . . . . . . . . : 11496 : 0. : 6 . . . . . . . . : 11504 : 0. : 8 . . . . . . . . : 11512 : 0. : 5 . . . . . . . . : 11520 : 0. : 4 . . . . . . . . : 11528 : 0. : 4 . . . . . . . . : 11536 : 0. : 3 . . . . . . . . : 11544 : 0. : 2 . . . . . . . . : 11552 : 0. : 5 . . . . . . . . : 11560 : 0. : 3 . . . . . . . . : 11568 : 0. : 3 ^C --- SIGINT (^C) detected. Terminate gracefully, saving the sample data... . . . . . . . . : 11576 : 0. : 1 Summary: Total Tests: 11576 : Total Fails: 0 : Fail rate (percent): 0. : Per CPU: CPU00: 0, CPU01: 0, CPU02: 0, CPU03: 0, CPU04: 0, CPU05: 0, CPU06: 0, CPU07: 0, idle-doug01 : end ... Test 2: Have a look at all idle state enabled/disabled combinations. This test is not very good, and only looks at processor package power. Particularly for idle state 1, see Note 1 below. But it's better than nothing. stock kernel: idle-disable-enable : begin ... Idle State: Per test PASS/FAIL : Power (Watts): 4 3 2 1 0 : : Expected : Max diff : 0 0 0 0 0 : PASS PASS PASS PASS PASS PASS PASS PASS : 3.7 :0.2 0 0 0 0 1 : PASS PASS PASS PASS PASS PASS PASS PASS : 3.7 :0.2 0 0 0 1 0 : PASS PASS PASS PASS PASS PASS PASS PASS : 3.7 :
RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems
On 2019.10.09 06:37 Rafael J. Wysocki wrote: > On Wednesday, October 9, 2019 1:19:51 AM CEST Rafael J. Wysocki wrote: >> On Tuesday, October 8, 2019 12:49:01 PM CEST Rafael J. Wysocki wrote: >>> On Tue, Oct 8, 2019 at 11:51 AM Rafael J. Wysocki wrote: >>>> On Tue, Oct 8, 2019 at 8:20 AM Doug Smythies wrote: >>>>> O.K. Thanks for your quick reply, and insight. >>>>> >>>>> I think long durations always need to be counted, but currently if >>>>> the deepest idle state is disabled, they are not. ... >>>> AFAICS, adding early_hits to count is not a mistake if there are still >>>> enabled states deeper than the current one. >>> >>> And the mistake appears to be that the "hits" and "misses" metrics >>> aren't handled in analogy with the "early_hits" one when the current >>> state is disabled. I only know how to exploit and test the "hits" and "misses" path that should use the deepest available idle state upon transition to an idle system. Even so, the test has a low probability of failing, and so needs to be run many times. I do not know how to demonstrate and/or test any "early_hits" path to confirm that an issue exists or that it is fixed. >>> >>> Let me try to cut a patch to address that. >> >> Appended below, not tested. Reference as: rjw1 >> >> It is meant to address two problems, one of which is that the "hits" and >> "misses" metrics of disabled states need to be taken into account too in >> some cases, and the other is an issue with the handling of "early hits" >> which may lead to suboptimal state selection if some states are disabled. > > Well, it still misses a couple of points. > > First, disable states that are too deep should not be taken into consideration > at all. > > Second, the "hits" and "misses" metrics of disabled states need to be used for > idle duration ranges corresponding to them regardless of whether or not the > "hits" value is greater than the "misses" one. > > Updated patch is below (still not tested), but it tries to do too much in one > go, so I need to split it into a series of smaller changes. Thanks for your continued look at this. Reference as: rjw2 Test 1, hack job statistical test (old tests re-stated): Kernel testsfail rate 5.4-rc1 6616 13.45% 5.3 23764.50% 5.3-teov7 121360.00% <<< teo.c reverted and teov7 put in its place. 5.4-rc1-ds 111680.00% <<< [old] ds proposed patch (> 7 hours test time) 5.4-rc1-ds12 42240.00% <<< [old] new ds proposed patch 5.4-rc2-rjw1112800.00% 5.4-rc2-rjw2 6400.00% <<< Will be run again, for longer. Test 2: I also looked at every possible enable/disable idle combination, and they all seemed O.K. No other tests have been run yet. System: Processor: i7-2600K Deepest idle state: 4 (C6) ... Doug
RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems
On 2019.10.06 08:34 Rafael J. Wysocki wrote: > On Sun, Oct 6, 2019 at 4:46 PM Doug Smythies wrote: >> On 2019.10.01 02:32 Rafael J. Wysocki wrote: >>> On Sun, Sep 29, 2019 at 6:05 PM Doug Smythies wrote: >>>> On 2019.09.26 09:32 Doug Smythies wrote: >>>> >>>>> If the deepest idle state is disabled, the system >>>>> can become somewhat unstable, with anywhere between no problem >>>>> at all, to the occasional temporary jump using a lot more >>>>> power for a few seconds, to a permanent jump using a lot more >>>>> power continuously. I have been unable to isolate the exact >>>>> test load conditions under which this will occur. However, >>>>> temporarily disabling and then enabling other idle states >>>>> seems to make for a somewhat repeatable test. It is important >>>>> to note that the issue occurs with only ever disabling the deepest >>>>> idle state, just not reliably. >>>>> >>>>> I want to know how you want to proceed before I do a bunch of >>>>> regression testing. >>>> >> I do not think I stated it clearly before: The problem here is that some CPUs >> seem to get stuck in idle state 0, and when they do power consumption spikes, >> often by several hundred % and often indefinitely. > > That indeed has not been clear to me, thanks for the clarification! > >> I made a hack job automated test: >> Kernel tests fail rate >> 5.4-rc16616 13.45% >> 5.3 23764.50% >> 5.3-teov7 121360.00% <<< teo.c reverted and teov7 put in >> its place. >> 5.4-rc1-ds 111680.00% <<< [old] proposed patch (> 7 hours test >> time) 5.4-rc1-ds12 4224 0.005 <<< new proposed patch >> >> [old] Proposed patch (on top of kernel 5.4-rc1): [deleted] > This change may cause the deepest state to be selected even if its > "hits" metric is less than the "misses" one AFAICS, in which case the > max_early_index state should be selected instead. > > It looks like the max_early_index computation is broken when the > deepest state is disabled. O.K. Thanks for your quick reply, and insight. I think long durations always need to be counted, but currently if the deepest idle state is disabled, they are not. How about this?: (test results added above, more tests pending if this might be a path forward.) diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c index b5a0e49..a970d2c 100644 --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -155,10 +155,12 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) cpu_data->states[i].early_hits -= early_hits >> DECAY_SHIFT; - if (drv->states[i].target_residency <= sleep_length_us) { - idx_timer = i; - if (drv->states[i].target_residency <= measured_us) - idx_hit = i; + if (!(drv->states[i].disabled || dev->states_usage[i].disable)){ + if (drv->states[i].target_residency <= sleep_length_us) { + idx_timer = i; + if (drv->states[i].target_residency <= measured_us) + idx_hit = i; + } } } @@ -256,39 +258,25 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, struct cpuidle_state *s = >states[i]; struct cpuidle_state_usage *su = >states_usage[i]; - if (s->disabled || su->disable) { - /* -* If the "early hits" metric of a disabled state is -* greater than the current maximum, it should be taken -* into account, because it would be a mistake to select -* a deeper state with lower "early hits" metric. The -* index cannot be changed to point to it, however, so -* just increase the max count alone and let the index -* still point to a shallower idle state. -*/ - if (max_early_idx >= 0 && - count < cpu_data->states[i].early_hits) - count = cpu_data->states[i].early_hits; - - continue; - } - if (idx < 0) -
RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems
On 2019.10.01 02:32 Rafael J. Wysocki wrote: > On Sun, Sep 29, 2019 at 6:05 PM Doug Smythies wrote: >> On 2019.09.26 09:32 Doug Smythies wrote: >> >>> If the deepest idle state is disabled, the system >>> can become somewhat unstable, with anywhere between no problem >>> at all, to the occasional temporary jump using a lot more >>> power for a few seconds, to a permanent jump using a lot more >>> power continuously. I have been unable to isolate the exact >>> test load conditions under which this will occur. However, >>> temporarily disabling and then enabling other idle states >>> seems to make for a somewhat repeatable test. It is important >>> to note that the issue occurs with only ever disabling the deepest >>> idle state, just not reliably. >>> >>> I want to know how you want to proceed before I do a bunch of >>> regression testing. >> >> I did some regression testing anyhow, more to create and debug >> a methodology than anything else. >> >>> On 2018.12.11 03:50 Rafael J. Wysocki wrote: >>> >>>> v7 -> v8: >>>> * Apply the selection rules to the idle deepest state as well as to >>>>the shallower ones (the deepest idle state was treated differently >>>>before by mistake). >>>> * Subtract 1/2 of the exit latency from the measured idle duration >>>>in teo_update() (instead of subtracting the entire exit latency). >>>>This makes the idle state selection be slightly more performance- >>>> oriented. >>> >>> I have isolated the issue to a subset of the v7 to v8 changes, however >>> it was not the exit latency changes. >>> >>> The partial revert to V7 changes I made were (on top of 5.3): >> >> The further testing showed a problem or two with my partial teo-v7 reversion >> (I call it teo-v12) under slightly different testing conditions. Correction: There was no problem with my partial reversion kernel (a.k.a. teo-v12). The problem was confusion over which kernel I was actually running for whatever test. >> >> I also have a 5.3 based kernel with the current teo reverted and the entire >> teo-v7 put in its place. I have yet to find a idle state disabled related >> issue >> with this kernel. >> >> I'll come back to this thread at a later date with better details and test >> results. > > Thanks for this work! > > Please also note that there is a teo patch in 5.4-rc1 that may make a > difference in principle. Yes, actually this saga started from somewhere between kernel 5.3 and 5.4-rc1, and did include those teo patches, which actually significantly increases the probability of the issue occurring. When the deepest idle state is disabled, and the all states search loop exits normally, it might incorrectly re-evaluate a previous idle state previously deemed not worthy of the check. This was introduced between teo development versions 7 and 8. The fix is to move the code back inside the loop. (I'll submit a patch in a day or two). I do not think I stated it clearly before: The problem here is that some CPUs seem to get stuck in idle state 0, and when they do power consumption spikes, often by several hundred % and often indefinitely. I made a hack job automated test: Kernel tests fail rate 5.4-rc1 6616 13.45% 5.3 23764.50% 5.3-teov7 121360.00% <<< teo.c reverted and teov7 put in its place. 5.4-rc1-ds 111680.00% <<< proposed patch (> 7 hours test time) Proposed patch (on top of kernel 5.4-rc1): doug@s15:~/temp-k-git/linux$ git diff diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c index b5a0e49..0502aa9 100644 --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -276,8 +276,22 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, if (idx < 0) idx = i; /* first enabled state */ - if (s->target_residency > duration_us) + if (s->target_residency > duration_us){ + /* +* If the "hits" metric of the idle state matching the sleep length is +* greater than its "misses" metric, that is the one to use. Otherwise, +* it is more likely that one of the shallower states will match the +* idle duration observed after wakeup, so take the one with the maximum +* "early hits" metric, but if that cannot be determined, just use the +* state se
RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems
On 2019.09.26 09:32 Doug Smythies wrote: > If the deepest idle state is disabled, the system > can become somewhat unstable, with anywhere between no problem > at all, to the occasional temporary jump using a lot more > power for a few seconds, to a permanent jump using a lot more > power continuously. I have been unable to isolate the exact > test load conditions under which this will occur. However, > temporarily disabling and then enabling other idle states > seems to make for a somewhat repeatable test. It is important > to note that the issue occurs with only ever disabling the deepest > idle state, just not reliably. > > I want to know how you want to proceed before I do a bunch of > regression testing. I did some regression testing anyhow, more to create and debug a methodology than anything else. > On 2018.12.11 03:50 Rafael J. Wysocki wrote: > >> v7 -> v8: >> * Apply the selection rules to the idle deepest state as well as to >>the shallower ones (the deepest idle state was treated differently >>before by mistake). >> * Subtract 1/2 of the exit latency from the measured idle duration >>in teo_update() (instead of subtracting the entire exit latency). >>This makes the idle state selection be slightly more performance- >> oriented. > > I have isolated the issue to a subset of the v7 to v8 changes, however > it was not the exit latency changes. > > The partial revert to V7 changes I made were (on top of 5.3): The further testing showed a problem or two with my partial teo-v7 reversion (I call it teo-v12) under slightly different testing conditions. I also have a 5.3 based kernel with the current teo reverted and the entire teo-v7 put in its place. I have yet to find a idle state disabled related issue with this kernel. I'll come back to this thread at a later date with better details and test results. ... Doug
RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems
All, Recall the extensive tests and work done between 2018.10.11 and 2019.01.16 on Rafael's teo governor, versions 1 through 11, noting that versions 9, 10, and 11 did not contain functional changes. Hi Rafael, If the deepest idle state is disabled, the system can become somewhat unstable, with anywhere between no problem at all, to the occasional temporary jump using a lot more power for a few seconds, to a permanent jump using a lot more power continuously. I have been unable to isolate the exact test load conditions under which this will occur. However, temporarily disabling and then enabling other idle states seems to make for a somewhat repeatable test. It is important to note that the issue occurs with only ever disabling the deepest idle state, just not reliably. I want to know how you want to proceed before I do a bunch of regression testing. On 2018.12.11 03:50 Rafael J. Wysocki wrote: > v7 -> v8: > * Apply the selection rules to the idle deepest state as well as to >the shallower ones (the deepest idle state was treated differently >before by mistake). > * Subtract 1/2 of the exit latency from the measured idle duration >in teo_update() (instead of subtracting the entire exit latency). >This makes the idle state selection be slightly more performance- > oriented. I have isolated the issue to a subset of the v7 to v8 changes, however it was not the exit latency changes. The partial revert to V7 changes I made were (on top of 5.3): diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c index 7d05efd..d2892bb 100644 --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -283,9 +283,28 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, if (idx < 0) idx = i; /* first enabled state */ - if (s->target_residency > duration_us) - break; + if (s->target_residency > duration_us) { + /* +* If the "hits" metric of the state matching the sleep +* length is greater than its "misses" metric, that is +* the one to use. +*/ + if (cpu_data->states[idx].hits > cpu_data->states[idx].misses) + break; + /* +* It is more likely that one of the shallower states +* will match the idle duration measured after wakeup, +* so take the one with the maximum "early hits" metric, +* but if that cannot be determined, just use the state +* selected so far. +*/ + if (max_early_idx >= 0) { + idx = max_early_idx; + duration_us = drv->states[idx].target_residency; + } + break; + } if (s->exit_latency > latency_req) { /* * If we break out of the loop for latency reasons, use @@ -294,7 +313,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, * as long as that target residency is low enough. */ duration_us = drv->states[idx].target_residency; - goto refine; + break; } idx = i; @@ -307,21 +326,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, } } - /* -* If the "hits" metric of the idle state matching the sleep length is -* greater than its "misses" metric, that is the one to use. Otherwise, -* it is more likely that one of the shallower states will match the -* idle duration observed after wakeup, so take the one with the maximum -* "early hits" metric, but if that cannot be determined, just use the -* state selected so far. -*/ - if (cpu_data->states[idx].hits <= cpu_data->states[idx].misses && - max_early_idx >= 0) { - idx = max_early_idx; - duration_us = drv->states[idx].target_residency; - } - -refine: if (idx < 0) { idx = 0; /* No states enabled. Must use 0. */ } else if (idx > 0) { Test results (note: I use my own monitoring utility, because it feeds my graphing/html utility, but have used turbostat here in case others want to repeat the test): Processor: i7-2600K Deepest idle state: 4 (C6) The system is mostly idle for these tests. turbostat command used: sudo turbostat --Summary --quiet --hide IRQ,Avg_MHz,SMI,GFXMHz,TSC_MHz,GFXWatt,CorWatt,CPU%c1,CPU%c7,CoreTmp,GFX%rc6,Pkg%pc2,Pkg%pc3,Pkg%pc6,C1,C1E,,C1%,C1E%,C3%,C6%
RE: [PATCH 1/2] x86,sched: Add support for frequency invariance
On 2019.09.24 01:06 Mel Gorman wrote: > On Thu, Sep 19, 2019 at 07:42:29AM -0700, Doug Smythies wrote: >> On 2019.09.17 07:25 Giovanni Gherdovich wrote: >>>On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote: >>> [...] > > Hence, I think this patchset should be considered on its own merits. Agree. > I think the patch is fine and should be merged with the main caveat being > that some CPU families may need to use a different calculation to account > for turbo boost which is a per-arch and per-cpu-family decision. Agree. > What, if anything, should change in this patchset before it can be merged? Nothing, and apologies for the tangential discussion. > Is that an acked-by? Absolutely, if I am worthy of ack'ing then: Acked-by: Doug Smythies ... Doug
RE: [PATCH 1/2] x86,sched: Add support for frequency invariance
Hi Giovanni, Thank you for your detailed reply. On 2019.09.17 07:25 Giovanni Gherdovich wrote: >On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote: > [...] >> The problem with the test is its run to run variability, which was from >> all the disk I/O, as far as I could determine. At the time, >> I studied this to death [2], and made a more repeatable test, without >> any disk I/O. >> >> While the challenges with this work flow have tended to be focused >> on the CPU frequency scaling driver, I have always considered >> the root issue here to be a scheduling issue. Excerpt from my notes >> [2]: >> >>> The issue is that performance is much much better if the system is >>> forced to use only 1 CPU rather than relying on the defaults where >>> the CPU scheduler decides what to do. >>> The scheduler seems to not realize that the current CPU has just >>> become free, and assigns the new task to a new CPU. Thus the load >>> on any one CPU is so low that it doesn't ramp up the CPU frequency. >>> It would be better if somehow the scheduler knew that the current >>> active CPU was now able to take on the new task, overall resulting >>> on one fully loaded CPU at the highest CPU frequency. >> >> I do not know if such is practical, and I didn't re-visit the issue. >> > > You're absolutely right, pinning a serialized, fork-intensive workload such as > gitsource gives you as good of a performance as you can get, because it > removes > the scheduler out of the picture. > > So one might be tempted to flag this test as non-representative of a > real-world scenario; Disagree. I consider this test to be very representative of real-world scenarios. However, and I do not know for certain, the relatively high average fork rate of the gitsource "make test" is less common. > the reasons we keep looking at it are: > 1. pinning may not always practical, as you mention > 2. it's an adversary, worst-case sort of test for some scheduler code paths Agree. >> For reference against which all other results are compared >> is the forced CPU affinity test run. i.e.: >> >> taskset -c 3 test_script. >> >> Mode Governordegradation Power Bzy_MHz >> Reference perf 1 CPU 1.00reference 3798 >> - performance 1.2 6% worse3618 >> passive ondemand2.3 >> activepowersave 2.6 >> passive schedutil 2.7 1600 >> passive schedutil-4C1.682515 >> >> Where degradation ratio is the time to execute / the reference time for >> the same conditions. The test runs over a wide range of processes per >> second, and the worst ratio has been selected for the above table. >> I have yet to write up this experiment, but the graphs that will >> eventually be used are at [4] and [5] (same data presented two >> different ways). > > Your table is interesting; I'd say that the one to beat there (from the > schedutil point of view) is intel_pstate(active)/performance. I'm slightly > surprised that intel_pstate(passive)/ondemand is worse than > intel_pstate(active)/powersave, I'd have guessed the other way around but it's > also true that the latter lost some grip on iowait_boost in of the recent > dev cycles. ?? intel_pstate(passive)/ondemand is better than intel_pstate(active)/powersave, not worse, over the entire range of PIDs (forks) per second and by quite a lot. >> I did the "make test" method and, presenting the numbers your way, >> got that 4C took 0.69 times as long as the unpatched schedutil. >> Your numbers were same or better (copied below, lower is better): >> 80x-BROADWELL-NUMA: 0.49 >> 8x-SKYLAKE-UMA: 0.55 >> 48x-HASWELL-NUMA: 0.69 > I think your 0.69 and my three values tell the same story: schedutil really > needs to use the frequency invariant formula otherwise it's out of the > race. Enabling scale-invariance gives multple tens of percent point in > advantage. Agreed. This frequency invariant addition is great. However, if schedutil is "out of the race" without it, as you say, then isn't intel_pstate(passive)/ondemand out of the race also? It performs just as poorly for this test, until very low PIDs per second. > Now, is it 0.69 or 0.49? There are many factors to it; that's why I'm happy I > can test on multiple machines and get a somehow more varied picture. > > Also, didn't you mention you made several runs and selected the worst one for >
RE: [PATCH 1/2] x86,sched: Add support for frequency invariance
On 2019.09.11 08:28 Doug Smythies wrote: > Hi Giovanni, > > Thank you for the great detail and test results you provided. > > On 2019.09.08.07:42 Giovanni Gherdovich wrote: > > ... [snip]... > >> The test we call "gitsource" (running the git unit test suite, a long-running >> single-threaded shell script) appears rather spectacular in this table (gains >> of 30-50% depending on the machine). It is to be noted, however, that >> gitsource has no adjustable parameters (such as the number of jobs in >> kernbench, which we average over in order to get a single-number summary >> score) and is exactly the kind of low-parallelism workload that benefits the >> most from this patch. When looking at the detailed tables of kernbench or >> tbench4, at low process or client counts one can see similar numbers. > > I think the "gitsource" test, is the one I learned about here two years > ago, [1]. It is an extremely good (best I know of) example of single > threaded, high PID consumption (about 400 / second average, my computer > [3]), performance issues on a multi CPU computer. I.E., this: > > Dountil the list of tasks is finished: >Start the next task in the list of stuff to do. > Enduntil > > The problem with the test is its run to run variability, which was from > all the disk I/O, as far as I could determine. I forgot, also some memory caching. I always toss out the first test, then do it 5 more times. If I do not do much stuff with my hard disk in between tests, it is repeatable enough. I did the "make test" method and, presenting the numbers your way, got that 4C took 0.69 times as long as the unpatched schedutil. Your numbers were same or better (copied below, lower is better): 80x-BROADWELL-NUMA: 0.49 8x-SKYLAKE-UMA: 0.55 48x-HASWELL-NUMA: 0.69 > At the time, > I studied this to death [2], and made a more repeatable test, without > any disk I/O. > > While the challenges with this work flow have tended to be focused > on the CPU frequency scaling driver, I have always considered > the root issue here to be a scheduling issue. Excerpt from my notes > [2]: > >> The issue is that performance is much much better if the system is >> forced to use only 1 CPU rather than relying on the defaults where >> the CPU scheduler decides what to do. >> The scheduler seems to not realize that the current CPU has just >> become free, and assigns the new task to a new CPU. Thus the load >> on any one CPU is so low that it doesn't ramp up the CPU frequency. >> It would be better if somehow the scheduler knew that the current >> active CPU was now able to take on the new task, overall resulting >> on one fully loaded CPU at the highest CPU frequency. > > I do not know if such is practical, and I didn't re-visit the issue. > > Anyway these are my results: > > Kernel: 5.3-rc8 and + these patches > Processor: i7-2600K > > This is important, at least for the performance governor numbers: > > cpu6: MSR_TURBO_RATIO_LIMIT: 0x23242526 > 35 * 100.0 = 3500.0 MHz max turbo 4 active cores > 36 * 100.0 = 3600.0 MHz max turbo 3 active cores > 37 * 100.0 = 3700.0 MHz max turbo 2 active cores > 38 * 100.0 = 3800.0 MHz max turbo 1 active cores > > For reference against which all other results are compared > is the forced CPU affinity test run. i.e.: > > taskset -c 3 test_script. > > Mode Governordegradation Power Bzy_MHz > Reference perf 1 CPU 1.00reference 3798 > - performance 1.2 6% worse3618 > passive ondemand2.3 > activepowersave 2.6 > passive schedutil 2.7 1600 > passive schedutil-4C1.682515 > > Where degradation ratio is the time to execute / the reference time for > the same conditions. The test runs over a wide range of processes per > second, and the worst ratio has been selected for the above table. > I have yet to write up this experiment, but the graphs that will > eventually be used are at [4] and [5] (same data presented two > different ways). The experiment write up is at [6], however I wanted more data from the lower tasks per second region, and so I re-did it, [7]. In the limit as sequential tasks per second goes to 0, the differences should diminish and I wanted to clearly observe this. Excerpt: > Conclusion: the schedutil governor improves from the worst > governor to (mostly) second only to the performance governor > for unforced CPU affinity execution. > The energy for the performance cases is worth more detail, as it > is
RE: [PATCH 1/2] x86,sched: Add support for frequency invariance
Hi Giovanni, Thank you for the great detail and test results you provided. On 2019.09.08.07:42 Giovanni Gherdovich wrote: ... [snip]... > The test we call "gitsource" (running the git unit test suite, a long-running > single-threaded shell script) appears rather spectacular in this table (gains > of 30-50% depending on the machine). It is to be noted, however, that > gitsource has no adjustable parameters (such as the number of jobs in > kernbench, which we average over in order to get a single-number summary > score) and is exactly the kind of low-parallelism workload that benefits the > most from this patch. When looking at the detailed tables of kernbench or > tbench4, at low process or client counts one can see similar numbers. I think the "gitsource" test, is the one I learned about here two years ago, [1]. It is an extremely good (best I know of) example of single threaded, high PID consumption (about 400 / second average, my computer [3]), performance issues on a multi CPU computer. I.E., this: Dountil the list of tasks is finished: Start the next task in the list of stuff to do. Wait for it to finish Enduntil The problem with the test is its run to run variability, which was from all the disk I/O, as far as I could determine. At the time, I studied this to death [2], and made a more repeatable test, without any disk I/O. While the challenges with this work flow have tended to be focused on the CPU frequency scaling driver, I have always considered the root issue here to be a scheduling issue. Excerpt from my notes [2]: > The issue is that performance is much much better if the system is > forced to use only 1 CPU rather than relying on the defaults where > the CPU scheduler decides what to do. > The scheduler seems to not realize that the current CPU has just > become free, and assigns the new task to a new CPU. Thus the load > on any one CPU is so low that it doesn't ramp up the CPU frequency. > It would be better if somehow the scheduler knew that the current > active CPU was now able to take on the new task, overall resulting > on one fully loaded CPU at the highest CPU frequency. I do not know if such is practical, and I didn't re-visit the issue. Anyway these are my results: Kernel: 5.3-rc8 and + these patches Processor: i7-2600K This is important, at least for the performance governor numbers: cpu6: MSR_TURBO_RATIO_LIMIT: 0x23242526 35 * 100.0 = 3500.0 MHz max turbo 4 active cores 36 * 100.0 = 3600.0 MHz max turbo 3 active cores 37 * 100.0 = 3700.0 MHz max turbo 2 active cores 38 * 100.0 = 3800.0 MHz max turbo 1 active cores For reference against which all other results are compared is the forced CPU affinity test run. i.e.: taskset -c 3 test_script. ModeGovernordegradation Power Bzy_MHz Reference perf 1 CPU 1.00reference 3798 - performance 1.2 6% worse3618 passive ondemand2.3 active powersave 2.6 passive schedutil 2.7 1600 passive schedutil-4C1.682515 Where degradation ratio is the time to execute / the reference time for the same conditions. The test runs over a wide range of processes per second, and the worst ratio has been selected for the above table. I have yet to write up this experiment, but the graphs that will eventually be used are at [4] and [5] (same data presented two different ways). The energy for the performance cases is worth more detail, as it is being wasted with CPUs waking up and going to sleep, and can be observed in the IRQ column of turbostat output: $ sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 60 Busy% Bzy_MHz IRQ PkgTmp PkgWatt GFXWatt 12.52 379881407 49 22.17 0.12 <<< Forced to CPU 3 only 12.52 379881139 51 22.18 0.12 12.52 379881036 51 22.20 0.12 11.43 3704267644 48 21.16 0.12 <<< Change over 12.56 3618490994 48 23.43 0.12 <<< Let the scheduler decide 12.56 3620491336 47 23.50 0.12 12.56 3619491607 47 23.50 0.12 12.56 3619491512 48 23.52 0.12 12.56 3619490806 47 23.51 0.12 12.56 3618491356 49 23.48 0.12 12.56 3618491035 48 23.51 0.12 12.56 3618491121 48 23.46 0.12 Note also the busy megahertz column, where other active cores (constantly waking and sleeping as we rotate through which CPUs are used) are limiting the highest frequency. ... Doug [1] https://marc.info/?l=linux-kernel=149181369622980=2 [2] http://www.smythies.com/~doug/linux/single-threaded/index.html [3] http://www.smythies.com/~doug/linux/single-threaded/pids_per_second2.png [4] http://www.smythies.com/~doug/linux/single-threaded/gg-pidps.png [5] http://www.smythies.com/~doug/linux/single-threaded/gg-loops.png
RE: [PATCH V4 2/2] cpufreq: intel_pstate: Implement QoS supported freq constraints
On 2019.08.08 19:16 Viresh Kumar wrote: > On 08-08-19, 09:25, Doug Smythies wrote: >> On 2019.08.07 00:06 Viresh Kumar wrote: >> Tested by: Doug Smythies >> Thermald seems to now be working O.K. for all the governors. > > Thanks for testing Doug. > >> I do note that if one sets >> /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq >> It seems to override subsequent attempts via >> /sys/devices/system/cpu/intel_pstate/max_perf_pct. >> Myself, I find this confusing. >> >> So the question becomes which one is the "master"? > > No one is master, cpufreq takes all the requests for frequency > constraints and tries to set the value based on aggregation of all. So > for max frequency, the lowest value wins and is shown up in sysfs. > > So, everything looks okay to me. O.K. While I understand the explanations, I still struggle with this scenario: doug@s15:~/temp$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct 50<<< Note: 50% = 1.9 GHz in my system) doug@s15:~/temp$ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq:190 /sys/devices/system/cpu/cpufreq/policy1/scaling_max_freq:190 /sys/devices/system/cpu/cpufreq/policy2/scaling_max_freq:190 /sys/devices/system/cpu/cpufreq/policy3/scaling_max_freq:190 /sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq:190 /sys/devices/system/cpu/cpufreq/policy5/scaling_max_freq:190 /sys/devices/system/cpu/cpufreq/policy6/scaling_max_freq:190 /sys/devices/system/cpu/cpufreq/policy7/scaling_max_freq:190 At this point I am not certain what I'll get if I try to set max_perf_pct to 100%, nor do I know how to find out with a user command. So, I'll try it: doug@s15:~/temp$ echo 100 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct 100 doug@s15:~/temp$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct 100 <<< Note: 100% = 3.8 GHz in my system) doug@s15:~/temp$ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq:220 /sys/devices/system/cpu/cpufreq/policy1/scaling_max_freq:220 /sys/devices/system/cpu/cpufreq/policy2/scaling_max_freq:220 /sys/devices/system/cpu/cpufreq/policy3/scaling_max_freq:220 /sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq:220 /sys/devices/system/cpu/cpufreq/policy5/scaling_max_freq:220 /sys/devices/system/cpu/cpufreq/policy6/scaling_max_freq:220 /sys/devices/system/cpu/cpufreq/policy7/scaling_max_freq:220 I guess I had set it sometime earlier, forgot, and then didn't get 3.8 Ghz as I had expected via max_perf_pct. ... Doug
RE: [PATCH V5 2/2] cpufreq: intel_pstate: Implement QoS supported freq constraints
On 2019.08.08 19:23 Viresh Kumar wrote: > --- > V4->V5: > - dev_pm_qos_update_request() can return 1 in case of success, handle > that. O.K. thanks, That fixes the "Fail" messages I was getting with V4. ... Doug
RE: [PATCH V4 2/2] cpufreq: intel_pstate: Implement QoS supported freq constraints
On 2019.08.07 00:06 Viresh Kumar wrote: Thanks for your work on this. > Intel pstate driver exposes min_perf_pct and max_perf_pct sysfs files, > which can be used to force a limit on the min/max P state of the driver. > Though these files eventually control the min/max frequencies that the > CPUs will run at, they don't make a change to policy->min/max values. > > When the values of these files are changed (in passive mode of the > driver), it leads to calling ->limits() callback of the cpufreq > governors, like schedutil. On a call to it the governors shall > forcefully update the frequency to come within the limits. Since the > limits, i.e. policy->min/max, aren't updated by the driver, the > governors fails to get the target freq within limit and sometimes aborts > the update believing that the frequency is already set to the target > value. > > This patch implements the QoS supported frequency constraints to update > policy->min/max values whenever min_perf_pct or max_perf_pct files are > updated. This is only done for the passive mode as of now, as the driver > is already working fine in active mode. > > Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX") > Reported-by: Doug Smythies > Signed-off-by: Viresh Kumar Tested by: Doug Smythies Thermald seems to now be working O.K. for all the governors. I do note that if one sets /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq It seems to override subsequent attempts via /sys/devices/system/cpu/intel_pstate/max_perf_pct. Myself, I find this confusing. So the question becomes which one is the "master"? Example: # for file in /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq; do echo "220" > $file; done # cat /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq 220 220 220 220 220 220 220 220 root@s15:/home/doug/temp# cat /sys/devices/system/cpu/intel_pstate/max_perf_pct ... (Note: 50% = 190) root@s15:/home/doug/temp# cat /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq 190 190 190 190 190 190 190 190 root@s15:/home/doug/temp# echo 100 > /sys/devices/system/cpu/intel_pstate/max_perf_pct ... (Note: 50% = 380, and my expectation is 3.8 GHz below) root@s15:/home/doug/temp# cat /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq 220 220 220 220 220 220 220 220 Similarly for the minimum side of things: root@s15:/home/doug/temp# for file in /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq; do echo "320" > $file; done root@s15:/home/doug/temp# cat /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq 320 320 320 320 320 320 320 320 root@s15:/home/doug/temp# echo 42 > /sys/devices/system/cpu/intel_pstate/min_perf_pct root@s15:/home/doug/temp# cat /sys/devices/system/cpu/intel_pstate/min_perf_pct 42 ... (note 42% = 160 = processor minimum, and that is my expectation below.) root@s15:/home/doug/temp# cat /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq 320 320 320 320 320 320 320 320 I thought these minimum anomalies would cause problems for thermald, but for whatever reason, it seems to work properly. > --- > V3->V4: > - Reimplemented the solution using QoS constraints instead of > resolve_freq() callback. > > drivers/cpufreq/intel_pstate.c | 120 +++-- > 1 file changed, 116 insertions(+), 4 deletions(-) > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c > index cc27d4c59dca..e9fbd6c36822 100644 > --- a/drivers/cpufreq/intel_pstate.c > +++ b/drivers/cpufreq/intel_pstate.c > @@ -24,6 +24,7 @@ > #include > #include > #include > +#include > #include > > #include > @@ -1085,6 +1086,47 @@ static ssize_t store_no_turbo(struct kobject *a, > struct kobj_attribute *b, > return count; > } > > +static struct cpufreq_driver intel_pstate; > + > +static void update_qos_request(enum dev_pm_qos_req_type type) > +{ > + int max_state, turbo_max, freq, i, perf_pct; > + struct dev_pm_qos_request *req; > + struct cpufreq_policy *policy; > + > + for_each_possible_cpu(i) { > + struct cpudata *cpu = all_cpu_data[i]; > + > + policy = cpufreq_cpu_get(i); > + if (!policy) > + continue; > + > + req = policy->driver_data; > + cpufreq_cpu_put(policy); > + > + if (!req) > + continue; > + > + if (hwp_active) > + intel_pstate_get_hwp_max(i, _max, _state); > + else > +
RE: [PATCH V3 2/2] cpufreq: intel_pstate: Implement ->resolve_freq()
On 2019.08.02 02:28 Rafael J. Wysocki wrote: > On Friday, August 2, 2019 11:17:55 AM CEST Rafael J. Wysocki wrote: >> On Fri, Aug 2, 2019 at 7:44 AM Viresh Kumar wrote: >>> >>> Intel pstate driver exposes min_perf_pct and max_perf_pct sysfs files, >>> which can be used to force a limit on the min/max P state of the driver. >>> Though these files eventually control the min/max frequencies that the >>> CPUs will run at, they don't make a change to policy->min/max values. >> >> That's correct. >> >>> When the values of these files are changed (in passive mode of the >>> driver), it leads to calling ->limits() callback of the cpufreq >>> governors, like schedutil. On a call to it the governors shall >>> forcefully update the frequency to come within the limits. >> >> OK, so the problem is that it is a bug to invoke the governor's ->limits() >> callback without updating policy->min/max, because that's what >> "limits" mean to the governors. >> >> Fair enough. > > AFAICS this can be addressed by adding PM QoS freq limits requests of each > CPU to > intel_pstate in the passive mode such that changing min_perf_pct or > max_perf_pct > will cause these requests to be updated. All governors for the intel_cpufreq (intel_pstate in passive mode) CPU frequency scaling driver are broken with respect to this issue, not just the schedutil governor. My initial escalation had been focused on acpi-cpufreq/schedutil and intel_cpufreq/schedutil, as they were both broken, and both fixed by my initially submitted reversion. What can I say, I missed that other intel_cpufreq governors were also involved. I tested all of them: conservative ondemand userspace powersave performance schedutil Note that no other governor uses resolve_freq(). ... Doug
RE: [PATCH V3 1/2] cpufreq: schedutil: Don't skip freq update when limits change
On 2019.08.02 02:12 Rafael J. Wysocki wrote: > On Fri, Aug 2, 2019 at 7:44 AM Viresh Kumar wrote: >> >> To avoid reducing the frequency of a CPU prematurely, we skip reducing >> the frequency if the CPU had been busy recently. >> >> This should not be done when the limits of the policy are changed, for >> example due to thermal throttling. We should always get the frequency >> within the new limits as soon as possible. >> >> Trying to fix this by using only one flag, i.e. need_freq_update, can >> lead to a race condition where the flag gets cleared without forcing us >> to change the frequency at least once. And so this patch introduces >> another flag to avoid that race condition. >> >> Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX") >> Cc: v4.18+ # v4.18+ >> Reported-by: Doug Smythies >> Signed-off-by: Viresh Kumar >> --- >> V2->V3: >> - Updated commit log. >> >> V1->V2: >> - Fixed the race condition using a different flag. >> >> @Doug: I haven't changed the code since you last tested these. Your >> Tested-by tag can be useful while applying the patches. Thanks. Tested-by: Doug Smythies For acpi-cpufreq/schedutil only (which we already know). I tested including Rafael's suggested change. I.E. diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 592ff72..ae3ec77 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -441,7 +441,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, struct sugov_policy *sg_policy = sg_cpu->sg_policy; unsigned long util, max; unsigned int next_f; - bool busy = false; + bool busy; sugov_iowait_boost(sg_cpu, time, flags); sg_cpu->last_update = time; @@ -452,8 +452,7 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, return; /* Limits may have changed, don't skip frequency update */ - if (!sg_policy->need_freq_update) - busy = sugov_cpu_is_busy(sg_cpu); + busy = !sg_policy->need_freq_update && sugov_cpu_is_busy(sg_cpu); util = sugov_get_util(sg_cpu); max = sg_cpu->max; ... Doug
RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
On 2019.07.31 23:17 Viresh Kumar wrote: > On 31-07-19, 17:20, Doug Smythies wrote: >> Summary: >> >> The old way, using UINT_MAX had two purposes: first, >> as a "need to do a frequency update" flag; but also second, to >> force any subsequent old/new frequency comparison to NOT be "the same, >> so why bother actually updating" (see: sugov_update_next_freq). All >> patches so far have been dealing with the flag, but only partially >> the comparisons. In a busy system, and when schedutil.c doesn't actually >> know the currently set system limits, the new frequency is dominated by >> values the same as the old frequency. So, when sugov_fast_switch calls >> sugov_update_next_freq, false is usually returned. > > And finally we know "Why" :) > > Good work Doug. Thanks for taking it to the end. > >> However, if we move the resetting of the flag and add another condition >> to the "no need to actually update" decision, then perhaps this patch >> version 1 will be O.K. It seems to be. (see way later in this e-mail). > >> With all this new knowledge, how about going back to >> version 1 of this patch, and then adding this: >> >> diff --git a/kernel/sched/cpufreq_schedutil.c >> b/kernel/sched/cpufreq_schedutil.c >> index 808d32b..f9156db 100644 >> --- a/kernel/sched/cpufreq_schedutil.c >> +++ b/kernel/sched/cpufreq_schedutil.c >> @@ -100,7 +100,12 @@ static bool sugov_should_update_freq(struct >> sugov_policy *sg_policy, u64 time) >> static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time, >>unsigned int next_freq) >> { >> - if (sg_policy->next_freq == next_freq) >> + /* >> +* Always force an update if the flag is set, regardless. >> +* In some implementations (intel_cpufreq) the frequency is clamped >> +* further downstream, and might not actually be different here. >> +*/ >> + if (sg_policy->next_freq == next_freq && >> !sg_policy->need_freq_update) >> return false; > > This is not correct because this is an optimization we have in place > to make things more efficient. And it was working by luck earlier and > my patch broke it for good :) Disagree. All I did was use a flag where it used to be set to UNIT_MAX, to basically implement the same thing. > Things need to get a bit more synchronized and something like this may > help (completely untested): > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c > index cc27d4c59dca..2d84361fbebc 100644 > --- a/drivers/cpufreq/intel_pstate.c > +++ b/drivers/cpufreq/intel_pstate.c > @@ -2314,6 +2314,18 @@ static int intel_cpufreq_target(struct cpufreq_policy > *policy, >return 0; > } > > +static unsigned int intel_cpufreq_resolve_freq(struct cpufreq_policy *policy, > + unsigned int target_freq) > +{ > + struct cpudata *cpu = all_cpu_data[policy->cpu]; > + int target_pstate; > + > + target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling); > + target_pstate = intel_pstate_prepare_request(cpu, target_pstate); > + > + return target_pstate * cpu->pstate.scaling; > +} > + > static unsigned int intel_cpufreq_fast_switch(struct cpufreq_policy *policy, > unsigned int target_freq) > { > @@ -2350,6 +2362,7 @@ static struct cpufreq_driver intel_cpufreq = { > .verify = intel_cpufreq_verify_policy, > .target = intel_cpufreq_target, > .fast_switch= intel_cpufreq_fast_switch, > + .resolve_freq = intel_cpufreq_resolve_freq, > .init = intel_cpufreq_cpu_init, > .exit = intel_pstate_cpu_exit, > .stop_cpu = intel_cpufreq_stop_cpu, > > -8<- > > Please try this with my patch 2. O.K. > We need patch 2 instead of 1 because > of another race condition Rafael noticed. Disagree. Notice that my modifications to your patch1 addresses that condition by moving the clearing of "need_freq_update" to sometime later. > > cpufreq_schedutil calls driver specific resolve_freq() to find the new > target frequency and this is where the limits should get applied IMO. Oh! I didn't know. But yes, that makes sense. > > Rafael can help with reviewing this diff but it would be great if you > can give this a try Doug. Anyway, I added the above code (I am calling it patch3) to patch2, as you asked, and it d
RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
Hi Viresh, Summary: The old way, using UINT_MAX had two purposes: first, as a "need to do a frequency update" flag; but also second, to force any subsequent old/new frequency comparison to NOT be "the same, so why bother actually updating" (see: sugov_update_next_freq). All patches so far have been dealing with the flag, but only partially the comparisons. In a busy system, and when schedutil.c doesn't actually know the currently set system limits, the new frequency is dominated by values the same as the old frequency. So, when sugov_fast_switch calls sugov_update_next_freq, false is usually returned. However, if we move the resetting of the flag and add another condition to the "no need to actually update" decision, then perhaps this patch version 1 will be O.K. It seems to be. (see way later in this e-mail). On 2019.07.29 01:38 Rafael J. Wysocki wrote: > On Mon, Jul 29, 2019 at 10:32 AM Viresh Kumar wrote: >> On 29-07-19, 00:55, Doug Smythies wrote: >>> On 2019.07.25 23:58 Viresh Kumar wrote: ...[snip]... >>>> Now, the frequency never gets down and so gets set to the maximum >>>> possible after a bit. >>>> >>>> - Then I did: >>>> >>>> echo > >>>> /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq >>>> >>>> Without my patch applied: >>>>The print never gets printed and so frequency doesn't go down. >>>> >>>> With my patch applied: >>>>The print gets printed immediately from sugov_work() and so >>>>the frequency reduces. >>>> >>>> Can you try with this diff along with my Patch2 ? I suspect there may >>>> be something wrong with the intel_cpufreq driver as the patch fixes >>>> the only path we have in the schedutil governor which takes busyness >>>> of a CPU into account. >>> >>> With this diff along with your patch2 There is never a print message >>> from sugov_work. There are from sugov_fast_switch. >> >> Which is okay. sugov_work won't get hit in your case as I explained >> above. O.K., I finally understand. >>> Note that for the intel_cpufreq CPU scaling driver and the schedutil >>> governor I adjust the maximum clock frequency this way: >>> >>> echo > /sys/devices/system/cpu/intel_pstate/max_perf_pct >> >> This should eventually call sugov_limits() in schedutil governor, this >> can be easily checked with another print message. >> >>> I also applied the pr_info messages to the reverted kernel, and re-did >>> my tests (where everything works as expected). There is never a print >>> message from sugov_work. There are from sugov_fast_switch. >> >> that's fine. >> >>> Notes: >>> >>> I do not know if: >>> /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq >>> /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq >>> Need to be accurate when using the intel_pstate driver in passive mode. >>> They are not. >>> The commit comment for 9083e4986124389e2a7c0ffca95630a4983887f0 >>> suggests that they might need to be representative. >>> I wonder if something similar to that commit is needed >>> for other global changes, such as max_perf_pct and min_perf_pct? >> >> We are already calling intel_pstate_update_policies() in that case, so >> it should be fine I believe. I now believe that lack of synchronization between /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq and /sys/devices/system/cpu/intel_pstate/max_perf_pct is the root issue here. The UINT_MAX next freq flag was also used to force a change because the limits are then forced to be checked and enforced. This e-mail continues with the assumption that this lack of synchronization is O.K., because that is the way it is now. However, if you want to have them synchronized then Viresh's patch1 will work fine afterwards. >> >>> intel_cpufreq/ondemand doesn't work properly on the reverted kernel. >> >> reverted kernel ? The patch you reverted was only for schedutil and it >> shouldn't have anything to do with ondemand. Agreed. This is on hold until I have time to look into it. >>> (just discovered, not investigated) >>> I don't know about other governors. >> >> When you do: >> >> echo > /sys/devices/system/cpu/intel_pstate/max_perf_pct >> >> How soon does the print from sugov_fast_switch() gets printed ? >> Immediately ? Check with both the kernels, with my patch and with the >> reverted patch. I don't really know how long. So I added a message so I could ge
RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
On 2019.07.31 Viresh Kumar wrote: > To avoid reducing the frequency of a CPU prematurely, we skip reducing > the frequency if the CPU had been busy recently. > > This should not be done when the limits of the policy are changed, for > example due to thermal throttling. We should always get the frequency > within the new limits as soon as possible. > > Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX") > Cc: v4.18+ # v4.18+ > Reported-by: Doug Smythies > Signed-off-by: Viresh Kumar > --- > @Doug: Can you please provide your Tested-by for this commit, as it > already fixed the issue around acpi-cpufreq driver. > > We will continue to see what's wrong with intel-pstate though. Please give me a few more hours. I'll reply to another thread with new information at that time. My recommendation will be to scrap this "patch2" and go back to "patch1" [1], with a couple of modifications. The logic of patch1 is sound. Teaser: it is working for intel_cpufreq/schedutil, but I have yet to test acpi-cpufreq/schedutil. [1] https://marc.info/?l=linux-pm=156377832225470=2 ... Doug
RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
On 2019.07.25 23:58 Viresh Kumar wrote: > On 25-07-19, 08:20, Doug Smythies wrote: >> I tried the patch ("patch2"). It did not fix the issue. >> >> To summarize, all kernel 5.2 based, all intel_cpufreq driver and schedutil >> governor: >> >> Test: Does a busy system respond to maximum CPU clock frequency reduction? >> >> stock, unaltered: No. >> revert ecd2884291261e3fddbc7651ee11a20d596bb514: Yes >> viresh patch: No. >> fast_switch edit: No. >> viresh patch2: No. > > Hmm, so I tried to reproduce your setup on my ARM board. > - booted only with CPU0 so I hit the sugov_update_single() routine > - And applied below diff to make CPU look permanently busy: > > -8<- >diff --git a/kernel/sched/cpufreq_schedutil.c >b/kernel/sched/cpufreq_schedutil.c > index 2f382b0959e5..afb47490e5dc 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -121,6 +121,7 @@ static void sugov_fast_switch(struct sugov_policy > *sg_policy, u64 time, > if (!sugov_update_next_freq(sg_policy, time, next_freq)) >return; > > + pr_info("%s: %d: %u\n", __func__, __LINE__, freq); ?? there is no "freq" variable here, and so this doesn't compile. However this works: + pr_info("%s: %d: %u\n", __func__, __LINE__, next_freq); > next_freq = cpufreq_driver_fast_switch(policy, next_freq); >if (!next_freq) >return; > @@ -424,14 +425,10 @@ static unsigned long sugov_iowait_apply(struct > sugov_cpu *sg_cpu, u64 time, > #ifdef CONFIG_NO_HZ_COMMON > static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) > { > - unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu); > - bool ret = idle_calls == sg_cpu->saved_idle_calls; > - > - sg_cpu->saved_idle_calls = idle_calls; > - return ret; > + return true; > } > #else > -static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return > false; } > +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return > true; } > #endif /* CONFIG_NO_HZ_COMMON */ > > /* > @@ -565,6 +562,7 @@ static void sugov_work(struct kthread_work *work) > sg_policy->work_in_progress = false; > raw_spin_unlock_irqrestore(_policy->update_lock, flags); > > + pr_info("%s: %d: %u\n", __func__, __LINE__, freq); > mutex_lock(_policy->work_lock); > __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L); > mutex_unlock(_policy->work_lock); > > -8<- > > Now, the frequency never gets down and so gets set to the maximum > possible after a bit. > > - Then I did: > > echo > > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq > > Without my patch applied: >The print never gets printed and so frequency doesn't go down. > > With my patch applied: >The print gets printed immediately from sugov_work() and so >the frequency reduces. > > Can you try with this diff along with my Patch2 ? I suspect there may > be something wrong with the intel_cpufreq driver as the patch fixes > the only path we have in the schedutil governor which takes busyness > of a CPU into account. With this diff along with your patch2 There is never a print message from sugov_work. There are from sugov_fast_switch. Note that for the intel_cpufreq CPU scaling driver and the schedutil governor I adjust the maximum clock frequency this way: echo > /sys/devices/system/cpu/intel_pstate/max_perf_pct I also applied the pr_info messages to the reverted kernel, and re-did my tests (where everything works as expected). There is never a print message from sugov_work. There are from sugov_fast_switch. Notes: I do not know if: /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq Need to be accurate when using the intel_pstate driver in passive mode. They are not. The commit comment for 9083e4986124389e2a7c0ffca95630a4983887f0 suggests that they might need to be representative. I wonder if something similar to that commit is needed for other global changes, such as max_perf_pct and min_perf_pct? intel_cpufreq/ondemand doesn't work properly on the reverted kernel. (just discovered, not investigated) I don't know about other governors. ... Doug
RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
Hi, I am having trouble keeping up. Here is what I have so far: On 2019.07.24 04:43 Viresh Kumar wrote: > On 23-07-19, 12:27, Rafael J. Wysocki wrote: >> On Tue, Jul 23, 2019 at 11:15 AM Viresh Kumar >> wrote: >>> Though there is one difference between intel_cpufreq and acpi_cpufreq, >>> intel_cpufreq has fast_switch_possible=true and so it uses slightly >>> different path in schedutil. I tried to look from that perspective as >>> well but couldn't find anything wrong. >> >> acpi-cpufreq should use fast switching on the Doug's system too. > > Ah okay. > >>> If you still find intel_cpufreq to be broken, even with this patch, It is. >>> please set fast_switch_possible=false instead of true in >>> __intel_pstate_cpu_init() and try tests again. That shall make it very >>> much similar to acpi-cpufreq driver. >> >> I wonder if this helps. It does not help. >> Even so, we want fast switching to be used by intel_cpufreq. > > With both using fast switching it shouldn't make any difference. >> Anyway, it looks like the change reverted by the Doug's patch >> introduced a race condition that had not been present before. Namely, >> need_freq_update is cleared in get_next_freq() when it is set _or_ >> when the new freq is different from the cached one, so in the latter >> case if it happens to be set by sugov_limits() after evaluating >> sugov_should_update_freq() (which returned 'true' for timing reasons), >> that update will be lost now. [Previously the update would not be >> lost, because the clearing of need_freq_update depended only on its >> current value.] Where it matters is that in the "need_freq_update set" >> case, the "premature frequency reduction avoidance" should not be >> applied (as you noticed and hence the $subject patch). >> >> However, even with the $subject patch, need_freq_update may still be >> set by sugov_limits() after the check added by it and then cleared by >> get_next_freq(), so it doesn't really eliminate the problem. >> >> IMO eliminating would require invalidating next_freq this way or >> another when need_freq_update is set in sugov_should_update_freq(), >> which was done before commit ecd2884291261e3fddbc7651ee11a20d596bb514. > > Hmm, so to avoid locking in fast path we need two variable group to > protect against this kind of issues. I still don't want to override > next_freq with a special meaning as it can cause hidden bugs, we have > seen that earlier. > > What about something like this then ? I tried the patch ("patch2"). It did not fix the issue. To summarize, all kernel 5.2 based, all intel_cpufreq driver and schedutil governor: Test: Does a busy system respond to maximum CPU clock frequency reduction? stock, unaltered: No. revert ecd2884291261e3fddbc7651ee11a20d596bb514: Yes viresh patch: No. fast_switch edit: No. viresh patch2: No. References (and procedures used): https://marc.info/?l=linux-pm=156346478429147=2 https://marc.info/?l=linux-kernel=156343125319461=2 ... Doug
RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change
On 2019.07.21 23:52 Viresh Kumar wrote: > To avoid reducing the frequency of a CPU prematurely, we skip reducing > the frequency if the CPU had been busy recently. > > This should not be done when the limits of the policy are changed, for > example due to thermal throttling. We should always get the frequency > within limits as soon as possible. > > Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX") > Cc: v4.18+ # v4.18+ > Reported-by: Doug Smythies > Signed-off-by: Viresh Kumar > --- > @Doug: Please try this patch, it must fix the issue you reported. It fixes the driver = acpi-cpufreq ; governor = schedutil test case It does not fix the driver = intel_cpufreq ; governor = schedutil test case I have checked my results twice, but will check again in the day or two. ... Doug > > kernel/sched/cpufreq_schedutil.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/cpufreq_schedutil.c > b/kernel/sched/cpufreq_schedutil.c > index 636ca6f88c8e..b53c4f02b0f1 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -447,7 +447,7 @@ static void sugov_update_single(struct update_util_data > *hook, u64 time, > struct sugov_policy *sg_policy = sg_cpu->sg_policy; > unsigned long util, max; > unsigned int next_f; > - bool busy; > + bool busy = false; > > sugov_iowait_boost(sg_cpu, time, flags); > sg_cpu->last_update = time; > @@ -457,7 +457,9 @@ static void sugov_update_single(struct update_util_data > *hook, u64 time, > if (!sugov_should_update_freq(sg_policy, time)) > return; > > - busy = sugov_cpu_is_busy(sg_cpu); > + /* Limits may have changed, don't skip frequency update */ > + if (!sg_policy->need_freq_update) > + busy = sugov_cpu_is_busy(sg_cpu); > > util = sugov_get_util(sg_cpu); > max = sg_cpu->max; > -- > 2.21.0.rc0.269.g1a574e7a288b
RE: [PATCH] Revert "cpufreq: schedutil: Don't set next_freq to UINT_MAX"
On 2019.07.18 03:28 Viresh Kumar wrote: > On 17-07-19, 23:26, Doug Smythies wrote: >> This reverts commit ecd2884291261e3fddbc7651ee11a20d596bb514. >> >> The commit caused a regression whereby reducing the maximum >> CPU clock frequency is ineffective while busy, and the CPU >> clock remains unchanged. Once the system has experienced >> some idle time, the new limit is implemented. > > Can you explain why this patch caused that issue ? I am sorry but I couldn't > understand it from your email. How are we trying to reduce the frequency? Is > clk_set_rate() getting called with that finally and not working ? The patch eliminates the "flag", UNIT_MAX, and it's related comment, that was used to indicate if it was a limit change that causes the subsequent execution of sugov_update_single. As for "clk_set_rate()", I don't know. I bisected the kernel and got here. I also looked at reverting B7EAF1AAB9F8, but it seemed to have some purpose which I don't know of more important than this one or not. I'll reinsert the first test below with more detail: On 2019.07.17 23:27 Doug wrote: > Driver: intel_cpufreq ; Governor: Schedutil > Kernel 5.2: $ uname -a Linux s15 5.2.0-stock #608 SMP PREEMPT Mon Jul 8 08:21:56 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux doug@s15:~$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver intel_cpufreq intel_cpufreq intel_cpufreq intel_cpufreq intel_cpufreq intel_cpufreq intel_cpufreq intel_cpufreq doug@s15:~$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor schedutil schedutil schedutil schedutil schedutil schedutil schedutil schedutil > Test 1: No thermal throttling involved (I only ever use thermal, if anything): doug@s15:~$ sudo systemctl status thermald.service . thermald.service - Thermal Daemon Service Loaded: loaded (/lib/systemd/system/thermald.service; disabled; vendor preset: enabled) Active: inactive (dead) > - load the system (some sort of CPU stress test). Use whatever you want. I use my own, only because I always have and it prints something every so often which gives an indication of actual clock speed. doug@s15:~$ ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & [1] 3000 [2] 3001 [3] 3002 [4] 3003 [5] 3004 [6] 3005 [7] 3006 [8] 3007 [9] 3008 Watch everything with turobstat: doug@s15:~$ sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 5 Busy% Bzy_MHz IRQ PkgTmp PkgWatt 0.031600269 25 3.69 0.071600390 25 3.72 0.061600368 25 3.72 0.061600343 25 3.71 0.041600269 26 3.70 74.72 347430230 46 43.95 <<< Add load 100.00 350040228 50 58.27 100.00 350040196 52 58.59 100.00 350040215 55 58.91 100.00 350040211 56 59.12 100.00 350040291 58 59.33 <<< Try to set 60% max 100.00 350040278 59 59.55 100.00 350040591 61 59.73 100.00 350040279 61 60.10 100.00 350040292 63 60.35 100.00 350040401 64 60.85 99.99 350040352 65 61.12 100.00 350040230 67 61.32 100.00 350040228 67 61.52 100.00 350040400 68 61.60 <<< Try to set 42% max 100.00 350040279 69 61.74 100.00 350040258 68 61.85 100.00 350040226 70 62.01 100.00 350040220 70 62.10 100.00 350040211 71 62.25 > - adjust downwards the maximum clock frequency > echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct doug@s15:~/temp$ echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct 60 doug@s15:~/temp$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct 60 ... wait awhile ... doug@s15:~/temp$ echo 42 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct 42 doug@s15:~/temp$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct 42 > - observe the Bzy_MHz does not change on the turbostat output. See annotated turbostat output above. > - reduce the system load, such that there is some idle time. > - observe the Bzy_MHz now does change. > - increase the load again. > - observe the CPU frequency is now pinned to the new limit. Go back to 60% first: doug@s15:~/temp$ echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct 60 killall test1 ... wait awhile ... doug@s15:~$ ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & [1] 3040 [2] 3041 [3] 3042 [4] 3043 [5] 3044 [6] 3045 [7] 3046 [8] 3047 [9] 3048 doug@s15:~$ sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 5 Busy% Bzy_MHz IRQ PkgTmp PkgWatt 100.00 350040289 80 64.32 100.00
[PATCH] Revert "cpufreq: schedutil: Don't set next_freq to UINT_MAX"
This reverts commit ecd2884291261e3fddbc7651ee11a20d596bb514. The commit caused a regression whereby reducing the maximum CPU clock frequency is ineffective while busy, and the CPU clock remains unchanged. Once the system has experienced some idle time, the new limit is implemented. A consequence is that any thermal throttling monitoring and control based on max freq limits fail to respond in a timely manor, if at all, to a thermal temperature trip on a busy system. Please review the original e-mail threads, as this attempt at a revertion would then re-open some of those questions. One possible alterative to this revertion would be to revert B7EAF1AAB9F8: Author: Rafael J. Wysocki Date: Wed Mar 22 00:08:50 2017 +0100 cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely References: https://marc.info/?t=15257618932=1=2 https://marc.info/?t=15258621902=1=2 https://marc.info/?t=15260716984=1=2 https://marc.info/?t=14901381392=1=2 --- Test Data: Typical commands: watch -n 5 cat /sys/devices/system/cpu/intel_pstate/max_perf_pct watch -n 5 cat /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 5 watch -n 5 cat /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq Driver: intel_cpufreq ; Governor: Schedutil Kernel 5.2: Test 1: No thermal throttling involved: - load the system (some sort of CPU stress test). - adjust downwards the maximum clock frequency echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct - observe the Bzy_MHz does not change on the turbostat output. - reduce the system load, such that there is some idle time. - observe the Bzy_MHz now does change. - increase the load again. - observe the CPU frequency is now pinned to the new limit. Test 2: Use thermald to try Thermal throttling: - By default thermald will use pstate control first. On my system a lower thermal trip point it used, because it doesn't get hot enough. - enable and start thermald - load the system (some sort of CPU stress test) - Once the thermal trip point is reached, observe thermald adjusting the max_perf_pct but nothing happens in response. - Once the lowest limit of max_perf_pct is reached, observe thermald fall through to intel-powerclamp, and kidle inject starts to be used. Therefore the "busy" condition (see the code) is not longer met, and the CPU frequency suddenly goes to minimum. Test 3: Limit thermald to using max_perf_pct only: - Adjust the cpu cooling device order to just include the one option, thereby excluding help from the fall through partner experienced in test 2. - Observations here tracked test 2 until the lower limit of max_perf_pct is reached. The CPUs remain at maximum frequency until the load is removed. Processor temperature continues to climb. Driver: intel_cpufreq ; Governor: Schedutil Kernel 5.2 + this revert patch: Test 1: No thermal throttling involved: - load the system (some sort of CPU stress test) - adjust downwards the maximum clock frequency echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct - observe the Bzy_MHz changes on the turbostat output. Test 2: Use thermald to try Thermal throttling: - By default thermald will use pstate control first. - enable and start thermald - load the system (some sort of CPU stress test) - Once the thermal trip point is reached, observe thermald adjusting the max_perf_pct and observe the actual frequency adjust in response. Test 3: Limit thermald to using max_perf_pct only: - Adjust the cpu cooling device order to just include the one option, adjusting max_perf_pct. - Observations here tracked test 2. (test 2 never fell through to intel_powerclamp and kidle inject. Driver: acpi-cpufreq ; Governor: Schedutil Kernel 5.2: Test 1: No thermal throttling involved: - load the system (some sort of CPU stress test) - adjust downwards the maximum clock frequency for file in /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq; do echo "220" > $file; done - observe the Bzy_MHz does not change on the turbostat output. - reduce the system load, such that there is some idle time. - observe the Bzy_MHz now does change. - increase the load again. - observe the CPU frequency is now pinned to the new limit. Test 2: Use thermald to try Thermal throttling: - By default thermald will use intel_powerclamp control first. - enable and start thermald - load the system (some sort of CPU stress test) - Once the thermal trip point is reached, observe thermald adjusting the kidle_inj amounts. - If the maximum kidle_inj is not enough, then oberve thermald start to limit scaling_max_freq, however the actual frequency response is immediate. Why? Because of the kidle inject, the system is not longer fully "busy", and so that condition is not met. Test 3: Limit thermald to using scaling_max_freq only: - Adjust the cpu cooling device order to just include
RE: [PATCH] cpuidle/drivers/mobile: Add new governor for mobile/embedded systems
On 2019.07.03 12:12 Doug Smythies wrote: > On 2019.07.03 08:16 Daniel Lezcano wrote: >> On 03/07/2019 16:23, Doug Smythies wrote: >>> On 2019.06.20 04:58 Daniel Lezcano wrote: > ... >>> Anyway, I did a bunch of tests and such, but have deleted >>> most from this e-mail, because it's just noise. I'll >>> include just one set: >>> >>> For a work load that would normally result in a lot of use >>> of shallow idle states (single core pipe-test * 2 cores). >> >> Can you share the tests and the command lines? > > Yes, give me a few days to repeat the tests and write > it up properly. I am leaving town in an hour and for a day. O.K. I re-did the tests and made a new web page with, I think, everything I used. ... >> The governor can be better by selecting the shallow states, the >> scheduler has to interact with the governor to give clues about the >> load, that is identified and will be the next step. >> >> Is it possible to check with the schedutil governor instead? > > Oh, I already have some data, just didn't include it before: > > Idle governor, teo; CPU frequency scaling: intel-cpufreq/schedutil; > Processor package power: 40.4 watts; 4.9 uSec/loop > > Idle governor, mobile; CPU frequency scaling: intel-cpufreq/schedutil; > Processor package power: 12.7 watts; 19.7 uSec/loop > > Idle governor, teo; CPU frequency scaling: intel-cpufreq/schedutil; > Idle states 0-3 disabled (note: Idle state 4 is the deepest on my system) > Processor package power: 36.9 watts; 8.3 uSec/loop > In my notes I wrote: "Huh?? I do not understand this result, as I had > expected more similar to the mobile governor". But I did not investigate. The reason for the big difference was/is that with the "mobile" governor the CPU frequency never scales up for this test. It can be sluggish to scale up with the teo and the menu governors, but always eventually does. I also tried the acpi-cpufreq driver with similar results. > Anyway, the schedutil test is the one I'll repeat and write up better. New summary (similar to old): governorusec/loop watts mobile 19.812.67 teo 4.8740.28 menu4.8540.25 teo-idle-0-3-dis8.3036.85 Graphs, details and source codes: http://www.smythies.com/~doug/linux/idle/mobile/index.html ... Doug
RE: [PATCH] cpuidle/drivers/mobile: Add new governor for mobile/embedded systems
On 2019.07.03 08:16 Daniel Lezcano wrote: > On 03/07/2019 16:23, Doug Smythies wrote: >> On 2019.06.20 04:58 Daniel Lezcano wrote: ... >> Anyway, I did a bunch of tests and such, but have deleted >> most from this e-mail, because it's just noise. I'll >> include just one set: >> >> For a work load that would normally result in a lot of use >> of shallow idle states (single core pipe-test * 2 cores). > > Can you share the tests and the command lines? Yes, give me a few days to repeat the tests and write it up properly. I am leaving town in an hour and for a day. It'll be similar to this: http://www.smythies.com/~doug/linux/idle/teo8/pipe/index.html parent page (which I will do a better version): http://www.smythies.com/~doug/linux/idle/teo8/index.html ... >> I got (all kernel 5.2-rc5 + this patch): >> >> Idle governor, teo; CPU frequency scaling: intel-cpufreq/ondemand; >> Processor package power: 40.4 watts; 4.9 uSec/loop >> >> Idle governor, teo; CPU frequency scaling: intel-cpufreq/ondemand; >> Processor package power: 34 watts; 5.2 uSec/loop >> >> Idle governor, mobile; CPU frequency scaling: intel-cpufreq/ondemand; >> Processor package power: 25.9 watts; 11.1 uSec/loop >> >> Idle governor, menu; CPU frequency scaling: intel-cpufreq/ondemand; >> Processor package power: 34.2 watts; 5.23 uSec/loop >> >> Idle governor, teo; CPU frequency scaling: intel-cpufreq/ondemand; >> Maximum CPU frequency limited to 73% to match mobile energy. >> Processor package power: 25.4 watts; 6.4 uSec/loop > > Ok that's interesting. Thanks for the values. > > The governor can be better by selecting the shallow states, the > scheduler has to interact with the governor to give clues about the > load, that is identified and will be the next step. > > Is it possible to check with the schedutil governor instead? Oh, I already have some data, just didn't include it before: Idle governor, teo; CPU frequency scaling: intel-cpufreq/schedutil; Processor package power: 40.4 watts; 4.9 uSec/loop Idle governor, mobile; CPU frequency scaling: intel-cpufreq/schedutil; Processor package power: 12.7 watts; 19.7 uSec/loop Idle governor, teo; CPU frequency scaling: intel-cpufreq/schedutil; Idle states 0-3 disabled (note: Idle state 4 is the deepest on my system) Processor package power: 36.9 watts; 8.3 uSec/loop In my notes I wrote: "Huh?? I do not understand this result, as I had expected more similar to the mobile governor". But I did not investigate. Anyway, the schedutil test is the one I'll repeat and write up better. ... Doug
RE: [PATCH] cpuidle/drivers/mobile: Add new governor for mobile/embedded systems
Hi Daniel, I tried your "mobile" governor, albeit not on a mobile device. On 2019.06.20 04:58 Daniel Lezcano wrote: ... > The mobile governor is a new governor targeting embedded systems > running on battery where the energy saving has a higher priority than > servers or desktops. This governor aims to save energy as much as > possible but with a performance degradation tolerance. > > In this way, we can optimize the governor for specific mobile workload > and more generally embedded systems without impacting other platforms. I just wanted to observe the lower energy, accepting performance degradation. My workloads may have been inappropriate. ... > + > +#define EMA_ALPHA_VAL64 > +#define EMA_ALPHA_SHIFT 7 > +#define MAX_RESCHED_INTERVAL_MS 100 > + > +static DEFINE_PER_CPU(struct mobile_device, mobile_devices); > + > +static int mobile_ema_new(s64 value, s64 ema_old) > +{ > + if (likely(ema_old)) > + return ema_old + (((value - ema_old) * EMA_ALPHA_VAL) >> > + EMA_ALPHA_SHIFT); > + return value; > +} Do you have any information as to why these numbers? Without any background, the filter seems overly new value weighted to me. It is an infinite impulse response type filter, currently at: output = 0.5 * old + 0.5 * new. I tried, but didn't get anything conclusive: output = 0.875 * old + 0.125 * new. I did it this way: #define EMA_ALPHA_VAL 7 #define EMA_ALPHA_SHIFT 3 #define MAX_RESCHED_INTERVAL_MS 100 static DEFINE_PER_CPU(struct mobile_device, mobile_devices); static int mobile_ema_new(s64 value, s64 ema_old) { if (likely(ema_old)) return ((ema_old * EMA_ALPHA_VAL) + value) >> EMA_ALPHA_SHIFT; return value; } ... > + /* > + * Sum all the residencies in order to compute the total > + * duration of the idle task. > + */ > + residency = dev->last_residency - s->exit_latency; What about when the CPU comes out of the idle state before it even gets fully into it? Under such conditions it seems to hold much too hard at idle states that are too deep, to the point where energy goes up while performance goes down. Anyway, I did a bunch of tests and such, but have deleted most from this e-mail, because it's just noise. I'll include just one set: For a work load that would normally result in a lot of use of shallow idle states (single core pipe-test * 2 cores). I got (all kernel 5.2-rc5 + this patch): Idle governor, teo; CPU frequency scaling: intel-cpufreq/ondemand; Processor package power: 40.4 watts; 4.9 uSec/loop Idle governor, teo; CPU frequency scaling: intel-cpufreq/ondemand; Processor package power: 34 watts; 5.2 uSec/loop Idle governor, mobile; CPU frequency scaling: intel-cpufreq/ondemand; Processor package power: 25.9 watts; 11.1 uSec/loop Idle governor, menu; CPU frequency scaling: intel-cpufreq/ondemand; Processor package power: 34.2 watts; 5.23 uSec/loop Idle governor, teo; CPU frequency scaling: intel-cpufreq/ondemand; Maximum CPU frequency limited to 73% to match mobile energy. Processor package power: 25.4 watts; 6.4 uSec/loop ... Doug
RE: NO_HZ_IDLE causes consistently low cpu "iowait" time (and higher cpu "idle" time)
On 2019.07.01 08:34 Alan Jenkins wrote: > Hi Hi, > I tried running a simple test: > > dd if=testfile iflag=direct bs=1M of=/dev/null > > With my default settings, `vmstat 10` shows something like 85% idle time > to 15% iowait time. I have 4 CPUs, so this is much less than one CPU > worth of iowait time. > > If I boot with "nohz=off", I see idle time fall to 75% or below, and > iowait rise to about 25%, equivalent to one CPU. That is what I had > originally expected. > > (I can also see my expected numbers, if I disable *all* C-states and > force polling using `pm_qos_resume_latency_us` in sysfs). > > The numbers above are from a kernel somewhere around v5.2-rc5. I saw > the "wrong" results on some previous kernels as well. I just now > realized the link to NO_HZ_IDLE.[1] > > [1] > https://unix.stackexchange.com/questions/517757/my-basic-assumption-about-system-iowait-does-not-hold/527836#527836 > > I did not find any information about this high level of inaccuracy. Can > anyone explain, is this behaviour expected? I'm not commenting on expected behaviour or not, just that it is inconsistent. > > I found several patches that mentioned "iowait" and NO_HZ_IDLE. But if > they described this problem, it was not clear to me. > > I thought this might also be affecting the "IO pressure" values from the > new "pressure stall information"... but I am too confused already, so I > am only asking about iowait at the moment :-). Using your workload, I confirm inconsistent behaviour for /proc/stat (which vmstat uses) between kernels 4.15, 4.16, and 4.17: 4.15 does what you expect, no matter idle states enabled or disabled. 4.16 doesn't do what you expect regardless. (although a little erratic.) >= 4.17 does what you expect with only idle state 0 enabled, and doesn't >otherwise. Actual test data vmstat (/proc/stat) (8 CPUs, 12.5% = 1 CPU)): Kernel idle/iowait % Idle states >= 1 4.1588/12 enabled 4.1588/12 disabled 4.1699/1enabled 4.1699/1disabled 4.1798/2enabled 4.1788/12 disabled Note 1: I never booted with "nohz=off" because the tick never turns off for idle state 0, which is good enough for testing. Note 2: Myself, I don't use /proc/stat for idle time statistics. I use: /sys/devices/system/cpu/cpu*/cpuidle/state*/time And they seem to always be consistent at the higher idle percentage number. Unless someone has some insight, the next step is kernel bisection, once for between kernel 4.15 and 4.16, then again between 4.16 and 4.17. The second bisection might go faster with knowledge gained from the first. Alan: Can you do kernel bisection? I can only do it starting maybe Friday. ... Doug
RE: 5.2-rc2: low framerate in flightgear, cpu not running at full speed, thermal related?
On 2019.06.13 01:53 Rafael J. Wysocki wrote: > I personally doubt that any thermal throttling is involved here. In earlier e-mails on this thread, Pavel showed his core and package temperatures as 97 and 98 degrees. If thermal throttling is not involved, it should be. The description of the observed CPU frequencies also matched my experiences with thermal throttling for the intel-pstate/powersave example. Myself, I can not determine if throttling is involved for the acpi-cpufreq/ondemand case, just from the clock frequencies, because, at least on my system, it uses the kidle_inject method instead of the pstate method. I continued doing experiments, enabling thermald (normally disabled on my system) and forcing thermal throttling on my test server. My system never gets hot enough, so I used a low trip point of 55 degrees. The intel_pstate/powersave and intel_cpufreq/ondemand, which both used the pstate method, outperformed the acpi-cpufreq/ondemand by 30%, in a constant thermal throttling mode. This seems the opposite of Pavel's results, if indeed his system is thermal throttling. (I can write these experiments up in more detail if desired.) On 2019.06.13 Pavel Machek wrote: > But it seems that your assumptions are incorrect for my workload. Fair enough. I don't have the appropriate hardware for this, and am trying to guess at a similar workflow for testing. Perhaps, just mudding things here rather than helping. > > flightgear is single-threaded, and in my configuration saturates the > CPU, because it would like to achieve higher framerate than my system > is capable of. Are you sure? Use turbostat and observe. Example 1: intel_cpufreq/ondemand, kernel 5.2-rc3, two 100% loads: doug@s15:~$ sudo turbostat --quiet --hide IRQ,Avg_MHz,SMI,\ > GFXMHz,TSC_MHz,CorWatt,CoreTmp,GFX%rc6,Pkg%pc2,Pkg%pc3,\ > Pkg%pc6,POLL,C1,C1E,C3,C6,C1%,C1E%,C3%,C6% \ > --interval 5 CoreCPU Busy% Bzy_MHz POLL% CPU%c1 CPU%c3 CPU%c6 CPU%c7 PkgTmp PkgWatt GFXWatt - - 25.24 27800.0025.70 0.0149.05 0.0056 23.82 0.12 0 0 1.6227810.001.900.0296.46 0.0056 23.82 0.12 0 4 0.0528330.003.47 1 1 100.00 27800.000.000.000.000.00 1 5 0.0328930.0099.97 2 2 0.0329060.0099.97 0.000.000.00 2 6 100.00 27800.000.00 3 3 0.0727970.000.180.0199.74 0.00 3 7 0.1028340.000.14 Example 2: acpi_cpufreq/ondemand, kernel 5.2-rc3, two 100% loads: doug@s15:~$ sudo turbostat --quiet --hide IRQ,Avg_MHz,SMI,\ > GFXMHz,TSC_MHz,CorWatt,CoreTmp,GFX%rc6,Pkg%pc2,Pkg%pc3,\ > Pkg%pc6,POLL,C1,C1E,C3,C6,C1%,C1E%,C3%,C6% \ > --interval 5 CoreCPU Busy% Bzy_MHz POLL% CPU%c1 CPU%c3 CPU%c6 CPU%c7 PkgTmp PkgWatt GFXWatt - - 18.06 30690.0018.45 0.0163.47 0.0055 21.50 0.12 0 0 15.55 30490.001.310.0283.12 0.0055 21.50 0.12 0 4 0.6426510.0016.22 1 1 62.43 30750.008.150.0029.43 0.00 1 5 7.7130680.0062.81 2 2 50.56 30910.000.780.0048.66 0.00 2 6 0.4423460.0050.89 3 3 2.3029010.005.000.0192.69 0.00 3 7 4.7930160.002.49 With the additional C6 % coming from the kidle_inj tasks. Observation 1: The loaded CPUs migrate much more often in this scenario. Observation 2: While the package watts are different by over 2 watts, the long term (done over many hours) average was the same for all methods. ... Doug
RE: 5.2-rc2: low framerate in flightgear, cpu not running at full speed, thermal related?
On 2019.06.12 14:25 Rafael J. Wysocki wrote: > On Wed, Jun 12, 2019 at 4:45 AM Doug Smythies wrote: >> >> So, currently there seems to be 3 issues in this thread >> (and I am guessing a little, without definitive data): >> >> 1.) On your system Kernel 5.4-rc2 (or 4) defaults to the intel_pstate CPU >> frequency >> scaling driver and the powersave governor, but kernel 4.6 defaults to the >> acpi-cpufreq CPU frequency scaling driver and the ondemand governor. > > Which means that intel_pstate works in the active mode by default and > so it uses its internal governor. Note sure what you mean by "internal governor"? If you meant HWP (Hardware P-state), Pavel's processor doesn't have it. If you meant the active powersave governor code within the driver, then agreed. > That governor is more performance-oriented than ondemand and it very > well may cause more power to be allocated for the processor - at the > expense of the GPU. O.K. I mainly use servers and so have no experience with possible GPU verses CPU tradeoffs. However, I did re-do my tests measuring energy instead of CPU frequency and found very little difference between the acpi-cpufreq/ondemand verses intel_pstate/powersave as a function of single threaded load. Actually, I did the test twice, one at 20 hertz work/sleep frequency and also at 67 hertz work/sleep frequency. (Of course, Pavel's processor might well have a different curve, but it is a similar vintage to mine i5-2520M verses i7-2600K.) The worst difference was approximately 1.1 extra processor package watts (an extra 5.5%) in the 80% to 85% single threaded load range at 67 hertz work/sleep frequency for the intel-pstate/powersave driver/governor. What am I saying? For a fixed amount of work to do per work/sleep cycle (i.e. maybe per video frame related type work) while the CPU frequency Verses load curves might differ, the resulting processor energy curve differs much less. (i.e. the extra power for higher CPU frequency is for less time because it gets the job done faster.) So, myself, I don't yet understand why only the one method would have hit thermal throttling, but not the other (if indeed it doesn't). Other differences between kernel 4.6 and 5.2-rc? might explain it. I did all my tests on kernel 5.2-rc3, except that one example from kernel 4.4 on my earlier reply, so that were not other variables than CPU scaling driver and governor changes. > The lower-than-expected frame rate may result from that, in principle. > One way to mitigate that might be to use intel_pstate in the passive > mode (pass intel_pstate=passive to the kernel in the command line) > along with either ondemand or schedutil as the governor. The CPU frequency verses load curves for this those two governors are very similar for both the acpi_cpufreq and intel_cpufreq (which is the intel_pstate driver in passive mode) drivers. Just for information: CPU frequency verses single threaded load curves for the conservative governor is quite different between the two drivers. (tests done in February, perhaps I should re-do and also look at energy at the same time, or instead of CPU frequency.) ... Doug
RE: 5.2-rc2: low framerate in flightgear, cpu not running at full speed, thermal related?
Hi, So, currently there seems to be 3 issues in this thread (and I am guessing a little, without definitive data): 1.) On your system Kernel 5.4-rc2 (or 4) defaults to the intel_pstate CPU frequency scaling driver and the powersave governor, but kernel 4.6 defaults to the acpi-cpufreq CPU frequency scaling driver and the ondemand governor. Suggest to check the related kernel configuration parameters. $ grep -A 10 -i "CPU frequency scaling drivers" .config # CPU frequency scaling drivers # CONFIG_X86_INTEL_PSTATE=y CONFIG_X86_PCC_CPUFREQ=y CONFIG_X86_ACPI_CPUFREQ=y CONFIG_X86_ACPI_CPUFREQ_CPB=y CONFIG_X86_POWERNOW_K8=y CONFIG_X86_AMD_FREQ_SENSITIVITY=m CONFIG_X86_SPEEDSTEP_CENTRINO=y CONFIG_X86_P4_CLOCKMOD=m 2.) With kernel 5.2-rc2 there is thermal throttling (thermald?), but for Kernel 4.6 it either doesn't hit the throttling threshold or is not enabled. I don't have input here. However note that the CPU frequency verses load graphs are quite different between acpi-cpufreq/ondemand and intel-pstate/powersave, with higher CPU frequencies at lower loads for the acpi-cpufreq driver and higher CPU frequencies at higher loads for the intel_pstate driver. The crossover point is about 35% load. So, the two driver/governors might well end up at different operating pointing terms of CPU frequencies. I do not have energy data for the CPU frequency verses load tests, which is really what matters. 3.) The older kernel is still using the older acpi-cpufreq stuff where some CPU frequency memory registers show what was asked for and not what the system is actually giving. See below. On 2019.06.09 04:18 Pavel Machek wrote: > When I start flightgear, I get framerates around 20 fps and cpu at > 3GHz: > > pavel@duo:~/bt$ cat /proc/cpuinfo | grep MHz > cpu MHz : 3027.471 > cpu MHz : 2981.863 > cpu MHz : 2958.352 > cpu MHz : 2864.001 > pavel@duo:~/bt$ > > (Ok, fgfs is really only running at single core, so why do both cores > run at 3GHz?) There is only 1 PLL (Phase Locked Loop) master clock generator in your processor. Basically, the CPU with the highest frequency demand wins. What is not shown in your data is that the not busy core is also probably in a deep idle state for most of the time, consuming little or no energy. Below is an example from my processor (i7-2600K) using turbostat (note C6 is my deepest idle state, and only it is shown for simplicity): CoreCPU Busy% Bzy_MHz CPU%c6 PkgTmp PkgWatt - - 12.52 379974.88 54 23.87 0 0 0.05366499.89 54 23.87 0 4 0.003695 1 1 0.07367699.87 1 5 0.003667 2 2 0.09366999.78 2 6 0.003666 3 3 0.0437860.00 3 7 99.92 3800 Observe the busy core spending no time in C6, but the other 3 spend most of the time there. However, for the little time that they are active, CPU 7 is the one demanding the high CPU frequency. Anticipated question: "Then why aren't all the CPU frequencies exactly the same?" My processor's max turbo frequency is a function of how many cores are active, and so actually varies under peak demand. Excerpt from turbostat, not in quiet mode: 35 * 100.0 = 3500.0 MHz max turbo 4 active cores 36 * 100.0 = 3600.0 MHz max turbo 3 active cores 37 * 100.0 = 3700.0 MHz max turbo 2 active cores 38 * 100.0 = 3800.0 MHz max turbo 1 active cores > The CPU is Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz . I guess it means > it should be able to sustain both cores running at 2.5GHz? No, that is more a function of your hardware's ability (or lack of) to move the waste heat away. > Situation is very different with v4.6 distro based kernel. > > CPU MHz is only getting values round to 100MHz. It does not go above > 2.5GHz, but it does not go below 2.5GHz under the load, either. Yes, it is going above 2.5 GHz, see below. > v4.6: > cpu MHz : 2501.000 > cpu MHz : 2501.000 > cpu MHz : 2501.000 > cpu MHz : 2501.000 That seems to be the old acpi-cpufreq information, which is telling you what it's asking for and not what it actually is. And the "01" means it is asking for turbo range frequencies. Example from my system, while running the above CPU 7 busy test: Conditions: Old kernel (4.4), acpi-cpufreq driver, ondemand gov. doug@s15:~$ grep MHz /proc/cpuinfo cpu MHz : 1600.000 cpu MHz : 1600.000 cpu MHz : 1600.000 cpu MHz : 1600.000 cpu MHz : 1600.000 cpu MHz : 1600.000 cpu MHz : 1600.000 cpu MHz : 3401.000 CPUs 0-6 have very little load and are asking for the minimum available frequency. CPU 7 is asking for turbo. The turbostat data tells a different story. Conditions: New kernel (5.2-rc3), acpi-cpufreq driver, ondemand gov. doug@s15:~$ grep MHz /proc/cpuinfo cpu MHz : 3611.623 cpu MHz :
RE: [PATCH v2] cpufreq: intel_pstate: Rework iowait boosting to be less aggressive
On 2019.02.07 03:51 Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > The current iowait boosting mechanism in intel_pstate_update_util() > is quite aggressive, as it goes to the maximum P-state right away, > and may cause excessive amounts of energy to be used, which is not > desirable and arguably isn't necessary too. > > Follow commit a5a0809bc58e ("cpufreq: schedutil: Make iowait boost > more energy efficient") that reworked the analogous iowait boost > mechanism in the schedutil governor and make the iowait boosting > in intel_pstate_update_util() work along the same lines. > > Signed-off-by: Rafael J. Wysocki > --- > > -> v2: > * Follow the Doug's suggestion and drop the immediate jump to > max P-state if boost is max. The code is simpler this way and > the perf impact should not be noticeable on average. Hi Rafael, Something has broken on my incoming e-mail sorting stuff, and I missed this one (and some others). This V2 is not actually what I was proposing. I was O.K. with the immediate jump, but I didn't want the set_pstate step by-passed if it was already at max because that would also by-pass the trace sample, if it was enabled. Anyway, this V2 seems O.K. to me. I tested it compared to V1 and, as you mentioned, wasn't able to detect any energy consumption or performance differences. ... Doug
RE: [PATCH] cpufreq: intel_pstate: Rework iowait boosting to be less aggressive
On 2019.02.05 04:04 Rafael J. Wysocki wrote: > On Friday, February 1, 2019 5:54:37 PM CET Doug Smythies wrote: >> On 2019.01.30 16:05 Rafael J. Wysocki wrote: >> >>> From: Rafael J. Wysocki >>> >>> The current iowait boosting mechanism in intel_pstate_update_util() >>> is quite aggressive, as it goes to the maximum P-state right away, >>> and may cause excessive amounts of energy to be used, which is not >>> desirable and arguably isn't necessary too. >>> >>> Follow commit a5a0809bc58e ("cpufreq: schedutil: Make iowait boost >>> more energy efficient") that reworked the analogous iowait boost >>> mechanism in the schedutil governor and make the iowait boosting >>> in intel_pstate_update_util() work along the same lines. >>> >>> Signed-off-by: Rafael J. Wysocki >>> --- >>> drivers/cpufreq/intel_pstate.c | 46 >>> + >>> 1 file changed, 29 insertions(+), 17 deletions(-) >>> >>> Index: linux-pm/drivers/cpufreq/intel_pstate.c >>> === >>> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c >>> +++ linux-pm/drivers/cpufreq/intel_pstate.c >>> @@ -50,6 +50,8 @@ >>> #define int_tofp(X) ((int64_t)(X) << FRAC_BITS) >>> #define fp_toint(X) ((X) >> FRAC_BITS) >>> >>> +#define ONE_EIGHTH_FP ((int64_t)1 << (FRAC_BITS - 3)) >>> + >>> #define EXT_BITS 6 >>> #define EXT_FRAC_BITS (EXT_BITS + FRAC_BITS) >>> #define fp_ext_toint(X) ((X) >> EXT_FRAC_BITS) >>> @@ -1678,17 +1680,14 @@ static inline int32_t get_avg_pstate(str >>> static inline int32_t get_target_pstate(struct cpudata *cpu) >>> { >>> struct sample *sample = >sample; >>> - int32_t busy_frac, boost; >>> + int32_t busy_frac; >>> int target, avg_pstate; >>> >>> busy_frac = div_fp(sample->mperf << cpu->aperf_mperf_shift, >>>sample->tsc); >>> >>> - boost = cpu->iowait_boost; >>> - cpu->iowait_boost >>= 1; >>> - >>> - if (busy_frac < boost) >>> - busy_frac = boost; >>> + if (busy_frac < cpu->iowait_boost) >>> + busy_frac = cpu->iowait_boost; >>> >>> sample->busy_scaled = busy_frac * 100; >>> >>> @@ -1767,22 +1766,35 @@ static void intel_pstate_update_util(str >>> if (smp_processor_id() != cpu->cpu) >>> return; >>> >>> + delta_ns = time - cpu->last_update; >>> if (flags & SCHED_CPUFREQ_IOWAIT) { >>> - cpu->iowait_boost = int_tofp(1); >>> - cpu->last_update = time; >>> - /* >>> -* The last time the busy was 100% so P-state was max anyway >>> -* so avoid overhead of computation. >>> -*/ >>> - if (fp_toint(cpu->sample.busy_scaled) == 100) >>> - return; >>> - >>> - goto set_pstate; >>> + /* Start over if the CPU may have been idle. */ >>> + if (delta_ns > TICK_NSEC) { >>> + cpu->iowait_boost = ONE_EIGHTH_FP; >>> + } else if (cpu->iowait_boost) { >>> + cpu->iowait_boost <<= 1; >>> + if (cpu->iowait_boost >= int_tofp(1)) { >>> + cpu->iowait_boost = int_tofp(1); >>> + cpu->last_update = time; >>> + /* >>> +* The last time the busy was 100% so P-state >>> +* was max anyway, so avoid the overhead of >>> +* computation. >>> +*/ >>> + if (fp_toint(cpu->sample.busy_scaled) == 100) >>> + return; >> >> Hi Rafael, >> >> By exiting here, the trace, if enabled, is also bypassed. >> We want the trace sample. > > Fair enough, but the return is there regardless of this patch. > > Maybe it should be fixed separately? O.K. >> Also, there is a generic: >> "If the target ptstate is the same as before, then don't set it" >> later on. >> Suggest to delete this test and exit condition. (I see that this early >> exit was done before also.) > > Well, exactly. > > It is not unreasonable to boost the frequency right away for an IO-waiter > without waiting for the next sample time IMO. I agree, but am just saying that it should include a trace sample, otherwise it is difficult to understand what happened. By the way, I forgot to mention before, I tried the patch and it does prevent CPU frequency spikes to maximum every few seconds in a very idle system. ... Doug
RE: [PATCH] cpufreq: intel_pstate: Rework iowait boosting to be less aggressive
On 2019.01.30 16:05 Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > The current iowait boosting mechanism in intel_pstate_update_util() > is quite aggressive, as it goes to the maximum P-state right away, > and may cause excessive amounts of energy to be used, which is not > desirable and arguably isn't necessary too. > > Follow commit a5a0809bc58e ("cpufreq: schedutil: Make iowait boost > more energy efficient") that reworked the analogous iowait boost > mechanism in the schedutil governor and make the iowait boosting > in intel_pstate_update_util() work along the same lines. > > Signed-off-by: Rafael J. Wysocki > --- > drivers/cpufreq/intel_pstate.c | 46 > + > 1 file changed, 29 insertions(+), 17 deletions(-) > > Index: linux-pm/drivers/cpufreq/intel_pstate.c > === > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c > +++ linux-pm/drivers/cpufreq/intel_pstate.c > @@ -50,6 +50,8 @@ > #define int_tofp(X) ((int64_t)(X) << FRAC_BITS) > #define fp_toint(X) ((X) >> FRAC_BITS) > > +#define ONE_EIGHTH_FP ((int64_t)1 << (FRAC_BITS - 3)) > + > #define EXT_BITS 6 > #define EXT_FRAC_BITS (EXT_BITS + FRAC_BITS) > #define fp_ext_toint(X) ((X) >> EXT_FRAC_BITS) > @@ -1678,17 +1680,14 @@ static inline int32_t get_avg_pstate(str > static inline int32_t get_target_pstate(struct cpudata *cpu) > { > struct sample *sample = >sample; > - int32_t busy_frac, boost; > + int32_t busy_frac; > int target, avg_pstate; > > busy_frac = div_fp(sample->mperf << cpu->aperf_mperf_shift, > sample->tsc); > > - boost = cpu->iowait_boost; > - cpu->iowait_boost >>= 1; > - > - if (busy_frac < boost) > - busy_frac = boost; > + if (busy_frac < cpu->iowait_boost) > + busy_frac = cpu->iowait_boost; > > sample->busy_scaled = busy_frac * 100; > > @@ -1767,22 +1766,35 @@ static void intel_pstate_update_util(str > if (smp_processor_id() != cpu->cpu) > return; > > + delta_ns = time - cpu->last_update; > if (flags & SCHED_CPUFREQ_IOWAIT) { > - cpu->iowait_boost = int_tofp(1); > - cpu->last_update = time; > - /* > - * The last time the busy was 100% so P-state was max anyway > - * so avoid overhead of computation. > - */ > - if (fp_toint(cpu->sample.busy_scaled) == 100) > - return; > - > - goto set_pstate; > + /* Start over if the CPU may have been idle. */ > + if (delta_ns > TICK_NSEC) { > + cpu->iowait_boost = ONE_EIGHTH_FP; > + } else if (cpu->iowait_boost) { > + cpu->iowait_boost <<= 1; > + if (cpu->iowait_boost >= int_tofp(1)) { > + cpu->iowait_boost = int_tofp(1); > + cpu->last_update = time; > + /* > + * The last time the busy was 100% so P-state > + * was max anyway, so avoid the overhead of > + * computation. > + */ > + if (fp_toint(cpu->sample.busy_scaled) == 100) > + return; Hi Rafael, By exiting here, the trace, if enabled, is also bypassed. We want the trace sample. Also, there is a generic: "If the target ptstate is the same as before, then don't set it" later on. Suggest to delete this test and exit condition. (I see that this early exit was done before also.) > + > + goto set_pstate; > + } > + } else { > + cpu->iowait_boost = ONE_EIGHTH_FP; > + } > } else if (cpu->iowait_boost) { > /* Clear iowait_boost if the CPU may have been idle. */ > - delta_ns = time - cpu->last_update; > if (delta_ns > TICK_NSEC) > cpu->iowait_boost = 0; > + else > + cpu->iowait_boost >>= 1; > } > cpu->last_update = time; > delta_ns = time - cpu->sample.time;
[PATCH] cpuidle: poll_state: Fix default time limit.
The default time is declared in units of microsecnds, but is used as nano seconds, resulting in significant accounting errors for idle state 0 time when all idle states deeper than 0 are disabled. Under these unusual conditions, we don't really care about the poll time limit anyhow. Signed-off-by: Doug Smythies --- drivers/cpuidle/poll_state.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c index b17d153..23a1b27 100644 --- a/drivers/cpuidle/poll_state.c +++ b/drivers/cpuidle/poll_state.c @@ -21,7 +21,7 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, local_irq_enable(); if (!current_set_polling_and_test()) { unsigned int loop_count = 0; - u64 limit = TICK_USEC; + u64 limit = TICK_NSEC; int i; for (i = 1; i < drv->state_count; i++) { -- 2.7.4
RE: [PATCH] cpuidle: New timer events oriented governor for tickless systems
Hi Rafael, Just typos, not already pointed out by Quentin: On 2018.12.18 03:09 Rafael J. Wysocki wrote: > The venerable menu governor does some thigns that are quite s/thigns/things > time frame in which no timer wakeups are possible (becuase it knows s/because/because > +obtain the the *sleep length*, which is the time until the closest timer > event s/the the/the > +increased when when the given idle state "matches" the sleep length only and > the s/when when/when ... Doug
[PATCH] tools/power/x86/intel_pstate_tracer: Fix non root execution for post processing a trace file.
This script is supposed to be allowed to run with regular user privileges if a previously captured trace is being post processed. Commit fbe313884d7ddd73ce457473cbdf3763f5b1d3da tools/power/x86/intel_pstate_tracer: Free the trace buffer memory introduced a bug that breaks that option. Commit 35459105deb26430653a7299a86bc66fb4dd5773 tools/power/x86/intel_pstate_tracer: Add optional setting of trace buffer memory allocation moved the code but kept the bug. This patch fixes the issue. Signed-off-by: Doug Smythies --- tools/power/x86/intel_pstate_tracer/intel_pstate_tracer.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/power/x86/intel_pstate_tracer/intel_pstate_tracer.py b/tools/power/x86/intel_pstate_tracer/intel_pstate_tracer.py index 84e2b64..2fa3c57 100755 --- a/tools/power/x86/intel_pstate_tracer/intel_pstate_tracer.py +++ b/tools/power/x86/intel_pstate_tracer/intel_pstate_tracer.py @@ -585,9 +585,9 @@ current_max_cpu = 0 read_trace_data(filename) -clear_trace_file() -# Free the memory if interval: +clear_trace_file() +# Free the memory free_trace_buffer() if graph_data_present == False: -- 2.7.4
RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems
On 2018.12.11 03:50 Rafael J. Wysocki wrote: ...[snip]... > With this version of the TEO governor I'm observing slight, but consistent > performance improvements in some different benchmarks on a few different > systems with respect to menu Same here. > and the corresponding power draw differences > appear to be in the noise. The idle power doesn't seem to change either. Same here. > Also the "above" and "below" metrics (introduced by a separate patch under > discussion) indicate changes in the "good" direction. My graphs now include the "above" and "below" metrics. In particular see Idle State 1 "above" (was too deep) graphs in the links below. However, performance is up and power about the same, so O.K. > Overall, I like this one, so I may consider dropping the RFC/RFT tag from the > next submission. :-) > > v7 -> v8: > * Apply the selection rules to the idle deepest state as well as to > the shallower ones (the deepest idle state was treated differently > before by mistake). > * Subtract 1/2 of the exit latency from the measured idle duration > in teo_update() (instead of subtracting the entire exit latency). > This makes the idle state selection be slightly more performance- > oriented. I cherry picked a couple of the mmtests that Giovanni was doing: Test kernels: "stock" kernel 4.20-rc5 + a couple of rjw patches. Specifically: 2a110ed cpuidle: poll_state: Disregard disable idle states 8f09875 cpuidle: Add 'above' and 'below' idle state metrics d6851a5 Documentation: admin-guide: PM: Add cpuidle document 2595646 Linux 4.20-rc5 "teov6" above + teov6 patch "teov7" above + teov7 patch "teov8" above + teov8 patch 1.) mmtests - netperf-unbound test (UDP): 4.20-rc5 4.20-rc5 4.20-rc5 4.20-rc5 stock teo6 teo7 teo8 Hmean send-64 129.64 ( 0.00%) 132.45 * 2.17%* 130.55 * 0.71%* 132.87 * 2.49%* Hmean send-128259.53 ( 0.00%) 264.90 * 2.07%* 261.61 * 0.80%* 264.94 * 2.09%* Hmean send-256515.24 ( 0.00%) 525.41 * 1.97%* 517.41 * 0.42%* 524.88 * 1.87%* Hmean send-1024 2041.01 ( 0.00%) 2079.22 * 1.87%* 2045.03 * 0.20%* 2077.25 * 1.78%* Hmean send-2048 3980.04 ( 0.00%) 4071.09 * 2.29%* 4041.15 * 1.54%* 4057.09 * 1.94%* Hmean send-3312 6321.90 ( 0.00%) 6460.23 * 2.19%* 6395.71 * 1.17%* 6409.09 * 1.38%* Hmean send-4096 7695.18 ( 0.00%) 7882.81 * 2.44%* 7813.72 * 1.54%* 7832.77 * 1.79%* Hmean send-8192 13920.53 ( 0.00%)14146.47 * 1.62%*13986.72 * 0.48%*14079.07 * 1.14%* Hmean send-1638424714.99 ( 0.00%)25225.95 * 2.07%*24896.10 * 0.73%*25131.52 * 1.69%* Hmean recv-64 129.64 ( 0.00%) 132.45 * 2.17%* 130.55 * 0.71%* 132.87 * 2.49%* Hmean recv-128259.53 ( 0.00%) 264.90 * 2.07%* 261.61 * 0.80%* 264.94 * 2.09%* Hmean recv-256515.24 ( 0.00%) 525.41 * 1.97%* 517.41 * 0.42%* 524.88 * 1.87%* Hmean recv-1024 2041.01 ( 0.00%) 2079.22 * 1.87%* 2045.03 * 0.20%* 2077.25 * 1.78%* Hmean recv-2048 3980.04 ( 0.00%) 4071.09 * 2.29%* 4041.15 * 1.54%* 4057.09 * 1.94%* Hmean recv-3312 6321.88 ( 0.00%) 6460.23 * 2.19%* 6395.71 * 1.17%* 6409.09 * 1.38%* Hmean recv-4096 7695.15 ( 0.00%) 7882.81 * 2.44%* 7813.72 * 1.54%* 7832.75 * 1.79%* Hmean recv-8192 13920.52 ( 0.00%)14146.43 * 1.62%*13986.72 * 0.48%*14079.07 * 1.14%* Hmean recv-1638424714.99 ( 0.00%)25225.90 * 2.07%*24896.07 * 0.73%*25131.49 * 1.69%* Graphs: http://www.smythies.com/~doug/linux/idle/teo8/net-pref-udp-unbound/index.html (note: slow upload speed from my server) 2.) mmtests - sockperf-udp-throughput test: 4.20-rc5 4.20-rc5 4.20-rc5 4.20-rc5 stock teo6 teo7 teo8 Hmean 1424.57 ( 0.00%) 25.91 * 5.46%* 25.99 * 5.78%* 25.73 * 4.75%* Hmean 100 175.37 ( 0.00%) 185.09 * 5.54%* 185.89 * 6.00%* 184.48 * 5.19%* Hmean 300 523.81 ( 0.00%) 553.47 * 5.66%* 554.70 * 5.90%* 550.16 * 5.03%* Hmean 500 870.08 ( 0.00%) 918.88 * 5.61%* 924.33 * 6.24%* 914.53 * 5.11%* Hmean 850 1449.44 ( 0.00%) 1530.84 * 5.62%* 1535.40 * 5.93%* 1522.53 * 5.04%* Graphs:
RE: intel_pstate: Lowest frequency not reached with Intel i7-6700
On 2018.12.13 04:42 Paul Menzel wrote: > On 12/13/18 11:39, Rafael J. Wysocki wrote: >> On Thu, Dec 13, 2018 at 10:54 AM Paul Menzel wrote: >>> On 12/13/18 00:06, Doug Smythies wrote: >>>> On 2018.12.12 13:40 Paul Menzel wrote: >>>> >>>>> Using *powersave* as P-state selection algorithm, on an idle system >>>> >>>> Define "idle system". >>>> If your computer is running a GUI, or is even a server without a GUI >>>> but with many services running, then "idle" really isn't. >>>> Below is from my test server, with many services disabled, so >>>> "idle" really is quite "idle" >>>> >>>> doug@s15:~/temp$ sudo turbostat --Summary --quiet --show >>>> Busy%,Bzy_MHz,PkgTmp,PkgWatt --interval 15 >>>> Busy% Bzy_MHz PkgTmp PkgWatt >>>> 0.01160827 3.71 >>>> 0.01161927 3.71 >>>> 0.01160028 3.71 >>>> 0.01160028 3.70 >>>> >>>> Note that p state 16 (1600 MHz) is the minimum for my older i7-2600k >>>> processor. >>> >>> The thing is, on an Intel Kaby Lake laptop with Ubuntu 18.10 and GNOME >>> running, it goes down to the lowest listed frequency. > > Checking the numbers again, I was mistaken. The lowest possible frequency > of the Intel Kaby Lake i7-7500U in that laptop is 400 MHz, and it is > going down to 600 MHz. Busy% from turbostat is 0.3 to 0.4. > >> Kaby Lake has hardware-managed P-states (HWP) which is a different mechanism. > > Isn’t HWP also available for the 6th generation? > >$ dmesg | grep intel_pstate >[2.092456] intel_pstate: Intel P-state driver initializing >[2.094820] intel_pstate: HWP enabled Yes, you have HWP. I do not have HWP. I don't think HWP or not is particularly relevant to this discussion. >>>>> Shouldn’t it go down until 800 MHz? >>>> >>>> We would need some actual busy information, turbostat is the >>>> recommended tool, to know for sure. >>> >>> Here you go. >>> >>> ``` >>> tools/power/x86/turbostat> sudo ./turbostat --Summary --quiet --show >>> Busy%,Bzy_MHz,PkgTmp,PkgWatt --interval 15 >>> Busy%Bzy_MHzPkgTmpPkgWatt >>> 3.591167311.68 >>> 3.21903311.34 >>> 3.21906311.34 >>> 3.27901311.35 >>> 8.232715302.32 ← stopping GDM (systemctl stop gdm) >>> 2.95915301.18 >>> 2.91906301.18 >>> 2.92903301.17 >>> 2.90900291.17 >>> 2.89903291.18 >>> 2.91903301.18 >>> 2.89903291.18 >>> 2.89900291.18 >>> 2.90903301.18 >>> 2.90903291.17 >>> 2.90903291.17 >>> 2.90900291.16 >>> 2.90903291.14 >>> 2.90903281.11 >>> 2.90903291.10 >>> 2.91900291.16 >>> 2.91903291.14 >>> 2.90903291.12 >>> 2.90903291.16 >>> 2.90900281.17 >>> 2.92903291.16 >>> 2.90903291.16 >>> 2.90903291.16 >>> ``` >>> >>> 800 MHz should be enough to keep GDM running, shouldn’t it? >> >> Well, depending. >> >>> Otherwise only SSH was running. >> >> There obviously is something that causes it to stay at 900 MHz. Agreed. Clearly some more stuff was using some CPU. The resulting requested CPU frequency is also highly dependent on the frequency of the load. Example: doug@s15:~/csv-plot$ sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt --interval 60 Busy% Bzy_MHz PkgTmp PkgWatt 0.08256727 3.77 <<< Note the high active CPU frequency with virtually no load. 0.01160026 3.69 0.01160027 3.70 0.83160030 4.18 1.80160030 4.59 <<< This load is at 435 Hertz work / sleep frequency 1.80160032 4.59 0.74303327 4.49 <<< This load is at 1 Hertz work / sleep frequency 0.76338027 4.63 0.76338428 4.62 0.38335628 4.16 0.01161528 3.71 0.01160927 3.70 > > It’s not obvious to me, but you have more experience. It’d expect > to at least one core(?) to go down to 800 MHz and not all to stay > at 900 MHz. There is only PLL (Phase Locked Loop). The
RE: intel_pstate: Lowest frequency not reached with Intel i7-6700
On 2018.12.12 13:40 Paul Menzel wrote: > Using *powersave* as P-state selection algorithm, on an idle system Define "idle system". If your computer is running a GUI, or is even a server without a GUI but with many services running, then "idle" really isn't. Below is from my test server, with many services disabled, so "idle" really is quite "idle" doug@s15:~/temp$ sudo turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt --interval 15 Busy% Bzy_MHz PkgTmp PkgWatt 0.01160827 3.71 0.01161927 3.71 0.01160028 3.71 0.01160028 3.70 Note that p state 16 (1600 MHz) is the minimum for my older i7-2600k processor. > Shouldn’t it go down until 800 MHz? We would need some actual busy information, turbostat is the recommended tool, to know for sure. ... Doug
RE: [PATCH v2] cpuidle: Add 'above' and 'below' idle state metrics
On 2018.12.10 02:52 Peter Zijlstra wrote: >On Mon, Dec 10, 2018 at 10:36:40PM +0100, Rafael J. Wysocki wrote: >> On Mon, Dec 10, 2018 at 1:21 PM Peter Zijlstra wrote: >>> Would not a tracepoint be better?; then there is no overhead in the >>> normal case where nobody gives a crap about these here numbers. >> >> There is an existing tracepoint that in principle could be used to >> produce this information, but it is such a major PITA in practice that >> nobody does that. Guess why. :-) > > Sounds like you need to ship a convenient script or something :-) For the histogram plots of idle durations that I sometimes provide, trace is used. It is more work to do it the trace way. Very often, when the rate of idle state entries/ exits is high, turning on trace influences the system under test significantly. Also, even if I allocate the majority of my memory to the trace buffer (i.e. 15 of my 16 gigabytes), I can only acquire a few minutes of trace data before filling the buffer. Some of my tests run for hours, and these new counters provide a way to acquire potentially useful (I don't have enough experience with them yet to know how useful) information, while having no influence on the system under test because I only take samples once per minute, or sometimes 4 times per minute. >> Also, the "usage" and "time" counters are there in sysfs, so why not these >> two? I agree, how are these two counters any different? In about a week or so, I'll have some test data comparing 4.20-rc5 with teov6 teov7 along with the idle data (graphs) that I usually provide and also these new counters. ... Doug
RE: [PATCH] cpuidle: Add 'high' and 'low' idle state metrics
On 2018.12.06 01:09 Rafael J. Wysocki wrote: > On Thu, Dec 6, 2018 at 12:08 AM Doug Smythies wrote: >> On 2018.12.03 04:32 Rafael J. Wysocki wrote: >> >>> Add two new metrics for CPU idle states, "high" and "low", to count >>> the number of times the given state had been asked for (or entered >>> from the kernel's perspective), but the observed idle duration turned >>> out to be too high or too low for it (respectively). >> >> I wonder about the "high" "low" terminology here. > > I took these names, because they are concise and simple. I could use > "below" and "above" respectively I guess. What about these? I see you already sent a new patch with these names. Yes, myself I like them better. I am going to try to add these counts to my next sets of graphs. ... Doug
RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On 2018.12.08 02:23 Giovanni Gherdovich wrote: > sorry for the late reply, this week I was traveling. No Problem. Thanks very much for your very detailed reply, which obviously took considerable time to write. While I was making progress, your instructions really fill in some gaps and mistakes I was making. Eventually (probably several days) I'll report back my test results. > Some specific remarks you raise: > > On Mon, 2018-12-03 at 08:23 -0800, Doug Smythies wrote: >> ... >> My issue is that I do not understand the output or how it >> might correlate with your tables. >> >> I get, for example: >> >>31 1 0.13s 0.68s 0.80s 1003894.302 1003779.613 >>31 1 0.16s 0.64s 0.80s 1008900.053 1008215.336 >>31 1 0.14s 0.66s 0.80s 1009630.439 1008990.265 >> ... >> >> But I don't know what that means, nor have I been able to find >> a description anywhere. > > I don't recognize this output. I hope the illustration above can clarify how > MMTests is used. Due to incompetence on my part, the config file being run for my tests was always just the default config file from my original git clone https://github.com/gormanm/mmtests.git command. So regardless of what I thought I was doing, I was running "pft" (Page Fault Test). ... Doug
RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems
On 2018.12.08 02:23 Giovanni Gherdovich wrote: > sorry for the late reply, this week I was traveling. No Problem. Thanks very much for your very detailed reply, which obviously took considerable time to write. While I was making progress, your instructions really fill in some gaps and mistakes I was making. Eventually (probably several days) I'll report back my test results. > Some specific remarks you raise: > > On Mon, 2018-12-03 at 08:23 -0800, Doug Smythies wrote: >> ... >> My issue is that I do not understand the output or how it >> might correlate with your tables. >> >> I get, for example: >> >>31 1 0.13s 0.68s 0.80s 1003894.302 1003779.613 >>31 1 0.16s 0.64s 0.80s 1008900.053 1008215.336 >>31 1 0.14s 0.66s 0.80s 1009630.439 1008990.265 >> ... >> >> But I don't know what that means, nor have I been able to find >> a description anywhere. > > I don't recognize this output. I hope the illustration above can clarify how > MMTests is used. Due to incompetence on my part, the config file being run for my tests was always just the default config file from my original git clone https://github.com/gormanm/mmtests.git command. So regardless of what I thought I was doing, I was running "pft" (Page Fault Test). ... Doug