from:"Doug Smythies"

Re: [RFC v4 0/2] CPU-Idle latency selftest framework

2021-04-13 Thread Doug Smythies

Hi Pratik,

V4 seems fine. Thank you.

On Mon, Apr 12, 2021 at 12:43 AM Pratik Rajesh Sampat
 wrote:
>
> Changelog v3-->v4
> Based on review comments by Doug Smythies,
> 1. Parsing the thread_siblings_list for CPU topology information to
>correctly identify the cores the test should run on in
>default(quick) mode.
> 2. The source CPU to source CPU interaction in the IPI test will always
>result in a lower latency and cause a bias in the average, hence
>avoid adding the latency to be averaged for same cpu IPIs. The
>latency will still be displayed in the detailed logs.
>
> RFC v3: https://lkml.org/lkml/2021/4/4/31

Example output for an Intel i5-10600K, HWP active, performance mode.
System very idle:

$ sudo ./cpuidle.sh -v -i
Inserting 
/lib/modules/5.12.0-rc7-prs-v4/kernel/drivers/cpuidle/test-cpuidle_latency.ko
module

--IPI Latency Test---
Baseline Avg IPI latency(ns): 686
Observed Avg IPI latency(ns) - State 0: 468
Observed Avg IPI latency(ns) - State 1: 956
Observed Avg IPI latency(ns) - State 2: 17936
Observed Avg IPI latency(ns) - State 3: 17968

--Timeout Latency Test--
Baseline Avg timeout diff(ns): 445
Observed Avg timeout diff(ns) - State 0: 377
Observed Avg timeout diff(ns) - State 1: 630
Observed Avg timeout diff(ns) - State 2: 322812
Observed Avg timeout diff(ns) - State 3: 306067
Removing 
/lib/modules/5.12.0-rc7-prs-v4/kernel/drivers/cpuidle/test-cpuidle_latency.ko
module
Full Output logged at: cpuidle.log

$ grep . /sys/devices/system/cpu/cpu7/cpuidle/state*/residency
/sys/devices/system/cpu/cpu7/cpuidle/state0/residency:0
/sys/devices/system/cpu/cpu7/cpuidle/state1/residency:1
/sys/devices/system/cpu/cpu7/cpuidle/state2/residency:360
/sys/devices/system/cpu/cpu7/cpuidle/state3/residency:3102
$ grep . /sys/devices/system/cpu/cpu7/cpuidle/state*/latency
/sys/devices/system/cpu/cpu7/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu7/cpuidle/state1/latency:1
/sys/devices/system/cpu/cpu7/cpuidle/state2/latency:120
/sys/devices/system/cpu/cpu7/cpuidle/state3/latency:1034

... Doug

Re: [RFC v3 0/2] CPU-Idle latency selftest framework

2021-04-09 Thread Doug Smythies

On Fri, Apr 9, 2021 at 12:43 AM Pratik Sampat  wrote:
> On 09/04/21 10:53 am, Doug Smythies wrote:
> > I tried V3 on a Intel i5-10600K processor with 6 cores and 12 CPUs.
> > The core to cpu mappings are:
> > core 0 has cpus 0 and 6
> > core 1 has cpus 1 and 7
> > core 2 has cpus 2 and 8
> > core 3 has cpus 3 and 9
> > core 4 has cpus 4 and 10
> > core 5 has cpus 5 and 11
> >
> > By default, it will test CPUs 0,2,4,6,10 on cores 0,2,4,0,2,4.
> > wouldn't it make more sense to test each core once?
>
> Ideally it would be better to run on all the CPUs, however on larger systems
> that I'm testing on with hundreds of cores and a high a thread count, the
> execution time increases while not particularly bringing any additional
> information to the table.
>
> That is why it made sense only run on one of the threads of each core to make
> the experiment faster while preserving accuracy.
>
> To handle various thread topologies it maybe worthwhile if we parse
> /sys/devices/system/cpu/cpuX/topology/thread_siblings_list for each core and
> use this information to run only once per physical core, rather than
> assuming the topology.
>
> What are your thoughts on a mechanism like this?

Yes, seems like a good solution.

... Doug

Re: [RFC v3 0/2] CPU-Idle latency selftest framework

2021-04-08 Thread Doug Smythies

Hi Pratik,

I tried V3 on a Intel i5-10600K processor with 6 cores and 12 CPUs.
The core to cpu mappings are:
core 0 has cpus 0 and 6
core 1 has cpus 1 and 7
core 2 has cpus 2 and 8
core 3 has cpus 3 and 9
core 4 has cpus 4 and 10
core 5 has cpus 5 and 11

By default, it will test CPUs 0,2,4,6,10 on cores 0,2,4,0,2,4.
wouldn't it make more sense to test each core once?
With the source CPU always 0, I think the results from the results
from the destination CPUs 0 and 6, on core 0 bias the results, at
least in the deeper idle states. They don't make much difference in
the shallow states. Myself, I wouldn't include them in the results.
Example, where I used the -v option for all CPUs:

--IPI Latency Test---
--Baseline IPI Latency measurement: CPU Busy--
SRC_CPU   DEST_CPU IPI_Latency(ns)
00  101
01  790
02  609
03  595
04  737
05  759
06  780
07  741
08  574
09  681
0   10  527
0   11  552
Baseline Avg IPI latency(ns): 620   suggest 656 here
---Enabling state: 0---
SRC_CPU   DEST_CPU IPI_Latency(ns)
00   76
01  471
02  420
03  462
04  454
05  468
06  453
07  473
08  380
09  483
0   10  492
0   11  454
Expected IPI latency(ns): 0
Observed Avg IPI latency(ns) - State 0: 423 < suggest 456 here
---Enabling state: 1---
SRC_CPU   DEST_CPU IPI_Latency(ns)
00  112
01  866
02  663
03  851
04 1090
05 1314
06 1941
07 1458
08  687
09  802
0   10 1041
0   11 1284
Expected IPI latency(ns): 1000
Observed Avg IPI latency(ns) - State 1: 1009  suggest 1006 here
---Enabling state: 2---
SRC_CPU   DEST_CPU IPI_Latency(ns)
00   75
0116362
0216785
0319650
0417356
0517606
06 2217
0717958
0817332
0916615
0   1017382
0   1117423
Expected IPI latency(ns): 12
Observed Avg IPI latency(ns) - State 2: 14730  suggest 17447 here
---Enabling state: 3---
SRC_CPU   DEST_CPU IPI_Latency(ns)
00  103
0117416
0217961
0316651
0417867
0517726
06 2178
0716620
0820951
0916567
0   1017131
0   1117563
Expected IPI latency(ns): 1034000
Observed Avg IPI latency(ns) - State 3: 14894  suggest 17645 here

Hope this helps.

... Doug

Re: [RFC v2 2/2] selftest/cpuidle: Add support for cpuidle latency measurement

2021-04-01 Thread Doug Smythies

Hi Pratik,

On Thu, Apr 1, 2021 at 4:45 AM Pratik Rajesh Sampat
 wrote:
>
...

> To run this test specifically:
> $ make -C tools/testing/selftests TARGETS="cpuidle" run_tests

I have not become any smarter than I was with version 1,
and still assumed that the "$" meant regular user.
Please put it as "#" or separate the two steps, compile and run.

>
> There are a few optinal arguments too that the script can take

optional

Suggest to also specifically mention how to run without re-compile,

# ./cpuidle.sh -v

Note also that the test still leaves all idle states disabled when done.

> [-h ]
> [-i ]
> [-m ]
> [-o ]
> [-v  (run on all cpus)]
> Default Output location in: tools/testing/selftest/cpuidle/cpuidle.log

...

> +cpu_is_online()
> +{
> +   cpu=$1
> +   if [ ! -f "/sys/devices/system/cpu/cpu$cpu/online" ]; then
> +   echo 0

incorrect. should be:

> +   echo 1

... Doug

Re: [3/3,v3] tools/power turbostat: Enable accumulate RAPL display

2021-03-24 Thread Doug Smythies

Just resending to previously missed people who should also test this.
(See other e-mail: Re: turbostat: Fix Pkg Power on Zen)

On Sat, Mar 13, 2021 at 5:49 AM youling 257  wrote:
>
> test this patch, turbostat can work.
>
> 2021-03-12 21:41 GMT+08:00, Chen Yu :
> > Hi Youling, Bas, and Bingsong,
> > On Wed, Mar 10, 2021 at 04:03:31PM -0800, Doug Smythies wrote:
> >> Hi Yu,
> >>
> >> I am just resending your e-mail, adjusting the "To:" list to
> >> include the 3 others that have submitted similar patches.
> >>
> >> ... Doug
> >>
> > Could you please help check if the following combined patch works?
> >
> > Thanks,
> > Chenyu
> >
> >
> > From 00e0622b1b693a5c7dc343aeb3aa51614a9e125e Mon Sep 17 00:00:00 2001
> > From: Bas Nieuwenhuizen 
> > Date: Fri, 12 Mar 2021 21:27:40 +0800
> > Subject: [PATCH] tools/power/turbostat: Fix turbostat for AMD Zen CPUs
> >
> > It was reported that on Zen+ system turbostat started exiting,
> > which was tracked down to the MSR_PKG_ENERGY_STAT read failing because
> > offset_to_idx wasn't returning a non-negative index.
> >
> > This patch combined the modification from Bingsong Si and
> > Bas Nieuwenhuizen and addd the MSR to the index system as alternative for
> > MSR_PKG_ENERGY_STATUS.
> >
> > Fixes: 9972d5d84d76 ("tools/power turbostat: Enable accumulate RAPL
> > display")
> > Reported-by: youling257 
> > Co-developed-by: Bingsong Si 
> > Signed-off-by: Chen Yu 
> > ---
> >  tools/power/x86/turbostat/turbostat.c | 8 ++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/power/x86/turbostat/turbostat.c
> > b/tools/power/x86/turbostat/turbostat.c
> > index a7c4f0772e53..a7c965734fdf 100644
> > --- a/tools/power/x86/turbostat/turbostat.c
> > +++ b/tools/power/x86/turbostat/turbostat.c
> > @@ -297,7 +297,10 @@ int idx_to_offset(int idx)
> >
> >   switch (idx) {
> >   case IDX_PKG_ENERGY:
> > - offset = MSR_PKG_ENERGY_STATUS;
> > + if (do_rapl & RAPL_AMD_F17H)
> > + offset = MSR_PKG_ENERGY_STAT;
> > + else
> > + offset = MSR_PKG_ENERGY_STATUS;
> >   break;
> >   case IDX_DRAM_ENERGY:
> >   offset = MSR_DRAM_ENERGY_STATUS;
> > @@ -326,6 +329,7 @@ int offset_to_idx(int offset)
> >
> >   switch (offset) {
> >   case MSR_PKG_ENERGY_STATUS:
> > + case MSR_PKG_ENERGY_STAT:
> >   idx = IDX_PKG_ENERGY;
> >   break;
> >   case MSR_DRAM_ENERGY_STATUS:
> > @@ -353,7 +357,7 @@ int idx_valid(int idx)
> >  {
> >   switch (idx) {
> >   case IDX_PKG_ENERGY:
> > - return do_rapl & RAPL_PKG;
> > + return do_rapl & (RAPL_PKG | RAPL_AMD_F17H);
> >   case IDX_DRAM_ENERGY:
> >   return do_rapl & RAPL_DRAM;
> >   case IDX_PP0_ENERGY:
> > --
> > 2.25.1
> >
> >

Re: turbostat: Fix Pkg Power on Zen

2021-03-24 Thread Doug Smythies

On Wed, Mar 24, 2021 at 5:38 AM Salvatore Bonaccorso  wrote:
> On Mon, Mar 15, 2021 at 10:54:24PM +0100, Christian Kastner wrote:
> > On 01.02.21 10:01, Kurt Garloff wrote:
> > > Issue persists on Ryzen in 5.11-rc6:
> > > kvmadmin@KurtSrv2018(//):~ [0]$ sudo 
> > > /casa/src/linux-stable/tools/power/x86/turbostat/turbostat

... deleted stuff that doesn't display properly ...

> > I was seeing the same issue (no stats, program just exits with 243), and
> > Kurt's simple patch resolved it for me.
>
> Does Kurt's patch seems good to you and can be applied or is there
> anything missing?

There were multiple patch submissions. Chen Yu kindly merged them into
one, which was put out for testing a couple of weeks ago. Try it and
report back. I'll forward it in a moment.

Re: [RFC 2/2] selftest/cpuidle: Add support for cpuidle latency measurement

2021-03-20 Thread Doug Smythies

On Wed, Mar 17, 2021 at 11:44 PM Pratik Sampat  wrote:
>
> Hi Doug,
> Thanks for trying these patches out.
>
> On 18/03/21 2:30 am, Doug Smythies wrote:
> > Hi Pratik,
> >
> > It just so happens that I have been trying Artem's version this last
> > week, so I tried yours.
> >
> > On Mon, Mar 15, 2021 at 4:49 AM Pratik Rajesh Sampat
> >  wrote:
> > ...
...
> > Other notes:
> >
> > No idle state for CPU 0 ever gets disabled.
> > I assume this is because CPU 0 can never be offline,
> > so that bit of code (Disable all stop states) doesn't find its state.
> > By the way, processor = Intel i5-9600K
>
> I had tried these patches on an IBM POWER 9 processor and disabling CPU0's 
> idle
> state works there. However, it does make sense for some processors to treat 
> CPU
> 0 differently.
> Maybe I could write in a case if idle state disabling fails for a CPU then we
> just skip it?

I didn't try it, I just did a hack so I could continue for this reply.

> > The system is left with all idle states disabled, well not for CPU 0
> > as per the above comment. The suggestion is to restore them,
> > otherwise my processor hogs 42 watts instead of 2.
> >
> > My results are highly variable per test.
>
> Question: Do you notice high variability with IPI test, Timer test or both?

The IPI test has less variability than the Timer test.

>
> I can think of two reasons for high run to run variance:
>
> 1. If you observe variance in timer tests, then I believe there could a
> mechanism of "C-state pre-wake" on some Intel machines at play here, which can
> pre-wake a CPU from an idle state when timers are armed. I'm not sure if the
> Intel platform that you're running on does that or not.
>
> Artem had described this behavior to me a while ago and I think his wult page
> describes this behavior in more detail:
> https://intel.github.io/wult/#c-state-pre-wake

Yes, I have reviewed all the references.
And yes, I think my processors have the pre-wake stuff.

I do not have the proper hardware to do the Artem pre-wake workaround
method, but might buy it in future.

> 2. I have noticed variability in results when there are kernel book-keeping or
> jitter tasks scheduled from time to time on an otherwise idle core.
> In the full per-CPU logs at tools/testing/selftests/cpuidle/cpuidle.log can 
> you
> spot any obvious outliers per-CPU state?

Yes.
I'll just paste in an example cpuidle.log file having used the -v option
below, along with my hack job diff.

doug@s19:~/temp-k-git/linux/tools/testing/selftests/cpuidle$ cat
cpuidle.log.v3-1
--IPI Latency Test---
--Baseline IPI Latency measurement: CPU Busy--
SRC_CPU   DEST_CPU IPI_Latency(ns)
00  140
01  632
02  675
03  671
04  675
05  767
06  653
07  826
08  819
09  615
0   10  758
0   11  758
Baseline Avg IPI latency(ns): 665
---Enabling state: 0---
SRC_CPU   DEST_CPU IPI_Latency(ns)
00   76
01  484
02  494
03  539
04  498
05  491
06  474
07  434
08  544
09  476
0   10  447
0   11  467
Expected IPI latency(ns): 0
Observed Avg IPI latency(ns) - State 0: 452
---Enabling state: 1---
SRC_CPU   DEST_CPU IPI_Latency(ns)
00   72
01 1081
02  821
03 1486
04 1022
05  960
06 1634
07  933
08 1032
09 1046
0   10 1430
0   11 1338
Expected IPI latency(ns): 1000
Observed Avg IPI latency(ns) - State 1: 1071
---Enabling state: 2---
SRC_CPU   DEST_CPU IPI_Latency(ns)
00  264
0130836
0230562
0330748
0435286
0530978
06 1952
0736066
0830670
0930605
0   1030635
0   1135423
Expected IPI latency(ns): 12
Observed Avg IPI latency(ns) - State 2: 27002
---Enabling state: 3---
SRC_CPU   DEST_CPU IPI_Latency(ns)
00   71
0130853
0232095
0332661
0430230
0534348
06 2012
0730816
0830908
0931130
0

Re: [RFC 2/2] selftest/cpuidle: Add support for cpuidle latency measurement

2021-03-17 Thread Doug Smythies

Hi Pratik,

It just so happens that I have been trying Artem's version this last
week, so I tried yours.

On Mon, Mar 15, 2021 at 4:49 AM Pratik Rajesh Sampat
 wrote:
>
...
> To run this test specifically:
> $ make -C tools/testing/selftests TARGETS="cpuidle" run_tests

While I suppose it should have been obvious, I interpreted
the "$" sign to mean I could run as a regular user, which I can not.

> There are a few optinal arguments too that the script can take
> [-h ]
> [-m ]
> [-o ]
> [-v  (run on all cpus)]
> Default Output location in: tools/testing/cpuidle/cpuidle.log

Isn't it:

tools/testing/selftests/cpuidle/cpuidle.log

? At least, that is where my file was.

Other notes:

No idle state for CPU 0 ever gets disabled.
I assume this is because CPU 0 can never be offline,
so that bit of code (Disable all stop states) doesn't find its state.
By the way, processor = Intel i5-9600K

The system is left with all idle states disabled, well not for CPU 0
as per the above comment. The suggestion is to restore them,
otherwise my processor hogs 42 watts instead of 2.

My results are highly variable per test.
My system is very idle:
Example (from turbostat at 6 seconds sample rate):
Busy%   Bzy_MHz IRQ PkgTmp  PkgWatt RAMWatt
0.034600153 28  2.031.89
0.014600103 29  2.031.89
0.054600115 29  2.081.89
0.01460095  28  2.091.89
0.034600114 28  2.111.89
0.014600107 29  2.071.89
0.024600102 29  2.111.89

...

... Doug

Re: [PATCH] tools/power/x86/turbostat: Fix TCC offset bit mask

2021-03-13 Thread Doug Smythies

On Fri, Mar 12, 2021 at 2:16 PM Len Brown  wrote:
>
> Doug,
> The offset works for control.
>
> However, it is erroneous to use it for reporting of the actual
> temperature, like I did in turbostat.

Agreed.
I have been running with a correction for that for a while,
and as discussed on Rui's thread.
But this bit mask correction patch is still needed isn't it?
for this:
cpu4: MSR_IA32_TEMPERATURE_TARGET: 0x1a64100d (90 C) (100 default - 10 offset)
which should be this:
cpu4: MSR_IA32_TEMPERATURE_TARGET: 0x1a64100d (74 C) (100 default - 26 offset)

But yes, I do now see the field size is only 4 bits for some parts.

... Doug

> Thus, I'm going to revert the patch that added it's use in turbostat
> for the Temperature column.
>
> thanks,
> -Len
>
> On Fri, Mar 12, 2021 at 1:26 AM Doug Smythies  wrote:
> >
> > Hi Len,
> >
> >
> > thank you for your reply.
> >
> > On Thu, Mar 11, 2021 at 3:19 PM Len Brown  wrote:
> > >
> > > Thanks for the close read, Doug.
> > >
> > > This field size actually varies from system to system,
> > > but the reality is that the offset is never that big, and so the
> > > smaller mask is sufficient.
> >
> > Disagree.
> >
> > I want to use an offset of 26.
> >
> > > Finally, this may all be moot, because there is discussion that using
> > > the offset this way is simply erroneous.
> >
> > Disagree.
> > It works great.
> > As far as I know/recall I was the only person that responded to Rui's thread
> > "thermal/intel: introduce tcc cooling driver" [1]
> > And, I spent quite a bit of time doing so.
> > However, I agree the response seems different between the two systems
> > under test, Rui's and mine.
> >
> > [1] https://marc.info/?l=linux-pm=161070345329806=2
> >
> > >  stay tuned.
> >
> > O.K.
> >
> > ... Doug
> > >
> > > -Len
> > >
> > >
> > > On Sat, Jan 16, 2021 at 12:07 PM Doug Smythies  
> > > wrote:
> > > >
> > > > The TCC offset mask is incorrect, resulting in
> > > > incorrect target temperature calculations, if
> > > > the offset is big enough to exceed the mask size.
> > > >
> > > > Signed-off-by: Doug Smythies 
> > > > ---
> > > >  tools/power/x86/turbostat/turbostat.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/tools/power/x86/turbostat/turbostat.c 
> > > > b/tools/power/x86/turbostat/turbostat.c
> > > > index 389ea5209a83..d7acdd4d16c4 100644
> > > > --- a/tools/power/x86/turbostat/turbostat.c
> > > > +++ b/tools/power/x86/turbostat/turbostat.c
> > > > @@ -4823,7 +4823,7 @@ int read_tcc_activation_temp()
> > > >
> > > > target_c = (msr >> 16) & 0xFF;
> > > >
> > > > -   offset_c = (msr >> 24) & 0xF;
> > > > +   offset_c = (msr >> 24) & 0x3F;
> > > >
> > > > tcc = target_c - offset_c;
> > > >
> > > > --
> > > > 2.25.1
> > > >
> > >
> > >
> > > --
> > > Len Brown, Intel Open Source Technology Center
>
>
>
> --
> Len Brown, Intel Open Source Technology Center

Re: [PATCH] tools/power/x86/turbostat: Fix TCC offset bit mask

2021-03-11 Thread Doug Smythies

Hi Len,


thank you for your reply.

On Thu, Mar 11, 2021 at 3:19 PM Len Brown  wrote:
>
> Thanks for the close read, Doug.
>
> This field size actually varies from system to system,
> but the reality is that the offset is never that big, and so the
> smaller mask is sufficient.

Disagree.

I want to use an offset of 26.

> Finally, this may all be moot, because there is discussion that using
> the offset this way is simply erroneous.

Disagree.
It works great.
As far as I know/recall I was the only person that responded to Rui's thread
"thermal/intel: introduce tcc cooling driver" [1]
And, I spent quite a bit of time doing so.
However, I agree the response seems different between the two systems
under test, Rui's and mine.

[1] https://marc.info/?l=linux-pm=161070345329806=2

>  stay tuned.

O.K.

... Doug
>
> -Len
>
>
> On Sat, Jan 16, 2021 at 12:07 PM Doug Smythies  
> wrote:
> >
> > The TCC offset mask is incorrect, resulting in
> > incorrect target temperature calculations, if
> > the offset is big enough to exceed the mask size.
> >
> > Signed-off-by: Doug Smythies 
> > ---
> >  tools/power/x86/turbostat/turbostat.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/tools/power/x86/turbostat/turbostat.c 
> > b/tools/power/x86/turbostat/turbostat.c
> > index 389ea5209a83..d7acdd4d16c4 100644
> > --- a/tools/power/x86/turbostat/turbostat.c
> > +++ b/tools/power/x86/turbostat/turbostat.c
> > @@ -4823,7 +4823,7 @@ int read_tcc_activation_temp()
> >
> > target_c = (msr >> 16) & 0xFF;
> >
> > -   offset_c = (msr >> 24) & 0xF;
> > +   offset_c = (msr >> 24) & 0x3F;
> >
> > tcc = target_c - offset_c;
> >
> > --
> > 2.25.1
> >
>
>
> --
> Len Brown, Intel Open Source Technology Center

Re: [3/3,v3] tools/power turbostat: Enable accumulate RAPL display

2021-03-10 Thread Doug Smythies

Hi Yu,

I am just resending your e-mail, adjusting the "To:" list to
include the 3 others that have submitted similar patches.

... Doug

On Mon, Mar 8, 2021 at 8:11 AM Chen Yu  wrote:
>
> Hi,
> On Mon, Mar 08, 2021 at 07:37:07AM -0800, Doug Smythies wrote:
> > On Mon, Mar 8, 2021 at 5:50 AM youling257  wrote:
> > >
> > > this cause turbostat not work on amd cpu.
> > >
> > > root@localhost:~# /turbostat
> > > turbostat version 20.09.30 - Len Brown 
> > > CPUID(0): AuthenticAMD 0xd CPUID levels; 0x801f xlevels; 
> > > family:model:stepping 0x17:18:1 (23:24:1)
> >
> > There are already two fixes for this in the queue.
> > https://marc.info/?l=linux-pm=161382097503925=2
> > https://marc.info/?l=linux-pm=161141701219263=2
> Thanks for reporting and pointing this out. I assume these two patches are 
> both fixing the
> same issue? It looks like these two patches should be merged into one:
> 1. Bingsong's patch access MSR_PKG_ENERGY_STAT only when RAPL_AMD_F17H is set,
>which is consistent with the original context.
> 2. Bas Nieuwenhuizen's patch also fixes the issue in idx_valid()
>in case RAPL_PKG was not set but with RAPL_AMD_F17H set.
>
> thanks,
> Chenyu

Re: [3/3,v3] tools/power turbostat: Enable accumulate RAPL display

2021-03-08 Thread Doug Smythies

On Mon, Mar 8, 2021 at 5:50 AM youling257  wrote:
>
> this cause turbostat not work on amd cpu.
>
> root@localhost:~# /turbostat
> turbostat version 20.09.30 - Len Brown 
> CPUID(0): AuthenticAMD 0xd CPUID levels; 0x801f xlevels; 
> family:model:stepping 0x17:18:1 (23:24:1)

There are already two fixes for this in the queue.
https://marc.info/?l=linux-pm=161382097503925=2
https://marc.info/?l=linux-pm=161141701219263=2

[PATCH] tools/power/x86/turbostat: Fix TCC offset bit mask

2021-01-16 Thread Doug Smythies

The TCC offset mask is incorrect, resulting in
incorrect target temperature calculations, if
the offset is big enough to exceed the mask size.

Signed-off-by: Doug Smythies 
---
 tools/power/x86/turbostat/turbostat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/power/x86/turbostat/turbostat.c 
b/tools/power/x86/turbostat/turbostat.c
index 389ea5209a83..d7acdd4d16c4 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -4823,7 +4823,7 @@ int read_tcc_activation_temp()
 
target_c = (msr >> 16) & 0xFF;
 
-   offset_c = (msr >> 24) & 0xF;
+   offset_c = (msr >> 24) & 0x3F;
 
tcc = target_c - offset_c;
 
-- 
2.25.1

RE: [PATCH v2 0/3] cpufreq: Allow drivers to receive more information from the governor

2020-12-17 Thread Doug Smythies

On 2020.12.14 12:02 Rafael J. Wysocki wrote:

> Hi,

Hi Rafael,

V2 test results below are new, other results are partially re-stated:

For readers that do not want to read on, I didn't find anything different than 
with
the other versions. This was more just due diligence.

Legend:

hwp: Kernel 5.10-rc6, HWP enabled; intel_cpufreq
rfc (or rjw): Kernel 5.10-rc6 + this patch set, HWP enabled; intel_cpu-freq; 
schedutil
no-hwp: Kernel 5.10-rc6, HWP disabled; intel_cpu-freq
acpi (or acpi-cpufreq): Kernel 5.10-rc6, HWP disabled; acpi-cpufreq; schedutil
patch: Kernel 5.10-rc7 + V1 patch set, HWP enabled; intel_cpu-freq; schedutil
v2: Kernel 5.10-rc7 + V2 patch set, HWP enabled; intel_cpu-freq; schedutil

Fixed work packet, fixed period, periodic workflow, load sweep up/down:

load work/sleep frequency: 73 Hertz:

hwp: Average: 12.00822 watts
rjw: Average: 10.18089 watts
no-hwp: Average: 10.21947 watts
acpi-cpufreq: Average:  9.06585 watts
patch: Average: 10.26060 watts
v2: Average: 10.50444

load work/sleep frequency: 113 Hertz:

hwp: Average: 12.01056
rjw: Average: 10.12303
no-hwp: Average: 10.08228
acpi-cpufreq: Average:  9.02215
patch: Average: 10.27055
v2: Average: 10.31097

load work/sleep frequency: 211 Hertz:

hwp: Average: 12.16067
rjw: Average: 10.24413
no-hwp: Average: 10.12463
acpi-cpufreq: Average:  9.19175
patch: Average: 10.33000
v2: Average: 10.39811

load work/sleep frequency: 347 Hertz:

hwp: Average: 12.34169
rjw: Average: 10.79980
no-hwp: Average: 10.57296
acpi-cpufreq: Average:  9.84709
patch: Average: 10.67029
v2: Average: 10.93143

load work/sleep frequency: 401 Hertz:

hwp: Average: 12.42562
rjw: Average: 11.12465
no-hwp: Average: 11.24203
acpi-cpufreq: Average: 10.78670
patch: Average: 10.94514
v2: Average: 11.50324


Serialized single threaded via PIDs per second method:
A.K.A fixed work packet, variable period
Results:

Execution times (seconds. Less is better):

no-hwp:

performance: Samples: 382  ; Average: 10.54450  ; Stand Deviation:  0.01564 ; 
Maximum: 10.61000 ; Minimum: 10.5

schedutil: Samples: 293  ; Average: 13.73416  ; Stand Deviation:  0.73395 ; 
Maximum: 15.46000 ; Minimum: 11.68000
acpi: Samples: 253  ; Average: 15.94889  ; Stand Deviation:  1.28219 ; Maximum: 
18.66000 ; Minimum: 12.04000

hwp:

schedutil: Samples: 380  ; Average: 10.58287  ; Stand Deviation:  0.01864 ; 
Maximum: 10.64000 ; Minimum: 10.54000
patch: Samples: 276  ; Average: 14.57029 ; Stand Deviation:  0.89771 ; Maximum: 
16.04000 ; Minimum: 11.68000
rfc: Samples: 271  ; Average: 14.86037  ; Stand Deviation:  0.84164 ; Maximum: 
16.04000 ; Minimum: 12.21000
v2: Samples: 274  ; Average: 14.67978  ; Stand Deviation:  1.03378 ; Maximum: 
16.07000 ; Minimum: 11.43000

Power (watts. More indicates higher CPU frequency and better performance. 
Sample time = 1 second.):

no-hwp:

performance: Samples: 4000  ; Average: 25.41355  ; Stand Deviation:  0.22156 ; 
Maximum: 26.01996 ; Minimum: 24.08807

schedutil: Samples: 4000  ; Average: 12.58863  ; Stand Deviation:  5.48600 ; 
Maximum: 25.50934 ; Minimum:  7.54559
acpi: Samples: 4000  ; Average:  9.57924  ; Stand Deviation:  5.41157 ; 
Maximum: 25.06366 ; Minimum:  5.51129

hwp:

schedutil: Samples: 4000  ; Average: 25.24245  ; Stand Deviation:  0.19539 ; 
Maximum: 25.93671 ; Minimum: 24.14746
patch: Samples: 4000  ; Average: 11.07225  ; Stand Deviation:  5.63142 ; 
Maximum: 24.99493 ; Minimum:  3.67548
rfc: Samples: 4000  ; Average: 10.35842  ; Stand Deviation:  4.77915 ; Maximum: 
24.95953 ; Minimum:  7.26202
v2: Samples: 4000  ; Average: 10.98284  ; Stand Deviation:  5.48859 ; Maximum: 
25.76331 ; Minimum:  7.53790

RE: [PATCH v1 0/4] cpufreq: Allow drivers to receive more information from the governor

2020-12-13 Thread Doug Smythies

On 2020.12.08 11:15 Doug wrote:
> On 2020.12.08 09:14 Rafael J. Wysocki wrote:
> > On Tue, Dec 8, 2020 at 5:31 PM Giovanni Gherdovich  
> > wrote:
> >> On Mon, 2020-12-07 at 17:25 +0100, Rafael J. Wysocki wrote:
> >> I'd like to test this patch, as I have concerns on how it performs against 
> >> the
> >> current intel_pstate passive + HWP + schedutil (which leaves 
> >> HWP.REQ.DESIRED
> >> untouched).
> >
> >The performance will be worse in some cases, which is to be expected,
> 
> Agreed. More further below.
> 
> >but the point here is to counter the problem that in some cases the
> >frequency is excessive now.
> 
> Agreed.
...
> 
> The "hwp" case is misleading. It is really bouncing around at about 0.2 hertz
> and can't seem to make up its mind. 

> If nothing else, all the other versions
> are a little more deterministic in their response.

Hmmm... I think that statement is backwards, for this particular
workflow. Read on.

> 
> >
> >> I'll get results within a week. Do you mind holding back the merge until 
> >> then?
> 
> On my system that git "make test" is broken, so I can not do the high PIDs 
> per second test that way.
> My own PIDs per second test also has issues on this computer.
> I am not sure when I'll get these type of tests working, but will try for 
> within a week.

Geovanni: Previously you have reported run to run variability with the git 
"make test"
serialized single threaded workflow, with a new PID per task. Previously, I had 
attributed
the variability to disk I/O. For the schedutil governor, there is also a 
significant variability
in execution time within and around a frequency of 0.2 Hertz. Generally, it 
would tend to get
averaged out over the long execution time of the "make test".

I have completely re-done everything for my version (no disk I/O) of the PIDs 
per second test:

Preamble:

As mentioned previously, the issue with attempting to acquire useful debugging
information with the schedutil governor is that it is always so oscillatory.
Sampling processor package energy at too high a frequency and it just looks like
noise, sample at too low a frequency and any useful information is filtered out.
To make a long story shorter, a sample frequency of about 1 hertz, or a sample
period of about 1 second, was revealing. Event periods of between 5 and 35 
seconds
are also supported by listening to the fan speed modulation on my test server.
A lot of intel_pstate_tracer.py utility trace data was acquired and reviewed, 
but
I am unable to make any sense out of it.

Data is presented using the teo idle governor, and a 1000 hertz kernel, but be 
aware
that the menu governor and a 250 Hertz kernel were also tried, but results not 
written
up. Data is presented for one set of conditions only, but be aware that all of 
the
PIDs per second space was searched for any specific anomalies. It is the same
everywhere (tested ~40 - ~3500 PIDs per second) (obviously at some low enough 
PIDs
per second, the differences will diminish) and the operating point of focus for
further dwell type testing was not cherry picked.

Operating point for the rest of the work:
Fixed work packet per task (PID):
~1780 PIDs per second, performance governor.
~ 308 PIDs per second, powersave governor.

Test computer: i5-9600K

Legend:

hwp: Kernel 5.10-rc6, HWP enabled; intel_cpufreq
rfc: Kernel 5.10-rc6 + this patch set, HWP enabled; intel_cpu-freq; schedutil
no-hwp: Kernel 5.10-rc6, HWP disabled; intel_cpu-freq
acpi: Kernel 5.10-rc6, HWP disabled; acpi-cpufreq; schedutil
patch: Kernel 5.10-rc7 + this patch set, HWP enabled; intel_cpu-freq; schedutil

Results:

Execution times (seconds. Less is better):

no-hwp:

conservative: Samples: 371  ; Average: 10.84296  ; Stand Deviation:  0.07160 ; 
Maximum: 11.09000 ; Minimum: 10.65000
ondemand: Samples: 342  ; Average: 11.76442  ; Stand Deviation:  0.02640 ; 
Maximum: 11.85000 ; Minimum: 11.71000
performance: Samples: 382  ; Average: 10.54450  ; Stand Deviation:  0.01564 ; 
Maximum: 10.61000 ; Minimum: 10.5
powersave: Samples: 68  ; Average: 59.93897  ; Stand Deviation:  0.11697 ; 
Maximum: 60.19000 ; Minimum: 59.67000

schedutil: Samples: 293  ; Average: 13.73416  ; Stand Deviation:  0.73395 ; 
Maximum: 15.46000 ; Minimum: 11.68000
acpi: Samples: 253  ; Average: 15.94889  ; Stand Deviation:  1.28219 ; Maximum: 
18.66000 ; Minimum: 12.04000

hwp:

conservative: Samples: 380  ; Average: 10.58518  ; Stand Deviation:  0.01562 ; 
Maximum: 10.64000 ; Minimum: 10.54000
ondemand: Samples: 380  ; Average: 10.58395  ; Stand Deviation:  0.01509 ; 
Maximum: 10.62000 ; Minimum: 10.54000
performance: Samples: 381  ; Average: 10.57871 ; Stand Deviation:  0.01870 ; 
Maximum: 10.63000 ; Minimum: 10.53000
powersave: Samples: 67  ; Average: 60.37075  ; Stand Deviation:  0.20095 ; 
Maximum: 60.81000 ; Minimum: 59.92000

schedutil: Samples: 380  ; Average: 10.58287  ; Stand Deviation:  0.01864 ; 
Maximum: 10.64000 ; Minimum: 10.54000
patch: Samples: 276  ; Average: 14.57029 ; Stand

RE: [PATCH v1 0/4] cpufreq: Allow drivers to receive more information from the governor

2020-12-08 Thread Doug Smythies

On 2020.12.08 09:14 Rafael J. Wysocki wrote: 
> On Tue, Dec 8, 2020 at 5:31 PM Giovanni Gherdovich  
> wrote:
>> On Mon, 2020-12-07 at 17:25 +0100, Rafael J. Wysocki wrote:
>>> Hi,
>>>
>>> This is based on the RFC posted a few days ago:
>>>
>>> https://lore.kernel.org/linux-pm/1817571.2o5Kk4Ohv2@kreacher/
>>>
>>> The majority of the original cover letter still applies, so let me quote it
>>> here:
>>> [...]
>>
>> Hello Rafael,
>>
>> I'd like to test this patch, as I have concerns on how it performs against 
>> the
>> current intel_pstate passive + HWP + schedutil (which leaves HWP.REQ.DESIRED
>> untouched).
> 
>The performance will be worse in some cases, which is to be expected,

Agreed. More further below.

>but the point here is to counter the problem that in some cases the
>frequency is excessive now.

Agreed.

I retested with this new version with my load sweep test,
and my results were the pretty similar to previously reported, [1].
If anything, there might be a perceptible difference between the RFC
version and this version as a function of work/sleep frequency.

73 Hertz: RFC was 0.8% less power
113 Hertz: RFC was 1.5% less power
211 Hertz: RFC was 0.8% less power
347 Hertz: RFC was 1.2% more power
401 Hertz: RFC was 1.6% more power

Of course, let us not lose site of the original gains that were in the 13 to 
17% range.

Now, for the "worse in some cases, which is to be expected" part:

At least on my system, it is most evident for some of the pipe type tests,
where the schedutil governor has never really known what to do. This patch
set seems to add enough of a downward bias that this version of the schedutil
governor now behaves much like the other versions

Here is an example (no forced CPU affinity, 2 pipes):

Legend:
hwp: Kernel 5.10-rc6, HWP enabled; intel_cpufreq; schedutil (always)
rjw: Kernel 5.10-rc6 + this patch set, HWP enabled; intel_cpu-freq; schedutil
no-hwp: Kernel 5.10-rc6, HWP disabled; intel_cpu-freq; schedutil
acpi-cpufreq (or just acpi): Kernel 5.10-rc6, HWP disabled; acpi-cpufreq; 
schedutil
patch: Kernel 5.10-rc7 + the non RFC 4 patch version of this is patch set, HWP 
enabled; intel_cpu-freq; schedutil

hwp: 3.295 uSec/loop (steady); average power: 62.78 Watts (steady); freq: 
4.60GHz (steady).
rjw: 3.6 uSec/loop (noisey) (9.3% worse); average power: 44.13 Watts (noisey); 
freq: ?.??GHz (noisey).
no-hwp: 3.35 uSec/loop (noisey); average power: 59.17 Watts (noisey); freq: 
?.??GHz (much less noisey).
acpi-cpufreq: 3.4 uSec/loop (noisey); average power: 56.93 Watts (noisey); 
freq: ?.??GHz (less noisey).
patch: 3.6 uSec/loop (noisey) (9.3% worse); average power: 43.36 Watts 
(noisey); freq: ?.??GHz (noisey).

One could argue that this patch set might be driving the frequency down a 
little too hard in this case,
but this is one of those difficult rotating through the CPUs cases anyhow.
As a side note, all other governors (well, not powersave) pin the CPU frequency 
at max (4.6 GHz).

For my version of the "Alex" pipe test (a token is passed around and around via 
6 named pipes,
with a bit of work to do per token stop)
I get (more power = better performance):

hwp: average power: 17.16 Watts (very noisey)
rjw: average power: 3.18 (noisey)
no-hwp: average power: 2.45 (less noisey)
acpi-cpufreq: average power: 2.46 (less noisey)
patch: average power: 2.47 (less noisey)

The "hwp" case is misleading. It is really bouncing around at about 0.2 hertz
and can't seem to make up its mind. If nothing else, all the other versions
are a little more deterministic in their response.

> 
>> I'll get results within a week. Do you mind holding back the merge until 
>> then?

On my system that git "make test" is broken, so I can not do the high PIDs per 
second test that way.
My own PIDs per second test also has issues on this computer.
I am not sure when I'll get these type of tests working, but will try for 
within a week.

... Doug

References:

[1] https://marc.info/?l=linux-kernel=160692480907696=2

My tests results graphs (updated):
Note: I have to code the web site, or I get hammered by bots.
Note: it is .com only because it was less expensive than .org

73 Hertz (10 samples per second):
Double u double u double u dot smythies dot 
.com/~doug/linux/s18/hwp/k510-rc6/su73/ 
113 Hertz (10 samples per second):
Double u double u double u dot smythies dot 
.com/~doug/linux/s18/hwp/k510-rc6/su113/
211 Hertz (10 samples per second):
Double u double u double u dot smythies dot 
.com/~doug/linux/s18/hwp/k510-rc6/su211/
347 Hertz (10 samples per second):
Double u double u double u dot smythies dot 
.com/~doug/linux/s18/hwp/k510-rc6/su347/
401 Hertz (10 samples per second):
Double u double u double u dot smythies dot 
.com/~doug/linux/s18/hwp/k510-rc6/su401/

Note: The below graphs are mislabelled, because I made hacker use of a tool 
dedicated
to graphing, and HTML wrapping, the load sweep test. The test conditions are 
steady state
operation, no variable changes.

pipe-test (5 seconds per

RE: [RFC][PATCH 1/2] cpufreq: Add special-purpose fast-switching callback for drivers

2020-12-02 Thread Doug Smythies

Hi Rafael,

On 2020.11.30 10:37 Rafael J. Wysocki wrote:

> First off, some cpufreq drivers (eg. intel_pstate) can pass hints
> beyond the current target frequency to the hardware and there are no
> provisions for doing that in the cpufreq framework.  In particular,
> today the driver has to assume that it should allow the frequency to

Forgot the important "not":

today the driver has to assume that it should allow not the frequency to

> fall below the one requested by the governor (or the required capacity
> may not be provided) which may not be the case and which may lead to
> excessive energy usage in some scenarios.
> 
> Second, the hints passed by these drivers to the hardware neeed not

s/neeed/need

...

O.K. this is good.

The problem with my basic CPU frequency verses load test with the
schedutil governor is that it is always so oscillatory it is pretty
much not possible to conclude anything. So I re-worked the test
to look at Processor Package Power load instead.

In a previous e-mail [1] I had reported the power differences
for one periodic load at one frequency, as a (apparently cherry picked)
example. Quoted:

> schedutil governor:
> acpi-cpufreq: good
> intel_cpufreq hwp: bad< Now good, with this patch set.
> intel_cpufreq no hwp: good
> ...
> periodic workflow at 347 hertz.
> ~36% load at 4.60 GHz (where hwp operates)
> ~55% load at 3.2 GHz (where no hwp operates)
>
> intel_cpufreq hwp: 9.6 processor package watts. 45.8 watts on the mains to 
> the computer.
> intel_cpufreq no hwp: ~6 processor package watts. ~41 watts on the mains to 
> the computer. (noisy)

So this time, I only have power/energy data, and a relatively easy way to 
compress all 12,000
samples into some concise summary is to simply find the average power for the 
entire experiment:

Legend:
hwp: Kernel 5.10-rc6, HWP enabled; intel_cpufreq; schedutil (always)
rjw: Kernel 5.10-rc6 + this patch set, HWP enabled; intel_cpu-freq; schedutil
no-hwp: Kernel 5.10-rc6, HWP disabled; intel_cpu-freq; schedutil
acpi-cpufreq: Kernel 5.10-rc6, HWP disabled; acpi-cpufreq; schedutil

load work/sleep frequency: 73 Hertz:
hwp: Average: 12.00822 watts
rjw: Average: 10.18089 watts
no-hwp: Average: 10.21947 watts
acpi-cpufreq: Average:  9.06585 watts

load work/sleep frequency: 113 Hertz:

hwp: Average: 12.01056
rjw: Average: 10.12303
no-hwp: Average: 10.08228
acpi-cpufreq: Average:  9.02215

load work/sleep frequency: 211 Hertz:

hwp: Average: 12.16067
rjw: Average: 10.24413
no-hwp: Average: 10.12463
acpi-cpufreq: Average:  9.19175

load work/sleep frequency: 347 Hertz:

hwp: Average: 12.34169
rjw: Average: 10.79980
no-hwp: Average: 10.57296
acpi-cpufreq: Average:  9.84709

load work/sleep frequency: 401 Hertz:

hwp: Average: 12.42562
rjw: Average: 11.12465
no-hwp: Average: 11.24203
acpi-cpufreq: Average: 10.78670

[1] https://marc.info/?l=linux-pm=159769839401767=2

My tests results graphs:
Note: I have to code the web site, or I get hammered by bots.
Note: it is .com only because it was less expensive than .org
73 Hertz:
Double u double u double u dot smythies dot 
.com/~doug/linux/s18/hwp/k510-rc6/su73/ 
113 Hertz:
Double u double u double u dot smythies dot 
.com/~doug/linux/s18/hwp/k510-rc6/su113/
211 Hertz:
Double u double u double u dot smythies dot 
.com/~doug/linux/s18/hwp/k510-rc6/su211/
347 Hertz:
Double u double u double u dot smythies dot 
.com/~doug/linux/s18/hwp/k510-rc6/su347/
401 Hertz:
Double u double u double u dot smythies dot 
.com/~doug/linux/s18/hwp/k510-rc6/su401/

... Doug

RE: [PATCH v3 0/4] cpufreq: intel_pstate: Handle powersave governor correctly in the passive mode with HWP

2020-11-10 Thread Doug Smythies

On 2020.11.10 09:22 Rafael J. Wysocki wrote:
> On Monday, November 9, 2020 5:49:49 PM CET Rafael J. Wysocki wrote:
>>
>> Even after the changes made very recently, the handling of the powersave
>> governor is not exactly as expected when intel_pstate operates in the
>> "passive" mode with HWP enabled.
>>
>> Namely, in that case HWP is not limited to the policy min frequency, but it
>> can scale the frequency up to the policy max limit and it cannot be 
>> constrained
>> currently, because there are no provisions for that in the framework.
>>
>> To address that, patches [1-3/4] add a new governor flag to indicate that 
>> this
>> governor wants the target frequency to be set to the exact value passed to 
>> the
>> driver, if possible, and change the powersave and performance governors to 
>> have
>> that flag set.
>>
>> The last patch makes intel_pstate take that flag into account when 
>> programming
>> the HWP Request MSR.
> 
> The v3 simply uses different names for the new governor flags.

Thank you.

I tested v2, with positive results, as reported for v1. I do not have time to
re-test v3.

My input is to also default this flag to be set for the userspace and ondemand 
governors.

userspace: I tested with and without this flag set, and the flag is needed if
the user expects the scaling_setspeed to be enforced.
Disclaimer: I don't normally actually use the userspace governor.

ondemand: from my tests, the ondemand response more closely mimics 
acpi-ondemand with the flag set.
Power consumption has been better for the limited testing done.
However, it is also a function of work/sleep frequency for periodic workflows 
and a function of
INTEL_CPUFREQ_TRANSITION_DELAY_HWP. I am saying that my ability to support the 
suggestion to default
to setting the flag is a little weak.

... Doug

RE: [PATCH 1/2] cpufreq: Introduce target min and max frequency hints

2020-11-05 Thread Doug Smythies

Hi Rafael:

Thank you for this patch set.

I can not get the patch to apply.
I was trying on top on 5.10-rc2, and have been unable to determine
what other patches might need to be applied first.

On 2020.11.05 10:24 Rafael J. Wysocki wrote:

...

> 
> Signed-off-by: Rafael J. Wysocki 
> ---
>  drivers/cpufreq/cpufreq.c |3 +++
>  drivers/cpufreq/cpufreq_performance.c |4 
>  drivers/cpufreq/cpufreq_powersave.c   |4 
>  include/linux/cpufreq.h   |   16 

I do not understand why this part says to look for 16
differences, but I can only find 2.

>  4 files changed, 27 insertions(+)
> 
> Index: linux-pm/include/linux/cpufreq.h
> ===
> --- linux-pm.orig/include/linux/cpufreq.h
> +++ linux-pm/include/linux/cpufreq.h
> @@ -63,6 +63,8 @@ struct cpufreq_policy {
> 
>   unsigned intmin;/* in kHz */
>   unsigned intmax;/* in kHz */
> + unsigned inttarget_min; /* in kHz */
> + unsigned inttarget_max; /* in kHz */
>   unsigned intcur;/* in kHz, only needed if cpufreq
>* governors are used */
>   unsigned intsuspend_freq; /* freq to set during suspend */
> Index: linux-pm/drivers/cpufreq/cpufreq.c

...

Anyway, I edited the patch, deleting the include/linux/cpufreq.h part,
then it applied, as did patch 2 of 2.
I edited include/linux/cpufreq.h manually.

Issues with the powersave governor reported in [1] and [2]
are fixed. Relevant part quoted and updated below:

> In early September Doug wrote:
>> powersave governor:
>> acpi-cpufreq: good
>> intel_cpufreq hwp: bad

Now good, with this patch set.

>> intel_cpufreq no hwp: good

...

> For the powersave governor, this is what we have now:
> 
> intel_cpufreq hwp == intel_pstate hwp
> intel_cpufreq no hwp == acpi-cpufreq == always minimum freq
> intel_pstate no hwp ~= acpi-cpufreq/ondemand

...

> My expectation was/is:
> 
> intel_cpufreq hwp == intel_cpufreq no hwp == acpi-cpufreq == always minimum 
> freq

And this is what we now have, with this patch set.

> intel_pstate no hwp ~= acpi-cpufreq/ondemand
> intel_pstate hwp == Unique. Say, extremely course version of ondemand.

[1] https://marc.info/?l=linux-pm=159769839401767=2
[2] https://marc.info/?l=linux-pm=159943780220923=2

... Doug

RE: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-09-06 Thread Doug Smythies

Hi Rafael,

On 2020.08.17 14:06 Doug Smythies wrote:
> On 2020.08.06 05:04 Rafael J. Wysocki wrote:
> 
> > Allow intel_pstate to work in the passive mode with HWP enabled and
> > make it set the HWP minimum performance limit (HWP floor) to the
> > P-state value given by the target frequency supplied by the cpufreq
> > governor, so as to prevent the HWP algorithm and the CPU scheduler
> > from working against each other, at least when the schedutil governor
> > is in use, and update the intel_pstate documentation accordingly.
> 
...
> 
> powersave governor:
> acpi-cpufreq: good
> intel_cpufreq hwp: bad
> intel_cpufreq no hwp: good

It occurs to me that my expectations as to what 
is meant by "powersave" might not agree with yours. 

For the powersave governor, this is what we have now:

intel_cpufreq hwp == intel_pstate hwp
intel_cpufreq no hwp == acpi-cpufreq == always minimum freq
intel_pstate no hwp ~= acpi-cpufreq/ondemand

Is that your understanding/intention?

My expectation was/is:

intel_cpufreq hwp == intel_cpufreq no hwp == acpi-cpufreq == always minimum freq
intel_pstate no hwp ~= acpi-cpufreq/ondemand
intel_pstate hwp == Unique. Say, extremely course version of ondemand.

... Doug

RE: [PATCH v2 0/5] cpufreq: intel_pstate: Address some HWP-related oddities

2020-08-25 Thread Doug Smythies

Hi Srinivas,

Thanks for your reply.

On 2020.08.25 08:12 Srinivas Pandruvada wrote:
> On Mon, 2020-08-24 at 18:00 -0700, Doug Smythies wrote:
> > I think there is a disconnect between your written
> > description of what is going on and your supporting MSR reads.
> >
> I reproduced again.
> I see the copy paste individual at the first place swapped.

Yes, and that had me confused, initially.

> I pasted the full output by direct copy - paste from the screen.
> 
> But the issues are still there.

Agreed.
I didn't try your offline/online of CPU 1 part previously,
but did now, and get the same results as you.

I did not know that "rdmsr -a 0x774" lists
stuff in the order that CPU were last brought on-line.
I had assumed the list was in CPU order. Weird.

My example (nothing new here, just me catching up.
The offline/online order was cpu1, then cpu3, then cpu2):

root@s18:/sys/devices/system/cpu# grep . 
cpu*/cpufreq/energy_performance_preference
cpu0/cpufreq/energy_performance_preference:balance_performance
cpu1/cpufreq/energy_performance_preference:127
cpu2/cpufreq/energy_performance_preference:125
cpu3/cpufreq/energy_performance_preference:126
cpu4/cpufreq/energy_performance_preference:balance_performance
cpu5/cpufreq/energy_performance_preference:balance_performance
root@s18:/sys/devices/system/cpu# rdmsr -p 0 0x774
80002e2e
root@s18:/sys/devices/system/cpu# rdmsr -p 1 0x774
7f002e2e
root@s18:/sys/devices/system/cpu# rdmsr -p 2 0x774
7d002e2e
root@s18:/sys/devices/system/cpu# rdmsr -p 3 0x774
7e002e2e
root@s18:/sys/devices/system/cpu# rdmsr -p 4 0x774
80002e2e
root@s18:/sys/devices/system/cpu# rdmsr -p 5 0x774
80002e2e
root@s18:/sys/devices/system/cpu# rdmsr -a 0x774
80002e2e
80002e2e
80002e2e
7f002e2e
7e002e2e
7d002e2e

... Doug

RE: [PATCH v2 0/5] cpufreq: intel_pstate: Address some HWP-related oddities

2020-08-24 Thread Doug Smythies

Hi Srinivas,

I think there is a disconnect between your written
description of what is going on and your supporting MSR reads.

On 2020.08.24 16:56 Srinivas Pandruvada wrote:
> On Mon, 2020-08-24 at 19:39 +0200, Rafael J. Wysocki wrote:
> > Hi All,
> >
> > The v2 is here to address feedback from Doug and one issue found by
> > me.
> >
> > The purpose of this series is to address some peculiarities related
> > to
> > taking CPUs offline/online and switching between different operation
> > modes with HWP enabled that have become visible after allowing the
> > driver to work in the passive mode with HWP enabled in 5.9-rc1 (and
> > one that was there earlier, but can be addressed easily after the
> > changes madein 5.9-rc1).
> >
> > Please refer to the patch changelogs for details.
> >
> > For easier testing/review, the series is available from the git
> > branch at:
> >
> >  git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
> >  intel_pstate-testing
> >
> 
> Applied these patches to 5.9-rc2

So did I, and the issues I reported the other day are fine now.
I did try a few of the things you were doing.

> 
> - After s3  limits got messed up.
>  # cat /sys/power/mem_sleep
> s2idle [deep]
> 
> - In the dmesg unchecked MSR for HWP register
> 
> 1.
> Before test
> 
> sudo rdmsr -a 0x774
> 80002b0c
> 80002b0c
> 80002d0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002d0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002d0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002d0d
> 7f002b0c

?? This looks like the MSR
read for further below, and for 
CPU 19 instead of 1.

> cd /sys/devices/system/cpu/intel_pstate/
> [root@otcpl-perf-test-skx-i9 intel_pstate]# grep . *
> hwp_dynamic_boost:0
> max_perf_pct:100
> min_perf_pct:27
> no_turbo:0
> num_pstates:32
> status:active
> turbo_pct:32
> 
> cd ../cpu1/cpufreq/
> [root@otcpl-perf-test-skx-i9 cpufreq]# grep . *
> affected_cpus:1
> base_frequency:330
> cpuinfo_max_freq:430
> cpuinfo_min_freq:120
> cpuinfo_transition_latency:0
> energy_performance_available_preferences:default performance
> balance_performance balance_power power
> energy_performance_preference:balance_performance
> related_cpus:1
> scaling_available_governors:performance powersave
> scaling_cur_freq:120
> scaling_driver:intel_pstate
> scaling_governor:powersave
> scaling_max_freq:430
> scaling_min_freq:120
> scaling_setspeed:
> 
> 
> 2. Now change the EPP
> 
> # echo 127 > energy_performance_preference
> sudo rdmsr -a 0x774
> 80002b0c
> 80002b0c
> 80002d0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002d0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002d0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002d0d

This looks like the original MSR read.

> 
> Good here
> 
> 3. Offline/online good
> 
> [root@otcpl-perf-test-skx-i9 cpufreq]# echo 0 >
> /sys/devices/system/cpu/cpu1/online
> [root@otcpl-perf-test-skx-i9 cpufreq]# echo ` >
> /sys/devices/system/cpu/cpu1/online
> > echo ` > /sys/devices/system/cpu/cpu1/online ^C
> [root@otcpl-perf-test-skx-i9 cpufreq]# echo 1 >
> /sys/devices/system/cpu/cpu1/online
> 
>  sudo rdmsr -a 0x774
> 80002b0c
> 80002b0c
> 80002d0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002d0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002d0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002b0c
> 80002d0d
> 7f002b0c

O.K.

> 
> Good. Online restored the setting
> 
> 4. Now S3
> 
> rtcwake -m mem -s 10

Cool command. I did not know about it.
I tried it.
> 
> All limits are now messed up
> 
> sudo rdmsr -a 0x774
> 80002b0c
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 8000ff00
> 

Yes, I got the same:

# /home/doug/c/msr-decoder (edited)
6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 :
raw: 80002E08 : 8000FF01 : 8000FF01 : 8000FF01 : 8000FF01 : 8000FF01 :
min:8 :1 :1 :1 :1 :1 :
max:   46 :  255 :  255 :  255 :  255 :  255 :
des:0 :0 :0 :0 :0 :0 :
epp:  128 :  128 :  128 :  128 :  128 :  128 :
act:0 :0 :0 :0 :0 :0 :

> 5. Now switch to passive
> Again bad, some CPU max/min is 0
> 
> sudo rdmsr -a 0x774
> 80002b0d
> 7f002b0f

Hmmm... Now seems to be CPU 1

> 80002b0c
> 80002d0e
> 80002b0c
> 80002b0d
> 80002b0f
> 80002b2b
> 80002b0c
> 80002d1d
> 8000
> 80002b0c
> 80002b0c
> 8000
> 8000
> 8000
> 8000
> 8000
> 8000
> 8000

MSR 774 was good for me, but in general my decoder was having troubles.

0x774: IA32_HWP_REQUEST:CPU 0-5 :
sh: 0: getcwd() failed: No such file or directory
raw: 80002E2E : 7F002E2E : 80002E2E : 80002E2E : 80002E2E : 80002E2E :
min:   46 :   46 :   46 :   46 :   46 :

RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Work in passive mode with HWP enabled

2020-08-24 Thread Doug Smythies

On 2020.05.21 10:16 Rafael J. Wysocki wrote:

...
> 
> The INTEL_CPUFREQ_TRANSITION_DELAY_HWP value has been guessed and it very well
> may turn out to be either too high or too low for the general use, which is 
> one
> reason why getting as much testing coverage as possible is key here.
> 
> If you can play with different INTEL_CPUFREQ_TRANSITION_DELAY_HWP values,
> please do so and let me know the conclusions.
...

Hi Rafael,

Since my v7 reply the other day about results for 347 Hz work/sleep frequency
periodic workflows [1], I have been testing at 73 Hertz work/sleep frequency.
And am now testing this:

-#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000
+#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 1

Which shows promise for the schedutil governor, which no
longer drives prematurely into the turbo range, consuming
excessive amounts of energy as compared to passive without
HWP.

The ondemand governor is still very bad. The only way I
have found to get it behave is to force max=min=target_pstate,
in addition to the above.

It will be quite some time before I have well organized feedback.
In the meantime if anyone has suggestions for an optimal
INTEL_CPUFREQ_TRANSITION_DELAY_HWP value, I would be grateful.
Otherwise I'll attempt to find it via tests.

... Doug

[1] https://marc.info/?l=linux-pm=159769839401767=2

RE: [PATCH 3/4] cpufreq: intel_pstate: Add ->offline and ->online callbacks

2020-08-21 Thread Doug Smythies

Hi Rafael,

Just annoying typo type feedback.

On 2020.08.20 09:38 Rafael J. Wysocki wrote:

> 
> From: "Rafael J. Wysocki" 
> 
> Add ->offline and ->online driver callbacksto to do the cleanup

"to to" and suggest this:

Add ->offline and ->online driver callbacks to cleanup

> before taking a CPU offline and to restore its working configuration
> when it goes back online, respectively, to avoid invoking the ->init
> callback on every CPU online which is quite a bit of unnecessary
> overhead.
> 
> Define ->offline and ->online so that they can be used in the
> passive as well as in the active mode and because ->offline will

  passive mode

> do the majority of ->stop_cpu work, the passive mode does not
> need that callback any more, so drop it.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
>  drivers/cpufreq/intel_pstate.c | 38 --
>  1 file changed, 32 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 3d18934fa975..aca0587b176f 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -2297,28 +2297,51 @@ static int intel_pstate_verify_policy(struct 
> cpufreq_policy_data *policy)
>   return 0;
>  }
> 
> -static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
> +static int intel_pstate_cpu_offline(struct cpufreq_policy *policy)
>  {
> + pr_debug("CPU %d going offline\n", policy->cpu);
> +
> + intel_pstate_exit_perf_limits(policy);
> +
> + /*
> +  * If the CPU is an SMT thread and it goes offline with the performance
> +  * settings different from the minimum, it will prevent its sibling
> +  * from getting to lower performance levels, so force the minimum
> +  * performance on CPU offline to prevent that form happening.

form/from

> +  */
>   if (hwp_active)
>   intel_pstate_hwp_force_min_perf(policy->cpu);
>   else
>   intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
> +
> + return 0;
> +}
> +
> +static int intel_pstate_cpu_online(struct cpufreq_policy *policy)
> +{
> + pr_debug("CPU %d going online\n", policy->cpu);
> +
> + intel_pstate_init_acpi_perf_limits(policy);
> +
> + if (hwp_active)
> + wrmsrl_on_cpu(policy->cpu, MSR_HWP_REQUEST,
> +   all_cpu_data[policy->cpu]->hwp_req_cached);
> +
> + return 0;
>  }
> 
>  static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
>  {
> - pr_debug("CPU %d exiting\n", policy->cpu);
> + pr_debug("CPU %d stopping\n", policy->cpu);
> 
>   intel_pstate_clear_update_util_hook(policy->cpu);
>   if (hwp_active)
>   intel_pstate_hwp_save_state(policy);
> -
> - intel_cpufreq_stop_cpu(policy);
>  }
> 
>  static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
>  {
> - intel_pstate_exit_perf_limits(policy);
> + pr_debug("CPU %d exiting\n", policy->cpu);
> 
>   policy->fast_switch_possible = false;
> 
> @@ -2398,6 +2421,8 @@ static struct cpufreq_driver intel_pstate = {
>   .init   = intel_pstate_cpu_init,
>   .exit   = intel_pstate_cpu_exit,
>   .stop_cpu   = intel_pstate_stop_cpu,
> + .offline= intel_pstate_cpu_offline,
> + .online = intel_pstate_cpu_online,
>   .update_limits  = intel_pstate_update_limits,
>   .name   = "intel_pstate",
>  };
> @@ -2652,7 +2677,8 @@ static struct cpufreq_driver intel_cpufreq = {
>   .fast_switch= intel_cpufreq_fast_switch,
>   .init   = intel_cpufreq_cpu_init,
>   .exit   = intel_cpufreq_cpu_exit,
> - .stop_cpu   = intel_cpufreq_stop_cpu,
> + .offline= intel_pstate_cpu_offline,
> + .online = intel_pstate_cpu_online,
>   .update_limits  = intel_pstate_update_limits,
>   .name   = "intel_cpufreq",
>  };
> --
> 2.26.2
> 
>

RE: [PATCH 0/4] cpufreq: intel_pstate: Address some HWP-related oddities

2020-08-21 Thread Doug Smythies

Hi Rafael,

On 2020.08.20 09:35 Rafael J. Wysocki wrote:
> 
> The purpose of this series is to address some peculiarities related to
> taking CPUs offline/online and switching between different operation
> modes with HWP enabled that have become visible after allowing the
> driver to work in the passive mode with HWP enabled in 5.9-rc1 (and
> one that was there earlier, but can be addressed easily after the
> changes madein 5.9-rc1).
> 
> Please refer to the patch changelogs for details.
> 
> For easier testing/review, the series is available from the git branch at:
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
>  intel_pstate-testing

Thanks for that.

There still seems to be a problem with EPP getting messed up.
I have not been able to find the exact spot in the code.

One problem is that EPP can end up as 0, and thereafter stays
at 0. In sysfs terms, it ends up as "performance" and thereafter
stays as "performance". Meanwhile I never modified it, and it started
as "balance_performance".

It happens when changing from active to passive if the governor is performance.
If the governor is not performance things work as expected.

Another problem is that EPP will end up as 128 when changing from passive
to active. This erroneous condition is cleared by changing the governor to
powersave and back to performance. It also doesn't occur the first
time after boot, when booting to intel_cpufreq/performance/HWP.
(confused yet?) The sysfs value is O.K. during this.

Supporting data:
Example 1:

Grub:
GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=450 msr.allow_writes=on 
cpuidle.governor=teo"

So I boot to intel_pstate/performance/HWP:

# /home/doug/c/msr-decoder (always edited for only relevant parts)
6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 :
raw: 2E2E : 2E2E : 2E2E : 2E2E : 2E2E : 2E2E :
epp:0 :0 :0 :0 :0 :0 :

# cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver
intel_pstate
# cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
performance
# cat  /sys/devices/system/cpu/cpufreq/policy3/energy_performance_preference
balance_performance

# echo passive > /sys/devices/system/cpu/intel_pstate/status

Note: the following results are incorrect:

# cat  /sys/devices/system/cpu/cpufreq/policy3/energy_performance_preference
performance
# echo "ondemand" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
ondemand
# cat  /sys/devices/system/cpu/cpufreq/policy3/energy_performance_preference
performance
# /home/doug/c/msr-decoder
6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 :
raw: 2E08 : 2E08 : 2E08 : 2E0B : 2E13 : 2E08 :
epp:0 :0 :0 :0 :0 :0 :

# echo active > /sys/devices/system/cpu/intel_pstate/status
# cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
performance
# cat  /sys/devices/system/cpu/cpufreq/policy3/energy_performance_preference
performance
# echo "powersave" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
powersave
# cat  /sys/devices/system/cpu/cpufreq/policy3/energy_performance_preference
performance
# cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
powersave
# cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver
intel_pstate

Example 2:
Grub:
GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=450 
intel_pstate=passive msr.allow_writes=on cpuidle.governor=teo"

So I boot to intel_cpufreq/performance/HWP:

# /home/doug/c/msr-decoder
6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 :
raw: 80002E2E : 80002E2E : 80002E2E : 80002E2E : 80002E2E : 80002E2E :
epp:  128 :  128 :  128 :  128 :  128 :  128 :

# cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver
intel_cpufreq
# cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
performance
# cat /sys/devices/system/cpu/cpufreq/policy3/energy_performance_preference
balance_performance
# echo active > /sys/devices/system/cpu/intel_pstate/status
# cat /sys/devices/system/cpu/cpufreq/policy3/energy_performance_preference
balance_performance
# cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
performance
# cat /sys/devices/system/cpu/cpu3/cpufreq/scaling_driver
intel_pstate
# /home/doug/c/msr-decoder
6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 :
raw: 2E2E : 2E2E : 2E2E : 2E2E : 2E2E : 2E2E :
epp:0 :0 :0 :0 :0 :0 :

# echo passive > /sys/devices/system/cpu/intel_pstate/status
# /home/doug/c/msr-decoder
6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 :
raw: 80002E2E : 80002E2E : 80002E2E : 80002E2E : 80002E2E : 80002E2E :
epp:  128 :  128 :  128 :  128 :  128 :  128 :
# echo active > /sys/devices/system/cpu/intel_pstate/status

Note: the following results are incorrect:

# /home/doug/c/msr-decoder
6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 :
raw: 80002E2E : 80002E2E : 80002E2E : 80002E2E :

RE: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-08-17 Thread Doug Smythies

On 2020.08.06 05:04 Rafael J. Wysocki wrote:

> Allow intel_pstate to work in the passive mode with HWP enabled and
> make it set the HWP minimum performance limit (HWP floor) to the
> P-state value given by the target frequency supplied by the cpufreq
> governor, so as to prevent the HWP algorithm and the CPU scheduler
> from working against each other, at least when the schedutil governor
> is in use, and update the intel_pstate documentation accordingly.

...

Hi Rafael,

You may or may not recall, I mentioned my further feedback would be
delayed, as I wanted to work on reducing the labour content of my
most basic CPU frequency scaler test.

I have tested kernel 5.9-rc1 for pretty much every intel_pstate
variant and governor, and also the acpi-cpufreq driver.

Other than changing governors, changes were only made via
grub command line options and re-boot. EPP or EPB were never
modified, they were always whatever default.

performance governor: (left mostly blank, on purpose.)
acpi-cpufreq:
intel_cpufreq hwp: good
intel_cpufreq no hwp:
intel_pstate hwp:
intel_pstate no hwp:

ondemand governor:
acpi-cpufreq: good
intel_cpufreq hwp: bad
intel_cpufreq no hwp: good

conservative governor:
acpi-cpufreq: good
intel_cpufreq hwp: good
intel_cpufreq no hwp: good

schedutil governor:
acpi-cpufreq: good
intel_cpufreq hwp: bad
intel_cpufreq no hwp: good

powersave governor:
acpi-cpufreq: good
intel_cpufreq hwp: bad
intel_cpufreq no hwp: good

active-powersave governor:
intel_pstate hwp: ? not smooth, suffers from the broken HWP issue.
intel_pstate no hwp: good.
Intel_pstate hwp, idle state 2 disabled: Better but still worse for power.

Now, we don't actually care about CPU frequency, we care about power:

ondemand governor:

periodic workflow at 347 hertz.
~58% load at 4.60 GHz (where hwp operates)
~76% load at 3.5 GHz (where no hwp operates)

intel_cpufreq hwp: 14.3 processor package watts. 51.5 watts on the mains to the 
computer.
intel_cpufreq no hwp: 9.1 processor package watts. 45.5 watts on the mains to 
the computer. 

schedutil governor:

periodic workflow at 347 hertz.
~36% load at 4.60 GHz (where hwp operates)
~55% load at 3.2 GHz (where no hwp operates)

intel_cpufreq hwp: 9.6 processor package watts. 45.8 watts on the mains to the 
computer.
intel_cpufreq no hwp: ~6 processor package watts. ~41 watts on the mains to the 
computer. (noisy)

powersave governor:

periodic workflow at 347 hertz.
~39.8% load at 2.00 GHz (where hwp operates)
~92.5% load at 0.8 GHz (where no hwp operates)

intel_cpufreq hwp: 2.6 processor package watts. 38 watts on the mains to the 
computer.
intel_cpufreq no hwp: 1.9 processor package watts. 36 watts on the mains to the 
computer.

active-powersave governor:

periodic workflow at 347 hertz.
~58% load at 4.60 GHz (where hwp operates)
~72% load at 3.88 GHz (where no hwp operates) 

intel_pstate hwp: 14.2 processor package watts. 52 watts on the mains to the 
computer.
intel_pstate no hwp: 10.1 processor package watts. 48 watts on the mains to the 
computer.

Link to web page with much of this same content which, in turn, links to 
various graphs.
Coded, to avoid the barrage of bots:

double u double u double u dot smythies dot com /~doug/linux/s18/hwp/v7/

... Doug

RE: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-08-05 Thread Doug Smythies

On 2020.08.05 09:56 Rafael J. Wysocki wrote:

> v6 -> v7:
>* Cosmetic changes in store_energy_performance_prefernce() to reduce the
>  LoC number and make it a bit easier to read.  No intentional functional
>  impact.

??
V7 is identical to V6.

Diff:

$ diff hwppassive-v6-2-2.patch hwppassive-v7-2-2.patch
2c2
< Sent: August 4, 2020 8:11 AM
---
> Sent: August 5, 2020 9:56 AM
5c5
< Subject: [PATCH v6] cpufreq: intel_pstate: Implement passive mode with HWP 
enabled
---
> Subject: [PATCH v7] cpufreq: intel_pstate: Implement passive mode with HWP 
> enabled
76a77,81
>
> v6 -> v7:
>* Cosmetic changes in store_energy_performance_prefernce() to reduce the
>  LoC number and make it a bit easier to read.  No intentional functional
>  impact.

... Doug

RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-08-05 Thread Doug Smythies

On 2020.08.03 10:09 Rafael J. Wysocki wrote:
> On Sunday, August 2, 2020 5:17:39 PM CEST Doug Smythies wrote:
> > On 2020.07.19 04:43 Rafael J. Wysocki wrote:
> > > On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies  wrote:
> > > > On 2020.07.16 05:08 Rafael J. Wysocki wrote:
> > > > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies  
> > > > > wrote:
> > > > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> > > > >> >
> > > > >> > From: Rafael J. Wysocki 
> > > > >> ...
> > > > >> > Since the passive mode hasn't worked with HWP at all, and it is 
> > > > >> > not going to
> > > > >> > the default for HWP systems anyway, I don't see any drawbacks 
> > > > >> > related to making
> > > > >> > this change, so I would consider this as 5.9 material unless there 
> > > > >> > are any
> > > > >> > serious objections.
> > > > >>
> > > > >> Good point.
> > > >
> > > > Actually, for those users that default to passive mode upon boot,
> > > > this would mean they would find themselves using this.
> > > > Also, it isn't obvious, from the typical "what driver and what governor"
> > > > inquiry.
> > >
> > > So the change in behavior is that after this patch
> > > intel_pstate=passive doesn't imply no_hwp any more.
> > >
> > > That's a very minor difference though and I'm not aware of any adverse
> > > effects it can cause on HWP systems anyway.
> >
> > My point was, that it will now default to something where
> > testing has not been completed.
> >
> > > The "what governor" is straightforward in the passive mode: that's
> > > whatever cpufreq governor has been selected.
> >
> > I think you might have missed my point.
> > From the normal methods of inquiry one does not know
> > if HWP is being used or not. Why? Because with
> > or without HWP one gets the same answers under:
> >
> > /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
> > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> 
> Yes, but this is also the case in the active mode, isn't it?

Yes, fair enough.
But we aren't changing what it means by default
between kernel 5.8 and 5.9-rc1.

... Doug

RE: [PATCH v6] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-08-04 Thread Doug Smythies

Hi Rafael,

I was just writing you about V5 when this V6 came.

On 2020.08.04 08:11 Rafael J. Wysocki wrote:
...
> This is on top of the material already in the mainline.

Oh, should have read that part better,
but did get there in the end.
...
> v5 -> v6:
>* Fix the problem with the EPP setting via sysfs not working with the
>  performance and powersave governors by stopping and restarting the
>  governor around the sysfs-based EPP updates in the passive mode.
>* Because of that, use the epp_cached field just for avoiding the above
>  if the new EPP value for the given CPU is the same as the old one.
>* Export cpufreq_start/stop_governor() from the core (for the above).

EPP is still not right.
I am not messing with it at all, just observing via my msr-decoder.

I booted without any intel_pstate related directives for the
kernel command line. The below is as expected (performance gov.):

# /home/doug/c/msr-decoder
How many CPUs?: 6
8.) 0x198: IA32_PERF_STATUS : CPU 0-5 :  46 :  46 :  46 :  46 :  46 :  46 :
B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable
1.) 0x19C: IA32_THERM_STATUS: 8845
2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination enabled OOB Bit 
8 reset OOB Bit 18 reset
3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 8843
4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0
A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO disable
5.) 0x771: IA32_HWP_CAPABILITIES (performance): 109252E : high 46 : guaranteed 
37 : efficient 9 : lowest 1
6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 :
raw: 2E2E : 2E2E : 2E2E : 2E2E : 2E2E : 2E2E :
min:   46 :   46 :   46 :   46 :   46 :   46 :
max:   46 :   46 :   46 :   46 :   46 :   46 :
des:0 :0 :0 :0 :0 :0 :
epp:0 :0 :0 :0 :0 :0 :
act:0 :0 :0 :0 :0 :0 :
7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0 : lowest 0

and then switched to passive mode later. EPP is not as expected. Expect 0
(performance mode):

# /home/doug/c/msr-decoder
How many CPUs?: 6
8.) 0x198: IA32_PERF_STATUS : CPU 0-5 :  46 :  46 :  46 :  46 :  46 :  46 :
B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable
1.) 0x19C: IA32_THERM_STATUS: 8844
2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination enabled OOB Bit 
8 reset OOB Bit 18 reset
3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 8842
4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0
A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO disable
5.) 0x771: IA32_HWP_CAPABILITIES (performance): 108252E : high 46 : guaranteed 
37 : efficient 8 : lowest 1
6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 :
raw: FF002E2E : FF002E2E : FF002E2E : FF002E2E : FF002E2E : FF002E2E :
min:   46 :   46 :   46 :   46 :   46 :   46 :
max:   46 :   46 :   46 :   46 :   46 :   46 :
des:0 :0 :0 :0 :0 :0 :
epp:  255 :  255 :  255 :  255 :  255 :  255 :
act:0 :0 :0 :0 :0 :0 :
7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0 : lowest 0

Then switched to ondemand governor, and put 100% load on 2 CPUs.
EPP is not as expected, which I don't actually know what to expect,
but assume 128:

# /home/doug/c/msr-decoder
How many CPUs?: 6
8.) 0x198: IA32_PERF_STATUS : CPU 0-5 :  46 :  46 :  46 :  46 :  46 :  46 :
B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable
1.) 0x19C: IA32_THERM_STATUS: 883B
2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination enabled OOB Bit 
8 reset OOB Bit 18 reset
3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 882B
4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0
A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO disable
5.) 0x771: IA32_HWP_CAPABILITIES (performance): 10B252E : high 46 : guaranteed 
37 : efficient 11 : lowest 1
6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 :
raw: FF002E09 : FF002E0C : FF002E2E : FF002E08 : FF002E2E : FF002E18 :
min:9 :   12 :   46 :8 :   46 :   24 :
max:   46 :   46 :   46 :   46 :   46 :   46 :
des:0 :0 :0 :0 :0 :0 :
epp:  255 :  255 :  255 :  255 :  255 :  255 :
act:0 :0 :0 :0 :0 :0 :
7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0 : lowest 0

For what it's worth, Kernel:

78b39581ed85 (HEAD -> dtemp) cpufreq: intel_pstate: Implement passive mode with 
HWP enabled
c0842fbc1b18 (origin/master, origin/HEAD, master) random32: move the 
pseudo-random 32-bit definitions to prandom.h
2baa85d6927d Merge tag 'acpi-5.9-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
04084978003c Merge

RE: [PATCH v4 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-08-02 Thread Doug Smythies

Hi Srinivas,

Thanks for your help. I was missing several needed patches.

On 2020.08.02 11:39 Srinivas Pandruvada wrote:
> On Sun, 2020-08-02 at 07:00 -0700, Doug Smythies wrote:
> > On 2020.08.01 09:40 Srinivas Pandruvada wrote:
> > > > On Monday, July 27, 2020 5:13:40 PM CEST Rafael J. Wysocki wrote:
> > > > > On Thursday, July 16, 2020 7:37:04 PM CEST Rafael J. Wysocki
> > > > > wrote:
> > > > > > This really is a v2 of this patch:
> > > > > >
> > > > > > https://patchwork.kernel.org/patch/11663271/
> > > > > >
> > > > > > with an extra preceding cleanup patch to avoid making
> > > > > > unrelated
> > > > > > changes in the
> > > > > > [2/2].
> > > I applied this series along with
> > > [PATCH] cpufreq: intel_pstate: Fix EPP setting via sysfs in active
> > > mode
> > > on 5.8 latest master (On top of raw epp patchset).
> >
> > Would you be kind enough to provide a "git log --oneline" output
> > of what you did.
> 
> 69dd9b2b11cd (HEAD -> 5-9-devel) cpufreq: intel_pstate: Implement
> passive mode with HWP enabled
> 63efaa01b06a cpufreq: intel_pstate: Fix EPP setting via sysfs in active
> mode
> e11e0a2edf83 cpufreq: intel_pstate: Rearrange the storing of new EPP
> values
> 93c3fd6a315c cpufreq: intel_pstate: Avoid enabling HWP if EPP is not
> supported
> 7cef1dd371c3 cpufreq: intel_pstate: Clean up aperf_mperf_shift
> description
> a3248d8d3a11 cpufreq: intel_pstate: Supply struct attribute description
> for get_aperf_mperf_shift()
> f52b6b075b07 cpufreq: intel_pstate: Fix static checker warning for epp
> variable
> 4a59d6be0774 cpufreq: intel_pstate: Allow raw energy performance
> preference value
> 7b34b5acdcc6 cpufreq: intel_pstate: Allow enable/disable energy
> efficiency
> ac3a0c847296 (origin/master, origin/HEAD, master) Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> >
> > I have been trying unsuccessfully to apply the patches,
> > so somewhere I obviously missed something.
> >
> > > When intel_pstate=passive from kernel command line then it is fine,
> > > no
> > > crash. But switch dynamically, crashed:
> >
> > I'll try to repeat, if I can get an actual kernel.

I could not repeat your crash.

I tried booting with and without intel_pstate=passive
on the kernel command line and then switching back and
forth thereafter.

However, I do confirm EPP is messed up. But not min
and max from MSR 0x774, they behave as expected,
based on quick testing only.

Since you mentioned:

>>> Don't have a recipe to reproduce this.

Maybe I simply didn't hit whatever.

... Doug

Useless additional stuff:

# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.8.0-rc7-dssp 
root=UUID=0ac356c1-caa9-4c2e-8229-4408bd998dbd ro ipv6.disable=1 
consoleblank=450 intel_pstate=passive cpuidle_sysfs_switch cpuidle.governor=teo

Went "active" then "passive" and set ondemand governor. 2 X 100% CPU loads:

# /home/doug/c/msr-decoder
How many CPUs?: 6
8.) 0x198: IA32_PERF_STATUS : CPU 0-5 :  46 :  46 :  46 :  46 :  46 :  46 :
B.) 0x770: IA32_PM_ENABLE: 1 : HWP enable
1.) 0x19C: IA32_THERM_STATUS: 883C
2.) 0x1AA: MSR_MISC_PWR_MGMT: 401CC0 EIST enabled Coordination enabled OOB Bit 
8 reset OOB Bit 18 reset
3.) 0x1B1: IA32_PACKAGE_THERM_STATUS: 882D
4.) 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 0
A.) 0x1FC: MSR_POWER_CTL: 3C005D : C1E disable : EEO disable : RHO disable
5.) 0x771: IA32_HWP_CAPABILITIES (performance): 10B252E : high 46 : guaranteed 
37 : efficient 11 : lowest 1
6.) 0x774: IA32_HWP_REQUEST:CPU 0-5 :
raw: FF002E0A : FF002E2E : FF002E2E : FF002E08 : FF002E18 : FF002E08 :
min:   10 :   46 :   46 :8 :   24 :8 :
max:   46 :   46 :   46 :   46 :   46 :   46 :
des:0 :0 :0 :0 :0 :0 :
epp:  255 :  255 :  255 :  255 :  255 :  255 :
act:0 :0 :0 :0 :0 :0 :
7.) 0x777: IA32_HWP_STATUS: 4 : high 4 : guaranteed 0 : efficient 0 : lowest 0

Kernel:

d72c8472dbd5 (HEAD -> k58rc7-d3) cpufreq: intel_pstate: Fix EPP setting via 
sysfs in active mode
c2f4869fbc27 cpufreq: intel_pstate: Implement passive mode with HWP enabled
85219968fab9 cpufreq: intel_pstate: Rearrange the storing of new EPP values
5c09a1a38106 cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported
9f29c81fe0b3 cpufreq: intel_pstate: Clean up aperf_mperf_shift description
2a863c241495 cpufreq: intel_pstate: Supply struct attribute description for 
get_aperf_mperf_shift()
4180d8413037 cpufreq: intel_pstate: Fix static checker warning for epp variable
7cd50e86a9e6 cpufreq: intel_pstate: Allow raw energy performance preference 
value
56dce9a1081e cpufreq: intel_pstate: Allow enable/disable energy efficiency
92ed30191993 (tag: v5.8-rc7) Linux 5.8-rc7

RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-08-02 Thread Doug Smythies

Hi Rafael,

On 2020.07.19 04:43 Rafael J. Wysocki wrote:
> On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies  wrote:
> > On 2020.07.16 05:08 Rafael J. Wysocki wrote:
> > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies  
> > > wrote:
> > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> > >> >
> > >> > From: Rafael J. Wysocki 
> > >> ...
> > >> > Since the passive mode hasn't worked with HWP at all, and it is not 
> > >> > going to
> > >> > the default for HWP systems anyway, I don't see any drawbacks related 
> > >> > to making
> > >> > this change, so I would consider this as 5.9 material unless there are 
> > >> > any
> > >> > serious objections.
> > >>
> > >> Good point.
> >
> > Actually, for those users that default to passive mode upon boot,
> > this would mean they would find themselves using this.
> > Also, it isn't obvious, from the typical "what driver and what governor"
> > inquiry.
> 
> So the change in behavior is that after this patch
> intel_pstate=passive doesn't imply no_hwp any more.
> 
> That's a very minor difference though and I'm not aware of any adverse
> effects it can cause on HWP systems anyway.

My point was, that it will now default to something where
testing has not been completed.

> The "what governor" is straightforward in the passive mode: that's
> whatever cpufreq governor has been selected.

I think you might have missed my point.
>From the normal methods of inquiry one does not know
if HWP is being used or not. Why? Because with
or without HWP one gets the same answers under:

/sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

... Doug

RE: [PATCH v4 2/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-08-02 Thread Doug Smythies

On 2020.08.01 16:41 Srinivas Pandruvada wrote:
> On Tue, 2020-07-28 at 17:13 +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 
> >
> > Allow intel_pstate to work in the passive mode with HWP enabled and
> > make it set the HWP minimum performance limit (HWP floor) to the
> > P-state value given by the target frequency supplied by the cpufreq
> > governor, so as to prevent the HWP algorithm and the CPU scheduler
> > from working against each other, at least when the schedutil governor
> > is in use, and update the intel_pstate documentation accordingly.
> >
> > Among other things, this allows utilization clamps to be taken
> > into account, at least to a certain extent, when intel_pstate is
> > in use and makes it more likely that sufficient capacity for
> > deadline tasks will be provided.
> >
> > After this change, the resulting behavior of an HWP system with
> > intel_pstate in the passive mode should be close to the behavior
> > of the analogous non-HWP system with intel_pstate in the passive
> > mode, except that in the frequency range below the base frequency
> > (ie. the frequency retured by the base_frequency cpufreq attribute
> > in sysfs on HWP systems) the HWP algorithm is allowed to go above
> > the floor P-state set by intel_pstate with or without hardware
> > coordination of P-states among CPUs in the same package.
> >
> Do you mean HWP.req.min will be below base_freq (unless user overrides
> it)?

No.

> With busy workload I see HWP req.min = HWP req.max.
> The base freq: 1.3GHz (ratio 0x0d), MAX 1C turbo: 3.9GHz (ratio: 0x27)
> When I monitor MSR 0x774 (HWP_REQ), I see
> 0x80002727

Yes, that is what I expect to see.

> 
> Normally msr 0x774
> 0x80002704

That would be "active" mode and the powersave governor, correct?.
And yes that is what I expect for your processor.
For mine, load or no load, decoded:
0x774: IA32_HWP_REQUEST:CPU 0-5 :
raw: 80002E08 : 80002E08 : 80002E08 : 80002E08 : 80002E08 : 80002E08 :
min:8 :8 :8 :8 :8 :8 :
max:   46 :   46 :   46 :   46 :   46 :   46 :
des:0 :0 :0 :0 :0 :0 :
epp:  128 :  128 :  128 :  128 :  128 :  128 :
act:0 :0 :0 :0 :0 :0 :

This thread is about passive mode, and myself, I do not expect the last byte to 
be
4 (8 for mine) under load.

... Doug

RE: [PATCH v4 0/2] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-08-02 Thread Doug Smythies

On 2020.08.01 09:40 Srinivas Pandruvada wrote:
>> On Monday, July 27, 2020 5:13:40 PM CEST Rafael J. Wysocki wrote:
>>> On Thursday, July 16, 2020 7:37:04 PM CEST Rafael J. Wysocki wrote:
 This really is a v2 of this patch:

 https://patchwork.kernel.org/patch/11663271/

 with an extra preceding cleanup patch to avoid making unrelated
 changes in the
 [2/2].
>>>
> I applied this series along with
> [PATCH] cpufreq: intel_pstate: Fix EPP setting via sysfs in active mode
> on 5.8 latest master (On top of raw epp patchset).

Hi Srinivas,

Would you be kind enough to provide a "git log --oneline" output
of what you did.

I have been trying unsuccessfully to apply the patches,
so somewhere I obviously missed something.

> When intel_pstate=passive from kernel command line then it is fine, no
> crash. But switch dynamically, crashed:

I'll try to repeat, if I can get an actual kernel.

> Attached crash.txt. I may need to try your linux-pm tree.

I also tried the linux-pm tree, same.

... Doug

RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-17 Thread Doug Smythies

Hi Rafael,

Thank you for your reply.

On 2020.07.16 05:08 Rafael J. Wysocki wrote:
> On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies  wrote:
>> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
>> >
>> > From: Rafael J. Wysocki 
>> ...
>> > Since the passive mode hasn't worked with HWP at all, and it is not going 
>> > to
>> > the default for HWP systems anyway, I don't see any drawbacks related to 
>> > making
>> > this change, so I would consider this as 5.9 material unless there are any
>> > serious objections.
>>
>> Good point.

Actually, for those users that default to passive mode upon boot,
this would mean they would find themselves using this.
Also, it isn't obvious, from the typical "what driver and what governor"
inquiry.

>> Some of the tests I do involve labour intensive post processing of data.
>> I want to automate some of that work, and it will take time.
>> We might be into the 5.9-rc series before I have detailed feedback.
>>
>> However, so far:
>>
>> Inverse impulse response test [1]:
>>
>> High level test, i5-9600K, HWP-passive (this patch), ondemand:
>> 3101 tests. 0 failures. (GOOD)
>>
>> From [1], re-stated:
>> > . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. 
>> > (HWP-active - powersave)
>> > . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 
>> > failures.
>>
>> My version of that cool Alexander named pipe test [2] serialized workflow:
>>
>> HWP-passive (this patch), performance: PASS.
>>
>> From [2], re-stated, and also re-tested.
>> HWP-disabled passive - performance: FAIL.
> 
> But I'm not quite sure how this is related to this patch?

It isn't. The point being that it is different.
But yes, that failure is because of our other discussion [3].

> 
> This test would still fail without the patch if the kernel was started
> with intel_pstate=passive in the kernel command line, wouldn't it.

Yes.

> 
> > Although, I believe the issue to be EPB management, [3].
> 
> Well, that's kind of unexpected.
> 
> If this really is the case, then it looks like the effect of the EPB
> setting on the processor is different depending on whether or not HWP
> is enabled.
> 
>> And yes, I did see the reply to [3] that came earlier,
>> And have now re-done the test, with the referenced patch added.
>> It still is FAIL. I reply to the [3] thread, eventually.
>>
>> [1] https://marc.info/?l=linux-pm=159354421400342=2
>> [2] https://marc.info/?l=linux-pm=159155067328641=2
>> [3] https://marc.info/?l=linux-pm=159438804230744=2
>>
>> Kernel:
>>
>> b08284a541ad (HEAD -> k58rc5-doug) cpufreq: intel_pstate: Avoid enabling HWP 
>> if EPP is not supported
>> 063fd7ccabfe cpufreq: intel_pstate: Implement passive mode with HWP enabled
>> 730ccf5054e9 cpufreq: intel_pstate: Allow raw energy performance preference 
>> value
>> bee36df01c68 cpufreq: intel_pstate: Allow enable/disable energy efficiency
>> 199629d8200e cpufreq: intel_pstate: Fix active mode setting from command line
>> 11ba468877bb (tag: v5.8-rc5, origin/master, origin/HEAD, master) Linux 
>> 5.8-rc5
>>
>> Rules for this work:
>>
>> . never use x86_energy_perf_policy.
>> . For HWP disabled: never change from active to passive or via versa, but 
>> rather do it via boot.
>> . after boot always check and reset the various power limit log bits that 
>> are set.
>> . never compile the kernel (well, until after any tests), which will set 
>> those bits again.
>> . never run prime95 high heat torture test, which will set those bits again.
>> . try to never do anything else that will set those bits again.
>>
>> To be clear, I do allow changing governors within the context of the above 
>> rules.
> 
> Thanks for the feedback!

RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-15 Thread Doug Smythies

On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> 
> From: Rafael J. Wysocki 
...
> Since the passive mode hasn't worked with HWP at all, and it is not going to
> the default for HWP systems anyway, I don't see any drawbacks related to 
> making
> this change, so I would consider this as 5.9 material unless there are any
> serious objections.

Good point.
Some of the tests I do involve labour intensive post processing of data.
I want to automate some of that work, and it will take time.
We might be into the 5.9-rc series before I have detailed feedback.

However, so far:

Inverse impulse response test [1]:

High level test, i5-9600K, HWP-passive (this patch), ondemand:
3101 tests. 0 failures. (GOOD)

>From [1], re-stated:
> . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. (HWP-active 
> - powersave)
> . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 failures.

My version of that cool Alexander named pipe test [2] serialized workflow:

HWP-passive (this patch), performance: PASS.

>From [2], re-stated, and also re-tested.
HWP-disabled passive - performance: FAIL.
Although, I believe the issue to be EPB management, [3].

And yes, I did see the reply to [3] that came earlier,
And have now re-done the test, with the referenced patch added.
It still is FAIL. I reply to the [3] thread, eventually.

[1] https://marc.info/?l=linux-pm=159354421400342=2
[2] https://marc.info/?l=linux-pm=159155067328641=2
[3] https://marc.info/?l=linux-pm=159438804230744=2

Kernel:

b08284a541ad (HEAD -> k58rc5-doug) cpufreq: intel_pstate: Avoid enabling HWP if 
EPP is not supported
063fd7ccabfe cpufreq: intel_pstate: Implement passive mode with HWP enabled
730ccf5054e9 cpufreq: intel_pstate: Allow raw energy performance preference 
value
bee36df01c68 cpufreq: intel_pstate: Allow enable/disable energy efficiency
199629d8200e cpufreq: intel_pstate: Fix active mode setting from command line
11ba468877bb (tag: v5.8-rc5, origin/master, origin/HEAD, master) Linux 5.8-rc5

Rules for this work:

. never use x86_energy_perf_policy.
. For HWP disabled: never change from active to passive or via versa, but 
rather do it via boot.
. after boot always check and reset the various power limit log bits that are 
set.
. never compile the kernel (well, until after any tests), which will set those 
bits again.
. never run prime95 high heat torture test, which will set those bits again.
. try to never do anything else that will set those bits again.

To be clear, I do allow changing governors within the context of the above 
rules.

... Doug

RE: [PATCH] cpufreq: intel_pstate: Fix active mode setting from command line

2020-07-13 Thread Doug Smythies

Hi Rafael,

Thank you.

On 2020.07.13 06:59 Rafael J. Wysocki wrote:
> 
> From: Rafael J. Wysocki 
> 
> If intel_pstate starts in the passive mode by default (that happens
> when the processor in the system doesn't support HWP), passing
> intel_pstate=active in the kernel command line doesn't work, so
> fix that.
> 
> Fixes: 33aa46f252c7 ("cpufreq: intel_pstate: Use passive mode by default 
> without HWP")
> Reported-by: Doug Smythies 
> Signed-off-by: Rafael J. Wysocki 

Acked-by: Doug Smythies 

> ---
>  drivers/cpufreq/intel_pstate.c |8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -2534,7 +2534,7 @@ static struct cpufreq_driver intel_cpufr
>   .name   = "intel_cpufreq",
>  };
> 
> -static struct cpufreq_driver *default_driver = _pstate;
> +static struct cpufreq_driver *default_driver;
> 
>  static void intel_pstate_driver_cleanup(void)
>  {
> @@ -2828,6 +2828,7 @@ static int __init intel_pstate_init(void
>   hwp_active++;
>   hwp_mode_bdw = id->driver_data;
>   intel_pstate.attr = hwp_cpufreq_attrs;
> + default_driver = _pstate;
>   goto hwp_cpu_matched;
>   }
>   } else {
> @@ -2845,7 +2846,8 @@ static int __init intel_pstate_init(void
>   return -ENODEV;
>   }
>   /* Without HWP start in the passive mode. */
> - default_driver = _cpufreq;
> + if (!default_driver)
> + default_driver = _cpufreq;
> 
>  hwp_cpu_matched:
>   /*
> @@ -2899,6 +2901,8 @@ static int __init intel_pstate_setup(cha
> 
>   if (!strcmp(str, "disable")) {
>   no_load = 1;
> + } else if (!strcmp(str, "active")) {
> + default_driver = _pstate;
>   } else if (!strcmp(str, "passive")) {
>   default_driver = _cpufreq;
>   no_hwp = 1;
>

RE: [PATCH 2/2] cpufreq: intel_pstate: Use passive mode by default without HWP

2020-07-09 Thread Doug Smythies

Hi Rafael,

As you may or may not recall, I am attempting to untangle
and separate multiple compounding issues around the
intel_pstate driver and HWP (or not).

Until everything is figured out, I am using the following rules:

. never use x86_energy_perf_policy.
. For HWP disabled: never change from active to passive or via versa, but 
rather do it via boot.
. after boot always check and reset the various power limit log bits that are 
set.
. never compile the kernel (well, until after any tests), which will set those 
bits again.
. never run prime95 high heat torture test, which will set those bits again.
. try to never do anything else that will set those bits again.

On 2020.03.28 05:58 Rafael J. Wysocki wrote:
> 
> From: "Rafael J. Wysocki" 
> 
> After recent changes allowing scale-invariant utilization to be
> used on x86, the schedutil governor on top of intel_pstate in the
> passive mode should be on par with (or better than) the active mode
> "powersave" algorithm of intel_pstate on systems in which
> hardware-managed P-states (HWP) are not used, so it should not be
> necessary to use the internal scaling algorithm in those cases.
> 
> Accordingly, modify intel_pstate to start in the passive mode by
> default if the processor at hand does not support HWP of if the driver
> is requested to avoid using HWP through the kernel command line.
> 
> Among other things, that will allow utilization clamps and the
> support for RT/DL tasks in the schedutil governor to be utilized on
> systems in which intel_pstate is used.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
>  Documentation/admin-guide/pm/intel_pstate.rst | 32 
> ---
>  drivers/cpufreq/intel_pstate.c|  3 ++-
>  2 files changed, 21 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/admin-guide/pm/intel_pstate.rst 
> b/Documentation/admin-
> guide/pm/intel_pstate.rst
> index ad392f3aee06..39d80bc29ccd 100644
> --- a/Documentation/admin-guide/pm/intel_pstate.rst
> +++ b/Documentation/admin-guide/pm/intel_pstate.rst
> @@ -62,9 +62,10 @@ on the capabilities of the processor.
>  Active Mode
>  ---
> 
> -This is the default operation mode of ``intel_pstate``.  If it works in this
> -mode, the ``scaling_driver`` policy attribute in ``sysfs`` for all 
> ``CPUFreq``
> -policies contains the string "intel_pstate".
> +This is the default operation mode of ``intel_pstate`` for processors with
> +hardware-managed P-states (HWP) support.  If it works in this mode, the
> +``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq`` policies
> +contains the string "intel_pstate".
> 
>  In this mode the driver bypasses the scaling governors layer of ``CPUFreq`` 
> and
>  provides its own scaling algorithms for P-state selection.  Those algorithms
> @@ -138,12 +139,13 @@ internal P-state selection logic to be less 
> performance-focused.
>  Active Mode Without HWP
>  ~~~
> 
> -This is the default operation mode for processors that do not support the HWP
> -feature.  It also is used by default with the ``intel_pstate=no_hwp`` 
> argument
> -in the kernel command line.  However, in this mode ``intel_pstate`` may 
> refuse
> -to work with the given processor if it does not recognize it.  [Note that
> -``intel_pstate`` will never refuse to work with any processor with the HWP
> -feature enabled.]
> +This operation mode is optional for processors that do not support the HWP
> +feature or when the ``intel_pstate=no_hwp`` argument is passed to the kernel 
> in
> +the command line.  The active mode is used in those cases if the
> +``intel_pstate=active`` argument is passed to the kernel in the command line.

???
I can not see anywhere in the code where the kernel command line argument
"intel_pstate=active" is dealt with.

> +In this mode ``intel_pstate`` may refuse to work with processors that are not
> +recognized by it.  [Note that ``intel_pstate`` will never refuse to work with
> +any processor with the HWP feature enabled.]
> 
>  In this mode ``intel_pstate`` registers utilization update callbacks with the
>  CPU scheduler in order to run a P-state selection algorithm, either
> @@ -188,10 +190,14 @@ is not set.
>  Passive Mode
>  
> 
> -This mode is used if the ``intel_pstate=passive`` argument is passed to the
> -kernel in the command line (it implies the ``intel_pstate=no_hwp`` setting 
> too).
> -Like in the active mode without HWP support, in this mode ``intel_pstate`` 
> may
> -refuse to work with the given processor if it does not recognize it.
> +This is the default operation mode of ``intel_pstate`` for processors without
> +hardware-managed P-states (HWP) support.  It is always used if the
> +``intel_pstate=passive`` argument is passed to the kernel in the command line
> +regardless of whether or not the given processor supports HWP.  [Note that 
> the
> +``intel_pstate=no_hwp`` setting implies ``intel_pstate=passive`` if it is 
> used
> +without

RE: [PATCH v4 2/2] cpufreq: intel_pstate: Allow raw energy performance preference value

2020-06-30 Thread Doug Smythies

Hi Srinivas,

Thanks for all your work on this.
I have fallen behind, and not sure when I can catch up.
However...

On 2020.06.26 11:34 Srinivas Pandruvada wrote:

> Similarly on battery the default "balance_performance" mode can be
> aggressive in power consumption. But picking up the next choice
> "balance power" results in too much loss of performance, which results in
> bad user experience in use cases like "Google Hangout". It was observed
> that some value between these two EPP is optimal.

There is a possibility that one of the issues I have been ranting
about could be a contributing factor to things like this.
(I don't know if it actually is.)
One way to compensate is to lower EPP.

I am going to send a new e-mail in a minute about it.
Please consider the possibility that some of these
EPP adjustments might just be programming around the issue.

... Doug

RE: [PATCH 1/2] cpufreq: intel_pstate: Allow enable/disable energy efficiency

2020-06-25 Thread Doug Smythies

Hi Srinivas,

I saw your V3.
I do not understand your reluctance to use

arch/x86/include/asm/msr-index.h

as the place to define anything MSR related.
Please explain.

One more comment about 1/3 of the way down below.

... Doug

On 2020.06.23 08:53 Doug Smythies wrote:
> On 2020.06.22 22:13 Srinivas Pandruvada wrote:
> 
> > By default intel_pstate driver disables energy efficiency by setting
> > MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode.
> > This CPU model is also shared by Coffee Lake desktop CPUs. This allows
> > these systems to reach maximum possible frequency. But this adds power
> > penalty, which some customers don't want. They want some way to enable/
> > disable dynamically.
> >
> > So, add an additional attribute "energy_efficiency_enable" under
> > /sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows
> > to read and write bit 19 ("Disable Energy Efficiency Optimization") in
> > the MSR IA32_POWER_CTL.
> >
> > This attribute is present in both HWP and non-HWP mode as this has an
> > effect in both modes. Refer to Intel Software Developer's manual for
> > details. The scope of this bit is package wide.
> 
> I do and always have. However these manuals are 1000's of pages,
> are updated often, and it can be difficult to find the correct page
> for the correct processor. So it is great that, in general, the same
> mnemonics are used in the code as the manuals.
> 
> >
> > Suggested-by: Len Brown 
> > Signed-off-by: Srinivas Pandruvada 
> > ---
> >  Documentation/admin-guide/pm/intel_pstate.rst |  7 +++
> >  drivers/cpufreq/intel_pstate.c| 49 ++-
> >  2 files changed, 54 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/pm/intel_pstate.rst 
> > b/Documentation/admin-
> > guide/pm/intel_pstate.rst
> > index 39d80bc29ccd..939bfdc53f4f 100644
> > --- a/Documentation/admin-guide/pm/intel_pstate.rst
> > +++ b/Documentation/admin-guide/pm/intel_pstate.rst
> > @@ -431,6 +431,13 @@ argument is passed to the kernel in the command line.
> > supported in the current configuration, writes to this attribute will
> > fail with an appropriate error.
> >
> > +``energy_efficiency_enable``
> > +   This attribute is only present on platforms, which has CPUs matching
> > +   Kaby Lake desktop CPU model. By default "energy_efficiency" is disabled
> 
> So, why not mention Coffee Lake also, as you did above?

And I still think you need to add "Coffee Lake" here also.

> 
> > +   on these CPU models in HWP mode by this driver. Enabling energy
> > +   efficiency may limit maximum operating frequency in both HWP and non
> > +   HWP mode.
> > +
> >  Interpretation of Policy Attributes
> >  ---
> >
> > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> > index 8e23a698ce04..1cf6d06f2314 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -1218,6 +1218,44 @@ static ssize_t store_hwp_dynamic_boost(struct 
> > kobject *a,
> > return count;
> >  }
> >
> > +#define MSR_IA32_POWER_CTL_BIT_EE  19
> 
> (same comment as the other day, for another patch) In my opinion and for
> consistency, this bit should be defined in
> 
> arch/x86/include/asm/msr-index.h
> 
> like so (I defined the other bits also):
> 
> diff --git a/arch/x86/include/asm/msr-index.h 
> b/arch/x86/include/asm/msr-index.h
> index e8370e64a155..1a6084067f18 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -255,6 +255,12 @@
> 
>  #define MSR_IA32_POWER_CTL 0x01fc
> 
> +/* POWER_CTL bits (some are model specific): */
> +
> +#define POWER_CTL_C1E  1
> +#define POWER_CTL_EEO  19
> +#define POWER_CTL_RHO  20
> +
>  #define MSR_IA32_MC0_CTL   0x0400
>  #define MSR_IA32_MC0_STATUS0x0401
>  #define MSR_IA32_MC0_ADDR  0x0402
> 
> There is another almost identical file at:
> 
> tools/arch/x86/include/asm/msr-index.h
> 
> and I am not sure which one is the master, but
> the former contains stuff that the later does not.
> 
> I have defined the bits names in a consistent way
> as observed elsewhere in the file. i.e. drop the
> "MSR_IA32_" prefix and add the bit.
> 
> Now, tying this back to my earlier comment about the
> manuals:
> The file "tools/arch/x86/include/asm/msr-index.h"

RE: [PATCH v2 2/2] cpufreq: intel_pstate: Allow raw energy performance preference value

2020-06-24 Thread Doug Smythies

Hi Srinivas,

I have immediate need for this. I have been using a tool
I wrote myself for this which I can now retire.
(it wasn't very good anyway).
Yours remembers for each governor, and is way better.
Thanks.

On 2020.06.23 11:27 Srinivas Pandruvada wrote:

> Currently using attribute "energy_performance_preference", user space can
> write one of the four per-defined preference string. These preference
> strings gets mapped to a hard-coded Energy-Performance Preference (EPP) or
> Energy-Performance Bias (EPB) knob.
> 
> These four values supposed to cover broad spectrum of use cases, but they
> are not uniformly distributed in the range.

Suggest:

These four values are supposed to cover broad spectrum of use cases, but
are not uniformly distributed in the range.

> There are number of cases,
> where this is not enough. For example:
> 
> Suppose user wants more performance when connected to AC. Instead of using
> default "balance performance", the "performance" setting can be used. This
> changes EPP value from 0x80 to 0x00. But setting EPP to 0, results in
> electrical and thermal issues on some platforms.

> This results in CPU to do
> aggressive throttling, which causes drop in performance.

Suggest:

This results in aggressive throttling, which causes a drop in performance.

And:

Tough.
I consider "performance mode" as sacrosanct, and have always
expected these to behave identically and at max CPU freq:

intel_pstate no-hwp / performance
intel_cpufreq no-hwp / performance  (a.k.a. passive)
acpi_cpufreq / performance
intel_pstate hwp / performance
intel_cpufreq hwp / performance (in future)

as was always the case on my i7-2600K (no hwp) based computer
and is not the case on my i5-9600K (hwp capable) computer.

> But some value
> between 0x80 and 0x00 results in better performance. But that value can't
> be fixed as the power curve is not linear. In some cases just changing EPP
> from 0x80 to 0x75 is enough to get significant performance gain.
> 
> Similarly on battery EPP 0x80 can be very aggressive in power consumption.
> But picking up the next choice "balance power" results in too much loss
> of performance, which cause bad user experience in use case like "Google
> Hangout". It was observed that some value between these two EPP is
> optimal.
> 
> This change allows fine grain EPP tuning for platform like Chromebooks.
> Here based on the product and use cases, different EPP values can be set.
> This change is similar to the change done for:
> /sys/devices/system/cpu/cpu*/power/energy_perf_bias
> where user has choice to write a predefined string or raw value.
> 
> The change itself is trivial. When user preference doesn't match
> predefined string preferences and value is an unsigned integer and in
> range, use that value for EPP. When the EPP feature is not prsent
 ^^
s/prsent/present

> writing raw value is not supported.
> 
> Suggested-by: Len Brown 
> Signed-off-by: Srinivas Pandruvada 
> ---
>  Documentation/admin-guide/pm/intel_pstate.rst |  6 ++-
>  drivers/cpufreq/intel_pstate.c| 50 +++
>  2 files changed, 45 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/admin-guide/pm/intel_pstate.rst 
> b/Documentation/admin-
> guide/pm/intel_pstate.rst
> index 939bfdc53f4f..5e209926e0ed 100644
> --- a/Documentation/admin-guide/pm/intel_pstate.rst
> +++ b/Documentation/admin-guide/pm/intel_pstate.rst
> @@ -561,7 +561,11 @@ somewhere between the two extremes:
>  Strings written to the ``energy_performance_preference`` attribute are
>  internally translated to integer values written to the processor's
>  Energy-Performance Preference (EPP) knob (if supported) or its
> -Energy-Performance Bias (EPB) knob.
> +Energy-Performance Bias (EPB) knob. It is also possible to write a positive
> +integer value between 0 to 255, if the EPP feature is present. If the EPP
> +feature is not present, writing integer value to this attribute is not
> +supported. In this case, user can use
> + "/sys/devices/system/cpu/cpu*/power/energy_perf_bias" interface.
> 
>  [Note that tasks may by migrated from one CPU to another by the scheduler's
>  load-balancing algorithm and if different energy vs performance hints are
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 1cf6d06f2314..d8f195c7a428 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -602,11 +602,12 @@ static const unsigned int epp_values[] = {
>   HWP_EPP_POWERSAVE
>  };
> 
> -static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data)
> +static int intel_pstate_get_energy_pref_index(struct cpudata *cpu_data, int 
> *raw_epp)
>  {
>   s16 epp;
>   int index = -EINVAL;
> 
> + *raw_epp = 0;
>   epp = intel_pstate_get_epp(cpu_data, 0);
>   if (epp < 0)
>   return epp;
> @@ -614,12 +615,14 @@ static int

RE: [PATCH v2 0/2] cpufreq: Specify the default governor on command line

2020-06-23 Thread Doug Smythies

Hi Quentin,
Thanks for your quick reply.

On 2020.06.23 11:05 Quentin Perret wrote: 

> Hi Doug,
> 
> On Tuesday 23 Jun 2020 at 10:54:33 (-0700), Doug Smythies wrote:
> > Hi Quentin,
> >
> > Because I am lazy and sometimes do not want to recompile
> > the distro source, I have a need/desire for this.
> 
> Good to know I'm not the only one ;-)
> 
> > Tested these two grub command lines:
> >
> > GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=300 
> > intel_pstate=disable
> cpufreq.default_governor=schedutil cpuidle_sysfs_switch cpuidle.governor=teo"
> >
> > And
> >
> > #GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=450 
> > intel_pstate=passive
> cpufreq.default_governor=schedutil cpuidle_sysfs_switch cpuidle.governor=teo"
> >
> > And all worked as expected. I use Ubuntu as my distro, and also had to 
> > disable a startup script that
> switches to "ondemand", or similar, after 1 minute.
> 
> Good, thanks for giving it a try.
> 
> > As a side note (separate subject, but is one reason I tried it):
> > My i5-9600K based computer seems to hit a power limit during boot 
> > approximately 3 seconds after
> kernel selection on grub.
> > This had no effect on that issue (even when selecting powersave governor).
> 
> Interesting ... Could you confirm that compiling with powersave as
> default doesn't fix the issue either?

No, it doesn't (good idea for a test though).
However, the big mains spike is also gone. So, I no longer know why those power
limit log bits are always set after boot.

> 
> Other question, when does the intel_pstate driver start on your device?
> Before or after that 3 seconds boot time?

Before, if I understand correctly (from dmesg):

[0.468969] intel_pstate: Intel P-state driver initializing

I'll attach a couple of annotated mains power graphs.
(which will likely get stripped from the on-list version of this e-mail).

Currently, I am drowning in stuff that doesn't work, and will put
this aside for now. I'll revive this as a new thread or a bugzilla
eventually.

I also tried booting with turbo disabled, no difference.

Thanks for this patch set.

... Doug

RE: [PATCH v2 0/2] cpufreq: Specify the default governor on command line

2020-06-23 Thread Doug Smythies

On 2020.06.23 07:22 Quentin Perret wrote:
> 
> This series enables users of prebuilt kernels (e.g. distro kernels) to
> specify their CPUfreq governor of choice using the kernel command line,
> instead of having to wait for the system to fully boot to userspace to
> switch using the sysfs interface. This is helpful for 2 reasons:
>   1. users get to choose the governor that runs during the actual boot;
>   2. it simplifies the userspace boot procedure a bit (one less thing to
>  worry about).
> 
> To enable this, the first patch moves all governor init calls to
> core_initcall, to make sure they are registered by the time the drivers
> probe. This should be relatively low impact as registering a governor
> is a simple procedure (it gets added to a llist), and all governors
> already load at core_initcall anyway when they're set as the default
> in Kconfig. This also allows to clean-up the governors' init/exit code,
> and reduces boilerplate.
> 
> The second patch introduces the new command line parameter, inspired by
> its cpuidle counterpart. More details can be found in the respective
> patch headers.
> 
> Changes in v2:
>  - added Viresh's ack to patch 01
>  - moved the assignment of 'default_governor' in patch 02 to the governor
>registration path instead of the driver registration (Viresh)
> 
> Thanks,
> Quentin
> 
> Quentin Perret (2):
>   cpufreq: Register governors at core_initcall
>   cpufreq: Specify default governor on command line
> 
>  .../admin-guide/kernel-parameters.txt |  5 
>  Documentation/admin-guide/pm/cpufreq.rst  |  6 ++---
>  .../platforms/cell/cpufreq_spudemand.c| 26 ++-
>  drivers/cpufreq/cpufreq.c | 23 
>  drivers/cpufreq/cpufreq_conservative.c| 22 
>  drivers/cpufreq/cpufreq_ondemand.c| 24 +
>  drivers/cpufreq/cpufreq_performance.c | 14 ++
>  drivers/cpufreq/cpufreq_powersave.c   | 18 +++--
>  drivers/cpufreq/cpufreq_userspace.c   | 18 +++--
>  include/linux/cpufreq.h   | 14 ++
>  kernel/sched/cpufreq_schedutil.c  |  6 +
>  11 files changed, 62 insertions(+), 114 deletions(-)
> 
> --
> 2.27.0.111.gc72c7da667-goog

Hi Quentin,

Because I am lazy and sometimes do not want to recompile
the distro source, I have a need/desire for this.

Tested these two grub command lines:

GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=300 
intel_pstate=disable cpufreq.default_governor=schedutil cpuidle_sysfs_switch 
cpuidle.governor=teo"

And

#GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1 consoleblank=450 
intel_pstate=passive cpufreq.default_governor=schedutil cpuidle_sysfs_switch 
cpuidle.governor=teo"

And all worked as expected. I use Ubuntu as my distro, and also had to disable 
a startup script that switches to "ondemand", or similar, after 1 minute.

As a side note (separate subject, but is one reason I tried it):
My i5-9600K based computer seems to hit a power limit during boot approximately 
3 seconds after kernel selection on grub.
This had no effect on that issue (even when selecting powersave governor).

... Doug

RE: [PATCH 1/2] cpufreq: intel_pstate: Allow enable/disable energy efficiency

2020-06-23 Thread Doug Smythies

On 2020.06.22 22:13 Srinivas Pandruvada wrote:

> By default intel_pstate driver disables energy efficiency by setting
> MSR_IA32_POWER_CTL bit 19 for Kaby Lake desktop CPU model in HWP mode.
> This CPU model is also shared by Coffee Lake desktop CPUs. This allows
> these systems to reach maximum possible frequency. But this adds power
> penalty, which some customers don't want. They want some way to enable/
> disable dynamically.
> 
> So, add an additional attribute "energy_efficiency_enable" under
> /sys/devices/system/cpu/intel_pstate/ for these CPU models. This allows
> to read and write bit 19 ("Disable Energy Efficiency Optimization") in
> the MSR IA32_POWER_CTL.
> 
> This attribute is present in both HWP and non-HWP mode as this has an
> effect in both modes. Refer to Intel Software Developer's manual for
> details. The scope of this bit is package wide.

I do and always have. However these manuals are 1000's of pages,
are updated often, and it can be difficult to find the correct page
for the correct processor. So it is great that, in general, the same
mnemonics are used in the code as the manuals.

> 
> Suggested-by: Len Brown 
> Signed-off-by: Srinivas Pandruvada 
> ---
>  Documentation/admin-guide/pm/intel_pstate.rst |  7 +++
>  drivers/cpufreq/intel_pstate.c| 49 ++-
>  2 files changed, 54 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/pm/intel_pstate.rst 
> b/Documentation/admin-
> guide/pm/intel_pstate.rst
> index 39d80bc29ccd..939bfdc53f4f 100644
> --- a/Documentation/admin-guide/pm/intel_pstate.rst
> +++ b/Documentation/admin-guide/pm/intel_pstate.rst
> @@ -431,6 +431,13 @@ argument is passed to the kernel in the command line.
>   supported in the current configuration, writes to this attribute will
>   fail with an appropriate error.
> 
> +``energy_efficiency_enable``
> + This attribute is only present on platforms, which has CPUs matching
> + Kaby Lake desktop CPU model. By default "energy_efficiency" is disabled

So, why not mention Coffee Lake also, as you did above?

> + on these CPU models in HWP mode by this driver. Enabling energy
> + efficiency may limit maximum operating frequency in both HWP and non
> + HWP mode.
> +
>  Interpretation of Policy Attributes
>  ---
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 8e23a698ce04..1cf6d06f2314 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -1218,6 +1218,44 @@ static ssize_t store_hwp_dynamic_boost(struct kobject 
> *a,
>   return count;
>  }
> 
> +#define MSR_IA32_POWER_CTL_BIT_EE19

(same comment as the other day, for another patch) In my opinion and for
consistency, this bit should be defined in

arch/x86/include/asm/msr-index.h

like so (I defined the other bits also):

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index e8370e64a155..1a6084067f18 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -255,6 +255,12 @@

 #define MSR_IA32_POWER_CTL 0x01fc

+/* POWER_CTL bits (some are model specific): */
+
+#define POWER_CTL_C1E  1
+#define POWER_CTL_EEO  19
+#define POWER_CTL_RHO  20
+
 #define MSR_IA32_MC0_CTL   0x0400
 #define MSR_IA32_MC0_STATUS0x0401
 #define MSR_IA32_MC0_ADDR  0x0402

There is another almost identical file at:

tools/arch/x86/include/asm/msr-index.h

and I am not sure which one is the master, but
the former contains stuff that the later does not.

I have defined the bits names in a consistent way
as observed elsewhere in the file. i.e. drop the
"MSR_IA32_" prefix and add the bit.

Now, tying this back to my earlier comment about the
manuals:
The file "tools/arch/x86/include/asm/msr-index.h"
is an excellent gateway reference for those
1000's of pages of software reference manuals.

As a user that uses them all the time, but typically
only digs deep into those manuals once every year or
two, I find tremendous value added via the msr-index.h
file.

> +
> +static ssize_t show_energy_efficiency_enable(struct kobject *kobj,
> +  struct kobj_attribute *attr,
> +  char *buf)
> +{
> + u64 power_ctl;
> + int enable;
> +
> + rdmsrl(MSR_IA32_POWER_CTL, power_ctl);
> + enable = (power_ctl & BIT(MSR_IA32_POWER_CTL_BIT_EE)) >> 
> MSR_IA32_POWER_CTL_BIT_EE;
> + return sprintf(buf, "%d\n", !enable);
> +}
> +
> +static ssize_t store_energy_efficiency_enable(struct kobject *a,
> +   struct kobj_attribute *b,
> +   const char *buf, size_t count)
> +{
> + u64 power_ctl;
> + u32 input;
> + int ret;
> +
> + ret = kstrtouint(buf, 10, );
> + if (ret)
> +

RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Work in passive mode with HWP enabled

2020-06-15 Thread Doug Smythies

On 2020.05.21 10:16 Rafael J. Wysocki wrote:
> 
> From: Rafael J. Wysocki 
> 
> Allow intel_pstate to work in the passive mode with HWP enabled and
> make it translate the target frequency supplied by the cpufreq
> governor in use into an EPP value to be written to the HWP request
> MSR (high frequencies are mapped to low EPP values that mean more
> performance-oriented HWP operation) as a hint for the HWP algorithm
> in the processor, so as to prevent it and the CPU scheduler from
> working against each other at least when the schedutil governor is
> in use.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> This is a prototype not intended for production use (based on linux-next).
> 
> Please test it if you can (on HWP systems, of course) and let me know the
> results.
> 
> The INTEL_CPUFREQ_TRANSITION_DELAY_HWP value has been guessed and it very well
> may turn out to be either too high or too low for the general use, which is 
> one
> reason why getting as much testing coverage as possible is key here.
> 
> If you can play with different INTEL_CPUFREQ_TRANSITION_DELAY_HWP values,
> please do so and let me know the conclusions.
> 
> Cheers,
> Rafael

To anyone trying this patch:

You will need to monitor EPP (Energy Performance Preference) carefully.
It changes as a function of passive/active, and if you booted active
or passive or no-hwp and changed later.

Originally, I was not specifically monitoring EPP, or paths taken since boot
towards driver, intel_pstate or intel_cpufreq, and governor, and will now have
to set aside test results.

@Rafael: I am still having problems with my test computer and HWP. However, I 
can
observe the energy saving potential of this "passive-yet-active HWP mode".

At this point, I am actually trying to make my newer test computer
simply behave and do what it is told with respect to CPU frequency scaling,
because even acpi-cpufreq misbehaves for performance governor under
some conditions [1].

[1] https://marc.info/?l=linux-pm=159155067328641=2

To my way of thinking:

1.) it is imperative that we be able to decouple the governor servo
from the processor servo. At a minimum this is needed for system testing,
debugging and reference baselines. At a maximum users could, perhaps, decide
for themselves. Myself, I would prefer "passive" to mean "do what you
have been told", and that is now what I am testing.

2.) I have always thought, indeed relied on, performance mode as being
more than a hint. For my older i7-2600K it never disobeyed orders, except
for the most minuscule of workloads.
This newer i5-9600K seems to have a mind of its own which I would like
to be able to turn off, yet still be able to use intel_pstate trace
with schedutil.

Recall last week I said

> moving forward the typical CPU frequency scaling
> configuration for my test system will be:
>
> driver: intel-cpufreq, forced at boot.
> governor: schedutil
> hwp: forced off at boot.

The problem is that baseline references are still needed
and performance mode is unreliable. Maybe other stuff also,
I simply don't know at this point.

Example of EPP changing (no need to read on) (from fresh boot):

Current EPP:

root@s18:/home/doug# rdmsr --bitfield 31:24 -u -a 0x774
128
128
128
128
128
128
root@s18:/home/doug# grep . /sys/devices/system/cpu/cpu3/cpufreq/*
/sys/devices/system/cpu/cpu3/cpufreq/affected_cpus:3
/sys/devices/system/cpu/cpu3/cpufreq/base_frequency:370
/sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_max_freq:460
/sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_min_freq:80
/sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_transition_latency:0
/sys/devices/system/cpu/cpu3/cpufreq/energy_performance_available_preferences:default
 performance balance_performance balance_power
power
/sys/devices/system/cpu/cpu3/cpufreq/energy_performance_preference:balance_performance
/sys/devices/system/cpu/cpu3/cpufreq/related_cpus:3
/sys/devices/system/cpu/cpu3/cpufreq/scaling_available_governors:performance 
powersave
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800102
/sys/devices/system/cpu/cpu3/cpufreq/scaling_driver:intel_pstate
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor:powersave
/sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq:460
/sys/devices/system/cpu/cpu3/cpufreq/scaling_min_freq:80
/sys/devices/system/cpu/cpu3/cpufreq/scaling_setspeed:

Now, switch to passive mode:

echo passive > /sys/devices/system/cpu/intel_pstate/status

And observe EPP:

root@s18:/home/doug# rdmsr --bitfield 31:24 -u -a 0x774
255
255
255
255
255
255
root@s18:/home/doug# grep . /sys/devices/system/cpu/cpu3/cpufreq/*
/sys/devices/system/cpu/cpu3/cpufreq/affected_cpus:3
/sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_max_freq:460
/sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_min_freq:80
/sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_transition_latency:2
/sys/devices/system/cpu/cpu3/cpufreq/related_cpus:3
/sys/devices/system/cpu/cpu3/cpufreq/scaling_available_governors:conservative 
ondemand userspace

RE: [PATCH] cpufreq: intel_pstate: Add additional OOB enabling bit

2020-06-12 Thread Doug Smythies

On 2020.06.11 10:49 Srinivas Pandruvada wrote:

> Add additional bit for OOB (Out of band) enabling of P-states. In this
> case intel_pstate shouldn't load. Currently, only "BIT(8) == 1" of the
> MSR MSR_MISC_PWR_MGMT is considered as OOB. Also add "BIT(18) == 1" as
> OOB condition.

Shouldn't those bits be defined in these files:
arch/x86/include/asm/msr-index.h
and
tools/arch/x86/include/asm/msr-index.h

?

By the way, I couldn't find those bits defined in Intel docs that I have.
 
> 
> Signed-off-by: Srinivas Pandruvada 
> ---
>  drivers/cpufreq/intel_pstate.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 8e23a698ce04..f21761443c90 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -2686,8 +2686,8 @@ static bool __init 
> intel_pstate_platform_pwr_mgmt_exists(void)
>   id = x86_match_cpu(intel_pstate_cpu_oob_ids);
>   if (id) {
>   rdmsrl(MSR_MISC_PWR_MGMT, misc_pwr);
> - if (misc_pwr & (1 << 8)) {
> - pr_debug("Bit 8 in the MISC_PWR_MGMT MSR set\n");
> + if ((misc_pwr & BIT(8)) || (misc_pwr & BIT(18))) {

And then those bit definitions used above.

> + pr_debug("Bit 8 or 18 in the MISC_PWR_MGMT MSR set\n");

And then some insight also printed  with the debug message.
At least say "Out of Band".

>   return true;
>   }
>   }
> --
> 2.24.1

RE: schedutil issue with serial workloads

2020-06-07 Thread Doug Smythies

On 2020.06.05 Rafael J. Wysocki wrote:
> On 6/4/2020 11:29 PM, Alexander Monakov wrote:
> > Hello,
> 
> Hi,
> 
> Let's make more people see your report.
> 
> +Peter, Giovanni, Quentin, Juri, Valentin, Vincent, Doug, and linux-pm.
> 
>> this is a question/bugreport about behavior of schedutil on serial workloads
>> such as rsync, or './configure', or 'make install'. These workloads are
>> such that there's no single task that takes a substantial portion of CPU
>> time, but at any moment there's at least one runnable task, and overall
>> the workload is compute-bound. To run the workload efficiently, cpufreq
>> governor should select a high frequency.
>>
>> Assume the system is idle except for the workload in question.
>>
>> Sadly, schedutil will select the lowest frequency, unless the workload is
>> confined to one core with taskset (in which case it will select the
>> highest frequency, correctly though somewhat paradoxically).
> 
> That's because the CPU utilization generated by the workload on all CPUs
> is small.
> 
> Confining it to one CPU causes the utilization of this one to grow and
> so schedutil selects a higher frequency for it.
> 
>> This sounds like it should be a known problem, but I couldn't find any
>> mention of it in the documentation.

Yes, this issue is very well known, and has been discussed on this list
several times, going back many years (and I likely missed some of the
discussions). In recent years Giovanni's git "make test" has
been the "goto" example for this. From that test, which has run to run
variability due to disk I/O, I made some test that varys PIDs per second
verses time. Giovanni's recent work on frequency invariance made a huge
difference for the schedutil response to this type of serialized workflow.

For my part of it:
I only ever focused on a new PID per work packet serialized workflow;
Since my last testing on this subject in January, I fell behind with
system issues and infrastructure updates.

Your workflow example is fascinating and rather revealing.
I will make use of it moving forward. Thank you.

Yes, schedutil basically responds poorly as it did for PIDs/second
based workflow before frequency invariance, but...(digression follows)...

Typically, I merely set the performance governor whenever I know
I will be doing serialized workflow, or whenever I just want the
job done the fastest (i.e. kernel compile).

If I use performance mode (hwp disabled, either active or passive,
doesn't matter), then I can not get the CPU frequency to max,
even if I set:

$ grep . /sys/devices/system/cpu/intel_pstate/m??_perf_pct
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/min_perf_pct:100

I have to increase EPB all way to 1 to get to max CPU frequency.
There also is extreme hysteresis, as I have to back to 9 for
the frequency to drop again.

The above was an i5-9600K. My much older i7-9600K, works fine
with default EPB of 6. I had not previously realized there was
so much difference between processors and EPB.

I don't have time to dig deeper right now, but will in future.

>> I was able to replicate the effect with a pair of 'ping-pong' programs
>> that get a token, burn some cycles to simulate work, and pass the token.
>> Thus, each program has 50% CPU utilization. To repeat my test:
>>
>> gcc -O2 pingpong.c -o pingpong
>> mkfifo ping
>> mkfifo pong
>> taskset -c 0 ./pingpong 100 < ping > pong &
>> taskset -c 1 ./pingpong 100 < pong > ping &
>> echo > ping
>>
>> #include 
>> #include 
>> int main(int argc, char *argv[])
>> {
>>  unsigned i, n;
>>  sscanf(argv[1], "%u", );
>>  for (;;) {
>>  char c;
>>  read(0, , 1);
>>  for (i = n; i; i--)
>>  asm("" :: "r"(i));
>>  write(1, , 1);
>>  }
>> }
>>
>> Alexander

It was not obvious to me what the approximate work/sleep frequency would be for
your work flow. For my version of it I made the loop time slower on purpose, and
because I could merely adjust "N" to compensate. I measured 100 hertz work/sleep
frequency per CPU, but my pipeline is 6 instead of 2.

Just for the record, this is what I did:

doug@s18:~/c$ cat pingpong.c
#include 
#include 
int main(int argc, char *argv[])
{
unsigned i, n, k;
sscanf(argv[1], "%u", );
while(1) {
char c;
read(0, , 1);
for (i = n; i; i--){
k = i;
k = k++;
}
write(1, , 1);
}
}

Compiled with:

cc pingpong.c -o pingpong

and run with (on purpose, I did not force CPU affinity,
as I wanted schedutil to decide (when it was the
governor, at least)):

#! /bin/dash
#
# ping-pong-test Smythies 2019.06.06
#   serialized workflow, but same PID.
#   from Alexander, but modified.
#

# because I always forget from last time
killall pingpong

rm --force pong1
rm --force pong2
rm --force pong3
rm --force pong4
rm

RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with HWP enabled

2020-06-06 Thread Doug Smythies

 where HWP appears to behave, the no-hwp power 
> use is much better.
> 
> O.K., so now, do a couple of more turbostat samples:
> 
> intel_pstate/powersave hwp idle state 2 disabled:
> 
> Overruns: 3
> Ave. work percent: 66.647895
> Processor package power: ~16.8 watts.
> Average CPU frequency: 4.6 gigahertz
> 
> intel_pstate/powersave hwp idle state 3 disabled:
> 
> Overruns: 22
> Ave. work percent: 66.647895
> Processor package power: ~16.2 watts.
> Average CPU frequency: 4.6 gigahertz
> 
> To prevent all the bots that burden my site, the link is coded:
> [1] double u double u double u dot smythies dot com 
> /~doug/linux/s18/hwp/index.html
> 
> ... Doug
> 
> > -Original Message-
> > From: Rafael J. Wysocki [mailto:r...@rjwysocki.net]
> > Sent: May 26, 2020 11:21 AM
> > To: Linux PM; Doug Smythies
> > Cc: LKML; Len Brown; Srinivas Pandruvada; Peter Zijlstra; Giovanni 
> > Gherdovich; Francisco Jerez
> > Subject: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with 
> > HWP enabled
> >
> > From: Rafael J. Wysocki 
> >
> > Allow intel_pstate to work in the passive mode with HWP enabled and
> > make it set the HWP minimum performance limit to 75% of the P-state
> > value corresponding to the target frequency supplied by the cpufreq
> > governor, so as to prevent the HWP algorithm and the CPU scheduler
> > from working against each other at least when the schedutil governor
> > is in use.
> >
> > Signed-off-by: Rafael J. Wysocki 
> > ---
> >
> > This is a replacement for https://patchwork.kernel.org/patch/11563615/ that
> > uses the HWP floor (minimum performance limit) as the feedback to the HWP
> > algorithm (instead of the EPP).
> >
> > The INTEL_CPUFREQ_TRANSITION_DELAY_HWP is still 5000 and the previous 
> > comments
> > still apply to it.
> >
> > In addition to that, the 75% fraction used in intel_cpufreq_adjust_hwp() 
> > can be
> > adjusted too, but I would like to use a value with a power-of-2 denominator 
> > for
> > that (so the next candidate would be 7/8).
> >
> > Everyone who can do that is kindly requested to test this and let me know
> > the outcome.

RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with HWP enabled

2020-06-06 Thread Doug Smythies

Hi Rafael,

As you well know, I often test first and
ask questions and review code later.

I think I should have questioned this first.

To the best of my ability/availability, I am
committed to follow up on the hwp issue raised on
the other branch of this thread. However, moving
forward the typical CPU frequency scaling
configuration for my test system will be:

driver: intel-cpufreq, forced at boot.
governor: schedutil
hwp: forced off at boot.

On 2020.05.26 11:21 Rafael J. Wysocki wrote:
> 
> Allow intel_pstate to work in the passive mode with HWP enabled and
> make it set the HWP minimum performance limit to 75% of the P-state
> value corresponding to the target frequency supplied by the cpufreq
> governor, so as to prevent the HWP algorithm and the CPU scheduler
> from working against each other at least when the schedutil governor
> is in use.

I think we need to define what "passive" mode is.
I have always interpreted it to mean "I would like
this pstate please. It has been requested by some higher level
servo". The name intel_cpufreq makes sense.

I have always interpreted "active" to mean "I would like
the intel_pstate CPU frequency driver to decide what pstate
I need".

As mentioned on the other branch of this thread, I don't have
a stable test baseline, but the servos are still fighting each other
with this version of the patch.

> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> This is a replacement for https://patchwork.kernel.org/patch/11563615/ that
> uses the HWP floor (minimum performance limit) as the feedback to the HWP
> algorithm (instead of the EPP).
> 
> The INTEL_CPUFREQ_TRANSITION_DELAY_HWP is still 5000 and the previous comments
> still apply to it.
> 
> In addition to that, the 75% fraction used in intel_cpufreq_adjust_hwp() can 
> be
> adjusted too, but I would like to use a value with a power-of-2 denominator 
> for
> that (so the next candidate would be 7/8).

The issue here is that the lag of the CPU frequency is not a constant, but 
rather
a function of the task work/sleep timing verses whatever else is going on. One 
has
to allow for the worst case. From thousands of seconds of intel_pstate trace 
data,
that limit needs to be about 3% (31/32).

Disclaimer: Done with no-hwp, active/powersave. The results might not be 
transferrable
to hwp enabled.

> 
> Everyone who can do that is kindly requested to test this and let me know
> the outcome.
> 
> Of course, the documentation still needs to be updated.  Also, the EPP can be
> handled in analogy with the active mode now, but that part can be added in a
> separate patch on top of this one.
> 
> Thanks!
> 
> ---
>  drivers/cpufreq/intel_pstate.c |  119 
> ++---
>  1 file changed, 88 insertions(+), 31 deletions(-)
> 
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -36,6 +36,7 @@
>  #define INTEL_PSTATE_SAMPLING_INTERVAL   (10 * NSEC_PER_MSEC)
> 
>  #define INTEL_CPUFREQ_TRANSITION_LATENCY 2
> +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP   5000
>  #define INTEL_CPUFREQ_TRANSITION_DELAY   500
> 
>  #ifdef CONFIG_ACPI
> @@ -2175,7 +2176,10 @@ static int intel_pstate_verify_policy(st
> 
>  static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
>  {
> - intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
> + if (hwp_active)
> + intel_pstate_hwp_force_min_perf(policy->cpu);
> + else
> + intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
>  }
> 
>  static void intel_pstate_stop_cpu(struct cpufreq_policy *policy)
> @@ -2183,12 +2187,10 @@ static void intel_pstate_stop_cpu(struct
>   pr_debug("CPU %d exiting\n", policy->cpu);
> 
>   intel_pstate_clear_update_util_hook(policy->cpu);
> - if (hwp_active) {
> + if (hwp_active)
>   intel_pstate_hwp_save_state(policy);
> - intel_pstate_hwp_force_min_perf(policy->cpu);
> - } else {
> - intel_cpufreq_stop_cpu(policy);
> - }
> +
> + intel_cpufreq_stop_cpu(policy);
>  }
> 
>  static int intel_pstate_cpu_exit(struct cpufreq_policy *policy)
> @@ -2318,13 +2320,58 @@ static void intel_cpufreq_trace(struct c
>   fp_toint(cpu->iowait_boost * 100));
>  }
> 
> +static void intel_cpufreq_update_hwp_request(struct cpudata *cpu, u32 
> min_perf)
> +{
> + u64 value, prev;
> +
> + rdmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, );
> + value = prev;
> +
> + value &= ~HWP_MIN_PERF(~0L);
> + value |= HWP_MIN_PERF(min_perf);
> +
> + /*
> +  * The entire MSR needs to be updated in order to update the HWP min
> +  * field in it, so opportunistically update the max too if needed.
> +  */
> + value &= ~HWP_MAX_PERF(~0L);
> + value |= HWP_MAX_PERF(cpu->max_perf_ratio);
> +
> + if (value != prev)
> +

RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with HWP enabled

2020-05-31 Thread Doug Smythies

On 2020.05.31 12:29 Srinivas Pandruvada wrote:
> On Sun, 2020-05-31 at 11:59 -0700, Srinivas Pandruvada wrote:
>> On Sun, 2020-05-31 at 11:06 -0700, Doug Smythies wrote:
>> > Hi Srinivas,
>> >
>> > Thanks you for your quick reply.
>> >
>> > On 2020.05.31 09:54 Srinivas Pandruvada wrote
>> > > On Sun, 2020-05-31 at 09:39 -0700, Doug Smythies wrote:
>> > > > Event begins at 17.456 seconds elapsed time.
>> > > > Previous event was about 107 milliseconds ago.
>> > > >
>> > > > Old min ; new min ; freq GHz; load % ; duration mS
>> > > > 27  ; 28  ; 4.60; 68.17  ; 10.226
>> > > > 28  ; 26  ; 4.53; 57.47  ; 10.005
>> > >
>> > > Seems you hit power/thermal limit
>> >
>> > No.
>> >
>> > I am nowhere near any power limit at all.
>> > I have meticulously configured and tested the thermal management of
>> > this computer.
>> > I never ever hit a thermal limit and have TDP set such that the
>> > processor
>> > temperature never exceeds about 75 degrees centigrade.
>> >
>> > There should never be throttling involved in these experiments.
>> > I can achieve throttling when compiling the kernel and with
>> > torture test mode on the mprime test (other CPU stressors,
>> > including my own, are not as good at generating heat as
>> > mprime).
>> >
>> > This system can run indefinitely at 99.9 watts processor package
>> > power.
>> > Example (turbostat, steady state, CPU freq throttled to 4.04 GHz):
>> >
>> > doug@s18:~$ sudo ~/turbostat --Summary --quiet --show
>> > Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 12
>> > Busy%   Bzy_MHz IRQ PkgTmp  PkgWatt GFXWatt
>> > 100.21  404572231   66  99.93   0.00
>> > 100.21  404372239   65  99.92   0.00
>> >
>> > > Is this some Lenovo system?
>> >
>> > No. The web page version of my original e-mail has
>> > a link to the test computer hardware profile.
>> >
>> > The motherboard is ASUS PRIME Z390-P.
>> >
>>
>> OK, this seems a desktop system.
>>
>> > > If you disable HWP you don't see that?
>> >
>> > Correct.
>> >
>> > > What is the value of
>> > > cat /sys/bus/pci/devices/\:00\:04.0/tcc_offset_degree_celsius
>> >
>> > ? "No such file or directory"
>> >
>> > > cat /sys/class/powercap/intel-rapl-mmio/intel-rapl-
>> > > mmio:0/constraint_0_power_limit_uw
>> You may not have
>> CONFIG_INT340X_THERMAL=y

I have:
CONFIG_INT340X_THERMAL=m

>>
>> What is
>> #rdmsr 0x1a2

>From the turbostat startup spew of stuff
(also in a link on the web page version of the e-mail):

MSR_IA32_TEMPERATURE_TARGET: 0x0064100d (100 C)

Or manually now:

root@s18:/home/doug/prime95# rdmsr 0x1a2
64100d

>>
>> Try changing energy_perf_bias and see if it helps here.

Yes, that is a test I meant to do, and should have done.

No, it doesn't help.

>>
> Also if
> MSR 0x1FC bit 19 is 0, change to 1.
> 

Ya, I have always found documentation on 0x1FC somewhat lacking.
Anyway, the bit is already 1.
Is the bit EE_TURBO_DISABLE?
Anyway, I tried that bit as 0, and it didn't help.

> Thanks,
> Srinivas
> 

... Doug

RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with HWP enabled

2020-05-31 Thread Doug Smythies

Hi Srinivas,

Thanks you for your quick reply.

On 2020.05.31 09:54 Srinivas Pandruvada wrote
> On Sun, 2020-05-31 at 09:39 -0700, Doug Smythies wrote:

>> Event begins at 17.456 seconds elapsed time.
>> Previous event was about 107 milliseconds ago.
>>
>> Old min ; new min ; freq GHz; load % ; duration mS
>> 27  ; 28  ; 4.60; 68.17  ; 10.226
>> 28  ; 26  ; 4.53; 57.47  ; 10.005
> 
> Seems you hit power/thermal limit

No.

I am nowhere near any power limit at all.
I have meticulously configured and tested the thermal management of this 
computer.
I never ever hit a thermal limit and have TDP set such that the processor
temperature never exceeds about 75 degrees centigrade.

There should never be throttling involved in these experiments.
I can achieve throttling when compiling the kernel and with
torture test mode on the mprime test (other CPU stressors,
including my own, are not as good at generating heat as
mprime).

This system can run indefinitely at 99.9 watts processor package power.
Example (turbostat, steady state, CPU freq throttled to 4.04 GHz):

doug@s18:~$ sudo ~/turbostat --Summary --quiet --show 
Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 12
Busy%   Bzy_MHz IRQ PkgTmp  PkgWatt GFXWatt
100.21  404572231   66  99.93   0.00
100.21  404372239   65  99.92   0.00

> 
> Is this some Lenovo system?

No. The web page version of my original e-mail has
a link to the test computer hardware profile.

The motherboard is ASUS PRIME Z390-P.

> 
> If you disable HWP you don't see that?

Correct.

> 
> What is the value of
> cat /sys/bus/pci/devices/\:00\:04.0/tcc_offset_degree_celsius

? "No such file or directory"

> cat /sys/class/powercap/intel-rapl-mmio/intel-rapl-
> mmio:0/constraint_0_power_limit_uw

? "No such file or directory"

> You may want to run
> Try running dptfxtract once.

No, I am not going to.

I am not running thermald. Eventually I will, as a backup
in case of cooling failure, so as not to hit the processor limit
shutdown. I just haven't done it yet.

> 
> Then try to get again
> 
> cat /sys/bus/pci/devices/\:00\:04.0/tcc_offset_degree_celsius
> cat /sys/class/powercap/intel-rapl-mmio/intel-rapl-
> mmio:0/constraint_0_power_limit_uw
> 
> 
> Thanks,
> Srinivas

RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with HWP enabled

2020-05-31 Thread Doug Smythies

Correction:

On 2020.05.31 09:39 Doug smythies wrote:

> The overruns and use of idle state 0 are exactly correlated.

Should have been "idle state 2":

The overruns and use of idle state 2 are exactly correlated.

RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with HWP enabled

2020-05-31 Thread Doug Smythies

e CPU loaded with new BIOS, and earlier tests NOT re-done).

Is the processor using the latest microcode? Currently 0xcc. Can not figure out 
if there is anything newer.

Leaving out the details, but all the tests and results are available, a mess 
but available, the summary is:

With HWP enabled, if idle state 2 is used, there is a probability that the CPU 
frequency might unexpectedly drop significantly.
If the processor does this by itself, or by being told to via sources outside 
of the intel_pstate CPU frequency driver, I don't
know.

The load sweep test was run at 6 seconds per step during increasing load and 3 
seconds per step decreasing
(by mistake, if you must know), while monitoring the idle statistics.
The test was done in a hurry, so many above/below statistics are 0%, due to 
insufficient sample size.
The overruns and use of idle state 0 are exactly correlated.
There are a lot of graphs on the idle statistics web page, but the idle state 2 
usage correlates exactly with
undesired low CPU frequency and overruns.

Side note: Even in the areas where HWP appears to behave, the no-hwp power use 
is much better.

O.K., so now, do a couple of more turbostat samples:

intel_pstate/powersave hwp idle state 2 disabled:

Overruns: 3
Ave. work percent: 66.647895
Processor package power: ~16.8 watts.
Average CPU frequency: 4.6 gigahertz

intel_pstate/powersave hwp idle state 3 disabled:

Overruns: 22
Ave. work percent: 66.647895
Processor package power: ~16.2 watts.
Average CPU frequency: 4.6 gigahertz

To prevent all the bots that burden my site, the link is coded:
[1] double u double u double u dot smythies dot com 
/~doug/linux/s18/hwp/index.html 

... Doug

> -Original Message-
> From: Rafael J. Wysocki [mailto:r...@rjwysocki.net]
> Sent: May 26, 2020 11:21 AM
> To: Linux PM; Doug Smythies
> Cc: LKML; Len Brown; Srinivas Pandruvada; Peter Zijlstra; Giovanni 
> Gherdovich; Francisco Jerez
> Subject: [RFC/RFT][PATCH] cpufreq: intel_pstate: Accept passive mode with HWP 
> enabled
> 
> From: Rafael J. Wysocki 
> 
> Allow intel_pstate to work in the passive mode with HWP enabled and
> make it set the HWP minimum performance limit to 75% of the P-state
> value corresponding to the target frequency supplied by the cpufreq
> governor, so as to prevent the HWP algorithm and the CPU scheduler
> from working against each other at least when the schedutil governor
> is in use.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> This is a replacement for https://patchwork.kernel.org/patch/11563615/ that
> uses the HWP floor (minimum performance limit) as the feedback to the HWP
> algorithm (instead of the EPP).
> 
> The INTEL_CPUFREQ_TRANSITION_DELAY_HWP is still 5000 and the previous comments
> still apply to it.
> 
> In addition to that, the 75% fraction used in intel_cpufreq_adjust_hwp() can 
> be
> adjusted too, but I would like to use a value with a power-of-2 denominator 
> for
> that (so the next candidate would be 7/8).
> 
> Everyone who can do that is kindly requested to test this and let me know
> the outcome.

RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Work in passive mode with HWP enabled

2020-05-26 Thread Doug Smythies

On 2020.05.26 01:19 Rafael J. Wysocki wrote:
>  to On Mon, May 25, 2020 at 10:57 PM Francisco Jerez
> > "Rafael J. Wysocki"  writes:
> > > On Mon, May 25, 2020 at 3:39 AM Francisco Jerez
> >
> > Why not HWP_MIN_PERF?  That would leave the HWP quite some room for
> > maneuvering (the whole HWP_MIN_PERF-HWP_MAX_PERF P-state range, it's not
> > like P-state selection would be entirely driven by the kernel), while
> > avoiding every failure scenario I was describing in my previous reply.

I have re-done my tests.
The problem that I observe seems specific to hwp itself
and not this patch and it's use in passive mode.
I see the exact same thing with intel_pstate/powersave.
[1] detail A.

Test: still simple single threaded load sweep,
at 347 hertz work/sleep frequency.
What do I see?

Unexpected frequency drops at around 70% load.
Example, from trace:

First, the thing has been going for awhile at 4.8 GHz.

Old epp ; new epp ; freq GHz; load % ; duration mS
80; 82  ; 4.57; 61.94  ; 20.001
82  ; 80; 4.57; 62.47  ; 40.003
80  ; 44  ; 3.73;   68.63  ; 62.009   What? Why freq down? Why 
long duration?
44  ;  0  ; 1.96; 100.23 ; 19.995   Even lower freq. load 
overruns.
 0  ; 73  ; 4.56; 82.93  ; 40.07    O.K. recovered, but damage 
done.
73  ; 46  ; 2.36;   79.19  ; 20.94   <<< now things oscillate a 
little.
46  ; 0   ; 1.9884  ;   100.24 ; 20.99
 0  ; 75  ; 4.5624  ;   82.1   ; 41.002  <<< Event ends. Next event in 
487 milliseconds.

Observation: Events are often, but not always, preceded by a longer than normal 
duration.
However, long durations are also not out of the ordinary in passive mode.

And yes, the above trace was with DELAY_HWP 20,000, but I do have trace examples
with it at 5,000. This was just a particularly good example.

Observation (from looking at a lot of trace data): There are phase delays
between the two systems, intel_cpufreq and hwp, and sometimes they seem to
oscillate a little and fight each other. There maybe some problematic
work/sleep frequencies where the oscillation builds into a full blown
resonance. 

Why does hwp drop the frequency?

This system is otherwise fairly idle,
so maybe because the pll drops down during the non work periods.

Maybe HWP thinks the system is idle and drops the frequency.
I can eliminate the overruns by disabling deep idle states such
that the PLL vote is never relinquished, but it's not a fair test.

Note that the above response can be "tuned".
If we take the conversation algorithm from target frequency to EPP
and introduce and offset, the above can be improved.

At what cost? More sluggishness, for a large positive offset.
So, the overruns just move from the steady state side of the task to
when the task starts. I did not find if there is a "sweet spot"
between offset and system response, and I do not think there is value
added in trying.

Note: With original settings, I rarely observe a problem with the step
function response to a new task.

> 
> Actually, I have been thinking about the HWP min as an alternative
> that may be worth evaluating.
> 
> However, I would rather set the HWP min to something like 80% if the
> cpufreq request.

Yes, this is a good idea and should not suffer from the two servo systems
fighting each other.

I got 0 overruns, verses 2240 overruns with no min limitation (100 second test).

As for INTEL_CPUFREQ_TRANSITION_DELAY_HWP, I'll probably use
10 milliseconds moving forward, because that is what I am most
familiar with from years ago work on the this driver. But, I did
not observe any issue with 5 milliseconds.

[1] 
http://www.smythies.com/~doug/linux/intel_pstate/passive-hwp/passive-hwp-but-active-powersave.png

Other replaced graphs:

http://www.smythies.com/~doug/linux/intel_pstate/passive-hwp/passive-hwp-ondemand.png
http://www.smythies.com/~doug/linux/intel_pstate/passive-hwp/passive-hwp-schedutil.png

... Doug

RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Work in passive mode with HWP enabled

2020-05-25 Thread Doug Smythies

Hi all,

The INTEL_CPUFREQ_TRANSITION_DELAY_HWP = 2
test results from this e-mail were incorrect.
The test and graphs are being re-done.

On 2020.05.25 08:30 Doug smythies wrote:

> 
> Legend - intel_pstate - powersave graph [2].
> 
> What? Why is there such a graph, unrelated to this patch?
> Well, because there is a not yet understood effect.
> 
> p_powe_stock : intel_pstate, powersave, stock kernel (5.7-rc6), hwp disabled.
> P_powe_hwp : intel_pstate, powersave, patched kernel (5.7-rc6), DELAY_HWP 
> 5000.
> P_powe_hwp2 : intel_pstate, powersave, patched kernel (5.7-rc6), DELAY_HWP 
> 2.
> 
> Conclusion: ??
> 
> Note: That I merely made a stupid mistake is a real possibility.

Yes, that was it. However all DELAY_HWP 2 tests were bad,
Not just this one.

... Doug

RE: [RFC 0/1] Alternate history mechanism for the TEO governor

2020-05-25 Thread Doug Smythies

On 2020.05.21 04:09 Pratik Sampat wrote:
> On 17/05/20 11:41 pm, Doug Smythies wrote:
> > On 2020.05.11 Pratik Rajesh Sampat wrote:
> >> First RFC posting:https://lkml.org/lkml/2020/2/22/27
> > Summary:
> >
> > On that thread I wrote:
> >
> >> I have done a couple of other tests with this patch set,
> >> but nothing to report yet, as the differences have been
> >> minor so far.
> >
> > I tried your tests, or as close as I could find, and still
> > do not notice much difference.
> 
> That is quite unfortunate. At least it doesn't seem to regress.

Yes, while I have not been able to demonstrate improvement,
I have not found any regression.

> 
> Nevertheless, as Rafael suggested aging is crucial, this patch doesn't age
> weights. I do have a version with aging but I had a lot of run to run variance
> so I had refrained from posting that.
> I'm tweaking around the logic for aging as well as distribution of weights,
> hopefully that may help.

O.K. I am putting this testing aside for now.
I like the set of tests, as they really show the differences between menu
and teo governors well.

> >>
> >> Sleeping Ebizzy
> >> ---
> >> Program to generate workloads resembling web server workloads.
> >> The benchmark is customized to allow for a sleep interval -i
> > I found a Phoronix ebizzy, but without the customization,
> > which I suspect is important to demonstrate your potential
> > improvement.
> >
> > Could you send me yours to try?
> 
> Sure thing, sleeping ebizzy is hosted here:
> https://github.com/pratiksampat/sleeping-ebizzy
> 
> >
> > ebizzy (records per second, more is better)
> >
> > teo wtteo   menu
> > 132344  132228  99.91%  130926  98.93%

O.K. yours is way different than what I was using.
Anyway, results still are not very different
between teo and wtteo. Some tests are showing a little difference
between above/below statistics [1]

[1] http://www.smythies.com/~doug/linux/idle/wtteo/ebizzy-interval/2_below.png

By the way, and likely not relevant, your sleeping-ebizzy test
seems extremely sensitive to the interval and number of threads.
It is not clear to me what settings I should use to try to re-create
your results. [2] is an interesting graph of records per second verses
intervals verses threads.

[2] 
http://www.smythies.com/~doug/linux/idle/wtteo/doug08/sleeping-ebizzy-records-intervals-threads.png

RE: [RFC/RFT][PATCH] cpufreq: intel_pstate: Work in passive mode with HWP enabled

2020-05-25 Thread Doug Smythies

On 2020.05.21 10:16 Rafael J. Wysocki wrote:

> 
> From: Rafael J. Wysocki 
> 
> Allow intel_pstate to work in the passive mode with HWP enabled and
> make it translate the target frequency supplied by the cpufreq
> governor in use into an EPP value to be written to the HWP request
> MSR (high frequencies are mapped to low EPP values that mean more
> performance-oriented HWP operation) as a hint for the HWP algorithm
> in the processor, so as to prevent it and the CPU scheduler from
> working against each other at least when the schedutil governor is
> in use.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> This is a prototype not intended for production use (based on linux-next).
> 
> Please test it if you can (on HWP systems, of course) and let me know the
> results.
> 
> The INTEL_CPUFREQ_TRANSITION_DELAY_HWP value has been guessed and it very well
> may turn out to be either too high or too low for the general use, which is 
> one
> reason why getting as much testing coverage as possible is key here.
> 
> If you can play with different INTEL_CPUFREQ_TRANSITION_DELAY_HWP values,
> please do so and let me know the conclusions.

Hi all,

It is way too early for me to reply to this, as my tests are incomplete.
However, I saw the other e-mails and will chime in now.

Note before: I have been having difficulties with this test computer,
results may be suspect.

The default INTEL_CPUFREQ_TRANSITION_DELAY_HWP is definitely too short.
I do not know what an optimal value is, but I have made it the same as
INTEL_CPUFREQ_TRANSITION_LATENCY for now, and because it makes for great
test comparisons.

I have only done the simplest of tests so far, single thread, forced
CPU affinity load ramp up/down.

Load:
fixed work packet per work period, i.e. the amount of work to do is independent 
of CPU frequency.
347 hertz work / sleep frequency. Why? No reason. Because it is what I used 
last time.

To reveal any hysteresis (i.e. with conservative governor) load ramps up from 
none
to 100% and then back down to none at 3 seconds per step (step size is 
uncalibrated).

Processor:
Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz

Modes:
intel_cpufreq (i_)
intel_pstate (p_)
acpi-cpufreq (a_)

Kernel:

b955f2f4a425 (HEAD -> k57-hwp) cpufreq: intel_pstate: Work in passive mode with 
HWP enabled
faf93d5c9da1 cpufreq: intel_pstate: Use passive mode by default without HWP
b9bbe6ed63b2 (tag: v5.7-rc6) Linux 5.7-rc6

Graph notes: The ramp up and ramp down are folded back on the graphs. These 
tests
were launched manually in two terminals, so the timing was not exact.
I'll fix this moving forward. 

Legend - ondemand graph [1] (minor annotation added):

i_onde_stock : intel-cpufreq, ondemand, stock kernel (5.7-rc6), hwp disabled.
i_onde_hwp : intel-cpufreq, ondemand, patched kernel (5.7-rc6), DELAY_HWP 5000.
i_onde_hwp2 : intel-cpufreq, ondemand, patched kernel (5.7-rc6), DELAY_HWP 
2.
a_onde: acpi-cpufreq, stock kernel (5.7-rc6), intel_pstate disabled.

Conclusion:
DELAY_HWP 2 matches the pre-patch response almost exactly.
DELAY_HWP 5000 goes too far (my opinion), but does look promising.

Legend - intel_pstate - powersave graph [2].

What? Why is there such a graph, unrelated to this patch?
Well, because there is a not yet understood effect.

p_powe_stock : intel_pstate, powersave, stock kernel (5.7-rc6), hwp disabled.
P_powe_hwp : intel_pstate, powersave, patched kernel (5.7-rc6), DELAY_HWP 5000.
P_powe_hwp2 : intel_pstate, powersave, patched kernel (5.7-rc6), DELAY_HWP 
2.

Conclusion: ??

Note: That I merely made a stupid mistake is a real possibility.
Repeat test pending.

Legend - schedutil graph [3]:

i_sche_stock : intel-cpufreq, ondemand, stock kernel (5.7-rc6), hwp disabled.
i_sche_hwp : intel-cpufreq, ondemand, patched kernel (5.7-rc6), DELAY_HWP 5000.
i_sche_hwp2 : intel-cpufreq, ondemand, patched kernel (5.7-rc6), DELAY_HWP 
2.
a_sche: acpi-cpufreq, stock kernel (5.7-rc6), intel_pstate disabled.

Conclusion: ?? I don't know, always had schedutil issues on this computer.
Expected more like typical from my older test server (i7-2600K) [4]:

[1] 
http://www.smythies.com/~doug/linux/intel_pstate/passive-hwp/passive-hwp-ondemand.png
[2] 
http://www.smythies.com/~doug/linux/intel_pstate/passive-hwp/passive-hwp-but-active-powersave.png
[3] 
http://www.smythies.com/~doug/linux/intel_pstate/passive-hwp/passive-hwp-schedutil.png
[4] 
http://www.smythies.com/~doug/linux/intel_pstate/cpu-freq-scalers/fixed-work-packet-folded-schedutil.png

> 
>  #define INTEL_CPUFREQ_TRANSITION_LATENCY 2
> +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP   5000

For now, while trying to prove nothing is busted,
suggest 2.

> +/**
> + * intel_cpufreq_adjust_hwp_request - Adjust the HWP reuqest register.

request.

> +static void intel_cpufreq_adjust_hwp_request(struct cpudata *cpu, s64 
> max_freq,
> +  unsigned int target_freq)
> +{
> + s64 epp_fp = div_fp(255 * (max_freq - target_freq),

RE: [RFC 0/1] Alternate history mechanism for the TEO governor

2020-05-17 Thread Doug Smythies

On 2020.05.11 Pratik Rajesh Sampat wrote:
> 
> First RFC posting: https://lkml.org/lkml/2020/2/22/27

Summary:

On that thread I wrote:

  > I have done a couple of other tests with this patch set,
  > but nothing to report yet, as the differences have been
  > minor so far.

I tried your tests, or as close as I could find, and still
do not notice much difference.

For detail, but likely little added value, read on:

Kernel: 5.7-rc4:
"teo": unmodified kernel.
"wtteo": with this patch added.
"menu": the menu idle governor, for comparison.
CPU frequency scaling driver: intel-cpufreq
CPU frequency scaling governor: schedutil
CPU idle driver: intel_idle

...

> Benchmarks:
> Schbench
> 
> Benchmarks scheduler wakeup latencies
> 
> 1. latency 99th percentile - usec

I found a Phoronix schbench test.
It defaults to 99.9th percentile.

schbench (usec, 99.9th Latency Percentile, less is better)(8 workers)   


threads teo wtteo   menu
2   14197   14194   99.98%  14467   101.90%
4   46733   46490   99.48%  46554   99.62%
6   57306   58291   101.72% 57754   100.78%
8   81408   80768   99.21%  81715   100.38%
16  157286  156570  99.54%  156621  99.58%
32  314573  310784  98.80%  315802  100.39%

Powers and other idle statistics were similar. [1]

> 2. Power - watts
> Machine - IBM Power 9
> 
> Latency and Power - Normalized
> +-+--+-+---+
> | Threads | TEO Baseline | Wt. TEO Latency | Wt. TEO Power |
> +-+--+-+---+
> | 2   | 100  | 101.3   | 85.29 |
> +-+--+-+---+
> | 4   | 100  | 105.06  | 113.63|
> +-+--+-+---+
> | 8   | 100  | 92.32   | 90.36 |
> +-+--+-+---+
> | 16  | 100  | 99.1| 92.43 |
> +-+--+-+---+
> 
> Accuracy
> 
> Vanilla TEO Governor - Prediction distribution %
> +-+--+--+--+---+---+---+-+
> | Threads | US 1 | US 2 | US 3 | US 4  | US 5  | US 6  | Correct |
> +-+--+--+--+---+---+---+-+
> | 2   | 6.12 | 1.08 | 1.76 | 20.41 | 9.2   | 28.74 | 22.51   |
> +-+--+--+--+---+---+---+-+
> | 4   | 8.54 | 1.56 | 1.25 | 20.24 | 10.75 | 25.17 | 22.67   |
> +-+--+--+--+---+---+---+-+
> | 8   | 5.88 | 2.67 | 1.09 | 13.72 | 17.08 | 32.04 | 22.95   |
> +-+--+--+--+---+---+---+-+
> | 16  | 6.29 | 2.43 | 0.86 | 13.21 | 15.33 | 26.52 | 29.34   |
> +-+--+--+--+---+---+---+-+
> +-+--+--+--+
> | Threads | OS 1 | OS 2 | OS 3 |
> +-+--+--+--+
> | 2   | 1.77 | 1.27 | 7.14 |
> +-+--+--+--+
> | 4   | 1.8  | 1.31 | 6.71 |
> +-+--+--+--+
> | 8   | 0.65 | 0.72 | 3.2  |
> +-+--+--+--+
> | 16  | 0.63 | 1.71 | 3.68 |
> +-+--+--+--+
> 
> Weighted TEO Governor - Prediction distribution %
> +-+--+--+--+---+---+---+-+
> | Threads | US 1 | US 2 | US 3 | US 4  | US 5  | US 6  | Correct |
> +-+--+--+--+---+---+---+-+
> | 2   | 7.26 | 2.07 | 0.02 | 15.85 | 13.29 | 36.26 | 22.13   |
> +-+--+--+--+---+---+---+-+
> | 4   | 4.33 | 1.45 | 0.15 | 14.17 | 14.68 | 40.36 | 21.01   |
> +-+--+--+--+---+---+---+-+
> | 8   | 4.73 | 2.46 | 0.12 | 12.48 | 14.68 | 32.38 | 28.9|
> +-+--+--+--+---+---+---+-+
> | 16  | 7.68 | 1.25 | 0.98 | 12.15 | 11.19 | 24.91 | 35.92   |
> +-+--+--+--+---+---+---+-+
> +-+--+--+--+
> | Threads | OS 1 | OS 2 | OS 3 |
> +-+--+--+--+
> | 2   | 0.39 | 0.42 | 2.31 |
> +-+--+--+--+
> | 4   | 0.45 | 0.51 | 2.89 |
> +-+--+--+--+
> | 8   | 0.53 | 0.66 | 3.06 |
> +-+--+--+--+
> | 16  | 0.97 | 1.9  | 3.05 |
> +-+--+--+--+
> 
> Sleeping Ebizzy
> ---
> Program to generate workloads resembling web server workloads.
> The benchmark is customized to allow for a sleep interval -i

I found a Phoronix ebizzy, but without the customization,
which I suspect is important to demonstrate your potential
improvement.

Could you

RE: [PATCH 0/4] cpuidle: teo: Fix issues related to disabled idle states

2019-10-18 Thread Doug Smythies

On 2019.10.10 Rafael J. Wysocki wrote:

> There are a few issues related to the handling of disabled idle states in the
> TEO (Timer-Events-Oriented) cpuidle governor which are addressed by this
> series.
>
> The application of the entire series is exactly equivalent to the testing 
> patch
> at https://lore.kernel.org/lkml/3490479.2dnHFFeJIp@kreacher/ , but IMO it is
> cleaner to split the changes into smaller patches which also allows them to
> be explained more accurately.

Hi,

I have re-tested and continued testing using this 4 patch set.

Summary: So far, everything is fine.

Some, but not all, detail:

Reference kernel: 5.4-rc2 "stock"
Test kernel: 5.4-rc2 + this 4 patch set "rjw-4"

Test 1: Where I simply coded, to automate, my best use
case example:

This is an idle state 4 disabled test.
The max entries column is entries and exits for
idle state 0 over the test sample interval
(15 seconds in this case).

stock kernel:

idle-doug01 : begin ...

Per CPU PASS/FAIL : Totals: fail rate: max entries:
 0 1 2 3 4 5 6 7  :   : (percent)::
 . . F . . . F .  :  8 : 25. :  69939
 . . F . . . F .  : 16 : 25. :  69938
 . . F . . . F .  : 24 : 25. :  70126
...
 . . . . . . F F  :   1928 : 30.8610 :  66530
 . . . . . . F F  :   1936 : 30.8368 :  68451
 . . . . . . F F  :   1944 : 30.8128 :  68643
 . . . . . . F F  :   1952 : 30.7889 :  68645
 . . . . . . F F  :   1960 : 30.7653 :  68152
 . . . . . . F F  :   1968 : 30.7419 :  67145
 . . . . . . F F  :   1976 : 30.7186 :  67349
 . . . . . . F F  :   1984 : 30.6956 :  68481
 . . . . . . F F  :   1992 : 30.6727 :  67394
 . . . . . . F F  :   2000 : 30.6500 :  68645
^C --- SIGINT (^C) detected. Terminate gracefully, saving the sample data...
 . . . . . . F F  :   2008 : 30.6275 :  29010

Summary: Total Tests:   2008 : Total Fails:615 : Fail rate (percent): 
30.6275 : Per CPU:
CPU00:  0, CPU01:  0, CPU02: 12, CPU03:  0, CPU04:152, 
CPU05: 43, CPU06:197, CPU07:211,

idle-doug01 : end ...

rjw-4 kernel:

idle-doug01 : begin ...

Per CPU PASS/FAIL : Totals: fail rate: max entries:
 0 1 2 3 4 5 6 7  :   : (percent)::
 . . . . . . . .  :  8 :  0. :  3
 . . . . . . . .  : 16 :  0. :  7
 . . . . . . . .  : 24 :  0. : 10
 . . . . . . . .  : 32 :  0. : 10
 . . . . . . . .  : 40 :  0. : 23
 . . . . . . . .  : 48 :  0. : 23
 . . . . . . . .  : 56 :  0. :  1
 . . . . . . . .  : 64 :  0. :  1
 . . . . . . . .  : 72 :  0. :  0
 . . . . . . . .  : 80 :  0. :  0
 . . . . . . . .  : 88 :  0. :  1
 . . . . . . . .  : 96 :  0. :  1
 . . . . . . . .  :104 :  0. :  0
 . . . . . . . .  :112 :  0. :  0
 . . . . . . . .  :120 :  0. :  1
 . . . . . . . .  :128 :  0. :  2
 . . . . . . . .  :136 :  0. :  2
 . . . . . . . .  :144 :  0. :  3
 . . . . . . . .  :152 :  0. :  3
 . . . . . . . .  :160 :  0. :  4
 . . . . . . . .  :168 :  0. :  4
 . . . . . . . .  :176 :  0. :  6
 . . . . . . . .  :184 :  0. :  3
 . . . . . . . .  :192 :  0. :  4
...
 . . . . . . . .  :  11480 :  0. :  5
 . . . . . . . .  :  11488 :  0. :  4
 . . . . . . . .  :  11496 :  0. :  6
 . . . . . . . .  :  11504 :  0. :  8
 . . . . . . . .  :  11512 :  0. :  5
 . . . . . . . .  :  11520 :  0. :  4
 . . . . . . . .  :  11528 :  0. :  4
 . . . . . . . .  :  11536 :  0. :  3
 . . . . . . . .  :  11544 :  0. :  2
 . . . . . . . .  :  11552 :  0. :  5
 . . . . . . . .  :  11560 :  0. :  3
 . . . . . . . .  :  11568 :  0. :  3
^C --- SIGINT (^C) detected. Terminate gracefully, saving the sample data...
 . . . . . . . .  :  11576 :  0. :  1

Summary: Total Tests:  11576 : Total Fails:  0 : Fail rate (percent):  
0. : Per CPU:
CPU00:  0, CPU01:  0, CPU02:  0, CPU03:  0, CPU04:  0, 
CPU05:  0, CPU06:  0, CPU07:  0,

idle-doug01 : end ...

Test 2: Have a look at all idle state enabled/disabled combinations.
This test is not very good, and only looks at processor package power.
Particularly for idle state 1, see Note 1 below.
But it's better than nothing.

stock kernel:

idle-disable-enable : begin ...

Idle State:  Per test PASS/FAIL  : Power (Watts):
4 3 2 1 0 :  :  Expected : Max diff :
0 0 0 0 0 : PASS PASS PASS PASS PASS PASS PASS PASS : 3.7 :0.2
0 0 0 0 1 : PASS PASS PASS PASS PASS PASS PASS PASS : 3.7 :0.2
0 0 0 1 0 : PASS PASS PASS PASS PASS PASS PASS PASS : 3.7 :

RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems

2019-10-10 Thread Doug Smythies

On 2019.10.09 06:37 Rafael J. Wysocki wrote:
> On Wednesday, October 9, 2019 1:19:51 AM CEST Rafael J. Wysocki wrote:
>> On Tuesday, October 8, 2019 12:49:01 PM CEST Rafael J. Wysocki wrote:
>>> On Tue, Oct 8, 2019 at 11:51 AM Rafael J. Wysocki  wrote:
>>>> On Tue, Oct 8, 2019 at 8:20 AM Doug Smythies  wrote:
>>>>> O.K. Thanks for your quick reply, and insight.
>>>>>
>>>>> I think long durations always need to be counted, but currently if
>>>>> the deepest idle state is disabled, they are not.
...
>>>> AFAICS, adding early_hits to count is not a mistake if there are still
>>>> enabled states deeper than the current one.
>>> 
>>> And the mistake appears to be that the "hits" and "misses" metrics
>>> aren't handled in analogy with the "early_hits" one when the current
>>> state is disabled.

I only know how to exploit and test the "hits" and "misses" path
that should use the deepest available idle state upon transition
to an idle system. Even so, the test has a low probability of
failing, and so needs to be run many times.

I do not know how to demonstrate and/or test any "early_hits" path
to confirm that an issue exists or that it is fixed.

>>> 
>>> Let me try to cut a patch to address that.
>> 
>> Appended below, not tested.

Reference as: rjw1

>> 
>> It is meant to address two problems, one of which is that the "hits" and
>> "misses" metrics of disabled states need to be taken into account too in
>> some cases, and the other is an issue with the handling of "early hits"
>> which may lead to suboptimal state selection if some states are disabled.
>
> Well, it still misses a couple of points.
>
> First, disable states that are too deep should not be taken into consideration
> at all.
>
> Second, the "hits" and "misses" metrics of disabled states need to be used for
> idle duration ranges corresponding to them regardless of whether or not the
> "hits" value is greater than the "misses" one.
>
> Updated patch is below (still not tested), but it tries to do too much in one
> go, so I need to split it into a series of smaller changes.

Thanks for your continued look at this.

Reference as: rjw2

Test 1, hack job statistical test (old tests re-stated):

Kernel  testsfail rate
5.4-rc1  6616   13.45%
5.3  23764.50%
5.3-teov7   121360.00%  <<< teo.c reverted and teov7 
put in its place.
5.4-rc1-ds  111680.00%  <<< [old] ds proposed patch (> 
7 hours test time)
5.4-rc1-ds12 42240.00% <<< [old] new ds proposed patch
5.4-rc2-rjw1112800.00%
5.4-rc2-rjw2  6400.00%  <<< Will be run again, for longer.

Test 2: I also looked at every possible enable/disable idle combination,
and they all seemed O.K.

No other tests have been run yet.

System:
Processor: i7-2600K
Deepest idle state: 4 (C6)

... Doug

RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems

2019-10-08 Thread Doug Smythies

On 2019.10.06 08:34 Rafael J. Wysocki wrote:
> On Sun, Oct 6, 2019 at 4:46 PM Doug Smythies  wrote:
>> On 2019.10.01 02:32 Rafael J. Wysocki wrote:
>>> On Sun, Sep 29, 2019 at 6:05 PM Doug Smythies  wrote:
>>>> On 2019.09.26 09:32 Doug Smythies wrote:
>>>>
>>>>> If the deepest idle state is disabled, the system
>>>>> can become somewhat unstable, with anywhere between no problem
>>>>> at all, to the occasional temporary jump using a lot more
>>>>> power for a few seconds, to a permanent jump using a lot more
>>>>> power continuously. I have been unable to isolate the exact
>>>>> test load conditions under which this will occur. However,
>>>>> temporarily disabling and then enabling other idle states
>>>>> seems to make for a somewhat repeatable test. It is important
>>>>> to note that the issue occurs with only ever disabling the deepest
>>>>> idle state, just not reliably.
>>>>>
>>>>> I want to know how you want to proceed before I do a bunch of
>>>>> regression testing.
>>>>
>> I do not think I stated it clearly before: The problem here is that some CPUs
>> seem to get stuck in idle state 0, and when they do power consumption spikes,
>> often by several hundred % and often indefinitely.
>
> That indeed has not been clear to me, thanks for the clarification!

>
>> I made a hack job automated test:
>> Kernel  tests fail rate
>> 5.4-rc16616   13.45%
>> 5.3  23764.50%
>> 5.3-teov7   121360.00%  <<< teo.c reverted and teov7 put in 
>> its place.
>> 5.4-rc1-ds  111680.00%  <<< [old] proposed patch (> 7 hours test 
>> time)


   5.4-rc1-ds12   4224  0.005 <<< new proposed patch

>>
>> [old] Proposed patch (on top of kernel 5.4-rc1): [deleted]

> This change may cause the deepest state to be selected even if its
> "hits" metric is less than the "misses" one AFAICS, in which case the
> max_early_index state should be selected instead.
> 
> It looks like the max_early_index computation is broken when the
> deepest state is disabled.

O.K. Thanks for your quick reply, and insight.

I think long durations always need to be counted, but currently if
the deepest idle state is disabled, they are not.
How about this?:
(test results added above, more tests pending if this might be a path forward.)

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index b5a0e49..a970d2c 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -155,10 +155,12 @@ static void teo_update(struct cpuidle_driver *drv, struct 
cpuidle_device *dev)

cpu_data->states[i].early_hits -= early_hits >> DECAY_SHIFT;

-   if (drv->states[i].target_residency <= sleep_length_us) {
-   idx_timer = i;
-   if (drv->states[i].target_residency <= measured_us)
-   idx_hit = i;
+   if (!(drv->states[i].disabled || dev->states_usage[i].disable)){
+   if (drv->states[i].target_residency <= sleep_length_us) 
{
+   idx_timer = i;
+   if (drv->states[i].target_residency <= 
measured_us)
+   idx_hit = i;
+   }
}
}

@@ -256,39 +258,25 @@ static int teo_select(struct cpuidle_driver *drv, struct 
cpuidle_device *dev,
struct cpuidle_state *s = >states[i];
struct cpuidle_state_usage *su = >states_usage[i];

-   if (s->disabled || su->disable) {
-   /*
-* If the "early hits" metric of a disabled state is
-* greater than the current maximum, it should be taken
-* into account, because it would be a mistake to select
-* a deeper state with lower "early hits" metric.  The
-* index cannot be changed to point to it, however, so
-* just increase the max count alone and let the index
-* still point to a shallower idle state.
-*/
-   if (max_early_idx >= 0 &&
-   count < cpu_data->states[i].early_hits)
-   count = cpu_data->states[i].early_hits;
-
-   continue;
-   }

-   if (idx < 0)
-

RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems

2019-10-06 Thread Doug Smythies

On 2019.10.01 02:32 Rafael J. Wysocki wrote:
> On Sun, Sep 29, 2019 at 6:05 PM Doug Smythies  wrote:
>> On 2019.09.26 09:32 Doug Smythies wrote:
>>
>>> If the deepest idle state is disabled, the system
>>> can become somewhat unstable, with anywhere between no problem
>>> at all, to the occasional temporary jump using a lot more
>>> power for a few seconds, to a permanent jump using a lot more
>>> power continuously. I have been unable to isolate the exact
>>> test load conditions under which this will occur. However,
>>> temporarily disabling and then enabling other idle states
>>> seems to make for a somewhat repeatable test. It is important
>>> to note that the issue occurs with only ever disabling the deepest
>>> idle state, just not reliably.
>>>
>>> I want to know how you want to proceed before I do a bunch of
>>> regression testing.
>>
>> I did some regression testing anyhow, more to create and debug
>> a methodology than anything else.
>>
>>> On 2018.12.11 03:50 Rafael J. Wysocki wrote:
>>>
>>>> v7 -> v8:
>>>>  * Apply the selection rules to the idle deepest state as well as to
>>>>the shallower ones (the deepest idle state was treated differently
>>>>before by mistake).
>>>>  * Subtract 1/2 of the exit latency from the measured idle duration
>>>>in teo_update() (instead of subtracting the entire exit latency).
>>>>This makes the idle state selection be slightly more performance-
>>>>   oriented.
>>>
>>> I have isolated the issue to a subset of the v7 to v8 changes, however
>>> it was not the exit latency changes.
>>>
>>> The partial revert to V7 changes I made were (on top of 5.3):
>>
>> The further testing showed a problem or two with my partial teo-v7 reversion
>> (I call it teo-v12) under slightly different testing conditions.

Correction:
There was no problem with my partial reversion kernel (a.k.a. teo-v12). The 
problem
was confusion over which kernel I was actually running for whatever test.

>>
>> I also have a 5.3 based kernel with the current teo reverted and the entire
>> teo-v7 put in its place. I have yet to find a idle state disabled related 
>> issue
>> with this kernel.
>>
>> I'll come back to this thread at a later date with better details and test 
>> results.
>
> Thanks for this work!
>
> Please also note that there is a teo patch in 5.4-rc1 that may make a
> difference in principle.

Yes, actually this saga started from somewhere between kernel 5.3 and 5.4-rc1,
and did include those teo patches, which actually significantly increases the
probability of the issue occurring.

When the deepest idle state is disabled, and the all states search loop exits
normally, it might incorrectly re-evaluate a previous idle state previously
deemed not worthy of the check. This was introduced between teo development
versions 7 and 8. The fix is to move the code back inside the loop.
(I'll submit a patch in a day or two).

I do not think I stated it clearly before: The problem here is that some CPUs
seem to get stuck in idle state 0, and when they do power consumption spikes,
often by several hundred % and often indefinitely.

I made a hack job automated test:
Kernel  tests   fail rate
5.4-rc1  6616   13.45%
5.3  23764.50%
5.3-teov7   121360.00%  <<< teo.c reverted and teov7 put in its 
place.
5.4-rc1-ds  111680.00%  <<< proposed patch (> 7 hours test time)

Proposed patch (on top of kernel 5.4-rc1):

doug@s15:~/temp-k-git/linux$ git diff
diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index b5a0e49..0502aa9 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -276,8 +276,22 @@ static int teo_select(struct cpuidle_driver *drv, struct 
cpuidle_device *dev,
if (idx < 0)
idx = i; /* first enabled state */

-   if (s->target_residency > duration_us)
+   if (s->target_residency > duration_us){
+   /*
+* If the "hits" metric of the idle state matching the 
sleep length is
+* greater than its "misses" metric, that is the one to 
use.  Otherwise,
+* it is more likely that one of the shallower states 
will match the
+* idle duration observed after wakeup, so take the one 
with the maximum
+* "early hits" metric, but if that cannot be 
determined, just use the
+* state se

RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems

2019-09-29 Thread Doug Smythies

On 2019.09.26 09:32 Doug Smythies wrote:

> If the deepest idle state is disabled, the system
> can become somewhat unstable, with anywhere between no problem
> at all, to the occasional temporary jump using a lot more
> power for a few seconds, to a permanent jump using a lot more
> power continuously. I have been unable to isolate the exact
> test load conditions under which this will occur. However,
> temporarily disabling and then enabling other idle states
> seems to make for a somewhat repeatable test. It is important
> to note that the issue occurs with only ever disabling the deepest
> idle state, just not reliably.
>
> I want to know how you want to proceed before I do a bunch of
> regression testing.

I did some regression testing anyhow, more to create and debug
a methodology than anything else.

> On 2018.12.11 03:50 Rafael J. Wysocki wrote:
>
>> v7 -> v8:
>>  * Apply the selection rules to the idle deepest state as well as to
>>the shallower ones (the deepest idle state was treated differently
>>before by mistake).
>>  * Subtract 1/2 of the exit latency from the measured idle duration
>>in teo_update() (instead of subtracting the entire exit latency).
>>This makes the idle state selection be slightly more performance-
>>   oriented.
>
> I have isolated the issue to a subset of the v7 to v8 changes, however
> it was not the exit latency changes.
>
> The partial revert to V7 changes I made were (on top of 5.3):

The further testing showed a problem or two with my partial teo-v7 reversion
(I call it teo-v12) under slightly different testing conditions.

I also have a 5.3 based kernel with the current teo reverted and the entire
teo-v7 put in its place. I have yet to find a idle state disabled related issue
with this kernel.

I'll come back to this thread at a later date with better details and test 
results.

... Doug

RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems

2019-09-26 Thread Doug Smythies

All,

Recall the extensive tests and work done between 2018.10.11 and 2019.01.16
on Rafael's teo governor, versions 1 through 11, noting that versions
9, 10, and 11 did not contain functional changes.

Hi Rafael,

If the deepest idle state is disabled, the system
can become somewhat unstable, with anywhere between no problem
at all, to the occasional temporary jump using a lot more
power for a few seconds, to a permanent jump using a lot more
power continuously. I have been unable to isolate the exact
test load conditions under which this will occur. However,
temporarily disabling and then enabling other idle states
seems to make for a somewhat repeatable test. It is important
to note that the issue occurs with only ever disabling the deepest
idle state, just not reliably.

I want to know how you want to proceed before I do a bunch of
regression testing.

On 2018.12.11 03:50 Rafael J. Wysocki wrote:

> v7 -> v8:
>  * Apply the selection rules to the idle deepest state as well as to
>the shallower ones (the deepest idle state was treated differently
>before by mistake).
>  * Subtract 1/2 of the exit latency from the measured idle duration
>in teo_update() (instead of subtracting the entire exit latency).
>This makes the idle state selection be slightly more performance-
>   oriented.

I have isolated the issue to a subset of the v7 to v8 changes, however
it was not the exit latency changes.

The partial revert to V7 changes I made were (on top of 5.3):

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index 7d05efd..d2892bb 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -283,9 +283,28 @@ static int teo_select(struct cpuidle_driver *drv, struct 
cpuidle_device *dev,
if (idx < 0)
idx = i; /* first enabled state */

-   if (s->target_residency > duration_us)
-   break;
+   if (s->target_residency > duration_us) {
+   /*
+* If the "hits" metric of the state matching the sleep
+* length is greater than its "misses" metric, that is
+* the one to use.
+*/
+   if (cpu_data->states[idx].hits > 
cpu_data->states[idx].misses)
+   break;

+   /*
+* It is more likely that one of the shallower states
+* will match the idle duration measured after wakeup,
+* so take the one with the maximum "early hits" metric,
+* but if that cannot be determined, just use the state
+* selected so far.
+*/
+   if (max_early_idx >= 0) {
+   idx = max_early_idx;
+   duration_us = drv->states[idx].target_residency;
+   }
+   break;
+   }
if (s->exit_latency > latency_req) {
/*
 * If we break out of the loop for latency reasons, use
@@ -294,7 +313,7 @@ static int teo_select(struct cpuidle_driver *drv, struct 
cpuidle_device *dev,
 * as long as that target residency is low enough.
 */
duration_us = drv->states[idx].target_residency;
-   goto refine;
+   break;
}

idx = i;
@@ -307,21 +326,6 @@ static int teo_select(struct cpuidle_driver *drv, struct 
cpuidle_device *dev,
}
}

-   /*
-* If the "hits" metric of the idle state matching the sleep length is
-* greater than its "misses" metric, that is the one to use.  Otherwise,
-* it is more likely that one of the shallower states will match the
-* idle duration observed after wakeup, so take the one with the maximum
-* "early hits" metric, but if that cannot be determined, just use the
-* state selected so far.
-*/
-   if (cpu_data->states[idx].hits <= cpu_data->states[idx].misses &&
-   max_early_idx >= 0) {
-   idx = max_early_idx;
-   duration_us = drv->states[idx].target_residency;
-   }
-
-refine:
if (idx < 0) {
idx = 0; /* No states enabled. Must use 0. */
} else if (idx > 0) {

Test results (note: I use my own monitoring utility, because it feeds my
graphing/html utility, but have used turbostat here in case others
want to repeat the test):

Processor: i7-2600K
Deepest idle state: 4 (C6)
The system is mostly idle for these tests.
turbostat command used:

sudo turbostat --Summary --quiet --hide
IRQ,Avg_MHz,SMI,GFXMHz,TSC_MHz,GFXWatt,CorWatt,CPU%c1,CPU%c7,CoreTmp,GFX%rc6,Pkg%pc2,Pkg%pc3,Pkg%pc6,C1,C1E,,C1%,C1E%,C3%,C6%

RE: [PATCH 1/2] x86,sched: Add support for frequency invariance

2019-09-24 Thread Doug Smythies

On 2019.09.24 01:06 Mel Gorman wrote:
> On Thu, Sep 19, 2019 at 07:42:29AM -0700, Doug Smythies wrote:
>> On 2019.09.17 07:25 Giovanni Gherdovich wrote:
>>>On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote:
>>> [...]
> 

> Hence, I think this patchset should be considered on its own merits.

Agree. 

> I think the patch is fine and should be merged with the main caveat being
> that some CPU families may need to use a different calculation to account
> for turbo boost which is a per-arch and per-cpu-family decision.

Agree.

> What, if anything, should change in this patchset before it can be merged?

Nothing, and apologies for the tangential discussion.

> Is that an acked-by?

Absolutely, if I am worthy of ack'ing then:

Acked-by: Doug Smythies 

... Doug

RE: [PATCH 1/2] x86,sched: Add support for frequency invariance

2019-09-19 Thread Doug Smythies

Hi Giovanni,

Thank you for your detailed reply.

On 2019.09.17 07:25 Giovanni Gherdovich wrote:
>On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote:
> [...]

>> The problem with the test is its run to run variability, which was from
>> all the disk I/O, as far as I could determine. At the time,
>> I studied this to death [2], and made a more repeatable test, without
>> any disk I/O.
>> 
>> While the challenges with this work flow have tended to be focused
>> on the CPU frequency scaling driver, I have always considered
>> the root issue here to be a scheduling issue. Excerpt from my notes
>> [2]:
>> 
>>> The issue is that performance is much much better if the system is
>>> forced to use only 1 CPU rather than relying on the defaults where
>>> the CPU scheduler decides what to do.
>>> The scheduler seems to not realize that the current CPU has just
>>> become free, and assigns the new task to a new CPU. Thus the load
>>> on any one CPU is so low that it doesn't ramp up the CPU frequency.
>>> It would be better if somehow the scheduler knew that the current
>>> active CPU was now able to take on the new task, overall resulting
>>> on one fully loaded CPU at the highest CPU frequency.
>> 
>> I do not know if such is practical, and I didn't re-visit the issue.
>>
>
> You're absolutely right, pinning a serialized, fork-intensive workload such as
> gitsource gives you as good of a performance as you can get, because it 
> removes
> the scheduler out of the picture.
>
> So one might be tempted to flag this test as non-representative of a
> real-world scenario;

Disagree. I consider this test to be very representative of real-world
scenarios. However, and I do not know for certain, the relatively high
average fork rate of the gitsource "make test" is less common.

> the reasons we keep looking at it are:
> 1. pinning may not always practical, as you mention
> 2. it's an adversary, worst-case sort of test for some scheduler code paths

Agree.

>> For reference against which all other results are compared
>> is the forced CPU affinity test run. i.e.:
>> 
>> taskset -c 3 test_script.
>> 
>> Mode  Governordegradation Power   Bzy_MHz
>> Reference perf 1 CPU  1.00reference   3798
>> - performance 1.2 6% worse3618
>> passive   ondemand2.3
>> activepowersave   2.6
>> passive   schedutil   2.7 1600
>> passive   schedutil-4C1.682515
>> 
>> Where degradation ratio is the time to execute / the reference time for
>> the same conditions. The test runs over a wide range of processes per
>> second, and the worst ratio has been selected for the above table.
>> I have yet to write up this experiment, but the graphs that will
>> eventually be used are at [4] and [5] (same data presented two
>> different ways).
>
> Your table is interesting; I'd say that the one to beat there (from the
> schedutil point of view) is intel_pstate(active)/performance. I'm slightly
> surprised that intel_pstate(passive)/ondemand is worse than
> intel_pstate(active)/powersave, I'd have guessed the other way around but it's
> also true that the latter lost some grip on iowait_boost in of the recent
> dev cycles.

??
intel_pstate(passive)/ondemand is better than intel_pstate(active)/powersave,
not worse, over the entire range of PIDs (forks) per second and by quite a lot.

>> I did the "make test" method and, presenting the numbers your way,
>> got that 4C took 0.69 times as long as the unpatched schedutil.
>> Your numbers were same or better (copied below, lower is better):
>> 80x-BROADWELL-NUMA:   0.49
>> 8x-SKYLAKE-UMA:   0.55
>> 48x-HASWELL-NUMA: 0.69

> I think your 0.69 and my three values tell the same story: schedutil really
> needs to use the frequency invariant formula otherwise it's out of the
> race. Enabling scale-invariance gives multple tens of percent point in
> advantage.

Agreed. This frequency invariant addition is great. However, if
schedutil is "out of the race" without it, as you say, then isn't
intel_pstate(passive)/ondemand out of the race also? It performs
just as poorly for this test, until very low PIDs per second.

> Now, is it 0.69 or 0.49? There are many factors to it; that's why I'm happy I
> can test on multiple machines and get a somehow more varied picture.
>
> Also, didn't you mention you made several runs and selected the worst one for
>

RE: [PATCH 1/2] x86,sched: Add support for frequency invariance

2019-09-13 Thread Doug Smythies

On 2019.09.11 08:28 Doug Smythies wrote:

> Hi Giovanni,
>
> Thank you for the great detail and test results you provided.
>
> On 2019.09.08.07:42 Giovanni Gherdovich wrote:
>
> ... [snip]...
>
>> The test we call "gitsource" (running the git unit test suite, a long-running
>> single-threaded shell script) appears rather spectacular in this table (gains
>> of 30-50% depending on the machine). It is to be noted, however, that
>> gitsource has no adjustable parameters (such as the number of jobs in
>> kernbench, which we average over in order to get a single-number summary
>> score) and is exactly the kind of low-parallelism workload that benefits the
>> most from this patch. When looking at the detailed tables of kernbench or
>> tbench4, at low process or client counts one can see similar numbers.
>
> I think the "gitsource" test, is the one I learned about here two years
> ago, [1]. It is an extremely good (best I know of) example of single
> threaded, high PID consumption (about 400 / second average, my computer
> [3]), performance issues on a multi CPU computer. I.E., this:
>
> Dountil the list of tasks is finished:
>Start the next task in the list of stuff to do.
> Enduntil
>
> The problem with the test is its run to run variability, which was from
> all the disk I/O, as far as I could determine.

I forgot, also some memory caching. I always toss out the first test,
then do it 5 more times. If I do not do much stuff with my hard disk
in between tests, it is repeatable enough.

I did the "make test" method and, presenting the numbers your way,
got that 4C took 0.69 times as long as the unpatched schedutil.
Your numbers were same or better (copied below, lower is better):
80x-BROADWELL-NUMA: 0.49
8x-SKYLAKE-UMA: 0.55
48x-HASWELL-NUMA:   0.69

> At the time,
> I studied this to death [2], and made a more repeatable test, without
> any disk I/O.
>
> While the challenges with this work flow have tended to be focused
> on the CPU frequency scaling driver, I have always considered
> the root issue here to be a scheduling issue. Excerpt from my notes
> [2]:
>
>> The issue is that performance is much much better if the system is
>> forced to use only 1 CPU rather than relying on the defaults where
>> the CPU scheduler decides what to do.
>> The scheduler seems to not realize that the current CPU has just
>> become free, and assigns the new task to a new CPU. Thus the load
>> on any one CPU is so low that it doesn't ramp up the CPU frequency.
>> It would be better if somehow the scheduler knew that the current
>> active CPU was now able to take on the new task, overall resulting
>> on one fully loaded CPU at the highest CPU frequency.
>
> I do not know if such is practical, and I didn't re-visit the issue.
>
> Anyway these are my results:
>
> Kernel: 5.3-rc8 and + these patches
> Processor: i7-2600K
>
> This is important, at least for the performance governor numbers:
>
> cpu6: MSR_TURBO_RATIO_LIMIT: 0x23242526
> 35 * 100.0 = 3500.0 MHz max turbo 4 active cores
> 36 * 100.0 = 3600.0 MHz max turbo 3 active cores
> 37 * 100.0 = 3700.0 MHz max turbo 2 active cores
> 38 * 100.0 = 3800.0 MHz max turbo 1 active cores
>
> For reference against which all other results are compared
> is the forced CPU affinity test run. i.e.:
>
> taskset -c 3 test_script.
>
> Mode  Governordegradation Power   Bzy_MHz
> Reference perf 1 CPU  1.00reference   3798
> - performance 1.2 6% worse3618
> passive   ondemand2.3
> activepowersave   2.6
> passive   schedutil   2.7 1600
> passive   schedutil-4C1.682515
>
> Where degradation ratio is the time to execute / the reference time for
> the same conditions. The test runs over a wide range of processes per
> second, and the worst ratio has been selected for the above table.
> I have yet to write up this experiment, but the graphs that will
> eventually be used are at [4] and [5] (same data presented two
> different ways).

The experiment write up is at [6], however I wanted more data
from the lower tasks per second region, and so I re-did it, [7].
In the limit as sequential tasks per second goes to 0, the
differences should diminish and I wanted to clearly observe this.

Excerpt:
> Conclusion: the schedutil governor improves from the worst 
> governor to (mostly) second only to the performance governor
> for unforced CPU affinity execution.

> The energy for the performance cases is worth more detail, as it
> is

RE: [PATCH 1/2] x86,sched: Add support for frequency invariance

2019-09-11 Thread Doug Smythies

Hi Giovanni,

Thank you for the great detail and test results you provided.

On 2019.09.08.07:42 Giovanni Gherdovich wrote:

... [snip]...

> The test we call "gitsource" (running the git unit test suite, a long-running
> single-threaded shell script) appears rather spectacular in this table (gains
> of 30-50% depending on the machine). It is to be noted, however, that
> gitsource has no adjustable parameters (such as the number of jobs in
> kernbench, which we average over in order to get a single-number summary
> score) and is exactly the kind of low-parallelism workload that benefits the
> most from this patch. When looking at the detailed tables of kernbench or
> tbench4, at low process or client counts one can see similar numbers.

I think the "gitsource" test, is the one I learned about here two years
ago, [1]. It is an extremely good (best I know of) example of single
threaded, high PID consumption (about 400 / second average, my computer
[3]), performance issues on a multi CPU computer. I.E., this:

Dountil the list of tasks is finished:
   Start the next task in the list of stuff to do.
   Wait for it to finish
Enduntil

The problem with the test is its run to run variability, which was from
all the disk I/O, as far as I could determine. At the time,
I studied this to death [2], and made a more repeatable test, without
any disk I/O.

While the challenges with this work flow have tended to be focused
on the CPU frequency scaling driver, I have always considered
the root issue here to be a scheduling issue. Excerpt from my notes
[2]:

> The issue is that performance is much much better if the system is
> forced to use only 1 CPU rather than relying on the defaults where
> the CPU scheduler decides what to do.
> The scheduler seems to not realize that the current CPU has just
> become free, and assigns the new task to a new CPU. Thus the load
> on any one CPU is so low that it doesn't ramp up the CPU frequency.
> It would be better if somehow the scheduler knew that the current
> active CPU was now able to take on the new task, overall resulting
> on one fully loaded CPU at the highest CPU frequency.

I do not know if such is practical, and I didn't re-visit the issue.

Anyway these are my results:

Kernel: 5.3-rc8 and + these patches
Processor: i7-2600K

This is important, at least for the performance governor numbers:

cpu6: MSR_TURBO_RATIO_LIMIT: 0x23242526
35 * 100.0 = 3500.0 MHz max turbo 4 active cores
36 * 100.0 = 3600.0 MHz max turbo 3 active cores
37 * 100.0 = 3700.0 MHz max turbo 2 active cores
38 * 100.0 = 3800.0 MHz max turbo 1 active cores

For reference against which all other results are compared
is the forced CPU affinity test run. i.e.:

taskset -c 3 test_script.

ModeGovernordegradation Power   Bzy_MHz
Reference   perf 1 CPU  1.00reference   3798
-   performance 1.2 6% worse3618
passive ondemand2.3
active  powersave   2.6
passive schedutil   2.7 1600
passive schedutil-4C1.682515

Where degradation ratio is the time to execute / the reference time for
the same conditions. The test runs over a wide range of processes per
second, and the worst ratio has been selected for the above table.
I have yet to write up this experiment, but the graphs that will
eventually be used are at [4] and [5] (same data presented two
different ways).

The energy for the performance cases is worth more detail, as it
is being wasted with CPUs waking up and going to sleep, and can be
observed in the IRQ column of turbostat output:

$ sudo turbostat --quiet --Summary --show 
Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 60
Busy%   Bzy_MHz IRQ PkgTmp  PkgWatt GFXWatt
12.52   379881407   49  22.17   0.12 <<< Forced to CPU 3 only
12.52   379881139   51  22.18   0.12
12.52   379881036   51  22.20   0.12
11.43   3704267644  48  21.16   0.12 <<< Change over
12.56   3618490994  48  23.43   0.12 <<< Let the scheduler decide
12.56   3620491336  47  23.50   0.12
12.56   3619491607  47  23.50   0.12
12.56   3619491512  48  23.52   0.12
12.56   3619490806  47  23.51   0.12
12.56   3618491356  49  23.48   0.12
12.56   3618491035  48  23.51   0.12
12.56   3618491121  48  23.46   0.12

Note also the busy megahertz column, where other active cores
(constantly waking and sleeping as we rotate through which
CPUs are used) are limiting the highest frequency.

... Doug

[1] https://marc.info/?l=linux-kernel=149181369622980=2
[2] http://www.smythies.com/~doug/linux/single-threaded/index.html
[3] http://www.smythies.com/~doug/linux/single-threaded/pids_per_second2.png
[4] http://www.smythies.com/~doug/linux/single-threaded/gg-pidps.png
[5] http://www.smythies.com/~doug/linux/single-threaded/gg-loops.png

RE: [PATCH V4 2/2] cpufreq: intel_pstate: Implement QoS supported freq constraints

2019-08-09 Thread Doug Smythies

On 2019.08.08 19:16 Viresh Kumar wrote:
> On 08-08-19, 09:25, Doug Smythies wrote:
>> On 2019.08.07 00:06 Viresh Kumar wrote:
>> Tested by: Doug Smythies 
>> Thermald seems to now be working O.K. for all the governors.
>
> Thanks for testing Doug.
> 
>> I do note that if one sets
>> /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
>> It seems to override subsequent attempts via
>> /sys/devices/system/cpu/intel_pstate/max_perf_pct.
>> Myself, I find this confusing.
>> 
>> So the question becomes which one is the "master"?
>
> No one is master, cpufreq takes all the requests for frequency
> constraints and tries to set the value based on aggregation of all. So
> for max frequency, the lowest value wins and is shown up in sysfs.
>
> So, everything looks okay to me.

O.K. While I understand the explanations, I still struggle with
this scenario:

doug@s15:~/temp$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
50<<< Note: 50% = 1.9 GHz in my system)
doug@s15:~/temp$ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
/sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq:190
/sys/devices/system/cpu/cpufreq/policy1/scaling_max_freq:190
/sys/devices/system/cpu/cpufreq/policy2/scaling_max_freq:190
/sys/devices/system/cpu/cpufreq/policy3/scaling_max_freq:190
/sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq:190
/sys/devices/system/cpu/cpufreq/policy5/scaling_max_freq:190
/sys/devices/system/cpu/cpufreq/policy6/scaling_max_freq:190
/sys/devices/system/cpu/cpufreq/policy7/scaling_max_freq:190

At this point I am not certain what I'll get if I try to
set max_perf_pct to 100%, nor do I know how to find out
with a user command.

So, I'll try it:

doug@s15:~/temp$ echo 100 | sudo tee 
/sys/devices/system/cpu/intel_pstate/max_perf_pct
100
doug@s15:~/temp$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
100  <<< Note: 100% = 3.8 GHz in my system)
doug@s15:~/temp$ grep . /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
/sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq:220
/sys/devices/system/cpu/cpufreq/policy1/scaling_max_freq:220
/sys/devices/system/cpu/cpufreq/policy2/scaling_max_freq:220
/sys/devices/system/cpu/cpufreq/policy3/scaling_max_freq:220
/sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq:220
/sys/devices/system/cpu/cpufreq/policy5/scaling_max_freq:220
/sys/devices/system/cpu/cpufreq/policy6/scaling_max_freq:220
/sys/devices/system/cpu/cpufreq/policy7/scaling_max_freq:220

I guess I had set it sometime earlier, forgot, and then didn't
get 3.8 Ghz as I had expected via max_perf_pct.

... Doug

RE: [PATCH V5 2/2] cpufreq: intel_pstate: Implement QoS supported freq constraints

2019-08-08 Thread Doug Smythies

On 2019.08.08 19:23 Viresh Kumar wrote:
> ---
> V4->V5:
> - dev_pm_qos_update_request() can return 1 in case of success, handle
>   that.

O.K. thanks,
That fixes the "Fail" messages I was getting with V4.

... Doug

RE: [PATCH V4 2/2] cpufreq: intel_pstate: Implement QoS supported freq constraints

2019-08-08 Thread Doug Smythies

On 2019.08.07 00:06 Viresh Kumar wrote:

Thanks for your work on this.

> Intel pstate driver exposes min_perf_pct and max_perf_pct sysfs files,
> which can be used to force a limit on the min/max P state of the driver.
> Though these files eventually control the min/max frequencies that the
> CPUs will run at, they don't make a change to policy->min/max values.
>
> When the values of these files are changed (in passive mode of the
> driver), it leads to calling ->limits() callback of the cpufreq
> governors, like schedutil. On a call to it the governors shall
> forcefully update the frequency to come within the limits. Since the
> limits, i.e.  policy->min/max, aren't updated by the driver, the
> governors fails to get the target freq within limit and sometimes aborts
> the update believing that the frequency is already set to the target
> value.
>
> This patch implements the QoS supported frequency constraints to update
> policy->min/max values whenever min_perf_pct or max_perf_pct files are
> updated. This is only done for the passive mode as of now, as the driver
> is already working fine in active mode.
>
> Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
> Reported-by: Doug Smythies 
> Signed-off-by: Viresh Kumar 

Tested by: Doug Smythies 
Thermald seems to now be working O.K. for all the governors.

I do note that if one sets
/sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
It seems to override subsequent attempts via
/sys/devices/system/cpu/intel_pstate/max_perf_pct.
Myself, I find this confusing.

So the question becomes which one is the "master"?

Example:

# for file in /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq; do echo 
"220" > $file; done
# cat /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
220
220
220
220
220
220
220
220
root@s15:/home/doug/temp# cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
... (Note: 50% = 190)
root@s15:/home/doug/temp# cat 
/sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
190
190
190
190
190
190
190
190
root@s15:/home/doug/temp# echo 100 > 
/sys/devices/system/cpu/intel_pstate/max_perf_pct
... (Note: 50% = 380, and my expectation is 3.8 GHz below)
root@s15:/home/doug/temp# cat 
/sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
220
220
220
220
220
220
220
220

Similarly for the minimum side of things:

root@s15:/home/doug/temp# for file in 
/sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq; do echo "320" > 
$file; done
root@s15:/home/doug/temp# cat 
/sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq
320
320
320
320
320
320
320
320
root@s15:/home/doug/temp# echo 42 > 
/sys/devices/system/cpu/intel_pstate/min_perf_pct
root@s15:/home/doug/temp# cat /sys/devices/system/cpu/intel_pstate/min_perf_pct
42   ... (note 42% = 160 = processor minimum, and that is my expectation 
below.)
root@s15:/home/doug/temp# cat 
/sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq
320
320
320
320
320
320
320
320

I thought these minimum anomalies would cause problems for thermald, but
for whatever reason, it seems to work properly.

> ---
> V3->V4:
> - Reimplemented the solution using QoS constraints instead of
>   resolve_freq() callback.
>
> drivers/cpufreq/intel_pstate.c | 120 +++--
> 1 file changed, 116 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index cc27d4c59dca..e9fbd6c36822 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -24,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include 
> @@ -1085,6 +1086,47 @@ static ssize_t store_no_turbo(struct kobject *a, 
> struct kobj_attribute *b,
>   return count;
>  }
>  
> +static struct cpufreq_driver intel_pstate;
> +
> +static void update_qos_request(enum dev_pm_qos_req_type type)
> +{
> + int max_state, turbo_max, freq, i, perf_pct;
> + struct dev_pm_qos_request *req;
> + struct cpufreq_policy *policy;
> +
> + for_each_possible_cpu(i) {
> + struct cpudata *cpu = all_cpu_data[i];
> +
> + policy = cpufreq_cpu_get(i);
> + if (!policy)
> + continue;
> +
> + req = policy->driver_data;
> + cpufreq_cpu_put(policy);
> +
> + if (!req)
> + continue;
> +
> + if (hwp_active)
> + intel_pstate_get_hwp_max(i, _max, _state);
> + else
> +

RE: [PATCH V3 2/2] cpufreq: intel_pstate: Implement ->resolve_freq()

2019-08-03 Thread Doug Smythies

On 2019.08.02 02:28 Rafael J. Wysocki wrote:
> On Friday, August 2, 2019 11:17:55 AM CEST Rafael J. Wysocki wrote:
>> On Fri, Aug 2, 2019 at 7:44 AM Viresh Kumar  wrote:
>>>
>>> Intel pstate driver exposes min_perf_pct and max_perf_pct sysfs files,
>>> which can be used to force a limit on the min/max P state of the driver.
>>> Though these files eventually control the min/max frequencies that the
>>> CPUs will run at, they don't make a change to policy->min/max values.
>> 
>> That's correct.
>> 
>>> When the values of these files are changed (in passive mode of the
>>> driver), it leads to calling ->limits() callback of the cpufreq
>>> governors, like schedutil. On a call to it the governors shall
>>> forcefully update the frequency to come within the limits.
>> 
>> OK, so the problem is that it is a bug to invoke the governor's ->limits()
>> callback without updating policy->min/max, because that's what
>> "limits" mean to the governors.
>> 
>> Fair enough.
>
> AFAICS this can be addressed by adding PM QoS freq limits requests of each 
> CPU to
> intel_pstate in the passive mode such that changing min_perf_pct or 
> max_perf_pct
> will cause these requests to be updated.

All governors for the intel_cpufreq (intel_pstate in passive mode) CPU frequency
scaling driver are broken with respect to this issue, not just the schedutil
governor. My initial escalation had been focused on acpi-cpufreq/schedutil
and intel_cpufreq/schedutil, as they were both broken, and both fixed by my 
initially
submitted reversion. What can I say, I missed that other intel_cpufreq governors
were also involved.

I tested all of them: conservative ondemand userspace powersave performance 
schedutil
Note that no other governor uses resolve_freq().

... Doug

RE: [PATCH V3 1/2] cpufreq: schedutil: Don't skip freq update when limits change

2019-08-02 Thread Doug Smythies

On 2019.08.02 02:12 Rafael J. Wysocki wrote:
> On Fri, Aug 2, 2019 at 7:44 AM Viresh Kumar  wrote:
>>
>> To avoid reducing the frequency of a CPU prematurely, we skip reducing
>> the frequency if the CPU had been busy recently.
>>
>> This should not be done when the limits of the policy are changed, for
>> example due to thermal throttling. We should always get the frequency
>> within the new limits as soon as possible.
>>
>> Trying to fix this by using only one flag, i.e. need_freq_update, can
>> lead to a race condition where the flag gets cleared without forcing us
>> to change the frequency at least once. And so this patch introduces
>>  another flag to avoid that race condition.
>>
>> Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
>> Cc: v4.18+  # v4.18+
>> Reported-by: Doug Smythies 
>> Signed-off-by: Viresh Kumar 
>> ---
>> V2->V3:
>> - Updated commit log.
>>
>> V1->V2:
>> - Fixed the race condition using a different flag.
>>
>> @Doug: I haven't changed the code since you last tested these. Your
>> Tested-by tag can be useful while applying the patches. Thanks.

Tested-by: Doug Smythies 
For acpi-cpufreq/schedutil only (which we already know).

I tested including Rafael's suggested change.
I.E.

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 592ff72..ae3ec77 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -441,7 +441,7 @@ static void sugov_update_single(struct update_util_data 
*hook, u64 time,
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
unsigned long util, max;
unsigned int next_f;
-   bool busy = false;
+   bool busy;

sugov_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
@@ -452,8 +452,7 @@ static void sugov_update_single(struct update_util_data 
*hook, u64 time,
return;

/* Limits may have changed, don't skip frequency update */
-   if (!sg_policy->need_freq_update)
-   busy = sugov_cpu_is_busy(sg_cpu);
+   busy = !sg_policy->need_freq_update && sugov_cpu_is_busy(sg_cpu);

util = sugov_get_util(sg_cpu);
max = sg_cpu->max;

... Doug

RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change

2019-08-01 Thread Doug Smythies

On 2019.07.31 23:17 Viresh Kumar wrote:
> On 31-07-19, 17:20, Doug Smythies wrote:
>> Summary:
>> 
>> The old way, using UINT_MAX had two purposes: first,
>> as a "need to do a frequency update" flag; but also second, to
>> force any subsequent old/new frequency comparison to NOT be "the same,
>> so why bother actually updating" (see: sugov_update_next_freq). All
>> patches so far have been dealing with the flag, but only partially
>> the comparisons. In a busy system, and when schedutil.c doesn't actually
>> know the currently set system limits, the new frequency is dominated by
>> values the same as the old frequency. So, when sugov_fast_switch calls 
>> sugov_update_next_freq, false is usually returned.
>
> And finally we know "Why" :)
>
> Good work Doug. Thanks for taking it to the end.
>
>> However, if we move the resetting of the flag and add another condition
>> to the "no need to actually update" decision, then perhaps this patch
>> version 1 will be O.K. It seems to be. (see way later in this e-mail).
>
>> With all this new knowledge, how about going back to
>> version 1 of this patch, and then adding this:
>> 
>> diff --git a/kernel/sched/cpufreq_schedutil.c 
>> b/kernel/sched/cpufreq_schedutil.c
>> index 808d32b..f9156db 100644
>> --- a/kernel/sched/cpufreq_schedutil.c
>> +++ b/kernel/sched/cpufreq_schedutil.c
>> @@ -100,7 +100,12 @@ static bool sugov_should_update_freq(struct 
>> sugov_policy *sg_policy, u64 time)
>>  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
>>unsigned int next_freq)
>>  {
>> -   if (sg_policy->next_freq == next_freq)
>> +   /*
>> +* Always force an update if the flag is set, regardless.
>> +* In some implementations (intel_cpufreq) the frequency is clamped
>> +* further downstream, and might not actually be different here.
>> +*/
>> +   if (sg_policy->next_freq == next_freq && 
>> !sg_policy->need_freq_update)
>> return false;
>
> This is not correct because this is an optimization we have in place
> to make things more efficient. And it was working by luck earlier and
> my patch broke it for good :)

Disagree.
All I did was use a flag where it used to be set to UNIT_MAX, to basically
implement the same thing.

> Things need to get a bit more synchronized and something like this may
> help (completely untested):
>
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index cc27d4c59dca..2d84361fbebc 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -2314,6 +2314,18 @@ static int intel_cpufreq_target(struct cpufreq_policy 
> *policy,
>return 0;
> }
> 
> +static unsigned int intel_cpufreq_resolve_freq(struct cpufreq_policy *policy,
> +  unsigned int target_freq)
> +{
> +   struct cpudata *cpu = all_cpu_data[policy->cpu];
> +   int target_pstate;
> +
> +   target_pstate = DIV_ROUND_UP(target_freq, cpu->pstate.scaling);
> +   target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
> +
> +   return target_pstate * cpu->pstate.scaling;
> +}
> +
>  static unsigned int intel_cpufreq_fast_switch(struct cpufreq_policy *policy,
>   unsigned int target_freq)
>  {
> @@ -2350,6 +2362,7 @@ static struct cpufreq_driver intel_cpufreq = {
> .verify = intel_cpufreq_verify_policy,
> .target = intel_cpufreq_target,
> .fast_switch= intel_cpufreq_fast_switch,
> +   .resolve_freq   = intel_cpufreq_resolve_freq,
> .init   = intel_cpufreq_cpu_init,
> .exit   = intel_pstate_cpu_exit,
> .stop_cpu   = intel_cpufreq_stop_cpu,
> 
> -8<-
>
> Please try this with my patch 2.

O.K.

> We need patch 2 instead of 1 because
> of another race condition Rafael noticed.

Disagree.
Notice that my modifications to your patch1 addresses
that condition by moving the clearing of "need_freq_update"
to sometime later.

> 
> cpufreq_schedutil calls driver specific resolve_freq() to find the new
> target frequency and this is where the limits should get applied IMO.

Oh! I didn't know. But yes, that makes sense.

>
> Rafael can help with reviewing this diff but it would be great if you
> can give this a try Doug.

Anyway, I added the above code (I am calling it patch3) to patch2, as
you asked, and it d

RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change

2019-07-31 Thread Doug Smythies

Hi Viresh,

Summary:

The old way, using UINT_MAX had two purposes: first,
as a "need to do a frequency update" flag; but also second, to
force any subsequent old/new frequency comparison to NOT be "the same,
so why bother actually updating" (see: sugov_update_next_freq). All
patches so far have been dealing with the flag, but only partially
the comparisons. In a busy system, and when schedutil.c doesn't actually
know the currently set system limits, the new frequency is dominated by
values the same as the old frequency. So, when sugov_fast_switch calls 
sugov_update_next_freq, false is usually returned.

However, if we move the resetting of the flag and add another condition
to the "no need to actually update" decision, then perhaps this patch
version 1 will be O.K. It seems to be. (see way later in this e-mail).

On 2019.07.29 01:38 Rafael J. Wysocki wrote:
> On Mon, Jul 29, 2019 at 10:32 AM Viresh Kumar  wrote:
>> On 29-07-19, 00:55, Doug Smythies wrote:
>>> On 2019.07.25 23:58 Viresh Kumar wrote:

...[snip]...

>>>> Now, the frequency never gets down and so gets set to the maximum
>>>> possible after a bit.
>>>>
>>>> - Then I did:
>>>>
>>>> echo  > 
>>>> /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq
>>>>
>>>> Without my patch applied:
>>>>The print never gets printed and so frequency doesn't go down.
>>>>
>>>> With my patch applied:
>>>>The print gets printed immediately from sugov_work() and so
>>>>the frequency reduces.
>>>>
>>>> Can you try with this diff along with my Patch2 ? I suspect there may
>>>> be something wrong with the intel_cpufreq driver as the patch fixes
>>>> the only path we have in the schedutil governor which takes busyness
>>>> of a CPU into account.
>>>
>>> With this diff along with your patch2 There is never a print message
>>> from sugov_work. There are from sugov_fast_switch.
>>
>> Which is okay. sugov_work won't get hit in your case as I explained
>> above.

O.K., I finally understand.

>>> Note that for the intel_cpufreq CPU scaling driver and the schedutil
>>> governor I adjust the maximum clock frequency this way:
>>>
>>> echo  > /sys/devices/system/cpu/intel_pstate/max_perf_pct
>>
>> This should eventually call sugov_limits() in schedutil governor, this
>> can be easily checked with another print message.
>>
>>> I also applied the pr_info messages to the reverted kernel, and re-did
>>> my tests (where everything works as expected). There is never a print
>>> message from sugov_work. There are from sugov_fast_switch.
>>
>> that's fine.
>>
>>> Notes:
>>>
>>> I do not know if:
>>> /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
>>> /sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq
>>> Need to be accurate when using the intel_pstate driver in passive mode.
>>> They are not.
>>> The commit comment for 9083e4986124389e2a7c0ffca95630a4983887f0
>>> suggests that they might need to be representative.
>>> I wonder if something similar to that commit is needed
>>> for other global changes, such as max_perf_pct and min_perf_pct?
>>
>> We are already calling intel_pstate_update_policies() in that case, so
>> it should be fine I believe.

I now believe that lack of synchronization between
/sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
and
/sys/devices/system/cpu/intel_pstate/max_perf_pct
is the root issue here.
The UINT_MAX next freq flag was also used to force a change
because the limits are then forced to be checked and enforced.

This e-mail continues with the assumption that this lack of
synchronization is O.K., because that is the way it is now.
However, if you want to have them synchronized then
Viresh's patch1 will work fine afterwards.

>>
>>> intel_cpufreq/ondemand doesn't work properly on the reverted kernel.
>>
>> reverted kernel ? The patch you reverted was only for schedutil and it
>> shouldn't have anything to do with ondemand.

Agreed. This is on hold until I have time to look into it.

>>> (just discovered, not investigated)
>>> I don't know about other governors.
>>
>> When you do:
>>
>> echo  > /sys/devices/system/cpu/intel_pstate/max_perf_pct
>>
>> How soon does the print from sugov_fast_switch() gets printed ?
>> Immediately ? Check with both the kernels, with my patch and with the
>> reverted patch.

I don't really know how long.
So I added a message so I could ge

RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change

2019-07-31 Thread Doug Smythies

On 2019.07.31 Viresh Kumar wrote:
> To avoid reducing the frequency of a CPU prematurely, we skip reducing
> the frequency if the CPU had been busy recently.
>
> This should not be done when the limits of the policy are changed, for
> example due to thermal throttling. We should always get the frequency
> within the new limits as soon as possible.
> 
> Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
> Cc: v4.18+  # v4.18+
> Reported-by: Doug Smythies 
> Signed-off-by: Viresh Kumar 
> ---
> @Doug: Can you please provide your Tested-by for this commit, as it
> already fixed the issue around acpi-cpufreq driver.
>
> We will continue to see what's wrong with intel-pstate though.

Please give me a few more hours.
I'll reply to another thread with new information at that time.

My recommendation will be to scrap this "patch2" and go back
to "patch1" [1], with a couple of modifications. The logic
of patch1 is sound.
Teaser: it is working for intel_cpufreq/schedutil, but I
have yet to test acpi-cpufreq/schedutil.

[1] https://marc.info/?l=linux-pm=156377832225470=2

... Doug

RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change

2019-07-29 Thread Doug Smythies

On 2019.07.25 23:58 Viresh Kumar wrote:
> On 25-07-19, 08:20, Doug Smythies wrote:
>> I tried the patch ("patch2"). It did not fix the issue.
>> 
>> To summarize, all kernel 5.2 based, all intel_cpufreq driver and schedutil 
>> governor:
>> 
>> Test: Does a busy system respond to maximum CPU clock frequency reduction?
>> 
>> stock, unaltered: No.
>> revert ecd2884291261e3fddbc7651ee11a20d596bb514: Yes
>> viresh patch: No.
>> fast_switch edit: No.
>> viresh patch2: No.
>
> Hmm, so I tried to reproduce your setup on my ARM board.
> - booted only with CPU0 so I hit the sugov_update_single() routine
> - And applied below diff to make CPU look permanently busy:
>
> -8<-
>diff --git a/kernel/sched/cpufreq_schedutil.c 
>b/kernel/sched/cpufreq_schedutil.c
> index 2f382b0959e5..afb47490e5dc 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -121,6 +121,7 @@ static void sugov_fast_switch(struct sugov_policy 
> *sg_policy, u64 time,
> if (!sugov_update_next_freq(sg_policy, time, next_freq))
>return;
> 
> +   pr_info("%s: %d: %u\n", __func__, __LINE__, freq);

?? there is no "freq" variable here, and so this doesn't compile. However this 
works:

+   pr_info("%s: %d: %u\n", __func__, __LINE__, next_freq);

> next_freq = cpufreq_driver_fast_switch(policy, next_freq);
>if (!next_freq)
>return;
> @@ -424,14 +425,10 @@ static unsigned long sugov_iowait_apply(struct 
> sugov_cpu *sg_cpu, u64 time,
> #ifdef CONFIG_NO_HZ_COMMON
> static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> {
> -   unsigned long idle_calls = tick_nohz_get_idle_calls_cpu(sg_cpu->cpu);
> -   bool ret = idle_calls == sg_cpu->saved_idle_calls;
> -
> -   sg_cpu->saved_idle_calls = idle_calls;
> -   return ret;
> +   return true;
>  }
>  #else
> -static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return 
> false; }
> +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return 
> true; }
> #endif /* CONFIG_NO_HZ_COMMON */
> 
>  /*
> @@ -565,6 +562,7 @@ static void sugov_work(struct kthread_work *work)
> sg_policy->work_in_progress = false;
> raw_spin_unlock_irqrestore(_policy->update_lock, flags);
> 
> +   pr_info("%s: %d: %u\n", __func__, __LINE__, freq);
> mutex_lock(_policy->work_lock);
> __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
> mutex_unlock(_policy->work_lock);
> 
> -8<-
>
> Now, the frequency never gets down and so gets set to the maximum
> possible after a bit.
>
> - Then I did:
>
> echo  > 
> /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq
>
> Without my patch applied:
>The print never gets printed and so frequency doesn't go down.
>
> With my patch applied:
>The print gets printed immediately from sugov_work() and so
>the frequency reduces.
> 
> Can you try with this diff along with my Patch2 ? I suspect there may
> be something wrong with the intel_cpufreq driver as the patch fixes
> the only path we have in the schedutil governor which takes busyness
> of a CPU into account.

With this diff along with your patch2 There is never a print message
from sugov_work. There are from sugov_fast_switch.

Note that for the intel_cpufreq CPU scaling driver and the schedutil
governor I adjust the maximum clock frequency this way:

echo  > /sys/devices/system/cpu/intel_pstate/max_perf_pct

I also applied the pr_info messages to the reverted kernel, and re-did
my tests (where everything works as expected). There is never a print
message from sugov_work. There are from sugov_fast_switch.

Notes:

I do not know if:
/sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
/sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq
Need to be accurate when using the intel_pstate driver in passive mode.
They are not.
The commit comment for 9083e4986124389e2a7c0ffca95630a4983887f0
suggests that they might need to be representative.
I wonder if something similar to that commit is needed
for other global changes, such as max_perf_pct and min_perf_pct?

intel_cpufreq/ondemand doesn't work properly on the reverted kernel.
(just discovered, not investigated)
I don't know about other governors.

... Doug

RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change

2019-07-25 Thread Doug Smythies

Hi,

I am having trouble keeping up.
Here is what I have so far:

On 2019.07.24 04:43 Viresh Kumar wrote:
> On 23-07-19, 12:27, Rafael J. Wysocki wrote:
>> On Tue, Jul 23, 2019 at 11:15 AM Viresh Kumar  
>> wrote:

>>> Though there is one difference between intel_cpufreq and acpi_cpufreq,
>>> intel_cpufreq has fast_switch_possible=true and so it uses slightly
>>> different path in schedutil. I tried to look from that perspective as
>>> well but couldn't find anything wrong.
>> 
>> acpi-cpufreq should use fast switching on the Doug's system too.
>
> Ah okay.
>
>>> If you still find intel_cpufreq to be broken, even with this patch,

It is.

>>> please set fast_switch_possible=false instead of true in
>>> __intel_pstate_cpu_init() and try tests again. That shall make it very
>>> much similar to acpi-cpufreq driver.
>> 
>> I wonder if this helps.

It does not help.

>> Even so, we want fast switching to be used by intel_cpufreq.

>
> With both using fast switching it shouldn't make any difference.

>> Anyway, it looks like the change reverted by the Doug's patch
>> introduced a race condition that had not been present before.  Namely,
>> need_freq_update is cleared in get_next_freq() when it is set _or_
>> when the new freq is different from the cached one, so in the latter
>> case if it happens to be set by sugov_limits() after evaluating
>> sugov_should_update_freq() (which returned 'true' for timing reasons),
>> that update will be lost now. [Previously the update would not be
>> lost, because the clearing of need_freq_update depended only on its
>> current value.] Where it matters is that in the "need_freq_update set"
>> case, the "premature frequency reduction avoidance" should not be
>> applied (as you noticed and hence the $subject patch).
>> 
>> However, even with the $subject patch, need_freq_update may still be
>> set by sugov_limits() after the check added by it and then cleared by
>> get_next_freq(), so it doesn't really eliminate the problem.
>> 
>> IMO eliminating would require invalidating next_freq this way or
>> another when need_freq_update is set in sugov_should_update_freq(),
>> which was done before commit ecd2884291261e3fddbc7651ee11a20d596bb514.
>
> Hmm, so to avoid locking in fast path we need two variable group to
> protect against this kind of issues. I still don't want to override
> next_freq with a special meaning as it can cause hidden bugs, we have
> seen that earlier.
>
> What about something like this then ?

I tried the patch ("patch2"). It did not fix the issue.

To summarize, all kernel 5.2 based, all intel_cpufreq driver and schedutil 
governor:

Test: Does a busy system respond to maximum CPU clock frequency reduction?

stock, unaltered: No.
revert ecd2884291261e3fddbc7651ee11a20d596bb514: Yes
viresh patch: No.
fast_switch edit: No.
viresh patch2: No.

References (and procedures used):
https://marc.info/?l=linux-pm=156346478429147=2
https://marc.info/?l=linux-kernel=156343125319461=2

... Doug

RE: [PATCH] cpufreq: schedutil: Don't skip freq update when limits change

2019-07-23 Thread Doug Smythies

On 2019.07.21 23:52 Viresh Kumar wrote:

> To avoid reducing the frequency of a CPU prematurely, we skip reducing
> the frequency if the CPU had been busy recently.
> 
> This should not be done when the limits of the policy are changed, for
> example due to thermal throttling. We should always get the frequency
> within limits as soon as possible.
>
> Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
> Cc: v4.18+  # v4.18+
> Reported-by: Doug Smythies 
> Signed-off-by: Viresh Kumar 
> ---
> @Doug: Please try this patch, it must fix the issue you reported.

It fixes the driver = acpi-cpufreq ; governor = schedutil test case
It does not fix the driver = intel_cpufreq ; governor = schedutil test case

I have checked my results twice, but will check again in the day or two.

... Doug

>
> kernel/sched/cpufreq_schedutil.c | 6 --
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c 
> b/kernel/sched/cpufreq_schedutil.c
> index 636ca6f88c8e..b53c4f02b0f1 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -447,7 +447,7 @@ static void sugov_update_single(struct update_util_data 
> *hook, u64 time,
>   struct sugov_policy *sg_policy = sg_cpu->sg_policy;
>   unsigned long util, max;
>   unsigned int next_f;
> - bool busy;
> + bool busy = false;
> 
>   sugov_iowait_boost(sg_cpu, time, flags);
>   sg_cpu->last_update = time;
> @@ -457,7 +457,9 @@ static void sugov_update_single(struct update_util_data 
> *hook, u64 time,
>   if (!sugov_should_update_freq(sg_policy, time))
>   return;
> 
> - busy = sugov_cpu_is_busy(sg_cpu);
> + /* Limits may have changed, don't skip frequency update */
> + if (!sg_policy->need_freq_update)
> + busy = sugov_cpu_is_busy(sg_cpu);
> 
>   util = sugov_get_util(sg_cpu);
>   max = sg_cpu->max;
> -- 
> 2.21.0.rc0.269.g1a574e7a288b

RE: [PATCH] Revert "cpufreq: schedutil: Don't set next_freq to UINT_MAX"

2019-07-18 Thread Doug Smythies

On 2019.07.18 03:28 Viresh Kumar wrote:
> On 17-07-19, 23:26, Doug Smythies wrote:
>> This reverts commit ecd2884291261e3fddbc7651ee11a20d596bb514.
>> 
>> The commit caused a regression whereby reducing the maximum
>> CPU clock frequency is ineffective while busy, and the CPU
>> clock remains unchanged. Once the system has experienced
>> some idle time, the new limit is implemented.
>
> Can you explain why this patch caused that issue ? I am sorry but I couldn't
> understand it from your email. How are we trying to reduce the frequency? Is
> clk_set_rate() getting called with that finally and not working ?

The patch eliminates the "flag", UNIT_MAX, and it's related comment,
that was used to indicate if it was a limit change that causes the
subsequent execution of sugov_update_single.

As for "clk_set_rate()", I don't know. I bisected the kernel
and got here. I also looked at reverting B7EAF1AAB9F8, but
it seemed to have some purpose which I don't know of more
important than this one or not.

I'll reinsert the first test below with more detail:

On 2019.07.17 23:27 Doug wrote:

> Driver: intel_cpufreq ; Governor: Schedutil
> Kernel 5.2:

$ uname -a
Linux s15 5.2.0-stock #608 SMP PREEMPT Mon Jul 8 08:21:56 PDT 2019 x86_64 
x86_64 x86_64 GNU/Linux

doug@s15:~$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq
doug@s15:~$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
schedutil
schedutil
schedutil
schedutil
schedutil
schedutil
schedutil
schedutil

> Test 1: No thermal throttling involved (I only ever use thermal, if anything):

doug@s15:~$ sudo systemctl status thermald.service
. thermald.service - Thermal Daemon Service
   Loaded: loaded (/lib/systemd/system/thermald.service; disabled; vendor 
preset: enabled)
   Active: inactive (dead)

> - load the system (some sort of CPU stress test).

Use whatever you want. I use my own, only because I always have and it prints 
something
every so often which gives an indication of actual clock speed.

doug@s15:~$ ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & 
~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 &
[1] 3000
[2] 3001
[3] 3002
[4] 3003
[5] 3004
[6] 3005
[7] 3006
[8] 3007
[9] 3008

Watch everything with turobstat:

doug@s15:~$ sudo turbostat --quiet --Summary --show 
Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 5
Busy%   Bzy_MHz IRQ PkgTmp  PkgWatt
0.031600269 25  3.69
0.071600390 25  3.72
0.061600368 25  3.72
0.061600343 25  3.71
0.041600269 26  3.70
74.72   347430230   46  43.95  <<< Add load
100.00  350040228   50  58.27
100.00  350040196   52  58.59
100.00  350040215   55  58.91
100.00  350040211   56  59.12
100.00  350040291   58  59.33  <<< Try to set 60% max
100.00  350040278   59  59.55
100.00  350040591   61  59.73
100.00  350040279   61  60.10
100.00  350040292   63  60.35
100.00  350040401   64  60.85
99.99   350040352   65  61.12
100.00  350040230   67  61.32
100.00  350040228   67  61.52
100.00  350040400   68  61.60  <<< Try to set 42% max
100.00  350040279   69  61.74
100.00  350040258   68  61.85
100.00  350040226   70  62.01
100.00  350040220   70  62.10
100.00  350040211   71  62.25

> - adjust downwards the maximum clock frequency
>   echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct

doug@s15:~/temp$ echo 60 | sudo tee 
/sys/devices/system/cpu/intel_pstate/max_perf_pct
60
doug@s15:~/temp$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
60
... wait awhile ...
doug@s15:~/temp$ echo 42 | sudo tee 
/sys/devices/system/cpu/intel_pstate/max_perf_pct
42
doug@s15:~/temp$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
42

> - observe the Bzy_MHz does not change on the turbostat output.

See annotated turbostat output above.

> - reduce the system load, such that there is some idle time.
> - observe the Bzy_MHz now does change.
> - increase the load again.
> - observe the CPU frequency is now pinned to the new limit.

Go back to 60% first:

doug@s15:~/temp$ echo 60 | sudo tee 
/sys/devices/system/cpu/intel_pstate/max_perf_pct
60
killall test1
... wait awhile ...
doug@s15:~$ ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & 
~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 &
[1] 3040
[2] 3041
[3] 3042
[4] 3043
[5] 3044
[6] 3045
[7] 3046
[8] 3047
[9] 3048

doug@s15:~$ sudo turbostat --quiet --Summary --show 
Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 5
Busy%   Bzy_MHz IRQ PkgTmp  PkgWatt
100.00  350040289   80  64.32
100.00

[PATCH] Revert "cpufreq: schedutil: Don't set next_freq to UINT_MAX"

2019-07-18 Thread Doug Smythies

This reverts commit ecd2884291261e3fddbc7651ee11a20d596bb514.

The commit caused a regression whereby reducing the maximum
CPU clock frequency is ineffective while busy, and the CPU
clock remains unchanged. Once the system has experienced
some idle time, the new limit is implemented.
A consequence is that any thermal throttling monitoring
and control based on max freq limits fail to respond
in a timely manor, if at all, to a thermal temperature
trip on a busy system.

Please review the original e-mail threads, as this
attempt at a revertion would then re-open some of
those questions. One possible alterative to this
revertion would be to revert B7EAF1AAB9F8:

  Author: Rafael J. Wysocki 
  Date:   Wed Mar 22 00:08:50 2017 +0100

  cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

References:

https://marc.info/?t=15257618932=1=2
https://marc.info/?t=15258621902=1=2
https://marc.info/?t=15260716984=1=2

https://marc.info/?t=14901381392=1=2
---
Test Data:

Typical commands:
watch -n 5 cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
watch -n 5 cat /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ 
--interval 5
watch -n 5 cat /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq

Driver: intel_cpufreq ; Governor: Schedutil
Kernel 5.2:

Test 1: No thermal throttling involved:
- load the system (some sort of CPU stress test).
- adjust downwards the maximum clock frequency
  echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
- observe the Bzy_MHz does not change on the turbostat output.
- reduce the system load, such that there is some idle time.
- observe the Bzy_MHz now does change.
- increase the load again.
- observe the CPU frequency is now pinned to the new limit.

Test 2: Use thermald to try Thermal throttling:
- By default thermald will use pstate control first.
  On my system a lower thermal trip point it used,
  because it doesn't get hot enough.
- enable and start thermald
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
  adjusting the max_perf_pct but nothing happens in response.
- Once the lowest limit of max_perf_pct is reached, observe
  thermald fall through to intel-powerclamp, and kidle inject
  starts to be used. Therefore the "busy" condition (see the code)
  is not longer met, and the CPU frequency suddenly goes to minimum.

Test 3: Limit thermald to using max_perf_pct only:
- Adjust the cpu cooling device order to just include the one
  option, thereby excluding help from the fall through partner
  experienced in test 2.
- Observations here tracked test 2 until the lower limit
  of max_perf_pct is reached. The CPUs remain at maximum
  frequency until the load is removed. Processor
  temperature continues to climb.

Driver: intel_cpufreq ; Governor: Schedutil
Kernel 5.2 + this revert patch:

Test 1: No thermal throttling involved:
- load the system (some sort of CPU stress test)
- adjust downwards the maximum clock frequency
  echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
- observe the Bzy_MHz changes on the turbostat output.

Test 2: Use thermald to try Thermal throttling:
- By default thermald will use pstate control first.
- enable and start thermald
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
  adjusting the max_perf_pct and observe the actual frequency
  adjust in response.

Test 3: Limit thermald to using max_perf_pct only:
- Adjust the cpu cooling device order to just include the one
  option, adjusting max_perf_pct.
- Observations here tracked test 2. (test 2 never fell
  through to intel_powerclamp and kidle inject.

Driver: acpi-cpufreq ; Governor: Schedutil
Kernel 5.2:

Test 1: No thermal throttling involved:
- load the system (some sort of CPU stress test)
- adjust downwards the maximum clock frequency
  for file in /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq; do echo 
"220" > $file; done
- observe the Bzy_MHz does not change on the turbostat output.
- reduce the system load, such that there is some idle time.
- observe the Bzy_MHz now does change.
- increase the load again.
- observe the CPU frequency is now pinned to the new limit.

Test 2: Use thermald to try Thermal throttling:
- By default thermald will use intel_powerclamp control first.
- enable and start thermald
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
  adjusting the kidle_inj amounts.
- If the maximum kidle_inj is not enough, then oberve thermald
  start to limit scaling_max_freq, however the actual frequency
  response is immediate. Why? Because of the kidle inject, the
  system is not longer fully "busy", and so that condition is not met.

Test 3: Limit thermald to using scaling_max_freq only:
- Adjust the cpu cooling device order to just include

RE: [PATCH] cpuidle/drivers/mobile: Add new governor for mobile/embedded systems

2019-07-07 Thread Doug Smythies

On 2019.07.03 12:12 Doug Smythies wrote:
> On 2019.07.03 08:16 Daniel Lezcano wrote:
>> On 03/07/2019 16:23, Doug Smythies wrote:
>>> On 2019.06.20 04:58 Daniel Lezcano wrote:

> ...
>>> Anyway, I did a bunch of tests and such, but have deleted
>>> most from this e-mail, because it's just noise. I'll
>>> include just one set:
>>> 
>>> For a work load that would normally result in a lot of use
>>> of shallow idle states (single core pipe-test * 2 cores).
>>
>> Can you share the tests and the command lines?
>
> Yes, give me a few days to repeat the tests and write
> it up properly. I am leaving town in an hour and for a day.

O.K. I re-did the tests and made a new web page with, I think,
everything I used.

...

>> The governor can be better by selecting the shallow states, the
>> scheduler has to interact with the governor to give clues about the
>> load, that is identified and will be the next step.
>>
>> Is it possible to check with the schedutil governor instead?
>
> Oh, I already have some data, just didn't include it before:
>
> Idle governor, teo; CPU frequency scaling: intel-cpufreq/schedutil;
> Processor package power: 40.4 watts; 4.9 uSec/loop
>
> Idle governor, mobile; CPU frequency scaling: intel-cpufreq/schedutil;
> Processor package power: 12.7 watts; 19.7 uSec/loop
>
> Idle governor, teo; CPU frequency scaling: intel-cpufreq/schedutil;
> Idle states 0-3 disabled (note: Idle state 4 is the deepest on my system)
> Processor package power: 36.9 watts; 8.3 uSec/loop
> In my notes I wrote: "Huh?? I do not understand this result, as I had
> expected more similar to the mobile governor". But I did not investigate.

The reason for the big difference was/is that with the "mobile"
governor the CPU frequency never scales up for this test. It can be
sluggish to scale up with the teo and the menu governors, but always
eventually does. I also tried the acpi-cpufreq driver with similar results.

> Anyway, the schedutil test is the one I'll repeat and write up better.

New summary (similar to old):

governorusec/loop   watts
mobile  19.812.67
teo 4.8740.28
menu4.8540.25
teo-idle-0-3-dis8.3036.85

Graphs, details and source codes:
http://www.smythies.com/~doug/linux/idle/mobile/index.html

... Doug

RE: [PATCH] cpuidle/drivers/mobile: Add new governor for mobile/embedded systems

2019-07-03 Thread Doug Smythies

On 2019.07.03 08:16 Daniel Lezcano wrote:
> On 03/07/2019 16:23, Doug Smythies wrote:
>> On 2019.06.20 04:58 Daniel Lezcano wrote:

...
>> Anyway, I did a bunch of tests and such, but have deleted
>> most from this e-mail, because it's just noise. I'll
>> include just one set:
>> 
>> For a work load that would normally result in a lot of use
>> of shallow idle states (single core pipe-test * 2 cores).
>
> Can you share the tests and the command lines?

Yes, give me a few days to repeat the tests and write
it up properly. I am leaving town in an hour and for a day.

It'll be similar to this:
http://www.smythies.com/~doug/linux/idle/teo8/pipe/index.html
parent page (which I will do a better version):
http://www.smythies.com/~doug/linux/idle/teo8/index.html
...

>> I got (all kernel 5.2-rc5 + this patch):
>> 
>> Idle governor, teo; CPU frequency scaling: intel-cpufreq/ondemand;
>> Processor package power: 40.4 watts; 4.9 uSec/loop
>> 
>> Idle governor, teo; CPU frequency scaling: intel-cpufreq/ondemand;
>> Processor package power: 34 watts; 5.2 uSec/loop
>> 
>> Idle governor, mobile; CPU frequency scaling: intel-cpufreq/ondemand;
>> Processor package power: 25.9 watts; 11.1 uSec/loop
>> 
>> Idle governor, menu; CPU frequency scaling: intel-cpufreq/ondemand;
>> Processor package power: 34.2 watts; 5.23 uSec/loop
>> 
>> Idle governor, teo; CPU frequency scaling: intel-cpufreq/ondemand;
>> Maximum CPU frequency limited to 73% to match mobile energy.
>> Processor package power: 25.4 watts; 6.4 uSec/loop
>
> Ok that's interesting. Thanks for the values.
>
> The governor can be better by selecting the shallow states, the
> scheduler has to interact with the governor to give clues about the
> load, that is identified and will be the next step.
>
> Is it possible to check with the schedutil governor instead?

Oh, I already have some data, just didn't include it before:

Idle governor, teo; CPU frequency scaling: intel-cpufreq/schedutil;
Processor package power: 40.4 watts; 4.9 uSec/loop

Idle governor, mobile; CPU frequency scaling: intel-cpufreq/schedutil;
Processor package power: 12.7 watts; 19.7 uSec/loop

Idle governor, teo; CPU frequency scaling: intel-cpufreq/schedutil;
Idle states 0-3 disabled (note: Idle state 4 is the deepest on my system)
Processor package power: 36.9 watts; 8.3 uSec/loop
In my notes I wrote: "Huh?? I do not understand this result, as I had
expected more similar to the mobile governor". But I did not investigate.

Anyway, the schedutil test is the one I'll repeat and write up better.

... Doug

RE: [PATCH] cpuidle/drivers/mobile: Add new governor for mobile/embedded systems

2019-07-03 Thread Doug Smythies

Hi Daniel,

I tried your "mobile" governor, albeit not on a mobile device.

On 2019.06.20 04:58 Daniel Lezcano wrote:

...

> The mobile governor is a new governor targeting embedded systems
> running on battery where the energy saving has a higher priority than
> servers or desktops. This governor aims to save energy as much as
> possible but with a performance degradation tolerance.
>
> In this way, we can optimize the governor for specific mobile workload
> and more generally embedded systems without impacting other platforms.

I just wanted to observe the lower energy, accepting performance
degradation. My workloads may have been inappropriate.

...

> +
> +#define EMA_ALPHA_VAL64
> +#define EMA_ALPHA_SHIFT  7
> +#define MAX_RESCHED_INTERVAL_MS  100
> +
> +static DEFINE_PER_CPU(struct mobile_device, mobile_devices);
> +
> +static int mobile_ema_new(s64 value, s64 ema_old)
> +{
> + if (likely(ema_old))
> + return ema_old + (((value - ema_old) * EMA_ALPHA_VAL) >>
> +   EMA_ALPHA_SHIFT);
> + return value;
> +}

Do you have any information as to why these numbers?
Without any background, the filter seems overly new value weighted to me.
It is an infinite impulse response type filter, currently at:

output = 0.5 * old + 0.5 * new.

I tried, but didn't get anything conclusive:

output = 0.875 * old + 0.125 * new.

I did it this way:

#define EMA_ALPHA_VAL   7
#define EMA_ALPHA_SHIFT 3
#define MAX_RESCHED_INTERVAL_MS 100

static DEFINE_PER_CPU(struct mobile_device, mobile_devices);

static int mobile_ema_new(s64 value, s64 ema_old)
{
if (likely(ema_old))
return ((ema_old * EMA_ALPHA_VAL) + value) >>
  EMA_ALPHA_SHIFT;
return value;
}

...

> + /*
> +  * Sum all the residencies in order to compute the total
> +  * duration of the idle task.
> +  */
> + residency = dev->last_residency - s->exit_latency;

What about when the CPU comes out of the idle state before it
even gets fully into it? Under such conditions it seems to hold
much too hard at idle states that are too deep, to the point
where energy goes up while performance goes down.

Anyway, I did a bunch of tests and such, but have deleted
most from this e-mail, because it's just noise. I'll
include just one set:

For a work load that would normally result in a lot of use
of shallow idle states (single core pipe-test * 2 cores).
I got (all kernel 5.2-rc5 + this patch):

Idle governor, teo; CPU frequency scaling: intel-cpufreq/ondemand;
Processor package power: 40.4 watts; 4.9 uSec/loop

Idle governor, teo; CPU frequency scaling: intel-cpufreq/ondemand;
Processor package power: 34 watts; 5.2 uSec/loop

Idle governor, mobile; CPU frequency scaling: intel-cpufreq/ondemand;
Processor package power: 25.9 watts; 11.1 uSec/loop

Idle governor, menu; CPU frequency scaling: intel-cpufreq/ondemand;
Processor package power: 34.2 watts; 5.23 uSec/loop

Idle governor, teo; CPU frequency scaling: intel-cpufreq/ondemand;
Maximum CPU frequency limited to 73% to match mobile energy.
Processor package power: 25.4 watts; 6.4 uSec/loop

... Doug

RE: NO_HZ_IDLE causes consistently low cpu "iowait" time (and higher cpu "idle" time)

2019-07-03 Thread Doug Smythies

On 2019.07.01 08:34 Alan Jenkins wrote:

> Hi

Hi,

> I tried running a simple test:
>
> dd if=testfile iflag=direct bs=1M of=/dev/null
>
> With my default settings, `vmstat 10` shows something like 85% idle time 
> to 15% iowait time. I have 4 CPUs, so this is much less than one CPU 
> worth of iowait time.
>
> If I boot with "nohz=off", I see idle time fall to 75% or below, and 
> iowait rise to about 25%, equivalent to one CPU.  That is what I had 
> originally expected.
>
> (I can also see my expected numbers, if I disable *all* C-states and 
> force polling using `pm_qos_resume_latency_us` in sysfs).
>
> The numbers above are from a kernel somewhere around v5.2-rc5.  I saw 
> the "wrong" results on some previous kernels as well.  I just now 
> realized the link to NO_HZ_IDLE.[1]
>
> [1] 
> https://unix.stackexchange.com/questions/517757/my-basic-assumption-about-system-iowait-does-not-hold/527836#527836
>
> I did not find any information about this high level of inaccuracy. Can 
> anyone explain, is this behaviour expected?

I'm not commenting on expected behaviour or not, just that it is
inconsistent.

>
> I found several patches that mentioned "iowait" and NO_HZ_IDLE. But if 
> they described this problem, it was not clear to me.
>
> I thought this might also be affecting the "IO pressure" values from the 
> new "pressure stall information"... but I am too confused already, so I 
> am only asking about iowait at the moment :-).

Using your workload, I confirm inconsistent behaviour for /proc/stat
(which vmstat uses) between kernels 4.15, 4.16, and 4.17:
4.15 does what you expect, no matter idle states enabled or disabled.
4.16 doesn't do what you expect regardless. (although a little erratic.)
>= 4.17 does what you expect with only idle state 0 enabled, and doesn't 
>otherwise.

Actual test data vmstat (/proc/stat) (8 CPUs, 12.5% = 1 CPU)):
Kernel  idle/iowait %   Idle states >= 1
4.1588/12   enabled
4.1588/12   disabled
4.1699/1enabled
4.1699/1disabled
4.1798/2enabled
4.1788/12   disabled

Note 1: I never booted with "nohz=off" because the tick never turns off for
idle state 0, which is good enough for testing.

Note 2: Myself, I don't use /proc/stat for idle time statistics. I use:
/sys/devices/system/cpu/cpu*/cpuidle/state*/time
And they seem to always be consistent at the higher idle percentage number.

Unless someone has some insight, the next step is kernel bisection,
once for between kernel 4.15 and 4.16, then again between 4.16 and 4.17.
The second bisection might go faster with knowledge gained from the first.
Alan: Can you do kernel bisection? I can only do it starting maybe Friday.

... Doug

RE: 5.2-rc2: low framerate in flightgear, cpu not running at full speed, thermal related?

2019-06-18 Thread Doug Smythies

On 2019.06.13 01:53 Rafael J. Wysocki wrote:

> I personally doubt that any thermal throttling is involved here.

In earlier e-mails on this thread, Pavel showed his core and package
temperatures as 97 and 98 degrees. If thermal throttling is not
involved, it should be. The description of the observed CPU
frequencies also matched my experiences with thermal throttling
for the intel-pstate/powersave example. Myself, I can not determine
if throttling is involved for the acpi-cpufreq/ondemand
case, just from the clock frequencies, because,
at least on my system, it uses the kidle_inject method
instead of the pstate method.

I continued doing experiments, enabling thermald (normally
disabled on my system) and forcing thermal throttling on
my test server. My system never gets hot enough, so I used
a low trip point of 55 degrees.

The intel_pstate/powersave and intel_cpufreq/ondemand,
which both used the pstate method, outperformed the
acpi-cpufreq/ondemand by 30%, in a constant thermal
throttling mode. This seems the opposite of Pavel's
results, if indeed his system is thermal throttling.
(I can write these experiments up in more detail
if desired.)

On 2019.06.13 Pavel Machek wrote:

> But it seems that your assumptions are incorrect for my workload.

Fair enough. I don't have the appropriate hardware for this, and
am trying to guess at a similar workflow for testing.
Perhaps, just mudding things here rather than helping.

>
> flightgear is single-threaded, and in my configuration saturates the
> CPU, because it would like to achieve higher framerate than my system
> is capable of.

Are you sure? Use turbostat and observe.
Example 1:
intel_cpufreq/ondemand, kernel 5.2-rc3, two 100% loads:

doug@s15:~$ sudo turbostat --quiet --hide IRQ,Avg_MHz,SMI,\
> GFXMHz,TSC_MHz,CorWatt,CoreTmp,GFX%rc6,Pkg%pc2,Pkg%pc3,\
> Pkg%pc6,POLL,C1,C1E,C3,C6,C1%,C1E%,C3%,C6% \
> --interval 5
CoreCPU Busy%   Bzy_MHz POLL%   CPU%c1  CPU%c3  CPU%c6  CPU%c7  PkgTmp  
PkgWatt GFXWatt
-   -   25.24   27800.0025.70   0.0149.05   0.0056  
23.82   0.12
0   0   1.6227810.001.900.0296.46   0.0056  
23.82   0.12
0   4   0.0528330.003.47
1   1   100.00  27800.000.000.000.000.00
1   5   0.0328930.0099.97
2   2   0.0329060.0099.97   0.000.000.00
2   6   100.00  27800.000.00
3   3   0.0727970.000.180.0199.74   0.00
3   7   0.1028340.000.14

Example 2:
acpi_cpufreq/ondemand, kernel 5.2-rc3, two 100% loads:

doug@s15:~$ sudo turbostat --quiet --hide IRQ,Avg_MHz,SMI,\
> GFXMHz,TSC_MHz,CorWatt,CoreTmp,GFX%rc6,Pkg%pc2,Pkg%pc3,\
> Pkg%pc6,POLL,C1,C1E,C3,C6,C1%,C1E%,C3%,C6% \
> --interval 5
CoreCPU Busy%   Bzy_MHz POLL%   CPU%c1  CPU%c3  CPU%c6  CPU%c7  PkgTmp  
PkgWatt GFXWatt
-   -   18.06   30690.0018.45   0.0163.47   0.0055  
21.50   0.12
0   0   15.55   30490.001.310.0283.12   0.0055  
21.50   0.12
0   4   0.6426510.0016.22
1   1   62.43   30750.008.150.0029.43   0.00
1   5   7.7130680.0062.81
2   2   50.56   30910.000.780.0048.66   0.00
2   6   0.4423460.0050.89
3   3   2.3029010.005.000.0192.69   0.00
3   7   4.7930160.002.49

With the additional C6 % coming from the kidle_inj tasks.
Observation 1: The loaded CPUs migrate much more often in this scenario.
Observation 2: While the package watts are different by over 2 watts,
the long term (done over many hours) average was the same for all methods.

... Doug

RE: 5.2-rc2: low framerate in flightgear, cpu not running at full speed, thermal related?

2019-06-13 Thread Doug Smythies

On 2019.06.12 14:25 Rafael J. Wysocki wrote:
> On Wed, Jun 12, 2019 at 4:45 AM Doug Smythies  wrote:
>>
>> So, currently there seems to be 3 issues in this thread
>> (and I am guessing a little, without definitive data):
>>
>> 1.) On your system Kernel 5.4-rc2 (or 4) defaults to the intel_pstate CPU 
>> frequency
>> scaling driver and the powersave governor, but kernel 4.6 defaults to the
>> acpi-cpufreq CPU frequency scaling driver and the ondemand governor.
>
> Which means that intel_pstate works in the active mode by default and
> so it uses its internal governor.

Note sure what you mean by "internal governor"?
If you meant HWP (Hardware P-state), Pavel's processor doesn't have it.
If you meant the active powersave governor code within the driver, then agreed.

> That governor is more performance-oriented than ondemand and it very
> well may cause more power to be allocated for the processor - at the
> expense of the GPU.

O.K. I mainly use servers and so have no experience with possible GPU
verses CPU tradeoffs.

However, I did re-do my tests measuring energy instead of CPU frequency
and found very little difference between the acpi-cpufreq/ondemand verses
intel_pstate/powersave as a function of single threaded load. Actually,
I did the test twice, one at 20 hertz work/sleep frequency and also
at 67 hertz work/sleep frequency. (Of course, Pavel's processor might
well have a different curve, but it is a similar vintage to mine
i5-2520M verses i7-2600K.) The worst difference was approximately
1.1 extra processor package watts (an extra 5.5%) in the 80% to 85%
single threaded load range at 67 hertz work/sleep frequency for
the intel-pstate/powersave driver/governor. 

What am I saying? For a fixed amount of work to do per work/sleep cycle
(i.e. maybe per video frame related type work) while the CPU frequency Verses 
load
curves might differ, the resulting processor energy curve differs much less.
(i.e. the extra power for higher CPU frequency is for less time because it gets
the job done faster.) So, myself, I don't yet understand why only the one method
would have hit thermal throttling, but not the other (if indeed it doesn't).
Other differences between kernel 4.6 and 5.2-rc? might explain it. I did all
my tests on kernel 5.2-rc3, except that one example from kernel 4.4 on my
earlier reply, so that were not other variables than CPU scaling driver and
governor changes.

> The lower-than-expected frame rate may result from that, in principle.

> One way to mitigate that might be to use intel_pstate in the passive
> mode (pass intel_pstate=passive to the kernel in the command line)
> along with either ondemand or schedutil as the governor.

The CPU frequency verses load curves for this those two governors are very 
similar
for both the acpi_cpufreq and intel_cpufreq (which is the intel_pstate driver
in passive mode) drivers.

Just for information: CPU frequency verses single threaded load curves
for the conservative governor is quite different between the two drivers.
(tests done in February, perhaps I should re-do and also look at energy
at the same time, or instead of CPU frequency.)

... Doug

RE: 5.2-rc2: low framerate in flightgear, cpu not running at full speed, thermal related?

2019-06-11 Thread Doug Smythies

Hi,

So, currently there seems to be 3 issues in this thread
(and I am guessing a little, without definitive data):

1.) On your system Kernel 5.4-rc2 (or 4) defaults to the intel_pstate CPU 
frequency
scaling driver and the powersave governor, but kernel 4.6 defaults to the
acpi-cpufreq CPU frequency scaling driver and the ondemand governor.

Suggest to check the related kernel configuration parameters.

$ grep -A 10 -i "CPU frequency scaling drivers" .config
# CPU frequency scaling drivers
#
CONFIG_X86_INTEL_PSTATE=y
CONFIG_X86_PCC_CPUFREQ=y
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_X86_ACPI_CPUFREQ_CPB=y
CONFIG_X86_POWERNOW_K8=y
CONFIG_X86_AMD_FREQ_SENSITIVITY=m
CONFIG_X86_SPEEDSTEP_CENTRINO=y
CONFIG_X86_P4_CLOCKMOD=m

2.) With kernel 5.2-rc2 there is thermal throttling (thermald?), but for 
Kernel 4.6 it either doesn't hit the throttling threshold or is not enabled.

I don't have input here.

However note that the CPU frequency verses load graphs are quite different
between acpi-cpufreq/ondemand and intel-pstate/powersave, with higher
CPU frequencies at lower loads for the acpi-cpufreq driver and higher
CPU frequencies at higher loads for the intel_pstate driver. The crossover
point is about 35% load. So, the two driver/governors might well end up at
different operating pointing terms of CPU frequencies. I do not have energy
data for the CPU frequency verses load tests, which is really what matters.

3.) The older kernel is still using the older acpi-cpufreq stuff where
some CPU frequency memory registers show what was asked for and not what
the system is actually giving.

See below.

On 2019.06.09 04:18 Pavel Machek wrote:

> When I start flightgear, I get framerates around 20 fps and cpu at
> 3GHz:
> 
> pavel@duo:~/bt$ cat /proc/cpuinfo  | grep MHz
> cpu MHz   : 3027.471
> cpu MHz : 2981.863
> cpu MHz   : 2958.352
> cpu MHz : 2864.001
> pavel@duo:~/bt$
>
> (Ok, fgfs is really only running at single core, so why do both cores
> run at 3GHz?)

There is only 1 PLL (Phase Locked Loop) master clock generator in your
processor. Basically, the CPU with the highest frequency demand wins.
What is not shown in your data is that the not busy core is also
probably in a deep idle state for most of the time, consuming little or no
energy. Below is an example from my processor (i7-2600K) using
turbostat (note C6 is my deepest idle state, and only it is shown
for simplicity):

CoreCPU Busy%   Bzy_MHz CPU%c6  PkgTmp  PkgWatt
-   -   12.52   379974.88   54  23.87
0   0   0.05366499.89   54  23.87
0   4   0.003695
1   1   0.07367699.87
1   5   0.003667
2   2   0.09366999.78
2   6   0.003666
3   3   0.0437860.00
3   7   99.92   3800

Observe the busy core spending no time in C6, but the other 3
spend most of the time there. However, for the little time that they are active,
CPU 7 is the one demanding the high CPU frequency.

Anticipated question:
"Then why aren't all the CPU frequencies exactly the same?"
My processor's max turbo frequency is a function of how many cores
are active, and so actually varies under peak demand. Excerpt from
turbostat, not in quiet mode:

35 * 100.0 = 3500.0 MHz max turbo 4 active cores
36 * 100.0 = 3600.0 MHz max turbo 3 active cores
37 * 100.0 = 3700.0 MHz max turbo 2 active cores
38 * 100.0 = 3800.0 MHz max turbo 1 active cores

> The CPU is Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz . I guess it means
> it should be able to sustain both cores running at 2.5GHz?

No, that is more a function of your hardware's ability (or lack of) to move
the waste heat away.

> Situation is very different with v4.6 distro based kernel.
> 
> CPU MHz is only getting values round to 100MHz. It does not go above
> 2.5GHz, but it does not go below 2.5GHz under the load, either.

Yes, it is going above 2.5 GHz, see below.

> v4.6:

> cpu MHz : 2501.000
> cpu MHz : 2501.000
> cpu MHz : 2501.000
> cpu MHz : 2501.000

That seems to be the old acpi-cpufreq information,
which is telling you what it's asking for and not what
it actually is. And the "01" means it is asking for turbo
range frequencies. Example from my system, while running the
above CPU 7 busy test:

Conditions: Old kernel (4.4), acpi-cpufreq driver, ondemand gov.

doug@s15:~$ grep MHz /proc/cpuinfo
cpu MHz : 1600.000
cpu MHz : 1600.000
cpu MHz : 1600.000
cpu MHz : 1600.000
cpu MHz : 1600.000
cpu MHz : 1600.000
cpu MHz : 1600.000
cpu MHz : 3401.000

CPUs 0-6 have very little load and are asking for the minimum
available frequency. CPU 7 is asking for turbo.
The turbostat data tells a different story.

Conditions: New kernel (5.2-rc3), acpi-cpufreq driver, ondemand gov.

doug@s15:~$ grep MHz /proc/cpuinfo
cpu MHz : 3611.623
cpu MHz :

RE: [PATCH v2] cpufreq: intel_pstate: Rework iowait boosting to be less aggressive

2019-02-17 Thread Doug Smythies

On 2019.02.07 03:51 Rafael J. Wysocki wrote:

> From: Rafael J. Wysocki 
> 
> The current iowait boosting mechanism in intel_pstate_update_util()
> is quite aggressive, as it goes to the maximum P-state right away,
> and may cause excessive amounts of energy to be used, which is not
> desirable and arguably isn't necessary too.
>
> Follow commit a5a0809bc58e ("cpufreq: schedutil: Make iowait boost
> more energy efficient") that reworked the analogous iowait boost
> mechanism in the schedutil governor and make the iowait boosting
> in intel_pstate_update_util() work along the same lines.
>
> Signed-off-by: Rafael J. Wysocki 
> ---
>
> -> v2:
>   * Follow the Doug's suggestion and drop the immediate jump to
> max P-state if boost is max.  The code is simpler this way and
> the perf impact should not be noticeable on average.

Hi Rafael,

Something has broken on my incoming e-mail sorting stuff, and I
missed this one (and some others).

This V2 is not actually what I was proposing. I was O.K. with
the immediate jump, but I didn't want the set_pstate step
by-passed if it was already at max because that would also
by-pass the trace sample, if it was enabled.

Anyway, this V2 seems O.K. to me. I tested it compared to V1
and, as you mentioned, wasn't able to detect any energy consumption
or performance differences.

... Doug

RE: [PATCH] cpufreq: intel_pstate: Rework iowait boosting to be less aggressive

2019-02-11 Thread Doug Smythies

On 2019.02.05 04:04 Rafael J. Wysocki wrote:
> On Friday, February 1, 2019 5:54:37 PM CET Doug Smythies wrote:
>> On 2019.01.30 16:05 Rafael J. Wysocki wrote:
>> 
>>> From: Rafael J. Wysocki 
>>>
>>> The current iowait boosting mechanism in intel_pstate_update_util()
>>> is quite aggressive, as it goes to the maximum P-state right away,
>>> and may cause excessive amounts of energy to be used, which is not
>>> desirable and arguably isn't necessary too.
>>>
>>> Follow commit a5a0809bc58e ("cpufreq: schedutil: Make iowait boost
>>> more energy efficient") that reworked the analogous iowait boost
>>> mechanism in the schedutil governor and make the iowait boosting
>>> in intel_pstate_update_util() work along the same lines.
>>> 
>>> Signed-off-by: Rafael J. Wysocki 
>>> ---
>>>  drivers/cpufreq/intel_pstate.c |   46 
>>> +
>>>  1 file changed, 29 insertions(+), 17 deletions(-)
>>>
>>> Index: linux-pm/drivers/cpufreq/intel_pstate.c
>>> ===
>>> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
>>> +++ linux-pm/drivers/cpufreq/intel_pstate.c
>>> @@ -50,6 +50,8 @@
>>>  #define int_tofp(X) ((int64_t)(X) << FRAC_BITS)
>>>  #define fp_toint(X) ((X) >> FRAC_BITS)
>>> 
>>> +#define ONE_EIGHTH_FP ((int64_t)1 << (FRAC_BITS - 3))
>>> +
>>>  #define EXT_BITS 6
>>>  #define EXT_FRAC_BITS (EXT_BITS + FRAC_BITS)
>>>  #define fp_ext_toint(X) ((X) >> EXT_FRAC_BITS)
>>> @@ -1678,17 +1680,14 @@ static inline int32_t get_avg_pstate(str
>>>  static inline int32_t get_target_pstate(struct cpudata *cpu)
>>>  {
>>> struct sample *sample = >sample;
>>> -   int32_t busy_frac, boost;
>>> +   int32_t busy_frac;
>>> int target, avg_pstate;
>>> 
>>> busy_frac = div_fp(sample->mperf << cpu->aperf_mperf_shift,
>>>sample->tsc);
>>> 
>>> -   boost = cpu->iowait_boost;
>>> -   cpu->iowait_boost >>= 1;
>>> -
>>> -   if (busy_frac < boost)
>>> -   busy_frac = boost;
>>> +   if (busy_frac < cpu->iowait_boost)
>>> +   busy_frac = cpu->iowait_boost;
>>> 
>>> sample->busy_scaled = busy_frac * 100;
>>> 
>>> @@ -1767,22 +1766,35 @@ static void intel_pstate_update_util(str
>>> if (smp_processor_id() != cpu->cpu)
>>> return;
>>>
>>> +   delta_ns = time - cpu->last_update;
>>> if (flags & SCHED_CPUFREQ_IOWAIT) {
>>> -   cpu->iowait_boost = int_tofp(1);
>>> -   cpu->last_update = time;
>>> -   /*
>>> -* The last time the busy was 100% so P-state was max anyway
>>> -* so avoid overhead of computation.
>>> -*/
>>> -   if (fp_toint(cpu->sample.busy_scaled) == 100)
>>> -   return;
>>> -
>>> -   goto set_pstate;
>>> +   /* Start over if the CPU may have been idle. */
>>> +   if (delta_ns > TICK_NSEC) {
>>> +   cpu->iowait_boost = ONE_EIGHTH_FP;
>>> +   } else if (cpu->iowait_boost) {
>>> +   cpu->iowait_boost <<= 1;
>>> +   if (cpu->iowait_boost >= int_tofp(1)) {
>>> +   cpu->iowait_boost = int_tofp(1);
>>> +   cpu->last_update = time;
>>> +   /*
>>> +* The last time the busy was 100% so P-state
>>> +* was max anyway, so avoid the overhead of
>>> +* computation.
>>> +*/
>>> +   if (fp_toint(cpu->sample.busy_scaled) == 100)
>>> +   return;
>> 
>> Hi Rafael,
>> 
>> By exiting here, the trace, if enabled, is also bypassed.
>> We want the trace sample.
>
> Fair enough, but the return is there regardless of this patch.
>
> Maybe it should be fixed separately?

O.K.

>> Also, there is a generic:
>> "If the target ptstate is the same as before, then don't set it"
>> later on.
>> Suggest to delete this test and exit condition. (I see that this early
>> exit was done before also.)
>
> Well, exactly.
>
> It is not unreasonable to boost the frequency right away for an IO-waiter
> without waiting for the next sample time IMO.

I agree, but am just saying that it should include a trace sample, otherwise
it is difficult to understand what happened.

By the way, I forgot to mention before, I tried the patch and it does prevent
CPU frequency spikes to maximum every few seconds in a very idle system.

... Doug

RE: [PATCH] cpufreq: intel_pstate: Rework iowait boosting to be less aggressive

2019-02-01 Thread Doug Smythies

On 2019.01.30 16:05 Rafael J. Wysocki wrote:

> From: Rafael J. Wysocki 
>
> The current iowait boosting mechanism in intel_pstate_update_util()
> is quite aggressive, as it goes to the maximum P-state right away,
> and may cause excessive amounts of energy to be used, which is not
> desirable and arguably isn't necessary too.
>
> Follow commit a5a0809bc58e ("cpufreq: schedutil: Make iowait boost
> more energy efficient") that reworked the analogous iowait boost
> mechanism in the schedutil governor and make the iowait boosting
> in intel_pstate_update_util() work along the same lines.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
>  drivers/cpufreq/intel_pstate.c |   46 
> +
>  1 file changed, 29 insertions(+), 17 deletions(-)
>
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -50,6 +50,8 @@
>  #define int_tofp(X) ((int64_t)(X) << FRAC_BITS)
>  #define fp_toint(X) ((X) >> FRAC_BITS)
> 
> +#define ONE_EIGHTH_FP ((int64_t)1 << (FRAC_BITS - 3))
> +
>  #define EXT_BITS 6
>  #define EXT_FRAC_BITS (EXT_BITS + FRAC_BITS)
>  #define fp_ext_toint(X) ((X) >> EXT_FRAC_BITS)
> @@ -1678,17 +1680,14 @@ static inline int32_t get_avg_pstate(str
>  static inline int32_t get_target_pstate(struct cpudata *cpu)
>  {
>   struct sample *sample = >sample;
> - int32_t busy_frac, boost;
> + int32_t busy_frac;
>   int target, avg_pstate;
> 
>   busy_frac = div_fp(sample->mperf << cpu->aperf_mperf_shift,
>  sample->tsc);
> 
> - boost = cpu->iowait_boost;
> - cpu->iowait_boost >>= 1;
> -
> - if (busy_frac < boost)
> - busy_frac = boost;
> + if (busy_frac < cpu->iowait_boost)
> + busy_frac = cpu->iowait_boost;
> 
>   sample->busy_scaled = busy_frac * 100;
> 
> @@ -1767,22 +1766,35 @@ static void intel_pstate_update_util(str
>   if (smp_processor_id() != cpu->cpu)
>   return;
>
> + delta_ns = time - cpu->last_update;
>   if (flags & SCHED_CPUFREQ_IOWAIT) {
> - cpu->iowait_boost = int_tofp(1);
> - cpu->last_update = time;
> - /*
> -  * The last time the busy was 100% so P-state was max anyway
> -  * so avoid overhead of computation.
> -  */
> - if (fp_toint(cpu->sample.busy_scaled) == 100)
> - return;
> -
> - goto set_pstate;
> + /* Start over if the CPU may have been idle. */
> + if (delta_ns > TICK_NSEC) {
> + cpu->iowait_boost = ONE_EIGHTH_FP;
> + } else if (cpu->iowait_boost) {
> + cpu->iowait_boost <<= 1;
> + if (cpu->iowait_boost >= int_tofp(1)) {
> + cpu->iowait_boost = int_tofp(1);
> + cpu->last_update = time;
> + /*
> +  * The last time the busy was 100% so P-state
> +  * was max anyway, so avoid the overhead of
> +  * computation.
> +  */
> + if (fp_toint(cpu->sample.busy_scaled) == 100)
> + return;

Hi Rafael,

By exiting here, the trace, if enabled, is also bypassed.
We want the trace sample.
Also, there is a generic:
"If the target ptstate is the same as before, then don't set it"
later on.
Suggest to delete this test and exit condition. (I see that this early exit was 
done before also.)

> +
> + goto set_pstate;
> + }
> + } else {
> + cpu->iowait_boost = ONE_EIGHTH_FP;
> + }
>   } else if (cpu->iowait_boost) {
>   /* Clear iowait_boost if the CPU may have been idle. */
> - delta_ns = time - cpu->last_update;
>   if (delta_ns > TICK_NSEC)
>   cpu->iowait_boost = 0;
> + else
> + cpu->iowait_boost >>= 1;
>   }
>   cpu->last_update = time;
>   delta_ns = time - cpu->sample.time;

[PATCH] cpuidle: poll_state: Fix default time limit.

2019-01-30 Thread Doug Smythies

The default time is declared in units of microsecnds,
but is used as nano seconds, resulting in significant
accounting errors for idle state 0 time when all idle
states deeper than 0 are disabled.

Under these unusual conditions, we don't really care
about the poll time limit anyhow.

Signed-off-by: Doug Smythies 
---
 drivers/cpuidle/poll_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index b17d153..23a1b27 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -21,7 +21,7 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev,
local_irq_enable();
if (!current_set_polling_and_test()) {
unsigned int loop_count = 0;
-   u64 limit = TICK_USEC;
+   u64 limit = TICK_NSEC;
int i;
 
for (i = 1; i < drv->state_count; i++) {
-- 
2.7.4

RE: [PATCH] cpuidle: New timer events oriented governor for tickless systems

2018-12-18 Thread Doug Smythies

Hi Rafael,

Just typos, not already pointed out by Quentin:

On 2018.12.18 03:09 Rafael J. Wysocki wrote:

> The venerable menu governor does some thigns that are quite

s/thigns/things

> time frame in which no timer wakeups are possible (becuase it knows

s/because/because

> +obtain the the *sleep length*, which is the time until the closest timer 
> event

s/the the/the

> +increased when when the given idle state "matches" the sleep length only and 
> the

s/when when/when

... Doug

[PATCH] tools/power/x86/intel_pstate_tracer: Fix non root execution for post processing a trace file.

2018-12-17 Thread Doug Smythies

This script is supposed to be allowed to run with regular user privileges
if a previously captured trace is being post processed.

Commit fbe313884d7ddd73ce457473cbdf3763f5b1d3da
tools/power/x86/intel_pstate_tracer: Free the trace buffer memory
introduced a bug that breaks that option.

Commit 35459105deb26430653a7299a86bc66fb4dd5773
tools/power/x86/intel_pstate_tracer: Add optional setting of trace buffer 
memory allocation
moved the code but kept the bug.

This patch fixes the issue.

Signed-off-by: Doug Smythies 
---
 tools/power/x86/intel_pstate_tracer/intel_pstate_tracer.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/power/x86/intel_pstate_tracer/intel_pstate_tracer.py 
b/tools/power/x86/intel_pstate_tracer/intel_pstate_tracer.py
index 84e2b64..2fa3c57 100755
--- a/tools/power/x86/intel_pstate_tracer/intel_pstate_tracer.py
+++ b/tools/power/x86/intel_pstate_tracer/intel_pstate_tracer.py
@@ -585,9 +585,9 @@ current_max_cpu = 0
 
 read_trace_data(filename)
 
-clear_trace_file()
-# Free the memory
 if interval:
+clear_trace_file()
+# Free the memory
 free_trace_buffer()
 
 if graph_data_present == False:
-- 
2.7.4

RE: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems

2018-12-16 Thread Doug Smythies

On 2018.12.11 03:50 Rafael J. Wysocki wrote:

...[snip]...

> With this version of the TEO governor I'm observing slight, but consistent
> performance improvements in some different benchmarks on a few different
> systems with respect to menu

Same here.

> and the corresponding power draw differences
> appear to be in the noise.  The idle power doesn't seem to change either.

Same here.

> Also the "above" and "below" metrics (introduced by a separate patch under
> discussion) indicate changes in the "good" direction.

My graphs now include the "above" and "below" metrics.
In particular see Idle State 1 "above" (was too deep) graphs in the links below.
However, performance is up and power about the same, so O.K.
  
> Overall, I like this one, so I may consider dropping the RFC/RFT tag from the
> next submission. :-)
>
> v7 -> v8:
> * Apply the selection rules to the idle deepest state as well as to
>   the shallower ones (the deepest idle state was treated differently
>   before by mistake).
> * Subtract 1/2 of the exit latency from the measured idle duration
>   in teo_update() (instead of subtracting the entire exit latency).
>   This makes the idle state selection be slightly more performance-
>   oriented.

I cherry picked a couple of the mmtests that Giovanni was doing:

Test kernels:

"stock" kernel 4.20-rc5 + a couple of rjw patches. Specifically:

2a110ed cpuidle: poll_state: Disregard disable idle states
8f09875 cpuidle: Add 'above' and 'below' idle state metrics
d6851a5 Documentation: admin-guide: PM: Add cpuidle document
2595646 Linux 4.20-rc5

"teov6" above + teov6 patch
"teov7" above + teov7 patch
"teov8" above + teov8 patch

1.) mmtests - netperf-unbound test (UDP):

   4.20-rc5   4.20-rc5  
 4.20-rc5   4.20-rc5
  stock   teo6  
 teo7   teo8
Hmean send-64 129.64 (   0.00%)  132.45 *   2.17%*  130.55 
*   0.71%*  132.87 *   2.49%*
Hmean send-128259.53 (   0.00%)  264.90 *   2.07%*  261.61 
*   0.80%*  264.94 *   2.09%*
Hmean send-256515.24 (   0.00%)  525.41 *   1.97%*  517.41 
*   0.42%*  524.88 *   1.87%*
Hmean send-1024  2041.01 (   0.00%) 2079.22 *   1.87%* 2045.03 
*   0.20%* 2077.25 *   1.78%*
Hmean send-2048  3980.04 (   0.00%) 4071.09 *   2.29%* 4041.15 
*   1.54%* 4057.09 *   1.94%*
Hmean send-3312  6321.90 (   0.00%) 6460.23 *   2.19%* 6395.71 
*   1.17%* 6409.09 *   1.38%*
Hmean send-4096  7695.18 (   0.00%) 7882.81 *   2.44%* 7813.72 
*   1.54%* 7832.77 *   1.79%*
Hmean send-8192 13920.53 (   0.00%)14146.47 *   1.62%*13986.72 
*   0.48%*14079.07 *   1.14%*
Hmean send-1638424714.99 (   0.00%)25225.95 *   2.07%*24896.10 
*   0.73%*25131.52 *   1.69%*
Hmean recv-64 129.64 (   0.00%)  132.45 *   2.17%*  130.55 
*   0.71%*  132.87 *   2.49%*
Hmean recv-128259.53 (   0.00%)  264.90 *   2.07%*  261.61 
*   0.80%*  264.94 *   2.09%*
Hmean recv-256515.24 (   0.00%)  525.41 *   1.97%*  517.41 
*   0.42%*  524.88 *   1.87%*
Hmean recv-1024  2041.01 (   0.00%) 2079.22 *   1.87%* 2045.03 
*   0.20%* 2077.25 *   1.78%*
Hmean recv-2048  3980.04 (   0.00%) 4071.09 *   2.29%* 4041.15 
*   1.54%* 4057.09 *   1.94%*
Hmean recv-3312  6321.88 (   0.00%) 6460.23 *   2.19%* 6395.71 
*   1.17%* 6409.09 *   1.38%*
Hmean recv-4096  7695.15 (   0.00%) 7882.81 *   2.44%* 7813.72 
*   1.54%* 7832.75 *   1.79%*
Hmean recv-8192 13920.52 (   0.00%)14146.43 *   1.62%*13986.72 
*   0.48%*14079.07 *   1.14%*
Hmean recv-1638424714.99 (   0.00%)25225.90 *   2.07%*24896.07 
*   0.73%*25131.49 *   1.69%*

Graphs: 
http://www.smythies.com/~doug/linux/idle/teo8/net-pref-udp-unbound/index.html
(note: slow upload speed from my server)

2.) mmtests - sockperf-udp-throughput test:

4.20-rc5   4.20-rc5   
4.20-rc5   4.20-rc5
   stock   teo6   
teo7   teo8
Hmean 1424.57 (   0.00%)   25.91 *   5.46%*   25.99 *   
5.78%*   25.73 *   4.75%*
Hmean 100  175.37 (   0.00%)  185.09 *   5.54%*  185.89 *   
6.00%*  184.48 *   5.19%*
Hmean 300  523.81 (   0.00%)  553.47 *   5.66%*  554.70 *   
5.90%*  550.16 *   5.03%*
Hmean 500  870.08 (   0.00%)  918.88 *   5.61%*  924.33 *   
6.24%*  914.53 *   5.11%*
Hmean 850 1449.44 (   0.00%) 1530.84 *   5.62%* 1535.40 *   
5.93%* 1522.53 *   5.04%*

Graphs:

RE: intel_pstate: Lowest frequency not reached with Intel i7-6700

2018-12-13 Thread Doug Smythies

On 2018.12.13 04:42 Paul Menzel wrote:
> On 12/13/18 11:39, Rafael J. Wysocki wrote:
>> On Thu, Dec 13, 2018 at 10:54 AM Paul Menzel  wrote:
>>> On 12/13/18 00:06, Doug Smythies wrote:
>>>> On 2018.12.12 13:40 Paul Menzel wrote:
>>>>
>>>>> Using *powersave* as P-state selection algorithm, on an idle system
>>>>
>>>> Define "idle system".
>>>> If your computer is running a GUI, or is even a server without a GUI
>>>> but with many services running, then "idle" really isn't.
>>>> Below is from my test server, with many services disabled, so
>>>> "idle" really is quite "idle"
>>>>
>>>> doug@s15:~/temp$ sudo turbostat --Summary --quiet --show 
>>>> Busy%,Bzy_MHz,PkgTmp,PkgWatt --interval 15
>>>> Busy%   Bzy_MHz PkgTmp  PkgWatt
>>>> 0.01160827  3.71
>>>> 0.01161927  3.71
>>>> 0.01160028  3.71
>>>> 0.01160028  3.70
>>>>
>>>> Note that p state 16 (1600 MHz) is the minimum for my older i7-2600k
>>>> processor.
>>>
>>> The thing is, on an Intel Kaby Lake laptop with Ubuntu 18.10 and GNOME
>>> running, it goes down to the lowest listed frequency.
>
> Checking the numbers again, I was mistaken. The lowest possible frequency
> of the Intel Kaby Lake i7-7500U in that laptop is 400 MHz, and it is
> going down to 600 MHz. Busy% from turbostat is 0.3 to 0.4.
>
>> Kaby Lake has hardware-managed P-states (HWP) which is a different mechanism.
>
> Isn’t HWP also available for the 6th generation?
>
>$ dmesg | grep intel_pstate
>[2.092456] intel_pstate: Intel P-state driver initializing
>[2.094820] intel_pstate: HWP enabled

Yes, you have HWP. I do not have HWP. I don't think HWP or not is
particularly relevant to this discussion.

>>>>> Shouldn’t it go down until 800 MHz?
>>>>
>>>> We would need some actual busy information, turbostat is the
>>>> recommended tool, to know for sure.
>>>
>>> Here you go.
>>>
>>> ```
>>> tools/power/x86/turbostat> sudo ./turbostat --Summary --quiet --show 
>>> Busy%,Bzy_MHz,PkgTmp,PkgWatt --interval 15
>>> Busy%Bzy_MHzPkgTmpPkgWatt
>>> 3.591167311.68
>>> 3.21903311.34
>>> 3.21906311.34
>>> 3.27901311.35
>>> 8.232715302.32  ← stopping GDM (systemctl stop gdm)
>>> 2.95915301.18
>>> 2.91906301.18
>>> 2.92903301.17
>>> 2.90900291.17
>>> 2.89903291.18
>>> 2.91903301.18
>>> 2.89903291.18
>>> 2.89900291.18
>>> 2.90903301.18
>>> 2.90903291.17
>>> 2.90903291.17
>>> 2.90900291.16
>>> 2.90903291.14
>>> 2.90903281.11
>>> 2.90903291.10
>>> 2.91900291.16
>>> 2.91903291.14
>>> 2.90903291.12
>>> 2.90903291.16
>>> 2.90900281.17
>>> 2.92903291.16
>>> 2.90903291.16
>>> 2.90903291.16
>>> ```
>>>
>>> 800 MHz should be enough to keep GDM running, shouldn’t it?
>> 
>> Well, depending.
>> 
>>> Otherwise only SSH was running.
>> 
>> There obviously is something that causes it to stay at 900 MHz.

Agreed. Clearly some more stuff was using some CPU.
The resulting requested CPU frequency is also highly dependent on
the frequency of the load. Example:

doug@s15:~/csv-plot$ sudo turbostat --Summary --quiet --show 
Busy%,Bzy_MHz,PkgTmp,PkgWatt --interval 60
Busy%   Bzy_MHz PkgTmp  PkgWatt
0.08256727  3.77  <<< Note the high active CPU frequency with 
virtually no load.
0.01160026  3.69
0.01160027  3.70
0.83160030  4.18
1.80160030  4.59  <<< This load is at 435 Hertz work / sleep 
frequency
1.80160032  4.59
0.74303327  4.49  <<< This load is at 1 Hertz work / sleep frequency
0.76338027  4.63
0.76338428  4.62
0.38335628  4.16
0.01161528  3.71
0.01160927  3.70

>
> It’s not obvious to me, but you have more experience. It’d expect
> to at least one core(?) to go down to 800 MHz and not all to stay
> at 900 MHz.

There is only PLL (Phase Locked Loop). The

RE: intel_pstate: Lowest frequency not reached with Intel i7-6700

2018-12-12 Thread Doug Smythies

On 2018.12.12 13:40 Paul Menzel wrote:

> Using *powersave* as P-state selection algorithm, on an idle system

Define "idle system".
If your computer is running a GUI, or is even a server without a GUI
but with many services running, then "idle" really isn't.
Below is from my test server, with many services disabled, so
"idle" really is quite "idle"

doug@s15:~/temp$ sudo turbostat --Summary --quiet --show 
Busy%,Bzy_MHz,PkgTmp,PkgWatt --interval 15
Busy%   Bzy_MHz PkgTmp  PkgWatt
0.01160827  3.71
0.01161927  3.71
0.01160028  3.71
0.01160028  3.70

Note that p state 16 (1600 MHz) is the minimum for my older i7-2600k processor.

> Shouldn’t it go down until 800 MHz?

We would need some actual busy information, turbostat is the recommended tool,
to know for sure.

... Doug

RE: [PATCH v2] cpuidle: Add 'above' and 'below' idle state metrics

2018-12-10 Thread Doug Smythies

On 2018.12.10 02:52 Peter Zijlstra wrote:
>On Mon, Dec 10, 2018 at 10:36:40PM +0100, Rafael J. Wysocki wrote:
>> On Mon, Dec 10, 2018 at 1:21 PM Peter Zijlstra  wrote:

>>> Would not a tracepoint be better?; then there is no overhead in the
>>> normal case where nobody gives a crap about these here numbers.
>> 
>> There is an existing tracepoint that in principle could be used to
>> produce this information, but it is such a major PITA in practice that
>> nobody does that.  Guess why. :-)
>
> Sounds like you need to ship a convenient script or something :-)

For the histogram plots of idle durations that I sometimes provide, trace
is used. It is more work to do it the trace way. Very often, when the rate
of idle state entries/ exits is high, turning on trace influences the system
under test significantly. Also, even if I allocate the majority of my memory
to the trace buffer (i.e. 15 of my 16 gigabytes), I can only acquire a few 
minutes of trace data before filling the buffer.

Some of my tests run for hours, and these new counters provide a way to acquire
potentially useful (I don't have enough experience with them yet to know how 
useful)
information, while having no influence on the system under test because
I only take samples once per minute, or sometimes 4 times per minute.

>> Also, the "usage" and "time" counters are there in sysfs, so why not these 
>> two?

I agree, how are these two counters any different?

In about a week or so, I'll have some test data comparing 4.20-rc5 with teov6
teov7 along with the idle data (graphs) that I usually provide and also these
new counters.

... Doug

RE: [PATCH] cpuidle: Add 'high' and 'low' idle state metrics

2018-12-08 Thread Doug Smythies

On 2018.12.06 01:09 Rafael J. Wysocki wrote:
> On Thu, Dec 6, 2018 at 12:08 AM Doug Smythies  wrote:
>> On 2018.12.03 04:32 Rafael J. Wysocki wrote:
>>
>>> Add two new metrics for CPU idle states, "high" and "low", to count
>>> the number of times the given state had been asked for (or entered
>>> from the kernel's perspective), but the observed idle duration turned
>>> out to be too high or too low for it (respectively).
>>
>> I wonder about the "high" "low" terminology here.
>
> I took these names, because they are concise and simple.  I could use
> "below" and "above" respectively I guess.  What about these?

I see you already sent a new patch with these names. Yes, myself I like
them better.

I am going to try to add these counts to my next sets of graphs.

... Doug

RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-08 Thread Doug Smythies

On 2018.12.08 02:23 Giovanni Gherdovich wrote:

> sorry for the late reply, this week I was traveling.

No Problem. Thanks very much for your very detailed reply,
which obviously took considerable time to write. While
I was making progress, your instructions really fill in
some gaps and mistakes I was making.

Eventually (probably several days) I'll report back my test
results.

> Some specific remarks you raise:
>
> On Mon, 2018-12-03 at 08:23 -0800, Doug Smythies wrote:
>> ...
>> My issue is that I do not understand the output or how it
>> might correlate with your tables.
>> 
>> I get, for example:
>> 
>>31   1 0.13s 0.68s 0.80s  1003894.302 1003779.613
>>31   1 0.16s 0.64s 0.80s  1008900.053 1008215.336
>>31   1 0.14s 0.66s 0.80s  1009630.439 1008990.265
>> ...
>> 
>> But I don't know what that means, nor have I been able to find
>> a description anywhere.
>
> I don't recognize this output. I hope the illustration above can clarify how
> MMTests is used.

Due to incompetence on my part, the config file being run for my tests was
always just the default config file from my original
git clone https://github.com/gormanm/mmtests.git
command. So regardless of what I thought I was doing, I was running "pft"
(Page Fault Test).

... Doug

RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-08 Thread Doug Smythies

On 2018.12.08 02:23 Giovanni Gherdovich wrote:

> sorry for the late reply, this week I was traveling.

No Problem. Thanks very much for your very detailed reply,
which obviously took considerable time to write. While
I was making progress, your instructions really fill in
some gaps and mistakes I was making.

Eventually (probably several days) I'll report back my test
results.

> Some specific remarks you raise:
>
> On Mon, 2018-12-03 at 08:23 -0800, Doug Smythies wrote:
>> ...
>> My issue is that I do not understand the output or how it
>> might correlate with your tables.
>> 
>> I get, for example:
>> 
>>31   1 0.13s 0.68s 0.80s  1003894.302 1003779.613
>>31   1 0.16s 0.64s 0.80s  1008900.053 1008215.336
>>31   1 0.14s 0.66s 0.80s  1009630.439 1008990.265
>> ...
>> 
>> But I don't know what that means, nor have I been able to find
>> a description anywhere.
>
> I don't recognize this output. I hope the illustration above can clarify how
> MMTests is used.

Due to incompetence on my part, the config file being run for my tests was
always just the default config file from my original
git clone https://github.com/gormanm/mmtests.git
command. So regardless of what I thought I was doing, I was running "pft"
(Page Fault Test).

... Doug

1 2 3 4 5 >

1 - 100 of 442 matches

Mail list logo