Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-08-06 Thread Rafael J. Wysocki
On Thursday, August 6, 2020 7:54:47 AM CEST Doug Smythies wrote:
> On 2020.08.03 10:09 Rafael J. Wysocki wrote:
> > On Sunday, August 2, 2020 5:17:39 PM CEST Doug Smythies wrote:
> > > On 2020.07.19 04:43 Rafael J. Wysocki wrote:
> > > > On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies  
> > > > wrote:
> > > > > On 2020.07.16 05:08 Rafael J. Wysocki wrote:
> > > > > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies 
> > > > > >  wrote:
> > > > > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> > > > > >> >
> > > > > >> > From: Rafael J. Wysocki 
> > > > > >> ...
> > > > > >> > Since the passive mode hasn't worked with HWP at all, and it is 
> > > > > >> > not going to
> > > > > >> > the default for HWP systems anyway, I don't see any drawbacks 
> > > > > >> > related to making
> > > > > >> > this change, so I would consider this as 5.9 material unless 
> > > > > >> > there are any
> > > > > >> > serious objections.
> > > > > >>
> > > > > >> Good point.
> > > > >
> > > > > Actually, for those users that default to passive mode upon boot,
> > > > > this would mean they would find themselves using this.
> > > > > Also, it isn't obvious, from the typical "what driver and what 
> > > > > governor"
> > > > > inquiry.
> > > >
> > > > So the change in behavior is that after this patch
> > > > intel_pstate=passive doesn't imply no_hwp any more.
> > > >
> > > > That's a very minor difference though and I'm not aware of any adverse
> > > > effects it can cause on HWP systems anyway.
> > >
> > > My point was, that it will now default to something where
> > > testing has not been completed.
> > >
> > > > The "what governor" is straightforward in the passive mode: that's
> > > > whatever cpufreq governor has been selected.
> > >
> > > I think you might have missed my point.
> > > From the normal methods of inquiry one does not know
> > > if HWP is being used or not. Why? Because with
> > > or without HWP one gets the same answers under:
> > >
> > > /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
> > > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> > 
> > Yes, but this is also the case in the active mode, isn't it?
> 
> Yes, fair enough.
> But we aren't changing what it means by default
> between kernel 5.8 and 5.9-rc1.

No, we aren't.

The only (expected) change is when booting with intel_pstate=passive and
without intel_pstate=no_hwp in the command line.

Which should be easy enough to address by adding intel_pstate=no_hwp to the
command line in 5.9-rc1 and later (to achieve the same behavior after a
fresh boot).

Cheers!





RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-08-05 Thread Doug Smythies
On 2020.08.03 10:09 Rafael J. Wysocki wrote:
> On Sunday, August 2, 2020 5:17:39 PM CEST Doug Smythies wrote:
> > On 2020.07.19 04:43 Rafael J. Wysocki wrote:
> > > On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies  wrote:
> > > > On 2020.07.16 05:08 Rafael J. Wysocki wrote:
> > > > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies  
> > > > > wrote:
> > > > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> > > > >> >
> > > > >> > From: Rafael J. Wysocki 
> > > > >> ...
> > > > >> > Since the passive mode hasn't worked with HWP at all, and it is 
> > > > >> > not going to
> > > > >> > the default for HWP systems anyway, I don't see any drawbacks 
> > > > >> > related to making
> > > > >> > this change, so I would consider this as 5.9 material unless there 
> > > > >> > are any
> > > > >> > serious objections.
> > > > >>
> > > > >> Good point.
> > > >
> > > > Actually, for those users that default to passive mode upon boot,
> > > > this would mean they would find themselves using this.
> > > > Also, it isn't obvious, from the typical "what driver and what governor"
> > > > inquiry.
> > >
> > > So the change in behavior is that after this patch
> > > intel_pstate=passive doesn't imply no_hwp any more.
> > >
> > > That's a very minor difference though and I'm not aware of any adverse
> > > effects it can cause on HWP systems anyway.
> >
> > My point was, that it will now default to something where
> > testing has not been completed.
> >
> > > The "what governor" is straightforward in the passive mode: that's
> > > whatever cpufreq governor has been selected.
> >
> > I think you might have missed my point.
> > From the normal methods of inquiry one does not know
> > if HWP is being used or not. Why? Because with
> > or without HWP one gets the same answers under:
> >
> > /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
> > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> 
> Yes, but this is also the case in the active mode, isn't it?

Yes, fair enough.
But we aren't changing what it means by default
between kernel 5.8 and 5.9-rc1.

... Doug




Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-08-03 Thread Rafael J. Wysocki
On Sunday, August 2, 2020 5:17:39 PM CEST Doug Smythies wrote:
> Hi Rafael,
> 
> On 2020.07.19 04:43 Rafael J. Wysocki wrote:
> > On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies  wrote:
> > > On 2020.07.16 05:08 Rafael J. Wysocki wrote:
> > > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies  
> > > > wrote:
> > > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> > > >> >
> > > >> > From: Rafael J. Wysocki 
> > > >> ...
> > > >> > Since the passive mode hasn't worked with HWP at all, and it is not 
> > > >> > going to
> > > >> > the default for HWP systems anyway, I don't see any drawbacks 
> > > >> > related to making
> > > >> > this change, so I would consider this as 5.9 material unless there 
> > > >> > are any
> > > >> > serious objections.
> > > >>
> > > >> Good point.
> > >
> > > Actually, for those users that default to passive mode upon boot,
> > > this would mean they would find themselves using this.
> > > Also, it isn't obvious, from the typical "what driver and what governor"
> > > inquiry.
> > 
> > So the change in behavior is that after this patch
> > intel_pstate=passive doesn't imply no_hwp any more.
> > 
> > That's a very minor difference though and I'm not aware of any adverse
> > effects it can cause on HWP systems anyway.
> 
> My point was, that it will now default to something where
> testing has not been completed.
> 
> > The "what governor" is straightforward in the passive mode: that's
> > whatever cpufreq governor has been selected.
> 
> I think you might have missed my point.
> From the normal methods of inquiry one does not know
> if HWP is being used or not. Why? Because with
> or without HWP one gets the same answers under:
> 
> /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
> /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Yes, but this is also the case in the active mode, isn't it?

Thanks!






RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-08-02 Thread Doug Smythies
Hi Rafael,

On 2020.07.19 04:43 Rafael J. Wysocki wrote:
> On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies  wrote:
> > On 2020.07.16 05:08 Rafael J. Wysocki wrote:
> > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies  
> > > wrote:
> > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> > >> >
> > >> > From: Rafael J. Wysocki 
> > >> ...
> > >> > Since the passive mode hasn't worked with HWP at all, and it is not 
> > >> > going to
> > >> > the default for HWP systems anyway, I don't see any drawbacks related 
> > >> > to making
> > >> > this change, so I would consider this as 5.9 material unless there are 
> > >> > any
> > >> > serious objections.
> > >>
> > >> Good point.
> >
> > Actually, for those users that default to passive mode upon boot,
> > this would mean they would find themselves using this.
> > Also, it isn't obvious, from the typical "what driver and what governor"
> > inquiry.
> 
> So the change in behavior is that after this patch
> intel_pstate=passive doesn't imply no_hwp any more.
> 
> That's a very minor difference though and I'm not aware of any adverse
> effects it can cause on HWP systems anyway.

My point was, that it will now default to something where
testing has not been completed.

> The "what governor" is straightforward in the passive mode: that's
> whatever cpufreq governor has been selected.

I think you might have missed my point.
>From the normal methods of inquiry one does not know
if HWP is being used or not. Why? Because with
or without HWP one gets the same answers under:

/sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

... Doug




Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-31 Thread Francisco Jerez
"Rafael J. Wysocki"  writes:

> On Thursday, July 30, 2020 2:49:34 AM CEST Francisco Jerez wrote:
>> 
>
> [cut]
>
>> >> >
>> >> >> >> No, I explicitly dismissed that in my previous reply.
>> >> >> >
>> >> >> > But at the same time you seem to agree that without the non-CPU com=
>> pon=3D
>> >> ent
>> >> >> > (or thermal pressure) the existing CPU performance scaling would be
>> >> >> > sufficient.
>> >> >> >
>> >> >>=3D20
>> >> >> Yes, but not necessarily in order to allow the non-CPU component to d=
>> raw
>> >> >> more power as you said above, but also because the existence of a
>> >> >> bottleneck in a non-CPU component gives us an opportunity to improve =
>> the
>> >> >> energy efficiency of the CPU, regardless of whether that allows the
>> >> >> workload to run faster.
>> >> >
>> >> > But why would the bottleneck be there otherwise?
>> >> >
>> >>=20
>> >> Because some resource of the system (e.g. memory bandwidth, GPU fill
>> >> rate) may be close to 100% utilized, causing a bottleneck for reasons
>> >> unrelated to its energy usage.
>> >
>> > Well, not quite.  Or at least in that case the performance cannot be impr=
>> oved
>> > by limiting the CPU frequency below the frequency looked for by scaling
>> > governors, AFAICS.
>> >
>> 
>> Right, but it might still be possible to improve the energy efficiency
>> of the workload even if its throughput cannot be improved, which seems
>> like a worthwhile purpose in itself.
>
> My point is that in this case the energy-efficiency of the processor cannot
> be improved without decreasing performance.
>

Well in principle it can whenever there is a bottleneck in a non-CPU
component.

> For the processors that are relevant here, the most energy-efficient way to
> run them is in the minimum P-state, but that rarely provides the required
> performance.  Without the knowledge on how much performance really is required
> we assume maximum achievable.  Anything else would need to involve some extra
> policy knobs (which are missing ATM) and that is another problem.
>

But there is a middle ground between limiting the workload to run at the
minimum P-state and having it run at the maximum achievable P-state.
The MEF estimation we were just talking about can help.  However due to
its heuristic nature I certainly see the merit of having policy knobs
allowing the optimization to be controlled by users, which is why I
added such controls in v2.99, at your request.

> I'm not saying that it is not a problem, but it is not possible to say how 
> much
> performance to sacrifice without any input from the user on that.
>
> IOW, this is a topic for another discussion.
>

I have the vague recollection that you brought that up already and I
agreed and implemented the changes you asked for.

>> > Scaling governors generally look for the maximum frequency at which there=
>>  is no
>> > CPU idle time in the workload.  At that frequency the CPU time required b=
>> y the
>> > workload to achieve the maximum performance is equal to the total CPU time
>> > available to it.  I till refer to that frequency as the maximum effective
>> > frequency (MEF) of the workload.
>> >
>> > By definition, running at frequencies above the MEF does not improve
>> > performance, but it causes CPU idle time to appear.  OTOH running at
>> > frequencies below the MEF increases the CPU time required by the workload
>> > to achieve the maximum performance, so effectively the workload does
>> > not get enough CPU time for the performance to be maximum, so it is lower
>> > than at the MEF.
>> >
>> 
>> Yes, we agree on that.
>> 
>> > Of course, the MEF is well-defined as long as the processor does not share
>> > the power budget with another component that is also used by the workload
>> > (say, a GPU).  Without the sharing of a power budget, the MEF can be dete=
>> rmined
>> > by looking at the CPU idle time (or CPU busy time, or CPU load, whichever=
>>  is
>> > the most convenient) alone, because it already depends on the speed of any
>> > memory etc accessed by the workload and slowing down the processor doesn't
>> > improve the performance (because the other components don't run any faster
>> > as a result of that).
>> >
>> > However, if the processor is sharing the power budget with a GPU (say), it
>> > is not enough to look at the CPU idle time to determine the MEF, because
>> > slowing down the processor generally causes the GPU to get more power whi=
>> ch
>> > allows it to run faster and CPUs can do more work, because they spend less
>> > time waiting for the GPU, so the CPU time available to the workload effec=
>> tively
>> > increases and it can achieve the maximum performance at a lower frequency.
>> > So there is "effective MEF" depending on the relative performance balance
>> > between the processor and the GPU and on what "fraction" of the workload
>> > runs on the GPU.
>> >
>> 
>> That doesn't mean that the MEF isn't well-defined in systems with a
>> shared power budget.  If you define it as the 

Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-31 Thread Rafael J. Wysocki
On Thursday, July 30, 2020 2:49:34 AM CEST Francisco Jerez wrote:
> 

[cut]

> >> >
> >> >> >> No, I explicitly dismissed that in my previous reply.
> >> >> >
> >> >> > But at the same time you seem to agree that without the non-CPU com=
> pon=3D
> >> ent
> >> >> > (or thermal pressure) the existing CPU performance scaling would be
> >> >> > sufficient.
> >> >> >
> >> >>=3D20
> >> >> Yes, but not necessarily in order to allow the non-CPU component to d=
> raw
> >> >> more power as you said above, but also because the existence of a
> >> >> bottleneck in a non-CPU component gives us an opportunity to improve =
> the
> >> >> energy efficiency of the CPU, regardless of whether that allows the
> >> >> workload to run faster.
> >> >
> >> > But why would the bottleneck be there otherwise?
> >> >
> >>=20
> >> Because some resource of the system (e.g. memory bandwidth, GPU fill
> >> rate) may be close to 100% utilized, causing a bottleneck for reasons
> >> unrelated to its energy usage.
> >
> > Well, not quite.  Or at least in that case the performance cannot be impr=
> oved
> > by limiting the CPU frequency below the frequency looked for by scaling
> > governors, AFAICS.
> >
> 
> Right, but it might still be possible to improve the energy efficiency
> of the workload even if its throughput cannot be improved, which seems
> like a worthwhile purpose in itself.

My point is that in this case the energy-efficiency of the processor cannot
be improved without decreasing performance.

For the processors that are relevant here, the most energy-efficient way to
run them is in the minimum P-state, but that rarely provides the required
performance.  Without the knowledge on how much performance really is required
we assume maximum achievable.  Anything else would need to involve some extra
policy knobs (which are missing ATM) and that is another problem.

I'm not saying that it is not a problem, but it is not possible to say how much
performance to sacrifice without any input from the user on that.

IOW, this is a topic for another discussion.

> > Scaling governors generally look for the maximum frequency at which there=
>  is no
> > CPU idle time in the workload.  At that frequency the CPU time required b=
> y the
> > workload to achieve the maximum performance is equal to the total CPU time
> > available to it.  I till refer to that frequency as the maximum effective
> > frequency (MEF) of the workload.
> >
> > By definition, running at frequencies above the MEF does not improve
> > performance, but it causes CPU idle time to appear.  OTOH running at
> > frequencies below the MEF increases the CPU time required by the workload
> > to achieve the maximum performance, so effectively the workload does
> > not get enough CPU time for the performance to be maximum, so it is lower
> > than at the MEF.
> >
> 
> Yes, we agree on that.
> 
> > Of course, the MEF is well-defined as long as the processor does not share
> > the power budget with another component that is also used by the workload
> > (say, a GPU).  Without the sharing of a power budget, the MEF can be dete=
> rmined
> > by looking at the CPU idle time (or CPU busy time, or CPU load, whichever=
>  is
> > the most convenient) alone, because it already depends on the speed of any
> > memory etc accessed by the workload and slowing down the processor doesn't
> > improve the performance (because the other components don't run any faster
> > as a result of that).
> >
> > However, if the processor is sharing the power budget with a GPU (say), it
> > is not enough to look at the CPU idle time to determine the MEF, because
> > slowing down the processor generally causes the GPU to get more power whi=
> ch
> > allows it to run faster and CPUs can do more work, because they spend less
> > time waiting for the GPU, so the CPU time available to the workload effec=
> tively
> > increases and it can achieve the maximum performance at a lower frequency.
> > So there is "effective MEF" depending on the relative performance balance
> > between the processor and the GPU and on what "fraction" of the workload
> > runs on the GPU.
> >
> 
> That doesn't mean that the MEF isn't well-defined in systems with a
> shared power budget.  If you define it as the lowest frequency at which
> the workload reaches maximum throughput, then there still is one even if
> the system is TDP-bound: the maximum frequency at which the CPU and
> device maintain their optimal power balance -- Above it the power budget
> of the device will be constrained, causing it to run at less than
> optimal throughput, also causing the workload as a whole to run at less
> than optimal throughput.

That is what I am calling the "effective MEF" and the main point is that it
cannot be determined by looking at the CPU idle time alone.

> That said I agree with you that it takes more than looking at the CPU
> utilization in order to determine the MEF in a system with a shared
> power budget -- Until you've reached a steady 

Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-29 Thread Francisco Jerez
"Rafael J. Wysocki"  writes:

> On Wednesday, July 29, 2020 7:46:08 AM CEST Francisco Jerez wrote:
>> 
>> --==-=-=
>> Content-Type: multipart/mixed; boundary="=-=-="
>> 
>> --=-=-=
>> Content-Type: text/plain; charset=utf-8
>> Content-Disposition: inline
>> Content-Transfer-Encoding: quoted-printable
>> 
>> "Rafael J. Wysocki"  writes:
>> 
>> > On Tuesday, July 28, 2020 4:32:22 AM CEST Francisco Jerez wrote:
>> >>
>> >> "Rafael J. Wysocki"  writes:
>> >>=20
>> >> > On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote:
>> >> >
>> >> > [cut]
>> >> >
>> >> >> >
>> >> >> > However, in the active mode the only updater of hwp_req_cached is
>> >> >> > intel_pstate_hwp_set() and this patch doesn't introduce any
>> >> >> > differences in behavior in that case.
>> >> >> >
>> >> >>=3D20
>> >> >> intel_pstate_hwp_set() is the only updater, but there are other
>> >> >> consumers that can get out of sync with the HWP request value written=
>>  by
>> >> >> intel_pstate_set_energy_pref_index().  intel_pstate_hwp_boost_up() se=
>> ems
>> >> >> like the most concerning example I named earlier.
>> >> >>=3D20
>> >> >> >> > So there may be a short time window after the
>> >> >> >> > intel_pstate_set_energy_pref_index() invocation in which the new=
>>  EPP
>> >> >> >> > value may not be in effect, but in general there is no guarantee=
>>  th=3D
>> >> at
>> >> >> >> > the new EPP will take effect immediately after updating the MSR
>> >> >> >> > anyway, so that race doesn't matter.
>> >> >> >> >
>> >> >> >> > That said, that race is avoidable, but I was thinking that tryin=
>> g to
>> >> >> >> > avoid it might not be worth it.  Now I see a better way to avoid=
>>  it,
>> >> >> >> > though, so I'm going to update the patch to that end.
>> >> >> >> >
>> >> >> >> >> Seems like a bug to me.
>> >> >> >> >
>> >> >> >> > It is racy, but not every race is a bug.
>> >> >> >> >
>> >> >> >>
>> >> >> >> Still seems like there is a bug in intel_pstate_set_energy_pref_in=
>> dex=3D
>> >> ()
>> >> >> >> AFAICT.
>> >> >> >
>> >> >> > If there is a bug, then what exactly is it, from the users' perspec=
>> tiv=3D
>> >> e?
>> >> >> >
>> >> >>=3D20
>> >> >> It can be reproduced easily as follows:
>> >> >>=3D20
>> >> >> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
>> >> >> | for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance=
>> _pr=3D
>> >> eference; do echo performance > $p; done
>> >> >
>> >> > Is this the active mode or the passive mode with the $subject patch ap=
>> pli=3D
>> >> ed?
>> >> >
>> >> > If the former, the issue is there regardless of the patch, so it needs=
>>  to=3D
>> >>  be
>> >> > fixed.
>> >> >
>> >> > If the latter, there should be no effect of hwp_dynamic_boost (which w=
>> as
>> >> > overlooked by me).
>> >> >
>> >>=20
>> >> This seems to be a problem in active mode only, so yeah the bug exists
>> >> regardless of your patch, but the fix is likely to allow you to simplify
>> >> this series slightly if it allows you to take full advantage of
>> >> hwp_req_cached and drop the additional EPP cache.
>> >
>> > The additional EPP cache is there to avoid synchronizing the scheduler
>> > context directly with a random process running on another CPU and doing
>> > things that may take time.
>> >
>> > The difference between the active mode and the passive mode in this respe=
>> ct
>> > is that in the latter case hwp_req_cached generally needs to be updated f=
>> rom
>> > the scheduler context, whereas in the former case it does not.
>> >
>> 
>> Hm, that's unfortunate.  Though I'd be surprised to see any appreciable
>> performance penalty from synchronizing with a sysfs handler that should
>> hardly ever be called.  Anyway thanks for the fix.
>
> No problem.
>
>> > [cut]
>> >
>> >> >> No, I explicitly dismissed that in my previous reply.
>> >> >
>> >> > But at the same time you seem to agree that without the non-CPU compon=
>> ent
>> >> > (or thermal pressure) the existing CPU performance scaling would be
>> >> > sufficient.
>> >> >
>> >>=20
>> >> Yes, but not necessarily in order to allow the non-CPU component to draw
>> >> more power as you said above, but also because the existence of a
>> >> bottleneck in a non-CPU component gives us an opportunity to improve the
>> >> energy efficiency of the CPU, regardless of whether that allows the
>> >> workload to run faster.
>> >
>> > But why would the bottleneck be there otherwise?
>> >
>> 
>> Because some resource of the system (e.g. memory bandwidth, GPU fill
>> rate) may be close to 100% utilized, causing a bottleneck for reasons
>> unrelated to its energy usage.
>
> Well, not quite.  Or at least in that case the performance cannot be improved
> by limiting the CPU frequency below the frequency looked for by scaling
> governors, AFAICS.
>

Right, but it might still be possible to improve the energy efficiency
of the workload even if its throughput cannot be improved, which seems
like a worthwhile purpose in itself.

> Scaling governors 

Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-29 Thread Rafael J. Wysocki
On Wednesday, July 29, 2020 7:46:08 AM CEST Francisco Jerez wrote:
> 
> --==-=-=
> Content-Type: multipart/mixed; boundary="=-=-="
> 
> --=-=-=
> Content-Type: text/plain; charset=utf-8
> Content-Disposition: inline
> Content-Transfer-Encoding: quoted-printable
> 
> "Rafael J. Wysocki"  writes:
> 
> > On Tuesday, July 28, 2020 4:32:22 AM CEST Francisco Jerez wrote:
> >>
> >> "Rafael J. Wysocki"  writes:
> >>=20
> >> > On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote:
> >> >
> >> > [cut]
> >> >
> >> >> >
> >> >> > However, in the active mode the only updater of hwp_req_cached is
> >> >> > intel_pstate_hwp_set() and this patch doesn't introduce any
> >> >> > differences in behavior in that case.
> >> >> >
> >> >>=3D20
> >> >> intel_pstate_hwp_set() is the only updater, but there are other
> >> >> consumers that can get out of sync with the HWP request value written=
>  by
> >> >> intel_pstate_set_energy_pref_index().  intel_pstate_hwp_boost_up() se=
> ems
> >> >> like the most concerning example I named earlier.
> >> >>=3D20
> >> >> >> > So there may be a short time window after the
> >> >> >> > intel_pstate_set_energy_pref_index() invocation in which the new=
>  EPP
> >> >> >> > value may not be in effect, but in general there is no guarantee=
>  th=3D
> >> at
> >> >> >> > the new EPP will take effect immediately after updating the MSR
> >> >> >> > anyway, so that race doesn't matter.
> >> >> >> >
> >> >> >> > That said, that race is avoidable, but I was thinking that tryin=
> g to
> >> >> >> > avoid it might not be worth it.  Now I see a better way to avoid=
>  it,
> >> >> >> > though, so I'm going to update the patch to that end.
> >> >> >> >
> >> >> >> >> Seems like a bug to me.
> >> >> >> >
> >> >> >> > It is racy, but not every race is a bug.
> >> >> >> >
> >> >> >>
> >> >> >> Still seems like there is a bug in intel_pstate_set_energy_pref_in=
> dex=3D
> >> ()
> >> >> >> AFAICT.
> >> >> >
> >> >> > If there is a bug, then what exactly is it, from the users' perspec=
> tiv=3D
> >> e?
> >> >> >
> >> >>=3D20
> >> >> It can be reproduced easily as follows:
> >> >>=3D20
> >> >> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
> >> >> | for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance=
> _pr=3D
> >> eference; do echo performance > $p; done
> >> >
> >> > Is this the active mode or the passive mode with the $subject patch ap=
> pli=3D
> >> ed?
> >> >
> >> > If the former, the issue is there regardless of the patch, so it needs=
>  to=3D
> >>  be
> >> > fixed.
> >> >
> >> > If the latter, there should be no effect of hwp_dynamic_boost (which w=
> as
> >> > overlooked by me).
> >> >
> >>=20
> >> This seems to be a problem in active mode only, so yeah the bug exists
> >> regardless of your patch, but the fix is likely to allow you to simplify
> >> this series slightly if it allows you to take full advantage of
> >> hwp_req_cached and drop the additional EPP cache.
> >
> > The additional EPP cache is there to avoid synchronizing the scheduler
> > context directly with a random process running on another CPU and doing
> > things that may take time.
> >
> > The difference between the active mode and the passive mode in this respe=
> ct
> > is that in the latter case hwp_req_cached generally needs to be updated f=
> rom
> > the scheduler context, whereas in the former case it does not.
> >
> 
> Hm, that's unfortunate.  Though I'd be surprised to see any appreciable
> performance penalty from synchronizing with a sysfs handler that should
> hardly ever be called.  Anyway thanks for the fix.

No problem.

> > [cut]
> >
> >> >> No, I explicitly dismissed that in my previous reply.
> >> >
> >> > But at the same time you seem to agree that without the non-CPU compon=
> ent
> >> > (or thermal pressure) the existing CPU performance scaling would be
> >> > sufficient.
> >> >
> >>=20
> >> Yes, but not necessarily in order to allow the non-CPU component to draw
> >> more power as you said above, but also because the existence of a
> >> bottleneck in a non-CPU component gives us an opportunity to improve the
> >> energy efficiency of the CPU, regardless of whether that allows the
> >> workload to run faster.
> >
> > But why would the bottleneck be there otherwise?
> >
> 
> Because some resource of the system (e.g. memory bandwidth, GPU fill
> rate) may be close to 100% utilized, causing a bottleneck for reasons
> unrelated to its energy usage.

Well, not quite.  Or at least in that case the performance cannot be improved
by limiting the CPU frequency below the frequency looked for by scaling
governors, AFAICS.

Scaling governors generally look for the maximum frequency at which there is no
CPU idle time in the workload.  At that frequency the CPU time required by the
workload to achieve the maximum performance is equal to the total CPU time
available to it.  I till refer to that frequency as the maximum effective
frequency (MEF) of the workload.

By definition, running at 

Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-28 Thread Francisco Jerez
"Rafael J. Wysocki"  writes:

> On Tuesday, July 28, 2020 4:32:22 AM CEST Francisco Jerez wrote:
>>
>> "Rafael J. Wysocki"  writes:
>> 
>> > On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote:
>> >
>> > [cut]
>> >
>> >> >
>> >> > However, in the active mode the only updater of hwp_req_cached is
>> >> > intel_pstate_hwp_set() and this patch doesn't introduce any
>> >> > differences in behavior in that case.
>> >> >
>> >>=20
>> >> intel_pstate_hwp_set() is the only updater, but there are other
>> >> consumers that can get out of sync with the HWP request value written by
>> >> intel_pstate_set_energy_pref_index().  intel_pstate_hwp_boost_up() seems
>> >> like the most concerning example I named earlier.
>> >>=20
>> >> >> > So there may be a short time window after the
>> >> >> > intel_pstate_set_energy_pref_index() invocation in which the new EPP
>> >> >> > value may not be in effect, but in general there is no guarantee th=
>> at
>> >> >> > the new EPP will take effect immediately after updating the MSR
>> >> >> > anyway, so that race doesn't matter.
>> >> >> >
>> >> >> > That said, that race is avoidable, but I was thinking that trying to
>> >> >> > avoid it might not be worth it.  Now I see a better way to avoid it,
>> >> >> > though, so I'm going to update the patch to that end.
>> >> >> >
>> >> >> >> Seems like a bug to me.
>> >> >> >
>> >> >> > It is racy, but not every race is a bug.
>> >> >> >
>> >> >>
>> >> >> Still seems like there is a bug in intel_pstate_set_energy_pref_index=
>> ()
>> >> >> AFAICT.
>> >> >
>> >> > If there is a bug, then what exactly is it, from the users' perspectiv=
>> e?
>> >> >
>> >>=20
>> >> It can be reproduced easily as follows:
>> >>=20
>> >> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
>> >> | for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance_pr=
>> eference; do echo performance > $p; done
>> >
>> > Is this the active mode or the passive mode with the $subject patch appli=
>> ed?
>> >
>> > If the former, the issue is there regardless of the patch, so it needs to=
>>  be
>> > fixed.
>> >
>> > If the latter, there should be no effect of hwp_dynamic_boost (which was
>> > overlooked by me).
>> >
>> 
>> This seems to be a problem in active mode only, so yeah the bug exists
>> regardless of your patch, but the fix is likely to allow you to simplify
>> this series slightly if it allows you to take full advantage of
>> hwp_req_cached and drop the additional EPP cache.
>
> The additional EPP cache is there to avoid synchronizing the scheduler
> context directly with a random process running on another CPU and doing
> things that may take time.
>
> The difference between the active mode and the passive mode in this respect
> is that in the latter case hwp_req_cached generally needs to be updated from
> the scheduler context, whereas in the former case it does not.
>

Hm, that's unfortunate.  Though I'd be surprised to see any appreciable
performance penalty from synchronizing with a sysfs handler that should
hardly ever be called.  Anyway thanks for the fix.

> [cut]
>
>> >> No, I explicitly dismissed that in my previous reply.
>> >
>> > But at the same time you seem to agree that without the non-CPU component
>> > (or thermal pressure) the existing CPU performance scaling would be
>> > sufficient.
>> >
>> 
>> Yes, but not necessarily in order to allow the non-CPU component to draw
>> more power as you said above, but also because the existence of a
>> bottleneck in a non-CPU component gives us an opportunity to improve the
>> energy efficiency of the CPU, regardless of whether that allows the
>> workload to run faster.
>
> But why would the bottleneck be there otherwise?
>

Because some resource of the system (e.g. memory bandwidth, GPU fill
rate) may be close to 100% utilized, causing a bottleneck for reasons
unrelated to its energy usage.

>> > [cut]
>> >
>> >> > Yes, it is, and so I don't quite see the connection between it and my =
>> question.
>> >> >
>> >> > Apparently, the unmodified performance scaling governors are not
>> >> > sufficient, so there must be something beyond the above which allows
>> >> > you to determine the frequency in question and so I'm asking what that
>> >> > is.
>> >> >
>> >>=20
>> >> The underlying heuristic assumption is the same as I outlined above, but
>> >> in any implementation of such a heuristic there is necessarily a
>> >> trade-off between responsiveness to short-term fluctuations and
>> >> long-term energy usage.  This trade-off is a function of the somewhat
>> >> arbitrary time interval I was referring to as "immediate past" -- A
>> >> longer time parameter allows the controller to consider a greater
>> >> portion of the workload's history while computing the response with
>> >> optimal energy usage, at the cost of increasing its reaction time to
>> >> discontinuous changes in the behavior of the workload (AKA increased
>> >> latency).
>> >
>> > OK
>> >
>> >> One of the key 

Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-28 Thread Rafael J. Wysocki
On Tuesday, July 28, 2020 4:32:22 AM CEST Francisco Jerez wrote:
>
> "Rafael J. Wysocki"  writes:
> 
> > On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote:
> >
> > [cut]
> >
> >> >
> >> > However, in the active mode the only updater of hwp_req_cached is
> >> > intel_pstate_hwp_set() and this patch doesn't introduce any
> >> > differences in behavior in that case.
> >> >
> >>=20
> >> intel_pstate_hwp_set() is the only updater, but there are other
> >> consumers that can get out of sync with the HWP request value written by
> >> intel_pstate_set_energy_pref_index().  intel_pstate_hwp_boost_up() seems
> >> like the most concerning example I named earlier.
> >>=20
> >> >> > So there may be a short time window after the
> >> >> > intel_pstate_set_energy_pref_index() invocation in which the new EPP
> >> >> > value may not be in effect, but in general there is no guarantee th=
> at
> >> >> > the new EPP will take effect immediately after updating the MSR
> >> >> > anyway, so that race doesn't matter.
> >> >> >
> >> >> > That said, that race is avoidable, but I was thinking that trying to
> >> >> > avoid it might not be worth it.  Now I see a better way to avoid it,
> >> >> > though, so I'm going to update the patch to that end.
> >> >> >
> >> >> >> Seems like a bug to me.
> >> >> >
> >> >> > It is racy, but not every race is a bug.
> >> >> >
> >> >>
> >> >> Still seems like there is a bug in intel_pstate_set_energy_pref_index=
> ()
> >> >> AFAICT.
> >> >
> >> > If there is a bug, then what exactly is it, from the users' perspectiv=
> e?
> >> >
> >>=20
> >> It can be reproduced easily as follows:
> >>=20
> >> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
> >> | for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance_pr=
> eference; do echo performance > $p; done
> >
> > Is this the active mode or the passive mode with the $subject patch appli=
> ed?
> >
> > If the former, the issue is there regardless of the patch, so it needs to=
>  be
> > fixed.
> >
> > If the latter, there should be no effect of hwp_dynamic_boost (which was
> > overlooked by me).
> >
> 
> This seems to be a problem in active mode only, so yeah the bug exists
> regardless of your patch, but the fix is likely to allow you to simplify
> this series slightly if it allows you to take full advantage of
> hwp_req_cached and drop the additional EPP cache.

The additional EPP cache is there to avoid synchronizing the scheduler
context directly with a random process running on another CPU and doing
things that may take time.

The difference between the active mode and the passive mode in this respect
is that in the latter case hwp_req_cached generally needs to be updated from
the scheduler context, whereas in the former case it does not.

[cut]

> >> No, I explicitly dismissed that in my previous reply.
> >
> > But at the same time you seem to agree that without the non-CPU component
> > (or thermal pressure) the existing CPU performance scaling would be
> > sufficient.
> >
> 
> Yes, but not necessarily in order to allow the non-CPU component to draw
> more power as you said above, but also because the existence of a
> bottleneck in a non-CPU component gives us an opportunity to improve the
> energy efficiency of the CPU, regardless of whether that allows the
> workload to run faster.

But why would the bottleneck be there otherwise?

> > [cut]
> >
> >> > Yes, it is, and so I don't quite see the connection between it and my =
> question.
> >> >
> >> > Apparently, the unmodified performance scaling governors are not
> >> > sufficient, so there must be something beyond the above which allows
> >> > you to determine the frequency in question and so I'm asking what that
> >> > is.
> >> >
> >>=20
> >> The underlying heuristic assumption is the same as I outlined above, but
> >> in any implementation of such a heuristic there is necessarily a
> >> trade-off between responsiveness to short-term fluctuations and
> >> long-term energy usage.  This trade-off is a function of the somewhat
> >> arbitrary time interval I was referring to as "immediate past" -- A
> >> longer time parameter allows the controller to consider a greater
> >> portion of the workload's history while computing the response with
> >> optimal energy usage, at the cost of increasing its reaction time to
> >> discontinuous changes in the behavior of the workload (AKA increased
> >> latency).
> >
> > OK
> >
> >> One of the key differences between the governor I proposed and the
> >> pre-existing ones is that it doesn't attempt to come up with a magic
> >> time parameter that works for everybody, because there isn't such a
> >> thing, since different devices and applications have latency
> >> requirements which often differ by orders of magnitude.
> >
> > The problem with this approach is that, generally speaking, the kernel
> > has a definition of "close past" already, which comes from the PELT
> > signal in the scheduler.
> >
> > That signal is used 

Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-28 Thread Rafael J. Wysocki
On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote:
> 

[cut]

> > If there is a bug, then what exactly is it, from the users' perspective?
> >
> 
> It can be reproduced easily as follows:
> 
> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
> | for p in 
> /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference; do 
> echo performance > $p; done
> 
> Let's make sure that the EPP updates landed on the turbostat output:
> 
> |[..]
> | CoreCPU Avg_MHz Busy%   Bzy_MHzHWP_REQ
> | -   -   1   0.0523960x
> | 0   0   1   0.0521530x2704
> | 0   4   1   0.0420620x2704
> | 1   1   1   0.0229380x2704
> | 1   5   2   0.0926090x2704
> | 2   2   1   0.0418570x2704
> | 2   6   1   0.0525610x2704
> | 3   3   0   0.0118830x2704
> | 3   7   2   0.0727030x2704
> |[..]
> 
> Now let's do some non-trivial IO activity in order to trigger HWP
> dynamic boost, and watch while random CPUs start losing their EPP
> setting requested via sysfs:
> 
> |[..]
> | CoreCPU Avg_MHz Busy%   Bzy_MHzHWP_REQ
> | -   -   16  0.8120230x
> | 0   0   7   0.6610690x80002704
> ^^
> | 0   4   24  2.1911160x80002704
> ^^
> | 1   1   18  0.6826180x2704
> | 1   5   1   0.0320050x2704
> | 2   2   2   0.0725120x2704
> | 2   6   33  1.3524020x2704
> | 3   3   1   0.0424700x2704
> | 3   7   45  1.4231850x80002704
> ^^

Actually, that's because intel_pstate_hwp_boost_up() and
intel_pstate_hwp_boost_down() use the hwp_req_cached value
for updating the HWP Request MSR and that is only written to
by intel_pstate_hwp_set() which is only invoked on policy changes,
so the MSR writes from intel_pstate_set_energy_pref_index()
basically get discarded.

So this is a matter of synchronizing intel_pstate_set_policy() with
intel_pstate_set_energy_pref_index() and they both acquire
intel_pstate_limits_lock already, so this shouldn't be too difficult to fix.

Let me cut a patch for that.





Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-27 Thread Francisco Jerez
"Rafael J. Wysocki"  writes:

> On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote:
>
> [cut]
>
>> >
>> > However, in the active mode the only updater of hwp_req_cached is
>> > intel_pstate_hwp_set() and this patch doesn't introduce any
>> > differences in behavior in that case.
>> >
>> 
>> intel_pstate_hwp_set() is the only updater, but there are other
>> consumers that can get out of sync with the HWP request value written by
>> intel_pstate_set_energy_pref_index().  intel_pstate_hwp_boost_up() seems
>> like the most concerning example I named earlier.
>> 
>> >> > So there may be a short time window after the
>> >> > intel_pstate_set_energy_pref_index() invocation in which the new EPP
>> >> > value may not be in effect, but in general there is no guarantee that
>> >> > the new EPP will take effect immediately after updating the MSR
>> >> > anyway, so that race doesn't matter.
>> >> >
>> >> > That said, that race is avoidable, but I was thinking that trying to
>> >> > avoid it might not be worth it.  Now I see a better way to avoid it,
>> >> > though, so I'm going to update the patch to that end.
>> >> >
>> >> >> Seems like a bug to me.
>> >> >
>> >> > It is racy, but not every race is a bug.
>> >> >
>> >>
>> >> Still seems like there is a bug in intel_pstate_set_energy_pref_index()
>> >> AFAICT.
>> >
>> > If there is a bug, then what exactly is it, from the users' perspective?
>> >
>> 
>> It can be reproduced easily as follows:
>> 
>> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
>> | for p in 
>> /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference; do 
>> echo performance > $p; done
>
> Is this the active mode or the passive mode with the $subject patch applied?
>
> If the former, the issue is there regardless of the patch, so it needs to be
> fixed.
>
> If the latter, there should be no effect of hwp_dynamic_boost (which was
> overlooked by me).
>

This seems to be a problem in active mode only, so yeah the bug exists
regardless of your patch, but the fix is likely to allow you to simplify
this series slightly if it allows you to take full advantage of
hwp_req_cached and drop the additional EPP cache.

> [cut]
>  
>> > To really decide what is better, the two alternatively would need to
>> > be compared quatitatively, but it doesn't look like they have been.
>> >
>> 
>> That's a fair request, I'm all for justifying design decisions
>> quantitatively -- And in fact we did compare my non-HWP controller with
>> the GuC-based solution quantitatively in a BXT platform, and results
>> showed the kernel-based solution to be superior by a significant margin.
>
> Why do you think that this result is significant beyond BXT?
>

Because the limitations of GuC power management relative to the
kernel-driven solution haven't fundamentally changed since BXT.  Of
course it might be that the gap we observed was due to more accidental
reasons, like bugs in the BXT GuC power management code, it would
certainly make sense to recheck.

>> That was a couple of years ago though and many things have changed, we
>> can get you the results of the previous comparison or an updated
>> comparison if you don't find it appealing to look at old numbers.
>> 
>
> [cut]
>
>> >>
>> >> If we define the instantaneous energy efficiency of a CPU (eta) to be
>> >> the ratio between its instantaneous frequency (f) and power consumption
>> >> (P),
>> >
>> > I'm sorry, but this definition is conceptually misguided.
>> >
>> > Energy-efficiency (denote it as \phi) can be defined as work/energy which 
>> > means
>> >
>> > \phi = dW / dE
>> >
>> > for the instantaneous one and in general that is not the same as the
>> > simple fraction below.
>> >
>> 
>> Hah!  Actually both definitions are mathematically equivalent everywhere
>> they're both defined.  I assume that the 'd' symbols in your expression
>> denote Leibniz's notation for a total derivative (if they didn't --
>> e.g. if they denoted some sort finite increment instead then I would
>> disagree that your expression is a valid definition of *instantaneous*
>> energy efficiency).  In addition I assume that we agree on there being
>> two well-defined functions of time in the case that concerns us, which
>> we call "instantaneous power consumption" [P(t) = dE(t)/dt] and
>> "instantaneous frequency" [f(t) = dW(t)/dt].  With that in mind you just
>> have to apply the standard chain rule from calculus in order to prove
>> the equivalence of both expressions:
>> 
>> | \phi = dW(t(E))/dE = dW(t)/dt * dt(E)/dE = dW(t)/dt * (dE(t)/dt)^-1 =
>> |  = f(t) * P(t)^-1 = eta
>
> Well, under certain assumptions, but OK, I stand corrected.
>
> [cut]
>
>> 
>> > Regardless, the ultimate goal appears to be to allow the non-CPU
>> > component you care about draw more power.
>> >
>> 
>> No, I explicitly dismissed that in my previous reply.
>
> But at the same time you seem to agree that without the non-CPU component
> (or thermal pressure) the existing CPU 

Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-27 Thread Rafael J. Wysocki
On Wednesday, July 22, 2020 1:14:42 AM CEST Francisco Jerez wrote:
> 
> --==-=-=
> Content-Type: multipart/mixed; boundary="=-=-="
> 
> --=-=-=
> Content-Type: text/plain; charset=utf-8
> Content-Disposition: inline
> Content-Transfer-Encoding: quoted-printable
> 
> Srinivas Pandruvada  writes:
> 
> > On Mon, 2020-07-20 at 16:20 -0700, Francisco Jerez wrote:
> >> "Rafael J. Wysocki"  writes:
> >>=20
> >> > On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez <
> >> > curroje...@riseup.net> wrote:
> >> > > "Rafael J. Wysocki"  writes:
> >> > >=20
> > {...]
> >
> >> > Overall, so far, I'm seeing a claim that the CPU subsystem can be
> >> > made
> >> > use less energy and do as much work as before (which is what
> >> > improving
> >> > the energy-efficiency means in general) if the maximum frequency of
> >> > CPUs is limited in a clever way.
> >> >=20
> >> > I'm failing to see what that clever way is, though.
> >> Hopefully the clarifications above help some.
> >
> > To simplify:
> >
> > Suppose I called a function numpy.multiply() to multiply two big arrays
> > and thread is a pegged to a CPU. Let's say it is causing CPU to
> > finish the job in 10ms and it is using a P-State of 0x20. But the same
> > job could have been done in 10ms even if it was using P-state of 0x16.
> > So we are not energy efficient. To really know where is the bottle neck
> > there are numbers of perf counters, may be cache was the issue, we
> > could rather raise the uncore frequency a little. A simple APRF,MPERF
> > counters are not enough.=20
> 
> Yes, that's right, APERF and MPERF aren't sufficient to identify every
> kind of possible bottleneck, some visibility of the utilization of other
> subsystems is necessary in addition -- Like e.g the instrumentation
> introduced in my series to detect a GPU bottleneck.  A bottleneck
> condition in an IO device can be communicated to CPUFREQ

It generally is not sufficient to communicate it to cpufreq.  It needs to be
communicated to the CPU scheduler.

> by adjusting a
> PM QoS latency request (link [2] in my previous reply) that effectively
> gives the governor permission to rearrange CPU work arbitrarily within
> the specified time frame (which should be of the order of the natural
> latency of the IO device -- e.g. at least the rendering time of a frame
> for a GPU) in order to minimize energy usage.

OK, we need to talk more about this.

> > or we characterize the workload at different P-states and set limits.
> > I think this is not you want to say for energy efficiency with your
> > changes.=20
> >
> > The way you are trying to improve "performance" is by caller (device
> > driver) to say how important my job at hand. Here device driver suppose
> > offload this calculations to some GPU and can wait up to 10 ms, you
> > want to tell CPU to be slow. But the p-state driver at a movement
> > observes that there is a chance of overshoot of latency, it will
> > immediately ask for higher P-state. So you want P-state limits based on
> > the latency requirements of the caller. Since caller has more knowledge
> > of latency requirement, this allows other devices sharing the power
> > budget to get more or less power, and improve overall energy efficiency
> > as the combined performance of system is improved.
> > Is this correct?
> 
> Yes, pretty much.

OK





Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-27 Thread Rafael J. Wysocki
On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote:

[cut]

> >
> > However, in the active mode the only updater of hwp_req_cached is
> > intel_pstate_hwp_set() and this patch doesn't introduce any
> > differences in behavior in that case.
> >
> 
> intel_pstate_hwp_set() is the only updater, but there are other
> consumers that can get out of sync with the HWP request value written by
> intel_pstate_set_energy_pref_index().  intel_pstate_hwp_boost_up() seems
> like the most concerning example I named earlier.
> 
> >> > So there may be a short time window after the
> >> > intel_pstate_set_energy_pref_index() invocation in which the new EPP
> >> > value may not be in effect, but in general there is no guarantee that
> >> > the new EPP will take effect immediately after updating the MSR
> >> > anyway, so that race doesn't matter.
> >> >
> >> > That said, that race is avoidable, but I was thinking that trying to
> >> > avoid it might not be worth it.  Now I see a better way to avoid it,
> >> > though, so I'm going to update the patch to that end.
> >> >
> >> >> Seems like a bug to me.
> >> >
> >> > It is racy, but not every race is a bug.
> >> >
> >>
> >> Still seems like there is a bug in intel_pstate_set_energy_pref_index()
> >> AFAICT.
> >
> > If there is a bug, then what exactly is it, from the users' perspective?
> >
> 
> It can be reproduced easily as follows:
> 
> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
> | for p in 
> /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference; do 
> echo performance > $p; done

Is this the active mode or the passive mode with the $subject patch applied?

If the former, the issue is there regardless of the patch, so it needs to be
fixed.

If the latter, there should be no effect of hwp_dynamic_boost (which was
overlooked by me).

[cut]
 
> > To really decide what is better, the two alternatively would need to
> > be compared quatitatively, but it doesn't look like they have been.
> >
> 
> That's a fair request, I'm all for justifying design decisions
> quantitatively -- And in fact we did compare my non-HWP controller with
> the GuC-based solution quantitatively in a BXT platform, and results
> showed the kernel-based solution to be superior by a significant margin.

Why do you think that this result is significant beyond BXT?

> That was a couple of years ago though and many things have changed, we
> can get you the results of the previous comparison or an updated
> comparison if you don't find it appealing to look at old numbers.
> 

[cut]

> >>
> >> If we define the instantaneous energy efficiency of a CPU (eta) to be
> >> the ratio between its instantaneous frequency (f) and power consumption
> >> (P),
> >
> > I'm sorry, but this definition is conceptually misguided.
> >
> > Energy-efficiency (denote it as \phi) can be defined as work/energy which 
> > means
> >
> > \phi = dW / dE
> >
> > for the instantaneous one and in general that is not the same as the
> > simple fraction below.
> >
> 
> Hah!  Actually both definitions are mathematically equivalent everywhere
> they're both defined.  I assume that the 'd' symbols in your expression
> denote Leibniz's notation for a total derivative (if they didn't --
> e.g. if they denoted some sort finite increment instead then I would
> disagree that your expression is a valid definition of *instantaneous*
> energy efficiency).  In addition I assume that we agree on there being
> two well-defined functions of time in the case that concerns us, which
> we call "instantaneous power consumption" [P(t) = dE(t)/dt] and
> "instantaneous frequency" [f(t) = dW(t)/dt].  With that in mind you just
> have to apply the standard chain rule from calculus in order to prove
> the equivalence of both expressions:
> 
> | \phi = dW(t(E))/dE = dW(t)/dt * dt(E)/dE = dW(t)/dt * (dE(t)/dt)^-1 =
> |  = f(t) * P(t)^-1 = eta

Well, under certain assumptions, but OK, I stand corrected.

[cut]

> 
> > Regardless, the ultimate goal appears to be to allow the non-CPU
> > component you care about draw more power.
> >
> 
> No, I explicitly dismissed that in my previous reply.

But at the same time you seem to agree that without the non-CPU component
(or thermal pressure) the existing CPU performance scaling would be
sufficient.

[cut]

> > Yes, it is, and so I don't quite see the connection between it and my 
> > question.
> >
> > Apparently, the unmodified performance scaling governors are not
> > sufficient, so there must be something beyond the above which allows
> > you to determine the frequency in question and so I'm asking what that
> > is.
> >
> 
> The underlying heuristic assumption is the same as I outlined above, but
> in any implementation of such a heuristic there is necessarily a
> trade-off between responsiveness to short-term fluctuations and
> long-term energy usage.  This trade-off is a function of the somewhat
> arbitrary time interval I was referring to as "immediate past" -- A
> 

Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-21 Thread Francisco Jerez
Srinivas Pandruvada  writes:

> On Mon, 2020-07-20 at 16:20 -0700, Francisco Jerez wrote:
>> "Rafael J. Wysocki"  writes:
>> 
>> > On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez <
>> > curroje...@riseup.net> wrote:
>> > > "Rafael J. Wysocki"  writes:
>> > > 
> {...]
>
>> > Overall, so far, I'm seeing a claim that the CPU subsystem can be
>> > made
>> > use less energy and do as much work as before (which is what
>> > improving
>> > the energy-efficiency means in general) if the maximum frequency of
>> > CPUs is limited in a clever way.
>> > 
>> > I'm failing to see what that clever way is, though.
>> Hopefully the clarifications above help some.
>
> To simplify:
>
> Suppose I called a function numpy.multiply() to multiply two big arrays
> and thread is a pegged to a CPU. Let's say it is causing CPU to
> finish the job in 10ms and it is using a P-State of 0x20. But the same
> job could have been done in 10ms even if it was using P-state of 0x16.
> So we are not energy efficient. To really know where is the bottle neck
> there are numbers of perf counters, may be cache was the issue, we
> could rather raise the uncore frequency a little. A simple APRF,MPERF
> counters are not enough. 

Yes, that's right, APERF and MPERF aren't sufficient to identify every
kind of possible bottleneck, some visibility of the utilization of other
subsystems is necessary in addition -- Like e.g the instrumentation
introduced in my series to detect a GPU bottleneck.  A bottleneck
condition in an IO device can be communicated to CPUFREQ by adjusting a
PM QoS latency request (link [2] in my previous reply) that effectively
gives the governor permission to rearrange CPU work arbitrarily within
the specified time frame (which should be of the order of the natural
latency of the IO device -- e.g. at least the rendering time of a frame
for a GPU) in order to minimize energy usage.

> or we characterize the workload at different P-states and set limits.
> I think this is not you want to say for energy efficiency with your
> changes. 
>
> The way you are trying to improve "performance" is by caller (device
> driver) to say how important my job at hand. Here device driver suppose
> offload this calculations to some GPU and can wait up to 10 ms, you
> want to tell CPU to be slow. But the p-state driver at a movement
> observes that there is a chance of overshoot of latency, it will
> immediately ask for higher P-state. So you want P-state limits based on
> the latency requirements of the caller. Since caller has more knowledge
> of latency requirement, this allows other devices sharing the power
> budget to get more or less power, and improve overall energy efficiency
> as the combined performance of system is improved.
> Is this correct?

Yes, pretty much.


signature.asc
Description: PGP signature


Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-21 Thread Srinivas Pandruvada
On Mon, 2020-07-20 at 16:20 -0700, Francisco Jerez wrote:
> "Rafael J. Wysocki"  writes:
> 
> > On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez <
> > curroje...@riseup.net> wrote:
> > > "Rafael J. Wysocki"  writes:
> > > 
{...]

> > Overall, so far, I'm seeing a claim that the CPU subsystem can be
> > made
> > use less energy and do as much work as before (which is what
> > improving
> > the energy-efficiency means in general) if the maximum frequency of
> > CPUs is limited in a clever way.
> > 
> > I'm failing to see what that clever way is, though.
> Hopefully the clarifications above help some.

To simplify:

Suppose I called a function numpy.multiply() to multiply two big arrays
and thread is a pegged to a CPU. Let's say it is causing CPU to
finish the job in 10ms and it is using a P-State of 0x20. But the same
job could have been done in 10ms even if it was using P-state of 0x16.
So we are not energy efficient. To really know where is the bottle neck
there are numbers of perf counters, may be cache was the issue, we
could rather raise the uncore frequency a little. A simple APRF,MPERF
counters are not enough. or we characterize the workload at different
P-states and set limits.
I think this is not you want to say for energy efficiency with your
changes. 

The way you are trying to improve "performance" is by caller (device
driver) to say how important my job at hand. Here device driver suppose
offload this calculations to some GPU and can wait up to 10 ms, you
want to tell CPU to be slow. But the p-state driver at a movement
observes that there is a chance of overshoot of latency, it will
immediately ask for higher P-state. So you want P-state limits based on
the latency requirements of the caller. Since caller has more knowledge
of latency requirement, this allows other devices sharing the power
budget to get more or less power, and improve overall energy efficiency
as the combined performance of system is improved.
Is this correct?



















Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-20 Thread Francisco Jerez
"Rafael J. Wysocki"  writes:

> On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez  wrote:
>>
>> "Rafael J. Wysocki"  writes:
>>
>> > On Wed, Jul 15, 2020 at 11:35 PM Francisco Jerez  
>> > wrote:
>> >>
>> >> "Rafael J. Wysocki"  writes:
>> >>
>> >> > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez  
>> >> > wrote:
>> >> >>
>> >> >> "Rafael J. Wysocki"  writes:
>> >> >>
>> >> >> > From: Rafael J. Wysocki 
>> >> >> >
>> >> >> > Allow intel_pstate to work in the passive mode with HWP enabled and
>> >> >> > make it set the HWP minimum performance limit (HWP floor) to the
>> >> >> > P-state value given by the target frequency supplied by the cpufreq
>> >> >> > governor, so as to prevent the HWP algorithm and the CPU scheduler
>> >> >> > from working against each other, at least when the schedutil governor
>> >> >> > is in use, and update the intel_pstate documentation accordingly.
>> >> >> >
>> >> >> > Among other things, this allows utilization clamps to be taken
>> >> >> > into account, at least to a certain extent, when intel_pstate is
>> >> >> > in use and makes it more likely that sufficient capacity for
>> >> >> > deadline tasks will be provided.
>> >> >> >
>> >> >> > After this change, the resulting behavior of an HWP system with
>> >> >> > intel_pstate in the passive mode should be close to the behavior
>> >> >> > of the analogous non-HWP system with intel_pstate in the passive
>> >> >> > mode, except that in the frequency range below the base frequency
>> >> >> > (ie. the frequency retured by the base_frequency cpufreq attribute
>> >> >> > in sysfs on HWP systems) the HWP algorithm is allowed to go above
>> >> >> > the floor P-state set by intel_pstate with or without hardware
>> >> >> > coordination of P-states among CPUs in the same package.
>> >> >> >
>> >> >> > Also note that the setting of the HWP floor may not be taken into
>> >> >> > account by the processor in the following cases:
>> >> >> >
>> >> >> >  * For the HWP floor in the range of P-states above the base
>> >> >> >frequency, referred to as the turbo range, the processor has a
>> >> >> >license to choose any P-state from that range, either below or
>> >> >> >above the HWP floor, just like a non-HWP processor in the case
>> >> >> >when the target P-state falls into the turbo range.
>> >> >> >
>> >> >> >  * If P-states of the CPUs in the same package are coordinated
>> >> >> >at the hardware level, the processor may choose a P-state
>> >> >> >above the HWP floor, just like a non-HWP processor in the
>> >> >> >analogous case.
>> >> >> >
>> >> >> > With this change applied, intel_pstate in the passive mode
>> >> >> > assumes complete control over the HWP request MSR and concurrent
>> >> >> > changes of that MSR (eg. via the direct MSR access interface) are
>> >> >> > overridden by it.
>> >> >> >
>> >> >> > Signed-off-by: Rafael J. Wysocki 
>> >> >> > ---
>> >> >> >
>> >> >> > This basically unifies the passive mode behavior of intel_pstate for 
>> >> >> > systems
>> >> >> > with and without HWP enabled.  The only case in which there is a 
>> >> >> > difference
>> >> >> > between the two (after this patch) is below the turbo range, where 
>> >> >> > the HWP
>> >> >> > algorithm can go above the floor regardless of whether or not 
>> >> >> > P-state are
>> >> >> > coordinated package-wide (this means the systems with per-core 
>> >> >> > P-states
>> >> >> > mostly is where the difference can be somewhat visible).
>> >> >> >
>> >> >> > Since the passive mode hasn't worked with HWP at all, and it is not 
>> >> >> > going to
>> >> >> > the default for HWP systems anyway, I don't see any drawbacks 
>> >> >> > related to making
>> >> >> > this change, so I would consider this as 5.9 material unless there 
>> >> >> > are any
>> >> >> > serious objections.
>> >> >> >
>> >> >> > Thanks!
>> >> >> >
>> >> >> > ---
>> >> >> >  Documentation/admin-guide/pm/intel_pstate.rst |   89 
>> >> >> > +++-
>> >> >> >  drivers/cpufreq/intel_pstate.c|  141 
>> >> >> > --
>> >> >> >  2 files changed, 152 insertions(+), 78 deletions(-)
>> >> >> >
>> >> >> > Index: linux-pm/drivers/cpufreq/intel_pstate.c
>> >> >> > ===
>> >> >> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
>> >> >> > +++ linux-pm/drivers/cpufreq/intel_pstate.c
>> >> >> > @@ -36,6 +36,7 @@
>> >> >> >  #define INTEL_PSTATE_SAMPLING_INTERVAL   (10 * NSEC_PER_MSEC)
>> >> >> >
>> >> >> >  #define INTEL_CPUFREQ_TRANSITION_LATENCY 2
>> >> >> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP   5000
>> >> >> >  #define INTEL_CPUFREQ_TRANSITION_DELAY   500
>> >> >> >
>> >> >> >  #ifdef CONFIG_ACPI
>> >> >> > @@ -222,6 +223,7 @@ struct global_params {
>> >> >> >   *   preference/bias
>> >> >> >   * @epp_saved:   Saved EPP/EPB during system suspend or 
>> >> >> > CPU offline
>> >> >> >   *   operation
>> >> 

Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-19 Thread Rafael J. Wysocki
On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez  wrote:
>
> "Rafael J. Wysocki"  writes:
>
> > On Wed, Jul 15, 2020 at 11:35 PM Francisco Jerez  
> > wrote:
> >>
> >> "Rafael J. Wysocki"  writes:
> >>
> >> > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez  
> >> > wrote:
> >> >>
> >> >> "Rafael J. Wysocki"  writes:
> >> >>
> >> >> > From: Rafael J. Wysocki 
> >> >> >
> >> >> > Allow intel_pstate to work in the passive mode with HWP enabled and
> >> >> > make it set the HWP minimum performance limit (HWP floor) to the
> >> >> > P-state value given by the target frequency supplied by the cpufreq
> >> >> > governor, so as to prevent the HWP algorithm and the CPU scheduler
> >> >> > from working against each other, at least when the schedutil governor
> >> >> > is in use, and update the intel_pstate documentation accordingly.
> >> >> >
> >> >> > Among other things, this allows utilization clamps to be taken
> >> >> > into account, at least to a certain extent, when intel_pstate is
> >> >> > in use and makes it more likely that sufficient capacity for
> >> >> > deadline tasks will be provided.
> >> >> >
> >> >> > After this change, the resulting behavior of an HWP system with
> >> >> > intel_pstate in the passive mode should be close to the behavior
> >> >> > of the analogous non-HWP system with intel_pstate in the passive
> >> >> > mode, except that in the frequency range below the base frequency
> >> >> > (ie. the frequency retured by the base_frequency cpufreq attribute
> >> >> > in sysfs on HWP systems) the HWP algorithm is allowed to go above
> >> >> > the floor P-state set by intel_pstate with or without hardware
> >> >> > coordination of P-states among CPUs in the same package.
> >> >> >
> >> >> > Also note that the setting of the HWP floor may not be taken into
> >> >> > account by the processor in the following cases:
> >> >> >
> >> >> >  * For the HWP floor in the range of P-states above the base
> >> >> >frequency, referred to as the turbo range, the processor has a
> >> >> >license to choose any P-state from that range, either below or
> >> >> >above the HWP floor, just like a non-HWP processor in the case
> >> >> >when the target P-state falls into the turbo range.
> >> >> >
> >> >> >  * If P-states of the CPUs in the same package are coordinated
> >> >> >at the hardware level, the processor may choose a P-state
> >> >> >above the HWP floor, just like a non-HWP processor in the
> >> >> >analogous case.
> >> >> >
> >> >> > With this change applied, intel_pstate in the passive mode
> >> >> > assumes complete control over the HWP request MSR and concurrent
> >> >> > changes of that MSR (eg. via the direct MSR access interface) are
> >> >> > overridden by it.
> >> >> >
> >> >> > Signed-off-by: Rafael J. Wysocki 
> >> >> > ---
> >> >> >
> >> >> > This basically unifies the passive mode behavior of intel_pstate for 
> >> >> > systems
> >> >> > with and without HWP enabled.  The only case in which there is a 
> >> >> > difference
> >> >> > between the two (after this patch) is below the turbo range, where 
> >> >> > the HWP
> >> >> > algorithm can go above the floor regardless of whether or not P-state 
> >> >> > are
> >> >> > coordinated package-wide (this means the systems with per-core 
> >> >> > P-states
> >> >> > mostly is where the difference can be somewhat visible).
> >> >> >
> >> >> > Since the passive mode hasn't worked with HWP at all, and it is not 
> >> >> > going to
> >> >> > the default for HWP systems anyway, I don't see any drawbacks related 
> >> >> > to making
> >> >> > this change, so I would consider this as 5.9 material unless there 
> >> >> > are any
> >> >> > serious objections.
> >> >> >
> >> >> > Thanks!
> >> >> >
> >> >> > ---
> >> >> >  Documentation/admin-guide/pm/intel_pstate.rst |   89 +++-
> >> >> >  drivers/cpufreq/intel_pstate.c|  141 
> >> >> > --
> >> >> >  2 files changed, 152 insertions(+), 78 deletions(-)
> >> >> >
> >> >> > Index: linux-pm/drivers/cpufreq/intel_pstate.c
> >> >> > ===
> >> >> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> >> >> > +++ linux-pm/drivers/cpufreq/intel_pstate.c
> >> >> > @@ -36,6 +36,7 @@
> >> >> >  #define INTEL_PSTATE_SAMPLING_INTERVAL   (10 * NSEC_PER_MSEC)
> >> >> >
> >> >> >  #define INTEL_CPUFREQ_TRANSITION_LATENCY 2
> >> >> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP   5000
> >> >> >  #define INTEL_CPUFREQ_TRANSITION_DELAY   500
> >> >> >
> >> >> >  #ifdef CONFIG_ACPI
> >> >> > @@ -222,6 +223,7 @@ struct global_params {
> >> >> >   *   preference/bias
> >> >> >   * @epp_saved:   Saved EPP/EPB during system suspend or 
> >> >> > CPU offline
> >> >> >   *   operation
> >> >> > + * @epp_cached   Cached HWP energy-performance 
> >> >> > preference value
> >> >> >   * @hwp_req_cached:  Cached value of the 

Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-19 Thread Rafael J. Wysocki
Hi Doug,

On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies  wrote:
>
> Hi Rafael,
>
> Thank you for your reply.
>
> On 2020.07.16 05:08 Rafael J. Wysocki wrote:
> > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies  wrote:
> >> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> >> >
> >> > From: Rafael J. Wysocki 
> >> ...
> >> > Since the passive mode hasn't worked with HWP at all, and it is not 
> >> > going to
> >> > the default for HWP systems anyway, I don't see any drawbacks related to 
> >> > making
> >> > this change, so I would consider this as 5.9 material unless there are 
> >> > any
> >> > serious objections.
> >>
> >> Good point.
>
> Actually, for those users that default to passive mode upon boot,
> this would mean they would find themselves using this.
> Also, it isn't obvious, from the typical "what driver and what governor"
> inquiry.

So the change in behavior is that after this patch
intel_pstate=passive doesn't imply no_hwp any more.

That's a very minor difference though and I'm not aware of any adverse
effects it can cause on HWP systems anyway.

The "what governor" is straightforward in the passive mode: that's
whatever cpufreq governor has been selected.

The driver is "intel_cpufreq" which means that the processor is
requested to run at a frequency selected by the governor or higher,
unless in the turbo range.  This works similarly in both the HWP and
non-HWP cases, except that in the HWP case it is possible to adjust
the EPP (through the additional sysfs knob) and the base frequency is
exported (the latter two things can be used to distinguish between the
two cases just fine IMO).

> >> Some of the tests I do involve labour intensive post processing of data.
> >> I want to automate some of that work, and it will take time.
> >> We might be into the 5.9-rc series before I have detailed feedback.
> >>
> >> However, so far:
> >>
> >> Inverse impulse response test [1]:
> >>
> >> High level test, i5-9600K, HWP-passive (this patch), ondemand:
> >> 3101 tests. 0 failures. (GOOD)
> >>
> >> From [1], re-stated:
> >> > . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. 
> >> > (HWP-active - powersave)
> >> > . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 
> >> > failures.
> >>
> >> My version of that cool Alexander named pipe test [2] serialized workflow:
> >>
> >> HWP-passive (this patch), performance: PASS.
> >>
> >> From [2], re-stated, and also re-tested.
> >> HWP-disabled passive - performance: FAIL.
> >
> > But I'm not quite sure how this is related to this patch?
>
> It isn't. The point being that it is different.

It is different, but kind of in a positive way IMO.

> But yes, that failure is because of our other discussion [3].

OK

> >
> > This test would still fail without the patch if the kernel was started
> > with intel_pstate=passive in the kernel command line, wouldn't it.
>
> Yes.

OK

Thanks!


RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-17 Thread Doug Smythies
Hi Rafael,

Thank you for your reply.

On 2020.07.16 05:08 Rafael J. Wysocki wrote:
> On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies  wrote:
>> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
>> >
>> > From: Rafael J. Wysocki 
>> ...
>> > Since the passive mode hasn't worked with HWP at all, and it is not going 
>> > to
>> > the default for HWP systems anyway, I don't see any drawbacks related to 
>> > making
>> > this change, so I would consider this as 5.9 material unless there are any
>> > serious objections.
>>
>> Good point.

Actually, for those users that default to passive mode upon boot,
this would mean they would find themselves using this.
Also, it isn't obvious, from the typical "what driver and what governor"
inquiry.

>> Some of the tests I do involve labour intensive post processing of data.
>> I want to automate some of that work, and it will take time.
>> We might be into the 5.9-rc series before I have detailed feedback.
>>
>> However, so far:
>>
>> Inverse impulse response test [1]:
>>
>> High level test, i5-9600K, HWP-passive (this patch), ondemand:
>> 3101 tests. 0 failures. (GOOD)
>>
>> From [1], re-stated:
>> > . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. 
>> > (HWP-active - powersave)
>> > . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 
>> > failures.
>>
>> My version of that cool Alexander named pipe test [2] serialized workflow:
>>
>> HWP-passive (this patch), performance: PASS.
>>
>> From [2], re-stated, and also re-tested.
>> HWP-disabled passive - performance: FAIL.
> 
> But I'm not quite sure how this is related to this patch?

It isn't. The point being that it is different.
But yes, that failure is because of our other discussion [3].

> 
> This test would still fail without the patch if the kernel was started
> with intel_pstate=passive in the kernel command line, wouldn't it.

Yes.

> 
> > Although, I believe the issue to be EPB management, [3].
> 
> Well, that's kind of unexpected.
> 
> If this really is the case, then it looks like the effect of the EPB
> setting on the processor is different depending on whether or not HWP
> is enabled.
> 
>> And yes, I did see the reply to [3] that came earlier,
>> And have now re-done the test, with the referenced patch added.
>> It still is FAIL. I reply to the [3] thread, eventually.
>>
>> [1] https://marc.info/?l=linux-pm=159354421400342=2
>> [2] https://marc.info/?l=linux-pm=159155067328641=2
>> [3] https://marc.info/?l=linux-pm=159438804230744=2
>>
>> Kernel:
>>
>> b08284a541ad (HEAD -> k58rc5-doug) cpufreq: intel_pstate: Avoid enabling HWP 
>> if EPP is not supported
>> 063fd7ccabfe cpufreq: intel_pstate: Implement passive mode with HWP enabled
>> 730ccf5054e9 cpufreq: intel_pstate: Allow raw energy performance preference 
>> value
>> bee36df01c68 cpufreq: intel_pstate: Allow enable/disable energy efficiency
>> 199629d8200e cpufreq: intel_pstate: Fix active mode setting from command line
>> 11ba468877bb (tag: v5.8-rc5, origin/master, origin/HEAD, master) Linux 
>> 5.8-rc5
>>
>> Rules for this work:
>>
>> . never use x86_energy_perf_policy.
>> . For HWP disabled: never change from active to passive or via versa, but 
>> rather do it via boot.
>> . after boot always check and reset the various power limit log bits that 
>> are set.
>> . never compile the kernel (well, until after any tests), which will set 
>> those bits again.
>> . never run prime95 high heat torture test, which will set those bits again.
>> . try to never do anything else that will set those bits again.
>>
>> To be clear, I do allow changing governors within the context of the above 
>> rules.
> 
> Thanks for the feedback!



Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-16 Thread Francisco Jerez
"Rafael J. Wysocki"  writes:

> On Wed, Jul 15, 2020 at 11:35 PM Francisco Jerez  
> wrote:
>>
>> "Rafael J. Wysocki"  writes:
>>
>> > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez  
>> > wrote:
>> >>
>> >> "Rafael J. Wysocki"  writes:
>> >>
>> >> > From: Rafael J. Wysocki 
>> >> >
>> >> > Allow intel_pstate to work in the passive mode with HWP enabled and
>> >> > make it set the HWP minimum performance limit (HWP floor) to the
>> >> > P-state value given by the target frequency supplied by the cpufreq
>> >> > governor, so as to prevent the HWP algorithm and the CPU scheduler
>> >> > from working against each other, at least when the schedutil governor
>> >> > is in use, and update the intel_pstate documentation accordingly.
>> >> >
>> >> > Among other things, this allows utilization clamps to be taken
>> >> > into account, at least to a certain extent, when intel_pstate is
>> >> > in use and makes it more likely that sufficient capacity for
>> >> > deadline tasks will be provided.
>> >> >
>> >> > After this change, the resulting behavior of an HWP system with
>> >> > intel_pstate in the passive mode should be close to the behavior
>> >> > of the analogous non-HWP system with intel_pstate in the passive
>> >> > mode, except that in the frequency range below the base frequency
>> >> > (ie. the frequency retured by the base_frequency cpufreq attribute
>> >> > in sysfs on HWP systems) the HWP algorithm is allowed to go above
>> >> > the floor P-state set by intel_pstate with or without hardware
>> >> > coordination of P-states among CPUs in the same package.
>> >> >
>> >> > Also note that the setting of the HWP floor may not be taken into
>> >> > account by the processor in the following cases:
>> >> >
>> >> >  * For the HWP floor in the range of P-states above the base
>> >> >frequency, referred to as the turbo range, the processor has a
>> >> >license to choose any P-state from that range, either below or
>> >> >above the HWP floor, just like a non-HWP processor in the case
>> >> >when the target P-state falls into the turbo range.
>> >> >
>> >> >  * If P-states of the CPUs in the same package are coordinated
>> >> >at the hardware level, the processor may choose a P-state
>> >> >above the HWP floor, just like a non-HWP processor in the
>> >> >analogous case.
>> >> >
>> >> > With this change applied, intel_pstate in the passive mode
>> >> > assumes complete control over the HWP request MSR and concurrent
>> >> > changes of that MSR (eg. via the direct MSR access interface) are
>> >> > overridden by it.
>> >> >
>> >> > Signed-off-by: Rafael J. Wysocki 
>> >> > ---
>> >> >
>> >> > This basically unifies the passive mode behavior of intel_pstate for 
>> >> > systems
>> >> > with and without HWP enabled.  The only case in which there is a 
>> >> > difference
>> >> > between the two (after this patch) is below the turbo range, where the 
>> >> > HWP
>> >> > algorithm can go above the floor regardless of whether or not P-state 
>> >> > are
>> >> > coordinated package-wide (this means the systems with per-core P-states
>> >> > mostly is where the difference can be somewhat visible).
>> >> >
>> >> > Since the passive mode hasn't worked with HWP at all, and it is not 
>> >> > going to
>> >> > the default for HWP systems anyway, I don't see any drawbacks related 
>> >> > to making
>> >> > this change, so I would consider this as 5.9 material unless there are 
>> >> > any
>> >> > serious objections.
>> >> >
>> >> > Thanks!
>> >> >
>> >> > ---
>> >> >  Documentation/admin-guide/pm/intel_pstate.rst |   89 +++-
>> >> >  drivers/cpufreq/intel_pstate.c|  141 
>> >> > --
>> >> >  2 files changed, 152 insertions(+), 78 deletions(-)
>> >> >
>> >> > Index: linux-pm/drivers/cpufreq/intel_pstate.c
>> >> > ===
>> >> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
>> >> > +++ linux-pm/drivers/cpufreq/intel_pstate.c
>> >> > @@ -36,6 +36,7 @@
>> >> >  #define INTEL_PSTATE_SAMPLING_INTERVAL   (10 * NSEC_PER_MSEC)
>> >> >
>> >> >  #define INTEL_CPUFREQ_TRANSITION_LATENCY 2
>> >> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP   5000
>> >> >  #define INTEL_CPUFREQ_TRANSITION_DELAY   500
>> >> >
>> >> >  #ifdef CONFIG_ACPI
>> >> > @@ -222,6 +223,7 @@ struct global_params {
>> >> >   *   preference/bias
>> >> >   * @epp_saved:   Saved EPP/EPB during system suspend or 
>> >> > CPU offline
>> >> >   *   operation
>> >> > + * @epp_cached   Cached HWP energy-performance preference 
>> >> > value
>> >> >   * @hwp_req_cached:  Cached value of the last HWP Request MSR
>> >> >   * @hwp_cap_cached:  Cached value of the last HWP Capabilities MSR
>> >> >   * @last_io_update:  Last time when IO wake flag was set
>> >> > @@ -259,6 +261,7 @@ struct cpudata {
>> >> >   s16 epp_policy;
>> >> >   s16 epp_default;
>> >> >   

Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-16 Thread Rafael J. Wysocki
On Thu, Jul 16, 2020 at 3:14 AM Srinivas Pandruvada
 wrote:
>
> On Wed, 2020-07-15 at 14:35 -0700, Francisco Jerez wrote:
> > "Rafael J. Wysocki"  writes:
> >
> > > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez <
> > > curroje...@riseup.net> wrote:
> > > > "Rafael J. Wysocki"  writes:
> > > >
> > > > > From: Rafael J. Wysocki 
> > > > >
>
> [...]
>
> > > > > I don't think that's accurate.  I've looked at hundreds of
> > > > > traces
> > while
> > my series [1] was in control of HWP_REQ_MAX and I've never seen an
> > excursion above the maximum HWP_REQ_MAX control specified by it
> > within a
> > given P-state domain, even while that maximum specified was well into
> > the turbo range.  So, yeah, I agree that HWP_REQ_MAX is nothing like
> > a
> > hard limit, particularly when multiple threads are running on the
> > same
> > clock domain, but the processor will still make its best effort to
> > limit
> > the clock frequency to the maximum of the requested maximums, even if
> > it
> > happens to be within the turbo range.  That doesn't make it useless.
> > The exact same thing can be said about controlling HWP_REQ_MIN as
> > you're
> > doing now in this revision of your patch, BTW.
> >
> > If you don't believe me here is the turbostat sample with maximum
> > Bzy_MHz I get on the computer I'm sitting on right now while
> > compiling a
> > kernel on CPU0 if I set HWP_REQ_MAX to 0x1c (within the turbo range):
> >
> > > CoreCPU Avg_MHz
> > > Busy%   Bzy_MHzHWP_REQ  PkgWatt CorWatt
> > > -   -   757 27.03   28000x  7.1
> > > 34.90
> > > 0   0   279499.77   28000x80001c04  7.1
> > > 34.90
> > > 0   2   83  2.9828000x80001c04
> > > 1   1   73  2.6028000x80001c04
> > > 1   3   78  2.7928000x80001c04
> >
> > With the default HWP_REQUEST:
> >
> > > CoreCPU Avg_MHz
> > > Busy%   Bzy_MHzHWP_REQ  PkgWatt CorWatt
> > > -   -   814 27.00   30150x  8.4
> > > 96.18
> > > 0   0   296898.24   30210x80001f04  8.4
> > > 96.18
> > > 0   2   84  2.8129820x80001f04
> > > 1   1   99  3.3429610x80001f04
> > > 1   3   105 3.6029210x80001f04
>
> Correct. In HWP mode this is possible to lower limit in turbo region
> conditionally. In legacy mode you can't with turbo activation ratio.
>
> But what we don't want set max and min perf and use like desired to run
> at a P-state overriding HWP or limit the range where HWP can't do any
> meaningful selection.

That's a good point too IMO.


Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-16 Thread Rafael J. Wysocki
On Wed, Jul 15, 2020 at 11:35 PM Francisco Jerez  wrote:
>
> "Rafael J. Wysocki"  writes:
>
> > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez  
> > wrote:
> >>
> >> "Rafael J. Wysocki"  writes:
> >>
> >> > From: Rafael J. Wysocki 
> >> >
> >> > Allow intel_pstate to work in the passive mode with HWP enabled and
> >> > make it set the HWP minimum performance limit (HWP floor) to the
> >> > P-state value given by the target frequency supplied by the cpufreq
> >> > governor, so as to prevent the HWP algorithm and the CPU scheduler
> >> > from working against each other, at least when the schedutil governor
> >> > is in use, and update the intel_pstate documentation accordingly.
> >> >
> >> > Among other things, this allows utilization clamps to be taken
> >> > into account, at least to a certain extent, when intel_pstate is
> >> > in use and makes it more likely that sufficient capacity for
> >> > deadline tasks will be provided.
> >> >
> >> > After this change, the resulting behavior of an HWP system with
> >> > intel_pstate in the passive mode should be close to the behavior
> >> > of the analogous non-HWP system with intel_pstate in the passive
> >> > mode, except that in the frequency range below the base frequency
> >> > (ie. the frequency retured by the base_frequency cpufreq attribute
> >> > in sysfs on HWP systems) the HWP algorithm is allowed to go above
> >> > the floor P-state set by intel_pstate with or without hardware
> >> > coordination of P-states among CPUs in the same package.
> >> >
> >> > Also note that the setting of the HWP floor may not be taken into
> >> > account by the processor in the following cases:
> >> >
> >> >  * For the HWP floor in the range of P-states above the base
> >> >frequency, referred to as the turbo range, the processor has a
> >> >license to choose any P-state from that range, either below or
> >> >above the HWP floor, just like a non-HWP processor in the case
> >> >when the target P-state falls into the turbo range.
> >> >
> >> >  * If P-states of the CPUs in the same package are coordinated
> >> >at the hardware level, the processor may choose a P-state
> >> >above the HWP floor, just like a non-HWP processor in the
> >> >analogous case.
> >> >
> >> > With this change applied, intel_pstate in the passive mode
> >> > assumes complete control over the HWP request MSR and concurrent
> >> > changes of that MSR (eg. via the direct MSR access interface) are
> >> > overridden by it.
> >> >
> >> > Signed-off-by: Rafael J. Wysocki 
> >> > ---
> >> >
> >> > This basically unifies the passive mode behavior of intel_pstate for 
> >> > systems
> >> > with and without HWP enabled.  The only case in which there is a 
> >> > difference
> >> > between the two (after this patch) is below the turbo range, where the 
> >> > HWP
> >> > algorithm can go above the floor regardless of whether or not P-state are
> >> > coordinated package-wide (this means the systems with per-core P-states
> >> > mostly is where the difference can be somewhat visible).
> >> >
> >> > Since the passive mode hasn't worked with HWP at all, and it is not 
> >> > going to
> >> > the default for HWP systems anyway, I don't see any drawbacks related to 
> >> > making
> >> > this change, so I would consider this as 5.9 material unless there are 
> >> > any
> >> > serious objections.
> >> >
> >> > Thanks!
> >> >
> >> > ---
> >> >  Documentation/admin-guide/pm/intel_pstate.rst |   89 +++-
> >> >  drivers/cpufreq/intel_pstate.c|  141 
> >> > --
> >> >  2 files changed, 152 insertions(+), 78 deletions(-)
> >> >
> >> > Index: linux-pm/drivers/cpufreq/intel_pstate.c
> >> > ===
> >> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> >> > +++ linux-pm/drivers/cpufreq/intel_pstate.c
> >> > @@ -36,6 +36,7 @@
> >> >  #define INTEL_PSTATE_SAMPLING_INTERVAL   (10 * NSEC_PER_MSEC)
> >> >
> >> >  #define INTEL_CPUFREQ_TRANSITION_LATENCY 2
> >> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP   5000
> >> >  #define INTEL_CPUFREQ_TRANSITION_DELAY   500
> >> >
> >> >  #ifdef CONFIG_ACPI
> >> > @@ -222,6 +223,7 @@ struct global_params {
> >> >   *   preference/bias
> >> >   * @epp_saved:   Saved EPP/EPB during system suspend or CPU 
> >> > offline
> >> >   *   operation
> >> > + * @epp_cached   Cached HWP energy-performance preference 
> >> > value
> >> >   * @hwp_req_cached:  Cached value of the last HWP Request MSR
> >> >   * @hwp_cap_cached:  Cached value of the last HWP Capabilities MSR
> >> >   * @last_io_update:  Last time when IO wake flag was set
> >> > @@ -259,6 +261,7 @@ struct cpudata {
> >> >   s16 epp_policy;
> >> >   s16 epp_default;
> >> >   s16 epp_saved;
> >> > + s16 epp_cached;
> >> >   u64 hwp_req_cached;
> >> >   u64 hwp_cap_cached;
> >> >   u64 last_io_update;
> 

Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-16 Thread Rafael J. Wysocki
On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies  wrote:
>
> On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> >
> > From: Rafael J. Wysocki 
> ...
> > Since the passive mode hasn't worked with HWP at all, and it is not going to
> > the default for HWP systems anyway, I don't see any drawbacks related to 
> > making
> > this change, so I would consider this as 5.9 material unless there are any
> > serious objections.
>
> Good point.
> Some of the tests I do involve labour intensive post processing of data.
> I want to automate some of that work, and it will take time.
> We might be into the 5.9-rc series before I have detailed feedback.
>
> However, so far:
>
> Inverse impulse response test [1]:
>
> High level test, i5-9600K, HWP-passive (this patch), ondemand:
> 3101 tests. 0 failures. (GOOD)
>
> From [1], re-stated:
> > . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. 
> > (HWP-active - powersave)
> > . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 failures.
>
> My version of that cool Alexander named pipe test [2] serialized workflow:
>
> HWP-passive (this patch), performance: PASS.
>
> From [2], re-stated, and also re-tested.
> HWP-disabled passive - performance: FAIL.

But I'm not quite sure how this is related to this patch?

This test would still fail without the patch if the kernel was started
with intel_pstate=passive in the kernel command line, wouldn't it.

> Although, I believe the issue to be EPB management, [3].

Well, that's kind of unexpected.

If this really is the case, then it looks like the effect of the EPB
setting on the processor is different depending on whether or not HWP
is enabled.

> And yes, I did see the reply to [3] that came earlier,
> And have now re-done the test, with the referenced patch added.
> It still is FAIL. I reply to the [3] thread, eventually.
>
> [1] https://marc.info/?l=linux-pm=159354421400342=2
> [2] https://marc.info/?l=linux-pm=159155067328641=2
> [3] https://marc.info/?l=linux-pm=159438804230744=2
>
> Kernel:
>
> b08284a541ad (HEAD -> k58rc5-doug) cpufreq: intel_pstate: Avoid enabling HWP 
> if EPP is not supported
> 063fd7ccabfe cpufreq: intel_pstate: Implement passive mode with HWP enabled
> 730ccf5054e9 cpufreq: intel_pstate: Allow raw energy performance preference 
> value
> bee36df01c68 cpufreq: intel_pstate: Allow enable/disable energy efficiency
> 199629d8200e cpufreq: intel_pstate: Fix active mode setting from command line
> 11ba468877bb (tag: v5.8-rc5, origin/master, origin/HEAD, master) Linux 5.8-rc5
>
> Rules for this work:
>
> . never use x86_energy_perf_policy.
> . For HWP disabled: never change from active to passive or via versa, but 
> rather do it via boot.
> . after boot always check and reset the various power limit log bits that are 
> set.
> . never compile the kernel (well, until after any tests), which will set 
> those bits again.
> . never run prime95 high heat torture test, which will set those bits again.
> . try to never do anything else that will set those bits again.
>
> To be clear, I do allow changing governors within the context of the above 
> rules.

Thanks for the feedback!


Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-15 Thread Srinivas Pandruvada
On Wed, 2020-07-15 at 14:35 -0700, Francisco Jerez wrote:
> "Rafael J. Wysocki"  writes:
> 
> > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez <
> > curroje...@riseup.net> wrote:
> > > "Rafael J. Wysocki"  writes:
> > > 
> > > > From: Rafael J. Wysocki 
> > > > 

[...]

> > > > I don't think that's accurate.  I've looked at hundreds of
> > > > traces
> while
> my series [1] was in control of HWP_REQ_MAX and I've never seen an
> excursion above the maximum HWP_REQ_MAX control specified by it
> within a
> given P-state domain, even while that maximum specified was well into
> the turbo range.  So, yeah, I agree that HWP_REQ_MAX is nothing like
> a
> hard limit, particularly when multiple threads are running on the
> same
> clock domain, but the processor will still make its best effort to
> limit
> the clock frequency to the maximum of the requested maximums, even if
> it
> happens to be within the turbo range.  That doesn't make it useless.
> The exact same thing can be said about controlling HWP_REQ_MIN as
> you're
> doing now in this revision of your patch, BTW.
> 
> If you don't believe me here is the turbostat sample with maximum
> Bzy_MHz I get on the computer I'm sitting on right now while
> compiling a
> kernel on CPU0 if I set HWP_REQ_MAX to 0x1c (within the turbo range):
> 
> > CoreCPU Avg_MHz
> > Busy%   Bzy_MHzHWP_REQ  PkgWatt CorWatt
> > -   -   757 27.03   28000x  7.1
> > 34.90
> > 0   0   279499.77   28000x80001c04  7.1
> > 34.90
> > 0   2   83  2.9828000x80001c04
> > 1   1   73  2.6028000x80001c04
> > 1   3   78  2.7928000x80001c04
> 
> With the default HWP_REQUEST:
> 
> > CoreCPU Avg_MHz
> > Busy%   Bzy_MHzHWP_REQ  PkgWatt CorWatt
> > -   -   814 27.00   30150x  8.4
> > 96.18
> > 0   0   296898.24   30210x80001f04  8.4
> > 96.18
> > 0   2   84  2.8129820x80001f04
> > 1   1   99  3.3429610x80001f04
> > 1   3   105 3.6029210x80001f04

Correct. In HWP mode this is possible to lower limit in turbo region
conditionally. In legacy mode you can't with turbo activation ratio.

But what we don't want set max and min perf and use like desired to run
at a P-state overriding HWP or limit the range where HWP can't do any
meaningful selection.

> > Generally, I'm not quite convinced that limiting the max frequency
> > is
> > really the right choice for controlling the processor's power draw
> > on
> > the systems in question.  There are other ways to do that, which in
> > theory should be more effective.  I mentioned RAPL somewhere in
> > this
> > context and there's the GUC firmware too.
> 
> I feel like we've had that conversation before and it's somewhat
> off-topic so I'll keep it short: Yes, in theory RAPL is more
> effective
> than HWP_REQ_MAX as a mechanism to limit the absolute power
> consumption
> of the processor package, but that's not the purpose of [1], its
> purpose
> is setting a lower limit to the energy efficiency of the processor
> when
> the maximum usable CPU frequency is known (due to the existence of an
> IO
> device bottleneck) -- And if the maximum usable CPU frequency is the
> information we have at hand, controlling the maximum CPU frequency
> directly is optimal, rather than trying to find the RAPL constraint
> that
> achieves the same average frequency by trial an error.  Also, in
> theory,
> even if you had an oracle to tell you what the appropriate RAPL
> constraint is, the result would necessarily be more energy-
> inefficient
> than controlling the maximum CPU frequency directly, since you're
> giving
> the processor additional freedom to run at frequencies above the one
> you
> want to average, which is guaranteed to be more energy-inefficient
> than
> running at that fixed frequency, assuming we are in the region of
> convexity of the processor's power curve.
> 
> Anyway, if you still have some disagreement on the theoretical
> details
> you're more than welcome to bring up the matter on the other thread
> [1],
> or accept the invitation for a presentation I sent you months ago...
> ;)



Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-15 Thread Francisco Jerez
"Rafael J. Wysocki"  writes:

> On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez  wrote:
>>
>> "Rafael J. Wysocki"  writes:
>>
>> > From: Rafael J. Wysocki 
>> >
>> > Allow intel_pstate to work in the passive mode with HWP enabled and
>> > make it set the HWP minimum performance limit (HWP floor) to the
>> > P-state value given by the target frequency supplied by the cpufreq
>> > governor, so as to prevent the HWP algorithm and the CPU scheduler
>> > from working against each other, at least when the schedutil governor
>> > is in use, and update the intel_pstate documentation accordingly.
>> >
>> > Among other things, this allows utilization clamps to be taken
>> > into account, at least to a certain extent, when intel_pstate is
>> > in use and makes it more likely that sufficient capacity for
>> > deadline tasks will be provided.
>> >
>> > After this change, the resulting behavior of an HWP system with
>> > intel_pstate in the passive mode should be close to the behavior
>> > of the analogous non-HWP system with intel_pstate in the passive
>> > mode, except that in the frequency range below the base frequency
>> > (ie. the frequency retured by the base_frequency cpufreq attribute
>> > in sysfs on HWP systems) the HWP algorithm is allowed to go above
>> > the floor P-state set by intel_pstate with or without hardware
>> > coordination of P-states among CPUs in the same package.
>> >
>> > Also note that the setting of the HWP floor may not be taken into
>> > account by the processor in the following cases:
>> >
>> >  * For the HWP floor in the range of P-states above the base
>> >frequency, referred to as the turbo range, the processor has a
>> >license to choose any P-state from that range, either below or
>> >above the HWP floor, just like a non-HWP processor in the case
>> >when the target P-state falls into the turbo range.
>> >
>> >  * If P-states of the CPUs in the same package are coordinated
>> >at the hardware level, the processor may choose a P-state
>> >above the HWP floor, just like a non-HWP processor in the
>> >analogous case.
>> >
>> > With this change applied, intel_pstate in the passive mode
>> > assumes complete control over the HWP request MSR and concurrent
>> > changes of that MSR (eg. via the direct MSR access interface) are
>> > overridden by it.
>> >
>> > Signed-off-by: Rafael J. Wysocki 
>> > ---
>> >
>> > This basically unifies the passive mode behavior of intel_pstate for 
>> > systems
>> > with and without HWP enabled.  The only case in which there is a difference
>> > between the two (after this patch) is below the turbo range, where the HWP
>> > algorithm can go above the floor regardless of whether or not P-state are
>> > coordinated package-wide (this means the systems with per-core P-states
>> > mostly is where the difference can be somewhat visible).
>> >
>> > Since the passive mode hasn't worked with HWP at all, and it is not going 
>> > to
>> > the default for HWP systems anyway, I don't see any drawbacks related to 
>> > making
>> > this change, so I would consider this as 5.9 material unless there are any
>> > serious objections.
>> >
>> > Thanks!
>> >
>> > ---
>> >  Documentation/admin-guide/pm/intel_pstate.rst |   89 +++-
>> >  drivers/cpufreq/intel_pstate.c|  141 
>> > --
>> >  2 files changed, 152 insertions(+), 78 deletions(-)
>> >
>> > Index: linux-pm/drivers/cpufreq/intel_pstate.c
>> > ===
>> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
>> > +++ linux-pm/drivers/cpufreq/intel_pstate.c
>> > @@ -36,6 +36,7 @@
>> >  #define INTEL_PSTATE_SAMPLING_INTERVAL   (10 * NSEC_PER_MSEC)
>> >
>> >  #define INTEL_CPUFREQ_TRANSITION_LATENCY 2
>> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP   5000
>> >  #define INTEL_CPUFREQ_TRANSITION_DELAY   500
>> >
>> >  #ifdef CONFIG_ACPI
>> > @@ -222,6 +223,7 @@ struct global_params {
>> >   *   preference/bias
>> >   * @epp_saved:   Saved EPP/EPB during system suspend or CPU 
>> > offline
>> >   *   operation
>> > + * @epp_cached   Cached HWP energy-performance preference 
>> > value
>> >   * @hwp_req_cached:  Cached value of the last HWP Request MSR
>> >   * @hwp_cap_cached:  Cached value of the last HWP Capabilities MSR
>> >   * @last_io_update:  Last time when IO wake flag was set
>> > @@ -259,6 +261,7 @@ struct cpudata {
>> >   s16 epp_policy;
>> >   s16 epp_default;
>> >   s16 epp_saved;
>> > + s16 epp_cached;
>> >   u64 hwp_req_cached;
>> >   u64 hwp_cap_cached;
>> >   u64 last_io_update;
>> > @@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_
>> >
>> >   value |= (u64)epp << 24;
>> >   ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
>> > +
>> > + WRITE_ONCE(cpu_data->epp_cached, epp);
>>
>> Why introduce a new EPP cache 

RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-15 Thread Doug Smythies
On 2020.07.14 11:16 Rafael J. Wysocki wrote:
> 
> From: Rafael J. Wysocki 
...
> Since the passive mode hasn't worked with HWP at all, and it is not going to
> the default for HWP systems anyway, I don't see any drawbacks related to 
> making
> this change, so I would consider this as 5.9 material unless there are any
> serious objections.

Good point.
Some of the tests I do involve labour intensive post processing of data.
I want to automate some of that work, and it will take time.
We might be into the 5.9-rc series before I have detailed feedback.

However, so far:

Inverse impulse response test [1]:

High level test, i5-9600K, HWP-passive (this patch), ondemand:
3101 tests. 0 failures. (GOOD)

>From [1], re-stated:
> . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. (HWP-active 
> - powersave)
> . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 failures.

My version of that cool Alexander named pipe test [2] serialized workflow:

HWP-passive (this patch), performance: PASS.

>From [2], re-stated, and also re-tested.
HWP-disabled passive - performance: FAIL.
Although, I believe the issue to be EPB management, [3].

And yes, I did see the reply to [3] that came earlier,
And have now re-done the test, with the referenced patch added.
It still is FAIL. I reply to the [3] thread, eventually.

[1] https://marc.info/?l=linux-pm=159354421400342=2
[2] https://marc.info/?l=linux-pm=159155067328641=2
[3] https://marc.info/?l=linux-pm=159438804230744=2

Kernel:

b08284a541ad (HEAD -> k58rc5-doug) cpufreq: intel_pstate: Avoid enabling HWP if 
EPP is not supported
063fd7ccabfe cpufreq: intel_pstate: Implement passive mode with HWP enabled
730ccf5054e9 cpufreq: intel_pstate: Allow raw energy performance preference 
value
bee36df01c68 cpufreq: intel_pstate: Allow enable/disable energy efficiency
199629d8200e cpufreq: intel_pstate: Fix active mode setting from command line
11ba468877bb (tag: v5.8-rc5, origin/master, origin/HEAD, master) Linux 5.8-rc5

Rules for this work:

. never use x86_energy_perf_policy.
. For HWP disabled: never change from active to passive or via versa, but 
rather do it via boot.
. after boot always check and reset the various power limit log bits that are 
set.
. never compile the kernel (well, until after any tests), which will set those 
bits again.
. never run prime95 high heat torture test, which will set those bits again.
. try to never do anything else that will set those bits again.

To be clear, I do allow changing governors within the context of the above 
rules.

... Doug




Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-15 Thread Rafael J. Wysocki
On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez  wrote:
>
> "Rafael J. Wysocki"  writes:
>
> > From: Rafael J. Wysocki 
> >
> > Allow intel_pstate to work in the passive mode with HWP enabled and
> > make it set the HWP minimum performance limit (HWP floor) to the
> > P-state value given by the target frequency supplied by the cpufreq
> > governor, so as to prevent the HWP algorithm and the CPU scheduler
> > from working against each other, at least when the schedutil governor
> > is in use, and update the intel_pstate documentation accordingly.
> >
> > Among other things, this allows utilization clamps to be taken
> > into account, at least to a certain extent, when intel_pstate is
> > in use and makes it more likely that sufficient capacity for
> > deadline tasks will be provided.
> >
> > After this change, the resulting behavior of an HWP system with
> > intel_pstate in the passive mode should be close to the behavior
> > of the analogous non-HWP system with intel_pstate in the passive
> > mode, except that in the frequency range below the base frequency
> > (ie. the frequency retured by the base_frequency cpufreq attribute
> > in sysfs on HWP systems) the HWP algorithm is allowed to go above
> > the floor P-state set by intel_pstate with or without hardware
> > coordination of P-states among CPUs in the same package.
> >
> > Also note that the setting of the HWP floor may not be taken into
> > account by the processor in the following cases:
> >
> >  * For the HWP floor in the range of P-states above the base
> >frequency, referred to as the turbo range, the processor has a
> >license to choose any P-state from that range, either below or
> >above the HWP floor, just like a non-HWP processor in the case
> >when the target P-state falls into the turbo range.
> >
> >  * If P-states of the CPUs in the same package are coordinated
> >at the hardware level, the processor may choose a P-state
> >above the HWP floor, just like a non-HWP processor in the
> >analogous case.
> >
> > With this change applied, intel_pstate in the passive mode
> > assumes complete control over the HWP request MSR and concurrent
> > changes of that MSR (eg. via the direct MSR access interface) are
> > overridden by it.
> >
> > Signed-off-by: Rafael J. Wysocki 
> > ---
> >
> > This basically unifies the passive mode behavior of intel_pstate for systems
> > with and without HWP enabled.  The only case in which there is a difference
> > between the two (after this patch) is below the turbo range, where the HWP
> > algorithm can go above the floor regardless of whether or not P-state are
> > coordinated package-wide (this means the systems with per-core P-states
> > mostly is where the difference can be somewhat visible).
> >
> > Since the passive mode hasn't worked with HWP at all, and it is not going to
> > the default for HWP systems anyway, I don't see any drawbacks related to 
> > making
> > this change, so I would consider this as 5.9 material unless there are any
> > serious objections.
> >
> > Thanks!
> >
> > ---
> >  Documentation/admin-guide/pm/intel_pstate.rst |   89 +++-
> >  drivers/cpufreq/intel_pstate.c|  141 
> > --
> >  2 files changed, 152 insertions(+), 78 deletions(-)
> >
> > Index: linux-pm/drivers/cpufreq/intel_pstate.c
> > ===
> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> > +++ linux-pm/drivers/cpufreq/intel_pstate.c
> > @@ -36,6 +36,7 @@
> >  #define INTEL_PSTATE_SAMPLING_INTERVAL   (10 * NSEC_PER_MSEC)
> >
> >  #define INTEL_CPUFREQ_TRANSITION_LATENCY 2
> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP   5000
> >  #define INTEL_CPUFREQ_TRANSITION_DELAY   500
> >
> >  #ifdef CONFIG_ACPI
> > @@ -222,6 +223,7 @@ struct global_params {
> >   *   preference/bias
> >   * @epp_saved:   Saved EPP/EPB during system suspend or CPU 
> > offline
> >   *   operation
> > + * @epp_cached   Cached HWP energy-performance preference value
> >   * @hwp_req_cached:  Cached value of the last HWP Request MSR
> >   * @hwp_cap_cached:  Cached value of the last HWP Capabilities MSR
> >   * @last_io_update:  Last time when IO wake flag was set
> > @@ -259,6 +261,7 @@ struct cpudata {
> >   s16 epp_policy;
> >   s16 epp_default;
> >   s16 epp_saved;
> > + s16 epp_cached;
> >   u64 hwp_req_cached;
> >   u64 hwp_cap_cached;
> >   u64 last_io_update;
> > @@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_
> >
> >   value |= (u64)epp << 24;
> >   ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
> > +
> > + WRITE_ONCE(cpu_data->epp_cached, epp);
>
> Why introduce a new EPP cache variable if there is already
> hwp_req_cached?  If intel_pstate_set_energy_pref_index() is failing to
> update hwp_req_cached maybe we should fix that 

Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-14 Thread Francisco Jerez
"Rafael J. Wysocki"  writes:

> From: Rafael J. Wysocki 
>
> Allow intel_pstate to work in the passive mode with HWP enabled and
> make it set the HWP minimum performance limit (HWP floor) to the
> P-state value given by the target frequency supplied by the cpufreq
> governor, so as to prevent the HWP algorithm and the CPU scheduler
> from working against each other, at least when the schedutil governor
> is in use, and update the intel_pstate documentation accordingly.
>
> Among other things, this allows utilization clamps to be taken
> into account, at least to a certain extent, when intel_pstate is
> in use and makes it more likely that sufficient capacity for
> deadline tasks will be provided.
>
> After this change, the resulting behavior of an HWP system with
> intel_pstate in the passive mode should be close to the behavior
> of the analogous non-HWP system with intel_pstate in the passive
> mode, except that in the frequency range below the base frequency
> (ie. the frequency retured by the base_frequency cpufreq attribute
> in sysfs on HWP systems) the HWP algorithm is allowed to go above
> the floor P-state set by intel_pstate with or without hardware
> coordination of P-states among CPUs in the same package.
>
> Also note that the setting of the HWP floor may not be taken into
> account by the processor in the following cases:
>
>  * For the HWP floor in the range of P-states above the base
>frequency, referred to as the turbo range, the processor has a
>license to choose any P-state from that range, either below or
>above the HWP floor, just like a non-HWP processor in the case
>when the target P-state falls into the turbo range.
>
>  * If P-states of the CPUs in the same package are coordinated
>at the hardware level, the processor may choose a P-state
>above the HWP floor, just like a non-HWP processor in the
>analogous case.
>
> With this change applied, intel_pstate in the passive mode
> assumes complete control over the HWP request MSR and concurrent
> changes of that MSR (eg. via the direct MSR access interface) are
> overridden by it.
>
> Signed-off-by: Rafael J. Wysocki 
> ---
>
> This basically unifies the passive mode behavior of intel_pstate for systems
> with and without HWP enabled.  The only case in which there is a difference
> between the two (after this patch) is below the turbo range, where the HWP
> algorithm can go above the floor regardless of whether or not P-state are
> coordinated package-wide (this means the systems with per-core P-states
> mostly is where the difference can be somewhat visible).
>
> Since the passive mode hasn't worked with HWP at all, and it is not going to
> the default for HWP systems anyway, I don't see any drawbacks related to 
> making
> this change, so I would consider this as 5.9 material unless there are any
> serious objections.
>
> Thanks!
>
> ---
>  Documentation/admin-guide/pm/intel_pstate.rst |   89 +++-
>  drivers/cpufreq/intel_pstate.c|  141 
> --
>  2 files changed, 152 insertions(+), 78 deletions(-)
>
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -36,6 +36,7 @@
>  #define INTEL_PSTATE_SAMPLING_INTERVAL   (10 * NSEC_PER_MSEC)
>  
>  #define INTEL_CPUFREQ_TRANSITION_LATENCY 2
> +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP   5000
>  #define INTEL_CPUFREQ_TRANSITION_DELAY   500
>  
>  #ifdef CONFIG_ACPI
> @@ -222,6 +223,7 @@ struct global_params {
>   *   preference/bias
>   * @epp_saved:   Saved EPP/EPB during system suspend or CPU 
> offline
>   *   operation
> + * @epp_cached   Cached HWP energy-performance preference value
>   * @hwp_req_cached:  Cached value of the last HWP Request MSR
>   * @hwp_cap_cached:  Cached value of the last HWP Capabilities MSR
>   * @last_io_update:  Last time when IO wake flag was set
> @@ -259,6 +261,7 @@ struct cpudata {
>   s16 epp_policy;
>   s16 epp_default;
>   s16 epp_saved;
> + s16 epp_cached;
>   u64 hwp_req_cached;
>   u64 hwp_cap_cached;
>   u64 last_io_update;
> @@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_
>  
>   value |= (u64)epp << 24;
>   ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
> +
> + WRITE_ONCE(cpu_data->epp_cached, epp);

Why introduce a new EPP cache variable if there is already
hwp_req_cached?  If intel_pstate_set_energy_pref_index() is failing to
update hwp_req_cached maybe we should fix that instead.  That will save
you a little bit of work in intel_cpufreq_adjust_hwp().

>   } else {
>   if (epp == -EINVAL)
>   epp = (pref_index - 1) << 2;
> @@ -2047,6 +2052,7 @@ static int intel_pstate_init_cpu(unsigne
>   

[PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled

2020-07-14 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

Allow intel_pstate to work in the passive mode with HWP enabled and
make it set the HWP minimum performance limit (HWP floor) to the
P-state value given by the target frequency supplied by the cpufreq
governor, so as to prevent the HWP algorithm and the CPU scheduler
from working against each other, at least when the schedutil governor
is in use, and update the intel_pstate documentation accordingly.

Among other things, this allows utilization clamps to be taken
into account, at least to a certain extent, when intel_pstate is
in use and makes it more likely that sufficient capacity for
deadline tasks will be provided.

After this change, the resulting behavior of an HWP system with
intel_pstate in the passive mode should be close to the behavior
of the analogous non-HWP system with intel_pstate in the passive
mode, except that in the frequency range below the base frequency
(ie. the frequency retured by the base_frequency cpufreq attribute
in sysfs on HWP systems) the HWP algorithm is allowed to go above
the floor P-state set by intel_pstate with or without hardware
coordination of P-states among CPUs in the same package.

Also note that the setting of the HWP floor may not be taken into
account by the processor in the following cases:

 * For the HWP floor in the range of P-states above the base
   frequency, referred to as the turbo range, the processor has a
   license to choose any P-state from that range, either below or
   above the HWP floor, just like a non-HWP processor in the case
   when the target P-state falls into the turbo range.

 * If P-states of the CPUs in the same package are coordinated
   at the hardware level, the processor may choose a P-state
   above the HWP floor, just like a non-HWP processor in the
   analogous case.

With this change applied, intel_pstate in the passive mode
assumes complete control over the HWP request MSR and concurrent
changes of that MSR (eg. via the direct MSR access interface) are
overridden by it.

Signed-off-by: Rafael J. Wysocki 
---

This basically unifies the passive mode behavior of intel_pstate for systems
with and without HWP enabled.  The only case in which there is a difference
between the two (after this patch) is below the turbo range, where the HWP
algorithm can go above the floor regardless of whether or not P-state are
coordinated package-wide (this means the systems with per-core P-states
mostly is where the difference can be somewhat visible).

Since the passive mode hasn't worked with HWP at all, and it is not going to
the default for HWP systems anyway, I don't see any drawbacks related to making
this change, so I would consider this as 5.9 material unless there are any
serious objections.

Thanks!

---
 Documentation/admin-guide/pm/intel_pstate.rst |   89 +++-
 drivers/cpufreq/intel_pstate.c|  141 --
 2 files changed, 152 insertions(+), 78 deletions(-)

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -36,6 +36,7 @@
 #define INTEL_PSTATE_SAMPLING_INTERVAL (10 * NSEC_PER_MSEC)
 
 #define INTEL_CPUFREQ_TRANSITION_LATENCY   2
+#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000
 #define INTEL_CPUFREQ_TRANSITION_DELAY 500
 
 #ifdef CONFIG_ACPI
@@ -222,6 +223,7 @@ struct global_params {
  * preference/bias
  * @epp_saved: Saved EPP/EPB during system suspend or CPU offline
  * operation
+ * @epp_cached Cached HWP energy-performance preference value
  * @hwp_req_cached:Cached value of the last HWP Request MSR
  * @hwp_cap_cached:Cached value of the last HWP Capabilities MSR
  * @last_io_update:Last time when IO wake flag was set
@@ -259,6 +261,7 @@ struct cpudata {
s16 epp_policy;
s16 epp_default;
s16 epp_saved;
+   s16 epp_cached;
u64 hwp_req_cached;
u64 hwp_cap_cached;
u64 last_io_update;
@@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_
 
value |= (u64)epp << 24;
ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value);
+
+   WRITE_ONCE(cpu_data->epp_cached, epp);
} else {
if (epp == -EINVAL)
epp = (pref_index - 1) << 2;
@@ -2047,6 +2052,7 @@ static int intel_pstate_init_cpu(unsigne
cpu->epp_default = -EINVAL;
cpu->epp_powersave = -EINVAL;
cpu->epp_saved = -EINVAL;
+   WRITE_ONCE(cpu->epp_cached, -EINVAL);
}
 
cpu = all_cpu_data[cpunum];
@@ -2245,7 +2251,10 @@ static int intel_pstate_verify_policy(st
 
 static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy)
 {
-   intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]);
+   if (hwp_active)
+