Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On Thursday, August 6, 2020 7:54:47 AM CEST Doug Smythies wrote: > On 2020.08.03 10:09 Rafael J. Wysocki wrote: > > On Sunday, August 2, 2020 5:17:39 PM CEST Doug Smythies wrote: > > > On 2020.07.19 04:43 Rafael J. Wysocki wrote: > > > > On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies > > > > wrote: > > > > > On 2020.07.16 05:08 Rafael J. Wysocki wrote: > > > > > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies > > > > > > wrote: > > > > > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote: > > > > > >> > > > > > > >> > From: Rafael J. Wysocki > > > > > >> ... > > > > > >> > Since the passive mode hasn't worked with HWP at all, and it is > > > > > >> > not going to > > > > > >> > the default for HWP systems anyway, I don't see any drawbacks > > > > > >> > related to making > > > > > >> > this change, so I would consider this as 5.9 material unless > > > > > >> > there are any > > > > > >> > serious objections. > > > > > >> > > > > > >> Good point. > > > > > > > > > > Actually, for those users that default to passive mode upon boot, > > > > > this would mean they would find themselves using this. > > > > > Also, it isn't obvious, from the typical "what driver and what > > > > > governor" > > > > > inquiry. > > > > > > > > So the change in behavior is that after this patch > > > > intel_pstate=passive doesn't imply no_hwp any more. > > > > > > > > That's a very minor difference though and I'm not aware of any adverse > > > > effects it can cause on HWP systems anyway. > > > > > > My point was, that it will now default to something where > > > testing has not been completed. > > > > > > > The "what governor" is straightforward in the passive mode: that's > > > > whatever cpufreq governor has been selected. > > > > > > I think you might have missed my point. > > > From the normal methods of inquiry one does not know > > > if HWP is being used or not. Why? Because with > > > or without HWP one gets the same answers under: > > > > > > /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver > > > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor > > > > Yes, but this is also the case in the active mode, isn't it? > > Yes, fair enough. > But we aren't changing what it means by default > between kernel 5.8 and 5.9-rc1. No, we aren't. The only (expected) change is when booting with intel_pstate=passive and without intel_pstate=no_hwp in the command line. Which should be easy enough to address by adding intel_pstate=no_hwp to the command line in 5.9-rc1 and later (to achieve the same behavior after a fresh boot). Cheers!
RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On 2020.08.03 10:09 Rafael J. Wysocki wrote: > On Sunday, August 2, 2020 5:17:39 PM CEST Doug Smythies wrote: > > On 2020.07.19 04:43 Rafael J. Wysocki wrote: > > > On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies wrote: > > > > On 2020.07.16 05:08 Rafael J. Wysocki wrote: > > > > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies > > > > > wrote: > > > > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote: > > > > >> > > > > > >> > From: Rafael J. Wysocki > > > > >> ... > > > > >> > Since the passive mode hasn't worked with HWP at all, and it is > > > > >> > not going to > > > > >> > the default for HWP systems anyway, I don't see any drawbacks > > > > >> > related to making > > > > >> > this change, so I would consider this as 5.9 material unless there > > > > >> > are any > > > > >> > serious objections. > > > > >> > > > > >> Good point. > > > > > > > > Actually, for those users that default to passive mode upon boot, > > > > this would mean they would find themselves using this. > > > > Also, it isn't obvious, from the typical "what driver and what governor" > > > > inquiry. > > > > > > So the change in behavior is that after this patch > > > intel_pstate=passive doesn't imply no_hwp any more. > > > > > > That's a very minor difference though and I'm not aware of any adverse > > > effects it can cause on HWP systems anyway. > > > > My point was, that it will now default to something where > > testing has not been completed. > > > > > The "what governor" is straightforward in the passive mode: that's > > > whatever cpufreq governor has been selected. > > > > I think you might have missed my point. > > From the normal methods of inquiry one does not know > > if HWP is being used or not. Why? Because with > > or without HWP one gets the same answers under: > > > > /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver > > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor > > Yes, but this is also the case in the active mode, isn't it? Yes, fair enough. But we aren't changing what it means by default between kernel 5.8 and 5.9-rc1. ... Doug
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On Sunday, August 2, 2020 5:17:39 PM CEST Doug Smythies wrote: > Hi Rafael, > > On 2020.07.19 04:43 Rafael J. Wysocki wrote: > > On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies wrote: > > > On 2020.07.16 05:08 Rafael J. Wysocki wrote: > > > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies > > > > wrote: > > > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote: > > > >> > > > > >> > From: Rafael J. Wysocki > > > >> ... > > > >> > Since the passive mode hasn't worked with HWP at all, and it is not > > > >> > going to > > > >> > the default for HWP systems anyway, I don't see any drawbacks > > > >> > related to making > > > >> > this change, so I would consider this as 5.9 material unless there > > > >> > are any > > > >> > serious objections. > > > >> > > > >> Good point. > > > > > > Actually, for those users that default to passive mode upon boot, > > > this would mean they would find themselves using this. > > > Also, it isn't obvious, from the typical "what driver and what governor" > > > inquiry. > > > > So the change in behavior is that after this patch > > intel_pstate=passive doesn't imply no_hwp any more. > > > > That's a very minor difference though and I'm not aware of any adverse > > effects it can cause on HWP systems anyway. > > My point was, that it will now default to something where > testing has not been completed. > > > The "what governor" is straightforward in the passive mode: that's > > whatever cpufreq governor has been selected. > > I think you might have missed my point. > From the normal methods of inquiry one does not know > if HWP is being used or not. Why? Because with > or without HWP one gets the same answers under: > > /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor Yes, but this is also the case in the active mode, isn't it? Thanks!
RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
Hi Rafael, On 2020.07.19 04:43 Rafael J. Wysocki wrote: > On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies wrote: > > On 2020.07.16 05:08 Rafael J. Wysocki wrote: > > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies > > > wrote: > > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote: > > >> > > > >> > From: Rafael J. Wysocki > > >> ... > > >> > Since the passive mode hasn't worked with HWP at all, and it is not > > >> > going to > > >> > the default for HWP systems anyway, I don't see any drawbacks related > > >> > to making > > >> > this change, so I would consider this as 5.9 material unless there are > > >> > any > > >> > serious objections. > > >> > > >> Good point. > > > > Actually, for those users that default to passive mode upon boot, > > this would mean they would find themselves using this. > > Also, it isn't obvious, from the typical "what driver and what governor" > > inquiry. > > So the change in behavior is that after this patch > intel_pstate=passive doesn't imply no_hwp any more. > > That's a very minor difference though and I'm not aware of any adverse > effects it can cause on HWP systems anyway. My point was, that it will now default to something where testing has not been completed. > The "what governor" is straightforward in the passive mode: that's > whatever cpufreq governor has been selected. I think you might have missed my point. >From the normal methods of inquiry one does not know if HWP is being used or not. Why? Because with or without HWP one gets the same answers under: /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ... Doug
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
"Rafael J. Wysocki" writes: > On Thursday, July 30, 2020 2:49:34 AM CEST Francisco Jerez wrote: >> > > [cut] > >> >> > >> >> >> >> No, I explicitly dismissed that in my previous reply. >> >> >> > >> >> >> > But at the same time you seem to agree that without the non-CPU com= >> pon=3D >> >> ent >> >> >> > (or thermal pressure) the existing CPU performance scaling would be >> >> >> > sufficient. >> >> >> > >> >> >>=3D20 >> >> >> Yes, but not necessarily in order to allow the non-CPU component to d= >> raw >> >> >> more power as you said above, but also because the existence of a >> >> >> bottleneck in a non-CPU component gives us an opportunity to improve = >> the >> >> >> energy efficiency of the CPU, regardless of whether that allows the >> >> >> workload to run faster. >> >> > >> >> > But why would the bottleneck be there otherwise? >> >> > >> >>=20 >> >> Because some resource of the system (e.g. memory bandwidth, GPU fill >> >> rate) may be close to 100% utilized, causing a bottleneck for reasons >> >> unrelated to its energy usage. >> > >> > Well, not quite. Or at least in that case the performance cannot be impr= >> oved >> > by limiting the CPU frequency below the frequency looked for by scaling >> > governors, AFAICS. >> > >> >> Right, but it might still be possible to improve the energy efficiency >> of the workload even if its throughput cannot be improved, which seems >> like a worthwhile purpose in itself. > > My point is that in this case the energy-efficiency of the processor cannot > be improved without decreasing performance. > Well in principle it can whenever there is a bottleneck in a non-CPU component. > For the processors that are relevant here, the most energy-efficient way to > run them is in the minimum P-state, but that rarely provides the required > performance. Without the knowledge on how much performance really is required > we assume maximum achievable. Anything else would need to involve some extra > policy knobs (which are missing ATM) and that is another problem. > But there is a middle ground between limiting the workload to run at the minimum P-state and having it run at the maximum achievable P-state. The MEF estimation we were just talking about can help. However due to its heuristic nature I certainly see the merit of having policy knobs allowing the optimization to be controlled by users, which is why I added such controls in v2.99, at your request. > I'm not saying that it is not a problem, but it is not possible to say how > much > performance to sacrifice without any input from the user on that. > > IOW, this is a topic for another discussion. > I have the vague recollection that you brought that up already and I agreed and implemented the changes you asked for. >> > Scaling governors generally look for the maximum frequency at which there= >> is no >> > CPU idle time in the workload. At that frequency the CPU time required b= >> y the >> > workload to achieve the maximum performance is equal to the total CPU time >> > available to it. I till refer to that frequency as the maximum effective >> > frequency (MEF) of the workload. >> > >> > By definition, running at frequencies above the MEF does not improve >> > performance, but it causes CPU idle time to appear. OTOH running at >> > frequencies below the MEF increases the CPU time required by the workload >> > to achieve the maximum performance, so effectively the workload does >> > not get enough CPU time for the performance to be maximum, so it is lower >> > than at the MEF. >> > >> >> Yes, we agree on that. >> >> > Of course, the MEF is well-defined as long as the processor does not share >> > the power budget with another component that is also used by the workload >> > (say, a GPU). Without the sharing of a power budget, the MEF can be dete= >> rmined >> > by looking at the CPU idle time (or CPU busy time, or CPU load, whichever= >> is >> > the most convenient) alone, because it already depends on the speed of any >> > memory etc accessed by the workload and slowing down the processor doesn't >> > improve the performance (because the other components don't run any faster >> > as a result of that). >> > >> > However, if the processor is sharing the power budget with a GPU (say), it >> > is not enough to look at the CPU idle time to determine the MEF, because >> > slowing down the processor generally causes the GPU to get more power whi= >> ch >> > allows it to run faster and CPUs can do more work, because they spend less >> > time waiting for the GPU, so the CPU time available to the workload effec= >> tively >> > increases and it can achieve the maximum performance at a lower frequency. >> > So there is "effective MEF" depending on the relative performance balance >> > between the processor and the GPU and on what "fraction" of the workload >> > runs on the GPU. >> > >> >> That doesn't mean that the MEF isn't well-defined in systems with a >> shared power budget. If you define it as the
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On Thursday, July 30, 2020 2:49:34 AM CEST Francisco Jerez wrote: > [cut] > >> > > >> >> >> No, I explicitly dismissed that in my previous reply. > >> >> > > >> >> > But at the same time you seem to agree that without the non-CPU com= > pon=3D > >> ent > >> >> > (or thermal pressure) the existing CPU performance scaling would be > >> >> > sufficient. > >> >> > > >> >>=3D20 > >> >> Yes, but not necessarily in order to allow the non-CPU component to d= > raw > >> >> more power as you said above, but also because the existence of a > >> >> bottleneck in a non-CPU component gives us an opportunity to improve = > the > >> >> energy efficiency of the CPU, regardless of whether that allows the > >> >> workload to run faster. > >> > > >> > But why would the bottleneck be there otherwise? > >> > > >>=20 > >> Because some resource of the system (e.g. memory bandwidth, GPU fill > >> rate) may be close to 100% utilized, causing a bottleneck for reasons > >> unrelated to its energy usage. > > > > Well, not quite. Or at least in that case the performance cannot be impr= > oved > > by limiting the CPU frequency below the frequency looked for by scaling > > governors, AFAICS. > > > > Right, but it might still be possible to improve the energy efficiency > of the workload even if its throughput cannot be improved, which seems > like a worthwhile purpose in itself. My point is that in this case the energy-efficiency of the processor cannot be improved without decreasing performance. For the processors that are relevant here, the most energy-efficient way to run them is in the minimum P-state, but that rarely provides the required performance. Without the knowledge on how much performance really is required we assume maximum achievable. Anything else would need to involve some extra policy knobs (which are missing ATM) and that is another problem. I'm not saying that it is not a problem, but it is not possible to say how much performance to sacrifice without any input from the user on that. IOW, this is a topic for another discussion. > > Scaling governors generally look for the maximum frequency at which there= > is no > > CPU idle time in the workload. At that frequency the CPU time required b= > y the > > workload to achieve the maximum performance is equal to the total CPU time > > available to it. I till refer to that frequency as the maximum effective > > frequency (MEF) of the workload. > > > > By definition, running at frequencies above the MEF does not improve > > performance, but it causes CPU idle time to appear. OTOH running at > > frequencies below the MEF increases the CPU time required by the workload > > to achieve the maximum performance, so effectively the workload does > > not get enough CPU time for the performance to be maximum, so it is lower > > than at the MEF. > > > > Yes, we agree on that. > > > Of course, the MEF is well-defined as long as the processor does not share > > the power budget with another component that is also used by the workload > > (say, a GPU). Without the sharing of a power budget, the MEF can be dete= > rmined > > by looking at the CPU idle time (or CPU busy time, or CPU load, whichever= > is > > the most convenient) alone, because it already depends on the speed of any > > memory etc accessed by the workload and slowing down the processor doesn't > > improve the performance (because the other components don't run any faster > > as a result of that). > > > > However, if the processor is sharing the power budget with a GPU (say), it > > is not enough to look at the CPU idle time to determine the MEF, because > > slowing down the processor generally causes the GPU to get more power whi= > ch > > allows it to run faster and CPUs can do more work, because they spend less > > time waiting for the GPU, so the CPU time available to the workload effec= > tively > > increases and it can achieve the maximum performance at a lower frequency. > > So there is "effective MEF" depending on the relative performance balance > > between the processor and the GPU and on what "fraction" of the workload > > runs on the GPU. > > > > That doesn't mean that the MEF isn't well-defined in systems with a > shared power budget. If you define it as the lowest frequency at which > the workload reaches maximum throughput, then there still is one even if > the system is TDP-bound: the maximum frequency at which the CPU and > device maintain their optimal power balance -- Above it the power budget > of the device will be constrained, causing it to run at less than > optimal throughput, also causing the workload as a whole to run at less > than optimal throughput. That is what I am calling the "effective MEF" and the main point is that it cannot be determined by looking at the CPU idle time alone. > That said I agree with you that it takes more than looking at the CPU > utilization in order to determine the MEF in a system with a shared > power budget -- Until you've reached a steady
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
"Rafael J. Wysocki" writes: > On Wednesday, July 29, 2020 7:46:08 AM CEST Francisco Jerez wrote: >> >> --==-=-= >> Content-Type: multipart/mixed; boundary="=-=-=" >> >> --=-=-= >> Content-Type: text/plain; charset=utf-8 >> Content-Disposition: inline >> Content-Transfer-Encoding: quoted-printable >> >> "Rafael J. Wysocki" writes: >> >> > On Tuesday, July 28, 2020 4:32:22 AM CEST Francisco Jerez wrote: >> >> >> >> "Rafael J. Wysocki" writes: >> >>=20 >> >> > On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote: >> >> > >> >> > [cut] >> >> > >> >> >> > >> >> >> > However, in the active mode the only updater of hwp_req_cached is >> >> >> > intel_pstate_hwp_set() and this patch doesn't introduce any >> >> >> > differences in behavior in that case. >> >> >> > >> >> >>=3D20 >> >> >> intel_pstate_hwp_set() is the only updater, but there are other >> >> >> consumers that can get out of sync with the HWP request value written= >> by >> >> >> intel_pstate_set_energy_pref_index(). intel_pstate_hwp_boost_up() se= >> ems >> >> >> like the most concerning example I named earlier. >> >> >>=3D20 >> >> >> >> > So there may be a short time window after the >> >> >> >> > intel_pstate_set_energy_pref_index() invocation in which the new= >> EPP >> >> >> >> > value may not be in effect, but in general there is no guarantee= >> th=3D >> >> at >> >> >> >> > the new EPP will take effect immediately after updating the MSR >> >> >> >> > anyway, so that race doesn't matter. >> >> >> >> > >> >> >> >> > That said, that race is avoidable, but I was thinking that tryin= >> g to >> >> >> >> > avoid it might not be worth it. Now I see a better way to avoid= >> it, >> >> >> >> > though, so I'm going to update the patch to that end. >> >> >> >> > >> >> >> >> >> Seems like a bug to me. >> >> >> >> > >> >> >> >> > It is racy, but not every race is a bug. >> >> >> >> > >> >> >> >> >> >> >> >> Still seems like there is a bug in intel_pstate_set_energy_pref_in= >> dex=3D >> >> () >> >> >> >> AFAICT. >> >> >> > >> >> >> > If there is a bug, then what exactly is it, from the users' perspec= >> tiv=3D >> >> e? >> >> >> > >> >> >>=3D20 >> >> >> It can be reproduced easily as follows: >> >> >>=3D20 >> >> >> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost >> >> >> | for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance= >> _pr=3D >> >> eference; do echo performance > $p; done >> >> > >> >> > Is this the active mode or the passive mode with the $subject patch ap= >> pli=3D >> >> ed? >> >> > >> >> > If the former, the issue is there regardless of the patch, so it needs= >> to=3D >> >> be >> >> > fixed. >> >> > >> >> > If the latter, there should be no effect of hwp_dynamic_boost (which w= >> as >> >> > overlooked by me). >> >> > >> >>=20 >> >> This seems to be a problem in active mode only, so yeah the bug exists >> >> regardless of your patch, but the fix is likely to allow you to simplify >> >> this series slightly if it allows you to take full advantage of >> >> hwp_req_cached and drop the additional EPP cache. >> > >> > The additional EPP cache is there to avoid synchronizing the scheduler >> > context directly with a random process running on another CPU and doing >> > things that may take time. >> > >> > The difference between the active mode and the passive mode in this respe= >> ct >> > is that in the latter case hwp_req_cached generally needs to be updated f= >> rom >> > the scheduler context, whereas in the former case it does not. >> > >> >> Hm, that's unfortunate. Though I'd be surprised to see any appreciable >> performance penalty from synchronizing with a sysfs handler that should >> hardly ever be called. Anyway thanks for the fix. > > No problem. > >> > [cut] >> > >> >> >> No, I explicitly dismissed that in my previous reply. >> >> > >> >> > But at the same time you seem to agree that without the non-CPU compon= >> ent >> >> > (or thermal pressure) the existing CPU performance scaling would be >> >> > sufficient. >> >> > >> >>=20 >> >> Yes, but not necessarily in order to allow the non-CPU component to draw >> >> more power as you said above, but also because the existence of a >> >> bottleneck in a non-CPU component gives us an opportunity to improve the >> >> energy efficiency of the CPU, regardless of whether that allows the >> >> workload to run faster. >> > >> > But why would the bottleneck be there otherwise? >> > >> >> Because some resource of the system (e.g. memory bandwidth, GPU fill >> rate) may be close to 100% utilized, causing a bottleneck for reasons >> unrelated to its energy usage. > > Well, not quite. Or at least in that case the performance cannot be improved > by limiting the CPU frequency below the frequency looked for by scaling > governors, AFAICS. > Right, but it might still be possible to improve the energy efficiency of the workload even if its throughput cannot be improved, which seems like a worthwhile purpose in itself. > Scaling governors
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On Wednesday, July 29, 2020 7:46:08 AM CEST Francisco Jerez wrote: > > --==-=-= > Content-Type: multipart/mixed; boundary="=-=-=" > > --=-=-= > Content-Type: text/plain; charset=utf-8 > Content-Disposition: inline > Content-Transfer-Encoding: quoted-printable > > "Rafael J. Wysocki" writes: > > > On Tuesday, July 28, 2020 4:32:22 AM CEST Francisco Jerez wrote: > >> > >> "Rafael J. Wysocki" writes: > >>=20 > >> > On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote: > >> > > >> > [cut] > >> > > >> >> > > >> >> > However, in the active mode the only updater of hwp_req_cached is > >> >> > intel_pstate_hwp_set() and this patch doesn't introduce any > >> >> > differences in behavior in that case. > >> >> > > >> >>=3D20 > >> >> intel_pstate_hwp_set() is the only updater, but there are other > >> >> consumers that can get out of sync with the HWP request value written= > by > >> >> intel_pstate_set_energy_pref_index(). intel_pstate_hwp_boost_up() se= > ems > >> >> like the most concerning example I named earlier. > >> >>=3D20 > >> >> >> > So there may be a short time window after the > >> >> >> > intel_pstate_set_energy_pref_index() invocation in which the new= > EPP > >> >> >> > value may not be in effect, but in general there is no guarantee= > th=3D > >> at > >> >> >> > the new EPP will take effect immediately after updating the MSR > >> >> >> > anyway, so that race doesn't matter. > >> >> >> > > >> >> >> > That said, that race is avoidable, but I was thinking that tryin= > g to > >> >> >> > avoid it might not be worth it. Now I see a better way to avoid= > it, > >> >> >> > though, so I'm going to update the patch to that end. > >> >> >> > > >> >> >> >> Seems like a bug to me. > >> >> >> > > >> >> >> > It is racy, but not every race is a bug. > >> >> >> > > >> >> >> > >> >> >> Still seems like there is a bug in intel_pstate_set_energy_pref_in= > dex=3D > >> () > >> >> >> AFAICT. > >> >> > > >> >> > If there is a bug, then what exactly is it, from the users' perspec= > tiv=3D > >> e? > >> >> > > >> >>=3D20 > >> >> It can be reproduced easily as follows: > >> >>=3D20 > >> >> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost > >> >> | for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance= > _pr=3D > >> eference; do echo performance > $p; done > >> > > >> > Is this the active mode or the passive mode with the $subject patch ap= > pli=3D > >> ed? > >> > > >> > If the former, the issue is there regardless of the patch, so it needs= > to=3D > >> be > >> > fixed. > >> > > >> > If the latter, there should be no effect of hwp_dynamic_boost (which w= > as > >> > overlooked by me). > >> > > >>=20 > >> This seems to be a problem in active mode only, so yeah the bug exists > >> regardless of your patch, but the fix is likely to allow you to simplify > >> this series slightly if it allows you to take full advantage of > >> hwp_req_cached and drop the additional EPP cache. > > > > The additional EPP cache is there to avoid synchronizing the scheduler > > context directly with a random process running on another CPU and doing > > things that may take time. > > > > The difference between the active mode and the passive mode in this respe= > ct > > is that in the latter case hwp_req_cached generally needs to be updated f= > rom > > the scheduler context, whereas in the former case it does not. > > > > Hm, that's unfortunate. Though I'd be surprised to see any appreciable > performance penalty from synchronizing with a sysfs handler that should > hardly ever be called. Anyway thanks for the fix. No problem. > > [cut] > > > >> >> No, I explicitly dismissed that in my previous reply. > >> > > >> > But at the same time you seem to agree that without the non-CPU compon= > ent > >> > (or thermal pressure) the existing CPU performance scaling would be > >> > sufficient. > >> > > >>=20 > >> Yes, but not necessarily in order to allow the non-CPU component to draw > >> more power as you said above, but also because the existence of a > >> bottleneck in a non-CPU component gives us an opportunity to improve the > >> energy efficiency of the CPU, regardless of whether that allows the > >> workload to run faster. > > > > But why would the bottleneck be there otherwise? > > > > Because some resource of the system (e.g. memory bandwidth, GPU fill > rate) may be close to 100% utilized, causing a bottleneck for reasons > unrelated to its energy usage. Well, not quite. Or at least in that case the performance cannot be improved by limiting the CPU frequency below the frequency looked for by scaling governors, AFAICS. Scaling governors generally look for the maximum frequency at which there is no CPU idle time in the workload. At that frequency the CPU time required by the workload to achieve the maximum performance is equal to the total CPU time available to it. I till refer to that frequency as the maximum effective frequency (MEF) of the workload. By definition, running at
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
"Rafael J. Wysocki" writes: > On Tuesday, July 28, 2020 4:32:22 AM CEST Francisco Jerez wrote: >> >> "Rafael J. Wysocki" writes: >> >> > On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote: >> > >> > [cut] >> > >> >> > >> >> > However, in the active mode the only updater of hwp_req_cached is >> >> > intel_pstate_hwp_set() and this patch doesn't introduce any >> >> > differences in behavior in that case. >> >> > >> >>=20 >> >> intel_pstate_hwp_set() is the only updater, but there are other >> >> consumers that can get out of sync with the HWP request value written by >> >> intel_pstate_set_energy_pref_index(). intel_pstate_hwp_boost_up() seems >> >> like the most concerning example I named earlier. >> >>=20 >> >> >> > So there may be a short time window after the >> >> >> > intel_pstate_set_energy_pref_index() invocation in which the new EPP >> >> >> > value may not be in effect, but in general there is no guarantee th= >> at >> >> >> > the new EPP will take effect immediately after updating the MSR >> >> >> > anyway, so that race doesn't matter. >> >> >> > >> >> >> > That said, that race is avoidable, but I was thinking that trying to >> >> >> > avoid it might not be worth it. Now I see a better way to avoid it, >> >> >> > though, so I'm going to update the patch to that end. >> >> >> > >> >> >> >> Seems like a bug to me. >> >> >> > >> >> >> > It is racy, but not every race is a bug. >> >> >> > >> >> >> >> >> >> Still seems like there is a bug in intel_pstate_set_energy_pref_index= >> () >> >> >> AFAICT. >> >> > >> >> > If there is a bug, then what exactly is it, from the users' perspectiv= >> e? >> >> > >> >>=20 >> >> It can be reproduced easily as follows: >> >>=20 >> >> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost >> >> | for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance_pr= >> eference; do echo performance > $p; done >> > >> > Is this the active mode or the passive mode with the $subject patch appli= >> ed? >> > >> > If the former, the issue is there regardless of the patch, so it needs to= >> be >> > fixed. >> > >> > If the latter, there should be no effect of hwp_dynamic_boost (which was >> > overlooked by me). >> > >> >> This seems to be a problem in active mode only, so yeah the bug exists >> regardless of your patch, but the fix is likely to allow you to simplify >> this series slightly if it allows you to take full advantage of >> hwp_req_cached and drop the additional EPP cache. > > The additional EPP cache is there to avoid synchronizing the scheduler > context directly with a random process running on another CPU and doing > things that may take time. > > The difference between the active mode and the passive mode in this respect > is that in the latter case hwp_req_cached generally needs to be updated from > the scheduler context, whereas in the former case it does not. > Hm, that's unfortunate. Though I'd be surprised to see any appreciable performance penalty from synchronizing with a sysfs handler that should hardly ever be called. Anyway thanks for the fix. > [cut] > >> >> No, I explicitly dismissed that in my previous reply. >> > >> > But at the same time you seem to agree that without the non-CPU component >> > (or thermal pressure) the existing CPU performance scaling would be >> > sufficient. >> > >> >> Yes, but not necessarily in order to allow the non-CPU component to draw >> more power as you said above, but also because the existence of a >> bottleneck in a non-CPU component gives us an opportunity to improve the >> energy efficiency of the CPU, regardless of whether that allows the >> workload to run faster. > > But why would the bottleneck be there otherwise? > Because some resource of the system (e.g. memory bandwidth, GPU fill rate) may be close to 100% utilized, causing a bottleneck for reasons unrelated to its energy usage. >> > [cut] >> > >> >> > Yes, it is, and so I don't quite see the connection between it and my = >> question. >> >> > >> >> > Apparently, the unmodified performance scaling governors are not >> >> > sufficient, so there must be something beyond the above which allows >> >> > you to determine the frequency in question and so I'm asking what that >> >> > is. >> >> > >> >>=20 >> >> The underlying heuristic assumption is the same as I outlined above, but >> >> in any implementation of such a heuristic there is necessarily a >> >> trade-off between responsiveness to short-term fluctuations and >> >> long-term energy usage. This trade-off is a function of the somewhat >> >> arbitrary time interval I was referring to as "immediate past" -- A >> >> longer time parameter allows the controller to consider a greater >> >> portion of the workload's history while computing the response with >> >> optimal energy usage, at the cost of increasing its reaction time to >> >> discontinuous changes in the behavior of the workload (AKA increased >> >> latency). >> > >> > OK >> > >> >> One of the key
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On Tuesday, July 28, 2020 4:32:22 AM CEST Francisco Jerez wrote: > > "Rafael J. Wysocki" writes: > > > On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote: > > > > [cut] > > > >> > > >> > However, in the active mode the only updater of hwp_req_cached is > >> > intel_pstate_hwp_set() and this patch doesn't introduce any > >> > differences in behavior in that case. > >> > > >>=20 > >> intel_pstate_hwp_set() is the only updater, but there are other > >> consumers that can get out of sync with the HWP request value written by > >> intel_pstate_set_energy_pref_index(). intel_pstate_hwp_boost_up() seems > >> like the most concerning example I named earlier. > >>=20 > >> >> > So there may be a short time window after the > >> >> > intel_pstate_set_energy_pref_index() invocation in which the new EPP > >> >> > value may not be in effect, but in general there is no guarantee th= > at > >> >> > the new EPP will take effect immediately after updating the MSR > >> >> > anyway, so that race doesn't matter. > >> >> > > >> >> > That said, that race is avoidable, but I was thinking that trying to > >> >> > avoid it might not be worth it. Now I see a better way to avoid it, > >> >> > though, so I'm going to update the patch to that end. > >> >> > > >> >> >> Seems like a bug to me. > >> >> > > >> >> > It is racy, but not every race is a bug. > >> >> > > >> >> > >> >> Still seems like there is a bug in intel_pstate_set_energy_pref_index= > () > >> >> AFAICT. > >> > > >> > If there is a bug, then what exactly is it, from the users' perspectiv= > e? > >> > > >>=20 > >> It can be reproduced easily as follows: > >>=20 > >> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost > >> | for p in /sys/devices/system/cpu/cpufreq/policy*/energy_performance_pr= > eference; do echo performance > $p; done > > > > Is this the active mode or the passive mode with the $subject patch appli= > ed? > > > > If the former, the issue is there regardless of the patch, so it needs to= > be > > fixed. > > > > If the latter, there should be no effect of hwp_dynamic_boost (which was > > overlooked by me). > > > > This seems to be a problem in active mode only, so yeah the bug exists > regardless of your patch, but the fix is likely to allow you to simplify > this series slightly if it allows you to take full advantage of > hwp_req_cached and drop the additional EPP cache. The additional EPP cache is there to avoid synchronizing the scheduler context directly with a random process running on another CPU and doing things that may take time. The difference between the active mode and the passive mode in this respect is that in the latter case hwp_req_cached generally needs to be updated from the scheduler context, whereas in the former case it does not. [cut] > >> No, I explicitly dismissed that in my previous reply. > > > > But at the same time you seem to agree that without the non-CPU component > > (or thermal pressure) the existing CPU performance scaling would be > > sufficient. > > > > Yes, but not necessarily in order to allow the non-CPU component to draw > more power as you said above, but also because the existence of a > bottleneck in a non-CPU component gives us an opportunity to improve the > energy efficiency of the CPU, regardless of whether that allows the > workload to run faster. But why would the bottleneck be there otherwise? > > [cut] > > > >> > Yes, it is, and so I don't quite see the connection between it and my = > question. > >> > > >> > Apparently, the unmodified performance scaling governors are not > >> > sufficient, so there must be something beyond the above which allows > >> > you to determine the frequency in question and so I'm asking what that > >> > is. > >> > > >>=20 > >> The underlying heuristic assumption is the same as I outlined above, but > >> in any implementation of such a heuristic there is necessarily a > >> trade-off between responsiveness to short-term fluctuations and > >> long-term energy usage. This trade-off is a function of the somewhat > >> arbitrary time interval I was referring to as "immediate past" -- A > >> longer time parameter allows the controller to consider a greater > >> portion of the workload's history while computing the response with > >> optimal energy usage, at the cost of increasing its reaction time to > >> discontinuous changes in the behavior of the workload (AKA increased > >> latency). > > > > OK > > > >> One of the key differences between the governor I proposed and the > >> pre-existing ones is that it doesn't attempt to come up with a magic > >> time parameter that works for everybody, because there isn't such a > >> thing, since different devices and applications have latency > >> requirements which often differ by orders of magnitude. > > > > The problem with this approach is that, generally speaking, the kernel > > has a definition of "close past" already, which comes from the PELT > > signal in the scheduler. > > > > That signal is used
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote: > [cut] > > If there is a bug, then what exactly is it, from the users' perspective? > > > > It can be reproduced easily as follows: > > | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost > | for p in > /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference; do > echo performance > $p; done > > Let's make sure that the EPP updates landed on the turbostat output: > > |[..] > | CoreCPU Avg_MHz Busy% Bzy_MHzHWP_REQ > | - - 1 0.0523960x > | 0 0 1 0.0521530x2704 > | 0 4 1 0.0420620x2704 > | 1 1 1 0.0229380x2704 > | 1 5 2 0.0926090x2704 > | 2 2 1 0.0418570x2704 > | 2 6 1 0.0525610x2704 > | 3 3 0 0.0118830x2704 > | 3 7 2 0.0727030x2704 > |[..] > > Now let's do some non-trivial IO activity in order to trigger HWP > dynamic boost, and watch while random CPUs start losing their EPP > setting requested via sysfs: > > |[..] > | CoreCPU Avg_MHz Busy% Bzy_MHzHWP_REQ > | - - 16 0.8120230x > | 0 0 7 0.6610690x80002704 > ^^ > | 0 4 24 2.1911160x80002704 > ^^ > | 1 1 18 0.6826180x2704 > | 1 5 1 0.0320050x2704 > | 2 2 2 0.0725120x2704 > | 2 6 33 1.3524020x2704 > | 3 3 1 0.0424700x2704 > | 3 7 45 1.4231850x80002704 > ^^ Actually, that's because intel_pstate_hwp_boost_up() and intel_pstate_hwp_boost_down() use the hwp_req_cached value for updating the HWP Request MSR and that is only written to by intel_pstate_hwp_set() which is only invoked on policy changes, so the MSR writes from intel_pstate_set_energy_pref_index() basically get discarded. So this is a matter of synchronizing intel_pstate_set_policy() with intel_pstate_set_energy_pref_index() and they both acquire intel_pstate_limits_lock already, so this shouldn't be too difficult to fix. Let me cut a patch for that.
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
"Rafael J. Wysocki" writes: > On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote: > > [cut] > >> > >> > However, in the active mode the only updater of hwp_req_cached is >> > intel_pstate_hwp_set() and this patch doesn't introduce any >> > differences in behavior in that case. >> > >> >> intel_pstate_hwp_set() is the only updater, but there are other >> consumers that can get out of sync with the HWP request value written by >> intel_pstate_set_energy_pref_index(). intel_pstate_hwp_boost_up() seems >> like the most concerning example I named earlier. >> >> >> > So there may be a short time window after the >> >> > intel_pstate_set_energy_pref_index() invocation in which the new EPP >> >> > value may not be in effect, but in general there is no guarantee that >> >> > the new EPP will take effect immediately after updating the MSR >> >> > anyway, so that race doesn't matter. >> >> > >> >> > That said, that race is avoidable, but I was thinking that trying to >> >> > avoid it might not be worth it. Now I see a better way to avoid it, >> >> > though, so I'm going to update the patch to that end. >> >> > >> >> >> Seems like a bug to me. >> >> > >> >> > It is racy, but not every race is a bug. >> >> > >> >> >> >> Still seems like there is a bug in intel_pstate_set_energy_pref_index() >> >> AFAICT. >> > >> > If there is a bug, then what exactly is it, from the users' perspective? >> > >> >> It can be reproduced easily as follows: >> >> | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost >> | for p in >> /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference; do >> echo performance > $p; done > > Is this the active mode or the passive mode with the $subject patch applied? > > If the former, the issue is there regardless of the patch, so it needs to be > fixed. > > If the latter, there should be no effect of hwp_dynamic_boost (which was > overlooked by me). > This seems to be a problem in active mode only, so yeah the bug exists regardless of your patch, but the fix is likely to allow you to simplify this series slightly if it allows you to take full advantage of hwp_req_cached and drop the additional EPP cache. > [cut] > >> > To really decide what is better, the two alternatively would need to >> > be compared quatitatively, but it doesn't look like they have been. >> > >> >> That's a fair request, I'm all for justifying design decisions >> quantitatively -- And in fact we did compare my non-HWP controller with >> the GuC-based solution quantitatively in a BXT platform, and results >> showed the kernel-based solution to be superior by a significant margin. > > Why do you think that this result is significant beyond BXT? > Because the limitations of GuC power management relative to the kernel-driven solution haven't fundamentally changed since BXT. Of course it might be that the gap we observed was due to more accidental reasons, like bugs in the BXT GuC power management code, it would certainly make sense to recheck. >> That was a couple of years ago though and many things have changed, we >> can get you the results of the previous comparison or an updated >> comparison if you don't find it appealing to look at old numbers. >> > > [cut] > >> >> >> >> If we define the instantaneous energy efficiency of a CPU (eta) to be >> >> the ratio between its instantaneous frequency (f) and power consumption >> >> (P), >> > >> > I'm sorry, but this definition is conceptually misguided. >> > >> > Energy-efficiency (denote it as \phi) can be defined as work/energy which >> > means >> > >> > \phi = dW / dE >> > >> > for the instantaneous one and in general that is not the same as the >> > simple fraction below. >> > >> >> Hah! Actually both definitions are mathematically equivalent everywhere >> they're both defined. I assume that the 'd' symbols in your expression >> denote Leibniz's notation for a total derivative (if they didn't -- >> e.g. if they denoted some sort finite increment instead then I would >> disagree that your expression is a valid definition of *instantaneous* >> energy efficiency). In addition I assume that we agree on there being >> two well-defined functions of time in the case that concerns us, which >> we call "instantaneous power consumption" [P(t) = dE(t)/dt] and >> "instantaneous frequency" [f(t) = dW(t)/dt]. With that in mind you just >> have to apply the standard chain rule from calculus in order to prove >> the equivalence of both expressions: >> >> | \phi = dW(t(E))/dE = dW(t)/dt * dt(E)/dE = dW(t)/dt * (dE(t)/dt)^-1 = >> | = f(t) * P(t)^-1 = eta > > Well, under certain assumptions, but OK, I stand corrected. > > [cut] > >> >> > Regardless, the ultimate goal appears to be to allow the non-CPU >> > component you care about draw more power. >> > >> >> No, I explicitly dismissed that in my previous reply. > > But at the same time you seem to agree that without the non-CPU component > (or thermal pressure) the existing CPU
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On Wednesday, July 22, 2020 1:14:42 AM CEST Francisco Jerez wrote: > > --==-=-= > Content-Type: multipart/mixed; boundary="=-=-=" > > --=-=-= > Content-Type: text/plain; charset=utf-8 > Content-Disposition: inline > Content-Transfer-Encoding: quoted-printable > > Srinivas Pandruvada writes: > > > On Mon, 2020-07-20 at 16:20 -0700, Francisco Jerez wrote: > >> "Rafael J. Wysocki" writes: > >>=20 > >> > On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez < > >> > curroje...@riseup.net> wrote: > >> > > "Rafael J. Wysocki" writes: > >> > >=20 > > {...] > > > >> > Overall, so far, I'm seeing a claim that the CPU subsystem can be > >> > made > >> > use less energy and do as much work as before (which is what > >> > improving > >> > the energy-efficiency means in general) if the maximum frequency of > >> > CPUs is limited in a clever way. > >> >=20 > >> > I'm failing to see what that clever way is, though. > >> Hopefully the clarifications above help some. > > > > To simplify: > > > > Suppose I called a function numpy.multiply() to multiply two big arrays > > and thread is a pegged to a CPU. Let's say it is causing CPU to > > finish the job in 10ms and it is using a P-State of 0x20. But the same > > job could have been done in 10ms even if it was using P-state of 0x16. > > So we are not energy efficient. To really know where is the bottle neck > > there are numbers of perf counters, may be cache was the issue, we > > could rather raise the uncore frequency a little. A simple APRF,MPERF > > counters are not enough.=20 > > Yes, that's right, APERF and MPERF aren't sufficient to identify every > kind of possible bottleneck, some visibility of the utilization of other > subsystems is necessary in addition -- Like e.g the instrumentation > introduced in my series to detect a GPU bottleneck. A bottleneck > condition in an IO device can be communicated to CPUFREQ It generally is not sufficient to communicate it to cpufreq. It needs to be communicated to the CPU scheduler. > by adjusting a > PM QoS latency request (link [2] in my previous reply) that effectively > gives the governor permission to rearrange CPU work arbitrarily within > the specified time frame (which should be of the order of the natural > latency of the IO device -- e.g. at least the rendering time of a frame > for a GPU) in order to minimize energy usage. OK, we need to talk more about this. > > or we characterize the workload at different P-states and set limits. > > I think this is not you want to say for energy efficiency with your > > changes.=20 > > > > The way you are trying to improve "performance" is by caller (device > > driver) to say how important my job at hand. Here device driver suppose > > offload this calculations to some GPU and can wait up to 10 ms, you > > want to tell CPU to be slow. But the p-state driver at a movement > > observes that there is a chance of overshoot of latency, it will > > immediately ask for higher P-state. So you want P-state limits based on > > the latency requirements of the caller. Since caller has more knowledge > > of latency requirement, this allows other devices sharing the power > > budget to get more or less power, and improve overall energy efficiency > > as the combined performance of system is improved. > > Is this correct? > > Yes, pretty much. OK
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On Tuesday, July 21, 2020 1:20:14 AM CEST Francisco Jerez wrote: [cut] > > > > However, in the active mode the only updater of hwp_req_cached is > > intel_pstate_hwp_set() and this patch doesn't introduce any > > differences in behavior in that case. > > > > intel_pstate_hwp_set() is the only updater, but there are other > consumers that can get out of sync with the HWP request value written by > intel_pstate_set_energy_pref_index(). intel_pstate_hwp_boost_up() seems > like the most concerning example I named earlier. > > >> > So there may be a short time window after the > >> > intel_pstate_set_energy_pref_index() invocation in which the new EPP > >> > value may not be in effect, but in general there is no guarantee that > >> > the new EPP will take effect immediately after updating the MSR > >> > anyway, so that race doesn't matter. > >> > > >> > That said, that race is avoidable, but I was thinking that trying to > >> > avoid it might not be worth it. Now I see a better way to avoid it, > >> > though, so I'm going to update the patch to that end. > >> > > >> >> Seems like a bug to me. > >> > > >> > It is racy, but not every race is a bug. > >> > > >> > >> Still seems like there is a bug in intel_pstate_set_energy_pref_index() > >> AFAICT. > > > > If there is a bug, then what exactly is it, from the users' perspective? > > > > It can be reproduced easily as follows: > > | echo 1 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost > | for p in > /sys/devices/system/cpu/cpufreq/policy*/energy_performance_preference; do > echo performance > $p; done Is this the active mode or the passive mode with the $subject patch applied? If the former, the issue is there regardless of the patch, so it needs to be fixed. If the latter, there should be no effect of hwp_dynamic_boost (which was overlooked by me). [cut] > > To really decide what is better, the two alternatively would need to > > be compared quatitatively, but it doesn't look like they have been. > > > > That's a fair request, I'm all for justifying design decisions > quantitatively -- And in fact we did compare my non-HWP controller with > the GuC-based solution quantitatively in a BXT platform, and results > showed the kernel-based solution to be superior by a significant margin. Why do you think that this result is significant beyond BXT? > That was a couple of years ago though and many things have changed, we > can get you the results of the previous comparison or an updated > comparison if you don't find it appealing to look at old numbers. > [cut] > >> > >> If we define the instantaneous energy efficiency of a CPU (eta) to be > >> the ratio between its instantaneous frequency (f) and power consumption > >> (P), > > > > I'm sorry, but this definition is conceptually misguided. > > > > Energy-efficiency (denote it as \phi) can be defined as work/energy which > > means > > > > \phi = dW / dE > > > > for the instantaneous one and in general that is not the same as the > > simple fraction below. > > > > Hah! Actually both definitions are mathematically equivalent everywhere > they're both defined. I assume that the 'd' symbols in your expression > denote Leibniz's notation for a total derivative (if they didn't -- > e.g. if they denoted some sort finite increment instead then I would > disagree that your expression is a valid definition of *instantaneous* > energy efficiency). In addition I assume that we agree on there being > two well-defined functions of time in the case that concerns us, which > we call "instantaneous power consumption" [P(t) = dE(t)/dt] and > "instantaneous frequency" [f(t) = dW(t)/dt]. With that in mind you just > have to apply the standard chain rule from calculus in order to prove > the equivalence of both expressions: > > | \phi = dW(t(E))/dE = dW(t)/dt * dt(E)/dE = dW(t)/dt * (dE(t)/dt)^-1 = > | = f(t) * P(t)^-1 = eta Well, under certain assumptions, but OK, I stand corrected. [cut] > > > Regardless, the ultimate goal appears to be to allow the non-CPU > > component you care about draw more power. > > > > No, I explicitly dismissed that in my previous reply. But at the same time you seem to agree that without the non-CPU component (or thermal pressure) the existing CPU performance scaling would be sufficient. [cut] > > Yes, it is, and so I don't quite see the connection between it and my > > question. > > > > Apparently, the unmodified performance scaling governors are not > > sufficient, so there must be something beyond the above which allows > > you to determine the frequency in question and so I'm asking what that > > is. > > > > The underlying heuristic assumption is the same as I outlined above, but > in any implementation of such a heuristic there is necessarily a > trade-off between responsiveness to short-term fluctuations and > long-term energy usage. This trade-off is a function of the somewhat > arbitrary time interval I was referring to as "immediate past" -- A >
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
Srinivas Pandruvada writes: > On Mon, 2020-07-20 at 16:20 -0700, Francisco Jerez wrote: >> "Rafael J. Wysocki" writes: >> >> > On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez < >> > curroje...@riseup.net> wrote: >> > > "Rafael J. Wysocki" writes: >> > > > {...] > >> > Overall, so far, I'm seeing a claim that the CPU subsystem can be >> > made >> > use less energy and do as much work as before (which is what >> > improving >> > the energy-efficiency means in general) if the maximum frequency of >> > CPUs is limited in a clever way. >> > >> > I'm failing to see what that clever way is, though. >> Hopefully the clarifications above help some. > > To simplify: > > Suppose I called a function numpy.multiply() to multiply two big arrays > and thread is a pegged to a CPU. Let's say it is causing CPU to > finish the job in 10ms and it is using a P-State of 0x20. But the same > job could have been done in 10ms even if it was using P-state of 0x16. > So we are not energy efficient. To really know where is the bottle neck > there are numbers of perf counters, may be cache was the issue, we > could rather raise the uncore frequency a little. A simple APRF,MPERF > counters are not enough. Yes, that's right, APERF and MPERF aren't sufficient to identify every kind of possible bottleneck, some visibility of the utilization of other subsystems is necessary in addition -- Like e.g the instrumentation introduced in my series to detect a GPU bottleneck. A bottleneck condition in an IO device can be communicated to CPUFREQ by adjusting a PM QoS latency request (link [2] in my previous reply) that effectively gives the governor permission to rearrange CPU work arbitrarily within the specified time frame (which should be of the order of the natural latency of the IO device -- e.g. at least the rendering time of a frame for a GPU) in order to minimize energy usage. > or we characterize the workload at different P-states and set limits. > I think this is not you want to say for energy efficiency with your > changes. > > The way you are trying to improve "performance" is by caller (device > driver) to say how important my job at hand. Here device driver suppose > offload this calculations to some GPU and can wait up to 10 ms, you > want to tell CPU to be slow. But the p-state driver at a movement > observes that there is a chance of overshoot of latency, it will > immediately ask for higher P-state. So you want P-state limits based on > the latency requirements of the caller. Since caller has more knowledge > of latency requirement, this allows other devices sharing the power > budget to get more or less power, and improve overall energy efficiency > as the combined performance of system is improved. > Is this correct? Yes, pretty much. signature.asc Description: PGP signature
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On Mon, 2020-07-20 at 16:20 -0700, Francisco Jerez wrote: > "Rafael J. Wysocki" writes: > > > On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez < > > curroje...@riseup.net> wrote: > > > "Rafael J. Wysocki" writes: > > > {...] > > Overall, so far, I'm seeing a claim that the CPU subsystem can be > > made > > use less energy and do as much work as before (which is what > > improving > > the energy-efficiency means in general) if the maximum frequency of > > CPUs is limited in a clever way. > > > > I'm failing to see what that clever way is, though. > Hopefully the clarifications above help some. To simplify: Suppose I called a function numpy.multiply() to multiply two big arrays and thread is a pegged to a CPU. Let's say it is causing CPU to finish the job in 10ms and it is using a P-State of 0x20. But the same job could have been done in 10ms even if it was using P-state of 0x16. So we are not energy efficient. To really know where is the bottle neck there are numbers of perf counters, may be cache was the issue, we could rather raise the uncore frequency a little. A simple APRF,MPERF counters are not enough. or we characterize the workload at different P-states and set limits. I think this is not you want to say for energy efficiency with your changes. The way you are trying to improve "performance" is by caller (device driver) to say how important my job at hand. Here device driver suppose offload this calculations to some GPU and can wait up to 10 ms, you want to tell CPU to be slow. But the p-state driver at a movement observes that there is a chance of overshoot of latency, it will immediately ask for higher P-state. So you want P-state limits based on the latency requirements of the caller. Since caller has more knowledge of latency requirement, this allows other devices sharing the power budget to get more or less power, and improve overall energy efficiency as the combined performance of system is improved. Is this correct?
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
"Rafael J. Wysocki" writes: > On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez wrote: >> >> "Rafael J. Wysocki" writes: >> >> > On Wed, Jul 15, 2020 at 11:35 PM Francisco Jerez >> > wrote: >> >> >> >> "Rafael J. Wysocki" writes: >> >> >> >> > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez >> >> > wrote: >> >> >> >> >> >> "Rafael J. Wysocki" writes: >> >> >> >> >> >> > From: Rafael J. Wysocki >> >> >> > >> >> >> > Allow intel_pstate to work in the passive mode with HWP enabled and >> >> >> > make it set the HWP minimum performance limit (HWP floor) to the >> >> >> > P-state value given by the target frequency supplied by the cpufreq >> >> >> > governor, so as to prevent the HWP algorithm and the CPU scheduler >> >> >> > from working against each other, at least when the schedutil governor >> >> >> > is in use, and update the intel_pstate documentation accordingly. >> >> >> > >> >> >> > Among other things, this allows utilization clamps to be taken >> >> >> > into account, at least to a certain extent, when intel_pstate is >> >> >> > in use and makes it more likely that sufficient capacity for >> >> >> > deadline tasks will be provided. >> >> >> > >> >> >> > After this change, the resulting behavior of an HWP system with >> >> >> > intel_pstate in the passive mode should be close to the behavior >> >> >> > of the analogous non-HWP system with intel_pstate in the passive >> >> >> > mode, except that in the frequency range below the base frequency >> >> >> > (ie. the frequency retured by the base_frequency cpufreq attribute >> >> >> > in sysfs on HWP systems) the HWP algorithm is allowed to go above >> >> >> > the floor P-state set by intel_pstate with or without hardware >> >> >> > coordination of P-states among CPUs in the same package. >> >> >> > >> >> >> > Also note that the setting of the HWP floor may not be taken into >> >> >> > account by the processor in the following cases: >> >> >> > >> >> >> > * For the HWP floor in the range of P-states above the base >> >> >> >frequency, referred to as the turbo range, the processor has a >> >> >> >license to choose any P-state from that range, either below or >> >> >> >above the HWP floor, just like a non-HWP processor in the case >> >> >> >when the target P-state falls into the turbo range. >> >> >> > >> >> >> > * If P-states of the CPUs in the same package are coordinated >> >> >> >at the hardware level, the processor may choose a P-state >> >> >> >above the HWP floor, just like a non-HWP processor in the >> >> >> >analogous case. >> >> >> > >> >> >> > With this change applied, intel_pstate in the passive mode >> >> >> > assumes complete control over the HWP request MSR and concurrent >> >> >> > changes of that MSR (eg. via the direct MSR access interface) are >> >> >> > overridden by it. >> >> >> > >> >> >> > Signed-off-by: Rafael J. Wysocki >> >> >> > --- >> >> >> > >> >> >> > This basically unifies the passive mode behavior of intel_pstate for >> >> >> > systems >> >> >> > with and without HWP enabled. The only case in which there is a >> >> >> > difference >> >> >> > between the two (after this patch) is below the turbo range, where >> >> >> > the HWP >> >> >> > algorithm can go above the floor regardless of whether or not >> >> >> > P-state are >> >> >> > coordinated package-wide (this means the systems with per-core >> >> >> > P-states >> >> >> > mostly is where the difference can be somewhat visible). >> >> >> > >> >> >> > Since the passive mode hasn't worked with HWP at all, and it is not >> >> >> > going to >> >> >> > the default for HWP systems anyway, I don't see any drawbacks >> >> >> > related to making >> >> >> > this change, so I would consider this as 5.9 material unless there >> >> >> > are any >> >> >> > serious objections. >> >> >> > >> >> >> > Thanks! >> >> >> > >> >> >> > --- >> >> >> > Documentation/admin-guide/pm/intel_pstate.rst | 89 >> >> >> > +++- >> >> >> > drivers/cpufreq/intel_pstate.c| 141 >> >> >> > -- >> >> >> > 2 files changed, 152 insertions(+), 78 deletions(-) >> >> >> > >> >> >> > Index: linux-pm/drivers/cpufreq/intel_pstate.c >> >> >> > === >> >> >> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c >> >> >> > +++ linux-pm/drivers/cpufreq/intel_pstate.c >> >> >> > @@ -36,6 +36,7 @@ >> >> >> > #define INTEL_PSTATE_SAMPLING_INTERVAL (10 * NSEC_PER_MSEC) >> >> >> > >> >> >> > #define INTEL_CPUFREQ_TRANSITION_LATENCY 2 >> >> >> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000 >> >> >> > #define INTEL_CPUFREQ_TRANSITION_DELAY 500 >> >> >> > >> >> >> > #ifdef CONFIG_ACPI >> >> >> > @@ -222,6 +223,7 @@ struct global_params { >> >> >> > * preference/bias >> >> >> > * @epp_saved: Saved EPP/EPB during system suspend or >> >> >> > CPU offline >> >> >> > * operation >> >>
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On Fri, Jul 17, 2020 at 2:21 AM Francisco Jerez wrote: > > "Rafael J. Wysocki" writes: > > > On Wed, Jul 15, 2020 at 11:35 PM Francisco Jerez > > wrote: > >> > >> "Rafael J. Wysocki" writes: > >> > >> > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez > >> > wrote: > >> >> > >> >> "Rafael J. Wysocki" writes: > >> >> > >> >> > From: Rafael J. Wysocki > >> >> > > >> >> > Allow intel_pstate to work in the passive mode with HWP enabled and > >> >> > make it set the HWP minimum performance limit (HWP floor) to the > >> >> > P-state value given by the target frequency supplied by the cpufreq > >> >> > governor, so as to prevent the HWP algorithm and the CPU scheduler > >> >> > from working against each other, at least when the schedutil governor > >> >> > is in use, and update the intel_pstate documentation accordingly. > >> >> > > >> >> > Among other things, this allows utilization clamps to be taken > >> >> > into account, at least to a certain extent, when intel_pstate is > >> >> > in use and makes it more likely that sufficient capacity for > >> >> > deadline tasks will be provided. > >> >> > > >> >> > After this change, the resulting behavior of an HWP system with > >> >> > intel_pstate in the passive mode should be close to the behavior > >> >> > of the analogous non-HWP system with intel_pstate in the passive > >> >> > mode, except that in the frequency range below the base frequency > >> >> > (ie. the frequency retured by the base_frequency cpufreq attribute > >> >> > in sysfs on HWP systems) the HWP algorithm is allowed to go above > >> >> > the floor P-state set by intel_pstate with or without hardware > >> >> > coordination of P-states among CPUs in the same package. > >> >> > > >> >> > Also note that the setting of the HWP floor may not be taken into > >> >> > account by the processor in the following cases: > >> >> > > >> >> > * For the HWP floor in the range of P-states above the base > >> >> >frequency, referred to as the turbo range, the processor has a > >> >> >license to choose any P-state from that range, either below or > >> >> >above the HWP floor, just like a non-HWP processor in the case > >> >> >when the target P-state falls into the turbo range. > >> >> > > >> >> > * If P-states of the CPUs in the same package are coordinated > >> >> >at the hardware level, the processor may choose a P-state > >> >> >above the HWP floor, just like a non-HWP processor in the > >> >> >analogous case. > >> >> > > >> >> > With this change applied, intel_pstate in the passive mode > >> >> > assumes complete control over the HWP request MSR and concurrent > >> >> > changes of that MSR (eg. via the direct MSR access interface) are > >> >> > overridden by it. > >> >> > > >> >> > Signed-off-by: Rafael J. Wysocki > >> >> > --- > >> >> > > >> >> > This basically unifies the passive mode behavior of intel_pstate for > >> >> > systems > >> >> > with and without HWP enabled. The only case in which there is a > >> >> > difference > >> >> > between the two (after this patch) is below the turbo range, where > >> >> > the HWP > >> >> > algorithm can go above the floor regardless of whether or not P-state > >> >> > are > >> >> > coordinated package-wide (this means the systems with per-core > >> >> > P-states > >> >> > mostly is where the difference can be somewhat visible). > >> >> > > >> >> > Since the passive mode hasn't worked with HWP at all, and it is not > >> >> > going to > >> >> > the default for HWP systems anyway, I don't see any drawbacks related > >> >> > to making > >> >> > this change, so I would consider this as 5.9 material unless there > >> >> > are any > >> >> > serious objections. > >> >> > > >> >> > Thanks! > >> >> > > >> >> > --- > >> >> > Documentation/admin-guide/pm/intel_pstate.rst | 89 +++- > >> >> > drivers/cpufreq/intel_pstate.c| 141 > >> >> > -- > >> >> > 2 files changed, 152 insertions(+), 78 deletions(-) > >> >> > > >> >> > Index: linux-pm/drivers/cpufreq/intel_pstate.c > >> >> > === > >> >> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c > >> >> > +++ linux-pm/drivers/cpufreq/intel_pstate.c > >> >> > @@ -36,6 +36,7 @@ > >> >> > #define INTEL_PSTATE_SAMPLING_INTERVAL (10 * NSEC_PER_MSEC) > >> >> > > >> >> > #define INTEL_CPUFREQ_TRANSITION_LATENCY 2 > >> >> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000 > >> >> > #define INTEL_CPUFREQ_TRANSITION_DELAY 500 > >> >> > > >> >> > #ifdef CONFIG_ACPI > >> >> > @@ -222,6 +223,7 @@ struct global_params { > >> >> > * preference/bias > >> >> > * @epp_saved: Saved EPP/EPB during system suspend or > >> >> > CPU offline > >> >> > * operation > >> >> > + * @epp_cached Cached HWP energy-performance > >> >> > preference value > >> >> > * @hwp_req_cached: Cached value of the
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
Hi Doug, On Fri, Jul 17, 2020 at 3:37 PM Doug Smythies wrote: > > Hi Rafael, > > Thank you for your reply. > > On 2020.07.16 05:08 Rafael J. Wysocki wrote: > > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies wrote: > >> On 2020.07.14 11:16 Rafael J. Wysocki wrote: > >> > > >> > From: Rafael J. Wysocki > >> ... > >> > Since the passive mode hasn't worked with HWP at all, and it is not > >> > going to > >> > the default for HWP systems anyway, I don't see any drawbacks related to > >> > making > >> > this change, so I would consider this as 5.9 material unless there are > >> > any > >> > serious objections. > >> > >> Good point. > > Actually, for those users that default to passive mode upon boot, > this would mean they would find themselves using this. > Also, it isn't obvious, from the typical "what driver and what governor" > inquiry. So the change in behavior is that after this patch intel_pstate=passive doesn't imply no_hwp any more. That's a very minor difference though and I'm not aware of any adverse effects it can cause on HWP systems anyway. The "what governor" is straightforward in the passive mode: that's whatever cpufreq governor has been selected. The driver is "intel_cpufreq" which means that the processor is requested to run at a frequency selected by the governor or higher, unless in the turbo range. This works similarly in both the HWP and non-HWP cases, except that in the HWP case it is possible to adjust the EPP (through the additional sysfs knob) and the base frequency is exported (the latter two things can be used to distinguish between the two cases just fine IMO). > >> Some of the tests I do involve labour intensive post processing of data. > >> I want to automate some of that work, and it will take time. > >> We might be into the 5.9-rc series before I have detailed feedback. > >> > >> However, so far: > >> > >> Inverse impulse response test [1]: > >> > >> High level test, i5-9600K, HWP-passive (this patch), ondemand: > >> 3101 tests. 0 failures. (GOOD) > >> > >> From [1], re-stated: > >> > . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. > >> > (HWP-active - powersave) > >> > . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 > >> > failures. > >> > >> My version of that cool Alexander named pipe test [2] serialized workflow: > >> > >> HWP-passive (this patch), performance: PASS. > >> > >> From [2], re-stated, and also re-tested. > >> HWP-disabled passive - performance: FAIL. > > > > But I'm not quite sure how this is related to this patch? > > It isn't. The point being that it is different. It is different, but kind of in a positive way IMO. > But yes, that failure is because of our other discussion [3]. OK > > > > This test would still fail without the patch if the kernel was started > > with intel_pstate=passive in the kernel command line, wouldn't it. > > Yes. OK Thanks!
RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
Hi Rafael, Thank you for your reply. On 2020.07.16 05:08 Rafael J. Wysocki wrote: > On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies wrote: >> On 2020.07.14 11:16 Rafael J. Wysocki wrote: >> > >> > From: Rafael J. Wysocki >> ... >> > Since the passive mode hasn't worked with HWP at all, and it is not going >> > to >> > the default for HWP systems anyway, I don't see any drawbacks related to >> > making >> > this change, so I would consider this as 5.9 material unless there are any >> > serious objections. >> >> Good point. Actually, for those users that default to passive mode upon boot, this would mean they would find themselves using this. Also, it isn't obvious, from the typical "what driver and what governor" inquiry. >> Some of the tests I do involve labour intensive post processing of data. >> I want to automate some of that work, and it will take time. >> We might be into the 5.9-rc series before I have detailed feedback. >> >> However, so far: >> >> Inverse impulse response test [1]: >> >> High level test, i5-9600K, HWP-passive (this patch), ondemand: >> 3101 tests. 0 failures. (GOOD) >> >> From [1], re-stated: >> > . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. >> > (HWP-active - powersave) >> > . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 >> > failures. >> >> My version of that cool Alexander named pipe test [2] serialized workflow: >> >> HWP-passive (this patch), performance: PASS. >> >> From [2], re-stated, and also re-tested. >> HWP-disabled passive - performance: FAIL. > > But I'm not quite sure how this is related to this patch? It isn't. The point being that it is different. But yes, that failure is because of our other discussion [3]. > > This test would still fail without the patch if the kernel was started > with intel_pstate=passive in the kernel command line, wouldn't it. Yes. > > > Although, I believe the issue to be EPB management, [3]. > > Well, that's kind of unexpected. > > If this really is the case, then it looks like the effect of the EPB > setting on the processor is different depending on whether or not HWP > is enabled. > >> And yes, I did see the reply to [3] that came earlier, >> And have now re-done the test, with the referenced patch added. >> It still is FAIL. I reply to the [3] thread, eventually. >> >> [1] https://marc.info/?l=linux-pm=159354421400342=2 >> [2] https://marc.info/?l=linux-pm=159155067328641=2 >> [3] https://marc.info/?l=linux-pm=159438804230744=2 >> >> Kernel: >> >> b08284a541ad (HEAD -> k58rc5-doug) cpufreq: intel_pstate: Avoid enabling HWP >> if EPP is not supported >> 063fd7ccabfe cpufreq: intel_pstate: Implement passive mode with HWP enabled >> 730ccf5054e9 cpufreq: intel_pstate: Allow raw energy performance preference >> value >> bee36df01c68 cpufreq: intel_pstate: Allow enable/disable energy efficiency >> 199629d8200e cpufreq: intel_pstate: Fix active mode setting from command line >> 11ba468877bb (tag: v5.8-rc5, origin/master, origin/HEAD, master) Linux >> 5.8-rc5 >> >> Rules for this work: >> >> . never use x86_energy_perf_policy. >> . For HWP disabled: never change from active to passive or via versa, but >> rather do it via boot. >> . after boot always check and reset the various power limit log bits that >> are set. >> . never compile the kernel (well, until after any tests), which will set >> those bits again. >> . never run prime95 high heat torture test, which will set those bits again. >> . try to never do anything else that will set those bits again. >> >> To be clear, I do allow changing governors within the context of the above >> rules. > > Thanks for the feedback!
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
"Rafael J. Wysocki" writes: > On Wed, Jul 15, 2020 at 11:35 PM Francisco Jerez > wrote: >> >> "Rafael J. Wysocki" writes: >> >> > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez >> > wrote: >> >> >> >> "Rafael J. Wysocki" writes: >> >> >> >> > From: Rafael J. Wysocki >> >> > >> >> > Allow intel_pstate to work in the passive mode with HWP enabled and >> >> > make it set the HWP minimum performance limit (HWP floor) to the >> >> > P-state value given by the target frequency supplied by the cpufreq >> >> > governor, so as to prevent the HWP algorithm and the CPU scheduler >> >> > from working against each other, at least when the schedutil governor >> >> > is in use, and update the intel_pstate documentation accordingly. >> >> > >> >> > Among other things, this allows utilization clamps to be taken >> >> > into account, at least to a certain extent, when intel_pstate is >> >> > in use and makes it more likely that sufficient capacity for >> >> > deadline tasks will be provided. >> >> > >> >> > After this change, the resulting behavior of an HWP system with >> >> > intel_pstate in the passive mode should be close to the behavior >> >> > of the analogous non-HWP system with intel_pstate in the passive >> >> > mode, except that in the frequency range below the base frequency >> >> > (ie. the frequency retured by the base_frequency cpufreq attribute >> >> > in sysfs on HWP systems) the HWP algorithm is allowed to go above >> >> > the floor P-state set by intel_pstate with or without hardware >> >> > coordination of P-states among CPUs in the same package. >> >> > >> >> > Also note that the setting of the HWP floor may not be taken into >> >> > account by the processor in the following cases: >> >> > >> >> > * For the HWP floor in the range of P-states above the base >> >> >frequency, referred to as the turbo range, the processor has a >> >> >license to choose any P-state from that range, either below or >> >> >above the HWP floor, just like a non-HWP processor in the case >> >> >when the target P-state falls into the turbo range. >> >> > >> >> > * If P-states of the CPUs in the same package are coordinated >> >> >at the hardware level, the processor may choose a P-state >> >> >above the HWP floor, just like a non-HWP processor in the >> >> >analogous case. >> >> > >> >> > With this change applied, intel_pstate in the passive mode >> >> > assumes complete control over the HWP request MSR and concurrent >> >> > changes of that MSR (eg. via the direct MSR access interface) are >> >> > overridden by it. >> >> > >> >> > Signed-off-by: Rafael J. Wysocki >> >> > --- >> >> > >> >> > This basically unifies the passive mode behavior of intel_pstate for >> >> > systems >> >> > with and without HWP enabled. The only case in which there is a >> >> > difference >> >> > between the two (after this patch) is below the turbo range, where the >> >> > HWP >> >> > algorithm can go above the floor regardless of whether or not P-state >> >> > are >> >> > coordinated package-wide (this means the systems with per-core P-states >> >> > mostly is where the difference can be somewhat visible). >> >> > >> >> > Since the passive mode hasn't worked with HWP at all, and it is not >> >> > going to >> >> > the default for HWP systems anyway, I don't see any drawbacks related >> >> > to making >> >> > this change, so I would consider this as 5.9 material unless there are >> >> > any >> >> > serious objections. >> >> > >> >> > Thanks! >> >> > >> >> > --- >> >> > Documentation/admin-guide/pm/intel_pstate.rst | 89 +++- >> >> > drivers/cpufreq/intel_pstate.c| 141 >> >> > -- >> >> > 2 files changed, 152 insertions(+), 78 deletions(-) >> >> > >> >> > Index: linux-pm/drivers/cpufreq/intel_pstate.c >> >> > === >> >> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c >> >> > +++ linux-pm/drivers/cpufreq/intel_pstate.c >> >> > @@ -36,6 +36,7 @@ >> >> > #define INTEL_PSTATE_SAMPLING_INTERVAL (10 * NSEC_PER_MSEC) >> >> > >> >> > #define INTEL_CPUFREQ_TRANSITION_LATENCY 2 >> >> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000 >> >> > #define INTEL_CPUFREQ_TRANSITION_DELAY 500 >> >> > >> >> > #ifdef CONFIG_ACPI >> >> > @@ -222,6 +223,7 @@ struct global_params { >> >> > * preference/bias >> >> > * @epp_saved: Saved EPP/EPB during system suspend or >> >> > CPU offline >> >> > * operation >> >> > + * @epp_cached Cached HWP energy-performance preference >> >> > value >> >> > * @hwp_req_cached: Cached value of the last HWP Request MSR >> >> > * @hwp_cap_cached: Cached value of the last HWP Capabilities MSR >> >> > * @last_io_update: Last time when IO wake flag was set >> >> > @@ -259,6 +261,7 @@ struct cpudata { >> >> > s16 epp_policy; >> >> > s16 epp_default; >> >> >
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On Thu, Jul 16, 2020 at 3:14 AM Srinivas Pandruvada wrote: > > On Wed, 2020-07-15 at 14:35 -0700, Francisco Jerez wrote: > > "Rafael J. Wysocki" writes: > > > > > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez < > > > curroje...@riseup.net> wrote: > > > > "Rafael J. Wysocki" writes: > > > > > > > > > From: Rafael J. Wysocki > > > > > > > [...] > > > > > > I don't think that's accurate. I've looked at hundreds of > > > > > traces > > while > > my series [1] was in control of HWP_REQ_MAX and I've never seen an > > excursion above the maximum HWP_REQ_MAX control specified by it > > within a > > given P-state domain, even while that maximum specified was well into > > the turbo range. So, yeah, I agree that HWP_REQ_MAX is nothing like > > a > > hard limit, particularly when multiple threads are running on the > > same > > clock domain, but the processor will still make its best effort to > > limit > > the clock frequency to the maximum of the requested maximums, even if > > it > > happens to be within the turbo range. That doesn't make it useless. > > The exact same thing can be said about controlling HWP_REQ_MIN as > > you're > > doing now in this revision of your patch, BTW. > > > > If you don't believe me here is the turbostat sample with maximum > > Bzy_MHz I get on the computer I'm sitting on right now while > > compiling a > > kernel on CPU0 if I set HWP_REQ_MAX to 0x1c (within the turbo range): > > > > > CoreCPU Avg_MHz > > > Busy% Bzy_MHzHWP_REQ PkgWatt CorWatt > > > - - 757 27.03 28000x 7.1 > > > 34.90 > > > 0 0 279499.77 28000x80001c04 7.1 > > > 34.90 > > > 0 2 83 2.9828000x80001c04 > > > 1 1 73 2.6028000x80001c04 > > > 1 3 78 2.7928000x80001c04 > > > > With the default HWP_REQUEST: > > > > > CoreCPU Avg_MHz > > > Busy% Bzy_MHzHWP_REQ PkgWatt CorWatt > > > - - 814 27.00 30150x 8.4 > > > 96.18 > > > 0 0 296898.24 30210x80001f04 8.4 > > > 96.18 > > > 0 2 84 2.8129820x80001f04 > > > 1 1 99 3.3429610x80001f04 > > > 1 3 105 3.6029210x80001f04 > > Correct. In HWP mode this is possible to lower limit in turbo region > conditionally. In legacy mode you can't with turbo activation ratio. > > But what we don't want set max and min perf and use like desired to run > at a P-state overriding HWP or limit the range where HWP can't do any > meaningful selection. That's a good point too IMO.
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On Wed, Jul 15, 2020 at 11:35 PM Francisco Jerez wrote: > > "Rafael J. Wysocki" writes: > > > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez > > wrote: > >> > >> "Rafael J. Wysocki" writes: > >> > >> > From: Rafael J. Wysocki > >> > > >> > Allow intel_pstate to work in the passive mode with HWP enabled and > >> > make it set the HWP minimum performance limit (HWP floor) to the > >> > P-state value given by the target frequency supplied by the cpufreq > >> > governor, so as to prevent the HWP algorithm and the CPU scheduler > >> > from working against each other, at least when the schedutil governor > >> > is in use, and update the intel_pstate documentation accordingly. > >> > > >> > Among other things, this allows utilization clamps to be taken > >> > into account, at least to a certain extent, when intel_pstate is > >> > in use and makes it more likely that sufficient capacity for > >> > deadline tasks will be provided. > >> > > >> > After this change, the resulting behavior of an HWP system with > >> > intel_pstate in the passive mode should be close to the behavior > >> > of the analogous non-HWP system with intel_pstate in the passive > >> > mode, except that in the frequency range below the base frequency > >> > (ie. the frequency retured by the base_frequency cpufreq attribute > >> > in sysfs on HWP systems) the HWP algorithm is allowed to go above > >> > the floor P-state set by intel_pstate with or without hardware > >> > coordination of P-states among CPUs in the same package. > >> > > >> > Also note that the setting of the HWP floor may not be taken into > >> > account by the processor in the following cases: > >> > > >> > * For the HWP floor in the range of P-states above the base > >> >frequency, referred to as the turbo range, the processor has a > >> >license to choose any P-state from that range, either below or > >> >above the HWP floor, just like a non-HWP processor in the case > >> >when the target P-state falls into the turbo range. > >> > > >> > * If P-states of the CPUs in the same package are coordinated > >> >at the hardware level, the processor may choose a P-state > >> >above the HWP floor, just like a non-HWP processor in the > >> >analogous case. > >> > > >> > With this change applied, intel_pstate in the passive mode > >> > assumes complete control over the HWP request MSR and concurrent > >> > changes of that MSR (eg. via the direct MSR access interface) are > >> > overridden by it. > >> > > >> > Signed-off-by: Rafael J. Wysocki > >> > --- > >> > > >> > This basically unifies the passive mode behavior of intel_pstate for > >> > systems > >> > with and without HWP enabled. The only case in which there is a > >> > difference > >> > between the two (after this patch) is below the turbo range, where the > >> > HWP > >> > algorithm can go above the floor regardless of whether or not P-state are > >> > coordinated package-wide (this means the systems with per-core P-states > >> > mostly is where the difference can be somewhat visible). > >> > > >> > Since the passive mode hasn't worked with HWP at all, and it is not > >> > going to > >> > the default for HWP systems anyway, I don't see any drawbacks related to > >> > making > >> > this change, so I would consider this as 5.9 material unless there are > >> > any > >> > serious objections. > >> > > >> > Thanks! > >> > > >> > --- > >> > Documentation/admin-guide/pm/intel_pstate.rst | 89 +++- > >> > drivers/cpufreq/intel_pstate.c| 141 > >> > -- > >> > 2 files changed, 152 insertions(+), 78 deletions(-) > >> > > >> > Index: linux-pm/drivers/cpufreq/intel_pstate.c > >> > === > >> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c > >> > +++ linux-pm/drivers/cpufreq/intel_pstate.c > >> > @@ -36,6 +36,7 @@ > >> > #define INTEL_PSTATE_SAMPLING_INTERVAL (10 * NSEC_PER_MSEC) > >> > > >> > #define INTEL_CPUFREQ_TRANSITION_LATENCY 2 > >> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000 > >> > #define INTEL_CPUFREQ_TRANSITION_DELAY 500 > >> > > >> > #ifdef CONFIG_ACPI > >> > @@ -222,6 +223,7 @@ struct global_params { > >> > * preference/bias > >> > * @epp_saved: Saved EPP/EPB during system suspend or CPU > >> > offline > >> > * operation > >> > + * @epp_cached Cached HWP energy-performance preference > >> > value > >> > * @hwp_req_cached: Cached value of the last HWP Request MSR > >> > * @hwp_cap_cached: Cached value of the last HWP Capabilities MSR > >> > * @last_io_update: Last time when IO wake flag was set > >> > @@ -259,6 +261,7 @@ struct cpudata { > >> > s16 epp_policy; > >> > s16 epp_default; > >> > s16 epp_saved; > >> > + s16 epp_cached; > >> > u64 hwp_req_cached; > >> > u64 hwp_cap_cached; > >> > u64 last_io_update; >
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On Wed, Jul 15, 2020 at 10:39 PM Doug Smythies wrote: > > On 2020.07.14 11:16 Rafael J. Wysocki wrote: > > > > From: Rafael J. Wysocki > ... > > Since the passive mode hasn't worked with HWP at all, and it is not going to > > the default for HWP systems anyway, I don't see any drawbacks related to > > making > > this change, so I would consider this as 5.9 material unless there are any > > serious objections. > > Good point. > Some of the tests I do involve labour intensive post processing of data. > I want to automate some of that work, and it will take time. > We might be into the 5.9-rc series before I have detailed feedback. > > However, so far: > > Inverse impulse response test [1]: > > High level test, i5-9600K, HWP-passive (this patch), ondemand: > 3101 tests. 0 failures. (GOOD) > > From [1], re-stated: > > . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. > > (HWP-active - powersave) > > . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 failures. > > My version of that cool Alexander named pipe test [2] serialized workflow: > > HWP-passive (this patch), performance: PASS. > > From [2], re-stated, and also re-tested. > HWP-disabled passive - performance: FAIL. But I'm not quite sure how this is related to this patch? This test would still fail without the patch if the kernel was started with intel_pstate=passive in the kernel command line, wouldn't it. > Although, I believe the issue to be EPB management, [3]. Well, that's kind of unexpected. If this really is the case, then it looks like the effect of the EPB setting on the processor is different depending on whether or not HWP is enabled. > And yes, I did see the reply to [3] that came earlier, > And have now re-done the test, with the referenced patch added. > It still is FAIL. I reply to the [3] thread, eventually. > > [1] https://marc.info/?l=linux-pm=159354421400342=2 > [2] https://marc.info/?l=linux-pm=159155067328641=2 > [3] https://marc.info/?l=linux-pm=159438804230744=2 > > Kernel: > > b08284a541ad (HEAD -> k58rc5-doug) cpufreq: intel_pstate: Avoid enabling HWP > if EPP is not supported > 063fd7ccabfe cpufreq: intel_pstate: Implement passive mode with HWP enabled > 730ccf5054e9 cpufreq: intel_pstate: Allow raw energy performance preference > value > bee36df01c68 cpufreq: intel_pstate: Allow enable/disable energy efficiency > 199629d8200e cpufreq: intel_pstate: Fix active mode setting from command line > 11ba468877bb (tag: v5.8-rc5, origin/master, origin/HEAD, master) Linux 5.8-rc5 > > Rules for this work: > > . never use x86_energy_perf_policy. > . For HWP disabled: never change from active to passive or via versa, but > rather do it via boot. > . after boot always check and reset the various power limit log bits that are > set. > . never compile the kernel (well, until after any tests), which will set > those bits again. > . never run prime95 high heat torture test, which will set those bits again. > . try to never do anything else that will set those bits again. > > To be clear, I do allow changing governors within the context of the above > rules. Thanks for the feedback!
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On Wed, 2020-07-15 at 14:35 -0700, Francisco Jerez wrote: > "Rafael J. Wysocki" writes: > > > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez < > > curroje...@riseup.net> wrote: > > > "Rafael J. Wysocki" writes: > > > > > > > From: Rafael J. Wysocki > > > > [...] > > > > I don't think that's accurate. I've looked at hundreds of > > > > traces > while > my series [1] was in control of HWP_REQ_MAX and I've never seen an > excursion above the maximum HWP_REQ_MAX control specified by it > within a > given P-state domain, even while that maximum specified was well into > the turbo range. So, yeah, I agree that HWP_REQ_MAX is nothing like > a > hard limit, particularly when multiple threads are running on the > same > clock domain, but the processor will still make its best effort to > limit > the clock frequency to the maximum of the requested maximums, even if > it > happens to be within the turbo range. That doesn't make it useless. > The exact same thing can be said about controlling HWP_REQ_MIN as > you're > doing now in this revision of your patch, BTW. > > If you don't believe me here is the turbostat sample with maximum > Bzy_MHz I get on the computer I'm sitting on right now while > compiling a > kernel on CPU0 if I set HWP_REQ_MAX to 0x1c (within the turbo range): > > > CoreCPU Avg_MHz > > Busy% Bzy_MHzHWP_REQ PkgWatt CorWatt > > - - 757 27.03 28000x 7.1 > > 34.90 > > 0 0 279499.77 28000x80001c04 7.1 > > 34.90 > > 0 2 83 2.9828000x80001c04 > > 1 1 73 2.6028000x80001c04 > > 1 3 78 2.7928000x80001c04 > > With the default HWP_REQUEST: > > > CoreCPU Avg_MHz > > Busy% Bzy_MHzHWP_REQ PkgWatt CorWatt > > - - 814 27.00 30150x 8.4 > > 96.18 > > 0 0 296898.24 30210x80001f04 8.4 > > 96.18 > > 0 2 84 2.8129820x80001f04 > > 1 1 99 3.3429610x80001f04 > > 1 3 105 3.6029210x80001f04 Correct. In HWP mode this is possible to lower limit in turbo region conditionally. In legacy mode you can't with turbo activation ratio. But what we don't want set max and min perf and use like desired to run at a P-state overriding HWP or limit the range where HWP can't do any meaningful selection. > > Generally, I'm not quite convinced that limiting the max frequency > > is > > really the right choice for controlling the processor's power draw > > on > > the systems in question. There are other ways to do that, which in > > theory should be more effective. I mentioned RAPL somewhere in > > this > > context and there's the GUC firmware too. > > I feel like we've had that conversation before and it's somewhat > off-topic so I'll keep it short: Yes, in theory RAPL is more > effective > than HWP_REQ_MAX as a mechanism to limit the absolute power > consumption > of the processor package, but that's not the purpose of [1], its > purpose > is setting a lower limit to the energy efficiency of the processor > when > the maximum usable CPU frequency is known (due to the existence of an > IO > device bottleneck) -- And if the maximum usable CPU frequency is the > information we have at hand, controlling the maximum CPU frequency > directly is optimal, rather than trying to find the RAPL constraint > that > achieves the same average frequency by trial an error. Also, in > theory, > even if you had an oracle to tell you what the appropriate RAPL > constraint is, the result would necessarily be more energy- > inefficient > than controlling the maximum CPU frequency directly, since you're > giving > the processor additional freedom to run at frequencies above the one > you > want to average, which is guaranteed to be more energy-inefficient > than > running at that fixed frequency, assuming we are in the region of > convexity of the processor's power curve. > > Anyway, if you still have some disagreement on the theoretical > details > you're more than welcome to bring up the matter on the other thread > [1], > or accept the invitation for a presentation I sent you months ago... > ;)
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
"Rafael J. Wysocki" writes: > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez wrote: >> >> "Rafael J. Wysocki" writes: >> >> > From: Rafael J. Wysocki >> > >> > Allow intel_pstate to work in the passive mode with HWP enabled and >> > make it set the HWP minimum performance limit (HWP floor) to the >> > P-state value given by the target frequency supplied by the cpufreq >> > governor, so as to prevent the HWP algorithm and the CPU scheduler >> > from working against each other, at least when the schedutil governor >> > is in use, and update the intel_pstate documentation accordingly. >> > >> > Among other things, this allows utilization clamps to be taken >> > into account, at least to a certain extent, when intel_pstate is >> > in use and makes it more likely that sufficient capacity for >> > deadline tasks will be provided. >> > >> > After this change, the resulting behavior of an HWP system with >> > intel_pstate in the passive mode should be close to the behavior >> > of the analogous non-HWP system with intel_pstate in the passive >> > mode, except that in the frequency range below the base frequency >> > (ie. the frequency retured by the base_frequency cpufreq attribute >> > in sysfs on HWP systems) the HWP algorithm is allowed to go above >> > the floor P-state set by intel_pstate with or without hardware >> > coordination of P-states among CPUs in the same package. >> > >> > Also note that the setting of the HWP floor may not be taken into >> > account by the processor in the following cases: >> > >> > * For the HWP floor in the range of P-states above the base >> >frequency, referred to as the turbo range, the processor has a >> >license to choose any P-state from that range, either below or >> >above the HWP floor, just like a non-HWP processor in the case >> >when the target P-state falls into the turbo range. >> > >> > * If P-states of the CPUs in the same package are coordinated >> >at the hardware level, the processor may choose a P-state >> >above the HWP floor, just like a non-HWP processor in the >> >analogous case. >> > >> > With this change applied, intel_pstate in the passive mode >> > assumes complete control over the HWP request MSR and concurrent >> > changes of that MSR (eg. via the direct MSR access interface) are >> > overridden by it. >> > >> > Signed-off-by: Rafael J. Wysocki >> > --- >> > >> > This basically unifies the passive mode behavior of intel_pstate for >> > systems >> > with and without HWP enabled. The only case in which there is a difference >> > between the two (after this patch) is below the turbo range, where the HWP >> > algorithm can go above the floor regardless of whether or not P-state are >> > coordinated package-wide (this means the systems with per-core P-states >> > mostly is where the difference can be somewhat visible). >> > >> > Since the passive mode hasn't worked with HWP at all, and it is not going >> > to >> > the default for HWP systems anyway, I don't see any drawbacks related to >> > making >> > this change, so I would consider this as 5.9 material unless there are any >> > serious objections. >> > >> > Thanks! >> > >> > --- >> > Documentation/admin-guide/pm/intel_pstate.rst | 89 +++- >> > drivers/cpufreq/intel_pstate.c| 141 >> > -- >> > 2 files changed, 152 insertions(+), 78 deletions(-) >> > >> > Index: linux-pm/drivers/cpufreq/intel_pstate.c >> > === >> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c >> > +++ linux-pm/drivers/cpufreq/intel_pstate.c >> > @@ -36,6 +36,7 @@ >> > #define INTEL_PSTATE_SAMPLING_INTERVAL (10 * NSEC_PER_MSEC) >> > >> > #define INTEL_CPUFREQ_TRANSITION_LATENCY 2 >> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000 >> > #define INTEL_CPUFREQ_TRANSITION_DELAY 500 >> > >> > #ifdef CONFIG_ACPI >> > @@ -222,6 +223,7 @@ struct global_params { >> > * preference/bias >> > * @epp_saved: Saved EPP/EPB during system suspend or CPU >> > offline >> > * operation >> > + * @epp_cached Cached HWP energy-performance preference >> > value >> > * @hwp_req_cached: Cached value of the last HWP Request MSR >> > * @hwp_cap_cached: Cached value of the last HWP Capabilities MSR >> > * @last_io_update: Last time when IO wake flag was set >> > @@ -259,6 +261,7 @@ struct cpudata { >> > s16 epp_policy; >> > s16 epp_default; >> > s16 epp_saved; >> > + s16 epp_cached; >> > u64 hwp_req_cached; >> > u64 hwp_cap_cached; >> > u64 last_io_update; >> > @@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_ >> > >> > value |= (u64)epp << 24; >> > ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value); >> > + >> > + WRITE_ONCE(cpu_data->epp_cached, epp); >> >> Why introduce a new EPP cache
RE: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On 2020.07.14 11:16 Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki ... > Since the passive mode hasn't worked with HWP at all, and it is not going to > the default for HWP systems anyway, I don't see any drawbacks related to > making > this change, so I would consider this as 5.9 material unless there are any > serious objections. Good point. Some of the tests I do involve labour intensive post processing of data. I want to automate some of that work, and it will take time. We might be into the 5.9-rc series before I have detailed feedback. However, so far: Inverse impulse response test [1]: High level test, i5-9600K, HWP-passive (this patch), ondemand: 3101 tests. 0 failures. (GOOD) >From [1], re-stated: > . High level: i5-9600K: 2453 tests, 60 failures, 2.45% fail rate. (HWP-active > - powersave) > . Verify acpi-cpufreq/ondemand works fine: i5-9600K: 8975 tests. 0 failures. My version of that cool Alexander named pipe test [2] serialized workflow: HWP-passive (this patch), performance: PASS. >From [2], re-stated, and also re-tested. HWP-disabled passive - performance: FAIL. Although, I believe the issue to be EPB management, [3]. And yes, I did see the reply to [3] that came earlier, And have now re-done the test, with the referenced patch added. It still is FAIL. I reply to the [3] thread, eventually. [1] https://marc.info/?l=linux-pm=159354421400342=2 [2] https://marc.info/?l=linux-pm=159155067328641=2 [3] https://marc.info/?l=linux-pm=159438804230744=2 Kernel: b08284a541ad (HEAD -> k58rc5-doug) cpufreq: intel_pstate: Avoid enabling HWP if EPP is not supported 063fd7ccabfe cpufreq: intel_pstate: Implement passive mode with HWP enabled 730ccf5054e9 cpufreq: intel_pstate: Allow raw energy performance preference value bee36df01c68 cpufreq: intel_pstate: Allow enable/disable energy efficiency 199629d8200e cpufreq: intel_pstate: Fix active mode setting from command line 11ba468877bb (tag: v5.8-rc5, origin/master, origin/HEAD, master) Linux 5.8-rc5 Rules for this work: . never use x86_energy_perf_policy. . For HWP disabled: never change from active to passive or via versa, but rather do it via boot. . after boot always check and reset the various power limit log bits that are set. . never compile the kernel (well, until after any tests), which will set those bits again. . never run prime95 high heat torture test, which will set those bits again. . try to never do anything else that will set those bits again. To be clear, I do allow changing governors within the context of the above rules. ... Doug
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez wrote: > > "Rafael J. Wysocki" writes: > > > From: Rafael J. Wysocki > > > > Allow intel_pstate to work in the passive mode with HWP enabled and > > make it set the HWP minimum performance limit (HWP floor) to the > > P-state value given by the target frequency supplied by the cpufreq > > governor, so as to prevent the HWP algorithm and the CPU scheduler > > from working against each other, at least when the schedutil governor > > is in use, and update the intel_pstate documentation accordingly. > > > > Among other things, this allows utilization clamps to be taken > > into account, at least to a certain extent, when intel_pstate is > > in use and makes it more likely that sufficient capacity for > > deadline tasks will be provided. > > > > After this change, the resulting behavior of an HWP system with > > intel_pstate in the passive mode should be close to the behavior > > of the analogous non-HWP system with intel_pstate in the passive > > mode, except that in the frequency range below the base frequency > > (ie. the frequency retured by the base_frequency cpufreq attribute > > in sysfs on HWP systems) the HWP algorithm is allowed to go above > > the floor P-state set by intel_pstate with or without hardware > > coordination of P-states among CPUs in the same package. > > > > Also note that the setting of the HWP floor may not be taken into > > account by the processor in the following cases: > > > > * For the HWP floor in the range of P-states above the base > >frequency, referred to as the turbo range, the processor has a > >license to choose any P-state from that range, either below or > >above the HWP floor, just like a non-HWP processor in the case > >when the target P-state falls into the turbo range. > > > > * If P-states of the CPUs in the same package are coordinated > >at the hardware level, the processor may choose a P-state > >above the HWP floor, just like a non-HWP processor in the > >analogous case. > > > > With this change applied, intel_pstate in the passive mode > > assumes complete control over the HWP request MSR and concurrent > > changes of that MSR (eg. via the direct MSR access interface) are > > overridden by it. > > > > Signed-off-by: Rafael J. Wysocki > > --- > > > > This basically unifies the passive mode behavior of intel_pstate for systems > > with and without HWP enabled. The only case in which there is a difference > > between the two (after this patch) is below the turbo range, where the HWP > > algorithm can go above the floor regardless of whether or not P-state are > > coordinated package-wide (this means the systems with per-core P-states > > mostly is where the difference can be somewhat visible). > > > > Since the passive mode hasn't worked with HWP at all, and it is not going to > > the default for HWP systems anyway, I don't see any drawbacks related to > > making > > this change, so I would consider this as 5.9 material unless there are any > > serious objections. > > > > Thanks! > > > > --- > > Documentation/admin-guide/pm/intel_pstate.rst | 89 +++- > > drivers/cpufreq/intel_pstate.c| 141 > > -- > > 2 files changed, 152 insertions(+), 78 deletions(-) > > > > Index: linux-pm/drivers/cpufreq/intel_pstate.c > > === > > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c > > +++ linux-pm/drivers/cpufreq/intel_pstate.c > > @@ -36,6 +36,7 @@ > > #define INTEL_PSTATE_SAMPLING_INTERVAL (10 * NSEC_PER_MSEC) > > > > #define INTEL_CPUFREQ_TRANSITION_LATENCY 2 > > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000 > > #define INTEL_CPUFREQ_TRANSITION_DELAY 500 > > > > #ifdef CONFIG_ACPI > > @@ -222,6 +223,7 @@ struct global_params { > > * preference/bias > > * @epp_saved: Saved EPP/EPB during system suspend or CPU > > offline > > * operation > > + * @epp_cached Cached HWP energy-performance preference value > > * @hwp_req_cached: Cached value of the last HWP Request MSR > > * @hwp_cap_cached: Cached value of the last HWP Capabilities MSR > > * @last_io_update: Last time when IO wake flag was set > > @@ -259,6 +261,7 @@ struct cpudata { > > s16 epp_policy; > > s16 epp_default; > > s16 epp_saved; > > + s16 epp_cached; > > u64 hwp_req_cached; > > u64 hwp_cap_cached; > > u64 last_io_update; > > @@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_ > > > > value |= (u64)epp << 24; > > ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value); > > + > > + WRITE_ONCE(cpu_data->epp_cached, epp); > > Why introduce a new EPP cache variable if there is already > hwp_req_cached? If intel_pstate_set_energy_pref_index() is failing to > update hwp_req_cached maybe we should fix that
Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
"Rafael J. Wysocki" writes: > From: Rafael J. Wysocki > > Allow intel_pstate to work in the passive mode with HWP enabled and > make it set the HWP minimum performance limit (HWP floor) to the > P-state value given by the target frequency supplied by the cpufreq > governor, so as to prevent the HWP algorithm and the CPU scheduler > from working against each other, at least when the schedutil governor > is in use, and update the intel_pstate documentation accordingly. > > Among other things, this allows utilization clamps to be taken > into account, at least to a certain extent, when intel_pstate is > in use and makes it more likely that sufficient capacity for > deadline tasks will be provided. > > After this change, the resulting behavior of an HWP system with > intel_pstate in the passive mode should be close to the behavior > of the analogous non-HWP system with intel_pstate in the passive > mode, except that in the frequency range below the base frequency > (ie. the frequency retured by the base_frequency cpufreq attribute > in sysfs on HWP systems) the HWP algorithm is allowed to go above > the floor P-state set by intel_pstate with or without hardware > coordination of P-states among CPUs in the same package. > > Also note that the setting of the HWP floor may not be taken into > account by the processor in the following cases: > > * For the HWP floor in the range of P-states above the base >frequency, referred to as the turbo range, the processor has a >license to choose any P-state from that range, either below or >above the HWP floor, just like a non-HWP processor in the case >when the target P-state falls into the turbo range. > > * If P-states of the CPUs in the same package are coordinated >at the hardware level, the processor may choose a P-state >above the HWP floor, just like a non-HWP processor in the >analogous case. > > With this change applied, intel_pstate in the passive mode > assumes complete control over the HWP request MSR and concurrent > changes of that MSR (eg. via the direct MSR access interface) are > overridden by it. > > Signed-off-by: Rafael J. Wysocki > --- > > This basically unifies the passive mode behavior of intel_pstate for systems > with and without HWP enabled. The only case in which there is a difference > between the two (after this patch) is below the turbo range, where the HWP > algorithm can go above the floor regardless of whether or not P-state are > coordinated package-wide (this means the systems with per-core P-states > mostly is where the difference can be somewhat visible). > > Since the passive mode hasn't worked with HWP at all, and it is not going to > the default for HWP systems anyway, I don't see any drawbacks related to > making > this change, so I would consider this as 5.9 material unless there are any > serious objections. > > Thanks! > > --- > Documentation/admin-guide/pm/intel_pstate.rst | 89 +++- > drivers/cpufreq/intel_pstate.c| 141 > -- > 2 files changed, 152 insertions(+), 78 deletions(-) > > Index: linux-pm/drivers/cpufreq/intel_pstate.c > === > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c > +++ linux-pm/drivers/cpufreq/intel_pstate.c > @@ -36,6 +36,7 @@ > #define INTEL_PSTATE_SAMPLING_INTERVAL (10 * NSEC_PER_MSEC) > > #define INTEL_CPUFREQ_TRANSITION_LATENCY 2 > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000 > #define INTEL_CPUFREQ_TRANSITION_DELAY 500 > > #ifdef CONFIG_ACPI > @@ -222,6 +223,7 @@ struct global_params { > * preference/bias > * @epp_saved: Saved EPP/EPB during system suspend or CPU > offline > * operation > + * @epp_cached Cached HWP energy-performance preference value > * @hwp_req_cached: Cached value of the last HWP Request MSR > * @hwp_cap_cached: Cached value of the last HWP Capabilities MSR > * @last_io_update: Last time when IO wake flag was set > @@ -259,6 +261,7 @@ struct cpudata { > s16 epp_policy; > s16 epp_default; > s16 epp_saved; > + s16 epp_cached; > u64 hwp_req_cached; > u64 hwp_cap_cached; > u64 last_io_update; > @@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_ > > value |= (u64)epp << 24; > ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value); > + > + WRITE_ONCE(cpu_data->epp_cached, epp); Why introduce a new EPP cache variable if there is already hwp_req_cached? If intel_pstate_set_energy_pref_index() is failing to update hwp_req_cached maybe we should fix that instead. That will save you a little bit of work in intel_cpufreq_adjust_hwp(). > } else { > if (epp == -EINVAL) > epp = (pref_index - 1) << 2; > @@ -2047,6 +2052,7 @@ static int intel_pstate_init_cpu(unsigne >
[PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled
From: Rafael J. Wysocki Allow intel_pstate to work in the passive mode with HWP enabled and make it set the HWP minimum performance limit (HWP floor) to the P-state value given by the target frequency supplied by the cpufreq governor, so as to prevent the HWP algorithm and the CPU scheduler from working against each other, at least when the schedutil governor is in use, and update the intel_pstate documentation accordingly. Among other things, this allows utilization clamps to be taken into account, at least to a certain extent, when intel_pstate is in use and makes it more likely that sufficient capacity for deadline tasks will be provided. After this change, the resulting behavior of an HWP system with intel_pstate in the passive mode should be close to the behavior of the analogous non-HWP system with intel_pstate in the passive mode, except that in the frequency range below the base frequency (ie. the frequency retured by the base_frequency cpufreq attribute in sysfs on HWP systems) the HWP algorithm is allowed to go above the floor P-state set by intel_pstate with or without hardware coordination of P-states among CPUs in the same package. Also note that the setting of the HWP floor may not be taken into account by the processor in the following cases: * For the HWP floor in the range of P-states above the base frequency, referred to as the turbo range, the processor has a license to choose any P-state from that range, either below or above the HWP floor, just like a non-HWP processor in the case when the target P-state falls into the turbo range. * If P-states of the CPUs in the same package are coordinated at the hardware level, the processor may choose a P-state above the HWP floor, just like a non-HWP processor in the analogous case. With this change applied, intel_pstate in the passive mode assumes complete control over the HWP request MSR and concurrent changes of that MSR (eg. via the direct MSR access interface) are overridden by it. Signed-off-by: Rafael J. Wysocki --- This basically unifies the passive mode behavior of intel_pstate for systems with and without HWP enabled. The only case in which there is a difference between the two (after this patch) is below the turbo range, where the HWP algorithm can go above the floor regardless of whether or not P-state are coordinated package-wide (this means the systems with per-core P-states mostly is where the difference can be somewhat visible). Since the passive mode hasn't worked with HWP at all, and it is not going to the default for HWP systems anyway, I don't see any drawbacks related to making this change, so I would consider this as 5.9 material unless there are any serious objections. Thanks! --- Documentation/admin-guide/pm/intel_pstate.rst | 89 +++- drivers/cpufreq/intel_pstate.c| 141 -- 2 files changed, 152 insertions(+), 78 deletions(-) Index: linux-pm/drivers/cpufreq/intel_pstate.c === --- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -36,6 +36,7 @@ #define INTEL_PSTATE_SAMPLING_INTERVAL (10 * NSEC_PER_MSEC) #define INTEL_CPUFREQ_TRANSITION_LATENCY 2 +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000 #define INTEL_CPUFREQ_TRANSITION_DELAY 500 #ifdef CONFIG_ACPI @@ -222,6 +223,7 @@ struct global_params { * preference/bias * @epp_saved: Saved EPP/EPB during system suspend or CPU offline * operation + * @epp_cached Cached HWP energy-performance preference value * @hwp_req_cached:Cached value of the last HWP Request MSR * @hwp_cap_cached:Cached value of the last HWP Capabilities MSR * @last_io_update:Last time when IO wake flag was set @@ -259,6 +261,7 @@ struct cpudata { s16 epp_policy; s16 epp_default; s16 epp_saved; + s16 epp_cached; u64 hwp_req_cached; u64 hwp_cap_cached; u64 last_io_update; @@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_ value |= (u64)epp << 24; ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value); + + WRITE_ONCE(cpu_data->epp_cached, epp); } else { if (epp == -EINVAL) epp = (pref_index - 1) << 2; @@ -2047,6 +2052,7 @@ static int intel_pstate_init_cpu(unsigne cpu->epp_default = -EINVAL; cpu->epp_powersave = -EINVAL; cpu->epp_saved = -EINVAL; + WRITE_ONCE(cpu->epp_cached, -EINVAL); } cpu = all_cpu_data[cpunum]; @@ -2245,7 +2251,10 @@ static int intel_pstate_verify_policy(st static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy) { - intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]); + if (hwp_active) +