Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
"Rafael J. Wysocki" writes: > On Sat, Sep 15, 2018 at 8:53 AM Francisco Jerez wrote: >> >> "Rafael J. Wysocki" writes: >> >> > On Tuesday, September 11, 2018 7:35:15 PM CEST Francisco Jerez wrote: >> >> >> >> "Rafael J. Wysocki" writes: >> >> >> >> > On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote: >> >> > >> >> >> Srinivas Pandruvada writes: >> >> >>=20 >> >> >> > [...] >> >> >> > >> >> >> >> > >=3D20 >> >> >> >> > > This patch causes a number of statistically significant >> >> >> >> > > regressions >> >> >> >> > > (with significance of 1%) on the two systems I've tested it >> >> >> >> > > on. On >> >> >> >> > > my >> >> >> >> >=3D20 >> >> >> >> > Sure. These patches are targeted to Atom clients where some of >> >> >> >> > these >> >> >> >> > server like workload may have some minor regression on few watts >> >> >> >> > TDP >> >> >> >> > parts. >> >> >> >>=3D20 >> >> >> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite, >> >> >> >> nor >> >> >> >> the 10% regression of warsaw qualify as small. And most of the test >> >> >> >> cases on the list of regressions aren't exclusively server-like, if >> >> >> >> at >> >> >> >> all. Warsaw, gtkperf, jxrendermark and lightsmark are all graphics >> >> >> >> benchmarks -- Latency is as important if not more for interactive >> >> >> >> workloads than it is for server workloads. In the case of a >> >> >> >> conflict >> >> >> >> like the one we're dealing with right now between optimizing for >> >> >> >> throughput (e.g. for the maximum number of requests per second) and >> >> >> >> optimizing for latency (e.g. for the minimum request duration), you >> >> >> >> are >> >> >> >> more likely to be concerned about the former than about the latter >> >> >> >> in >> >> >> >> a >> >> >> >> server setup. >> >> >> > >> >> >> > Eero, >> >> >> > Please add your test results here. >> >> >> > >> >> >> > No matter which algorithm you do, there will be variations. So you >> >> >> > have >> >> >> > to look at the platforms which you are targeting. For this >> >> >> > platform=3D= >> >> 20 >> >> >> > number one item is use of less turbo and hope you know why? >> >> >>=20 >> >> >> Unfortunately the current controller uses turbo frequently on Atoms for >> >> >> TDP-limited graphics workloads regardless of IOWAIT boosting. IOWAIT >> >> >> boosting simply exacerbated the pre-existing energy efficiency problem. >> >> > >> >> > My current understanding of the issue at hand is that using IOWAIT >> >> > boosti= >> >> ng >> >> > on Atoms is a regression relative to the previous behavior. >> >> >> >> Not universally. IOWAIT boosting helps under roughly the same >> >> conditions on Atom as it does on big core, so applying this patch will >> >> necessarily cause regressions too (see my reply from Sep. 3 for some >> >> numbers), and won't completely restore the previous behavior since it >> >> simply decreases the degree of IOWAIT boosting applied without being >> >> able to avoid it (c.f. the series I'm working on that does something >> >> similar to IOWAIT boosting when it's able to determine it's actually >> >> CPU-bound, which prevents energy inefficient behavior for non-CPU-bound >> >> workloads that don't benefit from a higher CPU clock frequency anyway). >> > >> > Well, OK. That doesn't seem to be a clear-cut regression situation, then, >> > since getting back is not desirable, apparently. >> > >> > Or would it restore the previous behavior if we didn't do any IOWAIT >> > boosting on Atoms at all? >> > >> >> > That is what Srinivas is trying to address here AFAICS. >> >> > >> >> > Now, you seem to be saying that the overall behavior is suboptimal and >> >> > the >> >> > IOWAIT boosting doesn't matter that much, >> >> >> >> I was just saying that IOWAIT boosting is less than half of the energy >> >> efficiency problem, and this patch only partially addresses that half of >> >> the problem. >> > >> > Well, fair enough, but there are two things to consider here, the general >> > energy-efficiency problem and the difference made by IOWAIT boosting. >> > >> > If the general energy-efficiency problem had existed for a relatively long >> > time, but it has got worse recently due to the IOWAIT boosting, it still >> > may be desirable to get the IOWAIT boosting out of the picture first >> > and then get to the general problem. >> > >> >> IMHO what is needed in order to address the IOWAIT boosting energy >> efficiency problem is roughly the same we need in order to address the >> other energy efficiency problem: A mechanism along the lines of [1] >> allowing us to determine whether the workload is IO-bound or not. In >> the former case IOWAIT boosting won't be able to improve the performance >> of the workload since the limiting factor is the IO throughput, so it >> will only increase the energy usage, potentially exacerbating the >> bottleneck if the IO device is an integrated GPU. In the latter case >> where the CPU and IO devices being w
Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
On Sat, Sep 15, 2018 at 8:53 AM Francisco Jerez wrote: > > "Rafael J. Wysocki" writes: > > > On Tuesday, September 11, 2018 7:35:15 PM CEST Francisco Jerez wrote: > >> > >> "Rafael J. Wysocki" writes: > >> > >> > On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote: > >> > > >> >> Srinivas Pandruvada writes: > >> >>=20 > >> >> > [...] > >> >> > > >> >> >> > >=3D20 > >> >> >> > > This patch causes a number of statistically significant > >> >> >> > > regressions > >> >> >> > > (with significance of 1%) on the two systems I've tested it > >> >> >> > > on. On > >> >> >> > > my > >> >> >> >=3D20 > >> >> >> > Sure. These patches are targeted to Atom clients where some of > >> >> >> > these > >> >> >> > server like workload may have some minor regression on few watts > >> >> >> > TDP > >> >> >> > parts. > >> >> >>=3D20 > >> >> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite, > >> >> >> nor > >> >> >> the 10% regression of warsaw qualify as small. And most of the test > >> >> >> cases on the list of regressions aren't exclusively server-like, if > >> >> >> at > >> >> >> all. Warsaw, gtkperf, jxrendermark and lightsmark are all graphics > >> >> >> benchmarks -- Latency is as important if not more for interactive > >> >> >> workloads than it is for server workloads. In the case of a conflict > >> >> >> like the one we're dealing with right now between optimizing for > >> >> >> throughput (e.g. for the maximum number of requests per second) and > >> >> >> optimizing for latency (e.g. for the minimum request duration), you > >> >> >> are > >> >> >> more likely to be concerned about the former than about the latter in > >> >> >> a > >> >> >> server setup. > >> >> > > >> >> > Eero, > >> >> > Please add your test results here. > >> >> > > >> >> > No matter which algorithm you do, there will be variations. So you > >> >> > have > >> >> > to look at the platforms which you are targeting. For this > >> >> > platform=3D= > >> 20 > >> >> > number one item is use of less turbo and hope you know why? > >> >>=20 > >> >> Unfortunately the current controller uses turbo frequently on Atoms for > >> >> TDP-limited graphics workloads regardless of IOWAIT boosting. IOWAIT > >> >> boosting simply exacerbated the pre-existing energy efficiency problem. > >> > > >> > My current understanding of the issue at hand is that using IOWAIT > >> > boosti= > >> ng > >> > on Atoms is a regression relative to the previous behavior. > >> > >> Not universally. IOWAIT boosting helps under roughly the same > >> conditions on Atom as it does on big core, so applying this patch will > >> necessarily cause regressions too (see my reply from Sep. 3 for some > >> numbers), and won't completely restore the previous behavior since it > >> simply decreases the degree of IOWAIT boosting applied without being > >> able to avoid it (c.f. the series I'm working on that does something > >> similar to IOWAIT boosting when it's able to determine it's actually > >> CPU-bound, which prevents energy inefficient behavior for non-CPU-bound > >> workloads that don't benefit from a higher CPU clock frequency anyway). > > > > Well, OK. That doesn't seem to be a clear-cut regression situation, then, > > since getting back is not desirable, apparently. > > > > Or would it restore the previous behavior if we didn't do any IOWAIT > > boosting on Atoms at all? > > > >> > That is what Srinivas is trying to address here AFAICS. > >> > > >> > Now, you seem to be saying that the overall behavior is suboptimal and > >> > the > >> > IOWAIT boosting doesn't matter that much, > >> > >> I was just saying that IOWAIT boosting is less than half of the energy > >> efficiency problem, and this patch only partially addresses that half of > >> the problem. > > > > Well, fair enough, but there are two things to consider here, the general > > energy-efficiency problem and the difference made by IOWAIT boosting. > > > > If the general energy-efficiency problem had existed for a relatively long > > time, but it has got worse recently due to the IOWAIT boosting, it still > > may be desirable to get the IOWAIT boosting out of the picture first > > and then get to the general problem. > > > > IMHO what is needed in order to address the IOWAIT boosting energy > efficiency problem is roughly the same we need in order to address the > other energy efficiency problem: A mechanism along the lines of [1] > allowing us to determine whether the workload is IO-bound or not. In > the former case IOWAIT boosting won't be able to improve the performance > of the workload since the limiting factor is the IO throughput, so it > will only increase the energy usage, potentially exacerbating the > bottleneck if the IO device is an integrated GPU. In the latter case > where the CPU and IO devices being waited on are both underutilized it > makes sense to optimize for low latency more aggressively (certainly > more aggressively than this patch does) which will incr
Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
"Rafael J. Wysocki" writes: > On Tuesday, September 11, 2018 7:35:15 PM CEST Francisco Jerez wrote: >> >> "Rafael J. Wysocki" writes: >> >> > On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote: >> > >> >> Srinivas Pandruvada writes: >> >>=20 >> >> > [...] >> >> > >> >> >> > >=3D20 >> >> >> > > This patch causes a number of statistically significant >> >> >> > > regressions >> >> >> > > (with significance of 1%) on the two systems I've tested it >> >> >> > > on. On >> >> >> > > my >> >> >> >=3D20 >> >> >> > Sure. These patches are targeted to Atom clients where some of >> >> >> > these >> >> >> > server like workload may have some minor regression on few watts >> >> >> > TDP >> >> >> > parts. >> >> >>=3D20 >> >> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite, >> >> >> nor >> >> >> the 10% regression of warsaw qualify as small. And most of the test >> >> >> cases on the list of regressions aren't exclusively server-like, if >> >> >> at >> >> >> all. Warsaw, gtkperf, jxrendermark and lightsmark are all graphics >> >> >> benchmarks -- Latency is as important if not more for interactive >> >> >> workloads than it is for server workloads. In the case of a conflict >> >> >> like the one we're dealing with right now between optimizing for >> >> >> throughput (e.g. for the maximum number of requests per second) and >> >> >> optimizing for latency (e.g. for the minimum request duration), you >> >> >> are >> >> >> more likely to be concerned about the former than about the latter in >> >> >> a >> >> >> server setup. >> >> > >> >> > Eero, >> >> > Please add your test results here. >> >> > >> >> > No matter which algorithm you do, there will be variations. So you have >> >> > to look at the platforms which you are targeting. For this platform=3D= >> 20 >> >> > number one item is use of less turbo and hope you know why? >> >>=20 >> >> Unfortunately the current controller uses turbo frequently on Atoms for >> >> TDP-limited graphics workloads regardless of IOWAIT boosting. IOWAIT >> >> boosting simply exacerbated the pre-existing energy efficiency problem. >> > >> > My current understanding of the issue at hand is that using IOWAIT boosti= >> ng >> > on Atoms is a regression relative to the previous behavior. >> >> Not universally. IOWAIT boosting helps under roughly the same >> conditions on Atom as it does on big core, so applying this patch will >> necessarily cause regressions too (see my reply from Sep. 3 for some >> numbers), and won't completely restore the previous behavior since it >> simply decreases the degree of IOWAIT boosting applied without being >> able to avoid it (c.f. the series I'm working on that does something >> similar to IOWAIT boosting when it's able to determine it's actually >> CPU-bound, which prevents energy inefficient behavior for non-CPU-bound >> workloads that don't benefit from a higher CPU clock frequency anyway). > > Well, OK. That doesn't seem to be a clear-cut regression situation, then, > since getting back is not desirable, apparently. > > Or would it restore the previous behavior if we didn't do any IOWAIT > boosting on Atoms at all? > >> > That is what Srinivas is trying to address here AFAICS. >> > >> > Now, you seem to be saying that the overall behavior is suboptimal and the >> > IOWAIT boosting doesn't matter that much, >> >> I was just saying that IOWAIT boosting is less than half of the energy >> efficiency problem, and this patch only partially addresses that half of >> the problem. > > Well, fair enough, but there are two things to consider here, the general > energy-efficiency problem and the difference made by IOWAIT boosting. > > If the general energy-efficiency problem had existed for a relatively long > time, but it has got worse recently due to the IOWAIT boosting, it still > may be desirable to get the IOWAIT boosting out of the picture first > and then get to the general problem. > IMHO what is needed in order to address the IOWAIT boosting energy efficiency problem is roughly the same we need in order to address the other energy efficiency problem: A mechanism along the lines of [1] allowing us to determine whether the workload is IO-bound or not. In the former case IOWAIT boosting won't be able to improve the performance of the workload since the limiting factor is the IO throughput, so it will only increase the energy usage, potentially exacerbating the bottleneck if the IO device is an integrated GPU. In the latter case where the CPU and IO devices being waited on are both underutilized it makes sense to optimize for low latency more aggressively (certainly more aggressively than this patch does) which will increase the utilization of the IO devices until at least one IO device becomes a bottleneck, at which point the throughput of the system becomes roughly independent of the CPU frequency and we're back to the former case. [1] https://patchwork.kernel.org/patch/10312259/ > I'm not sure if that is
Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
On Tuesday, September 11, 2018 7:35:15 PM CEST Francisco Jerez wrote: > > "Rafael J. Wysocki" writes: > > > On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote: > > > >> Srinivas Pandruvada writes: > >>=20 > >> > [...] > >> > > >> >> > >=3D20 > >> >> > > This patch causes a number of statistically significant > >> >> > > regressions > >> >> > > (with significance of 1%) on the two systems I've tested it > >> >> > > on. On > >> >> > > my > >> >> >=3D20 > >> >> > Sure. These patches are targeted to Atom clients where some of > >> >> > these > >> >> > server like workload may have some minor regression on few watts > >> >> > TDP > >> >> > parts. > >> >>=3D20 > >> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite, > >> >> nor > >> >> the 10% regression of warsaw qualify as small. And most of the test > >> >> cases on the list of regressions aren't exclusively server-like, if > >> >> at > >> >> all. Warsaw, gtkperf, jxrendermark and lightsmark are all graphics > >> >> benchmarks -- Latency is as important if not more for interactive > >> >> workloads than it is for server workloads. In the case of a conflict > >> >> like the one we're dealing with right now between optimizing for > >> >> throughput (e.g. for the maximum number of requests per second) and > >> >> optimizing for latency (e.g. for the minimum request duration), you > >> >> are > >> >> more likely to be concerned about the former than about the latter in > >> >> a > >> >> server setup. > >> > > >> > Eero, > >> > Please add your test results here. > >> > > >> > No matter which algorithm you do, there will be variations. So you have > >> > to look at the platforms which you are targeting. For this platform=3D= > 20 > >> > number one item is use of less turbo and hope you know why? > >>=20 > >> Unfortunately the current controller uses turbo frequently on Atoms for > >> TDP-limited graphics workloads regardless of IOWAIT boosting. IOWAIT > >> boosting simply exacerbated the pre-existing energy efficiency problem. > > > > My current understanding of the issue at hand is that using IOWAIT boosti= > ng > > on Atoms is a regression relative to the previous behavior. > > Not universally. IOWAIT boosting helps under roughly the same > conditions on Atom as it does on big core, so applying this patch will > necessarily cause regressions too (see my reply from Sep. 3 for some > numbers), and won't completely restore the previous behavior since it > simply decreases the degree of IOWAIT boosting applied without being > able to avoid it (c.f. the series I'm working on that does something > similar to IOWAIT boosting when it's able to determine it's actually > CPU-bound, which prevents energy inefficient behavior for non-CPU-bound > workloads that don't benefit from a higher CPU clock frequency anyway). Well, OK. That doesn't seem to be a clear-cut regression situation, then, since getting back is not desirable, apparently. Or would it restore the previous behavior if we didn't do any IOWAIT boosting on Atoms at all? > > That is what Srinivas is trying to address here AFAICS. > > > > Now, you seem to be saying that the overall behavior is suboptimal and the > > IOWAIT boosting doesn't matter that much, > > I was just saying that IOWAIT boosting is less than half of the energy > efficiency problem, and this patch only partially addresses that half of > the problem. Well, fair enough, but there are two things to consider here, the general energy-efficiency problem and the difference made by IOWAIT boosting. If the general energy-efficiency problem had existed for a relatively long time, but it has got worse recently due to the IOWAIT boosting, it still may be desirable to get the IOWAIT boosting out of the picture first and then get to the general problem. I'm not sure if that is the case as my experience with Atoms is limited anyway, so please advise. Thanks, Rafael
Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
"Rafael J. Wysocki" writes: > On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote: >> >> --==-=-= >> Content-Type: multipart/mixed; boundary="=-=-=" >> >> --=-=-= >> Content-Type: text/plain; charset=utf-8 >> Content-Disposition: inline >> Content-Transfer-Encoding: quoted-printable >> >> Srinivas Pandruvada writes: >> >> > [...] >> > >> >> > >=20 >> >> > > This patch causes a number of statistically significant >> >> > > regressions >> >> > > (with significance of 1%) on the two systems I've tested it >> >> > > on. On >> >> > > my >> >> >=20 >> >> > Sure. These patches are targeted to Atom clients where some of >> >> > these >> >> > server like workload may have some minor regression on few watts >> >> > TDP >> >> > parts. >> >>=20 >> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite, >> >> nor >> >> the 10% regression of warsaw qualify as small. And most of the test >> >> cases on the list of regressions aren't exclusively server-like, if >> >> at >> >> all. Warsaw, gtkperf, jxrendermark and lightsmark are all graphics >> >> benchmarks -- Latency is as important if not more for interactive >> >> workloads than it is for server workloads. In the case of a conflict >> >> like the one we're dealing with right now between optimizing for >> >> throughput (e.g. for the maximum number of requests per second) and >> >> optimizing for latency (e.g. for the minimum request duration), you >> >> are >> >> more likely to be concerned about the former than about the latter in >> >> a >> >> server setup. >> > >> > Eero, >> > Please add your test results here. >> > >> > No matter which algorithm you do, there will be variations. So you have >> > to look at the platforms which you are targeting. For this platform=20 >> > number one item is use of less turbo and hope you know why? >> >> Unfortunately the current controller uses turbo frequently on Atoms for >> TDP-limited graphics workloads regardless of IOWAIT boosting. IOWAIT >> boosting simply exacerbated the pre-existing energy efficiency problem. > > My current understanding of the issue at hand is that using IOWAIT boosting > on Atoms is a regression relative to the previous behavior. Not universally. IOWAIT boosting helps under roughly the same conditions on Atom as it does on big core, so applying this patch will necessarily cause regressions too (see my reply from Sep. 3 for some numbers), and won't completely restore the previous behavior since it simply decreases the degree of IOWAIT boosting applied without being able to avoid it (c.f. the series I'm working on that does something similar to IOWAIT boosting when it's able to determine it's actually CPU-bound, which prevents energy inefficient behavior for non-CPU-bound workloads that don't benefit from a higher CPU clock frequency anyway). > That is what Srinivas is trying to address here AFAICS. > > Now, you seem to be saying that the overall behavior is suboptimal and the > IOWAIT boosting doesn't matter that much, I was just saying that IOWAIT boosting is less than half of the energy efficiency problem, and this patch only partially addresses that half of the problem. > so some deeper changes are needed anyway. That may be the case, but > if there is a meaningful regression, we should first get back to the > point where it is not present and then to take care of the more > general problems. > > So, I'd like to understand how much of a problem the IOWAIT boosting really is > in the first place. If it is significant enough, let's address it first, this > way or another, and move on to the other problems subsequently. > See the Unigine and Gfxbench numbers I provided in my reply from Sep. 3 to get an idea of the magnitude of the IOWAIT boosting problem vs. the overall energy efficiency problem addressed by my series. > Thanks, > Rafael signature.asc Description: PGP signature
Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote: > > --==-=-= > Content-Type: multipart/mixed; boundary="=-=-=" > > --=-=-= > Content-Type: text/plain; charset=utf-8 > Content-Disposition: inline > Content-Transfer-Encoding: quoted-printable > > Srinivas Pandruvada writes: > > > [...] > > > >> > >=20 > >> > > This patch causes a number of statistically significant > >> > > regressions > >> > > (with significance of 1%) on the two systems I've tested it > >> > > on. On > >> > > my > >> >=20 > >> > Sure. These patches are targeted to Atom clients where some of > >> > these > >> > server like workload may have some minor regression on few watts > >> > TDP > >> > parts. > >>=20 > >> Neither the 36% regression of fs-mark, the 21% regression of sqlite, > >> nor > >> the 10% regression of warsaw qualify as small. And most of the test > >> cases on the list of regressions aren't exclusively server-like, if > >> at > >> all. Warsaw, gtkperf, jxrendermark and lightsmark are all graphics > >> benchmarks -- Latency is as important if not more for interactive > >> workloads than it is for server workloads. In the case of a conflict > >> like the one we're dealing with right now between optimizing for > >> throughput (e.g. for the maximum number of requests per second) and > >> optimizing for latency (e.g. for the minimum request duration), you > >> are > >> more likely to be concerned about the former than about the latter in > >> a > >> server setup. > > > > Eero, > > Please add your test results here. > > > > No matter which algorithm you do, there will be variations. So you have > > to look at the platforms which you are targeting. For this platform=20 > > number one item is use of less turbo and hope you know why? > > Unfortunately the current controller uses turbo frequently on Atoms for > TDP-limited graphics workloads regardless of IOWAIT boosting. IOWAIT > boosting simply exacerbated the pre-existing energy efficiency problem. My current understanding of the issue at hand is that using IOWAIT boosting on Atoms is a regression relative to the previous behavior. That is what Srinivas is trying to address here AFAICS. Now, you seem to be saying that the overall behavior is suboptimal and the IOWAIT boosting doesn't matter that much, so some deeper changes are needed anyway. That may be the case, but if there is a meaningful regression, we should first get back to the point where it is not present and then to take care of the more general problems. So, I'd like to understand how much of a problem the IOWAIT boosting really is in the first place. If it is significant enough, let's address it first, this way or another, and move on to the other problems subsequently. Thanks, Rafael
Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
On Monday, September 3, 2018 5:10:27 PM CEST Eero Tamminen wrote: > Hi, > > On 31.08.2018 20:28, Srinivas Pandruvada wrote: > ... > > As per testing Eero Tamminen, the results are comparable to the patchset > > https://patchwork.kernel.org/patch/10312259/ > > But he has to watch results for several days to check trend. > > It's close, but there is some gap compared to Francisco's version. It is unclear what exactly you are referring to without a link to the patch. Thanks, Rafael
Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
On Friday, August 31, 2018 7:28:51 PM CEST Srinivas Pandruvada wrote: > The io_boost mechanism, when scheduler update util callback indicates wake > from IO was intended for short term boost to improve disk IO performance. > But once i915 graphics driver changed to io-schedule_timeout() from > schedule_timeout() during waiting for response from GPU, this caused > constant IO boost, causing intel_pstate to constantly boost to turbo. > This has bad side effect on TDP limited Atom platforms. The following two > links shows the boost and frequency plot for GPU Test called > triangle_benchmark. > https://bugzilla.kernel.org/attachment.cgi?id=278091 > https://bugzilla.kernel.org/attachment.cgi?id=278093 > This type of behavior was observed with many graphics tests and regular > use of such platforms with graphical interface. > > The fix in this patch is to optimize the io boost by: > - Limit the maximum boost to base frequency on TDP limited Atom platforms > and max limit as 1 core turbo for the rest of the non-HWP platforms (same > as the current algorithm). > - Multilevel gradual boost: Start with increment = half of max boost and > increase to max on the subsequent IO wakes. > - Once max is reached and subsequent IO wakes don't cause boost for TDP > limited Atom platforms, with assumption that the target CPU already built > up enough load to run at higher P-state and the use of average frequency > in the current algorithm will also help not to reduce drastically. For > other platforms this is not limited similar to the current algorithm. > > With this fix the resultant plots show the reduced average frequency and > also causes upto 10% improvement in some graphics tests on Atom (Broxton) > platform. > https://bugzilla.kernel.org/attachment.cgi?id=278095 > https://bugzilla.kernel.org/attachment.cgi?id=278097 > > As per testing Eero Tamminen, the results are comparable to the patchset > https://patchwork.kernel.org/patch/10312259/ > But he has to watch results for several days to check trend. > > Since here boost is getting limited to turbo and non turbo, we need some > ways to adjust the fractions corresponding to max non turbo as well. It > is much easier to use the actual P-state limits for boost instead of > fractions. So here P-state io boost limit is applied on top of the > P-state limit calculated via current algorithm by removing current > io_wait boost calculation using fractions. > > Since we prefer to use common algorithm for all processor platforms, this > change was tested on other client and sever platforms as well. All results > were within the margin of errors. Results: > https://bugzilla.kernel.org/attachment.cgi?id=278149 > > To identify TDP limited platforms a new callback boost_limited() is > added, which will set a per cpudata field called iowait_boost_limited to > 1. Currently this is set for Atom platforms only. > > Tested-by: Eero Tamminen > Signed-off-by: Srinivas Pandruvada Let me say I would do this differently in a couple of ways. First off, the patch makes at least two different changes in one go which I don't quite agree with: it changes the way iowait boost works for everybody and introduces differences betwee Atom and Core. Doing them both in one patch is quite questionable IMO. It is not particularly clear that the iowait boost mechanism on non-Atom needs to be changed at all and if so, then why. Yes, you have results showing that the change doesn't regress the other systems, but since we are going to distinguish between Atom and non-Atom anyway, why change the others? Also, it is not particularly clear to me if there's any benefit from doing the iowait boosting on Atoms at all. Wouldn't it be better to restrict the iowait boosting to Core CPUs only and just leave Atoms alone? On top of that, I don't quite like the new code organization, but let me comment on the details below. > --- > drivers/cpufreq/intel_pstate.c | 112 +++-- > 1 file changed, 92 insertions(+), 20 deletions(-) > > diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c > index 1f2ce2f57842..15d9d5483d85 100644 > --- a/drivers/cpufreq/intel_pstate.c > +++ b/drivers/cpufreq/intel_pstate.c > @@ -195,7 +195,11 @@ struct global_params { > * @policy: CPUFreq policy value > * @update_util: CPUFreq utility callback information > * @update_util_set: CPUFreq utility callback is set > - * @iowait_boost:iowait-related boost fraction > + * @iowait_boost:iowait-related boost P-state > + * @iowait_boost_active: iowait boost processing is active > + * @iowait_boost_max:Max boost P-state to apply > + * @iowait_boost_increment: increment to last boost P-state I don't think you need this field. It just takes space for almost no speed benefit. > + * @owait_boost_limited: If set give up boost, once reach max boost state This field doesn't appear to be necessary too. The value of iowait_boost_max depends on it, but it looks
Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
Srinivas Pandruvada writes: > [...] > >> > > >> > > This patch causes a number of statistically significant >> > > regressions >> > > (with significance of 1%) on the two systems I've tested it >> > > on. On >> > > my >> > >> > Sure. These patches are targeted to Atom clients where some of >> > these >> > server like workload may have some minor regression on few watts >> > TDP >> > parts. >> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite, >> nor >> the 10% regression of warsaw qualify as small. And most of the test >> cases on the list of regressions aren't exclusively server-like, if >> at >> all. Warsaw, gtkperf, jxrendermark and lightsmark are all graphics >> benchmarks -- Latency is as important if not more for interactive >> workloads than it is for server workloads. In the case of a conflict >> like the one we're dealing with right now between optimizing for >> throughput (e.g. for the maximum number of requests per second) and >> optimizing for latency (e.g. for the minimum request duration), you >> are >> more likely to be concerned about the former than about the latter in >> a >> server setup. > > Eero, > Please add your test results here. > > No matter which algorithm you do, there will be variations. So you have > to look at the platforms which you are targeting. For this platform > number one item is use of less turbo and hope you know why? Unfortunately the current controller uses turbo frequently on Atoms for TDP-limited graphics workloads regardless of IOWAIT boosting. IOWAIT boosting simply exacerbated the pre-existing energy efficiency problem. > On this platform GFX patch caused this issue as it was submitted after > io boost patchset. So rather that should be reverted first before > optimizing anything. > > > >> >> > But weighing against reduced TURBO usage (which is enough trigger) >> > and >> > improvement in tests done by Eero (which was primary complaint to >> > us). >> > >> > It is trivial patch. Yes, the patch is not perfect and doesn't >> > close >> > doors for any improvements. >> > >> >> It's sort of self-contained because it's awfully incomplete.Don't >> agtr > >> >> > I see your idea, but how to implement in acceptable way is a >> > challenge. >> >> Main challenge was getting the code to work without regressions in >> latency-sensitive workloads, which I did, because you told me that it >> wasn't acceptable for it to cause any regressions on the Phoronix >> daily-system-tracker, whether latency-bound or not. > Yes, because your intention was to have a global change not a low power > Atom specific, > My intention was to target low-power Atoms only since the first day we discussed this problem. The cover letter of the series I sent and the commit messages make this fairly clear. >> What is left in >> order to address Peter's concerns is for the most part plumbing, >> that's >> guaranteed not to have any functional impact on the heuristic. The >> fact >> that we don't expect it to change the performance of the system makes >> it >> substantially less time-consuming to validate than altering the >> performance trade-offs of the heuristic dynamically in order to avoid >> regressions (which is what has kept my systems busy most of the time >> lately). Seems like my series, even in its current state without the >> changes requested by Peter is closer to being ready for production >> than >> this patch is. > > Sorry, Not at all. We call such patches as experimental series. The numbers speak otherwise. > You caused 100% regression to idle power. There is no version 2 after > that, even if you fixed locally even to look. > I did post a link to a v2 that fixed the idle power issue on the v1 thread, but I didn't intend to send it for review yet. I'll send it out once I've fully taken into account Peter's feedback. > Thanks, > Srinivas signature.asc Description: PGP signature
Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
[...] > > > > > > This patch causes a number of statistically significant > > > regressions > > > (with significance of 1%) on the two systems I've tested it > > > on. On > > > my > > > > Sure. These patches are targeted to Atom clients where some of > > these > > server like workload may have some minor regression on few watts > > TDP > > parts. > > Neither the 36% regression of fs-mark, the 21% regression of sqlite, > nor > the 10% regression of warsaw qualify as small. And most of the test > cases on the list of regressions aren't exclusively server-like, if > at > all. Warsaw, gtkperf, jxrendermark and lightsmark are all graphics > benchmarks -- Latency is as important if not more for interactive > workloads than it is for server workloads. In the case of a conflict > like the one we're dealing with right now between optimizing for > throughput (e.g. for the maximum number of requests per second) and > optimizing for latency (e.g. for the minimum request duration), you > are > more likely to be concerned about the former than about the latter in > a > server setup. Eero, Please add your test results here. No matter which algorithm you do, there will be variations. So you have to look at the platforms which you are targeting. For this platform number one item is use of less turbo and hope you know why? On this platform GFX patch caused this issue as it was submitted after io boost patchset. So rather that should be reverted first before optimizing anything. > > > But weighing against reduced TURBO usage (which is enough trigger) > > and > > improvement in tests done by Eero (which was primary complaint to > > us). > > > > It is trivial patch. Yes, the patch is not perfect and doesn't > > close > > doors for any improvements. > > > > It's sort of self-contained because it's awfully incomplete.Don't > agtr > > > I see your idea, but how to implement in acceptable way is a > > challenge. > > Main challenge was getting the code to work without regressions in > latency-sensitive workloads, which I did, because you told me that it > wasn't acceptable for it to cause any regressions on the Phoronix > daily-system-tracker, whether latency-bound or not. Yes, because your intention was to have a global change not a low power Atom specific, > What is left in > order to address Peter's concerns is for the most part plumbing, > that's > guaranteed not to have any functional impact on the heuristic. The > fact > that we don't expect it to change the performance of the system makes > it > substantially less time-consuming to validate than altering the > performance trade-offs of the heuristic dynamically in order to avoid > regressions (which is what has kept my systems busy most of the time > lately). Seems like my series, even in its current state without the > changes requested by Peter is closer to being ready for production > than > this patch is. Sorry, Not at all. We call such patches as experimental series. You caused 100% regression to idle power. There is no version 2 after that, even if you fixed locally even to look. Thanks, Srinivas
Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
Srinivas Pandruvada writes: > On Mon, 2018-09-03 at 23:53 -0700, Francisco Jerez wrote: >> Eero Tamminen writes: >> >> > Hi, >> > >> > On 31.08.2018 20:28, Srinivas Pandruvada wrote: >> > ... >> > > As per testing Eero Tamminen, the results are comparable to the >> > > patchset >> > > https://patchwork.kernel.org/patch/10312259/ >> > > But he has to watch results for several days to check trend. >> > >> > It's close, but there is some gap compared to Francisco's version. >> > >> > (Because of the large variance on TDP limited devices, and factors >> > causing extra perf differences e.g. between boots, it's hard to >> > give >> > exact number without having trends from several days / weeks. I >> > would >> > also need new version of Fransisco's patch set that applies to >> > latest >> > kernel like yours does.) >> > >> > >> > > Since here boost is getting limited to turbo and non turbo, we >> > > need some >> > > ways to adjust the fractions corresponding to max non turbo as >> > > well. It >> > > is much easier to use the actual P-state limits for boost instead >> > > of >> > > fractions. So here P-state io boost limit is applied on top of >> > > the >> > > P-state limit calculated via current algorithm by removing >> > > current >> > > io_wait boost calculation using fractions. >> > > >> > > Since we prefer to use common algorithm for all processor >> > > platforms, this >> > > change was tested on other client and sever platforms as well. >> > > All results >> > > were within the margin of errors. Results: >> > > https://bugzilla.kernel.org/attachment.cgi?id=278149 >> > >> > Good. >> > >> > Francisco, how well the listed PTS tests cover latency bound cases >> > you >> > were concerned about? [1] >> > >> > >> >- Eero >> > >> > [1] Fransisco was concerned that the patch: >> > * trade-off might regress latency bound cases (which I haven't >> > tested, I >> > tested only 3D throughput), and >> > * that it addressed only other of the sources of energy >> > inefficiencies >> > he had identified (which could explain slightly better 3D results >> > from >> > his more complex patch set). >> >> This patch causes a number of statistically significant regressions >> (with significance of 1%) on the two systems I've tested it on. On >> my > Sure. These patches are targeted to Atom clients where some of these > server like workload may have some minor regression on few watts TDP > parts. Neither the 36% regression of fs-mark, the 21% regression of sqlite, nor the 10% regression of warsaw qualify as small. And most of the test cases on the list of regressions aren't exclusively server-like, if at all. Warsaw, gtkperf, jxrendermark and lightsmark are all graphics benchmarks -- Latency is as important if not more for interactive workloads than it is for server workloads. In the case of a conflict like the one we're dealing with right now between optimizing for throughput (e.g. for the maximum number of requests per second) and optimizing for latency (e.g. for the minimum request duration), you are more likely to be concerned about the former than about the latter in a server setup. > But weighing against reduced TURBO usage (which is enough trigger) and > improvement in tests done by Eero (which was primary complaint to us). > > It is trivial patch. Yes, the patch is not perfect and doesn't close > doors for any improvements. > It's sort of self-contained because it's awfully incomplete. > I see your idea, but how to implement in acceptable way is a challenge. Main challenge was getting the code to work without regressions in latency-sensitive workloads, which I did, because you told me that it wasn't acceptable for it to cause any regressions on the Phoronix daily-system-tracker, whether latency-bound or not. What is left in order to address Peter's concerns is for the most part plumbing, that's guaranteed not to have any functional impact on the heuristic. The fact that we don't expect it to change the performance of the system makes it substantially less time-consuming to validate than altering the performance trade-offs of the heuristic dynamically in order to avoid regressions (which is what has kept my systems busy most of the time lately). Seems like my series, even in its current state without the changes requested by Peter is closer to being ready for production than this patch is. > So ultimately we will get there, but will require some time compared > with other high priority items. > > > Thanks > Srinivas > >> CHV N3050: >> >> > phoronix/fs- >> > mark/test=0: >> > XXX ±7.25% x34 -> XXX ±7.00% x39 d=-36.85% ±5.91% p=0.00% >> > phoronix/sqlite: >> > XXX ±1.86% x34 -> XXX ±1.88% x39 d=-21.73% >> > ±1.66% p=0.00% >> > warsow/benchsow: >> > XXX ±1.25% x34 -> XXX ±1.95% x39 d=-10.83% >> >
Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
On Mon, 2018-09-03 at 23:53 -0700, Francisco Jerez wrote: > Eero Tamminen writes: > > > Hi, > > > > On 31.08.2018 20:28, Srinivas Pandruvada wrote: > > ... > > > As per testing Eero Tamminen, the results are comparable to the > > > patchset > > > https://patchwork.kernel.org/patch/10312259/ > > > But he has to watch results for several days to check trend. > > > > It's close, but there is some gap compared to Francisco's version. > > > > (Because of the large variance on TDP limited devices, and factors > > causing extra perf differences e.g. between boots, it's hard to > > give > > exact number without having trends from several days / weeks. I > > would > > also need new version of Fransisco's patch set that applies to > > latest > > kernel like yours does.) > > > > > > > Since here boost is getting limited to turbo and non turbo, we > > > need some > > > ways to adjust the fractions corresponding to max non turbo as > > > well. It > > > is much easier to use the actual P-state limits for boost instead > > > of > > > fractions. So here P-state io boost limit is applied on top of > > > the > > > P-state limit calculated via current algorithm by removing > > > current > > > io_wait boost calculation using fractions. > > > > > > Since we prefer to use common algorithm for all processor > > > platforms, this > > > change was tested on other client and sever platforms as well. > > > All results > > > were within the margin of errors. Results: > > > https://bugzilla.kernel.org/attachment.cgi?id=278149 > > > > Good. > > > > Francisco, how well the listed PTS tests cover latency bound cases > > you > > were concerned about? [1] > > > > > > - Eero > > > > [1] Fransisco was concerned that the patch: > > * trade-off might regress latency bound cases (which I haven't > > tested, I > > tested only 3D throughput), and > > * that it addressed only other of the sources of energy > > inefficiencies > > he had identified (which could explain slightly better 3D results > > from > > his more complex patch set). > > This patch causes a number of statistically significant regressions > (with significance of 1%) on the two systems I've tested it on. On > my Sure. These patches are targeted to Atom clients where some of these server like workload may have some minor regression on few watts TDP parts. But weighing against reduced TURBO usage (which is enough trigger) and improvement in tests done by Eero (which was primary complaint to us). It is trivial patch. Yes, the patch is not perfect and doesn't close doors for any improvements. I see your idea, but how to implement in acceptable way is a challenge. So ultimately we will get there, but will require some time compared with other high priority items. Thanks Srinivas > CHV N3050: > > > phoronix/fs- > > mark/test=0: > > XXX ±7.25% x34 -> XXX ±7.00% x39 d=-36.85% ±5.91% p=0.00% > > phoronix/sqlite: > > XXX ±1.86% x34 -> XXX ±1.88% x39 d=-21.73% > > ±1.66% p=0.00% > > warsow/benchsow: > > XXX ±1.25% x34 -> XXX ±1.95% x39 d=-10.83% > > ±1.53% p=0.00% > > phoronix/iozone/record-size=4Kb/file-size=2GB/test=Read > > Performance: XXX ±1.70% x31 -> XXX ±1.02% x34 d=-7.39% > > ±1.36% p=0.00% > > phoronix/gtkperf/gtk- > > test=GtkComboBox: XXX > > ±1.15% x13 -> XXX ±1.59% x14 d=-5.37% ±1.35% p=0.00% > > lightsmark: > > XXX ±1.45% x34 -> XXX ±0.97% x41 d=-4.66% > > ±1.19% p=0.00% > > jxrendermark/rendering-test=Transformed Blit Bilinear/rendering- > > size=128x128: XXX ±1.04% x31 -> XXX ±1.04% x39 d=-4.58% > > ±1.01% p=0.00% > > jxrendermark/rendering-test=Linear Gradient Blend/rendering- > > size=128x128: XXX ±0.12% x31 -> XXX ±0.19% x39 d=-3.60% > > ±0.16% p=0.00% > > dbench/client- > > count=1: XX > > X ±0.50% x34 -> XXX ±0.50% x39 d=-2.51% ±0.49% p=0.00% > > On my BXT J3455: > > > fs- > > mark/test=0: > > XXX ±3.04% x6 -> XXX ±3.05% x9 d=-15.96% > > ±2.76% p=0.00% > > sqlite: > > XXX ±2.54% x6 -> XXX ±2.72% x9 d=-12.42% > > ±2.44% p=0.00% > > dbench/client- > > count=1: XX > > X ±0.42% x6 -> XXX ±0.36% x9 d=-6.52% ±0.37% p=0.00% > > dbench/client- > > count=2: XX > > X ±0.26% x6 -> XXX ±0.33% x9 d=-5.22% ±0.29% p=0.00% > > dbench/client- > > count=3: XX > > X ±0.34% x6 -> XXX ±0.53% x9 d=-2.92% ±0.45% p=0.00% > > x11perf/test=500px Compositing From
Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
Eero Tamminen writes: > Hi, > > On 31.08.2018 20:28, Srinivas Pandruvada wrote: > ... >> As per testing Eero Tamminen, the results are comparable to the patchset >> https://patchwork.kernel.org/patch/10312259/ >> But he has to watch results for several days to check trend. > > It's close, but there is some gap compared to Francisco's version. > > (Because of the large variance on TDP limited devices, and factors > causing extra perf differences e.g. between boots, it's hard to give > exact number without having trends from several days / weeks. I would > also need new version of Fransisco's patch set that applies to latest > kernel like yours does.) > > >> Since here boost is getting limited to turbo and non turbo, we need some >> ways to adjust the fractions corresponding to max non turbo as well. It >> is much easier to use the actual P-state limits for boost instead of >> fractions. So here P-state io boost limit is applied on top of the >> P-state limit calculated via current algorithm by removing current >> io_wait boost calculation using fractions. >> >> Since we prefer to use common algorithm for all processor platforms, this >> change was tested on other client and sever platforms as well. All results >> were within the margin of errors. Results: >> https://bugzilla.kernel.org/attachment.cgi?id=278149 > > Good. > > Francisco, how well the listed PTS tests cover latency bound cases you > were concerned about? [1] > > > - Eero > > [1] Fransisco was concerned that the patch: > * trade-off might regress latency bound cases (which I haven't tested, I > tested only 3D throughput), and > * that it addressed only other of the sources of energy inefficiencies > he had identified (which could explain slightly better 3D results from > his more complex patch set). This patch causes a number of statistically significant regressions (with significance of 1%) on the two systems I've tested it on. On my CHV N3050: | phoronix/fs-mark/test=0: XXX ±7.25% x34 -> XXX ±7.00% x39 d=-36.85% ±5.91% p=0.00% | phoronix/sqlite: XXX ±1.86% x34 -> XXX ±1.88% x39 d=-21.73% ±1.66% p=0.00% | warsow/benchsow: XXX ±1.25% x34 -> XXX ±1.95% x39 d=-10.83% ±1.53% p=0.00% | phoronix/iozone/record-size=4Kb/file-size=2GB/test=Read Performance: XXX ±1.70% x31 -> XXX ±1.02% x34 d=-7.39% ±1.36% p=0.00% | phoronix/gtkperf/gtk-test=GtkComboBox: XXX ±1.15% x13 -> XXX ±1.59% x14 d=-5.37% ±1.35% p=0.00% | lightsmark: XXX ±1.45% x34 -> XXX ±0.97% x41 d=-4.66% ±1.19% p=0.00% | jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-size=128x128: XXX ±1.04% x31 -> XXX ±1.04% x39 d=-4.58% ±1.01% p=0.00% | jxrendermark/rendering-test=Linear Gradient Blend/rendering-size=128x128: XXX ±0.12% x31 -> XXX ±0.19% x39 d=-3.60% ±0.16% p=0.00% | dbench/client-count=1: XXX ±0.50% x34 -> XXX ±0.50% x39 d=-2.51% ±0.49% p=0.00% On my BXT J3455: | fs-mark/test=0: XXX ±3.04% x6 -> XXX ±3.05% x9 d=-15.96% ±2.76% p=0.00% | sqlite: XXX ±2.54% x6 -> XXX ±2.72% x9 d=-12.42% ±2.44% p=0.00% | dbench/client-count=1: XXX ±0.42% x6 -> XXX ±0.36% x9 d=-6.52% ±0.37% p=0.00% | dbench/client-count=2: XXX ±0.26% x6 -> XXX ±0.33% x9 d=-5.22% ±0.29% p=0.00% | dbench/client-count=3: XXX ±0.34% x6 -> XXX ±0.53% x9 d=-2.92% ±0.45% p=0.00% | x11perf/test=500px Compositing From Pixmap To Window: XXX ±2.29% x16 -> XXX ±2.11% x19 d=-2.69% ±2.16% p=0.09% | lightsmark: XXX ±0.44% x6 -> XXX ±0.33% x9 d=-1.76% ±0.37% p=0.00% | j2dbench/rendering-test=Vector Graphics Rendering: XXX ±1.18% x16 -> XXX ±1.82% x19 d=-1.71% ±1.54% p=0.26% | gtkperf/gtk-test=GtkComboBox: XXX ±0.37% x6 -> XXX ±0.45% x9 d=-0.95% ±0.42% p=0.08% | jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-size=128x128: XXX ±0.21% x3 -> XXX ±0.23% x6 d=-0.87% ±0.22% p=0.08% This is not surprising given that the patch is making a hard trade-off between latency and energy efficiency without considering whether the workload is IO- or latency-bound, which is the reason why the series I submitted earlier [1] to address this problem implemented an IO utilization statistic in order to determine
Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
On Mon, 2018-09-03 at 18:10 +0300, Eero Tamminen wrote: > Hi, > > On 31.08.2018 20:28, Srinivas Pandruvada wrote: > ... > > As per testing Eero Tamminen, the results are comparable to the > > patchset > > https://patchwork.kernel.org/patch/10312259/ > > But he has to watch results for several days to check trend. > > It's close, but there is some gap compared to Francisco's version. > > (Because of the large variance on TDP limited devices, and factors > causing extra perf differences e.g. between boots, it's hard to give > exact number without having trends from several days / weeks. I > would > also need new version of Fransisco's patch set that applies to > latest > kernel like yours does.) > > > > Since here boost is getting limited to turbo and non turbo, we need > > some > > ways to adjust the fractions corresponding to max non turbo as > > well. It > > is much easier to use the actual P-state limits for boost instead > > of > > fractions. So here P-state io boost limit is applied on top of the > > P-state limit calculated via current algorithm by removing current > > io_wait boost calculation using fractions. > > > > Since we prefer to use common algorithm for all processor > > platforms, this > > change was tested on other client and sever platforms as well. All > > results > > were within the margin of errors. Results: > > https://bugzilla.kernel.org/attachment.cgi?id=278149 > > Good. > > Francisco, how well the listed PTS tests cover latency bound cases > you > were concerned about? [1] > > > - Eero > > [1] Fransisco was concerned that the patch: > * trade-off might regress latency bound cases (which I haven't > tested, I > tested only 3D throughput), and > * that it addressed only other of the sources of energy > inefficiencies > he had identified (which could explain slightly better 3D results > from > his more complex patch set). It is always possible to submit improvement patch later. Atleast this patch will reduce the usage of turbo. Thanks, Srinivas
Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
Hi, On 31.08.2018 20:28, Srinivas Pandruvada wrote: ... As per testing Eero Tamminen, the results are comparable to the patchset https://patchwork.kernel.org/patch/10312259/ But he has to watch results for several days to check trend. It's close, but there is some gap compared to Francisco's version. (Because of the large variance on TDP limited devices, and factors causing extra perf differences e.g. between boots, it's hard to give exact number without having trends from several days / weeks. I would also need new version of Fransisco's patch set that applies to latest kernel like yours does.) Since here boost is getting limited to turbo and non turbo, we need some ways to adjust the fractions corresponding to max non turbo as well. It is much easier to use the actual P-state limits for boost instead of fractions. So here P-state io boost limit is applied on top of the P-state limit calculated via current algorithm by removing current io_wait boost calculation using fractions. Since we prefer to use common algorithm for all processor platforms, this change was tested on other client and sever platforms as well. All results were within the margin of errors. Results: https://bugzilla.kernel.org/attachment.cgi?id=278149 Good. Francisco, how well the listed PTS tests cover latency bound cases you were concerned about? [1] - Eero [1] Fransisco was concerned that the patch: * trade-off might regress latency bound cases (which I haven't tested, I tested only 3D throughput), and * that it addressed only other of the sources of energy inefficiencies he had identified (which could explain slightly better 3D results from his more complex patch set).
[PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode
The io_boost mechanism, when scheduler update util callback indicates wake from IO was intended for short term boost to improve disk IO performance. But once i915 graphics driver changed to io-schedule_timeout() from schedule_timeout() during waiting for response from GPU, this caused constant IO boost, causing intel_pstate to constantly boost to turbo. This has bad side effect on TDP limited Atom platforms. The following two links shows the boost and frequency plot for GPU Test called triangle_benchmark. https://bugzilla.kernel.org/attachment.cgi?id=278091 https://bugzilla.kernel.org/attachment.cgi?id=278093 This type of behavior was observed with many graphics tests and regular use of such platforms with graphical interface. The fix in this patch is to optimize the io boost by: - Limit the maximum boost to base frequency on TDP limited Atom platforms and max limit as 1 core turbo for the rest of the non-HWP platforms (same as the current algorithm). - Multilevel gradual boost: Start with increment = half of max boost and increase to max on the subsequent IO wakes. - Once max is reached and subsequent IO wakes don't cause boost for TDP limited Atom platforms, with assumption that the target CPU already built up enough load to run at higher P-state and the use of average frequency in the current algorithm will also help not to reduce drastically. For other platforms this is not limited similar to the current algorithm. With this fix the resultant plots show the reduced average frequency and also causes upto 10% improvement in some graphics tests on Atom (Broxton) platform. https://bugzilla.kernel.org/attachment.cgi?id=278095 https://bugzilla.kernel.org/attachment.cgi?id=278097 As per testing Eero Tamminen, the results are comparable to the patchset https://patchwork.kernel.org/patch/10312259/ But he has to watch results for several days to check trend. Since here boost is getting limited to turbo and non turbo, we need some ways to adjust the fractions corresponding to max non turbo as well. It is much easier to use the actual P-state limits for boost instead of fractions. So here P-state io boost limit is applied on top of the P-state limit calculated via current algorithm by removing current io_wait boost calculation using fractions. Since we prefer to use common algorithm for all processor platforms, this change was tested on other client and sever platforms as well. All results were within the margin of errors. Results: https://bugzilla.kernel.org/attachment.cgi?id=278149 To identify TDP limited platforms a new callback boost_limited() is added, which will set a per cpudata field called iowait_boost_limited to 1. Currently this is set for Atom platforms only. Tested-by: Eero Tamminen Signed-off-by: Srinivas Pandruvada --- drivers/cpufreq/intel_pstate.c | 112 +++-- 1 file changed, 92 insertions(+), 20 deletions(-) diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 1f2ce2f57842..15d9d5483d85 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -195,7 +195,11 @@ struct global_params { * @policy:CPUFreq policy value * @update_util: CPUFreq utility callback information * @update_util_set: CPUFreq utility callback is set - * @iowait_boost: iowait-related boost fraction + * @iowait_boost: iowait-related boost P-state + * @iowait_boost_active: iowait boost processing is active + * @iowait_boost_max: Max boost P-state to apply + * @iowait_boost_increment: increment to last boost P-state + * @owait_boost_limited: If set give up boost, once reach max boost state * @last_update: Time of the last update. * @pstate:Stores P state limits for this CPU * @vid: Stores VID limits for this CPU @@ -254,6 +258,10 @@ struct cpudata { bool valid_pss_table; #endif unsigned int iowait_boost; + unsigned int iowait_boost_active; + int iowait_boost_max; + int iowait_boost_increment; + int iowait_boost_limited; s16 epp_powersave; s16 epp_policy; s16 epp_default; @@ -276,6 +284,7 @@ static struct cpudata **all_cpu_data; * @get_scaling: Callback to get frequency scaling factor * @get_val: Callback to convert P state to actual MSR write value * @get_vid: Callback to get VID data for Atom platforms + * @boost_limited: Callback to get max boost P-state, when applicable * * Core and Atom CPU models have different way to get P State limits. This * structure is used to store those callbacks. @@ -289,6 +298,7 @@ struct pstate_funcs { int (*get_aperf_mperf_shift)(void); u64 (*get_val)(struct cpudata*, int pstate); void (*get_vid)(struct cpudata *); + void (*boost_limited)(struct cpudata *cpudata); }; static struct pstate_funcs pstate_funcs __read_mostly; @@ -1251,6 +1261,11 @@ static void atom_get_vid(struct cpudata *cpudata)