Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-09-17 Thread Francisco Jerez
"Rafael J. Wysocki"  writes:

> On Sat, Sep 15, 2018 at 8:53 AM Francisco Jerez  wrote:
>>
>> "Rafael J. Wysocki"  writes:
>>
>> > On Tuesday, September 11, 2018 7:35:15 PM CEST Francisco Jerez wrote:
>> >>
>> >> "Rafael J. Wysocki"  writes:
>> >>
>> >> > On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote:
>> >> >
>> >> >> Srinivas Pandruvada  writes:
>> >> >>=20
>> >> >> > [...]
>> >> >> >
>> >> >> >> > >=3D20
>> >> >> >> > > This patch causes a number of statistically significant
>> >> >> >> > > regressions
>> >> >> >> > > (with significance of 1%) on the two systems I've tested it
>> >> >> >> > > on.  On
>> >> >> >> > > my
>> >> >> >> >=3D20
>> >> >> >> > Sure. These patches are targeted to Atom clients where some of
>> >> >> >> > these
>> >> >> >> > server like workload may have some minor regression on few watts
>> >> >> >> > TDP
>> >> >> >> > parts.
>> >> >> >>=3D20
>> >> >> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
>> >> >> >> nor
>> >> >> >> the 10% regression of warsaw qualify as small.  And most of the test
>> >> >> >> cases on the list of regressions aren't exclusively server-like, if
>> >> >> >> at
>> >> >> >> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
>> >> >> >> benchmarks -- Latency is as important if not more for interactive
>> >> >> >> workloads than it is for server workloads.  In the case of a 
>> >> >> >> conflict
>> >> >> >> like the one we're dealing with right now between optimizing for
>> >> >> >> throughput (e.g. for the maximum number of requests per second) and
>> >> >> >> optimizing for latency (e.g. for the minimum request duration), you
>> >> >> >> are
>> >> >> >> more likely to be concerned about the former than about the latter 
>> >> >> >> in
>> >> >> >> a
>> >> >> >> server setup.
>> >> >> >
>> >> >> > Eero,
>> >> >> > Please add your test results here.
>> >> >> >
>> >> >> > No matter which algorithm you do, there will be variations. So you 
>> >> >> > have
>> >> >> > to look at the platforms which you are targeting. For this 
>> >> >> > platform=3D=
>> >> 20
>> >> >> > number one item is use of less turbo and hope you know why?
>> >> >>=20
>> >> >> Unfortunately the current controller uses turbo frequently on Atoms for
>> >> >> TDP-limited graphics workloads regardless of IOWAIT boosting.  IOWAIT
>> >> >> boosting simply exacerbated the pre-existing energy efficiency problem.
>> >> >
>> >> > My current understanding of the issue at hand is that using IOWAIT 
>> >> > boosti=
>> >> ng
>> >> > on Atoms is a regression relative to the previous behavior.
>> >>
>> >> Not universally.  IOWAIT boosting helps under roughly the same
>> >> conditions on Atom as it does on big core, so applying this patch will
>> >> necessarily cause regressions too (see my reply from Sep. 3 for some
>> >> numbers), and won't completely restore the previous behavior since it
>> >> simply decreases the degree of IOWAIT boosting applied without being
>> >> able to avoid it (c.f. the series I'm working on that does something
>> >> similar to IOWAIT boosting when it's able to determine it's actually
>> >> CPU-bound, which prevents energy inefficient behavior for non-CPU-bound
>> >> workloads that don't benefit from a higher CPU clock frequency anyway).
>> >
>> > Well, OK.  That doesn't seem to be a clear-cut regression situation, then,
>> > since getting back is not desirable, apparently.
>> >
>> > Or would it restore the previous behavior if we didn't do any IOWAIT
>> > boosting on Atoms at all?
>> >
>> >> > That is what Srinivas is trying to address here AFAICS.
>> >> >
>> >> > Now, you seem to be saying that the overall behavior is suboptimal and 
>> >> > the
>> >> > IOWAIT boosting doesn't matter that much,
>> >>
>> >> I was just saying that IOWAIT boosting is less than half of the energy
>> >> efficiency problem, and this patch only partially addresses that half of
>> >> the problem.
>> >
>> > Well, fair enough, but there are two things to consider here, the general
>> > energy-efficiency problem and the difference made by IOWAIT boosting.
>> >
>> > If the general energy-efficiency problem had existed for a relatively long
>> > time, but it has got worse recently due to the IOWAIT boosting, it still
>> > may be desirable to get the IOWAIT boosting out of the picture first
>> > and then get to the general problem.
>> >
>>
>> IMHO what is needed in order to address the IOWAIT boosting energy
>> efficiency problem is roughly the same we need in order to address the
>> other energy efficiency problem: A mechanism along the lines of [1]
>> allowing us to determine whether the workload is IO-bound or not.  In
>> the former case IOWAIT boosting won't be able to improve the performance
>> of the workload since the limiting factor is the IO throughput, so it
>> will only increase the energy usage, potentially exacerbating the
>> bottleneck if the IO device is an integrated GPU.  In the latter case
>> where the CPU and IO devices being w

Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-09-17 Thread Rafael J. Wysocki
On Sat, Sep 15, 2018 at 8:53 AM Francisco Jerez  wrote:
>
> "Rafael J. Wysocki"  writes:
>
> > On Tuesday, September 11, 2018 7:35:15 PM CEST Francisco Jerez wrote:
> >>
> >> "Rafael J. Wysocki"  writes:
> >>
> >> > On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote:
> >> >
> >> >> Srinivas Pandruvada  writes:
> >> >>=20
> >> >> > [...]
> >> >> >
> >> >> >> > >=3D20
> >> >> >> > > This patch causes a number of statistically significant
> >> >> >> > > regressions
> >> >> >> > > (with significance of 1%) on the two systems I've tested it
> >> >> >> > > on.  On
> >> >> >> > > my
> >> >> >> >=3D20
> >> >> >> > Sure. These patches are targeted to Atom clients where some of
> >> >> >> > these
> >> >> >> > server like workload may have some minor regression on few watts
> >> >> >> > TDP
> >> >> >> > parts.
> >> >> >>=3D20
> >> >> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
> >> >> >> nor
> >> >> >> the 10% regression of warsaw qualify as small.  And most of the test
> >> >> >> cases on the list of regressions aren't exclusively server-like, if
> >> >> >> at
> >> >> >> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
> >> >> >> benchmarks -- Latency is as important if not more for interactive
> >> >> >> workloads than it is for server workloads.  In the case of a conflict
> >> >> >> like the one we're dealing with right now between optimizing for
> >> >> >> throughput (e.g. for the maximum number of requests per second) and
> >> >> >> optimizing for latency (e.g. for the minimum request duration), you
> >> >> >> are
> >> >> >> more likely to be concerned about the former than about the latter in
> >> >> >> a
> >> >> >> server setup.
> >> >> >
> >> >> > Eero,
> >> >> > Please add your test results here.
> >> >> >
> >> >> > No matter which algorithm you do, there will be variations. So you 
> >> >> > have
> >> >> > to look at the platforms which you are targeting. For this 
> >> >> > platform=3D=
> >> 20
> >> >> > number one item is use of less turbo and hope you know why?
> >> >>=20
> >> >> Unfortunately the current controller uses turbo frequently on Atoms for
> >> >> TDP-limited graphics workloads regardless of IOWAIT boosting.  IOWAIT
> >> >> boosting simply exacerbated the pre-existing energy efficiency problem.
> >> >
> >> > My current understanding of the issue at hand is that using IOWAIT 
> >> > boosti=
> >> ng
> >> > on Atoms is a regression relative to the previous behavior.
> >>
> >> Not universally.  IOWAIT boosting helps under roughly the same
> >> conditions on Atom as it does on big core, so applying this patch will
> >> necessarily cause regressions too (see my reply from Sep. 3 for some
> >> numbers), and won't completely restore the previous behavior since it
> >> simply decreases the degree of IOWAIT boosting applied without being
> >> able to avoid it (c.f. the series I'm working on that does something
> >> similar to IOWAIT boosting when it's able to determine it's actually
> >> CPU-bound, which prevents energy inefficient behavior for non-CPU-bound
> >> workloads that don't benefit from a higher CPU clock frequency anyway).
> >
> > Well, OK.  That doesn't seem to be a clear-cut regression situation, then,
> > since getting back is not desirable, apparently.
> >
> > Or would it restore the previous behavior if we didn't do any IOWAIT
> > boosting on Atoms at all?
> >
> >> > That is what Srinivas is trying to address here AFAICS.
> >> >
> >> > Now, you seem to be saying that the overall behavior is suboptimal and 
> >> > the
> >> > IOWAIT boosting doesn't matter that much,
> >>
> >> I was just saying that IOWAIT boosting is less than half of the energy
> >> efficiency problem, and this patch only partially addresses that half of
> >> the problem.
> >
> > Well, fair enough, but there are two things to consider here, the general
> > energy-efficiency problem and the difference made by IOWAIT boosting.
> >
> > If the general energy-efficiency problem had existed for a relatively long
> > time, but it has got worse recently due to the IOWAIT boosting, it still
> > may be desirable to get the IOWAIT boosting out of the picture first
> > and then get to the general problem.
> >
>
> IMHO what is needed in order to address the IOWAIT boosting energy
> efficiency problem is roughly the same we need in order to address the
> other energy efficiency problem: A mechanism along the lines of [1]
> allowing us to determine whether the workload is IO-bound or not.  In
> the former case IOWAIT boosting won't be able to improve the performance
> of the workload since the limiting factor is the IO throughput, so it
> will only increase the energy usage, potentially exacerbating the
> bottleneck if the IO device is an integrated GPU.  In the latter case
> where the CPU and IO devices being waited on are both underutilized it
> makes sense to optimize for low latency more aggressively (certainly
> more aggressively than this patch does) which will incr

Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-09-14 Thread Francisco Jerez
"Rafael J. Wysocki"  writes:

> On Tuesday, September 11, 2018 7:35:15 PM CEST Francisco Jerez wrote:
>>
>> "Rafael J. Wysocki"  writes:
>> 
>> > On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote:
>> >
>> >> Srinivas Pandruvada  writes:
>> >>=20
>> >> > [...]
>> >> >
>> >> >> > >=3D20
>> >> >> > > This patch causes a number of statistically significant
>> >> >> > > regressions
>> >> >> > > (with significance of 1%) on the two systems I've tested it
>> >> >> > > on.  On
>> >> >> > > my
>> >> >> >=3D20
>> >> >> > Sure. These patches are targeted to Atom clients where some of
>> >> >> > these
>> >> >> > server like workload may have some minor regression on few watts
>> >> >> > TDP
>> >> >> > parts.
>> >> >>=3D20
>> >> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
>> >> >> nor
>> >> >> the 10% regression of warsaw qualify as small.  And most of the test
>> >> >> cases on the list of regressions aren't exclusively server-like, if
>> >> >> at
>> >> >> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
>> >> >> benchmarks -- Latency is as important if not more for interactive
>> >> >> workloads than it is for server workloads.  In the case of a conflict
>> >> >> like the one we're dealing with right now between optimizing for
>> >> >> throughput (e.g. for the maximum number of requests per second) and
>> >> >> optimizing for latency (e.g. for the minimum request duration), you
>> >> >> are
>> >> >> more likely to be concerned about the former than about the latter in
>> >> >> a
>> >> >> server setup.
>> >> >
>> >> > Eero,
>> >> > Please add your test results here.
>> >> >
>> >> > No matter which algorithm you do, there will be variations. So you have
>> >> > to look at the platforms which you are targeting. For this platform=3D=
>> 20
>> >> > number one item is use of less turbo and hope you know why?
>> >>=20
>> >> Unfortunately the current controller uses turbo frequently on Atoms for
>> >> TDP-limited graphics workloads regardless of IOWAIT boosting.  IOWAIT
>> >> boosting simply exacerbated the pre-existing energy efficiency problem.
>> >
>> > My current understanding of the issue at hand is that using IOWAIT boosti=
>> ng
>> > on Atoms is a regression relative to the previous behavior.
>> 
>> Not universally.  IOWAIT boosting helps under roughly the same
>> conditions on Atom as it does on big core, so applying this patch will
>> necessarily cause regressions too (see my reply from Sep. 3 for some
>> numbers), and won't completely restore the previous behavior since it
>> simply decreases the degree of IOWAIT boosting applied without being
>> able to avoid it (c.f. the series I'm working on that does something
>> similar to IOWAIT boosting when it's able to determine it's actually
>> CPU-bound, which prevents energy inefficient behavior for non-CPU-bound
>> workloads that don't benefit from a higher CPU clock frequency anyway).
>
> Well, OK.  That doesn't seem to be a clear-cut regression situation, then,
> since getting back is not desirable, apparently.
>
> Or would it restore the previous behavior if we didn't do any IOWAIT
> boosting on Atoms at all?
>
>> > That is what Srinivas is trying to address here AFAICS.
>> >
>> > Now, you seem to be saying that the overall behavior is suboptimal and the
>> > IOWAIT boosting doesn't matter that much,
>> 
>> I was just saying that IOWAIT boosting is less than half of the energy
>> efficiency problem, and this patch only partially addresses that half of
>> the problem.
>
> Well, fair enough, but there are two things to consider here, the general
> energy-efficiency problem and the difference made by IOWAIT boosting.
>
> If the general energy-efficiency problem had existed for a relatively long
> time, but it has got worse recently due to the IOWAIT boosting, it still
> may be desirable to get the IOWAIT boosting out of the picture first
> and then get to the general problem.
>

IMHO what is needed in order to address the IOWAIT boosting energy
efficiency problem is roughly the same we need in order to address the
other energy efficiency problem: A mechanism along the lines of [1]
allowing us to determine whether the workload is IO-bound or not.  In
the former case IOWAIT boosting won't be able to improve the performance
of the workload since the limiting factor is the IO throughput, so it
will only increase the energy usage, potentially exacerbating the
bottleneck if the IO device is an integrated GPU.  In the latter case
where the CPU and IO devices being waited on are both underutilized it
makes sense to optimize for low latency more aggressively (certainly
more aggressively than this patch does) which will increase the
utilization of the IO devices until at least one IO device becomes a
bottleneck, at which point the throughput of the system becomes roughly
independent of the CPU frequency and we're back to the former case.

[1] https://patchwork.kernel.org/patch/10312259/

> I'm not sure if that is 

Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-09-14 Thread Rafael J. Wysocki
On Tuesday, September 11, 2018 7:35:15 PM CEST Francisco Jerez wrote:
>
> "Rafael J. Wysocki"  writes:
> 
> > On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote:
> >
> >> Srinivas Pandruvada  writes:
> >>=20
> >> > [...]
> >> >
> >> >> > >=3D20
> >> >> > > This patch causes a number of statistically significant
> >> >> > > regressions
> >> >> > > (with significance of 1%) on the two systems I've tested it
> >> >> > > on.  On
> >> >> > > my
> >> >> >=3D20
> >> >> > Sure. These patches are targeted to Atom clients where some of
> >> >> > these
> >> >> > server like workload may have some minor regression on few watts
> >> >> > TDP
> >> >> > parts.
> >> >>=3D20
> >> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
> >> >> nor
> >> >> the 10% regression of warsaw qualify as small.  And most of the test
> >> >> cases on the list of regressions aren't exclusively server-like, if
> >> >> at
> >> >> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
> >> >> benchmarks -- Latency is as important if not more for interactive
> >> >> workloads than it is for server workloads.  In the case of a conflict
> >> >> like the one we're dealing with right now between optimizing for
> >> >> throughput (e.g. for the maximum number of requests per second) and
> >> >> optimizing for latency (e.g. for the minimum request duration), you
> >> >> are
> >> >> more likely to be concerned about the former than about the latter in
> >> >> a
> >> >> server setup.
> >> >
> >> > Eero,
> >> > Please add your test results here.
> >> >
> >> > No matter which algorithm you do, there will be variations. So you have
> >> > to look at the platforms which you are targeting. For this platform=3D=
> 20
> >> > number one item is use of less turbo and hope you know why?
> >>=20
> >> Unfortunately the current controller uses turbo frequently on Atoms for
> >> TDP-limited graphics workloads regardless of IOWAIT boosting.  IOWAIT
> >> boosting simply exacerbated the pre-existing energy efficiency problem.
> >
> > My current understanding of the issue at hand is that using IOWAIT boosti=
> ng
> > on Atoms is a regression relative to the previous behavior.
> 
> Not universally.  IOWAIT boosting helps under roughly the same
> conditions on Atom as it does on big core, so applying this patch will
> necessarily cause regressions too (see my reply from Sep. 3 for some
> numbers), and won't completely restore the previous behavior since it
> simply decreases the degree of IOWAIT boosting applied without being
> able to avoid it (c.f. the series I'm working on that does something
> similar to IOWAIT boosting when it's able to determine it's actually
> CPU-bound, which prevents energy inefficient behavior for non-CPU-bound
> workloads that don't benefit from a higher CPU clock frequency anyway).

Well, OK.  That doesn't seem to be a clear-cut regression situation, then,
since getting back is not desirable, apparently.

Or would it restore the previous behavior if we didn't do any IOWAIT
boosting on Atoms at all?

> > That is what Srinivas is trying to address here AFAICS.
> >
> > Now, you seem to be saying that the overall behavior is suboptimal and the
> > IOWAIT boosting doesn't matter that much,
> 
> I was just saying that IOWAIT boosting is less than half of the energy
> efficiency problem, and this patch only partially addresses that half of
> the problem.

Well, fair enough, but there are two things to consider here, the general
energy-efficiency problem and the difference made by IOWAIT boosting.

If the general energy-efficiency problem had existed for a relatively long
time, but it has got worse recently due to the IOWAIT boosting, it still
may be desirable to get the IOWAIT boosting out of the picture first
and then get to the general problem.

I'm not sure if that is the case as my experience with Atoms is limited
anyway, so please advise.

Thanks,
Rafael



Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-09-11 Thread Francisco Jerez
"Rafael J. Wysocki"  writes:

> On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote:
>> 
>> --==-=-=
>> Content-Type: multipart/mixed; boundary="=-=-="
>> 
>> --=-=-=
>> Content-Type: text/plain; charset=utf-8
>> Content-Disposition: inline
>> Content-Transfer-Encoding: quoted-printable
>> 
>> Srinivas Pandruvada  writes:
>> 
>> > [...]
>> >
>> >> > >=20
>> >> > > This patch causes a number of statistically significant
>> >> > > regressions
>> >> > > (with significance of 1%) on the two systems I've tested it
>> >> > > on.  On
>> >> > > my
>> >> >=20
>> >> > Sure. These patches are targeted to Atom clients where some of
>> >> > these
>> >> > server like workload may have some minor regression on few watts
>> >> > TDP
>> >> > parts.
>> >>=20
>> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
>> >> nor
>> >> the 10% regression of warsaw qualify as small.  And most of the test
>> >> cases on the list of regressions aren't exclusively server-like, if
>> >> at
>> >> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
>> >> benchmarks -- Latency is as important if not more for interactive
>> >> workloads than it is for server workloads.  In the case of a conflict
>> >> like the one we're dealing with right now between optimizing for
>> >> throughput (e.g. for the maximum number of requests per second) and
>> >> optimizing for latency (e.g. for the minimum request duration), you
>> >> are
>> >> more likely to be concerned about the former than about the latter in
>> >> a
>> >> server setup.
>> >
>> > Eero,
>> > Please add your test results here.
>> >
>> > No matter which algorithm you do, there will be variations. So you have
>> > to look at the platforms which you are targeting. For this platform=20
>> > number one item is use of less turbo and hope you know why?
>> 
>> Unfortunately the current controller uses turbo frequently on Atoms for
>> TDP-limited graphics workloads regardless of IOWAIT boosting.  IOWAIT
>> boosting simply exacerbated the pre-existing energy efficiency problem.
>
> My current understanding of the issue at hand is that using IOWAIT boosting
> on Atoms is a regression relative to the previous behavior.

Not universally.  IOWAIT boosting helps under roughly the same
conditions on Atom as it does on big core, so applying this patch will
necessarily cause regressions too (see my reply from Sep. 3 for some
numbers), and won't completely restore the previous behavior since it
simply decreases the degree of IOWAIT boosting applied without being
able to avoid it (c.f. the series I'm working on that does something
similar to IOWAIT boosting when it's able to determine it's actually
CPU-bound, which prevents energy inefficient behavior for non-CPU-bound
workloads that don't benefit from a higher CPU clock frequency anyway).

> That is what Srinivas is trying to address here AFAICS.
>
> Now, you seem to be saying that the overall behavior is suboptimal and the
> IOWAIT boosting doesn't matter that much,

I was just saying that IOWAIT boosting is less than half of the energy
efficiency problem, and this patch only partially addresses that half of
the problem.

> so some deeper changes are needed anyway.  That may be the case, but
> if there is a meaningful regression, we should first get back to the
> point where it is not present and then to take care of the more
> general problems.
>
> So, I'd like to understand how much of a problem the IOWAIT boosting really is
> in the first place.  If it is significant enough, let's address it first, this
> way or another, and move on to the other problems subsequently.
>

See the Unigine and Gfxbench numbers I provided in my reply from Sep. 3
to get an idea of the magnitude of the IOWAIT boosting problem vs. the
overall energy efficiency problem addressed by my series.

> Thanks,
> Rafael


signature.asc
Description: PGP signature


Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-09-11 Thread Rafael J. Wysocki
On Thursday, September 6, 2018 6:20:08 AM CEST Francisco Jerez wrote:
> 
> --==-=-=
> Content-Type: multipart/mixed; boundary="=-=-="
> 
> --=-=-=
> Content-Type: text/plain; charset=utf-8
> Content-Disposition: inline
> Content-Transfer-Encoding: quoted-printable
> 
> Srinivas Pandruvada  writes:
> 
> > [...]
> >
> >> > >=20
> >> > > This patch causes a number of statistically significant
> >> > > regressions
> >> > > (with significance of 1%) on the two systems I've tested it
> >> > > on.  On
> >> > > my
> >> >=20
> >> > Sure. These patches are targeted to Atom clients where some of
> >> > these
> >> > server like workload may have some minor regression on few watts
> >> > TDP
> >> > parts.
> >>=20
> >> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
> >> nor
> >> the 10% regression of warsaw qualify as small.  And most of the test
> >> cases on the list of regressions aren't exclusively server-like, if
> >> at
> >> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
> >> benchmarks -- Latency is as important if not more for interactive
> >> workloads than it is for server workloads.  In the case of a conflict
> >> like the one we're dealing with right now between optimizing for
> >> throughput (e.g. for the maximum number of requests per second) and
> >> optimizing for latency (e.g. for the minimum request duration), you
> >> are
> >> more likely to be concerned about the former than about the latter in
> >> a
> >> server setup.
> >
> > Eero,
> > Please add your test results here.
> >
> > No matter which algorithm you do, there will be variations. So you have
> > to look at the platforms which you are targeting. For this platform=20
> > number one item is use of less turbo and hope you know why?
> 
> Unfortunately the current controller uses turbo frequently on Atoms for
> TDP-limited graphics workloads regardless of IOWAIT boosting.  IOWAIT
> boosting simply exacerbated the pre-existing energy efficiency problem.

My current understanding of the issue at hand is that using IOWAIT boosting
on Atoms is a regression relative to the previous behavior.  That is what
Srinivas is trying to address here AFAICS.

Now, you seem to be saying that the overall behavior is suboptimal and the
IOWAIT boosting doesn't matter that much, so some deeper changes are needed
anyway.  That may be the case, but if there is a meaningful regression, we
should first get back to the point where it is not present and then to take
care of the more general problems.

So, I'd like to understand how much of a problem the IOWAIT boosting really is
in the first place.  If it is significant enough, let's address it first, this
way or another, and move on to the other problems subsequently.

Thanks,
Rafael



Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-09-11 Thread Rafael J. Wysocki
On Monday, September 3, 2018 5:10:27 PM CEST Eero Tamminen wrote:
> Hi,
> 
> On 31.08.2018 20:28, Srinivas Pandruvada wrote:
> ...
> > As per testing Eero Tamminen, the results are comparable to the patchset
> > https://patchwork.kernel.org/patch/10312259/
> > But he has to watch results for several days to check trend.
> 
> It's close, but there is some gap compared to Francisco's version.

It is unclear what exactly you are referring to without a link to the
patch.

Thanks,
Rafael



Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-09-11 Thread Rafael J. Wysocki
On Friday, August 31, 2018 7:28:51 PM CEST Srinivas Pandruvada wrote:
> The io_boost mechanism, when scheduler update util callback indicates wake
> from IO was intended for short term boost to improve disk IO performance.
> But once i915 graphics driver changed to io-schedule_timeout() from
> schedule_timeout() during waiting for response from GPU, this caused
> constant IO boost, causing intel_pstate to constantly boost to turbo.
> This has bad side effect on TDP limited Atom platforms. The following two
> links shows the boost and frequency plot for GPU Test called
> triangle_benchmark.
> https://bugzilla.kernel.org/attachment.cgi?id=278091
> https://bugzilla.kernel.org/attachment.cgi?id=278093
> This type of behavior was observed with many graphics tests and regular
> use of such platforms with graphical interface.
> 
> The fix in this patch is to optimize the io boost by:
> - Limit the maximum boost to base frequency on TDP limited Atom platforms
> and max limit as 1 core turbo for the rest of the non-HWP platforms (same
> as the current algorithm).
> - Multilevel gradual boost: Start with increment = half of max boost and
> increase to max on the subsequent IO wakes.
> - Once max is reached and subsequent IO wakes don't cause boost for TDP
> limited Atom platforms, with assumption that the target CPU already built
> up enough load to run at higher P-state and the use of average frequency
> in the current algorithm will also help not to reduce drastically. For
> other platforms this is not limited similar to the current algorithm.
> 
> With this fix the resultant plots show the reduced average frequency and
> also causes upto 10% improvement in some graphics tests on Atom (Broxton)
> platform.
> https://bugzilla.kernel.org/attachment.cgi?id=278095
> https://bugzilla.kernel.org/attachment.cgi?id=278097
> 
> As per testing Eero Tamminen, the results are comparable to the patchset
> https://patchwork.kernel.org/patch/10312259/
> But he has to watch results for several days to check trend.
> 
> Since here boost is getting limited to turbo and non turbo, we need some
> ways to adjust the fractions corresponding to max non turbo as well. It
> is much easier to use the actual P-state limits for boost instead of
> fractions. So here P-state io boost limit is applied on top of the
> P-state limit calculated via current algorithm by removing current
> io_wait boost calculation using fractions.
> 
> Since we prefer to use common algorithm for all processor platforms, this
> change was tested on other client and sever platforms as well. All results
> were within the margin of errors. Results:
> https://bugzilla.kernel.org/attachment.cgi?id=278149
> 
> To identify TDP limited platforms a new callback boost_limited() is
> added, which will set a per cpudata field called iowait_boost_limited to
> 1. Currently this is set for Atom platforms only.
> 
> Tested-by: Eero Tamminen 
> Signed-off-by: Srinivas Pandruvada 

Let me say I would do this differently in a couple of ways.

First off, the patch makes at least two different changes in one go which
I don't quite agree with: it changes the way iowait boost works for everybody
and introduces differences betwee Atom and Core.  Doing them both in one
patch is quite questionable IMO.

It is not particularly clear that the iowait boost mechanism on non-Atom
needs to be changed at all and if so, then why.  Yes, you have results showing
that the change doesn't regress the other systems, but since we are going to
distinguish between Atom and non-Atom anyway, why change the others?

Also, it is not particularly clear to me if there's any benefit from doing
the iowait boosting on Atoms at all.  Wouldn't it be better to restrict
the iowait boosting to Core CPUs only and just leave Atoms alone?

On top of that, I don't quite like the new code organization, but let me
comment on the details below.

> ---
>  drivers/cpufreq/intel_pstate.c | 112 +++--
>  1 file changed, 92 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
> index 1f2ce2f57842..15d9d5483d85 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -195,7 +195,11 @@ struct global_params {
>   * @policy:  CPUFreq policy value
>   * @update_util: CPUFreq utility callback information
>   * @update_util_set: CPUFreq utility callback is set
> - * @iowait_boost:iowait-related boost fraction
> + * @iowait_boost:iowait-related boost P-state
> + * @iowait_boost_active: iowait boost processing is active
> + * @iowait_boost_max:Max boost P-state to apply
> + * @iowait_boost_increment: increment to last boost P-state

I don't think you need this field.  It just takes space for almost no
speed benefit.

> + * @owait_boost_limited: If set give up boost, once reach max boost state

This field doesn't appear to be necessary too.  The value of iowait_boost_max
depends on it, but it looks

Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-09-05 Thread Francisco Jerez
Srinivas Pandruvada  writes:

> [...]
>
>> > > 
>> > > This patch causes a number of statistically significant
>> > > regressions
>> > > (with significance of 1%) on the two systems I've tested it
>> > > on.  On
>> > > my
>> > 
>> > Sure. These patches are targeted to Atom clients where some of
>> > these
>> > server like workload may have some minor regression on few watts
>> > TDP
>> > parts.
>> 
>> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
>> nor
>> the 10% regression of warsaw qualify as small.  And most of the test
>> cases on the list of regressions aren't exclusively server-like, if
>> at
>> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
>> benchmarks -- Latency is as important if not more for interactive
>> workloads than it is for server workloads.  In the case of a conflict
>> like the one we're dealing with right now between optimizing for
>> throughput (e.g. for the maximum number of requests per second) and
>> optimizing for latency (e.g. for the minimum request duration), you
>> are
>> more likely to be concerned about the former than about the latter in
>> a
>> server setup.
>
> Eero,
> Please add your test results here.
>
> No matter which algorithm you do, there will be variations. So you have
> to look at the platforms which you are targeting. For this platform 
> number one item is use of less turbo and hope you know why?

Unfortunately the current controller uses turbo frequently on Atoms for
TDP-limited graphics workloads regardless of IOWAIT boosting.  IOWAIT
boosting simply exacerbated the pre-existing energy efficiency problem.

> On this platform GFX patch caused this issue as it was submitted after
> io boost patchset. So rather that should be reverted first before
> optimizing anything.
>
>
>
>> 
>> > But weighing against reduced TURBO usage (which is enough trigger)
>> > and
>> > improvement in tests done by Eero (which was primary complaint to
>> > us).
>> > 
>> > It is trivial patch. Yes, the patch is not perfect and doesn't
>> > close
>> > doors for any improvements.
>> > 
>> 
>> It's sort of self-contained because it's awfully incomplete.Don't
>> agtr
>
>> 
>> > I see your idea, but how to implement in acceptable way is a
>> > challenge.
>> 
>> Main challenge was getting the code to work without regressions in
>> latency-sensitive workloads, which I did, because you told me that it
>> wasn't acceptable for it to cause any regressions on the Phoronix
>> daily-system-tracker, whether latency-bound or not.
> Yes, because your intention was to have a global change not a low power
> Atom specific,
>

My intention was to target low-power Atoms only since the first day we
discussed this problem.  The cover letter of the series I sent and the
commit messages make this fairly clear.

>>   What is left in
>> order to address Peter's concerns is for the most part plumbing,
>> that's
>> guaranteed not to have any functional impact on the heuristic.  The
>> fact
>> that we don't expect it to change the performance of the system makes
>> it
>> substantially less time-consuming to validate than altering the
>> performance trade-offs of the heuristic dynamically in order to avoid
>> regressions (which is what has kept my systems busy most of the time
>> lately).  Seems like my series, even in its current state without the
>> changes requested by Peter is closer to being ready for production
>> than
>> this patch is.
>
> Sorry, Not at all. We call such patches as experimental series.

The numbers speak otherwise.

> You caused 100% regression to idle power. There is no version 2 after
> that, even if you fixed locally even to look.
>

I did post a link to a v2 that fixed the idle power issue on the v1
thread, but I didn't intend to send it for review yet.  I'll send it out
once I've fully taken into account Peter's feedback.

> Thanks,
> Srinivas


signature.asc
Description: PGP signature


Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-09-04 Thread Srinivas Pandruvada


[...]

> > > 
> > > This patch causes a number of statistically significant
> > > regressions
> > > (with significance of 1%) on the two systems I've tested it
> > > on.  On
> > > my
> > 
> > Sure. These patches are targeted to Atom clients where some of
> > these
> > server like workload may have some minor regression on few watts
> > TDP
> > parts.
> 
> Neither the 36% regression of fs-mark, the 21% regression of sqlite,
> nor
> the 10% regression of warsaw qualify as small.  And most of the test
> cases on the list of regressions aren't exclusively server-like, if
> at
> all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
> benchmarks -- Latency is as important if not more for interactive
> workloads than it is for server workloads.  In the case of a conflict
> like the one we're dealing with right now between optimizing for
> throughput (e.g. for the maximum number of requests per second) and
> optimizing for latency (e.g. for the minimum request duration), you
> are
> more likely to be concerned about the former than about the latter in
> a
> server setup.

Eero,
Please add your test results here.

No matter which algorithm you do, there will be variations. So you have
to look at the platforms which you are targeting. For this platform 
number one item is use of less turbo and hope you know why?
On this platform GFX patch caused this issue as it was submitted after
io boost patchset. So rather that should be reverted first before
optimizing anything.


> 
> > But weighing against reduced TURBO usage (which is enough trigger)
> > and
> > improvement in tests done by Eero (which was primary complaint to
> > us).
> > 
> > It is trivial patch. Yes, the patch is not perfect and doesn't
> > close
> > doors for any improvements.
> > 
> 
> It's sort of self-contained because it's awfully incomplete.Don't
> agtr

> 
> > I see your idea, but how to implement in acceptable way is a
> > challenge.
> 
> Main challenge was getting the code to work without regressions in
> latency-sensitive workloads, which I did, because you told me that it
> wasn't acceptable for it to cause any regressions on the Phoronix
> daily-system-tracker, whether latency-bound or not.
Yes, because your intention was to have a global change not a low power
Atom specific,

>   What is left in
> order to address Peter's concerns is for the most part plumbing,
> that's
> guaranteed not to have any functional impact on the heuristic.  The
> fact
> that we don't expect it to change the performance of the system makes
> it
> substantially less time-consuming to validate than altering the
> performance trade-offs of the heuristic dynamically in order to avoid
> regressions (which is what has kept my systems busy most of the time
> lately).  Seems like my series, even in its current state without the
> changes requested by Peter is closer to being ready for production
> than
> this patch is.

Sorry, Not at all. We call such patches as experimental series.
You caused 100% regression to idle power. There is no version 2 after
that, even if you fixed locally even to look.

Thanks,
Srinivas



Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-09-04 Thread Francisco Jerez
Srinivas Pandruvada  writes:

> On Mon, 2018-09-03 at 23:53 -0700, Francisco Jerez wrote:
>> Eero Tamminen  writes:
>> 
>> > Hi,
>> > 
>> > On 31.08.2018 20:28, Srinivas Pandruvada wrote:
>> > ...
>> > > As per testing Eero Tamminen, the results are comparable to the
>> > > patchset
>> > > https://patchwork.kernel.org/patch/10312259/
>> > > But he has to watch results for several days to check trend.
>> > 
>> > It's close, but there is some gap compared to Francisco's version.
>> > 
>> > (Because of the large variance on TDP limited devices, and factors 
>> > causing extra perf differences e.g. between boots, it's hard to
>> > give 
>> > exact number without having trends from several days / weeks.  I
>> > would 
>> > also need new version of Fransisco's patch set that applies to
>> > latest 
>> > kernel like yours does.)
>> > 
>> > 
>> > > Since here boost is getting limited to turbo and non turbo, we
>> > > need some
>> > > ways to adjust the fractions corresponding to max non turbo as
>> > > well. It
>> > > is much easier to use the actual P-state limits for boost instead
>> > > of
>> > > fractions. So here P-state io boost limit is applied on top of
>> > > the
>> > > P-state limit calculated via current algorithm by removing
>> > > current
>> > > io_wait boost calculation using fractions.
>> > > 
>> > > Since we prefer to use common algorithm for all processor
>> > > platforms, this
>> > > change was tested on other client and sever platforms as well.
>> > > All results
>> > > were within the margin of errors. Results:
>> > > https://bugzilla.kernel.org/attachment.cgi?id=278149
>> > 
>> > Good.
>> > 
>> > Francisco, how well the listed PTS tests cover latency bound cases
>> > you 
>> > were concerned about?  [1]
>> > 
>> > 
>> >- Eero
>> > 
>> > [1] Fransisco was concerned that the patch:
>> > * trade-off might regress latency bound cases (which I haven't
>> > tested, I 
>> > tested only 3D throughput), and
>> > * that it addressed only other of the sources of energy
>> > inefficiencies 
>> > he had identified (which could explain slightly better 3D results
>> > from 
>> > his more complex patch set).
>> 
>> This patch causes a number of statistically significant regressions
>> (with significance of 1%) on the two systems I've tested it on.  On
>> my
> Sure. These patches are targeted to Atom clients where some of these
> server like workload may have some minor regression on few watts TDP
> parts.

Neither the 36% regression of fs-mark, the 21% regression of sqlite, nor
the 10% regression of warsaw qualify as small.  And most of the test
cases on the list of regressions aren't exclusively server-like, if at
all.  Warsaw, gtkperf, jxrendermark and lightsmark are all graphics
benchmarks -- Latency is as important if not more for interactive
workloads than it is for server workloads.  In the case of a conflict
like the one we're dealing with right now between optimizing for
throughput (e.g. for the maximum number of requests per second) and
optimizing for latency (e.g. for the minimum request duration), you are
more likely to be concerned about the former than about the latter in a
server setup.

> But weighing against reduced TURBO usage (which is enough trigger) and
> improvement in tests done by Eero (which was primary complaint to us).
>
> It is trivial patch. Yes, the patch is not perfect and doesn't close
> doors for any improvements.
>

It's sort of self-contained because it's awfully incomplete.

> I see your idea, but how to implement in acceptable way is a challenge.

Main challenge was getting the code to work without regressions in
latency-sensitive workloads, which I did, because you told me that it
wasn't acceptable for it to cause any regressions on the Phoronix
daily-system-tracker, whether latency-bound or not.  What is left in
order to address Peter's concerns is for the most part plumbing, that's
guaranteed not to have any functional impact on the heuristic.  The fact
that we don't expect it to change the performance of the system makes it
substantially less time-consuming to validate than altering the
performance trade-offs of the heuristic dynamically in order to avoid
regressions (which is what has kept my systems busy most of the time
lately).  Seems like my series, even in its current state without the
changes requested by Peter is closer to being ready for production than
this patch is.

> So ultimately we will get there, but will require some time compared
> with other high priority items.
>
>
> Thanks
> Srinivas
>
>> CHV N3050:
>> 
>> > phoronix/fs-
>> > mark/test=0:   
>> > XXX ±7.25% x34 -> XXX ±7.00% x39  d=-36.85% ±5.91%  p=0.00%
>> > phoronix/sqlite:   
>> > XXX ±1.86% x34 -> XXX ±1.88% x39  d=-21.73%
>> > ±1.66%  p=0.00%
>> > warsow/benchsow:   
>> > XXX ±1.25% x34 -> XXX ±1.95% x39  d=-10.83%
>> > 

Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-09-04 Thread Srinivas Pandruvada
On Mon, 2018-09-03 at 23:53 -0700, Francisco Jerez wrote:
> Eero Tamminen  writes:
> 
> > Hi,
> > 
> > On 31.08.2018 20:28, Srinivas Pandruvada wrote:
> > ...
> > > As per testing Eero Tamminen, the results are comparable to the
> > > patchset
> > > https://patchwork.kernel.org/patch/10312259/
> > > But he has to watch results for several days to check trend.
> > 
> > It's close, but there is some gap compared to Francisco's version.
> > 
> > (Because of the large variance on TDP limited devices, and factors 
> > causing extra perf differences e.g. between boots, it's hard to
> > give 
> > exact number without having trends from several days / weeks.  I
> > would 
> > also need new version of Fransisco's patch set that applies to
> > latest 
> > kernel like yours does.)
> > 
> > 
> > > Since here boost is getting limited to turbo and non turbo, we
> > > need some
> > > ways to adjust the fractions corresponding to max non turbo as
> > > well. It
> > > is much easier to use the actual P-state limits for boost instead
> > > of
> > > fractions. So here P-state io boost limit is applied on top of
> > > the
> > > P-state limit calculated via current algorithm by removing
> > > current
> > > io_wait boost calculation using fractions.
> > > 
> > > Since we prefer to use common algorithm for all processor
> > > platforms, this
> > > change was tested on other client and sever platforms as well.
> > > All results
> > > were within the margin of errors. Results:
> > > https://bugzilla.kernel.org/attachment.cgi?id=278149
> > 
> > Good.
> > 
> > Francisco, how well the listed PTS tests cover latency bound cases
> > you 
> > were concerned about?  [1]
> > 
> > 
> > - Eero
> > 
> > [1] Fransisco was concerned that the patch:
> > * trade-off might regress latency bound cases (which I haven't
> > tested, I 
> > tested only 3D throughput), and
> > * that it addressed only other of the sources of energy
> > inefficiencies 
> > he had identified (which could explain slightly better 3D results
> > from 
> > his more complex patch set).
> 
> This patch causes a number of statistically significant regressions
> (with significance of 1%) on the two systems I've tested it on.  On
> my
Sure. These patches are targeted to Atom clients where some of these
server like workload may have some minor regression on few watts TDP
parts. But weighing against reduced TURBO usage (which is enough
trigger) and improvement in tests done by Eero (which was primary
complaint to us).

It is trivial patch. Yes, the patch is not perfect and doesn't close
doors for any improvements.

I see your idea, but how to implement in acceptable way is a challenge.
So ultimately we will get there, but will require some time compared
with other high priority items.


Thanks
Srinivas

> CHV N3050:
> 
> > phoronix/fs-
> > mark/test=0:   
> > XXX ±7.25% x34 -> XXX ±7.00% x39  d=-36.85% ±5.91%  p=0.00%
> > phoronix/sqlite:   
> > XXX ±1.86% x34 -> XXX ±1.88% x39  d=-21.73%
> > ±1.66%  p=0.00%
> > warsow/benchsow:   
> > XXX ±1.25% x34 -> XXX ±1.95% x39  d=-10.83%
> > ±1.53%  p=0.00%
> > phoronix/iozone/record-size=4Kb/file-size=2GB/test=Read
> > Performance:   XXX ±1.70% x31 -> XXX ±1.02% x34  d=-7.39%
> > ±1.36%   p=0.00%
> > phoronix/gtkperf/gtk-
> > test=GtkComboBox: XXX
> > ±1.15% x13 -> XXX ±1.59% x14   d=-5.37% ±1.35%  p=0.00%
> > lightsmark:
> > XXX ±1.45% x34 -> XXX ±0.97% x41   d=-4.66%
> > ±1.19%  p=0.00%
> > jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-
> > size=128x128:  XXX ±1.04% x31 -> XXX ±1.04% x39   d=-4.58%
> > ±1.01%  p=0.00%
> > jxrendermark/rendering-test=Linear Gradient Blend/rendering-
> > size=128x128:  XXX ±0.12% x31 -> XXX ±0.19% x39   d=-3.60%
> > ±0.16%  p=0.00%
> > dbench/client-
> > count=1: XX
> > X ±0.50% x34 -> XXX ±0.50% x39   d=-2.51% ±0.49%  p=0.00%
> 
> On my BXT J3455:
> 
> > fs-
> > mark/test=0:   
> >  XXX ±3.04% x6 ->   XXX ±3.05% x9  d=-15.96%
> > ±2.76%  p=0.00%
> > sqlite:
> > XXX ±2.54% x6 ->   XXX ±2.72% x9  d=-12.42%
> > ±2.44%  p=0.00%
> > dbench/client-
> > count=1: XX
> > X ±0.42% x6 ->   XXX ±0.36% x9   d=-6.52% ±0.37%  p=0.00%
> > dbench/client-
> > count=2: XX
> > X ±0.26% x6 ->   XXX ±0.33% x9   d=-5.22% ±0.29%  p=0.00%
> > dbench/client-
> > count=3: XX
> > X ±0.34% x6 ->   XXX ±0.53% x9   d=-2.92% ±0.45%  p=0.00%
> > x11perf/test=500px Compositing From

Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-09-04 Thread Francisco Jerez
Eero Tamminen  writes:

> Hi,
>
> On 31.08.2018 20:28, Srinivas Pandruvada wrote:
> ...
>> As per testing Eero Tamminen, the results are comparable to the patchset
>> https://patchwork.kernel.org/patch/10312259/
>> But he has to watch results for several days to check trend.
>
> It's close, but there is some gap compared to Francisco's version.
>
> (Because of the large variance on TDP limited devices, and factors 
> causing extra perf differences e.g. between boots, it's hard to give 
> exact number without having trends from several days / weeks.  I would 
> also need new version of Fransisco's patch set that applies to latest 
> kernel like yours does.)
>
>
>> Since here boost is getting limited to turbo and non turbo, we need some
>> ways to adjust the fractions corresponding to max non turbo as well. It
>> is much easier to use the actual P-state limits for boost instead of
>> fractions. So here P-state io boost limit is applied on top of the
>> P-state limit calculated via current algorithm by removing current
>> io_wait boost calculation using fractions.
>> 
>> Since we prefer to use common algorithm for all processor platforms, this
>> change was tested on other client and sever platforms as well. All results
>> were within the margin of errors. Results:
>> https://bugzilla.kernel.org/attachment.cgi?id=278149
>
> Good.
>
> Francisco, how well the listed PTS tests cover latency bound cases you 
> were concerned about?  [1]
>
>
>   - Eero
>
> [1] Fransisco was concerned that the patch:
> * trade-off might regress latency bound cases (which I haven't tested, I 
> tested only 3D throughput), and
> * that it addressed only other of the sources of energy inefficiencies 
> he had identified (which could explain slightly better 3D results from 
> his more complex patch set).

This patch causes a number of statistically significant regressions
(with significance of 1%) on the two systems I've tested it on.  On my
CHV N3050:

| phoronix/fs-mark/test=0:  
 XXX ±7.25% x34 -> XXX ±7.00% x39  d=-36.85% ±5.91%  p=0.00%
| phoronix/sqlite:  
 XXX ±1.86% x34 -> XXX ±1.88% x39  d=-21.73% ±1.66%  p=0.00%
| warsow/benchsow:  
 XXX ±1.25% x34 -> XXX ±1.95% x39  d=-10.83% ±1.53%  p=0.00%
| phoronix/iozone/record-size=4Kb/file-size=2GB/test=Read Performance:  
 XXX ±1.70% x31 -> XXX ±1.02% x34  d=-7.39% ±1.36%   p=0.00%
| phoronix/gtkperf/gtk-test=GtkComboBox:
 XXX ±1.15% x13 -> XXX ±1.59% x14   d=-5.37% ±1.35%  p=0.00%
| lightsmark:   
 XXX ±1.45% x34 -> XXX ±0.97% x41   d=-4.66% ±1.19%  p=0.00%
| jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-size=128x128: 
 XXX ±1.04% x31 -> XXX ±1.04% x39   d=-4.58% ±1.01%  p=0.00%
| jxrendermark/rendering-test=Linear Gradient Blend/rendering-size=128x128: 
 XXX ±0.12% x31 -> XXX ±0.19% x39   d=-3.60% ±0.16%  p=0.00%
| dbench/client-count=1:
 XXX ±0.50% x34 -> XXX ±0.50% x39   d=-2.51% ±0.49%  p=0.00%

On my BXT J3455:

| fs-mark/test=0:   
 XXX ±3.04% x6 ->   XXX ±3.05% x9  d=-15.96% ±2.76%  p=0.00%
| sqlite:   
 XXX ±2.54% x6 ->   XXX ±2.72% x9  d=-12.42% ±2.44%  p=0.00%
| dbench/client-count=1:
 XXX ±0.42% x6 ->   XXX ±0.36% x9   d=-6.52% ±0.37%  p=0.00%
| dbench/client-count=2:
 XXX ±0.26% x6 ->   XXX ±0.33% x9   d=-5.22% ±0.29%  p=0.00%
| dbench/client-count=3:
 XXX ±0.34% x6 ->   XXX ±0.53% x9   d=-2.92% ±0.45%  p=0.00%
| x11perf/test=500px Compositing From Pixmap To Window: 
 XXX ±2.29% x16 ->  XXX ±2.11% x19  d=-2.69% ±2.16%  p=0.09%
| lightsmark:   
 XXX ±0.44% x6 ->   XXX ±0.33% x9   d=-1.76% ±0.37%  p=0.00%
| j2dbench/rendering-test=Vector Graphics Rendering:
 XXX ±1.18% x16 ->  XXX ±1.82% x19  d=-1.71% ±1.54%  p=0.26%
| gtkperf/gtk-test=GtkComboBox: 
 XXX ±0.37% x6 ->   XXX ±0.45% x9   d=-0.95% ±0.42%  p=0.08%
| jxrendermark/rendering-test=Transformed Blit Bilinear/rendering-size=128x128: 
 XXX ±0.21% x3 ->   XXX ±0.23% x6   d=-0.87% ±0.22%  p=0.08%

This is not surprising given that the patch is making a hard trade-off
between latency and energy efficiency without considering whether the
workload is IO- or latency-bound, which is the reason why the series I
submitted earlier [1] to address this problem implemented an IO
utilization statistic in order to determine

Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-09-03 Thread Srinivas Pandruvada
On Mon, 2018-09-03 at 18:10 +0300, Eero Tamminen wrote:
> Hi,
> 
> On 31.08.2018 20:28, Srinivas Pandruvada wrote:
> ...
> > As per testing Eero Tamminen, the results are comparable to the
> > patchset
> > https://patchwork.kernel.org/patch/10312259/
> > But he has to watch results for several days to check trend.
> 
> It's close, but there is some gap compared to Francisco's version.
> 
> (Because of the large variance on TDP limited devices, and factors 
> causing extra perf differences e.g. between boots, it's hard to give 
> exact number without having trends from several days / weeks.  I
> would 
> also need new version of Fransisco's patch set that applies to
> latest 
> kernel like yours does.)
> 
> 
> > Since here boost is getting limited to turbo and non turbo, we need
> > some
> > ways to adjust the fractions corresponding to max non turbo as
> > well. It
> > is much easier to use the actual P-state limits for boost instead
> > of
> > fractions. So here P-state io boost limit is applied on top of the
> > P-state limit calculated via current algorithm by removing current
> > io_wait boost calculation using fractions.
> > 
> > Since we prefer to use common algorithm for all processor
> > platforms, this
> > change was tested on other client and sever platforms as well. All
> > results
> > were within the margin of errors. Results:
> > https://bugzilla.kernel.org/attachment.cgi?id=278149
> 
> Good.
> 
> Francisco, how well the listed PTS tests cover latency bound cases
> you 
> were concerned about?  [1]
> 
> 
>   - Eero
> 
> [1] Fransisco was concerned that the patch:
> * trade-off might regress latency bound cases (which I haven't
> tested, I 
> tested only 3D throughput), and
> * that it addressed only other of the sources of energy
> inefficiencies 
> he had identified (which could explain slightly better 3D results
> from 
> his more complex patch set).
It is always possible to submit improvement patch later. Atleast this
patch will reduce the usage of turbo.

Thanks,
Srinivas 




Re: [PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-09-03 Thread Eero Tamminen

Hi,

On 31.08.2018 20:28, Srinivas Pandruvada wrote:
...

As per testing Eero Tamminen, the results are comparable to the patchset
https://patchwork.kernel.org/patch/10312259/
But he has to watch results for several days to check trend.


It's close, but there is some gap compared to Francisco's version.

(Because of the large variance on TDP limited devices, and factors 
causing extra perf differences e.g. between boots, it's hard to give 
exact number without having trends from several days / weeks.  I would 
also need new version of Fransisco's patch set that applies to latest 
kernel like yours does.)




Since here boost is getting limited to turbo and non turbo, we need some
ways to adjust the fractions corresponding to max non turbo as well. It
is much easier to use the actual P-state limits for boost instead of
fractions. So here P-state io boost limit is applied on top of the
P-state limit calculated via current algorithm by removing current
io_wait boost calculation using fractions.

Since we prefer to use common algorithm for all processor platforms, this
change was tested on other client and sever platforms as well. All results
were within the margin of errors. Results:
https://bugzilla.kernel.org/attachment.cgi?id=278149


Good.

Francisco, how well the listed PTS tests cover latency bound cases you 
were concerned about?  [1]



- Eero

[1] Fransisco was concerned that the patch:
* trade-off might regress latency bound cases (which I haven't tested, I 
tested only 3D throughput), and
* that it addressed only other of the sources of energy inefficiencies 
he had identified (which could explain slightly better 3D results from 
his more complex patch set).




[PATCH] cpufreq: intel_pstate: Optimize IO boost in non HWP mode

2018-08-31 Thread Srinivas Pandruvada
The io_boost mechanism, when scheduler update util callback indicates wake
from IO was intended for short term boost to improve disk IO performance.
But once i915 graphics driver changed to io-schedule_timeout() from
schedule_timeout() during waiting for response from GPU, this caused
constant IO boost, causing intel_pstate to constantly boost to turbo.
This has bad side effect on TDP limited Atom platforms. The following two
links shows the boost and frequency plot for GPU Test called
triangle_benchmark.
https://bugzilla.kernel.org/attachment.cgi?id=278091
https://bugzilla.kernel.org/attachment.cgi?id=278093
This type of behavior was observed with many graphics tests and regular
use of such platforms with graphical interface.

The fix in this patch is to optimize the io boost by:
- Limit the maximum boost to base frequency on TDP limited Atom platforms
and max limit as 1 core turbo for the rest of the non-HWP platforms (same
as the current algorithm).
- Multilevel gradual boost: Start with increment = half of max boost and
increase to max on the subsequent IO wakes.
- Once max is reached and subsequent IO wakes don't cause boost for TDP
limited Atom platforms, with assumption that the target CPU already built
up enough load to run at higher P-state and the use of average frequency
in the current algorithm will also help not to reduce drastically. For
other platforms this is not limited similar to the current algorithm.

With this fix the resultant plots show the reduced average frequency and
also causes upto 10% improvement in some graphics tests on Atom (Broxton)
platform.
https://bugzilla.kernel.org/attachment.cgi?id=278095
https://bugzilla.kernel.org/attachment.cgi?id=278097

As per testing Eero Tamminen, the results are comparable to the patchset
https://patchwork.kernel.org/patch/10312259/
But he has to watch results for several days to check trend.

Since here boost is getting limited to turbo and non turbo, we need some
ways to adjust the fractions corresponding to max non turbo as well. It
is much easier to use the actual P-state limits for boost instead of
fractions. So here P-state io boost limit is applied on top of the
P-state limit calculated via current algorithm by removing current
io_wait boost calculation using fractions.

Since we prefer to use common algorithm for all processor platforms, this
change was tested on other client and sever platforms as well. All results
were within the margin of errors. Results:
https://bugzilla.kernel.org/attachment.cgi?id=278149

To identify TDP limited platforms a new callback boost_limited() is
added, which will set a per cpudata field called iowait_boost_limited to
1. Currently this is set for Atom platforms only.

Tested-by: Eero Tamminen 
Signed-off-by: Srinivas Pandruvada 
---
 drivers/cpufreq/intel_pstate.c | 112 +++--
 1 file changed, 92 insertions(+), 20 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 1f2ce2f57842..15d9d5483d85 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -195,7 +195,11 @@ struct global_params {
  * @policy:CPUFreq policy value
  * @update_util:   CPUFreq utility callback information
  * @update_util_set:   CPUFreq utility callback is set
- * @iowait_boost:  iowait-related boost fraction
+ * @iowait_boost:  iowait-related boost P-state
+ * @iowait_boost_active: iowait boost processing is active
+ * @iowait_boost_max:  Max boost P-state to apply
+ * @iowait_boost_increment: increment to last boost P-state
+ * @owait_boost_limited: If set give up boost, once reach max boost state
  * @last_update:   Time of the last update.
  * @pstate:Stores P state limits for this CPU
  * @vid:   Stores VID limits for this CPU
@@ -254,6 +258,10 @@ struct cpudata {
bool valid_pss_table;
 #endif
unsigned int iowait_boost;
+   unsigned int iowait_boost_active;
+   int iowait_boost_max;
+   int iowait_boost_increment;
+   int iowait_boost_limited;
s16 epp_powersave;
s16 epp_policy;
s16 epp_default;
@@ -276,6 +284,7 @@ static struct cpudata **all_cpu_data;
  * @get_scaling:   Callback to get frequency scaling factor
  * @get_val:   Callback to convert P state to actual MSR write value
  * @get_vid:   Callback to get VID data for Atom platforms
+ * @boost_limited: Callback to get max boost P-state, when applicable
  *
  * Core and Atom CPU models have different way to get P State limits. This
  * structure is used to store those callbacks.
@@ -289,6 +298,7 @@ struct pstate_funcs {
int (*get_aperf_mperf_shift)(void);
u64 (*get_val)(struct cpudata*, int pstate);
void (*get_vid)(struct cpudata *);
+   void (*boost_limited)(struct cpudata *cpudata);
 };
 
 static struct pstate_funcs pstate_funcs __read_mostly;
@@ -1251,6 +1261,11 @@ static void atom_get_vid(struct cpudata *cpudata)