subject:"Re\: \[PATCH 4\/4\] sched\/fair\: Use a recently used CPU as an idle candidate and the basis for SIS"

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-05 Thread Mel Gorman

On Mon, Feb 05, 2018 at 09:04:25AM -0800, Srinivas Pandruvada wrote:
> > I know EPP is a range, default from bios usually appear to be 6 or 7
> > but
> > I didn't do much experiementation to see if there is another value
> > that
> > works better. Even if there is, the default may need to change as not
> > many
> > people even know what EPP is or how it should be tuned.
> I think you are talking about EPB not EPP because of ranges you
> mentioned here. EPP is a value from 0 to 255. EPP is part of
> HWP_REQUEST MSR.
> EPB with HWP is used only in Broadwell server. I think you are using
> Skylake here.
> 

You're right in that I mixed up EPB (0-15 IIRC) and EPP which has a
different range whose exact meaning and semantics are not super-clear to
me. The machine in question is Skylake but for the purposes of the current
discussion, the issue is the same. There appears to be a lack of giving
decent hints to the hardware based on knowledge of migrations or short
periods of idleness that the processor may be very busy for very short
bursts that are important even if overall utilisation is low.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-05 Thread Mel Gorman

On Mon, Feb 05, 2018 at 09:04:25AM -0800, Srinivas Pandruvada wrote:
> > I know EPP is a range, default from bios usually appear to be 6 or 7
> > but
> > I didn't do much experiementation to see if there is another value
> > that
> > works better. Even if there is, the default may need to change as not
> > many
> > people even know what EPP is or how it should be tuned.
> I think you are talking about EPB not EPP because of ranges you
> mentioned here. EPP is a value from 0 to 255. EPP is part of
> HWP_REQUEST MSR.
> EPB with HWP is used only in Broadwell server. I think you are using
> Skylake here.
> 

You're right in that I mixed up EPB (0-15 IIRC) and EPP which has a
different range whose exact meaning and semantics are not super-clear to
me. The machine in question is Skylake but for the purposes of the current
discussion, the issue is the same. There appears to be a lack of giving
decent hints to the hardware based on knowledge of migrations or short
periods of idleness that the processor may be very busy for very short
bursts that are important even if overall utilisation is low.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-05 Thread Srinivas Pandruvada

On Mon, 2018-02-05 at 11:10 +, Mel Gorman wrote:
> On Fri, Feb 02, 2018 at 12:01:37PM -0800, Srinivas Pandruvada wrote:
> > > Sure, but the lack on detection when tasks are low utilisation
> > > but
> > > still
> > > latency/throughput sensitive is problematic. Users shouldn't have
> > > to
> > > know they need to disable HWP or set performance goernor out of
> > > the
> > > box.
> > > It's only going to get worse as sockets get larger.
> > 
> > I am not saying that we shouldn't do anything. Can you give me some
> > workloads which you care the most?
> > 
> 
> The proprietary workloads I'm aware of are useless to the discussion
> as they cannot be trivially reproduced and are typically only
> available
> under NDA. However, hints can be gotten by looking at the number of
> cases
> where recommended tunings limits C-states, set the performance
> governor,
> alter intel_pstate setpoint (if not HWP) etc.
> 
> For the purposes of illustration, dbench at low thread counts does
> a reasonable job even though it's not that interesting a workload in
> general. With ext4 in particular, the journalling thread interactions
> bounce tasks around the machine and the short sleeps for IO both
> combine
> to have relatively low utilisation on individual CPUs. It's less
> pronounced
> on xfs as it bounces less due to using kworkers instead of kthreads.
> 
> > > 
> > > > There are totally different way HWP is handled in client an
> > > > servers.
> > > > If you set desired all heuristics they collected will be
> > > > dumped, so
> > > > they suggest don't set desired when you are in autonomous mode.
> > > > If
> > > > we
> > > > really want a boost set the EPP. We know that EPP makes lots of
> > > > measurable difference.
> > > > 
> > > 
> > > Sure boosting EPP makes a difference -- it's essentially what the
> > > performance
> > > goveror does and I know that can be done by a user but it's still
> > > basically a
> > > cop-out. Default performance for low utilisation or lightly
> > > loaded
> > > machines
> > > is poor. Maybe it should be set based on the ACPI preferred
> > > profile
> > > but
> > > that information is not always available. It would be nice if
> > > *some*
> > > sort of hint about new migrations or tasks waking from IO would
> > > be
> > > desirable.
> > 
> > EPP is a range not a single value. So you don't need to make EPP=0
> > as a
> > performance governor. PeterZ gave me some scheduler change to
> > experiment, which can be used as hint to play with EPP. 
> > 
> 
> I know EPP is a range, default from bios usually appear to be 6 or 7
> but
> I didn't do much experiementation to see if there is another value
> that
> works better. Even if there is, the default may need to change as not
> many
> people even know what EPP is or how it should be tuned.
I think you are talking about EPB not EPP because of ranges you
mentioned here. EPP is a value from 0 to 255. EPP is part of
HWP_REQUEST MSR.
EPB with HWP is used only in Broadwell server. I think you are using
Skylake here.

Thanks,
Srinivas

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-05 Thread Srinivas Pandruvada

On Mon, 2018-02-05 at 11:10 +, Mel Gorman wrote:
> On Fri, Feb 02, 2018 at 12:01:37PM -0800, Srinivas Pandruvada wrote:
> > > Sure, but the lack on detection when tasks are low utilisation
> > > but
> > > still
> > > latency/throughput sensitive is problematic. Users shouldn't have
> > > to
> > > know they need to disable HWP or set performance goernor out of
> > > the
> > > box.
> > > It's only going to get worse as sockets get larger.
> > 
> > I am not saying that we shouldn't do anything. Can you give me some
> > workloads which you care the most?
> > 
> 
> The proprietary workloads I'm aware of are useless to the discussion
> as they cannot be trivially reproduced and are typically only
> available
> under NDA. However, hints can be gotten by looking at the number of
> cases
> where recommended tunings limits C-states, set the performance
> governor,
> alter intel_pstate setpoint (if not HWP) etc.
> 
> For the purposes of illustration, dbench at low thread counts does
> a reasonable job even though it's not that interesting a workload in
> general. With ext4 in particular, the journalling thread interactions
> bounce tasks around the machine and the short sleeps for IO both
> combine
> to have relatively low utilisation on individual CPUs. It's less
> pronounced
> on xfs as it bounces less due to using kworkers instead of kthreads.
> 
> > > 
> > > > There are totally different way HWP is handled in client an
> > > > servers.
> > > > If you set desired all heuristics they collected will be
> > > > dumped, so
> > > > they suggest don't set desired when you are in autonomous mode.
> > > > If
> > > > we
> > > > really want a boost set the EPP. We know that EPP makes lots of
> > > > measurable difference.
> > > > 
> > > 
> > > Sure boosting EPP makes a difference -- it's essentially what the
> > > performance
> > > goveror does and I know that can be done by a user but it's still
> > > basically a
> > > cop-out. Default performance for low utilisation or lightly
> > > loaded
> > > machines
> > > is poor. Maybe it should be set based on the ACPI preferred
> > > profile
> > > but
> > > that information is not always available. It would be nice if
> > > *some*
> > > sort of hint about new migrations or tasks waking from IO would
> > > be
> > > desirable.
> > 
> > EPP is a range not a single value. So you don't need to make EPP=0
> > as a
> > performance governor. PeterZ gave me some scheduler change to
> > experiment, which can be used as hint to play with EPP. 
> > 
> 
> I know EPP is a range, default from bios usually appear to be 6 or 7
> but
> I didn't do much experiementation to see if there is another value
> that
> works better. Even if there is, the default may need to change as not
> many
> people even know what EPP is or how it should be tuned.
I think you are talking about EPB not EPP because of ranges you
mentioned here. EPP is a value from 0 to 255. EPP is part of
HWP_REQUEST MSR.
EPB with HWP is used only in Broadwell server. I think you are using
Skylake here.

Thanks,
Srinivas

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-05 Thread Mel Gorman

On Fri, Feb 02, 2018 at 12:01:37PM -0800, Srinivas Pandruvada wrote:
> > Sure, but the lack on detection when tasks are low utilisation but
> > still
> > latency/throughput sensitive is problematic. Users shouldn't have to
> > know they need to disable HWP or set performance goernor out of the
> > box.
> > It's only going to get worse as sockets get larger.
>
> I am not saying that we shouldn't do anything. Can you give me some
> workloads which you care the most?
> 

The proprietary workloads I'm aware of are useless to the discussion
as they cannot be trivially reproduced and are typically only available
under NDA. However, hints can be gotten by looking at the number of cases
where recommended tunings limits C-states, set the performance governor,
alter intel_pstate setpoint (if not HWP) etc.

For the purposes of illustration, dbench at low thread counts does
a reasonable job even though it's not that interesting a workload in
general. With ext4 in particular, the journalling thread interactions
bounce tasks around the machine and the short sleeps for IO both combine
to have relatively low utilisation on individual CPUs. It's less pronounced
on xfs as it bounces less due to using kworkers instead of kthreads.

> > 
> > > There are totally different way HWP is handled in client an
> > > servers.
> > > If you set desired all heuristics they collected will be dumped, so
> > > they suggest don't set desired when you are in autonomous mode. If
> > > we
> > > really want a boost set the EPP. We know that EPP makes lots of
> > > measurable difference.
> > > 
> > 
> > Sure boosting EPP makes a difference -- it's essentially what the
> > performance
> > goveror does and I know that can be done by a user but it's still
> > basically a
> > cop-out. Default performance for low utilisation or lightly loaded
> > machines
> > is poor. Maybe it should be set based on the ACPI preferred profile
> > but
> > that information is not always available. It would be nice if *some*
> > sort of hint about new migrations or tasks waking from IO would be
> > desirable.
> EPP is a range not a single value. So you don't need to make EPP=0 as a
> performance governor. PeterZ gave me some scheduler change to
> experiment, which can be used as hint to play with EPP. 
> 

I know EPP is a range, default from bios usually appear to be 6 or 7 but
I didn't do much experiementation to see if there is another value that
works better. Even if there is, the default may need to change as not many
people even know what EPP is or how it should be tuned.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-05 Thread Mel Gorman

On Fri, Feb 02, 2018 at 12:01:37PM -0800, Srinivas Pandruvada wrote:
> > Sure, but the lack on detection when tasks are low utilisation but
> > still
> > latency/throughput sensitive is problematic. Users shouldn't have to
> > know they need to disable HWP or set performance goernor out of the
> > box.
> > It's only going to get worse as sockets get larger.
>
> I am not saying that we shouldn't do anything. Can you give me some
> workloads which you care the most?
> 

The proprietary workloads I'm aware of are useless to the discussion
as they cannot be trivially reproduced and are typically only available
under NDA. However, hints can be gotten by looking at the number of cases
where recommended tunings limits C-states, set the performance governor,
alter intel_pstate setpoint (if not HWP) etc.

For the purposes of illustration, dbench at low thread counts does
a reasonable job even though it's not that interesting a workload in
general. With ext4 in particular, the journalling thread interactions
bounce tasks around the machine and the short sleeps for IO both combine
to have relatively low utilisation on individual CPUs. It's less pronounced
on xfs as it bounces less due to using kworkers instead of kthreads.

> > 
> > > There are totally different way HWP is handled in client an
> > > servers.
> > > If you set desired all heuristics they collected will be dumped, so
> > > they suggest don't set desired when you are in autonomous mode. If
> > > we
> > > really want a boost set the EPP. We know that EPP makes lots of
> > > measurable difference.
> > > 
> > 
> > Sure boosting EPP makes a difference -- it's essentially what the
> > performance
> > goveror does and I know that can be done by a user but it's still
> > basically a
> > cop-out. Default performance for low utilisation or lightly loaded
> > machines
> > is poor. Maybe it should be set based on the ACPI preferred profile
> > but
> > that information is not always available. It would be nice if *some*
> > sort of hint about new migrations or tasks waking from IO would be
> > desirable.
> EPP is a range not a single value. So you don't need to make EPP=0 as a
> performance governor. PeterZ gave me some scheduler change to
> experiment, which can be used as hint to play with EPP. 
> 

I know EPP is a range, default from bios usually appear to be 6 or 7 but
I didn't do much experiementation to see if there is another value that
works better. Even if there is, the default may need to change as not many
people even know what EPP is or how it should be tuned.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-05 Thread Ingo Molnar


* Peter Zijlstra  wrote:

> On Sat, Feb 03, 2018 at 08:30:48AM -0800, Srinivas Pandruvada wrote:
> > On Fri, 2018-02-02 at 15:08 +0100, Peter Zijlstra wrote:
> > > On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote:
> > > 
> > > > +static void __intel_pstate_hwp_set_desired(int val)
> > > > +{
> > > > +   u64 value;
> > > > +
> > > > +   value = rdmsrl(MSR_HWP_REQUEST);
> > > > +   value &= ~GENMASK_ULL(23, 16);
> > > > +   value |= (val & 0xff) << 16;
> > > > +   wrmsrl(MSR_HWP_REQUEST, val);
> > > > +}
> > > 
> > > Also, if we keep a software shadow of that MSR, we can avoid the
> > > rdmsr, which might also help.
> > The reason we don't keep a software shadow as users can use x86-energy-
> > perf utility or via BMC to adjust.  CCed  to Len. 
> 
> Total NAK on that. People using /dev/msr to prod at state get to keep
> whatever pieces they end up with.
> 
> We _really_ should make /dev/msr taint the kernel on write. Ingo,
> Thomas, can we pretty please just do that?

Sure - mind sending a patch?

Thanks,

Ingo

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-05 Thread Ingo Molnar


* Peter Zijlstra  wrote:

> On Sat, Feb 03, 2018 at 08:30:48AM -0800, Srinivas Pandruvada wrote:
> > On Fri, 2018-02-02 at 15:08 +0100, Peter Zijlstra wrote:
> > > On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote:
> > > 
> > > > +static void __intel_pstate_hwp_set_desired(int val)
> > > > +{
> > > > +   u64 value;
> > > > +
> > > > +   value = rdmsrl(MSR_HWP_REQUEST);
> > > > +   value &= ~GENMASK_ULL(23, 16);
> > > > +   value |= (val & 0xff) << 16;
> > > > +   wrmsrl(MSR_HWP_REQUEST, val);
> > > > +}
> > > 
> > > Also, if we keep a software shadow of that MSR, we can avoid the
> > > rdmsr, which might also help.
> > The reason we don't keep a software shadow as users can use x86-energy-
> > perf utility or via BMC to adjust.  CCed  to Len. 
> 
> Total NAK on that. People using /dev/msr to prod at state get to keep
> whatever pieces they end up with.
> 
> We _really_ should make /dev/msr taint the kernel on write. Ingo,
> Thomas, can we pretty please just do that?

Sure - mind sending a patch?

Thanks,

Ingo

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-05 Thread Peter Zijlstra

On Sat, Feb 03, 2018 at 08:30:48AM -0800, Srinivas Pandruvada wrote:
> On Fri, 2018-02-02 at 15:08 +0100, Peter Zijlstra wrote:
> > On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote:
> > 
> > > +static void __intel_pstate_hwp_set_desired(int val)
> > > +{
> > > + u64 value;
> > > +
> > > + value = rdmsrl(MSR_HWP_REQUEST);
> > > + value &= ~GENMASK_ULL(23, 16);
> > > + value |= (val & 0xff) << 16;
> > > + wrmsrl(MSR_HWP_REQUEST, val);
> > > +}
> > 
> > Also, if we keep a software shadow of that MSR, we can avoid the
> > rdmsr, which might also help.
> The reason we don't keep a software shadow as users can use x86-energy-
> perf utility or via BMC to adjust.  CCed  to Len. 

Total NAK on that. People using /dev/msr to prod at state get to keep
whatever pieces they end up with.

We _really_ should make /dev/msr taint the kernel on write. Ingo,
Thomas, can we pretty please just do that?

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-05 Thread Peter Zijlstra

On Sat, Feb 03, 2018 at 08:30:48AM -0800, Srinivas Pandruvada wrote:
> On Fri, 2018-02-02 at 15:08 +0100, Peter Zijlstra wrote:
> > On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote:
> > 
> > > +static void __intel_pstate_hwp_set_desired(int val)
> > > +{
> > > + u64 value;
> > > +
> > > + value = rdmsrl(MSR_HWP_REQUEST);
> > > + value &= ~GENMASK_ULL(23, 16);
> > > + value |= (val & 0xff) << 16;
> > > + wrmsrl(MSR_HWP_REQUEST, val);
> > > +}
> > 
> > Also, if we keep a software shadow of that MSR, we can avoid the
> > rdmsr, which might also help.
> The reason we don't keep a software shadow as users can use x86-energy-
> perf utility or via BMC to adjust.  CCed  to Len. 

Total NAK on that. People using /dev/msr to prod at state get to keep
whatever pieces they end up with.

We _really_ should make /dev/msr taint the kernel on write. Ingo,
Thomas, can we pretty please just do that?

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-04 Thread Rafael J. Wysocki

On Friday, February 2, 2018 8:48:01 PM CET Mel Gorman wrote:
> On Fri, Feb 02, 2018 at 06:54:24AM -0800, Srinivas Pandruvada wrote:
> > > > > No idea, desired would be the one I would start with, it matches
> > > > > with
> > > > > the intent here. But I've no idea what our current HWP
> > > > > implementation
> > > > > actually does with it.
> > > > Desired !=0 will disable HWP autonomous mode of frequency
> > > > selection.
> > > But I don't think it will just run at "desired" then, will it?
> > HWP all are these hints only not a guarantee.
> 
> Sure, but the lack on detection when tasks are low utilisation but still
> latency/throughput sensitive is problematic. Users shouldn't have to
> know they need to disable HWP or set performance goernor out of the box.
> It's only going to get worse as sockets get larger.

OK, so that's the case I was thinking about when the kernel actually knows
more about what's going on than the HW-based logic.  That's when it makes
sense to provide hints, of course.

> > There are totally different way HWP is handled in client an servers.
> > If you set desired all heuristics they collected will be dumped, so
> > they suggest don't set desired when you are in autonomous mode. If we
> > really want a boost set the EPP. We know that EPP makes lots of
> > measurable difference.
> > 
> 
> Sure boosting EPP makes a difference -- it's essentially what the performance
> goveror does and I know that can be done by a user but it's still basically a
> cop-out. Default performance for low utilisation or lightly loaded machines
> is poor. Maybe it should be set based on the ACPI preferred profile but
> that information is not always available. It would be nice if *some*
> sort of hint about new migrations or tasks waking from IO would be desirable.

Agreed, we only need to figure out a clean way to do that.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-04 Thread Rafael J. Wysocki

On Friday, February 2, 2018 8:48:01 PM CET Mel Gorman wrote:
> On Fri, Feb 02, 2018 at 06:54:24AM -0800, Srinivas Pandruvada wrote:
> > > > > No idea, desired would be the one I would start with, it matches
> > > > > with
> > > > > the intent here. But I've no idea what our current HWP
> > > > > implementation
> > > > > actually does with it.
> > > > Desired !=0 will disable HWP autonomous mode of frequency
> > > > selection.
> > > But I don't think it will just run at "desired" then, will it?
> > HWP all are these hints only not a guarantee.
> 
> Sure, but the lack on detection when tasks are low utilisation but still
> latency/throughput sensitive is problematic. Users shouldn't have to
> know they need to disable HWP or set performance goernor out of the box.
> It's only going to get worse as sockets get larger.

OK, so that's the case I was thinking about when the kernel actually knows
more about what's going on than the HW-based logic.  That's when it makes
sense to provide hints, of course.

> > There are totally different way HWP is handled in client an servers.
> > If you set desired all heuristics they collected will be dumped, so
> > they suggest don't set desired when you are in autonomous mode. If we
> > really want a boost set the EPP. We know that EPP makes lots of
> > measurable difference.
> > 
> 
> Sure boosting EPP makes a difference -- it's essentially what the performance
> goveror does and I know that can be done by a user but it's still basically a
> cop-out. Default performance for low utilisation or lightly loaded machines
> is poor. Maybe it should be set based on the ACPI preferred profile but
> that information is not always available. It would be nice if *some*
> sort of hint about new migrations or tasks waking from IO would be desirable.

Agreed, we only need to figure out a clean way to do that.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-04 Thread Rafael J. Wysocki

On Friday, February 2, 2018 3:54:24 PM CET Srinivas Pandruvada wrote:
> On Fri, 2018-02-02 at 12:00 +0100, Rafael J. Wysocki wrote:
> > On Thursday, February 1, 2018 2:18:12 PM CET Srinivas Pandruvada
> > wrote:
> > > 
> > > On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote:
> > > > 
> > > > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki
> > > > wrote:
> > > > > 
> > > > > 
> > > > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra
> > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki
> > > > > > wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra
> > > > > > > wrote:
> > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > IA32_HWP_REQUEST has "Minimum_Performance",
> > > > > > > > "Maximum_Performance" and
> > > > > > > > "Desired_Performance" fields which can be used to give
> > > > > > > > explicit
> > > > > > > > frequency hints. And we really _should_ be doing that.
> > > > > > > > 
> > > > > > > > Because, esp. in this scenario; a task migrating; the
> > > > > > > > hardware really
> > > > > > > > can't do anything sensible, whereas the OS _knows_.
> > > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > > > > > That just means we might need to throttle writing to it, like
> > > > > > it
> > > > > > already
> > > > > > does for the regular pstate (PERF_CTRL) msr in any case
> > > > > > (also, is
> > > > > > that a
> > > > > > cheap msr?)
> > > > > > 
> > > > > > Not touching it at all seems silly.
> > > > > OK
> > > > > 
> > > > > So what field precisely would you touch?  "desired"?  If so,
> > > > > does
> > > > > that actually
> > > > > guarantee anything to happen?
> > > > No idea, desired would be the one I would start with, it matches
> > > > with
> > > > the intent here. But I've no idea what our current HWP
> > > > implementation
> > > > actually does with it.
> > > Desired !=0 will disable HWP autonomous mode of frequency
> > > selection.
> > But I don't think it will just run at "desired" then, will it?
> HWP all are these hints only not a guarantee.
> There are totally different way HWP is handled in client an servers.
> If you set desired all heuristics they collected will be dumped, so
> they suggest don't set desired when you are in autonomous mode. If we
> really want a boost set the EPP. We know that EPP makes lots of
> measurable difference.

Yeah, we've been consistently told to use EPP rather than "desired" for hints
like that.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-04 Thread Rafael J. Wysocki

On Friday, February 2, 2018 3:54:24 PM CET Srinivas Pandruvada wrote:
> On Fri, 2018-02-02 at 12:00 +0100, Rafael J. Wysocki wrote:
> > On Thursday, February 1, 2018 2:18:12 PM CET Srinivas Pandruvada
> > wrote:
> > > 
> > > On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote:
> > > > 
> > > > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki
> > > > wrote:
> > > > > 
> > > > > 
> > > > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra
> > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki
> > > > > > wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra
> > > > > > > wrote:
> > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > IA32_HWP_REQUEST has "Minimum_Performance",
> > > > > > > > "Maximum_Performance" and
> > > > > > > > "Desired_Performance" fields which can be used to give
> > > > > > > > explicit
> > > > > > > > frequency hints. And we really _should_ be doing that.
> > > > > > > > 
> > > > > > > > Because, esp. in this scenario; a task migrating; the
> > > > > > > > hardware really
> > > > > > > > can't do anything sensible, whereas the OS _knows_.
> > > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > > > > > That just means we might need to throttle writing to it, like
> > > > > > it
> > > > > > already
> > > > > > does for the regular pstate (PERF_CTRL) msr in any case
> > > > > > (also, is
> > > > > > that a
> > > > > > cheap msr?)
> > > > > > 
> > > > > > Not touching it at all seems silly.
> > > > > OK
> > > > > 
> > > > > So what field precisely would you touch?  "desired"?  If so,
> > > > > does
> > > > > that actually
> > > > > guarantee anything to happen?
> > > > No idea, desired would be the one I would start with, it matches
> > > > with
> > > > the intent here. But I've no idea what our current HWP
> > > > implementation
> > > > actually does with it.
> > > Desired !=0 will disable HWP autonomous mode of frequency
> > > selection.
> > But I don't think it will just run at "desired" then, will it?
> HWP all are these hints only not a guarantee.
> There are totally different way HWP is handled in client an servers.
> If you set desired all heuristics they collected will be dumped, so
> they suggest don't set desired when you are in autonomous mode. If we
> really want a boost set the EPP. We know that EPP makes lots of
> measurable difference.

Yeah, we've been consistently told to use EPP rather than "desired" for hints
like that.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-03 Thread Srinivas Pandruvada

On Fri, 2018-02-02 at 15:08 +0100, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote:
> 
> > +static void __intel_pstate_hwp_set_desired(int val)
> > +{
> > +   u64 value;
> > +
> > +   value = rdmsrl(MSR_HWP_REQUEST);
> > +   value &= ~GENMASK_ULL(23, 16);
> > +   value |= (val & 0xff) << 16;
> > +   wrmsrl(MSR_HWP_REQUEST, val);
> > +}
> 
> Also, if we keep a software shadow of that MSR, we can avoid the
> rdmsr, which might also help.
The reason we don't keep a software shadow as users can use x86-energy-
perf utility or via BMC to adjust.  CCed  to Len.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-03 Thread Srinivas Pandruvada

On Fri, 2018-02-02 at 15:08 +0100, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote:
> 
> > +static void __intel_pstate_hwp_set_desired(int val)
> > +{
> > +   u64 value;
> > +
> > +   value = rdmsrl(MSR_HWP_REQUEST);
> > +   value &= ~GENMASK_ULL(23, 16);
> > +   value |= (val & 0xff) << 16;
> > +   wrmsrl(MSR_HWP_REQUEST, val);
> > +}
> 
> Also, if we keep a software shadow of that MSR, we can avoid the
> rdmsr, which might also help.
The reason we don't keep a software shadow as users can use x86-energy-
perf utility or via BMC to adjust.  CCed  to Len.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Srinivas Pandruvada

On Fri, 2018-02-02 at 19:48 +, Mel Gorman wrote:
> On Fri, Feb 02, 2018 at 06:54:24AM -0800, Srinivas Pandruvada wrote:
> > > > > No idea, desired would be the one I would start with, it
> > > > > matches
> > > > > with
> > > > > the intent here. But I've no idea what our current HWP
> > > > > implementation
> > > > > actually does with it.
> > > > 
> > > > Desired !=0 will disable HWP autonomous mode of frequency
> > > > selection.
> > > 
> > > But I don't think it will just run at "desired" then, will it?
> > 
> > HWP all are these hints only not a guarantee.
> 
> Sure, but the lack on detection when tasks are low utilisation but
> still
> latency/throughput sensitive is problematic. Users shouldn't have to
> know they need to disable HWP or set performance goernor out of the
> box.
> It's only going to get worse as sockets get larger.
I am not saying that we shouldn't do anything. Can you give me some
workloads which you care the most?

> 
> > There are totally different way HWP is handled in client an
> > servers.
> > If you set desired all heuristics they collected will be dumped, so
> > they suggest don't set desired when you are in autonomous mode. If
> > we
> > really want a boost set the EPP. We know that EPP makes lots of
> > measurable difference.
> > 
> 
> Sure boosting EPP makes a difference -- it's essentially what the
> performance
> goveror does and I know that can be done by a user but it's still
> basically a
> cop-out. Default performance for low utilisation or lightly loaded
> machines
> is poor. Maybe it should be set based on the ACPI preferred profile
> but
> that information is not always available. It would be nice if *some*
> sort of hint about new migrations or tasks waking from IO would be
> desirable.
EPP is a range not a single value. So you don't need to make EPP=0 as a
performance governor. PeterZ gave me some scheduler change to
experiment, which can be used as hint to play with EPP. 

Thanks,
Srinivas

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Srinivas Pandruvada

On Fri, 2018-02-02 at 19:48 +, Mel Gorman wrote:
> On Fri, Feb 02, 2018 at 06:54:24AM -0800, Srinivas Pandruvada wrote:
> > > > > No idea, desired would be the one I would start with, it
> > > > > matches
> > > > > with
> > > > > the intent here. But I've no idea what our current HWP
> > > > > implementation
> > > > > actually does with it.
> > > > 
> > > > Desired !=0 will disable HWP autonomous mode of frequency
> > > > selection.
> > > 
> > > But I don't think it will just run at "desired" then, will it?
> > 
> > HWP all are these hints only not a guarantee.
> 
> Sure, but the lack on detection when tasks are low utilisation but
> still
> latency/throughput sensitive is problematic. Users shouldn't have to
> know they need to disable HWP or set performance goernor out of the
> box.
> It's only going to get worse as sockets get larger.
I am not saying that we shouldn't do anything. Can you give me some
workloads which you care the most?

> 
> > There are totally different way HWP is handled in client an
> > servers.
> > If you set desired all heuristics they collected will be dumped, so
> > they suggest don't set desired when you are in autonomous mode. If
> > we
> > really want a boost set the EPP. We know that EPP makes lots of
> > measurable difference.
> > 
> 
> Sure boosting EPP makes a difference -- it's essentially what the
> performance
> goveror does and I know that can be done by a user but it's still
> basically a
> cop-out. Default performance for low utilisation or lightly loaded
> machines
> is poor. Maybe it should be set based on the ACPI preferred profile
> but
> that information is not always available. It would be nice if *some*
> sort of hint about new migrations or tasks waking from IO would be
> desirable.
EPP is a range not a single value. So you don't need to make EPP=0 as a
performance governor. PeterZ gave me some scheduler change to
experiment, which can be used as hint to play with EPP. 

Thanks,
Srinivas

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Mel Gorman

On Fri, Feb 02, 2018 at 06:54:24AM -0800, Srinivas Pandruvada wrote:
> > > > No idea, desired would be the one I would start with, it matches
> > > > with
> > > > the intent here. But I've no idea what our current HWP
> > > > implementation
> > > > actually does with it.
> > > Desired !=0 will disable HWP autonomous mode of frequency
> > > selection.
> > But I don't think it will just run at "desired" then, will it?
> HWP all are these hints only not a guarantee.

Sure, but the lack on detection when tasks are low utilisation but still
latency/throughput sensitive is problematic. Users shouldn't have to
know they need to disable HWP or set performance goernor out of the box.
It's only going to get worse as sockets get larger.

> There are totally different way HWP is handled in client an servers.
> If you set desired all heuristics they collected will be dumped, so
> they suggest don't set desired when you are in autonomous mode. If we
> really want a boost set the EPP. We know that EPP makes lots of
> measurable difference.
> 

Sure boosting EPP makes a difference -- it's essentially what the performance
goveror does and I know that can be done by a user but it's still basically a
cop-out. Default performance for low utilisation or lightly loaded machines
is poor. Maybe it should be set based on the ACPI preferred profile but
that information is not always available. It would be nice if *some*
sort of hint about new migrations or tasks waking from IO would be desirable.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Mel Gorman

On Fri, Feb 02, 2018 at 06:54:24AM -0800, Srinivas Pandruvada wrote:
> > > > No idea, desired would be the one I would start with, it matches
> > > > with
> > > > the intent here. But I've no idea what our current HWP
> > > > implementation
> > > > actually does with it.
> > > Desired !=0 will disable HWP autonomous mode of frequency
> > > selection.
> > But I don't think it will just run at "desired" then, will it?
> HWP all are these hints only not a guarantee.

Sure, but the lack on detection when tasks are low utilisation but still
latency/throughput sensitive is problematic. Users shouldn't have to
know they need to disable HWP or set performance goernor out of the box.
It's only going to get worse as sockets get larger.

> There are totally different way HWP is handled in client an servers.
> If you set desired all heuristics they collected will be dumped, so
> they suggest don't set desired when you are in autonomous mode. If we
> really want a boost set the EPP. We know that EPP makes lots of
> measurable difference.
> 

Sure boosting EPP makes a difference -- it's essentially what the performance
goveror does and I know that can be done by a user but it's still basically a
cop-out. Default performance for low utilisation or lightly loaded machines
is poor. Maybe it should be set based on the ACPI preferred profile but
that information is not always available. It would be nice if *some*
sort of hint about new migrations or tasks waking from IO would be desirable.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Srinivas Pandruvada

On Fri, 2018-02-02 at 12:00 +0100, Rafael J. Wysocki wrote:
> On Thursday, February 1, 2018 2:18:12 PM CET Srinivas Pandruvada
> wrote:
> > 
> > On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote:
> > > 
> > > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki
> > > wrote:
> > > > 
> > > > 
> > > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra
> > > > wrote:
> > > > > 
> > > > > 
> > > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki
> > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra
> > > > > > wrote:
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > IA32_HWP_REQUEST has "Minimum_Performance",
> > > > > > > "Maximum_Performance" and
> > > > > > > "Desired_Performance" fields which can be used to give
> > > > > > > explicit
> > > > > > > frequency hints. And we really _should_ be doing that.
> > > > > > > 
> > > > > > > Because, esp. in this scenario; a task migrating; the
> > > > > > > hardware really
> > > > > > > can't do anything sensible, whereas the OS _knows_.
> > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > > > > That just means we might need to throttle writing to it, like
> > > > > it
> > > > > already
> > > > > does for the regular pstate (PERF_CTRL) msr in any case
> > > > > (also, is
> > > > > that a
> > > > > cheap msr?)
> > > > > 
> > > > > Not touching it at all seems silly.
> > > > OK
> > > > 
> > > > So what field precisely would you touch?  "desired"?  If so,
> > > > does
> > > > that actually
> > > > guarantee anything to happen?
> > > No idea, desired would be the one I would start with, it matches
> > > with
> > > the intent here. But I've no idea what our current HWP
> > > implementation
> > > actually does with it.
> > Desired !=0 will disable HWP autonomous mode of frequency
> > selection.
> But I don't think it will just run at "desired" then, will it?
HWP all are these hints only not a guarantee.
There are totally different way HWP is handled in client an servers.
If you set desired all heuristics they collected will be dumped, so
they suggest don't set desired when you are in autonomous mode. If we
really want a boost set the EPP. We know that EPP makes lots of
measurable difference.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Srinivas Pandruvada

On Fri, 2018-02-02 at 12:00 +0100, Rafael J. Wysocki wrote:
> On Thursday, February 1, 2018 2:18:12 PM CET Srinivas Pandruvada
> wrote:
> > 
> > On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote:
> > > 
> > > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki
> > > wrote:
> > > > 
> > > > 
> > > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra
> > > > wrote:
> > > > > 
> > > > > 
> > > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki
> > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra
> > > > > > wrote:
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > IA32_HWP_REQUEST has "Minimum_Performance",
> > > > > > > "Maximum_Performance" and
> > > > > > > "Desired_Performance" fields which can be used to give
> > > > > > > explicit
> > > > > > > frequency hints. And we really _should_ be doing that.
> > > > > > > 
> > > > > > > Because, esp. in this scenario; a task migrating; the
> > > > > > > hardware really
> > > > > > > can't do anything sensible, whereas the OS _knows_.
> > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > > > > That just means we might need to throttle writing to it, like
> > > > > it
> > > > > already
> > > > > does for the regular pstate (PERF_CTRL) msr in any case
> > > > > (also, is
> > > > > that a
> > > > > cheap msr?)
> > > > > 
> > > > > Not touching it at all seems silly.
> > > > OK
> > > > 
> > > > So what field precisely would you touch?  "desired"?  If so,
> > > > does
> > > > that actually
> > > > guarantee anything to happen?
> > > No idea, desired would be the one I would start with, it matches
> > > with
> > > the intent here. But I've no idea what our current HWP
> > > implementation
> > > actually does with it.
> > Desired !=0 will disable HWP autonomous mode of frequency
> > selection.
> But I don't think it will just run at "desired" then, will it?
HWP all are these hints only not a guarantee.
There are totally different way HWP is handled in client an servers.
If you set desired all heuristics they collected will be dumped, so
they suggest don't set desired when you are in autonomous mode. If we
really want a boost set the EPP. We know that EPP makes lots of
measurable difference.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Peter Zijlstra

On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote:

> +static void __intel_pstate_hwp_set_desired(int val)
> +{
> + u64 value;
> +
> + value = rdmsrl(MSR_HWP_REQUEST);
> + value &= ~GENMASK_ULL(23, 16);
> + value |= (val & 0xff) << 16;
> + wrmsrl(MSR_HWP_REQUEST, val);
> +}

Also, if we keep a software shadow of that MSR, we can avoid the
rdmsr, which might also help.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Peter Zijlstra

On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote:

> +static void __intel_pstate_hwp_set_desired(int val)
> +{
> + u64 value;
> +
> + value = rdmsrl(MSR_HWP_REQUEST);
> + value &= ~GENMASK_ULL(23, 16);
> + value |= (val & 0xff) << 16;
> + wrmsrl(MSR_HWP_REQUEST, val);
> +}

Also, if we keep a software shadow of that MSR, we can avoid the
rdmsr, which might also help.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Mel Gorman

On Fri, Feb 02, 2018 at 12:42:29PM +0100, Rafael J. Wysocki wrote:
> > > > But now that you made me look, intel_pstate_hwp_set() is horrible crap.
> > > > You should _never_ do things like:
> > > > 
> > > >   rdmsr_on_cpu()
> > > >   /* frob value */
> > > >   wrmsr_on_cpu()
> > > > 
> > > > That's insane.
> > > 
> > > I guess you mean it does too many IPIs?  Or that it shouldn't do any IPIs
> > > at all?
> > 
> > Yes, too many synchronous IPIs, which themselves are typically already
> > more expensive than the MSR access.
> 
> We could do all of the updates in one IPI (as Srinivas said), but it would be
> more code, and custom code for that matter.
> 
> Is this really worth it for a slow path like this one?
> 

Maybe it's a slow path at the moment but don't forget that one motivation
for this series is that HWP does not properly react when utilisation of a
CPU is artifically low because a task recently slept for IO or recently
migrated. In both cases, the task may be busy and sensitive to either
latency, throughput or both but HWP will use a low p-state. A standard
driver can do io-boost and while it currently does not do so, it could
also trivially do idle-boosting -- HWP does neither.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Mel Gorman

On Fri, Feb 02, 2018 at 12:42:29PM +0100, Rafael J. Wysocki wrote:
> > > > But now that you made me look, intel_pstate_hwp_set() is horrible crap.
> > > > You should _never_ do things like:
> > > > 
> > > >   rdmsr_on_cpu()
> > > >   /* frob value */
> > > >   wrmsr_on_cpu()
> > > > 
> > > > That's insane.
> > > 
> > > I guess you mean it does too many IPIs?  Or that it shouldn't do any IPIs
> > > at all?
> > 
> > Yes, too many synchronous IPIs, which themselves are typically already
> > more expensive than the MSR access.
> 
> We could do all of the updates in one IPI (as Srinivas said), but it would be
> more code, and custom code for that matter.
> 
> Is this really worth it for a slow path like this one?
> 

Maybe it's a slow path at the moment but don't forget that one motivation
for this series is that HWP does not properly react when utilisation of a
CPU is artifically low because a task recently slept for IO or recently
migrated. In both cases, the task may be busy and sensitive to either
latency, throughput or both but HWP will use a low p-state. A standard
driver can do io-boost and while it currently does not do so, it could
also trivially do idle-boosting -- HWP does neither.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Peter Zijlstra

On Fri, Feb 02, 2018 at 12:42:29PM +0100, Rafael J. Wysocki wrote:
> > Yes, too many synchronous IPIs, which themselves are typically already
> > more expensive than the MSR access.
> 
> We could do all of the updates in one IPI (as Srinivas said), but it would be
> more code, and custom code for that matter.
> 
> Is this really worth it for a slow path like this one?

Probably not, still, the pattern really annoys me ;-)

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Peter Zijlstra

On Fri, Feb 02, 2018 at 12:42:29PM +0100, Rafael J. Wysocki wrote:
> > Yes, too many synchronous IPIs, which themselves are typically already
> > more expensive than the MSR access.
> 
> We could do all of the updates in one IPI (as Srinivas said), but it would be
> more code, and custom code for that matter.
> 
> Is this really worth it for a slow path like this one?

Probably not, still, the pattern really annoys me ;-)

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Peter Zijlstra

On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote:
> +static void __intel_pstate_hwp_set_desired(int val)
> +{
> + u64 value;
> +
> + value = rdmsrl(MSR_HWP_REQUEST);
> + value &= ~GENMASK_ULL(23, 16);
> + value |= (val & 0xff) << 16;
> + wrmsrl(MSR_HWP_REQUEST, val);

D'0h: s/val/value/

> +}

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Peter Zijlstra

On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote:
> +static void __intel_pstate_hwp_set_desired(int val)
> +{
> + u64 value;
> +
> + value = rdmsrl(MSR_HWP_REQUEST);
> + value &= ~GENMASK_ULL(23, 16);
> + value |= (val & 0xff) << 16;
> + wrmsrl(MSR_HWP_REQUEST, val);

D'0h: s/val/value/

> +}

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Peter Zijlstra

On Fri, Feb 02, 2018 at 12:42:29PM +0100, Rafael J. Wysocki wrote:
> > If you really care you can do async IPIs and do a custom serialization
> > that only waits when you do back-to-back things, which should be fairly
> > uncommon I'd think.
> 
> In this particular case we don't want to return to user space before the
> MSR is actually written with the new value.

Why not?

I was thinking of something like the below, which would in fact do
exactly that.

---
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 7edf7a0e5a96..f0caa5cc7adb 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -767,6 +768,40 @@ static void intel_pstate_hwp_set(unsigned int cpu)
wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
 }
 
+static void __intel_pstate_hwp_set_desired(int val)
+{
+   u64 value;
+
+   value = rdmsrl(MSR_HWP_REQUEST);
+   value &= ~GENMASK_ULL(23, 16);
+   value |= (val & 0xff) << 16;
+   wrmsrl(MSR_HWP_REQUEST, val);
+}
+
+static void __intel_pstate_hwp_func(void *data)
+{
+   __intel_pstate_hwp_set_desired((int)data);
+}
+
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct __call_single_data, csd_data);
+
+static void intel_pstate_hwp_set_desired(int cpu, int val)
+{
+   struct call_function_data *csd;
+
+   preempt_disable();
+   csd = this_cpu_ptr(_data);
+   /* wait for previous invocation to complete */
+   csd_lock_wait(csd);
+
+   csd->func = __intel_pstate_hwp_func;
+   csd->info = (unsigned long)val;
+
+   smp_call_function_single_async(cpu, csd);
+   preempt_enable();
+}
+
+
 static int intel_pstate_hwp_save_state(struct cpufreq_policy *policy)
 {
struct cpudata *cpu_data = all_cpu_data[policy->cpu];
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 9fb239e12b82..2bc125ec6146 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -14,6 +14,11 @@
 #include 
 #include 
 
+enum {
+   CSD_FLAG_LOCK   = 0x01,
+   CSD_FLAG_SYNCHRONOUS= 0x02,
+};
+
 typedef void (*smp_call_func_t)(void *info);
 struct __call_single_data {
struct llist_node llist;
@@ -26,6 +31,11 @@ struct __call_single_data {
 typedef struct __call_single_data call_single_data_t
__aligned(sizeof(struct __call_single_data));
 
+static __always_inline void csd_lock_wait(call_single_data_t *csd)
+{
+   smp_cond_load_acquire(>flags, !(VAL & CSD_FLAG_LOCK));
+}
+
 /* total number of cpus in this system (may exceed NR_CPUS) */
 extern unsigned int total_cpus;
 
diff --git a/kernel/smp.c b/kernel/smp.c
index 084c8b3a2681..af0ef9eb7679 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -22,11 +22,6 @@
 
 #include "smpboot.h"
 
-enum {
-   CSD_FLAG_LOCK   = 0x01,
-   CSD_FLAG_SYNCHRONOUS= 0x02,
-};
-
 struct call_function_data {
call_single_data_t  __percpu *csd;
cpumask_var_t   cpumask;
@@ -103,11 +98,6 @@ void __init call_function_init(void)
  * previous function call. For multi-cpu calls its even more interesting
  * as we'll have to ensure no other cpu is observing our csd.
  */
-static __always_inline void csd_lock_wait(call_single_data_t *csd)
-{
-   smp_cond_load_acquire(>flags, !(VAL & CSD_FLAG_LOCK));
-}
-
 static __always_inline void csd_lock(call_single_data_t *csd)
 {
csd_lock_wait(csd);

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Peter Zijlstra

On Fri, Feb 02, 2018 at 12:42:29PM +0100, Rafael J. Wysocki wrote:
> > If you really care you can do async IPIs and do a custom serialization
> > that only waits when you do back-to-back things, which should be fairly
> > uncommon I'd think.
> 
> In this particular case we don't want to return to user space before the
> MSR is actually written with the new value.

Why not?

I was thinking of something like the below, which would in fact do
exactly that.

---
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 7edf7a0e5a96..f0caa5cc7adb 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -767,6 +768,40 @@ static void intel_pstate_hwp_set(unsigned int cpu)
wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value);
 }
 
+static void __intel_pstate_hwp_set_desired(int val)
+{
+   u64 value;
+
+   value = rdmsrl(MSR_HWP_REQUEST);
+   value &= ~GENMASK_ULL(23, 16);
+   value |= (val & 0xff) << 16;
+   wrmsrl(MSR_HWP_REQUEST, val);
+}
+
+static void __intel_pstate_hwp_func(void *data)
+{
+   __intel_pstate_hwp_set_desired((int)data);
+}
+
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct __call_single_data, csd_data);
+
+static void intel_pstate_hwp_set_desired(int cpu, int val)
+{
+   struct call_function_data *csd;
+
+   preempt_disable();
+   csd = this_cpu_ptr(_data);
+   /* wait for previous invocation to complete */
+   csd_lock_wait(csd);
+
+   csd->func = __intel_pstate_hwp_func;
+   csd->info = (unsigned long)val;
+
+   smp_call_function_single_async(cpu, csd);
+   preempt_enable();
+}
+
+
 static int intel_pstate_hwp_save_state(struct cpufreq_policy *policy)
 {
struct cpudata *cpu_data = all_cpu_data[policy->cpu];
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 9fb239e12b82..2bc125ec6146 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -14,6 +14,11 @@
 #include 
 #include 
 
+enum {
+   CSD_FLAG_LOCK   = 0x01,
+   CSD_FLAG_SYNCHRONOUS= 0x02,
+};
+
 typedef void (*smp_call_func_t)(void *info);
 struct __call_single_data {
struct llist_node llist;
@@ -26,6 +31,11 @@ struct __call_single_data {
 typedef struct __call_single_data call_single_data_t
__aligned(sizeof(struct __call_single_data));
 
+static __always_inline void csd_lock_wait(call_single_data_t *csd)
+{
+   smp_cond_load_acquire(>flags, !(VAL & CSD_FLAG_LOCK));
+}
+
 /* total number of cpus in this system (may exceed NR_CPUS) */
 extern unsigned int total_cpus;
 
diff --git a/kernel/smp.c b/kernel/smp.c
index 084c8b3a2681..af0ef9eb7679 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -22,11 +22,6 @@
 
 #include "smpboot.h"
 
-enum {
-   CSD_FLAG_LOCK   = 0x01,
-   CSD_FLAG_SYNCHRONOUS= 0x02,
-};
-
 struct call_function_data {
call_single_data_t  __percpu *csd;
cpumask_var_t   cpumask;
@@ -103,11 +98,6 @@ void __init call_function_init(void)
  * previous function call. For multi-cpu calls its even more interesting
  * as we'll have to ensure no other cpu is observing our csd.
  */
-static __always_inline void csd_lock_wait(call_single_data_t *csd)
-{
-   smp_cond_load_acquire(>flags, !(VAL & CSD_FLAG_LOCK));
-}
-
 static __always_inline void csd_lock(call_single_data_t *csd)
 {
csd_lock_wait(csd);

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Rafael J. Wysocki

On Thursday, February 1, 2018 10:11:04 AM CET Peter Zijlstra wrote:
> On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote:
> > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra wrote:
> > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> > > 
> > > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> > > > > "Desired_Performance" fields which can be used to give explicit
> > > > > frequency hints. And we really _should_ be doing that.
> > > > > 
> > > > > Because, esp. in this scenario; a task migrating; the hardware really
> > > > > can't do anything sensible, whereas the OS _knows_.
> > > > 
> > > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > > 
> > > That just means we might need to throttle writing to it, like it already
> > > does for the regular pstate (PERF_CTRL) msr in any case (also, is that a
> > > cheap msr?)
> > > 
> > > Not touching it at all seems silly.
> > 
> > OK
> > 
> > So what field precisely would you touch?  "desired"?  If so, does that 
> > actually
> > guarantee anything to happen?
> 
> No idea, desired would be the one I would start with, it matches with
> the intent here. But I've no idea what our current HWP implementation
> actually does with it.
> 
> > > But now that you made me look, intel_pstate_hwp_set() is horrible crap.
> > > You should _never_ do things like:
> > > 
> > >   rdmsr_on_cpu()
> > >   /* frob value */
> > >   wrmsr_on_cpu()
> > > 
> > > That's insane.
> > 
> > I guess you mean it does too many IPIs?  Or that it shouldn't do any IPIs
> > at all?
> 
> Yes, too many synchronous IPIs, which themselves are typically already
> more expensive than the MSR access.

We could do all of the updates in one IPI (as Srinivas said), but it would be
more code, and custom code for that matter.

Is this really worth it for a slow path like this one?

> At one point I looked to getting rid of the *msr_on_cpu() crud entirely,
> but there's just too much users out there I didn't feel like touching.
> 
> If you really care you can do async IPIs and do a custom serialization
> that only waits when you do back-to-back things, which should be fairly
> uncommon I'd think.

In this particular case we don't want to return to user space before the
MSR is actually written with the new value.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Rafael J. Wysocki

On Thursday, February 1, 2018 10:11:04 AM CET Peter Zijlstra wrote:
> On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote:
> > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra wrote:
> > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> > > 
> > > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> > > > > "Desired_Performance" fields which can be used to give explicit
> > > > > frequency hints. And we really _should_ be doing that.
> > > > > 
> > > > > Because, esp. in this scenario; a task migrating; the hardware really
> > > > > can't do anything sensible, whereas the OS _knows_.
> > > > 
> > > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > > 
> > > That just means we might need to throttle writing to it, like it already
> > > does for the regular pstate (PERF_CTRL) msr in any case (also, is that a
> > > cheap msr?)
> > > 
> > > Not touching it at all seems silly.
> > 
> > OK
> > 
> > So what field precisely would you touch?  "desired"?  If so, does that 
> > actually
> > guarantee anything to happen?
> 
> No idea, desired would be the one I would start with, it matches with
> the intent here. But I've no idea what our current HWP implementation
> actually does with it.
> 
> > > But now that you made me look, intel_pstate_hwp_set() is horrible crap.
> > > You should _never_ do things like:
> > > 
> > >   rdmsr_on_cpu()
> > >   /* frob value */
> > >   wrmsr_on_cpu()
> > > 
> > > That's insane.
> > 
> > I guess you mean it does too many IPIs?  Or that it shouldn't do any IPIs
> > at all?
> 
> Yes, too many synchronous IPIs, which themselves are typically already
> more expensive than the MSR access.

We could do all of the updates in one IPI (as Srinivas said), but it would be
more code, and custom code for that matter.

Is this really worth it for a slow path like this one?

> At one point I looked to getting rid of the *msr_on_cpu() crud entirely,
> but there's just too much users out there I didn't feel like touching.
> 
> If you really care you can do async IPIs and do a custom serialization
> that only waits when you do back-to-back things, which should be fairly
> uncommon I'd think.

In this particular case we don't want to return to user space before the
MSR is actually written with the new value.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Rafael J. Wysocki

On Thursday, February 1, 2018 2:18:12 PM CET Srinivas Pandruvada wrote:
> On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote:
> > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote:
> > > 
> > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra
> > > wrote:
> > > > 
> > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki
> > > > wrote:
> > > > > 
> > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra
> > > > > wrote:
> > > > > 
> > > > > > 
> > > > > > IA32_HWP_REQUEST has "Minimum_Performance",
> > > > > > "Maximum_Performance" and
> > > > > > "Desired_Performance" fields which can be used to give
> > > > > > explicit
> > > > > > frequency hints. And we really _should_ be doing that.
> > > > > > 
> > > > > > Because, esp. in this scenario; a task migrating; the
> > > > > > hardware really
> > > > > > can't do anything sensible, whereas the OS _knows_.
> > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > > > That just means we might need to throttle writing to it, like it
> > > > already
> > > > does for the regular pstate (PERF_CTRL) msr in any case (also, is
> > > > that a
> > > > cheap msr?)
> > > > 
> > > > Not touching it at all seems silly.
> > > OK
> > > 
> > > So what field precisely would you touch?  "desired"?  If so, does
> > > that actually
> > > guarantee anything to happen?
> > No idea, desired would be the one I would start with, it matches with
> > the intent here. But I've no idea what our current HWP implementation
> > actually does with it.
> Desired !=0 will disable HWP autonomous mode of frequency selection.

But I don't think it will just run at "desired" then, will it?

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-02 Thread Rafael J. Wysocki

On Thursday, February 1, 2018 2:18:12 PM CET Srinivas Pandruvada wrote:
> On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote:
> > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote:
> > > 
> > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra
> > > wrote:
> > > > 
> > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki
> > > > wrote:
> > > > > 
> > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra
> > > > > wrote:
> > > > > 
> > > > > > 
> > > > > > IA32_HWP_REQUEST has "Minimum_Performance",
> > > > > > "Maximum_Performance" and
> > > > > > "Desired_Performance" fields which can be used to give
> > > > > > explicit
> > > > > > frequency hints. And we really _should_ be doing that.
> > > > > > 
> > > > > > Because, esp. in this scenario; a task migrating; the
> > > > > > hardware really
> > > > > > can't do anything sensible, whereas the OS _knows_.
> > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > > > That just means we might need to throttle writing to it, like it
> > > > already
> > > > does for the regular pstate (PERF_CTRL) msr in any case (also, is
> > > > that a
> > > > cheap msr?)
> > > > 
> > > > Not touching it at all seems silly.
> > > OK
> > > 
> > > So what field precisely would you touch?  "desired"?  If so, does
> > > that actually
> > > guarantee anything to happen?
> > No idea, desired would be the one I would start with, it matches with
> > the intent here. But I've no idea what our current HWP implementation
> > actually does with it.
> Desired !=0 will disable HWP autonomous mode of frequency selection.

But I don't think it will just run at "desired" then, will it?

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-01 Thread Srinivas Pandruvada

On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote:
> On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote:
> > 
> > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra
> > wrote:
> > > 
> > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki
> > > wrote:
> > > > 
> > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra
> > > > wrote:
> > > > 
> > > > > 
> > > > > IA32_HWP_REQUEST has "Minimum_Performance",
> > > > > "Maximum_Performance" and
> > > > > "Desired_Performance" fields which can be used to give
> > > > > explicit
> > > > > frequency hints. And we really _should_ be doing that.
> > > > > 
> > > > > Because, esp. in this scenario; a task migrating; the
> > > > > hardware really
> > > > > can't do anything sensible, whereas the OS _knows_.
> > > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > > That just means we might need to throttle writing to it, like it
> > > already
> > > does for the regular pstate (PERF_CTRL) msr in any case (also, is
> > > that a
> > > cheap msr?)
> > > 
> > > Not touching it at all seems silly.
> > OK
> > 
> > So what field precisely would you touch?  "desired"?  If so, does
> > that actually
> > guarantee anything to happen?
> No idea, desired would be the one I would start with, it matches with
> the intent here. But I've no idea what our current HWP implementation
> actually does with it.
Desired !=0 will disable HWP autonomous mode of frequency selection.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-01 Thread Srinivas Pandruvada

On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote:
> On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote:
> > 
> > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra
> > wrote:
> > > 
> > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki
> > > wrote:
> > > > 
> > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra
> > > > wrote:
> > > > 
> > > > > 
> > > > > IA32_HWP_REQUEST has "Minimum_Performance",
> > > > > "Maximum_Performance" and
> > > > > "Desired_Performance" fields which can be used to give
> > > > > explicit
> > > > > frequency hints. And we really _should_ be doing that.
> > > > > 
> > > > > Because, esp. in this scenario; a task migrating; the
> > > > > hardware really
> > > > > can't do anything sensible, whereas the OS _knows_.
> > > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > > That just means we might need to throttle writing to it, like it
> > > already
> > > does for the regular pstate (PERF_CTRL) msr in any case (also, is
> > > that a
> > > cheap msr?)
> > > 
> > > Not touching it at all seems silly.
> > OK
> > 
> > So what field precisely would you touch?  "desired"?  If so, does
> > that actually
> > guarantee anything to happen?
> No idea, desired would be the one I would start with, it matches with
> the intent here. But I've no idea what our current HWP implementation
> actually does with it.
Desired !=0 will disable HWP autonomous mode of frequency selection.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-01 Thread Peter Zijlstra

On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote:
> On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra wrote:
> > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> > 
> > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> > > > "Desired_Performance" fields which can be used to give explicit
> > > > frequency hints. And we really _should_ be doing that.
> > > > 
> > > > Because, esp. in this scenario; a task migrating; the hardware really
> > > > can't do anything sensible, whereas the OS _knows_.
> > > 
> > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > 
> > That just means we might need to throttle writing to it, like it already
> > does for the regular pstate (PERF_CTRL) msr in any case (also, is that a
> > cheap msr?)
> > 
> > Not touching it at all seems silly.
> 
> OK
> 
> So what field precisely would you touch?  "desired"?  If so, does that 
> actually
> guarantee anything to happen?

No idea, desired would be the one I would start with, it matches with
the intent here. But I've no idea what our current HWP implementation
actually does with it.

> > But now that you made me look, intel_pstate_hwp_set() is horrible crap.
> > You should _never_ do things like:
> > 
> >   rdmsr_on_cpu()
> >   /* frob value */
> >   wrmsr_on_cpu()
> > 
> > That's insane.
> 
> I guess you mean it does too many IPIs?  Or that it shouldn't do any IPIs
> at all?

Yes, too many synchronous IPIs, which themselves are typically already
more expensive than the MSR access.

At one point I looked to getting rid of the *msr_on_cpu() crud entirely,
but there's just too much users out there I didn't feel like touching.

If you really care you can do async IPIs and do a custom serialization
that only waits when you do back-to-back things, which should be fairly
uncommon I'd think.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-01 Thread Peter Zijlstra

On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote:
> On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra wrote:
> > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> > 
> > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> > > > "Desired_Performance" fields which can be used to give explicit
> > > > frequency hints. And we really _should_ be doing that.
> > > > 
> > > > Because, esp. in this scenario; a task migrating; the hardware really
> > > > can't do anything sensible, whereas the OS _knows_.
> > > 
> > > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> > 
> > That just means we might need to throttle writing to it, like it already
> > does for the regular pstate (PERF_CTRL) msr in any case (also, is that a
> > cheap msr?)
> > 
> > Not touching it at all seems silly.
> 
> OK
> 
> So what field precisely would you touch?  "desired"?  If so, does that 
> actually
> guarantee anything to happen?

No idea, desired would be the one I would start with, it matches with
the intent here. But I've no idea what our current HWP implementation
actually does with it.

> > But now that you made me look, intel_pstate_hwp_set() is horrible crap.
> > You should _never_ do things like:
> > 
> >   rdmsr_on_cpu()
> >   /* frob value */
> >   wrmsr_on_cpu()
> > 
> > That's insane.
> 
> I guess you mean it does too many IPIs?  Or that it shouldn't do any IPIs
> at all?

Yes, too many synchronous IPIs, which themselves are typically already
more expensive than the MSR access.

At one point I looked to getting rid of the *msr_on_cpu() crud entirely,
but there's just too much users out there I didn't feel like touching.

If you really care you can do async IPIs and do a custom serialization
that only waits when you do back-to-back things, which should be fairly
uncommon I'd think.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-01 Thread Peter Zijlstra

On Wed, Jan 31, 2018 at 09:44:18AM -0800, Srinivas Pandruvada wrote:

> Much more throttling required compared to PERF_CTL. MSR_HWP_REQUEST is
> much slower compared to PERF_CTL (as high as 10:1). 

Still much better than what other architectures have to deal with ;-)

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-02-01 Thread Peter Zijlstra

On Wed, Jan 31, 2018 at 09:44:18AM -0800, Srinivas Pandruvada wrote:

> Much more throttling required compared to PERF_CTL. MSR_HWP_REQUEST is
> much slower compared to PERF_CTL (as high as 10:1). 

Still much better than what other architectures have to deal with ;-)

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-31 Thread Rafael J. Wysocki

On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra wrote:
> On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> 
> > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> > > "Desired_Performance" fields which can be used to give explicit
> > > frequency hints. And we really _should_ be doing that.
> > > 
> > > Because, esp. in this scenario; a task migrating; the hardware really
> > > can't do anything sensible, whereas the OS _knows_.
> > 
> > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> 
> That just means we might need to throttle writing to it, like it already
> does for the regular pstate (PERF_CTRL) msr in any case (also, is that a
> cheap msr?)
> 
> Not touching it at all seems silly.

OK

So what field precisely would you touch?  "desired"?  If so, does that actually
guarantee anything to happen?

> But now that you made me look, intel_pstate_hwp_set() is horrible crap.
> You should _never_ do things like:
> 
>   rdmsr_on_cpu()
>   /* frob value */
>   wrmsr_on_cpu()
> 
> That's insane.

I guess you mean it does too many IPIs?  Or that it shouldn't do any IPIs
at all?

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-31 Thread Rafael J. Wysocki

On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra wrote:
> On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> 
> > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> > > "Desired_Performance" fields which can be used to give explicit
> > > frequency hints. And we really _should_ be doing that.
> > > 
> > > Because, esp. in this scenario; a task migrating; the hardware really
> > > can't do anything sensible, whereas the OS _knows_.
> > 
> > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> 
> That just means we might need to throttle writing to it, like it already
> does for the regular pstate (PERF_CTRL) msr in any case (also, is that a
> cheap msr?)
> 
> Not touching it at all seems silly.

OK

So what field precisely would you touch?  "desired"?  If so, does that actually
guarantee anything to happen?

> But now that you made me look, intel_pstate_hwp_set() is horrible crap.
> You should _never_ do things like:
> 
>   rdmsr_on_cpu()
>   /* frob value */
>   wrmsr_on_cpu()
> 
> That's insane.

I guess you mean it does too many IPIs?  Or that it shouldn't do any IPIs
at all?

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-31 Thread Srinivas Pandruvada

On Wed, 2018-01-31 at 11:17 +0100, Peter Zijlstra wrote:
> On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance"
> > > and
> > > "Desired_Performance" fields which can be used to give explicit
> > > frequency hints. And we really _should_ be doing that.
> > > 
> > > Because, esp. in this scenario; a task migrating; the hardware
> > > really
> > > can't do anything sensible, whereas the OS _knows_.
> > 
> > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> 
> That just means we might need to throttle writing to it, like it
> already
> does for the regular pstate (PERF_CTRL) msr in any case (also, is
> that a
> cheap msr?)
Much more throttling required compared to PERF_CTL. MSR_HWP_REQUEST is
much slower compared to PERF_CTL (as high as 10:1). 

> 
> Not touching it at all seems silly.
> 
> But now that you made me look, intel_pstate_hwp_set() is horrible
> crap.
> You should _never_ do things like:
> 
>   rdmsr_on_cpu()
>   /* frob value */
>   wrmsr_on_cpu()
> 
> That's insane.

Since the cpufreq callback is not guaranteed to be called on the same
CPU, we have to use rd/wrmsr_on_cpu().
But we can use smp_call_function_single() and optimize this.
This function is called only during init, when usermode changes
frequency limits and from thermal, so very few times.

Thanks,
Srinivas

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-31 Thread Srinivas Pandruvada

On Wed, 2018-01-31 at 11:17 +0100, Peter Zijlstra wrote:
> On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance"
> > > and
> > > "Desired_Performance" fields which can be used to give explicit
> > > frequency hints. And we really _should_ be doing that.
> > > 
> > > Because, esp. in this scenario; a task migrating; the hardware
> > > really
> > > can't do anything sensible, whereas the OS _knows_.
> > 
> > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> 
> That just means we might need to throttle writing to it, like it
> already
> does for the regular pstate (PERF_CTRL) msr in any case (also, is
> that a
> cheap msr?)
Much more throttling required compared to PERF_CTL. MSR_HWP_REQUEST is
much slower compared to PERF_CTL (as high as 10:1). 

> 
> Not touching it at all seems silly.
> 
> But now that you made me look, intel_pstate_hwp_set() is horrible
> crap.
> You should _never_ do things like:
> 
>   rdmsr_on_cpu()
>   /* frob value */
>   wrmsr_on_cpu()
> 
> That's insane.

Since the cpufreq callback is not guaranteed to be called on the same
CPU, we have to use rd/wrmsr_on_cpu().
But we can use smp_call_function_single() and optimize this.
This function is called only during init, when usermode changes
frequency limits and from thermal, so very few times.

Thanks,
Srinivas

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-31 Thread Mel Gorman

On Wed, Jan 31, 2018 at 11:17:10AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> 
> > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> > > "Desired_Performance" fields which can be used to give explicit
> > > frequency hints. And we really _should_ be doing that.
> > > 
> > > Because, esp. in this scenario; a task migrating; the hardware really
> > > can't do anything sensible, whereas the OS _knows_.
> > 
> > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> 
> That just means we might need to throttle writing to it, like it already
> does for the regular pstate (PERF_CTRL) msr in any case (also, is that a
> cheap msr?)
> 
> Not touching it at all seems silly.
> 

Note that even if we do such programming, it would still be desirable to
minimise the number of times we have to reprogram it so the series as it
stands would still be useful.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-31 Thread Mel Gorman

On Wed, Jan 31, 2018 at 11:17:10AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> 
> > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> > > "Desired_Performance" fields which can be used to give explicit
> > > frequency hints. And we really _should_ be doing that.
> > > 
> > > Because, esp. in this scenario; a task migrating; the hardware really
> > > can't do anything sensible, whereas the OS _knows_.
> > 
> > But IA32_HWP_REQUEST is not a cheap MSR to write to.
> 
> That just means we might need to throttle writing to it, like it already
> does for the regular pstate (PERF_CTRL) msr in any case (also, is that a
> cheap msr?)
> 
> Not touching it at all seems silly.
> 

Note that even if we do such programming, it would still be desirable to
minimise the number of times we have to reprogram it so the series as it
stands would still be useful.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-31 Thread Peter Zijlstra

On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:

> > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> > "Desired_Performance" fields which can be used to give explicit
> > frequency hints. And we really _should_ be doing that.
> > 
> > Because, esp. in this scenario; a task migrating; the hardware really
> > can't do anything sensible, whereas the OS _knows_.
> 
> But IA32_HWP_REQUEST is not a cheap MSR to write to.

That just means we might need to throttle writing to it, like it already
does for the regular pstate (PERF_CTRL) msr in any case (also, is that a
cheap msr?)

Not touching it at all seems silly.

But now that you made me look, intel_pstate_hwp_set() is horrible crap.
You should _never_ do things like:

  rdmsr_on_cpu()
  /* frob value */
  wrmsr_on_cpu()

That's insane.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-31 Thread Peter Zijlstra

On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote:
> On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:

> > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> > "Desired_Performance" fields which can be used to give explicit
> > frequency hints. And we really _should_ be doing that.
> > 
> > Because, esp. in this scenario; a task migrating; the hardware really
> > can't do anything sensible, whereas the OS _knows_.
> 
> But IA32_HWP_REQUEST is not a cheap MSR to write to.

That just means we might need to throttle writing to it, like it already
does for the regular pstate (PERF_CTRL) msr in any case (also, is that a
cheap msr?)

Not touching it at all seems silly.

But now that you made me look, intel_pstate_hwp_set() is horrible crap.
You should _never_ do things like:

  rdmsr_on_cpu()
  /* frob value */
  wrmsr_on_cpu()

That's insane.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-31 Thread Rafael J. Wysocki

On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 12:57:18PM +, Mel Gorman wrote:
> > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> 
> > > Not saying this patch is bad; but Rafael / Srinivas we really should do
> > > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can
> > > still give it hints, and it looks like we're not doing that.
> > > 
> > 
> > I'm not sure if HWP can fix it because of the per-cpu nature of its
> > decisions. I believe it can only give the most basic of hints to hardware
> > like an energy performance profile or bias (EPP and EPB respectively).
> > Of course HWP can be turned off but not many people can detect that it's
> > an appropriate decision, or even desirable, and there is always the caveat
> > that disabling it increases the system CPU footprint.
> 
> IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> "Desired_Performance" fields which can be used to give explicit
> frequency hints. And we really _should_ be doing that.
> 
> Because, esp. in this scenario; a task migrating; the hardware really
> can't do anything sensible, whereas the OS _knows_.

But IA32_HWP_REQUEST is not a cheap MSR to write to.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-31 Thread Rafael J. Wysocki

On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 12:57:18PM +, Mel Gorman wrote:
> > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> 
> > > Not saying this patch is bad; but Rafael / Srinivas we really should do
> > > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can
> > > still give it hints, and it looks like we're not doing that.
> > > 
> > 
> > I'm not sure if HWP can fix it because of the per-cpu nature of its
> > decisions. I believe it can only give the most basic of hints to hardware
> > like an energy performance profile or bias (EPP and EPB respectively).
> > Of course HWP can be turned off but not many people can detect that it's
> > an appropriate decision, or even desirable, and there is always the caveat
> > that disabling it increases the system CPU footprint.
> 
> IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> "Desired_Performance" fields which can be used to give explicit
> frequency hints. And we really _should_ be doing that.
> 
> Because, esp. in this scenario; a task migrating; the hardware really
> can't do anything sensible, whereas the OS _knows_.

But IA32_HWP_REQUEST is not a cheap MSR to write to.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Mel Gorman

On Tue, Jan 30, 2018 at 02:40:49PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 01:25:27PM +, Mel Gorman wrote:
> > > Because, esp. in this scenario; a task migrating; the hardware really
> > > can't do anything sensible, whereas the OS _knows_.
> > 
> > Potentially yes. One option without HWP would be to track utilisation
> > for a task or artifically boost it for a short period after migration so
> > a higher p-state is potentially selected. With HWP, a hint would have to
> > be given to the hardware to try select a higher frequency but I've no idea
> > how expensive that is or how it would behave on different implementations
> > of HWP. It may also be a game of whack-a-mole trying to get every cpufreq
> > configuration correct.
> 
> We have much of this infrastructure and have been working on improving
> it [1]. We already track per task utilization and feed it into a cpufreq
> governor (schedutil).
> 

Which is potentially great without HWP but ...

> > One advantage of using fewer cores is that it should
> > work regardless of cpufreq driver.
> 
> Sure, and I started out by saying this patch isn't necessarily bad; but
> I think our 'use' [2] of HWP _is_.
> 

This is where we get caught -- HWP means we had full responsibility to
the hardware with limited feedback options. As far as I can tell, any
recent intel CPU is going to have HWP and we've enabled it by default if
available since 4.6. 

Even if further hints can be given from the OS, it does not necessarily
mean this patch is unhelpful. On low utilisation, it's still somewhat
sensible to use the minimum number of CPUs necessary (less cache traffic,
higher p-states, maybe beneficial to turbo boost etc).

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Mel Gorman

On Tue, Jan 30, 2018 at 02:40:49PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 01:25:27PM +, Mel Gorman wrote:
> > > Because, esp. in this scenario; a task migrating; the hardware really
> > > can't do anything sensible, whereas the OS _knows_.
> > 
> > Potentially yes. One option without HWP would be to track utilisation
> > for a task or artifically boost it for a short period after migration so
> > a higher p-state is potentially selected. With HWP, a hint would have to
> > be given to the hardware to try select a higher frequency but I've no idea
> > how expensive that is or how it would behave on different implementations
> > of HWP. It may also be a game of whack-a-mole trying to get every cpufreq
> > configuration correct.
> 
> We have much of this infrastructure and have been working on improving
> it [1]. We already track per task utilization and feed it into a cpufreq
> governor (schedutil).
> 

Which is potentially great without HWP but ...

> > One advantage of using fewer cores is that it should
> > work regardless of cpufreq driver.
> 
> Sure, and I started out by saying this patch isn't necessarily bad; but
> I think our 'use' [2] of HWP _is_.
> 

This is where we get caught -- HWP means we had full responsibility to
the hardware with limited feedback options. As far as I can tell, any
recent intel CPU is going to have HWP and we've enabled it by default if
available since 4.6. 

Even if further hints can be given from the OS, it does not necessarily
mean this patch is unhelpful. On low utilisation, it's still somewhat
sensible to use the minimum number of CPUs necessary (less cache traffic,
higher p-states, maybe beneficial to turbo boost etc).

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Peter Zijlstra

On Tue, Jan 30, 2018 at 01:25:27PM +, Mel Gorman wrote:
> > Because, esp. in this scenario; a task migrating; the hardware really
> > can't do anything sensible, whereas the OS _knows_.
> 
> Potentially yes. One option without HWP would be to track utilisation
> for a task or artifically boost it for a short period after migration so
> a higher p-state is potentially selected. With HWP, a hint would have to
> be given to the hardware to try select a higher frequency but I've no idea
> how expensive that is or how it would behave on different implementations
> of HWP. It may also be a game of whack-a-mole trying to get every cpufreq
> configuration correct.

We have much of this infrastructure and have been working on improving
it [1]. We already track per task utilization and feed it into a cpufreq
governor (schedutil).

> One advantage of using fewer cores is that it should
> work regardless of cpufreq driver.

Sure, and I started out by saying this patch isn't necessarily bad; but
I think our 'use' [2] of HWP _is_.

[1] https://lkml.kernel.org/r/20180123180847.4477-1-patrick.bell...@arm.com
[2] IIRC we basically don't do _anything_ when HWP.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Peter Zijlstra

On Tue, Jan 30, 2018 at 01:25:27PM +, Mel Gorman wrote:
> > Because, esp. in this scenario; a task migrating; the hardware really
> > can't do anything sensible, whereas the OS _knows_.
> 
> Potentially yes. One option without HWP would be to track utilisation
> for a task or artifically boost it for a short period after migration so
> a higher p-state is potentially selected. With HWP, a hint would have to
> be given to the hardware to try select a higher frequency but I've no idea
> how expensive that is or how it would behave on different implementations
> of HWP. It may also be a game of whack-a-mole trying to get every cpufreq
> configuration correct.

We have much of this infrastructure and have been working on improving
it [1]. We already track per task utilization and feed it into a cpufreq
governor (schedutil).

> One advantage of using fewer cores is that it should
> work regardless of cpufreq driver.

Sure, and I started out by saying this patch isn't necessarily bad; but
I think our 'use' [2] of HWP _is_.

[1] https://lkml.kernel.org/r/20180123180847.4477-1-patrick.bell...@arm.com
[2] IIRC we basically don't do _anything_ when HWP.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Mike Galbraith

On Tue, 2018-01-30 at 14:25 +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 02:15:44PM +0100, Mike Galbraith wrote:
> > On Tue, 2018-01-30 at 12:57 +, Mel Gorman wrote:
> > > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> > > 
> > > > Mel, what hardware are you testing this on?
> > > 
> > > The primary one was a single socket skylake machine with 8 threads (HT
> > > enabled).
> > 
> > I took it for a spin in a 2 x Gold 6152 box.
> 
> Can you translate that marketing speak for a simpleton like me? That's
> what we should've called SKL-EP, right?

Um um.. danged if I know.  It's SKL, w. 22 cores each + SMT.

-Mike

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Mike Galbraith

On Tue, 2018-01-30 at 14:25 +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 02:15:44PM +0100, Mike Galbraith wrote:
> > On Tue, 2018-01-30 at 12:57 +, Mel Gorman wrote:
> > > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> > > 
> > > > Mel, what hardware are you testing this on?
> > > 
> > > The primary one was a single socket skylake machine with 8 threads (HT
> > > enabled).
> > 
> > I took it for a spin in a 2 x Gold 6152 box.
> 
> Can you translate that marketing speak for a simpleton like me? That's
> what we should've called SKL-EP, right?

Um um.. danged if I know.  It's SKL, w. 22 cores each + SMT.

-Mike

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Peter Zijlstra

On Tue, Jan 30, 2018 at 02:15:44PM +0100, Mike Galbraith wrote:
> On Tue, 2018-01-30 at 12:57 +, Mel Gorman wrote:
> > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> > 
> > > Mel, what hardware are you testing this on?
> > 
> > The primary one was a single socket skylake machine with 8 threads (HT
> > enabled).
> 
> I took it for a spin in a 2 x Gold 6152 box.

Can you translate that marketing speak for a simpleton like me? That's
what we should've called SKL-EP, right?

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Peter Zijlstra

On Tue, Jan 30, 2018 at 02:15:44PM +0100, Mike Galbraith wrote:
> On Tue, 2018-01-30 at 12:57 +, Mel Gorman wrote:
> > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> > 
> > > Mel, what hardware are you testing this on?
> > 
> > The primary one was a single socket skylake machine with 8 threads (HT
> > enabled).
> 
> I took it for a spin in a 2 x Gold 6152 box.

Can you translate that marketing speak for a simpleton like me? That's
what we should've called SKL-EP, right?

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Mel Gorman

On Tue, Jan 30, 2018 at 02:15:31PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 12:57:18PM +, Mel Gorman wrote:
> > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> 
> > > Not saying this patch is bad; but Rafael / Srinivas we really should do
> > > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can
> > > still give it hints, and it looks like we're not doing that.
> > > 
> > 
> > I'm not sure if HWP can fix it because of the per-cpu nature of its
> > decisions. I believe it can only give the most basic of hints to hardware
> > like an energy performance profile or bias (EPP and EPB respectively).
> > Of course HWP can be turned off but not many people can detect that it's
> > an appropriate decision, or even desirable, and there is always the caveat
> > that disabling it increases the system CPU footprint.
> 
> IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> "Desired_Performance" fields which can be used to give explicit
> frequency hints. And we really _should_ be doing that.
> 

They can be although these are usually set by the bios or setup early
in boot and then left alone. It's not clear how or if these should be
tuned on the fly or what variables would drive dynamic tuning. The data
collected would still be per-cpu so if all CPUs have low utilisation,
the decisions will still be poor except maybe for things like IO boosting.

> Because, esp. in this scenario; a task migrating; the hardware really
> can't do anything sensible, whereas the OS _knows_.
> 

Potentially yes. One option without HWP would be to track utilisation
for a task or artifically boost it for a short period after migration so
a higher p-state is potentially selected. With HWP, a hint would have to
be given to the hardware to try select a higher frequency but I've no idea
how expensive that is or how it would behave on different implementations
of HWP. It may also be a game of whack-a-mole trying to get every cpufreq
configuration correct. One advantage of using fewer cores is that it should
work regardless of cpufreq driver.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Mel Gorman

On Tue, Jan 30, 2018 at 02:15:31PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 12:57:18PM +, Mel Gorman wrote:
> > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> 
> > > Not saying this patch is bad; but Rafael / Srinivas we really should do
> > > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can
> > > still give it hints, and it looks like we're not doing that.
> > > 
> > 
> > I'm not sure if HWP can fix it because of the per-cpu nature of its
> > decisions. I believe it can only give the most basic of hints to hardware
> > like an energy performance profile or bias (EPP and EPB respectively).
> > Of course HWP can be turned off but not many people can detect that it's
> > an appropriate decision, or even desirable, and there is always the caveat
> > that disabling it increases the system CPU footprint.
> 
> IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
> "Desired_Performance" fields which can be used to give explicit
> frequency hints. And we really _should_ be doing that.
> 

They can be although these are usually set by the bios or setup early
in boot and then left alone. It's not clear how or if these should be
tuned on the fly or what variables would drive dynamic tuning. The data
collected would still be per-cpu so if all CPUs have low utilisation,
the decisions will still be poor except maybe for things like IO boosting.

> Because, esp. in this scenario; a task migrating; the hardware really
> can't do anything sensible, whereas the OS _knows_.
> 

Potentially yes. One option without HWP would be to track utilisation
for a task or artifically boost it for a short period after migration so
a higher p-state is potentially selected. With HWP, a hint would have to
be given to the hardware to try select a higher frequency but I've no idea
how expensive that is or how it would behave on different implementations
of HWP. It may also be a game of whack-a-mole trying to get every cpufreq
configuration correct. One advantage of using fewer cores is that it should
work regardless of cpufreq driver.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Mel Gorman

On Tue, Jan 30, 2018 at 02:06:06PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 12:53:49PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote:
> > > The results can be less dramatic on NUMA where automatic balancing 
> > > interferes
> > > with the test. It's also known that network benchmarks running on 
> > > localhost
> > > also benefit quite a bit from this patch (roughly 10% on netperf RR for 
> > > UDP
> > > and TCP depending on the machine). Hackbench also seens small improvements
> > > (6-11% depending on machine and thread count). The facebook schbench was 
> > > also
> > > tested but in most cases showed little or no different to wakeup 
> > > latencies.
> > 
> > What cpufreq setting were you using for these tests?
> 
> I cannot measure any hackbench variation one way or the other with these
> patches using 'performance' mode. So I'll assume you've been running
> things with HWP or something.

Correct, I am not using the performance governor as it's not the default and
not necessarily desirable as a default.  HWP in some cases is enabled but
even machines using the intel_pstate driver on machines without HWP benefit.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Mel Gorman

On Tue, Jan 30, 2018 at 02:06:06PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 12:53:49PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote:
> > > The results can be less dramatic on NUMA where automatic balancing 
> > > interferes
> > > with the test. It's also known that network benchmarks running on 
> > > localhost
> > > also benefit quite a bit from this patch (roughly 10% on netperf RR for 
> > > UDP
> > > and TCP depending on the machine). Hackbench also seens small improvements
> > > (6-11% depending on machine and thread count). The facebook schbench was 
> > > also
> > > tested but in most cases showed little or no different to wakeup 
> > > latencies.
> > 
> > What cpufreq setting were you using for these tests?
> 
> I cannot measure any hackbench variation one way or the other with these
> patches using 'performance' mode. So I'll assume you've been running
> things with HWP or something.

Correct, I am not using the performance governor as it's not the default and
not necessarily desirable as a default.  HWP in some cases is enabled but
even machines using the intel_pstate driver on machines without HWP benefit.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Mike Galbraith

On Tue, 2018-01-30 at 12:57 +, Mel Gorman wrote:
> On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> 
> > Mel, what hardware are you testing this on?
> 
> The primary one was a single socket skylake machine with 8 threads (HT
> enabled).

I took it for a spin in a 2 x Gold 6152 box.

-Mike

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Mike Galbraith

On Tue, 2018-01-30 at 12:57 +, Mel Gorman wrote:
> On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> 
> > Mel, what hardware are you testing this on?
> 
> The primary one was a single socket skylake machine with 8 threads (HT
> enabled).

I took it for a spin in a 2 x Gold 6152 box.

-Mike

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Peter Zijlstra

On Tue, Jan 30, 2018 at 12:57:18PM +, Mel Gorman wrote:
> On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:

> > Not saying this patch is bad; but Rafael / Srinivas we really should do
> > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can
> > still give it hints, and it looks like we're not doing that.
> > 
> 
> I'm not sure if HWP can fix it because of the per-cpu nature of its
> decisions. I believe it can only give the most basic of hints to hardware
> like an energy performance profile or bias (EPP and EPB respectively).
> Of course HWP can be turned off but not many people can detect that it's
> an appropriate decision, or even desirable, and there is always the caveat
> that disabling it increases the system CPU footprint.

IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
"Desired_Performance" fields which can be used to give explicit
frequency hints. And we really _should_ be doing that.

Because, esp. in this scenario; a task migrating; the hardware really
can't do anything sensible, whereas the OS _knows_.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Peter Zijlstra

On Tue, Jan 30, 2018 at 12:57:18PM +, Mel Gorman wrote:
> On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:

> > Not saying this patch is bad; but Rafael / Srinivas we really should do
> > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can
> > still give it hints, and it looks like we're not doing that.
> > 
> 
> I'm not sure if HWP can fix it because of the per-cpu nature of its
> decisions. I believe it can only give the most basic of hints to hardware
> like an energy performance profile or bias (EPP and EPB respectively).
> Of course HWP can be turned off but not many people can detect that it's
> an appropriate decision, or even desirable, and there is always the caveat
> that disabling it increases the system CPU footprint.

IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and
"Desired_Performance" fields which can be used to give explicit
frequency hints. And we really _should_ be doing that.

Because, esp. in this scenario; a task migrating; the hardware really
can't do anything sensible, whereas the OS _knows_.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Peter Zijlstra

On Tue, Jan 30, 2018 at 12:53:49PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote:
> > The results can be less dramatic on NUMA where automatic balancing 
> > interferes
> > with the test. It's also known that network benchmarks running on localhost
> > also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
> > and TCP depending on the machine). Hackbench also seens small improvements
> > (6-11% depending on machine and thread count). The facebook schbench was 
> > also
> > tested but in most cases showed little or no different to wakeup latencies.
> 
> What cpufreq setting were you using for these tests?

I cannot measure any hackbench variation one way or the other with these
patches using 'performance' mode. So I'll assume you've been running
things with HWP or something.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Peter Zijlstra

On Tue, Jan 30, 2018 at 12:53:49PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote:
> > The results can be less dramatic on NUMA where automatic balancing 
> > interferes
> > with the test. It's also known that network benchmarks running on localhost
> > also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
> > and TCP depending on the machine). Hackbench also seens small improvements
> > (6-11% depending on machine and thread count). The facebook schbench was 
> > also
> > tested but in most cases showed little or no different to wakeup latencies.
> 
> What cpufreq setting were you using for these tests?

I cannot measure any hackbench variation one way or the other with these
patches using 'performance' mode. So I'll assume you've been running
things with HWP or something.

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Mel Gorman

On Tue, Jan 30, 2018 at 12:53:49PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote:
> > The results can be less dramatic on NUMA where automatic balancing 
> > interferes
> > with the test. It's also known that network benchmarks running on localhost
> > also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
> > and TCP depending on the machine). Hackbench also seens small improvements
> > (6-11% depending on machine and thread count). The facebook schbench was 
> > also
> > tested but in most cases showed little or no different to wakeup latencies.
> 
> What cpufreq setting were you using for these tests?

Default powersave settings in general -- intel_pstate driver is the one
used most often but HWP was not always available. By and large, no special
tuning was applied. Two sets of tests were run, one with turbostat and perf
keeping track of p-states and migrations and one without to make sure the
measured increase was real. The irony when hitting cpufreq-related problems
is that monitoring can mask the issue by increasing overall utilisation.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Mel Gorman

On Tue, Jan 30, 2018 at 12:53:49PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote:
> > The results can be less dramatic on NUMA where automatic balancing 
> > interferes
> > with the test. It's also known that network benchmarks running on localhost
> > also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
> > and TCP depending on the machine). Hackbench also seens small improvements
> > (6-11% depending on machine and thread count). The facebook schbench was 
> > also
> > tested but in most cases showed little or no different to wakeup latencies.
> 
> What cpufreq setting were you using for these tests?

Default powersave settings in general -- intel_pstate driver is the one
used most often but HWP was not always available. By and large, no special
tuning was applied. Two sets of tests were run, one with turbostat and perf
keeping track of p-states and migrations and one without to make sure the
measured increase was real. The irony when hitting cpufreq-related problems
is that monitoring can mask the issue by increasing overall utilisation.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Mel Gorman

On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote:
> > The select_idle_sibling (SIS) rewrite in commit 10e2f1acd010 ("sched/core:
> > Rewrite and improve select_idle_siblings()") replaced a domain iteration
> > with a search that broadly speaking does a wrapped walk of the scheduler
> > domain sharing a last-level-cache. While this had a number of improvements,
> > one consequence is that two tasks that share a waker/wakee relationship push
> > each other around a socket. Even though two tasks may be active, all cores
> > are evenly used. This is great from a search perspective and spreads a load
> > across individual cores but it has adverse consequences for cpufreq. As each
> > CPU has relatively low utilisation, cpufreq may decide the utilisation is
> > too low to used a higher P-state and overall computation throughput suffers.
> 
> > While individual cpufreq and cpuidle drivers may compensate by artifically
> > boosting P-state (at c0) or avoiding lower C-states (during idle), it does
> > not help if hardware-based cpufreq (e.g. HWP) is used.
> 
> Not saying this patch is bad; but Rafael / Srinivas we really should do
> better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can
> still give it hints, and it looks like we're not doing that.
> 

I'm not sure if HWP can fix it because of the per-cpu nature of its
decisions. I believe it can only give the most basic of hints to hardware
like an energy performance profile or bias (EPP and EPB respectively).
Of course HWP can be turned off but not many people can detect that it's
an appropriate decision, or even desirable, and there is always the caveat
that disabling it increases the system CPU footprint.

> Mel, what hardware are you testing this on?

The primary one was a single socket skylake machine with 8 threads (HT
enabled). However, 11 machines were used in total across multiple
generations to reduce the chance of a regression slipping in that was
machine-specific.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Mel Gorman

On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote:
> > The select_idle_sibling (SIS) rewrite in commit 10e2f1acd010 ("sched/core:
> > Rewrite and improve select_idle_siblings()") replaced a domain iteration
> > with a search that broadly speaking does a wrapped walk of the scheduler
> > domain sharing a last-level-cache. While this had a number of improvements,
> > one consequence is that two tasks that share a waker/wakee relationship push
> > each other around a socket. Even though two tasks may be active, all cores
> > are evenly used. This is great from a search perspective and spreads a load
> > across individual cores but it has adverse consequences for cpufreq. As each
> > CPU has relatively low utilisation, cpufreq may decide the utilisation is
> > too low to used a higher P-state and overall computation throughput suffers.
> 
> > While individual cpufreq and cpuidle drivers may compensate by artifically
> > boosting P-state (at c0) or avoiding lower C-states (during idle), it does
> > not help if hardware-based cpufreq (e.g. HWP) is used.
> 
> Not saying this patch is bad; but Rafael / Srinivas we really should do
> better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can
> still give it hints, and it looks like we're not doing that.
> 

I'm not sure if HWP can fix it because of the per-cpu nature of its
decisions. I believe it can only give the most basic of hints to hardware
like an energy performance profile or bias (EPP and EPB respectively).
Of course HWP can be turned off but not many people can detect that it's
an appropriate decision, or even desirable, and there is always the caveat
that disabling it increases the system CPU footprint.

> Mel, what hardware are you testing this on?

The primary one was a single socket skylake machine with 8 threads (HT
enabled). However, 11 machines were used in total across multiple
generations to reduce the chance of a regression slipping in that was
machine-specific.

-- 
Mel Gorman
SUSE Labs

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Peter Zijlstra

On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote:
> The results can be less dramatic on NUMA where automatic balancing interferes
> with the test. It's also known that network benchmarks running on localhost
> also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
> and TCP depending on the machine). Hackbench also seens small improvements
> (6-11% depending on machine and thread count). The facebook schbench was also
> tested but in most cases showed little or no different to wakeup latencies.

What cpufreq setting were you using for these tests?

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Peter Zijlstra

On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote:
> The results can be less dramatic on NUMA where automatic balancing interferes
> with the test. It's also known that network benchmarks running on localhost
> also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
> and TCP depending on the machine). Hackbench also seens small improvements
> (6-11% depending on machine and thread count). The facebook schbench was also
> tested but in most cases showed little or no different to wakeup latencies.

What cpufreq setting were you using for these tests?

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Peter Zijlstra

On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote:
> The select_idle_sibling (SIS) rewrite in commit 10e2f1acd010 ("sched/core:
> Rewrite and improve select_idle_siblings()") replaced a domain iteration
> with a search that broadly speaking does a wrapped walk of the scheduler
> domain sharing a last-level-cache. While this had a number of improvements,
> one consequence is that two tasks that share a waker/wakee relationship push
> each other around a socket. Even though two tasks may be active, all cores
> are evenly used. This is great from a search perspective and spreads a load
> across individual cores but it has adverse consequences for cpufreq. As each
> CPU has relatively low utilisation, cpufreq may decide the utilisation is
> too low to used a higher P-state and overall computation throughput suffers.

> While individual cpufreq and cpuidle drivers may compensate by artifically
> boosting P-state (at c0) or avoiding lower C-states (during idle), it does
> not help if hardware-based cpufreq (e.g. HWP) is used.

Not saying this patch is bad; but Rafael / Srinivas we really should do
better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can
still give it hints, and it looks like we're not doing that.

Mel, what hardware are you testing this on?

Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS

2018-01-30 Thread Peter Zijlstra

On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote:
> The select_idle_sibling (SIS) rewrite in commit 10e2f1acd010 ("sched/core:
> Rewrite and improve select_idle_siblings()") replaced a domain iteration
> with a search that broadly speaking does a wrapped walk of the scheduler
> domain sharing a last-level-cache. While this had a number of improvements,
> one consequence is that two tasks that share a waker/wakee relationship push
> each other around a socket. Even though two tasks may be active, all cores
> are evenly used. This is great from a search perspective and spreads a load
> across individual cores but it has adverse consequences for cpufreq. As each
> CPU has relatively low utilisation, cpufreq may decide the utilisation is
> too low to used a higher P-state and overall computation throughput suffers.

> While individual cpufreq and cpuidle drivers may compensate by artifically
> boosting P-state (at c0) or avoiding lower C-states (during idle), it does
> not help if hardware-based cpufreq (e.g. HWP) is used.

Not saying this patch is bad; but Rafael / Srinivas we really should do
better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can
still give it hints, and it looks like we're not doing that.

Mel, what hardware are you testing this on?

78 matches

Mail list logo