Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Mon, Feb 05, 2018 at 09:04:25AM -0800, Srinivas Pandruvada wrote: > > I know EPP is a range, default from bios usually appear to be 6 or 7 > > but > > I didn't do much experiementation to see if there is another value > > that > > works better. Even if there is, the default may need to change as not > > many > > people even know what EPP is or how it should be tuned. > I think you are talking about EPB not EPP because of ranges you > mentioned here. EPP is a value from 0 to 255. EPP is part of > HWP_REQUEST MSR. > EPB with HWP is used only in Broadwell server. I think you are using > Skylake here. > You're right in that I mixed up EPB (0-15 IIRC) and EPP which has a different range whose exact meaning and semantics are not super-clear to me. The machine in question is Skylake but for the purposes of the current discussion, the issue is the same. There appears to be a lack of giving decent hints to the hardware based on knowledge of migrations or short periods of idleness that the processor may be very busy for very short bursts that are important even if overall utilisation is low. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Mon, Feb 05, 2018 at 09:04:25AM -0800, Srinivas Pandruvada wrote: > > I know EPP is a range, default from bios usually appear to be 6 or 7 > > but > > I didn't do much experiementation to see if there is another value > > that > > works better. Even if there is, the default may need to change as not > > many > > people even know what EPP is or how it should be tuned. > I think you are talking about EPB not EPP because of ranges you > mentioned here. EPP is a value from 0 to 255. EPP is part of > HWP_REQUEST MSR. > EPB with HWP is used only in Broadwell server. I think you are using > Skylake here. > You're right in that I mixed up EPB (0-15 IIRC) and EPP which has a different range whose exact meaning and semantics are not super-clear to me. The machine in question is Skylake but for the purposes of the current discussion, the issue is the same. There appears to be a lack of giving decent hints to the hardware based on knowledge of migrations or short periods of idleness that the processor may be very busy for very short bursts that are important even if overall utilisation is low. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Mon, 2018-02-05 at 11:10 +, Mel Gorman wrote: > On Fri, Feb 02, 2018 at 12:01:37PM -0800, Srinivas Pandruvada wrote: > > > Sure, but the lack on detection when tasks are low utilisation > > > but > > > still > > > latency/throughput sensitive is problematic. Users shouldn't have > > > to > > > know they need to disable HWP or set performance goernor out of > > > the > > > box. > > > It's only going to get worse as sockets get larger. > > > > I am not saying that we shouldn't do anything. Can you give me some > > workloads which you care the most? > > > > The proprietary workloads I'm aware of are useless to the discussion > as they cannot be trivially reproduced and are typically only > available > under NDA. However, hints can be gotten by looking at the number of > cases > where recommended tunings limits C-states, set the performance > governor, > alter intel_pstate setpoint (if not HWP) etc. > > For the purposes of illustration, dbench at low thread counts does > a reasonable job even though it's not that interesting a workload in > general. With ext4 in particular, the journalling thread interactions > bounce tasks around the machine and the short sleeps for IO both > combine > to have relatively low utilisation on individual CPUs. It's less > pronounced > on xfs as it bounces less due to using kworkers instead of kthreads. > > > > > > > > There are totally different way HWP is handled in client an > > > > servers. > > > > If you set desired all heuristics they collected will be > > > > dumped, so > > > > they suggest don't set desired when you are in autonomous mode. > > > > If > > > > we > > > > really want a boost set the EPP. We know that EPP makes lots of > > > > measurable difference. > > > > > > > > > > Sure boosting EPP makes a difference -- it's essentially what the > > > performance > > > goveror does and I know that can be done by a user but it's still > > > basically a > > > cop-out. Default performance for low utilisation or lightly > > > loaded > > > machines > > > is poor. Maybe it should be set based on the ACPI preferred > > > profile > > > but > > > that information is not always available. It would be nice if > > > *some* > > > sort of hint about new migrations or tasks waking from IO would > > > be > > > desirable. > > > > EPP is a range not a single value. So you don't need to make EPP=0 > > as a > > performance governor. PeterZ gave me some scheduler change to > > experiment, which can be used as hint to play with EPP. > > > > I know EPP is a range, default from bios usually appear to be 6 or 7 > but > I didn't do much experiementation to see if there is another value > that > works better. Even if there is, the default may need to change as not > many > people even know what EPP is or how it should be tuned. I think you are talking about EPB not EPP because of ranges you mentioned here. EPP is a value from 0 to 255. EPP is part of HWP_REQUEST MSR. EPB with HWP is used only in Broadwell server. I think you are using Skylake here. Thanks, Srinivas
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Mon, 2018-02-05 at 11:10 +, Mel Gorman wrote: > On Fri, Feb 02, 2018 at 12:01:37PM -0800, Srinivas Pandruvada wrote: > > > Sure, but the lack on detection when tasks are low utilisation > > > but > > > still > > > latency/throughput sensitive is problematic. Users shouldn't have > > > to > > > know they need to disable HWP or set performance goernor out of > > > the > > > box. > > > It's only going to get worse as sockets get larger. > > > > I am not saying that we shouldn't do anything. Can you give me some > > workloads which you care the most? > > > > The proprietary workloads I'm aware of are useless to the discussion > as they cannot be trivially reproduced and are typically only > available > under NDA. However, hints can be gotten by looking at the number of > cases > where recommended tunings limits C-states, set the performance > governor, > alter intel_pstate setpoint (if not HWP) etc. > > For the purposes of illustration, dbench at low thread counts does > a reasonable job even though it's not that interesting a workload in > general. With ext4 in particular, the journalling thread interactions > bounce tasks around the machine and the short sleeps for IO both > combine > to have relatively low utilisation on individual CPUs. It's less > pronounced > on xfs as it bounces less due to using kworkers instead of kthreads. > > > > > > > > There are totally different way HWP is handled in client an > > > > servers. > > > > If you set desired all heuristics they collected will be > > > > dumped, so > > > > they suggest don't set desired when you are in autonomous mode. > > > > If > > > > we > > > > really want a boost set the EPP. We know that EPP makes lots of > > > > measurable difference. > > > > > > > > > > Sure boosting EPP makes a difference -- it's essentially what the > > > performance > > > goveror does and I know that can be done by a user but it's still > > > basically a > > > cop-out. Default performance for low utilisation or lightly > > > loaded > > > machines > > > is poor. Maybe it should be set based on the ACPI preferred > > > profile > > > but > > > that information is not always available. It would be nice if > > > *some* > > > sort of hint about new migrations or tasks waking from IO would > > > be > > > desirable. > > > > EPP is a range not a single value. So you don't need to make EPP=0 > > as a > > performance governor. PeterZ gave me some scheduler change to > > experiment, which can be used as hint to play with EPP. > > > > I know EPP is a range, default from bios usually appear to be 6 or 7 > but > I didn't do much experiementation to see if there is another value > that > works better. Even if there is, the default may need to change as not > many > people even know what EPP is or how it should be tuned. I think you are talking about EPB not EPP because of ranges you mentioned here. EPP is a value from 0 to 255. EPP is part of HWP_REQUEST MSR. EPB with HWP is used only in Broadwell server. I think you are using Skylake here. Thanks, Srinivas
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, Feb 02, 2018 at 12:01:37PM -0800, Srinivas Pandruvada wrote: > > Sure, but the lack on detection when tasks are low utilisation but > > still > > latency/throughput sensitive is problematic. Users shouldn't have to > > know they need to disable HWP or set performance goernor out of the > > box. > > It's only going to get worse as sockets get larger. > > I am not saying that we shouldn't do anything. Can you give me some > workloads which you care the most? > The proprietary workloads I'm aware of are useless to the discussion as they cannot be trivially reproduced and are typically only available under NDA. However, hints can be gotten by looking at the number of cases where recommended tunings limits C-states, set the performance governor, alter intel_pstate setpoint (if not HWP) etc. For the purposes of illustration, dbench at low thread counts does a reasonable job even though it's not that interesting a workload in general. With ext4 in particular, the journalling thread interactions bounce tasks around the machine and the short sleeps for IO both combine to have relatively low utilisation on individual CPUs. It's less pronounced on xfs as it bounces less due to using kworkers instead of kthreads. > > > > > There are totally different way HWP is handled in client an > > > servers. > > > If you set desired all heuristics they collected will be dumped, so > > > they suggest don't set desired when you are in autonomous mode. If > > > we > > > really want a boost set the EPP. We know that EPP makes lots of > > > measurable difference. > > > > > > > Sure boosting EPP makes a difference -- it's essentially what the > > performance > > goveror does and I know that can be done by a user but it's still > > basically a > > cop-out. Default performance for low utilisation or lightly loaded > > machines > > is poor. Maybe it should be set based on the ACPI preferred profile > > but > > that information is not always available. It would be nice if *some* > > sort of hint about new migrations or tasks waking from IO would be > > desirable. > EPP is a range not a single value. So you don't need to make EPP=0 as a > performance governor. PeterZ gave me some scheduler change to > experiment, which can be used as hint to play with EPP. > I know EPP is a range, default from bios usually appear to be 6 or 7 but I didn't do much experiementation to see if there is another value that works better. Even if there is, the default may need to change as not many people even know what EPP is or how it should be tuned. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, Feb 02, 2018 at 12:01:37PM -0800, Srinivas Pandruvada wrote: > > Sure, but the lack on detection when tasks are low utilisation but > > still > > latency/throughput sensitive is problematic. Users shouldn't have to > > know they need to disable HWP or set performance goernor out of the > > box. > > It's only going to get worse as sockets get larger. > > I am not saying that we shouldn't do anything. Can you give me some > workloads which you care the most? > The proprietary workloads I'm aware of are useless to the discussion as they cannot be trivially reproduced and are typically only available under NDA. However, hints can be gotten by looking at the number of cases where recommended tunings limits C-states, set the performance governor, alter intel_pstate setpoint (if not HWP) etc. For the purposes of illustration, dbench at low thread counts does a reasonable job even though it's not that interesting a workload in general. With ext4 in particular, the journalling thread interactions bounce tasks around the machine and the short sleeps for IO both combine to have relatively low utilisation on individual CPUs. It's less pronounced on xfs as it bounces less due to using kworkers instead of kthreads. > > > > > There are totally different way HWP is handled in client an > > > servers. > > > If you set desired all heuristics they collected will be dumped, so > > > they suggest don't set desired when you are in autonomous mode. If > > > we > > > really want a boost set the EPP. We know that EPP makes lots of > > > measurable difference. > > > > > > > Sure boosting EPP makes a difference -- it's essentially what the > > performance > > goveror does and I know that can be done by a user but it's still > > basically a > > cop-out. Default performance for low utilisation or lightly loaded > > machines > > is poor. Maybe it should be set based on the ACPI preferred profile > > but > > that information is not always available. It would be nice if *some* > > sort of hint about new migrations or tasks waking from IO would be > > desirable. > EPP is a range not a single value. So you don't need to make EPP=0 as a > performance governor. PeterZ gave me some scheduler change to > experiment, which can be used as hint to play with EPP. > I know EPP is a range, default from bios usually appear to be 6 or 7 but I didn't do much experiementation to see if there is another value that works better. Even if there is, the default may need to change as not many people even know what EPP is or how it should be tuned. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
* Peter Zijlstrawrote: > On Sat, Feb 03, 2018 at 08:30:48AM -0800, Srinivas Pandruvada wrote: > > On Fri, 2018-02-02 at 15:08 +0100, Peter Zijlstra wrote: > > > On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote: > > > > > > > +static void __intel_pstate_hwp_set_desired(int val) > > > > +{ > > > > + u64 value; > > > > + > > > > + value = rdmsrl(MSR_HWP_REQUEST); > > > > + value &= ~GENMASK_ULL(23, 16); > > > > + value |= (val & 0xff) << 16; > > > > + wrmsrl(MSR_HWP_REQUEST, val); > > > > +} > > > > > > Also, if we keep a software shadow of that MSR, we can avoid the > > > rdmsr, which might also help. > > The reason we don't keep a software shadow as users can use x86-energy- > > perf utility or via BMC to adjust. CCed to Len. > > Total NAK on that. People using /dev/msr to prod at state get to keep > whatever pieces they end up with. > > We _really_ should make /dev/msr taint the kernel on write. Ingo, > Thomas, can we pretty please just do that? Sure - mind sending a patch? Thanks, Ingo
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
* Peter Zijlstra wrote: > On Sat, Feb 03, 2018 at 08:30:48AM -0800, Srinivas Pandruvada wrote: > > On Fri, 2018-02-02 at 15:08 +0100, Peter Zijlstra wrote: > > > On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote: > > > > > > > +static void __intel_pstate_hwp_set_desired(int val) > > > > +{ > > > > + u64 value; > > > > + > > > > + value = rdmsrl(MSR_HWP_REQUEST); > > > > + value &= ~GENMASK_ULL(23, 16); > > > > + value |= (val & 0xff) << 16; > > > > + wrmsrl(MSR_HWP_REQUEST, val); > > > > +} > > > > > > Also, if we keep a software shadow of that MSR, we can avoid the > > > rdmsr, which might also help. > > The reason we don't keep a software shadow as users can use x86-energy- > > perf utility or via BMC to adjust. CCed to Len. > > Total NAK on that. People using /dev/msr to prod at state get to keep > whatever pieces they end up with. > > We _really_ should make /dev/msr taint the kernel on write. Ingo, > Thomas, can we pretty please just do that? Sure - mind sending a patch? Thanks, Ingo
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Sat, Feb 03, 2018 at 08:30:48AM -0800, Srinivas Pandruvada wrote: > On Fri, 2018-02-02 at 15:08 +0100, Peter Zijlstra wrote: > > On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote: > > > > > +static void __intel_pstate_hwp_set_desired(int val) > > > +{ > > > + u64 value; > > > + > > > + value = rdmsrl(MSR_HWP_REQUEST); > > > + value &= ~GENMASK_ULL(23, 16); > > > + value |= (val & 0xff) << 16; > > > + wrmsrl(MSR_HWP_REQUEST, val); > > > +} > > > > Also, if we keep a software shadow of that MSR, we can avoid the > > rdmsr, which might also help. > The reason we don't keep a software shadow as users can use x86-energy- > perf utility or via BMC to adjust. CCed to Len. Total NAK on that. People using /dev/msr to prod at state get to keep whatever pieces they end up with. We _really_ should make /dev/msr taint the kernel on write. Ingo, Thomas, can we pretty please just do that?
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Sat, Feb 03, 2018 at 08:30:48AM -0800, Srinivas Pandruvada wrote: > On Fri, 2018-02-02 at 15:08 +0100, Peter Zijlstra wrote: > > On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote: > > > > > +static void __intel_pstate_hwp_set_desired(int val) > > > +{ > > > + u64 value; > > > + > > > + value = rdmsrl(MSR_HWP_REQUEST); > > > + value &= ~GENMASK_ULL(23, 16); > > > + value |= (val & 0xff) << 16; > > > + wrmsrl(MSR_HWP_REQUEST, val); > > > +} > > > > Also, if we keep a software shadow of that MSR, we can avoid the > > rdmsr, which might also help. > The reason we don't keep a software shadow as users can use x86-energy- > perf utility or via BMC to adjust. CCed to Len. Total NAK on that. People using /dev/msr to prod at state get to keep whatever pieces they end up with. We _really_ should make /dev/msr taint the kernel on write. Ingo, Thomas, can we pretty please just do that?
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Friday, February 2, 2018 8:48:01 PM CET Mel Gorman wrote: > On Fri, Feb 02, 2018 at 06:54:24AM -0800, Srinivas Pandruvada wrote: > > > > > No idea, desired would be the one I would start with, it matches > > > > > with > > > > > the intent here. But I've no idea what our current HWP > > > > > implementation > > > > > actually does with it. > > > > Desired !=0 will disable HWP autonomous mode of frequency > > > > selection. > > > But I don't think it will just run at "desired" then, will it? > > HWP all are these hints only not a guarantee. > > Sure, but the lack on detection when tasks are low utilisation but still > latency/throughput sensitive is problematic. Users shouldn't have to > know they need to disable HWP or set performance goernor out of the box. > It's only going to get worse as sockets get larger. OK, so that's the case I was thinking about when the kernel actually knows more about what's going on than the HW-based logic. That's when it makes sense to provide hints, of course. > > There are totally different way HWP is handled in client an servers. > > If you set desired all heuristics they collected will be dumped, so > > they suggest don't set desired when you are in autonomous mode. If we > > really want a boost set the EPP. We know that EPP makes lots of > > measurable difference. > > > > Sure boosting EPP makes a difference -- it's essentially what the performance > goveror does and I know that can be done by a user but it's still basically a > cop-out. Default performance for low utilisation or lightly loaded machines > is poor. Maybe it should be set based on the ACPI preferred profile but > that information is not always available. It would be nice if *some* > sort of hint about new migrations or tasks waking from IO would be desirable. Agreed, we only need to figure out a clean way to do that.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Friday, February 2, 2018 8:48:01 PM CET Mel Gorman wrote: > On Fri, Feb 02, 2018 at 06:54:24AM -0800, Srinivas Pandruvada wrote: > > > > > No idea, desired would be the one I would start with, it matches > > > > > with > > > > > the intent here. But I've no idea what our current HWP > > > > > implementation > > > > > actually does with it. > > > > Desired !=0 will disable HWP autonomous mode of frequency > > > > selection. > > > But I don't think it will just run at "desired" then, will it? > > HWP all are these hints only not a guarantee. > > Sure, but the lack on detection when tasks are low utilisation but still > latency/throughput sensitive is problematic. Users shouldn't have to > know they need to disable HWP or set performance goernor out of the box. > It's only going to get worse as sockets get larger. OK, so that's the case I was thinking about when the kernel actually knows more about what's going on than the HW-based logic. That's when it makes sense to provide hints, of course. > > There are totally different way HWP is handled in client an servers. > > If you set desired all heuristics they collected will be dumped, so > > they suggest don't set desired when you are in autonomous mode. If we > > really want a boost set the EPP. We know that EPP makes lots of > > measurable difference. > > > > Sure boosting EPP makes a difference -- it's essentially what the performance > goveror does and I know that can be done by a user but it's still basically a > cop-out. Default performance for low utilisation or lightly loaded machines > is poor. Maybe it should be set based on the ACPI preferred profile but > that information is not always available. It would be nice if *some* > sort of hint about new migrations or tasks waking from IO would be desirable. Agreed, we only need to figure out a clean way to do that.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Friday, February 2, 2018 3:54:24 PM CET Srinivas Pandruvada wrote: > On Fri, 2018-02-02 at 12:00 +0100, Rafael J. Wysocki wrote: > > On Thursday, February 1, 2018 2:18:12 PM CET Srinivas Pandruvada > > wrote: > > > > > > On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote: > > > > > > > > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki > > > > wrote: > > > > > > > > > > > > > > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra > > > > > wrote: > > > > > > > > > > > > > > > > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > IA32_HWP_REQUEST has "Minimum_Performance", > > > > > > > > "Maximum_Performance" and > > > > > > > > "Desired_Performance" fields which can be used to give > > > > > > > > explicit > > > > > > > > frequency hints. And we really _should_ be doing that. > > > > > > > > > > > > > > > > Because, esp. in this scenario; a task migrating; the > > > > > > > > hardware really > > > > > > > > can't do anything sensible, whereas the OS _knows_. > > > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > > > > > That just means we might need to throttle writing to it, like > > > > > > it > > > > > > already > > > > > > does for the regular pstate (PERF_CTRL) msr in any case > > > > > > (also, is > > > > > > that a > > > > > > cheap msr?) > > > > > > > > > > > > Not touching it at all seems silly. > > > > > OK > > > > > > > > > > So what field precisely would you touch? "desired"? If so, > > > > > does > > > > > that actually > > > > > guarantee anything to happen? > > > > No idea, desired would be the one I would start with, it matches > > > > with > > > > the intent here. But I've no idea what our current HWP > > > > implementation > > > > actually does with it. > > > Desired !=0 will disable HWP autonomous mode of frequency > > > selection. > > But I don't think it will just run at "desired" then, will it? > HWP all are these hints only not a guarantee. > There are totally different way HWP is handled in client an servers. > If you set desired all heuristics they collected will be dumped, so > they suggest don't set desired when you are in autonomous mode. If we > really want a boost set the EPP. We know that EPP makes lots of > measurable difference. Yeah, we've been consistently told to use EPP rather than "desired" for hints like that.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Friday, February 2, 2018 3:54:24 PM CET Srinivas Pandruvada wrote: > On Fri, 2018-02-02 at 12:00 +0100, Rafael J. Wysocki wrote: > > On Thursday, February 1, 2018 2:18:12 PM CET Srinivas Pandruvada > > wrote: > > > > > > On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote: > > > > > > > > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki > > > > wrote: > > > > > > > > > > > > > > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra > > > > > wrote: > > > > > > > > > > > > > > > > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > IA32_HWP_REQUEST has "Minimum_Performance", > > > > > > > > "Maximum_Performance" and > > > > > > > > "Desired_Performance" fields which can be used to give > > > > > > > > explicit > > > > > > > > frequency hints. And we really _should_ be doing that. > > > > > > > > > > > > > > > > Because, esp. in this scenario; a task migrating; the > > > > > > > > hardware really > > > > > > > > can't do anything sensible, whereas the OS _knows_. > > > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > > > > > That just means we might need to throttle writing to it, like > > > > > > it > > > > > > already > > > > > > does for the regular pstate (PERF_CTRL) msr in any case > > > > > > (also, is > > > > > > that a > > > > > > cheap msr?) > > > > > > > > > > > > Not touching it at all seems silly. > > > > > OK > > > > > > > > > > So what field precisely would you touch? "desired"? If so, > > > > > does > > > > > that actually > > > > > guarantee anything to happen? > > > > No idea, desired would be the one I would start with, it matches > > > > with > > > > the intent here. But I've no idea what our current HWP > > > > implementation > > > > actually does with it. > > > Desired !=0 will disable HWP autonomous mode of frequency > > > selection. > > But I don't think it will just run at "desired" then, will it? > HWP all are these hints only not a guarantee. > There are totally different way HWP is handled in client an servers. > If you set desired all heuristics they collected will be dumped, so > they suggest don't set desired when you are in autonomous mode. If we > really want a boost set the EPP. We know that EPP makes lots of > measurable difference. Yeah, we've been consistently told to use EPP rather than "desired" for hints like that.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, 2018-02-02 at 15:08 +0100, Peter Zijlstra wrote: > On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote: > > > +static void __intel_pstate_hwp_set_desired(int val) > > +{ > > + u64 value; > > + > > + value = rdmsrl(MSR_HWP_REQUEST); > > + value &= ~GENMASK_ULL(23, 16); > > + value |= (val & 0xff) << 16; > > + wrmsrl(MSR_HWP_REQUEST, val); > > +} > > Also, if we keep a software shadow of that MSR, we can avoid the > rdmsr, which might also help. The reason we don't keep a software shadow as users can use x86-energy- perf utility or via BMC to adjust. CCed to Len.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, 2018-02-02 at 15:08 +0100, Peter Zijlstra wrote: > On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote: > > > +static void __intel_pstate_hwp_set_desired(int val) > > +{ > > + u64 value; > > + > > + value = rdmsrl(MSR_HWP_REQUEST); > > + value &= ~GENMASK_ULL(23, 16); > > + value |= (val & 0xff) << 16; > > + wrmsrl(MSR_HWP_REQUEST, val); > > +} > > Also, if we keep a software shadow of that MSR, we can avoid the > rdmsr, which might also help. The reason we don't keep a software shadow as users can use x86-energy- perf utility or via BMC to adjust. CCed to Len.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, 2018-02-02 at 19:48 +, Mel Gorman wrote: > On Fri, Feb 02, 2018 at 06:54:24AM -0800, Srinivas Pandruvada wrote: > > > > > No idea, desired would be the one I would start with, it > > > > > matches > > > > > with > > > > > the intent here. But I've no idea what our current HWP > > > > > implementation > > > > > actually does with it. > > > > > > > > Desired !=0 will disable HWP autonomous mode of frequency > > > > selection. > > > > > > But I don't think it will just run at "desired" then, will it? > > > > HWP all are these hints only not a guarantee. > > Sure, but the lack on detection when tasks are low utilisation but > still > latency/throughput sensitive is problematic. Users shouldn't have to > know they need to disable HWP or set performance goernor out of the > box. > It's only going to get worse as sockets get larger. I am not saying that we shouldn't do anything. Can you give me some workloads which you care the most? > > > There are totally different way HWP is handled in client an > > servers. > > If you set desired all heuristics they collected will be dumped, so > > they suggest don't set desired when you are in autonomous mode. If > > we > > really want a boost set the EPP. We know that EPP makes lots of > > measurable difference. > > > > Sure boosting EPP makes a difference -- it's essentially what the > performance > goveror does and I know that can be done by a user but it's still > basically a > cop-out. Default performance for low utilisation or lightly loaded > machines > is poor. Maybe it should be set based on the ACPI preferred profile > but > that information is not always available. It would be nice if *some* > sort of hint about new migrations or tasks waking from IO would be > desirable. EPP is a range not a single value. So you don't need to make EPP=0 as a performance governor. PeterZ gave me some scheduler change to experiment, which can be used as hint to play with EPP. Thanks, Srinivas
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, 2018-02-02 at 19:48 +, Mel Gorman wrote: > On Fri, Feb 02, 2018 at 06:54:24AM -0800, Srinivas Pandruvada wrote: > > > > > No idea, desired would be the one I would start with, it > > > > > matches > > > > > with > > > > > the intent here. But I've no idea what our current HWP > > > > > implementation > > > > > actually does with it. > > > > > > > > Desired !=0 will disable HWP autonomous mode of frequency > > > > selection. > > > > > > But I don't think it will just run at "desired" then, will it? > > > > HWP all are these hints only not a guarantee. > > Sure, but the lack on detection when tasks are low utilisation but > still > latency/throughput sensitive is problematic. Users shouldn't have to > know they need to disable HWP or set performance goernor out of the > box. > It's only going to get worse as sockets get larger. I am not saying that we shouldn't do anything. Can you give me some workloads which you care the most? > > > There are totally different way HWP is handled in client an > > servers. > > If you set desired all heuristics they collected will be dumped, so > > they suggest don't set desired when you are in autonomous mode. If > > we > > really want a boost set the EPP. We know that EPP makes lots of > > measurable difference. > > > > Sure boosting EPP makes a difference -- it's essentially what the > performance > goveror does and I know that can be done by a user but it's still > basically a > cop-out. Default performance for low utilisation or lightly loaded > machines > is poor. Maybe it should be set based on the ACPI preferred profile > but > that information is not always available. It would be nice if *some* > sort of hint about new migrations or tasks waking from IO would be > desirable. EPP is a range not a single value. So you don't need to make EPP=0 as a performance governor. PeterZ gave me some scheduler change to experiment, which can be used as hint to play with EPP. Thanks, Srinivas
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, Feb 02, 2018 at 06:54:24AM -0800, Srinivas Pandruvada wrote: > > > > No idea, desired would be the one I would start with, it matches > > > > with > > > > the intent here. But I've no idea what our current HWP > > > > implementation > > > > actually does with it. > > > Desired !=0 will disable HWP autonomous mode of frequency > > > selection. > > But I don't think it will just run at "desired" then, will it? > HWP all are these hints only not a guarantee. Sure, but the lack on detection when tasks are low utilisation but still latency/throughput sensitive is problematic. Users shouldn't have to know they need to disable HWP or set performance goernor out of the box. It's only going to get worse as sockets get larger. > There are totally different way HWP is handled in client an servers. > If you set desired all heuristics they collected will be dumped, so > they suggest don't set desired when you are in autonomous mode. If we > really want a boost set the EPP. We know that EPP makes lots of > measurable difference. > Sure boosting EPP makes a difference -- it's essentially what the performance goveror does and I know that can be done by a user but it's still basically a cop-out. Default performance for low utilisation or lightly loaded machines is poor. Maybe it should be set based on the ACPI preferred profile but that information is not always available. It would be nice if *some* sort of hint about new migrations or tasks waking from IO would be desirable. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, Feb 02, 2018 at 06:54:24AM -0800, Srinivas Pandruvada wrote: > > > > No idea, desired would be the one I would start with, it matches > > > > with > > > > the intent here. But I've no idea what our current HWP > > > > implementation > > > > actually does with it. > > > Desired !=0 will disable HWP autonomous mode of frequency > > > selection. > > But I don't think it will just run at "desired" then, will it? > HWP all are these hints only not a guarantee. Sure, but the lack on detection when tasks are low utilisation but still latency/throughput sensitive is problematic. Users shouldn't have to know they need to disable HWP or set performance goernor out of the box. It's only going to get worse as sockets get larger. > There are totally different way HWP is handled in client an servers. > If you set desired all heuristics they collected will be dumped, so > they suggest don't set desired when you are in autonomous mode. If we > really want a boost set the EPP. We know that EPP makes lots of > measurable difference. > Sure boosting EPP makes a difference -- it's essentially what the performance goveror does and I know that can be done by a user but it's still basically a cop-out. Default performance for low utilisation or lightly loaded machines is poor. Maybe it should be set based on the ACPI preferred profile but that information is not always available. It would be nice if *some* sort of hint about new migrations or tasks waking from IO would be desirable. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, 2018-02-02 at 12:00 +0100, Rafael J. Wysocki wrote: > On Thursday, February 1, 2018 2:18:12 PM CET Srinivas Pandruvada > wrote: > > > > On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote: > > > > > > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki > > > wrote: > > > > > > > > > > > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra > > > > wrote: > > > > > > > > > > > > > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki > > > > > wrote: > > > > > > > > > > > > > > > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > IA32_HWP_REQUEST has "Minimum_Performance", > > > > > > > "Maximum_Performance" and > > > > > > > "Desired_Performance" fields which can be used to give > > > > > > > explicit > > > > > > > frequency hints. And we really _should_ be doing that. > > > > > > > > > > > > > > Because, esp. in this scenario; a task migrating; the > > > > > > > hardware really > > > > > > > can't do anything sensible, whereas the OS _knows_. > > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > > > > That just means we might need to throttle writing to it, like > > > > > it > > > > > already > > > > > does for the regular pstate (PERF_CTRL) msr in any case > > > > > (also, is > > > > > that a > > > > > cheap msr?) > > > > > > > > > > Not touching it at all seems silly. > > > > OK > > > > > > > > So what field precisely would you touch? "desired"? If so, > > > > does > > > > that actually > > > > guarantee anything to happen? > > > No idea, desired would be the one I would start with, it matches > > > with > > > the intent here. But I've no idea what our current HWP > > > implementation > > > actually does with it. > > Desired !=0 will disable HWP autonomous mode of frequency > > selection. > But I don't think it will just run at "desired" then, will it? HWP all are these hints only not a guarantee. There are totally different way HWP is handled in client an servers. If you set desired all heuristics they collected will be dumped, so they suggest don't set desired when you are in autonomous mode. If we really want a boost set the EPP. We know that EPP makes lots of measurable difference.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, 2018-02-02 at 12:00 +0100, Rafael J. Wysocki wrote: > On Thursday, February 1, 2018 2:18:12 PM CET Srinivas Pandruvada > wrote: > > > > On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote: > > > > > > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki > > > wrote: > > > > > > > > > > > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra > > > > wrote: > > > > > > > > > > > > > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki > > > > > wrote: > > > > > > > > > > > > > > > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > IA32_HWP_REQUEST has "Minimum_Performance", > > > > > > > "Maximum_Performance" and > > > > > > > "Desired_Performance" fields which can be used to give > > > > > > > explicit > > > > > > > frequency hints. And we really _should_ be doing that. > > > > > > > > > > > > > > Because, esp. in this scenario; a task migrating; the > > > > > > > hardware really > > > > > > > can't do anything sensible, whereas the OS _knows_. > > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > > > > That just means we might need to throttle writing to it, like > > > > > it > > > > > already > > > > > does for the regular pstate (PERF_CTRL) msr in any case > > > > > (also, is > > > > > that a > > > > > cheap msr?) > > > > > > > > > > Not touching it at all seems silly. > > > > OK > > > > > > > > So what field precisely would you touch? "desired"? If so, > > > > does > > > > that actually > > > > guarantee anything to happen? > > > No idea, desired would be the one I would start with, it matches > > > with > > > the intent here. But I've no idea what our current HWP > > > implementation > > > actually does with it. > > Desired !=0 will disable HWP autonomous mode of frequency > > selection. > But I don't think it will just run at "desired" then, will it? HWP all are these hints only not a guarantee. There are totally different way HWP is handled in client an servers. If you set desired all heuristics they collected will be dumped, so they suggest don't set desired when you are in autonomous mode. If we really want a boost set the EPP. We know that EPP makes lots of measurable difference.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote: > +static void __intel_pstate_hwp_set_desired(int val) > +{ > + u64 value; > + > + value = rdmsrl(MSR_HWP_REQUEST); > + value &= ~GENMASK_ULL(23, 16); > + value |= (val & 0xff) << 16; > + wrmsrl(MSR_HWP_REQUEST, val); > +} Also, if we keep a software shadow of that MSR, we can avoid the rdmsr, which might also help.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote: > +static void __intel_pstate_hwp_set_desired(int val) > +{ > + u64 value; > + > + value = rdmsrl(MSR_HWP_REQUEST); > + value &= ~GENMASK_ULL(23, 16); > + value |= (val & 0xff) << 16; > + wrmsrl(MSR_HWP_REQUEST, val); > +} Also, if we keep a software shadow of that MSR, we can avoid the rdmsr, which might also help.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, Feb 02, 2018 at 12:42:29PM +0100, Rafael J. Wysocki wrote: > > > > But now that you made me look, intel_pstate_hwp_set() is horrible crap. > > > > You should _never_ do things like: > > > > > > > > rdmsr_on_cpu() > > > > /* frob value */ > > > > wrmsr_on_cpu() > > > > > > > > That's insane. > > > > > > I guess you mean it does too many IPIs? Or that it shouldn't do any IPIs > > > at all? > > > > Yes, too many synchronous IPIs, which themselves are typically already > > more expensive than the MSR access. > > We could do all of the updates in one IPI (as Srinivas said), but it would be > more code, and custom code for that matter. > > Is this really worth it for a slow path like this one? > Maybe it's a slow path at the moment but don't forget that one motivation for this series is that HWP does not properly react when utilisation of a CPU is artifically low because a task recently slept for IO or recently migrated. In both cases, the task may be busy and sensitive to either latency, throughput or both but HWP will use a low p-state. A standard driver can do io-boost and while it currently does not do so, it could also trivially do idle-boosting -- HWP does neither. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, Feb 02, 2018 at 12:42:29PM +0100, Rafael J. Wysocki wrote: > > > > But now that you made me look, intel_pstate_hwp_set() is horrible crap. > > > > You should _never_ do things like: > > > > > > > > rdmsr_on_cpu() > > > > /* frob value */ > > > > wrmsr_on_cpu() > > > > > > > > That's insane. > > > > > > I guess you mean it does too many IPIs? Or that it shouldn't do any IPIs > > > at all? > > > > Yes, too many synchronous IPIs, which themselves are typically already > > more expensive than the MSR access. > > We could do all of the updates in one IPI (as Srinivas said), but it would be > more code, and custom code for that matter. > > Is this really worth it for a slow path like this one? > Maybe it's a slow path at the moment but don't forget that one motivation for this series is that HWP does not properly react when utilisation of a CPU is artifically low because a task recently slept for IO or recently migrated. In both cases, the task may be busy and sensitive to either latency, throughput or both but HWP will use a low p-state. A standard driver can do io-boost and while it currently does not do so, it could also trivially do idle-boosting -- HWP does neither. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, Feb 02, 2018 at 12:42:29PM +0100, Rafael J. Wysocki wrote: > > Yes, too many synchronous IPIs, which themselves are typically already > > more expensive than the MSR access. > > We could do all of the updates in one IPI (as Srinivas said), but it would be > more code, and custom code for that matter. > > Is this really worth it for a slow path like this one? Probably not, still, the pattern really annoys me ;-)
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, Feb 02, 2018 at 12:42:29PM +0100, Rafael J. Wysocki wrote: > > Yes, too many synchronous IPIs, which themselves are typically already > > more expensive than the MSR access. > > We could do all of the updates in one IPI (as Srinivas said), but it would be > more code, and custom code for that matter. > > Is this really worth it for a slow path like this one? Probably not, still, the pattern really annoys me ;-)
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote: > +static void __intel_pstate_hwp_set_desired(int val) > +{ > + u64 value; > + > + value = rdmsrl(MSR_HWP_REQUEST); > + value &= ~GENMASK_ULL(23, 16); > + value |= (val & 0xff) << 16; > + wrmsrl(MSR_HWP_REQUEST, val); D'0h: s/val/value/ > +}
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, Feb 02, 2018 at 01:46:47PM +0100, Peter Zijlstra wrote: > +static void __intel_pstate_hwp_set_desired(int val) > +{ > + u64 value; > + > + value = rdmsrl(MSR_HWP_REQUEST); > + value &= ~GENMASK_ULL(23, 16); > + value |= (val & 0xff) << 16; > + wrmsrl(MSR_HWP_REQUEST, val); D'0h: s/val/value/ > +}
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, Feb 02, 2018 at 12:42:29PM +0100, Rafael J. Wysocki wrote: > > If you really care you can do async IPIs and do a custom serialization > > that only waits when you do back-to-back things, which should be fairly > > uncommon I'd think. > > In this particular case we don't want to return to user space before the > MSR is actually written with the new value. Why not? I was thinking of something like the below, which would in fact do exactly that. --- diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 7edf7a0e5a96..f0caa5cc7adb 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include @@ -767,6 +768,40 @@ static void intel_pstate_hwp_set(unsigned int cpu) wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value); } +static void __intel_pstate_hwp_set_desired(int val) +{ + u64 value; + + value = rdmsrl(MSR_HWP_REQUEST); + value &= ~GENMASK_ULL(23, 16); + value |= (val & 0xff) << 16; + wrmsrl(MSR_HWP_REQUEST, val); +} + +static void __intel_pstate_hwp_func(void *data) +{ + __intel_pstate_hwp_set_desired((int)data); +} + +static DEFINE_PER_CPU_SHARED_ALIGNED(struct __call_single_data, csd_data); + +static void intel_pstate_hwp_set_desired(int cpu, int val) +{ + struct call_function_data *csd; + + preempt_disable(); + csd = this_cpu_ptr(_data); + /* wait for previous invocation to complete */ + csd_lock_wait(csd); + + csd->func = __intel_pstate_hwp_func; + csd->info = (unsigned long)val; + + smp_call_function_single_async(cpu, csd); + preempt_enable(); +} + + static int intel_pstate_hwp_save_state(struct cpufreq_policy *policy) { struct cpudata *cpu_data = all_cpu_data[policy->cpu]; diff --git a/include/linux/smp.h b/include/linux/smp.h index 9fb239e12b82..2bc125ec6146 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -14,6 +14,11 @@ #include #include +enum { + CSD_FLAG_LOCK = 0x01, + CSD_FLAG_SYNCHRONOUS= 0x02, +}; + typedef void (*smp_call_func_t)(void *info); struct __call_single_data { struct llist_node llist; @@ -26,6 +31,11 @@ struct __call_single_data { typedef struct __call_single_data call_single_data_t __aligned(sizeof(struct __call_single_data)); +static __always_inline void csd_lock_wait(call_single_data_t *csd) +{ + smp_cond_load_acquire(>flags, !(VAL & CSD_FLAG_LOCK)); +} + /* total number of cpus in this system (may exceed NR_CPUS) */ extern unsigned int total_cpus; diff --git a/kernel/smp.c b/kernel/smp.c index 084c8b3a2681..af0ef9eb7679 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -22,11 +22,6 @@ #include "smpboot.h" -enum { - CSD_FLAG_LOCK = 0x01, - CSD_FLAG_SYNCHRONOUS= 0x02, -}; - struct call_function_data { call_single_data_t __percpu *csd; cpumask_var_t cpumask; @@ -103,11 +98,6 @@ void __init call_function_init(void) * previous function call. For multi-cpu calls its even more interesting * as we'll have to ensure no other cpu is observing our csd. */ -static __always_inline void csd_lock_wait(call_single_data_t *csd) -{ - smp_cond_load_acquire(>flags, !(VAL & CSD_FLAG_LOCK)); -} - static __always_inline void csd_lock(call_single_data_t *csd) { csd_lock_wait(csd);
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Fri, Feb 02, 2018 at 12:42:29PM +0100, Rafael J. Wysocki wrote: > > If you really care you can do async IPIs and do a custom serialization > > that only waits when you do back-to-back things, which should be fairly > > uncommon I'd think. > > In this particular case we don't want to return to user space before the > MSR is actually written with the new value. Why not? I was thinking of something like the below, which would in fact do exactly that. --- diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 7edf7a0e5a96..f0caa5cc7adb 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include @@ -767,6 +768,40 @@ static void intel_pstate_hwp_set(unsigned int cpu) wrmsrl_on_cpu(cpu, MSR_HWP_REQUEST, value); } +static void __intel_pstate_hwp_set_desired(int val) +{ + u64 value; + + value = rdmsrl(MSR_HWP_REQUEST); + value &= ~GENMASK_ULL(23, 16); + value |= (val & 0xff) << 16; + wrmsrl(MSR_HWP_REQUEST, val); +} + +static void __intel_pstate_hwp_func(void *data) +{ + __intel_pstate_hwp_set_desired((int)data); +} + +static DEFINE_PER_CPU_SHARED_ALIGNED(struct __call_single_data, csd_data); + +static void intel_pstate_hwp_set_desired(int cpu, int val) +{ + struct call_function_data *csd; + + preempt_disable(); + csd = this_cpu_ptr(_data); + /* wait for previous invocation to complete */ + csd_lock_wait(csd); + + csd->func = __intel_pstate_hwp_func; + csd->info = (unsigned long)val; + + smp_call_function_single_async(cpu, csd); + preempt_enable(); +} + + static int intel_pstate_hwp_save_state(struct cpufreq_policy *policy) { struct cpudata *cpu_data = all_cpu_data[policy->cpu]; diff --git a/include/linux/smp.h b/include/linux/smp.h index 9fb239e12b82..2bc125ec6146 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -14,6 +14,11 @@ #include #include +enum { + CSD_FLAG_LOCK = 0x01, + CSD_FLAG_SYNCHRONOUS= 0x02, +}; + typedef void (*smp_call_func_t)(void *info); struct __call_single_data { struct llist_node llist; @@ -26,6 +31,11 @@ struct __call_single_data { typedef struct __call_single_data call_single_data_t __aligned(sizeof(struct __call_single_data)); +static __always_inline void csd_lock_wait(call_single_data_t *csd) +{ + smp_cond_load_acquire(>flags, !(VAL & CSD_FLAG_LOCK)); +} + /* total number of cpus in this system (may exceed NR_CPUS) */ extern unsigned int total_cpus; diff --git a/kernel/smp.c b/kernel/smp.c index 084c8b3a2681..af0ef9eb7679 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -22,11 +22,6 @@ #include "smpboot.h" -enum { - CSD_FLAG_LOCK = 0x01, - CSD_FLAG_SYNCHRONOUS= 0x02, -}; - struct call_function_data { call_single_data_t __percpu *csd; cpumask_var_t cpumask; @@ -103,11 +98,6 @@ void __init call_function_init(void) * previous function call. For multi-cpu calls its even more interesting * as we'll have to ensure no other cpu is observing our csd. */ -static __always_inline void csd_lock_wait(call_single_data_t *csd) -{ - smp_cond_load_acquire(>flags, !(VAL & CSD_FLAG_LOCK)); -} - static __always_inline void csd_lock(call_single_data_t *csd) { csd_lock_wait(csd);
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Thursday, February 1, 2018 10:11:04 AM CET Peter Zijlstra wrote: > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote: > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra wrote: > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote: > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote: > > > > > > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and > > > > > "Desired_Performance" fields which can be used to give explicit > > > > > frequency hints. And we really _should_ be doing that. > > > > > > > > > > Because, esp. in this scenario; a task migrating; the hardware really > > > > > can't do anything sensible, whereas the OS _knows_. > > > > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > > > > > That just means we might need to throttle writing to it, like it already > > > does for the regular pstate (PERF_CTRL) msr in any case (also, is that a > > > cheap msr?) > > > > > > Not touching it at all seems silly. > > > > OK > > > > So what field precisely would you touch? "desired"? If so, does that > > actually > > guarantee anything to happen? > > No idea, desired would be the one I would start with, it matches with > the intent here. But I've no idea what our current HWP implementation > actually does with it. > > > > But now that you made me look, intel_pstate_hwp_set() is horrible crap. > > > You should _never_ do things like: > > > > > > rdmsr_on_cpu() > > > /* frob value */ > > > wrmsr_on_cpu() > > > > > > That's insane. > > > > I guess you mean it does too many IPIs? Or that it shouldn't do any IPIs > > at all? > > Yes, too many synchronous IPIs, which themselves are typically already > more expensive than the MSR access. We could do all of the updates in one IPI (as Srinivas said), but it would be more code, and custom code for that matter. Is this really worth it for a slow path like this one? > At one point I looked to getting rid of the *msr_on_cpu() crud entirely, > but there's just too much users out there I didn't feel like touching. > > If you really care you can do async IPIs and do a custom serialization > that only waits when you do back-to-back things, which should be fairly > uncommon I'd think. In this particular case we don't want to return to user space before the MSR is actually written with the new value.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Thursday, February 1, 2018 10:11:04 AM CET Peter Zijlstra wrote: > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote: > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra wrote: > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote: > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote: > > > > > > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and > > > > > "Desired_Performance" fields which can be used to give explicit > > > > > frequency hints. And we really _should_ be doing that. > > > > > > > > > > Because, esp. in this scenario; a task migrating; the hardware really > > > > > can't do anything sensible, whereas the OS _knows_. > > > > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > > > > > That just means we might need to throttle writing to it, like it already > > > does for the regular pstate (PERF_CTRL) msr in any case (also, is that a > > > cheap msr?) > > > > > > Not touching it at all seems silly. > > > > OK > > > > So what field precisely would you touch? "desired"? If so, does that > > actually > > guarantee anything to happen? > > No idea, desired would be the one I would start with, it matches with > the intent here. But I've no idea what our current HWP implementation > actually does with it. > > > > But now that you made me look, intel_pstate_hwp_set() is horrible crap. > > > You should _never_ do things like: > > > > > > rdmsr_on_cpu() > > > /* frob value */ > > > wrmsr_on_cpu() > > > > > > That's insane. > > > > I guess you mean it does too many IPIs? Or that it shouldn't do any IPIs > > at all? > > Yes, too many synchronous IPIs, which themselves are typically already > more expensive than the MSR access. We could do all of the updates in one IPI (as Srinivas said), but it would be more code, and custom code for that matter. Is this really worth it for a slow path like this one? > At one point I looked to getting rid of the *msr_on_cpu() crud entirely, > but there's just too much users out there I didn't feel like touching. > > If you really care you can do async IPIs and do a custom serialization > that only waits when you do back-to-back things, which should be fairly > uncommon I'd think. In this particular case we don't want to return to user space before the MSR is actually written with the new value.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Thursday, February 1, 2018 2:18:12 PM CET Srinivas Pandruvada wrote: > On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote: > > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote: > > > > > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra > > > wrote: > > > > > > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki > > > > wrote: > > > > > > > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra > > > > > wrote: > > > > > > > > > > > > > > > > > IA32_HWP_REQUEST has "Minimum_Performance", > > > > > > "Maximum_Performance" and > > > > > > "Desired_Performance" fields which can be used to give > > > > > > explicit > > > > > > frequency hints. And we really _should_ be doing that. > > > > > > > > > > > > Because, esp. in this scenario; a task migrating; the > > > > > > hardware really > > > > > > can't do anything sensible, whereas the OS _knows_. > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > > > That just means we might need to throttle writing to it, like it > > > > already > > > > does for the regular pstate (PERF_CTRL) msr in any case (also, is > > > > that a > > > > cheap msr?) > > > > > > > > Not touching it at all seems silly. > > > OK > > > > > > So what field precisely would you touch? "desired"? If so, does > > > that actually > > > guarantee anything to happen? > > No idea, desired would be the one I would start with, it matches with > > the intent here. But I've no idea what our current HWP implementation > > actually does with it. > Desired !=0 will disable HWP autonomous mode of frequency selection. But I don't think it will just run at "desired" then, will it?
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Thursday, February 1, 2018 2:18:12 PM CET Srinivas Pandruvada wrote: > On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote: > > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote: > > > > > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra > > > wrote: > > > > > > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki > > > > wrote: > > > > > > > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra > > > > > wrote: > > > > > > > > > > > > > > > > > IA32_HWP_REQUEST has "Minimum_Performance", > > > > > > "Maximum_Performance" and > > > > > > "Desired_Performance" fields which can be used to give > > > > > > explicit > > > > > > frequency hints. And we really _should_ be doing that. > > > > > > > > > > > > Because, esp. in this scenario; a task migrating; the > > > > > > hardware really > > > > > > can't do anything sensible, whereas the OS _knows_. > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > > > That just means we might need to throttle writing to it, like it > > > > already > > > > does for the regular pstate (PERF_CTRL) msr in any case (also, is > > > > that a > > > > cheap msr?) > > > > > > > > Not touching it at all seems silly. > > > OK > > > > > > So what field precisely would you touch? "desired"? If so, does > > > that actually > > > guarantee anything to happen? > > No idea, desired would be the one I would start with, it matches with > > the intent here. But I've no idea what our current HWP implementation > > actually does with it. > Desired !=0 will disable HWP autonomous mode of frequency selection. But I don't think it will just run at "desired" then, will it?
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote: > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote: > > > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra > > wrote: > > > > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki > > > wrote: > > > > > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra > > > > wrote: > > > > > > > > > > > > > > IA32_HWP_REQUEST has "Minimum_Performance", > > > > > "Maximum_Performance" and > > > > > "Desired_Performance" fields which can be used to give > > > > > explicit > > > > > frequency hints. And we really _should_ be doing that. > > > > > > > > > > Because, esp. in this scenario; a task migrating; the > > > > > hardware really > > > > > can't do anything sensible, whereas the OS _knows_. > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > > That just means we might need to throttle writing to it, like it > > > already > > > does for the regular pstate (PERF_CTRL) msr in any case (also, is > > > that a > > > cheap msr?) > > > > > > Not touching it at all seems silly. > > OK > > > > So what field precisely would you touch? "desired"? If so, does > > that actually > > guarantee anything to happen? > No idea, desired would be the one I would start with, it matches with > the intent here. But I've no idea what our current HWP implementation > actually does with it. Desired !=0 will disable HWP autonomous mode of frequency selection.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Thu, 2018-02-01 at 10:11 +0100, Peter Zijlstra wrote: > On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote: > > > > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra > > wrote: > > > > > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki > > > wrote: > > > > > > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra > > > > wrote: > > > > > > > > > > > > > > IA32_HWP_REQUEST has "Minimum_Performance", > > > > > "Maximum_Performance" and > > > > > "Desired_Performance" fields which can be used to give > > > > > explicit > > > > > frequency hints. And we really _should_ be doing that. > > > > > > > > > > Because, esp. in this scenario; a task migrating; the > > > > > hardware really > > > > > can't do anything sensible, whereas the OS _knows_. > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > > That just means we might need to throttle writing to it, like it > > > already > > > does for the regular pstate (PERF_CTRL) msr in any case (also, is > > > that a > > > cheap msr?) > > > > > > Not touching it at all seems silly. > > OK > > > > So what field precisely would you touch? "desired"? If so, does > > that actually > > guarantee anything to happen? > No idea, desired would be the one I would start with, it matches with > the intent here. But I've no idea what our current HWP implementation > actually does with it. Desired !=0 will disable HWP autonomous mode of frequency selection.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote: > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra wrote: > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote: > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote: > > > > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and > > > > "Desired_Performance" fields which can be used to give explicit > > > > frequency hints. And we really _should_ be doing that. > > > > > > > > Because, esp. in this scenario; a task migrating; the hardware really > > > > can't do anything sensible, whereas the OS _knows_. > > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > > > That just means we might need to throttle writing to it, like it already > > does for the regular pstate (PERF_CTRL) msr in any case (also, is that a > > cheap msr?) > > > > Not touching it at all seems silly. > > OK > > So what field precisely would you touch? "desired"? If so, does that > actually > guarantee anything to happen? No idea, desired would be the one I would start with, it matches with the intent here. But I've no idea what our current HWP implementation actually does with it. > > But now that you made me look, intel_pstate_hwp_set() is horrible crap. > > You should _never_ do things like: > > > > rdmsr_on_cpu() > > /* frob value */ > > wrmsr_on_cpu() > > > > That's insane. > > I guess you mean it does too many IPIs? Or that it shouldn't do any IPIs > at all? Yes, too many synchronous IPIs, which themselves are typically already more expensive than the MSR access. At one point I looked to getting rid of the *msr_on_cpu() crud entirely, but there's just too much users out there I didn't feel like touching. If you really care you can do async IPIs and do a custom serialization that only waits when you do back-to-back things, which should be fairly uncommon I'd think.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Thu, Feb 01, 2018 at 08:50:28AM +0100, Rafael J. Wysocki wrote: > On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra wrote: > > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote: > > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote: > > > > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and > > > > "Desired_Performance" fields which can be used to give explicit > > > > frequency hints. And we really _should_ be doing that. > > > > > > > > Because, esp. in this scenario; a task migrating; the hardware really > > > > can't do anything sensible, whereas the OS _knows_. > > > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > > > That just means we might need to throttle writing to it, like it already > > does for the regular pstate (PERF_CTRL) msr in any case (also, is that a > > cheap msr?) > > > > Not touching it at all seems silly. > > OK > > So what field precisely would you touch? "desired"? If so, does that > actually > guarantee anything to happen? No idea, desired would be the one I would start with, it matches with the intent here. But I've no idea what our current HWP implementation actually does with it. > > But now that you made me look, intel_pstate_hwp_set() is horrible crap. > > You should _never_ do things like: > > > > rdmsr_on_cpu() > > /* frob value */ > > wrmsr_on_cpu() > > > > That's insane. > > I guess you mean it does too many IPIs? Or that it shouldn't do any IPIs > at all? Yes, too many synchronous IPIs, which themselves are typically already more expensive than the MSR access. At one point I looked to getting rid of the *msr_on_cpu() crud entirely, but there's just too much users out there I didn't feel like touching. If you really care you can do async IPIs and do a custom serialization that only waits when you do back-to-back things, which should be fairly uncommon I'd think.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Wed, Jan 31, 2018 at 09:44:18AM -0800, Srinivas Pandruvada wrote: > Much more throttling required compared to PERF_CTL. MSR_HWP_REQUEST is > much slower compared to PERF_CTL (as high as 10:1). Still much better than what other architectures have to deal with ;-)
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Wed, Jan 31, 2018 at 09:44:18AM -0800, Srinivas Pandruvada wrote: > Much more throttling required compared to PERF_CTL. MSR_HWP_REQUEST is > much slower compared to PERF_CTL (as high as 10:1). Still much better than what other architectures have to deal with ;-)
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra wrote: > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote: > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote: > > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and > > > "Desired_Performance" fields which can be used to give explicit > > > frequency hints. And we really _should_ be doing that. > > > > > > Because, esp. in this scenario; a task migrating; the hardware really > > > can't do anything sensible, whereas the OS _knows_. > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > That just means we might need to throttle writing to it, like it already > does for the regular pstate (PERF_CTRL) msr in any case (also, is that a > cheap msr?) > > Not touching it at all seems silly. OK So what field precisely would you touch? "desired"? If so, does that actually guarantee anything to happen? > But now that you made me look, intel_pstate_hwp_set() is horrible crap. > You should _never_ do things like: > > rdmsr_on_cpu() > /* frob value */ > wrmsr_on_cpu() > > That's insane. I guess you mean it does too many IPIs? Or that it shouldn't do any IPIs at all?
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Wednesday, January 31, 2018 11:17:10 AM CET Peter Zijlstra wrote: > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote: > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote: > > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and > > > "Desired_Performance" fields which can be used to give explicit > > > frequency hints. And we really _should_ be doing that. > > > > > > Because, esp. in this scenario; a task migrating; the hardware really > > > can't do anything sensible, whereas the OS _knows_. > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > That just means we might need to throttle writing to it, like it already > does for the regular pstate (PERF_CTRL) msr in any case (also, is that a > cheap msr?) > > Not touching it at all seems silly. OK So what field precisely would you touch? "desired"? If so, does that actually guarantee anything to happen? > But now that you made me look, intel_pstate_hwp_set() is horrible crap. > You should _never_ do things like: > > rdmsr_on_cpu() > /* frob value */ > wrmsr_on_cpu() > > That's insane. I guess you mean it does too many IPIs? Or that it shouldn't do any IPIs at all?
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Wed, 2018-01-31 at 11:17 +0100, Peter Zijlstra wrote: > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote: > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote: > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" > > > and > > > "Desired_Performance" fields which can be used to give explicit > > > frequency hints. And we really _should_ be doing that. > > > > > > Because, esp. in this scenario; a task migrating; the hardware > > > really > > > can't do anything sensible, whereas the OS _knows_. > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > That just means we might need to throttle writing to it, like it > already > does for the regular pstate (PERF_CTRL) msr in any case (also, is > that a > cheap msr?) Much more throttling required compared to PERF_CTL. MSR_HWP_REQUEST is much slower compared to PERF_CTL (as high as 10:1). > > Not touching it at all seems silly. > > But now that you made me look, intel_pstate_hwp_set() is horrible > crap. > You should _never_ do things like: > > rdmsr_on_cpu() > /* frob value */ > wrmsr_on_cpu() > > That's insane. Since the cpufreq callback is not guaranteed to be called on the same CPU, we have to use rd/wrmsr_on_cpu(). But we can use smp_call_function_single() and optimize this. This function is called only during init, when usermode changes frequency limits and from thermal, so very few times. Thanks, Srinivas
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Wed, 2018-01-31 at 11:17 +0100, Peter Zijlstra wrote: > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote: > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote: > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" > > > and > > > "Desired_Performance" fields which can be used to give explicit > > > frequency hints. And we really _should_ be doing that. > > > > > > Because, esp. in this scenario; a task migrating; the hardware > > > really > > > can't do anything sensible, whereas the OS _knows_. > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > That just means we might need to throttle writing to it, like it > already > does for the regular pstate (PERF_CTRL) msr in any case (also, is > that a > cheap msr?) Much more throttling required compared to PERF_CTL. MSR_HWP_REQUEST is much slower compared to PERF_CTL (as high as 10:1). > > Not touching it at all seems silly. > > But now that you made me look, intel_pstate_hwp_set() is horrible > crap. > You should _never_ do things like: > > rdmsr_on_cpu() > /* frob value */ > wrmsr_on_cpu() > > That's insane. Since the cpufreq callback is not guaranteed to be called on the same CPU, we have to use rd/wrmsr_on_cpu(). But we can use smp_call_function_single() and optimize this. This function is called only during init, when usermode changes frequency limits and from thermal, so very few times. Thanks, Srinivas
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Wed, Jan 31, 2018 at 11:17:10AM +0100, Peter Zijlstra wrote: > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote: > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote: > > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and > > > "Desired_Performance" fields which can be used to give explicit > > > frequency hints. And we really _should_ be doing that. > > > > > > Because, esp. in this scenario; a task migrating; the hardware really > > > can't do anything sensible, whereas the OS _knows_. > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > That just means we might need to throttle writing to it, like it already > does for the regular pstate (PERF_CTRL) msr in any case (also, is that a > cheap msr?) > > Not touching it at all seems silly. > Note that even if we do such programming, it would still be desirable to minimise the number of times we have to reprogram it so the series as it stands would still be useful. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Wed, Jan 31, 2018 at 11:17:10AM +0100, Peter Zijlstra wrote: > On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote: > > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote: > > > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and > > > "Desired_Performance" fields which can be used to give explicit > > > frequency hints. And we really _should_ be doing that. > > > > > > Because, esp. in this scenario; a task migrating; the hardware really > > > can't do anything sensible, whereas the OS _knows_. > > > > But IA32_HWP_REQUEST is not a cheap MSR to write to. > > That just means we might need to throttle writing to it, like it already > does for the regular pstate (PERF_CTRL) msr in any case (also, is that a > cheap msr?) > > Not touching it at all seems silly. > Note that even if we do such programming, it would still be desirable to minimise the number of times we have to reprogram it so the series as it stands would still be useful. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote: > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote: > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and > > "Desired_Performance" fields which can be used to give explicit > > frequency hints. And we really _should_ be doing that. > > > > Because, esp. in this scenario; a task migrating; the hardware really > > can't do anything sensible, whereas the OS _knows_. > > But IA32_HWP_REQUEST is not a cheap MSR to write to. That just means we might need to throttle writing to it, like it already does for the regular pstate (PERF_CTRL) msr in any case (also, is that a cheap msr?) Not touching it at all seems silly. But now that you made me look, intel_pstate_hwp_set() is horrible crap. You should _never_ do things like: rdmsr_on_cpu() /* frob value */ wrmsr_on_cpu() That's insane.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Wed, Jan 31, 2018 at 10:22:49AM +0100, Rafael J. Wysocki wrote: > On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote: > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and > > "Desired_Performance" fields which can be used to give explicit > > frequency hints. And we really _should_ be doing that. > > > > Because, esp. in this scenario; a task migrating; the hardware really > > can't do anything sensible, whereas the OS _knows_. > > But IA32_HWP_REQUEST is not a cheap MSR to write to. That just means we might need to throttle writing to it, like it already does for the regular pstate (PERF_CTRL) msr in any case (also, is that a cheap msr?) Not touching it at all seems silly. But now that you made me look, intel_pstate_hwp_set() is horrible crap. You should _never_ do things like: rdmsr_on_cpu() /* frob value */ wrmsr_on_cpu() That's insane.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 12:57:18PM +, Mel Gorman wrote: > > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote: > > > > Not saying this patch is bad; but Rafael / Srinivas we really should do > > > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can > > > still give it hints, and it looks like we're not doing that. > > > > > > > I'm not sure if HWP can fix it because of the per-cpu nature of its > > decisions. I believe it can only give the most basic of hints to hardware > > like an energy performance profile or bias (EPP and EPB respectively). > > Of course HWP can be turned off but not many people can detect that it's > > an appropriate decision, or even desirable, and there is always the caveat > > that disabling it increases the system CPU footprint. > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and > "Desired_Performance" fields which can be used to give explicit > frequency hints. And we really _should_ be doing that. > > Because, esp. in this scenario; a task migrating; the hardware really > can't do anything sensible, whereas the OS _knows_. But IA32_HWP_REQUEST is not a cheap MSR to write to.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tuesday, January 30, 2018 2:15:31 PM CET Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 12:57:18PM +, Mel Gorman wrote: > > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote: > > > > Not saying this patch is bad; but Rafael / Srinivas we really should do > > > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can > > > still give it hints, and it looks like we're not doing that. > > > > > > > I'm not sure if HWP can fix it because of the per-cpu nature of its > > decisions. I believe it can only give the most basic of hints to hardware > > like an energy performance profile or bias (EPP and EPB respectively). > > Of course HWP can be turned off but not many people can detect that it's > > an appropriate decision, or even desirable, and there is always the caveat > > that disabling it increases the system CPU footprint. > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and > "Desired_Performance" fields which can be used to give explicit > frequency hints. And we really _should_ be doing that. > > Because, esp. in this scenario; a task migrating; the hardware really > can't do anything sensible, whereas the OS _knows_. But IA32_HWP_REQUEST is not a cheap MSR to write to.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 02:40:49PM +0100, Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 01:25:27PM +, Mel Gorman wrote: > > > Because, esp. in this scenario; a task migrating; the hardware really > > > can't do anything sensible, whereas the OS _knows_. > > > > Potentially yes. One option without HWP would be to track utilisation > > for a task or artifically boost it for a short period after migration so > > a higher p-state is potentially selected. With HWP, a hint would have to > > be given to the hardware to try select a higher frequency but I've no idea > > how expensive that is or how it would behave on different implementations > > of HWP. It may also be a game of whack-a-mole trying to get every cpufreq > > configuration correct. > > We have much of this infrastructure and have been working on improving > it [1]. We already track per task utilization and feed it into a cpufreq > governor (schedutil). > Which is potentially great without HWP but ... > > One advantage of using fewer cores is that it should > > work regardless of cpufreq driver. > > Sure, and I started out by saying this patch isn't necessarily bad; but > I think our 'use' [2] of HWP _is_. > This is where we get caught -- HWP means we had full responsibility to the hardware with limited feedback options. As far as I can tell, any recent intel CPU is going to have HWP and we've enabled it by default if available since 4.6. Even if further hints can be given from the OS, it does not necessarily mean this patch is unhelpful. On low utilisation, it's still somewhat sensible to use the minimum number of CPUs necessary (less cache traffic, higher p-states, maybe beneficial to turbo boost etc). -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 02:40:49PM +0100, Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 01:25:27PM +, Mel Gorman wrote: > > > Because, esp. in this scenario; a task migrating; the hardware really > > > can't do anything sensible, whereas the OS _knows_. > > > > Potentially yes. One option without HWP would be to track utilisation > > for a task or artifically boost it for a short period after migration so > > a higher p-state is potentially selected. With HWP, a hint would have to > > be given to the hardware to try select a higher frequency but I've no idea > > how expensive that is or how it would behave on different implementations > > of HWP. It may also be a game of whack-a-mole trying to get every cpufreq > > configuration correct. > > We have much of this infrastructure and have been working on improving > it [1]. We already track per task utilization and feed it into a cpufreq > governor (schedutil). > Which is potentially great without HWP but ... > > One advantage of using fewer cores is that it should > > work regardless of cpufreq driver. > > Sure, and I started out by saying this patch isn't necessarily bad; but > I think our 'use' [2] of HWP _is_. > This is where we get caught -- HWP means we had full responsibility to the hardware with limited feedback options. As far as I can tell, any recent intel CPU is going to have HWP and we've enabled it by default if available since 4.6. Even if further hints can be given from the OS, it does not necessarily mean this patch is unhelpful. On low utilisation, it's still somewhat sensible to use the minimum number of CPUs necessary (less cache traffic, higher p-states, maybe beneficial to turbo boost etc). -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 01:25:27PM +, Mel Gorman wrote: > > Because, esp. in this scenario; a task migrating; the hardware really > > can't do anything sensible, whereas the OS _knows_. > > Potentially yes. One option without HWP would be to track utilisation > for a task or artifically boost it for a short period after migration so > a higher p-state is potentially selected. With HWP, a hint would have to > be given to the hardware to try select a higher frequency but I've no idea > how expensive that is or how it would behave on different implementations > of HWP. It may also be a game of whack-a-mole trying to get every cpufreq > configuration correct. We have much of this infrastructure and have been working on improving it [1]. We already track per task utilization and feed it into a cpufreq governor (schedutil). > One advantage of using fewer cores is that it should > work regardless of cpufreq driver. Sure, and I started out by saying this patch isn't necessarily bad; but I think our 'use' [2] of HWP _is_. [1] https://lkml.kernel.org/r/20180123180847.4477-1-patrick.bell...@arm.com [2] IIRC we basically don't do _anything_ when HWP.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 01:25:27PM +, Mel Gorman wrote: > > Because, esp. in this scenario; a task migrating; the hardware really > > can't do anything sensible, whereas the OS _knows_. > > Potentially yes. One option without HWP would be to track utilisation > for a task or artifically boost it for a short period after migration so > a higher p-state is potentially selected. With HWP, a hint would have to > be given to the hardware to try select a higher frequency but I've no idea > how expensive that is or how it would behave on different implementations > of HWP. It may also be a game of whack-a-mole trying to get every cpufreq > configuration correct. We have much of this infrastructure and have been working on improving it [1]. We already track per task utilization and feed it into a cpufreq governor (schedutil). > One advantage of using fewer cores is that it should > work regardless of cpufreq driver. Sure, and I started out by saying this patch isn't necessarily bad; but I think our 'use' [2] of HWP _is_. [1] https://lkml.kernel.org/r/20180123180847.4477-1-patrick.bell...@arm.com [2] IIRC we basically don't do _anything_ when HWP.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, 2018-01-30 at 14:25 +0100, Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 02:15:44PM +0100, Mike Galbraith wrote: > > On Tue, 2018-01-30 at 12:57 +, Mel Gorman wrote: > > > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote: > > > > > > > Mel, what hardware are you testing this on? > > > > > > The primary one was a single socket skylake machine with 8 threads (HT > > > enabled). > > > > I took it for a spin in a 2 x Gold 6152 box. > > Can you translate that marketing speak for a simpleton like me? That's > what we should've called SKL-EP, right? Um um.. danged if I know. It's SKL, w. 22 cores each + SMT. -Mike
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, 2018-01-30 at 14:25 +0100, Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 02:15:44PM +0100, Mike Galbraith wrote: > > On Tue, 2018-01-30 at 12:57 +, Mel Gorman wrote: > > > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote: > > > > > > > Mel, what hardware are you testing this on? > > > > > > The primary one was a single socket skylake machine with 8 threads (HT > > > enabled). > > > > I took it for a spin in a 2 x Gold 6152 box. > > Can you translate that marketing speak for a simpleton like me? That's > what we should've called SKL-EP, right? Um um.. danged if I know. It's SKL, w. 22 cores each + SMT. -Mike
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 02:15:44PM +0100, Mike Galbraith wrote: > On Tue, 2018-01-30 at 12:57 +, Mel Gorman wrote: > > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote: > > > > > Mel, what hardware are you testing this on? > > > > The primary one was a single socket skylake machine with 8 threads (HT > > enabled). > > I took it for a spin in a 2 x Gold 6152 box. Can you translate that marketing speak for a simpleton like me? That's what we should've called SKL-EP, right?
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 02:15:44PM +0100, Mike Galbraith wrote: > On Tue, 2018-01-30 at 12:57 +, Mel Gorman wrote: > > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote: > > > > > Mel, what hardware are you testing this on? > > > > The primary one was a single socket skylake machine with 8 threads (HT > > enabled). > > I took it for a spin in a 2 x Gold 6152 box. Can you translate that marketing speak for a simpleton like me? That's what we should've called SKL-EP, right?
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 02:15:31PM +0100, Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 12:57:18PM +, Mel Gorman wrote: > > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote: > > > > Not saying this patch is bad; but Rafael / Srinivas we really should do > > > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can > > > still give it hints, and it looks like we're not doing that. > > > > > > > I'm not sure if HWP can fix it because of the per-cpu nature of its > > decisions. I believe it can only give the most basic of hints to hardware > > like an energy performance profile or bias (EPP and EPB respectively). > > Of course HWP can be turned off but not many people can detect that it's > > an appropriate decision, or even desirable, and there is always the caveat > > that disabling it increases the system CPU footprint. > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and > "Desired_Performance" fields which can be used to give explicit > frequency hints. And we really _should_ be doing that. > They can be although these are usually set by the bios or setup early in boot and then left alone. It's not clear how or if these should be tuned on the fly or what variables would drive dynamic tuning. The data collected would still be per-cpu so if all CPUs have low utilisation, the decisions will still be poor except maybe for things like IO boosting. > Because, esp. in this scenario; a task migrating; the hardware really > can't do anything sensible, whereas the OS _knows_. > Potentially yes. One option without HWP would be to track utilisation for a task or artifically boost it for a short period after migration so a higher p-state is potentially selected. With HWP, a hint would have to be given to the hardware to try select a higher frequency but I've no idea how expensive that is or how it would behave on different implementations of HWP. It may also be a game of whack-a-mole trying to get every cpufreq configuration correct. One advantage of using fewer cores is that it should work regardless of cpufreq driver. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 02:15:31PM +0100, Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 12:57:18PM +, Mel Gorman wrote: > > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote: > > > > Not saying this patch is bad; but Rafael / Srinivas we really should do > > > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can > > > still give it hints, and it looks like we're not doing that. > > > > > > > I'm not sure if HWP can fix it because of the per-cpu nature of its > > decisions. I believe it can only give the most basic of hints to hardware > > like an energy performance profile or bias (EPP and EPB respectively). > > Of course HWP can be turned off but not many people can detect that it's > > an appropriate decision, or even desirable, and there is always the caveat > > that disabling it increases the system CPU footprint. > > IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and > "Desired_Performance" fields which can be used to give explicit > frequency hints. And we really _should_ be doing that. > They can be although these are usually set by the bios or setup early in boot and then left alone. It's not clear how or if these should be tuned on the fly or what variables would drive dynamic tuning. The data collected would still be per-cpu so if all CPUs have low utilisation, the decisions will still be poor except maybe for things like IO boosting. > Because, esp. in this scenario; a task migrating; the hardware really > can't do anything sensible, whereas the OS _knows_. > Potentially yes. One option without HWP would be to track utilisation for a task or artifically boost it for a short period after migration so a higher p-state is potentially selected. With HWP, a hint would have to be given to the hardware to try select a higher frequency but I've no idea how expensive that is or how it would behave on different implementations of HWP. It may also be a game of whack-a-mole trying to get every cpufreq configuration correct. One advantage of using fewer cores is that it should work regardless of cpufreq driver. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 02:06:06PM +0100, Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 12:53:49PM +0100, Peter Zijlstra wrote: > > On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote: > > > The results can be less dramatic on NUMA where automatic balancing > > > interferes > > > with the test. It's also known that network benchmarks running on > > > localhost > > > also benefit quite a bit from this patch (roughly 10% on netperf RR for > > > UDP > > > and TCP depending on the machine). Hackbench also seens small improvements > > > (6-11% depending on machine and thread count). The facebook schbench was > > > also > > > tested but in most cases showed little or no different to wakeup > > > latencies. > > > > What cpufreq setting were you using for these tests? > > I cannot measure any hackbench variation one way or the other with these > patches using 'performance' mode. So I'll assume you've been running > things with HWP or something. Correct, I am not using the performance governor as it's not the default and not necessarily desirable as a default. HWP in some cases is enabled but even machines using the intel_pstate driver on machines without HWP benefit. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 02:06:06PM +0100, Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 12:53:49PM +0100, Peter Zijlstra wrote: > > On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote: > > > The results can be less dramatic on NUMA where automatic balancing > > > interferes > > > with the test. It's also known that network benchmarks running on > > > localhost > > > also benefit quite a bit from this patch (roughly 10% on netperf RR for > > > UDP > > > and TCP depending on the machine). Hackbench also seens small improvements > > > (6-11% depending on machine and thread count). The facebook schbench was > > > also > > > tested but in most cases showed little or no different to wakeup > > > latencies. > > > > What cpufreq setting were you using for these tests? > > I cannot measure any hackbench variation one way or the other with these > patches using 'performance' mode. So I'll assume you've been running > things with HWP or something. Correct, I am not using the performance governor as it's not the default and not necessarily desirable as a default. HWP in some cases is enabled but even machines using the intel_pstate driver on machines without HWP benefit. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, 2018-01-30 at 12:57 +, Mel Gorman wrote: > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote: > > > Mel, what hardware are you testing this on? > > The primary one was a single socket skylake machine with 8 threads (HT > enabled). I took it for a spin in a 2 x Gold 6152 box. -Mike
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, 2018-01-30 at 12:57 +, Mel Gorman wrote: > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote: > > > Mel, what hardware are you testing this on? > > The primary one was a single socket skylake machine with 8 threads (HT > enabled). I took it for a spin in a 2 x Gold 6152 box. -Mike
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 12:57:18PM +, Mel Gorman wrote: > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote: > > Not saying this patch is bad; but Rafael / Srinivas we really should do > > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can > > still give it hints, and it looks like we're not doing that. > > > > I'm not sure if HWP can fix it because of the per-cpu nature of its > decisions. I believe it can only give the most basic of hints to hardware > like an energy performance profile or bias (EPP and EPB respectively). > Of course HWP can be turned off but not many people can detect that it's > an appropriate decision, or even desirable, and there is always the caveat > that disabling it increases the system CPU footprint. IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and "Desired_Performance" fields which can be used to give explicit frequency hints. And we really _should_ be doing that. Because, esp. in this scenario; a task migrating; the hardware really can't do anything sensible, whereas the OS _knows_.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 12:57:18PM +, Mel Gorman wrote: > On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote: > > Not saying this patch is bad; but Rafael / Srinivas we really should do > > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can > > still give it hints, and it looks like we're not doing that. > > > > I'm not sure if HWP can fix it because of the per-cpu nature of its > decisions. I believe it can only give the most basic of hints to hardware > like an energy performance profile or bias (EPP and EPB respectively). > Of course HWP can be turned off but not many people can detect that it's > an appropriate decision, or even desirable, and there is always the caveat > that disabling it increases the system CPU footprint. IA32_HWP_REQUEST has "Minimum_Performance", "Maximum_Performance" and "Desired_Performance" fields which can be used to give explicit frequency hints. And we really _should_ be doing that. Because, esp. in this scenario; a task migrating; the hardware really can't do anything sensible, whereas the OS _knows_.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 12:53:49PM +0100, Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote: > > The results can be less dramatic on NUMA where automatic balancing > > interferes > > with the test. It's also known that network benchmarks running on localhost > > also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP > > and TCP depending on the machine). Hackbench also seens small improvements > > (6-11% depending on machine and thread count). The facebook schbench was > > also > > tested but in most cases showed little or no different to wakeup latencies. > > What cpufreq setting were you using for these tests? I cannot measure any hackbench variation one way or the other with these patches using 'performance' mode. So I'll assume you've been running things with HWP or something.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 12:53:49PM +0100, Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote: > > The results can be less dramatic on NUMA where automatic balancing > > interferes > > with the test. It's also known that network benchmarks running on localhost > > also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP > > and TCP depending on the machine). Hackbench also seens small improvements > > (6-11% depending on machine and thread count). The facebook schbench was > > also > > tested but in most cases showed little or no different to wakeup latencies. > > What cpufreq setting were you using for these tests? I cannot measure any hackbench variation one way or the other with these patches using 'performance' mode. So I'll assume you've been running things with HWP or something.
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 12:53:49PM +0100, Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote: > > The results can be less dramatic on NUMA where automatic balancing > > interferes > > with the test. It's also known that network benchmarks running on localhost > > also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP > > and TCP depending on the machine). Hackbench also seens small improvements > > (6-11% depending on machine and thread count). The facebook schbench was > > also > > tested but in most cases showed little or no different to wakeup latencies. > > What cpufreq setting were you using for these tests? Default powersave settings in general -- intel_pstate driver is the one used most often but HWP was not always available. By and large, no special tuning was applied. Two sets of tests were run, one with turbostat and perf keeping track of p-states and migrations and one without to make sure the measured increase was real. The irony when hitting cpufreq-related problems is that monitoring can mask the issue by increasing overall utilisation. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 12:53:49PM +0100, Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote: > > The results can be less dramatic on NUMA where automatic balancing > > interferes > > with the test. It's also known that network benchmarks running on localhost > > also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP > > and TCP depending on the machine). Hackbench also seens small improvements > > (6-11% depending on machine and thread count). The facebook schbench was > > also > > tested but in most cases showed little or no different to wakeup latencies. > > What cpufreq setting were you using for these tests? Default powersave settings in general -- intel_pstate driver is the one used most often but HWP was not always available. By and large, no special tuning was applied. Two sets of tests were run, one with turbostat and perf keeping track of p-states and migrations and one without to make sure the measured increase was real. The irony when hitting cpufreq-related problems is that monitoring can mask the issue by increasing overall utilisation. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote: > > The select_idle_sibling (SIS) rewrite in commit 10e2f1acd010 ("sched/core: > > Rewrite and improve select_idle_siblings()") replaced a domain iteration > > with a search that broadly speaking does a wrapped walk of the scheduler > > domain sharing a last-level-cache. While this had a number of improvements, > > one consequence is that two tasks that share a waker/wakee relationship push > > each other around a socket. Even though two tasks may be active, all cores > > are evenly used. This is great from a search perspective and spreads a load > > across individual cores but it has adverse consequences for cpufreq. As each > > CPU has relatively low utilisation, cpufreq may decide the utilisation is > > too low to used a higher P-state and overall computation throughput suffers. > > > While individual cpufreq and cpuidle drivers may compensate by artifically > > boosting P-state (at c0) or avoiding lower C-states (during idle), it does > > not help if hardware-based cpufreq (e.g. HWP) is used. > > Not saying this patch is bad; but Rafael / Srinivas we really should do > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can > still give it hints, and it looks like we're not doing that. > I'm not sure if HWP can fix it because of the per-cpu nature of its decisions. I believe it can only give the most basic of hints to hardware like an energy performance profile or bias (EPP and EPB respectively). Of course HWP can be turned off but not many people can detect that it's an appropriate decision, or even desirable, and there is always the caveat that disabling it increases the system CPU footprint. > Mel, what hardware are you testing this on? The primary one was a single socket skylake machine with 8 threads (HT enabled). However, 11 machines were used in total across multiple generations to reduce the chance of a regression slipping in that was machine-specific. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 12:50:54PM +0100, Peter Zijlstra wrote: > On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote: > > The select_idle_sibling (SIS) rewrite in commit 10e2f1acd010 ("sched/core: > > Rewrite and improve select_idle_siblings()") replaced a domain iteration > > with a search that broadly speaking does a wrapped walk of the scheduler > > domain sharing a last-level-cache. While this had a number of improvements, > > one consequence is that two tasks that share a waker/wakee relationship push > > each other around a socket. Even though two tasks may be active, all cores > > are evenly used. This is great from a search perspective and spreads a load > > across individual cores but it has adverse consequences for cpufreq. As each > > CPU has relatively low utilisation, cpufreq may decide the utilisation is > > too low to used a higher P-state and overall computation throughput suffers. > > > While individual cpufreq and cpuidle drivers may compensate by artifically > > boosting P-state (at c0) or avoiding lower C-states (during idle), it does > > not help if hardware-based cpufreq (e.g. HWP) is used. > > Not saying this patch is bad; but Rafael / Srinivas we really should do > better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can > still give it hints, and it looks like we're not doing that. > I'm not sure if HWP can fix it because of the per-cpu nature of its decisions. I believe it can only give the most basic of hints to hardware like an energy performance profile or bias (EPP and EPB respectively). Of course HWP can be turned off but not many people can detect that it's an appropriate decision, or even desirable, and there is always the caveat that disabling it increases the system CPU footprint. > Mel, what hardware are you testing this on? The primary one was a single socket skylake machine with 8 threads (HT enabled). However, 11 machines were used in total across multiple generations to reduce the chance of a regression slipping in that was machine-specific. -- Mel Gorman SUSE Labs
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote: > The results can be less dramatic on NUMA where automatic balancing interferes > with the test. It's also known that network benchmarks running on localhost > also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP > and TCP depending on the machine). Hackbench also seens small improvements > (6-11% depending on machine and thread count). The facebook schbench was also > tested but in most cases showed little or no different to wakeup latencies. What cpufreq setting were you using for these tests?
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote: > The results can be less dramatic on NUMA where automatic balancing interferes > with the test. It's also known that network benchmarks running on localhost > also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP > and TCP depending on the machine). Hackbench also seens small improvements > (6-11% depending on machine and thread count). The facebook schbench was also > tested but in most cases showed little or no different to wakeup latencies. What cpufreq setting were you using for these tests?
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote: > The select_idle_sibling (SIS) rewrite in commit 10e2f1acd010 ("sched/core: > Rewrite and improve select_idle_siblings()") replaced a domain iteration > with a search that broadly speaking does a wrapped walk of the scheduler > domain sharing a last-level-cache. While this had a number of improvements, > one consequence is that two tasks that share a waker/wakee relationship push > each other around a socket. Even though two tasks may be active, all cores > are evenly used. This is great from a search perspective and spreads a load > across individual cores but it has adverse consequences for cpufreq. As each > CPU has relatively low utilisation, cpufreq may decide the utilisation is > too low to used a higher P-state and overall computation throughput suffers. > While individual cpufreq and cpuidle drivers may compensate by artifically > boosting P-state (at c0) or avoiding lower C-states (during idle), it does > not help if hardware-based cpufreq (e.g. HWP) is used. Not saying this patch is bad; but Rafael / Srinivas we really should do better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can still give it hints, and it looks like we're not doing that. Mel, what hardware are you testing this on?
Re: [PATCH 4/4] sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
On Tue, Jan 30, 2018 at 10:45:55AM +, Mel Gorman wrote: > The select_idle_sibling (SIS) rewrite in commit 10e2f1acd010 ("sched/core: > Rewrite and improve select_idle_siblings()") replaced a domain iteration > with a search that broadly speaking does a wrapped walk of the scheduler > domain sharing a last-level-cache. While this had a number of improvements, > one consequence is that two tasks that share a waker/wakee relationship push > each other around a socket. Even though two tasks may be active, all cores > are evenly used. This is great from a search perspective and spreads a load > across individual cores but it has adverse consequences for cpufreq. As each > CPU has relatively low utilisation, cpufreq may decide the utilisation is > too low to used a higher P-state and overall computation throughput suffers. > While individual cpufreq and cpuidle drivers may compensate by artifically > boosting P-state (at c0) or avoiding lower C-states (during idle), it does > not help if hardware-based cpufreq (e.g. HWP) is used. Not saying this patch is bad; but Rafael / Srinivas we really should do better. Why isn't cpufreq (esp. sugov) fixing this? HWP or not, we can still give it hints, and it looks like we're not doing that. Mel, what hardware are you testing this on?