Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler
On Fri, Jan 08, 2021 at 12:22:41PM -0800, Tim Chen wrote: > > > On 1/8/21 7:12 AM, Morten Rasmussen wrote: > > On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote: > >> On 1/6/21 12:30 AM, Barry Song wrote: > >>> ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each > >>> cluster has 4 cpus. All clusters share L3 cache data while each cluster > >>> has local L3 tag. On the other hand, each cluster will share some > >>> internal system bus. This means cache is much more affine inside one > >>> cluster > >>> than across clusters. > >> > >> There is a similar need for clustering in x86. Some x86 cores could share > >> L2 caches that > >> is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 > >> clusters > >> of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing > >> L3). > >> Having a sched domain at the L2 cluster helps spread load among > >> L2 domains. This will reduce L2 cache contention and help with > >> performance for low to moderate load scenarios. > > > > IIUC, you are arguing for the exact opposite behaviour, i.e. balancing > > between L2 caches while Barry is after consolidating tasks within the > > boundaries of a L3 tag cache. One helps cache utilization, the other > > communication latency between tasks. Am I missing something? > > > > IMHO, we need some numbers on the table to say which way to go. Looking > > at just benchmarks of one type doesn't show that this is a good idea in > > general. > > > > I think it is going to depend on the workload. If there are dependent > tasks that communicate with one another, putting them together > in the same cluster will be the right thing to do to reduce communication > costs. On the other hand, if the tasks are independent, putting them > together on the same cluster > will increase resource contention and spreading them out will be better. Agree. That is exactly where I'm coming from. This is all about the task placement policy. We generally tend to spread tasks to avoid resource contention, SMT and caches, which seems to be what you are proposing to extend. I think that makes sense given it can produce significant benefits. > > Any thoughts on what is the right clustering "tag" to use to clump > related tasks together? > Cgroup? Pid? Tasks with same mm? I think this is the real question. I think the closest thing we have at the moment is the wakee/waker flip heuristic. This seems to be related. Perhaps the wake_affine tricks can serve as starting point? Morten
Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler
On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote: > On 1/6/21 12:30 AM, Barry Song wrote: > > ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each > > cluster has 4 cpus. All clusters share L3 cache data while each cluster > > has local L3 tag. On the other hand, each cluster will share some > > internal system bus. This means cache is much more affine inside one cluster > > than across clusters. > > There is a similar need for clustering in x86. Some x86 cores could share L2 > caches that > is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 > clusters > of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing > L3). > Having a sched domain at the L2 cluster helps spread load among > L2 domains. This will reduce L2 cache contention and help with > performance for low to moderate load scenarios. IIUC, you are arguing for the exact opposite behaviour, i.e. balancing between L2 caches while Barry is after consolidating tasks within the boundaries of a L3 tag cache. One helps cache utilization, the other communication latency between tasks. Am I missing something? IMHO, we need some numbers on the table to say which way to go. Looking at just benchmarks of one type doesn't show that this is a good idea in general. Morten
Re: [PATCH] sched: Add schedutil overview
On Fri, Dec 18, 2020 at 11:33:09AM +, Valentin Schneider wrote: > On 18/12/20 10:32, Peter Zijlstra wrote: > > +Schedutil / DVFS > > + > > + > > +Every time the scheduler load tracking is updated (task wakeup, task > > +migration, time progression) we call out to schedutil to update the > > hardware > > +DVFS state. > > + > > +The basis is the CPU runqueue's 'running' metric, which per the above it is > > +the frequency invariant utilization estimate of the CPU. From this we > > compute > > +a desired frequency like: > > + > > + max( running, util_est ); if UTIL_EST > > + u_cfs := { running; otherwise > > + > > + u_clamp := clamp( u_cfs, u_min, u_max ) > > + > > + u := u_cfs + u_rt + u_irq + u_dl;[approx. see source for more > > detail] > > + > > + f_des := min( f_max, 1.25 u * f_max ) > > + > > In schedutil_cpu_util(), uclamp clamps both u_cfs and u_rt. I'm afraid the > below might just bring more confusion; what do you think? > >clamp( u_cfs + u_rt, u_min, u_max ); if UCLAMP_TASK > u_clamp := { u_cfs + u_rt; otherwise > > u := u_clamp + u_irq + u_dl;[approx. see source for more detail] It is reflecting the code so I think it is worth it. It also fixes the typo in the original sum (u_cfs -> u_clamp). > (also, does this need a word about runnable rt tasks => goto max?) What is actually the intended policy there? I thought it was goto max unless rt was clamped, but if I read the code correctly in schedutil_cpu_util() the current policy is only goto max if uclamp isn't in use at all, including cfs. The write-up looks good to me. Reviewed-by: Morten Rasmussen Morten
Re: [RFC] Documentation/scheduler/schedutil.txt
Hi Peter, Looks like a nice summary to me. On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote: > Hi, > > I was recently asked to explain how schedutil works, the below write-up > is the result of that and I figured we might as well stick it in the > tree. > > Not as a patch for easy reading and commenting. > > --- > > NOTE; all this assumes a linear relation between frequency and work capacity, > we know this is flawed, but it is the best workable approximation. If you replace frequency with performance level everywhere (CPPC style), most of it should still work without that assumption. The assumption might have be made in hw or fw instead though. Morten
Re: [PATCH v4 1/4] PM / EM: Add a flag indicating units of power values in Energy Model
On Thu, Nov 05, 2020 at 10:09:05AM +, Lukasz Luba wrote: > > > On 11/5/20 9:18 AM, Morten Rasmussen wrote: > > On Tue, Nov 03, 2020 at 09:05:57AM +, Lukasz Luba wrote: > > > @@ -79,7 +82,8 @@ struct em_data_callback { > > > struct em_perf_domain *em_cpu_get(int cpu); > > > struct em_perf_domain *em_pd_get(struct device *dev); > > > int em_dev_register_perf_domain(struct device *dev, unsigned int > > > nr_states, > > > - struct em_data_callback *cb, cpumask_t *span); > > > + struct em_data_callback *cb, cpumask_t *spani, > > > > "spani" looks like a typo? > > > > Good catch, yes, the vim 'i'. > > Thank you Morten. I will resend this patch when you don't > find other issues in the rest of patches. The rest of the series looks okay to me. Morten
Re: [PATCH v4 1/4] PM / EM: Add a flag indicating units of power values in Energy Model
On Tue, Nov 03, 2020 at 09:05:57AM +, Lukasz Luba wrote: > @@ -79,7 +82,8 @@ struct em_data_callback { > struct em_perf_domain *em_cpu_get(int cpu); > struct em_perf_domain *em_pd_get(struct device *dev); > int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states, > - struct em_data_callback *cb, cpumask_t *span); > + struct em_data_callback *cb, cpumask_t *spani, "spani" looks like a typo?
Re: [RFC PATCH] topology: Represent clusters of CPUs within a die.
On Mon, Oct 19, 2020 at 02:41:57PM +0100, Jonathan Cameron wrote: > On Mon, 19 Oct 2020 15:10:52 +0200 > Morten Rasmussen wrote: > > > Hi Jonathan, > > > > On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote: > > > On Mon, 19 Oct 2020 12:35:22 +0200 > > > Peter Zijlstra wrote: > > > > > > > On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote: > > > > > Both ACPI and DT provide the ability to describe additional layers of > > > > > topology between that of individual cores and higher level constructs > > > > > such as the level at which the last level cache is shared. > > > > > In ACPI this can be represented in PPTT as a Processor Hierarchy > > > > > Node Structure [1] that is the parent of the CPU cores and in turn > > > > > has a parent Processor Hierarchy Nodes Structure representing > > > > > a higher level of topology. > > > > > > > > > > For example Kunpeng 920 has clusters of 4 CPUs. These do not share > > > > > any cache resources, but the interconnect topology is such that > > > > > the cost to transfer ownership of a cacheline between CPUs within > > > > > a cluster is lower than between CPUs in different clusters on the same > > > > > die. Hence, it can make sense to deliberately schedule threads > > > > > sharing data to a single cluster. > > > > > > > > > > This patch simply exposes this information to userspace libraries > > > > > like hwloc by providing cluster_cpus and related sysfs attributes. > > > > > PoC of HWLOC support at [2]. > > > > > > > > > > Note this patch only handle the ACPI case. > > > > > > > > > > Special consideration is needed for SMT processors, where it is > > > > > necessary to move 2 levels up the hierarchy from the leaf nodes > > > > > (thus skipping the processor core level). > > > > > > Hi Peter, > > > > > > > > > > > I'm confused by all of this. The core level is exactly what you seem to > > > > want. > > > > > > It's the level above the core, whether in an multi-threaded core > > > or a single threaded core. This may correspond to the level > > > at which caches are shared (typically L3). Cores are already well > > > represented via thread_siblings and similar. Extra confusion is that > > > the current core_siblings (deprecated) sysfs interface, actually reflects > > > the package level and ignores anything in between core and > > > package (such as die on x86) > > > > > > So in a typical system with a hierarchical interconnect you would have > > > > > > thread > > > core > > > cluster (possibly multiple layers as mentioned in Brice's reply). > > > die > > > package > > > > > > Unfortunately as pointed out in other branches of this thread, there is > > > no consistent generic name. I'm open to suggestions! > > > > IIUC, you are actually proposing another "die" level? I'm not sure if we > > can actually come up with a generic name since interconnects are highly > > implementation dependent. > > Brice mentioned hwloc is using 'group'. That seems generic enough perhaps. > > > > > How is you memory distributed? Do you already have NUMA nodes? If you > > want to keep tasks together, it might make sense to define the clusters > > (in your case) as NUMA nodes. > > We already have all of the standard levels. We need at least one more. > On a near future platform we'll have full set (kunpeng920 is single thread) > > So on kunpeng 920 we have > cores > (clusters) > die / llc shared at this level IIRC, LLC sharing isn't tied to a specific level in the user-space topology description. On some Arm systems LLC is per cluster while the package has a single die with two clusters. I'm slightly confused about the cache sharing. You said above that your clusters don't share cache resources? This list says LLC is at die level, which is above cluster level? > package (multiple NUMA nodes in each package) What is your NUMA node span? Couldn't you just make it equivalent to your clusters? > > > For example, in zen2 this would correspond to a 'core complex' consisting > > > 4 CPU cores (each one 2 threads) sharing some local L3 cache. > > > https://en.wikichip.org/wiki/amd/microarchitectures/zen_2 > > > In zen3 it looks like this
Re: [RFC PATCH] topology: Represent clusters of CPUs within a die.
On Mon, Oct 19, 2020 at 02:50:53PM +0200, Peter Zijlstra wrote: > On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote: > > On Mon, 19 Oct 2020 12:35:22 +0200 > > Peter Zijlstra wrote: > > > > I'm confused by all of this. The core level is exactly what you seem to > > > want. > > > > It's the level above the core, whether in an multi-threaded core > > or a single threaded core. This may correspond to the level > > at which caches are shared (typically L3). Cores are already well > > represented via thread_siblings and similar. Extra confusion is that > > the current core_siblings (deprecated) sysfs interface, actually reflects > > the package level and ignores anything in between core and > > package (such as die on x86) > > That seems wrong. core-mask should be whatever cores share L3. So on a > Intel Core2-Quad (just to pick an example) you should have 4 CPU in a > package, but only 2 CPUs for the core-mask. > > It just so happens that L3 and package were the same for a long while in > x86 land, although recent chips started breaking that trend. > > And I know nothing about the core-mask being depricated; it's what the > scheduler uses. It's not going anywhere. Don't get confused over the user-space topology and the scheduler topology, they are _not_ the same despite having similar names for some things :-) > So if your 'cluster' is a group of single cores (possibly with SMT) that > do not share cache but have a faster cache connection and you want them > to behave as-if they were a multi-core group that did share cache, then > core-mask it is. In the scheduler, yes. There is no core-mask exposed to user-space. We have to be clear about whether we discuss scheduler or user-space topology :-)
Re: [RFC PATCH] topology: Represent clusters of CPUs within a die.
Hi Jonathan, On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote: > On Mon, 19 Oct 2020 12:35:22 +0200 > Peter Zijlstra wrote: > > > On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote: > > > Both ACPI and DT provide the ability to describe additional layers of > > > topology between that of individual cores and higher level constructs > > > such as the level at which the last level cache is shared. > > > In ACPI this can be represented in PPTT as a Processor Hierarchy > > > Node Structure [1] that is the parent of the CPU cores and in turn > > > has a parent Processor Hierarchy Nodes Structure representing > > > a higher level of topology. > > > > > > For example Kunpeng 920 has clusters of 4 CPUs. These do not share > > > any cache resources, but the interconnect topology is such that > > > the cost to transfer ownership of a cacheline between CPUs within > > > a cluster is lower than between CPUs in different clusters on the same > > > die. Hence, it can make sense to deliberately schedule threads > > > sharing data to a single cluster. > > > > > > This patch simply exposes this information to userspace libraries > > > like hwloc by providing cluster_cpus and related sysfs attributes. > > > PoC of HWLOC support at [2]. > > > > > > Note this patch only handle the ACPI case. > > > > > > Special consideration is needed for SMT processors, where it is > > > necessary to move 2 levels up the hierarchy from the leaf nodes > > > (thus skipping the processor core level). > > Hi Peter, > > > > > I'm confused by all of this. The core level is exactly what you seem to > > want. > > It's the level above the core, whether in an multi-threaded core > or a single threaded core. This may correspond to the level > at which caches are shared (typically L3). Cores are already well > represented via thread_siblings and similar. Extra confusion is that > the current core_siblings (deprecated) sysfs interface, actually reflects > the package level and ignores anything in between core and > package (such as die on x86) > > So in a typical system with a hierarchical interconnect you would have > > thread > core > cluster (possibly multiple layers as mentioned in Brice's reply). > die > package > > Unfortunately as pointed out in other branches of this thread, there is > no consistent generic name. I'm open to suggestions! IIUC, you are actually proposing another "die" level? I'm not sure if we can actually come up with a generic name since interconnects are highly implementation dependent. How is you memory distributed? Do you already have NUMA nodes? If you want to keep tasks together, it might make sense to define the clusters (in your case) as NUMA nodes. > Both ACPI PPTT and DT provide generic structures to represent layers of > topology. They don't name as such, but in ACPI there are flags to indicate > package, core, thread. I think that is because those are the only ones that a fairly generic :-) It is also the only ones that scheduler cares about (plus NUMA). > > For example, in zen2 this would correspond to a 'core complex' consisting > 4 CPU cores (each one 2 threads) sharing some local L3 cache. > https://en.wikichip.org/wiki/amd/microarchitectures/zen_2 > In zen3 it looks like this level will be the same as that for the die. > > Given they used the name in knights landing (and as is pointed out in > another branch of this thread, it's the CPUID description) I think Intel > calls these 'tiles' (anyone confirm that?) > > A similar concept exists for some ARM processors. > https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110 > CCLs in the diagram on that page. > > Centriq 2400 had 2 core 'duplexes' which shared l2. > https://www.anandtech.com/show/11737/analyzing-falkors-microarchitecture-a-deep-dive-into-qualcomms-centriq-2400-for-windows-server-and-linux/3 > > From the info release at hotchips, it looks like the thunderx3 deploys > a similar ring interconnect with groups of cores, each with 4 threads. > Not sure what they plan to call them yet though or whether they will chose > to represent that layer of the topology in their firmware tables. > > Arms CMN600 interconnect also support such 'clusters' though I have no idea > if anyone has used it in this form yet. In that case, they are called > "processor compute clusters" > https://developer.arm.com/documentation/100180/0103/ > > Xuantie-910 is cluster based as well (shares l2). > > So in many cases the cluster level corresponds to something we already have > visibility of due to cache sharing etc, but that isn't true in kunpeng 920. The problem I see is that the benefit of keeping tasks together due to the interconnect layout might vary significantly between systems. So if we introduce a new cpumask for cluster it has to have represent roughly the same system properties otherwise generic software consuming this information could be tricked. If there is a provable benefit of
Re: [PATCH] arm64: dts: sdm845: Add CPU topology
On Thu, Jun 06, 2019 at 10:44:58AM +0200, Vincent Guittot wrote: > On Thu, 6 Jun 2019 at 10:34, Dietmar Eggemann > wrote: > > > > On 6/6/19 10:20 AM, Vincent Guittot wrote: > > > On Thu, 6 Jun 2019 at 09:49, Quentin Perret > > > wrote: > > >> > > >> Hi Vincent, > > >> > > >> On Thursday 06 Jun 2019 at 09:05:16 (+0200), Vincent Guittot wrote: > > >>> Hi Quentin, > > >>> > > >>> On Wed, 5 Jun 2019 at 19:21, Quentin Perret > > >>> wrote: > > > > On Friday 17 May 2019 at 14:55:19 (-0700), Stephen Boyd wrote: > > > Quoting Amit Kucheria (2019-05-16 04:54:45) > > >> (cc'ing Andy's correct email address) > > >> > > >> On Wed, May 15, 2019 at 2:46 AM Stephen Boyd > > >> wrote: > > >>> > > >>> Quoting Amit Kucheria (2019-05-13 04:54:12) > > On Mon, May 13, 2019 at 4:31 PM Amit Kucheria > > wrote: > > > > > > On Tue, Jan 15, 2019 at 12:13 AM Matthias Kaehlcke > > > wrote: > > >> > > >> The 8 CPU cores of the SDM845 are organized in two clusters of 4 > > >> big > > >> ("gold") and 4 little ("silver") cores. Add a cpu-map node to > > >> the DT > > >> that describes this topology. > > > > > > This is partly true. There are two groups of gold and silver > > > cores, > > > but AFAICT they are in a single cluster, not two separate ones. > > > SDM845 > > > is one of the early examples of ARM's Dynamiq architecture. > > > > > >> Signed-off-by: Matthias Kaehlcke > > > > > > I noticed that this patch sneaked through for this merge window > > > but > > > perhaps we can whip up a quick fix for -rc2? > > > > > > > And please find attached a patch to fix this up. Andy, since this > > hasn't landed yet (can we still squash this into the original > > patch?), > > I couldn't add a Fixes tag. > > > > >>> > > >>> I had the same concern. Thanks for catching this. I suspect this > > >>> must > > >>> cause some problem for IPA given that it can't discern between the > > >>> big > > >>> and little "power clusters"? > > >> > > >> Both EAS and IPA, I believe. It influences the scheduler's view of > > >> the > > >> the topology. > > > > > > And EAS and IPA are OK with the real topology? I'm just curious if > > > changing the topology to reflect reality will be a problem for those > > > two. > > > > FWIW, neither EAS nor IPA depends on this. Not the upstream version of > > EAS at least (which is used in recent Android kernels -- 4.19+). > > > > But doing this is still required for other things in the scheduler (the > > so-called 'capacity-awareness' code). So until we have a better > > solution, this patch is doing the right thing. > > >>> > > >>> I'm not sure to catch what you mean ? > > >>> Which so-called 'capacity-awareness' code are you speaking about ? and > > >>> what is the problem ? > > >> > > >> I'm talking about the wake-up path. ATM select_idle_sibling() is totally > > >> unaware of capacity differences. In its current form, this function > > >> basically assumes that all CPUs in a given sd_llc have the same > > >> capacity, which would be wrong if we had a single MC level for SDM845. > > >> So, until select_idle_sibling() is 'fixed' to be capacity-aware, we need > > >> two levels of sd for asymetric systems (including DynamIQ) so the > > >> wake_cap() story actually works. > > >> > > >> I hope that clarifies it :) > > > > > > hmm... does this justifies this wrong topology ? No, it doesn't. It relies heavily on how nested clusters are interpreted too, so it is quite fragile. > > > select_idle_sibling() is called only when system is overloaded and > > > scheduler disables the EAS path > > > In this case, the scheduler looks either for an idle cpu or for evenly > > > spreading the loads > > > This is maybe not always optimal and should probably be fixed but > > > doesn't justifies a wrong topology description IMHO > > > > The big/Little cluster detection in wake_cap() doesn't work anymore with > > DynamIQ w/o Phanton (DIE) domain. So the decision of going sis() or slow > > path is IMHO broken. > > That's probably not the right thread to discuss this further but i'm > not sure to understand why wake_cap() doesn't work as it compares the > capacity_orig of local cpu and prev cpu which are the same whatever > the sche domainœ We have had this discussion a couple of times over the last couple of years. The story, IIRC, is that when we introduced capacity awareness in the wake-up path (wake_cap()) we realised (I think it was actually you) that we could use select_idle_sibling() in cases where we know that the search space is limited to cpus with sufficient capacity so we didn't have to take the long route through find_idlest_cpu(). Back
Re: [PATCH v6 1/7] Documentation: DT: arm: add support for sockets defining package boundaries
On Fri, May 31, 2019 at 10:37:43AM +0100, Sudeep Holla wrote: > On Thu, May 30, 2019 at 10:42:54PM +0100, Russell King - ARM Linux admin > wrote: > > On Thu, May 30, 2019 at 12:51:03PM +0100, Morten Rasmussen wrote: > > > On Wed, May 29, 2019 at 07:39:17PM -0400, Andrew F. Davis wrote: > > > > On 5/29/19 5:13 PM, Atish Patra wrote: > > > > >From: Sudeep Holla > > > > > > > > > >The current ARM DT topology description provides the operating system > > > > >with a topological view of the system that is based on leaf nodes > > > > >representing either cores or threads (in an SMT system) and a > > > > >hierarchical set of cluster nodes that creates a hierarchical topology > > > > >view of how those cores and threads are grouped. > > > > > > > > > >However this hierarchical representation of clusters does not allow to > > > > >describe what topology level actually represents the physical package > > > > >or > > > > >the socket boundary, which is a key piece of information to be used by > > > > >an operating system to optimize resource allocation and scheduling. > > > > > > > > > > > > > Are physical package descriptions really needed? What does "socket" > > > > imply > > > > that a higher layer "cluster" node grouping does not? It doesn't imply a > > > > different NUMA distance and the definition of "socket" is already not > > > > well > > > > defined, is a dual chiplet processor not just a fancy dual "socket" or > > > > are > > > > dual "sockets" on a server board "slotket" card, will we need new names > > > > for > > > > those too.. > > > > > > Socket (or package) just implies what you suggest, a grouping of CPUs > > > based on the physical socket (or package). Some resources might be > > > associated with packages and more importantly socket information is > > > exposed to user-space. At the moment clusters are being exposed to > > > user-space as sockets which is less than ideal for some topologies. > > > > Please point out a 32-bit ARM system that has multiple "socket"s. > > > > As far as I'm aware, all 32-bit systems do not have socketed CPUs > > (modern ARM CPUs are part of a larger SoC), and the CPUs are always > > in one package. > > > > Even the test systems I've seen do not have socketed CPUs. > > > > As far as we know, there's none. So we simply have to assume all > those systems are single socket(IOW all CPUs reside inside a single > SoC package) system. Right, but we don't make that assumption. Clusters are reported as sockets/packages for arm, just like they are for arm64. My comment above applied to what can be described using DT, not what systems actually exists. We need to be able describe packages for architecture where we can't make assumptions. arm example (ARM TC2): root@morras01-tc2:~# lstopo Machine (985MB) Package L#0 Core L#0 + PU L#0 (P#0) Core L#1 + PU L#1 (P#1) Package L#1 Core L#2 + PU L#2 (P#2) Core L#3 + PU L#3 (P#3) Core L#4 + PU L#4 (P#4) Morten
Re: [PATCH v6 1/7] Documentation: DT: arm: add support for sockets defining package boundaries
On Thu, May 30, 2019 at 08:56:03AM -0400, Andrew F. Davis wrote: > On 5/30/19 7:51 AM, Morten Rasmussen wrote: > >On Wed, May 29, 2019 at 07:39:17PM -0400, Andrew F. Davis wrote: > >>On 5/29/19 5:13 PM, Atish Patra wrote: > >>>From: Sudeep Holla > >>> > >>>The current ARM DT topology description provides the operating system > >>>with a topological view of the system that is based on leaf nodes > >>>representing either cores or threads (in an SMT system) and a > >>>hierarchical set of cluster nodes that creates a hierarchical topology > >>>view of how those cores and threads are grouped. > >>> > >>>However this hierarchical representation of clusters does not allow to > >>>describe what topology level actually represents the physical package or > >>>the socket boundary, which is a key piece of information to be used by > >>>an operating system to optimize resource allocation and scheduling. > >>> > >> > >>Are physical package descriptions really needed? What does "socket" imply > >>that a higher layer "cluster" node grouping does not? It doesn't imply a > >>different NUMA distance and the definition of "socket" is already not well > >>defined, is a dual chiplet processor not just a fancy dual "socket" or are > >>dual "sockets" on a server board "slotket" card, will we need new names for > >>those too.. > > > >Socket (or package) just implies what you suggest, a grouping of CPUs > >based on the physical socket (or package). Some resources might be > >associated with packages and more importantly socket information is > >exposed to user-space. At the moment clusters are being exposed to > >user-space as sockets which is less than ideal for some topologies. > > > > I see the benefit of reporting the physical layout and packaging information > to user-space for tracking reasons, but from software perspective this > doesn't matter, and the resource partitioning should be described elsewhere > (NUMA nodes being the go to example). That would make defining a NUMA node mandatory even for non-NUMA systems? > >At the moment user-space is only told about hw threads, cores, and > >sockets. In the very near future it is going to be told about dies too > >(look for Len Brown's multi-die patch set). > > > > Seems my hypothetical case is already in the works :( Indeed. IIUC, the reasoning behind it is related to actual multi-die x86 packages and some rapl stuff being per-die or per-core. > > >I don't see how we can provide correct information to user-space based > >on the current information in DT. I'm not convinced it was a good idea > >to expose this information to user-space to begin with but that is > >another discussion. > > > > Fair enough, it's a little late now to un-expose this info to userspace so > we should at least present it correctly. My worry was this getting out of > hand with layering, for instance what happens when we need to add die nodes > in-between cluster and socket? If we want the die mask to be correct for arm/arm64/riscv we need die information from somewhere. I'm not in favour of adding more topology layers to the user-space visible topology description, but others might have a valid reason and if it is exposed I would prefer if we try to expose the right information. Btw, for packages, we already have that information in ACPI/PPTT so it would be nice if we could have that for DT based systems too. Morten
Re: [PATCH v6 1/7] Documentation: DT: arm: add support for sockets defining package boundaries
On Wed, May 29, 2019 at 07:39:17PM -0400, Andrew F. Davis wrote: > On 5/29/19 5:13 PM, Atish Patra wrote: > >From: Sudeep Holla > > > >The current ARM DT topology description provides the operating system > >with a topological view of the system that is based on leaf nodes > >representing either cores or threads (in an SMT system) and a > >hierarchical set of cluster nodes that creates a hierarchical topology > >view of how those cores and threads are grouped. > > > >However this hierarchical representation of clusters does not allow to > >describe what topology level actually represents the physical package or > >the socket boundary, which is a key piece of information to be used by > >an operating system to optimize resource allocation and scheduling. > > > > Are physical package descriptions really needed? What does "socket" imply > that a higher layer "cluster" node grouping does not? It doesn't imply a > different NUMA distance and the definition of "socket" is already not well > defined, is a dual chiplet processor not just a fancy dual "socket" or are > dual "sockets" on a server board "slotket" card, will we need new names for > those too.. Socket (or package) just implies what you suggest, a grouping of CPUs based on the physical socket (or package). Some resources might be associated with packages and more importantly socket information is exposed to user-space. At the moment clusters are being exposed to user-space as sockets which is less than ideal for some topologies. At the moment user-space is only told about hw threads, cores, and sockets. In the very near future it is going to be told about dies too (look for Len Brown's multi-die patch set). I don't see how we can provide correct information to user-space based on the current information in DT. I'm not convinced it was a good idea to expose this information to user-space to begin with but that is another discussion. Morten
Re: [RFC PATCH 3/6] sched/dl: Try better placement even for deadline tasks that do not block
On Tue, May 07, 2019 at 03:13:40PM +0100, Quentin Perret wrote: > On Monday 06 May 2019 at 06:48:33 (+0200), Luca Abeni wrote: > > @@ -1591,6 +1626,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int > > sd_flag, int flags) > > > > rcu_read_lock(); > > curr = READ_ONCE(rq->curr); /* unlocked access */ > > + het = static_branch_unlikely(_asym_cpucapacity); > > Nit: not sure how the generated code looks like but I wonder if this > could potentially make you loose the benefit of the static key ? I have to take the blame for this bit :-) I would be surprised the static_key gives us anything here, but that is actually not the point here. It is purely to know whether we have to be capacity aware or not. I don't think we are in a critical path and the variable providing the necessary condition just happened to be a static_key. We might be able to make better use of it if we refactor the code a bit. Morten
Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller
Hi, On Mon, Apr 08, 2019 at 02:45:32PM -0700, Song Liu wrote: > Servers running latency sensitive workload usually aren't fully loaded for > various reasons including disaster readiness. The machines running our > interactive workloads (referred as main workload) have a lot of spare CPU > cycles that we would like to use for optimistic side jobs like video > encoding. However, our experiments show that the side workload has strong > impact on the latency of main workload: > > side-job main-load-level main-avg-latency > none 1.0 1.00 > none 1.1 1.10 > none 1.2 1.10 > none 1.3 1.10 > none 1.4 1.15 > none 1.5 1.24 > none 1.6 1.74 > > ffmpeg1.0 1.82 > ffmpeg1.1 2.74 > > Note: both the main-load-level and the main-avg-latency numbers are > _normalized_. Could you reveal what level of utilization those main-load-level numbers correspond to? I'm trying to understand why the latency seems to increase rapidly once you hit 1.5. Is that the point where the system hits 100% utilization? > In these experiments, ffmpeg is put in a cgroup with cpu.weight of 1 > (lowest priority). However, it consumes all idle CPU cycles in the > system and causes high latency for the main workload. Further experiments > and analysis (more details below) shows that, for the main workload to meet > its latency targets, it is necessary to limit the CPU usage of the side > workload so that there are some _idle_ CPU. There are various reasons > behind the need of idle CPU time. First, shared CPU resouce saturation > starts to happen way before time-measured utilization reaches 100%. > Secondly, scheduling latency starts to impact the main workload as CPU > reaches full utilization. > > Currently, the cpu controller provides two mechanisms to protect the main > workload: cpu.weight and cpu.max. However, neither of them is sufficient > in these use cases. As shown in the experiments above, side workload with > cpu.weight of 1 (lowest priority) would still consume all idle CPU and add > unacceptable latency to the main workload. cpu.max can throttle the CPU > usage of the side workload and preserve some idle CPU. However, cpu.max > cannot react to changes in load levels. For example, when the main > workload uses 40% of CPU, cpu.max of 30% for the side workload would yield > good latencies for the main workload. However, when the workload > experiences higher load levels and uses more CPU, the same setting (cpu.max > of 30%) would cause the interactive workload to miss its latency target. > > These experiments demonstrated the need for a mechanism to effectively > throttle CPU usage of the side workload and preserve idle CPU cycles. > The mechanism should be able to adjust the level of throttling based on > the load level of the main workload. > > This patchset introduces a new knob for cpu controller: cpu.headroom. > cgroup of the main workload uses cpu.headroom to ensure side workload to > use limited CPU cycles. For example, if a main workload has a cpu.headroom > of 30%. The side workload will be throttled to give 30% overall idle CPU. > If the main workload uses more than 70% of CPU, the side workload will only > run with configurable minimal cycles. This configurable minimal cycles is > referred as "tolerance" of the main workload. IIUC, you are proposing to basically apply dynamic bandwidth throttling to side-jobs to preserve a specific headroom of idle cycles. The bit that isn't clear to me, is _why_ adding idle cycles helps your workload. I'm not convinced that adding headroom gives any latency improvements beyond watering down the impact of your side jobs. AFAIK, the throttling mechanism effectively removes the throttled tasks from the schedule according to a specific duty cycle. When the side job is not throttled the main workload is experiencing the same latency issues as before, but by dynamically tuning the side job throttling you can achieve a better average latency. Am I missing something? Have you looked at your distribution of main job latency and tried to compare with when throttling is active/not active? I'm wondering if the headroom solution is really the right solution for your use-case or if what you are really after is something which is lower priority than just setting the weight to 1. Something that (nearly) always gets pre-empted by your main job (SCHED_BATCH and SCHED_IDLE might not be enough). If your main job consist of lots of relatively short wake-ups things like the min_granularity could have significant latency impact. Morten
Re: [PATCH 0/14] v2 multi-die/package topology support
On Tue, Feb 26, 2019 at 07:53:58PM +0100, Peter Zijlstra wrote: > On Tue, Feb 26, 2019 at 01:19:58AM -0500, Len Brown wrote: > > Added sysfs package_threads, package_threads_list > > > > Added this attribute to show threads siblings in a package. > > Exactly same as "core_siblings above", a name now deprecated. > > This attribute name and definition is immune to future > > topology changes. > > > > Suggested by Brice. > > > > Added sysfs die_threads, die_threads_list > > > > Added this attribute to show which threads siblings in a die. > > V1 had proposed putting this info into "core_siblings", but we > > decided to leave that legacy attribute alone. > > This attribute name and definition is immune to future > > topology changes. > > > > On a single die-package system this attribute has same contents > > as "package_threads". > > > > Suggested by Brice. > > > > Added sysfs core_threads, core_threads_list > > > > Added this attribute to show which threads siblings in a core. > > Exactly same as "thread_siblings", a name now deprecated. > > This attribute name and definition is immune to future > > topology changes. > > > > Suggested by Brice. > > I think I prefer 's/threads/cpus/g' on that. Threads makes me think SMT, > and I don't think there's any guarantee the part in question will have > SMT on. I think 'threads' is a bit confusing as well. We seem to be using 'cpu' everywhere for something we can schedule tasks on, including the sysfs /sys/devices/system/cpu/ subdirs for each SMT thread on SMT systems. Another thing that I find confusing is that with this series we a new die id/mask which is totally unrelated to the DIE level in the sched_domain hierarchy. We should rename DIE level to something that reflects what it really is. If we can agree on that ;-) NODE level? Morten
Re: [PATCH 05/14] cpu topology: Export die_id
Hi Len, On Tue, Feb 26, 2019 at 01:20:03AM -0500, Len Brown wrote: > From: Len Brown > > Export die_id in cpu topology, for the benefit of hardware that > has multiple-die/package. > > Signed-off-by: Len Brown > Cc: linux-...@vger.kernel.org > --- > Documentation/cputopology.txt | 6 ++ > arch/x86/include/asm/topology.h | 1 + > drivers/base/topology.c | 4 > include/linux/topology.h| 3 +++ > 4 files changed, 14 insertions(+) > > diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt > index cb61277e2308..4e6be7f68fd8 100644 > --- a/Documentation/cputopology.txt > +++ b/Documentation/cputopology.txt > @@ -12,6 +12,12 @@ physical_package_id: > socket number, but the actual value is architecture and platform > dependent. > > +die_id: > + > + the CPU die ID of cpuX. Typically it is the hardware platform's > + identifier (rather than the kernel's). The actual value is > + architecture and platform dependent. > + > core_id: Can we add the details about die_id further down in cputopology.txt as well? diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt index 6c25ce682c90..77b65583081e 100644 --- a/Documentation/cputopology.txt +++ b/Documentation/cputopology.txt @@ -97,6 +97,7 @@ For an architecture to support this feature, it must define some of these macros in include/asm-XXX/topology.h:: #define topology_physical_package_id(cpu) + #define topology_die_id(cpu) #define topology_core_id(cpu) #define topology_book_id(cpu) #define topology_drawer_id(cpu) @@ -116,10 +117,11 @@ provides default definitions for any of the above macros that are not defined by include/asm-XXX/topology.h: 1) topology_physical_package_id: -1 -2) topology_core_id: 0 -3) topology_sibling_cpumask: just the given CPU -4) topology_core_cpumask: just the given CPU -5) topology_die_cpumask: just the given CPU +2) topology_die_id: -1 +3) topology_core_id: 0 +4) topology_sibling_cpumask: just the given CPU +5) topology_core_cpumask: just the given CPU +6) topology_die_cpumask: just the given CPU > > the CPU core ID of cpuX. Typically it is the hardware platform's > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h > index 453cf38a1c33..281be6bbc80d 100644 > --- a/arch/x86/include/asm/topology.h > +++ b/arch/x86/include/asm/topology.h > @@ -106,6 +106,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu); > > #define topology_logical_package_id(cpu) (cpu_data(cpu).logical_proc_id) > #define topology_physical_package_id(cpu)(cpu_data(cpu).phys_proc_id) > +#define topology_die_id(cpu) (cpu_data(cpu).cpu_die_id) > #define topology_core_id(cpu) > (cpu_data(cpu).cpu_core_id) > > #ifdef CONFIG_SMP The above is x86 specific and seems to fit better with the next patch in the series. Morten
[tip:sched/core] sched/fair: Add over-utilization/tipping point indicator
Commit-ID: 2802bf3cd936fe2c8033a696d375a4d9d3974de4 Gitweb: https://git.kernel.org/tip/2802bf3cd936fe2c8033a696d375a4d9d3974de4 Author: Morten Rasmussen AuthorDate: Mon, 3 Dec 2018 09:56:25 + Committer: Ingo Molnar CommitDate: Tue, 11 Dec 2018 15:17:01 +0100 sched/fair: Add over-utilization/tipping point indicator Energy-aware scheduling is only meant to be active while the system is _not_ over-utilized. That is, there are spare cycles available to shift tasks around based on their actual utilization to get a more energy-efficient task distribution without depriving any tasks. When above the tipping point task placement is done the traditional way based on load_avg, spreading the tasks across as many cpus as possible based on priority scaled load to preserve smp_nice. Below the tipping point we want to use util_avg instead. We need to define a criteria for when we make the switch. The util_avg for each cpu converges towards 100% regardless of how many additional tasks we may put on it. If we define over-utilized as: sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity) some individual cpus may be over-utilized running multiple tasks even when the above condition is false. That should be okay as long as we try to spread the tasks out to avoid per-cpu over-utilization as much as possible and if all tasks have the _same_ priority. If the latter isn't true, we have to consider priority to preserve smp_nice. For example, we could have n_cpus nice=-10 util_avg=55% tasks and n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less over-utilized than 55%+60% for those cpus that have to be shared. The system utilization is only 85% of the system capacity, but we are breaking smp_nice. To be sure not to break smp_nice, we have defined over-utilization conservatively as when any cpu in the system is fully utilized at its highest frequency instead: cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg to factor in priority to preserve smp_nice. With this definition, we can skip periodic load-balance as no cpu has an always-running task when the system is not over-utilized. All tasks will be periodic and we can balance them at wake-up. This conservative condition does however mean that some scenarios that could benefit from energy-aware decisions even if one cpu is fully utilized would not get those benefits. For systems where some cpus might have reduced capacity on some cpus (RT-pressure and/or big.LITTLE), we want periodic load-balance checks as soon a just a single cpu is fully utilized as it might one of those with reduced capacity and in that case we want to migrate it. [ peterz: Added a comment explaining why new tasks are not accounted during overutilization detection. ] Signed-off-by: Morten Rasmussen Signed-off-by: Quentin Perret Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Mike Galbraith Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: adhar...@codeaurora.org Cc: chris.redp...@arm.com Cc: curroje...@riseup.net Cc: dietmar.eggem...@arm.com Cc: edubez...@gmail.com Cc: gre...@linuxfoundation.org Cc: javi.mer...@kernel.org Cc: j...@joelfernandes.org Cc: juri.le...@redhat.com Cc: patrick.bell...@arm.com Cc: pkond...@codeaurora.org Cc: r...@rjwysocki.net Cc: skan...@codeaurora.org Cc: smuc...@google.com Cc: srinivas.pandruv...@linux.intel.com Cc: thara.gopin...@linaro.org Cc: tk...@google.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Cc: viresh.ku...@linaro.org Link: https://lkml.kernel.org/r/20181203095628.11858-13-quentin.per...@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 59 ++-- kernel/sched/sched.h | 4 2 files changed, 61 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e04f29098ec7..767e7675774b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5082,6 +5082,24 @@ static inline void hrtick_update(struct rq *rq) } #endif +#ifdef CONFIG_SMP +static inline unsigned long cpu_util(int cpu); +static unsigned long capacity_of(int cpu); + +static inline bool cpu_overutilized(int cpu) +{ + return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin); +} + +static inline void update_overutilized_status(struct rq *rq) +{ + if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) + WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED); +} +#else +static inline void update_overutilized_status(struct rq *rq) { } +#endif + /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and @@ -5139,8 +5157,26 @@ enqueue_task_fair(struct rq *rq, s
Re: [RFT PATCH v1 0/4] Unify CPU topology across ARM64 & RISC-V
Hi, On Thu, Nov 29, 2018 at 03:28:16PM -0800, Atish Patra wrote: > The cpu-map DT entry in ARM64 can describe the CPU topology in > much better way compared to other existing approaches. RISC-V can > easily adopt this binding to represent it's own CPU topology. > Thus, both cpu-map DT binding and topology parsing code can be > moved to a common location so that RISC-V or any other > architecture can leverage that. > > The relevant discussion regarding unifying cpu topology can be > found in [1]. > > arch_topology seems to be a perfect place to move the common > code. I have not introduced any functional changes in the moved > code. The only downside in this approach is that the capacity > code will be executed for RISC-V as well. But, it will exit > immediately after not able to find the appropriate DT node. If > the overhead is considered too much, we can always compile out > capacity related functions under a different config for the > architectures that do not support them. > > The patches have been tested for RISC-V and compile tested for > ARM64 & x86. The cpu-map bindings are used for arch/arm too, and so is arch_topology.c. In fact, it was introduced to allow code-sharing between arm and arm64. Applying patch three breaks arm. Moving the DT parsing to arch_topology.c we have to unify all three architectures. Be aware that arm and arm64 have some differences in how they detect cpu capacities. I think we might have to look at the split of code between arch/* and arch_topology.c again :-/ Morten
Re: [RFT PATCH v1 0/4] Unify CPU topology across ARM64 & RISC-V
Hi, On Thu, Nov 29, 2018 at 03:28:16PM -0800, Atish Patra wrote: > The cpu-map DT entry in ARM64 can describe the CPU topology in > much better way compared to other existing approaches. RISC-V can > easily adopt this binding to represent it's own CPU topology. > Thus, both cpu-map DT binding and topology parsing code can be > moved to a common location so that RISC-V or any other > architecture can leverage that. > > The relevant discussion regarding unifying cpu topology can be > found in [1]. > > arch_topology seems to be a perfect place to move the common > code. I have not introduced any functional changes in the moved > code. The only downside in this approach is that the capacity > code will be executed for RISC-V as well. But, it will exit > immediately after not able to find the appropriate DT node. If > the overhead is considered too much, we can always compile out > capacity related functions under a different config for the > architectures that do not support them. > > The patches have been tested for RISC-V and compile tested for > ARM64 & x86. The cpu-map bindings are used for arch/arm too, and so is arch_topology.c. In fact, it was introduced to allow code-sharing between arm and arm64. Applying patch three breaks arm. Moving the DT parsing to arch_topology.c we have to unify all three architectures. Be aware that arm and arm64 have some differences in how they detect cpu capacities. I think we might have to look at the split of code between arch/* and arch_topology.c again :-/ Morten
Re: [PATCH v5 2/2] sched/fair: update scale invariance of PELT
On Mon, Nov 05, 2018 at 10:10:34AM +0100, Vincent Guittot wrote: > On Fri, 2 Nov 2018 at 16:36, Dietmar Eggemann > wrote: > > > > On 10/26/18 6:11 PM, Vincent Guittot wrote: > > > The current implementation of load tracking invariance scales the > > > contribution with current frequency and uarch performance (only for > > > utilization) of the CPU. One main result of this formula is that the > > > figures are capped by current capacity of CPU. Another one is that the > > > load_avg is not invariant because not scaled with uarch. > > > > > > The util_avg of a periodic task that runs r time slots every p time slots > > > varies in the range : > > > > > > U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p) > > > > > > with U is the max util_avg value = SCHED_CAPACITY_SCALE > > > > > > At a lower capacity, the range becomes: > > > > > > U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * > > > (1-y^r')/(1-y^p) > > > > > > with C reflecting the compute capacity ratio between current capacity and > > > max capacity. > > > > > > so C tries to compensate changes in (1-y^r') but it can't be accurate. > > > > > > Instead of scaling the contribution value of PELT algo, we should scale > > > the > > > running time. The PELT signal aims to track the amount of computation of > > > tasks and/or rq so it seems more correct to scale the running time to > > > reflect the effective amount of computation done since the last update. > > > > > > In order to be fully invariant, we need to apply the same amount of > > > running time and idle time whatever the current capacity. Because running > > > at lower capacity implies that the task will run longer, we have to ensure > > > that the same amount of idle time will be apply when system becomes idle > > > and no idle time has been "stolen". But reaching the maximum utilization > > > value (SCHED_CAPACITY_SCALE) means that the task is seen as an > > > always-running task whatever the capacity of the CPU (even at max compute > > > capacity). In this case, we can discard this "stolen" idle times which > > > becomes meaningless. > > > > > > In order to achieve this time scaling, a new clock_pelt is created per rq. > > > The increase of this clock scales with current capacity when something > > > is running on rq and synchronizes with clock_task when rq is idle. With > > > this mecanism, we ensure the same running and idle time whatever the > > > current capacity. > > > > Thinking about this new approach on a big.LITTLE platform: > > > > CPU Capacities big: 1024 LITTLE: 512, performance CPUfreq governor > > > > A 50% (runtime/period) task on a big CPU will become an always running > > task on the little CPU. The utilization signal of the task and the > > cfs_rq of the little CPU converges to 1024. > > > > With contrib scaling the utilization signal of the 50% task converges to > > 512 on the little CPU, even it is always running on it, and so does the > > one of the cfs_rq. > > > > Two 25% tasks on a big CPU will become two 50% tasks on a little CPU. > > The utilization signal of the tasks converges to 512 and the one of the > > cfs_rq of the little CPU converges to 1024. > > > > With contrib scaling the utilization signal of the 25% tasks converges > > to 256 on the little CPU, even they run each 50% on it, and the one of > > the cfs_rq converges to 512. > > > > So what do we consider system-wide invariance? I thought that e.g. a 25% > > task should have a utilization value of 256 no matter on which CPU it is > > running? > > > > In both cases, the little CPU is not going idle whereas the big CPU does. > > IMO, the key point here is that there is no idle time. As soon as > there is no idle time, you don't know if a task has enough compute > capacity so you can't make difference between the 50% running task or > an always running task on the little core. > That's also interesting to noticed that the task will reach the always > running state after more than 600ms on little core with utilization > starting from 0. > > Then considering the system-wide invariance, the task are not really > invariant. If we take a 50% running task that run 40ms in a period of > 80ms, the max utilization of the task will be 721 on the big core and > 512 on the little core. > Then, if you take a 39ms running task instead, the utilization on the > big core will reach 709 but it will be 507 on little core. So your > utilization depends on the current capacity > With the new proposal, the max utilization will be 709 on big and > little cores for the 39ms running task. For the 40ms running task, the > utilization will be 721 on big core. then if the task moves on the > little, it will reach the value 721 after 80ms, then 900 after more > than 160ms and 1000 after 320ms It has always been debatable what to do with utilization when there are no spare cycles. In Dietmar's example where two 25% tasks are put on a 512 (50%) capacity CPU we add just enough utilization to have no
Re: [PATCH v5 2/2] sched/fair: update scale invariance of PELT
On Mon, Nov 05, 2018 at 10:10:34AM +0100, Vincent Guittot wrote: > On Fri, 2 Nov 2018 at 16:36, Dietmar Eggemann > wrote: > > > > On 10/26/18 6:11 PM, Vincent Guittot wrote: > > > The current implementation of load tracking invariance scales the > > > contribution with current frequency and uarch performance (only for > > > utilization) of the CPU. One main result of this formula is that the > > > figures are capped by current capacity of CPU. Another one is that the > > > load_avg is not invariant because not scaled with uarch. > > > > > > The util_avg of a periodic task that runs r time slots every p time slots > > > varies in the range : > > > > > > U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p) > > > > > > with U is the max util_avg value = SCHED_CAPACITY_SCALE > > > > > > At a lower capacity, the range becomes: > > > > > > U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * > > > (1-y^r')/(1-y^p) > > > > > > with C reflecting the compute capacity ratio between current capacity and > > > max capacity. > > > > > > so C tries to compensate changes in (1-y^r') but it can't be accurate. > > > > > > Instead of scaling the contribution value of PELT algo, we should scale > > > the > > > running time. The PELT signal aims to track the amount of computation of > > > tasks and/or rq so it seems more correct to scale the running time to > > > reflect the effective amount of computation done since the last update. > > > > > > In order to be fully invariant, we need to apply the same amount of > > > running time and idle time whatever the current capacity. Because running > > > at lower capacity implies that the task will run longer, we have to ensure > > > that the same amount of idle time will be apply when system becomes idle > > > and no idle time has been "stolen". But reaching the maximum utilization > > > value (SCHED_CAPACITY_SCALE) means that the task is seen as an > > > always-running task whatever the capacity of the CPU (even at max compute > > > capacity). In this case, we can discard this "stolen" idle times which > > > becomes meaningless. > > > > > > In order to achieve this time scaling, a new clock_pelt is created per rq. > > > The increase of this clock scales with current capacity when something > > > is running on rq and synchronizes with clock_task when rq is idle. With > > > this mecanism, we ensure the same running and idle time whatever the > > > current capacity. > > > > Thinking about this new approach on a big.LITTLE platform: > > > > CPU Capacities big: 1024 LITTLE: 512, performance CPUfreq governor > > > > A 50% (runtime/period) task on a big CPU will become an always running > > task on the little CPU. The utilization signal of the task and the > > cfs_rq of the little CPU converges to 1024. > > > > With contrib scaling the utilization signal of the 50% task converges to > > 512 on the little CPU, even it is always running on it, and so does the > > one of the cfs_rq. > > > > Two 25% tasks on a big CPU will become two 50% tasks on a little CPU. > > The utilization signal of the tasks converges to 512 and the one of the > > cfs_rq of the little CPU converges to 1024. > > > > With contrib scaling the utilization signal of the 25% tasks converges > > to 256 on the little CPU, even they run each 50% on it, and the one of > > the cfs_rq converges to 512. > > > > So what do we consider system-wide invariance? I thought that e.g. a 25% > > task should have a utilization value of 256 no matter on which CPU it is > > running? > > > > In both cases, the little CPU is not going idle whereas the big CPU does. > > IMO, the key point here is that there is no idle time. As soon as > there is no idle time, you don't know if a task has enough compute > capacity so you can't make difference between the 50% running task or > an always running task on the little core. > That's also interesting to noticed that the task will reach the always > running state after more than 600ms on little core with utilization > starting from 0. > > Then considering the system-wide invariance, the task are not really > invariant. If we take a 50% running task that run 40ms in a period of > 80ms, the max utilization of the task will be 721 on the big core and > 512 on the little core. > Then, if you take a 39ms running task instead, the utilization on the > big core will reach 709 but it will be 507 on little core. So your > utilization depends on the current capacity > With the new proposal, the max utilization will be 709 on big and > little cores for the 39ms running task. For the 40ms running task, the > utilization will be 721 on big core. then if the task moves on the > little, it will reach the value 721 after 80ms, then 900 after more > than 160ms and 1000 after 320ms It has always been debatable what to do with utilization when there are no spare cycles. In Dietmar's example where two 25% tasks are put on a 512 (50%) capacity CPU we add just enough utilization to have no
Re: [PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection
On Mon, Sep 10, 2018 at 10:21:11AM +0200, Ingo Molnar wrote: > > * Morten Rasmussen wrote: > > > The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the > > sched_domain in the hierarchy where all cpu capacities are visible for > > any cpu's point of view on asymmetric cpu capacity systems. The > > > /* > > + * Find the sched_domain_topology_level where all cpu capacities are > > visible > > + * for all cpus. > > + */ > > > + /* > > +* Examine topology from all cpu's point of views to detect the lowest > > +* sched_domain_topology_level where a highest capacity cpu is visible > > +* to everyone. > > +*/ > > > #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ > > -#define SD_ASYM_CPUCAPACITY 0x0040 /* Groups have different max cpu > > capacities */ > > +#define SD_ASYM_CPUCAPACITY 0x0040 /* Domain members have different cpu > > capacities */ > > For future reference: *please* capitalize 'CPU' and 'CPUs' in future patches > like the rest of > the scheduler does. > > You can see it spelled right above the new definition: 'waking CPU' ;-) > > (I fixed this up in this patch.) Noted. Thanks for fixing up the patch. Morten
Re: [PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection
On Mon, Sep 10, 2018 at 10:21:11AM +0200, Ingo Molnar wrote: > > * Morten Rasmussen wrote: > > > The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the > > sched_domain in the hierarchy where all cpu capacities are visible for > > any cpu's point of view on asymmetric cpu capacity systems. The > > > /* > > + * Find the sched_domain_topology_level where all cpu capacities are > > visible > > + * for all cpus. > > + */ > > > + /* > > +* Examine topology from all cpu's point of views to detect the lowest > > +* sched_domain_topology_level where a highest capacity cpu is visible > > +* to everyone. > > +*/ > > > #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ > > -#define SD_ASYM_CPUCAPACITY 0x0040 /* Groups have different max cpu > > capacities */ > > +#define SD_ASYM_CPUCAPACITY 0x0040 /* Domain members have different cpu > > capacities */ > > For future reference: *please* capitalize 'CPU' and 'CPUs' in future patches > like the rest of > the scheduler does. > > You can see it spelled right above the new definition: 'waking CPU' ;-) > > (I fixed this up in this patch.) Noted. Thanks for fixing up the patch. Morten
[tip:sched/core] sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains
Commit-ID: 9c63e84db29bcf584040931ad97c2edd11e35f6c Gitweb: https://git.kernel.org/tip/9c63e84db29bcf584040931ad97c2edd11e35f6c Author: Morten Rasmussen AuthorDate: Wed, 4 Jul 2018 11:17:50 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:54 +0200 sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains The 'prefer sibling' sched_domain flag is intended to encourage spreading tasks to sibling sched_domain to take advantage of more caches and core for SMT systems. It has recently been changed to be on all non-NUMA topology level. However, spreading across domains with CPU capacity asymmetry isn't desirable, e.g. spreading from high capacity to low capacity CPUs even if high capacity CPUs aren't overutilized might give access to more cache but the CPU will be slower and possibly lead to worse overall throughput. To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain level immediately below SD_ASYM_CPUCAPACITY. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: gaku.inami...@renesas.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-13-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/topology.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 2536e1b938f9..7ffad0d3a4eb 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1126,7 +1126,7 @@ sd_init(struct sched_domain_topology_level *tl, | 0*SD_SHARE_CPUCAPACITY | 0*SD_SHARE_PKG_RESOURCES | 0*SD_SERIALIZE - | 0*SD_PREFER_SIBLING + | 1*SD_PREFER_SIBLING | 0*SD_NUMA | sd_flags , @@ -1152,17 +1152,21 @@ sd_init(struct sched_domain_topology_level *tl, if (sd->flags & SD_ASYM_CPUCAPACITY) { struct sched_domain *t = sd; + /* +* Don't attempt to spread across CPUs of different capacities. +*/ + if (sd->child) + sd->child->flags &= ~SD_PREFER_SIBLING; + for_each_lower_domain(t) t->flags |= SD_BALANCE_WAKE; } if (sd->flags & SD_SHARE_CPUCAPACITY) { - sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */ } else if (sd->flags & SD_SHARE_PKG_RESOURCES) { - sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 117; sd->cache_nice_tries = 1; sd->busy_idx = 2; @@ -1173,6 +1177,7 @@ sd_init(struct sched_domain_topology_level *tl, sd->busy_idx = 3; sd->idle_idx = 2; + sd->flags &= ~SD_PREFER_SIBLING; sd->flags |= SD_SERIALIZE; if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) { sd->flags &= ~(SD_BALANCE_EXEC | @@ -1182,7 +1187,6 @@ sd_init(struct sched_domain_topology_level *tl, #endif } else { - sd->flags |= SD_PREFER_SIBLING; sd->cache_nice_tries = 1; sd->busy_idx = 2; sd->idle_idx = 1;
[tip:sched/core] sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains
Commit-ID: 9c63e84db29bcf584040931ad97c2edd11e35f6c Gitweb: https://git.kernel.org/tip/9c63e84db29bcf584040931ad97c2edd11e35f6c Author: Morten Rasmussen AuthorDate: Wed, 4 Jul 2018 11:17:50 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:54 +0200 sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains The 'prefer sibling' sched_domain flag is intended to encourage spreading tasks to sibling sched_domain to take advantage of more caches and core for SMT systems. It has recently been changed to be on all non-NUMA topology level. However, spreading across domains with CPU capacity asymmetry isn't desirable, e.g. spreading from high capacity to low capacity CPUs even if high capacity CPUs aren't overutilized might give access to more cache but the CPU will be slower and possibly lead to worse overall throughput. To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain level immediately below SD_ASYM_CPUCAPACITY. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: gaku.inami...@renesas.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-13-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/topology.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 2536e1b938f9..7ffad0d3a4eb 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1126,7 +1126,7 @@ sd_init(struct sched_domain_topology_level *tl, | 0*SD_SHARE_CPUCAPACITY | 0*SD_SHARE_PKG_RESOURCES | 0*SD_SERIALIZE - | 0*SD_PREFER_SIBLING + | 1*SD_PREFER_SIBLING | 0*SD_NUMA | sd_flags , @@ -1152,17 +1152,21 @@ sd_init(struct sched_domain_topology_level *tl, if (sd->flags & SD_ASYM_CPUCAPACITY) { struct sched_domain *t = sd; + /* +* Don't attempt to spread across CPUs of different capacities. +*/ + if (sd->child) + sd->child->flags &= ~SD_PREFER_SIBLING; + for_each_lower_domain(t) t->flags |= SD_BALANCE_WAKE; } if (sd->flags & SD_SHARE_CPUCAPACITY) { - sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */ } else if (sd->flags & SD_SHARE_PKG_RESOURCES) { - sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 117; sd->cache_nice_tries = 1; sd->busy_idx = 2; @@ -1173,6 +1177,7 @@ sd_init(struct sched_domain_topology_level *tl, sd->busy_idx = 3; sd->idle_idx = 2; + sd->flags &= ~SD_PREFER_SIBLING; sd->flags |= SD_SERIALIZE; if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) { sd->flags &= ~(SD_BALANCE_EXEC | @@ -1182,7 +1187,6 @@ sd_init(struct sched_domain_topology_level *tl, #endif } else { - sd->flags |= SD_PREFER_SIBLING; sd->cache_nice_tries = 1; sd->busy_idx = 2; sd->idle_idx = 1;
[tip:sched/core] sched/fair: Consider misfit tasks when load-balancing
Commit-ID: cad68e552e7774b68ae6a2c5fedb792936098b72 Gitweb: https://git.kernel.org/tip/cad68e552e7774b68ae6a2c5fedb792936098b72 Author: Morten Rasmussen AuthorDate: Wed, 4 Jul 2018 11:17:42 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:50 +0200 sched/fair: Consider misfit tasks when load-balancing On asymmetric CPU capacity systems load intensive tasks can end up on CPUs that don't suit their compute demand. In this scenarios 'misfit' tasks should be migrated to CPUs with higher compute capacity to ensure better throughput. group_misfit_task indicates this scenario, but tweaks to the load-balance code are needed to make the migrations happen. Misfit balancing only makes sense between a source group of lower per-CPU capacity and destination group of higher compute capacity. Otherwise, misfit balancing is ignored. group_misfit_task has lowest priority so any imbalance due to overload is dealt with first. The modifications are: 1. Only pick a group containing misfit tasks as the busiest group if the destination group has higher capacity and has spare capacity. 2. When the busiest group is a 'misfit' group, skip the usual average load and group capacity checks. 3. Set the imbalance for 'misfit' balancing sufficiently high for a task to be pulled ignoring average load. 4. Pick the CPU with the highest misfit load as the source CPU. 5. If the misfit task is alone on the source CPU, go for active balancing. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: gaku.inami...@renesas.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-5-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 51 +-- 1 file changed, 49 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fe04315d57b3..24fe39e57bc3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6890,6 +6890,7 @@ struct lb_env { unsigned intloop_max; enum fbq_type fbq_type; + enum group_type src_grp_type; struct list_headtasks; }; @@ -7873,6 +7874,17 @@ static bool update_sd_pick_busiest(struct lb_env *env, { struct sg_lb_stats *busiest = >busiest_stat; + /* +* Don't try to pull misfit tasks we can't help. +* We can use max_capacity here as reduction in capacity on some +* CPUs in the group should either be possible to resolve +* internally or be covered by avg_load imbalance (eventually). +*/ + if (sgs->group_type == group_misfit_task && + (!group_smaller_max_cpu_capacity(sg, sds->local) || +!group_has_capacity(env, >local_stat))) + return false; + if (sgs->group_type > busiest->group_type) return true; @@ -7895,6 +7907,13 @@ static bool update_sd_pick_busiest(struct lb_env *env, group_smaller_min_cpu_capacity(sds->local, sg)) return false; + /* +* If we have more than one misfit sg go with the biggest misfit. +*/ + if (sgs->group_type == group_misfit_task && + sgs->group_misfit_task_load < busiest->group_misfit_task_load) + return false; + asym_packing: /* This is the busiest node in its class. */ if (!(env->sd->flags & SD_ASYM_PACKING)) @@ -8192,8 +8211,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * factors in sg capacity and sgs with smaller group_type are * skipped when updating the busiest sg: */ - if (busiest->avg_load <= sds->avg_load || - local->avg_load >= sds->avg_load) { + if (busiest->group_type != group_misfit_task && + (busiest->avg_load <= sds->avg_load || +local->avg_load >= sds->avg_load)) { env->imbalance = 0; return fix_small_imbalance(env, sds); } @@ -8227,6 +8247,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s (sds->avg_load - local->avg_load) * local->group_capacity ) / SCHED_CAPACITY_SCALE; + /* Boost imbalance to allow misfit task to be balanced. */ + if (busiest->group_type == group_misfit_task) { + env->imbalance = max_t(long, env->imbalance, + busiest->group_misfit_task_load); + } + /* * if *imbalance is less than the average load per runnable task * there is no guarantee that any tasks will be moved so we'll have @@ -8293,6 +8319,10 @@ static s
[tip:sched/core] sched/fair: Consider misfit tasks when load-balancing
Commit-ID: cad68e552e7774b68ae6a2c5fedb792936098b72 Gitweb: https://git.kernel.org/tip/cad68e552e7774b68ae6a2c5fedb792936098b72 Author: Morten Rasmussen AuthorDate: Wed, 4 Jul 2018 11:17:42 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:50 +0200 sched/fair: Consider misfit tasks when load-balancing On asymmetric CPU capacity systems load intensive tasks can end up on CPUs that don't suit their compute demand. In this scenarios 'misfit' tasks should be migrated to CPUs with higher compute capacity to ensure better throughput. group_misfit_task indicates this scenario, but tweaks to the load-balance code are needed to make the migrations happen. Misfit balancing only makes sense between a source group of lower per-CPU capacity and destination group of higher compute capacity. Otherwise, misfit balancing is ignored. group_misfit_task has lowest priority so any imbalance due to overload is dealt with first. The modifications are: 1. Only pick a group containing misfit tasks as the busiest group if the destination group has higher capacity and has spare capacity. 2. When the busiest group is a 'misfit' group, skip the usual average load and group capacity checks. 3. Set the imbalance for 'misfit' balancing sufficiently high for a task to be pulled ignoring average load. 4. Pick the CPU with the highest misfit load as the source CPU. 5. If the misfit task is alone on the source CPU, go for active balancing. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: gaku.inami...@renesas.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-5-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 51 +-- 1 file changed, 49 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fe04315d57b3..24fe39e57bc3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6890,6 +6890,7 @@ struct lb_env { unsigned intloop_max; enum fbq_type fbq_type; + enum group_type src_grp_type; struct list_headtasks; }; @@ -7873,6 +7874,17 @@ static bool update_sd_pick_busiest(struct lb_env *env, { struct sg_lb_stats *busiest = >busiest_stat; + /* +* Don't try to pull misfit tasks we can't help. +* We can use max_capacity here as reduction in capacity on some +* CPUs in the group should either be possible to resolve +* internally or be covered by avg_load imbalance (eventually). +*/ + if (sgs->group_type == group_misfit_task && + (!group_smaller_max_cpu_capacity(sg, sds->local) || +!group_has_capacity(env, >local_stat))) + return false; + if (sgs->group_type > busiest->group_type) return true; @@ -7895,6 +7907,13 @@ static bool update_sd_pick_busiest(struct lb_env *env, group_smaller_min_cpu_capacity(sds->local, sg)) return false; + /* +* If we have more than one misfit sg go with the biggest misfit. +*/ + if (sgs->group_type == group_misfit_task && + sgs->group_misfit_task_load < busiest->group_misfit_task_load) + return false; + asym_packing: /* This is the busiest node in its class. */ if (!(env->sd->flags & SD_ASYM_PACKING)) @@ -8192,8 +8211,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * factors in sg capacity and sgs with smaller group_type are * skipped when updating the busiest sg: */ - if (busiest->avg_load <= sds->avg_load || - local->avg_load >= sds->avg_load) { + if (busiest->group_type != group_misfit_task && + (busiest->avg_load <= sds->avg_load || +local->avg_load >= sds->avg_load)) { env->imbalance = 0; return fix_small_imbalance(env, sds); } @@ -8227,6 +8247,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s (sds->avg_load - local->avg_load) * local->group_capacity ) / SCHED_CAPACITY_SCALE; + /* Boost imbalance to allow misfit task to be balanced. */ + if (busiest->group_type == group_misfit_task) { + env->imbalance = max_t(long, env->imbalance, + busiest->group_misfit_task_load); + } + /* * if *imbalance is less than the average load per runnable task * there is no guarantee that any tasks will be moved so we'll have @@ -8293,6 +8319,10 @@ static s
[tip:sched/core] sched/fair: Add 'group_misfit_task' load-balance type
Commit-ID: 3b1baa6496e6b7ad016342a9d256bdfb072ce902 Gitweb: https://git.kernel.org/tip/3b1baa6496e6b7ad016342a9d256bdfb072ce902 Author: Morten Rasmussen AuthorDate: Wed, 4 Jul 2018 11:17:40 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:49 +0200 sched/fair: Add 'group_misfit_task' load-balance type To maximize throughput in systems with asymmetric CPU capacities (e.g. ARM big.LITTLE) load-balancing has to consider task and CPU utilization as well as per-CPU compute capacity when load-balancing in addition to the current average load based load-balancing policy. Tasks with high utilization that are scheduled on a lower capacity CPU need to be identified and migrated to a higher capacity CPU if possible to maximize throughput. To implement this additional policy an additional group_type (load-balance scenario) is added: 'group_misfit_task'. This represents scenarios where a sched_group has one or more tasks that are not suitable for its per-CPU capacity. 'group_misfit_task' is only considered if the system is not overloaded or imbalanced ('group_imbalanced' or 'group_overloaded'). Identifying misfit tasks requires the rq lock to be held. To avoid taking remote rq locks to examine source sched_groups for misfit tasks, each CPU is responsible for tracking misfit tasks themselves and update the rq->misfit_task flag. This means checking task utilization when tasks are scheduled and on sched_tick. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: gaku.inami...@renesas.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-3-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 54 kernel/sched/sched.h | 2 ++ 2 files changed, 48 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3e5071aeb117..6e04bea5b11a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -693,6 +693,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se) static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu); static unsigned long task_h_load(struct task_struct *p); +static unsigned long capacity_of(int cpu); /* Give new sched_entity start runnable values to heavy its load in infant time */ void init_entity_runnable_average(struct sched_entity *se) @@ -1446,7 +1447,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, static unsigned long weighted_cpuload(struct rq *rq); static unsigned long source_load(int cpu, int type); static unsigned long target_load(int cpu, int type); -static unsigned long capacity_of(int cpu); /* Cached statistics for all CPUs within a node */ struct numa_stats { @@ -3647,6 +3647,29 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) WRITE_ONCE(p->se.avg.util_est, ue); } +static inline int task_fits_capacity(struct task_struct *p, long capacity) +{ + return capacity * 1024 > task_util_est(p) * capacity_margin; +} + +static inline void update_misfit_status(struct task_struct *p, struct rq *rq) +{ + if (!static_branch_unlikely(_asym_cpucapacity)) + return; + + if (!p) { + rq->misfit_task_load = 0; + return; + } + + if (task_fits_capacity(p, capacity_of(cpu_of(rq { + rq->misfit_task_load = 0; + return; + } + + rq->misfit_task_load = task_h_load(p); +} + #else /* CONFIG_SMP */ #define UPDATE_TG 0x0 @@ -3676,6 +3699,7 @@ util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {} static inline void util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) {} +static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {} #endif /* CONFIG_SMP */ @@ -6201,7 +6225,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) /* Bring task utilization in sync with prev_cpu */ sync_entity_load_avg(>se); - return min_cap * 1024 < task_util(p) * capacity_margin; + return !task_fits_capacity(p, min_cap); } /* @@ -6618,9 +6642,12 @@ done: __maybe_unused; if (hrtick_enabled(rq)) hrtick_start_fair(rq, p); + update_misfit_status(p, rq); + return p; idle: + update_misfit_status(NULL, rq); new_tasks = idle_balance(rq, rf); /* @@ -6826,6 +6853,13 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10; enum fbq_type { regular, remote, all }; +enum group_type { + group_other = 0, + group_misfit_task, + group_imbalanced, + group_overloaded, +}; + #define LBF_ALL_PINNED 0x01 #define LBF_
[tip:sched/core] sched/fair: Add 'group_misfit_task' load-balance type
Commit-ID: 3b1baa6496e6b7ad016342a9d256bdfb072ce902 Gitweb: https://git.kernel.org/tip/3b1baa6496e6b7ad016342a9d256bdfb072ce902 Author: Morten Rasmussen AuthorDate: Wed, 4 Jul 2018 11:17:40 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:49 +0200 sched/fair: Add 'group_misfit_task' load-balance type To maximize throughput in systems with asymmetric CPU capacities (e.g. ARM big.LITTLE) load-balancing has to consider task and CPU utilization as well as per-CPU compute capacity when load-balancing in addition to the current average load based load-balancing policy. Tasks with high utilization that are scheduled on a lower capacity CPU need to be identified and migrated to a higher capacity CPU if possible to maximize throughput. To implement this additional policy an additional group_type (load-balance scenario) is added: 'group_misfit_task'. This represents scenarios where a sched_group has one or more tasks that are not suitable for its per-CPU capacity. 'group_misfit_task' is only considered if the system is not overloaded or imbalanced ('group_imbalanced' or 'group_overloaded'). Identifying misfit tasks requires the rq lock to be held. To avoid taking remote rq locks to examine source sched_groups for misfit tasks, each CPU is responsible for tracking misfit tasks themselves and update the rq->misfit_task flag. This means checking task utilization when tasks are scheduled and on sched_tick. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: gaku.inami...@renesas.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-3-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 54 kernel/sched/sched.h | 2 ++ 2 files changed, 48 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3e5071aeb117..6e04bea5b11a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -693,6 +693,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se) static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu); static unsigned long task_h_load(struct task_struct *p); +static unsigned long capacity_of(int cpu); /* Give new sched_entity start runnable values to heavy its load in infant time */ void init_entity_runnable_average(struct sched_entity *se) @@ -1446,7 +1447,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, static unsigned long weighted_cpuload(struct rq *rq); static unsigned long source_load(int cpu, int type); static unsigned long target_load(int cpu, int type); -static unsigned long capacity_of(int cpu); /* Cached statistics for all CPUs within a node */ struct numa_stats { @@ -3647,6 +3647,29 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) WRITE_ONCE(p->se.avg.util_est, ue); } +static inline int task_fits_capacity(struct task_struct *p, long capacity) +{ + return capacity * 1024 > task_util_est(p) * capacity_margin; +} + +static inline void update_misfit_status(struct task_struct *p, struct rq *rq) +{ + if (!static_branch_unlikely(_asym_cpucapacity)) + return; + + if (!p) { + rq->misfit_task_load = 0; + return; + } + + if (task_fits_capacity(p, capacity_of(cpu_of(rq { + rq->misfit_task_load = 0; + return; + } + + rq->misfit_task_load = task_h_load(p); +} + #else /* CONFIG_SMP */ #define UPDATE_TG 0x0 @@ -3676,6 +3699,7 @@ util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {} static inline void util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) {} +static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {} #endif /* CONFIG_SMP */ @@ -6201,7 +6225,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) /* Bring task utilization in sync with prev_cpu */ sync_entity_load_avg(>se); - return min_cap * 1024 < task_util(p) * capacity_margin; + return !task_fits_capacity(p, min_cap); } /* @@ -6618,9 +6642,12 @@ done: __maybe_unused; if (hrtick_enabled(rq)) hrtick_start_fair(rq, p); + update_misfit_status(p, rq); + return p; idle: + update_misfit_status(NULL, rq); new_tasks = idle_balance(rq, rf); /* @@ -6826,6 +6853,13 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10; enum fbq_type { regular, remote, all }; +enum group_type { + group_other = 0, + group_misfit_task, + group_imbalanced, + group_overloaded, +}; + #define LBF_ALL_PINNED 0x01 #define LBF_
[tip:sched/core] sched/fair: Add sched_group per-CPU max capacity
Commit-ID: e3d6d0cb66f2351cbfd09fbae04eb9804afe9577 Gitweb: https://git.kernel.org/tip/e3d6d0cb66f2351cbfd09fbae04eb9804afe9577 Author: Morten Rasmussen AuthorDate: Wed, 4 Jul 2018 11:17:41 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:49 +0200 sched/fair: Add sched_group per-CPU max capacity The current sg->min_capacity tracks the lowest per-CPU compute capacity available in the sched_group when rt/irq pressure is taken into account. Minimum capacity isn't the ideal metric for tracking if a sched_group needs offloading to another sched_group for some scenarios, e.g. a sched_group with multiple CPUs if only one is under heavy pressure. Tracking maximum capacity isn't perfect either but a better choice for some situations as it indicates that the sched_group definitely compute capacity constrained either due to rt/irq pressure on all CPUs or asymmetric CPU capacities (e.g. big.LITTLE). Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: gaku.inami...@renesas.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-4-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 24 kernel/sched/sched.h| 1 + kernel/sched/topology.c | 2 ++ 3 files changed, 23 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6e04bea5b11a..fe04315d57b3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7557,13 +7557,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu) cpu_rq(cpu)->cpu_capacity = capacity; sdg->sgc->capacity = capacity; sdg->sgc->min_capacity = capacity; + sdg->sgc->max_capacity = capacity; } void update_group_capacity(struct sched_domain *sd, int cpu) { struct sched_domain *child = sd->child; struct sched_group *group, *sdg = sd->groups; - unsigned long capacity, min_capacity; + unsigned long capacity, min_capacity, max_capacity; unsigned long interval; interval = msecs_to_jiffies(sd->balance_interval); @@ -7577,6 +7578,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) capacity = 0; min_capacity = ULONG_MAX; + max_capacity = 0; if (child->flags & SD_OVERLAP) { /* @@ -7607,6 +7609,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) } min_capacity = min(capacity, min_capacity); + max_capacity = max(capacity, max_capacity); } } else { /* @@ -7620,12 +7623,14 @@ void update_group_capacity(struct sched_domain *sd, int cpu) capacity += sgc->capacity; min_capacity = min(sgc->min_capacity, min_capacity); + max_capacity = max(sgc->max_capacity, max_capacity); group = group->next; } while (group != child->groups); } sdg->sgc->capacity = capacity; sdg->sgc->min_capacity = min_capacity; + sdg->sgc->max_capacity = max_capacity; } /* @@ -7721,16 +7726,27 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs) } /* - * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller + * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller * per-CPU capacity than sched_group ref. */ static inline bool -group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref) +group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref) { return sg->sgc->min_capacity * capacity_margin < ref->sgc->min_capacity * 1024; } +/* + * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller + * per-CPU capacity_orig than sched_group ref. + */ +static inline bool +group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref) +{ + return sg->sgc->max_capacity * capacity_margin < + ref->sgc->max_capacity * 1024; +} + static inline enum group_type group_classify(struct sched_group *group, struct sg_lb_stats *sgs) @@ -7876,7 +7892,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, * power/energy consequences are not considered. */ if (sgs->sum_nr_running <= sgs->group_weight && - group_smaller_cpu_capacity(sds->local, sg)) + group_smaller_min_cpu_capacity(sds->local, sg)) return false; asym_packing: diff --git a/kernel/sched/sched.h b/kernel/
[tip:sched/core] sched/fair: Add sched_group per-CPU max capacity
Commit-ID: e3d6d0cb66f2351cbfd09fbae04eb9804afe9577 Gitweb: https://git.kernel.org/tip/e3d6d0cb66f2351cbfd09fbae04eb9804afe9577 Author: Morten Rasmussen AuthorDate: Wed, 4 Jul 2018 11:17:41 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:49 +0200 sched/fair: Add sched_group per-CPU max capacity The current sg->min_capacity tracks the lowest per-CPU compute capacity available in the sched_group when rt/irq pressure is taken into account. Minimum capacity isn't the ideal metric for tracking if a sched_group needs offloading to another sched_group for some scenarios, e.g. a sched_group with multiple CPUs if only one is under heavy pressure. Tracking maximum capacity isn't perfect either but a better choice for some situations as it indicates that the sched_group definitely compute capacity constrained either due to rt/irq pressure on all CPUs or asymmetric CPU capacities (e.g. big.LITTLE). Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: gaku.inami...@renesas.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-4-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 24 kernel/sched/sched.h| 1 + kernel/sched/topology.c | 2 ++ 3 files changed, 23 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6e04bea5b11a..fe04315d57b3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7557,13 +7557,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu) cpu_rq(cpu)->cpu_capacity = capacity; sdg->sgc->capacity = capacity; sdg->sgc->min_capacity = capacity; + sdg->sgc->max_capacity = capacity; } void update_group_capacity(struct sched_domain *sd, int cpu) { struct sched_domain *child = sd->child; struct sched_group *group, *sdg = sd->groups; - unsigned long capacity, min_capacity; + unsigned long capacity, min_capacity, max_capacity; unsigned long interval; interval = msecs_to_jiffies(sd->balance_interval); @@ -7577,6 +7578,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) capacity = 0; min_capacity = ULONG_MAX; + max_capacity = 0; if (child->flags & SD_OVERLAP) { /* @@ -7607,6 +7609,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) } min_capacity = min(capacity, min_capacity); + max_capacity = max(capacity, max_capacity); } } else { /* @@ -7620,12 +7623,14 @@ void update_group_capacity(struct sched_domain *sd, int cpu) capacity += sgc->capacity; min_capacity = min(sgc->min_capacity, min_capacity); + max_capacity = max(sgc->max_capacity, max_capacity); group = group->next; } while (group != child->groups); } sdg->sgc->capacity = capacity; sdg->sgc->min_capacity = min_capacity; + sdg->sgc->max_capacity = max_capacity; } /* @@ -7721,16 +7726,27 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs) } /* - * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller + * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller * per-CPU capacity than sched_group ref. */ static inline bool -group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref) +group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref) { return sg->sgc->min_capacity * capacity_margin < ref->sgc->min_capacity * 1024; } +/* + * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller + * per-CPU capacity_orig than sched_group ref. + */ +static inline bool +group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref) +{ + return sg->sgc->max_capacity * capacity_margin < + ref->sgc->max_capacity * 1024; +} + static inline enum group_type group_classify(struct sched_group *group, struct sg_lb_stats *sgs) @@ -7876,7 +7892,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, * power/energy consequences are not considered. */ if (sgs->sum_nr_running <= sgs->group_weight && - group_smaller_cpu_capacity(sds->local, sg)) + group_smaller_min_cpu_capacity(sds->local, sg)) return false; asym_packing: diff --git a/kernel/sched/sched.h b/kernel/
[tip:sched/core] sched/topology: Add static_key for asymmetric CPU capacity optimizations
Commit-ID: df054e8445a4011e3d693c2268129c0456108663 Gitweb: https://git.kernel.org/tip/df054e8445a4011e3d693c2268129c0456108663 Author: Morten Rasmussen AuthorDate: Wed, 4 Jul 2018 11:17:39 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:48 +0200 sched/topology: Add static_key for asymmetric CPU capacity optimizations The existing asymmetric CPU capacity code should cause minimal overhead for others. Putting it behind a static_key, it has been done for SMT optimizations, would make it easier to extend and improve without causing harm to others moving forward. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: gaku.inami...@renesas.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-2-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 3 +++ kernel/sched/sched.h| 1 + kernel/sched/topology.c | 9 - 3 files changed, 12 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f808ddf2a868..3e5071aeb117 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,6 +6188,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) { long min_cap, max_cap; + if (!static_branch_unlikely(_asym_cpucapacity)) + return 0; + min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu)); max_cap = cpu_rq(cpu)->rd->max_cpu_capacity; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 4a2e8cae63c4..0f36adc31ba5 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1185,6 +1185,7 @@ DECLARE_PER_CPU(int, sd_llc_id); DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DECLARE_PER_CPU(struct sched_domain *, sd_numa); DECLARE_PER_CPU(struct sched_domain *, sd_asym); +extern struct static_key_false sched_asym_cpucapacity; struct sched_group_capacity { atomic_tref; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 5c4d583d53ee..b0cdf5e95bda 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id); DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DEFINE_PER_CPU(struct sched_domain *, sd_numa); DEFINE_PER_CPU(struct sched_domain *, sd_asym); +DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity); static void update_top_cache_domain(int cpu) { @@ -1705,6 +1706,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att struct rq *rq = NULL; int i, ret = -ENOMEM; struct sched_domain_topology_level *tl_asym; + bool has_asym = false; alloc_state = __visit_domain_allocation_hell(, cpu_map); if (alloc_state != sa_rootdomain) @@ -1720,8 +1722,10 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att for_each_sd_topology(tl) { int dflags = 0; - if (tl == tl_asym) + if (tl == tl_asym) { dflags |= SD_ASYM_CPUCAPACITY; + has_asym = true; + } sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i); @@ -1773,6 +1777,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att } rcu_read_unlock(); + if (has_asym) + static_branch_enable_cpuslocked(_asym_cpucapacity); + if (rq && sched_debug_enabled) { pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n", cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
[tip:sched/core] sched/topology, arch/arm: Rebuild sched_domain hierarchy when CPU capacity changes
Commit-ID: e1799a80a4f5a463f252b7325da8bb66dfd55471 Gitweb: https://git.kernel.org/tip/e1799a80a4f5a463f252b7325da8bb66dfd55471 Author: Morten Rasmussen AuthorDate: Fri, 20 Jul 2018 14:32:34 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:48 +0200 sched/topology, arch/arm: Rebuild sched_domain hierarchy when CPU capacity changes Asymmetric CPU capacity can not necessarily be determined accurately at the time the initial sched_domain hierarchy is built during boot. It is therefore necessary to be able to force a full rebuild of the hierarchy later triggered by the arch_topology driver. A full rebuild requires the arch-code to implement arch_update_cpu_topology() which isn't yet implemented for arm. This patch points the arm implementation to arch_topology driver to ensure that full hierarchy rebuild happens when needed. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Russell King Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1532093554-30504-5-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- arch/arm/include/asm/topology.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index 5d88d2f22b2c..2a786f54d8b8 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -33,6 +33,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu); /* Replace task scheduler's default cpu-invariant accounting */ #define arch_scale_cpu_capacity topology_get_cpu_scale +/* Enable topology flag updates */ +#define arch_update_cpu_topology topology_update_cpu_topology + #else static inline void init_cpu_topology(void) { }
[tip:sched/core] sched/topology: Add static_key for asymmetric CPU capacity optimizations
Commit-ID: df054e8445a4011e3d693c2268129c0456108663 Gitweb: https://git.kernel.org/tip/df054e8445a4011e3d693c2268129c0456108663 Author: Morten Rasmussen AuthorDate: Wed, 4 Jul 2018 11:17:39 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:48 +0200 sched/topology: Add static_key for asymmetric CPU capacity optimizations The existing asymmetric CPU capacity code should cause minimal overhead for others. Putting it behind a static_key, it has been done for SMT optimizations, would make it easier to extend and improve without causing harm to others moving forward. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: gaku.inami...@renesas.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1530699470-29808-2-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- kernel/sched/fair.c | 3 +++ kernel/sched/sched.h| 1 + kernel/sched/topology.c | 9 - 3 files changed, 12 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f808ddf2a868..3e5071aeb117 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6188,6 +6188,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) { long min_cap, max_cap; + if (!static_branch_unlikely(_asym_cpucapacity)) + return 0; + min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu)); max_cap = cpu_rq(cpu)->rd->max_cpu_capacity; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 4a2e8cae63c4..0f36adc31ba5 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1185,6 +1185,7 @@ DECLARE_PER_CPU(int, sd_llc_id); DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DECLARE_PER_CPU(struct sched_domain *, sd_numa); DECLARE_PER_CPU(struct sched_domain *, sd_asym); +extern struct static_key_false sched_asym_cpucapacity; struct sched_group_capacity { atomic_tref; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 5c4d583d53ee..b0cdf5e95bda 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id); DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DEFINE_PER_CPU(struct sched_domain *, sd_numa); DEFINE_PER_CPU(struct sched_domain *, sd_asym); +DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity); static void update_top_cache_domain(int cpu) { @@ -1705,6 +1706,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att struct rq *rq = NULL; int i, ret = -ENOMEM; struct sched_domain_topology_level *tl_asym; + bool has_asym = false; alloc_state = __visit_domain_allocation_hell(, cpu_map); if (alloc_state != sa_rootdomain) @@ -1720,8 +1722,10 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att for_each_sd_topology(tl) { int dflags = 0; - if (tl == tl_asym) + if (tl == tl_asym) { dflags |= SD_ASYM_CPUCAPACITY; + has_asym = true; + } sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i); @@ -1773,6 +1777,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att } rcu_read_unlock(); + if (has_asym) + static_branch_enable_cpuslocked(_asym_cpucapacity); + if (rq && sched_debug_enabled) { pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n", cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
[tip:sched/core] sched/topology, arch/arm: Rebuild sched_domain hierarchy when CPU capacity changes
Commit-ID: e1799a80a4f5a463f252b7325da8bb66dfd55471 Gitweb: https://git.kernel.org/tip/e1799a80a4f5a463f252b7325da8bb66dfd55471 Author: Morten Rasmussen AuthorDate: Fri, 20 Jul 2018 14:32:34 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:48 +0200 sched/topology, arch/arm: Rebuild sched_domain hierarchy when CPU capacity changes Asymmetric CPU capacity can not necessarily be determined accurately at the time the initial sched_domain hierarchy is built during boot. It is therefore necessary to be able to force a full rebuild of the hierarchy later triggered by the arch_topology driver. A full rebuild requires the arch-code to implement arch_update_cpu_topology() which isn't yet implemented for arm. This patch points the arm implementation to arch_topology driver to ensure that full hierarchy rebuild happens when needed. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Russell King Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1532093554-30504-5-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- arch/arm/include/asm/topology.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index 5d88d2f22b2c..2a786f54d8b8 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -33,6 +33,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu); /* Replace task scheduler's default cpu-invariant accounting */ #define arch_scale_cpu_capacity topology_get_cpu_scale +/* Enable topology flag updates */ +#define arch_update_cpu_topology topology_update_cpu_topology + #else static inline void init_cpu_topology(void) { }
[tip:sched/core] sched/topology, drivers/base/arch_topology: Rebuild the sched_domain hierarchy when capacities change
Commit-ID: bb1fbdd3c3fd12b612c7d8cdf13bd6bfeebdefa3 Gitweb: https://git.kernel.org/tip/bb1fbdd3c3fd12b612c7d8cdf13bd6bfeebdefa3 Author: Morten Rasmussen AuthorDate: Fri, 20 Jul 2018 14:32:32 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:47 +0200 sched/topology, drivers/base/arch_topology: Rebuild the sched_domain hierarchy when capacities change The setting of SD_ASYM_CPUCAPACITY depends on the per-CPU capacities. These might not have their final values when the hierarchy is initially built as the values depend on cpufreq to be initialized or the values being set through sysfs. To ensure that the flags are set correctly we need to rebuild the sched_domain hierarchy whenever the reported per-CPU capacity (arch_scale_cpu_capacity()) changes. This patch ensure that a full sched_domain rebuild happens when CPU capacity changes occur. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Greg Kroah-Hartman Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1532093554-30504-3-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- drivers/base/arch_topology.c | 26 ++ include/linux/arch_topology.h | 1 + 2 files changed, 27 insertions(+) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index e7cb0c6ade81..edfcf8d982e4 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -15,6 +15,7 @@ #include #include #include +#include DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE; @@ -47,6 +48,9 @@ static ssize_t cpu_capacity_show(struct device *dev, return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id)); } +static void update_topology_flags_workfn(struct work_struct *work); +static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn); + static ssize_t cpu_capacity_store(struct device *dev, struct device_attribute *attr, const char *buf, @@ -72,6 +76,8 @@ static ssize_t cpu_capacity_store(struct device *dev, topology_set_cpu_scale(i, new_capacity); mutex_unlock(_scale_mutex); + schedule_work(_topology_flags_work); + return count; } @@ -96,6 +102,25 @@ static int register_cpu_capacity_sysctl(void) } subsys_initcall(register_cpu_capacity_sysctl); +static int update_topology; + +int topology_update_cpu_topology(void) +{ + return update_topology; +} + +/* + * Updating the sched_domains can't be done directly from cpufreq callbacks + * due to locking, so queue the work for later. + */ +static void update_topology_flags_workfn(struct work_struct *work) +{ + update_topology = 1; + rebuild_sched_domains(); + pr_debug("sched_domain hierarchy rebuilt, flags updated\n"); + update_topology = 0; +} + static u32 capacity_scale; static u32 *raw_capacity; @@ -201,6 +226,7 @@ init_cpu_capacity_callback(struct notifier_block *nb, if (cpumask_empty(cpus_to_visit)) { topology_normalize_cpu_scale(); + schedule_work(_topology_flags_work); free_raw_capacity(); pr_debug("cpu_capacity: parsing done\n"); schedule_work(_done_work); diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h index 2b709416de05..d9bdc1a7f4e7 100644 --- a/include/linux/arch_topology.h +++ b/include/linux/arch_topology.h @@ -9,6 +9,7 @@ #include void topology_normalize_cpu_scale(void); +int topology_update_cpu_topology(void); struct device_node; bool topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu);
[tip:sched/core] sched/topology, drivers/base/arch_topology: Rebuild the sched_domain hierarchy when capacities change
Commit-ID: bb1fbdd3c3fd12b612c7d8cdf13bd6bfeebdefa3 Gitweb: https://git.kernel.org/tip/bb1fbdd3c3fd12b612c7d8cdf13bd6bfeebdefa3 Author: Morten Rasmussen AuthorDate: Fri, 20 Jul 2018 14:32:32 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:47 +0200 sched/topology, drivers/base/arch_topology: Rebuild the sched_domain hierarchy when capacities change The setting of SD_ASYM_CPUCAPACITY depends on the per-CPU capacities. These might not have their final values when the hierarchy is initially built as the values depend on cpufreq to be initialized or the values being set through sysfs. To ensure that the flags are set correctly we need to rebuild the sched_domain hierarchy whenever the reported per-CPU capacity (arch_scale_cpu_capacity()) changes. This patch ensure that a full sched_domain rebuild happens when CPU capacity changes occur. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Greg Kroah-Hartman Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1532093554-30504-3-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- drivers/base/arch_topology.c | 26 ++ include/linux/arch_topology.h | 1 + 2 files changed, 27 insertions(+) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index e7cb0c6ade81..edfcf8d982e4 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -15,6 +15,7 @@ #include #include #include +#include DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE; @@ -47,6 +48,9 @@ static ssize_t cpu_capacity_show(struct device *dev, return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id)); } +static void update_topology_flags_workfn(struct work_struct *work); +static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn); + static ssize_t cpu_capacity_store(struct device *dev, struct device_attribute *attr, const char *buf, @@ -72,6 +76,8 @@ static ssize_t cpu_capacity_store(struct device *dev, topology_set_cpu_scale(i, new_capacity); mutex_unlock(_scale_mutex); + schedule_work(_topology_flags_work); + return count; } @@ -96,6 +102,25 @@ static int register_cpu_capacity_sysctl(void) } subsys_initcall(register_cpu_capacity_sysctl); +static int update_topology; + +int topology_update_cpu_topology(void) +{ + return update_topology; +} + +/* + * Updating the sched_domains can't be done directly from cpufreq callbacks + * due to locking, so queue the work for later. + */ +static void update_topology_flags_workfn(struct work_struct *work) +{ + update_topology = 1; + rebuild_sched_domains(); + pr_debug("sched_domain hierarchy rebuilt, flags updated\n"); + update_topology = 0; +} + static u32 capacity_scale; static u32 *raw_capacity; @@ -201,6 +226,7 @@ init_cpu_capacity_callback(struct notifier_block *nb, if (cpumask_empty(cpus_to_visit)) { topology_normalize_cpu_scale(); + schedule_work(_topology_flags_work); free_raw_capacity(); pr_debug("cpu_capacity: parsing done\n"); schedule_work(_done_work); diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h index 2b709416de05..d9bdc1a7f4e7 100644 --- a/include/linux/arch_topology.h +++ b/include/linux/arch_topology.h @@ -9,6 +9,7 @@ #include void topology_normalize_cpu_scale(void); +int topology_update_cpu_topology(void); struct device_node; bool topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu);
[tip:sched/core] sched/topology, arch/arm64: Rebuild the sched_domain hierarchy when the CPU capacity changes
Commit-ID: 3ba09df4b8b6e3f01ed6381e8fb890840fd0bca3 Gitweb: https://git.kernel.org/tip/3ba09df4b8b6e3f01ed6381e8fb890840fd0bca3 Author: Morten Rasmussen AuthorDate: Fri, 20 Jul 2018 14:32:33 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:47 +0200 sched/topology, arch/arm64: Rebuild the sched_domain hierarchy when the CPU capacity changes Asymmetric CPU capacity can not necessarily be determined accurately at the time the initial sched_domain hierarchy is built during boot. It is therefore necessary to be able to force a full rebuild of the hierarchy later triggered by the arch_topology driver. A full rebuild requires the arch-code to implement arch_update_cpu_topology() which isn't yet implemented for arm64. This patch points the arm64 implementation to arch_topology driver to ensure that full hierarchy rebuild happens when needed. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Catalin Marinas Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Cc: dietmar.eggem...@arm.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1532093554-30504-4-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- arch/arm64/include/asm/topology.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h index 49a0fee4f89b..0524f2438649 100644 --- a/arch/arm64/include/asm/topology.h +++ b/arch/arm64/include/asm/topology.h @@ -45,6 +45,9 @@ int pcibus_to_node(struct pci_bus *bus); /* Replace task scheduler's default cpu-invariant accounting */ #define arch_scale_cpu_capacity topology_get_cpu_scale +/* Enable topology flag updates */ +#define arch_update_cpu_topology topology_update_cpu_topology + #include #endif /* _ASM_ARM_TOPOLOGY_H */
[tip:sched/core] sched/topology, arch/arm64: Rebuild the sched_domain hierarchy when the CPU capacity changes
Commit-ID: 3ba09df4b8b6e3f01ed6381e8fb890840fd0bca3 Gitweb: https://git.kernel.org/tip/3ba09df4b8b6e3f01ed6381e8fb890840fd0bca3 Author: Morten Rasmussen AuthorDate: Fri, 20 Jul 2018 14:32:33 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:47 +0200 sched/topology, arch/arm64: Rebuild the sched_domain hierarchy when the CPU capacity changes Asymmetric CPU capacity can not necessarily be determined accurately at the time the initial sched_domain hierarchy is built during boot. It is therefore necessary to be able to force a full rebuild of the hierarchy later triggered by the arch_topology driver. A full rebuild requires the arch-code to implement arch_update_cpu_topology() which isn't yet implemented for arm64. This patch points the arm64 implementation to arch_topology driver to ensure that full hierarchy rebuild happens when needed. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Catalin Marinas Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Cc: dietmar.eggem...@arm.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1532093554-30504-4-git-send-email-morten.rasmus...@arm.com Signed-off-by: Ingo Molnar --- arch/arm64/include/asm/topology.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h index 49a0fee4f89b..0524f2438649 100644 --- a/arch/arm64/include/asm/topology.h +++ b/arch/arm64/include/asm/topology.h @@ -45,6 +45,9 @@ int pcibus_to_node(struct pci_bus *bus); /* Replace task scheduler's default cpu-invariant accounting */ #define arch_scale_cpu_capacity topology_get_cpu_scale +/* Enable topology flag updates */ +#define arch_update_cpu_topology topology_update_cpu_topology + #include #endif /* _ASM_ARM_TOPOLOGY_H */
[tip:sched/core] sched/topology: Add SD_ASYM_CPUCAPACITY flag detection
Commit-ID: 05484e0984487d42e97c417cbb0697fa9d16e7e9 Gitweb: https://git.kernel.org/tip/05484e0984487d42e97c417cbb0697fa9d16e7e9 Author: Morten Rasmussen AuthorDate: Fri, 20 Jul 2018 14:32:31 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:45 +0200 sched/topology: Add SD_ASYM_CPUCAPACITY flag detection The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the sched_domain in the hierarchy where all CPU capacities are visible for any CPU's point of view on asymmetric CPU capacity systems. The scheduler can then take to take capacity asymmetry into account when balancing at this level. It also serves as an indicator for how wide task placement heuristics have to search to consider all available CPU capacities as asymmetric systems might often appear symmetric at smallest level(s) of the sched_domain hierarchy. The flag has been around for while but so far only been set by out-of-tree code in Android kernels. One solution is to let each architecture provide the flag through a custom sched_domain topology array and associated mask and flag functions. However, SD_ASYM_CPUCAPACITY is special in the sense that it depends on the capacity and presence of all CPUs in the system, i.e. when hotplugging all CPUs out except those with one particular CPU capacity the flag should disappear even if the sched_domains don't collapse. Similarly, the flag is affected by cpusets where load-balancing is turned off. Detecting when the flags should be set therefore depends not only on topology information but also the cpuset configuration and hotplug state. The arch code doesn't have easy access to the cpuset configuration. Instead, this patch implements the flag detection in generic code where cpusets and hotplug state is already taken care of. All the arch is responsible for is to implement arch_scale_cpu_capacity() and force a full rebuild of the sched_domain hierarchy if capacities are updated, e.g. later in the boot process when cpufreq has initialized. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1532093554-30504-2-git-send-email-morten.rasmus...@arm.com [ Fixed 'CPU' capitalization. ] Signed-off-by: Ingo Molnar --- include/linux/sched/topology.h | 6 ++-- kernel/sched/topology.c| 81 ++ 2 files changed, 78 insertions(+), 9 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 26347741ba50..6b9976180c1e 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -23,10 +23,10 @@ #define SD_BALANCE_FORK0x0008 /* Balance on fork, clone */ #define SD_BALANCE_WAKE0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ -#define SD_ASYM_CPUCAPACITY0x0040 /* Groups have different max cpu capacities */ -#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu capacity */ +#define SD_ASYM_CPUCAPACITY0x0040 /* Domain members have different CPU capacities */ +#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share CPU capacity */ #define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ -#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ +#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share CPU pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING0x0800 /* Place busy groups earlier in the domain */ #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 505a41c42b96..5c4d583d53ee 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1061,7 +1061,6 @@ static struct cpumask ***sched_domains_numa_masks; * SD_SHARE_PKG_RESOURCES - describes shared caches * SD_NUMA- describes NUMA topologies * SD_SHARE_POWERDOMAIN - describes shared power domain - * SD_ASYM_CPUCAPACITY- describes mixed capacity topologies * * Odd one out, which beside describing the topology has a quirk also * prescribes the desired behaviour that goes along with it: @@ -1073,13 +1072,12 @@ static struct cpumask ***sched_domains_numa_masks; SD_SHARE_PKG_RESOURCES | \ SD_NUMA| \ SD_ASYM_PACKING| \ -SD_ASYM_CPUCAPACITY| \ SD_SHARE_POWERDOMAIN) static struct sched_domain * sd_init(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, - struct sched_domain *child, int cpu) + struct sched_domain *child, int dflags, int cpu
[tip:sched/core] sched/topology: Add SD_ASYM_CPUCAPACITY flag detection
Commit-ID: 05484e0984487d42e97c417cbb0697fa9d16e7e9 Gitweb: https://git.kernel.org/tip/05484e0984487d42e97c417cbb0697fa9d16e7e9 Author: Morten Rasmussen AuthorDate: Fri, 20 Jul 2018 14:32:31 +0100 Committer: Ingo Molnar CommitDate: Mon, 10 Sep 2018 11:05:45 +0200 sched/topology: Add SD_ASYM_CPUCAPACITY flag detection The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the sched_domain in the hierarchy where all CPU capacities are visible for any CPU's point of view on asymmetric CPU capacity systems. The scheduler can then take to take capacity asymmetry into account when balancing at this level. It also serves as an indicator for how wide task placement heuristics have to search to consider all available CPU capacities as asymmetric systems might often appear symmetric at smallest level(s) of the sched_domain hierarchy. The flag has been around for while but so far only been set by out-of-tree code in Android kernels. One solution is to let each architecture provide the flag through a custom sched_domain topology array and associated mask and flag functions. However, SD_ASYM_CPUCAPACITY is special in the sense that it depends on the capacity and presence of all CPUs in the system, i.e. when hotplugging all CPUs out except those with one particular CPU capacity the flag should disappear even if the sched_domains don't collapse. Similarly, the flag is affected by cpusets where load-balancing is turned off. Detecting when the flags should be set therefore depends not only on topology information but also the cpuset configuration and hotplug state. The arch code doesn't have easy access to the cpuset configuration. Instead, this patch implements the flag detection in generic code where cpusets and hotplug state is already taken care of. All the arch is responsible for is to implement arch_scale_cpu_capacity() and force a full rebuild of the sched_domain hierarchy if capacities are updated, e.g. later in the boot process when cpufreq has initialized. Signed-off-by: Morten Rasmussen Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: dietmar.eggem...@arm.com Cc: valentin.schnei...@arm.com Cc: vincent.guit...@linaro.org Link: http://lkml.kernel.org/r/1532093554-30504-2-git-send-email-morten.rasmus...@arm.com [ Fixed 'CPU' capitalization. ] Signed-off-by: Ingo Molnar --- include/linux/sched/topology.h | 6 ++-- kernel/sched/topology.c| 81 ++ 2 files changed, 78 insertions(+), 9 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 26347741ba50..6b9976180c1e 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -23,10 +23,10 @@ #define SD_BALANCE_FORK0x0008 /* Balance on fork, clone */ #define SD_BALANCE_WAKE0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ -#define SD_ASYM_CPUCAPACITY0x0040 /* Groups have different max cpu capacities */ -#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu capacity */ +#define SD_ASYM_CPUCAPACITY0x0040 /* Domain members have different CPU capacities */ +#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share CPU capacity */ #define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ -#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ +#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share CPU pkg resources */ #define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */ #define SD_ASYM_PACKING0x0800 /* Place busy groups earlier in the domain */ #define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */ diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 505a41c42b96..5c4d583d53ee 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1061,7 +1061,6 @@ static struct cpumask ***sched_domains_numa_masks; * SD_SHARE_PKG_RESOURCES - describes shared caches * SD_NUMA- describes NUMA topologies * SD_SHARE_POWERDOMAIN - describes shared power domain - * SD_ASYM_CPUCAPACITY- describes mixed capacity topologies * * Odd one out, which beside describing the topology has a quirk also * prescribes the desired behaviour that goes along with it: @@ -1073,13 +1072,12 @@ static struct cpumask ***sched_domains_numa_masks; SD_SHARE_PKG_RESOURCES | \ SD_NUMA| \ SD_ASYM_PACKING| \ -SD_ASYM_CPUCAPACITY| \ SD_SHARE_POWERDOMAIN) static struct sched_domain * sd_init(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, - struct sched_domain *child, int cpu) + struct sched_domain *child, int dflags, int cpu
Re: [PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations
On Tue, Jul 31, 2018 at 12:59:16PM +0200, Peter Zijlstra wrote: > > Combined with that SD_ASYM.. rework I ended up with the below. > > Holler if you want it changed :-) Looks good to me. Thanks, Morten
Re: [PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations
On Tue, Jul 31, 2018 at 12:59:16PM +0200, Peter Zijlstra wrote: > > Combined with that SD_ASYM.. rework I ended up with the below. > > Holler if you want it changed :-) Looks good to me. Thanks, Morten
Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
On Tue, Jul 31, 2018 at 01:10:49PM +0100, Valentin Schneider wrote: > Hi Peter, > > On 31/07/18 13:00, Peter Zijlstra wrote: > > > > > > Aside from the first patch, which I posted the change on, I've picked up > > until 10. I think that other SD_ASYM patch-set replaces 11 and 12, > > right? > > > 11 is no longer needed, but AFAICT we still need 12 - we don't want > PREFER_SIBLING to interfere with asymmetric systems. Yes, we still want patch 12 if possible.
Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
On Tue, Jul 31, 2018 at 01:10:49PM +0100, Valentin Schneider wrote: > Hi Peter, > > On 31/07/18 13:00, Peter Zijlstra wrote: > > > > > > Aside from the first patch, which I posted the change on, I've picked up > > until 10. I think that other SD_ASYM patch-set replaces 11 and 12, > > right? > > > 11 is no longer needed, but AFAICT we still need 12 - we don't want > PREFER_SIBLING to interfere with asymmetric systems. Yes, we still want patch 12 if possible.
Re: [PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection
On Mon, Jul 23, 2018 at 05:07:50PM +0100, Qais Yousef wrote: > On 23/07/18 16:27, Morten Rasmussen wrote: > >It does increase the cost of things like hotplug slightly and > >repartitioning of root_domains a slightly but I don't see how we can > >avoid it if we want generic code to set this flag. If the costs are not > >acceptable I think the only option is to make the detection architecture > >specific. > > I think hotplug is already expensive and this overhead would be small in > comparison. But this could be called when frequency changes if I understood > correctly - this is the one I wasn't sure how 'hot' it could be. I wouldn't > expect frequency changes at a very high rate because it's relatively > expensive too.. A frequency change shouldn't lead to a flag change or a rebuild of the sched_domain hierarhcy. The situations where the hierarchy should be rebuild to update the flag is during boot as we only know the amount of asymmetry once cpufreq has been initialized, when cpus are hotplugged in/out, and when root_domains change due to cpuset reconfiguration. So it should be a relatively rare event.
Re: [PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection
On Mon, Jul 23, 2018 at 05:07:50PM +0100, Qais Yousef wrote: > On 23/07/18 16:27, Morten Rasmussen wrote: > >It does increase the cost of things like hotplug slightly and > >repartitioning of root_domains a slightly but I don't see how we can > >avoid it if we want generic code to set this flag. If the costs are not > >acceptable I think the only option is to make the detection architecture > >specific. > > I think hotplug is already expensive and this overhead would be small in > comparison. But this could be called when frequency changes if I understood > correctly - this is the one I wasn't sure how 'hot' it could be. I wouldn't > expect frequency changes at a very high rate because it's relatively > expensive too.. A frequency change shouldn't lead to a flag change or a rebuild of the sched_domain hierarhcy. The situations where the hierarchy should be rebuild to update the flag is during boot as we only know the amount of asymmetry once cpufreq has been initialized, when cpus are hotplugged in/out, and when root_domains change due to cpuset reconfiguration. So it should be a relatively rare event.
Re: [PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection
On Mon, Jul 23, 2018 at 02:25:34PM +0100, Qais Yousef wrote: > Hi Morten > > On 20/07/18 14:32, Morten Rasmussen wrote: > >The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the > >sched_domain in the hierarchy where all cpu capacities are visible for > >any cpu's point of view on asymmetric cpu capacity systems. The > >scheduler can then take to take capacity asymmetry into account when > > Did you mean "s/take to take/try to take/"? Yes. [...] > >+/* > >+ * Examine topology from all cpu's point of views to detect the lowest > >+ * sched_domain_topology_level where a highest capacity cpu is visible > >+ * to everyone. > >+ */ > >+for_each_cpu(i, cpu_map) { > >+unsigned long max_capacity = arch_scale_cpu_capacity(NULL, i); > >+int tl_id = 0; > >+ > >+for_each_sd_topology(tl) { > >+if (tl_id < asym_level) > >+goto next_level; > >+ > > I think if you increment and then continue here you might save the extra > branch. I didn't look at any disassembly though to verify the generated > code. > > I wonder if we can introduce for_each_sd_topology_from(tl, starting_level) > so that you can start searching from a provided level - which will make this > skipping logic unnecessary? So the code will look like > > Â Â Â Â Â Â Â Â Â for_each_sd_topology_from(tl, asymc_level) { > Â Â Â Â Â Â Â Â Â Â Â Â ... > Â Â Â Â Â Â Â Â Â } Both options would work. Increment+contrinue instead of goto would be slightly less readable I think since we would still have the increment at the end of the loop, but easy to do. Introducing for_each_sd_topology_from() improve things too, but I wonder if it is worth it. > >@@ -1647,18 +1707,27 @@ build_sched_domains(const struct cpumask *cpu_map, > >struct sched_domain_attr *att > > struct s_data d; > > struct rq *rq = NULL; > > int i, ret = -ENOMEM; > >+struct sched_domain_topology_level *tl_asym; > > alloc_state = __visit_domain_allocation_hell(, cpu_map); > > if (alloc_state != sa_rootdomain) > > goto error; > >+tl_asym = asym_cpu_capacity_level(cpu_map); > >+ > > Or maybe this is not a hot path and we don't care that much about optimizing > the search since you call it unconditionally here even for systems that > don't care? It does increase the cost of things like hotplug slightly and repartitioning of root_domains a slightly but I don't see how we can avoid it if we want generic code to set this flag. If the costs are not acceptable I think the only option is to make the detection architecture specific. In any case, AFAIK rebuilding the sched_domain hierarchy shouldn't be a normal and common thing to do. If checking for the flag is not acceptable on SMP-only architectures, I can move it under arch/arm[,64] although it is not as clean. Morten
Re: [PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection
On Mon, Jul 23, 2018 at 02:25:34PM +0100, Qais Yousef wrote: > Hi Morten > > On 20/07/18 14:32, Morten Rasmussen wrote: > >The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the > >sched_domain in the hierarchy where all cpu capacities are visible for > >any cpu's point of view on asymmetric cpu capacity systems. The > >scheduler can then take to take capacity asymmetry into account when > > Did you mean "s/take to take/try to take/"? Yes. [...] > >+/* > >+ * Examine topology from all cpu's point of views to detect the lowest > >+ * sched_domain_topology_level where a highest capacity cpu is visible > >+ * to everyone. > >+ */ > >+for_each_cpu(i, cpu_map) { > >+unsigned long max_capacity = arch_scale_cpu_capacity(NULL, i); > >+int tl_id = 0; > >+ > >+for_each_sd_topology(tl) { > >+if (tl_id < asym_level) > >+goto next_level; > >+ > > I think if you increment and then continue here you might save the extra > branch. I didn't look at any disassembly though to verify the generated > code. > > I wonder if we can introduce for_each_sd_topology_from(tl, starting_level) > so that you can start searching from a provided level - which will make this > skipping logic unnecessary? So the code will look like > > Â Â Â Â Â Â Â Â Â for_each_sd_topology_from(tl, asymc_level) { > Â Â Â Â Â Â Â Â Â Â Â Â ... > Â Â Â Â Â Â Â Â Â } Both options would work. Increment+contrinue instead of goto would be slightly less readable I think since we would still have the increment at the end of the loop, but easy to do. Introducing for_each_sd_topology_from() improve things too, but I wonder if it is worth it. > >@@ -1647,18 +1707,27 @@ build_sched_domains(const struct cpumask *cpu_map, > >struct sched_domain_attr *att > > struct s_data d; > > struct rq *rq = NULL; > > int i, ret = -ENOMEM; > >+struct sched_domain_topology_level *tl_asym; > > alloc_state = __visit_domain_allocation_hell(, cpu_map); > > if (alloc_state != sa_rootdomain) > > goto error; > >+tl_asym = asym_cpu_capacity_level(cpu_map); > >+ > > Or maybe this is not a hot path and we don't care that much about optimizing > the search since you call it unconditionally here even for systems that > don't care? It does increase the cost of things like hotplug slightly and repartitioning of root_domains a slightly but I don't see how we can avoid it if we want generic code to set this flag. If the costs are not acceptable I think the only option is to make the detection architecture specific. In any case, AFAIK rebuilding the sched_domain hierarchy shouldn't be a normal and common thing to do. If checking for the flag is not acceptable on SMP-only architectures, I can move it under arch/arm[,64] although it is not as clean. Morten
Re: [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry
On Thu, Jul 05, 2018 at 04:03:11PM +0100, Quentin Perret wrote: > On Thursday 05 Jul 2018 at 15:13:49 (+0100), Morten Rasmussen wrote: > > 3. Detecting the flag in generic kernel/sched/* code means that all > > architectures will pay the for the overhead when building/rebuilding the > > sched_domain hierarchy, and all architectures that sets the cpu > > capacities to asymmetric will set the flag whether they like it or not. > > I'm not sure if this is a problem. > > That is true as well ... > > > > > In the end it is really about how much of this we want in generic code > > and how much we hide in arch/, and if we dare to touch the sched_domain > > build code ;-) > > Right so you can argue that the arch code is here to give you a > system-level information, and that if the scheduler wants to virtually > split that system, then it's its job to make sure that happens properly. > That is exactly what your patch does (IIUC), and I now think that this > is a very sensible middle-ground option. But this is debatable so I'm > interested to see what others think :-) I went ahead an hacked up some patches that sets the flag automatically as part of the sched_domain build process. I posted them so people can have a look: 1532093554-30504-1-git-send-email-morten.rasmus...@arm.com With those patches this patch has to be reverted/dropped. Morten
Re: [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry
On Thu, Jul 05, 2018 at 04:03:11PM +0100, Quentin Perret wrote: > On Thursday 05 Jul 2018 at 15:13:49 (+0100), Morten Rasmussen wrote: > > 3. Detecting the flag in generic kernel/sched/* code means that all > > architectures will pay the for the overhead when building/rebuilding the > > sched_domain hierarchy, and all architectures that sets the cpu > > capacities to asymmetric will set the flag whether they like it or not. > > I'm not sure if this is a problem. > > That is true as well ... > > > > > In the end it is really about how much of this we want in generic code > > and how much we hide in arch/, and if we dare to touch the sched_domain > > build code ;-) > > Right so you can argue that the arch code is here to give you a > system-level information, and that if the scheduler wants to virtually > split that system, then it's its job to make sure that happens properly. > That is exactly what your patch does (IIUC), and I now think that this > is a very sensible middle-ground option. But this is debatable so I'm > interested to see what others think :-) I went ahead an hacked up some patches that sets the flag automatically as part of the sched_domain build process. I posted them so people can have a look: 1532093554-30504-1-git-send-email-morten.rasmus...@arm.com With those patches this patch has to be reverted/dropped. Morten
[PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection
The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the sched_domain in the hierarchy where all cpu capacities are visible for any cpu's point of view on asymmetric cpu capacity systems. The scheduler can then take to take capacity asymmetry into account when balancing at this level. It also serves as an indicator for how wide task placement heuristics have to search to consider all available cpu capacities as asymmetric systems might often appear symmetric at smallest level(s) of the sched_domain hierarchy. The flag has been around for while but so far only been set by out-of-tree code in Android kernels. One solution is to let each architecture provide the flag through a custom sched_domain topology array and associated mask and flag functions. However, SD_ASYM_CPUCAPACITY is special in the sense that it depends on the capacity and presence of all cpus in the system, i.e. when hotplugging all cpus out except those with one particular cpu capacity the flag should disappear even if the sched_domains don't collapse. Similarly, the flag is affected by cpusets where load-balancing is turned off. Detecting when the flags should be set therefore depends not only on topology information but also the cpuset configuration and hotplug state. The arch code doesn't have easy access to the cpuset configuration. Instead, this patch implements the flag detection in generic code where cpusets and hotplug state is already taken care of. All the arch is responsible for is to implement arch_scale_cpu_capacity() and force a full rebuild of the sched_domain hierarchy if capacities are updated, e.g. later in the boot process when cpufreq has initialized. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Morten Rasmussen --- include/linux/sched/topology.h | 2 +- kernel/sched/topology.c| 81 ++ 2 files changed, 76 insertions(+), 7 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 26347741ba50..4fe2e49ab13b 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -23,7 +23,7 @@ #define SD_BALANCE_FORK0x0008 /* Balance on fork, clone */ #define SD_BALANCE_WAKE0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ -#define SD_ASYM_CPUCAPACITY0x0040 /* Groups have different max cpu capacities */ +#define SD_ASYM_CPUCAPACITY0x0040 /* Domain members have different cpu capacities */ #define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu capacity */ #define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 05a831427bc7..b8f41d557612 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1061,7 +1061,6 @@ static struct cpumask ***sched_domains_numa_masks; * SD_SHARE_PKG_RESOURCES - describes shared caches * SD_NUMA- describes NUMA topologies * SD_SHARE_POWERDOMAIN - describes shared power domain - * SD_ASYM_CPUCAPACITY- describes mixed capacity topologies * * Odd one out, which beside describing the topology has a quirk also * prescribes the desired behaviour that goes along with it: @@ -1073,13 +1072,12 @@ static struct cpumask ***sched_domains_numa_masks; SD_SHARE_PKG_RESOURCES | \ SD_NUMA| \ SD_ASYM_PACKING| \ -SD_ASYM_CPUCAPACITY| \ SD_SHARE_POWERDOMAIN) static struct sched_domain * sd_init(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, - struct sched_domain *child, int cpu) + struct sched_domain *child, int dflags, int cpu) { struct sd_data *sdd = >data; struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu); @@ -1100,6 +1098,9 @@ sd_init(struct sched_domain_topology_level *tl, "wrong sd_flags in topology description\n")) sd_flags &= ~TOPOLOGY_SD_FLAGS; + /* Apply detected topology flags */ + sd_flags |= dflags; + *sd = (struct sched_domain){ .min_interval = sd_weight, .max_interval = 2*sd_weight, @@ -1607,9 +1608,9 @@ static void __sdt_free(const struct cpumask *cpu_map) static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, struct sched_domain_attr *attr, - struct sched_domain *child, int cpu) + struct sched_domain *child, int dflags, int cpu) { - struct sched_domain *sd = sd_init(tl, cpu_map, child, cpu); + struct sched_domain *sd = sd_init(tl, cpu_map, child, dflags, cpu); if (child) { sd->l
[PATCH 4/4] arch/arm: Rebuild sched_domain hierarchy when cpu capacity changes
Asymmetric cpu capacity can not necessarily be determined accurately at the time the initial sched_domain hierarchy is built during boot. It is therefore necessary to be able to force a full rebuild of the hierarchy later triggered by the arch_topology driver. A full rebuild requires the arch-code to implement arch_update_cpu_topology() which isn't yet implemented for arm. This patch points the arm implementation to arch_topology driver to ensure that full hierarchy rebuild happens when needed. cc: Russell King Signed-off-by: Morten Rasmussen --- arch/arm/include/asm/topology.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index 5d88d2f22b2c..2a786f54d8b8 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -33,6 +33,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu); /* Replace task scheduler's default cpu-invariant accounting */ #define arch_scale_cpu_capacity topology_get_cpu_scale +/* Enable topology flag updates */ +#define arch_update_cpu_topology topology_update_cpu_topology + #else static inline void init_cpu_topology(void) { } -- 2.7.4
[PATCH 2/4] drivers/base/arch_topology: Rebuild sched_domain hierarchy when capacities change
The setting of SD_ASYM_CPUCAPACITY depends on the per-cpu capacities. These might not have their final values when the hierarchy is initially built as the values depend on cpufreq to be initialized or the values being set through sysfs. To ensure that the flags are set correctly we need to rebuild the sched_domain hierarchy whenever the reported per-cpu capacity (arch_scale_cpu_capacity()) changes. This patch ensure that a full sched_domain rebuild happens when cpu capacity changes occur. cc: Greg Kroah-Hartman Signed-off-by: Morten Rasmussen --- drivers/base/arch_topology.c | 26 ++ include/linux/arch_topology.h | 1 + 2 files changed, 27 insertions(+) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index e7cb0c6ade81..edfcf8d982e4 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -15,6 +15,7 @@ #include #include #include +#include DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE; @@ -47,6 +48,9 @@ static ssize_t cpu_capacity_show(struct device *dev, return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id)); } +static void update_topology_flags_workfn(struct work_struct *work); +static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn); + static ssize_t cpu_capacity_store(struct device *dev, struct device_attribute *attr, const char *buf, @@ -72,6 +76,8 @@ static ssize_t cpu_capacity_store(struct device *dev, topology_set_cpu_scale(i, new_capacity); mutex_unlock(_scale_mutex); + schedule_work(_topology_flags_work); + return count; } @@ -96,6 +102,25 @@ static int register_cpu_capacity_sysctl(void) } subsys_initcall(register_cpu_capacity_sysctl); +static int update_topology; + +int topology_update_cpu_topology(void) +{ + return update_topology; +} + +/* + * Updating the sched_domains can't be done directly from cpufreq callbacks + * due to locking, so queue the work for later. + */ +static void update_topology_flags_workfn(struct work_struct *work) +{ + update_topology = 1; + rebuild_sched_domains(); + pr_debug("sched_domain hierarchy rebuilt, flags updated\n"); + update_topology = 0; +} + static u32 capacity_scale; static u32 *raw_capacity; @@ -201,6 +226,7 @@ init_cpu_capacity_callback(struct notifier_block *nb, if (cpumask_empty(cpus_to_visit)) { topology_normalize_cpu_scale(); + schedule_work(_topology_flags_work); free_raw_capacity(); pr_debug("cpu_capacity: parsing done\n"); schedule_work(_done_work); diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h index 2b709416de05..d9bdc1a7f4e7 100644 --- a/include/linux/arch_topology.h +++ b/include/linux/arch_topology.h @@ -9,6 +9,7 @@ #include void topology_normalize_cpu_scale(void); +int topology_update_cpu_topology(void); struct device_node; bool topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu); -- 2.7.4
[PATCH 0/4] sched/topology: Set SD_ASYM_CPUCAPACITY flag automatically
The SD_ASYM_CPUCAPACITY flag has been around for some time now with no code to actually set it. Android has carried patches to do this out-of-tree in the meantime. The flag is meant to indicate cpu capacity asymmetry and is set at the topology level where the sched_domain spans all available cpu capacity in the system, i.e. all core types are visible, for any cpu in the system. The flag was merged as being a topology flag meaning that architecture had to provide the flag explicitly, however when mixed with cpusets splitting the system into multiple root_domains the flag can't be set without knowledge about the cpusets. Rather than exposing cpusets to architecture code this patch set moves the responsibility for setting the flag to generic topology code which is simpler and make the code architecture agnostic. Morten Rasmussen (4): sched/topology: SD_ASYM_CPUCAPACITY flag detection drivers/base/arch_topology: Rebuild sched_domain hierarchy when capacities change arch/arm64: Rebuild sched_domain hierarchy when cpu capacity changes arch/arm: Rebuild sched_domain hierarchy when cpu capacity changes arch/arm/include/asm/topology.h | 3 ++ arch/arm64/include/asm/topology.h | 3 ++ drivers/base/arch_topology.c | 26 + include/linux/arch_topology.h | 1 + include/linux/sched/topology.h| 2 +- kernel/sched/topology.c | 81 --- 6 files changed, 109 insertions(+), 7 deletions(-) -- 2.7.4
[PATCH 3/4] arch/arm64: Rebuild sched_domain hierarchy when cpu capacity changes
Asymmetric cpu capacity can not necessarily be determined accurately at the time the initial sched_domain hierarchy is built during boot. It is therefore necessary to be able to force a full rebuild of the hierarchy later triggered by the arch_topology driver. A full rebuild requires the arch-code to implement arch_update_cpu_topology() which isn't yet implemented for arm64. This patch points the arm64 implementation to arch_topology driver to ensure that full hierarchy rebuild happens when needed. cc: Catalin Marinas cc: Will Deacon Signed-off-by: Morten Rasmussen --- arch/arm64/include/asm/topology.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h index df48212f767b..61ba09d48237 100644 --- a/arch/arm64/include/asm/topology.h +++ b/arch/arm64/include/asm/topology.h @@ -43,6 +43,9 @@ int pcibus_to_node(struct pci_bus *bus); /* Replace task scheduler's default cpu-invariant accounting */ #define arch_scale_cpu_capacity topology_get_cpu_scale +/* Enable topology flag updates */ +#define arch_update_cpu_topology topology_update_cpu_topology + #include #endif /* _ASM_ARM_TOPOLOGY_H */ -- 2.7.4
[PATCH 2/4] drivers/base/arch_topology: Rebuild sched_domain hierarchy when capacities change
The setting of SD_ASYM_CPUCAPACITY depends on the per-cpu capacities. These might not have their final values when the hierarchy is initially built as the values depend on cpufreq to be initialized or the values being set through sysfs. To ensure that the flags are set correctly we need to rebuild the sched_domain hierarchy whenever the reported per-cpu capacity (arch_scale_cpu_capacity()) changes. This patch ensure that a full sched_domain rebuild happens when cpu capacity changes occur. cc: Greg Kroah-Hartman Signed-off-by: Morten Rasmussen --- drivers/base/arch_topology.c | 26 ++ include/linux/arch_topology.h | 1 + 2 files changed, 27 insertions(+) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index e7cb0c6ade81..edfcf8d982e4 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -15,6 +15,7 @@ #include #include #include +#include DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE; @@ -47,6 +48,9 @@ static ssize_t cpu_capacity_show(struct device *dev, return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id)); } +static void update_topology_flags_workfn(struct work_struct *work); +static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn); + static ssize_t cpu_capacity_store(struct device *dev, struct device_attribute *attr, const char *buf, @@ -72,6 +76,8 @@ static ssize_t cpu_capacity_store(struct device *dev, topology_set_cpu_scale(i, new_capacity); mutex_unlock(_scale_mutex); + schedule_work(_topology_flags_work); + return count; } @@ -96,6 +102,25 @@ static int register_cpu_capacity_sysctl(void) } subsys_initcall(register_cpu_capacity_sysctl); +static int update_topology; + +int topology_update_cpu_topology(void) +{ + return update_topology; +} + +/* + * Updating the sched_domains can't be done directly from cpufreq callbacks + * due to locking, so queue the work for later. + */ +static void update_topology_flags_workfn(struct work_struct *work) +{ + update_topology = 1; + rebuild_sched_domains(); + pr_debug("sched_domain hierarchy rebuilt, flags updated\n"); + update_topology = 0; +} + static u32 capacity_scale; static u32 *raw_capacity; @@ -201,6 +226,7 @@ init_cpu_capacity_callback(struct notifier_block *nb, if (cpumask_empty(cpus_to_visit)) { topology_normalize_cpu_scale(); + schedule_work(_topology_flags_work); free_raw_capacity(); pr_debug("cpu_capacity: parsing done\n"); schedule_work(_done_work); diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h index 2b709416de05..d9bdc1a7f4e7 100644 --- a/include/linux/arch_topology.h +++ b/include/linux/arch_topology.h @@ -9,6 +9,7 @@ #include void topology_normalize_cpu_scale(void); +int topology_update_cpu_topology(void); struct device_node; bool topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu); -- 2.7.4
[PATCH 0/4] sched/topology: Set SD_ASYM_CPUCAPACITY flag automatically
The SD_ASYM_CPUCAPACITY flag has been around for some time now with no code to actually set it. Android has carried patches to do this out-of-tree in the meantime. The flag is meant to indicate cpu capacity asymmetry and is set at the topology level where the sched_domain spans all available cpu capacity in the system, i.e. all core types are visible, for any cpu in the system. The flag was merged as being a topology flag meaning that architecture had to provide the flag explicitly, however when mixed with cpusets splitting the system into multiple root_domains the flag can't be set without knowledge about the cpusets. Rather than exposing cpusets to architecture code this patch set moves the responsibility for setting the flag to generic topology code which is simpler and make the code architecture agnostic. Morten Rasmussen (4): sched/topology: SD_ASYM_CPUCAPACITY flag detection drivers/base/arch_topology: Rebuild sched_domain hierarchy when capacities change arch/arm64: Rebuild sched_domain hierarchy when cpu capacity changes arch/arm: Rebuild sched_domain hierarchy when cpu capacity changes arch/arm/include/asm/topology.h | 3 ++ arch/arm64/include/asm/topology.h | 3 ++ drivers/base/arch_topology.c | 26 + include/linux/arch_topology.h | 1 + include/linux/sched/topology.h| 2 +- kernel/sched/topology.c | 81 --- 6 files changed, 109 insertions(+), 7 deletions(-) -- 2.7.4
[PATCH 3/4] arch/arm64: Rebuild sched_domain hierarchy when cpu capacity changes
Asymmetric cpu capacity can not necessarily be determined accurately at the time the initial sched_domain hierarchy is built during boot. It is therefore necessary to be able to force a full rebuild of the hierarchy later triggered by the arch_topology driver. A full rebuild requires the arch-code to implement arch_update_cpu_topology() which isn't yet implemented for arm64. This patch points the arm64 implementation to arch_topology driver to ensure that full hierarchy rebuild happens when needed. cc: Catalin Marinas cc: Will Deacon Signed-off-by: Morten Rasmussen --- arch/arm64/include/asm/topology.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h index df48212f767b..61ba09d48237 100644 --- a/arch/arm64/include/asm/topology.h +++ b/arch/arm64/include/asm/topology.h @@ -43,6 +43,9 @@ int pcibus_to_node(struct pci_bus *bus); /* Replace task scheduler's default cpu-invariant accounting */ #define arch_scale_cpu_capacity topology_get_cpu_scale +/* Enable topology flag updates */ +#define arch_update_cpu_topology topology_update_cpu_topology + #include #endif /* _ASM_ARM_TOPOLOGY_H */ -- 2.7.4
[PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection
The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the sched_domain in the hierarchy where all cpu capacities are visible for any cpu's point of view on asymmetric cpu capacity systems. The scheduler can then take to take capacity asymmetry into account when balancing at this level. It also serves as an indicator for how wide task placement heuristics have to search to consider all available cpu capacities as asymmetric systems might often appear symmetric at smallest level(s) of the sched_domain hierarchy. The flag has been around for while but so far only been set by out-of-tree code in Android kernels. One solution is to let each architecture provide the flag through a custom sched_domain topology array and associated mask and flag functions. However, SD_ASYM_CPUCAPACITY is special in the sense that it depends on the capacity and presence of all cpus in the system, i.e. when hotplugging all cpus out except those with one particular cpu capacity the flag should disappear even if the sched_domains don't collapse. Similarly, the flag is affected by cpusets where load-balancing is turned off. Detecting when the flags should be set therefore depends not only on topology information but also the cpuset configuration and hotplug state. The arch code doesn't have easy access to the cpuset configuration. Instead, this patch implements the flag detection in generic code where cpusets and hotplug state is already taken care of. All the arch is responsible for is to implement arch_scale_cpu_capacity() and force a full rebuild of the sched_domain hierarchy if capacities are updated, e.g. later in the boot process when cpufreq has initialized. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Morten Rasmussen --- include/linux/sched/topology.h | 2 +- kernel/sched/topology.c| 81 ++ 2 files changed, 76 insertions(+), 7 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 26347741ba50..4fe2e49ab13b 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -23,7 +23,7 @@ #define SD_BALANCE_FORK0x0008 /* Balance on fork, clone */ #define SD_BALANCE_WAKE0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ -#define SD_ASYM_CPUCAPACITY0x0040 /* Groups have different max cpu capacities */ +#define SD_ASYM_CPUCAPACITY0x0040 /* Domain members have different cpu capacities */ #define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu capacity */ #define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 05a831427bc7..b8f41d557612 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1061,7 +1061,6 @@ static struct cpumask ***sched_domains_numa_masks; * SD_SHARE_PKG_RESOURCES - describes shared caches * SD_NUMA- describes NUMA topologies * SD_SHARE_POWERDOMAIN - describes shared power domain - * SD_ASYM_CPUCAPACITY- describes mixed capacity topologies * * Odd one out, which beside describing the topology has a quirk also * prescribes the desired behaviour that goes along with it: @@ -1073,13 +1072,12 @@ static struct cpumask ***sched_domains_numa_masks; SD_SHARE_PKG_RESOURCES | \ SD_NUMA| \ SD_ASYM_PACKING| \ -SD_ASYM_CPUCAPACITY| \ SD_SHARE_POWERDOMAIN) static struct sched_domain * sd_init(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, - struct sched_domain *child, int cpu) + struct sched_domain *child, int dflags, int cpu) { struct sd_data *sdd = >data; struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu); @@ -1100,6 +1098,9 @@ sd_init(struct sched_domain_topology_level *tl, "wrong sd_flags in topology description\n")) sd_flags &= ~TOPOLOGY_SD_FLAGS; + /* Apply detected topology flags */ + sd_flags |= dflags; + *sd = (struct sched_domain){ .min_interval = sd_weight, .max_interval = 2*sd_weight, @@ -1607,9 +1608,9 @@ static void __sdt_free(const struct cpumask *cpu_map) static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl, const struct cpumask *cpu_map, struct sched_domain_attr *attr, - struct sched_domain *child, int cpu) + struct sched_domain *child, int dflags, int cpu) { - struct sched_domain *sd = sd_init(tl, cpu_map, child, cpu); + struct sched_domain *sd = sd_init(tl, cpu_map, child, dflags, cpu); if (child) { sd->l
[PATCH 4/4] arch/arm: Rebuild sched_domain hierarchy when cpu capacity changes
Asymmetric cpu capacity can not necessarily be determined accurately at the time the initial sched_domain hierarchy is built during boot. It is therefore necessary to be able to force a full rebuild of the hierarchy later triggered by the arch_topology driver. A full rebuild requires the arch-code to implement arch_update_cpu_topology() which isn't yet implemented for arm. This patch points the arm implementation to arch_topology driver to ensure that full hierarchy rebuild happens when needed. cc: Russell King Signed-off-by: Morten Rasmussen --- arch/arm/include/asm/topology.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index 5d88d2f22b2c..2a786f54d8b8 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -33,6 +33,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu); /* Replace task scheduler's default cpu-invariant accounting */ #define arch_scale_cpu_capacity topology_get_cpu_scale +/* Enable topology flag updates */ +#define arch_update_cpu_topology topology_update_cpu_topology + #else static inline void init_cpu_topology(void) { } -- 2.7.4
Re: arm: v4.18-rc5 with cpuidle on TC2 (A7 boot) spectre v2 issue
On Thu, Jul 19, 2018 at 02:32:22PM +0100, Russell King - ARM Linux wrote: > On Thu, Jul 19, 2018 at 11:01:10AM +0100, Russell King - ARM Linux wrote: > > On Thu, Jul 19, 2018 at 11:42:50AM +0200, Dietmar Eggemann wrote: > > > Hi, > > > > > > running v4.18-rc5 (plus still missing "power: vexpress: fix corruption in > > > notifier registration", otherwise I get this rcu_sched stall issue) on TC2 > > > (A7 boot) with vanilla multi_v7_defconfig plus > > > CONFIG_ARM_BIG_LITTLE_CPUIDLE=y gives me continuous: > > > > > > ... > > > CPUX: Spectre v2: incorrect context switching function, system vulnerable > > > ... > > > > > > messages. > > > > > > Work around is to disable CONFIG_HARDEN_BRANCH_PREDICTOR. > > > > or disable big.Little if you want the hardening. > > > > The choices are currently either protection against Spectre or big.Little > > support since the two are mutually exclusive at the moment. > > An alternative would be to give the patches in the attachment a test. > They're not finished yet, so I haven't sent them out, but still worth > testing. Thanks for sharing. I can confirm that your patches do cure the flood of warnings. TC2 booting on A7: [0.002922] CPU: Testing write buffer coherency: ok [0.003347] CPU0: thread -1, cpu 0, socket 1, mpidr 8100 [0.004022] Setting up static identity map for 0x8010 - 0x80100060 [0.004265] ARM CCI driver probed [0.004648] TC2 power management initialized [0.004930] Hierarchical SRCU implementation. [0.006956] smp: Bringing up secondary CPUs ... [0.008712] CPU1: thread -1, cpu 0, socket 0, mpidr 8000 [0.008720] CPU1: Spectre v2: firmware did not set auxiliary control register IBE bit, system vulnerable [0.009934] CPU2: thread -1, cpu 1, socket 0, mpidr 8001 [0.009940] CPU2: Spectre v2: firmware did not set auxiliary control register IBE bit, system vulnerable [0.011147] CPU3: thread -1, cpu 1, socket 1, mpidr 8101 [0.012350] CPU4: thread -1, cpu 2, socket 1, mpidr 8102 [0.012468] smp: Brought up 1 node, 5 CPUs [0.012490] SMP: Total of 5 processors activated (240.00 BogoMIPS). [0.012499] CPU: All CPU(s) started in SVC mode. TC2 booting on A15: [0.002045] CPU0: Spectre v2: firmware did not set auxiliary control register IBE bit, system vulnerable [0.002311] CPU0: thread -1, cpu 0, socket 0, mpidr 8000 [0.002809] Setting up static identity map for 0x8010 - 0x80100060 [0.003000] ARM CCI driver probed [0.003408] TC2 power management initialized [0.003637] Hierarchical SRCU implementation. [0.005177] smp: Bringing up secondary CPUs ... [0.006170] CPU1: thread -1, cpu 1, socket 0, mpidr 8001 [0.006176] CPU1: Spectre v2: firmware did not set auxiliary control register IBE bit, system vulnerable [0.008137] CPU2: thread -1, cpu 0, socket 1, mpidr 8100 [0.009304] CPU3: thread -1, cpu 1, socket 1, mpidr 8101 [0.010405] CPU4: thread -1, cpu 2, socket 1, mpidr 8102 [0.010537] smp: Brought up 1 node, 5 CPUs [0.010562] SMP: Total of 5 processors activated (240.00 BogoMIPS). [0.010572] CPU: All CPU(s) started in SVC mode. No further warnings for either configuration. For reference, this a partial output from later in the boot process when booting on A7 with 4.18-rc5 _without_ your patches: [5.576176] device-mapper: ioctl: 4.39.0-ioctl (2018-04-03) initialised: dm-de...@redhat.com [5.601689] cpu cpu0: bL_cpufreq_init: CPU 0 initialized [5.618670] cpu cpu1: bL_cpufreq_init: CPU 1 initialized [5.635583] arm_big_little: bL_cpufreq_register: Registered platform driver: vexpress-spc [5.661112] mmci-pl18x 1c05.mmci: Got CD GPIO [5.675235] mmci-pl18x 1c05.mmci: Got WP GPIO [5.687783] CPU2: Spectre v2: incorrect context switching function, system vulnerable [5.689623] mmci-pl18x 1c05.mmci: mmc0: PL180 manf 41 rev0 at 0x1c05 irq 26,27 (pio) [5.713217] CPU1: Spectre v2: incorrect context switching function, system vulnerable [5.718044] CPU2: Spectre v2: incorrect context switching function, system vulnerable [5.727896] CPU2: Spectre v2: incorrect context switching function, system vulnerable
Re: arm: v4.18-rc5 with cpuidle on TC2 (A7 boot) spectre v2 issue
On Thu, Jul 19, 2018 at 02:32:22PM +0100, Russell King - ARM Linux wrote: > On Thu, Jul 19, 2018 at 11:01:10AM +0100, Russell King - ARM Linux wrote: > > On Thu, Jul 19, 2018 at 11:42:50AM +0200, Dietmar Eggemann wrote: > > > Hi, > > > > > > running v4.18-rc5 (plus still missing "power: vexpress: fix corruption in > > > notifier registration", otherwise I get this rcu_sched stall issue) on TC2 > > > (A7 boot) with vanilla multi_v7_defconfig plus > > > CONFIG_ARM_BIG_LITTLE_CPUIDLE=y gives me continuous: > > > > > > ... > > > CPUX: Spectre v2: incorrect context switching function, system vulnerable > > > ... > > > > > > messages. > > > > > > Work around is to disable CONFIG_HARDEN_BRANCH_PREDICTOR. > > > > or disable big.Little if you want the hardening. > > > > The choices are currently either protection against Spectre or big.Little > > support since the two are mutually exclusive at the moment. > > An alternative would be to give the patches in the attachment a test. > They're not finished yet, so I haven't sent them out, but still worth > testing. Thanks for sharing. I can confirm that your patches do cure the flood of warnings. TC2 booting on A7: [0.002922] CPU: Testing write buffer coherency: ok [0.003347] CPU0: thread -1, cpu 0, socket 1, mpidr 8100 [0.004022] Setting up static identity map for 0x8010 - 0x80100060 [0.004265] ARM CCI driver probed [0.004648] TC2 power management initialized [0.004930] Hierarchical SRCU implementation. [0.006956] smp: Bringing up secondary CPUs ... [0.008712] CPU1: thread -1, cpu 0, socket 0, mpidr 8000 [0.008720] CPU1: Spectre v2: firmware did not set auxiliary control register IBE bit, system vulnerable [0.009934] CPU2: thread -1, cpu 1, socket 0, mpidr 8001 [0.009940] CPU2: Spectre v2: firmware did not set auxiliary control register IBE bit, system vulnerable [0.011147] CPU3: thread -1, cpu 1, socket 1, mpidr 8101 [0.012350] CPU4: thread -1, cpu 2, socket 1, mpidr 8102 [0.012468] smp: Brought up 1 node, 5 CPUs [0.012490] SMP: Total of 5 processors activated (240.00 BogoMIPS). [0.012499] CPU: All CPU(s) started in SVC mode. TC2 booting on A15: [0.002045] CPU0: Spectre v2: firmware did not set auxiliary control register IBE bit, system vulnerable [0.002311] CPU0: thread -1, cpu 0, socket 0, mpidr 8000 [0.002809] Setting up static identity map for 0x8010 - 0x80100060 [0.003000] ARM CCI driver probed [0.003408] TC2 power management initialized [0.003637] Hierarchical SRCU implementation. [0.005177] smp: Bringing up secondary CPUs ... [0.006170] CPU1: thread -1, cpu 1, socket 0, mpidr 8001 [0.006176] CPU1: Spectre v2: firmware did not set auxiliary control register IBE bit, system vulnerable [0.008137] CPU2: thread -1, cpu 0, socket 1, mpidr 8100 [0.009304] CPU3: thread -1, cpu 1, socket 1, mpidr 8101 [0.010405] CPU4: thread -1, cpu 2, socket 1, mpidr 8102 [0.010537] smp: Brought up 1 node, 5 CPUs [0.010562] SMP: Total of 5 processors activated (240.00 BogoMIPS). [0.010572] CPU: All CPU(s) started in SVC mode. No further warnings for either configuration. For reference, this a partial output from later in the boot process when booting on A7 with 4.18-rc5 _without_ your patches: [5.576176] device-mapper: ioctl: 4.39.0-ioctl (2018-04-03) initialised: dm-de...@redhat.com [5.601689] cpu cpu0: bL_cpufreq_init: CPU 0 initialized [5.618670] cpu cpu1: bL_cpufreq_init: CPU 1 initialized [5.635583] arm_big_little: bL_cpufreq_register: Registered platform driver: vexpress-spc [5.661112] mmci-pl18x 1c05.mmci: Got CD GPIO [5.675235] mmci-pl18x 1c05.mmci: Got WP GPIO [5.687783] CPU2: Spectre v2: incorrect context switching function, system vulnerable [5.689623] mmci-pl18x 1c05.mmci: mmc0: PL180 manf 41 rev0 at 0x1c05 irq 26,27 (pio) [5.713217] CPU1: Spectre v2: incorrect context switching function, system vulnerable [5.718044] CPU2: Spectre v2: incorrect context switching function, system vulnerable [5.727896] CPU2: Spectre v2: incorrect context switching function, system vulnerable
Re: arm: v4.18-rc5 with cpuidle on TC2 (A7 boot) spectre v2 issue
On Thu, Jul 19, 2018 at 11:01:10AM +0100, Russell King - ARM Linux wrote: > On Thu, Jul 19, 2018 at 11:42:50AM +0200, Dietmar Eggemann wrote: > > Hi, > > > > running v4.18-rc5 (plus still missing "power: vexpress: fix corruption in > > notifier registration", otherwise I get this rcu_sched stall issue) on TC2 > > (A7 boot) with vanilla multi_v7_defconfig plus > > CONFIG_ARM_BIG_LITTLE_CPUIDLE=y gives me continuous: > > > > ... > > CPUX: Spectre v2: incorrect context switching function, system vulnerable > > ... > > > > messages. > > > > Work around is to disable CONFIG_HARDEN_BRANCH_PREDICTOR. > > or disable big.Little if you want the hardening. > > The choices are currently either protection against Spectre or big.Little > support since the two are mutually exclusive at the moment. Would it be possible to make those message only appear once like they do when booting on A15? As it is we have to change the default setting of this new option to make the platform useable as those messages are flooding the console. I see >40 messages per second. Morten
Re: arm: v4.18-rc5 with cpuidle on TC2 (A7 boot) spectre v2 issue
On Thu, Jul 19, 2018 at 11:01:10AM +0100, Russell King - ARM Linux wrote: > On Thu, Jul 19, 2018 at 11:42:50AM +0200, Dietmar Eggemann wrote: > > Hi, > > > > running v4.18-rc5 (plus still missing "power: vexpress: fix corruption in > > notifier registration", otherwise I get this rcu_sched stall issue) on TC2 > > (A7 boot) with vanilla multi_v7_defconfig plus > > CONFIG_ARM_BIG_LITTLE_CPUIDLE=y gives me continuous: > > > > ... > > CPUX: Spectre v2: incorrect context switching function, system vulnerable > > ... > > > > messages. > > > > Work around is to disable CONFIG_HARDEN_BRANCH_PREDICTOR. > > or disable big.Little if you want the hardening. > > The choices are currently either protection against Spectre or big.Little > support since the two are mutually exclusive at the moment. Would it be possible to make those message only appear once like they do when booting on A15? As it is we have to change the default setting of this new option to make the platform useable as those messages are flooding the console. I see >40 messages per second. Morten
Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
On Fri, Jul 06, 2018 at 12:18:27PM +0200, Vincent Guittot wrote: > Hi Morten, > > On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen > wrote: > > > > On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial > > for performance that cpu intensive tasks are aggressively migrated to > > high capacity cpus as soon as those become available. The capacity > > awareness tweaks already in the wake-up path can't handle this as such > > tasks might run or be runnable forever. If they happen to be placed on a > > low capacity cpu from the beginning they are stuck there forever while > > high capacity cpus may have become available in the meantime. > > > > To address this issue this patch set introduces a new "misfit" > > load-balancing scenario in periodic/nohz/newly idle balance which tweaks > > the load-balance conditions to ignore load per capacity in certain > > cases. Since misfit tasks are commonly running alone on a cpu, more > > aggressive active load-balancing is needed too. > > > > The fundamental idea of this patch set has been in Android kernels for a > > long time and is absolutely essential for consistent performance on > > asymmetric cpu capacity systems. > > > > As already said , I'm not convinced by the proposal which seems quite > complex and also adds some kind of arbitrary and fixed power > management policy by deciding which tasks can or not go on big cores > whereas there are other frameworks to take such decision like EAS or > cgroups. The misfit patches are a crucial part of the EAS solution but they also make sense for some users on their own without an energy model. This is why they are posted separately. We have already discussed at length why the patches are needed and why the look like they do here in this thread: https://lore.kernel.org/lkml/cakftptd4skw_3sak--vbec5-m1ua48bjoqys0pdqw3npsps...@mail.gmail.com/ > Furthermore, there is already something similar in the kernel > with SD_ASYM_PACKING and IMO, it would be better to improve this > feature (if needed) instead of adding a new one which often do similar > things. As said in the previous thread, while it might look similar it isn't. SD_ASYM_PACKING isn't utilization-based which is the key metric used for EAS, schedutil, util_est, and util_clamp. SD_ASYM_PACKING serves a different purpose (see previous thread for details). > I have rerun your tests and got same results than misfit task patchset > on my hikey960 with SD_ASYM_PACKING feature for legacy b.L topology > and fake dynamiQ topology. And it give better performance when the > pinned tasks are short and scheduler has to wait for the task to > increase their utilization before getting a chance to migrate on big > core. Right, the test cases are quite simple and could be served better by SD_ASYM_PACKING. As we already discussed in that thread, that is due to the PELT lag but this the cost we have to pay if we don't have additional information about the requirements of the task and we don't want to default to big-first with all its implications. We have covered all this in the thread in early April. > Then, I have tested SD_ASYM_PACKING with EAS patchset and they work > together for b/L and dynamiQ topology Could you provide some more details about your evaluation? It probably works well for some use-cases but it isn't really designed for what we need for EAS. Morten
Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
On Fri, Jul 06, 2018 at 12:18:27PM +0200, Vincent Guittot wrote: > Hi Morten, > > On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen > wrote: > > > > On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial > > for performance that cpu intensive tasks are aggressively migrated to > > high capacity cpus as soon as those become available. The capacity > > awareness tweaks already in the wake-up path can't handle this as such > > tasks might run or be runnable forever. If they happen to be placed on a > > low capacity cpu from the beginning they are stuck there forever while > > high capacity cpus may have become available in the meantime. > > > > To address this issue this patch set introduces a new "misfit" > > load-balancing scenario in periodic/nohz/newly idle balance which tweaks > > the load-balance conditions to ignore load per capacity in certain > > cases. Since misfit tasks are commonly running alone on a cpu, more > > aggressive active load-balancing is needed too. > > > > The fundamental idea of this patch set has been in Android kernels for a > > long time and is absolutely essential for consistent performance on > > asymmetric cpu capacity systems. > > > > As already said , I'm not convinced by the proposal which seems quite > complex and also adds some kind of arbitrary and fixed power > management policy by deciding which tasks can or not go on big cores > whereas there are other frameworks to take such decision like EAS or > cgroups. The misfit patches are a crucial part of the EAS solution but they also make sense for some users on their own without an energy model. This is why they are posted separately. We have already discussed at length why the patches are needed and why the look like they do here in this thread: https://lore.kernel.org/lkml/cakftptd4skw_3sak--vbec5-m1ua48bjoqys0pdqw3npsps...@mail.gmail.com/ > Furthermore, there is already something similar in the kernel > with SD_ASYM_PACKING and IMO, it would be better to improve this > feature (if needed) instead of adding a new one which often do similar > things. As said in the previous thread, while it might look similar it isn't. SD_ASYM_PACKING isn't utilization-based which is the key metric used for EAS, schedutil, util_est, and util_clamp. SD_ASYM_PACKING serves a different purpose (see previous thread for details). > I have rerun your tests and got same results than misfit task patchset > on my hikey960 with SD_ASYM_PACKING feature for legacy b.L topology > and fake dynamiQ topology. And it give better performance when the > pinned tasks are short and scheduler has to wait for the task to > increase their utilization before getting a chance to migrate on big > core. Right, the test cases are quite simple and could be served better by SD_ASYM_PACKING. As we already discussed in that thread, that is due to the PELT lag but this the cost we have to pay if we don't have additional information about the requirements of the task and we don't want to default to big-first with all its implications. We have covered all this in the thread in early April. > Then, I have tested SD_ASYM_PACKING with EAS patchset and they work > together for b/L and dynamiQ topology Could you provide some more details about your evaluation? It probably works well for some use-cases but it isn't really designed for what we need for EAS. Morten
Re: [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains
On Fri, Jul 06, 2018 at 12:18:17PM +0200, Vincent Guittot wrote: > On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen > wrote: > > > > The 'prefer sibling' sched_domain flag is intended to encourage > > spreading tasks to sibling sched_domain to take advantage of more caches > > and core for SMT systems. It has recently been changed to be on all > > non-NUMA topology level. However, spreading across domains with cpu > > capacity asymmetry isn't desirable, e.g. spreading from high capacity to > > low capacity cpus even if high capacity cpus aren't overutilized might > > give access to more cache but the cpu will be slower and possibly lead > > to worse overall throughput. > > > > To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain > > level immediately below SD_ASYM_CPUCAPACITY. > > This makes sense. Nevertheless, this patch also raises a scheduling > problem and break the 1 task per CPU policy that is enforced by > SD_PREFER_SIBLING. Scheduling one task per cpu when n_task == n_cpus on asymmetric topologies is generally broken already and this patch set doesn't fix that problem. SD_PREFER_SIBLING might seem to help in very specific cases: n_litte_cpus == n_big_cpus. In that case the little group might classified as overloaded. It doesn't guarantee that anything gets pulled as the grp_load/grp_capacity in the imbalance calculation on some system still says the little cpus are more loaded than the bigs despite one of them being idle. That depends on the little cpu capacities. On systems where n_little_cpus != n_big_cpus SD_PREFER_SIBLING is broken as it assumes the group_weight to be the same. This is the case on Juno and several other platforms. IMHO, SD_PREFER_SIBLING isn't the solution to this problem. It might help for a limited subset of topologies/capacities but the right solution is to change the imbalance calculation. As the name says, it is meant to spread tasks and does so unconditionally. For asymmetric systems we would like to consider cpu capacity before migrating tasks. > When running the tests of your cover letter, 1 long > running task is often co scheduled on a big core whereas short pinned > tasks are still running and a little core is idle which is not an > optimal scheduling decision This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't say that this patch breaks anything that isn't broken already. In fact we this happening with and without this patch applied. Morten
Re: [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains
On Fri, Jul 06, 2018 at 12:18:17PM +0200, Vincent Guittot wrote: > On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen > wrote: > > > > The 'prefer sibling' sched_domain flag is intended to encourage > > spreading tasks to sibling sched_domain to take advantage of more caches > > and core for SMT systems. It has recently been changed to be on all > > non-NUMA topology level. However, spreading across domains with cpu > > capacity asymmetry isn't desirable, e.g. spreading from high capacity to > > low capacity cpus even if high capacity cpus aren't overutilized might > > give access to more cache but the cpu will be slower and possibly lead > > to worse overall throughput. > > > > To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain > > level immediately below SD_ASYM_CPUCAPACITY. > > This makes sense. Nevertheless, this patch also raises a scheduling > problem and break the 1 task per CPU policy that is enforced by > SD_PREFER_SIBLING. Scheduling one task per cpu when n_task == n_cpus on asymmetric topologies is generally broken already and this patch set doesn't fix that problem. SD_PREFER_SIBLING might seem to help in very specific cases: n_litte_cpus == n_big_cpus. In that case the little group might classified as overloaded. It doesn't guarantee that anything gets pulled as the grp_load/grp_capacity in the imbalance calculation on some system still says the little cpus are more loaded than the bigs despite one of them being idle. That depends on the little cpu capacities. On systems where n_little_cpus != n_big_cpus SD_PREFER_SIBLING is broken as it assumes the group_weight to be the same. This is the case on Juno and several other platforms. IMHO, SD_PREFER_SIBLING isn't the solution to this problem. It might help for a limited subset of topologies/capacities but the right solution is to change the imbalance calculation. As the name says, it is meant to spread tasks and does so unconditionally. For asymmetric systems we would like to consider cpu capacity before migrating tasks. > When running the tests of your cover letter, 1 long > running task is often co scheduled on a big core whereas short pinned > tasks are still running and a little core is idle which is not an > optimal scheduling decision This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't say that this patch breaks anything that isn't broken already. In fact we this happening with and without this patch applied. Morten
Re: [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry
On Thu, Jul 05, 2018 at 02:31:43PM +0100, Quentin Perret wrote: > Hi Morten, > > On Wednesday 04 Jul 2018 at 11:17:49 (+0100), Morten Rasmussen wrote: > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index 71330e0e41db..29c186961345 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -1160,6 +1160,26 @@ sd_init(struct sched_domain_topology_level *tl, > > sd_id = cpumask_first(sched_domain_span(sd)); > > > > /* > > +* Check if cpu_map eclipses cpu capacity asymmetry. > > +*/ > > + > > + if (sd->flags & SD_ASYM_CPUCAPACITY) { > > + int i; > > + bool disable = true; > > + long capacity = arch_scale_cpu_capacity(NULL, sd_id); > > + > > + for_each_cpu(i, sched_domain_span(sd)) { > > + if (capacity != arch_scale_cpu_capacity(NULL, i)) { > > + disable = false; > > + break; > > + } > > + } > > + > > + if (disable) > > + sd->flags &= ~SD_ASYM_CPUCAPACITY; > > + } > > + > > + /* > > * Convert topological properties into behaviour. > > */ > > If SD_ASYM_CPUCAPACITY means that some CPUs have different > arch_scale_cpu_capacity() values, we could also automatically _set_ > the flag in sd_init() no ? Why should we let the arch set it and just > correct it later ? > > I understand the moment at which we know the capacities of CPUs varies > from arch to arch, but the arch code could just call > rebuild_sched_domain when the capacities of CPUs change and let the > scheduler detect things automatically. I mean, even if the arch code > sets the flag in its topology level table, it will have to rebuild > the sched domains anyway ... > > What do you think ? We could as well set the flag here so the architecture doesn't have to do it. It is a bit more complicated though for few reasons: 1. Detecting when to disable the flag is a lot simpler than checking which level is should be set on. You basically have to work you way up from the lowest topology level until you get to a level spanning all the capacities available in the system to figure out where the flag should be set. I don't think this fits easily with how we build the sched_domain hierarchy. It can of course be done. 2. As you say, we still need the arch code (or cpufreq?) to rebuild the whole thing once we know that the capacities have been determined. That currently implies implementing arch_update_cpu_topology() which is arch-specific. So we would need some arch code to make rebuild happen at the right point in time. If the rebuild should be triggering the rebuild we need another way to force a full rebuild. This can also be done. 3. Detecting the flag in generic kernel/sched/* code means that all architectures will pay the for the overhead when building/rebuilding the sched_domain hierarchy, and all architectures that sets the cpu capacities to asymmetric will set the flag whether they like it or not. I'm not sure if this is a problem. In the end it is really about how much of this we want in generic code and how much we hide in arch/, and if we dare to touch the sched_domain build code ;-) Morten
Re: [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry
On Thu, Jul 05, 2018 at 02:31:43PM +0100, Quentin Perret wrote: > Hi Morten, > > On Wednesday 04 Jul 2018 at 11:17:49 (+0100), Morten Rasmussen wrote: > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > > index 71330e0e41db..29c186961345 100644 > > --- a/kernel/sched/topology.c > > +++ b/kernel/sched/topology.c > > @@ -1160,6 +1160,26 @@ sd_init(struct sched_domain_topology_level *tl, > > sd_id = cpumask_first(sched_domain_span(sd)); > > > > /* > > +* Check if cpu_map eclipses cpu capacity asymmetry. > > +*/ > > + > > + if (sd->flags & SD_ASYM_CPUCAPACITY) { > > + int i; > > + bool disable = true; > > + long capacity = arch_scale_cpu_capacity(NULL, sd_id); > > + > > + for_each_cpu(i, sched_domain_span(sd)) { > > + if (capacity != arch_scale_cpu_capacity(NULL, i)) { > > + disable = false; > > + break; > > + } > > + } > > + > > + if (disable) > > + sd->flags &= ~SD_ASYM_CPUCAPACITY; > > + } > > + > > + /* > > * Convert topological properties into behaviour. > > */ > > If SD_ASYM_CPUCAPACITY means that some CPUs have different > arch_scale_cpu_capacity() values, we could also automatically _set_ > the flag in sd_init() no ? Why should we let the arch set it and just > correct it later ? > > I understand the moment at which we know the capacities of CPUs varies > from arch to arch, but the arch code could just call > rebuild_sched_domain when the capacities of CPUs change and let the > scheduler detect things automatically. I mean, even if the arch code > sets the flag in its topology level table, it will have to rebuild > the sched domains anyway ... > > What do you think ? We could as well set the flag here so the architecture doesn't have to do it. It is a bit more complicated though for few reasons: 1. Detecting when to disable the flag is a lot simpler than checking which level is should be set on. You basically have to work you way up from the lowest topology level until you get to a level spanning all the capacities available in the system to figure out where the flag should be set. I don't think this fits easily with how we build the sched_domain hierarchy. It can of course be done. 2. As you say, we still need the arch code (or cpufreq?) to rebuild the whole thing once we know that the capacities have been determined. That currently implies implementing arch_update_cpu_topology() which is arch-specific. So we would need some arch code to make rebuild happen at the right point in time. If the rebuild should be triggering the rebuild we need another way to force a full rebuild. This can also be done. 3. Detecting the flag in generic kernel/sched/* code means that all architectures will pay the for the overhead when building/rebuilding the sched_domain hierarchy, and all architectures that sets the cpu capacities to asymmetric will set the flag whether they like it or not. I'm not sure if this is a problem. In the end it is really about how much of this we want in generic code and how much we hide in arch/, and if we dare to touch the sched_domain build code ;-) Morten
Re: [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
Hi, On Tue, Jul 03, 2018 at 02:28:28AM +, Gaku Inami wrote: > Hi, > > > -Original Message- > > From: Morten Rasmussen > > Sent: Wednesday, June 20, 2018 6:06 PM > > To: pet...@infradead.org; mi...@redhat.com > > Cc: valentin.schnei...@arm.com; dietmar.eggem...@arm.com; > > vincent.guit...@linaro.org; Gaku Inami > > ; linux-kernel@vger.kernel.org; Morten Rasmussen > > > > Subject: [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric > > capacity systems > [snip] > > > > The patches have been tested on: > >1. Arm Juno (r0): 2+4 Cortex A57/A53 > >2. Hikey960: 4+4 Cortex A73/A53 > > > > Test case: > > Big cpus are always kept busy. Pin a shorter running sysbench tasks to > > big cpus, while creating a longer running set of unpinned sysbench > > tasks. > > I have tested v3 patches on Renesas SoC again. It looks fine. > > You can add: > > Tested-by: Gaku Inami > > The patches have been tested on: >3. Renesas R-Car H3 : 4+4 Cortex A57/A53 > > Results: > Single runs with completion time of each task > R-Car H3 (tip) > total time: 0.9391s > total time: 0.9865s > total time: 1.3691s > total time: 1.6740s > > R-Car H3 (misfit) > total time: 0.9368s > total time: 0.9475s > total time: 0.9471s > total time: 0.9505s > > 10 run summary (tracking longest running task for each run) > R-Car H3 > avg max > tip 1.6742 1.6750 > misfit 0.9784 0.9905 Thanks for testing again. I have just posted v4 with some minor changes. Behaviour for the test-cases should be the same. Morten
Re: [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
Hi, On Tue, Jul 03, 2018 at 02:28:28AM +, Gaku Inami wrote: > Hi, > > > -Original Message- > > From: Morten Rasmussen > > Sent: Wednesday, June 20, 2018 6:06 PM > > To: pet...@infradead.org; mi...@redhat.com > > Cc: valentin.schnei...@arm.com; dietmar.eggem...@arm.com; > > vincent.guit...@linaro.org; Gaku Inami > > ; linux-kernel@vger.kernel.org; Morten Rasmussen > > > > Subject: [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric > > capacity systems > [snip] > > > > The patches have been tested on: > >1. Arm Juno (r0): 2+4 Cortex A57/A53 > >2. Hikey960: 4+4 Cortex A73/A53 > > > > Test case: > > Big cpus are always kept busy. Pin a shorter running sysbench tasks to > > big cpus, while creating a longer running set of unpinned sysbench > > tasks. > > I have tested v3 patches on Renesas SoC again. It looks fine. > > You can add: > > Tested-by: Gaku Inami > > The patches have been tested on: >3. Renesas R-Car H3 : 4+4 Cortex A57/A53 > > Results: > Single runs with completion time of each task > R-Car H3 (tip) > total time: 0.9391s > total time: 0.9865s > total time: 1.3691s > total time: 1.6740s > > R-Car H3 (misfit) > total time: 0.9368s > total time: 0.9475s > total time: 0.9471s > total time: 0.9505s > > 10 run summary (tracking longest running task for each run) > R-Car H3 > avg max > tip 1.6742 1.6750 > misfit 0.9784 0.9905 Thanks for testing again. I have just posted v4 with some minor changes. Behaviour for the test-cases should be the same. Morten
[PATCHv4 02/12] sched/fair: Add group_misfit_task load-balance type
To maximize throughput in systems with asymmetric cpu capacities (e.g. ARM big.LITTLE) load-balancing has to consider task and cpu utilization as well as per-cpu compute capacity when load-balancing in addition to the current average load based load-balancing policy. Tasks with high utilization that are scheduled on a lower capacity cpu need to be identified and migrated to a higher capacity cpu if possible to maximize throughput. To implement this additional policy an additional group_type (load-balance scenario) is added: group_misfit_task. This represents scenarios where a sched_group has one or more tasks that are not suitable for its per-cpu capacity. group_misfit_task is only considered if the system is not overloaded or imbalanced (group_imbalanced or group_overloaded). Identifying misfit tasks requires the rq lock to be held. To avoid taking remote rq locks to examine source sched_groups for misfit tasks, each cpu is responsible for tracking misfit tasks themselves and update the rq->misfit_task flag. This means checking task utilization when tasks are scheduled and on sched_tick. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 54 kernel/sched/sched.h | 2 ++ 2 files changed, 48 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 85fb7e8ff5c8..e05e5202a1d2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -697,6 +697,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se) static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu); static unsigned long task_h_load(struct task_struct *p); +static unsigned long capacity_of(int cpu); /* Give new sched_entity start runnable values to heavy its load in infant time */ void init_entity_runnable_average(struct sched_entity *se) @@ -1448,7 +1449,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, static unsigned long weighted_cpuload(struct rq *rq); static unsigned long source_load(int cpu, int type); static unsigned long target_load(int cpu, int type); -static unsigned long capacity_of(int cpu); /* Cached statistics for all CPUs within a node */ struct numa_stats { @@ -4035,6 +4035,29 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) WRITE_ONCE(p->se.avg.util_est, ue); } +static inline int task_fits_capacity(struct task_struct *p, long capacity) +{ + return capacity * 1024 > task_util_est(p) * capacity_margin; +} + +static inline void update_misfit_status(struct task_struct *p, struct rq *rq) +{ + if (!static_branch_unlikely(_asym_cpucapacity)) + return; + + if (!p) { + rq->misfit_task_load = 0; + return; + } + + if (task_fits_capacity(p, capacity_of(cpu_of(rq { + rq->misfit_task_load = 0; + return; + } + + rq->misfit_task_load = task_h_load(p); +} + #else /* CONFIG_SMP */ static inline int @@ -4070,6 +4093,7 @@ util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {} static inline void util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) {} +static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {} #endif /* CONFIG_SMP */ @@ -6596,7 +6620,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) /* Bring task utilization in sync with prev_cpu */ sync_entity_load_avg(>se); - return min_cap * 1024 < task_util(p) * capacity_margin; + return !task_fits_capacity(p, min_cap); } /* @@ -7013,9 +7037,12 @@ done: __maybe_unused; if (hrtick_enabled(rq)) hrtick_start_fair(rq, p); + update_misfit_status(p, rq); + return p; idle: + update_misfit_status(NULL, rq); new_tasks = idle_balance(rq, rf); /* @@ -7221,6 +7248,13 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10; enum fbq_type { regular, remote, all }; +enum group_type { + group_other = 0, + group_misfit_task, + group_imbalanced, + group_overloaded, +}; + #define LBF_ALL_PINNED 0x01 #define LBF_NEED_BREAK 0x02 #define LBF_DST_PINNED 0x04 @@ -7762,12 +7796,6 @@ static unsigned long task_h_load(struct task_struct *p) /** Helpers for find_busiest_group / -enum group_type { - group_other = 0, - group_imbalanced, - group_overloaded, -}; - /* * sg_lb_stats - stats of a sched_group required for load_balancing */ @@ -7783,6 +7811,7 @@ struct sg_lb_stats { unsigned int group_weight; enum group_type group_type; int group_no_capacity; + unsigned long group_misfit_task_load; /* A cpu has a task too big for its capacity */ #ifdef CONFIG_NUMA_BALANC
[PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial for performance that cpu intensive tasks are aggressively migrated to high capacity cpus as soon as those become available. The capacity awareness tweaks already in the wake-up path can't handle this as such tasks might run or be runnable forever. If they happen to be placed on a low capacity cpu from the beginning they are stuck there forever while high capacity cpus may have become available in the meantime. To address this issue this patch set introduces a new "misfit" load-balancing scenario in periodic/nohz/newly idle balance which tweaks the load-balance conditions to ignore load per capacity in certain cases. Since misfit tasks are commonly running alone on a cpu, more aggressive active load-balancing is needed too. The fundamental idea of this patch set has been in Android kernels for a long time and is absolutely essential for consistent performance on asymmetric cpu capacity systems. The patches have been tested on: 1. Arm Juno (r0): 2+4 Cortex A57/A53 2. Hikey960: 4+4 Cortex A73/A53 Test case: Big cpus are always kept busy. Pin a shorter running sysbench tasks to big cpus, while creating a longer running set of unpinned sysbench tasks. REQUESTS=1000 BIGS="1 2" LITTLES="0 3 4 5" # Don't care about the score for those, just keep the bigs busy for i in $BIGS; do taskset -c $i sysbench --max-requests=$((REQUESTS / 4)) \ --test=cpu run &>/dev/null & done for i in $LITTLES; do sysbench --max-requests=$REQUESTS --test=cpu run \ | grep "total time:" & done wait Results: Single runs with completion time of each task Juno (tip) total time: 1.2608s total time: 1.2995s total time: 1.5954s total time: 1.7463s Juno (misfit) total time: 1.2575s total time: 1.3004s total time: 1.5860s total time: 1.5871s Hikey960 (tip) total time: 1.7431s total time: 2.2914s total time: 2.5976s total time: 1.7280s Hikey960 (misfit) total time: 1.7866s total time: 1.7513s total time: 1.6918s total time: 1.6965s 10 run summary (tracking longest running task for each run) JunoHikey960 avg max avg max tip 1.7465 1.7469 2.5997 2.6131 misfit 1.6016 1.6192 1.8506 1.9666 Changelog: v4 - Added check for empty cpu_map in sd_init(). - Added patch to disable SD_ASYM_CPUCAPACITY for root_domains that don't observe capacity asymmetry if the system as a whole is asymmetric. - Added patch to disable SD_PREFER_SIBLING on the sched_domain level below SD_ASYM_CPUCAPACITY. - Rebased against tip/sched/core. - Fixed uninitialised variable introduced in update_sd_lb_stats. - Added patch to do a slight variable initialisation cleanup in update_sd_lb_stats. - Removed superfluous type changes for temp variables assigned to root_domain->overload. - Reworded commit for the patch setting rq->rd->overload when misfit. - v3 Tested-by: Gaku Inami v3 - Fixed locking around static_key. - Changed group per-cpu capacity comparison to be based on max rather than min capacity. - Added patch to prevent occasional pointless high->low capacity migrations. - Changed type of group_misfit_task_load and misfit_task_load to unsigned long. - Changed fbq() to pick the cpu with highest misfit_task_load rather than breaking when the first is found. - Rebased against tip/sched/core. - v2 Tested-by: Gaku Inami v2 - Removed redudant condition in static_key enablement. - Fixed logic flaw in patch #2 reported by Yi Yao - Dropped patch #4 as although the patch seems to make sense no benefit has been proven. - Dropped root_domain->overload renaming - Changed type of root_domain->overload to int - Wrapped accesses of rq->rd->overload with READ/WRITE_ONCE - v1 Tested-by: Gaku Inami Chris Redpath (1): sched/fair: Don't move tasks to lower capacity cpus unless necessary Morten Rasmussen (6): sched: Add static_key for asymmetric cpu capacity optimizations sched/fair: Add group_misfit_task load-balance type sched: Add sched_group per-cpu max capacity sched/fair: Consider misfit tasks when load-balancing sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains Valentin Schneider (5): sched/fair: Kick nohz balance if rq->misfit_task_load sched/fair: Change prefer_sibling type to bool sched: Change r
[PATCHv4 02/12] sched/fair: Add group_misfit_task load-balance type
To maximize throughput in systems with asymmetric cpu capacities (e.g. ARM big.LITTLE) load-balancing has to consider task and cpu utilization as well as per-cpu compute capacity when load-balancing in addition to the current average load based load-balancing policy. Tasks with high utilization that are scheduled on a lower capacity cpu need to be identified and migrated to a higher capacity cpu if possible to maximize throughput. To implement this additional policy an additional group_type (load-balance scenario) is added: group_misfit_task. This represents scenarios where a sched_group has one or more tasks that are not suitable for its per-cpu capacity. group_misfit_task is only considered if the system is not overloaded or imbalanced (group_imbalanced or group_overloaded). Identifying misfit tasks requires the rq lock to be held. To avoid taking remote rq locks to examine source sched_groups for misfit tasks, each cpu is responsible for tracking misfit tasks themselves and update the rq->misfit_task flag. This means checking task utilization when tasks are scheduled and on sched_tick. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 54 kernel/sched/sched.h | 2 ++ 2 files changed, 48 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 85fb7e8ff5c8..e05e5202a1d2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -697,6 +697,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se) static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu); static unsigned long task_h_load(struct task_struct *p); +static unsigned long capacity_of(int cpu); /* Give new sched_entity start runnable values to heavy its load in infant time */ void init_entity_runnable_average(struct sched_entity *se) @@ -1448,7 +1449,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, static unsigned long weighted_cpuload(struct rq *rq); static unsigned long source_load(int cpu, int type); static unsigned long target_load(int cpu, int type); -static unsigned long capacity_of(int cpu); /* Cached statistics for all CPUs within a node */ struct numa_stats { @@ -4035,6 +4035,29 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) WRITE_ONCE(p->se.avg.util_est, ue); } +static inline int task_fits_capacity(struct task_struct *p, long capacity) +{ + return capacity * 1024 > task_util_est(p) * capacity_margin; +} + +static inline void update_misfit_status(struct task_struct *p, struct rq *rq) +{ + if (!static_branch_unlikely(_asym_cpucapacity)) + return; + + if (!p) { + rq->misfit_task_load = 0; + return; + } + + if (task_fits_capacity(p, capacity_of(cpu_of(rq { + rq->misfit_task_load = 0; + return; + } + + rq->misfit_task_load = task_h_load(p); +} + #else /* CONFIG_SMP */ static inline int @@ -4070,6 +4093,7 @@ util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {} static inline void util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) {} +static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {} #endif /* CONFIG_SMP */ @@ -6596,7 +6620,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) /* Bring task utilization in sync with prev_cpu */ sync_entity_load_avg(>se); - return min_cap * 1024 < task_util(p) * capacity_margin; + return !task_fits_capacity(p, min_cap); } /* @@ -7013,9 +7037,12 @@ done: __maybe_unused; if (hrtick_enabled(rq)) hrtick_start_fair(rq, p); + update_misfit_status(p, rq); + return p; idle: + update_misfit_status(NULL, rq); new_tasks = idle_balance(rq, rf); /* @@ -7221,6 +7248,13 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10; enum fbq_type { regular, remote, all }; +enum group_type { + group_other = 0, + group_misfit_task, + group_imbalanced, + group_overloaded, +}; + #define LBF_ALL_PINNED 0x01 #define LBF_NEED_BREAK 0x02 #define LBF_DST_PINNED 0x04 @@ -7762,12 +7796,6 @@ static unsigned long task_h_load(struct task_struct *p) /** Helpers for find_busiest_group / -enum group_type { - group_other = 0, - group_imbalanced, - group_overloaded, -}; - /* * sg_lb_stats - stats of a sched_group required for load_balancing */ @@ -7783,6 +7811,7 @@ struct sg_lb_stats { unsigned int group_weight; enum group_type group_type; int group_no_capacity; + unsigned long group_misfit_task_load; /* A cpu has a task too big for its capacity */ #ifdef CONFIG_NUMA_BALANC
[PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial for performance that cpu intensive tasks are aggressively migrated to high capacity cpus as soon as those become available. The capacity awareness tweaks already in the wake-up path can't handle this as such tasks might run or be runnable forever. If they happen to be placed on a low capacity cpu from the beginning they are stuck there forever while high capacity cpus may have become available in the meantime. To address this issue this patch set introduces a new "misfit" load-balancing scenario in periodic/nohz/newly idle balance which tweaks the load-balance conditions to ignore load per capacity in certain cases. Since misfit tasks are commonly running alone on a cpu, more aggressive active load-balancing is needed too. The fundamental idea of this patch set has been in Android kernels for a long time and is absolutely essential for consistent performance on asymmetric cpu capacity systems. The patches have been tested on: 1. Arm Juno (r0): 2+4 Cortex A57/A53 2. Hikey960: 4+4 Cortex A73/A53 Test case: Big cpus are always kept busy. Pin a shorter running sysbench tasks to big cpus, while creating a longer running set of unpinned sysbench tasks. REQUESTS=1000 BIGS="1 2" LITTLES="0 3 4 5" # Don't care about the score for those, just keep the bigs busy for i in $BIGS; do taskset -c $i sysbench --max-requests=$((REQUESTS / 4)) \ --test=cpu run &>/dev/null & done for i in $LITTLES; do sysbench --max-requests=$REQUESTS --test=cpu run \ | grep "total time:" & done wait Results: Single runs with completion time of each task Juno (tip) total time: 1.2608s total time: 1.2995s total time: 1.5954s total time: 1.7463s Juno (misfit) total time: 1.2575s total time: 1.3004s total time: 1.5860s total time: 1.5871s Hikey960 (tip) total time: 1.7431s total time: 2.2914s total time: 2.5976s total time: 1.7280s Hikey960 (misfit) total time: 1.7866s total time: 1.7513s total time: 1.6918s total time: 1.6965s 10 run summary (tracking longest running task for each run) JunoHikey960 avg max avg max tip 1.7465 1.7469 2.5997 2.6131 misfit 1.6016 1.6192 1.8506 1.9666 Changelog: v4 - Added check for empty cpu_map in sd_init(). - Added patch to disable SD_ASYM_CPUCAPACITY for root_domains that don't observe capacity asymmetry if the system as a whole is asymmetric. - Added patch to disable SD_PREFER_SIBLING on the sched_domain level below SD_ASYM_CPUCAPACITY. - Rebased against tip/sched/core. - Fixed uninitialised variable introduced in update_sd_lb_stats. - Added patch to do a slight variable initialisation cleanup in update_sd_lb_stats. - Removed superfluous type changes for temp variables assigned to root_domain->overload. - Reworded commit for the patch setting rq->rd->overload when misfit. - v3 Tested-by: Gaku Inami v3 - Fixed locking around static_key. - Changed group per-cpu capacity comparison to be based on max rather than min capacity. - Added patch to prevent occasional pointless high->low capacity migrations. - Changed type of group_misfit_task_load and misfit_task_load to unsigned long. - Changed fbq() to pick the cpu with highest misfit_task_load rather than breaking when the first is found. - Rebased against tip/sched/core. - v2 Tested-by: Gaku Inami v2 - Removed redudant condition in static_key enablement. - Fixed logic flaw in patch #2 reported by Yi Yao - Dropped patch #4 as although the patch seems to make sense no benefit has been proven. - Dropped root_domain->overload renaming - Changed type of root_domain->overload to int - Wrapped accesses of rq->rd->overload with READ/WRITE_ONCE - v1 Tested-by: Gaku Inami Chris Redpath (1): sched/fair: Don't move tasks to lower capacity cpus unless necessary Morten Rasmussen (6): sched: Add static_key for asymmetric cpu capacity optimizations sched/fair: Add group_misfit_task load-balance type sched: Add sched_group per-cpu max capacity sched/fair: Consider misfit tasks when load-balancing sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains Valentin Schneider (5): sched/fair: Kick nohz balance if rq->misfit_task_load sched/fair: Change prefer_sibling type to bool sched: Change r
[PATCHv4 03/12] sched: Add sched_group per-cpu max capacity
The current sg->min_capacity tracks the lowest per-cpu compute capacity available in the sched_group when rt/irq pressure is taken into account. Minimum capacity isn't the ideal metric for tracking if a sched_group needs offloading to another sched_group for some scenarios, e.g. a sched_group with multiple cpus if only one is under heavy pressure. Tracking maximum capacity isn't perfect either but a better choice for some situations as it indicates that the sched_group definitely compute capacity constrained either due to rt/irq pressure on all cpus or asymmetric cpu capacities (e.g. big.LITTLE). cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 24 kernel/sched/sched.h| 1 + kernel/sched/topology.c | 2 ++ 3 files changed, 23 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e05e5202a1d2..09ede4321a3d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7927,13 +7927,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu) cpu_rq(cpu)->cpu_capacity = capacity; sdg->sgc->capacity = capacity; sdg->sgc->min_capacity = capacity; + sdg->sgc->max_capacity = capacity; } void update_group_capacity(struct sched_domain *sd, int cpu) { struct sched_domain *child = sd->child; struct sched_group *group, *sdg = sd->groups; - unsigned long capacity, min_capacity; + unsigned long capacity, min_capacity, max_capacity; unsigned long interval; interval = msecs_to_jiffies(sd->balance_interval); @@ -7947,6 +7948,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) capacity = 0; min_capacity = ULONG_MAX; + max_capacity = 0; if (child->flags & SD_OVERLAP) { /* @@ -7977,6 +7979,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) } min_capacity = min(capacity, min_capacity); + max_capacity = max(capacity, max_capacity); } } else { /* @@ -7990,12 +7993,14 @@ void update_group_capacity(struct sched_domain *sd, int cpu) capacity += sgc->capacity; min_capacity = min(sgc->min_capacity, min_capacity); + max_capacity = max(sgc->max_capacity, max_capacity); group = group->next; } while (group != child->groups); } sdg->sgc->capacity = capacity; sdg->sgc->min_capacity = min_capacity; + sdg->sgc->max_capacity = max_capacity; } /* @@ -8091,16 +8096,27 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs) } /* - * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller + * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller * per-CPU capacity than sched_group ref. */ static inline bool -group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref) +group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref) { return sg->sgc->min_capacity * capacity_margin < ref->sgc->min_capacity * 1024; } +/* + * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller + * per-CPU capacity_orig than sched_group ref. + */ +static inline bool +group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref) +{ + return sg->sgc->max_capacity * capacity_margin < + ref->sgc->max_capacity * 1024; +} + static inline enum group_type group_classify(struct sched_group *group, struct sg_lb_stats *sgs) @@ -8246,7 +8262,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, * power/energy consequences are not considered. */ if (sgs->sum_nr_running <= sgs->group_weight && - group_smaller_cpu_capacity(sds->local, sg)) + group_smaller_min_cpu_capacity(sds->local, sg)) return false; asym_packing: diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3376bacab712..6c39a07e8a68 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1172,6 +1172,7 @@ struct sched_group_capacity { */ unsigned long capacity; unsigned long min_capacity; /* Min per-CPU capacity in group */ + unsigned long max_capacity; /* Max per-CPU capacity in group */ unsigned long next_update; int imbalance; /* XXX unrelated to capacity but shared group state */ diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index
[PATCHv4 03/12] sched: Add sched_group per-cpu max capacity
The current sg->min_capacity tracks the lowest per-cpu compute capacity available in the sched_group when rt/irq pressure is taken into account. Minimum capacity isn't the ideal metric for tracking if a sched_group needs offloading to another sched_group for some scenarios, e.g. a sched_group with multiple cpus if only one is under heavy pressure. Tracking maximum capacity isn't perfect either but a better choice for some situations as it indicates that the sched_group definitely compute capacity constrained either due to rt/irq pressure on all cpus or asymmetric cpu capacities (e.g. big.LITTLE). cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 24 kernel/sched/sched.h| 1 + kernel/sched/topology.c | 2 ++ 3 files changed, 23 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e05e5202a1d2..09ede4321a3d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7927,13 +7927,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu) cpu_rq(cpu)->cpu_capacity = capacity; sdg->sgc->capacity = capacity; sdg->sgc->min_capacity = capacity; + sdg->sgc->max_capacity = capacity; } void update_group_capacity(struct sched_domain *sd, int cpu) { struct sched_domain *child = sd->child; struct sched_group *group, *sdg = sd->groups; - unsigned long capacity, min_capacity; + unsigned long capacity, min_capacity, max_capacity; unsigned long interval; interval = msecs_to_jiffies(sd->balance_interval); @@ -7947,6 +7948,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) capacity = 0; min_capacity = ULONG_MAX; + max_capacity = 0; if (child->flags & SD_OVERLAP) { /* @@ -7977,6 +7979,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu) } min_capacity = min(capacity, min_capacity); + max_capacity = max(capacity, max_capacity); } } else { /* @@ -7990,12 +7993,14 @@ void update_group_capacity(struct sched_domain *sd, int cpu) capacity += sgc->capacity; min_capacity = min(sgc->min_capacity, min_capacity); + max_capacity = max(sgc->max_capacity, max_capacity); group = group->next; } while (group != child->groups); } sdg->sgc->capacity = capacity; sdg->sgc->min_capacity = min_capacity; + sdg->sgc->max_capacity = max_capacity; } /* @@ -8091,16 +8096,27 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs) } /* - * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller + * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller * per-CPU capacity than sched_group ref. */ static inline bool -group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref) +group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref) { return sg->sgc->min_capacity * capacity_margin < ref->sgc->min_capacity * 1024; } +/* + * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller + * per-CPU capacity_orig than sched_group ref. + */ +static inline bool +group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref) +{ + return sg->sgc->max_capacity * capacity_margin < + ref->sgc->max_capacity * 1024; +} + static inline enum group_type group_classify(struct sched_group *group, struct sg_lb_stats *sgs) @@ -8246,7 +8262,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, * power/energy consequences are not considered. */ if (sgs->sum_nr_running <= sgs->group_weight && - group_smaller_cpu_capacity(sds->local, sg)) + group_smaller_min_cpu_capacity(sds->local, sg)) return false; asym_packing: diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3376bacab712..6c39a07e8a68 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1172,6 +1172,7 @@ struct sched_group_capacity { */ unsigned long capacity; unsigned long min_capacity; /* Min per-CPU capacity in group */ + unsigned long max_capacity; /* Max per-CPU capacity in group */ unsigned long next_update; int imbalance; /* XXX unrelated to capacity but shared group state */ diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index
[PATCHv4 05/12] sched/fair: Kick nohz balance if rq->misfit_task_load
From: Valentin Schneider There already are a few conditions in nohz_kick_needed() to ensure a nohz kick is triggered, but they are not enough for some misfit task scenarios. Excluding asym packing, those are: * rq->nr_running >=2: Not relevant here because we are running a misfit task, it needs to be migrated regardless and potentially through active balance. * sds->nr_busy_cpus > 1: If there is only the misfit task being run on a group of low capacity cpus, this will be evaluated to False. * rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here, misfit task needs to be migrated regardless of rt/IRQ pressure As such, this commit adds an rq->misfit_task_load condition to trigger a nohz kick. The idea to kick a nohz balance for misfit tasks originally came from Leo Yan , and a similar patch was submitted for the Android Common Kernel - see [1]. [1]: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Valentin Schneider Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6e885d92fad2..acec93e1dc51 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9505,7 +9505,7 @@ static void nohz_balancer_kick(struct rq *rq) if (time_before(now, nohz.next_balance)) goto out; - if (rq->nr_running >= 2) { + if (rq->nr_running >= 2 || rq->misfit_task_load) { flags = NOHZ_KICK_MASK; goto out; } -- 2.7.4
[PATCHv4 05/12] sched/fair: Kick nohz balance if rq->misfit_task_load
From: Valentin Schneider There already are a few conditions in nohz_kick_needed() to ensure a nohz kick is triggered, but they are not enough for some misfit task scenarios. Excluding asym packing, those are: * rq->nr_running >=2: Not relevant here because we are running a misfit task, it needs to be migrated regardless and potentially through active balance. * sds->nr_busy_cpus > 1: If there is only the misfit task being run on a group of low capacity cpus, this will be evaluated to False. * rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here, misfit task needs to be migrated regardless of rt/IRQ pressure As such, this commit adds an rq->misfit_task_load condition to trigger a nohz kick. The idea to kick a nohz balance for misfit tasks originally came from Leo Yan , and a similar patch was submitted for the Android Common Kernel - see [1]. [1]: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Valentin Schneider Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6e885d92fad2..acec93e1dc51 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9505,7 +9505,7 @@ static void nohz_balancer_kick(struct rq *rq) if (time_before(now, nohz.next_balance)) goto out; - if (rq->nr_running >= 2) { + if (rq->nr_running >= 2 || rq->misfit_task_load) { flags = NOHZ_KICK_MASK; goto out; } -- 2.7.4
[PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains
The 'prefer sibling' sched_domain flag is intended to encourage spreading tasks to sibling sched_domain to take advantage of more caches and core for SMT systems. It has recently been changed to be on all non-NUMA topology level. However, spreading across domains with cpu capacity asymmetry isn't desirable, e.g. spreading from high capacity to low capacity cpus even if high capacity cpus aren't overutilized might give access to more cache but the cpu will be slower and possibly lead to worse overall throughput. To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain level immediately below SD_ASYM_CPUCAPACITY. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Morten Rasmussen --- kernel/sched/topology.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 29c186961345..00c7a08c7f77 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1140,7 +1140,7 @@ sd_init(struct sched_domain_topology_level *tl, | 0*SD_SHARE_CPUCAPACITY | 0*SD_SHARE_PKG_RESOURCES | 0*SD_SERIALIZE - | 0*SD_PREFER_SIBLING + | 1*SD_PREFER_SIBLING | 0*SD_NUMA | sd_flags , @@ -1186,17 +1186,21 @@ sd_init(struct sched_domain_topology_level *tl, if (sd->flags & SD_ASYM_CPUCAPACITY) { struct sched_domain *t = sd; + /* +* Don't attempt to spread across cpus of different capacities. +*/ + if (sd->child) + sd->child->flags &= ~SD_PREFER_SIBLING; + for_each_lower_domain(t) t->flags |= SD_BALANCE_WAKE; } if (sd->flags & SD_SHARE_CPUCAPACITY) { - sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */ } else if (sd->flags & SD_SHARE_PKG_RESOURCES) { - sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 117; sd->cache_nice_tries = 1; sd->busy_idx = 2; @@ -1207,6 +1211,7 @@ sd_init(struct sched_domain_topology_level *tl, sd->busy_idx = 3; sd->idle_idx = 2; + sd->flags &= ~SD_PREFER_SIBLING; sd->flags |= SD_SERIALIZE; if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) { sd->flags &= ~(SD_BALANCE_EXEC | @@ -1216,7 +1221,6 @@ sd_init(struct sched_domain_topology_level *tl, #endif } else { - sd->flags |= SD_PREFER_SIBLING; sd->cache_nice_tries = 1; sd->busy_idx = 2; sd->idle_idx = 1; -- 2.7.4
[PATCHv4 04/12] sched/fair: Consider misfit tasks when load-balancing
On asymmetric cpu capacity systems load intensive tasks can end up on cpus that don't suit their compute demand. In this scenarios 'misfit' tasks should be migrated to cpus with higher compute capacity to ensure better throughput. group_misfit_task indicates this scenario, but tweaks to the load-balance code are needed to make the migrations happen. Misfit balancing only makes sense between a source group of lower per-cpu capacity and destination group of higher compute capacity. Otherwise, misfit balancing is ignored. group_misfit_task has lowest priority so any imbalance due to overload is dealt with first. The modifications are: 1. Only pick a group containing misfit tasks as the busiest group if the destination group has higher capacity and has spare capacity. 2. When the busiest group is a 'misfit' group, skip the usual average load and group capacity checks. 3. Set the imbalance for 'misfit' balancing sufficiently high for a task to be pulled ignoring average load. 4. Pick the cpu with the highest misfit load as the source cpu. 5. If the misfit task is alone on the source cpu, go for active balancing. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 51 +-- 1 file changed, 49 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 09ede4321a3d..6e885d92fad2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7285,6 +7285,7 @@ struct lb_env { unsigned intloop_max; enum fbq_type fbq_type; + enum group_type src_grp_type; struct list_headtasks; }; @@ -8243,6 +8244,17 @@ static bool update_sd_pick_busiest(struct lb_env *env, { struct sg_lb_stats *busiest = >busiest_stat; + /* +* Don't try to pull misfit tasks we can't help. +* We can use max_capacity here as reduction in capacity on some +* cpus in the group should either be possible to resolve +* internally or be covered by avg_load imbalance (eventually). +*/ + if (sgs->group_type == group_misfit_task && + (!group_smaller_max_cpu_capacity(sg, sds->local) || +!group_has_capacity(env, >local_stat))) + return false; + if (sgs->group_type > busiest->group_type) return true; @@ -8265,6 +8277,13 @@ static bool update_sd_pick_busiest(struct lb_env *env, group_smaller_min_cpu_capacity(sds->local, sg)) return false; + /* +* If we have more than one misfit sg go with the biggest misfit. +*/ + if (sgs->group_type == group_misfit_task && + sgs->group_misfit_task_load < busiest->group_misfit_task_load) + return false; + asym_packing: /* This is the busiest node in its class. */ if (!(env->sd->flags & SD_ASYM_PACKING)) @@ -8562,8 +8581,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * factors in sg capacity and sgs with smaller group_type are * skipped when updating the busiest sg: */ - if (busiest->avg_load <= sds->avg_load || - local->avg_load >= sds->avg_load) { + if (busiest->group_type != group_misfit_task && + (busiest->avg_load <= sds->avg_load || +local->avg_load >= sds->avg_load)) { env->imbalance = 0; return fix_small_imbalance(env, sds); } @@ -8597,6 +8617,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s (sds->avg_load - local->avg_load) * local->group_capacity ) / SCHED_CAPACITY_SCALE; + /* Boost imbalance to allow misfit task to be balanced. */ + if (busiest->group_type == group_misfit_task) { + env->imbalance = max_t(long, env->imbalance, + busiest->group_misfit_task_load); + } + /* * if *imbalance is less than the average load per runnable task * there is no guarantee that any tasks will be moved so we'll have @@ -8663,6 +8689,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env) busiest->group_no_capacity) goto force_balance; + /* Misfit tasks should be dealt with regardless of the avg load */ + if (busiest->group_type == group_misfit_task) + goto force_balance; + /* * If the local group is busier than the selected busiest group * don't try and pull any tasks. @@ -8700,6 +8730,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) force_balance: /* Looks like there is an imbalance. Compute it */ + env->src_grp_type = busi
[PATCHv4 09/12] sched/fair: Set rq->rd->overload when misfit
From: Valentin Schneider Idle balance is a great opportunity to pull a misfit task. However, there are scenarios where misfit tasks are present but idle balance is prevented by the overload flag. A good example of this is a workload of n identical tasks. Let's suppose we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly CPU-intensive tasks - for the sake of simplicity let's say they are just CPU hogs, even when running on big CPUs. They are identical tasks, so on an SMP system they should all end at (roughly) the same time. However, in our case the LITTLE CPUs are less performing than the big CPUs, so tasks running on the LITTLEs will have a longer completion time. This means that the big CPUs will complete their work earlier, at which point they should pull the tasks from the LITTLEs. What we want to happen is summarized as follows: a,b,c,d are our CPU-hogging tasks _ signifies idling LITTLE_0 | a a a a _ _ LITTLE_1 | b b b b _ _ -|- big_0 | c c c c a a big_1 | d d d d b b ^ ^ Tasks end on the big CPUs, idle balance happens and the misfit tasks are pulled straight away This however won't happen, because currently the overload flag is only set when there is any CPU that has more than one runnable task - which may very well not be the case here if our CPU-hogging workload is all there is to run. As such, this commit sets the overload flag in update_sg_lb_stats when a group is flagged as having a misfit task. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Valentin Schneider Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 6 -- kernel/sched/sched.h | 6 +- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d0641ba7bea1..de84f5a9a65a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8163,7 +8163,7 @@ static bool update_nohz_stats(struct rq *rq, bool force) * @load_idx: Load index of sched_domain of this_cpu for load calc. * @local_group: Does group contain this_cpu. * @sgs: variable to hold the statistics for this group. - * @overload: Indicate more than one runnable task for any CPU. + * @overload: Indicate pullable load (e.g. >1 runnable task). */ static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, @@ -8207,8 +8207,10 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->idle_cpus++; if (env->sd->flags & SD_ASYM_CPUCAPACITY && - sgs->group_misfit_task_load < rq->misfit_task_load) + sgs->group_misfit_task_load < rq->misfit_task_load) { sgs->group_misfit_task_load = rq->misfit_task_load; + *overload = 1; + } } /* Adjust by relative CPU capacity of the group */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e1dc85d1bfdd..377545b5aa15 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -695,7 +695,11 @@ struct root_domain { cpumask_var_t span; cpumask_var_t online; - /* Indicate more than one runnable task for any CPU */ + /* +* Indicate pullable load on at least one CPU, e.g: +* - More than one runnable task +* - Running task is misfit +*/ int overload; /* -- 2.7.4
[PATCHv4 10/12] sched/fair: Don't move tasks to lower capacity cpus unless necessary
From: Chris Redpath When lower capacity CPUs are load balancing and considering to pull something from a higher capacity group, we should not pull tasks from a cpu with only one task running as this is guaranteed to impede progress for that task. If there is more than one task running, load balance in the higher capacity group would have already made any possible moves to resolve imbalance and we should make better use of system compute capacity by moving a task if we still have more than one running. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Chris Redpath Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index de84f5a9a65a..06beefa02420 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8793,6 +8793,17 @@ static struct rq *find_busiest_queue(struct lb_env *env, capacity = capacity_of(i); + /* +* For ASYM_CPUCAPACITY domains, don't pick a cpu that could +* eventually lead to active_balancing high->low capacity. +* Higher per-cpu capacity is considered better than balancing +* average load. +*/ + if (env->sd->flags & SD_ASYM_CPUCAPACITY && + capacity_of(env->dst_cpu) < capacity && + rq->nr_running == 1) + continue; + wl = weighted_cpuload(rq); /* -- 2.7.4
[PATCHv4 08/12] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE
From: Valentin Schneider This variable can be read and set locklessly within update_sd_lb_stats(). As such, READ/WRITE_ONCE are added to make sure nothing terribly wrong can happen because of the compiler. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Valentin Schneider Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 6 +++--- kernel/sched/sched.h | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ee26eeb188ef..d0641ba7bea1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8428,8 +8428,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd if (!env->sd->parent) { /* update overload indicator if we are at root domain */ - if (env->dst_rq->rd->overload != overload) - env->dst_rq->rd->overload = overload; + if (READ_ONCE(env->dst_rq->rd->overload) != overload) + WRITE_ONCE(env->dst_rq->rd->overload, overload); } } @@ -9872,7 +9872,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf) rq_unpin_lock(this_rq, rf); if (this_rq->avg_idle < sysctl_sched_migration_cost || - !this_rq->rd->overload) { + !READ_ONCE(this_rq->rd->overload)) { rcu_read_lock(); sd = rcu_dereference_check_sched_domain(this_rq->sd); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 648224b23287..e1dc85d1bfdd 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1672,8 +1672,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count) if (prev_nr < 2 && rq->nr_running >= 2) { #ifdef CONFIG_SMP - if (!rq->rd->overload) - rq->rd->overload = 1; + if (!READ_ONCE(rq->rd->overload)) + WRITE_ONCE(rq->rd->overload, 1); #endif } -- 2.7.4
[PATCHv4 06/12] sched/fair: Change prefer_sibling type to bool
From: Valentin Schneider This variable is entirely local to update_sd_lb_stats, so we can safely change its type and slightly clean up its initialisation. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Valentin Schneider Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index acec93e1dc51..ee26eeb188ef 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8352,11 +8352,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sched_group *sg = env->sd->groups; struct sg_lb_stats *local = >local_stat; struct sg_lb_stats tmp_sgs; - int load_idx, prefer_sibling = 0; + int load_idx; bool overload = false; - - if (child && child->flags & SD_PREFER_SIBLING) - prefer_sibling = 1; + bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING; #ifdef CONFIG_NO_HZ_COMMON if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked)) -- 2.7.4
[PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry
When hotplugging cpus out or creating exclusive cpusets (disabling sched_load_balance) systems which were asymmetric at boot might become symmetric. In this case leaving the flag set might lead to suboptimal scheduling decisions. The arch-code proving the flag doesn't have visibility of the cpuset configuration so it must either be told by passing a cpumask or the generic topology code has to verify if the flag should still be set when taking the actual sched_domain_span() into account. This patch implements the latter approach. We need to detect capacity based on calling arch_scale_cpu_capacity() directly as rq->cpu_capacity_orig hasn't been set yet early in the boot process. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Morten Rasmussen --- kernel/sched/topology.c | 20 1 file changed, 20 insertions(+) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 71330e0e41db..29c186961345 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1160,6 +1160,26 @@ sd_init(struct sched_domain_topology_level *tl, sd_id = cpumask_first(sched_domain_span(sd)); /* +* Check if cpu_map eclipses cpu capacity asymmetry. +*/ + + if (sd->flags & SD_ASYM_CPUCAPACITY) { + int i; + bool disable = true; + long capacity = arch_scale_cpu_capacity(NULL, sd_id); + + for_each_cpu(i, sched_domain_span(sd)) { + if (capacity != arch_scale_cpu_capacity(NULL, i)) { + disable = false; + break; + } + } + + if (disable) + sd->flags &= ~SD_ASYM_CPUCAPACITY; + } + + /* * Convert topological properties into behaviour. */ -- 2.7.4
[PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains
The 'prefer sibling' sched_domain flag is intended to encourage spreading tasks to sibling sched_domain to take advantage of more caches and core for SMT systems. It has recently been changed to be on all non-NUMA topology level. However, spreading across domains with cpu capacity asymmetry isn't desirable, e.g. spreading from high capacity to low capacity cpus even if high capacity cpus aren't overutilized might give access to more cache but the cpu will be slower and possibly lead to worse overall throughput. To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain level immediately below SD_ASYM_CPUCAPACITY. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Morten Rasmussen --- kernel/sched/topology.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 29c186961345..00c7a08c7f77 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1140,7 +1140,7 @@ sd_init(struct sched_domain_topology_level *tl, | 0*SD_SHARE_CPUCAPACITY | 0*SD_SHARE_PKG_RESOURCES | 0*SD_SERIALIZE - | 0*SD_PREFER_SIBLING + | 1*SD_PREFER_SIBLING | 0*SD_NUMA | sd_flags , @@ -1186,17 +1186,21 @@ sd_init(struct sched_domain_topology_level *tl, if (sd->flags & SD_ASYM_CPUCAPACITY) { struct sched_domain *t = sd; + /* +* Don't attempt to spread across cpus of different capacities. +*/ + if (sd->child) + sd->child->flags &= ~SD_PREFER_SIBLING; + for_each_lower_domain(t) t->flags |= SD_BALANCE_WAKE; } if (sd->flags & SD_SHARE_CPUCAPACITY) { - sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */ } else if (sd->flags & SD_SHARE_PKG_RESOURCES) { - sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 117; sd->cache_nice_tries = 1; sd->busy_idx = 2; @@ -1207,6 +1211,7 @@ sd_init(struct sched_domain_topology_level *tl, sd->busy_idx = 3; sd->idle_idx = 2; + sd->flags &= ~SD_PREFER_SIBLING; sd->flags |= SD_SERIALIZE; if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) { sd->flags &= ~(SD_BALANCE_EXEC | @@ -1216,7 +1221,6 @@ sd_init(struct sched_domain_topology_level *tl, #endif } else { - sd->flags |= SD_PREFER_SIBLING; sd->cache_nice_tries = 1; sd->busy_idx = 2; sd->idle_idx = 1; -- 2.7.4
[PATCHv4 04/12] sched/fair: Consider misfit tasks when load-balancing
On asymmetric cpu capacity systems load intensive tasks can end up on cpus that don't suit their compute demand. In this scenarios 'misfit' tasks should be migrated to cpus with higher compute capacity to ensure better throughput. group_misfit_task indicates this scenario, but tweaks to the load-balance code are needed to make the migrations happen. Misfit balancing only makes sense between a source group of lower per-cpu capacity and destination group of higher compute capacity. Otherwise, misfit balancing is ignored. group_misfit_task has lowest priority so any imbalance due to overload is dealt with first. The modifications are: 1. Only pick a group containing misfit tasks as the busiest group if the destination group has higher capacity and has spare capacity. 2. When the busiest group is a 'misfit' group, skip the usual average load and group capacity checks. 3. Set the imbalance for 'misfit' balancing sufficiently high for a task to be pulled ignoring average load. 4. Pick the cpu with the highest misfit load as the source cpu. 5. If the misfit task is alone on the source cpu, go for active balancing. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 51 +-- 1 file changed, 49 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 09ede4321a3d..6e885d92fad2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7285,6 +7285,7 @@ struct lb_env { unsigned intloop_max; enum fbq_type fbq_type; + enum group_type src_grp_type; struct list_headtasks; }; @@ -8243,6 +8244,17 @@ static bool update_sd_pick_busiest(struct lb_env *env, { struct sg_lb_stats *busiest = >busiest_stat; + /* +* Don't try to pull misfit tasks we can't help. +* We can use max_capacity here as reduction in capacity on some +* cpus in the group should either be possible to resolve +* internally or be covered by avg_load imbalance (eventually). +*/ + if (sgs->group_type == group_misfit_task && + (!group_smaller_max_cpu_capacity(sg, sds->local) || +!group_has_capacity(env, >local_stat))) + return false; + if (sgs->group_type > busiest->group_type) return true; @@ -8265,6 +8277,13 @@ static bool update_sd_pick_busiest(struct lb_env *env, group_smaller_min_cpu_capacity(sds->local, sg)) return false; + /* +* If we have more than one misfit sg go with the biggest misfit. +*/ + if (sgs->group_type == group_misfit_task && + sgs->group_misfit_task_load < busiest->group_misfit_task_load) + return false; + asym_packing: /* This is the busiest node in its class. */ if (!(env->sd->flags & SD_ASYM_PACKING)) @@ -8562,8 +8581,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * factors in sg capacity and sgs with smaller group_type are * skipped when updating the busiest sg: */ - if (busiest->avg_load <= sds->avg_load || - local->avg_load >= sds->avg_load) { + if (busiest->group_type != group_misfit_task && + (busiest->avg_load <= sds->avg_load || +local->avg_load >= sds->avg_load)) { env->imbalance = 0; return fix_small_imbalance(env, sds); } @@ -8597,6 +8617,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s (sds->avg_load - local->avg_load) * local->group_capacity ) / SCHED_CAPACITY_SCALE; + /* Boost imbalance to allow misfit task to be balanced. */ + if (busiest->group_type == group_misfit_task) { + env->imbalance = max_t(long, env->imbalance, + busiest->group_misfit_task_load); + } + /* * if *imbalance is less than the average load per runnable task * there is no guarantee that any tasks will be moved so we'll have @@ -8663,6 +8689,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env) busiest->group_no_capacity) goto force_balance; + /* Misfit tasks should be dealt with regardless of the avg load */ + if (busiest->group_type == group_misfit_task) + goto force_balance; + /* * If the local group is busier than the selected busiest group * don't try and pull any tasks. @@ -8700,6 +8730,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) force_balance: /* Looks like there is an imbalance. Compute it */ + env->src_grp_type = busi
[PATCHv4 09/12] sched/fair: Set rq->rd->overload when misfit
From: Valentin Schneider Idle balance is a great opportunity to pull a misfit task. However, there are scenarios where misfit tasks are present but idle balance is prevented by the overload flag. A good example of this is a workload of n identical tasks. Let's suppose we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly CPU-intensive tasks - for the sake of simplicity let's say they are just CPU hogs, even when running on big CPUs. They are identical tasks, so on an SMP system they should all end at (roughly) the same time. However, in our case the LITTLE CPUs are less performing than the big CPUs, so tasks running on the LITTLEs will have a longer completion time. This means that the big CPUs will complete their work earlier, at which point they should pull the tasks from the LITTLEs. What we want to happen is summarized as follows: a,b,c,d are our CPU-hogging tasks _ signifies idling LITTLE_0 | a a a a _ _ LITTLE_1 | b b b b _ _ -|- big_0 | c c c c a a big_1 | d d d d b b ^ ^ Tasks end on the big CPUs, idle balance happens and the misfit tasks are pulled straight away This however won't happen, because currently the overload flag is only set when there is any CPU that has more than one runnable task - which may very well not be the case here if our CPU-hogging workload is all there is to run. As such, this commit sets the overload flag in update_sg_lb_stats when a group is flagged as having a misfit task. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Valentin Schneider Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 6 -- kernel/sched/sched.h | 6 +- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d0641ba7bea1..de84f5a9a65a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8163,7 +8163,7 @@ static bool update_nohz_stats(struct rq *rq, bool force) * @load_idx: Load index of sched_domain of this_cpu for load calc. * @local_group: Does group contain this_cpu. * @sgs: variable to hold the statistics for this group. - * @overload: Indicate more than one runnable task for any CPU. + * @overload: Indicate pullable load (e.g. >1 runnable task). */ static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, @@ -8207,8 +8207,10 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->idle_cpus++; if (env->sd->flags & SD_ASYM_CPUCAPACITY && - sgs->group_misfit_task_load < rq->misfit_task_load) + sgs->group_misfit_task_load < rq->misfit_task_load) { sgs->group_misfit_task_load = rq->misfit_task_load; + *overload = 1; + } } /* Adjust by relative CPU capacity of the group */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e1dc85d1bfdd..377545b5aa15 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -695,7 +695,11 @@ struct root_domain { cpumask_var_t span; cpumask_var_t online; - /* Indicate more than one runnable task for any CPU */ + /* +* Indicate pullable load on at least one CPU, e.g: +* - More than one runnable task +* - Running task is misfit +*/ int overload; /* -- 2.7.4
[PATCHv4 10/12] sched/fair: Don't move tasks to lower capacity cpus unless necessary
From: Chris Redpath When lower capacity CPUs are load balancing and considering to pull something from a higher capacity group, we should not pull tasks from a cpu with only one task running as this is guaranteed to impede progress for that task. If there is more than one task running, load balance in the higher capacity group would have already made any possible moves to resolve imbalance and we should make better use of system compute capacity by moving a task if we still have more than one running. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Chris Redpath Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index de84f5a9a65a..06beefa02420 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8793,6 +8793,17 @@ static struct rq *find_busiest_queue(struct lb_env *env, capacity = capacity_of(i); + /* +* For ASYM_CPUCAPACITY domains, don't pick a cpu that could +* eventually lead to active_balancing high->low capacity. +* Higher per-cpu capacity is considered better than balancing +* average load. +*/ + if (env->sd->flags & SD_ASYM_CPUCAPACITY && + capacity_of(env->dst_cpu) < capacity && + rq->nr_running == 1) + continue; + wl = weighted_cpuload(rq); /* -- 2.7.4
[PATCHv4 08/12] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE
From: Valentin Schneider This variable can be read and set locklessly within update_sd_lb_stats(). As such, READ/WRITE_ONCE are added to make sure nothing terribly wrong can happen because of the compiler. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Valentin Schneider Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 6 +++--- kernel/sched/sched.h | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ee26eeb188ef..d0641ba7bea1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8428,8 +8428,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd if (!env->sd->parent) { /* update overload indicator if we are at root domain */ - if (env->dst_rq->rd->overload != overload) - env->dst_rq->rd->overload = overload; + if (READ_ONCE(env->dst_rq->rd->overload) != overload) + WRITE_ONCE(env->dst_rq->rd->overload, overload); } } @@ -9872,7 +9872,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf) rq_unpin_lock(this_rq, rf); if (this_rq->avg_idle < sysctl_sched_migration_cost || - !this_rq->rd->overload) { + !READ_ONCE(this_rq->rd->overload)) { rcu_read_lock(); sd = rcu_dereference_check_sched_domain(this_rq->sd); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 648224b23287..e1dc85d1bfdd 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1672,8 +1672,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count) if (prev_nr < 2 && rq->nr_running >= 2) { #ifdef CONFIG_SMP - if (!rq->rd->overload) - rq->rd->overload = 1; + if (!READ_ONCE(rq->rd->overload)) + WRITE_ONCE(rq->rd->overload, 1); #endif } -- 2.7.4
[PATCHv4 06/12] sched/fair: Change prefer_sibling type to bool
From: Valentin Schneider This variable is entirely local to update_sd_lb_stats, so we can safely change its type and slightly clean up its initialisation. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Valentin Schneider Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index acec93e1dc51..ee26eeb188ef 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8352,11 +8352,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sched_group *sg = env->sd->groups; struct sg_lb_stats *local = >local_stat; struct sg_lb_stats tmp_sgs; - int load_idx, prefer_sibling = 0; + int load_idx; bool overload = false; - - if (child && child->flags & SD_PREFER_SIBLING) - prefer_sibling = 1; + bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING; #ifdef CONFIG_NO_HZ_COMMON if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked)) -- 2.7.4
[PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry
When hotplugging cpus out or creating exclusive cpusets (disabling sched_load_balance) systems which were asymmetric at boot might become symmetric. In this case leaving the flag set might lead to suboptimal scheduling decisions. The arch-code proving the flag doesn't have visibility of the cpuset configuration so it must either be told by passing a cpumask or the generic topology code has to verify if the flag should still be set when taking the actual sched_domain_span() into account. This patch implements the latter approach. We need to detect capacity based on calling arch_scale_cpu_capacity() directly as rq->cpu_capacity_orig hasn't been set yet early in the boot process. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Morten Rasmussen --- kernel/sched/topology.c | 20 1 file changed, 20 insertions(+) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 71330e0e41db..29c186961345 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1160,6 +1160,26 @@ sd_init(struct sched_domain_topology_level *tl, sd_id = cpumask_first(sched_domain_span(sd)); /* +* Check if cpu_map eclipses cpu capacity asymmetry. +*/ + + if (sd->flags & SD_ASYM_CPUCAPACITY) { + int i; + bool disable = true; + long capacity = arch_scale_cpu_capacity(NULL, sd_id); + + for_each_cpu(i, sched_domain_span(sd)) { + if (capacity != arch_scale_cpu_capacity(NULL, i)) { + disable = false; + break; + } + } + + if (disable) + sd->flags &= ~SD_ASYM_CPUCAPACITY; + } + + /* * Convert topological properties into behaviour. */ -- 2.7.4
[PATCHv4 07/12] sched: Change root_domain->overload type to int
From: Valentin Schneider sizeof(_Bool) is implementation defined, so let's just go with 'int' as is done for other structures e.g. sched_domain_shared->has_idle_cores. The local 'overload' variable used in update_sd_lb_stats can remain bool, as it won't impact any struct layout and can be assigned to the root_domain field. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Valentin Schneider Signed-off-by: Morten Rasmussen --- kernel/sched/sched.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 6c39a07e8a68..648224b23287 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -696,7 +696,7 @@ struct root_domain { cpumask_var_t online; /* Indicate more than one runnable task for any CPU */ - booloverload; + int overload; /* * The bit corresponding to a CPU gets set here if such CPU has more @@ -1673,7 +1673,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count) if (prev_nr < 2 && rq->nr_running >= 2) { #ifdef CONFIG_SMP if (!rq->rd->overload) - rq->rd->overload = true; + rq->rd->overload = 1; #endif } -- 2.7.4
[PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations
The existing asymmetric cpu capacity code should cause minimal overhead for others. Putting it behind a static_key, it has been done for SMT optimizations, would make it easier to extend and improve without causing harm to others moving forward. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 3 +++ kernel/sched/sched.h| 1 + kernel/sched/topology.c | 19 +++ 3 files changed, 23 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 321cd5dcf2e8..85fb7e8ff5c8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6583,6 +6583,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) { long min_cap, max_cap; + if (!static_branch_unlikely(_asym_cpucapacity)) + return 0; + min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu)); max_cap = cpu_rq(cpu)->rd->max_cpu_capacity; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c7742dcc136c..35ce218f0157 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1160,6 +1160,7 @@ DECLARE_PER_CPU(int, sd_llc_id); DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DECLARE_PER_CPU(struct sched_domain *, sd_numa); DECLARE_PER_CPU(struct sched_domain *, sd_asym); +extern struct static_key_false sched_asym_cpucapacity; struct sched_group_capacity { atomic_tref; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 05a831427bc7..0cfdeff669fe 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id); DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DEFINE_PER_CPU(struct sched_domain *, sd_numa); DEFINE_PER_CPU(struct sched_domain *, sd_asym); +DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity); static void update_top_cache_domain(int cpu) { @@ -425,6 +426,21 @@ static void update_top_cache_domain(int cpu) rcu_assign_pointer(per_cpu(sd_asym, cpu), sd); } +static void update_asym_cpucapacity(int cpu) +{ + int enable = false; + + rcu_read_lock(); + if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY)) + enable = true; + rcu_read_unlock(); + + if (enable) { + /* This expects to be hotplug-safe */ + static_branch_enable_cpuslocked(_asym_cpucapacity); + } +} + /* * Attach the domain 'sd' to 'cpu' as its base domain. Callers must * hold the hotplug lock. @@ -1707,6 +1723,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att } rcu_read_unlock(); + if (!cpumask_empty(cpu_map)) + update_asym_cpucapacity(cpumask_first(cpu_map)); + if (rq && sched_debug_enabled) { pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n", cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity); -- 2.7.4
[PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations
The existing asymmetric cpu capacity code should cause minimal overhead for others. Putting it behind a static_key, it has been done for SMT optimizations, would make it easier to extend and improve without causing harm to others moving forward. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Morten Rasmussen --- kernel/sched/fair.c | 3 +++ kernel/sched/sched.h| 1 + kernel/sched/topology.c | 19 +++ 3 files changed, 23 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 321cd5dcf2e8..85fb7e8ff5c8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6583,6 +6583,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) { long min_cap, max_cap; + if (!static_branch_unlikely(_asym_cpucapacity)) + return 0; + min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu)); max_cap = cpu_rq(cpu)->rd->max_cpu_capacity; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c7742dcc136c..35ce218f0157 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1160,6 +1160,7 @@ DECLARE_PER_CPU(int, sd_llc_id); DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DECLARE_PER_CPU(struct sched_domain *, sd_numa); DECLARE_PER_CPU(struct sched_domain *, sd_asym); +extern struct static_key_false sched_asym_cpucapacity; struct sched_group_capacity { atomic_tref; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 05a831427bc7..0cfdeff669fe 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id); DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared); DEFINE_PER_CPU(struct sched_domain *, sd_numa); DEFINE_PER_CPU(struct sched_domain *, sd_asym); +DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity); static void update_top_cache_domain(int cpu) { @@ -425,6 +426,21 @@ static void update_top_cache_domain(int cpu) rcu_assign_pointer(per_cpu(sd_asym, cpu), sd); } +static void update_asym_cpucapacity(int cpu) +{ + int enable = false; + + rcu_read_lock(); + if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY)) + enable = true; + rcu_read_unlock(); + + if (enable) { + /* This expects to be hotplug-safe */ + static_branch_enable_cpuslocked(_asym_cpucapacity); + } +} + /* * Attach the domain 'sd' to 'cpu' as its base domain. Callers must * hold the hotplug lock. @@ -1707,6 +1723,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att } rcu_read_unlock(); + if (!cpumask_empty(cpu_map)) + update_asym_cpucapacity(cpumask_first(cpu_map)); + if (rq && sched_debug_enabled) { pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n", cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity); -- 2.7.4
[PATCHv4 07/12] sched: Change root_domain->overload type to int
From: Valentin Schneider sizeof(_Bool) is implementation defined, so let's just go with 'int' as is done for other structures e.g. sched_domain_shared->has_idle_cores. The local 'overload' variable used in update_sd_lb_stats can remain bool, as it won't impact any struct layout and can be assigned to the root_domain field. cc: Ingo Molnar cc: Peter Zijlstra Signed-off-by: Valentin Schneider Signed-off-by: Morten Rasmussen --- kernel/sched/sched.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 6c39a07e8a68..648224b23287 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -696,7 +696,7 @@ struct root_domain { cpumask_var_t online; /* Indicate more than one runnable task for any CPU */ - booloverload; + int overload; /* * The bit corresponding to a CPU gets set here if such CPU has more @@ -1673,7 +1673,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count) if (prev_nr < 2 && rq->nr_running >= 2) { #ifdef CONFIG_SMP if (!rq->rd->overload) - rq->rd->overload = true; + rq->rd->overload = 1; #endif } -- 2.7.4