Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler

2021-01-11 Thread Morten Rasmussen
On Fri, Jan 08, 2021 at 12:22:41PM -0800, Tim Chen wrote:
> 
> 
> On 1/8/21 7:12 AM, Morten Rasmussen wrote:
> > On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote:
> >> On 1/6/21 12:30 AM, Barry Song wrote:
> >>> ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
> >>> cluster has 4 cpus. All clusters share L3 cache data while each cluster
> >>> has local L3 tag. On the other hand, each cluster will share some
> >>> internal system bus. This means cache is much more affine inside one 
> >>> cluster
> >>> than across clusters.
> >>
> >> There is a similar need for clustering in x86.  Some x86 cores could share 
> >> L2 caches that
> >> is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 
> >> clusters
> >> of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing 
> >> L3).  
> >> Having a sched domain at the L2 cluster helps spread load among 
> >> L2 domains.  This will reduce L2 cache contention and help with
> >> performance for low to moderate load scenarios.
> > 
> > IIUC, you are arguing for the exact opposite behaviour, i.e. balancing
> > between L2 caches while Barry is after consolidating tasks within the
> > boundaries of a L3 tag cache. One helps cache utilization, the other
> > communication latency between tasks. Am I missing something? 
> > 
> > IMHO, we need some numbers on the table to say which way to go. Looking
> > at just benchmarks of one type doesn't show that this is a good idea in
> > general.
> > 
> 
> I think it is going to depend on the workload.  If there are dependent
> tasks that communicate with one another, putting them together
> in the same cluster will be the right thing to do to reduce communication
> costs.  On the other hand, if the tasks are independent, putting them 
> together on the same cluster
> will increase resource contention and spreading them out will be better.

Agree. That is exactly where I'm coming from. This is all about the task
placement policy. We generally tend to spread tasks to avoid resource
contention, SMT and caches, which seems to be what you are proposing to
extend. I think that makes sense given it can produce significant
benefits.

> 
> Any thoughts on what is the right clustering "tag" to use to clump
> related tasks together?
> Cgroup? Pid? Tasks with same mm?

I think this is the real question. I think the closest thing we have at
the moment is the wakee/waker flip heuristic. This seems to be related.
Perhaps the wake_affine tricks can serve as starting point?

Morten


Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler

2021-01-08 Thread Morten Rasmussen
On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote:
> On 1/6/21 12:30 AM, Barry Song wrote:
> > ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each
> > cluster has 4 cpus. All clusters share L3 cache data while each cluster
> > has local L3 tag. On the other hand, each cluster will share some
> > internal system bus. This means cache is much more affine inside one cluster
> > than across clusters.
> 
> There is a similar need for clustering in x86.  Some x86 cores could share L2 
> caches that
> is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 
> clusters
> of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing 
> L3).  
> Having a sched domain at the L2 cluster helps spread load among 
> L2 domains.  This will reduce L2 cache contention and help with
> performance for low to moderate load scenarios.

IIUC, you are arguing for the exact opposite behaviour, i.e. balancing
between L2 caches while Barry is after consolidating tasks within the
boundaries of a L3 tag cache. One helps cache utilization, the other
communication latency between tasks. Am I missing something? 

IMHO, we need some numbers on the table to say which way to go. Looking
at just benchmarks of one type doesn't show that this is a good idea in
general.

Morten


Re: [PATCH] sched: Add schedutil overview

2020-12-18 Thread Morten Rasmussen
On Fri, Dec 18, 2020 at 11:33:09AM +, Valentin Schneider wrote:
> On 18/12/20 10:32, Peter Zijlstra wrote:
> > +Schedutil / DVFS
> > +
> > +
> > +Every time the scheduler load tracking is updated (task wakeup, task
> > +migration, time progression) we call out to schedutil to update the 
> > hardware
> > +DVFS state.
> > +
> > +The basis is the CPU runqueue's 'running' metric, which per the above it is
> > +the frequency invariant utilization estimate of the CPU. From this we 
> > compute
> > +a desired frequency like:
> > +
> > + max( running, util_est ); if UTIL_EST
> > +  u_cfs := { running;  otherwise
> > +
> > +  u_clamp := clamp( u_cfs, u_min, u_max )
> > +
> > +  u := u_cfs + u_rt + u_irq + u_dl;[approx. see source for more 
> > detail]
> > +
> > +  f_des := min( f_max, 1.25 u * f_max )
> > +
> 
> In schedutil_cpu_util(), uclamp clamps both u_cfs and u_rt. I'm afraid the
> below might just bring more confusion; what do you think?
> 
>clamp( u_cfs + u_rt, u_min, u_max );  if UCLAMP_TASK
>   u_clamp := { u_cfs + u_rt; otherwise
> 
>   u := u_clamp + u_irq + u_dl;[approx. see source for more detail]

It is reflecting the code so I think it is worth it. It also fixes the
typo in the original sum (u_cfs -> u_clamp).

> (also, does this need a word about runnable rt tasks => goto max?)

What is actually the intended policy there? I thought it was goto max
unless rt was clamped, but if I read the code correctly in
schedutil_cpu_util() the current policy is only goto max if uclamp isn't
in use at all, including cfs.

The write-up looks good to me.

Reviewed-by: Morten Rasmussen 

Morten


Re: [RFC] Documentation/scheduler/schedutil.txt

2020-11-20 Thread Morten Rasmussen
Hi Peter,

Looks like a nice summary to me.

On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:
> Hi,
> 
> I was recently asked to explain how schedutil works, the below write-up
> is the result of that and I figured we might as well stick it in the
> tree.
> 
> Not as a patch for easy reading and commenting.
> 
> ---
> 
> NOTE; all this assumes a linear relation between frequency and work capacity,
> we know this is flawed, but it is the best workable approximation.

If you replace frequency with performance level everywhere (CPPC style),
most of it should still work without that assumption. The assumption
might have be made in hw or fw instead though.

Morten


Re: [PATCH v4 1/4] PM / EM: Add a flag indicating units of power values in Energy Model

2020-11-05 Thread Morten Rasmussen
On Thu, Nov 05, 2020 at 10:09:05AM +, Lukasz Luba wrote:
> 
> 
> On 11/5/20 9:18 AM, Morten Rasmussen wrote:
> > On Tue, Nov 03, 2020 at 09:05:57AM +, Lukasz Luba wrote:
> > > @@ -79,7 +82,8 @@ struct em_data_callback {
> > >   struct em_perf_domain *em_cpu_get(int cpu);
> > >   struct em_perf_domain *em_pd_get(struct device *dev);
> > >   int em_dev_register_perf_domain(struct device *dev, unsigned int 
> > > nr_states,
> > > - struct em_data_callback *cb, cpumask_t *span);
> > > + struct em_data_callback *cb, cpumask_t *spani,
> > 
> > "spani" looks like a typo?
> > 
> 
> Good catch, yes, the vim 'i'.
> 
> Thank you Morten. I will resend this patch when you don't
> find other issues in the rest of patches.

The rest of the series looks okay to me.

Morten


Re: [PATCH v4 1/4] PM / EM: Add a flag indicating units of power values in Energy Model

2020-11-05 Thread Morten Rasmussen
On Tue, Nov 03, 2020 at 09:05:57AM +, Lukasz Luba wrote:
> @@ -79,7 +82,8 @@ struct em_data_callback {
>  struct em_perf_domain *em_cpu_get(int cpu);
>  struct em_perf_domain *em_pd_get(struct device *dev);
>  int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
> - struct em_data_callback *cb, cpumask_t *span);
> + struct em_data_callback *cb, cpumask_t *spani,

"spani" looks like a typo?


Re: [RFC PATCH] topology: Represent clusters of CPUs within a die.

2020-10-19 Thread Morten Rasmussen
On Mon, Oct 19, 2020 at 02:41:57PM +0100, Jonathan Cameron wrote:
> On Mon, 19 Oct 2020 15:10:52 +0200
> Morten Rasmussen  wrote:
> 
> > Hi Jonathan,
> > 
> > On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote:
> > > On Mon, 19 Oct 2020 12:35:22 +0200
> > > Peter Zijlstra  wrote:
> > >   
> > > > On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote:  
> > > > > Both ACPI and DT provide the ability to describe additional layers of
> > > > > topology between that of individual cores and higher level constructs
> > > > > such as the level at which the last level cache is shared.
> > > > > In ACPI this can be represented in PPTT as a Processor Hierarchy
> > > > > Node Structure [1] that is the parent of the CPU cores and in turn
> > > > > has a parent Processor Hierarchy Nodes Structure representing
> > > > > a higher level of topology.
> > > > > 
> > > > > For example Kunpeng 920 has clusters of 4 CPUs.  These do not share
> > > > > any cache resources, but the interconnect topology is such that
> > > > > the cost to transfer ownership of a cacheline between CPUs within
> > > > > a cluster is lower than between CPUs in different clusters on the same
> > > > > die.   Hence, it can make sense to deliberately schedule threads
> > > > > sharing data to a single cluster.
> > > > > 
> > > > > This patch simply exposes this information to userspace libraries
> > > > > like hwloc by providing cluster_cpus and related sysfs attributes.
> > > > > PoC of HWLOC support at [2].
> > > > > 
> > > > > Note this patch only handle the ACPI case.
> > > > > 
> > > > > Special consideration is needed for SMT processors, where it is
> > > > > necessary to move 2 levels up the hierarchy from the leaf nodes
> > > > > (thus skipping the processor core level).
> > > 
> > > Hi Peter,
> > >   
> > > > 
> > > > I'm confused by all of this. The core level is exactly what you seem to
> > > > want.  
> > > 
> > > It's the level above the core, whether in an multi-threaded core
> > > or a single threaded core.   This may correspond to the level
> > > at which caches are shared (typically L3).  Cores are already well
> > > represented via thread_siblings and similar.  Extra confusion is that
> > > the current core_siblings (deprecated) sysfs interface, actually reflects
> > > the package level and ignores anything in between core and
> > > package (such as die on x86)
> > > 
> > > So in a typical system with a hierarchical interconnect you would have
> > > 
> > > thread
> > > core
> > > cluster (possibly multiple layers as mentioned in Brice's reply).
> > > die
> > > package
> > > 
> > > Unfortunately as pointed out in other branches of this thread, there is
> > > no consistent generic name.  I'm open to suggestions!  
> > 
> > IIUC, you are actually proposing another "die" level? I'm not sure if we
> > can actually come up with a generic name since interconnects are highly
> > implementation dependent.
> 
> Brice mentioned hwloc is using 'group'.  That seems generic enough perhaps.
> 
> > 
> > How is you memory distributed? Do you already have NUMA nodes? If you
> > want to keep tasks together, it might make sense to define the clusters
> > (in your case) as NUMA nodes.
> 
> We already have all of the standard levels.  We need at least one more.
> On a near future platform we'll have full set (kunpeng920 is single thread)
> 
> So on kunpeng 920 we have
> cores
> (clusters)
> die / llc shared at this level

IIRC, LLC sharing isn't tied to a specific level in the user-space
topology description. On some Arm systems LLC is per cluster while the
package has a single die with two clusters.

I'm slightly confused about the cache sharing. You said above that your
clusters don't share cache resources? This list says LLC is at die
level, which is above cluster level?

> package (multiple NUMA nodes in each package) 

What is your NUMA node span? Couldn't you just make it equivalent to
your clusters?

> > > For example, in zen2 this would correspond to a 'core complex' consisting
> > > 4 CPU cores (each one 2 threads) sharing some local L3 cache.
> > > https://en.wikichip.org/wiki/amd/microarchitectures/zen_2
> > > In zen3 it looks like this 

Re: [RFC PATCH] topology: Represent clusters of CPUs within a die.

2020-10-19 Thread Morten Rasmussen
On Mon, Oct 19, 2020 at 02:50:53PM +0200, Peter Zijlstra wrote:
> On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote:
> > On Mon, 19 Oct 2020 12:35:22 +0200
> > Peter Zijlstra  wrote:
> 
> > > I'm confused by all of this. The core level is exactly what you seem to
> > > want.
> > 
> > It's the level above the core, whether in an multi-threaded core
> > or a single threaded core.   This may correspond to the level
> > at which caches are shared (typically L3).  Cores are already well
> > represented via thread_siblings and similar.  Extra confusion is that
> > the current core_siblings (deprecated) sysfs interface, actually reflects
> > the package level and ignores anything in between core and
> > package (such as die on x86)
> 
> That seems wrong. core-mask should be whatever cores share L3. So on a
> Intel Core2-Quad (just to pick an example) you should have 4 CPU in a
> package, but only 2 CPUs for the core-mask.
> 
> It just so happens that L3 and package were the same for a long while in
> x86 land, although recent chips started breaking that trend.
> 
> And I know nothing about the core-mask being depricated; it's what the
> scheduler uses. It's not going anywhere.

Don't get confused over the user-space topology and the scheduler
topology, they are _not_ the same despite having similar names for some
things :-)

> So if your 'cluster' is a group of single cores (possibly with SMT) that
> do not share cache but have a faster cache connection and you want them
> to behave as-if they were a multi-core group that did share cache, then
> core-mask it is.

In the scheduler, yes. There is no core-mask exposed to user-space.

We have to be clear about whether we discuss scheduler or user-space
topology :-)


Re: [RFC PATCH] topology: Represent clusters of CPUs within a die.

2020-10-19 Thread Morten Rasmussen
Hi Jonathan,

On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote:
> On Mon, 19 Oct 2020 12:35:22 +0200
> Peter Zijlstra  wrote:
> 
> > On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote:
> > > Both ACPI and DT provide the ability to describe additional layers of
> > > topology between that of individual cores and higher level constructs
> > > such as the level at which the last level cache is shared.
> > > In ACPI this can be represented in PPTT as a Processor Hierarchy
> > > Node Structure [1] that is the parent of the CPU cores and in turn
> > > has a parent Processor Hierarchy Nodes Structure representing
> > > a higher level of topology.
> > > 
> > > For example Kunpeng 920 has clusters of 4 CPUs.  These do not share
> > > any cache resources, but the interconnect topology is such that
> > > the cost to transfer ownership of a cacheline between CPUs within
> > > a cluster is lower than between CPUs in different clusters on the same
> > > die.   Hence, it can make sense to deliberately schedule threads
> > > sharing data to a single cluster.
> > > 
> > > This patch simply exposes this information to userspace libraries
> > > like hwloc by providing cluster_cpus and related sysfs attributes.
> > > PoC of HWLOC support at [2].
> > > 
> > > Note this patch only handle the ACPI case.
> > > 
> > > Special consideration is needed for SMT processors, where it is
> > > necessary to move 2 levels up the hierarchy from the leaf nodes
> > > (thus skipping the processor core level).  
> 
> Hi Peter,
> 
> > 
> > I'm confused by all of this. The core level is exactly what you seem to
> > want.
> 
> It's the level above the core, whether in an multi-threaded core
> or a single threaded core.   This may correspond to the level
> at which caches are shared (typically L3).  Cores are already well
> represented via thread_siblings and similar.  Extra confusion is that
> the current core_siblings (deprecated) sysfs interface, actually reflects
> the package level and ignores anything in between core and
> package (such as die on x86)
> 
> So in a typical system with a hierarchical interconnect you would have
> 
> thread
> core
> cluster (possibly multiple layers as mentioned in Brice's reply).
> die
> package
> 
> Unfortunately as pointed out in other branches of this thread, there is
> no consistent generic name.  I'm open to suggestions!

IIUC, you are actually proposing another "die" level? I'm not sure if we
can actually come up with a generic name since interconnects are highly
implementation dependent.

How is you memory distributed? Do you already have NUMA nodes? If you
want to keep tasks together, it might make sense to define the clusters
(in your case) as NUMA nodes.

> Both ACPI PPTT and DT provide generic structures to represent layers of
> topology.   They don't name as such, but in ACPI there are flags to indicate
> package, core, thread.

I think that is because those are the only ones that a fairly generic
:-) It is also the only ones that scheduler cares about (plus NUMA).

> 
> For example, in zen2 this would correspond to a 'core complex' consisting
> 4 CPU cores (each one 2 threads) sharing some local L3 cache.
> https://en.wikichip.org/wiki/amd/microarchitectures/zen_2
> In zen3 it looks like this level will be the same as that for the die.
> 
> Given they used the name in knights landing (and as is pointed out in
> another branch of this thread, it's the CPUID description) I think Intel
> calls these 'tiles' (anyone confirm that?) 
> 
> A similar concept exists for some ARM processors. 
> https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110
> CCLs in the diagram on that page.
> 
> Centriq 2400 had 2 core 'duplexes' which shared l2.
> https://www.anandtech.com/show/11737/analyzing-falkors-microarchitecture-a-deep-dive-into-qualcomms-centriq-2400-for-windows-server-and-linux/3
> 
> From the info release at hotchips, it looks like the thunderx3 deploys
> a similar ring interconnect with groups of cores, each with 4 threads.
> Not sure what they plan to call them yet though or whether they will chose
> to represent that layer of the topology in their firmware tables.
> 
> Arms CMN600 interconnect also support such 'clusters' though I have no idea
> if anyone has used it in this form yet.  In that case, they are called
> "processor compute clusters"
> https://developer.arm.com/documentation/100180/0103/
> 
> Xuantie-910 is cluster based as well (shares l2).
> 
> So in many cases the cluster level corresponds to something we already have
> visibility of due to cache sharing etc, but that isn't true in kunpeng 920.

The problem I see is that the benefit of keeping tasks together due to
the interconnect layout might vary significantly between systems. So if
we introduce a new cpumask for cluster it has to have represent roughly
the same system properties otherwise generic software consuming this
information could be tricked.

If there is a provable benefit of 

Re: [PATCH] arm64: dts: sdm845: Add CPU topology

2019-06-06 Thread Morten Rasmussen
On Thu, Jun 06, 2019 at 10:44:58AM +0200, Vincent Guittot wrote:
> On Thu, 6 Jun 2019 at 10:34, Dietmar Eggemann  
> wrote:
> >
> > On 6/6/19 10:20 AM, Vincent Guittot wrote:
> > > On Thu, 6 Jun 2019 at 09:49, Quentin Perret  
> > > wrote:
> > >>
> > >> Hi Vincent,
> > >>
> > >> On Thursday 06 Jun 2019 at 09:05:16 (+0200), Vincent Guittot wrote:
> > >>> Hi Quentin,
> > >>>
> > >>> On Wed, 5 Jun 2019 at 19:21, Quentin Perret  
> > >>> wrote:
> > 
> >  On Friday 17 May 2019 at 14:55:19 (-0700), Stephen Boyd wrote:
> > > Quoting Amit Kucheria (2019-05-16 04:54:45)
> > >> (cc'ing Andy's correct email address)
> > >>
> > >> On Wed, May 15, 2019 at 2:46 AM Stephen Boyd  
> > >> wrote:
> > >>>
> > >>> Quoting Amit Kucheria (2019-05-13 04:54:12)
> >  On Mon, May 13, 2019 at 4:31 PM Amit Kucheria 
> >   wrote:
> > >
> > > On Tue, Jan 15, 2019 at 12:13 AM Matthias Kaehlcke 
> > >  wrote:
> > >>
> > >> The 8 CPU cores of the SDM845 are organized in two clusters of 4 
> > >> big
> > >> ("gold") and 4 little ("silver") cores. Add a cpu-map node to 
> > >> the DT
> > >> that describes this topology.
> > >
> > > This is partly true. There are two groups of gold and silver 
> > > cores,
> > > but AFAICT they are in a single cluster, not two separate ones. 
> > > SDM845
> > > is one of the early examples of ARM's Dynamiq architecture.
> > >
> > >> Signed-off-by: Matthias Kaehlcke 
> > >
> > > I noticed that this patch sneaked through for this merge window 
> > > but
> > > perhaps we can whip up a quick fix for -rc2?
> > >
> > 
> >  And please find attached a patch to fix this up. Andy, since this
> >  hasn't landed yet (can we still squash this into the original 
> >  patch?),
> >  I couldn't add a Fixes tag.
> > 
> > >>>
> > >>> I had the same concern. Thanks for catching this. I suspect this 
> > >>> must
> > >>> cause some problem for IPA given that it can't discern between the 
> > >>> big
> > >>> and little "power clusters"?
> > >>
> > >> Both EAS and IPA, I believe. It influences the scheduler's view of 
> > >> the
> > >> the topology.
> > >
> > > And EAS and IPA are OK with the real topology? I'm just curious if
> > > changing the topology to reflect reality will be a problem for those
> > > two.
> > 
> >  FWIW, neither EAS nor IPA depends on this. Not the upstream version of
> >  EAS at least (which is used in recent Android kernels -- 4.19+).
> > 
> >  But doing this is still required for other things in the scheduler (the
> >  so-called 'capacity-awareness' code). So until we have a better
> >  solution, this patch is doing the right thing.
> > >>>
> > >>> I'm not sure to catch what you mean ?
> > >>> Which so-called 'capacity-awareness' code are you speaking about ? and
> > >>> what is the problem ?
> > >>
> > >> I'm talking about the wake-up path. ATM select_idle_sibling() is totally
> > >> unaware of capacity differences. In its current form, this function
> > >> basically assumes that all CPUs in a given sd_llc have the same
> > >> capacity, which would be wrong if we had a single MC level for SDM845.
> > >> So, until select_idle_sibling() is 'fixed' to be capacity-aware, we need
> > >> two levels of sd for asymetric systems (including DynamIQ) so the
> > >> wake_cap() story actually works.
> > >>
> > >> I hope that clarifies it :)
> > >
> > > hmm... does this justifies this wrong topology ?

No, it doesn't. It relies heavily on how nested clusters are interpreted
too, so it is quite fragile.

> > > select_idle_sibling() is called only when system is overloaded and
> > > scheduler disables the EAS path
> > > In this case, the scheduler looks either for an idle cpu or for evenly
> > > spreading the loads
> > > This is maybe not always optimal and should probably be fixed but
> > > doesn't justifies a wrong topology description IMHO
> >
> > The big/Little cluster detection in wake_cap() doesn't work anymore with
> > DynamIQ w/o Phanton (DIE) domain. So the decision of going sis() or slow
> > path is IMHO broken.
> 
> That's probably not the right thread to discuss this further but i'm
> not sure to understand why wake_cap() doesn't work as it compares the
> capacity_orig of local cpu and prev cpu which are the same whatever
> the sche domainœ

We have had this discussion a couple of times over the last couple of
years. The story, IIRC, is that when we introduced capacity awareness in
the wake-up path (wake_cap()) we realised (I think it was actually you)
that we could use select_idle_sibling() in cases where we know that the
search space is limited to cpus with sufficient capacity so we didn't
have to take the long route through find_idlest_cpu(). Back 

Re: [PATCH v6 1/7] Documentation: DT: arm: add support for sockets defining package boundaries

2019-05-31 Thread Morten Rasmussen
On Fri, May 31, 2019 at 10:37:43AM +0100, Sudeep Holla wrote:
> On Thu, May 30, 2019 at 10:42:54PM +0100, Russell King - ARM Linux admin 
> wrote:
> > On Thu, May 30, 2019 at 12:51:03PM +0100, Morten Rasmussen wrote:
> > > On Wed, May 29, 2019 at 07:39:17PM -0400, Andrew F. Davis wrote:
> > > > On 5/29/19 5:13 PM, Atish Patra wrote:
> > > > >From: Sudeep Holla 
> > > > >
> > > > >The current ARM DT topology description provides the operating system
> > > > >with a topological view of the system that is based on leaf nodes
> > > > >representing either cores or threads (in an SMT system) and a
> > > > >hierarchical set of cluster nodes that creates a hierarchical topology
> > > > >view of how those cores and threads are grouped.
> > > > >
> > > > >However this hierarchical representation of clusters does not allow to
> > > > >describe what topology level actually represents the physical package 
> > > > >or
> > > > >the socket boundary, which is a key piece of information to be used by
> > > > >an operating system to optimize resource allocation and scheduling.
> > > > >
> > > >
> > > > Are physical package descriptions really needed? What does "socket" 
> > > > imply
> > > > that a higher layer "cluster" node grouping does not? It doesn't imply a
> > > > different NUMA distance and the definition of "socket" is already not 
> > > > well
> > > > defined, is a dual chiplet processor not just a fancy dual "socket" or 
> > > > are
> > > > dual "sockets" on a server board "slotket" card, will we need new names 
> > > > for
> > > > those too..
> > >
> > > Socket (or package) just implies what you suggest, a grouping of CPUs
> > > based on the physical socket (or package). Some resources might be
> > > associated with packages and more importantly socket information is
> > > exposed to user-space. At the moment clusters are being exposed to
> > > user-space as sockets which is less than ideal for some topologies.
> >
> > Please point out a 32-bit ARM system that has multiple "socket"s.
> >
> > As far as I'm aware, all 32-bit systems do not have socketed CPUs
> > (modern ARM CPUs are part of a larger SoC), and the CPUs are always
> > in one package.
> >
> > Even the test systems I've seen do not have socketed CPUs.
> >
> 
> As far as we know, there's none. So we simply have to assume all
> those systems are single socket(IOW all CPUs reside inside a single
> SoC package) system.

Right, but we don't make that assumption. Clusters are reported as
sockets/packages for arm, just like they are for arm64. My comment above
applied to what can be described using DT, not what systems actually
exists. We need to be able describe packages for architecture where we
can't make assumptions.

arm example (ARM TC2):
root@morras01-tc2:~# lstopo
Machine (985MB)
  Package L#0
Core L#0 + PU L#0 (P#0)
Core L#1 + PU L#1 (P#1)
  Package L#1
Core L#2 + PU L#2 (P#2)
Core L#3 + PU L#3 (P#3)
Core L#4 + PU L#4 (P#4)

Morten


Re: [PATCH v6 1/7] Documentation: DT: arm: add support for sockets defining package boundaries

2019-05-30 Thread Morten Rasmussen
On Thu, May 30, 2019 at 08:56:03AM -0400, Andrew F. Davis wrote:
> On 5/30/19 7:51 AM, Morten Rasmussen wrote:
> >On Wed, May 29, 2019 at 07:39:17PM -0400, Andrew F. Davis wrote:
> >>On 5/29/19 5:13 PM, Atish Patra wrote:
> >>>From: Sudeep Holla 
> >>>
> >>>The current ARM DT topology description provides the operating system
> >>>with a topological view of the system that is based on leaf nodes
> >>>representing either cores or threads (in an SMT system) and a
> >>>hierarchical set of cluster nodes that creates a hierarchical topology
> >>>view of how those cores and threads are grouped.
> >>>
> >>>However this hierarchical representation of clusters does not allow to
> >>>describe what topology level actually represents the physical package or
> >>>the socket boundary, which is a key piece of information to be used by
> >>>an operating system to optimize resource allocation and scheduling.
> >>>
> >>
> >>Are physical package descriptions really needed? What does "socket" imply
> >>that a higher layer "cluster" node grouping does not? It doesn't imply a
> >>different NUMA distance and the definition of "socket" is already not well
> >>defined, is a dual chiplet processor not just a fancy dual "socket" or are
> >>dual "sockets" on a server board "slotket" card, will we need new names for
> >>those too..
> >
> >Socket (or package) just implies what you suggest, a grouping of CPUs
> >based on the physical socket (or package). Some resources might be
> >associated with packages and more importantly socket information is
> >exposed to user-space. At the moment clusters are being exposed to
> >user-space as sockets which is less than ideal for some topologies.
> >
> 
> I see the benefit of reporting the physical layout and packaging information
> to user-space for tracking reasons, but from software perspective this
> doesn't matter, and the resource partitioning should be described elsewhere
> (NUMA nodes being the go to example).

That would make defining a NUMA node mandatory even for non-NUMA
systems?

> >At the moment user-space is only told about hw threads, cores, and
> >sockets. In the very near future it is going to be told about dies too
> >(look for Len Brown's multi-die patch set).
> >
> 
> Seems my hypothetical case is already in the works :(

Indeed. IIUC, the reasoning behind it is related to actual multi-die
x86 packages and some rapl stuff being per-die or per-core.

> 
> >I don't see how we can provide correct information to user-space based
> >on the current information in DT. I'm not convinced it was a good idea
> >to expose this information to user-space to begin with but that is
> >another discussion.
> >
> 
> Fair enough, it's a little late now to un-expose this info to userspace so
> we should at least present it correctly. My worry was this getting out of
> hand with layering, for instance what happens when we need to add die nodes
> in-between cluster and socket?

If we want the die mask to be correct for arm/arm64/riscv we need die
information from somewhere. I'm not in favour of adding more topology
layers to the user-space visible topology description, but others might
have a valid reason and if it is exposed I would prefer if we try to
expose the right information.

Btw, for packages, we already have that information in ACPI/PPTT so it
would be nice if we could have that for DT based systems too.

Morten


Re: [PATCH v6 1/7] Documentation: DT: arm: add support for sockets defining package boundaries

2019-05-30 Thread Morten Rasmussen
On Wed, May 29, 2019 at 07:39:17PM -0400, Andrew F. Davis wrote:
> On 5/29/19 5:13 PM, Atish Patra wrote:
> >From: Sudeep Holla 
> >
> >The current ARM DT topology description provides the operating system
> >with a topological view of the system that is based on leaf nodes
> >representing either cores or threads (in an SMT system) and a
> >hierarchical set of cluster nodes that creates a hierarchical topology
> >view of how those cores and threads are grouped.
> >
> >However this hierarchical representation of clusters does not allow to
> >describe what topology level actually represents the physical package or
> >the socket boundary, which is a key piece of information to be used by
> >an operating system to optimize resource allocation and scheduling.
> >
> 
> Are physical package descriptions really needed? What does "socket" imply
> that a higher layer "cluster" node grouping does not? It doesn't imply a
> different NUMA distance and the definition of "socket" is already not well
> defined, is a dual chiplet processor not just a fancy dual "socket" or are
> dual "sockets" on a server board "slotket" card, will we need new names for
> those too..

Socket (or package) just implies what you suggest, a grouping of CPUs
based on the physical socket (or package). Some resources might be
associated with packages and more importantly socket information is
exposed to user-space. At the moment clusters are being exposed to
user-space as sockets which is less than ideal for some topologies.

At the moment user-space is only told about hw threads, cores, and
sockets. In the very near future it is going to be told about dies too
(look for Len Brown's multi-die patch set).

I don't see how we can provide correct information to user-space based
on the current information in DT. I'm not convinced it was a good idea
to expose this information to user-space to begin with but that is
another discussion.

Morten


Re: [RFC PATCH 3/6] sched/dl: Try better placement even for deadline tasks that do not block

2019-05-07 Thread Morten Rasmussen
On Tue, May 07, 2019 at 03:13:40PM +0100, Quentin Perret wrote:
> On Monday 06 May 2019 at 06:48:33 (+0200), Luca Abeni wrote:
> > @@ -1591,6 +1626,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int 
> > sd_flag, int flags)
> >  
> > rcu_read_lock();
> > curr = READ_ONCE(rq->curr); /* unlocked access */
> > +   het = static_branch_unlikely(_asym_cpucapacity);
> 
> Nit: not sure how the generated code looks like but I wonder if this
> could potentially make you loose the benefit of the static key ?

I have to take the blame for this bit :-)

I would be surprised the static_key gives us anything here, but that is 
actually not the point here. It is purely to know whether we have to be 
capacity aware or not. I don't think we are in a critical path and the
variable providing the necessary condition just happened to be a
static_key.

We might be able to make better use of it if we refactor the code a bit.

Morten


Re: [PATCH 0/7] introduce cpu.headroom knob to cpu controller

2019-04-10 Thread Morten Rasmussen
Hi,

On Mon, Apr 08, 2019 at 02:45:32PM -0700, Song Liu wrote:
> Servers running latency sensitive workload usually aren't fully loaded for 
> various reasons including disaster readiness. The machines running our 
> interactive workloads (referred as main workload) have a lot of spare CPU 
> cycles that we would like to use for optimistic side jobs like video 
> encoding. However, our experiments show that the side workload has strong
> impact on the latency of main workload:
> 
>   side-job   main-load-level   main-avg-latency
>  none  1.0  1.00
>  none  1.1  1.10
>  none  1.2  1.10 
>  none  1.3  1.10
>  none  1.4  1.15
>  none  1.5  1.24
>  none  1.6  1.74
> 
>  ffmpeg1.0  1.82
>  ffmpeg1.1  2.74
> 
> Note: both the main-load-level and the main-avg-latency numbers are
>  _normalized_.

Could you reveal what level of utilization those main-load-level numbers
correspond to? I'm trying to understand why the latency seems to
increase rapidly once you hit 1.5. Is that the point where the system
hits 100% utilization?

> In these experiments, ffmpeg is put in a cgroup with cpu.weight of 1 
> (lowest priority). However, it consumes all idle CPU cycles in the 
> system and causes high latency for the main workload. Further experiments
> and analysis (more details below) shows that, for the main workload to meet
> its latency targets, it is necessary to limit the CPU usage of the side
> workload so that there are some _idle_ CPU. There are various reasons
> behind the need of idle CPU time. First, shared CPU resouce saturation 
> starts to happen way before time-measured utilization reaches 100%. 
> Secondly, scheduling latency starts to impact the main workload as CPU 
> reaches full utilization. 
> 
> Currently, the cpu controller provides two mechanisms to protect the main 
> workload: cpu.weight and cpu.max. However, neither of them is sufficient 
> in these use cases. As shown in the experiments above, side workload with 
> cpu.weight of 1 (lowest priority) would still consume all idle CPU and add 
> unacceptable latency to the main workload. cpu.max can throttle the CPU 
> usage of the side workload and preserve some idle CPU. However, cpu.max 
> cannot react to changes in load levels. For example, when the main 
> workload uses 40% of CPU, cpu.max of 30% for the side workload would yield 
> good latencies for the main workload. However, when the workload 
> experiences higher load levels and uses more CPU, the same setting (cpu.max 
> of 30%) would cause the interactive workload to miss its latency target. 
> 
> These experiments demonstrated the need for a mechanism to effectively 
> throttle CPU usage of the side workload and preserve idle CPU cycles. 
> The mechanism should be able to adjust the level of throttling based on
> the load level of the main workload. 
> 
> This patchset introduces a new knob for cpu controller: cpu.headroom. 
> cgroup of the main workload uses cpu.headroom to ensure side workload to 
> use limited CPU cycles. For example, if a main workload has a cpu.headroom 
> of 30%. The side workload will be throttled to give 30% overall idle CPU. 
> If the main workload uses more than 70% of CPU, the side workload will only 
> run with configurable minimal cycles. This configurable minimal cycles is
> referred as "tolerance" of the main workload.

IIUC, you are proposing to basically apply dynamic bandwidth throttling to
side-jobs to preserve a specific headroom of idle cycles.

The bit that isn't clear to me, is _why_ adding idle cycles helps your
workload. I'm not convinced that adding headroom gives any latency
improvements beyond watering down the impact of your side jobs. AFAIK,
the throttling mechanism effectively removes the throttled tasks from
the schedule according to a specific duty cycle. When the side job is
not throttled the main workload is experiencing the same latency issues
as before, but by dynamically tuning the side job throttling you can
achieve a better average latency. Am I missing something?

Have you looked at your distribution of main job latency and tried to
compare with when throttling is active/not active?

I'm wondering if the headroom solution is really the right solution for
your use-case or if what you are really after is something which is
lower priority than just setting the weight to 1. Something that
(nearly) always gets pre-empted by your main job (SCHED_BATCH and
SCHED_IDLE might not be enough). If your main job consist
of lots of relatively short wake-ups things like the min_granularity
could have significant latency impact.

Morten


Re: [PATCH 0/14] v2 multi-die/package topology support

2019-03-07 Thread Morten Rasmussen
On Tue, Feb 26, 2019 at 07:53:58PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 26, 2019 at 01:19:58AM -0500, Len Brown wrote:
> > Added sysfs package_threads, package_threads_list
> > 
> > Added this attribute to show threads siblings in a package.
> > Exactly same as "core_siblings above", a name now deprecated.
> > This attribute name and definition is immune to future
> > topology changes.
> > 
> > Suggested by Brice.
> > 
> > Added sysfs die_threads, die_threads_list
> > 
> > Added this attribute to show which threads siblings in a die.
> > V1 had proposed putting this info into "core_siblings", but we
> > decided to leave that legacy attribute alone.
> > This attribute name and definition is immune to future
> > topology changes.
> > 
> > On a single die-package system this attribute has same contents
> > as "package_threads".
> > 
> > Suggested by Brice.
> > 
> > Added sysfs core_threads, core_threads_list
> > 
> > Added this attribute to show which threads siblings in a core.
> > Exactly same as "thread_siblings", a name now deprecated.
> > This attribute name and definition is immune to future
> > topology changes.
> > 
> > Suggested by Brice.
> 
> I think I prefer 's/threads/cpus/g' on that. Threads makes me think SMT,
> and I don't think there's any guarantee the part in question will have
> SMT on.

I think 'threads' is a bit confusing as well. We seem to be using 'cpu'
everywhere for something we can schedule tasks on, including the sysfs
/sys/devices/system/cpu/ subdirs for each SMT thread on SMT systems.

Another thing that I find confusing is that with this series we a new
die id/mask which is totally unrelated to the DIE level in the
sched_domain hierarchy. We should rename DIE level to something that
reflects what it really is. If we can agree on that ;-)

NODE level?

Morten


Re: [PATCH 05/14] cpu topology: Export die_id

2019-03-07 Thread Morten Rasmussen
Hi Len,

On Tue, Feb 26, 2019 at 01:20:03AM -0500, Len Brown wrote:
> From: Len Brown 
> 
> Export die_id in cpu topology, for the benefit of hardware that
> has multiple-die/package.
> 
> Signed-off-by: Len Brown 
> Cc: linux-...@vger.kernel.org
> ---
>  Documentation/cputopology.txt   | 6 ++
>  arch/x86/include/asm/topology.h | 1 +
>  drivers/base/topology.c | 4 
>  include/linux/topology.h| 3 +++
>  4 files changed, 14 insertions(+)
> 
> diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt
> index cb61277e2308..4e6be7f68fd8 100644
> --- a/Documentation/cputopology.txt
> +++ b/Documentation/cputopology.txt
> @@ -12,6 +12,12 @@ physical_package_id:
>   socket number, but the actual value is architecture and platform
>   dependent.
>  
> +die_id:
> +
> + the CPU die ID of cpuX. Typically it is the hardware platform's
> + identifier (rather than the kernel's).  The actual value is
> + architecture and platform dependent.
> +
>  core_id:

Can we add the details about die_id further down in cputopology.txt as
well?

diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt
index 6c25ce682c90..77b65583081e 100644
--- a/Documentation/cputopology.txt
+++ b/Documentation/cputopology.txt
@@ -97,6 +97,7 @@ For an architecture to support this feature, it must define 
some of
 these macros in include/asm-XXX/topology.h::
 
#define topology_physical_package_id(cpu)
+   #define topology_die_id(cpu)
#define topology_core_id(cpu)
#define topology_book_id(cpu)
#define topology_drawer_id(cpu)
@@ -116,10 +117,11 @@ provides default definitions for any of the above macros 
that are
 not defined by include/asm-XXX/topology.h:
 
 1) topology_physical_package_id: -1
-2) topology_core_id: 0
-3) topology_sibling_cpumask: just the given CPU
-4) topology_core_cpumask: just the given CPU
-5) topology_die_cpumask: just the given CPU
+2) topology_die_id: -1
+3) topology_core_id: 0
+4) topology_sibling_cpumask: just the given CPU
+5) topology_core_cpumask: just the given CPU
+6) topology_die_cpumask: just the given CPU
 
>  
>   the CPU core ID of cpuX. Typically it is the hardware platform's
> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index 453cf38a1c33..281be6bbc80d 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -106,6 +106,7 @@ extern const struct cpumask *cpu_coregroup_mask(int cpu);
>  
>  #define topology_logical_package_id(cpu) (cpu_data(cpu).logical_proc_id)
>  #define topology_physical_package_id(cpu)(cpu_data(cpu).phys_proc_id)
> +#define topology_die_id(cpu) (cpu_data(cpu).cpu_die_id)
>  #define topology_core_id(cpu)
> (cpu_data(cpu).cpu_core_id)
>  
>  #ifdef CONFIG_SMP

The above is x86 specific and seems to fit better with the next patch in
the series.

Morten


[tip:sched/core] sched/fair: Add over-utilization/tipping point indicator

2018-12-11 Thread tip-bot for Morten Rasmussen
Commit-ID:  2802bf3cd936fe2c8033a696d375a4d9d3974de4
Gitweb: https://git.kernel.org/tip/2802bf3cd936fe2c8033a696d375a4d9d3974de4
Author: Morten Rasmussen 
AuthorDate: Mon, 3 Dec 2018 09:56:25 +
Committer:  Ingo Molnar 
CommitDate: Tue, 11 Dec 2018 15:17:01 +0100

sched/fair: Add over-utilization/tipping point indicator

Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way based
on load_avg, spreading the tasks across as many cpus as possible based
on priority scaled load to preserve smp_nice. Below the tipping point we
want to use util_avg instead. We need to define a criteria for when we
make the switch.

The util_avg for each cpu converges towards 100% regardless of how many
additional tasks we may put on it. If we define over-utilized as:

sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)

some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.

For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are
likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks
getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less
over-utilized than 55%+60% for those cpus that have to be shared. The
system utilization is only 85% of the system capacity, but we are
breaking smp_nice.

To be sure not to break smp_nice, we have defined over-utilization
conservatively as when any cpu in the system is fully utilized at its
highest frequency instead:

cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity

IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg
to factor in priority to preserve smp_nice.

With this definition, we can skip periodic load-balance as no cpu has an
always-running task when the system is not over-utilized. All tasks will
be periodic and we can balance them at wake-up. This conservative
condition does however mean that some scenarios that could benefit from
energy-aware decisions even if one cpu is fully utilized would not get
those benefits.

For systems where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.

[ peterz: Added a comment explaining why new tasks are not accounted during
  overutilization detection. ]

Signed-off-by: Morten Rasmussen 
Signed-off-by: Quentin Perret 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: adhar...@codeaurora.org
Cc: chris.redp...@arm.com
Cc: curroje...@riseup.net
Cc: dietmar.eggem...@arm.com
Cc: edubez...@gmail.com
Cc: gre...@linuxfoundation.org
Cc: javi.mer...@kernel.org
Cc: j...@joelfernandes.org
Cc: juri.le...@redhat.com
Cc: patrick.bell...@arm.com
Cc: pkond...@codeaurora.org
Cc: r...@rjwysocki.net
Cc: skan...@codeaurora.org
Cc: smuc...@google.com
Cc: srinivas.pandruv...@linux.intel.com
Cc: thara.gopin...@linaro.org
Cc: tk...@google.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Cc: viresh.ku...@linaro.org
Link: https://lkml.kernel.org/r/20181203095628.11858-13-quentin.per...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c  | 59 ++--
 kernel/sched/sched.h |  4 
 2 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e04f29098ec7..767e7675774b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5082,6 +5082,24 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+#ifdef CONFIG_SMP
+static inline unsigned long cpu_util(int cpu);
+static unsigned long capacity_of(int cpu);
+
+static inline bool cpu_overutilized(int cpu)
+{
+   return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin);
+}
+
+static inline void update_overutilized_status(struct rq *rq)
+{
+   if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+   WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
+}
+#else
+static inline void update_overutilized_status(struct rq *rq) { }
+#endif
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -5139,8 +5157,26 @@ enqueue_task_fair(struct rq *rq, s

Re: [RFT PATCH v1 0/4] Unify CPU topology across ARM64 & RISC-V

2018-12-07 Thread Morten Rasmussen
Hi,

On Thu, Nov 29, 2018 at 03:28:16PM -0800, Atish Patra wrote:
> The cpu-map DT entry in ARM64 can describe the CPU topology in
> much better way compared to other existing approaches. RISC-V can
> easily adopt this binding to represent it's own CPU topology.
> Thus, both cpu-map DT binding and topology parsing code can be
> moved to a common location so that RISC-V or any other
> architecture can leverage that.
> 
> The relevant discussion regarding unifying cpu topology can be
> found in [1].
> 
> arch_topology seems to be a perfect place to move the common
> code. I have not introduced any functional changes in the moved
> code. The only downside in this approach is that the capacity
> code will be executed for RISC-V as well. But, it will exit
> immediately after not able to find the appropriate DT node. If
> the overhead is considered too much, we can always compile out
> capacity related functions under a different config for the
> architectures that do not support them.
> 
> The patches have been tested for RISC-V and compile tested for
> ARM64 & x86.

The cpu-map bindings are used for arch/arm too, and so is
arch_topology.c. In fact, it was introduced to allow code-sharing
between arm and arm64. Applying patch three breaks arm.

Moving the DT parsing to arch_topology.c we have to unify all three
architectures. Be aware that arm and arm64 have some differences in how
they detect cpu capacities. I think we might have to look at the split
of code between arch/* and arch_topology.c again :-/

Morten


Re: [RFT PATCH v1 0/4] Unify CPU topology across ARM64 & RISC-V

2018-12-07 Thread Morten Rasmussen
Hi,

On Thu, Nov 29, 2018 at 03:28:16PM -0800, Atish Patra wrote:
> The cpu-map DT entry in ARM64 can describe the CPU topology in
> much better way compared to other existing approaches. RISC-V can
> easily adopt this binding to represent it's own CPU topology.
> Thus, both cpu-map DT binding and topology parsing code can be
> moved to a common location so that RISC-V or any other
> architecture can leverage that.
> 
> The relevant discussion regarding unifying cpu topology can be
> found in [1].
> 
> arch_topology seems to be a perfect place to move the common
> code. I have not introduced any functional changes in the moved
> code. The only downside in this approach is that the capacity
> code will be executed for RISC-V as well. But, it will exit
> immediately after not able to find the appropriate DT node. If
> the overhead is considered too much, we can always compile out
> capacity related functions under a different config for the
> architectures that do not support them.
> 
> The patches have been tested for RISC-V and compile tested for
> ARM64 & x86.

The cpu-map bindings are used for arch/arm too, and so is
arch_topology.c. In fact, it was introduced to allow code-sharing
between arm and arm64. Applying patch three breaks arm.

Moving the DT parsing to arch_topology.c we have to unify all three
architectures. Be aware that arm and arm64 have some differences in how
they detect cpu capacities. I think we might have to look at the split
of code between arch/* and arch_topology.c again :-/

Morten


Re: [PATCH v5 2/2] sched/fair: update scale invariance of PELT

2018-11-05 Thread Morten Rasmussen
On Mon, Nov 05, 2018 at 10:10:34AM +0100, Vincent Guittot wrote:
> On Fri, 2 Nov 2018 at 16:36, Dietmar Eggemann  
> wrote:
> >
> > On 10/26/18 6:11 PM, Vincent Guittot wrote:
> > > The current implementation of load tracking invariance scales the
> > > contribution with current frequency and uarch performance (only for
> > > utilization) of the CPU. One main result of this formula is that the
> > > figures are capped by current capacity of CPU. Another one is that the
> > > load_avg is not invariant because not scaled with uarch.
> > >
> > > The util_avg of a periodic task that runs r time slots every p time slots
> > > varies in the range :
> > >
> > >  U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
> > >
> > > with U is the max util_avg value = SCHED_CAPACITY_SCALE
> > >
> > > At a lower capacity, the range becomes:
> > >
> > >  U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * 
> > > (1-y^r')/(1-y^p)
> > >
> > > with C reflecting the compute capacity ratio between current capacity and
> > > max capacity.
> > >
> > > so C tries to compensate changes in (1-y^r') but it can't be accurate.
> > >
> > > Instead of scaling the contribution value of PELT algo, we should scale 
> > > the
> > > running time. The PELT signal aims to track the amount of computation of
> > > tasks and/or rq so it seems more correct to scale the running time to
> > > reflect the effective amount of computation done since the last update.
> > >
> > > In order to be fully invariant, we need to apply the same amount of
> > > running time and idle time whatever the current capacity. Because running
> > > at lower capacity implies that the task will run longer, we have to ensure
> > > that the same amount of idle time will be apply when system becomes idle
> > > and no idle time has been "stolen". But reaching the maximum utilization
> > > value (SCHED_CAPACITY_SCALE) means that the task is seen as an
> > > always-running task whatever the capacity of the CPU (even at max compute
> > > capacity). In this case, we can discard this "stolen" idle times which
> > > becomes meaningless.
> > >
> > > In order to achieve this time scaling, a new clock_pelt is created per rq.
> > > The increase of this clock scales with current capacity when something
> > > is running on rq and synchronizes with clock_task when rq is idle. With
> > > this mecanism, we ensure the same running and idle time whatever the
> > > current capacity.
> >
> > Thinking about this new approach on a big.LITTLE platform:
> >
> > CPU Capacities big: 1024 LITTLE: 512, performance CPUfreq governor
> >
> > A 50% (runtime/period) task on a big CPU will become an always running
> > task on the little CPU. The utilization signal of the task and the
> > cfs_rq of the little CPU converges to 1024.
> >
> > With contrib scaling the utilization signal of the 50% task converges to
> > 512 on the little CPU, even it is always running on it, and so does the
> > one of the cfs_rq.
> >
> > Two 25% tasks on a big CPU will become two 50% tasks on a little CPU.
> > The utilization signal of the tasks converges to 512 and the one of the
> > cfs_rq of the little CPU converges to 1024.
> >
> > With contrib scaling the utilization signal of the 25% tasks converges
> > to 256 on the little CPU, even they run each 50% on it, and the one of
> > the cfs_rq converges to 512.
> >
> > So what do we consider system-wide invariance? I thought that e.g. a 25%
> > task should have a utilization value of 256 no matter on which CPU it is
> > running?
> >
> > In both cases, the little CPU is not going idle whereas the big CPU does.
> 
> IMO, the key point here is that there is no idle time. As soon as
> there is no idle time, you don't know if a task has enough compute
> capacity so you can't make difference between the 50% running task or
> an always running task on the little core.
> That's also interesting to noticed that the task will reach the always
> running state after more than 600ms on little core with utilization
> starting from 0.
> 
> Then considering the system-wide invariance, the task are not really
> invariant. If we take a 50% running task that run 40ms in a period of
> 80ms, the max utilization of the task will be 721 on the big core and
> 512 on the little core.
> Then, if you take a 39ms running task instead, the utilization on the
> big core will reach 709 but it will be 507 on little core. So your
> utilization depends on the current capacity
> With the new proposal, the max utilization will be 709 on big and
> little cores for the 39ms running task. For the 40ms running task, the
> utilization will be 721 on big core. then if the task moves on the
> little, it will reach the value 721 after 80ms,  then 900 after more
> than 160ms and 1000 after 320ms

It has always been debatable what to do with utilization when there are
no spare cycles.

In Dietmar's example where two 25% tasks are put on a 512 (50%) capacity
CPU we add just enough utilization to have no 

Re: [PATCH v5 2/2] sched/fair: update scale invariance of PELT

2018-11-05 Thread Morten Rasmussen
On Mon, Nov 05, 2018 at 10:10:34AM +0100, Vincent Guittot wrote:
> On Fri, 2 Nov 2018 at 16:36, Dietmar Eggemann  
> wrote:
> >
> > On 10/26/18 6:11 PM, Vincent Guittot wrote:
> > > The current implementation of load tracking invariance scales the
> > > contribution with current frequency and uarch performance (only for
> > > utilization) of the CPU. One main result of this formula is that the
> > > figures are capped by current capacity of CPU. Another one is that the
> > > load_avg is not invariant because not scaled with uarch.
> > >
> > > The util_avg of a periodic task that runs r time slots every p time slots
> > > varies in the range :
> > >
> > >  U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
> > >
> > > with U is the max util_avg value = SCHED_CAPACITY_SCALE
> > >
> > > At a lower capacity, the range becomes:
> > >
> > >  U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * 
> > > (1-y^r')/(1-y^p)
> > >
> > > with C reflecting the compute capacity ratio between current capacity and
> > > max capacity.
> > >
> > > so C tries to compensate changes in (1-y^r') but it can't be accurate.
> > >
> > > Instead of scaling the contribution value of PELT algo, we should scale 
> > > the
> > > running time. The PELT signal aims to track the amount of computation of
> > > tasks and/or rq so it seems more correct to scale the running time to
> > > reflect the effective amount of computation done since the last update.
> > >
> > > In order to be fully invariant, we need to apply the same amount of
> > > running time and idle time whatever the current capacity. Because running
> > > at lower capacity implies that the task will run longer, we have to ensure
> > > that the same amount of idle time will be apply when system becomes idle
> > > and no idle time has been "stolen". But reaching the maximum utilization
> > > value (SCHED_CAPACITY_SCALE) means that the task is seen as an
> > > always-running task whatever the capacity of the CPU (even at max compute
> > > capacity). In this case, we can discard this "stolen" idle times which
> > > becomes meaningless.
> > >
> > > In order to achieve this time scaling, a new clock_pelt is created per rq.
> > > The increase of this clock scales with current capacity when something
> > > is running on rq and synchronizes with clock_task when rq is idle. With
> > > this mecanism, we ensure the same running and idle time whatever the
> > > current capacity.
> >
> > Thinking about this new approach on a big.LITTLE platform:
> >
> > CPU Capacities big: 1024 LITTLE: 512, performance CPUfreq governor
> >
> > A 50% (runtime/period) task on a big CPU will become an always running
> > task on the little CPU. The utilization signal of the task and the
> > cfs_rq of the little CPU converges to 1024.
> >
> > With contrib scaling the utilization signal of the 50% task converges to
> > 512 on the little CPU, even it is always running on it, and so does the
> > one of the cfs_rq.
> >
> > Two 25% tasks on a big CPU will become two 50% tasks on a little CPU.
> > The utilization signal of the tasks converges to 512 and the one of the
> > cfs_rq of the little CPU converges to 1024.
> >
> > With contrib scaling the utilization signal of the 25% tasks converges
> > to 256 on the little CPU, even they run each 50% on it, and the one of
> > the cfs_rq converges to 512.
> >
> > So what do we consider system-wide invariance? I thought that e.g. a 25%
> > task should have a utilization value of 256 no matter on which CPU it is
> > running?
> >
> > In both cases, the little CPU is not going idle whereas the big CPU does.
> 
> IMO, the key point here is that there is no idle time. As soon as
> there is no idle time, you don't know if a task has enough compute
> capacity so you can't make difference between the 50% running task or
> an always running task on the little core.
> That's also interesting to noticed that the task will reach the always
> running state after more than 600ms on little core with utilization
> starting from 0.
> 
> Then considering the system-wide invariance, the task are not really
> invariant. If we take a 50% running task that run 40ms in a period of
> 80ms, the max utilization of the task will be 721 on the big core and
> 512 on the little core.
> Then, if you take a 39ms running task instead, the utilization on the
> big core will reach 709 but it will be 507 on little core. So your
> utilization depends on the current capacity
> With the new proposal, the max utilization will be 709 on big and
> little cores for the 39ms running task. For the 40ms running task, the
> utilization will be 721 on big core. then if the task moves on the
> little, it will reach the value 721 after 80ms,  then 900 after more
> than 160ms and 1000 after 320ms

It has always been debatable what to do with utilization when there are
no spare cycles.

In Dietmar's example where two 25% tasks are put on a 512 (50%) capacity
CPU we add just enough utilization to have no 

Re: [PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection

2018-09-11 Thread Morten Rasmussen
On Mon, Sep 10, 2018 at 10:21:11AM +0200, Ingo Molnar wrote:
> 
> * Morten Rasmussen  wrote:
> 
> > The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the
> > sched_domain in the hierarchy where all cpu capacities are visible for
> > any cpu's point of view on asymmetric cpu capacity systems. The
> 
> >  /*
> > + * Find the sched_domain_topology_level where all cpu capacities are 
> > visible
> > + * for all cpus.
> > + */
> 
> > +   /*
> > +* Examine topology from all cpu's point of views to detect the lowest
> > +* sched_domain_topology_level where a highest capacity cpu is visible
> > +* to everyone.
> > +*/
> 
> >  #define SD_WAKE_AFFINE   0x0020  /* Wake task to waking CPU */
> > -#define SD_ASYM_CPUCAPACITY  0x0040  /* Groups have different max cpu 
> > capacities */
> > +#define SD_ASYM_CPUCAPACITY  0x0040  /* Domain members have different cpu 
> > capacities */
> 
> For future reference: *please* capitalize 'CPU' and 'CPUs' in future patches 
> like the rest of 
> the scheduler does.
> 
> You can see it spelled right above the new definition: 'waking CPU' ;-)
> 
> (I fixed this up in this patch.)

Noted. Thanks for fixing up the patch.

Morten


Re: [PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection

2018-09-11 Thread Morten Rasmussen
On Mon, Sep 10, 2018 at 10:21:11AM +0200, Ingo Molnar wrote:
> 
> * Morten Rasmussen  wrote:
> 
> > The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the
> > sched_domain in the hierarchy where all cpu capacities are visible for
> > any cpu's point of view on asymmetric cpu capacity systems. The
> 
> >  /*
> > + * Find the sched_domain_topology_level where all cpu capacities are 
> > visible
> > + * for all cpus.
> > + */
> 
> > +   /*
> > +* Examine topology from all cpu's point of views to detect the lowest
> > +* sched_domain_topology_level where a highest capacity cpu is visible
> > +* to everyone.
> > +*/
> 
> >  #define SD_WAKE_AFFINE   0x0020  /* Wake task to waking CPU */
> > -#define SD_ASYM_CPUCAPACITY  0x0040  /* Groups have different max cpu 
> > capacities */
> > +#define SD_ASYM_CPUCAPACITY  0x0040  /* Domain members have different cpu 
> > capacities */
> 
> For future reference: *please* capitalize 'CPU' and 'CPUs' in future patches 
> like the rest of 
> the scheduler does.
> 
> You can see it spelled right above the new definition: 'waking CPU' ;-)
> 
> (I fixed this up in this patch.)

Noted. Thanks for fixing up the patch.

Morten


[tip:sched/core] sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  9c63e84db29bcf584040931ad97c2edd11e35f6c
Gitweb: https://git.kernel.org/tip/9c63e84db29bcf584040931ad97c2edd11e35f6c
Author: Morten Rasmussen 
AuthorDate: Wed, 4 Jul 2018 11:17:50 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:54 +0200

sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains

The 'prefer sibling' sched_domain flag is intended to encourage
spreading tasks to sibling sched_domain to take advantage of more caches
and core for SMT systems. It has recently been changed to be on all
non-NUMA topology level. However, spreading across domains with CPU
capacity asymmetry isn't desirable, e.g. spreading from high capacity to
low capacity CPUs even if high capacity CPUs aren't overutilized might
give access to more cache but the CPU will be slower and possibly lead
to worse overall throughput.

To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
level immediately below SD_ASYM_CPUCAPACITY.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: gaku.inami...@renesas.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1530699470-29808-13-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/topology.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 2536e1b938f9..7ffad0d3a4eb 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1126,7 +1126,7 @@ sd_init(struct sched_domain_topology_level *tl,
| 0*SD_SHARE_CPUCAPACITY
| 0*SD_SHARE_PKG_RESOURCES
| 0*SD_SERIALIZE
-   | 0*SD_PREFER_SIBLING
+   | 1*SD_PREFER_SIBLING
| 0*SD_NUMA
| sd_flags
,
@@ -1152,17 +1152,21 @@ sd_init(struct sched_domain_topology_level *tl,
if (sd->flags & SD_ASYM_CPUCAPACITY) {
struct sched_domain *t = sd;
 
+   /*
+* Don't attempt to spread across CPUs of different capacities.
+*/
+   if (sd->child)
+   sd->child->flags &= ~SD_PREFER_SIBLING;
+
for_each_lower_domain(t)
t->flags |= SD_BALANCE_WAKE;
}
 
if (sd->flags & SD_SHARE_CPUCAPACITY) {
-   sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 110;
sd->smt_gain = 1178; /* ~15% */
 
} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
-   sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 117;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
@@ -1173,6 +1177,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->busy_idx = 3;
sd->idle_idx = 2;
 
+   sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
if (sched_domains_numa_distance[tl->numa_level] > 
RECLAIM_DISTANCE) {
sd->flags &= ~(SD_BALANCE_EXEC |
@@ -1182,7 +1187,6 @@ sd_init(struct sched_domain_topology_level *tl,
 
 #endif
} else {
-   sd->flags |= SD_PREFER_SIBLING;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
sd->idle_idx = 1;


[tip:sched/core] sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  9c63e84db29bcf584040931ad97c2edd11e35f6c
Gitweb: https://git.kernel.org/tip/9c63e84db29bcf584040931ad97c2edd11e35f6c
Author: Morten Rasmussen 
AuthorDate: Wed, 4 Jul 2018 11:17:50 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:54 +0200

sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains

The 'prefer sibling' sched_domain flag is intended to encourage
spreading tasks to sibling sched_domain to take advantage of more caches
and core for SMT systems. It has recently been changed to be on all
non-NUMA topology level. However, spreading across domains with CPU
capacity asymmetry isn't desirable, e.g. spreading from high capacity to
low capacity CPUs even if high capacity CPUs aren't overutilized might
give access to more cache but the CPU will be slower and possibly lead
to worse overall throughput.

To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
level immediately below SD_ASYM_CPUCAPACITY.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: gaku.inami...@renesas.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1530699470-29808-13-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/topology.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 2536e1b938f9..7ffad0d3a4eb 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1126,7 +1126,7 @@ sd_init(struct sched_domain_topology_level *tl,
| 0*SD_SHARE_CPUCAPACITY
| 0*SD_SHARE_PKG_RESOURCES
| 0*SD_SERIALIZE
-   | 0*SD_PREFER_SIBLING
+   | 1*SD_PREFER_SIBLING
| 0*SD_NUMA
| sd_flags
,
@@ -1152,17 +1152,21 @@ sd_init(struct sched_domain_topology_level *tl,
if (sd->flags & SD_ASYM_CPUCAPACITY) {
struct sched_domain *t = sd;
 
+   /*
+* Don't attempt to spread across CPUs of different capacities.
+*/
+   if (sd->child)
+   sd->child->flags &= ~SD_PREFER_SIBLING;
+
for_each_lower_domain(t)
t->flags |= SD_BALANCE_WAKE;
}
 
if (sd->flags & SD_SHARE_CPUCAPACITY) {
-   sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 110;
sd->smt_gain = 1178; /* ~15% */
 
} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
-   sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 117;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
@@ -1173,6 +1177,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->busy_idx = 3;
sd->idle_idx = 2;
 
+   sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
if (sched_domains_numa_distance[tl->numa_level] > 
RECLAIM_DISTANCE) {
sd->flags &= ~(SD_BALANCE_EXEC |
@@ -1182,7 +1187,6 @@ sd_init(struct sched_domain_topology_level *tl,
 
 #endif
} else {
-   sd->flags |= SD_PREFER_SIBLING;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
sd->idle_idx = 1;


[tip:sched/core] sched/fair: Consider misfit tasks when load-balancing

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  cad68e552e7774b68ae6a2c5fedb792936098b72
Gitweb: https://git.kernel.org/tip/cad68e552e7774b68ae6a2c5fedb792936098b72
Author: Morten Rasmussen 
AuthorDate: Wed, 4 Jul 2018 11:17:42 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:50 +0200

sched/fair: Consider misfit tasks when load-balancing

On asymmetric CPU capacity systems load intensive tasks can end up on
CPUs that don't suit their compute demand.  In this scenarios 'misfit'
tasks should be migrated to CPUs with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.

Misfit balancing only makes sense between a source group of lower
per-CPU capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.

The modifications are:

1. Only pick a group containing misfit tasks as the busiest group if the
   destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
   load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
   to be pulled ignoring average load.
4. Pick the CPU with the highest misfit load as the source CPU.
5. If the misfit task is alone on the source CPU, go for active
   balancing.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: gaku.inami...@renesas.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1530699470-29808-5-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 51 +--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe04315d57b3..24fe39e57bc3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6890,6 +6890,7 @@ struct lb_env {
unsigned intloop_max;
 
enum fbq_type   fbq_type;
+   enum group_type src_grp_type;
struct list_headtasks;
 };
 
@@ -7873,6 +7874,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 {
struct sg_lb_stats *busiest = >busiest_stat;
 
+   /*
+* Don't try to pull misfit tasks we can't help.
+* We can use max_capacity here as reduction in capacity on some
+* CPUs in the group should either be possible to resolve
+* internally or be covered by avg_load imbalance (eventually).
+*/
+   if (sgs->group_type == group_misfit_task &&
+   (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+!group_has_capacity(env, >local_stat)))
+   return false;
+
if (sgs->group_type > busiest->group_type)
return true;
 
@@ -7895,6 +7907,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
group_smaller_min_cpu_capacity(sds->local, sg))
return false;
 
+   /*
+* If we have more than one misfit sg go with the biggest misfit.
+*/
+   if (sgs->group_type == group_misfit_task &&
+   sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+   return false;
+
 asym_packing:
/* This is the busiest node in its class. */
if (!(env->sd->flags & SD_ASYM_PACKING))
@@ -8192,8 +8211,9 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
 * factors in sg capacity and sgs with smaller group_type are
 * skipped when updating the busiest sg:
 */
-   if (busiest->avg_load <= sds->avg_load ||
-   local->avg_load >= sds->avg_load) {
+   if (busiest->group_type != group_misfit_task &&
+   (busiest->avg_load <= sds->avg_load ||
+local->avg_load >= sds->avg_load)) {
env->imbalance = 0;
return fix_small_imbalance(env, sds);
}
@@ -8227,6 +8247,12 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;
 
+   /* Boost imbalance to allow misfit task to be balanced. */
+   if (busiest->group_type == group_misfit_task) {
+   env->imbalance = max_t(long, env->imbalance,
+  busiest->group_misfit_task_load);
+   }
+
/*
 * if *imbalance is less than the average load per runnable task
 * there is no guarantee that any tasks will be moved so we'll have
@@ -8293,6 +8319,10 @@ static s

[tip:sched/core] sched/fair: Consider misfit tasks when load-balancing

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  cad68e552e7774b68ae6a2c5fedb792936098b72
Gitweb: https://git.kernel.org/tip/cad68e552e7774b68ae6a2c5fedb792936098b72
Author: Morten Rasmussen 
AuthorDate: Wed, 4 Jul 2018 11:17:42 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:50 +0200

sched/fair: Consider misfit tasks when load-balancing

On asymmetric CPU capacity systems load intensive tasks can end up on
CPUs that don't suit their compute demand.  In this scenarios 'misfit'
tasks should be migrated to CPUs with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.

Misfit balancing only makes sense between a source group of lower
per-CPU capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.

The modifications are:

1. Only pick a group containing misfit tasks as the busiest group if the
   destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
   load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
   to be pulled ignoring average load.
4. Pick the CPU with the highest misfit load as the source CPU.
5. If the misfit task is alone on the source CPU, go for active
   balancing.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: gaku.inami...@renesas.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1530699470-29808-5-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 51 +--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe04315d57b3..24fe39e57bc3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6890,6 +6890,7 @@ struct lb_env {
unsigned intloop_max;
 
enum fbq_type   fbq_type;
+   enum group_type src_grp_type;
struct list_headtasks;
 };
 
@@ -7873,6 +7874,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 {
struct sg_lb_stats *busiest = >busiest_stat;
 
+   /*
+* Don't try to pull misfit tasks we can't help.
+* We can use max_capacity here as reduction in capacity on some
+* CPUs in the group should either be possible to resolve
+* internally or be covered by avg_load imbalance (eventually).
+*/
+   if (sgs->group_type == group_misfit_task &&
+   (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+!group_has_capacity(env, >local_stat)))
+   return false;
+
if (sgs->group_type > busiest->group_type)
return true;
 
@@ -7895,6 +7907,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
group_smaller_min_cpu_capacity(sds->local, sg))
return false;
 
+   /*
+* If we have more than one misfit sg go with the biggest misfit.
+*/
+   if (sgs->group_type == group_misfit_task &&
+   sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+   return false;
+
 asym_packing:
/* This is the busiest node in its class. */
if (!(env->sd->flags & SD_ASYM_PACKING))
@@ -8192,8 +8211,9 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
 * factors in sg capacity and sgs with smaller group_type are
 * skipped when updating the busiest sg:
 */
-   if (busiest->avg_load <= sds->avg_load ||
-   local->avg_load >= sds->avg_load) {
+   if (busiest->group_type != group_misfit_task &&
+   (busiest->avg_load <= sds->avg_load ||
+local->avg_load >= sds->avg_load)) {
env->imbalance = 0;
return fix_small_imbalance(env, sds);
}
@@ -8227,6 +8247,12 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;
 
+   /* Boost imbalance to allow misfit task to be balanced. */
+   if (busiest->group_type == group_misfit_task) {
+   env->imbalance = max_t(long, env->imbalance,
+  busiest->group_misfit_task_load);
+   }
+
/*
 * if *imbalance is less than the average load per runnable task
 * there is no guarantee that any tasks will be moved so we'll have
@@ -8293,6 +8319,10 @@ static s

[tip:sched/core] sched/fair: Add 'group_misfit_task' load-balance type

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  3b1baa6496e6b7ad016342a9d256bdfb072ce902
Gitweb: https://git.kernel.org/tip/3b1baa6496e6b7ad016342a9d256bdfb072ce902
Author: Morten Rasmussen 
AuthorDate: Wed, 4 Jul 2018 11:17:40 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:49 +0200

sched/fair: Add 'group_misfit_task' load-balance type

To maximize throughput in systems with asymmetric CPU capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and CPU utilization
as well as per-CPU compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity CPU need to be
identified and migrated to a higher capacity CPU if possible to maximize
throughput.

To implement this additional policy an additional group_type
(load-balance scenario) is added: 'group_misfit_task'. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-CPU capacity. 'group_misfit_task' is only considered
if the system is not overloaded or imbalanced ('group_imbalanced' or
'group_overloaded').

Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each CPU is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: gaku.inami...@renesas.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1530699470-29808-3-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c  | 54 
 kernel/sched/sched.h |  2 ++
 2 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e5071aeb117..6e04bea5b11a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -693,6 +693,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 
 static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
+static unsigned long capacity_of(int cpu);
 
 /* Give new sched_entity start runnable values to heavy its load in infant 
time */
 void init_entity_runnable_average(struct sched_entity *se)
@@ -1446,7 +1447,6 @@ bool should_numa_migrate_memory(struct task_struct *p, 
struct page * page,
 static unsigned long weighted_cpuload(struct rq *rq);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
-static unsigned long capacity_of(int cpu);
 
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
@@ -3647,6 +3647,29 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
task_struct *p, bool task_sleep)
WRITE_ONCE(p->se.avg.util_est, ue);
 }
 
+static inline int task_fits_capacity(struct task_struct *p, long capacity)
+{
+   return capacity * 1024 > task_util_est(p) * capacity_margin;
+}
+
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
+{
+   if (!static_branch_unlikely(_asym_cpucapacity))
+   return;
+
+   if (!p) {
+   rq->misfit_task_load = 0;
+   return;
+   }
+
+   if (task_fits_capacity(p, capacity_of(cpu_of(rq {
+   rq->misfit_task_load = 0;
+   return;
+   }
+
+   rq->misfit_task_load = task_h_load(p);
+}
+
 #else /* CONFIG_SMP */
 
 #define UPDATE_TG  0x0
@@ -3676,6 +3699,7 @@ util_est_enqueue(struct cfs_rq *cfs_rq, struct 
task_struct *p) {}
 static inline void
 util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
 bool task_sleep) {}
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq) 
{}
 
 #endif /* CONFIG_SMP */
 
@@ -6201,7 +6225,7 @@ static int wake_cap(struct task_struct *p, int cpu, int 
prev_cpu)
/* Bring task utilization in sync with prev_cpu */
sync_entity_load_avg(>se);
 
-   return min_cap * 1024 < task_util(p) * capacity_margin;
+   return !task_fits_capacity(p, min_cap);
 }
 
 /*
@@ -6618,9 +6642,12 @@ done: __maybe_unused;
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);
 
+   update_misfit_status(p, rq);
+
return p;
 
 idle:
+   update_misfit_status(NULL, rq);
new_tasks = idle_balance(rq, rf);
 
/*
@@ -6826,6 +6853,13 @@ static unsigned long __read_mostly 
max_load_balance_interval = HZ/10;
 
 enum fbq_type { regular, remote, all };
 
+enum group_type {
+   group_other = 0,
+   group_misfit_task,
+   group_imbalanced,
+   group_overloaded,
+};
+
 #define LBF_ALL_PINNED 0x01
 #define LBF_

[tip:sched/core] sched/fair: Add 'group_misfit_task' load-balance type

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  3b1baa6496e6b7ad016342a9d256bdfb072ce902
Gitweb: https://git.kernel.org/tip/3b1baa6496e6b7ad016342a9d256bdfb072ce902
Author: Morten Rasmussen 
AuthorDate: Wed, 4 Jul 2018 11:17:40 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:49 +0200

sched/fair: Add 'group_misfit_task' load-balance type

To maximize throughput in systems with asymmetric CPU capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and CPU utilization
as well as per-CPU compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity CPU need to be
identified and migrated to a higher capacity CPU if possible to maximize
throughput.

To implement this additional policy an additional group_type
(load-balance scenario) is added: 'group_misfit_task'. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-CPU capacity. 'group_misfit_task' is only considered
if the system is not overloaded or imbalanced ('group_imbalanced' or
'group_overloaded').

Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each CPU is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: gaku.inami...@renesas.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1530699470-29808-3-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c  | 54 
 kernel/sched/sched.h |  2 ++
 2 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e5071aeb117..6e04bea5b11a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -693,6 +693,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 
 static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
+static unsigned long capacity_of(int cpu);
 
 /* Give new sched_entity start runnable values to heavy its load in infant 
time */
 void init_entity_runnable_average(struct sched_entity *se)
@@ -1446,7 +1447,6 @@ bool should_numa_migrate_memory(struct task_struct *p, 
struct page * page,
 static unsigned long weighted_cpuload(struct rq *rq);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
-static unsigned long capacity_of(int cpu);
 
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
@@ -3647,6 +3647,29 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
task_struct *p, bool task_sleep)
WRITE_ONCE(p->se.avg.util_est, ue);
 }
 
+static inline int task_fits_capacity(struct task_struct *p, long capacity)
+{
+   return capacity * 1024 > task_util_est(p) * capacity_margin;
+}
+
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
+{
+   if (!static_branch_unlikely(_asym_cpucapacity))
+   return;
+
+   if (!p) {
+   rq->misfit_task_load = 0;
+   return;
+   }
+
+   if (task_fits_capacity(p, capacity_of(cpu_of(rq {
+   rq->misfit_task_load = 0;
+   return;
+   }
+
+   rq->misfit_task_load = task_h_load(p);
+}
+
 #else /* CONFIG_SMP */
 
 #define UPDATE_TG  0x0
@@ -3676,6 +3699,7 @@ util_est_enqueue(struct cfs_rq *cfs_rq, struct 
task_struct *p) {}
 static inline void
 util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
 bool task_sleep) {}
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq) 
{}
 
 #endif /* CONFIG_SMP */
 
@@ -6201,7 +6225,7 @@ static int wake_cap(struct task_struct *p, int cpu, int 
prev_cpu)
/* Bring task utilization in sync with prev_cpu */
sync_entity_load_avg(>se);
 
-   return min_cap * 1024 < task_util(p) * capacity_margin;
+   return !task_fits_capacity(p, min_cap);
 }
 
 /*
@@ -6618,9 +6642,12 @@ done: __maybe_unused;
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);
 
+   update_misfit_status(p, rq);
+
return p;
 
 idle:
+   update_misfit_status(NULL, rq);
new_tasks = idle_balance(rq, rf);
 
/*
@@ -6826,6 +6853,13 @@ static unsigned long __read_mostly 
max_load_balance_interval = HZ/10;
 
 enum fbq_type { regular, remote, all };
 
+enum group_type {
+   group_other = 0,
+   group_misfit_task,
+   group_imbalanced,
+   group_overloaded,
+};
+
 #define LBF_ALL_PINNED 0x01
 #define LBF_

[tip:sched/core] sched/fair: Add sched_group per-CPU max capacity

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  e3d6d0cb66f2351cbfd09fbae04eb9804afe9577
Gitweb: https://git.kernel.org/tip/e3d6d0cb66f2351cbfd09fbae04eb9804afe9577
Author: Morten Rasmussen 
AuthorDate: Wed, 4 Jul 2018 11:17:41 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:49 +0200

sched/fair: Add sched_group per-CPU max capacity

The current sg->min_capacity tracks the lowest per-CPU compute capacity
available in the sched_group when rt/irq pressure is taken into account.
Minimum capacity isn't the ideal metric for tracking if a sched_group
needs offloading to another sched_group for some scenarios, e.g. a
sched_group with multiple CPUs if only one is under heavy pressure.
Tracking maximum capacity isn't perfect either but a better choice for
some situations as it indicates that the sched_group definitely compute
capacity constrained either due to rt/irq pressure on all CPUs or
asymmetric CPU capacities (e.g. big.LITTLE).

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: gaku.inami...@renesas.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1530699470-29808-4-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 24 
 kernel/sched/sched.h|  1 +
 kernel/sched/topology.c |  2 ++
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e04bea5b11a..fe04315d57b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7557,13 +7557,14 @@ static void update_cpu_capacity(struct sched_domain 
*sd, int cpu)
cpu_rq(cpu)->cpu_capacity = capacity;
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = capacity;
+   sdg->sgc->max_capacity = capacity;
 }
 
 void update_group_capacity(struct sched_domain *sd, int cpu)
 {
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
-   unsigned long capacity, min_capacity;
+   unsigned long capacity, min_capacity, max_capacity;
unsigned long interval;
 
interval = msecs_to_jiffies(sd->balance_interval);
@@ -7577,6 +7578,7 @@ void update_group_capacity(struct sched_domain *sd, int 
cpu)
 
capacity = 0;
min_capacity = ULONG_MAX;
+   max_capacity = 0;
 
if (child->flags & SD_OVERLAP) {
/*
@@ -7607,6 +7609,7 @@ void update_group_capacity(struct sched_domain *sd, int 
cpu)
}
 
min_capacity = min(capacity, min_capacity);
+   max_capacity = max(capacity, max_capacity);
}
} else  {
/*
@@ -7620,12 +7623,14 @@ void update_group_capacity(struct sched_domain *sd, int 
cpu)
 
capacity += sgc->capacity;
min_capacity = min(sgc->min_capacity, min_capacity);
+   max_capacity = max(sgc->max_capacity, max_capacity);
group = group->next;
} while (group != child->groups);
}
 
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = min_capacity;
+   sdg->sgc->max_capacity = max_capacity;
 }
 
 /*
@@ -7721,16 +7726,27 @@ group_is_overloaded(struct lb_env *env, struct 
sg_lb_stats *sgs)
 }
 
 /*
- * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
  * per-CPU capacity than sched_group ref.
  */
 static inline bool
-group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 {
return sg->sgc->min_capacity * capacity_margin <
ref->sgc->min_capacity * 1024;
 }
 
+/*
+ * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-CPU capacity_orig than sched_group ref.
+ */
+static inline bool
+group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+   return sg->sgc->max_capacity * capacity_margin <
+   ref->sgc->max_capacity * 1024;
+}
+
 static inline enum
 group_type group_classify(struct sched_group *group,
  struct sg_lb_stats *sgs)
@@ -7876,7 +7892,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 * power/energy consequences are not considered.
 */
if (sgs->sum_nr_running <= sgs->group_weight &&
-   group_smaller_cpu_capacity(sds->local, sg))
+   group_smaller_min_cpu_capacity(sds->local, sg))
return false;
 
 asym_packing:
diff --git a/kernel/sched/sched.h b/kernel/

[tip:sched/core] sched/fair: Add sched_group per-CPU max capacity

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  e3d6d0cb66f2351cbfd09fbae04eb9804afe9577
Gitweb: https://git.kernel.org/tip/e3d6d0cb66f2351cbfd09fbae04eb9804afe9577
Author: Morten Rasmussen 
AuthorDate: Wed, 4 Jul 2018 11:17:41 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:49 +0200

sched/fair: Add sched_group per-CPU max capacity

The current sg->min_capacity tracks the lowest per-CPU compute capacity
available in the sched_group when rt/irq pressure is taken into account.
Minimum capacity isn't the ideal metric for tracking if a sched_group
needs offloading to another sched_group for some scenarios, e.g. a
sched_group with multiple CPUs if only one is under heavy pressure.
Tracking maximum capacity isn't perfect either but a better choice for
some situations as it indicates that the sched_group definitely compute
capacity constrained either due to rt/irq pressure on all CPUs or
asymmetric CPU capacities (e.g. big.LITTLE).

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: gaku.inami...@renesas.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1530699470-29808-4-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 24 
 kernel/sched/sched.h|  1 +
 kernel/sched/topology.c |  2 ++
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e04bea5b11a..fe04315d57b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7557,13 +7557,14 @@ static void update_cpu_capacity(struct sched_domain 
*sd, int cpu)
cpu_rq(cpu)->cpu_capacity = capacity;
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = capacity;
+   sdg->sgc->max_capacity = capacity;
 }
 
 void update_group_capacity(struct sched_domain *sd, int cpu)
 {
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
-   unsigned long capacity, min_capacity;
+   unsigned long capacity, min_capacity, max_capacity;
unsigned long interval;
 
interval = msecs_to_jiffies(sd->balance_interval);
@@ -7577,6 +7578,7 @@ void update_group_capacity(struct sched_domain *sd, int 
cpu)
 
capacity = 0;
min_capacity = ULONG_MAX;
+   max_capacity = 0;
 
if (child->flags & SD_OVERLAP) {
/*
@@ -7607,6 +7609,7 @@ void update_group_capacity(struct sched_domain *sd, int 
cpu)
}
 
min_capacity = min(capacity, min_capacity);
+   max_capacity = max(capacity, max_capacity);
}
} else  {
/*
@@ -7620,12 +7623,14 @@ void update_group_capacity(struct sched_domain *sd, int 
cpu)
 
capacity += sgc->capacity;
min_capacity = min(sgc->min_capacity, min_capacity);
+   max_capacity = max(sgc->max_capacity, max_capacity);
group = group->next;
} while (group != child->groups);
}
 
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = min_capacity;
+   sdg->sgc->max_capacity = max_capacity;
 }
 
 /*
@@ -7721,16 +7726,27 @@ group_is_overloaded(struct lb_env *env, struct 
sg_lb_stats *sgs)
 }
 
 /*
- * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
  * per-CPU capacity than sched_group ref.
  */
 static inline bool
-group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 {
return sg->sgc->min_capacity * capacity_margin <
ref->sgc->min_capacity * 1024;
 }
 
+/*
+ * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-CPU capacity_orig than sched_group ref.
+ */
+static inline bool
+group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+   return sg->sgc->max_capacity * capacity_margin <
+   ref->sgc->max_capacity * 1024;
+}
+
 static inline enum
 group_type group_classify(struct sched_group *group,
  struct sg_lb_stats *sgs)
@@ -7876,7 +7892,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 * power/energy consequences are not considered.
 */
if (sgs->sum_nr_running <= sgs->group_weight &&
-   group_smaller_cpu_capacity(sds->local, sg))
+   group_smaller_min_cpu_capacity(sds->local, sg))
return false;
 
 asym_packing:
diff --git a/kernel/sched/sched.h b/kernel/

[tip:sched/core] sched/topology: Add static_key for asymmetric CPU capacity optimizations

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  df054e8445a4011e3d693c2268129c0456108663
Gitweb: https://git.kernel.org/tip/df054e8445a4011e3d693c2268129c0456108663
Author: Morten Rasmussen 
AuthorDate: Wed, 4 Jul 2018 11:17:39 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:48 +0200

sched/topology: Add static_key for asymmetric CPU capacity optimizations

The existing asymmetric CPU capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: gaku.inami...@renesas.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1530699470-29808-2-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 3 +++
 kernel/sched/sched.h| 1 +
 kernel/sched/topology.c | 9 -
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f808ddf2a868..3e5071aeb117 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,6 +6188,9 @@ static int wake_cap(struct task_struct *p, int cpu, int 
prev_cpu)
 {
long min_cap, max_cap;
 
+   if (!static_branch_unlikely(_asym_cpucapacity))
+   return 0;
+
min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8cae63c4..0f36adc31ba5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1185,6 +1185,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+extern struct static_key_false sched_asym_cpucapacity;
 
 struct sched_group_capacity {
atomic_tref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5c4d583d53ee..b0cdf5e95bda 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -1705,6 +1706,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct 
sched_domain_attr *att
struct rq *rq = NULL;
int i, ret = -ENOMEM;
struct sched_domain_topology_level *tl_asym;
+   bool has_asym = false;
 
alloc_state = __visit_domain_allocation_hell(, cpu_map);
if (alloc_state != sa_rootdomain)
@@ -1720,8 +1722,10 @@ build_sched_domains(const struct cpumask *cpu_map, 
struct sched_domain_attr *att
for_each_sd_topology(tl) {
int dflags = 0;
 
-   if (tl == tl_asym)
+   if (tl == tl_asym) {
dflags |= SD_ASYM_CPUCAPACITY;
+   has_asym = true;
+   }
 
sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, 
i);
 
@@ -1773,6 +1777,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct 
sched_domain_attr *att
}
rcu_read_unlock();
 
+   if (has_asym)
+   static_branch_enable_cpuslocked(_asym_cpucapacity);
+
if (rq && sched_debug_enabled) {
pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);


[tip:sched/core] sched/topology, arch/arm: Rebuild sched_domain hierarchy when CPU capacity changes

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  e1799a80a4f5a463f252b7325da8bb66dfd55471
Gitweb: https://git.kernel.org/tip/e1799a80a4f5a463f252b7325da8bb66dfd55471
Author: Morten Rasmussen 
AuthorDate: Fri, 20 Jul 2018 14:32:34 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:48 +0200

sched/topology, arch/arm: Rebuild sched_domain hierarchy when CPU capacity 
changes

Asymmetric CPU capacity can not necessarily be determined accurately at
the time the initial sched_domain hierarchy is built during boot. It is
therefore necessary to be able to force a full rebuild of the hierarchy
later triggered by the arch_topology driver. A full rebuild requires the
arch-code to implement arch_update_cpu_topology() which isn't yet
implemented for arm. This patch points the arm implementation to
arch_topology driver to ensure that full hierarchy rebuild happens when
needed.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Russell King 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1532093554-30504-5-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 arch/arm/include/asm/topology.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 5d88d2f22b2c..2a786f54d8b8 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -33,6 +33,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu);
 /* Replace task scheduler's default cpu-invariant accounting */
 #define arch_scale_cpu_capacity topology_get_cpu_scale
 
+/* Enable topology flag updates */
+#define arch_update_cpu_topology topology_update_cpu_topology
+
 #else
 
 static inline void init_cpu_topology(void) { }


[tip:sched/core] sched/topology: Add static_key for asymmetric CPU capacity optimizations

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  df054e8445a4011e3d693c2268129c0456108663
Gitweb: https://git.kernel.org/tip/df054e8445a4011e3d693c2268129c0456108663
Author: Morten Rasmussen 
AuthorDate: Wed, 4 Jul 2018 11:17:39 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:48 +0200

sched/topology: Add static_key for asymmetric CPU capacity optimizations

The existing asymmetric CPU capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: gaku.inami...@renesas.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1530699470-29808-2-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 3 +++
 kernel/sched/sched.h| 1 +
 kernel/sched/topology.c | 9 -
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f808ddf2a868..3e5071aeb117 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6188,6 +6188,9 @@ static int wake_cap(struct task_struct *p, int cpu, int 
prev_cpu)
 {
long min_cap, max_cap;
 
+   if (!static_branch_unlikely(_asym_cpucapacity))
+   return 0;
+
min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8cae63c4..0f36adc31ba5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1185,6 +1185,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+extern struct static_key_false sched_asym_cpucapacity;
 
 struct sched_group_capacity {
atomic_tref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 5c4d583d53ee..b0cdf5e95bda 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -1705,6 +1706,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct 
sched_domain_attr *att
struct rq *rq = NULL;
int i, ret = -ENOMEM;
struct sched_domain_topology_level *tl_asym;
+   bool has_asym = false;
 
alloc_state = __visit_domain_allocation_hell(, cpu_map);
if (alloc_state != sa_rootdomain)
@@ -1720,8 +1722,10 @@ build_sched_domains(const struct cpumask *cpu_map, 
struct sched_domain_attr *att
for_each_sd_topology(tl) {
int dflags = 0;
 
-   if (tl == tl_asym)
+   if (tl == tl_asym) {
dflags |= SD_ASYM_CPUCAPACITY;
+   has_asym = true;
+   }
 
sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, 
i);
 
@@ -1773,6 +1777,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct 
sched_domain_attr *att
}
rcu_read_unlock();
 
+   if (has_asym)
+   static_branch_enable_cpuslocked(_asym_cpucapacity);
+
if (rq && sched_debug_enabled) {
pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);


[tip:sched/core] sched/topology, arch/arm: Rebuild sched_domain hierarchy when CPU capacity changes

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  e1799a80a4f5a463f252b7325da8bb66dfd55471
Gitweb: https://git.kernel.org/tip/e1799a80a4f5a463f252b7325da8bb66dfd55471
Author: Morten Rasmussen 
AuthorDate: Fri, 20 Jul 2018 14:32:34 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:48 +0200

sched/topology, arch/arm: Rebuild sched_domain hierarchy when CPU capacity 
changes

Asymmetric CPU capacity can not necessarily be determined accurately at
the time the initial sched_domain hierarchy is built during boot. It is
therefore necessary to be able to force a full rebuild of the hierarchy
later triggered by the arch_topology driver. A full rebuild requires the
arch-code to implement arch_update_cpu_topology() which isn't yet
implemented for arm. This patch points the arm implementation to
arch_topology driver to ensure that full hierarchy rebuild happens when
needed.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Russell King 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1532093554-30504-5-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 arch/arm/include/asm/topology.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 5d88d2f22b2c..2a786f54d8b8 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -33,6 +33,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu);
 /* Replace task scheduler's default cpu-invariant accounting */
 #define arch_scale_cpu_capacity topology_get_cpu_scale
 
+/* Enable topology flag updates */
+#define arch_update_cpu_topology topology_update_cpu_topology
+
 #else
 
 static inline void init_cpu_topology(void) { }


[tip:sched/core] sched/topology, drivers/base/arch_topology: Rebuild the sched_domain hierarchy when capacities change

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  bb1fbdd3c3fd12b612c7d8cdf13bd6bfeebdefa3
Gitweb: https://git.kernel.org/tip/bb1fbdd3c3fd12b612c7d8cdf13bd6bfeebdefa3
Author: Morten Rasmussen 
AuthorDate: Fri, 20 Jul 2018 14:32:32 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:47 +0200

sched/topology, drivers/base/arch_topology: Rebuild the sched_domain hierarchy 
when capacities change

The setting of SD_ASYM_CPUCAPACITY depends on the per-CPU capacities.
These might not have their final values when the hierarchy is initially
built as the values depend on cpufreq to be initialized or the values
being set through sysfs. To ensure that the flags are set correctly we
need to rebuild the sched_domain hierarchy whenever the reported per-CPU
capacity (arch_scale_cpu_capacity()) changes.

This patch ensure that a full sched_domain rebuild happens when CPU
capacity changes occur.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Greg Kroah-Hartman 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1532093554-30504-3-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 drivers/base/arch_topology.c  | 26 ++
 include/linux/arch_topology.h |  1 +
 2 files changed, 27 insertions(+)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index e7cb0c6ade81..edfcf8d982e4 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
 
@@ -47,6 +48,9 @@ static ssize_t cpu_capacity_show(struct device *dev,
return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id));
 }
 
+static void update_topology_flags_workfn(struct work_struct *work);
+static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn);
+
 static ssize_t cpu_capacity_store(struct device *dev,
  struct device_attribute *attr,
  const char *buf,
@@ -72,6 +76,8 @@ static ssize_t cpu_capacity_store(struct device *dev,
topology_set_cpu_scale(i, new_capacity);
mutex_unlock(_scale_mutex);
 
+   schedule_work(_topology_flags_work);
+
return count;
 }
 
@@ -96,6 +102,25 @@ static int register_cpu_capacity_sysctl(void)
 }
 subsys_initcall(register_cpu_capacity_sysctl);
 
+static int update_topology;
+
+int topology_update_cpu_topology(void)
+{
+   return update_topology;
+}
+
+/*
+ * Updating the sched_domains can't be done directly from cpufreq callbacks
+ * due to locking, so queue the work for later.
+ */
+static void update_topology_flags_workfn(struct work_struct *work)
+{
+   update_topology = 1;
+   rebuild_sched_domains();
+   pr_debug("sched_domain hierarchy rebuilt, flags updated\n");
+   update_topology = 0;
+}
+
 static u32 capacity_scale;
 static u32 *raw_capacity;
 
@@ -201,6 +226,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
 
if (cpumask_empty(cpus_to_visit)) {
topology_normalize_cpu_scale();
+   schedule_work(_topology_flags_work);
free_raw_capacity();
pr_debug("cpu_capacity: parsing done\n");
schedule_work(_done_work);
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 2b709416de05..d9bdc1a7f4e7 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -9,6 +9,7 @@
 #include 
 
 void topology_normalize_cpu_scale(void);
+int topology_update_cpu_topology(void);
 
 struct device_node;
 bool topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu);


[tip:sched/core] sched/topology, drivers/base/arch_topology: Rebuild the sched_domain hierarchy when capacities change

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  bb1fbdd3c3fd12b612c7d8cdf13bd6bfeebdefa3
Gitweb: https://git.kernel.org/tip/bb1fbdd3c3fd12b612c7d8cdf13bd6bfeebdefa3
Author: Morten Rasmussen 
AuthorDate: Fri, 20 Jul 2018 14:32:32 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:47 +0200

sched/topology, drivers/base/arch_topology: Rebuild the sched_domain hierarchy 
when capacities change

The setting of SD_ASYM_CPUCAPACITY depends on the per-CPU capacities.
These might not have their final values when the hierarchy is initially
built as the values depend on cpufreq to be initialized or the values
being set through sysfs. To ensure that the flags are set correctly we
need to rebuild the sched_domain hierarchy whenever the reported per-CPU
capacity (arch_scale_cpu_capacity()) changes.

This patch ensure that a full sched_domain rebuild happens when CPU
capacity changes occur.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Greg Kroah-Hartman 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1532093554-30504-3-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 drivers/base/arch_topology.c  | 26 ++
 include/linux/arch_topology.h |  1 +
 2 files changed, 27 insertions(+)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index e7cb0c6ade81..edfcf8d982e4 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
 
@@ -47,6 +48,9 @@ static ssize_t cpu_capacity_show(struct device *dev,
return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id));
 }
 
+static void update_topology_flags_workfn(struct work_struct *work);
+static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn);
+
 static ssize_t cpu_capacity_store(struct device *dev,
  struct device_attribute *attr,
  const char *buf,
@@ -72,6 +76,8 @@ static ssize_t cpu_capacity_store(struct device *dev,
topology_set_cpu_scale(i, new_capacity);
mutex_unlock(_scale_mutex);
 
+   schedule_work(_topology_flags_work);
+
return count;
 }
 
@@ -96,6 +102,25 @@ static int register_cpu_capacity_sysctl(void)
 }
 subsys_initcall(register_cpu_capacity_sysctl);
 
+static int update_topology;
+
+int topology_update_cpu_topology(void)
+{
+   return update_topology;
+}
+
+/*
+ * Updating the sched_domains can't be done directly from cpufreq callbacks
+ * due to locking, so queue the work for later.
+ */
+static void update_topology_flags_workfn(struct work_struct *work)
+{
+   update_topology = 1;
+   rebuild_sched_domains();
+   pr_debug("sched_domain hierarchy rebuilt, flags updated\n");
+   update_topology = 0;
+}
+
 static u32 capacity_scale;
 static u32 *raw_capacity;
 
@@ -201,6 +226,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
 
if (cpumask_empty(cpus_to_visit)) {
topology_normalize_cpu_scale();
+   schedule_work(_topology_flags_work);
free_raw_capacity();
pr_debug("cpu_capacity: parsing done\n");
schedule_work(_done_work);
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 2b709416de05..d9bdc1a7f4e7 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -9,6 +9,7 @@
 #include 
 
 void topology_normalize_cpu_scale(void);
+int topology_update_cpu_topology(void);
 
 struct device_node;
 bool topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu);


[tip:sched/core] sched/topology, arch/arm64: Rebuild the sched_domain hierarchy when the CPU capacity changes

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  3ba09df4b8b6e3f01ed6381e8fb890840fd0bca3
Gitweb: https://git.kernel.org/tip/3ba09df4b8b6e3f01ed6381e8fb890840fd0bca3
Author: Morten Rasmussen 
AuthorDate: Fri, 20 Jul 2018 14:32:33 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:47 +0200

sched/topology, arch/arm64: Rebuild the sched_domain hierarchy when the CPU 
capacity changes

Asymmetric CPU capacity can not necessarily be determined accurately at
the time the initial sched_domain hierarchy is built during boot. It is
therefore necessary to be able to force a full rebuild of the hierarchy
later triggered by the arch_topology driver. A full rebuild requires the
arch-code to implement arch_update_cpu_topology() which isn't yet
implemented for arm64. This patch points the arm64 implementation to
arch_topology driver to ensure that full hierarchy rebuild happens when
needed.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Catalin Marinas 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Will Deacon 
Cc: dietmar.eggem...@arm.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1532093554-30504-4-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 arch/arm64/include/asm/topology.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/include/asm/topology.h 
b/arch/arm64/include/asm/topology.h
index 49a0fee4f89b..0524f2438649 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -45,6 +45,9 @@ int pcibus_to_node(struct pci_bus *bus);
 /* Replace task scheduler's default cpu-invariant accounting */
 #define arch_scale_cpu_capacity topology_get_cpu_scale
 
+/* Enable topology flag updates */
+#define arch_update_cpu_topology topology_update_cpu_topology
+
 #include 
 
 #endif /* _ASM_ARM_TOPOLOGY_H */


[tip:sched/core] sched/topology, arch/arm64: Rebuild the sched_domain hierarchy when the CPU capacity changes

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  3ba09df4b8b6e3f01ed6381e8fb890840fd0bca3
Gitweb: https://git.kernel.org/tip/3ba09df4b8b6e3f01ed6381e8fb890840fd0bca3
Author: Morten Rasmussen 
AuthorDate: Fri, 20 Jul 2018 14:32:33 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:47 +0200

sched/topology, arch/arm64: Rebuild the sched_domain hierarchy when the CPU 
capacity changes

Asymmetric CPU capacity can not necessarily be determined accurately at
the time the initial sched_domain hierarchy is built during boot. It is
therefore necessary to be able to force a full rebuild of the hierarchy
later triggered by the arch_topology driver. A full rebuild requires the
arch-code to implement arch_update_cpu_topology() which isn't yet
implemented for arm64. This patch points the arm64 implementation to
arch_topology driver to ensure that full hierarchy rebuild happens when
needed.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Catalin Marinas 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Will Deacon 
Cc: dietmar.eggem...@arm.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1532093554-30504-4-git-send-email-morten.rasmus...@arm.com
Signed-off-by: Ingo Molnar 
---
 arch/arm64/include/asm/topology.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/include/asm/topology.h 
b/arch/arm64/include/asm/topology.h
index 49a0fee4f89b..0524f2438649 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -45,6 +45,9 @@ int pcibus_to_node(struct pci_bus *bus);
 /* Replace task scheduler's default cpu-invariant accounting */
 #define arch_scale_cpu_capacity topology_get_cpu_scale
 
+/* Enable topology flag updates */
+#define arch_update_cpu_topology topology_update_cpu_topology
+
 #include 
 
 #endif /* _ASM_ARM_TOPOLOGY_H */


[tip:sched/core] sched/topology: Add SD_ASYM_CPUCAPACITY flag detection

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  05484e0984487d42e97c417cbb0697fa9d16e7e9
Gitweb: https://git.kernel.org/tip/05484e0984487d42e97c417cbb0697fa9d16e7e9
Author: Morten Rasmussen 
AuthorDate: Fri, 20 Jul 2018 14:32:31 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:45 +0200

sched/topology: Add SD_ASYM_CPUCAPACITY flag detection

The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the
sched_domain in the hierarchy where all CPU capacities are visible for
any CPU's point of view on asymmetric CPU capacity systems. The
scheduler can then take to take capacity asymmetry into account when
balancing at this level. It also serves as an indicator for how wide
task placement heuristics have to search to consider all available CPU
capacities as asymmetric systems might often appear symmetric at
smallest level(s) of the sched_domain hierarchy.

The flag has been around for while but so far only been set by
out-of-tree code in Android kernels. One solution is to let each
architecture provide the flag through a custom sched_domain topology
array and associated mask and flag functions. However,
SD_ASYM_CPUCAPACITY is special in the sense that it depends on the
capacity and presence of all CPUs in the system, i.e. when hotplugging
all CPUs out except those with one particular CPU capacity the flag
should disappear even if the sched_domains don't collapse. Similarly,
the flag is affected by cpusets where load-balancing is turned off.
Detecting when the flags should be set therefore depends not only on
topology information but also the cpuset configuration and hotplug
state. The arch code doesn't have easy access to the cpuset
configuration.

Instead, this patch implements the flag detection in generic code where
cpusets and hotplug state is already taken care of. All the arch is
responsible for is to implement arch_scale_cpu_capacity() and force a
full rebuild of the sched_domain hierarchy if capacities are updated,
e.g. later in the boot process when cpufreq has initialized.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1532093554-30504-2-git-send-email-morten.rasmus...@arm.com
[ Fixed 'CPU' capitalization. ]
Signed-off-by: Ingo Molnar 
---
 include/linux/sched/topology.h |  6 ++--
 kernel/sched/topology.c| 81 ++
 2 files changed, 78 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..6b9976180c1e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -23,10 +23,10 @@
 #define SD_BALANCE_FORK0x0008  /* Balance on fork, clone */
 #define SD_BALANCE_WAKE0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE 0x0020  /* Wake task to waking CPU */
-#define SD_ASYM_CPUCAPACITY0x0040  /* Groups have different max cpu 
capacities */
-#define SD_SHARE_CPUCAPACITY   0x0080  /* Domain members share cpu capacity */
+#define SD_ASYM_CPUCAPACITY0x0040  /* Domain members have different CPU 
capacities */
+#define SD_SHARE_CPUCAPACITY   0x0080  /* Domain members share CPU capacity */
 #define SD_SHARE_POWERDOMAIN   0x0100  /* Domain members share power domain */
-#define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share cpu pkg 
resources */
+#define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share CPU pkg 
resources */
 #define SD_SERIALIZE   0x0400  /* Only a single load balancing 
instance */
 #define SD_ASYM_PACKING0x0800  /* Place busy groups earlier in 
the domain */
 #define SD_PREFER_SIBLING  0x1000  /* Prefer to place tasks in a sibling 
domain */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 505a41c42b96..5c4d583d53ee 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1061,7 +1061,6 @@ static struct cpumask 
***sched_domains_numa_masks;
  *   SD_SHARE_PKG_RESOURCES - describes shared caches
  *   SD_NUMA- describes NUMA topologies
  *   SD_SHARE_POWERDOMAIN   - describes shared power domain
- *   SD_ASYM_CPUCAPACITY- describes mixed capacity topologies
  *
  * Odd one out, which beside describing the topology has a quirk also
  * prescribes the desired behaviour that goes along with it:
@@ -1073,13 +1072,12 @@ static struct cpumask   
***sched_domains_numa_masks;
 SD_SHARE_PKG_RESOURCES |   \
 SD_NUMA|   \
 SD_ASYM_PACKING|   \
-SD_ASYM_CPUCAPACITY|   \
 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
 sd_init(struct sched_domain_topology_level *tl,
const struct cpumask *cpu_map,
-   struct sched_domain *child, int cpu)
+   struct sched_domain *child, int dflags, int cpu

[tip:sched/core] sched/topology: Add SD_ASYM_CPUCAPACITY flag detection

2018-09-10 Thread tip-bot for Morten Rasmussen
Commit-ID:  05484e0984487d42e97c417cbb0697fa9d16e7e9
Gitweb: https://git.kernel.org/tip/05484e0984487d42e97c417cbb0697fa9d16e7e9
Author: Morten Rasmussen 
AuthorDate: Fri, 20 Jul 2018 14:32:31 +0100
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 11:05:45 +0200

sched/topology: Add SD_ASYM_CPUCAPACITY flag detection

The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the
sched_domain in the hierarchy where all CPU capacities are visible for
any CPU's point of view on asymmetric CPU capacity systems. The
scheduler can then take to take capacity asymmetry into account when
balancing at this level. It also serves as an indicator for how wide
task placement heuristics have to search to consider all available CPU
capacities as asymmetric systems might often appear symmetric at
smallest level(s) of the sched_domain hierarchy.

The flag has been around for while but so far only been set by
out-of-tree code in Android kernels. One solution is to let each
architecture provide the flag through a custom sched_domain topology
array and associated mask and flag functions. However,
SD_ASYM_CPUCAPACITY is special in the sense that it depends on the
capacity and presence of all CPUs in the system, i.e. when hotplugging
all CPUs out except those with one particular CPU capacity the flag
should disappear even if the sched_domains don't collapse. Similarly,
the flag is affected by cpusets where load-balancing is turned off.
Detecting when the flags should be set therefore depends not only on
topology information but also the cpuset configuration and hotplug
state. The arch code doesn't have easy access to the cpuset
configuration.

Instead, this patch implements the flag detection in generic code where
cpusets and hotplug state is already taken care of. All the arch is
responsible for is to implement arch_scale_cpu_capacity() and force a
full rebuild of the sched_domain hierarchy if capacities are updated,
e.g. later in the boot process when cpufreq has initialized.

Signed-off-by: Morten Rasmussen 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: dietmar.eggem...@arm.com
Cc: valentin.schnei...@arm.com
Cc: vincent.guit...@linaro.org
Link: 
http://lkml.kernel.org/r/1532093554-30504-2-git-send-email-morten.rasmus...@arm.com
[ Fixed 'CPU' capitalization. ]
Signed-off-by: Ingo Molnar 
---
 include/linux/sched/topology.h |  6 ++--
 kernel/sched/topology.c| 81 ++
 2 files changed, 78 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..6b9976180c1e 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -23,10 +23,10 @@
 #define SD_BALANCE_FORK0x0008  /* Balance on fork, clone */
 #define SD_BALANCE_WAKE0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE 0x0020  /* Wake task to waking CPU */
-#define SD_ASYM_CPUCAPACITY0x0040  /* Groups have different max cpu 
capacities */
-#define SD_SHARE_CPUCAPACITY   0x0080  /* Domain members share cpu capacity */
+#define SD_ASYM_CPUCAPACITY0x0040  /* Domain members have different CPU 
capacities */
+#define SD_SHARE_CPUCAPACITY   0x0080  /* Domain members share CPU capacity */
 #define SD_SHARE_POWERDOMAIN   0x0100  /* Domain members share power domain */
-#define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share cpu pkg 
resources */
+#define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share CPU pkg 
resources */
 #define SD_SERIALIZE   0x0400  /* Only a single load balancing 
instance */
 #define SD_ASYM_PACKING0x0800  /* Place busy groups earlier in 
the domain */
 #define SD_PREFER_SIBLING  0x1000  /* Prefer to place tasks in a sibling 
domain */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 505a41c42b96..5c4d583d53ee 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1061,7 +1061,6 @@ static struct cpumask 
***sched_domains_numa_masks;
  *   SD_SHARE_PKG_RESOURCES - describes shared caches
  *   SD_NUMA- describes NUMA topologies
  *   SD_SHARE_POWERDOMAIN   - describes shared power domain
- *   SD_ASYM_CPUCAPACITY- describes mixed capacity topologies
  *
  * Odd one out, which beside describing the topology has a quirk also
  * prescribes the desired behaviour that goes along with it:
@@ -1073,13 +1072,12 @@ static struct cpumask   
***sched_domains_numa_masks;
 SD_SHARE_PKG_RESOURCES |   \
 SD_NUMA|   \
 SD_ASYM_PACKING|   \
-SD_ASYM_CPUCAPACITY|   \
 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
 sd_init(struct sched_domain_topology_level *tl,
const struct cpumask *cpu_map,
-   struct sched_domain *child, int cpu)
+   struct sched_domain *child, int dflags, int cpu

Re: [PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations

2018-08-02 Thread Morten Rasmussen
On Tue, Jul 31, 2018 at 12:59:16PM +0200, Peter Zijlstra wrote:
> 
> Combined with that SD_ASYM.. rework I ended up with the below.
> 
> Holler if you want it changed :-)

Looks good to me.

Thanks,
Morten


Re: [PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations

2018-08-02 Thread Morten Rasmussen
On Tue, Jul 31, 2018 at 12:59:16PM +0200, Peter Zijlstra wrote:
> 
> Combined with that SD_ASYM.. rework I ended up with the below.
> 
> Holler if you want it changed :-)

Looks good to me.

Thanks,
Morten


Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems

2018-07-31 Thread Morten Rasmussen
On Tue, Jul 31, 2018 at 01:10:49PM +0100, Valentin Schneider wrote:
> Hi Peter,
> 
> On 31/07/18 13:00, Peter Zijlstra wrote:
> > 
> > 
> > Aside from the first patch, which I posted the change on, I've picked up
> > until 10. I think that other SD_ASYM patch-set replaces 11 and 12,
> > right?
> >
> 11 is no longer needed, but AFAICT we still need 12 - we don't want
> PREFER_SIBLING to interfere with asymmetric systems.

Yes, we still want patch 12 if possible.


Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems

2018-07-31 Thread Morten Rasmussen
On Tue, Jul 31, 2018 at 01:10:49PM +0100, Valentin Schneider wrote:
> Hi Peter,
> 
> On 31/07/18 13:00, Peter Zijlstra wrote:
> > 
> > 
> > Aside from the first patch, which I posted the change on, I've picked up
> > until 10. I think that other SD_ASYM patch-set replaces 11 and 12,
> > right?
> >
> 11 is no longer needed, but AFAICT we still need 12 - we don't want
> PREFER_SIBLING to interfere with asymmetric systems.

Yes, we still want patch 12 if possible.


Re: [PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection

2018-07-24 Thread Morten Rasmussen
On Mon, Jul 23, 2018 at 05:07:50PM +0100, Qais Yousef wrote:
> On 23/07/18 16:27, Morten Rasmussen wrote:
> >It does increase the cost of things like hotplug slightly and
> >repartitioning of root_domains a slightly but I don't see how we can
> >avoid it if we want generic code to set this flag. If the costs are not
> >acceptable I think the only option is to make the detection architecture
> >specific.
> 
> I think hotplug is already expensive and this overhead would be small in
> comparison. But this could be called when frequency changes if I understood
> correctly - this is the one I wasn't sure how 'hot' it could be. I wouldn't
> expect frequency changes at a very high rate because it's relatively
> expensive too..

A frequency change shouldn't lead to a flag change or a rebuild of the
sched_domain hierarhcy. The situations where the hierarchy should be
rebuild to update the flag is during boot as we only know the amount of
asymmetry once cpufreq has been initialized, when cpus are hotplugged
in/out, and when root_domains change due to cpuset reconfiguration. So
it should be a relatively rare event.


Re: [PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection

2018-07-24 Thread Morten Rasmussen
On Mon, Jul 23, 2018 at 05:07:50PM +0100, Qais Yousef wrote:
> On 23/07/18 16:27, Morten Rasmussen wrote:
> >It does increase the cost of things like hotplug slightly and
> >repartitioning of root_domains a slightly but I don't see how we can
> >avoid it if we want generic code to set this flag. If the costs are not
> >acceptable I think the only option is to make the detection architecture
> >specific.
> 
> I think hotplug is already expensive and this overhead would be small in
> comparison. But this could be called when frequency changes if I understood
> correctly - this is the one I wasn't sure how 'hot' it could be. I wouldn't
> expect frequency changes at a very high rate because it's relatively
> expensive too..

A frequency change shouldn't lead to a flag change or a rebuild of the
sched_domain hierarhcy. The situations where the hierarchy should be
rebuild to update the flag is during boot as we only know the amount of
asymmetry once cpufreq has been initialized, when cpus are hotplugged
in/out, and when root_domains change due to cpuset reconfiguration. So
it should be a relatively rare event.


Re: [PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection

2018-07-23 Thread Morten Rasmussen
On Mon, Jul 23, 2018 at 02:25:34PM +0100, Qais Yousef wrote:
> Hi Morten
> 
> On 20/07/18 14:32, Morten Rasmussen wrote:
> >The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the
> >sched_domain in the hierarchy where all cpu capacities are visible for
> >any cpu's point of view on asymmetric cpu capacity systems. The
> >scheduler can then take to take capacity asymmetry into account when
> 
> Did you mean "s/take to take/try to take/"?

Yes.


[...]

> >+/*
> >+ * Examine topology from all cpu's point of views to detect the lowest
> >+ * sched_domain_topology_level where a highest capacity cpu is visible
> >+ * to everyone.
> >+ */
> >+for_each_cpu(i, cpu_map) {
> >+unsigned long max_capacity = arch_scale_cpu_capacity(NULL, i);
> >+int tl_id = 0;
> >+
> >+for_each_sd_topology(tl) {
> >+if (tl_id < asym_level)
> >+goto next_level;
> >+
> 
> I think if you increment and then continue here you might save the extra
> branch. I didn't look at any disassembly though to verify the generated
> code.
> 
> I wonder if we can introduce for_each_sd_topology_from(tl, starting_level)
> so that you can start searching from a provided level - which will make this
> skipping logic unnecessary? So the code will look like
> 
>             for_each_sd_topology_from(tl, asymc_level) {
>                 ...
>             }

Both options would work. Increment+contrinue instead of goto would be
slightly less readable I think since we would still have the increment
at the end of the loop, but easy to do. Introducing
for_each_sd_topology_from() improve things too, but I wonder if it is
worth it.

> >@@ -1647,18 +1707,27 @@ build_sched_domains(const struct cpumask *cpu_map, 
> >struct sched_domain_attr *att
> > struct s_data d;
> > struct rq *rq = NULL;
> > int i, ret = -ENOMEM;
> >+struct sched_domain_topology_level *tl_asym;
> > alloc_state = __visit_domain_allocation_hell(, cpu_map);
> > if (alloc_state != sa_rootdomain)
> > goto error;
> >+tl_asym = asym_cpu_capacity_level(cpu_map);
> >+
> 
> Or maybe this is not a hot path and we don't care that much about optimizing
> the search since you call it unconditionally here even for systems that
> don't care?

It does increase the cost of things like hotplug slightly and
repartitioning of root_domains a slightly but I don't see how we can
avoid it if we want generic code to set this flag. If the costs are not
acceptable I think the only option is to make the detection architecture
specific.

In any case, AFAIK rebuilding the sched_domain hierarchy shouldn't be a
normal and common thing to do. If checking for the flag is not
acceptable on SMP-only architectures, I can move it under arch/arm[,64]
although it is not as clean.

Morten


Re: [PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection

2018-07-23 Thread Morten Rasmussen
On Mon, Jul 23, 2018 at 02:25:34PM +0100, Qais Yousef wrote:
> Hi Morten
> 
> On 20/07/18 14:32, Morten Rasmussen wrote:
> >The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the
> >sched_domain in the hierarchy where all cpu capacities are visible for
> >any cpu's point of view on asymmetric cpu capacity systems. The
> >scheduler can then take to take capacity asymmetry into account when
> 
> Did you mean "s/take to take/try to take/"?

Yes.


[...]

> >+/*
> >+ * Examine topology from all cpu's point of views to detect the lowest
> >+ * sched_domain_topology_level where a highest capacity cpu is visible
> >+ * to everyone.
> >+ */
> >+for_each_cpu(i, cpu_map) {
> >+unsigned long max_capacity = arch_scale_cpu_capacity(NULL, i);
> >+int tl_id = 0;
> >+
> >+for_each_sd_topology(tl) {
> >+if (tl_id < asym_level)
> >+goto next_level;
> >+
> 
> I think if you increment and then continue here you might save the extra
> branch. I didn't look at any disassembly though to verify the generated
> code.
> 
> I wonder if we can introduce for_each_sd_topology_from(tl, starting_level)
> so that you can start searching from a provided level - which will make this
> skipping logic unnecessary? So the code will look like
> 
>             for_each_sd_topology_from(tl, asymc_level) {
>                 ...
>             }

Both options would work. Increment+contrinue instead of goto would be
slightly less readable I think since we would still have the increment
at the end of the loop, but easy to do. Introducing
for_each_sd_topology_from() improve things too, but I wonder if it is
worth it.

> >@@ -1647,18 +1707,27 @@ build_sched_domains(const struct cpumask *cpu_map, 
> >struct sched_domain_attr *att
> > struct s_data d;
> > struct rq *rq = NULL;
> > int i, ret = -ENOMEM;
> >+struct sched_domain_topology_level *tl_asym;
> > alloc_state = __visit_domain_allocation_hell(, cpu_map);
> > if (alloc_state != sa_rootdomain)
> > goto error;
> >+tl_asym = asym_cpu_capacity_level(cpu_map);
> >+
> 
> Or maybe this is not a hot path and we don't care that much about optimizing
> the search since you call it unconditionally here even for systems that
> don't care?

It does increase the cost of things like hotplug slightly and
repartitioning of root_domains a slightly but I don't see how we can
avoid it if we want generic code to set this flag. If the costs are not
acceptable I think the only option is to make the detection architecture
specific.

In any case, AFAIK rebuilding the sched_domain hierarchy shouldn't be a
normal and common thing to do. If checking for the flag is not
acceptable on SMP-only architectures, I can move it under arch/arm[,64]
although it is not as clean.

Morten


Re: [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry

2018-07-20 Thread Morten Rasmussen
On Thu, Jul 05, 2018 at 04:03:11PM +0100, Quentin Perret wrote:
> On Thursday 05 Jul 2018 at 15:13:49 (+0100), Morten Rasmussen wrote:
> > 3. Detecting the flag in generic kernel/sched/* code means that all
> > architectures will pay the for the overhead when building/rebuilding the
> > sched_domain hierarchy, and all architectures that sets the cpu
> > capacities to asymmetric will set the flag whether they like it or not.
> > I'm not sure if this is a problem.
> 
> That is true as well ...
> 
> > 
> > In the end it is really about how much of this we want in generic code
> > and how much we hide in arch/, and if we dare to touch the sched_domain
> > build code ;-)
> 
> Right so you can argue that the arch code is here to give you a
> system-level information, and that if the scheduler wants to virtually
> split that system, then it's its job to make sure that happens properly.
> That is exactly what your patch does (IIUC), and I now think that this
> is a very sensible middle-ground option. But this is debatable so I'm
> interested to see what others think :-)

I went ahead an hacked up some patches that sets the flag automatically
as part of the sched_domain build process. I posted them so people can
have a look: 1532093554-30504-1-git-send-email-morten.rasmus...@arm.com

With those patches this patch has to be reverted/dropped.

Morten


Re: [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry

2018-07-20 Thread Morten Rasmussen
On Thu, Jul 05, 2018 at 04:03:11PM +0100, Quentin Perret wrote:
> On Thursday 05 Jul 2018 at 15:13:49 (+0100), Morten Rasmussen wrote:
> > 3. Detecting the flag in generic kernel/sched/* code means that all
> > architectures will pay the for the overhead when building/rebuilding the
> > sched_domain hierarchy, and all architectures that sets the cpu
> > capacities to asymmetric will set the flag whether they like it or not.
> > I'm not sure if this is a problem.
> 
> That is true as well ...
> 
> > 
> > In the end it is really about how much of this we want in generic code
> > and how much we hide in arch/, and if we dare to touch the sched_domain
> > build code ;-)
> 
> Right so you can argue that the arch code is here to give you a
> system-level information, and that if the scheduler wants to virtually
> split that system, then it's its job to make sure that happens properly.
> That is exactly what your patch does (IIUC), and I now think that this
> is a very sensible middle-ground option. But this is debatable so I'm
> interested to see what others think :-)

I went ahead an hacked up some patches that sets the flag automatically
as part of the sched_domain build process. I posted them so people can
have a look: 1532093554-30504-1-git-send-email-morten.rasmus...@arm.com

With those patches this patch has to be reverted/dropped.

Morten


[PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection

2018-07-20 Thread Morten Rasmussen
The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the
sched_domain in the hierarchy where all cpu capacities are visible for
any cpu's point of view on asymmetric cpu capacity systems. The
scheduler can then take to take capacity asymmetry into account when
balancing at this level. It also serves as an indicator for how wide
task placement heuristics have to search to consider all available cpu
capacities as asymmetric systems might often appear symmetric at
smallest level(s) of the sched_domain hierarchy.

The flag has been around for while but so far only been set by
out-of-tree code in Android kernels. One solution is to let each
architecture provide the flag through a custom sched_domain topology
array and associated mask and flag functions. However,
SD_ASYM_CPUCAPACITY is special in the sense that it depends on the
capacity and presence of all cpus in the system, i.e. when hotplugging
all cpus out except those with one particular cpu capacity the flag
should disappear even if the sched_domains don't collapse. Similarly,
the flag is affected by cpusets where load-balancing is turned off.
Detecting when the flags should be set therefore depends not only on
topology information but also the cpuset configuration and hotplug
state. The arch code doesn't have easy access to the cpuset
configuration.

Instead, this patch implements the flag detection in generic code where
cpusets and hotplug state is already taken care of. All the arch is
responsible for is to implement arch_scale_cpu_capacity() and force a
full rebuild of the sched_domain hierarchy if capacities are updated,
e.g. later in the boot process when cpufreq has initialized.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Morten Rasmussen 
---
 include/linux/sched/topology.h |  2 +-
 kernel/sched/topology.c| 81 ++
 2 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..4fe2e49ab13b 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -23,7 +23,7 @@
 #define SD_BALANCE_FORK0x0008  /* Balance on fork, clone */
 #define SD_BALANCE_WAKE0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE 0x0020  /* Wake task to waking CPU */
-#define SD_ASYM_CPUCAPACITY0x0040  /* Groups have different max cpu 
capacities */
+#define SD_ASYM_CPUCAPACITY0x0040  /* Domain members have different cpu 
capacities */
 #define SD_SHARE_CPUCAPACITY   0x0080  /* Domain members share cpu capacity */
 #define SD_SHARE_POWERDOMAIN   0x0100  /* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share cpu pkg 
resources */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05a831427bc7..b8f41d557612 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1061,7 +1061,6 @@ static struct cpumask 
***sched_domains_numa_masks;
  *   SD_SHARE_PKG_RESOURCES - describes shared caches
  *   SD_NUMA- describes NUMA topologies
  *   SD_SHARE_POWERDOMAIN   - describes shared power domain
- *   SD_ASYM_CPUCAPACITY- describes mixed capacity topologies
  *
  * Odd one out, which beside describing the topology has a quirk also
  * prescribes the desired behaviour that goes along with it:
@@ -1073,13 +1072,12 @@ static struct cpumask   
***sched_domains_numa_masks;
 SD_SHARE_PKG_RESOURCES |   \
 SD_NUMA|   \
 SD_ASYM_PACKING|   \
-SD_ASYM_CPUCAPACITY|   \
 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
 sd_init(struct sched_domain_topology_level *tl,
const struct cpumask *cpu_map,
-   struct sched_domain *child, int cpu)
+   struct sched_domain *child, int dflags, int cpu)
 {
struct sd_data *sdd = >data;
struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
@@ -1100,6 +1098,9 @@ sd_init(struct sched_domain_topology_level *tl,
"wrong sd_flags in topology description\n"))
sd_flags &= ~TOPOLOGY_SD_FLAGS;
 
+   /* Apply detected topology flags */
+   sd_flags |= dflags;
+
*sd = (struct sched_domain){
.min_interval   = sd_weight,
.max_interval   = 2*sd_weight,
@@ -1607,9 +1608,9 @@ static void __sdt_free(const struct cpumask *cpu_map)
 
 static struct sched_domain *build_sched_domain(struct 
sched_domain_topology_level *tl,
const struct cpumask *cpu_map, struct sched_domain_attr *attr,
-   struct sched_domain *child, int cpu)
+   struct sched_domain *child, int dflags, int cpu)
 {
-   struct sched_domain *sd = sd_init(tl, cpu_map, child, cpu);
+   struct sched_domain *sd = sd_init(tl, cpu_map, child, dflags, cpu);
 
if (child) {
sd->l

[PATCH 4/4] arch/arm: Rebuild sched_domain hierarchy when cpu capacity changes

2018-07-20 Thread Morten Rasmussen
Asymmetric cpu capacity can not necessarily be determined accurately at
the time the initial sched_domain hierarchy is built during boot. It is
therefore necessary to be able to force a full rebuild of the hierarchy
later triggered by the arch_topology driver. A full rebuild requires the
arch-code to implement arch_update_cpu_topology() which isn't yet
implemented for arm. This patch points the arm implementation to
arch_topology driver to ensure that full hierarchy rebuild happens when
needed.

cc: Russell King 

Signed-off-by: Morten Rasmussen 
---
 arch/arm/include/asm/topology.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 5d88d2f22b2c..2a786f54d8b8 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -33,6 +33,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu);
 /* Replace task scheduler's default cpu-invariant accounting */
 #define arch_scale_cpu_capacity topology_get_cpu_scale
 
+/* Enable topology flag updates */
+#define arch_update_cpu_topology topology_update_cpu_topology
+
 #else
 
 static inline void init_cpu_topology(void) { }
-- 
2.7.4



[PATCH 2/4] drivers/base/arch_topology: Rebuild sched_domain hierarchy when capacities change

2018-07-20 Thread Morten Rasmussen
The setting of SD_ASYM_CPUCAPACITY depends on the per-cpu capacities.
These might not have their final values when the hierarchy is initially
built as the values depend on cpufreq to be initialized or the values
being set through sysfs. To ensure that the flags are set correctly we
need to rebuild the sched_domain hierarchy whenever the reported per-cpu
capacity (arch_scale_cpu_capacity()) changes.

This patch ensure that a full sched_domain rebuild happens when cpu
capacity changes occur.

cc: Greg Kroah-Hartman 

Signed-off-by: Morten Rasmussen 
---
 drivers/base/arch_topology.c  | 26 ++
 include/linux/arch_topology.h |  1 +
 2 files changed, 27 insertions(+)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index e7cb0c6ade81..edfcf8d982e4 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
 
@@ -47,6 +48,9 @@ static ssize_t cpu_capacity_show(struct device *dev,
return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id));
 }
 
+static void update_topology_flags_workfn(struct work_struct *work);
+static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn);
+
 static ssize_t cpu_capacity_store(struct device *dev,
  struct device_attribute *attr,
  const char *buf,
@@ -72,6 +76,8 @@ static ssize_t cpu_capacity_store(struct device *dev,
topology_set_cpu_scale(i, new_capacity);
mutex_unlock(_scale_mutex);
 
+   schedule_work(_topology_flags_work);
+
return count;
 }
 
@@ -96,6 +102,25 @@ static int register_cpu_capacity_sysctl(void)
 }
 subsys_initcall(register_cpu_capacity_sysctl);
 
+static int update_topology;
+
+int topology_update_cpu_topology(void)
+{
+   return update_topology;
+}
+
+/*
+ * Updating the sched_domains can't be done directly from cpufreq callbacks
+ * due to locking, so queue the work for later.
+ */
+static void update_topology_flags_workfn(struct work_struct *work)
+{
+   update_topology = 1;
+   rebuild_sched_domains();
+   pr_debug("sched_domain hierarchy rebuilt, flags updated\n");
+   update_topology = 0;
+}
+
 static u32 capacity_scale;
 static u32 *raw_capacity;
 
@@ -201,6 +226,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
 
if (cpumask_empty(cpus_to_visit)) {
topology_normalize_cpu_scale();
+   schedule_work(_topology_flags_work);
free_raw_capacity();
pr_debug("cpu_capacity: parsing done\n");
schedule_work(_done_work);
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 2b709416de05..d9bdc1a7f4e7 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -9,6 +9,7 @@
 #include 
 
 void topology_normalize_cpu_scale(void);
+int topology_update_cpu_topology(void);
 
 struct device_node;
 bool topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu);
-- 
2.7.4



[PATCH 0/4] sched/topology: Set SD_ASYM_CPUCAPACITY flag automatically

2018-07-20 Thread Morten Rasmussen
The SD_ASYM_CPUCAPACITY flag has been around for some time now with no code to
actually set it. Android has carried patches to do this out-of-tree in the
meantime. The flag is meant to indicate cpu capacity asymmetry and is set at
the topology level where the sched_domain spans all available cpu capacity in
the system, i.e. all core types are visible, for any cpu in the system.

The flag was merged as being a topology flag meaning that architecture had to
provide the flag explicitly, however when mixed with cpusets splitting the
system into multiple root_domains the flag can't be set without knowledge about
the cpusets. Rather than exposing cpusets to architecture code this patch set
moves the responsibility for setting the flag to generic topology code which is
simpler and make the code architecture agnostic.

Morten Rasmussen (4):
  sched/topology: SD_ASYM_CPUCAPACITY flag detection
  drivers/base/arch_topology: Rebuild sched_domain hierarchy when
capacities change
  arch/arm64: Rebuild sched_domain hierarchy when cpu capacity changes
  arch/arm: Rebuild sched_domain hierarchy when cpu capacity changes

 arch/arm/include/asm/topology.h   |  3 ++
 arch/arm64/include/asm/topology.h |  3 ++
 drivers/base/arch_topology.c  | 26 +
 include/linux/arch_topology.h |  1 +
 include/linux/sched/topology.h|  2 +-
 kernel/sched/topology.c   | 81 ---
 6 files changed, 109 insertions(+), 7 deletions(-)

-- 
2.7.4



[PATCH 3/4] arch/arm64: Rebuild sched_domain hierarchy when cpu capacity changes

2018-07-20 Thread Morten Rasmussen
Asymmetric cpu capacity can not necessarily be determined accurately at
the time the initial sched_domain hierarchy is built during boot. It is
therefore necessary to be able to force a full rebuild of the hierarchy
later triggered by the arch_topology driver. A full rebuild requires the
arch-code to implement arch_update_cpu_topology() which isn't yet
implemented for arm64. This patch points the arm64 implementation to
arch_topology driver to ensure that full hierarchy rebuild happens when
needed.

cc: Catalin Marinas 
cc: Will Deacon 

Signed-off-by: Morten Rasmussen 
---
 arch/arm64/include/asm/topology.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/include/asm/topology.h 
b/arch/arm64/include/asm/topology.h
index df48212f767b..61ba09d48237 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -43,6 +43,9 @@ int pcibus_to_node(struct pci_bus *bus);
 /* Replace task scheduler's default cpu-invariant accounting */
 #define arch_scale_cpu_capacity topology_get_cpu_scale
 
+/* Enable topology flag updates */
+#define arch_update_cpu_topology topology_update_cpu_topology
+
 #include 
 
 #endif /* _ASM_ARM_TOPOLOGY_H */
-- 
2.7.4



[PATCH 2/4] drivers/base/arch_topology: Rebuild sched_domain hierarchy when capacities change

2018-07-20 Thread Morten Rasmussen
The setting of SD_ASYM_CPUCAPACITY depends on the per-cpu capacities.
These might not have their final values when the hierarchy is initially
built as the values depend on cpufreq to be initialized or the values
being set through sysfs. To ensure that the flags are set correctly we
need to rebuild the sched_domain hierarchy whenever the reported per-cpu
capacity (arch_scale_cpu_capacity()) changes.

This patch ensure that a full sched_domain rebuild happens when cpu
capacity changes occur.

cc: Greg Kroah-Hartman 

Signed-off-by: Morten Rasmussen 
---
 drivers/base/arch_topology.c  | 26 ++
 include/linux/arch_topology.h |  1 +
 2 files changed, 27 insertions(+)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index e7cb0c6ade81..edfcf8d982e4 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
 
@@ -47,6 +48,9 @@ static ssize_t cpu_capacity_show(struct device *dev,
return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id));
 }
 
+static void update_topology_flags_workfn(struct work_struct *work);
+static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn);
+
 static ssize_t cpu_capacity_store(struct device *dev,
  struct device_attribute *attr,
  const char *buf,
@@ -72,6 +76,8 @@ static ssize_t cpu_capacity_store(struct device *dev,
topology_set_cpu_scale(i, new_capacity);
mutex_unlock(_scale_mutex);
 
+   schedule_work(_topology_flags_work);
+
return count;
 }
 
@@ -96,6 +102,25 @@ static int register_cpu_capacity_sysctl(void)
 }
 subsys_initcall(register_cpu_capacity_sysctl);
 
+static int update_topology;
+
+int topology_update_cpu_topology(void)
+{
+   return update_topology;
+}
+
+/*
+ * Updating the sched_domains can't be done directly from cpufreq callbacks
+ * due to locking, so queue the work for later.
+ */
+static void update_topology_flags_workfn(struct work_struct *work)
+{
+   update_topology = 1;
+   rebuild_sched_domains();
+   pr_debug("sched_domain hierarchy rebuilt, flags updated\n");
+   update_topology = 0;
+}
+
 static u32 capacity_scale;
 static u32 *raw_capacity;
 
@@ -201,6 +226,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
 
if (cpumask_empty(cpus_to_visit)) {
topology_normalize_cpu_scale();
+   schedule_work(_topology_flags_work);
free_raw_capacity();
pr_debug("cpu_capacity: parsing done\n");
schedule_work(_done_work);
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 2b709416de05..d9bdc1a7f4e7 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -9,6 +9,7 @@
 #include 
 
 void topology_normalize_cpu_scale(void);
+int topology_update_cpu_topology(void);
 
 struct device_node;
 bool topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu);
-- 
2.7.4



[PATCH 0/4] sched/topology: Set SD_ASYM_CPUCAPACITY flag automatically

2018-07-20 Thread Morten Rasmussen
The SD_ASYM_CPUCAPACITY flag has been around for some time now with no code to
actually set it. Android has carried patches to do this out-of-tree in the
meantime. The flag is meant to indicate cpu capacity asymmetry and is set at
the topology level where the sched_domain spans all available cpu capacity in
the system, i.e. all core types are visible, for any cpu in the system.

The flag was merged as being a topology flag meaning that architecture had to
provide the flag explicitly, however when mixed with cpusets splitting the
system into multiple root_domains the flag can't be set without knowledge about
the cpusets. Rather than exposing cpusets to architecture code this patch set
moves the responsibility for setting the flag to generic topology code which is
simpler and make the code architecture agnostic.

Morten Rasmussen (4):
  sched/topology: SD_ASYM_CPUCAPACITY flag detection
  drivers/base/arch_topology: Rebuild sched_domain hierarchy when
capacities change
  arch/arm64: Rebuild sched_domain hierarchy when cpu capacity changes
  arch/arm: Rebuild sched_domain hierarchy when cpu capacity changes

 arch/arm/include/asm/topology.h   |  3 ++
 arch/arm64/include/asm/topology.h |  3 ++
 drivers/base/arch_topology.c  | 26 +
 include/linux/arch_topology.h |  1 +
 include/linux/sched/topology.h|  2 +-
 kernel/sched/topology.c   | 81 ---
 6 files changed, 109 insertions(+), 7 deletions(-)

-- 
2.7.4



[PATCH 3/4] arch/arm64: Rebuild sched_domain hierarchy when cpu capacity changes

2018-07-20 Thread Morten Rasmussen
Asymmetric cpu capacity can not necessarily be determined accurately at
the time the initial sched_domain hierarchy is built during boot. It is
therefore necessary to be able to force a full rebuild of the hierarchy
later triggered by the arch_topology driver. A full rebuild requires the
arch-code to implement arch_update_cpu_topology() which isn't yet
implemented for arm64. This patch points the arm64 implementation to
arch_topology driver to ensure that full hierarchy rebuild happens when
needed.

cc: Catalin Marinas 
cc: Will Deacon 

Signed-off-by: Morten Rasmussen 
---
 arch/arm64/include/asm/topology.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/include/asm/topology.h 
b/arch/arm64/include/asm/topology.h
index df48212f767b..61ba09d48237 100644
--- a/arch/arm64/include/asm/topology.h
+++ b/arch/arm64/include/asm/topology.h
@@ -43,6 +43,9 @@ int pcibus_to_node(struct pci_bus *bus);
 /* Replace task scheduler's default cpu-invariant accounting */
 #define arch_scale_cpu_capacity topology_get_cpu_scale
 
+/* Enable topology flag updates */
+#define arch_update_cpu_topology topology_update_cpu_topology
+
 #include 
 
 #endif /* _ASM_ARM_TOPOLOGY_H */
-- 
2.7.4



[PATCH 1/4] sched/topology: SD_ASYM_CPUCAPACITY flag detection

2018-07-20 Thread Morten Rasmussen
The SD_ASYM_CPUCAPACITY sched_domain flag is supposed to mark the
sched_domain in the hierarchy where all cpu capacities are visible for
any cpu's point of view on asymmetric cpu capacity systems. The
scheduler can then take to take capacity asymmetry into account when
balancing at this level. It also serves as an indicator for how wide
task placement heuristics have to search to consider all available cpu
capacities as asymmetric systems might often appear symmetric at
smallest level(s) of the sched_domain hierarchy.

The flag has been around for while but so far only been set by
out-of-tree code in Android kernels. One solution is to let each
architecture provide the flag through a custom sched_domain topology
array and associated mask and flag functions. However,
SD_ASYM_CPUCAPACITY is special in the sense that it depends on the
capacity and presence of all cpus in the system, i.e. when hotplugging
all cpus out except those with one particular cpu capacity the flag
should disappear even if the sched_domains don't collapse. Similarly,
the flag is affected by cpusets where load-balancing is turned off.
Detecting when the flags should be set therefore depends not only on
topology information but also the cpuset configuration and hotplug
state. The arch code doesn't have easy access to the cpuset
configuration.

Instead, this patch implements the flag detection in generic code where
cpusets and hotplug state is already taken care of. All the arch is
responsible for is to implement arch_scale_cpu_capacity() and force a
full rebuild of the sched_domain hierarchy if capacities are updated,
e.g. later in the boot process when cpufreq has initialized.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Morten Rasmussen 
---
 include/linux/sched/topology.h |  2 +-
 kernel/sched/topology.c| 81 ++
 2 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..4fe2e49ab13b 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -23,7 +23,7 @@
 #define SD_BALANCE_FORK0x0008  /* Balance on fork, clone */
 #define SD_BALANCE_WAKE0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE 0x0020  /* Wake task to waking CPU */
-#define SD_ASYM_CPUCAPACITY0x0040  /* Groups have different max cpu 
capacities */
+#define SD_ASYM_CPUCAPACITY0x0040  /* Domain members have different cpu 
capacities */
 #define SD_SHARE_CPUCAPACITY   0x0080  /* Domain members share cpu capacity */
 #define SD_SHARE_POWERDOMAIN   0x0100  /* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share cpu pkg 
resources */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05a831427bc7..b8f41d557612 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1061,7 +1061,6 @@ static struct cpumask 
***sched_domains_numa_masks;
  *   SD_SHARE_PKG_RESOURCES - describes shared caches
  *   SD_NUMA- describes NUMA topologies
  *   SD_SHARE_POWERDOMAIN   - describes shared power domain
- *   SD_ASYM_CPUCAPACITY- describes mixed capacity topologies
  *
  * Odd one out, which beside describing the topology has a quirk also
  * prescribes the desired behaviour that goes along with it:
@@ -1073,13 +1072,12 @@ static struct cpumask   
***sched_domains_numa_masks;
 SD_SHARE_PKG_RESOURCES |   \
 SD_NUMA|   \
 SD_ASYM_PACKING|   \
-SD_ASYM_CPUCAPACITY|   \
 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
 sd_init(struct sched_domain_topology_level *tl,
const struct cpumask *cpu_map,
-   struct sched_domain *child, int cpu)
+   struct sched_domain *child, int dflags, int cpu)
 {
struct sd_data *sdd = >data;
struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
@@ -1100,6 +1098,9 @@ sd_init(struct sched_domain_topology_level *tl,
"wrong sd_flags in topology description\n"))
sd_flags &= ~TOPOLOGY_SD_FLAGS;
 
+   /* Apply detected topology flags */
+   sd_flags |= dflags;
+
*sd = (struct sched_domain){
.min_interval   = sd_weight,
.max_interval   = 2*sd_weight,
@@ -1607,9 +1608,9 @@ static void __sdt_free(const struct cpumask *cpu_map)
 
 static struct sched_domain *build_sched_domain(struct 
sched_domain_topology_level *tl,
const struct cpumask *cpu_map, struct sched_domain_attr *attr,
-   struct sched_domain *child, int cpu)
+   struct sched_domain *child, int dflags, int cpu)
 {
-   struct sched_domain *sd = sd_init(tl, cpu_map, child, cpu);
+   struct sched_domain *sd = sd_init(tl, cpu_map, child, dflags, cpu);
 
if (child) {
sd->l

[PATCH 4/4] arch/arm: Rebuild sched_domain hierarchy when cpu capacity changes

2018-07-20 Thread Morten Rasmussen
Asymmetric cpu capacity can not necessarily be determined accurately at
the time the initial sched_domain hierarchy is built during boot. It is
therefore necessary to be able to force a full rebuild of the hierarchy
later triggered by the arch_topology driver. A full rebuild requires the
arch-code to implement arch_update_cpu_topology() which isn't yet
implemented for arm. This patch points the arm implementation to
arch_topology driver to ensure that full hierarchy rebuild happens when
needed.

cc: Russell King 

Signed-off-by: Morten Rasmussen 
---
 arch/arm/include/asm/topology.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 5d88d2f22b2c..2a786f54d8b8 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -33,6 +33,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu);
 /* Replace task scheduler's default cpu-invariant accounting */
 #define arch_scale_cpu_capacity topology_get_cpu_scale
 
+/* Enable topology flag updates */
+#define arch_update_cpu_topology topology_update_cpu_topology
+
 #else
 
 static inline void init_cpu_topology(void) { }
-- 
2.7.4



Re: arm: v4.18-rc5 with cpuidle on TC2 (A7 boot) spectre v2 issue

2018-07-20 Thread Morten Rasmussen
On Thu, Jul 19, 2018 at 02:32:22PM +0100, Russell King - ARM Linux wrote:
> On Thu, Jul 19, 2018 at 11:01:10AM +0100, Russell King - ARM Linux wrote:
> > On Thu, Jul 19, 2018 at 11:42:50AM +0200, Dietmar Eggemann wrote:
> > > Hi,
> > > 
> > > running v4.18-rc5 (plus still missing "power: vexpress: fix corruption in
> > > notifier registration", otherwise I get this rcu_sched stall issue) on TC2
> > > (A7 boot) with vanilla multi_v7_defconfig plus
> > > CONFIG_ARM_BIG_LITTLE_CPUIDLE=y gives me continuous:
> > > 
> > > ...
> > >  CPUX: Spectre v2: incorrect context switching function, system vulnerable
> > > ...
> > > 
> > > messages.
> > > 
> > > Work around is to disable CONFIG_HARDEN_BRANCH_PREDICTOR.
> > 
> > or disable big.Little if you want the hardening.
> > 
> > The choices are currently either protection against Spectre or big.Little
> > support since the two are mutually exclusive at the moment.
> 
> An alternative would be to give the patches in the attachment a test.
> They're not finished yet, so I haven't sent them out, but still worth
> testing.

Thanks for sharing. I can confirm that your patches do cure the flood of 
warnings.

TC2 booting on A7:

[0.002922] CPU: Testing write buffer coherency: ok
[0.003347] CPU0: thread -1, cpu 0, socket 1, mpidr 8100
[0.004022] Setting up static identity map for 0x8010 - 0x80100060
[0.004265] ARM CCI driver probed
[0.004648] TC2 power management initialized
[0.004930] Hierarchical SRCU implementation.
[0.006956] smp: Bringing up secondary CPUs ...
[0.008712] CPU1: thread -1, cpu 0, socket 0, mpidr 8000
[0.008720] CPU1: Spectre v2: firmware did not set auxiliary control 
register IBE bit, system vulnerable
[0.009934] CPU2: thread -1, cpu 1, socket 0, mpidr 8001
[0.009940] CPU2: Spectre v2: firmware did not set auxiliary control 
register IBE bit, system vulnerable
[0.011147] CPU3: thread -1, cpu 1, socket 1, mpidr 8101
[0.012350] CPU4: thread -1, cpu 2, socket 1, mpidr 8102
[0.012468] smp: Brought up 1 node, 5 CPUs
[0.012490] SMP: Total of 5 processors activated (240.00 BogoMIPS).
[0.012499] CPU: All CPU(s) started in SVC mode.

TC2 booting on A15:

[0.002045] CPU0: Spectre v2: firmware did not set auxiliary control 
register IBE bit, system vulnerable
[0.002311] CPU0: thread -1, cpu 0, socket 0, mpidr 8000
[0.002809] Setting up static identity map for 0x8010 - 0x80100060
[0.003000] ARM CCI driver probed
[0.003408] TC2 power management initialized
[0.003637] Hierarchical SRCU implementation.
[0.005177] smp: Bringing up secondary CPUs ...
[0.006170] CPU1: thread -1, cpu 1, socket 0, mpidr 8001
[0.006176] CPU1: Spectre v2: firmware did not set auxiliary control 
register IBE bit, system vulnerable
[0.008137] CPU2: thread -1, cpu 0, socket 1, mpidr 8100
[0.009304] CPU3: thread -1, cpu 1, socket 1, mpidr 8101
[0.010405] CPU4: thread -1, cpu 2, socket 1, mpidr 8102
[0.010537] smp: Brought up 1 node, 5 CPUs
[0.010562] SMP: Total of 5 processors activated (240.00 BogoMIPS).
[0.010572] CPU: All CPU(s) started in SVC mode.

No further warnings for either configuration.

For reference, this a partial output from later in the boot process when
booting on A7 with 4.18-rc5 _without_ your patches:

[5.576176] device-mapper: ioctl: 4.39.0-ioctl (2018-04-03) initialised: 
dm-de...@redhat.com
[5.601689] cpu cpu0: bL_cpufreq_init: CPU 0 initialized
[5.618670] cpu cpu1: bL_cpufreq_init: CPU 1 initialized
[5.635583] arm_big_little: bL_cpufreq_register: Registered platform driver: 
vexpress-spc
[5.661112] mmci-pl18x 1c05.mmci: Got CD GPIO
[5.675235] mmci-pl18x 1c05.mmci: Got WP GPIO
[5.687783] CPU2: Spectre v2: incorrect context switching function, system 
vulnerable
[5.689623] mmci-pl18x 1c05.mmci: mmc0: PL180 manf 41 rev0 at 0x1c05 
irq 26,27 (pio)
[5.713217] CPU1: Spectre v2: incorrect context switching function, system 
vulnerable
[5.718044] CPU2: Spectre v2: incorrect context switching function, system 
vulnerable
[5.727896] CPU2: Spectre v2: incorrect context switching function, system 
vulnerable


Re: arm: v4.18-rc5 with cpuidle on TC2 (A7 boot) spectre v2 issue

2018-07-20 Thread Morten Rasmussen
On Thu, Jul 19, 2018 at 02:32:22PM +0100, Russell King - ARM Linux wrote:
> On Thu, Jul 19, 2018 at 11:01:10AM +0100, Russell King - ARM Linux wrote:
> > On Thu, Jul 19, 2018 at 11:42:50AM +0200, Dietmar Eggemann wrote:
> > > Hi,
> > > 
> > > running v4.18-rc5 (plus still missing "power: vexpress: fix corruption in
> > > notifier registration", otherwise I get this rcu_sched stall issue) on TC2
> > > (A7 boot) with vanilla multi_v7_defconfig plus
> > > CONFIG_ARM_BIG_LITTLE_CPUIDLE=y gives me continuous:
> > > 
> > > ...
> > >  CPUX: Spectre v2: incorrect context switching function, system vulnerable
> > > ...
> > > 
> > > messages.
> > > 
> > > Work around is to disable CONFIG_HARDEN_BRANCH_PREDICTOR.
> > 
> > or disable big.Little if you want the hardening.
> > 
> > The choices are currently either protection against Spectre or big.Little
> > support since the two are mutually exclusive at the moment.
> 
> An alternative would be to give the patches in the attachment a test.
> They're not finished yet, so I haven't sent them out, but still worth
> testing.

Thanks for sharing. I can confirm that your patches do cure the flood of 
warnings.

TC2 booting on A7:

[0.002922] CPU: Testing write buffer coherency: ok
[0.003347] CPU0: thread -1, cpu 0, socket 1, mpidr 8100
[0.004022] Setting up static identity map for 0x8010 - 0x80100060
[0.004265] ARM CCI driver probed
[0.004648] TC2 power management initialized
[0.004930] Hierarchical SRCU implementation.
[0.006956] smp: Bringing up secondary CPUs ...
[0.008712] CPU1: thread -1, cpu 0, socket 0, mpidr 8000
[0.008720] CPU1: Spectre v2: firmware did not set auxiliary control 
register IBE bit, system vulnerable
[0.009934] CPU2: thread -1, cpu 1, socket 0, mpidr 8001
[0.009940] CPU2: Spectre v2: firmware did not set auxiliary control 
register IBE bit, system vulnerable
[0.011147] CPU3: thread -1, cpu 1, socket 1, mpidr 8101
[0.012350] CPU4: thread -1, cpu 2, socket 1, mpidr 8102
[0.012468] smp: Brought up 1 node, 5 CPUs
[0.012490] SMP: Total of 5 processors activated (240.00 BogoMIPS).
[0.012499] CPU: All CPU(s) started in SVC mode.

TC2 booting on A15:

[0.002045] CPU0: Spectre v2: firmware did not set auxiliary control 
register IBE bit, system vulnerable
[0.002311] CPU0: thread -1, cpu 0, socket 0, mpidr 8000
[0.002809] Setting up static identity map for 0x8010 - 0x80100060
[0.003000] ARM CCI driver probed
[0.003408] TC2 power management initialized
[0.003637] Hierarchical SRCU implementation.
[0.005177] smp: Bringing up secondary CPUs ...
[0.006170] CPU1: thread -1, cpu 1, socket 0, mpidr 8001
[0.006176] CPU1: Spectre v2: firmware did not set auxiliary control 
register IBE bit, system vulnerable
[0.008137] CPU2: thread -1, cpu 0, socket 1, mpidr 8100
[0.009304] CPU3: thread -1, cpu 1, socket 1, mpidr 8101
[0.010405] CPU4: thread -1, cpu 2, socket 1, mpidr 8102
[0.010537] smp: Brought up 1 node, 5 CPUs
[0.010562] SMP: Total of 5 processors activated (240.00 BogoMIPS).
[0.010572] CPU: All CPU(s) started in SVC mode.

No further warnings for either configuration.

For reference, this a partial output from later in the boot process when
booting on A7 with 4.18-rc5 _without_ your patches:

[5.576176] device-mapper: ioctl: 4.39.0-ioctl (2018-04-03) initialised: 
dm-de...@redhat.com
[5.601689] cpu cpu0: bL_cpufreq_init: CPU 0 initialized
[5.618670] cpu cpu1: bL_cpufreq_init: CPU 1 initialized
[5.635583] arm_big_little: bL_cpufreq_register: Registered platform driver: 
vexpress-spc
[5.661112] mmci-pl18x 1c05.mmci: Got CD GPIO
[5.675235] mmci-pl18x 1c05.mmci: Got WP GPIO
[5.687783] CPU2: Spectre v2: incorrect context switching function, system 
vulnerable
[5.689623] mmci-pl18x 1c05.mmci: mmc0: PL180 manf 41 rev0 at 0x1c05 
irq 26,27 (pio)
[5.713217] CPU1: Spectre v2: incorrect context switching function, system 
vulnerable
[5.718044] CPU2: Spectre v2: incorrect context switching function, system 
vulnerable
[5.727896] CPU2: Spectre v2: incorrect context switching function, system 
vulnerable


Re: arm: v4.18-rc5 with cpuidle on TC2 (A7 boot) spectre v2 issue

2018-07-19 Thread Morten Rasmussen
On Thu, Jul 19, 2018 at 11:01:10AM +0100, Russell King - ARM Linux wrote:
> On Thu, Jul 19, 2018 at 11:42:50AM +0200, Dietmar Eggemann wrote:
> > Hi,
> > 
> > running v4.18-rc5 (plus still missing "power: vexpress: fix corruption in
> > notifier registration", otherwise I get this rcu_sched stall issue) on TC2
> > (A7 boot) with vanilla multi_v7_defconfig plus
> > CONFIG_ARM_BIG_LITTLE_CPUIDLE=y gives me continuous:
> > 
> > ...
> >  CPUX: Spectre v2: incorrect context switching function, system vulnerable
> > ...
> > 
> > messages.
> > 
> > Work around is to disable CONFIG_HARDEN_BRANCH_PREDICTOR.
> 
> or disable big.Little if you want the hardening.
> 
> The choices are currently either protection against Spectre or big.Little
> support since the two are mutually exclusive at the moment.

Would it be possible to make those message only appear once like they do
when booting on A15?

As it is we have to change the default setting of this new option to
make the platform useable as those messages are flooding the console. I
see >40 messages per second.

Morten


Re: arm: v4.18-rc5 with cpuidle on TC2 (A7 boot) spectre v2 issue

2018-07-19 Thread Morten Rasmussen
On Thu, Jul 19, 2018 at 11:01:10AM +0100, Russell King - ARM Linux wrote:
> On Thu, Jul 19, 2018 at 11:42:50AM +0200, Dietmar Eggemann wrote:
> > Hi,
> > 
> > running v4.18-rc5 (plus still missing "power: vexpress: fix corruption in
> > notifier registration", otherwise I get this rcu_sched stall issue) on TC2
> > (A7 boot) with vanilla multi_v7_defconfig plus
> > CONFIG_ARM_BIG_LITTLE_CPUIDLE=y gives me continuous:
> > 
> > ...
> >  CPUX: Spectre v2: incorrect context switching function, system vulnerable
> > ...
> > 
> > messages.
> > 
> > Work around is to disable CONFIG_HARDEN_BRANCH_PREDICTOR.
> 
> or disable big.Little if you want the hardening.
> 
> The choices are currently either protection against Spectre or big.Little
> support since the two are mutually exclusive at the moment.

Would it be possible to make those message only appear once like they do
when booting on A15?

As it is we have to change the default setting of this new option to
make the platform useable as those messages are flooding the console. I
see >40 messages per second.

Morten


Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems

2018-07-09 Thread Morten Rasmussen
On Fri, Jul 06, 2018 at 12:18:27PM +0200, Vincent Guittot wrote:
> Hi Morten,
> 
> On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen  
> wrote:
> >
> > On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
> > for performance that cpu intensive tasks are aggressively migrated to
> > high capacity cpus as soon as those become available. The capacity
> > awareness tweaks already in the wake-up path can't handle this as such
> > tasks might run or be runnable forever. If they happen to be placed on a
> > low capacity cpu from the beginning they are stuck there forever while
> > high capacity cpus may have become available in the meantime.
> >
> > To address this issue this patch set introduces a new "misfit"
> > load-balancing scenario in periodic/nohz/newly idle balance which tweaks
> > the load-balance conditions to ignore load per capacity in certain
> > cases. Since misfit tasks are commonly running alone on a cpu, more
> > aggressive active load-balancing is needed too.
> >
> > The fundamental idea of this patch set has been in Android kernels for a
> > long time and is absolutely essential for consistent performance on
> > asymmetric cpu capacity systems.
> >
> 
> As already said , I'm not convinced by the proposal which seems quite
> complex and also adds some kind of arbitrary and fixed power
> management policy by deciding which tasks can or not go on big cores
> whereas there are other frameworks to take such decision like EAS or
> cgroups.

The misfit patches are a crucial part of the EAS solution but they also
make sense for some users on their own without an energy model. This is
why they are posted separately.

We have already discussed at length why the patches are needed and why
the look like they do here in this thread:

https://lore.kernel.org/lkml/cakftptd4skw_3sak--vbec5-m1ua48bjoqys0pdqw3npsps...@mail.gmail.com/

> Furthermore, there is already something similar in the kernel
> with SD_ASYM_PACKING and IMO, it would be better to improve this
> feature (if needed) instead of adding a new one which often do similar
> things.

As said in the previous thread, while it might look similar it isn't.
SD_ASYM_PACKING isn't utilization-based which is the key metric used for
EAS, schedutil, util_est, and util_clamp. SD_ASYM_PACKING serves a
different purpose (see previous thread for details).

> I have rerun your tests and got same results than misfit task patchset
> on my hikey960 with SD_ASYM_PACKING feature for legacy b.L topology
> and fake dynamiQ topology. And it give better performance when the
> pinned tasks are short and scheduler has to wait for the task to
> increase their utilization before getting a chance to migrate on big
> core.

Right, the test cases are quite simple and could be served better by
SD_ASYM_PACKING. As we already discussed in that thread, that is due to
the PELT lag but this the cost we have to pay if we don't have
additional information about the requirements of the task and we don't
want to default to big-first with all its implications.

We have covered all this in the thread in early April.

> Then, I have tested SD_ASYM_PACKING with EAS patchset and they work
> together for b/L and dynamiQ topology

Could you provide some more details about your evaluation? It probably
works well for some use-cases but it isn't really designed for what we
need for EAS.

Morten


Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems

2018-07-09 Thread Morten Rasmussen
On Fri, Jul 06, 2018 at 12:18:27PM +0200, Vincent Guittot wrote:
> Hi Morten,
> 
> On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen  
> wrote:
> >
> > On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
> > for performance that cpu intensive tasks are aggressively migrated to
> > high capacity cpus as soon as those become available. The capacity
> > awareness tweaks already in the wake-up path can't handle this as such
> > tasks might run or be runnable forever. If they happen to be placed on a
> > low capacity cpu from the beginning they are stuck there forever while
> > high capacity cpus may have become available in the meantime.
> >
> > To address this issue this patch set introduces a new "misfit"
> > load-balancing scenario in periodic/nohz/newly idle balance which tweaks
> > the load-balance conditions to ignore load per capacity in certain
> > cases. Since misfit tasks are commonly running alone on a cpu, more
> > aggressive active load-balancing is needed too.
> >
> > The fundamental idea of this patch set has been in Android kernels for a
> > long time and is absolutely essential for consistent performance on
> > asymmetric cpu capacity systems.
> >
> 
> As already said , I'm not convinced by the proposal which seems quite
> complex and also adds some kind of arbitrary and fixed power
> management policy by deciding which tasks can or not go on big cores
> whereas there are other frameworks to take such decision like EAS or
> cgroups.

The misfit patches are a crucial part of the EAS solution but they also
make sense for some users on their own without an energy model. This is
why they are posted separately.

We have already discussed at length why the patches are needed and why
the look like they do here in this thread:

https://lore.kernel.org/lkml/cakftptd4skw_3sak--vbec5-m1ua48bjoqys0pdqw3npsps...@mail.gmail.com/

> Furthermore, there is already something similar in the kernel
> with SD_ASYM_PACKING and IMO, it would be better to improve this
> feature (if needed) instead of adding a new one which often do similar
> things.

As said in the previous thread, while it might look similar it isn't.
SD_ASYM_PACKING isn't utilization-based which is the key metric used for
EAS, schedutil, util_est, and util_clamp. SD_ASYM_PACKING serves a
different purpose (see previous thread for details).

> I have rerun your tests and got same results than misfit task patchset
> on my hikey960 with SD_ASYM_PACKING feature for legacy b.L topology
> and fake dynamiQ topology. And it give better performance when the
> pinned tasks are short and scheduler has to wait for the task to
> increase their utilization before getting a chance to migrate on big
> core.

Right, the test cases are quite simple and could be served better by
SD_ASYM_PACKING. As we already discussed in that thread, that is due to
the PELT lag but this the cost we have to pay if we don't have
additional information about the requirements of the task and we don't
want to default to big-first with all its implications.

We have covered all this in the thread in early April.

> Then, I have tested SD_ASYM_PACKING with EAS patchset and they work
> together for b/L and dynamiQ topology

Could you provide some more details about your evaluation? It probably
works well for some use-cases but it isn't really designed for what we
need for EAS.

Morten


Re: [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains

2018-07-06 Thread Morten Rasmussen
On Fri, Jul 06, 2018 at 12:18:17PM +0200, Vincent Guittot wrote:
> On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen  
> wrote:
> >
> > The 'prefer sibling' sched_domain flag is intended to encourage
> > spreading tasks to sibling sched_domain to take advantage of more caches
> > and core for SMT systems. It has recently been changed to be on all
> > non-NUMA topology level. However, spreading across domains with cpu
> > capacity asymmetry isn't desirable, e.g. spreading from high capacity to
> > low capacity cpus even if high capacity cpus aren't overutilized might
> > give access to more cache but the cpu will be slower and possibly lead
> > to worse overall throughput.
> >
> > To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
> > level immediately below SD_ASYM_CPUCAPACITY.
> 
> This makes sense. Nevertheless, this patch also raises a scheduling
> problem and break the 1 task per CPU policy that is enforced by
> SD_PREFER_SIBLING.

Scheduling one task per cpu when n_task == n_cpus on asymmetric
topologies is generally broken already and this patch set doesn't fix
that problem.

SD_PREFER_SIBLING might seem to help in very specific cases:
n_litte_cpus == n_big_cpus. In that case the little group might
classified as overloaded. It doesn't guarantee that anything gets pulled
as the grp_load/grp_capacity in the imbalance calculation on some system
still says the little cpus are more loaded than the bigs despite one of
them being idle. That depends on the little cpu capacities.

On systems where n_little_cpus != n_big_cpus SD_PREFER_SIBLING is broken
as it assumes the group_weight to be the same. This is the case on Juno
and several other platforms.

IMHO, SD_PREFER_SIBLING isn't the solution to this problem. It might
help for a limited subset of topologies/capacities but the right
solution is to change the imbalance calculation. As the name says, it is
meant to spread tasks and does so unconditionally. For asymmetric
systems we would like to consider cpu capacity before migrating tasks.

> When running the tests of your cover letter, 1 long
> running task is often co scheduled on a big core whereas short pinned
> tasks are still running and a little core is idle which is not an
> optimal scheduling decision

This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't
say that this patch breaks anything that isn't broken already. In fact
we this happening with and without this patch applied.

Morten


Re: [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains

2018-07-06 Thread Morten Rasmussen
On Fri, Jul 06, 2018 at 12:18:17PM +0200, Vincent Guittot wrote:
> On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen  
> wrote:
> >
> > The 'prefer sibling' sched_domain flag is intended to encourage
> > spreading tasks to sibling sched_domain to take advantage of more caches
> > and core for SMT systems. It has recently been changed to be on all
> > non-NUMA topology level. However, spreading across domains with cpu
> > capacity asymmetry isn't desirable, e.g. spreading from high capacity to
> > low capacity cpus even if high capacity cpus aren't overutilized might
> > give access to more cache but the cpu will be slower and possibly lead
> > to worse overall throughput.
> >
> > To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
> > level immediately below SD_ASYM_CPUCAPACITY.
> 
> This makes sense. Nevertheless, this patch also raises a scheduling
> problem and break the 1 task per CPU policy that is enforced by
> SD_PREFER_SIBLING.

Scheduling one task per cpu when n_task == n_cpus on asymmetric
topologies is generally broken already and this patch set doesn't fix
that problem.

SD_PREFER_SIBLING might seem to help in very specific cases:
n_litte_cpus == n_big_cpus. In that case the little group might
classified as overloaded. It doesn't guarantee that anything gets pulled
as the grp_load/grp_capacity in the imbalance calculation on some system
still says the little cpus are more loaded than the bigs despite one of
them being idle. That depends on the little cpu capacities.

On systems where n_little_cpus != n_big_cpus SD_PREFER_SIBLING is broken
as it assumes the group_weight to be the same. This is the case on Juno
and several other platforms.

IMHO, SD_PREFER_SIBLING isn't the solution to this problem. It might
help for a limited subset of topologies/capacities but the right
solution is to change the imbalance calculation. As the name says, it is
meant to spread tasks and does so unconditionally. For asymmetric
systems we would like to consider cpu capacity before migrating tasks.

> When running the tests of your cover letter, 1 long
> running task is often co scheduled on a big core whereas short pinned
> tasks are still running and a little core is idle which is not an
> optimal scheduling decision

This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't
say that this patch breaks anything that isn't broken already. In fact
we this happening with and without this patch applied.

Morten


Re: [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry

2018-07-05 Thread Morten Rasmussen
On Thu, Jul 05, 2018 at 02:31:43PM +0100, Quentin Perret wrote:
> Hi Morten,
> 
> On Wednesday 04 Jul 2018 at 11:17:49 (+0100), Morten Rasmussen wrote:
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 71330e0e41db..29c186961345 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1160,6 +1160,26 @@ sd_init(struct sched_domain_topology_level *tl,
> > sd_id = cpumask_first(sched_domain_span(sd));
> >  
> > /*
> > +* Check if cpu_map eclipses cpu capacity asymmetry.
> > +*/
> > +
> > +   if (sd->flags & SD_ASYM_CPUCAPACITY) {
> > +   int i;
> > +   bool disable = true;
> > +   long capacity = arch_scale_cpu_capacity(NULL, sd_id);
> > +
> > +   for_each_cpu(i, sched_domain_span(sd)) {
> > +   if (capacity != arch_scale_cpu_capacity(NULL, i)) {
> > +   disable = false;
> > +   break;
> > +   }
> > +   }
> > +
> > +   if (disable)
> > +   sd->flags &= ~SD_ASYM_CPUCAPACITY;
> > +   }
> > +
> > +   /*
> >  * Convert topological properties into behaviour.
> >  */
> 
> If SD_ASYM_CPUCAPACITY means that some CPUs have different
> arch_scale_cpu_capacity() values, we could also automatically _set_
> the flag in sd_init() no ? Why should we let the arch set it and just
> correct it later ?
> 
> I understand the moment at which we know the capacities of CPUs varies
> from arch to arch, but the arch code could just call
> rebuild_sched_domain when the capacities of CPUs change and let the
> scheduler detect things automatically. I mean, even if the arch code
> sets the flag in its topology level table, it will have to rebuild
> the sched domains anyway ...
> 
> What do you think ?

We could as well set the flag here so the architecture doesn't have to
do it. It is a bit more complicated though for few reasons:

1. Detecting when to disable the flag is a lot simpler than checking
which level is should be set on. You basically have to work you way up
from the lowest topology level until you get to a level spanning all the
capacities available in the system to figure out where the flag should
be set. I don't think this fits easily with how we build the
sched_domain hierarchy. It can of course be done.

2. As you say, we still need the arch code (or cpufreq?) to rebuild the
whole thing once we know that the capacities have been determined. That
currently implies implementing arch_update_cpu_topology() which is
arch-specific. So we would need some arch code to make rebuild happen at
the right point in time. If the rebuild should be triggering the rebuild
we need another way to force a full rebuild. This can also be done.

3. Detecting the flag in generic kernel/sched/* code means that all
architectures will pay the for the overhead when building/rebuilding the
sched_domain hierarchy, and all architectures that sets the cpu
capacities to asymmetric will set the flag whether they like it or not.
I'm not sure if this is a problem.

In the end it is really about how much of this we want in generic code
and how much we hide in arch/, and if we dare to touch the sched_domain
build code ;-)

Morten


Re: [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry

2018-07-05 Thread Morten Rasmussen
On Thu, Jul 05, 2018 at 02:31:43PM +0100, Quentin Perret wrote:
> Hi Morten,
> 
> On Wednesday 04 Jul 2018 at 11:17:49 (+0100), Morten Rasmussen wrote:
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 71330e0e41db..29c186961345 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1160,6 +1160,26 @@ sd_init(struct sched_domain_topology_level *tl,
> > sd_id = cpumask_first(sched_domain_span(sd));
> >  
> > /*
> > +* Check if cpu_map eclipses cpu capacity asymmetry.
> > +*/
> > +
> > +   if (sd->flags & SD_ASYM_CPUCAPACITY) {
> > +   int i;
> > +   bool disable = true;
> > +   long capacity = arch_scale_cpu_capacity(NULL, sd_id);
> > +
> > +   for_each_cpu(i, sched_domain_span(sd)) {
> > +   if (capacity != arch_scale_cpu_capacity(NULL, i)) {
> > +   disable = false;
> > +   break;
> > +   }
> > +   }
> > +
> > +   if (disable)
> > +   sd->flags &= ~SD_ASYM_CPUCAPACITY;
> > +   }
> > +
> > +   /*
> >  * Convert topological properties into behaviour.
> >  */
> 
> If SD_ASYM_CPUCAPACITY means that some CPUs have different
> arch_scale_cpu_capacity() values, we could also automatically _set_
> the flag in sd_init() no ? Why should we let the arch set it and just
> correct it later ?
> 
> I understand the moment at which we know the capacities of CPUs varies
> from arch to arch, but the arch code could just call
> rebuild_sched_domain when the capacities of CPUs change and let the
> scheduler detect things automatically. I mean, even if the arch code
> sets the flag in its topology level table, it will have to rebuild
> the sched domains anyway ...
> 
> What do you think ?

We could as well set the flag here so the architecture doesn't have to
do it. It is a bit more complicated though for few reasons:

1. Detecting when to disable the flag is a lot simpler than checking
which level is should be set on. You basically have to work you way up
from the lowest topology level until you get to a level spanning all the
capacities available in the system to figure out where the flag should
be set. I don't think this fits easily with how we build the
sched_domain hierarchy. It can of course be done.

2. As you say, we still need the arch code (or cpufreq?) to rebuild the
whole thing once we know that the capacities have been determined. That
currently implies implementing arch_update_cpu_topology() which is
arch-specific. So we would need some arch code to make rebuild happen at
the right point in time. If the rebuild should be triggering the rebuild
we need another way to force a full rebuild. This can also be done.

3. Detecting the flag in generic kernel/sched/* code means that all
architectures will pay the for the overhead when building/rebuilding the
sched_domain hierarchy, and all architectures that sets the cpu
capacities to asymmetric will set the flag whether they like it or not.
I'm not sure if this is a problem.

In the end it is really about how much of this we want in generic code
and how much we hide in arch/, and if we dare to touch the sched_domain
build code ;-)

Morten


Re: [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems

2018-07-04 Thread Morten Rasmussen
Hi,

On Tue, Jul 03, 2018 at 02:28:28AM +, Gaku Inami wrote:
> Hi,
> 
> > -Original Message-
> > From: Morten Rasmussen 
> > Sent: Wednesday, June 20, 2018 6:06 PM
> > To: pet...@infradead.org; mi...@redhat.com
> > Cc: valentin.schnei...@arm.com; dietmar.eggem...@arm.com; 
> > vincent.guit...@linaro.org; Gaku Inami
> > ; linux-kernel@vger.kernel.org; Morten Rasmussen 
> > 
> > Subject: [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric 
> > capacity systems
> [snip]
> > 
> > The patches have been tested on:
> >1. Arm Juno (r0): 2+4 Cortex A57/A53
> >2. Hikey960: 4+4 Cortex A73/A53
> > 
> > Test case:
> > Big cpus are always kept busy. Pin a shorter running sysbench tasks to
> > big cpus, while creating a longer running set of unpinned sysbench
> > tasks.
> 
> I have tested v3 patches on Renesas SoC again. It looks fine.
> 
> You can add:
> 
> Tested-by: Gaku Inami 
> 
> The patches have been tested on:
>3. Renesas R-Car H3 : 4+4 Cortex A57/A53
> 
> Results:
> Single runs with completion time of each task
> R-Car H3 (tip)
> total time:  0.9391s
> total time:  0.9865s
> total time:  1.3691s
> total time:  1.6740s
> 
> R-Car H3 (misfit)
> total time:  0.9368s
> total time:  0.9475s
> total time:  0.9471s
> total time:  0.9505s
> 
> 10 run summary (tracking longest running task for each run)
>   R-Car H3
>   avg max
> tip 1.6742  1.6750
> misfit  0.9784  0.9905

Thanks for testing again. I have just posted v4 with some minor changes.
Behaviour for the test-cases should be the same.

Morten


Re: [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems

2018-07-04 Thread Morten Rasmussen
Hi,

On Tue, Jul 03, 2018 at 02:28:28AM +, Gaku Inami wrote:
> Hi,
> 
> > -Original Message-
> > From: Morten Rasmussen 
> > Sent: Wednesday, June 20, 2018 6:06 PM
> > To: pet...@infradead.org; mi...@redhat.com
> > Cc: valentin.schnei...@arm.com; dietmar.eggem...@arm.com; 
> > vincent.guit...@linaro.org; Gaku Inami
> > ; linux-kernel@vger.kernel.org; Morten Rasmussen 
> > 
> > Subject: [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric 
> > capacity systems
> [snip]
> > 
> > The patches have been tested on:
> >1. Arm Juno (r0): 2+4 Cortex A57/A53
> >2. Hikey960: 4+4 Cortex A73/A53
> > 
> > Test case:
> > Big cpus are always kept busy. Pin a shorter running sysbench tasks to
> > big cpus, while creating a longer running set of unpinned sysbench
> > tasks.
> 
> I have tested v3 patches on Renesas SoC again. It looks fine.
> 
> You can add:
> 
> Tested-by: Gaku Inami 
> 
> The patches have been tested on:
>3. Renesas R-Car H3 : 4+4 Cortex A57/A53
> 
> Results:
> Single runs with completion time of each task
> R-Car H3 (tip)
> total time:  0.9391s
> total time:  0.9865s
> total time:  1.3691s
> total time:  1.6740s
> 
> R-Car H3 (misfit)
> total time:  0.9368s
> total time:  0.9475s
> total time:  0.9471s
> total time:  0.9505s
> 
> 10 run summary (tracking longest running task for each run)
>   R-Car H3
>   avg max
> tip 1.6742  1.6750
> misfit  0.9784  0.9905

Thanks for testing again. I have just posted v4 with some minor changes.
Behaviour for the test-cases should be the same.

Morten


[PATCHv4 02/12] sched/fair: Add group_misfit_task load-balance type

2018-07-04 Thread Morten Rasmussen
To maximize throughput in systems with asymmetric cpu capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and cpu utilization
as well as per-cpu compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity cpu need to be
identified and migrated to a higher capacity cpu if possible to maximize
throughput.

To implement this additional policy an additional group_type
(load-balance scenario) is added: group_misfit_task. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-cpu capacity. group_misfit_task is only considered
if the system is not overloaded or imbalanced (group_imbalanced or
group_overloaded).

Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each cpu is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c  | 54 
 kernel/sched/sched.h |  2 ++
 2 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 85fb7e8ff5c8..e05e5202a1d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -697,6 +697,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 
 static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
+static unsigned long capacity_of(int cpu);
 
 /* Give new sched_entity start runnable values to heavy its load in infant 
time */
 void init_entity_runnable_average(struct sched_entity *se)
@@ -1448,7 +1449,6 @@ bool should_numa_migrate_memory(struct task_struct *p, 
struct page * page,
 static unsigned long weighted_cpuload(struct rq *rq);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
-static unsigned long capacity_of(int cpu);
 
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
@@ -4035,6 +4035,29 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
task_struct *p, bool task_sleep)
WRITE_ONCE(p->se.avg.util_est, ue);
 }
 
+static inline int task_fits_capacity(struct task_struct *p, long capacity)
+{
+   return capacity * 1024 > task_util_est(p) * capacity_margin;
+}
+
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
+{
+   if (!static_branch_unlikely(_asym_cpucapacity))
+   return;
+
+   if (!p) {
+   rq->misfit_task_load = 0;
+   return;
+   }
+
+   if (task_fits_capacity(p, capacity_of(cpu_of(rq {
+   rq->misfit_task_load = 0;
+   return;
+   }
+
+   rq->misfit_task_load = task_h_load(p);
+}
+
 #else /* CONFIG_SMP */
 
 static inline int
@@ -4070,6 +4093,7 @@ util_est_enqueue(struct cfs_rq *cfs_rq, struct 
task_struct *p) {}
 static inline void
 util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
 bool task_sleep) {}
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq) 
{}
 
 #endif /* CONFIG_SMP */
 
@@ -6596,7 +6620,7 @@ static int wake_cap(struct task_struct *p, int cpu, int 
prev_cpu)
/* Bring task utilization in sync with prev_cpu */
sync_entity_load_avg(>se);
 
-   return min_cap * 1024 < task_util(p) * capacity_margin;
+   return !task_fits_capacity(p, min_cap);
 }
 
 /*
@@ -7013,9 +7037,12 @@ done: __maybe_unused;
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);
 
+   update_misfit_status(p, rq);
+
return p;
 
 idle:
+   update_misfit_status(NULL, rq);
new_tasks = idle_balance(rq, rf);
 
/*
@@ -7221,6 +7248,13 @@ static unsigned long __read_mostly 
max_load_balance_interval = HZ/10;
 
 enum fbq_type { regular, remote, all };
 
+enum group_type {
+   group_other = 0,
+   group_misfit_task,
+   group_imbalanced,
+   group_overloaded,
+};
+
 #define LBF_ALL_PINNED 0x01
 #define LBF_NEED_BREAK 0x02
 #define LBF_DST_PINNED  0x04
@@ -7762,12 +7796,6 @@ static unsigned long task_h_load(struct task_struct *p)
 
 /** Helpers for find_busiest_group /
 
-enum group_type {
-   group_other = 0,
-   group_imbalanced,
-   group_overloaded,
-};
-
 /*
  * sg_lb_stats - stats of a sched_group required for load_balancing
  */
@@ -7783,6 +7811,7 @@ struct sg_lb_stats {
unsigned int group_weight;
enum group_type group_type;
int group_no_capacity;
+   unsigned long group_misfit_task_load; /* A cpu has a task too big for 
its capacity */
 #ifdef CONFIG_NUMA_BALANC

[PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems

2018-07-04 Thread Morten Rasmussen
On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
for performance that cpu intensive tasks are aggressively migrated to
high capacity cpus as soon as those become available. The capacity
awareness tweaks already in the wake-up path can't handle this as such
tasks might run or be runnable forever. If they happen to be placed on a
low capacity cpu from the beginning they are stuck there forever while
high capacity cpus may have become available in the meantime.

To address this issue this patch set introduces a new "misfit"
load-balancing scenario in periodic/nohz/newly idle balance which tweaks
the load-balance conditions to ignore load per capacity in certain
cases. Since misfit tasks are commonly running alone on a cpu, more
aggressive active load-balancing is needed too.

The fundamental idea of this patch set has been in Android kernels for a
long time and is absolutely essential for consistent performance on
asymmetric cpu capacity systems.

The patches have been tested on:
   1. Arm Juno (r0): 2+4 Cortex A57/A53
   2. Hikey960: 4+4 Cortex A73/A53

Test case:
Big cpus are always kept busy. Pin a shorter running sysbench tasks to
big cpus, while creating a longer running set of unpinned sysbench
tasks.

REQUESTS=1000
BIGS="1 2"
LITTLES="0 3 4 5"
 
# Don't care about the score for those, just keep the bigs busy
for i in $BIGS; do
taskset -c $i sysbench --max-requests=$((REQUESTS / 4)) \
--test=cpu  run &>/dev/null &
done
 
for i in $LITTLES; do
sysbench --max-requests=$REQUESTS --test=cpu run \
| grep "total time:" &
done
 
wait

Results:
Single runs with completion time of each task
Juno (tip)
total time:  1.2608s
total time:  1.2995s
total time:  1.5954s
total time:  1.7463s

Juno (misfit)
total time:  1.2575s
total time:  1.3004s
total time:  1.5860s
total time:  1.5871s

Hikey960 (tip)
total time:  1.7431s
total time:  2.2914s
total time:  2.5976s
total time:  1.7280s

Hikey960 (misfit)
total time:  1.7866s
total time:  1.7513s
total time:  1.6918s
total time:  1.6965s

10 run summary (tracking longest running task for each run)
JunoHikey960
avg max avg max
tip 1.7465  1.7469  2.5997  2.6131 
misfit  1.6016  1.6192  1.8506  1.9666

Changelog:
v4
- Added check for empty cpu_map in sd_init().
- Added patch to disable SD_ASYM_CPUCAPACITY for root_domains that don't
  observe capacity asymmetry if the system as a whole is asymmetric.
- Added patch to disable SD_PREFER_SIBLING on the sched_domain level below
  SD_ASYM_CPUCAPACITY.
- Rebased against tip/sched/core.
- Fixed uninitialised variable introduced in update_sd_lb_stats.
- Added patch to do a slight variable initialisation cleanup in 
update_sd_lb_stats.
- Removed superfluous type changes for temp variables assigned to 
root_domain->overload.
- Reworded commit for the patch setting rq->rd->overload when misfit.
- v3 Tested-by: Gaku Inami 

v3
- Fixed locking around static_key.
- Changed group per-cpu capacity comparison to be based on max rather
  than min capacity.
- Added patch to prevent occasional pointless high->low capacity
  migrations.
- Changed type of group_misfit_task_load and misfit_task_load to
  unsigned long.
- Changed fbq() to pick the cpu with highest misfit_task_load rather
  than breaking when the first is found.
- Rebased against tip/sched/core.
- v2 Tested-by: Gaku Inami 

v2
- Removed redudant condition in static_key enablement.
- Fixed logic flaw in patch #2 reported by Yi Yao 
- Dropped patch #4 as although the patch seems to make sense no benefit
  has been proven.
- Dropped root_domain->overload renaming
- Changed type of root_domain->overload to int
- Wrapped accesses of rq->rd->overload with READ/WRITE_ONCE
- v1 Tested-by: Gaku Inami 

Chris Redpath (1):
  sched/fair: Don't move tasks to lower capacity cpus unless necessary

Morten Rasmussen (6):
  sched: Add static_key for asymmetric cpu capacity optimizations
  sched/fair: Add group_misfit_task load-balance type
  sched: Add sched_group per-cpu max capacity
  sched/fair: Consider misfit tasks when load-balancing
  sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without
asymmetry
  sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity
domains

Valentin Schneider (5):
  sched/fair: Kick nohz balance if rq->misfit_task_load
  sched/fair: Change prefer_sibling type to bool
  sched: Change r

[PATCHv4 02/12] sched/fair: Add group_misfit_task load-balance type

2018-07-04 Thread Morten Rasmussen
To maximize throughput in systems with asymmetric cpu capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and cpu utilization
as well as per-cpu compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity cpu need to be
identified and migrated to a higher capacity cpu if possible to maximize
throughput.

To implement this additional policy an additional group_type
(load-balance scenario) is added: group_misfit_task. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-cpu capacity. group_misfit_task is only considered
if the system is not overloaded or imbalanced (group_imbalanced or
group_overloaded).

Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each cpu is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c  | 54 
 kernel/sched/sched.h |  2 ++
 2 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 85fb7e8ff5c8..e05e5202a1d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -697,6 +697,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 
 static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
+static unsigned long capacity_of(int cpu);
 
 /* Give new sched_entity start runnable values to heavy its load in infant 
time */
 void init_entity_runnable_average(struct sched_entity *se)
@@ -1448,7 +1449,6 @@ bool should_numa_migrate_memory(struct task_struct *p, 
struct page * page,
 static unsigned long weighted_cpuload(struct rq *rq);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
-static unsigned long capacity_of(int cpu);
 
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
@@ -4035,6 +4035,29 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct 
task_struct *p, bool task_sleep)
WRITE_ONCE(p->se.avg.util_est, ue);
 }
 
+static inline int task_fits_capacity(struct task_struct *p, long capacity)
+{
+   return capacity * 1024 > task_util_est(p) * capacity_margin;
+}
+
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
+{
+   if (!static_branch_unlikely(_asym_cpucapacity))
+   return;
+
+   if (!p) {
+   rq->misfit_task_load = 0;
+   return;
+   }
+
+   if (task_fits_capacity(p, capacity_of(cpu_of(rq {
+   rq->misfit_task_load = 0;
+   return;
+   }
+
+   rq->misfit_task_load = task_h_load(p);
+}
+
 #else /* CONFIG_SMP */
 
 static inline int
@@ -4070,6 +4093,7 @@ util_est_enqueue(struct cfs_rq *cfs_rq, struct 
task_struct *p) {}
 static inline void
 util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
 bool task_sleep) {}
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq) 
{}
 
 #endif /* CONFIG_SMP */
 
@@ -6596,7 +6620,7 @@ static int wake_cap(struct task_struct *p, int cpu, int 
prev_cpu)
/* Bring task utilization in sync with prev_cpu */
sync_entity_load_avg(>se);
 
-   return min_cap * 1024 < task_util(p) * capacity_margin;
+   return !task_fits_capacity(p, min_cap);
 }
 
 /*
@@ -7013,9 +7037,12 @@ done: __maybe_unused;
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);
 
+   update_misfit_status(p, rq);
+
return p;
 
 idle:
+   update_misfit_status(NULL, rq);
new_tasks = idle_balance(rq, rf);
 
/*
@@ -7221,6 +7248,13 @@ static unsigned long __read_mostly 
max_load_balance_interval = HZ/10;
 
 enum fbq_type { regular, remote, all };
 
+enum group_type {
+   group_other = 0,
+   group_misfit_task,
+   group_imbalanced,
+   group_overloaded,
+};
+
 #define LBF_ALL_PINNED 0x01
 #define LBF_NEED_BREAK 0x02
 #define LBF_DST_PINNED  0x04
@@ -7762,12 +7796,6 @@ static unsigned long task_h_load(struct task_struct *p)
 
 /** Helpers for find_busiest_group /
 
-enum group_type {
-   group_other = 0,
-   group_imbalanced,
-   group_overloaded,
-};
-
 /*
  * sg_lb_stats - stats of a sched_group required for load_balancing
  */
@@ -7783,6 +7811,7 @@ struct sg_lb_stats {
unsigned int group_weight;
enum group_type group_type;
int group_no_capacity;
+   unsigned long group_misfit_task_load; /* A cpu has a task too big for 
its capacity */
 #ifdef CONFIG_NUMA_BALANC

[PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems

2018-07-04 Thread Morten Rasmussen
On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
for performance that cpu intensive tasks are aggressively migrated to
high capacity cpus as soon as those become available. The capacity
awareness tweaks already in the wake-up path can't handle this as such
tasks might run or be runnable forever. If they happen to be placed on a
low capacity cpu from the beginning they are stuck there forever while
high capacity cpus may have become available in the meantime.

To address this issue this patch set introduces a new "misfit"
load-balancing scenario in periodic/nohz/newly idle balance which tweaks
the load-balance conditions to ignore load per capacity in certain
cases. Since misfit tasks are commonly running alone on a cpu, more
aggressive active load-balancing is needed too.

The fundamental idea of this patch set has been in Android kernels for a
long time and is absolutely essential for consistent performance on
asymmetric cpu capacity systems.

The patches have been tested on:
   1. Arm Juno (r0): 2+4 Cortex A57/A53
   2. Hikey960: 4+4 Cortex A73/A53

Test case:
Big cpus are always kept busy. Pin a shorter running sysbench tasks to
big cpus, while creating a longer running set of unpinned sysbench
tasks.

REQUESTS=1000
BIGS="1 2"
LITTLES="0 3 4 5"
 
# Don't care about the score for those, just keep the bigs busy
for i in $BIGS; do
taskset -c $i sysbench --max-requests=$((REQUESTS / 4)) \
--test=cpu  run &>/dev/null &
done
 
for i in $LITTLES; do
sysbench --max-requests=$REQUESTS --test=cpu run \
| grep "total time:" &
done
 
wait

Results:
Single runs with completion time of each task
Juno (tip)
total time:  1.2608s
total time:  1.2995s
total time:  1.5954s
total time:  1.7463s

Juno (misfit)
total time:  1.2575s
total time:  1.3004s
total time:  1.5860s
total time:  1.5871s

Hikey960 (tip)
total time:  1.7431s
total time:  2.2914s
total time:  2.5976s
total time:  1.7280s

Hikey960 (misfit)
total time:  1.7866s
total time:  1.7513s
total time:  1.6918s
total time:  1.6965s

10 run summary (tracking longest running task for each run)
JunoHikey960
avg max avg max
tip 1.7465  1.7469  2.5997  2.6131 
misfit  1.6016  1.6192  1.8506  1.9666

Changelog:
v4
- Added check for empty cpu_map in sd_init().
- Added patch to disable SD_ASYM_CPUCAPACITY for root_domains that don't
  observe capacity asymmetry if the system as a whole is asymmetric.
- Added patch to disable SD_PREFER_SIBLING on the sched_domain level below
  SD_ASYM_CPUCAPACITY.
- Rebased against tip/sched/core.
- Fixed uninitialised variable introduced in update_sd_lb_stats.
- Added patch to do a slight variable initialisation cleanup in 
update_sd_lb_stats.
- Removed superfluous type changes for temp variables assigned to 
root_domain->overload.
- Reworded commit for the patch setting rq->rd->overload when misfit.
- v3 Tested-by: Gaku Inami 

v3
- Fixed locking around static_key.
- Changed group per-cpu capacity comparison to be based on max rather
  than min capacity.
- Added patch to prevent occasional pointless high->low capacity
  migrations.
- Changed type of group_misfit_task_load and misfit_task_load to
  unsigned long.
- Changed fbq() to pick the cpu with highest misfit_task_load rather
  than breaking when the first is found.
- Rebased against tip/sched/core.
- v2 Tested-by: Gaku Inami 

v2
- Removed redudant condition in static_key enablement.
- Fixed logic flaw in patch #2 reported by Yi Yao 
- Dropped patch #4 as although the patch seems to make sense no benefit
  has been proven.
- Dropped root_domain->overload renaming
- Changed type of root_domain->overload to int
- Wrapped accesses of rq->rd->overload with READ/WRITE_ONCE
- v1 Tested-by: Gaku Inami 

Chris Redpath (1):
  sched/fair: Don't move tasks to lower capacity cpus unless necessary

Morten Rasmussen (6):
  sched: Add static_key for asymmetric cpu capacity optimizations
  sched/fair: Add group_misfit_task load-balance type
  sched: Add sched_group per-cpu max capacity
  sched/fair: Consider misfit tasks when load-balancing
  sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without
asymmetry
  sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity
domains

Valentin Schneider (5):
  sched/fair: Kick nohz balance if rq->misfit_task_load
  sched/fair: Change prefer_sibling type to bool
  sched: Change r

[PATCHv4 03/12] sched: Add sched_group per-cpu max capacity

2018-07-04 Thread Morten Rasmussen
The current sg->min_capacity tracks the lowest per-cpu compute capacity
available in the sched_group when rt/irq pressure is taken into account.
Minimum capacity isn't the ideal metric for tracking if a sched_group
needs offloading to another sched_group for some scenarios, e.g. a
sched_group with multiple cpus if only one is under heavy pressure.
Tracking maximum capacity isn't perfect either but a better choice for
some situations as it indicates that the sched_group definitely compute
capacity constrained either due to rt/irq pressure on all cpus or
asymmetric cpu capacities (e.g. big.LITTLE).

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c | 24 
 kernel/sched/sched.h|  1 +
 kernel/sched/topology.c |  2 ++
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e05e5202a1d2..09ede4321a3d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7927,13 +7927,14 @@ static void update_cpu_capacity(struct sched_domain 
*sd, int cpu)
cpu_rq(cpu)->cpu_capacity = capacity;
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = capacity;
+   sdg->sgc->max_capacity = capacity;
 }
 
 void update_group_capacity(struct sched_domain *sd, int cpu)
 {
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
-   unsigned long capacity, min_capacity;
+   unsigned long capacity, min_capacity, max_capacity;
unsigned long interval;
 
interval = msecs_to_jiffies(sd->balance_interval);
@@ -7947,6 +7948,7 @@ void update_group_capacity(struct sched_domain *sd, int 
cpu)
 
capacity = 0;
min_capacity = ULONG_MAX;
+   max_capacity = 0;
 
if (child->flags & SD_OVERLAP) {
/*
@@ -7977,6 +7979,7 @@ void update_group_capacity(struct sched_domain *sd, int 
cpu)
}
 
min_capacity = min(capacity, min_capacity);
+   max_capacity = max(capacity, max_capacity);
}
} else  {
/*
@@ -7990,12 +7993,14 @@ void update_group_capacity(struct sched_domain *sd, int 
cpu)
 
capacity += sgc->capacity;
min_capacity = min(sgc->min_capacity, min_capacity);
+   max_capacity = max(sgc->max_capacity, max_capacity);
group = group->next;
} while (group != child->groups);
}
 
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = min_capacity;
+   sdg->sgc->max_capacity = max_capacity;
 }
 
 /*
@@ -8091,16 +8096,27 @@ group_is_overloaded(struct lb_env *env, struct 
sg_lb_stats *sgs)
 }
 
 /*
- * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
  * per-CPU capacity than sched_group ref.
  */
 static inline bool
-group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 {
return sg->sgc->min_capacity * capacity_margin <
ref->sgc->min_capacity * 1024;
 }
 
+/*
+ * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-CPU capacity_orig than sched_group ref.
+ */
+static inline bool
+group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+   return sg->sgc->max_capacity * capacity_margin <
+   ref->sgc->max_capacity * 1024;
+}
+
 static inline enum
 group_type group_classify(struct sched_group *group,
  struct sg_lb_stats *sgs)
@@ -8246,7 +8262,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 * power/energy consequences are not considered.
 */
if (sgs->sum_nr_running <= sgs->group_weight &&
-   group_smaller_cpu_capacity(sds->local, sg))
+   group_smaller_min_cpu_capacity(sds->local, sg))
return false;
 
 asym_packing:
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3376bacab712..6c39a07e8a68 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1172,6 +1172,7 @@ struct sched_group_capacity {
 */
unsigned long   capacity;
unsigned long   min_capacity;   /* Min per-CPU capacity 
in group */
+   unsigned long   max_capacity;   /* Max per-CPU capacity 
in group */
unsigned long   next_update;
int imbalance;  /* XXX unrelated to 
capacity but shared group state */
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index

[PATCHv4 03/12] sched: Add sched_group per-cpu max capacity

2018-07-04 Thread Morten Rasmussen
The current sg->min_capacity tracks the lowest per-cpu compute capacity
available in the sched_group when rt/irq pressure is taken into account.
Minimum capacity isn't the ideal metric for tracking if a sched_group
needs offloading to another sched_group for some scenarios, e.g. a
sched_group with multiple cpus if only one is under heavy pressure.
Tracking maximum capacity isn't perfect either but a better choice for
some situations as it indicates that the sched_group definitely compute
capacity constrained either due to rt/irq pressure on all cpus or
asymmetric cpu capacities (e.g. big.LITTLE).

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c | 24 
 kernel/sched/sched.h|  1 +
 kernel/sched/topology.c |  2 ++
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e05e5202a1d2..09ede4321a3d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7927,13 +7927,14 @@ static void update_cpu_capacity(struct sched_domain 
*sd, int cpu)
cpu_rq(cpu)->cpu_capacity = capacity;
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = capacity;
+   sdg->sgc->max_capacity = capacity;
 }
 
 void update_group_capacity(struct sched_domain *sd, int cpu)
 {
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
-   unsigned long capacity, min_capacity;
+   unsigned long capacity, min_capacity, max_capacity;
unsigned long interval;
 
interval = msecs_to_jiffies(sd->balance_interval);
@@ -7947,6 +7948,7 @@ void update_group_capacity(struct sched_domain *sd, int 
cpu)
 
capacity = 0;
min_capacity = ULONG_MAX;
+   max_capacity = 0;
 
if (child->flags & SD_OVERLAP) {
/*
@@ -7977,6 +7979,7 @@ void update_group_capacity(struct sched_domain *sd, int 
cpu)
}
 
min_capacity = min(capacity, min_capacity);
+   max_capacity = max(capacity, max_capacity);
}
} else  {
/*
@@ -7990,12 +7993,14 @@ void update_group_capacity(struct sched_domain *sd, int 
cpu)
 
capacity += sgc->capacity;
min_capacity = min(sgc->min_capacity, min_capacity);
+   max_capacity = max(sgc->max_capacity, max_capacity);
group = group->next;
} while (group != child->groups);
}
 
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = min_capacity;
+   sdg->sgc->max_capacity = max_capacity;
 }
 
 /*
@@ -8091,16 +8096,27 @@ group_is_overloaded(struct lb_env *env, struct 
sg_lb_stats *sgs)
 }
 
 /*
- * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
  * per-CPU capacity than sched_group ref.
  */
 static inline bool
-group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 {
return sg->sgc->min_capacity * capacity_margin <
ref->sgc->min_capacity * 1024;
 }
 
+/*
+ * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-CPU capacity_orig than sched_group ref.
+ */
+static inline bool
+group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+   return sg->sgc->max_capacity * capacity_margin <
+   ref->sgc->max_capacity * 1024;
+}
+
 static inline enum
 group_type group_classify(struct sched_group *group,
  struct sg_lb_stats *sgs)
@@ -8246,7 +8262,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 * power/energy consequences are not considered.
 */
if (sgs->sum_nr_running <= sgs->group_weight &&
-   group_smaller_cpu_capacity(sds->local, sg))
+   group_smaller_min_cpu_capacity(sds->local, sg))
return false;
 
 asym_packing:
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3376bacab712..6c39a07e8a68 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1172,6 +1172,7 @@ struct sched_group_capacity {
 */
unsigned long   capacity;
unsigned long   min_capacity;   /* Min per-CPU capacity 
in group */
+   unsigned long   max_capacity;   /* Max per-CPU capacity 
in group */
unsigned long   next_update;
int imbalance;  /* XXX unrelated to 
capacity but shared group state */
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index

[PATCHv4 05/12] sched/fair: Kick nohz balance if rq->misfit_task_load

2018-07-04 Thread Morten Rasmussen
From: Valentin Schneider 

There already are a few conditions in nohz_kick_needed() to ensure
a nohz kick is triggered, but they are not enough for some misfit
task scenarios. Excluding asym packing, those are:

* rq->nr_running >=2: Not relevant here because we are running a
misfit task, it needs to be migrated regardless and potentially through
active balance.
* sds->nr_busy_cpus > 1: If there is only the misfit task being run
on a group of low capacity cpus, this will be evaluated to False.
* rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
misfit task needs to be migrated regardless of rt/IRQ pressure

As such, this commit adds an rq->misfit_task_load condition to trigger a
nohz kick.

The idea to kick a nohz balance for misfit tasks originally came from
Leo Yan , and a similar patch was submitted for
the Android Common Kernel - see [1].

[1]: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Valentin Schneider 
Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e885d92fad2..acec93e1dc51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9505,7 +9505,7 @@ static void nohz_balancer_kick(struct rq *rq)
if (time_before(now, nohz.next_balance))
goto out;
 
-   if (rq->nr_running >= 2) {
+   if (rq->nr_running >= 2 || rq->misfit_task_load) {
flags = NOHZ_KICK_MASK;
goto out;
}
-- 
2.7.4



[PATCHv4 05/12] sched/fair: Kick nohz balance if rq->misfit_task_load

2018-07-04 Thread Morten Rasmussen
From: Valentin Schneider 

There already are a few conditions in nohz_kick_needed() to ensure
a nohz kick is triggered, but they are not enough for some misfit
task scenarios. Excluding asym packing, those are:

* rq->nr_running >=2: Not relevant here because we are running a
misfit task, it needs to be migrated regardless and potentially through
active balance.
* sds->nr_busy_cpus > 1: If there is only the misfit task being run
on a group of low capacity cpus, this will be evaluated to False.
* rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
misfit task needs to be migrated regardless of rt/IRQ pressure

As such, this commit adds an rq->misfit_task_load condition to trigger a
nohz kick.

The idea to kick a nohz balance for misfit tasks originally came from
Leo Yan , and a similar patch was submitted for
the Android Common Kernel - see [1].

[1]: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Valentin Schneider 
Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e885d92fad2..acec93e1dc51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9505,7 +9505,7 @@ static void nohz_balancer_kick(struct rq *rq)
if (time_before(now, nohz.next_balance))
goto out;
 
-   if (rq->nr_running >= 2) {
+   if (rq->nr_running >= 2 || rq->misfit_task_load) {
flags = NOHZ_KICK_MASK;
goto out;
}
-- 
2.7.4



[PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains

2018-07-04 Thread Morten Rasmussen
The 'prefer sibling' sched_domain flag is intended to encourage
spreading tasks to sibling sched_domain to take advantage of more caches
and core for SMT systems. It has recently been changed to be on all
non-NUMA topology level. However, spreading across domains with cpu
capacity asymmetry isn't desirable, e.g. spreading from high capacity to
low capacity cpus even if high capacity cpus aren't overutilized might
give access to more cache but the cpu will be slower and possibly lead
to worse overall throughput.

To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
level immediately below SD_ASYM_CPUCAPACITY.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Morten Rasmussen 
---
 kernel/sched/topology.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 29c186961345..00c7a08c7f77 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1140,7 +1140,7 @@ sd_init(struct sched_domain_topology_level *tl,
| 0*SD_SHARE_CPUCAPACITY
| 0*SD_SHARE_PKG_RESOURCES
| 0*SD_SERIALIZE
-   | 0*SD_PREFER_SIBLING
+   | 1*SD_PREFER_SIBLING
| 0*SD_NUMA
| sd_flags
,
@@ -1186,17 +1186,21 @@ sd_init(struct sched_domain_topology_level *tl,
if (sd->flags & SD_ASYM_CPUCAPACITY) {
struct sched_domain *t = sd;
 
+   /*
+* Don't attempt to spread across cpus of different capacities.
+*/
+   if (sd->child)
+   sd->child->flags &= ~SD_PREFER_SIBLING;
+
for_each_lower_domain(t)
t->flags |= SD_BALANCE_WAKE;
}
 
if (sd->flags & SD_SHARE_CPUCAPACITY) {
-   sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 110;
sd->smt_gain = 1178; /* ~15% */
 
} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
-   sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 117;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
@@ -1207,6 +1211,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->busy_idx = 3;
sd->idle_idx = 2;
 
+   sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
if (sched_domains_numa_distance[tl->numa_level] > 
RECLAIM_DISTANCE) {
sd->flags &= ~(SD_BALANCE_EXEC |
@@ -1216,7 +1221,6 @@ sd_init(struct sched_domain_topology_level *tl,
 
 #endif
} else {
-   sd->flags |= SD_PREFER_SIBLING;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
sd->idle_idx = 1;
-- 
2.7.4



[PATCHv4 04/12] sched/fair: Consider misfit tasks when load-balancing

2018-07-04 Thread Morten Rasmussen
On asymmetric cpu capacity systems load intensive tasks can end up on
cpus that don't suit their compute demand.  In this scenarios 'misfit'
tasks should be migrated to cpus with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.

Misfit balancing only makes sense between a source group of lower
per-cpu capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.

The modifications are:

1. Only pick a group containing misfit tasks as the busiest group if the
   destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
   load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
   to be pulled ignoring average load.
4. Pick the cpu with the highest misfit load as the source cpu.
5. If the misfit task is alone on the source cpu, go for active
   balancing.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c | 51 +--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 09ede4321a3d..6e885d92fad2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7285,6 +7285,7 @@ struct lb_env {
unsigned intloop_max;
 
enum fbq_type   fbq_type;
+   enum group_type src_grp_type;
struct list_headtasks;
 };
 
@@ -8243,6 +8244,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 {
struct sg_lb_stats *busiest = >busiest_stat;
 
+   /*
+* Don't try to pull misfit tasks we can't help.
+* We can use max_capacity here as reduction in capacity on some
+* cpus in the group should either be possible to resolve
+* internally or be covered by avg_load imbalance (eventually).
+*/
+   if (sgs->group_type == group_misfit_task &&
+   (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+!group_has_capacity(env, >local_stat)))
+   return false;
+
if (sgs->group_type > busiest->group_type)
return true;
 
@@ -8265,6 +8277,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
group_smaller_min_cpu_capacity(sds->local, sg))
return false;
 
+   /*
+* If we have more than one misfit sg go with the biggest misfit.
+*/
+   if (sgs->group_type == group_misfit_task &&
+   sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+   return false;
+
 asym_packing:
/* This is the busiest node in its class. */
if (!(env->sd->flags & SD_ASYM_PACKING))
@@ -8562,8 +8581,9 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
 * factors in sg capacity and sgs with smaller group_type are
 * skipped when updating the busiest sg:
 */
-   if (busiest->avg_load <= sds->avg_load ||
-   local->avg_load >= sds->avg_load) {
+   if (busiest->group_type != group_misfit_task &&
+   (busiest->avg_load <= sds->avg_load ||
+local->avg_load >= sds->avg_load)) {
env->imbalance = 0;
return fix_small_imbalance(env, sds);
}
@@ -8597,6 +8617,12 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;
 
+   /* Boost imbalance to allow misfit task to be balanced. */
+   if (busiest->group_type == group_misfit_task) {
+   env->imbalance = max_t(long, env->imbalance,
+  busiest->group_misfit_task_load);
+   }
+
/*
 * if *imbalance is less than the average load per runnable task
 * there is no guarantee that any tasks will be moved so we'll have
@@ -8663,6 +8689,10 @@ static struct sched_group *find_busiest_group(struct 
lb_env *env)
busiest->group_no_capacity)
goto force_balance;
 
+   /* Misfit tasks should be dealt with regardless of the avg load */
+   if (busiest->group_type == group_misfit_task)
+   goto force_balance;
+
/*
 * If the local group is busier than the selected busiest group
 * don't try and pull any tasks.
@@ -8700,6 +8730,7 @@ static struct sched_group *find_busiest_group(struct 
lb_env *env)
 
 force_balance:
/* Looks like there is an imbalance. Compute it */
+   env->src_grp_type = busi

[PATCHv4 09/12] sched/fair: Set rq->rd->overload when misfit

2018-07-04 Thread Morten Rasmussen
From: Valentin Schneider 

Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.

A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.

They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.

This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:

a,b,c,d are our CPU-hogging tasks
_ signifies idling

LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
-|-
  big_0  | c c c c a a
  big_1  | d d d d b b
  ^
  ^
Tasks end on the big CPUs, idle balance happens
and the misfit tasks are pulled straight away

This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.

As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Valentin Schneider 
Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c  | 6 --
 kernel/sched/sched.h | 6 +-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0641ba7bea1..de84f5a9a65a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8163,7 +8163,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
  * @load_idx: Load index of sched_domain of this_cpu for load calc.
  * @local_group: Does group contain this_cpu.
  * @sgs: variable to hold the statistics for this group.
- * @overload: Indicate more than one runnable task for any CPU.
+ * @overload: Indicate pullable load (e.g. >1 runnable task).
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
@@ -8207,8 +8207,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->idle_cpus++;
 
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
-   sgs->group_misfit_task_load < rq->misfit_task_load)
+   sgs->group_misfit_task_load < rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
+   *overload = 1;
+   }
}
 
/* Adjust by relative CPU capacity of the group */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e1dc85d1bfdd..377545b5aa15 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -695,7 +695,11 @@ struct root_domain {
cpumask_var_t   span;
cpumask_var_t   online;
 
-   /* Indicate more than one runnable task for any CPU */
+   /*
+* Indicate pullable load on at least one CPU, e.g:
+* - More than one runnable task
+* - Running task is misfit
+*/
int overload;
 
/*
-- 
2.7.4



[PATCHv4 10/12] sched/fair: Don't move tasks to lower capacity cpus unless necessary

2018-07-04 Thread Morten Rasmussen
From: Chris Redpath 

When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
cpu with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Chris Redpath 
Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de84f5a9a65a..06beefa02420 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8793,6 +8793,17 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 
capacity = capacity_of(i);
 
+   /*
+* For ASYM_CPUCAPACITY domains, don't pick a cpu that could
+* eventually lead to active_balancing high->low capacity.
+* Higher per-cpu capacity is considered better than balancing
+* average load.
+*/
+   if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+   capacity_of(env->dst_cpu) < capacity &&
+   rq->nr_running == 1)
+   continue;
+
wl = weighted_cpuload(rq);
 
/*
-- 
2.7.4



[PATCHv4 08/12] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE

2018-07-04 Thread Morten Rasmussen
From: Valentin Schneider 

This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE are added to make sure nothing terribly wrong
can happen because of the compiler.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Valentin Schneider 
Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c  | 6 +++---
 kernel/sched/sched.h | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee26eeb188ef..d0641ba7bea1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8428,8 +8428,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, 
struct sd_lb_stats *sd
 
if (!env->sd->parent) {
/* update overload indicator if we are at root domain */
-   if (env->dst_rq->rd->overload != overload)
-   env->dst_rq->rd->overload = overload;
+   if (READ_ONCE(env->dst_rq->rd->overload) != overload)
+   WRITE_ONCE(env->dst_rq->rd->overload, overload);
}
 }
 
@@ -9872,7 +9872,7 @@ static int idle_balance(struct rq *this_rq, struct 
rq_flags *rf)
rq_unpin_lock(this_rq, rf);
 
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-   !this_rq->rd->overload) {
+   !READ_ONCE(this_rq->rd->overload)) {
 
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 648224b23287..e1dc85d1bfdd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1672,8 +1672,8 @@ static inline void add_nr_running(struct rq *rq, unsigned 
count)
 
if (prev_nr < 2 && rq->nr_running >= 2) {
 #ifdef CONFIG_SMP
-   if (!rq->rd->overload)
-   rq->rd->overload = 1;
+   if (!READ_ONCE(rq->rd->overload))
+   WRITE_ONCE(rq->rd->overload, 1);
 #endif
}
 
-- 
2.7.4



[PATCHv4 06/12] sched/fair: Change prefer_sibling type to bool

2018-07-04 Thread Morten Rasmussen
From: Valentin Schneider 

This variable is entirely local to update_sd_lb_stats, so we can
safely change its type and slightly clean up its initialisation.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Valentin Schneider 
Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index acec93e1dc51..ee26eeb188ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8352,11 +8352,9 @@ static inline void update_sd_lb_stats(struct lb_env 
*env, struct sd_lb_stats *sd
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = >local_stat;
struct sg_lb_stats tmp_sgs;
-   int load_idx, prefer_sibling = 0;
+   int load_idx;
bool overload = false;
-
-   if (child && child->flags & SD_PREFER_SIBLING)
-   prefer_sibling = 1;
+   bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
 
 #ifdef CONFIG_NO_HZ_COMMON
if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
-- 
2.7.4



[PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry

2018-07-04 Thread Morten Rasmussen
When hotplugging cpus out or creating exclusive cpusets (disabling
sched_load_balance) systems which were asymmetric at boot might become
symmetric. In this case leaving the flag set might lead to suboptimal
scheduling decisions.

The arch-code proving the flag doesn't have visibility of the cpuset
configuration so it must either be told by passing a cpumask or the
generic topology code has to verify if the flag should still be set
when taking the actual sched_domain_span() into account. This patch
implements the latter approach.

We need to detect capacity based on calling arch_scale_cpu_capacity()
directly as rq->cpu_capacity_orig hasn't been set yet early in the boot
process.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Morten Rasmussen 
---
 kernel/sched/topology.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 71330e0e41db..29c186961345 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1160,6 +1160,26 @@ sd_init(struct sched_domain_topology_level *tl,
sd_id = cpumask_first(sched_domain_span(sd));
 
/*
+* Check if cpu_map eclipses cpu capacity asymmetry.
+*/
+
+   if (sd->flags & SD_ASYM_CPUCAPACITY) {
+   int i;
+   bool disable = true;
+   long capacity = arch_scale_cpu_capacity(NULL, sd_id);
+
+   for_each_cpu(i, sched_domain_span(sd)) {
+   if (capacity != arch_scale_cpu_capacity(NULL, i)) {
+   disable = false;
+   break;
+   }
+   }
+
+   if (disable)
+   sd->flags &= ~SD_ASYM_CPUCAPACITY;
+   }
+
+   /*
 * Convert topological properties into behaviour.
 */
 
-- 
2.7.4



[PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains

2018-07-04 Thread Morten Rasmussen
The 'prefer sibling' sched_domain flag is intended to encourage
spreading tasks to sibling sched_domain to take advantage of more caches
and core for SMT systems. It has recently been changed to be on all
non-NUMA topology level. However, spreading across domains with cpu
capacity asymmetry isn't desirable, e.g. spreading from high capacity to
low capacity cpus even if high capacity cpus aren't overutilized might
give access to more cache but the cpu will be slower and possibly lead
to worse overall throughput.

To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
level immediately below SD_ASYM_CPUCAPACITY.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Morten Rasmussen 
---
 kernel/sched/topology.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 29c186961345..00c7a08c7f77 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1140,7 +1140,7 @@ sd_init(struct sched_domain_topology_level *tl,
| 0*SD_SHARE_CPUCAPACITY
| 0*SD_SHARE_PKG_RESOURCES
| 0*SD_SERIALIZE
-   | 0*SD_PREFER_SIBLING
+   | 1*SD_PREFER_SIBLING
| 0*SD_NUMA
| sd_flags
,
@@ -1186,17 +1186,21 @@ sd_init(struct sched_domain_topology_level *tl,
if (sd->flags & SD_ASYM_CPUCAPACITY) {
struct sched_domain *t = sd;
 
+   /*
+* Don't attempt to spread across cpus of different capacities.
+*/
+   if (sd->child)
+   sd->child->flags &= ~SD_PREFER_SIBLING;
+
for_each_lower_domain(t)
t->flags |= SD_BALANCE_WAKE;
}
 
if (sd->flags & SD_SHARE_CPUCAPACITY) {
-   sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 110;
sd->smt_gain = 1178; /* ~15% */
 
} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
-   sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 117;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
@@ -1207,6 +1211,7 @@ sd_init(struct sched_domain_topology_level *tl,
sd->busy_idx = 3;
sd->idle_idx = 2;
 
+   sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
if (sched_domains_numa_distance[tl->numa_level] > 
RECLAIM_DISTANCE) {
sd->flags &= ~(SD_BALANCE_EXEC |
@@ -1216,7 +1221,6 @@ sd_init(struct sched_domain_topology_level *tl,
 
 #endif
} else {
-   sd->flags |= SD_PREFER_SIBLING;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
sd->idle_idx = 1;
-- 
2.7.4



[PATCHv4 04/12] sched/fair: Consider misfit tasks when load-balancing

2018-07-04 Thread Morten Rasmussen
On asymmetric cpu capacity systems load intensive tasks can end up on
cpus that don't suit their compute demand.  In this scenarios 'misfit'
tasks should be migrated to cpus with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.

Misfit balancing only makes sense between a source group of lower
per-cpu capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.

The modifications are:

1. Only pick a group containing misfit tasks as the busiest group if the
   destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
   load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
   to be pulled ignoring average load.
4. Pick the cpu with the highest misfit load as the source cpu.
5. If the misfit task is alone on the source cpu, go for active
   balancing.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c | 51 +--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 09ede4321a3d..6e885d92fad2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7285,6 +7285,7 @@ struct lb_env {
unsigned intloop_max;
 
enum fbq_type   fbq_type;
+   enum group_type src_grp_type;
struct list_headtasks;
 };
 
@@ -8243,6 +8244,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 {
struct sg_lb_stats *busiest = >busiest_stat;
 
+   /*
+* Don't try to pull misfit tasks we can't help.
+* We can use max_capacity here as reduction in capacity on some
+* cpus in the group should either be possible to resolve
+* internally or be covered by avg_load imbalance (eventually).
+*/
+   if (sgs->group_type == group_misfit_task &&
+   (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+!group_has_capacity(env, >local_stat)))
+   return false;
+
if (sgs->group_type > busiest->group_type)
return true;
 
@@ -8265,6 +8277,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
group_smaller_min_cpu_capacity(sds->local, sg))
return false;
 
+   /*
+* If we have more than one misfit sg go with the biggest misfit.
+*/
+   if (sgs->group_type == group_misfit_task &&
+   sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+   return false;
+
 asym_packing:
/* This is the busiest node in its class. */
if (!(env->sd->flags & SD_ASYM_PACKING))
@@ -8562,8 +8581,9 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
 * factors in sg capacity and sgs with smaller group_type are
 * skipped when updating the busiest sg:
 */
-   if (busiest->avg_load <= sds->avg_load ||
-   local->avg_load >= sds->avg_load) {
+   if (busiest->group_type != group_misfit_task &&
+   (busiest->avg_load <= sds->avg_load ||
+local->avg_load >= sds->avg_load)) {
env->imbalance = 0;
return fix_small_imbalance(env, sds);
}
@@ -8597,6 +8617,12 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;
 
+   /* Boost imbalance to allow misfit task to be balanced. */
+   if (busiest->group_type == group_misfit_task) {
+   env->imbalance = max_t(long, env->imbalance,
+  busiest->group_misfit_task_load);
+   }
+
/*
 * if *imbalance is less than the average load per runnable task
 * there is no guarantee that any tasks will be moved so we'll have
@@ -8663,6 +8689,10 @@ static struct sched_group *find_busiest_group(struct 
lb_env *env)
busiest->group_no_capacity)
goto force_balance;
 
+   /* Misfit tasks should be dealt with regardless of the avg load */
+   if (busiest->group_type == group_misfit_task)
+   goto force_balance;
+
/*
 * If the local group is busier than the selected busiest group
 * don't try and pull any tasks.
@@ -8700,6 +8730,7 @@ static struct sched_group *find_busiest_group(struct 
lb_env *env)
 
 force_balance:
/* Looks like there is an imbalance. Compute it */
+   env->src_grp_type = busi

[PATCHv4 09/12] sched/fair: Set rq->rd->overload when misfit

2018-07-04 Thread Morten Rasmussen
From: Valentin Schneider 

Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.

A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.

They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.

This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:

a,b,c,d are our CPU-hogging tasks
_ signifies idling

LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
-|-
  big_0  | c c c c a a
  big_1  | d d d d b b
  ^
  ^
Tasks end on the big CPUs, idle balance happens
and the misfit tasks are pulled straight away

This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.

As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Valentin Schneider 
Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c  | 6 --
 kernel/sched/sched.h | 6 +-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0641ba7bea1..de84f5a9a65a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8163,7 +8163,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
  * @load_idx: Load index of sched_domain of this_cpu for load calc.
  * @local_group: Does group contain this_cpu.
  * @sgs: variable to hold the statistics for this group.
- * @overload: Indicate more than one runnable task for any CPU.
+ * @overload: Indicate pullable load (e.g. >1 runnable task).
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
@@ -8207,8 +8207,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->idle_cpus++;
 
if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
-   sgs->group_misfit_task_load < rq->misfit_task_load)
+   sgs->group_misfit_task_load < rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
+   *overload = 1;
+   }
}
 
/* Adjust by relative CPU capacity of the group */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e1dc85d1bfdd..377545b5aa15 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -695,7 +695,11 @@ struct root_domain {
cpumask_var_t   span;
cpumask_var_t   online;
 
-   /* Indicate more than one runnable task for any CPU */
+   /*
+* Indicate pullable load on at least one CPU, e.g:
+* - More than one runnable task
+* - Running task is misfit
+*/
int overload;
 
/*
-- 
2.7.4



[PATCHv4 10/12] sched/fair: Don't move tasks to lower capacity cpus unless necessary

2018-07-04 Thread Morten Rasmussen
From: Chris Redpath 

When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
cpu with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Chris Redpath 
Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de84f5a9a65a..06beefa02420 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8793,6 +8793,17 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 
capacity = capacity_of(i);
 
+   /*
+* For ASYM_CPUCAPACITY domains, don't pick a cpu that could
+* eventually lead to active_balancing high->low capacity.
+* Higher per-cpu capacity is considered better than balancing
+* average load.
+*/
+   if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+   capacity_of(env->dst_cpu) < capacity &&
+   rq->nr_running == 1)
+   continue;
+
wl = weighted_cpuload(rq);
 
/*
-- 
2.7.4



[PATCHv4 08/12] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE

2018-07-04 Thread Morten Rasmussen
From: Valentin Schneider 

This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE are added to make sure nothing terribly wrong
can happen because of the compiler.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Valentin Schneider 
Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c  | 6 +++---
 kernel/sched/sched.h | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee26eeb188ef..d0641ba7bea1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8428,8 +8428,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, 
struct sd_lb_stats *sd
 
if (!env->sd->parent) {
/* update overload indicator if we are at root domain */
-   if (env->dst_rq->rd->overload != overload)
-   env->dst_rq->rd->overload = overload;
+   if (READ_ONCE(env->dst_rq->rd->overload) != overload)
+   WRITE_ONCE(env->dst_rq->rd->overload, overload);
}
 }
 
@@ -9872,7 +9872,7 @@ static int idle_balance(struct rq *this_rq, struct 
rq_flags *rf)
rq_unpin_lock(this_rq, rf);
 
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-   !this_rq->rd->overload) {
+   !READ_ONCE(this_rq->rd->overload)) {
 
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 648224b23287..e1dc85d1bfdd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1672,8 +1672,8 @@ static inline void add_nr_running(struct rq *rq, unsigned 
count)
 
if (prev_nr < 2 && rq->nr_running >= 2) {
 #ifdef CONFIG_SMP
-   if (!rq->rd->overload)
-   rq->rd->overload = 1;
+   if (!READ_ONCE(rq->rd->overload))
+   WRITE_ONCE(rq->rd->overload, 1);
 #endif
}
 
-- 
2.7.4



[PATCHv4 06/12] sched/fair: Change prefer_sibling type to bool

2018-07-04 Thread Morten Rasmussen
From: Valentin Schneider 

This variable is entirely local to update_sd_lb_stats, so we can
safely change its type and slightly clean up its initialisation.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Valentin Schneider 
Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index acec93e1dc51..ee26eeb188ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8352,11 +8352,9 @@ static inline void update_sd_lb_stats(struct lb_env 
*env, struct sd_lb_stats *sd
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = >local_stat;
struct sg_lb_stats tmp_sgs;
-   int load_idx, prefer_sibling = 0;
+   int load_idx;
bool overload = false;
-
-   if (child && child->flags & SD_PREFER_SIBLING)
-   prefer_sibling = 1;
+   bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
 
 #ifdef CONFIG_NO_HZ_COMMON
if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
-- 
2.7.4



[PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry

2018-07-04 Thread Morten Rasmussen
When hotplugging cpus out or creating exclusive cpusets (disabling
sched_load_balance) systems which were asymmetric at boot might become
symmetric. In this case leaving the flag set might lead to suboptimal
scheduling decisions.

The arch-code proving the flag doesn't have visibility of the cpuset
configuration so it must either be told by passing a cpumask or the
generic topology code has to verify if the flag should still be set
when taking the actual sched_domain_span() into account. This patch
implements the latter approach.

We need to detect capacity based on calling arch_scale_cpu_capacity()
directly as rq->cpu_capacity_orig hasn't been set yet early in the boot
process.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Morten Rasmussen 
---
 kernel/sched/topology.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 71330e0e41db..29c186961345 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1160,6 +1160,26 @@ sd_init(struct sched_domain_topology_level *tl,
sd_id = cpumask_first(sched_domain_span(sd));
 
/*
+* Check if cpu_map eclipses cpu capacity asymmetry.
+*/
+
+   if (sd->flags & SD_ASYM_CPUCAPACITY) {
+   int i;
+   bool disable = true;
+   long capacity = arch_scale_cpu_capacity(NULL, sd_id);
+
+   for_each_cpu(i, sched_domain_span(sd)) {
+   if (capacity != arch_scale_cpu_capacity(NULL, i)) {
+   disable = false;
+   break;
+   }
+   }
+
+   if (disable)
+   sd->flags &= ~SD_ASYM_CPUCAPACITY;
+   }
+
+   /*
 * Convert topological properties into behaviour.
 */
 
-- 
2.7.4



[PATCHv4 07/12] sched: Change root_domain->overload type to int

2018-07-04 Thread Morten Rasmussen
From: Valentin Schneider 

sizeof(_Bool) is implementation defined, so let's just go with 'int' as
is done for other structures e.g. sched_domain_shared->has_idle_cores.

The local 'overload' variable used in update_sd_lb_stats can remain
bool, as it won't impact any struct layout and can be assigned to the
root_domain field.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Valentin Schneider 
Signed-off-by: Morten Rasmussen 
---
 kernel/sched/sched.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6c39a07e8a68..648224b23287 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -696,7 +696,7 @@ struct root_domain {
cpumask_var_t   online;
 
/* Indicate more than one runnable task for any CPU */
-   booloverload;
+   int overload;
 
/*
 * The bit corresponding to a CPU gets set here if such CPU has more
@@ -1673,7 +1673,7 @@ static inline void add_nr_running(struct rq *rq, unsigned 
count)
if (prev_nr < 2 && rq->nr_running >= 2) {
 #ifdef CONFIG_SMP
if (!rq->rd->overload)
-   rq->rd->overload = true;
+   rq->rd->overload = 1;
 #endif
}
 
-- 
2.7.4



[PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations

2018-07-04 Thread Morten Rasmussen
The existing asymmetric cpu capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c |  3 +++
 kernel/sched/sched.h|  1 +
 kernel/sched/topology.c | 19 +++
 3 files changed, 23 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 321cd5dcf2e8..85fb7e8ff5c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6583,6 +6583,9 @@ static int wake_cap(struct task_struct *p, int cpu, int 
prev_cpu)
 {
long min_cap, max_cap;
 
+   if (!static_branch_unlikely(_asym_cpucapacity))
+   return 0;
+
min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7742dcc136c..35ce218f0157 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1160,6 +1160,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+extern struct static_key_false sched_asym_cpucapacity;
 
 struct sched_group_capacity {
atomic_tref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05a831427bc7..0cfdeff669fe 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -425,6 +426,21 @@ static void update_top_cache_domain(int cpu)
rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
 }
 
+static void update_asym_cpucapacity(int cpu)
+{
+   int enable = false;
+
+   rcu_read_lock();
+   if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
+   enable = true;
+   rcu_read_unlock();
+
+   if (enable) {
+   /* This expects to be hotplug-safe */
+   static_branch_enable_cpuslocked(_asym_cpucapacity);
+   }
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
@@ -1707,6 +1723,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct 
sched_domain_attr *att
}
rcu_read_unlock();
 
+   if (!cpumask_empty(cpu_map))
+   update_asym_cpucapacity(cpumask_first(cpu_map));
+
if (rq && sched_debug_enabled) {
pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
-- 
2.7.4



[PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations

2018-07-04 Thread Morten Rasmussen
The existing asymmetric cpu capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Morten Rasmussen 
---
 kernel/sched/fair.c |  3 +++
 kernel/sched/sched.h|  1 +
 kernel/sched/topology.c | 19 +++
 3 files changed, 23 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 321cd5dcf2e8..85fb7e8ff5c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6583,6 +6583,9 @@ static int wake_cap(struct task_struct *p, int cpu, int 
prev_cpu)
 {
long min_cap, max_cap;
 
+   if (!static_branch_unlikely(_asym_cpucapacity))
+   return 0;
+
min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7742dcc136c..35ce218f0157 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1160,6 +1160,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+extern struct static_key_false sched_asym_cpucapacity;
 
 struct sched_group_capacity {
atomic_tref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05a831427bc7..0cfdeff669fe 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -425,6 +426,21 @@ static void update_top_cache_domain(int cpu)
rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
 }
 
+static void update_asym_cpucapacity(int cpu)
+{
+   int enable = false;
+
+   rcu_read_lock();
+   if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
+   enable = true;
+   rcu_read_unlock();
+
+   if (enable) {
+   /* This expects to be hotplug-safe */
+   static_branch_enable_cpuslocked(_asym_cpucapacity);
+   }
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
@@ -1707,6 +1723,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct 
sched_domain_attr *att
}
rcu_read_unlock();
 
+   if (!cpumask_empty(cpu_map))
+   update_asym_cpucapacity(cpumask_first(cpu_map));
+
if (rq && sched_debug_enabled) {
pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
-- 
2.7.4



[PATCHv4 07/12] sched: Change root_domain->overload type to int

2018-07-04 Thread Morten Rasmussen
From: Valentin Schneider 

sizeof(_Bool) is implementation defined, so let's just go with 'int' as
is done for other structures e.g. sched_domain_shared->has_idle_cores.

The local 'overload' variable used in update_sd_lb_stats can remain
bool, as it won't impact any struct layout and can be assigned to the
root_domain field.

cc: Ingo Molnar 
cc: Peter Zijlstra 

Signed-off-by: Valentin Schneider 
Signed-off-by: Morten Rasmussen 
---
 kernel/sched/sched.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6c39a07e8a68..648224b23287 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -696,7 +696,7 @@ struct root_domain {
cpumask_var_t   online;
 
/* Indicate more than one runnable task for any CPU */
-   booloverload;
+   int overload;
 
/*
 * The bit corresponding to a CPU gets set here if such CPU has more
@@ -1673,7 +1673,7 @@ static inline void add_nr_running(struct rq *rq, unsigned 
count)
if (prev_nr < 2 && rq->nr_running >= 2) {
 #ifdef CONFIG_SMP
if (!rq->rd->overload)
-   rq->rd->overload = true;
+   rq->rd->overload = 1;
 #endif
}
 
-- 
2.7.4



  1   2   3   4   5   6   7   8   9   10   >