Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-07 Thread Subhra Mazumdar



On 02/07/2018 12:42 AM, Peter Zijlstra wrote:

On Tue, Feb 06, 2018 at 04:30:03PM -0800, Subhra Mazumdar wrote:


I meant the SMT balance patch. That does comparison with only one other
random core and takes the decision in O(1). Any potential scan of all cores
or cpus is O(n) and doesn't scale and will only get worse in future. That
applies to both select_idle_core() and select_idle_cpu().

We only do the full scan if we think to know there is indeed an idle
core to be had, and if there are idle cores the machine isn't terribly
busy.

If there are no idle cores, we do not in fact scan everything. We limit
the amount of scanning based on the average idle time with a minimum of
4.

This logic may not be working as well as you may think it to be. I had
sent the cost of select_idle_sibling() w/ and w/o my patch and there was
huge difference:

Following is the cost (in us) of select_idle_sibling() with hackbench 16
groups:

function baseline-rc6  %stdev patch   %stdev
select_idle_sibling()    0.556 1.72    0.263 (-52.70%) 0.78


(Except when you switch off things like SIS_PROP, then you scan
unconditionally and reduce tail latency at the cost of throughput --
like said, some people want this).

O(1) sounds nice, but it has horrible worst case latencies.

And like I said, I don't see how a rotor is particularly more random
than the previous cpu the task you happen to be waking ran on.

The rotor randomness is not the main point here. I think the benchmark
improvements come from the fact that select_idle_sibling() cost has reduced
a lot. To reduce it while still maintaining good spread of threads can be
achieved by this SMT balance scheme which in turn requires a fast decent
random number generator and rotor is just an easy way to achieve that.

Is there any reason this randomized approach is not acceptable even if
benchmarks show improvement? Are there other benchmarks I should try?

Looking at just one other core has a fairly high chance of not finding
idle. I really cannot remember all the benchmarks, but Mike did
something a little less random but still O(1) a few years back and that
crashed and burned.


Also your suggestion to keep the SMT utilization but still do a traversal of
cores
in select_idle_core() while remembering the least loaded core will still
have
the problem of potentially traversing all cores. I can compare this with a
core
level only SMT balancing, is that useful to decide? I will also test on
SPARC
machines with higher degree of SMT.

Please learn to use your email client, that's near unreadable for
whitespace damage, time is really too precious to try and untangle crap
like that.

Sorry about that



You had also mentioned to do it for only SMT >2, not sure I understand why
as even for SMT=2 (intel) benchmarks show improvement. This clearly shows
the scalability problem.

For SMT2 you don't need the occupation counter with atomic crud, a
simple !atomic core-idle works just fine. And your 'patch' had soo many
moving parts it was too hard to say which aspect changed what.

hackbench really isn't _that_ interesting a benchmark and its fairly
easy to make it go faster, but doing so typically wrecks something else.

I also had uperf and Oracle DB TPC-C numbers in the patch which showed
improvements


There's a whole array of netperf benchmarks which you have to run at
different utilization levels, there's the sysbench oltp stuff, which you
can run on MariaDB and PostgreSQL, again at different utilization levels
(both databases behave quite differently). There's ebizzy, tbench,
dbench and others.

There's also NUMA stuff, like NAS benchmarks and specJBB nonsense.

There's the facebook schbench, which is a bit finnicky to set up.

And I'm sure I'm forgetting half, look at the patches and regression
reports for more clues, that's what Google is for. You could have
figured all that out yourself, much of these are listed in the
changelogs and if you'd bothered to find lkml regression reports you
could find more.

Ok, will find out and run them.

Thanks,
Subhra


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-07 Thread Peter Zijlstra
On Tue, Feb 06, 2018 at 04:30:03PM -0800, Subhra Mazumdar wrote:

> I meant the SMT balance patch. That does comparison with only one other
> random core and takes the decision in O(1). Any potential scan of all cores
> or cpus is O(n) and doesn't scale and will only get worse in future. That
> applies to both select_idle_core() and select_idle_cpu().

We only do the full scan if we think to know there is indeed an idle
core to be had, and if there are idle cores the machine isn't terribly
busy.

If there are no idle cores, we do not in fact scan everything. We limit
the amount of scanning based on the average idle time with a minimum of
4.

(Except when you switch off things like SIS_PROP, then you scan
unconditionally and reduce tail latency at the cost of throughput --
like said, some people want this).

O(1) sounds nice, but it has horrible worst case latencies.

And like I said, I don't see how a rotor is particularly more random
than the previous cpu the task you happen to be waking ran on.

> Is there any reason this randomized approach is not acceptable even if
> benchmarks show improvement? Are there other benchmarks I should try?

Looking at just one other core has a fairly high chance of not finding
idle. I really cannot remember all the benchmarks, but Mike did
something a little less random but still O(1) a few years back and that
crashed and burned.

> Also your suggestion to keep the SMT utilization but still do a traversal of
> cores
> in select_idle_core() while remembering the least loaded core will still
> have
> the problem of potentially traversing all cores. I can compare this with a
> core
> level only SMT balancing, is that useful to decide? I will also test on
> SPARC
> machines with higher degree of SMT.

Please learn to use your email client, that's near unreadable for
whitespace damage, time is really too precious to try and untangle crap
like that.

> You had also mentioned to do it for only SMT >2, not sure I understand why
> as even for SMT=2 (intel) benchmarks show improvement. This clearly shows
> the scalability problem.

For SMT2 you don't need the occupation counter with atomic crud, a
simple !atomic core-idle works just fine. And your 'patch' had soo many
moving parts it was too hard to say which aspect changed what.

hackbench really isn't _that_ interesting a benchmark and its fairly
easy to make it go faster, but doing so typically wrecks something else.

There's a whole array of netperf benchmarks which you have to run at
different utilization levels, there's the sysbench oltp stuff, which you
can run on MariaDB and PostgreSQL, again at different utilization levels
(both databases behave quite differently). There's ebizzy, tbench,
dbench and others.

There's also NUMA stuff, like NAS benchmarks and specJBB nonsense.

There's the facebook schbench, which is a bit finnicky to set up.

And I'm sure I'm forgetting half, look at the patches and regression
reports for more clues, that's what Google is for. You could have
figured all that out yourself, much of these are listed in the
changelogs and if you'd bothered to find lkml regression reports you
could find more.


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-06 Thread Subhra Mazumdar



On 02/06/2018 01:12 AM, Peter Zijlstra wrote:

On Mon, Feb 05, 2018 at 02:09:11PM -0800, Subhra Mazumdar wrote:

The pseudo random is also used for choosing a random core to compare with,
how will transposing achieve that?

Not entirely sure what your point is. Current code doesn't compare to
just _one_ other core, and I don't think we'd ever want to do that.

So currently select_idle_core() will, if there is an idle core, iterate
the whole thing trying to find it. If it fails, it clears the
'have_idle_core' state.

select_idle_cpu, which we'll fall back to, will limit the scanning based
on the average idle time.


The crucial point however, is that concurrent wakeups will not, on
average, do the same iteration because of the target offset.

I meant the SMT balance patch. That does comparison with only one other
random core and takes the decision in O(1). Any potential scan of all cores
or cpus is O(n) and doesn't scale and will only get worse in future. That
applies to both select_idle_core() and select_idle_cpu().

Is there any reason this randomized approach is not acceptable even if
benchmarks show improvement? Are there other benchmarks I should try?

Also your suggestion to keep the SMT utilization but still do a 
traversal of cores
in select_idle_core() while remembering the least loaded core will still 
have
the problem of potentially traversing all cores. I can compare this with 
a core
level only SMT balancing, is that useful to decide? I will also test on 
SPARC

machines with higher degree of SMT.

You had also mentioned to do it for only SMT >2, not sure I understand why
as even for SMT=2 (intel) benchmarks show improvement. This clearly shows
the scalability problem.

Thanks,
Subhra


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-06 Thread Peter Zijlstra
On Mon, Feb 05, 2018 at 02:09:11PM -0800, Subhra Mazumdar wrote:
> The pseudo random is also used for choosing a random core to compare with,
> how will transposing achieve that?

Not entirely sure what your point is. Current code doesn't compare to
just _one_ other core, and I don't think we'd ever want to do that.

So currently select_idle_core() will, if there is an idle core, iterate
the whole thing trying to find it. If it fails, it clears the
'have_idle_core' state.

select_idle_cpu, which we'll fall back to, will limit the scanning based
on the average idle time.


The crucial point however, is that concurrent wakeups will not, on
average, do the same iteration because of the target offset.


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-05 Thread Subhra Mazumdar



On 02/05/2018 09:03 AM, Peter Zijlstra wrote:

On Mon, Feb 05, 2018 at 01:48:54PM +0100, Peter Zijlstra wrote:

So while I see the point of tracking these numbers (for SMT>2), I don't
think its worth doing outside of the core, and then we still need some
powerpc (or any other architecture with abysmal atomics) tested.

FWIW Power has another 'fun' feature, their cores have asymmetric SMT.

Their cores have a static power level, based on _which_ SMT sibling is
running, not how many. A single SMT2 runs (much) slower than a single
SMT0.

So that random selection stuff really doesn't work well for them. Now
'sadly' x86 can also have ASYM_PACKING set on its SMT domain, so I'm
going to have to figure out what to do about all that.
Even the existing code doesn't handle that. The SMT balancing compares 
the remaining SMT capacity so even with asymmetric cores should work OK.


Thanks,
Subhra


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-05 Thread Subhra Mazumdar



On 02/05/2018 04:19 AM, Peter Zijlstra wrote:

On Fri, Feb 02, 2018 at 09:37:02AM -0800, Subhra Mazumdar wrote:

In the scheme of SMT balance, if the idle cpu search is done _not_ in the
last run core, then we need a random cpu to start from. If the idle cpu
search is done in the last run core we can start the search from last run
cpu. Since we need the random index for the first case I just did it for
both.

That shouldn't be too hard to fix. I think we can simply transpose the
CPU number. That is, something like:

   cpu' = core'_id + (cpu - core_id)

should work for most sane cases. We don't give any guarantees this will
in fact work, but (almost) all actual CPU enumeration schemes I've seen
this should work for.

And if it doesn't work, we're not worse of than we are now.

I just couldn't readily find a place where we need to do this for cores
with the current code. But I think we have one place between LLCs where
it can be done:

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7b6535987500..eb8b8d0a026c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6109,7 +6109,7 @@ static int select_idle_smt(struct task_struct *p, struct 
sched_domain *sd, int t
if (!static_branch_likely(&sched_smt_present))
return -1;
  
-	for_each_cpu(cpu, cpu_smt_mask(target)) {

+   for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) {
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
continue;
if (idle_cpu(cpu))
@@ -6357,8 +6357,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int sd_flag, int wake_f
if (cpu == prev_cpu)
goto pick_cpu;
  
-		if (wake_affine(affine_sd, p, prev_cpu, sync))

-   new_cpu = cpu;
+   if (wake_affine(affine_sd, p, prev_cpu, sync)) {
+   /*
+* Transpose prev_cpu's offset into this cpu's
+* LLC domain to retain the 'random' search offset
+* for for_each_cpu_wrap().
+*/
+   new_cpu = per_cpu(sd_llc_id, cpu) +
+ (prev_cpu - per_cpu(sd_llc_id, prev_cpu));
+   if (unlikely(!cpus_share_cache(new_cpu, cpu)))
+   new_cpu = cpu;
+   }
}
  
  	if (sd && !(sd_flag & SD_BALANCE_FORK)) {
The pseudo random is also used for choosing a random core to compare 
with, how will transposing achieve that?


Thanks,
Subhra


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-05 Thread Peter Zijlstra
On Mon, Feb 05, 2018 at 01:48:54PM +0100, Peter Zijlstra wrote:
> So while I see the point of tracking these numbers (for SMT>2), I don't
> think its worth doing outside of the core, and then we still need some
> powerpc (or any other architecture with abysmal atomics) tested.

FWIW Power has another 'fun' feature, their cores have asymmetric SMT.

Their cores have a static power level, based on _which_ SMT sibling is
running, not how many. A single SMT2 runs (much) slower than a single
SMT0.

So that random selection stuff really doesn't work well for them. Now
'sadly' x86 can also have ASYM_PACKING set on its SMT domain, so I'm
going to have to figure out what to do about all that.


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-05 Thread Mike Galbraith
On Mon, 2018-02-05 at 13:48 +0100, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 04:06:32PM -0500, Steven Sistare wrote:
> > On 2/2/2018 2:59 PM, Peter Zijlstra wrote:
> 
> > > But then you get that atomic crud to contend on the cluster level, which
> > > is even worse than it contending on the core level.
> > 
> > True, but it can still be a net win if we make better scheduling decisions.
> > A saving grace is that the atomic counter is only updated if the cpu
> > makes a transition from idle to busy or vice versa.
> 
> Which can still be a very high rate for some workloads. I always forget
> which, but there are plenty workloads that have very frequenct very
> short idle times. Mike, do you remember what comes apart when we take
> out the sysctl_sched_migration_cost test in idle_balance()?

Used to be anything scheduling cross-core heftily suffered, ie pretty
much any localhost communication heavy load.  I just tried disabling it
in 4.13 though (pre pti cliff), tried tbench, and it made zip squat
difference.  I presume that's due to the meanwhile added this_rq->rd-
>overload and/or curr_cost checks.  I don't recall the original cost
details beyond it having been "a sh*tload".

-Mike


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-05 Thread Peter Zijlstra
On Fri, Feb 02, 2018 at 04:06:32PM -0500, Steven Sistare wrote:
> On 2/2/2018 2:59 PM, Peter Zijlstra wrote:

> > But then you get that atomic crud to contend on the cluster level, which
> > is even worse than it contending on the core level.
> 
> True, but it can still be a net win if we make better scheduling decisions.
> A saving grace is that the atomic counter is only updated if the cpu
> makes a transition from idle to busy or vice versa.

Which can still be a very high rate for some workloads. I always forget
which, but there are plenty workloads that have very frequenct very
short idle times. Mike, do you remember what comes apart when we take
out the sysctl_sched_migration_cost test in idle_balance()?

> We need data for this type of system, showing improvements for normal
> workloads, and showing little downside for a high context switch rate
> torture test.

So core-wide atomics should, on architectures that can do atomics in L1,
be relatively fast. Once you leave L1, atomic contention goes up a fair
bit. And then there's architectures that simply don't do atomics in L1
(like Power).

Testing on my SKL desktop, atomics contending between SMT siblings is
basically free (weirdly enough my test says its cheaper), atomics
contending on the L3 is 10x as expensive, and this is with only 4 cores.

If I test it on my 10 core IVB, I'm up to 20x, and I can't imagine that
getting any better with bigger core count (my IVB does not show SMT
contention as lower, but not particularly more expensive either).


So while I see the point of tracking these numbers (for SMT>2), I don't
think its worth doing outside of the core, and then we still need some
powerpc (or any other architecture with abysmal atomics) tested.

So what we can do is make this counting crud conditional on SMT>2 and
possibly part of the topology flags such that an architecture can
opt-out.

Then select_idle_core can be augmented to remember the least-loaded core
it encounters in its traversal, and go with that.


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-05 Thread Peter Zijlstra
On Fri, Feb 02, 2018 at 09:37:02AM -0800, Subhra Mazumdar wrote:
> In the scheme of SMT balance, if the idle cpu search is done _not_ in the
> last run core, then we need a random cpu to start from. If the idle cpu
> search is done in the last run core we can start the search from last run
> cpu. Since we need the random index for the first case I just did it for
> both.

That shouldn't be too hard to fix. I think we can simply transpose the
CPU number. That is, something like:

  cpu' = core'_id + (cpu - core_id)

should work for most sane cases. We don't give any guarantees this will
in fact work, but (almost) all actual CPU enumeration schemes I've seen
this should work for.

And if it doesn't work, we're not worse of than we are now.

I just couldn't readily find a place where we need to do this for cores
with the current code. But I think we have one place between LLCs where
it can be done:

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7b6535987500..eb8b8d0a026c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6109,7 +6109,7 @@ static int select_idle_smt(struct task_struct *p, struct 
sched_domain *sd, int t
if (!static_branch_likely(&sched_smt_present))
return -1;
 
-   for_each_cpu(cpu, cpu_smt_mask(target)) {
+   for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) {
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
continue;
if (idle_cpu(cpu))
@@ -6357,8 +6357,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int sd_flag, int wake_f
if (cpu == prev_cpu)
goto pick_cpu;
 
-   if (wake_affine(affine_sd, p, prev_cpu, sync))
-   new_cpu = cpu;
+   if (wake_affine(affine_sd, p, prev_cpu, sync)) {
+   /*
+* Transpose prev_cpu's offset into this cpu's
+* LLC domain to retain the 'random' search offset
+* for for_each_cpu_wrap().
+*/
+   new_cpu = per_cpu(sd_llc_id, cpu) +
+ (prev_cpu - per_cpu(sd_llc_id, prev_cpu));
+   if (unlikely(!cpus_share_cache(new_cpu, cpu)))
+   new_cpu = cpu;
+   }
}
 
if (sd && !(sd_flag & SD_BALANCE_FORK)) {


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-02 Thread Mike Galbraith
On Fri, 2018-02-02 at 13:34 -0500, Steven Sistare wrote:
> On 2/2/2018 12:39 PM, Steven Sistare wrote:
> > On 2/2/2018 12:21 PM, Peter Zijlstra wrote:
> >> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
> >>> It might be interesting to add a tunable for the number of random choices 
> >>> to
> >>> make, and clamp it at the max nr computed from avg_cost in 
> >>> select_idle_cpu.
> >>
> >> This needs a fairly complicated PRNG for it would need to visit each
> >> possible CPU once before looping. A LFSR does that, but requires 2^n-1
> >> elements and we have topology masks that don't match that.. The trivial
> >> example is something with 6 cores.
> > 
> > Or keep it simple and accept the possibility of choosing the same candidate
> > more than once.
> > 
> >>> Or, choose a random starting point and then search for nr sequential 
> >>> candidates; possibly limited by a tunable.
> >>
> >> And this is basically what we already do. Except with the task-cpu
> >> instead of a per-cpu rotor.
> > 
> > Righto.  Disregard this suggestion.
> 
> Actually, I take back my take back.  I suspect the primary benefit
> of random selection is that it breaks up resonance states where
> CPUs that are busy tend to stay busy, and CPUs that are idle tend
> to stay idle, which is reinforced by starting the search at target =
> last cpu ran.

I suspect the primary benefit is reduction of bouncing.  The absolutely
maddening thing about SIS is that some stuff out there (like FB's load)
doesn't give a rats ass about anything other than absolute minimum
sched latency while other stuff notices cache going missing.  Joy.

-Mike


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-02 Thread Steven Sistare
On 2/2/2018 3:04 PM, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 01:34:58PM -0500, Steven Sistare wrote:
>> Actually, I take back my take back.  I suspect the primary benefit
>> of random selection is that it breaks up resonance states where
>> CPUs that are busy tend to stay busy, and CPUs that are idle tend
>> to stay idle, which is reinforced by starting the search at target =
>> last cpu ran.
> 
> Which, according to there here patches:
> 
>   https://lkml.kernel.org/r/20180130104555.4125-1-mgor...@techsingularity.net
> 
> is a good thing, because of power management.

Yes, but it's a bad thing if ready to run tasks pile on busy CPUs and idle 
CPUs go unused.  Stating the obvious, when the search for idle fails, the 
thread goes on a busy CPU. The existing logic that checks and uses the
initial target if it is idle reduces unnecessary spreading and is power
friendly (indeed, added in the patch you reference).  Subhra's patch does
not change that.

- Steve


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-02 Thread Steven Sistare
On 2/2/2018 2:59 PM, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
>> On 2/1/2018 7:33 AM, Peter Zijlstra wrote:
>>> On Mon, Jan 29, 2018 at 03:31:02PM -0800, subhra mazumdar wrote:
 +  rcu_read_lock();
 +  sd = rcu_dereference(per_cpu(sd_llc, this_cpu));
 +  if (util) {
 +  for_each_lower_domain(sd) {
 +  if (sd->level == 0)
 +  break;
>>>
>>> afaict you really only need this for the core, and here you're assuming
>>> everything below the LLC is cores. Would it not be much clearer if you
>>> introduce sd_core.
>>>
>>> As is, for_each_lower_domain includes the starting domain, sd->group
>>> then is the first core group for this cpu. But then you continue to the
>>> smt domain (on Intel, on other architectures there could be a cluster
>>> domain in between) and then you bail using that sd->level == 0 hack
>>> because otherwise things would go *bang*.
>>
>> Hi Peter,
>>
>> The code here and in smt_balance intentionally visits each level between
>> the llc and smt, including core-cluster on architectures that define it.
>> smt_balance thus has the chance to randomly pick a better cluster,
>> and then within that cluster randomly pick a better core.  It makes sense,
>> as resources are shared within a cluster, and choosing a less loaded cluster
>> should give better performance.  As you suggest in a few other places,
>> it would be nice to see performance results for this case.  We have
>> SPARC processors with core clusters.
>>
> 
> But then you get that atomic crud to contend on the cluster level, which
> is even worse than it contending on the core level.

True, but it can still be a net win if we make better scheduling decisions.
A saving grace is that the atomic counter is only updated if the cpu
makes a transition from idle to busy or vice versa.

We need data for this type of system, showing improvements for normal
workloads, and showing little downside for a high context switch rate
torture test.  

- Steve


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-02 Thread Steven Sistare
On 2/2/2018 2:58 PM, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 12:36:47PM -0500, Steven Sistare wrote:
>> On 2/2/2018 12:17 PM, Peter Zijlstra wrote:
>>> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
>> +static int select_idle_smt(struct task_struct *p, struct sched_group 
>> *sg)
>>  {
>> +int i, rand_index, rand_cpu;
>> +int this_cpu = smp_processor_id();
>>  
>> +rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight;
>> +rand_cpu = sg->cp_array[rand_index];
>
> Right, so yuck.. I know why you need that, but that extra array and
> dereference is the reason I never went there.
>
> How much difference does it really make vs the 'normal' wrapping search
> from last CPU ?
>
> This really should be a separate patch with separate performance numbers
> on.

 For the benefit of other readers, if we always search and choose starting 
 from
 the first CPU in a core, then later searches will often need to traverse 
 the first
 N busy CPU's to find the first idle CPU.  Choosing a random starting point 
 avoids
 such bias.  It is probably a win for processors with 4 to 8 CPUs per core, 
 and
 a slight but hopefully negligible loss for 2 CPUs per core, and I agree we 
 need
 to see performance data for this as a separate patch to decide.  We have 
 SPARC
 systems with 8 CPUs per core.
>>>
>>> Which is why the current code already doesn't start from the first cpu
>>> in the mask. We start at whatever CPU the task ran last on, which is
>>> effectively 'random' if the system is busy.
>>>
>>> So how is a per-cpu rotor better than that?
>>
>> The current code is:
>> for_each_cpu(cpu, cpu_smt_mask(target)) {
>>
>> For an 8-cpu/core processor, 8 values of target map to the same cpu_smt_mask.
>> 8 different tasks will traverse the mask in the same order.
> 
> Ooh, the SMT loop.. yes that can be improved. But look at the other
> ones, they do:
> 
>   for_each_cpu_wrap(cpu, sched_domain_span(), target)
> 
> so we look for an idle cpu in the LLC domain, and start iteration at
> @target, which will (on average) be different for different CPUs, and
> thus hopefully find different idle CPUs.
> 
> You could simple change the SMT loop to something like:
> 
>   for_each_cpu_wrap(cpu, cpu_smt_mask(target), target)
> 
> and see what that does.

Good idea - Steve



Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-02 Thread Peter Zijlstra
On Fri, Feb 02, 2018 at 01:34:58PM -0500, Steven Sistare wrote:
> Actually, I take back my take back.  I suspect the primary benefit
> of random selection is that it breaks up resonance states where
> CPUs that are busy tend to stay busy, and CPUs that are idle tend
> to stay idle, which is reinforced by starting the search at target =
> last cpu ran.

Which, according to there here patches:

  https://lkml.kernel.org/r/20180130104555.4125-1-mgor...@techsingularity.net

is a good thing, because of power management.



Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-02 Thread Peter Zijlstra
On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
> On 2/1/2018 7:33 AM, Peter Zijlstra wrote:
> > On Mon, Jan 29, 2018 at 03:31:02PM -0800, subhra mazumdar wrote:
> >> +  rcu_read_lock();
> >> +  sd = rcu_dereference(per_cpu(sd_llc, this_cpu));
> >> +  if (util) {
> >> +  for_each_lower_domain(sd) {
> >> +  if (sd->level == 0)
> >> +  break;
> > 
> > afaict you really only need this for the core, and here you're assuming
> > everything below the LLC is cores. Would it not be much clearer if you
> > introduce sd_core.
> > 
> > As is, for_each_lower_domain includes the starting domain, sd->group
> > then is the first core group for this cpu. But then you continue to the
> > smt domain (on Intel, on other architectures there could be a cluster
> > domain in between) and then you bail using that sd->level == 0 hack
> > because otherwise things would go *bang*.
> 
> Hi Peter,
> 
> The code here and in smt_balance intentionally visits each level between
> the llc and smt, including core-cluster on architectures that define it.
> smt_balance thus has the chance to randomly pick a better cluster,
> and then within that cluster randomly pick a better core.  It makes sense,
> as resources are shared within a cluster, and choosing a less loaded cluster
> should give better performance.  As you suggest in a few other places,
> it would be nice to see performance results for this case.  We have
> SPARC processors with core clusters.
> 

But then you get that atomic crud to contend on the cluster level, which
is even worse than it contending on the core level.


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-02 Thread Peter Zijlstra
On Fri, Feb 02, 2018 at 12:36:47PM -0500, Steven Sistare wrote:
> On 2/2/2018 12:17 PM, Peter Zijlstra wrote:
> > On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
>  +static int select_idle_smt(struct task_struct *p, struct sched_group 
>  *sg)
>   {
>  +int i, rand_index, rand_cpu;
>  +int this_cpu = smp_processor_id();
>   
>  +rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight;
>  +rand_cpu = sg->cp_array[rand_index];
> >>>
> >>> Right, so yuck.. I know why you need that, but that extra array and
> >>> dereference is the reason I never went there.
> >>>
> >>> How much difference does it really make vs the 'normal' wrapping search
> >>> from last CPU ?
> >>>
> >>> This really should be a separate patch with separate performance numbers
> >>> on.
> >>
> >> For the benefit of other readers, if we always search and choose starting 
> >> from
> >> the first CPU in a core, then later searches will often need to traverse 
> >> the first
> >> N busy CPU's to find the first idle CPU.  Choosing a random starting point 
> >> avoids
> >> such bias.  It is probably a win for processors with 4 to 8 CPUs per core, 
> >> and
> >> a slight but hopefully negligible loss for 2 CPUs per core, and I agree we 
> >> need
> >> to see performance data for this as a separate patch to decide.  We have 
> >> SPARC
> >> systems with 8 CPUs per core.
> > 
> > Which is why the current code already doesn't start from the first cpu
> > in the mask. We start at whatever CPU the task ran last on, which is
> > effectively 'random' if the system is busy.
> > 
> > So how is a per-cpu rotor better than that?
> 
> The current code is:
> for_each_cpu(cpu, cpu_smt_mask(target)) {
> 
> For an 8-cpu/core processor, 8 values of target map to the same cpu_smt_mask.
> 8 different tasks will traverse the mask in the same order.

Ooh, the SMT loop.. yes that can be improved. But look at the other
ones, they do:

  for_each_cpu_wrap(cpu, sched_domain_span(), target)

so we look for an idle cpu in the LLC domain, and start iteration at
@target, which will (on average) be different for different CPUs, and
thus hopefully find different idle CPUs.

You could simple change the SMT loop to something like:

  for_each_cpu_wrap(cpu, cpu_smt_mask(target), target)

and see what that does.


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-02 Thread Steven Sistare
On 2/2/2018 12:39 PM, Steven Sistare wrote:
> On 2/2/2018 12:21 PM, Peter Zijlstra wrote:
>> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
>>> It might be interesting to add a tunable for the number of random choices to
>>> make, and clamp it at the max nr computed from avg_cost in select_idle_cpu.
>>
>> This needs a fairly complicated PRNG for it would need to visit each
>> possible CPU once before looping. A LFSR does that, but requires 2^n-1
>> elements and we have topology masks that don't match that.. The trivial
>> example is something with 6 cores.
> 
> Or keep it simple and accept the possibility of choosing the same candidate
> more than once.
> 
>>> Or, choose a random starting point and then search for nr sequential 
>>> candidates; possibly limited by a tunable.
>>
>> And this is basically what we already do. Except with the task-cpu
>> instead of a per-cpu rotor.
> 
> Righto.  Disregard this suggestion.

Actually, I take back my take back.  I suspect the primary benefit
of random selection is that it breaks up resonance states where
CPUs that are busy tend to stay busy, and CPUs that are idle tend
to stay idle, which is reinforced by starting the search at target =
last cpu ran.

Or, a quantitative argument: if sampling a single random CPU
gives better results (and the data says it does), then sampling
a random cpu and searching nr from it should give better results,
since it has nr - 1 more chances to find an idle CPU.

- Steve


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-02 Thread Steven Sistare
On 2/2/2018 12:17 PM, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
 +static int select_idle_smt(struct task_struct *p, struct sched_group *sg)
  {
 +  int i, rand_index, rand_cpu;
 +  int this_cpu = smp_processor_id();
  
 +  rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight;
 +  rand_cpu = sg->cp_array[rand_index];
>>>
>>> Right, so yuck.. I know why you need that, but that extra array and
>>> dereference is the reason I never went there.
>>>
>>> How much difference does it really make vs the 'normal' wrapping search
>>> from last CPU ?
>>>
>>> This really should be a separate patch with separate performance numbers
>>> on.
>>
>> For the benefit of other readers, if we always search and choose starting 
>> from
>> the first CPU in a core, then later searches will often need to traverse the 
>> first
>> N busy CPU's to find the first idle CPU.  Choosing a random starting point 
>> avoids
>> such bias.  It is probably a win for processors with 4 to 8 CPUs per core, 
>> and
>> a slight but hopefully negligible loss for 2 CPUs per core, and I agree we 
>> need
>> to see performance data for this as a separate patch to decide.  We have 
>> SPARC
>> systems with 8 CPUs per core.
> 
> Which is why the current code already doesn't start from the first cpu
> in the mask. We start at whatever CPU the task ran last on, which is
> effectively 'random' if the system is busy.
> 
> So how is a per-cpu rotor better than that?

The current code is:
for_each_cpu(cpu, cpu_smt_mask(target)) {

For an 8-cpu/core processor, 8 values of target map to the same cpu_smt_mask.
8 different tasks will traverse the mask in the same order.

- Steve


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-02 Thread Steven Sistare
On 2/2/2018 12:21 PM, Peter Zijlstra wrote:
> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
>> It might be interesting to add a tunable for the number of random choices to
>> make, and clamp it at the max nr computed from avg_cost in select_idle_cpu.
> 
> This needs a fairly complicated PRNG for it would need to visit each
> possible CPU once before looping. A LFSR does that, but requires 2^n-1
> elements and we have topology masks that don't match that.. The trivial
> example is something with 6 cores.

Or keep it simple and accept the possibility of choosing the same candidate
more than once.

>> Or, choose a random starting point and then search for nr sequential 
>> candidates; possibly limited by a tunable.
> 
> And this is basically what we already do. Except with the task-cpu
> instead of a per-cpu rotor.

Righto.  Disregard this suggestion.

- Steve


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-02 Thread Subhra Mazumdar



On 2/2/18 9:17 AM, Peter Zijlstra wrote:

On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:

+static int select_idle_smt(struct task_struct *p, struct sched_group *sg)
  {
+   int i, rand_index, rand_cpu;
+   int this_cpu = smp_processor_id();
  
+	rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight;

+   rand_cpu = sg->cp_array[rand_index];

Right, so yuck.. I know why you need that, but that extra array and
dereference is the reason I never went there.

How much difference does it really make vs the 'normal' wrapping search
from last CPU ?

This really should be a separate patch with separate performance numbers
on.

For the benefit of other readers, if we always search and choose starting from
the first CPU in a core, then later searches will often need to traverse the 
first
N busy CPU's to find the first idle CPU.  Choosing a random starting point 
avoids
such bias.  It is probably a win for processors with 4 to 8 CPUs per core, and
a slight but hopefully negligible loss for 2 CPUs per core, and I agree we need
to see performance data for this as a separate patch to decide.  We have SPARC
systems with 8 CPUs per core.

Which is why the current code already doesn't start from the first cpu
in the mask. We start at whatever CPU the task ran last on, which is
effectively 'random' if the system is busy.

So how is a per-cpu rotor better than that?
In the scheme of SMT balance, if the idle cpu search is done _not_ in 
the last run core, then we need a random cpu to start from. If the idle 
cpu search is done in the last run core we can start the search from 
last run cpu. Since we need the random index for the first case I just 
did it for both.


Thanks,
Subhra



Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-02 Thread Peter Zijlstra
On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
> >> +static int select_idle_smt(struct task_struct *p, struct sched_group *sg)
> >>  {
> >> +  int i, rand_index, rand_cpu;
> >> +  int this_cpu = smp_processor_id();
> >>  
> >> +  rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight;
> >> +  rand_cpu = sg->cp_array[rand_index];
> > 
> > Right, so yuck.. I know why you need that, but that extra array and
> > dereference is the reason I never went there.
> > 
> > How much difference does it really make vs the 'normal' wrapping search
> > from last CPU ?
> > 
> > This really should be a separate patch with separate performance numbers
> > on.
> 
> For the benefit of other readers, if we always search and choose starting from
> the first CPU in a core, then later searches will often need to traverse the 
> first
> N busy CPU's to find the first idle CPU.  Choosing a random starting point 
> avoids
> such bias.  It is probably a win for processors with 4 to 8 CPUs per core, and
> a slight but hopefully negligible loss for 2 CPUs per core, and I agree we 
> need
> to see performance data for this as a separate patch to decide.  We have SPARC
> systems with 8 CPUs per core.

Which is why the current code already doesn't start from the first cpu
in the mask. We start at whatever CPU the task ran last on, which is
effectively 'random' if the system is busy.

So how is a per-cpu rotor better than that?


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-02 Thread Peter Zijlstra
On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote:
> It might be interesting to add a tunable for the number of random choices to
> make, and clamp it at the max nr computed from avg_cost in select_idle_cpu.

This needs a fairly complicated PRNG for it would need to visit each
possible CPU once before looping. A LFSR does that, but requires 2^n-1
elements and we have topology masks that don't match that.. The trivial
example is something with 6 cores.

> Or, choose a random starting point and then search for nr sequential 
> candidates; possibly limited by a tunable.

And this is basically what we already do. Except with the task-cpu
instead of a per-cpu rotor.


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-02 Thread Steven Sistare
On 2/1/2018 7:33 AM, Peter Zijlstra wrote:
> On Mon, Jan 29, 2018 at 03:31:02PM -0800, subhra mazumdar wrote:
>> +#ifdef CONFIG_SCHED_SMT
>> +
>> +/*
>> + * From sd_llc downward update the SMT utilization.
> 
> Please don't use utilization for this, we already use that word for
> something else.
> 
>> + * Skip the lowest level 0.
>> + */
>> +void smt_util(struct rq *rq, int prev_busy, int next_busy)
>> +{
>> +struct sched_domain *sd;
>> +struct sched_group *sg_cpu;
>> +int this_cpu = rq->cpu;
>> +int util;
>> +
>> +if (rq->curr_util == UTIL_UNINITIALIZED)
>> +prev_busy = 0;
>> +
>> +util = next_busy - prev_busy;
>> +rcu_read_lock();
>> +sd = rcu_dereference(per_cpu(sd_llc, this_cpu));
>> +if (util) {
>> +for_each_lower_domain(sd) {
>> +if (sd->level == 0)
>> +break;
> 
> afaict you really only need this for the core, and here you're assuming
> everything below the LLC is cores. Would it not be much clearer if you
> introduce sd_core.
> 
> As is, for_each_lower_domain includes the starting domain, sd->group
> then is the first core group for this cpu. But then you continue to the
> smt domain (on Intel, on other architectures there could be a cluster
> domain in between) and then you bail using that sd->level == 0 hack
> because otherwise things would go *bang*.

Hi Peter,

The code here and in smt_balance intentionally visits each level between
the llc and smt, including core-cluster on architectures that define it.
smt_balance thus has the chance to randomly pick a better cluster,
and then within that cluster randomly pick a better core.  It makes sense,
as resources are shared within a cluster, and choosing a less loaded cluster
should give better performance.  As you suggest in a few other places,
it would be nice to see performance results for this case.  We have
SPARC processors with core clusters.

> Really rather messy this.
> 
> I think you want to go allocate sched_domain_shared for the MC level and
> use that, much like sd_llc_shared.
> 
>> +sg_cpu = sd->groups;
>> +
>> +/* atomically add/subtract the util */
>> +if (util > 0)
>> +atomic_inc((atomic_t *)&sg_cpu->utilization);
>> +else
>> +atomic_dec((atomic_t *)&sg_cpu->utilization);
> 
> *sigh*, wth do you need that cast? You didn't get the memo that spurious
> casts are where bugs happen and are terrible style at the best of times?
> 
>> +}
>> +}
>> +
>> +if (sd)
>> +rq->curr_util = next_busy;
>> +rcu_read_unlock();
>> +}
> 
>> +smt_util(rq, 1, 0);
> 
>> +smt_util(rq, 0, 1);
> 
> That all seems like an overly complex way to write inc/dec. You turned
> what should be a boolean state space (2^1) into something like 2^64.
> 
> Also, I think if you count idle instead of busy, you'll do away with the
> need for that whole uninitialized thing.
> 
> 
>> +#define CPU_PSEUDO_RANDOM(cpu)  (cpu_rq(cpu)->rotor++)
> 
> That's a bit of a stretch calling that pseudo random..
> 
>> +/*
>> + * Find an idle cpu in the core starting search from random index.
>> + */
>> +static int select_idle_smt(struct task_struct *p, struct sched_group *sg)
>>  {
>> +int i, rand_index, rand_cpu;
>> +int this_cpu = smp_processor_id();
>>  
>> +rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight;
>> +rand_cpu = sg->cp_array[rand_index];
> 
> Right, so yuck.. I know why you need that, but that extra array and
> dereference is the reason I never went there.
> 
> How much difference does it really make vs the 'normal' wrapping search
> from last CPU ?
> 
> This really should be a separate patch with separate performance numbers
> on.

For the benefit of other readers, if we always search and choose starting from
the first CPU in a core, then later searches will often need to traverse the 
first
N busy CPU's to find the first idle CPU.  Choosing a random starting point 
avoids
such bias.  It is probably a win for processors with 4 to 8 CPUs per core, and
a slight but hopefully negligible loss for 2 CPUs per core, and I agree we need
to see performance data for this as a separate patch to decide.  We have SPARC
systems with 8 CPUs per core.

>> +for_each_cpu_wrap(i, sched_group_span(sg), rand_cpu) {
>> +if (!cpumask_test_cpu(i, &p->cpus_allowed))
>> +continue;
>> +if (idle_cpu(i))
>> +return i;
>> +}
>>  
>> +return -1;
>>  }
>>  
>>  /*
>> + * Compare the SMT utilization of the two groups and select the one which
>> + * has more capacity left.
>>   */
>> +static int smt_should_migrate(struct sched_group *here,
>> +  struct sched_group *there, int self_util)
>>  {
>> +int this_cpu = smp_processor_id();
>> +int here_util = here->utilization, there_util = there->utilizat

Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-01 Thread Peter Zijlstra
On Thu, Feb 01, 2018 at 01:33:35PM +0100, Peter Zijlstra wrote:
> I think you want to go allocate sched_domain_shared for the MC level and
> use that, much like sd_llc_shared.

Also, you'd want to try and get performance numbers for something like
Power8, which has SMT8 and fairly expensive atomic ops.


Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance

2018-02-01 Thread Peter Zijlstra
On Mon, Jan 29, 2018 at 03:31:02PM -0800, subhra mazumdar wrote:
> +#ifdef CONFIG_SCHED_SMT
> +
> +/*
> + * From sd_llc downward update the SMT utilization.

Please don't use utilization for this, we already use that word for
something else.

> + * Skip the lowest level 0.
> + */
> +void smt_util(struct rq *rq, int prev_busy, int next_busy)
> +{
> + struct sched_domain *sd;
> + struct sched_group *sg_cpu;
> + int this_cpu = rq->cpu;
> + int util;
> +
> + if (rq->curr_util == UTIL_UNINITIALIZED)
> + prev_busy = 0;
> +
> + util = next_busy - prev_busy;
> + rcu_read_lock();
> + sd = rcu_dereference(per_cpu(sd_llc, this_cpu));
> + if (util) {
> + for_each_lower_domain(sd) {
> + if (sd->level == 0)
> + break;

afaict you really only need this for the core, and here you're assuming
everything below the LLC is cores. Would it not be much clearer if you
introduce sd_core.

As is, for_each_lower_domain includes the starting domain, sd->group
then is the first core group for this cpu. But then you continue to the
smt domain (on Intel, on other architectures there could be a cluster
domain in between) and then you bail using that sd->level == 0 hack
because otherwise things would go *bang*.

Really rather messy this.

I think you want to go allocate sched_domain_shared for the MC level and
use that, much like sd_llc_shared.

> + sg_cpu = sd->groups;
> +
> + /* atomically add/subtract the util */
> + if (util > 0)
> + atomic_inc((atomic_t *)&sg_cpu->utilization);
> + else
> + atomic_dec((atomic_t *)&sg_cpu->utilization);

*sigh*, wth do you need that cast? You didn't get the memo that spurious
casts are where bugs happen and are terrible style at the best of times?

> + }
> + }
> +
> + if (sd)
> + rq->curr_util = next_busy;
> + rcu_read_unlock();
> +}

> + smt_util(rq, 1, 0);

> + smt_util(rq, 0, 1);

That all seems like an overly complex way to write inc/dec. You turned
what should be a boolean state space (2^1) into something like 2^64.

Also, I think if you count idle instead of busy, you'll do away with the
need for that whole uninitialized thing.


> +#define  CPU_PSEUDO_RANDOM(cpu)  (cpu_rq(cpu)->rotor++)

That's a bit of a stretch calling that pseudo random..

> +/*
> + * Find an idle cpu in the core starting search from random index.
> + */
> +static int select_idle_smt(struct task_struct *p, struct sched_group *sg)
>  {
> + int i, rand_index, rand_cpu;
> + int this_cpu = smp_processor_id();
>  
> + rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight;
> + rand_cpu = sg->cp_array[rand_index];

Right, so yuck.. I know why you need that, but that extra array and
dereference is the reason I never went there.

How much difference does it really make vs the 'normal' wrapping search
from last CPU ?

This really should be a separate patch with separate performance numbers
on.

> + for_each_cpu_wrap(i, sched_group_span(sg), rand_cpu) {
> + if (!cpumask_test_cpu(i, &p->cpus_allowed))
> + continue;
> + if (idle_cpu(i))
> + return i;
> + }
>  
> + return -1;
>  }
>  
>  /*
> + * Compare the SMT utilization of the two groups and select the one which
> + * has more capacity left.
>   */
> +static int smt_should_migrate(struct sched_group *here,
> +   struct sched_group *there, int self_util)
>  {
> + int this_cpu = smp_processor_id();
> + int here_util = here->utilization, there_util = there->utilization;
>  
> + /* Discount self utilization if it belongs to here or there */
> + if (self_util > 0) {
> + if (cpumask_test_cpu(this_cpu, sched_group_span(here)))
> + here_util -= self_util;
> + else if (cpumask_test_cpu(this_cpu, sched_group_span(there)))
> + there_util -= self_util;
>   }
>  
> + /* Return true if other group has more capacity left */
> + return (there->group_weight - there_util >
> + here->group_weight - here_util);
>  }

OK, so this effectively picks the least-busiest/idlest SMT sibling of
two.

>  /*
> + * SMT balancing works by comparing the target cpu group with a random group
> + * in llc domain. It calls smt_should_migrate to decide which group has more
> + * capacity left. The balancing starts top down fom sd_llc to SMT core level.
> + * Finally idle cpu search is only done in the core.
>   */
> +static int smt_balance(struct task_struct *p, struct sched_domain *sd, int 
> cpu)
>  {
> + struct sched_group *sg, *src_sg, *start_sg, *tgt_sg;
> + struct cpumask *span;
> + int rand_idx, weight;
> + int cpu_orig = cpu;
> + int rand_cpu;
> + int this_cpu = smp_processor_id();