Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On 02/07/2018 12:42 AM, Peter Zijlstra wrote: On Tue, Feb 06, 2018 at 04:30:03PM -0800, Subhra Mazumdar wrote: I meant the SMT balance patch. That does comparison with only one other random core and takes the decision in O(1). Any potential scan of all cores or cpus is O(n) and doesn't scale and will only get worse in future. That applies to both select_idle_core() and select_idle_cpu(). We only do the full scan if we think to know there is indeed an idle core to be had, and if there are idle cores the machine isn't terribly busy. If there are no idle cores, we do not in fact scan everything. We limit the amount of scanning based on the average idle time with a minimum of 4. This logic may not be working as well as you may think it to be. I had sent the cost of select_idle_sibling() w/ and w/o my patch and there was huge difference: Following is the cost (in us) of select_idle_sibling() with hackbench 16 groups: function baseline-rc6 %stdev patch %stdev select_idle_sibling() 0.556 1.72 0.263 (-52.70%) 0.78 (Except when you switch off things like SIS_PROP, then you scan unconditionally and reduce tail latency at the cost of throughput -- like said, some people want this). O(1) sounds nice, but it has horrible worst case latencies. And like I said, I don't see how a rotor is particularly more random than the previous cpu the task you happen to be waking ran on. The rotor randomness is not the main point here. I think the benchmark improvements come from the fact that select_idle_sibling() cost has reduced a lot. To reduce it while still maintaining good spread of threads can be achieved by this SMT balance scheme which in turn requires a fast decent random number generator and rotor is just an easy way to achieve that. Is there any reason this randomized approach is not acceptable even if benchmarks show improvement? Are there other benchmarks I should try? Looking at just one other core has a fairly high chance of not finding idle. I really cannot remember all the benchmarks, but Mike did something a little less random but still O(1) a few years back and that crashed and burned. Also your suggestion to keep the SMT utilization but still do a traversal of cores in select_idle_core() while remembering the least loaded core will still have the problem of potentially traversing all cores. I can compare this with a core level only SMT balancing, is that useful to decide? I will also test on SPARC machines with higher degree of SMT. Please learn to use your email client, that's near unreadable for whitespace damage, time is really too precious to try and untangle crap like that. Sorry about that You had also mentioned to do it for only SMT >2, not sure I understand why as even for SMT=2 (intel) benchmarks show improvement. This clearly shows the scalability problem. For SMT2 you don't need the occupation counter with atomic crud, a simple !atomic core-idle works just fine. And your 'patch' had soo many moving parts it was too hard to say which aspect changed what. hackbench really isn't _that_ interesting a benchmark and its fairly easy to make it go faster, but doing so typically wrecks something else. I also had uperf and Oracle DB TPC-C numbers in the patch which showed improvements There's a whole array of netperf benchmarks which you have to run at different utilization levels, there's the sysbench oltp stuff, which you can run on MariaDB and PostgreSQL, again at different utilization levels (both databases behave quite differently). There's ebizzy, tbench, dbench and others. There's also NUMA stuff, like NAS benchmarks and specJBB nonsense. There's the facebook schbench, which is a bit finnicky to set up. And I'm sure I'm forgetting half, look at the patches and regression reports for more clues, that's what Google is for. You could have figured all that out yourself, much of these are listed in the changelogs and if you'd bothered to find lkml regression reports you could find more. Ok, will find out and run them. Thanks, Subhra
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On Tue, Feb 06, 2018 at 04:30:03PM -0800, Subhra Mazumdar wrote: > I meant the SMT balance patch. That does comparison with only one other > random core and takes the decision in O(1). Any potential scan of all cores > or cpus is O(n) and doesn't scale and will only get worse in future. That > applies to both select_idle_core() and select_idle_cpu(). We only do the full scan if we think to know there is indeed an idle core to be had, and if there are idle cores the machine isn't terribly busy. If there are no idle cores, we do not in fact scan everything. We limit the amount of scanning based on the average idle time with a minimum of 4. (Except when you switch off things like SIS_PROP, then you scan unconditionally and reduce tail latency at the cost of throughput -- like said, some people want this). O(1) sounds nice, but it has horrible worst case latencies. And like I said, I don't see how a rotor is particularly more random than the previous cpu the task you happen to be waking ran on. > Is there any reason this randomized approach is not acceptable even if > benchmarks show improvement? Are there other benchmarks I should try? Looking at just one other core has a fairly high chance of not finding idle. I really cannot remember all the benchmarks, but Mike did something a little less random but still O(1) a few years back and that crashed and burned. > Also your suggestion to keep the SMT utilization but still do a traversal of > cores > in select_idle_core() while remembering the least loaded core will still > have > the problem of potentially traversing all cores. I can compare this with a > core > level only SMT balancing, is that useful to decide? I will also test on > SPARC > machines with higher degree of SMT. Please learn to use your email client, that's near unreadable for whitespace damage, time is really too precious to try and untangle crap like that. > You had also mentioned to do it for only SMT >2, not sure I understand why > as even for SMT=2 (intel) benchmarks show improvement. This clearly shows > the scalability problem. For SMT2 you don't need the occupation counter with atomic crud, a simple !atomic core-idle works just fine. And your 'patch' had soo many moving parts it was too hard to say which aspect changed what. hackbench really isn't _that_ interesting a benchmark and its fairly easy to make it go faster, but doing so typically wrecks something else. There's a whole array of netperf benchmarks which you have to run at different utilization levels, there's the sysbench oltp stuff, which you can run on MariaDB and PostgreSQL, again at different utilization levels (both databases behave quite differently). There's ebizzy, tbench, dbench and others. There's also NUMA stuff, like NAS benchmarks and specJBB nonsense. There's the facebook schbench, which is a bit finnicky to set up. And I'm sure I'm forgetting half, look at the patches and regression reports for more clues, that's what Google is for. You could have figured all that out yourself, much of these are listed in the changelogs and if you'd bothered to find lkml regression reports you could find more.
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On 02/06/2018 01:12 AM, Peter Zijlstra wrote: On Mon, Feb 05, 2018 at 02:09:11PM -0800, Subhra Mazumdar wrote: The pseudo random is also used for choosing a random core to compare with, how will transposing achieve that? Not entirely sure what your point is. Current code doesn't compare to just _one_ other core, and I don't think we'd ever want to do that. So currently select_idle_core() will, if there is an idle core, iterate the whole thing trying to find it. If it fails, it clears the 'have_idle_core' state. select_idle_cpu, which we'll fall back to, will limit the scanning based on the average idle time. The crucial point however, is that concurrent wakeups will not, on average, do the same iteration because of the target offset. I meant the SMT balance patch. That does comparison with only one other random core and takes the decision in O(1). Any potential scan of all cores or cpus is O(n) and doesn't scale and will only get worse in future. That applies to both select_idle_core() and select_idle_cpu(). Is there any reason this randomized approach is not acceptable even if benchmarks show improvement? Are there other benchmarks I should try? Also your suggestion to keep the SMT utilization but still do a traversal of cores in select_idle_core() while remembering the least loaded core will still have the problem of potentially traversing all cores. I can compare this with a core level only SMT balancing, is that useful to decide? I will also test on SPARC machines with higher degree of SMT. You had also mentioned to do it for only SMT >2, not sure I understand why as even for SMT=2 (intel) benchmarks show improvement. This clearly shows the scalability problem. Thanks, Subhra
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On Mon, Feb 05, 2018 at 02:09:11PM -0800, Subhra Mazumdar wrote: > The pseudo random is also used for choosing a random core to compare with, > how will transposing achieve that? Not entirely sure what your point is. Current code doesn't compare to just _one_ other core, and I don't think we'd ever want to do that. So currently select_idle_core() will, if there is an idle core, iterate the whole thing trying to find it. If it fails, it clears the 'have_idle_core' state. select_idle_cpu, which we'll fall back to, will limit the scanning based on the average idle time. The crucial point however, is that concurrent wakeups will not, on average, do the same iteration because of the target offset.
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On 02/05/2018 09:03 AM, Peter Zijlstra wrote: On Mon, Feb 05, 2018 at 01:48:54PM +0100, Peter Zijlstra wrote: So while I see the point of tracking these numbers (for SMT>2), I don't think its worth doing outside of the core, and then we still need some powerpc (or any other architecture with abysmal atomics) tested. FWIW Power has another 'fun' feature, their cores have asymmetric SMT. Their cores have a static power level, based on _which_ SMT sibling is running, not how many. A single SMT2 runs (much) slower than a single SMT0. So that random selection stuff really doesn't work well for them. Now 'sadly' x86 can also have ASYM_PACKING set on its SMT domain, so I'm going to have to figure out what to do about all that. Even the existing code doesn't handle that. The SMT balancing compares the remaining SMT capacity so even with asymmetric cores should work OK. Thanks, Subhra
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On 02/05/2018 04:19 AM, Peter Zijlstra wrote: On Fri, Feb 02, 2018 at 09:37:02AM -0800, Subhra Mazumdar wrote: In the scheme of SMT balance, if the idle cpu search is done _not_ in the last run core, then we need a random cpu to start from. If the idle cpu search is done in the last run core we can start the search from last run cpu. Since we need the random index for the first case I just did it for both. That shouldn't be too hard to fix. I think we can simply transpose the CPU number. That is, something like: cpu' = core'_id + (cpu - core_id) should work for most sane cases. We don't give any guarantees this will in fact work, but (almost) all actual CPU enumeration schemes I've seen this should work for. And if it doesn't work, we're not worse of than we are now. I just couldn't readily find a place where we need to do this for cores with the current code. But I think we have one place between LLCs where it can be done: --- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7b6535987500..eb8b8d0a026c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6109,7 +6109,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t if (!static_branch_likely(&sched_smt_present)) return -1; - for_each_cpu(cpu, cpu_smt_mask(target)) { + for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) { if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; if (idle_cpu(cpu)) @@ -6357,8 +6357,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (cpu == prev_cpu) goto pick_cpu; - if (wake_affine(affine_sd, p, prev_cpu, sync)) - new_cpu = cpu; + if (wake_affine(affine_sd, p, prev_cpu, sync)) { + /* +* Transpose prev_cpu's offset into this cpu's +* LLC domain to retain the 'random' search offset +* for for_each_cpu_wrap(). +*/ + new_cpu = per_cpu(sd_llc_id, cpu) + + (prev_cpu - per_cpu(sd_llc_id, prev_cpu)); + if (unlikely(!cpus_share_cache(new_cpu, cpu))) + new_cpu = cpu; + } } if (sd && !(sd_flag & SD_BALANCE_FORK)) { The pseudo random is also used for choosing a random core to compare with, how will transposing achieve that? Thanks, Subhra
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On Mon, Feb 05, 2018 at 01:48:54PM +0100, Peter Zijlstra wrote: > So while I see the point of tracking these numbers (for SMT>2), I don't > think its worth doing outside of the core, and then we still need some > powerpc (or any other architecture with abysmal atomics) tested. FWIW Power has another 'fun' feature, their cores have asymmetric SMT. Their cores have a static power level, based on _which_ SMT sibling is running, not how many. A single SMT2 runs (much) slower than a single SMT0. So that random selection stuff really doesn't work well for them. Now 'sadly' x86 can also have ASYM_PACKING set on its SMT domain, so I'm going to have to figure out what to do about all that.
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On Mon, 2018-02-05 at 13:48 +0100, Peter Zijlstra wrote: > On Fri, Feb 02, 2018 at 04:06:32PM -0500, Steven Sistare wrote: > > On 2/2/2018 2:59 PM, Peter Zijlstra wrote: > > > > But then you get that atomic crud to contend on the cluster level, which > > > is even worse than it contending on the core level. > > > > True, but it can still be a net win if we make better scheduling decisions. > > A saving grace is that the atomic counter is only updated if the cpu > > makes a transition from idle to busy or vice versa. > > Which can still be a very high rate for some workloads. I always forget > which, but there are plenty workloads that have very frequenct very > short idle times. Mike, do you remember what comes apart when we take > out the sysctl_sched_migration_cost test in idle_balance()? Used to be anything scheduling cross-core heftily suffered, ie pretty much any localhost communication heavy load. I just tried disabling it in 4.13 though (pre pti cliff), tried tbench, and it made zip squat difference. I presume that's due to the meanwhile added this_rq->rd- >overload and/or curr_cost checks. I don't recall the original cost details beyond it having been "a sh*tload". -Mike
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On Fri, Feb 02, 2018 at 04:06:32PM -0500, Steven Sistare wrote: > On 2/2/2018 2:59 PM, Peter Zijlstra wrote: > > But then you get that atomic crud to contend on the cluster level, which > > is even worse than it contending on the core level. > > True, but it can still be a net win if we make better scheduling decisions. > A saving grace is that the atomic counter is only updated if the cpu > makes a transition from idle to busy or vice versa. Which can still be a very high rate for some workloads. I always forget which, but there are plenty workloads that have very frequenct very short idle times. Mike, do you remember what comes apart when we take out the sysctl_sched_migration_cost test in idle_balance()? > We need data for this type of system, showing improvements for normal > workloads, and showing little downside for a high context switch rate > torture test. So core-wide atomics should, on architectures that can do atomics in L1, be relatively fast. Once you leave L1, atomic contention goes up a fair bit. And then there's architectures that simply don't do atomics in L1 (like Power). Testing on my SKL desktop, atomics contending between SMT siblings is basically free (weirdly enough my test says its cheaper), atomics contending on the L3 is 10x as expensive, and this is with only 4 cores. If I test it on my 10 core IVB, I'm up to 20x, and I can't imagine that getting any better with bigger core count (my IVB does not show SMT contention as lower, but not particularly more expensive either). So while I see the point of tracking these numbers (for SMT>2), I don't think its worth doing outside of the core, and then we still need some powerpc (or any other architecture with abysmal atomics) tested. So what we can do is make this counting crud conditional on SMT>2 and possibly part of the topology flags such that an architecture can opt-out. Then select_idle_core can be augmented to remember the least-loaded core it encounters in its traversal, and go with that.
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On Fri, Feb 02, 2018 at 09:37:02AM -0800, Subhra Mazumdar wrote: > In the scheme of SMT balance, if the idle cpu search is done _not_ in the > last run core, then we need a random cpu to start from. If the idle cpu > search is done in the last run core we can start the search from last run > cpu. Since we need the random index for the first case I just did it for > both. That shouldn't be too hard to fix. I think we can simply transpose the CPU number. That is, something like: cpu' = core'_id + (cpu - core_id) should work for most sane cases. We don't give any guarantees this will in fact work, but (almost) all actual CPU enumeration schemes I've seen this should work for. And if it doesn't work, we're not worse of than we are now. I just couldn't readily find a place where we need to do this for cores with the current code. But I think we have one place between LLCs where it can be done: --- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7b6535987500..eb8b8d0a026c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6109,7 +6109,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t if (!static_branch_likely(&sched_smt_present)) return -1; - for_each_cpu(cpu, cpu_smt_mask(target)) { + for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) { if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; if (idle_cpu(cpu)) @@ -6357,8 +6357,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (cpu == prev_cpu) goto pick_cpu; - if (wake_affine(affine_sd, p, prev_cpu, sync)) - new_cpu = cpu; + if (wake_affine(affine_sd, p, prev_cpu, sync)) { + /* +* Transpose prev_cpu's offset into this cpu's +* LLC domain to retain the 'random' search offset +* for for_each_cpu_wrap(). +*/ + new_cpu = per_cpu(sd_llc_id, cpu) + + (prev_cpu - per_cpu(sd_llc_id, prev_cpu)); + if (unlikely(!cpus_share_cache(new_cpu, cpu))) + new_cpu = cpu; + } } if (sd && !(sd_flag & SD_BALANCE_FORK)) {
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On Fri, 2018-02-02 at 13:34 -0500, Steven Sistare wrote: > On 2/2/2018 12:39 PM, Steven Sistare wrote: > > On 2/2/2018 12:21 PM, Peter Zijlstra wrote: > >> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote: > >>> It might be interesting to add a tunable for the number of random choices > >>> to > >>> make, and clamp it at the max nr computed from avg_cost in > >>> select_idle_cpu. > >> > >> This needs a fairly complicated PRNG for it would need to visit each > >> possible CPU once before looping. A LFSR does that, but requires 2^n-1 > >> elements and we have topology masks that don't match that.. The trivial > >> example is something with 6 cores. > > > > Or keep it simple and accept the possibility of choosing the same candidate > > more than once. > > > >>> Or, choose a random starting point and then search for nr sequential > >>> candidates; possibly limited by a tunable. > >> > >> And this is basically what we already do. Except with the task-cpu > >> instead of a per-cpu rotor. > > > > Righto. Disregard this suggestion. > > Actually, I take back my take back. I suspect the primary benefit > of random selection is that it breaks up resonance states where > CPUs that are busy tend to stay busy, and CPUs that are idle tend > to stay idle, which is reinforced by starting the search at target = > last cpu ran. I suspect the primary benefit is reduction of bouncing. The absolutely maddening thing about SIS is that some stuff out there (like FB's load) doesn't give a rats ass about anything other than absolute minimum sched latency while other stuff notices cache going missing. Joy. -Mike
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On 2/2/2018 3:04 PM, Peter Zijlstra wrote: > On Fri, Feb 02, 2018 at 01:34:58PM -0500, Steven Sistare wrote: >> Actually, I take back my take back. I suspect the primary benefit >> of random selection is that it breaks up resonance states where >> CPUs that are busy tend to stay busy, and CPUs that are idle tend >> to stay idle, which is reinforced by starting the search at target = >> last cpu ran. > > Which, according to there here patches: > > https://lkml.kernel.org/r/20180130104555.4125-1-mgor...@techsingularity.net > > is a good thing, because of power management. Yes, but it's a bad thing if ready to run tasks pile on busy CPUs and idle CPUs go unused. Stating the obvious, when the search for idle fails, the thread goes on a busy CPU. The existing logic that checks and uses the initial target if it is idle reduces unnecessary spreading and is power friendly (indeed, added in the patch you reference). Subhra's patch does not change that. - Steve
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On 2/2/2018 2:59 PM, Peter Zijlstra wrote: > On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote: >> On 2/1/2018 7:33 AM, Peter Zijlstra wrote: >>> On Mon, Jan 29, 2018 at 03:31:02PM -0800, subhra mazumdar wrote: + rcu_read_lock(); + sd = rcu_dereference(per_cpu(sd_llc, this_cpu)); + if (util) { + for_each_lower_domain(sd) { + if (sd->level == 0) + break; >>> >>> afaict you really only need this for the core, and here you're assuming >>> everything below the LLC is cores. Would it not be much clearer if you >>> introduce sd_core. >>> >>> As is, for_each_lower_domain includes the starting domain, sd->group >>> then is the first core group for this cpu. But then you continue to the >>> smt domain (on Intel, on other architectures there could be a cluster >>> domain in between) and then you bail using that sd->level == 0 hack >>> because otherwise things would go *bang*. >> >> Hi Peter, >> >> The code here and in smt_balance intentionally visits each level between >> the llc and smt, including core-cluster on architectures that define it. >> smt_balance thus has the chance to randomly pick a better cluster, >> and then within that cluster randomly pick a better core. It makes sense, >> as resources are shared within a cluster, and choosing a less loaded cluster >> should give better performance. As you suggest in a few other places, >> it would be nice to see performance results for this case. We have >> SPARC processors with core clusters. >> > > But then you get that atomic crud to contend on the cluster level, which > is even worse than it contending on the core level. True, but it can still be a net win if we make better scheduling decisions. A saving grace is that the atomic counter is only updated if the cpu makes a transition from idle to busy or vice versa. We need data for this type of system, showing improvements for normal workloads, and showing little downside for a high context switch rate torture test. - Steve
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On 2/2/2018 2:58 PM, Peter Zijlstra wrote: > On Fri, Feb 02, 2018 at 12:36:47PM -0500, Steven Sistare wrote: >> On 2/2/2018 12:17 PM, Peter Zijlstra wrote: >>> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote: >> +static int select_idle_smt(struct task_struct *p, struct sched_group >> *sg) >> { >> +int i, rand_index, rand_cpu; >> +int this_cpu = smp_processor_id(); >> >> +rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight; >> +rand_cpu = sg->cp_array[rand_index]; > > Right, so yuck.. I know why you need that, but that extra array and > dereference is the reason I never went there. > > How much difference does it really make vs the 'normal' wrapping search > from last CPU ? > > This really should be a separate patch with separate performance numbers > on. For the benefit of other readers, if we always search and choose starting from the first CPU in a core, then later searches will often need to traverse the first N busy CPU's to find the first idle CPU. Choosing a random starting point avoids such bias. It is probably a win for processors with 4 to 8 CPUs per core, and a slight but hopefully negligible loss for 2 CPUs per core, and I agree we need to see performance data for this as a separate patch to decide. We have SPARC systems with 8 CPUs per core. >>> >>> Which is why the current code already doesn't start from the first cpu >>> in the mask. We start at whatever CPU the task ran last on, which is >>> effectively 'random' if the system is busy. >>> >>> So how is a per-cpu rotor better than that? >> >> The current code is: >> for_each_cpu(cpu, cpu_smt_mask(target)) { >> >> For an 8-cpu/core processor, 8 values of target map to the same cpu_smt_mask. >> 8 different tasks will traverse the mask in the same order. > > Ooh, the SMT loop.. yes that can be improved. But look at the other > ones, they do: > > for_each_cpu_wrap(cpu, sched_domain_span(), target) > > so we look for an idle cpu in the LLC domain, and start iteration at > @target, which will (on average) be different for different CPUs, and > thus hopefully find different idle CPUs. > > You could simple change the SMT loop to something like: > > for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) > > and see what that does. Good idea - Steve
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On Fri, Feb 02, 2018 at 01:34:58PM -0500, Steven Sistare wrote: > Actually, I take back my take back. I suspect the primary benefit > of random selection is that it breaks up resonance states where > CPUs that are busy tend to stay busy, and CPUs that are idle tend > to stay idle, which is reinforced by starting the search at target = > last cpu ran. Which, according to there here patches: https://lkml.kernel.org/r/20180130104555.4125-1-mgor...@techsingularity.net is a good thing, because of power management.
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote: > On 2/1/2018 7:33 AM, Peter Zijlstra wrote: > > On Mon, Jan 29, 2018 at 03:31:02PM -0800, subhra mazumdar wrote: > >> + rcu_read_lock(); > >> + sd = rcu_dereference(per_cpu(sd_llc, this_cpu)); > >> + if (util) { > >> + for_each_lower_domain(sd) { > >> + if (sd->level == 0) > >> + break; > > > > afaict you really only need this for the core, and here you're assuming > > everything below the LLC is cores. Would it not be much clearer if you > > introduce sd_core. > > > > As is, for_each_lower_domain includes the starting domain, sd->group > > then is the first core group for this cpu. But then you continue to the > > smt domain (on Intel, on other architectures there could be a cluster > > domain in between) and then you bail using that sd->level == 0 hack > > because otherwise things would go *bang*. > > Hi Peter, > > The code here and in smt_balance intentionally visits each level between > the llc and smt, including core-cluster on architectures that define it. > smt_balance thus has the chance to randomly pick a better cluster, > and then within that cluster randomly pick a better core. It makes sense, > as resources are shared within a cluster, and choosing a less loaded cluster > should give better performance. As you suggest in a few other places, > it would be nice to see performance results for this case. We have > SPARC processors with core clusters. > But then you get that atomic crud to contend on the cluster level, which is even worse than it contending on the core level.
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On Fri, Feb 02, 2018 at 12:36:47PM -0500, Steven Sistare wrote: > On 2/2/2018 12:17 PM, Peter Zijlstra wrote: > > On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote: > +static int select_idle_smt(struct task_struct *p, struct sched_group > *sg) > { > +int i, rand_index, rand_cpu; > +int this_cpu = smp_processor_id(); > > +rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight; > +rand_cpu = sg->cp_array[rand_index]; > >>> > >>> Right, so yuck.. I know why you need that, but that extra array and > >>> dereference is the reason I never went there. > >>> > >>> How much difference does it really make vs the 'normal' wrapping search > >>> from last CPU ? > >>> > >>> This really should be a separate patch with separate performance numbers > >>> on. > >> > >> For the benefit of other readers, if we always search and choose starting > >> from > >> the first CPU in a core, then later searches will often need to traverse > >> the first > >> N busy CPU's to find the first idle CPU. Choosing a random starting point > >> avoids > >> such bias. It is probably a win for processors with 4 to 8 CPUs per core, > >> and > >> a slight but hopefully negligible loss for 2 CPUs per core, and I agree we > >> need > >> to see performance data for this as a separate patch to decide. We have > >> SPARC > >> systems with 8 CPUs per core. > > > > Which is why the current code already doesn't start from the first cpu > > in the mask. We start at whatever CPU the task ran last on, which is > > effectively 'random' if the system is busy. > > > > So how is a per-cpu rotor better than that? > > The current code is: > for_each_cpu(cpu, cpu_smt_mask(target)) { > > For an 8-cpu/core processor, 8 values of target map to the same cpu_smt_mask. > 8 different tasks will traverse the mask in the same order. Ooh, the SMT loop.. yes that can be improved. But look at the other ones, they do: for_each_cpu_wrap(cpu, sched_domain_span(), target) so we look for an idle cpu in the LLC domain, and start iteration at @target, which will (on average) be different for different CPUs, and thus hopefully find different idle CPUs. You could simple change the SMT loop to something like: for_each_cpu_wrap(cpu, cpu_smt_mask(target), target) and see what that does.
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On 2/2/2018 12:39 PM, Steven Sistare wrote: > On 2/2/2018 12:21 PM, Peter Zijlstra wrote: >> On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote: >>> It might be interesting to add a tunable for the number of random choices to >>> make, and clamp it at the max nr computed from avg_cost in select_idle_cpu. >> >> This needs a fairly complicated PRNG for it would need to visit each >> possible CPU once before looping. A LFSR does that, but requires 2^n-1 >> elements and we have topology masks that don't match that.. The trivial >> example is something with 6 cores. > > Or keep it simple and accept the possibility of choosing the same candidate > more than once. > >>> Or, choose a random starting point and then search for nr sequential >>> candidates; possibly limited by a tunable. >> >> And this is basically what we already do. Except with the task-cpu >> instead of a per-cpu rotor. > > Righto. Disregard this suggestion. Actually, I take back my take back. I suspect the primary benefit of random selection is that it breaks up resonance states where CPUs that are busy tend to stay busy, and CPUs that are idle tend to stay idle, which is reinforced by starting the search at target = last cpu ran. Or, a quantitative argument: if sampling a single random CPU gives better results (and the data says it does), then sampling a random cpu and searching nr from it should give better results, since it has nr - 1 more chances to find an idle CPU. - Steve
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On 2/2/2018 12:17 PM, Peter Zijlstra wrote: > On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote: +static int select_idle_smt(struct task_struct *p, struct sched_group *sg) { + int i, rand_index, rand_cpu; + int this_cpu = smp_processor_id(); + rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight; + rand_cpu = sg->cp_array[rand_index]; >>> >>> Right, so yuck.. I know why you need that, but that extra array and >>> dereference is the reason I never went there. >>> >>> How much difference does it really make vs the 'normal' wrapping search >>> from last CPU ? >>> >>> This really should be a separate patch with separate performance numbers >>> on. >> >> For the benefit of other readers, if we always search and choose starting >> from >> the first CPU in a core, then later searches will often need to traverse the >> first >> N busy CPU's to find the first idle CPU. Choosing a random starting point >> avoids >> such bias. It is probably a win for processors with 4 to 8 CPUs per core, >> and >> a slight but hopefully negligible loss for 2 CPUs per core, and I agree we >> need >> to see performance data for this as a separate patch to decide. We have >> SPARC >> systems with 8 CPUs per core. > > Which is why the current code already doesn't start from the first cpu > in the mask. We start at whatever CPU the task ran last on, which is > effectively 'random' if the system is busy. > > So how is a per-cpu rotor better than that? The current code is: for_each_cpu(cpu, cpu_smt_mask(target)) { For an 8-cpu/core processor, 8 values of target map to the same cpu_smt_mask. 8 different tasks will traverse the mask in the same order. - Steve
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On 2/2/2018 12:21 PM, Peter Zijlstra wrote: > On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote: >> It might be interesting to add a tunable for the number of random choices to >> make, and clamp it at the max nr computed from avg_cost in select_idle_cpu. > > This needs a fairly complicated PRNG for it would need to visit each > possible CPU once before looping. A LFSR does that, but requires 2^n-1 > elements and we have topology masks that don't match that.. The trivial > example is something with 6 cores. Or keep it simple and accept the possibility of choosing the same candidate more than once. >> Or, choose a random starting point and then search for nr sequential >> candidates; possibly limited by a tunable. > > And this is basically what we already do. Except with the task-cpu > instead of a per-cpu rotor. Righto. Disregard this suggestion. - Steve
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On 2/2/18 9:17 AM, Peter Zijlstra wrote: On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote: +static int select_idle_smt(struct task_struct *p, struct sched_group *sg) { + int i, rand_index, rand_cpu; + int this_cpu = smp_processor_id(); + rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight; + rand_cpu = sg->cp_array[rand_index]; Right, so yuck.. I know why you need that, but that extra array and dereference is the reason I never went there. How much difference does it really make vs the 'normal' wrapping search from last CPU ? This really should be a separate patch with separate performance numbers on. For the benefit of other readers, if we always search and choose starting from the first CPU in a core, then later searches will often need to traverse the first N busy CPU's to find the first idle CPU. Choosing a random starting point avoids such bias. It is probably a win for processors with 4 to 8 CPUs per core, and a slight but hopefully negligible loss for 2 CPUs per core, and I agree we need to see performance data for this as a separate patch to decide. We have SPARC systems with 8 CPUs per core. Which is why the current code already doesn't start from the first cpu in the mask. We start at whatever CPU the task ran last on, which is effectively 'random' if the system is busy. So how is a per-cpu rotor better than that? In the scheme of SMT balance, if the idle cpu search is done _not_ in the last run core, then we need a random cpu to start from. If the idle cpu search is done in the last run core we can start the search from last run cpu. Since we need the random index for the first case I just did it for both. Thanks, Subhra
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote: > >> +static int select_idle_smt(struct task_struct *p, struct sched_group *sg) > >> { > >> + int i, rand_index, rand_cpu; > >> + int this_cpu = smp_processor_id(); > >> > >> + rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight; > >> + rand_cpu = sg->cp_array[rand_index]; > > > > Right, so yuck.. I know why you need that, but that extra array and > > dereference is the reason I never went there. > > > > How much difference does it really make vs the 'normal' wrapping search > > from last CPU ? > > > > This really should be a separate patch with separate performance numbers > > on. > > For the benefit of other readers, if we always search and choose starting from > the first CPU in a core, then later searches will often need to traverse the > first > N busy CPU's to find the first idle CPU. Choosing a random starting point > avoids > such bias. It is probably a win for processors with 4 to 8 CPUs per core, and > a slight but hopefully negligible loss for 2 CPUs per core, and I agree we > need > to see performance data for this as a separate patch to decide. We have SPARC > systems with 8 CPUs per core. Which is why the current code already doesn't start from the first cpu in the mask. We start at whatever CPU the task ran last on, which is effectively 'random' if the system is busy. So how is a per-cpu rotor better than that?
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On Fri, Feb 02, 2018 at 11:53:40AM -0500, Steven Sistare wrote: > It might be interesting to add a tunable for the number of random choices to > make, and clamp it at the max nr computed from avg_cost in select_idle_cpu. This needs a fairly complicated PRNG for it would need to visit each possible CPU once before looping. A LFSR does that, but requires 2^n-1 elements and we have topology masks that don't match that.. The trivial example is something with 6 cores. > Or, choose a random starting point and then search for nr sequential > candidates; possibly limited by a tunable. And this is basically what we already do. Except with the task-cpu instead of a per-cpu rotor.
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On 2/1/2018 7:33 AM, Peter Zijlstra wrote: > On Mon, Jan 29, 2018 at 03:31:02PM -0800, subhra mazumdar wrote: >> +#ifdef CONFIG_SCHED_SMT >> + >> +/* >> + * From sd_llc downward update the SMT utilization. > > Please don't use utilization for this, we already use that word for > something else. > >> + * Skip the lowest level 0. >> + */ >> +void smt_util(struct rq *rq, int prev_busy, int next_busy) >> +{ >> +struct sched_domain *sd; >> +struct sched_group *sg_cpu; >> +int this_cpu = rq->cpu; >> +int util; >> + >> +if (rq->curr_util == UTIL_UNINITIALIZED) >> +prev_busy = 0; >> + >> +util = next_busy - prev_busy; >> +rcu_read_lock(); >> +sd = rcu_dereference(per_cpu(sd_llc, this_cpu)); >> +if (util) { >> +for_each_lower_domain(sd) { >> +if (sd->level == 0) >> +break; > > afaict you really only need this for the core, and here you're assuming > everything below the LLC is cores. Would it not be much clearer if you > introduce sd_core. > > As is, for_each_lower_domain includes the starting domain, sd->group > then is the first core group for this cpu. But then you continue to the > smt domain (on Intel, on other architectures there could be a cluster > domain in between) and then you bail using that sd->level == 0 hack > because otherwise things would go *bang*. Hi Peter, The code here and in smt_balance intentionally visits each level between the llc and smt, including core-cluster on architectures that define it. smt_balance thus has the chance to randomly pick a better cluster, and then within that cluster randomly pick a better core. It makes sense, as resources are shared within a cluster, and choosing a less loaded cluster should give better performance. As you suggest in a few other places, it would be nice to see performance results for this case. We have SPARC processors with core clusters. > Really rather messy this. > > I think you want to go allocate sched_domain_shared for the MC level and > use that, much like sd_llc_shared. > >> +sg_cpu = sd->groups; >> + >> +/* atomically add/subtract the util */ >> +if (util > 0) >> +atomic_inc((atomic_t *)&sg_cpu->utilization); >> +else >> +atomic_dec((atomic_t *)&sg_cpu->utilization); > > *sigh*, wth do you need that cast? You didn't get the memo that spurious > casts are where bugs happen and are terrible style at the best of times? > >> +} >> +} >> + >> +if (sd) >> +rq->curr_util = next_busy; >> +rcu_read_unlock(); >> +} > >> +smt_util(rq, 1, 0); > >> +smt_util(rq, 0, 1); > > That all seems like an overly complex way to write inc/dec. You turned > what should be a boolean state space (2^1) into something like 2^64. > > Also, I think if you count idle instead of busy, you'll do away with the > need for that whole uninitialized thing. > > >> +#define CPU_PSEUDO_RANDOM(cpu) (cpu_rq(cpu)->rotor++) > > That's a bit of a stretch calling that pseudo random.. > >> +/* >> + * Find an idle cpu in the core starting search from random index. >> + */ >> +static int select_idle_smt(struct task_struct *p, struct sched_group *sg) >> { >> +int i, rand_index, rand_cpu; >> +int this_cpu = smp_processor_id(); >> >> +rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight; >> +rand_cpu = sg->cp_array[rand_index]; > > Right, so yuck.. I know why you need that, but that extra array and > dereference is the reason I never went there. > > How much difference does it really make vs the 'normal' wrapping search > from last CPU ? > > This really should be a separate patch with separate performance numbers > on. For the benefit of other readers, if we always search and choose starting from the first CPU in a core, then later searches will often need to traverse the first N busy CPU's to find the first idle CPU. Choosing a random starting point avoids such bias. It is probably a win for processors with 4 to 8 CPUs per core, and a slight but hopefully negligible loss for 2 CPUs per core, and I agree we need to see performance data for this as a separate patch to decide. We have SPARC systems with 8 CPUs per core. >> +for_each_cpu_wrap(i, sched_group_span(sg), rand_cpu) { >> +if (!cpumask_test_cpu(i, &p->cpus_allowed)) >> +continue; >> +if (idle_cpu(i)) >> +return i; >> +} >> >> +return -1; >> } >> >> /* >> + * Compare the SMT utilization of the two groups and select the one which >> + * has more capacity left. >> */ >> +static int smt_should_migrate(struct sched_group *here, >> + struct sched_group *there, int self_util) >> { >> +int this_cpu = smp_processor_id(); >> +int here_util = here->utilization, there_util = there->utilizat
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On Thu, Feb 01, 2018 at 01:33:35PM +0100, Peter Zijlstra wrote: > I think you want to go allocate sched_domain_shared for the MC level and > use that, much like sd_llc_shared. Also, you'd want to try and get performance numbers for something like Power8, which has SMT8 and fairly expensive atomic ops.
Re: [RESEND RFC PATCH V3] sched: Improve scalability of select_idle_sibling using SMT balance
On Mon, Jan 29, 2018 at 03:31:02PM -0800, subhra mazumdar wrote: > +#ifdef CONFIG_SCHED_SMT > + > +/* > + * From sd_llc downward update the SMT utilization. Please don't use utilization for this, we already use that word for something else. > + * Skip the lowest level 0. > + */ > +void smt_util(struct rq *rq, int prev_busy, int next_busy) > +{ > + struct sched_domain *sd; > + struct sched_group *sg_cpu; > + int this_cpu = rq->cpu; > + int util; > + > + if (rq->curr_util == UTIL_UNINITIALIZED) > + prev_busy = 0; > + > + util = next_busy - prev_busy; > + rcu_read_lock(); > + sd = rcu_dereference(per_cpu(sd_llc, this_cpu)); > + if (util) { > + for_each_lower_domain(sd) { > + if (sd->level == 0) > + break; afaict you really only need this for the core, and here you're assuming everything below the LLC is cores. Would it not be much clearer if you introduce sd_core. As is, for_each_lower_domain includes the starting domain, sd->group then is the first core group for this cpu. But then you continue to the smt domain (on Intel, on other architectures there could be a cluster domain in between) and then you bail using that sd->level == 0 hack because otherwise things would go *bang*. Really rather messy this. I think you want to go allocate sched_domain_shared for the MC level and use that, much like sd_llc_shared. > + sg_cpu = sd->groups; > + > + /* atomically add/subtract the util */ > + if (util > 0) > + atomic_inc((atomic_t *)&sg_cpu->utilization); > + else > + atomic_dec((atomic_t *)&sg_cpu->utilization); *sigh*, wth do you need that cast? You didn't get the memo that spurious casts are where bugs happen and are terrible style at the best of times? > + } > + } > + > + if (sd) > + rq->curr_util = next_busy; > + rcu_read_unlock(); > +} > + smt_util(rq, 1, 0); > + smt_util(rq, 0, 1); That all seems like an overly complex way to write inc/dec. You turned what should be a boolean state space (2^1) into something like 2^64. Also, I think if you count idle instead of busy, you'll do away with the need for that whole uninitialized thing. > +#define CPU_PSEUDO_RANDOM(cpu) (cpu_rq(cpu)->rotor++) That's a bit of a stretch calling that pseudo random.. > +/* > + * Find an idle cpu in the core starting search from random index. > + */ > +static int select_idle_smt(struct task_struct *p, struct sched_group *sg) > { > + int i, rand_index, rand_cpu; > + int this_cpu = smp_processor_id(); > > + rand_index = CPU_PSEUDO_RANDOM(this_cpu) % sg->group_weight; > + rand_cpu = sg->cp_array[rand_index]; Right, so yuck.. I know why you need that, but that extra array and dereference is the reason I never went there. How much difference does it really make vs the 'normal' wrapping search from last CPU ? This really should be a separate patch with separate performance numbers on. > + for_each_cpu_wrap(i, sched_group_span(sg), rand_cpu) { > + if (!cpumask_test_cpu(i, &p->cpus_allowed)) > + continue; > + if (idle_cpu(i)) > + return i; > + } > > + return -1; > } > > /* > + * Compare the SMT utilization of the two groups and select the one which > + * has more capacity left. > */ > +static int smt_should_migrate(struct sched_group *here, > + struct sched_group *there, int self_util) > { > + int this_cpu = smp_processor_id(); > + int here_util = here->utilization, there_util = there->utilization; > > + /* Discount self utilization if it belongs to here or there */ > + if (self_util > 0) { > + if (cpumask_test_cpu(this_cpu, sched_group_span(here))) > + here_util -= self_util; > + else if (cpumask_test_cpu(this_cpu, sched_group_span(there))) > + there_util -= self_util; > } > > + /* Return true if other group has more capacity left */ > + return (there->group_weight - there_util > > + here->group_weight - here_util); > } OK, so this effectively picks the least-busiest/idlest SMT sibling of two. > /* > + * SMT balancing works by comparing the target cpu group with a random group > + * in llc domain. It calls smt_should_migrate to decide which group has more > + * capacity left. The balancing starts top down fom sd_llc to SMT core level. > + * Finally idle cpu search is only done in the core. > */ > +static int smt_balance(struct task_struct *p, struct sched_domain *sd, int > cpu) > { > + struct sched_group *sg, *src_sg, *start_sg, *tgt_sg; > + struct cpumask *span; > + int rand_idx, weight; > + int cpu_orig = cpu; > + int rand_cpu; > + int this_cpu = smp_processor_id();