Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
On Thu, 2018-02-15 at 10:07 -0800, Rohit Jain wrote: > > > Rohit is running more tests with a patch that deletes > > sysctl_sched_migration_cost from idle_balance, and for his patch but > > with the 5000 usec mistake corrected back to 500 usec. So far both > > give improvements over the baseline, but for different cases, so we > > need to try more workloads before we draw any conclusions. > > > > Rohit, can you share your data so far? > > Results: > > In the following results, "Domain based" approach is as mentioned in the > RFC sent out with the values fixed (As pointed out by Mike). "No check" is > the patch where I just remove the check against sysctl_sched_migration_cost > > 1) Hackbench results on 2 socket, 44 core and 88 threads Intel x86 machine > (lower is better): > > +--+-+--+-+ > | | Without Patch |Domain Based |No Check > | > +--+---+++-++++ > |Loops | Groups|Average |%Std Dev|Average |%Std Dev|Average > |%Std Dev| > +--+---+++-++++ > |10| 4 |9.701 |0.78 |7.971 (+17.84%) | 1.34 |8.919 (+8.07%) > |1.07 | > |10| 8 |17.186 |0.77 |16.712 (+2.76%) | 0.87 |17.043 (+0.83%) > |0.83 | > |10| 16 |30.378 |0.55 |29.780 (+1.97%) | 0.38 |29.565 (+2.67%) > |0.29 | > |10| 32 |54.712 |0.54 |53.001 (+3.13%) | 0.19 |52.158 (+4.67%) > |0.22 | > +--+---+++-++++ previous numbers. +---++---+---+--+ | || | Without patch |With patch| +---++---+-+-++-+ |Loops |FD |Groups | Average |%Std Dev |Average |%Std Dev | +---++---+-+-++-+ |10 |40 |4 | 9.701 |0.78 |9.623 (+0.81%) |3.67 | |10 |40 |8 | 17.186 |0.77 |17.068 (+0.68%) |1.89 | |10 |40 |16 | 30.378 |0.55 |30.072 (+1.52%) |0.46 | |10 |40 |32 | 54.712 |0.54 |53.588 (+2.28%) |0.21 | +---++---+-+-++-+ My take on this (not that you have to sell it to me, you don't) when I squint at these together is submit the one-liner, and take the rest back to the drawing board. You've got nothing but high std dev numbers in (imo) way too finicky/unrealistic hackbench to sell these not so pretty patches. I bet you can easily sell that one-liner, because that removes an old wart (me stealing migration_cost in the first place), instead of making wart a whole lot harder to intentionally not notice. -Mike
Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
On Thu, 2018-02-15 at 13:21 -0500, Steven Sistare wrote: > On 2/15/2018 1:07 PM, Mike Galbraith wrote: > > >> Can you provide more details on the sysbench oltp test that motivated you > >> to add sysctl_sched_migration_cost to idle_balance, so Rohit can re-test > >> it? > > > > The problem at that time was the cycle overhead of entering that LB > > path at high frequency. Dirt simple. > > I get that. I meant please provide details on test parameters and config if > you remember them. Nope. I doubt it would be relevant to here/now anyway. -Mike
Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
On 2/15/2018 1:07 PM, Mike Galbraith wrote: > On Thu, 2018-02-15 at 11:35 -0500, Steven Sistare wrote: >> On 2/10/2018 1:37 AM, Mike Galbraith wrote: >>> On Fri, 2018-02-09 at 11:08 -0500, Steven Sistare wrote: >> @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct >> rq_flags *rf) >> if (!(sd->flags & SD_LOAD_BALANCE)) >> continue; >> >> -if (this_rq->avg_idle < curr_cost + >> sd->max_newidle_lb_cost) { >> +if (this_rq->avg_idle < curr_cost + >> sd->max_newidle_lb_cost + >> +sd->sched_migration_cost) { >> update_next_balance(sd, &next_balance); >> break; >> } > > Ditto. The old code did not migrate if the expected costs exceeded the expected idle time. The new code just adds the sd-specific penalty (essentially loss of cache footprint) to the costs. The for_each_domain loop visit smallest to largest sd's, hence visiting smallest to largest migration costs (though the tunables do not enforce an ordering), and bails at the first sd where the total cost is a lose. >>> >>> Hrm.. >>> >>> You're now adding a hypothetical cost to the measured cost of running >>> the LB machinery, which implies that the measurement is insufficient, >>> but you still don't say why it is insufficient. What happens if you >>> don't do that? I ask, because when I removed the... >>> >>> this_rq->avg_idle < sysctl_sched_migration_cost >>> >>> ...bits to check removal effect for Peter, the original reason for it >>> being added did not re-materialize, making me wonder why you need to >>> make this cutoff more aggressive. >> >> The current code with sysctl_sched_migration_cost discourages migration >> too much, per our test results. > > That's why I asked you what happens if you only whack the _apparently_ > (but maybe not) obsolete old throttle, it appeared likely that your win > came from allowing a bit more migration than the simple throttle > allowed, which if true, would obviate the need for anything more. > >> Can you provide more details on the sysbench oltp test that motivated you >> to add sysctl_sched_migration_cost to idle_balance, so Rohit can re-test it? > > The problem at that time was the cycle overhead of entering that LB > path at high frequency. Dirt simple. I get that. I meant please provide details on test parameters and config if you remember them. - Steve
Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
On Thu, 2018-02-15 at 11:35 -0500, Steven Sistare wrote: > On 2/10/2018 1:37 AM, Mike Galbraith wrote: > > On Fri, 2018-02-09 at 11:08 -0500, Steven Sistare wrote: > @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct > rq_flags *rf) > if (!(sd->flags & SD_LOAD_BALANCE)) > continue; > > -if (this_rq->avg_idle < curr_cost + > sd->max_newidle_lb_cost) { > +if (this_rq->avg_idle < curr_cost + > sd->max_newidle_lb_cost + > +sd->sched_migration_cost) { > update_next_balance(sd, &next_balance); > break; > } > >>> > >>> Ditto. > >> > >> The old code did not migrate if the expected costs exceeded the expected > >> idle > >> time. The new code just adds the sd-specific penalty (essentially loss of > >> cache > >> footprint) to the costs. The for_each_domain loop visit smallest to > >> largest > >> sd's, hence visiting smallest to largest migration costs (though the > >> tunables do > >> not enforce an ordering), and bails at the first sd where the total cost > >> is a lose. > > > > Hrm.. > > > > You're now adding a hypothetical cost to the measured cost of running > > the LB machinery, which implies that the measurement is insufficient, > > but you still don't say why it is insufficient. What happens if you > > don't do that? I ask, because when I removed the... > > > > this_rq->avg_idle < sysctl_sched_migration_cost > > > > ...bits to check removal effect for Peter, the original reason for it > > being added did not re-materialize, making me wonder why you need to > > make this cutoff more aggressive. > > The current code with sysctl_sched_migration_cost discourages migration > too much, per our test results. That's why I asked you what happens if you only whack the _apparently_ (but maybe not) obsolete old throttle, it appeared likely that your win came from allowing a bit more migration than the simple throttle allowed, which if true, would obviate the need for anything more. > Can you provide more details on the sysbench oltp test that motivated you > to add sysctl_sched_migration_cost to idle_balance, so Rohit can re-test it? The problem at that time was the cycle overhead of entering that LB path at high frequency. Dirt simple. -Mike
Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
On 02/15/2018 08:35 AM, Steven Sistare wrote: On 2/10/2018 1:37 AM, Mike Galbraith wrote: On Fri, 2018-02-09 at 11:08 -0500, Steven Sistare wrote: @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf) if (!(sd->flags & SD_LOAD_BALANCE)) continue; - if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) { + if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost + + sd->sched_migration_cost) { update_next_balance(sd, &next_balance); break; } Ditto. The old code did not migrate if the expected costs exceeded the expected idle time. The new code just adds the sd-specific penalty (essentially loss of cache footprint) to the costs. The for_each_domain loop visit smallest to largest sd's, hence visiting smallest to largest migration costs (though the tunables do not enforce an ordering), and bails at the first sd where the total cost is a lose. Hrm.. You're now adding a hypothetical cost to the measured cost of running the LB machinery, which implies that the measurement is insufficient, but you still don't say why it is insufficient. What happens if you don't do that? I ask, because when I removed the... this_rq->avg_idle < sysctl_sched_migration_cost ...bits to check removal effect for Peter, the original reason for it being added did not re-materialize, making me wonder why you need to make this cutoff more aggressive. The current code with sysctl_sched_migration_cost discourages migration too much, per our test results. Deleting it entirely from idle_balance() may be the right solution, or it may allow too much migration and cause regressions due to loss of cache warmth on some workloads. Rohit's patch deletes it and adds the sd->sched_migration_cost term to allow a migration rate that is somewhere in the middle, and is logically sound. It discourages but does not prevent migration between nodes, and encourages but does not always allow migration between cores. By contrast, setting relax_domain_level to disable SD_BALANCE_NEWIDLE at the SD_NUMA level is a big hammer. I would be perfectly happy if deleting sysctl_sched_migration_cost from idle_balance does the trick. Last week in a different thread you mentioned it did not hurt tbench: Mike, do you remember what comes apart when we take out the sysctl_sched_migration_cost test in idle_balance()? Used to be anything scheduling cross-core heftily suffered, ie pretty much any localhost communication heavy load. I just tried disabling it in 4.13 though (pre pti cliff), tried tbench, and it made zip squat difference. I presume that's due to the meanwhile added this_rq->rd->overload and/or curr_cost checks. Can you provide more details on the sysbench oltp test that motivated you to add sysctl_sched_migration_cost to idle_balance, so Rohit can re-test it? 1b9508f6 sched: Rate-limit newidle Rate limit newidle to migration_cost. It's a win for all stages of sysbench oltp tests. Rohit is running more tests with a patch that deletes sysctl_sched_migration_cost from idle_balance, and for his patch but with the 5000 usec mistake corrected back to 500 usec. So far both give improvements over the baseline, but for different cases, so we need to try more workloads before we draw any conclusions. Rohit, can you share your data so far? Results: In the following results, "Domain based" approach is as mentioned in the RFC sent out with the values fixed (As pointed out by Mike). "No check" is the patch where I just remove the check against sysctl_sched_migration_cost 1) Hackbench results on 2 socket, 44 core and 88 threads Intel x86 machine (lower is better): +--+-+--+-+ | | Without Patch |Domain Based |No Check | +--+---+++-++++ |Loops | Groups|Average |%Std Dev|Average |%Std Dev|Average |%Std Dev| +--+---+++-++++ |10| 4 |9.701 |0.78 |7.971 (+17.84%) | 1.34 |8.919 (+8.07%) |1.07 | |10| 8 |17.186 |0.77 |16.712 (+2.76%) | 0.87 |17.043 (+0.83%) |0.83 | |10| 16 |30.378 |0.55 |29.780 (+1.97%) | 0.38 |29.565 (+2.67%) |0.29 | |10| 32 |54.712 |0.54 |53.001 (+3.13%) | 0.19 |52.158 (+4.67%) |0.22 | +--+---+++-++++ 2) Sysbench MySQL results on 2 socket, 44 core and 88 threads Intel x86 machine (higher is better): +---++++ | | Without Patch | Domain based | No check | +---+---++-
Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
On 2/10/2018 1:37 AM, Mike Galbraith wrote: > On Fri, 2018-02-09 at 11:08 -0500, Steven Sistare wrote: @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf) if (!(sd->flags & SD_LOAD_BALANCE)) continue; - if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) { + if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost + + sd->sched_migration_cost) { update_next_balance(sd, &next_balance); break; } >>> >>> Ditto. >> >> The old code did not migrate if the expected costs exceeded the expected idle >> time. The new code just adds the sd-specific penalty (essentially loss of >> cache >> footprint) to the costs. The for_each_domain loop visit smallest to largest >> sd's, hence visiting smallest to largest migration costs (though the >> tunables do >> not enforce an ordering), and bails at the first sd where the total cost is >> a lose. > > Hrm.. > > You're now adding a hypothetical cost to the measured cost of running > the LB machinery, which implies that the measurement is insufficient, > but you still don't say why it is insufficient. What happens if you > don't do that? I ask, because when I removed the... > > this_rq->avg_idle < sysctl_sched_migration_cost > > ...bits to check removal effect for Peter, the original reason for it > being added did not re-materialize, making me wonder why you need to > make this cutoff more aggressive. The current code with sysctl_sched_migration_cost discourages migration too much, per our test results. Deleting it entirely from idle_balance() may be the right solution, or it may allow too much migration and cause regressions due to loss of cache warmth on some workloads. Rohit's patch deletes it and adds the sd->sched_migration_cost term to allow a migration rate that is somewhere in the middle, and is logically sound. It discourages but does not prevent migration between nodes, and encourages but does not always allow migration between cores. By contrast, setting relax_domain_level to disable SD_BALANCE_NEWIDLE at the SD_NUMA level is a big hammer. I would be perfectly happy if deleting sysctl_sched_migration_cost from idle_balance does the trick. Last week in a different thread you mentioned it did not hurt tbench: >> Mike, do you remember what comes apart when we take >> out the sysctl_sched_migration_cost test in idle_balance()? > > Used to be anything scheduling cross-core heftily suffered, ie pretty > much any localhost communication heavy load. I just tried disabling it > in 4.13 though (pre pti cliff), tried tbench, and it made zip squat > difference. I presume that's due to the meanwhile added > this_rq->rd->overload and/or curr_cost checks. Can you provide more details on the sysbench oltp test that motivated you to add sysctl_sched_migration_cost to idle_balance, so Rohit can re-test it? 1b9508f6 sched: Rate-limit newidle Rate limit newidle to migration_cost. It's a win for all stages of sysbench oltp tests. Rohit is running more tests with a patch that deletes sysctl_sched_migration_cost from idle_balance, and for his patch but with the 5000 usec mistake corrected back to 500 usec. So far both give improvements over the baseline, but for different cases, so we need to try more workloads before we draw any conclusions. Rohit, can you share your data so far? - Steve
Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
On Fri, 2018-02-09 at 11:08 -0500, Steven Sistare wrote: > >> @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct > >> rq_flags *rf) > >>if (!(sd->flags & SD_LOAD_BALANCE)) > >>continue; > >> > >> - if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) { > >> + if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost + > >> + sd->sched_migration_cost) { > >>update_next_balance(sd, &next_balance); > >>break; > >>} > > > > Ditto. > > The old code did not migrate if the expected costs exceeded the expected idle > time. The new code just adds the sd-specific penalty (essentially loss of > cache > footprint) to the costs. The for_each_domain loop visit smallest to largest > sd's, hence visiting smallest to largest migration costs (though the tunables > do > not enforce an ordering), and bails at the first sd where the total cost is a > lose. Hrm.. You're now adding a hypothetical cost to the measured cost of running the LB machinery, which implies that the measurement is insufficient, but you still don't say why it is insufficient. What happens if you don't do that? I ask, because when I removed the... this_rq->avg_idle < sysctl_sched_migration_cost ...bits to check removal effect for Peter, the original reason for it being added did not re-materialize, making me wonder why you need to make this cutoff more aggressive. -Mike
Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
On 2/8/2018 10:42 PM, Mike Galbraith wrote: > On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote: >> This patch makes idle_balance more dynamic as the sched_migration_cost >> is now accounted on a sched_domain level. This in turn is done in >> sd_init when we know what the topology relationships are. >> >> For introduction sakes cost of migration within the same core is set as >> 0, across cores is 50 usec and across sockets is 500 usec. sysctl for >> these variables are introduced in patch 2. >> >> Signed-off-by: Rohit Jain >> --- >> include/linux/sched/topology.h | 1 + >> kernel/sched/fair.c| 6 +++--- >> kernel/sched/topology.c| 5 + >> 3 files changed, 9 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h >> index cf257c2..bcb4db2 100644 >> --- a/include/linux/sched/topology.h >> +++ b/include/linux/sched/topology.h >> @@ -104,6 +104,7 @@ struct sched_domain { >> u64 max_newidle_lb_cost; >> unsigned long next_decay_max_lb_cost; >> >> +u64 sched_migration_cost; >> u64 avg_scan_cost; /* select_idle_sibling */ >> >> #ifdef CONFIG_SCHEDSTATS >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 2fe3aa8..61d3508 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -8782,8 +8782,7 @@ static int idle_balance(struct rq *this_rq, struct >> rq_flags *rf) >> */ >> rq_unpin_lock(this_rq, rf); >> >> -if (this_rq->avg_idle < sysctl_sched_migration_cost || >> -!this_rq->rd->overload) { >> +if (!this_rq->rd->overload) { >> rcu_read_lock(); >> sd = rcu_dereference_check_sched_domain(this_rq->sd); >> if (sd) > > Unexplained/unrelated change. This could be explained better in the cover letter, but it is related; this and the change below are the meat of the patch. The deleted test "this_rq->avg_idle < sysctl_sched_migration_cost" formerly bailed based on a single global notion of migration cost, independent of sd. Now the cost is per-sd, evaluated in the sd loop below. The other condition to bail early, "!this_rq->rd->overload" is still relevant and remains. >> @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct >> rq_flags *rf) >> if (!(sd->flags & SD_LOAD_BALANCE)) >> continue; >> >> -if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) { >> +if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost + >> +sd->sched_migration_cost) { >> update_next_balance(sd, &next_balance); >> break; >> } > > Ditto. The old code did not migrate if the expected costs exceeded the expected idle time. The new code just adds the sd-specific penalty (essentially loss of cache footprint) to the costs. The for_each_domain loop visit smallest to largest sd's, hence visiting smallest to largest migration costs (though the tunables do not enforce an ordering), and bails at the first sd where the total cost is a lose. >> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c >> index 034cbed..bcd8c64 100644 >> --- a/kernel/sched/topology.c >> +++ b/kernel/sched/topology.c >> @@ -1148,12 +1148,14 @@ sd_init(struct sched_domain_topology_level *tl, >> sd->flags |= SD_PREFER_SIBLING; >> sd->imbalance_pct = 110; >> sd->smt_gain = 1178; /* ~15% */ >> +sd->sched_migration_cost = 0; >> >> } else if (sd->flags & SD_SHARE_PKG_RESOURCES) { >> sd->flags |= SD_PREFER_SIBLING; >> sd->imbalance_pct = 117; >> sd->cache_nice_tries = 1; >> sd->busy_idx = 2; >> +sd->sched_migration_cost = 50UL; >> >> #ifdef CONFIG_NUMA >> } else if (sd->flags & SD_NUMA) { >> @@ -1162,6 +1164,7 @@ sd_init(struct sched_domain_topology_level *tl, >> sd->idle_idx = 2; >> >> sd->flags |= SD_SERIALIZE; >> +sd->sched_migration_cost = 500UL; > > That's not 500us. Good catch, thanks. It's 5000us but should be 500. The latest version of Rohit's patch lost a little performance vs the previous version, and this might explain why. Re-testing may bring good news. - Steve
Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote: > This patch makes idle_balance more dynamic as the sched_migration_cost > is now accounted on a sched_domain level. This in turn is done in > sd_init when we know what the topology relationships are. > > For introduction sakes cost of migration within the same core is set as > 0, across cores is 50 usec and across sockets is 500 usec. sysctl for > these variables are introduced in patch 2. > > Signed-off-by: Rohit Jain > --- > include/linux/sched/topology.h | 1 + > kernel/sched/fair.c| 6 +++--- > kernel/sched/topology.c| 5 + > 3 files changed, 9 insertions(+), 3 deletions(-) > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h > index cf257c2..bcb4db2 100644 > --- a/include/linux/sched/topology.h > +++ b/include/linux/sched/topology.h > @@ -104,6 +104,7 @@ struct sched_domain { > u64 max_newidle_lb_cost; > unsigned long next_decay_max_lb_cost; > > + u64 sched_migration_cost; > u64 avg_scan_cost; /* select_idle_sibling */ > > #ifdef CONFIG_SCHEDSTATS > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 2fe3aa8..61d3508 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8782,8 +8782,7 @@ static int idle_balance(struct rq *this_rq, struct > rq_flags *rf) >*/ > rq_unpin_lock(this_rq, rf); > > - if (this_rq->avg_idle < sysctl_sched_migration_cost || > - !this_rq->rd->overload) { > + if (!this_rq->rd->overload) { > rcu_read_lock(); > sd = rcu_dereference_check_sched_domain(this_rq->sd); > if (sd) Unexplained/unrelated change. > @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct > rq_flags *rf) > if (!(sd->flags & SD_LOAD_BALANCE)) > continue; > > - if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) { > + if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost + > + sd->sched_migration_cost) { > update_next_balance(sd, &next_balance); > break; > } Ditto. > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 034cbed..bcd8c64 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -1148,12 +1148,14 @@ sd_init(struct sched_domain_topology_level *tl, > sd->flags |= SD_PREFER_SIBLING; > sd->imbalance_pct = 110; > sd->smt_gain = 1178; /* ~15% */ > + sd->sched_migration_cost = 0; > > } else if (sd->flags & SD_SHARE_PKG_RESOURCES) { > sd->flags |= SD_PREFER_SIBLING; > sd->imbalance_pct = 117; > sd->cache_nice_tries = 1; > sd->busy_idx = 2; > + sd->sched_migration_cost = 50UL; > > #ifdef CONFIG_NUMA > } else if (sd->flags & SD_NUMA) { > @@ -1162,6 +1164,7 @@ sd_init(struct sched_domain_topology_level *tl, > sd->idle_idx = 2; > > sd->flags |= SD_SERIALIZE; > + sd->sched_migration_cost = 500UL; That's not 500us. -Mike
[RFC 1/2] sched: reduce migration cost between faster caches for idle_balance
This patch makes idle_balance more dynamic as the sched_migration_cost is now accounted on a sched_domain level. This in turn is done in sd_init when we know what the topology relationships are. For introduction sakes cost of migration within the same core is set as 0, across cores is 50 usec and across sockets is 500 usec. sysctl for these variables are introduced in patch 2. Signed-off-by: Rohit Jain --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c| 6 +++--- kernel/sched/topology.c| 5 + 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index cf257c2..bcb4db2 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -104,6 +104,7 @@ struct sched_domain { u64 max_newidle_lb_cost; unsigned long next_decay_max_lb_cost; + u64 sched_migration_cost; u64 avg_scan_cost; /* select_idle_sibling */ #ifdef CONFIG_SCHEDSTATS diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2fe3aa8..61d3508 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8782,8 +8782,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf) */ rq_unpin_lock(this_rq, rf); - if (this_rq->avg_idle < sysctl_sched_migration_cost || - !this_rq->rd->overload) { + if (!this_rq->rd->overload) { rcu_read_lock(); sd = rcu_dereference_check_sched_domain(this_rq->sd); if (sd) @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf) if (!(sd->flags & SD_LOAD_BALANCE)) continue; - if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) { + if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost + + sd->sched_migration_cost) { update_next_balance(sd, &next_balance); break; } diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 034cbed..bcd8c64 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1148,12 +1148,14 @@ sd_init(struct sched_domain_topology_level *tl, sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 110; sd->smt_gain = 1178; /* ~15% */ + sd->sched_migration_cost = 0; } else if (sd->flags & SD_SHARE_PKG_RESOURCES) { sd->flags |= SD_PREFER_SIBLING; sd->imbalance_pct = 117; sd->cache_nice_tries = 1; sd->busy_idx = 2; + sd->sched_migration_cost = 50UL; #ifdef CONFIG_NUMA } else if (sd->flags & SD_NUMA) { @@ -1162,6 +1164,7 @@ sd_init(struct sched_domain_topology_level *tl, sd->idle_idx = 2; sd->flags |= SD_SERIALIZE; + sd->sched_migration_cost = 500UL; if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) { sd->flags &= ~(SD_BALANCE_EXEC | SD_BALANCE_FORK | @@ -1174,6 +1177,7 @@ sd_init(struct sched_domain_topology_level *tl, sd->cache_nice_tries = 1; sd->busy_idx = 2; sd->idle_idx = 1; + sd->sched_migration_cost = 500UL; } /* @@ -1622,6 +1626,7 @@ static struct sched_domain *build_sched_domain(struct sched_domain_topology_leve } } + set_domain_attribute(sd, attr); return sd; -- 2.7.4