Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
On 08/02/21 16:29, Vincent Guittot wrote: > On Fri, 5 Feb 2021 at 21:07, Valentin Schneider > wrote: >> >> Perhaps I can still keep 5/8 with something like >> >> if (!rq->misfit_task_load) >> return false; >> >> do { >> if (capacity_greater(group->sgc->max_capacity, rq->cpu_capacity)) >> return true; >> >> group = group->next; >> } while (group != sd->groups); > > I don't catch what you want to achieve with this while loop compared > to the original condition which is : > trigger a load_balance : > - if there is CPU with higher original capacity > - or if the capacity of this cpu has significantly reduced because of > pressure and there is maybe others with more capacity even if it's one > with highest original capacity > If we had a root-domain-wide (dynamic) capacity maximum, we could make check_misfit_status() return false if the CPU *is* pressured but there is no better alternative - e.g. if all other CPUs are pressured even worse. This isn't a correctness issue as the nohz load-balance will just not migrate the misfit task, but it would be nice to prevent the nohz kick altogether. I might ditch this for now and revisit it later. >> >> return false; >> >> This works somewhat well for big.LITTLE, but for DynamIQ systems under a >> single L3 this ends up iterating over all the CPUs :/
Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
On Fri, 5 Feb 2021 at 21:07, Valentin Schneider wrote: > > On 05/02/21 18:17, Vincent Guittot wrote: > > On Fri, 5 Feb 2021 at 18:00, Valentin Schneider > >> >> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct > >> >> sched_domain *sd) > >> >> static inline int check_misfit_status(struct rq *rq, struct > >> >> sched_domain *sd) > >> >> { > >> >> return rq->misfit_task_load && > >> >> - (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity || > >> >> + (capacity_greater(rq->rd->max_cpu_capacity, > >> >> rq->cpu_capacity_orig) || > >> > > >> > Why do you add a margin here whereas there was no margin before ? > >> > > >> > >> Comparing capacities without any sort of filter can lead to ping-ponging > >> tasks around (capacity values very easily fluctuate by +/- 1, if not more). > > > > max_cpu_capacity reflects the max of the cpu_capacity_orig values > > don't aim to change and can be considered as static values. > > It would be better to fix this rounding problem (if any) in > > topology_get_cpu_scale instead of computing a margin every time it's > > used > > > > That's embarrassing, I was convinced we had something updating > rd->max_cpu_capacity with actual rq->capacity values... But as you point > out that's absolutely not the case, it's all based on rq->capacity_orig, > which completely invalidates patch 5/8. > > Welp. > > Perhaps I can still keep 5/8 with something like > > if (!rq->misfit_task_load) > return false; > > do { > if (capacity_greater(group->sgc->max_capacity, rq->cpu_capacity)) > return true; > > group = group->next; > } while (group != sd->groups); I don't catch what you want to achieve with this while loop compared to the original condition which is : trigger a load_balance : - if there is CPU with higher original capacity - or if the capacity of this cpu has significantly reduced because of pressure and there is maybe others with more capacity even if it's one with highest original capacity > > return false; > > This works somewhat well for big.LITTLE, but for DynamIQ systems under a > single L3 this ends up iterating over all the CPUs :/
Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
On Thu, 28 Jan 2021 at 19:32, Valentin Schneider wrote: > > During load-balance, groups classified as group_misfit_task are filtered > out if they do not pass > > group_smaller_max_cpu_capacity(, ); group_smaller_max_cpu_capacity and group_smaller_max_cpu_capacity are removed in the next patch. Merge this and the next and directly remove them > > which itself employs fits_capacity() to compare the sgc->max_capacity of > both groups. > > Due to the underlying margin, fits_capacity(X, 1024) will return false for > any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are > {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium" > CPUs, misfit migration will never intentionally upmigrate it to a CPU of > higher capacity due to the aforementioned margin. > > One may argue the 20% margin of fits_capacity() is excessive in the advent > of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here > is that fits_capacity() is meant to compare a utilization value to a > capacity value, whereas here it is being used to compare two capacity > values. As CPU capacity and task utilization have different dynamics, a > sensible approach here would be to add a new helper dedicated to comparing > CPU capacities. > > Introduce capacity_greater(), which uses a 5% margin. Use it to replace the > existing capacity checks. Note that check_cpu_capacity() uses yet another > margin (sd->imbalance_pct), and is left alone for now. > > Signed-off-by: Valentin Schneider > --- > kernel/sched/fair.c | 15 +++ > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7d01fa0bfc7e..58ce0b22fcb0 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu) > */ > #define fits_capacity(cap, max)((cap) * 1280 < (max) * 1024) > > +/* > + * The margin used when comparing CPU capacities. > + * is 'cap' noticeably greater than 'ref' > + * > + * (default: ~5%) > + */ > +#define capacity_greater(cap, ref) ((ref) * 1078 < (cap) * 1024) > #endif > > #ifdef CONFIG_CFS_BANDWIDTH > @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain > *sd) > static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd) > { > return rq->misfit_task_load && > - (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity || > + (capacity_greater(rq->rd->max_cpu_capacity, > rq->cpu_capacity_orig) || Why do you add a margin here whereas there was no margin before ? > check_cpu_capacity(rq, sd)); > } > > @@ -8352,7 +8359,7 @@ group_is_overloaded(unsigned int imbalance_pct, struct > sg_lb_stats *sgs) > static inline bool > group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group > *ref) > { > - return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity); > + return capacity_greater(ref->sgc->min_capacity, > sg->sgc->min_capacity); > } > > /* > @@ -8362,7 +8369,7 @@ group_smaller_min_cpu_capacity(struct sched_group *sg, > struct sched_group *ref) > static inline bool > group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group > *ref) > { > - return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity); > + return capacity_greater(ref->sgc->max_capacity, > sg->sgc->max_capacity); > } > > static inline enum > @@ -9421,7 +9428,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, > * average load. > */ > if (sd_has_asym_cpucapacity(env->sd) && > - capacity_of(env->dst_cpu) < capacity && > + !capacity_greater(capacity_of(env->dst_cpu), capacity) && same here > nr_running == 1) > continue; > > -- > 2.27.0 >
Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
On 05/02/21 18:17, Vincent Guittot wrote: > On Fri, 5 Feb 2021 at 18:00, Valentin Schneider >> >> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct >> >> sched_domain *sd) >> >> static inline int check_misfit_status(struct rq *rq, struct sched_domain >> >> *sd) >> >> { >> >> return rq->misfit_task_load && >> >> - (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity || >> >> + (capacity_greater(rq->rd->max_cpu_capacity, >> >> rq->cpu_capacity_orig) || >> > >> > Why do you add a margin here whereas there was no margin before ? >> > >> >> Comparing capacities without any sort of filter can lead to ping-ponging >> tasks around (capacity values very easily fluctuate by +/- 1, if not more). > > max_cpu_capacity reflects the max of the cpu_capacity_orig values > don't aim to change and can be considered as static values. > It would be better to fix this rounding problem (if any) in > topology_get_cpu_scale instead of computing a margin every time it's > used > That's embarrassing, I was convinced we had something updating rd->max_cpu_capacity with actual rq->capacity values... But as you point out that's absolutely not the case, it's all based on rq->capacity_orig, which completely invalidates patch 5/8. Welp. Perhaps I can still keep 5/8 with something like if (!rq->misfit_task_load) return false; do { if (capacity_greater(group->sgc->max_capacity, rq->cpu_capacity)) return true; group = group->next; } while (group != sd->groups); return false; This works somewhat well for big.LITTLE, but for DynamIQ systems under a single L3 this ends up iterating over all the CPUs :/
Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
On 05/02/21 15:31, Vincent Guittot wrote: > On Thu, 28 Jan 2021 at 19:32, Valentin Schneider > wrote: >> >> During load-balance, groups classified as group_misfit_task are filtered >> out if they do not pass >> >> group_smaller_max_cpu_capacity(, ); > > group_smaller_max_cpu_capacity and group_smaller_max_cpu_capacity are > removed in the next patch. Merge this and the next and directly remove > them > OK. >> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain >> *sd) >> static inline int check_misfit_status(struct rq *rq, struct sched_domain >> *sd) >> { >> return rq->misfit_task_load && >> - (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity || >> + (capacity_greater(rq->rd->max_cpu_capacity, >> rq->cpu_capacity_orig) || > > Why do you add a margin here whereas there was no margin before ? > Comparing capacities without any sort of filter can lead to ping-ponging tasks around (capacity values very easily fluctuate by +/- 1, if not more). I'm guilty of doing two things at once here: replace existing users, and convert callsites that should be existing users. I can split the conversion in a separate patch.
Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
On Fri, 5 Feb 2021 at 18:00, Valentin Schneider wrote: > > On 05/02/21 15:31, Vincent Guittot wrote: > > On Thu, 28 Jan 2021 at 19:32, Valentin Schneider > > wrote: > >> > >> During load-balance, groups classified as group_misfit_task are filtered > >> out if they do not pass > >> > >> group_smaller_max_cpu_capacity(, ); > > > > group_smaller_max_cpu_capacity and group_smaller_max_cpu_capacity are > > removed in the next patch. Merge this and the next and directly remove > > them > > > > OK. > > >> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct > >> sched_domain *sd) > >> static inline int check_misfit_status(struct rq *rq, struct sched_domain > >> *sd) > >> { > >> return rq->misfit_task_load && > >> - (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity || > >> + (capacity_greater(rq->rd->max_cpu_capacity, > >> rq->cpu_capacity_orig) || > > > > Why do you add a margin here whereas there was no margin before ? > > > > Comparing capacities without any sort of filter can lead to ping-ponging > tasks around (capacity values very easily fluctuate by +/- 1, if not more). max_cpu_capacity reflects the max of the cpu_capacity_orig values don't aim to change and can be considered as static values. It would be better to fix this rounding problem (if any) in topology_get_cpu_scale instead of computing a margin every time it's used > I'm guilty of doing two things at once here: replace existing users, and > convert callsites that should be existing users. I can split the conversion > in a separate patch.
Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
On 03/02/21 15:15, Qais Yousef wrote: > On 01/28/21 18:31, Valentin Schneider wrote: >> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu) >> */ >> #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024) >> >> +/* >> + * The margin used when comparing CPU capacities. >> + * is 'cap' noticeably greater than 'ref' >> + * >> + * (default: ~5%) >> + */ >> +#define capacity_greater(cap, ref) ((ref) * 1078 < (cap) * 1024) > > nit: can we use cap1 and cap2 and make the implementation use '>' instead of > '<'? ie: > > #define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078) > > this is more intuitive to read IMHO. Especially few lines below we have > > return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity); > > which pass 'ref->...' as cap which can be confusing when looking at the > function signature @ref. > Unfortunate naming indeed... And I suppose it can't hurt to follow the argument "declaration" order.
Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
On 01/28/21 18:31, Valentin Schneider wrote: > During load-balance, groups classified as group_misfit_task are filtered > out if they do not pass > > group_smaller_max_cpu_capacity(, ); > > which itself employs fits_capacity() to compare the sgc->max_capacity of > both groups. > > Due to the underlying margin, fits_capacity(X, 1024) will return false for > any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are > {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium" > CPUs, misfit migration will never intentionally upmigrate it to a CPU of > higher capacity due to the aforementioned margin. > > One may argue the 20% margin of fits_capacity() is excessive in the advent > of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here > is that fits_capacity() is meant to compare a utilization value to a > capacity value, whereas here it is being used to compare two capacity > values. As CPU capacity and task utilization have different dynamics, a > sensible approach here would be to add a new helper dedicated to comparing > CPU capacities. > > Introduce capacity_greater(), which uses a 5% margin. Use it to replace the > existing capacity checks. Note that check_cpu_capacity() uses yet another > margin (sd->imbalance_pct), and is left alone for now. > > Signed-off-by: Valentin Schneider > --- > kernel/sched/fair.c | 15 +++ > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7d01fa0bfc7e..58ce0b22fcb0 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu) > */ > #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024) > > +/* > + * The margin used when comparing CPU capacities. > + * is 'cap' noticeably greater than 'ref' > + * > + * (default: ~5%) > + */ > +#define capacity_greater(cap, ref) ((ref) * 1078 < (cap) * 1024) nit: can we use cap1 and cap2 and make the implementation use '>' instead of '<'? ie: #define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078) this is more intuitive to read IMHO. Especially few lines below we have return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity); which pass 'ref->...' as cap which can be confusing when looking at the function signature @ref. Either way, this LGTM Reviewed-by: Qais Yousef Thanks -- Qais Yousef > #endif > > #ifdef CONFIG_CFS_BANDWIDTH > @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain > *sd) > static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd) > { > return rq->misfit_task_load && > - (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity || > + (capacity_greater(rq->rd->max_cpu_capacity, > rq->cpu_capacity_orig) || >check_cpu_capacity(rq, sd)); > } > > @@ -8352,7 +8359,7 @@ group_is_overloaded(unsigned int imbalance_pct, struct > sg_lb_stats *sgs) > static inline bool > group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group > *ref) > { > - return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity); > + return capacity_greater(ref->sgc->min_capacity, sg->sgc->min_capacity); > } > > /* > @@ -8362,7 +8369,7 @@ group_smaller_min_cpu_capacity(struct sched_group *sg, > struct sched_group *ref) > static inline bool > group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group > *ref) > { > - return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity); > + return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity); > } > > static inline enum > @@ -9421,7 +9428,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, >* average load. >*/ > if (sd_has_asym_cpucapacity(env->sd) && > - capacity_of(env->dst_cpu) < capacity && > + !capacity_greater(capacity_of(env->dst_cpu), capacity) && > nr_running == 1) > continue; > > -- > 2.27.0 >
[PATCH 3/8] sched/fair: Tweak misfit-related capacity checks
During load-balance, groups classified as group_misfit_task are filtered out if they do not pass group_smaller_max_cpu_capacity(, ); which itself employs fits_capacity() to compare the sgc->max_capacity of both groups. Due to the underlying margin, fits_capacity(X, 1024) will return false for any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium" CPUs, misfit migration will never intentionally upmigrate it to a CPU of higher capacity due to the aforementioned margin. One may argue the 20% margin of fits_capacity() is excessive in the advent of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here is that fits_capacity() is meant to compare a utilization value to a capacity value, whereas here it is being used to compare two capacity values. As CPU capacity and task utilization have different dynamics, a sensible approach here would be to add a new helper dedicated to comparing CPU capacities. Introduce capacity_greater(), which uses a 5% margin. Use it to replace the existing capacity checks. Note that check_cpu_capacity() uses yet another margin (sd->imbalance_pct), and is left alone for now. Signed-off-by: Valentin Schneider --- kernel/sched/fair.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7d01fa0bfc7e..58ce0b22fcb0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu) */ #define fits_capacity(cap, max)((cap) * 1280 < (max) * 1024) +/* + * The margin used when comparing CPU capacities. + * is 'cap' noticeably greater than 'ref' + * + * (default: ~5%) + */ +#define capacity_greater(cap, ref) ((ref) * 1078 < (cap) * 1024) #endif #ifdef CONFIG_CFS_BANDWIDTH @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd) static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd) { return rq->misfit_task_load && - (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity || + (capacity_greater(rq->rd->max_cpu_capacity, rq->cpu_capacity_orig) || check_cpu_capacity(rq, sd)); } @@ -8352,7 +8359,7 @@ group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs) static inline bool group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref) { - return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity); + return capacity_greater(ref->sgc->min_capacity, sg->sgc->min_capacity); } /* @@ -8362,7 +8369,7 @@ group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref) static inline bool group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref) { - return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity); + return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity); } static inline enum @@ -9421,7 +9428,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, * average load. */ if (sd_has_asym_cpucapacity(env->sd) && - capacity_of(env->dst_cpu) < capacity && + !capacity_greater(capacity_of(env->dst_cpu), capacity) && nr_running == 1) continue; -- 2.27.0