Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks

2021-02-08 Thread Valentin Schneider
On 08/02/21 16:29, Vincent Guittot wrote:
> On Fri, 5 Feb 2021 at 21:07, Valentin Schneider
>  wrote:
>>
>> Perhaps I can still keep 5/8 with something like
>>
>>   if (!rq->misfit_task_load)
>>   return false;
>>
>>   do {
>>   if (capacity_greater(group->sgc->max_capacity, rq->cpu_capacity))
>>   return true;
>>
>>   group = group->next;
>>   } while (group != sd->groups);
>
> I don't catch what you want to achieve with this  while loop compared
> to the original condition which is :
> trigger a load_balance :
> - if there is CPU with higher original capacity
> - or if the capacity of this cpu has significantly reduced because of
> pressure and there is maybe others with more capacity even if it's one
> with highest original capacity
>

If we had a root-domain-wide (dynamic) capacity maximum, we could make
check_misfit_status() return false if the CPU *is* pressured but there is
no better alternative - e.g. if all other CPUs are pressured even worse.

This isn't a correctness issue as the nohz load-balance will just not
migrate the misfit task, but it would be nice to prevent the nohz kick
altogether.

I might ditch this for now and revisit it later.

>>
>>   return false;
>>
>> This works somewhat well for big.LITTLE, but for DynamIQ systems under a
>> single L3 this ends up iterating over all the CPUs :/


Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks

2021-02-08 Thread Vincent Guittot
On Fri, 5 Feb 2021 at 21:07, Valentin Schneider
 wrote:
>
> On 05/02/21 18:17, Vincent Guittot wrote:
> > On Fri, 5 Feb 2021 at 18:00, Valentin Schneider
> >> >> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct 
> >> >> sched_domain *sd)
> >> >>  static inline int check_misfit_status(struct rq *rq, struct 
> >> >> sched_domain *sd)
> >> >>  {
> >> >> return rq->misfit_task_load &&
> >> >> -   (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
> >> >> +   (capacity_greater(rq->rd->max_cpu_capacity, 
> >> >> rq->cpu_capacity_orig) ||
> >> >
> >> > Why do you add a margin here whereas there was no margin before ?
> >> >
> >>
> >> Comparing capacities without any sort of filter can lead to ping-ponging
> >> tasks around (capacity values very easily fluctuate by +/- 1, if not more).
> >
> > max_cpu_capacity reflects the max of the cpu_capacity_orig values
> > don't aim to change and can be considered as static values.
> > It would be better to fix this rounding problem (if any) in
> > topology_get_cpu_scale instead of computing a margin every time it's
> > used
> >
>
> That's embarrassing, I was convinced we had something updating
> rd->max_cpu_capacity with actual rq->capacity values... But as you point
> out that's absolutely not the case, it's all based on rq->capacity_orig,
> which completely invalidates patch 5/8.
>
> Welp.
>
> Perhaps I can still keep 5/8 with something like
>
>   if (!rq->misfit_task_load)
>   return false;
>
>   do {
>   if (capacity_greater(group->sgc->max_capacity, rq->cpu_capacity))
>   return true;
>
>   group = group->next;
>   } while (group != sd->groups);

I don't catch what you want to achieve with this  while loop compared
to the original condition which is :
trigger a load_balance :
- if there is CPU with higher original capacity
- or if the capacity of this cpu has significantly reduced because of
pressure and there is maybe others with more capacity even if it's one
with highest original capacity

>
>   return false;
>
> This works somewhat well for big.LITTLE, but for DynamIQ systems under a
> single L3 this ends up iterating over all the CPUs :/


Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks

2021-02-05 Thread Vincent Guittot
On Thu, 28 Jan 2021 at 19:32, Valentin Schneider
 wrote:
>
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
>
>   group_smaller_max_cpu_capacity(, );

group_smaller_max_cpu_capacity and  group_smaller_max_cpu_capacity are
removed in the next patch. Merge this and the next and directly remove
them

>
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
>
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
>
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
>
> Introduce capacity_greater(), which uses a 5% margin. Use it to replace the
> existing capacity checks. Note that check_cpu_capacity() uses yet another
> margin (sd->imbalance_pct), and is left alone for now.
>
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/sched/fair.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7d01fa0bfc7e..58ce0b22fcb0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
>   */
>  #define fits_capacity(cap, max)((cap) * 1280 < (max) * 1024)
>
> +/*
> + * The margin used when comparing CPU capacities.
> + * is 'cap' noticeably greater than 'ref'
> + *
> + * (default: ~5%)
> + */
> +#define capacity_greater(cap, ref) ((ref) * 1078 < (cap) * 1024)
>  #endif
>
>  #ifdef CONFIG_CFS_BANDWIDTH
> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain 
> *sd)
>  static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
>  {
> return rq->misfit_task_load &&
> -   (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
> +   (capacity_greater(rq->rd->max_cpu_capacity, 
> rq->cpu_capacity_orig) ||

Why do you add a margin here whereas there was no margin before ?

>  check_cpu_capacity(rq, sd));
>  }
>
> @@ -8352,7 +8359,7 @@ group_is_overloaded(unsigned int imbalance_pct, struct 
> sg_lb_stats *sgs)
>  static inline bool
>  group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group 
> *ref)
>  {
> -   return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
> +   return capacity_greater(ref->sgc->min_capacity, 
> sg->sgc->min_capacity);
>  }
>
>  /*
> @@ -8362,7 +8369,7 @@ group_smaller_min_cpu_capacity(struct sched_group *sg, 
> struct sched_group *ref)
>  static inline bool
>  group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group 
> *ref)
>  {
> -   return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
> +   return capacity_greater(ref->sgc->max_capacity, 
> sg->sgc->max_capacity);
>  }
>
>  static inline enum
> @@ -9421,7 +9428,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>  * average load.
>  */
> if (sd_has_asym_cpucapacity(env->sd) &&
> -   capacity_of(env->dst_cpu) < capacity &&
> +   !capacity_greater(capacity_of(env->dst_cpu), capacity) &&

same here

> nr_running == 1)
> continue;
>
> --
> 2.27.0
>


Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks

2021-02-05 Thread Valentin Schneider
On 05/02/21 18:17, Vincent Guittot wrote:
> On Fri, 5 Feb 2021 at 18:00, Valentin Schneider
>> >> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct 
>> >> sched_domain *sd)
>> >>  static inline int check_misfit_status(struct rq *rq, struct sched_domain 
>> >> *sd)
>> >>  {
>> >> return rq->misfit_task_load &&
>> >> -   (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
>> >> +   (capacity_greater(rq->rd->max_cpu_capacity, 
>> >> rq->cpu_capacity_orig) ||
>> >
>> > Why do you add a margin here whereas there was no margin before ?
>> >
>>
>> Comparing capacities without any sort of filter can lead to ping-ponging
>> tasks around (capacity values very easily fluctuate by +/- 1, if not more).
>
> max_cpu_capacity reflects the max of the cpu_capacity_orig values
> don't aim to change and can be considered as static values.
> It would be better to fix this rounding problem (if any) in
> topology_get_cpu_scale instead of computing a margin every time it's
> used
>

That's embarrassing, I was convinced we had something updating
rd->max_cpu_capacity with actual rq->capacity values... But as you point
out that's absolutely not the case, it's all based on rq->capacity_orig,
which completely invalidates patch 5/8.

Welp.

Perhaps I can still keep 5/8 with something like

  if (!rq->misfit_task_load)
  return false;

  do {
  if (capacity_greater(group->sgc->max_capacity, rq->cpu_capacity))
  return true;

  group = group->next;
  } while (group != sd->groups);

  return false;

This works somewhat well for big.LITTLE, but for DynamIQ systems under a
single L3 this ends up iterating over all the CPUs :/


Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks

2021-02-05 Thread Valentin Schneider
On 05/02/21 15:31, Vincent Guittot wrote:
> On Thu, 28 Jan 2021 at 19:32, Valentin Schneider
>  wrote:
>>
>> During load-balance, groups classified as group_misfit_task are filtered
>> out if they do not pass
>>
>>   group_smaller_max_cpu_capacity(, );
>
> group_smaller_max_cpu_capacity and  group_smaller_max_cpu_capacity are
> removed in the next patch. Merge this and the next and directly remove
> them
>

OK.

>> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain 
>> *sd)
>>  static inline int check_misfit_status(struct rq *rq, struct sched_domain 
>> *sd)
>>  {
>> return rq->misfit_task_load &&
>> -   (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
>> +   (capacity_greater(rq->rd->max_cpu_capacity, 
>> rq->cpu_capacity_orig) ||
>
> Why do you add a margin here whereas there was no margin before ?
>

Comparing capacities without any sort of filter can lead to ping-ponging
tasks around (capacity values very easily fluctuate by +/- 1, if not more).
I'm guilty of doing two things at once here: replace existing users, and
convert callsites that should be existing users. I can split the conversion
in a separate patch.


Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks

2021-02-05 Thread Vincent Guittot
On Fri, 5 Feb 2021 at 18:00, Valentin Schneider
 wrote:
>
> On 05/02/21 15:31, Vincent Guittot wrote:
> > On Thu, 28 Jan 2021 at 19:32, Valentin Schneider
> >  wrote:
> >>
> >> During load-balance, groups classified as group_misfit_task are filtered
> >> out if they do not pass
> >>
> >>   group_smaller_max_cpu_capacity(, );
> >
> > group_smaller_max_cpu_capacity and  group_smaller_max_cpu_capacity are
> > removed in the next patch. Merge this and the next and directly remove
> > them
> >
>
> OK.
>
> >> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct 
> >> sched_domain *sd)
> >>  static inline int check_misfit_status(struct rq *rq, struct sched_domain 
> >> *sd)
> >>  {
> >> return rq->misfit_task_load &&
> >> -   (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
> >> +   (capacity_greater(rq->rd->max_cpu_capacity, 
> >> rq->cpu_capacity_orig) ||
> >
> > Why do you add a margin here whereas there was no margin before ?
> >
>
> Comparing capacities without any sort of filter can lead to ping-ponging
> tasks around (capacity values very easily fluctuate by +/- 1, if not more).

max_cpu_capacity reflects the max of the cpu_capacity_orig values
don't aim to change and can be considered as static values.
It would be better to fix this rounding problem (if any) in
topology_get_cpu_scale instead of computing a margin every time it's
used

> I'm guilty of doing two things at once here: replace existing users, and
> convert callsites that should be existing users. I can split the conversion
> in a separate patch.


Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks

2021-02-03 Thread Valentin Schneider
On 03/02/21 15:15, Qais Yousef wrote:
> On 01/28/21 18:31, Valentin Schneider wrote:
>> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
>>   */
>>  #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
>>  
>> +/*
>> + * The margin used when comparing CPU capacities.
>> + * is 'cap' noticeably greater than 'ref'
>> + *
>> + * (default: ~5%)
>> + */
>> +#define capacity_greater(cap, ref) ((ref) * 1078 < (cap) * 1024)
>
> nit: can we use cap1 and cap2 and make the implementation use '>' instead of
> '<'? ie:
>
>   #define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)
>
> this is more intuitive to read IMHO. Especially few lines below we have
>
>   return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity);
>
> which pass 'ref->...' as cap which can be confusing when looking at the
> function signature @ref.
>

Unfortunate naming indeed... And I suppose it can't hurt to follow the
argument "declaration" order. 


Re: [PATCH 3/8] sched/fair: Tweak misfit-related capacity checks

2021-02-03 Thread Qais Yousef
On 01/28/21 18:31, Valentin Schneider wrote:
> During load-balance, groups classified as group_misfit_task are filtered
> out if they do not pass
> 
>   group_smaller_max_cpu_capacity(, );
> 
> which itself employs fits_capacity() to compare the sgc->max_capacity of
> both groups.
> 
> Due to the underlying margin, fits_capacity(X, 1024) will return false for
> any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
> {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
> CPUs, misfit migration will never intentionally upmigrate it to a CPU of
> higher capacity due to the aforementioned margin.
> 
> One may argue the 20% margin of fits_capacity() is excessive in the advent
> of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
> is that fits_capacity() is meant to compare a utilization value to a
> capacity value, whereas here it is being used to compare two capacity
> values. As CPU capacity and task utilization have different dynamics, a
> sensible approach here would be to add a new helper dedicated to comparing
> CPU capacities.
> 
> Introduce capacity_greater(), which uses a 5% margin. Use it to replace the
> existing capacity checks. Note that check_cpu_capacity() uses yet another
> margin (sd->imbalance_pct), and is left alone for now.
> 
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/sched/fair.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7d01fa0bfc7e..58ce0b22fcb0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
>   */
>  #define fits_capacity(cap, max)  ((cap) * 1280 < (max) * 1024)
>  
> +/*
> + * The margin used when comparing CPU capacities.
> + * is 'cap' noticeably greater than 'ref'
> + *
> + * (default: ~5%)
> + */
> +#define capacity_greater(cap, ref) ((ref) * 1078 < (cap) * 1024)

nit: can we use cap1 and cap2 and make the implementation use '>' instead of
'<'? ie:

#define capacity_greater(cap1, cap2) ((cap1) * 1024 > (cap2) * 1078)

this is more intuitive to read IMHO. Especially few lines below we have

return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity);

which pass 'ref->...' as cap which can be confusing when looking at the
function signature @ref.

Either way, this LGTM

Reviewed-by: Qais Yousef 

Thanks

--
Qais Yousef

>  #endif
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
> @@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain 
> *sd)
>  static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
>  {
>   return rq->misfit_task_load &&
> - (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
> + (capacity_greater(rq->rd->max_cpu_capacity, 
> rq->cpu_capacity_orig) ||
>check_cpu_capacity(rq, sd));
>  }
>  
> @@ -8352,7 +8359,7 @@ group_is_overloaded(unsigned int imbalance_pct, struct 
> sg_lb_stats *sgs)
>  static inline bool
>  group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group 
> *ref)
>  {
> - return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
> + return capacity_greater(ref->sgc->min_capacity, sg->sgc->min_capacity);
>  }
>  
>  /*
> @@ -8362,7 +8369,7 @@ group_smaller_min_cpu_capacity(struct sched_group *sg, 
> struct sched_group *ref)
>  static inline bool
>  group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group 
> *ref)
>  {
> - return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
> + return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity);
>  }
>  
>  static inline enum
> @@ -9421,7 +9428,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>* average load.
>*/
>   if (sd_has_asym_cpucapacity(env->sd) &&
> - capacity_of(env->dst_cpu) < capacity &&
> + !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
>   nr_running == 1)
>   continue;
>  
> -- 
> 2.27.0
> 


[PATCH 3/8] sched/fair: Tweak misfit-related capacity checks

2021-01-28 Thread Valentin Schneider
During load-balance, groups classified as group_misfit_task are filtered
out if they do not pass

  group_smaller_max_cpu_capacity(, );

which itself employs fits_capacity() to compare the sgc->max_capacity of
both groups.

Due to the underlying margin, fits_capacity(X, 1024) will return false for
any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
{261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
CPUs, misfit migration will never intentionally upmigrate it to a CPU of
higher capacity due to the aforementioned margin.

One may argue the 20% margin of fits_capacity() is excessive in the advent
of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
is that fits_capacity() is meant to compare a utilization value to a
capacity value, whereas here it is being used to compare two capacity
values. As CPU capacity and task utilization have different dynamics, a
sensible approach here would be to add a new helper dedicated to comparing
CPU capacities.

Introduce capacity_greater(), which uses a 5% margin. Use it to replace the
existing capacity checks. Note that check_cpu_capacity() uses yet another
margin (sd->imbalance_pct), and is left alone for now.

Signed-off-by: Valentin Schneider 
---
 kernel/sched/fair.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7d01fa0bfc7e..58ce0b22fcb0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -113,6 +113,13 @@ int __weak arch_asym_cpu_priority(int cpu)
  */
 #define fits_capacity(cap, max)((cap) * 1280 < (max) * 1024)
 
+/*
+ * The margin used when comparing CPU capacities.
+ * is 'cap' noticeably greater than 'ref'
+ *
+ * (default: ~5%)
+ */
+#define capacity_greater(cap, ref) ((ref) * 1078 < (cap) * 1024)
 #endif
 
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -8253,7 +8260,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
 static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
 {
return rq->misfit_task_load &&
-   (rq->cpu_capacity_orig < rq->rd->max_cpu_capacity ||
+   (capacity_greater(rq->rd->max_cpu_capacity, 
rq->cpu_capacity_orig) ||
 check_cpu_capacity(rq, sd));
 }
 
@@ -8352,7 +8359,7 @@ group_is_overloaded(unsigned int imbalance_pct, struct 
sg_lb_stats *sgs)
 static inline bool
 group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 {
-   return fits_capacity(sg->sgc->min_capacity, ref->sgc->min_capacity);
+   return capacity_greater(ref->sgc->min_capacity, sg->sgc->min_capacity);
 }
 
 /*
@@ -8362,7 +8369,7 @@ group_smaller_min_cpu_capacity(struct sched_group *sg, 
struct sched_group *ref)
 static inline bool
 group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 {
-   return fits_capacity(sg->sgc->max_capacity, ref->sgc->max_capacity);
+   return capacity_greater(ref->sgc->max_capacity, sg->sgc->max_capacity);
 }
 
 static inline enum
@@ -9421,7 +9428,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 * average load.
 */
if (sd_has_asym_cpucapacity(env->sd) &&
-   capacity_of(env->dst_cpu) < capacity &&
+   !capacity_greater(capacity_of(env->dst_cpu), capacity) &&
nr_running == 1)
continue;
 
-- 
2.27.0