Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-05-19 Thread Srikar Dronamraju
* Peter Zijlstra  [2021-05-19 11:59:48]:

> On Tue, May 18, 2021 at 12:07:40PM -0700, Ricardo Neri wrote:
> > On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
> > > On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
> 
> > > > So I'm thinking that this is a property of having ASYM_PACKING at a core
> > > > level, rather than some arch special. Wouldn't something like this be
> > > > more appropriate?
> 
> > > Thanks Peter for the quick review! This makes sense to me. The only
> > > reason we proposed arch_asym_check_smt_siblings() is because we were
> > > about breaking powerpc (I need to study how they set priorities for SMT,
> > > if applicable). If you think this is not an issue I can post a
> > > v4 with this update.
> > 
> > As far as I can see, priorities in powerpc are set by the CPU number.
> > However, I am not sure how CPUs are enumerated? If CPUs in brackets are
> > SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
> > [1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
> > core would need to be busy before a new core is used.
> > 
> > Still, I think the issue described in the cover letter may be
> > reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
> > tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
> > able to help since CPU0 has the highest priority.
> > 
> > I am cc'ing the linuxppc list to get some feedback.
> 
> IIRC the concern with Power is that their Cores can go faster if the
> higher SMT siblings are unused.
> 
> That is, suppose you have an SMT4 Core with only a single active task,
> then if only SMT0 is used it can reach max performance, but if the
> active sibling is SMT1 it can not reach max performance, and if the only
> active sibling is SMT2 it goes slower still.
> 
> So they need to pack the tasks to the lowest SMT siblings, and have the
> highest SMT siblings idle (where possible) in order to increase
> performance.
> 
> 

If you are referring to SD_ASYM_PACKING, then packing tasks to lowest SMT
was needed in POWER7 timeframe. So if there was one thread running, then
running on the lowest sibling provided the best performance as if running in
single threaded mode.

However recent chips like POWER8/ POWER9 / POWER10 dont need SD_ASYM_PACKING
since the hardware itself does the switch. So even if task is place on a
higher sibling within the core, we dont need to switch the task to a lower
sibling for it to perform better. Now running only one thread running on any
sibling, its expected to run in single threaded mode.

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-05-19 Thread Nicholas Piggin
Excerpts from Peter Zijlstra's message of May 19, 2021 7:59 pm:
> On Tue, May 18, 2021 at 12:07:40PM -0700, Ricardo Neri wrote:
>> On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
>> > On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
> 
>> > > So I'm thinking that this is a property of having ASYM_PACKING at a core
>> > > level, rather than some arch special. Wouldn't something like this be
>> > > more appropriate?
> 
>> > Thanks Peter for the quick review! This makes sense to me. The only
>> > reason we proposed arch_asym_check_smt_siblings() is because we were
>> > about breaking powerpc (I need to study how they set priorities for SMT,
>> > if applicable). If you think this is not an issue I can post a
>> > v4 with this update.
>> 
>> As far as I can see, priorities in powerpc are set by the CPU number.
>> However, I am not sure how CPUs are enumerated? If CPUs in brackets are
>> SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
>> [1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
>> core would need to be busy before a new core is used.
>> 
>> Still, I think the issue described in the cover letter may be
>> reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
>> tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
>> able to help since CPU0 has the highest priority.
>> 
>> I am cc'ing the linuxppc list to get some feedback.
> 
> IIRC the concern with Power is that their Cores can go faster if the
> higher SMT siblings are unused.
> 
> That is, suppose you have an SMT4 Core with only a single active task,
> then if only SMT0 is used it can reach max performance, but if the
> active sibling is SMT1 it can not reach max performance, and if the only
> active sibling is SMT2 it goes slower still.
> 
> So they need to pack the tasks to the lowest SMT siblings, and have the
> highest SMT siblings idle (where possible) in order to increase
> performance.

That's correct.

Thanks,
Nick


Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-05-19 Thread Peter Zijlstra
On Tue, May 18, 2021 at 12:07:40PM -0700, Ricardo Neri wrote:
> On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
> > On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:

> > > So I'm thinking that this is a property of having ASYM_PACKING at a core
> > > level, rather than some arch special. Wouldn't something like this be
> > > more appropriate?

> > Thanks Peter for the quick review! This makes sense to me. The only
> > reason we proposed arch_asym_check_smt_siblings() is because we were
> > about breaking powerpc (I need to study how they set priorities for SMT,
> > if applicable). If you think this is not an issue I can post a
> > v4 with this update.
> 
> As far as I can see, priorities in powerpc are set by the CPU number.
> However, I am not sure how CPUs are enumerated? If CPUs in brackets are
> SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
> [1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
> core would need to be busy before a new core is used.
> 
> Still, I think the issue described in the cover letter may be
> reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
> tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
> able to help since CPU0 has the highest priority.
> 
> I am cc'ing the linuxppc list to get some feedback.

IIRC the concern with Power is that their Cores can go faster if the
higher SMT siblings are unused.

That is, suppose you have an SMT4 Core with only a single active task,
then if only SMT0 is used it can reach max performance, but if the
active sibling is SMT1 it can not reach max performance, and if the only
active sibling is SMT2 it goes slower still.

So they need to pack the tasks to the lowest SMT siblings, and have the
highest SMT siblings idle (where possible) in order to increase
performance.




Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-05-18 Thread Ricardo Neri
On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
> On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
> > On Thu, May 13, 2021 at 08:49:08AM -0700, Ricardo Neri wrote:
> > >  include/linux/sched/topology.h |   1 +
> > >  kernel/sched/fair.c| 101 +
> > >  2 files changed, 102 insertions(+)
> > > 
> > > diff --git a/include/linux/sched/topology.h 
> > > b/include/linux/sched/topology.h
> > > index 8f0f778b7c91..43bdb8b1e1df 100644
> > > --- a/include/linux/sched/topology.h
> > > +++ b/include/linux/sched/topology.h
> > > @@ -57,6 +57,7 @@ static inline int cpu_numa_flags(void)
> > >  #endif
> > >  
> > >  extern int arch_asym_cpu_priority(int cpu);
> > > +extern bool arch_asym_check_smt_siblings(void);
> > >  
> > >  struct sched_domain_attr {
> > >   int relax_domain_level;
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index c8b66a5d593e..3d6cc027e6e6 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -106,6 +106,15 @@ int __weak arch_asym_cpu_priority(int cpu)
> > >   return -cpu;
> > >  }
> > >  
> > > +/*
> > > + * For asym packing, first check the state of SMT siblings before 
> > > deciding to
> > > + * pull tasks.
> > > + */
> > > +bool __weak arch_asym_check_smt_siblings(void)
> > > +{
> > > + return false;
> > > +}
> > > +
> > >  /*
> > >   * The margin used when comparing utilization with CPU capacity.
> > >   *
> > 
> > > @@ -8458,6 +8550,9 @@ sched_asym(struct lb_env *env, struct sd_lb_stats 
> > > *sds,  struct sg_lb_stats *sgs
> > >   if (group == sds->local)
> > >   return false;
> > >  
> > > + if (arch_asym_check_smt_siblings())
> > > + return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> > > +
> > >   return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> > >  }
> > 
> > So I'm thinking that this is a property of having ASYM_PACKING at a core
> > level, rather than some arch special. Wouldn't something like this be
> > more appropriate?
> > 
> > ---
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -57,7 +57,6 @@ static inline int cpu_numa_flags(void)
> >  #endif
> >  
> >  extern int arch_asym_cpu_priority(int cpu);
> > -extern bool arch_asym_check_smt_siblings(void);
> >  
> >  struct sched_domain_attr {
> > int relax_domain_level;
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -107,15 +107,6 @@ int __weak arch_asym_cpu_priority(int cp
> >  }
> >  
> >  /*
> > - * For asym packing, first check the state of SMT siblings before deciding 
> > to
> > - * pull tasks.
> > - */
> > -bool __weak arch_asym_check_smt_siblings(void)
> > -{
> > -   return false;
> > -}
> > -
> > -/*
> >   * The margin used when comparing utilization with CPU capacity.
> >   *
> >   * (default: ~20%)
> > @@ -8550,7 +8541,8 @@ sched_asym(struct lb_env *env, struct sd
> > if (group == sds->local)
> > return false;
> >  
> > -   if (arch_asym_check_smt_siblings())
> > +   if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
> > +   (group->flags & SD_SHARE_CPUCAPACITY))
> > return asym_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> 
> Thanks Peter for the quick review! This makes sense to me. The only
> reason we proposed arch_asym_check_smt_siblings() is because we were
> about breaking powerpc (I need to study how they set priorities for SMT,
> if applicable). If you think this is not an issue I can post a
> v4 with this update.

As far as I can see, priorities in powerpc are set by the CPU number.
However, I am not sure how CPUs are enumerated? If CPUs in brackets are
SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
[1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
core would need to be busy before a new core is used.

Still, I think the issue described in the cover letter may be
reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
able to help since CPU0 has the highest priority.

I am cc'ing the linuxppc list to get some feedback.

Thanks and BR,
Ricardo