subject:"\[RFC\/PATCH\] powerpc\/smp\: Add SD_SHARE_PKG

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-18 Thread Gautham R Shenoy

Hello Mel,

On Mon, Apr 12, 2021 at 11:48:19AM +0100, Mel Gorman wrote:
> On Mon, Apr 12, 2021 at 11:06:19AM +0100, Valentin Schneider wrote:
> > On 12/04/21 10:37, Mel Gorman wrote:
> > > On Mon, Apr 12, 2021 at 11:54:36AM +0530, Srikar Dronamraju wrote:
> > >> * Gautham R. Shenoy  [2021-04-02 11:07:54]:
> > >>
> > >> >
> > >> > To remedy this, this patch proposes that the LLC be moved to the MC
> > >> > level which is a group of cores in one half of the chip.
> > >> >
> > >> >   SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE
> > >> >
> > >>
> > >> I think marking Hemisphere as a LLC in a P10 scenario is a good idea.
> > >>
> > >> > While there is no cache being shared at this level, this is still the
> > >> > level where some amount of cache-snooping takes place and it is
> > >> > relatively faster to access the data from the caches of the cores
> > >> > within this domain. With this change, we no longer see regressions on
> > >> > P10 for applications which require single threaded performance.
> > >>
> > >> Peter, Valentin, Vincent, Mel, etal
> > >>
> > >> On architectures where we have multiple levels of cache access latencies
> > >> within a DIE, (For example: one within the current LLC or SMT core and 
> > >> the
> > >> other at MC or Hemisphere, and finally across hemispheres), do you have 
> > >> any
> > >> suggestions on how we could handle the same in the core scheduler?
> > >>
> > >
> > > Minimally I think it would be worth detecting when there are multiple
> > > LLCs per node and detecting that in generic code as a static branch. In
> > > select_idle_cpu, consider taking two passes -- first on the LLC domain
> > > and if no idle CPU is found then taking a second pass if the search depth
> > > allows within the node with the LLC CPUs masked out.
> > 
> > I think that's actually a decent approach. Tying SD_SHARE_PKG_RESOURCES to
> > something other than pure cache topology in a generic manner is tough (as
> > it relies on murky, ill-defined hardware fabric properties).
> > 
> 
> Agreed. The LLC->node scan idea has been on my TODO list to try for
> a while.

If you have any patches for these, I will be happy to test them on
POWER10. Though, on POWER10, there will be an additional sd between
the LLC and the DIE domain. 




> 
> > Last I tried thinking about that, I stopped at having a core-to-core
> > latency matrix, building domains off of that, and having some knob
> > specifying the highest distance value below which we'd set
> > SD_SHARE_PKG_RESOURCES. There's a few things I 'hate' about that; for one
> > it makes cpus_share_cache() somewhat questionable.
> > 
> 
> And I thought about something like this too but worried it might get
> complex, particularly on chiplets where we do not necessarily have
> hardware info on latency depending on how it's wired up. It also might
> lead to excessive cpumask manipulation in a fast path if we have to
> traverse multiple distances with search cost exceeding gains from latency
> reduction. Hence -- keeping it simple with two level only, LLC then node
> within the allowed search depth and see what that gets us. It might be
> "good enough" in most cases and would be a basis for comparison against
> complex approaches.


> 
> At minimum, I expect IBM can evaluate the POWER10 aspect and I can run
> an evaluation on Zen generations.


> 
> -- 
> Mel Gorman
> SUSE Labs

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-14 Thread Gautham R Shenoy

On Mon, Apr 12, 2021 at 06:33:55PM +0200, Michal Suchánek wrote:
> On Mon, Apr 12, 2021 at 04:24:44PM +0100, Mel Gorman wrote:
> > On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > > > Peter, Valentin, Vincent, Mel, etal
> > > > >
> > > > > On architectures where we have multiple levels of cache access 
> > > > > latencies
> > > > > within a DIE, (For example: one within the current LLC or SMT core 
> > > > > and the
> > > > > other at MC or Hemisphere, and finally across hemispheres), do you 
> > > > > have any
> > > > > suggestions on how we could handle the same in the core scheduler?
> > >
> > > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> > > only rely on cache
> > >
> >
> > From topology.c
> >
> > SD_SHARE_PKG_RESOURCES - describes shared caches
> >
> > I'm guessing here because I am not familiar with power10 but the central
> > problem appears to be when to prefer selecting a CPU sharing L2 or L3
> > cache and the core assumes the last-level-cache is the only relevant one.
> 
> It does not seem to be the case according to original description:
> 
>  When the scheduler tries to wakeup a task, it chooses between the
>  waker-CPU and the wakee's previous-CPU. Suppose this choice is called
>  the "target", then in the target's LLC domain, the scheduler
>  
>  a) tries to find an idle core in the LLC. This helps exploit the
> This is the same as (b) Should this be SMT^^^ ?

On POWER10, without this patch, the LLC is at SMT sched-domain
domain. The difference between a) and b) is a) searches for an idle
core, while b) searches for an idle CPU. 


> SMT folding that the wakee task can benefit from. If an idle
> core is found, the wakee is woken up on it.
>  
>  b) Failing to find an idle core, the scheduler tries to find an idle
> CPU in the LLC. This helps minimise the wakeup latency for the
> wakee since it gets to run on the CPU immediately.
>  
>  c) Failing this, it will wake it up on target CPU.
>  
>  Thus, with P9-sched topology, since the CACHE domain comprises of two
>  SMT4 cores, there is a decent chance that we get an idle core, failing
>  which there is a relatively higher probability of finding an idle CPU
>  among the 8 threads in the domain.
>  
>  However, in P10-sched topology, since the SMT domain is the LLC and it
>  contains only a single SMT4 core, the probability that we find that
>  core to be idle is less. Furthermore, since there are only 4 CPUs to
>  search for an idle CPU, there is lower probability that we can get an
>  idle CPU to wake up the task on.
> 
> >
> > For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
> > unintended consequences for load balancing because load within a die may
> > not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
> > the MC level.
> 
> Not spreading load between SMT4 domains within MC is exactly what setting LLC
> at MC level would address, wouldn't it?
>
> As in on P10 we have two relevant levels but the topology as is describes only
> one, and moving the LLC level lower gives two levels the scheduler looks at
> again. Or am I missing something?

This is my current understanding as well, since with this patch we
would then be able to move tasks quickly between the SMT4 cores,
perhaps at the expense of losing out on cache-affinity. Which is why
it would be good to verify this using a test/benchmark.


> 
> Thanks
> 
> Michal
> 

--
Thanks and Regards
gautham.

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-14 Thread Gautham R Shenoy

Hello Mel,

On Mon, Apr 12, 2021 at 04:24:44PM +0100, Mel Gorman wrote:
> On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > > Peter, Valentin, Vincent, Mel, etal
> > > >
> > > > On architectures where we have multiple levels of cache access latencies
> > > > within a DIE, (For example: one within the current LLC or SMT core and 
> > > > the
> > > > other at MC or Hemisphere, and finally across hemispheres), do you have 
> > > > any
> > > > suggestions on how we could handle the same in the core scheduler?
> > 
> > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> > only rely on cache
> > 
> 
> >From topology.c
> 
>   SD_SHARE_PKG_RESOURCES - describes shared caches
>

Yes, I was aware of this shared caches, but this current patch was the
simplest way to achieve the effect, though the cores in the MC domain
on POWER10 do not share a cache. However, it is relatively faster to
transfer data across the cores within the MC domain compared to the
cores outside the MC domain in the Die.

> I'm guessing here because I am not familiar with power10 but the central
> problem appears to be when to prefer selecting a CPU sharing L2 or L3
> cache and the core assumes the last-level-cache is the only relevant one.
>

On POWER we have traditionally preferred to keep the LLC at the
sched-domain comprising of groups of CPUs that share the L2 (since L3
is a victim cache on POWER).

On POWER9, the L2 was shared by the threads of a pair of SMT4 cores,
while on POWER10, L2 is shared by threads of a single SMT4 core.

Thus, the current task wake-up logic would have a lower probability of
finding an idle core inside an LLC since it has only one core to
search in the LLC. This is why moving the LLC to the parent domain
(MC) consisting of a group of SMT4 cores among which snooping the
cache-data is faster is helpful for workloads that require the single
threaded performance.

> For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
> unintended consequences for load balancing because load within a die may
> not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
> the MC level.

Since we are adding the SD_SHARE_PKG_RESOURCES to the parent of the
the only sched-domain (which is a SMT4 domain) which currently has
this flag set, would it cause issues in spreading the load between the
SMT4 domains ?

Are there any tests/benchmarks that can help bring this out? It could
be good to understand this.

> 
> > >
> > > Minimally I think it would be worth detecting when there are multiple
> > > LLCs per node and detecting that in generic code as a static branch. In
> > > select_idle_cpu, consider taking two passes -- first on the LLC domain
> > > and if no idle CPU is found then taking a second pass if the search depth
> > 
> > We have done a lot of changes to reduce and optimize the fast path and
> > I don't think re adding another layer  in the fast path makes sense as
> > you will end up unrolling the for_each_domain behind some
> > static_banches.
> > 
> 
> Searching the node would only happen if a) there was enough search depth
> left and b) there were no idle CPUs at the LLC level. As no new domain
> is added, it's not clear to me why for_each_domain would change.
> 
> But still, your comment reminded me that different architectures have
> different requirements
> 
> Power 10 appears to prefer CPU selection sharing L2 cache but desires
>   spillover to L3 when selecting and idle CPU.
>

Indeed, so on POWER10, the preference would be
1) idle core in the L2 domain.
2) idle core in the MC domain.
3) idle CPU  in the L2 domain
4) idle CPU  in the MC domain.

This patch is able to achieve this *implicitly* because of the way the
select_idle_cpu() and the select_idle_core() is currently coded, where
in the presence of idle cores in the MC level, the select_idle_core()
searches for the idle core starting with the core of the target-CPU.

If I understand your proposal correctly it would be to make this
explicit into a two level search where we first search in the LLC
domain, failing which, we carry on the search in the rest of the die
(assuming that the LLC is not in the die).

> X86 varies, it might want the Power10 approach for some families and prefer
>   L3 spilling over to a CPU on the same node in others.
> 
> S390 cares about something called books and drawers although I've no
>   what it means as such and whether it has any preferences on
>   search order.
> 
> ARM has similar requirements again according to "scheduler: expose the
>   topology of clusters and add cluster scheduler" and that one *does*
>   add another domain.
> 
> I had forgotten about the ARM patches but remembered that they were
> interesting because they potentially help the Zen situation but I didn't
> get the chance to review them before they fell off my radar again. About
> all I recall is that I thought the "cluster" terminology was vague.
> 
> The only commo

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-13 Thread Vincent Guittot

On Mon, 12 Apr 2021 at 17:24, Mel Gorman  wrote:
>
> On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > > Peter, Valentin, Vincent, Mel, etal
> > > >
> > > > On architectures where we have multiple levels of cache access latencies
> > > > within a DIE, (For example: one within the current LLC or SMT core and 
> > > > the
> > > > other at MC or Hemisphere, and finally across hemispheres), do you have 
> > > > any
> > > > suggestions on how we could handle the same in the core scheduler?
> >
> > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> > only rely on cache
> >
>
> From topology.c
>
> SD_SHARE_PKG_RESOURCES - describes shared caches
>
> I'm guessing here because I am not familiar with power10 but the central
> problem appears to be when to prefer selecting a CPU sharing L2 or L3
> cache and the core assumes the last-level-cache is the only relevant one.
>
> For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
> unintended consequences for load balancing because load within a die may
> not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
> the MC level.

But the SMT4 level is still present  here with select_idle_core taking
of the spreading

>
> > >
> > > Minimally I think it would be worth detecting when there are multiple
> > > LLCs per node and detecting that in generic code as a static branch. In
> > > select_idle_cpu, consider taking two passes -- first on the LLC domain
> > > and if no idle CPU is found then taking a second pass if the search depth
> >
> > We have done a lot of changes to reduce and optimize the fast path and
> > I don't think re adding another layer  in the fast path makes sense as
> > you will end up unrolling the for_each_domain behind some
> > static_banches.
> >
>
> Searching the node would only happen if a) there was enough search depth
> left and b) there were no idle CPUs at the LLC level. As no new domain
> is added, it's not clear to me why for_each_domain would change.

What I mean is that you should directly do for_each_sched_domain in
the fast path because that what you are proposing at the end. It's no
more looks like a fast path but a traditional LB

>
> But still, your comment reminded me that different architectures have
> different requirements
>
> Power 10 appears to prefer CPU selection sharing L2 cache but desires
> spillover to L3 when selecting and idle CPU.
>
> X86 varies, it might want the Power10 approach for some families and prefer
> L3 spilling over to a CPU on the same node in others.
>
> S390 cares about something called books and drawers although I've no
> what it means as such and whether it has any preferences on
> search order.
>
> ARM has similar requirements again according to "scheduler: expose the
> topology of clusters and add cluster scheduler" and that one *does*
> add another domain.
>
> I had forgotten about the ARM patches but remembered that they were
> interesting because they potentially help the Zen situation but I didn't
> get the chance to review them before they fell off my radar again. About
> all I recall is that I thought the "cluster" terminology was vague.
>
> The only commonality I thought might exist is that architectures may
> like to define what the first domain to search for an idle CPU and a
> second domain. Alternatively, architectures could specify a domain to
> search primarily but also search the next domain in the hierarchy if
> search depth permits. The default would be the existing behaviour --
> search CPUs sharing a last-level-cache.
>
> > SD_SHARE_PKG_RESOURCES should be set to the last level where we can
> > efficiently move task between CPUs at wakeup
> >
>
> The definition of "efficiently" varies. Moving tasks between CPUs sharing
> a cache is most efficient but moving the task to a CPU that at least has
> local memory channels is a reasonable option if there are no idle CPUs
> sharing cache and preferable to stacking.

That's why setting SD_SHARE_PKG_RESOURCES for P10 looks fine to me.
This last level of SD_SHARE_PKG_RESOURCES should define the cpumask to
be considered  in fast path

>
> > > allows within the node with the LLC CPUs masked out. While there would be
> > > a latency hit because cache is not shared, it would still be a CPU local
> > > to memory that is idle. That would potentially be beneficial on Zen*
> > > as well without having to introduce new domains in the topology hierarchy.
> >
> > What is the current sched_domain topology description for zen ?
> >
>
> The cache and NUMA topologies differ slightly between each generation
> of Zen. The common pattern is that a single NUMA node can have multiple
> L3 caches and at one point I thought it might be reasonable to allow
> spillover to select a local idle CPU instead of stacking multiple tasks
> on a CPU sharing cache. I never got as far as thinking how it could be
> done in a way that multiple architecture

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-12 Thread Michal Suchánek

On Mon, Apr 12, 2021 at 04:24:44PM +0100, Mel Gorman wrote:
> On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > > Peter, Valentin, Vincent, Mel, etal
> > > >
> > > > On architectures where we have multiple levels of cache access latencies
> > > > within a DIE, (For example: one within the current LLC or SMT core and 
> > > > the
> > > > other at MC or Hemisphere, and finally across hemispheres), do you have 
> > > > any
> > > > suggestions on how we could handle the same in the core scheduler?
> >
> > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> > only rely on cache
> >
>
> From topology.c
>
>   SD_SHARE_PKG_RESOURCES - describes shared caches
>
> I'm guessing here because I am not familiar with power10 but the central
> problem appears to be when to prefer selecting a CPU sharing L2 or L3
> cache and the core assumes the last-level-cache is the only relevant one.

It does not seem to be the case according to original description:

 When the scheduler tries to wakeup a task, it chooses between the
 waker-CPU and the wakee's previous-CPU. Suppose this choice is called
 the "target", then in the target's LLC domain, the scheduler

 a) tries to find an idle core in the LLC. This helps exploit the
This is the same as (b) Should this be SMT^^^ ?
SMT folding that the wakee task can benefit from. If an idle
core is found, the wakee is woken up on it.

 b) Failing to find an idle core, the scheduler tries to find an idle
CPU in the LLC. This helps minimise the wakeup latency for the
wakee since it gets to run on the CPU immediately.

 c) Failing this, it will wake it up on target CPU.

 Thus, with P9-sched topology, since the CACHE domain comprises of two
 SMT4 cores, there is a decent chance that we get an idle core, failing
 which there is a relatively higher probability of finding an idle CPU
 among the 8 threads in the domain.

 However, in P10-sched topology, since the SMT domain is the LLC and it
 contains only a single SMT4 core, the probability that we find that
 core to be idle is less. Furthermore, since there are only 4 CPUs to
 search for an idle CPU, there is lower probability that we can get an
 idle CPU to wake up the task on.

>
> For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
> unintended consequences for load balancing because load within a die may
> not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
> the MC level.

Not spreading load between SMT4 domains within MC is exactly what setting LLC
at MC level would address, wouldn't it?

As in on P10 we have two relevant levels but the topology as is describes only
one, and moving the LLC level lower gives two levels the scheduler looks at
again. Or am I missing something?

Thanks

Michal

> > >
> > > Minimally I think it would be worth detecting when there are multiple
> > > LLCs per node and detecting that in generic code as a static branch. In
> > > select_idle_cpu, consider taking two passes -- first on the LLC domain
> > > and if no idle CPU is found then taking a second pass if the search depth
> >
> > We have done a lot of changes to reduce and optimize the fast path and
> > I don't think re adding another layer  in the fast path makes sense as
> > you will end up unrolling the for_each_domain behind some
> > static_banches.
> >
>
> Searching the node would only happen if a) there was enough search depth
> left and b) there were no idle CPUs at the LLC level. As no new domain
> is added, it's not clear to me why for_each_domain would change.
>
> But still, your comment reminded me that different architectures have
> different requirements
>
> Power 10 appears to prefer CPU selection sharing L2 cache but desires
>   spillover to L3 when selecting and idle CPU.
>
> X86 varies, it might want the Power10 approach for some families and prefer
>   L3 spilling over to a CPU on the same node in others.
>
> S390 cares about something called books and drawers although I've no
>   what it means as such and whether it has any preferences on
>   search order.
>
> ARM has similar requirements again according to "scheduler: expose the
>   topology of clusters and add cluster scheduler" and that one *does*
>   add another domain.
>
> I had forgotten about the ARM patches but remembered that they were
> interesting because they potentially help the Zen situation but I didn't
> get the chance to review them before they fell off my radar again. About
> all I recall is that I thought the "cluster" terminology was vague.
>
> The only commonality I thought might exist is that architectures may
> like to define what the first domain to search for an idle CPU and a
> second domain. Alternatively, architectures could specify a domain to
> search primarily but also search the next domain in the hierarchy if
> search depth permits. The

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-12 Thread Mel Gorman

On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > Peter, Valentin, Vincent, Mel, etal
> > >
> > > On architectures where we have multiple levels of cache access latencies
> > > within a DIE, (For example: one within the current LLC or SMT core and the
> > > other at MC or Hemisphere, and finally across hemispheres), do you have 
> > > any
> > > suggestions on how we could handle the same in the core scheduler?
> 
> I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> only rely on cache
> 

>From topology.c

SD_SHARE_PKG_RESOURCES - describes shared caches

I'm guessing here because I am not familiar with power10 but the central
problem appears to be when to prefer selecting a CPU sharing L2 or L3
cache and the core assumes the last-level-cache is the only relevant one.

For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
unintended consequences for load balancing because load within a die may
not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
the MC level.

> >
> > Minimally I think it would be worth detecting when there are multiple
> > LLCs per node and detecting that in generic code as a static branch. In
> > select_idle_cpu, consider taking two passes -- first on the LLC domain
> > and if no idle CPU is found then taking a second pass if the search depth
> 
> We have done a lot of changes to reduce and optimize the fast path and
> I don't think re adding another layer  in the fast path makes sense as
> you will end up unrolling the for_each_domain behind some
> static_banches.
> 

Searching the node would only happen if a) there was enough search depth
left and b) there were no idle CPUs at the LLC level. As no new domain
is added, it's not clear to me why for_each_domain would change.

But still, your comment reminded me that different architectures have
different requirements

Power 10 appears to prefer CPU selection sharing L2 cache but desires
spillover to L3 when selecting and idle CPU.

X86 varies, it might want the Power10 approach for some families and prefer
L3 spilling over to a CPU on the same node in others.

S390 cares about something called books and drawers although I've no
what it means as such and whether it has any preferences on
search order.

ARM has similar requirements again according to "scheduler: expose the
topology of clusters and add cluster scheduler" and that one *does*
add another domain.

I had forgotten about the ARM patches but remembered that they were
interesting because they potentially help the Zen situation but I didn't
get the chance to review them before they fell off my radar again. About
all I recall is that I thought the "cluster" terminology was vague.

The only commonality I thought might exist is that architectures may
like to define what the first domain to search for an idle CPU and a
second domain. Alternatively, architectures could specify a domain to
search primarily but also search the next domain in the hierarchy if
search depth permits. The default would be the existing behaviour --
search CPUs sharing a last-level-cache.

> SD_SHARE_PKG_RESOURCES should be set to the last level where we can
> efficiently move task between CPUs at wakeup
> 

The definition of "efficiently" varies. Moving tasks between CPUs sharing
a cache is most efficient but moving the task to a CPU that at least has
local memory channels is a reasonable option if there are no idle CPUs
sharing cache and preferable to stacking.

> > allows within the node with the LLC CPUs masked out. While there would be
> > a latency hit because cache is not shared, it would still be a CPU local
> > to memory that is idle. That would potentially be beneficial on Zen*
> > as well without having to introduce new domains in the topology hierarchy.
> 
> What is the current sched_domain topology description for zen ?
> 

The cache and NUMA topologies differ slightly between each generation
of Zen. The common pattern is that a single NUMA node can have multiple
L3 caches and at one point I thought it might be reasonable to allow
spillover to select a local idle CPU instead of stacking multiple tasks
on a CPU sharing cache. I never got as far as thinking how it could be
done in a way that multiple architectures would be happy with.

-- 
Mel Gorman
SUSE Labs

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-12 Thread Vincent Guittot

On Mon, 12 Apr 2021 at 11:37, Mel Gorman  wrote:
>
> On Mon, Apr 12, 2021 at 11:54:36AM +0530, Srikar Dronamraju wrote:
> > * Gautham R. Shenoy  [2021-04-02 11:07:54]:
> >
> > >
> > > To remedy this, this patch proposes that the LLC be moved to the MC
> > > level which is a group of cores in one half of the chip.
> > >
> > >   SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE
> > >
> >
> > I think marking Hemisphere as a LLC in a P10 scenario is a good idea.
> >
> > > While there is no cache being shared at this level, this is still the
> > > level where some amount of cache-snooping takes place and it is
> > > relatively faster to access the data from the caches of the cores
> > > within this domain. With this change, we no longer see regressions on
> > > P10 for applications which require single threaded performance.
> >
> > Peter, Valentin, Vincent, Mel, etal
> >
> > On architectures where we have multiple levels of cache access latencies
> > within a DIE, (For example: one within the current LLC or SMT core and the
> > other at MC or Hemisphere, and finally across hemispheres), do you have any
> > suggestions on how we could handle the same in the core scheduler?

I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
only rely on cache

> >
>
> Minimally I think it would be worth detecting when there are multiple
> LLCs per node and detecting that in generic code as a static branch. In
> select_idle_cpu, consider taking two passes -- first on the LLC domain
> and if no idle CPU is found then taking a second pass if the search depth

We have done a lot of changes to reduce and optimize the fast path and
I don't think re adding another layer  in the fast path makes sense as
you will end up unrolling the for_each_domain behind some
static_banches.

SD_SHARE_PKG_RESOURCES should be set to the last level where we can
efficiently move task between CPUs at wakeup

> allows within the node with the LLC CPUs masked out. While there would be
> a latency hit because cache is not shared, it would still be a CPU local
> to memory that is idle. That would potentially be beneficial on Zen*
> as well without having to introduce new domains in the topology hierarchy.

What is the current sched_domain topology description for zen ?

>
> --
> Mel Gorman
> SUSE Labs

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-12 Thread Mel Gorman

On Mon, Apr 12, 2021 at 11:06:19AM +0100, Valentin Schneider wrote:
> On 12/04/21 10:37, Mel Gorman wrote:
> > On Mon, Apr 12, 2021 at 11:54:36AM +0530, Srikar Dronamraju wrote:
> >> * Gautham R. Shenoy  [2021-04-02 11:07:54]:
> >>
> >> >
> >> > To remedy this, this patch proposes that the LLC be moved to the MC
> >> > level which is a group of cores in one half of the chip.
> >> >
> >> >   SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE
> >> >
> >>
> >> I think marking Hemisphere as a LLC in a P10 scenario is a good idea.
> >>
> >> > While there is no cache being shared at this level, this is still the
> >> > level where some amount of cache-snooping takes place and it is
> >> > relatively faster to access the data from the caches of the cores
> >> > within this domain. With this change, we no longer see regressions on
> >> > P10 for applications which require single threaded performance.
> >>
> >> Peter, Valentin, Vincent, Mel, etal
> >>
> >> On architectures where we have multiple levels of cache access latencies
> >> within a DIE, (For example: one within the current LLC or SMT core and the
> >> other at MC or Hemisphere, and finally across hemispheres), do you have any
> >> suggestions on how we could handle the same in the core scheduler?
> >>
> >
> > Minimally I think it would be worth detecting when there are multiple
> > LLCs per node and detecting that in generic code as a static branch. In
> > select_idle_cpu, consider taking two passes -- first on the LLC domain
> > and if no idle CPU is found then taking a second pass if the search depth
> > allows within the node with the LLC CPUs masked out.
> 
> I think that's actually a decent approach. Tying SD_SHARE_PKG_RESOURCES to
> something other than pure cache topology in a generic manner is tough (as
> it relies on murky, ill-defined hardware fabric properties).
> 

Agreed. The LLC->node scan idea has been on my TODO list to try for
a while.

> Last I tried thinking about that, I stopped at having a core-to-core
> latency matrix, building domains off of that, and having some knob
> specifying the highest distance value below which we'd set
> SD_SHARE_PKG_RESOURCES. There's a few things I 'hate' about that; for one
> it makes cpus_share_cache() somewhat questionable.
> 

And I thought about something like this too but worried it might get
complex, particularly on chiplets where we do not necessarily have
hardware info on latency depending on how it's wired up. It also might
lead to excessive cpumask manipulation in a fast path if we have to
traverse multiple distances with search cost exceeding gains from latency
reduction. Hence -- keeping it simple with two level only, LLC then node
within the allowed search depth and see what that gets us. It might be
"good enough" in most cases and would be a basis for comparison against
complex approaches.

At minimum, I expect IBM can evaluate the POWER10 aspect and I can run
an evaluation on Zen generations.

-- 
Mel Gorman
SUSE Labs

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-12 Thread Valentin Schneider

On 12/04/21 10:37, Mel Gorman wrote:
> On Mon, Apr 12, 2021 at 11:54:36AM +0530, Srikar Dronamraju wrote:
>> * Gautham R. Shenoy  [2021-04-02 11:07:54]:
>>
>> >
>> > To remedy this, this patch proposes that the LLC be moved to the MC
>> > level which is a group of cores in one half of the chip.
>> >
>> >   SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE
>> >
>>
>> I think marking Hemisphere as a LLC in a P10 scenario is a good idea.
>>
>> > While there is no cache being shared at this level, this is still the
>> > level where some amount of cache-snooping takes place and it is
>> > relatively faster to access the data from the caches of the cores
>> > within this domain. With this change, we no longer see regressions on
>> > P10 for applications which require single threaded performance.
>>
>> Peter, Valentin, Vincent, Mel, etal
>>
>> On architectures where we have multiple levels of cache access latencies
>> within a DIE, (For example: one within the current LLC or SMT core and the
>> other at MC or Hemisphere, and finally across hemispheres), do you have any
>> suggestions on how we could handle the same in the core scheduler?
>>
>
> Minimally I think it would be worth detecting when there are multiple
> LLCs per node and detecting that in generic code as a static branch. In
> select_idle_cpu, consider taking two passes -- first on the LLC domain
> and if no idle CPU is found then taking a second pass if the search depth
> allows within the node with the LLC CPUs masked out.

I think that's actually a decent approach. Tying SD_SHARE_PKG_RESOURCES to
something other than pure cache topology in a generic manner is tough (as
it relies on murky, ill-defined hardware fabric properties).

Last I tried thinking about that, I stopped at having a core-to-core
latency matrix, building domains off of that, and having some knob
specifying the highest distance value below which we'd set
SD_SHARE_PKG_RESOURCES. There's a few things I 'hate' about that; for one
it makes cpus_share_cache() somewhat questionable.

> While there would be
> a latency hit because cache is not shared, it would still be a CPU local
> to memory that is idle. That would potentially be beneficial on Zen*
> as well without having to introduce new domains in the topology hierarchy.
>
> --
> Mel Gorman
> SUSE Labs

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-12 Thread Mel Gorman

On Mon, Apr 12, 2021 at 11:54:36AM +0530, Srikar Dronamraju wrote:
> * Gautham R. Shenoy  [2021-04-02 11:07:54]:
> 
> > 
> > To remedy this, this patch proposes that the LLC be moved to the MC
> > level which is a group of cores in one half of the chip.
> > 
> >   SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE
> > 
> 
> I think marking Hemisphere as a LLC in a P10 scenario is a good idea.
> 
> > While there is no cache being shared at this level, this is still the
> > level where some amount of cache-snooping takes place and it is
> > relatively faster to access the data from the caches of the cores
> > within this domain. With this change, we no longer see regressions on
> > P10 for applications which require single threaded performance.
> 
> Peter, Valentin, Vincent, Mel, etal
> 
> On architectures where we have multiple levels of cache access latencies
> within a DIE, (For example: one within the current LLC or SMT core and the
> other at MC or Hemisphere, and finally across hemispheres), do you have any
> suggestions on how we could handle the same in the core scheduler?
> 

Minimally I think it would be worth detecting when there are multiple
LLCs per node and detecting that in generic code as a static branch. In
select_idle_cpu, consider taking two passes -- first on the LLC domain
and if no idle CPU is found then taking a second pass if the search depth
allows within the node with the LLC CPUs masked out. While there would be
a latency hit because cache is not shared, it would still be a CPU local
to memory that is idle. That would potentially be beneficial on Zen*
as well without having to introduce new domains in the topology hierarchy.

-- 
Mel Gorman
SUSE Labs

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-11 Thread Srikar Dronamraju

* Gautham R. Shenoy  [2021-04-02 11:07:54]:

> 
> To remedy this, this patch proposes that the LLC be moved to the MC
> level which is a group of cores in one half of the chip.
> 
>   SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE
> 

I think marking Hemisphere as a LLC in a P10 scenario is a good idea.

> While there is no cache being shared at this level, this is still the
> level where some amount of cache-snooping takes place and it is
> relatively faster to access the data from the caches of the cores
> within this domain. With this change, we no longer see regressions on
> P10 for applications which require single threaded performance.

Peter, Valentin, Vincent, Mel, etal

On architectures where we have multiple levels of cache access latencies
within a DIE, (For example: one within the current LLC or SMT core and the
other at MC or Hemisphere, and finally across hemispheres), do you have any
suggestions on how we could handle the same in the core scheduler?

-- 
Thanks and Regards
Srikar Dronamraju

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-02 Thread Gautham R Shenoy

(Missed cc'ing Cc Peter in the original posting)

On Fri, Apr 02, 2021 at 11:07:54AM +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> On POWER10 systems, the L2 cache is at the SMT4 small core level. The
> following commits ensure that L2 cache gets correctly discovered and
> the Last-Level-Cache domain (LLC) is set to the SMT sched-domain.
> 
> 790a166 powerpc/smp: Parse ibm,thread-groups with multiple properties
> 1fdc1d6 powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map
> fbd2b67 powerpc/smp: Rename init_thread_group_l1_cache_map() to make
>  it generic
> 538abe powerpc/smp: Add support detecting thread-groups sharing L2 cache
> 0be4763 powerpc/cacheinfo: Print correct cache-sibling map/list for L2 
> cache
> 
> However, with the LLC now on the SMT sched-domain, we are seeing some
> regressions in the performance of applications that requires
> single-threaded performance. The reason for this is as follows:
> 
> Prior to the change (we call this P9-sched below), the sched-domain
> hierarchy was:
> 
> SMT (SMT4) --> CACHE (SMT8)[LLC] --> MC (Hemisphere) --> DIE
> 
> where the CACHE sched-domain is defined to be the Last Level Cache (LLC).
> 
> On the upstream kernel, with the aforementioned commmits (P10-sched),
> the sched-domain hierarchy is:
> 
> SMT (SMT4)[LLC] --> MC (Hemisphere) --> DIE
> 
> with the SMT sched-domain as the LLC.
> 
> When the scheduler tries to wakeup a task, it chooses between the
> waker-CPU and the wakee's previous-CPU. Suppose this choice is called
> the "target", then in the target's LLC domain, the scheduler
> 
> a) tries to find an idle core in the LLC. This helps exploit the
>SMT folding that the wakee task can benefit from. If an idle
>core is found, the wakee is woken up on it.
> 
> b) Failing to find an idle core, the scheduler tries to find an idle
>CPU in the LLC. This helps minimise the wakeup latency for the
>wakee since it gets to run on the CPU immediately.
> 
> c) Failing this, it will wake it up on target CPU.
> 
> Thus, with P9-sched topology, since the CACHE domain comprises of two
> SMT4 cores, there is a decent chance that we get an idle core, failing
> which there is a relatively higher probability of finding an idle CPU
> among the 8 threads in the domain.
> 
> However, in P10-sched topology, since the SMT domain is the LLC and it
> contains only a single SMT4 core, the probability that we find that
> core to be idle is less. Furthermore, since there are only 4 CPUs to
> search for an idle CPU, there is lower probability that we can get an
> idle CPU to wake up the task on.
> 
> Thus applications which require single threaded performance will end
> up getting woken up on potentially busy core, even though there are
> idle cores in the system.
> 
> To remedy this, this patch proposes that the LLC be moved to the MC
> level which is a group of cores in one half of the chip.
> 
>   SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE
> 
> While there is no cache being shared at this level, this is still the
> level where some amount of cache-snooping takes place and it is
> relatively faster to access the data from the caches of the cores
> within this domain. With this change, we no longer see regressions on
> P10 for applications which require single threaded performance.
> 
> The patch also improves the tail latencies on schbench and the
> usecs/op on "perf bench sched pipe"
> 
> On a 10 core P10 system with 80 CPUs,
> 
> schbench
> 
> (https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/)
> 
> Values : Lower the better.
> 99th percentile is the tail latency.
> 
> 
> 99th percentile
> ~~
> No. messenger
> threads   5.12-rc45.12-rc4
>   P10-sched   MC-LLC
> ~~~
> 1 70 us 85 us
> 2 81 us101 us
> 3 92 us107 us
> 4 96 us110 us
> 5103 us123 us
> 6   3412 us >  122 us
> 7   1490 us136 us
> 8   6200 us   3572 us
> 
> 
> Hackbench
> 
> (perf bench sched pipe)
> values: lower the better
> 
> ~~~
> No. of
> parallel
> instances   5.12-rc4   5.12-rc4
> P10-sched  MC-LLC 
> ~~~
> 1   24.04 us/op18.72 us/op 
> 2   24.04 us/op18.65 us/op 
> 4   24.01 us/op18.76 us/op 
> 8   24.10 us/op19.11 us/op 
> ~~~
> 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  arch/powerpc/kernel/smp.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 5a4d59a..c75dbd4 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -976,6 +976,13 @@ static bool has_coregrou

[RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-01 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

On POWER10 systems, the L2 cache is at the SMT4 small core level. The
following commits ensure that L2 cache gets correctly discovered and
the Last-Level-Cache domain (LLC) is set to the SMT sched-domain.

790a166 powerpc/smp: Parse ibm,thread-groups with multiple properties
1fdc1d6 powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map
fbd2b67 powerpc/smp: Rename init_thread_group_l1_cache_map() to make
 it generic
538abe powerpc/smp: Add support detecting thread-groups sharing L2 cache
0be4763 powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

However, with the LLC now on the SMT sched-domain, we are seeing some
regressions in the performance of applications that requires
single-threaded performance. The reason for this is as follows:

Prior to the change (we call this P9-sched below), the sched-domain
hierarchy was:

  SMT (SMT4) --> CACHE (SMT8)[LLC] --> MC (Hemisphere) --> DIE

where the CACHE sched-domain is defined to be the Last Level Cache (LLC).

On the upstream kernel, with the aforementioned commmits (P10-sched),
the sched-domain hierarchy is:

  SMT (SMT4)[LLC] --> MC (Hemisphere) --> DIE

with the SMT sched-domain as the LLC.

When the scheduler tries to wakeup a task, it chooses between the
waker-CPU and the wakee's previous-CPU. Suppose this choice is called
the "target", then in the target's LLC domain, the scheduler

a) tries to find an idle core in the LLC. This helps exploit the
   SMT folding that the wakee task can benefit from. If an idle
   core is found, the wakee is woken up on it.

b) Failing to find an idle core, the scheduler tries to find an idle
   CPU in the LLC. This helps minimise the wakeup latency for the
   wakee since it gets to run on the CPU immediately.

c) Failing this, it will wake it up on target CPU.

Thus, with P9-sched topology, since the CACHE domain comprises of two
SMT4 cores, there is a decent chance that we get an idle core, failing
which there is a relatively higher probability of finding an idle CPU
among the 8 threads in the domain.

However, in P10-sched topology, since the SMT domain is the LLC and it
contains only a single SMT4 core, the probability that we find that
core to be idle is less. Furthermore, since there are only 4 CPUs to
search for an idle CPU, there is lower probability that we can get an
idle CPU to wake up the task on.

Thus applications which require single threaded performance will end
up getting woken up on potentially busy core, even though there are
idle cores in the system.

To remedy this, this patch proposes that the LLC be moved to the MC
level which is a group of cores in one half of the chip.

  SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE

While there is no cache being shared at this level, this is still the
level where some amount of cache-snooping takes place and it is
relatively faster to access the data from the caches of the cores
within this domain. With this change, we no longer see regressions on
P10 for applications which require single threaded performance.

The patch also improves the tail latencies on schbench and the
usecs/op on "perf bench sched pipe"

On a 10 core P10 system with 80 CPUs,

schbench

(https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/)

Values : Lower the better.
99th percentile is the tail latency.


99th percentile
~~
No. messenger
threads   5.12-rc45.12-rc4
  P10-sched   MC-LLC
~~~
1 70 us 85 us
2 81 us101 us
3 92 us107 us
4 96 us110 us
5103 us123 us
6   3412 us >  122 us
7   1490 us136 us
8   6200 us   3572 us


Hackbench

(perf bench sched pipe)
values: lower the better

~~~
No. of
parallel
instances   5.12-rc4   5.12-rc4
P10-sched  MC-LLC 
~~~
1   24.04 us/op18.72 us/op 
2   24.04 us/op18.65 us/op 
4   24.01 us/op18.76 us/op 
8   24.10 us/op19.11 us/op 
~~~

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/smp.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5a4d59a..c75dbd4 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -976,6 +976,13 @@ static bool has_coregroup_support(void)
return coregroup_enabled;
 }
 
+static int powerpc_mc_flags(void)
+{
+   if(has_coregroup_support())
+   return SD_SHARE_PKG_RESOURCES;
+   return 0;
+}
+
 static const struct cpumask *cpu_mc_mask(int cpu)
 {
return cpu_coregroup_mask(cpu);
@@ -986,7 +993,7 @@ static const struct cpumask *cpu_mc_mask(int cpu)

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

[RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

13 matches

Site Navigation

Mail list logo

Footer information