Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
Hello Mel, On Mon, Apr 12, 2021 at 11:48:19AM +0100, Mel Gorman wrote: > On Mon, Apr 12, 2021 at 11:06:19AM +0100, Valentin Schneider wrote: > > On 12/04/21 10:37, Mel Gorman wrote: > > > On Mon, Apr 12, 2021 at 11:54:36AM +0530, Srikar Dronamraju wrote: > > >> * Gautham R. Shenoy [2021-04-02 11:07:54]: > > >> > > >> > > > >> > To remedy this, this patch proposes that the LLC be moved to the MC > > >> > level which is a group of cores in one half of the chip. > > >> > > > >> > SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE > > >> > > > >> > > >> I think marking Hemisphere as a LLC in a P10 scenario is a good idea. > > >> > > >> > While there is no cache being shared at this level, this is still the > > >> > level where some amount of cache-snooping takes place and it is > > >> > relatively faster to access the data from the caches of the cores > > >> > within this domain. With this change, we no longer see regressions on > > >> > P10 for applications which require single threaded performance. > > >> > > >> Peter, Valentin, Vincent, Mel, etal > > >> > > >> On architectures where we have multiple levels of cache access latencies > > >> within a DIE, (For example: one within the current LLC or SMT core and > > >> the > > >> other at MC or Hemisphere, and finally across hemispheres), do you have > > >> any > > >> suggestions on how we could handle the same in the core scheduler? > > >> > > > > > > Minimally I think it would be worth detecting when there are multiple > > > LLCs per node and detecting that in generic code as a static branch. In > > > select_idle_cpu, consider taking two passes -- first on the LLC domain > > > and if no idle CPU is found then taking a second pass if the search depth > > > allows within the node with the LLC CPUs masked out. > > > > I think that's actually a decent approach. Tying SD_SHARE_PKG_RESOURCES to > > something other than pure cache topology in a generic manner is tough (as > > it relies on murky, ill-defined hardware fabric properties). > > > > Agreed. The LLC->node scan idea has been on my TODO list to try for > a while. If you have any patches for these, I will be happy to test them on POWER10. Though, on POWER10, there will be an additional sd between the LLC and the DIE domain. > > > Last I tried thinking about that, I stopped at having a core-to-core > > latency matrix, building domains off of that, and having some knob > > specifying the highest distance value below which we'd set > > SD_SHARE_PKG_RESOURCES. There's a few things I 'hate' about that; for one > > it makes cpus_share_cache() somewhat questionable. > > > > And I thought about something like this too but worried it might get > complex, particularly on chiplets where we do not necessarily have > hardware info on latency depending on how it's wired up. It also might > lead to excessive cpumask manipulation in a fast path if we have to > traverse multiple distances with search cost exceeding gains from latency > reduction. Hence -- keeping it simple with two level only, LLC then node > within the allowed search depth and see what that gets us. It might be > "good enough" in most cases and would be a basis for comparison against > complex approaches. > > At minimum, I expect IBM can evaluate the POWER10 aspect and I can run > an evaluation on Zen generations. > > -- > Mel Gorman > SUSE Labs
Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
On Mon, Apr 12, 2021 at 06:33:55PM +0200, Michal Suchánek wrote: > On Mon, Apr 12, 2021 at 04:24:44PM +0100, Mel Gorman wrote: > > On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote: > > > > > Peter, Valentin, Vincent, Mel, etal > > > > > > > > > > On architectures where we have multiple levels of cache access > > > > > latencies > > > > > within a DIE, (For example: one within the current LLC or SMT core > > > > > and the > > > > > other at MC or Hemisphere, and finally across hemispheres), do you > > > > > have any > > > > > suggestions on how we could handle the same in the core scheduler? > > > > > > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't > > > only rely on cache > > > > > > > From topology.c > > > > SD_SHARE_PKG_RESOURCES - describes shared caches > > > > I'm guessing here because I am not familiar with power10 but the central > > problem appears to be when to prefer selecting a CPU sharing L2 or L3 > > cache and the core assumes the last-level-cache is the only relevant one. > > It does not seem to be the case according to original description: > > When the scheduler tries to wakeup a task, it chooses between the > waker-CPU and the wakee's previous-CPU. Suppose this choice is called > the "target", then in the target's LLC domain, the scheduler > > a) tries to find an idle core in the LLC. This helps exploit the > This is the same as (b) Should this be SMT^^^ ? On POWER10, without this patch, the LLC is at SMT sched-domain domain. The difference between a) and b) is a) searches for an idle core, while b) searches for an idle CPU. > SMT folding that the wakee task can benefit from. If an idle > core is found, the wakee is woken up on it. > > b) Failing to find an idle core, the scheduler tries to find an idle > CPU in the LLC. This helps minimise the wakeup latency for the > wakee since it gets to run on the CPU immediately. > > c) Failing this, it will wake it up on target CPU. > > Thus, with P9-sched topology, since the CACHE domain comprises of two > SMT4 cores, there is a decent chance that we get an idle core, failing > which there is a relatively higher probability of finding an idle CPU > among the 8 threads in the domain. > > However, in P10-sched topology, since the SMT domain is the LLC and it > contains only a single SMT4 core, the probability that we find that > core to be idle is less. Furthermore, since there are only 4 CPUs to > search for an idle CPU, there is lower probability that we can get an > idle CPU to wake up the task on. > > > > > For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have > > unintended consequences for load balancing because load within a die may > > not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at > > the MC level. > > Not spreading load between SMT4 domains within MC is exactly what setting LLC > at MC level would address, wouldn't it? > > As in on P10 we have two relevant levels but the topology as is describes only > one, and moving the LLC level lower gives two levels the scheduler looks at > again. Or am I missing something? This is my current understanding as well, since with this patch we would then be able to move tasks quickly between the SMT4 cores, perhaps at the expense of losing out on cache-affinity. Which is why it would be good to verify this using a test/benchmark. > > Thanks > > Michal > -- Thanks and Regards gautham.
Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
Hello Mel, On Mon, Apr 12, 2021 at 04:24:44PM +0100, Mel Gorman wrote: > On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote: > > > > Peter, Valentin, Vincent, Mel, etal > > > > > > > > On architectures where we have multiple levels of cache access latencies > > > > within a DIE, (For example: one within the current LLC or SMT core and > > > > the > > > > other at MC or Hemisphere, and finally across hemispheres), do you have > > > > any > > > > suggestions on how we could handle the same in the core scheduler? > > > > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't > > only rely on cache > > > > >From topology.c > > SD_SHARE_PKG_RESOURCES - describes shared caches > Yes, I was aware of this shared caches, but this current patch was the simplest way to achieve the effect, though the cores in the MC domain on POWER10 do not share a cache. However, it is relatively faster to transfer data across the cores within the MC domain compared to the cores outside the MC domain in the Die. > I'm guessing here because I am not familiar with power10 but the central > problem appears to be when to prefer selecting a CPU sharing L2 or L3 > cache and the core assumes the last-level-cache is the only relevant one. > On POWER we have traditionally preferred to keep the LLC at the sched-domain comprising of groups of CPUs that share the L2 (since L3 is a victim cache on POWER). On POWER9, the L2 was shared by the threads of a pair of SMT4 cores, while on POWER10, L2 is shared by threads of a single SMT4 core. Thus, the current task wake-up logic would have a lower probability of finding an idle core inside an LLC since it has only one core to search in the LLC. This is why moving the LLC to the parent domain (MC) consisting of a group of SMT4 cores among which snooping the cache-data is faster is helpful for workloads that require the single threaded performance. > For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have > unintended consequences for load balancing because load within a die may > not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at > the MC level. Since we are adding the SD_SHARE_PKG_RESOURCES to the parent of the the only sched-domain (which is a SMT4 domain) which currently has this flag set, would it cause issues in spreading the load between the SMT4 domains ? Are there any tests/benchmarks that can help bring this out? It could be good to understand this. > > > > > > > Minimally I think it would be worth detecting when there are multiple > > > LLCs per node and detecting that in generic code as a static branch. In > > > select_idle_cpu, consider taking two passes -- first on the LLC domain > > > and if no idle CPU is found then taking a second pass if the search depth > > > > We have done a lot of changes to reduce and optimize the fast path and > > I don't think re adding another layer in the fast path makes sense as > > you will end up unrolling the for_each_domain behind some > > static_banches. > > > > Searching the node would only happen if a) there was enough search depth > left and b) there were no idle CPUs at the LLC level. As no new domain > is added, it's not clear to me why for_each_domain would change. > > But still, your comment reminded me that different architectures have > different requirements > > Power 10 appears to prefer CPU selection sharing L2 cache but desires > spillover to L3 when selecting and idle CPU. > Indeed, so on POWER10, the preference would be 1) idle core in the L2 domain. 2) idle core in the MC domain. 3) idle CPU in the L2 domain 4) idle CPU in the MC domain. This patch is able to achieve this *implicitly* because of the way the select_idle_cpu() and the select_idle_core() is currently coded, where in the presence of idle cores in the MC level, the select_idle_core() searches for the idle core starting with the core of the target-CPU. If I understand your proposal correctly it would be to make this explicit into a two level search where we first search in the LLC domain, failing which, we carry on the search in the rest of the die (assuming that the LLC is not in the die). > X86 varies, it might want the Power10 approach for some families and prefer > L3 spilling over to a CPU on the same node in others. > > S390 cares about something called books and drawers although I've no > what it means as such and whether it has any preferences on > search order. > > ARM has similar requirements again according to "scheduler: expose the > topology of clusters and add cluster scheduler" and that one *does* > add another domain. > > I had forgotten about the ARM patches but remembered that they were > interesting because they potentially help the Zen situation but I didn't > get the chance to review them before they fell off my radar again. About > all I recall is that I thought the "cluster" terminology was vague. > > The only commo
Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
On Mon, 12 Apr 2021 at 17:24, Mel Gorman wrote: > > On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote: > > > > Peter, Valentin, Vincent, Mel, etal > > > > > > > > On architectures where we have multiple levels of cache access latencies > > > > within a DIE, (For example: one within the current LLC or SMT core and > > > > the > > > > other at MC or Hemisphere, and finally across hemispheres), do you have > > > > any > > > > suggestions on how we could handle the same in the core scheduler? > > > > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't > > only rely on cache > > > > From topology.c > > SD_SHARE_PKG_RESOURCES - describes shared caches > > I'm guessing here because I am not familiar with power10 but the central > problem appears to be when to prefer selecting a CPU sharing L2 or L3 > cache and the core assumes the last-level-cache is the only relevant one. > > For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have > unintended consequences for load balancing because load within a die may > not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at > the MC level. But the SMT4 level is still present here with select_idle_core taking of the spreading > > > > > > > Minimally I think it would be worth detecting when there are multiple > > > LLCs per node and detecting that in generic code as a static branch. In > > > select_idle_cpu, consider taking two passes -- first on the LLC domain > > > and if no idle CPU is found then taking a second pass if the search depth > > > > We have done a lot of changes to reduce and optimize the fast path and > > I don't think re adding another layer in the fast path makes sense as > > you will end up unrolling the for_each_domain behind some > > static_banches. > > > > Searching the node would only happen if a) there was enough search depth > left and b) there were no idle CPUs at the LLC level. As no new domain > is added, it's not clear to me why for_each_domain would change. What I mean is that you should directly do for_each_sched_domain in the fast path because that what you are proposing at the end. It's no more looks like a fast path but a traditional LB > > But still, your comment reminded me that different architectures have > different requirements > > Power 10 appears to prefer CPU selection sharing L2 cache but desires > spillover to L3 when selecting and idle CPU. > > X86 varies, it might want the Power10 approach for some families and prefer > L3 spilling over to a CPU on the same node in others. > > S390 cares about something called books and drawers although I've no > what it means as such and whether it has any preferences on > search order. > > ARM has similar requirements again according to "scheduler: expose the > topology of clusters and add cluster scheduler" and that one *does* > add another domain. > > I had forgotten about the ARM patches but remembered that they were > interesting because they potentially help the Zen situation but I didn't > get the chance to review them before they fell off my radar again. About > all I recall is that I thought the "cluster" terminology was vague. > > The only commonality I thought might exist is that architectures may > like to define what the first domain to search for an idle CPU and a > second domain. Alternatively, architectures could specify a domain to > search primarily but also search the next domain in the hierarchy if > search depth permits. The default would be the existing behaviour -- > search CPUs sharing a last-level-cache. > > > SD_SHARE_PKG_RESOURCES should be set to the last level where we can > > efficiently move task between CPUs at wakeup > > > > The definition of "efficiently" varies. Moving tasks between CPUs sharing > a cache is most efficient but moving the task to a CPU that at least has > local memory channels is a reasonable option if there are no idle CPUs > sharing cache and preferable to stacking. That's why setting SD_SHARE_PKG_RESOURCES for P10 looks fine to me. This last level of SD_SHARE_PKG_RESOURCES should define the cpumask to be considered in fast path > > > > allows within the node with the LLC CPUs masked out. While there would be > > > a latency hit because cache is not shared, it would still be a CPU local > > > to memory that is idle. That would potentially be beneficial on Zen* > > > as well without having to introduce new domains in the topology hierarchy. > > > > What is the current sched_domain topology description for zen ? > > > > The cache and NUMA topologies differ slightly between each generation > of Zen. The common pattern is that a single NUMA node can have multiple > L3 caches and at one point I thought it might be reasonable to allow > spillover to select a local idle CPU instead of stacking multiple tasks > on a CPU sharing cache. I never got as far as thinking how it could be > done in a way that multiple architecture
Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
On Mon, Apr 12, 2021 at 04:24:44PM +0100, Mel Gorman wrote: > On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote: > > > > Peter, Valentin, Vincent, Mel, etal > > > > > > > > On architectures where we have multiple levels of cache access latencies > > > > within a DIE, (For example: one within the current LLC or SMT core and > > > > the > > > > other at MC or Hemisphere, and finally across hemispheres), do you have > > > > any > > > > suggestions on how we could handle the same in the core scheduler? > > > > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't > > only rely on cache > > > > From topology.c > > SD_SHARE_PKG_RESOURCES - describes shared caches > > I'm guessing here because I am not familiar with power10 but the central > problem appears to be when to prefer selecting a CPU sharing L2 or L3 > cache and the core assumes the last-level-cache is the only relevant one. It does not seem to be the case according to original description: When the scheduler tries to wakeup a task, it chooses between the waker-CPU and the wakee's previous-CPU. Suppose this choice is called the "target", then in the target's LLC domain, the scheduler a) tries to find an idle core in the LLC. This helps exploit the This is the same as (b) Should this be SMT^^^ ? SMT folding that the wakee task can benefit from. If an idle core is found, the wakee is woken up on it. b) Failing to find an idle core, the scheduler tries to find an idle CPU in the LLC. This helps minimise the wakeup latency for the wakee since it gets to run on the CPU immediately. c) Failing this, it will wake it up on target CPU. Thus, with P9-sched topology, since the CACHE domain comprises of two SMT4 cores, there is a decent chance that we get an idle core, failing which there is a relatively higher probability of finding an idle CPU among the 8 threads in the domain. However, in P10-sched topology, since the SMT domain is the LLC and it contains only a single SMT4 core, the probability that we find that core to be idle is less. Furthermore, since there are only 4 CPUs to search for an idle CPU, there is lower probability that we can get an idle CPU to wake up the task on. > > For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have > unintended consequences for load balancing because load within a die may > not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at > the MC level. Not spreading load between SMT4 domains within MC is exactly what setting LLC at MC level would address, wouldn't it? As in on P10 we have two relevant levels but the topology as is describes only one, and moving the LLC level lower gives two levels the scheduler looks at again. Or am I missing something? Thanks Michal > > > > > > Minimally I think it would be worth detecting when there are multiple > > > LLCs per node and detecting that in generic code as a static branch. In > > > select_idle_cpu, consider taking two passes -- first on the LLC domain > > > and if no idle CPU is found then taking a second pass if the search depth > > > > We have done a lot of changes to reduce and optimize the fast path and > > I don't think re adding another layer in the fast path makes sense as > > you will end up unrolling the for_each_domain behind some > > static_banches. > > > > Searching the node would only happen if a) there was enough search depth > left and b) there were no idle CPUs at the LLC level. As no new domain > is added, it's not clear to me why for_each_domain would change. > > But still, your comment reminded me that different architectures have > different requirements > > Power 10 appears to prefer CPU selection sharing L2 cache but desires > spillover to L3 when selecting and idle CPU. > > X86 varies, it might want the Power10 approach for some families and prefer > L3 spilling over to a CPU on the same node in others. > > S390 cares about something called books and drawers although I've no > what it means as such and whether it has any preferences on > search order. > > ARM has similar requirements again according to "scheduler: expose the > topology of clusters and add cluster scheduler" and that one *does* > add another domain. > > I had forgotten about the ARM patches but remembered that they were > interesting because they potentially help the Zen situation but I didn't > get the chance to review them before they fell off my radar again. About > all I recall is that I thought the "cluster" terminology was vague. > > The only commonality I thought might exist is that architectures may > like to define what the first domain to search for an idle CPU and a > second domain. Alternatively, architectures could specify a domain to > search primarily but also search the next domain in the hierarchy if > search depth permits. The
Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote: > > > Peter, Valentin, Vincent, Mel, etal > > > > > > On architectures where we have multiple levels of cache access latencies > > > within a DIE, (For example: one within the current LLC or SMT core and the > > > other at MC or Hemisphere, and finally across hemispheres), do you have > > > any > > > suggestions on how we could handle the same in the core scheduler? > > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't > only rely on cache > >From topology.c SD_SHARE_PKG_RESOURCES - describes shared caches I'm guessing here because I am not familiar with power10 but the central problem appears to be when to prefer selecting a CPU sharing L2 or L3 cache and the core assumes the last-level-cache is the only relevant one. For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have unintended consequences for load balancing because load within a die may not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at the MC level. > > > > Minimally I think it would be worth detecting when there are multiple > > LLCs per node and detecting that in generic code as a static branch. In > > select_idle_cpu, consider taking two passes -- first on the LLC domain > > and if no idle CPU is found then taking a second pass if the search depth > > We have done a lot of changes to reduce and optimize the fast path and > I don't think re adding another layer in the fast path makes sense as > you will end up unrolling the for_each_domain behind some > static_banches. > Searching the node would only happen if a) there was enough search depth left and b) there were no idle CPUs at the LLC level. As no new domain is added, it's not clear to me why for_each_domain would change. But still, your comment reminded me that different architectures have different requirements Power 10 appears to prefer CPU selection sharing L2 cache but desires spillover to L3 when selecting and idle CPU. X86 varies, it might want the Power10 approach for some families and prefer L3 spilling over to a CPU on the same node in others. S390 cares about something called books and drawers although I've no what it means as such and whether it has any preferences on search order. ARM has similar requirements again according to "scheduler: expose the topology of clusters and add cluster scheduler" and that one *does* add another domain. I had forgotten about the ARM patches but remembered that they were interesting because they potentially help the Zen situation but I didn't get the chance to review them before they fell off my radar again. About all I recall is that I thought the "cluster" terminology was vague. The only commonality I thought might exist is that architectures may like to define what the first domain to search for an idle CPU and a second domain. Alternatively, architectures could specify a domain to search primarily but also search the next domain in the hierarchy if search depth permits. The default would be the existing behaviour -- search CPUs sharing a last-level-cache. > SD_SHARE_PKG_RESOURCES should be set to the last level where we can > efficiently move task between CPUs at wakeup > The definition of "efficiently" varies. Moving tasks between CPUs sharing a cache is most efficient but moving the task to a CPU that at least has local memory channels is a reasonable option if there are no idle CPUs sharing cache and preferable to stacking. > > allows within the node with the LLC CPUs masked out. While there would be > > a latency hit because cache is not shared, it would still be a CPU local > > to memory that is idle. That would potentially be beneficial on Zen* > > as well without having to introduce new domains in the topology hierarchy. > > What is the current sched_domain topology description for zen ? > The cache and NUMA topologies differ slightly between each generation of Zen. The common pattern is that a single NUMA node can have multiple L3 caches and at one point I thought it might be reasonable to allow spillover to select a local idle CPU instead of stacking multiple tasks on a CPU sharing cache. I never got as far as thinking how it could be done in a way that multiple architectures would be happy with. -- Mel Gorman SUSE Labs
Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
On Mon, 12 Apr 2021 at 11:37, Mel Gorman wrote: > > On Mon, Apr 12, 2021 at 11:54:36AM +0530, Srikar Dronamraju wrote: > > * Gautham R. Shenoy [2021-04-02 11:07:54]: > > > > > > > > To remedy this, this patch proposes that the LLC be moved to the MC > > > level which is a group of cores in one half of the chip. > > > > > > SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE > > > > > > > I think marking Hemisphere as a LLC in a P10 scenario is a good idea. > > > > > While there is no cache being shared at this level, this is still the > > > level where some amount of cache-snooping takes place and it is > > > relatively faster to access the data from the caches of the cores > > > within this domain. With this change, we no longer see regressions on > > > P10 for applications which require single threaded performance. > > > > Peter, Valentin, Vincent, Mel, etal > > > > On architectures where we have multiple levels of cache access latencies > > within a DIE, (For example: one within the current LLC or SMT core and the > > other at MC or Hemisphere, and finally across hemispheres), do you have any > > suggestions on how we could handle the same in the core scheduler? I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't only rely on cache > > > > Minimally I think it would be worth detecting when there are multiple > LLCs per node and detecting that in generic code as a static branch. In > select_idle_cpu, consider taking two passes -- first on the LLC domain > and if no idle CPU is found then taking a second pass if the search depth We have done a lot of changes to reduce and optimize the fast path and I don't think re adding another layer in the fast path makes sense as you will end up unrolling the for_each_domain behind some static_banches. SD_SHARE_PKG_RESOURCES should be set to the last level where we can efficiently move task between CPUs at wakeup > allows within the node with the LLC CPUs masked out. While there would be > a latency hit because cache is not shared, it would still be a CPU local > to memory that is idle. That would potentially be beneficial on Zen* > as well without having to introduce new domains in the topology hierarchy. What is the current sched_domain topology description for zen ? > > -- > Mel Gorman > SUSE Labs
Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
On Mon, Apr 12, 2021 at 11:06:19AM +0100, Valentin Schneider wrote: > On 12/04/21 10:37, Mel Gorman wrote: > > On Mon, Apr 12, 2021 at 11:54:36AM +0530, Srikar Dronamraju wrote: > >> * Gautham R. Shenoy [2021-04-02 11:07:54]: > >> > >> > > >> > To remedy this, this patch proposes that the LLC be moved to the MC > >> > level which is a group of cores in one half of the chip. > >> > > >> > SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE > >> > > >> > >> I think marking Hemisphere as a LLC in a P10 scenario is a good idea. > >> > >> > While there is no cache being shared at this level, this is still the > >> > level where some amount of cache-snooping takes place and it is > >> > relatively faster to access the data from the caches of the cores > >> > within this domain. With this change, we no longer see regressions on > >> > P10 for applications which require single threaded performance. > >> > >> Peter, Valentin, Vincent, Mel, etal > >> > >> On architectures where we have multiple levels of cache access latencies > >> within a DIE, (For example: one within the current LLC or SMT core and the > >> other at MC or Hemisphere, and finally across hemispheres), do you have any > >> suggestions on how we could handle the same in the core scheduler? > >> > > > > Minimally I think it would be worth detecting when there are multiple > > LLCs per node and detecting that in generic code as a static branch. In > > select_idle_cpu, consider taking two passes -- first on the LLC domain > > and if no idle CPU is found then taking a second pass if the search depth > > allows within the node with the LLC CPUs masked out. > > I think that's actually a decent approach. Tying SD_SHARE_PKG_RESOURCES to > something other than pure cache topology in a generic manner is tough (as > it relies on murky, ill-defined hardware fabric properties). > Agreed. The LLC->node scan idea has been on my TODO list to try for a while. > Last I tried thinking about that, I stopped at having a core-to-core > latency matrix, building domains off of that, and having some knob > specifying the highest distance value below which we'd set > SD_SHARE_PKG_RESOURCES. There's a few things I 'hate' about that; for one > it makes cpus_share_cache() somewhat questionable. > And I thought about something like this too but worried it might get complex, particularly on chiplets where we do not necessarily have hardware info on latency depending on how it's wired up. It also might lead to excessive cpumask manipulation in a fast path if we have to traverse multiple distances with search cost exceeding gains from latency reduction. Hence -- keeping it simple with two level only, LLC then node within the allowed search depth and see what that gets us. It might be "good enough" in most cases and would be a basis for comparison against complex approaches. At minimum, I expect IBM can evaluate the POWER10 aspect and I can run an evaluation on Zen generations. -- Mel Gorman SUSE Labs
Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
On 12/04/21 10:37, Mel Gorman wrote: > On Mon, Apr 12, 2021 at 11:54:36AM +0530, Srikar Dronamraju wrote: >> * Gautham R. Shenoy [2021-04-02 11:07:54]: >> >> > >> > To remedy this, this patch proposes that the LLC be moved to the MC >> > level which is a group of cores in one half of the chip. >> > >> > SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE >> > >> >> I think marking Hemisphere as a LLC in a P10 scenario is a good idea. >> >> > While there is no cache being shared at this level, this is still the >> > level where some amount of cache-snooping takes place and it is >> > relatively faster to access the data from the caches of the cores >> > within this domain. With this change, we no longer see regressions on >> > P10 for applications which require single threaded performance. >> >> Peter, Valentin, Vincent, Mel, etal >> >> On architectures where we have multiple levels of cache access latencies >> within a DIE, (For example: one within the current LLC or SMT core and the >> other at MC or Hemisphere, and finally across hemispheres), do you have any >> suggestions on how we could handle the same in the core scheduler? >> > > Minimally I think it would be worth detecting when there are multiple > LLCs per node and detecting that in generic code as a static branch. In > select_idle_cpu, consider taking two passes -- first on the LLC domain > and if no idle CPU is found then taking a second pass if the search depth > allows within the node with the LLC CPUs masked out. I think that's actually a decent approach. Tying SD_SHARE_PKG_RESOURCES to something other than pure cache topology in a generic manner is tough (as it relies on murky, ill-defined hardware fabric properties). Last I tried thinking about that, I stopped at having a core-to-core latency matrix, building domains off of that, and having some knob specifying the highest distance value below which we'd set SD_SHARE_PKG_RESOURCES. There's a few things I 'hate' about that; for one it makes cpus_share_cache() somewhat questionable. > While there would be > a latency hit because cache is not shared, it would still be a CPU local > to memory that is idle. That would potentially be beneficial on Zen* > as well without having to introduce new domains in the topology hierarchy. > > -- > Mel Gorman > SUSE Labs
Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
On Mon, Apr 12, 2021 at 11:54:36AM +0530, Srikar Dronamraju wrote: > * Gautham R. Shenoy [2021-04-02 11:07:54]: > > > > > To remedy this, this patch proposes that the LLC be moved to the MC > > level which is a group of cores in one half of the chip. > > > > SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE > > > > I think marking Hemisphere as a LLC in a P10 scenario is a good idea. > > > While there is no cache being shared at this level, this is still the > > level where some amount of cache-snooping takes place and it is > > relatively faster to access the data from the caches of the cores > > within this domain. With this change, we no longer see regressions on > > P10 for applications which require single threaded performance. > > Peter, Valentin, Vincent, Mel, etal > > On architectures where we have multiple levels of cache access latencies > within a DIE, (For example: one within the current LLC or SMT core and the > other at MC or Hemisphere, and finally across hemispheres), do you have any > suggestions on how we could handle the same in the core scheduler? > Minimally I think it would be worth detecting when there are multiple LLCs per node and detecting that in generic code as a static branch. In select_idle_cpu, consider taking two passes -- first on the LLC domain and if no idle CPU is found then taking a second pass if the search depth allows within the node with the LLC CPUs masked out. While there would be a latency hit because cache is not shared, it would still be a CPU local to memory that is idle. That would potentially be beneficial on Zen* as well without having to introduce new domains in the topology hierarchy. -- Mel Gorman SUSE Labs
Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
* Gautham R. Shenoy [2021-04-02 11:07:54]: > > To remedy this, this patch proposes that the LLC be moved to the MC > level which is a group of cores in one half of the chip. > > SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE > I think marking Hemisphere as a LLC in a P10 scenario is a good idea. > While there is no cache being shared at this level, this is still the > level where some amount of cache-snooping takes place and it is > relatively faster to access the data from the caches of the cores > within this domain. With this change, we no longer see regressions on > P10 for applications which require single threaded performance. Peter, Valentin, Vincent, Mel, etal On architectures where we have multiple levels of cache access latencies within a DIE, (For example: one within the current LLC or SMT core and the other at MC or Hemisphere, and finally across hemispheres), do you have any suggestions on how we could handle the same in the core scheduler? -- Thanks and Regards Srikar Dronamraju
Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
(Missed cc'ing Cc Peter in the original posting) On Fri, Apr 02, 2021 at 11:07:54AM +0530, Gautham R. Shenoy wrote: > From: "Gautham R. Shenoy" > > On POWER10 systems, the L2 cache is at the SMT4 small core level. The > following commits ensure that L2 cache gets correctly discovered and > the Last-Level-Cache domain (LLC) is set to the SMT sched-domain. > > 790a166 powerpc/smp: Parse ibm,thread-groups with multiple properties > 1fdc1d6 powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map > fbd2b67 powerpc/smp: Rename init_thread_group_l1_cache_map() to make > it generic > 538abe powerpc/smp: Add support detecting thread-groups sharing L2 cache > 0be4763 powerpc/cacheinfo: Print correct cache-sibling map/list for L2 > cache > > However, with the LLC now on the SMT sched-domain, we are seeing some > regressions in the performance of applications that requires > single-threaded performance. The reason for this is as follows: > > Prior to the change (we call this P9-sched below), the sched-domain > hierarchy was: > > SMT (SMT4) --> CACHE (SMT8)[LLC] --> MC (Hemisphere) --> DIE > > where the CACHE sched-domain is defined to be the Last Level Cache (LLC). > > On the upstream kernel, with the aforementioned commmits (P10-sched), > the sched-domain hierarchy is: > > SMT (SMT4)[LLC] --> MC (Hemisphere) --> DIE > > with the SMT sched-domain as the LLC. > > When the scheduler tries to wakeup a task, it chooses between the > waker-CPU and the wakee's previous-CPU. Suppose this choice is called > the "target", then in the target's LLC domain, the scheduler > > a) tries to find an idle core in the LLC. This helps exploit the >SMT folding that the wakee task can benefit from. If an idle >core is found, the wakee is woken up on it. > > b) Failing to find an idle core, the scheduler tries to find an idle >CPU in the LLC. This helps minimise the wakeup latency for the >wakee since it gets to run on the CPU immediately. > > c) Failing this, it will wake it up on target CPU. > > Thus, with P9-sched topology, since the CACHE domain comprises of two > SMT4 cores, there is a decent chance that we get an idle core, failing > which there is a relatively higher probability of finding an idle CPU > among the 8 threads in the domain. > > However, in P10-sched topology, since the SMT domain is the LLC and it > contains only a single SMT4 core, the probability that we find that > core to be idle is less. Furthermore, since there are only 4 CPUs to > search for an idle CPU, there is lower probability that we can get an > idle CPU to wake up the task on. > > Thus applications which require single threaded performance will end > up getting woken up on potentially busy core, even though there are > idle cores in the system. > > To remedy this, this patch proposes that the LLC be moved to the MC > level which is a group of cores in one half of the chip. > > SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE > > While there is no cache being shared at this level, this is still the > level where some amount of cache-snooping takes place and it is > relatively faster to access the data from the caches of the cores > within this domain. With this change, we no longer see regressions on > P10 for applications which require single threaded performance. > > The patch also improves the tail latencies on schbench and the > usecs/op on "perf bench sched pipe" > > On a 10 core P10 system with 80 CPUs, > > schbench > > (https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/) > > Values : Lower the better. > 99th percentile is the tail latency. > > > 99th percentile > ~~ > No. messenger > threads 5.12-rc45.12-rc4 > P10-sched MC-LLC > ~~~ > 1 70 us 85 us > 2 81 us101 us > 3 92 us107 us > 4 96 us110 us > 5103 us123 us > 6 3412 us > 122 us > 7 1490 us136 us > 8 6200 us 3572 us > > > Hackbench > > (perf bench sched pipe) > values: lower the better > > ~~~ > No. of > parallel > instances 5.12-rc4 5.12-rc4 > P10-sched MC-LLC > ~~~ > 1 24.04 us/op18.72 us/op > 2 24.04 us/op18.65 us/op > 4 24.01 us/op18.76 us/op > 8 24.10 us/op19.11 us/op > ~~~ > > Signed-off-by: Gautham R. Shenoy > --- > arch/powerpc/kernel/smp.c | 9 - > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index 5a4d59a..c75dbd4 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -976,6 +976,13 @@ static bool has_coregrou
[RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
From: "Gautham R. Shenoy" On POWER10 systems, the L2 cache is at the SMT4 small core level. The following commits ensure that L2 cache gets correctly discovered and the Last-Level-Cache domain (LLC) is set to the SMT sched-domain. 790a166 powerpc/smp: Parse ibm,thread-groups with multiple properties 1fdc1d6 powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map fbd2b67 powerpc/smp: Rename init_thread_group_l1_cache_map() to make it generic 538abe powerpc/smp: Add support detecting thread-groups sharing L2 cache 0be4763 powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache However, with the LLC now on the SMT sched-domain, we are seeing some regressions in the performance of applications that requires single-threaded performance. The reason for this is as follows: Prior to the change (we call this P9-sched below), the sched-domain hierarchy was: SMT (SMT4) --> CACHE (SMT8)[LLC] --> MC (Hemisphere) --> DIE where the CACHE sched-domain is defined to be the Last Level Cache (LLC). On the upstream kernel, with the aforementioned commmits (P10-sched), the sched-domain hierarchy is: SMT (SMT4)[LLC] --> MC (Hemisphere) --> DIE with the SMT sched-domain as the LLC. When the scheduler tries to wakeup a task, it chooses between the waker-CPU and the wakee's previous-CPU. Suppose this choice is called the "target", then in the target's LLC domain, the scheduler a) tries to find an idle core in the LLC. This helps exploit the SMT folding that the wakee task can benefit from. If an idle core is found, the wakee is woken up on it. b) Failing to find an idle core, the scheduler tries to find an idle CPU in the LLC. This helps minimise the wakeup latency for the wakee since it gets to run on the CPU immediately. c) Failing this, it will wake it up on target CPU. Thus, with P9-sched topology, since the CACHE domain comprises of two SMT4 cores, there is a decent chance that we get an idle core, failing which there is a relatively higher probability of finding an idle CPU among the 8 threads in the domain. However, in P10-sched topology, since the SMT domain is the LLC and it contains only a single SMT4 core, the probability that we find that core to be idle is less. Furthermore, since there are only 4 CPUs to search for an idle CPU, there is lower probability that we can get an idle CPU to wake up the task on. Thus applications which require single threaded performance will end up getting woken up on potentially busy core, even though there are idle cores in the system. To remedy this, this patch proposes that the LLC be moved to the MC level which is a group of cores in one half of the chip. SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE While there is no cache being shared at this level, this is still the level where some amount of cache-snooping takes place and it is relatively faster to access the data from the caches of the cores within this domain. With this change, we no longer see regressions on P10 for applications which require single threaded performance. The patch also improves the tail latencies on schbench and the usecs/op on "perf bench sched pipe" On a 10 core P10 system with 80 CPUs, schbench (https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/) Values : Lower the better. 99th percentile is the tail latency. 99th percentile ~~ No. messenger threads 5.12-rc45.12-rc4 P10-sched MC-LLC ~~~ 1 70 us 85 us 2 81 us101 us 3 92 us107 us 4 96 us110 us 5103 us123 us 6 3412 us > 122 us 7 1490 us136 us 8 6200 us 3572 us Hackbench (perf bench sched pipe) values: lower the better ~~~ No. of parallel instances 5.12-rc4 5.12-rc4 P10-sched MC-LLC ~~~ 1 24.04 us/op18.72 us/op 2 24.04 us/op18.65 us/op 4 24.01 us/op18.76 us/op 8 24.10 us/op19.11 us/op ~~~ Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/smp.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 5a4d59a..c75dbd4 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -976,6 +976,13 @@ static bool has_coregroup_support(void) return coregroup_enabled; } +static int powerpc_mc_flags(void) +{ + if(has_coregroup_support()) + return SD_SHARE_PKG_RESOURCES; + return 0; +} + static const struct cpumask *cpu_mc_mask(int cpu) { return cpu_coregroup_mask(cpu); @@ -986,7 +993,7 @@ static const struct cpumask *cpu_mc_mask(int cpu)