Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
> > Also in the current P9 itself, two neighbouring core-pairs form a quad. > > Cache latency within a quad is better than a latency to a distant core-pair. > > Cache latency within a core pair is way better than latency within a quad. > > So if we have only 4 threads running on a DIE all of them accessing the same > > cache-lines, then we could probably benefit if all the tasks were to run > > within the quad aka MC/Coregroup. > > > > Did you test this? WRT load balance we do try to balance "load" over the > different domain spans, so if you represent quads as their own MC domain, > you would AFAICT end up spreading tasks over the quads (rather than packing > them) when balancing at e.g. DIE level. The desired behaviour might be > hackable with some more ASYM_PACKING, but I'm not sure I should be > suggesting that :-) > Agree, load balance will try to spread the load across the quads. In my hack, I was explicitly marking QUAD domains as !SD_PREFER_SIBLING + relaxing few load spreading rules when SD_PREFER_SIBLING was not set. And this was on a slightly older kernel (without recent Vincent's load balance overhaul). > > I have found some benchmarks which are latency sensitive to benefit by > > having a grouping a quad level (using kernel hacks and not backed by > > firmware changes). Gautham also found similar results in his experiments > > but he only used binding within the stock kernel. > > > > IIUC you reflect this "fabric quirk" (i.e. coregroups) using this DT > binding thing. > > That's also where things get interesting (for me) because I experienced > something similar on another arm64 platform (ThunderX1). This was more > about cache bandwidth than cache latency, but IMO it's in the same bag of > fabric quirks. I blabbered a bit about this at last LPC [1], but kind of > gave up on it given the TX1 was the only (arm64) platform where I could get > both significant and reproducible results. > > Now, if you folks are seeing this on completely different hardware and have > "real" workloads that truly benefit from this kind of domain partitioning, > this might be another incentive to try and sort of generalize this. That's > outside the scope of your series, but your findings give me some hope! > > I think what I had in mind back then was that if enough folks cared about > it, we might get some bits added to the ACPI spec; something along the > lines of proximity domains for the caches described in PPTT, IOW a cache > distance matrix. I don't really know what it'll take to get there, but I > figured I'd dump this in case someone's listening :-) > Very interesting. > > I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC > > domain need not be LLC domain for Power. > > From what I understood your MC domain does seem to map to LLC; but in any > case, shouldn't you set that flag at least for BIGCORE (i.e. L2)? AIUI with > your changes your sd_llc is gonna be SMT, and that's not going to be a very > big mask. IMO you do want to correctly reflect your LLC situation via this > flag to make cpus_share_cache() work properly. I detect if the LLC is shared at BIGCORE, and if they are shared at BIGCORE, then I dynamically rename the DOMAIN as CACHE and enable SD_SHARE_PKG_RESOURCES in that domain. > > [1]: https://linuxplumbersconf.org/event/4/contributions/484/ Thanks for the pointer. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
Hi Srikar, Valentin, On Wed, Jul 29, 2020 at 11:43:55AM +0530, Srikar Dronamraju wrote: > * Valentin Schneider [2020-07-28 16:03:11]: > [..snip..] > At this time the current topology would be good enough i.e BIGCORE would > always be equal to a MC. However in future we could have chips that can have > lesser/larger number of CPUs in llc than in a BIGCORE or we could have > granular or split L3 caches within a DIE. In such a case BIGCORE != MC. > > Also in the current P9 itself, two neighbouring core-pairs form a quad. > Cache latency within a quad is better than a latency to a distant core-pair. > Cache latency within a core pair is way better than latency within a quad. > So if we have only 4 threads running on a DIE all of them accessing the same > cache-lines, then we could probably benefit if all the tasks were to run > within the quad aka MC/Coregroup. > > I have found some benchmarks which are latency sensitive to benefit by > having a grouping a quad level (using kernel hacks and not backed by > firmware changes). Gautham also found similar results in his experiments > but he only used binding within the stock kernel. > > I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC > domain need not be LLC domain for Power. I am observing that SD_SHARE_PKG_RESOURCES at L2 provides the best results for POWER9 in terms of cache-benefits during wakeup. On a POWER9 Boston machine, running a producer-consumer test case (https://github.com/gautshen/misc/blob/master/producer_consumer/producer_consumer.c) The test case creates two threads, one Producer and another Consumer. Both work on a fairly large shared array of size 64M. In an interation the Producer performs stores to 1024 random locations and wakes up the Consumer. In the Consumer's iteration, loads from those exact 1024 locations. We measure the number of Consumer iterations per second and the average time for each Consumer iteration. The smaller the time, the better it is. The following results are when I pinned the Producer and Consumer to different combinations of CPUs to cover Small core , Big-core, Neighbouring Big-core, Far off core within the same chip, and across chips. There is a also a case where they are not affined anywhere, and we let the scheduler wake them up correctly. We find the best results when the Producer and Consumer are within the same L2 domain. These numbers are also close to the numbers that we get when we let the Scheduler wake them up (where LLC is L2). ## Same Small core (4 threads: Shares L1, L2, L3, Frequency Domain) Consumer affined to CPU 3 Producer affined to CPU 1 4698 iterations, avg time: 20034 ns 4951 iterations, avg time: 20012 ns 4957 iterations, avg time: 19971 ns 4968 iterations, avg time: 19985 ns 4970 iterations, avg time: 19977 ns ## Same Big Core (8 threads: Shares L2, L3, Frequency Domain) Consumer affined to CPU 7 Producer affined to CPU 1 4580 iterations, avg time: 19403 ns 4851 iterations, avg time: 19373 ns 4849 iterations, avg time: 19394 ns 4856 iterations, avg time: 19394 ns 4867 iterations, avg time: 19353 ns ## Neighbouring Big-core (Faster data-snooping from L2. Shares L3, Frequency Domain) Producer affined to CPU 1 Consumer affined to CPU 11 4270 iterations, avg time: 24158 ns 4491 iterations, avg time: 24157 ns 4500 iterations, avg time: 24148 ns 4516 iterations, avg time: 24164 ns 4518 iterations, avg time: 24165 ns ## Any other Big-core from Same Chip (Shares L3) Producer affined to CPU 1 Consumer affined to CPU 87 4176 iterations, avg time: 27953 ns 4417 iterations, avg time: 27925 ns 4415 iterations, avg time: 27934 ns 4417 iterations, avg time: 27983 ns 4430 iterations, avg time: 27958 ns ## Different Chips (No cache-sharing) Consumer affined to CPU 175 Producer affined to CPU 1 3277 iterations, avg time: 50786 ns 3063 iterations, avg time: 50732 ns 2831 iterations, avg time: 50737 ns 2859 iterations, avg time: 50688 ns 2849 iterations, avg time: 50722 ns ## Without affining them (Let Scheduler wake-them up appropriately) Consumer affined to CPU 0-175 Producer affined to CPU 0-175 4821 iterations, avg time: 19412 ns 4863 iterations, avg time: 19435 ns 4855 iterations, avg time: 19381 ns 4811 iterations, avg time: 19458 ns 4892 iterations, avg time: 19429 ns -- Thanks and Regards gautham.
Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
(+Cc Morten) On 29/07/20 07:13, Srikar Dronamraju wrote: > * Valentin Schneider [2020-07-28 16:03:11]: > > Hi Valentin, > > Thanks for looking into the patches. > >> On 27/07/20 06:32, Srikar Dronamraju wrote: >> > Add percpu coregroup maps and masks to create coregroup domain. >> > If a coregroup doesn't exist, the coregroup domain will be degenerated >> > in favour of SMT/CACHE domain. >> > >> >> So there's at least one arm64 platform out there with the same "pairs of >> cores share L2" thing (Ampere eMAG), and that lives quite happily with the >> default scheduler topology (SMT/MC/DIE). Each pair of core gets its MC >> domain, and the whole system is covered by DIE. >> >> Now arguably it's not a perfect representation; DIE doesn't have >> SD_SHARE_PKG_RESOURCES so the highest level sd_llc can point to is MC. That >> will impact all callsites using cpus_share_cache(): in the eMAG case, only >> pairs of cores will be seen as sharing cache, even though *all* cores share >> the same L3. >> > > Okay, Its good to know that we have a chip which is similar to P9 in > topology. > >> I'm trying to paint a picture of what the P9 topology looks like (the one >> you showcase in your cover letter) to see if there are any similarities; >> from what I gather in [1], wikichips and your cover letter, with P9 you can >> have something like this in a single DIE (somewhat unsure about L3 setup; >> it looks to be distributed?) >> >> +-+ >> | L3 | >> +---+-+---+-+---+-+---+ >> | L2 | | L2 | | L2 | | L2 | >> +--+-+--+ +--+-+--+ +--+-+--+ +--+-+--+ >> | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | >> +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ >> |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| >> +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ >> >> Which would lead to (ignoring the whole SMT CPU numbering shenanigans) >> >> NUMA [ ... >> DIE [ ] >> MC [ ] [ ] [ ] [ ] >> BIGCORE [ ] [ ] [ ] [ ] >> SMT [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] >> 00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31 >> > > What you have summed up is perfectly what a P9 topology looks like. I dont > think I could have explained it better than this. > Yay! >> This however has MC == BIGCORE; what makes it you can have different spans >> for these two domains? If it's not too much to ask, I'd love to have a P9 >> topology diagram. >> >> [1]: 20200722081822.gg9...@linux.vnet.ibm.com > > At this time the current topology would be good enough i.e BIGCORE would > always be equal to a MC. However in future we could have chips that can have > lesser/larger number of CPUs in llc than in a BIGCORE or we could have > granular or split L3 caches within a DIE. In such a case BIGCORE != MC. > Right, that one's fair enough. > Also in the current P9 itself, two neighbouring core-pairs form a quad. > Cache latency within a quad is better than a latency to a distant core-pair. > Cache latency within a core pair is way better than latency within a quad. > So if we have only 4 threads running on a DIE all of them accessing the same > cache-lines, then we could probably benefit if all the tasks were to run > within the quad aka MC/Coregroup. > Did you test this? WRT load balance we do try to balance "load" over the different domain spans, so if you represent quads as their own MC domain, you would AFAICT end up spreading tasks over the quads (rather than packing them) when balancing at e.g. DIE level. The desired behaviour might be hackable with some more ASYM_PACKING, but I'm not sure I should be suggesting that :-) > I have found some benchmarks which are latency sensitive to benefit by > having a grouping a quad level (using kernel hacks and not backed by > firmware changes). Gautham also found similar results in his experiments > but he only used binding within the stock kernel. > IIUC you reflect this "fabric quirk" (i.e. coregroups) using this DT binding thing. That's also where things get interesting (for me) because I experienced something similar on another arm64 platform (ThunderX1). This was more about cache bandwidth than cache latency, but IMO it's in the same bag of fabric quirks. I blabbered a bit about this at last LPC [1], but kind of gave up on it given the TX1 was the only (arm64) platform where I could get both significant and reproducible results. Now, if you folks are seeing this on completely different hardware and have "real" workloads that truly ben
Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
* Valentin Schneider [2020-07-28 16:03:11]: Hi Valentin, Thanks for looking into the patches. > On 27/07/20 06:32, Srikar Dronamraju wrote: > > Add percpu coregroup maps and masks to create coregroup domain. > > If a coregroup doesn't exist, the coregroup domain will be degenerated > > in favour of SMT/CACHE domain. > > > > So there's at least one arm64 platform out there with the same "pairs of > cores share L2" thing (Ampere eMAG), and that lives quite happily with the > default scheduler topology (SMT/MC/DIE). Each pair of core gets its MC > domain, and the whole system is covered by DIE. > > Now arguably it's not a perfect representation; DIE doesn't have > SD_SHARE_PKG_RESOURCES so the highest level sd_llc can point to is MC. That > will impact all callsites using cpus_share_cache(): in the eMAG case, only > pairs of cores will be seen as sharing cache, even though *all* cores share > the same L3. > Okay, Its good to know that we have a chip which is similar to P9 in topology. > I'm trying to paint a picture of what the P9 topology looks like (the one > you showcase in your cover letter) to see if there are any similarities; > from what I gather in [1], wikichips and your cover letter, with P9 you can > have something like this in a single DIE (somewhat unsure about L3 setup; > it looks to be distributed?) > > +-+ > | L3 | > +---+-+---+-+---+-+---+ > | L2 | | L2 | | L2 | | L2 | > +--+-+--+ +--+-+--+ +--+-+--+ +--+-+--+ > | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | > +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ > |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| > +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ > > Which would lead to (ignoring the whole SMT CPU numbering shenanigans) > > NUMA [ ... > DIE [ ] > MC [ ] [ ] [ ] [ ] > BIGCORE [ ] [ ] [ ] [ ] > SMT [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] > 00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31 > What you have summed up is perfectly what a P9 topology looks like. I dont think I could have explained it better than this. > This however has MC == BIGCORE; what makes it you can have different spans > for these two domains? If it's not too much to ask, I'd love to have a P9 > topology diagram. > > [1]: 20200722081822.gg9...@linux.vnet.ibm.com At this time the current topology would be good enough i.e BIGCORE would always be equal to a MC. However in future we could have chips that can have lesser/larger number of CPUs in llc than in a BIGCORE or we could have granular or split L3 caches within a DIE. In such a case BIGCORE != MC. Also in the current P9 itself, two neighbouring core-pairs form a quad. Cache latency within a quad is better than a latency to a distant core-pair. Cache latency within a core pair is way better than latency within a quad. So if we have only 4 threads running on a DIE all of them accessing the same cache-lines, then we could probably benefit if all the tasks were to run within the quad aka MC/Coregroup. I have found some benchmarks which are latency sensitive to benefit by having a grouping a quad level (using kernel hacks and not backed by firmware changes). Gautham also found similar results in his experiments but he only used binding within the stock kernel. I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC domain need not be LLC domain for Power. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
Hi, On 27/07/20 06:32, Srikar Dronamraju wrote: > Add percpu coregroup maps and masks to create coregroup domain. > If a coregroup doesn't exist, the coregroup domain will be degenerated > in favour of SMT/CACHE domain. > So there's at least one arm64 platform out there with the same "pairs of cores share L2" thing (Ampere eMAG), and that lives quite happily with the default scheduler topology (SMT/MC/DIE). Each pair of core gets its MC domain, and the whole system is covered by DIE. Now arguably it's not a perfect representation; DIE doesn't have SD_SHARE_PKG_RESOURCES so the highest level sd_llc can point to is MC. That will impact all callsites using cpus_share_cache(): in the eMAG case, only pairs of cores will be seen as sharing cache, even though *all* cores share the same L3. I'm trying to paint a picture of what the P9 topology looks like (the one you showcase in your cover letter) to see if there are any similarities; from what I gather in [1], wikichips and your cover letter, with P9 you can have something like this in a single DIE (somewhat unsure about L3 setup; it looks to be distributed?) +-+ | L3 | +---+-+---+-+---+-+---+ | L2 | | L2 | | L2 | | L2 | +--+-+--+ +--+-+--+ +--+-+--+ +--+-+--+ | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | | L1 | +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+ Which would lead to (ignoring the whole SMT CPU numbering shenanigans) NUMA [ ... DIE [ ] MC [ ] [ ] [ ] [ ] BIGCORE [ ] [ ] [ ] [ ] SMT [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] 00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31 This however has MC == BIGCORE; what makes it you can have different spans for these two domains? If it's not too much to ask, I'd love to have a P9 topology diagram. [1]: 20200722081822.gg9...@linux.vnet.ibm.com
Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
Hi Srikar, On Mon, Jul 27, 2020 at 11:02:29AM +0530, Srikar Dronamraju wrote: > Add percpu coregroup maps and masks to create coregroup domain. > If a coregroup doesn't exist, the coregroup domain will be degenerated > in favour of SMT/CACHE domain. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Anton Blanchard > Cc: Oliver O'Halloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Gautham R Shenoy > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju This version looks good to me. Reviewed-by: Gautham R. Shenoy > --- > Changelog v3 ->v4: > if coregroup_support doesn't exist, update MC mask to the next > smaller domain mask. > > Changelog v2 -> v3: > Add optimization for mask updation under coregroup_support > > Changelog v1 -> v2: > Moved coregroup topology fixup to fixup_topology (Gautham) > > arch/powerpc/include/asm/topology.h | 10 +++ > arch/powerpc/kernel/smp.c | 44 + > arch/powerpc/mm/numa.c | 5 > 3 files changed, 59 insertions(+) > > diff --git a/arch/powerpc/include/asm/topology.h > b/arch/powerpc/include/asm/topology.h > index f0b6300e7dd3..6609174918ab 100644 > --- a/arch/powerpc/include/asm/topology.h > +++ b/arch/powerpc/include/asm/topology.h > @@ -88,12 +88,22 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 > *cpu2_assoc) > > #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR) > extern int find_and_online_cpu_nid(int cpu); > +extern int cpu_to_coregroup_id(int cpu); > #else > static inline int find_and_online_cpu_nid(int cpu) > { > return 0; > } > > +static inline int cpu_to_coregroup_id(int cpu) > +{ > +#ifdef CONFIG_SMP > + return cpu_to_core_id(cpu); > +#else > + return 0; > +#endif > +} > + > #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */ > > #include > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index dab96a1203ec..95f0bf72e283 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -80,6 +80,7 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map); > DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map); > DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map); > DEFINE_PER_CPU(cpumask_var_t, cpu_core_map); > +DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map); > > EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); > EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map); > @@ -91,6 +92,7 @@ enum { > smt_idx, > #endif > bigcore_idx, > + mc_idx, > die_idx, > }; > > @@ -869,6 +871,21 @@ static const struct cpumask *smallcore_smt_mask(int cpu) > } > #endif > > +static struct cpumask *cpu_coregroup_mask(int cpu) > +{ > + return per_cpu(cpu_coregroup_map, cpu); > +} > + > +static bool has_coregroup_support(void) > +{ > + return coregroup_enabled; > +} > + > +static const struct cpumask *cpu_mc_mask(int cpu) > +{ > + return cpu_coregroup_mask(cpu); > +} > + > static const struct cpumask *cpu_bigcore_mask(int cpu) > { > return per_cpu(cpu_sibling_map, cpu); > @@ -879,6 +896,7 @@ static struct sched_domain_topology_level > powerpc_topology[] = { > { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, > #endif > { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) }, > + { cpu_mc_mask, SD_INIT_NAME(MC) }, > { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > { NULL, }, > }; > @@ -925,6 +943,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > GFP_KERNEL, cpu_to_node(cpu)); > zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu), > GFP_KERNEL, cpu_to_node(cpu)); > + if (has_coregroup_support()) > + zalloc_cpumask_var_node(&per_cpu(cpu_coregroup_map, > cpu), > + GFP_KERNEL, cpu_to_node(cpu)); > + > #ifdef CONFIG_NEED_MULTIPLE_NODES > /* >* numa_node_id() works after this. > @@ -942,6 +964,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid)); > cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid)); > > + if (has_coregroup_support()) > + cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid)); > + > init_big_cores(); > if (has_big_cores) { > cpumask_set_cpu(boot_cpuid, > @@ -1233,6 +1258,8 @@ static void remove_cpu_from_masks(int cpu) > set_cpus_unrelated(cpu, i, cpu_sibling_mask); > if (has_big_cores) > set_cpus_unrelated(cpu, i, cpu_smallcore_mask); > + if (has_coregroup_support()) > + set_cpus_unrelated(cpu, i, cpu_coregroup_mask); > } > } > #endif > @@ -1293,6 +1320,20 @@ static void add_cpu_to_masks(int cpu) > add_cpu_to_smallcore_masks(cpu); > update_ma
[PATCH v4 09/10] Powerpc/smp: Create coregroup domain
Add percpu coregroup maps and masks to create coregroup domain. If a coregroup doesn't exist, the coregroup domain will be degenerated in favour of SMT/CACHE domain. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Jordan Niethe Signed-off-by: Srikar Dronamraju --- Changelog v3 ->v4: if coregroup_support doesn't exist, update MC mask to the next smaller domain mask. Changelog v2 -> v3: Add optimization for mask updation under coregroup_support Changelog v1 -> v2: Moved coregroup topology fixup to fixup_topology (Gautham) arch/powerpc/include/asm/topology.h | 10 +++ arch/powerpc/kernel/smp.c | 44 + arch/powerpc/mm/numa.c | 5 3 files changed, 59 insertions(+) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index f0b6300e7dd3..6609174918ab 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -88,12 +88,22 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR) extern int find_and_online_cpu_nid(int cpu); +extern int cpu_to_coregroup_id(int cpu); #else static inline int find_and_online_cpu_nid(int cpu) { return 0; } +static inline int cpu_to_coregroup_id(int cpu) +{ +#ifdef CONFIG_SMP + return cpu_to_core_id(cpu); +#else + return 0; +#endif +} + #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */ #include diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index dab96a1203ec..95f0bf72e283 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -80,6 +80,7 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map); DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map); DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map); DEFINE_PER_CPU(cpumask_var_t, cpu_core_map); +DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map); EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map); @@ -91,6 +92,7 @@ enum { smt_idx, #endif bigcore_idx, + mc_idx, die_idx, }; @@ -869,6 +871,21 @@ static const struct cpumask *smallcore_smt_mask(int cpu) } #endif +static struct cpumask *cpu_coregroup_mask(int cpu) +{ + return per_cpu(cpu_coregroup_map, cpu); +} + +static bool has_coregroup_support(void) +{ + return coregroup_enabled; +} + +static const struct cpumask *cpu_mc_mask(int cpu) +{ + return cpu_coregroup_mask(cpu); +} + static const struct cpumask *cpu_bigcore_mask(int cpu) { return per_cpu(cpu_sibling_map, cpu); @@ -879,6 +896,7 @@ static struct sched_domain_topology_level powerpc_topology[] = { { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, #endif { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) }, + { cpu_mc_mask, SD_INIT_NAME(MC) }, { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, }, }; @@ -925,6 +943,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus) GFP_KERNEL, cpu_to_node(cpu)); zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu), GFP_KERNEL, cpu_to_node(cpu)); + if (has_coregroup_support()) + zalloc_cpumask_var_node(&per_cpu(cpu_coregroup_map, cpu), + GFP_KERNEL, cpu_to_node(cpu)); + #ifdef CONFIG_NEED_MULTIPLE_NODES /* * numa_node_id() works after this. @@ -942,6 +964,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus) cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid)); cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid)); + if (has_coregroup_support()) + cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid)); + init_big_cores(); if (has_big_cores) { cpumask_set_cpu(boot_cpuid, @@ -1233,6 +1258,8 @@ static void remove_cpu_from_masks(int cpu) set_cpus_unrelated(cpu, i, cpu_sibling_mask); if (has_big_cores) set_cpus_unrelated(cpu, i, cpu_smallcore_mask); + if (has_coregroup_support()) + set_cpus_unrelated(cpu, i, cpu_coregroup_mask); } } #endif @@ -1293,6 +1320,20 @@ static void add_cpu_to_masks(int cpu) add_cpu_to_smallcore_masks(cpu); update_mask_by_l2(cpu, cpu_l2_cache_mask); + if (has_coregroup_support()) { + int coregroup_id = cpu_to_coregroup_id(cpu); + + cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu)); + for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) { + int fcpu = cpu_first_thread_sibling(i); + +
[PATCH v4 09/10] Powerpc/smp: Create coregroup domain
Add percpu coregroup maps and masks to create coregroup domain. If a coregroup doesn't exist, the coregroup domain will be degenerated in favour of SMT/CACHE domain. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Jordan Niethe Signed-off-by: Srikar Dronamraju --- Changelog v3 ->v4: if coregroup_support doesn't exist, update MC mask to the next smaller domain mask. Changelog v2 -> v3: Add optimization for mask updation under coregroup_support Changelog v1 -> v2: Moved coregroup topology fixup to fixup_topology (Gautham) arch/powerpc/include/asm/topology.h | 10 ++ arch/powerpc/kernel/smp.c | 48 + arch/powerpc/mm/numa.c | 5 +++ 3 files changed, 63 insertions(+) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index f0b6300e7dd3..6609174918ab 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -88,12 +88,22 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR) extern int find_and_online_cpu_nid(int cpu); +extern int cpu_to_coregroup_id(int cpu); #else static inline int find_and_online_cpu_nid(int cpu) { return 0; } +static inline int cpu_to_coregroup_id(int cpu) +{ +#ifdef CONFIG_SMP + return cpu_to_core_id(cpu); +#else + return 0; +#endif +} + #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */ #include diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index dab96a1203ec..95f0bf72e283 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -80,6 +80,7 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map); DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map); DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map); DEFINE_PER_CPU(cpumask_var_t, cpu_core_map); +DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map); EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map); @@ -91,6 +92,7 @@ enum { smt_idx, #endif bigcore_idx, + mc_idx, die_idx, }; @@ -869,6 +871,21 @@ static const struct cpumask *smallcore_smt_mask(int cpu) } #endif +static struct cpumask *cpu_coregroup_mask(int cpu) +{ + return per_cpu(cpu_coregroup_map, cpu); +} + +static bool has_coregroup_support(void) +{ + return coregroup_enabled; +} + +static const struct cpumask *cpu_mc_mask(int cpu) +{ + return cpu_coregroup_mask(cpu); +} + static const struct cpumask *cpu_bigcore_mask(int cpu) { return per_cpu(cpu_sibling_map, cpu); @@ -879,6 +896,7 @@ static struct sched_domain_topology_level powerpc_topology[] = { { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, #endif { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) }, + { cpu_mc_mask, SD_INIT_NAME(MC) }, { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, }, }; @@ -925,6 +943,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus) GFP_KERNEL, cpu_to_node(cpu)); zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu), GFP_KERNEL, cpu_to_node(cpu)); + if (has_coregroup_support()) + zalloc_cpumask_var_node(&per_cpu(cpu_coregroup_map, cpu), + GFP_KERNEL, cpu_to_node(cpu)); + #ifdef CONFIG_NEED_MULTIPLE_NODES /* * numa_node_id() works after this. @@ -942,6 +964,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus) cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid)); cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid)); + if (has_coregroup_support()) + cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid)); + init_big_cores(); if (has_big_cores) { cpumask_set_cpu(boot_cpuid, @@ -1233,6 +1258,8 @@ static void remove_cpu_from_masks(int cpu) set_cpus_unrelated(cpu, i, cpu_sibling_mask); if (has_big_cores) set_cpus_unrelated(cpu, i, cpu_smallcore_mask); + if (has_coregroup_support()) + set_cpus_unrelated(cpu, i, cpu_coregroup_mask); } } #endif @@ -1293,6 +1320,20 @@ static void add_cpu_to_masks(int cpu) add_cpu_to_smallcore_masks(cpu); update_mask_by_l2(cpu, cpu_l2_cache_mask); + if (has_coregroup_support()) { + int coregroup_id = cpu_to_coregroup_id(cpu); + + cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu)); + for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) { + int fcpu = cpu_first_thread_sibling(i); + +