Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
* Gautham R. Shenoy [2021-04-02 11:07:54]: > > To remedy this, this patch proposes that the LLC be moved to the MC > level which is a group of cores in one half of the chip. > > SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE > I think marking Hemisphere as a LLC in a P10 scenario is a good idea. > While there is no cache being shared at this level, this is still the > level where some amount of cache-snooping takes place and it is > relatively faster to access the data from the caches of the cores > within this domain. With this change, we no longer see regressions on > P10 for applications which require single threaded performance. Peter, Valentin, Vincent, Mel, etal On architectures where we have multiple levels of cache access latencies within a DIE, (For example: one within the current LLC or SMT core and the other at MC or Hemisphere, and finally across hemispheres), do you have any suggestions on how we could handle the same in the core scheduler? -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
* Vincent Guittot [2021-03-08 14:52:39]: > On Fri, 26 Feb 2021 at 17:41, Srikar Dronamraju > wrote: > > Thanks Vincent for your review comments. > > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync) > > +{ > > + struct sched_domain_shared *tsds, *psds; > > + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff; > > + > > + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu)); > > + tnr_busy = atomic_read(>nr_busy_cpus); > > + tllc_size = per_cpu(sd_llc_size, this_cpu); > > + > > + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu)); > > + pnr_busy = atomic_read(>nr_busy_cpus); > > + pllc_size = per_cpu(sd_llc_size, prev_cpu); > > + > > + /* No need to compare, if both LLCs are fully loaded */ > > + if (pnr_busy == pllc_size && tnr_busy == pllc_size) > > + return nr_cpumask_bits; > > + > > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size) > > + return this_cpu; > > Why have you chosen to favor this_cpu instead of prev_cpu unlike for > wake_idle ? At this point, we know the waker running on this_cpu and wakee which was running on prev_cpu are affine to each other and this_cpu and prev_cpu dont share cache. I chose to move them close to each other to benefit from the cache sharing. Based on feedback from Peter and Rik, I made the check more conservative i.e tnr_busy <= tllc_size/smt_weight (where smt_weight is the cpumask weight of smt domain for this_cpu) i.e if we have a free core in this llc domain, chose this_cpu. select_idle_sibling() should pick an idle cpu/core/smt within the llc domain for this_cpu. Do you feel, this may not be the correct option? We are also experimenting with another option, were we call prefer_idler_cpu after wa_weight. I.e 1. if wake_affine_weight choses this_cpu but llc in prev_cpu has an idle smt/CPU but there are no idle smt/CPU in this_cpu, then chose idle smt/CPU in prev_cpu 2. if wake_affine_weight choses nr_cpumask(aka prev_cpu) but llc in this_cpu has an idle smt/CPU but there are no idle smt/CPU in prev_cpu, then chose idle smt/CPU in this_cpu > > + > > + /* For better wakeup latency, prefer idler LLC to cache affinity */ > > + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size; > > + if (!diff) > > + return nr_cpumask_bits; > > + if (diff < 0) > > + return this_cpu; > > + > > + return prev_cpu; > > +} > > + > > static int wake_affine(struct sched_domain *sd, struct task_struct *p, > >int this_cpu, int prev_cpu, int sync) > > { > > @@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, > > struct task_struct *p, > > if (sched_feat(WA_IDLE)) > > target = wake_affine_idle(this_cpu, prev_cpu, sync); > > > > + if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits && > > + !cpus_share_cache(this_cpu, prev_cpu)) > > + target = prefer_idler_llc(this_cpu, prev_cpu, sync); > > could you use the same naming convention as others function ? > wake_affine_llc as an example I guess you meant s/prefer_idler_llc/wake_affine_llc/ Sure. I can modify. > > > + > > if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits) > > target = wake_affine_weight(sd, p, this_cpu, prev_cpu, > > sync); > > > > @@ -5884,8 +5918,11 @@ static int wake_affine(struct sched_domain *sd, > > struct task_struct *p, > > if (target == nr_cpumask_bits) > > return prev_cpu; > > > > - schedstat_inc(sd->ttwu_move_affine); > > - schedstat_inc(p->se.statistics.nr_wakeups_affine); > > + if (target == this_cpu) { > > How is this condition related to $subject ? Before this change, wake_affine_weight and wake_affine_idle would either return this_cpu or nr_cpumask_bits. Just before this check, we check if target is nr_cpumask_bits and return prev_cpu. So the stats were only incremented when target was this_cpu. However with prefer_idler_llc, we may return this_cpu, prev_cpu or nr_cpumask_bits. Now we only to update stats when we have chosen to migrate the task to this_cpu. Hence I had this check. If we use the slightly lazier approach which is check for wa_weight first before wa_idler_llc, then we may not need this change at all. -- Thanks and Regards Srikar Dronamraju
[PATCH v2] sched/fair: Prefer idle CPU to cache affinity
On POWER8 and POWER9, the last level cache (L2) has been at the level of a group of 8 threads (SMT8 on POWER8, a big-core comprising of a pair of SMT4 cores on POWER9). However, on POWER10, the LLC domain is at the level of a group of SMT4 threads within the SMT8 core. Due to the shrinking in the size of the LLC domain, the probability of finding an idle CPU in the LLC domain of the target is lesser on POWER10 compared to the previous generation processors. With commit 9538abee18cc ("powerpc/smp: Add support detecting thread-groups sharing L2 cache") benchmarks such as Daytrader (https://github.com/WASdev/sample.daytrader7) show a drop in throughput in a configuration consisting of 1 JVM spanning across 6-8 Bigcores on POWER10. Analysis showed that this was because more number of wakeups were happening on busy CPUs when the utilization was 60-70%. This drop in throughput also shows up as a drop in CPU utilization. However most other benchmarks benefit with detecting the thread-groups that share L2 cache. Current order of preference to pick a LLC while waking a wake-affine task: 1. Between the waker CPU and previous CPU, prefer the LLC of the CPU that is idle. 2. Between the waker CPU and previous CPU, prefer the LLC of the CPU that is less lightly loaded. In the current situation where waker and previous CPUs are busy, but only one of its LLC has an idle CPU, Scheduler may end up picking a LLC with no idle CPUs. To mitigate this, add a new step between 1 and 2 where Scheduler compares idle CPUs in waker and previous LLCs and picks the appropriate one. The other alternative is to search for an idle CPU in the other LLC, if the current select_idle_sibling is unable to find an idle CPU in the preferred LLC. But that may increase the time to select a CPU. 80USERS 5.11-rc6 5.11-rc6+revert 5.11-rc6+patch 8CORE/1JVM throughput 6651.6 6716.3 (0.97%) 6670 (0.27%) sys/user:time 59.75/23.86 61.77/24.55 56.34/22.65 8CORE/2JVM throughput 6425.4 6446.8 (0.33%) 6627.9 (3.15%) sys/user:time 70.59/24.25 72.28/23.77 67.50/23.67 8CORE/4JVM throughput 5355.3 5551.2 (3.66%) 5417.3 (1.58%) sys/user:time 76.74/21.79 76.54/22.73 74.77/21.86 8CORE/8JVM throughput 4420.6 4553.3 (3.00%) 4486.2 (1.48%) sys/user:time 79.13/20.32 78.76/21.01 78.14/20.19 Cc: LKML Cc: Michael Ellerman Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Parth Shah Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Dietmar Eggemann Cc: Mel Gorman Cc: Vincent Guittot Cc: Rik van Riel Co-developed-by: Gautham R Shenoy Signed-off-by: Gautham R Shenoy Co-developed-by: Parth Shah Signed-off-by: Parth Shah Signed-off-by: Srikar Dronamraju --- Changelog v1->v2: v1: http://lore.kernel.org/lkml/20210226164029.122432-1-sri...@linux.vnet.ibm.com/t/#u - Make WA_WAKER default (Suggested by Rik) - Make WA_WAKER check more conservative: (Suggested by Rik / Peter) - s/pllc_size/tllc_size while checking for busy case: (Pointed by Dietmar) - Add rcu_read_lock and check for validity of shared domains kernel/sched/fair.c | 57 +++-- kernel/sched/features.h | 2 ++ kernel/sched/sched.h| 1 + kernel/sched/topology.c | 2 ++ 4 files changed, 60 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8a8bd7b13634..492ba07e4f51 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5869,6 +5869,52 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p, return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits; } +#ifdef CONFIG_NO_HZ_COMMON +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync) +{ + struct sched_domain_shared *tsds, *psds; + int pnr_busy, pllc_size, tnr_busy, tllc_size; + unsigned int smt_size = per_cpu(smt_size, this_cpu); + int diff; + + rcu_read_lock(); + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu)); + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu)); + if (!tsds || !psds) { + rcu_read_unlock(); + return nr_cpumask_bits; + } + + tnr_busy = atomic_read(>nr_busy_cpus); + pnr_busy = atomic_read(>nr_busy_cpus); + rcu_read_unlock(); + + tllc_size = per_cpu(sd_llc_size, this_cpu); + pllc_size = per_cpu(sd_llc_size, prev_cpu); + + /* No need to compare, if both LLCs are fully loaded */ + if (pnr_busy == pllc_size && tnr_busy == tllc_size) + return nr_cpumask_bits; + + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size / smt_size) + return this_cpu; + + /* For better wakeup latency, prefer idler LLC to cache affinity */ + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size; + if (!dif
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
* Peter Zijlstra [2021-03-02 10:10:32]: > On Tue, Mar 02, 2021 at 01:09:46PM +0530, Srikar Dronamraju wrote: > > > Oh, could be, I didn't grep :/ We could have core code keep track of the > > > smt count I suppose. > > > > Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead? > > cpumask_weight() is potentially super expensive. With CPUMASK_OFFSTACK > you get at least one more cache miss and then the bitmap might be really > long. > > Best to compute the results once and use it later. Oh okay .. Noted. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
* Dietmar Eggemann [2021-03-02 10:53:06]: > On 26/02/2021 17:40, Srikar Dronamraju wrote: > > [...] > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 8a8bd7b13634..d49bfcdc4a19 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct > > task_struct *p, > > return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits; > > } > > > > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync) > > +{ > > + struct sched_domain_shared *tsds, *psds; > > + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff; > > + > > + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu)); > > + tnr_busy = atomic_read(>nr_busy_cpus); > > + tllc_size = per_cpu(sd_llc_size, this_cpu); > > + > > + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu)); > > + pnr_busy = atomic_read(>nr_busy_cpus); > > + pllc_size = per_cpu(sd_llc_size, prev_cpu); > > + > > + /* No need to compare, if both LLCs are fully loaded */ > > + if (pnr_busy == pllc_size && tnr_busy == pllc_size) > > ^ >shouldn't this be tllc_size ? Yes, thanks for pointing out. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
* Peter Zijlstra [2021-03-01 18:18:28]: > On Mon, Mar 01, 2021 at 10:36:01PM +0530, Srikar Dronamraju wrote: > > * Peter Zijlstra [2021-03-01 16:44:42]: > > > > > On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote: > > > > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote: > > > > > > > > > > Maybe we could use "tnr_busy * 4 < > > > > tllc_size * 3" or > > > > something like that? > > > > > > How about: > > > > > > tnr_busy < tllc_size / topology_max_smt_threads() > > > > > > ? > > > > Isn't topology_max_smt_threads only for x86 as of today? > > Or Am I missing out? > > Oh, could be, I didn't grep :/ We could have core code keep track of the > smt count I suppose. Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead? -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
* Peter Zijlstra [2021-03-01 16:40:33]: > On Fri, Feb 26, 2021 at 10:10:29PM +0530, Srikar Dronamraju wrote: > > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync) > > +{ > > + struct sched_domain_shared *tsds, *psds; > > + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff; > > + > > + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu)); > > + tnr_busy = atomic_read(>nr_busy_cpus); > > + tllc_size = per_cpu(sd_llc_size, this_cpu); > > + > > + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu)); > > + pnr_busy = atomic_read(>nr_busy_cpus); > > + pllc_size = per_cpu(sd_llc_size, prev_cpu); > > + > > nr_busy_cpus is NO_HZ_COMMON So this code that consumes it should be > too. Thanks Peter, will take care of this along with other changes including calling within rcu_read_lock and checking for tsds and psds after rcu_dereference. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
* Peter Zijlstra [2021-03-01 16:44:42]: > On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote: > > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote: > > > > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size) > > > + return this_cpu; > > > > I wonder if we need to use a slightly lower threshold on > > very large LLCs, both to account for the fact that the > > select_idle_cpu code may not find the single idle CPU > > among a dozen busy ones, or because on a system with > > hyperthreading we may often be better off picking another > > LLC for HT contention issues? > > > > Maybe we could use "tnr_busy * 4 < > > tllc_size * 3" or > > something like that? > > How about: > > tnr_busy < tllc_size / topology_max_smt_threads() > > ? Isn't topology_max_smt_threads only for x86 as of today? Or Am I missing out? -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity
* Rik van Riel [2021-02-27 14:56:07]: > > In the current situation where waker and previous CPUs are busy, but > > only one of its LLC has an idle CPU, Scheduler may end up picking a > > LLC > > with no idle CPUs. To mitigate this, add a new step between 1 and 2 > > where Scheduler compares idle CPUs in waker and previous LLCs and > > picks > > the appropriate one. > > I like that idea a lot. That could also solve some of the Thanks. > issues sometimes observed on multi-node x86 systems, and > probably on the newer AMD chips with several LLCs on chip. > Okay. > > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size) > > + return this_cpu; > > I wonder if we need to use a slightly lower threshold on > very large LLCs, both to account for the fact that the > select_idle_cpu code may not find the single idle CPU > among a dozen busy ones, or because on a system with > hyperthreading we may often be better off picking another > LLC for HT contention issues? > > Maybe we could use "tnr_busy * 4 < > tllc_size * 3" or > something like that? > > That way we will only try to find the last 5 idle > CPUs > in a 22 CPU LLC if the other LLC also has fewer than 6 > idle cores. > > That might increase our chances of finding an idle CPU > with SIS_PROP enabled, and might allow WA_WAKER to be > true by default. Agree we need to be conservative esp if we want to make WA_WAKER on by default. I would still like to hear from other people if they think its ok to enable it by default. I wonder if enabling it by default can cause some load imbalances leading to more active load balance down the line. I haven't benchmarked with WA_WAKER enabled. Thanks Rik for your inputs. -- Thanks and Regards Srikar Dronamraju
[PATCH] sched/fair: Prefer idle CPU to cache affinity
On POWER8 and POWER9, the last level cache (L2) has been at the level of a group of 8 threads (SMT8 on POWER8, a big-core comprising of a pair of SMT4 cores on POWER9). However, on POWER10, the LLC domain is at the level of a group of SMT4 threads within the SMT8 core. Due to the shrinking in the size of the LLC domain, the probability of finding an idle CPU in the LLC domain of the target is lesser on POWER10 compared to the previous generation processors. With commit 9538abee18cc ("powerpc/smp: Add support detecting thread-groups sharing L2 cache") benchmarks such as Daytrader (https://github.com/WASdev/sample.daytrader7) show a drop in throughput in a configuration consisting of 1 JVM spanning across 6-8 Bigcores on POWER10. Analysis showed that this was because more number of wakeups were happening on busy CPUs when the utilization was 60-70%. This drop in throughput also shows up as a drop in CPU utilization. However most other benchmarks benefit with detecting the thread-groups that share L2 cache. Current order of preference to pick a LLC while waking a wake-affine task: 1. Between the waker CPU and previous CPU, prefer the LLC of the CPU that is idle. 2. Between the waker CPU and previous CPU, prefer the LLC of the CPU that is less lightly loaded. In the current situation where waker and previous CPUs are busy, but only one of its LLC has an idle CPU, Scheduler may end up picking a LLC with no idle CPUs. To mitigate this, add a new step between 1 and 2 where Scheduler compares idle CPUs in waker and previous LLCs and picks the appropriate one. The other alternative is to search for an idle CPU in the other LLC, if the current select_idle_sibling is unable to find an idle CPU in the preferred LLC. But that may increase the time to select a CPU. 5.11-rc6 5.11-rc6+revert 5.11-rc6+patch 8CORE/1JVM 80USERS throughput 6651.66716.3 (0.97%)6940 (4.34%) sys/user:time 59.75/23.86 61.77/24.55 60/24 8CORE/2JVM 80USERS throughput 6425.46446.8 (0.33%)6473.2 (0.74%) sys/user:time 70.59/24.25 72.28/23.77 70/24 8CORE/4JVM 80USERS throughput 5355.35551.2 (3.66%)5586.6 (4.32%) sys/user:time 76.74/21.79 76.54/22.73 76/22 8CORE/8JVM 80USERS throughput 4420.64553.3 (3.00%)4405.8 (-0.33%) sys/user:time 79.13/20.32 78.76/21.01 79/20 Cc: LKML Cc: Michael Ellerman Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Parth Shah Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Dietmar Eggemann Cc: Mel Gorman Cc: Vincent Guittot Co-developed-by: Gautham R Shenoy Signed-off-by: Gautham R Shenoy Co-developed-by: Parth Shah Signed-off-by: Parth Shah Signed-off-by: Srikar Dronamraju --- kernel/sched/fair.c | 41 +++-- kernel/sched/features.h | 2 ++ 2 files changed, 41 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8a8bd7b13634..d49bfcdc4a19 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p, return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits; } +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync) +{ + struct sched_domain_shared *tsds, *psds; + int pnr_busy, pllc_size, tnr_busy, tllc_size, diff; + + tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu)); + tnr_busy = atomic_read(>nr_busy_cpus); + tllc_size = per_cpu(sd_llc_size, this_cpu); + + psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu)); + pnr_busy = atomic_read(>nr_busy_cpus); + pllc_size = per_cpu(sd_llc_size, prev_cpu); + + /* No need to compare, if both LLCs are fully loaded */ + if (pnr_busy == pllc_size && tnr_busy == pllc_size) + return nr_cpumask_bits; + + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size) + return this_cpu; + + /* For better wakeup latency, prefer idler LLC to cache affinity */ + diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size; + if (!diff) + return nr_cpumask_bits; + if (diff < 0) + return this_cpu; + + return prev_cpu; +} + static int wake_affine(struct sched_domain *sd, struct task_struct *p, int this_cpu, int prev_cpu, int sync) { @@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, if (sched_feat(WA_IDLE)) target = wake_affine_idle(this_cpu, prev_cpu, sync); + if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits && + !cpus_share_cache(this_cpu, prev_cpu)) +
Re: [PATCH] perf bench numa: Fix the condition checks for max number of numa nodes
* Athira Rajeev [2021-02-25 11:50:02]: > In systems having higher node numbers available like node > 255, perf numa bench will fail with SIGABORT. > > <<>> > perf: bench/numa.c:1416: init: Assertion `!(g->p.nr_nodes > 64 || > g->p.nr_nodes < 0)' failed. > Aborted (core dumped) > <<>> > Looks good to me. Reviewed-by: Srikar Dronamraju -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/topology: Switch to sched_debug() for conditional sched domain printk
* Yue Hu [2021-02-03 18:10:19]: > On Wed, 3 Feb 2021 15:22:56 +0530 > Srikar Dronamraju wrote: > > > * Yue Hu [2021-02-03 12:20:10]: > > > > > > sched_debug() would only be present in CONFIG_SCHED_DEBUG. Right? > > In which case there would be a build failure with your change in > > !CONFIG_SCHED_DEBUG config. > > > > or Am I missing something? > > sched_debug() is also defined for !CONFIG_SCHED_DEBUG as below: > > static inline bool sched_debug(void) > { > return false; > } > ah .. right .. somehow I missed this. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/topology: Switch to sched_debug() for conditional sched domain printk
* Yue Hu [2021-02-03 12:20:10]: > From: Yue Hu > > Currently, the macro sched_debug_enabled has same function as > sched_debug() with return false for !SCHED_DEBUG. And sched_debug() > is a wapper of variable sched_debug_enabled for SCHED_DEBUG. We > can use the sched_debug() for all cases. So, let's remove the > unnecessary marco, also use sched_debug() in sched_domain_debug() > for code consistency. > > Signed-off-by: Yue Hu > --- > kernel/sched/topology.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 5d3675c..402138c 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -131,7 +131,7 @@ static void sched_domain_debug(struct sched_domain *sd, > int cpu) > { > int level = 0; > > - if (!sched_debug_enabled) > + if (!sched_debug()) > return; > sched_debug() would only be present in CONFIG_SCHED_DEBUG. Right? In which case there would be a build failure with your change in !CONFIG_SCHED_DEBUG config. or Am I missing something? > if (!sd) { > @@ -152,7 +152,6 @@ static void sched_domain_debug(struct sched_domain *sd, > int cpu) > } > #else /* !CONFIG_SCHED_DEBUG */ > > -# define sched_debug_enabled 0 > # define sched_domain_debug(sd, cpu) do { } while (0) > static inline bool sched_debug(void) > { > @@ -2113,7 +2112,7 @@ static bool topology_span_sane(struct > sched_domain_topology_level *tl, > if (has_asym) > static_branch_inc_cpuslocked(_asym_cpucapacity); > > - if (rq && sched_debug_enabled) { > + if (rq && sched_debug()) { Same as above. > pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n", > cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity); > } > -- > 1.9.1 > -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v3 1/5] powerpc/smp: Parse ibm,thread-groups with multiple properties
* Gautham R. Shenoy [2020-12-10 16:08:55]: > From: "Gautham R. Shenoy" > > The "ibm,thread-groups" device-tree property is an array that is used > to indicate if groups of threads within a core share certain > properties. It provides details of which property is being shared by > which groups of threads. This array can encode information about > multiple properties being shared by different thread-groups within the > core. > Looks good to me. Reviewed-by: Srikar Dronamraju -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v3 5/5] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache
* Gautham R. Shenoy [2020-12-10 16:08:59]: > From: "Gautham R. Shenoy" > > On POWER platforms where only some groups of threads within a core > share the L2-cache (indicated by the ibm,thread-groups device-tree > property), we currently print the incorrect shared_cpu_map/list for > L2-cache in the sysfs. > > This patch reports the correct shared_cpu_map/list on such platforms. > > Example: > On a platform with "ibm,thread-groups" set to > 0001 0002 0004 > 0002 0004 0006 0001 > 0003 0005 0007 0002 > 0002 0004 0002 > 0004 0006 0001 0003 > 0005 0007 > Looks good to me. Reviewed-by: Srikar Dronamraju -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v3 4/5] powerpc/smp: Add support detecting thread-groups sharing L2 cache
* Gautham R. Shenoy [2020-12-10 16:08:58]: > From: "Gautham R. Shenoy" > > On POWER systems, groups of threads within a core sharing the L2-cache > can be indicated by the "ibm,thread-groups" property array with the > identifier "2". > > This patch adds support for detecting this, and when present, populate > the populating the cpu_l2_cache_mask of every CPU to the core-siblings > which share L2 with the CPU as specified in the by the > "ibm,thread-groups" property array. > > On a platform with the following "ibm,thread-group" configuration >0001 0002 0004 >0002 0004 0006 0001 >0003 0005 0007 0002 >0002 0004 0002 >0004 00000006 0001 0003 > 0005 0007 > Looks good to me. Reviewed-by: Srikar Dronamraju -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v3 2/5] powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map
* Gautham R. Shenoy [2020-12-10 16:08:56]: > From: "Gautham R. Shenoy" > > On platforms which have the "ibm,thread-groups" property, the per-cpu > variable cpu_l1_cache_map keeps a track of which group of threads > within the same core share the L1 cache, Instruction and Data flow. > > This patch renames the variable to "thread_group_l1_cache_map" to make > it consistent with a subsequent patch which will introduce > thread_group_l2_cache_map. > > This patch introduces no functional change. > Looks good to me. Reviewed-by: Srikar Dronamraju > Signed-off-by: Gautham R. Shenoy -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v3 3/5] powerpc/smp: Rename init_thread_group_l1_cache_map() to make it generic
* Gautham R. Shenoy [2020-12-10 16:08:57]: > From: "Gautham R. Shenoy" > > init_thread_group_l1_cache_map() initializes the per-cpu cpumask > thread_group_l1_cache_map with the core-siblings which share L1 cache > with the CPU. Make this function generic to the cache-property (L1 or > L2) and update a suitable mask. This is a preparatory patch for the > next patch where we will introduce discovery of thread-groups that > share L2-cache. > > No functional change. > Looks good to me. Reviewed-by: Srikar Dronamraju > Signed-off-by: Gautham R. Shenoy -- Thanks and Regards Srikar Dronamraju
Re: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache
* Gautham R Shenoy [2020-12-08 23:12:37]: > > > For L2 we have thread_group_l2_cache_map to store the tasks from the thread > > group. but cpu_l2_cache_map for keeping track of tasks. > > > > > I think we should do some renaming to keep the names consistent. > > I would say probably say move the current cpu_l2_cache_map to > > cpu_llc_cache_map and move the new aka thread_group_l2_cache_map as > > cpu_l2_cache_map to be somewhat consistent. > > Hmm.. cpu_llc_cache_map is still very generic. We want to have > something that defines l2 map. > > I agree that we need to keep it consistent. How about renaming > cpu_l1_cache_map to thread_groups_l1_cache_map ? > > That way thread_groups_l1_cache_map and thread_groups_l2_cache_map > refer to the corresponding L1 and L2 siblings as discovered from > ibm,thread-groups property. I am fine with this. > > > + > > > + for_each_possible_cpu(cpu) { > > > + int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L2); > > > + > > > + if (err) > > > + return err; > > > + } > > > + > > > + thread_group_shares_l2 = true; > > > > Why do we need a separate loop. Why cant we merge this in the above loop > > itself? > > No, there are platforms where one THREAD_GROUP_SHARE_L1 exists while > THREAD_GROUP_SHARE_L2 doesn't exist. It becomes easier if these are > separately tracked. Also, what do we gain if we put this in the same > loop? It will be (nr_possible_cpus * 2 * invocations of > init_cpu_cache_map()) as opposed to 2 * (nr_possible_cpus * > invocations of init_cpu_cache_map()). Isn't it ? > Its not about the number of invocations but per-cpu thread group list that would need not be loaded again. Currently they would probably be in the cache-line, but get dropped to be loaded again in the next loop. And we still can support platforms with only THREAD_GROUP_SHARE_L1 since parse_thread_groups would have given us how many levels of thread groups are supported on a platform. > > > > > + pr_info("Thread-groups in a core share L2-cache\n"); > > > > Can this be moved to a pr_debug? Does it help any regular user/admins to > > know if thread-groups shared l2 cache. Infact it may confuse users on what > > thread groups are and which thread groups dont share cache. > > I would prefer some other name than thread_group_shares_l2 but dont know any > > better alternatives and may be my choices are even worse. > > Would you be ok with "L2 cache shared by threads of the small core" ? Sounds better to me. I would still think pr_debug is better since regular Admins/users may not make too much information from this. > > > > > Ah this can be simplified to: > > if (thread_group_shares_l2) { > > cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu)); > > > > for_each_cpu(i, per_cpu(thread_group_l2_cache_map, cpu)) { > > if (cpu_online(i)) > > set_cpus_related(i, cpu, cpu_l2_cache_mask); > > } > > Don't we want to enforce that the siblings sharing L1 be a subset of > the siblings sharing L2 ? Or do you recommend putting in a check for > that somewhere ? > I didnt think about the case where the device-tree could show L2 to be a subset of L1. How about initializing thread_group_l2_cache_map itself with cpu_l1_cache_map. It would be a simple one time operation and reduce the overhead here every CPU online. And it would help in your subsequent patch too. We dont want the cacheinfo for L1 showing CPUs not present in L2. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache
* Gautham R Shenoy [2020-12-08 23:26:47]: > > The drawback of this is even if cpus 0,2,4,6 are released L1 cache will not > > be released. Is this as expected? > > cacheinfo populates the cache->shared_cpu_map on the basis of which > CPUs share the common device-tree node for a particular cache. There > is one l1-cache object in the device-tree for a CPU node corresponding > to a big-core. That the L1 is further split between the threads of the > core is shown using ibm,thread-groups. > Yes. > The ideal thing would be to add a "group_leader" field to "struct > cache" so that we can create separate cache objects , one per thread > group. I will take a stab at this in the v2. > I am not saying this needs to be done immediately. We could add a TODO and get it done later. Your patch is not making it worse. Its just that there is still something more left to be done. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties
* Gautham R Shenoy [2020-12-08 22:55:40]: > > > > NIT: > > tglx mentions in one of his recent comments to try keep a reverse fir tree > > ordering of variables where possible. > > I suppose you mean moving the longer local variable declarations to to > the top and shorter ones to the bottom. Thanks. Will fix this. > Yes. > > > + } > > > + > > > + if (!tg) > > > + return -EINVAL; > > > + > > > + cpu_group_start = get_cpu_thread_group_start(cpu, tg); > > > > This whole hunk should be moved to a new function and called before > > init_cpu_cache_map. It will simplify the logic to great extent. > > I suppose you are referring to the part where we select the correct > tg. Yeah, that can move to a different helper. > Yes, I would prefer if we could call this new helper outside init_cpu_cache_map. > > > > > > - zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu), > > > - GFP_KERNEL, cpu_to_node(cpu)); > > > + mask = _cpu(cpu_l1_cache_map, cpu); > > > + > > > + zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu)); > > > > > > > This hunk (and the next hunk) should be moved to next patch. > > > > The next patch is only about introducing THREAD_GROUP_SHARE_L2. Hence > I put in any other code in this patch, since it seems to be a logical > place to collate whatever we have in a generic form. > While I am fine with it, having a pointer that always points to the same mask looks wierd. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache
* Gautham R. Shenoy [2020-12-04 10:18:47]: > From: "Gautham R. Shenoy" > > > Signed-off-by: Gautham R. Shenoy > --- > > +extern bool thread_group_shares_l2; > /* > * On big-core systems, each core has two groups of CPUs each of which > * has its own L1-cache. The thread-siblings which share l1-cache with > * @cpu can be obtained via cpu_smallcore_mask(). > + * > + * On some big-core systems, the L2 cache is shared only between some > + * groups of siblings. This is already parsed and encoded in > + * cpu_l2_cache_mask(). > */ > static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct > cache *cache) > { > if (cache->level == 1) > return cpu_smallcore_mask(cpu); > + if (cache->level == 2 && thread_group_shares_l2) > + return cpu_l2_cache_mask(cpu); > > return >shared_cpu_map; As pointed with l...@intel.org, we need to do this only with #CONFIG_SMP, even for cache->level = 1 too. I agree that we are displaying shared_cpu_map correctly. Should we have also update /clear shared_cpu_map in the first place. For example:- If for a P9 core with CPUs 0-7, the cache->shared_cpu_map for L1 would have 0-7 but would display 0,2,4,6. The drawback of this is even if cpus 0,2,4,6 are released L1 cache will not be released. Is this as expected? -- Thanks and Regards Srikar Dronamraju
Re: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache
he_map(int cpu, unsigned int > cache_property) > if (!dn) > return -ENODATA; > > - if (!(cache_property == THREAD_GROUP_SHARE_L1)) > + if (!(cache_property == THREAD_GROUP_SHARE_L1 || > + cache_property == THREAD_GROUP_SHARE_L2)) > return -EINVAL; > > if (!cpu_tgl->nr_properties) { > @@ -867,7 +879,10 @@ static int init_cpu_cache_map(int cpu, unsigned int > cache_property) > goto out; > } > > - mask = _cpu(cpu_l1_cache_map, cpu); > + if (cache_property == THREAD_GROUP_SHARE_L1) > + mask = _cpu(cpu_l1_cache_map, cpu); > + else if (cache_property == THREAD_GROUP_SHARE_L2) > + mask = _cpu(thread_group_l2_cache_map, cpu); > > zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu)); > > @@ -973,6 +988,16 @@ static int init_big_cores(void) > } > > has_big_cores = true; > + > + for_each_possible_cpu(cpu) { > + int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L2); > + > + if (err) > + return err; > + } > + > + thread_group_shares_l2 = true; Why do we need a separate loop. Why cant we merge this in the above loop itself? > + pr_info("Thread-groups in a core share L2-cache\n"); Can this be moved to a pr_debug? Does it help any regular user/admins to know if thread-groups shared l2 cache. Infact it may confuse users on what thread groups are and which thread groups dont share cache. I would prefer some other name than thread_group_shares_l2 but dont know any better alternatives and may be my choices are even worse. > return 0; > } > > @@ -1287,6 +1312,31 @@ static bool update_mask_by_l2(int cpu, cpumask_var_t > *mask) > if (has_big_cores) > submask_fn = cpu_smallcore_mask; > > + NIT: extra blank line? > + /* > + * If the threads in a thread-group share L2 cache, then then > + * the L2-mask can be obtained from thread_group_l2_cache_map. > + */ > + if (thread_group_shares_l2) { > + /* Siblings that share L1 is a subset of siblings that share > L2.*/ > + or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask); > + if (*mask) { > + cpumask_andnot(*mask, > +per_cpu(thread_group_l2_cache_map, cpu), > +cpu_l2_cache_mask(cpu)); > + } else { > + mask = _cpu(thread_group_l2_cache_map, cpu); > + } > + > + for_each_cpu(i, *mask) { > + if (!cpu_online(i)) > + continue; > + set_cpus_related(i, cpu, cpu_l2_cache_mask); > + } > + > + return true; > + } > + Ah this can be simplified to: if (thread_group_shares_l2) { cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu)); for_each_cpu(i, per_cpu(thread_group_l2_cache_map, cpu)) { if (cpu_online(i)) set_cpus_related(i, cpu, cpu_l2_cache_mask); } } No? > l2_cache = cpu_to_l2cache(cpu); > if (!l2_cache || !*mask) { > /* Assume only core siblings share cache with this CPU */ -- Thanks and Regards Srikar Dronamraju
Re: [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties
; > @@ -830,11 +867,12 @@ static int init_cpu_l1_cache_map(int cpu) > goto out; > } > > - zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu), > - GFP_KERNEL, cpu_to_node(cpu)); > + mask = _cpu(cpu_l1_cache_map, cpu); > + > + zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu)); > This hunk (and the next hunk) should be moved to next patch. > for (i = first_thread; i < first_thread + threads_per_core; i++) { > - int i_group_start = get_cpu_thread_group_start(i, ); > + int i_group_start = get_cpu_thread_group_start(i, tg); > > if (unlikely(i_group_start == -1)) { > WARN_ON_ONCE(1); > @@ -843,7 +881,7 @@ static int init_cpu_l1_cache_map(int cpu) > } > > if (i_group_start == cpu_group_start) > - cpumask_set_cpu(i, per_cpu(cpu_l1_cache_map, cpu)); > + cpumask_set_cpu(i, *mask); > } > > out: > @@ -924,7 +962,7 @@ static int init_big_cores(void) > int cpu; > > for_each_possible_cpu(cpu) { > - int err = init_cpu_l1_cache_map(cpu); > + int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L1); > > if (err) > return err; > -- > 1.9.4 > -- Thanks and Regards Srikar Dronamraju
Re: [PATCH 2/3] x86/uprobes: Fix not using prefixes.nbytes for loop over prefixes.bytes
* Masami Hiramatsu [2020-12-02 17:51:16]: > Since the insn.prefixes.nbytes can be bigger than the size of > insn.prefixes.bytes[] when a same prefix is repeated, we have to > check whether the insn.prefixes.bytes[i] != 0 and i < 4 instead > of insn.prefixes.nbytes. > > Fixes: 2b1444983508 ("uprobes, mm, x86: Add the ability to install and remove > uprobes breakpoints") > Cc: sta...@vger.kernel.org > Reported-by: Kees Cook > Signed-off-by: Masami Hiramatsu Looks good to me. Reviewed-by: Srikar Dronamraju > --- > arch/x86/kernel/uprobes.c |4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c > index 3fdaa042823d..bb3ea3705b99 100644 > --- a/arch/x86/kernel/uprobes.c > +++ b/arch/x86/kernel/uprobes.c > @@ -257,7 +257,7 @@ static bool is_prefix_bad(struct insn *insn) > { > int i; > > - for (i = 0; i < insn->prefixes.nbytes; i++) { > + for (i = 0; insn->prefixes.bytes[i] && i < 4; i++) { > insn_attr_t attr; > > attr = inat_get_opcode_attribute(insn->prefixes.bytes[i]); > @@ -746,7 +746,7 @@ static int branch_setup_xol_ops(struct arch_uprobe > *auprobe, struct insn *insn) >* Intel and AMD behavior differ in 64-bit mode: Intel ignores 66 > prefix. >* No one uses these insns, reject any branch insns with such prefix. >*/ > - for (i = 0; i < insn->prefixes.nbytes; i++) { > + for (i = 0; insn->prefixes.bytes[i] && i < 4; i++) { > if (insn->prefixes.bytes[i] == 0x66) > return -ENOTSUPP; > } > -- Thanks and Regards Srikar Dronamraju
[PATCH v2 3/4] powerpc: Reintroduce is_kvm_guest in a new avatar
Introduce a static branch that would be set during boot if the OS happens to be a KVM guest. Subsequent checks to see if we are on KVM will rely on this static branch. This static branch would be used in vcpu_is_preempted in a subsequent patch. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Nathan Lynch Cc: Gautham R Shenoy Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Juri Lelli Cc: Waiman Long Cc: Phil Auld Acked-by: Waiman Long Signed-off-by: Srikar Dronamraju --- arch/powerpc/include/asm/kvm_guest.h | 10 ++ arch/powerpc/include/asm/kvm_para.h | 2 +- arch/powerpc/kernel/firmware.c | 2 ++ 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/kvm_guest.h b/arch/powerpc/include/asm/kvm_guest.h index ba8291e02ba9..627ba272e781 100644 --- a/arch/powerpc/include/asm/kvm_guest.h +++ b/arch/powerpc/include/asm/kvm_guest.h @@ -7,8 +7,18 @@ #define __POWERPC_KVM_GUEST_H__ #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST) +#include + +DECLARE_STATIC_KEY_FALSE(kvm_guest); + +static inline bool is_kvm_guest(void) +{ + return static_branch_unlikely(_guest); +} + bool check_kvm_guest(void); #else +static inline bool is_kvm_guest(void) { return false; } static inline bool check_kvm_guest(void) { return false; } #endif diff --git a/arch/powerpc/include/asm/kvm_para.h b/arch/powerpc/include/asm/kvm_para.h index 6fba06b6cfdb..abe1b5e82547 100644 --- a/arch/powerpc/include/asm/kvm_para.h +++ b/arch/powerpc/include/asm/kvm_para.h @@ -14,7 +14,7 @@ static inline int kvm_para_available(void) { - return IS_ENABLED(CONFIG_KVM_GUEST) && check_kvm_guest(); + return IS_ENABLED(CONFIG_KVM_GUEST) && is_kvm_guest(); } static inline unsigned int kvm_arch_para_features(void) diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c index 0aeb6a5b1a9e..28498fc573f2 100644 --- a/arch/powerpc/kernel/firmware.c +++ b/arch/powerpc/kernel/firmware.c @@ -22,6 +22,7 @@ EXPORT_SYMBOL_GPL(powerpc_firmware_features); #endif #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST) +DEFINE_STATIC_KEY_FALSE(kvm_guest); bool check_kvm_guest(void) { struct device_node *hyper_node; @@ -33,6 +34,7 @@ bool check_kvm_guest(void) if (!of_device_is_compatible(hyper_node, "linux,kvm")) return 0; + static_branch_enable(_guest); return 1; } #endif -- 2.18.4
[PATCH v2 4/4] powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted
If its a shared lpar but not a KVM guest, then see if the vCPU is related to the calling vCPU. On PowerVM, only cores can be preempted. So if one vCPU is a non-preempted state, we can decipher that all other vCPUs sharing the same core are in non-preempted state. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Nathan Lynch Cc: Gautham R Shenoy Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Juri Lelli Cc: Waiman Long Cc: Phil Auld Acked-by: Waiman Long Signed-off-by: Srikar Dronamraju --- arch/powerpc/include/asm/paravirt.h | 18 ++ 1 file changed, 18 insertions(+) diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h index 9362c94fe3aa..edc08f04aef7 100644 --- a/arch/powerpc/include/asm/paravirt.h +++ b/arch/powerpc/include/asm/paravirt.h @@ -10,6 +10,9 @@ #endif #ifdef CONFIG_PPC_SPLPAR +#include +#include + DECLARE_STATIC_KEY_FALSE(shared_processor); static inline bool is_shared_processor(void) @@ -74,6 +77,21 @@ static inline bool vcpu_is_preempted(int cpu) { if (!is_shared_processor()) return false; + +#ifdef CONFIG_PPC_SPLPAR + if (!is_kvm_guest()) { + int first_cpu = cpu_first_thread_sibling(smp_processor_id()); + + /* +* Preemption can only happen at core granularity. This CPU +* is not preempted if one of the CPU of this core is not +* preempted. +*/ + if (cpu_first_thread_sibling(cpu) == first_cpu) + return false; + } +#endif + if (yield_count_of(cpu) & 1) return true; return false; -- 2.18.4
[PATCH v2 1/4] powerpc: Refactor is_kvm_guest declaration to new header
Only code/declaration movement, in anticipation of doing a kvm-aware vcpu_is_preempted. No additional changes. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Nathan Lynch Cc: Gautham R Shenoy Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Juri Lelli Cc: Waiman Long Cc: Phil Auld Acked-by: Waiman Long Signed-off-by: Srikar Dronamraju --- Changelog: v1->v2: v1: https://lore.kernel.org/linuxppc-dev/20201028123512.871051-1-sri...@linux.vnet.ibm.com/t/#u - Moved a hunk to fix a no previous prototype warning reported by: l...@intel.com https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org/thread/C6PTRPHWMC7VV4OTYN3ISYKDHTDQS6YI/ arch/powerpc/include/asm/firmware.h | 6 -- arch/powerpc/include/asm/kvm_guest.h | 15 +++ arch/powerpc/include/asm/kvm_para.h | 2 +- arch/powerpc/kernel/firmware.c | 1 + arch/powerpc/platforms/pseries/smp.c | 1 + 5 files changed, 18 insertions(+), 7 deletions(-) create mode 100644 arch/powerpc/include/asm/kvm_guest.h diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h index 0b295bdb201e..aa6a5ef5d483 100644 --- a/arch/powerpc/include/asm/firmware.h +++ b/arch/powerpc/include/asm/firmware.h @@ -134,12 +134,6 @@ extern int ibm_nmi_interlock_token; extern unsigned int __start___fw_ftr_fixup, __stop___fw_ftr_fixup; -#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST) -bool is_kvm_guest(void); -#else -static inline bool is_kvm_guest(void) { return false; } -#endif - #ifdef CONFIG_PPC_PSERIES void pseries_probe_fw_features(void); #else diff --git a/arch/powerpc/include/asm/kvm_guest.h b/arch/powerpc/include/asm/kvm_guest.h new file mode 100644 index ..c0ace884a0e8 --- /dev/null +++ b/arch/powerpc/include/asm/kvm_guest.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2020 IBM Corporation + */ + +#ifndef __POWERPC_KVM_GUEST_H__ +#define __POWERPC_KVM_GUEST_H__ + +#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST) +bool is_kvm_guest(void); +#else +static inline bool is_kvm_guest(void) { return false; } +#endif + +#endif /* __POWERPC_KVM_GUEST_H__ */ diff --git a/arch/powerpc/include/asm/kvm_para.h b/arch/powerpc/include/asm/kvm_para.h index 744612054c94..abe1b5e82547 100644 --- a/arch/powerpc/include/asm/kvm_para.h +++ b/arch/powerpc/include/asm/kvm_para.h @@ -8,7 +8,7 @@ #ifndef __POWERPC_KVM_PARA_H__ #define __POWERPC_KVM_PARA_H__ -#include +#include #include diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c index fe48d319d490..5f48e5ad24cd 100644 --- a/arch/powerpc/kernel/firmware.c +++ b/arch/powerpc/kernel/firmware.c @@ -14,6 +14,7 @@ #include #include +#include #ifdef CONFIG_PPC64 unsigned long powerpc_firmware_features __read_mostly; diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c index 92922491a81c..d578732c545d 100644 --- a/arch/powerpc/platforms/pseries/smp.c +++ b/arch/powerpc/platforms/pseries/smp.c @@ -42,6 +42,7 @@ #include #include #include +#include #include "pseries.h" -- 2.18.4
[PATCH v2 2/4] powerpc: Rename is_kvm_guest to check_kvm_guest
is_kvm_guest() will be reused in subsequent patch in a new avatar. Hence rename is_kvm_guest to check_kvm_guest. No additional changes. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Nathan Lynch Cc: Gautham R Shenoy Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Juri Lelli Cc: Waiman Long Cc: Phil Auld Acked-by: Waiman Long Signed-off-by: Srikar Dronamraju --- arch/powerpc/include/asm/kvm_guest.h | 4 ++-- arch/powerpc/include/asm/kvm_para.h | 2 +- arch/powerpc/kernel/firmware.c | 2 +- arch/powerpc/platforms/pseries/smp.c | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_guest.h b/arch/powerpc/include/asm/kvm_guest.h index c0ace884a0e8..ba8291e02ba9 100644 --- a/arch/powerpc/include/asm/kvm_guest.h +++ b/arch/powerpc/include/asm/kvm_guest.h @@ -7,9 +7,9 @@ #define __POWERPC_KVM_GUEST_H__ #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST) -bool is_kvm_guest(void); +bool check_kvm_guest(void); #else -static inline bool is_kvm_guest(void) { return false; } +static inline bool check_kvm_guest(void) { return false; } #endif #endif /* __POWERPC_KVM_GUEST_H__ */ diff --git a/arch/powerpc/include/asm/kvm_para.h b/arch/powerpc/include/asm/kvm_para.h index abe1b5e82547..6fba06b6cfdb 100644 --- a/arch/powerpc/include/asm/kvm_para.h +++ b/arch/powerpc/include/asm/kvm_para.h @@ -14,7 +14,7 @@ static inline int kvm_para_available(void) { - return IS_ENABLED(CONFIG_KVM_GUEST) && is_kvm_guest(); + return IS_ENABLED(CONFIG_KVM_GUEST) && check_kvm_guest(); } static inline unsigned int kvm_arch_para_features(void) diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c index 5f48e5ad24cd..0aeb6a5b1a9e 100644 --- a/arch/powerpc/kernel/firmware.c +++ b/arch/powerpc/kernel/firmware.c @@ -22,7 +22,7 @@ EXPORT_SYMBOL_GPL(powerpc_firmware_features); #endif #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST) -bool is_kvm_guest(void) +bool check_kvm_guest(void) { struct device_node *hyper_node; diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c index d578732c545d..c70b4be9f0a5 100644 --- a/arch/powerpc/platforms/pseries/smp.c +++ b/arch/powerpc/platforms/pseries/smp.c @@ -211,7 +211,7 @@ static __init void pSeries_smp_probe(void) if (!cpu_has_feature(CPU_FTR_SMT)) return; - if (is_kvm_guest()) { + if (check_kvm_guest()) { /* * KVM emulates doorbells by disabling FSCR[MSGP] so msgsndp * faults to the hypervisor which then reads the instruction -- 2.18.4
[PATCH v2 0/4] Powerpc: Better preemption for shared processor
Currently, vcpu_is_preempted will return the yield_count for shared_processor. On a PowerVM LPAR, Phyp schedules at SMT8 core boundary i.e all CPUs belonging to a core are either group scheduled in or group scheduled out. This can be used to better predict non-preempted CPUs on PowerVM shared LPARs. perf stat -r 5 -a perf bench sched pipe -l 1000 (lesser time is better) powerpc/next 35,107,951.20 msec cpu-clock # 255.898 CPUs utilized ( +- 0.31% ) 23,655,348 context-switches #0.674 K/sec ( +- 3.72% ) 14,465 cpu-migrations#0.000 K/sec ( +- 5.37% ) 82,463 page-faults #0.002 K/sec ( +- 8.40% ) 1,127,182,328,206 cycles#0.032 GHz ( +- 1.60% ) (66.67%) 78,587,300,622 stalled-cycles-frontend #6.97% frontend cycles idle ( +- 0.08% ) (50.01%) 654,124,218,432 stalled-cycles-backend# 58.03% backend cycles idle ( +- 1.74% ) (50.01%) 834,013,059,242 instructions #0.74 insn per cycle #0.78 stalled cycles per insn ( +- 0.73% ) (66.67%) 132,911,454,387 branches #3.786 M/sec ( +- 0.59% ) (50.00%) 2,890,882,143 branch-misses #2.18% of all branches ( +- 0.46% ) (50.00%) 137.195 +- 0.419 seconds time elapsed ( +- 0.31% ) powerpc/next + patchset 29,981,702.64 msec cpu-clock # 255.881 CPUs utilized ( +- 1.30% ) 40,162,456 context-switches #0.001 M/sec ( +- 0.01% ) 1,110 cpu-migrations#0.000 K/sec ( +- 5.20% ) 62,616 page-faults #0.002 K/sec ( +- 3.93% ) 1,430,030,626,037 cycles#0.048 GHz ( +- 1.41% ) (66.67%) 83,202,707,288 stalled-cycles-frontend #5.82% frontend cycles idle ( +- 0.75% ) (50.01%) 744,556,088,520 stalled-cycles-backend# 52.07% backend cycles idle ( +- 1.39% ) (50.01%) 940,138,418,674 instructions #0.66 insn per cycle #0.79 stalled cycles per insn ( +- 0.51% ) (66.67%) 146,452,852,283 branches #4.885 M/sec ( +- 0.80% ) (50.00%) 3,237,743,996 branch-misses #2.21% of all branches ( +- 1.18% ) (50.01%) 117.17 +- 1.52 seconds time elapsed ( +- 1.30% ) This is around 14.6% improvement in performance. Changelog: v1->v2: v1: https://lore.kernel.org/linuxppc-dev/20201028123512.871051-1-sri...@linux.vnet.ibm.com/t/#u - Rebased to 27th Nov linuxppc/merge tree. - Moved a hunk to fix a no previous prototype warning reported by: l...@intel.com https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org/thread/C6PTRPHWMC7VV4OTYN3ISYKDHTDQS6YI/ Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Nathan Lynch Cc: Gautham R Shenoy Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Juri Lelli Cc: Waiman Long Cc: Phil Auld Srikar Dronamraju (4): powerpc: Refactor is_kvm_guest declaration to new header powerpc: Rename is_kvm_guest to check_kvm_guest powerpc: Reintroduce is_kvm_guest powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted arch/powerpc/include/asm/firmware.h | 6 -- arch/powerpc/include/asm/kvm_guest.h | 25 + arch/powerpc/include/asm/kvm_para.h | 2 +- arch/powerpc/include/asm/paravirt.h | 18 ++ arch/powerpc/kernel/firmware.c | 5 - arch/powerpc/platforms/pseries/smp.c | 3 ++- 6 files changed, 50 insertions(+), 9 deletions(-) create mode 100644 arch/powerpc/include/asm/kvm_guest.h -- 2.18.4
Re: powerpc64-linux-ld: arch/powerpc/mm/numa.o:undefined reference to `coregroup_enabled'
* Randy Dunlap [2020-11-19 15:51:29]: > On 11/19/20 10:26 AM, kernel test robot wrote: > > tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > master > > head: c2e7554e1b85935d962127efa3c2a76483b0b3b6 > > commit: f9f130ff2ec93c5949576bbfb168cc9530c23649 powerpc/numa: Detect > > support for coregroup > > date: 9 weeks ago > > config: powerpc64-randconfig-c003-20201104 (attached as .config) > > compiler: powerpc64-linux-gcc (GCC) 9.3.0 > > reproduce (this is a W=1 build): > > wget > > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O > > ~/bin/make.cross > > chmod +x ~/bin/make.cross > > # > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f9f130ff2ec93c5949576bbfb168cc9530c23649 > > git remote add linus > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > > git fetch --no-tags linus master > > git checkout f9f130ff2ec93c5949576bbfb168cc9530c23649 > > # save the attached .config to linux build tree > > COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross > > ARCH=powerpc64 > > > > If you fix the issue, kindly add following tag as appropriate > > Reported-by: kernel test robot > > > > All errors (new ones prefixed by >>): > > > > > > powerpc64-linux-ld: arch/powerpc/mm/numa.o:(.toc+0x0): undefined > > > > reference to `coregroup_enabled' > > powerpc64-linux-ld: mm/page_alloc.o:(.toc+0x0): undefined reference to > > `node_reclaim_distance' > > powerpc64-linux-ld: mm/khugepaged.o:(.toc+0x0): undefined reference to > > `node_reclaim_distance' > > Hm, OK. > CONFIG_NUMA=y > # CONFIG_SMP is not set > > Michael, Gautham, does anyone care about this config combination? > I can add #ifdef CONFIG_SMP where coregroup_enabled is being accessed but I do feel CONFIG_NUMA but !CONFIG_SMP may not be a valid combination. > > Thanks. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH 0/4] Powerpc: Better preemption for shared processor
* Waiman Long [2020-10-28 20:01:30]: > > Srikar Dronamraju (4): > >powerpc: Refactor is_kvm_guest declaration to new header > >powerpc: Rename is_kvm_guest to check_kvm_guest > >powerpc: Reintroduce is_kvm_guest > >powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted > > > > arch/powerpc/include/asm/firmware.h | 6 -- > > arch/powerpc/include/asm/kvm_guest.h | 25 + > > arch/powerpc/include/asm/kvm_para.h | 2 +- > > arch/powerpc/include/asm/paravirt.h | 18 ++ > > arch/powerpc/kernel/firmware.c | 5 - > > arch/powerpc/platforms/pseries/smp.c | 3 ++- > > 6 files changed, 50 insertions(+), 9 deletions(-) > > create mode 100644 arch/powerpc/include/asm/kvm_guest.h > > > This patch series looks good to me and the performance is nice too. > > Acked-by: Waiman Long Thank you. > > Just curious, is the performance mainly from the use of static_branch > (patches 1 - 3) or from reducing call to yield_count_of(). Because of the reduced call to yield_count > > Cheers, > Longman > -- Thanks and Regards Srikar Dronamraju
[PATCH 2/4] powerpc: Rename is_kvm_guest to check_kvm_guest
is_kvm_guest() will be reused in subsequent patch in a new avatar. Hence rename is_kvm_guest to check_kvm_guest. No additional changes. Signed-off-by: Srikar Dronamraju Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Nathan Lynch Cc: Gautham R Shenoy Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Juri Lelli Cc: Waiman Long Cc: Phil Auld --- arch/powerpc/include/asm/kvm_guest.h | 4 ++-- arch/powerpc/include/asm/kvm_para.h | 2 +- arch/powerpc/kernel/firmware.c | 2 +- arch/powerpc/platforms/pseries/smp.c | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_guest.h b/arch/powerpc/include/asm/kvm_guest.h index c0ace884a0e8..ba8291e02ba9 100644 --- a/arch/powerpc/include/asm/kvm_guest.h +++ b/arch/powerpc/include/asm/kvm_guest.h @@ -7,9 +7,9 @@ #define __POWERPC_KVM_GUEST_H__ #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST) -bool is_kvm_guest(void); +bool check_kvm_guest(void); #else -static inline bool is_kvm_guest(void) { return false; } +static inline bool check_kvm_guest(void) { return false; } #endif #endif /* __POWERPC_KVM_GUEST_H__ */ diff --git a/arch/powerpc/include/asm/kvm_para.h b/arch/powerpc/include/asm/kvm_para.h index abe1b5e82547..6fba06b6cfdb 100644 --- a/arch/powerpc/include/asm/kvm_para.h +++ b/arch/powerpc/include/asm/kvm_para.h @@ -14,7 +14,7 @@ static inline int kvm_para_available(void) { - return IS_ENABLED(CONFIG_KVM_GUEST) && is_kvm_guest(); + return IS_ENABLED(CONFIG_KVM_GUEST) && check_kvm_guest(); } static inline unsigned int kvm_arch_para_features(void) diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c index fe48d319d490..61243267d4cf 100644 --- a/arch/powerpc/kernel/firmware.c +++ b/arch/powerpc/kernel/firmware.c @@ -21,7 +21,7 @@ EXPORT_SYMBOL_GPL(powerpc_firmware_features); #endif #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST) -bool is_kvm_guest(void) +bool check_kvm_guest(void) { struct device_node *hyper_node; diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c index d578732c545d..c70b4be9f0a5 100644 --- a/arch/powerpc/platforms/pseries/smp.c +++ b/arch/powerpc/platforms/pseries/smp.c @@ -211,7 +211,7 @@ static __init void pSeries_smp_probe(void) if (!cpu_has_feature(CPU_FTR_SMT)) return; - if (is_kvm_guest()) { + if (check_kvm_guest()) { /* * KVM emulates doorbells by disabling FSCR[MSGP] so msgsndp * faults to the hypervisor which then reads the instruction -- 2.18.4
[PATCH 1/4] powerpc: Refactor is_kvm_guest declaration to new header
Only code/declaration movement, in anticipation of doing a kvm-aware vcpu_is_preempted. No additional changes. Signed-off-by: Srikar Dronamraju Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Nathan Lynch Cc: Gautham R Shenoy Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Juri Lelli Cc: Waiman Long Cc: Phil Auld --- arch/powerpc/include/asm/firmware.h | 6 -- arch/powerpc/include/asm/kvm_guest.h | 15 +++ arch/powerpc/include/asm/kvm_para.h | 2 +- arch/powerpc/platforms/pseries/smp.c | 1 + 4 files changed, 17 insertions(+), 7 deletions(-) create mode 100644 arch/powerpc/include/asm/kvm_guest.h diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h index 0b295bdb201e..aa6a5ef5d483 100644 --- a/arch/powerpc/include/asm/firmware.h +++ b/arch/powerpc/include/asm/firmware.h @@ -134,12 +134,6 @@ extern int ibm_nmi_interlock_token; extern unsigned int __start___fw_ftr_fixup, __stop___fw_ftr_fixup; -#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST) -bool is_kvm_guest(void); -#else -static inline bool is_kvm_guest(void) { return false; } -#endif - #ifdef CONFIG_PPC_PSERIES void pseries_probe_fw_features(void); #else diff --git a/arch/powerpc/include/asm/kvm_guest.h b/arch/powerpc/include/asm/kvm_guest.h new file mode 100644 index ..c0ace884a0e8 --- /dev/null +++ b/arch/powerpc/include/asm/kvm_guest.h @@ -0,0 +1,15 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (C) 2020 IBM Corporation + */ + +#ifndef __POWERPC_KVM_GUEST_H__ +#define __POWERPC_KVM_GUEST_H__ + +#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST) +bool is_kvm_guest(void); +#else +static inline bool is_kvm_guest(void) { return false; } +#endif + +#endif /* __POWERPC_KVM_GUEST_H__ */ diff --git a/arch/powerpc/include/asm/kvm_para.h b/arch/powerpc/include/asm/kvm_para.h index 744612054c94..abe1b5e82547 100644 --- a/arch/powerpc/include/asm/kvm_para.h +++ b/arch/powerpc/include/asm/kvm_para.h @@ -8,7 +8,7 @@ #ifndef __POWERPC_KVM_PARA_H__ #define __POWERPC_KVM_PARA_H__ -#include +#include #include diff --git a/arch/powerpc/platforms/pseries/smp.c b/arch/powerpc/platforms/pseries/smp.c index 92922491a81c..d578732c545d 100644 --- a/arch/powerpc/platforms/pseries/smp.c +++ b/arch/powerpc/platforms/pseries/smp.c @@ -42,6 +42,7 @@ #include #include #include +#include #include "pseries.h" -- 2.18.4
[PATCH 4/4] powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted
If its a shared lpar but not a KVM guest, then see if the vCPU is related to the calling vCPU. On PowerVM, only cores can be preempted. So if one vCPU is a non-preempted state, we can decipher that all other vCPUs sharing the same core are in non-preempted state. Signed-off-by: Srikar Dronamraju Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Nathan Lynch Cc: Gautham R Shenoy Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Juri Lelli Cc: Waiman Long Cc: Phil Auld --- arch/powerpc/include/asm/paravirt.h | 18 ++ 1 file changed, 18 insertions(+) diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h index 9362c94fe3aa..edc08f04aef7 100644 --- a/arch/powerpc/include/asm/paravirt.h +++ b/arch/powerpc/include/asm/paravirt.h @@ -10,6 +10,9 @@ #endif #ifdef CONFIG_PPC_SPLPAR +#include +#include + DECLARE_STATIC_KEY_FALSE(shared_processor); static inline bool is_shared_processor(void) @@ -74,6 +77,21 @@ static inline bool vcpu_is_preempted(int cpu) { if (!is_shared_processor()) return false; + +#ifdef CONFIG_PPC_SPLPAR + if (!is_kvm_guest()) { + int first_cpu = cpu_first_thread_sibling(smp_processor_id()); + + /* +* Preemption can only happen at core granularity. This CPU +* is not preempted if one of the CPU of this core is not +* preempted. +*/ + if (cpu_first_thread_sibling(cpu) == first_cpu) + return false; + } +#endif + if (yield_count_of(cpu) & 1) return true; return false; -- 2.18.4
[PATCH 3/4] powerpc: Reintroduce is_kvm_guest in a new avatar
Introduce a static branch that would be set during boot if the OS happens to be a KVM guest. Subsequent checks to see if we are on KVM will rely on this static branch. This static branch would be used in vcpu_is_preempted in a subsequent patch. Signed-off-by: Srikar Dronamraju Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Nathan Lynch Cc: Gautham R Shenoy Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Juri Lelli Cc: Waiman Long Cc: Phil Auld --- arch/powerpc/include/asm/kvm_guest.h | 10 ++ arch/powerpc/include/asm/kvm_para.h | 2 +- arch/powerpc/kernel/firmware.c | 3 +++ 3 files changed, 14 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/kvm_guest.h b/arch/powerpc/include/asm/kvm_guest.h index ba8291e02ba9..627ba272e781 100644 --- a/arch/powerpc/include/asm/kvm_guest.h +++ b/arch/powerpc/include/asm/kvm_guest.h @@ -7,8 +7,18 @@ #define __POWERPC_KVM_GUEST_H__ #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST) +#include + +DECLARE_STATIC_KEY_FALSE(kvm_guest); + +static inline bool is_kvm_guest(void) +{ + return static_branch_unlikely(_guest); +} + bool check_kvm_guest(void); #else +static inline bool is_kvm_guest(void) { return false; } static inline bool check_kvm_guest(void) { return false; } #endif diff --git a/arch/powerpc/include/asm/kvm_para.h b/arch/powerpc/include/asm/kvm_para.h index 6fba06b6cfdb..abe1b5e82547 100644 --- a/arch/powerpc/include/asm/kvm_para.h +++ b/arch/powerpc/include/asm/kvm_para.h @@ -14,7 +14,7 @@ static inline int kvm_para_available(void) { - return IS_ENABLED(CONFIG_KVM_GUEST) && check_kvm_guest(); + return IS_ENABLED(CONFIG_KVM_GUEST) && is_kvm_guest(); } static inline unsigned int kvm_arch_para_features(void) diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c index 61243267d4cf..28498fc573f2 100644 --- a/arch/powerpc/kernel/firmware.c +++ b/arch/powerpc/kernel/firmware.c @@ -14,6 +14,7 @@ #include #include +#include #ifdef CONFIG_PPC64 unsigned long powerpc_firmware_features __read_mostly; @@ -21,6 +22,7 @@ EXPORT_SYMBOL_GPL(powerpc_firmware_features); #endif #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST) +DEFINE_STATIC_KEY_FALSE(kvm_guest); bool check_kvm_guest(void) { struct device_node *hyper_node; @@ -32,6 +34,7 @@ bool check_kvm_guest(void) if (!of_device_is_compatible(hyper_node, "linux,kvm")) return 0; + static_branch_enable(_guest); return 1; } #endif -- 2.18.4
[PATCH 0/4] Powerpc: Better preemption for shared processor
Currently, vcpu_is_preempted will return the yield_count for shared_processor. On a PowerVM LPAR, Phyp schedules at SMT8 core boundary i.e all CPUs belonging to a core are either group scheduled in or group scheduled out. This can be used to better predict non-preempted CPUs on PowerVM shared LPARs. perf stat -r 5 -a perf bench sched pipe -l 1000 (lesser time is better) powerpc/next 35,107,951.20 msec cpu-clock # 255.898 CPUs utilized ( +- 0.31% ) 23,655,348 context-switches #0.674 K/sec ( +- 3.72% ) 14,465 cpu-migrations#0.000 K/sec ( +- 5.37% ) 82,463 page-faults #0.002 K/sec ( +- 8.40% ) 1,127,182,328,206 cycles#0.032 GHz ( +- 1.60% ) (66.67%) 78,587,300,622 stalled-cycles-frontend #6.97% frontend cycles idle ( +- 0.08% ) (50.01%) 654,124,218,432 stalled-cycles-backend# 58.03% backend cycles idle ( +- 1.74% ) (50.01%) 834,013,059,242 instructions #0.74 insn per cycle #0.78 stalled cycles per insn ( +- 0.73% ) (66.67%) 132,911,454,387 branches #3.786 M/sec ( +- 0.59% ) (50.00%) 2,890,882,143 branch-misses #2.18% of all branches ( +- 0.46% ) (50.00%) 137.195 +- 0.419 seconds time elapsed ( +- 0.31% ) powerpc/next + patchset 29,981,702.64 msec cpu-clock # 255.881 CPUs utilized ( +- 1.30% ) 40,162,456 context-switches #0.001 M/sec ( +- 0.01% ) 1,110 cpu-migrations#0.000 K/sec ( +- 5.20% ) 62,616 page-faults #0.002 K/sec ( +- 3.93% ) 1,430,030,626,037 cycles#0.048 GHz ( +- 1.41% ) (66.67%) 83,202,707,288 stalled-cycles-frontend #5.82% frontend cycles idle ( +- 0.75% ) (50.01%) 744,556,088,520 stalled-cycles-backend# 52.07% backend cycles idle ( +- 1.39% ) (50.01%) 940,138,418,674 instructions #0.66 insn per cycle #0.79 stalled cycles per insn ( +- 0.51% ) (66.67%) 146,452,852,283 branches #4.885 M/sec ( +- 0.80% ) (50.00%) 3,237,743,996 branch-misses #2.21% of all branches ( +- 1.18% ) (50.01%) 117.17 +- 1.52 seconds time elapsed ( +- 1.30% ) This is around 14.6% improvement in performance. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Nathan Lynch Cc: Gautham R Shenoy Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Juri Lelli Cc: Waiman Long Cc: Phil Auld Srikar Dronamraju (4): powerpc: Refactor is_kvm_guest declaration to new header powerpc: Rename is_kvm_guest to check_kvm_guest powerpc: Reintroduce is_kvm_guest powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted arch/powerpc/include/asm/firmware.h | 6 -- arch/powerpc/include/asm/kvm_guest.h | 25 + arch/powerpc/include/asm/kvm_para.h | 2 +- arch/powerpc/include/asm/paravirt.h | 18 ++ arch/powerpc/kernel/firmware.c | 5 - arch/powerpc/platforms/pseries/smp.c | 3 ++- 6 files changed, 50 insertions(+), 9 deletions(-) create mode 100644 arch/powerpc/include/asm/kvm_guest.h -- 2.18.4
[PATCH 2/2] powerpc/smp: Use GFP_ATOMIC while allocating tmp mask
Qian Cai reported a regression where CPU Hotplug fails with the latest powerpc/next BUG: sleeping function called from invalid context at mm/slab.h:494 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/88 no locks held by swapper/88/0. irq event stamp: 18074448 hardirqs last enabled at (18074447): [] tick_nohz_idle_enter+0x9c/0x110 hardirqs last disabled at (18074448): [] do_idle+0x138/0x3b0 do_idle at kernel/sched/idle.c:253 (discriminator 1) softirqs last enabled at (18074440): [] irq_enter_rcu+0x94/0xa0 softirqs last disabled at (18074439): [] irq_enter_rcu+0x70/0xa0 CPU: 88 PID: 0 Comm: swapper/88 Tainted: GW 5.9.0-rc8-next-20201007 #1 Call Trace: [c0002a4bfcf0] [c0649e98] dump_stack+0xec/0x144 (unreliable) [c0002a4bfd30] [c00f6c34] ___might_sleep+0x2f4/0x310 [c0002a4bfdb0] [c0354f94] slab_pre_alloc_hook.constprop.82+0x124/0x190 [c0002a4bfe00] [c035e9e8] __kmalloc_node+0x88/0x3a0 slab_alloc_node at mm/slub.c:2817 (inlined by) __kmalloc_node at mm/slub.c:4013 [c0002a4bfe80] [c06494d8] alloc_cpumask_var_node+0x38/0x80 kmalloc_node at include/linux/slab.h:577 (inlined by) alloc_cpumask_var_node at lib/cpumask.c:116 [c0002a4bfef0] [c003eedc] start_secondary+0x27c/0x800 update_mask_by_l2 at arch/powerpc/kernel/smp.c:1267 (inlined by) add_cpu_to_masks at arch/powerpc/kernel/smp.c:1387 (inlined by) start_secondary at arch/powerpc/kernel/smp.c:1420 [c0002a4bff90] [c000c468] start_secondary_resume+0x10/0x14 Allocating a temporary mask while performing a CPU Hotplug operation with CONFIG_CPUMASK_OFFSTACK enabled, leads to calling a sleepable function from a atomic context. Fix this by allocating the temporary mask with GFP_ATOMIC flag. Also instead of having to allocate twice, allocate the mask in the caller so that we only have to allocate once. If the allocation fails, assume the mask to be same as sibling mask, which will make the scheduler to drop this domain for this CPU. Fixes: 70a94089d7f7 ("powerpc/smp: Optimize update_coregroup_mask") Fixes: 3ab33d6dc3e9 ("powerpc/smp: Optimize update_mask_by_l2") Reported-by: Qian Cai Signed-off-by: Srikar Dronamraju Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nathan Lynch Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Qian Cai --- Changelog v1->v2: https://lore.kernel.org/linuxppc-dev/20201008034240.34059-1-sri...@linux.vnet.ibm.com/t/#u Updated 2nd patch based on comments from Michael Ellerman - Remove the WARN_ON. - Handle allocation failures in a more subtle fashion - Allocate in the caller so that we allocate once. arch/powerpc/kernel/smp.c | 57 +-- 1 file changed, 31 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index a864b9b3228c..028479e9b66b 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1257,38 +1257,33 @@ static struct device_node *cpu_to_l2cache(int cpu) return cache; } -static bool update_mask_by_l2(int cpu) +static bool update_mask_by_l2(int cpu, cpumask_var_t *mask) { struct cpumask *(*submask_fn)(int) = cpu_sibling_mask; struct device_node *l2_cache, *np; - cpumask_var_t mask; int i; if (has_big_cores) submask_fn = cpu_smallcore_mask; l2_cache = cpu_to_l2cache(cpu); - if (!l2_cache) { - /* -* If no l2cache for this CPU, assume all siblings to share -* cache with this CPU. -*/ + if (!l2_cache || !*mask) { + /* Assume only core siblings share cache with this CPU */ for_each_cpu(i, submask_fn(cpu)) set_cpus_related(cpu, i, cpu_l2_cache_mask); return false; } - alloc_cpumask_var_node(, GFP_KERNEL, cpu_to_node(cpu)); - cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu)); + cpumask_and(*mask, cpu_online_mask, cpu_cpu_mask(cpu)); /* Update l2-cache mask with all the CPUs that are part of submask */ or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask); /* Skip all CPUs already part of current CPU l2-cache mask */ - cpumask_andnot(mask, mask, cpu_l2_cache_mask(cpu)); + cpumask_andnot(*mask, *mask, cpu_l2_cache_mask(cpu)); - for_each_cpu(i, mask) { + for_each_cpu(i, *mask) { /* * when updating the marks the current CPU has not been marked * online, but we need to update the cache masks @@ -1298,15 +1293,14 @@ static bool update_mask_by_l2(int cpu) /* Skip all CPUs already part of current CPU l2-cache */ if (np == l2_cache) { or_cpumasks_related(cpu, i, submask_fn, cpu_l2_cache_mask); - cpumask_andnot(m
[PATCH 1/2] powerpc/smp: Remove unnecessary variable
Commit 3ab33d6dc3e9 ("powerpc/smp: Optimize update_mask_by_l2") introduced submask_fn in update_mask_by_l2 to track the right submask. However commit f6606cfdfbcd ("powerpc/smp: Dont assume l2-cache to be superset of sibling") introduced sibling_mask in update_mask_by_l2 to track the same submask. Remove sibling_mask in favour of submask_fn. Signed-off-by: Srikar Dronamraju Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nathan Lynch Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Qian Cai --- arch/powerpc/kernel/smp.c | 13 - 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 8d1c401f4617..a864b9b3228c 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1264,18 +1264,16 @@ static bool update_mask_by_l2(int cpu) cpumask_var_t mask; int i; + if (has_big_cores) + submask_fn = cpu_smallcore_mask; + l2_cache = cpu_to_l2cache(cpu); if (!l2_cache) { - struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask; - /* * If no l2cache for this CPU, assume all siblings to share * cache with this CPU. */ - if (has_big_cores) - sibling_mask = cpu_smallcore_mask; - - for_each_cpu(i, sibling_mask(cpu)) + for_each_cpu(i, submask_fn(cpu)) set_cpus_related(cpu, i, cpu_l2_cache_mask); return false; @@ -1284,9 +1282,6 @@ static bool update_mask_by_l2(int cpu) alloc_cpumask_var_node(, GFP_KERNEL, cpu_to_node(cpu)); cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu)); - if (has_big_cores) - submask_fn = cpu_smallcore_mask; - /* Update l2-cache mask with all the CPUs that are part of submask */ or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask); -- 2.18.2
[PATCH v2 0/2] Fixes for coregroup
These patches fixes problems introduced by the coregroup patches. The first patch we remove a redundant variable. Second patch allows to boot with CONFIG_CPUMASK_OFFSTACK enabled. Changelog v1->v2: https://lore.kernel.org/linuxppc-dev/20201008034240.34059-1-sri...@linux.vnet.ibm.com/t/#u 1. 1st patch was not part of previous posting. 2. Updated 2nd patch based on comments from Michael Ellerman Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nathan Lynch Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Qian Cai Srikar Dronamraju (2): powerpc/smp: Remove unnecessary variable powerpc/smp: Use GFP_ATOMIC while allocating tmp mask arch/powerpc/kernel/smp.c | 70 +++ 1 file changed, 35 insertions(+), 35 deletions(-) -- 2.18.2
[PATCH] powerpc/smp: Use GFP_ATOMIC while allocating tmp mask
Qian Cai reported a regression where CPU Hotplug fails with the latest powerpc/next BUG: sleeping function called from invalid context at mm/slab.h:494 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/88 no locks held by swapper/88/0. irq event stamp: 18074448 hardirqs last enabled at (18074447): [] tick_nohz_idle_enter+0x9c/0x110 hardirqs last disabled at (18074448): [] do_idle+0x138/0x3b0 do_idle at kernel/sched/idle.c:253 (discriminator 1) softirqs last enabled at (18074440): [] irq_enter_rcu+0x94/0xa0 softirqs last disabled at (18074439): [] irq_enter_rcu+0x70/0xa0 CPU: 88 PID: 0 Comm: swapper/88 Tainted: GW 5.9.0-rc8-next-20201007 #1 Call Trace: [c0002a4bfcf0] [c0649e98] dump_stack+0xec/0x144 (unreliable) [c0002a4bfd30] [c00f6c34] ___might_sleep+0x2f4/0x310 [c0002a4bfdb0] [c0354f94] slab_pre_alloc_hook.constprop.82+0x124/0x190 [c0002a4bfe00] [c035e9e8] __kmalloc_node+0x88/0x3a0 slab_alloc_node at mm/slub.c:2817 (inlined by) __kmalloc_node at mm/slub.c:4013 [c0002a4bfe80] [c06494d8] alloc_cpumask_var_node+0x38/0x80 kmalloc_node at include/linux/slab.h:577 (inlined by) alloc_cpumask_var_node at lib/cpumask.c:116 [c0002a4bfef0] [c003eedc] start_secondary+0x27c/0x800 update_mask_by_l2 at arch/powerpc/kernel/smp.c:1267 (inlined by) add_cpu_to_masks at arch/powerpc/kernel/smp.c:1387 (inlined by) start_secondary at arch/powerpc/kernel/smp.c:1420 [c0002a4bff90] [c000c468] start_secondary_resume+0x10/0x14 Allocating a temporary mask while performing a CPU Hotplug operation with CONFIG_CPUMASK_OFFSTACK enabled, leads to calling a sleepable function from a atomic context. Fix this by allocating the temporary mask with GFP_ATOMIC flag. If there is a failure to allocate a mask, scheduler is going to observe that this CPU's topology is broken. Instead of having to speculate why the topology is broken, add a WARN_ON_ONCE. Fixes: 70a94089d7f7 ("powerpc/smp: Optimize update_coregroup_mask") Fixes: 3ab33d6dc3e9 ("powerpc/smp: Optimize update_mask_by_l2") Reported-by: Qian Cai Suggested-by: Qian Cai Signed-off-by: Srikar Dronamraju Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nathan Lynch Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Qian Cai --- arch/powerpc/kernel/smp.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 0dc1b85..1268558 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1264,7 +1264,8 @@ static bool update_mask_by_l2(int cpu) return false; } - alloc_cpumask_var_node(, GFP_KERNEL, cpu_to_node(cpu)); + /* In CPU-hotplug path, hence use GFP_ATOMIC */ + WARN_ON_ONCE(!alloc_cpumask_var_node(, GFP_ATOMIC, cpu_to_node(cpu))); cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu)); if (has_big_cores) @@ -1344,7 +1345,8 @@ static void update_coregroup_mask(int cpu) int coregroup_id = cpu_to_coregroup_id(cpu); int i; - alloc_cpumask_var_node(, GFP_KERNEL, cpu_to_node(cpu)); + /* In CPU-hotplug path, hence use GFP_ATOMIC */ + WARN_ON_ONCE(!alloc_cpumask_var_node(, GFP_ATOMIC, cpu_to_node(cpu))); cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu)); if (shared_caches) -- 1.8.3.1
[PATCH v3 06/11] powerpc/smp: Stop passing mask to update_mask_by_l2
update_mask_by_l2 is called only once. But it passes cpu_l2_cache_mask as parameter. Instead of passing cpu_l2_cache_mask, use it directly in update_mask_by_l2. Signed-off-by: Srikar Dronamraju Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Qian Cai --- arch/powerpc/kernel/smp.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index c860c4950c9f..441c9c64b1e3 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1218,7 +1218,7 @@ static struct device_node *cpu_to_l2cache(int cpu) return cache; } -static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int)) +static bool update_mask_by_l2(int cpu) { struct device_node *l2_cache, *np; int i; @@ -1240,7 +1240,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int)) return false; } - cpumask_set_cpu(cpu, mask_fn(cpu)); + cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu)); for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) { /* * when updating the marks the current CPU has not been marked @@ -1251,7 +1251,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int)) continue; if (np == l2_cache) - set_cpus_related(cpu, i, mask_fn); + set_cpus_related(cpu, i, cpu_l2_cache_mask); of_node_put(np); } @@ -1315,7 +1315,7 @@ static void add_cpu_to_masks(int cpu) set_cpus_related(i, cpu, cpu_sibling_mask); add_cpu_to_smallcore_masks(cpu); - update_mask_by_l2(cpu, cpu_l2_cache_mask); + update_mask_by_l2(cpu); if (has_coregroup_support()) { int coregroup_id = cpu_to_coregroup_id(cpu); -- 2.17.1
[PATCH v3 11/11] powerpc/smp: Optimize update_coregroup_mask
All threads of a SMT4/SMT8 core can either be part of CPU's coregroup mask or outside the coregroup. Use this relation to reduce the number of iterations needed to find all the CPUs that share the same coregroup Use a temporary mask to iterate through the CPUs that may share coregroup mask. Also instead of setting one CPU at a time into cpu_coregroup_mask, copy the SMT4/SMT8/submask at one shot. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Signed-off-by: Srikar Dronamraju --- Changelog v2->v3: Use GFP_ATOMIC instead of GFP_KERNEL since allocations need to atomic at the time of CPU HotPlug Reported by Qian Cai arch/powerpc/kernel/smp.c | 31 +++ 1 file changed, 23 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index b48ae4e306d3..bbaea93dc558 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1340,19 +1340,34 @@ static inline void add_cpu_to_smallcore_masks(int cpu) static void update_coregroup_mask(int cpu) { - int first_thread = cpu_first_thread_sibling(cpu); + struct cpumask *(*submask_fn)(int) = cpu_sibling_mask; + cpumask_var_t mask; int coregroup_id = cpu_to_coregroup_id(cpu); int i; - cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu)); - for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) { - int fcpu = cpu_first_thread_sibling(i); + /* In CPU-hotplug path, hence use GFP_ATOMIC */ + alloc_cpumask_var_node(, GFP_ATOMIC, cpu_to_node(cpu)); + cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu)); + + if (shared_caches) + submask_fn = cpu_l2_cache_mask; - if (fcpu == first_thread) - set_cpus_related(cpu, i, cpu_coregroup_mask); - else if (coregroup_id == cpu_to_coregroup_id(i)) - set_cpus_related(cpu, i, cpu_coregroup_mask); + /* Update coregroup mask with all the CPUs that are part of submask */ + or_cpumasks_related(cpu, cpu, submask_fn, cpu_coregroup_mask); + + /* Skip all CPUs already part of coregroup mask */ + cpumask_andnot(mask, mask, cpu_coregroup_mask(cpu)); + + for_each_cpu(i, mask) { + /* Skip all CPUs not part of this coregroup */ + if (coregroup_id == cpu_to_coregroup_id(i)) { + or_cpumasks_related(cpu, i, submask_fn, cpu_coregroup_mask); + cpumask_andnot(mask, mask, submask_fn(i)); + } else { + cpumask_andnot(mask, mask, cpu_coregroup_mask(i)); + } } + free_cpumask_var(mask); } static void add_cpu_to_masks(int cpu) -- 2.17.1
[PATCH v3 08/11] powerpc/smp: Check for duplicate topologies and consolidate
CACHE and COREGROUP domains are now part of default topology. However on systems that don't support CACHE or COREGROUP, these domains will eventually be degenerated. The degeneration happens per CPU. Do note the current fixup_topology() logic ensures that mask of a domain that is not supported on the current platform is set to the previous domain. Instead of waiting for the scheduler to degenerated try to consolidate based on their masks and sd_flags. This is done just before setting the scheduler topology. Signed-off-by: Srikar Dronamraju Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Qian Cai --- arch/powerpc/kernel/smp.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index aeb219a4bf7a..6f866e6b12f8 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1401,6 +1401,8 @@ int setup_profiling_timer(unsigned int multiplier) static void fixup_topology(void) { + int i; + #ifdef CONFIG_SCHED_SMT if (has_big_cores) { pr_info("Big cores detected but using small core scheduling\n"); @@ -1410,6 +1412,30 @@ static void fixup_topology(void) if (!has_coregroup_support()) powerpc_topology[mc_idx].mask = powerpc_topology[cache_idx].mask; + + /* +* Try to consolidate topology levels here instead of +* allowing scheduler to degenerate. +* - Dont consolidate if masks are different. +* - Dont consolidate if sd_flags exists and are different. +*/ + for (i = 1; i <= die_idx; i++) { + if (powerpc_topology[i].mask != powerpc_topology[i - 1].mask) + continue; + + if (powerpc_topology[i].sd_flags && powerpc_topology[i - 1].sd_flags && + powerpc_topology[i].sd_flags != powerpc_topology[i - 1].sd_flags) + continue; + + if (!powerpc_topology[i - 1].sd_flags) + powerpc_topology[i - 1].sd_flags = powerpc_topology[i].sd_flags; + + powerpc_topology[i].mask = powerpc_topology[i + 1].mask; + powerpc_topology[i].sd_flags = powerpc_topology[i + 1].sd_flags; +#ifdef CONFIG_SCHED_DEBUG + powerpc_topology[i].name = powerpc_topology[i + 1].name; +#endif + } } void __init smp_cpus_done(unsigned int max_cpus) -- 2.17.1
[PATCH v3 09/11] powerpc/smp: Optimize update_mask_by_l2
All threads of a SMT4 core can either be part of this CPU's l2-cache mask or not related to this CPU l2-cache mask. Use this relation to reduce the number of iterations needed to find all the CPUs that share the same l2-cache. Use a temporary mask to iterate through the CPUs that may share l2_cache mask. Also instead of setting one CPU at a time into cpu_l2_cache_mask, copy the SMT4/sub mask at one shot. Signed-off-by: Srikar Dronamraju Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Qian Cai --- Changelog v2->v3: Use GFP_ATOMIC instead of GFP_KERNEL since allocations need to atomic at the time of CPU HotPlug Reported by Qian Cai arch/powerpc/kernel/smp.c | 52 +-- 1 file changed, 46 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 6f866e6b12f8..17e90c2414af 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -671,6 +671,28 @@ static void set_cpus_unrelated(int i, int j, #endif /* + * Extends set_cpus_related. Instead of setting one CPU at a time in + * dstmask, set srcmask at oneshot. dstmask should be super set of srcmask. + */ +static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int), + struct cpumask *(*dstmask)(int)) +{ + struct cpumask *mask; + int k; + + mask = srcmask(j); + for_each_cpu(k, srcmask(i)) + cpumask_or(dstmask(k), dstmask(k), mask); + + if (i == j) + return; + + mask = srcmask(i); + for_each_cpu(k, srcmask(j)) + cpumask_or(dstmask(k), dstmask(k), mask); +} + +/* * parse_thread_groups: Parses the "ibm,thread-groups" device tree * property for the CPU device node @dn and stores * the parsed output in the thread_groups @@ -1220,7 +1242,9 @@ static struct device_node *cpu_to_l2cache(int cpu) static bool update_mask_by_l2(int cpu) { + struct cpumask *(*submask_fn)(int) = cpu_sibling_mask; struct device_node *l2_cache, *np; + cpumask_var_t mask; int i; l2_cache = cpu_to_l2cache(cpu); @@ -1240,22 +1264,38 @@ static bool update_mask_by_l2(int cpu) return false; } - cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu)); - for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) { + /* In CPU-hotplug path, hence use GFP_ATOMIC */ + alloc_cpumask_var_node(, GFP_ATOMIC, cpu_to_node(cpu)); + cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu)); + + if (has_big_cores) + submask_fn = cpu_smallcore_mask; + + /* Update l2-cache mask with all the CPUs that are part of submask */ + or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask); + + /* Skip all CPUs already part of current CPU l2-cache mask */ + cpumask_andnot(mask, mask, cpu_l2_cache_mask(cpu)); + + for_each_cpu(i, mask) { /* * when updating the marks the current CPU has not been marked * online, but we need to update the cache masks */ np = cpu_to_l2cache(i); - if (!np) - continue; - if (np == l2_cache) - set_cpus_related(cpu, i, cpu_l2_cache_mask); + /* Skip all CPUs already part of current CPU l2-cache */ + if (np == l2_cache) { + or_cpumasks_related(cpu, i, submask_fn, cpu_l2_cache_mask); + cpumask_andnot(mask, mask, submask_fn(i)); + } else { + cpumask_andnot(mask, mask, cpu_l2_cache_mask(i)); + } of_node_put(np); } of_node_put(l2_cache); + free_cpumask_var(mask); return true; } -- 2.17.1
[PATCH v3 07/11] powerpc/smp: Depend on cpu_l1_cache_map when adding CPUs
Currently on hotplug/hotunplug, CPU iterates through all the CPUs in its core to find threads in its thread group. However this info is already captured in cpu_l1_cache_map. Hence reduce iterations and cleanup add_cpu_to_smallcore_masks function. Signed-off-by: Srikar Dronamraju Tested-by: Satheesh Rajendran Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Qian Cai --- arch/powerpc/kernel/smp.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 441c9c64b1e3..aeb219a4bf7a 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1285,16 +1285,15 @@ static void remove_cpu_from_masks(int cpu) static inline void add_cpu_to_smallcore_masks(int cpu) { - struct cpumask *this_l1_cache_map = per_cpu(cpu_l1_cache_map, cpu); - int i, first_thread = cpu_first_thread_sibling(cpu); + int i; if (!has_big_cores) return; cpumask_set_cpu(cpu, cpu_smallcore_mask(cpu)); - for (i = first_thread; i < first_thread + threads_per_core; i++) { - if (cpu_online(i) && cpumask_test_cpu(i, this_l1_cache_map)) + for_each_cpu(i, per_cpu(cpu_l1_cache_map, cpu)) { + if (cpu_online(i)) set_cpus_related(i, cpu, cpu_smallcore_mask); } } -- 2.17.1
[PATCH v3 10/11] powerpc/smp: Move coregroup mask updation to a new function
Move the logic for updating the coregroup mask of a CPU to its own function. This will help in reworking the updation of coregroup mask in subsequent patch. Signed-off-by: Srikar Dronamraju Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Qian Cai --- arch/powerpc/kernel/smp.c | 32 +++- 1 file changed, 19 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 17e90c2414af..b48ae4e306d3 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1338,6 +1338,23 @@ static inline void add_cpu_to_smallcore_masks(int cpu) } } +static void update_coregroup_mask(int cpu) +{ + int first_thread = cpu_first_thread_sibling(cpu); + int coregroup_id = cpu_to_coregroup_id(cpu); + int i; + + cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu)); + for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) { + int fcpu = cpu_first_thread_sibling(i); + + if (fcpu == first_thread) + set_cpus_related(cpu, i, cpu_coregroup_mask); + else if (coregroup_id == cpu_to_coregroup_id(i)) + set_cpus_related(cpu, i, cpu_coregroup_mask); + } +} + static void add_cpu_to_masks(int cpu) { int first_thread = cpu_first_thread_sibling(cpu); @@ -1356,19 +1373,8 @@ static void add_cpu_to_masks(int cpu) add_cpu_to_smallcore_masks(cpu); update_mask_by_l2(cpu); - if (has_coregroup_support()) { - int coregroup_id = cpu_to_coregroup_id(cpu); - - cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu)); - for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) { - int fcpu = cpu_first_thread_sibling(i); - - if (fcpu == first_thread) - set_cpus_related(cpu, i, cpu_coregroup_mask); - else if (coregroup_id == cpu_to_coregroup_id(i)) - set_cpus_related(cpu, i, cpu_coregroup_mask); - } - } + if (has_coregroup_support()) + update_coregroup_mask(cpu); } /* Activate a secondary processor. */ -- 2.17.1
[PATCH v3 03/11] powerpc/smp: Remove get_physical_package_id
Now that cpu_core_mask has been removed and topology_core_cpumask has been updated to use cpu_cpu_mask, we no more need get_physical_package_id. Signed-off-by: Srikar Dronamraju Tested-by: Satheesh Rajendran Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Qian Cai --- arch/powerpc/include/asm/topology.h | 5 - arch/powerpc/kernel/smp.c | 20 2 files changed, 25 deletions(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index e0f232533c9d..e45219f74be0 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -114,12 +114,7 @@ static inline int cpu_to_coregroup_id(int cpu) #ifdef CONFIG_PPC64 #include -#ifdef CONFIG_PPC_SPLPAR -int get_physical_package_id(int cpu); -#define topology_physical_package_id(cpu) (get_physical_package_id(cpu)) -#else #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) -#endif #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) #define topology_core_cpumask(cpu) (cpu_cpu_mask(cpu)) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ec41491beca4..8c095fe237b2 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1292,26 +1292,6 @@ static inline void add_cpu_to_smallcore_masks(int cpu) } } -int get_physical_package_id(int cpu) -{ - int pkg_id = cpu_to_chip_id(cpu); - - /* -* If the platform is PowerNV or Guest on KVM, ibm,chip-id is -* defined. Hence we would return the chip-id as the result of -* get_physical_package_id. -*/ - if (pkg_id == -1 && firmware_has_feature(FW_FEATURE_LPAR) && - IS_ENABLED(CONFIG_PPC_SPLPAR)) { - struct device_node *np = of_get_cpu_node(cpu, NULL); - pkg_id = of_node_to_nid(np); - of_node_put(np); - } - - return pkg_id; -} -EXPORT_SYMBOL_GPL(get_physical_package_id); - static void add_cpu_to_masks(int cpu) { int first_thread = cpu_first_thread_sibling(cpu); -- 2.17.1
[PATCH v3 04/11] powerpc/smp: Optimize remove_cpu_from_masks
While offlining a CPU, system currently iterate through all the CPUs in the DIE to clear sibling, l2_cache and smallcore maps. However if there are more cores in a DIE, system can end up spending more time iterating through CPUs which are completely unrelated. Optimize this by only iterating through smaller but relevant cpumap. If shared_cache is set, cpu_l2_cache_map should be relevant else cpu_sibling_map would be relevant. Signed-off-by: Srikar Dronamraju Tested-by: Satheesh Rajendran Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Qian Cai --- arch/powerpc/kernel/smp.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 8c095fe237b2..2e61a81aad88 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1263,14 +1263,21 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int)) #ifdef CONFIG_HOTPLUG_CPU static void remove_cpu_from_masks(int cpu) { + struct cpumask *(*mask_fn)(int) = cpu_sibling_mask; int i; - for_each_cpu(i, cpu_cpu_mask(cpu)) { + if (shared_caches) + mask_fn = cpu_l2_cache_mask; + + for_each_cpu(i, mask_fn(cpu)) { set_cpus_unrelated(cpu, i, cpu_l2_cache_mask); set_cpus_unrelated(cpu, i, cpu_sibling_mask); if (has_big_cores) set_cpus_unrelated(cpu, i, cpu_smallcore_mask); - if (has_coregroup_support()) + } + + if (has_coregroup_support()) { + for_each_cpu(i, cpu_coregroup_mask(cpu)) set_cpus_unrelated(cpu, i, cpu_coregroup_mask); } } -- 2.17.1
[PATCH v3 02/11] powerpc/smp: Stop updating cpu_core_mask
Anton Blanchard reported that his 4096 vcpu KVM guest took around 30 minutes to boot. He also analyzed it to the time taken to iterate while setting the cpu_core_mask. Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be equal on Power. However updating cpu_core_mask took forever to update as its a per cpu cpumask variable. Instead cpu_cpu_mask was a per NODE /per DIE cpumask that was shared by all the respective CPUs. Also cpu_cpu_mask is needed from a scheduler perspective. However cpu_core_map is an exported symbol. Hence stop updating cpu_core_map and make it point to cpu_cpu_mask. Signed-off-by: Srikar Dronamraju Tested-by: Satheesh Rajendran Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Qian Cai --- arch/powerpc/include/asm/smp.h | 5 - arch/powerpc/kernel/smp.c | 33 +++-- 2 files changed, 7 insertions(+), 31 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 041f0b97c45b..40e121dd16af 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -119,11 +119,6 @@ static inline struct cpumask *cpu_sibling_mask(int cpu) return per_cpu(cpu_sibling_map, cpu); } -static inline struct cpumask *cpu_core_mask(int cpu) -{ - return per_cpu(cpu_core_map, cpu); -} - static inline struct cpumask *cpu_l2_cache_mask(int cpu) { return per_cpu(cpu_l2_cache_map, cpu); diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 3d96752d6570..ec41491beca4 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -953,12 +953,17 @@ void __init smp_prepare_cpus(unsigned int max_cpus) local_memory_node(numa_cpu_lookup_table[cpu])); } #endif + /* +* cpu_core_map is now more updated and exists only since +* its been exported for long. It only will have a snapshot +* of cpu_cpu_mask. +*/ + cpumask_copy(per_cpu(cpu_core_map, cpu), cpu_cpu_mask(cpu)); } /* Init the cpumasks so the boot CPU is related to itself */ cpumask_set_cpu(boot_cpuid, cpu_sibling_mask(boot_cpuid)); cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid)); - cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid)); if (has_coregroup_support()) cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid)); @@ -1260,9 +1265,7 @@ static void remove_cpu_from_masks(int cpu) { int i; - /* NB: cpu_core_mask is a superset of the others */ - for_each_cpu(i, cpu_core_mask(cpu)) { - set_cpus_unrelated(cpu, i, cpu_core_mask); + for_each_cpu(i, cpu_cpu_mask(cpu)) { set_cpus_unrelated(cpu, i, cpu_l2_cache_mask); set_cpus_unrelated(cpu, i, cpu_sibling_mask); if (has_big_cores) @@ -1312,7 +1315,6 @@ EXPORT_SYMBOL_GPL(get_physical_package_id); static void add_cpu_to_masks(int cpu) { int first_thread = cpu_first_thread_sibling(cpu); - int pkg_id = get_physical_package_id(cpu); int i; /* @@ -1320,7 +1322,6 @@ static void add_cpu_to_masks(int cpu) * add it to it's own thread sibling mask. */ cpumask_set_cpu(cpu, cpu_sibling_mask(cpu)); - cpumask_set_cpu(cpu, cpu_core_mask(cpu)); for (i = first_thread; i < first_thread + threads_per_core; i++) if (cpu_online(i)) @@ -1342,26 +1343,6 @@ static void add_cpu_to_masks(int cpu) set_cpus_related(cpu, i, cpu_coregroup_mask); } } - - if (pkg_id == -1) { - struct cpumask *(*mask)(int) = cpu_sibling_mask; - - /* -* Copy the sibling mask into core sibling mask and -* mark any CPUs on the same chip as this CPU. -*/ - if (shared_caches) - mask = cpu_l2_cache_mask; - - for_each_cpu(i, mask(cpu)) - set_cpus_related(cpu, i, cpu_core_mask); - - return; - } - - for_each_cpu(i, cpu_online_mask) - if (get_physical_package_id(i) == pkg_id) - set_cpus_related(cpu, i, cpu_core_mask); } /* Activate a secondary processor. */ -- 2.17.1
[PATCH v3 00/11] Optimization to improve CPU online/offline on Powerpc
h: 113 99.5000th: 104 99.9000th: 159 99.9000th: 129 min=0, max=15221min=0, max=7666 100 interations of ppc64_cpu --smt=1 / ppc64_cpu --smt=8 Units: seconds : lesser is better - ppc64_cpu --smt=1 kernelNMinMaxMedian Avg Stddev powerpc/next 100 13.39 17.55 14.71 14.7658 0.69184745 +patchset 100 13.3 16.27 14.33 14.4179 0.5427433 ppc64_cpu --smt=8 kernelNMinMaxMedian Avg Stddev powerpc/next 100 21.65 26.17 23.71 23.7111 0.8589786 +patchset 100 21.88 25.79 23.16 23.2945 0.86394839 Observations: Performance of ebizzy/ perf_sched_bench / schbench remain the same with and without the patchset. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Qian Cai Srikar Dronamraju (11): powerpc/topology: Update topology_core_cpumask powerpc/smp: Stop updating cpu_core_mask powerpc/smp: Remove get_physical_package_id powerpc/smp: Optimize remove_cpu_from_masks powerpc/smp: Limit CPUs traversed to within a node. powerpc/smp: Stop passing mask to update_mask_by_l2 powerpc/smp: Depend on cpu_l1_cache_map when adding CPUs powerpc/smp: Check for duplicate topologies and consolidate powerpc/smp: Optimize update_mask_by_l2 powerpc/smp: Move coregroup mask updation to a new function powerpc/smp: Optimize update_coregroup_mask arch/powerpc/include/asm/smp.h | 5 - arch/powerpc/include/asm/topology.h | 7 +- arch/powerpc/kernel/smp.c | 188 +++- 3 files changed, 122 insertions(+), 78 deletions(-) -- 2.17.1
[PATCH v3 01/11] powerpc/topology: Update topology_core_cpumask
On Power, cpu_core_mask and cpu_cpu_mask refer to the same set of CPUs. cpu_cpu_mask is needed by scheduler, hence look at deprecating cpu_core_mask. Before deleting the cpu_core_mask, ensure its only user is moved to cpu_cpu_mask. Signed-off-by: Srikar Dronamraju Tested-by: Satheesh Rajendran Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Qian Cai --- arch/powerpc/include/asm/topology.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 6609174918ab..e0f232533c9d 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -122,7 +122,7 @@ int get_physical_package_id(int cpu); #endif #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) -#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) +#define topology_core_cpumask(cpu) (cpu_cpu_mask(cpu)) #define topology_core_id(cpu) (cpu_to_core_id(cpu)) #endif -- 2.17.1
[PATCH v3 05/11] powerpc/smp: Limit CPUs traversed to within a node.
All the arch specific topology cpumasks are within a node/DIE. However when setting these per CPU cpumasks, system traverses through all the online CPUs. This is redundant. Reduce the traversal to only CPUs that are online in the node to which the CPU belongs to. Signed-off-by: Srikar Dronamraju Tested-by: Satheesh Rajendran Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Qian Cai --- arch/powerpc/kernel/smp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 2e61a81aad88..c860c4950c9f 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1241,7 +1241,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int)) } cpumask_set_cpu(cpu, mask_fn(cpu)); - for_each_cpu(i, cpu_online_mask) { + for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) { /* * when updating the marks the current CPU has not been marked * online, but we need to update the cache masks -- 2.17.1
Re: [PATCH v2 09/11] powerpc/smp: Optimize update_mask_by_l2
* Qian Cai [2020-10-07 09:05:42]: Hi Qian, Thanks for testing and reporting the failure. > On Mon, 2020-09-21 at 15:26 +0530, Srikar Dronamraju wrote: > > All threads of a SMT4 core can either be part of this CPU's l2-cache > > mask or not related to this CPU l2-cache mask. Use this relation to > > reduce the number of iterations needed to find all the CPUs that share > > the same l2-cache. > > > > Use a temporary mask to iterate through the CPUs that may share l2_cache > > mask. Also instead of setting one CPU at a time into cpu_l2_cache_mask, > > copy the SMT4/sub mask at one shot. > > > ... > > static bool update_mask_by_l2(int cpu) > > { > > + struct cpumask *(*submask_fn)(int) = cpu_sibling_mask; > > struct device_node *l2_cache, *np; > > + cpumask_var_t mask; > > int i; > > > > l2_cache = cpu_to_l2cache(cpu); > > @@ -1240,22 +1264,37 @@ static bool update_mask_by_l2(int cpu) > > return false; > > } > > > > - cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu)); > > - for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) { > > + alloc_cpumask_var_node(, GFP_KERNEL, cpu_to_node(cpu)); > > Shouldn't this be GFP_ATOMIC? Otherwise, during the CPU hotplugging, we have, Can you confirm if CONFIG_CPUMASK_OFFSTACK is enabled in your config? Because if !CONFIG_CPUMASK_OFFSTACK, then alloc_cpumask_var_node would do nothing but return true. Regarding CONFIG_CPUMASK_OFFSTACK, not sure how much powerpc was tested with that config enabled. Please refer to http://lore.kernel.org/lkml/87o8nv51bg@mpe.ellerman.id.au/t/#u And we do have an issue to track the same. https://github.com/linuxppc/issues/issues/321 for enabling/ testing / verifying if CONFIG_CPUMASK_OFFSTACK works. I also dont see any powerpc kconfig enabling this. I do agree with your suggestion that we could substitute GFP_ATOMIC/GFP_KERNEL. > > (irqs were disabled in do_idle()) > > [ 335.420001][T0] BUG: sleeping function called from invalid context at > mm/slab.h:494 > [ 335.420003][T0] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: > 0, name: swapper/88 > [ 335.420005][T0] no locks held by swapper/88/0. > [ 335.420007][T0] irq event stamp: 18074448 > [ 335.420015][T0] hardirqs last enabled at (18074447): > [] tick_nohz_idle_enter+0x9c/0x110 > [ 335.420019][T0] hardirqs last disabled at (18074448): > [] do_idle+0x138/0x3b0 > do_idle at kernel/sched/idle.c:253 (discriminator 1) > [ 335.420023][T0] softirqs last enabled at (18074440): > [] irq_enter_rcu+0x94/0xa0 > [ 335.420026][T0] softirqs last disabled at (18074439): > [] irq_enter_rcu+0x70/0xa0 > [ 335.420030][T0] CPU: 88 PID: 0 Comm: swapper/88 Tainted: GW > 5.9.0-rc8-next-20201007 #1 > [ 335.420032][T0] Call Trace: > [ 335.420037][T0] [c0002a4bfcf0] [c0649e98] > dump_stack+0xec/0x144 (unreliable) > [ 335.420043][T0] [c0002a4bfd30] [c00f6c34] > ___might_sleep+0x2f4/0x310 > [ 335.420048][T0] [c0002a4bfdb0] [c0354f94] > slab_pre_alloc_hook.constprop.82+0x124/0x190 > [ 335.420051][T0] [c0002a4bfe00] [c035e9e8] > __kmalloc_node+0x88/0x3a0 > slab_alloc_node at mm/slub.c:2817 > (inlined by) __kmalloc_node at mm/slub.c:4013 > [ 335.420054][T0] [c0002a4bfe80] [c06494d8] > alloc_cpumask_var_node+0x38/0x80 > kmalloc_node at include/linux/slab.h:577 > (inlined by) alloc_cpumask_var_node at lib/cpumask.c:116 > [ 335.420060][T0] [c0002a4bfef0] [c003eedc] > start_secondary+0x27c/0x800 > update_mask_by_l2 at arch/powerpc/kernel/smp.c:1267 > (inlined by) add_cpu_to_masks at arch/powerpc/kernel/smp.c:1387 > (inlined by) start_secondary at arch/powerpc/kernel/smp.c:1420 > [ 335.420063][T0] [c0002a4bff90] [c000c468] > start_secondary_resume+0x10/0x14 > > > + cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu)); > > + > > + if (has_big_cores) > > + submask_fn = cpu_smallcore_mask; > > + > > + /* Update l2-cache mask with all the CPUs that are part of submask */ > > + or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask); > > + > > + /* Skip all CPUs already part of current CPU l2-cache mask */ > > + cpumask_andnot(mask, mask, cpu_l2_cache_mask(cpu)); > > + > > + for_each_cpu(i, mask) { > > /* > > * when updating the marks the current CPU has not been marked > > * online, but we need to update the cache masks > > */ > > np = cpu_to_l2cache(i); > > - if (!np) > > -
[PATCH v2 06/11] powerpc/smp: Stop passing mask to update_mask_by_l2
update_mask_by_l2 is called only once. But it passes cpu_l2_cache_mask as parameter. Instead of passing cpu_l2_cache_mask, use it directly in update_mask_by_l2. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Signed-off-by: Srikar Dronamraju Tested-by: Satheesh Rajendran --- arch/powerpc/kernel/smp.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index c860c4950c9f..441c9c64b1e3 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1218,7 +1218,7 @@ static struct device_node *cpu_to_l2cache(int cpu) return cache; } -static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int)) +static bool update_mask_by_l2(int cpu) { struct device_node *l2_cache, *np; int i; @@ -1240,7 +1240,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int)) return false; } - cpumask_set_cpu(cpu, mask_fn(cpu)); + cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu)); for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) { /* * when updating the marks the current CPU has not been marked @@ -1251,7 +1251,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int)) continue; if (np == l2_cache) - set_cpus_related(cpu, i, mask_fn); + set_cpus_related(cpu, i, cpu_l2_cache_mask); of_node_put(np); } @@ -1315,7 +1315,7 @@ static void add_cpu_to_masks(int cpu) set_cpus_related(i, cpu, cpu_sibling_mask); add_cpu_to_smallcore_masks(cpu); - update_mask_by_l2(cpu, cpu_l2_cache_mask); + update_mask_by_l2(cpu); if (has_coregroup_support()) { int coregroup_id = cpu_to_coregroup_id(cpu); -- 2.17.1
[PATCH v2 10/11] powerpc/smp: Move coregroup mask updation to a new function
Move the logic for updating the coregroup mask of a CPU to its own function. This will help in reworking the updation of coregroup mask in subsequent patch. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Signed-off-by: Srikar Dronamraju --- arch/powerpc/kernel/smp.c | 32 +++- 1 file changed, 19 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 17e90c2414af..b48ae4e306d3 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1337,6 +1337,23 @@ static inline void add_cpu_to_smallcore_masks(int cpu) } } +static void update_coregroup_mask(int cpu) +{ + int first_thread = cpu_first_thread_sibling(cpu); + int coregroup_id = cpu_to_coregroup_id(cpu); + int i; + + cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu)); + for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) { + int fcpu = cpu_first_thread_sibling(i); + + if (fcpu == first_thread) + set_cpus_related(cpu, i, cpu_coregroup_mask); + else if (coregroup_id == cpu_to_coregroup_id(i)) + set_cpus_related(cpu, i, cpu_coregroup_mask); + } +} + static void add_cpu_to_masks(int cpu) { int first_thread = cpu_first_thread_sibling(cpu); @@ -1355,19 +1372,8 @@ static void add_cpu_to_masks(int cpu) add_cpu_to_smallcore_masks(cpu); update_mask_by_l2(cpu); - if (has_coregroup_support()) { - int coregroup_id = cpu_to_coregroup_id(cpu); - - cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu)); - for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) { - int fcpu = cpu_first_thread_sibling(i); - - if (fcpu == first_thread) - set_cpus_related(cpu, i, cpu_coregroup_mask); - else if (coregroup_id == cpu_to_coregroup_id(i)) - set_cpus_related(cpu, i, cpu_coregroup_mask); - } - } + if (has_coregroup_support()) + update_coregroup_mask(cpu); } /* Activate a secondary processor. */ -- 2.17.1
[PATCH v2 05/11] powerpc/smp: Limit CPUs traversed to within a node.
All the arch specific topology cpumasks are within a node/DIE. However when setting these per CPU cpumasks, system traverses through all the online CPUs. This is redundant. Reduce the traversal to only CPUs that are online in the node to which the CPU belongs to. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Signed-off-by: Srikar Dronamraju Tested-by: Satheesh Rajendran --- arch/powerpc/kernel/smp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 2e61a81aad88..c860c4950c9f 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1241,7 +1241,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int)) } cpumask_set_cpu(cpu, mask_fn(cpu)); - for_each_cpu(i, cpu_online_mask) { + for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) { /* * when updating the marks the current CPU has not been marked * online, but we need to update the cache masks -- 2.17.1
[PATCH v2 11/11] powerpc/smp: Optimize update_coregroup_mask
All threads of a SMT4/SMT8 core can either be part of CPU's coregroup mask or outside the coregroup. Use this relation to reduce the number of iterations needed to find all the CPUs that share the same coregroup Use a temporary mask to iterate through the CPUs that may share coregroup mask. Also instead of setting one CPU at a time into cpu_coregroup_mask, copy the SMT4/SMT8/submask at one shot. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Signed-off-by: Srikar Dronamraju --- arch/powerpc/kernel/smp.c | 30 ++ 1 file changed, 22 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index b48ae4e306d3..bbaea93dc558 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1339,19 +1339,33 @@ static inline void add_cpu_to_smallcore_masks(int cpu) static void update_coregroup_mask(int cpu) { - int first_thread = cpu_first_thread_sibling(cpu); + struct cpumask *(*submask_fn)(int) = cpu_sibling_mask; + cpumask_var_t mask; int coregroup_id = cpu_to_coregroup_id(cpu); int i; - cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu)); - for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) { - int fcpu = cpu_first_thread_sibling(i); + alloc_cpumask_var_node(, GFP_KERNEL, cpu_to_node(cpu)); + cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu)); + + if (shared_caches) + submask_fn = cpu_l2_cache_mask; + + /* Update coregroup mask with all the CPUs that are part of submask */ + or_cpumasks_related(cpu, cpu, submask_fn, cpu_coregroup_mask); + + /* Skip all CPUs already part of coregroup mask */ + cpumask_andnot(mask, mask, cpu_coregroup_mask(cpu)); - if (fcpu == first_thread) - set_cpus_related(cpu, i, cpu_coregroup_mask); - else if (coregroup_id == cpu_to_coregroup_id(i)) - set_cpus_related(cpu, i, cpu_coregroup_mask); + for_each_cpu(i, mask) { + /* Skip all CPUs not part of this coregroup */ + if (coregroup_id == cpu_to_coregroup_id(i)) { + or_cpumasks_related(cpu, i, submask_fn, cpu_coregroup_mask); + cpumask_andnot(mask, mask, submask_fn(i)); + } else { + cpumask_andnot(mask, mask, cpu_coregroup_mask(i)); + } } + free_cpumask_var(mask); } static void add_cpu_to_masks(int cpu) -- 2.17.1
[PATCH v2 04/11] powerpc/smp: Optimize remove_cpu_from_masks
While offlining a CPU, system currently iterate through all the CPUs in the DIE to clear sibling, l2_cache and smallcore maps. However if there are more cores in a DIE, system can end up spending more time iterating through CPUs which are completely unrelated. Optimize this by only iterating through smaller but relevant cpumap. If shared_cache is set, cpu_l2_cache_map should be relevant else cpu_sibling_map would be relevant. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Signed-off-by: Srikar Dronamraju Tested-by: Satheesh Rajendran --- arch/powerpc/kernel/smp.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 8c095fe237b2..2e61a81aad88 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1263,14 +1263,21 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int)) #ifdef CONFIG_HOTPLUG_CPU static void remove_cpu_from_masks(int cpu) { + struct cpumask *(*mask_fn)(int) = cpu_sibling_mask; int i; - for_each_cpu(i, cpu_cpu_mask(cpu)) { + if (shared_caches) + mask_fn = cpu_l2_cache_mask; + + for_each_cpu(i, mask_fn(cpu)) { set_cpus_unrelated(cpu, i, cpu_l2_cache_mask); set_cpus_unrelated(cpu, i, cpu_sibling_mask); if (has_big_cores) set_cpus_unrelated(cpu, i, cpu_smallcore_mask); - if (has_coregroup_support()) + } + + if (has_coregroup_support()) { + for_each_cpu(i, cpu_coregroup_mask(cpu)) set_cpus_unrelated(cpu, i, cpu_coregroup_mask); } } -- 2.17.1
[PATCH v2 07/11] powerpc/smp: Depend on cpu_l1_cache_map when adding CPUs
Currently on hotplug/hotunplug, CPU iterates through all the CPUs in its core to find threads in its thread group. However this info is already captured in cpu_l1_cache_map. Hence reduce iterations and cleanup add_cpu_to_smallcore_masks function. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Signed-off-by: Srikar Dronamraju Tested-by: Satheesh Rajendran --- arch/powerpc/kernel/smp.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 441c9c64b1e3..aeb219a4bf7a 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1285,16 +1285,15 @@ static void remove_cpu_from_masks(int cpu) static inline void add_cpu_to_smallcore_masks(int cpu) { - struct cpumask *this_l1_cache_map = per_cpu(cpu_l1_cache_map, cpu); - int i, first_thread = cpu_first_thread_sibling(cpu); + int i; if (!has_big_cores) return; cpumask_set_cpu(cpu, cpu_smallcore_mask(cpu)); - for (i = first_thread; i < first_thread + threads_per_core; i++) { - if (cpu_online(i) && cpumask_test_cpu(i, this_l1_cache_map)) + for_each_cpu(i, per_cpu(cpu_l1_cache_map, cpu)) { + if (cpu_online(i)) set_cpus_related(i, cpu, cpu_smallcore_mask); } } -- 2.17.1
[PATCH v2 09/11] powerpc/smp: Optimize update_mask_by_l2
All threads of a SMT4 core can either be part of this CPU's l2-cache mask or not related to this CPU l2-cache mask. Use this relation to reduce the number of iterations needed to find all the CPUs that share the same l2-cache. Use a temporary mask to iterate through the CPUs that may share l2_cache mask. Also instead of setting one CPU at a time into cpu_l2_cache_mask, copy the SMT4/sub mask at one shot. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Signed-off-by: Srikar Dronamraju --- arch/powerpc/kernel/smp.c | 51 ++- 1 file changed, 45 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 6f866e6b12f8..17e90c2414af 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -670,6 +670,28 @@ static void set_cpus_unrelated(int i, int j, } #endif +/* + * Extends set_cpus_related. Instead of setting one CPU at a time in + * dstmask, set srcmask at oneshot. dstmask should be super set of srcmask. + */ +static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int), + struct cpumask *(*dstmask)(int)) +{ + struct cpumask *mask; + int k; + + mask = srcmask(j); + for_each_cpu(k, srcmask(i)) + cpumask_or(dstmask(k), dstmask(k), mask); + + if (i == j) + return; + + mask = srcmask(i); + for_each_cpu(k, srcmask(j)) + cpumask_or(dstmask(k), dstmask(k), mask); +} + /* * parse_thread_groups: Parses the "ibm,thread-groups" device tree * property for the CPU device node @dn and stores @@ -1220,7 +1242,9 @@ static struct device_node *cpu_to_l2cache(int cpu) static bool update_mask_by_l2(int cpu) { + struct cpumask *(*submask_fn)(int) = cpu_sibling_mask; struct device_node *l2_cache, *np; + cpumask_var_t mask; int i; l2_cache = cpu_to_l2cache(cpu); @@ -1240,22 +1264,37 @@ static bool update_mask_by_l2(int cpu) return false; } - cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu)); - for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) { + alloc_cpumask_var_node(, GFP_KERNEL, cpu_to_node(cpu)); + cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu)); + + if (has_big_cores) + submask_fn = cpu_smallcore_mask; + + /* Update l2-cache mask with all the CPUs that are part of submask */ + or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask); + + /* Skip all CPUs already part of current CPU l2-cache mask */ + cpumask_andnot(mask, mask, cpu_l2_cache_mask(cpu)); + + for_each_cpu(i, mask) { /* * when updating the marks the current CPU has not been marked * online, but we need to update the cache masks */ np = cpu_to_l2cache(i); - if (!np) - continue; - if (np == l2_cache) - set_cpus_related(cpu, i, cpu_l2_cache_mask); + /* Skip all CPUs already part of current CPU l2-cache */ + if (np == l2_cache) { + or_cpumasks_related(cpu, i, submask_fn, cpu_l2_cache_mask); + cpumask_andnot(mask, mask, submask_fn(i)); + } else { + cpumask_andnot(mask, mask, cpu_l2_cache_mask(i)); + } of_node_put(np); } of_node_put(l2_cache); + free_cpumask_var(mask); return true; } -- 2.17.1
[PATCH v2 08/11] powerpc/smp: Check for duplicate topologies and consolidate
CACHE and COREGROUP domains are now part of default topology. However on systems that don't support CACHE or COREGROUP, these domains will eventually be degenerated. The degeneration happens per CPU. Do note the current fixup_topology() logic ensures that mask of a domain that is not supported on the current platform is set to the previous domain. Instead of waiting for the scheduler to degenerated try to consolidate based on their masks and sd_flags. This is done just before setting the scheduler topology. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Signed-off-by: Srikar Dronamraju --- arch/powerpc/kernel/smp.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index aeb219a4bf7a..6f866e6b12f8 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1401,6 +1401,8 @@ int setup_profiling_timer(unsigned int multiplier) static void fixup_topology(void) { + int i; + #ifdef CONFIG_SCHED_SMT if (has_big_cores) { pr_info("Big cores detected but using small core scheduling\n"); @@ -1410,6 +1412,30 @@ static void fixup_topology(void) if (!has_coregroup_support()) powerpc_topology[mc_idx].mask = powerpc_topology[cache_idx].mask; + + /* +* Try to consolidate topology levels here instead of +* allowing scheduler to degenerate. +* - Dont consolidate if masks are different. +* - Dont consolidate if sd_flags exists and are different. +*/ + for (i = 1; i <= die_idx; i++) { + if (powerpc_topology[i].mask != powerpc_topology[i - 1].mask) + continue; + + if (powerpc_topology[i].sd_flags && powerpc_topology[i - 1].sd_flags && + powerpc_topology[i].sd_flags != powerpc_topology[i - 1].sd_flags) + continue; + + if (!powerpc_topology[i - 1].sd_flags) + powerpc_topology[i - 1].sd_flags = powerpc_topology[i].sd_flags; + + powerpc_topology[i].mask = powerpc_topology[i + 1].mask; + powerpc_topology[i].sd_flags = powerpc_topology[i + 1].sd_flags; +#ifdef CONFIG_SCHED_DEBUG + powerpc_topology[i].name = powerpc_topology[i + 1].name; +#endif + } } void __init smp_cpus_done(unsigned int max_cpus) -- 2.17.1
[PATCH v2 01/11] powerpc/topology: Update topology_core_cpumask
On Power, cpu_core_mask and cpu_cpu_mask refer to the same set of CPUs. cpu_cpu_mask is needed by scheduler, hence look at deprecating cpu_core_mask. Before deleting the cpu_core_mask, ensure its only user is moved to cpu_cpu_mask. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Signed-off-by: Srikar Dronamraju Tested-by: Satheesh Rajendran --- arch/powerpc/include/asm/topology.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 6609174918ab..e0f232533c9d 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -122,7 +122,7 @@ int get_physical_package_id(int cpu); #endif #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) -#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu)) +#define topology_core_cpumask(cpu) (cpu_cpu_mask(cpu)) #define topology_core_id(cpu) (cpu_to_core_id(cpu)) #endif -- 2.17.1
[PATCH v2 03/11] powerpc/smp: Remove get_physical_package_id
Now that cpu_core_mask has been removed and topology_core_cpumask has been updated to use cpu_cpu_mask, we no more need get_physical_package_id. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Signed-off-by: Srikar Dronamraju Tested-by: Satheesh Rajendran --- arch/powerpc/include/asm/topology.h | 5 - arch/powerpc/kernel/smp.c | 20 2 files changed, 25 deletions(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index e0f232533c9d..e45219f74be0 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -114,12 +114,7 @@ static inline int cpu_to_coregroup_id(int cpu) #ifdef CONFIG_PPC64 #include -#ifdef CONFIG_PPC_SPLPAR -int get_physical_package_id(int cpu); -#define topology_physical_package_id(cpu) (get_physical_package_id(cpu)) -#else #define topology_physical_package_id(cpu) (cpu_to_chip_id(cpu)) -#endif #define topology_sibling_cpumask(cpu) (per_cpu(cpu_sibling_map, cpu)) #define topology_core_cpumask(cpu) (cpu_cpu_mask(cpu)) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ec41491beca4..8c095fe237b2 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1292,26 +1292,6 @@ static inline void add_cpu_to_smallcore_masks(int cpu) } } -int get_physical_package_id(int cpu) -{ - int pkg_id = cpu_to_chip_id(cpu); - - /* -* If the platform is PowerNV or Guest on KVM, ibm,chip-id is -* defined. Hence we would return the chip-id as the result of -* get_physical_package_id. -*/ - if (pkg_id == -1 && firmware_has_feature(FW_FEATURE_LPAR) && - IS_ENABLED(CONFIG_PPC_SPLPAR)) { - struct device_node *np = of_get_cpu_node(cpu, NULL); - pkg_id = of_node_to_nid(np); - of_node_put(np); - } - - return pkg_id; -} -EXPORT_SYMBOL_GPL(get_physical_package_id); - static void add_cpu_to_masks(int cpu) { int first_thread = cpu_first_thread_sibling(cpu); -- 2.17.1
[PATCH v2 00/11] Optimization to improve CPU online/offline on Powerpc
0.8589786 +patchset 100 21.88 25.79 23.16 23.2945 0.86394839 Observations: Performance of ebizzy/ perf_sched_bench / schbench remain the same with and without the patchset. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Signed-off-by: Srikar Dronamraju Changelog v1->v2: v1 link: https://lore.kernel.org/linuxppc-dev/20200727075532.30058-1-sri...@linux.vnet.ibm.com/t/#u Added five more patches on top of Seven. Rebased to 19th Sept 2020 powerpc/next (based on v5.9-rc2) Srikar Dronamraju (11): powerpc/topology: Update topology_core_cpumask powerpc/smp: Stop updating cpu_core_mask powerpc/smp: Remove get_physical_package_id powerpc/smp: Optimize remove_cpu_from_masks powerpc/smp: Limit CPUs traversed to within a node. powerpc/smp: Stop passing mask to update_mask_by_l2 powerpc/smp: Depend on cpu_l1_cache_map when adding CPUs powerpc/smp: Check for duplicate topologies and consolidate powerpc/smp: Optimize update_mask_by_l2 powerpc/smp: Move coregroup mask updation to a new function powerpc/smp: Optimize update_coregroup_mask arch/powerpc/include/asm/smp.h | 5 - arch/powerpc/include/asm/topology.h | 7 +- arch/powerpc/kernel/smp.c | 186 ++-- 3 files changed, 120 insertions(+), 78 deletions(-) -- 2.17.1
[PATCH v2 02/11] powerpc/smp: Stop updating cpu_core_mask
Anton Blanchard reported that his 4096 vcpu KVM guest took around 30 minutes to boot. He also analyzed it to the time taken to iterate while setting the cpu_core_mask. Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU would be equal on Power. However updating cpu_core_mask took forever to update as its a per cpu cpumask variable. Instead cpu_cpu_mask was a per NODE /per DIE cpumask that was shared by all the respective CPUs. Also cpu_cpu_mask is needed from a scheduler perspective. However cpu_core_map is an exported symbol. Hence stop updating cpu_core_map and make it point to cpu_cpu_mask. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Signed-off-by: Srikar Dronamraju Tested-by: Satheesh Rajendran --- arch/powerpc/include/asm/smp.h | 5 - arch/powerpc/kernel/smp.c | 33 +++-- 2 files changed, 7 insertions(+), 31 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 041f0b97c45b..40e121dd16af 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -119,11 +119,6 @@ static inline struct cpumask *cpu_sibling_mask(int cpu) return per_cpu(cpu_sibling_map, cpu); } -static inline struct cpumask *cpu_core_mask(int cpu) -{ - return per_cpu(cpu_core_map, cpu); -} - static inline struct cpumask *cpu_l2_cache_mask(int cpu) { return per_cpu(cpu_l2_cache_map, cpu); diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 3d96752d6570..ec41491beca4 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -953,12 +953,17 @@ void __init smp_prepare_cpus(unsigned int max_cpus) local_memory_node(numa_cpu_lookup_table[cpu])); } #endif + /* +* cpu_core_map is now more updated and exists only since +* its been exported for long. It only will have a snapshot +* of cpu_cpu_mask. +*/ + cpumask_copy(per_cpu(cpu_core_map, cpu), cpu_cpu_mask(cpu)); } /* Init the cpumasks so the boot CPU is related to itself */ cpumask_set_cpu(boot_cpuid, cpu_sibling_mask(boot_cpuid)); cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid)); - cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid)); if (has_coregroup_support()) cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid)); @@ -1260,9 +1265,7 @@ static void remove_cpu_from_masks(int cpu) { int i; - /* NB: cpu_core_mask is a superset of the others */ - for_each_cpu(i, cpu_core_mask(cpu)) { - set_cpus_unrelated(cpu, i, cpu_core_mask); + for_each_cpu(i, cpu_cpu_mask(cpu)) { set_cpus_unrelated(cpu, i, cpu_l2_cache_mask); set_cpus_unrelated(cpu, i, cpu_sibling_mask); if (has_big_cores) @@ -1312,7 +1315,6 @@ EXPORT_SYMBOL_GPL(get_physical_package_id); static void add_cpu_to_masks(int cpu) { int first_thread = cpu_first_thread_sibling(cpu); - int pkg_id = get_physical_package_id(cpu); int i; /* @@ -1320,7 +1322,6 @@ static void add_cpu_to_masks(int cpu) * add it to it's own thread sibling mask. */ cpumask_set_cpu(cpu, cpu_sibling_mask(cpu)); - cpumask_set_cpu(cpu, cpu_core_mask(cpu)); for (i = first_thread; i < first_thread + threads_per_core; i++) if (cpu_online(i)) @@ -1342,26 +1343,6 @@ static void add_cpu_to_masks(int cpu) set_cpus_related(cpu, i, cpu_coregroup_mask); } } - - if (pkg_id == -1) { - struct cpumask *(*mask)(int) = cpu_sibling_mask; - - /* -* Copy the sibling mask into core sibling mask and -* mark any CPUs on the same chip as this CPU. -*/ - if (shared_caches) - mask = cpu_l2_cache_mask; - - for_each_cpu(i, mask(cpu)) - set_cpus_related(cpu, i, cpu_core_mask); - - return; - } - - for_each_cpu(i, cpu_online_mask) - if (get_physical_package_id(i) == pkg_id) - set_cpus_related(cpu, i, cpu_core_mask); } /* Activate a secondary processor. */ -- 2.17.1
Re: [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
Fix to make it work where CPUs dont have a l2-cache element. >8-8<- >From b25d47b01b7195b1df19083a4043fa6a87a901a3 Mon Sep 17 00:00:00 2001 From: Srikar Dronamraju Date: Thu, 9 Jul 2020 13:33:38 +0530 Subject: [PATCH v5.2 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling Current code assumes that cpumask of cpus sharing a l2-cache mask will always be a superset of cpu_sibling_mask. Lets stop that assumption. cpu_l2_cache_mask is a superset of cpu_sibling_mask if and only if shared_caches is set. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Jordan Niethe Cc: Vaidyanathan Srinivasan Reviewed-by: Gautham R. Shenoy Signed-off-by: Srikar Dronamraju --- Changelog v1 -> v2: Set cpumask after verifying l2-cache. (Gautham) Changelog v5 -> v5.2: If cpu has no l2-cache set cpumask as per its sibling mask. (Michael Ellerman) arch/powerpc/kernel/smp.c | 43 +-- 1 file changed, 29 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 9f4333d..168532e 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1186,9 +1186,23 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int)) int i; l2_cache = cpu_to_l2cache(cpu); - if (!l2_cache) + if (!l2_cache) { + struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask; + + /* +* If no l2cache for this CPU, assume all siblings to share +* cache with this CPU. +*/ + if (has_big_cores) + sibling_mask = cpu_smallcore_mask; + + for_each_cpu(i, sibling_mask(cpu)) + set_cpus_related(cpu, i, cpu_l2_cache_mask); + return false; + } + cpumask_set_cpu(cpu, mask_fn(cpu)); for_each_cpu(i, cpu_online_mask) { /* * when updating the marks the current CPU has not been marked @@ -1271,29 +1285,30 @@ static void add_cpu_to_masks(int cpu) * add it to it's own thread sibling mask. */ cpumask_set_cpu(cpu, cpu_sibling_mask(cpu)); + cpumask_set_cpu(cpu, cpu_core_mask(cpu)); for (i = first_thread; i < first_thread + threads_per_core; i++) if (cpu_online(i)) set_cpus_related(i, cpu, cpu_sibling_mask); add_cpu_to_smallcore_masks(cpu); - /* -* Copy the thread sibling mask into the cache sibling mask -* and mark any CPUs that share an L2 with this CPU. -*/ - for_each_cpu(i, cpu_sibling_mask(cpu)) - set_cpus_related(cpu, i, cpu_l2_cache_mask); update_mask_by_l2(cpu, cpu_l2_cache_mask); - /* -* Copy the cache sibling mask into core sibling mask and mark -* any CPUs on the same chip as this CPU. -*/ - for_each_cpu(i, cpu_l2_cache_mask(cpu)) - set_cpus_related(cpu, i, cpu_core_mask); + if (pkg_id == -1) { + struct cpumask *(*mask)(int) = cpu_sibling_mask; + + /* +* Copy the sibling mask into core sibling mask and +* mark any CPUs on the same chip as this CPU. +*/ + if (shared_caches) + mask = cpu_l2_cache_mask; + + for_each_cpu(i, mask(cpu)) + set_cpus_related(cpu, i, cpu_core_mask); - if (pkg_id == -1) return; + } for_each_cpu(i, cpu_online_mask) if (get_physical_package_id(i) == pkg_id) -- 2.17.1
Re: [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
* Michael Ellerman [2020-09-13 11:46:41]: > Srikar Dronamraju writes: > > * Michael Ellerman [2020-09-11 21:55:23]: > > > >> Srikar Dronamraju writes: > >> > Current code assumes that cpumask of cpus sharing a l2-cache mask will > >> > always be a superset of cpu_sibling_mask. > >> > > >> > Lets stop that assumption. cpu_l2_cache_mask is a superset of > >> > cpu_sibling_mask if and only if shared_caches is set. > >> > >> I'm seeing oopses with this: > >> > > The patch fixes qemu, and I don't see the crash on mambo, but I still > see: > [0.010536] smp: Bringing up secondary CPUs ... > [0.019189] smp: Brought up 2 nodes, 8 CPUs > [0.019210] numa: Node 0 CPUs: 0-3 > [0.019235] numa: Node 1 CPUs: 4-7 > [0.02] the CACHE domain not a subset of the MC domain > [0.024505] BUG: arch topology borken > [0.024527] the SMT domain not a subset of the CACHE domain > [0.024563] BUG: arch topology borken > [0.024584] the CACHE domain not a subset of the MC domain > [0.024645] BUG: arch topology borken > [0.024666] the SMT domain not a subset of the CACHE domain > [0.024702] BUG: arch topology borken > [0.024723] the CACHE domain not a subset of the MC domain > > That's the p9 mambo model, using skiboot.tcl from skiboot, with CPUS=2, > THREADS=4 and MAMBO_NUMA=1. > I was able to reproduce with qemu-system-ppc64 -nographic -vga none -M pseries -cpu POWER8 \ -kernel build~/vmlinux \ -m 2G,slots=2,maxmem=4G \ -object memory-backend-ram,size=1G,id=m0 \ -object memory-backend-ram,size=1G,id=m1 \ -numa node,nodeid=0,memdev=m0 \ -numa node,nodeid=1,memdev=m1 \ -smp 8,threads=4,sockets=2,maxcpus=8 \ If the CPU doesn't have a l2-cache element, then CPU not only has to set itself in the cpu_l2_cache but also the siblings. Otherwise it will so happen that the Siblings will have 4 Cpus set, and the Cache domain will have just one cpu set, leading to this BUG message. Patch follows this mail. > Node layout is: > > [0.00] Early memory node ranges > [0.00] node 0: [mem 0x-0x] > [0.00] node 1: [mem 0x2000-0x2000] > [0.00] Initmem setup node 0 [mem > 0x-0x] > [0.00] On node 0 totalpages: 65536 > [0.00] Initmem setup node 1 [mem > 0x2000-0x2000] > [ 0.00] On node 1 totalpages: 65536 > > > There aren't any l2-cache properties in the device-tree under cpus. > > I'll try and have a closer look tonight. > > cheers -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
* Michael Ellerman [2020-09-11 21:55:23]: > Srikar Dronamraju writes: > > Current code assumes that cpumask of cpus sharing a l2-cache mask will > > always be a superset of cpu_sibling_mask. > > > > Lets stop that assumption. cpu_l2_cache_mask is a superset of > > cpu_sibling_mask if and only if shared_caches is set. > > I'm seeing oopses with this: > > [0.117392][T1] smp: Bringing up secondary CPUs ... > [0.156515][T1] smp: Brought up 2 nodes, 2 CPUs > [0.158265][T1] numa: Node 0 CPUs: 0 > [0.158520][T1] numa: Node 1 CPUs: 1 > [0.167453][T1] BUG: Unable to handle kernel data access on read at > 0x800041228298 > [0.167992][T1] Faulting instruction address: 0xc018c128 > [0.168817][T1] Oops: Kernel access of bad area, sig: 11 [#1] > [0.168964][T1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries > [0.169417][T1] Modules linked in: > [0.170047][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted > 5.9.0-rc2-00095-g7430ad5aa700 #209 > [0.170305][T1] NIP: c018c128 LR: c018c0cc CTR: > c004dce0 > [0.170498][T1] REGS: c0007e343880 TRAP: 0380 Not tainted > (5.9.0-rc2-00095-g7430ad5aa700) > [0.170602][T1] MSR: 82009033 CR: > 4400 XER: > [0.170985][T1] CFAR: c018c288 IRQMASK: 0 > [0.170985][T1] GPR00: c0007e343b10 > c173e400 4000 > [0.170985][T1] GPR04: 0800 > 0800 > [0.170985][T1] GPR08: c122c298 > c0003fffc000 c0007fd05ce8 > [0.170985][T1] GPR12: c0007e0119f8 c193 > 8ade > [0.170985][T1] GPR16: c0007e3c0640 0917 > c0007e3c0658 0008 > [0.170985][T1] GPR20: c15d0bb8 8ade > c0f57400 c1817c28 > [0.170985][T1] GPR24: c176dc80 c0007e3c0890 > c0007e3cfe00 > [0.170985][T1] GPR28: c1772310 c0007e011900 > c0007e3c0800 0001 > [0.172750][T1] NIP [c018c128] build_sched_domains+0x808/0x14b0 > [0.172900][T1] LR [c018c0cc] build_sched_domains+0x7ac/0x14b0 > [0.173186][T1] Call Trace: > [0.173484][T1] [c0007e343b10] [c018bfe8] > build_sched_domains+0x6c8/0x14b0 (unreliable) > [0.173821][T1] [c0007e343c50] [c018dcdc] > sched_init_domains+0xec/0x130 > [0.174037][T1] [c0007e343ca0] [c10d59d8] > sched_init_smp+0x50/0xc4 > [0.174207][T1] [c0007e343cd0] [c10b45c4] > kernel_init_freeable+0x1b4/0x378 > [0.174378][T1] [c0007e343db0] [c00129fc] > kernel_init+0x24/0x158 > [0.174740][T1] [c0007e343e20] [c000d9d0] > ret_from_kernel_thread+0x5c/0x6c > [0.175050][T1] Instruction dump: > [0.175626][T1] 554905ee 71480040 7d2907b4 4182016c 2c29 3920006e > 913e002c 41820034 > [0.175841][T1] 7c6307b4 e9300020 78631f24 7d58182a <7d2a482a> > f93e0080 7d404828 314a0001 > [0.178340][T1] ---[ end trace 6876b88dd1d4b3bb ]--- > [0.178512][T1] > [1.180458][T1] Kernel panic - not syncing: Attempted to kill init! > exitcode=0x000b > > That's qemu: > > qemu-system-ppc64 -nographic -vga none -M pseries -cpu POWER8 \ > -kernel build~/vmlinux \ > -m 2G,slots=2,maxmem=4G \ > -object memory-backend-ram,size=1G,id=m0 \ > -object memory-backend-ram,size=1G,id=m1 \ > -numa node,nodeid=0,memdev=m0 \ > -numa node,nodeid=1,memdev=m1 \ > -smp 2,sockets=2,maxcpus=2 \ > Thanks Michael for the report and also for identifying the patch and also giving an easy reproducer. That made my task easy. (My only problem was all my PowerKVM hosts had a old compiler that refuse to compile never kernels.) So in this setup, CPU doesn't have a l2-cache. And in that scenario, we miss updating the l2-cache domain. Actually the initial patch had this exact code. However it was my mistake. I should have reassessed it before making changes suggested by Gautham. Patch below. Do let me know if you want me to send the patch separately. > > On mambo I get: > > [0.005069][T1] smp: Bringing up secondary CPUs ... > [0.011656][T1] smp: Brought up 2 nodes, 8 CPUs > [0.011682][T1] numa: Node 0 CPUs: 0-3 > [0.011709][T1] numa: Node 1 CPUs: 4-7 > [0.012015][T1] BUG: arch topology borken > [0.012040][T1] the SMT domain not a subset of the CACHE domain > [
Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer
> > > > On Mon, Sep 07, 2020 at 07:27:08PM +1200, Barry Song wrote: > > > Something is wrong. In find_busiest_group(), we are checking if src has > > > higher load, however, in task_numa_find_cpu(), we are checking if dst > > > will have higher load after balancing. It seems it is not sensible to > > > check src. > > > It maybe cause wrong imbalance value, for example, if > > > dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and > > > src_running = env->src_stats.nr_running - 1 results in 1; > > > The current code is thinking imbalance as 0 since src_running is smaller > > > than 2. > > > This is inconsistent with load balancer. > > > I have observed the similar behaviour what Barry Song has documented with a simple ebizzy with less threads on a 2 node system ebizzy -t 6 -S 100 We see couple of ebizzy threads moving back and forth between the 2 nodes because of numa balancer and load balancer trying to do the exact opposite. However with Barry's patch, couple of tests regress heavily. (Any numa workload that has shared numa faults). For example: perf bench numa mem --no-data_rand_walk -p 1 -t 6 -G 0 -P 3072 -T 0 -l 50 -c I also don't understand the rational behind checking for dst_running in numa balancer path. This almost means no numa balancing in lightly loaded scenario. So agree with Mel that we should probably test more scenarios before we accept this patch. > > > > It checks the conditions if the move was to happen. Have you evaluated > > this for a NUMA balancing load and confirmed it a) balances properly and > > b) does not increase the scan rate trying to "fix" the problem? > > I think the original code was trying to check if the numa migration > would lead to new imbalance in load balancer. In case src is A, dst is B, and > both of them have nr_running as 2. A moves one task to B, then A > will have 1, B will have 3. In load balancer, A will try to pull task > from B since B's nr_running is larger than min_imbalance. But the code > is saying imbalance=0 by finding A's nr_running is smaller than > min_imbalance. > > Will share more test data if you need. > > > > > -- > > Mel Gorman > > SUSE Labs > > Thanks > Barry -- Thanks and Regards Srikar Dronamraju
Re: [RFC v2] perf/core: Fixes hung issue on perf stat command during cpu hotplug
* Kajol Jain [2020-08-26 20:24:11]: > Commit 2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()") > added assignment of ret value as -EAGAIN in case function > call to 'smp_call_function_single' fails. > For non-zero ret value, it did > 'ret = !ret ? data.ret : -EAGAIN;', which always > assign -EAGAIN to ret and make second if condition useless. > > In scenarios like when executing a perf stat with --per-thread option, and > if any of the monitoring cpu goes offline, the 'smp_call_function_single' > function could return -ENXIO, and with the above check, > task_function_call hung and increases CPU > usage (because of repeated 'smp_call_function_single()') > > Recration scenario: > # perf stat -a --per-thread && (offline a CPU ) > > Patch here removes the tertiary condition added as part of that > commit and added a check for NULL and -EAGAIN. > > Fixes: 2ed6edd33a21("perf: Add cond_resched() to task_function_call()") > Signed-off-by: Kajol Jain > Reported-by: Srikar Dronamraju Tested-by: Srikar Dronamraju -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/fair: Fix wrong cpu selecting from isolated domain
* xunlei [2020-08-25 10:11:24]: > On 2020/8/24 PM9:38, Srikar Dronamraju wrote: > > * Xunlei Pang [2020-08-24 20:30:19]: > > > >> We've met problems that occasionally tasks with full cpumask > >> (e.g. by putting it into a cpuset or setting to full affinity) > >> were migrated to our isolated cpus in production environment. > >> > >> After some analysis, we found that it is due to the current > >> select_idle_smt() not considering the sched_domain mask. > >> > >> Fix it by checking the valid domain mask in select_idle_smt(). > >> > >> Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve > >> select_idle_siblings()) > >> Reported-by: Wetp Zhang > >> Signed-off-by: Xunlei Pang > >> --- > >> kernel/sched/fair.c | 9 + > >> 1 file changed, 5 insertions(+), 4 deletions(-) > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index 1a68a05..fa942c4 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -6075,7 +6075,7 @@ static int select_idle_core(struct task_struct *p, > >> struct sched_domain *sd, int > >> /* > >> * Scan the local SMT mask for idle CPUs. > >> */ > >> -static int select_idle_smt(struct task_struct *p, int target) > >> +static int select_idle_smt(struct task_struct *p, struct sched_domain > >> *sd, int target) > >> { > >>int cpu; > >> > >> @@ -6083,7 +6083,8 @@ static int select_idle_smt(struct task_struct *p, > >> int target) > >>return -1; > >> > >>for_each_cpu(cpu, cpu_smt_mask(target)) { > >> - if (!cpumask_test_cpu(cpu, p->cpus_ptr)) > >> + if (!cpumask_test_cpu(cpu, p->cpus_ptr) || > >> + !cpumask_test_cpu(cpu, sched_domain_span(sd))) > >>continue; > > > > Don't think this is right thing to do. What if this task had set a cpumask > > that doesn't cover all the cpus in this sched_domain_span(sd) ah, right I missed the 'or' part. > > It doesn't matter, without this patch, it selects an idle cpu from: > "cpu_smt_mask(target) and p->cpus_ptr" > > with this patch, it selects an idle cpu from: > "cpu_smt_mask(target) and p->cpus_ptr and sched_domain_span(sd)" > > > > > cpu_smt_mask(target) would already limit to the sched_domain_span(sd) so I > > am not sure how this can help? > > > > > > Here is an example: > CPU0 and CPU16 are hyper-thread pair, CPU16 is domain isolated. So its > sd_llc doesn't contain CPU16, and cpu_smt_mask(0) is 0 and 16. > > Then we have @target is 0, select_idle_smt() may return the isolated(and > idle) CPU16 without this patch. Okay. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] sched/fair: Fix wrong cpu selecting from isolated domain
* Xunlei Pang [2020-08-24 20:30:19]: > We've met problems that occasionally tasks with full cpumask > (e.g. by putting it into a cpuset or setting to full affinity) > were migrated to our isolated cpus in production environment. > > After some analysis, we found that it is due to the current > select_idle_smt() not considering the sched_domain mask. > > Fix it by checking the valid domain mask in select_idle_smt(). > > Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings()) > Reported-by: Wetp Zhang > Signed-off-by: Xunlei Pang > --- > kernel/sched/fair.c | 9 + > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 1a68a05..fa942c4 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6075,7 +6075,7 @@ static int select_idle_core(struct task_struct *p, > struct sched_domain *sd, int > /* > * Scan the local SMT mask for idle CPUs. > */ > -static int select_idle_smt(struct task_struct *p, int target) > +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, > int target) > { > int cpu; > > @@ -6083,7 +6083,8 @@ static int select_idle_smt(struct task_struct *p, int > target) > return -1; > > for_each_cpu(cpu, cpu_smt_mask(target)) { > - if (!cpumask_test_cpu(cpu, p->cpus_ptr)) > + if (!cpumask_test_cpu(cpu, p->cpus_ptr) || > + !cpumask_test_cpu(cpu, sched_domain_span(sd))) > continue; Don't think this is right thing to do. What if this task had set a cpumask that doesn't cover all the cpus in this sched_domain_span(sd) cpu_smt_mask(target) would already limit to the sched_domain_span(sd) so I am not sure how this can help? -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
* Michal Hocko [2020-08-18 09:37:12]: > On Tue 18-08-20 09:32:52, David Hildenbrand wrote: > > On 12.08.20 08:01, Srikar Dronamraju wrote: > > > Hi Andrew, Michal, David > > > > > > * Andrew Morton [2020-08-06 21:32:11]: > > > > > >> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju > > >> wrote: > > >> > > >>>> The memory hotplug changes that somehow because you can hotremove numa > > >>>> nodes and therefore make the nodemask sparse but that is not a common > > >>>> case. I am not sure what would happen if a completely new node was > > >>>> added > > >>>> and its corresponding node was already used by the renumbered one > > >>>> though. It would likely conflate the two I am afraid. But I am not sure > > >>>> this is really possible with x86 and a lack of a bug report would > > >>>> suggest that nobody is doing that at least. > > >>>> > > >> So... do we merge this patch or not? Seems that the overall view is > > >> "risky but nobody is likely to do anything better any time soon"? > > > > > > Can we decide on this one way or the other? > > > > Hmm, not sure who's the person to decide. I tend to prefer doing the > > node renaming, handling this in ppc code; > > Agreed. That would be a safer option. Okay, will send arch specific v6 version. > -- > Michal Hocko > SUSE Labs -- Thanks and Regards Srikar Dronamraju
Re: [PATCH] uprobes: __replace_page() avoid BUG in munlock_vma_page()
* Hugh Dickins [2020-08-16 13:44:25]: > syzbot crashed on the VM_BUG_ON_PAGE(PageTail) in munlock_vma_page(), > when called from uprobes __replace_page(). Which of many ways to fix it? > Settled on not calling when PageCompound (since Head and Tail are equals > in this context, PageCompound the usual check in uprobes.c, and the prior > use of FOLL_SPLIT_PMD will have cleared PageMlocked already). > > Reported-by: syzbot > Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT") > Signed-off-by: Hugh Dickins > Cc: sta...@vger.kernel.org # v5.4+ > --- > This one is not a 5.9-rc regression, but still good to fix. > > kernel/events/uprobes.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- v5.9-rc/kernel/events/uprobes.c 2020-08-12 19:46:50.851196584 -0700 > +++ linux/kernel/events/uprobes.c 2020-08-16 13:18:35.292821674 -0700 > @@ -205,7 +205,7 @@ static int __replace_page(struct vm_area > try_to_free_swap(old_page); > page_vma_mapped_walk_done(); > > - if (vma->vm_flags & VM_LOCKED) > + if ((vma->vm_flags & VM_LOCKED) && !PageCompound(old_page)) > munlock_vma_page(old_page); > put_page(old_page); > Looks good to me. Reviewed-by: Srikar Dronamraju -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v3 1/2] perf bench numa: use numa_node_to_cpus() to bind tasks to nodes
* Alexander Gordeev [2020-08-13 13:32:48]: > It is currently assumed that each node contains at most > nr_cpus/nr_nodes CPUs and nodes' CPU ranges do not overlap. > That assumption is generally incorrect as there are archs > where a CPU number does not depend on to its node number. > > This update removes the described assumption by simply calling > numa_node_to_cpus() interface and using the returned mask for > binding CPUs to nodes. > > Also, variable types and names made consistent in functions > using cpumask. > > Cc: Satheesh Rajendran > Cc: Srikar Dronamraju > Cc: Naveen N. Rao > Cc: Balamuruhan S > Cc: Peter Zijlstra > Cc: Ingo Molnar > Cc: Arnaldo Carvalho de Melo > Cc: Mark Rutland > Cc: Alexander Shishkin > Cc: Jiri Olsa > Cc: Namhyung Kim > Signed-off-by: Alexander Gordeev > --- > @@ -310,13 +306,16 @@ static cpu_set_t bind_to_node(int target_node) > for (cpu = 0; cpu < g->p.nr_cpus; cpu++) > CPU_SET(cpu, ); > } else { > - int cpu_start = (target_node + 0) * cpus_per_node; > - int cpu_stop = (target_node + 1) * cpus_per_node; > - > - BUG_ON(cpu_stop > g->p.nr_cpus); > + struct bitmask *cpumask = numa_allocate_cpumask(); > > - for (cpu = cpu_start; cpu < cpu_stop; cpu++) > - CPU_SET(cpu, ); > + BUG_ON(!cpumask); > + if (!numa_node_to_cpus(target_node, cpumask)) { > + for (cpu = 0; cpu < (int)cpumask->size; cpu++) { > + if (numa_bitmask_isbitset(cpumask, cpu)) > + CPU_SET(cpu, ); > + } > + } > + numa_free_cpumask(cpumask); > } > > ret = sched_setaffinity(0, sizeof(mask), ); > -- > 1.8.3.1 > Looks good to me. Reviewed-by: Srikar Dronamraju -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v3 1/2] perf bench numa: fix cpumask memory leak in node_has_cpus()
* Alexander Gordeev [2020-08-13 13:30:42]: > Couple numa_allocate_cpumask() and numa_free_cpumask() functions > > Signed-off-by: Alexander Gordeev > --- > tools/perf/bench/numa.c | 13 + > 1 file changed, 9 insertions(+), 4 deletions(-) > > diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c > index 31e2601..9066511 100644 > --- a/tools/perf/bench/numa.c > +++ b/tools/perf/bench/numa.c > @@ -248,16 +248,21 @@ static int is_node_present(int node) > static bool node_has_cpus(int node) > { > struct bitmask *cpu = numa_allocate_cpumask(); > + bool ret = false; /* fall back to nocpus */ > unsigned int i; > > - if (cpu && !numa_node_to_cpus(node, cpu)) { > + BUG_ON(!cpu); > + if (!numa_node_to_cpus(node, cpu)) { > for (i = 0; i < cpu->size; i++) { > - if (numa_bitmask_isbitset(cpu, i)) > - return true; > + if (numa_bitmask_isbitset(cpu, i)) { > + ret = true; > + break; > + } > } > } > + numa_free_cpumask(cpu); > > - return false; /* lets fall back to nocpus safely */ > + return ret; > } > > static cpu_set_t bind_to_cpu(int target_cpu) > -- > 1.8.3.1 > Looks good to me. Reviewed-by: Srikar Dronamraju -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
Hi Andrew, Michal, David * Andrew Morton [2020-08-06 21:32:11]: > On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju > wrote: > > > > The memory hotplug changes that somehow because you can hotremove numa > > > nodes and therefore make the nodemask sparse but that is not a common > > > case. I am not sure what would happen if a completely new node was added > > > and its corresponding node was already used by the renumbered one > > > though. It would likely conflate the two I am afraid. But I am not sure > > > this is really possible with x86 and a lack of a bug report would > > > suggest that nobody is doing that at least. > > > > > > > JFYI, > > Satheesh copied in this mailchain had opened a bug a year on crash with vcpu > > hotplug on memoryless node. > > > > https://bugzilla.kernel.org/show_bug.cgi?id=202187 > > So... do we merge this patch or not? Seems that the overall view is > "risky but nobody is likely to do anything better any time soon"? Can we decide on this one way or the other? -- Thanks and Regards Srikar Dronamraju
[PATCH v5 02/10] powerpc/smp: Merge Power9 topology with Power topology
A new sched_domain_topology_level was added just for Power9. However the same can be achieved by merging powerpc_topology with power9_topology and makes the code more simpler especially when adding a new sched domain. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Jordan Niethe Cc: Vaidyanathan Srinivasan Reviewed-by: Gautham R. Shenoy Signed-off-by: Srikar Dronamraju --- Changelog v1 -> v2: Replaced a reference to cpu_smt_mask with per_cpu(cpu_sibling_map, cpu) since cpu_smt_mask is only defined under CONFIG_SCHED_SMT arch/powerpc/kernel/smp.c | 25 +++-- 1 file changed, 3 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index edf94ca64eea..08da765b91f1 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1313,7 +1313,7 @@ int setup_profiling_timer(unsigned int multiplier) } #ifdef CONFIG_SCHED_SMT -/* cpumask of CPUs with asymetric SMT dependancy */ +/* cpumask of CPUs with asymmetric SMT dependency */ static int powerpc_smt_flags(void) { int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES; @@ -1326,14 +1326,6 @@ static int powerpc_smt_flags(void) } #endif -static struct sched_domain_topology_level powerpc_topology[] = { -#ifdef CONFIG_SCHED_SMT - { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, -#endif - { cpu_cpu_mask, SD_INIT_NAME(DIE) }, - { NULL, }, -}; - /* * P9 has a slightly odd architecture where pairs of cores share an L2 cache. * This topology makes it *much* cheaper to migrate tasks between adjacent cores @@ -1361,7 +1353,7 @@ static const struct cpumask *smallcore_smt_mask(int cpu) } #endif -static struct sched_domain_topology_level power9_topology[] = { +static struct sched_domain_topology_level powerpc_topology[] = { #ifdef CONFIG_SCHED_SMT { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, #endif @@ -1386,21 +1378,10 @@ void __init smp_cpus_done(unsigned int max_cpus) #ifdef CONFIG_SCHED_SMT if (has_big_cores) { pr_info("Big cores detected but using small core scheduling\n"); - power9_topology[0].mask = smallcore_smt_mask; powerpc_topology[0].mask = smallcore_smt_mask; } #endif - /* -* If any CPU detects that it's sharing a cache with another CPU then -* use the deeper topology that is aware of this sharing. -*/ - if (shared_caches) { - pr_info("Using shared cache scheduler topology\n"); - set_sched_topology(power9_topology); - } else { - pr_info("Using standard scheduler topology\n"); - set_sched_topology(powerpc_topology); - } + set_sched_topology(powerpc_topology); } #ifdef CONFIG_HOTPLUG_CPU -- 2.18.2
[PATCH v5 01/10] powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES
Fix a build warning in a non CONFIG_NEED_MULTIPLE_NODES "error: _numa_cpu_lookup_table_ undeclared" Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Jordan Niethe Cc: Vaidyanathan Srinivasan Reviewed-by: Gautham R. Shenoy Signed-off-by: Srikar Dronamraju --- Changelog v2 -> v3: Removed node caching part. Rewrote the Commit msg (Michael Ellerman) Renamed to powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES arch/powerpc/kernel/smp.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 73199470c265..edf94ca64eea 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -860,6 +860,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus) GFP_KERNEL, cpu_to_node(cpu)); zalloc_cpumask_var_node(_cpu(cpu_core_map, cpu), GFP_KERNEL, cpu_to_node(cpu)); +#ifdef CONFIG_NEED_MULTIPLE_NODES /* * numa_node_id() works after this. */ @@ -868,6 +869,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus) set_cpu_numa_mem(cpu, local_memory_node(numa_cpu_lookup_table[cpu])); } +#endif } /* Init the cpumasks so the boot CPU is related to itself */ -- 2.18.2
[PATCH v5 03/10] powerpc/smp: Move powerpc_topology above
Just moving the powerpc_topology description above. This will help in using functions in this file and avoid declarations. No other functional changes Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Jordan Niethe Cc: Vaidyanathan Srinivasan Reviewed-by: Gautham R. Shenoy Signed-off-by: Srikar Dronamraju --- arch/powerpc/kernel/smp.c | 104 +++--- 1 file changed, 52 insertions(+), 52 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 08da765b91f1..39224a042468 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -818,6 +818,58 @@ static int init_cpu_l1_cache_map(int cpu) return err; } +static bool shared_caches; + +#ifdef CONFIG_SCHED_SMT +/* cpumask of CPUs with asymmetric SMT dependency */ +static int powerpc_smt_flags(void) +{ + int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES; + + if (cpu_has_feature(CPU_FTR_ASYM_SMT)) { + printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n"); + flags |= SD_ASYM_PACKING; + } + return flags; +} +#endif + +/* + * P9 has a slightly odd architecture where pairs of cores share an L2 cache. + * This topology makes it *much* cheaper to migrate tasks between adjacent cores + * since the migrated task remains cache hot. We want to take advantage of this + * at the scheduler level so an extra topology level is required. + */ +static int powerpc_shared_cache_flags(void) +{ + return SD_SHARE_PKG_RESOURCES; +} + +/* + * We can't just pass cpu_l2_cache_mask() directly because + * returns a non-const pointer and the compiler barfs on that. + */ +static const struct cpumask *shared_cache_mask(int cpu) +{ + return cpu_l2_cache_mask(cpu); +} + +#ifdef CONFIG_SCHED_SMT +static const struct cpumask *smallcore_smt_mask(int cpu) +{ + return cpu_smallcore_mask(cpu); +} +#endif + +static struct sched_domain_topology_level powerpc_topology[] = { +#ifdef CONFIG_SCHED_SMT + { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, +#endif + { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) }, + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + static int init_big_cores(void) { int cpu; @@ -1247,8 +1299,6 @@ static void add_cpu_to_masks(int cpu) set_cpus_related(cpu, i, cpu_core_mask); } -static bool shared_caches; - /* Activate a secondary processor. */ void start_secondary(void *unused) { @@ -1312,56 +1362,6 @@ int setup_profiling_timer(unsigned int multiplier) return 0; } -#ifdef CONFIG_SCHED_SMT -/* cpumask of CPUs with asymmetric SMT dependency */ -static int powerpc_smt_flags(void) -{ - int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES; - - if (cpu_has_feature(CPU_FTR_ASYM_SMT)) { - printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n"); - flags |= SD_ASYM_PACKING; - } - return flags; -} -#endif - -/* - * P9 has a slightly odd architecture where pairs of cores share an L2 cache. - * This topology makes it *much* cheaper to migrate tasks between adjacent cores - * since the migrated task remains cache hot. We want to take advantage of this - * at the scheduler level so an extra topology level is required. - */ -static int powerpc_shared_cache_flags(void) -{ - return SD_SHARE_PKG_RESOURCES; -} - -/* - * We can't just pass cpu_l2_cache_mask() directly because - * returns a non-const pointer and the compiler barfs on that. - */ -static const struct cpumask *shared_cache_mask(int cpu) -{ - return cpu_l2_cache_mask(cpu); -} - -#ifdef CONFIG_SCHED_SMT -static const struct cpumask *smallcore_smt_mask(int cpu) -{ - return cpu_smallcore_mask(cpu); -} -#endif - -static struct sched_domain_topology_level powerpc_topology[] = { -#ifdef CONFIG_SCHED_SMT - { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, -#endif - { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) }, - { cpu_cpu_mask, SD_INIT_NAME(DIE) }, - { NULL, }, -}; - void __init smp_cpus_done(unsigned int max_cpus) { /* -- 2.18.2
[PATCH v5 00/10] Coregroup support on Powerpc
- $ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name SMT CACHE DIE NUMA $ head /proc/schedstat version 15 timestamp 4318242208 cpu0 0 0 0 0 0 0 28077107004 4773387362 78205 domain0 ,,,0055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain1 ,,,00ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain2 ,,, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain3 ,,, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cpu1 0 0 0 0 0 0 24177439200 413887604 75393 domain0 ,,,00aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain1 ,,,00ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 After patchset -- $ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name SMT CACHE MC DIE NUMA $ head /proc/schedstat version 15 timestamp 4318242208 cpu0 0 0 0 0 0 0 28077107004 4773387362 78205 domain0 ,,,0055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain1 ,,,00ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain2 ,,, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain3 ,,, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 domain4 ,,, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 cpu1 0 0 0 0 0 0 24177439200 413887604 75393 domain0 ,,,00aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Nick Piggin Cc: Oliver OHalloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Anton Blanchard Cc: Gautham R Shenoy Cc: Vaidyanathan Srinivasan Cc: Jordan Niethe Srikar Dronamraju (10): powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES powerpc/smp: Merge Power9 topology with Power topology powerpc/smp: Move powerpc_topology above powerpc/smp: Move topology fixups into a new function powerpc/smp: Dont assume l2-cache to be superset of sibling powerpc/smp: Optimize start_secondary powerpc/numa: Detect support for coregroup powerpc/smp: Allocate cpumask only after searching thread group powerpc/smp: Create coregroup domain powerpc/smp: Implement cpu_to_coregroup_id arch/powerpc/include/asm/smp.h | 1 + arch/powerpc/include/asm/topology.h | 10 ++ arch/powerpc/kernel/smp.c | 235 +--- arch/powerpc/mm/numa.c | 59 +-- 4 files changed, 198 insertions(+), 107 deletions(-) -- 2.18.2
[PATCH v5 09/10] powerpc/smp: Create coregroup domain
Add percpu coregroup maps and masks to create coregroup domain. If a coregroup doesn't exist, the coregroup domain will be degenerated in favour of SMT/CACHE domain. Do note this patch is only creating stubs for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation would be in a subsequent patch. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R. Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Jordan Niethe Cc: Vaidyanathan Srinivasan Reviewed-by: Gautham R. Shenoy Signed-off-by: Srikar Dronamraju --- Changelog v4 ->v5: Updated commit msg to specify actual implementation of cpu_to_coregroup_id is in a subsequent patch (Michael Ellerman) Changelog v3 ->v4: if coregroup_support doesn't exist, update MC mask to the next smaller domain mask. Changelog v2 -> v3: Add optimization for mask updation under coregroup_support Changelog v1 -> v2: Moved coregroup topology fixup to fixup_topology (Gautham) arch/powerpc/include/asm/topology.h | 10 ++ arch/powerpc/kernel/smp.c | 54 - arch/powerpc/mm/numa.c | 5 +++ 3 files changed, 68 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index f0b6300e7dd3..6609174918ab 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -88,12 +88,22 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc) #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR) extern int find_and_online_cpu_nid(int cpu); +extern int cpu_to_coregroup_id(int cpu); #else static inline int find_and_online_cpu_nid(int cpu) { return 0; } +static inline int cpu_to_coregroup_id(int cpu) +{ +#ifdef CONFIG_SMP + return cpu_to_core_id(cpu); +#else + return 0; +#endif +} + #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */ #include diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 0536ac06876b..566e3accac3e 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -80,12 +80,22 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map); DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map); DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map); DEFINE_PER_CPU(cpumask_var_t, cpu_core_map); +DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map); EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map); EXPORT_PER_CPU_SYMBOL(cpu_core_map); EXPORT_SYMBOL_GPL(has_big_cores); +enum { +#ifdef CONFIG_SCHED_SMT + smt_idx, +#endif + cache_idx, + mc_idx, + die_idx, +}; + #define MAX_THREAD_LIST_SIZE 8 #define THREAD_GROUP_SHARE_L1 1 struct thread_groups { @@ -861,11 +871,27 @@ static const struct cpumask *smallcore_smt_mask(int cpu) } #endif +static struct cpumask *cpu_coregroup_mask(int cpu) +{ + return per_cpu(cpu_coregroup_map, cpu); +} + +static bool has_coregroup_support(void) +{ + return coregroup_enabled; +} + +static const struct cpumask *cpu_mc_mask(int cpu) +{ + return cpu_coregroup_mask(cpu); +} + static struct sched_domain_topology_level powerpc_topology[] = { #ifdef CONFIG_SCHED_SMT { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, #endif { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) }, + { cpu_mc_mask, SD_INIT_NAME(MC) }, { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, }, }; @@ -912,6 +938,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus) GFP_KERNEL, cpu_to_node(cpu)); zalloc_cpumask_var_node(_cpu(cpu_core_map, cpu), GFP_KERNEL, cpu_to_node(cpu)); + if (has_coregroup_support()) + zalloc_cpumask_var_node(_cpu(cpu_coregroup_map, cpu), + GFP_KERNEL, cpu_to_node(cpu)); + #ifdef CONFIG_NEED_MULTIPLE_NODES /* * numa_node_id() works after this. @@ -929,6 +959,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus) cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid)); cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid)); + if (has_coregroup_support()) + cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid)); + init_big_cores(); if (has_big_cores) { cpumask_set_cpu(boot_cpuid, @@ -1220,6 +1253,8 @@ static void remove_cpu_from_masks(int cpu) set_cpus_unrelated(cpu, i, cpu_sibling_mask); if (has_big_cores) set_cpus_unrelated(cpu, i, cpu_smallcore_mask); + if (has_coregroup_support()) + set_cpus_unrelate
[PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
Current code assumes that cpumask of cpus sharing a l2-cache mask will always be a superset of cpu_sibling_mask. Lets stop that assumption. cpu_l2_cache_mask is a superset of cpu_sibling_mask if and only if shared_caches is set. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Jordan Niethe Cc: Vaidyanathan Srinivasan Reviewed-by: Gautham R. Shenoy Signed-off-by: Srikar Dronamraju --- Changelog v1 -> v2: Set cpumask after verifying l2-cache. (Gautham) arch/powerpc/kernel/smp.c | 28 +++- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index b13161a5ffc3..0c960ce3be42 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1188,6 +1188,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int)) if (!l2_cache) return false; + cpumask_set_cpu(cpu, mask_fn(cpu)); for_each_cpu(i, cpu_online_mask) { /* * when updating the marks the current CPU has not been marked @@ -1270,29 +1271,30 @@ static void add_cpu_to_masks(int cpu) * add it to it's own thread sibling mask. */ cpumask_set_cpu(cpu, cpu_sibling_mask(cpu)); + cpumask_set_cpu(cpu, cpu_core_mask(cpu)); for (i = first_thread; i < first_thread + threads_per_core; i++) if (cpu_online(i)) set_cpus_related(i, cpu, cpu_sibling_mask); add_cpu_to_smallcore_masks(cpu); - /* -* Copy the thread sibling mask into the cache sibling mask -* and mark any CPUs that share an L2 with this CPU. -*/ - for_each_cpu(i, cpu_sibling_mask(cpu)) - set_cpus_related(cpu, i, cpu_l2_cache_mask); update_mask_by_l2(cpu, cpu_l2_cache_mask); - /* -* Copy the cache sibling mask into core sibling mask and mark -* any CPUs on the same chip as this CPU. -*/ - for_each_cpu(i, cpu_l2_cache_mask(cpu)) - set_cpus_related(cpu, i, cpu_core_mask); + if (pkg_id == -1) { + struct cpumask *(*mask)(int) = cpu_sibling_mask; + + /* +* Copy the sibling mask into core sibling mask and +* mark any CPUs on the same chip as this CPU. +*/ + if (shared_caches) + mask = cpu_l2_cache_mask; + + for_each_cpu(i, mask(cpu)) + set_cpus_related(cpu, i, cpu_core_mask); - if (pkg_id == -1) return; + } for_each_cpu(i, cpu_online_mask) if (get_physical_package_id(i) == pkg_id) -- 2.18.2
[PATCH v5 08/10] powerpc/smp: Allocate cpumask only after searching thread group
If allocated earlier and the search fails, then cpu_l1_cache_map cpumask is unnecessarily cleared. However cpu_l1_cache_map can be allocated / cleared after we search thread group. Please note CONFIG_CPUMASK_OFFSTACK is not set on Powerpc. Hence cpumask allocated by zalloc_cpumask_var_node is never freed. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Jordan Niethe Cc: Vaidyanathan Srinivasan Reviewed-by: Gautham R. Shenoy Signed-off-by: Srikar Dronamraju --- Changelog v4 ->v5: Updated commit msg on why cpumask need not be freed. (Michael Ellerman) arch/powerpc/kernel/smp.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 7403fdcf3821..0536ac06876b 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -789,10 +789,6 @@ static int init_cpu_l1_cache_map(int cpu) if (err) goto out; - zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu), - GFP_KERNEL, - cpu_to_node(cpu)); - cpu_group_start = get_cpu_thread_group_start(cpu, ); if (unlikely(cpu_group_start == -1)) { @@ -801,6 +797,9 @@ static int init_cpu_l1_cache_map(int cpu) goto out; } + zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu), + GFP_KERNEL, cpu_to_node(cpu)); + for (i = first_thread; i < first_thread + threads_per_core; i++) { int i_group_start = get_cpu_thread_group_start(i, ); -- 2.18.2
[PATCH v5 07/10] powerpc/numa: Detect support for coregroup
Add support for grouping cores based on the device-tree classification. - The last domain in the associativity domains always refers to the core. - If primary reference domain happens to be the penultimate domain in the associativity domains device-tree property, then there are no coregroups. However if its not a penultimate domain, then there are coregroups. There can be more than one coregroup. For now we would be interested in the last or the smallest coregroups, i.e one sub-group per DIE. Currently there are no firmwares that are exposing this grouping. Hence allow the basis for grouping to be abstract. Once the firmware starts using this grouping, code would be added to detect the type of grouping and adjust the sd domain flags accordingly. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Jordan Niethe Cc: Vaidyanathan Srinivasan Reviewed-by: Gautham R. Shenoy Signed-off-by: Srikar Dronamraju --- Changelog v4->v5: Updated commit msg with current abstract nature of the coregroups (Michael Ellerman) Changelog v1 -> v2: Explained Coregroup in commit msg (Michael Ellerman) arch/powerpc/include/asm/smp.h | 1 + arch/powerpc/kernel/smp.c | 1 + arch/powerpc/mm/numa.c | 34 +- 3 files changed, 23 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 49a25e2400f2..5bdc17a7049f 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -28,6 +28,7 @@ extern int boot_cpuid; extern int spinning_secondaries; extern u32 *cpu_to_phys_id; +extern bool coregroup_enabled; extern void cpu_die(void); extern int cpu_to_chip_id(int cpu); diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 91cf5d05e7ec..7403fdcf3821 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -74,6 +74,7 @@ static DEFINE_PER_CPU(int, cpu_state) = { 0 }; struct task_struct *secondary_current; bool has_big_cores; +bool coregroup_enabled; DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map); DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map); diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 2298899a0f0a..51cb672f113b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -886,7 +886,9 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn) static void __init find_possible_nodes(void) { struct device_node *rtas; - u32 numnodes, i; + const __be32 *domains; + int prop_length, max_nodes; + u32 i; if (!numa_enabled) return; @@ -895,25 +897,31 @@ static void __init find_possible_nodes(void) if (!rtas) return; - if (of_property_read_u32_index(rtas, "ibm,current-associativity-domains", - min_common_depth, )) { - /* -* ibm,current-associativity-domains is a fairly recent -* property. If it doesn't exist, then fallback on -* ibm,max-associativity-domains. Current denotes what the -* platform can support compared to max which denotes what the -* Hypervisor can support. -*/ - if (of_property_read_u32_index(rtas, "ibm,max-associativity-domains", - min_common_depth, )) + /* +* ibm,current-associativity-domains is a fairly recent property. If +* it doesn't exist, then fallback on ibm,max-associativity-domains. +* Current denotes what the platform can support compared to max +* which denotes what the Hypervisor can support. +*/ + domains = of_get_property(rtas, "ibm,current-associativity-domains", + _length); + if (!domains) { + domains = of_get_property(rtas, "ibm,max-associativity-domains", + _length); + if (!domains) goto out; } - for (i = 0; i < numnodes; i++) { + max_nodes = of_read_number([min_common_depth], 1); + for (i = 0; i < max_nodes; i++) { if (!node_possible(i)) node_set(i, node_possible_map); } + prop_length /= sizeof(int); + if (prop_length > min_common_depth + 2) + coregroup_enabled = 1; + out: of_node_put(rtas); } -- 2.18.2
[PATCH v5 06/10] powerpc/smp: Optimize start_secondary
In start_secondary, even if shared_cache was already set, system does a redundant match for cpumask. This redundant check can be removed by checking if shared_cache is already set. While here, localize the sibling_mask variable to within the if condition. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Jordan Niethe Cc: Vaidyanathan Srinivasan Signed-off-by: Srikar Dronamraju --- Changelog v4 ->v5: Retain cache domain, no need for generalization (Michael Ellerman, Peter Zijlstra, Valentin Schneider, Gautham R. Shenoy) Changelog v1 -> v2: Moved shared_cache topology fixup to fixup_topology (Gautham) arch/powerpc/kernel/smp.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 0c960ce3be42..91cf5d05e7ec 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -851,7 +851,7 @@ static int powerpc_shared_cache_flags(void) */ static const struct cpumask *shared_cache_mask(int cpu) { - return cpu_l2_cache_mask(cpu); + return per_cpu(cpu_l2_cache_map, cpu); } #ifdef CONFIG_SCHED_SMT @@ -1305,7 +1305,6 @@ static void add_cpu_to_masks(int cpu) void start_secondary(void *unused) { unsigned int cpu = smp_processor_id(); - struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask; mmgrab(_mm); current->active_mm = _mm; @@ -1331,14 +1330,20 @@ void start_secondary(void *unused) /* Update topology CPU masks */ add_cpu_to_masks(cpu); - if (has_big_cores) - sibling_mask = cpu_smallcore_mask; /* * Check for any shared caches. Note that this must be done on a * per-core basis because one core in the pair might be disabled. */ - if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu))) - shared_caches = true; + if (!shared_caches) { + struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask; + struct cpumask *mask = cpu_l2_cache_mask(cpu); + + if (has_big_cores) + sibling_mask = cpu_smallcore_mask; + + if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu))) + shared_caches = true; + } set_numa_node(numa_cpu_lookup_table[cpu]); set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); -- 2.18.2
[PATCH v5 10/10] powerpc/smp: Implement cpu_to_coregroup_id
Lookup the coregroup id from the associativity array. If unable to detect the coregroup id, fallback on the core id. This way, ensure sched_domain degenerates and an extra sched domain is not created. Ideally this function should have been implemented in arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we don't need to find the primary domain again. If the device-tree mentions more than one coregroup, then kernel implements only the last or the smallest coregroup, which currently corresponds to the penultimate domain in the device-tree. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Jordan Niethe Cc: Vaidyanathan Srinivasan Reviewed-by: Gautham R. Shenoy Signed-off-by: Srikar Dronamraju --- Changelog v1 -> v2: Move coregroup_enabled before getting associativity (Gautham) arch/powerpc/mm/numa.c | 20 1 file changed, 20 insertions(+) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0d57779e7942..8b3b3ec7fcc4 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1218,6 +1218,26 @@ int find_and_online_cpu_nid(int cpu) int cpu_to_coregroup_id(int cpu) { + __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0}; + int index; + + if (cpu < 0 || cpu > nr_cpu_ids) + return -1; + + if (!coregroup_enabled) + goto out; + + if (!firmware_has_feature(FW_FEATURE_VPHN)) + goto out; + + if (vphn_get_associativity(cpu, associativity)) + goto out; + + index = of_read_number(associativity, 1); + if (index > min_common_depth + 1) + return of_read_number([index - 1], 1); + +out: return cpu_to_core_id(cpu); } -- 2.18.2
[PATCH v5 04/10] powerpc/smp: Move topology fixups into a new function
Move topology fixup based on the platform attributes into its own function which is called just before set_sched_topology. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Anton Blanchard Cc: Oliver O'Halloran Cc: Nathan Lynch Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Jordan Niethe Cc: Vaidyanathan Srinivasan Reviewed-by: Gautham R. Shenoy Signed-off-by: Srikar Dronamraju --- Changelog v2 -> v3: Rewrote changelog (Gautham) Renamed to powerpc/smp: Move topology fixups into a new function arch/powerpc/kernel/smp.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 39224a042468..b13161a5ffc3 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1362,6 +1362,16 @@ int setup_profiling_timer(unsigned int multiplier) return 0; } +static void fixup_topology(void) +{ +#ifdef CONFIG_SCHED_SMT + if (has_big_cores) { + pr_info("Big cores detected but using small core scheduling\n"); + powerpc_topology[0].mask = smallcore_smt_mask; + } +#endif +} + void __init smp_cpus_done(unsigned int max_cpus) { /* @@ -1375,12 +1385,7 @@ void __init smp_cpus_done(unsigned int max_cpus) dump_numa_cpu_topology(); -#ifdef CONFIG_SCHED_SMT - if (has_big_cores) { - pr_info("Big cores detected but using small core scheduling\n"); - powerpc_topology[0].mask = smallcore_smt_mask; - } -#endif + fixup_topology(); set_sched_topology(powerpc_topology); } -- 2.18.2
Re: [PATCH v2] sched/fair: ignore cache hotness for SMT migration
* Josh Don [2020-08-04 12:34:13]: > SMT siblings share caches, so cache hotness should be irrelevant for > cross-sibling migration. > > Proposed-by: Venkatesh Pallipadi > Signed-off-by: Josh Don > --- > kernel/sched/fair.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 1a68a0536add..abdb54e2339f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7402,6 +7402,10 @@ static int task_hot(struct task_struct *p, struct > lb_env *env) > if (unlikely(task_has_idle_policy(p))) > return 0; > > + /* SMT siblings share cache */ > + if (env->sd->flags & SD_SHARE_CPUCAPACITY) > + return 0; > + If this for retaining cache hotness, should we look at SD_SHARE_PKG_RESOURCES instead of SD_SHARE_CPUCAPACITY? > /* >* Buddy candidates are cache hot: > */ > -- > 2.28.0.163.g6104cc2f0b6-goog > -- Thanks and Regards Srikar Dronamraju
[PATCH v2 2/2] powerpc/topology: Override cpu_smt_mask
1182315.7 60018.733 1018433 - 1047037 : ##3% (3) 1047037 - 1075640 : 4% (4) 1075640 - 1104244 : 4% (4) 1104244 - 1132848 : ### 7% (7) 1132848 - 1161452 : 17% (17) 1161452 - 1190055 : ## 12% (12) 1190055 - 1218659 : #21% (21) 1218659 - 1247263 : ## 23% (23) 1247263 - 1275866 : 4% (4) 1275866 - 1304470 : 4% (4) with patch N Min MaxMedian AvgStddev 100967014 1292938 1208819 1185281.8 69815.851 967014 - 999606 : ##1% (1) 999606 - 1032199 : ##1% (1) 1032199 - 1064791 : 6% (6) 1064791 - 1097384 : ##5% (5) 1097384 - 1129976 : ##9% (9) 1129976 - 1162568 : 10% (10) 1162568 - 1195161 : ## 13% (13) 1195161 - 1227753 : 22% (22) 1227753 - 1260346 : ## 25% (25) 1260346 - 1292938 : ##7% (7) Observations: Not much changes, ebizzy is not much impacted. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Dietmar Eggemann Cc: Mel Gorman Cc: Vincent Guittot Cc: Vaidyanathan Srinivasan Signed-off-by: Srikar Dronamraju --- Changelog v1->v2: Modified commit msg as per mailing list discussion. Added performance numbers arch/powerpc/include/asm/cputhreads.h | 1 - arch/powerpc/include/asm/smp.h| 13 + 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/cputhreads.h b/arch/powerpc/include/asm/cputhreads.h index deb99fd6e060..98c8bd155bf9 100644 --- a/arch/powerpc/include/asm/cputhreads.h +++ b/arch/powerpc/include/asm/cputhreads.h @@ -23,7 +23,6 @@ extern int threads_per_core; extern int threads_per_subcore; extern int threads_shift; -extern bool has_big_cores; extern cpumask_t threads_core_mask; #else #define threads_per_core 1 diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 9cd0765633c5..bb06aa875131 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -131,6 +131,19 @@ static inline struct cpumask *cpu_smallcore_mask(int cpu) extern int cpu_to_core_id(int cpu); +extern bool has_big_cores; + +#define cpu_smt_mask cpu_smt_mask +#ifdef CONFIG_SCHED_SMT +static inline const struct cpumask *cpu_smt_mask(int cpu) +{ + if (has_big_cores) + return per_cpu(cpu_smallcore_map, cpu); + + return per_cpu(cpu_sibling_map, cpu); +} +#endif /* CONFIG_SCHED_SMT */ + /* Since OpenPIC has only 4 IPIs, we use slightly different message numbers. * * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up -- 2.18.2
[PATCH v2 1/2] sched/topology: Allow archs to override cpu_smt_mask
cpu_smt_mask tracks topology_sibling_cpumask. This would be good for most architectures. One of the users of cpu_smt_mask(), would be to identify idle-cores. On Power9, a pair of SMT4 cores can be presented by the firmware as a SMT8 core for backward compatibility reasons. Powerpc allows LPARs to be live migrated from Power8 to Power9. Do note Power8 had only SMT8 cores. Existing software which has been developed/configured for Power8 would expect to see SMT8 core. Maintaining the illusion of SMT8 core is a requirement to make that work. In order to maintain above userspace backward compatibility with previous versions of processor, Power9 onwards there is option to the firmware to advertise a pair of SMT4 cores as a fused cores aka SMT8 core. On Power9 this pair shares the L2 cache as well. However, from the scheduler's point of view, a core should be determined by SMT4, since its a completely independent unit of compute. Hence allow PowerPc architecture to override the default cpu_smt_mask() to point to the SMT4 cores in a SMT8 mode. This will ensure the scheduler is always given the right information. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Dietmar Eggemann Cc: Mel Gorman Cc: Vincent Guittot Cc: Vaidyanathan Srinivasan Acked-by; Peter Zijlstra (Intel) Signed-off-by: Srikar Dronamraju --- Changelog v1->v2: Update the commit msg based on the discussion in community esp with Peter Zijlstra and Michael Ellerman. include/linux/topology.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/topology.h b/include/linux/topology.h index 608fa4aadf0e..ad03df1cc266 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -198,7 +198,7 @@ static inline int cpu_to_mem(int cpu) #define topology_die_cpumask(cpu) cpumask_of(cpu) #endif -#ifdef CONFIG_SCHED_SMT +#if defined(CONFIG_SCHED_SMT) && !defined(cpu_smt_mask) static inline const struct cpumask *cpu_smt_mask(int cpu) { return topology_sibling_cpumask(cpu); -- 2.18.2
Re: [PATCH 1/2] sched/topology: Allow archs to override cpu_smt_mask
* pet...@infradead.org [2020-08-06 10:54:29]: > On Thu, Aug 06, 2020 at 03:32:25PM +1000, Michael Ellerman wrote: > > > That brings with it a bunch of problems, such as existing software that > > has been developed/configured for Power8 and expects to see SMT8. > > > > We also allow LPARs to be live migrated from Power8 to Power9 (and back), so > > maintaining the illusion of SMT8 is considered a requirement to make that > > work. > > So how does that work if the kernel booted on P9 and demuxed the SMT8 > into 2xSMT4? If you migrate that state onto a P8 with actual SMT8 you're > toast again. > To add to what Michael already said, the reason we don't expose the demux of SMT8 into 2xSMT4 to userspace, is to make the userspace believe they are on a SMT8. When the kernel is live migrated from P8 to P9, till the time of reboot they would only have the older P8 topology. After reboot the kernel topology would change, but the userspace is made to believe that they are running on SMT8 core by way of keeping the sibling_cpumask at SMT8 core level. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH 1/2] sched/topology: Allow archs to override cpu_smt_mask
* pet...@infradead.org [2020-08-06 15:15:47]: > > But my understanding is most LPARs don't get migrated back and forth, > > they'll start life on a P8 and only get migrated to a P9 once when the > > customer gets a P9. They might then run for a long time (months to > > years) on the P9 in P8 compat mode, not because they ever want to > > migrate back to a real P8, but because the software in the LPAR is still > > expecting to be on a P8. > > > > I'm not a real expert on all the Enterprisey stuff though, so someone > > else might be able to give us a better picture. > > > > But the point of mentioning the migration stuff was mainly just to > > explain why we feel we need to present SMT8 to userspace even on P9. > > OK, fair enough. The patch wasn't particularly onerous, I was just > wondering why etc.. > > The case of starting on a P8 and being migrated to a P9 makes sense to > me; in that case you'd like to rebuild your sched domains, but can't go > about changing user visible topolofy information. > > I suppose: > > Acked-by; Peter Zijlstra (Intel) > > An updated Changelog that recaps some of this discussion might also be > nice. Okay, will surely do the needful. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH 1/2] sched/topology: Allow archs to override cpu_smt_mask
* pet...@infradead.org [2020-08-04 12:45:20]: > On Tue, Aug 04, 2020 at 09:03:06AM +0530, Srikar Dronamraju wrote: > > cpu_smt_mask tracks topology_sibling_cpumask. This would be good for > > most architectures. One of the users of cpu_smt_mask(), would be to > > identify idle-cores. On Power9, a pair of cores can be presented by the > > firmware as a big-core for backward compatibility reasons. > > > > In order to maintain userspace backward compatibility with previous > > versions of processor, (since Power8 had SMT8 cores), Power9 onwards there > > is option to the firmware to advertise a pair of SMT4 cores as a fused > > cores (referred to as the big_core mode in the Linux Kernel). On Power9 > > this pair shares the L2 cache as well. However, from the scheduler's point > > of view, a core should be determined by SMT4. The load-balancer already > > does this. Hence allow PowerPc architecture to override the default > > cpu_smt_mask() to point to the SMT4 cores in a big_core mode. > > I'm utterly confused. > > Why can't you set your regular siblings mask to the smt4 thing? Who > cares about the compat stuff, I thought that was an LPAR/AIX thing. There are no technical challenges to set the sibling mask to SMT4. This is for Linux running on PowerVM. When these Power9 boxes are sold / marketed as X core boxes (where X stand for SMT8 cores). Since on PowerVM world, everything is in SMT8 mode, the device tree properties still mark the system to be running on 8 thread core. There are a number of utilities like ppc64_cpu that directly read from device-tree. They would get core count and thread count which is SMT8 based. If the sibling_mask is set to small core, then same user when looking at output from lscpu and other utilities that look at sysfs will start seeing 2x number of cores to what he had provisioned and what the utilities from the device-tree show. This can gets users confused. So to keep the device-tree properties, utilities depending on device-tree, sysfs and utilities depending on sysfs on the same page, userspace are only exposed as SMT8. -- Thanks and Regards Srikar Dronamraju
[PATCH 2/2] powerpc/topology: Override cpu_smt_mask
On Power9 a pair of cores can be presented by the firmware as a big-core for backward compatibility reasons, with 4 threads per (small) core and 8 threads per big-core. cpu_smt_mask() should generally point to the cpu mask of the (small)core. In order to maintain userspace backward compatibility (with Power8 chips in case of Power9) in enterprise Linux systems, the topology_sibling_cpumask has to be set to big-core. Hence override the default cpu_smt_mask() to be powerpc specific allowing for better scheduling behaviour on Power. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Dietmar Eggemann Cc: Mel Gorman Cc: Vincent Guittot Cc: Vaidyanathan Srinivasan Signed-off-by: Srikar Dronamraju --- arch/powerpc/include/asm/cputhreads.h | 1 - arch/powerpc/include/asm/smp.h| 13 + 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/cputhreads.h b/arch/powerpc/include/asm/cputhreads.h index deb99fd6e060..98c8bd155bf9 100644 --- a/arch/powerpc/include/asm/cputhreads.h +++ b/arch/powerpc/include/asm/cputhreads.h @@ -23,7 +23,6 @@ extern int threads_per_core; extern int threads_per_subcore; extern int threads_shift; -extern bool has_big_cores; extern cpumask_t threads_core_mask; #else #define threads_per_core 1 diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 9cd0765633c5..d4bc28accb28 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -131,6 +131,19 @@ static inline struct cpumask *cpu_smallcore_mask(int cpu) extern int cpu_to_core_id(int cpu); +extern bool has_big_cores; + +#define cpu_smt_mask cpu_smt_mask +#ifdef CONFIG_SCHED_SMT +static inline const struct cpumask *cpu_smt_mask(int cpu) +{ + if (has_big_cores) + return per_cpu(cpu_smallcore_map, cpu); + + return per_cpu(cpu_sibling_map, cpu); +} +#endif /* CONFIG_SCHED_SMT */ + /* Since OpenPIC has only 4 IPIs, we use slightly different message numbers. * * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up -- 2.18.2
[PATCH 1/2] sched/topology: Allow archs to override cpu_smt_mask
cpu_smt_mask tracks topology_sibling_cpumask. This would be good for most architectures. One of the users of cpu_smt_mask(), would be to identify idle-cores. On Power9, a pair of cores can be presented by the firmware as a big-core for backward compatibility reasons. In order to maintain userspace backward compatibility with previous versions of processor, (since Power8 had SMT8 cores), Power9 onwards there is option to the firmware to advertise a pair of SMT4 cores as a fused cores (referred to as the big_core mode in the Linux Kernel). On Power9 this pair shares the L2 cache as well. However, from the scheduler's point of view, a core should be determined by SMT4. The load-balancer already does this. Hence allow PowerPc architecture to override the default cpu_smt_mask() to point to the SMT4 cores in a big_core mode. Cc: linuxppc-dev Cc: LKML Cc: Michael Ellerman Cc: Michael Neuling Cc: Gautham R Shenoy Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Dietmar Eggemann Cc: Mel Gorman Cc: Vincent Guittot Cc: Vaidyanathan Srinivasan Signed-off-by: Srikar Dronamraju --- include/linux/topology.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/topology.h b/include/linux/topology.h index 608fa4aadf0e..ad03df1cc266 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -198,7 +198,7 @@ static inline int cpu_to_mem(int cpu) #define topology_die_cpumask(cpu) cpumask_of(cpu) #endif -#ifdef CONFIG_SCHED_SMT +#if defined(CONFIG_SCHED_SMT) && !defined(cpu_smt_mask) static inline const struct cpumask *cpu_smt_mask(int cpu) { return topology_sibling_cpumask(cpu); -- 2.18.2
Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
> > Also in the current P9 itself, two neighbouring core-pairs form a quad. > > Cache latency within a quad is better than a latency to a distant core-pair. > > Cache latency within a core pair is way better than latency within a quad. > > So if we have only 4 threads running on a DIE all of them accessing the same > > cache-lines, then we could probably benefit if all the tasks were to run > > within the quad aka MC/Coregroup. > > > > Did you test this? WRT load balance we do try to balance "load" over the > different domain spans, so if you represent quads as their own MC domain, > you would AFAICT end up spreading tasks over the quads (rather than packing > them) when balancing at e.g. DIE level. The desired behaviour might be > hackable with some more ASYM_PACKING, but I'm not sure I should be > suggesting that :-) > Agree, load balance will try to spread the load across the quads. In my hack, I was explicitly marking QUAD domains as !SD_PREFER_SIBLING + relaxing few load spreading rules when SD_PREFER_SIBLING was not set. And this was on a slightly older kernel (without recent Vincent's load balance overhaul). > > I have found some benchmarks which are latency sensitive to benefit by > > having a grouping a quad level (using kernel hacks and not backed by > > firmware changes). Gautham also found similar results in his experiments > > but he only used binding within the stock kernel. > > > > IIUC you reflect this "fabric quirk" (i.e. coregroups) using this DT > binding thing. > > That's also where things get interesting (for me) because I experienced > something similar on another arm64 platform (ThunderX1). This was more > about cache bandwidth than cache latency, but IMO it's in the same bag of > fabric quirks. I blabbered a bit about this at last LPC [1], but kind of > gave up on it given the TX1 was the only (arm64) platform where I could get > both significant and reproducible results. > > Now, if you folks are seeing this on completely different hardware and have > "real" workloads that truly benefit from this kind of domain partitioning, > this might be another incentive to try and sort of generalize this. That's > outside the scope of your series, but your findings give me some hope! > > I think what I had in mind back then was that if enough folks cared about > it, we might get some bits added to the ACPI spec; something along the > lines of proximity domains for the caches described in PPTT, IOW a cache > distance matrix. I don't really know what it'll take to get there, but I > figured I'd dump this in case someone's listening :-) > Very interesting. > > I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC > > domain need not be LLC domain for Power. > > From what I understood your MC domain does seem to map to LLC; but in any > case, shouldn't you set that flag at least for BIGCORE (i.e. L2)? AIUI with > your changes your sd_llc is gonna be SMT, and that's not going to be a very > big mask. IMO you do want to correctly reflect your LLC situation via this > flag to make cpus_share_cache() work properly. I detect if the LLC is shared at BIGCORE, and if they are shared at BIGCORE, then I dynamically rename the DOMAIN as CACHE and enable SD_SHARE_PKG_RESOURCES in that domain. > > [1]: https://linuxplumbersconf.org/event/4/contributions/484/ Thanks for the pointer. -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id
* Michael Ellerman [2020-07-31 18:02:21]: > Srikar Dronamraju writes: > > Lookup the coregroup id from the associativity array. > Thanks Michael for all your comments and inputs. > It's slightly strange that this is called in patch 9, but only properly > implemented here in patch 10. > > I'm not saying you have to squash them together, but it would be good if > the change log for patch 9 mentioned that a subsequent commit will > complete the implementation and how that affects the behaviour. > I probably got influenced by few LKML community members who always add a stub and implement the gory details in a subsequent patch. I will surely add the change log in patch 9 about the subsequent patches. > cheers > -- Thanks and Regards Srikar Dronamraju
Re: [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group
* Michael Ellerman [2020-07-31 17:52:15]: > Srikar Dronamraju writes: > > If allocated earlier and the search fails, then cpumask need to be > > freed. However cpu_l1_cache_map can be allocated after we search thread > > group. > > It's not freed anywhere AFAICS? > Yes, its never freed. Infact we are never checking if zalloc_cpumask_var_node fails. Its not just this cpumask, but historically all the other existing cpumasks in arch/powerpc/kernel/smp.c are never freed/checked. I did dig into this a bit and it appears that .. (Please do correct me if I am wrong!! ) Powerpc using cpumask_var_t for all of the percpu variables. And it dont seem to enable CONFIG_CPUMASK_OFFSTACK even from the MAXSMP config. So from include/linux/cpumask.h typedef struct cpumask cpumask_var_t[1]; and zalloc_cpumask_var_node ends up being cpumask_clear So I think we are historically we seem to assume we are always !CPUMASK_OFFSTACK and hence we dont need to check for return as well as free.. I would look forward to your comments on how we should handle this going forward. But I would keep this the same for this patchset. One of the questions that I have is if we most likely are to be in !CONFIG_CPUMASK_OFFSTACK, then should be migrate to cpumask_t for percpu variables. The reason being we end up using NR_CPU cpumask for each percpu cpumask variable instead of using NR_CPU cpumask_t pointer. > And even after this change there's still an error path that doesn't free > it, isn't there? > > cheers > > > Cc: linuxppc-dev > > Cc: LKML > > Cc: Michael Ellerman > > Cc: Nicholas Piggin > > Cc: Anton Blanchard > > Cc: Oliver O'Halloran > > Cc: Nathan Lynch > > Cc: Michael Neuling > > Cc: Gautham R Shenoy > > Cc: Ingo Molnar > > Cc: Peter Zijlstra > > Cc: Valentin Schneider > > Cc: Jordan Niethe > > Reviewed-by: Gautham R. Shenoy > > Signed-off-by: Srikar Dronamraju > > --- > > arch/powerpc/kernel/smp.c | 7 +++ > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > > index 698000c7f76f..dab96a1203ec 100644 > > --- a/arch/powerpc/kernel/smp.c > > +++ b/arch/powerpc/kernel/smp.c > > @@ -797,10 +797,6 @@ static int init_cpu_l1_cache_map(int cpu) > > if (err) > > goto out; > > > > - zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu), > > - GFP_KERNEL, > > - cpu_to_node(cpu)); > > - > > cpu_group_start = get_cpu_thread_group_start(cpu, ); > > > > if (unlikely(cpu_group_start == -1)) { > > @@ -809,6 +805,9 @@ static int init_cpu_l1_cache_map(int cpu) > > goto out; > > } > > > > + zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu), > > + GFP_KERNEL, cpu_to_node(cpu)); > > + > > for (i = first_thread; i < first_thread + threads_per_core; i++) { > > int i_group_start = get_cpu_thread_group_start(i, ); > > > > -- > > 2.17.1 -- Thanks and Regards Srikar Dronamraju