from:"Srikar Dronamraju"

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-12 Thread Srikar Dronamraju

* Gautham R. Shenoy  [2021-04-02 11:07:54]:

> 
> To remedy this, this patch proposes that the LLC be moved to the MC
> level which is a group of cores in one half of the chip.
> 
>   SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE
> 

I think marking Hemisphere as a LLC in a P10 scenario is a good idea.

> While there is no cache being shared at this level, this is still the
> level where some amount of cache-snooping takes place and it is
> relatively faster to access the data from the caches of the cores
> within this domain. With this change, we no longer see regressions on
> P10 for applications which require single threaded performance.

Peter, Valentin, Vincent, Mel, etal

On architectures where we have multiple levels of cache access latencies
within a DIE, (For example: one within the current LLC or SMT core and the
other at MC or Hemisphere, and finally across hemispheres), do you have any
suggestions on how we could handle the same in the core scheduler?

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-09 Thread Srikar Dronamraju

* Vincent Guittot  [2021-03-08 14:52:39]:

> On Fri, 26 Feb 2021 at 17:41, Srikar Dronamraju
>  wrote:
> >

Thanks Vincent for your review comments.

> > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> > +{
> > +   struct sched_domain_shared *tsds, *psds;
> > +   int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> > +
> > +   tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> > +   tnr_busy = atomic_read(>nr_busy_cpus);
> > +   tllc_size = per_cpu(sd_llc_size, this_cpu);
> > +
> > +   psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> > +   pnr_busy = atomic_read(>nr_busy_cpus);
> > +   pllc_size = per_cpu(sd_llc_size, prev_cpu);
> > +
> > +   /* No need to compare, if both LLCs are fully loaded */
> > +   if (pnr_busy == pllc_size && tnr_busy == pllc_size)
> > +   return nr_cpumask_bits;
> > +
> > +   if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > +   return this_cpu;
> 
> Why have you chosen to favor this_cpu instead of prev_cpu unlike for 
> wake_idle ?

At this point, we know the waker running on this_cpu and wakee which was
running on prev_cpu are affine to each other and this_cpu and prev_cpu dont
share cache. I chose to move them close to each other to benefit from the
cache sharing. Based on feedback from Peter and Rik, I made the check more
conservative i.e tnr_busy <= tllc_size/smt_weight (where smt_weight is the
cpumask weight of smt domain for this_cpu) i.e if we have a free core in
this llc domain, chose this_cpu.  select_idle_sibling() should pick an idle
cpu/core/smt within the llc domain for this_cpu.

Do you feel, this may not be the correct option?

We are also experimenting with another option, were we call prefer_idler_cpu
after wa_weight. I.e 
1. if wake_affine_weight choses this_cpu but llc in prev_cpu has an idle
smt/CPU but there are no idle smt/CPU in this_cpu, then chose idle smt/CPU
in prev_cpu
2. if wake_affine_weight choses nr_cpumask(aka prev_cpu) but llc in this_cpu
has an idle smt/CPU but there are no idle smt/CPU in prev_cpu, then chose
idle smt/CPU in this_cpu

> > +
> > +   /* For better wakeup latency, prefer idler LLC to cache affinity */
> > +   diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
> > +   if (!diff)
> > +   return nr_cpumask_bits;
> > +   if (diff < 0)
> > +   return this_cpu;
> > +
> > +   return prev_cpu;
> > +}
> > +
> >  static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> >int this_cpu, int prev_cpu, int sync)
> >  {
> > @@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, 
> > struct task_struct *p,
> > if (sched_feat(WA_IDLE))
> > target = wake_affine_idle(this_cpu, prev_cpu, sync);
> >
> > +   if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits &&
> > +   !cpus_share_cache(this_cpu, prev_cpu))
> > +   target = prefer_idler_llc(this_cpu, prev_cpu, sync);
> 
> could you use the same naming convention as others function ?
> wake_affine_llc as an example

I guess you meant s/prefer_idler_llc/wake_affine_llc/
Sure. I can modify.

> 
> > +
> > if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
> > target = wake_affine_weight(sd, p, this_cpu, prev_cpu, 
> > sync);
> >
> > @@ -5884,8 +5918,11 @@ static int wake_affine(struct sched_domain *sd, 
> > struct task_struct *p,
> > if (target == nr_cpumask_bits)
> > return prev_cpu;
> >
> > -   schedstat_inc(sd->ttwu_move_affine);
> > -   schedstat_inc(p->se.statistics.nr_wakeups_affine);
> > +   if (target == this_cpu) {
> 
> How is this condition related to $subject ?

Before this change, wake_affine_weight and wake_affine_idle would either
return this_cpu or nr_cpumask_bits. Just before this check, we check if
target is nr_cpumask_bits and return prev_cpu. So the stats were only
incremented when target was this_cpu.

However with prefer_idler_llc, we may return this_cpu, prev_cpu or
nr_cpumask_bits. Now we only to update stats when we have chosen to migrate
the task to this_cpu. Hence I had this check.

If we use the slightly lazier approach which is check for wa_weight first
before wa_idler_llc, then we may not need this change at all.

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH v2] sched/fair: Prefer idle CPU to cache affinity

2021-03-04 Thread Srikar Dronamraju

On POWER8 and POWER9, the last level cache (L2) has been at the level of
a group of 8 threads (SMT8 on POWER8, a big-core comprising of a pair of
SMT4 cores on POWER9). However, on POWER10, the LLC domain is at the
level of a group of SMT4 threads within the SMT8 core. Due to the
shrinking in the size of the LLC domain, the probability of finding an
idle CPU in the LLC domain of the target is lesser on POWER10 compared
to the previous generation processors.

With commit 9538abee18cc ("powerpc/smp: Add support detecting
thread-groups sharing L2 cache") benchmarks such as Daytrader
(https://github.com/WASdev/sample.daytrader7) show a drop in throughput
in a configuration consisting of 1 JVM spanning across 6-8 Bigcores on
POWER10.  Analysis showed that this was because more number of wakeups
were happening on busy CPUs when the utilization was 60-70%. This drop
in throughput also shows up as a drop in CPU utilization. However most
other benchmarks benefit with detecting the thread-groups that share L2
cache.

Current order of preference to pick a LLC while waking a wake-affine
task:
1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
   that is idle.

2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
   that is less lightly loaded.

In the current situation where waker and previous CPUs are busy, but
only one of its LLC has an idle CPU, Scheduler may end up picking a LLC
with no idle CPUs. To mitigate this, add a new step between 1 and 2
where Scheduler compares idle CPUs in waker and previous LLCs and picks
the appropriate one.

The other alternative is to search for an idle CPU in the other LLC, if
the current select_idle_sibling is unable to find an idle CPU in the
preferred LLC. But that may increase the time to select a CPU.

80USERS 5.11-rc6 5.11-rc6+revert  5.11-rc6+patch
8CORE/1JVM   throughput 6651.6   6716.3 (0.97%)   6670 (0.27%)
 sys/user:time  59.75/23.86  61.77/24.55  56.34/22.65

8CORE/2JVM   throughput 6425.4   6446.8 (0.33%)   6627.9 (3.15%)
 sys/user:time  70.59/24.25  72.28/23.77  67.50/23.67

8CORE/4JVM   throughput 5355.3   5551.2 (3.66%)   5417.3 (1.58%)
 sys/user:time  76.74/21.79  76.54/22.73  74.77/21.86

8CORE/8JVM   throughput 4420.6   4553.3 (3.00%)   4486.2 (1.48%)
 sys/user:time  79.13/20.32  78.76/21.01  78.14/20.19

Cc: LKML 
Cc: Michael Ellerman 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Parth Shah 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Dietmar Eggemann 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Rik van Riel 
Co-developed-by: Gautham R Shenoy 
Signed-off-by: Gautham R Shenoy 
Co-developed-by: Parth Shah 
Signed-off-by: Parth Shah 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1->v2:
v1: 
http://lore.kernel.org/lkml/20210226164029.122432-1-sri...@linux.vnet.ibm.com/t/#u
- Make WA_WAKER default (Suggested by Rik)
- Make WA_WAKER check more conservative: (Suggested by Rik / Peter)
- s/pllc_size/tllc_size while checking for busy case: (Pointed by Dietmar)
- Add rcu_read_lock and check for validity of shared domains

 kernel/sched/fair.c | 57 +++--
 kernel/sched/features.h |  2 ++
 kernel/sched/sched.h|  1 +
 kernel/sched/topology.c |  2 ++
 4 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a8bd7b13634..492ba07e4f51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5869,6 +5869,52 @@ wake_affine_weight(struct sched_domain *sd, struct 
task_struct *p,
return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
 }
 
+#ifdef CONFIG_NO_HZ_COMMON
+static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
+{
+   struct sched_domain_shared *tsds, *psds;
+   int pnr_busy, pllc_size, tnr_busy, tllc_size;
+   unsigned int smt_size = per_cpu(smt_size, this_cpu);
+   int diff;
+
+   rcu_read_lock();
+   tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
+   psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
+   if (!tsds || !psds) {
+   rcu_read_unlock();
+   return nr_cpumask_bits;
+   }
+
+   tnr_busy = atomic_read(>nr_busy_cpus);
+   pnr_busy = atomic_read(>nr_busy_cpus);
+   rcu_read_unlock();
+
+   tllc_size = per_cpu(sd_llc_size, this_cpu);
+   pllc_size = per_cpu(sd_llc_size, prev_cpu);
+
+   /* No need to compare, if both LLCs are fully loaded */
+   if (pnr_busy == pllc_size && tnr_busy == tllc_size)
+   return nr_cpumask_bits;
+
+   if (sched_feat(WA_WAKER) && tnr_busy < tllc_size / smt_size)
+   return this_cpu;
+
+   /* For better wakeup latency, prefer idler LLC to cache affinity */
+   diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
+   if (!dif

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-02 Thread Srikar Dronamraju

* Peter Zijlstra  [2021-03-02 10:10:32]:

> On Tue, Mar 02, 2021 at 01:09:46PM +0530, Srikar Dronamraju wrote:
> > > Oh, could be, I didn't grep :/ We could have core code keep track of the
> > > smt count I suppose.
> > 
> > Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead?
> 
> cpumask_weight() is potentially super expensive. With CPUMASK_OFFSTACK
> you get at least one more cache miss and then the bitmap might be really
> long.
> 
> Best to compute the results once and use it later.

Oh okay .. Noted.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-02 Thread Srikar Dronamraju

* Dietmar Eggemann  [2021-03-02 10:53:06]:

> On 26/02/2021 17:40, Srikar Dronamraju wrote:
> 
> [...]
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a8bd7b13634..d49bfcdc4a19 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct 
> > task_struct *p,
> > return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
> >  }
> > 
> > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> > +{
> > +   struct sched_domain_shared *tsds, *psds;
> > +   int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> > +
> > +   tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> > +   tnr_busy = atomic_read(>nr_busy_cpus);
> > +   tllc_size = per_cpu(sd_llc_size, this_cpu);
> > +
> > +   psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> > +   pnr_busy = atomic_read(>nr_busy_cpus);
> > +   pllc_size = per_cpu(sd_llc_size, prev_cpu);
> > +
> > +   /* No need to compare, if both LLCs are fully loaded */
> > +   if (pnr_busy == pllc_size && tnr_busy == pllc_size)
> 
>          ^
>shouldn't this be tllc_size ?

Yes, thanks for pointing out.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-02 Thread Srikar Dronamraju

* Peter Zijlstra  [2021-03-01 18:18:28]:

> On Mon, Mar 01, 2021 at 10:36:01PM +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra  [2021-03-01 16:44:42]:
> > 
> > > On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote:
> > > > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:
> > > 
 > > > 
> > > > Maybe we could use "tnr_busy * 4 <
> > > > tllc_size * 3" or
> > > > something like that?
> > > 
> > > How about:
> > > 
> > >   tnr_busy < tllc_size / topology_max_smt_threads()
> > > 
> > > ?
> > 
> > Isn't topology_max_smt_threads only for x86 as of today?
> > Or Am I missing out?
> 
> Oh, could be, I didn't grep :/ We could have core code keep track of the
> smt count I suppose.

Could we use cpumask_weight(cpu_smt_mask(this_cpu)) instead?

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-01 Thread Srikar Dronamraju

* Peter Zijlstra  [2021-03-01 16:40:33]:

> On Fri, Feb 26, 2021 at 10:10:29PM +0530, Srikar Dronamraju wrote:
> > +static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
> > +{
> > +   struct sched_domain_shared *tsds, *psds;
> > +   int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
> > +
> > +   tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
> > +   tnr_busy = atomic_read(>nr_busy_cpus);
> > +   tllc_size = per_cpu(sd_llc_size, this_cpu);
> > +
> > +   psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
> > +   pnr_busy = atomic_read(>nr_busy_cpus);
> > +   pllc_size = per_cpu(sd_llc_size, prev_cpu);
> > +
> 
> nr_busy_cpus is NO_HZ_COMMON So this code that consumes it should be
> too.

Thanks Peter, will take care of this along with other changes including
calling within rcu_read_lock and checking for tsds and psds after
rcu_dereference.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-01 Thread Srikar Dronamraju

* Peter Zijlstra  [2021-03-01 16:44:42]:

> On Sat, Feb 27, 2021 at 02:56:07PM -0500, Rik van Riel wrote:
> > On Fri, 2021-02-26 at 22:10 +0530, Srikar Dronamraju wrote:
> 
> > > + if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > > + return this_cpu;
> > 
> > I wonder if we need to use a slightly lower threshold on
> > very large LLCs, both to account for the fact that the
> > select_idle_cpu code may not find the single idle CPU
> > among a dozen busy ones, or because on a system with
> > hyperthreading we may often be better off picking another
> > LLC for HT contention issues?
> > 
> > Maybe we could use "tnr_busy * 4 <
> > tllc_size * 3" or
> > something like that?
> 
> How about:
> 
>   tnr_busy < tllc_size / topology_max_smt_threads()
> 
> ?

Isn't topology_max_smt_threads only for x86 as of today?
Or Am I missing out?


-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-03-01 Thread Srikar Dronamraju

* Rik van Riel  [2021-02-27 14:56:07]:

> > In the current situation where waker and previous CPUs are busy, but
> > only one of its LLC has an idle CPU, Scheduler may end up picking a
> > LLC
> > with no idle CPUs. To mitigate this, add a new step between 1 and 2
> > where Scheduler compares idle CPUs in waker and previous LLCs and
> > picks
> > the appropriate one.
> 
> I like that idea a lot. That could also solve some of the

Thanks.

> issues sometimes observed on multi-node x86 systems, and
> probably on the newer AMD chips with several LLCs on chip.
> 

Okay.

> > +   if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
> > +   return this_cpu;
> 
> I wonder if we need to use a slightly lower threshold on
> very large LLCs, both to account for the fact that the
> select_idle_cpu code may not find the single idle CPU
> among a dozen busy ones, or because on a system with
> hyperthreading we may often be better off picking another
> LLC for HT contention issues?

> 
> Maybe we could use "tnr_busy * 4 <
> tllc_size * 3" or
> something like that?
> 
> That way we will only try to find the last 5 idle
> CPUs
> in a 22 CPU LLC if the other LLC also has fewer than 6
> idle cores.
> 
> That might increase our chances of finding an idle CPU
> with SIS_PROP enabled, and might allow WA_WAKER to be
> true by default.

Agree we need to be conservative esp if we want to make WA_WAKER on by
default. I would still like to hear from other people if they think its ok
to enable it by default. I wonder if enabling it by default can cause some
load imbalances leading to more active load balance down the line.  I
haven't benchmarked with WA_WAKER enabled.

Thanks Rik for your inputs.

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH] sched/fair: Prefer idle CPU to cache affinity

2021-02-26 Thread Srikar Dronamraju

On POWER8 and POWER9, the last level cache (L2) has been at the level of
a group of 8 threads (SMT8 on POWER8, a big-core comprising of a pair of
SMT4 cores on POWER9). However, on POWER10, the LLC domain is at the
level of a group of SMT4 threads within the SMT8 core. Due to the
shrinking in the size of the LLC domain, the probability of finding an
idle CPU in the LLC domain of the target is lesser on POWER10 compared
to the previous generation processors.

With commit 9538abee18cc ("powerpc/smp: Add support detecting
thread-groups sharing L2 cache") benchmarks such as Daytrader
(https://github.com/WASdev/sample.daytrader7) show a drop in throughput
in a configuration consisting of 1 JVM spanning across 6-8 Bigcores on
POWER10.  Analysis showed that this was because more number of wakeups
were happening on busy CPUs when the utilization was 60-70%. This drop
in throughput also shows up as a drop in CPU utilization. However most
other benchmarks benefit with detecting the thread-groups that share L2
cache.

Current order of preference to pick a LLC while waking a wake-affine
task:
1. Between the waker CPU and previous CPU, prefer the LLC of the CPU
   that is idle.

2. Between the waker CPU and previous CPU, prefer the LLC of the CPU
   that is less lightly loaded.

In the current situation where waker and previous CPUs are busy, but
only one of its LLC has an idle CPU, Scheduler may end up picking a LLC
with no idle CPUs. To mitigate this, add a new step between 1 and 2
where Scheduler compares idle CPUs in waker and previous LLCs and picks
the appropriate one.

The other alternative is to search for an idle CPU in the other LLC, if
the current select_idle_sibling is unable to find an idle CPU in the
preferred LLC. But that may increase the time to select a CPU.


 5.11-rc6  5.11-rc6+revert   
5.11-rc6+patch
8CORE/1JVM  80USERS   throughput 6651.66716.3 (0.97%)6940 
(4.34%)
  sys/user:time  59.75/23.86   61.77/24.55   60/24

8CORE/2JVM  80USERS   throughput 6425.46446.8 (0.33%)6473.2 
(0.74%)
  sys/user:time  70.59/24.25   72.28/23.77   70/24

8CORE/4JVM  80USERS   throughput 5355.35551.2 (3.66%)5586.6 
(4.32%)
  sys/user:time  76.74/21.79   76.54/22.73   76/22

8CORE/8JVM  80USERS   throughput 4420.64553.3 (3.00%)4405.8 
(-0.33%)
  sys/user:time  79.13/20.32   78.76/21.01   79/20

Cc: LKML 
Cc: Michael Ellerman 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Parth Shah 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Dietmar Eggemann 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Co-developed-by: Gautham R Shenoy 
Signed-off-by: Gautham R Shenoy 
Co-developed-by: Parth Shah 
Signed-off-by: Parth Shah 
Signed-off-by: Srikar Dronamraju 
---
 kernel/sched/fair.c | 41 +++--
 kernel/sched/features.h |  2 ++
 2 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a8bd7b13634..d49bfcdc4a19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5869,6 +5869,36 @@ wake_affine_weight(struct sched_domain *sd, struct 
task_struct *p,
return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
 }

+static int prefer_idler_llc(int this_cpu, int prev_cpu, int sync)
+{
+   struct sched_domain_shared *tsds, *psds;
+   int pnr_busy, pllc_size, tnr_busy, tllc_size, diff;
+
+   tsds = rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
+   tnr_busy = atomic_read(>nr_busy_cpus);
+   tllc_size = per_cpu(sd_llc_size, this_cpu);
+
+   psds = rcu_dereference(per_cpu(sd_llc_shared, prev_cpu));
+   pnr_busy = atomic_read(>nr_busy_cpus);
+   pllc_size = per_cpu(sd_llc_size, prev_cpu);
+
+   /* No need to compare, if both LLCs are fully loaded */
+   if (pnr_busy == pllc_size && tnr_busy == pllc_size)
+   return nr_cpumask_bits;
+
+   if (sched_feat(WA_WAKER) && tnr_busy < tllc_size)
+   return this_cpu;
+
+   /* For better wakeup latency, prefer idler LLC to cache affinity */
+   diff = tnr_busy * pllc_size - sync - pnr_busy * tllc_size;
+   if (!diff)
+   return nr_cpumask_bits;
+   if (diff < 0)
+   return this_cpu;
+
+   return prev_cpu;
+}
+
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
   int this_cpu, int prev_cpu, int sync)
 {
@@ -5877,6 +5907,10 @@ static int wake_affine(struct sched_domain *sd, struct 
task_struct *p,
if (sched_feat(WA_IDLE))
target = wake_affine_idle(this_cpu, prev_cpu, sync);

+   if (sched_feat(WA_IDLER_LLC) && target == nr_cpumask_bits &&
+   !cpus_share_cache(this_cpu, prev_cpu))
+

Re: [PATCH] perf bench numa: Fix the condition checks for max number of numa nodes

2021-02-26 Thread Srikar Dronamraju

* Athira Rajeev  [2021-02-25 11:50:02]:

> In systems having higher node numbers available like node
> 255, perf numa bench will fail with SIGABORT.
> 
> <<>>
> perf: bench/numa.c:1416: init: Assertion `!(g->p.nr_nodes > 64 || 
> g->p.nr_nodes < 0)' failed.
> Aborted (core dumped)
> <<>>
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/topology: Switch to sched_debug() for conditional sched domain printk

2021-02-03 Thread Srikar Dronamraju

* Yue Hu  [2021-02-03 18:10:19]:

> On Wed, 3 Feb 2021 15:22:56 +0530
> Srikar Dronamraju  wrote:
> 
> > * Yue Hu  [2021-02-03 12:20:10]:
> > 
> > 
> > sched_debug() would only be present in CONFIG_SCHED_DEBUG. Right?
> > In which case there would be a build failure with your change in
> > !CONFIG_SCHED_DEBUG config.
> > 
> > or Am I missing something?
> 
> sched_debug() is also defined for !CONFIG_SCHED_DEBUG as below:
> 
> static inline bool sched_debug(void)
> {
>   return false;
> }
> 

ah .. right .. somehow I missed this. 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/topology: Switch to sched_debug() for conditional sched domain printk

2021-02-03 Thread Srikar Dronamraju

* Yue Hu  [2021-02-03 12:20:10]:

> From: Yue Hu 
> 
> Currently, the macro sched_debug_enabled has same function as
> sched_debug() with return false for !SCHED_DEBUG. And sched_debug()
> is a wapper of variable sched_debug_enabled for SCHED_DEBUG. We
> can use the sched_debug() for all cases. So, let's remove the
> unnecessary marco, also use sched_debug() in sched_domain_debug()
> for code consistency.
> 
> Signed-off-by: Yue Hu 
> ---
>  kernel/sched/topology.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 5d3675c..402138c 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -131,7 +131,7 @@ static void sched_domain_debug(struct sched_domain *sd, 
> int cpu)
>  {
>   int level = 0;
>  
> - if (!sched_debug_enabled)
> + if (!sched_debug())
>   return;
>  

sched_debug() would only be present in CONFIG_SCHED_DEBUG. Right?
In which case there would be a build failure with your change in
!CONFIG_SCHED_DEBUG config.

or Am I missing something?

>   if (!sd) {
> @@ -152,7 +152,6 @@ static void sched_domain_debug(struct sched_domain *sd, 
> int cpu)
>  }
>  #else /* !CONFIG_SCHED_DEBUG */
>  
> -# define sched_debug_enabled 0
>  # define sched_domain_debug(sd, cpu) do { } while (0)
>  static inline bool sched_debug(void)
>  {
> @@ -2113,7 +2112,7 @@ static bool topology_span_sane(struct 
> sched_domain_topology_level *tl,
>   if (has_asym)
>   static_branch_inc_cpuslocked(_asym_cpucapacity);
>  
> - if (rq && sched_debug_enabled) {
> + if (rq && sched_debug()) {

Same as above.

>   pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
>   cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
>   }
> -- 
> 1.9.1
> 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v3 1/5] powerpc/smp: Parse ibm,thread-groups with multiple properties

2020-12-15 Thread Srikar Dronamraju

* Gautham R. Shenoy  [2020-12-10 16:08:55]:

> From: "Gautham R. Shenoy" 
> 
> The "ibm,thread-groups" device-tree property is an array that is used
> to indicate if groups of threads within a core share certain
> properties. It provides details of which property is being shared by
> which groups of threads. This array can encode information about
> multiple properties being shared by different thread-groups within the
> core.
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v3 5/5] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

2020-12-15 Thread Srikar Dronamraju

* Gautham R. Shenoy  [2020-12-10 16:08:59]:

> From: "Gautham R. Shenoy" 
> 
> On POWER platforms where only some groups of threads within a core
> share the L2-cache (indicated by the ibm,thread-groups device-tree
> property), we currently print the incorrect shared_cpu_map/list for
> L2-cache in the sysfs.
> 
> This patch reports the correct shared_cpu_map/list on such platforms.
> 
> Example:
> On a platform with "ibm,thread-groups" set to
>  0001 0002 0004 
>  0002 0004 0006 0001
>  0003 0005 0007 0002
>  0002 0004  0002
>  0004 0006 0001 0003
>      0005 0007
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v3 4/5] powerpc/smp: Add support detecting thread-groups sharing L2 cache

2020-12-15 Thread Srikar Dronamraju

* Gautham R. Shenoy  [2020-12-10 16:08:58]:

> From: "Gautham R. Shenoy" 
> 
> On POWER systems, groups of threads within a core sharing the L2-cache
> can be indicated by the "ibm,thread-groups" property array with the
> identifier "2".
> 
> This patch adds support for detecting this, and when present, populate
> the populating the cpu_l2_cache_mask of every CPU to the core-siblings
> which share L2 with the CPU as specified in the by the
> "ibm,thread-groups" property array.
> 
> On a platform with the following "ibm,thread-group" configuration
>0001 0002 0004 
>0002 0004 0006 0001
>0003 0005 0007 0002
>0002 0004  0002
>0004 00000006 0001 0003
>    0005 0007
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v3 2/5] powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map

2020-12-15 Thread Srikar Dronamraju

* Gautham R. Shenoy  [2020-12-10 16:08:56]:

> From: "Gautham R. Shenoy" 
> 
> On platforms which have the "ibm,thread-groups" property, the per-cpu
> variable cpu_l1_cache_map keeps a track of which group of threads
> within the same core share the L1 cache, Instruction and Data flow.
> 
> This patch renames the variable to "thread_group_l1_cache_map" to make
> it consistent with a subsequent patch which will introduce
> thread_group_l2_cache_map.
> 
> This patch introduces no functional change.
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

> Signed-off-by: Gautham R. Shenoy 
-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v3 3/5] powerpc/smp: Rename init_thread_group_l1_cache_map() to make it generic

2020-12-15 Thread Srikar Dronamraju

* Gautham R. Shenoy  [2020-12-10 16:08:57]:

> From: "Gautham R. Shenoy" 
> 
> init_thread_group_l1_cache_map() initializes the per-cpu cpumask
> thread_group_l1_cache_map with the core-siblings which share L1 cache
> with the CPU. Make this function generic to the cache-property (L1 or
> L2) and update a suitable mask. This is a preparatory patch for the
> next patch where we will introduce discovery of thread-groups that
> share L2-cache.
> 
> No functional change.
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

> Signed-off-by: Gautham R. Shenoy 
-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache

2020-12-09 Thread Srikar Dronamraju

* Gautham R Shenoy  [2020-12-08 23:12:37]:

> 
> > For L2 we have thread_group_l2_cache_map to store the tasks from the thread
> > group.  but cpu_l2_cache_map for keeping track of tasks.
> 
> > 
> > I think we should do some renaming to keep the names consistent.
> > I would say probably say move the current cpu_l2_cache_map to
> > cpu_llc_cache_map and move the new aka  thread_group_l2_cache_map as
> > cpu_l2_cache_map to be somewhat consistent.
> 
> Hmm.. cpu_llc_cache_map is still very generic. We want to have
> something that defines l2 map.
> 
> I agree that we need to keep it consistent. How about renaming
> cpu_l1_cache_map to thread_groups_l1_cache_map ?
> 
> That way thread_groups_l1_cache_map and thread_groups_l2_cache_map
> refer to the corresponding L1 and L2 siblings as discovered from
> ibm,thread-groups property.

I am fine with this.

> > > +
> > > + for_each_possible_cpu(cpu) {
> > > + int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L2);
> > > +
> > > + if (err)
> > > + return err;
> > > + }
> > > +
> > > + thread_group_shares_l2 = true;
> > 
> > Why do we need a separate loop. Why cant we merge this in the above loop
> > itself?
> 
> No, there are platforms where one THREAD_GROUP_SHARE_L1 exists while
> THREAD_GROUP_SHARE_L2 doesn't exist. It becomes easier if these are
> separately tracked. Also, what do we gain if we put this in the same
> loop? It will be (nr_possible_cpus * 2 * invocations of
> init_cpu_cache_map()) as opposed to 2 * (nr_possible_cpus *
> invocations of init_cpu_cache_map()). Isn't it ?
> 
Its not about the number of invocations but per-cpu thread group list
that would need not be loaded again. Currently they would probably be in the
cache-line, but get dropped to be loaded again in the next loop.
And we still can support platforms with only THREAD_GROUP_SHARE_L1 since
parse_thread_groups would have given us how many levels of thread groups are
supported on a platform.

> > 
> > > + pr_info("Thread-groups in a core share L2-cache\n");
> > 
> > Can this be moved to a pr_debug? Does it help any regular user/admins to
> > know if thread-groups shared l2 cache. Infact it may confuse users on what
> > thread groups are and which thread groups dont share cache.
> > I would prefer some other name than thread_group_shares_l2 but dont know any
> > better alternatives and may be my choices are even worse.
> 
> Would you be ok with "L2 cache shared by threads of the small core" ?

Sounds better to me. I would still think pr_debug is better since regular
Admins/users may not make too much information from this.
> 
> > 
> > Ah this can be simplified to:
> > if (thread_group_shares_l2) {
> > cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));
> > 
> > for_each_cpu(i, per_cpu(thread_group_l2_cache_map, cpu)) {
> > if (cpu_online(i))
> > set_cpus_related(i, cpu, cpu_l2_cache_mask);
> > }
> 
> Don't we want to enforce that the siblings sharing L1 be a subset of
> the siblings sharing L2 ? Or do you recommend putting in a check for
> that somewhere ?
> 
I didnt think about the case where the device-tree could show L2 to be a
subset of L1.

How about initializing thread_group_l2_cache_map itself with
cpu_l1_cache_map. It would be a simple one time operation and reduce the
overhead here every CPU online.
And it would help in your subsequent patch too. We dont want the cacheinfo
for L1 showing CPUs not present in L2.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

2020-12-09 Thread Srikar Dronamraju

* Gautham R Shenoy  [2020-12-08 23:26:47]:

> > The drawback of this is even if cpus 0,2,4,6 are released L1 cache will not
> > be released. Is this as expected?
> 
> cacheinfo populates the cache->shared_cpu_map on the basis of which
> CPUs share the common device-tree node for a particular cache.  There
> is one l1-cache object in the device-tree for a CPU node corresponding
> to a big-core. That the L1 is further split between the threads of the
> core is shown using ibm,thread-groups.
> 

Yes.

> The ideal thing would be to add a "group_leader" field to "struct
> cache" so that we can create separate cache objects , one per thread
> group. I will take a stab at this in the v2.
> 

I am not saying this needs to be done immediately. We could add a TODO and
get it done later. Your patch is not making it worse. Its just that there is
still something more left to be done.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties

2020-12-09 Thread Srikar Dronamraju

* Gautham R Shenoy  [2020-12-08 22:55:40]:

> > 
> > NIT:
> > tglx mentions in one of his recent comments to try keep a reverse fir tree
> > ordering of variables where possible.
> 
> I suppose you mean moving the longer local variable declarations to to
> the top and shorter ones to the bottom. Thanks. Will fix this.
> 

Yes.

> > > + }
> > > +
> > > + if (!tg)
> > > + return -EINVAL;
> > > +
> > > + cpu_group_start = get_cpu_thread_group_start(cpu, tg);
> > 
> > This whole hunk should be moved to a new function and called before
> > init_cpu_cache_map. It will simplify the logic to great extent.
> 
> I suppose you are referring to the part where we select the correct
> tg. Yeah, that can move to a different helper.
> 

Yes, I would prefer if we could call this new helper outside
init_cpu_cache_map.

> > > 
> > > - zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu),
> > > - GFP_KERNEL, cpu_to_node(cpu));
> > > + mask = _cpu(cpu_l1_cache_map, cpu);
> > > +
> > > + zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
> > > 
> > 
> > This hunk (and the next hunk) should be moved to next patch.
> >
> 
> The next patch is only about introducing  THREAD_GROUP_SHARE_L2. Hence
> I put in any other code in this patch, since it seems to be a logical
> place to collate whatever we have in a generic form.
> 

While I am fine with it, having a pointer that always points to the same
mask looks wierd.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

2020-12-07 Thread Srikar Dronamraju

* Gautham R. Shenoy  [2020-12-04 10:18:47]:

> From: "Gautham R. Shenoy" 
> 
> 
> Signed-off-by: Gautham R. Shenoy 
> ---
> 
> +extern bool thread_group_shares_l2;
>  /*
>   * On big-core systems, each core has two groups of CPUs each of which
>   * has its own L1-cache. The thread-siblings which share l1-cache with
>   * @cpu can be obtained via cpu_smallcore_mask().
> + *
> + * On some big-core systems, the L2 cache is shared only between some
> + * groups of siblings. This is already parsed and encoded in
> + * cpu_l2_cache_mask().
>   */
>  static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct 
> cache *cache)
>  {
>   if (cache->level == 1)
>   return cpu_smallcore_mask(cpu);
> + if (cache->level == 2 && thread_group_shares_l2)
> + return cpu_l2_cache_mask(cpu);
> 
>   return >shared_cpu_map;

As pointed with l...@intel.org, we need to do this only with #CONFIG_SMP,
even for cache->level = 1 too.

I agree that we are displaying shared_cpu_map correctly. Should we have also
update /clear shared_cpu_map in the first place. For example:- If for a P9
core with CPUs 0-7, the cache->shared_cpu_map for L1 would have 0-7 but
would display 0,2,4,6.

The drawback of this is even if cpus 0,2,4,6 are released L1 cache will not
be released. Is this as expected?


-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache

2020-12-07 Thread Srikar Dronamraju

he_map(int cpu, unsigned int 
> cache_property)
>   if (!dn)
>   return -ENODATA;
> 
> - if (!(cache_property == THREAD_GROUP_SHARE_L1))
> + if (!(cache_property == THREAD_GROUP_SHARE_L1 ||
> +   cache_property == THREAD_GROUP_SHARE_L2))
>   return -EINVAL;
> 
>   if (!cpu_tgl->nr_properties) {
> @@ -867,7 +879,10 @@ static int init_cpu_cache_map(int cpu, unsigned int 
> cache_property)
>   goto out;
>   }
> 
> - mask = _cpu(cpu_l1_cache_map, cpu);
> + if (cache_property == THREAD_GROUP_SHARE_L1)
> + mask = _cpu(cpu_l1_cache_map, cpu);
> + else if (cache_property == THREAD_GROUP_SHARE_L2)
> + mask = _cpu(thread_group_l2_cache_map, cpu);
> 
>   zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
> 
> @@ -973,6 +988,16 @@ static int init_big_cores(void)
>   }
> 
>   has_big_cores = true;
> +
> + for_each_possible_cpu(cpu) {
> + int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L2);
> +
> + if (err)
> + return err;
> + }
> +
> + thread_group_shares_l2 = true;

Why do we need a separate loop. Why cant we merge this in the above loop
itself? 

> + pr_info("Thread-groups in a core share L2-cache\n");

Can this be moved to a pr_debug? Does it help any regular user/admins to
know if thread-groups shared l2 cache. Infact it may confuse users on what
thread groups are and which thread groups dont share cache.
I would prefer some other name than thread_group_shares_l2 but dont know any
better alternatives and may be my choices are even worse.

>   return 0;
>  }
> 
> @@ -1287,6 +1312,31 @@ static bool update_mask_by_l2(int cpu, cpumask_var_t 
> *mask)
>   if (has_big_cores)
>   submask_fn = cpu_smallcore_mask;
> 
> +

NIT: extra blank line?

> + /*
> +  * If the threads in a thread-group share L2 cache, then then
> +  * the L2-mask can be obtained from thread_group_l2_cache_map.
> +  */
> + if (thread_group_shares_l2) {
> + /* Siblings that share L1 is a subset of siblings that share 
> L2.*/
> + or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
> + if (*mask) {
> + cpumask_andnot(*mask,
> +per_cpu(thread_group_l2_cache_map, cpu),
> +cpu_l2_cache_mask(cpu));
> + } else {
> + mask = _cpu(thread_group_l2_cache_map, cpu);
> + }
> +
> + for_each_cpu(i, *mask) {
> + if (!cpu_online(i))
> + continue;
> +     set_cpus_related(i, cpu, cpu_l2_cache_mask);
> + }
> +
> + return true;
> + }
> +

Ah this can be simplified to:
if (thread_group_shares_l2) {
cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));

for_each_cpu(i, per_cpu(thread_group_l2_cache_map, cpu)) {
if (cpu_online(i))
set_cpus_related(i, cpu, cpu_l2_cache_mask);
}
}

No?

>   l2_cache = cpu_to_l2cache(cpu);
>   if (!l2_cache || !*mask) {
>   /* Assume only core siblings share cache with this CPU */

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties

2020-12-07 Thread Srikar Dronamraju

;
> @@ -830,11 +867,12 @@ static int init_cpu_l1_cache_map(int cpu)
>   goto out;
>   }
> 
> - zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu),
> - GFP_KERNEL, cpu_to_node(cpu));
> + mask = _cpu(cpu_l1_cache_map, cpu);
> +
> + zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
> 

This hunk (and the next hunk) should be moved to next patch.

>   for (i = first_thread; i < first_thread + threads_per_core; i++) {
> - int i_group_start = get_cpu_thread_group_start(i, );
> + int i_group_start = get_cpu_thread_group_start(i, tg);
> 
>   if (unlikely(i_group_start == -1)) {
>   WARN_ON_ONCE(1);
> @@ -843,7 +881,7 @@ static int init_cpu_l1_cache_map(int cpu)
>   }
> 
>   if (i_group_start == cpu_group_start)
> - cpumask_set_cpu(i, per_cpu(cpu_l1_cache_map, cpu));
> + cpumask_set_cpu(i, *mask);
>   }
> 
>  out:
> @@ -924,7 +962,7 @@ static int init_big_cores(void)
>   int cpu;
> 
>   for_each_possible_cpu(cpu) {
> - int err = init_cpu_l1_cache_map(cpu);
> + int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L1);
> 
>   if (err)
>   return err;
> -- 
> 1.9.4
> 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 2/3] x86/uprobes: Fix not using prefixes.nbytes for loop over prefixes.bytes

2020-12-02 Thread Srikar Dronamraju

* Masami Hiramatsu  [2020-12-02 17:51:16]:

> Since the insn.prefixes.nbytes can be bigger than the size of
> insn.prefixes.bytes[] when a same prefix is repeated, we have to
> check whether the insn.prefixes.bytes[i] != 0 and i < 4 instead
> of insn.prefixes.nbytes.
> 
> Fixes: 2b1444983508 ("uprobes, mm, x86: Add the ability to install and remove 
> uprobes breakpoints")
> Cc: sta...@vger.kernel.org
> Reported-by: Kees Cook 
> Signed-off-by: Masami Hiramatsu 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

> ---
>  arch/x86/kernel/uprobes.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 3fdaa042823d..bb3ea3705b99 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -257,7 +257,7 @@ static bool is_prefix_bad(struct insn *insn)
>  {
>   int i;
> 
> - for (i = 0; i < insn->prefixes.nbytes; i++) {
> + for (i = 0; insn->prefixes.bytes[i] && i < 4; i++) {
>   insn_attr_t attr;
> 
>   attr = inat_get_opcode_attribute(insn->prefixes.bytes[i]);
> @@ -746,7 +746,7 @@ static int branch_setup_xol_ops(struct arch_uprobe 
> *auprobe, struct insn *insn)
>* Intel and AMD behavior differ in 64-bit mode: Intel ignores 66 
> prefix.
>* No one uses these insns, reject any branch insns with such prefix.
>*/
> - for (i = 0; i < insn->prefixes.nbytes; i++) {
> + for (i = 0; insn->prefixes.bytes[i] && i < 4; i++) {
>   if (insn->prefixes.bytes[i] == 0x66)
>   return -ENOTSUPP;
>   }
> 

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH v2 3/4] powerpc: Reintroduce is_kvm_guest in a new avatar

2020-12-01 Thread Srikar Dronamraju

Introduce a static branch that would be set during boot if the OS
happens to be a KVM guest. Subsequent checks to see if we are on KVM
will rely on this static branch. This static branch would be used in
vcpu_is_preempted in a subsequent patch.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
Acked-by: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/kvm_guest.h | 10 ++
 arch/powerpc/include/asm/kvm_para.h  |  2 +-
 arch/powerpc/kernel/firmware.c   |  2 ++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_guest.h 
b/arch/powerpc/include/asm/kvm_guest.h
index ba8291e02ba9..627ba272e781 100644
--- a/arch/powerpc/include/asm/kvm_guest.h
+++ b/arch/powerpc/include/asm/kvm_guest.h
@@ -7,8 +7,18 @@
 #define __POWERPC_KVM_GUEST_H__
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
+#include 
+
+DECLARE_STATIC_KEY_FALSE(kvm_guest);
+
+static inline bool is_kvm_guest(void)
+{
+   return static_branch_unlikely(_guest);
+}
+
 bool check_kvm_guest(void);
 #else
+static inline bool is_kvm_guest(void) { return false; }
 static inline bool check_kvm_guest(void) { return false; }
 #endif
 
diff --git a/arch/powerpc/include/asm/kvm_para.h 
b/arch/powerpc/include/asm/kvm_para.h
index 6fba06b6cfdb..abe1b5e82547 100644
--- a/arch/powerpc/include/asm/kvm_para.h
+++ b/arch/powerpc/include/asm/kvm_para.h
@@ -14,7 +14,7 @@
 
 static inline int kvm_para_available(void)
 {
-   return IS_ENABLED(CONFIG_KVM_GUEST) && check_kvm_guest();
+   return IS_ENABLED(CONFIG_KVM_GUEST) && is_kvm_guest();
 }
 
 static inline unsigned int kvm_arch_para_features(void)
diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c
index 0aeb6a5b1a9e..28498fc573f2 100644
--- a/arch/powerpc/kernel/firmware.c
+++ b/arch/powerpc/kernel/firmware.c
@@ -22,6 +22,7 @@ EXPORT_SYMBOL_GPL(powerpc_firmware_features);
 #endif
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
+DEFINE_STATIC_KEY_FALSE(kvm_guest);
 bool check_kvm_guest(void)
 {
struct device_node *hyper_node;
@@ -33,6 +34,7 @@ bool check_kvm_guest(void)
if (!of_device_is_compatible(hyper_node, "linux,kvm"))
return 0;
 
+   static_branch_enable(_guest);
return 1;
 }
 #endif
-- 
2.18.4

[PATCH v2 4/4] powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted

2020-12-01 Thread Srikar Dronamraju

If its a shared lpar but not a KVM guest, then see if the vCPU is
related to the calling vCPU. On PowerVM, only cores can be preempted.
So if one vCPU is a non-preempted state, we can decipher that all other
vCPUs sharing the same core are in non-preempted state.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
Acked-by: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/paravirt.h | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/powerpc/include/asm/paravirt.h 
b/arch/powerpc/include/asm/paravirt.h
index 9362c94fe3aa..edc08f04aef7 100644
--- a/arch/powerpc/include/asm/paravirt.h
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -10,6 +10,9 @@
 #endif
 
 #ifdef CONFIG_PPC_SPLPAR
+#include 
+#include 
+
 DECLARE_STATIC_KEY_FALSE(shared_processor);
 
 static inline bool is_shared_processor(void)
@@ -74,6 +77,21 @@ static inline bool vcpu_is_preempted(int cpu)
 {
if (!is_shared_processor())
return false;
+
+#ifdef CONFIG_PPC_SPLPAR
+   if (!is_kvm_guest()) {
+   int first_cpu = cpu_first_thread_sibling(smp_processor_id());
+
+   /*
+* Preemption can only happen at core granularity. This CPU
+* is not preempted if one of the CPU of this core is not
+* preempted.
+*/
+   if (cpu_first_thread_sibling(cpu) == first_cpu)
+   return false;
+   }
+#endif
+
if (yield_count_of(cpu) & 1)
return true;
return false;
-- 
2.18.4

[PATCH v2 1/4] powerpc: Refactor is_kvm_guest declaration to new header

2020-12-01 Thread Srikar Dronamraju

Only code/declaration movement, in anticipation of doing a kvm-aware
vcpu_is_preempted. No additional changes.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
Acked-by: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
Changelog:
v1->v2:
v1: 
https://lore.kernel.org/linuxppc-dev/20201028123512.871051-1-sri...@linux.vnet.ibm.com/t/#u
 - Moved a hunk to fix a no previous prototype warning reported by: 
l...@intel.com
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org/thread/C6PTRPHWMC7VV4OTYN3ISYKDHTDQS6YI/

 arch/powerpc/include/asm/firmware.h  |  6 --
 arch/powerpc/include/asm/kvm_guest.h | 15 +++
 arch/powerpc/include/asm/kvm_para.h  |  2 +-
 arch/powerpc/kernel/firmware.c   |  1 +
 arch/powerpc/platforms/pseries/smp.c |  1 +
 5 files changed, 18 insertions(+), 7 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_guest.h

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 0b295bdb201e..aa6a5ef5d483 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -134,12 +134,6 @@ extern int ibm_nmi_interlock_token;
 
 extern unsigned int __start___fw_ftr_fixup, __stop___fw_ftr_fixup;
 
-#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
-bool is_kvm_guest(void);
-#else
-static inline bool is_kvm_guest(void) { return false; }
-#endif
-
 #ifdef CONFIG_PPC_PSERIES
 void pseries_probe_fw_features(void);
 #else
diff --git a/arch/powerpc/include/asm/kvm_guest.h 
b/arch/powerpc/include/asm/kvm_guest.h
new file mode 100644
index ..c0ace884a0e8
--- /dev/null
+++ b/arch/powerpc/include/asm/kvm_guest.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2020  IBM Corporation
+ */
+
+#ifndef __POWERPC_KVM_GUEST_H__
+#define __POWERPC_KVM_GUEST_H__
+
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
+bool is_kvm_guest(void);
+#else
+static inline bool is_kvm_guest(void) { return false; }
+#endif
+
+#endif /* __POWERPC_KVM_GUEST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_para.h 
b/arch/powerpc/include/asm/kvm_para.h
index 744612054c94..abe1b5e82547 100644
--- a/arch/powerpc/include/asm/kvm_para.h
+++ b/arch/powerpc/include/asm/kvm_para.h
@@ -8,7 +8,7 @@
 #ifndef __POWERPC_KVM_PARA_H__
 #define __POWERPC_KVM_PARA_H__
 
-#include 
+#include 
 
 #include 
 
diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c
index fe48d319d490..5f48e5ad24cd 100644
--- a/arch/powerpc/kernel/firmware.c
+++ b/arch/powerpc/kernel/firmware.c
@@ -14,6 +14,7 @@
 #include 
 
 #include 
+#include 
 
 #ifdef CONFIG_PPC64
 unsigned long powerpc_firmware_features __read_mostly;
diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index 92922491a81c..d578732c545d 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "pseries.h"
 
-- 
2.18.4

[PATCH v2 2/4] powerpc: Rename is_kvm_guest to check_kvm_guest

2020-12-01 Thread Srikar Dronamraju

is_kvm_guest() will be reused in subsequent patch in a new avatar.  Hence
rename is_kvm_guest to check_kvm_guest. No additional changes.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
Acked-by: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/kvm_guest.h | 4 ++--
 arch/powerpc/include/asm/kvm_para.h  | 2 +-
 arch/powerpc/kernel/firmware.c   | 2 +-
 arch/powerpc/platforms/pseries/smp.c | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_guest.h 
b/arch/powerpc/include/asm/kvm_guest.h
index c0ace884a0e8..ba8291e02ba9 100644
--- a/arch/powerpc/include/asm/kvm_guest.h
+++ b/arch/powerpc/include/asm/kvm_guest.h
@@ -7,9 +7,9 @@
 #define __POWERPC_KVM_GUEST_H__
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
-bool is_kvm_guest(void);
+bool check_kvm_guest(void);
 #else
-static inline bool is_kvm_guest(void) { return false; }
+static inline bool check_kvm_guest(void) { return false; }
 #endif
 
 #endif /* __POWERPC_KVM_GUEST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_para.h 
b/arch/powerpc/include/asm/kvm_para.h
index abe1b5e82547..6fba06b6cfdb 100644
--- a/arch/powerpc/include/asm/kvm_para.h
+++ b/arch/powerpc/include/asm/kvm_para.h
@@ -14,7 +14,7 @@
 
 static inline int kvm_para_available(void)
 {
-   return IS_ENABLED(CONFIG_KVM_GUEST) && is_kvm_guest();
+   return IS_ENABLED(CONFIG_KVM_GUEST) && check_kvm_guest();
 }
 
 static inline unsigned int kvm_arch_para_features(void)
diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c
index 5f48e5ad24cd..0aeb6a5b1a9e 100644
--- a/arch/powerpc/kernel/firmware.c
+++ b/arch/powerpc/kernel/firmware.c
@@ -22,7 +22,7 @@ EXPORT_SYMBOL_GPL(powerpc_firmware_features);
 #endif
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
-bool is_kvm_guest(void)
+bool check_kvm_guest(void)
 {
struct device_node *hyper_node;
 
diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index d578732c545d..c70b4be9f0a5 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -211,7 +211,7 @@ static __init void pSeries_smp_probe(void)
if (!cpu_has_feature(CPU_FTR_SMT))
return;
 
-   if (is_kvm_guest()) {
+   if (check_kvm_guest()) {
/*
 * KVM emulates doorbells by disabling FSCR[MSGP] so msgsndp
 * faults to the hypervisor which then reads the instruction
-- 
2.18.4

[PATCH v2 0/4] Powerpc: Better preemption for shared processor

2020-12-01 Thread Srikar Dronamraju

Currently, vcpu_is_preempted will return the yield_count for
shared_processor. On a PowerVM LPAR, Phyp schedules at SMT8 core boundary
i.e all CPUs belonging to a core are either group scheduled in or group
scheduled out. This can be used to better predict non-preempted CPUs on
PowerVM shared LPARs.

perf stat -r 5 -a perf bench sched pipe -l 1000 (lesser time is better)

powerpc/next
 35,107,951.20 msec cpu-clock #  255.898 CPUs utilized  
  ( +-  0.31% )
23,655,348  context-switches  #0.674 K/sec  
  ( +-  3.72% )
14,465  cpu-migrations#0.000 K/sec  
  ( +-  5.37% )
82,463  page-faults   #0.002 K/sec  
  ( +-  8.40% )
 1,127,182,328,206  cycles#0.032 GHz
  ( +-  1.60% )  (66.67%)
78,587,300,622  stalled-cycles-frontend   #6.97% frontend cycles 
idle ( +-  0.08% )  (50.01%)
   654,124,218,432  stalled-cycles-backend#   58.03% backend cycles 
idle  ( +-  1.74% )  (50.01%)
   834,013,059,242  instructions  #0.74  insn per cycle
  #0.78  stalled cycles per 
insn  ( +-  0.73% )  (66.67%)
   132,911,454,387  branches  #3.786 M/sec  
  ( +-  0.59% )  (50.00%)
 2,890,882,143  branch-misses #2.18% of all branches
  ( +-  0.46% )  (50.00%)

   137.195 +- 0.419 seconds time elapsed  ( +-  0.31% )

powerpc/next + patchset
 29,981,702.64 msec cpu-clock #  255.881 CPUs utilized  
  ( +-  1.30% )
40,162,456  context-switches  #0.001 M/sec  
  ( +-  0.01% )
 1,110  cpu-migrations#0.000 K/sec  
  ( +-  5.20% )
62,616  page-faults   #0.002 K/sec  
  ( +-  3.93% )
 1,430,030,626,037  cycles#0.048 GHz
  ( +-  1.41% )  (66.67%)
83,202,707,288  stalled-cycles-frontend   #5.82% frontend cycles 
idle ( +-  0.75% )  (50.01%)
   744,556,088,520  stalled-cycles-backend#   52.07% backend cycles 
idle  ( +-  1.39% )  (50.01%)
   940,138,418,674  instructions  #0.66  insn per cycle
  #0.79  stalled cycles per 
insn  ( +-  0.51% )  (66.67%)
   146,452,852,283  branches  #4.885 M/sec  
  ( +-  0.80% )  (50.00%)
 3,237,743,996  branch-misses #2.21% of all branches
  ( +-  1.18% )  (50.01%)

117.17 +- 1.52 seconds time elapsed  ( +-  1.30% )

This is around 14.6% improvement in performance.

Changelog:
v1->v2:
v1: 
https://lore.kernel.org/linuxppc-dev/20201028123512.871051-1-sri...@linux.vnet.ibm.com/t/#u
 - Rebased to 27th Nov linuxppc/merge tree.
 - Moved a hunk to fix a no previous prototype warning reported by: 
l...@intel.com
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org/thread/C6PTRPHWMC7VV4OTYN3ISYKDHTDQS6YI/

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 

Srikar Dronamraju (4):
  powerpc: Refactor is_kvm_guest declaration to new header
  powerpc: Rename is_kvm_guest to check_kvm_guest
  powerpc: Reintroduce is_kvm_guest
  powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted

 arch/powerpc/include/asm/firmware.h  |  6 --
 arch/powerpc/include/asm/kvm_guest.h | 25 +
 arch/powerpc/include/asm/kvm_para.h  |  2 +-
 arch/powerpc/include/asm/paravirt.h  | 18 ++
 arch/powerpc/kernel/firmware.c   |  5 -
 arch/powerpc/platforms/pseries/smp.c |  3 ++-
 6 files changed, 50 insertions(+), 9 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_guest.h

-- 
2.18.4

Re: powerpc64-linux-ld: arch/powerpc/mm/numa.o:undefined reference to `coregroup_enabled'

2020-11-23 Thread Srikar Dronamraju

* Randy Dunlap  [2020-11-19 15:51:29]:

> On 11/19/20 10:26 AM, kernel test robot wrote:
> > tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> > master
> > head:   c2e7554e1b85935d962127efa3c2a76483b0b3b6
> > commit: f9f130ff2ec93c5949576bbfb168cc9530c23649 powerpc/numa: Detect 
> > support for coregroup
> > date:   9 weeks ago
> > config: powerpc64-randconfig-c003-20201104 (attached as .config)
> > compiler: powerpc64-linux-gcc (GCC) 9.3.0
> > reproduce (this is a W=1 build):
> >  wget 
> > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> > ~/bin/make.cross
> >  chmod +x ~/bin/make.cross
> >  # 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f9f130ff2ec93c5949576bbfb168cc9530c23649
> >  git remote add linus 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> >  git fetch --no-tags linus master
> >  git checkout f9f130ff2ec93c5949576bbfb168cc9530c23649
> >  # save the attached .config to linux build tree
> >  COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
> > ARCH=powerpc64
> > 
> > If you fix the issue, kindly add following tag as appropriate
> > Reported-by: kernel test robot 
> > 
> > All errors (new ones prefixed by >>):
> > 
> > > > powerpc64-linux-ld: arch/powerpc/mm/numa.o:(.toc+0x0): undefined 
> > > > reference to `coregroup_enabled'
> > powerpc64-linux-ld: mm/page_alloc.o:(.toc+0x0): undefined reference to 
> > `node_reclaim_distance'
> > powerpc64-linux-ld: mm/khugepaged.o:(.toc+0x0): undefined reference to 
> > `node_reclaim_distance'
> 
> Hm, OK.
> CONFIG_NUMA=y
> # CONFIG_SMP is not set
> 
> Michael, Gautham, does anyone care about this config combination?
> 

I can add #ifdef CONFIG_SMP where coregroup_enabled is being accessed
but I do feel CONFIG_NUMA but !CONFIG_SMP may not be a valid combination.

> 
> Thanks.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 0/4] Powerpc: Better preemption for shared processor

2020-10-29 Thread Srikar Dronamraju

* Waiman Long  [2020-10-28 20:01:30]:

> > Srikar Dronamraju (4):
> >powerpc: Refactor is_kvm_guest declaration to new header
> >powerpc: Rename is_kvm_guest to check_kvm_guest
> >powerpc: Reintroduce is_kvm_guest
> >powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted
> > 
> >   arch/powerpc/include/asm/firmware.h  |  6 --
> >   arch/powerpc/include/asm/kvm_guest.h | 25 +
> >   arch/powerpc/include/asm/kvm_para.h  |  2 +-
> >   arch/powerpc/include/asm/paravirt.h  | 18 ++
> >   arch/powerpc/kernel/firmware.c   |  5 -
> >   arch/powerpc/platforms/pseries/smp.c |  3 ++-
> >   6 files changed, 50 insertions(+), 9 deletions(-)
> >   create mode 100644 arch/powerpc/include/asm/kvm_guest.h
> > 
> This patch series looks good to me and the performance is nice too.
> 
> Acked-by: Waiman Long 

Thank you.

> 
> Just curious, is the performance mainly from the use of static_branch
> (patches 1 - 3) or from reducing call to yield_count_of().

Because of the reduced call to yield_count

> 
> Cheers,
> Longman
> 

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH 2/4] powerpc: Rename is_kvm_guest to check_kvm_guest

2020-10-28 Thread Srikar Dronamraju

is_kvm_guest() will be reused in subsequent patch in a new avatar.  Hence
rename is_kvm_guest to check_kvm_guest. No additional changes.

Signed-off-by: Srikar Dronamraju 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
---
 arch/powerpc/include/asm/kvm_guest.h | 4 ++--
 arch/powerpc/include/asm/kvm_para.h  | 2 +-
 arch/powerpc/kernel/firmware.c   | 2 +-
 arch/powerpc/platforms/pseries/smp.c | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_guest.h 
b/arch/powerpc/include/asm/kvm_guest.h
index c0ace884a0e8..ba8291e02ba9 100644
--- a/arch/powerpc/include/asm/kvm_guest.h
+++ b/arch/powerpc/include/asm/kvm_guest.h
@@ -7,9 +7,9 @@
 #define __POWERPC_KVM_GUEST_H__
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
-bool is_kvm_guest(void);
+bool check_kvm_guest(void);
 #else
-static inline bool is_kvm_guest(void) { return false; }
+static inline bool check_kvm_guest(void) { return false; }
 #endif
 
 #endif /* __POWERPC_KVM_GUEST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_para.h 
b/arch/powerpc/include/asm/kvm_para.h
index abe1b5e82547..6fba06b6cfdb 100644
--- a/arch/powerpc/include/asm/kvm_para.h
+++ b/arch/powerpc/include/asm/kvm_para.h
@@ -14,7 +14,7 @@
 
 static inline int kvm_para_available(void)
 {
-   return IS_ENABLED(CONFIG_KVM_GUEST) && is_kvm_guest();
+   return IS_ENABLED(CONFIG_KVM_GUEST) && check_kvm_guest();
 }
 
 static inline unsigned int kvm_arch_para_features(void)
diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c
index fe48d319d490..61243267d4cf 100644
--- a/arch/powerpc/kernel/firmware.c
+++ b/arch/powerpc/kernel/firmware.c
@@ -21,7 +21,7 @@ EXPORT_SYMBOL_GPL(powerpc_firmware_features);
 #endif
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
-bool is_kvm_guest(void)
+bool check_kvm_guest(void)
 {
struct device_node *hyper_node;
 
diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index d578732c545d..c70b4be9f0a5 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -211,7 +211,7 @@ static __init void pSeries_smp_probe(void)
if (!cpu_has_feature(CPU_FTR_SMT))
return;
 
-   if (is_kvm_guest()) {
+   if (check_kvm_guest()) {
/*
 * KVM emulates doorbells by disabling FSCR[MSGP] so msgsndp
 * faults to the hypervisor which then reads the instruction
-- 
2.18.4

[PATCH 1/4] powerpc: Refactor is_kvm_guest declaration to new header

2020-10-28 Thread Srikar Dronamraju

Only code/declaration movement, in anticipation of doing a kvm-aware
vcpu_is_preempted. No additional changes.

Signed-off-by: Srikar Dronamraju 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
---
 arch/powerpc/include/asm/firmware.h  |  6 --
 arch/powerpc/include/asm/kvm_guest.h | 15 +++
 arch/powerpc/include/asm/kvm_para.h  |  2 +-
 arch/powerpc/platforms/pseries/smp.c |  1 +
 4 files changed, 17 insertions(+), 7 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_guest.h

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 0b295bdb201e..aa6a5ef5d483 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -134,12 +134,6 @@ extern int ibm_nmi_interlock_token;
 
 extern unsigned int __start___fw_ftr_fixup, __stop___fw_ftr_fixup;
 
-#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
-bool is_kvm_guest(void);
-#else
-static inline bool is_kvm_guest(void) { return false; }
-#endif
-
 #ifdef CONFIG_PPC_PSERIES
 void pseries_probe_fw_features(void);
 #else
diff --git a/arch/powerpc/include/asm/kvm_guest.h 
b/arch/powerpc/include/asm/kvm_guest.h
new file mode 100644
index ..c0ace884a0e8
--- /dev/null
+++ b/arch/powerpc/include/asm/kvm_guest.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2020  IBM Corporation
+ */
+
+#ifndef __POWERPC_KVM_GUEST_H__
+#define __POWERPC_KVM_GUEST_H__
+
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
+bool is_kvm_guest(void);
+#else
+static inline bool is_kvm_guest(void) { return false; }
+#endif
+
+#endif /* __POWERPC_KVM_GUEST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_para.h 
b/arch/powerpc/include/asm/kvm_para.h
index 744612054c94..abe1b5e82547 100644
--- a/arch/powerpc/include/asm/kvm_para.h
+++ b/arch/powerpc/include/asm/kvm_para.h
@@ -8,7 +8,7 @@
 #ifndef __POWERPC_KVM_PARA_H__
 #define __POWERPC_KVM_PARA_H__
 
-#include 
+#include 
 
 #include 
 
diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index 92922491a81c..d578732c545d 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "pseries.h"
 
-- 
2.18.4

[PATCH 4/4] powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted

2020-10-28 Thread Srikar Dronamraju

If its a shared lpar but not a KVM guest, then see if the vCPU is
related to the calling vCPU. On PowerVM, only cores can be preempted.
So if one vCPU is a non-preempted state, we can decipher that all other
vCPUs sharing the same core are in non-preempted state.

Signed-off-by: Srikar Dronamraju 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
---
 arch/powerpc/include/asm/paravirt.h | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/powerpc/include/asm/paravirt.h 
b/arch/powerpc/include/asm/paravirt.h
index 9362c94fe3aa..edc08f04aef7 100644
--- a/arch/powerpc/include/asm/paravirt.h
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -10,6 +10,9 @@
 #endif
 
 #ifdef CONFIG_PPC_SPLPAR
+#include 
+#include 
+
 DECLARE_STATIC_KEY_FALSE(shared_processor);
 
 static inline bool is_shared_processor(void)
@@ -74,6 +77,21 @@ static inline bool vcpu_is_preempted(int cpu)
 {
if (!is_shared_processor())
return false;
+
+#ifdef CONFIG_PPC_SPLPAR
+   if (!is_kvm_guest()) {
+   int first_cpu = cpu_first_thread_sibling(smp_processor_id());
+
+   /*
+* Preemption can only happen at core granularity. This CPU
+* is not preempted if one of the CPU of this core is not
+* preempted.
+*/
+   if (cpu_first_thread_sibling(cpu) == first_cpu)
+   return false;
+   }
+#endif
+
if (yield_count_of(cpu) & 1)
return true;
return false;
-- 
2.18.4

[PATCH 3/4] powerpc: Reintroduce is_kvm_guest in a new avatar

2020-10-28 Thread Srikar Dronamraju

Introduce a static branch that would be set during boot if the OS
happens to be a KVM guest. Subsequent checks to see if we are on KVM
will rely on this static branch. This static branch would be used in
vcpu_is_preempted in a subsequent patch.

Signed-off-by: Srikar Dronamraju 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
---
 arch/powerpc/include/asm/kvm_guest.h | 10 ++
 arch/powerpc/include/asm/kvm_para.h  |  2 +-
 arch/powerpc/kernel/firmware.c   |  3 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_guest.h 
b/arch/powerpc/include/asm/kvm_guest.h
index ba8291e02ba9..627ba272e781 100644
--- a/arch/powerpc/include/asm/kvm_guest.h
+++ b/arch/powerpc/include/asm/kvm_guest.h
@@ -7,8 +7,18 @@
 #define __POWERPC_KVM_GUEST_H__
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
+#include 
+
+DECLARE_STATIC_KEY_FALSE(kvm_guest);
+
+static inline bool is_kvm_guest(void)
+{
+   return static_branch_unlikely(_guest);
+}
+
 bool check_kvm_guest(void);
 #else
+static inline bool is_kvm_guest(void) { return false; }
 static inline bool check_kvm_guest(void) { return false; }
 #endif
 
diff --git a/arch/powerpc/include/asm/kvm_para.h 
b/arch/powerpc/include/asm/kvm_para.h
index 6fba06b6cfdb..abe1b5e82547 100644
--- a/arch/powerpc/include/asm/kvm_para.h
+++ b/arch/powerpc/include/asm/kvm_para.h
@@ -14,7 +14,7 @@
 
 static inline int kvm_para_available(void)
 {
-   return IS_ENABLED(CONFIG_KVM_GUEST) && check_kvm_guest();
+   return IS_ENABLED(CONFIG_KVM_GUEST) && is_kvm_guest();
 }
 
 static inline unsigned int kvm_arch_para_features(void)
diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c
index 61243267d4cf..28498fc573f2 100644
--- a/arch/powerpc/kernel/firmware.c
+++ b/arch/powerpc/kernel/firmware.c
@@ -14,6 +14,7 @@
 #include 
 
 #include 
+#include 
 
 #ifdef CONFIG_PPC64
 unsigned long powerpc_firmware_features __read_mostly;
@@ -21,6 +22,7 @@ EXPORT_SYMBOL_GPL(powerpc_firmware_features);
 #endif
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
+DEFINE_STATIC_KEY_FALSE(kvm_guest);
 bool check_kvm_guest(void)
 {
struct device_node *hyper_node;
@@ -32,6 +34,7 @@ bool check_kvm_guest(void)
if (!of_device_is_compatible(hyper_node, "linux,kvm"))
return 0;
 
+   static_branch_enable(_guest);
return 1;
 }
 #endif
-- 
2.18.4

[PATCH 0/4] Powerpc: Better preemption for shared processor

2020-10-28 Thread Srikar Dronamraju

Currently, vcpu_is_preempted will return the yield_count for
shared_processor. On a PowerVM LPAR, Phyp schedules at SMT8 core boundary
i.e all CPUs belonging to a core are either group scheduled in or group
scheduled out. This can be used to better predict non-preempted CPUs on
PowerVM shared LPARs.

perf stat -r 5 -a perf bench sched pipe -l 1000 (lesser time is better)

powerpc/next
 35,107,951.20 msec cpu-clock #  255.898 CPUs utilized  
  ( +-  0.31% )
23,655,348  context-switches  #0.674 K/sec  
  ( +-  3.72% )
14,465  cpu-migrations#0.000 K/sec  
  ( +-  5.37% )
82,463  page-faults   #0.002 K/sec  
  ( +-  8.40% )
 1,127,182,328,206  cycles#0.032 GHz
  ( +-  1.60% )  (66.67%)
78,587,300,622  stalled-cycles-frontend   #6.97% frontend cycles 
idle ( +-  0.08% )  (50.01%)
   654,124,218,432  stalled-cycles-backend#   58.03% backend cycles 
idle  ( +-  1.74% )  (50.01%)
   834,013,059,242  instructions  #0.74  insn per cycle
  #0.78  stalled cycles per 
insn  ( +-  0.73% )  (66.67%)
   132,911,454,387  branches  #3.786 M/sec  
  ( +-  0.59% )  (50.00%)
 2,890,882,143  branch-misses #2.18% of all branches
  ( +-  0.46% )  (50.00%)

   137.195 +- 0.419 seconds time elapsed  ( +-  0.31% )

powerpc/next + patchset
 29,981,702.64 msec cpu-clock #  255.881 CPUs utilized  
  ( +-  1.30% )
40,162,456  context-switches  #0.001 M/sec  
  ( +-  0.01% )
 1,110  cpu-migrations#0.000 K/sec  
  ( +-  5.20% )
62,616  page-faults   #0.002 K/sec  
  ( +-  3.93% )
 1,430,030,626,037  cycles#0.048 GHz
  ( +-  1.41% )  (66.67%)
83,202,707,288  stalled-cycles-frontend   #5.82% frontend cycles 
idle ( +-  0.75% )  (50.01%)
   744,556,088,520  stalled-cycles-backend#   52.07% backend cycles 
idle  ( +-  1.39% )  (50.01%)
   940,138,418,674  instructions  #0.66  insn per cycle
  #0.79  stalled cycles per 
insn  ( +-  0.51% )  (66.67%)
   146,452,852,283  branches  #4.885 M/sec  
  ( +-  0.80% )  (50.00%)
 3,237,743,996  branch-misses #2.21% of all branches
  ( +-  1.18% )  (50.01%)

117.17 +- 1.52 seconds time elapsed  ( +-  1.30% )

This is around 14.6% improvement in performance.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 

Srikar Dronamraju (4):
  powerpc: Refactor is_kvm_guest declaration to new header
  powerpc: Rename is_kvm_guest to check_kvm_guest
  powerpc: Reintroduce is_kvm_guest
  powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted

 arch/powerpc/include/asm/firmware.h  |  6 --
 arch/powerpc/include/asm/kvm_guest.h | 25 +
 arch/powerpc/include/asm/kvm_para.h  |  2 +-
 arch/powerpc/include/asm/paravirt.h  | 18 ++
 arch/powerpc/kernel/firmware.c   |  5 -
 arch/powerpc/platforms/pseries/smp.c |  3 ++-
 6 files changed, 50 insertions(+), 9 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_guest.h

-- 
2.18.4

[PATCH 2/2] powerpc/smp: Use GFP_ATOMIC while allocating tmp mask

2020-10-18 Thread Srikar Dronamraju

Qian Cai reported a regression where CPU Hotplug fails with the latest
powerpc/next

BUG: sleeping function called from invalid context at mm/slab.h:494
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/88
no locks held by swapper/88/0.
irq event stamp: 18074448
hardirqs last  enabled at (18074447): [] 
tick_nohz_idle_enter+0x9c/0x110
hardirqs last disabled at (18074448): [] do_idle+0x138/0x3b0
do_idle at kernel/sched/idle.c:253 (discriminator 1)
softirqs last  enabled at (18074440): [] 
irq_enter_rcu+0x94/0xa0
softirqs last disabled at (18074439): [] 
irq_enter_rcu+0x70/0xa0
CPU: 88 PID: 0 Comm: swapper/88 Tainted: GW 
5.9.0-rc8-next-20201007 #1
Call Trace:
[c0002a4bfcf0] [c0649e98] dump_stack+0xec/0x144 (unreliable)
[c0002a4bfd30] [c00f6c34] ___might_sleep+0x2f4/0x310
[c0002a4bfdb0] [c0354f94] 
slab_pre_alloc_hook.constprop.82+0x124/0x190
[c0002a4bfe00] [c035e9e8] __kmalloc_node+0x88/0x3a0
slab_alloc_node at mm/slub.c:2817
(inlined by) __kmalloc_node at mm/slub.c:4013
[c0002a4bfe80] [c06494d8] alloc_cpumask_var_node+0x38/0x80
kmalloc_node at include/linux/slab.h:577
(inlined by) alloc_cpumask_var_node at lib/cpumask.c:116
[c0002a4bfef0] [c003eedc] start_secondary+0x27c/0x800
update_mask_by_l2 at arch/powerpc/kernel/smp.c:1267
(inlined by) add_cpu_to_masks at arch/powerpc/kernel/smp.c:1387
(inlined by) start_secondary at arch/powerpc/kernel/smp.c:1420
[c0002a4bff90] [c000c468] start_secondary_resume+0x10/0x14

Allocating a temporary mask while performing a CPU Hotplug operation
with CONFIG_CPUMASK_OFFSTACK enabled, leads to calling a sleepable
function from a atomic context. Fix this by allocating the temporary
mask with GFP_ATOMIC flag. Also instead of having to allocate twice,
allocate the mask in the caller so that we only have to allocate once.
If the allocation fails, assume the mask to be same as sibling mask, which
will make the scheduler to drop this domain for this CPU.

Fixes: 70a94089d7f7 ("powerpc/smp: Optimize update_coregroup_mask")
Fixes: 3ab33d6dc3e9 ("powerpc/smp: Optimize update_mask_by_l2")
Reported-by: Qian Cai 
Signed-off-by: Srikar Dronamraju 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Qian Cai 
---
Changelog v1->v2:
https://lore.kernel.org/linuxppc-dev/20201008034240.34059-1-sri...@linux.vnet.ibm.com/t/#u
Updated 2nd patch based on comments from Michael Ellerman
- Remove the WARN_ON.
- Handle allocation failures in a more subtle fashion
- Allocate in the caller so that we allocate once.

 arch/powerpc/kernel/smp.c | 57 +--
 1 file changed, 31 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index a864b9b3228c..028479e9b66b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1257,38 +1257,33 @@ static struct device_node *cpu_to_l2cache(int cpu)
return cache;
 }
 
-static bool update_mask_by_l2(int cpu)
+static bool update_mask_by_l2(int cpu, cpumask_var_t *mask)
 {
struct cpumask *(*submask_fn)(int) = cpu_sibling_mask;
struct device_node *l2_cache, *np;
-   cpumask_var_t mask;
int i;
 
if (has_big_cores)
submask_fn = cpu_smallcore_mask;
 
l2_cache = cpu_to_l2cache(cpu);
-   if (!l2_cache) {
-   /*
-* If no l2cache for this CPU, assume all siblings to share
-* cache with this CPU.
-*/
+   if (!l2_cache || !*mask) {
+   /* Assume only core siblings share cache with this CPU */
for_each_cpu(i, submask_fn(cpu))
set_cpus_related(cpu, i, cpu_l2_cache_mask);
 
return false;
}
 
-   alloc_cpumask_var_node(, GFP_KERNEL, cpu_to_node(cpu));
-   cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu));
+   cpumask_and(*mask, cpu_online_mask, cpu_cpu_mask(cpu));
 
/* Update l2-cache mask with all the CPUs that are part of submask */
or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
 
/* Skip all CPUs already part of current CPU l2-cache mask */
-   cpumask_andnot(mask, mask, cpu_l2_cache_mask(cpu));
+   cpumask_andnot(*mask, *mask, cpu_l2_cache_mask(cpu));
 
-   for_each_cpu(i, mask) {
+   for_each_cpu(i, *mask) {
/*
 * when updating the marks the current CPU has not been marked
 * online, but we need to update the cache masks
@@ -1298,15 +1293,14 @@ static bool update_mask_by_l2(int cpu)
/* Skip all CPUs already part of current CPU l2-cache */
if (np == l2_cache) {
or_cpumasks_related(cpu, i, submask_fn, 
cpu_l2_cache_mask);
-   cpumask_andnot(m

[PATCH 1/2] powerpc/smp: Remove unnecessary variable

2020-10-18 Thread Srikar Dronamraju

Commit 3ab33d6dc3e9 ("powerpc/smp: Optimize update_mask_by_l2")
introduced submask_fn in update_mask_by_l2 to track the right submask.
However commit f6606cfdfbcd ("powerpc/smp: Dont assume l2-cache to be
superset of sibling") introduced sibling_mask in update_mask_by_l2 to
track the same submask. Remove sibling_mask in favour of submask_fn.

Signed-off-by: Srikar Dronamraju 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Qian Cai 
---
 arch/powerpc/kernel/smp.c | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8d1c401f4617..a864b9b3228c 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1264,18 +1264,16 @@ static bool update_mask_by_l2(int cpu)
cpumask_var_t mask;
int i;
 
+   if (has_big_cores)
+   submask_fn = cpu_smallcore_mask;
+
l2_cache = cpu_to_l2cache(cpu);
if (!l2_cache) {
-   struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
-
/*
 * If no l2cache for this CPU, assume all siblings to share
 * cache with this CPU.
 */
-   if (has_big_cores)
-   sibling_mask = cpu_smallcore_mask;
-
-   for_each_cpu(i, sibling_mask(cpu))
+   for_each_cpu(i, submask_fn(cpu))
set_cpus_related(cpu, i, cpu_l2_cache_mask);
 
return false;
@@ -1284,9 +1282,6 @@ static bool update_mask_by_l2(int cpu)
alloc_cpumask_var_node(, GFP_KERNEL, cpu_to_node(cpu));
cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu));
 
-   if (has_big_cores)
-   submask_fn = cpu_smallcore_mask;
-
/* Update l2-cache mask with all the CPUs that are part of submask */
or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
 
-- 
2.18.2

[PATCH v2 0/2] Fixes for coregroup

2020-10-18 Thread Srikar Dronamraju

These patches fixes problems introduced by the coregroup patches.
The first patch we remove a redundant variable.
Second patch allows to boot with CONFIG_CPUMASK_OFFSTACK enabled.

Changelog v1->v2:
https://lore.kernel.org/linuxppc-dev/20201008034240.34059-1-sri...@linux.vnet.ibm.com/t/#u
1. 1st patch was not part of previous posting.
2. Updated 2nd patch based on comments from Michael Ellerman

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Qian Cai 

Srikar Dronamraju (2):
  powerpc/smp: Remove unnecessary variable
  powerpc/smp: Use GFP_ATOMIC while allocating tmp mask

 arch/powerpc/kernel/smp.c | 70 +++
 1 file changed, 35 insertions(+), 35 deletions(-)

-- 
2.18.2

[PATCH] powerpc/smp: Use GFP_ATOMIC while allocating tmp mask

2020-10-07 Thread Srikar Dronamraju

Qian Cai reported a regression where CPU Hotplug fails with the latest
powerpc/next

BUG: sleeping function called from invalid context at mm/slab.h:494
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/88
no locks held by swapper/88/0.
irq event stamp: 18074448
hardirqs last  enabled at (18074447): [] 
tick_nohz_idle_enter+0x9c/0x110
hardirqs last disabled at (18074448): [] do_idle+0x138/0x3b0
do_idle at kernel/sched/idle.c:253 (discriminator 1)
softirqs last  enabled at (18074440): [] 
irq_enter_rcu+0x94/0xa0
softirqs last disabled at (18074439): [] 
irq_enter_rcu+0x70/0xa0
CPU: 88 PID: 0 Comm: swapper/88 Tainted: GW 
5.9.0-rc8-next-20201007 #1
Call Trace:
[c0002a4bfcf0] [c0649e98] dump_stack+0xec/0x144 (unreliable)
[c0002a4bfd30] [c00f6c34] ___might_sleep+0x2f4/0x310
[c0002a4bfdb0] [c0354f94] 
slab_pre_alloc_hook.constprop.82+0x124/0x190
[c0002a4bfe00] [c035e9e8] __kmalloc_node+0x88/0x3a0
slab_alloc_node at mm/slub.c:2817
(inlined by) __kmalloc_node at mm/slub.c:4013
[c0002a4bfe80] [c06494d8] alloc_cpumask_var_node+0x38/0x80
kmalloc_node at include/linux/slab.h:577
(inlined by) alloc_cpumask_var_node at lib/cpumask.c:116
[c0002a4bfef0] [c003eedc] start_secondary+0x27c/0x800
update_mask_by_l2 at arch/powerpc/kernel/smp.c:1267
(inlined by) add_cpu_to_masks at arch/powerpc/kernel/smp.c:1387
(inlined by) start_secondary at arch/powerpc/kernel/smp.c:1420
[c0002a4bff90] [c000c468] start_secondary_resume+0x10/0x14

Allocating a temporary mask while performing a CPU Hotplug operation
with CONFIG_CPUMASK_OFFSTACK enabled, leads to calling a sleepable
function from a atomic context. Fix this by allocating the temporary
mask with GFP_ATOMIC flag.

If there is a failure to allocate a mask, scheduler is going to observe
that this CPU's topology is broken. Instead of having to speculate why
the topology is broken, add a WARN_ON_ONCE.

Fixes: 70a94089d7f7 ("powerpc/smp: Optimize update_coregroup_mask")
Fixes: 3ab33d6dc3e9 ("powerpc/smp: Optimize update_mask_by_l2")
Reported-by: Qian Cai 
Suggested-by: Qian Cai 
Signed-off-by: Srikar Dronamraju 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Qian Cai 
---
 arch/powerpc/kernel/smp.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 0dc1b85..1268558 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1264,7 +1264,8 @@ static bool update_mask_by_l2(int cpu)
return false;
}
 
-   alloc_cpumask_var_node(, GFP_KERNEL, cpu_to_node(cpu));
+   /* In CPU-hotplug path, hence use GFP_ATOMIC */
+   WARN_ON_ONCE(!alloc_cpumask_var_node(, GFP_ATOMIC, 
cpu_to_node(cpu)));
cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu));
 
if (has_big_cores)
@@ -1344,7 +1345,8 @@ static void update_coregroup_mask(int cpu)
int coregroup_id = cpu_to_coregroup_id(cpu);
int i;
 
-   alloc_cpumask_var_node(, GFP_KERNEL, cpu_to_node(cpu));
+   /* In CPU-hotplug path, hence use GFP_ATOMIC */
+   WARN_ON_ONCE(!alloc_cpumask_var_node(, GFP_ATOMIC, 
cpu_to_node(cpu)));
cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu));
 
if (shared_caches)
-- 
1.8.3.1

[PATCH v3 06/11] powerpc/smp: Stop passing mask to update_mask_by_l2

2020-10-07 Thread Srikar Dronamraju

update_mask_by_l2 is called only once. But it passes cpu_l2_cache_mask
as parameter. Instead of passing cpu_l2_cache_mask, use it directly in
update_mask_by_l2.

Signed-off-by: Srikar Dronamraju 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Qian Cai 
---
 arch/powerpc/kernel/smp.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index c860c4950c9f..441c9c64b1e3 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1218,7 +1218,7 @@ static struct device_node *cpu_to_l2cache(int cpu)
return cache;
 }
 
-static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int))
+static bool update_mask_by_l2(int cpu)
 {
struct device_node *l2_cache, *np;
int i;
@@ -1240,7 +1240,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
*(*mask_fn)(int))
return false;
}
 
-   cpumask_set_cpu(cpu, mask_fn(cpu));
+   cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));
for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
/*
 * when updating the marks the current CPU has not been marked
@@ -1251,7 +1251,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
*(*mask_fn)(int))
continue;
 
if (np == l2_cache)
-   set_cpus_related(cpu, i, mask_fn);
+   set_cpus_related(cpu, i, cpu_l2_cache_mask);
 
of_node_put(np);
}
@@ -1315,7 +1315,7 @@ static void add_cpu_to_masks(int cpu)
set_cpus_related(i, cpu, cpu_sibling_mask);
 
add_cpu_to_smallcore_masks(cpu);
-   update_mask_by_l2(cpu, cpu_l2_cache_mask);
+   update_mask_by_l2(cpu);
 
if (has_coregroup_support()) {
int coregroup_id = cpu_to_coregroup_id(cpu);
-- 
2.17.1

[PATCH v3 11/11] powerpc/smp: Optimize update_coregroup_mask

2020-10-07 Thread Srikar Dronamraju

All threads of a SMT4/SMT8 core can either be part of CPU's coregroup
mask or outside the coregroup. Use this relation to reduce the
number of iterations needed to find all the CPUs that share the same
coregroup

Use a temporary mask to iterate through the CPUs that may share
coregroup mask. Also instead of setting one CPU at a time into
cpu_coregroup_mask, copy the SMT4/SMT8/submask at one shot.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
---
Changelog v2->v3:
Use GFP_ATOMIC instead of GFP_KERNEL since allocations need to
atomic at the time of CPU HotPlug
Reported by Qian Cai 

 arch/powerpc/kernel/smp.c | 31 +++
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index b48ae4e306d3..bbaea93dc558 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1340,19 +1340,34 @@ static inline void add_cpu_to_smallcore_masks(int cpu)
 
 static void update_coregroup_mask(int cpu)
 {
-   int first_thread = cpu_first_thread_sibling(cpu);
+   struct cpumask *(*submask_fn)(int) = cpu_sibling_mask;
+   cpumask_var_t mask;
int coregroup_id = cpu_to_coregroup_id(cpu);
int i;
 
-   cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu));
-   for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
-   int fcpu = cpu_first_thread_sibling(i);
+   /* In CPU-hotplug path, hence use GFP_ATOMIC */
+   alloc_cpumask_var_node(, GFP_ATOMIC, cpu_to_node(cpu));
+   cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu));
+
+   if (shared_caches)
+   submask_fn = cpu_l2_cache_mask;
 
-   if (fcpu == first_thread)
-   set_cpus_related(cpu, i, cpu_coregroup_mask);
-   else if (coregroup_id == cpu_to_coregroup_id(i))
-   set_cpus_related(cpu, i, cpu_coregroup_mask);
+   /* Update coregroup mask with all the CPUs that are part of submask */
+   or_cpumasks_related(cpu, cpu, submask_fn, cpu_coregroup_mask);
+
+   /* Skip all CPUs already part of coregroup mask */
+   cpumask_andnot(mask, mask, cpu_coregroup_mask(cpu));
+
+   for_each_cpu(i, mask) {
+   /* Skip all CPUs not part of this coregroup */
+   if (coregroup_id == cpu_to_coregroup_id(i)) {
+   or_cpumasks_related(cpu, i, submask_fn, 
cpu_coregroup_mask);
+   cpumask_andnot(mask, mask, submask_fn(i));
+   } else {
+   cpumask_andnot(mask, mask, cpu_coregroup_mask(i));
+   }
}
+   free_cpumask_var(mask);
 }
 
 static void add_cpu_to_masks(int cpu)
-- 
2.17.1

[PATCH v3 08/11] powerpc/smp: Check for duplicate topologies and consolidate

2020-10-07 Thread Srikar Dronamraju

CACHE and COREGROUP domains are now part of default topology. However on
systems that don't support CACHE or COREGROUP, these domains will
eventually be degenerated. The degeneration happens per CPU. Do note the
current fixup_topology() logic ensures that mask of a domain that is not
supported on the current platform is set to the previous domain.

Instead of waiting for the scheduler to degenerated try to consolidate
based on their masks and sd_flags. This is done just before setting
the scheduler topology.

Signed-off-by: Srikar Dronamraju 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Qian Cai 
---
 arch/powerpc/kernel/smp.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index aeb219a4bf7a..6f866e6b12f8 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1401,6 +1401,8 @@ int setup_profiling_timer(unsigned int multiplier)
 
 static void fixup_topology(void)
 {
+   int i;
+
 #ifdef CONFIG_SCHED_SMT
if (has_big_cores) {
pr_info("Big cores detected but using small core scheduling\n");
@@ -1410,6 +1412,30 @@ static void fixup_topology(void)
 
if (!has_coregroup_support())
powerpc_topology[mc_idx].mask = 
powerpc_topology[cache_idx].mask;
+
+   /*
+* Try to consolidate topology levels here instead of
+* allowing scheduler to degenerate.
+* - Dont consolidate if masks are different.
+* - Dont consolidate if sd_flags exists and are different.
+*/
+   for (i = 1; i <= die_idx; i++) {
+   if (powerpc_topology[i].mask != powerpc_topology[i - 1].mask)
+   continue;
+
+   if (powerpc_topology[i].sd_flags && powerpc_topology[i - 
1].sd_flags &&
+   powerpc_topology[i].sd_flags != 
powerpc_topology[i - 1].sd_flags)
+   continue;
+
+   if (!powerpc_topology[i - 1].sd_flags)
+   powerpc_topology[i - 1].sd_flags = 
powerpc_topology[i].sd_flags;
+
+   powerpc_topology[i].mask = powerpc_topology[i + 1].mask;
+   powerpc_topology[i].sd_flags = powerpc_topology[i + 1].sd_flags;
+#ifdef CONFIG_SCHED_DEBUG
+   powerpc_topology[i].name = powerpc_topology[i + 1].name;
+#endif
+   }
 }
 
 void __init smp_cpus_done(unsigned int max_cpus)
-- 
2.17.1

[PATCH v3 09/11] powerpc/smp: Optimize update_mask_by_l2

2020-10-07 Thread Srikar Dronamraju

All threads of a SMT4 core can either be part of this CPU's l2-cache
mask or not related to this CPU l2-cache mask. Use this relation to
reduce the number of iterations needed to find all the CPUs that share
the same l2-cache.

Use a temporary mask to iterate through the CPUs that may share l2_cache
mask. Also instead of setting one CPU at a time into cpu_l2_cache_mask,
copy the SMT4/sub mask at one shot.

Signed-off-by: Srikar Dronamraju 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Qian Cai 
---
Changelog v2->v3:
Use GFP_ATOMIC instead of GFP_KERNEL since allocations need to
atomic at the time of CPU HotPlug
Reported by Qian Cai 

 arch/powerpc/kernel/smp.c | 52 +--
 1 file changed, 46 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6f866e6b12f8..17e90c2414af 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -671,6 +671,28 @@ static void set_cpus_unrelated(int i, int j,
 #endif
 
 /*
+ * Extends set_cpus_related. Instead of setting one CPU at a time in
+ * dstmask, set srcmask at oneshot. dstmask should be super set of srcmask.
+ */
+static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int),
+   struct cpumask *(*dstmask)(int))
+{
+   struct cpumask *mask;
+   int k;
+
+   mask = srcmask(j);
+   for_each_cpu(k, srcmask(i))
+   cpumask_or(dstmask(k), dstmask(k), mask);
+
+   if (i == j)
+   return;
+
+   mask = srcmask(i);
+   for_each_cpu(k, srcmask(j))
+   cpumask_or(dstmask(k), dstmask(k), mask);
+}
+
+/*
  * parse_thread_groups: Parses the "ibm,thread-groups" device tree
  *  property for the CPU device node @dn and stores
  *  the parsed output in the thread_groups
@@ -1220,7 +1242,9 @@ static struct device_node *cpu_to_l2cache(int cpu)
 
 static bool update_mask_by_l2(int cpu)
 {
+   struct cpumask *(*submask_fn)(int) = cpu_sibling_mask;
struct device_node *l2_cache, *np;
+   cpumask_var_t mask;
int i;
 
l2_cache = cpu_to_l2cache(cpu);
@@ -1240,22 +1264,38 @@ static bool update_mask_by_l2(int cpu)
return false;
}
 
-   cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));
-   for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
+   /* In CPU-hotplug path, hence use GFP_ATOMIC */
+   alloc_cpumask_var_node(, GFP_ATOMIC, cpu_to_node(cpu));
+   cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu));
+
+   if (has_big_cores)
+   submask_fn = cpu_smallcore_mask;
+
+   /* Update l2-cache mask with all the CPUs that are part of submask */
+   or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
+
+   /* Skip all CPUs already part of current CPU l2-cache mask */
+   cpumask_andnot(mask, mask, cpu_l2_cache_mask(cpu));
+
+   for_each_cpu(i, mask) {
/*
 * when updating the marks the current CPU has not been marked
 * online, but we need to update the cache masks
 */
np = cpu_to_l2cache(i);
-   if (!np)
-   continue;
 
-   if (np == l2_cache)
-   set_cpus_related(cpu, i, cpu_l2_cache_mask);
+   /* Skip all CPUs already part of current CPU l2-cache */
+   if (np == l2_cache) {
+   or_cpumasks_related(cpu, i, submask_fn, 
cpu_l2_cache_mask);
+   cpumask_andnot(mask, mask, submask_fn(i));
+   } else {
+   cpumask_andnot(mask, mask, cpu_l2_cache_mask(i));
+   }
 
of_node_put(np);
}
of_node_put(l2_cache);
+   free_cpumask_var(mask);
 
return true;
 }
-- 
2.17.1

[PATCH v3 07/11] powerpc/smp: Depend on cpu_l1_cache_map when adding CPUs

2020-10-07 Thread Srikar Dronamraju

Currently on hotplug/hotunplug, CPU iterates through all the CPUs in
its core to find threads in its thread group. However this info is
already captured in cpu_l1_cache_map. Hence reduce iterations and
cleanup add_cpu_to_smallcore_masks function.

Signed-off-by: Srikar Dronamraju 
Tested-by: Satheesh Rajendran 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Qian Cai 
---
 arch/powerpc/kernel/smp.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 441c9c64b1e3..aeb219a4bf7a 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1285,16 +1285,15 @@ static void remove_cpu_from_masks(int cpu)
 
 static inline void add_cpu_to_smallcore_masks(int cpu)
 {
-   struct cpumask *this_l1_cache_map = per_cpu(cpu_l1_cache_map, cpu);
-   int i, first_thread = cpu_first_thread_sibling(cpu);
+   int i;
 
if (!has_big_cores)
return;
 
cpumask_set_cpu(cpu, cpu_smallcore_mask(cpu));
 
-   for (i = first_thread; i < first_thread + threads_per_core; i++) {
-   if (cpu_online(i) && cpumask_test_cpu(i, this_l1_cache_map))
+   for_each_cpu(i, per_cpu(cpu_l1_cache_map, cpu)) {
+   if (cpu_online(i))
set_cpus_related(i, cpu, cpu_smallcore_mask);
}
 }
-- 
2.17.1

[PATCH v3 10/11] powerpc/smp: Move coregroup mask updation to a new function

2020-10-07 Thread Srikar Dronamraju

Move the logic for updating the coregroup mask of a CPU to its own
function. This will help in reworking the updation of coregroup mask in
subsequent patch.

Signed-off-by: Srikar Dronamraju 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Qian Cai 
---
 arch/powerpc/kernel/smp.c | 32 +++-
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 17e90c2414af..b48ae4e306d3 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1338,6 +1338,23 @@ static inline void add_cpu_to_smallcore_masks(int cpu)
}
 }
 
+static void update_coregroup_mask(int cpu)
+{
+   int first_thread = cpu_first_thread_sibling(cpu);
+   int coregroup_id = cpu_to_coregroup_id(cpu);
+   int i;
+
+   cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu));
+   for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
+   int fcpu = cpu_first_thread_sibling(i);
+
+   if (fcpu == first_thread)
+   set_cpus_related(cpu, i, cpu_coregroup_mask);
+   else if (coregroup_id == cpu_to_coregroup_id(i))
+   set_cpus_related(cpu, i, cpu_coregroup_mask);
+   }
+}
+
 static void add_cpu_to_masks(int cpu)
 {
int first_thread = cpu_first_thread_sibling(cpu);
@@ -1356,19 +1373,8 @@ static void add_cpu_to_masks(int cpu)
add_cpu_to_smallcore_masks(cpu);
update_mask_by_l2(cpu);
 
-   if (has_coregroup_support()) {
-   int coregroup_id = cpu_to_coregroup_id(cpu);
-
-   cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu));
-   for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
-   int fcpu = cpu_first_thread_sibling(i);
-
-   if (fcpu == first_thread)
-   set_cpus_related(cpu, i, cpu_coregroup_mask);
-   else if (coregroup_id == cpu_to_coregroup_id(i))
-   set_cpus_related(cpu, i, cpu_coregroup_mask);
-   }
-   }
+   if (has_coregroup_support())
+   update_coregroup_mask(cpu);
 }
 
 /* Activate a secondary processor. */
-- 
2.17.1

[PATCH v3 03/11] powerpc/smp: Remove get_physical_package_id

2020-10-07 Thread Srikar Dronamraju

Now that cpu_core_mask has been removed and topology_core_cpumask has
been updated to use cpu_cpu_mask, we no more need
get_physical_package_id.

Signed-off-by: Srikar Dronamraju 
Tested-by: Satheesh Rajendran 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Qian Cai 
---
 arch/powerpc/include/asm/topology.h |  5 -
 arch/powerpc/kernel/smp.c   | 20 
 2 files changed, 25 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index e0f232533c9d..e45219f74be0 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -114,12 +114,7 @@ static inline int cpu_to_coregroup_id(int cpu)
 #ifdef CONFIG_PPC64
 #include 
 
-#ifdef CONFIG_PPC_SPLPAR
-int get_physical_package_id(int cpu);
-#define topology_physical_package_id(cpu)  (get_physical_package_id(cpu))
-#else
 #define topology_physical_package_id(cpu)  (cpu_to_chip_id(cpu))
-#endif
 
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
 #define topology_core_cpumask(cpu) (cpu_cpu_mask(cpu))
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ec41491beca4..8c095fe237b2 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1292,26 +1292,6 @@ static inline void add_cpu_to_smallcore_masks(int cpu)
}
 }
 
-int get_physical_package_id(int cpu)
-{
-   int pkg_id = cpu_to_chip_id(cpu);
-
-   /*
-* If the platform is PowerNV or Guest on KVM, ibm,chip-id is
-* defined. Hence we would return the chip-id as the result of
-* get_physical_package_id.
-*/
-   if (pkg_id == -1 && firmware_has_feature(FW_FEATURE_LPAR) &&
-   IS_ENABLED(CONFIG_PPC_SPLPAR)) {
-   struct device_node *np = of_get_cpu_node(cpu, NULL);
-   pkg_id = of_node_to_nid(np);
-   of_node_put(np);
-   }
-
-   return pkg_id;
-}
-EXPORT_SYMBOL_GPL(get_physical_package_id);
-
 static void add_cpu_to_masks(int cpu)
 {
int first_thread = cpu_first_thread_sibling(cpu);
-- 
2.17.1

[PATCH v3 04/11] powerpc/smp: Optimize remove_cpu_from_masks

2020-10-07 Thread Srikar Dronamraju

While offlining a CPU, system currently iterate through all the CPUs in
the DIE to clear sibling, l2_cache and smallcore maps. However if there
are more cores in a DIE, system can end up spending more time iterating
through CPUs which are completely unrelated.

Optimize this by only iterating through smaller but relevant cpumap.
If shared_cache is set, cpu_l2_cache_map should be relevant else
cpu_sibling_map would be relevant.

Signed-off-by: Srikar Dronamraju 
Tested-by: Satheesh Rajendran 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Qian Cai 
---
 arch/powerpc/kernel/smp.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8c095fe237b2..2e61a81aad88 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1263,14 +1263,21 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
*(*mask_fn)(int))
 #ifdef CONFIG_HOTPLUG_CPU
 static void remove_cpu_from_masks(int cpu)
 {
+   struct cpumask *(*mask_fn)(int) = cpu_sibling_mask;
int i;
 
-   for_each_cpu(i, cpu_cpu_mask(cpu)) {
+   if (shared_caches)
+   mask_fn = cpu_l2_cache_mask;
+
+   for_each_cpu(i, mask_fn(cpu)) {
set_cpus_unrelated(cpu, i, cpu_l2_cache_mask);
set_cpus_unrelated(cpu, i, cpu_sibling_mask);
if (has_big_cores)
set_cpus_unrelated(cpu, i, cpu_smallcore_mask);
-   if (has_coregroup_support())
+   }
+
+   if (has_coregroup_support()) {
+   for_each_cpu(i, cpu_coregroup_mask(cpu))
set_cpus_unrelated(cpu, i, cpu_coregroup_mask);
}
 }
-- 
2.17.1

[PATCH v3 02/11] powerpc/smp: Stop updating cpu_core_mask

2020-10-07 Thread Srikar Dronamraju

Anton Blanchard reported that his 4096 vcpu KVM guest took around 30
minutes to boot. He also analyzed it to the time taken to iterate while
setting the cpu_core_mask.

Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU
would be equal on Power. However updating cpu_core_mask took forever to
update as its a per cpu cpumask variable. Instead cpu_cpu_mask was a per
NODE /per DIE cpumask that was shared by all the respective CPUs.

Also cpu_cpu_mask is needed from a scheduler perspective. However
cpu_core_map is an exported symbol. Hence stop updating cpu_core_map
and make it point to cpu_cpu_mask.

Signed-off-by: Srikar Dronamraju 
Tested-by: Satheesh Rajendran 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Qian Cai 
---
 arch/powerpc/include/asm/smp.h |  5 -
 arch/powerpc/kernel/smp.c  | 33 +++--
 2 files changed, 7 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 041f0b97c45b..40e121dd16af 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -119,11 +119,6 @@ static inline struct cpumask *cpu_sibling_mask(int cpu)
return per_cpu(cpu_sibling_map, cpu);
 }
 
-static inline struct cpumask *cpu_core_mask(int cpu)
-{
-   return per_cpu(cpu_core_map, cpu);
-}
-
 static inline struct cpumask *cpu_l2_cache_mask(int cpu)
 {
return per_cpu(cpu_l2_cache_map, cpu);
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 3d96752d6570..ec41491beca4 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -953,12 +953,17 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
local_memory_node(numa_cpu_lookup_table[cpu]));
}
 #endif
+   /*
+* cpu_core_map is now more updated and exists only since
+* its been exported for long. It only will have a snapshot
+* of cpu_cpu_mask.
+*/
+   cpumask_copy(per_cpu(cpu_core_map, cpu), cpu_cpu_mask(cpu));
}
 
/* Init the cpumasks so the boot CPU is related to itself */
cpumask_set_cpu(boot_cpuid, cpu_sibling_mask(boot_cpuid));
cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
-   cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
 
if (has_coregroup_support())
cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
@@ -1260,9 +1265,7 @@ static void remove_cpu_from_masks(int cpu)
 {
int i;
 
-   /* NB: cpu_core_mask is a superset of the others */
-   for_each_cpu(i, cpu_core_mask(cpu)) {
-   set_cpus_unrelated(cpu, i, cpu_core_mask);
+   for_each_cpu(i, cpu_cpu_mask(cpu)) {
set_cpus_unrelated(cpu, i, cpu_l2_cache_mask);
set_cpus_unrelated(cpu, i, cpu_sibling_mask);
if (has_big_cores)
@@ -1312,7 +1315,6 @@ EXPORT_SYMBOL_GPL(get_physical_package_id);
 static void add_cpu_to_masks(int cpu)
 {
int first_thread = cpu_first_thread_sibling(cpu);
-   int pkg_id = get_physical_package_id(cpu);
int i;
 
/*
@@ -1320,7 +1322,6 @@ static void add_cpu_to_masks(int cpu)
 * add it to it's own thread sibling mask.
 */
cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
-   cpumask_set_cpu(cpu, cpu_core_mask(cpu));
 
for (i = first_thread; i < first_thread + threads_per_core; i++)
if (cpu_online(i))
@@ -1342,26 +1343,6 @@ static void add_cpu_to_masks(int cpu)
set_cpus_related(cpu, i, cpu_coregroup_mask);
}
}
-
-   if (pkg_id == -1) {
-   struct cpumask *(*mask)(int) = cpu_sibling_mask;
-
-   /*
-* Copy the sibling mask into core sibling mask and
-* mark any CPUs on the same chip as this CPU.
-*/
-   if (shared_caches)
-   mask = cpu_l2_cache_mask;
-
-   for_each_cpu(i, mask(cpu))
-   set_cpus_related(cpu, i, cpu_core_mask);
-
-   return;
-   }
-
-   for_each_cpu(i, cpu_online_mask)
-   if (get_physical_package_id(i) == pkg_id)
-   set_cpus_related(cpu, i, cpu_core_mask);
 }
 
 /* Activate a secondary processor. */
-- 
2.17.1

[PATCH v3 00/11] Optimization to improve CPU online/offline on Powerpc

2020-10-07 Thread Srikar Dronamraju

h: 113  99.5000th: 104
99.9000th: 159  99.9000th: 129
min=0, max=15221min=0, max=7666

100 interations of ppc64_cpu --smt=1 / ppc64_cpu --smt=8
Units: seconds : lesser is better
-
ppc64_cpu --smt=1
kernelNMinMaxMedian  Avg  Stddev
powerpc/next  100  13.39  17.55  14.71   14.7658  0.69184745
+patchset 100  13.3   16.27  14.33   14.4179  0.5427433

ppc64_cpu --smt=8
kernelNMinMaxMedian  Avg  Stddev
powerpc/next  100  21.65  26.17  23.71   23.7111  0.8589786
+patchset 100  21.88  25.79  23.16   23.2945  0.86394839


Observations:
Performance of ebizzy/ perf_sched_bench / schbench remain the
same with and without the patchset.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Qian Cai 

Srikar Dronamraju (11):
  powerpc/topology: Update topology_core_cpumask
  powerpc/smp: Stop updating cpu_core_mask
  powerpc/smp: Remove get_physical_package_id
  powerpc/smp: Optimize remove_cpu_from_masks
  powerpc/smp: Limit CPUs traversed to within a node.
  powerpc/smp: Stop passing mask to update_mask_by_l2
  powerpc/smp: Depend on cpu_l1_cache_map when adding CPUs
  powerpc/smp: Check for duplicate topologies and consolidate
  powerpc/smp: Optimize update_mask_by_l2
  powerpc/smp: Move coregroup mask updation to a new function
  powerpc/smp: Optimize update_coregroup_mask

 arch/powerpc/include/asm/smp.h  |   5 -
 arch/powerpc/include/asm/topology.h |   7 +-
 arch/powerpc/kernel/smp.c   | 188 +++-
 3 files changed, 122 insertions(+), 78 deletions(-)

-- 
2.17.1

[PATCH v3 01/11] powerpc/topology: Update topology_core_cpumask

2020-10-07 Thread Srikar Dronamraju

On Power, cpu_core_mask and cpu_cpu_mask refer to the same set of CPUs.
cpu_cpu_mask is needed by scheduler, hence look at deprecating
cpu_core_mask. Before deleting the cpu_core_mask, ensure its only user
is moved to cpu_cpu_mask.

Signed-off-by: Srikar Dronamraju 
Tested-by: Satheesh Rajendran 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Qian Cai 
---
 arch/powerpc/include/asm/topology.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 6609174918ab..e0f232533c9d 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -122,7 +122,7 @@ int get_physical_package_id(int cpu);
 #endif
 
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
-#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
+#define topology_core_cpumask(cpu) (cpu_cpu_mask(cpu))
 #define topology_core_id(cpu)  (cpu_to_core_id(cpu))
 
 #endif
-- 
2.17.1

[PATCH v3 05/11] powerpc/smp: Limit CPUs traversed to within a node.

2020-10-07 Thread Srikar Dronamraju

All the arch specific topology cpumasks are within a node/DIE.
However when setting these per CPU cpumasks, system traverses through
all the online CPUs. This is redundant.

Reduce the traversal to only CPUs that are online in the node to which
the CPU belongs to.

Signed-off-by: Srikar Dronamraju 
Tested-by: Satheesh Rajendran 
Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Qian Cai 
---
 arch/powerpc/kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 2e61a81aad88..c860c4950c9f 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1241,7 +1241,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
*(*mask_fn)(int))
}
 
cpumask_set_cpu(cpu, mask_fn(cpu));
-   for_each_cpu(i, cpu_online_mask) {
+   for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
/*
 * when updating the marks the current CPU has not been marked
 * online, but we need to update the cache masks
-- 
2.17.1

Re: [PATCH v2 09/11] powerpc/smp: Optimize update_mask_by_l2

2020-10-07 Thread Srikar Dronamraju

* Qian Cai  [2020-10-07 09:05:42]:

Hi Qian,

Thanks for testing and reporting the failure.

> On Mon, 2020-09-21 at 15:26 +0530, Srikar Dronamraju wrote:
> > All threads of a SMT4 core can either be part of this CPU's l2-cache
> > mask or not related to this CPU l2-cache mask. Use this relation to
> > reduce the number of iterations needed to find all the CPUs that share
> > the same l2-cache.
> > 
> > Use a temporary mask to iterate through the CPUs that may share l2_cache
> > mask. Also instead of setting one CPU at a time into cpu_l2_cache_mask,
> > copy the SMT4/sub mask at one shot.
> > 
> ...
> >  static bool update_mask_by_l2(int cpu)
> >  {
> > +   struct cpumask *(*submask_fn)(int) = cpu_sibling_mask;
> > struct device_node *l2_cache, *np;
> > +   cpumask_var_t mask;
> > int i;
> >  
> > l2_cache = cpu_to_l2cache(cpu);
> > @@ -1240,22 +1264,37 @@ static bool update_mask_by_l2(int cpu)
> > return false;
> > }
> >  
> > -   cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));
> > -   for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
> > +   alloc_cpumask_var_node(, GFP_KERNEL, cpu_to_node(cpu));
> 
> Shouldn't this be GFP_ATOMIC? Otherwise, during the CPU hotplugging, we have,

Can you confirm if CONFIG_CPUMASK_OFFSTACK is enabled in your config?
Because if !CONFIG_CPUMASK_OFFSTACK, then alloc_cpumask_var_node would do
nothing but return true.

Regarding CONFIG_CPUMASK_OFFSTACK, not sure how much powerpc was tested
with that config enabled.

Please refer to
http://lore.kernel.org/lkml/87o8nv51bg@mpe.ellerman.id.au/t/#u
And we do have an issue to track the same.
https://github.com/linuxppc/issues/issues/321 for enabling/ testing /
verifying if CONFIG_CPUMASK_OFFSTACK works. I also dont see any
powerpc kconfig enabling this.

I do agree with your suggestion that we could substitute
GFP_ATOMIC/GFP_KERNEL.

> 
> (irqs were disabled in do_idle())
> 
> [  335.420001][T0] BUG: sleeping function called from invalid context at 
> mm/slab.h:494
> [  335.420003][T0] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 
> 0, name: swapper/88
> [  335.420005][T0] no locks held by swapper/88/0.
> [  335.420007][T0] irq event stamp: 18074448
> [  335.420015][T0] hardirqs last  enabled at (18074447): 
> [] tick_nohz_idle_enter+0x9c/0x110
> [  335.420019][T0] hardirqs last disabled at (18074448): 
> [] do_idle+0x138/0x3b0
> do_idle at kernel/sched/idle.c:253 (discriminator 1)
> [  335.420023][T0] softirqs last  enabled at (18074440): 
> [] irq_enter_rcu+0x94/0xa0
> [  335.420026][T0] softirqs last disabled at (18074439): 
> [] irq_enter_rcu+0x70/0xa0
> [  335.420030][T0] CPU: 88 PID: 0 Comm: swapper/88 Tainted: GW
>  5.9.0-rc8-next-20201007 #1
> [  335.420032][T0] Call Trace:
> [  335.420037][T0] [c0002a4bfcf0] [c0649e98] 
> dump_stack+0xec/0x144 (unreliable)
> [  335.420043][T0] [c0002a4bfd30] [c00f6c34] 
> ___might_sleep+0x2f4/0x310
> [  335.420048][T0] [c0002a4bfdb0] [c0354f94] 
> slab_pre_alloc_hook.constprop.82+0x124/0x190
> [  335.420051][T0] [c0002a4bfe00] [c035e9e8] 
> __kmalloc_node+0x88/0x3a0
> slab_alloc_node at mm/slub.c:2817
> (inlined by) __kmalloc_node at mm/slub.c:4013
> [  335.420054][T0] [c0002a4bfe80] [c06494d8] 
> alloc_cpumask_var_node+0x38/0x80
> kmalloc_node at include/linux/slab.h:577
> (inlined by) alloc_cpumask_var_node at lib/cpumask.c:116
> [  335.420060][T0] [c0002a4bfef0] [c003eedc] 
> start_secondary+0x27c/0x800
> update_mask_by_l2 at arch/powerpc/kernel/smp.c:1267
> (inlined by) add_cpu_to_masks at arch/powerpc/kernel/smp.c:1387
> (inlined by) start_secondary at arch/powerpc/kernel/smp.c:1420
> [  335.420063][T0] [c0002a4bff90] [c000c468] 
> start_secondary_resume+0x10/0x14
> 
> > +   cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu));
> > +
> > +   if (has_big_cores)
> > +   submask_fn = cpu_smallcore_mask;
> > +
> > +   /* Update l2-cache mask with all the CPUs that are part of submask */
> > +   or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
> > +
> > +   /* Skip all CPUs already part of current CPU l2-cache mask */
> > +   cpumask_andnot(mask, mask, cpu_l2_cache_mask(cpu));
> > +
> > +   for_each_cpu(i, mask) {
> > /*
> >  * when updating the marks the current CPU has not been marked
> >  * online, but we need to update the cache masks
> >  */
> > np = cpu_to_l2cache(i);
> > -   if (!np)
> > -

[PATCH v2 06/11] powerpc/smp: Stop passing mask to update_mask_by_l2

2020-09-21 Thread Srikar Dronamraju

update_mask_by_l2 is called only once. But it passes cpu_l2_cache_mask
as parameter. Instead of passing cpu_l2_cache_mask, use it directly in
update_mask_by_l2.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
Tested-by: Satheesh Rajendran 
---
 arch/powerpc/kernel/smp.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index c860c4950c9f..441c9c64b1e3 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1218,7 +1218,7 @@ static struct device_node *cpu_to_l2cache(int cpu)
return cache;
 }
 
-static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int))
+static bool update_mask_by_l2(int cpu)
 {
struct device_node *l2_cache, *np;
int i;
@@ -1240,7 +1240,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
*(*mask_fn)(int))
return false;
}
 
-   cpumask_set_cpu(cpu, mask_fn(cpu));
+   cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));
for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
/*
 * when updating the marks the current CPU has not been marked
@@ -1251,7 +1251,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
*(*mask_fn)(int))
continue;
 
if (np == l2_cache)
-   set_cpus_related(cpu, i, mask_fn);
+   set_cpus_related(cpu, i, cpu_l2_cache_mask);
 
of_node_put(np);
}
@@ -1315,7 +1315,7 @@ static void add_cpu_to_masks(int cpu)
set_cpus_related(i, cpu, cpu_sibling_mask);
 
add_cpu_to_smallcore_masks(cpu);
-   update_mask_by_l2(cpu, cpu_l2_cache_mask);
+   update_mask_by_l2(cpu);
 
if (has_coregroup_support()) {
int coregroup_id = cpu_to_coregroup_id(cpu);
-- 
2.17.1

[PATCH v2 10/11] powerpc/smp: Move coregroup mask updation to a new function

2020-09-21 Thread Srikar Dronamraju

Move the logic for updating the coregroup mask of a CPU to its own
function. This will help in reworking the updation of coregroup mask in
subsequent patch.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/smp.c | 32 +++-
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 17e90c2414af..b48ae4e306d3 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1337,6 +1337,23 @@ static inline void add_cpu_to_smallcore_masks(int cpu)
}
 }
 
+static void update_coregroup_mask(int cpu)
+{
+   int first_thread = cpu_first_thread_sibling(cpu);
+   int coregroup_id = cpu_to_coregroup_id(cpu);
+   int i;
+
+   cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu));
+   for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
+   int fcpu = cpu_first_thread_sibling(i);
+
+   if (fcpu == first_thread)
+   set_cpus_related(cpu, i, cpu_coregroup_mask);
+   else if (coregroup_id == cpu_to_coregroup_id(i))
+   set_cpus_related(cpu, i, cpu_coregroup_mask);
+   }
+}
+
 static void add_cpu_to_masks(int cpu)
 {
int first_thread = cpu_first_thread_sibling(cpu);
@@ -1355,19 +1372,8 @@ static void add_cpu_to_masks(int cpu)
add_cpu_to_smallcore_masks(cpu);
update_mask_by_l2(cpu);
 
-   if (has_coregroup_support()) {
-   int coregroup_id = cpu_to_coregroup_id(cpu);
-
-   cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu));
-   for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
-   int fcpu = cpu_first_thread_sibling(i);
-
-   if (fcpu == first_thread)
-   set_cpus_related(cpu, i, cpu_coregroup_mask);
-   else if (coregroup_id == cpu_to_coregroup_id(i))
-   set_cpus_related(cpu, i, cpu_coregroup_mask);
-   }
-   }
+   if (has_coregroup_support())
+   update_coregroup_mask(cpu);
 }
 
 /* Activate a secondary processor. */
-- 
2.17.1

[PATCH v2 05/11] powerpc/smp: Limit CPUs traversed to within a node.

2020-09-21 Thread Srikar Dronamraju

All the arch specific topology cpumasks are within a node/DIE.
However when setting these per CPU cpumasks, system traverses through
all the online CPUs. This is redundant.

Reduce the traversal to only CPUs that are online in the node to which
the CPU belongs to.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
Tested-by: Satheesh Rajendran 
---
 arch/powerpc/kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 2e61a81aad88..c860c4950c9f 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1241,7 +1241,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
*(*mask_fn)(int))
}
 
cpumask_set_cpu(cpu, mask_fn(cpu));
-   for_each_cpu(i, cpu_online_mask) {
+   for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
/*
 * when updating the marks the current CPU has not been marked
 * online, but we need to update the cache masks
-- 
2.17.1

[PATCH v2 11/11] powerpc/smp: Optimize update_coregroup_mask

2020-09-21 Thread Srikar Dronamraju

All threads of a SMT4/SMT8 core can either be part of CPU's coregroup
mask or outside the coregroup. Use this relation to reduce the
number of iterations needed to find all the CPUs that share the same
coregroup

Use a temporary mask to iterate through the CPUs that may share
coregroup mask. Also instead of setting one CPU at a time into
cpu_coregroup_mask, copy the SMT4/SMT8/submask at one shot.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/smp.c | 30 ++
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index b48ae4e306d3..bbaea93dc558 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1339,19 +1339,33 @@ static inline void add_cpu_to_smallcore_masks(int cpu)
 
 static void update_coregroup_mask(int cpu)
 {
-   int first_thread = cpu_first_thread_sibling(cpu);
+   struct cpumask *(*submask_fn)(int) = cpu_sibling_mask;
+   cpumask_var_t mask;
int coregroup_id = cpu_to_coregroup_id(cpu);
int i;
 
-   cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu));
-   for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
-   int fcpu = cpu_first_thread_sibling(i);
+   alloc_cpumask_var_node(, GFP_KERNEL, cpu_to_node(cpu));
+   cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu));
+
+   if (shared_caches)
+   submask_fn = cpu_l2_cache_mask;
+
+   /* Update coregroup mask with all the CPUs that are part of submask */
+   or_cpumasks_related(cpu, cpu, submask_fn, cpu_coregroup_mask);
+
+   /* Skip all CPUs already part of coregroup mask */
+   cpumask_andnot(mask, mask, cpu_coregroup_mask(cpu));
 
-   if (fcpu == first_thread)
-   set_cpus_related(cpu, i, cpu_coregroup_mask);
-   else if (coregroup_id == cpu_to_coregroup_id(i))
-   set_cpus_related(cpu, i, cpu_coregroup_mask);
+   for_each_cpu(i, mask) {
+   /* Skip all CPUs not part of this coregroup */
+   if (coregroup_id == cpu_to_coregroup_id(i)) {
+   or_cpumasks_related(cpu, i, submask_fn, 
cpu_coregroup_mask);
+   cpumask_andnot(mask, mask, submask_fn(i));
+   } else {
+   cpumask_andnot(mask, mask, cpu_coregroup_mask(i));
+   }
}
+   free_cpumask_var(mask);
 }
 
 static void add_cpu_to_masks(int cpu)
-- 
2.17.1

[PATCH v2 04/11] powerpc/smp: Optimize remove_cpu_from_masks

2020-09-21 Thread Srikar Dronamraju

While offlining a CPU, system currently iterate through all the CPUs in
the DIE to clear sibling, l2_cache and smallcore maps. However if there
are more cores in a DIE, system can end up spending more time iterating
through CPUs which are completely unrelated.

Optimize this by only iterating through smaller but relevant cpumap.
If shared_cache is set, cpu_l2_cache_map should be relevant else
cpu_sibling_map would be relevant.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
Tested-by: Satheesh Rajendran 
---
 arch/powerpc/kernel/smp.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8c095fe237b2..2e61a81aad88 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1263,14 +1263,21 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
*(*mask_fn)(int))
 #ifdef CONFIG_HOTPLUG_CPU
 static void remove_cpu_from_masks(int cpu)
 {
+   struct cpumask *(*mask_fn)(int) = cpu_sibling_mask;
int i;
 
-   for_each_cpu(i, cpu_cpu_mask(cpu)) {
+   if (shared_caches)
+   mask_fn = cpu_l2_cache_mask;
+
+   for_each_cpu(i, mask_fn(cpu)) {
set_cpus_unrelated(cpu, i, cpu_l2_cache_mask);
set_cpus_unrelated(cpu, i, cpu_sibling_mask);
if (has_big_cores)
set_cpus_unrelated(cpu, i, cpu_smallcore_mask);
-   if (has_coregroup_support())
+   }
+
+   if (has_coregroup_support()) {
+   for_each_cpu(i, cpu_coregroup_mask(cpu))
set_cpus_unrelated(cpu, i, cpu_coregroup_mask);
}
 }
-- 
2.17.1

[PATCH v2 07/11] powerpc/smp: Depend on cpu_l1_cache_map when adding CPUs

2020-09-21 Thread Srikar Dronamraju

Currently on hotplug/hotunplug, CPU iterates through all the CPUs in
its core to find threads in its thread group. However this info is
already captured in cpu_l1_cache_map. Hence reduce iterations and
cleanup add_cpu_to_smallcore_masks function.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
Tested-by: Satheesh Rajendran 
---
 arch/powerpc/kernel/smp.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 441c9c64b1e3..aeb219a4bf7a 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1285,16 +1285,15 @@ static void remove_cpu_from_masks(int cpu)
 
 static inline void add_cpu_to_smallcore_masks(int cpu)
 {
-   struct cpumask *this_l1_cache_map = per_cpu(cpu_l1_cache_map, cpu);
-   int i, first_thread = cpu_first_thread_sibling(cpu);
+   int i;
 
if (!has_big_cores)
return;
 
cpumask_set_cpu(cpu, cpu_smallcore_mask(cpu));
 
-   for (i = first_thread; i < first_thread + threads_per_core; i++) {
-   if (cpu_online(i) && cpumask_test_cpu(i, this_l1_cache_map))
+   for_each_cpu(i, per_cpu(cpu_l1_cache_map, cpu)) {
+   if (cpu_online(i))
set_cpus_related(i, cpu, cpu_smallcore_mask);
}
 }
-- 
2.17.1

[PATCH v2 09/11] powerpc/smp: Optimize update_mask_by_l2

2020-09-21 Thread Srikar Dronamraju

All threads of a SMT4 core can either be part of this CPU's l2-cache
mask or not related to this CPU l2-cache mask. Use this relation to
reduce the number of iterations needed to find all the CPUs that share
the same l2-cache.

Use a temporary mask to iterate through the CPUs that may share l2_cache
mask. Also instead of setting one CPU at a time into cpu_l2_cache_mask,
copy the SMT4/sub mask at one shot.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/smp.c | 51 ++-
 1 file changed, 45 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6f866e6b12f8..17e90c2414af 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -670,6 +670,28 @@ static void set_cpus_unrelated(int i, int j,
 }
 #endif
 
+/*
+ * Extends set_cpus_related. Instead of setting one CPU at a time in
+ * dstmask, set srcmask at oneshot. dstmask should be super set of srcmask.
+ */
+static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int),
+   struct cpumask *(*dstmask)(int))
+{
+   struct cpumask *mask;
+   int k;
+
+   mask = srcmask(j);
+   for_each_cpu(k, srcmask(i))
+   cpumask_or(dstmask(k), dstmask(k), mask);
+
+   if (i == j)
+   return;
+
+   mask = srcmask(i);
+   for_each_cpu(k, srcmask(j))
+   cpumask_or(dstmask(k), dstmask(k), mask);
+}
+
 /*
  * parse_thread_groups: Parses the "ibm,thread-groups" device tree
  *  property for the CPU device node @dn and stores
@@ -1220,7 +1242,9 @@ static struct device_node *cpu_to_l2cache(int cpu)
 
 static bool update_mask_by_l2(int cpu)
 {
+   struct cpumask *(*submask_fn)(int) = cpu_sibling_mask;
struct device_node *l2_cache, *np;
+   cpumask_var_t mask;
int i;
 
l2_cache = cpu_to_l2cache(cpu);
@@ -1240,22 +1264,37 @@ static bool update_mask_by_l2(int cpu)
return false;
}
 
-   cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));
-   for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
+   alloc_cpumask_var_node(, GFP_KERNEL, cpu_to_node(cpu));
+   cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu));
+
+   if (has_big_cores)
+   submask_fn = cpu_smallcore_mask;
+
+   /* Update l2-cache mask with all the CPUs that are part of submask */
+   or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
+
+   /* Skip all CPUs already part of current CPU l2-cache mask */
+   cpumask_andnot(mask, mask, cpu_l2_cache_mask(cpu));
+
+   for_each_cpu(i, mask) {
/*
 * when updating the marks the current CPU has not been marked
 * online, but we need to update the cache masks
 */
np = cpu_to_l2cache(i);
-   if (!np)
-   continue;
 
-   if (np == l2_cache)
-   set_cpus_related(cpu, i, cpu_l2_cache_mask);
+   /* Skip all CPUs already part of current CPU l2-cache */
+   if (np == l2_cache) {
+   or_cpumasks_related(cpu, i, submask_fn, 
cpu_l2_cache_mask);
+   cpumask_andnot(mask, mask, submask_fn(i));
+   } else {
+   cpumask_andnot(mask, mask, cpu_l2_cache_mask(i));
+   }
 
of_node_put(np);
}
of_node_put(l2_cache);
+   free_cpumask_var(mask);
 
return true;
 }
-- 
2.17.1

[PATCH v2 08/11] powerpc/smp: Check for duplicate topologies and consolidate

2020-09-21 Thread Srikar Dronamraju

CACHE and COREGROUP domains are now part of default topology. However on
systems that don't support CACHE or COREGROUP, these domains will
eventually be degenerated. The degeneration happens per CPU. Do note the
current fixup_topology() logic ensures that mask of a domain that is not
supported on the current platform is set to the previous domain.

Instead of waiting for the scheduler to degenerated try to consolidate
based on their masks and sd_flags. This is done just before setting
the scheduler topology.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/smp.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index aeb219a4bf7a..6f866e6b12f8 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1401,6 +1401,8 @@ int setup_profiling_timer(unsigned int multiplier)
 
 static void fixup_topology(void)
 {
+   int i;
+
 #ifdef CONFIG_SCHED_SMT
if (has_big_cores) {
pr_info("Big cores detected but using small core scheduling\n");
@@ -1410,6 +1412,30 @@ static void fixup_topology(void)
 
if (!has_coregroup_support())
powerpc_topology[mc_idx].mask = 
powerpc_topology[cache_idx].mask;
+
+   /*
+* Try to consolidate topology levels here instead of
+* allowing scheduler to degenerate.
+* - Dont consolidate if masks are different.
+* - Dont consolidate if sd_flags exists and are different.
+*/
+   for (i = 1; i <= die_idx; i++) {
+   if (powerpc_topology[i].mask != powerpc_topology[i - 1].mask)
+   continue;
+
+   if (powerpc_topology[i].sd_flags && powerpc_topology[i - 
1].sd_flags &&
+   powerpc_topology[i].sd_flags != 
powerpc_topology[i - 1].sd_flags)
+   continue;
+
+   if (!powerpc_topology[i - 1].sd_flags)
+   powerpc_topology[i - 1].sd_flags = 
powerpc_topology[i].sd_flags;
+
+   powerpc_topology[i].mask = powerpc_topology[i + 1].mask;
+   powerpc_topology[i].sd_flags = powerpc_topology[i + 1].sd_flags;
+#ifdef CONFIG_SCHED_DEBUG
+   powerpc_topology[i].name = powerpc_topology[i + 1].name;
+#endif
+   }
 }
 
 void __init smp_cpus_done(unsigned int max_cpus)
-- 
2.17.1

[PATCH v2 01/11] powerpc/topology: Update topology_core_cpumask

2020-09-21 Thread Srikar Dronamraju

On Power, cpu_core_mask and cpu_cpu_mask refer to the same set of CPUs.
cpu_cpu_mask is needed by scheduler, hence look at deprecating
cpu_core_mask. Before deleting the cpu_core_mask, ensure its only user
is moved to cpu_cpu_mask.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
Tested-by: Satheesh Rajendran 
---
 arch/powerpc/include/asm/topology.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 6609174918ab..e0f232533c9d 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -122,7 +122,7 @@ int get_physical_package_id(int cpu);
 #endif
 
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
-#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
+#define topology_core_cpumask(cpu) (cpu_cpu_mask(cpu))
 #define topology_core_id(cpu)  (cpu_to_core_id(cpu))
 
 #endif
-- 
2.17.1

[PATCH v2 03/11] powerpc/smp: Remove get_physical_package_id

2020-09-21 Thread Srikar Dronamraju

Now that cpu_core_mask has been removed and topology_core_cpumask has
been updated to use cpu_cpu_mask, we no more need
get_physical_package_id.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
Tested-by: Satheesh Rajendran 
---
 arch/powerpc/include/asm/topology.h |  5 -
 arch/powerpc/kernel/smp.c   | 20 
 2 files changed, 25 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index e0f232533c9d..e45219f74be0 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -114,12 +114,7 @@ static inline int cpu_to_coregroup_id(int cpu)
 #ifdef CONFIG_PPC64
 #include 
 
-#ifdef CONFIG_PPC_SPLPAR
-int get_physical_package_id(int cpu);
-#define topology_physical_package_id(cpu)  (get_physical_package_id(cpu))
-#else
 #define topology_physical_package_id(cpu)  (cpu_to_chip_id(cpu))
-#endif
 
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
 #define topology_core_cpumask(cpu) (cpu_cpu_mask(cpu))
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ec41491beca4..8c095fe237b2 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1292,26 +1292,6 @@ static inline void add_cpu_to_smallcore_masks(int cpu)
}
 }
 
-int get_physical_package_id(int cpu)
-{
-   int pkg_id = cpu_to_chip_id(cpu);
-
-   /*
-* If the platform is PowerNV or Guest on KVM, ibm,chip-id is
-* defined. Hence we would return the chip-id as the result of
-* get_physical_package_id.
-*/
-   if (pkg_id == -1 && firmware_has_feature(FW_FEATURE_LPAR) &&
-   IS_ENABLED(CONFIG_PPC_SPLPAR)) {
-   struct device_node *np = of_get_cpu_node(cpu, NULL);
-   pkg_id = of_node_to_nid(np);
-   of_node_put(np);
-   }
-
-   return pkg_id;
-}
-EXPORT_SYMBOL_GPL(get_physical_package_id);
-
 static void add_cpu_to_masks(int cpu)
 {
int first_thread = cpu_first_thread_sibling(cpu);
-- 
2.17.1

[PATCH v2 00/11] Optimization to improve CPU online/offline on Powerpc

2020-09-21 Thread Srikar Dronamraju

  0.8589786
+patchset 100  21.88  25.79  23.16   23.2945  0.86394839

Observations:
Performance of ebizzy/ perf_sched_bench / schbench remain the
same with and without the patchset.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 

Changelog v1->v2:
v1 link: 
https://lore.kernel.org/linuxppc-dev/20200727075532.30058-1-sri...@linux.vnet.ibm.com/t/#u
Added five more patches on top of Seven.
Rebased to 19th Sept 2020 powerpc/next (based on v5.9-rc2)

Srikar Dronamraju (11):
  powerpc/topology: Update topology_core_cpumask
  powerpc/smp: Stop updating cpu_core_mask
  powerpc/smp: Remove get_physical_package_id
  powerpc/smp: Optimize remove_cpu_from_masks
  powerpc/smp: Limit CPUs traversed to within a node.
  powerpc/smp: Stop passing mask to update_mask_by_l2
  powerpc/smp: Depend on cpu_l1_cache_map when adding CPUs
  powerpc/smp: Check for duplicate topologies and consolidate
  powerpc/smp: Optimize update_mask_by_l2
  powerpc/smp: Move coregroup mask updation to a new function
  powerpc/smp: Optimize update_coregroup_mask

 arch/powerpc/include/asm/smp.h  |   5 -
 arch/powerpc/include/asm/topology.h |   7 +-
 arch/powerpc/kernel/smp.c   | 186 ++--
 3 files changed, 120 insertions(+), 78 deletions(-)

-- 
2.17.1

[PATCH v2 02/11] powerpc/smp: Stop updating cpu_core_mask

2020-09-21 Thread Srikar Dronamraju

Anton Blanchard reported that his 4096 vcpu KVM guest took around 30
minutes to boot. He also analyzed it to the time taken to iterate while
setting the cpu_core_mask.

Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU
would be equal on Power. However updating cpu_core_mask took forever to
update as its a per cpu cpumask variable. Instead cpu_cpu_mask was a per
NODE /per DIE cpumask that was shared by all the respective CPUs.

Also cpu_cpu_mask is needed from a scheduler perspective. However
cpu_core_map is an exported symbol. Hence stop updating cpu_core_map
and make it point to cpu_cpu_mask.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Satheesh Rajendran 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Signed-off-by: Srikar Dronamraju 
Tested-by: Satheesh Rajendran 
---
 arch/powerpc/include/asm/smp.h |  5 -
 arch/powerpc/kernel/smp.c  | 33 +++--
 2 files changed, 7 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 041f0b97c45b..40e121dd16af 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -119,11 +119,6 @@ static inline struct cpumask *cpu_sibling_mask(int cpu)
return per_cpu(cpu_sibling_map, cpu);
 }
 
-static inline struct cpumask *cpu_core_mask(int cpu)
-{
-   return per_cpu(cpu_core_map, cpu);
-}
-
 static inline struct cpumask *cpu_l2_cache_mask(int cpu)
 {
return per_cpu(cpu_l2_cache_map, cpu);
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 3d96752d6570..ec41491beca4 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -953,12 +953,17 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
local_memory_node(numa_cpu_lookup_table[cpu]));
}
 #endif
+   /*
+* cpu_core_map is now more updated and exists only since
+* its been exported for long. It only will have a snapshot
+* of cpu_cpu_mask.
+*/
+   cpumask_copy(per_cpu(cpu_core_map, cpu), cpu_cpu_mask(cpu));
}
 
/* Init the cpumasks so the boot CPU is related to itself */
cpumask_set_cpu(boot_cpuid, cpu_sibling_mask(boot_cpuid));
cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
-   cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
 
if (has_coregroup_support())
cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
@@ -1260,9 +1265,7 @@ static void remove_cpu_from_masks(int cpu)
 {
int i;
 
-   /* NB: cpu_core_mask is a superset of the others */
-   for_each_cpu(i, cpu_core_mask(cpu)) {
-   set_cpus_unrelated(cpu, i, cpu_core_mask);
+   for_each_cpu(i, cpu_cpu_mask(cpu)) {
set_cpus_unrelated(cpu, i, cpu_l2_cache_mask);
set_cpus_unrelated(cpu, i, cpu_sibling_mask);
if (has_big_cores)
@@ -1312,7 +1315,6 @@ EXPORT_SYMBOL_GPL(get_physical_package_id);
 static void add_cpu_to_masks(int cpu)
 {
int first_thread = cpu_first_thread_sibling(cpu);
-   int pkg_id = get_physical_package_id(cpu);
int i;
 
/*
@@ -1320,7 +1322,6 @@ static void add_cpu_to_masks(int cpu)
 * add it to it's own thread sibling mask.
 */
cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
-   cpumask_set_cpu(cpu, cpu_core_mask(cpu));
 
for (i = first_thread; i < first_thread + threads_per_core; i++)
if (cpu_online(i))
@@ -1342,26 +1343,6 @@ static void add_cpu_to_masks(int cpu)
set_cpus_related(cpu, i, cpu_coregroup_mask);
}
}
-
-   if (pkg_id == -1) {
-   struct cpumask *(*mask)(int) = cpu_sibling_mask;
-
-   /*
-* Copy the sibling mask into core sibling mask and
-* mark any CPUs on the same chip as this CPU.
-*/
-   if (shared_caches)
-   mask = cpu_l2_cache_mask;
-
-   for_each_cpu(i, mask(cpu))
-   set_cpus_related(cpu, i, cpu_core_mask);
-
-   return;
-   }
-
-   for_each_cpu(i, cpu_online_mask)
-   if (get_physical_package_id(i) == pkg_id)
-   set_cpus_related(cpu, i, cpu_core_mask);
 }
 
 /* Activate a secondary processor. */
-- 
2.17.1

Re: [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling

2020-09-13 Thread Srikar Dronamraju

Fix to make it work where CPUs dont have a l2-cache element.

>8-8<-

>From b25d47b01b7195b1df19083a4043fa6a87a901a3 Mon Sep 17 00:00:00 2001
From: Srikar Dronamraju 
Date: Thu, 9 Jul 2020 13:33:38 +0530
Subject: [PATCH v5.2 05/10] powerpc/smp: Dont assume l2-cache to be superset of
 sibling

Current code assumes that cpumask of cpus sharing a l2-cache mask will
always be a superset of cpu_sibling_mask.

Lets stop that assumption. cpu_l2_cache_mask is a superset of
cpu_sibling_mask if and only if shared_caches is set.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Jordan Niethe 
Cc: Vaidyanathan Srinivasan 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1 -> v2:
Set cpumask after verifying l2-cache. (Gautham)

Changelog v5 -> v5.2:
If cpu has no l2-cache set cpumask as per its
 sibling mask. (Michael Ellerman)

 arch/powerpc/kernel/smp.c | 43 +--
 1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 9f4333d..168532e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1186,9 +1186,23 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
*(*mask_fn)(int))
int i;
 
l2_cache = cpu_to_l2cache(cpu);
-   if (!l2_cache)
+   if (!l2_cache) {
+   struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
+
+   /*
+* If no l2cache for this CPU, assume all siblings to share
+* cache with this CPU.
+*/
+   if (has_big_cores)
+   sibling_mask = cpu_smallcore_mask;
+
+   for_each_cpu(i, sibling_mask(cpu))
+   set_cpus_related(cpu, i, cpu_l2_cache_mask);
+
return false;
+   }
 
+   cpumask_set_cpu(cpu, mask_fn(cpu));
for_each_cpu(i, cpu_online_mask) {
/*
 * when updating the marks the current CPU has not been marked
@@ -1271,29 +1285,30 @@ static void add_cpu_to_masks(int cpu)
 * add it to it's own thread sibling mask.
 */
cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
+   cpumask_set_cpu(cpu, cpu_core_mask(cpu));
 
for (i = first_thread; i < first_thread + threads_per_core; i++)
if (cpu_online(i))
set_cpus_related(i, cpu, cpu_sibling_mask);
 
add_cpu_to_smallcore_masks(cpu);
-   /*
-* Copy the thread sibling mask into the cache sibling mask
-* and mark any CPUs that share an L2 with this CPU.
-*/
-   for_each_cpu(i, cpu_sibling_mask(cpu))
-   set_cpus_related(cpu, i, cpu_l2_cache_mask);
update_mask_by_l2(cpu, cpu_l2_cache_mask);
 
-   /*
-* Copy the cache sibling mask into core sibling mask and mark
-* any CPUs on the same chip as this CPU.
-*/
-   for_each_cpu(i, cpu_l2_cache_mask(cpu))
-   set_cpus_related(cpu, i, cpu_core_mask);
+   if (pkg_id == -1) {
+   struct cpumask *(*mask)(int) = cpu_sibling_mask;
+
+   /*
+* Copy the sibling mask into core sibling mask and
+* mark any CPUs on the same chip as this CPU.
+*/
+   if (shared_caches)
+   mask = cpu_l2_cache_mask;
+
+   for_each_cpu(i, mask(cpu))
+   set_cpus_related(cpu, i, cpu_core_mask);
 
-   if (pkg_id == -1)
return;
+   }
 
for_each_cpu(i, cpu_online_mask)
if (get_physical_package_id(i) == pkg_id)
-- 
2.17.1

Re: [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling

2020-09-13 Thread Srikar Dronamraju

* Michael Ellerman  [2020-09-13 11:46:41]:

> Srikar Dronamraju  writes:
> > * Michael Ellerman  [2020-09-11 21:55:23]:
> >
> >> Srikar Dronamraju  writes:
> >> > Current code assumes that cpumask of cpus sharing a l2-cache mask will
> >> > always be a superset of cpu_sibling_mask.
> >> >
> >> > Lets stop that assumption. cpu_l2_cache_mask is a superset of
> >> > cpu_sibling_mask if and only if shared_caches is set.
> >> 
> >> I'm seeing oopses with this:
> >> 
> 
> The patch fixes qemu, and I don't see the crash on mambo, but I still
> see:
>   [0.010536] smp: Bringing up secondary CPUs ...
>   [0.019189] smp: Brought up 2 nodes, 8 CPUs
>   [0.019210] numa: Node 0 CPUs: 0-3
>   [0.019235] numa: Node 1 CPUs: 4-7
>   [0.02]  the CACHE domain not a subset of the MC domain
>   [0.024505] BUG: arch topology borken
>   [0.024527]  the SMT domain not a subset of the CACHE domain
>   [0.024563] BUG: arch topology borken
>   [0.024584]  the CACHE domain not a subset of the MC domain
>   [0.024645] BUG: arch topology borken
>   [0.024666]  the SMT domain not a subset of the CACHE domain
>   [0.024702] BUG: arch topology borken
>   [0.024723]  the CACHE domain not a subset of the MC domain
> 
> That's the p9 mambo model, using skiboot.tcl from skiboot, with CPUS=2,
> THREADS=4 and MAMBO_NUMA=1.
> 

I was able to reproduce with
 qemu-system-ppc64 -nographic -vga none -M pseries -cpu POWER8 \
   -kernel build~/vmlinux \
   -m 2G,slots=2,maxmem=4G \
   -object memory-backend-ram,size=1G,id=m0 \
   -object memory-backend-ram,size=1G,id=m1 \
   -numa node,nodeid=0,memdev=m0 \
   -numa node,nodeid=1,memdev=m1 \
   -smp 8,threads=4,sockets=2,maxcpus=8  \


If the CPU doesn't have a l2-cache element, then CPU not only has to set
itself in the cpu_l2_cache but also the siblings. Otherwise it will so
happen that the Siblings will have 4 Cpus set, and the Cache domain will
have just one cpu set, leading to this BUG message.

Patch follows this mail.

> Node layout is:
> 
> [0.00] Early memory node ranges
> [0.00]   node   0: [mem 0x-0x]
> [0.00]   node   1: [mem 0x2000-0x2000]
> [0.00] Initmem setup node 0 [mem 
> 0x-0x]
> [0.00] On node 0 totalpages: 65536
> [0.00] Initmem setup node 1 [mem 
> 0x2000-0x2000]
> [    0.00] On node 1 totalpages: 65536
> 
> 
> There aren't any l2-cache properties in the device-tree under cpus.
> 
> I'll try and have a closer look tonight.
> 
> cheers

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling

2020-09-11 Thread Srikar Dronamraju

* Michael Ellerman  [2020-09-11 21:55:23]:

> Srikar Dronamraju  writes:
> > Current code assumes that cpumask of cpus sharing a l2-cache mask will
> > always be a superset of cpu_sibling_mask.
> >
> > Lets stop that assumption. cpu_l2_cache_mask is a superset of
> > cpu_sibling_mask if and only if shared_caches is set.
> 
> I'm seeing oopses with this:
> 
> [0.117392][T1] smp: Bringing up secondary CPUs ...
> [0.156515][T1] smp: Brought up 2 nodes, 2 CPUs
> [0.158265][T1] numa: Node 0 CPUs: 0
> [0.158520][T1] numa: Node 1 CPUs: 1
> [0.167453][T1] BUG: Unable to handle kernel data access on read at 
> 0x800041228298
> [0.167992][T1] Faulting instruction address: 0xc018c128
> [0.168817][T1] Oops: Kernel access of bad area, sig: 11 [#1]
> [0.168964][T1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> [0.169417][T1] Modules linked in:
> [0.170047][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 5.9.0-rc2-00095-g7430ad5aa700 #209
> [0.170305][T1] NIP:  c018c128 LR: c018c0cc CTR: 
> c004dce0
> [0.170498][T1] REGS: c0007e343880 TRAP: 0380   Not tainted  
> (5.9.0-rc2-00095-g7430ad5aa700)
> [0.170602][T1] MSR:  82009033   CR: 
> 4400  XER: 
> [0.170985][T1] CFAR: c018c288 IRQMASK: 0
> [0.170985][T1] GPR00:  c0007e343b10 
> c173e400 4000
> [0.170985][T1] GPR04:  0800 
> 0800 
> [0.170985][T1] GPR08:  c122c298 
> c0003fffc000 c0007fd05ce8
> [0.170985][T1] GPR12: c0007e0119f8 c193 
> 8ade 
> [0.170985][T1] GPR16: c0007e3c0640 0917 
> c0007e3c0658 0008
> [0.170985][T1] GPR20: c15d0bb8 8ade 
> c0f57400 c1817c28
> [0.170985][T1] GPR24: c176dc80 c0007e3c0890 
> c0007e3cfe00 
> [0.170985][T1] GPR28: c1772310 c0007e011900 
> c0007e3c0800 0001
> [0.172750][T1] NIP [c018c128] build_sched_domains+0x808/0x14b0
> [0.172900][T1] LR [c018c0cc] build_sched_domains+0x7ac/0x14b0
> [0.173186][T1] Call Trace:
> [0.173484][T1] [c0007e343b10] [c018bfe8] 
> build_sched_domains+0x6c8/0x14b0 (unreliable)
> [0.173821][T1] [c0007e343c50] [c018dcdc] 
> sched_init_domains+0xec/0x130
> [0.174037][T1] [c0007e343ca0] [c10d59d8] 
> sched_init_smp+0x50/0xc4
> [0.174207][T1] [c0007e343cd0] [c10b45c4] 
> kernel_init_freeable+0x1b4/0x378
> [0.174378][T1] [c0007e343db0] [c00129fc] 
> kernel_init+0x24/0x158
> [0.174740][T1] [c0007e343e20] [c000d9d0] 
> ret_from_kernel_thread+0x5c/0x6c
> [0.175050][T1] Instruction dump:
> [0.175626][T1] 554905ee 71480040 7d2907b4 4182016c 2c29 3920006e 
> 913e002c 41820034
> [0.175841][T1] 7c6307b4 e9300020 78631f24 7d58182a <7d2a482a> 
> f93e0080 7d404828 314a0001
> [0.178340][T1] ---[ end trace 6876b88dd1d4b3bb ]---
> [0.178512][T1]
> [1.180458][T1] Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x000b
> 
> That's qemu:
> 
> qemu-system-ppc64 -nographic -vga none -M pseries -cpu POWER8 \
>   -kernel build~/vmlinux \
>   -m 2G,slots=2,maxmem=4G \
>   -object memory-backend-ram,size=1G,id=m0 \
>   -object memory-backend-ram,size=1G,id=m1 \
>   -numa node,nodeid=0,memdev=m0 \
>   -numa node,nodeid=1,memdev=m1 \
>   -smp 2,sockets=2,maxcpus=2  \
> 

Thanks Michael for the report and also for identifying the patch and also
giving an easy reproducer. That made my task easy. (My only problem was all
my PowerKVM hosts had a old compiler that refuse to compile never kernels.)

So in this setup, CPU doesn't have a l2-cache. And in that scenario, we
miss updating the l2-cache domain. Actually the initial patch had this
exact code. However it was my mistake. I should have reassessed it before
making changes suggested by Gautham.

Patch below. Do let me know if you want me to send the patch separately.

> 
> On mambo I get:
> 
> [0.005069][T1] smp: Bringing up secondary CPUs ...
> [0.011656][T1] smp: Brought up 2 nodes, 8 CPUs
> [0.011682][T1] numa: Node 0 CPUs: 0-3
> [0.011709][T1] numa: Node 1 CPUs: 4-7
> [0.012015][T1] BUG: arch topology borken
> [0.012040][T1]  the SMT domain not a subset of the CACHE domain
> [

Re: [PATCH] sched/fair: use dst group while checking imbalance for NUMA balancer

2020-09-07 Thread Srikar Dronamraju

> > 
> > On Mon, Sep 07, 2020 at 07:27:08PM +1200, Barry Song wrote:
> > > Something is wrong. In find_busiest_group(), we are checking if src has
> > > higher load, however, in task_numa_find_cpu(), we are checking if dst
> > > will have higher load after balancing. It seems it is not sensible to
> > > check src.
> > > It maybe cause wrong imbalance value, for example, if
> > > dst_running = env->dst_stats.nr_running + 1 results in 3 or above, and
> > > src_running = env->src_stats.nr_running - 1 results in 1;
> > > The current code is thinking imbalance as 0 since src_running is smaller
> > > than 2.
> > > This is inconsistent with load balancer.
> > >

I have observed the similar behaviour what Barry Song has documented with a
simple ebizzy with less threads on a 2 node system

ebizzy -t 6 -S 100

We see couple of ebizzy threads moving back and forth between the 2 nodes
because of numa balancer and load balancer trying to do the exact opposite.

However with Barry's patch, couple of tests regress heavily. (Any numa
workload that has shared numa faults).
For example:
perf bench numa mem --no-data_rand_walk -p 1 -t 6 -G 0 -P 3072 -T 0 -l 50 -c

I also don't understand the rational behind checking for dst_running in numa
balancer path. This almost means no numa balancing in lightly loaded scenario.

So agree with Mel that we should probably test more scenarios before
we accept this patch.

> > 
> > It checks the conditions if the move was to happen. Have you evaluated
> > this for a NUMA balancing load and confirmed it a) balances properly and
> > b) does not increase the scan rate trying to "fix" the problem?
> 
> I think the original code was trying to check if the numa migration
> would lead to new imbalance in load balancer. In case src is A, dst is B, and
> both of them have nr_running as 2. A moves one task to B, then A
> will have 1, B will have 3. In load balancer, A will try to pull task
> from B since B's nr_running is larger than min_imbalance. But the code
> is saying imbalance=0 by finding A's nr_running is smaller than
> min_imbalance.
> 
> Will share more test data if you need.
> 
> > 
> > --
> > Mel Gorman
> > SUSE Labs
> 
> Thanks
> Barry

-- 
Thanks and Regards
Srikar Dronamraju

Re: [RFC v2] perf/core: Fixes hung issue on perf stat command during cpu hotplug

2020-08-26 Thread Srikar Dronamraju

* Kajol Jain  [2020-08-26 20:24:11]:

> Commit 2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()")
> added assignment of ret value as -EAGAIN in case function
> call to 'smp_call_function_single' fails.
> For non-zero ret value, it did
> 'ret = !ret ? data.ret : -EAGAIN;', which always
> assign -EAGAIN to ret and make second if condition useless.
> 
> In scenarios like when executing a perf stat with --per-thread option, and
> if any of the monitoring cpu goes offline, the 'smp_call_function_single'
> function could return -ENXIO, and with the above check,
> task_function_call hung and increases CPU
> usage (because of repeated 'smp_call_function_single()')
> 
> Recration scenario:
>   # perf stat -a --per-thread && (offline a CPU )
> 
> Patch here removes the tertiary condition added as part of that
> commit and added a check for NULL and -EAGAIN.
> 
> Fixes: 2ed6edd33a21("perf: Add cond_resched() to task_function_call()")
> Signed-off-by: Kajol Jain 
> Reported-by: Srikar Dronamraju 

Tested-by: Srikar Dronamraju 


-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/fair: Fix wrong cpu selecting from isolated domain

2020-08-24 Thread Srikar Dronamraju

* xunlei  [2020-08-25 10:11:24]:

> On 2020/8/24 PM9:38, Srikar Dronamraju wrote:
> > * Xunlei Pang  [2020-08-24 20:30:19]:
> > 
> >> We've met problems that occasionally tasks with full cpumask
> >> (e.g. by putting it into a cpuset or setting to full affinity)
> >> were migrated to our isolated cpus in production environment.
> >>
> >> After some analysis, we found that it is due to the current
> >> select_idle_smt() not considering the sched_domain mask.
> >>
> >> Fix it by checking the valid domain mask in select_idle_smt().
> >>
> >> Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve 
> >> select_idle_siblings())
> >> Reported-by: Wetp Zhang 
> >> Signed-off-by: Xunlei Pang 
> >> ---
> >>  kernel/sched/fair.c | 9 +
> >>  1 file changed, 5 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 1a68a05..fa942c4 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6075,7 +6075,7 @@ static int select_idle_core(struct task_struct *p, 
> >> struct sched_domain *sd, int
> >>  /*
> >>   * Scan the local SMT mask for idle CPUs.
> >>   */
> >> -static int select_idle_smt(struct task_struct *p, int target)
> >> +static int select_idle_smt(struct task_struct *p, struct sched_domain 
> >> *sd, int target)
> >>  {
> >>int cpu;
> >>  
> >> @@ -6083,7 +6083,8 @@ static int select_idle_smt(struct task_struct *p, 
> >> int target)
> >>return -1;
> >>  
> >>for_each_cpu(cpu, cpu_smt_mask(target)) {
> >> -  if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> >> +  if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
> >> +  !cpumask_test_cpu(cpu, sched_domain_span(sd)))
> >>continue;
> > 
> > Don't think this is right thing to do.  What if this task had set a cpumask
> > that doesn't cover all the cpus in this sched_domain_span(sd)

ah, right I  missed the 'or' part.
> 
> It doesn't matter, without this patch, it selects an idle cpu from:
> "cpu_smt_mask(target) and p->cpus_ptr"
> 
> with this patch, it selects an idle cpu from:
> "cpu_smt_mask(target) and p->cpus_ptr and sched_domain_span(sd)"
> 
> > 
> > cpu_smt_mask(target) would already limit to the sched_domain_span(sd) so I
> > am not sure how this can help?
> > 
> > 
> 
> Here is an example:
> CPU0 and CPU16 are hyper-thread pair, CPU16 is domain isolated. So its
> sd_llc doesn't contain CPU16, and cpu_smt_mask(0) is 0 and 16.
> 
> Then we have @target is 0, select_idle_smt() may return the isolated(and
> idle) CPU16 without this patch.

Okay.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] sched/fair: Fix wrong cpu selecting from isolated domain

2020-08-24 Thread Srikar Dronamraju

* Xunlei Pang  [2020-08-24 20:30:19]:

> We've met problems that occasionally tasks with full cpumask
> (e.g. by putting it into a cpuset or setting to full affinity)
> were migrated to our isolated cpus in production environment.
> 
> After some analysis, we found that it is due to the current
> select_idle_smt() not considering the sched_domain mask.
> 
> Fix it by checking the valid domain mask in select_idle_smt().
> 
> Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings())
> Reported-by: Wetp Zhang 
> Signed-off-by: Xunlei Pang 
> ---
>  kernel/sched/fair.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1a68a05..fa942c4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6075,7 +6075,7 @@ static int select_idle_core(struct task_struct *p, 
> struct sched_domain *sd, int
>  /*
>   * Scan the local SMT mask for idle CPUs.
>   */
> -static int select_idle_smt(struct task_struct *p, int target)
> +static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, 
> int target)
>  {
>   int cpu;
>  
> @@ -6083,7 +6083,8 @@ static int select_idle_smt(struct task_struct *p, int 
> target)
>   return -1;
>  
>   for_each_cpu(cpu, cpu_smt_mask(target)) {
> - if (!cpumask_test_cpu(cpu, p->cpus_ptr))
> + if (!cpumask_test_cpu(cpu, p->cpus_ptr) ||
> + !cpumask_test_cpu(cpu, sched_domain_span(sd)))
>   continue;

Don't think this is right thing to do.  What if this task had set a cpumask
that doesn't cover all the cpus in this sched_domain_span(sd)

cpu_smt_mask(target) would already limit to the sched_domain_span(sd) so I
am not sure how this can help?


-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-08-18 Thread Srikar Dronamraju

* Michal Hocko  [2020-08-18 09:37:12]:

> On Tue 18-08-20 09:32:52, David Hildenbrand wrote:
> > On 12.08.20 08:01, Srikar Dronamraju wrote:
> > > Hi Andrew, Michal, David
> > > 
> > > * Andrew Morton  [2020-08-06 21:32:11]:
> > > 
> > >> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju 
> > >>  wrote:
> > >>
> > >>>> The memory hotplug changes that somehow because you can hotremove numa
> > >>>> nodes and therefore make the nodemask sparse but that is not a common
> > >>>> case. I am not sure what would happen if a completely new node was 
> > >>>> added
> > >>>> and its corresponding node was already used by the renumbered one
> > >>>> though. It would likely conflate the two I am afraid. But I am not sure
> > >>>> this is really possible with x86 and a lack of a bug report would
> > >>>> suggest that nobody is doing that at least.
> > >>>>
> > >> So...  do we merge this patch or not?  Seems that the overall view is
> > >> "risky but nobody is likely to do anything better any time soon"?
> > > 
> > > Can we decide on this one way or the other?
> > 
> > Hmm, not sure who's the person to decide. I tend to prefer doing the
> > node renaming, handling this in ppc code;
> 
> Agreed. That would be a safer option.

Okay, will send arch specific v6 version.

> -- 
> Michal Hocko
> SUSE Labs

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH] uprobes: __replace_page() avoid BUG in munlock_vma_page()

2020-08-17 Thread Srikar Dronamraju

* Hugh Dickins  [2020-08-16 13:44:25]:

> syzbot crashed on the VM_BUG_ON_PAGE(PageTail) in munlock_vma_page(),
> when called from uprobes __replace_page().  Which of many ways to fix it?
> Settled on not calling when PageCompound (since Head and Tail are equals
> in this context, PageCompound the usual check in uprobes.c, and the prior
> use of FOLL_SPLIT_PMD will have cleared PageMlocked already).
> 
> Reported-by: syzbot 
> Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
> Signed-off-by: Hugh Dickins 
> Cc: sta...@vger.kernel.org # v5.4+
> ---
> This one is not a 5.9-rc regression, but still good to fix.
> 
>  kernel/events/uprobes.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- v5.9-rc/kernel/events/uprobes.c   2020-08-12 19:46:50.851196584 -0700
> +++ linux/kernel/events/uprobes.c 2020-08-16 13:18:35.292821674 -0700
> @@ -205,7 +205,7 @@ static int __replace_page(struct vm_area
>   try_to_free_swap(old_page);
>   page_vma_mapped_walk_done();
> 
> - if (vma->vm_flags & VM_LOCKED)
> + if ((vma->vm_flags & VM_LOCKED) && !PageCompound(old_page))
>   munlock_vma_page(old_page);
>   put_page(old_page);
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v3 1/2] perf bench numa: use numa_node_to_cpus() to bind tasks to nodes

2020-08-13 Thread Srikar Dronamraju

* Alexander Gordeev  [2020-08-13 13:32:48]:

> It is currently assumed that each node contains at most
> nr_cpus/nr_nodes CPUs and nodes' CPU ranges do not overlap.
> That assumption is generally incorrect as there are archs
> where a CPU number does not depend on to its node number.
> 
> This update removes the described assumption by simply calling
> numa_node_to_cpus() interface and using the returned mask for
> binding CPUs to nodes.
> 
> Also, variable types and names made consistent in functions
> using cpumask.
> 
> Cc: Satheesh Rajendran 
> Cc: Srikar Dronamraju 
> Cc: Naveen N. Rao 
> Cc: Balamuruhan S 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Mark Rutland 
> Cc: Alexander Shishkin 
> Cc: Jiri Olsa 
> Cc: Namhyung Kim 
> Signed-off-by: Alexander Gordeev 
> ---
> @@ -310,13 +306,16 @@ static cpu_set_t bind_to_node(int target_node)
>   for (cpu = 0; cpu < g->p.nr_cpus; cpu++)
>   CPU_SET(cpu, );
>   } else {
> - int cpu_start = (target_node + 0) * cpus_per_node;
> - int cpu_stop  = (target_node + 1) * cpus_per_node;
> -
> - BUG_ON(cpu_stop > g->p.nr_cpus);
> + struct bitmask *cpumask = numa_allocate_cpumask();
> 
> - for (cpu = cpu_start; cpu < cpu_stop; cpu++)
> - CPU_SET(cpu, );
> + BUG_ON(!cpumask);
> + if (!numa_node_to_cpus(target_node, cpumask)) {
> + for (cpu = 0; cpu < (int)cpumask->size; cpu++) {
> + if (numa_bitmask_isbitset(cpumask, cpu))
> + CPU_SET(cpu, );
> + }
> + }
> +     numa_free_cpumask(cpumask);
>   }
> 
>   ret = sched_setaffinity(0, sizeof(mask), );
> -- 
> 1.8.3.1
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v3 1/2] perf bench numa: fix cpumask memory leak in node_has_cpus()

2020-08-13 Thread Srikar Dronamraju

* Alexander Gordeev  [2020-08-13 13:30:42]:

> Couple numa_allocate_cpumask() and numa_free_cpumask() functions
> 
> Signed-off-by: Alexander Gordeev 
> ---
>  tools/perf/bench/numa.c | 13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c
> index 31e2601..9066511 100644
> --- a/tools/perf/bench/numa.c
> +++ b/tools/perf/bench/numa.c
> @@ -248,16 +248,21 @@ static int is_node_present(int node)
>  static bool node_has_cpus(int node)
>  {
>   struct bitmask *cpu = numa_allocate_cpumask();
> + bool ret = false; /* fall back to nocpus */
>   unsigned int i;
> 
> - if (cpu && !numa_node_to_cpus(node, cpu)) {
> + BUG_ON(!cpu);
> + if (!numa_node_to_cpus(node, cpu)) {
>   for (i = 0; i < cpu->size; i++) {
> - if (numa_bitmask_isbitset(cpu, i))
> - return true;
> + if (numa_bitmask_isbitset(cpu, i)) {
> + ret = true;
> + break;
> + }
>   }
>   }
> + numa_free_cpumask(cpu);
> 
> - return false; /* lets fall back to nocpus safely */
> + return ret;
>  }
> 
>  static cpu_set_t bind_to_cpu(int target_cpu)
> -- 
> 1.8.3.1
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-08-12 Thread Srikar Dronamraju

Hi Andrew, Michal, David

* Andrew Morton  [2020-08-06 21:32:11]:

> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju 
>  wrote:
> 
> > > The memory hotplug changes that somehow because you can hotremove numa
> > > nodes and therefore make the nodemask sparse but that is not a common
> > > case. I am not sure what would happen if a completely new node was added
> > > and its corresponding node was already used by the renumbered one
> > > though. It would likely conflate the two I am afraid. But I am not sure
> > > this is really possible with x86 and a lack of a bug report would
> > > suggest that nobody is doing that at least.
> > > 
> > 
> > JFYI,
> > Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
> > hotplug on memoryless node. 
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=202187
> 
> So...  do we merge this patch or not?  Seems that the overall view is
> "risky but nobody is likely to do anything better any time soon"?

Can we decide on this one way or the other?

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH v5 02/10] powerpc/smp: Merge Power9 topology with Power topology

2020-08-10 Thread Srikar Dronamraju

A new sched_domain_topology_level was added just for Power9. However the
same can be achieved by merging powerpc_topology with power9_topology
and makes the code more simpler especially when adding a new sched
domain.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Jordan Niethe 
Cc: Vaidyanathan Srinivasan 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1 -> v2:
Replaced a reference to cpu_smt_mask with per_cpu(cpu_sibling_map, cpu)
since cpu_smt_mask is only defined under CONFIG_SCHED_SMT

 arch/powerpc/kernel/smp.c | 25 +++--
 1 file changed, 3 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index edf94ca64eea..08da765b91f1 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1313,7 +1313,7 @@ int setup_profiling_timer(unsigned int multiplier)
 }
 
 #ifdef CONFIG_SCHED_SMT
-/* cpumask of CPUs with asymetric SMT dependancy */
+/* cpumask of CPUs with asymmetric SMT dependency */
 static int powerpc_smt_flags(void)
 {
int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
@@ -1326,14 +1326,6 @@ static int powerpc_smt_flags(void)
 }
 #endif
 
-static struct sched_domain_topology_level powerpc_topology[] = {
-#ifdef CONFIG_SCHED_SMT
-   { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
-#endif
-   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
-   { NULL, },
-};
-
 /*
  * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
  * This topology makes it *much* cheaper to migrate tasks between adjacent 
cores
@@ -1361,7 +1353,7 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
 }
 #endif
 
-static struct sched_domain_topology_level power9_topology[] = {
+static struct sched_domain_topology_level powerpc_topology[] = {
 #ifdef CONFIG_SCHED_SMT
{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
 #endif
@@ -1386,21 +1378,10 @@ void __init smp_cpus_done(unsigned int max_cpus)
 #ifdef CONFIG_SCHED_SMT
if (has_big_cores) {
pr_info("Big cores detected but using small core scheduling\n");
-   power9_topology[0].mask = smallcore_smt_mask;
powerpc_topology[0].mask = smallcore_smt_mask;
}
 #endif
-   /*
-* If any CPU detects that it's sharing a cache with another CPU then
-* use the deeper topology that is aware of this sharing.
-*/
-   if (shared_caches) {
-   pr_info("Using shared cache scheduler topology\n");
-   set_sched_topology(power9_topology);
-   } else {
-   pr_info("Using standard scheduler topology\n");
-   set_sched_topology(powerpc_topology);
-   }
+   set_sched_topology(powerpc_topology);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
-- 
2.18.2

[PATCH v5 01/10] powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES

2020-08-10 Thread Srikar Dronamraju

Fix a build warning in a non CONFIG_NEED_MULTIPLE_NODES
"error: _numa_cpu_lookup_table_ undeclared"

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Jordan Niethe 
Cc: Vaidyanathan Srinivasan 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v2 -> v3:
Removed node caching part. Rewrote the Commit msg (Michael Ellerman)
Renamed to powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES

 arch/powerpc/kernel/smp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 73199470c265..edf94ca64eea 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -860,6 +860,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
GFP_KERNEL, cpu_to_node(cpu));
zalloc_cpumask_var_node(_cpu(cpu_core_map, cpu),
GFP_KERNEL, cpu_to_node(cpu));
+#ifdef CONFIG_NEED_MULTIPLE_NODES
/*
 * numa_node_id() works after this.
 */
@@ -868,6 +869,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
set_cpu_numa_mem(cpu,
local_memory_node(numa_cpu_lookup_table[cpu]));
}
+#endif
}
 
/* Init the cpumasks so the boot CPU is related to itself */
-- 
2.18.2

[PATCH v5 03/10] powerpc/smp: Move powerpc_topology above

2020-08-10 Thread Srikar Dronamraju

Just moving the powerpc_topology description above.
This will help in using functions in this file and avoid declarations.

No other functional changes

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Jordan Niethe 
Cc: Vaidyanathan Srinivasan 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/smp.c | 104 +++---
 1 file changed, 52 insertions(+), 52 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 08da765b91f1..39224a042468 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -818,6 +818,58 @@ static int init_cpu_l1_cache_map(int cpu)
return err;
 }
 
+static bool shared_caches;
+
+#ifdef CONFIG_SCHED_SMT
+/* cpumask of CPUs with asymmetric SMT dependency */
+static int powerpc_smt_flags(void)
+{
+   int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
+
+   if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
+   printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
+   flags |= SD_ASYM_PACKING;
+   }
+   return flags;
+}
+#endif
+
+/*
+ * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
+ * This topology makes it *much* cheaper to migrate tasks between adjacent 
cores
+ * since the migrated task remains cache hot. We want to take advantage of this
+ * at the scheduler level so an extra topology level is required.
+ */
+static int powerpc_shared_cache_flags(void)
+{
+   return SD_SHARE_PKG_RESOURCES;
+}
+
+/*
+ * We can't just pass cpu_l2_cache_mask() directly because
+ * returns a non-const pointer and the compiler barfs on that.
+ */
+static const struct cpumask *shared_cache_mask(int cpu)
+{
+   return cpu_l2_cache_mask(cpu);
+}
+
+#ifdef CONFIG_SCHED_SMT
+static const struct cpumask *smallcore_smt_mask(int cpu)
+{
+   return cpu_smallcore_mask(cpu);
+}
+#endif
+
+static struct sched_domain_topology_level powerpc_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+   { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+   { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
+   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
+   { NULL, },
+};
+
 static int init_big_cores(void)
 {
int cpu;
@@ -1247,8 +1299,6 @@ static void add_cpu_to_masks(int cpu)
set_cpus_related(cpu, i, cpu_core_mask);
 }
 
-static bool shared_caches;
-
 /* Activate a secondary processor. */
 void start_secondary(void *unused)
 {
@@ -1312,56 +1362,6 @@ int setup_profiling_timer(unsigned int multiplier)
return 0;
 }
 
-#ifdef CONFIG_SCHED_SMT
-/* cpumask of CPUs with asymmetric SMT dependency */
-static int powerpc_smt_flags(void)
-{
-   int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
-
-   if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
-   printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
-   flags |= SD_ASYM_PACKING;
-   }
-   return flags;
-}
-#endif
-
-/*
- * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
- * This topology makes it *much* cheaper to migrate tasks between adjacent 
cores
- * since the migrated task remains cache hot. We want to take advantage of this
- * at the scheduler level so an extra topology level is required.
- */
-static int powerpc_shared_cache_flags(void)
-{
-   return SD_SHARE_PKG_RESOURCES;
-}
-
-/*
- * We can't just pass cpu_l2_cache_mask() directly because
- * returns a non-const pointer and the compiler barfs on that.
- */
-static const struct cpumask *shared_cache_mask(int cpu)
-{
-   return cpu_l2_cache_mask(cpu);
-}
-
-#ifdef CONFIG_SCHED_SMT
-static const struct cpumask *smallcore_smt_mask(int cpu)
-{
-   return cpu_smallcore_mask(cpu);
-}
-#endif
-
-static struct sched_domain_topology_level powerpc_topology[] = {
-#ifdef CONFIG_SCHED_SMT
-   { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
-#endif
-   { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
-   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
-   { NULL, },
-};
-
 void __init smp_cpus_done(unsigned int max_cpus)
 {
/*
-- 
2.18.2

[PATCH v5 00/10] Coregroup support on Powerpc

2020-08-10 Thread Srikar Dronamraju

-
$ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
SMT
CACHE
DIE
NUMA

$ head /proc/schedstat
version 15
timestamp 4318242208
cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
domain0 ,,,0055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 ,,,00ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 ,,, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ,,, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 24177439200 413887604 75393
domain0 ,,,00aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 ,,,00ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

After patchset
--
$ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
SMT
CACHE
MC
DIE
NUMA

$ head /proc/schedstat
version 15
timestamp 4318242208
cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
domain0 ,,,0055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 ,,,00ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 ,,, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ,,, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain4 ,,, 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 24177439200 413887604 75393
domain0 ,,,00aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Nick Piggin 
Cc: Oliver OHalloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Anton Blanchard 
Cc: Gautham R Shenoy 
Cc: Vaidyanathan Srinivasan 
Cc: Jordan Niethe 

Srikar Dronamraju (10):
  powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES
  powerpc/smp: Merge Power9 topology with Power topology
  powerpc/smp: Move powerpc_topology above
  powerpc/smp: Move topology fixups into  a new function
  powerpc/smp: Dont assume l2-cache to be superset of sibling
  powerpc/smp: Optimize start_secondary
  powerpc/numa: Detect support for coregroup
  powerpc/smp: Allocate cpumask only after searching thread group
  powerpc/smp: Create coregroup domain
  powerpc/smp: Implement cpu_to_coregroup_id

 arch/powerpc/include/asm/smp.h  |   1 +
 arch/powerpc/include/asm/topology.h |  10 ++
 arch/powerpc/kernel/smp.c   | 235 +---
 arch/powerpc/mm/numa.c  |  59 +--
 4 files changed, 198 insertions(+), 107 deletions(-)

-- 
2.18.2

[PATCH v5 09/10] powerpc/smp: Create coregroup domain

2020-08-10 Thread Srikar Dronamraju

Add percpu coregroup maps and masks to create coregroup domain.
If a coregroup doesn't exist, the coregroup domain will be degenerated
in favour of SMT/CACHE domain. Do note this patch is only creating stubs
for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation
would be in a subsequent patch.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R. Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Jordan Niethe 
Cc: Vaidyanathan Srinivasan 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v4 ->v5:
Updated commit msg to specify actual implementation of
cpu_to_coregroup_id is in a subsequent patch (Michael Ellerman)

Changelog v3 ->v4:
if coregroup_support doesn't exist, update MC mask to the next
smaller domain mask.

Changelog v2 -> v3:
Add optimization for mask updation under coregroup_support

Changelog v1 -> v2:
Moved coregroup topology fixup to fixup_topology (Gautham)

 arch/powerpc/include/asm/topology.h | 10 ++
 arch/powerpc/kernel/smp.c   | 54 -
 arch/powerpc/mm/numa.c  |  5 +++
 3 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index f0b6300e7dd3..6609174918ab 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -88,12 +88,22 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
 extern int find_and_online_cpu_nid(int cpu);
+extern int cpu_to_coregroup_id(int cpu);
 #else
 static inline int find_and_online_cpu_nid(int cpu)
 {
return 0;
 }
 
+static inline int cpu_to_coregroup_id(int cpu)
+{
+#ifdef CONFIG_SMP
+   return cpu_to_core_id(cpu);
+#else
+   return 0;
+#endif
+}
+
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 0536ac06876b..566e3accac3e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -80,12 +80,22 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_core_map);
+DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map);
 
 EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
 EXPORT_SYMBOL_GPL(has_big_cores);
 
+enum {
+#ifdef CONFIG_SCHED_SMT
+   smt_idx,
+#endif
+   cache_idx,
+   mc_idx,
+   die_idx,
+};
+
 #define MAX_THREAD_LIST_SIZE   8
 #define THREAD_GROUP_SHARE_L1   1
 struct thread_groups {
@@ -861,11 +871,27 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
 }
 #endif
 
+static struct cpumask *cpu_coregroup_mask(int cpu)
+{
+   return per_cpu(cpu_coregroup_map, cpu);
+}
+
+static bool has_coregroup_support(void)
+{
+   return coregroup_enabled;
+}
+
+static const struct cpumask *cpu_mc_mask(int cpu)
+{
+   return cpu_coregroup_mask(cpu);
+}
+
 static struct sched_domain_topology_level powerpc_topology[] = {
 #ifdef CONFIG_SCHED_SMT
{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
 #endif
{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
+   { cpu_mc_mask, SD_INIT_NAME(MC) },
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ NULL, },
 };
@@ -912,6 +938,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
GFP_KERNEL, cpu_to_node(cpu));
zalloc_cpumask_var_node(_cpu(cpu_core_map, cpu),
GFP_KERNEL, cpu_to_node(cpu));
+   if (has_coregroup_support())
+   zalloc_cpumask_var_node(_cpu(cpu_coregroup_map, 
cpu),
+   GFP_KERNEL, cpu_to_node(cpu));
+
 #ifdef CONFIG_NEED_MULTIPLE_NODES
/*
 * numa_node_id() works after this.
@@ -929,6 +959,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
 
+   if (has_coregroup_support())
+   cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
+
init_big_cores();
if (has_big_cores) {
cpumask_set_cpu(boot_cpuid,
@@ -1220,6 +1253,8 @@ static void remove_cpu_from_masks(int cpu)
set_cpus_unrelated(cpu, i, cpu_sibling_mask);
if (has_big_cores)
set_cpus_unrelated(cpu, i, cpu_smallcore_mask);
+   if (has_coregroup_support())
+   set_cpus_unrelate

[PATCH v5 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling

2020-08-10 Thread Srikar Dronamraju

Current code assumes that cpumask of cpus sharing a l2-cache mask will
always be a superset of cpu_sibling_mask.

Lets stop that assumption. cpu_l2_cache_mask is a superset of
cpu_sibling_mask if and only if shared_caches is set.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Jordan Niethe 
Cc: Vaidyanathan Srinivasan 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1 -> v2:
Set cpumask after verifying l2-cache. (Gautham)

 arch/powerpc/kernel/smp.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index b13161a5ffc3..0c960ce3be42 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1188,6 +1188,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
*(*mask_fn)(int))
if (!l2_cache)
return false;
 
+   cpumask_set_cpu(cpu, mask_fn(cpu));
for_each_cpu(i, cpu_online_mask) {
/*
 * when updating the marks the current CPU has not been marked
@@ -1270,29 +1271,30 @@ static void add_cpu_to_masks(int cpu)
 * add it to it's own thread sibling mask.
 */
cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
+   cpumask_set_cpu(cpu, cpu_core_mask(cpu));
 
for (i = first_thread; i < first_thread + threads_per_core; i++)
if (cpu_online(i))
set_cpus_related(i, cpu, cpu_sibling_mask);
 
add_cpu_to_smallcore_masks(cpu);
-   /*
-* Copy the thread sibling mask into the cache sibling mask
-* and mark any CPUs that share an L2 with this CPU.
-*/
-   for_each_cpu(i, cpu_sibling_mask(cpu))
-   set_cpus_related(cpu, i, cpu_l2_cache_mask);
update_mask_by_l2(cpu, cpu_l2_cache_mask);
 
-   /*
-* Copy the cache sibling mask into core sibling mask and mark
-* any CPUs on the same chip as this CPU.
-*/
-   for_each_cpu(i, cpu_l2_cache_mask(cpu))
-   set_cpus_related(cpu, i, cpu_core_mask);
+   if (pkg_id == -1) {
+   struct cpumask *(*mask)(int) = cpu_sibling_mask;
+
+   /*
+* Copy the sibling mask into core sibling mask and
+* mark any CPUs on the same chip as this CPU.
+*/
+   if (shared_caches)
+   mask = cpu_l2_cache_mask;
+
+   for_each_cpu(i, mask(cpu))
+   set_cpus_related(cpu, i, cpu_core_mask);
 
-   if (pkg_id == -1)
return;
+   }
 
for_each_cpu(i, cpu_online_mask)
if (get_physical_package_id(i) == pkg_id)
-- 
2.18.2

[PATCH v5 08/10] powerpc/smp: Allocate cpumask only after searching thread group

2020-08-10 Thread Srikar Dronamraju

If allocated earlier and the search fails, then cpu_l1_cache_map cpumask
is unnecessarily cleared. However cpu_l1_cache_map can be allocated /
cleared after we search thread group.

Please note CONFIG_CPUMASK_OFFSTACK is not set on Powerpc. Hence cpumask
allocated by zalloc_cpumask_var_node is never freed.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Jordan Niethe 
Cc: Vaidyanathan Srinivasan 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v4 ->v5:
Updated commit msg on why cpumask need not be freed.
(Michael Ellerman)

 arch/powerpc/kernel/smp.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 7403fdcf3821..0536ac06876b 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -789,10 +789,6 @@ static int init_cpu_l1_cache_map(int cpu)
if (err)
goto out;
 
-   zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu),
-   GFP_KERNEL,
-   cpu_to_node(cpu));
-
cpu_group_start = get_cpu_thread_group_start(cpu, );
 
if (unlikely(cpu_group_start == -1)) {
@@ -801,6 +797,9 @@ static int init_cpu_l1_cache_map(int cpu)
goto out;
}
 
+   zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu),
+   GFP_KERNEL, cpu_to_node(cpu));
+
for (i = first_thread; i < first_thread + threads_per_core; i++) {
int i_group_start = get_cpu_thread_group_start(i, );
 
-- 
2.18.2

[PATCH v5 07/10] powerpc/numa: Detect support for coregroup

2020-08-10 Thread Srikar Dronamraju

Add support for grouping cores based on the device-tree classification.
- The last domain in the associativity domains always refers to the
core.
- If primary reference domain happens to be the penultimate domain in
the associativity domains device-tree property, then there are no
coregroups. However if its not a penultimate domain, then there are
coregroups. There can be more than one coregroup. For now we would be
interested in the last or the smallest coregroups, i.e one sub-group
per DIE.

Currently there are no firmwares that are exposing this grouping. Hence
allow the basis for grouping to be abstract.  Once the firmware starts
using this grouping, code would be added to detect the type of grouping
and adjust the sd domain flags accordingly.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Jordan Niethe 
Cc: Vaidyanathan Srinivasan 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v4->v5:
Updated commit msg with current abstract nature of the coregroups
(Michael Ellerman)

Changelog v1 -> v2:
Explained Coregroup in commit msg (Michael Ellerman)

 arch/powerpc/include/asm/smp.h |  1 +
 arch/powerpc/kernel/smp.c  |  1 +
 arch/powerpc/mm/numa.c | 34 +-
 3 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 49a25e2400f2..5bdc17a7049f 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -28,6 +28,7 @@
 extern int boot_cpuid;
 extern int spinning_secondaries;
 extern u32 *cpu_to_phys_id;
+extern bool coregroup_enabled;
 
 extern void cpu_die(void);
 extern int cpu_to_chip_id(int cpu);
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 91cf5d05e7ec..7403fdcf3821 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -74,6 +74,7 @@ static DEFINE_PER_CPU(int, cpu_state) = { 0 };
 
 struct task_struct *secondary_current;
 bool has_big_cores;
+bool coregroup_enabled;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 2298899a0f0a..51cb672f113b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -886,7 +886,9 @@ static void __init setup_node_data(int nid, u64 start_pfn, 
u64 end_pfn)
 static void __init find_possible_nodes(void)
 {
struct device_node *rtas;
-   u32 numnodes, i;
+   const __be32 *domains;
+   int prop_length, max_nodes;
+   u32 i;
 
if (!numa_enabled)
return;
@@ -895,25 +897,31 @@ static void __init find_possible_nodes(void)
if (!rtas)
return;
 
-   if (of_property_read_u32_index(rtas, 
"ibm,current-associativity-domains",
-   min_common_depth, )) {
-   /*
-* ibm,current-associativity-domains is a fairly recent
-* property. If it doesn't exist, then fallback on
-* ibm,max-associativity-domains. Current denotes what the
-* platform can support compared to max which denotes what the
-* Hypervisor can support.
-*/
-   if (of_property_read_u32_index(rtas, 
"ibm,max-associativity-domains",
-   min_common_depth, ))
+   /*
+* ibm,current-associativity-domains is a fairly recent property. If
+* it doesn't exist, then fallback on ibm,max-associativity-domains.
+* Current denotes what the platform can support compared to max
+* which denotes what the Hypervisor can support.
+*/
+   domains = of_get_property(rtas, "ibm,current-associativity-domains",
+   _length);
+   if (!domains) {
+   domains = of_get_property(rtas, "ibm,max-associativity-domains",
+   _length);
+   if (!domains)
goto out;
}
 
-   for (i = 0; i < numnodes; i++) {
+   max_nodes = of_read_number([min_common_depth], 1);
+   for (i = 0; i < max_nodes; i++) {
if (!node_possible(i))
node_set(i, node_possible_map);
}
 
+   prop_length /= sizeof(int);
+   if (prop_length > min_common_depth + 2)
+   coregroup_enabled = 1;
+
 out:
of_node_put(rtas);
 }
-- 
2.18.2

[PATCH v5 06/10] powerpc/smp: Optimize start_secondary

2020-08-10 Thread Srikar Dronamraju

In start_secondary, even if shared_cache was already set, system does a
redundant match for cpumask. This redundant check can be removed by
checking if shared_cache is already set.

While here, localize the sibling_mask variable to within the if
condition.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Jordan Niethe 
Cc: Vaidyanathan Srinivasan 
Signed-off-by: Srikar Dronamraju 
---
Changelog v4 ->v5:
Retain cache domain, no need for generalization
 (Michael Ellerman, Peter Zijlstra,
 Valentin Schneider, Gautham R. Shenoy)

Changelog v1 -> v2:
Moved shared_cache topology fixup to fixup_topology (Gautham)

 arch/powerpc/kernel/smp.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 0c960ce3be42..91cf5d05e7ec 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -851,7 +851,7 @@ static int powerpc_shared_cache_flags(void)
  */
 static const struct cpumask *shared_cache_mask(int cpu)
 {
-   return cpu_l2_cache_mask(cpu);
+   return per_cpu(cpu_l2_cache_map, cpu);
 }
 
 #ifdef CONFIG_SCHED_SMT
@@ -1305,7 +1305,6 @@ static void add_cpu_to_masks(int cpu)
 void start_secondary(void *unused)
 {
unsigned int cpu = smp_processor_id();
-   struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
 
mmgrab(_mm);
current->active_mm = _mm;
@@ -1331,14 +1330,20 @@ void start_secondary(void *unused)
/* Update topology CPU masks */
add_cpu_to_masks(cpu);
 
-   if (has_big_cores)
-   sibling_mask = cpu_smallcore_mask;
/*
 * Check for any shared caches. Note that this must be done on a
 * per-core basis because one core in the pair might be disabled.
 */
-   if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
-   shared_caches = true;
+   if (!shared_caches) {
+   struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
+   struct cpumask *mask = cpu_l2_cache_mask(cpu);
+
+   if (has_big_cores)
+   sibling_mask = cpu_smallcore_mask;
+
+   if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
+   shared_caches = true;
+   }
 
set_numa_node(numa_cpu_lookup_table[cpu]);
set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
-- 
2.18.2

[PATCH v5 10/10] powerpc/smp: Implement cpu_to_coregroup_id

2020-08-10 Thread Srikar Dronamraju

Lookup the coregroup id from the associativity array.

If unable to detect the coregroup id, fallback on the core id.
This way, ensure sched_domain degenerates and an extra sched domain is
not created.

Ideally this function should have been implemented in
arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
don't need to find the primary domain again.

If the device-tree mentions more than one coregroup, then kernel
implements only the last or the smallest coregroup, which currently
corresponds to the penultimate domain in the device-tree.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Jordan Niethe 
Cc: Vaidyanathan Srinivasan 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1 -> v2:
Move coregroup_enabled before getting associativity (Gautham)

 arch/powerpc/mm/numa.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0d57779e7942..8b3b3ec7fcc4 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1218,6 +1218,26 @@ int find_and_online_cpu_nid(int cpu)
 
 int cpu_to_coregroup_id(int cpu)
 {
+   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+   int index;
+
+   if (cpu < 0 || cpu > nr_cpu_ids)
+   return -1;
+
+   if (!coregroup_enabled)
+   goto out;
+
+   if (!firmware_has_feature(FW_FEATURE_VPHN))
+   goto out;
+
+   if (vphn_get_associativity(cpu, associativity))
+   goto out;
+
+   index = of_read_number(associativity, 1);
+   if (index > min_common_depth + 1)
+   return of_read_number([index - 1], 1);
+
+out:
return cpu_to_core_id(cpu);
 }
 
-- 
2.18.2

[PATCH v5 04/10] powerpc/smp: Move topology fixups into a new function

2020-08-10 Thread Srikar Dronamraju

Move topology fixup based on the platform attributes into its own
function which is called just before set_sched_topology.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Anton Blanchard 
Cc: Oliver O'Halloran 
Cc: Nathan Lynch 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Jordan Niethe 
Cc: Vaidyanathan Srinivasan 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v2 -> v3:
Rewrote changelog (Gautham)
Renamed to powerpc/smp: Move topology fixups into  a new function

 arch/powerpc/kernel/smp.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 39224a042468..b13161a5ffc3 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1362,6 +1362,16 @@ int setup_profiling_timer(unsigned int multiplier)
return 0;
 }
 
+static void fixup_topology(void)
+{
+#ifdef CONFIG_SCHED_SMT
+   if (has_big_cores) {
+   pr_info("Big cores detected but using small core scheduling\n");
+   powerpc_topology[0].mask = smallcore_smt_mask;
+   }
+#endif
+}
+
 void __init smp_cpus_done(unsigned int max_cpus)
 {
/*
@@ -1375,12 +1385,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
dump_numa_cpu_topology();
 
-#ifdef CONFIG_SCHED_SMT
-   if (has_big_cores) {
-   pr_info("Big cores detected but using small core scheduling\n");
-   powerpc_topology[0].mask = smallcore_smt_mask;
-   }
-#endif
+   fixup_topology();
set_sched_topology(powerpc_topology);
 }
 
-- 
2.18.2

Re: [PATCH v2] sched/fair: ignore cache hotness for SMT migration

2020-08-10 Thread Srikar Dronamraju

* Josh Don  [2020-08-04 12:34:13]:

> SMT siblings share caches, so cache hotness should be irrelevant for
> cross-sibling migration.
> 
> Proposed-by: Venkatesh Pallipadi 
> Signed-off-by: Josh Don 
> ---
>  kernel/sched/fair.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1a68a0536add..abdb54e2339f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7402,6 +7402,10 @@ static int task_hot(struct task_struct *p, struct 
> lb_env *env)
>   if (unlikely(task_has_idle_policy(p)))
>   return 0;
>  
> + /* SMT siblings share cache */
> + if (env->sd->flags & SD_SHARE_CPUCAPACITY)
> + return 0;
> +

If this for retaining cache hotness, should we look at
SD_SHARE_PKG_RESOURCES instead of SD_SHARE_CPUCAPACITY?

>   /*
>* Buddy candidates are cache hot:
>    */
> -- 
> 2.28.0.163.g6104cc2f0b6-goog
> 

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH v2 2/2] powerpc/topology: Override cpu_smt_mask

2020-08-07 Thread Srikar Dronamraju

 1182315.7 60018.733

1018433 - 1047037 : ##3% (3)
1047037 - 1075640 :   4% (4)
1075640 - 1104244 :   4% (4)
1104244 - 1132848 : ###   7% (7)
1132848 - 1161452 :  17% 
(17)
1161452 - 1190055 : ##   12% 
(12)
1190055 - 1218659 : #21% 
(21)
1218659 - 1247263 : ##   23% 
(23)
1247263 - 1275866 :   4% (4)
1275866 - 1304470 :   4% (4)

with patch
  N   Min   MaxMedian   AvgStddev
100967014   1292938   1208819 1185281.8 69815.851

 967014 - 999606  : ##1% (1)
 999606 - 1032199 : ##1% (1)
1032199 - 1064791 :   6% (6)
1064791 - 1097384 : ##5% (5)
1097384 - 1129976 : ##9% (9)
1129976 - 1162568 :  10% 
(10)
1162568 - 1195161 : ##   13% 
(13)
1195161 - 1227753 :  22% 
(22)
1227753 - 1260346 : ##   25% 
(25)
1260346 - 1292938 : ##7% (7)

Observations: Not much changes, ebizzy is not much impacted.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Dietmar Eggemann 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Vaidyanathan Srinivasan 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1->v2:
Modified commit msg as per mailing list discussion.
Added performance numbers

 arch/powerpc/include/asm/cputhreads.h |  1 -
 arch/powerpc/include/asm/smp.h| 13 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/cputhreads.h 
b/arch/powerpc/include/asm/cputhreads.h
index deb99fd6e060..98c8bd155bf9 100644
--- a/arch/powerpc/include/asm/cputhreads.h
+++ b/arch/powerpc/include/asm/cputhreads.h
@@ -23,7 +23,6 @@
 extern int threads_per_core;
 extern int threads_per_subcore;
 extern int threads_shift;
-extern bool has_big_cores;
 extern cpumask_t threads_core_mask;
 #else
 #define threads_per_core   1
diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 9cd0765633c5..bb06aa875131 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -131,6 +131,19 @@ static inline struct cpumask *cpu_smallcore_mask(int cpu)
 
 extern int cpu_to_core_id(int cpu);
 
+extern bool has_big_cores;
+
+#define cpu_smt_mask cpu_smt_mask
+#ifdef CONFIG_SCHED_SMT
+static inline const struct cpumask *cpu_smt_mask(int cpu)
+{
+   if (has_big_cores)
+   return per_cpu(cpu_smallcore_map, cpu);
+
+   return per_cpu(cpu_sibling_map, cpu);
+}
+#endif /* CONFIG_SCHED_SMT */
+
 /* Since OpenPIC has only 4 IPIs, we use slightly different message numbers.
  *
  * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up
-- 
2.18.2

[PATCH v2 1/2] sched/topology: Allow archs to override cpu_smt_mask

2020-08-07 Thread Srikar Dronamraju

cpu_smt_mask tracks topology_sibling_cpumask. This would be good for
most architectures. One of the users of cpu_smt_mask(), would be to
identify idle-cores. On Power9, a pair of SMT4 cores can be presented by
the firmware as a SMT8 core for backward compatibility reasons.

Powerpc allows LPARs to be live migrated from Power8 to Power9. Do note
Power8 had only SMT8 cores. Existing software which has been
developed/configured for Power8 would expect to see SMT8 core.
Maintaining the illusion of SMT8 core is a requirement to make that
work.

In order to maintain above userspace backward compatibility with
previous versions of processor, Power9 onwards there is option to the
firmware to advertise a pair of SMT4 cores as a fused cores aka SMT8
core. On Power9 this pair shares the L2 cache as well. However, from the
scheduler's point of view, a core should be determined by SMT4, since
its a completely independent unit of compute. Hence allow PowerPc
architecture to override the default cpu_smt_mask() to point to the SMT4
cores in a SMT8 mode.

This will ensure the scheduler is always given the right information.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Dietmar Eggemann 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Vaidyanathan Srinivasan 
Acked-by; Peter Zijlstra (Intel) 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1->v2:
Update the commit msg based on the discussion in community esp
with Peter Zijlstra and Michael Ellerman.

 include/linux/topology.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 608fa4aadf0e..ad03df1cc266 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -198,7 +198,7 @@ static inline int cpu_to_mem(int cpu)
 #define topology_die_cpumask(cpu)  cpumask_of(cpu)
 #endif
 
-#ifdef CONFIG_SCHED_SMT
+#if defined(CONFIG_SCHED_SMT) && !defined(cpu_smt_mask)
 static inline const struct cpumask *cpu_smt_mask(int cpu)
 {
return topology_sibling_cpumask(cpu);
-- 
2.18.2

Re: [PATCH 1/2] sched/topology: Allow archs to override cpu_smt_mask

2020-08-06 Thread Srikar Dronamraju

* pet...@infradead.org  [2020-08-06 10:54:29]:

> On Thu, Aug 06, 2020 at 03:32:25PM +1000, Michael Ellerman wrote:
> 
> > That brings with it a bunch of problems, such as existing software that
> > has been developed/configured for Power8 and expects to see SMT8.
> > 
> > We also allow LPARs to be live migrated from Power8 to Power9 (and back), so
> > maintaining the illusion of SMT8 is considered a requirement to make that 
> > work.
> 
> So how does that work if the kernel booted on P9 and demuxed the SMT8
> into 2xSMT4? If you migrate that state onto a P8 with actual SMT8 you're
> toast again.
> 

To add to what Michael already said, the reason we don't expose the demux of
SMT8 into 2xSMT4 to userspace, is to make the userspace believe they are on
a SMT8. When the kernel is live migrated from P8 to P9, till the time of reboot
they would only have the older P8 topology. After reboot the kernel topology
would change, but the userspace is made to believe that they are running on
SMT8 core by way of keeping the sibling_cpumask at SMT8 core level.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 1/2] sched/topology: Allow archs to override cpu_smt_mask

2020-08-06 Thread Srikar Dronamraju

* pet...@infradead.org  [2020-08-06 15:15:47]:

> > But my understanding is most LPARs don't get migrated back and forth,
> > they'll start life on a P8 and only get migrated to a P9 once when the
> > customer gets a P9. They might then run for a long time (months to
> > years) on the P9 in P8 compat mode, not because they ever want to
> > migrate back to a real P8, but because the software in the LPAR is still
> > expecting to be on a P8.
> > 
> > I'm not a real expert on all the Enterprisey stuff though, so someone
> > else might be able to give us a better picture.
> > 
> > But the point of mentioning the migration stuff was mainly just to
> > explain why we feel we need to present SMT8 to userspace even on P9.
> 
> OK, fair enough. The patch wasn't particularly onerous, I was just
> wondering why etc..
> 
> The case of starting on a P8 and being migrated to a P9 makes sense to
> me; in that case you'd like to rebuild your sched domains, but can't go
> about changing user visible topolofy information.
> 
> I suppose:
> 
> Acked-by; Peter Zijlstra (Intel) 
> 
> An updated Changelog that recaps some of this discussion might also be
> nice.

Okay, will surely do the needful.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH 1/2] sched/topology: Allow archs to override cpu_smt_mask

2020-08-04 Thread Srikar Dronamraju

* pet...@infradead.org  [2020-08-04 12:45:20]:

> On Tue, Aug 04, 2020 at 09:03:06AM +0530, Srikar Dronamraju wrote:
> > cpu_smt_mask tracks topology_sibling_cpumask. This would be good for
> > most architectures. One of the users of cpu_smt_mask(), would be to
> > identify idle-cores. On Power9, a pair of cores can be presented by the
> > firmware as a big-core for backward compatibility reasons.
> > 
> > In order to maintain userspace backward compatibility with previous
> > versions of processor, (since Power8 had SMT8 cores), Power9 onwards there
> > is option to the firmware to advertise a pair of SMT4 cores as a fused
> > cores (referred to as the big_core mode in the Linux Kernel). On Power9
> > this pair shares the L2 cache as well. However, from the scheduler's point
> > of view, a core should be determined by SMT4. The load-balancer already
> > does this. Hence allow PowerPc architecture to override the default
> > cpu_smt_mask() to point to the SMT4 cores in a big_core mode.
> 
> I'm utterly confused.
> 
> Why can't you set your regular siblings mask to the smt4 thing? Who
> cares about the compat stuff, I thought that was an LPAR/AIX thing.

There are no technical challenges to set the sibling mask to SMT4.
This is for Linux running on PowerVM. When these Power9 boxes are sold /
marketed as X core boxes (where X stand for SMT8 cores).  Since on PowerVM
world, everything is in SMT8 mode, the device tree properties still mark
the system to be running on 8 thread core. There are a number of utilities
like ppc64_cpu that directly read from device-tree. They would get core
count and thread count which is SMT8 based.

If the sibling_mask is set to small core, then same user when looking at
output from lscpu and other utilities that look at sysfs will start seeing
2x number of cores to what he had provisioned and what the utilities from
the device-tree show. This can gets users confused.

So to keep the device-tree properties, utilities depending on device-tree,
sysfs and utilities depending on sysfs on the same page, userspace are only
exposed as SMT8.

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH 2/2] powerpc/topology: Override cpu_smt_mask

2020-08-03 Thread Srikar Dronamraju

On Power9 a pair of cores can be presented by the firmware as a big-core
for backward compatibility reasons, with 4 threads per (small) core and 8
threads per big-core. cpu_smt_mask() should generally point to the cpu mask
of the (small)core.

In order to maintain userspace backward compatibility (with Power8 chips in
case of Power9) in enterprise Linux systems, the topology_sibling_cpumask
has to be set to big-core. Hence override the default cpu_smt_mask() to be
powerpc specific allowing for better scheduling behaviour on Power.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Dietmar Eggemann 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Vaidyanathan Srinivasan 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/cputhreads.h |  1 -
 arch/powerpc/include/asm/smp.h| 13 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/cputhreads.h 
b/arch/powerpc/include/asm/cputhreads.h
index deb99fd6e060..98c8bd155bf9 100644
--- a/arch/powerpc/include/asm/cputhreads.h
+++ b/arch/powerpc/include/asm/cputhreads.h
@@ -23,7 +23,6 @@
 extern int threads_per_core;
 extern int threads_per_subcore;
 extern int threads_shift;
-extern bool has_big_cores;
 extern cpumask_t threads_core_mask;
 #else
 #define threads_per_core   1
diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 9cd0765633c5..d4bc28accb28 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -131,6 +131,19 @@ static inline struct cpumask *cpu_smallcore_mask(int cpu)
 
 extern int cpu_to_core_id(int cpu);
 
+extern bool has_big_cores;
+
+#define cpu_smt_mask cpu_smt_mask
+#ifdef CONFIG_SCHED_SMT
+static inline const struct cpumask *cpu_smt_mask(int cpu)
+{
+   if (has_big_cores)
+   return per_cpu(cpu_smallcore_map, cpu);
+
+   return per_cpu(cpu_sibling_map, cpu);
+}
+#endif /* CONFIG_SCHED_SMT */
+
 /* Since OpenPIC has only 4 IPIs, we use slightly different message numbers.
  *
  * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up
-- 
2.18.2

[PATCH 1/2] sched/topology: Allow archs to override cpu_smt_mask

2020-08-03 Thread Srikar Dronamraju

cpu_smt_mask tracks topology_sibling_cpumask. This would be good for
most architectures. One of the users of cpu_smt_mask(), would be to
identify idle-cores. On Power9, a pair of cores can be presented by the
firmware as a big-core for backward compatibility reasons.

In order to maintain userspace backward compatibility with previous
versions of processor, (since Power8 had SMT8 cores), Power9 onwards there
is option to the firmware to advertise a pair of SMT4 cores as a fused
cores (referred to as the big_core mode in the Linux Kernel). On Power9
this pair shares the L2 cache as well. However, from the scheduler's point
of view, a core should be determined by SMT4. The load-balancer already
does this. Hence allow PowerPc architecture to override the default
cpu_smt_mask() to point to the SMT4 cores in a big_core mode.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Michael Neuling 
Cc: Gautham R Shenoy 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Dietmar Eggemann 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Vaidyanathan Srinivasan 
Signed-off-by: Srikar Dronamraju 
---
 include/linux/topology.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 608fa4aadf0e..ad03df1cc266 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -198,7 +198,7 @@ static inline int cpu_to_mem(int cpu)
 #define topology_die_cpumask(cpu)  cpumask_of(cpu)
 #endif

-#ifdef CONFIG_SCHED_SMT
+#if defined(CONFIG_SCHED_SMT) && !defined(cpu_smt_mask)
 static inline const struct cpumask *cpu_smt_mask(int cpu)
 {
return topology_sibling_cpumask(cpu);
-- 
2.18.2

Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain

2020-08-03 Thread Srikar Dronamraju

> > Also in the current P9 itself, two neighbouring core-pairs form a quad.
> > Cache latency within a quad is better than a latency to a distant core-pair.
> > Cache latency within a core pair is way better than latency within a quad.
> > So if we have only 4 threads running on a DIE all of them accessing the same
> > cache-lines, then we could probably benefit if all the tasks were to run
> > within the quad aka MC/Coregroup.
> >
> 
> Did you test this? WRT load balance we do try to balance "load" over the
> different domain spans, so if you represent quads as their own MC domain,
> you would AFAICT end up spreading tasks over the quads (rather than packing
> them) when balancing at e.g. DIE level. The desired behaviour might be
> hackable with some more ASYM_PACKING, but I'm not sure I should be
> suggesting that :-)
> 

Agree, load balance will try to spread the load across the quads. In my hack,
I was explicitly marking QUAD domains as !SD_PREFER_SIBLING + relaxing few
load spreading rules when SD_PREFER_SIBLING was not set. And this was on a
slightly older kernel (without recent Vincent's load balance overhaul).

> > I have found some benchmarks which are latency sensitive to benefit by
> > having a grouping a quad level (using kernel hacks and not backed by
> > firmware changes). Gautham also found similar results in his experiments
> > but he only used binding within the stock kernel.
> >
> 
> IIUC you reflect this "fabric quirk" (i.e. coregroups) using this DT
> binding thing.
> 
> That's also where things get interesting (for me) because I experienced
> something similar on another arm64 platform (ThunderX1). This was more
> about cache bandwidth than cache latency, but IMO it's in the same bag of
> fabric quirks. I blabbered a bit about this at last LPC [1], but kind of
> gave up on it given the TX1 was the only (arm64) platform where I could get
> both significant and reproducible results.
> 
> Now, if you folks are seeing this on completely different hardware and have
> "real" workloads that truly benefit from this kind of domain partitioning,
> this might be another incentive to try and sort of generalize this. That's
> outside the scope of your series, but your findings give me some hope!
> 
> I think what I had in mind back then was that if enough folks cared about
> it, we might get some bits added to the ACPI spec; something along the
> lines of proximity domains for the caches described in PPTT, IOW a cache
> distance matrix. I don't really know what it'll take to get there, but I
> figured I'd dump this in case someone's listening :-)
> 

Very interesting.

> > I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
> > domain need not be LLC domain for Power.
> 
> From what I understood your MC domain does seem to map to LLC; but in any
> case, shouldn't you set that flag at least for BIGCORE (i.e. L2)? AIUI with
> your changes your sd_llc is gonna be SMT, and that's not going to be a very
> big mask. IMO you do want to correctly reflect your LLC situation via this
> flag to make cpus_share_cache() work properly.

I detect if the LLC is shared at BIGCORE, and if they are shared at BIGCORE,
then I dynamically rename the DOMAIN as CACHE and enable
SD_SHARE_PKG_RESOURCES in that domain.

> 
> [1]: https://linuxplumbersconf.org/event/4/contributions/484/

Thanks for the pointer.

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id

2020-07-31 Thread Srikar Dronamraju

* Michael Ellerman  [2020-07-31 18:02:21]:

> Srikar Dronamraju  writes:
> > Lookup the coregroup id from the associativity array.
> 

Thanks Michael for all your comments and inputs.

> It's slightly strange that this is called in patch 9, but only properly
> implemented here in patch 10.
> 
> I'm not saying you have to squash them together, but it would be good if
> the change log for patch 9 mentioned that a subsequent commit will
> complete the implementation and how that affects the behaviour.
> 

I probably got influenced by few LKML community members who always add a
stub and implement the gory details in a subsequent patch. I will surely add
the change log in patch 9 about the subsequent patches.

> cheers
> 

-- 
Thanks and Regards
Srikar Dronamraju

Re: [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group

2020-07-31 Thread Srikar Dronamraju

* Michael Ellerman  [2020-07-31 17:52:15]:

> Srikar Dronamraju  writes:
> > If allocated earlier and the search fails, then cpumask need to be
> > freed. However cpu_l1_cache_map can be allocated after we search thread
> > group.
> 
> It's not freed anywhere AFAICS?
> 

Yes, its never freed. Infact we are never checking if
zalloc_cpumask_var_node fails. Its not just this cpumask, but historically
all the other existing cpumasks in arch/powerpc/kernel/smp.c are never
freed/checked. I did dig into this a bit and it appears that ..
(Please do correct me if I am wrong!! )

Powerpc using cpumask_var_t for all of the percpu variables. And it dont seem
to enable CONFIG_CPUMASK_OFFSTACK even from the MAXSMP config.

So from include/linux/cpumask.h

typedef struct cpumask cpumask_var_t[1];
and
zalloc_cpumask_var_node ends up being cpumask_clear

So I think we are historically we seem to assume we are always
!CPUMASK_OFFSTACK and hence we dont need to check for return as well as
free..

I would look forward to your comments on how we should handle this going
forward. But I would keep this the same for this patchset.

One of the questions that I have is if we most likely are to be in
!CONFIG_CPUMASK_OFFSTACK, then should be migrate to cpumask_t for percpu
variables. 

The reason being we end up using NR_CPU cpumask for each percpu cpumask
variable instead of using NR_CPU cpumask_t pointer.

> And even after this change there's still an error path that doesn't free
> it, isn't there?
> 
> cheers
> 
> > Cc: linuxppc-dev 
> > Cc: LKML 
> > Cc: Michael Ellerman 
> > Cc: Nicholas Piggin 
> > Cc: Anton Blanchard 
> > Cc: Oliver O'Halloran 
> > Cc: Nathan Lynch 
> > Cc: Michael Neuling 
> > Cc: Gautham R Shenoy 
> > Cc: Ingo Molnar 
> > Cc: Peter Zijlstra 
> > Cc: Valentin Schneider 
> > Cc: Jordan Niethe 
> > Reviewed-by: Gautham R. Shenoy 
> > Signed-off-by: Srikar Dronamraju 
> > ---
> >  arch/powerpc/kernel/smp.c | 7 +++
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 698000c7f76f..dab96a1203ec 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -797,10 +797,6 @@ static int init_cpu_l1_cache_map(int cpu)
> > if (err)
> > goto out;
> >  
> > -   zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu),
> > -   GFP_KERNEL,
> > -   cpu_to_node(cpu));
> > -
> > cpu_group_start = get_cpu_thread_group_start(cpu, );
> >  
> > if (unlikely(cpu_group_start == -1)) {
> > @@ -809,6 +805,9 @@ static int init_cpu_l1_cache_map(int cpu)
> > goto out;
> > }
> >  
> > +   zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu),
> > +   GFP_KERNEL, cpu_to_node(cpu));
> > +
> > for (i = first_thread; i < first_thread + threads_per_core; i++) {
> > int i_group_start = get_cpu_thread_group_start(i, );
> >  
> > -- 
> > 2.17.1

-- 
Thanks and Regards
Srikar Dronamraju

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1523 matches

Mail list logo