Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-19 Thread Gautham R Shenoy
Hello Mel,

On Mon, Apr 12, 2021 at 11:48:19AM +0100, Mel Gorman wrote:
> On Mon, Apr 12, 2021 at 11:06:19AM +0100, Valentin Schneider wrote:
> > On 12/04/21 10:37, Mel Gorman wrote:
> > > On Mon, Apr 12, 2021 at 11:54:36AM +0530, Srikar Dronamraju wrote:
> > >> * Gautham R. Shenoy  [2021-04-02 11:07:54]:
> > >>
> > >> >
> > >> > To remedy this, this patch proposes that the LLC be moved to the MC
> > >> > level which is a group of cores in one half of the chip.
> > >> >
> > >> >   SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE
> > >> >
> > >>
> > >> I think marking Hemisphere as a LLC in a P10 scenario is a good idea.
> > >>
> > >> > While there is no cache being shared at this level, this is still the
> > >> > level where some amount of cache-snooping takes place and it is
> > >> > relatively faster to access the data from the caches of the cores
> > >> > within this domain. With this change, we no longer see regressions on
> > >> > P10 for applications which require single threaded performance.
> > >>
> > >> Peter, Valentin, Vincent, Mel, etal
> > >>
> > >> On architectures where we have multiple levels of cache access latencies
> > >> within a DIE, (For example: one within the current LLC or SMT core and 
> > >> the
> > >> other at MC or Hemisphere, and finally across hemispheres), do you have 
> > >> any
> > >> suggestions on how we could handle the same in the core scheduler?
> > >>
> > >
> > > Minimally I think it would be worth detecting when there are multiple
> > > LLCs per node and detecting that in generic code as a static branch. In
> > > select_idle_cpu, consider taking two passes -- first on the LLC domain
> > > and if no idle CPU is found then taking a second pass if the search depth
> > > allows within the node with the LLC CPUs masked out.
> > 
> > I think that's actually a decent approach. Tying SD_SHARE_PKG_RESOURCES to
> > something other than pure cache topology in a generic manner is tough (as
> > it relies on murky, ill-defined hardware fabric properties).
> > 
> 
> Agreed. The LLC->node scan idea has been on my TODO list to try for
> a while.

If you have any patches for these, I will be happy to test them on
POWER10. Though, on POWER10, there will be an additional sd between
the LLC and the DIE domain. 




> 
> > Last I tried thinking about that, I stopped at having a core-to-core
> > latency matrix, building domains off of that, and having some knob
> > specifying the highest distance value below which we'd set
> > SD_SHARE_PKG_RESOURCES. There's a few things I 'hate' about that; for one
> > it makes cpus_share_cache() somewhat questionable.
> > 
> 
> And I thought about something like this too but worried it might get
> complex, particularly on chiplets where we do not necessarily have
> hardware info on latency depending on how it's wired up. It also might
> lead to excessive cpumask manipulation in a fast path if we have to
> traverse multiple distances with search cost exceeding gains from latency
> reduction. Hence -- keeping it simple with two level only, LLC then node
> within the allowed search depth and see what that gets us. It might be
> "good enough" in most cases and would be a basis for comparison against
> complex approaches.


> 
> At minimum, I expect IBM can evaluate the POWER10 aspect and I can run
> an evaluation on Zen generations.


> 
> -- 
> Mel Gorman
> SUSE Labs


Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-14 Thread Gautham R Shenoy
On Mon, Apr 12, 2021 at 06:33:55PM +0200, Michal Suchánek wrote:
> On Mon, Apr 12, 2021 at 04:24:44PM +0100, Mel Gorman wrote:
> > On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > > > Peter, Valentin, Vincent, Mel, etal
> > > > >
> > > > > On architectures where we have multiple levels of cache access 
> > > > > latencies
> > > > > within a DIE, (For example: one within the current LLC or SMT core 
> > > > > and the
> > > > > other at MC or Hemisphere, and finally across hemispheres), do you 
> > > > > have any
> > > > > suggestions on how we could handle the same in the core scheduler?
> > >
> > > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> > > only rely on cache
> > >
> >
> > From topology.c
> >
> > SD_SHARE_PKG_RESOURCES - describes shared caches
> >
> > I'm guessing here because I am not familiar with power10 but the central
> > problem appears to be when to prefer selecting a CPU sharing L2 or L3
> > cache and the core assumes the last-level-cache is the only relevant one.
> 
> It does not seem to be the case according to original description:
> 
>  When the scheduler tries to wakeup a task, it chooses between the
>  waker-CPU and the wakee's previous-CPU. Suppose this choice is called
>  the "target", then in the target's LLC domain, the scheduler
>  
>  a) tries to find an idle core in the LLC. This helps exploit the
> This is the same as (b) Should this be SMT^^^ ?

On POWER10, without this patch, the LLC is at SMT sched-domain
domain. The difference between a) and b) is a) searches for an idle
core, while b) searches for an idle CPU. 


> SMT folding that the wakee task can benefit from. If an idle
> core is found, the wakee is woken up on it.
>  
>  b) Failing to find an idle core, the scheduler tries to find an idle
> CPU in the LLC. This helps minimise the wakeup latency for the
> wakee since it gets to run on the CPU immediately.
>  
>  c) Failing this, it will wake it up on target CPU.
>  
>  Thus, with P9-sched topology, since the CACHE domain comprises of two
>  SMT4 cores, there is a decent chance that we get an idle core, failing
>  which there is a relatively higher probability of finding an idle CPU
>  among the 8 threads in the domain.
>  
>  However, in P10-sched topology, since the SMT domain is the LLC and it
>  contains only a single SMT4 core, the probability that we find that
>  core to be idle is less. Furthermore, since there are only 4 CPUs to
>  search for an idle CPU, there is lower probability that we can get an
>  idle CPU to wake up the task on.
> 
> >
> > For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
> > unintended consequences for load balancing because load within a die may
> > not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
> > the MC level.
> 
> Not spreading load between SMT4 domains within MC is exactly what setting LLC
> at MC level would address, wouldn't it?
>
> As in on P10 we have two relevant levels but the topology as is describes only
> one, and moving the LLC level lower gives two levels the scheduler looks at
> again. Or am I missing something?

This is my current understanding as well, since with this patch we
would then be able to move tasks quickly between the SMT4 cores,
perhaps at the expense of losing out on cache-affinity. Which is why
it would be good to verify this using a test/benchmark.


> 
> Thanks
> 
> Michal
> 

--
Thanks and Regards
gautham.


Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-14 Thread Gautham R Shenoy
Hello Mel,

On Mon, Apr 12, 2021 at 04:24:44PM +0100, Mel Gorman wrote:
> On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > > Peter, Valentin, Vincent, Mel, etal
> > > >
> > > > On architectures where we have multiple levels of cache access latencies
> > > > within a DIE, (For example: one within the current LLC or SMT core and 
> > > > the
> > > > other at MC or Hemisphere, and finally across hemispheres), do you have 
> > > > any
> > > > suggestions on how we could handle the same in the core scheduler?
> > 
> > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> > only rely on cache
> > 
> 
> >From topology.c
> 
>   SD_SHARE_PKG_RESOURCES - describes shared caches
>

Yes, I was aware of this shared caches, but this current patch was the
simplest way to achieve the effect, though the cores in the MC domain
on POWER10 do not share a cache. However, it is relatively faster to
transfer data across the cores within the MC domain compared to the
cores outside the MC domain in the Die.


> I'm guessing here because I am not familiar with power10 but the central
> problem appears to be when to prefer selecting a CPU sharing L2 or L3
> cache and the core assumes the last-level-cache is the only relevant one.
>

On POWER we have traditionally preferred to keep the LLC at the
sched-domain comprising of groups of CPUs that share the L2 (since L3
is a victim cache on POWER).

On POWER9, the L2 was shared by the threads of a pair of SMT4 cores,
while on POWER10, L2 is shared by threads of a single SMT4 core.

Thus, the current task wake-up logic would have a lower probability of
finding an idle core inside an LLC since it has only one core to
search in the LLC. This is why moving the LLC to the parent domain
(MC) consisting of a group of SMT4 cores among which snooping the
cache-data is faster is helpful for workloads that require the single
threaded performance.


> For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
> unintended consequences for load balancing because load within a die may
> not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
> the MC level.


Since we are adding the SD_SHARE_PKG_RESOURCES to the parent of the
the only sched-domain (which is a SMT4 domain) which currently has
this flag set, would it cause issues in spreading the load between the
SMT4 domains ?

Are there any tests/benchmarks that can help bring this out? It could
be good to understand this.

> 
> > >
> > > Minimally I think it would be worth detecting when there are multiple
> > > LLCs per node and detecting that in generic code as a static branch. In
> > > select_idle_cpu, consider taking two passes -- first on the LLC domain
> > > and if no idle CPU is found then taking a second pass if the search depth
> > 
> > We have done a lot of changes to reduce and optimize the fast path and
> > I don't think re adding another layer  in the fast path makes sense as
> > you will end up unrolling the for_each_domain behind some
> > static_banches.
> > 
> 
> Searching the node would only happen if a) there was enough search depth
> left and b) there were no idle CPUs at the LLC level. As no new domain
> is added, it's not clear to me why for_each_domain would change.
> 
> But still, your comment reminded me that different architectures have
> different requirements
> 
> Power 10 appears to prefer CPU selection sharing L2 cache but desires
>   spillover to L3 when selecting and idle CPU.
>

Indeed, so on POWER10, the preference would be
1) idle core in the L2 domain.
2) idle core in the MC domain.
3) idle CPU  in the L2 domain
4) idle CPU  in the MC domain.

This patch is able to achieve this *implicitly* because of the way the
select_idle_cpu() and the select_idle_core() is currently coded, where
in the presence of idle cores in the MC level, the select_idle_core()
searches for the idle core starting with the core of the target-CPU.

If I understand your proposal correctly it would be to make this
explicit into a two level search where we first search in the LLC
domain, failing which, we carry on the search in the rest of the die
(assuming that the LLC is not in the die).


> X86 varies, it might want the Power10 approach for some families and prefer
>   L3 spilling over to a CPU on the same node in others.
> 
> S390 cares about something called books and drawers although I've no
>   what it means as such and whether it has any preferences on
>   search order.
> 
> ARM has similar requirements again according to "scheduler: expose the
>   topology of clusters and add cluster scheduler" and that one *does*
>   add another domain.
> 
> I had forgotten about the ARM patches but remembered that they were
> interesting because they potentially help the Zen situation but I didn't
> get the chance to review them before they fell off my radar again. About
> all I recall is that I thought the "cluster" terminology was vague.
> 
> The only 

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-02 Thread Gautham R Shenoy
(Missed cc'ing Cc Peter in the original posting)

On Fri, Apr 02, 2021 at 11:07:54AM +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> On POWER10 systems, the L2 cache is at the SMT4 small core level. The
> following commits ensure that L2 cache gets correctly discovered and
> the Last-Level-Cache domain (LLC) is set to the SMT sched-domain.
> 
> 790a166 powerpc/smp: Parse ibm,thread-groups with multiple properties
> 1fdc1d6 powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map
> fbd2b67 powerpc/smp: Rename init_thread_group_l1_cache_map() to make
>  it generic
> 538abe powerpc/smp: Add support detecting thread-groups sharing L2 cache
> 0be4763 powerpc/cacheinfo: Print correct cache-sibling map/list for L2 
> cache
> 
> However, with the LLC now on the SMT sched-domain, we are seeing some
> regressions in the performance of applications that requires
> single-threaded performance. The reason for this is as follows:
> 
> Prior to the change (we call this P9-sched below), the sched-domain
> hierarchy was:
> 
> SMT (SMT4) --> CACHE (SMT8)[LLC] --> MC (Hemisphere) --> DIE
> 
> where the CACHE sched-domain is defined to be the Last Level Cache (LLC).
> 
> On the upstream kernel, with the aforementioned commmits (P10-sched),
> the sched-domain hierarchy is:
> 
> SMT (SMT4)[LLC] --> MC (Hemisphere) --> DIE
> 
> with the SMT sched-domain as the LLC.
> 
> When the scheduler tries to wakeup a task, it chooses between the
> waker-CPU and the wakee's previous-CPU. Suppose this choice is called
> the "target", then in the target's LLC domain, the scheduler
> 
> a) tries to find an idle core in the LLC. This helps exploit the
>SMT folding that the wakee task can benefit from. If an idle
>core is found, the wakee is woken up on it.
> 
> b) Failing to find an idle core, the scheduler tries to find an idle
>CPU in the LLC. This helps minimise the wakeup latency for the
>wakee since it gets to run on the CPU immediately.
> 
> c) Failing this, it will wake it up on target CPU.
> 
> Thus, with P9-sched topology, since the CACHE domain comprises of two
> SMT4 cores, there is a decent chance that we get an idle core, failing
> which there is a relatively higher probability of finding an idle CPU
> among the 8 threads in the domain.
> 
> However, in P10-sched topology, since the SMT domain is the LLC and it
> contains only a single SMT4 core, the probability that we find that
> core to be idle is less. Furthermore, since there are only 4 CPUs to
> search for an idle CPU, there is lower probability that we can get an
> idle CPU to wake up the task on.
> 
> Thus applications which require single threaded performance will end
> up getting woken up on potentially busy core, even though there are
> idle cores in the system.
> 
> To remedy this, this patch proposes that the LLC be moved to the MC
> level which is a group of cores in one half of the chip.
> 
>   SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE
> 
> While there is no cache being shared at this level, this is still the
> level where some amount of cache-snooping takes place and it is
> relatively faster to access the data from the caches of the cores
> within this domain. With this change, we no longer see regressions on
> P10 for applications which require single threaded performance.
> 
> The patch also improves the tail latencies on schbench and the
> usecs/op on "perf bench sched pipe"
> 
> On a 10 core P10 system with 80 CPUs,
> 
> schbench
> 
> (https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/)
> 
> Values : Lower the better.
> 99th percentile is the tail latency.
> 
> 
> 99th percentile
> ~~
> No. messenger
> threads   5.12-rc45.12-rc4
>   P10-sched   MC-LLC
> ~~~
> 1 70 us 85 us
> 2 81 us101 us
> 3 92 us107 us
> 4 96 us110 us
> 5103 us123 us
> 6   3412 us >  122 us
> 7   1490 us136 us
> 8   6200 us   3572 us
> 
> 
> Hackbench
> 
> (perf bench sched pipe)
> values: lower the better
> 
> ~~~
> No. of
> parallel
> instances   5.12-rc4   5.12-rc4
> P10-sched  MC-LLC 
> ~~~
> 1   24.04 us/op18.72 us/op 
> 2   24.04 us/op18.65 us/op 
> 4   24.01 us/op18.76 us/op 
> 8   24.10 us/op 

[RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-02 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On POWER10 systems, the L2 cache is at the SMT4 small core level. The
following commits ensure that L2 cache gets correctly discovered and
the Last-Level-Cache domain (LLC) is set to the SMT sched-domain.

790a166 powerpc/smp: Parse ibm,thread-groups with multiple properties
1fdc1d6 powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map
fbd2b67 powerpc/smp: Rename init_thread_group_l1_cache_map() to make
 it generic
538abe powerpc/smp: Add support detecting thread-groups sharing L2 cache
0be4763 powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

However, with the LLC now on the SMT sched-domain, we are seeing some
regressions in the performance of applications that requires
single-threaded performance. The reason for this is as follows:

Prior to the change (we call this P9-sched below), the sched-domain
hierarchy was:

  SMT (SMT4) --> CACHE (SMT8)[LLC] --> MC (Hemisphere) --> DIE

where the CACHE sched-domain is defined to be the Last Level Cache (LLC).

On the upstream kernel, with the aforementioned commmits (P10-sched),
the sched-domain hierarchy is:

  SMT (SMT4)[LLC] --> MC (Hemisphere) --> DIE

with the SMT sched-domain as the LLC.

When the scheduler tries to wakeup a task, it chooses between the
waker-CPU and the wakee's previous-CPU. Suppose this choice is called
the "target", then in the target's LLC domain, the scheduler

a) tries to find an idle core in the LLC. This helps exploit the
   SMT folding that the wakee task can benefit from. If an idle
   core is found, the wakee is woken up on it.

b) Failing to find an idle core, the scheduler tries to find an idle
   CPU in the LLC. This helps minimise the wakeup latency for the
   wakee since it gets to run on the CPU immediately.

c) Failing this, it will wake it up on target CPU.

Thus, with P9-sched topology, since the CACHE domain comprises of two
SMT4 cores, there is a decent chance that we get an idle core, failing
which there is a relatively higher probability of finding an idle CPU
among the 8 threads in the domain.

However, in P10-sched topology, since the SMT domain is the LLC and it
contains only a single SMT4 core, the probability that we find that
core to be idle is less. Furthermore, since there are only 4 CPUs to
search for an idle CPU, there is lower probability that we can get an
idle CPU to wake up the task on.

Thus applications which require single threaded performance will end
up getting woken up on potentially busy core, even though there are
idle cores in the system.

To remedy this, this patch proposes that the LLC be moved to the MC
level which is a group of cores in one half of the chip.

  SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE

While there is no cache being shared at this level, this is still the
level where some amount of cache-snooping takes place and it is
relatively faster to access the data from the caches of the cores
within this domain. With this change, we no longer see regressions on
P10 for applications which require single threaded performance.

The patch also improves the tail latencies on schbench and the
usecs/op on "perf bench sched pipe"

On a 10 core P10 system with 80 CPUs,

schbench

(https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/)

Values : Lower the better.
99th percentile is the tail latency.


99th percentile
~~
No. messenger
threads   5.12-rc45.12-rc4
  P10-sched   MC-LLC
~~~
1 70 us 85 us
2 81 us101 us
3 92 us107 us
4 96 us110 us
5103 us123 us
6   3412 us >  122 us
7   1490 us136 us
8   6200 us   3572 us


Hackbench

(perf bench sched pipe)
values: lower the better

~~~
No. of
parallel
instances   5.12-rc4   5.12-rc4
P10-sched  MC-LLC 
~~~
1   24.04 us/op18.72 us/op 
2   24.04 us/op18.65 us/op 
4   24.01 us/op18.76 us/op 
8   24.10 us/op19.11 us/op 
~~~~~~~~~~~

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/smp.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5a4d59a..c75dbd4 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -976,6 +976,13 @@ static bool has_coregroup_support(void)
return coregroup_enabled;
 }
 
+static int powerpc_mc_flags(void)
+{
+   if(has_coregroup_support())
+   return SD_SHARE_PKG_RESOURCES;
+   return 0;
+}
+
 static const struct cpumask *cpu_mc_mask(int cpu)
 {
return cpu_coregroup_mask(cpu);
@@ -986,7 +993

Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

2021-01-20 Thread Gautham R Shenoy


On Wed, Jan 20, 2021 at 09:54:20AM +, Mel Gorman wrote:
> On Wed, Jan 20, 2021 at 10:21:47AM +0100, Vincent Guittot wrote:
> > On Wed, 20 Jan 2021 at 10:12, Mel Gorman  
> > wrote:
> > >
> > > On Wed, Jan 20, 2021 at 02:00:18PM +0530, Gautham R Shenoy wrote:
> > > > > @@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct 
> > > > > *p, struct sched_domain *sd, int t
> > > > > }
> > > > >
> > > > > for_each_cpu_wrap(cpu, cpus, target) {
> > > > > -   if (!--nr)
> > > > > -   return -1;
> > > > > -   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> > > > > -   break;
> > > > > +   if (smt) {
> > > > > +   i = select_idle_core(p, cpu, cpus, _cpu);
> > > > > +   if ((unsigned int)i < nr_cpumask_bits)
> > > > > +   return i;
> > > > > +
> > > > > +   } else {
> > > > > +   if (!--nr)
> > > > > +   return -1;
> > > > > +   i = __select_idle_cpu(cpu);
> > > > > +   if ((unsigned int)i < nr_cpumask_bits) {
> > > > > +   idle_cpu = i;
> > > > > +   break;
> > > > > +   }
> > > > > +   }
> > > > > }
> > > > >
> > > > > -   if (sched_feat(SIS_PROP)) {
> > > > > +   if (smt)
> > > > > +   set_idle_cores(this, false);
> > > >
> > > > Shouldn't we set_idle_cores(false) only if this was the last idle
> > > > core in the LLC ?
> > > >
> > >
> > > That would involve rechecking the cpumask bits that have not been
> > > scanned to see if any of them are an idle core. As the existance of idle
> > > cores can change very rapidly, it's not worth the cost.
> > 
> > But don't we reach this point only if we scanned all CPUs and didn't
> > find an idle core ?

Indeed, I missed that part that we return as soon as we find an idle
core in the for_each_cpu_wrap() loop above. So here we clear the
"has_idle_cores" when there are no longer any idle-cores. Sorry for
the noise!


> 
> Yes, but my understanding of Gauthams suggestion was to check if an
> idle core found was the last idle core available and set has_idle_cores
> to false in that case.

That would have been nice, but since we do not keep a count of idle
cores, it is probably not worth the effort as you note below.

> I think this would be relatively expensive and
> possibly futile as returning the last idle core for this wakeup does not
> mean there will be no idle core on the next wakeup as other cores may go
> idle between wakeups.


> 
> -- 
> Mel Gorman
> SUSE Labs

--
Thanks and Regards
gautham.


Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

2021-01-20 Thread Gautham R Shenoy
Hello Mel, Peter,

On Tue, Jan 19, 2021 at 11:22:11AM +, Mel Gorman wrote:
> From: Peter Zijlstra (Intel) 
> 
> Both select_idle_core() and select_idle_cpu() do a loop over the same
> cpumask. Observe that by clearing the already visited CPUs, we can
> fold the iteration and iterate a core at a time.
> 
> All we need to do is remember any non-idle CPU we encountered while
> scanning for an idle core. This way we'll only iterate every CPU once.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Mel Gorman 
> ---
>  kernel/sched/fair.c | 101 ++--
>  1 file changed, 61 insertions(+), 40 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 12e08da90024..822851b39b65 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c

[..snip..]


> @@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, int t
>   }
> 
>   for_each_cpu_wrap(cpu, cpus, target) {
> - if (!--nr)
> - return -1;
> - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> - break;
> + if (smt) {
> + i = select_idle_core(p, cpu, cpus, _cpu);
> + if ((unsigned int)i < nr_cpumask_bits)
> + return i;
> +
> + } else {
> + if (!--nr)
> + return -1;
> + i = __select_idle_cpu(cpu);
> + if ((unsigned int)i < nr_cpumask_bits) {
> + idle_cpu = i;
> + break;
> + }
> + }
>   }
> 
> - if (sched_feat(SIS_PROP)) {
> + if (smt)
> + set_idle_cores(this, false);

Shouldn't we set_idle_cores(false) only if this was the last idle
core in the LLC ? 

--
Thanks and Regards
gautham.


[PATCH 1/2] powerpc/cacheinfo: Lookup cache by dt node and thread-group id

2021-01-18 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently the cacheinfo code on powerpc indexes the "cache" objects
(modelling the L1/L2/L3 caches) where the key is device-tree node
corresponding to that cache. On some of the POWER server platforms
thread-groups within the core share different sets of caches (Eg: On
SMT8 POWER9 systems, threads 0,2,4,6 of a core share L1 cache and
threads 1,3,5,7 of the same core share another L1 cache). On such
platforms, there is a single device-tree node corresponding to that
cache and the cache-configuration within the threads of the core is
indicated via "ibm,thread-groups" device-tree property.

Since the current code is not aware of the "ibm,thread-groups"
property, on the aforementoined systems, cacheinfo code still treats
all the threads in the core to be sharing the cache because of the
single device-tree node (In the earlier example, the cacheinfo code
would says CPUs 0-7 share L1 cache).

In this patch, we make the powerpc cacheinfo code aware of the
"ibm,thread-groups" property. We indexe the "cache" objects by the
key-pair (device-tree node, thread-group id). For any CPUX, for a
given level of cache, the thread-group id is defined to be the first
CPU in the "ibm,thread-groups" cache-group containing CPUX. For levels
of cache which are not represented in "ibm,thread-groups" property,
the thread-group id is -1.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/smp.h  |  3 ++
 arch/powerpc/kernel/cacheinfo.c | 80 +
 2 files changed, 61 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index c4e2d53..39de24b 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -32,6 +32,9 @@
 
 extern int cpu_to_chip_id(int cpu);
 
+DECLARE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map);
+DECLARE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
+
 #ifdef CONFIG_SMP
 
 struct smp_ops_t {
diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index 6f903e9a..5a6925d 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -120,6 +120,7 @@ struct cache {
struct cpumask shared_cpu_map; /* online CPUs using this cache */
int type;  /* split cache disambiguation */
int level; /* level not explicit in device tree */
+   int group_id;  /* id of the group of threads that share 
this cache */
struct list_head list; /* global list of cache objects */
struct cache *next_local;  /* next cache of >= level */
 };
@@ -142,22 +143,24 @@ static const char *cache_type_string(const struct cache 
*cache)
 }
 
 static void cache_init(struct cache *cache, int type, int level,
-  struct device_node *ofnode)
+  struct device_node *ofnode, int group_id)
 {
cache->type = type;
cache->level = level;
cache->ofnode = of_node_get(ofnode);
+   cache->group_id = group_id;
INIT_LIST_HEAD(>list);
list_add(>list, _list);
 }
 
-static struct cache *new_cache(int type, int level, struct device_node *ofnode)
+static struct cache *new_cache(int type, int level,
+  struct device_node *ofnode, int group_id)
 {
struct cache *cache;
 
cache = kzalloc(sizeof(*cache), GFP_KERNEL);
if (cache)
-   cache_init(cache, type, level, ofnode);
+   cache_init(cache, type, level, ofnode, group_id);
 
return cache;
 }
@@ -309,20 +312,24 @@ static struct cache *cache_find_first_sibling(struct 
cache *cache)
return cache;
 
list_for_each_entry(iter, _list, list)
-   if (iter->ofnode == cache->ofnode && iter->next_local == cache)
+   if (iter->ofnode == cache->ofnode &&
+   iter->group_id == cache->group_id &&
+   iter->next_local == cache)
return iter;
 
return cache;
 }
 
-/* return the first cache on a local list matching node */
-static struct cache *cache_lookup_by_node(const struct device_node *node)
+/* return the first cache on a local list matching node and thread-group id */
+static struct cache *cache_lookup_by_node_group(const struct device_node *node,
+   int group_id)
 {
struct cache *cache = NULL;
struct cache *iter;
 
list_for_each_entry(iter, _list, list) {
-   if (iter->ofnode != node)
+   if (iter->ofnode != node ||
+   iter->group_id != group_id)
continue;
cache = cache_find_first_sibling(iter);
break;
@@ -352,1

[PATCH 2/2] powerpc/cacheinfo: Remove the redundant get_shared_cpu_map()

2021-01-18 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

The helper function get_shared_cpu_map() was added in

'commit 500fe5f550ec ("powerpc/cacheinfo: Report the correct
shared_cpu_map on big-cores")'

and subsequently expanded upon in

'commit 0be47634db0b ("powerpc/cacheinfo: Print correct cache-sibling
map/list for L2 cache")'

in order to help report the correct groups of threads sharing these caches
on big-core systems where groups of threads within a core can share
different sets of caches.

Now that powerpc/cacheinfo is aware of "ibm,thread-groups" property,
cache->shared_cpu_map contains the correct set of thread-siblings
sharing the cache. Hence we no longer need the functions
get_shared_cpu_map(). This patch removes this function. We also remove
the helper function index_dir_to_cpu() which was only called by
get_shared_cpu_map().

With these functions removed, we can still see the correct
cache-sibling map/list for L1 and L2 caches on systems with L1 and L2
caches distributed among groups of threads in a core.

With this patch, on a SMT8 POWER10 system where the L1 and L2 caches
are split between the two groups of threads in a core, for CPUs 8,9,
the L1-Data, L1-Instruction, L2, L3 cache CPU sibling list is as
follows:

$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-15
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-15

$ ppc64_cpu --smt=4
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-11
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-11

$ ppc64_cpu --smt=2
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-9
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-9

$ ppc64_cpu --smt=1
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/cacheinfo.c | 41 +
 1 file changed, 1 insertion(+), 40 deletions(-)

diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index 5a6925d..20d9169 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -675,45 +675,6 @@ static ssize_t level_show(struct kobject *k, struct 
kobj_attribute *attr, char *
 static struct kobj_attribute cache_level_attr =
__ATTR(level, 0444, level_show, NULL);
 
-static unsigned int index_dir_to_cpu(struct cache_index_dir *index)
-{
-   struct kobject *index_dir_kobj = >kobj;
-   struct kobject *cache_dir_kobj = index_dir_kobj->parent;
-   struct kobject *cpu_dev_kobj = cache_dir_kobj->parent;
-   struct device *dev = kobj_to_dev(cpu_dev_kobj);
-
-   return dev->id;
-}
-
-/*
- * On big-core systems, each core has two groups of CPUs each of which
- * has its own L1-cache. The thread-siblings which share l1-cache with
- * @cpu can be obtained via cpu_smallcore_mask().
- *
- * On some big-core systems, the L2 cache is shared only between some
- * groups of siblings. This is already parsed and encoded in
- * cpu_l2_cache_mask().
- *
- * TODO: cache_lookup_or_instantiate() needs to be made aware of the
- *   "ibm,thread-groups" property so that cache->shared_cpu_map
- *   reflects the correct siblings on platforms that have this
- *   device-tree property. This helper function is only a stop-gap
- *   solution so that we report the

[PATCH 0/2] powerpc/cacheinfo: Add "ibm,thread-groups" awareness

2021-01-18 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Hi,

Currently the cacheinfo code on powerpc indexes the "cache" objects
(modelling the L1/L2/L3 caches) where the key is device-tree node
corresponding to that cache. On some of the POWER server platforms
thread-groups within the core share different sets of caches (Eg: On
SMT8 POWER9 systems, threads 0,2,4,6 of a core share L1 cache and
threads 1,3,5,7 of the same core share another L1 cache). On such
platforms, there is a single device-tree node corresponding to that
cache and the cache-configuration within the threads of the core is
indicated via "ibm,thread-groups" device-tree property.

Since the current code is not aware of the "ibm,thread-groups"
property, on the aforementoined systems, cacheinfo code still treats
all the threads in the core to be sharing the cache because of the
single device-tree node (In the earlier example, the cacheinfo code
would says CPUs 0-7 share L1 cache).

In this patch series, we make the powerpc cacheinfo code aware of the
"ibm,thread-groups" property. We indexe the "cache" objects by the
key-pair (device-tree node, thread-group id). For any CPUX, for a
given level of cache, the thread-group id is defined to be the first
CPU in the "ibm,thread-groups" cache-group containing CPUX. For levels
of cache which are not represented in "ibm,thread-groups" property,
the thread-group id is -1.

We can now remove the helper function get_shared_cpu_map() and
index_dir_to_cpu() since the cache->shared_cpu_map contains the
correct satate of the thread-siblings sharing the cache.

This has been tested on a SMT8 POWER9 system where L1 cache is split
between groups of threads in the core and on an SMT8 POWER10 system
where L1 and L2 caches are split between groups of threads in a core.

With this patch series, on POWER10 SMT8 system, we see the following
reported via sysfs:

$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-15
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-15

$ ppc64_cpu --smt=4
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-11
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-11

$ ppc64_cpu --smt=2
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-9
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-9

$ ppc64_cpu --smt=1
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8

Gautham R. Shenoy (2):
  powerpc/cacheinfo: Lookup cache by dt node and thread-group id
  powerpc/cacheinfo: Remove the redundant get_shared_cpu_map()

 arch/powerpc/include/asm/smp.h  |   3 +
 arch/powerpc/kernel/cacheinfo.c | 121 
 2 files changed, 62 insertions(+), 62 deletions(-)

-- 
1.9.4



[PATCH v3 2/5] powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map

2020-12-10 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On platforms which have the "ibm,thread-groups" property, the per-cpu
variable cpu_l1_cache_map keeps a track of which group of threads
within the same core share the L1 cache, Instruction and Data flow.

This patch renames the variable to "thread_group_l1_cache_map" to make
it consistent with a subsequent patch which will introduce
thread_group_l2_cache_map.

This patch introduces no functional change.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/smp.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 88d88ad..f3290d5 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -116,10 +116,10 @@ struct thread_groups_list {
 
 static struct thread_groups_list tgl[NR_CPUS] __initdata;
 /*
- * On big-cores system, cpu_l1_cache_map for each CPU corresponds to
+ * On big-cores system, thread_group_l1_cache_map for each CPU corresponds to
  * the set its siblings that share the L1-cache.
  */
-DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map);
+DEFINE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map);
 
 /* SMP operations for this machine */
 struct smp_ops_t *smp_ops;
@@ -866,7 +866,7 @@ static struct thread_groups *__init get_thread_groups(int 
cpu,
return tg;
 }
 
-static int init_cpu_l1_cache_map(int cpu)
+static int init_thread_group_l1_cache_map(int cpu)
 
 {
int first_thread = cpu_first_thread_sibling(cpu);
@@ -885,7 +885,7 @@ static int init_cpu_l1_cache_map(int cpu)
return -ENODATA;
}
 
-   zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu),
+   zalloc_cpumask_var_node(_cpu(thread_group_l1_cache_map, cpu),
GFP_KERNEL, cpu_to_node(cpu));
 
for (i = first_thread; i < first_thread + threads_per_core; i++) {
@@ -897,7 +897,7 @@ static int init_cpu_l1_cache_map(int cpu)
}
 
if (i_group_start == cpu_group_start)
-   cpumask_set_cpu(i, per_cpu(cpu_l1_cache_map, cpu));
+   cpumask_set_cpu(i, per_cpu(thread_group_l1_cache_map, 
cpu));
}
 
return 0;
@@ -976,7 +976,7 @@ static int init_big_cores(void)
int cpu;
 
for_each_possible_cpu(cpu) {
-   int err = init_cpu_l1_cache_map(cpu);
+   int err = init_thread_group_l1_cache_map(cpu);
 
if (err)
return err;
@@ -1372,7 +1372,7 @@ static inline void add_cpu_to_smallcore_masks(int cpu)
 
cpumask_set_cpu(cpu, cpu_smallcore_mask(cpu));
 
-   for_each_cpu(i, per_cpu(cpu_l1_cache_map, cpu)) {
+   for_each_cpu(i, per_cpu(thread_group_l1_cache_map, cpu)) {
if (cpu_online(i))
set_cpus_related(i, cpu, cpu_smallcore_mask);
}
-- 
1.9.4



[PATCH v3 0/5] Extend Parsing "ibm,thread-groups" for Shared-L2 information

2020-12-10 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Hi,

This is the v2 of the patchset to extend parsing of "ibm,thread-groups" property
to discover the Shared-L2 cache information.

The previous versions can be found here :

v2 : 
https://lore.kernel.org/linuxppc-dev/1607533700-5546-1-git-send-email-...@linux.vnet.ibm.com/T/#m043ea15d3832658527fca94765202b9cbefd330d

v1 : 
https://lore.kernel.org/linuxppc-dev/1607057327-29822-1-git-send-email-...@linux.vnet.ibm.com/T/#m0fabffa1ea1a2807b362f25c849bb19415216520


Changes form v2-->v3:
 * Fixed the build errors reported by the Kernel Test Robot for Patches 4 and 5.

Changes from v1-->v2:
Incorporate the review comments from Srikar and
fix a build error on !PPC64 configs reported by the kernel bot.

 * Split Patch 1 into three patches
   * First patch ensure that parse_thread_groups() is made generic to
 support more than one property.
   * Second patch renames cpu_l1_cache_map as
 thread_group_l1_cache_map for consistency. No functional impact.
   * The third patch makes init_thread_group_l1_cache_map()
 generic. No functional impact.

* Patch 2 (Now patch 4): Incorporates the review comments from Srikar 
simplifying
   the changes to update_mask_by_l2()

* Patch 3 (Now patch 5): Fix a build errors for 32-bit configs
   reported by the kernel build bot.

Description of the Patchset
===
The "ibm,thread-groups" device-tree property is an array that is used
to indicate if groups of threads within a core share certain
properties. It provides details of which property is being shared by
which groups of threads. This array can encode information about
multiple properties being shared by different thread-groups within the
core.

Example: Suppose,
"ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15]

This can be decomposed up into two consecutive arrays:

a) [1,2,4,8,10,12,14,9,11,13,15]
b) [2,2,4,8,10,12,14,9,11,13,15]

where in,

a) provides information of Property "1" being shared by "2" groups,
   each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the
   first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of
   the second group is {9,11,13,15}. Property "1" is indicative of
   the thread in the group sharing L1 cache, translation cache and
   Instruction Data flow.

b) provides information of Property "2" being shared by "2" groups,
   each group with "4" threads. The "ibm,ppc-interrupt-server#s" of
   the first group is {8,10,12,14} and the
   "ibm,ppc-interrupt-server#s" of the second group is
   {9,11,13,15}. Property "2" indicates that the threads in each group
   share the L2-cache.
   
The existing code assumes that the "ibm,thread-groups" encodes
information about only one property. Hence even on platforms which
encode information about multiple properties being shared by the
corresponding groups of threads, the current code will only pick the
first one. (In the above example, it will only consider
[1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]).

Furthermore, currently on platforms where groups of threads share L2
cache, we incorrectly create an extra CACHE level sched-domain that
maps to all the threads of the core.

For example, if "ibm,thread-groups" is 
 0001 0002 0004 
 0002 0004 0006 0001
 0003 0005 0007 0002
 0002 0004  0002
 0004 0006 0001 0003
 0005 0007

then, the sub-array
[0002 0002 0004
  0002 0004 0006
 0001 0003 0005 0007]
indicates that L2 (Property "2") is shared only between the threads of a single
group. There are "2" groups of threads where each group contains "4"
threads each. The groups being {0,2,4,6} and {1,3,5,7}.

However, the sched-domain hierarchy for CPUs 0,1 is
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

where the CACHE domain reports that L2 is shared across the entire
core which is incorrect on such platforms.

This patchset remedies these issues by extending the parsing support
for "ibm,thread-groups" to discover information about multiple
properties being shared by the corresponding groups of threads. In
particular we cano now detect if the groups of threads within a core
share the L2-cache. On such platf

[PATCH v3 4/5] powerpc/smp: Add support detecting thread-groups sharing L2 cache

2020-12-10 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On POWER systems, groups of threads within a core sharing the L2-cache
can be indicated by the "ibm,thread-groups" property array with the
identifier "2".

This patch adds support for detecting this, and when present, populate
the populating the cpu_l2_cache_mask of every CPU to the core-siblings
which share L2 with the CPU as specified in the by the
"ibm,thread-groups" property array.

On a platform with the following "ibm,thread-group" configuration
 0001 0002 0004 
 0002 0004 0006 0001
 0003 0005 0007 0002
 0002 0004  0002
 0004 0006 0001 0003
 0005 0007

Without this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

The CACHE domain at 0-7 is incorrect since the ibm,thread-groups
sub-array
[0002 0002 0004
  0002 0004 0006
 0001 0003 0005 0007]
indicates that L2 (Property "2") is shared only between the threads of a single
group. There are "2" groups of threads where each group contains "4"
threads each. The groups being {0,2,4,6} and {1,3,5,7}.

With this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
resp.) gets degenerated into the SMT domain. Furthermore, the
last-level-cache domain gets correctly set to the SMT sched-domain.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/smp.h |  2 ++
 arch/powerpc/kernel/smp.c  | 58 ++
 2 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index b2035b2..035459c 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -134,6 +134,7 @@ static inline struct cpumask *cpu_smallcore_mask(int cpu)
 extern int cpu_to_core_id(int cpu);
 
 extern bool has_big_cores;
+extern bool thread_group_shares_l2;
 
 #define cpu_smt_mask cpu_smt_mask
 #ifdef CONFIG_SCHED_SMT
@@ -187,6 +188,7 @@ static inline const struct cpumask *cpu_smt_mask(int cpu)
 /* for UP */
 #define hard_smp_processor_id()get_hard_smp_processor_id(0)
 #define smp_setup_cpu_maps()
+#define thread_group_shares_l2  0
 static inline void inhibit_secondary_onlining(void) {}
 static inline void uninhibit_secondary_onlining(void) {}
 static inline const struct cpumask *cpu_sibling_mask(int cpu)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 9078b5b5..2b9b1bb 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -76,6 +76,7 @@
 struct task_struct *secondary_current;
 bool has_big_cores;
 bool coregroup_enabled;
+bool thread_group_shares_l2;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
@@ -99,6 +100,7 @@ enum {
 
 #define MAX_THREAD_LIST_SIZE   8
 #define THREAD_GROUP_SHARE_L1   1
+#define THREAD_GROUP_SHARE_L2   2
 struct thread_groups {
unsigned int property;
unsigned int nr_groups;
@@ -107,7 +109,7 @@ struct thread_groups {
 };
 
 /* Maximum number of properties that groups of threads within a core can share 
*/
-#define MAX_THREAD_GROUP_PROPERTIES 1
+#define MAX_THREAD_GROUP_PROPERTIES 2
 
 struct thread_groups_list {
unsigned int nr_properties;
@@ -121,6 +123,13 @@ struct thread_groups_list {
  */
 DEFINE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map);
 
+/*
+ * On some big-cores system, thread_group_l2_cache_map for each CPU
+ * corresponds to the set its siblings within the core that share the
+ * L2-cache.
+ */
+DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
+
 /* SMP operations for this machine */
 struct smp_ops_t *smp_ops;
 
@@ -718,7 +727,9 @@ static void or_cpumasks_related(int i, int j, struct 
cpumask *(*srcmask)(int),
  *
  * ibm,thread-groups[i + 0] tells us the property based on which the
  * threads are being grouped together. If this value is 1, it implies
- * that the threads in the same group share L1, trans

[PATCH v3 5/5] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

2020-12-10 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On POWER platforms where only some groups of threads within a core
share the L2-cache (indicated by the ibm,thread-groups device-tree
property), we currently print the incorrect shared_cpu_map/list for
L2-cache in the sysfs.

This patch reports the correct shared_cpu_map/list on such platforms.

Example:
On a platform with "ibm,thread-groups" set to
 0001 0002 0004 
 0002 0004 0006 0001
 0003 0005 0007 0002
 0002 0004  0002
 0004 0006 0001 0003
 0005 0007

This indicates that threads {0,2,4,6} in the core share the L2-cache
and threads {1,3,5,7} in the core share the L2 cache.

However, without the patch, the shared_cpu_map/list for L2 for CPUs 0,
1 is reported in the sysfs as follows:

/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-7
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,00ff

/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:0-7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00ff

With the patch, the shared_cpu_map/list for L2 cache for CPUs 0, 1 is
correctly reported as follows:

/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,0055

/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:1,3,5,7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00aa

This patch also defines cpu_l2_cache_mask() for !CONFIG_SMP case.
Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/smp.h  |  4 
 arch/powerpc/kernel/cacheinfo.c | 30 --
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 035459c..c4e2d53 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -201,6 +201,10 @@ static inline const struct cpumask *cpu_smallcore_mask(int 
cpu)
return cpumask_of(cpu);
 }
 
+static inline const struct cpumask *cpu_l2_cache_mask(int cpu)
+{
+   return cpumask_of(cpu);
+}
 #endif /* CONFIG_SMP */
 
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index 65ab9fc..6f903e9a 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -655,11 +655,27 @@ static unsigned int index_dir_to_cpu(struct 
cache_index_dir *index)
  * On big-core systems, each core has two groups of CPUs each of which
  * has its own L1-cache. The thread-siblings which share l1-cache with
  * @cpu can be obtained via cpu_smallcore_mask().
+ *
+ * On some big-core systems, the L2 cache is shared only between some
+ * groups of siblings. This is already parsed and encoded in
+ * cpu_l2_cache_mask().
+ *
+ * TODO: cache_lookup_or_instantiate() needs to be made aware of the
+ *   "ibm,thread-groups" property so that cache->shared_cpu_map
+ *   reflects the correct siblings on platforms that have this
+ *   device-tree property. This helper function is only a stop-gap
+ *   solution so that we report the correct siblings to the
+ *   userspace via sysfs.
  */
-static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct cache 
*cache)
+static const struct cpumask *get_shared_cpu_map(struct cache_index_dir *index, 
struct cache *cache)
 {
-   if (cache->level == 1)
-   return cpu_smallcore_mask(cpu);
+   if (has_big_cores) {
+   int cpu = index_dir_to_cpu(index);
+   if (cache->level == 1)
+   return cpu_smallcore_mask(cpu);
+   if (cache->level == 2 && thread_group_shares_l2)
+   return cpu_l2_cache_mask(cpu);
+   }
 
return >shared_cpu_map;
 }
@@ -670,17 +686,11 @@ static const struct cpumask 
*get_big_core_shared_cpu_map(int cpu, struct cache *
struct cache_index_dir *index;
struct cache *cache;
const struct cpumask *mask;
-   int cpu;
 
index = kobj_to_cache_index_dir(k);
cache = index->cache;
 
-   if (has_big_cores) {
-   cpu = index_dir_to_cpu(index);
-   mask = get_big_core_shared_cpu_map(cpu, cache);
-   } else {
-   mask  = >shared_cpu_map;
-   }
+   mask = get_shared_cpu_map(index, cache);
 
return cpumap_print_to_pagebuf(list, buf, mask);
 }
-- 
1.9.4



[PATCH v3 3/5] powerpc/smp: Rename init_thread_group_l1_cache_map() to make it generic

2020-12-10 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

init_thread_group_l1_cache_map() initializes the per-cpu cpumask
thread_group_l1_cache_map with the core-siblings which share L1 cache
with the CPU. Make this function generic to the cache-property (L1 or
L2) and update a suitable mask. This is a preparatory patch for the
next patch where we will introduce discovery of thread-groups that
share L2-cache.

No functional change.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/smp.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index f3290d5..9078b5b5 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -866,15 +866,18 @@ static struct thread_groups *__init get_thread_groups(int 
cpu,
return tg;
 }
 
-static int init_thread_group_l1_cache_map(int cpu)
+static int __init init_thread_group_cache_map(int cpu, int cache_property)
 
 {
int first_thread = cpu_first_thread_sibling(cpu);
int i, cpu_group_start = -1, err = 0;
struct thread_groups *tg = NULL;
+   cpumask_var_t *mask;
 
-   tg = get_thread_groups(cpu, THREAD_GROUP_SHARE_L1,
-  );
+   if (cache_property != THREAD_GROUP_SHARE_L1)
+   return -EINVAL;
+
+   tg = get_thread_groups(cpu, cache_property, );
if (!tg)
return err;
 
@@ -885,8 +888,8 @@ static int init_thread_group_l1_cache_map(int cpu)
return -ENODATA;
}
 
-   zalloc_cpumask_var_node(_cpu(thread_group_l1_cache_map, cpu),
-   GFP_KERNEL, cpu_to_node(cpu));
+   mask = _cpu(thread_group_l1_cache_map, cpu);
+   zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
 
for (i = first_thread; i < first_thread + threads_per_core; i++) {
int i_group_start = get_cpu_thread_group_start(i, tg);
@@ -897,7 +900,7 @@ static int init_thread_group_l1_cache_map(int cpu)
}
 
if (i_group_start == cpu_group_start)
-   cpumask_set_cpu(i, per_cpu(thread_group_l1_cache_map, 
cpu));
+   cpumask_set_cpu(i, *mask);
}
 
return 0;
@@ -976,7 +979,7 @@ static int init_big_cores(void)
int cpu;
 
for_each_possible_cpu(cpu) {
-   int err = init_thread_group_l1_cache_map(cpu);
+   int err = init_thread_group_cache_map(cpu, 
THREAD_GROUP_SHARE_L1);
 
if (err)
return err;
-- 
1.9.4



[PATCH v3 1/5] powerpc/smp: Parse ibm,thread-groups with multiple properties

2020-12-10 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

The "ibm,thread-groups" device-tree property is an array that is used
to indicate if groups of threads within a core share certain
properties. It provides details of which property is being shared by
which groups of threads. This array can encode information about
multiple properties being shared by different thread-groups within the
core.

Example: Suppose,
"ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15]

This can be decomposed up into two consecutive arrays:

a) [1,2,4,8,10,12,14,9,11,13,15]
b) [2,2,4,8,10,12,14,9,11,13,15]

where in,

a) provides information of Property "1" being shared by "2" groups,
   each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the
   first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of
   the second group is {9,11,13,15}. Property "1" is indicative of
   the thread in the group sharing L1 cache, translation cache and
   Instruction Data flow.

b) provides information of Property "2" being shared by "2" groups,
   each group with "4" threads. The "ibm,ppc-interrupt-server#s" of
   the first group is {8,10,12,14} and the
   "ibm,ppc-interrupt-server#s" of the second group is
   {9,11,13,15}. Property "2" indicates that the threads in each group
   share the L2-cache.

The existing code assumes that the "ibm,thread-groups" encodes
information about only one property. Hence even on platforms which
encode information about multiple properties being shared by the
corresponding groups of threads, the current code will only pick the
first one. (In the above example, it will only consider
[1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]).

This patch extends the parsing support on platforms which encode
information about multiple properties being shared by the
corresponding groups of threads.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/smp.c | 174 ++
 1 file changed, 113 insertions(+), 61 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8c2857c..88d88ad 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -106,6 +106,15 @@ struct thread_groups {
unsigned int thread_list[MAX_THREAD_LIST_SIZE];
 };
 
+/* Maximum number of properties that groups of threads within a core can share 
*/
+#define MAX_THREAD_GROUP_PROPERTIES 1
+
+struct thread_groups_list {
+   unsigned int nr_properties;
+   struct thread_groups property_tgs[MAX_THREAD_GROUP_PROPERTIES];
+};
+
+static struct thread_groups_list tgl[NR_CPUS] __initdata;
 /*
  * On big-cores system, cpu_l1_cache_map for each CPU corresponds to
  * the set its siblings that share the L1-cache.
@@ -695,81 +704,98 @@ static void or_cpumasks_related(int i, int j, struct 
cpumask *(*srcmask)(int),
 /*
  * parse_thread_groups: Parses the "ibm,thread-groups" device tree
  *  property for the CPU device node @dn and stores
- *  the parsed output in the thread_groups
- *  structure @tg if the ibm,thread-groups[0]
- *  matches @property.
+ *  the parsed output in the thread_groups_list
+ *  structure @tglp.
  *
  * @dn: The device node of the CPU device.
- * @tg: Pointer to a thread group structure into which the parsed
+ * @tglp: Pointer to a thread group list structure into which the parsed
  *  output of "ibm,thread-groups" is stored.
- * @property: The property of the thread-group that the caller is
- *interested in.
  *
  * ibm,thread-groups[0..N-1] array defines which group of threads in
  * the CPU-device node can be grouped together based on the property.
  *
- * ibm,thread-groups[0] tells us the property based on which the
+ * This array can represent thread groupings for multiple properties.
+ *
+ * ibm,thread-groups[i + 0] tells us the property based on which the
  * threads are being grouped together. If this value is 1, it implies
  * that the threads in the same group share L1, translation cache.
  *
- * ibm,thread-groups[1] tells us how many such thread groups exist.
+ * ibm,thread-groups[i+1] tells us how many such thread groups exist for the
+ * property ibm,thread-groups[i]
  *
- * ibm,thread-groups[2] tells us the number of threads in each such
+ * ibm,thread-groups[i+2] tells us the number of threads in each such
  * group.
+ * Suppose k = (ibm,thread-groups[i+1] * ibm,thread-groups[i+2]), then,
  *
- * ibm,thread-groups[3..N-1] is the list of threads identified by
+ * ibm,thread-groups[i+3..i+k+2] (is the list of threads identified by
  * "ibm,ppc-interrupt-server#s" arranged as per their membership in
  * the grouping.
  *
- * Example: If ibm,thread-gro

[PATCH v2 2/5] powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map

2020-12-09 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On platforms which have the "ibm,thread-groups" property, the per-cpu
variable cpu_l1_cache_map keeps a track of which group of threads
within the same core share the L1 cache, Instruction and Data flow.

This patch renames the variable to "thread_group_l1_cache_map" to make
it consistent with a subsequent patch which will introduce
thread_group_l2_cache_map.

This patch introduces no functional change.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/smp.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 88d88ad..f3290d5 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -116,10 +116,10 @@ struct thread_groups_list {
 
 static struct thread_groups_list tgl[NR_CPUS] __initdata;
 /*
- * On big-cores system, cpu_l1_cache_map for each CPU corresponds to
+ * On big-cores system, thread_group_l1_cache_map for each CPU corresponds to
  * the set its siblings that share the L1-cache.
  */
-DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map);
+DEFINE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map);
 
 /* SMP operations for this machine */
 struct smp_ops_t *smp_ops;
@@ -866,7 +866,7 @@ static struct thread_groups *__init get_thread_groups(int 
cpu,
return tg;
 }
 
-static int init_cpu_l1_cache_map(int cpu)
+static int init_thread_group_l1_cache_map(int cpu)
 
 {
int first_thread = cpu_first_thread_sibling(cpu);
@@ -885,7 +885,7 @@ static int init_cpu_l1_cache_map(int cpu)
return -ENODATA;
}
 
-   zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu),
+   zalloc_cpumask_var_node(_cpu(thread_group_l1_cache_map, cpu),
GFP_KERNEL, cpu_to_node(cpu));
 
for (i = first_thread; i < first_thread + threads_per_core; i++) {
@@ -897,7 +897,7 @@ static int init_cpu_l1_cache_map(int cpu)
}
 
if (i_group_start == cpu_group_start)
-   cpumask_set_cpu(i, per_cpu(cpu_l1_cache_map, cpu));
+   cpumask_set_cpu(i, per_cpu(thread_group_l1_cache_map, 
cpu));
}
 
return 0;
@@ -976,7 +976,7 @@ static int init_big_cores(void)
int cpu;
 
for_each_possible_cpu(cpu) {
-   int err = init_cpu_l1_cache_map(cpu);
+   int err = init_thread_group_l1_cache_map(cpu);
 
if (err)
return err;
@@ -1372,7 +1372,7 @@ static inline void add_cpu_to_smallcore_masks(int cpu)
 
cpumask_set_cpu(cpu, cpu_smallcore_mask(cpu));
 
-   for_each_cpu(i, per_cpu(cpu_l1_cache_map, cpu)) {
+   for_each_cpu(i, per_cpu(thread_group_l1_cache_map, cpu)) {
if (cpu_online(i))
set_cpus_related(i, cpu, cpu_smallcore_mask);
}
-- 
1.9.4



[PATCH v2 5/5] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

2020-12-09 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On POWER platforms where only some groups of threads within a core
share the L2-cache (indicated by the ibm,thread-groups device-tree
property), we currently print the incorrect shared_cpu_map/list for
L2-cache in the sysfs.

This patch reports the correct shared_cpu_map/list on such platforms.

Example:
On a platform with "ibm,thread-groups" set to
 0001 0002 0004 
 0002 0004 0006 0001
 0003 0005 0007 0002
 0002 0004  0002
 0004 0006 0001 0003
 0005 0007

This indicates that threads {0,2,4,6} in the core share the L2-cache
and threads {1,3,5,7} in the core share the L2 cache.

However, without the patch, the shared_cpu_map/list for L2 for CPUs 0,
1 is reported in the sysfs as follows:

/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-7
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,00ff

/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:0-7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00ff

With the patch, the shared_cpu_map/list for L2 cache for CPUs 0, 1 is
correctly reported as follows:

/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,0055

/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:1,3,5,7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00aa

This patch adds #CONFIG_PPC64 checks for these cases to ensure that
32-bit configs build correctly.
Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/cacheinfo.c | 34 --
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index 65ab9fc..cb87b68 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -641,6 +641,7 @@ static ssize_t level_show(struct kobject *k, struct 
kobj_attribute *attr, char *
 static struct kobj_attribute cache_level_attr =
__ATTR(level, 0444, level_show, NULL);
 
+#ifdef CONFIG_PPC64
 static unsigned int index_dir_to_cpu(struct cache_index_dir *index)
 {
struct kobject *index_dir_kobj = >kobj;
@@ -650,16 +651,35 @@ static unsigned int index_dir_to_cpu(struct 
cache_index_dir *index)
 
return dev->id;
 }
+#endif
 
 /*
  * On big-core systems, each core has two groups of CPUs each of which
  * has its own L1-cache. The thread-siblings which share l1-cache with
  * @cpu can be obtained via cpu_smallcore_mask().
+ *
+ * On some big-core systems, the L2 cache is shared only between some
+ * groups of siblings. This is already parsed and encoded in
+ * cpu_l2_cache_mask().
+ *
+ * TODO: cache_lookup_or_instantiate() needs to be made aware of the
+ *   "ibm,thread-groups" property so that cache->shared_cpu_map
+ *   reflects the correct siblings on platforms that have this
+ *   device-tree property. This helper function is only a stop-gap
+ *   solution so that we report the correct siblings to the
+ *   userspace via sysfs.
  */
-static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct cache 
*cache)
+static const struct cpumask *get_shared_cpu_map(struct cache_index_dir *index, 
struct cache *cache)
 {
-   if (cache->level == 1)
-   return cpu_smallcore_mask(cpu);
+#ifdef CONFIG_PPC64
+   if (has_big_cores) {
+   int cpu = index_dir_to_cpu(index);
+   if (cache->level == 1)
+   return cpu_smallcore_mask(cpu);
+   if (cache->level == 2 && thread_group_shares_l2)
+   return cpu_l2_cache_mask(cpu);
+   }
+#endif
 
return >shared_cpu_map;
 }
@@ -670,17 +690,11 @@ static const struct cpumask 
*get_big_core_shared_cpu_map(int cpu, struct cache *
struct cache_index_dir *index;
struct cache *cache;
const struct cpumask *mask;
-   int cpu;
 
index = kobj_to_cache_index_dir(k);
cache = index->cache;
 
-   if (has_big_cores) {
-   cpu = index_dir_to_cpu(index);
-   mask = get_big_core_shared_cpu_map(cpu, cache);
-   } else {
-   mask  = >shared_cpu_map;
-   }
+   mask = get_shared_cpu_map(index, cache);
 
return cpumap_print_to_pagebuf(list, buf, mask);
 }
-- 
1.9.4



[PATCH v2 4/5] powerpc/smp: Add support detecting thread-groups sharing L2 cache

2020-12-09 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On POWER systems, groups of threads within a core sharing the L2-cache
can be indicated by the "ibm,thread-groups" property array with the
identifier "2".

This patch adds support for detecting this, and when present, populate
the populating the cpu_l2_cache_mask of every CPU to the core-siblings
which share L2 with the CPU as specified in the by the
"ibm,thread-groups" property array.

On a platform with the following "ibm,thread-group" configuration
 0001 0002 0004 
 0002 0004 0006 0001
 0003 0005 0007 0002
 0002 0004  0002
 0004 0006 0001 0003
 0005 0007

Without this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

The CACHE domain at 0-7 is incorrect since the ibm,thread-groups
sub-array
[0002 0002 0004
  0002 0004 0006
 0001 0003 0005 0007]
indicates that L2 (Property "2") is shared only between the threads of a single
group. There are "2" groups of threads where each group contains "4"
threads each. The groups being {0,2,4,6} and {1,3,5,7}.

With this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
resp.) gets degenerated into the SMT domain. Furthermore, the
last-level-cache domain gets correctly set to the SMT sched-domain.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/smp.h |  1 +
 arch/powerpc/kernel/smp.c  | 56 +++---
 2 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index b2035b2..8d3d081 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -134,6 +134,7 @@ static inline struct cpumask *cpu_smallcore_mask(int cpu)
 extern int cpu_to_core_id(int cpu);
 
 extern bool has_big_cores;
+extern bool thread_group_shares_l2;
 
 #define cpu_smt_mask cpu_smt_mask
 #ifdef CONFIG_SCHED_SMT
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 9078b5b5..a46cf3f 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -76,6 +76,7 @@
 struct task_struct *secondary_current;
 bool has_big_cores;
 bool coregroup_enabled;
+bool thread_group_shares_l2;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
@@ -99,6 +100,7 @@ enum {
 
 #define MAX_THREAD_LIST_SIZE   8
 #define THREAD_GROUP_SHARE_L1   1
+#define THREAD_GROUP_SHARE_L2   2
 struct thread_groups {
unsigned int property;
unsigned int nr_groups;
@@ -107,7 +109,7 @@ struct thread_groups {
 };
 
 /* Maximum number of properties that groups of threads within a core can share 
*/
-#define MAX_THREAD_GROUP_PROPERTIES 1
+#define MAX_THREAD_GROUP_PROPERTIES 2
 
 struct thread_groups_list {
unsigned int nr_properties;
@@ -121,6 +123,13 @@ struct thread_groups_list {
  */
 DEFINE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map);
 
+/*
+ * On some big-cores system, thread_group_l2_cache_map for each CPU
+ * corresponds to the set its siblings within the core that share the
+ * L2-cache.
+ */
+DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
+
 /* SMP operations for this machine */
 struct smp_ops_t *smp_ops;
 
@@ -718,7 +727,9 @@ static void or_cpumasks_related(int i, int j, struct 
cpumask *(*srcmask)(int),
  *
  * ibm,thread-groups[i + 0] tells us the property based on which the
  * threads are being grouped together. If this value is 1, it implies
- * that the threads in the same group share L1, translation cache.
+ * that the threads in the same group share L1, translation cache. If
+ * the value is 2, it implies that the threads in the same group share
+ * the same L2 cache.
  *
  * ibm,thread-groups[i+1] tells us how many such thread groups exist for the
  * property ibm,thread-groups[i]
@@ -874,7 +885,8 @@ static int __init init_thread_group_cache_map(int cpu, int 
cache_property)
struct thr

[PATCH v2 1/5] powerpc/smp: Parse ibm,thread-groups with multiple properties

2020-12-09 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

The "ibm,thread-groups" device-tree property is an array that is used
to indicate if groups of threads within a core share certain
properties. It provides details of which property is being shared by
which groups of threads. This array can encode information about
multiple properties being shared by different thread-groups within the
core.

Example: Suppose,
"ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15]

This can be decomposed up into two consecutive arrays:

a) [1,2,4,8,10,12,14,9,11,13,15]
b) [2,2,4,8,10,12,14,9,11,13,15]

where in,

a) provides information of Property "1" being shared by "2" groups,
   each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the
   first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of
   the second group is {9,11,13,15}. Property "1" is indicative of
   the thread in the group sharing L1 cache, translation cache and
   Instruction Data flow.

b) provides information of Property "2" being shared by "2" groups,
   each group with "4" threads. The "ibm,ppc-interrupt-server#s" of
   the first group is {8,10,12,14} and the
   "ibm,ppc-interrupt-server#s" of the second group is
   {9,11,13,15}. Property "2" indicates that the threads in each group
   share the L2-cache.

The existing code assumes that the "ibm,thread-groups" encodes
information about only one property. Hence even on platforms which
encode information about multiple properties being shared by the
corresponding groups of threads, the current code will only pick the
first one. (In the above example, it will only consider
[1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]).

This patch extends the parsing support on platforms which encode
information about multiple properties being shared by the
corresponding groups of threads.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/smp.c | 174 ++
 1 file changed, 113 insertions(+), 61 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8c2857c..88d88ad 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -106,6 +106,15 @@ struct thread_groups {
unsigned int thread_list[MAX_THREAD_LIST_SIZE];
 };
 
+/* Maximum number of properties that groups of threads within a core can share 
*/
+#define MAX_THREAD_GROUP_PROPERTIES 1
+
+struct thread_groups_list {
+   unsigned int nr_properties;
+   struct thread_groups property_tgs[MAX_THREAD_GROUP_PROPERTIES];
+};
+
+static struct thread_groups_list tgl[NR_CPUS] __initdata;
 /*
  * On big-cores system, cpu_l1_cache_map for each CPU corresponds to
  * the set its siblings that share the L1-cache.
@@ -695,81 +704,98 @@ static void or_cpumasks_related(int i, int j, struct 
cpumask *(*srcmask)(int),
 /*
  * parse_thread_groups: Parses the "ibm,thread-groups" device tree
  *  property for the CPU device node @dn and stores
- *  the parsed output in the thread_groups
- *  structure @tg if the ibm,thread-groups[0]
- *  matches @property.
+ *  the parsed output in the thread_groups_list
+ *  structure @tglp.
  *
  * @dn: The device node of the CPU device.
- * @tg: Pointer to a thread group structure into which the parsed
+ * @tglp: Pointer to a thread group list structure into which the parsed
  *  output of "ibm,thread-groups" is stored.
- * @property: The property of the thread-group that the caller is
- *interested in.
  *
  * ibm,thread-groups[0..N-1] array defines which group of threads in
  * the CPU-device node can be grouped together based on the property.
  *
- * ibm,thread-groups[0] tells us the property based on which the
+ * This array can represent thread groupings for multiple properties.
+ *
+ * ibm,thread-groups[i + 0] tells us the property based on which the
  * threads are being grouped together. If this value is 1, it implies
  * that the threads in the same group share L1, translation cache.
  *
- * ibm,thread-groups[1] tells us how many such thread groups exist.
+ * ibm,thread-groups[i+1] tells us how many such thread groups exist for the
+ * property ibm,thread-groups[i]
  *
- * ibm,thread-groups[2] tells us the number of threads in each such
+ * ibm,thread-groups[i+2] tells us the number of threads in each such
  * group.
+ * Suppose k = (ibm,thread-groups[i+1] * ibm,thread-groups[i+2]), then,
  *
- * ibm,thread-groups[3..N-1] is the list of threads identified by
+ * ibm,thread-groups[i+3..i+k+2] (is the list of threads identified by
  * "ibm,ppc-interrupt-server#s" arranged as per their membership in
  * the grouping.
  *
- * Example: If ibm,thread-gro

[PATCH v2 0/5] Extend Parsing "ibm,thread-groups" for Shared-L2 information

2020-12-09 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Hi,

This is the v2 of the patchset to extend parsing of "ibm,thread-groups" property
to discover the Shared-L2 cache information.

The v1 can be found here :
https://lore.kernel.org/linuxppc-dev/1607057327-29822-1-git-send-email-...@linux.vnet.ibm.com/T/#m0fabffa1ea1a2807b362f25c849bb19415216520

The key changes from v1 are as follows to incorporate the review
comments from Srikar and fix a build error on !PPC64 configs reported
by the kernel bot.

 * Split Patch 1 into three patches
   * First patch ensure that parse_thread_groups() is made generic to
 support more than one property.
   * Second patch renames cpu_l1_cache_map as
 thread_group_l1_cache_map for consistency. No functional impact.
   * The third patch makes init_thread_group_l1_cache_map()
 generic. No functional impact.

* Patch 2 (Now patch 4): Incorporates the review comments from Srikar 
simplifying
   the changes to update_mask_by_l2()

* Patch 3 (Now patch 5): Fix a build errors for 32-bit configs
   reported by the kernel build bot.
 
Description of the Patchset
===
The "ibm,thread-groups" device-tree property is an array that is used
to indicate if groups of threads within a core share certain
properties. It provides details of which property is being shared by
which groups of threads. This array can encode information about
multiple properties being shared by different thread-groups within the
core.

Example: Suppose,
"ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15]

This can be decomposed up into two consecutive arrays:

a) [1,2,4,8,10,12,14,9,11,13,15]
b) [2,2,4,8,10,12,14,9,11,13,15]

where in,

a) provides information of Property "1" being shared by "2" groups,
   each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the
   first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of
   the second group is {9,11,13,15}. Property "1" is indicative of
   the thread in the group sharing L1 cache, translation cache and
   Instruction Data flow.

b) provides information of Property "2" being shared by "2" groups,
   each group with "4" threads. The "ibm,ppc-interrupt-server#s" of
   the first group is {8,10,12,14} and the
   "ibm,ppc-interrupt-server#s" of the second group is
   {9,11,13,15}. Property "2" indicates that the threads in each group
   share the L2-cache.
   
The existing code assumes that the "ibm,thread-groups" encodes
information about only one property. Hence even on platforms which
encode information about multiple properties being shared by the
corresponding groups of threads, the current code will only pick the
first one. (In the above example, it will only consider
[1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]).

Furthermore, currently on platforms where groups of threads share L2
cache, we incorrectly create an extra CACHE level sched-domain that
maps to all the threads of the core.

For example, if "ibm,thread-groups" is 
 0001 0002 0004 
 0002 0004 0006 0001
 0003 0005 0007 0002
 0002 0004  0002
 0004 0006 0001 0003
 0005 0007

then, the sub-array
[0002 0002 0004
  0002 0004 0006
 0001 0003 0005 0007]
indicates that L2 (Property "2") is shared only between the threads of a single
group. There are "2" groups of threads where each group contains "4"
threads each. The groups being {0,2,4,6} and {1,3,5,7}.

However, the sched-domain hierarchy for CPUs 0,1 is
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

where the CACHE domain reports that L2 is shared across the entire
core which is incorrect on such platforms.

This patchset remedies these issues by extending the parsing support
for "ibm,thread-groups" to discover information about multiple
properties being shared by the corresponding groups of threads. In
particular we cano now detect if the groups of threads within a core
share the L2-cache. On such platforms, we populate the populating the
cpu_l2_cache_mask of every CPU to the core-siblings which share L2
with the CPU as specified in the by the "ibm,thread-groups" property
array.

With the patchset, the sched-domain hierarchy is correctly
reported. F

[PATCH v2 3/5] powerpc/smp: Rename init_thread_group_l1_cache_map() to make it generic

2020-12-09 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

init_thread_group_l1_cache_map() initializes the per-cpu cpumask
thread_group_l1_cache_map with the core-siblings which share L1 cache
with the CPU. Make this function generic to the cache-property (L1 or
L2) and update a suitable mask. This is a preparatory patch for the
next patch where we will introduce discovery of thread-groups that
share L2-cache.

No functional change.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/smp.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index f3290d5..9078b5b5 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -866,15 +866,18 @@ static struct thread_groups *__init get_thread_groups(int 
cpu,
return tg;
 }
 
-static int init_thread_group_l1_cache_map(int cpu)
+static int __init init_thread_group_cache_map(int cpu, int cache_property)
 
 {
int first_thread = cpu_first_thread_sibling(cpu);
int i, cpu_group_start = -1, err = 0;
struct thread_groups *tg = NULL;
+   cpumask_var_t *mask;
 
-   tg = get_thread_groups(cpu, THREAD_GROUP_SHARE_L1,
-  );
+   if (cache_property != THREAD_GROUP_SHARE_L1)
+   return -EINVAL;
+
+   tg = get_thread_groups(cpu, cache_property, );
if (!tg)
return err;
 
@@ -885,8 +888,8 @@ static int init_thread_group_l1_cache_map(int cpu)
return -ENODATA;
}
 
-   zalloc_cpumask_var_node(_cpu(thread_group_l1_cache_map, cpu),
-   GFP_KERNEL, cpu_to_node(cpu));
+   mask = _cpu(thread_group_l1_cache_map, cpu);
+   zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
 
for (i = first_thread; i < first_thread + threads_per_core; i++) {
int i_group_start = get_cpu_thread_group_start(i, tg);
@@ -897,7 +900,7 @@ static int init_thread_group_l1_cache_map(int cpu)
}
 
if (i_group_start == cpu_group_start)
-   cpumask_set_cpu(i, per_cpu(thread_group_l1_cache_map, 
cpu));
+   cpumask_set_cpu(i, *mask);
}
 
return 0;
@@ -976,7 +979,7 @@ static int init_big_cores(void)
int cpu;
 
for_each_possible_cpu(cpu) {
-   int err = init_thread_group_l1_cache_map(cpu);
+   int err = init_thread_group_cache_map(cpu, 
THREAD_GROUP_SHARE_L1);
 
if (err)
return err;
-- 
1.9.4



Re: [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

2020-12-09 Thread Gautham R Shenoy
On Wed, Dec 09, 2020 at 02:09:21PM +0530, Srikar Dronamraju wrote:
> * Gautham R Shenoy  [2020-12-08 23:26:47]:
> 
> > > The drawback of this is even if cpus 0,2,4,6 are released L1 cache will 
> > > not
> > > be released. Is this as expected?
> > 
> > cacheinfo populates the cache->shared_cpu_map on the basis of which
> > CPUs share the common device-tree node for a particular cache.  There
> > is one l1-cache object in the device-tree for a CPU node corresponding
> > to a big-core. That the L1 is further split between the threads of the
> > core is shown using ibm,thread-groups.
> > 
> 
> Yes.
> 
> > The ideal thing would be to add a "group_leader" field to "struct
> > cache" so that we can create separate cache objects , one per thread
> > group. I will take a stab at this in the v2.
> > 
> 
> I am not saying this needs to be done immediately. We could add a TODO and
> get it done later. Your patch is not making it worse. Its just that there is
> still something more left to be done.

Yeah, it needs to be fixed but it may not be a 5.11 target. For now I
will fix this patch to take care of the build errors on !PPC64 !SMT
configs. I will post a separate series for making cacheinfo.c aware of
thread-groups at the time of construction of the cache-chain.

> 
> -- 
> Thanks and Regards
> Srikar Dronamraju


Re: [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties

2020-12-09 Thread Gautham R Shenoy
On Wed, Dec 09, 2020 at 02:05:41PM +0530, Srikar Dronamraju wrote:
> * Gautham R Shenoy  [2020-12-08 22:55:40]:
> 
> > > 
> > > NIT:
> > > tglx mentions in one of his recent comments to try keep a reverse fir tree
> > > ordering of variables where possible.
> > 
> > I suppose you mean moving the longer local variable declarations to to
> > the top and shorter ones to the bottom. Thanks. Will fix this.
> > 
> 
> Yes.
> 
> > > > +   }
> > > > +
> > > > +   if (!tg)
> > > > +   return -EINVAL;
> > > > +
> > > > +   cpu_group_start = get_cpu_thread_group_start(cpu, tg);
> > > 
> > > This whole hunk should be moved to a new function and called before
> > > init_cpu_cache_map. It will simplify the logic to great extent.
> > 
> > I suppose you are referring to the part where we select the correct
> > tg. Yeah, that can move to a different helper.
> > 
> 
> Yes, I would prefer if we could call this new helper outside
> init_cpu_cache_map.
> 
> > > > 
> > > > -   zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu),
> > > > -   GFP_KERNEL, cpu_to_node(cpu));
> > > > +   mask = _cpu(cpu_l1_cache_map, cpu);
> > > > +
> > > > +   zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
> > > > 
> > > 
> > > This hunk (and the next hunk) should be moved to next patch.
> > >
> > 
> > The next patch is only about introducing  THREAD_GROUP_SHARE_L2. Hence
> > I put in any other code in this patch, since it seems to be a logical
> > place to collate whatever we have in a generic form.
> > 
> 
> While I am fine with it, having a pointer that always points to the same
> mask looks wierd.

Sure. Moving some of this to a separate preparatory patch.

> 
> -- 
> Thanks and Regards
> Srikar Dronamraju


Re: [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

2020-12-08 Thread Gautham R Shenoy
On Mon, Dec 07, 2020 at 06:41:38PM +0530, Srikar Dronamraju wrote:
> * Gautham R. Shenoy  [2020-12-04 10:18:47]:
> 
> > From: "Gautham R. Shenoy" 
> > 
> > 
> > Signed-off-by: Gautham R. Shenoy 
> > ---
> > 
> > +extern bool thread_group_shares_l2;
> >  /*
> >   * On big-core systems, each core has two groups of CPUs each of which
> >   * has its own L1-cache. The thread-siblings which share l1-cache with
> >   * @cpu can be obtained via cpu_smallcore_mask().
> > + *
> > + * On some big-core systems, the L2 cache is shared only between some
> > + * groups of siblings. This is already parsed and encoded in
> > + * cpu_l2_cache_mask().
> >   */
> >  static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct 
> > cache *cache)
> >  {
> > if (cache->level == 1)
> > return cpu_smallcore_mask(cpu);
> > +   if (cache->level == 2 && thread_group_shares_l2)
> > +   return cpu_l2_cache_mask(cpu);
> > 
> > return >shared_cpu_map;
> 
> As pointed with l...@intel.org, we need to do this only with #CONFIG_SMP,
> even for cache->level = 1 too.

Yes, I have fixed that in the next version.

> 
> I agree that we are displaying shared_cpu_map correctly. Should we have also
> update /clear shared_cpu_map in the first place. For example:- If for a P9
> core with CPUs 0-7, the cache->shared_cpu_map for L1 would have 0-7 but
> would display 0,2,4,6.
> 
> The drawback of this is even if cpus 0,2,4,6 are released L1 cache will not
> be released. Is this as expected?

cacheinfo populates the cache->shared_cpu_map on the basis of which
CPUs share the common device-tree node for a particular cache.  There
is one l1-cache object in the device-tree for a CPU node corresponding
to a big-core. That the L1 is further split between the threads of the
core is shown using ibm,thread-groups.

The ideal thing would be to add a "group_leader" field to "struct
cache" so that we can create separate cache objects , one per thread
group. I will take a stab at this in the v2.

Thanks for the review comments.



> 
> 
> -- 
> Thanks and Regards
> Srikar Dronamraju


Re: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache

2020-12-08 Thread Gautham R Shenoy
Hello Srikar,

On Mon, Dec 07, 2020 at 06:10:39PM +0530, Srikar Dronamraju wrote:
> * Gautham R. Shenoy  [2020-12-04 10:18:46]:
> 
> > From: "Gautham R. Shenoy" 
> > 
> > On POWER systems, groups of threads within a core sharing the L2-cache
> > can be indicated by the "ibm,thread-groups" property array with the
> > identifier "2".
> > 
> > This patch adds support for detecting this, and when present, populate
> > the populating the cpu_l2_cache_mask of every CPU to the core-siblings
> > which share L2 with the CPU as specified in the by the
> > "ibm,thread-groups" property array.
> > 
> > On a platform with the following "ibm,thread-group" configuration
> >  0001 0002 0004 
> >  0002 0004 0006 0001
> >  0003 0005 0007 0002
> >  0002 0004  0002
> >  0004 0006 0001 0003
> >  0005 0007
> > 
> > Without this patch, the sched-domain hierarchy for CPUs 0,1 would be
> > CPU0 attaching sched-domain(s):
> > domain-0: span=0,2,4,6 level=SMT
> > domain-1: span=0-7 level=CACHE
> > domain-2: span=0-15,24-39,48-55 level=MC
> > domain-3: span=0-55 level=DIE
> > 
> > CPU1 attaching sched-domain(s):
> > domain-0: span=1,3,5,7 level=SMT
> > domain-1: span=0-7 level=CACHE
> > domain-2: span=0-15,24-39,48-55 level=MC
> > domain-3: span=0-55 level=DIE
> > 
> > The CACHE domain at 0-7 is incorrect since the ibm,thread-groups
> > sub-array
> > [0002 0002 0004
> >   0002 0004 0006
> >  0001 0003 0005 0007]
> > indicates that L2 (Property "2") is shared only between the threads of a 
> > single
> > group. There are "2" groups of threads where each group contains "4"
> > threads each. The groups being {0,2,4,6} and {1,3,5,7}.
> > 
> > With this patch, the sched-domain hierarchy for CPUs 0,1 would be
> > CPU0 attaching sched-domain(s):
> > domain-0: span=0,2,4,6 level=SMT
> > domain-1: span=0-15,24-39,48-55 level=MC
> > domain-2: span=0-55 level=DIE
> > 
> > CPU1 attaching sched-domain(s):
> > domain-0: span=1,3,5,7 level=SMT
> > domain-1: span=0-15,24-39,48-55 level=MC
> > domain-2: span=0-55 level=DIE
> > 
> > The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
> > resp.) gets degenerated into the SMT domain. Furthermore, the
> > last-level-cache domain gets correctly set to the SMT sched-domain.
> > 
> > Signed-off-by: Gautham R. Shenoy 
> > ---
> >  arch/powerpc/kernel/smp.c | 66 
> > +--
> >  1 file changed, 58 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 6a242a3..a116d2d 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -76,6 +76,7 @@
> >  struct task_struct *secondary_current;
> >  bool has_big_cores;
> >  bool coregroup_enabled;
> > +bool thread_group_shares_l2;
> 
> Either keep this as static in this patch or add its declaration
>

This will be used in Patch 3 in kernel/cacheinfo.c, but not any other
place. Hence I am not making it static here.


> > 
> >  DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
> >  DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
> > @@ -99,6 +100,7 @@ enum {
> > 
> >  #define MAX_THREAD_LIST_SIZE   8
> >  #define THREAD_GROUP_SHARE_L1   1
> > +#define THREAD_GROUP_SHARE_L2   2
> >  struct thread_groups {
> > unsigned int property;
> > unsigned int nr_groups;
> > @@ -107,7 +109,7 @@ struct thread_groups {
> >  };
> > 
> >  /* Maximum number of properties that groups of threads within a core can 
> > share */
> > -#define MAX_THREAD_GROUP_PROPERTIES 1
> > +#define MAX_THREAD_GROUP_PROPERTIES 2
> > 
> >  struct thread_groups_list {
> > unsigned int nr_properties;
> > @@ -121,6 +123,13 @@ struct thread_groups_list {
> >   */
> >  DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map);
> > 
> > +/*
> > + * On some big-cores system, thread_group_l2_cache_map for each CPU
> > + * corresponds to the set its siblings within the core that share the
> > + * L2-cache.
> > + */
> > +DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);

Re: [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties

2020-12-08 Thread Gautham R Shenoy
Hello Srikar,

Thanks for taking a look at the patch.

On Mon, Dec 07, 2020 at 05:40:42PM +0530, Srikar Dronamraju wrote:
> * Gautham R. Shenoy  [2020-12-04 10:18:45]:
> 
> > From: "Gautham R. Shenoy" 
> 
> 
> 
> > 
> >  static int parse_thread_groups(struct device_node *dn,
> > -  struct thread_groups *tg,
> > -  unsigned int property)
> > +  struct thread_groups_list *tglp)
> >  {
> > -   int i;
> > -   u32 thread_group_array[3 + MAX_THREAD_LIST_SIZE];
> > +   int i = 0;
> > +   u32 *thread_group_array;
> > u32 *thread_list;
> > size_t total_threads;
> > -   int ret;
> > +   int ret = 0, count;
> > +   unsigned int property_idx = 0;
> 
> NIT:
> tglx mentions in one of his recent comments to try keep a reverse fir tree
> ordering of variables where possible.

I suppose you mean moving the longer local variable declarations to to
the top and shorter ones to the bottom. Thanks. Will fix this.


> 
> > 
> > +   count = of_property_count_u32_elems(dn, "ibm,thread-groups");
> > +   thread_group_array = kcalloc(count, sizeof(u32), GFP_KERNEL);
> > ret = of_property_read_u32_array(dn, "ibm,thread-groups",
> > -thread_group_array, 3);
> > +thread_group_array, count);
> > if (ret)
> > -   return ret;
> > -
> > -   tg->property = thread_group_array[0];
> > -   tg->nr_groups = thread_group_array[1];
> > -   tg->threads_per_group = thread_group_array[2];
> > -   if (tg->property != property ||
> > -   tg->nr_groups < 1 ||
> > -   tg->threads_per_group < 1)
> > -   return -ENODATA;
> > +   goto out_free;
> > 
> > -   total_threads = tg->nr_groups * tg->threads_per_group;
> > +   while (i < count && property_idx < MAX_THREAD_GROUP_PROPERTIES) {
> > +   int j;
> > +   struct thread_groups *tg = >property_tgs[property_idx++];
> 
> NIT: same as above.

Ok.
> 
> > 
> > -   ret = of_property_read_u32_array(dn, "ibm,thread-groups",
> > -thread_group_array,
> > -3 + total_threads);
> > -   if (ret)
> > -   return ret;
> > +   tg->property = thread_group_array[i];
> > +   tg->nr_groups = thread_group_array[i + 1];
> > +   tg->threads_per_group = thread_group_array[i + 2];
> > +   total_threads = tg->nr_groups * tg->threads_per_group;
> > +
> > +   thread_list = _group_array[i + 3];
> > 
> > -   thread_list = _group_array[3];
> > +   for (j = 0; j < total_threads; j++)
> > +   tg->thread_list[j] = thread_list[j];
> > +   i = i + 3 + total_threads;
> 
>   Can't we simply use memcpy instead?

We could. But this one makes it more explicit.


> 
> > +   }
> > 
> > -   for (i = 0 ; i < total_threads; i++)
> > -   tg->thread_list[i] = thread_list[i];
> > +   tglp->nr_properties = property_idx;
> > 
> > -   return 0;
> > +out_free:
> > +   kfree(thread_group_array);
> > +   return ret;
> >  }
> > 
> >  /*
> > @@ -805,24 +827,39 @@ static int get_cpu_thread_group_start(int cpu, struct 
> > thread_groups *tg)
> > return -1;
> >  }
> > 
> > -static int init_cpu_l1_cache_map(int cpu)
> > +static int init_cpu_cache_map(int cpu, unsigned int cache_property)
> > 
> >  {
> > struct device_node *dn = of_get_cpu_node(cpu, NULL);
> > -   struct thread_groups tg = {.property = 0,
> > -  .nr_groups = 0,
> > -  .threads_per_group = 0};
> > +   struct thread_groups *tg = NULL;
> > int first_thread = cpu_first_thread_sibling(cpu);
> > int i, cpu_group_start = -1, err = 0;
> > +   cpumask_var_t *mask;
> > +   struct thread_groups_list *cpu_tgl = [cpu];
> 
> NIT: same as 1st comment.

Sure, will fix this.

> 
> > 
> > if (!dn)
> > return -ENODATA;
> > 
> > -   err = parse_thread_groups(dn, , THREAD_GROUP_SHARE_L1);
> > -   if (err)
> > -   goto out;
> > +   if (!(cache_property == THREAD_GROUP_SHARE_L1))
> > +   return -EINVAL;
> > 
> > -   cpu_group_start = get_cpu_thread_group_start(cpu, );
> > +   if (!cpu_tgl->n

[PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties

2020-12-03 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

The "ibm,thread-groups" device-tree property is an array that is used
to indicate if groups of threads within a core share certain
properties. It provides details of which property is being shared by
which groups of threads. This array can encode information about
multiple properties being shared by different thread-groups within the
core.

Example: Suppose,
"ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15]

This can be decomposed up into two consecutive arrays:

a) [1,2,4,8,10,12,14,9,11,13,15]
b) [2,2,4,8,10,12,14,9,11,13,15]

where in,

a) provides information of Property "1" being shared by "2" groups,
   each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the
   first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of
   the second group is {9,11,13,15}. Property "1" is indicative of
   the thread in the group sharing L1 cache, translation cache and
   Instruction Data flow.

b) provides information of Property "2" being shared by "2" groups,
   each group with "4" threads. The "ibm,ppc-interrupt-server#s" of
   the first group is {8,10,12,14} and the
   "ibm,ppc-interrupt-server#s" of the second group is
   {9,11,13,15}. Property "2" indicates that the threads in each group
   share the L2-cache.
   
The existing code assumes that the "ibm,thread-groups" encodes
information about only one property. Hence even on platforms which
encode information about multiple properties being shared by the
corresponding groups of threads, the current code will only pick the
first one. (In the above example, it will only consider
[1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]).

This patch extends the parsing support on platforms which encode
information about multiple properties being shared by the
corresponding groups of threads.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/smp.c | 146 +-
 1 file changed, 92 insertions(+), 54 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8c2857c..6a242a3 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -106,6 +106,15 @@ struct thread_groups {
unsigned int thread_list[MAX_THREAD_LIST_SIZE];
 };
 
+/* Maximum number of properties that groups of threads within a core can share 
*/
+#define MAX_THREAD_GROUP_PROPERTIES 1
+
+struct thread_groups_list {
+   unsigned int nr_properties;
+   struct thread_groups property_tgs[MAX_THREAD_GROUP_PROPERTIES];
+};
+
+static struct thread_groups_list tgl[NR_CPUS] __initdata;
 /*
  * On big-cores system, cpu_l1_cache_map for each CPU corresponds to
  * the set its siblings that share the L1-cache.
@@ -695,81 +704,94 @@ static void or_cpumasks_related(int i, int j, struct 
cpumask *(*srcmask)(int),
 /*
  * parse_thread_groups: Parses the "ibm,thread-groups" device tree
  *  property for the CPU device node @dn and stores
- *  the parsed output in the thread_groups
- *  structure @tg if the ibm,thread-groups[0]
- *  matches @property.
+ *  the parsed output in the thread_groups_list
+ *  structure @tglp.
  *
  * @dn: The device node of the CPU device.
- * @tg: Pointer to a thread group structure into which the parsed
+ * @tglp: Pointer to a thread group list structure into which the parsed
  *  output of "ibm,thread-groups" is stored.
- * @property: The property of the thread-group that the caller is
- *interested in.
  *
  * ibm,thread-groups[0..N-1] array defines which group of threads in
  * the CPU-device node can be grouped together based on the property.
  *
- * ibm,thread-groups[0] tells us the property based on which the
+ * This array can represent thread groupings for multiple properties.
+ *
+ * ibm,thread-groups[i + 0] tells us the property based on which the
  * threads are being grouped together. If this value is 1, it implies
  * that the threads in the same group share L1, translation cache.
  *
- * ibm,thread-groups[1] tells us how many such thread groups exist.
+ * ibm,thread-groups[i+1] tells us how many such thread groups exist for the
+ * property ibm,thread-groups[i]
  *
- * ibm,thread-groups[2] tells us the number of threads in each such
+ * ibm,thread-groups[i+2] tells us the number of threads in each such
  * group.
+ * Suppose k = (ibm,thread-groups[i+1] * ibm,thread-groups[i+2]), then,
  *
- * ibm,thread-groups[3..N-1] is the list of threads identified by
+ * ibm,thread-groups[i+3..i+k+2] (is the list of threads identified by
  * "ibm,ppc-interrupt-server#s" arranged as per their membership in
  * the grouping.
  *
- * Example: If ibm,thread-gr

[PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache

2020-12-03 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On POWER systems, groups of threads within a core sharing the L2-cache
can be indicated by the "ibm,thread-groups" property array with the
identifier "2".

This patch adds support for detecting this, and when present, populate
the populating the cpu_l2_cache_mask of every CPU to the core-siblings
which share L2 with the CPU as specified in the by the
"ibm,thread-groups" property array.

On a platform with the following "ibm,thread-group" configuration
 0001 0002 0004 
 0002 0004 0006 0001
 0003 0005 0007 0002
 0002 0004  0002
 0004 0006 0001 0003
 0005 0007

Without this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

The CACHE domain at 0-7 is incorrect since the ibm,thread-groups
sub-array
[0002 0002 0004
  0002 0004 0006
 0001 0003 0005 0007]
indicates that L2 (Property "2") is shared only between the threads of a single
group. There are "2" groups of threads where each group contains "4"
threads each. The groups being {0,2,4,6} and {1,3,5,7}.

With this patch, the sched-domain hierarchy for CPUs 0,1 would be
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
resp.) gets degenerated into the SMT domain. Furthermore, the
last-level-cache domain gets correctly set to the SMT sched-domain.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/smp.c | 66 +--
 1 file changed, 58 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6a242a3..a116d2d 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -76,6 +76,7 @@
 struct task_struct *secondary_current;
 bool has_big_cores;
 bool coregroup_enabled;
+bool thread_group_shares_l2;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
@@ -99,6 +100,7 @@ enum {
 
 #define MAX_THREAD_LIST_SIZE   8
 #define THREAD_GROUP_SHARE_L1   1
+#define THREAD_GROUP_SHARE_L2   2
 struct thread_groups {
unsigned int property;
unsigned int nr_groups;
@@ -107,7 +109,7 @@ struct thread_groups {
 };
 
 /* Maximum number of properties that groups of threads within a core can share 
*/
-#define MAX_THREAD_GROUP_PROPERTIES 1
+#define MAX_THREAD_GROUP_PROPERTIES 2
 
 struct thread_groups_list {
unsigned int nr_properties;
@@ -121,6 +123,13 @@ struct thread_groups_list {
  */
 DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map);
 
+/*
+ * On some big-cores system, thread_group_l2_cache_map for each CPU
+ * corresponds to the set its siblings within the core that share the
+ * L2-cache.
+ */
+DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
+
 /* SMP operations for this machine */
 struct smp_ops_t *smp_ops;
 
@@ -718,7 +727,9 @@ static void or_cpumasks_related(int i, int j, struct 
cpumask *(*srcmask)(int),
  *
  * ibm,thread-groups[i + 0] tells us the property based on which the
  * threads are being grouped together. If this value is 1, it implies
- * that the threads in the same group share L1, translation cache.
+ * that the threads in the same group share L1, translation cache. If
+ * the value is 2, it implies that the threads in the same group share
+ * the same L2 cache.
  *
  * ibm,thread-groups[i+1] tells us how many such thread groups exist for the
  * property ibm,thread-groups[i]
@@ -745,10 +756,10 @@ static void or_cpumasks_related(int i, int j, struct 
cpumask *(*srcmask)(int),
  * 12}.
  *
  * b) there are 2 groups of 4 threads each, where each group of
- *threads share some property indicated by the first value 2. The
- *"ibm,ppc-interrupt-server#s" of the first group is {5,7,9,11}
- *and the "ibm,ppc-interrupt-server#s" of the second group is
- *{6,8,10,12} structure
+ *threads share some property indicated by the first value 2 (L2
+ *cache). The "ibm,ppc-interrupt-server#s" of the first group is
+ *{5,7,9,

[PATCH 0/3] Extend Parsing "ibm,thread-groups" for Shared-L2 information

2020-12-03 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

The "ibm,thread-groups" device-tree property is an array that is used
to indicate if groups of threads within a core share certain
properties. It provides details of which property is being shared by
which groups of threads. This array can encode information about
multiple properties being shared by different thread-groups within the
core.

Example: Suppose,
"ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15]

This can be decomposed up into two consecutive arrays:

a) [1,2,4,8,10,12,14,9,11,13,15]
b) [2,2,4,8,10,12,14,9,11,13,15]

where in,

a) provides information of Property "1" being shared by "2" groups,
   each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the
   first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of
   the second group is {9,11,13,15}. Property "1" is indicative of
   the thread in the group sharing L1 cache, translation cache and
   Instruction Data flow.

b) provides information of Property "2" being shared by "2" groups,
   each group with "4" threads. The "ibm,ppc-interrupt-server#s" of
   the first group is {8,10,12,14} and the
   "ibm,ppc-interrupt-server#s" of the second group is
   {9,11,13,15}. Property "2" indicates that the threads in each group
   share the L2-cache.
   
The existing code assumes that the "ibm,thread-groups" encodes
information about only one property. Hence even on platforms which
encode information about multiple properties being shared by the
corresponding groups of threads, the current code will only pick the
first one. (In the above example, it will only consider
[1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]).

Furthermore, currently on platforms where groups of threads share L2
cache, we incorrectly create an extra CACHE level sched-domain that
maps to all the threads of the core.

For example, if "ibm,thread-groups" is 
 0001 0002 0004 
 0002 0004 0006 0001
 0003 0005 0007 0002
 0002 0004  0002
 0004 0006 0001 0003
 0005 0007

then, the sub-array
[0002 0002 0004
  0002 0004 0006
 0001 0003 0005 0007]
indicates that L2 (Property "2") is shared only between the threads of a single
group. There are "2" groups of threads where each group contains "4"
threads each. The groups being {0,2,4,6} and {1,3,5,7}.

However, the sched-domain hierarchy for CPUs 0,1 is
CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-7 level=CACHE
domain-2: span=0-15,24-39,48-55 level=MC
domain-3: span=0-55 level=DIE

where the CACHE domain reports that L2 is shared across the entire
core which is incorrect on such platforms.


This patchset remedies these issues by extending the parsing support
for "ibm,thread-groups" to discover information about multiple
properties being shared by the corresponding groups of threads. In
particular we cano now detect if the groups of threads within a core
share the L2-cache. On such platforms, we populate the populating the
cpu_l2_cache_mask of every CPU to the core-siblings which share L2
with the CPU as specified in the by the "ibm,thread-groups" property
array.

With the patchset, the sched-domain hierarchy is correctly
reported. For eg for CPUs 0,1, with the patchset

CPU0 attaching sched-domain(s):
domain-0: span=0,2,4,6 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

CPU1 attaching sched-domain(s):
domain-0: span=1,3,5,7 level=SMT
domain-1: span=0-15,24-39,48-55 level=MC
domain-2: span=0-55 level=DIE

The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1
resp.) gets degenerated into the SMT domain. Furthermore, the
last-level-cache domain gets correctly set to the SMT sched-domain.

Finally, this patchset reports the correct shared_cpu_map/list in the
sysfs for L2 cache on such platforms. With the patchset for CPUs0, 1,
for L2 cache we see the correct shared_cpu_map/list

/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,0055

/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:1,3,5,7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00aa

The patchset has been tested on older platform

[PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

2020-12-03 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

On POWER platforms where only some groups of threads within a core
share the L2-cache (indicated by the ibm,thread-groups device-tree
property), we currently print the incorrect shared_cpu_map/list for
L2-cache in the sysfs.

This patch reports the correct shared_cpu_map/list on such platforms.

Example:
On a platform with "ibm,thread-groups" set to
 0001 0002 0004 
 0002 0004 0006 0001
 0003 0005 0007 0002
 0002 0004  0002
 0004 0006 0001 0003
 0005 0007

This indicates that threads {0,2,4,6} in the core share the L2-cache
and threads {1,3,5,7} in the core share the L2 cache.

However, without the patch, the shared_cpu_map/list for L2 for CPUs 0,
1 is reported in the sysfs as follows:

/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-7
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,00ff

/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:0-7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00ff

With the patch, the shared_cpu_map/list for L2 cache for CPUs 0, 1 is
correctly reported as follows:

/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,0055

/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:1,3,5,7
/sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00aa

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/kernel/cacheinfo.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index 65ab9fc..1cc8f37 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -651,15 +651,22 @@ static unsigned int index_dir_to_cpu(struct 
cache_index_dir *index)
return dev->id;
 }
 
+extern bool thread_group_shares_l2;
 /*
  * On big-core systems, each core has two groups of CPUs each of which
  * has its own L1-cache. The thread-siblings which share l1-cache with
  * @cpu can be obtained via cpu_smallcore_mask().
+ *
+ * On some big-core systems, the L2 cache is shared only between some
+ * groups of siblings. This is already parsed and encoded in
+ * cpu_l2_cache_mask().
  */
 static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct cache 
*cache)
 {
if (cache->level == 1)
return cpu_smallcore_mask(cpu);
+   if (cache->level == 2 && thread_group_shares_l2)
+   return cpu_l2_cache_mask(cpu);
 
return >shared_cpu_map;
 }
-- 
1.9.4



Re: [PATCH 20/33] docs: ABI: testing: make the files compatible with ReST output

2020-11-02 Thread Gautham R Shenoy
y Failure'.
> 
>   - overcurrent : This file gives the total number of times the
> - max frequency is throttled due to 'Overcurrent'.
> +   max frequency is throttled due to 'Overcurrent'.
> 
>   - occ_reset : This file gives the total number of times the max
> - frequency is throttled due to 'OCC Reset'.
> +   frequency is throttled due to 'OCC Reset'.
> 
>   The sysfs attributes representing different throttle reasons 
> like
>   powercap, overtemp, supply_fault, overcurrent and occ_reset map 
> to


This hunk for the powernv cpufreq driver looks good to me.
For these two hunks,

Reviewed-by: Gautham R. Shenoy 




Re: [RFC v4 1/1] selftests/cpuidle: Add support for cpuidle latency measurement

2020-09-14 Thread Gautham R Shenoy
On Wed, Sep 02, 2020 at 05:15:06PM +0530, Pratik Rajesh Sampat wrote:
> Measure cpuidle latencies on wakeup to determine and compare with the
> advertsied wakeup latencies for each idle state.
> 
> Cpuidle wakeup latencies are determined for IPIs and Timer events and
> can help determine any deviations from what is advertsied by the
> hardware.
> 
> A baseline measurement for each case of IPI and timers is taken at
> 100 percent CPU usage to quantify for the kernel-userpsace overhead
> during execution.
> 
> Signed-off-by: Pratik Rajesh Sampat 
> ---
>  tools/testing/selftests/Makefile  |   1 +
>  tools/testing/selftests/cpuidle/Makefile  |   7 +
>  tools/testing/selftests/cpuidle/cpuidle.c | 616 ++
>  tools/testing/selftests/cpuidle/settings  |   1 +
>  4 files changed, 625 insertions(+)
>  create mode 100644 tools/testing/selftests/cpuidle/Makefile
>  create mode 100644 tools/testing/selftests/cpuidle/cpuidle.c
>  create mode 100644 tools/testing/selftests/cpuidle/settings
> 
> diff --git a/tools/testing/selftests/Makefile 
> b/tools/testing/selftests/Makefile
> index 9018f45d631d..2bb0e87f76fd 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -8,6 +8,7 @@ TARGETS += cgroup
>  TARGETS += clone3
>  TARGETS += core
>  TARGETS += cpufreq
> +TARGETS += cpuidle
>  TARGETS += cpu-hotplug
>  TARGETS += drivers/dma-buf
>  TARGETS += efivarfs
> diff --git a/tools/testing/selftests/cpuidle/Makefile 
> b/tools/testing/selftests/cpuidle/Makefile
> new file mode 100644
> index ..d332485e1bc5
> --- /dev/null
> +++ b/tools/testing/selftests/cpuidle/Makefile
> @@ -0,0 +1,7 @@
> +# SPDX-License-Identifier: GPL-2.0
> +TEST_GEN_PROGS := cpuidle
> +
> +CFLAGS += -O2
> +LDLIBS += -lpthread
> +
> +include ../lib.mk
> diff --git a/tools/testing/selftests/cpuidle/cpuidle.c 
> b/tools/testing/selftests/cpuidle/cpuidle.c
> new file mode 100644
> index ..4b1e7a91f75c
> --- /dev/null
> +++ b/tools/testing/selftests/cpuidle/cpuidle.c
> @@ -0,0 +1,616 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Cpuidle latency measurement microbenchmark
> + *
> + * A mechanism to measure wakeup latency for IPI and Timer based interrupts
> + * Results of this microbenchmark can be used to check and validate against 
> the
> + * advertised latencies for each cpuidle state
> + *
> + * IPIs (using pipes) and Timers are used to wake the CPU up and measure the
> + * time difference
> + *
> + * Usage:
> + *   ./cpuidle --mode  --output 
> + *
> + * Copyright (C) 2020 Pratik Rajesh Sampat , IBM
> + */
> +
> +#define _GNU_SOURCE
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define READ 0
> +#define WRITE1
> +#define TIMEOUT_US   50
> +
> +static int pipe_fd[2];
> +static int *cpu_list;
> +static int cpus;
> +static int idle_states;
> +static uint64_t *latency_list;
> +static uint64_t *residency_list;
> +
> +static char *log_file = "cpuidle.log";
> +
> +static int get_online_cpus(int *online_cpu_list, int total_cpus)
> +{
> + char filename[80];
> + int i, index = 0;
> + FILE *fptr;
> +
> + for (i = 0; i < total_cpus; i++) {
> + char status;
> +
> + sprintf(filename, "/sys/devices/system/cpu/cpu");
> + sprintf(filename + strlen(filename), "%d%s", i, "/online");
> + fptr = fopen(filename, "r");
> + if (!fptr)
> + continue;
> + assert(fscanf(fptr, "%c", ) != EOF);
> + if (status == '1')
> + online_cpu_list[index++] = i;
> + fclose(fptr);
> + }
> + return index;
> +}
> +
> +static uint64_t us_to_ns(uint64_t val)
> +{
> + return val * 1000;
> +}
> +
> +static void get_latency(int cpu)
> +{
> + char filename[80];
> + uint64_t latency;
> + FILE *fptr;
> + int state;
> +
> + for (state = 0; state < idle_states; state++) {
> + sprintf(filename, "%s%d%s%d%s", "/sys/devices/system/cpu/cpu",
> + cpu, "/cpuidle/state",
> + state, "/latency");
> + fptr = fopen(filename, "r");
> + assert(fptr);
> +
> + assert(fscanf(fptr, "%ld", ) != EOF);
> + latency_list[state] = latency;
> + fclose(fptr);
> + }
> +}
> +
> +static void get_residency(int cpu)
> +{
> + uint64_t residency;
> + char filename[80];
> + FILE *fptr;
> + int state;
> +
> + for (state = 0; state < idle_states; state++) {
> + sprintf(filename, "%s%d%s%d%s", "/sys/devices/system/cpu/cpu",
> + cpu, "/cpuidle/state",
> + state, "/residency");
> + fptr = fopen(filename, "r");
> + assert(fptr);
> +
> + assert(fscanf(fptr, "%ld", ) != EOF);

Re: [RFC v4 0/1] Selftest for cpuidle latency measurement

2020-09-14 Thread Gautham R Shenoy
On Wed, Sep 02, 2020 at 05:15:05PM +0530, Pratik Rajesh Sampat wrote:
> Changelog v3-->v4:
> 1. Overhaul in implementation from kernel module to a userspace selftest 
> ---
> 
> The patch series introduces a mechanism to measure wakeup latency for
> IPI and timer based interrupts
> The motivation behind this series is to find significant deviations
> behind advertised latency and residency values
> 
> To achieve this in the userspace, IPI latencies are calculated by
> sending information through pipes and inducing a wakeup, similarly
> alarm events are setup for calculate timer based wakeup latencies.

> 
> To account for delays from kernel-userspace interactions baseline
> observations are taken on a 100% busy CPU and subsequent obervations
> must be considered relative to that.
> 
> In theory, wakeups induced by IPI and Timers should have similar
> wakeup latencies, however in practice there may be deviations which may
> need to be captured.
> 
> One downside of the userspace approach in contrast to the kernel
> implementation is that the run to run variance can turn out to be high
> in the order of ms; which is the scope of the experiments at times.
> 
> Another downside of the userspace approach is that it takes much longer
> to run and hence a command-line option quick and full are added to make
> sure quick 1 CPU tests can be carried out when needed and otherwise it
> can carry out a full system comprehensive test.
> 
> Usage
> ---
> ./cpuidle --mode  --output  
> full: runs on all CPUS
> quick: run on a random CPU
> num_cpus: Limit the number of CPUS to run on
> 
> Sample output snippet
> -
> --IPI Latency Test---
> SRC_CPU   DEST_CPU IPI_Latency(ns)
> ...
>   0  5   256178
>   0  6   478161
>   0  7   285445
>   0  8   273553
> Expected IPI latency(ns): 10
> Observed Average IPI latency(ns): 248334

I suppose by run-to-run variance you are referring to the outliers in
the above sequence (like 478161) ? Or is it that each time you run
your test program you observe completely different series of values ?

If it is the former, then perhaps we could discard the outliers for
the purpose of average latency computation and print the max, min and
the corrected-average values above.



> 
> --Timeout Latency Test--
> --Baseline Timeout Latency measurement: CPU Busy--
> Wakeup_src Baseline_delay(ns)
> ...
>  32  972405
>  33 1004287
>  34  986663
>  35  994022
> Expected timeout(ns): 1000
> Observed Average timeout diff(ns): 991844
>

It would be good to see a complete sample output, perhaps for the
--mode=10 so that it is easy to discern if there are cases when the
observed timeouts/IPI latencies for the busy case are larger than the
idle-case.



> Pratik Rajesh Sampat (1):
>   selftests/cpuidle: Add support for cpuidle latency measurement
> 
>  tools/testing/selftests/Makefile  |   1 +
>  tools/testing/selftests/cpuidle/Makefile  |   7 +
>  tools/testing/selftests/cpuidle/cpuidle.c | 616 ++
>  tools/testing/selftests/cpuidle/settings  |   1 +
>  4 files changed, 625 insertions(+)
>  create mode 100644 tools/testing/selftests/cpuidle/Makefile
>  create mode 100644 tools/testing/selftests/cpuidle/cpuidle.c
>  create mode 100644 tools/testing/selftests/cpuidle/settings
> 
> -- 
> 2.26.2
> 

--
Thanks and Regards
gautham.


[PATCH v2] cpuidle-pseries: Fix CEDE latency conversion from tb to us

2020-09-03 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
of the Extended CEDE states advertised by the platform. The values
advertised by the platform are in timebase ticks. However the cpuidle
framework requires the latency values in microseconds.

If the tb-ticks value advertised by the platform correspond to a value
smaller than 1us, during the conversion from tb-ticks to microseconds,
in the current code, the result becomes zero. This is incorrect as it
puts a CEDE state on par with the snooze state.

This patch fixes this by rounding up the result obtained while
converting the latency value from tb-ticks to microseconds. It also
prints a warning in case we discover an extended-cede state with
wakeup latency to be 0. In such a case, ensure that CEDE(0) has a
non-zero wakeup latency.

Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
CEDE(0)")

Signed-off-by: Gautham R. Shenoy 
---
v1-->v2: Added a warning if a CEDE state has 0 wakeup latency (Suggested by 
Joel Stanley)
 Also added code to ensure that CEDE(0) has a non-zero wakeup latency.  
 
 drivers/cpuidle/cpuidle-pseries.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index ff6d99e..a2b5c6f 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -361,7 +361,10 @@ static void __init fixup_cede0_latency(void)
for (i = 0; i < nr_xcede_records; i++) {
struct xcede_latency_record *record = >records[i];
u64 latency_tb = be64_to_cpu(record->latency_ticks);
-   u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
+   u64 latency_us = DIV_ROUND_UP_ULL(tb_to_ns(latency_tb), 
NSEC_PER_USEC);
+
+   if (latency_us == 0)
+   pr_warn("cpuidle: xcede record %d has an unrealistic 
latency of 0us.\n", i);
 
if (latency_us < min_latency_us)
min_latency_us = latency_us;
@@ -378,10 +381,14 @@ static void __init fixup_cede0_latency(void)
 * Perform the fix-up.
 */
if (min_latency_us < dedicated_states[1].exit_latency) {
-   u64 cede0_latency = min_latency_us - 1;
+   /*
+* We set a minimum of 1us wakeup latency for cede0 to
+* distinguish it from snooze
+*/
+   u64 cede0_latency = 1;
 
-   if (cede0_latency <= 0)
-   cede0_latency = min_latency_us;
+   if (min_latency_us > cede0_latency)
+   cede0_latency = min_latency_us - 1;
 
dedicated_states[1].exit_latency = cede0_latency;
dedicated_states[1].target_residency = 10 * (cede0_latency);
-- 
1.9.4



Re: [PATCH] cpuidle-pseries: Fix CEDE latency conversion from tb to us

2020-09-02 Thread Gautham R Shenoy
Hello Joel,

On Wed, Sep 02, 2020 at 01:08:35AM +, Joel Stanley wrote:
> On Tue, 1 Sep 2020 at 14:09, Gautham R. Shenoy  
> wrote:
> >
> > From: "Gautham R. Shenoy" 
> >
> > commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > of the Extended CEDE states advertised by the platform. The values
> > advertised by the platform are in timebase ticks. However the cpuidle
> > framework requires the latency values in microseconds.
> >
> > If the tb-ticks value advertised by the platform correspond to a value
> > smaller than 1us, during the conversion from tb-ticks to microseconds,
> > in the current code, the result becomes zero. This is incorrect as it
> > puts a CEDE state on par with the snooze state.
> >
> > This patch fixes this by rounding up the result obtained while
> > converting the latency value from tb-ticks to microseconds.
> >
> > Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > CEDE(0)")
> >
> > Signed-off-by: Gautham R. Shenoy 
> 
> Reviewed-by: Joel Stanley 
>

Thanks for reviewing the fix.

> Should you check for the zero case and print a warning?

Yes, that would be better. I will post a v2 with that.

> 
> > ---
> >  drivers/cpuidle/cpuidle-pseries.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/cpuidle/cpuidle-pseries.c 
> > b/drivers/cpuidle/cpuidle-pseries.c
> > index ff6d99e..9043358 100644
> > --- a/drivers/cpuidle/cpuidle-pseries.c
> > +++ b/drivers/cpuidle/cpuidle-pseries.c
> > @@ -361,7 +361,7 @@ static void __init fixup_cede0_latency(void)
> > for (i = 0; i < nr_xcede_records; i++) {
> > struct xcede_latency_record *record = >records[i];
> > u64 latency_tb = be64_to_cpu(record->latency_ticks);
> > -   u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
> > +   u64 latency_us = DIV_ROUND_UP_ULL(tb_to_ns(latency_tb), 
> > NSEC_PER_USEC);
> >
> > if (latency_us < min_latency_us)
> > min_latency_us = latency_us;
> > --
> > 1.9.4
> >


[PATCH] cpuidle-pseries: Fix CEDE latency conversion from tb to us

2020-09-01 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
of the Extended CEDE states advertised by the platform. The values
advertised by the platform are in timebase ticks. However the cpuidle
framework requires the latency values in microseconds.

If the tb-ticks value advertised by the platform correspond to a value
smaller than 1us, during the conversion from tb-ticks to microseconds,
in the current code, the result becomes zero. This is incorrect as it
puts a CEDE state on par with the snooze state.

This patch fixes this by rounding up the result obtained while
converting the latency value from tb-ticks to microseconds.

Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
CEDE(0)")

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-pseries.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index ff6d99e..9043358 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -361,7 +361,7 @@ static void __init fixup_cede0_latency(void)
for (i = 0; i < nr_xcede_records; i++) {
struct xcede_latency_record *record = >records[i];
u64 latency_tb = be64_to_cpu(record->latency_ticks);
-   u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
+   u64 latency_us = DIV_ROUND_UP_ULL(tb_to_ns(latency_tb), 
NSEC_PER_USEC);
 
if (latency_us < min_latency_us)
min_latency_us = latency_us;
-- 
1.9.4



Re: [PATCH v5 06/10] powerpc/smp: Optimize start_secondary

2020-08-11 Thread Gautham R Shenoy
Hi Srikar,

On Mon, Aug 10, 2020 at 12:48:30PM +0530, Srikar Dronamraju wrote:
> In start_secondary, even if shared_cache was already set, system does a
> redundant match for cpumask. This redundant check can be removed by
> checking if shared_cache is already set.
> 
> While here, localize the sibling_mask variable to within the if
> condition.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Anton Blanchard 
> Cc: Oliver O'Halloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Gautham R Shenoy 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Jordan Niethe 
> Cc: Vaidyanathan Srinivasan 
> Signed-off-by: Srikar Dronamraju 

The change looks good to me.

Reviewed-by: Gautham R. Shenoy 

> ---
> Changelog v4 ->v5:
>   Retain cache domain, no need for generalization
>        (Michael Ellerman, Peter Zijlstra,
>Valentin Schneider, Gautham R. Shenoy)
> 
> Changelog v1 -> v2:
>   Moved shared_cache topology fixup to fixup_topology (Gautham)
> 
>  arch/powerpc/kernel/smp.c | 17 +++--
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 0c960ce3be42..91cf5d05e7ec 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -851,7 +851,7 @@ static int powerpc_shared_cache_flags(void)
>   */
>  static const struct cpumask *shared_cache_mask(int cpu)
>  {
> - return cpu_l2_cache_mask(cpu);
> + return per_cpu(cpu_l2_cache_map, cpu);
>  }
> 
>  #ifdef CONFIG_SCHED_SMT
> @@ -1305,7 +1305,6 @@ static void add_cpu_to_masks(int cpu)
>  void start_secondary(void *unused)
>  {
>   unsigned int cpu = smp_processor_id();
> - struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> 
>   mmgrab(_mm);
>   current->active_mm = _mm;
> @@ -1331,14 +1330,20 @@ void start_secondary(void *unused)
>   /* Update topology CPU masks */
>   add_cpu_to_masks(cpu);
> 
> - if (has_big_cores)
> - sibling_mask = cpu_smallcore_mask;
>   /*
>* Check for any shared caches. Note that this must be done on a
>* per-core basis because one core in the pair might be disabled.
>*/
> - if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
> - shared_caches = true;
> + if (!shared_caches) {
> + struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> + struct cpumask *mask = cpu_l2_cache_mask(cpu);
> +
> + if (has_big_cores)
> + sibling_mask = cpu_smallcore_mask;
> +
> + if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
> + shared_caches = true;
> + }
> 
>   set_numa_node(numa_cpu_lookup_table[cpu]);
>   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
> -- 
> 2.18.2
> 


Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain

2020-07-31 Thread Gautham R Shenoy
Hi Srikar, Valentin,

On Wed, Jul 29, 2020 at 11:43:55AM +0530, Srikar Dronamraju wrote:
> * Valentin Schneider  [2020-07-28 16:03:11]:
>

[..snip..]

> At this time the current topology would be good enough i.e BIGCORE would
> always be equal to a MC. However in future we could have chips that can have
> lesser/larger number of CPUs in llc than in a BIGCORE or we could have
> granular or split L3 caches within a DIE. In such a case BIGCORE != MC.
> 
> Also in the current P9 itself, two neighbouring core-pairs form a quad.
> Cache latency within a quad is better than a latency to a distant core-pair.
> Cache latency within a core pair is way better than latency within a quad.
> So if we have only 4 threads running on a DIE all of them accessing the same
> cache-lines, then we could probably benefit if all the tasks were to run
> within the quad aka MC/Coregroup.
> 
> I have found some benchmarks which are latency sensitive to benefit by
> having a grouping a quad level (using kernel hacks and not backed by
> firmware changes). Gautham also found similar results in his experiments
> but he only used binding within the stock kernel.
> 
> I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
> domain need not be LLC domain for Power.

I am observing that SD_SHARE_PKG_RESOURCES at L2 provides the best
results for POWER9 in terms of cache-benefits during wakeup.  On a
POWER9 Boston machine, running a producer-consumer test case
(https://github.com/gautshen/misc/blob/master/producer_consumer/producer_consumer.c)

The test case creates two threads, one Producer and another
Consumer. Both work on a fairly large shared array of size 64M.  In an
interation the Producer performs stores to 1024 random locations and
wakes up the Consumer. In the Consumer's iteration, loads from those
exact 1024 locations.

We measure the number of Consumer iterations per second and the
average time for each Consumer iteration. The smaller the time, the
better it is.

The following results are when I pinned the Producer and Consumer to
different combinations of CPUs to cover Small core , Big-core,
Neighbouring Big-core, Far off core within the same chip, and across
chips. There is a also a case where they are not affined anywhere, and
we let the scheduler wake them up correctly.

We find the best results when the Producer and Consumer are within the
same L2 domain. These numbers are also close to the numbers that we
get when we let the Scheduler wake them up (where LLC is L2).

## Same Small core (4 threads: Shares L1, L2, L3, Frequency Domain)

Consumer affined to  CPU 3
Producer affined to  CPU 1
4698 iterations, avg time: 20034 ns
4951 iterations, avg time: 20012 ns
4957 iterations, avg time: 19971 ns
4968 iterations, avg time: 19985 ns
4970 iterations, avg time: 19977 ns


## Same Big Core (8 threads: Shares L2, L3, Frequency Domain)

Consumer affined to  CPU 7
Producer affined to  CPU 1
4580 iterations, avg time: 19403 ns
4851 iterations, avg time: 19373 ns
4849 iterations, avg time: 19394 ns
4856 iterations, avg time: 19394 ns
4867 iterations, avg time: 19353 ns

## Neighbouring Big-core (Faster data-snooping from L2. Shares L3, Frequency 
Domain)

Producer affined to  CPU 1
Consumer affined to  CPU 11
4270 iterations, avg time: 24158 ns
4491 iterations, avg time: 24157 ns
4500 iterations, avg time: 24148 ns
4516 iterations, avg time: 24164 ns
4518 iterations, avg time: 24165 ns


## Any other Big-core from Same Chip (Shares L3)

Producer affined to  CPU 1
Consumer affined to  CPU 87
4176 iterations, avg time: 27953 ns
4417 iterations, avg time: 27925 ns
4415 iterations, avg time: 27934 ns
4417 iterations, avg time: 27983 ns
4430 iterations, avg time: 27958 ns


## Different Chips (No cache-sharing)

Consumer affined to  CPU 175
Producer affined to  CPU 1
3277 iterations, avg time: 50786 ns
3063 iterations, avg time: 50732 ns
2831 iterations, avg time: 50737 ns
2859 iterations, avg time: 50688 ns
2849 iterations, avg time: 50722 ns

## Without affining them (Let Scheduler wake-them up appropriately)
Consumer affined to  CPU 0-175
Producer affined to  CPU 0-175
4821 iterations, avg time: 19412 ns
4863 iterations, avg time: 19435 ns
4855 iterations, avg time: 19381 ns
4811 iterations, avg time: 19458 ns
4892 iterations, avg time: 19429 ns


--
Thanks and Regards
gautham.


Re: [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain

2020-07-29 Thread Gautham R Shenoy
On Mon, Jul 27, 2020 at 11:02:26AM +0530, Srikar Dronamraju wrote:
> Currently "CACHE" domain happens to be the 2nd sched domain as per
> powerpc_topology. This domain will collapse if cpumask of l2-cache is
> same as SMT domain. However we could generalize this domain such that it
> could mean either be a "CACHE" domain or a "BIGCORE" domain.
> 
> While setting up the "CACHE" domain, check if shared_cache is already
> set.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Anton Blanchard 
> Cc: Oliver O'Halloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Gautham R Shenoy 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 

Reviewed-by: Gautham R. Shenoy 

> ---
> Changelog v1 -> v2:
>   Moved shared_cache topology fixup to fixup_topology (Gautham)
> 
>  arch/powerpc/kernel/smp.c | 48 +++
>  1 file changed, 34 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index d997c7411664..3c5ccf6d2b1c 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
>  EXPORT_PER_CPU_SYMBOL(cpu_core_map);
>  EXPORT_SYMBOL_GPL(has_big_cores);
> 
> +enum {
> +#ifdef CONFIG_SCHED_SMT
> + smt_idx,
> +#endif
> + bigcore_idx,
> + die_idx,
> +};
> +
>  #define MAX_THREAD_LIST_SIZE 8
>  #define THREAD_GROUP_SHARE_L1   1
>  struct thread_groups {
> @@ -851,13 +859,7 @@ static int powerpc_shared_cache_flags(void)
>   */
>  static const struct cpumask *shared_cache_mask(int cpu)
>  {
> - if (shared_caches)
> - return cpu_l2_cache_mask(cpu);
> -
> - if (has_big_cores)
> - return cpu_smallcore_mask(cpu);
> -
> - return per_cpu(cpu_sibling_map, cpu);
> + return per_cpu(cpu_l2_cache_map, cpu);
>  }
> 
>  #ifdef CONFIG_SCHED_SMT
> @@ -867,11 +869,16 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
>  }
>  #endif
> 
> +static const struct cpumask *cpu_bigcore_mask(int cpu)
> +{
> + return per_cpu(cpu_sibling_map, cpu);
> +}
> +
>  static struct sched_domain_topology_level powerpc_topology[] = {
>  #ifdef CONFIG_SCHED_SMT
>   { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>  #endif
> - { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
> + { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
>   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>   { NULL, },
>  };
> @@ -1311,7 +1318,6 @@ static void add_cpu_to_masks(int cpu)
>  void start_secondary(void *unused)
>  {
>   unsigned int cpu = smp_processor_id();
> - struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> 
>   mmgrab(_mm);
>   current->active_mm = _mm;
> @@ -1337,14 +1343,20 @@ void start_secondary(void *unused)
>   /* Update topology CPU masks */
>   add_cpu_to_masks(cpu);
> 
> - if (has_big_cores)
> - sibling_mask = cpu_smallcore_mask;
>   /*
>* Check for any shared caches. Note that this must be done on a
>* per-core basis because one core in the pair might be disabled.
>*/
> - if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
> - shared_caches = true;
> + if (!shared_caches) {
> + struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> + struct cpumask *mask = cpu_l2_cache_mask(cpu);
> +
> + if (has_big_cores)
> + sibling_mask = cpu_smallcore_mask;
> +
> + if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
> + shared_caches = true;
> + }
> 
>   set_numa_node(numa_cpu_lookup_table[cpu]);
>   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
> @@ -1375,9 +1387,17 @@ static void fixup_topology(void)
>  #ifdef CONFIG_SCHED_SMT
>   if (has_big_cores) {
>   pr_info("Big cores detected but using small core scheduling\n");
> - powerpc_topology[0].mask = smallcore_smt_mask;
> + powerpc_topology[smt_idx].mask = smallcore_smt_mask;
>   }
>  #endif
> + if (shared_caches) {
> + pr_info("Using shared cache scheduler topology\n");
> + powerpc_topology[bigcore_idx].mask = shared_cache_mask;
> + powerpc_topology[bigcore_idx].sd_flags = 
> powerpc_shared_cache_flags;
> +#ifdef CONFIG_SCHED_DEBUG
> + powerpc_topology[bigcore_idx].name = "CACHE";
> +#endif
> + }
>  }
> 
>  void __init smp_cpus_done(unsigned int max_cpus)
> -- 
> 2.17.1
> 


[PATCH v3 3/3] cpuidle-pseries : Fixup exit latency for CEDE(0)

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

We are currently assuming that CEDE(0) has exit latency 10us, since
there is no way for us to query from the platform.  However, if the
wakeup latency of an Extended CEDE state is smaller than 10us, then we
can be sure that the exit latency of CEDE(0) cannot be more than that.
that.

In this patch, we fix the exit latency of CEDE(0) if we discover an
Extended CEDE state with wakeup latency smaller than 10us.

Benchmark results:

On POWER8, this patch does not have any impact since the advertized
latency of Extended CEDE (1) is 30us which is higher than the default
latency of CEDE (0) which is 10us.

On POWER9 we see improvement the single-threaded performance of ebizzy,
and no regression in the wakeup latency or the number of
context-switches.

ebizzy:
2 ebizzy threads bound to the same big-core. 25% improvement in the
avg records/s with patch.
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x  10   2491089   5834307   5398375   4244335 1596244.9
*  10   2893813   5834474   5832448 5327281.3 1055941.4

context_switch2 :
There is no major regression observed with this patch as seen from the
context_switch2 benchmark.

context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different
small cores). We observe a minor 0.14% regression in the number of
context-switches (higher is better).
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x 500348872362236354712 354745.69  2711.827
* 500349422361452353942  354215.4 2576.9258
Difference at 99.0% confidence
-530.288 +/- 430.963
-0.149484% +/- 0.121485%
(Student's t, pooled s = 2645.24)

context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37%
improvement in the number of context-switches (higher is better).
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x 500287956294940288896 288977.23 646.59295
* 500288300294646289582 290064.76 1161.9992
Difference at 99.0% confidence
1087.53 +/- 153.194
0.376337% +/- 0.0530125%
(Student's t, pooled s = 940.299)

schbench:
No major difference could be seen until the 99.9th percentile.

Without-patch
Latency percentiles (usec)
50.0th: 29
75.0th: 39
90.0th: 49
95.0th: 59
*99.0th: 13104
99.5th: 14672
99.9th: 15824
min=0, max=17993

With-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 40
90.0th: 50
95.0th: 61
*99.0th: 13648
99.5th: 14768
99.9th: 15664
min=0, max=29812

Signed-off-by: Gautham R. Shenoy 
---
v2-->v3 : Made notation consistent with first two patches.
 drivers/cpuidle/cpuidle-pseries.c | 41 +--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index f528da7..8d19820 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -350,13 +350,50 @@ static int pseries_cpuidle_driver_init(void)
return 0;
 }
 
-static void __init parse_xcede_idle_states(void)
+static void __init fixup_cede0_latency(void)
 {
+   int i;
+   u64 min_latency_us = dedicated_states[1].exit_latency; /* CEDE latency 
*/
+   struct xcede_latency_payload *payload;
+
if (parse_cede_parameters())
return;
 
pr_info("cpuidle : Skipping the %d Extended CEDE idle states\n",
nr_xcede_records);
+
+   payload = _latency_parameter.payload;
+   for (i = 0; i < nr_xcede_records; i++) {
+   struct xcede_latency_record *record = >records[i];
+   u64 latency_tb = be64_to_cpu(record->latency_ticks);
+   u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
+
+   if (latency_us < min_latency_us)
+   min_latency_us = latency_us;
+   }
+
+   /*
+* By default, we assume that CEDE(0) has exit latency 10us,
+* since there is no way for us to query from the platform.
+*
+* However, if the wakeup latency of an Extended CEDE state is
+* smaller than 10us, then we can be sure that CEDE(0)
+* requires no more than that.
+*
+* Perform the fix-up.
+*/
+   if (min_latency_us < dedicated_states[1].exit_latency) {
+   u64 cede0_latency = min_latency_us - 1;
+
+   if (cede0_latency <= 0)
+   cede0_latency = min_latency_us;
+
+   dedicated_states[1].exit_latency = cede0_latency;
+   dedicated_states[1].target_residency = 10 * (cede0_latency);
+   pr_i

[PATCH v3 0/3] cpuidle-pseries: Parse extended CEDE information for idle.

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

This is a v3 of the patch series to parse the extended CEDE
information in the pseries-cpuidle driver.

The previous two versions of the patches can be found here:

v2: 
https://lore.kernel.org/lkml/1596005254-25753-1-git-send-email-...@linux.vnet.ibm.com/

v1: 
https://lore.kernel.org/linuxppc-dev/1594120299-31389-1-git-send-email-...@linux.vnet.ibm.com/

The change from v2 --> v1 :

 * Patch 1: Got rid of some #define-s which were needed mainly for Patches4 and
   5 of v1, but were retained in v2.

 * Patch 2:

* Based on feedback from Michael Ellerman, rewrote the
  function to parse the extended idle states by explicitly
  defining the structure of the object that is returned by
  ibm,get-system-parameters(CEDE_LATENCY_TOKEN) rtas-call. In
  the previous versions we were passing a character array and
  subsequently parsing the individual elements which can be
  bug-prone. This also gets rid of the excessive (cast *)ing
  that was in the previous versions.

  * Marked some of the functions static and annotated some of
  the functions with __init and data with __initdata. This
  makes Sparse happy.

  * Added comments for CEDE_LATENCY_TOKEN.

  * Renamed add_pseries_idle_states() to
parse_xcede_idle_states(). Again, this is because Patch 4
and 5 from v1 are no longer there.

 * Patch 3: No functional changes, but minor changes to be consistent
   with Patch 1 and 2 of this series.
 
I have additionally tested the code on POWER8 dedicated LPAR and found
that it has no impact, since the wakeup latency of CEDE(1) is 30us
which is greater that default latency that we are assuming for
CEDE(0). So we do not need to fixup CEDE(0) latency on POWER8.

Vaidy, I have removed your Reviewed-by for v1, since the code has
changed a little bit.

Gautham R. Shenoy (3):
  cpuidle-pseries: Set the latency-hint before entering CEDE
  cpuidle-pseries: Add function to parse extended CEDE records
  cpuidle-pseries : Fixup exit latency for CEDE(0)

 drivers/cpuidle/cpuidle-pseries.c | 190 +-
 1 file changed, 188 insertions(+), 2 deletions(-)

-- 
1.9.4



[PATCH v3 2/3] cpuidle-pseries: Add function to parse extended CEDE records

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently we use CEDE with latency-hint 0 as the only other idle state
on a dedicated LPAR apart from the polling "snooze" state.

The platform might support additional extended CEDE idle states, which
can be discovered through the "ibm,get-system-parameter" rtas-call
made with CEDE_LATENCY_TOKEN.

This patch adds a function to obtain information about the extended
CEDE idle states from the platform and parse the contents to populate
an array of extended CEDE states. These idle states thus discovered
will be added to the cpuidle framework in the next patch.

dmesg on a POWER8 and POWER9 LPAR, demonstrating the output of parsing
the extended CEDE latency parameters are as follows

POWER8
[   10.093279] xcede : xcede_record_size = 10
[   10.093285] xcede : Record 0 : hint = 1, latency = 0x3c00 tb ticks, 
Wake-on-irq = 1
[   10.093291] xcede : Record 1 : hint = 2, latency = 0x4e2000 tb ticks, 
Wake-on-irq = 0
[   10.093297] cpuidle : Skipping the 2 Extended CEDE idle states

POWER9
[5.913180] xcede : xcede_record_size = 10
[5.913183] xcede : Record 0 : hint = 1, latency = 0x400 tb ticks, 
Wake-on-irq = 1
[5.913188] xcede : Record 1 : hint = 2, latency = 0x3e8000 tb ticks, 
Wake-on-irq = 0
[5.913193] cpuidle : Skipping the 2 Extended CEDE idle states

Signed-off-by: Gautham R. Shenoy 
---
v2-->v3 : Cleaned up parse_cede_parameters(). Silenced some sparse warnings.
drivers/cpuidle/cpuidle-pseries.c | 142 ++
 1 file changed, 142 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index f5865a2..f528da7 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct cpuidle_driver pseries_idle_driver = {
.name = "pseries_idle",
@@ -87,6 +88,137 @@ static void check_and_cede_processor(void)
 }
 
 #define NR_DEDICATED_STATES2 /* snooze, CEDE */
+/*
+ * XCEDE : Extended CEDE states discovered through the
+ * "ibm,get-systems-parameter" rtas-call with the token
+ * CEDE_LATENCY_TOKEN
+ */
+#define MAX_XCEDE_STATES   4
+#defineXCEDE_LATENCY_RECORD_SIZE   10
+#define XCEDE_LATENCY_PARAM_MAX_LENGTH (2 + 2 + \
+   (MAX_XCEDE_STATES * 
XCEDE_LATENCY_RECORD_SIZE))
+
+/*
+ * Section 7.3.16 System Parameters Option of PAPR version 2.8.1 has a
+ * table with all the parameters to ibm,get-system-parameters.
+ * CEDE_LATENCY_TOKEN corresponds to the token value for Cede Latency
+ * Settings Information.
+ */
+#define CEDE_LATENCY_TOKEN 45
+
+/*
+ * If the platform supports the cede latency settings
+ * information system parameter it must provide the following
+ * information in the NULL terminated parameter string:
+ *
+ * a. The first byte is the length “N” of each cede
+ *latency setting record minus one (zero indicates a length
+ *of 1 byte).
+ *
+ * b. For each supported cede latency setting a cede latency
+ *setting record consisting of the first “N” bytes as per
+ *the following table.
+ *
+ * -
+ * | Field   | Field  |
+ * | Name| Length |
+ * -
+ * | Cede Latency| 1 Byte |
+ * | Specifier Value ||
+ * -
+ * | Maximum wakeup  ||
+ * | latency in  | 8 Bytes|
+ * | tb-ticks||
+ * -
+ * | Responsive to   ||
+ * | external| 1 Byte |
+ * | interrupts  ||
+ * -
+ *
+ * This version has cede latency record size = 10.
+ *
+ * The structure xcede_latency_payload represents a) and b) with
+ * xcede_latency_record representing the table in b).
+ *
+ * xcede_latency_parameter is what gets returned by
+ * ibm,get-systems-parameter rtas-call when made with
+ * CEDE_LATENCY_TOKEN.
+ *
+ * These structures are only used to represent the data sent obtained
+ * by the rtas-call. The data is in Big-Endian.
+ */
+struct xcede_latency_record {
+   u8  hint;
+   __be64  latency_ticks;
+   u8  wake_on_irqs;
+} __packed;
+
+struct xcede_latency_payload {
+   u8 record_size;
+   struct xcede_latency_record records[MAX_XCEDE_STATES];
+} __packed;
+
+struct xcede_latency_parameter {
+   __be16  payload_size;
+   struct xcede_latency_payload payload;
+   u8 null_char;
+} __packed;
+
+static unsigned int nr_xcede_records;
+static struct xcede_latency_parameter xcede_latency_parameter __initdata;
+
+static int __init parse_cede_parameters(void)
+{
+   int ret, i;
+   u16 payload_size;
+   u8 xcede_record_size;
+   u32 total_xcede_records_size;
+   struct xcede_latency_payload *pa

[PATCH v3 1/3] cpuidle-pseries: Set the latency-hint before entering CEDE

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

As per the PAPR, each H_CEDE call is associated with a latency-hint to
be passed in the VPA field "cede_latency_hint". The CEDE states that
we were implicitly entering so far is CEDE with latency-hint = 0.

This patch explicitly sets the latency hint corresponding to the CEDE
state that we are currently entering. While at it, we save the
previous hint, to be restored once we wakeup from CEDE. This will be
required in the future when we expose extended-cede states through the
cpuidle framework, where each of them will have a different
cede-latency hint.

Signed-off-by: Gautham R. Shenoy 
---
v2-->v3 : Got rid of the usused NR_CEDE_STATES definition

 drivers/cpuidle/cpuidle-pseries.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 3e058ad2..f5865a2 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -86,19 +86,26 @@ static void check_and_cede_processor(void)
}
 }
 
+#define NR_DEDICATED_STATES2 /* snooze, CEDE */
+
+u8 cede_latency_hint[NR_DEDICATED_STATES];
 static int dedicated_cede_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
 {
+   u8 old_latency_hint;
 
pseries_idle_prolog();
get_lppaca()->donate_dedicated_cpu = 1;
+   old_latency_hint = get_lppaca()->cede_latency_hint;
+   get_lppaca()->cede_latency_hint = cede_latency_hint[index];
 
HMT_medium();
check_and_cede_processor();
 
local_irq_disable();
get_lppaca()->donate_dedicated_cpu = 0;
+   get_lppaca()->cede_latency_hint = old_latency_hint;
 
pseries_idle_epilog();
 
@@ -130,7 +137,7 @@ static int shared_cede_loop(struct cpuidle_device *dev,
 /*
  * States for dedicated partition case.
  */
-static struct cpuidle_state dedicated_states[] = {
+static struct cpuidle_state dedicated_states[NR_DEDICATED_STATES] = {
{ /* Snooze */
.name = "snooze",
.desc = "snooze",
@@ -233,7 +240,7 @@ static int pseries_idle_probe(void)
max_idle_state = ARRAY_SIZE(shared_states);
} else {
cpuidle_state_table = dedicated_states;
-   max_idle_state = ARRAY_SIZE(dedicated_states);
+   max_idle_state = NR_DEDICATED_STATES;
}
} else
return -ENODEV;
-- 
1.9.4



[PATCH v2 1/3] cpuidle-pseries: Set the latency-hint before entering CEDE

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

As per the PAPR, each H_CEDE call is associated with a latency-hint to
be passed in the VPA field "cede_latency_hint". The CEDE states that
we were implicitly entering so far is CEDE with latency-hint = 0.

This patch explicitly sets the latency hint corresponding to the CEDE
state that we are currently entering. While at it, we save the
previous hint, to be restored once we wakeup from CEDE. This will be
required in the future when we expose extended-cede states through the
cpuidle framework, where each of them will have a different
cede-latency hint.

Reviewed-by: Vaidyanathan Srinivasan 
Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-pseries.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 3e058ad2..88e71c3 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -86,19 +86,27 @@ static void check_and_cede_processor(void)
}
 }
 
+#define NR_CEDE_STATES 1  /* CEDE with latency-hint 0 */
+#define NR_DEDICATED_STATES(NR_CEDE_STATES + 1) /* Includes snooze */
+
+u8 cede_latency_hint[NR_DEDICATED_STATES];
 static int dedicated_cede_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
 {
+   u8 old_latency_hint;
 
pseries_idle_prolog();
get_lppaca()->donate_dedicated_cpu = 1;
+   old_latency_hint = get_lppaca()->cede_latency_hint;
+   get_lppaca()->cede_latency_hint = cede_latency_hint[index];
 
HMT_medium();
check_and_cede_processor();
 
local_irq_disable();
get_lppaca()->donate_dedicated_cpu = 0;
+   get_lppaca()->cede_latency_hint = old_latency_hint;
 
pseries_idle_epilog();
 
@@ -130,7 +138,7 @@ static int shared_cede_loop(struct cpuidle_device *dev,
 /*
  * States for dedicated partition case.
  */
-static struct cpuidle_state dedicated_states[] = {
+static struct cpuidle_state dedicated_states[NR_DEDICATED_STATES] = {
{ /* Snooze */
.name = "snooze",
.desc = "snooze",
-- 
1.9.4



[PATCH v2 3/3] cpuidle-pseries : Fixup exit latency for CEDE(0)

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

We are currently assuming that CEDE(0) has exit latency 10us, since
there is no way for us to query from the platform.  However, if the
wakeup latency of an Extended CEDE state is smaller than 10us, then we
can be sure that the exit latency of CEDE(0) cannot be more than that.
that.

In this patch, we fix the exit latency of CEDE(0) if we discover an
Extended CEDE state with wakeup latency smaller than 10us.

Benchmark results:

ebizzy:
2 ebizzy threads bound to the same big-core. 25% improvement in the
avg records/s with patch.
x without_patch
+ with_patch
N   Min   MaxMedian   AvgStddev
x  10   2491089   5834307   5398375   4244335 1596244.9
+  10   2893813   5834474   5832448 5327281.3 1055941.4

context_switch2 :
There is no major regression observed with this patch as seen from the
context_switch2 benchmark.

context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different
small cores). We observe a minor 0.14% regression in the number of
context-switches (higher is better).
x without_patch
+ with_patch
N   Min   MaxMedian   AvgStddev
x 500348872362236354712 354745.69  2711.827
+ 500349422361452353942  354215.4 2576.9258
Difference at 99.0% confidence
-530.288 +/- 430.963
-0.149484% +/- 0.121485%
(Student's t, pooled s = 2645.24)

context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37%
improvement in the number of context-switches (higher is better).
x without_patch
+ with_patch
N   Min   MaxMedian   AvgStddev
x 500287956294940288896 288977.23 646.59295
+ 500288300294646289582 290064.76 1161.9992
Difference at 99.0% confidence
1087.53 +/- 153.194
0.376337% +/- 0.0530125%
(Student's t, pooled s = 940.299)

schbench:
No major difference could be seen until the 99.9th percentile.

Without-patch
Latency percentiles (usec)
50.0th: 29
75.0th: 39
90.0th: 49
95.0th: 59
*99.0th: 13104
99.5th: 14672
99.9th: 15824
min=0, max=17993

With-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 40
90.0th: 50
95.0th: 61
*99.0th: 13648
99.5th: 14768
99.9th: 15664
min=0, max=29812

Reviewed-by: Vaidyanathan Srinivasan 
Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-pseries.c | 34 --
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index b1dc24d..0b2f115 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -334,12 +334,42 @@ static int pseries_cpuidle_driver_init(void)
 static int add_pseries_idle_states(void)
 {
int nr_states = 2; /* By default we have snooze, CEDE */
+   int i;
+   u64 min_latency_us = dedicated_states[1].exit_latency; /* CEDE latency 
*/
 
if (parse_cede_parameters())
return nr_states;
 
-   pr_info("cpuidle : Skipping the %d Extended CEDE idle states\n",
-   nr_xcede_records);
+   for (i = 0; i < nr_xcede_records; i++) {
+   u64 latency_tb = xcede_records[i].wakeup_latency_tb_ticks;
+   u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
+
+   if (latency_us < min_latency_us)
+   min_latency_us = latency_us;
+   }
+
+   /*
+* We are currently assuming that CEDE(0) has exit latency
+* 10us, since there is no way for us to query from the
+* platform.
+*
+* However, if the wakeup latency of an Extended CEDE state is
+* smaller than 10us, then we can be sure that CEDE(0)
+* requires no more than that.
+*
+* Perform the fix-up.
+*/
+   if (min_latency_us < dedicated_states[1].exit_latency) {
+   u64 cede0_latency = min_latency_us - 1;
+
+   if (cede0_latency <= 0)
+   cede0_latency = min_latency_us;
+
+   dedicated_states[1].exit_latency = cede0_latency;
+   dedicated_states[1].target_residency = 10 * (cede0_latency);
+   pr_info("cpuidle : Fixed up CEDE exit latency to %llu us\n",
+   cede0_latency);
+   }
 
return nr_states;
 }
-- 
1.9.4



[PATCH v2 2/3] cpuidle-pseries: Add function to parse extended CEDE records

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently we use CEDE with latency-hint 0 as the only other idle state
on a dedicated LPAR apart from the polling "snooze" state.

The platform might support additional extended CEDE idle states, which
can be discovered through the "ibm,get-system-parameter" rtas-call
made with CEDE_LATENCY_TOKEN.

This patch adds a function to obtain information about the extended
CEDE idle states from the platform and parse the contents to populate
an array of extended CEDE states. These idle states thus discovered
will be added to the cpuidle framework in the next patch.

dmesg on a POWER9 LPAR, demonstrating the output of parsing the
extended CEDE latency parameters.

[5.913180] xcede : xcede_record_size = 10
[5.913183] xcede : Record 0 : hint = 1, latency =0x400 tb-ticks, 
Wake-on-irq = 1
[5.913188] xcede : Record 1 : hint = 2, latency =0x3e8000 tb-ticks, 
Wake-on-irq = 0
[5.913193] cpuidle : Skipping the 2 Extended CEDE idle states

Reviewed-by: Vaidyanathan Srinivasan 
Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-pseries.c | 129 +-
 1 file changed, 127 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 88e71c3..b1dc24d 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct cpuidle_driver pseries_idle_driver = {
.name = "pseries_idle",
@@ -86,9 +87,120 @@ static void check_and_cede_processor(void)
}
 }
 
-#define NR_CEDE_STATES 1  /* CEDE with latency-hint 0 */
+struct xcede_latency_records {
+   u8  latency_hint;
+   u64 wakeup_latency_tb_ticks;
+   u8  responsive_to_irqs;
+};
+
+/*
+ * XCEDE : Extended CEDE states discovered through the
+ * "ibm,get-systems-parameter" rtas-call with the token
+ * CEDE_LATENCY_TOKEN
+ */
+#define MAX_XCEDE_STATES   4
+#defineXCEDE_LATENCY_RECORD_SIZE   10
+#define XCEDE_LATENCY_PARAM_MAX_LENGTH (2 + 2 + \
+   (MAX_XCEDE_STATES * 
XCEDE_LATENCY_RECORD_SIZE))
+
+#define CEDE_LATENCY_TOKEN 45
+
+#define NR_CEDE_STATES (MAX_XCEDE_STATES + 1) /* CEDE with 
latency-hint 0 */
 #define NR_DEDICATED_STATES(NR_CEDE_STATES + 1) /* Includes snooze */
 
+struct xcede_latency_records xcede_records[MAX_XCEDE_STATES];
+unsigned int nr_xcede_records;
+char xcede_parameters[XCEDE_LATENCY_PARAM_MAX_LENGTH];
+
+static int parse_cede_parameters(void)
+{
+   int ret = -1, i;
+   u16 payload_length;
+   u8 xcede_record_size;
+   u32 total_xcede_records_size;
+   char *payload;
+
+   memset(xcede_parameters, 0, XCEDE_LATENCY_PARAM_MAX_LENGTH);
+
+   ret = rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1,
+   NULL, CEDE_LATENCY_TOKEN, __pa(xcede_parameters),
+   XCEDE_LATENCY_PARAM_MAX_LENGTH);
+
+   if (ret) {
+   pr_err("xcede: Error parsing CEDE_LATENCY_TOKEN\n");
+   return ret;
+   }
+
+   payload_length = be16_to_cpu(*(__be16 *)(_parameters[0]));
+   payload = _parameters[2];
+
+   /*
+* If the platform supports the cede latency settings
+* information system parameter it must provide the following
+* information in the NULL terminated parameter string:
+*
+* a. The first byte is the length “N” of each cede
+*latency setting record minus one (zero indicates a length
+*of 1 byte).
+*
+* b. For each supported cede latency setting a cede latency
+*setting record consisting of the first “N” bytes as per
+*the following table.
+*
+*  -
+*  | Field   | Field  |
+*  | Name| Length |
+*  -
+*  | Cede Latency| 1 Byte |
+*  | Specifier Value ||
+*  -
+*  | Maximum wakeup  ||
+*  | latency in  | 8 Bytes|
+*  | tb-ticks||
+*  -
+*  | Responsive to   ||
+*  | external| 1 Byte |
+*  | interrupts  ||
+*  -
+*
+* This version has cede latency record size = 10.
+*/
+   xcede_record_size = (u8)payload[0] + 1;
+
+   if (xcede_record_size != XCEDE_LATENCY_RECORD_SIZE) {
+   pr_err("xcede : Expected record-size %d. Observed size %d.\n",
+  XCEDE_LATENCY_RECORD_SIZE, xcede_record_size);
+   return -EINVAL;
+

[PATCH v2 0/3] cpuidle-pseries: Parse extended CEDE information for idle.

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Hi,

This is a v2 of the patch series to parse the extended CEDE
information in the pseries-cpuidle driver.

The v1 of this patchset can be found here :
https://lore.kernel.org/linuxppc-dev/1594120299-31389-1-git-send-email-...@linux.vnet.ibm.com/

The change from v1 --> v2 :

 * Dropped Patches 4 and 5 which would expose extended idle-states,
   that wakeup on external interrupts, to cpuidle framework.  These
   were RFC patches in v1. Dropped them because currently the only
   extended CEDE state that wakesup on external interrupts is CEDE(1)
   which adds no signifcant value over CEDE(0).
   
 * Rebased the patches onto powerpc/merge.
 
 * No changes in code for Patches 1-3.

Motivation:
===
On pseries Dedicated Linux LPARs, apart from the polling snooze idle
state, we currently have the CEDE idle state which cedes the CPU to
the hypervisor with latency-hint = 0.

However, the PowerVM hypervisor supports additional extended CEDE
states, which can be queried through the "ibm,get-systems-parameter"
rtas-call with the CEDE_LATENCY_TOKEN. The hypervisor maps these
extended CEDE states to appropriate platform idle-states in order to
provide energy-savings as well as shifting power to the active
units. On existing pseries LPARs today we have extended CEDE with
latency-hints {1,2} supported.

The patches in this patchset, adds code to parse the CEDE latency
records provided by the hypervisor. We use this information to
determine the wakeup latency of the regular CEDE (which we have been
so far hardcoding to 10us while experimentally it is much lesser ~
1us), by looking at the wakeup latency provided by the hypervisor for
Extended CEDE states. Since the platform currently advertises Extended
CEDE 1 to have wakeup latency of 2us, we can be sure that the wakeup
latency of the regular CEDE is no more than this.

With Patches 1-3, we see an improvement in the single-threaded
performance on ebizzy.

2 ebizzy threads bound to the same big-core. 25% improvement in the
avg records/s (higher the better) with patches 1-3.
x without_patches
* with_patches
N   Min   MaxMedian   AvgStddev
x  10   2491089   5834307   5398375   4244335 1596244.9
*  10   2893813   5834474   5832448 5327281.3 1055941.4

We do not observe any major regression in either the context_switch2
benchmark or the schbench benchmark

context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different
small cores). We observe a minor 0.14% regression in the number of
context-switches (higher is better).
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x 500348872362236354712 354745.69  2711.827
* 500349422361452353942  354215.4 2576.9258

context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37%
improvement in the number of context-switches (higher is better).
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x 500287956294940288896 288977.23 646.59295
* 500288300294646289582 290064.76 1161.9992

schbench:
No major difference could be seen until the 99.9th percentile.

Without-patch
Latency percentiles (usec)
50.0th: 29
75.0th: 39
90.0th: 49
95.0th: 59
*99.0th: 13104
99.5th: 14672
99.9th: 15824
min=0, max=17993

With-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 40
90.0th: 50
95.0th: 61
*99.0th: 13648
99.5th: 14768
99.9th: 15664
    min=0, max=29812

Gautham R. Shenoy (3):
  cpuidle-pseries: Set the latency-hint before entering CEDE
  cpuidle-pseries: Add function to parse extended CEDE records
  cpuidle-pseries : Fixup exit latency for CEDE(0)

 drivers/cpuidle/cpuidle-pseries.c | 167 +-
 1 file changed, 165 insertions(+), 2 deletions(-)

-- 
1.9.4



Re: [PATCH 0/5] cpuidle-pseries: Parse extended CEDE information for idle.

2020-07-27 Thread Gautham R Shenoy
Hello Rafael,

On Mon, Jul 27, 2020 at 04:14:12PM +0200, Rafael J. Wysocki wrote:
> On Tue, Jul 7, 2020 at 1:32 PM Gautham R Shenoy  
> wrote:
> >
> > Hi,
> >
> > On Tue, Jul 07, 2020 at 04:41:34PM +0530, Gautham R. Shenoy wrote:
> > > From: "Gautham R. Shenoy" 
> > >
> > > Hi,
> > >
> > >
> > >
> > >
> > > Gautham R. Shenoy (5):
> > >   cpuidle-pseries: Set the latency-hint before entering CEDE
> > >   cpuidle-pseries: Add function to parse extended CEDE records
> > >   cpuidle-pseries : Fixup exit latency for CEDE(0)
> > >   cpuidle-pseries : Include extended CEDE states in cpuidle framework
> > >   cpuidle-pseries: Block Extended CEDE(1) which adds no additional
> > > value.
> >
> > Forgot to mention that these patches are on top of Nathan's series to
> > remove extended CEDE offline and bogus topology update code :
> > https://lore.kernel.org/linuxppc-dev/20200612051238.1007764-1-nath...@linux.ibm.com/
> 
> OK, so this is targeted at the powerpc maintainers, isn't it?

Yes, the code is powerpc specific.

Also, I noticed that Nathan's patches have been merged by Michael
Ellerman in the powerpc/merge tree. I will rebase and post a v2 of
this patch series.

--
Thanks and Regards
gautham.


Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain

2020-07-27 Thread Gautham R Shenoy
Hi Srikar,

On Mon, Jul 27, 2020 at 11:02:29AM +0530, Srikar Dronamraju wrote:
> Add percpu coregroup maps and masks to create coregroup domain.
> If a coregroup doesn't exist, the coregroup domain will be degenerated
> in favour of SMT/CACHE domain.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Anton Blanchard 
> Cc: Oliver O'Halloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Gautham R Shenoy 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 

This version looks good to me.

Reviewed-by: Gautham R. Shenoy 


> ---
> Changelog v3 ->v4:
>   if coregroup_support doesn't exist, update MC mask to the next
>   smaller domain mask.
> 
> Changelog v2 -> v3:
>   Add optimization for mask updation under coregroup_support
> 
> Changelog v1 -> v2:
>   Moved coregroup topology fixup to fixup_topology (Gautham)
> 
>  arch/powerpc/include/asm/topology.h | 10 +++
>  arch/powerpc/kernel/smp.c   | 44 +
>  arch/powerpc/mm/numa.c  |  5 
>  3 files changed, 59 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/topology.h 
> b/arch/powerpc/include/asm/topology.h
> index f0b6300e7dd3..6609174918ab 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -88,12 +88,22 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 
> *cpu2_assoc)
> 
>  #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
>  extern int find_and_online_cpu_nid(int cpu);
> +extern int cpu_to_coregroup_id(int cpu);
>  #else
>  static inline int find_and_online_cpu_nid(int cpu)
>  {
>   return 0;
>  }
> 
> +static inline int cpu_to_coregroup_id(int cpu)
> +{
> +#ifdef CONFIG_SMP
> + return cpu_to_core_id(cpu);
> +#else
> + return 0;
> +#endif
> +}
> +
>  #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
> 
>  #include 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index dab96a1203ec..95f0bf72e283 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -80,6 +80,7 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_core_map);
> +DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map);
> 
>  EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
>  EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
> @@ -91,6 +92,7 @@ enum {
>   smt_idx,
>  #endif
>   bigcore_idx,
> + mc_idx,
>   die_idx,
>  };
> 
> @@ -869,6 +871,21 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
>  }
>  #endif
> 
> +static struct cpumask *cpu_coregroup_mask(int cpu)
> +{
> + return per_cpu(cpu_coregroup_map, cpu);
> +}
> +
> +static bool has_coregroup_support(void)
> +{
> + return coregroup_enabled;
> +}
> +
> +static const struct cpumask *cpu_mc_mask(int cpu)
> +{
> + return cpu_coregroup_mask(cpu);
> +}
> +
>  static const struct cpumask *cpu_bigcore_mask(int cpu)
>  {
>   return per_cpu(cpu_sibling_map, cpu);
> @@ -879,6 +896,7 @@ static struct sched_domain_topology_level 
> powerpc_topology[] = {
>   { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>  #endif
>   { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
> + { cpu_mc_mask, SD_INIT_NAME(MC) },
>   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>   { NULL, },
>  };
> @@ -925,6 +943,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>   GFP_KERNEL, cpu_to_node(cpu));
>   zalloc_cpumask_var_node(_cpu(cpu_core_map, cpu),
>   GFP_KERNEL, cpu_to_node(cpu));
> + if (has_coregroup_support())
> + zalloc_cpumask_var_node(_cpu(cpu_coregroup_map, 
> cpu),
> + GFP_KERNEL, cpu_to_node(cpu));
> +
>  #ifdef CONFIG_NEED_MULTIPLE_NODES
>   /*
>* numa_node_id() works after this.
> @@ -942,6 +964,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>   cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
>   cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
> 
> + if (has_coregroup_support())
> + cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
> +
>   init_big_cores();
>   if (has_big_cores) {
>   cpumask_set_cpu(boot_cpuid,
> @@ -1233,6 +1258,8 @@ static void remove_

Re: [PATCH v3 09/10] powerpc/smp: Create coregroup domain

2020-07-26 Thread Gautham R Shenoy
On Thu, Jul 23, 2020 at 02:21:15PM +0530, Srikar Dronamraju wrote:
> Add percpu coregroup maps and masks to create coregroup domain.
> If a coregroup doesn't exist, the coregroup domain will be degenerated
> in favour of SMT/CACHE domain.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Anton Blanchard 
> Cc: Oliver O'Halloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Gautham R Shenoy 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v2 -> v3:
>   Add optimization for mask updation under coregroup_support
> 
> Changelog v1 -> v2:
>   Moved coregroup topology fixup to fixup_topology (Gautham)
> 
>  arch/powerpc/include/asm/topology.h | 10 +++
>  arch/powerpc/kernel/smp.c   | 44 +
>  arch/powerpc/mm/numa.c  |  5 
>  3 files changed, 59 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/topology.h 
> b/arch/powerpc/include/asm/topology.h
> index f0b6300e7dd3..6609174918ab 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -88,12 +88,22 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 
> *cpu2_assoc)
> 
>  #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
>  extern int find_and_online_cpu_nid(int cpu);
> +extern int cpu_to_coregroup_id(int cpu);
>  #else
>  static inline int find_and_online_cpu_nid(int cpu)
>  {
>   return 0;
>  }
> 
> +static inline int cpu_to_coregroup_id(int cpu)
> +{
> +#ifdef CONFIG_SMP
> + return cpu_to_core_id(cpu);
> +#else
> + return 0;
> +#endif
> +}
> +
>  #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
> 
>  #include 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 7d8d44cbab11..1faedde3e406 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -80,6 +80,7 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_core_map);
> +DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map);
> 
>  EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
>  EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
> @@ -91,6 +92,7 @@ enum {
>   smt_idx,
>  #endif
>   bigcore_idx,
> + mc_idx,
>   die_idx,
>  };
> 
> @@ -869,6 +871,21 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
>  }
>  #endif
> 
> +static struct cpumask *cpu_coregroup_mask(int cpu)
> +{
> + return per_cpu(cpu_coregroup_map, cpu);
> +}
> +
> +static bool has_coregroup_support(void)
> +{
> + return coregroup_enabled;
> +}
> +
> +static const struct cpumask *cpu_mc_mask(int cpu)
> +{
> + return cpu_coregroup_mask(cpu);
> +}
> +
>  static const struct cpumask *cpu_bigcore_mask(int cpu)
>  {
>   return per_cpu(cpu_sibling_map, cpu);
> @@ -879,6 +896,7 @@ static struct sched_domain_topology_level 
> powerpc_topology[] = {
>   { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>  #endif
>   { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
> + { cpu_mc_mask, SD_INIT_NAME(MC) },
>   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>   { NULL, },
>  };

[..snip..]

> @@ -1384,6 +1425,9 @@ int setup_profiling_timer(unsigned int multiplier)
> 
>  static void fixup_topology(void)
>  {
> + if (!has_coregroup_support())
> + powerpc_topology[mc_idx].mask = cpu_bigcore_mask;
> +
>   if (shared_caches) {
>   pr_info("Using shared cache scheduler topology\n");
>   powerpc_topology[bigcore_idx].mask = shared_cache_mask;


Suppose we consider a topology which does not have coregroup_support,
but has shared_caches. In that case, we would want our coregroup
domain to degenerate.

>From the above code, after the fixup, our topology will look as
follows:

static struct sched_domain_topology_level powerpc_topology[] = {
{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
{ cpu_bigcore_mask, SD_INIT_NAME(MC) },
{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
{ NULL, },

So, in this case, the core-group domain (identified by MC) will
degenerate only if cpu_bigcore_mask() and shared_cache_mask() return
the same value. This may work for existing platforms, because either
shared_caches don't exist, or when they do, cpu_bigcore_mask and
shared_cache_mask return the same set of CPUs. But this may or may not
continue to hold good in the futur

Re: [PATCH v3 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling

2020-07-24 Thread Gautham R Shenoy


On Thu, Jul 23, 2020 at 02:21:11PM +0530, Srikar Dronamraju wrote:
> Current code assumes that cpumask of cpus sharing a l2-cache mask will
> always be a superset of cpu_sibling_mask.
> 
> Lets stop that assumption. cpu_l2_cache_mask is a superset of
> cpu_sibling_mask if and only if shared_caches is set.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Anton Blanchard 
> Cc: Oliver O'Halloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Gautham R Shenoy 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 

Reviewed-by: Gautham R. Shenoy 

> ---
> Changelog v1 -> v2:
>   Set cpumask after verifying l2-cache. (Gautham)
> 
>  arch/powerpc/kernel/smp.c | 28 +++-
>  1 file changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index da27f6909be1..d997c7411664 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1194,6 +1194,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
> *(*mask_fn)(int))
>   if (!l2_cache)
>   return false;
> 
> + cpumask_set_cpu(cpu, mask_fn(cpu));
>   for_each_cpu(i, cpu_online_mask) {
>   /*
>* when updating the marks the current CPU has not been marked
> @@ -1276,29 +1277,30 @@ static void add_cpu_to_masks(int cpu)
>* add it to it's own thread sibling mask.
>*/
>   cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
> + cpumask_set_cpu(cpu, cpu_core_mask(cpu));
> 
>   for (i = first_thread; i < first_thread + threads_per_core; i++)
>   if (cpu_online(i))
>   set_cpus_related(i, cpu, cpu_sibling_mask);
> 
>   add_cpu_to_smallcore_masks(cpu);
> - /*
> -  * Copy the thread sibling mask into the cache sibling mask
> -  * and mark any CPUs that share an L2 with this CPU.
> -  */
> - for_each_cpu(i, cpu_sibling_mask(cpu))
> - set_cpus_related(cpu, i, cpu_l2_cache_mask);
>   update_mask_by_l2(cpu, cpu_l2_cache_mask);
> 
> - /*
> -  * Copy the cache sibling mask into core sibling mask and mark
> -  * any CPUs on the same chip as this CPU.
> -  */
> - for_each_cpu(i, cpu_l2_cache_mask(cpu))
> - set_cpus_related(cpu, i, cpu_core_mask);
> + if (pkg_id == -1) {
> + struct cpumask *(*mask)(int) = cpu_sibling_mask;
> +
> + /*
> +  * Copy the sibling mask into core sibling mask and
> +  * mark any CPUs on the same chip as this CPU.
> +  */
> + if (shared_caches)
> + mask = cpu_l2_cache_mask;
> +
> + for_each_cpu(i, mask(cpu))
> + set_cpus_related(cpu, i, cpu_core_mask);
> 
> - if (pkg_id == -1)
>   return;
> + }
> 
>   for_each_cpu(i, cpu_online_mask)
>   if (get_physical_package_id(i) == pkg_id)
> -- 
> 2.18.2
> 


Re: [PATCH v2 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling

2020-07-24 Thread Gautham R Shenoy
On Wed, Jul 22, 2020 at 12:27:47PM +0530, Srikar Dronamraju wrote:
> * Gautham R Shenoy  [2020-07-22 11:51:14]:
> 
> > Hi Srikar,
> > 
> > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > > index 72f16dc0cb26..57468877499a 100644
> > > --- a/arch/powerpc/kernel/smp.c
> > > +++ b/arch/powerpc/kernel/smp.c
> > > @@ -1196,6 +1196,7 @@ static bool update_mask_by_l2(int cpu, struct 
> > > cpumask *(*mask_fn)(int))
> > >   if (!l2_cache)
> > >   return false;
> > > 
> > > + cpumask_set_cpu(cpu, mask_fn(cpu));
> > 
> > 
> > Ok, we need to do this because "cpu" is not yet set in the
> > cpu_online_mask. Prior to your patch the "cpu" was getting set in
> > cpu_l2_cache_map(cpu) as a side-effect of the code that is removed in
> > the patch.
> > 
> 
> Right.
> 
> > 
> > >   for_each_cpu(i, cpu_online_mask) {
> > >   /*
> > >* when updating the marks the current CPU has not been marked
> > > @@ -1278,29 +1279,30 @@ static void add_cpu_to_masks(int cpu)
> > >* add it to it's own thread sibling mask.
> > >*/
> > >   cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
> > > + cpumask_set_cpu(cpu, cpu_core_mask(cpu));
> 
> Note: Above, we are explicitly setting the cpu_core_mask.

You are right. I missed this.

> 
> > > 
> > >   for (i = first_thread; i < first_thread + threads_per_core; i++)
> > >   if (cpu_online(i))
> > >   set_cpus_related(i, cpu, cpu_sibling_mask);
> > > 
> > >   add_cpu_to_smallcore_masks(cpu);
> > > - /*
> > > -  * Copy the thread sibling mask into the cache sibling mask
> > > -  * and mark any CPUs that share an L2 with this CPU.
> > > -  */
> > > - for_each_cpu(i, cpu_sibling_mask(cpu))
> > > - set_cpus_related(cpu, i, cpu_l2_cache_mask);
> > >   update_mask_by_l2(cpu, cpu_l2_cache_mask);
> > > 
> > > - /*
> > > -  * Copy the cache sibling mask into core sibling mask and mark
> > > -  * any CPUs on the same chip as this CPU.
> > > -  */
> > > - for_each_cpu(i, cpu_l2_cache_mask(cpu))
> > > - set_cpus_related(cpu, i, cpu_core_mask);
> > > + if (pkg_id == -1) {
> > 
> > I suppose this "if" condition is an optimization, since if pkg_id != -1,
> > we anyway set these CPUs in the cpu_core_mask below.
> > 
> > However...
> 
> This is not just an optimization.
> The hunk removed would only work if cpu_l2_cache_mask is bigger than
> cpu_sibling_mask. (this was the previous assumption that we want to break)
> If the cpu_sibling_mask is bigger than cpu_l2_cache_mask and pkg_id is -1,
> then setting only cpu_l2_cache_mask in cpu_core_mask will result in a broken 
> topology.
> 
> > 
> > > + struct cpumask *(*mask)(int) = cpu_sibling_mask;
> > > +
> > > + /*
> > > +  * Copy the sibling mask into core sibling mask and
> > > +  * mark any CPUs on the same chip as this CPU.
> > > +  */
> > > + if (shared_caches)
> > > + mask = cpu_l2_cache_mask;
> > > +
> > > + for_each_cpu(i, mask(cpu))
> > > + set_cpus_related(cpu, i, cpu_core_mask);
> > > 
> > > - if (pkg_id == -1)
> > >   return;
> > > + }
> > 
> > 
> > ... since "cpu" is not yet set in the cpu_online_mask, do we not miss 
> > setting
> > "cpu" in the cpu_core_mask(cpu) in the for-loop below ?
> > 
> > 
> 
> As noted above, we are setting before. So we don't missing the cpu and hence
> have not different from before.


Fair enough.

> 
> > --
> > Thanks and Regards
> > gautham.
> 
> -- 
> Thanks and Regards
> Srikar Dronamraju


Re: [PATCH v3 04/10] powerpc/smp: Move topology fixups into a new function

2020-07-24 Thread Gautham R Shenoy
On Thu, Jul 23, 2020 at 02:21:10PM +0530, Srikar Dronamraju wrote:
> Move topology fixup based on the platform attributes into its own
> function which is called just before set_sched_topology.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Anton Blanchard 
> Cc: Oliver O'Halloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Gautham R Shenoy 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 


Reviewed-by: Gautham R. Shenoy 
> ---
> Changelog v2 -> v3:
>   Rewrote changelog (Gautham)
>   Renamed to powerpc/smp: Move topology fixups into  a new function
> 
>  arch/powerpc/kernel/smp.c | 17 +++--
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index a685915e5941..da27f6909be1 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1368,6 +1368,16 @@ int setup_profiling_timer(unsigned int multiplier)
>   return 0;
>  }
> 
> +static void fixup_topology(void)
> +{
> +#ifdef CONFIG_SCHED_SMT
> + if (has_big_cores) {
> + pr_info("Big cores detected but using small core scheduling\n");
> + powerpc_topology[0].mask = smallcore_smt_mask;
> + }
> +#endif
> +}
> +
>  void __init smp_cpus_done(unsigned int max_cpus)
>  {
>   /*
> @@ -1381,12 +1391,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
> 
>   dump_numa_cpu_topology();
> 
> -#ifdef CONFIG_SCHED_SMT
> - if (has_big_cores) {
> - pr_info("Big cores detected but using small core scheduling\n");
> - powerpc_topology[0].mask = smallcore_smt_mask;
> - }
> -#endif
> + fixup_topology();
>   set_sched_topology(powerpc_topology);
>  }
> 
> -- 
> 2.18.2
> 


Re: [PATCH v3 02/10] powerpc/smp: Merge Power9 topology with Power topology

2020-07-24 Thread Gautham R Shenoy
On Thu, Jul 23, 2020 at 02:21:08PM +0530, Srikar Dronamraju wrote:
> A new sched_domain_topology_level was added just for Power9. However the
> same can be achieved by merging powerpc_topology with power9_topology
> and makes the code more simpler especially when adding a new sched
> domain.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Anton Blanchard 
> Cc: Oliver O'Halloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Gautham R Shenoy 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 

LGTM.


Reviewed-by: Gautham R. Shenoy 

> ---
> Changelog v1 -> v2:
>   Replaced a reference to cpu_smt_mask with per_cpu(cpu_sibling_map, cpu)
>   since cpu_smt_mask is only defined under CONFIG_SCHED_SMT
> 
>  arch/powerpc/kernel/smp.c | 33 ++---
>  1 file changed, 10 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index edf94ca64eea..283a04e54f52 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1313,7 +1313,7 @@ int setup_profiling_timer(unsigned int multiplier)
>  }
> 
>  #ifdef CONFIG_SCHED_SMT
> -/* cpumask of CPUs with asymetric SMT dependancy */
> +/* cpumask of CPUs with asymmetric SMT dependency */
>  static int powerpc_smt_flags(void)
>  {
>   int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
> @@ -1326,14 +1326,6 @@ static int powerpc_smt_flags(void)
>  }
>  #endif
> 
> -static struct sched_domain_topology_level powerpc_topology[] = {
> -#ifdef CONFIG_SCHED_SMT
> - { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
> -#endif
> - { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> - { NULL, },
> -};
> -
>  /*
>   * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
>   * This topology makes it *much* cheaper to migrate tasks between adjacent 
> cores
> @@ -1351,7 +1343,13 @@ static int powerpc_shared_cache_flags(void)
>   */
>  static const struct cpumask *shared_cache_mask(int cpu)
>  {
> - return cpu_l2_cache_mask(cpu);
> + if (shared_caches)
> + return cpu_l2_cache_mask(cpu);
> +
> + if (has_big_cores)
> + return cpu_smallcore_mask(cpu);
> +
> + return per_cpu(cpu_sibling_map, cpu);
>  }
> 
>  #ifdef CONFIG_SCHED_SMT
> @@ -1361,7 +1359,7 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
>  }
>  #endif
> 
> -static struct sched_domain_topology_level power9_topology[] = {
> +static struct sched_domain_topology_level powerpc_topology[] = {
>  #ifdef CONFIG_SCHED_SMT
>   { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>  #endif
> @@ -1386,21 +1384,10 @@ void __init smp_cpus_done(unsigned int max_cpus)
>  #ifdef CONFIG_SCHED_SMT
>   if (has_big_cores) {
>   pr_info("Big cores detected but using small core scheduling\n");
> - power9_topology[0].mask = smallcore_smt_mask;
>   powerpc_topology[0].mask = smallcore_smt_mask;
>   }
>  #endif
> - /*
> -  * If any CPU detects that it's sharing a cache with another CPU then
> -  * use the deeper topology that is aware of this sharing.
> -  */
> - if (shared_caches) {
> - pr_info("Using shared cache scheduler topology\n");
> - set_sched_topology(power9_topology);
> - } else {
> - pr_info("Using standard scheduler topology\n");
> - set_sched_topology(powerpc_topology);
> - }
> + set_sched_topology(powerpc_topology);
>  }
> 
>  #ifdef CONFIG_HOTPLUG_CPU
> -- 
> 2.18.2
> 


Re: [PATCH v2 10/10] powerpc/smp: Implement cpu_to_coregroup_id

2020-07-22 Thread Gautham R Shenoy
On Tue, Jul 21, 2020 at 05:08:14PM +0530, Srikar Dronamraju wrote:
> Lookup the coregroup id from the associativity array.
> 
> If unable to detect the coregroup id, fallback on the core id.
> This way, ensure sched_domain degenerates and an extra sched domain is
> not created.
> 
> Ideally this function should have been implemented in
> arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
> don't need to find the primary domain again.
> 
> If the device-tree mentions more than one coregroup, then kernel
> implements only the last or the smallest coregroup, which currently
> corresponds to the penultimate domain in the device-tree.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 

Looks good to me.

Reviewed-by : Gautham R. Shenoy 


> ---
> Changelog v1 -> v2:
> powerpc/smp: Implement cpu_to_coregroup_id
>   Move coregroup_enabled before getting associativity (Gautham)
> 
>  arch/powerpc/mm/numa.c | 20 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index ef8aa580da21..ae57b68beaee 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1218,6 +1218,26 @@ int find_and_online_cpu_nid(int cpu)
> 
>  int cpu_to_coregroup_id(int cpu)
>  {
> + __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> + int index;
> +
> + if (cpu < 0 || cpu > nr_cpu_ids)
> + return -1;
> +
> + if (!coregroup_enabled)
> + goto out;
> +
> + if (!firmware_has_feature(FW_FEATURE_VPHN))
> + goto out;
> +
> + if (vphn_get_associativity(cpu, associativity))
> + goto out;
> +
> + index = of_read_number(associativity, 1);
> + if (index > min_common_depth + 1)
> + return of_read_number([index - 1], 1);
> +
> +out:
>   return cpu_to_core_id(cpu);
>  }
> 
> -- 
> 2.17.1
> 


Re: [PATCH v2 09/10] Powerpc/smp: Create coregroup domain

2020-07-22 Thread Gautham R Shenoy
Hi Srikar,

On Tue, Jul 21, 2020 at 05:08:13PM +0530, Srikar Dronamraju wrote:
> Add percpu coregroup maps and masks to create coregroup domain.
> If a coregroup doesn't exist, the coregroup domain will be degenerated
> in favour of SMT/CACHE domain.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 

A query below.

> ---
> Changelog v1 -> v2:
> Powerpc/smp: Create coregroup domain
>   Moved coregroup topology fixup to fixup_topology (Gautham)
> 
>  arch/powerpc/include/asm/topology.h | 10 
>  arch/powerpc/kernel/smp.c   | 38 +
>  arch/powerpc/mm/numa.c  |  5 
>  3 files changed, 53 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/topology.h 
> b/arch/powerpc/include/asm/topology.h
> index f0b6300e7dd3..6609174918ab 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h

[..snip..]

> @@ -91,6 +92,7 @@ enum {
>   smt_idx,
>  #endif
>   bigcore_idx,
> + mc_idx,
>   die_idx,
>  };
> 


[..snip..]

> @@ -879,6 +896,7 @@ static struct sched_domain_topology_level 
> powerpc_topology[] = {
>   { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>  #endif
>   { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
> + { cpu_mc_mask, SD_INIT_NAME(MC) },
>   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>   { NULL, },
>  };


[..snip..]

> @@ -1386,6 +1421,9 @@ int setup_profiling_timer(unsigned int multiplier)
> 
>  static void fixup_topology(void)
>  {
> + if (!has_coregroup_support())
> + powerpc_topology[mc_idx].mask = cpu_bigcore_mask;
> +

Shouldn't we move this condition after doing the fixup for shared
caches ? Because if we have shared_caches, but not core_group, then we
want the coregroup domain to degenerate correctly.


>   if (shared_caches) {
>   pr_info("Using shared cache scheduler topology\n");
>   powerpc_topology[bigcore_idx].mask = shared_cache_mask;


--
Thanks and regards
gautham.


Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain

2020-07-22 Thread Gautham R Shenoy
Hello Srikar,

On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> Currently "CACHE" domain happens to be the 2nd sched domain as per
> powerpc_topology. This domain will collapse if cpumask of l2-cache is
> same as SMT domain. However we could generalize this domain such that it
> could mean either be a "CACHE" domain or a "BIGCORE" domain.
> 
> While setting up the "CACHE" domain, check if shared_cache is already
> set.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v1 -> v2:
> powerpc/smp: Generalize 2nd sched domain
>   Moved shared_cache topology fixup to fixup_topology (Gautham)
>

Just one comment below.

>  arch/powerpc/kernel/smp.c | 49 ---
>  1 file changed, 35 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 57468877499a..933ebdf97432 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
>  EXPORT_PER_CPU_SYMBOL(cpu_core_map);
>  EXPORT_SYMBOL_GPL(has_big_cores);
> 
> +enum {
> +#ifdef CONFIG_SCHED_SMT
> + smt_idx,
> +#endif
> + bigcore_idx,
> + die_idx,
> +};
> +


[..snip..]

> @@ -1339,14 +1345,20 @@ void start_secondary(void *unused)
>   /* Update topology CPU masks */
>   add_cpu_to_masks(cpu);
> 
> - if (has_big_cores)
> - sibling_mask = cpu_smallcore_mask;
>   /*
>* Check for any shared caches. Note that this must be done on a
>* per-core basis because one core in the pair might be disabled.
>*/
> - if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
> - shared_caches = true;
> + if (!shared_caches) {
> + struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> + struct cpumask *mask = cpu_l2_cache_mask(cpu);
> +
> + if (has_big_cores)
> + sibling_mask = cpu_smallcore_mask;
> +
> + if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
> + shared_caches = true;

At the risk of repeating my comment to the v1 version of the patch, we
have shared caches only l2_cache_mask(cpu) is a strict superset of
sibling_mask(cpu).

"cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu))" does not
capture this.

Could we please use

  if (!cpumask_equal(sibling_mask(cpu), mask) &&
  cpumask_subset(sibling_mask(cpu), mask) {
  }

?


> + }
> 
>   set_numa_node(numa_cpu_lookup_table[cpu]);
>   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
> @@ -1374,10 +1386,19 @@ int setup_profiling_timer(unsigned int multiplier)
> 
>  static void fixup_topology(void)
>  {
> + if (shared_caches) {
> + pr_info("Using shared cache scheduler topology\n");
> + powerpc_topology[bigcore_idx].mask = shared_cache_mask;
> +#ifdef CONFIG_SCHED_DEBUG
> + powerpc_topology[bigcore_idx].name = "CACHE";
> +#endif
> + powerpc_topology[bigcore_idx].sd_flags = 
> powerpc_shared_cache_flags;
> + }
> +
>  #ifdef CONFIG_SCHED_SMT
>   if (has_big_cores) {
>   pr_info("Big cores detected but using small core scheduling\n");
> - powerpc_topology[0].mask = smallcore_smt_mask;
> + powerpc_topology[smt_idx].mask = smallcore_smt_mask;
>   }
>  #endif


Otherwise the patch looks good to me.

--
Thanks and Regards
gautham.


Re: [PATCH v2 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling

2020-07-22 Thread Gautham R Shenoy
Hi Srikar,

On Tue, Jul 21, 2020 at 05:08:09PM +0530, Srikar Dronamraju wrote:
> Current code assumes that cpumask of cpus sharing a l2-cache mask will
> always be a superset of cpu_sibling_mask.
> 
> Lets stop that assumption. cpu_l2_cache_mask is a superset of
> cpu_sibling_mask if and only if shared_caches is set.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v1 -> v2:
> powerpc/smp: Dont assume l2-cache to be superset of sibling
>   Set cpumask after verifying l2-cache. (Gautham)
> 
>  arch/powerpc/kernel/smp.c | 28 +++-
>  1 file changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 72f16dc0cb26..57468877499a 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1196,6 +1196,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask 
> *(*mask_fn)(int))
>   if (!l2_cache)
>   return false;
> 
> + cpumask_set_cpu(cpu, mask_fn(cpu));


Ok, we need to do this because "cpu" is not yet set in the
cpu_online_mask. Prior to your patch the "cpu" was getting set in
cpu_l2_cache_map(cpu) as a side-effect of the code that is removed in
the patch.


>   for_each_cpu(i, cpu_online_mask) {
>   /*
>* when updating the marks the current CPU has not been marked
> @@ -1278,29 +1279,30 @@ static void add_cpu_to_masks(int cpu)
>* add it to it's own thread sibling mask.
>*/
>   cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
> + cpumask_set_cpu(cpu, cpu_core_mask(cpu));
> 
>   for (i = first_thread; i < first_thread + threads_per_core; i++)
>   if (cpu_online(i))
>   set_cpus_related(i, cpu, cpu_sibling_mask);
> 
>   add_cpu_to_smallcore_masks(cpu);
> - /*
> -  * Copy the thread sibling mask into the cache sibling mask
> -  * and mark any CPUs that share an L2 with this CPU.
> -  */
> - for_each_cpu(i, cpu_sibling_mask(cpu))
> - set_cpus_related(cpu, i, cpu_l2_cache_mask);
>   update_mask_by_l2(cpu, cpu_l2_cache_mask);
> 
> - /*
> -  * Copy the cache sibling mask into core sibling mask and mark
> -  * any CPUs on the same chip as this CPU.
> -  */
> - for_each_cpu(i, cpu_l2_cache_mask(cpu))
> - set_cpus_related(cpu, i, cpu_core_mask);
> + if (pkg_id == -1) {

I suppose this "if" condition is an optimization, since if pkg_id != -1,
we anyway set these CPUs in the cpu_core_mask below.

However...

> + struct cpumask *(*mask)(int) = cpu_sibling_mask;
> +
> + /*
> +  * Copy the sibling mask into core sibling mask and
> +  * mark any CPUs on the same chip as this CPU.
> +  */
> + if (shared_caches)
> + mask = cpu_l2_cache_mask;
> +
> + for_each_cpu(i, mask(cpu))
> + set_cpus_related(cpu, i, cpu_core_mask);
> 
> - if (pkg_id == -1)
>   return;
> + }


... since "cpu" is not yet set in the cpu_online_mask, do we not miss setting
"cpu" in the cpu_core_mask(cpu) in the for-loop below ?


> 
>   for_each_cpu(i, cpu_online_mask)
>   if (get_physical_package_id(i) == pkg_id)


Before this patch it was unconditionally getting set in
cpu_core_mask(cpu) because of the fact that it was set in
cpu_l2_cache_mask(cpu) and we were unconditionally setting all the
CPUs in cpu_l2_cache_mask(cpu) in cpu_core_mask(cpu).

What am I missing ?

> -- 
> 2.17.1
>

--
Thanks and Regards
gautham.


Re: [PATCH v2 04/10] powerpc/smp: Enable small core scheduling sooner

2020-07-22 Thread Gautham R Shenoy
Hello Srikar,

On Tue, Jul 21, 2020 at 05:08:08PM +0530, Srikar Dronamraju wrote:
> Enable small core scheduling as soon as we detect that we are in a
> system that supports thread group. Doing so would avoid a redundant
> check.

The patch looks good to me. However, I think the commit message still
reflect the v1 code where we were moving the functionality from
smp_cpus_done() to init_big_cores().

In this we are moving it to a helper function to collate all fixups to
topology.

> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v1 -> v2:
> powerpc/smp: Enable small core scheduling sooner
>   Restored the previous info msg (Jordan)
>   Moved big core topology fixup to fixup_topology (Gautham)
> 
>  arch/powerpc/kernel/smp.c | 17 +++--
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 1ce95da00cb6..72f16dc0cb26 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1370,6 +1370,16 @@ int setup_profiling_timer(unsigned int multiplier)
>   return 0;
>  }
> 
> +static void fixup_topology(void)
> +{
> +#ifdef CONFIG_SCHED_SMT
> + if (has_big_cores) {
> + pr_info("Big cores detected but using small core scheduling\n");
> + powerpc_topology[0].mask = smallcore_smt_mask;
> + }
> +#endif
> +}
> +
>  void __init smp_cpus_done(unsigned int max_cpus)
>  {
>   /*
> @@ -1383,12 +1393,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
> 
>   dump_numa_cpu_topology();
> 
> -#ifdef CONFIG_SCHED_SMT
> - if (has_big_cores) {
> - pr_info("Big cores detected but using small core scheduling\n");
> - powerpc_topology[0].mask = smallcore_smt_mask;
> - }
> -#endif
> + fixup_topology();
>   set_sched_topology(powerpc_topology);
>  }
> 
> -- 
> 2.17.1
> 


Re: [PATCH v2 02/10] powerpc/smp: Merge Power9 topology with Power topology

2020-07-21 Thread Gautham R Shenoy
On Tue, Jul 21, 2020 at 05:08:06PM +0530, Srikar Dronamraju wrote:
> A new sched_domain_topology_level was added just for Power9. However the
> same can be achieved by merging powerpc_topology with power9_topology
> and makes the code more simpler especially when adding a new sched
> domain.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Nick Piggin 
> Cc: Oliver OHalloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Anton Blanchard 
> Cc: Gautham R Shenoy 
> Cc: Vaidyanathan Srinivasan 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v1 -> v2:
> powerpc/smp: Merge Power9 topology with Power topology
>   Replaced a reference to cpu_smt_mask with per_cpu(cpu_sibling_map, cpu)
>   since cpu_smt_mask is only defined under CONFIG_SCHED_SMT
> 
>  arch/powerpc/kernel/smp.c | 33 ++---
>  1 file changed, 10 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 680c0edcc59d..0e0b118d9b6e 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1315,7 +1315,7 @@ int setup_profiling_timer(unsigned int multiplier)
>  }
> 
>  #ifdef CONFIG_SCHED_SMT
> -/* cpumask of CPUs with asymetric SMT dependancy */
> +/* cpumask of CPUs with asymmetric SMT dependency */
>  static int powerpc_smt_flags(void)
>  {
>   int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
> @@ -1328,14 +1328,6 @@ static int powerpc_smt_flags(void)
>  }
>  #endif
> 
> -static struct sched_domain_topology_level powerpc_topology[] = {
> -#ifdef CONFIG_SCHED_SMT
> - { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
> -#endif
> - { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> - { NULL, },
> -};
> -
>  /*
>   * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
>   * This topology makes it *much* cheaper to migrate tasks between adjacent 
> cores
> @@ -1353,7 +1345,13 @@ static int powerpc_shared_cache_flags(void)
>   */
>  static const struct cpumask *shared_cache_mask(int cpu)
>  {
> - return cpu_l2_cache_mask(cpu);
> + if (shared_caches)
> + return cpu_l2_cache_mask(cpu);
> +
> + if (has_big_cores)
> + return cpu_smallcore_mask(cpu);
> +
> + return per_cpu(cpu_sibling_map, cpu);
>  }


It might be helpful to enumerate the consequences of this change:

With this patch, on POWER7 and POWER8

   SMT and CACHE domains' cpumasks will both be
   per_cpu(cpu_sibling_map, cpu).

   On POWER7 SMT level flags has the following
   (SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING)

   On POWER8 SMT level flags has the following
   (SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES).

   On both POWER7 and POWER8, CACHE level flags only has
   SD_SHARE_PKG_RESOURCES

   Thus, on both POWER7 and POWER8, since the SMT and CACHE cpumasks
   are the same and since CACHE has no additional flags which SMT does
   not, the parent domain CACHE will be degenerated.

   Hence we will have SMT --> DIE --> NUMA as before without the
   patch. So the patch introduces no behavioural change. Only change
   is an additional degeneration of the CACHE domain.

On POWER9 : Baremetal.
   SMT level cpumask = per_cpu(cpu_sibling_map, cpu)

   Since the caches are shared for a pair of two cores,
   CACHE level cpumask = cpu_l2_cache_mask(cpu)

   Thus, we will have SMT --> CACHE --> DIE --> NUMA as before.  No
   behavioural change.

On POWER9 : LPAR
   SMT level cpumask = cpu_smallcore_mask(cpu).

   Since the caches are shared,
   CACHE level cpumask = cpu_l2_cache_mask(cpu).

   Thus, we will have SMT --> CACHE --> DIE --> NUMA as before.  Again
   no change in behaviour.

Reviewed-by: Gautham R. Shenoy 

--
Thanks and Regards
gautham.


Re: [PATCH v4 2/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable

2020-07-21 Thread Gautham R Shenoy
On Tue, Jul 21, 2020 at 09:07:07PM +0530, Pratik Rajesh Sampat wrote:
> Replace the variable name from using "pnv_first_spr_loss_level" to
> "deep_spr_loss_state".
> 
> pnv_first_spr_loss_level is supposed to be the earliest state that
> has OPAL_PM_LOSE_FULL_CONTEXT set, in other places the kernel uses the
> "deep" states as terminology. Hence renaming the variable to be coherent
> to its semantics.
> 
> Signed-off-by: Pratik Rajesh Sampat 

Acked-by: Gautham R. Shenoy 

> ---
>  arch/powerpc/platforms/powernv/idle.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/idle.c 
> b/arch/powerpc/platforms/powernv/idle.c
> index 642abf0b8329..28462d59a8e1 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -48,7 +48,7 @@ static bool default_stop_found;
>   * First stop state levels when SPR and TB loss can occur.
>   */
>  static u64 pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
> -static u64 pnv_first_spr_loss_level = MAX_STOP_STATE + 1;
> +static u64 deep_spr_loss_state = MAX_STOP_STATE + 1;
> 
>  /*
>   * psscr value and mask of the deepest stop idle state.
> @@ -657,7 +657,7 @@ static unsigned long power9_idle_stop(unsigned long 
> psscr, bool mmu_on)
> */
>   mmcr0   = mfspr(SPRN_MMCR0);
>   }
> - if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level) {
> + if ((psscr & PSSCR_RL_MASK) >= deep_spr_loss_state) {
>   sprs.lpcr   = mfspr(SPRN_LPCR);
>   sprs.hfscr  = mfspr(SPRN_HFSCR);
>   sprs.fscr   = mfspr(SPRN_FSCR);
> @@ -741,7 +741,7 @@ static unsigned long power9_idle_stop(unsigned long 
> psscr, bool mmu_on)
>* just always test PSSCR for SPR/TB state loss.
>*/
>   pls = (psscr & PSSCR_PLS) >> PSSCR_PLS_SHIFT;
> - if (likely(pls < pnv_first_spr_loss_level)) {
> + if (likely(pls < deep_spr_loss_state)) {
>   if (sprs_saved)
>   atomic_stop_thread_idle();
>   goto out;
> @@ -1088,7 +1088,7 @@ static void __init pnv_power9_idle_init(void)
>* the deepest loss-less (OPAL_PM_STOP_INST_FAST) stop state.
>*/
>   pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
> - pnv_first_spr_loss_level = MAX_STOP_STATE + 1;
> + deep_spr_loss_state = MAX_STOP_STATE + 1;
>   for (i = 0; i < nr_pnv_idle_states; i++) {
>   int err;
>   struct pnv_idle_states_t *state = _idle_states[i];
> @@ -1099,8 +1099,8 @@ static void __init pnv_power9_idle_init(void)
>   pnv_first_tb_loss_level = psscr_rl;
> 
>   if ((state->flags & OPAL_PM_LOSE_FULL_CONTEXT) &&
> -  (pnv_first_spr_loss_level > psscr_rl))
> - pnv_first_spr_loss_level = psscr_rl;
> +  (deep_spr_loss_state > psscr_rl))
> + deep_spr_loss_state = psscr_rl;
> 
>   /*
>* The idle code does not deal with TB loss occurring
> @@ -,8 +,8 @@ static void __init pnv_power9_idle_init(void)
>* compatibility.
>*/
>   if ((state->flags & OPAL_PM_TIMEBASE_STOP) &&
> -  (pnv_first_spr_loss_level > psscr_rl))
> - pnv_first_spr_loss_level = psscr_rl;
> +  (deep_spr_loss_state > psscr_rl))
> + deep_spr_loss_state = psscr_rl;
> 
>   err = validate_psscr_val_mask(>psscr_val,
> >psscr_mask,
> @@ -1158,7 +1158,7 @@ static void __init pnv_power9_idle_init(void)
>   }
> 
>   pr_info("cpuidle-powernv: First stop level that may lose SPRs = 
> 0x%llx\n",
> - pnv_first_spr_loss_level);
> + deep_spr_loss_state);
> 
>   pr_info("cpuidle-powernv: First stop level that may lose timebase = 
> 0x%llx\n",
>   pnv_first_tb_loss_level);
> -- 
> 2.25.4
> 


Re: [PATCH v3 2/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable

2020-07-21 Thread Gautham R Shenoy
Hi,

On Wed, Jul 22, 2020 at 12:37:41AM +1000, Nicholas Piggin wrote:
> Excerpts from Pratik Sampat's message of July 21, 2020 8:29 pm:
> > 
> > 
> > On 20/07/20 5:27 am, Nicholas Piggin wrote:
> >> Excerpts from Pratik Rajesh Sampat's message of July 18, 2020 4:53 am:
> >>> Replace the variable name from using "pnv_first_spr_loss_level" to
> >>> "pnv_first_fullstate_loss_level".
> >>>
> >>> As pnv_first_spr_loss_level is supposed to be the earliest state that
> >>> has OPAL_PM_LOSE_FULL_CONTEXT set, however as shallow states too loose
> >>> SPR values, render an incorrect terminology.
> >> It also doesn't lose "full" state at this loss level though. From the
> >> architecture it could be called "hv state loss level", but in POWER10
> >> even that is not strictly true.
> >>
> > Right. Just discovered that deep stop states won't loose full state
> > P10 onwards.
> > Would it better if we rename it as "pnv_all_spr_loss_state" instead
> > so that it stays generic enough while being semantically coherent?
> 
> It doesn't lose all SPRs. It does physically, but for Linux it appears 
> at least timebase SPRs are retained and that's mostly how it's 
> documented.
> 
> Maybe there's no really good name for it, but we do call it "deep" stop 
> in other places, you could call it deep_spr_loss maybe. I don't mind too 
> much though, whatever Gautham is happy with.

Nick's suggestion is fine by me. We can call it deep_spr_loss_state.

> 
> Thanks,
> Nick

--
Thanks and Regards
gautham.


Re: [PATCH v2 2/2] selftest/cpuidle: Add support for cpuidle latency measurement

2020-07-19 Thread Gautham R Shenoy
Hi Pratik,


On Fri, Jul 17, 2020 at 02:48:01PM +0530, Pratik Rajesh Sampat wrote:
> This patch adds support to trace IPI based and timer based wakeup
> latency from idle states
> 
> Latches onto the test-cpuidle_latency kernel module using the debugfs
> interface to send IPIs or schedule a timer based event, which in-turn
> populates the debugfs with the latency measurements.
> 
> Currently for the IPI and timer tests; first disable all idle states
> and then test for latency measurements incrementally enabling each state
> 
> Signed-off-by: Pratik Rajesh Sampat 

A few comments below.

> ---
>  tools/testing/selftests/Makefile   |   1 +
>  tools/testing/selftests/cpuidle/Makefile   |   6 +
>  tools/testing/selftests/cpuidle/cpuidle.sh | 257 +
>  tools/testing/selftests/cpuidle/settings   |   1 +
>  4 files changed, 265 insertions(+)
>  create mode 100644 tools/testing/selftests/cpuidle/Makefile
>  create mode 100755 tools/testing/selftests/cpuidle/cpuidle.sh
>  create mode 100644 tools/testing/selftests/cpuidle/settings
> 

[..skip..]

> +
> +ins_mod()
> +{
> + if [ ! -f "$MODULE" ]; then
> + printf "$MODULE module does not exist. Exitting\n"

If the module has been compiled into the kernel (due to a
localyesconfig, for instance), then it is unlikely that we will find
it in /lib/modules. Perhaps you want to check if the debugfs
directories created by the module exist, and if so, print a message
saying that the modules is already loaded or some such?

> + exit $ksft_skip
> + fi
> + printf "Inserting $MODULE module\n\n"
> + insmod $MODULE
> + if [ $? != 0 ]; then
> + printf "Insmod $MODULE failed\n"
> + exit $ksft_skip
> + fi
> +}
> +
> +compute_average()
> +{
> + arr=("$@")
> + sum=0
> + size=${#arr[@]}
> + for i in "${arr[@]}"
> + do
> + sum=$((sum + i))
> + done
> + avg=$((sum/size))

It would be good to assert that "size" isn't 0 here.

> +}
> +
> +# Disable all stop states
> +disable_idle()
> +{
> + for ((cpu=0; cpu + do
> + for ((state=0; state + do
> + echo 1 > 
> /sys/devices/system/cpu/cpu$cpu/cpuidle/state$state/disable

So, on offlined CPUs, we won't see
/sys/devices/system/cpu/cpu$cpu/cpuidle/state$state directory. You
should probably perform this operation only on online CPUs.


> + done
> + done
> +}
> +
> +# Perform operation on each CPU for the given state
> +# $1 - Operation: enable (0) / disable (1)
> +# $2 - State to enable
> +op_state()
> +{
> + for ((cpu=0; cpu + do
> + echo $1 > 
> /sys/devices/system/cpu/cpu$cpu/cpuidle/state$2/disable


Ditto

> + done
> +}

This is a helper function. For better readability of the main code you
can define the following wrappers and use them.


cpuidle_enable_state()
{
state=$1
op_state 1 $state
}

cpuidle_disable_state()
{
state=$1
op_state 0 $state

}


> +

[..snip..]

> +run_ipi_tests()
> +{
> +extract_latency
> +disable_idle
> +declare -a avg_arr
> +echo -e "--IPI Latency Test---" >> $LOG
> +
> + echo -e "--Baseline IPI Latency measurement: CPU Busy--" >> $LOG
> + printf "%s %10s %12s\n" "SRC_CPU" "DEST_CPU" "IPI_Latency(ns)" 
> >> $LOG
> + for ((cpu=0; cpu + do
> + ipi_test_once "baseline" $cpu
> + printf "%-3s %10s %12s\n" $src_cpu $cpu $ipi_latency >> 
> $LOG
> + avg_arr+=($ipi_latency)
> + done
> + compute_average "${avg_arr[@]}"
> + echo -e "Baseline Average IPI latency(ns): $avg" >> $LOG
> +
> +for ((state=0; state +do
> + unset avg_arr
> + echo -e "---Enabling state: $state---" >> $LOG
> + op_state 0 $state
> + printf "%s %10s %12s\n" "SRC_CPU" "DEST_CPU" 
> "IPI_Latency(ns)" >> $LOG
> + for ((cpu=0; cpu + do

If a CPU is offline, then we should skip it here.

> + # Running IPI test and logging results
> + sleep 1
> + ipi_test_once "test" $cpu
> + printf "%-3s %10s %12s\n" $src_cpu $cpu 
> $ipi_latency >> $LOG
> + avg_arr+=($ipi_latency)
> + done
> + compute_average "${avg_arr[@]}"
> + echo -e "Expected IPI latency(ns): 
> ${latency_arr[$state]}" >> $LOG
> + echo -e "Observed Average IPI latency(ns): $avg" >> $LOG
> + op_state 1 $state
> +done
> +}
> +
> +# Extract the residency in microseconds and convert to nanoseconds.
> +# Add 100 ns so that the timer stays for a little longer than the residency
> +extract_residency()
> +{
> 

Re: [PATCH v2 1/2] cpuidle: Trace IPI based and timer based wakeup latency from idle states

2020-07-19 Thread Gautham R Shenoy
On Fri, Jul 17, 2020 at 02:48:00PM +0530, Pratik Rajesh Sampat wrote:
> Fire directed smp_call_function_single IPIs from a specified source
> CPU to the specified target CPU to reduce the noise we have to wade
> through in the trace log.
> The module is based on the idea written by Srivatsa Bhat and maintained
> by Vaidyanathan Srinivasan internally.
> 
> Queue HR timer and measure jitter. Wakeup latency measurement for idle
> states using hrtimer.  Echo a value in ns to timer_test_function and
> watch trace. A HRtimer will be queued and when it fires the expected
> wakeup vs actual wakeup is computes and delay printed in ns.
> 
> Implemented as a module which utilizes debugfs so that it can be
> integrated with selftests.
> 
> To include the module, check option and include as module
> kernel hacking -> Cpuidle latency selftests
> 
> [srivatsa.b...@linux.vnet.ibm.com: Initial implementation in
>  cpidle/sysfs]
> 
> [sva...@linux.vnet.ibm.com: wakeup latency measurements using hrtimer
>  and fix some of the time calculation]
> 
> [e...@linux.vnet.ibm.com: Fix some whitespace and tab errors and
>  increase the resolution of IPI wakeup]
> 
> Signed-off-by: Pratik Rajesh Sampat 


The debugfs module looks good to me.

Reviewed-by: Gautham R. Shenoy 


> ---
>  drivers/cpuidle/Makefile   |   1 +
>  drivers/cpuidle/test-cpuidle_latency.c | 150 +
>  lib/Kconfig.debug  |  10 ++
>  3 files changed, 161 insertions(+)
>  create mode 100644 drivers/cpuidle/test-cpuidle_latency.c
> 
> diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
> index f07800cbb43f..2ae05968078c 100644
> --- a/drivers/cpuidle/Makefile
> +++ b/drivers/cpuidle/Makefile
> @@ -8,6 +8,7 @@ obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
>  obj-$(CONFIG_DT_IDLE_STATES)   += dt_idle_states.o
>  obj-$(CONFIG_ARCH_HAS_CPU_RELAX)   += poll_state.o
>  obj-$(CONFIG_HALTPOLL_CPUIDLE) += cpuidle-haltpoll.o
> +obj-$(CONFIG_IDLE_LATENCY_SELFTEST)+= test-cpuidle_latency.o
> 
>  
> ##
>  # ARM SoC drivers
> diff --git a/drivers/cpuidle/test-cpuidle_latency.c 
> b/drivers/cpuidle/test-cpuidle_latency.c
> new file mode 100644
> index ..61574665e972
> --- /dev/null
> +++ b/drivers/cpuidle/test-cpuidle_latency.c
> @@ -0,0 +1,150 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Module-based API test facility for cpuidle latency using IPIs and timers
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +/* IPI based wakeup latencies */
> +struct latency {
> + unsigned int src_cpu;
> + unsigned int dest_cpu;
> + ktime_t time_start;
> + ktime_t time_end;
> + u64 latency_ns;
> +} ipi_wakeup;
> +
> +static void measure_latency(void *info)
> +{
> + struct latency *v;
> + ktime_t time_diff;
> +
> + v = (struct latency *)info;
> + v->time_end = ktime_get();
> + time_diff = ktime_sub(v->time_end, v->time_start);
> + v->latency_ns = ktime_to_ns(time_diff);
> +}
> +
> +void run_smp_call_function_test(unsigned int cpu)
> +{
> + ipi_wakeup.src_cpu = smp_processor_id();
> + ipi_wakeup.dest_cpu = cpu;
> + ipi_wakeup.time_start = ktime_get();
> + smp_call_function_single(cpu, measure_latency, _wakeup, 1);
> +}
> +
> +/* Timer based wakeup latencies */
> +struct timer_data {
> + unsigned int src_cpu;
> + u64 timeout;
> + ktime_t time_start;
> + ktime_t time_end;
> + struct hrtimer timer;
> + u64 timeout_diff_ns;
> +} timer_wakeup;
> +
> +static enum hrtimer_restart timer_called(struct hrtimer *hrtimer)
> +{
> + struct timer_data *w;
> + ktime_t time_diff;
> +
> + w = container_of(hrtimer, struct timer_data, timer);
> + w->time_end = ktime_get();
> +
> + time_diff = ktime_sub(w->time_end, w->time_start);
> + time_diff = ktime_sub(time_diff, ns_to_ktime(w->timeout));
> + w->timeout_diff_ns = ktime_to_ns(time_diff);
> + return HRTIMER_NORESTART;
> +}
> +
> +static void run_timer_test(unsigned int ns)
> +{
> + hrtimer_init(_wakeup.timer, CLOCK_MONOTONIC,
> +  HRTIMER_MODE_REL);
> + timer_wakeup.timer.function = timer_called;
> + timer_wakeup.time_start = ktime_get();
> + timer_wakeup.src_cpu = smp_processor_id();
> + timer_wakeup.timeout = ns;
> +
> + hrtimer_start(_wakeup.timer, ns_to_ktime(ns),
> +   HRTIMER_MODE_REL_PINNED);
> +}
> +
> +static struct dentry *dir;
> +
> +static 

Re: [PATCH v2 0/3] Power10 basic energy management

2020-07-13 Thread Gautham R Shenoy
On Mon, Jul 13, 2020 at 03:23:21PM +1000, Nicholas Piggin wrote:
> Excerpts from Pratik Rajesh Sampat's message of July 10, 2020 3:22 pm:
> > Changelog v1 --> v2:
> > 1. Save-restore DAWR and DAWRX unconditionally as they are lost in
> > shallow idle states too
> > 2. Rename pnv_first_spr_loss_level to pnv_first_fullstate_loss_level to
> > correct naming terminology
> > 
> > Pratik Rajesh Sampat (3):
> >   powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above
> >   powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
> >   powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
> > 
> >  arch/powerpc/platforms/powernv/idle.c | 34 +--
> >  1 file changed, 22 insertions(+), 12 deletions(-)
> 
> These look okay to me, but the CPU_FTR_ARCH_300 test for 
> pnv_power9_idle_init() is actually wrong, it should be a PVR test 
> because idle is not completely architected (not even shallow stop 
> states, unfortunately).
> 
> It doesn't look like we support POWER10 idle correctly yet, and on older
> kernels it wouldn't work even if we fixed newer, so ideally the PVR 
> check would be backported as a fix in the front of the series.
> 
> Sadly, we have no OPAL idle driver yet. Hopefully we will before the
> next processor shows up :P

Abhishek posted a version recently :
https://patchwork.ozlabs.org/project/skiboot/patch/20200706043533.76539-1-hunt...@linux.vnet.ibm.com/


> 
> Thanks,
> Nick

--
Thanks and Regards
gautham.


Re: [PATCH 1/2] powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above

2020-07-09 Thread Gautham R Shenoy
On Fri, Jul 03, 2020 at 06:16:39PM +0530, Pratik Rajesh Sampat wrote:
> POWER9 onwards the support for the registers HID1, HID4, HID5 has been
> receded.
> Although mfspr on the above registers worked in Power9, In Power10
> simulator is unrecognized. Moving their assignment under the
> check for machines lower than Power9
> 
> Signed-off-by: Pratik Rajesh Sampat 

Nice catch.

Reviewed-by: Gautham R. Shenoy 

> ---
>  arch/powerpc/platforms/powernv/idle.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/idle.c 
> b/arch/powerpc/platforms/powernv/idle.c
> index 2dd467383a88..19d94d021357 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -73,9 +73,6 @@ static int pnv_save_sprs_for_deep_states(void)
>*/
>   uint64_t lpcr_val   = mfspr(SPRN_LPCR);
>   uint64_t hid0_val   = mfspr(SPRN_HID0);
> - uint64_t hid1_val   = mfspr(SPRN_HID1);
> - uint64_t hid4_val   = mfspr(SPRN_HID4);
> - uint64_t hid5_val   = mfspr(SPRN_HID5);
>   uint64_t hmeer_val  = mfspr(SPRN_HMEER);
>   uint64_t msr_val = MSR_IDLE;
>   uint64_t psscr_val = pnv_deepest_stop_psscr_val;
> @@ -117,6 +114,9 @@ static int pnv_save_sprs_for_deep_states(void)
> 
>   /* Only p8 needs to set extra HID regiters */
>   if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
> + uint64_t hid1_val = mfspr(SPRN_HID1);
> + uint64_t hid4_val = mfspr(SPRN_HID4);
> + uint64_t hid5_val = mfspr(SPRN_HID5);
> 
>   rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
>   if (rc != 0)
> -- 
> 2.25.4
> 
--
Thanks and Regards
gautham.


Re: [PATCH 2/2] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10

2020-07-09 Thread Gautham R Shenoy
On Fri, Jul 03, 2020 at 06:16:40PM +0530, Pratik Rajesh Sampat wrote:
> Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
> stop levels < 4.

Adding Ravi Bangoria  to the cc.

> Therefore save the values of these SPRs before entering a  "stop"
> state and restore their values on wakeup.
> 
> Signed-off-by: Pratik Rajesh Sampat 


The saving and restoration looks good to me. 
> ---
>  arch/powerpc/platforms/powernv/idle.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/idle.c 
> b/arch/powerpc/platforms/powernv/idle.c
> index 19d94d021357..471d4a65b1fa 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -600,6 +600,8 @@ struct p9_sprs {
>   u64 iamr;
>   u64 amor;
>   u64 uamor;
> + u64 dawr0;
> + u64 dawrx0;
>  };
> 
>  static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
> @@ -677,6 +679,10 @@ static unsigned long power9_idle_stop(unsigned long 
> psscr, bool mmu_on)
>   sprs.tscr   = mfspr(SPRN_TSCR);
>   if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
>   sprs.ldbar = mfspr(SPRN_LDBAR);
> + if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> + sprs.dawr0 = mfspr(SPRN_DAWR0);
> + sprs.dawrx0 = mfspr(SPRN_DAWRX0);
> + }
>


But this is within the if condition which says

if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level)

This if condition is meant for stop4 and stop5 since these are stop
levels that have OPAL_PM_LOSE_HYP_CONTEXT set.

Since we can lose DAWR*, on states that lose limited hypervisor
context, such as stop0-2, we need to unconditionally save them
like AMR, IAMR etc.


>   sprs_saved = true;
> 
> @@ -792,6 +798,10 @@ static unsigned long power9_idle_stop(unsigned long 
> psscr, bool mmu_on)
>   mtspr(SPRN_MMCR2,   sprs.mmcr2);
>   if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
>   mtspr(SPRN_LDBAR, sprs.ldbar);
> + if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> + mtspr(SPRN_DAWR0, sprs.dawr0);
> + mtspr(SPRN_DAWRX0, sprs.dawrx0);
> + }


Likewise, we need to unconditionally restore these SPRs.


> 
>   mtspr(SPRN_SPRG3,   local_paca->sprg_vdso);
> 
> -- 
> 2.25.4
> 


Re: [PATCH 0/5] cpuidle-pseries: Parse extended CEDE information for idle.

2020-07-07 Thread Gautham R Shenoy
Hi,

On Tue, Jul 07, 2020 at 04:41:34PM +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> Hi,
> 
> 
> 
> 
> Gautham R. Shenoy (5):
>   cpuidle-pseries: Set the latency-hint before entering CEDE
>   cpuidle-pseries: Add function to parse extended CEDE records
>   cpuidle-pseries : Fixup exit latency for CEDE(0)
>   cpuidle-pseries : Include extended CEDE states in cpuidle framework
>   cpuidle-pseries: Block Extended CEDE(1) which adds no additional
> value.

Forgot to mention that these patches are on top of Nathan's series to
remove extended CEDE offline and bogus topology update code :
https://lore.kernel.org/linuxppc-dev/20200612051238.1007764-1-nath...@linux.ibm.com/

> 
>  drivers/cpuidle/cpuidle-pseries.c | 268 
> +-
>  1 file changed, 266 insertions(+), 2 deletions(-)
> 
> -- 
> 1.9.4
> 


[PATCH 5/5] cpuidle-pseries: Block Extended CEDE(1) which adds no additional value.

2020-07-07 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

The Extended CEDE state with latency-hint = 1 is only different from
normal CEDE (with latency-hint = 0) in that a CPU in Extended CEDE(1)
does not wakeup on timer events. Both CEDE and Extended CEDE(1) map to
the same hardware idle state. Since we already get SMT folding from
the normal CEDE, the Extended CEDE(1) doesn't provide any additional
value. This patch blocks Extended CEDE(1).

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-pseries.c | 57 ---
 1 file changed, 54 insertions(+), 3 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 6f893cd..be0b8b2 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -350,6 +350,43 @@ static int pseries_cpuidle_driver_init(void)
return 0;
 }
 
+#define XCEDE1_HINT1
+#define ERR_NO_VALUE_ADD   (-1)
+#define ERR_NO_EE_WAKEUP   (-2)
+
+/*
+ * Returns 0 if the Extende CEDE state with @hint is not blocked in
+ * cpuidle framework.
+ *
+ * Returns ERR_NO_EE_WAKEUP if the Extended CEDE state is blocked due
+ * to not being responsive to external interrupts.
+ *
+ * Returns ERR_NO_VALUE_ADD if the Extended CEDE state does not provide
+ * added value addition over the normal CEDE.
+ */
+static int cpuidle_xcede_blocked(u8 hint, u64 latency_us, u8 
responsive_to_irqs)
+{
+
+   /*
+* We will only allow extended CEDE states that are responsive
+* to irqs do not require an H_PROD to be woken up.
+*/
+   if (!responsive_to_irqs)
+   return ERR_NO_EE_WAKEUP;
+
+   /*
+* We already obtain SMT folding benefits from CEDE (which is
+* CEDE with hint 0). Furthermore, CEDE is also responsive to
+* timer-events, while XCEDE1 requires an external
+* interrupt/H_PROD to be woken up. Hence, block XCEDE1 since
+* it adds no further value.
+*/
+   if (hint == XCEDE1_HINT)
+   return ERR_NO_VALUE_ADD;
+
+   return 0;
+}
+
 static int add_pseries_idle_states(void)
 {
int nr_states = 2; /* By default we have snooze, CEDE */
@@ -365,15 +402,29 @@ static int add_pseries_idle_states(void)
char name[CPUIDLE_NAME_LEN];
unsigned int latency_hint = xcede_records[i].latency_hint;
u64 residency_us;
+   int rc;
+
+   if (latency_us < min_latency_us)
+   min_latency_us = latency_us;
+
+   rc = cpuidle_xcede_blocked(latency_hint, latency_us,
+  xcede_records[i].responsive_to_irqs);
 
-   if (!xcede_records[i].responsive_to_irqs) {
+   if (rc) {
+   switch (rc) {
+   case ERR_NO_VALUE_ADD:
+   pr_info("cpuidle : Skipping XCEDE%d. No 
additional value-add\n",
+   latency_hint);
+   break;
+   case ERR_NO_EE_WAKEUP:
pr_info("cpuidle : Skipping XCEDE%d. Not responsive to 
IRQs\n",
latency_hint);
+   break;
+   }
+
continue;
}
 
-   if (latency_us < min_latency_us)
-   min_latency_us = latency_us;
snprintf(name, CPUIDLE_NAME_LEN, "XCEDE%d", latency_hint);
 
/*
-- 
1.9.4



[PATCH 4/5] cpuidle-pseries : Include extended CEDE states in cpuidle framework

2020-07-07 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

This patch exposes those extended CEDE states to the cpuidle framework
which are responsive to external interrupts and do not need an H_PROD.

Since as per the PAPR, all the extended CEDE states are non-responsive
to timers, we indicate this to the cpuidle subsystem via the
CPUIDLE_FLAG_TIMER_STOP flag for all those extende CEDE states which
can wake up on external interrupts.

With the patch, we are able to see the extended CEDE state with
latency hint = 1 exposed via the cpuidle framework.

$ cpupower idle-info
CPUidle driver: pseries_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 3
Available idle states: snooze CEDE XCEDE1
snooze:
Flags/Description: snooze
Latency: 0
Usage: 33429446
Duration: 27006062
CEDE:
Flags/Description: CEDE
Latency: 1
Usage: 10272
Duration: 110786770
XCEDE1:
Flags/Description: XCEDE1
Latency: 12
Usage: 26445
Duration: 1436433815

Benchmark results:
TLDR: Over all we do not see any additional benefit from having XCEDE1 over
CEDE.

ebizzy :
2 threads bound to a big-core. With this patch, we see a 3.39%
regression compared to with only CEDE0 latency fixup.
x With only CEDE0 latency fixup
* With CEDE0 latency fixup + CEDE1
N   Min   MaxMedian   AvgStddev
x  10   2893813   5834474   5832448 5327281.3 1055941.4
*  10   2907329   5834923   5831398 5146614.6 1193874.8

context_switch2:
With the context_switch2 there are no observable regressions in the
results.

context_switch2 CPU0 CPU1 (Same Big-core, different small-cores).
No difference with and without patch.
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x 500343644348778345444 345584.02 1035.1658
* 500344310347646345776 345877.22 802.19501

context_switch2 CPU0 CPU8 (different big-cores). Minor 0.05% improvement
with patch
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x 500287562288756288162 288134.76 262.24328
* 500287874288960288306 288274.66 187.57034

schbench:
No regressions observed with schbench

Without Patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 40
90.0th: 50
95.0th: 61
*99.0th: 13648
99.5th: 14768
99.9th: 15664
min=0, max=29812

With Patch:
Latency percentiles (usec)
50.0th: 30
75.0th: 40
90.0th: 51
95.0th: 59
*99.0th: 13616
99.5th: 14512
99.9th: 15696
min=0, max=15996

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-pseries.c | 50 +++
 1 file changed, 50 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 502f906..6f893cd 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -362,9 +362,59 @@ static int add_pseries_idle_states(void)
for (i = 0; i < nr_xcede_records; i++) {
u64 latency_tb = xcede_records[i].wakeup_latency_tb_ticks;
u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
+   char name[CPUIDLE_NAME_LEN];
+   unsigned int latency_hint = xcede_records[i].latency_hint;
+   u64 residency_us;
+
+   if (!xcede_records[i].responsive_to_irqs) {
+   pr_info("cpuidle : Skipping XCEDE%d. Not responsive to 
IRQs\n",
+   latency_hint);
+   continue;
+   }
 
if (latency_us < min_latency_us)
min_latency_us = latency_us;
+   snprintf(name, CPUIDLE_NAME_LEN, "XCEDE%d", latency_hint);
+
+   /*
+* As per the section 14.14.1 of PAPR version 2.8.1
+* says that alling H_CEDE with the value of the cede
+* latency specifier set greater than zero allows the
+* processor timer facility to be disabled (so as not
+* to cause gratuitous wake-ups - the use of H_PROD,
+* or other external interrupt is required to wake the
+* processor in this case).
+*
+* So, inform the cpuidle-subsystem that the timer
+* will be stopped for these states.
+*
+* Also, bump up the latency by 10us, since cpuidle
+* would use timer-offload framework which will need
+* to send an IPI to wakeup a CPU whose timer has
+* expired.
+*/

[PATCH 0/5] cpuidle-pseries: Parse extended CEDE information for idle.

2020-07-07 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Hi,

On pseries Dedicated Linux LPARs, apart from the polling snooze idle
state, we currently have the CEDE idle state which cedes the CPU to
the hypervisor with latency-hint = 0.

However, the PowerVM hypervisor supports additional extended CEDE
states, which can be queried through the "ibm,get-systems-parameter"
rtas-call with the CEDE_LATENCY_TOKEN. The hypervisor maps these
extended CEDE states to appropriate platform idle-states in order to
provide energy-savings as well as shifting power to the active
units. On existing pseries LPARs today we have extended CEDE with
latency-hints {1,2} supported.

In Patches 1-3 of this patchset, we add the code to parse the CEDE
latency records provided by the hypervisor. We use this information to
determine the wakeup latency of the regular CEDE (which we have been
so far hardcoding to 10us while experimentally it is much lesser ~
1us), by looking at the wakeup latency provided by the hypervisor for
Extended CEDE states. Since the platform currently advertises Extended
CEDE 1 to have wakeup latency of 2us, we can be sure that the wakeup
latency of the regular CEDE is no more than this.

Patch 4 (currently marked as RFC), expose the extended CEDE states
parsed above to the cpuidle framework, provided that they can wakeup
on an interrupt. On current platforms only Extended CEDE 1 fits the
bill, but this is going to change in future platforms where even
Extended CEDE 2 may be responsive to external interrupts.

Patch 5 (currently marked as RFC), filters out Extended CEDE 1 since
it offers no added advantage over the normal CEDE.

With Patches 1-3, we see an improvement in the single-threaded
performance on ebizzy.

2 ebizzy threads bound to the same big-core. 25% improvement in the
avg records/s (higher the better) with patches 1-3.
x without_patches
* with_patches
N   Min   MaxMedian   AvgStddev
x  10   2491089   5834307   5398375   4244335 1596244.9
*  10   2893813   5834474   5832448 5327281.3 1055941.4

We do not observe any major regression in either the context_switch2
benchmark or the schbench benchmark

context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different
small cores). We observe a minor 0.14% regression in the number of
context-switches (higher is better).
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x 500348872362236354712 354745.69  2711.827
* 500349422361452353942  354215.4 2576.9258

context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37%
improvement in the number of context-switches (higher is better).
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x 500287956294940288896 288977.23 646.59295
* 500288300294646289582 290064.76 1161.9992

schbench:
No major difference could be seen until the 99.9th percentile.

Without-patch
Latency percentiles (usec)
50.0th: 29
75.0th: 39
90.0th: 49
95.0th: 59
*99.0th: 13104
99.5th: 14672
99.9th: 15824
min=0, max=17993

With-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 40
90.0th: 50
95.0th: 61
*99.0th: 13648
99.5th: 14768
99.9th: 15664
min=0, max=29812



Gautham R. Shenoy (5):
  cpuidle-pseries: Set the latency-hint before entering CEDE
  cpuidle-pseries: Add function to parse extended CEDE records
  cpuidle-pseries : Fixup exit latency for CEDE(0)
  cpuidle-pseries : Include extended CEDE states in cpuidle framework
  cpuidle-pseries: Block Extended CEDE(1) which adds no additional
value.

 drivers/cpuidle/cpuidle-pseries.c | 268 +-
 1 file changed, 266 insertions(+), 2 deletions(-)

-- 
1.9.4



[PATCH 1/5] cpuidle-pseries: Set the latency-hint before entering CEDE

2020-07-07 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

As per the PAPR, each H_CEDE call is associated with a latency-hint to
be passed in the VPA field "cede_latency_hint". The CEDE states that
we were implicitly entering so far is CEDE with latency-hint = 0.

This patch explicitly sets the latency hint corresponding to the CEDE
state that we are currently entering. While at it, we save the
previous hint, to be restored once we wakeup from CEDE. This will be
required in the future when we expose extended-cede states through the
cpuidle framework, where each of them will have a different
cede-latency hint.

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-pseries.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 4a37252..39d4bb6 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -105,19 +105,27 @@ static void check_and_cede_processor(void)
}
 }
 
+#define NR_CEDE_STATES 1  /* CEDE with latency-hint 0 */
+#define NR_DEDICATED_STATES(NR_CEDE_STATES + 1) /* Includes snooze */
+
+u8 cede_latency_hint[NR_DEDICATED_STATES];
 static int dedicated_cede_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
 {
+   u8 old_latency_hint;
 
pseries_idle_prolog();
get_lppaca()->donate_dedicated_cpu = 1;
+   old_latency_hint = get_lppaca()->cede_latency_hint;
+   get_lppaca()->cede_latency_hint = cede_latency_hint[index];
 
HMT_medium();
check_and_cede_processor();
 
local_irq_disable();
get_lppaca()->donate_dedicated_cpu = 0;
+   get_lppaca()->cede_latency_hint = old_latency_hint;
 
pseries_idle_epilog();
 
@@ -149,7 +157,7 @@ static int shared_cede_loop(struct cpuidle_device *dev,
 /*
  * States for dedicated partition case.
  */
-static struct cpuidle_state dedicated_states[] = {
+static struct cpuidle_state dedicated_states[NR_DEDICATED_STATES] = {
{ /* Snooze */
.name = "snooze",
.desc = "snooze",
-- 
1.9.4



[PATCH 2/5] cpuidle-pseries: Add function to parse extended CEDE records

2020-07-07 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently we use CEDE with latency-hint 0 as the only other idle state
on a dedicated LPAR apart from the polling "snooze" state.

The platform might support additional extended CEDE idle states, which
can be discovered through the "ibm,get-system-parameter" rtas-call
made with CEDE_LATENCY_TOKEN.

This patch adds a function to obtain information about the extended
CEDE idle states from the platform and parse the contents to populate
an array of extended CEDE states. These idle states thus discovered
will be added to the cpuidle framework in the next patch.

dmesg on a POWER9 LPAR, demonstrating the output of parsing the
extended CEDE latency parameters.

[5.913180] xcede : xcede_record_size = 10
[5.913183] xcede : Record 0 : hint = 1, latency =0x400 tb-ticks, 
Wake-on-irq = 1
[5.913188] xcede : Record 1 : hint = 2, latency =0x3e8000 tb-ticks, 
Wake-on-irq = 0
[5.913193] cpuidle : Skipping the 2 Extended CEDE idle states

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-pseries.c | 129 +-
 1 file changed, 127 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 39d4bb6..c13549b 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct cpuidle_driver pseries_idle_driver = {
.name = "pseries_idle",
@@ -105,9 +106,120 @@ static void check_and_cede_processor(void)
}
 }
 
-#define NR_CEDE_STATES 1  /* CEDE with latency-hint 0 */
+struct xcede_latency_records {
+   u8  latency_hint;
+   u64 wakeup_latency_tb_ticks;
+   u8  responsive_to_irqs;
+};
+
+/*
+ * XCEDE : Extended CEDE states discovered through the
+ * "ibm,get-systems-parameter" rtas-call with the token
+ * CEDE_LATENCY_TOKEN
+ */
+#define MAX_XCEDE_STATES   4
+#defineXCEDE_LATENCY_RECORD_SIZE   10
+#define XCEDE_LATENCY_PARAM_MAX_LENGTH (2 + 2 + \
+   (MAX_XCEDE_STATES * 
XCEDE_LATENCY_RECORD_SIZE))
+
+#define CEDE_LATENCY_TOKEN 45
+
+#define NR_CEDE_STATES (MAX_XCEDE_STATES + 1) /* CEDE with 
latency-hint 0 */
 #define NR_DEDICATED_STATES(NR_CEDE_STATES + 1) /* Includes snooze */
 
+struct xcede_latency_records xcede_records[MAX_XCEDE_STATES];
+unsigned int nr_xcede_records;
+char xcede_parameters[XCEDE_LATENCY_PARAM_MAX_LENGTH];
+
+static int parse_cede_parameters(void)
+{
+   int ret = -1, i;
+   u16 payload_length;
+   u8 xcede_record_size;
+   u32 total_xcede_records_size;
+   char *payload;
+
+   memset(xcede_parameters, 0, XCEDE_LATENCY_PARAM_MAX_LENGTH);
+
+   ret = rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1,
+   NULL, CEDE_LATENCY_TOKEN, __pa(xcede_parameters),
+   XCEDE_LATENCY_PARAM_MAX_LENGTH);
+
+   if (ret) {
+   pr_err("xcede: Error parsing CEDE_LATENCY_TOKEN\n");
+   return ret;
+   }
+
+   payload_length = be16_to_cpu(*(__be16 *)(_parameters[0]));
+   payload = _parameters[2];
+
+   /*
+* If the platform supports the cede latency settings
+* information system parameter it must provide the following
+* information in the NULL terminated parameter string:
+*
+* a. The first byte is the length “N” of each cede
+*latency setting record minus one (zero indicates a length
+*of 1 byte).
+*
+* b. For each supported cede latency setting a cede latency
+*setting record consisting of the first “N” bytes as per
+*the following table.
+*
+*  -
+*  | Field   | Field  |
+*  | Name| Length |
+*  -
+*  | Cede Latency| 1 Byte |
+*  | Specifier Value ||
+*  -
+*  | Maximum wakeup  ||
+*  | latency in  | 8 Bytes|
+*  | tb-ticks||
+*  -
+*  | Responsive to   ||
+*  | external| 1 Byte |
+*  | interrupts  ||
+*  -
+*
+* This version has cede latency record size = 10.
+*/
+   xcede_record_size = (u8)payload[0] + 1;
+
+   if (xcede_record_size != XCEDE_LATENCY_RECORD_SIZE) {
+   pr_err("xcede : Expected record-size %d. Observed size %d.\n",
+  XCEDE_LATENCY_RECORD_SIZE, xcede_record_size);
+   return -EINVAL;
+   }
+
+   pr_info("xcede : xcede_record_size =

[PATCH 3/5] cpuidle-pseries : Fixup exit latency for CEDE(0)

2020-07-07 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

We are currently assuming that CEDE(0) has exit latency 10us, since
there is no way for us to query from the platform.  However, if the
wakeup latency of an Extended CEDE state is smaller than 10us, then we
can be sure that the exit latency of CEDE(0) cannot be more than that.
that.

In this patch, we fix the exit latency of CEDE(0) if we discover an
Extended CEDE state with wakeup latency smaller than 10us. The new
value is 1us lesser than the smallest wakeup latency among the
Extended CEDE states.

Benchmark results:

ebizzy:
2 ebizzy threads bound to the same big-core. 25% improvement in the
avg records/s with patch.
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x  10   2491089   5834307   5398375   4244335 1596244.9
*  10   2893813   5834474   5832448 5327281.3 1055941.4

context_switch2 :
There is no major regression observed with this patch as seen from the
context_switch2 benchmark.

context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different
small cores). We observe a minor 0.14% regression in the number of
context-switches (higher is better).
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x 500348872362236354712 354745.69  2711.827
* 500349422361452353942  354215.4 2576.9258

context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37%
improvement in the number of context-switches (higher is better).
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x 500287956294940288896 288977.23 646.59295
* 500288300294646289582 290064.76 1161.9992

schbench:
No major difference could be seen until the 99.9th percentile.

Without-patch
Latency percentiles (usec)
50.0th: 29
75.0th: 39
90.0th: 49
95.0th: 59
*99.0th: 13104
99.5th: 14672
99.9th: 15824
min=0, max=17993

With-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 40
90.0th: 50
95.0th: 61
*99.0th: 13648
99.5th: 14768
99.9th: 15664
min=0, max=29812

Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-pseries.c | 34 --
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index c13549b..502f906 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -353,12 +353,42 @@ static int pseries_cpuidle_driver_init(void)
 static int add_pseries_idle_states(void)
 {
int nr_states = 2; /* By default we have snooze, CEDE */
+   int i;
+   u64 min_latency_us = dedicated_states[1].exit_latency; /* CEDE latency 
*/
 
if (parse_cede_parameters())
return nr_states;
 
-   pr_info("cpuidle : Skipping the %d Extended CEDE idle states\n",
-   nr_xcede_records);
+   for (i = 0; i < nr_xcede_records; i++) {
+   u64 latency_tb = xcede_records[i].wakeup_latency_tb_ticks;
+   u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
+
+   if (latency_us < min_latency_us)
+   min_latency_us = latency_us;
+   }
+
+   /*
+* We are currently assuming that CEDE(0) has exit latency
+* 10us, since there is no way for us to query from the
+* platform.
+*
+* However, if the wakeup latency of an Extended CEDE state is
+* smaller than 10us, then we can be sure that CEDE(0)
+* requires no more than that.
+*
+* Perform the fix-up.
+*/
+   if (min_latency_us < dedicated_states[1].exit_latency) {
+   u64 cede0_latency = min_latency_us - 1;
+
+   if (cede0_latency <= 0)
+   cede0_latency = min_latency_us;
+
+   dedicated_states[1].exit_latency = cede0_latency;
+   dedicated_states[1].target_residency = 10 * (cede0_latency);
+   pr_info("cpuidle : Fixed up CEDE exit latency to %llu us\n",
+   cede0_latency);
+   }
 
return nr_states;
 }
-- 
1.9.4



Re: [PATCH v5 2/3] powerpc/numa: Prefer node id queried from vphn

2020-06-24 Thread Gautham R Shenoy
Hello Srikar,


On Wed, Jun 24, 2020 at 02:58:45PM +0530, Srikar Dronamraju wrote:
> Node id queried from the static device tree may not
> be correct. For example: it may always show 0 on a shared processor.
> Hence prefer the node id queried from vphn and fallback on the device tree
> based node id if vphn query fails.
> 
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Michal Hocko 
> Cc: Mel Gorman 
> Cc: Vlastimil Babka 
> Cc: "Kirill A. Shutemov" 
> Cc: Christopher Lameter 
> Cc: Michael Ellerman 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Gautham R Shenoy 
> Cc: Satheesh Rajendran 
> Cc: David Hildenbrand 
> Signed-off-by: Srikar Dronamraju 


This patch looks good to me.

Reviewed-by: Gautham R. Shenoy 


--
Thanks and Regards
gautham.


Re: [PATCH v5 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-06-24 Thread Gautham R Shenoy
On Wed, Jun 24, 2020 at 02:58:44PM +0530, Srikar Dronamraju wrote:
> A Powerpc system with multiple possible nodes and with CONFIG_NUMA
> enabled always used to have a node 0, even if node 0 does not any cpus
> or memory attached to it. As per PAPR, node affinity of a cpu is only
> available once its present / online. For all cpus that are possible but
> not present, cpu_to_node() would point to node 0.
> 
> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> to make sure all possible but not present cpu_to_node are set to a
> proper node.
> 
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Michal Hocko 
> Cc: Mel Gorman 
> Cc: Vlastimil Babka 
> Cc: "Kirill A. Shutemov" 
> Cc: Christopher Lameter 
> Cc: Michael Ellerman 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Gautham R Shenoy 
> Cc: Satheesh Rajendran 
> Cc: David Hildenbrand 
> Signed-off-by: Srikar Dronamraju 

This looks good to me.

Reviewed-by: Gautham R. Shenoy 

--
Thanks and Regards
gautham.


Re: [PATCH v4 2/3] powerpc/numa: Prefer node id queried from vphn

2020-05-12 Thread Gautham R Shenoy
On Tue, May 12, 2020 at 06:59:36PM +0530, Srikar Dronamraju wrote:
> Node id queried from the static device tree may not
> be correct. For example: it may always show 0 on a shared processor.
> Hence prefer the node id queried from vphn and fallback on the device tree
> based node id if vphn query fails.
> 
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Michal Hocko 
> Cc: Mel Gorman 
> Cc: Vlastimil Babka 
> Cc: "Kirill A. Shutemov" 
> Cc: Christopher Lameter 
> Cc: Michael Ellerman 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Gautham R Shenoy 
> Cc: Satheesh Rajendran 
> Cc: David Hildenbrand 
> Signed-off-by: Srikar Dronamraju 

Looks good to me.

Reviewed-by: Gautham R. Shenoy 

> ---
> Changelog v2:->v3:
> - Resolved comments from Gautham.
> Link v2: 
> https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-sri...@linux.vnet.ibm.com/t/#u
> 
> Changelog v1:->v2:
> - Rebased to v5.7-rc3
> 
>  arch/powerpc/mm/numa.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b3615b7..2815313 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -719,20 +719,20 @@ static int __init parse_numa_properties(void)
>*/
>   for_each_present_cpu(i) {
>   struct device_node *cpu;
> - int nid;
> -
> - cpu = of_get_cpu_node(i, NULL);
> - BUG_ON(!cpu);
> - nid = of_node_to_nid_single(cpu);
> - of_node_put(cpu);
> + int nid = vphn_get_nid(i);
> 
>   /*
>* Don't fall back to default_nid yet -- we will plug
>* cpus into nodes once the memory scan has discovered
>* the topology.
>*/
> - if (nid < 0)
> - continue;
> - node_set_online(nid);
> + if (nid == NUMA_NO_NODE) {
> + cpu = of_get_cpu_node(i, NULL);
> + BUG_ON(!cpu);
> + nid = of_node_to_nid_single(cpu);
> + of_node_put(cpu);
> + }
> +
> + if (likely(nid > 0))
> + node_set_online(nid);
>   }
> 
>   get_n_mem_cells(_mem_addr_cells, _mem_size_cells);
> -- 
> 1.8.3.1
> 


Re: [PATCH] powerpc/powernv: Fix a warning message

2020-05-03 Thread Gautham R Shenoy
Hello Christophe,

On Sat, May 02, 2020 at 01:59:49PM +0200, Christophe JAILLET wrote:
> Fix a cut'n'paste error in a warning message. This should be
> 'cpu-idle-state-residency-ns' to match the property searched in the
> previous 'of_property_read_u32_array()'
> 
> Fixes: 9c7b185ab2fe ("powernv/cpuidle: Parse dt idle properties into global 
> structure")
> Signed-off-by: Christophe JAILLET 

Thanks for catching this.

Reviewed-by: Gautham R. Shenoy 

> ---
>  arch/powerpc/platforms/powernv/idle.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/idle.c 
> b/arch/powerpc/platforms/powernv/idle.c
> index 78599bca66c2..2dd467383a88 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -1270,7 +1270,7 @@ static int pnv_parse_cpuidle_dt(void)
>   /* Read residencies */
>   if (of_property_read_u32_array(np, "ibm,cpu-idle-state-residency-ns",
>  temp_u32, nr_idle_states)) {
> - pr_warn("cpuidle-powernv: missing 
> ibm,cpu-idle-state-latencies-ns in DT\n");
> + pr_warn("cpuidle-powernv: missing 
> ibm,cpu-idle-state-residency-ns in DT\n");
>   rc = -EINVAL;
>   goto out;
>   }
> -- 
> 2.25.1
> 


Re: [PATCH v5 0/5] Track and expose idle PURR and SPURR ticks

2020-04-30 Thread Gautham R Shenoy
On Thu, Apr 30, 2020 at 09:46:13AM +0530, Gautham R Shenoy wrote:
> Hello Michael,
> > >
> > > Michael, could you please consider this for 5.8 ?
> > 
> > Yes. Has it been tested on KVM at all?
> 
> No. I haven't tested this on KVM. Will do that today.


The results on Shared LPAR and KVM are as follows:
---

The lparstat results on a Shared LPAR are consistent with that
observed on a dedicated LPAR when at least one of the threads of the
core is active. When all the threads are idle, the lparstat shows
incorrect idle percentage. But this is perhaps due to the fact that
the Hypervisor puts a completely idle core in some power-saving state
with runlatch turned off due to which PURR counts on the threads of a
core do not add up to the elapsed timebase ticks. The results are in
section A) below.

lparstat is not supported on KVM. However, I performed some basic
sanity checks on purr, spurr, idle_purr, and idle_spurr sysfs files
that show up after this patch series. When CPUs are offlined, the
idle_purr and idle_spurr sysfs files no longer show up, just like purr
and spurr sysfs files. The values of the counters monotonically
increase, except when the CPU is busy, in which case the idle_purr and
idle_spurr counts are stagnant as expected.

However, I don't think the even the values of PURR or SPURR make much
sense on KVM guest, since the Linux Hypervisor doesn't set additional
registers such as RWMR, except on POWER8, where the KVM sets RWMR
corresponding to the number of online threads in a vCORE before
dispatching the vcore. I haven't been able to test it on a POWER8
guest yet. The results on POWER9 are in section B) below.


A ) Shared LPAR
==

1. When all the threads of the core are running a CPU-Hog

# ./lparstat -E 1 5
System Configuration
type=Shared mode=Capped smt=8 lcpu=6 mem=10362752 kB cpus=10 ent=6.00 
---Actual--- -Normalized-
%busy  %idle   Frequency %busy  %idle
-- --  - -- --
100.00   0.00  2.90GHz[126%] 126.00   0.00
100.00   0.00  2.90GHz[126%] 126.00   0.00
100.00   0.00  2.90GHz[126%] 126.00   0.00
100.00   0.00  2.90GHz[126%] 126.00   0.00
100.01   0.00  2.90GHz[126%] 126.01   0.00

2. When 4 threads of a core are running CPU Hogs, with the remaining 4
threads idle.

# ./lparstat -E 1 5
System Configuration
type=Shared mode=Capped smt=8 lcpu=6 mem=10362752 kB cpus=10 ent=6.00 
---Actual--- -Normalized-
%busy  %idle   Frequency %busy  %idle
-- --  - -- --
 81.06  18.94  2.97GHz[129%] 104.56  24.44
 81.05  18.95  2.97GHz[129%] 104.56  24.44
 81.06  18.95  2.97GHz[129%] 104.56  24.44
 81.06  18.95  2.97GHz[129%] 104.56  24.44
 81.05  18.95  2.97GHz[129%] 104.56  24.45

3. When 2 threads of a core are running CPU Hogs, with the other 6
threads idle.

# ./lparstat -E 1 5
System Configuration
type=Shared mode=Capped smt=8 lcpu=6 mem=10362752 kB cpus=10 ent=6.00 
---Actual--- -Normalized-
%busy  %idle   Frequency %busy  %idle
-- --  - -- --
 65.21  34.79  3.13GHz[136%]  88.69  47.31
 65.20  34.81  3.13GHz[136%]  88.67  47.33
 64.25  35.76  3.13GHz[136%]  87.38  48.63
 63.68  36.31  3.13GHz[136%]  86.60  49.39
 63.55  36.45  3.13GHz[136%]  86.42  49.58
 

4. When a single thread of the core is running CPU Hog, remaining 7
threads are idle.
# ./lparstat -E 1 5
System Configuration
type=Shared mode=Capped smt=8 lcpu=6 mem=10362752 kB cpus=10 ent=6.00 
---Actual--- -Normalized-
%busy  %idle   Frequency %busy  %idle
-- --  - -- --
 31.80  68.20  3.20GHz[139%]  44.20  94.80
 31.80  68.20  3.20GHz[139%]  44.20  94.81
 31.80  68.20  3.20GHz[139%]  44.20  94.80
 31.80  68.21  3.20GHz[139%]  44.20  94.81
 31.79  68.21  3.20GHz[139%]  44.19  94.81

5. When the LPAR is idle:

# ./lparstat -E 1 5
System Configuration
type=Shared mode=Capped smt=8 lcpu=6 mem=10362752 kB cpus=10 ent=6.00 
---Actual--- -Normalized-
%busy  %idle   Frequency %busy  %idle
-- --  - -- --
  0.04   0.14  2.41GHz[105%]   0.04   0.15
  0.04   0.15  2.36GHz[102%]   0.04   0.15
  0.03   0.13  2.35GHz[102%]   0.03   0.14
  0.03   0.13  2.31GHz[100%]   0.03   0.13
  0.03   0.13  2.32GHz[101%]   0.03   0.14

In this case, the sum of the PURR values do not add up to the elapsed
TB. This is probably due to the Hypervisor putting the core into some
power-saving state with the runlatch turned off.

# ./purr_tb -t 8
Got threads_per_core = 8
CORE 0: 
CPU 0 : Delta PURR : 85744 
CPU 1 : Delta PURR : 113632 
CPU 2 : Delta PURR : 78224 
CPU 3 : Delta PURR : 68856 
CPU 4 : Delta PURR : 78064 
CPU 5 : Delta PURR : 60488 
CPU 6 : Delta PURR : 6 
CPU 7 : Delta PURR : 59464 
Total D

Re: [PATCH v5 0/5] Track and expose idle PURR and SPURR ticks

2020-04-29 Thread Gautham R Shenoy
Hello Michael,

On Thu, Apr 30, 2020 at 12:34:52PM +1000, Michael Ellerman wrote:
> Gautham R Shenoy  writes:
> > On Mon, Apr 20, 2020 at 03:46:35PM -0700, Tyrel Datwyler wrote:
> >> On 4/7/20 1:47 AM, Gautham R. Shenoy wrote:
> >> > From: "Gautham R. Shenoy" 
> >> > 
> >> > Hi,
> >> > 
> >> > This is the fifth version of the patches to track and expose idle PURR
> >> > and SPURR ticks. These patches are required by tools such as lparstat
> >> > to compute system utilization for capacity planning purposes.
> ...
> >> > 
> >> > Gautham R. Shenoy (5):
> >> >   powerpc: Move idle_loop_prolog()/epilog() functions to header file
> >> >   powerpc/idle: Store PURR snapshot in a per-cpu global variable
> >> >   powerpc/pseries: Account for SPURR ticks on idle CPUs
> >> >   powerpc/sysfs: Show idle_purr and idle_spurr for every CPU
> >> >   Documentation: Document sysfs interfaces purr, spurr, idle_purr,
> >> > idle_spurr
> >> > 
> >> >  Documentation/ABI/testing/sysfs-devices-system-cpu | 39 +
> >> >  arch/powerpc/include/asm/idle.h| 93 
> >> > ++
> >> >  arch/powerpc/kernel/sysfs.c| 82 
> >> > ++-
> >> >  arch/powerpc/platforms/pseries/setup.c |  8 +-
> >> >  drivers/cpuidle/cpuidle-pseries.c  | 39 ++---
> >> >  5 files changed, 224 insertions(+), 37 deletions(-)
> >> >  create mode 100644 arch/powerpc/include/asm/idle.h
> >> > 
> >> 
> >> Reviewed-by: Tyrel Datwyler 
> >
> > Thanks for reviewing the patches.
> >
> >> 
> >> Any chance this is going to be merged in the near future? There is a 
> >> patchset to
> >> update lparstat in the powerpc-utils package to calculate PURR/SPURR cpu
> >> utilization that I would like to merge, but have been holding off to make 
> >> sure
> >> we are synced with this proposed patchset.
> >
> > Michael, could you please consider this for 5.8 ?
> 
> Yes. Has it been tested on KVM at all?

No. I haven't tested this on KVM. Will do that today.


> 
> cheers

--
Thanks and Regards
gautham.


Re: [PATCH v2 2/3] powerpc/numa: Prefer node id queried from vphn

2020-04-29 Thread Gautham R Shenoy
Hello Srikar,

On Tue, Apr 28, 2020 at 03:08:35PM +0530, Srikar Dronamraju wrote:
> Node id queried from the static device tree may not
> be correct. For example: it may always show 0 on a shared processor.
> Hence prefer the node id queried from vphn and fallback on the device tree
> based node id if vphn query fails.
> 
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Michal Hocko 
> Cc: Mel Gorman 
> Cc: Vlastimil Babka 
> Cc: "Kirill A. Shutemov" 
> Cc: Christopher Lameter 
> Cc: Michael Ellerman 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v1:->v2:
> - Rebased to v5.7-rc3
> 
>  arch/powerpc/mm/numa.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b3615b7fdbdf..281531340230 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -719,20 +719,20 @@ static int __init parse_numa_properties(void)
>*/
>   for_each_present_cpu(i) {
>   struct device_node *cpu;
> - int nid;
> -
> - cpu = of_get_cpu_node(i, NULL);
> - BUG_ON(!cpu);
> - nid = of_node_to_nid_single(cpu);
> - of_node_put(cpu);
> + int nid = vphn_get_nid(i);
> 
>   /*
>* Don't fall back to default_nid yet -- we will plug
>* cpus into nodes once the memory scan has discovered
>* the topology.
>*/
> - if (nid < 0)
> - continue;


> + if (nid == NUMA_NO_NODE) {
> + cpu = of_get_cpu_node(i, NULL);
> + if (cpu) {

Why are we not retaining the BUG_ON(!cpu) assert here ?

> + nid = of_node_to_nid_single(cpu);
> + of_node_put(cpu);
> + }
> + }

Is it possible at this point that both vphn_get_nid(i) and
of_node_to_nid_single(cpu) returns NUMA_NO_NODE ? If so,
should we still call node_set_online() below ?


>   node_set_online(nid);
>   }
> 
> -- 
> 2.20.1
> 
--
Thanks and Regards
gautham.


[PATCH v2 0/1] pseries/hotplug: Change the default behaviour of cede_offline

2019-10-22 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

This is the v2 of the fix to change the default behaviour of cede_offline.
The previous version can be found here: https://lkml.org/lkml/2019/9/12/222

The main change from v1 is that the patch2 to create a sysfs file to
report and control the value of cede_offline_enabled has been dropped.

Problem Description:

Currently on Pseries Linux Guests, the offlined CPU can be put to one
of the following two states:
   - Long term processor cede (also called extended cede)
   - Returned to the Hypervisor via RTAS "stop-self" call.

This is controlled by the kernel boot parameter "cede_offline=on/off".

By default the offlined CPUs enter extended cede. The PHYP hypervisor
considers CPUs in extended cede to be "active" since the CPUs are
still under the control fo the Linux Guests. Hence, when we change the
SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
will continue to count the values for offlined CPUs in extended cede
as if they are online.

One of the expectations with PURR is that the for an interval of time,
the sum of the PURR increments across the online CPUs of a core should
equal the number of timebase ticks for that interval.

This is currently not the case.

In the following data (Generated using
https://github.com/gautshen/misc/blob/master/purr_tb.py):

SD-PURR = Sum of PURR increments on online CPUs of that core in 1 second
  
SMT=off
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0]51200   69883784
core01 [  8]51200   88782536
core02 [ 16]51200   94296824
core03 [ 24]51200   80951968

SMT=2
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0,1]  51200   136147792   
core01 [  8,9]  51200   128636784   
core02 [ 16,17] 51200   135426488   
core03 [ 24,25] 51200   153027520   

SMT=4
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0,1,2,3]  51200   258331616   
core01 [  8,9,10,11]51200   274220072   
core02 [ 16,17,18,19]   51200   260013736   
core03 [ 24,25,26,27]   51200   260079672   

SMT=on
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0,1,2,3,4,5,6,7]  51200   512941248   
core01 [  8,9,10,11,12,13,14,15]51200   512936544   
core02 [ 16,17,18,19,20,21,22,23]   51200   512931544   
core03 [ 24,25,26,27,28,29,30,31]   51200   512923800

This patchset addresses this issue by ensuring that by default, the
offlined CPUs are returned to the Hypervisor via RTAS "stop-self" call
by changing the default value of "cede_offline_enabled" to false.

With the patches, we see that the observed value of the sum of the
PURR increments across the the online threads of a core in 1-second
matches the number of tb-ticks in 1-second.

SMT=off
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0]51200512527568  
core01 [  8]51200512556128  
core02 [ 16]51200512590016  
core03 [ 24]51200512589440  

SMT=2
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0,1]  51200   512635328
core01 [  8,9]  51200   512610416   
core02 [ 16,17] 51200   512639360   
core03 [ 24,25] 51200   512638720   

SMT=4
===
CoreSD-PURR SD-PURR
(expected)  (observed)
===
core00 [  0,1,2,3]  51200   512757328   
core01 [  8,9,10,11]51200   512727920   
core02 [ 16,17,18,19]   51200   512754712   
core03 [ 24,25,26,27]   51200   512739040   

SMT=on
==
Core   SD-PURR SD-PURR

[PATCH v2 1/1] pseries/hotplug-cpu: Change default behaviour of cede_offline to "off"

2019-10-22 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently on PSeries Linux guests, the offlined CPU can be put to one
of the following two states:
   - Long term processor cede (also called extended cede)
   - Returned to the hypervisor via RTAS "stop-self" call.

This is controlled by the kernel boot parameter "cede_offline=on/off".

By default the offlined CPUs enter extended cede. The PHYP hypervisor
considers CPUs in extended cede to be "active" since they are still
under the control fo the Linux guests. Hence, when we change the SMT
modes by offlining the secondary CPUs, the PURR and the RWMR SPRs will
continue to count the values for offlined CPUs in extended cede as if
they are online. This breaks the accounting in tools such as lparstat.

To fix this, ensure that by default the offlined CPUs are returned to
the hypervisor via RTAS "stop-self" call by changing the default value
of "cede_offline_enabled" to false.

Fixes: commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU
into an appropriate offline state")

Signed-off-by: Gautham R. Shenoy 
---
 Documentation/core-api/cpu_hotplug.rst   |  2 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/Documentation/core-api/cpu_hotplug.rst 
b/Documentation/core-api/cpu_hotplug.rst
index 4a50ab7..5319593 100644
--- a/Documentation/core-api/cpu_hotplug.rst
+++ b/Documentation/core-api/cpu_hotplug.rst
@@ -53,7 +53,7 @@ Command Line Switches
 ``cede_offline={"off","on"}``
   Use this option to disable/enable putting offlined processors to an extended
   ``H_CEDE`` state on supported pseries platforms. If nothing is specified,
-  ``cede_offline`` is set to "on".
+  ``cede_offline`` is set to "off".
 
   This option is limited to the PowerPC architecture.
 
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index bbda646..f9d0366 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -46,7 +46,17 @@ static DEFINE_PER_CPU(enum cpu_state_vals, 
preferred_offline_state) =
 
 static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE;
 
-static bool cede_offline_enabled __read_mostly = true;
+/*
+ * Determines whether the offlined CPUs should be put to a long term
+ * processor cede (called extended cede) for power-saving
+ * purposes. The CPUs in extended cede are still with the Linux Guest
+ * and are not returned to the Hypervisor.
+ *
+ * By default, the offlined CPUs are returned to the hypervisor via
+ * RTAS "stop-self". This behaviour can be changed by passing the
+ * kernel commandline parameter "cede_offline=on".
+ */
+static bool cede_offline_enabled __read_mostly;
 
 /*
  * Enable/disable cede_offline when available.
-- 
1.9.4



Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline

2019-09-18 Thread Gautham R Shenoy
On Wed, Sep 18, 2019 at 03:14:15PM +1000, Michael Ellerman wrote:
> "Gautham R. Shenoy"  writes:
> > From: "Gautham R. Shenoy" 
> >
> > Currently on Pseries Linux Guests, the offlined CPU can be put to one
> > of the following two states:
> >- Long term processor cede (also called extended cede)
> >- Returned to the Hypervisor via RTAS "stop-self" call.
> >
> > This is controlled by the kernel boot parameter "cede_offline=on/off".
> >
> > By default the offlined CPUs enter extended cede.
> 
> Since commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an 
> appropriate offline state") (Nov 2009)
> 
> Which you wrote :)

Mea Culpa! I forgot to include the "Fixes commit 3aa565f53c39" into
Patch 1 of the series.

> 
> Why was that wrong?

It was wrong from the definition of what PHYP considers as
"not-active" CPU. From the point of view of that hypervisor, a CPU is
not-active iff it is in RTAS "stop-self". Thus if a CPU is offline via
extended cede, and not using any cycles, it is still considered to be
active, by PHYP. This causes PURR accounting is broken. 

> 
> > The PHYP hypervisor considers CPUs in extended cede to be "active"
> > since the CPUs are still under the control fo the Linux Guests. Hence, when 
> > we change the
> > SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
> > will continue to count the values for offlined CPUs in extended cede
> > as if they are online.
> >
> > One of the expectations with PURR is that the for an interval of time,
> > the sum of the PURR increments across the online CPUs of a core should
> > equal the number of timebase ticks for that interval.
> >
> > This is currently not the case.
> 
> But why does that matter? It's just some accounting stuff, does it
> actually break something meaningful?

As Naveen mentioned, it breaks lparstat which the customers are using
for capacity planning. Unfortunately we discovered this 10 years after
the feature was written.

> 
> Also what does this do to the latency of CPU online/offline.

It will have a slightly higher latency compared to extended cede,
since it involves an additional rtas-call for both the start and
stopping of CPU. Will measure the exact difference and post it in the
next version.

> And what does this do on KVM?

KVM doesn't seem to depend on the state of the offline VCPU as it has
an explicit way of signalling whether a CPU is online or not, via
KVM_REG_PPC_ONLINE. In commit 7aa15842c15f ("KVM: PPC: Book3S HV: Set
RWMR on POWER8 so PURR/SPURR count correctly") we use this KVM reg to
update the count of online vCPUs in a core, and use this count to set
the RWMR correctly before dispatching the core.

So, this patchset doesn't affect KVM.

> 
> 
> > In the following data (Generated using
> > https://github.com/gautshen/misc/blob/master/purr_tb.py):
> >
> >
> > delta tb = tb ticks elapsed in 1 second.
> > delta purr = sum of PURR increments on online CPUs of that core in 1
> >  second
> >   
> > SMT=off
> > ===
> > Coredelta tb(apprx)  delta purr 
> > ===
> > core00 [  0]51200   69883784
> > core01 [  8]51200   88782536
> > core02 [ 16]51200   94296824
> > core03 [ 24]51200   80951968
> 
> Showing the expected value in another column would make this much
> clearer.

Thanks. Will update the testcase to call out the expected value.
> 
> cheers
> 


--
Thanks and Regards
gautham.


Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline

2019-09-18 Thread Gautham R Shenoy
Hello Nathan, Michael,

On Tue, Sep 17, 2019 at 12:36:35PM -0500, Nathan Lynch wrote:
> Gautham R Shenoy  writes:
> > On Thu, Sep 12, 2019 at 10:39:45AM -0500, Nathan Lynch wrote:
> >> "Gautham R. Shenoy"  writes:
> >> > The patchset also defines a new sysfs attribute
> >> > "/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests
> >> > to allow userspace programs to change the state into which the
> >> > offlined CPU need to be put to at runtime.
> >> 
> >> A boolean sysfs interface will become awkward if we need to add another
> >> mode in the future.
> >> 
> >> What do you think about naming the attribute something like
> >> 'offline_mode', with the possible values 'extended-cede' and
> >> 'rtas-stopped'?
> >
> > We can do that. However, IMHO in the longer term, on PSeries guests,
> > we should have only one offline state - rtas-stopped.  The reason for
> > this being, that on Linux, SMT switch is brought into effect through
> > the CPU Hotplug interface. The only state in which the SMT switch will
> > recognized by the hypervisors such as PHYP is rtas-stopped.
> 
> OK. Why "longer term" though, instead of doing it now?

Because adding extended-cede into a cpuidle state is non-trivial since
a CPU in that state is non responsive to external interrupts. We will
additional changes in the IPI, Timer and the Interrupt code to ensure
that these get translated to a H_PROD in order to wake-up the target
CPU in extended CEDE.

Timer: is relatively easy since the cpuidle infrastructure has the
   timer-offload framework (used for fastsleep in POWER8) where we
   can offload the timers of an idling CPU to another CPU which
   can wakeup the CPU when the timer expires via an IPI.

IPIs: We need to ensure that icp_hv_set_qirr() correctly sends H_IPI
  or H_PROD depending on whether or not the target CPU is in
  extended CEDE.

Interrupts: Either we migrate away the interrupts from the CPU that is
entering extended CEDE or we prevent a CPU that is the
sole target for an interrupt from entering extended CEDE.

The accounting problem in tools such as lparstat with
"cede_offline=on" is affecting customers who are using these tools for
capacity-planning. That problem needs a fix in the short-term, for
which Patch 1 changes the default behaviour of cede_offline from "on"
to "off". Since this patch would break the existing userspace tools
that use the CPU-Offline infrastructure to fold CPUs for saving power,
the sysfs interface allowing a runtime change of cede_offline_enabled
was provided to enable these userspace tools to cope with minimal
change.

> 
> 
> > All other states (such as extended-cede) should in the long-term be
> > exposed via the cpuidle interface.
> >
> > With this in mind, I made the sysfs interface boolean to mirror the
> > current "cede_offline" commandline parameter. Eventually when we have
> > only one offline-state, we can deprecate the commandline parameter as
> > well as the sysfs interface.
> 
> I don't care for adding a sysfs interface that is intended from the
> beginning to become vestigial...

Fair point. Come to think of it, in case the cpuidle menu governor
behaviour doesn't match the expectations provided by the current
userspace solutions for folding idle CPUs for power-savings, it would
be useful to have this option around so that existing users who prefer
the userspace solution can still have that option.

> 
> This strikes me as unnecessarily incremental if you're changing the
> default offline state. Any user space programs depending on the current
> behavior will have to change anyway (and why is it OK to break them?)
>

Yes, the current userspace program will need to be modified to check
for the sysfs interface and change the value to
cede_offline_enabled=1.

> Why isn't the plan:
> 
>   1. Add extended cede support to the pseries cpuidle driver
>   2. Make stop-self the only cpu offline state for pseries (no sysfs
>  interface necessary)

This is the plan, except that 1. requires some additional work and
this patchset is proposed as a short-term mitigation until we get
1. right.

> 
> ?

--
Thanks and Regards
gautham.


Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline

2019-09-15 Thread Gautham R Shenoy
Hello Nathan,

On Thu, Sep 12, 2019 at 10:39:45AM -0500, Nathan Lynch wrote:
> "Gautham R. Shenoy"  writes:
> > The patchset also defines a new sysfs attribute
> > "/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests
> > to allow userspace programs to change the state into which the
> > offlined CPU need to be put to at runtime.
> 
> A boolean sysfs interface will become awkward if we need to add another
> mode in the future.
> 
> What do you think about naming the attribute something like
> 'offline_mode', with the possible values 'extended-cede' and
> 'rtas-stopped'?

We can do that. However, IMHO in the longer term, on PSeries guests,
we should have only one offline state - rtas-stopped.  The reason for
this being, that on Linux, SMT switch is brought into effect through
the CPU Hotplug interface. The only state in which the SMT switch will
recognized by the hypervisors such as PHYP is rtas-stopped.

All other states (such as extended-cede) should in the long-term be
exposed via the cpuidle interface.

With this in mind, I made the sysfs interface boolean to mirror the
current "cede_offline" commandline parameter. Eventually when we have
only one offline-state, we can deprecate the commandline parameter as
well as the sysfs interface.

Thoughts?

--
Thanks and Regards
gautham.


[PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline

2019-09-12 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently on Pseries Linux Guests, the offlined CPU can be put to one
of the following two states:
   - Long term processor cede (also called extended cede)
   - Returned to the Hypervisor via RTAS "stop-self" call.

This is controlled by the kernel boot parameter "cede_offline=on/off".

By default the offlined CPUs enter extended cede. The PHYP hypervisor
considers CPUs in extended cede to be "active" since the CPUs are
still under the control fo the Linux Guests. Hence, when we change the
SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs
will continue to count the values for offlined CPUs in extended cede
as if they are online.

One of the expectations with PURR is that the for an interval of time,
the sum of the PURR increments across the online CPUs of a core should
equal the number of timebase ticks for that interval.

This is currently not the case.

In the following data (Generated using
https://github.com/gautshen/misc/blob/master/purr_tb.py):


delta tb = tb ticks elapsed in 1 second.
delta purr = sum of PURR increments on online CPUs of that core in 1
 second
  
SMT=off
===
Coredelta tb(apprx)  delta purr 
===
core00 [  0]51200   69883784
core01 [  8]51200   88782536
core02 [ 16]51200   94296824
core03 [ 24]51200   80951968

SMT=2
===
Coredelta tb(apprx)  delta purr 
===
core00 [  0,1]  51200   136147792   
core01 [  8,9]  51200   128636784   
core02 [ 16,17] 51200   135426488   
core03 [ 24,25] 51200   153027520   

SMT=4
===
Coredelta tb(apprx)  delta purr 
===
core00 [  0,1,2,3]  51200   258331616   
core01 [  8,9,10,11]51200   274220072   
core02 [ 16,17,18,19]   51200   260013736   
core03 [ 24,25,26,27]   51200   260079672   

SMT=on
===
Coredelta tb(apprx)  delta purr 
===
core00 [  0,1,2,3,4,5,6,7]  51200   512941248   
core01 [  8,9,10,11,12,13,14,15]51200   512936544   
core02 [ 16,17,18,19,20,21,22,23]   51200   512931544   
core03 [ 24,25,26,27,28,29,30,31]   51200   512923800

This patchset addresses this issue by ensuring that by default, the
offlined CPUs are returned to the Hypervisor via RTAS "stop-self" call
by changing the default value of "cede_offline_enabled" to false.

The patchset also defines a new sysfs attribute
"/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests
to allow userspace programs to change the state into which the
offlined CPU need to be put to at runtime. This is intended for
userspace programs that fold CPUs for the purpose of saving energy
when the utilization is low. Setting the value of this attribute
ensures that subsequent CPU offline operations will put the offlined
CPUs to extended cede. However, it will cause inconsistencies in the
PURR accounting. Clearing the attribute will make the offlined CPUs
call the RTAS "stop-self" call thereby returning the CPU to the
hypervisor.

With the patches,

SMT=off
===
Coredelta tb(apprx)  delta purr 
===
core00 [  0]51200512527568  
core01 [  8]51200512556128  
core02 [ 16]51200512590016  
core03 [ 24]51200512589440  

SMT=2
===
Coredelta tb(apprx)  delta purr 
===
core00 [  0,1]  51200   512635328
core01 [  8,9]  51200   512610416   
core02 [ 16,17] 51200   512639360   
core03 [ 24,25] 51200   512638720   

SMT=4
===
Coredelta tb(apprx)  delta purr 
===
core00 [  0,1,2,3]  51200   512757328   
core01 [  8,9,10,11]51200   512727920   
core02 [ 16,17,18,19]   51200   512754712   
core03 [ 24,25,26,27]   51200   512739040   

SMT=on
==
Core   delta tb(apprx)  delta purr  
==
core00 [  0,1,2,3,4,5,6,7]   

[PATCH 1/2] pseries/hotplug-cpu: Change default behaviour of cede_offline to "off"

2019-09-12 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently on Pseries Linux Guests, the offlined CPU can be put to one
of the following two states:
   - Long term processor cede (also called extended cede)
   - Returned to the Hypervisor via RTAS "stop-self" call.

This is controlled by the kernel boot parameter "cede_offline=on/off".

By default the offlined CPUs enter extended cede. The PHYP hypervisor
considers CPUs in extended cede to be "active" since they are still
under the control fo the Linux Guests. Hence, when we change the SMT
modes by offlining the secondary CPUs, the PURR and the RWMR SPRs will
continue to count the values for offlined CPUs in extended cede as if
they are online. This breaks the accounting in tools such as lparstat.

To fix this, ensure that by default the offlined CPUs are returned to
the Hypervisor via RTAS "stop-self" call by changing the default value
of "cede_offline_enabled" to false.

Signed-off-by: Gautham R. Shenoy 
---
 Documentation/core-api/cpu_hotplug.rst   |  2 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/Documentation/core-api/cpu_hotplug.rst 
b/Documentation/core-api/cpu_hotplug.rst
index 4a50ab7..5319593 100644
--- a/Documentation/core-api/cpu_hotplug.rst
+++ b/Documentation/core-api/cpu_hotplug.rst
@@ -53,7 +53,7 @@ Command Line Switches
 ``cede_offline={"off","on"}``
   Use this option to disable/enable putting offlined processors to an extended
   ``H_CEDE`` state on supported pseries platforms. If nothing is specified,
-  ``cede_offline`` is set to "on".
+  ``cede_offline`` is set to "off".
 
   This option is limited to the PowerPC architecture.
 
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index bbda646..f9d0366 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -46,7 +46,17 @@ static DEFINE_PER_CPU(enum cpu_state_vals, 
preferred_offline_state) =
 
 static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE;
 
-static bool cede_offline_enabled __read_mostly = true;
+/*
+ * Determines whether the offlined CPUs should be put to a long term
+ * processor cede (called extended cede) for power-saving
+ * purposes. The CPUs in extended cede are still with the Linux Guest
+ * and are not returned to the Hypervisor.
+ *
+ * By default, the offlined CPUs are returned to the hypervisor via
+ * RTAS "stop-self". This behaviour can be changed by passing the
+ * kernel commandline parameter "cede_offline=on".
+ */
+static bool cede_offline_enabled __read_mostly;
 
 /*
  * Enable/disable cede_offline when available.
-- 
1.9.4



[PATCH 2/2] pseries/hotplug-cpu: Add sysfs attribute for cede_offline

2019-09-12 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Define a new sysfs attribute
"/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests
to allow userspace programs to change the state into which the
offlined CPU need to be put to at runtime. This is intended for
userspace programs that fold CPUs for the purpose of saving energy
when the utilization is low.

Setting the value of this attribute ensures that subsequent CPU
offline operations will put the offlined CPUs to extended
cede. However, it will cause inconsistencies in the PURR accounting.

Clearing the attribute will make the offlined CPUs call the RTAS
"stop-self" call thereby returning the CPU to the hypervisor.

Signed-off-by: Gautham R. Shenoy 
---
 Documentation/ABI/testing/sysfs-devices-system-cpu | 14 +
 arch/powerpc/platforms/pseries/hotplug-cpu.c   | 68 --
 2 files changed, 76 insertions(+), 6 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu 
b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 06d0931..b3c52cd 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -572,3 +572,17 @@ Description:   Secure Virtual Machine
If 1, it means the system is using the Protected Execution
Facility in POWER9 and newer processors. i.e., it is a Secure
Virtual Machine.
+
+What:  /sys/devices/system/cpu/cede_offline_enabled
+Date:  August 2019
+Contact:   Linux kernel mailing list 
+   Linux for PowerPC mailing list 
+Description:   Offline CPU state control
+
+   If 1, it means that offline CPUs on PSeries guests
+   will be made to call an extended CEDE which provides
+   energy savings but at the expense of accuracy of PURR
+   accounting. If 0, the offline CPUs on PSeries guests
+   will be made to call RTAS "stop-self" call which will
+   return the CPUs to the Hypervisor and provide accurate
+   values of PURR. The value is 0 by default.
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index f9d0366..4a04cf7 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -943,9 +943,64 @@ static int parse_cede_parameters(void)
 CEDE_LATENCY_PARAM_MAX_LENGTH);
 }
 
-static int __init pseries_cpu_hotplug_init(void)
+/*
+ * Must be guarded by
+ * cpu_maps_update_begin()...cpu_maps_update_done()
+ */
+static void update_default_offline_state(void)
 {
int cpu;
+
+   if (cede_offline_enabled)
+   default_offline_state = CPU_STATE_INACTIVE;
+   else
+   default_offline_state = CPU_STATE_OFFLINE;
+
+   for_each_possible_cpu(cpu)
+   set_default_offline_state(cpu);
+}
+
+static ssize_t show_cede_offline_enabled(struct device *dev,
+struct device_attribute *attr,
+char *buf)
+{
+   unsigned long ret = 0;
+
+   if (cede_offline_enabled)
+   ret = 1;
+
+   return sprintf(buf, "%lx\n", ret);
+}
+
+static ssize_t store_cede_offline_enabled(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+   bool val;
+   int ret = 0;
+
+   ret = kstrtobool(buf, );
+   if (ret)
+   return -EINVAL;
+
+   cpu_maps_update_begin();
+   /* Check if anything needs to be done */
+   if (val == cede_offline_enabled)
+   goto done;
+   cede_offline_enabled = val;
+   update_default_offline_state();
+done:
+   cpu_maps_update_done();
+
+   return count;
+}
+
+static DEVICE_ATTR(cede_offline_enabled, 0600,
+  show_cede_offline_enabled,
+  store_cede_offline_enabled);
+
+static int __init pseries_cpu_hotplug_init(void)
+{
int qcss_tok;
 
 #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
@@ -971,11 +1026,12 @@ static int __init pseries_cpu_hotplug_init(void)
if (firmware_has_feature(FW_FEATURE_LPAR)) {
of_reconfig_notifier_register(_smp_nb);
cpu_maps_update_begin();
-   if (cede_offline_enabled && parse_cede_parameters() == 0) {
-   default_offline_state = CPU_STATE_INACTIVE;
-   for_each_online_cpu(cpu)
-   set_default_offline_state(cpu);
-   }
+   if (parse_cede_parameters() == 0)
+   device_create_file(cpu_subsys.dev_root,
+  _attr_cede_offline_enabled);
+   else /* Extended cede is not supported */
+   cede_offline_ena

[PATCH] powerpc/xive: Fix loop exit-condition in xive_find_target_in_mask()

2019-07-17 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

xive_find_target_in_mask() has the following for(;;) loop which has a
bug when @first == cpumask_first(@mask) and condition 1 fails to hold
for every CPU in @mask. In this case we loop forever in the for-loop.

  first = cpu;
  for (;;) {
  if (cpu_online(cpu) && xive_try_pick_target(cpu)) // condition 1
  return cpu;
  cpu = cpumask_next(cpu, mask);
  if (cpu == first) // condition 2
  break;

  if (cpu >= nr_cpu_ids) // condition 3
  cpu = cpumask_first(mask);
  }

This is because, when @first == cpumask_first(@mask), we never hit the
condition 2 (cpu == first) since prior to this check, we would have
executed "cpu = cpumask_next(cpu, mask)" which will set the value of
@cpu to a value greater than @first or to nr_cpus_ids. When this is
coupled with the fact that condition 1 is not met, we will never exit
this loop.

This was discovered by the hard-lockup detector while running LTP test
concurrently with SMT switch tests.

 watchdog: CPU 12 detected hard LOCKUP on other CPUs 68
 watchdog: CPU 12 TB:85587019220796, last SMP heartbeat TB:85578827223399 
(15999ms ago)
 watchdog: CPU 68 Hard LOCKUP
 watchdog: CPU 68 TB:85587019361273, last heartbeat TB:85576815065016 (19930ms 
ago)
 CPU: 68 PID: 45050 Comm: hxediag Kdump: loaded Not tainted 
4.18.0-100.el8.ppc64le #1
 NIP:  c06f5578 LR: c0cba9ec CTR: 
 REGS: c000201fff3c7d80 TRAP: 0100   Not tainted  (4.18.0-100.el8.ppc64le)
 MSR:  92883033   CR: 24028424  XER: 

 CFAR: c06f558c IRQMASK: 1
 GPR00: c00afc58 c000201c01c43400 c15ce500 c000201cae26ec18
 GPR04: 0800 0540 0800 00f8
 GPR08: 0020 00a8 8000 c0081a1beed8
 GPR12: c00b1410 c000201fff7f4c00  
 GPR16:   0540 0001
 GPR20: 0048 1011 c0081a1e3780 c000201cae26ed18
 GPR24:  c000201cae26ed8c 0001 c1116bc0
 GPR28: c1601ee8 c1602494 c000201cae26ec18 001f
 NIP [c06f5578] find_next_bit+0x38/0x90
 LR [c0cba9ec] cpumask_next+0x2c/0x50
 Call Trace:
 [c000201c01c43400] [c000201cae26ec18] 0xc000201cae26ec18 (unreliable)
 [c000201c01c43420] [c00afc58] xive_find_target_in_mask+0x1b8/0x240
 [c000201c01c43470] [c00b0228] xive_pick_irq_target.isra.3+0x168/0x1f0
 [c000201c01c435c0] [c00b1470] xive_irq_startup+0x60/0x260
 [c000201c01c43640] [c01d8328] __irq_startup+0x58/0xf0
 [c000201c01c43670] [c01d844c] irq_startup+0x8c/0x1a0
 [c000201c01c436b0] [c01d57b0] __setup_irq+0x9f0/0xa90
 [c000201c01c43760] [c01d5aa0] request_threaded_irq+0x140/0x220
 [c000201c01c437d0] [c0081a17b3d4] bnx2x_nic_load+0x188c/0x3040 [bnx2x]
 [c000201c01c43950] [c0081a187c44] bnx2x_self_test+0x1fc/0x1f70 [bnx2x]
 [c000201c01c43a90] [c0adc748] dev_ethtool+0x11d8/0x2cb0
 [c000201c01c43b60] [c0b0b61c] dev_ioctl+0x5ac/0xa50
 [c000201c01c43bf0] [c0a8d4ec] sock_do_ioctl+0xbc/0x1b0
 [c000201c01c43c60] [c0a8dfb8] sock_ioctl+0x258/0x4f0
 [c000201c01c43d20] [c04c9704] do_vfs_ioctl+0xd4/0xa70
 [c000201c01c43de0] [c04ca274] sys_ioctl+0xc4/0x160
 [c000201c01c43e30] [c000b388] system_call+0x5c/0x70
 Instruction dump:
 78aad182 54a806be 3920 78a50664 794a1f24 7d294036 7d43502a 7d295039
 4182001c 4834 78a9d182 79291f24 <7d23482a> 2fa9 409e0020 38a50040

To fix this, move the check for condition 2 after the check for
condition 3, so that we are able to break out of the loop soon after
iterating through all the CPUs in the @mask in the problem case. Use
do..while() to achieve this.

Fixes: 243e25112d06 ("powerpc/xive: Native exploitation of the XIVE
interrupt controller")
Cc:  # 4.12+
Reported-by: Indira P. Joga 
Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/sysdev/xive/common.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 082c7e1..1cdb395 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -479,7 +479,7 @@ static int xive_find_target_in_mask(const struct cpumask 
*mask,
 * Now go through the entire mask until we find a valid
 * target.
 */
-   for (;;) {
+   do {
/*
 * We re-check online as the fallback case passes us
 * an untested affinity mask
@@ -487,12 +487,11 @@ static int xive_find_target_in_mask(const struct cpumask 
*mask,
if (cpu_online(cpu) && xive_try_pick_target(cpu))
return cpu;
cpu = cpumask_next(cpu, mask);
-   

Re: [PATCH v3] powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt()

2019-06-03 Thread Gautham R Shenoy
Hi,

On Wed, May 15, 2019 at 01:15:52PM +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> The calls to arch_add_memory()/arch_remove_memory() are always made
> with the read-side cpu_hotplug_lock acquired via
> memory_hotplug_begin().  On pSeries,
> arch_add_memory()/arch_remove_memory() eventually call resize_hpt()
> which in turn calls stop_machine() which acquires the read-side
> cpu_hotplug_lock again, thereby resulting in the recursive acquisition
> of this lock.

A clarification regarding why we hadn't observed this problem earlier.

In the absence of CONFIG_PROVE_LOCKING, we hadn't observed a system
lockup during a memory hotplug operation because cpus_read_lock() is a
per-cpu rwsem read, which, in the fast-path (in the absence of the
writer, which in our case is a CPU-hotplug operation) simply
increments the read_count on the semaphore. Thus a recursive read in
the fast-path doesn't cause any problems.

However, we can hit this problem in practice if there is a concurrent
CPU-Hotplug operation in progress which is waiting to acquire the
write-side of the lock. This will cause the second recursive read to
block until the writer finishes. While the writer is blocked since the
first read holds the lock. Thus both the reader as well as the writers
fail to make any progress thereby blocking both CPU-Hotplug as well as
Memory Hotplug operations.

Memory-Hotplug  CPU-Hotplug
CPU 0   CPU 1
--  --

1. down_read(cpu_hotplug_lock.rw_sem)  
   [memory_hotplug_begin]
2. down_write(cpu_hotplug_lock.rw_sem)
[cpu_up/cpu_down]
3. down_read(cpu_hotplug_lock.rw_sem)
   [stop_machine()]


> 
> Lockdep complains as follows in these code-paths.
> 
>  swapper/0/1 is trying to acquire lock:
>  (ptrval) (cpu_hotplug_lock.rw_sem){}, at: stop_machine+0x2c/0x60
> 
> but task is already holding lock:
> (ptrval) (cpu_hotplug_lock.rw_sem){}, at: 
> mem_hotplug_begin+0x20/0x50
> 
>  other info that might help us debug this:
>   Possible unsafe locking scenario:
> 
> CPU0
> 
>lock(cpu_hotplug_lock.rw_sem);
>lock(cpu_hotplug_lock.rw_sem);
> 
>   *** DEADLOCK ***
> 
>   May be due to missing lock nesting notation
> 
>  3 locks held by swapper/0/1:
>   #0: (ptrval) (>mutex){}, at: __driver_attach+0x12c/0x1b0
>   #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at: 
> mem_hotplug_begin+0x20/0x50
>   #2: (ptrval) (mem_hotplug_lock.rw_sem){}, at: 
> percpu_down_write+0x54/0x1a0
> 
> stack backtrace:
>  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 5.0.0-rc5-58373-gbc99402235f3-dirty #166
>  Call Trace:
>  [c000feb03150] [c0e32bd4] dump_stack+0xe8/0x164 (unreliable)
>  [c000feb031a0] [c020d6c0] __lock_acquire+0x1110/0x1c70
>  [c000feb03320] [c020f080] lock_acquire+0x240/0x290
>  [c000feb033e0] [c017f554] cpus_read_lock+0x64/0xf0
>  [c000feb03420] [c029ebac] stop_machine+0x2c/0x60
>  [c000feb03460] [c00d7f7c] pseries_lpar_resize_hpt+0x19c/0x2c0
>  [c000feb03500] [c00788d0] resize_hpt_for_hotplug+0x70/0xd0
>  [c000feb03570] [c0e5d278] arch_add_memory+0x58/0xfc
>  [c000feb03610] [c03553a8] devm_memremap_pages+0x5e8/0x8f0
>  [c000feb036c0] [c09c2394] pmem_attach_disk+0x764/0x830
>  [c000feb037d0] [c09a7c38] nvdimm_bus_probe+0x118/0x240
>  [c000feb03860] [c0968500] really_probe+0x230/0x4b0
>  [c000feb038f0] [c0968aec] driver_probe_device+0x16c/0x1e0
>  [c000feb03970] [c0968ca8] __driver_attach+0x148/0x1b0
>  [c000feb039f0] [c09650b0] bus_for_each_dev+0x90/0x130
>  [c000feb03a50] [c0967dd4] driver_attach+0x34/0x50
>  [c000feb03a70] [c0967068] bus_add_driver+0x1a8/0x360
>  [c000feb03b00] [c096a498] driver_register+0x108/0x170
>  [c000feb03b70] [c09a7400] __nd_driver_register+0xd0/0xf0
>  [c000feb03bd0] [c128aa90] nd_pmem_driver_init+0x34/0x48
>  [c000feb03bf0] [c0010a10] do_one_initcall+0x1e0/0x45c
>  [c000feb03cd0] [c122462c] kernel_init_freeable+0x540/0x64c
>  [c000feb03db0] [c001110c] kernel_init+0x2c/0x160
>  [c000feb03e20] [c000bed4] ret_from_kernel_thread+0x5c/0x68
> 
> Fix this issue by
>   1) Requiring all the calls to pseries_lpar_resize_hpt() be made
>  with cpu_hotplug_lock held.
> 
>   2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked()
>  as a consequence of 1)
> 
>   3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.

Re: [PATCH] cpupower : frequency-set -r option misses the last cpu in related cpu list

2019-05-29 Thread Gautham R Shenoy
Hi Abhishek,

On Wed, May 29, 2019 at 3:02 PM Abhishek Goel
 wrote:
>
> To set frequency on specific cpus using cpupower, following syntax can
> be used :
> cpupower -c #i frequency-set -f #f -r
>
> While setting frequency using cpupower frequency-set command, if we use
> '-r' option, it is expected to set frequency for all cpus related to
> cpu #i. But it is observed to be missing the last cpu in related cpu
> list. This patch fixes the problem.
>
> Signed-off-by: Abhishek Goel 
> ---
>  tools/power/cpupower/utils/cpufreq-set.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/tools/power/cpupower/utils/cpufreq-set.c 
> b/tools/power/cpupower/utils/cpufreq-set.c
> index 1eef0aed6..08a405593 100644
> --- a/tools/power/cpupower/utils/cpufreq-set.c
> +++ b/tools/power/cpupower/utils/cpufreq-set.c
> @@ -306,6 +306,8 @@ int cmd_freq_set(int argc, char **argv)
> bitmask_setbit(cpus_chosen, cpus->cpu);
> cpus = cpus->next;
> }
> +   /* Set the last cpu in related cpus list */
> +   bitmask_setbit(cpus_chosen, cpus->cpu);

Perhaps you could convert the while() loop to a do ..  while(). That
should will ensure
that we terminate the loop after setting the last valid CPU.


> cpufreq_put_related_cpus(cpus);
> }
> }
> --
> 2.17.1
>


-- 
Thanks and Regards
gautham.


Re: [PATCH 0/1] Forced-wakeup for stop lite states on Powernv

2019-05-16 Thread Gautham R Shenoy
Hi Nicholas,

On Thu, May 16, 2019 at 04:13:17PM +1000, Nicholas Piggin wrote:

> 
> > The motivation behind this patch was a HPC customer issue where they
> > were observing some CPUs in the core getting stuck at stop0_lite
> > state, thereby lowering the performance on the other CPUs of the core
> > which were running the application.
> > 
> > Disabling stop0_lite via sysfs didn't help since we would fallback to
> > snooze and it would make matters worse.
> 
> snooze has the timeout though, so it should kick into stop0 properly
> (and if it doesn't that's another issue that should be fixed in this
> series).
>
> I'm not questioning the patch for stop0_lite, to be clear. I think
> the logic is sound. I just raise one urelated issue that happens to
> be for stop0_lite as well (should we even enable it on P9?), and one
> peripheral issue (should we make a similar fix for deeper stop states?)
>

I think it makes sense to generalize this from the point of view of
CPUs remaining in shallower idle states for long durations on tickless
kernels.

> > 
> >> 
> >> We should always have fewer states unless proven otherwise.
> > 
> > I agree.
> > 
> >> 
> >> That said, we enable it today so I don't want to argue this point
> >> here, because it is a different issue from your patch.
> >> 
> >> > When it is in stop0 or deeper, 
> >> > it free up both
> >> > space and time slice of core.
> >> > In stop0_lite, cpu doesn't free up the core resources and thus inhibits 
> >> > thread
> >> > folding. When a cpu goes to stop0, it will free up the core resources 
> >> > thus increasing
> >> > the single thread performance of other sibling thread.
> >> > Hence, we do not want to get stuck in stop0_lite for long duration, and 
> >> > want to quickly
> >> > move onto the next state.
> >> > If we get stuck in any other state we would possibly be losing on to 
> >> > power saving,
> >> > but will still be able to gain the performance benefits for other 
> >> > sibling threads.
> >> 
> >> That's true, but stop0 -> deeper stop is also a benefit (for
> >> performance if we have some power/thermal constraints, and/or for power
> >> usage).
> >> 
> >> Sure it may not be so noticable as the SMT switch, but I just wonder
> >> if the infrastructure should be there for the same reason.
> >> 
> >> I was testing interrupt frequency on some tickless workloads configs,
> >> and without too much trouble you can get CPUs to sleep with no
> >> interrupts for many minutes. Hours even. We wouldn't want the CPU to
> >> stay in stop0 for that long.
> > 
> > If it stays in stop0 or even stop2 for that long, we would want to
> > "promote" it to a deeper state, such as say STOP5 which allows the
> > other cores to run at higher frequencies.
> 
> So we would want this same logic for all but the deepest runtime
> stop state?

Yes. We can, in steps, promote individual threads of the core to
eventually request a deeper state such as stop4/5. On a completely
idle tickless system, eventually we should see the core go to the
deeper idle state.

> 
> >> Just thinking about the patch itself, I wonder do you need a full
> >> kernel timer, or could we just set the decrementer? Is there much 
> >> performance cost here?
> >>
> > 
> > Good point. A decrementer would do actually.
> 
> That would be good if it does, might save a few cycles.
> 
> Thanks,
> Nick
>

--
Thanks and Regards
gautham.



Re: [PATCH 0/1] Forced-wakeup for stop lite states on Powernv

2019-05-15 Thread Gautham R Shenoy
Hello Nicholas,


On Thu, May 16, 2019 at 02:55:42PM +1000, Nicholas Piggin wrote:
> Abhishek's on May 13, 2019 7:49 pm:
> > On 05/08/2019 10:29 AM, Nicholas Piggin wrote:
> >> Abhishek Goel's on April 22, 2019 4:32 pm:
> >>> Currently, the cpuidle governors determine what idle state a idling CPU
> >>> should enter into based on heuristics that depend on the idle history on
> >>> that CPU. Given that no predictive heuristic is perfect, there are cases
> >>> where the governor predicts a shallow idle state, hoping that the CPU will
> >>> be busy soon. However, if no new workload is scheduled on that CPU in the
> >>> near future, the CPU will end up in the shallow state.
> >>>
> >>> Motivation
> >>> --
> >>> In case of POWER, this is problematic, when the predicted state in the
> >>> aforementioned scenario is a lite stop state, as such lite states will
> >>> inhibit SMT folding, thereby depriving the other threads in the core from
> >>> using the core resources.
> >>>
> >>> So we do not want to get stucked in such states for longer duration. To
> >>> address this, the cpuidle-core can queue timer to correspond with the
> >>> residency value of the next available state. This timer will forcefully
> >>> wakeup the cpu. Few such iterations will essentially train the governor to
> >>> select a deeper state for that cpu, as the timer here corresponds to the
> >>> next available cpuidle state residency. Cpu will be kicked out of the lite
> >>> state and end up in a non-lite state.
> >>>
> >>> Experiment
> >>> --
> >>> I performed experiments for three scenarios to collect some data.
> >>>
> >>> case 1 :
> >>> Without this patch and without tick retained, i.e. in a upstream kernel,
> >>> It would spend more than even a second to get out of stop0_lite.
> >>>
> >>> case 2 : With tick retained in a upstream kernel -
> >>>
> >>> Generally, we have a sched tick at 4ms(CONF_HZ = 250). Ideally I expected
> >>> it to take 8 sched tick to get out of stop0_lite. Experimentally,
> >>> observation was
> >>>
> >>> =
> >>> sample  minmax   99percentile
> >>> 20  4ms12ms  4ms
> >>> =
> >>>
> >>> It would take atleast one sched tick to get out of stop0_lite.
> >>>
> >>> case 2 :  With this patch (not stopping tick, but explicitly queuing a
> >>>timer)
> >>>
> >>> 
> >>> sample  min max 99percentile
> >>> 
> >>> 20  144us   192us   144us
> >>> 
> >>>
> >>> In this patch, we queue a timer just before entering into a stop0_lite
> >>> state. The timer fires at (residency of next available state + exit 
> >>> latency
> >>> of next available state * 2). Let's say if next state(stop0) is available
> >>> which has residency of 20us, it should get out in as low as (20+2*2)*8
> >>> [Based on the forumla (residency + 2xlatency)*history length] microseconds
> >>> = 192us. Ideally we would expect 8 iterations, it was observed to get out
> >>> in 6-7 iterations. Even if let's say stop2 is next available state(stop0
> >>> and stop1 both are unavailable), it would take (100+2*10)*8 = 960us to get
> >>> into stop2.
> >>>
> >>> So, We are able to get out of stop0_lite generally in 150us(with this
> >>> patch) as compared to 4ms(with tick retained). As stated earlier, we do 
> >>> not
> >>> want to get stuck into stop0_lite as it inhibits SMT folding for other
> >>> sibling threads, depriving them of core resources. Current patch is using
> >>> forced-wakeup only for stop0_lite, as it gives performance benefit(primary
> >>> reason) along with lowering down power consumption. We may extend this
> >>> model for other states in future.
> >> I still have to wonder, between our snooze loop and stop0, what does
> >> stop0_lite buy us.
> >>
> >> That said, the problem you're solving here is a generic one that all
> >> stop states have, I think. Doesn't the same thing apply going from
> >> stop0 to stop5? You might under estimate the sleep time and lose power
> >> savings and therefore performance there too. Shouldn't we make it
> >> generic for all stop states?
> >>
> >> Thanks,
> >> Nick
> >>
> >>
> > When a cpu is in snooze, it takes both space and time of core. When in 
> > stop0_lite,
> > it free up time but it still takes space.
> 
> True, but snooze should only be taking less than 1% of front end
> cycles. I appreciate there is some non-zero difference here, I just
> wonder in practice what exactly we gain by it.

The idea behind implementing a lite-state was that on the future
platforms it can be made to wait on a flag and hence act as a
replacement for snooze. On POWER9 we don't have this feature.

The motivation 

[PATCH v3] powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt()

2019-05-15 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

The calls to arch_add_memory()/arch_remove_memory() are always made
with the read-side cpu_hotplug_lock acquired via
memory_hotplug_begin().  On pSeries,
arch_add_memory()/arch_remove_memory() eventually call resize_hpt()
which in turn calls stop_machine() which acquires the read-side
cpu_hotplug_lock again, thereby resulting in the recursive acquisition
of this lock.

Lockdep complains as follows in these code-paths.

 swapper/0/1 is trying to acquire lock:
 (ptrval) (cpu_hotplug_lock.rw_sem){}, at: stop_machine+0x2c/0x60

but task is already holding lock:
(ptrval) (cpu_hotplug_lock.rw_sem){}, at: 
mem_hotplug_begin+0x20/0x50

 other info that might help us debug this:
  Possible unsafe locking scenario:

CPU0

   lock(cpu_hotplug_lock.rw_sem);
   lock(cpu_hotplug_lock.rw_sem);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 3 locks held by swapper/0/1:
  #0: (ptrval) (>mutex){}, at: __driver_attach+0x12c/0x1b0
  #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at: 
mem_hotplug_begin+0x20/0x50
  #2: (ptrval) (mem_hotplug_lock.rw_sem){}, at: 
percpu_down_write+0x54/0x1a0

stack backtrace:
 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc5-58373-gbc99402235f3-dirty 
#166
 Call Trace:
 [c000feb03150] [c0e32bd4] dump_stack+0xe8/0x164 (unreliable)
 [c000feb031a0] [c020d6c0] __lock_acquire+0x1110/0x1c70
 [c000feb03320] [c020f080] lock_acquire+0x240/0x290
 [c000feb033e0] [c017f554] cpus_read_lock+0x64/0xf0
 [c000feb03420] [c029ebac] stop_machine+0x2c/0x60
 [c000feb03460] [c00d7f7c] pseries_lpar_resize_hpt+0x19c/0x2c0
 [c000feb03500] [c00788d0] resize_hpt_for_hotplug+0x70/0xd0
 [c000feb03570] [c0e5d278] arch_add_memory+0x58/0xfc
 [c000feb03610] [c03553a8] devm_memremap_pages+0x5e8/0x8f0
 [c000feb036c0] [c09c2394] pmem_attach_disk+0x764/0x830
 [c000feb037d0] [c09a7c38] nvdimm_bus_probe+0x118/0x240
 [c000feb03860] [c0968500] really_probe+0x230/0x4b0
 [c000feb038f0] [c0968aec] driver_probe_device+0x16c/0x1e0
 [c000feb03970] [c0968ca8] __driver_attach+0x148/0x1b0
 [c000feb039f0] [c09650b0] bus_for_each_dev+0x90/0x130
 [c000feb03a50] [c0967dd4] driver_attach+0x34/0x50
 [c000feb03a70] [c0967068] bus_add_driver+0x1a8/0x360
 [c000feb03b00] [c096a498] driver_register+0x108/0x170
 [c000feb03b70] [c09a7400] __nd_driver_register+0xd0/0xf0
 [c000feb03bd0] [c128aa90] nd_pmem_driver_init+0x34/0x48
 [c000feb03bf0] [c0010a10] do_one_initcall+0x1e0/0x45c
 [c000feb03cd0] [c122462c] kernel_init_freeable+0x540/0x64c
 [c000feb03db0] [c001110c] kernel_init+0x2c/0x160
 [c000feb03e20] [c000bed4] ret_from_kernel_thread+0x5c/0x68

Fix this issue by
  1) Requiring all the calls to pseries_lpar_resize_hpt() be made
 with cpu_hotplug_lock held.

  2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked()
 as a consequence of 1)

  3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.resize_hpt()
 with cpu_hotplug_lock held.

Reported-by: Aneesh Kumar K.V 
Signed-off-by: Gautham R. Shenoy 
---
v2 -> v3 : Updated the comment for pseries_lpar_resize_hpt()
   Updated the commit-log with the full backtrace.
v1 -> v2 : Rebased against powerpc/next instead of linux/master

 arch/powerpc/mm/book3s64/hash_utils.c | 9 -
 arch/powerpc/platforms/pseries/lpar.c | 8 ++--
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 919a861..d07fcafd 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1928,10 +1929,16 @@ static int hpt_order_get(void *data, u64 *val)
 
 static int hpt_order_set(void *data, u64 val)
 {
+   int ret;
+
if (!mmu_hash_ops.resize_hpt)
return -ENODEV;
 
-   return mmu_hash_ops.resize_hpt(val);
+   cpus_read_lock();
+   ret = mmu_hash_ops.resize_hpt(val);
+   cpus_read_unlock();
+
+   return ret;
 }
 
 DEFINE_DEBUGFS_ATTRIBUTE(fops_hpt_order, hpt_order_get, hpt_order_set, 
"%llu\n");
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 1034ef1..557d592 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -859,7 +859,10 @@ static int pseries_lpar_resize_hpt_commit(void *data)
return 0;
 }
 
-/* Must be called in user context */
+/*
+ * Must be called in process context. The caller must hold the
+ * cpus_lock.
+ */
 static int pseries_lpar_resize_hpt(unsigned long shift)
 {
struct hpt_res

Re: [RESEND PATCH] powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt

2019-05-14 Thread Gautham R Shenoy
On Tue, May 14, 2019 at 05:02:16PM +1000, Michael Ellerman wrote:
> "Gautham R. Shenoy"  writes:
> > From: "Gautham R. Shenoy" 
> >
> > Subject: Re: [RESEND PATCH] powerpc/pseries: Fix cpu_hotplug_lock 
> > acquisition in resize_hpt
> 
> ps. A "RESEND" implies the patch is unchanged and you're just resending
> it because it was ignored.
> 
> In this case it should have just been "PATCH v2", with a note below the "---"
> saying "v2: Rebased onto powerpc/next ..."

Ok. I will send a v3 :-)

> 
> cheers
> 
> > During a memory hotplug operations involving resizing of the HPT, we
> > invoke a stop_machine() to perform the resizing. In this code path, we
> > end up recursively taking the cpu_hotplug_lock, first in
> > memory_hotplug_begin() and then subsequently in stop_machine(). This
> > causes the system to hang. With lockdep enabled we get the following
> > error message before the hang.
> >
> >   swapper/0/1 is trying to acquire lock:
> >   (ptrval) (cpu_hotplug_lock.rw_sem){}, at: 
> > stop_machine+0x2c/0x60
> >
> >   but task is already holding lock:
> >   (ptrval) (cpu_hotplug_lock.rw_sem){}, at: 
> > mem_hotplug_begin+0x20/0x50
> >
> >   other info that might help us debug this:
> >Possible unsafe locking scenario:
> >
> >  CPU0
> >  
> > lock(cpu_hotplug_lock.rw_sem);
> > lock(cpu_hotplug_lock.rw_sem);
> >
> >*** DEADLOCK ***
> >
> > Fix this issue by
> >   1) Requiring all the calls to pseries_lpar_resize_hpt() be made
> >  with cpu_hotplug_lock held.
> >
> >   2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked()
> >  as a consequence of 1)
> >
> >   3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.resize_hpt()
> >  with cpu_hotplug_lock held.
> >
> > Reported-by: Aneesh Kumar K.V 
> > Signed-off-by: Gautham R. Shenoy 
> > ---
> >
> > Rebased this one against powerpc/next instead of linux/master.
> >
> >  arch/powerpc/mm/book3s64/hash_utils.c | 9 -
> >  arch/powerpc/platforms/pseries/lpar.c | 8 ++--
> >  2 files changed, 14 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
> > b/arch/powerpc/mm/book3s64/hash_utils.c
> > index 919a861..d07fcafd 100644
> > --- a/arch/powerpc/mm/book3s64/hash_utils.c
> > +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> > @@ -38,6 +38,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -1928,10 +1929,16 @@ static int hpt_order_get(void *data, u64 *val)
> >  
> >  static int hpt_order_set(void *data, u64 val)
> >  {
> > +   int ret;
> > +
> > if (!mmu_hash_ops.resize_hpt)
> > return -ENODEV;
> >  
> > -   return mmu_hash_ops.resize_hpt(val);
> > +   cpus_read_lock();
> > +   ret = mmu_hash_ops.resize_hpt(val);
> > +   cpus_read_unlock();
> > +
> > +   return ret;
> >  }
> >  
> >  DEFINE_DEBUGFS_ATTRIBUTE(fops_hpt_order, hpt_order_get, hpt_order_set, 
> > "%llu\n");
> > diff --git a/arch/powerpc/platforms/pseries/lpar.c 
> > b/arch/powerpc/platforms/pseries/lpar.c
> > index 1034ef1..2fc9756 100644
> > --- a/arch/powerpc/platforms/pseries/lpar.c
> > +++ b/arch/powerpc/platforms/pseries/lpar.c
> > @@ -859,7 +859,10 @@ static int pseries_lpar_resize_hpt_commit(void *data)
> > return 0;
> >  }
> >  
> > -/* Must be called in user context */
> > +/*
> > + * Must be called in user context. The caller should hold the
> > + * cpus_lock.
> > + */
> >  static int pseries_lpar_resize_hpt(unsigned long shift)
> >  {
> > struct hpt_resize_state state = {
> > @@ -913,7 +916,8 @@ static int pseries_lpar_resize_hpt(unsigned long shift)
> >  
> > t1 = ktime_get();
> >  
> > -   rc = stop_machine(pseries_lpar_resize_hpt_commit, , NULL);
> > +   rc = stop_machine_cpuslocked(pseries_lpar_resize_hpt_commit,
> > +, NULL);
> >  
> > t2 = ktime_get();
> >  
> > -- 
> > 1.9.4
> 



Re: [RESEND PATCH] powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt

2019-05-14 Thread Gautham R Shenoy
Hi Michael,

On Tue, May 14, 2019 at 05:00:19PM +1000, Michael Ellerman wrote:
> "Gautham R. Shenoy"  writes:
> > From: "Gautham R. Shenoy" 
> >
> > During a memory hotplug operations involving resizing of the HPT, we
> > invoke a stop_machine() to perform the resizing. In this code path, we
> > end up recursively taking the cpu_hotplug_lock, first in
> > memory_hotplug_begin() and then subsequently in stop_machine(). This
> > causes the system to hang.
>
> This implies we have never tested a memory hotplug that resized the HPT.
> Is that really true? Or did something change?
>

This was reported by Aneesh during a testcase involving reconfiguring
the namespace for nvdimm where we do a memory remove followed by
add. The memory add invokes resize_hpt().

It seems we can hit this issue when we perform a memory hotplug/unplug
in the guest.

> > With lockdep enabled we get the following
> > error message before the hang.
> >
> >   swapper/0/1 is trying to acquire lock:
> >   (ptrval) (cpu_hotplug_lock.rw_sem){}, at: 
> > stop_machine+0x2c/0x60
> >
> >   but task is already holding lock:
> >   (ptrval) (cpu_hotplug_lock.rw_sem){}, at: 
> > mem_hotplug_begin+0x20/0x50
> 
> Do we have the full stack trace?

Yes, here is the complete log:

[0.537123] swapper/0/1 is trying to acquire lock:
[0.537197] (ptrval) (cpu_hotplug_lock.rw_sem){}, at: 
stop_machine+0x2c/0x60
[0.537336]
[0.537336] but task is already holding lock:
[0.537429] (ptrval) (cpu_hotplug_lock.rw_sem){}, at: 
mem_hotplug_begin+0x20/0x50

[0.537570]   
[0.537570] other info that might help us debug this:
[0.537663]  Possible unsafe locking scenario:
[0.537663]
[0.537756]CPU0 
[0.537794]   
[0.537832]   lock(cpu_hotplug_lock.rw_sem); 
   
[0.537906]   lock(cpu_hotplug_lock.rw_sem); 
[0.537980]   
[0.537980]  *** DEADLOCK ***
   
[0.537980]  

[0.538074]  May be due to missing lock nesting notation 
  
[0.538074]  
  
[0.538168] 3 locks held by swapper/0/1:   
[0.538224]  #0: (ptrval) (>mutex){}, at: 
__driver_attach+0x12c/0x1b0
[0.538348]  #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at: 
mem_hotplug_begin+0x20/0x50
[0.538477]  #2: (ptrval) (mem_hotplug_lock.rw_sem){}, at: 
percpu_down_write+0x54/0x1a0
[0.538608]
[0.538608] stack backtrace:  
[0.538685] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
5.0.0-rc5-58373-gbc99402235f3-dirty #166
[0.538812] Call Trace:  
[0.538863] [c000feb03150] [c0e32bd4] dump_stack+0xe8/0x164 
(unreliable)
[0.538975] [c000feb031a0] [c020d6c0] 
__lock_acquire+0x1110/0x1c70
[0.539086] [c000feb03320] [c020f080] lock_acquire+0x240/0x290
[0.539180] [c000feb033e0] [c017f554] cpus_read_lock+0x64/0xf0
[0.539273] [c000feb03420] [c029ebac] stop_machine+0x2c/0x60 
[0.539367] [c000feb03460] [c00d7f7c] 
pseries_lpar_resize_hpt+0x19c/0x2c0
[0.539479] [c000feb03500] [c00788d0] 
resize_hpt_for_hotplug+0x70/0xd0
[0.539590] [c000feb03570] [c0e5d278] arch_add_memory+0x58/0xfc
[0.539683] [c000feb03610] [c03553a8] 
devm_memremap_pages+0x5e8/0x8f0
[0.539804] [c000feb036c0] [c09c2394] 
pmem_attach_disk+0x764/0x830
[0.539916] [c000feb037d0] [c09a7c38] 
nvdimm_bus_probe+0x118/0x240
[0.540026] [c000feb03860] [c0968500] really_probe+0x230/0x4b0
[0.540119] [c000feb038f0] [c0968aec] 
driver_probe_device+0x16c/0x1e0
[0.540230] [c000feb03970] [c0968ca8] __driver_attach+0x148/0x1b0
[0.540340] [c000feb039f0] [c09650b0] bus_for_each_dev+0x90/0x130
[0.540451] [c000feb03a50] [c0967dd4] driver_attach+0x34/0x50
[0.540544] [c000feb03a70] [c0967068] bus_add_driver+0x1a8/0x360
[0.540654] [c000feb03b00] [c096a498] driver_register+0x108/0x170
[0.540766] [c000feb03b70] [c09a7400] 
__nd_driver_register+0xd0/0xf0
[0.540898] [c000feb03bd0] [c128aa90] 
nd_pmem_driver_init+0x34/0x48
[0.541010] [c000feb03bf0] [c0010a10] do_one_initcall+0x1e0/0x45c
[0.541122] [c000fe

[PATCH] pseries/energy: Use OF accessor functions to read ibm,drc-indexes

2019-03-08 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

In cpu_to_drc_index() in the case when FW_FEATURE_DRC_INFO is absent,
we currently use of_read_property() to obtain the pointer to the array
corresponding to the property "ibm,drc-indexes". The elements of this
array are of type __be32, but are accessed without any conversion to
the OS-endianness, which is buggy on a Little Endian OS.

Fix this by using of_property_read_u32_index() accessor function to
safely read the elements of the array.

Fixes: commit e83636ac3334 ("pseries/drc-info: Search DRC properties for CPU 
indexes")
Cc:  #v4.16+
Reported-by: Pavithra R. Prakash 
Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/platforms/pseries/pseries_energy.c | 27 -
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c 
b/arch/powerpc/platforms/pseries/pseries_energy.c
index 6ed2212..1c4d1ba 100644
--- a/arch/powerpc/platforms/pseries/pseries_energy.c
+++ b/arch/powerpc/platforms/pseries/pseries_energy.c
@@ -77,18 +77,27 @@ static u32 cpu_to_drc_index(int cpu)
 
ret = drc.drc_index_start + (thread_index * drc.sequential_inc);
} else {
-   const __be32 *indexes;
-
-   indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
-   if (indexes == NULL)
-   goto err_of_node_put;
+   u32 nr_drc_indexes, thread_drc_index;
 
/*
-* The first element indexes[0] is the number of drc_indexes
-* returned in the list.  Hence thread_index+1 will get the
-* drc_index corresponding to core number thread_index.
+* The first element of ibm,drc-indexes array is the
+* number of drc_indexes returned in the list.  Hence
+* thread_index+1 will get the drc_index corresponding
+* to core number thread_index.
 */
-   ret = indexes[thread_index + 1];
+   rc = of_property_read_u32_index(dn, "ibm,drc-indexes",
+   0, _drc_indexes);
+   if (rc)
+   goto err_of_node_put;
+
+   WARN_ON(thread_index > nr_drc_indexes);
+   rc = of_property_read_u32_index(dn, "ibm,drc-indexes",
+   thread_index + 1,
+   _drc_index);
+   if (rc)
+   goto err_of_node_put;
+
+   ret = thread_drc_index;
}
 
rc = 0;
-- 
1.9.4



Re: [PATCH] powernv: powercap: Add hard minimum powercap

2019-02-28 Thread Gautham R Shenoy
Hi Shilpa,

On Thu, Feb 28, 2019 at 11:25:25AM +0530, Shilpasri G Bhat wrote:
> Hi,
> 
> On 02/28/2019 10:14 AM, Daniel Axtens wrote:
> > Shilpasri G Bhat  writes:
> > 
> >> In POWER9, OCC(On-Chip-Controller) provides for hard and soft system
> >> powercapping range. The hard powercap range is guaranteed while soft
> >> powercap may or may not be asserted due to various power-thermal
> >> reasons based on system configuration and workloads. This patch adds
> >> a sysfs file to export the hard minimum powercap limit to allow the
> >> user to set the appropriate powercap value that can be managed by the
> >> system.
> > 
> > Maybe it's common terminology and I'm just not aware of it, but what do
> > you mean by "asserted"? It doesn't appear elsewhere in the documentation
> > you're patching, and it's not a use of assert that I'm familiar with...
> > 
> > Regards,
> > Daniel
> > 
> 
> I meant to say powercap will not be assured in the soft powercap range, i.e,
> system may or may not be throttled of CPU frequency to remain within the 
> powercap.
> 
> I can reword the document and commit message.

I agree with Daniel. How about replacing "asserted" with "enforced by
the OCC"?


> 
> Thanks and Regards,
> Shilpa

--
Thanks and Regards
gautham.

> 
> >>
> >> Signed-off-by: Shilpasri G Bhat 
> >> ---
> >>  .../ABI/testing/sysfs-firmware-opal-powercap   | 10 
> >>  arch/powerpc/platforms/powernv/opal-powercap.c | 66 
> >> +-
> >>  2 files changed, 37 insertions(+), 39 deletions(-)
> >>
> >> diff --git a/Documentation/ABI/testing/sysfs-firmware-opal-powercap 
> >> b/Documentation/ABI/testing/sysfs-firmware-opal-powercap
> >> index c9b66ec..65db4c1 100644
> >> --- a/Documentation/ABI/testing/sysfs-firmware-opal-powercap
> >> +++ b/Documentation/ABI/testing/sysfs-firmware-opal-powercap
> >> @@ -29,3 +29,13 @@ Description:System powercap directory and 
> >> attributes applicable for
> >>  creates a request for setting a new-powercap. The
> >>  powercap requested must be between powercap-min
> >>  and powercap-max.
> >> +
> >> +What: 
> >> /sys/firmware/opal/powercap/system-powercap/powercap-hard-min
> >> +Date: Feb 2019
> >> +Contact:  Linux for PowerPC mailing list 
> >> +Description:  Hard minimum powercap
> >> +
> >> +  This file provides the hard minimum powercap limit in
> >> +  Watts. The powercap value above hard minimum is always
> >> +  guaranteed to be asserted and the powercap value below
> >> +  the hard minimum limit may or may not be guaranteed.
> >> diff --git a/arch/powerpc/platforms/powernv/opal-powercap.c 
> >> b/arch/powerpc/platforms/powernv/opal-powercap.c
> >> index d90ee4f..38408e7 100644
> >> --- a/arch/powerpc/platforms/powernv/opal-powercap.c
> >> +++ b/arch/powerpc/platforms/powernv/opal-powercap.c
> >> @@ -139,10 +139,24 @@ static void powercap_add_attr(int handle, const char 
> >> *name,
> >>attr->handle = handle;
> >>sysfs_attr_init(>attr.attr);
> >>attr->attr.attr.name = name;
> >> -  attr->attr.attr.mode = 0444;
> >> +
> >> +  if (!strncmp(name, "powercap-current", strlen(name))) {
> >> +  attr->attr.attr.mode = 0664;
> >> +  attr->attr.store = powercap_store;
> >> +  } else {
> >> +  attr->attr.attr.mode = 0444;
> >> +  }
> >> +
> >>attr->attr.show = powercap_show;
> >>  }
> >>  
> >> +static const char * const powercap_strs[] = {
> >> +  "powercap-max",
> >> +  "powercap-min",
> >> +  "powercap-current",
> >> +  "powercap-hard-min",
> >> +};
> >> +
> >>  void __init opal_powercap_init(void)
> >>  {
> >>struct device_node *powercap, *node;
> >> @@ -167,60 +181,34 @@ void __init opal_powercap_init(void)
> >>  
> >>i = 0;
> >>for_each_child_of_node(powercap, node) {
> >> -  u32 cur, min, max;
> >> -  int j = 0;
> >> -  bool has_cur = false, has_min = false, has_max = false;
> >> +  u32 id;
> >> +  int j, count = 0;
> >>  
> >> -  if (!of_property_read_u32(node, "powercap-min", )) {
> >> -  j++;
> >> -  has_min = true;
> >> -  }
> >> -
> >> -  if (!of_property_read_u32(node, "powercap-max", )) {
> >> -  j++;
> >> -  has_max = true;
> >> -  }
> >> +  for (j = 0; j < ARRAY_SIZE(powercap_strs); j++)
> >> +  if (!of_property_read_u32(node, powercap_strs[j], ))
> >> +  count++;
> >>  
> >> -  if (!of_property_read_u32(node, "powercap-current", )) {
> >> -  j++;
> >> -  has_cur = true;
> >> -  }
> >> -
> >> -  pcaps[i].pattrs = kcalloc(j, sizeof(struct powercap_attr),
> >> +  pcaps[i].pattrs = kcalloc(count, sizeof(struct powercap_attr),
> >>  GFP_KERNEL);
> >>if (!pcaps[i].pattrs)
> >>goto 

  1   2   3   4   5   6   7   8   9   10   >