Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
Hello Mel, On Mon, Apr 12, 2021 at 11:48:19AM +0100, Mel Gorman wrote: > On Mon, Apr 12, 2021 at 11:06:19AM +0100, Valentin Schneider wrote: > > On 12/04/21 10:37, Mel Gorman wrote: > > > On Mon, Apr 12, 2021 at 11:54:36AM +0530, Srikar Dronamraju wrote: > > >> * Gautham R. Shenoy [2021-04-02 11:07:54]: > > >> > > >> > > > >> > To remedy this, this patch proposes that the LLC be moved to the MC > > >> > level which is a group of cores in one half of the chip. > > >> > > > >> > SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE > > >> > > > >> > > >> I think marking Hemisphere as a LLC in a P10 scenario is a good idea. > > >> > > >> > While there is no cache being shared at this level, this is still the > > >> > level where some amount of cache-snooping takes place and it is > > >> > relatively faster to access the data from the caches of the cores > > >> > within this domain. With this change, we no longer see regressions on > > >> > P10 for applications which require single threaded performance. > > >> > > >> Peter, Valentin, Vincent, Mel, etal > > >> > > >> On architectures where we have multiple levels of cache access latencies > > >> within a DIE, (For example: one within the current LLC or SMT core and > > >> the > > >> other at MC or Hemisphere, and finally across hemispheres), do you have > > >> any > > >> suggestions on how we could handle the same in the core scheduler? > > >> > > > > > > Minimally I think it would be worth detecting when there are multiple > > > LLCs per node and detecting that in generic code as a static branch. In > > > select_idle_cpu, consider taking two passes -- first on the LLC domain > > > and if no idle CPU is found then taking a second pass if the search depth > > > allows within the node with the LLC CPUs masked out. > > > > I think that's actually a decent approach. Tying SD_SHARE_PKG_RESOURCES to > > something other than pure cache topology in a generic manner is tough (as > > it relies on murky, ill-defined hardware fabric properties). > > > > Agreed. The LLC->node scan idea has been on my TODO list to try for > a while. If you have any patches for these, I will be happy to test them on POWER10. Though, on POWER10, there will be an additional sd between the LLC and the DIE domain. > > > Last I tried thinking about that, I stopped at having a core-to-core > > latency matrix, building domains off of that, and having some knob > > specifying the highest distance value below which we'd set > > SD_SHARE_PKG_RESOURCES. There's a few things I 'hate' about that; for one > > it makes cpus_share_cache() somewhat questionable. > > > > And I thought about something like this too but worried it might get > complex, particularly on chiplets where we do not necessarily have > hardware info on latency depending on how it's wired up. It also might > lead to excessive cpumask manipulation in a fast path if we have to > traverse multiple distances with search cost exceeding gains from latency > reduction. Hence -- keeping it simple with two level only, LLC then node > within the allowed search depth and see what that gets us. It might be > "good enough" in most cases and would be a basis for comparison against > complex approaches. > > At minimum, I expect IBM can evaluate the POWER10 aspect and I can run > an evaluation on Zen generations. > > -- > Mel Gorman > SUSE Labs
Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
On Mon, Apr 12, 2021 at 06:33:55PM +0200, Michal Suchánek wrote: > On Mon, Apr 12, 2021 at 04:24:44PM +0100, Mel Gorman wrote: > > On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote: > > > > > Peter, Valentin, Vincent, Mel, etal > > > > > > > > > > On architectures where we have multiple levels of cache access > > > > > latencies > > > > > within a DIE, (For example: one within the current LLC or SMT core > > > > > and the > > > > > other at MC or Hemisphere, and finally across hemispheres), do you > > > > > have any > > > > > suggestions on how we could handle the same in the core scheduler? > > > > > > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't > > > only rely on cache > > > > > > > From topology.c > > > > SD_SHARE_PKG_RESOURCES - describes shared caches > > > > I'm guessing here because I am not familiar with power10 but the central > > problem appears to be when to prefer selecting a CPU sharing L2 or L3 > > cache and the core assumes the last-level-cache is the only relevant one. > > It does not seem to be the case according to original description: > > When the scheduler tries to wakeup a task, it chooses between the > waker-CPU and the wakee's previous-CPU. Suppose this choice is called > the "target", then in the target's LLC domain, the scheduler > > a) tries to find an idle core in the LLC. This helps exploit the > This is the same as (b) Should this be SMT^^^ ? On POWER10, without this patch, the LLC is at SMT sched-domain domain. The difference between a) and b) is a) searches for an idle core, while b) searches for an idle CPU. > SMT folding that the wakee task can benefit from. If an idle > core is found, the wakee is woken up on it. > > b) Failing to find an idle core, the scheduler tries to find an idle > CPU in the LLC. This helps minimise the wakeup latency for the > wakee since it gets to run on the CPU immediately. > > c) Failing this, it will wake it up on target CPU. > > Thus, with P9-sched topology, since the CACHE domain comprises of two > SMT4 cores, there is a decent chance that we get an idle core, failing > which there is a relatively higher probability of finding an idle CPU > among the 8 threads in the domain. > > However, in P10-sched topology, since the SMT domain is the LLC and it > contains only a single SMT4 core, the probability that we find that > core to be idle is less. Furthermore, since there are only 4 CPUs to > search for an idle CPU, there is lower probability that we can get an > idle CPU to wake up the task on. > > > > > For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have > > unintended consequences for load balancing because load within a die may > > not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at > > the MC level. > > Not spreading load between SMT4 domains within MC is exactly what setting LLC > at MC level would address, wouldn't it? > > As in on P10 we have two relevant levels but the topology as is describes only > one, and moving the LLC level lower gives two levels the scheduler looks at > again. Or am I missing something? This is my current understanding as well, since with this patch we would then be able to move tasks quickly between the SMT4 cores, perhaps at the expense of losing out on cache-affinity. Which is why it would be good to verify this using a test/benchmark. > > Thanks > > Michal > -- Thanks and Regards gautham.
Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
Hello Mel, On Mon, Apr 12, 2021 at 04:24:44PM +0100, Mel Gorman wrote: > On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote: > > > > Peter, Valentin, Vincent, Mel, etal > > > > > > > > On architectures where we have multiple levels of cache access latencies > > > > within a DIE, (For example: one within the current LLC or SMT core and > > > > the > > > > other at MC or Hemisphere, and finally across hemispheres), do you have > > > > any > > > > suggestions on how we could handle the same in the core scheduler? > > > > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't > > only rely on cache > > > > >From topology.c > > SD_SHARE_PKG_RESOURCES - describes shared caches > Yes, I was aware of this shared caches, but this current patch was the simplest way to achieve the effect, though the cores in the MC domain on POWER10 do not share a cache. However, it is relatively faster to transfer data across the cores within the MC domain compared to the cores outside the MC domain in the Die. > I'm guessing here because I am not familiar with power10 but the central > problem appears to be when to prefer selecting a CPU sharing L2 or L3 > cache and the core assumes the last-level-cache is the only relevant one. > On POWER we have traditionally preferred to keep the LLC at the sched-domain comprising of groups of CPUs that share the L2 (since L3 is a victim cache on POWER). On POWER9, the L2 was shared by the threads of a pair of SMT4 cores, while on POWER10, L2 is shared by threads of a single SMT4 core. Thus, the current task wake-up logic would have a lower probability of finding an idle core inside an LLC since it has only one core to search in the LLC. This is why moving the LLC to the parent domain (MC) consisting of a group of SMT4 cores among which snooping the cache-data is faster is helpful for workloads that require the single threaded performance. > For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have > unintended consequences for load balancing because load within a die may > not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at > the MC level. Since we are adding the SD_SHARE_PKG_RESOURCES to the parent of the the only sched-domain (which is a SMT4 domain) which currently has this flag set, would it cause issues in spreading the load between the SMT4 domains ? Are there any tests/benchmarks that can help bring this out? It could be good to understand this. > > > > > > > Minimally I think it would be worth detecting when there are multiple > > > LLCs per node and detecting that in generic code as a static branch. In > > > select_idle_cpu, consider taking two passes -- first on the LLC domain > > > and if no idle CPU is found then taking a second pass if the search depth > > > > We have done a lot of changes to reduce and optimize the fast path and > > I don't think re adding another layer in the fast path makes sense as > > you will end up unrolling the for_each_domain behind some > > static_banches. > > > > Searching the node would only happen if a) there was enough search depth > left and b) there were no idle CPUs at the LLC level. As no new domain > is added, it's not clear to me why for_each_domain would change. > > But still, your comment reminded me that different architectures have > different requirements > > Power 10 appears to prefer CPU selection sharing L2 cache but desires > spillover to L3 when selecting and idle CPU. > Indeed, so on POWER10, the preference would be 1) idle core in the L2 domain. 2) idle core in the MC domain. 3) idle CPU in the L2 domain 4) idle CPU in the MC domain. This patch is able to achieve this *implicitly* because of the way the select_idle_cpu() and the select_idle_core() is currently coded, where in the presence of idle cores in the MC level, the select_idle_core() searches for the idle core starting with the core of the target-CPU. If I understand your proposal correctly it would be to make this explicit into a two level search where we first search in the LLC domain, failing which, we carry on the search in the rest of the die (assuming that the LLC is not in the die). > X86 varies, it might want the Power10 approach for some families and prefer > L3 spilling over to a CPU on the same node in others. > > S390 cares about something called books and drawers although I've no > what it means as such and whether it has any preferences on > search order. > > ARM has similar requirements again according to "scheduler: expose the > topology of clusters and add cluster scheduler" and that one *does* > add another domain. > > I had forgotten about the ARM patches but remembered that they were > interesting because they potentially help the Zen situation but I didn't > get the chance to review them before they fell off my radar again. About > all I recall is that I thought the "cluster" terminology was vague. > > The only
Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
(Missed cc'ing Cc Peter in the original posting) On Fri, Apr 02, 2021 at 11:07:54AM +0530, Gautham R. Shenoy wrote: > From: "Gautham R. Shenoy" > > On POWER10 systems, the L2 cache is at the SMT4 small core level. The > following commits ensure that L2 cache gets correctly discovered and > the Last-Level-Cache domain (LLC) is set to the SMT sched-domain. > > 790a166 powerpc/smp: Parse ibm,thread-groups with multiple properties > 1fdc1d6 powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map > fbd2b67 powerpc/smp: Rename init_thread_group_l1_cache_map() to make > it generic > 538abe powerpc/smp: Add support detecting thread-groups sharing L2 cache > 0be4763 powerpc/cacheinfo: Print correct cache-sibling map/list for L2 > cache > > However, with the LLC now on the SMT sched-domain, we are seeing some > regressions in the performance of applications that requires > single-threaded performance. The reason for this is as follows: > > Prior to the change (we call this P9-sched below), the sched-domain > hierarchy was: > > SMT (SMT4) --> CACHE (SMT8)[LLC] --> MC (Hemisphere) --> DIE > > where the CACHE sched-domain is defined to be the Last Level Cache (LLC). > > On the upstream kernel, with the aforementioned commmits (P10-sched), > the sched-domain hierarchy is: > > SMT (SMT4)[LLC] --> MC (Hemisphere) --> DIE > > with the SMT sched-domain as the LLC. > > When the scheduler tries to wakeup a task, it chooses between the > waker-CPU and the wakee's previous-CPU. Suppose this choice is called > the "target", then in the target's LLC domain, the scheduler > > a) tries to find an idle core in the LLC. This helps exploit the >SMT folding that the wakee task can benefit from. If an idle >core is found, the wakee is woken up on it. > > b) Failing to find an idle core, the scheduler tries to find an idle >CPU in the LLC. This helps minimise the wakeup latency for the >wakee since it gets to run on the CPU immediately. > > c) Failing this, it will wake it up on target CPU. > > Thus, with P9-sched topology, since the CACHE domain comprises of two > SMT4 cores, there is a decent chance that we get an idle core, failing > which there is a relatively higher probability of finding an idle CPU > among the 8 threads in the domain. > > However, in P10-sched topology, since the SMT domain is the LLC and it > contains only a single SMT4 core, the probability that we find that > core to be idle is less. Furthermore, since there are only 4 CPUs to > search for an idle CPU, there is lower probability that we can get an > idle CPU to wake up the task on. > > Thus applications which require single threaded performance will end > up getting woken up on potentially busy core, even though there are > idle cores in the system. > > To remedy this, this patch proposes that the LLC be moved to the MC > level which is a group of cores in one half of the chip. > > SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE > > While there is no cache being shared at this level, this is still the > level where some amount of cache-snooping takes place and it is > relatively faster to access the data from the caches of the cores > within this domain. With this change, we no longer see regressions on > P10 for applications which require single threaded performance. > > The patch also improves the tail latencies on schbench and the > usecs/op on "perf bench sched pipe" > > On a 10 core P10 system with 80 CPUs, > > schbench > > (https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/) > > Values : Lower the better. > 99th percentile is the tail latency. > > > 99th percentile > ~~ > No. messenger > threads 5.12-rc45.12-rc4 > P10-sched MC-LLC > ~~~ > 1 70 us 85 us > 2 81 us101 us > 3 92 us107 us > 4 96 us110 us > 5103 us123 us > 6 3412 us > 122 us > 7 1490 us136 us > 8 6200 us 3572 us > > > Hackbench > > (perf bench sched pipe) > values: lower the better > > ~~~ > No. of > parallel > instances 5.12-rc4 5.12-rc4 > P10-sched MC-LLC > ~~~ > 1 24.04 us/op18.72 us/op > 2 24.04 us/op18.65 us/op > 4 24.01 us/op18.76 us/op > 8 24.10 us/op
[RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
From: "Gautham R. Shenoy" On POWER10 systems, the L2 cache is at the SMT4 small core level. The following commits ensure that L2 cache gets correctly discovered and the Last-Level-Cache domain (LLC) is set to the SMT sched-domain. 790a166 powerpc/smp: Parse ibm,thread-groups with multiple properties 1fdc1d6 powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map fbd2b67 powerpc/smp: Rename init_thread_group_l1_cache_map() to make it generic 538abe powerpc/smp: Add support detecting thread-groups sharing L2 cache 0be4763 powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache However, with the LLC now on the SMT sched-domain, we are seeing some regressions in the performance of applications that requires single-threaded performance. The reason for this is as follows: Prior to the change (we call this P9-sched below), the sched-domain hierarchy was: SMT (SMT4) --> CACHE (SMT8)[LLC] --> MC (Hemisphere) --> DIE where the CACHE sched-domain is defined to be the Last Level Cache (LLC). On the upstream kernel, with the aforementioned commmits (P10-sched), the sched-domain hierarchy is: SMT (SMT4)[LLC] --> MC (Hemisphere) --> DIE with the SMT sched-domain as the LLC. When the scheduler tries to wakeup a task, it chooses between the waker-CPU and the wakee's previous-CPU. Suppose this choice is called the "target", then in the target's LLC domain, the scheduler a) tries to find an idle core in the LLC. This helps exploit the SMT folding that the wakee task can benefit from. If an idle core is found, the wakee is woken up on it. b) Failing to find an idle core, the scheduler tries to find an idle CPU in the LLC. This helps minimise the wakeup latency for the wakee since it gets to run on the CPU immediately. c) Failing this, it will wake it up on target CPU. Thus, with P9-sched topology, since the CACHE domain comprises of two SMT4 cores, there is a decent chance that we get an idle core, failing which there is a relatively higher probability of finding an idle CPU among the 8 threads in the domain. However, in P10-sched topology, since the SMT domain is the LLC and it contains only a single SMT4 core, the probability that we find that core to be idle is less. Furthermore, since there are only 4 CPUs to search for an idle CPU, there is lower probability that we can get an idle CPU to wake up the task on. Thus applications which require single threaded performance will end up getting woken up on potentially busy core, even though there are idle cores in the system. To remedy this, this patch proposes that the LLC be moved to the MC level which is a group of cores in one half of the chip. SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE While there is no cache being shared at this level, this is still the level where some amount of cache-snooping takes place and it is relatively faster to access the data from the caches of the cores within this domain. With this change, we no longer see regressions on P10 for applications which require single threaded performance. The patch also improves the tail latencies on schbench and the usecs/op on "perf bench sched pipe" On a 10 core P10 system with 80 CPUs, schbench (https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/) Values : Lower the better. 99th percentile is the tail latency. 99th percentile ~~ No. messenger threads 5.12-rc45.12-rc4 P10-sched MC-LLC ~~~ 1 70 us 85 us 2 81 us101 us 3 92 us107 us 4 96 us110 us 5103 us123 us 6 3412 us > 122 us 7 1490 us136 us 8 6200 us 3572 us Hackbench (perf bench sched pipe) values: lower the better ~~~ No. of parallel instances 5.12-rc4 5.12-rc4 P10-sched MC-LLC ~~~ 1 24.04 us/op18.72 us/op 2 24.04 us/op18.65 us/op 4 24.01 us/op18.76 us/op 8 24.10 us/op19.11 us/op ~~~~~~~~~~~ Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/smp.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 5a4d59a..c75dbd4 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -976,6 +976,13 @@ static bool has_coregroup_support(void) return coregroup_enabled; } +static int powerpc_mc_flags(void) +{ + if(has_coregroup_support()) + return SD_SHARE_PKG_RESOURCES; + return 0; +} + static const struct cpumask *cpu_mc_mask(int cpu) { return cpu_coregroup_mask(cpu); @@ -986,7 +993
Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()
On Wed, Jan 20, 2021 at 09:54:20AM +, Mel Gorman wrote: > On Wed, Jan 20, 2021 at 10:21:47AM +0100, Vincent Guittot wrote: > > On Wed, 20 Jan 2021 at 10:12, Mel Gorman > > wrote: > > > > > > On Wed, Jan 20, 2021 at 02:00:18PM +0530, Gautham R Shenoy wrote: > > > > > @@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct > > > > > *p, struct sched_domain *sd, int t > > > > > } > > > > > > > > > > for_each_cpu_wrap(cpu, cpus, target) { > > > > > - if (!--nr) > > > > > - return -1; > > > > > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > > > > > - break; > > > > > + if (smt) { > > > > > + i = select_idle_core(p, cpu, cpus, _cpu); > > > > > + if ((unsigned int)i < nr_cpumask_bits) > > > > > + return i; > > > > > + > > > > > + } else { > > > > > + if (!--nr) > > > > > + return -1; > > > > > + i = __select_idle_cpu(cpu); > > > > > + if ((unsigned int)i < nr_cpumask_bits) { > > > > > + idle_cpu = i; > > > > > + break; > > > > > + } > > > > > + } > > > > > } > > > > > > > > > > - if (sched_feat(SIS_PROP)) { > > > > > + if (smt) > > > > > + set_idle_cores(this, false); > > > > > > > > Shouldn't we set_idle_cores(false) only if this was the last idle > > > > core in the LLC ? > > > > > > > > > > That would involve rechecking the cpumask bits that have not been > > > scanned to see if any of them are an idle core. As the existance of idle > > > cores can change very rapidly, it's not worth the cost. > > > > But don't we reach this point only if we scanned all CPUs and didn't > > find an idle core ? Indeed, I missed that part that we return as soon as we find an idle core in the for_each_cpu_wrap() loop above. So here we clear the "has_idle_cores" when there are no longer any idle-cores. Sorry for the noise! > > Yes, but my understanding of Gauthams suggestion was to check if an > idle core found was the last idle core available and set has_idle_cores > to false in that case. That would have been nice, but since we do not keep a count of idle cores, it is probably not worth the effort as you note below. > I think this would be relatively expensive and > possibly futile as returning the last idle core for this wakeup does not > mean there will be no idle core on the next wakeup as other cores may go > idle between wakeups. > > -- > Mel Gorman > SUSE Labs -- Thanks and Regards gautham.
Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()
Hello Mel, Peter, On Tue, Jan 19, 2021 at 11:22:11AM +, Mel Gorman wrote: > From: Peter Zijlstra (Intel) > > Both select_idle_core() and select_idle_cpu() do a loop over the same > cpumask. Observe that by clearing the already visited CPUs, we can > fold the iteration and iterate a core at a time. > > All we need to do is remember any non-idle CPU we encountered while > scanning for an idle core. This way we'll only iterate every CPU once. > > Signed-off-by: Peter Zijlstra (Intel) > Signed-off-by: Mel Gorman > --- > kernel/sched/fair.c | 101 ++-- > 1 file changed, 61 insertions(+), 40 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 12e08da90024..822851b39b65 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c [..snip..] > @@ -6157,18 +6169,31 @@ static int select_idle_cpu(struct task_struct *p, > struct sched_domain *sd, int t > } > > for_each_cpu_wrap(cpu, cpus, target) { > - if (!--nr) > - return -1; > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > - break; > + if (smt) { > + i = select_idle_core(p, cpu, cpus, _cpu); > + if ((unsigned int)i < nr_cpumask_bits) > + return i; > + > + } else { > + if (!--nr) > + return -1; > + i = __select_idle_cpu(cpu); > + if ((unsigned int)i < nr_cpumask_bits) { > + idle_cpu = i; > + break; > + } > + } > } > > - if (sched_feat(SIS_PROP)) { > + if (smt) > + set_idle_cores(this, false); Shouldn't we set_idle_cores(false) only if this was the last idle core in the LLC ? -- Thanks and Regards gautham.
[PATCH 1/2] powerpc/cacheinfo: Lookup cache by dt node and thread-group id
From: "Gautham R. Shenoy" Currently the cacheinfo code on powerpc indexes the "cache" objects (modelling the L1/L2/L3 caches) where the key is device-tree node corresponding to that cache. On some of the POWER server platforms thread-groups within the core share different sets of caches (Eg: On SMT8 POWER9 systems, threads 0,2,4,6 of a core share L1 cache and threads 1,3,5,7 of the same core share another L1 cache). On such platforms, there is a single device-tree node corresponding to that cache and the cache-configuration within the threads of the core is indicated via "ibm,thread-groups" device-tree property. Since the current code is not aware of the "ibm,thread-groups" property, on the aforementoined systems, cacheinfo code still treats all the threads in the core to be sharing the cache because of the single device-tree node (In the earlier example, the cacheinfo code would says CPUs 0-7 share L1 cache). In this patch, we make the powerpc cacheinfo code aware of the "ibm,thread-groups" property. We indexe the "cache" objects by the key-pair (device-tree node, thread-group id). For any CPUX, for a given level of cache, the thread-group id is defined to be the first CPU in the "ibm,thread-groups" cache-group containing CPUX. For levels of cache which are not represented in "ibm,thread-groups" property, the thread-group id is -1. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/smp.h | 3 ++ arch/powerpc/kernel/cacheinfo.c | 80 + 2 files changed, 61 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index c4e2d53..39de24b 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -32,6 +32,9 @@ extern int cpu_to_chip_id(int cpu); +DECLARE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map); +DECLARE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map); + #ifdef CONFIG_SMP struct smp_ops_t { diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c index 6f903e9a..5a6925d 100644 --- a/arch/powerpc/kernel/cacheinfo.c +++ b/arch/powerpc/kernel/cacheinfo.c @@ -120,6 +120,7 @@ struct cache { struct cpumask shared_cpu_map; /* online CPUs using this cache */ int type; /* split cache disambiguation */ int level; /* level not explicit in device tree */ + int group_id; /* id of the group of threads that share this cache */ struct list_head list; /* global list of cache objects */ struct cache *next_local; /* next cache of >= level */ }; @@ -142,22 +143,24 @@ static const char *cache_type_string(const struct cache *cache) } static void cache_init(struct cache *cache, int type, int level, - struct device_node *ofnode) + struct device_node *ofnode, int group_id) { cache->type = type; cache->level = level; cache->ofnode = of_node_get(ofnode); + cache->group_id = group_id; INIT_LIST_HEAD(>list); list_add(>list, _list); } -static struct cache *new_cache(int type, int level, struct device_node *ofnode) +static struct cache *new_cache(int type, int level, + struct device_node *ofnode, int group_id) { struct cache *cache; cache = kzalloc(sizeof(*cache), GFP_KERNEL); if (cache) - cache_init(cache, type, level, ofnode); + cache_init(cache, type, level, ofnode, group_id); return cache; } @@ -309,20 +312,24 @@ static struct cache *cache_find_first_sibling(struct cache *cache) return cache; list_for_each_entry(iter, _list, list) - if (iter->ofnode == cache->ofnode && iter->next_local == cache) + if (iter->ofnode == cache->ofnode && + iter->group_id == cache->group_id && + iter->next_local == cache) return iter; return cache; } -/* return the first cache on a local list matching node */ -static struct cache *cache_lookup_by_node(const struct device_node *node) +/* return the first cache on a local list matching node and thread-group id */ +static struct cache *cache_lookup_by_node_group(const struct device_node *node, + int group_id) { struct cache *cache = NULL; struct cache *iter; list_for_each_entry(iter, _list, list) { - if (iter->ofnode != node) + if (iter->ofnode != node || + iter->group_id != group_id) continue; cache = cache_find_first_sibling(iter); break; @@ -352,1
[PATCH 2/2] powerpc/cacheinfo: Remove the redundant get_shared_cpu_map()
From: "Gautham R. Shenoy" The helper function get_shared_cpu_map() was added in 'commit 500fe5f550ec ("powerpc/cacheinfo: Report the correct shared_cpu_map on big-cores")' and subsequently expanded upon in 'commit 0be47634db0b ("powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache")' in order to help report the correct groups of threads sharing these caches on big-core systems where groups of threads within a core can share different sets of caches. Now that powerpc/cacheinfo is aware of "ibm,thread-groups" property, cache->shared_cpu_map contains the correct set of thread-siblings sharing the cache. Hence we no longer need the functions get_shared_cpu_map(). This patch removes this function. We also remove the helper function index_dir_to_cpu() which was only called by get_shared_cpu_map(). With these functions removed, we can still see the correct cache-sibling map/list for L1 and L2 caches on systems with L1 and L2 caches distributed among groups of threads in a core. With this patch, on a SMT8 POWER10 system where the L1 and L2 caches are split between the two groups of threads in a core, for CPUs 8,9, the L1-Data, L1-Instruction, L2, L3 cache CPU sibling list is as follows: $ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10,12,14 /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10,12,14 /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10,12,14 /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-15 /sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11,13,15 /sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11,13,15 /sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11,13,15 /sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-15 $ ppc64_cpu --smt=4 $ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10 /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10 /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10 /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-11 /sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11 /sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11 /sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11 /sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-11 $ ppc64_cpu --smt=2 $ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8 /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8 /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8 /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-9 /sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9 /sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9 /sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9 /sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-9 $ ppc64_cpu --smt=1 $ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8 /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8 /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8 /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8 Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/cacheinfo.c | 41 + 1 file changed, 1 insertion(+), 40 deletions(-) diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c index 5a6925d..20d9169 100644 --- a/arch/powerpc/kernel/cacheinfo.c +++ b/arch/powerpc/kernel/cacheinfo.c @@ -675,45 +675,6 @@ static ssize_t level_show(struct kobject *k, struct kobj_attribute *attr, char * static struct kobj_attribute cache_level_attr = __ATTR(level, 0444, level_show, NULL); -static unsigned int index_dir_to_cpu(struct cache_index_dir *index) -{ - struct kobject *index_dir_kobj = >kobj; - struct kobject *cache_dir_kobj = index_dir_kobj->parent; - struct kobject *cpu_dev_kobj = cache_dir_kobj->parent; - struct device *dev = kobj_to_dev(cpu_dev_kobj); - - return dev->id; -} - -/* - * On big-core systems, each core has two groups of CPUs each of which - * has its own L1-cache. The thread-siblings which share l1-cache with - * @cpu can be obtained via cpu_smallcore_mask(). - * - * On some big-core systems, the L2 cache is shared only between some - * groups of siblings. This is already parsed and encoded in - * cpu_l2_cache_mask(). - * - * TODO: cache_lookup_or_instantiate() needs to be made aware of the - * "ibm,thread-groups" property so that cache->shared_cpu_map - * reflects the correct siblings on platforms that have this - * device-tree property. This helper function is only a stop-gap - * solution so that we report the
[PATCH 0/2] powerpc/cacheinfo: Add "ibm,thread-groups" awareness
From: "Gautham R. Shenoy" Hi, Currently the cacheinfo code on powerpc indexes the "cache" objects (modelling the L1/L2/L3 caches) where the key is device-tree node corresponding to that cache. On some of the POWER server platforms thread-groups within the core share different sets of caches (Eg: On SMT8 POWER9 systems, threads 0,2,4,6 of a core share L1 cache and threads 1,3,5,7 of the same core share another L1 cache). On such platforms, there is a single device-tree node corresponding to that cache and the cache-configuration within the threads of the core is indicated via "ibm,thread-groups" device-tree property. Since the current code is not aware of the "ibm,thread-groups" property, on the aforementoined systems, cacheinfo code still treats all the threads in the core to be sharing the cache because of the single device-tree node (In the earlier example, the cacheinfo code would says CPUs 0-7 share L1 cache). In this patch series, we make the powerpc cacheinfo code aware of the "ibm,thread-groups" property. We indexe the "cache" objects by the key-pair (device-tree node, thread-group id). For any CPUX, for a given level of cache, the thread-group id is defined to be the first CPU in the "ibm,thread-groups" cache-group containing CPUX. For levels of cache which are not represented in "ibm,thread-groups" property, the thread-group id is -1. We can now remove the helper function get_shared_cpu_map() and index_dir_to_cpu() since the cache->shared_cpu_map contains the correct satate of the thread-siblings sharing the cache. This has been tested on a SMT8 POWER9 system where L1 cache is split between groups of threads in the core and on an SMT8 POWER10 system where L1 and L2 caches are split between groups of threads in a core. With this patch series, on POWER10 SMT8 system, we see the following reported via sysfs: $ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10,12,14 /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10,12,14 /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10,12,14 /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-15 /sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11,13,15 /sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11,13,15 /sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11,13,15 /sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-15 $ ppc64_cpu --smt=4 $ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10 /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10 /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10 /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-11 /sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11 /sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11 /sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11 /sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-11 $ ppc64_cpu --smt=2 $ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8 /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8 /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8 /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-9 /sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9 /sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9 /sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9 /sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-9 $ ppc64_cpu --smt=1 $ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8 /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8 /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8 /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8 Gautham R. Shenoy (2): powerpc/cacheinfo: Lookup cache by dt node and thread-group id powerpc/cacheinfo: Remove the redundant get_shared_cpu_map() arch/powerpc/include/asm/smp.h | 3 + arch/powerpc/kernel/cacheinfo.c | 121 2 files changed, 62 insertions(+), 62 deletions(-) -- 1.9.4
[PATCH v3 2/5] powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map
From: "Gautham R. Shenoy" On platforms which have the "ibm,thread-groups" property, the per-cpu variable cpu_l1_cache_map keeps a track of which group of threads within the same core share the L1 cache, Instruction and Data flow. This patch renames the variable to "thread_group_l1_cache_map" to make it consistent with a subsequent patch which will introduce thread_group_l2_cache_map. This patch introduces no functional change. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/smp.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 88d88ad..f3290d5 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -116,10 +116,10 @@ struct thread_groups_list { static struct thread_groups_list tgl[NR_CPUS] __initdata; /* - * On big-cores system, cpu_l1_cache_map for each CPU corresponds to + * On big-cores system, thread_group_l1_cache_map for each CPU corresponds to * the set its siblings that share the L1-cache. */ -DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map); +DEFINE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map); /* SMP operations for this machine */ struct smp_ops_t *smp_ops; @@ -866,7 +866,7 @@ static struct thread_groups *__init get_thread_groups(int cpu, return tg; } -static int init_cpu_l1_cache_map(int cpu) +static int init_thread_group_l1_cache_map(int cpu) { int first_thread = cpu_first_thread_sibling(cpu); @@ -885,7 +885,7 @@ static int init_cpu_l1_cache_map(int cpu) return -ENODATA; } - zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu), + zalloc_cpumask_var_node(_cpu(thread_group_l1_cache_map, cpu), GFP_KERNEL, cpu_to_node(cpu)); for (i = first_thread; i < first_thread + threads_per_core; i++) { @@ -897,7 +897,7 @@ static int init_cpu_l1_cache_map(int cpu) } if (i_group_start == cpu_group_start) - cpumask_set_cpu(i, per_cpu(cpu_l1_cache_map, cpu)); + cpumask_set_cpu(i, per_cpu(thread_group_l1_cache_map, cpu)); } return 0; @@ -976,7 +976,7 @@ static int init_big_cores(void) int cpu; for_each_possible_cpu(cpu) { - int err = init_cpu_l1_cache_map(cpu); + int err = init_thread_group_l1_cache_map(cpu); if (err) return err; @@ -1372,7 +1372,7 @@ static inline void add_cpu_to_smallcore_masks(int cpu) cpumask_set_cpu(cpu, cpu_smallcore_mask(cpu)); - for_each_cpu(i, per_cpu(cpu_l1_cache_map, cpu)) { + for_each_cpu(i, per_cpu(thread_group_l1_cache_map, cpu)) { if (cpu_online(i)) set_cpus_related(i, cpu, cpu_smallcore_mask); } -- 1.9.4
[PATCH v3 0/5] Extend Parsing "ibm,thread-groups" for Shared-L2 information
From: "Gautham R. Shenoy" Hi, This is the v2 of the patchset to extend parsing of "ibm,thread-groups" property to discover the Shared-L2 cache information. The previous versions can be found here : v2 : https://lore.kernel.org/linuxppc-dev/1607533700-5546-1-git-send-email-...@linux.vnet.ibm.com/T/#m043ea15d3832658527fca94765202b9cbefd330d v1 : https://lore.kernel.org/linuxppc-dev/1607057327-29822-1-git-send-email-...@linux.vnet.ibm.com/T/#m0fabffa1ea1a2807b362f25c849bb19415216520 Changes form v2-->v3: * Fixed the build errors reported by the Kernel Test Robot for Patches 4 and 5. Changes from v1-->v2: Incorporate the review comments from Srikar and fix a build error on !PPC64 configs reported by the kernel bot. * Split Patch 1 into three patches * First patch ensure that parse_thread_groups() is made generic to support more than one property. * Second patch renames cpu_l1_cache_map as thread_group_l1_cache_map for consistency. No functional impact. * The third patch makes init_thread_group_l1_cache_map() generic. No functional impact. * Patch 2 (Now patch 4): Incorporates the review comments from Srikar simplifying the changes to update_mask_by_l2() * Patch 3 (Now patch 5): Fix a build errors for 32-bit configs reported by the kernel build bot. Description of the Patchset === The "ibm,thread-groups" device-tree property is an array that is used to indicate if groups of threads within a core share certain properties. It provides details of which property is being shared by which groups of threads. This array can encode information about multiple properties being shared by different thread-groups within the core. Example: Suppose, "ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15] This can be decomposed up into two consecutive arrays: a) [1,2,4,8,10,12,14,9,11,13,15] b) [2,2,4,8,10,12,14,9,11,13,15] where in, a) provides information of Property "1" being shared by "2" groups, each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of the second group is {9,11,13,15}. Property "1" is indicative of the thread in the group sharing L1 cache, translation cache and Instruction Data flow. b) provides information of Property "2" being shared by "2" groups, each group with "4" threads. The "ibm,ppc-interrupt-server#s" of the first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of the second group is {9,11,13,15}. Property "2" indicates that the threads in each group share the L2-cache. The existing code assumes that the "ibm,thread-groups" encodes information about only one property. Hence even on platforms which encode information about multiple properties being shared by the corresponding groups of threads, the current code will only pick the first one. (In the above example, it will only consider [1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]). Furthermore, currently on platforms where groups of threads share L2 cache, we incorrectly create an extra CACHE level sched-domain that maps to all the threads of the core. For example, if "ibm,thread-groups" is 0001 0002 0004 0002 0004 0006 0001 0003 0005 0007 0002 0002 0004 0002 0004 0006 0001 0003 0005 0007 then, the sub-array [0002 0002 0004 0002 0004 0006 0001 0003 0005 0007] indicates that L2 (Property "2") is shared only between the threads of a single group. There are "2" groups of threads where each group contains "4" threads each. The groups being {0,2,4,6} and {1,3,5,7}. However, the sched-domain hierarchy for CPUs 0,1 is CPU0 attaching sched-domain(s): domain-0: span=0,2,4,6 level=SMT domain-1: span=0-7 level=CACHE domain-2: span=0-15,24-39,48-55 level=MC domain-3: span=0-55 level=DIE CPU1 attaching sched-domain(s): domain-0: span=1,3,5,7 level=SMT domain-1: span=0-7 level=CACHE domain-2: span=0-15,24-39,48-55 level=MC domain-3: span=0-55 level=DIE where the CACHE domain reports that L2 is shared across the entire core which is incorrect on such platforms. This patchset remedies these issues by extending the parsing support for "ibm,thread-groups" to discover information about multiple properties being shared by the corresponding groups of threads. In particular we cano now detect if the groups of threads within a core share the L2-cache. On such platf
[PATCH v3 4/5] powerpc/smp: Add support detecting thread-groups sharing L2 cache
From: "Gautham R. Shenoy" On POWER systems, groups of threads within a core sharing the L2-cache can be indicated by the "ibm,thread-groups" property array with the identifier "2". This patch adds support for detecting this, and when present, populate the populating the cpu_l2_cache_mask of every CPU to the core-siblings which share L2 with the CPU as specified in the by the "ibm,thread-groups" property array. On a platform with the following "ibm,thread-group" configuration 0001 0002 0004 0002 0004 0006 0001 0003 0005 0007 0002 0002 0004 0002 0004 0006 0001 0003 0005 0007 Without this patch, the sched-domain hierarchy for CPUs 0,1 would be CPU0 attaching sched-domain(s): domain-0: span=0,2,4,6 level=SMT domain-1: span=0-7 level=CACHE domain-2: span=0-15,24-39,48-55 level=MC domain-3: span=0-55 level=DIE CPU1 attaching sched-domain(s): domain-0: span=1,3,5,7 level=SMT domain-1: span=0-7 level=CACHE domain-2: span=0-15,24-39,48-55 level=MC domain-3: span=0-55 level=DIE The CACHE domain at 0-7 is incorrect since the ibm,thread-groups sub-array [0002 0002 0004 0002 0004 0006 0001 0003 0005 0007] indicates that L2 (Property "2") is shared only between the threads of a single group. There are "2" groups of threads where each group contains "4" threads each. The groups being {0,2,4,6} and {1,3,5,7}. With this patch, the sched-domain hierarchy for CPUs 0,1 would be CPU0 attaching sched-domain(s): domain-0: span=0,2,4,6 level=SMT domain-1: span=0-15,24-39,48-55 level=MC domain-2: span=0-55 level=DIE CPU1 attaching sched-domain(s): domain-0: span=1,3,5,7 level=SMT domain-1: span=0-15,24-39,48-55 level=MC domain-2: span=0-55 level=DIE The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1 resp.) gets degenerated into the SMT domain. Furthermore, the last-level-cache domain gets correctly set to the SMT sched-domain. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/smp.h | 2 ++ arch/powerpc/kernel/smp.c | 58 ++ 2 files changed, 55 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index b2035b2..035459c 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -134,6 +134,7 @@ static inline struct cpumask *cpu_smallcore_mask(int cpu) extern int cpu_to_core_id(int cpu); extern bool has_big_cores; +extern bool thread_group_shares_l2; #define cpu_smt_mask cpu_smt_mask #ifdef CONFIG_SCHED_SMT @@ -187,6 +188,7 @@ static inline const struct cpumask *cpu_smt_mask(int cpu) /* for UP */ #define hard_smp_processor_id()get_hard_smp_processor_id(0) #define smp_setup_cpu_maps() +#define thread_group_shares_l2 0 static inline void inhibit_secondary_onlining(void) {} static inline void uninhibit_secondary_onlining(void) {} static inline const struct cpumask *cpu_sibling_mask(int cpu) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 9078b5b5..2b9b1bb 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -76,6 +76,7 @@ struct task_struct *secondary_current; bool has_big_cores; bool coregroup_enabled; +bool thread_group_shares_l2; DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map); DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map); @@ -99,6 +100,7 @@ enum { #define MAX_THREAD_LIST_SIZE 8 #define THREAD_GROUP_SHARE_L1 1 +#define THREAD_GROUP_SHARE_L2 2 struct thread_groups { unsigned int property; unsigned int nr_groups; @@ -107,7 +109,7 @@ struct thread_groups { }; /* Maximum number of properties that groups of threads within a core can share */ -#define MAX_THREAD_GROUP_PROPERTIES 1 +#define MAX_THREAD_GROUP_PROPERTIES 2 struct thread_groups_list { unsigned int nr_properties; @@ -121,6 +123,13 @@ struct thread_groups_list { */ DEFINE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map); +/* + * On some big-cores system, thread_group_l2_cache_map for each CPU + * corresponds to the set its siblings within the core that share the + * L2-cache. + */ +DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map); + /* SMP operations for this machine */ struct smp_ops_t *smp_ops; @@ -718,7 +727,9 @@ static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int), * * ibm,thread-groups[i + 0] tells us the property based on which the * threads are being grouped together. If this value is 1, it implies - * that the threads in the same group share L1, trans
[PATCH v3 5/5] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache
From: "Gautham R. Shenoy" On POWER platforms where only some groups of threads within a core share the L2-cache (indicated by the ibm,thread-groups device-tree property), we currently print the incorrect shared_cpu_map/list for L2-cache in the sysfs. This patch reports the correct shared_cpu_map/list on such platforms. Example: On a platform with "ibm,thread-groups" set to 0001 0002 0004 0002 0004 0006 0001 0003 0005 0007 0002 0002 0004 0002 0004 0006 0001 0003 0005 0007 This indicates that threads {0,2,4,6} in the core share the L2-cache and threads {1,3,5,7} in the core share the L2 cache. However, without the patch, the shared_cpu_map/list for L2 for CPUs 0, 1 is reported in the sysfs as follows: /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-7 /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,00ff /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:0-7 /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00ff With the patch, the shared_cpu_map/list for L2 cache for CPUs 0, 1 is correctly reported as follows: /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6 /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,0055 /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:1,3,5,7 /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00aa This patch also defines cpu_l2_cache_mask() for !CONFIG_SMP case. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/smp.h | 4 arch/powerpc/kernel/cacheinfo.c | 30 -- 2 files changed, 24 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index 035459c..c4e2d53 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -201,6 +201,10 @@ static inline const struct cpumask *cpu_smallcore_mask(int cpu) return cpumask_of(cpu); } +static inline const struct cpumask *cpu_l2_cache_mask(int cpu) +{ + return cpumask_of(cpu); +} #endif /* CONFIG_SMP */ #ifdef CONFIG_PPC64 diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c index 65ab9fc..6f903e9a 100644 --- a/arch/powerpc/kernel/cacheinfo.c +++ b/arch/powerpc/kernel/cacheinfo.c @@ -655,11 +655,27 @@ static unsigned int index_dir_to_cpu(struct cache_index_dir *index) * On big-core systems, each core has two groups of CPUs each of which * has its own L1-cache. The thread-siblings which share l1-cache with * @cpu can be obtained via cpu_smallcore_mask(). + * + * On some big-core systems, the L2 cache is shared only between some + * groups of siblings. This is already parsed and encoded in + * cpu_l2_cache_mask(). + * + * TODO: cache_lookup_or_instantiate() needs to be made aware of the + * "ibm,thread-groups" property so that cache->shared_cpu_map + * reflects the correct siblings on platforms that have this + * device-tree property. This helper function is only a stop-gap + * solution so that we report the correct siblings to the + * userspace via sysfs. */ -static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct cache *cache) +static const struct cpumask *get_shared_cpu_map(struct cache_index_dir *index, struct cache *cache) { - if (cache->level == 1) - return cpu_smallcore_mask(cpu); + if (has_big_cores) { + int cpu = index_dir_to_cpu(index); + if (cache->level == 1) + return cpu_smallcore_mask(cpu); + if (cache->level == 2 && thread_group_shares_l2) + return cpu_l2_cache_mask(cpu); + } return >shared_cpu_map; } @@ -670,17 +686,11 @@ static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct cache * struct cache_index_dir *index; struct cache *cache; const struct cpumask *mask; - int cpu; index = kobj_to_cache_index_dir(k); cache = index->cache; - if (has_big_cores) { - cpu = index_dir_to_cpu(index); - mask = get_big_core_shared_cpu_map(cpu, cache); - } else { - mask = >shared_cpu_map; - } + mask = get_shared_cpu_map(index, cache); return cpumap_print_to_pagebuf(list, buf, mask); } -- 1.9.4
[PATCH v3 3/5] powerpc/smp: Rename init_thread_group_l1_cache_map() to make it generic
From: "Gautham R. Shenoy" init_thread_group_l1_cache_map() initializes the per-cpu cpumask thread_group_l1_cache_map with the core-siblings which share L1 cache with the CPU. Make this function generic to the cache-property (L1 or L2) and update a suitable mask. This is a preparatory patch for the next patch where we will introduce discovery of thread-groups that share L2-cache. No functional change. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/smp.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index f3290d5..9078b5b5 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -866,15 +866,18 @@ static struct thread_groups *__init get_thread_groups(int cpu, return tg; } -static int init_thread_group_l1_cache_map(int cpu) +static int __init init_thread_group_cache_map(int cpu, int cache_property) { int first_thread = cpu_first_thread_sibling(cpu); int i, cpu_group_start = -1, err = 0; struct thread_groups *tg = NULL; + cpumask_var_t *mask; - tg = get_thread_groups(cpu, THREAD_GROUP_SHARE_L1, - ); + if (cache_property != THREAD_GROUP_SHARE_L1) + return -EINVAL; + + tg = get_thread_groups(cpu, cache_property, ); if (!tg) return err; @@ -885,8 +888,8 @@ static int init_thread_group_l1_cache_map(int cpu) return -ENODATA; } - zalloc_cpumask_var_node(_cpu(thread_group_l1_cache_map, cpu), - GFP_KERNEL, cpu_to_node(cpu)); + mask = _cpu(thread_group_l1_cache_map, cpu); + zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu)); for (i = first_thread; i < first_thread + threads_per_core; i++) { int i_group_start = get_cpu_thread_group_start(i, tg); @@ -897,7 +900,7 @@ static int init_thread_group_l1_cache_map(int cpu) } if (i_group_start == cpu_group_start) - cpumask_set_cpu(i, per_cpu(thread_group_l1_cache_map, cpu)); + cpumask_set_cpu(i, *mask); } return 0; @@ -976,7 +979,7 @@ static int init_big_cores(void) int cpu; for_each_possible_cpu(cpu) { - int err = init_thread_group_l1_cache_map(cpu); + int err = init_thread_group_cache_map(cpu, THREAD_GROUP_SHARE_L1); if (err) return err; -- 1.9.4
[PATCH v3 1/5] powerpc/smp: Parse ibm,thread-groups with multiple properties
From: "Gautham R. Shenoy" The "ibm,thread-groups" device-tree property is an array that is used to indicate if groups of threads within a core share certain properties. It provides details of which property is being shared by which groups of threads. This array can encode information about multiple properties being shared by different thread-groups within the core. Example: Suppose, "ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15] This can be decomposed up into two consecutive arrays: a) [1,2,4,8,10,12,14,9,11,13,15] b) [2,2,4,8,10,12,14,9,11,13,15] where in, a) provides information of Property "1" being shared by "2" groups, each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of the second group is {9,11,13,15}. Property "1" is indicative of the thread in the group sharing L1 cache, translation cache and Instruction Data flow. b) provides information of Property "2" being shared by "2" groups, each group with "4" threads. The "ibm,ppc-interrupt-server#s" of the first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of the second group is {9,11,13,15}. Property "2" indicates that the threads in each group share the L2-cache. The existing code assumes that the "ibm,thread-groups" encodes information about only one property. Hence even on platforms which encode information about multiple properties being shared by the corresponding groups of threads, the current code will only pick the first one. (In the above example, it will only consider [1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]). This patch extends the parsing support on platforms which encode information about multiple properties being shared by the corresponding groups of threads. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/smp.c | 174 ++ 1 file changed, 113 insertions(+), 61 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 8c2857c..88d88ad 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -106,6 +106,15 @@ struct thread_groups { unsigned int thread_list[MAX_THREAD_LIST_SIZE]; }; +/* Maximum number of properties that groups of threads within a core can share */ +#define MAX_THREAD_GROUP_PROPERTIES 1 + +struct thread_groups_list { + unsigned int nr_properties; + struct thread_groups property_tgs[MAX_THREAD_GROUP_PROPERTIES]; +}; + +static struct thread_groups_list tgl[NR_CPUS] __initdata; /* * On big-cores system, cpu_l1_cache_map for each CPU corresponds to * the set its siblings that share the L1-cache. @@ -695,81 +704,98 @@ static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int), /* * parse_thread_groups: Parses the "ibm,thread-groups" device tree * property for the CPU device node @dn and stores - * the parsed output in the thread_groups - * structure @tg if the ibm,thread-groups[0] - * matches @property. + * the parsed output in the thread_groups_list + * structure @tglp. * * @dn: The device node of the CPU device. - * @tg: Pointer to a thread group structure into which the parsed + * @tglp: Pointer to a thread group list structure into which the parsed * output of "ibm,thread-groups" is stored. - * @property: The property of the thread-group that the caller is - *interested in. * * ibm,thread-groups[0..N-1] array defines which group of threads in * the CPU-device node can be grouped together based on the property. * - * ibm,thread-groups[0] tells us the property based on which the + * This array can represent thread groupings for multiple properties. + * + * ibm,thread-groups[i + 0] tells us the property based on which the * threads are being grouped together. If this value is 1, it implies * that the threads in the same group share L1, translation cache. * - * ibm,thread-groups[1] tells us how many such thread groups exist. + * ibm,thread-groups[i+1] tells us how many such thread groups exist for the + * property ibm,thread-groups[i] * - * ibm,thread-groups[2] tells us the number of threads in each such + * ibm,thread-groups[i+2] tells us the number of threads in each such * group. + * Suppose k = (ibm,thread-groups[i+1] * ibm,thread-groups[i+2]), then, * - * ibm,thread-groups[3..N-1] is the list of threads identified by + * ibm,thread-groups[i+3..i+k+2] (is the list of threads identified by * "ibm,ppc-interrupt-server#s" arranged as per their membership in * the grouping. * - * Example: If ibm,thread-gro
[PATCH v2 2/5] powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map
From: "Gautham R. Shenoy" On platforms which have the "ibm,thread-groups" property, the per-cpu variable cpu_l1_cache_map keeps a track of which group of threads within the same core share the L1 cache, Instruction and Data flow. This patch renames the variable to "thread_group_l1_cache_map" to make it consistent with a subsequent patch which will introduce thread_group_l2_cache_map. This patch introduces no functional change. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/smp.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 88d88ad..f3290d5 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -116,10 +116,10 @@ struct thread_groups_list { static struct thread_groups_list tgl[NR_CPUS] __initdata; /* - * On big-cores system, cpu_l1_cache_map for each CPU corresponds to + * On big-cores system, thread_group_l1_cache_map for each CPU corresponds to * the set its siblings that share the L1-cache. */ -DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map); +DEFINE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map); /* SMP operations for this machine */ struct smp_ops_t *smp_ops; @@ -866,7 +866,7 @@ static struct thread_groups *__init get_thread_groups(int cpu, return tg; } -static int init_cpu_l1_cache_map(int cpu) +static int init_thread_group_l1_cache_map(int cpu) { int first_thread = cpu_first_thread_sibling(cpu); @@ -885,7 +885,7 @@ static int init_cpu_l1_cache_map(int cpu) return -ENODATA; } - zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu), + zalloc_cpumask_var_node(_cpu(thread_group_l1_cache_map, cpu), GFP_KERNEL, cpu_to_node(cpu)); for (i = first_thread; i < first_thread + threads_per_core; i++) { @@ -897,7 +897,7 @@ static int init_cpu_l1_cache_map(int cpu) } if (i_group_start == cpu_group_start) - cpumask_set_cpu(i, per_cpu(cpu_l1_cache_map, cpu)); + cpumask_set_cpu(i, per_cpu(thread_group_l1_cache_map, cpu)); } return 0; @@ -976,7 +976,7 @@ static int init_big_cores(void) int cpu; for_each_possible_cpu(cpu) { - int err = init_cpu_l1_cache_map(cpu); + int err = init_thread_group_l1_cache_map(cpu); if (err) return err; @@ -1372,7 +1372,7 @@ static inline void add_cpu_to_smallcore_masks(int cpu) cpumask_set_cpu(cpu, cpu_smallcore_mask(cpu)); - for_each_cpu(i, per_cpu(cpu_l1_cache_map, cpu)) { + for_each_cpu(i, per_cpu(thread_group_l1_cache_map, cpu)) { if (cpu_online(i)) set_cpus_related(i, cpu, cpu_smallcore_mask); } -- 1.9.4
[PATCH v2 5/5] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache
From: "Gautham R. Shenoy" On POWER platforms where only some groups of threads within a core share the L2-cache (indicated by the ibm,thread-groups device-tree property), we currently print the incorrect shared_cpu_map/list for L2-cache in the sysfs. This patch reports the correct shared_cpu_map/list on such platforms. Example: On a platform with "ibm,thread-groups" set to 0001 0002 0004 0002 0004 0006 0001 0003 0005 0007 0002 0002 0004 0002 0004 0006 0001 0003 0005 0007 This indicates that threads {0,2,4,6} in the core share the L2-cache and threads {1,3,5,7} in the core share the L2 cache. However, without the patch, the shared_cpu_map/list for L2 for CPUs 0, 1 is reported in the sysfs as follows: /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-7 /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,00ff /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:0-7 /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00ff With the patch, the shared_cpu_map/list for L2 cache for CPUs 0, 1 is correctly reported as follows: /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6 /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,0055 /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:1,3,5,7 /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00aa This patch adds #CONFIG_PPC64 checks for these cases to ensure that 32-bit configs build correctly. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/cacheinfo.c | 34 -- 1 file changed, 24 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c index 65ab9fc..cb87b68 100644 --- a/arch/powerpc/kernel/cacheinfo.c +++ b/arch/powerpc/kernel/cacheinfo.c @@ -641,6 +641,7 @@ static ssize_t level_show(struct kobject *k, struct kobj_attribute *attr, char * static struct kobj_attribute cache_level_attr = __ATTR(level, 0444, level_show, NULL); +#ifdef CONFIG_PPC64 static unsigned int index_dir_to_cpu(struct cache_index_dir *index) { struct kobject *index_dir_kobj = >kobj; @@ -650,16 +651,35 @@ static unsigned int index_dir_to_cpu(struct cache_index_dir *index) return dev->id; } +#endif /* * On big-core systems, each core has two groups of CPUs each of which * has its own L1-cache. The thread-siblings which share l1-cache with * @cpu can be obtained via cpu_smallcore_mask(). + * + * On some big-core systems, the L2 cache is shared only between some + * groups of siblings. This is already parsed and encoded in + * cpu_l2_cache_mask(). + * + * TODO: cache_lookup_or_instantiate() needs to be made aware of the + * "ibm,thread-groups" property so that cache->shared_cpu_map + * reflects the correct siblings on platforms that have this + * device-tree property. This helper function is only a stop-gap + * solution so that we report the correct siblings to the + * userspace via sysfs. */ -static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct cache *cache) +static const struct cpumask *get_shared_cpu_map(struct cache_index_dir *index, struct cache *cache) { - if (cache->level == 1) - return cpu_smallcore_mask(cpu); +#ifdef CONFIG_PPC64 + if (has_big_cores) { + int cpu = index_dir_to_cpu(index); + if (cache->level == 1) + return cpu_smallcore_mask(cpu); + if (cache->level == 2 && thread_group_shares_l2) + return cpu_l2_cache_mask(cpu); + } +#endif return >shared_cpu_map; } @@ -670,17 +690,11 @@ static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct cache * struct cache_index_dir *index; struct cache *cache; const struct cpumask *mask; - int cpu; index = kobj_to_cache_index_dir(k); cache = index->cache; - if (has_big_cores) { - cpu = index_dir_to_cpu(index); - mask = get_big_core_shared_cpu_map(cpu, cache); - } else { - mask = >shared_cpu_map; - } + mask = get_shared_cpu_map(index, cache); return cpumap_print_to_pagebuf(list, buf, mask); } -- 1.9.4
[PATCH v2 4/5] powerpc/smp: Add support detecting thread-groups sharing L2 cache
From: "Gautham R. Shenoy" On POWER systems, groups of threads within a core sharing the L2-cache can be indicated by the "ibm,thread-groups" property array with the identifier "2". This patch adds support for detecting this, and when present, populate the populating the cpu_l2_cache_mask of every CPU to the core-siblings which share L2 with the CPU as specified in the by the "ibm,thread-groups" property array. On a platform with the following "ibm,thread-group" configuration 0001 0002 0004 0002 0004 0006 0001 0003 0005 0007 0002 0002 0004 0002 0004 0006 0001 0003 0005 0007 Without this patch, the sched-domain hierarchy for CPUs 0,1 would be CPU0 attaching sched-domain(s): domain-0: span=0,2,4,6 level=SMT domain-1: span=0-7 level=CACHE domain-2: span=0-15,24-39,48-55 level=MC domain-3: span=0-55 level=DIE CPU1 attaching sched-domain(s): domain-0: span=1,3,5,7 level=SMT domain-1: span=0-7 level=CACHE domain-2: span=0-15,24-39,48-55 level=MC domain-3: span=0-55 level=DIE The CACHE domain at 0-7 is incorrect since the ibm,thread-groups sub-array [0002 0002 0004 0002 0004 0006 0001 0003 0005 0007] indicates that L2 (Property "2") is shared only between the threads of a single group. There are "2" groups of threads where each group contains "4" threads each. The groups being {0,2,4,6} and {1,3,5,7}. With this patch, the sched-domain hierarchy for CPUs 0,1 would be CPU0 attaching sched-domain(s): domain-0: span=0,2,4,6 level=SMT domain-1: span=0-15,24-39,48-55 level=MC domain-2: span=0-55 level=DIE CPU1 attaching sched-domain(s): domain-0: span=1,3,5,7 level=SMT domain-1: span=0-15,24-39,48-55 level=MC domain-2: span=0-55 level=DIE The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1 resp.) gets degenerated into the SMT domain. Furthermore, the last-level-cache domain gets correctly set to the SMT sched-domain. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/include/asm/smp.h | 1 + arch/powerpc/kernel/smp.c | 56 +++--- 2 files changed, 53 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h index b2035b2..8d3d081 100644 --- a/arch/powerpc/include/asm/smp.h +++ b/arch/powerpc/include/asm/smp.h @@ -134,6 +134,7 @@ static inline struct cpumask *cpu_smallcore_mask(int cpu) extern int cpu_to_core_id(int cpu); extern bool has_big_cores; +extern bool thread_group_shares_l2; #define cpu_smt_mask cpu_smt_mask #ifdef CONFIG_SCHED_SMT diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 9078b5b5..a46cf3f 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -76,6 +76,7 @@ struct task_struct *secondary_current; bool has_big_cores; bool coregroup_enabled; +bool thread_group_shares_l2; DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map); DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map); @@ -99,6 +100,7 @@ enum { #define MAX_THREAD_LIST_SIZE 8 #define THREAD_GROUP_SHARE_L1 1 +#define THREAD_GROUP_SHARE_L2 2 struct thread_groups { unsigned int property; unsigned int nr_groups; @@ -107,7 +109,7 @@ struct thread_groups { }; /* Maximum number of properties that groups of threads within a core can share */ -#define MAX_THREAD_GROUP_PROPERTIES 1 +#define MAX_THREAD_GROUP_PROPERTIES 2 struct thread_groups_list { unsigned int nr_properties; @@ -121,6 +123,13 @@ struct thread_groups_list { */ DEFINE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map); +/* + * On some big-cores system, thread_group_l2_cache_map for each CPU + * corresponds to the set its siblings within the core that share the + * L2-cache. + */ +DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map); + /* SMP operations for this machine */ struct smp_ops_t *smp_ops; @@ -718,7 +727,9 @@ static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int), * * ibm,thread-groups[i + 0] tells us the property based on which the * threads are being grouped together. If this value is 1, it implies - * that the threads in the same group share L1, translation cache. + * that the threads in the same group share L1, translation cache. If + * the value is 2, it implies that the threads in the same group share + * the same L2 cache. * * ibm,thread-groups[i+1] tells us how many such thread groups exist for the * property ibm,thread-groups[i] @@ -874,7 +885,8 @@ static int __init init_thread_group_cache_map(int cpu, int cache_property) struct thr
[PATCH v2 1/5] powerpc/smp: Parse ibm,thread-groups with multiple properties
From: "Gautham R. Shenoy" The "ibm,thread-groups" device-tree property is an array that is used to indicate if groups of threads within a core share certain properties. It provides details of which property is being shared by which groups of threads. This array can encode information about multiple properties being shared by different thread-groups within the core. Example: Suppose, "ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15] This can be decomposed up into two consecutive arrays: a) [1,2,4,8,10,12,14,9,11,13,15] b) [2,2,4,8,10,12,14,9,11,13,15] where in, a) provides information of Property "1" being shared by "2" groups, each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of the second group is {9,11,13,15}. Property "1" is indicative of the thread in the group sharing L1 cache, translation cache and Instruction Data flow. b) provides information of Property "2" being shared by "2" groups, each group with "4" threads. The "ibm,ppc-interrupt-server#s" of the first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of the second group is {9,11,13,15}. Property "2" indicates that the threads in each group share the L2-cache. The existing code assumes that the "ibm,thread-groups" encodes information about only one property. Hence even on platforms which encode information about multiple properties being shared by the corresponding groups of threads, the current code will only pick the first one. (In the above example, it will only consider [1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]). This patch extends the parsing support on platforms which encode information about multiple properties being shared by the corresponding groups of threads. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/smp.c | 174 ++ 1 file changed, 113 insertions(+), 61 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 8c2857c..88d88ad 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -106,6 +106,15 @@ struct thread_groups { unsigned int thread_list[MAX_THREAD_LIST_SIZE]; }; +/* Maximum number of properties that groups of threads within a core can share */ +#define MAX_THREAD_GROUP_PROPERTIES 1 + +struct thread_groups_list { + unsigned int nr_properties; + struct thread_groups property_tgs[MAX_THREAD_GROUP_PROPERTIES]; +}; + +static struct thread_groups_list tgl[NR_CPUS] __initdata; /* * On big-cores system, cpu_l1_cache_map for each CPU corresponds to * the set its siblings that share the L1-cache. @@ -695,81 +704,98 @@ static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int), /* * parse_thread_groups: Parses the "ibm,thread-groups" device tree * property for the CPU device node @dn and stores - * the parsed output in the thread_groups - * structure @tg if the ibm,thread-groups[0] - * matches @property. + * the parsed output in the thread_groups_list + * structure @tglp. * * @dn: The device node of the CPU device. - * @tg: Pointer to a thread group structure into which the parsed + * @tglp: Pointer to a thread group list structure into which the parsed * output of "ibm,thread-groups" is stored. - * @property: The property of the thread-group that the caller is - *interested in. * * ibm,thread-groups[0..N-1] array defines which group of threads in * the CPU-device node can be grouped together based on the property. * - * ibm,thread-groups[0] tells us the property based on which the + * This array can represent thread groupings for multiple properties. + * + * ibm,thread-groups[i + 0] tells us the property based on which the * threads are being grouped together. If this value is 1, it implies * that the threads in the same group share L1, translation cache. * - * ibm,thread-groups[1] tells us how many such thread groups exist. + * ibm,thread-groups[i+1] tells us how many such thread groups exist for the + * property ibm,thread-groups[i] * - * ibm,thread-groups[2] tells us the number of threads in each such + * ibm,thread-groups[i+2] tells us the number of threads in each such * group. + * Suppose k = (ibm,thread-groups[i+1] * ibm,thread-groups[i+2]), then, * - * ibm,thread-groups[3..N-1] is the list of threads identified by + * ibm,thread-groups[i+3..i+k+2] (is the list of threads identified by * "ibm,ppc-interrupt-server#s" arranged as per their membership in * the grouping. * - * Example: If ibm,thread-gro
[PATCH v2 0/5] Extend Parsing "ibm,thread-groups" for Shared-L2 information
From: "Gautham R. Shenoy" Hi, This is the v2 of the patchset to extend parsing of "ibm,thread-groups" property to discover the Shared-L2 cache information. The v1 can be found here : https://lore.kernel.org/linuxppc-dev/1607057327-29822-1-git-send-email-...@linux.vnet.ibm.com/T/#m0fabffa1ea1a2807b362f25c849bb19415216520 The key changes from v1 are as follows to incorporate the review comments from Srikar and fix a build error on !PPC64 configs reported by the kernel bot. * Split Patch 1 into three patches * First patch ensure that parse_thread_groups() is made generic to support more than one property. * Second patch renames cpu_l1_cache_map as thread_group_l1_cache_map for consistency. No functional impact. * The third patch makes init_thread_group_l1_cache_map() generic. No functional impact. * Patch 2 (Now patch 4): Incorporates the review comments from Srikar simplifying the changes to update_mask_by_l2() * Patch 3 (Now patch 5): Fix a build errors for 32-bit configs reported by the kernel build bot. Description of the Patchset === The "ibm,thread-groups" device-tree property is an array that is used to indicate if groups of threads within a core share certain properties. It provides details of which property is being shared by which groups of threads. This array can encode information about multiple properties being shared by different thread-groups within the core. Example: Suppose, "ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15] This can be decomposed up into two consecutive arrays: a) [1,2,4,8,10,12,14,9,11,13,15] b) [2,2,4,8,10,12,14,9,11,13,15] where in, a) provides information of Property "1" being shared by "2" groups, each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of the second group is {9,11,13,15}. Property "1" is indicative of the thread in the group sharing L1 cache, translation cache and Instruction Data flow. b) provides information of Property "2" being shared by "2" groups, each group with "4" threads. The "ibm,ppc-interrupt-server#s" of the first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of the second group is {9,11,13,15}. Property "2" indicates that the threads in each group share the L2-cache. The existing code assumes that the "ibm,thread-groups" encodes information about only one property. Hence even on platforms which encode information about multiple properties being shared by the corresponding groups of threads, the current code will only pick the first one. (In the above example, it will only consider [1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]). Furthermore, currently on platforms where groups of threads share L2 cache, we incorrectly create an extra CACHE level sched-domain that maps to all the threads of the core. For example, if "ibm,thread-groups" is 0001 0002 0004 0002 0004 0006 0001 0003 0005 0007 0002 0002 0004 0002 0004 0006 0001 0003 0005 0007 then, the sub-array [0002 0002 0004 0002 0004 0006 0001 0003 0005 0007] indicates that L2 (Property "2") is shared only between the threads of a single group. There are "2" groups of threads where each group contains "4" threads each. The groups being {0,2,4,6} and {1,3,5,7}. However, the sched-domain hierarchy for CPUs 0,1 is CPU0 attaching sched-domain(s): domain-0: span=0,2,4,6 level=SMT domain-1: span=0-7 level=CACHE domain-2: span=0-15,24-39,48-55 level=MC domain-3: span=0-55 level=DIE CPU1 attaching sched-domain(s): domain-0: span=1,3,5,7 level=SMT domain-1: span=0-7 level=CACHE domain-2: span=0-15,24-39,48-55 level=MC domain-3: span=0-55 level=DIE where the CACHE domain reports that L2 is shared across the entire core which is incorrect on such platforms. This patchset remedies these issues by extending the parsing support for "ibm,thread-groups" to discover information about multiple properties being shared by the corresponding groups of threads. In particular we cano now detect if the groups of threads within a core share the L2-cache. On such platforms, we populate the populating the cpu_l2_cache_mask of every CPU to the core-siblings which share L2 with the CPU as specified in the by the "ibm,thread-groups" property array. With the patchset, the sched-domain hierarchy is correctly reported. F
[PATCH v2 3/5] powerpc/smp: Rename init_thread_group_l1_cache_map() to make it generic
From: "Gautham R. Shenoy" init_thread_group_l1_cache_map() initializes the per-cpu cpumask thread_group_l1_cache_map with the core-siblings which share L1 cache with the CPU. Make this function generic to the cache-property (L1 or L2) and update a suitable mask. This is a preparatory patch for the next patch where we will introduce discovery of thread-groups that share L2-cache. No functional change. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/smp.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index f3290d5..9078b5b5 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -866,15 +866,18 @@ static struct thread_groups *__init get_thread_groups(int cpu, return tg; } -static int init_thread_group_l1_cache_map(int cpu) +static int __init init_thread_group_cache_map(int cpu, int cache_property) { int first_thread = cpu_first_thread_sibling(cpu); int i, cpu_group_start = -1, err = 0; struct thread_groups *tg = NULL; + cpumask_var_t *mask; - tg = get_thread_groups(cpu, THREAD_GROUP_SHARE_L1, - ); + if (cache_property != THREAD_GROUP_SHARE_L1) + return -EINVAL; + + tg = get_thread_groups(cpu, cache_property, ); if (!tg) return err; @@ -885,8 +888,8 @@ static int init_thread_group_l1_cache_map(int cpu) return -ENODATA; } - zalloc_cpumask_var_node(_cpu(thread_group_l1_cache_map, cpu), - GFP_KERNEL, cpu_to_node(cpu)); + mask = _cpu(thread_group_l1_cache_map, cpu); + zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu)); for (i = first_thread; i < first_thread + threads_per_core; i++) { int i_group_start = get_cpu_thread_group_start(i, tg); @@ -897,7 +900,7 @@ static int init_thread_group_l1_cache_map(int cpu) } if (i_group_start == cpu_group_start) - cpumask_set_cpu(i, per_cpu(thread_group_l1_cache_map, cpu)); + cpumask_set_cpu(i, *mask); } return 0; @@ -976,7 +979,7 @@ static int init_big_cores(void) int cpu; for_each_possible_cpu(cpu) { - int err = init_thread_group_l1_cache_map(cpu); + int err = init_thread_group_cache_map(cpu, THREAD_GROUP_SHARE_L1); if (err) return err; -- 1.9.4
Re: [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache
On Wed, Dec 09, 2020 at 02:09:21PM +0530, Srikar Dronamraju wrote: > * Gautham R Shenoy [2020-12-08 23:26:47]: > > > > The drawback of this is even if cpus 0,2,4,6 are released L1 cache will > > > not > > > be released. Is this as expected? > > > > cacheinfo populates the cache->shared_cpu_map on the basis of which > > CPUs share the common device-tree node for a particular cache. There > > is one l1-cache object in the device-tree for a CPU node corresponding > > to a big-core. That the L1 is further split between the threads of the > > core is shown using ibm,thread-groups. > > > > Yes. > > > The ideal thing would be to add a "group_leader" field to "struct > > cache" so that we can create separate cache objects , one per thread > > group. I will take a stab at this in the v2. > > > > I am not saying this needs to be done immediately. We could add a TODO and > get it done later. Your patch is not making it worse. Its just that there is > still something more left to be done. Yeah, it needs to be fixed but it may not be a 5.11 target. For now I will fix this patch to take care of the build errors on !PPC64 !SMT configs. I will post a separate series for making cacheinfo.c aware of thread-groups at the time of construction of the cache-chain. > > -- > Thanks and Regards > Srikar Dronamraju
Re: [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties
On Wed, Dec 09, 2020 at 02:05:41PM +0530, Srikar Dronamraju wrote: > * Gautham R Shenoy [2020-12-08 22:55:40]: > > > > > > > NIT: > > > tglx mentions in one of his recent comments to try keep a reverse fir tree > > > ordering of variables where possible. > > > > I suppose you mean moving the longer local variable declarations to to > > the top and shorter ones to the bottom. Thanks. Will fix this. > > > > Yes. > > > > > + } > > > > + > > > > + if (!tg) > > > > + return -EINVAL; > > > > + > > > > + cpu_group_start = get_cpu_thread_group_start(cpu, tg); > > > > > > This whole hunk should be moved to a new function and called before > > > init_cpu_cache_map. It will simplify the logic to great extent. > > > > I suppose you are referring to the part where we select the correct > > tg. Yeah, that can move to a different helper. > > > > Yes, I would prefer if we could call this new helper outside > init_cpu_cache_map. > > > > > > > > > - zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu), > > > > - GFP_KERNEL, cpu_to_node(cpu)); > > > > + mask = _cpu(cpu_l1_cache_map, cpu); > > > > + > > > > + zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu)); > > > > > > > > > > This hunk (and the next hunk) should be moved to next patch. > > > > > > > The next patch is only about introducing THREAD_GROUP_SHARE_L2. Hence > > I put in any other code in this patch, since it seems to be a logical > > place to collate whatever we have in a generic form. > > > > While I am fine with it, having a pointer that always points to the same > mask looks wierd. Sure. Moving some of this to a separate preparatory patch. > > -- > Thanks and Regards > Srikar Dronamraju
Re: [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache
On Mon, Dec 07, 2020 at 06:41:38PM +0530, Srikar Dronamraju wrote: > * Gautham R. Shenoy [2020-12-04 10:18:47]: > > > From: "Gautham R. Shenoy" > > > > > > Signed-off-by: Gautham R. Shenoy > > --- > > > > +extern bool thread_group_shares_l2; > > /* > > * On big-core systems, each core has two groups of CPUs each of which > > * has its own L1-cache. The thread-siblings which share l1-cache with > > * @cpu can be obtained via cpu_smallcore_mask(). > > + * > > + * On some big-core systems, the L2 cache is shared only between some > > + * groups of siblings. This is already parsed and encoded in > > + * cpu_l2_cache_mask(). > > */ > > static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct > > cache *cache) > > { > > if (cache->level == 1) > > return cpu_smallcore_mask(cpu); > > + if (cache->level == 2 && thread_group_shares_l2) > > + return cpu_l2_cache_mask(cpu); > > > > return >shared_cpu_map; > > As pointed with l...@intel.org, we need to do this only with #CONFIG_SMP, > even for cache->level = 1 too. Yes, I have fixed that in the next version. > > I agree that we are displaying shared_cpu_map correctly. Should we have also > update /clear shared_cpu_map in the first place. For example:- If for a P9 > core with CPUs 0-7, the cache->shared_cpu_map for L1 would have 0-7 but > would display 0,2,4,6. > > The drawback of this is even if cpus 0,2,4,6 are released L1 cache will not > be released. Is this as expected? cacheinfo populates the cache->shared_cpu_map on the basis of which CPUs share the common device-tree node for a particular cache. There is one l1-cache object in the device-tree for a CPU node corresponding to a big-core. That the L1 is further split between the threads of the core is shown using ibm,thread-groups. The ideal thing would be to add a "group_leader" field to "struct cache" so that we can create separate cache objects , one per thread group. I will take a stab at this in the v2. Thanks for the review comments. > > > -- > Thanks and Regards > Srikar Dronamraju
Re: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache
Hello Srikar, On Mon, Dec 07, 2020 at 06:10:39PM +0530, Srikar Dronamraju wrote: > * Gautham R. Shenoy [2020-12-04 10:18:46]: > > > From: "Gautham R. Shenoy" > > > > On POWER systems, groups of threads within a core sharing the L2-cache > > can be indicated by the "ibm,thread-groups" property array with the > > identifier "2". > > > > This patch adds support for detecting this, and when present, populate > > the populating the cpu_l2_cache_mask of every CPU to the core-siblings > > which share L2 with the CPU as specified in the by the > > "ibm,thread-groups" property array. > > > > On a platform with the following "ibm,thread-group" configuration > > 0001 0002 0004 > > 0002 0004 0006 0001 > > 0003 0005 0007 0002 > > 0002 0004 0002 > > 0004 0006 0001 0003 > > 0005 0007 > > > > Without this patch, the sched-domain hierarchy for CPUs 0,1 would be > > CPU0 attaching sched-domain(s): > > domain-0: span=0,2,4,6 level=SMT > > domain-1: span=0-7 level=CACHE > > domain-2: span=0-15,24-39,48-55 level=MC > > domain-3: span=0-55 level=DIE > > > > CPU1 attaching sched-domain(s): > > domain-0: span=1,3,5,7 level=SMT > > domain-1: span=0-7 level=CACHE > > domain-2: span=0-15,24-39,48-55 level=MC > > domain-3: span=0-55 level=DIE > > > > The CACHE domain at 0-7 is incorrect since the ibm,thread-groups > > sub-array > > [0002 0002 0004 > > 0002 0004 0006 > > 0001 0003 0005 0007] > > indicates that L2 (Property "2") is shared only between the threads of a > > single > > group. There are "2" groups of threads where each group contains "4" > > threads each. The groups being {0,2,4,6} and {1,3,5,7}. > > > > With this patch, the sched-domain hierarchy for CPUs 0,1 would be > > CPU0 attaching sched-domain(s): > > domain-0: span=0,2,4,6 level=SMT > > domain-1: span=0-15,24-39,48-55 level=MC > > domain-2: span=0-55 level=DIE > > > > CPU1 attaching sched-domain(s): > > domain-0: span=1,3,5,7 level=SMT > > domain-1: span=0-15,24-39,48-55 level=MC > > domain-2: span=0-55 level=DIE > > > > The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1 > > resp.) gets degenerated into the SMT domain. Furthermore, the > > last-level-cache domain gets correctly set to the SMT sched-domain. > > > > Signed-off-by: Gautham R. Shenoy > > --- > > arch/powerpc/kernel/smp.c | 66 > > +-- > > 1 file changed, 58 insertions(+), 8 deletions(-) > > > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > > index 6a242a3..a116d2d 100644 > > --- a/arch/powerpc/kernel/smp.c > > +++ b/arch/powerpc/kernel/smp.c > > @@ -76,6 +76,7 @@ > > struct task_struct *secondary_current; > > bool has_big_cores; > > bool coregroup_enabled; > > +bool thread_group_shares_l2; > > Either keep this as static in this patch or add its declaration > This will be used in Patch 3 in kernel/cacheinfo.c, but not any other place. Hence I am not making it static here. > > > > DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map); > > DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map); > > @@ -99,6 +100,7 @@ enum { > > > > #define MAX_THREAD_LIST_SIZE 8 > > #define THREAD_GROUP_SHARE_L1 1 > > +#define THREAD_GROUP_SHARE_L2 2 > > struct thread_groups { > > unsigned int property; > > unsigned int nr_groups; > > @@ -107,7 +109,7 @@ struct thread_groups { > > }; > > > > /* Maximum number of properties that groups of threads within a core can > > share */ > > -#define MAX_THREAD_GROUP_PROPERTIES 1 > > +#define MAX_THREAD_GROUP_PROPERTIES 2 > > > > struct thread_groups_list { > > unsigned int nr_properties; > > @@ -121,6 +123,13 @@ struct thread_groups_list { > > */ > > DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map); > > > > +/* > > + * On some big-cores system, thread_group_l2_cache_map for each CPU > > + * corresponds to the set its siblings within the core that share the > > + * L2-cache. > > + */ > > +DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
Re: [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties
Hello Srikar, Thanks for taking a look at the patch. On Mon, Dec 07, 2020 at 05:40:42PM +0530, Srikar Dronamraju wrote: > * Gautham R. Shenoy [2020-12-04 10:18:45]: > > > From: "Gautham R. Shenoy" > > > > > > > static int parse_thread_groups(struct device_node *dn, > > - struct thread_groups *tg, > > - unsigned int property) > > + struct thread_groups_list *tglp) > > { > > - int i; > > - u32 thread_group_array[3 + MAX_THREAD_LIST_SIZE]; > > + int i = 0; > > + u32 *thread_group_array; > > u32 *thread_list; > > size_t total_threads; > > - int ret; > > + int ret = 0, count; > > + unsigned int property_idx = 0; > > NIT: > tglx mentions in one of his recent comments to try keep a reverse fir tree > ordering of variables where possible. I suppose you mean moving the longer local variable declarations to to the top and shorter ones to the bottom. Thanks. Will fix this. > > > > > + count = of_property_count_u32_elems(dn, "ibm,thread-groups"); > > + thread_group_array = kcalloc(count, sizeof(u32), GFP_KERNEL); > > ret = of_property_read_u32_array(dn, "ibm,thread-groups", > > -thread_group_array, 3); > > +thread_group_array, count); > > if (ret) > > - return ret; > > - > > - tg->property = thread_group_array[0]; > > - tg->nr_groups = thread_group_array[1]; > > - tg->threads_per_group = thread_group_array[2]; > > - if (tg->property != property || > > - tg->nr_groups < 1 || > > - tg->threads_per_group < 1) > > - return -ENODATA; > > + goto out_free; > > > > - total_threads = tg->nr_groups * tg->threads_per_group; > > + while (i < count && property_idx < MAX_THREAD_GROUP_PROPERTIES) { > > + int j; > > + struct thread_groups *tg = >property_tgs[property_idx++]; > > NIT: same as above. Ok. > > > > > - ret = of_property_read_u32_array(dn, "ibm,thread-groups", > > -thread_group_array, > > -3 + total_threads); > > - if (ret) > > - return ret; > > + tg->property = thread_group_array[i]; > > + tg->nr_groups = thread_group_array[i + 1]; > > + tg->threads_per_group = thread_group_array[i + 2]; > > + total_threads = tg->nr_groups * tg->threads_per_group; > > + > > + thread_list = _group_array[i + 3]; > > > > - thread_list = _group_array[3]; > > + for (j = 0; j < total_threads; j++) > > + tg->thread_list[j] = thread_list[j]; > > + i = i + 3 + total_threads; > > Can't we simply use memcpy instead? We could. But this one makes it more explicit. > > > + } > > > > - for (i = 0 ; i < total_threads; i++) > > - tg->thread_list[i] = thread_list[i]; > > + tglp->nr_properties = property_idx; > > > > - return 0; > > +out_free: > > + kfree(thread_group_array); > > + return ret; > > } > > > > /* > > @@ -805,24 +827,39 @@ static int get_cpu_thread_group_start(int cpu, struct > > thread_groups *tg) > > return -1; > > } > > > > -static int init_cpu_l1_cache_map(int cpu) > > +static int init_cpu_cache_map(int cpu, unsigned int cache_property) > > > > { > > struct device_node *dn = of_get_cpu_node(cpu, NULL); > > - struct thread_groups tg = {.property = 0, > > - .nr_groups = 0, > > - .threads_per_group = 0}; > > + struct thread_groups *tg = NULL; > > int first_thread = cpu_first_thread_sibling(cpu); > > int i, cpu_group_start = -1, err = 0; > > + cpumask_var_t *mask; > > + struct thread_groups_list *cpu_tgl = [cpu]; > > NIT: same as 1st comment. Sure, will fix this. > > > > > if (!dn) > > return -ENODATA; > > > > - err = parse_thread_groups(dn, , THREAD_GROUP_SHARE_L1); > > - if (err) > > - goto out; > > + if (!(cache_property == THREAD_GROUP_SHARE_L1)) > > + return -EINVAL; > > > > - cpu_group_start = get_cpu_thread_group_start(cpu, ); > > + if (!cpu_tgl->n
[PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties
From: "Gautham R. Shenoy" The "ibm,thread-groups" device-tree property is an array that is used to indicate if groups of threads within a core share certain properties. It provides details of which property is being shared by which groups of threads. This array can encode information about multiple properties being shared by different thread-groups within the core. Example: Suppose, "ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15] This can be decomposed up into two consecutive arrays: a) [1,2,4,8,10,12,14,9,11,13,15] b) [2,2,4,8,10,12,14,9,11,13,15] where in, a) provides information of Property "1" being shared by "2" groups, each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of the second group is {9,11,13,15}. Property "1" is indicative of the thread in the group sharing L1 cache, translation cache and Instruction Data flow. b) provides information of Property "2" being shared by "2" groups, each group with "4" threads. The "ibm,ppc-interrupt-server#s" of the first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of the second group is {9,11,13,15}. Property "2" indicates that the threads in each group share the L2-cache. The existing code assumes that the "ibm,thread-groups" encodes information about only one property. Hence even on platforms which encode information about multiple properties being shared by the corresponding groups of threads, the current code will only pick the first one. (In the above example, it will only consider [1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]). This patch extends the parsing support on platforms which encode information about multiple properties being shared by the corresponding groups of threads. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/smp.c | 146 +- 1 file changed, 92 insertions(+), 54 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 8c2857c..6a242a3 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -106,6 +106,15 @@ struct thread_groups { unsigned int thread_list[MAX_THREAD_LIST_SIZE]; }; +/* Maximum number of properties that groups of threads within a core can share */ +#define MAX_THREAD_GROUP_PROPERTIES 1 + +struct thread_groups_list { + unsigned int nr_properties; + struct thread_groups property_tgs[MAX_THREAD_GROUP_PROPERTIES]; +}; + +static struct thread_groups_list tgl[NR_CPUS] __initdata; /* * On big-cores system, cpu_l1_cache_map for each CPU corresponds to * the set its siblings that share the L1-cache. @@ -695,81 +704,94 @@ static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int), /* * parse_thread_groups: Parses the "ibm,thread-groups" device tree * property for the CPU device node @dn and stores - * the parsed output in the thread_groups - * structure @tg if the ibm,thread-groups[0] - * matches @property. + * the parsed output in the thread_groups_list + * structure @tglp. * * @dn: The device node of the CPU device. - * @tg: Pointer to a thread group structure into which the parsed + * @tglp: Pointer to a thread group list structure into which the parsed * output of "ibm,thread-groups" is stored. - * @property: The property of the thread-group that the caller is - *interested in. * * ibm,thread-groups[0..N-1] array defines which group of threads in * the CPU-device node can be grouped together based on the property. * - * ibm,thread-groups[0] tells us the property based on which the + * This array can represent thread groupings for multiple properties. + * + * ibm,thread-groups[i + 0] tells us the property based on which the * threads are being grouped together. If this value is 1, it implies * that the threads in the same group share L1, translation cache. * - * ibm,thread-groups[1] tells us how many such thread groups exist. + * ibm,thread-groups[i+1] tells us how many such thread groups exist for the + * property ibm,thread-groups[i] * - * ibm,thread-groups[2] tells us the number of threads in each such + * ibm,thread-groups[i+2] tells us the number of threads in each such * group. + * Suppose k = (ibm,thread-groups[i+1] * ibm,thread-groups[i+2]), then, * - * ibm,thread-groups[3..N-1] is the list of threads identified by + * ibm,thread-groups[i+3..i+k+2] (is the list of threads identified by * "ibm,ppc-interrupt-server#s" arranged as per their membership in * the grouping. * - * Example: If ibm,thread-gr
[PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache
From: "Gautham R. Shenoy" On POWER systems, groups of threads within a core sharing the L2-cache can be indicated by the "ibm,thread-groups" property array with the identifier "2". This patch adds support for detecting this, and when present, populate the populating the cpu_l2_cache_mask of every CPU to the core-siblings which share L2 with the CPU as specified in the by the "ibm,thread-groups" property array. On a platform with the following "ibm,thread-group" configuration 0001 0002 0004 0002 0004 0006 0001 0003 0005 0007 0002 0002 0004 0002 0004 0006 0001 0003 0005 0007 Without this patch, the sched-domain hierarchy for CPUs 0,1 would be CPU0 attaching sched-domain(s): domain-0: span=0,2,4,6 level=SMT domain-1: span=0-7 level=CACHE domain-2: span=0-15,24-39,48-55 level=MC domain-3: span=0-55 level=DIE CPU1 attaching sched-domain(s): domain-0: span=1,3,5,7 level=SMT domain-1: span=0-7 level=CACHE domain-2: span=0-15,24-39,48-55 level=MC domain-3: span=0-55 level=DIE The CACHE domain at 0-7 is incorrect since the ibm,thread-groups sub-array [0002 0002 0004 0002 0004 0006 0001 0003 0005 0007] indicates that L2 (Property "2") is shared only between the threads of a single group. There are "2" groups of threads where each group contains "4" threads each. The groups being {0,2,4,6} and {1,3,5,7}. With this patch, the sched-domain hierarchy for CPUs 0,1 would be CPU0 attaching sched-domain(s): domain-0: span=0,2,4,6 level=SMT domain-1: span=0-15,24-39,48-55 level=MC domain-2: span=0-55 level=DIE CPU1 attaching sched-domain(s): domain-0: span=1,3,5,7 level=SMT domain-1: span=0-15,24-39,48-55 level=MC domain-2: span=0-55 level=DIE The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1 resp.) gets degenerated into the SMT domain. Furthermore, the last-level-cache domain gets correctly set to the SMT sched-domain. Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/smp.c | 66 +-- 1 file changed, 58 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 6a242a3..a116d2d 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -76,6 +76,7 @@ struct task_struct *secondary_current; bool has_big_cores; bool coregroup_enabled; +bool thread_group_shares_l2; DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map); DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map); @@ -99,6 +100,7 @@ enum { #define MAX_THREAD_LIST_SIZE 8 #define THREAD_GROUP_SHARE_L1 1 +#define THREAD_GROUP_SHARE_L2 2 struct thread_groups { unsigned int property; unsigned int nr_groups; @@ -107,7 +109,7 @@ struct thread_groups { }; /* Maximum number of properties that groups of threads within a core can share */ -#define MAX_THREAD_GROUP_PROPERTIES 1 +#define MAX_THREAD_GROUP_PROPERTIES 2 struct thread_groups_list { unsigned int nr_properties; @@ -121,6 +123,13 @@ struct thread_groups_list { */ DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map); +/* + * On some big-cores system, thread_group_l2_cache_map for each CPU + * corresponds to the set its siblings within the core that share the + * L2-cache. + */ +DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map); + /* SMP operations for this machine */ struct smp_ops_t *smp_ops; @@ -718,7 +727,9 @@ static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int), * * ibm,thread-groups[i + 0] tells us the property based on which the * threads are being grouped together. If this value is 1, it implies - * that the threads in the same group share L1, translation cache. + * that the threads in the same group share L1, translation cache. If + * the value is 2, it implies that the threads in the same group share + * the same L2 cache. * * ibm,thread-groups[i+1] tells us how many such thread groups exist for the * property ibm,thread-groups[i] @@ -745,10 +756,10 @@ static void or_cpumasks_related(int i, int j, struct cpumask *(*srcmask)(int), * 12}. * * b) there are 2 groups of 4 threads each, where each group of - *threads share some property indicated by the first value 2. The - *"ibm,ppc-interrupt-server#s" of the first group is {5,7,9,11} - *and the "ibm,ppc-interrupt-server#s" of the second group is - *{6,8,10,12} structure + *threads share some property indicated by the first value 2 (L2 + *cache). The "ibm,ppc-interrupt-server#s" of the first group is + *{5,7,9,
[PATCH 0/3] Extend Parsing "ibm,thread-groups" for Shared-L2 information
From: "Gautham R. Shenoy" The "ibm,thread-groups" device-tree property is an array that is used to indicate if groups of threads within a core share certain properties. It provides details of which property is being shared by which groups of threads. This array can encode information about multiple properties being shared by different thread-groups within the core. Example: Suppose, "ibm,thread-groups" = [1,2,4,8,10,12,14,9,11,13,15,2,2,4,8,10,12,14,9,11,13,15] This can be decomposed up into two consecutive arrays: a) [1,2,4,8,10,12,14,9,11,13,15] b) [2,2,4,8,10,12,14,9,11,13,15] where in, a) provides information of Property "1" being shared by "2" groups, each with "4" threads each. The "ibm,ppc-interrupt-server#s" of the first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of the second group is {9,11,13,15}. Property "1" is indicative of the thread in the group sharing L1 cache, translation cache and Instruction Data flow. b) provides information of Property "2" being shared by "2" groups, each group with "4" threads. The "ibm,ppc-interrupt-server#s" of the first group is {8,10,12,14} and the "ibm,ppc-interrupt-server#s" of the second group is {9,11,13,15}. Property "2" indicates that the threads in each group share the L2-cache. The existing code assumes that the "ibm,thread-groups" encodes information about only one property. Hence even on platforms which encode information about multiple properties being shared by the corresponding groups of threads, the current code will only pick the first one. (In the above example, it will only consider [1,2,4,8,10,12,14,9,11,13,15] but not [2,2,4,8,10,12,14,9,11,13,15]). Furthermore, currently on platforms where groups of threads share L2 cache, we incorrectly create an extra CACHE level sched-domain that maps to all the threads of the core. For example, if "ibm,thread-groups" is 0001 0002 0004 0002 0004 0006 0001 0003 0005 0007 0002 0002 0004 0002 0004 0006 0001 0003 0005 0007 then, the sub-array [0002 0002 0004 0002 0004 0006 0001 0003 0005 0007] indicates that L2 (Property "2") is shared only between the threads of a single group. There are "2" groups of threads where each group contains "4" threads each. The groups being {0,2,4,6} and {1,3,5,7}. However, the sched-domain hierarchy for CPUs 0,1 is CPU0 attaching sched-domain(s): domain-0: span=0,2,4,6 level=SMT domain-1: span=0-7 level=CACHE domain-2: span=0-15,24-39,48-55 level=MC domain-3: span=0-55 level=DIE CPU1 attaching sched-domain(s): domain-0: span=1,3,5,7 level=SMT domain-1: span=0-7 level=CACHE domain-2: span=0-15,24-39,48-55 level=MC domain-3: span=0-55 level=DIE where the CACHE domain reports that L2 is shared across the entire core which is incorrect on such platforms. This patchset remedies these issues by extending the parsing support for "ibm,thread-groups" to discover information about multiple properties being shared by the corresponding groups of threads. In particular we cano now detect if the groups of threads within a core share the L2-cache. On such platforms, we populate the populating the cpu_l2_cache_mask of every CPU to the core-siblings which share L2 with the CPU as specified in the by the "ibm,thread-groups" property array. With the patchset, the sched-domain hierarchy is correctly reported. For eg for CPUs 0,1, with the patchset CPU0 attaching sched-domain(s): domain-0: span=0,2,4,6 level=SMT domain-1: span=0-15,24-39,48-55 level=MC domain-2: span=0-55 level=DIE CPU1 attaching sched-domain(s): domain-0: span=1,3,5,7 level=SMT domain-1: span=0-15,24-39,48-55 level=MC domain-2: span=0-55 level=DIE The CACHE domain with span=0,2,4,6 for CPU 0 (span=1,3,5,7 for CPU 1 resp.) gets degenerated into the SMT domain. Furthermore, the last-level-cache domain gets correctly set to the SMT sched-domain. Finally, this patchset reports the correct shared_cpu_map/list in the sysfs for L2 cache on such platforms. With the patchset for CPUs0, 1, for L2 cache we see the correct shared_cpu_map/list /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6 /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,0055 /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:1,3,5,7 /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00aa The patchset has been tested on older platform
[PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache
From: "Gautham R. Shenoy" On POWER platforms where only some groups of threads within a core share the L2-cache (indicated by the ibm,thread-groups device-tree property), we currently print the incorrect shared_cpu_map/list for L2-cache in the sysfs. This patch reports the correct shared_cpu_map/list on such platforms. Example: On a platform with "ibm,thread-groups" set to 0001 0002 0004 0002 0004 0006 0001 0003 0005 0007 0002 0002 0004 0002 0004 0006 0001 0003 0005 0007 This indicates that threads {0,2,4,6} in the core share the L2-cache and threads {1,3,5,7} in the core share the L2 cache. However, without the patch, the shared_cpu_map/list for L2 for CPUs 0, 1 is reported in the sysfs as follows: /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0-7 /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,00ff /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:0-7 /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00ff With the patch, the shared_cpu_map/list for L2 cache for CPUs 0, 1 is correctly reported as follows: /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6 /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00,0055 /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_list:1,3,5,7 /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map:00,00aa Signed-off-by: Gautham R. Shenoy --- arch/powerpc/kernel/cacheinfo.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c index 65ab9fc..1cc8f37 100644 --- a/arch/powerpc/kernel/cacheinfo.c +++ b/arch/powerpc/kernel/cacheinfo.c @@ -651,15 +651,22 @@ static unsigned int index_dir_to_cpu(struct cache_index_dir *index) return dev->id; } +extern bool thread_group_shares_l2; /* * On big-core systems, each core has two groups of CPUs each of which * has its own L1-cache. The thread-siblings which share l1-cache with * @cpu can be obtained via cpu_smallcore_mask(). + * + * On some big-core systems, the L2 cache is shared only between some + * groups of siblings. This is already parsed and encoded in + * cpu_l2_cache_mask(). */ static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct cache *cache) { if (cache->level == 1) return cpu_smallcore_mask(cpu); + if (cache->level == 2 && thread_group_shares_l2) + return cpu_l2_cache_mask(cpu); return >shared_cpu_map; } -- 1.9.4
Re: [PATCH 20/33] docs: ABI: testing: make the files compatible with ReST output
y Failure'. > > - overcurrent : This file gives the total number of times the > - max frequency is throttled due to 'Overcurrent'. > + max frequency is throttled due to 'Overcurrent'. > > - occ_reset : This file gives the total number of times the max > - frequency is throttled due to 'OCC Reset'. > + frequency is throttled due to 'OCC Reset'. > > The sysfs attributes representing different throttle reasons > like > powercap, overtemp, supply_fault, overcurrent and occ_reset map > to This hunk for the powernv cpufreq driver looks good to me. For these two hunks, Reviewed-by: Gautham R. Shenoy
Re: [RFC v4 1/1] selftests/cpuidle: Add support for cpuidle latency measurement
On Wed, Sep 02, 2020 at 05:15:06PM +0530, Pratik Rajesh Sampat wrote: > Measure cpuidle latencies on wakeup to determine and compare with the > advertsied wakeup latencies for each idle state. > > Cpuidle wakeup latencies are determined for IPIs and Timer events and > can help determine any deviations from what is advertsied by the > hardware. > > A baseline measurement for each case of IPI and timers is taken at > 100 percent CPU usage to quantify for the kernel-userpsace overhead > during execution. > > Signed-off-by: Pratik Rajesh Sampat > --- > tools/testing/selftests/Makefile | 1 + > tools/testing/selftests/cpuidle/Makefile | 7 + > tools/testing/selftests/cpuidle/cpuidle.c | 616 ++ > tools/testing/selftests/cpuidle/settings | 1 + > 4 files changed, 625 insertions(+) > create mode 100644 tools/testing/selftests/cpuidle/Makefile > create mode 100644 tools/testing/selftests/cpuidle/cpuidle.c > create mode 100644 tools/testing/selftests/cpuidle/settings > > diff --git a/tools/testing/selftests/Makefile > b/tools/testing/selftests/Makefile > index 9018f45d631d..2bb0e87f76fd 100644 > --- a/tools/testing/selftests/Makefile > +++ b/tools/testing/selftests/Makefile > @@ -8,6 +8,7 @@ TARGETS += cgroup > TARGETS += clone3 > TARGETS += core > TARGETS += cpufreq > +TARGETS += cpuidle > TARGETS += cpu-hotplug > TARGETS += drivers/dma-buf > TARGETS += efivarfs > diff --git a/tools/testing/selftests/cpuidle/Makefile > b/tools/testing/selftests/cpuidle/Makefile > new file mode 100644 > index ..d332485e1bc5 > --- /dev/null > +++ b/tools/testing/selftests/cpuidle/Makefile > @@ -0,0 +1,7 @@ > +# SPDX-License-Identifier: GPL-2.0 > +TEST_GEN_PROGS := cpuidle > + > +CFLAGS += -O2 > +LDLIBS += -lpthread > + > +include ../lib.mk > diff --git a/tools/testing/selftests/cpuidle/cpuidle.c > b/tools/testing/selftests/cpuidle/cpuidle.c > new file mode 100644 > index ..4b1e7a91f75c > --- /dev/null > +++ b/tools/testing/selftests/cpuidle/cpuidle.c > @@ -0,0 +1,616 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* > + * Cpuidle latency measurement microbenchmark > + * > + * A mechanism to measure wakeup latency for IPI and Timer based interrupts > + * Results of this microbenchmark can be used to check and validate against > the > + * advertised latencies for each cpuidle state > + * > + * IPIs (using pipes) and Timers are used to wake the CPU up and measure the > + * time difference > + * > + * Usage: > + * ./cpuidle --mode --output > + * > + * Copyright (C) 2020 Pratik Rajesh Sampat , IBM > + */ > + > +#define _GNU_SOURCE > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define READ 0 > +#define WRITE1 > +#define TIMEOUT_US 50 > + > +static int pipe_fd[2]; > +static int *cpu_list; > +static int cpus; > +static int idle_states; > +static uint64_t *latency_list; > +static uint64_t *residency_list; > + > +static char *log_file = "cpuidle.log"; > + > +static int get_online_cpus(int *online_cpu_list, int total_cpus) > +{ > + char filename[80]; > + int i, index = 0; > + FILE *fptr; > + > + for (i = 0; i < total_cpus; i++) { > + char status; > + > + sprintf(filename, "/sys/devices/system/cpu/cpu"); > + sprintf(filename + strlen(filename), "%d%s", i, "/online"); > + fptr = fopen(filename, "r"); > + if (!fptr) > + continue; > + assert(fscanf(fptr, "%c", ) != EOF); > + if (status == '1') > + online_cpu_list[index++] = i; > + fclose(fptr); > + } > + return index; > +} > + > +static uint64_t us_to_ns(uint64_t val) > +{ > + return val * 1000; > +} > + > +static void get_latency(int cpu) > +{ > + char filename[80]; > + uint64_t latency; > + FILE *fptr; > + int state; > + > + for (state = 0; state < idle_states; state++) { > + sprintf(filename, "%s%d%s%d%s", "/sys/devices/system/cpu/cpu", > + cpu, "/cpuidle/state", > + state, "/latency"); > + fptr = fopen(filename, "r"); > + assert(fptr); > + > + assert(fscanf(fptr, "%ld", ) != EOF); > + latency_list[state] = latency; > + fclose(fptr); > + } > +} > + > +static void get_residency(int cpu) > +{ > + uint64_t residency; > + char filename[80]; > + FILE *fptr; > + int state; > + > + for (state = 0; state < idle_states; state++) { > + sprintf(filename, "%s%d%s%d%s", "/sys/devices/system/cpu/cpu", > + cpu, "/cpuidle/state", > + state, "/residency"); > + fptr = fopen(filename, "r"); > + assert(fptr); > + > + assert(fscanf(fptr, "%ld", ) != EOF);
Re: [RFC v4 0/1] Selftest for cpuidle latency measurement
On Wed, Sep 02, 2020 at 05:15:05PM +0530, Pratik Rajesh Sampat wrote: > Changelog v3-->v4: > 1. Overhaul in implementation from kernel module to a userspace selftest > --- > > The patch series introduces a mechanism to measure wakeup latency for > IPI and timer based interrupts > The motivation behind this series is to find significant deviations > behind advertised latency and residency values > > To achieve this in the userspace, IPI latencies are calculated by > sending information through pipes and inducing a wakeup, similarly > alarm events are setup for calculate timer based wakeup latencies. > > To account for delays from kernel-userspace interactions baseline > observations are taken on a 100% busy CPU and subsequent obervations > must be considered relative to that. > > In theory, wakeups induced by IPI and Timers should have similar > wakeup latencies, however in practice there may be deviations which may > need to be captured. > > One downside of the userspace approach in contrast to the kernel > implementation is that the run to run variance can turn out to be high > in the order of ms; which is the scope of the experiments at times. > > Another downside of the userspace approach is that it takes much longer > to run and hence a command-line option quick and full are added to make > sure quick 1 CPU tests can be carried out when needed and otherwise it > can carry out a full system comprehensive test. > > Usage > --- > ./cpuidle --mode --output > full: runs on all CPUS > quick: run on a random CPU > num_cpus: Limit the number of CPUS to run on > > Sample output snippet > - > --IPI Latency Test--- > SRC_CPU DEST_CPU IPI_Latency(ns) > ... > 0 5 256178 > 0 6 478161 > 0 7 285445 > 0 8 273553 > Expected IPI latency(ns): 10 > Observed Average IPI latency(ns): 248334 I suppose by run-to-run variance you are referring to the outliers in the above sequence (like 478161) ? Or is it that each time you run your test program you observe completely different series of values ? If it is the former, then perhaps we could discard the outliers for the purpose of average latency computation and print the max, min and the corrected-average values above. > > --Timeout Latency Test-- > --Baseline Timeout Latency measurement: CPU Busy-- > Wakeup_src Baseline_delay(ns) > ... > 32 972405 > 33 1004287 > 34 986663 > 35 994022 > Expected timeout(ns): 1000 > Observed Average timeout diff(ns): 991844 > It would be good to see a complete sample output, perhaps for the --mode=10 so that it is easy to discern if there are cases when the observed timeouts/IPI latencies for the busy case are larger than the idle-case. > Pratik Rajesh Sampat (1): > selftests/cpuidle: Add support for cpuidle latency measurement > > tools/testing/selftests/Makefile | 1 + > tools/testing/selftests/cpuidle/Makefile | 7 + > tools/testing/selftests/cpuidle/cpuidle.c | 616 ++ > tools/testing/selftests/cpuidle/settings | 1 + > 4 files changed, 625 insertions(+) > create mode 100644 tools/testing/selftests/cpuidle/Makefile > create mode 100644 tools/testing/selftests/cpuidle/cpuidle.c > create mode 100644 tools/testing/selftests/cpuidle/settings > > -- > 2.26.2 > -- Thanks and Regards gautham.
[PATCH v2] cpuidle-pseries: Fix CEDE latency conversion from tb to us
From: "Gautham R. Shenoy" commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for CEDE(0)") sets the exit latency of CEDE(0) based on the latency values of the Extended CEDE states advertised by the platform. The values advertised by the platform are in timebase ticks. However the cpuidle framework requires the latency values in microseconds. If the tb-ticks value advertised by the platform correspond to a value smaller than 1us, during the conversion from tb-ticks to microseconds, in the current code, the result becomes zero. This is incorrect as it puts a CEDE state on par with the snooze state. This patch fixes this by rounding up the result obtained while converting the latency value from tb-ticks to microseconds. It also prints a warning in case we discover an extended-cede state with wakeup latency to be 0. In such a case, ensure that CEDE(0) has a non-zero wakeup latency. Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for CEDE(0)") Signed-off-by: Gautham R. Shenoy --- v1-->v2: Added a warning if a CEDE state has 0 wakeup latency (Suggested by Joel Stanley) Also added code to ensure that CEDE(0) has a non-zero wakeup latency. drivers/cpuidle/cpuidle-pseries.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index ff6d99e..a2b5c6f 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -361,7 +361,10 @@ static void __init fixup_cede0_latency(void) for (i = 0; i < nr_xcede_records; i++) { struct xcede_latency_record *record = >records[i]; u64 latency_tb = be64_to_cpu(record->latency_ticks); - u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC; + u64 latency_us = DIV_ROUND_UP_ULL(tb_to_ns(latency_tb), NSEC_PER_USEC); + + if (latency_us == 0) + pr_warn("cpuidle: xcede record %d has an unrealistic latency of 0us.\n", i); if (latency_us < min_latency_us) min_latency_us = latency_us; @@ -378,10 +381,14 @@ static void __init fixup_cede0_latency(void) * Perform the fix-up. */ if (min_latency_us < dedicated_states[1].exit_latency) { - u64 cede0_latency = min_latency_us - 1; + /* +* We set a minimum of 1us wakeup latency for cede0 to +* distinguish it from snooze +*/ + u64 cede0_latency = 1; - if (cede0_latency <= 0) - cede0_latency = min_latency_us; + if (min_latency_us > cede0_latency) + cede0_latency = min_latency_us - 1; dedicated_states[1].exit_latency = cede0_latency; dedicated_states[1].target_residency = 10 * (cede0_latency); -- 1.9.4
Re: [PATCH] cpuidle-pseries: Fix CEDE latency conversion from tb to us
Hello Joel, On Wed, Sep 02, 2020 at 01:08:35AM +, Joel Stanley wrote: > On Tue, 1 Sep 2020 at 14:09, Gautham R. Shenoy > wrote: > > > > From: "Gautham R. Shenoy" > > > > commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values > > of the Extended CEDE states advertised by the platform. The values > > advertised by the platform are in timebase ticks. However the cpuidle > > framework requires the latency values in microseconds. > > > > If the tb-ticks value advertised by the platform correspond to a value > > smaller than 1us, during the conversion from tb-ticks to microseconds, > > in the current code, the result becomes zero. This is incorrect as it > > puts a CEDE state on par with the snooze state. > > > > This patch fixes this by rounding up the result obtained while > > converting the latency value from tb-ticks to microseconds. > > > > Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for > > CEDE(0)") > > > > Signed-off-by: Gautham R. Shenoy > > Reviewed-by: Joel Stanley > Thanks for reviewing the fix. > Should you check for the zero case and print a warning? Yes, that would be better. I will post a v2 with that. > > > --- > > drivers/cpuidle/cpuidle-pseries.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/cpuidle/cpuidle-pseries.c > > b/drivers/cpuidle/cpuidle-pseries.c > > index ff6d99e..9043358 100644 > > --- a/drivers/cpuidle/cpuidle-pseries.c > > +++ b/drivers/cpuidle/cpuidle-pseries.c > > @@ -361,7 +361,7 @@ static void __init fixup_cede0_latency(void) > > for (i = 0; i < nr_xcede_records; i++) { > > struct xcede_latency_record *record = >records[i]; > > u64 latency_tb = be64_to_cpu(record->latency_ticks); > > - u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC; > > + u64 latency_us = DIV_ROUND_UP_ULL(tb_to_ns(latency_tb), > > NSEC_PER_USEC); > > > > if (latency_us < min_latency_us) > > min_latency_us = latency_us; > > -- > > 1.9.4 > >
[PATCH] cpuidle-pseries: Fix CEDE latency conversion from tb to us
From: "Gautham R. Shenoy" commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for CEDE(0)") sets the exit latency of CEDE(0) based on the latency values of the Extended CEDE states advertised by the platform. The values advertised by the platform are in timebase ticks. However the cpuidle framework requires the latency values in microseconds. If the tb-ticks value advertised by the platform correspond to a value smaller than 1us, during the conversion from tb-ticks to microseconds, in the current code, the result becomes zero. This is incorrect as it puts a CEDE state on par with the snooze state. This patch fixes this by rounding up the result obtained while converting the latency value from tb-ticks to microseconds. Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for CEDE(0)") Signed-off-by: Gautham R. Shenoy --- drivers/cpuidle/cpuidle-pseries.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index ff6d99e..9043358 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -361,7 +361,7 @@ static void __init fixup_cede0_latency(void) for (i = 0; i < nr_xcede_records; i++) { struct xcede_latency_record *record = >records[i]; u64 latency_tb = be64_to_cpu(record->latency_ticks); - u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC; + u64 latency_us = DIV_ROUND_UP_ULL(tb_to_ns(latency_tb), NSEC_PER_USEC); if (latency_us < min_latency_us) min_latency_us = latency_us; -- 1.9.4
Re: [PATCH v5 06/10] powerpc/smp: Optimize start_secondary
Hi Srikar, On Mon, Aug 10, 2020 at 12:48:30PM +0530, Srikar Dronamraju wrote: > In start_secondary, even if shared_cache was already set, system does a > redundant match for cpumask. This redundant check can be removed by > checking if shared_cache is already set. > > While here, localize the sibling_mask variable to within the if > condition. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Anton Blanchard > Cc: Oliver O'Halloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Gautham R Shenoy > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Jordan Niethe > Cc: Vaidyanathan Srinivasan > Signed-off-by: Srikar Dronamraju The change looks good to me. Reviewed-by: Gautham R. Shenoy > --- > Changelog v4 ->v5: > Retain cache domain, no need for generalization > (Michael Ellerman, Peter Zijlstra, >Valentin Schneider, Gautham R. Shenoy) > > Changelog v1 -> v2: > Moved shared_cache topology fixup to fixup_topology (Gautham) > > arch/powerpc/kernel/smp.c | 17 +++-- > 1 file changed, 11 insertions(+), 6 deletions(-) > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index 0c960ce3be42..91cf5d05e7ec 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -851,7 +851,7 @@ static int powerpc_shared_cache_flags(void) > */ > static const struct cpumask *shared_cache_mask(int cpu) > { > - return cpu_l2_cache_mask(cpu); > + return per_cpu(cpu_l2_cache_map, cpu); > } > > #ifdef CONFIG_SCHED_SMT > @@ -1305,7 +1305,6 @@ static void add_cpu_to_masks(int cpu) > void start_secondary(void *unused) > { > unsigned int cpu = smp_processor_id(); > - struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask; > > mmgrab(_mm); > current->active_mm = _mm; > @@ -1331,14 +1330,20 @@ void start_secondary(void *unused) > /* Update topology CPU masks */ > add_cpu_to_masks(cpu); > > - if (has_big_cores) > - sibling_mask = cpu_smallcore_mask; > /* >* Check for any shared caches. Note that this must be done on a >* per-core basis because one core in the pair might be disabled. >*/ > - if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu))) > - shared_caches = true; > + if (!shared_caches) { > + struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask; > + struct cpumask *mask = cpu_l2_cache_mask(cpu); > + > + if (has_big_cores) > + sibling_mask = cpu_smallcore_mask; > + > + if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu))) > + shared_caches = true; > + } > > set_numa_node(numa_cpu_lookup_table[cpu]); > set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); > -- > 2.18.2 >
Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
Hi Srikar, Valentin, On Wed, Jul 29, 2020 at 11:43:55AM +0530, Srikar Dronamraju wrote: > * Valentin Schneider [2020-07-28 16:03:11]: > [..snip..] > At this time the current topology would be good enough i.e BIGCORE would > always be equal to a MC. However in future we could have chips that can have > lesser/larger number of CPUs in llc than in a BIGCORE or we could have > granular or split L3 caches within a DIE. In such a case BIGCORE != MC. > > Also in the current P9 itself, two neighbouring core-pairs form a quad. > Cache latency within a quad is better than a latency to a distant core-pair. > Cache latency within a core pair is way better than latency within a quad. > So if we have only 4 threads running on a DIE all of them accessing the same > cache-lines, then we could probably benefit if all the tasks were to run > within the quad aka MC/Coregroup. > > I have found some benchmarks which are latency sensitive to benefit by > having a grouping a quad level (using kernel hacks and not backed by > firmware changes). Gautham also found similar results in his experiments > but he only used binding within the stock kernel. > > I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC > domain need not be LLC domain for Power. I am observing that SD_SHARE_PKG_RESOURCES at L2 provides the best results for POWER9 in terms of cache-benefits during wakeup. On a POWER9 Boston machine, running a producer-consumer test case (https://github.com/gautshen/misc/blob/master/producer_consumer/producer_consumer.c) The test case creates two threads, one Producer and another Consumer. Both work on a fairly large shared array of size 64M. In an interation the Producer performs stores to 1024 random locations and wakes up the Consumer. In the Consumer's iteration, loads from those exact 1024 locations. We measure the number of Consumer iterations per second and the average time for each Consumer iteration. The smaller the time, the better it is. The following results are when I pinned the Producer and Consumer to different combinations of CPUs to cover Small core , Big-core, Neighbouring Big-core, Far off core within the same chip, and across chips. There is a also a case where they are not affined anywhere, and we let the scheduler wake them up correctly. We find the best results when the Producer and Consumer are within the same L2 domain. These numbers are also close to the numbers that we get when we let the Scheduler wake them up (where LLC is L2). ## Same Small core (4 threads: Shares L1, L2, L3, Frequency Domain) Consumer affined to CPU 3 Producer affined to CPU 1 4698 iterations, avg time: 20034 ns 4951 iterations, avg time: 20012 ns 4957 iterations, avg time: 19971 ns 4968 iterations, avg time: 19985 ns 4970 iterations, avg time: 19977 ns ## Same Big Core (8 threads: Shares L2, L3, Frequency Domain) Consumer affined to CPU 7 Producer affined to CPU 1 4580 iterations, avg time: 19403 ns 4851 iterations, avg time: 19373 ns 4849 iterations, avg time: 19394 ns 4856 iterations, avg time: 19394 ns 4867 iterations, avg time: 19353 ns ## Neighbouring Big-core (Faster data-snooping from L2. Shares L3, Frequency Domain) Producer affined to CPU 1 Consumer affined to CPU 11 4270 iterations, avg time: 24158 ns 4491 iterations, avg time: 24157 ns 4500 iterations, avg time: 24148 ns 4516 iterations, avg time: 24164 ns 4518 iterations, avg time: 24165 ns ## Any other Big-core from Same Chip (Shares L3) Producer affined to CPU 1 Consumer affined to CPU 87 4176 iterations, avg time: 27953 ns 4417 iterations, avg time: 27925 ns 4415 iterations, avg time: 27934 ns 4417 iterations, avg time: 27983 ns 4430 iterations, avg time: 27958 ns ## Different Chips (No cache-sharing) Consumer affined to CPU 175 Producer affined to CPU 1 3277 iterations, avg time: 50786 ns 3063 iterations, avg time: 50732 ns 2831 iterations, avg time: 50737 ns 2859 iterations, avg time: 50688 ns 2849 iterations, avg time: 50722 ns ## Without affining them (Let Scheduler wake-them up appropriately) Consumer affined to CPU 0-175 Producer affined to CPU 0-175 4821 iterations, avg time: 19412 ns 4863 iterations, avg time: 19435 ns 4855 iterations, avg time: 19381 ns 4811 iterations, avg time: 19458 ns 4892 iterations, avg time: 19429 ns -- Thanks and Regards gautham.
Re: [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain
On Mon, Jul 27, 2020 at 11:02:26AM +0530, Srikar Dronamraju wrote: > Currently "CACHE" domain happens to be the 2nd sched domain as per > powerpc_topology. This domain will collapse if cpumask of l2-cache is > same as SMT domain. However we could generalize this domain such that it > could mean either be a "CACHE" domain or a "BIGCORE" domain. > > While setting up the "CACHE" domain, check if shared_cache is already > set. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Anton Blanchard > Cc: Oliver O'Halloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Gautham R Shenoy > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju Reviewed-by: Gautham R. Shenoy > --- > Changelog v1 -> v2: > Moved shared_cache topology fixup to fixup_topology (Gautham) > > arch/powerpc/kernel/smp.c | 48 +++ > 1 file changed, 34 insertions(+), 14 deletions(-) > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index d997c7411664..3c5ccf6d2b1c 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map); > EXPORT_PER_CPU_SYMBOL(cpu_core_map); > EXPORT_SYMBOL_GPL(has_big_cores); > > +enum { > +#ifdef CONFIG_SCHED_SMT > + smt_idx, > +#endif > + bigcore_idx, > + die_idx, > +}; > + > #define MAX_THREAD_LIST_SIZE 8 > #define THREAD_GROUP_SHARE_L1 1 > struct thread_groups { > @@ -851,13 +859,7 @@ static int powerpc_shared_cache_flags(void) > */ > static const struct cpumask *shared_cache_mask(int cpu) > { > - if (shared_caches) > - return cpu_l2_cache_mask(cpu); > - > - if (has_big_cores) > - return cpu_smallcore_mask(cpu); > - > - return per_cpu(cpu_sibling_map, cpu); > + return per_cpu(cpu_l2_cache_map, cpu); > } > > #ifdef CONFIG_SCHED_SMT > @@ -867,11 +869,16 @@ static const struct cpumask *smallcore_smt_mask(int cpu) > } > #endif > > +static const struct cpumask *cpu_bigcore_mask(int cpu) > +{ > + return per_cpu(cpu_sibling_map, cpu); > +} > + > static struct sched_domain_topology_level powerpc_topology[] = { > #ifdef CONFIG_SCHED_SMT > { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, > #endif > - { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) }, > + { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) }, > { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > { NULL, }, > }; > @@ -1311,7 +1318,6 @@ static void add_cpu_to_masks(int cpu) > void start_secondary(void *unused) > { > unsigned int cpu = smp_processor_id(); > - struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask; > > mmgrab(_mm); > current->active_mm = _mm; > @@ -1337,14 +1343,20 @@ void start_secondary(void *unused) > /* Update topology CPU masks */ > add_cpu_to_masks(cpu); > > - if (has_big_cores) > - sibling_mask = cpu_smallcore_mask; > /* >* Check for any shared caches. Note that this must be done on a >* per-core basis because one core in the pair might be disabled. >*/ > - if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu))) > - shared_caches = true; > + if (!shared_caches) { > + struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask; > + struct cpumask *mask = cpu_l2_cache_mask(cpu); > + > + if (has_big_cores) > + sibling_mask = cpu_smallcore_mask; > + > + if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu))) > + shared_caches = true; > + } > > set_numa_node(numa_cpu_lookup_table[cpu]); > set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); > @@ -1375,9 +1387,17 @@ static void fixup_topology(void) > #ifdef CONFIG_SCHED_SMT > if (has_big_cores) { > pr_info("Big cores detected but using small core scheduling\n"); > - powerpc_topology[0].mask = smallcore_smt_mask; > + powerpc_topology[smt_idx].mask = smallcore_smt_mask; > } > #endif > + if (shared_caches) { > + pr_info("Using shared cache scheduler topology\n"); > + powerpc_topology[bigcore_idx].mask = shared_cache_mask; > + powerpc_topology[bigcore_idx].sd_flags = > powerpc_shared_cache_flags; > +#ifdef CONFIG_SCHED_DEBUG > + powerpc_topology[bigcore_idx].name = "CACHE"; > +#endif > + } > } > > void __init smp_cpus_done(unsigned int max_cpus) > -- > 2.17.1 >
[PATCH v3 3/3] cpuidle-pseries : Fixup exit latency for CEDE(0)
From: "Gautham R. Shenoy" We are currently assuming that CEDE(0) has exit latency 10us, since there is no way for us to query from the platform. However, if the wakeup latency of an Extended CEDE state is smaller than 10us, then we can be sure that the exit latency of CEDE(0) cannot be more than that. that. In this patch, we fix the exit latency of CEDE(0) if we discover an Extended CEDE state with wakeup latency smaller than 10us. Benchmark results: On POWER8, this patch does not have any impact since the advertized latency of Extended CEDE (1) is 30us which is higher than the default latency of CEDE (0) which is 10us. On POWER9 we see improvement the single-threaded performance of ebizzy, and no regression in the wakeup latency or the number of context-switches. ebizzy: 2 ebizzy threads bound to the same big-core. 25% improvement in the avg records/s with patch. x without_patch * with_patch N Min MaxMedian AvgStddev x 10 2491089 5834307 5398375 4244335 1596244.9 * 10 2893813 5834474 5832448 5327281.3 1055941.4 context_switch2 : There is no major regression observed with this patch as seen from the context_switch2 benchmark. context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different small cores). We observe a minor 0.14% regression in the number of context-switches (higher is better). x without_patch * with_patch N Min MaxMedian AvgStddev x 500348872362236354712 354745.69 2711.827 * 500349422361452353942 354215.4 2576.9258 Difference at 99.0% confidence -530.288 +/- 430.963 -0.149484% +/- 0.121485% (Student's t, pooled s = 2645.24) context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37% improvement in the number of context-switches (higher is better). x without_patch * with_patch N Min MaxMedian AvgStddev x 500287956294940288896 288977.23 646.59295 * 500288300294646289582 290064.76 1161.9992 Difference at 99.0% confidence 1087.53 +/- 153.194 0.376337% +/- 0.0530125% (Student's t, pooled s = 940.299) schbench: No major difference could be seen until the 99.9th percentile. Without-patch Latency percentiles (usec) 50.0th: 29 75.0th: 39 90.0th: 49 95.0th: 59 *99.0th: 13104 99.5th: 14672 99.9th: 15824 min=0, max=17993 With-patch: Latency percentiles (usec) 50.0th: 29 75.0th: 40 90.0th: 50 95.0th: 61 *99.0th: 13648 99.5th: 14768 99.9th: 15664 min=0, max=29812 Signed-off-by: Gautham R. Shenoy --- v2-->v3 : Made notation consistent with first two patches. drivers/cpuidle/cpuidle-pseries.c | 41 +-- 1 file changed, 39 insertions(+), 2 deletions(-) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index f528da7..8d19820 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -350,13 +350,50 @@ static int pseries_cpuidle_driver_init(void) return 0; } -static void __init parse_xcede_idle_states(void) +static void __init fixup_cede0_latency(void) { + int i; + u64 min_latency_us = dedicated_states[1].exit_latency; /* CEDE latency */ + struct xcede_latency_payload *payload; + if (parse_cede_parameters()) return; pr_info("cpuidle : Skipping the %d Extended CEDE idle states\n", nr_xcede_records); + + payload = _latency_parameter.payload; + for (i = 0; i < nr_xcede_records; i++) { + struct xcede_latency_record *record = >records[i]; + u64 latency_tb = be64_to_cpu(record->latency_ticks); + u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC; + + if (latency_us < min_latency_us) + min_latency_us = latency_us; + } + + /* +* By default, we assume that CEDE(0) has exit latency 10us, +* since there is no way for us to query from the platform. +* +* However, if the wakeup latency of an Extended CEDE state is +* smaller than 10us, then we can be sure that CEDE(0) +* requires no more than that. +* +* Perform the fix-up. +*/ + if (min_latency_us < dedicated_states[1].exit_latency) { + u64 cede0_latency = min_latency_us - 1; + + if (cede0_latency <= 0) + cede0_latency = min_latency_us; + + dedicated_states[1].exit_latency = cede0_latency; + dedicated_states[1].target_residency = 10 * (cede0_latency); + pr_i
[PATCH v3 0/3] cpuidle-pseries: Parse extended CEDE information for idle.
From: "Gautham R. Shenoy" This is a v3 of the patch series to parse the extended CEDE information in the pseries-cpuidle driver. The previous two versions of the patches can be found here: v2: https://lore.kernel.org/lkml/1596005254-25753-1-git-send-email-...@linux.vnet.ibm.com/ v1: https://lore.kernel.org/linuxppc-dev/1594120299-31389-1-git-send-email-...@linux.vnet.ibm.com/ The change from v2 --> v1 : * Patch 1: Got rid of some #define-s which were needed mainly for Patches4 and 5 of v1, but were retained in v2. * Patch 2: * Based on feedback from Michael Ellerman, rewrote the function to parse the extended idle states by explicitly defining the structure of the object that is returned by ibm,get-system-parameters(CEDE_LATENCY_TOKEN) rtas-call. In the previous versions we were passing a character array and subsequently parsing the individual elements which can be bug-prone. This also gets rid of the excessive (cast *)ing that was in the previous versions. * Marked some of the functions static and annotated some of the functions with __init and data with __initdata. This makes Sparse happy. * Added comments for CEDE_LATENCY_TOKEN. * Renamed add_pseries_idle_states() to parse_xcede_idle_states(). Again, this is because Patch 4 and 5 from v1 are no longer there. * Patch 3: No functional changes, but minor changes to be consistent with Patch 1 and 2 of this series. I have additionally tested the code on POWER8 dedicated LPAR and found that it has no impact, since the wakeup latency of CEDE(1) is 30us which is greater that default latency that we are assuming for CEDE(0). So we do not need to fixup CEDE(0) latency on POWER8. Vaidy, I have removed your Reviewed-by for v1, since the code has changed a little bit. Gautham R. Shenoy (3): cpuidle-pseries: Set the latency-hint before entering CEDE cpuidle-pseries: Add function to parse extended CEDE records cpuidle-pseries : Fixup exit latency for CEDE(0) drivers/cpuidle/cpuidle-pseries.c | 190 +- 1 file changed, 188 insertions(+), 2 deletions(-) -- 1.9.4
[PATCH v3 2/3] cpuidle-pseries: Add function to parse extended CEDE records
From: "Gautham R. Shenoy" Currently we use CEDE with latency-hint 0 as the only other idle state on a dedicated LPAR apart from the polling "snooze" state. The platform might support additional extended CEDE idle states, which can be discovered through the "ibm,get-system-parameter" rtas-call made with CEDE_LATENCY_TOKEN. This patch adds a function to obtain information about the extended CEDE idle states from the platform and parse the contents to populate an array of extended CEDE states. These idle states thus discovered will be added to the cpuidle framework in the next patch. dmesg on a POWER8 and POWER9 LPAR, demonstrating the output of parsing the extended CEDE latency parameters are as follows POWER8 [ 10.093279] xcede : xcede_record_size = 10 [ 10.093285] xcede : Record 0 : hint = 1, latency = 0x3c00 tb ticks, Wake-on-irq = 1 [ 10.093291] xcede : Record 1 : hint = 2, latency = 0x4e2000 tb ticks, Wake-on-irq = 0 [ 10.093297] cpuidle : Skipping the 2 Extended CEDE idle states POWER9 [5.913180] xcede : xcede_record_size = 10 [5.913183] xcede : Record 0 : hint = 1, latency = 0x400 tb ticks, Wake-on-irq = 1 [5.913188] xcede : Record 1 : hint = 2, latency = 0x3e8000 tb ticks, Wake-on-irq = 0 [5.913193] cpuidle : Skipping the 2 Extended CEDE idle states Signed-off-by: Gautham R. Shenoy --- v2-->v3 : Cleaned up parse_cede_parameters(). Silenced some sparse warnings. drivers/cpuidle/cpuidle-pseries.c | 142 ++ 1 file changed, 142 insertions(+) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index f5865a2..f528da7 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -21,6 +21,7 @@ #include #include #include +#include static struct cpuidle_driver pseries_idle_driver = { .name = "pseries_idle", @@ -87,6 +88,137 @@ static void check_and_cede_processor(void) } #define NR_DEDICATED_STATES2 /* snooze, CEDE */ +/* + * XCEDE : Extended CEDE states discovered through the + * "ibm,get-systems-parameter" rtas-call with the token + * CEDE_LATENCY_TOKEN + */ +#define MAX_XCEDE_STATES 4 +#defineXCEDE_LATENCY_RECORD_SIZE 10 +#define XCEDE_LATENCY_PARAM_MAX_LENGTH (2 + 2 + \ + (MAX_XCEDE_STATES * XCEDE_LATENCY_RECORD_SIZE)) + +/* + * Section 7.3.16 System Parameters Option of PAPR version 2.8.1 has a + * table with all the parameters to ibm,get-system-parameters. + * CEDE_LATENCY_TOKEN corresponds to the token value for Cede Latency + * Settings Information. + */ +#define CEDE_LATENCY_TOKEN 45 + +/* + * If the platform supports the cede latency settings + * information system parameter it must provide the following + * information in the NULL terminated parameter string: + * + * a. The first byte is the length “N” of each cede + *latency setting record minus one (zero indicates a length + *of 1 byte). + * + * b. For each supported cede latency setting a cede latency + *setting record consisting of the first “N” bytes as per + *the following table. + * + * - + * | Field | Field | + * | Name| Length | + * - + * | Cede Latency| 1 Byte | + * | Specifier Value || + * - + * | Maximum wakeup || + * | latency in | 8 Bytes| + * | tb-ticks|| + * - + * | Responsive to || + * | external| 1 Byte | + * | interrupts || + * - + * + * This version has cede latency record size = 10. + * + * The structure xcede_latency_payload represents a) and b) with + * xcede_latency_record representing the table in b). + * + * xcede_latency_parameter is what gets returned by + * ibm,get-systems-parameter rtas-call when made with + * CEDE_LATENCY_TOKEN. + * + * These structures are only used to represent the data sent obtained + * by the rtas-call. The data is in Big-Endian. + */ +struct xcede_latency_record { + u8 hint; + __be64 latency_ticks; + u8 wake_on_irqs; +} __packed; + +struct xcede_latency_payload { + u8 record_size; + struct xcede_latency_record records[MAX_XCEDE_STATES]; +} __packed; + +struct xcede_latency_parameter { + __be16 payload_size; + struct xcede_latency_payload payload; + u8 null_char; +} __packed; + +static unsigned int nr_xcede_records; +static struct xcede_latency_parameter xcede_latency_parameter __initdata; + +static int __init parse_cede_parameters(void) +{ + int ret, i; + u16 payload_size; + u8 xcede_record_size; + u32 total_xcede_records_size; + struct xcede_latency_payload *pa
[PATCH v3 1/3] cpuidle-pseries: Set the latency-hint before entering CEDE
From: "Gautham R. Shenoy" As per the PAPR, each H_CEDE call is associated with a latency-hint to be passed in the VPA field "cede_latency_hint". The CEDE states that we were implicitly entering so far is CEDE with latency-hint = 0. This patch explicitly sets the latency hint corresponding to the CEDE state that we are currently entering. While at it, we save the previous hint, to be restored once we wakeup from CEDE. This will be required in the future when we expose extended-cede states through the cpuidle framework, where each of them will have a different cede-latency hint. Signed-off-by: Gautham R. Shenoy --- v2-->v3 : Got rid of the usused NR_CEDE_STATES definition drivers/cpuidle/cpuidle-pseries.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 3e058ad2..f5865a2 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -86,19 +86,26 @@ static void check_and_cede_processor(void) } } +#define NR_DEDICATED_STATES2 /* snooze, CEDE */ + +u8 cede_latency_hint[NR_DEDICATED_STATES]; static int dedicated_cede_loop(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { + u8 old_latency_hint; pseries_idle_prolog(); get_lppaca()->donate_dedicated_cpu = 1; + old_latency_hint = get_lppaca()->cede_latency_hint; + get_lppaca()->cede_latency_hint = cede_latency_hint[index]; HMT_medium(); check_and_cede_processor(); local_irq_disable(); get_lppaca()->donate_dedicated_cpu = 0; + get_lppaca()->cede_latency_hint = old_latency_hint; pseries_idle_epilog(); @@ -130,7 +137,7 @@ static int shared_cede_loop(struct cpuidle_device *dev, /* * States for dedicated partition case. */ -static struct cpuidle_state dedicated_states[] = { +static struct cpuidle_state dedicated_states[NR_DEDICATED_STATES] = { { /* Snooze */ .name = "snooze", .desc = "snooze", @@ -233,7 +240,7 @@ static int pseries_idle_probe(void) max_idle_state = ARRAY_SIZE(shared_states); } else { cpuidle_state_table = dedicated_states; - max_idle_state = ARRAY_SIZE(dedicated_states); + max_idle_state = NR_DEDICATED_STATES; } } else return -ENODEV; -- 1.9.4
[PATCH v2 1/3] cpuidle-pseries: Set the latency-hint before entering CEDE
From: "Gautham R. Shenoy" As per the PAPR, each H_CEDE call is associated with a latency-hint to be passed in the VPA field "cede_latency_hint". The CEDE states that we were implicitly entering so far is CEDE with latency-hint = 0. This patch explicitly sets the latency hint corresponding to the CEDE state that we are currently entering. While at it, we save the previous hint, to be restored once we wakeup from CEDE. This will be required in the future when we expose extended-cede states through the cpuidle framework, where each of them will have a different cede-latency hint. Reviewed-by: Vaidyanathan Srinivasan Signed-off-by: Gautham R. Shenoy --- drivers/cpuidle/cpuidle-pseries.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 3e058ad2..88e71c3 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -86,19 +86,27 @@ static void check_and_cede_processor(void) } } +#define NR_CEDE_STATES 1 /* CEDE with latency-hint 0 */ +#define NR_DEDICATED_STATES(NR_CEDE_STATES + 1) /* Includes snooze */ + +u8 cede_latency_hint[NR_DEDICATED_STATES]; static int dedicated_cede_loop(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { + u8 old_latency_hint; pseries_idle_prolog(); get_lppaca()->donate_dedicated_cpu = 1; + old_latency_hint = get_lppaca()->cede_latency_hint; + get_lppaca()->cede_latency_hint = cede_latency_hint[index]; HMT_medium(); check_and_cede_processor(); local_irq_disable(); get_lppaca()->donate_dedicated_cpu = 0; + get_lppaca()->cede_latency_hint = old_latency_hint; pseries_idle_epilog(); @@ -130,7 +138,7 @@ static int shared_cede_loop(struct cpuidle_device *dev, /* * States for dedicated partition case. */ -static struct cpuidle_state dedicated_states[] = { +static struct cpuidle_state dedicated_states[NR_DEDICATED_STATES] = { { /* Snooze */ .name = "snooze", .desc = "snooze", -- 1.9.4
[PATCH v2 3/3] cpuidle-pseries : Fixup exit latency for CEDE(0)
From: "Gautham R. Shenoy" We are currently assuming that CEDE(0) has exit latency 10us, since there is no way for us to query from the platform. However, if the wakeup latency of an Extended CEDE state is smaller than 10us, then we can be sure that the exit latency of CEDE(0) cannot be more than that. that. In this patch, we fix the exit latency of CEDE(0) if we discover an Extended CEDE state with wakeup latency smaller than 10us. Benchmark results: ebizzy: 2 ebizzy threads bound to the same big-core. 25% improvement in the avg records/s with patch. x without_patch + with_patch N Min MaxMedian AvgStddev x 10 2491089 5834307 5398375 4244335 1596244.9 + 10 2893813 5834474 5832448 5327281.3 1055941.4 context_switch2 : There is no major regression observed with this patch as seen from the context_switch2 benchmark. context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different small cores). We observe a minor 0.14% regression in the number of context-switches (higher is better). x without_patch + with_patch N Min MaxMedian AvgStddev x 500348872362236354712 354745.69 2711.827 + 500349422361452353942 354215.4 2576.9258 Difference at 99.0% confidence -530.288 +/- 430.963 -0.149484% +/- 0.121485% (Student's t, pooled s = 2645.24) context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37% improvement in the number of context-switches (higher is better). x without_patch + with_patch N Min MaxMedian AvgStddev x 500287956294940288896 288977.23 646.59295 + 500288300294646289582 290064.76 1161.9992 Difference at 99.0% confidence 1087.53 +/- 153.194 0.376337% +/- 0.0530125% (Student's t, pooled s = 940.299) schbench: No major difference could be seen until the 99.9th percentile. Without-patch Latency percentiles (usec) 50.0th: 29 75.0th: 39 90.0th: 49 95.0th: 59 *99.0th: 13104 99.5th: 14672 99.9th: 15824 min=0, max=17993 With-patch: Latency percentiles (usec) 50.0th: 29 75.0th: 40 90.0th: 50 95.0th: 61 *99.0th: 13648 99.5th: 14768 99.9th: 15664 min=0, max=29812 Reviewed-by: Vaidyanathan Srinivasan Signed-off-by: Gautham R. Shenoy --- drivers/cpuidle/cpuidle-pseries.c | 34 -- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index b1dc24d..0b2f115 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -334,12 +334,42 @@ static int pseries_cpuidle_driver_init(void) static int add_pseries_idle_states(void) { int nr_states = 2; /* By default we have snooze, CEDE */ + int i; + u64 min_latency_us = dedicated_states[1].exit_latency; /* CEDE latency */ if (parse_cede_parameters()) return nr_states; - pr_info("cpuidle : Skipping the %d Extended CEDE idle states\n", - nr_xcede_records); + for (i = 0; i < nr_xcede_records; i++) { + u64 latency_tb = xcede_records[i].wakeup_latency_tb_ticks; + u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC; + + if (latency_us < min_latency_us) + min_latency_us = latency_us; + } + + /* +* We are currently assuming that CEDE(0) has exit latency +* 10us, since there is no way for us to query from the +* platform. +* +* However, if the wakeup latency of an Extended CEDE state is +* smaller than 10us, then we can be sure that CEDE(0) +* requires no more than that. +* +* Perform the fix-up. +*/ + if (min_latency_us < dedicated_states[1].exit_latency) { + u64 cede0_latency = min_latency_us - 1; + + if (cede0_latency <= 0) + cede0_latency = min_latency_us; + + dedicated_states[1].exit_latency = cede0_latency; + dedicated_states[1].target_residency = 10 * (cede0_latency); + pr_info("cpuidle : Fixed up CEDE exit latency to %llu us\n", + cede0_latency); + } return nr_states; } -- 1.9.4
[PATCH v2 2/3] cpuidle-pseries: Add function to parse extended CEDE records
From: "Gautham R. Shenoy" Currently we use CEDE with latency-hint 0 as the only other idle state on a dedicated LPAR apart from the polling "snooze" state. The platform might support additional extended CEDE idle states, which can be discovered through the "ibm,get-system-parameter" rtas-call made with CEDE_LATENCY_TOKEN. This patch adds a function to obtain information about the extended CEDE idle states from the platform and parse the contents to populate an array of extended CEDE states. These idle states thus discovered will be added to the cpuidle framework in the next patch. dmesg on a POWER9 LPAR, demonstrating the output of parsing the extended CEDE latency parameters. [5.913180] xcede : xcede_record_size = 10 [5.913183] xcede : Record 0 : hint = 1, latency =0x400 tb-ticks, Wake-on-irq = 1 [5.913188] xcede : Record 1 : hint = 2, latency =0x3e8000 tb-ticks, Wake-on-irq = 0 [5.913193] cpuidle : Skipping the 2 Extended CEDE idle states Reviewed-by: Vaidyanathan Srinivasan Signed-off-by: Gautham R. Shenoy --- drivers/cpuidle/cpuidle-pseries.c | 129 +- 1 file changed, 127 insertions(+), 2 deletions(-) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 88e71c3..b1dc24d 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -21,6 +21,7 @@ #include #include #include +#include static struct cpuidle_driver pseries_idle_driver = { .name = "pseries_idle", @@ -86,9 +87,120 @@ static void check_and_cede_processor(void) } } -#define NR_CEDE_STATES 1 /* CEDE with latency-hint 0 */ +struct xcede_latency_records { + u8 latency_hint; + u64 wakeup_latency_tb_ticks; + u8 responsive_to_irqs; +}; + +/* + * XCEDE : Extended CEDE states discovered through the + * "ibm,get-systems-parameter" rtas-call with the token + * CEDE_LATENCY_TOKEN + */ +#define MAX_XCEDE_STATES 4 +#defineXCEDE_LATENCY_RECORD_SIZE 10 +#define XCEDE_LATENCY_PARAM_MAX_LENGTH (2 + 2 + \ + (MAX_XCEDE_STATES * XCEDE_LATENCY_RECORD_SIZE)) + +#define CEDE_LATENCY_TOKEN 45 + +#define NR_CEDE_STATES (MAX_XCEDE_STATES + 1) /* CEDE with latency-hint 0 */ #define NR_DEDICATED_STATES(NR_CEDE_STATES + 1) /* Includes snooze */ +struct xcede_latency_records xcede_records[MAX_XCEDE_STATES]; +unsigned int nr_xcede_records; +char xcede_parameters[XCEDE_LATENCY_PARAM_MAX_LENGTH]; + +static int parse_cede_parameters(void) +{ + int ret = -1, i; + u16 payload_length; + u8 xcede_record_size; + u32 total_xcede_records_size; + char *payload; + + memset(xcede_parameters, 0, XCEDE_LATENCY_PARAM_MAX_LENGTH); + + ret = rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1, + NULL, CEDE_LATENCY_TOKEN, __pa(xcede_parameters), + XCEDE_LATENCY_PARAM_MAX_LENGTH); + + if (ret) { + pr_err("xcede: Error parsing CEDE_LATENCY_TOKEN\n"); + return ret; + } + + payload_length = be16_to_cpu(*(__be16 *)(_parameters[0])); + payload = _parameters[2]; + + /* +* If the platform supports the cede latency settings +* information system parameter it must provide the following +* information in the NULL terminated parameter string: +* +* a. The first byte is the length “N” of each cede +*latency setting record minus one (zero indicates a length +*of 1 byte). +* +* b. For each supported cede latency setting a cede latency +*setting record consisting of the first “N” bytes as per +*the following table. +* +* - +* | Field | Field | +* | Name| Length | +* - +* | Cede Latency| 1 Byte | +* | Specifier Value || +* - +* | Maximum wakeup || +* | latency in | 8 Bytes| +* | tb-ticks|| +* - +* | Responsive to || +* | external| 1 Byte | +* | interrupts || +* - +* +* This version has cede latency record size = 10. +*/ + xcede_record_size = (u8)payload[0] + 1; + + if (xcede_record_size != XCEDE_LATENCY_RECORD_SIZE) { + pr_err("xcede : Expected record-size %d. Observed size %d.\n", + XCEDE_LATENCY_RECORD_SIZE, xcede_record_size); + return -EINVAL; +
[PATCH v2 0/3] cpuidle-pseries: Parse extended CEDE information for idle.
From: "Gautham R. Shenoy" Hi, This is a v2 of the patch series to parse the extended CEDE information in the pseries-cpuidle driver. The v1 of this patchset can be found here : https://lore.kernel.org/linuxppc-dev/1594120299-31389-1-git-send-email-...@linux.vnet.ibm.com/ The change from v1 --> v2 : * Dropped Patches 4 and 5 which would expose extended idle-states, that wakeup on external interrupts, to cpuidle framework. These were RFC patches in v1. Dropped them because currently the only extended CEDE state that wakesup on external interrupts is CEDE(1) which adds no signifcant value over CEDE(0). * Rebased the patches onto powerpc/merge. * No changes in code for Patches 1-3. Motivation: === On pseries Dedicated Linux LPARs, apart from the polling snooze idle state, we currently have the CEDE idle state which cedes the CPU to the hypervisor with latency-hint = 0. However, the PowerVM hypervisor supports additional extended CEDE states, which can be queried through the "ibm,get-systems-parameter" rtas-call with the CEDE_LATENCY_TOKEN. The hypervisor maps these extended CEDE states to appropriate platform idle-states in order to provide energy-savings as well as shifting power to the active units. On existing pseries LPARs today we have extended CEDE with latency-hints {1,2} supported. The patches in this patchset, adds code to parse the CEDE latency records provided by the hypervisor. We use this information to determine the wakeup latency of the regular CEDE (which we have been so far hardcoding to 10us while experimentally it is much lesser ~ 1us), by looking at the wakeup latency provided by the hypervisor for Extended CEDE states. Since the platform currently advertises Extended CEDE 1 to have wakeup latency of 2us, we can be sure that the wakeup latency of the regular CEDE is no more than this. With Patches 1-3, we see an improvement in the single-threaded performance on ebizzy. 2 ebizzy threads bound to the same big-core. 25% improvement in the avg records/s (higher the better) with patches 1-3. x without_patches * with_patches N Min MaxMedian AvgStddev x 10 2491089 5834307 5398375 4244335 1596244.9 * 10 2893813 5834474 5832448 5327281.3 1055941.4 We do not observe any major regression in either the context_switch2 benchmark or the schbench benchmark context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different small cores). We observe a minor 0.14% regression in the number of context-switches (higher is better). x without_patch * with_patch N Min MaxMedian AvgStddev x 500348872362236354712 354745.69 2711.827 * 500349422361452353942 354215.4 2576.9258 context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37% improvement in the number of context-switches (higher is better). x without_patch * with_patch N Min MaxMedian AvgStddev x 500287956294940288896 288977.23 646.59295 * 500288300294646289582 290064.76 1161.9992 schbench: No major difference could be seen until the 99.9th percentile. Without-patch Latency percentiles (usec) 50.0th: 29 75.0th: 39 90.0th: 49 95.0th: 59 *99.0th: 13104 99.5th: 14672 99.9th: 15824 min=0, max=17993 With-patch: Latency percentiles (usec) 50.0th: 29 75.0th: 40 90.0th: 50 95.0th: 61 *99.0th: 13648 99.5th: 14768 99.9th: 15664 min=0, max=29812 Gautham R. Shenoy (3): cpuidle-pseries: Set the latency-hint before entering CEDE cpuidle-pseries: Add function to parse extended CEDE records cpuidle-pseries : Fixup exit latency for CEDE(0) drivers/cpuidle/cpuidle-pseries.c | 167 +- 1 file changed, 165 insertions(+), 2 deletions(-) -- 1.9.4
Re: [PATCH 0/5] cpuidle-pseries: Parse extended CEDE information for idle.
Hello Rafael, On Mon, Jul 27, 2020 at 04:14:12PM +0200, Rafael J. Wysocki wrote: > On Tue, Jul 7, 2020 at 1:32 PM Gautham R Shenoy > wrote: > > > > Hi, > > > > On Tue, Jul 07, 2020 at 04:41:34PM +0530, Gautham R. Shenoy wrote: > > > From: "Gautham R. Shenoy" > > > > > > Hi, > > > > > > > > > > > > > > > Gautham R. Shenoy (5): > > > cpuidle-pseries: Set the latency-hint before entering CEDE > > > cpuidle-pseries: Add function to parse extended CEDE records > > > cpuidle-pseries : Fixup exit latency for CEDE(0) > > > cpuidle-pseries : Include extended CEDE states in cpuidle framework > > > cpuidle-pseries: Block Extended CEDE(1) which adds no additional > > > value. > > > > Forgot to mention that these patches are on top of Nathan's series to > > remove extended CEDE offline and bogus topology update code : > > https://lore.kernel.org/linuxppc-dev/20200612051238.1007764-1-nath...@linux.ibm.com/ > > OK, so this is targeted at the powerpc maintainers, isn't it? Yes, the code is powerpc specific. Also, I noticed that Nathan's patches have been merged by Michael Ellerman in the powerpc/merge tree. I will rebase and post a v2 of this patch series. -- Thanks and Regards gautham.
Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
Hi Srikar, On Mon, Jul 27, 2020 at 11:02:29AM +0530, Srikar Dronamraju wrote: > Add percpu coregroup maps and masks to create coregroup domain. > If a coregroup doesn't exist, the coregroup domain will be degenerated > in favour of SMT/CACHE domain. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Anton Blanchard > Cc: Oliver O'Halloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Gautham R Shenoy > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju This version looks good to me. Reviewed-by: Gautham R. Shenoy > --- > Changelog v3 ->v4: > if coregroup_support doesn't exist, update MC mask to the next > smaller domain mask. > > Changelog v2 -> v3: > Add optimization for mask updation under coregroup_support > > Changelog v1 -> v2: > Moved coregroup topology fixup to fixup_topology (Gautham) > > arch/powerpc/include/asm/topology.h | 10 +++ > arch/powerpc/kernel/smp.c | 44 + > arch/powerpc/mm/numa.c | 5 > 3 files changed, 59 insertions(+) > > diff --git a/arch/powerpc/include/asm/topology.h > b/arch/powerpc/include/asm/topology.h > index f0b6300e7dd3..6609174918ab 100644 > --- a/arch/powerpc/include/asm/topology.h > +++ b/arch/powerpc/include/asm/topology.h > @@ -88,12 +88,22 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 > *cpu2_assoc) > > #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR) > extern int find_and_online_cpu_nid(int cpu); > +extern int cpu_to_coregroup_id(int cpu); > #else > static inline int find_and_online_cpu_nid(int cpu) > { > return 0; > } > > +static inline int cpu_to_coregroup_id(int cpu) > +{ > +#ifdef CONFIG_SMP > + return cpu_to_core_id(cpu); > +#else > + return 0; > +#endif > +} > + > #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */ > > #include > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index dab96a1203ec..95f0bf72e283 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -80,6 +80,7 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map); > DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map); > DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map); > DEFINE_PER_CPU(cpumask_var_t, cpu_core_map); > +DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map); > > EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); > EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map); > @@ -91,6 +92,7 @@ enum { > smt_idx, > #endif > bigcore_idx, > + mc_idx, > die_idx, > }; > > @@ -869,6 +871,21 @@ static const struct cpumask *smallcore_smt_mask(int cpu) > } > #endif > > +static struct cpumask *cpu_coregroup_mask(int cpu) > +{ > + return per_cpu(cpu_coregroup_map, cpu); > +} > + > +static bool has_coregroup_support(void) > +{ > + return coregroup_enabled; > +} > + > +static const struct cpumask *cpu_mc_mask(int cpu) > +{ > + return cpu_coregroup_mask(cpu); > +} > + > static const struct cpumask *cpu_bigcore_mask(int cpu) > { > return per_cpu(cpu_sibling_map, cpu); > @@ -879,6 +896,7 @@ static struct sched_domain_topology_level > powerpc_topology[] = { > { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, > #endif > { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) }, > + { cpu_mc_mask, SD_INIT_NAME(MC) }, > { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > { NULL, }, > }; > @@ -925,6 +943,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > GFP_KERNEL, cpu_to_node(cpu)); > zalloc_cpumask_var_node(_cpu(cpu_core_map, cpu), > GFP_KERNEL, cpu_to_node(cpu)); > + if (has_coregroup_support()) > + zalloc_cpumask_var_node(_cpu(cpu_coregroup_map, > cpu), > + GFP_KERNEL, cpu_to_node(cpu)); > + > #ifdef CONFIG_NEED_MULTIPLE_NODES > /* >* numa_node_id() works after this. > @@ -942,6 +964,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid)); > cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid)); > > + if (has_coregroup_support()) > + cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid)); > + > init_big_cores(); > if (has_big_cores) { > cpumask_set_cpu(boot_cpuid, > @@ -1233,6 +1258,8 @@ static void remove_
Re: [PATCH v3 09/10] powerpc/smp: Create coregroup domain
On Thu, Jul 23, 2020 at 02:21:15PM +0530, Srikar Dronamraju wrote: > Add percpu coregroup maps and masks to create coregroup domain. > If a coregroup doesn't exist, the coregroup domain will be degenerated > in favour of SMT/CACHE domain. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Anton Blanchard > Cc: Oliver O'Halloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Gautham R Shenoy > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju > --- > Changelog v2 -> v3: > Add optimization for mask updation under coregroup_support > > Changelog v1 -> v2: > Moved coregroup topology fixup to fixup_topology (Gautham) > > arch/powerpc/include/asm/topology.h | 10 +++ > arch/powerpc/kernel/smp.c | 44 + > arch/powerpc/mm/numa.c | 5 > 3 files changed, 59 insertions(+) > > diff --git a/arch/powerpc/include/asm/topology.h > b/arch/powerpc/include/asm/topology.h > index f0b6300e7dd3..6609174918ab 100644 > --- a/arch/powerpc/include/asm/topology.h > +++ b/arch/powerpc/include/asm/topology.h > @@ -88,12 +88,22 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 > *cpu2_assoc) > > #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR) > extern int find_and_online_cpu_nid(int cpu); > +extern int cpu_to_coregroup_id(int cpu); > #else > static inline int find_and_online_cpu_nid(int cpu) > { > return 0; > } > > +static inline int cpu_to_coregroup_id(int cpu) > +{ > +#ifdef CONFIG_SMP > + return cpu_to_core_id(cpu); > +#else > + return 0; > +#endif > +} > + > #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */ > > #include > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index 7d8d44cbab11..1faedde3e406 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -80,6 +80,7 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map); > DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map); > DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map); > DEFINE_PER_CPU(cpumask_var_t, cpu_core_map); > +DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map); > > EXPORT_PER_CPU_SYMBOL(cpu_sibling_map); > EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map); > @@ -91,6 +92,7 @@ enum { > smt_idx, > #endif > bigcore_idx, > + mc_idx, > die_idx, > }; > > @@ -869,6 +871,21 @@ static const struct cpumask *smallcore_smt_mask(int cpu) > } > #endif > > +static struct cpumask *cpu_coregroup_mask(int cpu) > +{ > + return per_cpu(cpu_coregroup_map, cpu); > +} > + > +static bool has_coregroup_support(void) > +{ > + return coregroup_enabled; > +} > + > +static const struct cpumask *cpu_mc_mask(int cpu) > +{ > + return cpu_coregroup_mask(cpu); > +} > + > static const struct cpumask *cpu_bigcore_mask(int cpu) > { > return per_cpu(cpu_sibling_map, cpu); > @@ -879,6 +896,7 @@ static struct sched_domain_topology_level > powerpc_topology[] = { > { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, > #endif > { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) }, > + { cpu_mc_mask, SD_INIT_NAME(MC) }, > { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > { NULL, }, > }; [..snip..] > @@ -1384,6 +1425,9 @@ int setup_profiling_timer(unsigned int multiplier) > > static void fixup_topology(void) > { > + if (!has_coregroup_support()) > + powerpc_topology[mc_idx].mask = cpu_bigcore_mask; > + > if (shared_caches) { > pr_info("Using shared cache scheduler topology\n"); > powerpc_topology[bigcore_idx].mask = shared_cache_mask; Suppose we consider a topology which does not have coregroup_support, but has shared_caches. In that case, we would want our coregroup domain to degenerate. >From the above code, after the fixup, our topology will look as follows: static struct sched_domain_topology_level powerpc_topology[] = { { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) }, { cpu_bigcore_mask, SD_INIT_NAME(MC) }, { cpu_cpu_mask, SD_INIT_NAME(DIE) }, { NULL, }, So, in this case, the core-group domain (identified by MC) will degenerate only if cpu_bigcore_mask() and shared_cache_mask() return the same value. This may work for existing platforms, because either shared_caches don't exist, or when they do, cpu_bigcore_mask and shared_cache_mask return the same set of CPUs. But this may or may not continue to hold good in the futur
Re: [PATCH v3 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
On Thu, Jul 23, 2020 at 02:21:11PM +0530, Srikar Dronamraju wrote: > Current code assumes that cpumask of cpus sharing a l2-cache mask will > always be a superset of cpu_sibling_mask. > > Lets stop that assumption. cpu_l2_cache_mask is a superset of > cpu_sibling_mask if and only if shared_caches is set. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Anton Blanchard > Cc: Oliver O'Halloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Gautham R Shenoy > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju Reviewed-by: Gautham R. Shenoy > --- > Changelog v1 -> v2: > Set cpumask after verifying l2-cache. (Gautham) > > arch/powerpc/kernel/smp.c | 28 +++- > 1 file changed, 15 insertions(+), 13 deletions(-) > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index da27f6909be1..d997c7411664 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -1194,6 +1194,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask > *(*mask_fn)(int)) > if (!l2_cache) > return false; > > + cpumask_set_cpu(cpu, mask_fn(cpu)); > for_each_cpu(i, cpu_online_mask) { > /* >* when updating the marks the current CPU has not been marked > @@ -1276,29 +1277,30 @@ static void add_cpu_to_masks(int cpu) >* add it to it's own thread sibling mask. >*/ > cpumask_set_cpu(cpu, cpu_sibling_mask(cpu)); > + cpumask_set_cpu(cpu, cpu_core_mask(cpu)); > > for (i = first_thread; i < first_thread + threads_per_core; i++) > if (cpu_online(i)) > set_cpus_related(i, cpu, cpu_sibling_mask); > > add_cpu_to_smallcore_masks(cpu); > - /* > - * Copy the thread sibling mask into the cache sibling mask > - * and mark any CPUs that share an L2 with this CPU. > - */ > - for_each_cpu(i, cpu_sibling_mask(cpu)) > - set_cpus_related(cpu, i, cpu_l2_cache_mask); > update_mask_by_l2(cpu, cpu_l2_cache_mask); > > - /* > - * Copy the cache sibling mask into core sibling mask and mark > - * any CPUs on the same chip as this CPU. > - */ > - for_each_cpu(i, cpu_l2_cache_mask(cpu)) > - set_cpus_related(cpu, i, cpu_core_mask); > + if (pkg_id == -1) { > + struct cpumask *(*mask)(int) = cpu_sibling_mask; > + > + /* > + * Copy the sibling mask into core sibling mask and > + * mark any CPUs on the same chip as this CPU. > + */ > + if (shared_caches) > + mask = cpu_l2_cache_mask; > + > + for_each_cpu(i, mask(cpu)) > + set_cpus_related(cpu, i, cpu_core_mask); > > - if (pkg_id == -1) > return; > + } > > for_each_cpu(i, cpu_online_mask) > if (get_physical_package_id(i) == pkg_id) > -- > 2.18.2 >
Re: [PATCH v2 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
On Wed, Jul 22, 2020 at 12:27:47PM +0530, Srikar Dronamraju wrote: > * Gautham R Shenoy [2020-07-22 11:51:14]: > > > Hi Srikar, > > > > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > > > index 72f16dc0cb26..57468877499a 100644 > > > --- a/arch/powerpc/kernel/smp.c > > > +++ b/arch/powerpc/kernel/smp.c > > > @@ -1196,6 +1196,7 @@ static bool update_mask_by_l2(int cpu, struct > > > cpumask *(*mask_fn)(int)) > > > if (!l2_cache) > > > return false; > > > > > > + cpumask_set_cpu(cpu, mask_fn(cpu)); > > > > > > Ok, we need to do this because "cpu" is not yet set in the > > cpu_online_mask. Prior to your patch the "cpu" was getting set in > > cpu_l2_cache_map(cpu) as a side-effect of the code that is removed in > > the patch. > > > > Right. > > > > > > for_each_cpu(i, cpu_online_mask) { > > > /* > > >* when updating the marks the current CPU has not been marked > > > @@ -1278,29 +1279,30 @@ static void add_cpu_to_masks(int cpu) > > >* add it to it's own thread sibling mask. > > >*/ > > > cpumask_set_cpu(cpu, cpu_sibling_mask(cpu)); > > > + cpumask_set_cpu(cpu, cpu_core_mask(cpu)); > > Note: Above, we are explicitly setting the cpu_core_mask. You are right. I missed this. > > > > > > > for (i = first_thread; i < first_thread + threads_per_core; i++) > > > if (cpu_online(i)) > > > set_cpus_related(i, cpu, cpu_sibling_mask); > > > > > > add_cpu_to_smallcore_masks(cpu); > > > - /* > > > - * Copy the thread sibling mask into the cache sibling mask > > > - * and mark any CPUs that share an L2 with this CPU. > > > - */ > > > - for_each_cpu(i, cpu_sibling_mask(cpu)) > > > - set_cpus_related(cpu, i, cpu_l2_cache_mask); > > > update_mask_by_l2(cpu, cpu_l2_cache_mask); > > > > > > - /* > > > - * Copy the cache sibling mask into core sibling mask and mark > > > - * any CPUs on the same chip as this CPU. > > > - */ > > > - for_each_cpu(i, cpu_l2_cache_mask(cpu)) > > > - set_cpus_related(cpu, i, cpu_core_mask); > > > + if (pkg_id == -1) { > > > > I suppose this "if" condition is an optimization, since if pkg_id != -1, > > we anyway set these CPUs in the cpu_core_mask below. > > > > However... > > This is not just an optimization. > The hunk removed would only work if cpu_l2_cache_mask is bigger than > cpu_sibling_mask. (this was the previous assumption that we want to break) > If the cpu_sibling_mask is bigger than cpu_l2_cache_mask and pkg_id is -1, > then setting only cpu_l2_cache_mask in cpu_core_mask will result in a broken > topology. > > > > > > + struct cpumask *(*mask)(int) = cpu_sibling_mask; > > > + > > > + /* > > > + * Copy the sibling mask into core sibling mask and > > > + * mark any CPUs on the same chip as this CPU. > > > + */ > > > + if (shared_caches) > > > + mask = cpu_l2_cache_mask; > > > + > > > + for_each_cpu(i, mask(cpu)) > > > + set_cpus_related(cpu, i, cpu_core_mask); > > > > > > - if (pkg_id == -1) > > > return; > > > + } > > > > > > ... since "cpu" is not yet set in the cpu_online_mask, do we not miss > > setting > > "cpu" in the cpu_core_mask(cpu) in the for-loop below ? > > > > > > As noted above, we are setting before. So we don't missing the cpu and hence > have not different from before. Fair enough. > > > -- > > Thanks and Regards > > gautham. > > -- > Thanks and Regards > Srikar Dronamraju
Re: [PATCH v3 04/10] powerpc/smp: Move topology fixups into a new function
On Thu, Jul 23, 2020 at 02:21:10PM +0530, Srikar Dronamraju wrote: > Move topology fixup based on the platform attributes into its own > function which is called just before set_sched_topology. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Anton Blanchard > Cc: Oliver O'Halloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Gautham R Shenoy > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju Reviewed-by: Gautham R. Shenoy > --- > Changelog v2 -> v3: > Rewrote changelog (Gautham) > Renamed to powerpc/smp: Move topology fixups into a new function > > arch/powerpc/kernel/smp.c | 17 +++-- > 1 file changed, 11 insertions(+), 6 deletions(-) > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index a685915e5941..da27f6909be1 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -1368,6 +1368,16 @@ int setup_profiling_timer(unsigned int multiplier) > return 0; > } > > +static void fixup_topology(void) > +{ > +#ifdef CONFIG_SCHED_SMT > + if (has_big_cores) { > + pr_info("Big cores detected but using small core scheduling\n"); > + powerpc_topology[0].mask = smallcore_smt_mask; > + } > +#endif > +} > + > void __init smp_cpus_done(unsigned int max_cpus) > { > /* > @@ -1381,12 +1391,7 @@ void __init smp_cpus_done(unsigned int max_cpus) > > dump_numa_cpu_topology(); > > -#ifdef CONFIG_SCHED_SMT > - if (has_big_cores) { > - pr_info("Big cores detected but using small core scheduling\n"); > - powerpc_topology[0].mask = smallcore_smt_mask; > - } > -#endif > + fixup_topology(); > set_sched_topology(powerpc_topology); > } > > -- > 2.18.2 >
Re: [PATCH v3 02/10] powerpc/smp: Merge Power9 topology with Power topology
On Thu, Jul 23, 2020 at 02:21:08PM +0530, Srikar Dronamraju wrote: > A new sched_domain_topology_level was added just for Power9. However the > same can be achieved by merging powerpc_topology with power9_topology > and makes the code more simpler especially when adding a new sched > domain. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Anton Blanchard > Cc: Oliver O'Halloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Gautham R Shenoy > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju LGTM. Reviewed-by: Gautham R. Shenoy > --- > Changelog v1 -> v2: > Replaced a reference to cpu_smt_mask with per_cpu(cpu_sibling_map, cpu) > since cpu_smt_mask is only defined under CONFIG_SCHED_SMT > > arch/powerpc/kernel/smp.c | 33 ++--- > 1 file changed, 10 insertions(+), 23 deletions(-) > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index edf94ca64eea..283a04e54f52 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -1313,7 +1313,7 @@ int setup_profiling_timer(unsigned int multiplier) > } > > #ifdef CONFIG_SCHED_SMT > -/* cpumask of CPUs with asymetric SMT dependancy */ > +/* cpumask of CPUs with asymmetric SMT dependency */ > static int powerpc_smt_flags(void) > { > int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES; > @@ -1326,14 +1326,6 @@ static int powerpc_smt_flags(void) > } > #endif > > -static struct sched_domain_topology_level powerpc_topology[] = { > -#ifdef CONFIG_SCHED_SMT > - { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, > -#endif > - { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > - { NULL, }, > -}; > - > /* > * P9 has a slightly odd architecture where pairs of cores share an L2 cache. > * This topology makes it *much* cheaper to migrate tasks between adjacent > cores > @@ -1351,7 +1343,13 @@ static int powerpc_shared_cache_flags(void) > */ > static const struct cpumask *shared_cache_mask(int cpu) > { > - return cpu_l2_cache_mask(cpu); > + if (shared_caches) > + return cpu_l2_cache_mask(cpu); > + > + if (has_big_cores) > + return cpu_smallcore_mask(cpu); > + > + return per_cpu(cpu_sibling_map, cpu); > } > > #ifdef CONFIG_SCHED_SMT > @@ -1361,7 +1359,7 @@ static const struct cpumask *smallcore_smt_mask(int cpu) > } > #endif > > -static struct sched_domain_topology_level power9_topology[] = { > +static struct sched_domain_topology_level powerpc_topology[] = { > #ifdef CONFIG_SCHED_SMT > { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, > #endif > @@ -1386,21 +1384,10 @@ void __init smp_cpus_done(unsigned int max_cpus) > #ifdef CONFIG_SCHED_SMT > if (has_big_cores) { > pr_info("Big cores detected but using small core scheduling\n"); > - power9_topology[0].mask = smallcore_smt_mask; > powerpc_topology[0].mask = smallcore_smt_mask; > } > #endif > - /* > - * If any CPU detects that it's sharing a cache with another CPU then > - * use the deeper topology that is aware of this sharing. > - */ > - if (shared_caches) { > - pr_info("Using shared cache scheduler topology\n"); > - set_sched_topology(power9_topology); > - } else { > - pr_info("Using standard scheduler topology\n"); > - set_sched_topology(powerpc_topology); > - } > + set_sched_topology(powerpc_topology); > } > > #ifdef CONFIG_HOTPLUG_CPU > -- > 2.18.2 >
Re: [PATCH v2 10/10] powerpc/smp: Implement cpu_to_coregroup_id
On Tue, Jul 21, 2020 at 05:08:14PM +0530, Srikar Dronamraju wrote: > Lookup the coregroup id from the associativity array. > > If unable to detect the coregroup id, fallback on the core id. > This way, ensure sched_domain degenerates and an extra sched domain is > not created. > > Ideally this function should have been implemented in > arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we > don't need to find the primary domain again. > > If the device-tree mentions more than one coregroup, then kernel > implements only the last or the smallest coregroup, which currently > corresponds to the penultimate domain in the device-tree. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Nick Piggin > Cc: Oliver OHalloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Anton Blanchard > Cc: Gautham R Shenoy > Cc: Vaidyanathan Srinivasan > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju Looks good to me. Reviewed-by : Gautham R. Shenoy > --- > Changelog v1 -> v2: > powerpc/smp: Implement cpu_to_coregroup_id > Move coregroup_enabled before getting associativity (Gautham) > > arch/powerpc/mm/numa.c | 20 > 1 file changed, 20 insertions(+) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index ef8aa580da21..ae57b68beaee 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -1218,6 +1218,26 @@ int find_and_online_cpu_nid(int cpu) > > int cpu_to_coregroup_id(int cpu) > { > + __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0}; > + int index; > + > + if (cpu < 0 || cpu > nr_cpu_ids) > + return -1; > + > + if (!coregroup_enabled) > + goto out; > + > + if (!firmware_has_feature(FW_FEATURE_VPHN)) > + goto out; > + > + if (vphn_get_associativity(cpu, associativity)) > + goto out; > + > + index = of_read_number(associativity, 1); > + if (index > min_common_depth + 1) > + return of_read_number([index - 1], 1); > + > +out: > return cpu_to_core_id(cpu); > } > > -- > 2.17.1 >
Re: [PATCH v2 09/10] Powerpc/smp: Create coregroup domain
Hi Srikar, On Tue, Jul 21, 2020 at 05:08:13PM +0530, Srikar Dronamraju wrote: > Add percpu coregroup maps and masks to create coregroup domain. > If a coregroup doesn't exist, the coregroup domain will be degenerated > in favour of SMT/CACHE domain. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Nick Piggin > Cc: Oliver OHalloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Anton Blanchard > Cc: Gautham R Shenoy > Cc: Vaidyanathan Srinivasan > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju A query below. > --- > Changelog v1 -> v2: > Powerpc/smp: Create coregroup domain > Moved coregroup topology fixup to fixup_topology (Gautham) > > arch/powerpc/include/asm/topology.h | 10 > arch/powerpc/kernel/smp.c | 38 + > arch/powerpc/mm/numa.c | 5 > 3 files changed, 53 insertions(+) > > diff --git a/arch/powerpc/include/asm/topology.h > b/arch/powerpc/include/asm/topology.h > index f0b6300e7dd3..6609174918ab 100644 > --- a/arch/powerpc/include/asm/topology.h > +++ b/arch/powerpc/include/asm/topology.h [..snip..] > @@ -91,6 +92,7 @@ enum { > smt_idx, > #endif > bigcore_idx, > + mc_idx, > die_idx, > }; > [..snip..] > @@ -879,6 +896,7 @@ static struct sched_domain_topology_level > powerpc_topology[] = { > { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, > #endif > { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) }, > + { cpu_mc_mask, SD_INIT_NAME(MC) }, > { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > { NULL, }, > }; [..snip..] > @@ -1386,6 +1421,9 @@ int setup_profiling_timer(unsigned int multiplier) > > static void fixup_topology(void) > { > + if (!has_coregroup_support()) > + powerpc_topology[mc_idx].mask = cpu_bigcore_mask; > + Shouldn't we move this condition after doing the fixup for shared caches ? Because if we have shared_caches, but not core_group, then we want the coregroup domain to degenerate correctly. > if (shared_caches) { > pr_info("Using shared cache scheduler topology\n"); > powerpc_topology[bigcore_idx].mask = shared_cache_mask; -- Thanks and regards gautham.
Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain
Hello Srikar, On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote: > Currently "CACHE" domain happens to be the 2nd sched domain as per > powerpc_topology. This domain will collapse if cpumask of l2-cache is > same as SMT domain. However we could generalize this domain such that it > could mean either be a "CACHE" domain or a "BIGCORE" domain. > > While setting up the "CACHE" domain, check if shared_cache is already > set. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Nick Piggin > Cc: Oliver OHalloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Anton Blanchard > Cc: Gautham R Shenoy > Cc: Vaidyanathan Srinivasan > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju > --- > Changelog v1 -> v2: > powerpc/smp: Generalize 2nd sched domain > Moved shared_cache topology fixup to fixup_topology (Gautham) > Just one comment below. > arch/powerpc/kernel/smp.c | 49 --- > 1 file changed, 35 insertions(+), 14 deletions(-) > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index 57468877499a..933ebdf97432 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map); > EXPORT_PER_CPU_SYMBOL(cpu_core_map); > EXPORT_SYMBOL_GPL(has_big_cores); > > +enum { > +#ifdef CONFIG_SCHED_SMT > + smt_idx, > +#endif > + bigcore_idx, > + die_idx, > +}; > + [..snip..] > @@ -1339,14 +1345,20 @@ void start_secondary(void *unused) > /* Update topology CPU masks */ > add_cpu_to_masks(cpu); > > - if (has_big_cores) > - sibling_mask = cpu_smallcore_mask; > /* >* Check for any shared caches. Note that this must be done on a >* per-core basis because one core in the pair might be disabled. >*/ > - if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu))) > - shared_caches = true; > + if (!shared_caches) { > + struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask; > + struct cpumask *mask = cpu_l2_cache_mask(cpu); > + > + if (has_big_cores) > + sibling_mask = cpu_smallcore_mask; > + > + if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu))) > + shared_caches = true; At the risk of repeating my comment to the v1 version of the patch, we have shared caches only l2_cache_mask(cpu) is a strict superset of sibling_mask(cpu). "cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu))" does not capture this. Could we please use if (!cpumask_equal(sibling_mask(cpu), mask) && cpumask_subset(sibling_mask(cpu), mask) { } ? > + } > > set_numa_node(numa_cpu_lookup_table[cpu]); > set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); > @@ -1374,10 +1386,19 @@ int setup_profiling_timer(unsigned int multiplier) > > static void fixup_topology(void) > { > + if (shared_caches) { > + pr_info("Using shared cache scheduler topology\n"); > + powerpc_topology[bigcore_idx].mask = shared_cache_mask; > +#ifdef CONFIG_SCHED_DEBUG > + powerpc_topology[bigcore_idx].name = "CACHE"; > +#endif > + powerpc_topology[bigcore_idx].sd_flags = > powerpc_shared_cache_flags; > + } > + > #ifdef CONFIG_SCHED_SMT > if (has_big_cores) { > pr_info("Big cores detected but using small core scheduling\n"); > - powerpc_topology[0].mask = smallcore_smt_mask; > + powerpc_topology[smt_idx].mask = smallcore_smt_mask; > } > #endif Otherwise the patch looks good to me. -- Thanks and Regards gautham.
Re: [PATCH v2 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
Hi Srikar, On Tue, Jul 21, 2020 at 05:08:09PM +0530, Srikar Dronamraju wrote: > Current code assumes that cpumask of cpus sharing a l2-cache mask will > always be a superset of cpu_sibling_mask. > > Lets stop that assumption. cpu_l2_cache_mask is a superset of > cpu_sibling_mask if and only if shared_caches is set. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Nick Piggin > Cc: Oliver OHalloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Anton Blanchard > Cc: Gautham R Shenoy > Cc: Vaidyanathan Srinivasan > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju > --- > Changelog v1 -> v2: > powerpc/smp: Dont assume l2-cache to be superset of sibling > Set cpumask after verifying l2-cache. (Gautham) > > arch/powerpc/kernel/smp.c | 28 +++- > 1 file changed, 15 insertions(+), 13 deletions(-) > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index 72f16dc0cb26..57468877499a 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -1196,6 +1196,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask > *(*mask_fn)(int)) > if (!l2_cache) > return false; > > + cpumask_set_cpu(cpu, mask_fn(cpu)); Ok, we need to do this because "cpu" is not yet set in the cpu_online_mask. Prior to your patch the "cpu" was getting set in cpu_l2_cache_map(cpu) as a side-effect of the code that is removed in the patch. > for_each_cpu(i, cpu_online_mask) { > /* >* when updating the marks the current CPU has not been marked > @@ -1278,29 +1279,30 @@ static void add_cpu_to_masks(int cpu) >* add it to it's own thread sibling mask. >*/ > cpumask_set_cpu(cpu, cpu_sibling_mask(cpu)); > + cpumask_set_cpu(cpu, cpu_core_mask(cpu)); > > for (i = first_thread; i < first_thread + threads_per_core; i++) > if (cpu_online(i)) > set_cpus_related(i, cpu, cpu_sibling_mask); > > add_cpu_to_smallcore_masks(cpu); > - /* > - * Copy the thread sibling mask into the cache sibling mask > - * and mark any CPUs that share an L2 with this CPU. > - */ > - for_each_cpu(i, cpu_sibling_mask(cpu)) > - set_cpus_related(cpu, i, cpu_l2_cache_mask); > update_mask_by_l2(cpu, cpu_l2_cache_mask); > > - /* > - * Copy the cache sibling mask into core sibling mask and mark > - * any CPUs on the same chip as this CPU. > - */ > - for_each_cpu(i, cpu_l2_cache_mask(cpu)) > - set_cpus_related(cpu, i, cpu_core_mask); > + if (pkg_id == -1) { I suppose this "if" condition is an optimization, since if pkg_id != -1, we anyway set these CPUs in the cpu_core_mask below. However... > + struct cpumask *(*mask)(int) = cpu_sibling_mask; > + > + /* > + * Copy the sibling mask into core sibling mask and > + * mark any CPUs on the same chip as this CPU. > + */ > + if (shared_caches) > + mask = cpu_l2_cache_mask; > + > + for_each_cpu(i, mask(cpu)) > + set_cpus_related(cpu, i, cpu_core_mask); > > - if (pkg_id == -1) > return; > + } ... since "cpu" is not yet set in the cpu_online_mask, do we not miss setting "cpu" in the cpu_core_mask(cpu) in the for-loop below ? > > for_each_cpu(i, cpu_online_mask) > if (get_physical_package_id(i) == pkg_id) Before this patch it was unconditionally getting set in cpu_core_mask(cpu) because of the fact that it was set in cpu_l2_cache_mask(cpu) and we were unconditionally setting all the CPUs in cpu_l2_cache_mask(cpu) in cpu_core_mask(cpu). What am I missing ? > -- > 2.17.1 > -- Thanks and Regards gautham.
Re: [PATCH v2 04/10] powerpc/smp: Enable small core scheduling sooner
Hello Srikar, On Tue, Jul 21, 2020 at 05:08:08PM +0530, Srikar Dronamraju wrote: > Enable small core scheduling as soon as we detect that we are in a > system that supports thread group. Doing so would avoid a redundant > check. The patch looks good to me. However, I think the commit message still reflect the v1 code where we were moving the functionality from smp_cpus_done() to init_big_cores(). In this we are moving it to a helper function to collate all fixups to topology. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Nick Piggin > Cc: Oliver OHalloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Anton Blanchard > Cc: Gautham R Shenoy > Cc: Vaidyanathan Srinivasan > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju > --- > Changelog v1 -> v2: > powerpc/smp: Enable small core scheduling sooner > Restored the previous info msg (Jordan) > Moved big core topology fixup to fixup_topology (Gautham) > > arch/powerpc/kernel/smp.c | 17 +++-- > 1 file changed, 11 insertions(+), 6 deletions(-) > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index 1ce95da00cb6..72f16dc0cb26 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -1370,6 +1370,16 @@ int setup_profiling_timer(unsigned int multiplier) > return 0; > } > > +static void fixup_topology(void) > +{ > +#ifdef CONFIG_SCHED_SMT > + if (has_big_cores) { > + pr_info("Big cores detected but using small core scheduling\n"); > + powerpc_topology[0].mask = smallcore_smt_mask; > + } > +#endif > +} > + > void __init smp_cpus_done(unsigned int max_cpus) > { > /* > @@ -1383,12 +1393,7 @@ void __init smp_cpus_done(unsigned int max_cpus) > > dump_numa_cpu_topology(); > > -#ifdef CONFIG_SCHED_SMT > - if (has_big_cores) { > - pr_info("Big cores detected but using small core scheduling\n"); > - powerpc_topology[0].mask = smallcore_smt_mask; > - } > -#endif > + fixup_topology(); > set_sched_topology(powerpc_topology); > } > > -- > 2.17.1 >
Re: [PATCH v2 02/10] powerpc/smp: Merge Power9 topology with Power topology
On Tue, Jul 21, 2020 at 05:08:06PM +0530, Srikar Dronamraju wrote: > A new sched_domain_topology_level was added just for Power9. However the > same can be achieved by merging powerpc_topology with power9_topology > and makes the code more simpler especially when adding a new sched > domain. > > Cc: linuxppc-dev > Cc: LKML > Cc: Michael Ellerman > Cc: Ingo Molnar > Cc: Peter Zijlstra > Cc: Valentin Schneider > Cc: Nick Piggin > Cc: Oliver OHalloran > Cc: Nathan Lynch > Cc: Michael Neuling > Cc: Anton Blanchard > Cc: Gautham R Shenoy > Cc: Vaidyanathan Srinivasan > Cc: Jordan Niethe > Signed-off-by: Srikar Dronamraju > --- > Changelog v1 -> v2: > powerpc/smp: Merge Power9 topology with Power topology > Replaced a reference to cpu_smt_mask with per_cpu(cpu_sibling_map, cpu) > since cpu_smt_mask is only defined under CONFIG_SCHED_SMT > > arch/powerpc/kernel/smp.c | 33 ++--- > 1 file changed, 10 insertions(+), 23 deletions(-) > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index 680c0edcc59d..0e0b118d9b6e 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -1315,7 +1315,7 @@ int setup_profiling_timer(unsigned int multiplier) > } > > #ifdef CONFIG_SCHED_SMT > -/* cpumask of CPUs with asymetric SMT dependancy */ > +/* cpumask of CPUs with asymmetric SMT dependency */ > static int powerpc_smt_flags(void) > { > int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES; > @@ -1328,14 +1328,6 @@ static int powerpc_smt_flags(void) > } > #endif > > -static struct sched_domain_topology_level powerpc_topology[] = { > -#ifdef CONFIG_SCHED_SMT > - { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) }, > -#endif > - { cpu_cpu_mask, SD_INIT_NAME(DIE) }, > - { NULL, }, > -}; > - > /* > * P9 has a slightly odd architecture where pairs of cores share an L2 cache. > * This topology makes it *much* cheaper to migrate tasks between adjacent > cores > @@ -1353,7 +1345,13 @@ static int powerpc_shared_cache_flags(void) > */ > static const struct cpumask *shared_cache_mask(int cpu) > { > - return cpu_l2_cache_mask(cpu); > + if (shared_caches) > + return cpu_l2_cache_mask(cpu); > + > + if (has_big_cores) > + return cpu_smallcore_mask(cpu); > + > + return per_cpu(cpu_sibling_map, cpu); > } It might be helpful to enumerate the consequences of this change: With this patch, on POWER7 and POWER8 SMT and CACHE domains' cpumasks will both be per_cpu(cpu_sibling_map, cpu). On POWER7 SMT level flags has the following (SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES | SD_ASYM_PACKING) On POWER8 SMT level flags has the following (SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES). On both POWER7 and POWER8, CACHE level flags only has SD_SHARE_PKG_RESOURCES Thus, on both POWER7 and POWER8, since the SMT and CACHE cpumasks are the same and since CACHE has no additional flags which SMT does not, the parent domain CACHE will be degenerated. Hence we will have SMT --> DIE --> NUMA as before without the patch. So the patch introduces no behavioural change. Only change is an additional degeneration of the CACHE domain. On POWER9 : Baremetal. SMT level cpumask = per_cpu(cpu_sibling_map, cpu) Since the caches are shared for a pair of two cores, CACHE level cpumask = cpu_l2_cache_mask(cpu) Thus, we will have SMT --> CACHE --> DIE --> NUMA as before. No behavioural change. On POWER9 : LPAR SMT level cpumask = cpu_smallcore_mask(cpu). Since the caches are shared, CACHE level cpumask = cpu_l2_cache_mask(cpu). Thus, we will have SMT --> CACHE --> DIE --> NUMA as before. Again no change in behaviour. Reviewed-by: Gautham R. Shenoy -- Thanks and Regards gautham.
Re: [PATCH v4 2/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
On Tue, Jul 21, 2020 at 09:07:07PM +0530, Pratik Rajesh Sampat wrote: > Replace the variable name from using "pnv_first_spr_loss_level" to > "deep_spr_loss_state". > > pnv_first_spr_loss_level is supposed to be the earliest state that > has OPAL_PM_LOSE_FULL_CONTEXT set, in other places the kernel uses the > "deep" states as terminology. Hence renaming the variable to be coherent > to its semantics. > > Signed-off-by: Pratik Rajesh Sampat Acked-by: Gautham R. Shenoy > --- > arch/powerpc/platforms/powernv/idle.c | 18 +- > 1 file changed, 9 insertions(+), 9 deletions(-) > > diff --git a/arch/powerpc/platforms/powernv/idle.c > b/arch/powerpc/platforms/powernv/idle.c > index 642abf0b8329..28462d59a8e1 100644 > --- a/arch/powerpc/platforms/powernv/idle.c > +++ b/arch/powerpc/platforms/powernv/idle.c > @@ -48,7 +48,7 @@ static bool default_stop_found; > * First stop state levels when SPR and TB loss can occur. > */ > static u64 pnv_first_tb_loss_level = MAX_STOP_STATE + 1; > -static u64 pnv_first_spr_loss_level = MAX_STOP_STATE + 1; > +static u64 deep_spr_loss_state = MAX_STOP_STATE + 1; > > /* > * psscr value and mask of the deepest stop idle state. > @@ -657,7 +657,7 @@ static unsigned long power9_idle_stop(unsigned long > psscr, bool mmu_on) > */ > mmcr0 = mfspr(SPRN_MMCR0); > } > - if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level) { > + if ((psscr & PSSCR_RL_MASK) >= deep_spr_loss_state) { > sprs.lpcr = mfspr(SPRN_LPCR); > sprs.hfscr = mfspr(SPRN_HFSCR); > sprs.fscr = mfspr(SPRN_FSCR); > @@ -741,7 +741,7 @@ static unsigned long power9_idle_stop(unsigned long > psscr, bool mmu_on) >* just always test PSSCR for SPR/TB state loss. >*/ > pls = (psscr & PSSCR_PLS) >> PSSCR_PLS_SHIFT; > - if (likely(pls < pnv_first_spr_loss_level)) { > + if (likely(pls < deep_spr_loss_state)) { > if (sprs_saved) > atomic_stop_thread_idle(); > goto out; > @@ -1088,7 +1088,7 @@ static void __init pnv_power9_idle_init(void) >* the deepest loss-less (OPAL_PM_STOP_INST_FAST) stop state. >*/ > pnv_first_tb_loss_level = MAX_STOP_STATE + 1; > - pnv_first_spr_loss_level = MAX_STOP_STATE + 1; > + deep_spr_loss_state = MAX_STOP_STATE + 1; > for (i = 0; i < nr_pnv_idle_states; i++) { > int err; > struct pnv_idle_states_t *state = _idle_states[i]; > @@ -1099,8 +1099,8 @@ static void __init pnv_power9_idle_init(void) > pnv_first_tb_loss_level = psscr_rl; > > if ((state->flags & OPAL_PM_LOSE_FULL_CONTEXT) && > - (pnv_first_spr_loss_level > psscr_rl)) > - pnv_first_spr_loss_level = psscr_rl; > + (deep_spr_loss_state > psscr_rl)) > + deep_spr_loss_state = psscr_rl; > > /* >* The idle code does not deal with TB loss occurring > @@ -,8 +,8 @@ static void __init pnv_power9_idle_init(void) >* compatibility. >*/ > if ((state->flags & OPAL_PM_TIMEBASE_STOP) && > - (pnv_first_spr_loss_level > psscr_rl)) > - pnv_first_spr_loss_level = psscr_rl; > + (deep_spr_loss_state > psscr_rl)) > + deep_spr_loss_state = psscr_rl; > > err = validate_psscr_val_mask(>psscr_val, > >psscr_mask, > @@ -1158,7 +1158,7 @@ static void __init pnv_power9_idle_init(void) > } > > pr_info("cpuidle-powernv: First stop level that may lose SPRs = > 0x%llx\n", > - pnv_first_spr_loss_level); > + deep_spr_loss_state); > > pr_info("cpuidle-powernv: First stop level that may lose timebase = > 0x%llx\n", > pnv_first_tb_loss_level); > -- > 2.25.4 >
Re: [PATCH v3 2/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
Hi, On Wed, Jul 22, 2020 at 12:37:41AM +1000, Nicholas Piggin wrote: > Excerpts from Pratik Sampat's message of July 21, 2020 8:29 pm: > > > > > > On 20/07/20 5:27 am, Nicholas Piggin wrote: > >> Excerpts from Pratik Rajesh Sampat's message of July 18, 2020 4:53 am: > >>> Replace the variable name from using "pnv_first_spr_loss_level" to > >>> "pnv_first_fullstate_loss_level". > >>> > >>> As pnv_first_spr_loss_level is supposed to be the earliest state that > >>> has OPAL_PM_LOSE_FULL_CONTEXT set, however as shallow states too loose > >>> SPR values, render an incorrect terminology. > >> It also doesn't lose "full" state at this loss level though. From the > >> architecture it could be called "hv state loss level", but in POWER10 > >> even that is not strictly true. > >> > > Right. Just discovered that deep stop states won't loose full state > > P10 onwards. > > Would it better if we rename it as "pnv_all_spr_loss_state" instead > > so that it stays generic enough while being semantically coherent? > > It doesn't lose all SPRs. It does physically, but for Linux it appears > at least timebase SPRs are retained and that's mostly how it's > documented. > > Maybe there's no really good name for it, but we do call it "deep" stop > in other places, you could call it deep_spr_loss maybe. I don't mind too > much though, whatever Gautham is happy with. Nick's suggestion is fine by me. We can call it deep_spr_loss_state. > > Thanks, > Nick -- Thanks and Regards gautham.
Re: [PATCH v2 2/2] selftest/cpuidle: Add support for cpuidle latency measurement
Hi Pratik, On Fri, Jul 17, 2020 at 02:48:01PM +0530, Pratik Rajesh Sampat wrote: > This patch adds support to trace IPI based and timer based wakeup > latency from idle states > > Latches onto the test-cpuidle_latency kernel module using the debugfs > interface to send IPIs or schedule a timer based event, which in-turn > populates the debugfs with the latency measurements. > > Currently for the IPI and timer tests; first disable all idle states > and then test for latency measurements incrementally enabling each state > > Signed-off-by: Pratik Rajesh Sampat A few comments below. > --- > tools/testing/selftests/Makefile | 1 + > tools/testing/selftests/cpuidle/Makefile | 6 + > tools/testing/selftests/cpuidle/cpuidle.sh | 257 + > tools/testing/selftests/cpuidle/settings | 1 + > 4 files changed, 265 insertions(+) > create mode 100644 tools/testing/selftests/cpuidle/Makefile > create mode 100755 tools/testing/selftests/cpuidle/cpuidle.sh > create mode 100644 tools/testing/selftests/cpuidle/settings > [..skip..] > + > +ins_mod() > +{ > + if [ ! -f "$MODULE" ]; then > + printf "$MODULE module does not exist. Exitting\n" If the module has been compiled into the kernel (due to a localyesconfig, for instance), then it is unlikely that we will find it in /lib/modules. Perhaps you want to check if the debugfs directories created by the module exist, and if so, print a message saying that the modules is already loaded or some such? > + exit $ksft_skip > + fi > + printf "Inserting $MODULE module\n\n" > + insmod $MODULE > + if [ $? != 0 ]; then > + printf "Insmod $MODULE failed\n" > + exit $ksft_skip > + fi > +} > + > +compute_average() > +{ > + arr=("$@") > + sum=0 > + size=${#arr[@]} > + for i in "${arr[@]}" > + do > + sum=$((sum + i)) > + done > + avg=$((sum/size)) It would be good to assert that "size" isn't 0 here. > +} > + > +# Disable all stop states > +disable_idle() > +{ > + for ((cpu=0; cpu + do > + for ((state=0; state + do > + echo 1 > > /sys/devices/system/cpu/cpu$cpu/cpuidle/state$state/disable So, on offlined CPUs, we won't see /sys/devices/system/cpu/cpu$cpu/cpuidle/state$state directory. You should probably perform this operation only on online CPUs. > + done > + done > +} > + > +# Perform operation on each CPU for the given state > +# $1 - Operation: enable (0) / disable (1) > +# $2 - State to enable > +op_state() > +{ > + for ((cpu=0; cpu + do > + echo $1 > > /sys/devices/system/cpu/cpu$cpu/cpuidle/state$2/disable Ditto > + done > +} This is a helper function. For better readability of the main code you can define the following wrappers and use them. cpuidle_enable_state() { state=$1 op_state 1 $state } cpuidle_disable_state() { state=$1 op_state 0 $state } > + [..snip..] > +run_ipi_tests() > +{ > +extract_latency > +disable_idle > +declare -a avg_arr > +echo -e "--IPI Latency Test---" >> $LOG > + > + echo -e "--Baseline IPI Latency measurement: CPU Busy--" >> $LOG > + printf "%s %10s %12s\n" "SRC_CPU" "DEST_CPU" "IPI_Latency(ns)" > >> $LOG > + for ((cpu=0; cpu + do > + ipi_test_once "baseline" $cpu > + printf "%-3s %10s %12s\n" $src_cpu $cpu $ipi_latency >> > $LOG > + avg_arr+=($ipi_latency) > + done > + compute_average "${avg_arr[@]}" > + echo -e "Baseline Average IPI latency(ns): $avg" >> $LOG > + > +for ((state=0; state +do > + unset avg_arr > + echo -e "---Enabling state: $state---" >> $LOG > + op_state 0 $state > + printf "%s %10s %12s\n" "SRC_CPU" "DEST_CPU" > "IPI_Latency(ns)" >> $LOG > + for ((cpu=0; cpu + do If a CPU is offline, then we should skip it here. > + # Running IPI test and logging results > + sleep 1 > + ipi_test_once "test" $cpu > + printf "%-3s %10s %12s\n" $src_cpu $cpu > $ipi_latency >> $LOG > + avg_arr+=($ipi_latency) > + done > + compute_average "${avg_arr[@]}" > + echo -e "Expected IPI latency(ns): > ${latency_arr[$state]}" >> $LOG > + echo -e "Observed Average IPI latency(ns): $avg" >> $LOG > + op_state 1 $state > +done > +} > + > +# Extract the residency in microseconds and convert to nanoseconds. > +# Add 100 ns so that the timer stays for a little longer than the residency > +extract_residency() > +{ >
Re: [PATCH v2 1/2] cpuidle: Trace IPI based and timer based wakeup latency from idle states
On Fri, Jul 17, 2020 at 02:48:00PM +0530, Pratik Rajesh Sampat wrote: > Fire directed smp_call_function_single IPIs from a specified source > CPU to the specified target CPU to reduce the noise we have to wade > through in the trace log. > The module is based on the idea written by Srivatsa Bhat and maintained > by Vaidyanathan Srinivasan internally. > > Queue HR timer and measure jitter. Wakeup latency measurement for idle > states using hrtimer. Echo a value in ns to timer_test_function and > watch trace. A HRtimer will be queued and when it fires the expected > wakeup vs actual wakeup is computes and delay printed in ns. > > Implemented as a module which utilizes debugfs so that it can be > integrated with selftests. > > To include the module, check option and include as module > kernel hacking -> Cpuidle latency selftests > > [srivatsa.b...@linux.vnet.ibm.com: Initial implementation in > cpidle/sysfs] > > [sva...@linux.vnet.ibm.com: wakeup latency measurements using hrtimer > and fix some of the time calculation] > > [e...@linux.vnet.ibm.com: Fix some whitespace and tab errors and > increase the resolution of IPI wakeup] > > Signed-off-by: Pratik Rajesh Sampat The debugfs module looks good to me. Reviewed-by: Gautham R. Shenoy > --- > drivers/cpuidle/Makefile | 1 + > drivers/cpuidle/test-cpuidle_latency.c | 150 + > lib/Kconfig.debug | 10 ++ > 3 files changed, 161 insertions(+) > create mode 100644 drivers/cpuidle/test-cpuidle_latency.c > > diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile > index f07800cbb43f..2ae05968078c 100644 > --- a/drivers/cpuidle/Makefile > +++ b/drivers/cpuidle/Makefile > @@ -8,6 +8,7 @@ obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o > obj-$(CONFIG_DT_IDLE_STATES) += dt_idle_states.o > obj-$(CONFIG_ARCH_HAS_CPU_RELAX) += poll_state.o > obj-$(CONFIG_HALTPOLL_CPUIDLE) += cpuidle-haltpoll.o > +obj-$(CONFIG_IDLE_LATENCY_SELFTEST)+= test-cpuidle_latency.o > > > ## > # ARM SoC drivers > diff --git a/drivers/cpuidle/test-cpuidle_latency.c > b/drivers/cpuidle/test-cpuidle_latency.c > new file mode 100644 > index ..61574665e972 > --- /dev/null > +++ b/drivers/cpuidle/test-cpuidle_latency.c > @@ -0,0 +1,150 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* > + * Module-based API test facility for cpuidle latency using IPIs and timers > + */ > + > +#include > +#include > +#include > + > +/* IPI based wakeup latencies */ > +struct latency { > + unsigned int src_cpu; > + unsigned int dest_cpu; > + ktime_t time_start; > + ktime_t time_end; > + u64 latency_ns; > +} ipi_wakeup; > + > +static void measure_latency(void *info) > +{ > + struct latency *v; > + ktime_t time_diff; > + > + v = (struct latency *)info; > + v->time_end = ktime_get(); > + time_diff = ktime_sub(v->time_end, v->time_start); > + v->latency_ns = ktime_to_ns(time_diff); > +} > + > +void run_smp_call_function_test(unsigned int cpu) > +{ > + ipi_wakeup.src_cpu = smp_processor_id(); > + ipi_wakeup.dest_cpu = cpu; > + ipi_wakeup.time_start = ktime_get(); > + smp_call_function_single(cpu, measure_latency, _wakeup, 1); > +} > + > +/* Timer based wakeup latencies */ > +struct timer_data { > + unsigned int src_cpu; > + u64 timeout; > + ktime_t time_start; > + ktime_t time_end; > + struct hrtimer timer; > + u64 timeout_diff_ns; > +} timer_wakeup; > + > +static enum hrtimer_restart timer_called(struct hrtimer *hrtimer) > +{ > + struct timer_data *w; > + ktime_t time_diff; > + > + w = container_of(hrtimer, struct timer_data, timer); > + w->time_end = ktime_get(); > + > + time_diff = ktime_sub(w->time_end, w->time_start); > + time_diff = ktime_sub(time_diff, ns_to_ktime(w->timeout)); > + w->timeout_diff_ns = ktime_to_ns(time_diff); > + return HRTIMER_NORESTART; > +} > + > +static void run_timer_test(unsigned int ns) > +{ > + hrtimer_init(_wakeup.timer, CLOCK_MONOTONIC, > + HRTIMER_MODE_REL); > + timer_wakeup.timer.function = timer_called; > + timer_wakeup.time_start = ktime_get(); > + timer_wakeup.src_cpu = smp_processor_id(); > + timer_wakeup.timeout = ns; > + > + hrtimer_start(_wakeup.timer, ns_to_ktime(ns), > + HRTIMER_MODE_REL_PINNED); > +} > + > +static struct dentry *dir; > + > +static
Re: [PATCH v2 0/3] Power10 basic energy management
On Mon, Jul 13, 2020 at 03:23:21PM +1000, Nicholas Piggin wrote: > Excerpts from Pratik Rajesh Sampat's message of July 10, 2020 3:22 pm: > > Changelog v1 --> v2: > > 1. Save-restore DAWR and DAWRX unconditionally as they are lost in > > shallow idle states too > > 2. Rename pnv_first_spr_loss_level to pnv_first_fullstate_loss_level to > > correct naming terminology > > > > Pratik Rajesh Sampat (3): > > powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above > > powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10 > > powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable > > > > arch/powerpc/platforms/powernv/idle.c | 34 +-- > > 1 file changed, 22 insertions(+), 12 deletions(-) > > These look okay to me, but the CPU_FTR_ARCH_300 test for > pnv_power9_idle_init() is actually wrong, it should be a PVR test > because idle is not completely architected (not even shallow stop > states, unfortunately). > > It doesn't look like we support POWER10 idle correctly yet, and on older > kernels it wouldn't work even if we fixed newer, so ideally the PVR > check would be backported as a fix in the front of the series. > > Sadly, we have no OPAL idle driver yet. Hopefully we will before the > next processor shows up :P Abhishek posted a version recently : https://patchwork.ozlabs.org/project/skiboot/patch/20200706043533.76539-1-hunt...@linux.vnet.ibm.com/ > > Thanks, > Nick -- Thanks and Regards gautham.
Re: [PATCH 1/2] powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above
On Fri, Jul 03, 2020 at 06:16:39PM +0530, Pratik Rajesh Sampat wrote: > POWER9 onwards the support for the registers HID1, HID4, HID5 has been > receded. > Although mfspr on the above registers worked in Power9, In Power10 > simulator is unrecognized. Moving their assignment under the > check for machines lower than Power9 > > Signed-off-by: Pratik Rajesh Sampat Nice catch. Reviewed-by: Gautham R. Shenoy > --- > arch/powerpc/platforms/powernv/idle.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/platforms/powernv/idle.c > b/arch/powerpc/platforms/powernv/idle.c > index 2dd467383a88..19d94d021357 100644 > --- a/arch/powerpc/platforms/powernv/idle.c > +++ b/arch/powerpc/platforms/powernv/idle.c > @@ -73,9 +73,6 @@ static int pnv_save_sprs_for_deep_states(void) >*/ > uint64_t lpcr_val = mfspr(SPRN_LPCR); > uint64_t hid0_val = mfspr(SPRN_HID0); > - uint64_t hid1_val = mfspr(SPRN_HID1); > - uint64_t hid4_val = mfspr(SPRN_HID4); > - uint64_t hid5_val = mfspr(SPRN_HID5); > uint64_t hmeer_val = mfspr(SPRN_HMEER); > uint64_t msr_val = MSR_IDLE; > uint64_t psscr_val = pnv_deepest_stop_psscr_val; > @@ -117,6 +114,9 @@ static int pnv_save_sprs_for_deep_states(void) > > /* Only p8 needs to set extra HID regiters */ > if (!cpu_has_feature(CPU_FTR_ARCH_300)) { > + uint64_t hid1_val = mfspr(SPRN_HID1); > + uint64_t hid4_val = mfspr(SPRN_HID4); > + uint64_t hid5_val = mfspr(SPRN_HID5); > > rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val); > if (rc != 0) > -- > 2.25.4 > -- Thanks and Regards gautham.
Re: [PATCH 2/2] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
On Fri, Jul 03, 2020 at 06:16:40PM +0530, Pratik Rajesh Sampat wrote: > Additional registers DAWR0, DAWRX0 may be lost on Power 10 for > stop levels < 4. Adding Ravi Bangoria to the cc. > Therefore save the values of these SPRs before entering a "stop" > state and restore their values on wakeup. > > Signed-off-by: Pratik Rajesh Sampat The saving and restoration looks good to me. > --- > arch/powerpc/platforms/powernv/idle.c | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/arch/powerpc/platforms/powernv/idle.c > b/arch/powerpc/platforms/powernv/idle.c > index 19d94d021357..471d4a65b1fa 100644 > --- a/arch/powerpc/platforms/powernv/idle.c > +++ b/arch/powerpc/platforms/powernv/idle.c > @@ -600,6 +600,8 @@ struct p9_sprs { > u64 iamr; > u64 amor; > u64 uamor; > + u64 dawr0; > + u64 dawrx0; > }; > > static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on) > @@ -677,6 +679,10 @@ static unsigned long power9_idle_stop(unsigned long > psscr, bool mmu_on) > sprs.tscr = mfspr(SPRN_TSCR); > if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR)) > sprs.ldbar = mfspr(SPRN_LDBAR); > + if (cpu_has_feature(CPU_FTR_ARCH_31)) { > + sprs.dawr0 = mfspr(SPRN_DAWR0); > + sprs.dawrx0 = mfspr(SPRN_DAWRX0); > + } > But this is within the if condition which says if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level) This if condition is meant for stop4 and stop5 since these are stop levels that have OPAL_PM_LOSE_HYP_CONTEXT set. Since we can lose DAWR*, on states that lose limited hypervisor context, such as stop0-2, we need to unconditionally save them like AMR, IAMR etc. > sprs_saved = true; > > @@ -792,6 +798,10 @@ static unsigned long power9_idle_stop(unsigned long > psscr, bool mmu_on) > mtspr(SPRN_MMCR2, sprs.mmcr2); > if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR)) > mtspr(SPRN_LDBAR, sprs.ldbar); > + if (cpu_has_feature(CPU_FTR_ARCH_31)) { > + mtspr(SPRN_DAWR0, sprs.dawr0); > + mtspr(SPRN_DAWRX0, sprs.dawrx0); > + } Likewise, we need to unconditionally restore these SPRs. > > mtspr(SPRN_SPRG3, local_paca->sprg_vdso); > > -- > 2.25.4 >
Re: [PATCH 0/5] cpuidle-pseries: Parse extended CEDE information for idle.
Hi, On Tue, Jul 07, 2020 at 04:41:34PM +0530, Gautham R. Shenoy wrote: > From: "Gautham R. Shenoy" > > Hi, > > > > > Gautham R. Shenoy (5): > cpuidle-pseries: Set the latency-hint before entering CEDE > cpuidle-pseries: Add function to parse extended CEDE records > cpuidle-pseries : Fixup exit latency for CEDE(0) > cpuidle-pseries : Include extended CEDE states in cpuidle framework > cpuidle-pseries: Block Extended CEDE(1) which adds no additional > value. Forgot to mention that these patches are on top of Nathan's series to remove extended CEDE offline and bogus topology update code : https://lore.kernel.org/linuxppc-dev/20200612051238.1007764-1-nath...@linux.ibm.com/ > > drivers/cpuidle/cpuidle-pseries.c | 268 > +- > 1 file changed, 266 insertions(+), 2 deletions(-) > > -- > 1.9.4 >
[PATCH 5/5] cpuidle-pseries: Block Extended CEDE(1) which adds no additional value.
From: "Gautham R. Shenoy" The Extended CEDE state with latency-hint = 1 is only different from normal CEDE (with latency-hint = 0) in that a CPU in Extended CEDE(1) does not wakeup on timer events. Both CEDE and Extended CEDE(1) map to the same hardware idle state. Since we already get SMT folding from the normal CEDE, the Extended CEDE(1) doesn't provide any additional value. This patch blocks Extended CEDE(1). Signed-off-by: Gautham R. Shenoy --- drivers/cpuidle/cpuidle-pseries.c | 57 --- 1 file changed, 54 insertions(+), 3 deletions(-) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 6f893cd..be0b8b2 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -350,6 +350,43 @@ static int pseries_cpuidle_driver_init(void) return 0; } +#define XCEDE1_HINT1 +#define ERR_NO_VALUE_ADD (-1) +#define ERR_NO_EE_WAKEUP (-2) + +/* + * Returns 0 if the Extende CEDE state with @hint is not blocked in + * cpuidle framework. + * + * Returns ERR_NO_EE_WAKEUP if the Extended CEDE state is blocked due + * to not being responsive to external interrupts. + * + * Returns ERR_NO_VALUE_ADD if the Extended CEDE state does not provide + * added value addition over the normal CEDE. + */ +static int cpuidle_xcede_blocked(u8 hint, u64 latency_us, u8 responsive_to_irqs) +{ + + /* +* We will only allow extended CEDE states that are responsive +* to irqs do not require an H_PROD to be woken up. +*/ + if (!responsive_to_irqs) + return ERR_NO_EE_WAKEUP; + + /* +* We already obtain SMT folding benefits from CEDE (which is +* CEDE with hint 0). Furthermore, CEDE is also responsive to +* timer-events, while XCEDE1 requires an external +* interrupt/H_PROD to be woken up. Hence, block XCEDE1 since +* it adds no further value. +*/ + if (hint == XCEDE1_HINT) + return ERR_NO_VALUE_ADD; + + return 0; +} + static int add_pseries_idle_states(void) { int nr_states = 2; /* By default we have snooze, CEDE */ @@ -365,15 +402,29 @@ static int add_pseries_idle_states(void) char name[CPUIDLE_NAME_LEN]; unsigned int latency_hint = xcede_records[i].latency_hint; u64 residency_us; + int rc; + + if (latency_us < min_latency_us) + min_latency_us = latency_us; + + rc = cpuidle_xcede_blocked(latency_hint, latency_us, + xcede_records[i].responsive_to_irqs); - if (!xcede_records[i].responsive_to_irqs) { + if (rc) { + switch (rc) { + case ERR_NO_VALUE_ADD: + pr_info("cpuidle : Skipping XCEDE%d. No additional value-add\n", + latency_hint); + break; + case ERR_NO_EE_WAKEUP: pr_info("cpuidle : Skipping XCEDE%d. Not responsive to IRQs\n", latency_hint); + break; + } + continue; } - if (latency_us < min_latency_us) - min_latency_us = latency_us; snprintf(name, CPUIDLE_NAME_LEN, "XCEDE%d", latency_hint); /* -- 1.9.4
[PATCH 4/5] cpuidle-pseries : Include extended CEDE states in cpuidle framework
From: "Gautham R. Shenoy" This patch exposes those extended CEDE states to the cpuidle framework which are responsive to external interrupts and do not need an H_PROD. Since as per the PAPR, all the extended CEDE states are non-responsive to timers, we indicate this to the cpuidle subsystem via the CPUIDLE_FLAG_TIMER_STOP flag for all those extende CEDE states which can wake up on external interrupts. With the patch, we are able to see the extended CEDE state with latency hint = 1 exposed via the cpuidle framework. $ cpupower idle-info CPUidle driver: pseries_idle CPUidle governor: menu analyzing CPU 0: Number of idle states: 3 Available idle states: snooze CEDE XCEDE1 snooze: Flags/Description: snooze Latency: 0 Usage: 33429446 Duration: 27006062 CEDE: Flags/Description: CEDE Latency: 1 Usage: 10272 Duration: 110786770 XCEDE1: Flags/Description: XCEDE1 Latency: 12 Usage: 26445 Duration: 1436433815 Benchmark results: TLDR: Over all we do not see any additional benefit from having XCEDE1 over CEDE. ebizzy : 2 threads bound to a big-core. With this patch, we see a 3.39% regression compared to with only CEDE0 latency fixup. x With only CEDE0 latency fixup * With CEDE0 latency fixup + CEDE1 N Min MaxMedian AvgStddev x 10 2893813 5834474 5832448 5327281.3 1055941.4 * 10 2907329 5834923 5831398 5146614.6 1193874.8 context_switch2: With the context_switch2 there are no observable regressions in the results. context_switch2 CPU0 CPU1 (Same Big-core, different small-cores). No difference with and without patch. x without_patch * with_patch N Min MaxMedian AvgStddev x 500343644348778345444 345584.02 1035.1658 * 500344310347646345776 345877.22 802.19501 context_switch2 CPU0 CPU8 (different big-cores). Minor 0.05% improvement with patch x without_patch * with_patch N Min MaxMedian AvgStddev x 500287562288756288162 288134.76 262.24328 * 500287874288960288306 288274.66 187.57034 schbench: No regressions observed with schbench Without Patch: Latency percentiles (usec) 50.0th: 29 75.0th: 40 90.0th: 50 95.0th: 61 *99.0th: 13648 99.5th: 14768 99.9th: 15664 min=0, max=29812 With Patch: Latency percentiles (usec) 50.0th: 30 75.0th: 40 90.0th: 51 95.0th: 59 *99.0th: 13616 99.5th: 14512 99.9th: 15696 min=0, max=15996 Signed-off-by: Gautham R. Shenoy --- drivers/cpuidle/cpuidle-pseries.c | 50 +++ 1 file changed, 50 insertions(+) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 502f906..6f893cd 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -362,9 +362,59 @@ static int add_pseries_idle_states(void) for (i = 0; i < nr_xcede_records; i++) { u64 latency_tb = xcede_records[i].wakeup_latency_tb_ticks; u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC; + char name[CPUIDLE_NAME_LEN]; + unsigned int latency_hint = xcede_records[i].latency_hint; + u64 residency_us; + + if (!xcede_records[i].responsive_to_irqs) { + pr_info("cpuidle : Skipping XCEDE%d. Not responsive to IRQs\n", + latency_hint); + continue; + } if (latency_us < min_latency_us) min_latency_us = latency_us; + snprintf(name, CPUIDLE_NAME_LEN, "XCEDE%d", latency_hint); + + /* +* As per the section 14.14.1 of PAPR version 2.8.1 +* says that alling H_CEDE with the value of the cede +* latency specifier set greater than zero allows the +* processor timer facility to be disabled (so as not +* to cause gratuitous wake-ups - the use of H_PROD, +* or other external interrupt is required to wake the +* processor in this case). +* +* So, inform the cpuidle-subsystem that the timer +* will be stopped for these states. +* +* Also, bump up the latency by 10us, since cpuidle +* would use timer-offload framework which will need +* to send an IPI to wakeup a CPU whose timer has +* expired. +*/
[PATCH 0/5] cpuidle-pseries: Parse extended CEDE information for idle.
From: "Gautham R. Shenoy" Hi, On pseries Dedicated Linux LPARs, apart from the polling snooze idle state, we currently have the CEDE idle state which cedes the CPU to the hypervisor with latency-hint = 0. However, the PowerVM hypervisor supports additional extended CEDE states, which can be queried through the "ibm,get-systems-parameter" rtas-call with the CEDE_LATENCY_TOKEN. The hypervisor maps these extended CEDE states to appropriate platform idle-states in order to provide energy-savings as well as shifting power to the active units. On existing pseries LPARs today we have extended CEDE with latency-hints {1,2} supported. In Patches 1-3 of this patchset, we add the code to parse the CEDE latency records provided by the hypervisor. We use this information to determine the wakeup latency of the regular CEDE (which we have been so far hardcoding to 10us while experimentally it is much lesser ~ 1us), by looking at the wakeup latency provided by the hypervisor for Extended CEDE states. Since the platform currently advertises Extended CEDE 1 to have wakeup latency of 2us, we can be sure that the wakeup latency of the regular CEDE is no more than this. Patch 4 (currently marked as RFC), expose the extended CEDE states parsed above to the cpuidle framework, provided that they can wakeup on an interrupt. On current platforms only Extended CEDE 1 fits the bill, but this is going to change in future platforms where even Extended CEDE 2 may be responsive to external interrupts. Patch 5 (currently marked as RFC), filters out Extended CEDE 1 since it offers no added advantage over the normal CEDE. With Patches 1-3, we see an improvement in the single-threaded performance on ebizzy. 2 ebizzy threads bound to the same big-core. 25% improvement in the avg records/s (higher the better) with patches 1-3. x without_patches * with_patches N Min MaxMedian AvgStddev x 10 2491089 5834307 5398375 4244335 1596244.9 * 10 2893813 5834474 5832448 5327281.3 1055941.4 We do not observe any major regression in either the context_switch2 benchmark or the schbench benchmark context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different small cores). We observe a minor 0.14% regression in the number of context-switches (higher is better). x without_patch * with_patch N Min MaxMedian AvgStddev x 500348872362236354712 354745.69 2711.827 * 500349422361452353942 354215.4 2576.9258 context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37% improvement in the number of context-switches (higher is better). x without_patch * with_patch N Min MaxMedian AvgStddev x 500287956294940288896 288977.23 646.59295 * 500288300294646289582 290064.76 1161.9992 schbench: No major difference could be seen until the 99.9th percentile. Without-patch Latency percentiles (usec) 50.0th: 29 75.0th: 39 90.0th: 49 95.0th: 59 *99.0th: 13104 99.5th: 14672 99.9th: 15824 min=0, max=17993 With-patch: Latency percentiles (usec) 50.0th: 29 75.0th: 40 90.0th: 50 95.0th: 61 *99.0th: 13648 99.5th: 14768 99.9th: 15664 min=0, max=29812 Gautham R. Shenoy (5): cpuidle-pseries: Set the latency-hint before entering CEDE cpuidle-pseries: Add function to parse extended CEDE records cpuidle-pseries : Fixup exit latency for CEDE(0) cpuidle-pseries : Include extended CEDE states in cpuidle framework cpuidle-pseries: Block Extended CEDE(1) which adds no additional value. drivers/cpuidle/cpuidle-pseries.c | 268 +- 1 file changed, 266 insertions(+), 2 deletions(-) -- 1.9.4
[PATCH 1/5] cpuidle-pseries: Set the latency-hint before entering CEDE
From: "Gautham R. Shenoy" As per the PAPR, each H_CEDE call is associated with a latency-hint to be passed in the VPA field "cede_latency_hint". The CEDE states that we were implicitly entering so far is CEDE with latency-hint = 0. This patch explicitly sets the latency hint corresponding to the CEDE state that we are currently entering. While at it, we save the previous hint, to be restored once we wakeup from CEDE. This will be required in the future when we expose extended-cede states through the cpuidle framework, where each of them will have a different cede-latency hint. Signed-off-by: Gautham R. Shenoy --- drivers/cpuidle/cpuidle-pseries.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 4a37252..39d4bb6 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -105,19 +105,27 @@ static void check_and_cede_processor(void) } } +#define NR_CEDE_STATES 1 /* CEDE with latency-hint 0 */ +#define NR_DEDICATED_STATES(NR_CEDE_STATES + 1) /* Includes snooze */ + +u8 cede_latency_hint[NR_DEDICATED_STATES]; static int dedicated_cede_loop(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { + u8 old_latency_hint; pseries_idle_prolog(); get_lppaca()->donate_dedicated_cpu = 1; + old_latency_hint = get_lppaca()->cede_latency_hint; + get_lppaca()->cede_latency_hint = cede_latency_hint[index]; HMT_medium(); check_and_cede_processor(); local_irq_disable(); get_lppaca()->donate_dedicated_cpu = 0; + get_lppaca()->cede_latency_hint = old_latency_hint; pseries_idle_epilog(); @@ -149,7 +157,7 @@ static int shared_cede_loop(struct cpuidle_device *dev, /* * States for dedicated partition case. */ -static struct cpuidle_state dedicated_states[] = { +static struct cpuidle_state dedicated_states[NR_DEDICATED_STATES] = { { /* Snooze */ .name = "snooze", .desc = "snooze", -- 1.9.4
[PATCH 2/5] cpuidle-pseries: Add function to parse extended CEDE records
From: "Gautham R. Shenoy" Currently we use CEDE with latency-hint 0 as the only other idle state on a dedicated LPAR apart from the polling "snooze" state. The platform might support additional extended CEDE idle states, which can be discovered through the "ibm,get-system-parameter" rtas-call made with CEDE_LATENCY_TOKEN. This patch adds a function to obtain information about the extended CEDE idle states from the platform and parse the contents to populate an array of extended CEDE states. These idle states thus discovered will be added to the cpuidle framework in the next patch. dmesg on a POWER9 LPAR, demonstrating the output of parsing the extended CEDE latency parameters. [5.913180] xcede : xcede_record_size = 10 [5.913183] xcede : Record 0 : hint = 1, latency =0x400 tb-ticks, Wake-on-irq = 1 [5.913188] xcede : Record 1 : hint = 2, latency =0x3e8000 tb-ticks, Wake-on-irq = 0 [5.913193] cpuidle : Skipping the 2 Extended CEDE idle states Signed-off-by: Gautham R. Shenoy --- drivers/cpuidle/cpuidle-pseries.c | 129 +- 1 file changed, 127 insertions(+), 2 deletions(-) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 39d4bb6..c13549b 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -21,6 +21,7 @@ #include #include #include +#include struct cpuidle_driver pseries_idle_driver = { .name = "pseries_idle", @@ -105,9 +106,120 @@ static void check_and_cede_processor(void) } } -#define NR_CEDE_STATES 1 /* CEDE with latency-hint 0 */ +struct xcede_latency_records { + u8 latency_hint; + u64 wakeup_latency_tb_ticks; + u8 responsive_to_irqs; +}; + +/* + * XCEDE : Extended CEDE states discovered through the + * "ibm,get-systems-parameter" rtas-call with the token + * CEDE_LATENCY_TOKEN + */ +#define MAX_XCEDE_STATES 4 +#defineXCEDE_LATENCY_RECORD_SIZE 10 +#define XCEDE_LATENCY_PARAM_MAX_LENGTH (2 + 2 + \ + (MAX_XCEDE_STATES * XCEDE_LATENCY_RECORD_SIZE)) + +#define CEDE_LATENCY_TOKEN 45 + +#define NR_CEDE_STATES (MAX_XCEDE_STATES + 1) /* CEDE with latency-hint 0 */ #define NR_DEDICATED_STATES(NR_CEDE_STATES + 1) /* Includes snooze */ +struct xcede_latency_records xcede_records[MAX_XCEDE_STATES]; +unsigned int nr_xcede_records; +char xcede_parameters[XCEDE_LATENCY_PARAM_MAX_LENGTH]; + +static int parse_cede_parameters(void) +{ + int ret = -1, i; + u16 payload_length; + u8 xcede_record_size; + u32 total_xcede_records_size; + char *payload; + + memset(xcede_parameters, 0, XCEDE_LATENCY_PARAM_MAX_LENGTH); + + ret = rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1, + NULL, CEDE_LATENCY_TOKEN, __pa(xcede_parameters), + XCEDE_LATENCY_PARAM_MAX_LENGTH); + + if (ret) { + pr_err("xcede: Error parsing CEDE_LATENCY_TOKEN\n"); + return ret; + } + + payload_length = be16_to_cpu(*(__be16 *)(_parameters[0])); + payload = _parameters[2]; + + /* +* If the platform supports the cede latency settings +* information system parameter it must provide the following +* information in the NULL terminated parameter string: +* +* a. The first byte is the length “N” of each cede +*latency setting record minus one (zero indicates a length +*of 1 byte). +* +* b. For each supported cede latency setting a cede latency +*setting record consisting of the first “N” bytes as per +*the following table. +* +* - +* | Field | Field | +* | Name| Length | +* - +* | Cede Latency| 1 Byte | +* | Specifier Value || +* - +* | Maximum wakeup || +* | latency in | 8 Bytes| +* | tb-ticks|| +* - +* | Responsive to || +* | external| 1 Byte | +* | interrupts || +* - +* +* This version has cede latency record size = 10. +*/ + xcede_record_size = (u8)payload[0] + 1; + + if (xcede_record_size != XCEDE_LATENCY_RECORD_SIZE) { + pr_err("xcede : Expected record-size %d. Observed size %d.\n", + XCEDE_LATENCY_RECORD_SIZE, xcede_record_size); + return -EINVAL; + } + + pr_info("xcede : xcede_record_size =
[PATCH 3/5] cpuidle-pseries : Fixup exit latency for CEDE(0)
From: "Gautham R. Shenoy" We are currently assuming that CEDE(0) has exit latency 10us, since there is no way for us to query from the platform. However, if the wakeup latency of an Extended CEDE state is smaller than 10us, then we can be sure that the exit latency of CEDE(0) cannot be more than that. that. In this patch, we fix the exit latency of CEDE(0) if we discover an Extended CEDE state with wakeup latency smaller than 10us. The new value is 1us lesser than the smallest wakeup latency among the Extended CEDE states. Benchmark results: ebizzy: 2 ebizzy threads bound to the same big-core. 25% improvement in the avg records/s with patch. x without_patch * with_patch N Min MaxMedian AvgStddev x 10 2491089 5834307 5398375 4244335 1596244.9 * 10 2893813 5834474 5832448 5327281.3 1055941.4 context_switch2 : There is no major regression observed with this patch as seen from the context_switch2 benchmark. context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different small cores). We observe a minor 0.14% regression in the number of context-switches (higher is better). x without_patch * with_patch N Min MaxMedian AvgStddev x 500348872362236354712 354745.69 2711.827 * 500349422361452353942 354215.4 2576.9258 context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37% improvement in the number of context-switches (higher is better). x without_patch * with_patch N Min MaxMedian AvgStddev x 500287956294940288896 288977.23 646.59295 * 500288300294646289582 290064.76 1161.9992 schbench: No major difference could be seen until the 99.9th percentile. Without-patch Latency percentiles (usec) 50.0th: 29 75.0th: 39 90.0th: 49 95.0th: 59 *99.0th: 13104 99.5th: 14672 99.9th: 15824 min=0, max=17993 With-patch: Latency percentiles (usec) 50.0th: 29 75.0th: 40 90.0th: 50 95.0th: 61 *99.0th: 13648 99.5th: 14768 99.9th: 15664 min=0, max=29812 Signed-off-by: Gautham R. Shenoy --- drivers/cpuidle/cpuidle-pseries.c | 34 -- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index c13549b..502f906 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -353,12 +353,42 @@ static int pseries_cpuidle_driver_init(void) static int add_pseries_idle_states(void) { int nr_states = 2; /* By default we have snooze, CEDE */ + int i; + u64 min_latency_us = dedicated_states[1].exit_latency; /* CEDE latency */ if (parse_cede_parameters()) return nr_states; - pr_info("cpuidle : Skipping the %d Extended CEDE idle states\n", - nr_xcede_records); + for (i = 0; i < nr_xcede_records; i++) { + u64 latency_tb = xcede_records[i].wakeup_latency_tb_ticks; + u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC; + + if (latency_us < min_latency_us) + min_latency_us = latency_us; + } + + /* +* We are currently assuming that CEDE(0) has exit latency +* 10us, since there is no way for us to query from the +* platform. +* +* However, if the wakeup latency of an Extended CEDE state is +* smaller than 10us, then we can be sure that CEDE(0) +* requires no more than that. +* +* Perform the fix-up. +*/ + if (min_latency_us < dedicated_states[1].exit_latency) { + u64 cede0_latency = min_latency_us - 1; + + if (cede0_latency <= 0) + cede0_latency = min_latency_us; + + dedicated_states[1].exit_latency = cede0_latency; + dedicated_states[1].target_residency = 10 * (cede0_latency); + pr_info("cpuidle : Fixed up CEDE exit latency to %llu us\n", + cede0_latency); + } return nr_states; } -- 1.9.4
Re: [PATCH v5 2/3] powerpc/numa: Prefer node id queried from vphn
Hello Srikar, On Wed, Jun 24, 2020 at 02:58:45PM +0530, Srikar Dronamraju wrote: > Node id queried from the static device tree may not > be correct. For example: it may always show 0 on a shared processor. > Hence prefer the node id queried from vphn and fallback on the device tree > based node id if vphn query fails. > > Cc: linuxppc-...@lists.ozlabs.org > Cc: linux...@kvack.org > Cc: linux-kernel@vger.kernel.org > Cc: Michal Hocko > Cc: Mel Gorman > Cc: Vlastimil Babka > Cc: "Kirill A. Shutemov" > Cc: Christopher Lameter > Cc: Michael Ellerman > Cc: Andrew Morton > Cc: Linus Torvalds > Cc: Gautham R Shenoy > Cc: Satheesh Rajendran > Cc: David Hildenbrand > Signed-off-by: Srikar Dronamraju This patch looks good to me. Reviewed-by: Gautham R. Shenoy -- Thanks and Regards gautham.
Re: [PATCH v5 1/3] powerpc/numa: Set numa_node for all possible cpus
On Wed, Jun 24, 2020 at 02:58:44PM +0530, Srikar Dronamraju wrote: > A Powerpc system with multiple possible nodes and with CONFIG_NUMA > enabled always used to have a node 0, even if node 0 does not any cpus > or memory attached to it. As per PAPR, node affinity of a cpu is only > available once its present / online. For all cpus that are possible but > not present, cpu_to_node() would point to node 0. > > To ensure a cpuless, memoryless dummy node is not online, powerpc need > to make sure all possible but not present cpu_to_node are set to a > proper node. > > Cc: linuxppc-...@lists.ozlabs.org > Cc: linux...@kvack.org > Cc: linux-kernel@vger.kernel.org > Cc: Michal Hocko > Cc: Mel Gorman > Cc: Vlastimil Babka > Cc: "Kirill A. Shutemov" > Cc: Christopher Lameter > Cc: Michael Ellerman > Cc: Andrew Morton > Cc: Linus Torvalds > Cc: Gautham R Shenoy > Cc: Satheesh Rajendran > Cc: David Hildenbrand > Signed-off-by: Srikar Dronamraju This looks good to me. Reviewed-by: Gautham R. Shenoy -- Thanks and Regards gautham.
Re: [PATCH v4 2/3] powerpc/numa: Prefer node id queried from vphn
On Tue, May 12, 2020 at 06:59:36PM +0530, Srikar Dronamraju wrote: > Node id queried from the static device tree may not > be correct. For example: it may always show 0 on a shared processor. > Hence prefer the node id queried from vphn and fallback on the device tree > based node id if vphn query fails. > > Cc: linuxppc-...@lists.ozlabs.org > Cc: linux...@kvack.org > Cc: linux-kernel@vger.kernel.org > Cc: Michal Hocko > Cc: Mel Gorman > Cc: Vlastimil Babka > Cc: "Kirill A. Shutemov" > Cc: Christopher Lameter > Cc: Michael Ellerman > Cc: Andrew Morton > Cc: Linus Torvalds > Cc: Gautham R Shenoy > Cc: Satheesh Rajendran > Cc: David Hildenbrand > Signed-off-by: Srikar Dronamraju Looks good to me. Reviewed-by: Gautham R. Shenoy > --- > Changelog v2:->v3: > - Resolved comments from Gautham. > Link v2: > https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-sri...@linux.vnet.ibm.com/t/#u > > Changelog v1:->v2: > - Rebased to v5.7-rc3 > > arch/powerpc/mm/numa.c | 16 > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index b3615b7..2815313 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -719,20 +719,20 @@ static int __init parse_numa_properties(void) >*/ > for_each_present_cpu(i) { > struct device_node *cpu; > - int nid; > - > - cpu = of_get_cpu_node(i, NULL); > - BUG_ON(!cpu); > - nid = of_node_to_nid_single(cpu); > - of_node_put(cpu); > + int nid = vphn_get_nid(i); > > /* >* Don't fall back to default_nid yet -- we will plug >* cpus into nodes once the memory scan has discovered >* the topology. >*/ > - if (nid < 0) > - continue; > - node_set_online(nid); > + if (nid == NUMA_NO_NODE) { > + cpu = of_get_cpu_node(i, NULL); > + BUG_ON(!cpu); > + nid = of_node_to_nid_single(cpu); > + of_node_put(cpu); > + } > + > + if (likely(nid > 0)) > + node_set_online(nid); > } > > get_n_mem_cells(_mem_addr_cells, _mem_size_cells); > -- > 1.8.3.1 >
Re: [PATCH] powerpc/powernv: Fix a warning message
Hello Christophe, On Sat, May 02, 2020 at 01:59:49PM +0200, Christophe JAILLET wrote: > Fix a cut'n'paste error in a warning message. This should be > 'cpu-idle-state-residency-ns' to match the property searched in the > previous 'of_property_read_u32_array()' > > Fixes: 9c7b185ab2fe ("powernv/cpuidle: Parse dt idle properties into global > structure") > Signed-off-by: Christophe JAILLET Thanks for catching this. Reviewed-by: Gautham R. Shenoy > --- > arch/powerpc/platforms/powernv/idle.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/platforms/powernv/idle.c > b/arch/powerpc/platforms/powernv/idle.c > index 78599bca66c2..2dd467383a88 100644 > --- a/arch/powerpc/platforms/powernv/idle.c > +++ b/arch/powerpc/platforms/powernv/idle.c > @@ -1270,7 +1270,7 @@ static int pnv_parse_cpuidle_dt(void) > /* Read residencies */ > if (of_property_read_u32_array(np, "ibm,cpu-idle-state-residency-ns", > temp_u32, nr_idle_states)) { > - pr_warn("cpuidle-powernv: missing > ibm,cpu-idle-state-latencies-ns in DT\n"); > + pr_warn("cpuidle-powernv: missing > ibm,cpu-idle-state-residency-ns in DT\n"); > rc = -EINVAL; > goto out; > } > -- > 2.25.1 >
Re: [PATCH v5 0/5] Track and expose idle PURR and SPURR ticks
On Thu, Apr 30, 2020 at 09:46:13AM +0530, Gautham R Shenoy wrote: > Hello Michael, > > > > > > Michael, could you please consider this for 5.8 ? > > > > Yes. Has it been tested on KVM at all? > > No. I haven't tested this on KVM. Will do that today. The results on Shared LPAR and KVM are as follows: --- The lparstat results on a Shared LPAR are consistent with that observed on a dedicated LPAR when at least one of the threads of the core is active. When all the threads are idle, the lparstat shows incorrect idle percentage. But this is perhaps due to the fact that the Hypervisor puts a completely idle core in some power-saving state with runlatch turned off due to which PURR counts on the threads of a core do not add up to the elapsed timebase ticks. The results are in section A) below. lparstat is not supported on KVM. However, I performed some basic sanity checks on purr, spurr, idle_purr, and idle_spurr sysfs files that show up after this patch series. When CPUs are offlined, the idle_purr and idle_spurr sysfs files no longer show up, just like purr and spurr sysfs files. The values of the counters monotonically increase, except when the CPU is busy, in which case the idle_purr and idle_spurr counts are stagnant as expected. However, I don't think the even the values of PURR or SPURR make much sense on KVM guest, since the Linux Hypervisor doesn't set additional registers such as RWMR, except on POWER8, where the KVM sets RWMR corresponding to the number of online threads in a vCORE before dispatching the vcore. I haven't been able to test it on a POWER8 guest yet. The results on POWER9 are in section B) below. A ) Shared LPAR == 1. When all the threads of the core are running a CPU-Hog # ./lparstat -E 1 5 System Configuration type=Shared mode=Capped smt=8 lcpu=6 mem=10362752 kB cpus=10 ent=6.00 ---Actual--- -Normalized- %busy %idle Frequency %busy %idle -- -- - -- -- 100.00 0.00 2.90GHz[126%] 126.00 0.00 100.00 0.00 2.90GHz[126%] 126.00 0.00 100.00 0.00 2.90GHz[126%] 126.00 0.00 100.00 0.00 2.90GHz[126%] 126.00 0.00 100.01 0.00 2.90GHz[126%] 126.01 0.00 2. When 4 threads of a core are running CPU Hogs, with the remaining 4 threads idle. # ./lparstat -E 1 5 System Configuration type=Shared mode=Capped smt=8 lcpu=6 mem=10362752 kB cpus=10 ent=6.00 ---Actual--- -Normalized- %busy %idle Frequency %busy %idle -- -- - -- -- 81.06 18.94 2.97GHz[129%] 104.56 24.44 81.05 18.95 2.97GHz[129%] 104.56 24.44 81.06 18.95 2.97GHz[129%] 104.56 24.44 81.06 18.95 2.97GHz[129%] 104.56 24.44 81.05 18.95 2.97GHz[129%] 104.56 24.45 3. When 2 threads of a core are running CPU Hogs, with the other 6 threads idle. # ./lparstat -E 1 5 System Configuration type=Shared mode=Capped smt=8 lcpu=6 mem=10362752 kB cpus=10 ent=6.00 ---Actual--- -Normalized- %busy %idle Frequency %busy %idle -- -- - -- -- 65.21 34.79 3.13GHz[136%] 88.69 47.31 65.20 34.81 3.13GHz[136%] 88.67 47.33 64.25 35.76 3.13GHz[136%] 87.38 48.63 63.68 36.31 3.13GHz[136%] 86.60 49.39 63.55 36.45 3.13GHz[136%] 86.42 49.58 4. When a single thread of the core is running CPU Hog, remaining 7 threads are idle. # ./lparstat -E 1 5 System Configuration type=Shared mode=Capped smt=8 lcpu=6 mem=10362752 kB cpus=10 ent=6.00 ---Actual--- -Normalized- %busy %idle Frequency %busy %idle -- -- - -- -- 31.80 68.20 3.20GHz[139%] 44.20 94.80 31.80 68.20 3.20GHz[139%] 44.20 94.81 31.80 68.20 3.20GHz[139%] 44.20 94.80 31.80 68.21 3.20GHz[139%] 44.20 94.81 31.79 68.21 3.20GHz[139%] 44.19 94.81 5. When the LPAR is idle: # ./lparstat -E 1 5 System Configuration type=Shared mode=Capped smt=8 lcpu=6 mem=10362752 kB cpus=10 ent=6.00 ---Actual--- -Normalized- %busy %idle Frequency %busy %idle -- -- - -- -- 0.04 0.14 2.41GHz[105%] 0.04 0.15 0.04 0.15 2.36GHz[102%] 0.04 0.15 0.03 0.13 2.35GHz[102%] 0.03 0.14 0.03 0.13 2.31GHz[100%] 0.03 0.13 0.03 0.13 2.32GHz[101%] 0.03 0.14 In this case, the sum of the PURR values do not add up to the elapsed TB. This is probably due to the Hypervisor putting the core into some power-saving state with the runlatch turned off. # ./purr_tb -t 8 Got threads_per_core = 8 CORE 0: CPU 0 : Delta PURR : 85744 CPU 1 : Delta PURR : 113632 CPU 2 : Delta PURR : 78224 CPU 3 : Delta PURR : 68856 CPU 4 : Delta PURR : 78064 CPU 5 : Delta PURR : 60488 CPU 6 : Delta PURR : 6 CPU 7 : Delta PURR : 59464 Total D
Re: [PATCH v5 0/5] Track and expose idle PURR and SPURR ticks
Hello Michael, On Thu, Apr 30, 2020 at 12:34:52PM +1000, Michael Ellerman wrote: > Gautham R Shenoy writes: > > On Mon, Apr 20, 2020 at 03:46:35PM -0700, Tyrel Datwyler wrote: > >> On 4/7/20 1:47 AM, Gautham R. Shenoy wrote: > >> > From: "Gautham R. Shenoy" > >> > > >> > Hi, > >> > > >> > This is the fifth version of the patches to track and expose idle PURR > >> > and SPURR ticks. These patches are required by tools such as lparstat > >> > to compute system utilization for capacity planning purposes. > ... > >> > > >> > Gautham R. Shenoy (5): > >> > powerpc: Move idle_loop_prolog()/epilog() functions to header file > >> > powerpc/idle: Store PURR snapshot in a per-cpu global variable > >> > powerpc/pseries: Account for SPURR ticks on idle CPUs > >> > powerpc/sysfs: Show idle_purr and idle_spurr for every CPU > >> > Documentation: Document sysfs interfaces purr, spurr, idle_purr, > >> > idle_spurr > >> > > >> > Documentation/ABI/testing/sysfs-devices-system-cpu | 39 + > >> > arch/powerpc/include/asm/idle.h| 93 > >> > ++ > >> > arch/powerpc/kernel/sysfs.c| 82 > >> > ++- > >> > arch/powerpc/platforms/pseries/setup.c | 8 +- > >> > drivers/cpuidle/cpuidle-pseries.c | 39 ++--- > >> > 5 files changed, 224 insertions(+), 37 deletions(-) > >> > create mode 100644 arch/powerpc/include/asm/idle.h > >> > > >> > >> Reviewed-by: Tyrel Datwyler > > > > Thanks for reviewing the patches. > > > >> > >> Any chance this is going to be merged in the near future? There is a > >> patchset to > >> update lparstat in the powerpc-utils package to calculate PURR/SPURR cpu > >> utilization that I would like to merge, but have been holding off to make > >> sure > >> we are synced with this proposed patchset. > > > > Michael, could you please consider this for 5.8 ? > > Yes. Has it been tested on KVM at all? No. I haven't tested this on KVM. Will do that today. > > cheers -- Thanks and Regards gautham.
Re: [PATCH v2 2/3] powerpc/numa: Prefer node id queried from vphn
Hello Srikar, On Tue, Apr 28, 2020 at 03:08:35PM +0530, Srikar Dronamraju wrote: > Node id queried from the static device tree may not > be correct. For example: it may always show 0 on a shared processor. > Hence prefer the node id queried from vphn and fallback on the device tree > based node id if vphn query fails. > > Cc: linuxppc-...@lists.ozlabs.org > Cc: linux...@kvack.org > Cc: linux-kernel@vger.kernel.org > Cc: Michal Hocko > Cc: Mel Gorman > Cc: Vlastimil Babka > Cc: "Kirill A. Shutemov" > Cc: Christopher Lameter > Cc: Michael Ellerman > Cc: Andrew Morton > Cc: Linus Torvalds > Signed-off-by: Srikar Dronamraju > --- > Changelog v1:->v2: > - Rebased to v5.7-rc3 > > arch/powerpc/mm/numa.c | 16 > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index b3615b7fdbdf..281531340230 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -719,20 +719,20 @@ static int __init parse_numa_properties(void) >*/ > for_each_present_cpu(i) { > struct device_node *cpu; > - int nid; > - > - cpu = of_get_cpu_node(i, NULL); > - BUG_ON(!cpu); > - nid = of_node_to_nid_single(cpu); > - of_node_put(cpu); > + int nid = vphn_get_nid(i); > > /* >* Don't fall back to default_nid yet -- we will plug >* cpus into nodes once the memory scan has discovered >* the topology. >*/ > - if (nid < 0) > - continue; > + if (nid == NUMA_NO_NODE) { > + cpu = of_get_cpu_node(i, NULL); > + if (cpu) { Why are we not retaining the BUG_ON(!cpu) assert here ? > + nid = of_node_to_nid_single(cpu); > + of_node_put(cpu); > + } > + } Is it possible at this point that both vphn_get_nid(i) and of_node_to_nid_single(cpu) returns NUMA_NO_NODE ? If so, should we still call node_set_online() below ? > node_set_online(nid); > } > > -- > 2.20.1 > -- Thanks and Regards gautham.
[PATCH v2 0/1] pseries/hotplug: Change the default behaviour of cede_offline
From: "Gautham R. Shenoy" This is the v2 of the fix to change the default behaviour of cede_offline. The previous version can be found here: https://lkml.org/lkml/2019/9/12/222 The main change from v1 is that the patch2 to create a sysfs file to report and control the value of cede_offline_enabled has been dropped. Problem Description: Currently on Pseries Linux Guests, the offlined CPU can be put to one of the following two states: - Long term processor cede (also called extended cede) - Returned to the Hypervisor via RTAS "stop-self" call. This is controlled by the kernel boot parameter "cede_offline=on/off". By default the offlined CPUs enter extended cede. The PHYP hypervisor considers CPUs in extended cede to be "active" since the CPUs are still under the control fo the Linux Guests. Hence, when we change the SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs will continue to count the values for offlined CPUs in extended cede as if they are online. One of the expectations with PURR is that the for an interval of time, the sum of the PURR increments across the online CPUs of a core should equal the number of timebase ticks for that interval. This is currently not the case. In the following data (Generated using https://github.com/gautshen/misc/blob/master/purr_tb.py): SD-PURR = Sum of PURR increments on online CPUs of that core in 1 second SMT=off === CoreSD-PURR SD-PURR (expected) (observed) === core00 [ 0]51200 69883784 core01 [ 8]51200 88782536 core02 [ 16]51200 94296824 core03 [ 24]51200 80951968 SMT=2 === CoreSD-PURR SD-PURR (expected) (observed) === core00 [ 0,1] 51200 136147792 core01 [ 8,9] 51200 128636784 core02 [ 16,17] 51200 135426488 core03 [ 24,25] 51200 153027520 SMT=4 === CoreSD-PURR SD-PURR (expected) (observed) === core00 [ 0,1,2,3] 51200 258331616 core01 [ 8,9,10,11]51200 274220072 core02 [ 16,17,18,19] 51200 260013736 core03 [ 24,25,26,27] 51200 260079672 SMT=on === CoreSD-PURR SD-PURR (expected) (observed) === core00 [ 0,1,2,3,4,5,6,7] 51200 512941248 core01 [ 8,9,10,11,12,13,14,15]51200 512936544 core02 [ 16,17,18,19,20,21,22,23] 51200 512931544 core03 [ 24,25,26,27,28,29,30,31] 51200 512923800 This patchset addresses this issue by ensuring that by default, the offlined CPUs are returned to the Hypervisor via RTAS "stop-self" call by changing the default value of "cede_offline_enabled" to false. With the patches, we see that the observed value of the sum of the PURR increments across the the online threads of a core in 1-second matches the number of tb-ticks in 1-second. SMT=off === CoreSD-PURR SD-PURR (expected) (observed) === core00 [ 0]51200512527568 core01 [ 8]51200512556128 core02 [ 16]51200512590016 core03 [ 24]51200512589440 SMT=2 === CoreSD-PURR SD-PURR (expected) (observed) === core00 [ 0,1] 51200 512635328 core01 [ 8,9] 51200 512610416 core02 [ 16,17] 51200 512639360 core03 [ 24,25] 51200 512638720 SMT=4 === CoreSD-PURR SD-PURR (expected) (observed) === core00 [ 0,1,2,3] 51200 512757328 core01 [ 8,9,10,11]51200 512727920 core02 [ 16,17,18,19] 51200 512754712 core03 [ 24,25,26,27] 51200 512739040 SMT=on == Core SD-PURR SD-PURR
[PATCH v2 1/1] pseries/hotplug-cpu: Change default behaviour of cede_offline to "off"
From: "Gautham R. Shenoy" Currently on PSeries Linux guests, the offlined CPU can be put to one of the following two states: - Long term processor cede (also called extended cede) - Returned to the hypervisor via RTAS "stop-self" call. This is controlled by the kernel boot parameter "cede_offline=on/off". By default the offlined CPUs enter extended cede. The PHYP hypervisor considers CPUs in extended cede to be "active" since they are still under the control fo the Linux guests. Hence, when we change the SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs will continue to count the values for offlined CPUs in extended cede as if they are online. This breaks the accounting in tools such as lparstat. To fix this, ensure that by default the offlined CPUs are returned to the hypervisor via RTAS "stop-self" call by changing the default value of "cede_offline_enabled" to false. Fixes: commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an appropriate offline state") Signed-off-by: Gautham R. Shenoy --- Documentation/core-api/cpu_hotplug.rst | 2 +- arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst index 4a50ab7..5319593 100644 --- a/Documentation/core-api/cpu_hotplug.rst +++ b/Documentation/core-api/cpu_hotplug.rst @@ -53,7 +53,7 @@ Command Line Switches ``cede_offline={"off","on"}`` Use this option to disable/enable putting offlined processors to an extended ``H_CEDE`` state on supported pseries platforms. If nothing is specified, - ``cede_offline`` is set to "on". + ``cede_offline`` is set to "off". This option is limited to the PowerPC architecture. diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index bbda646..f9d0366 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -46,7 +46,17 @@ static DEFINE_PER_CPU(enum cpu_state_vals, preferred_offline_state) = static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE; -static bool cede_offline_enabled __read_mostly = true; +/* + * Determines whether the offlined CPUs should be put to a long term + * processor cede (called extended cede) for power-saving + * purposes. The CPUs in extended cede are still with the Linux Guest + * and are not returned to the Hypervisor. + * + * By default, the offlined CPUs are returned to the hypervisor via + * RTAS "stop-self". This behaviour can be changed by passing the + * kernel commandline parameter "cede_offline=on". + */ +static bool cede_offline_enabled __read_mostly; /* * Enable/disable cede_offline when available. -- 1.9.4
Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
On Wed, Sep 18, 2019 at 03:14:15PM +1000, Michael Ellerman wrote: > "Gautham R. Shenoy" writes: > > From: "Gautham R. Shenoy" > > > > Currently on Pseries Linux Guests, the offlined CPU can be put to one > > of the following two states: > >- Long term processor cede (also called extended cede) > >- Returned to the Hypervisor via RTAS "stop-self" call. > > > > This is controlled by the kernel boot parameter "cede_offline=on/off". > > > > By default the offlined CPUs enter extended cede. > > Since commit 3aa565f53c39 ("powerpc/pseries: Add hooks to put the CPU into an > appropriate offline state") (Nov 2009) > > Which you wrote :) Mea Culpa! I forgot to include the "Fixes commit 3aa565f53c39" into Patch 1 of the series. > > Why was that wrong? It was wrong from the definition of what PHYP considers as "not-active" CPU. From the point of view of that hypervisor, a CPU is not-active iff it is in RTAS "stop-self". Thus if a CPU is offline via extended cede, and not using any cycles, it is still considered to be active, by PHYP. This causes PURR accounting is broken. > > > The PHYP hypervisor considers CPUs in extended cede to be "active" > > since the CPUs are still under the control fo the Linux Guests. Hence, when > > we change the > > SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs > > will continue to count the values for offlined CPUs in extended cede > > as if they are online. > > > > One of the expectations with PURR is that the for an interval of time, > > the sum of the PURR increments across the online CPUs of a core should > > equal the number of timebase ticks for that interval. > > > > This is currently not the case. > > But why does that matter? It's just some accounting stuff, does it > actually break something meaningful? As Naveen mentioned, it breaks lparstat which the customers are using for capacity planning. Unfortunately we discovered this 10 years after the feature was written. > > Also what does this do to the latency of CPU online/offline. It will have a slightly higher latency compared to extended cede, since it involves an additional rtas-call for both the start and stopping of CPU. Will measure the exact difference and post it in the next version. > And what does this do on KVM? KVM doesn't seem to depend on the state of the offline VCPU as it has an explicit way of signalling whether a CPU is online or not, via KVM_REG_PPC_ONLINE. In commit 7aa15842c15f ("KVM: PPC: Book3S HV: Set RWMR on POWER8 so PURR/SPURR count correctly") we use this KVM reg to update the count of online vCPUs in a core, and use this count to set the RWMR correctly before dispatching the core. So, this patchset doesn't affect KVM. > > > > In the following data (Generated using > > https://github.com/gautshen/misc/blob/master/purr_tb.py): > > > > > > delta tb = tb ticks elapsed in 1 second. > > delta purr = sum of PURR increments on online CPUs of that core in 1 > > second > > > > SMT=off > > === > > Coredelta tb(apprx) delta purr > > === > > core00 [ 0]51200 69883784 > > core01 [ 8]51200 88782536 > > core02 [ 16]51200 94296824 > > core03 [ 24]51200 80951968 > > Showing the expected value in another column would make this much > clearer. Thanks. Will update the testcase to call out the expected value. > > cheers > -- Thanks and Regards gautham.
Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
Hello Nathan, Michael, On Tue, Sep 17, 2019 at 12:36:35PM -0500, Nathan Lynch wrote: > Gautham R Shenoy writes: > > On Thu, Sep 12, 2019 at 10:39:45AM -0500, Nathan Lynch wrote: > >> "Gautham R. Shenoy" writes: > >> > The patchset also defines a new sysfs attribute > >> > "/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests > >> > to allow userspace programs to change the state into which the > >> > offlined CPU need to be put to at runtime. > >> > >> A boolean sysfs interface will become awkward if we need to add another > >> mode in the future. > >> > >> What do you think about naming the attribute something like > >> 'offline_mode', with the possible values 'extended-cede' and > >> 'rtas-stopped'? > > > > We can do that. However, IMHO in the longer term, on PSeries guests, > > we should have only one offline state - rtas-stopped. The reason for > > this being, that on Linux, SMT switch is brought into effect through > > the CPU Hotplug interface. The only state in which the SMT switch will > > recognized by the hypervisors such as PHYP is rtas-stopped. > > OK. Why "longer term" though, instead of doing it now? Because adding extended-cede into a cpuidle state is non-trivial since a CPU in that state is non responsive to external interrupts. We will additional changes in the IPI, Timer and the Interrupt code to ensure that these get translated to a H_PROD in order to wake-up the target CPU in extended CEDE. Timer: is relatively easy since the cpuidle infrastructure has the timer-offload framework (used for fastsleep in POWER8) where we can offload the timers of an idling CPU to another CPU which can wakeup the CPU when the timer expires via an IPI. IPIs: We need to ensure that icp_hv_set_qirr() correctly sends H_IPI or H_PROD depending on whether or not the target CPU is in extended CEDE. Interrupts: Either we migrate away the interrupts from the CPU that is entering extended CEDE or we prevent a CPU that is the sole target for an interrupt from entering extended CEDE. The accounting problem in tools such as lparstat with "cede_offline=on" is affecting customers who are using these tools for capacity-planning. That problem needs a fix in the short-term, for which Patch 1 changes the default behaviour of cede_offline from "on" to "off". Since this patch would break the existing userspace tools that use the CPU-Offline infrastructure to fold CPUs for saving power, the sysfs interface allowing a runtime change of cede_offline_enabled was provided to enable these userspace tools to cope with minimal change. > > > > All other states (such as extended-cede) should in the long-term be > > exposed via the cpuidle interface. > > > > With this in mind, I made the sysfs interface boolean to mirror the > > current "cede_offline" commandline parameter. Eventually when we have > > only one offline-state, we can deprecate the commandline parameter as > > well as the sysfs interface. > > I don't care for adding a sysfs interface that is intended from the > beginning to become vestigial... Fair point. Come to think of it, in case the cpuidle menu governor behaviour doesn't match the expectations provided by the current userspace solutions for folding idle CPUs for power-savings, it would be useful to have this option around so that existing users who prefer the userspace solution can still have that option. > > This strikes me as unnecessarily incremental if you're changing the > default offline state. Any user space programs depending on the current > behavior will have to change anyway (and why is it OK to break them?) > Yes, the current userspace program will need to be modified to check for the sysfs interface and change the value to cede_offline_enabled=1. > Why isn't the plan: > > 1. Add extended cede support to the pseries cpuidle driver > 2. Make stop-self the only cpu offline state for pseries (no sysfs > interface necessary) This is the plan, except that 1. requires some additional work and this patchset is proposed as a short-term mitigation until we get 1. right. > > ? -- Thanks and Regards gautham.
Re: [PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
Hello Nathan, On Thu, Sep 12, 2019 at 10:39:45AM -0500, Nathan Lynch wrote: > "Gautham R. Shenoy" writes: > > The patchset also defines a new sysfs attribute > > "/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests > > to allow userspace programs to change the state into which the > > offlined CPU need to be put to at runtime. > > A boolean sysfs interface will become awkward if we need to add another > mode in the future. > > What do you think about naming the attribute something like > 'offline_mode', with the possible values 'extended-cede' and > 'rtas-stopped'? We can do that. However, IMHO in the longer term, on PSeries guests, we should have only one offline state - rtas-stopped. The reason for this being, that on Linux, SMT switch is brought into effect through the CPU Hotplug interface. The only state in which the SMT switch will recognized by the hypervisors such as PHYP is rtas-stopped. All other states (such as extended-cede) should in the long-term be exposed via the cpuidle interface. With this in mind, I made the sysfs interface boolean to mirror the current "cede_offline" commandline parameter. Eventually when we have only one offline-state, we can deprecate the commandline parameter as well as the sysfs interface. Thoughts? -- Thanks and Regards gautham.
[PATCH 0/2] pseries/hotplug: Change the default behaviour of cede_offline
From: "Gautham R. Shenoy" Currently on Pseries Linux Guests, the offlined CPU can be put to one of the following two states: - Long term processor cede (also called extended cede) - Returned to the Hypervisor via RTAS "stop-self" call. This is controlled by the kernel boot parameter "cede_offline=on/off". By default the offlined CPUs enter extended cede. The PHYP hypervisor considers CPUs in extended cede to be "active" since the CPUs are still under the control fo the Linux Guests. Hence, when we change the SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs will continue to count the values for offlined CPUs in extended cede as if they are online. One of the expectations with PURR is that the for an interval of time, the sum of the PURR increments across the online CPUs of a core should equal the number of timebase ticks for that interval. This is currently not the case. In the following data (Generated using https://github.com/gautshen/misc/blob/master/purr_tb.py): delta tb = tb ticks elapsed in 1 second. delta purr = sum of PURR increments on online CPUs of that core in 1 second SMT=off === Coredelta tb(apprx) delta purr === core00 [ 0]51200 69883784 core01 [ 8]51200 88782536 core02 [ 16]51200 94296824 core03 [ 24]51200 80951968 SMT=2 === Coredelta tb(apprx) delta purr === core00 [ 0,1] 51200 136147792 core01 [ 8,9] 51200 128636784 core02 [ 16,17] 51200 135426488 core03 [ 24,25] 51200 153027520 SMT=4 === Coredelta tb(apprx) delta purr === core00 [ 0,1,2,3] 51200 258331616 core01 [ 8,9,10,11]51200 274220072 core02 [ 16,17,18,19] 51200 260013736 core03 [ 24,25,26,27] 51200 260079672 SMT=on === Coredelta tb(apprx) delta purr === core00 [ 0,1,2,3,4,5,6,7] 51200 512941248 core01 [ 8,9,10,11,12,13,14,15]51200 512936544 core02 [ 16,17,18,19,20,21,22,23] 51200 512931544 core03 [ 24,25,26,27,28,29,30,31] 51200 512923800 This patchset addresses this issue by ensuring that by default, the offlined CPUs are returned to the Hypervisor via RTAS "stop-self" call by changing the default value of "cede_offline_enabled" to false. The patchset also defines a new sysfs attribute "/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests to allow userspace programs to change the state into which the offlined CPU need to be put to at runtime. This is intended for userspace programs that fold CPUs for the purpose of saving energy when the utilization is low. Setting the value of this attribute ensures that subsequent CPU offline operations will put the offlined CPUs to extended cede. However, it will cause inconsistencies in the PURR accounting. Clearing the attribute will make the offlined CPUs call the RTAS "stop-self" call thereby returning the CPU to the hypervisor. With the patches, SMT=off === Coredelta tb(apprx) delta purr === core00 [ 0]51200512527568 core01 [ 8]51200512556128 core02 [ 16]51200512590016 core03 [ 24]51200512589440 SMT=2 === Coredelta tb(apprx) delta purr === core00 [ 0,1] 51200 512635328 core01 [ 8,9] 51200 512610416 core02 [ 16,17] 51200 512639360 core03 [ 24,25] 51200 512638720 SMT=4 === Coredelta tb(apprx) delta purr === core00 [ 0,1,2,3] 51200 512757328 core01 [ 8,9,10,11]51200 512727920 core02 [ 16,17,18,19] 51200 512754712 core03 [ 24,25,26,27] 51200 512739040 SMT=on == Core delta tb(apprx) delta purr == core00 [ 0,1,2,3,4,5,6,7]
[PATCH 1/2] pseries/hotplug-cpu: Change default behaviour of cede_offline to "off"
From: "Gautham R. Shenoy" Currently on Pseries Linux Guests, the offlined CPU can be put to one of the following two states: - Long term processor cede (also called extended cede) - Returned to the Hypervisor via RTAS "stop-self" call. This is controlled by the kernel boot parameter "cede_offline=on/off". By default the offlined CPUs enter extended cede. The PHYP hypervisor considers CPUs in extended cede to be "active" since they are still under the control fo the Linux Guests. Hence, when we change the SMT modes by offlining the secondary CPUs, the PURR and the RWMR SPRs will continue to count the values for offlined CPUs in extended cede as if they are online. This breaks the accounting in tools such as lparstat. To fix this, ensure that by default the offlined CPUs are returned to the Hypervisor via RTAS "stop-self" call by changing the default value of "cede_offline_enabled" to false. Signed-off-by: Gautham R. Shenoy --- Documentation/core-api/cpu_hotplug.rst | 2 +- arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst index 4a50ab7..5319593 100644 --- a/Documentation/core-api/cpu_hotplug.rst +++ b/Documentation/core-api/cpu_hotplug.rst @@ -53,7 +53,7 @@ Command Line Switches ``cede_offline={"off","on"}`` Use this option to disable/enable putting offlined processors to an extended ``H_CEDE`` state on supported pseries platforms. If nothing is specified, - ``cede_offline`` is set to "on". + ``cede_offline`` is set to "off". This option is limited to the PowerPC architecture. diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index bbda646..f9d0366 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -46,7 +46,17 @@ static DEFINE_PER_CPU(enum cpu_state_vals, preferred_offline_state) = static enum cpu_state_vals default_offline_state = CPU_STATE_OFFLINE; -static bool cede_offline_enabled __read_mostly = true; +/* + * Determines whether the offlined CPUs should be put to a long term + * processor cede (called extended cede) for power-saving + * purposes. The CPUs in extended cede are still with the Linux Guest + * and are not returned to the Hypervisor. + * + * By default, the offlined CPUs are returned to the hypervisor via + * RTAS "stop-self". This behaviour can be changed by passing the + * kernel commandline parameter "cede_offline=on". + */ +static bool cede_offline_enabled __read_mostly; /* * Enable/disable cede_offline when available. -- 1.9.4
[PATCH 2/2] pseries/hotplug-cpu: Add sysfs attribute for cede_offline
From: "Gautham R. Shenoy" Define a new sysfs attribute "/sys/device/system/cpu/cede_offline_enabled" on PSeries Linux guests to allow userspace programs to change the state into which the offlined CPU need to be put to at runtime. This is intended for userspace programs that fold CPUs for the purpose of saving energy when the utilization is low. Setting the value of this attribute ensures that subsequent CPU offline operations will put the offlined CPUs to extended cede. However, it will cause inconsistencies in the PURR accounting. Clearing the attribute will make the offlined CPUs call the RTAS "stop-self" call thereby returning the CPU to the hypervisor. Signed-off-by: Gautham R. Shenoy --- Documentation/ABI/testing/sysfs-devices-system-cpu | 14 + arch/powerpc/platforms/pseries/hotplug-cpu.c | 68 -- 2 files changed, 76 insertions(+), 6 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu index 06d0931..b3c52cd 100644 --- a/Documentation/ABI/testing/sysfs-devices-system-cpu +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu @@ -572,3 +572,17 @@ Description: Secure Virtual Machine If 1, it means the system is using the Protected Execution Facility in POWER9 and newer processors. i.e., it is a Secure Virtual Machine. + +What: /sys/devices/system/cpu/cede_offline_enabled +Date: August 2019 +Contact: Linux kernel mailing list + Linux for PowerPC mailing list +Description: Offline CPU state control + + If 1, it means that offline CPUs on PSeries guests + will be made to call an extended CEDE which provides + energy savings but at the expense of accuracy of PURR + accounting. If 0, the offline CPUs on PSeries guests + will be made to call RTAS "stop-self" call which will + return the CPUs to the Hypervisor and provide accurate + values of PURR. The value is 0 by default. diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index f9d0366..4a04cf7 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -943,9 +943,64 @@ static int parse_cede_parameters(void) CEDE_LATENCY_PARAM_MAX_LENGTH); } -static int __init pseries_cpu_hotplug_init(void) +/* + * Must be guarded by + * cpu_maps_update_begin()...cpu_maps_update_done() + */ +static void update_default_offline_state(void) { int cpu; + + if (cede_offline_enabled) + default_offline_state = CPU_STATE_INACTIVE; + else + default_offline_state = CPU_STATE_OFFLINE; + + for_each_possible_cpu(cpu) + set_default_offline_state(cpu); +} + +static ssize_t show_cede_offline_enabled(struct device *dev, +struct device_attribute *attr, +char *buf) +{ + unsigned long ret = 0; + + if (cede_offline_enabled) + ret = 1; + + return sprintf(buf, "%lx\n", ret); +} + +static ssize_t store_cede_offline_enabled(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + bool val; + int ret = 0; + + ret = kstrtobool(buf, ); + if (ret) + return -EINVAL; + + cpu_maps_update_begin(); + /* Check if anything needs to be done */ + if (val == cede_offline_enabled) + goto done; + cede_offline_enabled = val; + update_default_offline_state(); +done: + cpu_maps_update_done(); + + return count; +} + +static DEVICE_ATTR(cede_offline_enabled, 0600, + show_cede_offline_enabled, + store_cede_offline_enabled); + +static int __init pseries_cpu_hotplug_init(void) +{ int qcss_tok; #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE @@ -971,11 +1026,12 @@ static int __init pseries_cpu_hotplug_init(void) if (firmware_has_feature(FW_FEATURE_LPAR)) { of_reconfig_notifier_register(_smp_nb); cpu_maps_update_begin(); - if (cede_offline_enabled && parse_cede_parameters() == 0) { - default_offline_state = CPU_STATE_INACTIVE; - for_each_online_cpu(cpu) - set_default_offline_state(cpu); - } + if (parse_cede_parameters() == 0) + device_create_file(cpu_subsys.dev_root, + _attr_cede_offline_enabled); + else /* Extended cede is not supported */ + cede_offline_ena
[PATCH] powerpc/xive: Fix loop exit-condition in xive_find_target_in_mask()
From: "Gautham R. Shenoy" xive_find_target_in_mask() has the following for(;;) loop which has a bug when @first == cpumask_first(@mask) and condition 1 fails to hold for every CPU in @mask. In this case we loop forever in the for-loop. first = cpu; for (;;) { if (cpu_online(cpu) && xive_try_pick_target(cpu)) // condition 1 return cpu; cpu = cpumask_next(cpu, mask); if (cpu == first) // condition 2 break; if (cpu >= nr_cpu_ids) // condition 3 cpu = cpumask_first(mask); } This is because, when @first == cpumask_first(@mask), we never hit the condition 2 (cpu == first) since prior to this check, we would have executed "cpu = cpumask_next(cpu, mask)" which will set the value of @cpu to a value greater than @first or to nr_cpus_ids. When this is coupled with the fact that condition 1 is not met, we will never exit this loop. This was discovered by the hard-lockup detector while running LTP test concurrently with SMT switch tests. watchdog: CPU 12 detected hard LOCKUP on other CPUs 68 watchdog: CPU 12 TB:85587019220796, last SMP heartbeat TB:85578827223399 (15999ms ago) watchdog: CPU 68 Hard LOCKUP watchdog: CPU 68 TB:85587019361273, last heartbeat TB:85576815065016 (19930ms ago) CPU: 68 PID: 45050 Comm: hxediag Kdump: loaded Not tainted 4.18.0-100.el8.ppc64le #1 NIP: c06f5578 LR: c0cba9ec CTR: REGS: c000201fff3c7d80 TRAP: 0100 Not tainted (4.18.0-100.el8.ppc64le) MSR: 92883033 CR: 24028424 XER: CFAR: c06f558c IRQMASK: 1 GPR00: c00afc58 c000201c01c43400 c15ce500 c000201cae26ec18 GPR04: 0800 0540 0800 00f8 GPR08: 0020 00a8 8000 c0081a1beed8 GPR12: c00b1410 c000201fff7f4c00 GPR16: 0540 0001 GPR20: 0048 1011 c0081a1e3780 c000201cae26ed18 GPR24: c000201cae26ed8c 0001 c1116bc0 GPR28: c1601ee8 c1602494 c000201cae26ec18 001f NIP [c06f5578] find_next_bit+0x38/0x90 LR [c0cba9ec] cpumask_next+0x2c/0x50 Call Trace: [c000201c01c43400] [c000201cae26ec18] 0xc000201cae26ec18 (unreliable) [c000201c01c43420] [c00afc58] xive_find_target_in_mask+0x1b8/0x240 [c000201c01c43470] [c00b0228] xive_pick_irq_target.isra.3+0x168/0x1f0 [c000201c01c435c0] [c00b1470] xive_irq_startup+0x60/0x260 [c000201c01c43640] [c01d8328] __irq_startup+0x58/0xf0 [c000201c01c43670] [c01d844c] irq_startup+0x8c/0x1a0 [c000201c01c436b0] [c01d57b0] __setup_irq+0x9f0/0xa90 [c000201c01c43760] [c01d5aa0] request_threaded_irq+0x140/0x220 [c000201c01c437d0] [c0081a17b3d4] bnx2x_nic_load+0x188c/0x3040 [bnx2x] [c000201c01c43950] [c0081a187c44] bnx2x_self_test+0x1fc/0x1f70 [bnx2x] [c000201c01c43a90] [c0adc748] dev_ethtool+0x11d8/0x2cb0 [c000201c01c43b60] [c0b0b61c] dev_ioctl+0x5ac/0xa50 [c000201c01c43bf0] [c0a8d4ec] sock_do_ioctl+0xbc/0x1b0 [c000201c01c43c60] [c0a8dfb8] sock_ioctl+0x258/0x4f0 [c000201c01c43d20] [c04c9704] do_vfs_ioctl+0xd4/0xa70 [c000201c01c43de0] [c04ca274] sys_ioctl+0xc4/0x160 [c000201c01c43e30] [c000b388] system_call+0x5c/0x70 Instruction dump: 78aad182 54a806be 3920 78a50664 794a1f24 7d294036 7d43502a 7d295039 4182001c 4834 78a9d182 79291f24 <7d23482a> 2fa9 409e0020 38a50040 To fix this, move the check for condition 2 after the check for condition 3, so that we are able to break out of the loop soon after iterating through all the CPUs in the @mask in the problem case. Use do..while() to achieve this. Fixes: 243e25112d06 ("powerpc/xive: Native exploitation of the XIVE interrupt controller") Cc: # 4.12+ Reported-by: Indira P. Joga Signed-off-by: Gautham R. Shenoy --- arch/powerpc/sysdev/xive/common.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/sysdev/xive/common.c b/arch/powerpc/sysdev/xive/common.c index 082c7e1..1cdb395 100644 --- a/arch/powerpc/sysdev/xive/common.c +++ b/arch/powerpc/sysdev/xive/common.c @@ -479,7 +479,7 @@ static int xive_find_target_in_mask(const struct cpumask *mask, * Now go through the entire mask until we find a valid * target. */ - for (;;) { + do { /* * We re-check online as the fallback case passes us * an untested affinity mask @@ -487,12 +487,11 @@ static int xive_find_target_in_mask(const struct cpumask *mask, if (cpu_online(cpu) && xive_try_pick_target(cpu)) return cpu; cpu = cpumask_next(cpu, mask); -
Re: [PATCH v3] powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt()
Hi, On Wed, May 15, 2019 at 01:15:52PM +0530, Gautham R. Shenoy wrote: > From: "Gautham R. Shenoy" > > The calls to arch_add_memory()/arch_remove_memory() are always made > with the read-side cpu_hotplug_lock acquired via > memory_hotplug_begin(). On pSeries, > arch_add_memory()/arch_remove_memory() eventually call resize_hpt() > which in turn calls stop_machine() which acquires the read-side > cpu_hotplug_lock again, thereby resulting in the recursive acquisition > of this lock. A clarification regarding why we hadn't observed this problem earlier. In the absence of CONFIG_PROVE_LOCKING, we hadn't observed a system lockup during a memory hotplug operation because cpus_read_lock() is a per-cpu rwsem read, which, in the fast-path (in the absence of the writer, which in our case is a CPU-hotplug operation) simply increments the read_count on the semaphore. Thus a recursive read in the fast-path doesn't cause any problems. However, we can hit this problem in practice if there is a concurrent CPU-Hotplug operation in progress which is waiting to acquire the write-side of the lock. This will cause the second recursive read to block until the writer finishes. While the writer is blocked since the first read holds the lock. Thus both the reader as well as the writers fail to make any progress thereby blocking both CPU-Hotplug as well as Memory Hotplug operations. Memory-Hotplug CPU-Hotplug CPU 0 CPU 1 -- -- 1. down_read(cpu_hotplug_lock.rw_sem) [memory_hotplug_begin] 2. down_write(cpu_hotplug_lock.rw_sem) [cpu_up/cpu_down] 3. down_read(cpu_hotplug_lock.rw_sem) [stop_machine()] > > Lockdep complains as follows in these code-paths. > > swapper/0/1 is trying to acquire lock: > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: stop_machine+0x2c/0x60 > > but task is already holding lock: > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: > mem_hotplug_begin+0x20/0x50 > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > >lock(cpu_hotplug_lock.rw_sem); >lock(cpu_hotplug_lock.rw_sem); > > *** DEADLOCK *** > > May be due to missing lock nesting notation > > 3 locks held by swapper/0/1: > #0: (ptrval) (>mutex){}, at: __driver_attach+0x12c/0x1b0 > #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at: > mem_hotplug_begin+0x20/0x50 > #2: (ptrval) (mem_hotplug_lock.rw_sem){}, at: > percpu_down_write+0x54/0x1a0 > > stack backtrace: > CPU: 0 PID: 1 Comm: swapper/0 Not tainted > 5.0.0-rc5-58373-gbc99402235f3-dirty #166 > Call Trace: > [c000feb03150] [c0e32bd4] dump_stack+0xe8/0x164 (unreliable) > [c000feb031a0] [c020d6c0] __lock_acquire+0x1110/0x1c70 > [c000feb03320] [c020f080] lock_acquire+0x240/0x290 > [c000feb033e0] [c017f554] cpus_read_lock+0x64/0xf0 > [c000feb03420] [c029ebac] stop_machine+0x2c/0x60 > [c000feb03460] [c00d7f7c] pseries_lpar_resize_hpt+0x19c/0x2c0 > [c000feb03500] [c00788d0] resize_hpt_for_hotplug+0x70/0xd0 > [c000feb03570] [c0e5d278] arch_add_memory+0x58/0xfc > [c000feb03610] [c03553a8] devm_memremap_pages+0x5e8/0x8f0 > [c000feb036c0] [c09c2394] pmem_attach_disk+0x764/0x830 > [c000feb037d0] [c09a7c38] nvdimm_bus_probe+0x118/0x240 > [c000feb03860] [c0968500] really_probe+0x230/0x4b0 > [c000feb038f0] [c0968aec] driver_probe_device+0x16c/0x1e0 > [c000feb03970] [c0968ca8] __driver_attach+0x148/0x1b0 > [c000feb039f0] [c09650b0] bus_for_each_dev+0x90/0x130 > [c000feb03a50] [c0967dd4] driver_attach+0x34/0x50 > [c000feb03a70] [c0967068] bus_add_driver+0x1a8/0x360 > [c000feb03b00] [c096a498] driver_register+0x108/0x170 > [c000feb03b70] [c09a7400] __nd_driver_register+0xd0/0xf0 > [c000feb03bd0] [c128aa90] nd_pmem_driver_init+0x34/0x48 > [c000feb03bf0] [c0010a10] do_one_initcall+0x1e0/0x45c > [c000feb03cd0] [c122462c] kernel_init_freeable+0x540/0x64c > [c000feb03db0] [c001110c] kernel_init+0x2c/0x160 > [c000feb03e20] [c000bed4] ret_from_kernel_thread+0x5c/0x68 > > Fix this issue by > 1) Requiring all the calls to pseries_lpar_resize_hpt() be made > with cpu_hotplug_lock held. > > 2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked() > as a consequence of 1) > > 3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.
Re: [PATCH] cpupower : frequency-set -r option misses the last cpu in related cpu list
Hi Abhishek, On Wed, May 29, 2019 at 3:02 PM Abhishek Goel wrote: > > To set frequency on specific cpus using cpupower, following syntax can > be used : > cpupower -c #i frequency-set -f #f -r > > While setting frequency using cpupower frequency-set command, if we use > '-r' option, it is expected to set frequency for all cpus related to > cpu #i. But it is observed to be missing the last cpu in related cpu > list. This patch fixes the problem. > > Signed-off-by: Abhishek Goel > --- > tools/power/cpupower/utils/cpufreq-set.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/tools/power/cpupower/utils/cpufreq-set.c > b/tools/power/cpupower/utils/cpufreq-set.c > index 1eef0aed6..08a405593 100644 > --- a/tools/power/cpupower/utils/cpufreq-set.c > +++ b/tools/power/cpupower/utils/cpufreq-set.c > @@ -306,6 +306,8 @@ int cmd_freq_set(int argc, char **argv) > bitmask_setbit(cpus_chosen, cpus->cpu); > cpus = cpus->next; > } > + /* Set the last cpu in related cpus list */ > + bitmask_setbit(cpus_chosen, cpus->cpu); Perhaps you could convert the while() loop to a do .. while(). That should will ensure that we terminate the loop after setting the last valid CPU. > cpufreq_put_related_cpus(cpus); > } > } > -- > 2.17.1 > -- Thanks and Regards gautham.
Re: [PATCH 0/1] Forced-wakeup for stop lite states on Powernv
Hi Nicholas, On Thu, May 16, 2019 at 04:13:17PM +1000, Nicholas Piggin wrote: > > > The motivation behind this patch was a HPC customer issue where they > > were observing some CPUs in the core getting stuck at stop0_lite > > state, thereby lowering the performance on the other CPUs of the core > > which were running the application. > > > > Disabling stop0_lite via sysfs didn't help since we would fallback to > > snooze and it would make matters worse. > > snooze has the timeout though, so it should kick into stop0 properly > (and if it doesn't that's another issue that should be fixed in this > series). > > I'm not questioning the patch for stop0_lite, to be clear. I think > the logic is sound. I just raise one urelated issue that happens to > be for stop0_lite as well (should we even enable it on P9?), and one > peripheral issue (should we make a similar fix for deeper stop states?) > I think it makes sense to generalize this from the point of view of CPUs remaining in shallower idle states for long durations on tickless kernels. > > > >> > >> We should always have fewer states unless proven otherwise. > > > > I agree. > > > >> > >> That said, we enable it today so I don't want to argue this point > >> here, because it is a different issue from your patch. > >> > >> > When it is in stop0 or deeper, > >> > it free up both > >> > space and time slice of core. > >> > In stop0_lite, cpu doesn't free up the core resources and thus inhibits > >> > thread > >> > folding. When a cpu goes to stop0, it will free up the core resources > >> > thus increasing > >> > the single thread performance of other sibling thread. > >> > Hence, we do not want to get stuck in stop0_lite for long duration, and > >> > want to quickly > >> > move onto the next state. > >> > If we get stuck in any other state we would possibly be losing on to > >> > power saving, > >> > but will still be able to gain the performance benefits for other > >> > sibling threads. > >> > >> That's true, but stop0 -> deeper stop is also a benefit (for > >> performance if we have some power/thermal constraints, and/or for power > >> usage). > >> > >> Sure it may not be so noticable as the SMT switch, but I just wonder > >> if the infrastructure should be there for the same reason. > >> > >> I was testing interrupt frequency on some tickless workloads configs, > >> and without too much trouble you can get CPUs to sleep with no > >> interrupts for many minutes. Hours even. We wouldn't want the CPU to > >> stay in stop0 for that long. > > > > If it stays in stop0 or even stop2 for that long, we would want to > > "promote" it to a deeper state, such as say STOP5 which allows the > > other cores to run at higher frequencies. > > So we would want this same logic for all but the deepest runtime > stop state? Yes. We can, in steps, promote individual threads of the core to eventually request a deeper state such as stop4/5. On a completely idle tickless system, eventually we should see the core go to the deeper idle state. > > >> Just thinking about the patch itself, I wonder do you need a full > >> kernel timer, or could we just set the decrementer? Is there much > >> performance cost here? > >> > > > > Good point. A decrementer would do actually. > > That would be good if it does, might save a few cycles. > > Thanks, > Nick > -- Thanks and Regards gautham.
Re: [PATCH 0/1] Forced-wakeup for stop lite states on Powernv
Hello Nicholas, On Thu, May 16, 2019 at 02:55:42PM +1000, Nicholas Piggin wrote: > Abhishek's on May 13, 2019 7:49 pm: > > On 05/08/2019 10:29 AM, Nicholas Piggin wrote: > >> Abhishek Goel's on April 22, 2019 4:32 pm: > >>> Currently, the cpuidle governors determine what idle state a idling CPU > >>> should enter into based on heuristics that depend on the idle history on > >>> that CPU. Given that no predictive heuristic is perfect, there are cases > >>> where the governor predicts a shallow idle state, hoping that the CPU will > >>> be busy soon. However, if no new workload is scheduled on that CPU in the > >>> near future, the CPU will end up in the shallow state. > >>> > >>> Motivation > >>> -- > >>> In case of POWER, this is problematic, when the predicted state in the > >>> aforementioned scenario is a lite stop state, as such lite states will > >>> inhibit SMT folding, thereby depriving the other threads in the core from > >>> using the core resources. > >>> > >>> So we do not want to get stucked in such states for longer duration. To > >>> address this, the cpuidle-core can queue timer to correspond with the > >>> residency value of the next available state. This timer will forcefully > >>> wakeup the cpu. Few such iterations will essentially train the governor to > >>> select a deeper state for that cpu, as the timer here corresponds to the > >>> next available cpuidle state residency. Cpu will be kicked out of the lite > >>> state and end up in a non-lite state. > >>> > >>> Experiment > >>> -- > >>> I performed experiments for three scenarios to collect some data. > >>> > >>> case 1 : > >>> Without this patch and without tick retained, i.e. in a upstream kernel, > >>> It would spend more than even a second to get out of stop0_lite. > >>> > >>> case 2 : With tick retained in a upstream kernel - > >>> > >>> Generally, we have a sched tick at 4ms(CONF_HZ = 250). Ideally I expected > >>> it to take 8 sched tick to get out of stop0_lite. Experimentally, > >>> observation was > >>> > >>> = > >>> sample minmax 99percentile > >>> 20 4ms12ms 4ms > >>> = > >>> > >>> It would take atleast one sched tick to get out of stop0_lite. > >>> > >>> case 2 : With this patch (not stopping tick, but explicitly queuing a > >>>timer) > >>> > >>> > >>> sample min max 99percentile > >>> > >>> 20 144us 192us 144us > >>> > >>> > >>> In this patch, we queue a timer just before entering into a stop0_lite > >>> state. The timer fires at (residency of next available state + exit > >>> latency > >>> of next available state * 2). Let's say if next state(stop0) is available > >>> which has residency of 20us, it should get out in as low as (20+2*2)*8 > >>> [Based on the forumla (residency + 2xlatency)*history length] microseconds > >>> = 192us. Ideally we would expect 8 iterations, it was observed to get out > >>> in 6-7 iterations. Even if let's say stop2 is next available state(stop0 > >>> and stop1 both are unavailable), it would take (100+2*10)*8 = 960us to get > >>> into stop2. > >>> > >>> So, We are able to get out of stop0_lite generally in 150us(with this > >>> patch) as compared to 4ms(with tick retained). As stated earlier, we do > >>> not > >>> want to get stuck into stop0_lite as it inhibits SMT folding for other > >>> sibling threads, depriving them of core resources. Current patch is using > >>> forced-wakeup only for stop0_lite, as it gives performance benefit(primary > >>> reason) along with lowering down power consumption. We may extend this > >>> model for other states in future. > >> I still have to wonder, between our snooze loop and stop0, what does > >> stop0_lite buy us. > >> > >> That said, the problem you're solving here is a generic one that all > >> stop states have, I think. Doesn't the same thing apply going from > >> stop0 to stop5? You might under estimate the sleep time and lose power > >> savings and therefore performance there too. Shouldn't we make it > >> generic for all stop states? > >> > >> Thanks, > >> Nick > >> > >> > > When a cpu is in snooze, it takes both space and time of core. When in > > stop0_lite, > > it free up time but it still takes space. > > True, but snooze should only be taking less than 1% of front end > cycles. I appreciate there is some non-zero difference here, I just > wonder in practice what exactly we gain by it. The idea behind implementing a lite-state was that on the future platforms it can be made to wait on a flag and hence act as a replacement for snooze. On POWER9 we don't have this feature. The motivation
[PATCH v3] powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt()
From: "Gautham R. Shenoy" The calls to arch_add_memory()/arch_remove_memory() are always made with the read-side cpu_hotplug_lock acquired via memory_hotplug_begin(). On pSeries, arch_add_memory()/arch_remove_memory() eventually call resize_hpt() which in turn calls stop_machine() which acquires the read-side cpu_hotplug_lock again, thereby resulting in the recursive acquisition of this lock. Lockdep complains as follows in these code-paths. swapper/0/1 is trying to acquire lock: (ptrval) (cpu_hotplug_lock.rw_sem){}, at: stop_machine+0x2c/0x60 but task is already holding lock: (ptrval) (cpu_hotplug_lock.rw_sem){}, at: mem_hotplug_begin+0x20/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(cpu_hotplug_lock.rw_sem); lock(cpu_hotplug_lock.rw_sem); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by swapper/0/1: #0: (ptrval) (>mutex){}, at: __driver_attach+0x12c/0x1b0 #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at: mem_hotplug_begin+0x20/0x50 #2: (ptrval) (mem_hotplug_lock.rw_sem){}, at: percpu_down_write+0x54/0x1a0 stack backtrace: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc5-58373-gbc99402235f3-dirty #166 Call Trace: [c000feb03150] [c0e32bd4] dump_stack+0xe8/0x164 (unreliable) [c000feb031a0] [c020d6c0] __lock_acquire+0x1110/0x1c70 [c000feb03320] [c020f080] lock_acquire+0x240/0x290 [c000feb033e0] [c017f554] cpus_read_lock+0x64/0xf0 [c000feb03420] [c029ebac] stop_machine+0x2c/0x60 [c000feb03460] [c00d7f7c] pseries_lpar_resize_hpt+0x19c/0x2c0 [c000feb03500] [c00788d0] resize_hpt_for_hotplug+0x70/0xd0 [c000feb03570] [c0e5d278] arch_add_memory+0x58/0xfc [c000feb03610] [c03553a8] devm_memremap_pages+0x5e8/0x8f0 [c000feb036c0] [c09c2394] pmem_attach_disk+0x764/0x830 [c000feb037d0] [c09a7c38] nvdimm_bus_probe+0x118/0x240 [c000feb03860] [c0968500] really_probe+0x230/0x4b0 [c000feb038f0] [c0968aec] driver_probe_device+0x16c/0x1e0 [c000feb03970] [c0968ca8] __driver_attach+0x148/0x1b0 [c000feb039f0] [c09650b0] bus_for_each_dev+0x90/0x130 [c000feb03a50] [c0967dd4] driver_attach+0x34/0x50 [c000feb03a70] [c0967068] bus_add_driver+0x1a8/0x360 [c000feb03b00] [c096a498] driver_register+0x108/0x170 [c000feb03b70] [c09a7400] __nd_driver_register+0xd0/0xf0 [c000feb03bd0] [c128aa90] nd_pmem_driver_init+0x34/0x48 [c000feb03bf0] [c0010a10] do_one_initcall+0x1e0/0x45c [c000feb03cd0] [c122462c] kernel_init_freeable+0x540/0x64c [c000feb03db0] [c001110c] kernel_init+0x2c/0x160 [c000feb03e20] [c000bed4] ret_from_kernel_thread+0x5c/0x68 Fix this issue by 1) Requiring all the calls to pseries_lpar_resize_hpt() be made with cpu_hotplug_lock held. 2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked() as a consequence of 1) 3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.resize_hpt() with cpu_hotplug_lock held. Reported-by: Aneesh Kumar K.V Signed-off-by: Gautham R. Shenoy --- v2 -> v3 : Updated the comment for pseries_lpar_resize_hpt() Updated the commit-log with the full backtrace. v1 -> v2 : Rebased against powerpc/next instead of linux/master arch/powerpc/mm/book3s64/hash_utils.c | 9 - arch/powerpc/platforms/pseries/lpar.c | 8 ++-- 2 files changed, 14 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index 919a861..d07fcafd 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include @@ -1928,10 +1929,16 @@ static int hpt_order_get(void *data, u64 *val) static int hpt_order_set(void *data, u64 val) { + int ret; + if (!mmu_hash_ops.resize_hpt) return -ENODEV; - return mmu_hash_ops.resize_hpt(val); + cpus_read_lock(); + ret = mmu_hash_ops.resize_hpt(val); + cpus_read_unlock(); + + return ret; } DEFINE_DEBUGFS_ATTRIBUTE(fops_hpt_order, hpt_order_get, hpt_order_set, "%llu\n"); diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c index 1034ef1..557d592 100644 --- a/arch/powerpc/platforms/pseries/lpar.c +++ b/arch/powerpc/platforms/pseries/lpar.c @@ -859,7 +859,10 @@ static int pseries_lpar_resize_hpt_commit(void *data) return 0; } -/* Must be called in user context */ +/* + * Must be called in process context. The caller must hold the + * cpus_lock. + */ static int pseries_lpar_resize_hpt(unsigned long shift) { struct hpt_res
Re: [RESEND PATCH] powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt
On Tue, May 14, 2019 at 05:02:16PM +1000, Michael Ellerman wrote: > "Gautham R. Shenoy" writes: > > From: "Gautham R. Shenoy" > > > > Subject: Re: [RESEND PATCH] powerpc/pseries: Fix cpu_hotplug_lock > > acquisition in resize_hpt > > ps. A "RESEND" implies the patch is unchanged and you're just resending > it because it was ignored. > > In this case it should have just been "PATCH v2", with a note below the "---" > saying "v2: Rebased onto powerpc/next ..." Ok. I will send a v3 :-) > > cheers > > > During a memory hotplug operations involving resizing of the HPT, we > > invoke a stop_machine() to perform the resizing. In this code path, we > > end up recursively taking the cpu_hotplug_lock, first in > > memory_hotplug_begin() and then subsequently in stop_machine(). This > > causes the system to hang. With lockdep enabled we get the following > > error message before the hang. > > > > swapper/0/1 is trying to acquire lock: > > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: > > stop_machine+0x2c/0x60 > > > > but task is already holding lock: > > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: > > mem_hotplug_begin+0x20/0x50 > > > > other info that might help us debug this: > >Possible unsafe locking scenario: > > > > CPU0 > > > > lock(cpu_hotplug_lock.rw_sem); > > lock(cpu_hotplug_lock.rw_sem); > > > >*** DEADLOCK *** > > > > Fix this issue by > > 1) Requiring all the calls to pseries_lpar_resize_hpt() be made > > with cpu_hotplug_lock held. > > > > 2) In pseries_lpar_resize_hpt() invoke stop_machine_cpuslocked() > > as a consequence of 1) > > > > 3) To satisfy 1), in hpt_order_set(), call mmu_hash_ops.resize_hpt() > > with cpu_hotplug_lock held. > > > > Reported-by: Aneesh Kumar K.V > > Signed-off-by: Gautham R. Shenoy > > --- > > > > Rebased this one against powerpc/next instead of linux/master. > > > > arch/powerpc/mm/book3s64/hash_utils.c | 9 - > > arch/powerpc/platforms/pseries/lpar.c | 8 ++-- > > 2 files changed, 14 insertions(+), 3 deletions(-) > > > > diff --git a/arch/powerpc/mm/book3s64/hash_utils.c > > b/arch/powerpc/mm/book3s64/hash_utils.c > > index 919a861..d07fcafd 100644 > > --- a/arch/powerpc/mm/book3s64/hash_utils.c > > +++ b/arch/powerpc/mm/book3s64/hash_utils.c > > @@ -38,6 +38,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -1928,10 +1929,16 @@ static int hpt_order_get(void *data, u64 *val) > > > > static int hpt_order_set(void *data, u64 val) > > { > > + int ret; > > + > > if (!mmu_hash_ops.resize_hpt) > > return -ENODEV; > > > > - return mmu_hash_ops.resize_hpt(val); > > + cpus_read_lock(); > > + ret = mmu_hash_ops.resize_hpt(val); > > + cpus_read_unlock(); > > + > > + return ret; > > } > > > > DEFINE_DEBUGFS_ATTRIBUTE(fops_hpt_order, hpt_order_get, hpt_order_set, > > "%llu\n"); > > diff --git a/arch/powerpc/platforms/pseries/lpar.c > > b/arch/powerpc/platforms/pseries/lpar.c > > index 1034ef1..2fc9756 100644 > > --- a/arch/powerpc/platforms/pseries/lpar.c > > +++ b/arch/powerpc/platforms/pseries/lpar.c > > @@ -859,7 +859,10 @@ static int pseries_lpar_resize_hpt_commit(void *data) > > return 0; > > } > > > > -/* Must be called in user context */ > > +/* > > + * Must be called in user context. The caller should hold the > > + * cpus_lock. > > + */ > > static int pseries_lpar_resize_hpt(unsigned long shift) > > { > > struct hpt_resize_state state = { > > @@ -913,7 +916,8 @@ static int pseries_lpar_resize_hpt(unsigned long shift) > > > > t1 = ktime_get(); > > > > - rc = stop_machine(pseries_lpar_resize_hpt_commit, , NULL); > > + rc = stop_machine_cpuslocked(pseries_lpar_resize_hpt_commit, > > +, NULL); > > > > t2 = ktime_get(); > > > > -- > > 1.9.4 >
Re: [RESEND PATCH] powerpc/pseries: Fix cpu_hotplug_lock acquisition in resize_hpt
Hi Michael, On Tue, May 14, 2019 at 05:00:19PM +1000, Michael Ellerman wrote: > "Gautham R. Shenoy" writes: > > From: "Gautham R. Shenoy" > > > > During a memory hotplug operations involving resizing of the HPT, we > > invoke a stop_machine() to perform the resizing. In this code path, we > > end up recursively taking the cpu_hotplug_lock, first in > > memory_hotplug_begin() and then subsequently in stop_machine(). This > > causes the system to hang. > > This implies we have never tested a memory hotplug that resized the HPT. > Is that really true? Or did something change? > This was reported by Aneesh during a testcase involving reconfiguring the namespace for nvdimm where we do a memory remove followed by add. The memory add invokes resize_hpt(). It seems we can hit this issue when we perform a memory hotplug/unplug in the guest. > > With lockdep enabled we get the following > > error message before the hang. > > > > swapper/0/1 is trying to acquire lock: > > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: > > stop_machine+0x2c/0x60 > > > > but task is already holding lock: > > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: > > mem_hotplug_begin+0x20/0x50 > > Do we have the full stack trace? Yes, here is the complete log: [0.537123] swapper/0/1 is trying to acquire lock: [0.537197] (ptrval) (cpu_hotplug_lock.rw_sem){}, at: stop_machine+0x2c/0x60 [0.537336] [0.537336] but task is already holding lock: [0.537429] (ptrval) (cpu_hotplug_lock.rw_sem){}, at: mem_hotplug_begin+0x20/0x50 [0.537570] [0.537570] other info that might help us debug this: [0.537663] Possible unsafe locking scenario: [0.537663] [0.537756]CPU0 [0.537794] [0.537832] lock(cpu_hotplug_lock.rw_sem); [0.537906] lock(cpu_hotplug_lock.rw_sem); [0.537980] [0.537980] *** DEADLOCK *** [0.537980] [0.538074] May be due to missing lock nesting notation [0.538074] [0.538168] 3 locks held by swapper/0/1: [0.538224] #0: (ptrval) (>mutex){}, at: __driver_attach+0x12c/0x1b0 [0.538348] #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at: mem_hotplug_begin+0x20/0x50 [0.538477] #2: (ptrval) (mem_hotplug_lock.rw_sem){}, at: percpu_down_write+0x54/0x1a0 [0.538608] [0.538608] stack backtrace: [0.538685] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc5-58373-gbc99402235f3-dirty #166 [0.538812] Call Trace: [0.538863] [c000feb03150] [c0e32bd4] dump_stack+0xe8/0x164 (unreliable) [0.538975] [c000feb031a0] [c020d6c0] __lock_acquire+0x1110/0x1c70 [0.539086] [c000feb03320] [c020f080] lock_acquire+0x240/0x290 [0.539180] [c000feb033e0] [c017f554] cpus_read_lock+0x64/0xf0 [0.539273] [c000feb03420] [c029ebac] stop_machine+0x2c/0x60 [0.539367] [c000feb03460] [c00d7f7c] pseries_lpar_resize_hpt+0x19c/0x2c0 [0.539479] [c000feb03500] [c00788d0] resize_hpt_for_hotplug+0x70/0xd0 [0.539590] [c000feb03570] [c0e5d278] arch_add_memory+0x58/0xfc [0.539683] [c000feb03610] [c03553a8] devm_memremap_pages+0x5e8/0x8f0 [0.539804] [c000feb036c0] [c09c2394] pmem_attach_disk+0x764/0x830 [0.539916] [c000feb037d0] [c09a7c38] nvdimm_bus_probe+0x118/0x240 [0.540026] [c000feb03860] [c0968500] really_probe+0x230/0x4b0 [0.540119] [c000feb038f0] [c0968aec] driver_probe_device+0x16c/0x1e0 [0.540230] [c000feb03970] [c0968ca8] __driver_attach+0x148/0x1b0 [0.540340] [c000feb039f0] [c09650b0] bus_for_each_dev+0x90/0x130 [0.540451] [c000feb03a50] [c0967dd4] driver_attach+0x34/0x50 [0.540544] [c000feb03a70] [c0967068] bus_add_driver+0x1a8/0x360 [0.540654] [c000feb03b00] [c096a498] driver_register+0x108/0x170 [0.540766] [c000feb03b70] [c09a7400] __nd_driver_register+0xd0/0xf0 [0.540898] [c000feb03bd0] [c128aa90] nd_pmem_driver_init+0x34/0x48 [0.541010] [c000feb03bf0] [c0010a10] do_one_initcall+0x1e0/0x45c [0.541122] [c000fe
[PATCH] pseries/energy: Use OF accessor functions to read ibm,drc-indexes
From: "Gautham R. Shenoy" In cpu_to_drc_index() in the case when FW_FEATURE_DRC_INFO is absent, we currently use of_read_property() to obtain the pointer to the array corresponding to the property "ibm,drc-indexes". The elements of this array are of type __be32, but are accessed without any conversion to the OS-endianness, which is buggy on a Little Endian OS. Fix this by using of_property_read_u32_index() accessor function to safely read the elements of the array. Fixes: commit e83636ac3334 ("pseries/drc-info: Search DRC properties for CPU indexes") Cc: #v4.16+ Reported-by: Pavithra R. Prakash Signed-off-by: Gautham R. Shenoy --- arch/powerpc/platforms/pseries/pseries_energy.c | 27 - 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c b/arch/powerpc/platforms/pseries/pseries_energy.c index 6ed2212..1c4d1ba 100644 --- a/arch/powerpc/platforms/pseries/pseries_energy.c +++ b/arch/powerpc/platforms/pseries/pseries_energy.c @@ -77,18 +77,27 @@ static u32 cpu_to_drc_index(int cpu) ret = drc.drc_index_start + (thread_index * drc.sequential_inc); } else { - const __be32 *indexes; - - indexes = of_get_property(dn, "ibm,drc-indexes", NULL); - if (indexes == NULL) - goto err_of_node_put; + u32 nr_drc_indexes, thread_drc_index; /* -* The first element indexes[0] is the number of drc_indexes -* returned in the list. Hence thread_index+1 will get the -* drc_index corresponding to core number thread_index. +* The first element of ibm,drc-indexes array is the +* number of drc_indexes returned in the list. Hence +* thread_index+1 will get the drc_index corresponding +* to core number thread_index. */ - ret = indexes[thread_index + 1]; + rc = of_property_read_u32_index(dn, "ibm,drc-indexes", + 0, _drc_indexes); + if (rc) + goto err_of_node_put; + + WARN_ON(thread_index > nr_drc_indexes); + rc = of_property_read_u32_index(dn, "ibm,drc-indexes", + thread_index + 1, + _drc_index); + if (rc) + goto err_of_node_put; + + ret = thread_drc_index; } rc = 0; -- 1.9.4
Re: [PATCH] powernv: powercap: Add hard minimum powercap
Hi Shilpa, On Thu, Feb 28, 2019 at 11:25:25AM +0530, Shilpasri G Bhat wrote: > Hi, > > On 02/28/2019 10:14 AM, Daniel Axtens wrote: > > Shilpasri G Bhat writes: > > > >> In POWER9, OCC(On-Chip-Controller) provides for hard and soft system > >> powercapping range. The hard powercap range is guaranteed while soft > >> powercap may or may not be asserted due to various power-thermal > >> reasons based on system configuration and workloads. This patch adds > >> a sysfs file to export the hard minimum powercap limit to allow the > >> user to set the appropriate powercap value that can be managed by the > >> system. > > > > Maybe it's common terminology and I'm just not aware of it, but what do > > you mean by "asserted"? It doesn't appear elsewhere in the documentation > > you're patching, and it's not a use of assert that I'm familiar with... > > > > Regards, > > Daniel > > > > I meant to say powercap will not be assured in the soft powercap range, i.e, > system may or may not be throttled of CPU frequency to remain within the > powercap. > > I can reword the document and commit message. I agree with Daniel. How about replacing "asserted" with "enforced by the OCC"? > > Thanks and Regards, > Shilpa -- Thanks and Regards gautham. > > >> > >> Signed-off-by: Shilpasri G Bhat > >> --- > >> .../ABI/testing/sysfs-firmware-opal-powercap | 10 > >> arch/powerpc/platforms/powernv/opal-powercap.c | 66 > >> +- > >> 2 files changed, 37 insertions(+), 39 deletions(-) > >> > >> diff --git a/Documentation/ABI/testing/sysfs-firmware-opal-powercap > >> b/Documentation/ABI/testing/sysfs-firmware-opal-powercap > >> index c9b66ec..65db4c1 100644 > >> --- a/Documentation/ABI/testing/sysfs-firmware-opal-powercap > >> +++ b/Documentation/ABI/testing/sysfs-firmware-opal-powercap > >> @@ -29,3 +29,13 @@ Description:System powercap directory and > >> attributes applicable for > >> creates a request for setting a new-powercap. The > >> powercap requested must be between powercap-min > >> and powercap-max. > >> + > >> +What: > >> /sys/firmware/opal/powercap/system-powercap/powercap-hard-min > >> +Date: Feb 2019 > >> +Contact: Linux for PowerPC mailing list > >> +Description: Hard minimum powercap > >> + > >> + This file provides the hard minimum powercap limit in > >> + Watts. The powercap value above hard minimum is always > >> + guaranteed to be asserted and the powercap value below > >> + the hard minimum limit may or may not be guaranteed. > >> diff --git a/arch/powerpc/platforms/powernv/opal-powercap.c > >> b/arch/powerpc/platforms/powernv/opal-powercap.c > >> index d90ee4f..38408e7 100644 > >> --- a/arch/powerpc/platforms/powernv/opal-powercap.c > >> +++ b/arch/powerpc/platforms/powernv/opal-powercap.c > >> @@ -139,10 +139,24 @@ static void powercap_add_attr(int handle, const char > >> *name, > >>attr->handle = handle; > >>sysfs_attr_init(>attr.attr); > >>attr->attr.attr.name = name; > >> - attr->attr.attr.mode = 0444; > >> + > >> + if (!strncmp(name, "powercap-current", strlen(name))) { > >> + attr->attr.attr.mode = 0664; > >> + attr->attr.store = powercap_store; > >> + } else { > >> + attr->attr.attr.mode = 0444; > >> + } > >> + > >>attr->attr.show = powercap_show; > >> } > >> > >> +static const char * const powercap_strs[] = { > >> + "powercap-max", > >> + "powercap-min", > >> + "powercap-current", > >> + "powercap-hard-min", > >> +}; > >> + > >> void __init opal_powercap_init(void) > >> { > >>struct device_node *powercap, *node; > >> @@ -167,60 +181,34 @@ void __init opal_powercap_init(void) > >> > >>i = 0; > >>for_each_child_of_node(powercap, node) { > >> - u32 cur, min, max; > >> - int j = 0; > >> - bool has_cur = false, has_min = false, has_max = false; > >> + u32 id; > >> + int j, count = 0; > >> > >> - if (!of_property_read_u32(node, "powercap-min", )) { > >> - j++; > >> - has_min = true; > >> - } > >> - > >> - if (!of_property_read_u32(node, "powercap-max", )) { > >> - j++; > >> - has_max = true; > >> - } > >> + for (j = 0; j < ARRAY_SIZE(powercap_strs); j++) > >> + if (!of_property_read_u32(node, powercap_strs[j], )) > >> + count++; > >> > >> - if (!of_property_read_u32(node, "powercap-current", )) { > >> - j++; > >> - has_cur = true; > >> - } > >> - > >> - pcaps[i].pattrs = kcalloc(j, sizeof(struct powercap_attr), > >> + pcaps[i].pattrs = kcalloc(count, sizeof(struct powercap_attr), > >> GFP_KERNEL); > >>if (!pcaps[i].pattrs) > >>goto