Re: [PATCH v2] powerpc/smp: do not decrement idle task preempt count in CPU offline

2021-10-18 Thread Srikar Dronamraju
* Nathan Lynch  [2021-10-15 12:39:02]:

> With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we
> get:
> 
> BUG: scheduling while atomic: swapper/1/0/0x
> no locks held by swapper/1/0.
> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ #100
> Call Trace:
>  dump_stack_lvl+0xac/0x108
>  __schedule_bug+0xac/0xe0
>  __schedule+0xcf8/0x10d0
>  schedule_idle+0x3c/0x70
>  do_idle+0x2d8/0x4a0
>  cpu_startup_entry+0x38/0x40
>  start_secondary+0x2ec/0x3a0
>  start_secondary_prolog+0x10/0x14
> 
> This is because powerpc's arch_cpu_idle_dead() decrements the idle task's
> preempt count, for reasons explained in commit a7c2bb8279d2 ("powerpc:
> Re-enable preemption before cpu_die()"), specifically "start_secondary()
> expects a preempt_count() of 0."
> 
> However, since commit 2c669ef6979c ("powerpc/preempt: Don't touch the idle
> task's preempt_count during hotplug") and commit f1a0a376ca0c ("sched/core:
> Initialize the idle task with preemption disabled"), that justification no
> longer holds.
> 
> The idle task isn't supposed to re-enable preemption, so remove the
> vestigial preempt_enable() from the CPU offline path.
> 
> Tested with pseries and powernv in qemu, and pseries on PowerVM.
> 
> Fixes: 2c669ef6979c ("powerpc/preempt: Don't touch the idle task's 
> preempt_count during hotplug")
> Signed-off-by: Nathan Lynch 
> Reviewed-by: Valentin Schneider 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

> ---
> 
> Notes:
> Changes since v1:
> 
> - remove incorrect Fixes: tag, add Valentin's r-b.
> 
>  arch/powerpc/kernel/smp.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 9cc7d3dbf439..605bab448f84 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1730,8 +1730,6 @@ void __cpu_die(unsigned int cpu)
> 
>  void arch_cpu_idle_dead(void)
>  {
> - sched_preempt_enable_no_resched();
> -
>   /*
>* Disable on the down path. This will be re-enabled by
>* start_secondary() via start_secondary_resume() below
> -- 
> 2.31.1
> 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v2 2/2] powerpc/paravirt: correct preempt debug splat in vcpu_is_preempted()

2021-09-29 Thread Srikar Dronamraju
* Nathan Lynch  [2021-09-28 16:41:47]:

> vcpu_is_preempted() can be used outside of preempt-disabled critical
> sections, yielding warnings such as:
> 
> BUG: using smp_processor_id() in preemptible [] code: 
> systemd-udevd/185
> caller is rwsem_spin_on_owner+0x1cc/0x2d0
> CPU: 1 PID: 185 Comm: systemd-udevd Not tainted 5.15.0-rc2+ #33
> Call Trace:
> [c00012907ac0] [c0aa30a8] dump_stack_lvl+0xac/0x108 (unreliable)
> [c00012907b00] [c1371f70] check_preemption_disabled+0x150/0x160
> [c00012907b90] [c01e0e8c] rwsem_spin_on_owner+0x1cc/0x2d0
> [c00012907be0] [c01e1408] rwsem_down_write_slowpath+0x478/0x9a0
> [c00012907ca0] [c0576cf4] filename_create+0x94/0x1e0
> [c00012907d10] [c057ac08] do_symlinkat+0x68/0x1a0
> [c00012907d70] [c057ae18] sys_symlink+0x58/0x70
> [c00012907da0] [c002e448] system_call_exception+0x198/0x3c0
> [c00012907e10] [c000c54c] system_call_common+0xec/0x250
> 
> The result of vcpu_is_preempted() is always used speculatively, and the
> function does not access per-cpu resources in a (Linux) preempt-unsafe way.
> Use raw_smp_processor_id() to avoid such warnings, adding explanatory
> comments.
> 
> Signed-off-by: Nathan Lynch 
> Fixes: ca3f969dcb11 ("powerpc/paravirt: Use is_kvm_guest() in 
> vcpu_is_preempted()")

Looks good to me.

Reviewed-by: Srikar Dronamraju 

> ---
>  arch/powerpc/include/asm/paravirt.h | 18 +-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/paravirt.h 
> b/arch/powerpc/include/asm/paravirt.h
> index 39f173961f6a..eb7df559ae74 100644
> --- a/arch/powerpc/include/asm/paravirt.h
> +++ b/arch/powerpc/include/asm/paravirt.h
> @@ -110,7 +110,23 @@ static inline bool vcpu_is_preempted(int cpu)
> 
>  #ifdef CONFIG_PPC_SPLPAR
>   if (!is_kvm_guest()) {
> - int first_cpu = cpu_first_thread_sibling(smp_processor_id());
> + int first_cpu;
> +
> + /*
> +  * The result of vcpu_is_preempted() is used in a
> +  * speculative way, and is always subject to invalidation
> +  * by events internal and external to Linux. While we can
> +  * be called in preemptable context (in the Linux sense),
> +  * we're not accessing per-cpu resources in a way that can
> +  * race destructively with Linux scheduler preemption and
> +  * migration, and callers can tolerate the potential for
> +  * error introduced by sampling the CPU index without
> +  * pinning the task to it. So it is permissible to use
> +  * raw_smp_processor_id() here to defeat the preempt debug
> +  * warnings that can arise from using smp_processor_id()
> +  * in arbitrary contexts.
> +  */
> + first_cpu = cpu_first_thread_sibling(raw_smp_processor_id());
> 
>   /*
>* The PowerVM hypervisor dispatches VMs on a whole core
> -- 
> 2.31.1
> 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v2 1/2] powerpc/paravirt: vcpu_is_preempted() commentary

2021-09-29 Thread Srikar Dronamraju
* Nathan Lynch  [2021-09-28 16:41:46]:

> Add comments more clearly documenting that this function determines whether
> hypervisor-level preemption of the VM has occurred.
> 
> Signed-off-by: Nathan Lynch 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

> ---
>  arch/powerpc/include/asm/paravirt.h | 22 ++
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/paravirt.h 
> b/arch/powerpc/include/asm/paravirt.h
> index bcb7b5f917be..39f173961f6a 100644
> --- a/arch/powerpc/include/asm/paravirt.h
> +++ b/arch/powerpc/include/asm/paravirt.h
> @@ -21,7 +21,7 @@ static inline bool is_shared_processor(void)
>   return static_branch_unlikely(_processor);
>  }
> 
> -/* If bit 0 is set, the cpu has been preempted */
> +/* If bit 0 is set, the cpu has been ceded, conferred, or preempted */
>  static inline u32 yield_count_of(int cpu)
>  {
>   __be32 yield_count = READ_ONCE(lppaca_of(cpu).yield_count);
> @@ -92,6 +92,19 @@ static inline void prod_cpu(int cpu)
>  #define vcpu_is_preempted vcpu_is_preempted
>  static inline bool vcpu_is_preempted(int cpu)
>  {
> + /*
> +  * The dispatch/yield bit alone is an imperfect indicator of
> +  * whether the hypervisor has dispatched @cpu to run on a physical
> +  * processor. When it is clear, @cpu is definitely not preempted.
> +  * But when it is set, it means only that it *might* be, subject to
> +  * other conditions. So we check other properties of the VM and
> +  * @cpu first, resorting to the yield count last.
> +  */
> +
> + /*
> +  * Hypervisor preemption isn't possible in dedicated processor
> +  * mode by definition.
> +  */
>   if (!is_shared_processor())
>   return false;
> 
> @@ -100,9 +113,10 @@ static inline bool vcpu_is_preempted(int cpu)
>   int first_cpu = cpu_first_thread_sibling(smp_processor_id());
> 
>   /*
> -  * Preemption can only happen at core granularity. This CPU
> -  * is not preempted if one of the CPU of this core is not
> -  * preempted.
> +  * The PowerVM hypervisor dispatches VMs on a whole core
> +  * basis. So we know that a thread sibling of the local CPU
> +  * cannot have been preempted by the hypervisor, even if it
> +  * has called H_CONFER, which will set the yield bit.
>    */
>   if (cpu_first_thread_sibling(cpu) == first_cpu)
>   return false;
> -- 
> 2.31.1
> 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH] powerpc: fix unbalanced node refcount in check_kvm_guest()

2021-09-29 Thread Srikar Dronamraju
* Nathan Lynch  [2021-09-28 07:45:50]:

> When check_kvm_guest() succeeds in looking up a /hypervisor OF node, it
> returns without performing a matching put for the lookup, leaving the
> node's reference count elevated.
> 
> Add the necessary call to of_node_put(), rearranging the code slightly to
> avoid repetition or goto.
> 

Looks good to me.

I do see few other cases where we call of_find_node calls without
of_node_put().

Some of them that I saw were in

find_legacy_serial_ports()  in  arch/powerpc/kernel/legacy_serial.c
proc_ppc64_create in arch/powerpc/proc/powerpc.c
update_events_in_group in arch/powerpc/perf/imc-pmu.c
cell_iommu_init_disabled in arch/powerpc/platforms/cell/iommu.c
cell_publish_devices in arch/powerpc/platforms/cell/setup.c
spufs_init_isolated_loader in arch/powerpc/platforms/cell/spufs/inode.c
holly_init_pci / holly_restart and holly_init_IRQ in 
arch/powerpc/platforms/embedded6xx/holly.c

Reviewed-by: Srikar Dronamraju 

> Signed-off-by: Nathan Lynch 
> Fixes: 107c55005fbd ("powerpc/pseries: Add KVM guest doorbell restrictions")
> ---
>  arch/powerpc/kernel/firmware.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c
> index c7022c41cc31..20328f72f9f2 100644
> --- a/arch/powerpc/kernel/firmware.c
> +++ b/arch/powerpc/kernel/firmware.c
> @@ -31,11 +31,10 @@ int __init check_kvm_guest(void)
>   if (!hyper_node)
>   return 0;
> 
> - if (!of_device_is_compatible(hyper_node, "linux,kvm"))
> - return 0;
> -
> - static_branch_enable(_guest);
> + if (of_device_is_compatible(hyper_node, "linux,kvm"))
> + static_branch_enable(_guest);
> 
> + of_node_put(hyper_node);
>   return 0;
>  }
>  core_initcall(check_kvm_guest); // before kvm_guest_init()
> -- 
> 2.31.1
> 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH] powerpc/paravirt: correct preempt debug splat in vcpu_is_preempted()

2021-09-23 Thread Srikar Dronamraju
* Michael Ellerman  [2021-09-23 17:29:32]:

> Nathan Lynch  writes:
> > Srikar Dronamraju  writes:
> >
> >> * Nathan Lynch  [2021-09-22 11:01:12]:
> >>
> >>> Srikar Dronamraju  writes:
> >>> > * Nathan Lynch  [2021-09-20 22:12:13]:
> >>> >
> >>> >> vcpu_is_preempted() can be used outside of preempt-disabled critical
> >>> >> sections, yielding warnings such as:
> >>> >> 
> >>> >> BUG: using smp_processor_id() in preemptible [] code: 
> >>> >> systemd-udevd/185
> >>> >> caller is rwsem_spin_on_owner+0x1cc/0x2d0
> >>> >> CPU: 1 PID: 185 Comm: systemd-udevd Not tainted 5.15.0-rc2+ #33
> >>> >> Call Trace:
> >>> >> [c00012907ac0] [c0aa30a8] dump_stack_lvl+0xac/0x108 
> >>> >> (unreliable)
> >>> >> [c00012907b00] [c1371f70] 
> >>> >> check_preemption_disabled+0x150/0x160
> >>> >> [c00012907b90] [c01e0e8c] rwsem_spin_on_owner+0x1cc/0x2d0
> >>> >> [c00012907be0] [c01e1408] 
> >>> >> rwsem_down_write_slowpath+0x478/0x9a0
> >>> >> [c00012907ca0] [c0576cf4] filename_create+0x94/0x1e0
> >>> >> [c00012907d10] [c057ac08] do_symlinkat+0x68/0x1a0
> >>> >> [c00012907d70] [c057ae18] sys_symlink+0x58/0x70
> >>> >> [c00012907da0] [c002e448] system_call_exception+0x198/0x3c0
> >>> >> [c00012907e10] [c000c54c] system_call_common+0xec/0x250
> >>> >> 
> >>> >> The result of vcpu_is_preempted() is always subject to invalidation by
> >>> >> events inside and outside of Linux; it's just a best guess at a point 
> >>> >> in
> >>> >> time. Use raw_smp_processor_id() to avoid such warnings.
> >>> >
> >>> > Typically smp_processor_id() and raw_smp_processor_id() except for the
> >>> > CONFIG_DEBUG_PREEMPT.
> >>> 
> >>> Sorry, I don't follow...
> >>
> >> I meant, Unless CONFIG_DEBUG_PREEMPT, smp_processor_id() is defined as
> >> raw_processor_id().
> >>
> >>> 
> >>> > In the CONFIG_DEBUG_PREEMPT case, smp_processor_id()
> >>> > is actually debug_smp_processor_id(), which does all the checks.
> >>> 
> >>> Yes, OK.
> >>> 
> >>> > I believe these checks in debug_smp_processor_id() are only valid for 
> >>> > x86
> >>> > case (aka cases were they have __smp_processor_id() defined.)
> >>> 
> >>> Hmm, I am under the impression that the checks in
> >>> debug_smp_processor_id() are valid regardless of whether the arch
> >>> overrides __smp_processor_id().
> >>
> >> From include/linux/smp.h
> >>
> >> /*
> >>  * Allow the architecture to differentiate between a stable and unstable 
> >> read.
> >>  * For example, x86 uses an IRQ-safe asm-volatile read for the unstable 
> >> but a
> >>  * regular asm read for the stable.
> >>  */
> >> #ifndef __smp_processor_id
> >> #define __smp_processor_id(x) raw_smp_processor_id(x)
> >> #endif
> >>
> >> As far as I see, only x86 has a definition of __smp_processor_id.
> >> So for archs like Powerpc, __smp_processor_id(), is always
> >> defined as raw_smp_processor_id(). Right?
> >
> > Sure, yes.
> >
> >> I would think debug_smp_processor_id() would be useful if 
> >> __smp_processor_id()
> >> is different from raw_smp_processor_id(). Do note debug_smp_processor_id() 
> >> calls raw_smp_processor_id().
> 
> Agree.
> 
> > I do not think the utility of debug_smp_processor_id() is related to
> > whether the arch defines __smp_processor_id().
> >
> >> Or can I understand how debug_smp_processor_id() is useful if
> >> __smp_processor_id() is defined as raw_smp_processor_id()?
> 
> debug_smp_processor_id() is useful on powerpc, as well as other arches,
> because it checks that we're in a context where the processor id won't
> change out from under us.
> 
> eg. something like this is unsafe:
> 
>   int counts[NR_CPUS];
>   int tmp, cpu;
>   
>   cpu = smp_processor_id();
>   tmp = counts[cpu];
>   <- preempted here and migrated to another CPU
>   counts[cpu] = tmp + 1;
> 

If lets say the above call was replaced by raw_smp_processor_id(), how would
it avoid the preemption / migration to another CPU?

Replacing it with raw_smp_processor_id() may avoid, the debug splat but the
underlying issue would still remain as is. No?

> 
> > So, for powerpc with DEBUG_PREEMPT unset, a call to smp_procesor_id()
> > expands to __smp_processor_id() which expands to raw_smp_processor_id(),
> > avoiding the preempt safety checks. This is working as intended.
> >
> > For powerpc with DEBUG_PREEMPT=y, a call to smp_processor_id() expands
> > to the out of line call to debug_smp_processor_id(), which calls
> > raw_smp_processor_id() and performs the checks, warning if called in an
> > inappropriate context, as seen here. Also working as intended.
> >
> > AFAICT __smp_processor_id() is a performance/codegen-oriented hook, and
> > not really related to the debug facility. Please see 9ed7d75b2f09d
> > ("x86/percpu: Relax smp_processor_id()").
> 
> Yeah good find.
> 
> cheers

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes

2021-09-23 Thread Srikar Dronamraju
* Michael Ellerman  [2021-09-23 21:17:25]:

> Srikar Dronamraju  writes:
> > * Michael Ellerman  [2021-08-26 23:36:53]:
> >
> >> Srikar Dronamraju  writes:
> >> > Scheduler expects unique number of node distances to be available at
> >> > boot.
> ...
> >
> >> > Fake the offline node's distance_lookup_table entries so that all
> >> > possible node distances are updated.
> >>
> >> Does this work if we have a single node offline at boot?
> >>
> >
> > It should.
> >
> >> Say we start with:
> >>
> >> node distances:
> >> node   0   1
> >>   0:  10  20
> >>   1:  20  10
> >>
> >> And node 2 is offline at boot. We can only initialise that nodes entries
> >> in the distance_lookup_table:
> >>
> >>while (i--)
> >>distance_lookup_table[node][i] = node;
> >>
> >> By filling them all with 2 that causes node_distance(2, X) to return the
> >> maximum distance for all other nodes X, because we won't break out of
> >> the loop in __node_distance():
> >>
> >>for (i = 0; i < distance_ref_points_depth; i++) {
> >>if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
> >>break;
> >>
> >>/* Double the distance for each NUMA level */
> >>distance *= 2;
> >>}
> >>
> >> If distance_ref_points_depth was 4 we'd return 160.
> >
> > As you already know, distance 10, 20, .. are defined by Powerpc, form1
> > affinity. PAPR doesn't define actual distances, it only provides us the
> > associativity. If there are distance_ref_points_depth is 4,
> > (distance_ref_points_depth doesn't take local distance into consideration)
> > 10, 20, 40, 80, 160.
> >
> >>
> >> That'd leave us with 3 unique distances at boot, 10, 20, 160.
> >>
> >
> > So if there are unique distances, then the distances as per the current
> > code has to be 10, 20, 40, 80.. I dont see a way in which we have a break in
> > the series. like having 160 without 80.
>
> I'm confused what you mean there.
>

At the outset, if we have a better probable solution, do let me know, I am
willing to try that too.

> If we have a node that's offline at boot then we get 160 for that node,
> that's just the result of having no info for it, so we never break out
> of the for loop.
>
> So if we have two nodes, one hop apart, and then an offline node we get
> 10, 20, 160.
>
> Or if you're using depth = 3 then it's 10, 20, 80.
>

My understanding is as below:

device-tree provides the max hops by way of
ibm,associativity-reference-points. This is mapped to
distance_ref_points_depth in Linux-powerpc.

Now Linux-powerpc encodes hops as (dis-regarding local distance) 20, 40, 80,
160, 320 ...
So if the distance_ref_points_depth is 3, then the hops are 20, 40, 80.

Do you disagree?


> >> But when node 2 comes online it might introduce more than 1 new distance
> >> value, eg. it could be that the actual distances are:
> >>
> >> node distances:
> >> node   0   1   2
> >>   0:  10  20  40
> >>   1:  20  10  80
> >>   2:  40  80  10
> >>
> >> ie. we now have 4 distances, 10, 20, 40, 80.
> >>
> >> What am I missing?
> >
> > As I said above, I am not sure how we can have a break in the series.
> > If distance_ref_points_depth is 3, the distances has to be 10,20,40,80 as
> > atleast for form1 affinity.
>
> I agree for depth 3 we have to see 10, 20, 40, 80. But nothing
> guarantees we see each value (other than 10).

The hop distances are not from the device-tree, the device-tree only gives
us the max hops possible. Linux-powerpc is actually hard-coding the
distances which each hop distance being 2x the previous.
So we may not see any nodes at a particular hop, but we know maximum hops.
And if distance_ref_points_depth is 3, then hops are 20, 40, 80 only.

>
> We can have two nodes one hop apart, so we have 10 and 20, then a third
> node is added 3 hops away, so we get 10, 20, 80.
>

> The real problem is that the third node could be 3 hops from node 0
> and 2 hops from node 1, and so the addition of the third node causes
> two new distance values (40 & 80) to be required.

So here the max hops as given by device-tree is 3. So we know that we are
looking for max-distance of 80 by way of distance_ref_points_depth.

Even if the 3rd node was at 4 hops, we would already know the max distance
of 160, by way of distance_ref_points_depth. However in the most u

Re: [PATCH] powerpc/paravirt: correct preempt debug splat in vcpu_is_preempted()

2021-09-22 Thread Srikar Dronamraju
* Nathan Lynch  [2021-09-22 11:01:12]:

> Srikar Dronamraju  writes:
> > * Nathan Lynch  [2021-09-20 22:12:13]:
> >
> >> vcpu_is_preempted() can be used outside of preempt-disabled critical
> >> sections, yielding warnings such as:
> >> 
> >> BUG: using smp_processor_id() in preemptible [] code: 
> >> systemd-udevd/185
> >> caller is rwsem_spin_on_owner+0x1cc/0x2d0
> >> CPU: 1 PID: 185 Comm: systemd-udevd Not tainted 5.15.0-rc2+ #33
> >> Call Trace:
> >> [c00012907ac0] [c0aa30a8] dump_stack_lvl+0xac/0x108 
> >> (unreliable)
> >> [c00012907b00] [c1371f70] check_preemption_disabled+0x150/0x160
> >> [c00012907b90] [c01e0e8c] rwsem_spin_on_owner+0x1cc/0x2d0
> >> [c00012907be0] [c01e1408] rwsem_down_write_slowpath+0x478/0x9a0
> >> [c00012907ca0] [c0576cf4] filename_create+0x94/0x1e0
> >> [c00012907d10] [c057ac08] do_symlinkat+0x68/0x1a0
> >> [c00012907d70] [c057ae18] sys_symlink+0x58/0x70
> >> [c00012907da0] [c002e448] system_call_exception+0x198/0x3c0
> >> [c00012907e10] [c000c54c] system_call_common+0xec/0x250
> >> 
> >> The result of vcpu_is_preempted() is always subject to invalidation by
> >> events inside and outside of Linux; it's just a best guess at a point in
> >> time. Use raw_smp_processor_id() to avoid such warnings.
> >
> > Typically smp_processor_id() and raw_smp_processor_id() except for the
> > CONFIG_DEBUG_PREEMPT.
> 
> Sorry, I don't follow...

I meant, Unless CONFIG_DEBUG_PREEMPT, smp_processor_id() is defined as
raw_processor_id().

> 
> > In the CONFIG_DEBUG_PREEMPT case, smp_processor_id()
> > is actually debug_smp_processor_id(), which does all the checks.
> 
> Yes, OK.
> 
> > I believe these checks in debug_smp_processor_id() are only valid for x86
> > case (aka cases were they have __smp_processor_id() defined.)
> 
> Hmm, I am under the impression that the checks in
> debug_smp_processor_id() are valid regardless of whether the arch
> overrides __smp_processor_id().

>From include/linux/smp.h

/*
 * Allow the architecture to differentiate between a stable and unstable read.
 * For example, x86 uses an IRQ-safe asm-volatile read for the unstable but a
 * regular asm read for the stable.
 */
#ifndef __smp_processor_id
#define __smp_processor_id(x) raw_smp_processor_id(x)
#endif

As far as I see, only x86 has a definition of __smp_processor_id.
So for archs like Powerpc, __smp_processor_id(), is always
defined as raw_smp_processor_id(). Right?

I would think debug_smp_processor_id() would be useful if __smp_processor_id()
is different from raw_smp_processor_id(). Do note debug_smp_processor_id() 
calls raw_smp_processor_id().

Or can I understand how debug_smp_processor_id() is useful if
__smp_processor_id() is defined as raw_smp_processor_id()?

> I think the stack trace here correctly identifies an incorrect use of
> smp_processor_id(), and the call site needs to be changed. Do you
> disagree?

Yes the stack_trace shows that debug_smp_processor_id(). However what I want 
to understand is why should we even call debug_smp_processor_id(), when our
__smp_processor_id() is defined as raw_smp_processor_id().

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH] powerpc/paravirt: correct preempt debug splat in vcpu_is_preempted()

2021-09-22 Thread Srikar Dronamraju
* Nathan Lynch  [2021-09-20 22:12:13]:

> vcpu_is_preempted() can be used outside of preempt-disabled critical
> sections, yielding warnings such as:
> 
> BUG: using smp_processor_id() in preemptible [] code: 
> systemd-udevd/185
> caller is rwsem_spin_on_owner+0x1cc/0x2d0
> CPU: 1 PID: 185 Comm: systemd-udevd Not tainted 5.15.0-rc2+ #33
> Call Trace:
> [c00012907ac0] [c0aa30a8] dump_stack_lvl+0xac/0x108 (unreliable)
> [c00012907b00] [c1371f70] check_preemption_disabled+0x150/0x160
> [c00012907b90] [c01e0e8c] rwsem_spin_on_owner+0x1cc/0x2d0
> [c00012907be0] [c01e1408] rwsem_down_write_slowpath+0x478/0x9a0
> [c00012907ca0] [c0576cf4] filename_create+0x94/0x1e0
> [c00012907d10] [c057ac08] do_symlinkat+0x68/0x1a0
> [c00012907d70] [c057ae18] sys_symlink+0x58/0x70
> [c00012907da0] [c002e448] system_call_exception+0x198/0x3c0
> [c00012907e10] [c000c54c] system_call_common+0xec/0x250
> 
> The result of vcpu_is_preempted() is always subject to invalidation by
> events inside and outside of Linux; it's just a best guess at a point in
> time. Use raw_smp_processor_id() to avoid such warnings.

Typically smp_processor_id() and raw_smp_processor_id() except for the
CONFIG_DEBUG_PREEMPT. In the CONFIG_DEBUG_PREEMPT case, smp_processor_id()
is actually debug_smp_processor_id(), which does all the checks.

I believe these checks in debug_smp_processor_id() are only valid for x86
case (aka cases were they have __smp_processor_id() defined.)
i.e x86 has a different implementation of _smp_processor_id() for stable and
unstable

> 
> Signed-off-by: Nathan Lynch 
> Fixes: ca3f969dcb11 ("powerpc/paravirt: Use is_kvm_guest() in 
> vcpu_is_preempted()")
> ---
>  arch/powerpc/include/asm/paravirt.h | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/paravirt.h 
> b/arch/powerpc/include/asm/paravirt.h
> index bcb7b5f917be..e429aca566de 100644
> --- a/arch/powerpc/include/asm/paravirt.h
> +++ b/arch/powerpc/include/asm/paravirt.h
> @@ -97,7 +97,14 @@ static inline bool vcpu_is_preempted(int cpu)
> 
>  #ifdef CONFIG_PPC_SPLPAR
>   if (!is_kvm_guest()) {
> - int first_cpu = cpu_first_thread_sibling(smp_processor_id());
> + int first_cpu;
> +
> + /*
> +  * This is only a guess at best, and this function may be
> +  * called with preemption enabled. Using raw_smp_processor_id()
> +  * does not damage accuracy.
> +  */
> + first_cpu = cpu_first_thread_sibling(raw_smp_processor_id());
> 
>   /*
>* Preemption can only happen at core granularity. This CPU
> -- 
> 2.31.1
> 

How about something like the below?

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 510519e8a1eb..8c669e8ceb73 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -256,12 +256,14 @@ static inline int get_boot_cpu_id(void)
  */
 #ifndef __smp_processor_id
 #define __smp_processor_id(x) raw_smp_processor_id(x)
-#endif
-
+#else
 #ifdef CONFIG_DEBUG_PREEMPT
   extern unsigned int debug_smp_processor_id(void);
 # define smp_processor_id() debug_smp_processor_id()
-#else
+#endif
+#endif
+
+#ifndef smp_processor_id
 # define smp_processor_id() __smp_processor_id()
 #endif
 

-- 
Thanks and Regards
Srikar Dronamraju


[PATCH v3] powerpc/numa: Fill distance_lookup_table for offline nodes

2021-09-13 Thread Srikar Dronamraju
Scheduler expects the number of unique node distances to be available
at boot. It iterates over all pairs of nodes and records
node_distance() for that pair, and then calculates the set of unique
distances. As per PAPR, node distances for offline nodes is not
available. However, PAPR already exposes unique possible node
distances. Fake the offline node's distance_lookup_table entries so
that all possible node distances are updated.

However this only needs to be done if the number of unique node
distances that can be computed for online nodes is less than the
number of possible unique node distances as represented by
distance_ref_points_depth. When the node is actually onlined,
distance_lookup_table will be updated with actual entries.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Cc: kernel test robot 
Signed-off-by: Srikar Dronamraju 
---
Changelog:
v2: 
https://lore.kernel.org/linuxppc-dev/20210821102535.169643-4-sri...@linux.vnet.ibm.com/t/#u
- Updated changelog
- Updated variable names
- Rebased to newer branch
- tweaked the WARN if the depth is greater than sizeof(long)
- All of the above based on comments from Michael Ellerman

v1: 
https://lore.kernel.org/linuxppc-dev/20210701041552.112072-3-sri...@linux.vnet.ibm.com/t/#u
[ Fixed a missing prototype warning Reported-by: kernel test robot 
]

 arch/powerpc/mm/numa.c | 68 ++
 1 file changed, 68 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 6f14c8fb6359..5ac9f20ebbc8 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1079,6 +1079,73 @@ void __init dump_numa_cpu_topology(void)
}
 }
 
+/*
+ * Scheduler expects unique number of node distances to be available at
+ * boot. It uses node distance to calculate this unique node distances. On
+ * POWER, node distances for offline nodes is not available. However, POWER
+ * already knows unique possible node distances. Fake the offline node's
+ * distance_lookup_table entries so that all possible node distances are
+ * updated.
+ */
+static void __init fake_update_distance_lookup_table(void)
+{
+   unsigned long distance_map;
+   int i, cur_depth, max_depth, node;
+
+   if (!numa_enabled)
+   return;
+
+   if (affinity_form != FORM1_AFFINITY)
+   return;
+
+   /*
+* distance_ref_points_depth lists the unique numa domains
+* available. However it ignore LOCAL_DISTANCE. So add +1
+* to get the actual number of unique distances.
+*/
+   max_depth = distance_ref_points_depth + 1;
+
+   if (max_depth > sizeof(distance_map)) {
+   WARN(1, "Max depth %d > %ld\n", max_depth, 
sizeof(distance_map));
+   max_depth = sizeof(distance_map);
+   }
+
+   bitmap_zero(_map, max_depth);
+   bitmap_set(_map, 0, 1);
+
+   for_each_online_node(node) {
+   int nd, distance;
+
+   if (node == first_online_node)
+   continue;
+
+   nd = __node_distance(node, first_online_node);
+   for (i = 0, distance = LOCAL_DISTANCE; i < max_depth; i++, 
distance *= 2) {
+   if (distance == nd) {
+   bitmap_set(_map, i, 1);
+   break;
+   }
+   }
+   cur_depth = bitmap_weight(_map, max_depth);
+   if (cur_depth == max_depth)
+   return;
+   }
+
+   for_each_node(node) {
+   if (node_online(node))
+   continue;
+
+   i = find_first_zero_bit(_map, max_depth);
+   bitmap_set(_map, i, 1);
+   while (i--)
+   distance_lookup_table[node][i] = node;
+
+   cur_depth = bitmap_weight(_map, max_depth);
+   if (cur_depth == max_depth)
+   return;
+   }
+}
+
 /* Initialize NODE_DATA for a node on the local memory */
 static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 {
@@ -1201,6 +1268,7 @@ void __init mem_topology_setup(void)
 */
numa_setup_cpu(cpu);
}
+   fake_update_distance_lookup_table();
 }
 
 void __init initmem_init(void)
-- 
2.27.0



Re: [PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes

2021-09-01 Thread Srikar Dronamraju
* Michael Ellerman  [2021-08-26 23:36:53]:

> Srikar Dronamraju  writes:
> > Scheduler expects unique number of node distances to be available at
> > boot.
> 
> I think it needs "the number of unique node distances" ?
> 
> > It uses node distance to calculate this unique node distances.
> 
> It iterates over all pairs of nodes and records node_distance() for that
> pair, and then calculates the set of unique distances.
> 
> > On POWER, node distances for offline nodes is not available. However,
> > POWER already knows unique possible node distances.
> 
> I think it would be more accurate to say PAPR rather than POWER there.
> It's PAPR that defines the way we determine distances and imposes that
> limitation.
> 

Okay, will do all the necessary modifications as suggested above.

> > Fake the offline node's distance_lookup_table entries so that all
> > possible node distances are updated.
> 
> Does this work if we have a single node offline at boot?
> 

It should.

> Say we start with:
> 
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> And node 2 is offline at boot. We can only initialise that nodes entries
> in the distance_lookup_table:
> 
>   while (i--)
>   distance_lookup_table[node][i] = node;
> 
> By filling them all with 2 that causes node_distance(2, X) to return the
> maximum distance for all other nodes X, because we won't break out of
> the loop in __node_distance():
> 
>   for (i = 0; i < distance_ref_points_depth; i++) {
>   if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
>   break;
> 
>   /* Double the distance for each NUMA level */
>   distance *= 2;
>   }
> 
> If distance_ref_points_depth was 4 we'd return 160.

As you already know, distance 10, 20, .. are defined by Powerpc, form1
affinity. PAPR doesn't define actual distances, it only provides us the
associativity. If there are distance_ref_points_depth is 4,
(distance_ref_points_depth doesn't take local distance into consideration)
10, 20, 40, 80, 160.

> 
> That'd leave us with 3 unique distances at boot, 10, 20, 160.
> 

So if there are unique distances, then the distances as per the current
code has to be 10, 20, 40, 80.. I dont see a way in which we have a break in
the series. like having 160 without 80.

> But when node 2 comes online it might introduce more than 1 new distance
> value, eg. it could be that the actual distances are:
> 
> node distances:
> node   0   1   2
>   0:  10  20  40
>   1:  20  10  80
>   2:  40  80  10
> 
> ie. we now have 4 distances, 10, 20, 40, 80.
> 
> What am I missing?
> 

As I said above, I am not sure how we can have a break in the series.
If distance_ref_points_depth is 3, the distances has to be 10,20,40,80 as
atleast for form1 affinity.

> > However this only needs to be done if the number of unique node
> > distances that can be computed for online nodes is less than the
> > number of possible unique node distances as represented by
> > distance_ref_points_depth.
> 
> Looking at a few machines they all have distance_ref_points_depth = 2.
> 
> So maybe that explains it, in practice we only see 10, 20, 40.
> 
> > When the node is actually onlined, distance_lookup_table will be
> > updated with actual entries.
> 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: Nathan Lynch 
> > Cc: Michael Ellerman 
> > Cc: Ingo Molnar 
> > Cc: Peter Zijlstra 
> > Cc: Valentin Schneider 
> > Cc: Gautham R Shenoy 
> > Cc: Vincent Guittot 
> > Cc: Geetika Moolchandani 
> > Cc: Laurent Dufour 
> > Cc: kernel test robot 
> > Signed-off-by: Srikar Dronamraju 
> > ---
> >  arch/powerpc/mm/numa.c | 70 ++
> >  1 file changed, 70 insertions(+)
> >
> > Changelog:
> > v1: 
> > https://lore.kernel.org/linuxppc-dev/20210701041552.112072-3-sri...@linux.vnet.ibm.com/t/#u
> > [ Fixed a missing prototype warning Reported-by: kernel test robot 
> > ]
> >
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 3c124928a16d..0ee79a08c9e1 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -856,6 +856,75 @@ void __init dump_numa_cpu_topology(void)
> > }
> >  }
> >  
> > +/*
> > + * Scheduler expects unique number of node distances to be available at
> > + * boot. It uses node distance to calculate this unique node distances. On
> > + * POWER, node distances for offline nodes is not available. However, POWER
> > + * al

[PATCH v5 1/5] powerpc/numa: Drop dbg in favour of pr_debug

2021-08-26 Thread Srikar Dronamraju
PowerPc supported numa=debug which is not documented. This option was
used to print early debug output. However something more flexible can be
achieved by using CONFIG_DYNAMIC_DEBUG.

Hence drop dbg (and numa=debug) in favour of pr_debug

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Suggested-by: Michael Ellerman 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..5e9b777a1151 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -40,9 +40,6 @@ static int numa_enabled = 1;
 
 static char *cmdline __initdata;
 
-static int numa_debug;
-#define dbg(args...) if (numa_debug) { printk(KERN_INFO args); }
-
 int numa_cpu_lookup_table[NR_CPUS];
 cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
 struct pglist_data *node_data[MAX_NUMNODES];
@@ -79,7 +76,7 @@ static void __init setup_node_to_cpumask_map(void)
alloc_bootmem_cpumask_var(_to_cpumask_map[node]);
 
/* cpumask_of_node() will now work */
-   dbg("Node to cpumask map for %u nodes\n", nr_node_ids);
+   pr_debug("Node to cpumask map for %u nodes\n", nr_node_ids);
 }
 
 static int __init fake_numa_create_new_node(unsigned long end_pfn,
@@ -123,7 +120,7 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
cmdline = p;
fake_nid++;
*nid = fake_nid;
-   dbg("created new fake_node with id %d\n", fake_nid);
+   pr_debug("created new fake_node with id %d\n", fake_nid);
return 1;
}
return 0;
@@ -141,7 +138,7 @@ static void map_cpu_to_node(int cpu, int node)
 {
update_numa_cpu_lookup_table(cpu, node);
 
-   dbg("adding cpu %d to node %d\n", cpu, node);
+   pr_debug("adding cpu %d to node %d\n", cpu, node);
 
if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))
cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
@@ -152,7 +149,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
 {
int node = numa_cpu_lookup_table[cpu];
 
-   dbg("removing cpu %lu from node %d\n", cpu, node);
+   pr_debug("removing cpu %lu from node %d\n", cpu, node);
 
if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
@@ -313,7 +310,7 @@ static int __init find_min_common_depth(void)
_ref_points_depth);
 
if (!distance_ref_points) {
-   dbg("NUMA: ibm,associativity-reference-points not found.\n");
+   pr_debug("NUMA: ibm,associativity-reference-points not 
found.\n");
goto err;
}
 
@@ -321,7 +318,7 @@ static int __init find_min_common_depth(void)
 
if (firmware_has_feature(FW_FEATURE_OPAL) ||
firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
-   dbg("Using form 1 affinity\n");
+   pr_debug("Using form 1 affinity\n");
form1_affinity = 1;
}
 
@@ -719,7 +716,7 @@ static int __init parse_numa_properties(void)
return min_common_depth;
}
 
-   dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+   pr_debug("NUMA associativity depth for CPU/Memory: %d\n", 
min_common_depth);
 
/*
 * Even though we connect cpus to numa domains later in SMP
@@ -1014,9 +1011,6 @@ static int __init early_numa(char *p)
if (strstr(p, "off"))
numa_enabled = 0;
 
-   if (strstr(p, "debug"))
-   numa_debug = 1;
-
p = strstr(p, "fake=");
if (p)
cmdline = p + strlen("fake=");
@@ -1179,7 +1173,7 @@ static long vphn_get_associativity(unsigned long cpu,
 
switch (rc) {
case H_SUCCESS:
-   dbg("VPHN hcall succeeded. Reset polling...\n");
+   pr_debug("VPHN hcall succeeded. Reset polling...\n");
goto out;
 
case H_FUNCTION:
-- 
2.18.2



[PATCH v3 5/5] powerpc/numa: Fill distance_lookup_table for offline nodes

2021-08-26 Thread Srikar Dronamraju
Scheduler expects unique number of node distances to be available at
boot. It uses node distance to calculate this unique node distances.
On POWER, node distances for offline nodes is not available. However,
POWER already knows unique possible node distances. Fake the offline
node's distance_lookup_table entries so that all possible node
distances are updated.

However this only needs to be done if the number of unique node
distances that can be computed for online nodes is less than the
number of possible unique node distances as represented by
distance_ref_points_depth. When the node is actually onlined,
distance_lookup_table will be updated with actual entries.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Cc: kernel test robot 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 70 ++
 1 file changed, 70 insertions(+)

Changelog:
v1: 
https://lore.kernel.org/linuxppc-dev/20210701041552.112072-3-sri...@linux.vnet.ibm.com/t/#u
[ Fixed a missing prototype warning Reported-by: kernel test robot 
]

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 87ade2f56f45..afa2ede4ac53 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -849,6 +849,75 @@ void __init dump_numa_cpu_topology(void)
}
 }
 
+/*
+ * Scheduler expects unique number of node distances to be available at
+ * boot. It uses node distance to calculate this unique node distances. On
+ * POWER, node distances for offline nodes is not available. However, POWER
+ * already knows unique possible node distances. Fake the offline node's
+ * distance_lookup_table entries so that all possible node distances are
+ * updated.
+ */
+static void __init fake_update_distance_lookup_table(void)
+{
+   unsigned long distance_map;
+   int i, nr_levels, nr_depth, node;
+
+   if (!numa_enabled)
+   return;
+
+   if (!form1_affinity)
+   return;
+
+   /*
+* distance_ref_points_depth lists the unique numa domains
+* available. However it ignore LOCAL_DISTANCE. So add +1
+* to get the actual number of unique distances.
+*/
+   nr_depth = distance_ref_points_depth + 1;
+
+   WARN_ON(nr_depth > sizeof(distance_map));
+
+   bitmap_zero(_map, nr_depth);
+   bitmap_set(_map, 0, 1);
+
+   for_each_online_node(node) {
+   int nd, distance = LOCAL_DISTANCE;
+
+   if (node == first_online_node)
+   continue;
+
+   nd = __node_distance(node, first_online_node);
+   for (i = 0; i < nr_depth; i++, distance *= 2) {
+   if (distance == nd) {
+   bitmap_set(_map, i, 1);
+   break;
+   }
+   }
+   nr_levels = bitmap_weight(_map, nr_depth);
+   if (nr_levels == nr_depth)
+   return;
+   }
+
+   for_each_node(node) {
+   if (node_online(node))
+   continue;
+
+   i = find_first_zero_bit(_map, nr_depth);
+   if (i >= nr_depth || i == 0) {
+   pr_warn("Levels(%d) not matching levels(%d)", 
nr_levels, nr_depth);
+   return;
+   }
+
+   bitmap_set(_map, i, 1);
+   while (i--)
+   distance_lookup_table[node][i] = node;
+
+   nr_levels = bitmap_weight(_map, nr_depth);
+   if (nr_levels == nr_depth)
+   return;
+   }
+}
+
 /* Initialize NODE_DATA for a node on the local memory */
 static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 {
@@ -964,6 +1033,7 @@ void __init mem_topology_setup(void)
 */
numa_setup_cpu(cpu);
}
+   fake_update_distance_lookup_table();
 }
 
 void __init initmem_init(void)
-- 
2.18.2



[PATCH v3 4/5] powerpc/numa: Update cpu_cpu_map on CPU online/offline

2021-08-26 Thread Srikar Dronamraju
cpu_cpu_map holds all the CPUs in the DIE. However in PowerPC, when
onlining/offlining of CPUs, this mask doesn't get updated.  This mask
is however updated when CPUs are added/removed. So when both
operations like online/offline of CPUs and adding/removing of CPUs are
done simultaneously, then cpumaps end up broken.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag
udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag
bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
rfkill nf_tables nfnetlink pseries_rng xts vmx_crypto uio_pdrv_genirq
uio binfmt_misc ip_tables xfs libcrc32c dm_service_time sd_mod t10_pi sg
ibmvfc scsi_transport_fc ibmveth dm_multipath dm_mirror dm_region_hash
dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c01caac8 LR: c01caac4 CTR: 007088ec
REGS: c0005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  80029033   CR: 48828222  XER:
0009
CFAR: c01ea698 IRQMASK: 0
GPR00: c01caac4 c0005596f4c0 c1c4a400 0036
GPR04: fffd c0005596f1d0 0027 c018cfd07f90
GPR08: 0023 0001 0027 c018fe68ffe8
GPR12: 8000 c0001e9d1880 c0013a047200 0800
GPR16: c1d3c7d0 0240 0048 c00010aacd18
GPR20: 0001 c00010aacc18 c0013a047c00 c00139ec2400
GPR24: 0280 c00139ec2520 c00136c1b400 c1c93060
GPR28: c0013a047c20 c1d3c6c0 c1c978a0 000d
NIP [c01caac8] build_sched_domains+0xd48/0x1720
LR [c01caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c0005596f4c0] [c01caac4] build_sched_domains+0xd44/0x1720 
(unreliable)
[c0005596f670] [c01cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c0005596f710] [c02804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c0005596f810] [c0283e60] rebuild_sched_domains+0x40/0x70
[c0005596f840] [c0284124] cpuset_hotplug_workfn+0x294/0xf10
[c0005596fc60] [c0175040] process_one_work+0x290/0x590
[c0005596fd00] [c01753c8] worker_thread+0x88/0x620
[c0005596fda0] [c0181704] kthread+0x194/0x1a0
[c0005596fe10] [c000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 6000 2fa30800 409e0028 80fe e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 6000 <0fe0> 3900 38e0 38c0

Fix this by updating cpu_cpu_map aka cpumask_of_node() on every CPU
online/offline.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/topology.h | 12 
 arch/powerpc/kernel/smp.c   |  3 +++
 arch/powerpc/mm/numa.c  |  7 ++-
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..2f0a4d7b95f6 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -65,6 +65,11 @@ static inline int early_cpu_to_node(int cpu)
 
 int of_drconf_to_nid_single(struct drmem_lmb *lmb);
 
+extern void map_cpu_to_node(int cpu, int node);
+#ifdef CONFIG_HOTPLUG_CPU
+extern void unmap_cpu_from_node(unsigned long cpu);
+#endif /* CONFIG_HOTPLUG_CPU */
+
 #else
 
 static inline int early_cpu_to_node(int cpu) { return 0; }
@@ -93,6 +98,13 @@ static inline int of_drconf_to_nid_single(struct drmem_lmb 
*lmb)
return first_online_node;
 }
 
+#ifdef CONFIG_SMP
+static inline void map_cpu_to_node(int cpu, int node) {}
+#ifdef CONFIG_HOTPLUG_CPU
+static inline void unmap_cpu_from_node(unsigned long cpu) {}
+#endif /* CONFIG_HOTPLUG_CPU */
+#endif /* CONFIG_SMP */
+
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index b31b8ca3ae2e..d947a4fd753c 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1407,6 +1407,8 @@ static void remove_cpu_from_masks(int cpu)
struct cpumask *(*mask_fn)(int) = cpu_sibling_mask;
int i;
 
+   unmap_cpu_from_node(cpu);
+
if (shared_caches)
mask_fn = cpu_l2_cache_mask;
 
@@ -1491,6 +1493,7 @@ static void add_cpu_to_masks(int cpu)
 * This CPU will not be in the online mask yet so we need to manually
 * add it to it's own threa

[PATCH v3 3/5] powerpc/numa: Print debug statements only when required

2021-08-26 Thread Srikar Dronamraju
Currently, a debug message gets printed every time an attempt to
add(remove) a CPU. However this is redundant if the CPU is already added
(removed) from the node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 9af38b1c618b..6655ecdeddef 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -138,10 +138,10 @@ static void map_cpu_to_node(int cpu, int node)
 {
update_numa_cpu_lookup_table(cpu, node);
 
-   pr_debug("adding cpu %d to node %d\n", cpu, node);
-
-   if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))
+   if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node]))) {
+   pr_debug("adding cpu %d to node %d\n", cpu, node);
cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
+   }
 }
 
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
@@ -149,10 +149,9 @@ static void unmap_cpu_from_node(unsigned long cpu)
 {
int node = numa_cpu_lookup_table[cpu];
 
-   pr_debug("removing cpu %lu from node %d\n", cpu, node);
-
if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
+   pr_debug("removing cpu %lu from node %d\n", cpu, node);
} else {
pr_warn("WARNING: cpu %lu not found in node %d\n", cpu, node);
}
-- 
2.18.2



[PATCH v5 2/5] powerpc/numa: convert printk to pr_xxx

2021-08-26 Thread Srikar Dronamraju
Convert the remaining printk to pr_xxx
One advantage would be all prints will now have prefix "numa:" from
pr_fmt().

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
[ convert printk(KERN_ERR) to pr_warn : Suggested by Laurent Dufour ]
Suggested-by: Michael Ellerman 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 5e9b777a1151..9af38b1c618b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -154,8 +154,7 @@ static void unmap_cpu_from_node(unsigned long cpu)
if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
} else {
-   printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
-  cpu, node);
+   pr_warn("WARNING: cpu %lu not found in node %d\n", cpu, node);
}
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
@@ -326,8 +325,7 @@ static int __init find_min_common_depth(void)
depth = of_read_number(distance_ref_points, 1);
} else {
if (distance_ref_points_depth < 2) {
-   printk(KERN_WARNING "NUMA: "
-   "short ibm,associativity-reference-points\n");
+   pr_warn("short ibm,associativity-reference-points\n");
goto err;
}
 
@@ -339,7 +337,7 @@ static int __init find_min_common_depth(void)
 * MAX_DISTANCE_REF_POINTS domains.
 */
if (distance_ref_points_depth > MAX_DISTANCE_REF_POINTS) {
-   printk(KERN_WARNING "NUMA: distance array capped at "
+   pr_warn("distance array capped at "
"%d entries\n", MAX_DISTANCE_REF_POINTS);
distance_ref_points_depth = MAX_DISTANCE_REF_POINTS;
}
@@ -701,7 +699,7 @@ static int __init parse_numa_properties(void)
unsigned long i;
 
if (numa_enabled == 0) {
-   printk(KERN_WARNING "NUMA disabled by user\n");
+   pr_warn("disabled by user\n");
return -1;
}
 
@@ -716,7 +714,7 @@ static int __init parse_numa_properties(void)
return min_common_depth;
}
 
-   pr_debug("NUMA associativity depth for CPU/Memory: %d\n", 
min_common_depth);
+   pr_debug("associativity depth for CPU/Memory: %d\n", min_common_depth);
 
/*
 * Even though we connect cpus to numa domains later in SMP
@@ -808,10 +806,8 @@ static void __init setup_nonnuma(void)
unsigned int nid = 0;
int i;
 
-   printk(KERN_DEBUG "Top of RAM: 0x%lx, Total RAM: 0x%lx\n",
-  top_of_ram, total_ram);
-   printk(KERN_DEBUG "Memory hole size: %ldMB\n",
-  (top_of_ram - total_ram) >> 20);
+   pr_debug("Top of RAM: 0x%lx, Total RAM: 0x%lx\n", top_of_ram, 
total_ram);
+   pr_debug("Memory hole size: %ldMB\n", (top_of_ram - total_ram) >> 20);
 
for_each_mem_pfn_range(i, MAX_NUMNODES, _pfn, _pfn, NULL) {
fake_numa_create_new_node(end_pfn, );
-- 
2.18.2



[PATCH v3 0/5] Updates to powerpc for robust CPU online/offline

2021-08-26 Thread Srikar Dronamraju
Changelog v2 -> v3:
v2: 
https://lore.kernel.org/linuxppc-dev/20210821102535.169643-1-sri...@linux.vnet.ibm.com/t/#u
Add patch 1: to drop dbg and numa=debug (Suggested by Michael Ellerman)
Add patch 2: to convert printk to pr_xxx (Suggested by Michael Ellerman)
Use pr_warn instead of pr_debug(WARNING) (Suggested by Laurent)

Changelog v1 -> v2:
Moved patch to this series: powerpc/numa: Fill distance_lookup_table for 
offline nodes
fixed a missing prototype warning

Scheduler expects unique number of node distances to be available
at boot. It uses node distance to calculate this unique node
distances. On Power Servers, node distances for offline nodes is not
available. However, Power Servers already knows unique possible node
distances. Fake the offline node's distance_lookup_table entries so
that all possible node distances are updated.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  40  40  40  40  40  40
  1:  20  10  40  40  40  40  40  40
  2:  40  40  10  20  40  40  40  40
  3:  40  40  20  10  40  40  40  40
  4:  40  40  40  40  10  20  40  40
  5:  40  40  40  40  20  10  40  40
  6:  40  40  40  40  40  40  10  20
  7:  40  40  40  40  40  40  20  10

However the same system when only two nodes are online at boot, then
distance info from numactl will look like
node distances:
node   0   1
  0:  10  20
  1:  20  10

With the faked numa distance at boot, the node distance table will look
like
node   0   1   2
  0:  10  20  40
  1:  20  10  40
  2:  40  40  10

The actual distance will be populated once the nodes are onlined.

Also when simultaneously running CPU online/offline with CPU
add/remove in a loop, we see a WARNING messages.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898 
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag
raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding tls
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink
pseries_rng xts vmx_crypto uio_pdrv_genirq uio binfmt_misc ip_tables xfs
libcrc32c dm_service_time sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth
dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c01caac8 LR: c01caac4 CTR: 007088ec
REGS: c0005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  80029033   CR: 48828222  XER: 0009
CFAR: c01ea698 IRQMASK: 0
GPR00: c01caac4 c0005596f4c0 c1c4a400 0036
GPR04: fffd c0005596f1d0 0027 c018cfd07f90
GPR08: 0023 0001 0027 c018fe68ffe8
GPR12: 8000 c0001e9d1880 c0013a047200 0800
GPR16: c1d3c7d0 0240 0048 c00010aacd18
GPR20: 0001 c00010aacc18 c0013a047c00 c00139ec2400
GPR24: 0280 c00139ec2520 c00136c1b400 c1c93060
GPR28: c0013a047c20 c1d3c6c0 c1c978a0 000d
NIP [c01caac8] build_sched_domains+0xd48/0x1720
LR [c01caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c0005596f4c0] [c01caac4] build_sched_domains+0xd44/0x1720 
(unreliable)
[c0005596f670] [c01cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c0005596f710] [c02804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c0005596f810] [c0283e60] rebuild_sched_domains+0x40/0x70
[c0005596f840] [c0284124] cpuset_hotplug_workfn+0x294/0xf10
[c0005596fc60] [c0175040] process_one_work+0x290/0x590
[c0005596fd00] [c01753c8] worker_thread+0x88/0x620
[c0005596fda0] [c0181704] kthread+0x194/0x1a0
[c0005596fe10] [c000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 6000 2fa30800 409e0028 80fe e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 6000 <0fe0> 3900 38e0 38c0

This was because cpu_cpu_mask() was not getting updated on CPU
online/offline but would be only updated when add/remove of CPUs.
Other cpumasks get updated both on CPU online/offline and add/remove
Update cpu_cpu_mask() on CPU online/offline too.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 

Srikar Dronamraju (5):
  powerpc/numa: Drop dbg in favour of pr_debug
  powerpc/numa: convert printk to pr_xxx
  powerpc/numa: Print debug statements only when required
  powerpc/numa: Update cpu_cp

[PATCH v2 3/3] powerpc/smp: Enable CACHE domain for shared processor

2021-08-26 Thread Srikar Dronamraju
Currently CACHE domain is not enabled on shared processor mode PowerVM
LPARS. On PowerVM systems, 'ibm,thread-group' device-tree property 2
under cpu-device-node indicates which all CPUs share L2-cache. However
'ibm,thread-group' device-tree property 2 is a relatively new property.
In absence of 'ibm,thread-group' property 2, 'l2-cache' device property
under cpu-device-node could help system to identify CPUs sharing L2-cache.
However this property is not exposed by PhyP in shared processor mode
configurations.

In absence of properties that inform OS about which CPUs share L2-cache,
fallback on core boundary.

Here are some stats from Power9 shared LPAR with the changes.

$ lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  32
On-line CPU(s) list: 0-31
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):   3
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
NUMA node0 CPU(s):   16-23
NUMA node1 CPU(s):   0-15,24-31
Physical sockets:2
Physical chips:  1
Physical cores/chip: 10

Before patch
$ grep -r . /sys/kernel/debug/sched/domains/cpu0/domain*/name
Before
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:NUMA

After
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain3/name:NUMA

$  awk '/domain/{print $1, $2}' /proc/schedstat | sort -u | sed -e 
's/,//g'
Before
domain0 0055
domain0 00aa
domain0 5500
domain0 aa00
domain0 0055
domain0 00aa
domain0 5500
domain0 aa00
domain1 00ff
domain1 ff00
domain2 

After
domain0 0055
domain0 00aa
domain0 5500
domain0 aa00
domain0 0055
domain0 00aa
domain0 5500
domain0 aa00
domain1 00ff
domain1 ff00
domain1 00ff
domain1 ff00
domain2 ff00
domain2 
domain3 

(Lower is better)
perf stat -a -r 5 -n perf bench sched pipe  | tail -n 2
Before
   153.798 +- 0.142 seconds time elapsed  ( +-  0.09% )

After
   111.545 +- 0.652 seconds time elapsed  ( +-  0.58% )

which is an improvement of 27.47%

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index b5bd5a4708d0..b31b8ca3ae2e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1365,7 +1365,7 @@ static bool update_mask_by_l2(int cpu, cpumask_var_t 
*mask)
l2_cache = cpu_to_l2cache(cpu);
if (!l2_cache || !*mask) {
/* Assume only core siblings share cache with this CPU */
-   for_each_cpu(i, submask_fn(cpu))
+   for_each_cpu(i, cpu_sibling_mask(cpu))
set_cpus_related(cpu, i, cpu_l2_cache_mask);
 
return false;
-- 
2.18.2



[PATCH v2 2/3] powerpc/smp: Update cpu_core_map on all PowerPc systems

2021-08-26 Thread Srikar Dronamraju
lscpu() uses core_siblings to list the number of sockets in the
system. core_siblings is set using topology_core_cpumask.

While optimizing the powerpc bootup path, Commit 4ca234a9cbd7
("powerpc/smp: Stop updating cpu_core_mask").  it was found that
updating cpu_core_mask() ended up taking a lot of time. It was thought
that on Powerpc, cpu_core_mask() would always be same as
cpu_cpu_mask() i.e number of sockets will always be equal to number of
nodes. As an optimization, cpu_core_mask() was made a snapshot of
cpu_cpu_mask().

However that was found to be false with PowerPc KVM guests, where each
node could have more than one socket. So with Commit c47f892d7aa6
("powerpc/smp: Reintroduce cpu_core_mask"), cpu_core_mask was updated
based on chip_id but in an optimized way using some mask manipulations
and chip_id caching.

However on non-PowerNV and non-pseries KVM guests (i.e not
implementing cpu_to_chip_id(), continued to use a copy of
cpu_cpu_mask().

There are two issues that were noticed on such systems
1. lscpu would report one extra socket.
On a IBM,9009-42A (aka zz system) which has only 2 chips/ sockets/
nodes, lscpu would report
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  160
On-line CPU(s) list: 0-159
Thread(s) per core:  8
Core(s) per socket:  6
Socket(s):   3<--
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-79
NUMA node1 CPU(s):   80-159

2. Currently cpu_cpu_mask is updated when a core is
added/removed. However its not updated when smt mode switching or on
CPUs are explicitly offlined. However all other percpu masks are
updated to ensure only active/online CPUs are in the masks.
This results in build_sched_domain traces since there will be CPUs in
cpu_cpu_mask() but those CPUs are not present in SMT / CACHE / MC /
NUMA domains. A loop of threads running smt mode switching and core
add/remove will soon show this trace.
Hence cpu_cpu_mask has to be update at smt mode switch.

This will have impact on cpu_core_mask(). cpu_core_mask() is a
snapshot of cpu_cpu_mask. Different CPUs within the same socket will
end up having different cpu_core_masks since they are snapshots at
different points of time. This means when lscpu will start reporting
many more sockets than the actual number of sockets/ nodes / chips.

Different ways to handle this problem:
A. Update the snapshot aka cpu_core_mask for all CPUs whenever
   cpu_cpu_mask is updated. This would a non-optimal solution.
B. Instead of a cpumask_var_t, make cpu_core_map a cpumask pointer
   pointing to cpu_cpu_mask. However percpu cpumask pointer is frowned
   upon and we need a clean way to handle PowerPc KVM guest which is
   not a snapshot.
C. Update cpu_core_masks all PowerPc systems like in PowerPc KVM
guests using mask manipulations. This approach is relatively simple
and unifies with the existing code.
D. On top of 3, we could also resurrect get_physical_package_id which
   could return a nid for the said CPU. However this is not needed at this
   time.

Option C is the preferred approach for now.

While this is somewhat a revert of Commit 4ca234a9cbd7 ("powerpc/smp:
Stop updating cpu_core_mask").

1. Plain revert has some conflicts
2. For chip_id == -1, the cpu_core_mask is made identical to
cpu_cpu_mask, unlike previously where cpu_core_mask was set to a core
if chip_id doesn't exist.

This goes by the principle that if chip_id is not exposed, then
sockets / chip / node share the same set of CPUs.

With the fix, lscpu o/p would be
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  160
On-line CPU(s) list: 0-159
Thread(s) per core:  8
Core(s) per socket:  6
Socket(s):   2 <--
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-79
NUMA node1 CPU(s):   80-159

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Aneesh Kumar K.V 
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Fixes: 4ca234a9cbd7 ("powerpc/smp: Stop updating cpu_core_mask")
Signed-off-by: Srikar Dronamraju 
---
Changelog : v1 -> v2:
v1:https://lore.kernel.org/linuxppc-dev/20210821092419.167454-3-sri...@linux.vnet.ibm.com/t/#u
Handled comments from Michael Ellerman
[ updated changelog to make it more generic powerpc issue ]

 arch/powerpc/kernel/smp.c | 11 +--
 1 file changed, 5 insertions(+),

[PATCH v2 1/3] powerpc/smp: Fix a crash while booting kvm guest with nr_cpus=2

2021-08-26 Thread Srikar Dronamraju
Aneesh reported a crash with a fairly recent upstream kernel when
booting kernel whose commandline was appended with nr_cpus=2

1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c8a67bd0]
pc: c002557c: cpu_to_chip_id+0x3c/0x100
lr: c0058380: start_secondary+0x460/0xb00
sp: c8a67e70
   msr: 80001033
   dar: 10
 dsisr: 8
  current = 0xc891bb00
  paca= 0xc018ff981f80   irqmask: 0x03   irq_happened: 0x01
pid   = 0, comm = swapper/1
Linux version 5.13.0-rc3-15704-ga050a6d2b7e8 (kvaneesh@ltc-boston8) (gcc 
(Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) 
#433 SMP Tue May 25 02:38:49 CDT 2021
1:mon> t
[link register   ] c0058380 start_secondary+0x460/0xb00
[c8a67e70] c8a67eb0 (unreliable)
[c8a67eb0] c00589d4 start_secondary+0xab4/0xb00
[c8a67f90] c000c654 start_secondary_prolog+0x10/0x14

Current code assumes that num_possible_cpus() is always greater than
threads_per_core. However this may not be true when using nr_cpus=2 or
similar options. Handle the case where num_possible_cpus() is not an
exact multiple of  threads_per_core.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Aneesh Kumar K.V 
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Fixes: c1e53367dab1 ("powerpc/smp: Cache CPU to chip lookup")
Reported-by: Aneesh Kumar K.V 
Debugged-by: Michael Ellerman 
Signed-off-by: Srikar Dronamraju 
---
Changelog: v1 -> v2:
v1: - 
https://lore.kernel.org/linuxppc-dev/20210821092419.167454-2-sri...@linux.vnet.ibm.com/t/#u
Handled comment from Gautham Shenoy
[ Updated to use DIV_ROUND_UP instead of max to handle more situations ]

 arch/powerpc/kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6c6e4d934d86..bf11b3c4eb28 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1074,7 +1074,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
}
 
if (cpu_to_chip_id(boot_cpuid) != -1) {
-   int idx = num_possible_cpus() / threads_per_core;
+   int idx = DIV_ROUND_UP(num_possible_cpus(), threads_per_core);
 
/*
 * All threads of a core will all belong to the same core,
-- 
2.18.2



[PATCH v2 0/3] powerpc/smp: Misc fixes

2021-08-26 Thread Srikar Dronamraju
Changelog : v1 -> v2:
v1: 
https://lore.kernel.org/linuxppc-dev/20210821092419.167454-1-sri...@linux.vnet.ibm.com/t/#u``
[ patch 1: Updated to use DIV_ROUND_UP instead of max to handle more situations 
]
[ patch 2: updated changelog to make it more generic powerpc issue ]

The 1st patch fixes a regression which causes a crash when booted with
nr_cpus=2.

The 2nd patch fixes a regression where lscpu on PowerVM reports more
number of sockets that are available.

The 3rd patch updates the fallback when L2-cache properties are not
explicitly exposed to be the cpu_sibling_mask.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 

Srikar Dronamraju (3):
  powerpc/smp: Fix a crash while booting kvm guest with nr_cpus=2
  powerpc/smp: Update cpu_core_map on all PowerPc systems
  powerpc/smp: Enable CACHE domain for shared processor

 arch/powerpc/kernel/smp.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

-- 
2.18.2



Re: [PATCH v2 1/3] powerpc/numa: Print debug statements only when required

2021-08-25 Thread Srikar Dronamraju
* Michael Ellerman  [2021-08-25 23:01:42]:

> Srikar Dronamraju  writes:
> > * Laurent Dufour  [2021-08-23 11:21:33]:
> >> Le 21/08/2021 à 12:25, Srikar Dronamraju a écrit :
> >> > Currently, a debug message gets printed every time an attempt to
> >> > add(remove) a CPU. However this is redundant if the CPU is already added
> >> > (removed) from the node.
> >> > 
> >
> > Its a fair point.
> >
> > Michael,
> >
> > Do you want me to resend this patch with s/pr_err/pr_warn for the above
> > line?
> 
> I think what I'd prefer is if we stopped using this custom dbg() stuff
> in numa.c, and cleaned up all the messages to use pr_xxx().
> 
> Those debug statements only appear if you boot with numa=debug, which is
> not documented anywhere and I had completely forgotten existed TBH.
> 
> These days there's CONFIG_DYNAMIC_DEBUG for turning on/off messages,
> which is much more flexible.
> 
> So can we drop the numa=debug bits, and convert all the dbg()s to
> pr_debug().
> 
> And then do a pass converting all the printk("NUMA: ") to pr_xxx() which
> will get "numa:" from pr_fmt().
> 
> cheers

Okay, will do the needful.

-- 
Thanks and Regards
Srikar Dronamraju


Re: [FSL P50x0] lscpu reports wrong values since the RC1 of kernel 5.13

2021-08-25 Thread Srikar Dronamraju
* Christian Zigotzky  [2021-08-16 14:29:21]:


Hi Christian,

> I tested the RC6 of kernel 5.14 today and unfortunately the issue still
> exists. We have figured out that only P5040 SoCs are affected. [1]
> P5020 SoCs display the correct values.
> Please check the CPU changes in the PowerPC updates 5.13-1 and 5.13-2.
>

Thanks for reporting the issue.
Would it be possible to try
https://lore.kernel.org/linuxppc-dev/20210821092419.167454-3-sri...@linux.vnet.ibm.com/t/#u

If the above patch is not helping, then can you please collect the output of

cat /sys/devices/system/cpu/cpu*/topology/core_siblings

Were all the CPUs online at the time of boot?
Did we do any CPU online/offline operations post boot?

If we did CPU online/offline, can you capture the output just after the
boot along with lscpu output..

Since this is being seen on few SOCs, can you summarize the difference
between P5040 and P5020.
> 
> [1] https://forum.hyperion-entertainment.com/viewtopic.php?p=53775#p53775
> 
> 
> On 09 August 2021 um 02:37 pm, Christian Zigotzky wrote:
> > Hi All,
> > 
> > Lscpu reports wrong values [1] since the RC1 of kernel 5.13 on my FSL
> > P5040 Cyrus+ board (A-EON AmigaOne X5000). [2]
> > The differences are:
> > 
> > Since the RC1 of kernel 5.13 (wrong values):
> > 
> > Core(s) per socket: 1
> > Socket(s): 3
> > 

I know that the socket count was off by 1, but I cant explain how its off by
2 here.

> > Before (correct values):
> > 
> > Core(s) per socket: 4
> > Socket(s): 1
> > 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 1/3] powerpc/smp: Fix a crash while booting kvm guest with nr_cpus=2

2021-08-23 Thread Srikar Dronamraju
* Gautham R Shenoy  [2021-08-23 11:41:22]:

> On Sat, Aug 21, 2021 at 02:54:17PM +0530, Srikar Dronamraju wrote:
> > Aneesh reported a crash with a fairly recent upstream kernel when
> > booting kernel whose commandline was appended with nr_cpus=2
> > 
> > 1:mon> e
> > cpu 0x1: Vector: 300 (Data Access) at [c8a67bd0]
> > pc: c002557c: cpu_to_chip_id+0x3c/0x100
> > lr: c0058380: start_secondary+0x460/0xb00
> > sp: c8a67e70
> >msr: 80001033
> >dar: 10
> >  dsisr: 8
> >   current = 0xc891bb00
> >   paca= 0xc018ff981f80   irqmask: 0x03   irq_happened: 0x01
> > pid   = 0, comm = swapper/1
> > Linux version 5.13.0-rc3-15704-ga050a6d2b7e8 (kvaneesh@ltc-boston8) (gcc 
> > (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 
> > 2.34) #433 SMP Tue May 25 02:38:49 CDT 2021
> > 1:mon> t
> > [link register   ] c0058380 start_secondary+0x460/0xb00
> > [c8a67e70] c8a67eb0 (unreliable)
> > [c8a67eb0] c00589d4 start_secondary+0xab4/0xb00
> > [c8a67f90] c000c654 start_secondary_prolog+0x10/0x14
> > 
> > Current code assumes that num_possible_cpus() is always greater than
> > threads_per_core. However this may not be true when using nr_cpus=2 or
> > similar options. Handle the case where num_possible_cpus is smaller than
> > threads_per_core.
> >
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: Aneesh Kumar K.V 
> > Cc: Nathan Lynch 
> > Cc: Michael Ellerman 
> > Cc: Ingo Molnar 
> > Cc: Peter Zijlstra 
> > Cc: Valentin Schneider 
> > Cc: Gautham R Shenoy 
> > Cc: Vincent Guittot 
> > Fixes: c1e53367dab1 ("powerpc/smp: Cache CPU to chip lookup")
> > Reported-by: Aneesh Kumar K.V 
> > Debugged-by: Michael Ellerman 
> > Signed-off-by: Srikar Dronamraju 
> > ---
> >  arch/powerpc/kernel/smp.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 6c6e4d934d86..3d6874fe1937 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -1074,7 +1074,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> > }
> > 
> > if (cpu_to_chip_id(boot_cpuid) != -1) {
> > -   int idx = num_possible_cpus() / threads_per_core;
> > +   int idx = max((int)num_possible_cpus() / threads_per_core, 1);
> 
> I think this code was assuming that num_possible_cpus() is a multiple
> of threads_per_core.
> 
> So, on a system with threads_per_core=8, if we pass nr_cpus=10, we
> will still get idx=1. Thus, we will allocate only one entry in
> chip_id_lookup_table[] even though there are two cores and
> chip_id_lookup_table[] is expected to have one entry per core.
> 
> Is this a valid scenario ? If yes, should we use
> 
>idx = DIV_ROUND_UP(num_possible_cpus, threads_per_core);
> 

Yes, this can be done.
will resend this patch with this change.

> 
> > 
> > /*
> >  * All threads of a core will all belong to the same core,
> > -- 
> > 2.18.2
> > 
> 
> --
> Thanks and Regards
> gautham.

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v2 1/3] powerpc/numa: Print debug statements only when required

2021-08-23 Thread Srikar Dronamraju
* Laurent Dufour  [2021-08-23 11:21:33]:

> Le 21/08/2021 à 12:25, Srikar Dronamraju a écrit :
> > Currently, a debug message gets printed every time an attempt to
> > add(remove) a CPU. However this is redundant if the CPU is already added
> > (removed) from the node.
> > 
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: Nathan Lynch 
> > Cc: Michael Ellerman 
> > Cc: Ingo Molnar 
> > Cc: Peter Zijlstra 
> > Cc: Valentin Schneider 
> > Cc: Gautham R Shenoy 
> > Cc: Vincent Guittot 
> > Cc: Geetika Moolchandani 
> > Cc: Laurent Dufour 
> > Signed-off-by: Srikar Dronamraju 
> > ---
> >   arch/powerpc/mm/numa.c | 11 +--
> >   1 file changed, 5 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index f2bf98bdcea2..fbe03f6840e0 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -141,10 +141,11 @@ static void map_cpu_to_node(int cpu, int node)
> >   {
> > update_numa_cpu_lookup_table(cpu, node);
> > -   dbg("adding cpu %d to node %d\n", cpu, node);
> > -   if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))
> > +   if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node]))) {
> > +   dbg("adding cpu %d to node %d\n", cpu, node);
> > cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
> > +   }
> >   }
> >   #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
> > @@ -152,13 +153,11 @@ static void unmap_cpu_from_node(unsigned long cpu)
> >   {
> > int node = numa_cpu_lookup_table[cpu];
> > -   dbg("removing cpu %lu from node %d\n", cpu, node);
> > -
> > if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
> > cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
> > +   dbg("removing cpu %lu from node %d\n", cpu, node);
> > } else {
> > -   printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
> > -  cpu, node);
> > +   pr_err("WARNING: cpu %lu not found in node %d\n", cpu, node);
> 
> Would pr_warn() be more appropriate here (or removing the "WARNING" 
> statement)?

Its a fair point.

Michael,

Do you want me to resend this patch with s/pr_err/pr_warn for the above
line?

> 
> > }
> >   }
> >   #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
> > 
> 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v2 0/3] Updates to powerpc for robust CPU online/offline

2021-08-23 Thread Srikar Dronamraju
* Peter Zijlstra  [2021-08-23 10:33:30]:

> On Sat, Aug 21, 2021 at 03:55:32PM +0530, Srikar Dronamraju wrote:
> > Scheduler expects unique number of node distances to be available
> > at boot. It uses node distance to calculate this unique node
> > distances. On Power Servers, node distances for offline nodes is not
> > available. However, Power Servers already knows unique possible node
> > distances. Fake the offline node's distance_lookup_table entries so
> > that all possible node distances are updated.
> > 
> > For example distance info from numactl from a fully populated 8 node
> > system at boot may look like this.
> > 
> > node distances:
> > node   0   1   2   3   4   5   6   7
> >   0:  10  20  40  40  40  40  40  40
> >   1:  20  10  40  40  40  40  40  40
> >   2:  40  40  10  20  40  40  40  40
> >   3:  40  40  20  10  40  40  40  40
> >   4:  40  40  40  40  10  20  40  40
> >   5:  40  40  40  40  20  10  40  40
> >   6:  40  40  40  40  40  40  10  20
> >   7:  40  40  40  40  40  40  20  10
> > 
> > However the same system when only two nodes are online at boot, then
> > distance info from numactl will look like
> > node distances:
> > node   0   1
> >   0:  10  20
> >   1:  20  10
> > 
> > With the faked numa distance at boot, the node distance table will look
> > like
> > node   0   1   2
> >   0:  10  20  40
> >   1:  20  10  40
> >   2:  40  40  10
> > 
> > The actual distance will be populated once the nodes are onlined.
> 
> How did you want all this merged? I picked up Valentin's patch, do you
> want me to pick up these PowerPC patches in the same tree, or do you
> want to route them seperately?

While both (the patch you accepted and this series) together help solve the
problem, I think there is no hard dependency between the two. Hence I would
think it should be okay to go through the powerpc tree.


-- 
Thanks and Regards
Srikar Dronamraju


[PATCH v2 3/3] powerpc/numa: Fill distance_lookup_table for offline nodes

2021-08-21 Thread Srikar Dronamraju
Scheduler expects unique number of node distances to be available at
boot. It uses node distance to calculate this unique node distances.
On POWER, node distances for offline nodes is not available. However,
POWER already knows unique possible node distances. Fake the offline
node's distance_lookup_table entries so that all possible node
distances are updated.

However this only needs to be done if the number of unique node
distances that can be computed for online nodes is less than the
number of possible unique node distances as represented by
distance_ref_points_depth. When the node is actually onlined,
distance_lookup_table will be updated with actual entries.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Cc: kernel test robot 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 70 ++
 1 file changed, 70 insertions(+)

Changelog:
v1: 
https://lore.kernel.org/linuxppc-dev/20210701041552.112072-3-sri...@linux.vnet.ibm.com/t/#u
[ Fixed a missing prototype warning Reported-by: kernel test robot 
]

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 3c124928a16d..0ee79a08c9e1 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -856,6 +856,75 @@ void __init dump_numa_cpu_topology(void)
}
 }
 
+/*
+ * Scheduler expects unique number of node distances to be available at
+ * boot. It uses node distance to calculate this unique node distances. On
+ * POWER, node distances for offline nodes is not available. However, POWER
+ * already knows unique possible node distances. Fake the offline node's
+ * distance_lookup_table entries so that all possible node distances are
+ * updated.
+ */
+static void __init fake_update_distance_lookup_table(void)
+{
+   unsigned long distance_map;
+   int i, nr_levels, nr_depth, node;
+
+   if (!numa_enabled)
+   return;
+
+   if (!form1_affinity)
+   return;
+
+   /*
+* distance_ref_points_depth lists the unique numa domains
+* available. However it ignore LOCAL_DISTANCE. So add +1
+* to get the actual number of unique distances.
+*/
+   nr_depth = distance_ref_points_depth + 1;
+
+   WARN_ON(nr_depth > sizeof(distance_map));
+
+   bitmap_zero(_map, nr_depth);
+   bitmap_set(_map, 0, 1);
+
+   for_each_online_node(node) {
+   int nd, distance = LOCAL_DISTANCE;
+
+   if (node == first_online_node)
+   continue;
+
+   nd = __node_distance(node, first_online_node);
+   for (i = 0; i < nr_depth; i++, distance *= 2) {
+   if (distance == nd) {
+   bitmap_set(_map, i, 1);
+   break;
+   }
+   }
+   nr_levels = bitmap_weight(_map, nr_depth);
+   if (nr_levels == nr_depth)
+   return;
+   }
+
+   for_each_node(node) {
+   if (node_online(node))
+   continue;
+
+   i = find_first_zero_bit(_map, nr_depth);
+   if (i >= nr_depth || i == 0) {
+   pr_warn("Levels(%d) not matching levels(%d)", 
nr_levels, nr_depth);
+   return;
+   }
+
+   bitmap_set(_map, i, 1);
+   while (i--)
+   distance_lookup_table[node][i] = node;
+
+   nr_levels = bitmap_weight(_map, nr_depth);
+   if (nr_levels == nr_depth)
+   return;
+   }
+}
+
 /* Initialize NODE_DATA for a node on the local memory */
 static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 {
@@ -971,6 +1040,7 @@ void __init mem_topology_setup(void)
 */
numa_setup_cpu(cpu);
}
+   fake_update_distance_lookup_table();
 }
 
 void __init initmem_init(void)
-- 
2.18.2



[PATCH v2 2/3] powerpc/numa: Update cpu_cpu_map on CPU online/offline

2021-08-21 Thread Srikar Dronamraju
cpu_cpu_map holds all the CPUs in the DIE. However in PowerPC, when
onlining/offlining of CPUs, this mask doesn't get updated.  This mask
is however updated when CPUs are added/removed. So when both
operations like online/offline of CPUs and adding/removing of CPUs are
done simultaneously, then cpumaps end up broken.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag
udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag
bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
rfkill nf_tables nfnetlink pseries_rng xts vmx_crypto uio_pdrv_genirq
uio binfmt_misc ip_tables xfs libcrc32c dm_service_time sd_mod t10_pi sg
ibmvfc scsi_transport_fc ibmveth dm_multipath dm_mirror dm_region_hash
dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c01caac8 LR: c01caac4 CTR: 007088ec
REGS: c0005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  80029033   CR: 48828222  XER:
0009
CFAR: c01ea698 IRQMASK: 0
GPR00: c01caac4 c0005596f4c0 c1c4a400 0036
GPR04: fffd c0005596f1d0 0027 c018cfd07f90
GPR08: 0023 0001 0027 c018fe68ffe8
GPR12: 8000 c0001e9d1880 c0013a047200 0800
GPR16: c1d3c7d0 0240 0048 c00010aacd18
GPR20: 0001 c00010aacc18 c0013a047c00 c00139ec2400
GPR24: 0280 c00139ec2520 c00136c1b400 c1c93060
GPR28: c0013a047c20 c1d3c6c0 c1c978a0 000d
NIP [c01caac8] build_sched_domains+0xd48/0x1720
LR [c01caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c0005596f4c0] [c01caac4] build_sched_domains+0xd44/0x1720 
(unreliable)
[c0005596f670] [c01cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c0005596f710] [c02804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c0005596f810] [c0283e60] rebuild_sched_domains+0x40/0x70
[c0005596f840] [c0284124] cpuset_hotplug_workfn+0x294/0xf10
[c0005596fc60] [c0175040] process_one_work+0x290/0x590
[c0005596fd00] [c01753c8] worker_thread+0x88/0x620
[c0005596fda0] [c0181704] kthread+0x194/0x1a0
[c0005596fe10] [c000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 6000 2fa30800 409e0028 80fe e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 6000 <0fe0> 3900 38e0 38c0

Fix this by updating cpu_cpu_map aka cpumask_of_node() on every CPU
online/offline.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/topology.h | 12 
 arch/powerpc/kernel/smp.c   |  3 +++
 arch/powerpc/mm/numa.c  |  7 ++-
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..2f0a4d7b95f6 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -65,6 +65,11 @@ static inline int early_cpu_to_node(int cpu)
 
 int of_drconf_to_nid_single(struct drmem_lmb *lmb);
 
+extern void map_cpu_to_node(int cpu, int node);
+#ifdef CONFIG_HOTPLUG_CPU
+extern void unmap_cpu_from_node(unsigned long cpu);
+#endif /* CONFIG_HOTPLUG_CPU */
+
 #else
 
 static inline int early_cpu_to_node(int cpu) { return 0; }
@@ -93,6 +98,13 @@ static inline int of_drconf_to_nid_single(struct drmem_lmb 
*lmb)
return first_online_node;
 }
 
+#ifdef CONFIG_SMP
+static inline void map_cpu_to_node(int cpu, int node) {}
+#ifdef CONFIG_HOTPLUG_CPU
+static inline void unmap_cpu_from_node(unsigned long cpu) {}
+#endif /* CONFIG_HOTPLUG_CPU */
+#endif /* CONFIG_SMP */
+
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 47b15f31cc29..5ede4b1c7473 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1407,6 +1407,8 @@ static void remove_cpu_from_masks(int cpu)
struct cpumask *(*mask_fn)(int) = cpu_sibling_mask;
int i;
 
+   unmap_cpu_from_node(cpu);
+
if (shared_caches)
mask_fn = cpu_l2_cache_mask;
 
@@ -1491,6 +1493,7 @@ static void add_cpu_to_masks(int cpu)
 * This CPU will not be in the online mask yet so we need to manually
 * add it to it's own threa

[PATCH v2 1/3] powerpc/numa: Print debug statements only when required

2021-08-21 Thread Srikar Dronamraju
Currently, a debug message gets printed every time an attempt to
add(remove) a CPU. However this is redundant if the CPU is already added
(removed) from the node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..fbe03f6840e0 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -141,10 +141,11 @@ static void map_cpu_to_node(int cpu, int node)
 {
update_numa_cpu_lookup_table(cpu, node);
 
-   dbg("adding cpu %d to node %d\n", cpu, node);
 
-   if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))
+   if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node]))) {
+   dbg("adding cpu %d to node %d\n", cpu, node);
cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
+   }
 }
 
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
@@ -152,13 +153,11 @@ static void unmap_cpu_from_node(unsigned long cpu)
 {
int node = numa_cpu_lookup_table[cpu];
 
-   dbg("removing cpu %lu from node %d\n", cpu, node);
-
if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
+   dbg("removing cpu %lu from node %d\n", cpu, node);
} else {
-   printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
-  cpu, node);
+   pr_err("WARNING: cpu %lu not found in node %d\n", cpu, node);
}
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
-- 
2.18.2



[PATCH v2 0/3] Updates to powerpc for robust CPU online/offline

2021-08-21 Thread Srikar Dronamraju
Scheduler expects unique number of node distances to be available
at boot. It uses node distance to calculate this unique node
distances. On Power Servers, node distances for offline nodes is not
available. However, Power Servers already knows unique possible node
distances. Fake the offline node's distance_lookup_table entries so
that all possible node distances are updated.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  40  40  40  40  40  40
  1:  20  10  40  40  40  40  40  40
  2:  40  40  10  20  40  40  40  40
  3:  40  40  20  10  40  40  40  40
  4:  40  40  40  40  10  20  40  40
  5:  40  40  40  40  20  10  40  40
  6:  40  40  40  40  40  40  10  20
  7:  40  40  40  40  40  40  20  10

However the same system when only two nodes are online at boot, then
distance info from numactl will look like
node distances:
node   0   1
  0:  10  20
  1:  20  10

With the faked numa distance at boot, the node distance table will look
like
node   0   1   2
  0:  10  20  40
  1:  20  10  40
  2:  40  40  10

The actual distance will be populated once the nodes are onlined.

Also when simultaneously running CPU online/offline with CPU
add/remove in a loop, we see a WARNING messages.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898 
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag
raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding tls
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink
pseries_rng xts vmx_crypto uio_pdrv_genirq uio binfmt_misc ip_tables xfs
libcrc32c dm_service_time sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth
dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c01caac8 LR: c01caac4 CTR: 007088ec
REGS: c0005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  80029033   CR: 48828222  XER: 0009
CFAR: c01ea698 IRQMASK: 0
GPR00: c01caac4 c0005596f4c0 c1c4a400 0036
GPR04: fffd c0005596f1d0 0027 c018cfd07f90
GPR08: 0023 0001 0027 c018fe68ffe8
GPR12: 8000 c0001e9d1880 c0013a047200 0800
GPR16: c1d3c7d0 0240 0048 c00010aacd18
GPR20: 0001 c00010aacc18 c0013a047c00 c00139ec2400
GPR24: 0280 c00139ec2520 c00136c1b400 c1c93060
GPR28: c0013a047c20 c1d3c6c0 c1c978a0 000d
NIP [c01caac8] build_sched_domains+0xd48/0x1720
LR [c01caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c0005596f4c0] [c01caac4] build_sched_domains+0xd44/0x1720 
(unreliable)
[c0005596f670] [c01cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c0005596f710] [c02804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c0005596f810] [c0283e60] rebuild_sched_domains+0x40/0x70
[c0005596f840] [c0284124] cpuset_hotplug_workfn+0x294/0xf10
[c0005596fc60] [c0175040] process_one_work+0x290/0x590
[c0005596fd00] [c01753c8] worker_thread+0x88/0x620
[c0005596fda0] [c0181704] kthread+0x194/0x1a0
[c0005596fe10] [c000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 6000 2fa30800 409e0028 80fe e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 6000 <0fe0> 3900 38e0 38c0

This was because cpu_cpu_mask() was not getting updated on CPU
online/offline but would be only updated when add/remove of CPUs.
Other cpumasks get updated both on CPU online/offline and add/remove
Update cpu_cpu_mask() on CPU online/offline too.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 

Srikar Dronamraju (3):
  powerpc/numa: Print debug statements only when required
  powerpc/numa: Update cpu_cpu_map on CPU online/offline
  powerpc/numa: Fill distance_lookup_table for offline nodes

 arch/powerpc/include/asm/topology.h | 12 
 arch/powerpc/kernel/smp.c   |  3 +
 arch/powerpc/mm/numa.c  | 88 +
 3 files changed, 92 insertions(+), 11 deletions(-)

-- 
2.18.2



[PATCH 3/3] powerpc/smp: Enable CACHE domain for shared processor

2021-08-21 Thread Srikar Dronamraju
Currently CACHE domain is not enabled on shared processor mode PowerVM
LPARS. On PowerVM systems, 'ibm,thread-group' device-tree property 2
under cpu-device-node indicates which all CPUs share L2-cache. However
'ibm,thread-group' device-tree property 2 is a relatively new property.
In absence of 'ibm,thread-group' property 2, 'l2-cache' device property
under cpu-device-node could help system to identify CPUs sharing L2-cache.
However this property is not exposed by PhyP in shared processor mode
configurations.

In absence of properties that inform OS about which CPUs share L2-cache,
fallback on core boundary.

Here are some stats from Power9 shared LPAR with the changes.

$ lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  32
On-line CPU(s) list: 0-31
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):   3
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
NUMA node0 CPU(s):   16-23
NUMA node1 CPU(s):   0-15,24-31
Physical sockets:2
Physical chips:  1
Physical cores/chip: 10

Before patch
$ grep -r . /sys/kernel/debug/sched/domains/cpu0/domain*/name
Before
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:NUMA

After
/sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
/sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
/sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
/sys/kernel/debug/sched/domains/cpu0/domain3/name:NUMA

$  awk '/domain/{print $1, $2}' /proc/schedstat | sort -u | sed -e 
's/,//g'
Before
domain0 0055
domain0 00aa
domain0 5500
domain0 aa00
domain0 0055
domain0 00aa
domain0 5500
domain0 aa00
domain1 00ff
domain1 ff00
domain2 

After
domain0 0055
domain0 00aa
domain0 5500
domain0 aa00
domain0 0055
domain0 00aa
domain0 5500
domain0 aa00
domain1 00ff
domain1 ff00
domain1 00ff
domain1 ff00
domain2 ff00
domain2 
domain3 

(Lower is better)
perf stat -a -r 5 -n perf bench sched pipe  | tail -n 2
Before
   153.798 +- 0.142 seconds time elapsed  ( +-  0.09% )

After
   111.545 +- 0.652 seconds time elapsed  ( +-  0.58% )

which is an improvement of 27.47%

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 3d26d3c61e94..47b15f31cc29 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1365,7 +1365,7 @@ static bool update_mask_by_l2(int cpu, cpumask_var_t 
*mask)
l2_cache = cpu_to_l2cache(cpu);
if (!l2_cache || !*mask) {
/* Assume only core siblings share cache with this CPU */
-   for_each_cpu(i, submask_fn(cpu))
+   for_each_cpu(i, cpu_sibling_mask(cpu))
set_cpus_related(cpu, i, cpu_l2_cache_mask);
 
return false;
-- 
2.18.2



[PATCH 2/3] powerpc/smp: Update cpu_core_map on PowerVM lpars.

2021-08-21 Thread Srikar Dronamraju
lscpu() uses core_siblings to list the number of sockets in the
system. core_siblings is set using topology_core_cpumask.

While optimizing the powerpc bootup path, Commit 4ca234a9cbd7
("powerpc/smp: Stop updating cpu_core_mask").  it was found that
updating cpu_core_mask() ended up taking a lot of time. It was thought
that on Powerpc, cpu_core_mask() would always be same as
cpu_cpu_mask() i.e number of sockets will always be equal to number of
nodes. As an optimization, cpu_core_mask() was made a snapshot of
cpu_cpu_mask().

However that was found to be false with PowerPc KVM guests, where each
node could have more than one socket. So with Commit c47f892d7aa6
("powerpc/smp: Reintroduce cpu_core_mask"), cpu_core_mask was updated
based on chip_id but in an optimized way using some mask manipulations
and chip_id caching.

However PowerVM based lpars (and other not implementing
cpu_to_chip_id(), continued to use a copy of cpu_cpu_mask().

There are two issues that were noticed on PowerVM lpars.
1. lscpu would report one extra socket.
On a IBM,9009-42A (aka zz system) which has only 2 chips/ sockets/
nodes, lscpu would report
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  160
On-line CPU(s) list: 0-159
Thread(s) per core:  8
Core(s) per socket:  6
Socket(s):   3<--
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-79
NUMA node1 CPU(s):   80-159

2. Currently cpu_cpu_mask is updated when a core is
added/removed. However its not updated when smt mode switching or on
CPUs are explicitly offlined. However all other percpu masks are
updated to ensure only active/online CPUs are in the masks.
This results in build_sched_domain traces since there will be CPUs in
cpu_cpu_mask() but those CPUs are not present in SMT / CACHE / MC /
NUMA domains. A loop of threads running smt mode switching and core
add/remove will soon show this trace.
Hence cpu_cpu_mask has to be update at smt mode switch.

This will have impact on cpu_core_mask(). cpu_core_mask() is a
snapshot of cpu_cpu_mask. Different CPUs within the same socket will
end up having different cpu_core_masks since they are snapshots at
different points of time. This means when lscpu will start reporting
many more sockets than the actual number of sockets/ nodes / chips.

Different ways to handle this problem:
A. Update the snapshot aka cpu_core_mask for all CPUs whenever
   cpu_cpu_mask is updated. This would a non-optimal solution.
B. Instead of a cpumask_var_t, make cpu_core_map a cpumask pointer
   pointing to cpu_cpu_mask. However percpu cpumask pointer is frowned
   upon and we need a clean way to handle PowerPc KVM guest which is
   not a snapshot.
C. Update cpu_core_masks in PowerVM like in PowerPc KVM guests using
   mask manipulations. This approach is relatively simple and unifies
   with the existing code.
D. On top of 3, we could also resurrect get_physical_package_id which
   could return a nid for the said CPU. However this is not needed at this
   time.

Option C is the preferred approach for now.

While this is somewhat a revert of Commit 4ca234a9cbd7 ("powerpc/smp:
Stop updating cpu_core_mask").

1. Plain revert has some conflicts
2. For chip_id == -1, the cpu_core_mask is made identical to
cpu_cpu_mask, unlike previously where cpu_core_mask was set to a core
if chip_id doesn't exist.

This goes by the principle that if chip_id is not exposed, then
sockets / chip / node share the same set of CPUs.

With the fix, lscpu o/p would be
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  160
On-line CPU(s) list: 0-159
Thread(s) per core:  8
Core(s) per socket:  6
Socket(s):   2 <--
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-79
NUMA node1 CPU(s):   80-159

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Fixes: 4ca234a9cbd7 ("powerpc/smp: Stop updating cpu_core_mask")
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/smp.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 3d6874fe1937..3d26d3c61e94 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1492,6 +1492,7 @@ static void add_cpu_to_masks(int cpu)
   

[PATCH 1/3] powerpc/smp: Fix a crash while booting kvm guest with nr_cpus=2

2021-08-21 Thread Srikar Dronamraju
Aneesh reported a crash with a fairly recent upstream kernel when
booting kernel whose commandline was appended with nr_cpus=2

1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c8a67bd0]
pc: c002557c: cpu_to_chip_id+0x3c/0x100
lr: c0058380: start_secondary+0x460/0xb00
sp: c8a67e70
   msr: 80001033
   dar: 10
 dsisr: 8
  current = 0xc891bb00
  paca= 0xc018ff981f80   irqmask: 0x03   irq_happened: 0x01
pid   = 0, comm = swapper/1
Linux version 5.13.0-rc3-15704-ga050a6d2b7e8 (kvaneesh@ltc-boston8) (gcc 
(Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) 
#433 SMP Tue May 25 02:38:49 CDT 2021
1:mon> t
[link register   ] c0058380 start_secondary+0x460/0xb00
[c8a67e70] c8a67eb0 (unreliable)
[c8a67eb0] c00589d4 start_secondary+0xab4/0xb00
[c8a67f90] c000c654 start_secondary_prolog+0x10/0x14

Current code assumes that num_possible_cpus() is always greater than
threads_per_core. However this may not be true when using nr_cpus=2 or
similar options. Handle the case where num_possible_cpus is smaller than
threads_per_core.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Aneesh Kumar K.V 
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Fixes: c1e53367dab1 ("powerpc/smp: Cache CPU to chip lookup")
Reported-by: Aneesh Kumar K.V 
Debugged-by: Michael Ellerman 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/smp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6c6e4d934d86..3d6874fe1937 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1074,7 +1074,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
}
 
if (cpu_to_chip_id(boot_cpuid) != -1) {
-   int idx = num_possible_cpus() / threads_per_core;
+   int idx = max((int)num_possible_cpus() / threads_per_core, 1);
 
/*
 * All threads of a core will all belong to the same core,
-- 
2.18.2



[PATCH 0/3] powerpc/smp: Misc fixes

2021-08-21 Thread Srikar Dronamraju
The 1st patch fixes a regression which causes a crash when booted with
nr_cpus=2.

The 2nd patch fixes a regression where lscpu on PowerVM reports more
number of sockets that that are available.

The 3rd patch updates the fallback when L2-cache properties are not
explicitly exposed to be the cpu_sibling_mask.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Srikar Dronamraju (3):
  powerpc/smp: Fix a crash while booting kvm guest with nr_cpus=2
  powerpc/smp: Update cpu_core_map on PowerVM lpars.
  powerpc/smp: Enable CACHE domain for shared processor

 arch/powerpc/kernel/smp.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

--
2.18.2



[PATCH] sched/topology: Skip updating masks for non-online nodes

2021-08-18 Thread Srikar Dronamraju
From: Valentin Schneider 

The scheduler currently expects NUMA node distances to be stable from
init onwards, and as a consequence builds the related data structures
once-and-for-all at init (see sched_init_numa()).

Unfortunately, on some architectures node distance is unreliable for
offline nodes and may very well change upon onlining.

Skip over offline nodes during sched_init_numa(). Track nodes that have
been onlined at least once, and trigger a build of a node's NUMA masks
when it is first onlined post-init.

Cc: LKML 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Dietmar Eggemann 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Rik van Riel 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Reported-by: Geetika Moolchandani 
Signed-off-by: Srikar Dronamraju 
Signed-off-by: Valentin Schneider 
---
Changelog:
[Fixed warning: no previous prototype for '__sched_domains_numa_masks_set']
 kernel/sched/topology.c | 65 +
 1 file changed, 65 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b77ad49dc14f..4e8698e62f07 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1482,6 +1482,8 @@ int   sched_max_numa_distance;
 static int *sched_domains_numa_distance;
 static struct cpumask  ***sched_domains_numa_masks;
 int __read_mostly  node_reclaim_distance = RECLAIM_DISTANCE;
+
+static unsigned long __read_mostly *sched_numa_onlined_nodes;
 #endif
 
 /*
@@ -1833,6 +1835,16 @@ void sched_init_numa(void)
sched_domains_numa_masks[i][j] = mask;
 
for_each_node(k) {
+   /*
+* Distance information can be unreliable for
+* offline nodes, defer building the node
+* masks to its bringup.
+* This relies on all unique distance values
+* still being visible at init time.
+*/
+   if (!node_online(j))
+   continue;
+
if (sched_debug() && (node_distance(j, k) != 
node_distance(k, j)))
sched_numa_warn("Node-distance not 
symmetric");
 
@@ -1886,6 +1898,53 @@ void sched_init_numa(void)
sched_max_numa_distance = sched_domains_numa_distance[nr_levels - 1];
 
init_numa_topology_type();
+
+   sched_numa_onlined_nodes = bitmap_alloc(nr_node_ids, GFP_KERNEL);
+   if (!sched_numa_onlined_nodes)
+   return;
+
+   bitmap_zero(sched_numa_onlined_nodes, nr_node_ids);
+   for_each_online_node(i)
+   bitmap_set(sched_numa_onlined_nodes, i, 1);
+}
+
+static void __sched_domains_numa_masks_set(unsigned int node)
+{
+   int i, j;
+
+   /*
+* NUMA masks are not built for offline nodes in sched_init_numa().
+* Thus, when a CPU of a never-onlined-before node gets plugged in,
+* adding that new CPU to the right NUMA masks is not sufficient: the
+* masks of that CPU's node must also be updated.
+*/
+   if (test_bit(node, sched_numa_onlined_nodes))
+   return;
+
+   bitmap_set(sched_numa_onlined_nodes, node, 1);
+
+   for (i = 0; i < sched_domains_numa_levels; i++) {
+   for (j = 0; j < nr_node_ids; j++) {
+   if (!node_online(j) || node == j)
+   continue;
+
+   if (node_distance(j, node) > 
sched_domains_numa_distance[i])
+   continue;
+
+   /* Add remote nodes in our masks */
+   cpumask_or(sched_domains_numa_masks[i][node],
+  sched_domains_numa_masks[i][node],
+  sched_domains_numa_masks[0][j]);
+   }
+   }
+
+   /*
+* A new node has been brought up, potentially changing the topology
+* classification.
+*
+* Note that this is racy vs any use of sched_numa_topology_type :/
+*/
+   init_numa_topology_type();
 }
 
 void sched_domains_numa_masks_set(unsigned int cpu)
@@ -1893,8 +1952,14 @@ void sched_domains_numa_masks_set(unsigned int cpu)
int node = cpu_to_node(cpu);
int i, j;
 
+   __sched_domains_numa_masks_set(node);
+
for (i = 0; i < sched_domains_numa_levels; i++) {
for (j = 0; j < nr_node_ids; j++) {
+   if (!node_online(j))
+   continue;
+
+   /* Set ourselves in the remote node's masks */
if (node_distance(j, node) 

Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-08-16 Thread Srikar Dronamraju
> 
> Your version is much much better than mine.
> And I have verified that it works as expected.
> 
> 

Hey Peter/Valentin

Are we waiting for any more feedback/testing for this?


-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-08-10 Thread Srikar Dronamraju
* Valentin Schneider  [2021-08-09 13:52:38]:

> On 09/08/21 12:22, Srikar Dronamraju wrote:
> > * Valentin Schneider  [2021-08-08 16:56:47]:
> >> Wait, doesn't the distance matrix (without any offline node) say
> >>
> >>   distance(0, 3) == 40
> >>
> >> ? We should have at the very least:
> >>
> >>   node   0   1   2   3
> >> 0:  10  20  ??  40
> >> 1:  20  20  ??  40
> >> 2:  ??  ??  ??  ??
> >> 3:  40  40  ??  10
> >>
> >
> > Before onlining node 3 and CPU 3 (node/CPU 0 and 1 are already online)
> > Note: Node 2-7 and CPU 2-7 are still offline.
> >
> > node   0   1   2   3
> >   0:  10  20  40  10
> >   1:  20  20  40  10
> >   2:  40  40  10  10
> >   3:  10  10  10  10
> >
> > NODE->mask(0) == 0
> > NODE->mask(1) == 1
> > NODE->mask(2) == 0
> > NODE->mask(3) == 0
> >
> > Note: This is with updating Node 2's distance as 40 for figuring out
> > the number of numa levels. Since we have all possible distances, we
> > dont update Node 3 distance, so it will be as if its local to node 0.
> >
> > Now when Node 3 and CPU 3 are onlined
> > Note: Node 2, 3-7 and CPU 2, 3-7 are still offline.
> >
> > node   0   1   2   3
> >   0:  10  20  40  40
> >   1:  20  20  40  40
> >   2:  40  40  10  40
> >   3:  40  40  40  10
> >
> > NODE->mask(0) == 0
> > NODE->mask(1) == 1
> > NODE->mask(2) == 0
> > NODE->mask(3) == 0,3
> >
> > CPU 0 continues to be part of Node->mask(3) because when we online and
> > we find the right distance, there is no API to reset the numa mask of
> > 3 to remove CPU 0 from the numa masks.
> >
> > If we had an API to clear/set sched_domains_numa_masks[node][] when
> > the node state changes, we could probably plug-in to clear/set the
> > node masks whenever node state changes.
> >
> 
> Gotcha, this is now coming back to me...
> 
> [...]
> 
> >> Ok, so it looks like we really can't do without that part - even if we get
> >> "sensible" distance values for the online nodes, we can't divine values for
> >> the offline ones.
> >>
> >
> > Yes
> >
> 
> Argh, while your approach does take care of the masks, it leaves
> sched_numa_topology_type unchanged. You *can* force an update of it, but
> yuck :(
> 
> I got to the below...
> 

Yes, I completely missed that we should update sched_numa_topology_type.


> ---
> From: Srikar Dronamraju 
> Date: Thu, 1 Jul 2021 09:45:51 +0530
> Subject: [PATCH 1/1] sched/topology: Skip updating masks for non-online nodes
> 
> The scheduler currently expects NUMA node distances to be stable from init
> onwards, and as a consequence builds the related data structures
> once-and-for-all at init (see sched_init_numa()).
> 
> Unfortunately, on some architectures node distance is unreliable for
> offline nodes and may very well change upon onlining.
> 
> Skip over offline nodes during sched_init_numa(). Track nodes that have
> been onlined at least once, and trigger a build of a node's NUMA masks when
> it is first onlined post-init.
> 

Your version is much much better than mine.
And I have verified that it works as expected.


> Reported-by: Geetika Moolchandani 
> Signed-off-by: Srikar Dronamraju 
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/sched/topology.c | 65 +
>  1 file changed, 65 insertions(+)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b77ad49dc14f..cba95793a9b7 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1482,6 +1482,8 @@ int sched_max_numa_distance;
>  static int   *sched_domains_numa_distance;
>  static struct cpumask***sched_domains_numa_masks;
>  int __read_mostlynode_reclaim_distance = RECLAIM_DISTANCE;
> +
> +static unsigned long __read_mostly *sched_numa_onlined_nodes;
>  #endif
> 
>  /*
> @@ -1833,6 +1835,16 @@ void sched_init_numa(void)
>   sched_domains_numa_masks[i][j] = mask;
> 
>   for_each_node(k) {
> + /*
> +  * Distance information can be unreliable for
> +  * offline nodes, defer building the node
> +  * masks to its bringup.
> +  * This relies on all unique distance values
> +  * still being visible at init 

Re: [PATCH v2] powerpc/xive: Do not skip CPU-less nodes when creating the IPIs

2021-08-10 Thread Srikar Dronamraju
* C?dric Le Goater  [2021-08-07 09:20:57]:

> On PowerVM, CPU-less nodes can be populated with hot-plugged CPUs at
> runtime. Today, the IPI is not created for such nodes, and hot-plugged
> CPUs use a bogus IPI, which leads to soft lockups.
> 
> We can not directly allocate and request the IPI on demand because
> bringup_up() is called under the IRQ sparse lock. The alternative is
> to allocate the IPIs for all possible nodes at startup and to request
> the mapping on demand when the first CPU of a node is brought up.
> 

Thank you, this version too works for me.

Tested-by: Srikar Dronamraju 


> Fixes: 7dcc37b3eff9 ("powerpc/xive: Map one IPI interrupt per node")
> Cc: sta...@vger.kernel.org # v5.13
> Reported-by: Geetika Moolchandani 
> Cc: Srikar Dronamraju 
> Cc: Laurent Vivier 
> Signed-off-by: Cédric Le Goater 
> Message-Id: <20210629131542.743888-1-...@kaod.org>
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/sysdev/xive/common.c | 35 +--
>  1 file changed, 24 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index dbdbbc2f1dc5..943fd30095af 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -67,6 +67,7 @@ static struct irq_domain *xive_irq_domain;
>  static struct xive_ipi_desc {
>   unsigned int irq;
>   char name[16];
> + atomic_t started;
>  } *xive_ipis;
>  
>  /*
> @@ -1120,7 +1121,7 @@ static const struct irq_domain_ops 
> xive_ipi_irq_domain_ops = {
>   .alloc  = xive_ipi_irq_domain_alloc,
>  };
>  
> -static int __init xive_request_ipi(void)
> +static int __init xive_init_ipis(void)
>  {
>   struct fwnode_handle *fwnode;
>   struct irq_domain *ipi_domain;
> @@ -1144,10 +1145,6 @@ static int __init xive_request_ipi(void)
>   struct xive_ipi_desc *xid = _ipis[node];
>   struct xive_ipi_alloc_info info = { node };
>  
> - /* Skip nodes without CPUs */
> - if (cpumask_empty(cpumask_of_node(node)))
> - continue;
> -
>   /*
>* Map one IPI interrupt per node for all cpus of that node.
>* Since the HW interrupt number doesn't have any meaning,
> @@ -1159,11 +1156,6 @@ static int __init xive_request_ipi(void)
>   xid->irq = ret;
>  
>   snprintf(xid->name, sizeof(xid->name), "IPI-%d", node);
> -
> - ret = request_irq(xid->irq, xive_muxed_ipi_action,
> -   IRQF_PERCPU | IRQF_NO_THREAD, xid->name, 
> NULL);
> -
> - WARN(ret < 0, "Failed to request IPI %d: %d\n", xid->irq, ret);
>   }
>  
>   return ret;
> @@ -1178,6 +1170,22 @@ static int __init xive_request_ipi(void)
>   return ret;
>  }
>  
> +static int __init xive_request_ipi(unsigned int cpu)
> +{
> + struct xive_ipi_desc *xid = _ipis[early_cpu_to_node(cpu)];
> + int ret;
> +
> + if (atomic_inc_return(>started) > 1)
> + return 0;
> +
> + ret = request_irq(xid->irq, xive_muxed_ipi_action,
> +   IRQF_PERCPU | IRQF_NO_THREAD,
> +   xid->name, NULL);
> +
> + WARN(ret < 0, "Failed to request IPI %d: %d\n", xid->irq, ret);
> + return ret;
> +}
> +
>  static int xive_setup_cpu_ipi(unsigned int cpu)
>  {
>   unsigned int xive_ipi_irq = xive_ipi_cpu_to_irq(cpu);
> @@ -1192,6 +1200,9 @@ static int xive_setup_cpu_ipi(unsigned int cpu)
>   if (xc->hw_ipi != XIVE_BAD_IRQ)
>   return 0;
>  
> + /* Register the IPI */
> + xive_request_ipi(cpu);
> +
>   /* Grab an IPI from the backend, this will populate xc->hw_ipi */
>   if (xive_ops->get_ipi(cpu, xc))
>   return -EIO;
> @@ -1231,6 +1242,8 @@ static void xive_cleanup_cpu_ipi(unsigned int cpu, 
> struct xive_cpu *xc)
>   if (xc->hw_ipi == XIVE_BAD_IRQ)
>   return;
>  
> + /* TODO: clear IPI mapping */
> +
>   /* Mask the IPI */
>   xive_do_source_set_mask(>ipi_data, true);
>  
> @@ -1253,7 +1266,7 @@ void __init xive_smp_probe(void)
>   smp_ops->cause_ipi = xive_cause_ipi;
>  
>   /* Register the IPI */
> - xive_request_ipi();
> + xive_init_ipis();
>  
>   /* Allocate and setup IPI for the boot CPU */
>   xive_setup_cpu_ipi(smp_processor_id());
> -- 
> 2.31.1
> 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-08-09 Thread Srikar Dronamraju
* Valentin Schneider  [2021-08-08 16:56:47]:

>
> A bit late, but technically the week isn't over yet! :D
>
> On 23/07/21 20:09, Srikar Dronamraju wrote:
> > * Valentin Schneider  [2021-07-13 17:32:14]:
> >> Now, let's take examples from your cover letter:
> >>
> >>   node distances:
> >>   node   0   1   2   3   4   5   6   7
> >> 0:  10  20  40  40  40  40  40  40
> >> 1:  20  10  40  40  40  40  40  40
> >> 2:  40  40  10  20  40  40  40  40
> >> 3:  40  40  20  10  40  40  40  40
> >> 4:  40  40  40  40  10  20  40  40
> >> 5:  40  40  40  40  20  10  40  40
> >> 6:  40  40  40  40  40  40  10  20
> >> 7:  40  40  40  40  40  40  20  10
> >>
> >> But the system boots with just nodes 0 and 1, thus only this distance
> >> matrix is valid:
> >>
> >>   node   0   1
> >> 0:  10  20
> >> 1:  20  10
> >>
> >> topology_span_sane() is going to use tl->mask(cpu), and as you reported the
> >> NODE topology level should cause issues. Let's assume all offline nodes say
> >> they're 10 distance away from everyone else, and that we have one CPU per
> >> node. This would give us:
> >>
> >
> > No,
> > All offline nodes would be at a distance of 10 from node 0 only.
> > So here node distance of all offline nodes from node 1 would be 20.
> >
> >>   NODE->mask(0) == 0,2-7
> >>   NODE->mask(1) == 1-7
> >
> > so
> >
> > NODE->mask(0) == 0,2-7
> > NODE->mask(1) should be 1
> > and NODE->mask(2-7) == 0,2-7
> >
>
> Ok, so that shouldn't trigger the warning.

Yes not at this point, but later on when we online a node.

>
> >>
> >> The intersection is 2-7, we'll trigger the WARN_ON().
> >> Now, with the above snippet, we'll check if that intersection covers any
> >> online CPU. For sched_init_domains(), cpu_map is cpu_active_mask, so we'd
> >> end up with an empty intersection and we shouldn't warn - that's the theory
> >> at least.
> >
> > Now lets say we onlined CPU 3 and node 3 which was at a actual distance
> > of 20 from node 0.
> >
> > (If we only consider online CPUs, and since scheduler masks like
> > sched_domains_numa_masks arent updated with offline CPUs,)
> > then
> >
> > NODE->mask(0) == 0
> > NODE->mask(1) == 1
> > NODE->mask(3) == 0,3
> >
>
> Wait, doesn't the distance matrix (without any offline node) say
>
>   distance(0, 3) == 40
>
> ? We should have at the very least:
>
>   node   0   1   2   3
> 0:  10  20  ??  40
> 1:  20  20  ??  40
> 2:  ??  ??  ??  ??
> 3:  40  40  ??  10
>

Before onlining node 3 and CPU 3 (node/CPU 0 and 1 are already online)
Note: Node 2-7 and CPU 2-7 are still offline.

node   0   1   2   3
  0:  10  20  40  10
  1:  20  20  40  10
  2:  40  40  10  10
  3:  10  10  10  10

NODE->mask(0) == 0
NODE->mask(1) == 1
NODE->mask(2) == 0
NODE->mask(3) == 0

Note: This is with updating Node 2's distance as 40 for figuring out
the number of numa levels. Since we have all possible distances, we
dont update Node 3 distance, so it will be as if its local to node 0.

Now when Node 3 and CPU 3 are onlined
Note: Node 2, 3-7 and CPU 2, 3-7 are still offline.

node   0   1   2   3
  0:  10  20  40  40
  1:  20  20  40  40
  2:  40  40  10  40
  3:  40  40  40  10

NODE->mask(0) == 0
NODE->mask(1) == 1
NODE->mask(2) == 0
NODE->mask(3) == 0,3

CPU 0 continues to be part of Node->mask(3) because when we online and
we find the right distance, there is no API to reset the numa mask of
3 to remove CPU 0 from the numa masks.

If we had an API to clear/set sched_domains_numa_masks[node][] when
the node state changes, we could probably plug-in to clear/set the
node masks whenever node state changes.


> Regardless, NODE->mask(x) is sched_domains_numa_masks[0][x], if
>
>   distance(0,3) > LOCAL_DISTANCE
>
> then
>
>   node0 ??? NODE->mask(3)
>
> > cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
> > if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && 
> > cpumask_intersects(intersect, cpu_map))
> >
> > cpu_map is 0,1,3
> > intersect is 0
> >
> > From above NODE->mask(0) is !equal to NODE->mask(1) and
> > cpumask_intersects(intersect, cpu_map) is also true.
> >
> > I picked Node 3 since if Node 1 is online, we would have faked distance
> > for Node 2 to be at distance of 40.
> >
> > Any node from 3 to 7, we would have faced the same problem.
> >

Re: [PATCHv2 2/3] powerpc/cacheinfo: Remove the redundant get_shared_cpu_map()

2021-08-05 Thread Srikar Dronamraju
* Parth Shah  [2021-07-28 23:26:06]:

> From: "Gautham R. Shenoy" 
> 
> The helper function get_shared_cpu_map() was added in
> 
> 'commit 500fe5f550ec ("powerpc/cacheinfo: Report the correct
> shared_cpu_map on big-cores")'
> 
> and subsequently expanded upon in
> 
> 'commit 0be47634db0b ("powerpc/cacheinfo: Print correct cache-sibling
> map/list for L2 cache")'
> 
> in order to help report the correct groups of threads sharing these caches
> on big-core systems where groups of threads within a core can share
> different sets of caches.
> 
> Now that powerpc/cacheinfo is aware of "ibm,thread-groups" property,
> cache->shared_cpu_map contains the correct set of thread-siblings
> sharing the cache. Hence we no longer need the functions
> get_shared_cpu_map(). This patch removes this function. We also remove
> the helper function index_dir_to_cpu() which was only called by
> get_shared_cpu_map().
> 
> With these functions removed, we can still see the correct
> cache-sibling map/list for L1 and L2 caches on systems with L1 and L2
> caches distributed among groups of threads in a core.
> 
> With this patch, on a SMT8 POWER10 system where the L1 and L2 caches
> are split between the two groups of threads in a core, for CPUs 8,9,
> the L1-Data, L1-Instruction, L2, L3 cache CPU sibling list is as
> follows:
> 
> $ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
> /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10,12,14
> /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10,12,14
> /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10,12,14
> /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-15
> /sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11,13,15
> /sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11,13,15
> /sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11,13,15
> /sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-15
> 
> $ ppc64_cpu --smt=4
> $ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
> /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10
> /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10
> /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10
> /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-11
> /sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11
> /sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11
> /sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11
> /sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-11
> 
> $ ppc64_cpu --smt=2
> $ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
> /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
> /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
> /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
> /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-9
> /sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9
> /sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9
> /sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9
> /sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-9
> 
> $ ppc64_cpu --smt=1
> $ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
> /sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
> /sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
> /sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
> /sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8
> 
> Signed-off-by: Gautham R. Shenoy 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

> ---
>  arch/powerpc/kernel/cacheinfo.c | 41 +
>  1 file changed, 1 insertion(+), 40 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
> index 5a6925d87424..20d91693eac1 100644
> --- a/arch/powerpc/kernel/cacheinfo.c
> +++ b/arch/powerpc/kernel/cacheinfo.c
> @@ -675,45 +675,6 @@ static ssize_t level_show(struct kobject *k, struct 
> kobj_attribute *attr, char *
>  static struct kobj_attribute cache_level_attr =
>   __ATTR(level, 0444, level_show, NULL);
> 
> -static unsigned int index_dir_to_cpu(struct cache_index_dir *index)
> -{
> - struct kobject *index_dir_kobj = >kobj;
> - struct kobject *cache_dir_kobj = index_dir_kobj->parent;
> - struct kobject *cpu_dev_kobj = cache_dir_kobj->parent;
> - struct device *dev = kobj_to_dev(cpu_dev_kobj);
> -
> - return dev->id;
> -}
> -
> -/*
> - * On big-core systems, each core has two groups of CPUs each of which
> - * has its own L1-cache. The thread-siblings w

Re: [PATCHv2 1/3] powerpc/cacheinfo: Lookup cache by dt node and thread-group id

2021-08-05 Thread Srikar Dronamraju
* Parth Shah  [2021-07-28 23:26:05]:

> From: "Gautham R. Shenoy" 
> 
> Currently the cacheinfo code on powerpc indexes the "cache" objects
> (modelling the L1/L2/L3 caches) where the key is device-tree node
> corresponding to that cache. On some of the POWER server platforms
> thread-groups within the core share different sets of caches (Eg: On
> SMT8 POWER9 systems, threads 0,2,4,6 of a core share L1 cache and
> threads 1,3,5,7 of the same core share another L1 cache). On such
> platforms, there is a single device-tree node corresponding to that
> cache and the cache-configuration within the threads of the core is
> indicated via "ibm,thread-groups" device-tree property.
> 
> Since the current code is not aware of the "ibm,thread-groups"
> property, on the aforementoined systems, cacheinfo code still treats
> all the threads in the core to be sharing the cache because of the
> single device-tree node (In the earlier example, the cacheinfo code
> would says CPUs 0-7 share L1 cache).
> 
> In this patch, we make the powerpc cacheinfo code aware of the
> "ibm,thread-groups" property. We indexe the "cache" objects by the
> key-pair (device-tree node, thread-group id). For any CPUX, for a
> given level of cache, the thread-group id is defined to be the first
> CPU in the "ibm,thread-groups" cache-group containing CPUX. For levels
> of cache which are not represented in "ibm,thread-groups" property,
> the thread-group id is -1.
> 
> Signed-off-by: Gautham R. Shenoy 
> [parth: Remove "static" keyword for the definition of 
> "thread_group_l1_cache_map"
> and "thread_group_l2_cache_map" to get rid of the compile error.]
> Signed-off-by: Parth Shah 


Looks good to me.

Reviewed-by: Srikar Dronamraju 

> ---
>  arch/powerpc/include/asm/smp.h  |  3 ++
>  arch/powerpc/kernel/cacheinfo.c | 80 -
>  arch/powerpc/kernel/smp.c   |  4 +-
>  3 files changed, 63 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
> index 03b3d010cbab..1259040cc3a4 100644
> --- a/arch/powerpc/include/asm/smp.h
> +++ b/arch/powerpc/include/asm/smp.h
> @@ -33,6 +33,9 @@ extern bool coregroup_enabled;
>  extern int cpu_to_chip_id(int cpu);
>  extern int *chip_id_lookup_table;
> 
> +DECLARE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map);
> +DECLARE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
> +
>  #ifdef CONFIG_SMP
> 
>  struct smp_ops_t {
> diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
> index 6f903e9aa20b..5a6925d87424 100644
> --- a/arch/powerpc/kernel/cacheinfo.c
> +++ b/arch/powerpc/kernel/cacheinfo.c
> @@ -120,6 +120,7 @@ struct cache {
>   struct cpumask shared_cpu_map; /* online CPUs using this cache */
>   int type;  /* split cache disambiguation */
>   int level; /* level not explicit in device tree */
> + int group_id;  /* id of the group of threads that share 
> this cache */
>   struct list_head list; /* global list of cache objects */
>   struct cache *next_local;  /* next cache of >= level */
>  };
> @@ -142,22 +143,24 @@ static const char *cache_type_string(const struct cache 
> *cache)
>  }
> 
>  static void cache_init(struct cache *cache, int type, int level,
> -struct device_node *ofnode)
> +struct device_node *ofnode, int group_id)
>  {
>   cache->type = type;
>   cache->level = level;
>   cache->ofnode = of_node_get(ofnode);
> + cache->group_id = group_id;
>   INIT_LIST_HEAD(>list);
>   list_add(>list, _list);
>  }
> 
> -static struct cache *new_cache(int type, int level, struct device_node 
> *ofnode)
> +static struct cache *new_cache(int type, int level,
> +struct device_node *ofnode, int group_id)
>  {
>   struct cache *cache;
> 
>   cache = kzalloc(sizeof(*cache), GFP_KERNEL);
>   if (cache)
> - cache_init(cache, type, level, ofnode);
> + cache_init(cache, type, level, ofnode, group_id);
> 
>   return cache;
>  }
> @@ -309,20 +312,24 @@ static struct cache *cache_find_first_sibling(struct 
> cache *cache)
>   return cache;
> 
>   list_for_each_entry(iter, _list, list)
> - if (iter->ofnode == cache->ofnode && iter->next_local == cache)
> + if (iter->ofnode == cache->ofnode &&
> + iter->group_id == cache->group_id &&
> 

Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-08-04 Thread Srikar Dronamraju
* Srikar Dronamraju  [2021-07-23 20:09:14]:

> * Valentin Schneider  [2021-07-13 17:32:14]:
> 
> > On 12/07/21 18:18, Srikar Dronamraju wrote:
> > > Hi Valentin,
> > >
> > >> On 01/07/21 09:45, Srikar Dronamraju wrote:
> > >> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
> > >> >  void sched_domains_numa_masks_set(unsigned int cpu)
> > >> >  {
> > >

Hey Valentin / Peter

Did you get a chance to look at this?

-- 
Thanks and Regards
Srikar Dronamraju


[PATCH] powerpc/pseries: Fix regression while building external modules

2021-07-29 Thread Srikar Dronamraju
With Commit c9f3401313a5 ("powerpc: Always enable queued spinlocks for
64s, disable for others") CONFIG_PPC_QUEUED_SPINLOCKS is always
enabled on ppc64le, external modules that use spinlock APIs are
failing.

ERROR: modpost: GPL-incompatible module XXX.ko uses GPL-only symbol
'shared_processor'

Before the above commit, modules were able to build without any
issues. Also this problem is not seen on other architectures. This
problem can be workaround if CONFIG_UNINLINE_SPIN_UNLOCK is enabled in
the config. However CONFIG_UNINLINE_SPIN_UNLOCK is not enabled by
default and only enabled in certain conditions like
CONFIG_DEBUG_SPINLOCKS is set in the kernel config.

 #include 
spinlock_t spLock;

static int __init spinlock_test_init(void)
{
spin_lock_init();
spin_lock();
spin_unlock();
return 0;
}

static void __exit spinlock_test_exit(void)
{
printk("spinlock_test unloaded\n");
}
module_init(spinlock_test_init);
module_exit(spinlock_test_exit);

MODULE_DESCRIPTION ("spinlock_test");
MODULE_LICENSE ("non-GPL");
MODULE_AUTHOR ("Srikar Dronamraju");

Given that spin locks are one of the basic facilities for module code,
this effectively makes it impossible to build/load almost any non GPL
modules on ppc64le.

This was first reported at https://github.com/openzfs/zfs/issues/11172

Currently shared_processor is exported as GPL only symbol.
Fix this for parity with other architectures by exposing
shared_processor to non-GPL modules too.

Fixes: 14c73bd344da ("powerpc/vcpu: Assume dedicated processors as non-preempt")
Fixes: c9f3401313a5 ("powerpc: Always enable queued spinlocks for 64s, disable 
for others")
Reported-by: marc.c.dio...@gmail.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: marc.c.dio...@gmail.com
Cc: jfor...@redhat.com
Cc: yaday...@in.ibm.com
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/platforms/pseries/setup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 754e493b7c05..0338f481c12b 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -77,7 +77,7 @@
 #include "../../../../drivers/pci/pci.h"
 
 DEFINE_STATIC_KEY_FALSE(shared_processor);
-EXPORT_SYMBOL_GPL(shared_processor);
+EXPORT_SYMBOL(shared_processor);
 
 int CMO_PrPSP = -1;
 int CMO_SecPSP = -1;

base-commit: adf3c31e18b765ea24eba7b0c1efc076b8ee3d55
-- 
2.18.2



Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-07-23 Thread Srikar Dronamraju
* Valentin Schneider  [2021-07-13 17:32:14]:

> On 12/07/21 18:18, Srikar Dronamraju wrote:
> > Hi Valentin,
> >
> >> On 01/07/21 09:45, Srikar Dronamraju wrote:
> >> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
> >> >  void sched_domains_numa_masks_set(unsigned int cpu)
> >> >  {
> >
> > Unfortunately this is not helping.
> > I tried this patch alone and also with 2/2 patch of this series where
> > we update/fill fake topology numbers. However both cases are still failing.
> >
> 
> Thanks for testing it.
> 
> 
> Now, let's take examples from your cover letter:
> 
>   node distances:
>   node   0   1   2   3   4   5   6   7
> 0:  10  20  40  40  40  40  40  40
> 1:  20  10  40  40  40  40  40  40
> 2:  40  40  10  20  40  40  40  40
> 3:  40  40  20  10  40  40  40  40
> 4:  40  40  40  40  10  20  40  40
> 5:  40  40  40  40  20  10  40  40
> 6:  40  40  40  40  40  40  10  20
> 7:  40  40  40  40  40  40  20  10
> 
> But the system boots with just nodes 0 and 1, thus only this distance
> matrix is valid:
> 
>   node   0   1
> 0:  10  20
> 1:  20  10
> 
> topology_span_sane() is going to use tl->mask(cpu), and as you reported the
> NODE topology level should cause issues. Let's assume all offline nodes say
> they're 10 distance away from everyone else, and that we have one CPU per
> node. This would give us:
> 

No,
All offline nodes would be at a distance of 10 from node 0 only.
So here node distance of all offline nodes from node 1 would be 20.

>   NODE->mask(0) == 0,2-7
>   NODE->mask(1) == 1-7

so 

NODE->mask(0) == 0,2-7
NODE->mask(1) should be 1
and NODE->mask(2-7) == 0,2-7

> 
> The intersection is 2-7, we'll trigger the WARN_ON().
> Now, with the above snippet, we'll check if that intersection covers any
> online CPU. For sched_init_domains(), cpu_map is cpu_active_mask, so we'd
> end up with an empty intersection and we shouldn't warn - that's the theory
> at least.

Now lets say we onlined CPU 3 and node 3 which was at a actual distance
of 20 from node 0.

(If we only consider online CPUs, and since scheduler masks like
sched_domains_numa_masks arent updated with offline CPUs,)
then

NODE->mask(0) == 0
NODE->mask(1) == 1
NODE->mask(3) == 0,3

cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) && cpumask_intersects(intersect, 
cpu_map))

cpu_map is 0,1,3
intersect is 0

>From above NODE->mask(0) is !equal to NODE->mask(1) and
cpumask_intersects(intersect, cpu_map) is also true.

I picked Node 3 since if Node 1 is online, we would have faked distance
for Node 2 to be at distance of 40.

Any node from 3 to 7, we would have faced the same problem.

> 
> Looking at sd_numa_mask(), I think there's a bug with topology_span_sane():
> it doesn't run in the right place wrt where sched_domains_curr_level is
> updated. Could you try the below on top of the previous snippet?
> 
> If that doesn't help, could you share the node distances / topology masks
> that lead to the WARN_ON()? Thanks.
> 
> ---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b77ad49dc14f..cda69dfa4065 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1516,13 +1516,6 @@ sd_init(struct sched_domain_topology_level *tl,
>   int sd_id, sd_weight, sd_flags = 0;
>   struct cpumask *sd_span;
> 
> -#ifdef CONFIG_NUMA
> - /*
> -  * Ugly hack to pass state to sd_numa_mask()...
> -  */
> - sched_domains_curr_level = tl->numa_level;
> -#endif
> -
>   sd_weight = cpumask_weight(tl->mask(cpu));
> 
>   if (tl->sd_flags)
> @@ -2131,7 +2124,12 @@ build_sched_domains(const struct cpumask *cpu_map, 
> struct sched_domain_attr *att
> 
>   sd = NULL;
>   for_each_sd_topology(tl) {
> -
> +#ifdef CONFIG_NUMA
> + /*
> +  * Ugly hack to pass state to sd_numa_mask()...
> +  */
> + sched_domains_curr_level = tl->numa_level;
> +#endif
>   if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
>   goto error;
> 
> 

I tested with the above patch too. However it still not helping.

Here is the log from my testing.

At Boot.

(Do remember to arrive at sched_max_numa_levels we faked the
numa_distance of node 1 to be at 20 from node 0. All other offline
nodes are at a distance of 10 from node 0.)

numactl -H
available: 2 nodes (0,5)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 0 MB
node 0 free: 0 MB
node 5 cpus:
node 5 size: 32038 MB
node 5 free: 293

Re: [PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-07-12 Thread Srikar Dronamraju
Hi Valentin,

> On 01/07/21 09:45, Srikar Dronamraju wrote:
> > @@ -1891,12 +1894,30 @@ void sched_init_numa(void)
> >  void sched_domains_numa_masks_set(unsigned int cpu)
> >  {
> 
> Hmph, so we're playing games with masks of offline nodes - is that really
> necessary? Your modification of sched_init_numa() still scans all of the
> nodes (regardless of their online status) to build the distance map, and
> that is never updated (sched_init_numa() is pretty much an __init
> function).
> 
> So AFAICT this is all to cope with topology_span_sane() not applying
> 'cpu_map' to its masks. That seemed fine to me back when I wrote it, but in
> light of having bogus distance values for offline nodes, not so much...
> 
> What about the below instead?
> 
> ---
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b77ad49dc14f..c2d9caad4aa6 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2075,6 +2075,7 @@ static struct sched_domain *build_sched_domain(struct 
> sched_domain_topology_leve
>  static bool topology_span_sane(struct sched_domain_topology_level *tl,
> const struct cpumask *cpu_map, int cpu)
>  {
> + struct cpumask *intersect = sched_domains_tmpmask;
>   int i;
> 
>   /* NUMA levels are allowed to overlap */
> @@ -2090,14 +2091,17 @@ static bool topology_span_sane(struct 
> sched_domain_topology_level *tl,
>   for_each_cpu(i, cpu_map) {
>   if (i == cpu)
>   continue;
> +
>   /*
> -  * We should 'and' all those masks with 'cpu_map' to exactly
> -  * match the topology we're about to build, but that can only
> -  * remove CPUs, which only lessens our ability to detect
> -  * overlaps
> +  * We shouldn't have to bother with cpu_map here, unfortunately
> +  * some architectures (powerpc says hello) have to deal with
> +  * offline NUMA nodes reporting bogus distance values. This can
> +  * lead to funky NODE domain spans, but since those are offline
> +  * we can mask them out.
>*/
> + cpumask_and(intersect, tl->mask(cpu), tl->mask(i));
>   if (!cpumask_equal(tl->mask(cpu), tl->mask(i)) &&
> - cpumask_intersects(tl->mask(cpu), tl->mask(i)))
> + cpumask_intersects(intersect, cpu_map))
>   return false;
>   }
> 

Unfortunately this is not helping.
I tried this patch alone and also with 2/2 patch of this series where
we update/fill fake topology numbers. However both cases are still failing.

-- 
Thanks and Regards
Srikar Dronamraju


[PATCH 2/2] powerpc/numa: Update cpu_cpu_map on CPU online/offline

2021-07-01 Thread Srikar Dronamraju
cpu_cpu_map holds all the CPUs in the DIE. However in PowerPC, when
onlining/offlining of CPUs, this mask doesn't get updated.  This mask
is however updated when CPUs are added/removed. So when both
operations like online/offline of CPUs and adding/removing of CPUs are
done simultaneously, then cpumaps end up broken.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag
udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag
bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
rfkill nf_tables nfnetlink pseries_rng xts vmx_crypto uio_pdrv_genirq
uio binfmt_misc ip_tables xfs libcrc32c dm_service_time sd_mod t10_pi sg
ibmvfc scsi_transport_fc ibmveth dm_multipath dm_mirror dm_region_hash
dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c01caac8 LR: c01caac4 CTR: 007088ec
REGS: c0005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  80029033   CR: 48828222  XER:
0009
CFAR: c01ea698 IRQMASK: 0
GPR00: c01caac4 c0005596f4c0 c1c4a400 0036
GPR04: fffd c0005596f1d0 0027 c018cfd07f90
GPR08: 0023 0001 0027 c018fe68ffe8
GPR12: 8000 c0001e9d1880 c0013a047200 0800
GPR16: c1d3c7d0 0240 0048 c00010aacd18
GPR20: 0001 c00010aacc18 c0013a047c00 c00139ec2400
GPR24: 0280 c00139ec2520 c00136c1b400 c1c93060
GPR28: c0013a047c20 c1d3c6c0 c1c978a0 000d
NIP [c01caac8] build_sched_domains+0xd48/0x1720
LR [c01caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c0005596f4c0] [c01caac4] build_sched_domains+0xd44/0x1720 
(unreliable)
[c0005596f670] [c01cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c0005596f710] [c02804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c0005596f810] [c0283e60] rebuild_sched_domains+0x40/0x70
[c0005596f840] [c0284124] cpuset_hotplug_workfn+0x294/0xf10
[c0005596fc60] [c0175040] process_one_work+0x290/0x590
[c0005596fd00] [c01753c8] worker_thread+0x88/0x620
[c0005596fda0] [c0181704] kthread+0x194/0x1a0
[c0005596fe10] [c000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 6000 2fa30800 409e0028 80fe e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 6000 <0fe0> 3900 38e0 38c0

Fix this by updating cpu_cpu_map aka cpumask_of_node() on every CPU
online/offline.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/topology.h | 12 
 arch/powerpc/kernel/smp.c   |  3 +++
 arch/powerpc/mm/numa.c  |  7 ++-
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..2f0a4d7b95f6 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -65,6 +65,11 @@ static inline int early_cpu_to_node(int cpu)
 
 int of_drconf_to_nid_single(struct drmem_lmb *lmb);
 
+extern void map_cpu_to_node(int cpu, int node);
+#ifdef CONFIG_HOTPLUG_CPU
+extern void unmap_cpu_from_node(unsigned long cpu);
+#endif /* CONFIG_HOTPLUG_CPU */
+
 #else
 
 static inline int early_cpu_to_node(int cpu) { return 0; }
@@ -93,6 +98,13 @@ static inline int of_drconf_to_nid_single(struct drmem_lmb 
*lmb)
return first_online_node;
 }
 
+#ifdef CONFIG_SMP
+static inline void map_cpu_to_node(int cpu, int node) {}
+#ifdef CONFIG_HOTPLUG_CPU
+static inline void unmap_cpu_from_node(unsigned long cpu) {}
+#endif /* CONFIG_HOTPLUG_CPU */
+#endif /* CONFIG_SMP */
+
 #endif /* CONFIG_NUMA */
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 6c6e4d934d86..e562cca13d66 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1407,6 +1407,8 @@ static void remove_cpu_from_masks(int cpu)
struct cpumask *(*mask_fn)(int) = cpu_sibling_mask;
int i;
 
+   unmap_cpu_from_node(cpu);
+
if (shared_caches)
mask_fn = cpu_l2_cache_mask;
 
@@ -1491,6 +1493,7 @@ static void add_cpu_to_masks(int cpu)
 * This CPU will not be in the online mask yet so we need to manually
 * add it to it's own threa

[PATCH 1/2] powerpc/numa: Print debug statements only when required

2021-07-01 Thread Srikar Dronamraju
Currently, a debug message gets printed every time an attempt to
add(remove) a CPU. However this is redundant if the CPU is already added
(removed) from the node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 6d0d89127190..f68dbe4e982c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -141,10 +141,11 @@ static void map_cpu_to_node(int cpu, int node)
 {
update_numa_cpu_lookup_table(cpu, node);
 
-   dbg("adding cpu %d to node %d\n", cpu, node);
 
-   if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node])))
+   if (!(cpumask_test_cpu(cpu, node_to_cpumask_map[node]))) {
+   dbg("adding cpu %d to node %d\n", cpu, node);
cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
+   }
 }
 
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
@@ -152,13 +153,11 @@ static void unmap_cpu_from_node(unsigned long cpu)
 {
int node = numa_cpu_lookup_table[cpu];
 
-   dbg("removing cpu %lu from node %d\n", cpu, node);
-
if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
cpumask_clear_cpu(cpu, node_to_cpumask_map[node]);
+   dbg("removing cpu %lu from node %d\n", cpu, node);
} else {
-   printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
-  cpu, node);
+   pr_err("WARNING: cpu %lu not found in node %d\n", cpu, node);
}
 }
 #endif /* CONFIG_HOTPLUG_CPU || CONFIG_PPC_SPLPAR */
-- 
2.27.0



[PATCH 0/2] Update cpu_cpu_mask on CPU online/offline

2021-07-01 Thread Srikar Dronamraju
When simultaneously running CPU online/offline with CPU add/remove in a
loop, we see a WARNING messages.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898 
build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag
raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding tls
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink
pseries_rng xts vmx_crypto uio_pdrv_genirq uio binfmt_misc ip_tables xfs
libcrc32c dm_service_time sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth
dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c01caac8 LR: c01caac4 CTR: 007088ec
REGS: c0005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  80029033   CR: 48828222  XER: 0009
CFAR: c01ea698 IRQMASK: 0
GPR00: c01caac4 c0005596f4c0 c1c4a400 0036
GPR04: fffd c0005596f1d0 0027 c018cfd07f90
GPR08: 0023 0001 0027 c018fe68ffe8
GPR12: 8000 c0001e9d1880 c0013a047200 0800
GPR16: c1d3c7d0 0240 0048 c00010aacd18
GPR20: 0001 c00010aacc18 c0013a047c00 c00139ec2400
GPR24: 0280 c00139ec2520 c00136c1b400 c1c93060
GPR28: c0013a047c20 c1d3c6c0 c1c978a0 000d
NIP [c01caac8] build_sched_domains+0xd48/0x1720
LR [c01caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c0005596f4c0] [c01caac4] build_sched_domains+0xd44/0x1720 
(unreliable)
[c0005596f670] [c01cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c0005596f710] [c02804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c0005596f810] [c0283e60] rebuild_sched_domains+0x40/0x70
[c0005596f840] [c0284124] cpuset_hotplug_workfn+0x294/0xf10
[c0005596fc60] [c0175040] process_one_work+0x290/0x590
[c0005596fd00] [c01753c8] worker_thread+0x88/0x620
[c0005596fda0] [c0181704] kthread+0x194/0x1a0
[c0005596fe10] [c000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 6000 2fa30800 409e0028 80fe e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 6000 <0fe0> 3900 38e0 38c0

This was because cpu_cpu_mask() was not getting updated on CPU
online/offline but would be only updated when add/remove of CPUs.
Other cpumasks get updated both on CPU online/offline and add/remove
Update cpu_cpu_mask() on CPU online/offline too.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Vincent Guittot 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 

Srikar Dronamraju (2):
  powerpc/numa: Print debug statements only when required
  powerpc/numa: Update cpu_cpu_map on CPU online/offline

 arch/powerpc/include/asm/topology.h | 12 
 arch/powerpc/kernel/smp.c   |  3 +++
 arch/powerpc/mm/numa.c  | 18 +++---
 3 files changed, 22 insertions(+), 11 deletions(-)

-- 
2.27.0



[PATCH v2 1/2] sched/topology: Skip updating masks for non-online nodes

2021-06-30 Thread Srikar Dronamraju
Currently scheduler doesn't check if node is online before adding CPUs
to the node mask. However on some architectures, node distance is only
available for nodes that are online. Its not sure how much to rely on
the node distance, when one of the nodes is offline.

If said node distance is fake (since one of the nodes is offline) and
the actual node distance is different, then the cpumask of such nodes
when the nodes become becomes online will be wrong.

This can cause topology_span_sane to throw up a warning message and the
rest of the topology being not updated properly.

Resolve this by skipping update of cpumask for nodes that are not
online.

However by skipping, relevant CPUs may not be set when nodes are
onlined. i.e when coming up with NUMA masks at a certain NUMA distance,
CPUs that are part of other nodes, which are already online will not be
part of the NUMA mask. Hence the first time, a CPU is added to the newly
onlined node, add the other CPUs to the numa_mask.

Cc: LKML 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Dietmar Eggemann 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Rik van Riel 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Reported-by: Geetika Moolchandani 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1->v2:
v1 link: 
http://lore.kernel.org/lkml/20210520154427.1041031-4-sri...@linux.vnet.ibm.com/t/#u
Update the NUMA masks, whenever 1st CPU is added to cpuless node

 kernel/sched/topology.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b77ad49dc14f..f25dbcab4fd2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1833,6 +1833,9 @@ void sched_init_numa(void)
sched_domains_numa_masks[i][j] = mask;
 
for_each_node(k) {
+   if (!node_online(j))
+   continue;
+
if (sched_debug() && (node_distance(j, k) != 
node_distance(k, j)))
sched_numa_warn("Node-distance not 
symmetric");
 
@@ -1891,12 +1894,30 @@ void sched_init_numa(void)
 void sched_domains_numa_masks_set(unsigned int cpu)
 {
int node = cpu_to_node(cpu);
-   int i, j;
+   int i, j, empty;
 
+   empty = cpumask_empty(sched_domains_numa_masks[0][node]);
for (i = 0; i < sched_domains_numa_levels; i++) {
for (j = 0; j < nr_node_ids; j++) {
-   if (node_distance(j, node) <= 
sched_domains_numa_distance[i])
+   if (!node_online(j))
+   continue;
+
+   if (node_distance(j, node) <= 
sched_domains_numa_distance[i]) {
cpumask_set_cpu(cpu, 
sched_domains_numa_masks[i][j]);
+
+   /*
+* We skip updating numa_masks for offline
+* nodes. However now that the node is
+* finally online, CPUs that were added
+* earlier, should now be accommodated into
+* newly oneline node's numa mask.
+*/
+   if (node != j && empty) {
+   
cpumask_or(sched_domains_numa_masks[i][node],
+   
sched_domains_numa_masks[i][node],
+   
sched_domains_numa_masks[0][j]);
+   }
+   }
}
}
 }
-- 
2.27.0



[PATCH v2 2/2] powerpc/numa: Fill distance_lookup_table for offline nodes

2021-06-30 Thread Srikar Dronamraju
Currently scheduler populates the distance map by looking at distance
of each node from all other nodes. This should work for most
architectures and platforms.

Scheduler expects unique number of node distances to be available at
boot. It uses node distance to calculate this unique node distances.
On Power Servers, node distances for offline nodes is not available.
However, Power Servers already knows unique possible node distances.
Fake the offline node's distance_lookup_table entries so that all
possible node distances are updated.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  40  40  40  40  40  40
  1:  20  10  40  40  40  40  40  40
  2:  40  40  10  20  40  40  40  40
  3:  40  40  20  10  40  40  40  40
  4:  40  40  40  40  10  20  40  40
  5:  40  40  40  40  20  10  40  40
  6:  40  40  40  40  40  40  10  20
  7:  40  40  40  40  40  40  20  10

However the same system when only two nodes are online at boot, then
distance info from numactl will look like
node distances:
node   0   1
  0:  10  20
  1:  20  10

It may be implementation dependent on what node_distance(0,3) where
node 0 is online and node 3 is offline. In Power Servers case, it returns
LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
distance between nodes is 20. However that would not be true.

When Nodes are onlined and CPUs from those nodes are hotplugged,
the max node distance would be 40.

However this only needs to be done if the number of unique node
distances that can be computed for online nodes is less than the
number of possible unique node distances as represented by
distance_ref_points_depth. When the node is actually onlined,
distance_lookup_table will be updated with actual entries.

Cc: LKML 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Dietmar Eggemann 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Rik van Riel 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 
Reported-by: Geetika Moolchandani 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1->v2:
Move to a Powerpc specific solution as suggested by Peter and Valentin

 arch/powerpc/mm/numa.c | 70 ++
 1 file changed, 70 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..6d0d89127190 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -860,6 +860,75 @@ void __init dump_numa_cpu_topology(void)
}
 }
 
+/*
+ * Scheduler expects unique number of node distances to be available at
+ * boot. It uses node distance to calculate this unique node distances. On
+ * POWER, node distances for offline nodes is not available. However, POWER
+ * already knows unique possible node distances. Fake the offline node's
+ * distance_lookup_table entries so that all possible node distances are
+ * updated.
+ */
+void __init fake_update_distance_lookup_table(void)
+{
+   unsigned long distance_map;
+   int i, nr_levels, nr_depth, node;
+
+   if (!numa_enabled)
+   return;
+
+   if (!form1_affinity)
+   return;
+
+   /*
+* distance_ref_points_depth lists the unique numa domains
+* available. However it ignore LOCAL_DISTANCE. So add +1
+* to get the actual number of unique distances.
+*/
+   nr_depth = distance_ref_points_depth + 1;
+
+   WARN_ON(nr_depth > sizeof(distance_map));
+
+   bitmap_zero(_map, nr_depth);
+   bitmap_set(_map, 0, 1);
+
+   for_each_online_node(node) {
+   int nd, distance = LOCAL_DISTANCE;
+
+   if (node == first_online_node)
+   continue;
+
+   nd = __node_distance(node, first_online_node);
+   for (i = 0; i < nr_depth; i++, distance *= 2) {
+   if (distance == nd) {
+   bitmap_set(_map, i, 1);
+   break;
+   }
+   }
+   nr_levels = bitmap_weight(_map, nr_depth);
+   if (nr_levels == nr_depth)
+   return;
+   }
+
+   for_each_node(node) {
+   if (node_online(node))
+   continue;
+
+   i = find_first_zero_bit(_map, nr_depth);
+   if (i >= nr_depth || i == 0) {
+   pr_warn("Levels(%d) not matching levels(%d)", 
nr_levels, nr_depth);
+   return;
+   }
+
+   bitmap_set(_map, i, 1);
+   while (i--)
+   distance_lookup_table[node][i] = node;
+
+   nr_levels = bitmap_weight(_map, nr_depth);
+   if (nr_levels == nr_depth)
+   return;
+   }
+}
+
 /* Initialize NODE_DATA for a no

[PATCH v2 0/2] Skip numa distance for offline nodes

2021-06-30 Thread Srikar Dronamraju
Changelog v1->v2:
v1: 
http://lore.kernel.org/lkml/20210520154427.1041031-1-sri...@linux.vnet.ibm.com/t/#u
- Update the numa masks, whenever 1st CPU is added to cpuless node
- Populate all possible nodes distances in boot in a
powerpc specific function

Geetika reported yet another trace while doing a dlpar CPU add
operation. This was true even on top of a recent commit
6980d13f0dd1 ("powerpc/smp: Set numa node before updating mask") which fixed
a similar trace.

WARNING: CPU: 40 PID: 2954 at kernel/sched/topology.c:2088 
build_sched_domains+0x6e8/0x1540
Modules linked in: nft_counter nft_compat rpadlpar_io rpaphp mptcp_diag
xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag
netlink_diag bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables
nfnetlink dm_multipath pseries_rng xts vmx_crypto binfmt_misc ip_tables xfs
libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror
dm_region_hash dm_log dm_mod fuse
CPU: 40 PID: 2954 Comm: kworker/40:0 Not tainted 5.13.0-rc1+ #19
Workqueue: events cpuset_hotplug_workfn
NIP:  c01de588 LR: c01de584 CTR: 006cd36c
REGS: c0002772b250 TRAP: 0700   Not tainted  (5.12.0-rc5-master+)
MSR:  80029033   CR: 28828422  XER: 000d
CFAR: c020c2f8 IRQMASK: 0 #012GPR00: c01de584 c0002772b4f0
c1f55400 0036 #012GPR04: c063c6368010 c063c63f0a00
0027 c063c6368018 #012GPR08: 0023 c063c636ef48
0063c4de c063bfe9ffe8 #012GPR12: 28828424 c063fe68fe80
 0417 #012GPR16: 0028 c740dcd8
c205db68 c1a3a4a0 #012GPR20: c00091ed7d20 c00091ed8520
0001  #012GPR24: c000113a9600 0190
0028 c10e3ac0 #012GPR28:  c740dd00
c000317b5900 0190
NIP [c01de588] build_sched_domains+0x6e8/0x1540
LR [c01de584] build_sched_domains+0x6e4/0x1540
Call Trace:
[c0002772b4f0] [c01de584] build_sched_domains+0x6e4/0x1540 
(unreliable)
[c0002772b640] [c01e08dc] partition_sched_domains_locked+0x3ec/0x530
[c0002772b6e0] [c02a2144] rebuild_sched_domains_locked+0x524/0xbf0
[c0002772b7e0] [c02a5620] rebuild_sched_domains+0x40/0x70
[c0002772b810] [c02a58e4] cpuset_hotplug_workfn+0x294/0xe20
[c0002772bc30] [c0187510] process_one_work+0x300/0x670
[c0002772bd10] [c01878f8] worker_thread+0x78/0x520
[c0002772bda0] [c01937f0] kthread+0x1a0/0x1b0
[c0002772be10] [c000d6ec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
7ee5bb78 7f0ac378 7f29cb78 7f68db78 7f46d378 7f84e378 f8610068 3c62ff19
fbe10060 3863e558 4802dd31 6000 <0fe0> 3920fff4 f9210080 e86100b0

Detailed analysis of the failing scenario showed that the span in
question belongs to NODE domain and further the cpumasks for some
cpus in NODE overlapped. There are two possible reasons how we ended
up here:

(1) The numa node was offline or blank with no CPUs or memory. Hence
the sched_max_numa_distance could not be set correctly, or the
sched_domains_numa_distance happened to be partially populated.

(2) Depending on a bogus node_distance of an offline node to populate
cpumasks is the issue.  On POWER platform the node_distance is
correctly available only for an online node which has some CPU or
memory resource associated with it.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  40  40  40  40  40  40
  1:  20  10  40  40  40  40  40  40
  2:  40  40  10  20  40  40  40  40
  3:  40  40  20  10  40  40  40  40
  4:  40  40  40  40  10  20  40  40
  5:  40  40  40  40  20  10  40  40
  6:  40  40  40  40  40  40  10  20
  7:  40  40  40  40  40  40  20  10

However the same system when only two nodes are online at boot, then the
numa topology will look like
node distances:
node   0   1
  0:  10  20
  1:  20  10

This series tries to fix both these problems.
Note: These problems are now visible, thanks to
Commit ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't
(partially) overlap")

Cc: LKML 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Cc: Dietmar Eggemann 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Rik van Riel 
Cc: Geetika Moolchandani 
Cc: Laurent Dufour 

Srikar Dronamraju (2):
  sched/topology: Skip updating masks for non-online nodes
  powerpc/numa: Fill distance_lookup_table for offline nodes

 arch/powerpc/mm/numa.c  | 70 +

Re: [PATCH] powerpc/xive: Do not skip CPU-less nodes when creating the IPIs

2021-06-29 Thread Srikar Dronamraju
* C?dric Le Goater  [2021-06-29 15:15:42]:

> On PowerVM, CPU-less nodes can be populated with hot-plugged CPUs at
> runtime. Today, the IPI is not created for such nodes, and hot-plugged
> CPUs use a bogus IPI, which leads to soft lockups.
> 
> We could create the node IPI on demand but it is a bit complex because
> this code would be called under bringup_up() and some IRQ locking is
> being done. The simplest solution is to create the IPIs for all nodes
> at startup.

Thanks for quickly coming up with the fix.

> 
> Fixes: 7dcc37b3eff9 ("powerpc/xive: Map one IPI interrupt per node")
> Cc: sta...@vger.kernel.org # v5.13
> Reported-by: Geetika Moolchandani 
> Cc: Srikar Dronamraju 
> Signed-off-by: Cédric Le Goater 

Tested-by: Srikar Dronamraju 
> ---
> 
> This patch breaks old versions of irqbalance (<= v1.4). Possible nodes
> are collected from /sys/devices/system/node/ but CPU-less nodes are
> not listed there. When interrupts are scanned, the link representing
> the node structure is NULL and segfault occurs.
> 
> Version 1.7 seems immune. 
> 
> ---
>  arch/powerpc/sysdev/xive/common.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index f3b16ed48b05..5d2c58dba57e 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -1143,10 +1143,6 @@ static int __init xive_request_ipi(void)
>   struct xive_ipi_desc *xid = _ipis[node];
>   struct xive_ipi_alloc_info info = { node };
>  
> - /* Skip nodes without CPUs */
> - if (cpumask_empty(cpumask_of_node(node)))
> - continue;
> -
>   /*
>* Map one IPI interrupt per node for all cpus of that node.
>* Since the HW interrupt number doesn't have any meaning,
> -- 
> 2.31.1
> 

-- 
Thanks and Regards
Srikar Dronamraju


Re: PowerPC guest getting "BUG: scheduling while atomic" on linux-next-20210623 during secondary CPUs bringup

2021-06-24 Thread Srikar Dronamraju
* Bharata B Rao  [2021-06-24 21:25:09]:

> A PowerPC KVM guest gets the following BUG message when booting
> linux-next-20210623:
> 
> smp: Bringing up secondary CPUs ...
> BUG: scheduling while atomic: swapper/1/0/0x
> no locks held by swapper/1/0.
> Modules linked in:
> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.13.0-rc7-next-20210623
> Call Trace:
> [cae5bc20] [c0badc64] dump_stack_lvl+0x98/0xe0 (unreliable)
> [cae5bc60] [c0210200] __schedule_bug+0xb0/0xe0
> [cae5bcd0] [c1609e28] __schedule+0x1788/0x1c70
> [cae5be20] [c160a8cc] schedule_idle+0x3c/0x70
> [cae5be50] [c022984c] do_idle+0x2bc/0x420
> [cae5bf00] [c0229d88] cpu_startup_entry+0x38/0x40
> [cae5bf30] [c00666c0] start_secondary+0x290/0x2a0
> [cae5bf90] [c000be54] start_secondary_prolog+0x10/0x14
> 
> 
> 
> smp: Brought up 2 nodes, 16 CPUs
> numa: Node 0 CPUs: 0-7
> numa: Node 1 CPUs: 8-15
> 
> This seems to have started from next-20210521 and isn't seen on
> next-20210511.
> 

Bharata,

I think the regression is due to Commit f1a0a376ca0c ("sched/core:
Initialize the idle task with preemption disabled")

Can you please try with the above commit reverted?

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

2021-05-28 Thread Srikar Dronamraju
* Peter Zijlstra  [2021-05-28 10:43:23]:

> On Mon, May 24, 2021 at 09:48:29PM +0530, Srikar Dronamraju wrote:
> > * Valentin Schneider  [2021-05-24 15:16:09]:
> 
> > > I suppose one way to avoid the hook would be to write some "fake" distance
> > > values into your distance_lookup_table[] for offline nodes using your
> > > distance_ref_point_depth thing, i.e. ensure an iteration of
> > > node_distance(a, b) covers all distance values [1]. You can then keep 
> > > patch
> > > 3 around, and that should roughly be it.
> > > 
> > 
> > Yes, this would suffice but to me its not very clean.
> > static int found[distance_ref_point_depth];
> > 
> > for_each_node(node){
> > int i, nd, distance = LOCAL_DISTANCE;
> > goto out;
> > 
> > nd = node_distance(node, first_online_node)
> > for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
> > if (node_online) {
> > if (distance != nd)
> > continue;
> > found[i] ++;
> > break;
> > }
> > if (found[i])
> > continue;
> > distance_lookup_table[node][i] = 
> > distance_lookup_table[first_online_node][i];
> > found[i] ++;
> > break;
> > }
> > }
> > 
> > But do note: We are setting a precedent for node distance between two nodes
> > to change.
> 
> Not really; or rather not more than already is the case AFAICT. Because
> currently your distance table will have *something* in it
> (LOCAL_DISTANCE afaict) for nodes that have never been online, which is
> what triggered the whole problem to begin with.
> 
> Only after the node has come online for the first time, will it contain
> the right value.
> 
> So both before and after this proposal the actual distance value changes
> after the first time a node goes online.
> 
> Yes that's unfortunate, but I don't see a problem with pre-filling it
> with something useful in order to avoid aditional arch hooks.
> 
> 

Okay,

Will post a v2 with prefilling.
Thanks for the update.

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

2021-05-27 Thread Srikar Dronamraju
* Valentin Schneider  [2021-05-25 11:21:02]:

> On 24/05/21 21:48, Srikar Dronamraju wrote:
> > * Valentin Schneider  [2021-05-24 15:16:09]:
> >> Ok so from your arch you can figure out the *size* of the set of unique
> >> distances, but not the individual node_distance(a, b)... That's quite
> >> unfortunate.
> >
> > Yes, thats true.
> >
> >>
> >> I suppose one way to avoid the hook would be to write some "fake" distance
> >> values into your distance_lookup_table[] for offline nodes using your
> >> distance_ref_point_depth thing, i.e. ensure an iteration of
> >> node_distance(a, b) covers all distance values [1]. You can then keep patch
> >> 3 around, and that should roughly be it.
> >>
> >
> > Yes, this would suffice but to me its not very clean.
> > static int found[distance_ref_point_depth];
> >
> > for_each_node(node){
> >   int i, nd, distance = LOCAL_DISTANCE;
> >   goto out;
> >
> >   nd = node_distance(node, first_online_node)
> >   for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
> >   if (node_online) {
> >   if (distance != nd)
> >   continue;
> >   found[i] ++;
> >   break;
> >   }
> >   if (found[i])
> >   continue;
> >   distance_lookup_table[node][i] = 
> > distance_lookup_table[first_online_node][i];
> >   found[i] ++;
> >   break;
> >   }
> > }
> >
> > But do note: We are setting a precedent for node distance between two nodes
> > to change.
> >
> 
> Indeed. AFAICT it's that or the unique-distance-values hook :/

Peter, Valentin, Michael,

Can you please let me know which approach you would want me to follow.

Or do let me know any other alternative solutions that you would want me to
try.


-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

2021-05-25 Thread Srikar Dronamraju
* Valentin Schneider  [2021-05-25 11:21:02]:

> On 24/05/21 21:48, Srikar Dronamraju wrote:
> > * Valentin Schneider  [2021-05-24 15:16:09]:
> >> Ok so from your arch you can figure out the *size* of the set of unique
> >> distances, but not the individual node_distance(a, b)... That's quite
> >> unfortunate.
> >
> > Yes, thats true.
> >
> >>
> >> I suppose one way to avoid the hook would be to write some "fake" distance
> >> values into your distance_lookup_table[] for offline nodes using your
> >> distance_ref_point_depth thing, i.e. ensure an iteration of
> >> node_distance(a, b) covers all distance values [1]. You can then keep patch
> >> 3 around, and that should roughly be it.
> >>
> >
> > Yes, this would suffice but to me its not very clean.
> > static int found[distance_ref_point_depth];
> >
> > for_each_node(node){
> >   int i, nd, distance = LOCAL_DISTANCE;
> >   goto out;
> >
> >   nd = node_distance(node, first_online_node)
> >   for (i=0; i < distance_ref_point_depth; i++, distance *= 2) {
> >   if (node_online) {
> >   if (distance != nd)
> >   continue;
> >   found[i] ++;
> >   break;
> >   }
> >   if (found[i])
> >   continue;
> >   distance_lookup_table[node][i] = 
> > distance_lookup_table[first_online_node][i];
> >   found[i] ++;
> >   break;
> >   }
> > }
> >
> > But do note: We are setting a precedent for node distance between two nodes
> > to change.
> >
> 
> Indeed. AFAICT it's that or the unique-distance-values hook :/

Peter,

Please let me know which approach would you prefer.
I am open to try any other approach too.

In my humble opinion, unique-distance-values hook is more cleaner.
Do you still have any concerns with the unique-distance-values hook?

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

2021-05-24 Thread Srikar Dronamraju
* Valentin Schneider  [2021-05-24 15:16:09]:

> On 21/05/21 14:58, Srikar Dronamraju wrote:
> > * Peter Zijlstra  [2021-05-21 10:14:10]:
> >
> >> On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote:
> >> > * Peter Zijlstra  [2021-05-20 20:56:31]:
> >> >
> >> > > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> >> > > > Currently scheduler populates the distance map by looking at distance
> >> > > > of each node from all other nodes. This should work for most
> >> > > > architectures and platforms.
> >> > > >
> >> > > > However there are some architectures like POWER that may not expose
> >> > > > the distance of nodes that are not yet onlined because those 
> >> > > > resources
> >> > > > are not yet allocated to the OS instance. Such architectures have
> >> > > > other means to provide valid distance data for the current platform.
> >> > > >
> >> > > > For example distance info from numactl from a fully populated 8 node
> >> > > > system at boot may look like this.
> >> > > >
> >> > > > node distances:
> >> > > > node   0   1   2   3   4   5   6   7
> >> > > >   0:  10  20  40  40  40  40  40  40
> >> > > >   1:  20  10  40  40  40  40  40  40
> >> > > >   2:  40  40  10  20  40  40  40  40
> >> > > >   3:  40  40  20  10  40  40  40  40
> >> > > >   4:  40  40  40  40  10  20  40  40
> >> > > >   5:  40  40  40  40  20  10  40  40
> >> > > >   6:  40  40  40  40  40  40  10  20
> >> > > >   7:  40  40  40  40  40  40  20  10
> >> > > >
> >> > > > However the same system when only two nodes are online at boot, then 
> >> > > > the
> >> > > > numa topology will look like
> >> > > > node distances:
> >> > > > node   0   1
> >> > > >   0:  10  20
> >> > > >   1:  20  10
> >> > > >
> >> > > > It may be implementation dependent on what node_distance(0,3) where
> >> > > > node 0 is online and node 3 is offline. In POWER case, it returns
> >> > > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the 
> >> > > > max
> >> > > > distance between nodes is 20. However that would not be true.
> >> > > >
> >> > > > When Nodes are onlined and CPUs from those nodes are hotplugged,
> >> > > > the max node distance would be 40.
> >> > > >
> >> > > > To handle such scenarios, let scheduler allow architectures to 
> >> > > > populate
> >> > > > the distance map. Architectures that like to populate the distance 
> >> > > > map
> >> > > > can overload arch_populate_distance_map().
> >> > >
> >> > > Why? Why can't your node_distance() DTRT? The arch interface is
> >> > > nr_node_ids and node_distance(), I don't see why we need something new
> >> > > and then replace one special use of it.
> >> > >
> >> > > By virtue of you being able to actually implement this new hook, you
> >> > > supposedly can actually do node_distance() right too.
> >> >
> >> > Since for an offline node, arch interface code doesn't have the info.
> >> > As far as I know/understand, in POWER, unless there is an active memory 
> >> > or
> >> > CPU that's getting onlined, arch can't fetch the correct node distance.
> >> >
> >> > Taking the above example: node 3 is offline, then node_distance of (3,X)
> >> > where X is anything other than 3, is not reliable. The moment node 3 is
> >> > onlined, the node distance is reliable.
> >> >
> >> > This problem will not happen even on POWER if all the nodes have either
> >> > memory or CPUs active at the time of boot.
> >>
> >> But then how can you implement this new hook? Going by the fact that
> >> both nr_node_ids and distance_ref_points_depth are fixed, how many
> >> possible __node_distance() configurations are there left?
> >>
> >
> > distance_ref_point_depth is provided as a different property and is readily
> > available at boot. The new api will use just use that. So based on th

Re: [PATCH 2/3] powerpc/numa: Populate distance map correctly

2021-05-24 Thread Srikar Dronamraju
* Valentin Schneider  [2021-05-24 15:16:22]:

> On 20/05/21 21:14, Srikar Dronamraju wrote:
> > +int arch_populate_distance_map(unsigned long *distance_map)
> > +{
> > +   int i;
> > +   int distance = LOCAL_DISTANCE;
> > +
> > +   bitmap_set(distance_map, distance, 1);
> > +
> > +   if (!form1_affinity) {
> > +   bitmap_set(distance_map, REMOTE_DISTANCE, 1);
> > +   return 0;
> > +   }
> > +
> > +   for (i = 0; i < distance_ref_points_depth; i++) {
> > +   distance *= 2;
> > +   bitmap_set(distance_map, distance, 1);
> 
> Do you have guarantees your distance values will always be in the form of
> 
>   LOCAL_DISTANCE * 2^i
> 
> because that certainly isn't true for x86/arm64.
> 

This is true till now. It don't think that's going to change anytime soon, but
we never know what lies ahead.

For all practical purposes, (unless a newer, shinier property is proposed,)
distance_ref_points_depth is going to give us the unique distances.

> > +   }
> > +   return 0;
> > +}
> > +
> >  /*
> >   * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
> >   * info is found.
> > --
> > 2.27.0

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

2021-05-21 Thread Srikar Dronamraju
* Peter Zijlstra  [2021-05-21 10:14:10]:

> On Fri, May 21, 2021 at 08:08:02AM +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra  [2021-05-20 20:56:31]:
> > 
> > > On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> > > > Currently scheduler populates the distance map by looking at distance
> > > > of each node from all other nodes. This should work for most
> > > > architectures and platforms.
> > > > 
> > > > However there are some architectures like POWER that may not expose
> > > > the distance of nodes that are not yet onlined because those resources
> > > > are not yet allocated to the OS instance. Such architectures have
> > > > other means to provide valid distance data for the current platform.
> > > > 
> > > > For example distance info from numactl from a fully populated 8 node
> > > > system at boot may look like this.
> > > > 
> > > > node distances:
> > > > node   0   1   2   3   4   5   6   7
> > > >   0:  10  20  40  40  40  40  40  40
> > > >   1:  20  10  40  40  40  40  40  40
> > > >   2:  40  40  10  20  40  40  40  40
> > > >   3:  40  40  20  10  40  40  40  40
> > > >   4:  40  40  40  40  10  20  40  40
> > > >   5:  40  40  40  40  20  10  40  40
> > > >   6:  40  40  40  40  40  40  10  20
> > > >   7:  40  40  40  40  40  40  20  10
> > > > 
> > > > However the same system when only two nodes are online at boot, then the
> > > > numa topology will look like
> > > > node distances:
> > > > node   0   1
> > > >   0:  10  20
> > > >   1:  20  10
> > > > 
> > > > It may be implementation dependent on what node_distance(0,3) where
> > > > node 0 is online and node 3 is offline. In POWER case, it returns
> > > > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> > > > distance between nodes is 20. However that would not be true.
> > > > 
> > > > When Nodes are onlined and CPUs from those nodes are hotplugged,
> > > > the max node distance would be 40.
> > > > 
> > > > To handle such scenarios, let scheduler allow architectures to populate
> > > > the distance map. Architectures that like to populate the distance map
> > > > can overload arch_populate_distance_map().
> > > 
> > > Why? Why can't your node_distance() DTRT? The arch interface is
> > > nr_node_ids and node_distance(), I don't see why we need something new
> > > and then replace one special use of it.
> > > 
> > > By virtue of you being able to actually implement this new hook, you
> > > supposedly can actually do node_distance() right too.
> > 
> > Since for an offline node, arch interface code doesn't have the info.
> > As far as I know/understand, in POWER, unless there is an active memory or
> > CPU that's getting onlined, arch can't fetch the correct node distance.
> > 
> > Taking the above example: node 3 is offline, then node_distance of (3,X)
> > where X is anything other than 3, is not reliable. The moment node 3 is
> > onlined, the node distance is reliable.
> > 
> > This problem will not happen even on POWER if all the nodes have either
> > memory or CPUs active at the time of boot.
> 
> But then how can you implement this new hook? Going by the fact that
> both nr_node_ids and distance_ref_points_depth are fixed, how many
> possible __node_distance() configurations are there left?
> 

distance_ref_point_depth is provided as a different property and is readily
available at boot. The new api will use just use that. So based on the
distance_ref_point_depth, we know all possible node distances for that
platform.

For an offline node, we don't have that specific nodes distance_lookup_table
array entries. Each array would be of distance_ref_point_depth entries.
Without the distance_lookup_table for an array populated, we will not be
able to tell how far the node is with respect to other nodes.

We can lookup the correct distance_lookup_table for a node based on memory
or the CPUs attached to that node. Since in an offline node, both of them
would not be around, the distance_lookup_table will have stale values.

> The example provided above does not suggest there's much room for
> alternatives, and hence for actual need of this new interface.
> 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 1/3] sched/topology: Allow archs to populate distance map

2021-05-20 Thread Srikar Dronamraju
* Peter Zijlstra  [2021-05-20 20:56:31]:

> On Thu, May 20, 2021 at 09:14:25PM +0530, Srikar Dronamraju wrote:
> > Currently scheduler populates the distance map by looking at distance
> > of each node from all other nodes. This should work for most
> > architectures and platforms.
> > 
> > However there are some architectures like POWER that may not expose
> > the distance of nodes that are not yet onlined because those resources
> > are not yet allocated to the OS instance. Such architectures have
> > other means to provide valid distance data for the current platform.
> > 
> > For example distance info from numactl from a fully populated 8 node
> > system at boot may look like this.
> > 
> > node distances:
> > node   0   1   2   3   4   5   6   7
> >   0:  10  20  40  40  40  40  40  40
> >   1:  20  10  40  40  40  40  40  40
> >   2:  40  40  10  20  40  40  40  40
> >   3:  40  40  20  10  40  40  40  40
> >   4:  40  40  40  40  10  20  40  40
> >   5:  40  40  40  40  20  10  40  40
> >   6:  40  40  40  40  40  40  10  20
> >   7:  40  40  40  40  40  40  20  10
> > 
> > However the same system when only two nodes are online at boot, then the
> > numa topology will look like
> > node distances:
> > node   0   1
> >   0:  10  20
> >   1:  20  10
> > 
> > It may be implementation dependent on what node_distance(0,3) where
> > node 0 is online and node 3 is offline. In POWER case, it returns
> > LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
> > distance between nodes is 20. However that would not be true.
> > 
> > When Nodes are onlined and CPUs from those nodes are hotplugged,
> > the max node distance would be 40.
> > 
> > To handle such scenarios, let scheduler allow architectures to populate
> > the distance map. Architectures that like to populate the distance map
> > can overload arch_populate_distance_map().
> 
> Why? Why can't your node_distance() DTRT? The arch interface is
> nr_node_ids and node_distance(), I don't see why we need something new
> and then replace one special use of it.
> 
> By virtue of you being able to actually implement this new hook, you
> supposedly can actually do node_distance() right too.

Since for an offline node, arch interface code doesn't have the info.
As far as I know/understand, in POWER, unless there is an active memory or
CPU that's getting onlined, arch can't fetch the correct node distance.

Taking the above example: node 3 is offline, then node_distance of (3,X)
where X is anything other than 3, is not reliable. The moment node 3 is
onlined, the node distance is reliable.

This problem will not happen even on POWER if all the nodes have either
memory or CPUs active at the time of boot.

-- 
Thanks and Regards
Srikar Dronamraju


[PATCH 3/3] sched/topology: Skip updating masks for non-online nodes

2021-05-20 Thread Srikar Dronamraju
Currently scheduler doesn't check if node is online before adding CPUs
to the node mask. However on some architectures, node distance is only
available for nodes that are online. Its not sure how much to rely on
the node distance, when one of the nodes is offline.

If said node distance is fake (since one of the nodes is offline) and
the actual node distance is different, then the cpumask of such nodes
when the nodes become becomes online will be wrong.

This can cause topology_span_sane to throw up a warning message and the
rest of the topology being not updated properly.

Resolve this by skipping update of cpumask for nodes that are not
online.

Cc: LKML 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Scott Cheloha 
Cc: Gautham R Shenoy 
Cc: Dietmar Eggemann 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Rik van Riel 
Cc: Geetika Moolchandani 
Reported-by: Geetika Moolchandani 
Signed-off-by: Srikar Dronamraju 
---
 kernel/sched/topology.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ccb9aff59add..ba0555e83548 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1731,6 +1731,9 @@ void sched_init_numa(void)
sched_domains_numa_masks[i][j] = mask;
 
for_each_node(k) {
+   if (!node_online(j))
+   continue;
+
if (sched_debug() && (node_distance(j, k) != 
node_distance(k, j)))
sched_numa_warn("Node-distance not 
symmetric");
 
@@ -1793,6 +1796,9 @@ void sched_domains_numa_masks_set(unsigned int cpu)
 
for (i = 0; i < sched_domains_numa_levels; i++) {
for (j = 0; j < nr_node_ids; j++) {
+   if (!node_online(j))
+   continue;
+
if (node_distance(j, node) <= 
sched_domains_numa_distance[i])
cpumask_set_cpu(cpu, 
sched_domains_numa_masks[i][j]);
}
-- 
2.27.0



[PATCH 2/3] powerpc/numa: Populate distance map correctly

2021-05-20 Thread Srikar Dronamraju
As per PAPR that defines the OS to hypervisor interface on POWER,
there is no way to calculate the node_distance between 2 nodes, when
either of the nodes are offline. However scheduler needs the distance
map to be populated at boot time. On POWER, this information is
provided within the distance_ref_points_depth array, which needs to be
parsed to extract the potential node distances.

To handle this scenario, lets overload arch_populate_distance_map(),
to provide all the distances that are possible in the current
platform.

Cc: LKML 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Scott Cheloha 
Cc: Gautham R Shenoy 
Cc: Dietmar Eggemann 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Rik van Riel 
Cc: Geetika Moolchandani 
Reported-by: Geetika Moolchandani 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/topology.h |  3 +++
 arch/powerpc/mm/numa.c  | 19 +++
 2 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index e4db64c0e184..d7605d833b8d 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -22,6 +22,9 @@ struct drmem_lmb;
   cpu_all_mask :   \
   node_to_cpumask_map[node])
 
+#define arch_populate_distance_map arch_populate_distance_map
+extern int arch_populate_distance_map(unsigned long *distance_map);
+
 struct pci_bus;
 #ifdef CONFIG_PCI
 extern int pcibus_to_node(struct pci_bus *bus);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..9a225b29814a 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -221,6 +221,25 @@ static void initialize_distance_lookup_table(int nid,
}
 }
 
+int arch_populate_distance_map(unsigned long *distance_map)
+{
+   int i;
+   int distance = LOCAL_DISTANCE;
+
+   bitmap_set(distance_map, distance, 1);
+
+   if (!form1_affinity) {
+   bitmap_set(distance_map, REMOTE_DISTANCE, 1);
+   return 0;
+   }
+
+   for (i = 0; i < distance_ref_points_depth; i++) {
+   distance *= 2;
+   bitmap_set(distance_map, distance, 1);
+   }
+   return 0;
+}
+
 /*
  * Returns nid in the range [0..nr_node_ids], or -1 if no useful NUMA
  * info is found.
-- 
2.27.0



[PATCH 0/3] Skip numa distance for offline nodes

2021-05-20 Thread Srikar Dronamraju
Geetika reported yet another trace while doing a dlpar CPU add
operation. This was true even on top of a recent commit
6980d13f0dd1 ("powerpc/smp: Set numa node before updating mask") which fixed
a similar trace.

WARNING: CPU: 40 PID: 2954 at kernel/sched/topology.c:2088 
build_sched_domains+0x6e8/0x1540
Modules linked in: nft_counter nft_compat rpadlpar_io rpaphp mptcp_diag
xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag
netlink_diag bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables
nfnetlink dm_multipath pseries_rng xts vmx_crypto binfmt_misc ip_tables xfs
libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror
dm_region_hash dm_log dm_mod fuse
CPU: 40 PID: 2954 Comm: kworker/40:0 Not tainted 5.13.0-rc1+ #19
Workqueue: events cpuset_hotplug_workfn
NIP:  c01de588 LR: c01de584 CTR: 006cd36c
REGS: c0002772b250 TRAP: 0700   Not tainted  (5.12.0-rc5-master+)
MSR:  80029033   CR: 28828422  XER: 000d
CFAR: c020c2f8 IRQMASK: 0 #012GPR00: c01de584 c0002772b4f0
c1f55400 0036 #012GPR04: c063c6368010 c063c63f0a00
0027 c063c6368018 #012GPR08: 0023 c063c636ef48
0063c4de c063bfe9ffe8 #012GPR12: 28828424 c063fe68fe80
 0417 #012GPR16: 0028 c740dcd8
c205db68 c1a3a4a0 #012GPR20: c00091ed7d20 c00091ed8520
0001  #012GPR24: c000113a9600 0190
0028 c10e3ac0 #012GPR28:  c740dd00
c000317b5900 0190 
NIP [c01de588] build_sched_domains+0x6e8/0x1540
LR [c01de584] build_sched_domains+0x6e4/0x1540
Call Trace:
[c0002772b4f0] [c01de584] build_sched_domains+0x6e4/0x1540 
(unreliable)
[c0002772b640] [c01e08dc] partition_sched_domains_locked+0x3ec/0x530
[c0002772b6e0] [c02a2144] rebuild_sched_domains_locked+0x524/0xbf0
[c0002772b7e0] [c02a5620] rebuild_sched_domains+0x40/0x70
[c0002772b810] [c02a58e4] cpuset_hotplug_workfn+0x294/0xe20
[c0002772bc30] [c0187510] process_one_work+0x300/0x670
[c0002772bd10] [c01878f8] worker_thread+0x78/0x520
[c0002772bda0] [c01937f0] kthread+0x1a0/0x1b0
[c0002772be10] [c000d6ec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
7ee5bb78 7f0ac378 7f29cb78 7f68db78 7f46d378 7f84e378 f8610068 3c62ff19
fbe10060 3863e558 4802dd31 6000 <0fe0> 3920fff4 f9210080 e86100b0

Detailed analysis of the failing scenario showed that the span in
question belongs to NODE domain and further the cpumasks for some
cpus in NODE overlapped. There are two possible reasons how we ended
up here:

(1) The numa node was offline or blank with no CPUs or memory. Hence
the sched_max_numa_distance could not be set correctly, or the
sched_domains_numa_distance happened to be partially populated.

(2) Depending on a bogus node_distance of an offline node to populate
cpumasks is the issue.  On POWER platform the node_distance is
correctly available only for an online node which has some CPU or
memory resource associated with it.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  40  40  40  40  40  40
  1:  20  10  40  40  40  40  40  40
  2:  40  40  10  20  40  40  40  40
  3:  40  40  20  10  40  40  40  40
  4:  40  40  40  40  10  20  40  40
  5:  40  40  40  40  20  10  40  40
  6:  40  40  40  40  40  40  10  20
  7:  40  40  40  40  40  40  20  10

However the same system when only two nodes are online at boot, then the
numa topology will look like
node distances:
node   0   1
  0:  10  20
  1:  20  10

This series tries to fix both these problems.
Note: These problems are now visible, thanks to 
Commit ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't
(partially) overlap")


Cc: LKML 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Scott Cheloha 
Cc: Gautham R Shenoy 
Cc: Dietmar Eggemann 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Rik van Riel 
Cc: Geetika Moolchandani 

Srikar Dronamraju (3):
  sched/topology: Allow archs to populate distance map
  powerpc/numa: Populate distance map correctly
  sched/topology: Skip updating masks for non-online nodes

 arch/powerpc/include/asm/topology.h |  3 +++
 arch/powerpc/mm/numa.c  | 19 +++
 kernel/sched/topology.c | 38 +
 3 files changed, 50 insertions(+), 10 deletions(-)


base-commit: 1699949d3314e5d1956fb082e4cd4798bf6149fc
-- 
2.27.0



[PATCH 1/3] sched/topology: Allow archs to populate distance map

2021-05-20 Thread Srikar Dronamraju
Currently scheduler populates the distance map by looking at distance
of each node from all other nodes. This should work for most
architectures and platforms.

However there are some architectures like POWER that may not expose
the distance of nodes that are not yet onlined because those resources
are not yet allocated to the OS instance. Such architectures have
other means to provide valid distance data for the current platform.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  40  40  40  40  40  40
  1:  20  10  40  40  40  40  40  40
  2:  40  40  10  20  40  40  40  40
  3:  40  40  20  10  40  40  40  40
  4:  40  40  40  40  10  20  40  40
  5:  40  40  40  40  20  10  40  40
  6:  40  40  40  40  40  40  10  20
  7:  40  40  40  40  40  40  20  10

However the same system when only two nodes are online at boot, then the
numa topology will look like
node distances:
node   0   1
  0:  10  20
  1:  20  10

It may be implementation dependent on what node_distance(0,3) where
node 0 is online and node 3 is offline. In POWER case, it returns
LOCAL_DISTANCE(10). Here at boot the scheduler would assume that the max
distance between nodes is 20. However that would not be true.

When Nodes are onlined and CPUs from those nodes are hotplugged,
the max node distance would be 40.

To handle such scenarios, let scheduler allow architectures to populate
the distance map. Architectures that like to populate the distance map
can overload arch_populate_distance_map().

Cc: LKML 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Scott Cheloha 
Cc: Gautham R Shenoy 
Cc: Dietmar Eggemann 
Cc: Mel Gorman 
Cc: Vincent Guittot 
Cc: Rik van Riel 
Cc: Geetika Moolchandani 
Reported-by: Geetika Moolchandani 
Signed-off-by: Srikar Dronamraju 
---
 kernel/sched/topology.c | 32 ++--
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 053115b55f89..ccb9aff59add 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1630,6 +1630,26 @@ static void init_numa_topology_type(void)
 
 #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
 
+#ifndef arch_populate_distance_map
+static int arch_populate_distance_map(unsigned long *distance_map)
+{
+   int i, j;
+
+   for (i = 0; i < nr_node_ids; i++) {
+   for (j = 0; j < nr_node_ids; j++) {
+   int distance = node_distance(i, j);
+
+   if (distance < LOCAL_DISTANCE || distance >= 
NR_DISTANCE_VALUES) {
+   sched_numa_warn("Invalid distance value range");
+   return -1;
+   }
+   bitmap_set(distance_map, distance, 1);
+   }
+   }
+   return 0;
+}
+#endif
+
 void sched_init_numa(void)
 {
struct sched_domain_topology_level *tl;
@@ -1646,18 +1666,10 @@ void sched_init_numa(void)
return;
 
bitmap_zero(distance_map, NR_DISTANCE_VALUES);
-   for (i = 0; i < nr_node_ids; i++) {
-   for (j = 0; j < nr_node_ids; j++) {
-   int distance = node_distance(i, j);
 
-   if (distance < LOCAL_DISTANCE || distance >= 
NR_DISTANCE_VALUES) {
-   sched_numa_warn("Invalid distance value range");
-   return;
-   }
+   if (arch_populate_distance_map(distance_map))
+   return;
 
-   bitmap_set(distance_map, distance, 1);
-   }
-   }
/*
 * We can now figure out how many unique distance values there are and
 * allocate memory accordingly.
-- 
2.27.0



Re: [PATCH v3 5/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-05-19 Thread Srikar Dronamraju
* Peter Zijlstra  [2021-05-19 11:59:48]:

> On Tue, May 18, 2021 at 12:07:40PM -0700, Ricardo Neri wrote:
> > On Fri, May 14, 2021 at 07:14:15PM -0700, Ricardo Neri wrote:
> > > On Fri, May 14, 2021 at 11:47:45AM +0200, Peter Zijlstra wrote:
> 
> > > > So I'm thinking that this is a property of having ASYM_PACKING at a core
> > > > level, rather than some arch special. Wouldn't something like this be
> > > > more appropriate?
> 
> > > Thanks Peter for the quick review! This makes sense to me. The only
> > > reason we proposed arch_asym_check_smt_siblings() is because we were
> > > about breaking powerpc (I need to study how they set priorities for SMT,
> > > if applicable). If you think this is not an issue I can post a
> > > v4 with this update.
> > 
> > As far as I can see, priorities in powerpc are set by the CPU number.
> > However, I am not sure how CPUs are enumerated? If CPUs in brackets are
> > SMT sibling, Does an enumeration looks like A) [0, 1], [2, 3] or B) [0, 2],
> > [1, 3]? I guess B is the right answer. Otherwise, both SMT siblings of a
> > core would need to be busy before a new core is used.
> > 
> > Still, I think the issue described in the cover letter may be
> > reproducible in powerpc as well. If CPU3 is offlined, and [0, 2] pulled
> > tasks from [1, -] so that both CPU0 and CPU2 become busy, CPU1 would not be
> > able to help since CPU0 has the highest priority.
> > 
> > I am cc'ing the linuxppc list to get some feedback.
> 
> IIRC the concern with Power is that their Cores can go faster if the
> higher SMT siblings are unused.
> 
> That is, suppose you have an SMT4 Core with only a single active task,
> then if only SMT0 is used it can reach max performance, but if the
> active sibling is SMT1 it can not reach max performance, and if the only
> active sibling is SMT2 it goes slower still.
> 
> So they need to pack the tasks to the lowest SMT siblings, and have the
> highest SMT siblings idle (where possible) in order to increase
> performance.
> 
> 

If you are referring to SD_ASYM_PACKING, then packing tasks to lowest SMT
was needed in POWER7 timeframe. So if there was one thread running, then
running on the lowest sibling provided the best performance as if running in
single threaded mode.

However recent chips like POWER8/ POWER9 / POWER10 dont need SD_ASYM_PACKING
since the hardware itself does the switch. So even if task is place on a
higher sibling within the core, we dont need to switch the task to a lower
sibling for it to perform better. Now running only one thread running on any
sibling, its expected to run in single threaded mode.

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH] ppc64/numa: consider the max numa node for migratable LPAR

2021-05-10 Thread Srikar Dronamraju
* Laurent Dufour  [2021-04-29 20:19:01]:

> When a LPAR is migratable, we should consider the maximum possible NUMA
> node instead the number of NUMA node from the actual system.
> 
> The DT property 'ibm,current-associativity-domains' is defining the maximum
> number of nodes the LPAR can see when running on that box. But if the LPAR
> is being migrated on another box, it may seen up to the nodes defined by
> 'ibm,max-associativity-domains'. So if a LPAR is migratable, that value
> should be used.
> 
> Unfortunately, there is no easy way to know if a LPAR is migratable or
> not. The hypervisor is exporting the property 'ibm,migratable-partition' in
> the case it set to migrate partition, but that would not mean that the
> current partition is migratable.
> 
> Without that patch, when a LPAR is started on a 2 nodes box and then
> migrated to a 3 nodes box, the hypervisor may spread the LPAR's CPUs on the
> 3rd node. In that case if a CPU from that 3rd node is added to the LPAR, it
> will be wrongly assigned to the node because the kernel has been set to use


> up to 2 nodes (the configuration of the departure node). With that patch
> applies, the CPU is correctly added to the 3rd node.

You probably meant, "With this patch applied"

Also you may want to add a fixes tag:

> Cc: Srikar Dronamraju 
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/mm/numa.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index f2bf98bdcea2..673fa6e47850 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -893,7 +893,7 @@ static void __init setup_node_data(int nid, u64 
> start_pfn, u64 end_pfn)
>  static void __init find_possible_nodes(void)
>  {
>   struct device_node *rtas;
> - const __be32 *domains;
> + const __be32 *domains = NULL;
>   int prop_length, max_nodes;
>   u32 i;
> 
> @@ -909,9 +909,14 @@ static void __init find_possible_nodes(void)
>* it doesn't exist, then fallback on ibm,max-associativity-domains.
>* Current denotes what the platform can support compared to max
>* which denotes what the Hypervisor can support.
> +  *
> +  * If the LPAR is migratable, new nodes might be activated after a LPM,
> +  * so we should consider the max number in that case.
>*/
> - domains = of_get_property(rtas, "ibm,current-associativity-domains",
> - _length);
> + if (!of_get_property(of_root, "ibm,migratable-partition", NULL))
> + domains = of_get_property(rtas,
> +   "ibm,current-associativity-domains",
> +   _length);
>   if (!domains) {
>   domains = of_get_property(rtas, "ibm,max-associativity-domains",
>   _length);
> @@ -920,6 +925,9 @@ static void __init find_possible_nodes(void)
>   }
> 
>   max_nodes = of_read_number([min_common_depth], 1);
> + printk(KERN_INFO "Partition configured for %d NUMA nodes.\n",
> +max_nodes);
> +

Another nit:
you may want to make this pr_info instead of printk

>   for (i = 0; i < max_nodes; i++) {
>   if (!node_possible(i))
>   node_set(i, node_possible_map);
> -- 
> 2.31.1
> 

Otherwise looks good to me.

Reviewed-by: Srikar Dronamraju 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 3/3] powerpc/smp: Cache CPU to chip lookup

2021-04-16 Thread Srikar Dronamraju
* Gautham R Shenoy  [2021-04-16 21:27:48]:

> On Thu, Apr 15, 2021 at 11:21:10PM +0530, Srikar Dronamraju wrote:
> > * Gautham R Shenoy  [2021-04-15 22:49:21]:
> > 
> > > > 
> > > > +int *chip_id_lookup_table;
> > > > +
> > > >  #ifdef CONFIG_PPC64
> > > >  int __initdata iommu_is_off;
> > > >  int __initdata iommu_force_on;
> > > > @@ -914,13 +916,22 @@ EXPORT_SYMBOL(of_get_ibm_chip_id);
> > > >  int cpu_to_chip_id(int cpu)
> > > >  {
> > > > struct device_node *np;
> > > > +   int ret = -1, idx;
> > > > +
> > > > +   idx = cpu / threads_per_core;
> > > > +   if (chip_id_lookup_table && chip_id_lookup_table[idx] != -1)
> > > 
> > 
> > > The value -1 is ambiguous since we won't be able to determine if
> > > it is because we haven't yet made a of_get_ibm_chip_id() call
> > > or if of_get_ibm_chip_id() call was made and it returned a -1.
> > > 
> > 
> > We don't allocate chip_id_lookup_table unless cpu_to_chip_id() return
> > !-1 value for the boot-cpuid. So this ensures that we dont
> > unnecessarily allocate chip_id_lookup_table. Also I check for
> > chip_id_lookup_table before calling cpu_to_chip_id() for other CPUs.
> > So this avoids overhead of calling cpu_to_chip_id() for platforms that
> > dont support it.  Also its most likely that if the
> > chip_id_lookup_table is initialized then of_get_ibm_chip_id() call
> > would return a valid value.
> > 
> > + Below we are only populating the lookup table, only when the
> > of_get_cpu_node is valid.
> > 
> > So I dont see any drawbacks of initializing it to -1. Do you see
> any?
> 
> 
> Only if other callers of cpu_to_chip_id() don't check for whether the
> chip_id_lookup_table() has been allocated or not. From a code
> readability point of view, it is easier to have that check  this inside
> cpu_to_chip_id() instead of requiring all its callers to make that
> check.
> 

I didn't understand your comment. However let me reiterate what I said
earlier. We don't have control over who and when cpu_to_chip_id() gets
called. If the cpu_to_chip_id() might be called for non present CPU,
in which case it will return -1, Should we cache it or not?

If we cache it, we will return wrong value when the CPU may turn out
to be present. If we cache and retry it then having one value for
initializing and another for invalid is all the same as having just 1
value for initializing and invalid. Just that we end up adding more
confusing code. Atleast to me, code isnt readable if I say retry for
-1 and -2 too. After few years, we ourselves will wonder why we have
two values if we are checking and performing same actions.

> > 
> > > Thus, perhaps we can initialize chip_id_lookup_table[idx] with a
> > > different unique negative value. How about S32_MIN ? and check
> > > chip_id_lookup_table[idx] is different here ?
> > > 
> > 
> > I had initially initialized to -2, But then I thought we adding in
> > more confusion than necessary and it was not solving any issues.
> > 
> > 
> > -- 
> > Thanks and Regards
> > Srikar Dronamraju

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 1/3] powerpc/smp: Reintroduce cpu_core_mask

2021-04-15 Thread Srikar Dronamraju
* David Gibson  [2021-04-16 13:21:34]:

Thanks for having a look at the patches.

> On Thu, Apr 15, 2021 at 05:39:32PM +0530, Srikar Dronamraju wrote:
> > Daniel reported that with Commit 4ca234a9cbd7 ("powerpc/smp: Stop
> > updating cpu_core_mask") QEMU was unable to set single NUMA node SMP
> > topologies such as:
> >  -smp 8,maxcpus=8,cores=2,threads=2,sockets=2
> >  i.e he expected 2 sockets in one NUMA node.
> 
> Well, strictly speaking, you can still set that toplogy in qemu but a
> PAPR guest with that commit will show as having 1 socket in lscpu and
> similar things.
> 

Right, I did mention the o/p of lscpu in QEMU with the said commit and
with the new patches in the cover letter. Somehow I goofed up the cc
list for the cover letter.

Reference for the cover letter:
https://lore.kernel.org/linuxppc-dev/20210415120934.232271-1-sri...@linux.vnet.ibm.com/t/#u

> Basically, this is because PAPR has no meaningful distinction between
> cores and sockets.  So it's kind of a cosmetic problem, but it is a
> user-unexpected behaviour that it would be nice to avoid if it's not
> excessively difficult.
> 
> > The above commit helped to reduce boot time on Large Systems for
> > example 4096 vCPU single socket QEMU instance. PAPR is silent on
> > having more than one socket within a NUMA node.
> > 
> > cpu_core_mask and cpu_cpu_mask for any CPU would be same unless the
> > number of sockets is different from the number of NUMA nodes.
> 
> Number of sockets being different from number of NUMA nodes is routine
> in qemu, and I don't think it's something we should enforce.
> 
> > One option is to reintroduce cpu_core_mask but use a slightly
> > different method to arrive at the cpu_core_mask. Previously each CPU's
> > chip-id would be compared with all other CPU's chip-id to verify if
> > both the CPUs were related at the chip level. Now if a CPU 'A' is
> > found related / (unrelated) to another CPU 'B', all the thread
> > siblings of 'A' and thread siblings of 'B' are automatically marked as
> > related / (unrelated).
> > 
> > Also if a platform doesn't support ibm,chip-id property, i.e its
> > cpu_to_chip_id returns -1, cpu_core_map holds a copy of
> > cpu_cpu_mask().
> 
> Yeah, the other weirdness here is that ibm,chip-id isn't a PAPR
> property at all - it was added for powernv.  We then added it to qemu
> for PAPR guests because that was the way at the time to get the guest
> to advertise the expected number of sockets.  It therefore basically
> *only* exists on PAPR/qemu for that purpose, so if it's not serving it
> we need to come up with something else.
> 

Do you have ideas on what that something could be like? So if that's
more beneficial then we could move over to that scheme. Also apart
from ibm,chip-id being not a PAPR property, do you have any other
concerns with it.


-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 3/3] powerpc/smp: Cache CPU to chip lookup

2021-04-15 Thread Srikar Dronamraju
* Gautham R Shenoy  [2021-04-15 22:49:21]:

> > 
> > +int *chip_id_lookup_table;
> > +
> >  #ifdef CONFIG_PPC64
> >  int __initdata iommu_is_off;
> >  int __initdata iommu_force_on;
> > @@ -914,13 +916,22 @@ EXPORT_SYMBOL(of_get_ibm_chip_id);
> >  int cpu_to_chip_id(int cpu)
> >  {
> > struct device_node *np;
> > +   int ret = -1, idx;
> > +
> > +   idx = cpu / threads_per_core;
> > +   if (chip_id_lookup_table && chip_id_lookup_table[idx] != -1)
> 

> The value -1 is ambiguous since we won't be able to determine if
> it is because we haven't yet made a of_get_ibm_chip_id() call
> or if of_get_ibm_chip_id() call was made and it returned a -1.
> 

We don't allocate chip_id_lookup_table unless cpu_to_chip_id() return
!-1 value for the boot-cpuid. So this ensures that we dont
unnecessarily allocate chip_id_lookup_table. Also I check for
chip_id_lookup_table before calling cpu_to_chip_id() for other CPUs.
So this avoids overhead of calling cpu_to_chip_id() for platforms that
dont support it.  Also its most likely that if the
chip_id_lookup_table is initialized then of_get_ibm_chip_id() call
would return a valid value.

+ Below we are only populating the lookup table, only when the
of_get_cpu_node is valid.

So I dont see any drawbacks of initializing it to -1. Do you see any?

> Thus, perhaps we can initialize chip_id_lookup_table[idx] with a
> different unique negative value. How about S32_MIN ? and check
> chip_id_lookup_table[idx] is different here ?
> 

I had initially initialized to -2, But then I thought we adding in
more confusion than necessary and it was not solving any issues.


-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 1/3] powerpc/smp: Reintroduce cpu_core_mask

2021-04-15 Thread Srikar Dronamraju
* Gautham R Shenoy  [2021-04-15 22:41:34]:

> Hi Srikar,
> 
> 

Thanks for taking a look.

> > @@ -1485,12 +1486,36 @@ static void add_cpu_to_masks(int cpu)
> > add_cpu_to_smallcore_masks(cpu);
> > 
> > /* In CPU-hotplug path, hence use GFP_ATOMIC */
> > -   alloc_cpumask_var_node(, GFP_ATOMIC, cpu_to_node(cpu));
> > +   ret = alloc_cpumask_var_node(, GFP_ATOMIC, cpu_to_node(cpu));
> > update_mask_by_l2(cpu, );
> > 
> > if (has_coregroup_support())
> > update_coregroup_mask(cpu, );
> > 
> > +   if (chip_id == -1 || !ret) {
> > +   cpumask_copy(per_cpu(cpu_core_map, cpu), cpu_cpu_mask(cpu));
> > +   goto out;
> > +   }
> > +
> > +   if (shared_caches)
> > +   submask_fn = cpu_l2_cache_mask;
> > +
> > +   /* Update core_mask with all the CPUs that are part of submask */
> > +   or_cpumasks_related(cpu, cpu, submask_fn, cpu_core_mask);
> >
> 
> If coregroups exist, we can add the cpus of the coregroup to the
> cpu_core_mask thereby reducing the scope of the for_each_cpu() search
> below. This will still cut down the time on Baremetal systems
> supporting coregroups.
> 

Yes, once we upstream coregroup support to Baremetal, we should look
at adding it. Also do note, number of CPUs we support for Baremetal is
comparatively lower than in PowerVM + QEMU. And more importantly the
number of cores per coregroup is also very low. So the optimization
may not yield too much of a benefit.

Its only in the QEMU case, where we end up having too many cores in
the same chip, where we see a drastic increase in the boot-up time.

-- 
Thanks and Regards
Srikar Dronamraju


[PATCH 1/3] powerpc/smp: Reintroduce cpu_core_mask

2021-04-15 Thread Srikar Dronamraju
Daniel reported that with Commit 4ca234a9cbd7 ("powerpc/smp: Stop
updating cpu_core_mask") QEMU was unable to set single NUMA node SMP
topologies such as:
 -smp 8,maxcpus=8,cores=2,threads=2,sockets=2
 i.e he expected 2 sockets in one NUMA node.

The above commit helped to reduce boot time on Large Systems for
example 4096 vCPU single socket QEMU instance. PAPR is silent on
having more than one socket within a NUMA node.

cpu_core_mask and cpu_cpu_mask for any CPU would be same unless the
number of sockets is different from the number of NUMA nodes.

One option is to reintroduce cpu_core_mask but use a slightly
different method to arrive at the cpu_core_mask. Previously each CPU's
chip-id would be compared with all other CPU's chip-id to verify if
both the CPUs were related at the chip level. Now if a CPU 'A' is
found related / (unrelated) to another CPU 'B', all the thread
siblings of 'A' and thread siblings of 'B' are automatically marked as
related / (unrelated).

Also if a platform doesn't support ibm,chip-id property, i.e its
cpu_to_chip_id returns -1, cpu_core_map holds a copy of
cpu_cpu_mask().

Fixes: 4ca234a9cbd7 ("powerpc/smp: Stop updating cpu_core_mask")
Cc: linuxppc-dev@lists.ozlabs.org
Cc: qemu-...@nongnu.org
Cc: Cedric Le Goater 
Cc: David Gibson 
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Reported-by: Daniel Henrique Barboza 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/smp.h |  5 +
 arch/powerpc/kernel/smp.c  | 39 --
 2 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 7a13bc20f0a0..47081a9e13ca 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -121,6 +121,11 @@ static inline struct cpumask *cpu_sibling_mask(int cpu)
return per_cpu(cpu_sibling_map, cpu);
 }
 
+static inline struct cpumask *cpu_core_mask(int cpu)
+{
+   return per_cpu(cpu_core_map, cpu);
+}
+
 static inline struct cpumask *cpu_l2_cache_mask(int cpu)
 {
return per_cpu(cpu_l2_cache_map, cpu);
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5a4d59a1070d..5c7ce1d50631 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1057,17 +1057,12 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
local_memory_node(numa_cpu_lookup_table[cpu]));
}
 #endif
-   /*
-* cpu_core_map is now more updated and exists only since
-* its been exported for long. It only will have a snapshot
-* of cpu_cpu_mask.
-*/
-   cpumask_copy(per_cpu(cpu_core_map, cpu), cpu_cpu_mask(cpu));
}
 
/* Init the cpumasks so the boot CPU is related to itself */
cpumask_set_cpu(boot_cpuid, cpu_sibling_mask(boot_cpuid));
cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
+   cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
 
if (has_coregroup_support())
cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
@@ -1408,6 +1403,9 @@ static void remove_cpu_from_masks(int cpu)
set_cpus_unrelated(cpu, i, cpu_smallcore_mask);
}
 
+   for_each_cpu(i, cpu_core_mask(cpu))
+   set_cpus_unrelated(cpu, i, cpu_core_mask);
+
if (has_coregroup_support()) {
for_each_cpu(i, cpu_coregroup_mask(cpu))
set_cpus_unrelated(cpu, i, cpu_coregroup_mask);
@@ -1468,8 +1466,11 @@ static void update_coregroup_mask(int cpu, cpumask_var_t 
*mask)
 
 static void add_cpu_to_masks(int cpu)
 {
+   struct cpumask *(*submask_fn)(int) = cpu_sibling_mask;
int first_thread = cpu_first_thread_sibling(cpu);
+   int chip_id = cpu_to_chip_id(cpu);
cpumask_var_t mask;
+   bool ret;
int i;
 
/*
@@ -1485,12 +1486,36 @@ static void add_cpu_to_masks(int cpu)
add_cpu_to_smallcore_masks(cpu);
 
/* In CPU-hotplug path, hence use GFP_ATOMIC */
-   alloc_cpumask_var_node(, GFP_ATOMIC, cpu_to_node(cpu));
+   ret = alloc_cpumask_var_node(, GFP_ATOMIC, cpu_to_node(cpu));
update_mask_by_l2(cpu, );
 
if (has_coregroup_support())
update_coregroup_mask(cpu, );
 
+   if (chip_id == -1 || !ret) {
+   cpumask_copy(per_cpu(cpu_core_map, cpu), cpu_cpu_mask(cpu));
+   goto out;
+   }
+
+   if (shared_caches)
+   submask_fn = cpu_l2_cache_mask;
+
+   /* Update core_mask with all the CPUs that are part of submask */
+   or_cpumasks_related(cpu, cpu, submask_fn, cpu_core_mask);
+
+   /* Skip all CPUs already part of current CPU core mask */
+   cpumask_andnot(mask, cpu_online_mask, cpu_core_mask(cpu));
+
+   

[PATCH 3/3] powerpc/smp: Cache CPU to chip lookup

2021-04-15 Thread Srikar Dronamraju
On systems with large CPUs per node, even with the filtered matching of
related CPUs, there can be large number of calls to cpu_to_chip_id for
the same CPU. For example with 4096 vCPU, 1 node QEMU configuration,
with 4 threads per core, system could be see upto 1024 calls to
cpu_to_chip_id() for the same CPU. On a given system, cpu_to_chip_id()
for a given CPU would always return the same. Hence cache the result in
a lookup table for use in subsequent calls.

Since all CPUs sharing the same core will belong to the same chip, the
lookup_table has an entry for one CPU per core.  chip_id_lookup_table is
not being freed and would be used on subsequent CPU online post CPU
offline.

Suggested-by: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: qemu-...@nongnu.org
Cc: Cedric Le Goater 
Cc: David Gibson 
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Reported-by: Daniel Henrique Barboza 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/smp.h |  1 +
 arch/powerpc/kernel/prom.c | 19 +++
 arch/powerpc/kernel/smp.c  | 21 +++--
 3 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 47081a9e13ca..03b3d010cbab 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -31,6 +31,7 @@ extern u32 *cpu_to_phys_id;
 extern bool coregroup_enabled;
 
 extern int cpu_to_chip_id(int cpu);
+extern int *chip_id_lookup_table;
 
 #ifdef CONFIG_SMP
 
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 9a4797d1d40d..6d2e4a5bc471 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -65,6 +65,8 @@
 #define DBG(fmt...)
 #endif
 
+int *chip_id_lookup_table;
+
 #ifdef CONFIG_PPC64
 int __initdata iommu_is_off;
 int __initdata iommu_force_on;
@@ -914,13 +916,22 @@ EXPORT_SYMBOL(of_get_ibm_chip_id);
 int cpu_to_chip_id(int cpu)
 {
struct device_node *np;
+   int ret = -1, idx;
+
+   idx = cpu / threads_per_core;
+   if (chip_id_lookup_table && chip_id_lookup_table[idx] != -1)
+   return chip_id_lookup_table[idx];
 
np = of_get_cpu_node(cpu, NULL);
-   if (!np)
-   return -1;
+   if (np) {
+   ret = of_get_ibm_chip_id(np);
+   of_node_put(np);
+
+   if (chip_id_lookup_table)
+   chip_id_lookup_table[idx] = ret;
+   }
 
-   of_node_put(np);
-   return of_get_ibm_chip_id(np);
+   return ret;
 }
 EXPORT_SYMBOL(cpu_to_chip_id);
 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5c7ce1d50631..50520fbea424 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1073,6 +1073,20 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
cpu_smallcore_mask(boot_cpuid));
}
 
+   if (cpu_to_chip_id(boot_cpuid) != -1) {
+   int idx = num_possible_cpus() / threads_per_core;
+
+   /*
+* All threads of a core will all belong to the same core,
+* chip_id_lookup_table will have one entry per core.
+* Assumption: if boot_cpuid doesn't have a chip-id, then no
+* other CPUs, will also not have chip-id.
+*/
+   chip_id_lookup_table = kcalloc(idx, sizeof(int), GFP_KERNEL);
+   if (chip_id_lookup_table)
+   memset(chip_id_lookup_table, -1, sizeof(int) * idx);
+   }
+
if (smp_ops && smp_ops->probe)
smp_ops->probe();
 }
@@ -1468,8 +1482,8 @@ static void add_cpu_to_masks(int cpu)
 {
struct cpumask *(*submask_fn)(int) = cpu_sibling_mask;
int first_thread = cpu_first_thread_sibling(cpu);
-   int chip_id = cpu_to_chip_id(cpu);
cpumask_var_t mask;
+   int chip_id = -1;
bool ret;
int i;
 
@@ -1492,7 +1506,10 @@ static void add_cpu_to_masks(int cpu)
if (has_coregroup_support())
update_coregroup_mask(cpu, );
 
-   if (chip_id == -1 || !ret) {
+   if (chip_id_lookup_table && ret)
+   chip_id = cpu_to_chip_id(cpu);
+
+   if (chip_id == -1) {
cpumask_copy(per_cpu(cpu_core_map, cpu), cpu_cpu_mask(cpu));
goto out;
}
-- 
2.25.1



[PATCH 2/3] Revert "powerpc/topology: Update topology_core_cpumask"

2021-04-15 Thread Srikar Dronamraju
Now that cpu_core_mask has been reintroduced, lets revert
commit 4bce545903fa ("powerpc/topology: Update topology_core_cpumask")

Post this commit, lscpu should reflect topologies as requested by a user
when a QEMU instance is launched with NUMA spanning multiple sockets.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: qemu-...@nongnu.org
Cc: Cedric Le Goater 
Cc: David Gibson 
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Gautham R Shenoy 
Reported-by: Daniel Henrique Barboza 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/topology.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 3beeb030cd78..e4db64c0e184 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -126,7 +126,7 @@ static inline int cpu_to_coregroup_id(int cpu)
 #define topology_physical_package_id(cpu)  (cpu_to_chip_id(cpu))
 
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
-#define topology_core_cpumask(cpu) (cpu_cpu_mask(cpu))
+#define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_core_id(cpu)  (cpu_to_core_id(cpu))
 
 #endif
-- 
2.25.1



[PATCH 0/3] Reintroduce cpu_core_mask

2021-04-15 Thread Srikar Dronamraju
Daniel had reported that
 QEMU is now unable to see requested topologies in a multi socket single
 NUMA node configurations.
 -smp 8,maxcpus=8,cores=2,threads=2,sockets=2

This patchset reintroduces cpu_core_mask so that users can see requested
topologies while still maintaining the boot time of very large system
configurations.

It includes caching the chip_id as suggested by Michael Ellermann

4 Threads/Core; 4 cores/Socket; 4 Sockets/Node, 2 Nodes in System
  -numa node,nodeid=0,memdev=m0 \
  -numa node,nodeid=1,memdev=m1 \
  -smp 128,sockets=8,threads=4,maxcpus=128  \

5.12.0-rc5 (or any kernel with commit 4ca234a9cbd7)
---
srikar@cloudy:~$ lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):   2 <<<<<-
NUMA node(s):2
Model:   2.3 (pvr 004e 1203)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   1 MiB
L1i cache:   1 MiB
NUMA node0 CPU(s):   0-15,32-47,64-79,96-111
NUMA node1 CPU(s):   16-31,48-63,80-95,112-127
--
srikar@cloudy:~$ dmesg |grep smp
[0.010658] smp: Bringing up secondary CPUs ...
[0.424681] smp: Brought up 2 nodes, 128 CPUs
--

5.12.0-rc5 + 3 patches
--
srikar@cloudy:~$ lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  4
Socket(s):   8<<<<-
NUMA node(s):2
Model:   2.3 (pvr 004e 1203)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   1 MiB
L1i cache:   1 MiB
NUMA node0 CPU(s):   0-15,32-47,64-79,96-111
NUMA node1 CPU(s):   16-31,48-63,80-95,112-127
--
srikar@cloudy:~$ dmesg |grep smp
[0.010372] smp: Bringing up secondary CPUs ...
[0.417892] smp: Brought up 2 nodes, 128 CPUs

5.12.0-rc5
--
srikar@cloudy:~$  lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  1024
On-line CPU(s) list: 0-1023
Thread(s) per core:  8
Core(s) per socket:  128
Socket(s):   1
NUMA node(s):1
Model:   2.3 (pvr 004e 1203)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   KVM
Virtualization type: para
L1d cache:   4 MiB
L1i cache:   4 MiB
NUMA node0 CPU(s):   0-1023
srikar@cloudy:~$ dmesg | grep smp
[0.027753 ] smp: Bringing up secondary CPUs ...
[2.315193 ] smp: Brought up 1 node, 1024 CPUs

5.12.0-rc5 + 3 patches
--
srikar@cloudy:~$ dmesg | grep smp
[0.027659 ] smp: Bringing up secondary CPUs ...
[2.532739 ] smp: Brought up 1 node, 1024 CPUs

I also have booted and tested the kernels on PowerVM and PowerNV and
even there I see a very negligible increase in the bringing up time of
secondary CPUs

Srikar Dronamraju (3):
  powerpc/smp: Reintroduce cpu_core_mask
  Revert "powerpc/topology: Update topology_core_cpumask"
  powerpc/smp: Cache CPU to chip lookup

 arch/powerpc/include/asm/smp.h  |  6 
 arch/powerpc/include/asm/topology.h |  2 +-
 arch/powerpc/kernel/prom.c  | 19 +++---
 arch/powerpc/kernel/smp.c   | 56 +
 4 files changed, 71 insertions(+), 12 deletions(-)

-- 
2.25.1



Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-12 Thread Srikar Dronamraju
* Gautham R. Shenoy  [2021-04-02 11:07:54]:

> 
> To remedy this, this patch proposes that the LLC be moved to the MC
> level which is a group of cores in one half of the chip.
> 
>   SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE
> 

I think marking Hemisphere as a LLC in a P10 scenario is a good idea.

> While there is no cache being shared at this level, this is still the
> level where some amount of cache-snooping takes place and it is
> relatively faster to access the data from the caches of the cores
> within this domain. With this change, we no longer see regressions on
> P10 for applications which require single threaded performance.

Peter, Valentin, Vincent, Mel, etal

On architectures where we have multiple levels of cache access latencies
within a DIE, (For example: one within the current LLC or SMT core and the
other at MC or Hemisphere, and finally across hemispheres), do you have any
suggestions on how we could handle the same in the core scheduler?

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask

2021-04-09 Thread Srikar Dronamraju
nterrupt: c00 at 0x2025bd74

That leaves us with just 2 options for now.
1. Update numa_mem later and only update numa_node here.
- Over a longer period of time, this would be more confusing since we
may lose track of why we are splitting the set of numa_node and numa_mem.

or
2. Use my earlier patch.

My choice would be to go with my earlier patch.
Please do let me know your thoughts on the same.

-- 
Thanks and Regards
Srikar Dronamraju

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 3beeb030cd78..1cdd83703f93 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -44,6 +44,7 @@ extern void __init dump_numa_cpu_topology(void);
 
 extern int sysfs_add_device_to_node(struct device *dev, int nid);
 extern void sysfs_remove_device_from_node(struct device *dev, int nid);
+extern int numa_setup_cpu(unsigned long cpu);
 
 static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node)
 {
@@ -81,6 +82,11 @@ static inline void sysfs_remove_device_from_node(struct 
device *dev,
 {
 }
 
+static int numa_setup_cpu(unsigned long cpu)
+{
+   return first_online_node;
+}
+
 static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) {}
 
 static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5a4d59a1070d..0d0c71ba4672 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1539,9 +1539,6 @@ void start_secondary(void *unused)
shared_caches = true;
}
 
-   set_numa_node(numa_cpu_lookup_table[cpu]);
-   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
-
smp_wmb();
notify_cpu_starting(cpu);
set_cpu_online(cpu, true);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f2bf98bdcea2..11914a28db67 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -501,7 +501,7 @@ static int vphn_get_nid(long unused)
  * Figure out to which domain a cpu belongs and stick it there.
  * Return the id of the domain used.
  */
-static int numa_setup_cpu(unsigned long lcpu)
+int numa_setup_cpu(unsigned long lcpu)
 {
struct device_node *cpu;
int fcpu = cpu_first_thread_sibling(lcpu);
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 12cbffd3c2e3..311fbe916d5b 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -198,9 +198,14 @@ static int pseries_add_processor(struct device_node *np)
}
 
for_each_cpu(cpu, tmp) {
+   int nid;
+
BUG_ON(cpu_present(cpu));
set_cpu_present(cpu, true);
set_hard_smp_processor_id(cpu, be32_to_cpu(*intserv++));
+   nid = numa_setup_cpu(cpu);
+   set_cpu_numa_node(cpu, nid);
+   set_cpu_numa_mem(cpu, local_memory_node(nid));
}
err = 0;
 out_unlock:



Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask

2021-04-08 Thread Srikar Dronamraju
* Nathan Lynch  [2021-04-07 14:46:24]:

> Srikar Dronamraju  writes:
> 
> > * Nathan Lynch  [2021-04-07 07:19:10]:
> >
> >> Sorry for the delay in following up here.
> >> 
> >
> > No issues.
> >
> >> >> So I'd suggest that pseries_add_processor() be made to update
> >> >> these things when the CPUs are marked present, before onlining them.
> >> >
> >> > In pseries_add_processor, we are only marking the cpu as present. i.e
> >> > I believe numa_setup_cpu() would not have been called. So we may not 
> >> > have a
> >> > way to associate the CPU to the node. Otherwise we will have to call
> >> > numa_setup_cpu() or the hcall_vphn.
> >> >
> >> > We could try calling numa_setup_cpu() immediately after we set the
> >> > CPU to be present, but that would be one more extra hcall + I dont know 
> >> > if
> >> > there are any more steps needed before CPU being made present and
> >> > associating the CPU to the node.
> >> 
> >> An additional hcall in this path doesn't seem too expensive.
> >> 
> >> > Are we sure the node is already online?
> >> 
> >> I see that dlpar_online_cpu() calls find_and_online_cpu_nid(), so yes I
> >> think that's covered.
> >
> > Okay, 
> >
> > Can we just call set_cpu_numa_node() at the end of map_cpu_to_node().
> > The advantage would be the update to numa_cpu_lookup_table and cpu_to_node
> > would happen at the same time and would be in sync.
> 
> I don't know. I guess this question just makes me wonder whether powerpc
> needs to have the additional lookup table. How is it different from the
> generic per_cpu numa_node?

lookup table is for early cpu to node i.e when per_cpu variables may not be
available. This would mean that calling set_numa_node/set_cpu_numa_node from
map_cpu_to_node() may not always be an option, since map_cpu_to_node() does
end up getting called very early in the system.

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask

2021-04-07 Thread Srikar Dronamraju
* Nathan Lynch  [2021-04-07 07:19:10]:

> Sorry for the delay in following up here.
> 

No issues.

> >> So I'd suggest that pseries_add_processor() be made to update
> >> these things when the CPUs are marked present, before onlining them.
> >
> > In pseries_add_processor, we are only marking the cpu as present. i.e
> > I believe numa_setup_cpu() would not have been called. So we may not have a
> > way to associate the CPU to the node. Otherwise we will have to call
> > numa_setup_cpu() or the hcall_vphn.
> >
> > We could try calling numa_setup_cpu() immediately after we set the
> > CPU to be present, but that would be one more extra hcall + I dont know if
> > there are any more steps needed before CPU being made present and
> > associating the CPU to the node.
> 
> An additional hcall in this path doesn't seem too expensive.
> 
> > Are we sure the node is already online?
> 
> I see that dlpar_online_cpu() calls find_and_online_cpu_nid(), so yes I
> think that's covered.

Okay, 

Can we just call set_cpu_numa_node() at the end of map_cpu_to_node().
The advantage would be the update to numa_cpu_lookup_table and cpu_to_node
would happen at the same time and would be in sync.

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask

2021-04-01 Thread Srikar Dronamraju
* Nathan Lynch  [2021-04-01 17:51:05]:

Thanks Nathan for reviewing.

> > -   set_numa_node(numa_cpu_lookup_table[cpu]);
> > -   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
> > -
> 
> Regardless of your change: at boot time, this set of calls to
> set_numa_node() and set_numa_mem() is redundant, right? Because
> smp_prepare_cpus() has:
> 
>   for_each_possible_cpu(cpu) {
>   ...
>   if (cpu_present(cpu)) {
>   set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
>   set_cpu_numa_mem(cpu,
>   local_memory_node(numa_cpu_lookup_table[cpu]));
>   }
> 
> I would rather that, when onlining a CPU that happens to have been
> dynamically added after boot, we enter start_secondary() with conditions
> equivalent to those at boot time. Or as close to that as is practical.
> 
> So I'd suggest that pseries_add_processor() be made to update
> these things when the CPUs are marked present, before onlining them.

In pseries_add_processor, we are only marking the cpu as present. i.e
I believe numa_setup_cpu() would not have been called. So we may not have a
way to associate the CPU to the node. Otherwise we will have to call
numa_setup_cpu() or the hcall_vphn.

We could try calling numa_setup_cpu() immediately after we set the
CPU to be present, but that would be one more extra hcall + I dont know if
there are any more steps needed before CPU being made present and
associating the CPU to the node. Are we sure the node is already online? For
the numa_mem, we are better of if the zonelists for the node are built.

or the other solution would be to call this in map_cpu_to_node().
Here also we have to be sure the zonelists for the node are already built.

-- 
Thanks and Regards
Srikar Dronamraju


[PATCH 1/1] powerpc/smp: Set numa node before updating mask

2021-04-01 Thread Srikar Dronamraju
Geethika reported a trace when doing a dlpar CPU add.

[ cut here ]
WARNING: CPU: 152 PID: 1134 at kernel/sched/topology.c:2057
CPU: 152 PID: 1134 Comm: kworker/152:1 Not tainted 5.12.0-rc5-master #5
Workqueue: events cpuset_hotplug_workfn
NIP:  c01cfc14 LR: c01cfc10 CTR: c07e3420
REGS: c034a08eb260 TRAP: 0700   Not tainted  (5.12.0-rc5-master+)
MSR:  80029033   CR: 28828422  XER: 0020
CFAR: c01fd888 IRQMASK: 0 #012GPR00: c01cfc10
c034a08eb500 c1f35400 0027 #012GPR04:
c035abaa8010 c035abb30a00 0027 c035abaa8018
#012GPR08: 0023 c035abaaef48 0035aa54
c035a49dffe8 #012GPR12: 28828424 c035bf1a1c80
0497 0004 #012GPR16: c347a258
0140 c203d468 c1a1a490 #012GPR20:
c1f9c160 c034adf70920 c034aec9fd20 000100087bd3
#012GPR24: 000100087bd3 c035b3de09f8 0030
c035b3de09f8 #012GPR28: 0028 c347a280
c034aefe0b00 c10a2a68
NIP [c01cfc14] build_sched_domains+0x6a4/0x1500
LR [c01cfc10] build_sched_domains+0x6a0/0x1500
Call Trace:
[c034a08eb500] [c01cfc10] build_sched_domains+0x6a0/0x1500 
(unreliable)
[c034a08eb640] [c01d1e6c] partition_sched_domains_locked+0x3ec/0x530
[c034a08eb6e0] [c02936d4] rebuild_sched_domains_locked+0x524/0xbf0
[c034a08eb7e0] [c0296bb0] rebuild_sched_domains+0x40/0x70
[c034a08eb810] [c0296e74] cpuset_hotplug_workfn+0x294/0xe20
[c034a08ebc30] [c0178dd0] process_one_work+0x300/0x670
[c034a08ebd10] [c01791b8] worker_thread+0x78/0x520
[c034a08ebda0] [c0185090] kthread+0x1a0/0x1b0
[c034a08ebe10] [c000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
7d2903a6 4e800421 e8410018 7f67db78 7fe6fb78 7f45d378 7f84e378 7c681b78
3c62ff1a 3863c6f8 4802dc35 6000 <0fe0> 3920fff4 f9210070 e86100a0
---[ end trace 532d9066d3d4d7ec ]---

Some of the per-CPU masks use cpu_cpu_mask as a filter to limit the search
for related CPUs. On a dlpar add of a CPU, update cpu_cpu_mask before
updating the per-CPU masks. This will ensure the cpu_cpu_mask is updated
correctly before its used in setting the masks. Setting the numa_node will
ensure that when cpu_cpu_mask() gets called, the correct node number is
used. This code movement helped fix the above call trace.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch 
Cc: Michael Ellerman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Scott Cheloha 
Cc: Gautham R Shenoy 
Cc: Geetika Moolchandani 
Reported-by: Geetika Moolchandani 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/smp.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5a4d59a1070d..1a99d75679a8 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1521,6 +1521,9 @@ void start_secondary(void *unused)
 
vdso_getcpu_init();
 #endif
+   set_numa_node(numa_cpu_lookup_table[cpu]);
+   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
+
/* Update topology CPU masks */
add_cpu_to_masks(cpu);
 
@@ -1539,9 +1542,6 @@ void start_secondary(void *unused)
shared_caches = true;
}
 
-   set_numa_node(numa_cpu_lookup_table[cpu]);
-   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
-
smp_wmb();
notify_cpu_starting(cpu);
set_cpu_online(cpu, true);
-- 
2.27.0



Re: Advice needed on SMP regression after cpu_core_mask change

2021-03-18 Thread Srikar Dronamraju
chip-id calculation is the only place 
> in the code
> that cares about cores per socket information. The kernel is now ignoring 
> that, starting
> on 4bce545903fa, and now QEMU is unable to provide this info to the guest.
> 
> If we're not going to use ibm,chip-id any longer, which seems sensible given 
> that PAPR does
> not declare it, we need another way of letting the guest know how much cores 
> per socket
> we want.
> 
> 
> 
> [1] https://bugzilla.redhat.com/1934421
> 
> 
> 
> Thanks,
> 
> 
> DHB

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH] powerpc/numa: Fix topology_physical_package_id() on pSeries

2021-03-16 Thread Srikar Dronamraju
* C?dric Le Goater  [2021-03-12 15:31:54]:

> Initial commit 15863ff3b8da ("powerpc: Make chip-id information
> available to userspace") introduce a cpu_to_chip_id() routine for the
> PowerNV platform using the "ibm,chip-id" property to query the chip id
> of a CPU. But PAPR does not specify such a property and the node id
> query is broken.
>
> Use cpu_to_node() instead which guarantees to have a correct value on
> all platforms, PowerNV an pSeries.
>
> Cc: Nathan Lynch 
> Cc: Srikar Dronamraju 
> Cc: Vasant Hegde 
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/include/asm/topology.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>

(Sorry I somehow managed to mangle to-address. Hence resending this mail
again)

While this looks good to me, @mpe had reservations on using nid as chip-id.
https://lore.kernel.org/linuxppc-dev/87lfwhypv0@concordia.ellerman.id.au/t/#u
He may be okay with using nid as a "virtual" package id in a pseries
environment.

Reviewed-by: Srikar Dronamraju 

[---=| TOFU protection by t-prot: 24 lines snipped |=---]

--
Thanks and Regards
Srikar Dronamraju


> diff --git a/arch/powerpc/include/asm/topology.h 
> b/arch/powerpc/include/asm/topology.h
> index 3beeb030cd78..887c42a4e43d 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -123,7 +123,7 @@ static inline int cpu_to_coregroup_id(int cpu)
>  #ifdef CONFIG_PPC64
>  #include 
>
> -#define topology_physical_package_id(cpu)(cpu_to_chip_id(cpu))
> +#define topology_physical_package_id(cpu)(cpu_to_node(cpu))
>
>  #define topology_sibling_cpumask(cpu)(per_cpu(cpu_sibling_map, cpu))
>  #define topology_core_cpumask(cpu)   (cpu_cpu_mask(cpu))
> --
> 2.26.2
>

--
Thanks and Regards
Srikar Dronamraju


Re: [PATCH] powerpc/numa: Fix topology_physical_package_id() on pSeries

2021-03-15 Thread Srikar Dronamraju
* C?dric Le Goater  [2021-03-12 15:31:54]:

> Initial commit 15863ff3b8da ("powerpc: Make chip-id information
> available to userspace") introduce a cpu_to_chip_id() routine for the
> PowerNV platform using the "ibm,chip-id" property to query the chip id
> of a CPU. But PAPR does not specify such a property and the node id
> query is broken.
> 
> Use cpu_to_node() instead which guarantees to have a correct value on
> all platforms, PowerNV an pSeries.
> 

While this looks good to me, @mpe had reservations on using nid as chip-id.
https://lore.kernel.org/linuxppc-dev/87lfwhypv0@concordia.ellerman.id.au/t/#u
He may be okay with using nid as a "virtual" package id in a pseries
environment.

Reviewed-by: Srikar Dronamraju 

> Cc: Nathan Lynch 
> Cc: Srikar Dronamraju 
> Cc: Vasant Hegde 
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/include/asm/topology.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/topology.h 
> b/arch/powerpc/include/asm/topology.h
> index 3beeb030cd78..887c42a4e43d 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -123,7 +123,7 @@ static inline int cpu_to_coregroup_id(int cpu)
>  #ifdef CONFIG_PPC64
>  #include 
>  
> -#define topology_physical_package_id(cpu)(cpu_to_chip_id(cpu))
> +#define topology_physical_package_id(cpu)(cpu_to_node(cpu))
>  
>  #define topology_sibling_cpumask(cpu)(per_cpu(cpu_sibling_map, cpu))
>  #define topology_core_cpumask(cpu)   (cpu_cpu_mask(cpu))
> -- 
> 2.26.2
> 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH] perf bench numa: Fix the condition checks for max number of numa nodes

2021-02-26 Thread Srikar Dronamraju
* Athira Rajeev  [2021-02-25 11:50:02]:

> In systems having higher node numbers available like node
> 255, perf numa bench will fail with SIGABORT.
> 
> <<>>
> perf: bench/numa.c:1416: init: Assertion `!(g->p.nr_nodes > 64 || 
> g->p.nr_nodes < 0)' failed.
> Aborted (core dumped)
> <<>>
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v3 1/5] powerpc/smp: Parse ibm,thread-groups with multiple properties

2020-12-15 Thread Srikar Dronamraju
* Gautham R. Shenoy  [2020-12-10 16:08:55]:

> From: "Gautham R. Shenoy" 
> 
> The "ibm,thread-groups" device-tree property is an array that is used
> to indicate if groups of threads within a core share certain
> properties. It provides details of which property is being shared by
> which groups of threads. This array can encode information about
> multiple properties being shared by different thread-groups within the
> core.
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v3 5/5] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

2020-12-15 Thread Srikar Dronamraju
* Gautham R. Shenoy  [2020-12-10 16:08:59]:

> From: "Gautham R. Shenoy" 
> 
> On POWER platforms where only some groups of threads within a core
> share the L2-cache (indicated by the ibm,thread-groups device-tree
> property), we currently print the incorrect shared_cpu_map/list for
> L2-cache in the sysfs.
> 
> This patch reports the correct shared_cpu_map/list on such platforms.
> 
> Example:
> On a platform with "ibm,thread-groups" set to
>  0001 0002 0004 
>  0002 0004 0006 0001
>  0003 0005 0007 0002
>  0002 0004  0002
>  0004 0006 0001 0003
>      0005 0007
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v3 4/5] powerpc/smp: Add support detecting thread-groups sharing L2 cache

2020-12-15 Thread Srikar Dronamraju
* Gautham R. Shenoy  [2020-12-10 16:08:58]:

> From: "Gautham R. Shenoy" 
> 
> On POWER systems, groups of threads within a core sharing the L2-cache
> can be indicated by the "ibm,thread-groups" property array with the
> identifier "2".
> 
> This patch adds support for detecting this, and when present, populate
> the populating the cpu_l2_cache_mask of every CPU to the core-siblings
> which share L2 with the CPU as specified in the by the
> "ibm,thread-groups" property array.
> 
> On a platform with the following "ibm,thread-group" configuration
>0001 0002 0004 
>0002 0004 0006 0001
>0003 0005 0007 0002
>0002 0004  0002
>0004 00000006 0001 0003
>    0005 0007
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v3 3/5] powerpc/smp: Rename init_thread_group_l1_cache_map() to make it generic

2020-12-15 Thread Srikar Dronamraju
* Gautham R. Shenoy  [2020-12-10 16:08:57]:

> From: "Gautham R. Shenoy" 
> 
> init_thread_group_l1_cache_map() initializes the per-cpu cpumask
> thread_group_l1_cache_map with the core-siblings which share L1 cache
> with the CPU. Make this function generic to the cache-property (L1 or
> L2) and update a suitable mask. This is a preparatory patch for the
> next patch where we will introduce discovery of thread-groups that
> share L2-cache.
> 
> No functional change.
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

> Signed-off-by: Gautham R. Shenoy 
-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v3 2/5] powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map

2020-12-15 Thread Srikar Dronamraju
* Gautham R. Shenoy  [2020-12-10 16:08:56]:

> From: "Gautham R. Shenoy" 
> 
> On platforms which have the "ibm,thread-groups" property, the per-cpu
> variable cpu_l1_cache_map keeps a track of which group of threads
> within the same core share the L1 cache, Instruction and Data flow.
> 
> This patch renames the variable to "thread_group_l1_cache_map" to make
> it consistent with a subsequent patch which will introduce
> thread_group_l2_cache_map.
> 
> This patch introduces no functional change.
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

> Signed-off-by: Gautham R. Shenoy 
-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache

2020-12-09 Thread Srikar Dronamraju
* Gautham R Shenoy  [2020-12-08 23:12:37]:

> 
> > For L2 we have thread_group_l2_cache_map to store the tasks from the thread
> > group.  but cpu_l2_cache_map for keeping track of tasks.
> 
> > 
> > I think we should do some renaming to keep the names consistent.
> > I would say probably say move the current cpu_l2_cache_map to
> > cpu_llc_cache_map and move the new aka  thread_group_l2_cache_map as
> > cpu_l2_cache_map to be somewhat consistent.
> 
> Hmm.. cpu_llc_cache_map is still very generic. We want to have
> something that defines l2 map.
> 
> I agree that we need to keep it consistent. How about renaming
> cpu_l1_cache_map to thread_groups_l1_cache_map ?
> 
> That way thread_groups_l1_cache_map and thread_groups_l2_cache_map
> refer to the corresponding L1 and L2 siblings as discovered from
> ibm,thread-groups property.

I am fine with this.

> > > +
> > > + for_each_possible_cpu(cpu) {
> > > + int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L2);
> > > +
> > > + if (err)
> > > + return err;
> > > + }
> > > +
> > > + thread_group_shares_l2 = true;
> > 
> > Why do we need a separate loop. Why cant we merge this in the above loop
> > itself?
> 
> No, there are platforms where one THREAD_GROUP_SHARE_L1 exists while
> THREAD_GROUP_SHARE_L2 doesn't exist. It becomes easier if these are
> separately tracked. Also, what do we gain if we put this in the same
> loop? It will be (nr_possible_cpus * 2 * invocations of
> init_cpu_cache_map()) as opposed to 2 * (nr_possible_cpus *
> invocations of init_cpu_cache_map()). Isn't it ?
> 
Its not about the number of invocations but per-cpu thread group list
that would need not be loaded again. Currently they would probably be in the
cache-line, but get dropped to be loaded again in the next loop.
And we still can support platforms with only THREAD_GROUP_SHARE_L1 since
parse_thread_groups would have given us how many levels of thread groups are
supported on a platform.

> > 
> > > + pr_info("Thread-groups in a core share L2-cache\n");
> > 
> > Can this be moved to a pr_debug? Does it help any regular user/admins to
> > know if thread-groups shared l2 cache. Infact it may confuse users on what
> > thread groups are and which thread groups dont share cache.
> > I would prefer some other name than thread_group_shares_l2 but dont know any
> > better alternatives and may be my choices are even worse.
> 
> Would you be ok with "L2 cache shared by threads of the small core" ?

Sounds better to me. I would still think pr_debug is better since regular
Admins/users may not make too much information from this.
> 
> > 
> > Ah this can be simplified to:
> > if (thread_group_shares_l2) {
> > cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));
> > 
> > for_each_cpu(i, per_cpu(thread_group_l2_cache_map, cpu)) {
> > if (cpu_online(i))
> > set_cpus_related(i, cpu, cpu_l2_cache_mask);
> > }
> 
> Don't we want to enforce that the siblings sharing L1 be a subset of
> the siblings sharing L2 ? Or do you recommend putting in a check for
> that somewhere ?
> 
I didnt think about the case where the device-tree could show L2 to be a
subset of L1.

How about initializing thread_group_l2_cache_map itself with
cpu_l1_cache_map. It would be a simple one time operation and reduce the
overhead here every CPU online.
And it would help in your subsequent patch too. We dont want the cacheinfo
for L1 showing CPUs not present in L2.

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

2020-12-09 Thread Srikar Dronamraju
* Gautham R Shenoy  [2020-12-08 23:26:47]:

> > The drawback of this is even if cpus 0,2,4,6 are released L1 cache will not
> > be released. Is this as expected?
> 
> cacheinfo populates the cache->shared_cpu_map on the basis of which
> CPUs share the common device-tree node for a particular cache.  There
> is one l1-cache object in the device-tree for a CPU node corresponding
> to a big-core. That the L1 is further split between the threads of the
> core is shown using ibm,thread-groups.
> 

Yes.

> The ideal thing would be to add a "group_leader" field to "struct
> cache" so that we can create separate cache objects , one per thread
> group. I will take a stab at this in the v2.
> 

I am not saying this needs to be done immediately. We could add a TODO and
get it done later. Your patch is not making it worse. Its just that there is
still something more left to be done.

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties

2020-12-09 Thread Srikar Dronamraju
* Gautham R Shenoy  [2020-12-08 22:55:40]:

> > 
> > NIT:
> > tglx mentions in one of his recent comments to try keep a reverse fir tree
> > ordering of variables where possible.
> 
> I suppose you mean moving the longer local variable declarations to to
> the top and shorter ones to the bottom. Thanks. Will fix this.
> 

Yes.

> > > + }
> > > +
> > > + if (!tg)
> > > + return -EINVAL;
> > > +
> > > + cpu_group_start = get_cpu_thread_group_start(cpu, tg);
> > 
> > This whole hunk should be moved to a new function and called before
> > init_cpu_cache_map. It will simplify the logic to great extent.
> 
> I suppose you are referring to the part where we select the correct
> tg. Yeah, that can move to a different helper.
> 

Yes, I would prefer if we could call this new helper outside
init_cpu_cache_map.

> > > 
> > > - zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu),
> > > - GFP_KERNEL, cpu_to_node(cpu));
> > > + mask = _cpu(cpu_l1_cache_map, cpu);
> > > +
> > > + zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
> > > 
> > 
> > This hunk (and the next hunk) should be moved to next patch.
> >
> 
> The next patch is only about introducing  THREAD_GROUP_SHARE_L2. Hence
> I put in any other code in this patch, since it seems to be a logical
> place to collate whatever we have in a generic form.
> 

While I am fine with it, having a pointer that always points to the same
mask looks wierd.

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 3/3] powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

2020-12-07 Thread Srikar Dronamraju
* Gautham R. Shenoy  [2020-12-04 10:18:47]:

> From: "Gautham R. Shenoy" 
> 
> 
> Signed-off-by: Gautham R. Shenoy 
> ---
> 
> +extern bool thread_group_shares_l2;
>  /*
>   * On big-core systems, each core has two groups of CPUs each of which
>   * has its own L1-cache. The thread-siblings which share l1-cache with
>   * @cpu can be obtained via cpu_smallcore_mask().
> + *
> + * On some big-core systems, the L2 cache is shared only between some
> + * groups of siblings. This is already parsed and encoded in
> + * cpu_l2_cache_mask().
>   */
>  static const struct cpumask *get_big_core_shared_cpu_map(int cpu, struct 
> cache *cache)
>  {
>   if (cache->level == 1)
>   return cpu_smallcore_mask(cpu);
> + if (cache->level == 2 && thread_group_shares_l2)
> + return cpu_l2_cache_mask(cpu);
> 
>   return >shared_cpu_map;

As pointed with l...@intel.org, we need to do this only with #CONFIG_SMP,
even for cache->level = 1 too.

I agree that we are displaying shared_cpu_map correctly. Should we have also
update /clear shared_cpu_map in the first place. For example:- If for a P9
core with CPUs 0-7, the cache->shared_cpu_map for L1 would have 0-7 but
would display 0,2,4,6.

The drawback of this is even if cpus 0,2,4,6 are released L1 cache will not
be released. Is this as expected?


-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 2/3] powerpc/smp: Add support detecting thread-groups sharing L2 cache

2020-12-07 Thread Srikar Dronamraju
he_map(int cpu, unsigned int 
> cache_property)
>   if (!dn)
>   return -ENODATA;
> 
> - if (!(cache_property == THREAD_GROUP_SHARE_L1))
> + if (!(cache_property == THREAD_GROUP_SHARE_L1 ||
> +   cache_property == THREAD_GROUP_SHARE_L2))
>   return -EINVAL;
> 
>   if (!cpu_tgl->nr_properties) {
> @@ -867,7 +879,10 @@ static int init_cpu_cache_map(int cpu, unsigned int 
> cache_property)
>   goto out;
>   }
> 
> - mask = _cpu(cpu_l1_cache_map, cpu);
> + if (cache_property == THREAD_GROUP_SHARE_L1)
> + mask = _cpu(cpu_l1_cache_map, cpu);
> + else if (cache_property == THREAD_GROUP_SHARE_L2)
> + mask = _cpu(thread_group_l2_cache_map, cpu);
> 
>   zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
> 
> @@ -973,6 +988,16 @@ static int init_big_cores(void)
>   }
> 
>   has_big_cores = true;
> +
> + for_each_possible_cpu(cpu) {
> + int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L2);
> +
> + if (err)
> + return err;
> + }
> +
> + thread_group_shares_l2 = true;

Why do we need a separate loop. Why cant we merge this in the above loop
itself? 

> + pr_info("Thread-groups in a core share L2-cache\n");

Can this be moved to a pr_debug? Does it help any regular user/admins to
know if thread-groups shared l2 cache. Infact it may confuse users on what
thread groups are and which thread groups dont share cache.
I would prefer some other name than thread_group_shares_l2 but dont know any
better alternatives and may be my choices are even worse.

>   return 0;
>  }
> 
> @@ -1287,6 +1312,31 @@ static bool update_mask_by_l2(int cpu, cpumask_var_t 
> *mask)
>   if (has_big_cores)
>   submask_fn = cpu_smallcore_mask;
> 
> +

NIT: extra blank line?

> + /*
> +  * If the threads in a thread-group share L2 cache, then then
> +  * the L2-mask can be obtained from thread_group_l2_cache_map.
> +  */
> + if (thread_group_shares_l2) {
> + /* Siblings that share L1 is a subset of siblings that share 
> L2.*/
> + or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
> + if (*mask) {
> + cpumask_andnot(*mask,
> +per_cpu(thread_group_l2_cache_map, cpu),
> +cpu_l2_cache_mask(cpu));
> + } else {
> + mask = _cpu(thread_group_l2_cache_map, cpu);
> + }
> +
> + for_each_cpu(i, *mask) {
> + if (!cpu_online(i))
> + continue;
> +     set_cpus_related(i, cpu, cpu_l2_cache_mask);
> + }
> +
> + return true;
> + }
> +

Ah this can be simplified to:
if (thread_group_shares_l2) {
cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));

for_each_cpu(i, per_cpu(thread_group_l2_cache_map, cpu)) {
if (cpu_online(i))
set_cpus_related(i, cpu, cpu_l2_cache_mask);
}
}

No?

>   l2_cache = cpu_to_l2cache(cpu);
>   if (!l2_cache || !*mask) {
>   /* Assume only core siblings share cache with this CPU */

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH 1/3] powerpc/smp: Parse ibm,thread-groups with multiple properties

2020-12-07 Thread Srikar Dronamraju
;
> @@ -830,11 +867,12 @@ static int init_cpu_l1_cache_map(int cpu)
>   goto out;
>   }
> 
> - zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu),
> - GFP_KERNEL, cpu_to_node(cpu));
> + mask = _cpu(cpu_l1_cache_map, cpu);
> +
> + zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
> 

This hunk (and the next hunk) should be moved to next patch.

>   for (i = first_thread; i < first_thread + threads_per_core; i++) {
> - int i_group_start = get_cpu_thread_group_start(i, );
> + int i_group_start = get_cpu_thread_group_start(i, tg);
> 
>   if (unlikely(i_group_start == -1)) {
>   WARN_ON_ONCE(1);
> @@ -843,7 +881,7 @@ static int init_cpu_l1_cache_map(int cpu)
>   }
> 
>   if (i_group_start == cpu_group_start)
> - cpumask_set_cpu(i, per_cpu(cpu_l1_cache_map, cpu));
> + cpumask_set_cpu(i, *mask);
>   }
> 
>  out:
> @@ -924,7 +962,7 @@ static int init_big_cores(void)
>   int cpu;
> 
>   for_each_possible_cpu(cpu) {
> - int err = init_cpu_l1_cache_map(cpu);
> + int err = init_cpu_cache_map(cpu, THREAD_GROUP_SHARE_L1);
> 
>   if (err)
>   return err;
> -- 
> 1.9.4
> 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [powerpc:next-test 121/184] arch/powerpc/kernel/firmware.c:31:9-10: WARNING: return of 0/1 in function 'check_kvm_guest' with return type bool

2020-12-02 Thread Srikar Dronamraju
Hi, 

Thanks for the report.

* kernel test robot  [2020-12-03 07:25:06]:

> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
> next-test
> head:   fb003959777a635dea8910cf71109b612c7f940c
> commit: 77354ecf8473208a5cc5f20a501760f7d6d631cd [121/184] powerpc: Rename 
> is_kvm_guest() to check_kvm_guest()

Sorry for the nit pick, but the commit should have been
Commit 107c55005fbd ("powerpc/pseries: Add KVM guest doorbell restrictions")

because thats the patch that introduced is_kvm_guest.
my patch was just renaming is_kvm_guest to check_kvm_guest.

> config: powerpc-randconfig-c003-20201202 (attached as .config)
> compiler: powerpc64le-linux-gcc (GCC) 9.3.0
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
> 
> 
> "coccinelle warnings: (new ones prefixed by >>)"
> >> arch/powerpc/kernel/firmware.c:31:9-10: WARNING: return of 0/1 in function 
> >> 'check_kvm_guest' with return type bool
> 
> Please review and possibly fold the followup patch.
> 

But I agree we can still fold this the followup patch into the
Rename is_kvm_guest() to check_kvm_guest().

> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org



-- 
Thanks and Regards
Srikar Dronamraju


[PATCH v2 3/4] powerpc: Reintroduce is_kvm_guest in a new avatar

2020-12-01 Thread Srikar Dronamraju
Introduce a static branch that would be set during boot if the OS
happens to be a KVM guest. Subsequent checks to see if we are on KVM
will rely on this static branch. This static branch would be used in
vcpu_is_preempted in a subsequent patch.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
Acked-by: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/kvm_guest.h | 10 ++
 arch/powerpc/include/asm/kvm_para.h  |  2 +-
 arch/powerpc/kernel/firmware.c   |  2 ++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_guest.h 
b/arch/powerpc/include/asm/kvm_guest.h
index ba8291e02ba9..627ba272e781 100644
--- a/arch/powerpc/include/asm/kvm_guest.h
+++ b/arch/powerpc/include/asm/kvm_guest.h
@@ -7,8 +7,18 @@
 #define __POWERPC_KVM_GUEST_H__
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
+#include 
+
+DECLARE_STATIC_KEY_FALSE(kvm_guest);
+
+static inline bool is_kvm_guest(void)
+{
+   return static_branch_unlikely(_guest);
+}
+
 bool check_kvm_guest(void);
 #else
+static inline bool is_kvm_guest(void) { return false; }
 static inline bool check_kvm_guest(void) { return false; }
 #endif
 
diff --git a/arch/powerpc/include/asm/kvm_para.h 
b/arch/powerpc/include/asm/kvm_para.h
index 6fba06b6cfdb..abe1b5e82547 100644
--- a/arch/powerpc/include/asm/kvm_para.h
+++ b/arch/powerpc/include/asm/kvm_para.h
@@ -14,7 +14,7 @@
 
 static inline int kvm_para_available(void)
 {
-   return IS_ENABLED(CONFIG_KVM_GUEST) && check_kvm_guest();
+   return IS_ENABLED(CONFIG_KVM_GUEST) && is_kvm_guest();
 }
 
 static inline unsigned int kvm_arch_para_features(void)
diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c
index 0aeb6a5b1a9e..28498fc573f2 100644
--- a/arch/powerpc/kernel/firmware.c
+++ b/arch/powerpc/kernel/firmware.c
@@ -22,6 +22,7 @@ EXPORT_SYMBOL_GPL(powerpc_firmware_features);
 #endif
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
+DEFINE_STATIC_KEY_FALSE(kvm_guest);
 bool check_kvm_guest(void)
 {
struct device_node *hyper_node;
@@ -33,6 +34,7 @@ bool check_kvm_guest(void)
if (!of_device_is_compatible(hyper_node, "linux,kvm"))
return 0;
 
+   static_branch_enable(_guest);
return 1;
 }
 #endif
-- 
2.18.4



[PATCH v2 1/4] powerpc: Refactor is_kvm_guest declaration to new header

2020-12-01 Thread Srikar Dronamraju
Only code/declaration movement, in anticipation of doing a kvm-aware
vcpu_is_preempted. No additional changes.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
Acked-by: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
Changelog:
v1->v2:
v1: 
https://lore.kernel.org/linuxppc-dev/20201028123512.871051-1-sri...@linux.vnet.ibm.com/t/#u
 - Moved a hunk to fix a no previous prototype warning reported by: 
l...@intel.com
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org/thread/C6PTRPHWMC7VV4OTYN3ISYKDHTDQS6YI/

 arch/powerpc/include/asm/firmware.h  |  6 --
 arch/powerpc/include/asm/kvm_guest.h | 15 +++
 arch/powerpc/include/asm/kvm_para.h  |  2 +-
 arch/powerpc/kernel/firmware.c   |  1 +
 arch/powerpc/platforms/pseries/smp.c |  1 +
 5 files changed, 18 insertions(+), 7 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_guest.h

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 0b295bdb201e..aa6a5ef5d483 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -134,12 +134,6 @@ extern int ibm_nmi_interlock_token;
 
 extern unsigned int __start___fw_ftr_fixup, __stop___fw_ftr_fixup;
 
-#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
-bool is_kvm_guest(void);
-#else
-static inline bool is_kvm_guest(void) { return false; }
-#endif
-
 #ifdef CONFIG_PPC_PSERIES
 void pseries_probe_fw_features(void);
 #else
diff --git a/arch/powerpc/include/asm/kvm_guest.h 
b/arch/powerpc/include/asm/kvm_guest.h
new file mode 100644
index ..c0ace884a0e8
--- /dev/null
+++ b/arch/powerpc/include/asm/kvm_guest.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2020  IBM Corporation
+ */
+
+#ifndef __POWERPC_KVM_GUEST_H__
+#define __POWERPC_KVM_GUEST_H__
+
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
+bool is_kvm_guest(void);
+#else
+static inline bool is_kvm_guest(void) { return false; }
+#endif
+
+#endif /* __POWERPC_KVM_GUEST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_para.h 
b/arch/powerpc/include/asm/kvm_para.h
index 744612054c94..abe1b5e82547 100644
--- a/arch/powerpc/include/asm/kvm_para.h
+++ b/arch/powerpc/include/asm/kvm_para.h
@@ -8,7 +8,7 @@
 #ifndef __POWERPC_KVM_PARA_H__
 #define __POWERPC_KVM_PARA_H__
 
-#include 
+#include 
 
 #include 
 
diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c
index fe48d319d490..5f48e5ad24cd 100644
--- a/arch/powerpc/kernel/firmware.c
+++ b/arch/powerpc/kernel/firmware.c
@@ -14,6 +14,7 @@
 #include 
 
 #include 
+#include 
 
 #ifdef CONFIG_PPC64
 unsigned long powerpc_firmware_features __read_mostly;
diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index 92922491a81c..d578732c545d 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "pseries.h"
 
-- 
2.18.4



[PATCH v2 2/4] powerpc: Rename is_kvm_guest to check_kvm_guest

2020-12-01 Thread Srikar Dronamraju
is_kvm_guest() will be reused in subsequent patch in a new avatar.  Hence
rename is_kvm_guest to check_kvm_guest. No additional changes.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
Acked-by: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/kvm_guest.h | 4 ++--
 arch/powerpc/include/asm/kvm_para.h  | 2 +-
 arch/powerpc/kernel/firmware.c   | 2 +-
 arch/powerpc/platforms/pseries/smp.c | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_guest.h 
b/arch/powerpc/include/asm/kvm_guest.h
index c0ace884a0e8..ba8291e02ba9 100644
--- a/arch/powerpc/include/asm/kvm_guest.h
+++ b/arch/powerpc/include/asm/kvm_guest.h
@@ -7,9 +7,9 @@
 #define __POWERPC_KVM_GUEST_H__
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
-bool is_kvm_guest(void);
+bool check_kvm_guest(void);
 #else
-static inline bool is_kvm_guest(void) { return false; }
+static inline bool check_kvm_guest(void) { return false; }
 #endif
 
 #endif /* __POWERPC_KVM_GUEST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_para.h 
b/arch/powerpc/include/asm/kvm_para.h
index abe1b5e82547..6fba06b6cfdb 100644
--- a/arch/powerpc/include/asm/kvm_para.h
+++ b/arch/powerpc/include/asm/kvm_para.h
@@ -14,7 +14,7 @@
 
 static inline int kvm_para_available(void)
 {
-   return IS_ENABLED(CONFIG_KVM_GUEST) && is_kvm_guest();
+   return IS_ENABLED(CONFIG_KVM_GUEST) && check_kvm_guest();
 }
 
 static inline unsigned int kvm_arch_para_features(void)
diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c
index 5f48e5ad24cd..0aeb6a5b1a9e 100644
--- a/arch/powerpc/kernel/firmware.c
+++ b/arch/powerpc/kernel/firmware.c
@@ -22,7 +22,7 @@ EXPORT_SYMBOL_GPL(powerpc_firmware_features);
 #endif
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
-bool is_kvm_guest(void)
+bool check_kvm_guest(void)
 {
struct device_node *hyper_node;
 
diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index d578732c545d..c70b4be9f0a5 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -211,7 +211,7 @@ static __init void pSeries_smp_probe(void)
if (!cpu_has_feature(CPU_FTR_SMT))
return;
 
-   if (is_kvm_guest()) {
+   if (check_kvm_guest()) {
/*
 * KVM emulates doorbells by disabling FSCR[MSGP] so msgsndp
 * faults to the hypervisor which then reads the instruction
-- 
2.18.4



[PATCH v2 4/4] powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted

2020-12-01 Thread Srikar Dronamraju
If its a shared lpar but not a KVM guest, then see if the vCPU is
related to the calling vCPU. On PowerVM, only cores can be preempted.
So if one vCPU is a non-preempted state, we can decipher that all other
vCPUs sharing the same core are in non-preempted state.

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 
Acked-by: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/paravirt.h | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/powerpc/include/asm/paravirt.h 
b/arch/powerpc/include/asm/paravirt.h
index 9362c94fe3aa..edc08f04aef7 100644
--- a/arch/powerpc/include/asm/paravirt.h
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -10,6 +10,9 @@
 #endif
 
 #ifdef CONFIG_PPC_SPLPAR
+#include 
+#include 
+
 DECLARE_STATIC_KEY_FALSE(shared_processor);
 
 static inline bool is_shared_processor(void)
@@ -74,6 +77,21 @@ static inline bool vcpu_is_preempted(int cpu)
 {
if (!is_shared_processor())
return false;
+
+#ifdef CONFIG_PPC_SPLPAR
+   if (!is_kvm_guest()) {
+   int first_cpu = cpu_first_thread_sibling(smp_processor_id());
+
+   /*
+* Preemption can only happen at core granularity. This CPU
+* is not preempted if one of the CPU of this core is not
+* preempted.
+*/
+   if (cpu_first_thread_sibling(cpu) == first_cpu)
+   return false;
+   }
+#endif
+
if (yield_count_of(cpu) & 1)
return true;
return false;
-- 
2.18.4



[PATCH v2 0/4] Powerpc: Better preemption for shared processor

2020-12-01 Thread Srikar Dronamraju
Currently, vcpu_is_preempted will return the yield_count for
shared_processor. On a PowerVM LPAR, Phyp schedules at SMT8 core boundary
i.e all CPUs belonging to a core are either group scheduled in or group
scheduled out. This can be used to better predict non-preempted CPUs on
PowerVM shared LPARs.

perf stat -r 5 -a perf bench sched pipe -l 1000 (lesser time is better)

powerpc/next
 35,107,951.20 msec cpu-clock #  255.898 CPUs utilized  
  ( +-  0.31% )
23,655,348  context-switches  #0.674 K/sec  
  ( +-  3.72% )
14,465  cpu-migrations#0.000 K/sec  
  ( +-  5.37% )
82,463  page-faults   #0.002 K/sec  
  ( +-  8.40% )
 1,127,182,328,206  cycles#0.032 GHz
  ( +-  1.60% )  (66.67%)
78,587,300,622  stalled-cycles-frontend   #6.97% frontend cycles 
idle ( +-  0.08% )  (50.01%)
   654,124,218,432  stalled-cycles-backend#   58.03% backend cycles 
idle  ( +-  1.74% )  (50.01%)
   834,013,059,242  instructions  #0.74  insn per cycle
  #0.78  stalled cycles per 
insn  ( +-  0.73% )  (66.67%)
   132,911,454,387  branches  #3.786 M/sec  
  ( +-  0.59% )  (50.00%)
 2,890,882,143  branch-misses #2.18% of all branches
  ( +-  0.46% )  (50.00%)

   137.195 +- 0.419 seconds time elapsed  ( +-  0.31% )

powerpc/next + patchset
 29,981,702.64 msec cpu-clock #  255.881 CPUs utilized  
  ( +-  1.30% )
40,162,456  context-switches  #0.001 M/sec  
  ( +-  0.01% )
 1,110  cpu-migrations#0.000 K/sec  
  ( +-  5.20% )
62,616  page-faults   #0.002 K/sec  
  ( +-  3.93% )
 1,430,030,626,037  cycles#0.048 GHz
  ( +-  1.41% )  (66.67%)
83,202,707,288  stalled-cycles-frontend   #5.82% frontend cycles 
idle ( +-  0.75% )  (50.01%)
   744,556,088,520  stalled-cycles-backend#   52.07% backend cycles 
idle  ( +-  1.39% )  (50.01%)
   940,138,418,674  instructions  #0.66  insn per cycle
  #0.79  stalled cycles per 
insn  ( +-  0.51% )  (66.67%)
   146,452,852,283  branches  #4.885 M/sec  
  ( +-  0.80% )  (50.00%)
 3,237,743,996  branch-misses #2.21% of all branches
  ( +-  1.18% )  (50.01%)

117.17 +- 1.52 seconds time elapsed  ( +-  1.30% )

This is around 14.6% improvement in performance.

Changelog:
v1->v2:
v1: 
https://lore.kernel.org/linuxppc-dev/20201028123512.871051-1-sri...@linux.vnet.ibm.com/t/#u
 - Rebased to 27th Nov linuxppc/merge tree.
 - Moved a hunk to fix a no previous prototype warning reported by: 
l...@intel.com
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org/thread/C6PTRPHWMC7VV4OTYN3ISYKDHTDQS6YI/

Cc: linuxppc-dev 
Cc: LKML 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: Gautham R Shenoy 
Cc: Peter Zijlstra 
Cc: Valentin Schneider 
Cc: Juri Lelli 
Cc: Waiman Long 
Cc: Phil Auld 

Srikar Dronamraju (4):
  powerpc: Refactor is_kvm_guest declaration to new header
  powerpc: Rename is_kvm_guest to check_kvm_guest
  powerpc: Reintroduce is_kvm_guest
  powerpc/paravirt: Use is_kvm_guest in vcpu_is_preempted

 arch/powerpc/include/asm/firmware.h  |  6 --
 arch/powerpc/include/asm/kvm_guest.h | 25 +
 arch/powerpc/include/asm/kvm_para.h  |  2 +-
 arch/powerpc/include/asm/paravirt.h  | 18 ++
 arch/powerpc/kernel/firmware.c   |  5 -
 arch/powerpc/platforms/pseries/smp.c |  3 ++-
 6 files changed, 50 insertions(+), 9 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_guest.h

-- 
2.18.4



  1   2   3   4   5   >