[PATCH v2 3/4] mm: Implement reset_numa_mem

2020-03-18 Thread Srikar Dronamraju
For a memoryless or offline nodes, node_numa_mem refers to a N_MEMORY
fallback node. Currently kernel has an API set_numa_mem that sets
node_numa_mem for memoryless node. However this API cannot be used for
offline nodes. Hence all offline nodes will have their node_numa_mem set
to 0. However systems can themselves have node 0 as offline i.e
memoryless and cpuless at this time. In such cases,
node_to_mem_node() fails to provide a N_MEMORY fallback node.

Mitigate this by having a new API that sets the default node_numa_mem for
offline nodes to be first_memory_node.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 
Cc: Nathan Lynch 

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Signed-off-by: Srikar Dronamraju 
---
 include/asm-generic/topology.h | 3 +++
 include/linux/topology.h   | 7 +++
 2 files changed, 10 insertions(+)

diff --git a/include/asm-generic/topology.h b/include/asm-generic/topology.h
index 238873739550..e803ee7850e6 100644
--- a/include/asm-generic/topology.h
+++ b/include/asm-generic/topology.h
@@ -68,6 +68,9 @@
 #ifndef set_numa_mem
 #define set_numa_mem(node)
 #endif
+#ifndef reset_numa_mem
+#define reset_numa_mem(node)
+#endif
 #ifndef set_cpu_numa_mem
 #define set_cpu_numa_mem(cpu, node)
 #endif
diff --git a/include/linux/topology.h b/include/linux/topology.h
index eb2fe6edd73c..bebda80038bf 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -147,6 +147,13 @@ static inline int node_to_mem_node(int node)
 }
 #endif
 
+#ifndef reset_numa_mem
+static inline void reset_numa_mem(int node)
+{
+   _node_numa_mem_[node] = first_memory_node;
+}
+#endif
+
 #ifndef numa_mem_id
 /* Returns the number of the nearest Node with memory */
 static inline int numa_mem_id(void)
-- 
2.18.1



[PATCH v2 2/4] mm/slub: Use mem_node to allocate a new slab

2020-03-18 Thread Srikar Dronamraju
Currently while allocating a slab for a offline node, we use its
associated node_numa_mem to search for a partial slab. If we don't find
a partial slab, we try allocating a slab from the offline node using
__alloc_pages_node. However this is bound to fail.

NIP [c039a300] __alloc_pages_nodemask+0x130/0x3b0
LR [c039a3c4] __alloc_pages_nodemask+0x1f4/0x3b0
Call Trace:
[c008b36837f0] [c039a3b4] __alloc_pages_nodemask+0x1e4/0x3b0 
(unreliable)
[c008b3683870] [c03d1ff8] new_slab+0x128/0xcf0
[c008b3683950] [c03d6060] ___slab_alloc+0x410/0x820
[c008b3683a40] [c03d64a4] __slab_alloc+0x34/0x60
[c008b3683a70] [c03d78b0] __kmalloc_node+0x110/0x490
[c008b3683af0] [c0343a08] kvmalloc_node+0x58/0x110
[c008b3683b30] [c03ffd44] mem_cgroup_css_online+0x104/0x270
[c008b3683b90] [c0234e08] online_css+0x48/0xd0
[c008b3683bc0] [c023dedc] cgroup_apply_control_enable+0x2ec/0x4d0
[c008b3683ca0] [c02416f8] cgroup_mkdir+0x228/0x5f0
[c008b3683d10] [c0520360] kernfs_iop_mkdir+0x90/0xf0
[c008b3683d50] [c043e400] vfs_mkdir+0x110/0x230
[c008b3683da0] [c0441ee0] do_mkdirat+0xb0/0x1a0
[c008b3683e20] [c000b278] system_call+0x5c/0x68

Mitigate this by allocating the new slab from the node_numa_mem.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 
Cc: Nathan Lynch 

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1 -> v2:
- Handled comments from Vlastimil Babka
- Now node gets set to node_numa_mem in new_slab_objects.

 mm/slub.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1c55bf7892bf..2dc603a84290 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2475,6 +2475,9 @@ static inline void *new_slab_objects(struct kmem_cache 
*s, gfp_t flags,
if (freelist)
return freelist;
 
+   if (node != NUMA_NO_NODE && !node_present_pages(node))
+   node = node_to_mem_node(node);
+
page = new_slab(s, flags, node);
if (page) {
c = raw_cpu_ptr(s->cpu_slab);
@@ -2569,12 +2572,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t 
gfpflags, int node,
 redo:
 
if (unlikely(!node_match(page, node))) {
-   int searchnode = node;
-
if (node != NUMA_NO_NODE && !node_present_pages(node))
-   searchnode = node_to_mem_node(node);
+   node = node_to_mem_node(node);
 
-   if (unlikely(!node_match(page, searchnode))) {
+   if (unlikely(!node_match(page, node))) {
stat(s, ALLOC_NODE_MISMATCH);
deactivate_slab(s, page, c->freelist, c);
goto new_slab;
-- 
2.18.1



[PATCH v2 1/4] mm: Check for node_online in node_present_pages

2020-03-18 Thread Srikar Dronamraju
Calling a kmalloc_node on a possible node which is not yet onlined can
lead to panic. Currently node_present_pages() doesn't verify the node is
online before accessing the pgdat for the node. However pgdat struct may
not be available resulting in a crash.

NIP [c03d55f4] ___slab_alloc+0x1f4/0x760
LR [c03d5b94] __slab_alloc+0x34/0x60
Call Trace:
[c008b3783960] [c03d5734] ___slab_alloc+0x334/0x760 (unreliable)
[c008b3783a40] [c03d5b94] __slab_alloc+0x34/0x60
[c008b3783a70] [c03d6fa0] __kmalloc_node+0x110/0x490
[c008b3783af0] [c03443d8] kvmalloc_node+0x58/0x110
[c008b3783b30] [c03fee38] mem_cgroup_css_online+0x108/0x270
[c008b3783b90] [c0235aa8] online_css+0x48/0xd0
[c008b3783bc0] [c023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
[c008b3783ca0] [c0242318] cgroup_mkdir+0x228/0x5f0
[c008b3783d10] [c051e170] kernfs_iop_mkdir+0x90/0xf0
[c008b3783d50] [c043dc00] vfs_mkdir+0x110/0x230
[c008b3783da0] [c0441c90] do_mkdirat+0xb0/0x1a0
[c008b3783e20] [c000b278] system_call+0x5c/0x68

Fix this by verifying the node is online before accessing the pgdat
structure. Fix the same for node_spanned_pages() too.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 
Cc: Nathan Lynch 

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Signed-off-by: Srikar Dronamraju 
---
 include/linux/mmzone.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f3f264826423..88078a3b95e5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -756,8 +756,10 @@ typedef struct pglist_data {
atomic_long_t   vm_stat[NR_VM_NODE_STAT_ITEMS];
 } pg_data_t;
 
-#define node_present_pages(nid)(NODE_DATA(nid)->node_present_pages)
-#define node_spanned_pages(nid)(NODE_DATA(nid)->node_spanned_pages)
+#define node_present_pages(nid)\
+   (node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0)
+#define node_spanned_pages(nid)\
+   (node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0)
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
 #define pgdat_page_nr(pgdat, pagenr)   ((pgdat)->node_mem_map + (pagenr))
 #else
-- 
2.18.1



[PATCH v2 0/4] Fix kmalloc_node on offline nodes

2020-03-18 Thread Srikar Dronamraju
Changelog v1 -> v2:
- Handled comments from Vlastimil Babka and Bharata B Rao
- Changes only in patch 2 and 4.

Sachin recently reported that linux-next was no more bootable on few
powerpc systems.
https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/

# numactl -H
available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31
node 1 size: 35247 MB
node 1 free: 30907 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10
#

Sachin bisected the problem to Commit a75056fc1e7c ("mm/memcontrol.c: allocate
shrinker_map on appropriate NUMA node")

The root cause analysis showed that mm/slub and powerpc/numa had some 
shortcomings
with respect to offline nodes.

This patch series is on top of patches posted at
https://lore.kernel.org/linuxppc-dev/2020030237.5731-1-sri...@linux.vnet.ibm.com/t/#u

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 
Cc: Nathan Lynch 

Srikar Dronamraju (4):
  mm: Check for node_online in node_present_pages
  mm/slub: Use mem_node to allocate a new slab
  mm: Implement reset_numa_mem
  powerpc/numa: Set fallback nodes for offline nodes

 arch/powerpc/include/asm/topology.h | 16 
 arch/powerpc/kernel/smp.c   |  1 +
 include/asm-generic/topology.h  |  3 +++
 include/linux/mmzone.h  |  6 --
 include/linux/topology.h|  7 +++
 mm/slub.c   |  9 +
 6 files changed, 36 insertions(+), 6 deletions(-)

-- 
2.18.1



Re: Slub: Increased mem consumption on cpu,mem-less node powerpc guest

2020-03-17 Thread Srikar Dronamraju
* Vlastimil Babka  [2020-03-17 17:45:15]:

> On 3/17/20 5:25 PM, Srikar Dronamraju wrote:
> > * Vlastimil Babka  [2020-03-17 16:56:04]:
> > 
> >> 
> >> I wonder why do you get a memory leak while Sachin in the same situation 
> >> [1]
> >> gets a crash? I don't understand anything anymore.
> > 
> > Sachin was testing on linux-next which has Kirill's patch which modifies
> > slub to use kmalloc_node instead of kmalloc. While Bharata is testing on
> > upstream, which doesn't have this. 
> 
> Yes, that Kirill's patch was about the memcg shrinker map allocation. But the
> patch hunk that Bharata posted as a "hack" that fixes the problem, it follows
> that there has to be something else that calls kmalloc_node(node) where node 
> is
> one that doesn't have present pages.
> 
> He mentions alloc_fair_sched_group() which has:
> 
> for_each_possible_cpu(i) {
> cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
>   GFP_KERNEL, cpu_to_node(i));
> ...
> se = kzalloc_node(sizeof(struct sched_entity),
>   GFP_KERNEL, cpu_to_node(i));
> 


Sachin's experiment.
Upstream-next/ memcg /
possible nodes were 0-31
online nodes were 0-1
kmalloc_node called for_each_node / for_each_possible_node.
This would crash while allocating slab from !N_ONLINE nodes.

Bharata's experiment.
Upstream
possible nodes were 0-1
online nodes were 0-1
kmalloc_node called for_each_online_node/ for_each_possible_cpu
i.e kmalloc is called for N_ONLINE nodes.
So wouldn't crash

Even if his possible nodes were 0-256. I don't think we have kmalloc_node
being called in !N_ONLINE nodes. Hence its not crashing.
If we see the above code that you quote, kzalloc_node is using cpu_to_node
which in Bharata's case will always return 1.


> I assume one of these structs is 1k and other 512 bytes (rounded) and that for
> some possible cpu's cpu_to_node(i) will be 0, which has no present pages. And 
> as
> Bharata pasted, node_to_mem_node(0) = 0
> So this looks like the same scenario, but it doesn't crash? Is the node 0
> actually online here, and/or does it have N_NORMAL_MEMORY state?

I still dont have any clue on the leak though.

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH 2/4] mm/slub: Use mem_node to allocate a new slab

2020-03-17 Thread Srikar Dronamraju
* Vlastimil Babka  [2020-03-17 14:34:25]:

> 
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1970,14 +1970,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
> > flags, int node,
> > struct kmem_cache_cpu *c)
> >  {
> > void *object;
> > -   int searchnode = node;
> >  
> > -   if (node == NUMA_NO_NODE)
> > -   searchnode = numa_mem_id();
> > -   else if (!node_present_pages(node))
> > -   searchnode = node_to_mem_node(node);
> > -
> > -   object = get_partial_node(s, get_node(s, searchnode), c, flags);
> > +   object = get_partial_node(s, get_node(s, node), c, flags);
> > if (object || node != NUMA_NO_NODE)>return object;
> >
> >   return get_any_partial(s, flags, c);
> 
> I.e. here in this if(), now node will never equal NUMA_NO_NODE (thanks to the
> hunk below), thus the get_any_partial() call becomes dead code?

Very true. 

Would it be okay if we remove the node != NUMA_NO_NODE
if (object || node != NUMA_NO_NODE)
    return object;

will now become
if (object)
return object;

-- 
Thanks and Regards
Srikar Dronamraju



Re: Slub: Increased mem consumption on cpu,mem-less node powerpc guest

2020-03-17 Thread Srikar Dronamraju
* Vlastimil Babka  [2020-03-17 16:56:04]:

> 
> I wonder why do you get a memory leak while Sachin in the same situation [1]
> gets a crash? I don't understand anything anymore.

Sachin was testing on linux-next which has Kirill's patch which modifies
slub to use kmalloc_node instead of kmalloc. While Bharata is testing on
upstream, which doesn't have this. 

> 
> [1]
> https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/
> 

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH 2/4] mm/slub: Use mem_node to allocate a new slab

2020-03-17 Thread Srikar Dronamraju
* Vlastimil Babka  [2020-03-17 14:53:26]:

> >> > 
> >> > Mitigate this by allocating the new slab from the node_numa_mem.
> >> 
> >> Are you sure this is really needed and the other 3 patches are not enough 
> >> for
> >> the current SLUB code to work as needed? It seems you are changing the 
> >> semantics
> >> here...
> >> 
> > 
> > The other 3 patches are not enough because we don't carry the searchnode
> > when the actual alloc_pages_node gets called. 
> > 
> > With only the 3 patches, we see the above Panic, its signature is slightly
> > different from what Sachin first reported and which I have carried in 1st
> > patch.
> 
> Ah, I see. So that's the missing pgdat after your series [1] right?

Yes the pgdat would be missing after my cpuless, memoryless node patchset.
However..
> 
> That sounds like an argument for Michal's suggestions that pgdats exist and 
> have
> correctly populated zonelists for all possible nodes.

Only the first patch in this series would be affected by pgdat existing or
not.  Even if the pgdat existed, the NODE_DATA[nid]->node_present_pages
would be 0. Right? So it would look at node_to_mem_node(). And since node 0 is
cpuless it would return 0. If we pass this node 0 (which is memoryless/cpuless) 
to
alloc_pages_node. Please note I am only setting node_numa_mem only
for offline nodes. However we could change this to set for all offline and
memoryless nodes.

> node_to_mem_node() could be just a shortcut for the first zone's node in the
> zonelist, so that fallback follows the topology.
> 
> [1]
> https://lore.kernel.org/linuxppc-dev/2020030237.5731-1-sri...@linux.vnet.ibm.com/t/#m76e5b4c4084380b1d4b193d5aa0359b987f2290e
> 



-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH 4/4] powerpc/numa: Set fallback nodes for offline nodes

2020-03-17 Thread Srikar Dronamraju
* Bharata B Rao  [2020-03-17 19:52:32]:

Thanks Bharata.

> This patchset can also fix a related problem that I reported earlier at
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-March/206076.html
> with an additional change, suggested by Srikar as shown below:
> 
> >  
> > -   for_each_online_node(node) {
> > +   for_each_node(node) {
> > +   /*
> > +* For all possible but not yet online nodes, ensure their
> > +* node_numa_mem is set correctly so that kmalloc_node works
> > +* for such nodes.
> > +*/
> > +   if (!node_online(node)) {
> 
> Change the above line to like below:
> 
> +   if (!node_state(node, N_MEMORY)) {
> 

Just to clarify, this is needed if we don't have
http://lore.kernel.org/lkml/2020030237.5731-1-sri...@linux.vnet.ibm.com/t/#u


-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH 2/4] mm/slub: Use mem_node to allocate a new slab

2020-03-17 Thread Srikar Dronamraju
* Vlastimil Babka  [2020-03-17 14:34:25]:

> On 3/17/20 2:17 PM, Srikar Dronamraju wrote:
> > Currently while allocating a slab for a offline node, we use its
> > associated node_numa_mem to search for a partial slab. If we don't find
> > a partial slab, we try allocating a slab from the offline node using
> > __alloc_pages_node. However this is bound to fail.
> > 
> > NIP [c039a300] __alloc_pages_nodemask+0x130/0x3b0
> > LR [c039a3c4] __alloc_pages_nodemask+0x1f4/0x3b0
> > Call Trace:
> > [c008b36837f0] [c039a3b4] __alloc_pages_nodemask+0x1e4/0x3b0 
> > (unreliable)
> > [c008b3683870] [c03d1ff8] new_slab+0x128/0xcf0
> > [c008b3683950] [c03d6060] ___slab_alloc+0x410/0x820
> > [c008b3683a40] [c03d64a4] __slab_alloc+0x34/0x60
> > [c008b3683a70] [c03d78b0] __kmalloc_node+0x110/0x490
> > [c008b3683af0] [c0343a08] kvmalloc_node+0x58/0x110
> > [c008b3683b30] [c03ffd44] mem_cgroup_css_online+0x104/0x270
> > [c008b3683b90] [c0234e08] online_css+0x48/0xd0
> > [c008b3683bc0] [c023dedc] 
> > cgroup_apply_control_enable+0x2ec/0x4d0
> > [c008b3683ca0] [c02416f8] cgroup_mkdir+0x228/0x5f0
> > [c008b3683d10] [c0520360] kernfs_iop_mkdir+0x90/0xf0
> > [c008b3683d50] [c043e400] vfs_mkdir+0x110/0x230
> > [c008b3683da0] [c0441ee0] do_mkdirat+0xb0/0x1a0
> > [c008b3683e20] [c000b278] system_call+0x5c/0x68
> > 
> > Mitigate this by allocating the new slab from the node_numa_mem.
> 
> Are you sure this is really needed and the other 3 patches are not enough for
> the current SLUB code to work as needed? It seems you are changing the 
> semantics
> here...
> 

The other 3 patches are not enough because we don't carry the searchnode
when the actual alloc_pages_node gets called. 

With only the 3 patches, we see the above Panic, its signature is slightly
different from what Sachin first reported and which I have carried in 1st
patch.

> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1970,14 +1970,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
> > flags, int node,
> > struct kmem_cache_cpu *c)
> >  {
> > void *object;
> > -   int searchnode = node;
> >  
> > -   if (node == NUMA_NO_NODE)
> > -   searchnode = numa_mem_id();
> > -   else if (!node_present_pages(node))
> > -   searchnode = node_to_mem_node(node);
> > -
> > -   object = get_partial_node(s, get_node(s, searchnode), c, flags);
> > +   object = get_partial_node(s, get_node(s, node), c, flags);
> > if (object || node != NUMA_NO_NODE)>return object;
> >
> >   return get_any_partial(s, flags, c);
> 
> I.e. here in this if(), now node will never equal NUMA_NO_NODE (thanks to the
> hunk below), thus the get_any_partial() call becomes dead code?
> 
> > @@ -2470,6 +2464,11 @@ static inline void *new_slab_objects(struct 
> > kmem_cache *s, gfp_t flags,
> >  
> > WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
> >  
> > +   if (node == NUMA_NO_NODE)
> > +   node = numa_mem_id();
> > +   else if (!node_present_pages(node))
> > +   node = node_to_mem_node(node);
> > +
> > freelist = get_partial(s, flags, node, c);
> >  
> > if (freelist)
> > @@ -2569,12 +2568,10 @@ static void *___slab_alloc(struct kmem_cache *s, 
> > gfp_t gfpflags, int node,
> >  redo:
> >  
> > if (unlikely(!node_match(page, node))) {
> > -   int searchnode = node;
> > -
> > if (node != NUMA_NO_NODE && !node_present_pages(node))
> > -   searchnode = node_to_mem_node(node);
> > +   node = node_to_mem_node(node);
> >  
> > -   if (unlikely(!node_match(page, searchnode))) {
> > +   if (unlikely(!node_match(page, node))) {
> > stat(s, ALLOC_NODE_MISMATCH);
> > deactivate_slab(s, page, c->freelist, c);
> > goto new_slab;
> > 
> 

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH 1/4] mm: Check for node_online in node_present_pages

2020-03-17 Thread Srikar Dronamraju
* Srikar Dronamraju  [2020-03-17 18:47:50]:

> 
> Reported-by: Sachin Sant 
> Tested-by: Sachin Sant 
> Signed-off-by: Srikar Dronamraju 
> ---
>  include/linux/mmzone.h | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f3f264826423..88078a3b95e5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -756,8 +756,10 @@ typedef struct pglist_data {
>   atomic_long_t   vm_stat[NR_VM_NODE_STAT_ITEMS];
>  } pg_data_t;
> 
> -#define node_present_pages(nid)  (NODE_DATA(nid)->node_present_pages)
> -#define node_spanned_pages(nid)  (NODE_DATA(nid)->node_spanned_pages)
> +#define node_present_pages(nid)  \
> + (node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0)
> +#define node_spanned_pages(nid)  \
> + (node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0)
>  #ifdef CONFIG_FLAT_NODE_MEM_MAP
>  #define pgdat_page_nr(pgdat, pagenr) ((pgdat)->node_mem_map + (pagenr))
>  #else

If we indeed start having pgdat/NODE_DATA for even offline nodes as Michal
Hocko, then we may not this particular patch.

-- 
Thanks and Regards
Srikar Dronamraju



[PATCH 1/4] mm: Check for node_online in node_present_pages

2020-03-17 Thread Srikar Dronamraju
Calling a kmalloc_node on a possible node which is not yet onlined can
lead to panic. Currently node_present_pages() doesn't verify the node is
online before accessing the pgdat for the node. However pgdat struct may
not be available resulting in a crash.

NIP [c03d55f4] ___slab_alloc+0x1f4/0x760
LR [c03d5b94] __slab_alloc+0x34/0x60
Call Trace:
[c008b3783960] [c03d5734] ___slab_alloc+0x334/0x760 (unreliable)
[c008b3783a40] [c03d5b94] __slab_alloc+0x34/0x60
[c008b3783a70] [c03d6fa0] __kmalloc_node+0x110/0x490
[c008b3783af0] [c03443d8] kvmalloc_node+0x58/0x110
[c008b3783b30] [c03fee38] mem_cgroup_css_online+0x108/0x270
[c008b3783b90] [c0235aa8] online_css+0x48/0xd0
[c008b3783bc0] [c023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
[c008b3783ca0] [c0242318] cgroup_mkdir+0x228/0x5f0
[c008b3783d10] [c051e170] kernfs_iop_mkdir+0x90/0xf0
[c008b3783d50] [c043dc00] vfs_mkdir+0x110/0x230
[c008b3783da0] [c0441c90] do_mkdirat+0xb0/0x1a0
[c008b3783e20] [c000b278] system_call+0x5c/0x68

Fix this by verifying the node is online before accessing the pgdat
structure. Fix the same for node_spanned_pages() too.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Signed-off-by: Srikar Dronamraju 
---
 include/linux/mmzone.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f3f264826423..88078a3b95e5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -756,8 +756,10 @@ typedef struct pglist_data {
atomic_long_t   vm_stat[NR_VM_NODE_STAT_ITEMS];
 } pg_data_t;
 
-#define node_present_pages(nid)(NODE_DATA(nid)->node_present_pages)
-#define node_spanned_pages(nid)(NODE_DATA(nid)->node_spanned_pages)
+#define node_present_pages(nid)\
+   (node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0)
+#define node_spanned_pages(nid)\
+   (node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0)
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
 #define pgdat_page_nr(pgdat, pagenr)   ((pgdat)->node_mem_map + (pagenr))
 #else
-- 
2.18.1



[PATCH 4/4] powerpc/numa: Set fallback nodes for offline nodes

2020-03-17 Thread Srikar Dronamraju
Currently fallback nodes for offline nodes aren't set. Hence by default
node 0 ends up being the default node. However node 0 might be offline.

Fix this by explicitly setting fallback node. Ensure first_memory_node
is set before kernel does explicit setting of fallback node.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 281531340230..6e97ab6575cb 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -827,7 +827,16 @@ void __init dump_numa_cpu_topology(void)
if (!numa_enabled)
return;
 
-   for_each_online_node(node) {
+   for_each_node(node) {
+   /*
+* For all possible but not yet online nodes, ensure their
+* node_numa_mem is set correctly so that kmalloc_node works
+* for such nodes.
+*/
+   if (!node_online(node)) {
+   reset_numa_mem(node);
+   continue;
+   }
pr_info("Node %d CPUs:", node);
 
count = 0;
-- 
2.18.1



[PATCH 3/4] mm: Implement reset_numa_mem

2020-03-17 Thread Srikar Dronamraju
For a memoryless or offline nodes, node_numa_mem refers to a N_MEMORY
fallback node. Currently kernel has an API set_numa_mem that sets
node_numa_mem for memoryless node. However this API cannot be used for
offline nodes. Hence all offline nodes will have their node_numa_mem set
to 0. However systems can themselves have node 0 as offline i.e
memoryless and cpuless at this time. In such cases,
node_to_mem_node() fails to provide a N_MEMORY fallback node.

Mitigate this by having a new API that sets the default node_numa_mem for
offline nodes to be first_memory_node.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Signed-off-by: Srikar Dronamraju 
---
 include/asm-generic/topology.h | 3 +++
 include/linux/topology.h   | 7 +++
 2 files changed, 10 insertions(+)

diff --git a/include/asm-generic/topology.h b/include/asm-generic/topology.h
index 238873739550..e803ee7850e6 100644
--- a/include/asm-generic/topology.h
+++ b/include/asm-generic/topology.h
@@ -68,6 +68,9 @@
 #ifndef set_numa_mem
 #define set_numa_mem(node)
 #endif
+#ifndef reset_numa_mem
+#define reset_numa_mem(node)
+#endif
 #ifndef set_cpu_numa_mem
 #define set_cpu_numa_mem(cpu, node)
 #endif
diff --git a/include/linux/topology.h b/include/linux/topology.h
index eb2fe6edd73c..bebda80038bf 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -147,6 +147,13 @@ static inline int node_to_mem_node(int node)
 }
 #endif
 
+#ifndef reset_numa_mem
+static inline void reset_numa_mem(int node)
+{
+   _node_numa_mem_[node] = first_memory_node;
+}
+#endif
+
 #ifndef numa_mem_id
 /* Returns the number of the nearest Node with memory */
 static inline int numa_mem_id(void)
-- 
2.18.1



[PATCH 0/4] Fix kmalloc_node on offline nodes

2020-03-17 Thread Srikar Dronamraju
Sachin recently reported that linux-next was no more bootable on few
systems.
https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/

# numactl -H
available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31
node 1 size: 35247 MB
node 1 free: 30907 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10
#

Sachin bisected the problem to Commit a75056fc1e7c ("mm/memcontrol.c: allocate
shrinker_map on appropriate NUMA node")

The root cause analysis showed that mm/slub and powerpc/numa had some 
shortcomings
with respect to offline nodes.

This patch series is on top of patches posted at
https://lore.kernel.org/linuxppc-dev/2020030237.5731-1-sri...@linux.vnet.ibm.com/t/#u

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 

Srikar Dronamraju (4):
  mm: Check for node_online in node_present_pages
  mm/slub: Use mem_node to allocate a new slab
  mm: Implement reset_numa_mem
  powerpc/numa: Set fallback nodes for offline nodes

 arch/powerpc/mm/numa.c | 11 ++-
 include/asm-generic/topology.h |  3 +++
 include/linux/mmzone.h |  6 --
 include/linux/topology.h   |  7 +++
 mm/slub.c  | 19 ---
 5 files changed, 32 insertions(+), 14 deletions(-)

--
2.18.1



[PATCH 2/4] mm/slub: Use mem_node to allocate a new slab

2020-03-17 Thread Srikar Dronamraju
Currently while allocating a slab for a offline node, we use its
associated node_numa_mem to search for a partial slab. If we don't find
a partial slab, we try allocating a slab from the offline node using
__alloc_pages_node. However this is bound to fail.

NIP [c039a300] __alloc_pages_nodemask+0x130/0x3b0
LR [c039a3c4] __alloc_pages_nodemask+0x1f4/0x3b0
Call Trace:
[c008b36837f0] [c039a3b4] __alloc_pages_nodemask+0x1e4/0x3b0 
(unreliable)
[c008b3683870] [c03d1ff8] new_slab+0x128/0xcf0
[c008b3683950] [c03d6060] ___slab_alloc+0x410/0x820
[c008b3683a40] [c03d64a4] __slab_alloc+0x34/0x60
[c008b3683a70] [c03d78b0] __kmalloc_node+0x110/0x490
[c008b3683af0] [c0343a08] kvmalloc_node+0x58/0x110
[c008b3683b30] [c03ffd44] mem_cgroup_css_online+0x104/0x270
[c008b3683b90] [c0234e08] online_css+0x48/0xd0
[c008b3683bc0] [c023dedc] cgroup_apply_control_enable+0x2ec/0x4d0
[c008b3683ca0] [c02416f8] cgroup_mkdir+0x228/0x5f0
[c008b3683d10] [c0520360] kernfs_iop_mkdir+0x90/0xf0
[c008b3683d50] [c043e400] vfs_mkdir+0x110/0x230
[c008b3683da0] [c0441ee0] do_mkdirat+0xb0/0x1a0
[c008b3683e20] [c000b278] system_call+0x5c/0x68

Mitigate this by allocating the new slab from the node_numa_mem.

Cc: Andrew Morton 
Cc: linux...@kvack.org
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Sachin Sant 
Cc: Michal Hocko 
Cc: Christopher Lameter 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim 
Cc: Kirill Tkhai 
Cc: Vlastimil Babka 
Cc: Srikar Dronamraju 
Cc: Bharata B Rao 

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 
Signed-off-by: Srikar Dronamraju 
---
 mm/slub.c | 19 ---
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1c55bf7892bf..fdf7f38f96e6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1970,14 +1970,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
struct kmem_cache_cpu *c)
 {
void *object;
-   int searchnode = node;
 
-   if (node == NUMA_NO_NODE)
-   searchnode = numa_mem_id();
-   else if (!node_present_pages(node))
-   searchnode = node_to_mem_node(node);
-
-   object = get_partial_node(s, get_node(s, searchnode), c, flags);
+   object = get_partial_node(s, get_node(s, node), c, flags);
if (object || node != NUMA_NO_NODE)
return object;
 
@@ -2470,6 +2464,11 @@ static inline void *new_slab_objects(struct kmem_cache 
*s, gfp_t flags,
 
WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
 
+   if (node == NUMA_NO_NODE)
+   node = numa_mem_id();
+   else if (!node_present_pages(node))
+   node = node_to_mem_node(node);
+
freelist = get_partial(s, flags, node, c);
 
if (freelist)
@@ -2569,12 +2568,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t 
gfpflags, int node,
 redo:
 
if (unlikely(!node_match(page, node))) {
-   int searchnode = node;
-
if (node != NUMA_NO_NODE && !node_present_pages(node))
-   searchnode = node_to_mem_node(node);
+   node = node_to_mem_node(node);
 
-   if (unlikely(!node_match(page, searchnode))) {
+   if (unlikely(!node_match(page, node))) {
stat(s, ALLOC_NODE_MISMATCH);
deactivate_slab(s, page, c->freelist, c);
goto new_slab;
-- 
2.18.1



Re: [PATCH 2/2] powerpc/smp: Use IS_ENABLED() to avoid #ifdef

2020-03-13 Thread Srikar Dronamraju
* Michael Ellerman  [2020-03-13 22:20:20]:

> We can avoid the #ifdef by using IS_ENABLED() in the existing
> condition check.
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/kernel/smp.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index aae61a3b3201..6d2a3a3666f0 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1189,18 +1189,17 @@ int get_physical_package_id(int cpu)
>  {
>   int pkg_id = cpu_to_chip_id(cpu);
> 
> -#ifdef CONFIG_PPC_SPLPAR
>   /*
>* If the platform is PowerNV or Guest on KVM, ibm,chip-id is
>* defined. Hence we would return the chip-id as the result of
>* get_physical_package_id.
>*/
> - if (pkg_id == -1 && firmware_has_feature(FW_FEATURE_LPAR)) {
> + if (pkg_id == -1 && firmware_has_feature(FW_FEATURE_LPAR) &&
> + IS_ENABLED(CONFIG_PPC_SPLPAR)) {
>   struct device_node *np = of_get_cpu_node(cpu, NULL);
>   pkg_id = of_node_to_nid(np);
>   of_node_put(np);
>   }
> -#endif /* CONFIG_PPC_SPLPAR */
> 
>   return pkg_id;
>  }
> -- 
> 2.21.1
> 

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH 1/2] powerpc/smp: Drop superfluous NULL check

2020-03-13 Thread Srikar Dronamraju
* Michael Ellerman  [2020-03-13 22:20:19]:

> We don't need the NULL check of np, the result is the same because the
> OF helpers cope with NULL, of_node_to_nid(NULL) == NUMA_NO_NODE (-1).
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju 

> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/kernel/smp.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 37c12e3bab9e..aae61a3b3201 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1197,11 +1197,8 @@ int get_physical_package_id(int cpu)
>*/
>   if (pkg_id == -1 && firmware_has_feature(FW_FEATURE_LPAR)) {
>   struct device_node *np = of_get_cpu_node(cpu, NULL);
> -
> - if (np) {
> - pkg_id = of_node_to_nid(np);
> - of_node_put(np);
> - }
> + pkg_id = of_node_to_nid(np);
> + of_node_put(np);
>   }
>  #endif /* CONFIG_PPC_SPLPAR */
> 
> -- 
> 2.21.1
> 

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-13 Thread Srikar Dronamraju
* Vlastimil Babka  [2020-03-12 17:41:58]:

> On 3/12/20 5:13 PM, Srikar Dronamraju wrote:
> > * Vlastimil Babka  [2020-03-12 14:51:38]:
> > 
> >> > * Vlastimil Babka  [2020-03-12 10:30:50]:
> >> > 
> >> >> On 3/12/20 9:23 AM, Sachin Sant wrote:
> >> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju 
> >> >> >>  wrote:
> >> >> >> * Michal Hocko  [2020-03-11 12:57:35]:
> >> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> >> I think we do need well defined and documented rules around 
> >> node_to_mem_node(),
> >> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
> >> everyone handles it the same, safe way.
> 
> So let's try to brainstorm how this would look like? What I mean are some 
> rules
> like below, even if some details in my current understanding are most likely
> incorrect:
> 

Agree.

> with nid present in:
> N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online
> node with memory so that we don't require everyone to search for it in 
> slightly
> different ways
> N_ONLINE - pgdat must exist, there doesn't have to be present memory,
> node_to_mem_node() still has to return something else (?)

Right, think this has been taken care of at this time.

> N_NORMAL_MEMORY - there is present memory, node_to_mem_node() returns itself
> N_HIGH_MEMORY - node has present high memory
> 

dont see any problems with the above two to. That leaves us with N_POSSIBLE.

> > 
> > Other option would be to tweak Kirill Tkhai's patch such that we call
> > kvmalloc_node()/kzalloc_node() if node is online and call kvmalloc/kvzalloc
> > if the node is offline.
> 
> I really would like a solution that hides these ugly details from callers so
> they don't have to workaround the APIs we provide. kvmalloc_node() really
> shouldn't crash, and it should fallback automatically if we don't give it
> __GFP_THISNODE
> 

Agree thats its better to make API's robust where possible.

> However, taking a step back, memcg_alloc_shrinker_maps() is probably rather
> wasteful on systems with 256 possible nodes and only few present, by 
> allocating
> effectively dead structures for each memcg.
> 

If we dont allocate now, we would have to allocate them when we online the
nodes. To me it looks better to allocate as soon as the nodes are onlined,

> SLUB tries to be smart, so it allocates the per-node per-cache structures only
> when the node goes online in slab_mem_going_online_callback(). This is why
> there's a crash when such non-existing structures are accessed for a node 
> that's
> not online, and why they shouldn't be accessed.
> 
> Perhaps memcg should do the same on-demand allocation, if possible.
> 

Right.

-- 
Thanks and Regards
Srikar Dronamraju



Re: [5.6.0-rc2-next-20200218/powerpc] Boot failure on POWER9

2020-03-13 Thread Srikar Dronamraju
* Michael Ellerman  [2020-03-13 21:48:06]:

> Sachin Sant  writes:
> >> The patch below might work. Sachin can you test this? I tried faking up
> >> a system with a memoryless node zero but couldn't get it to even start
> >> booting.
> >> 
> > The patch did not help. The kernel crashed during
> > the boot with the same call trace.
> >
> > BUG_ON() introduced with the patch was not triggered.
> 
> OK, that's weird.
> 
> I eventually managed to get a memoryless node going in sim, and it
> appears to work there.
> 
> eg in dmesg:
> 
>   [0.00][T0] numa:   NODE_DATA [mem 0x2000fffa2f80-0x2000fffa7fff]
>   [0.00][T0] numa: NODE_DATA(0) on node 1
>   [0.00][T0] numa:   NODE_DATA [mem 0x2000fff9df00-0x2000fffa2f7f]
>   ...
>   [0.00][T0] Early memory node ranges
>   [0.00][T0]   node   1: [mem 
> 0x-0x]
>   [0.00][T0]   node   1: [mem 
> 0x2000-0x2000]
>   [0.00][T0] Could not find start_pfn for node 0
>   [0.00][T0] Initmem setup node 0 [mem 
> 0x-0x]
>   [0.00][T0] On node 0 totalpages: 0
>   [0.00][T0] Initmem setup node 1 [mem 
> 0x-0x2000]
>   [0.00][T0] On node 1 totalpages: 131072
>   
>   # dmesg | grep set_numa
>   [0.00][T0] set_numa_mem: mem node for 0 = 1
>   [0.005654][T0] set_numa_mem: mem node for 1 = 1
> 
> So is the problem more than just node zero having no memory?
> 

The problem would happen with possible nodes which are not yet present. i.e
no cpus, no memory attached to those nodes.

Please look at
http://lore.kernel.org/lkml/20200312131438.gb3...@linux.vnet.ibm.com/t/#u
for more details.

The summary being: pgdat/Node_Data for such nodes is not allocated. Hence
the node_present_pages(nid) called  where nid is a possible but not yet
present node fails. Currently node_present_pages(nid) and node_to_mem_node
don't seem to be equipped to handle possible but not present nodes.

> cheers

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-13 Thread Srikar Dronamraju
* Joonsoo Kim  [2020-03-13 18:47:49]:

> > >>
> > >> > Also for a memoryless/cpuless node or possible but not present nodes,
> > >> > node_to_mem_node(node) will still end up as node (atleast on powerpc).
> > >>
> > >> I think that's the place where this would be best to fix.
> > >>
> > >
> > > Maybe. I thought about it but the current set_numa_mem semantics are apt
> > > for memoryless cpu node and not for possible nodes.  We could have upto 
> > > 256
> > > possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory.
> > > node_to_mem_node seems to return what is set in set_numa_mem().
> > > set_numa_mem() seems to say set my numa_mem node for the current 
> > > memoryless
> > > node to the param passed.
> > >
> > > But how do we set numa_mem for all the other 253 possible nodes, which
> > > probably will have 0 as default?
> > >
> > > Should we introduce another API such that we could update for all possible
> > > nodes?
> >
> > If we want to rely on node_to_mem_node() to give us something safe for each
> > possible node, then probably it would have to be like that, yeah.
> >
> > >> > I tried with this hunk below and it works.
> > >> >
> > >> > But I am not sure if we need to check at other places were
> > >> > node_present_pages is being called.
> > >>
> > >> I think this seems to defeat the purpose of node_to_mem_node()? 
> > >> Shouldn't it
> > >> return only nodes that are online with present memory?
> > >> CCing Joonsoo who seems to have introduced this in ad2c8144418c 
> > >> ("topology: add
> > >> support for node_to_mem_node() to determine the fallback node")
> > >>
> > >
> > > Agree
> 
> I lost all the memory about it. :)
> Anyway, how about this?
> 
> 1. make node_present_pages() safer
> static inline node_present_pages(nid)
> {
> if (!node_online(nid)) return 0;
> return (NODE_DATA(nid)->node_present_pages);
> }
> 

Yes this would help.

> 2. make node_to_mem_node() safer for all cases
> In ppc arch's mem_topology_setup(void)
> for_each_present_cpu(cpu) {
>  numa_setup_cpu(cpu);
>  mem_node = node_to_mem_node(numa_mem_id());
>  if (!node_present_pages(mem_node)) {
>   _node_numa_mem_[numa_mem_id()] = first_online_node;
>  }
> }
> 

But here as discussed above, we miss the case of possible but not present nodes.
For such nodes, the above change may not update, resulting in they still
having 0. And node 0 can be only possible but not present.

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-12 Thread Srikar Dronamraju
* Vlastimil Babka  [2020-03-12 14:51:38]:

> > * Vlastimil Babka  [2020-03-12 10:30:50]:
> > 
> >> On 3/12/20 9:23 AM, Sachin Sant wrote:
> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju 
> >> >>  wrote:
> >> >> * Michal Hocko  [2020-03-11 12:57:35]:
> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> >> >>>> to make sure all possible but not present cpu_to_node are set to a
> >> >>>> proper node.
> >> >>> 
> >> >>> Just curious, is this somehow related to
> >> >>> http://lkml.kernel.org/r/20200227182650.gg3...@dhcp22.suse.cz?
> >> >>> 
> >> >> 
> >> >> The issue I am trying to fix is a known issue in Powerpc since many 
> >> >> years.
> >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: 
> >> >> allocate
> >> >> shrinker_map on appropriate NUMA node"). 
> >> >> 
> > 
> > While I am not an expert in the slub area, I looked at the patch
> > a75056fc1e7c and had some thoughts on why this could be causing this issue.
> > 
> > On the system where the crash happens, the possible number of nodes is much
> > greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
> > available for onlined nodes.
> > 
> > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
> > for all possible nodes and in ___slab_alloc we end up looking at the
> > node_present_pages which is NODE_DATA(nid)->node_present_pages.
> > i.e for a node whose pdgat struct is not allocated, we are trying to
> > dereference.
> 
> From what we saw, the pgdat does exist, the problem is that slab's per-node 
> data
> doesn't exist for a node that doesn't have present pages, as it would be a 
> waste
> of memory.

Just to be clear
Before my 3 patches to fix dummy node:
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
0-31
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
0-1

> 
> Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In
> Sachin's first report [1] we have
> 
> [0.00] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
> [0.00] numa: NODE_DATA(0) on node 1
> [0.00] numa:   NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff]
> 

So even if pgdat would exist for nodes 0 and 1, there is no pgdat for the
rest 30 nodes.

> But in this thread, with your patches Sachin reports:

and with my patches
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
0-31
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
1

> 
> [0.00] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
> 

so we only see one pgdat.

> So I assume it's just node 1. In that case, node_present_pages is really 
> dangerous.
> 
> [1]
> https://lore.kernel.org/linux-next/3381cd91-ab3d-4773-ba04-e7a072a63...@linux.vnet.ibm.com/
> 
> > Also for a memoryless/cpuless node or possible but not present nodes,
> > node_to_mem_node(node) will still end up as node (atleast on powerpc).
> 
> I think that's the place where this would be best to fix.
> 

Maybe. I thought about it but the current set_numa_mem semantics are apt
for memoryless cpu node and not for possible nodes.  We could have upto 256
possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory.
node_to_mem_node seems to return what is set in set_numa_mem().
set_numa_mem() seems to say set my numa_mem node for the current memoryless
node to the param passed.

But how do we set numa_mem for all the other 253 possible nodes, which
probably will have 0 as default?

Should we introduce another API such that we could update for all possible
nodes?

> > I tried with this hunk below and it works.
> > 
> > But I am not sure if we need to check at other places were
> > node_present_pages is being called.
> 
> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it
> return only nodes that are online with present memory?
> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: 
> add
> support for node_to_mem_node() to determine the fallback node")
> 

Agree 

> I think we do need well defined and documented rules around 
> node_to_mem_node(),
> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
> everyone handles it the same, safe way.
> 

Other option would be to tweak Kirill Tkhai's patch such that we call
kvmalloc_node()/kzalloc_node()

Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-12 Thread Srikar Dronamraju
* Vlastimil Babka  [2020-03-12 10:30:50]:

> On 3/12/20 9:23 AM, Sachin Sant wrote:
> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju  
> >> wrote:
> >> * Michal Hocko  [2020-03-11 12:57:35]:
> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> >>>> to make sure all possible but not present cpu_to_node are set to a
> >>>> proper node.
> >>> 
> >>> Just curious, is this somehow related to
> >>> http://lkml.kernel.org/r/20200227182650.gg3...@dhcp22.suse.cz?
> >>> 
> >> 
> >> The issue I am trying to fix is a known issue in Powerpc since many years.
> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
> >> shrinker_map on appropriate NUMA node"). 
> >> 
> >> I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
> >> kernel. Will work with Sachin/Abdul (reporters of the issue).

I had used v1 and not v2. So my mistake.

> > I applied this 3 patch series on top of March 11 next tree (commit 
> > d44a64766795 )
> > The kernel still fails to boot with same call trace.
> 

While I am not an expert in the slub area, I looked at the patch
a75056fc1e7c and had some thoughts on why this could be causing this issue.

On the system where the crash happens, the possible number of nodes is much
greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
available for onlined nodes.

With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
for all possible nodes and in ___slab_alloc we end up looking at the
node_present_pages which is NODE_DATA(nid)->node_present_pages.
i.e for a node whose pdgat struct is not allocated, we are trying to
dereference.

Also for a memoryless/cpuless node or possible but not present nodes,
node_to_mem_node(node) will still end up as node (atleast on powerpc).

I tried with this hunk below and it works.

But I am not sure if we need to check at other places were
node_present_pages is being called.

diff --git a/mm/slub.c b/mm/slub.c
index 626cbcbd977f..bddb93bed55e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t 
gfpflags, int node,
if (unlikely(!node_match(page, node))) {
int searchnode = node;
 
-   if (node != NUMA_NO_NODE && !node_present_pages(node))
-   searchnode = node_to_mem_node(node);
-
+   if (node != NUMA_NO_NODE) {
+   if (!node_online(node) || !node_present_pages(node)) {
+   searchnode = node_to_mem_node(node);
+   if (!node_online(searchnode))
+   searchnode = first_online_node;
+   }
+   }
if (unlikely(!node_match(page, searchnode))) {
stat(s, ALLOC_NODE_MISMATCH);
deactivate_slab(s, page, c->freelist, c);

> > 
> 

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-11 Thread Srikar Dronamraju
* Michal Hocko  [2020-03-11 12:57:35]:

> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> > A Powerpc system with multiple possible nodes and with CONFIG_NUMA
> > enabled always used to have a node 0, even if node 0 does not any cpus
> > or memory attached to it. As per PAPR, node affinity of a cpu is only
> > available once its present / online. For all cpus that are possible but
> > not present, cpu_to_node() would point to node 0.
> > 
> > To ensure a cpuless, memoryless dummy node is not online, powerpc need
> > to make sure all possible but not present cpu_to_node are set to a
> > proper node.
> 
> Just curious, is this somehow related to
> http://lkml.kernel.org/r/20200227182650.gg3...@dhcp22.suse.cz?
> 

The issue I am trying to fix is a known issue in Powerpc since many years.
So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
shrinker_map on appropriate NUMA node"). 

I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
kernel. Will work with Sachin/Abdul (reporters of the issue).


> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: linux...@kvack.org
> > Cc: linux-ker...@vger.kernel.org
> > Cc: Michal Hocko 
> > Cc: Mel Gorman 
> > Cc: Vlastimil Babka 
> > Cc: "Kirill A. Shutemov" 
> > Cc: Christopher Lameter 
> > Cc: Michael Ellerman 
> > Cc: Andrew Morton 
> > Cc: Linus Torvalds 
> > Signed-off-by: Srikar Dronamraju 
> > ---
> >  arch/powerpc/mm/numa.c | 16 ++--
> >  1 file changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 8a399db..54dcd49 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -931,8 +931,20 @@ void __init mem_topology_setup(void)
> >  
> > reset_numa_cpu_lookup_table();
> >  
> > -   for_each_present_cpu(cpu)
> > -   numa_setup_cpu(cpu);
> > +   for_each_possible_cpu(cpu) {
> > +   /*
> > +* Powerpc with CONFIG_NUMA always used to have a node 0,
> > +* even if it was memoryless or cpuless. For all cpus that
> > +* are possible but not present, cpu_to_node() would point
> > +* to node 0. To remove a cpuless, memoryless dummy node,
> > +* powerpc need to make sure all possible but not present
> > +* cpu_to_node are set to a proper node.
> > +*/
> > +   if (cpu_present(cpu))
> > +   numa_setup_cpu(cpu);
> > +   else
> > +   set_cpu_numa_node(cpu, first_online_node);
> > +   }
> >  }
> >  
> >  void __init initmem_init(void)
> > -- 
> > 1.8.3.1
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Thanks and Regards
Srikar Dronamraju



[PATCH 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-03-11 Thread Srikar Dronamraju
Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot.  However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

Lets stop assuming that Node 0 is always online.

v5.6-rc4
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
--
 /sys/devices/system/node/online:0,2
 /proc/sys/kernel/numa_balancing:1
 /sys/devices/system/node/has_cpu:   2
 /sys/devices/system/node/has_memory:2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:  0-31

v5.6-rc4 + patch
--
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
--
/sys/devices/system/node/online:2
/proc/sys/kernel/numa_balancing:0
/sys/devices/system/node/has_cpu:   2
/sys/devices/system/node/has_memory:2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:  0-31

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Signed-off-by: Srikar Dronamraju 
---
 mm/page_alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb75..68e635f4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -116,8 +116,10 @@ struct pcpu_drain {
  */
 nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
[N_POSSIBLE] = NODE_MASK_ALL,
+#ifdef CONFIG_NUMA
+   [N_ONLINE] = NODE_MASK_NONE,
+#else
[N_ONLINE] = { { [0] = 1UL } },
-#ifndef CONFIG_NUMA
[N_NORMAL_MEMORY] = { { [0] = 1UL } },
 #ifdef CONFIG_HIGHMEM
[N_HIGH_MEMORY] = { { [0] = 1UL } },
-- 
1.8.3.1



[PATCH 2/3] powerpc/numa: Prefer node id queried from vphn

2020-03-11 Thread Srikar Dronamraju
Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 54dcd49..8735fed 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -719,20 +719,20 @@ static int __init parse_numa_properties(void)
 */
for_each_present_cpu(i) {
struct device_node *cpu;
-   int nid;
-
-   cpu = of_get_cpu_node(i, NULL);
-   BUG_ON(!cpu);
-   nid = of_node_to_nid_single(cpu);
-   of_node_put(cpu);
+   int nid = vphn_get_nid(i);
 
/*
 * Don't fall back to default_nid yet -- we will plug
 * cpus into nodes once the memory scan has discovered
 * the topology.
 */
-   if (nid < 0)
-   continue;
+   if (nid == NUMA_NO_NODE) {
+   cpu = of_get_cpu_node(i, NULL);
+   if (cpu) {
+   nid = of_node_to_nid_single(cpu);
+   of_node_put(cpu);
+   }
+   }
node_set_online(nid);
}
 
-- 
1.8.3.1



[PATCH 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-03-11 Thread Srikar Dronamraju
A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.

To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8a399db..54dcd49 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -931,8 +931,20 @@ void __init mem_topology_setup(void)
 
reset_numa_cpu_lookup_table();
 
-   for_each_present_cpu(cpu)
-   numa_setup_cpu(cpu);
+   for_each_possible_cpu(cpu) {
+   /*
+* Powerpc with CONFIG_NUMA always used to have a node 0,
+* even if it was memoryless or cpuless. For all cpus that
+* are possible but not present, cpu_to_node() would point
+* to node 0. To remove a cpuless, memoryless dummy node,
+* powerpc need to make sure all possible but not present
+* cpu_to_node are set to a proper node.
+*/
+   if (cpu_present(cpu))
+   numa_setup_cpu(cpu);
+   else
+   set_cpu_numa_node(cpu, first_online_node);
+   }
 }
 
 void __init initmem_init(void)
-- 
1.8.3.1



[PATCH 0/3] Offline memoryless cpuless node 0

2020-03-11 Thread Srikar Dronamraju
Linux kernel configured with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause
1. numa_balancing to be enabled on systems with only one online node.
2. Existence of dummy (cpuless and memoryless) node which can confuse
users/scripts looking at output of lscpu / numactl.

This patchset wants to correct this anomaly.

This should only affect systems that have CONFIG_MEMORYLESS_NODES.
Currently there are only 2 architectures ia64 and powerpc that have this
config.

v5.6-rc4
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
--
 /sys/devices/system/node/online:0,2
 /proc/sys/kernel/numa_balancing:1
 /sys/devices/system/node/has_cpu:   2
 /sys/devices/system/node/has_memory:2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:  0-31

v5.6-rc4 + patches
--
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
--
/sys/devices/system/node/online:2
/proc/sys/kernel/numa_balancing:0
/sys/devices/system/node/has_cpu:   2
/sys/devices/system/node/has_memory:2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:  0-31

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: Michal Hocko 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: "Kirill A. Shutemov" 
Cc: Christopher Lameter 
Cc: Michael Ellerman 
Cc: Andrew Morton 
Cc: Linus Torvalds 

Srikar Dronamraju (3):
  powerpc/numa: Set numa_node for all possible cpus
  powerpc/numa: Prefer node id queried from vphn
  mm/page_alloc: Keep memoryless cpuless node 0 offline

 arch/powerpc/mm/numa.c | 32 ++--
 mm/page_alloc.c|  4 +++-
 2 files changed, 25 insertions(+), 11 deletions(-)

-- 
1.8.3.1



Re: [PATCH v4] powerpc/smp: Use nid as fallback for package_id

2020-01-30 Thread Srikar Dronamraju
* Gautham R Shenoy  [2020-01-30 16:27:47]:

> On Wed, Jan 29, 2020 at 07:21:21PM +0530, Srikar Dronamraju wrote:
> 
> [..snip..]
> 
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -1185,10 +1185,34 @@ static inline void add_cpu_to_smallcore_masks(int 
> > cpu)
> > }
> >  }
> > 
> > +int get_physical_package_id(int cpu)
> > +{
> > +   int ppid = cpu_to_chip_id(cpu);
> > +
> > +#ifdef CONFIG_PPC_SPLPAR
> > +   /*
> > +* If the platform is PowerNV or Guest on KVM, ibm,chip-id is
> > +* defined. Hence we would return the chip-id as the
> > +* get_physical_package_id.
> > +*/
> > +   if (ppid == -1 && firmware_has_feature(FW_FEATURE_LPAR)) {
> > +   struct device_node *np = of_get_cpu_node(cpu, NULL);
> > +
> > +   if (np) {
> > +   ppid = of_node_to_nid(np);
> > +   of_node_put(np);
> > +   }
> > +   }
> > +#endif /* CONFIG_PPC_SPLPAR */
> > +
> > +   return ppid;
> > +}
> >  static void add_cpu_to_masks(int cpu)
> >  {
> > int first_thread = cpu_first_thread_sibling(cpu);
> > -   int chipid = cpu_to_chip_id(cpu);
> > +   int ppid = get_physical_package_id(cpu);
> > int i;
> > 
> > /*
> > @@ -1217,11 +1241,11 @@ static void add_cpu_to_masks(int cpu)
> > for_each_cpu(i, cpu_l2_cache_mask(cpu))
> > set_cpus_related(cpu, i, cpu_core_mask);
> > 
> > -   if (chipid == -1)
> > +   if (ppid == -1)
> > return;
> 
> Can get_physical_package_id() return -1 ?
> 

Yes, it can return -1,
1. A System doesnt have CONFIG_PPC_SPLPAR: cpu_to_chip_id() might return -1.
2. A System with CONFIG_PPC_SPLPAR: Still can return -1 for following cases.
a. Not having firmware feature FW_FEATURE_LPAR
b. If for some reason, of_get_cpu_node property is not present.

> > 
> > for_each_cpu(i, cpu_online_mask)
> > -   if (cpu_to_chip_id(i) == chipid)
> > +   if (get_physical_package_id(i) == ppid)
> > set_cpus_related(cpu, i, cpu_core_mask);
> >  }

-- 
Thanks and Regards
Srikar Dronamraju



[PATCH v6 4/5] powerpc/numa: Early request for home node associativity

2020-01-29 Thread Srikar Dronamraju
Currently the kernel detects if its running on a shared lpar platform
and requests home node associativity before the scheduler sched_domains
are setup. However between the time NUMA setup is initialized and the
request for home node associativity, workqueue initializes its per node
cpumask. The per node workqueue possible cpumask may turn invalid
after home node associativity resulting in weird situations like
workqueue possible cpumask being a subset of workqueue online cpumask.

This can be fixed by requesting home node associativity earlier just
before NUMA setup. However at the NUMA setup time, kernel may not be in
a position to detect if its running on a shared lpar platform. So
request for home node associativity and if the request fails, fallback
on the device tree property.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
Changelog (v5->v6):
- Handled comments from Michael Ellerman
  * hardware-id is now deciphered in vphn_get_nid
  * No need to pass extra bool

Changelog (v2->v3):
- Handled comments from Nathan Lynch
  * Use first thread of the core for cpu-to-node map.
  * get hardware-id in numa_setup_cpu

Changelog (v1->v2):
- Handled comments from Nathan Lynch
  * Dont depend on pacas to be setup for the hwid

 arch/powerpc/mm/numa.c | 41 -
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 63ec0c3..ee7400f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -461,6 +461,41 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
return nid;
 }
 
+#ifdef CONFIG_PPC_SPLPAR
+static int vphn_get_nid(long lcpu)
+{
+   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+   long rc, hwid;
+
+   /*
+* On a shared lpar, device tree will not have node associativity.
+* At this time lppaca, or its __old_status field may not be
+* updated. Hence kernel cannot detect if its on a shared lpar. So
+* request an explicit associativity irrespective of whether the
+* lpar is shared or dedicated. Use the device tree property as a
+* fallback. cpu_to_phys_id is only valid between
+* smp_setup_cpu_maps() and smp_setup_pacas().
+*/
+   if (firmware_has_feature(FW_FEATURE_VPHN)) {
+   if (cpu_to_phys_id)
+   hwid = cpu_to_phys_id[lcpu];
+   else
+   hwid = get_hard_smp_processor_id(lcpu);
+
+   rc = hcall_vphn(hwid, VPHN_FLAG_VCPU, associativity);
+   if (rc == H_SUCCESS)
+   return associativity_to_nid(associativity);
+   }
+
+   return NUMA_NO_NODE;
+}
+#else
+static int vphn_get_nid(long unused)
+{
+   return NUMA_NO_NODE;
+}
+#endif  /* CONFIG_PPC_SPLPAR */
+
 /*
  * Figure out to which domain a cpu belongs and stick it there.
  * Return the id of the domain used.
@@ -485,6 +520,10 @@ static int numa_setup_cpu(unsigned long lcpu)
return nid;
}
 
+   nid = vphn_get_nid(lcpu);
+   if (nid != NUMA_NO_NODE)
+   goto out_present;
+
cpu = of_get_cpu_node(lcpu, NULL);
 
if (!cpu) {
@@ -496,6 +535,7 @@ static int numa_setup_cpu(unsigned long lcpu)
}
 
nid = of_node_to_nid_single(cpu);
+   of_node_put(cpu);
 
 out_present:
if (nid < 0 || !node_possible(nid))
@@ -515,7 +555,6 @@ static int numa_setup_cpu(unsigned long lcpu)
}
 
map_cpu_to_node(lcpu, nid);
-   of_node_put(cpu);
 out:
return nid;
 }
-- 
1.8.3.1



[PATCH v6 5/5] powerpc/numa: Remove late request for home node associativity

2020-01-29 Thread Srikar Dronamraju
With commit ("powerpc/numa: Early request for home node associativity"),
commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot") which was requesting home node associativity
becomes redundant.

Hence remove the late request for home node associativity.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Abdul Haleem 
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
Reviewed-by: Nathan Lynch 
---
 arch/powerpc/include/asm/topology.h | 4 
 arch/powerpc/kernel/smp.c   | 5 -
 arch/powerpc/mm/numa.c  | 9 -
 3 files changed, 18 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 2f7e1ea..9bd396f 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -98,7 +98,6 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
 extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
 extern int timed_topology_update(int nsecs);
-extern void __init shared_proc_topology_init(void);
 #else
 static inline int start_topology_update(void)
 {
@@ -121,9 +120,6 @@ static inline int timed_topology_update(int nsecs)
return 0;
 }
 
-#ifdef CONFIG_SMP
-static inline void shared_proc_topology_init(void) {}
-#endif
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf..cdd39a0 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1359,11 +1359,6 @@ void __init smp_cpus_done(unsigned int max_cpus)
if (smp_ops && smp_ops->bringup_done)
smp_ops->bringup_done();
 
-   /*
-* On a shared LPAR, associativity needs to be requested.
-* Hence, get numa topology before dumping cpu topology
-*/
-   shared_proc_topology_init();
dump_numa_cpu_topology();
 
 #ifdef CONFIG_SCHED_SMT
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index ee7400f..01a03f2 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1632,15 +1632,6 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
-void __init shared_proc_topology_init(void)
-{
-   if (lppaca_shared_proc(get_lppaca())) {
-   bitmap_fill(cpumask_bits(_associativity_changes_mask),
-   nr_cpumask_bits);
-   numa_update_cpu_topology(false);
-   }
-}
-
 static int topology_read(struct seq_file *file, void *v)
 {
if (vphn_enabled || prrn_enabled)
-- 
1.8.3.1



[PATCH v6 2/5] powerpc/numa: Handle extra hcall_vphn error cases

2020-01-29 Thread Srikar Dronamraju
Currently code handles H_FUNCTION, H_SUCCESS, H_HARDWARE return codes.
However hcall_vphn can return other return codes. Now it also handles
H_PARAMETER return code.  Also the rest return codes are handled under the
default case.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Abdul Haleem 
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
Reviewed-by: Nathan Lynch 
---
Changelog (v2->v1):
 Handled comments from Nathan:
  - Split patch from patch 1.
  - Corrected a problem where I missed calling stop_topology_update().
  - Using pr_err_ratelimited instead of printk.

 arch/powerpc/mm/numa.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 50d68d2..8fbe57c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1191,23 +1191,30 @@ static long vphn_get_associativity(unsigned long cpu,
VPHN_FLAG_VCPU, associativity);
 
switch (rc) {
+   case H_SUCCESS:
+   dbg("VPHN hcall succeeded. Reset polling...\n");
+   timed_topology_update(0);
+   goto out;
+
case H_FUNCTION:
-   printk_once(KERN_INFO
-   "VPHN is not supported. Disabling polling...\n");
-   stop_topology_update();
+   pr_err_ratelimited("VPHN unsupported. Disabling polling...\n");
break;
case H_HARDWARE:
-   printk(KERN_ERR
-   "hcall_vphn() experienced a hardware fault "
+   pr_err_ratelimited("hcall_vphn() experienced a hardware fault "
"preventing VPHN. Disabling polling...\n");
-   stop_topology_update();
break;
-   case H_SUCCESS:
-   dbg("VPHN hcall succeeded. Reset polling...\n");
-   timed_topology_update(0);
+   case H_PARAMETER:
+   pr_err_ratelimited("hcall_vphn() was passed an invalid 
parameter. "
+   "Disabling polling...\n");
+   break;
+   default:
+   pr_err_ratelimited("hcall_vphn() returned %ld. Disabling 
polling...\n"
+   , rc);
break;
}
 
+   stop_topology_update();
+out:
return rc;
 }
 
-- 
1.8.3.1



[PATCH v6 3/5] powerpc/numa: Use cpu node map of first sibling thread

2020-01-29 Thread Srikar Dronamraju
All the sibling threads of a core have to be part of the same node.
To ensure that all the sibling threads map to the same node, always
lookup/update the cpu-to-node map of the first thread in the core.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Abdul Haleem 
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
Reviewed-by: Nathan Lynch 
---
Changelog: v3->v4
As suggested by Nathan, add a warning while mapping offline cpu.

 arch/powerpc/mm/numa.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8fbe57c..63ec0c3 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -467,15 +467,20 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
  */
 static int numa_setup_cpu(unsigned long lcpu)
 {
-   int nid = NUMA_NO_NODE;
struct device_node *cpu;
+   int fcpu = cpu_first_thread_sibling(lcpu);
+   int nid = NUMA_NO_NODE;
 
/*
 * If a valid cpu-to-node mapping is already available, use it
 * directly instead of querying the firmware, since it represents
 * the most recent mapping notified to us by the platform (eg: VPHN).
+* Since cpu_to_node binding remains the same for all threads in the
+* core. If a valid cpu-to-node mapping is already available, for
+* the first thread in the core, use it.
 */
-   if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
+   nid = numa_cpu_lookup_table[fcpu];
+   if (nid >= 0) {
map_cpu_to_node(lcpu, nid);
return nid;
}
@@ -496,6 +501,19 @@ static int numa_setup_cpu(unsigned long lcpu)
if (nid < 0 || !node_possible(nid))
nid = first_online_node;
 
+   /*
+* Update for the first thread of the core. All threads of a core
+* have to be part of the same node. This not only avoids querying
+* for every other thread in the core, but always avoids a case
+* where virtual node associativity change causes subsequent threads
+* of a core to be associated with different nid. However if first
+* thread is already online, expect it to have a valid mapping.
+*/
+   if (fcpu != lcpu) {
+   WARN_ON(cpu_online(fcpu));
+   map_cpu_to_node(fcpu, nid);
+   }
+
map_cpu_to_node(lcpu, nid);
of_node_put(cpu);
 out:
-- 
1.8.3.1



[PATCH v6 1/5] powerpc/vphn: Check for error from hcall_vphn

2020-01-29 Thread Srikar Dronamraju
There is no value in unpacking associativity, if
H_HOME_NODE_ASSOCIATIVITY hcall has returned an error.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Abdul Haleem 
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
Reviewed-by: Nathan Lynch 
---
Changelog (v2->v1):
- Split the patch into 2(Suggested by Nathan).

 arch/powerpc/platforms/pseries/vphn.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/vphn.c 
b/arch/powerpc/platforms/pseries/vphn.c
index 3f07bf6..cca474a 100644
--- a/arch/powerpc/platforms/pseries/vphn.c
+++ b/arch/powerpc/platforms/pseries/vphn.c
@@ -82,7 +82,8 @@ long hcall_vphn(unsigned long cpu, u64 flags, __be32 
*associativity)
long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
 
rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, retbuf, flags, cpu);
-   vphn_unpack_associativity(retbuf, associativity);
+   if (rc == H_SUCCESS)
+   vphn_unpack_associativity(retbuf, associativity);
 
return rc;
 }
-- 
1.8.3.1



[PATCH v6 0/5] Early node associativity

2020-01-29 Thread Srikar Dronamraju
Note: (Only patch 4 changes from v5)

Abdul reported  a warning on a shared lpar.
"WARNING: workqueue cpumask: online intersect > possible intersect".
This is because per node workqueue possible mask is set very early in the
boot process even before the system was querying the home node
associativity. However per node workqueue online cpumask gets updated
dynamically. Hence there is a chance when per node workqueue online cpumask
is a superset of per node workqueue possible mask.

Link for v5: 
http://lkml.kernel.org/r/20191216144904.6776-1-sri...@linux.vnet.ibm.com
Changelog: v5->v6
 - Handled comments from Michael Ellerman
 - Rebased to v5.5-rc6

Link for v4: https://patchwork.ozlabs.org/patch/1161979
Changelog: v4->v5:
- Rebased to v5.5-rc2

Link for v3: 
http://lkml.kernel.org/r/20190906135020.19772-1-sri...@linux.vnet.ibm.com
Changelog: v3->v4
 - Added a warning as suggested by Nathan Lynch.

Link for v2: 
http://lkml.kernel.org/r/20190829055023.6171-1-sri...@linux.vnet.ibm.com
Changelog: v2->v3
 - Handled comments from Nathan Lynch.

Link for v1: https://patchwork.ozlabs.org/patch/1151658
Changelog: v1->v2
 - Handled comments from Nathan Lynch.

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Abdul Haleem 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 

Srikar Dronamraju (5):
  powerpc/vphn: Check for error from hcall_vphn
  powerpc/numa: Handle extra hcall_vphn error cases
  powerpc/numa: Use cpu node map of first sibling thread
  powerpc/numa: Early request for home node associativity
  powerpc/numa: Remove late request for home node associativity

 arch/powerpc/include/asm/topology.h   |  4 --
 arch/powerpc/kernel/smp.c |  5 --
 arch/powerpc/mm/numa.c| 97 +++
 arch/powerpc/platforms/pseries/vphn.c |  3 +-
 4 files changed, 78 insertions(+), 31 deletions(-)

-- 
1.8.3.1



[PATCH v4] powerpc/smp: Use nid as fallback for package_id

2020-01-29 Thread Srikar Dronamraju
Package_id is to find out all cores that are part of the same chip. On
PowerNV machines, package_id defaults to chip_id. However ibm,chip_id
property is not present in device-tree of PowerVM Lpars. Hence lscpu
output shows one core per socket and multiple cores.

To overcome this, use nid as the package_id on PowerVM Lpars.

Before the patch.
---
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8
Core(s) per socket:  1   <--
Socket(s):   16  <--
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127
 #
 # cat /sys/devices/system/cpu/cpu0/topology/physical_package_id
 -1
 #

After the patch
---
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8  <--
Core(s) per socket:  8  <--
Socket(s):   2
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127
 #
 # cat /sys/devices/system/cpu/cpu0/topology/physical_package_id
 0
 #
Now lscpu output is more in line with the system configuration.

Signed-off-by: Srikar Dronamraju 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Michael Ellerman 
Cc: Vasant Hegde 
Cc: Vaidyanathan Srinivasan 
---
Changelog from v3: 
https://lore.kernel.org/linuxppc-dev/20191216142119.5937-1-sri...@linux.vnet.ibm.com/t/#u
- Handled comments from Michael Ellerman.
  - Rename function cpu_to_nid to get_physical_package_id
  - Moved code back to arch/powerpc/kernel/smp.c from pseries/lpar.c
  - Use export_symbol_gpl instead of export_symbol.
  - Rebased to v5.5-rc6

Changelog from v2: https://patchwork.ozlabs.org/patch/1151649
- Rebased to v5.5-rc2
- Renamed the new function to cpu_to_nid
- Removed checks to fadump property. (Looked too excessive)
- Moved pseries specific code to pseries/lpar.c

Changelog from v1: https://patchwork.ozlabs.org/patch/1126145
- In v1 cpu_to_chip_id was overloaded to fallback on nid.  Michael
  Ellerman wasn't comfortable with nid being shown up as chip_id.

 arch/powerpc/include/asm/topology.h |  6 ++
 arch/powerpc/kernel/smp.c   | 30 ++---
 2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 2f7e1ea5089e..e2e1ccd4a18d 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -134,7 +134,13 @@ static inline void shared_proc_topology_init(void) {}
 #ifdef CONFIG_PPC64
 #include 
 
+#ifdef CONFIG_PPC_SPLPAR
+int get_physical_package_id(int cpu);
+#define topology_physical_package_id(cpu)  (get_physical_package_id(cpu))
+#else
 #define topology_physical_package_id(cpu)  (cpu_to_chip_id(cpu))
+#endif
+
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_core_id(cpu)  (cpu_to_core_id(cpu))
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf6a221..d75fc8fd8168 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1185,10 +1185,34 @@ static inline void add_cpu_to_smallcore_masks(int cpu)
}
 }
 
+int get_physical_package_id(int cpu)
+{
+   int ppid = cpu_to_chip_id(cpu);
+
+#ifdef CONFIG_PPC_SPLPAR
+   /*
+* If the platform is PowerNV or Guest on KVM, ibm,chip-id is
+* defined. Hence we would return the chip-id as the
+* get_physical_package_id.
+*/
+   if (ppid == -1 && firmware_has_feature(FW_FEATURE_LPAR)) {
+   struct device_node *np = of_get_cpu_node(cpu, NULL);
+
+   if (np) {
+   ppid = of_node_to_nid(np);
+   of_node_put(np);
+   }
+   }
+#endif /* CONFIG_PPC_SPLPAR */
+
+   return ppid;
+}
+EXPORT_SYMBOL_GPL(get_physical_package_id);
+
 static void add_cpu_to_masks(int cpu)
 {
int first_thread = cpu_first_thread_sibling(cpu);
-   int chipid = cpu_to_chip_id(cpu);
+   int ppid = get_physical_package_id(cpu);
int i;
 
/*
@@ -1217,11 +1241,11 @@ static void add_cpu_to_masks(int cpu)
for_each_cpu(i, c

Re: [PATCH] powerpc/shared: Fix build problem

2019-12-25 Thread Srikar Dronamraju
* Guenter Roeck  [2019-12-25 08:06:26]:

> Since commit 656c21d6af5d ("powerpc/shared: Use static key to detect
> shared processor") and 14c73bd344da ("powerpc/vcpu: Assume dedicated
> processors as non-preempt"), powerpc test builds may fail with the
> following build errors.
> 
> ./arch/powerpc/include/asm/spinlock.h:39:1: error:
>   type defaults to ???int??? in declaration of 
> ???DECLARE_STATIC_KEY_FALSE???
> ./arch/powerpc/include/asm/spinlock.h: In function ???vcpu_is_preempted???:
> ./arch/powerpc/include/asm/spinlock.h:44:7: error:
>   implicit declaration of function ???static_branch_unlikely???
> ./arch/powerpc/include/asm/spinlock.h:44:31: error:
>   ???shared_processor??? undeclared
> 
> The offending commits use static_branch_unlikely and shared_processor
> without adding the include file declaring it.

Thanks for reporting but same fix was already posted
http://lkml.kernel.org/r/20191223133147.129983-1-ja...@zx2c4.com


-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH] powerpc/shared: include correct header for static key

2019-12-24 Thread Srikar Dronamraju
* Jason A. Donenfeld  [2019-12-23 14:31:47]:

> Recently, the spinlock implementation grew a static key optimization,
> but the jump_label.h header include was left out, leading to build
> errors:
> 
> linux/arch/powerpc/include/asm/spinlock.h:44:7: error: implicit declaration 
> of function ???static_branch_unlikely??? 
> [-Werror=implicit-function-declaration]
>44 |  if (!static_branch_unlikely(_processor))
> 
> This commit adds the missing header.

Right, This was in v4 version of the patchset
(http://lkml.kernel.org/r/20191212085344.17357-1-sri...@linux.vnet.ibm.com)
but we missed it in the final v5 version.

Thanks for noticing and fixing it.

Reviewed-by: Srikar Dronamraju 

> 
> Fixes: 656c21d6af5d ("powerpc/shared: Use static key to detect shared 
> processor")
> Cc: Srikar Dronamraju 
> Signed-off-by: Jason A. Donenfeld 
> ---
>  arch/powerpc/include/asm/spinlock.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/include/asm/spinlock.h 
> b/arch/powerpc/include/asm/spinlock.h
> index 1b55fc08f853..860228e917dc 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -15,6 +15,7 @@
>   *
>   * (the type definitions are in asm/spinlock_types.h)
>   */
> +#include 
>  #include 
>  #ifdef CONFIG_PPC64
>  #include 
> -- 
> 2.24.1
> 

-- 
Thanks and Regards
Srikar Dronamraju



[PATCH v5 5/5] powerpc/numa: Remove late request for home node associativity

2019-12-16 Thread Srikar Dronamraju
With commit ("powerpc/numa: Early request for home node associativity"),
commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot") which was requesting home node associativity
becomes redundant.

Hence remove the late request for home node associativity.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Abdul Haleem 
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
Reviewed-by: Nathan Lynch 
---
 arch/powerpc/include/asm/topology.h | 4 
 arch/powerpc/kernel/smp.c   | 5 -
 arch/powerpc/mm/numa.c  | 9 -
 3 files changed, 18 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 2f7e1ea5089e..9bd396fba7c2 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -98,7 +98,6 @@ extern int stop_topology_update(void);
 extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
 extern int timed_topology_update(int nsecs);
-extern void __init shared_proc_topology_init(void);
 #else
 static inline int start_topology_update(void)
 {
@@ -121,9 +120,6 @@ static inline int timed_topology_update(int nsecs)
return 0;
 }
 
-#ifdef CONFIG_SMP
-static inline void shared_proc_topology_init(void) {}
-#endif
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf6a221..cdd39a0bc49d 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1359,11 +1359,6 @@ void __init smp_cpus_done(unsigned int max_cpus)
if (smp_ops && smp_ops->bringup_done)
smp_ops->bringup_done();
 
-   /*
-* On a shared LPAR, associativity needs to be requested.
-* Hence, get numa topology before dumping cpu topology
-*/
-   shared_proc_topology_init();
dump_numa_cpu_topology();
 
 #ifdef CONFIG_SCHED_SMT
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f837a0e725bc..3ba8dc0b7bc5 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1628,15 +1628,6 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
-void __init shared_proc_topology_init(void)
-{
-   if (lppaca_shared_proc(get_lppaca())) {
-   bitmap_fill(cpumask_bits(_associativity_changes_mask),
-   nr_cpumask_bits);
-   numa_update_cpu_topology(false);
-   }
-}
-
 static int topology_read(struct seq_file *file, void *v)
 {
if (vphn_enabled || prrn_enabled)
-- 
2.18.1



[PATCH v5 4/5] powerpc/numa: Early request for home node associativity

2019-12-16 Thread Srikar Dronamraju
Currently the kernel detects if its running on a shared lpar platform
and requests home node associativity before the scheduler sched_domains
are setup. However between the time NUMA setup is initialized and the
request for home node associativity, workqueue initializes its per node
cpumask. The per node workqueue possible cpumask may turn invalid
after home node associativity resulting in weird situations like
workqueue possible cpumask being a subset of workqueue online cpumask.

This can be fixed by requesting home node associativity earlier just
before NUMA setup. However at the NUMA setup time, kernel may not be in
a position to detect if its running on a shared lpar platform. So
request for home node associativity and if the request fails, fallback
on the device tree property.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Abdul Haleem 
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
Reviewed-by: Nathan Lynch 
---
Changelog (v2->v3):
- Handled comments from Nathan Lynch
  * Use first thread of the core for cpu-to-node map.
  * get hardware-id in numa_setup_cpu

Changelog (v1->v2):
- Handled comments from Nathan Lynch
  * Dont depend on pacas to be setup for the hwid


 arch/powerpc/mm/numa.c | 45 +-
 1 file changed, 40 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 63ec0c3c817f..f837a0e725bc 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -461,13 +461,27 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
return nid;
 }
 
+static int vphn_get_nid(long hwid)
+{
+   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+   long rc;
+
+   rc = hcall_vphn(hwid, VPHN_FLAG_VCPU, associativity);
+   if (rc == H_SUCCESS)
+   return associativity_to_nid(associativity);
+
+   return NUMA_NO_NODE;
+}
+
 /*
  * Figure out to which domain a cpu belongs and stick it there.
+ * cpu_to_phys_id is only valid between smp_setup_cpu_maps() and
+ * smp_setup_pacas(). If called outside this window, set get_hwid to true.
  * Return the id of the domain used.
  */
-static int numa_setup_cpu(unsigned long lcpu)
+static int numa_setup_cpu(unsigned long lcpu, bool get_hwid)
 {
-   struct device_node *cpu;
+   struct device_node *cpu = NULL;
int fcpu = cpu_first_thread_sibling(lcpu);
int nid = NUMA_NO_NODE;
 
@@ -485,6 +499,27 @@ static int numa_setup_cpu(unsigned long lcpu)
return nid;
}
 
+   /*
+* On a shared lpar, device tree will not have node associativity.
+* At this time lppaca, or its __old_status field may not be
+* updated. Hence kernel cannot detect if its on a shared lpar. So
+* request an explicit associativity irrespective of whether the
+* lpar is shared or dedicated. Use the device tree property as a
+* fallback.
+*/
+   if (firmware_has_feature(FW_FEATURE_VPHN)) {
+   long hwid;
+
+   if (get_hwid)
+   hwid = get_hard_smp_processor_id(lcpu);
+   else
+   hwid = cpu_to_phys_id[lcpu];
+   nid = vphn_get_nid(hwid);
+   }
+
+   if (nid != NUMA_NO_NODE)
+   goto out_present;
+
cpu = of_get_cpu_node(lcpu, NULL);
 
if (!cpu) {
@@ -496,6 +531,7 @@ static int numa_setup_cpu(unsigned long lcpu)
}
 
nid = of_node_to_nid_single(cpu);
+   of_node_put(cpu);
 
 out_present:
if (nid < 0 || !node_possible(nid))
@@ -515,7 +551,6 @@ static int numa_setup_cpu(unsigned long lcpu)
}
 
map_cpu_to_node(lcpu, nid);
-   of_node_put(cpu);
 out:
return nid;
 }
@@ -546,7 +581,7 @@ static int ppc_numa_cpu_prepare(unsigned int cpu)
 {
int nid;
 
-   nid = numa_setup_cpu(cpu);
+   nid = numa_setup_cpu(cpu, true);
verify_cpu_node_mapping(cpu, nid);
return 0;
 }
@@ -893,7 +928,7 @@ void __init mem_topology_setup(void)
reset_numa_cpu_lookup_table();
 
for_each_present_cpu(cpu)
-   numa_setup_cpu(cpu);
+   numa_setup_cpu(cpu, false);
 }
 
 void __init initmem_init(void)
-- 
2.18.1



[PATCH v5 2/5] powerpc/numa: Handle extra hcall_vphn error cases

2019-12-16 Thread Srikar Dronamraju
Currently code handles H_FUNCTION, H_SUCCESS, H_HARDWARE return codes.
However hcall_vphn can return other return codes. Now it also handles
H_PARAMETER return code.  Also the rest return codes are handled under the
default case.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Abdul Haleem 
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
Reviewed-by: Nathan Lynch 
---
Changelog (v2->v1):
 Handled comments from Nathan:
  - Split patch from patch 1.
  - Corrected a problem where I missed calling stop_topology_update().
  - Using pr_err_ratelimited instead of printk.

 arch/powerpc/mm/numa.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 50d68d21ddcc..8fbe57c29153 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1191,23 +1191,30 @@ static long vphn_get_associativity(unsigned long cpu,
VPHN_FLAG_VCPU, associativity);
 
switch (rc) {
+   case H_SUCCESS:
+   dbg("VPHN hcall succeeded. Reset polling...\n");
+   timed_topology_update(0);
+   goto out;
+
case H_FUNCTION:
-   printk_once(KERN_INFO
-   "VPHN is not supported. Disabling polling...\n");
-   stop_topology_update();
+   pr_err_ratelimited("VPHN unsupported. Disabling polling...\n");
break;
case H_HARDWARE:
-   printk(KERN_ERR
-   "hcall_vphn() experienced a hardware fault "
+   pr_err_ratelimited("hcall_vphn() experienced a hardware fault "
"preventing VPHN. Disabling polling...\n");
-   stop_topology_update();
break;
-   case H_SUCCESS:
-   dbg("VPHN hcall succeeded. Reset polling...\n");
-   timed_topology_update(0);
+   case H_PARAMETER:
+   pr_err_ratelimited("hcall_vphn() was passed an invalid 
parameter. "
+   "Disabling polling...\n");
+   break;
+   default:
+   pr_err_ratelimited("hcall_vphn() returned %ld. Disabling 
polling...\n"
+   , rc);
break;
}
 
+   stop_topology_update();
+out:
return rc;
 }
 
-- 
2.18.1



[PATCH v5 3/5] powerpc/numa: Use cpu node map of first sibling thread

2019-12-16 Thread Srikar Dronamraju
All the sibling threads of a core have to be part of the same node.
To ensure that all the sibling threads map to the same node, always
lookup/update the cpu-to-node map of the first thread in the core.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Abdul Haleem 
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
Reviewed-by: Nathan Lynch 
---
Changelog: v3->v4
As suggested by Nathan, add a warning while mapping offline cpu.

 arch/powerpc/mm/numa.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8fbe57c29153..63ec0c3c817f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -467,15 +467,20 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
  */
 static int numa_setup_cpu(unsigned long lcpu)
 {
-   int nid = NUMA_NO_NODE;
struct device_node *cpu;
+   int fcpu = cpu_first_thread_sibling(lcpu);
+   int nid = NUMA_NO_NODE;
 
/*
 * If a valid cpu-to-node mapping is already available, use it
 * directly instead of querying the firmware, since it represents
 * the most recent mapping notified to us by the platform (eg: VPHN).
+* Since cpu_to_node binding remains the same for all threads in the
+* core. If a valid cpu-to-node mapping is already available, for
+* the first thread in the core, use it.
 */
-   if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
+   nid = numa_cpu_lookup_table[fcpu];
+   if (nid >= 0) {
map_cpu_to_node(lcpu, nid);
return nid;
}
@@ -496,6 +501,19 @@ static int numa_setup_cpu(unsigned long lcpu)
if (nid < 0 || !node_possible(nid))
nid = first_online_node;
 
+   /*
+* Update for the first thread of the core. All threads of a core
+* have to be part of the same node. This not only avoids querying
+* for every other thread in the core, but always avoids a case
+* where virtual node associativity change causes subsequent threads
+* of a core to be associated with different nid. However if first
+* thread is already online, expect it to have a valid mapping.
+*/
+   if (fcpu != lcpu) {
+   WARN_ON(cpu_online(fcpu));
+   map_cpu_to_node(fcpu, nid);
+   }
+
map_cpu_to_node(lcpu, nid);
of_node_put(cpu);
 out:
-- 
2.18.1



[PATCH v5 0/5] Early node associativity

2019-12-16 Thread Srikar Dronamraju
Note: (Only patch 3 changes from v3)

Abdul reported  a warning on a shared lpar.
"WARNING: workqueue cpumask: online intersect > possible intersect".
This is because per node workqueue possible mask is set very early in the
boot process even before the system was querying the home node
associativity. However per node workqueue online cpumask gets updated
dynamically. Hence there is a chance when per node workqueue online cpumask
is a superset of per node workqueue possible mask.

Link for v4: https://patchwork.ozlabs.org/patch/1161979
Changelog: v4->v5:
- Rebased to v5.5-rc2

Link for v3: 
http://lkml.kernel.org/r/20190906135020.19772-1-sri...@linux.vnet.ibm.com
Changelog: v3->v4
 - Added a warning as suggested by Nathan Lynch.

Link for v2: 
http://lkml.kernel.org/r/20190829055023.6171-1-sri...@linux.vnet.ibm.com
Changelog: v2->v3
 - Handled comments from Nathan Lynch.

Link for v1: https://patchwork.ozlabs.org/patch/1151658
Changelog: v1->v2
 - Handled comments from Nathan Lynch.

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Abdul Haleem 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 

Srikar Dronamraju (5):
  powerpc/vphn: Check for error from hcall_vphn
  powerpc/numa: Handle extra hcall_vphn error cases
  powerpc/numa: Use cpu node map of first sibling thread
  powerpc/numa: Early request for home node associativity
  powerpc/numa: Remove late request for home node associativity

 arch/powerpc/include/asm/topology.h   |  4 --
 arch/powerpc/kernel/smp.c |  5 --
 arch/powerpc/mm/numa.c| 99 ---
 arch/powerpc/platforms/pseries/vphn.c |  3 +-
 4 files changed, 77 insertions(+), 34 deletions(-)

-- 
2.18.1



[PATCH v5 1/5] powerpc/vphn: Check for error from hcall_vphn

2019-12-16 Thread Srikar Dronamraju
There is no value in unpacking associativity, if
H_HOME_NODE_ASSOCIATIVITY hcall has returned an error.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Abdul Haleem 
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
Reviewed-by: Nathan Lynch 
---
Changelog (v2->v1):
- Split the patch into 2(Suggested by Nathan).

 arch/powerpc/platforms/pseries/vphn.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/vphn.c 
b/arch/powerpc/platforms/pseries/vphn.c
index 3f07bf6c670e..cca474a2c396 100644
--- a/arch/powerpc/platforms/pseries/vphn.c
+++ b/arch/powerpc/platforms/pseries/vphn.c
@@ -82,7 +82,8 @@ long hcall_vphn(unsigned long cpu, u64 flags, __be32 
*associativity)
long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
 
rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, retbuf, flags, cpu);
-   vphn_unpack_associativity(retbuf, associativity);
+   if (rc == H_SUCCESS)
+   vphn_unpack_associativity(retbuf, associativity);
 
return rc;
 }
-- 
2.18.1



[PATCH v5 0/5] Early node associativity

2019-12-16 Thread Srikar Dronamraju
Note: (Only patch 3 changes from v3)

Abdul reported  a warning on a shared lpar.
"WARNING: workqueue cpumask: online intersect > possible intersect".
This is because per node workqueue possible mask is set very early in the
boot process even before the system was querying the home node
associativity. However per node workqueue online cpumask gets updated
dynamically. Hence there is a chance when per node workqueue online cpumask
is a superset of per node workqueue possible mask.

Link for v4: https://patchwork.ozlabs.org/patch/1161979
Changelog: v4->v5:
- Rebased to v5.5-rc2

Link for v3: 
http://lkml.kernel.org/r/20190906135020.19772-1-sri...@linux.vnet.ibm.com
Changelog: v3->v4
 - Added a warning as suggested by Nathan Lynch.

Link for v2: 
http://lkml.kernel.org/r/20190829055023.6171-1-sri...@linux.vnet.ibm.com
Changelog: v2->v3
 - Handled comments from Nathan Lynch.

Link for v1: https://patchwork.ozlabs.org/patch/1151658
Changelog: v1->v2
 - Handled comments from Nathan Lynch.

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Abdul Haleem 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 

Srikar Dronamraju (5):
  powerpc/vphn: Check for error from hcall_vphn
  powerpc/numa: Handle extra hcall_vphn error cases
  powerpc/numa: Use cpu node map of first sibling thread
  powerpc/numa: Early request for home node associativity
  powerpc/numa: Remove late request for home node associativity

 arch/powerpc/include/asm/topology.h   |  4 --
 arch/powerpc/kernel/smp.c |  5 --
 arch/powerpc/mm/numa.c| 99 ---
 arch/powerpc/platforms/pseries/vphn.c |  3 +-
 4 files changed, 77 insertions(+), 34 deletions(-)

-- 
2.18.1



[PATCH v3] powerpc/smp: Use nid as fallback for package_id

2019-12-16 Thread Srikar Dronamraju
Package_id is to find out all cores that are part of the same chip. On
PowerNV machines, package_id defaults to chip_id. However ibm,chip_id
property is not present in device-tree of PowerVM Lpars. Hence lscpu
output shows one core per socket and multiple cores.

To overcome this, use nid as the package_id on PowerVM Lpars.

Before the patch.
---
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8
Core(s) per socket:  1   <--
Socket(s):   16  <--
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127
 #
 # cat /sys/devices/system/cpu/cpu0/topology/physical_package_id
 -1
 #

After the patch
---
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8  <--
Core(s) per socket:  8  <--
Socket(s):   2
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127
 #
 # cat /sys/devices/system/cpu/cpu0/topology/physical_package_id
 0
 #
Now lscpu output is more in line with the system configuration.

Signed-off-by: Srikar Dronamraju 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Michael Ellerman 
Cc: Vasant Hegde 
Cc: Vaidyanathan Srinivasan 
---
Changelog from v2: https://patchwork.ozlabs.org/patch/1151649
- Rebased to v5.5-rc2
- Renamed the new function to cpu_to_nid
- Removed checks to fadump property. (Looked too excessive)
- Moved pseries specific code to pseries/lpar.c

Changelog from v1: https://patchwork.ozlabs.org/patch/1126145
- In v1 cpu_to_chip_id was overloaded to fallback on nid.  Michael
  Ellerman wasn't comfortable with nid being shown up as chip_id.

 arch/powerpc/include/asm/topology.h   |  7 ++-
 arch/powerpc/kernel/smp.c |  6 +++---
 arch/powerpc/platforms/pseries/lpar.c | 22 ++
 3 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 2f7e1ea5089e..7422ef913c75 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -130,11 +130,16 @@ static inline void shared_proc_topology_init(void) {}
 
 #ifdef CONFIG_SMP
 #include 
+#ifdef CONFIG_PPC_SPLPAR
+int cpu_to_nid(int);
+#else
+#define cpu_to_nid(cpu)cpu_to_chip_id(cpu)
+#endif
 
 #ifdef CONFIG_PPC64
 #include 
 
-#define topology_physical_package_id(cpu)  (cpu_to_chip_id(cpu))
+#define topology_physical_package_id(cpu)  (cpu_to_nid(cpu))
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_core_id(cpu)  (cpu_to_core_id(cpu))
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf6a221..b0c1438d8d9a 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1188,7 +1188,7 @@ static inline void add_cpu_to_smallcore_masks(int cpu)
 static void add_cpu_to_masks(int cpu)
 {
int first_thread = cpu_first_thread_sibling(cpu);
-   int chipid = cpu_to_chip_id(cpu);
+   int nid = cpu_to_nid(cpu);
int i;
 
/*
@@ -1217,11 +1217,11 @@ static void add_cpu_to_masks(int cpu)
for_each_cpu(i, cpu_l2_cache_mask(cpu))
set_cpus_related(cpu, i, cpu_core_mask);
 
-   if (chipid == -1)
+   if (nid == -1)
return;
 
for_each_cpu(i, cpu_online_mask)
-   if (cpu_to_chip_id(i) == chipid)
+   if (cpu_to_nid(i) == nid)
set_cpus_related(cpu, i, cpu_core_mask);
 }
 
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 60cb29ae4739..99583b9f00e4 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -650,6 +650,28 @@ static int __init vcpudispatch_stats_procfs_init(void)
 }
 
 machine_device_initcall(pseries, vcpudispatch_stats_procfs_init);
+
+int cpu_to_nid(cpu)
+{
+   int nid = cpu_to_chip_id(cpu);
+
+   /*
+* If the platform is PowerNV or Guest on KVM, ibm,chip-id is
+* defined. Hence we would return the chip-id as the
+* cpu_to_nid.
+*/
+   i

Re: [PATCH v5 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-12 Thread Srikar Dronamraju
* Michael Ellerman  [2019-12-13 14:50:35]:

> Waiman Long suggested using static_keys.
> 
> Fixes: 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted vCPUs")
> Cc: sta...@vger.kernel.org # v4.18+
> Reported-by: Parth Shah 
> Reported-by: Ihor Pasichnyk 
> Tested-by: Juri Lelli 
> Acked-by: Waiman Long 
> Reviewed-by: Gautham R. Shenoy 
> Signed-off-by: Srikar Dronamraju 
> Acked-by: Phil Auld 
> Reviewed-by: Vaidyanathan Srinivasan 
> Tested-by: Parth Shah 
> [mpe: Move the key and setting of the key to pseries/setup.c]
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/include/asm/spinlock.h| 4 +++-
>  arch/powerpc/platforms/pseries/setup.c | 7 +++
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 

Tested with your version of the patch and its working as expected

Latency percentiles (usec)
50.0th: 45
75.0th: 63
90.0th: 74
95.0th: 78
*99.0th: 82
99.5th: 83
99.9th: 86
min=0, max=96
Latency percentiles (usec)
50.0th: 46
75.0th: 64
90.0th: 75
95.0th: 79
*99.0th: 83
99.5th: 85
99.9th: 91
min=0, max=117
Latency percentiles (usec)
50.0th: 46
75.0th: 64
90.0th: 75
95.0th: 79
*99.0th: 83
99.5th: 84
99.9th: 90
min=0, max=95
Latency percentiles (usec)
50.0th: 47
75.0th: 65
90.0th: 75
95.0th: 79
*99.0th: 84
99.5th: 85
99.9th: 90
min=0, max=117
Latency percentiles (usec)
50.0th: 45
75.0th: 64
90.0th: 75
95.0th: 79
*99.0th: 82
99.5th: 83
    99.9th: 93
min=0, max=111


-- 
Thanks and Regards
Srikar Dronamraju



[PATCH v4 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-12 Thread Srikar Dronamraju
With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted
vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup.
This leads to wrong choice of CPU, which in-turn leads to larger wakeup
latencies. Eventually, it leads to performance regression in latency
sensitive benchmarks like soltp, schbench etc.

On Powerpc, vcpu_is_preempted only looks at yield_count. If the
yield_count is odd, the vCPU is assumed to be preempted. However
yield_count is increased whenever LPAR enters CEDE state. So any CPU
that has entered CEDE state is assumed to be preempted.

Even if vCPU of dedicated LPAR is preempted/donated, it should have
right of first-use since they are suppose to own the vCPU.

On a Power9 System with 32 cores
 # lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):   16
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127

  # perf stat -a -r 5 ./schbench
v5.4 v5.4 + patch
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 39
75.th: 62   75.th: 53
90.th: 71   90.th: 67
95.th: 77   95.th: 76
*99.th: 91  *99.th: 89
99.5000th: 707  99.5000th: 93
99.9000th: 6920 99.9000th: 118
min=0, max=10048min=0, max=211
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 61   75.th: 45
90.th: 72   90.th: 53
95.th: 79   95.th: 56
*99.th: 691 *99.th: 61
99.5000th: 3972 99.5000th: 63
99.9000th: 8368 99.9000th: 78
min=0, max=16606min=0, max=228
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 61   75.th: 45
90.th: 71   90.th: 53
95.th: 77   95.th: 57
*99.th: 106 *99.th: 63
99.5000th: 2364 99.5000th: 68
99.9000th: 7480 99.9000th: 100
min=0, max=10001min=0, max=134
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 62   75.th: 46
90.th: 72   90.th: 53
95.th: 78   95.th: 56
*99.th: 93  *99.th: 61
99.5000th: 108  99.5000th: 64
99.9000th: 6792 99.9000th: 85
min=0, max=17681min=0, max=121
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 46   50.th: 33
75.th: 62   75.th: 44
90.th: 73   90.th: 51
95.th: 79   95.th: 54
*99.th: 113 *99.th: 61
99.5000th: 2724 99.5000th: 64
99.9000th: 6184 99.9000th: 82
min=0, max=9887 min=0, max=121

 Performance counter stats for 'system wide' (5 runs):

context-switches43,373  ( +-  0.40% )   44,597 ( +-  0.55% )
cpu-migrations   1,211  ( +-  5.04% )  220 ( +-  6.23% )
page-faults 15,983  ( +-  5.21% )   15,360 ( +-  3.38% )

Waiman Long suggested using static_keys.

Fixes: 41946c86876e ("locking/core, powerpc: Implement vcpu_is_preempted(cpu)")

Cc: Parth Shah 
Cc: Ihor Pasichnyk 
Cc: Juri Lelli 
Cc: Phil Auld 
Cc: Waiman Long 
Cc: Gautham R. Shenoy 
Cc: Vaidyanathan Srinivasan 
Reported-by: Parth Shah 
Reported-by: Ihor Pasichnyk 
Tested-by: Juri Lelli 
Tested-by: Parth Shah 
Acked-by: Waiman Long 
Acked-by: Phil Auld 
Reviewed-by: Gautham R. Shenoy 
Reviewed-by: Vaidyanathan Srinivasan 
Signed-off-by: Srikar Dronamraju 
---

[PATCH v3 2/2] powerpc/shared: Use static key to detect shared processor

2019-12-05 Thread Srikar Dronamraju
With the static key shared processor available, is_shared_processor()
can return without having to query the lppaca structure.

Cc: Parth Shah 
Cc: Ihor Pasichnyk 
Cc: Juri Lelli 
Cc: Phil Auld 
Cc: Waiman Long 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1 (https://patchwork.ozlabs.org/patch/1204192/) ->v2:
Now that we no more refer to lppaca, remove the comment.

Changelog v2->v3:
Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES.
This was suggested by Waiman Long.

 arch/powerpc/include/asm/spinlock.h | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index de817c25deff..e83d57f27566 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -111,13 +111,8 @@ static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
 
 static inline bool is_shared_processor(void)
 {
-/*
- * LPPACA is only available on Pseries so guard anything LPPACA related to
- * allow other platforms (which include this common header) to compile.
- */
-#ifdef CONFIG_PPC_PSERIES
-   return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
-   lppaca_shared_proc(local_paca->lppaca_ptr));
+#ifdef CONFIG_PPC_SPLPAR
+   return static_branch_unlikely(_processor);
 #else
return false;
 #endif
-- 
2.18.1



[PATCH v3 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-05 Thread Srikar Dronamraju
With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted
vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup.
This leads to wrong choice of CPU, which in-turn leads to larger wakeup
latencies. Eventually, it leads to performance regression in latency
sensitive benchmarks like soltp, schbench etc.

On Powerpc, vcpu_is_preempted only looks at yield_count. If the
yield_count is odd, the vCPU is assumed to be preempted. However
yield_count is increased whenever LPAR enters CEDE state. So any CPU
that has entered CEDE state is assumed to be preempted.

Even if vCPU of dedicated LPAR is preempted/donated, it should have
right of first-use since they are suppose to own the vCPU.

On a Power9 System with 32 cores
 # lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):   16
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127

  # perf stat -a -r 5 ./schbench
v5.4 v5.4 + patch
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 39
75.th: 62   75.th: 53
90.th: 71   90.th: 67
95.th: 77   95.th: 76
*99.th: 91  *99.th: 89
99.5000th: 707  99.5000th: 93
99.9000th: 6920 99.9000th: 118
min=0, max=10048min=0, max=211
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 61   75.th: 45
90.th: 72   90.th: 53
95.th: 79   95.th: 56
*99.th: 691 *99.th: 61
99.5000th: 3972 99.5000th: 63
99.9000th: 8368 99.9000th: 78
min=0, max=16606min=0, max=228
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 61   75.th: 45
90.th: 71   90.th: 53
95.th: 77   95.th: 57
*99.th: 106 *99.th: 63
99.5000th: 2364 99.5000th: 68
99.9000th: 7480 99.9000th: 100
min=0, max=10001min=0, max=134
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 45   50.th: 34
75.th: 62   75.th: 46
90.th: 72   90.th: 53
95.th: 78   95.th: 56
*99.th: 93  *99.th: 61
99.5000th: 108  99.5000th: 64
99.9000th: 6792 99.9000th: 85
min=0, max=17681min=0, max=121
Latency percentiles (usec)   Latency percentiles (usec)
50.th: 46   50.th: 33
75.th: 62   75.th: 44
90.th: 73   90.th: 51
95.th: 79   95.th: 54
*99.th: 113 *99.th: 61
99.5000th: 2724 99.5000th: 64
99.9000th: 6184 99.9000th: 82
min=0, max=9887 min=0, max=121

 Performance counter stats for 'system wide' (5 runs):

context-switches43,373  ( +-  0.40% )   44,597 ( +-  0.55% )
cpu-migrations   1,211  ( +-  5.04% )  220 ( +-  6.23% )
page-faults 15,983  ( +-  5.21% )   15,360 ( +-  3.38% )

Waiman Long suggested using static_keys.

Reported-by: Parth Shah 
Reported-by: Ihor Pasichnyk 
Cc: Parth Shah 
Cc: Ihor Pasichnyk 
Cc: Juri Lelli 
Cc: Phil Auld 
Cc: Waiman Long 
Cc: Gautham R. Shenoy 
Tested-by: Juri Lelli 
Ack-by: Waiman Long 
Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Srikar Dronamraju 
---
Changelog v1 (https://patchwork.ozlabs.org/patch/1204190/) ->v3:
Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES.
This was suggested by Waiman Long.

 arch/powerpc/include/asm/spinlock.h | 

[PATCH v2 2/2] powerpc/shared: Use static key to detect shared processor

2019-12-04 Thread Srikar Dronamraju
With the static key shared processor available, is_shared_processor()
can return without having to query the lppaca structure.

Signed-off-by: Srikar Dronamraju 
---
Changelog v1->v2:
Now that we no more refer to lppaca, remove the comment.

 arch/powerpc/include/asm/spinlock.h | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 866f6ca0427a..251fe6e47471 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -111,13 +111,8 @@ static inline void splpar_rw_yield(arch_rwlock_t *lock) {};
 
 static inline bool is_shared_processor(void)
 {
-/*
- * LPPACA is only available on Pseries so guard anything LPPACA related to
- * allow other platforms (which include this common header) to compile.
- */
-#ifdef CONFIG_PPC_PSERIES
-   return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
-   lppaca_shared_proc(local_paca->lppaca_ptr));
+#if defined(CONFIG_PPC_PSERIES) && defined(CONFIG_PPC_SPLPAR)
+   return static_branch_unlikely(_processor);
 #else
return false;
 #endif
-- 
2.18.1



Re: [PATCH 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-04 Thread Srikar Dronamraju
* Srikar Dronamraju  [2019-12-04 19:14:58]:

> 
> 
>   # perf stat -a -r 5 ./schbench
> v5.4  v5.4 + patch
> Latency percentiles (usec)  Latency percentiles (usec)
>   49.th: 47   50.th: 33
>   74.th: 64   75.th: 44
>   89.th: 76   90.th: 50
>   94.th: 83   95.th: 53
>   *98.th: 103 *99.th: 57
>   98.5000th: 2124 99.5000th: 59
>   98.9000th: 7976 99.9000th: 83
>   min=-1, max=10519   min=0, max=117
> Latency percentiles (usec)  Latency percentiles (usec)
>   49.th: 45   50.th: 34
>   74.th: 61   75.th: 45
>   89.th: 70   90.th: 52
>   94.th: 77   95.th: 56
>   *98.th: 504 *99.th: 62
>   98.5000th: 4012 99.5000th: 64
>   98.9000th: 8168 99.9000th: 79
>   min=-1, max=14500   min=0, max=123
> Latency percentiles (usec)  Latency percentiles (usec)
>   49.th: 48   50.th: 35
>   74.th: 65   75.th: 47
>   89.th: 76   90.th: 55
>   94.th: 82   95.th: 59
>   *98.th: 1098*99.th: 67
>   98.5000th: 3988 99.5000th: 71
>   98.9000th: 9360 99.9000th: 98
>   min=-1, max=19283   min=0, max=137
> Latency percentiles (usec)  Latency percentiles (usec)
>   49.th: 46   50.th: 35
>   74.th: 63   75.th: 46
>   89.th: 73   90.th: 53
>   94.th: 78   95.th: 57
>   *98.th: 113 *99.th: 63
>   98.5000th: 2316 99.5000th: 65
>   98.9000th: 7704 99.9000th: 83
>   min=-1, max=17976   min=0, max=139
> Latency percentiles (usec)  Latency percentiles (usec)
>   49.th: 46   50.th: 34
>   74.th: 62   75.th: 46
>   89.th: 73   90.th: 53
>   94.th: 79   95.th: 57
>   *98.th: 97  *99.th: 64
>   98.5000th: 1398 99.5000th: 70
>   98.9000th: 8136 99.9000th: 100
>   min=-1, max=10008       min=0, max=142
> 

Just to be clear, since these are latency values, lesser number is better.

-- 
Thanks and Regards
Srikar Dronamraju



[PATCH 2/2] powerpc/shared: Use static key to detect shared processor

2019-12-04 Thread Srikar Dronamraju
With the static key shared processor available, is_shared_processor()
can return without having to query the lppaca structure.

Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/spinlock.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 866f6ca0427a..699eb306a835 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -115,9 +115,8 @@ static inline bool is_shared_processor(void)
  * LPPACA is only available on Pseries so guard anything LPPACA related to
  * allow other platforms (which include this common header) to compile.
  */
-#ifdef CONFIG_PPC_PSERIES
-   return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
-   lppaca_shared_proc(local_paca->lppaca_ptr));
+#if defined(CONFIG_PPC_PSERIES) && defined(CONFIG_PPC_SPLPAR)
+   return static_branch_unlikely(_processor);
 #else
return false;
 #endif
-- 
2.18.1



[PATCH 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-04 Thread Srikar Dronamraju
aiman Long 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/include/asm/spinlock.h | 5 +++--
 arch/powerpc/mm/numa.c  | 4 
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index e9a960e28f3c..866f6ca0427a 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -35,11 +35,12 @@
 #define LOCK_TOKEN 1
 #endif
 
-#ifdef CONFIG_PPC_PSERIES
+#if defined(CONFIG_PPC_PSERIES) && defined(CONFIG_PPC_SPLPAR)
+DECLARE_STATIC_KEY_FALSE(shared_processor);
 #define vcpu_is_preempted vcpu_is_preempted
 static inline bool vcpu_is_preempted(int cpu)
 {
-   if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+   if (!static_branch_unlikely(_processor))
return false;
return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
 }
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 50d68d21ddcc..ffb971f3a63c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1568,9 +1568,13 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
+DEFINE_STATIC_KEY_FALSE(shared_processor);
+EXPORT_SYMBOL_GPL(shared_processor);
+
 void __init shared_proc_topology_init(void)
 {
if (lppaca_shared_proc(get_lppaca())) {
+   static_branch_enable(_processor);
bitmap_fill(cpumask_bits(_associativity_changes_mask),
nr_cpumask_bits);
numa_update_cpu_topology(false);
-- 
2.18.1



Re: [PATCH v2] powerpc/smp: Use nid as fallback for package_id

2019-11-15 Thread Srikar Dronamraju
Hey Michael,

* Srikar Dronamraju  [2019-08-22 20:08:53]:

> Package_id is to find out all cores that are part of the same chip. On
> PowerNV machines, package_id defaults to chip_id. However ibm,chip_id
> property is not present in device-tree of PowerVM Lpars. Hence lscpu
> output shows one core per socket and multiple cores.
> 

Can you let me know if you have any comments on this patch?

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH v4 0/5] Early node associativity

2019-11-15 Thread Srikar Dronamraju
Hi Michael,

> Nathan Lynch  writes:
> > Srikar Dronamraju  writes:
> >> Abdul reported  a warning on a shared lpar.
> >> "WARNING: workqueue cpumask: online intersect > possible intersect".
> >> This is because per node workqueue possible mask is set very early in the
> >> boot process even before the system was querying the home node
> >> associativity. However per node workqueue online cpumask gets updated
> >> dynamically. Hence there is a chance when per node workqueue online cpumask
> >> is a superset of per node workqueue possible mask.
> >
> > Sorry for the delay in following up on these. The series looks good to
> > me, and I've given it a little testing with LPM and DLPAR. I've also
> > verified that the cpu assignments occur early as intended on an LPAR
> > where that workqueue warning had been triggered.
> 
> If this is applied I think we can remove about 500 loc from numa.c. Once
> splpar cpu-node assignments are done in the same sequence as with
> dedicated processor mode, numa_update_cpu_topology() and related code
> becomes unneeded.

Since its 2 months since version 5 was posted and more than a month since
Nathan added his reviewed-by, can you please let us know if there is any
reason why this patchset has not been applied yet. Or do let me know if have
any further comments.

-- 
Thanks and Regards
Srikar Dronamraju



[PATCH v4 5/5] powerpc/numa: Remove late request for home node associativity

2019-09-13 Thread Srikar Dronamraju
With commit ("powerpc/numa: Early request for home node associativity"),
commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot") which was requesting home node associativity
becomes redundant.

Hence remove the late request for home node associativity.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
 arch/powerpc/include/asm/topology.h | 4 
 arch/powerpc/kernel/smp.c   | 5 -
 arch/powerpc/mm/numa.c  | 9 -
 3 files changed, 18 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 2f7e1ea..9bd396f 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -98,7 +98,6 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
 extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
 extern int timed_topology_update(int nsecs);
-extern void __init shared_proc_topology_init(void);
 #else
 static inline int start_topology_update(void)
 {
@@ -121,9 +120,6 @@ static inline int timed_topology_update(int nsecs)
return 0;
 }
 
-#ifdef CONFIG_SMP
-static inline void shared_proc_topology_init(void) {}
-#endif
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf..cdd39a0 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1359,11 +1359,6 @@ void __init smp_cpus_done(unsigned int max_cpus)
if (smp_ops && smp_ops->bringup_done)
smp_ops->bringup_done();
 
-   /*
-* On a shared LPAR, associativity needs to be requested.
-* Hence, get numa topology before dumping cpu topology
-*/
-   shared_proc_topology_init();
dump_numa_cpu_topology();
 
 #ifdef CONFIG_SCHED_SMT
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f837a0e..3ba8dc0 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1628,15 +1628,6 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
-void __init shared_proc_topology_init(void)
-{
-   if (lppaca_shared_proc(get_lppaca())) {
-   bitmap_fill(cpumask_bits(_associativity_changes_mask),
-   nr_cpumask_bits);
-   numa_update_cpu_topology(false);
-   }
-}
-
 static int topology_read(struct seq_file *file, void *v)
 {
if (vphn_enabled || prrn_enabled)
-- 
1.8.3.1



[PATCH v4 4/5] powerpc/numa: Early request for home node associativity

2019-09-13 Thread Srikar Dronamraju
Currently the kernel detects if its running on a shared lpar platform
and requests home node associativity before the scheduler sched_domains
are setup. However between the time NUMA setup is initialized and the
request for home node associativity, workqueue initializes its per node
cpumask. The per node workqueue possible cpumask may turn invalid
after home node associativity resulting in weird situations like
workqueue possible cpumask being a subset of workqueue online cpumask.

This can be fixed by requesting home node associativity earlier just
before NUMA setup. However at the NUMA setup time, kernel may not be in
a position to detect if its running on a shared lpar platform. So
request for home node associativity and if the request fails, fallback
on the device tree property.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
Changelog (v1->v2):
- Handled comments from Nathan Lynch
  * Dont depend on pacas to be setup for the hwid

Changelog (v2->v3):
- Handled comments from Nathan Lynch
  * Use first thread of the core for cpu-to-node map.
  * get hardware-id in numa_setup_cpu

 arch/powerpc/mm/numa.c | 45 -
 1 file changed, 40 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 63ec0c3..f837a0e 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -461,13 +461,27 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
return nid;
 }
 
+static int vphn_get_nid(long hwid)
+{
+   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+   long rc;
+
+   rc = hcall_vphn(hwid, VPHN_FLAG_VCPU, associativity);
+   if (rc == H_SUCCESS)
+   return associativity_to_nid(associativity);
+
+   return NUMA_NO_NODE;
+}
+
 /*
  * Figure out to which domain a cpu belongs and stick it there.
+ * cpu_to_phys_id is only valid between smp_setup_cpu_maps() and
+ * smp_setup_pacas(). If called outside this window, set get_hwid to true.
  * Return the id of the domain used.
  */
-static int numa_setup_cpu(unsigned long lcpu)
+static int numa_setup_cpu(unsigned long lcpu, bool get_hwid)
 {
-   struct device_node *cpu;
+   struct device_node *cpu = NULL;
int fcpu = cpu_first_thread_sibling(lcpu);
int nid = NUMA_NO_NODE;
 
@@ -485,6 +499,27 @@ static int numa_setup_cpu(unsigned long lcpu)
return nid;
}
 
+   /*
+* On a shared lpar, device tree will not have node associativity.
+* At this time lppaca, or its __old_status field may not be
+* updated. Hence kernel cannot detect if its on a shared lpar. So
+* request an explicit associativity irrespective of whether the
+* lpar is shared or dedicated. Use the device tree property as a
+* fallback.
+*/
+   if (firmware_has_feature(FW_FEATURE_VPHN)) {
+   long hwid;
+
+   if (get_hwid)
+   hwid = get_hard_smp_processor_id(lcpu);
+   else
+   hwid = cpu_to_phys_id[lcpu];
+   nid = vphn_get_nid(hwid);
+   }
+
+   if (nid != NUMA_NO_NODE)
+   goto out_present;
+
cpu = of_get_cpu_node(lcpu, NULL);
 
if (!cpu) {
@@ -496,6 +531,7 @@ static int numa_setup_cpu(unsigned long lcpu)
}
 
nid = of_node_to_nid_single(cpu);
+   of_node_put(cpu);
 
 out_present:
if (nid < 0 || !node_possible(nid))
@@ -515,7 +551,6 @@ static int numa_setup_cpu(unsigned long lcpu)
}
 
map_cpu_to_node(lcpu, nid);
-   of_node_put(cpu);
 out:
return nid;
 }
@@ -546,7 +581,7 @@ static int ppc_numa_cpu_prepare(unsigned int cpu)
 {
int nid;
 
-   nid = numa_setup_cpu(cpu);
+   nid = numa_setup_cpu(cpu, true);
verify_cpu_node_mapping(cpu, nid);
return 0;
 }
@@ -893,7 +928,7 @@ void __init mem_topology_setup(void)
reset_numa_cpu_lookup_table();
 
for_each_present_cpu(cpu)
-   numa_setup_cpu(cpu);
+   numa_setup_cpu(cpu, false);
 }
 
 void __init initmem_init(void)
-- 
1.8.3.1



[PATCH v4 1/5] powerpc/vphn: Check for error from hcall_vphn

2019-09-13 Thread Srikar Dronamraju
There is no value in unpacking associativity, if
H_HOME_NODE_ASSOCIATIVITY hcall has returned an error.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
Changelog (v2->v1):
- Split the patch into 2(Suggested by Nathan).

 arch/powerpc/platforms/pseries/vphn.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/vphn.c 
b/arch/powerpc/platforms/pseries/vphn.c
index 3f07bf6..cca474a 100644
--- a/arch/powerpc/platforms/pseries/vphn.c
+++ b/arch/powerpc/platforms/pseries/vphn.c
@@ -82,7 +82,8 @@ long hcall_vphn(unsigned long cpu, u64 flags, __be32 
*associativity)
long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
 
rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, retbuf, flags, cpu);
-   vphn_unpack_associativity(retbuf, associativity);
+   if (rc == H_SUCCESS)
+   vphn_unpack_associativity(retbuf, associativity);
 
return rc;
 }
-- 
1.8.3.1



[PATCH v4 3/5] powerpc/numa: Use cpu node map of first sibling thread

2019-09-13 Thread Srikar Dronamraju
All the sibling threads of a core have to be part of the same node.
To ensure that all the sibling threads map to the same node, always
lookup/update the cpu-to-node map of the first thread in the core.

Signed-off-by: Srikar Dronamraju 
---
Changelog: v3->v4
As suggested by Nathan, add a warning while mapping offline cpu.

 arch/powerpc/mm/numa.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8fbe57c..63ec0c3 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -467,15 +467,20 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
  */
 static int numa_setup_cpu(unsigned long lcpu)
 {
-   int nid = NUMA_NO_NODE;
struct device_node *cpu;
+   int fcpu = cpu_first_thread_sibling(lcpu);
+   int nid = NUMA_NO_NODE;
 
/*
 * If a valid cpu-to-node mapping is already available, use it
 * directly instead of querying the firmware, since it represents
 * the most recent mapping notified to us by the platform (eg: VPHN).
+* Since cpu_to_node binding remains the same for all threads in the
+* core. If a valid cpu-to-node mapping is already available, for
+* the first thread in the core, use it.
 */
-   if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
+   nid = numa_cpu_lookup_table[fcpu];
+   if (nid >= 0) {
map_cpu_to_node(lcpu, nid);
return nid;
}
@@ -496,6 +501,19 @@ static int numa_setup_cpu(unsigned long lcpu)
if (nid < 0 || !node_possible(nid))
nid = first_online_node;
 
+   /*
+* Update for the first thread of the core. All threads of a core
+* have to be part of the same node. This not only avoids querying
+* for every other thread in the core, but always avoids a case
+* where virtual node associativity change causes subsequent threads
+* of a core to be associated with different nid. However if first
+* thread is already online, expect it to have a valid mapping.
+*/
+   if (fcpu != lcpu) {
+   WARN_ON(cpu_online(fcpu));
+   map_cpu_to_node(fcpu, nid);
+   }
+
map_cpu_to_node(lcpu, nid);
of_node_put(cpu);
 out:
-- 
1.8.3.1



[PATCH v4 2/5] powerpc/numa: Handle extra hcall_vphn error cases

2019-09-13 Thread Srikar Dronamraju
Currently code handles H_FUNCTION, H_SUCCESS, H_HARDWARE return codes.
However hcall_vphn can return other return codes. Now it also handles
H_PARAMETER return code.  Also the rest return codes are handled under the
default case.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
Changelog (v2->v1):
 Handled comments from Nathan:
  - Split patch from patch 1.
  - Corrected a problem where I missed calling stop_topology_update().
  - Using pr_err_ratelimited instead of printk.

 arch/powerpc/mm/numa.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 50d68d2..8fbe57c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1191,23 +1191,30 @@ static long vphn_get_associativity(unsigned long cpu,
VPHN_FLAG_VCPU, associativity);
 
switch (rc) {
+   case H_SUCCESS:
+   dbg("VPHN hcall succeeded. Reset polling...\n");
+   timed_topology_update(0);
+   goto out;
+
case H_FUNCTION:
-   printk_once(KERN_INFO
-   "VPHN is not supported. Disabling polling...\n");
-   stop_topology_update();
+   pr_err_ratelimited("VPHN unsupported. Disabling polling...\n");
break;
case H_HARDWARE:
-   printk(KERN_ERR
-   "hcall_vphn() experienced a hardware fault "
+   pr_err_ratelimited("hcall_vphn() experienced a hardware fault "
"preventing VPHN. Disabling polling...\n");
-   stop_topology_update();
break;
-   case H_SUCCESS:
-   dbg("VPHN hcall succeeded. Reset polling...\n");
-   timed_topology_update(0);
+   case H_PARAMETER:
+   pr_err_ratelimited("hcall_vphn() was passed an invalid 
parameter. "
+   "Disabling polling...\n");
+   break;
+   default:
+   pr_err_ratelimited("hcall_vphn() returned %ld. Disabling 
polling...\n"
+   , rc);
break;
}
 
+   stop_topology_update();
+out:
return rc;
 }
 
-- 
1.8.3.1



[PATCH v4 0/5] Early node associativity

2019-09-13 Thread Srikar Dronamraju
Note: (Only patch 3 changes from v3)

Abdul reported  a warning on a shared lpar.
"WARNING: workqueue cpumask: online intersect > possible intersect".
This is because per node workqueue possible mask is set very early in the
boot process even before the system was querying the home node
associativity. However per node workqueue online cpumask gets updated
dynamically. Hence there is a chance when per node workqueue online cpumask
is a superset of per node workqueue possible mask.

Link for v1: https://patchwork.ozlabs.org/patch/1151658
Changelog: v1->v2
 - Handled comments from Nathan Lynch.

Link for v2: 
http://lkml.kernel.org/r/20190829055023.6171-1-sri...@linux.vnet.ibm.com
Changelog: v2->v3
 - Handled comments from Nathan Lynch.

Link for v3: 
http://lkml.kernel.org/r/20190906135020.19772-1-sri...@linux.vnet.ibm.com
Changelog: v3->v4
 - Added a warning as suggested by Nathan Lynch.

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Abdul Haleem 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 

Srikar Dronamraju (5):
  powerpc/vphn: Check for error from hcall_vphn
  powerpc/numa: Handle extra hcall_vphn error cases
  powerpc/numa: Use cpu node map of first sibling thread
  powerpc/numa: Early request for home node associativity
  powerpc/numa: Remove late request for home node associativity

 arch/powerpc/include/asm/topology.h   |  4 --
 arch/powerpc/kernel/smp.c |  5 --
 arch/powerpc/mm/numa.c| 99 ++-
 arch/powerpc/platforms/pseries/vphn.c |  3 +-
 4 files changed, 77 insertions(+), 34 deletions(-)

-- 
1.8.3.1



Re: [PATCH v3 3/5] powerpc/numa: Use cpu node map of first sibling thread

2019-09-12 Thread Srikar Dronamraju
* Nathan Lynch  [2019-09-12 13:15:03]:

> Srikar Dronamraju  writes:
> 
> >> 
> >> I think just WARN_ON(cpu_online(fcpu)) would be satisfactory. In my
> >> experience, the downstream effects of violating this condition are
> >> varied and quite difficult to debug. Seems only appropriate to emit a
> >> warning and stack trace before the OS inevitably becomes unstable.
> >
> > I still have to try but wouldn't this be a problem for the boot-cpu?
> > I mean boot-cpu would be marked online while it tries to do numa_setup_cpu.
> > No?
> 
> This is what I mean:
> 
>  +  if (fcpu != lcpu) {
>  +  WARN_ON(cpu_online(fcpu));
>  +  map_cpu_to_node(fcpu, nid);
>  +  }
> 

Yes this should work. Will send an updated patch with this change.

> I.e. if we're modifying the mapping for a remote cpu, warn if it's
> online.
> 
> I don't think this would warn on the boot cpu -- I would expect fcpu and
> lcpu to be the same and this branch would not be taken.

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH v3 3/5] powerpc/numa: Use cpu node map of first sibling thread

2019-09-12 Thread Srikar Dronamraju
> 
> I think just WARN_ON(cpu_online(fcpu)) would be satisfactory. In my
> experience, the downstream effects of violating this condition are
> varied and quite difficult to debug. Seems only appropriate to emit a
> warning and stack trace before the OS inevitably becomes unstable.

I still have to try but wouldn't this be a problem for the boot-cpu?
I mean boot-cpu would be marked online while it tries to do numa_setup_cpu.
No?

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH v3 3/5] powerpc/numa: Use cpu node map of first sibling thread

2019-09-11 Thread Srikar Dronamraju
Hi Nathan, 

Thanks for your reviews.

> > -   if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
> > +   nid = numa_cpu_lookup_table[fcpu];
> > +   if (nid >= 0) {
> > map_cpu_to_node(lcpu, nid);
> > return nid;
> > }
> 
> Yes, we need to something like this to prevent a VPHN change that occurs
> concurrently with onlining a core's threads from messing us up.
> 
> Is it a good assumption that the first thread of a sibling group will
> have its mapping initialized first? I think the answer is yes for boot,
> but hotplug... not so sure.
> 
> 
> > @@ -496,6 +501,16 @@ static int numa_setup_cpu(unsigned long lcpu)
> > if (nid < 0 || !node_possible(nid))
> > nid = first_online_node;
> >  
> > +   /*
> > +* Update for the first thread of the core. All threads of a core
> > +* have to be part of the same node. This not only avoids querying
> > +* for every other thread in the core, but always avoids a case
> > +* where virtual node associativity change causes subsequent threads
> > +* of a core to be associated with different nid.
> > +*/
> > +   if (fcpu != lcpu)
> > +   map_cpu_to_node(fcpu, nid);
> > +
> 
> OK, I see that this somewhat addresses my concern above. But changing
> this mapping for a remote cpu is unsafe except under specific
> circumstances. I think this should first assert:
> 
> * numa_cpu_lookup_table[fcpu] == NUMA_NO_NODE
> * cpu_online(fcpu) == false
> 
> to document and enforce the conditions that must hold for this to be OK.

I do understand that we shouldn't be modifying the nid for a different cpu.

We just checked above that the mapping for the first cpu doesnt exist.
If the first cpu (or remote cpu as you coin it) was online, then its
mapping should have existed and we return even before we come here.

nid = numa_cpu_lookup_table[fcpu];
if (nid >= 0) {
map_cpu_to_node(lcpu, nid);
return nid;
}

Currently numa_setup_cpus is only called at very early boot and in cpu
hotplug. At hotplug time, the oneline of cpus is serialized. Right? Do we 
see a chance of remote cpu changing its state as we set its nid here?

Also lets say if we assert and for some unknown reason the assertion fails.
How do we handle the failure case?  We cant get out without setting
the nid. We cant continue setting the nid. Should we panic the system given
that the check a few lines above is now turning out to be false? Probably
no, as I think we can live with it.

Any thoughts?

-- 
Thanks and Regards
Srikar Dronamraju



[PATCH v3 5/5] powerpc/numa: Remove late request for home node associativity

2019-09-06 Thread Srikar Dronamraju
With commit ("powerpc/numa: Early request for home node associativity"),
commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot") which was requesting home node associativity
becomes redundant.

Hence remove the late request for home node associativity.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
 arch/powerpc/include/asm/topology.h | 4 
 arch/powerpc/kernel/smp.c   | 5 -
 arch/powerpc/mm/numa.c  | 9 -
 3 files changed, 18 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 2f7e1ea..9bd396f 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -98,7 +98,6 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
 extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
 extern int timed_topology_update(int nsecs);
-extern void __init shared_proc_topology_init(void);
 #else
 static inline int start_topology_update(void)
 {
@@ -121,9 +120,6 @@ static inline int timed_topology_update(int nsecs)
return 0;
 }
 
-#ifdef CONFIG_SMP
-static inline void shared_proc_topology_init(void) {}
-#endif
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf..cdd39a0 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1359,11 +1359,6 @@ void __init smp_cpus_done(unsigned int max_cpus)
if (smp_ops && smp_ops->bringup_done)
smp_ops->bringup_done();
 
-   /*
-* On a shared LPAR, associativity needs to be requested.
-* Hence, get numa topology before dumping cpu topology
-*/
-   shared_proc_topology_init();
dump_numa_cpu_topology();
 
 #ifdef CONFIG_SCHED_SMT
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 5bb11ef..22a7bf5 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1625,15 +1625,6 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
-void __init shared_proc_topology_init(void)
-{
-   if (lppaca_shared_proc(get_lppaca())) {
-   bitmap_fill(cpumask_bits(_associativity_changes_mask),
-   nr_cpumask_bits);
-   numa_update_cpu_topology(false);
-   }
-}
-
 static int topology_read(struct seq_file *file, void *v)
 {
if (vphn_enabled || prrn_enabled)
-- 
1.8.3.1



[PATCH v3 3/5] powerpc/numa: Use cpu node map of first sibling thread

2019-09-06 Thread Srikar Dronamraju
All the sibling threads of a core have to be part of the same node.
To ensure that all the sibling threads map to the same node, always
lookup/update the cpu-to-node map of the first thread in the core.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
 arch/powerpc/mm/numa.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8fbe57c..d0af9a2 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -467,15 +467,20 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
  */
 static int numa_setup_cpu(unsigned long lcpu)
 {
-   int nid = NUMA_NO_NODE;
struct device_node *cpu;
+   int fcpu = cpu_first_thread_sibling(lcpu);
+   int nid = NUMA_NO_NODE;
 
/*
 * If a valid cpu-to-node mapping is already available, use it
 * directly instead of querying the firmware, since it represents
 * the most recent mapping notified to us by the platform (eg: VPHN).
+* Since cpu_to_node binding remains the same for all threads in the
+* core. If a valid cpu-to-node mapping is already available, for
+* the first thread in the core, use it.
 */
-   if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
+   nid = numa_cpu_lookup_table[fcpu];
+   if (nid >= 0) {
map_cpu_to_node(lcpu, nid);
return nid;
}
@@ -496,6 +501,16 @@ static int numa_setup_cpu(unsigned long lcpu)
if (nid < 0 || !node_possible(nid))
nid = first_online_node;
 
+   /*
+* Update for the first thread of the core. All threads of a core
+* have to be part of the same node. This not only avoids querying
+* for every other thread in the core, but always avoids a case
+* where virtual node associativity change causes subsequent threads
+* of a core to be associated with different nid.
+*/
+   if (fcpu != lcpu)
+   map_cpu_to_node(fcpu, nid);
+
map_cpu_to_node(lcpu, nid);
of_node_put(cpu);
 out:
-- 
1.8.3.1



[PATCH v3 4/5] powerpc/numa: Early request for home node associativity

2019-09-06 Thread Srikar Dronamraju
Currently the kernel detects if its running on a shared lpar platform
and requests home node associativity before the scheduler sched_domains
are setup. However between the time NUMA setup is initialized and the
request for home node associativity, workqueue initializes its per node
cpumask. The per node workqueue possible cpumask may turn invalid
after home node associativity resulting in weird situations like
workqueue possible cpumask being a subset of workqueue online cpumask.

This can be fixed by requesting home node associativity earlier just
before NUMA setup. However at the NUMA setup time, kernel may not be in
a position to detect if its running on a shared lpar platform. So
request for home node associativity and if the request fails, fallback
on the device tree property.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
Changelog (v1->v2):
- Handled comments from Nathan Lynch
  * Dont depend on pacas to be setup for the hwid

Changelog (v2->v3):
- Handled comments from Nathan Lynch
  * Use first thread of the core for cpu-to-node map.
  * get hardware-id in numa_setup_cpu

 arch/powerpc/mm/numa.c | 45 -
 1 file changed, 40 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d0af9a2..5bb11ef 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -461,13 +461,27 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
return nid;
 }
 
+static int vphn_get_nid(long hwid)
+{
+   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+   long rc;
+
+   rc = hcall_vphn(hwid, VPHN_FLAG_VCPU, associativity);
+   if (rc == H_SUCCESS)
+   return associativity_to_nid(associativity);
+
+   return NUMA_NO_NODE;
+}
+
 /*
  * Figure out to which domain a cpu belongs and stick it there.
+ * cpu_to_phys_id is only valid between smp_setup_cpu_maps() and
+ * smp_setup_pacas(). If called outside this window, set get_hwid to true.
  * Return the id of the domain used.
  */
-static int numa_setup_cpu(unsigned long lcpu)
+static int numa_setup_cpu(unsigned long lcpu, bool get_hwid)
 {
-   struct device_node *cpu;
+   struct device_node *cpu = NULL;
int fcpu = cpu_first_thread_sibling(lcpu);
int nid = NUMA_NO_NODE;
 
@@ -485,6 +499,27 @@ static int numa_setup_cpu(unsigned long lcpu)
return nid;
}
 
+   /*
+* On a shared lpar, device tree will not have node associativity.
+* At this time lppaca, or its __old_status field may not be
+* updated. Hence kernel cannot detect if its on a shared lpar. So
+* request an explicit associativity irrespective of whether the
+* lpar is shared or dedicated. Use the device tree property as a
+* fallback.
+*/
+   if (firmware_has_feature(FW_FEATURE_VPHN)) {
+   long hwid;
+
+   if (get_hwid)
+   hwid = get_hard_smp_processor_id(lcpu);
+   else
+   hwid = cpu_to_phys_id[lcpu];
+   nid = vphn_get_nid(hwid);
+   }
+
+   if (nid != NUMA_NO_NODE)
+   goto out_present;
+
cpu = of_get_cpu_node(lcpu, NULL);
 
if (!cpu) {
@@ -496,6 +531,7 @@ static int numa_setup_cpu(unsigned long lcpu)
}
 
nid = of_node_to_nid_single(cpu);
+   of_node_put(cpu);
 
 out_present:
if (nid < 0 || !node_possible(nid))
@@ -512,7 +548,6 @@ static int numa_setup_cpu(unsigned long lcpu)
map_cpu_to_node(fcpu, nid);
 
map_cpu_to_node(lcpu, nid);
-   of_node_put(cpu);
 out:
return nid;
 }
@@ -543,7 +578,7 @@ static int ppc_numa_cpu_prepare(unsigned int cpu)
 {
int nid;
 
-   nid = numa_setup_cpu(cpu);
+   nid = numa_setup_cpu(cpu, true);
verify_cpu_node_mapping(cpu, nid);
return 0;
 }
@@ -890,7 +925,7 @@ void __init mem_topology_setup(void)
reset_numa_cpu_lookup_table();
 
for_each_present_cpu(cpu)
-   numa_setup_cpu(cpu);
+   numa_setup_cpu(cpu, false);
 }
 
 void __init initmem_init(void)
-- 
1.8.3.1



[PATCH v3 2/5] powerpc/numa: Handle extra hcall_vphn error cases

2019-09-06 Thread Srikar Dronamraju
Currently code handles H_FUNCTION, H_SUCCESS, H_HARDWARE return codes.
However hcall_vphn can return other return codes. Now it also handles
H_PARAMETER return code.  Also the rest return codes are handled under the
default case.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
Changelog (v2->v1):
 Handled comments from Nathan:
  - Split patch from patch 1.
  - Corrected a problem where I missed calling stop_topology_update().
  - Using pr_err_ratelimited instead of printk.

 arch/powerpc/mm/numa.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 50d68d2..8fbe57c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1191,23 +1191,30 @@ static long vphn_get_associativity(unsigned long cpu,
VPHN_FLAG_VCPU, associativity);
 
switch (rc) {
+   case H_SUCCESS:
+   dbg("VPHN hcall succeeded. Reset polling...\n");
+   timed_topology_update(0);
+   goto out;
+
case H_FUNCTION:
-   printk_once(KERN_INFO
-   "VPHN is not supported. Disabling polling...\n");
-   stop_topology_update();
+   pr_err_ratelimited("VPHN unsupported. Disabling polling...\n");
break;
case H_HARDWARE:
-   printk(KERN_ERR
-   "hcall_vphn() experienced a hardware fault "
+   pr_err_ratelimited("hcall_vphn() experienced a hardware fault "
"preventing VPHN. Disabling polling...\n");
-   stop_topology_update();
break;
-   case H_SUCCESS:
-   dbg("VPHN hcall succeeded. Reset polling...\n");
-   timed_topology_update(0);
+   case H_PARAMETER:
+   pr_err_ratelimited("hcall_vphn() was passed an invalid 
parameter. "
+   "Disabling polling...\n");
+   break;
+   default:
+   pr_err_ratelimited("hcall_vphn() returned %ld. Disabling 
polling...\n"
+   , rc);
break;
}
 
+   stop_topology_update();
+out:
return rc;
 }
 
-- 
1.8.3.1



[PATCH v3 1/5] powerpc/vphn: Check for error from hcall_vphn

2019-09-06 Thread Srikar Dronamraju
There is no value in unpacking associativity, if
H_HOME_NODE_ASSOCIATIVITY hcall has returned an error.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
Changelog (v2->v1):
- Split the patch into 2(Suggested by Nathan).

 arch/powerpc/platforms/pseries/vphn.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/vphn.c 
b/arch/powerpc/platforms/pseries/vphn.c
index 3f07bf6..cca474a 100644
--- a/arch/powerpc/platforms/pseries/vphn.c
+++ b/arch/powerpc/platforms/pseries/vphn.c
@@ -82,7 +82,8 @@ long hcall_vphn(unsigned long cpu, u64 flags, __be32 
*associativity)
long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
 
rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, retbuf, flags, cpu);
-   vphn_unpack_associativity(retbuf, associativity);
+   if (rc == H_SUCCESS)
+   vphn_unpack_associativity(retbuf, associativity);
 
return rc;
 }
-- 
1.8.3.1



[PATCH v3 0/5] Early node associativity

2019-09-06 Thread Srikar Dronamraju
Abdul reported  a warning on a shared lpar.
"WARNING: workqueue cpumask: online intersect > possible intersect".
This is because per node workqueue possible mask is set very early in the
boot process even before the system was querying the home node
associativity. However per node workqueue online cpumask gets updated
dynamically. Hence there is a chance when per node workqueue online cpumask
is a superset of per node workqueue possible mask.

Link for v1: https://patchwork.ozlabs.org/patch/1151658
Changelog: v1->v2
 - Handled comments from Nathan Lynch.

Link for v2: 
http://lkml.kernel.org/r/20190829055023.6171-1-sri...@linux.vnet.ibm.com
Changelog: v2->v3
 - Handled comments from Nathan Lynch.

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Abdul Haleem 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 

Srikar Dronamraju (5):
  powerpc/vphn: Check for error from hcall_vphn
  powerpc/numa: Handle extra hcall_vphn error cases
  powerpc/numa: Use cpu node map of first sibling thread
  powerpc/numa: Early request for home node associativity
  powerpc/numa: Remove late request for home node associativity

 arch/powerpc/include/asm/topology.h   |  4 --
 arch/powerpc/kernel/smp.c |  5 --
 arch/powerpc/mm/numa.c| 96 ++-
 arch/powerpc/platforms/pseries/vphn.c |  3 +-
 4 files changed, 74 insertions(+), 34 deletions(-)

-- 
1.8.3.1



Re: [PATCH v2 3/4] powerpc/numa: Early request for home node associativity

2019-09-06 Thread Srikar Dronamraju
> > Regardless, I have an annoying question :-) Isn't it possible that,
> > while Linux is calling vphn_get_nid() for each logical cpu in sequence,
> > the platform could change a virtual processor's node assignment,
> > potentially causing sibling threads to get different node assignments
> > and producing an incoherent topology (which then leads to sched domain
> > assertions etc)?
> > 
> 
> Right, its certainly possible for node assignment to change while we iterate
> through the siblings. Do you have an recommendations?
> 

One thingk that I forgot to add was we already cache the cpu_to_node
mapping. If the mapping is around, we still don't lookup the nid.

However its still possible that in the first iteration where we cache the
nid's. we still end up with different nids for different siblings if node
associativity changes in between.

> > If so, I think more care is needed. The algorithm should make the vphn
> > call only once per cpu node, I think?
> 
> I didn't get "once per cpu node", How do we know which all cpus are part of
> that cpu node? Or did you mean once per cpu core?

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH v2 3/4] powerpc/numa: Early request for home node associativity

2019-09-05 Thread Srikar Dronamraju
> >
> > While here, fix a problem where of_node_put could be called even when
> > of_get_cpu_node was not successful.
> 
> of_node_put() handles NULL arguments, so this should not be necessary.
> 

Ok 

> > @@ -875,7 +908,7 @@ void __init mem_topology_setup(void)
> > reset_numa_cpu_lookup_table();
> >  
> > for_each_present_cpu(cpu)
> > -   numa_setup_cpu(cpu);
> > +   numa_setup_cpu(cpu, false);
> >  }
> 
> I'm open to other points of view here, but I would prefer two separate
> functions, something like vphn_get_nid() for runtime and
> vphn_get_nid_early() (which could be __init) for boot-time
> initialization. Propagating a somewhat unexpressive boolean flag through
> two levels of function calls in this code is unappealing...
> 

Somehow not convinced that we need to duplicate function just to avoid
passing a bool.

If propagating a boolean flag in two levels of function calls is an issue,
we could decipher the logic in numa_setup_cpu itself

Something like this
static int numa_setup_cpu(unsigned long lcpu, bool get_hwid)
{


if (firmware_has_feature(FW_FEATURE_VPHN)) {
long hwid;

if (get_hwid)
hwid = get_hard_smp_processor_id(cpu);
else
hwid = cpu_to_phys_id[cpu];

nid = vphn_get_nid(lcpu, hwid);
}


Would this help?

> Regardless, I have an annoying question :-) Isn't it possible that,
> while Linux is calling vphn_get_nid() for each logical cpu in sequence,
> the platform could change a virtual processor's node assignment,
> potentially causing sibling threads to get different node assignments
> and producing an incoherent topology (which then leads to sched domain
> assertions etc)?
> 

Right, its certainly possible for node assignment to change while we iterate
through the siblings. Do you have an recommendations?

> If so, I think more care is needed. The algorithm should make the vphn
> call only once per cpu node, I think?

I didn't get "once per cpu node", How do we know which all cpus are part of
that cpu node? Or did you mean once per cpu core?

-- 
Thanks and Regards
Srikar Dronamraju



[PATCH v2 4/4] powerpc/numa: Remove late request for home node associativity

2019-08-28 Thread Srikar Dronamraju
With commit ("powerpc/numa: Early request for home node associativity"),
commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot") which was requesting home node associativity
becomes redundant.

Hence remove the late request for home node associativity.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
 arch/powerpc/include/asm/topology.h | 4 
 arch/powerpc/kernel/smp.c   | 5 -
 arch/powerpc/mm/numa.c  | 9 -
 3 files changed, 18 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 2f7e1ea..9bd396f 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -98,7 +98,6 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
 extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
 extern int timed_topology_update(int nsecs);
-extern void __init shared_proc_topology_init(void);
 #else
 static inline int start_topology_update(void)
 {
@@ -121,9 +120,6 @@ static inline int timed_topology_update(int nsecs)
return 0;
 }
 
-#ifdef CONFIG_SMP
-static inline void shared_proc_topology_init(void) {}
-#endif
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf..cdd39a0 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1359,11 +1359,6 @@ void __init smp_cpus_done(unsigned int max_cpus)
if (smp_ops && smp_ops->bringup_done)
smp_ops->bringup_done();
 
-   /*
-* On a shared LPAR, associativity needs to be requested.
-* Hence, get numa topology before dumping cpu topology
-*/
-   shared_proc_topology_init();
dump_numa_cpu_topology();
 
 #ifdef CONFIG_SCHED_SMT
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index de4a1a1..a20617a 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1608,15 +1608,6 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
-void __init shared_proc_topology_init(void)
-{
-   if (lppaca_shared_proc(get_lppaca())) {
-   bitmap_fill(cpumask_bits(_associativity_changes_mask),
-   nr_cpumask_bits);
-   numa_update_cpu_topology(false);
-   }
-}
-
 static int topology_read(struct seq_file *file, void *v)
 {
if (vphn_enabled || prrn_enabled)
-- 
1.8.3.1



[PATCH v2 2/4] powerpc/numa: Handle extra hcall_vphn error cases

2019-08-28 Thread Srikar Dronamraju
Currently code handles H_FUNCTION, H_SUCCESS, H_HARDWARE return codes.
However hcall_vphn can return other return codes. Now it also handles
H_PARAMETER return code.  Also the rest return codes are handled under the
default case.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
Changelog (v2->v1):
 Handled comments from Nathan:
  - Split patch from patch 1.
  - Corrected a problem where I missed calling stop_topology_update().
  - Using pr_err_ratelimited instead of printk.

 arch/powerpc/mm/numa.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 50d68d2..8fbe57c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1191,23 +1191,30 @@ static long vphn_get_associativity(unsigned long cpu,
VPHN_FLAG_VCPU, associativity);
 
switch (rc) {
+   case H_SUCCESS:
+   dbg("VPHN hcall succeeded. Reset polling...\n");
+   timed_topology_update(0);
+   goto out;
+
case H_FUNCTION:
-   printk_once(KERN_INFO
-   "VPHN is not supported. Disabling polling...\n");
-   stop_topology_update();
+   pr_err_ratelimited("VPHN unsupported. Disabling polling...\n");
break;
case H_HARDWARE:
-   printk(KERN_ERR
-   "hcall_vphn() experienced a hardware fault "
+   pr_err_ratelimited("hcall_vphn() experienced a hardware fault "
"preventing VPHN. Disabling polling...\n");
-   stop_topology_update();
break;
-   case H_SUCCESS:
-   dbg("VPHN hcall succeeded. Reset polling...\n");
-   timed_topology_update(0);
+   case H_PARAMETER:
+   pr_err_ratelimited("hcall_vphn() was passed an invalid 
parameter. "
+   "Disabling polling...\n");
+   break;
+   default:
+   pr_err_ratelimited("hcall_vphn() returned %ld. Disabling 
polling...\n"
+   , rc);
break;
}
 
+   stop_topology_update();
+out:
return rc;
 }
 
-- 
1.8.3.1



[PATCH v2 3/4] powerpc/numa: Early request for home node associativity

2019-08-28 Thread Srikar Dronamraju
Currently the kernel detects if its running on a shared lpar platform
and requests home node associativity before the scheduler sched_domains
are setup. However between the time NUMA setup is initialized and the
request for home node associativity, workqueue initializes its per node
cpumask. The per node workqueue possible cpumask may turn invalid
after home node associativity resulting in weird situations like
workqueue possible cpumask being a subset of workqueue online cpumask.

This can be fixed by requesting home node associativity earlier just
before NUMA setup. However at the NUMA setup time, kernel may not be in
a position to detect if its running on a shared lpar platform. So
request for home node associativity and if the request fails, fallback
on the device tree property.

While here, fix a problem where of_node_put could be called even when
of_get_cpu_node was not successful.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
Changelog (v2->v1):
- Handled comments from Nathan Lynch
  * Dont depend on pacas to be setup for the hwid

 arch/powerpc/mm/numa.c | 43 ++-
 1 file changed, 38 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8fbe57c..de4a1a1 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -461,14 +461,33 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
return nid;
 }
 
+static int vphn_get_nid(unsigned long cpu, bool get_hwid)
+{
+   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+   long rc, hwid;
+
+   if (get_hwid)
+   hwid = get_hard_smp_processor_id(cpu);
+   else
+   hwid = cpu_to_phys_id[cpu];
+
+   rc = hcall_vphn(hwid, VPHN_FLAG_VCPU, associativity);
+   if (rc == H_SUCCESS)
+   return associativity_to_nid(associativity);
+
+   return NUMA_NO_NODE;
+}
+
 /*
  * Figure out to which domain a cpu belongs and stick it there.
+ * cpu_to_phys_id is only valid between smp_setup_cpu_maps() and
+ * smp_setup_pacas(). If called outside this window, set get_hwid to true.
  * Return the id of the domain used.
  */
-static int numa_setup_cpu(unsigned long lcpu)
+static int numa_setup_cpu(unsigned long lcpu, bool get_hwid)
 {
int nid = NUMA_NO_NODE;
-   struct device_node *cpu;
+   struct device_node *cpu = NULL;
 
/*
 * If a valid cpu-to-node mapping is already available, use it
@@ -480,6 +499,20 @@ static int numa_setup_cpu(unsigned long lcpu)
return nid;
}
 
+   /*
+* On a shared lpar, device tree will not have node associativity.
+* At this time lppaca, or its __old_status field may not be
+* updated. Hence kernel cannot detect if its on a shared lpar. So
+* request an explicit associativity irrespective of whether the
+* lpar is shared or dedicated. Use the device tree property as a
+* fallback.
+*/
+   if (firmware_has_feature(FW_FEATURE_VPHN))
+   nid = vphn_get_nid(lcpu, get_hwid);
+
+   if (nid != NUMA_NO_NODE)
+   goto out_present;
+
cpu = of_get_cpu_node(lcpu, NULL);
 
if (!cpu) {
@@ -491,13 +524,13 @@ static int numa_setup_cpu(unsigned long lcpu)
}
 
nid = of_node_to_nid_single(cpu);
+   of_node_put(cpu);
 
 out_present:
if (nid < 0 || !node_possible(nid))
nid = first_online_node;
 
map_cpu_to_node(lcpu, nid);
-   of_node_put(cpu);
 out:
return nid;
 }
@@ -528,7 +561,7 @@ static int ppc_numa_cpu_prepare(unsigned int cpu)
 {
int nid;
 
-   nid = numa_setup_cpu(cpu);
+   nid = numa_setup_cpu(cpu, true);
verify_cpu_node_mapping(cpu, nid);
return 0;
 }
@@ -875,7 +908,7 @@ void __init mem_topology_setup(void)
reset_numa_cpu_lookup_table();
 
for_each_present_cpu(cpu)
-   numa_setup_cpu(cpu);
+   numa_setup_cpu(cpu, false);
 }
 
 void __init initmem_init(void)
-- 
1.8.3.1



[PATCH v2 0/4] Early node associativity

2019-08-28 Thread Srikar Dronamraju
Abdul reported  a warning on a shared lpar.
"WARNING: workqueue cpumask: online intersect > possible intersect".
This is because per node workqueue possible mask is set very early in the
boot process even before the system was querying the home node
associativity. However per node workqueue online cpumask gets updated
dynamically. Hence there is a chance when per node workqueue online cpumask
is a superset of per node workqueue possible mask.

Link for v1: https://patchwork.ozlabs.org/patch/1151658
Changelog: v1->v2
 - Handled comments from Nathan Lynch.

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Abdul Haleem 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 

Srikar Dronamraju (4):
  powerpc/vphn: Check for error from hcall_vphn
  powerpc/numa: Handle extra hcall_vphn error cases
  powerpc/numa: Early request for home node associativity
  powerpc/numa: Remove late request for home node associativity

 arch/powerpc/include/asm/topology.h   |  4 --
 arch/powerpc/kernel/smp.c |  5 ---
 arch/powerpc/mm/numa.c| 77 ---
 arch/powerpc/platforms/pseries/vphn.c |  3 +-
 4 files changed, 56 insertions(+), 33 deletions(-)

-- 
1.8.3.1



[PATCH v2 1/4] powerpc/vphn: Check for error from hcall_vphn

2019-08-28 Thread Srikar Dronamraju
There is no value in unpacking associativity, if
H_HOME_NODE_ASSOCIATIVITY hcall has returned an error.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Reported-by: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
Changelog (v2->v1):
- Split the patch into 2(Suggested by Nathan).

 arch/powerpc/platforms/pseries/vphn.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/vphn.c 
b/arch/powerpc/platforms/pseries/vphn.c
index 3f07bf6..cca474a 100644
--- a/arch/powerpc/platforms/pseries/vphn.c
+++ b/arch/powerpc/platforms/pseries/vphn.c
@@ -82,7 +82,8 @@ long hcall_vphn(unsigned long cpu, u64 flags, __be32 
*associativity)
long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
 
rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, retbuf, flags, cpu);
-   vphn_unpack_associativity(retbuf, associativity);
+   if (rc == H_SUCCESS)
+   vphn_unpack_associativity(retbuf, associativity);
 
return rc;
 }
-- 
1.8.3.1



Re: [PATCH 2/3] powerpc/numa: Early request for home node associativity

2019-08-27 Thread Srikar Dronamraju
Hi Satheesh,

> > Currently the kernel detects if its running on a shared lpar platform
> > and requests home node associativity before the scheduler sched_domains
> > are setup. However between the time NUMA setup is initialized and the
> > request for home node associativity, workqueue initializes its per node
> > cpumask. The per node workqueue possible cpumask may turn invalid
> > after home node associativity resulting in weird situations like
> > workqueue possible cpumask being a subset of workqueue online cpumask.
> 
> Env:
> HW: Power8
> Host/Guest Kernel: 5.3.0-rc5-00172-g13e3f1076e29 (linux master + this series)
> Qemu: 4.0.90 (v4.1.0-rc3)
> 
> Guest Config:
> ..
>  4
> ...
> /home/kvmci/linux/vmlinux
> root=/dev/sda2 rw console=tty0 console=ttyS0,115200 
> init=/sbin/init  initcall_debug numa=debug crashkernel=1024M 
> selinux=0
> ...
>   
> 
>   
>   
> 
> 
> Event: 
> vcpu hotplug
> 
> [root@atest-guest ~]# [   41.447170] random: crng init done
> [   41.448153] random: 7 urandom warning(s) missed due to ratelimiting
> [   51.727256] VPHN hcall succeeded. Reset polling...
> [   51.826301] adding cpu 2 to node 1
> [   51.856238] WARNING: workqueue cpumask: online intersect > possible 
> intersect
> [   51.916297] VPHN hcall succeeded. Reset polling...
> [   52.036272] adding cpu 3 to node 1
> 

Thanks for testing.

The fix for this patch series was to make sure per node workqueue possible
cpus is updated correctly at boot. However Node hotplug on KVM guests and
dlpar on PowerVM lpars aren't covered by this patch series. On systems that
support shared processor, the associativity of the possible cpus is not
known at boot time. Hence we will not be able to update the per node
workquque possible cpumask.

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH 2/3] powerpc/numa: Early request for home node associativity

2019-08-22 Thread Srikar Dronamraju
* Nathan Lynch  [2019-08-22 12:17:48]:

> Hi Srikar,

Thanks Nathan for the review.

> 
> > However home node associativity requires cpu's hwid which is set in
> > smp_setup_pacas. Hence call smp_setup_pacas before numa_setup_cpus.
> 
> But this seems like it would negatively affect pacas' NUMA placements?
> 
> Would it be less risky to figure out a way to do "early" VPHN hcalls
> before mem_topology_setup, getting the hwids from the cpu_to_phys_id
> array perhaps?
> 

Do you mean for calls from mem_topology_setup(), stuff we use cpu_to_phys_id
but for the calls from ppc_numa_cpu_prepare() we use the
get_hard_smp_processor_id()?

Thats doable.

> 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 88b5157..7965d3b 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -461,6 +461,21 @@ static int of_drconf_to_nid_single(struct drmem_lmb 
> > *lmb)
> > return nid;
> >  }
> >  
> > +static int vphn_get_nid(unsigned long cpu)
> > +{
> > +   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> > +   long rc;
> > +
> > +   /* Use associativity from first thread for all siblings */
> 
> I don't understand how this comment corresponds to the code it
> accompanies.


Okay will rephrase
> 
> 
> > +   rc = hcall_vphn(get_hard_smp_processor_id(cpu),
> > +   VPHN_FLAG_VCPU, associativity);
> > +
> > +   if (rc == H_SUCCESS)
> > +   return  associativity_to_nid(associativity);
>   ^^ extra space
> 
> > @@ -490,7 +505,18 @@ static int numa_setup_cpu(unsigned long lcpu)
> > goto out;
> > }
> >  
> > -   nid = of_node_to_nid_single(cpu);
> > +   /*
> > +* On a shared lpar, the device tree might not have the correct node
> > +* associativity.  At this time lppaca, or its __old_status field
> 
> Sorry but I'm going to quibble with this phrasing a bit. On SPLPAR the
> CPU nodes have no affinity information in the device tree at all. This
> comment implies that they may have incorrect information, which is
> AFAIK not the case.
> 

Okay will clarify.

-- 
Thanks and Regards
Srikar Dronamraju



[PATCH 3/3] powerpc/numa: Remove late request for home node associativity

2019-08-22 Thread Srikar Dronamraju
With commit ("powerpc/numa: Early request for home node associativity"),
commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot") which was requesting home node associativity
becomes redundant.

Hence remove the late request for home node associativity.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Reported-by: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
 arch/powerpc/include/asm/topology.h | 4 
 arch/powerpc/kernel/smp.c   | 5 -
 arch/powerpc/mm/numa.c  | 9 -
 3 files changed, 18 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 2f7e1ea..9bd396f 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -98,7 +98,6 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 
*cpu2_assoc)
 extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
 extern int timed_topology_update(int nsecs);
-extern void __init shared_proc_topology_init(void);
 #else
 static inline int start_topology_update(void)
 {
@@ -121,9 +120,6 @@ static inline int timed_topology_update(int nsecs)
return 0;
 }
 
-#ifdef CONFIG_SMP
-static inline void shared_proc_topology_init(void) {}
-#endif
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf..cdd39a0 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1359,11 +1359,6 @@ void __init smp_cpus_done(unsigned int max_cpus)
if (smp_ops && smp_ops->bringup_done)
smp_ops->bringup_done();
 
-   /*
-* On a shared LPAR, associativity needs to be requested.
-* Hence, get numa topology before dumping cpu topology
-*/
-   shared_proc_topology_init();
dump_numa_cpu_topology();
 
 #ifdef CONFIG_SCHED_SMT
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 7965d3b..2efeac8 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1604,15 +1604,6 @@ int prrn_is_enabled(void)
return prrn_enabled;
 }
 
-void __init shared_proc_topology_init(void)
-{
-   if (lppaca_shared_proc(get_lppaca())) {
-   bitmap_fill(cpumask_bits(_associativity_changes_mask),
-   nr_cpumask_bits);
-   numa_update_cpu_topology(false);
-   }
-}
-
 static int topology_read(struct seq_file *file, void *v)
 {
if (vphn_enabled || prrn_enabled)
-- 
1.8.3.1



[PATCH 2/3] powerpc/numa: Early request for home node associativity

2019-08-22 Thread Srikar Dronamraju
Currently the kernel detects if its running on a shared lpar platform
and requests home node associativity before the scheduler sched_domains
are setup. However between the time NUMA setup is initialized and the
request for home node associativity, workqueue initializes its per node
cpumask. The per node workqueue possible cpumask may turn invalid
after home node associativity resulting in weird situations like
workqueue possible cpumask being a subset of workqueue online cpumask.

This can be fixed by requesting home node associativity earlier just
before NUMA setup. However at the NUMA setup time, kernel may not be in
a position to detect if its running on a shared lpar platform. So
request for home node associativity and if the request fails, fallback
on the device tree property.

However home node associativity requires cpu's hwid which is set in
smp_setup_pacas. Hence call smp_setup_pacas before numa_setup_cpus.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Reported-by: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
 arch/powerpc/kernel/setup-common.c |  5 +++--
 arch/powerpc/mm/numa.c | 28 +++-
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 1f8db66..9135dba 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -888,6 +888,9 @@ void __init setup_arch(char **cmdline_p)
/* Check the SMT related command line arguments (ppc64). */
check_smt_enabled();
 
+#ifdef CONFIG_SMP
+   smp_setup_pacas();
+#endif
/* Parse memory topology */
mem_topology_setup();
 
@@ -899,8 +902,6 @@ void __init setup_arch(char **cmdline_p)
 * so smp_release_cpus() does nothing for them.
 */
 #ifdef CONFIG_SMP
-   smp_setup_pacas();
-
/* On BookE, setup per-core TLB data structures. */
setup_tlb_core_data();
 
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 88b5157..7965d3b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -461,6 +461,21 @@ static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
return nid;
 }
 
+static int vphn_get_nid(unsigned long cpu)
+{
+   __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+   long rc;
+
+   /* Use associativity from first thread for all siblings */
+   rc = hcall_vphn(get_hard_smp_processor_id(cpu),
+   VPHN_FLAG_VCPU, associativity);
+
+   if (rc == H_SUCCESS)
+   return  associativity_to_nid(associativity);
+
+   return NUMA_NO_NODE;
+}
+
 /*
  * Figure out to which domain a cpu belongs and stick it there.
  * Return the id of the domain used.
@@ -490,7 +505,18 @@ static int numa_setup_cpu(unsigned long lcpu)
goto out;
}
 
-   nid = of_node_to_nid_single(cpu);
+   /*
+* On a shared lpar, the device tree might not have the correct node
+* associativity.  At this time lppaca, or its __old_status field
+* may not be updated. Hence request an explicit associativity
+* irrespective of whether the lpar is shared or dedicated.  Use the
+* device tree property as a fallback.
+*/
+   if (firmware_has_feature(FW_FEATURE_VPHN))
+   nid = vphn_get_nid(lcpu);
+
+   if (nid == NUMA_NO_NODE)
+   nid = of_node_to_nid_single(cpu);
 
 out_present:
if (nid < 0 || !node_possible(nid))
-- 
1.8.3.1



[PATCH 1/3] powerpc/vphn: Check for error from hcall_vphn

2019-08-22 Thread Srikar Dronamraju
There is no point in unpacking associativity, if
H_HOME_NODE_ASSOCIATIVITY hcall has returned an error.

Also added error messages for H_PARAMETER and default case in
vphn_get_associativity.

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Reported-by: Satheesh Rajendran 
Reported-by: Abdul Haleem 
---
 arch/powerpc/mm/numa.c| 16 +---
 arch/powerpc/platforms/pseries/vphn.c |  3 ++-
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 50d68d2..88b5157 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1191,6 +1191,10 @@ static long vphn_get_associativity(unsigned long cpu,
VPHN_FLAG_VCPU, associativity);
 
switch (rc) {
+   case H_SUCCESS:
+   dbg("VPHN hcall succeeded. Reset polling...\n");
+   timed_topology_update(0);
+   break;
case H_FUNCTION:
printk_once(KERN_INFO
"VPHN is not supported. Disabling polling...\n");
@@ -1202,9 +1206,15 @@ static long vphn_get_associativity(unsigned long cpu,
"preventing VPHN. Disabling polling...\n");
stop_topology_update();
break;
-   case H_SUCCESS:
-   dbg("VPHN hcall succeeded. Reset polling...\n");
-   timed_topology_update(0);
+   case H_PARAMETER:
+   printk(KERN_ERR
+   "hcall_vphn() was passed an invalid parameter."
+   "Disabling polling...\n");
+   break;
+   default:
+   printk(KERN_ERR
+   "hcall_vphn() returned %ld. Disabling polling \n", rc);
+   stop_topology_update();
break;
}
 
diff --git a/arch/powerpc/platforms/pseries/vphn.c 
b/arch/powerpc/platforms/pseries/vphn.c
index 3f07bf6..cca474a 100644
--- a/arch/powerpc/platforms/pseries/vphn.c
+++ b/arch/powerpc/platforms/pseries/vphn.c
@@ -82,7 +82,8 @@ long hcall_vphn(unsigned long cpu, u64 flags, __be32 
*associativity)
long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
 
rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, retbuf, flags, cpu);
-   vphn_unpack_associativity(retbuf, associativity);
+   if (rc == H_SUCCESS)
+   vphn_unpack_associativity(retbuf, associativity);
 
return rc;
 }
-- 
1.8.3.1



[PATCH 0/3] Early node associativity

2019-08-22 Thread Srikar Dronamraju
Abdul reported  a warning on a shared lpar.
"WARNING: workqueue cpumask: online intersect > possible intersect".
This is because per node workqueue possible mask is set very early in the
boot process even before the system was querying the home node
associativity. However per node workqueue online cpumask gets updated
dynamically. Hence there is a chance when per node workqueue online cpumask
is a superset of per node workqueue possible mask.

The below patches try to fix this problem.
Reported at : https://github.com/linuxppc/issues/issues/167

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Abdul Haleem 
Cc: Nathan Lynch 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Satheesh Rajendran 

Srikar Dronamraju (3):
  powerpc/vphn: Check for error from hcall_vphn
  powerpc/numa: Early request for home node associativity
  powerpc/numa: Remove late request for home node associativity

 arch/powerpc/include/asm/topology.h   |  4 ---
 arch/powerpc/kernel/setup-common.c|  5 ++--
 arch/powerpc/kernel/smp.c |  5 
 arch/powerpc/mm/numa.c| 53 ++-
 arch/powerpc/platforms/pseries/vphn.c |  3 +-
 5 files changed, 45 insertions(+), 25 deletions(-)

-- 
1.8.3.1



[PATCH v2] powerpc/smp: Use nid as fallback for package_id

2019-08-22 Thread Srikar Dronamraju
Package_id is to find out all cores that are part of the same chip. On
PowerNV machines, package_id defaults to chip_id. However ibm,chip_id
property is not present in device-tree of PowerVM Lpars. Hence lscpu
output shows one core per socket and multiple cores.

To overcome this, use nid as the package_id on PowerVM Lpars.

Before the patch.
---
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8
Core(s) per socket:  1   <--
Socket(s):   16  <--
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127
 #
 # cat /sys/devices/system/cpu/cpu0/topology/physical_package_id
 -1
 #

After the patch
---
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8  <--
Core(s) per socket:  8  <--
Socket(s):   2
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127
 #
 # cat /sys/devices/system/cpu/cpu0/topology/physical_package_id
 0
 #
Now lscpu output is more in line with the system configuration.

Link to previous posting: https://patchwork.ozlabs.org/patch/1126145

Signed-off-by: Srikar Dronamraju 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Vasant Hegde 
Cc: Vaidyanathan Srinivasan 
---
Changelog from v1:
In V1 cpu_to_chip_id was overloaded to fallback on nid.  Michael Ellerman
wasn't comfortable with nid being shown up as chip_id.

 arch/powerpc/include/asm/topology.h |  3 +-
 arch/powerpc/kernel/smp.c   | 46 +++--
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 2f7e1ea5089e..f0c4b2f06665 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -134,7 +134,8 @@ static inline void shared_proc_topology_init(void) {}
 #ifdef CONFIG_PPC64
 #include 
 
-#define topology_physical_package_id(cpu)  (cpu_to_chip_id(cpu))
+extern int get_physical_package_id(int);
+#define topology_physical_package_id(cpu)  (get_physical_package_id(cpu))
 #define topology_sibling_cpumask(cpu)  (per_cpu(cpu_sibling_map, cpu))
 #define topology_core_cpumask(cpu) (per_cpu(cpu_core_map, cpu))
 #define topology_core_id(cpu)  (cpu_to_core_id(cpu))
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf6a221..4d1541cc5e95 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1185,10 +1185,50 @@ static inline void add_cpu_to_smallcore_masks(int cpu)
}
 }
 
+#ifdef CONFIG_PPC64
+int get_physical_package_id(cpu)
+{
+   struct device_node *np, *root;
+   struct property *pp;
+   int gppid = cpu_to_chip_id(cpu);
+
+   /*
+* If the platform is PowerNV or Guest on KVM, ibm,chip-id is
+* defined. Hence we would return the chip-id as the
+* get_physical_package_id.
+*/
+   if (gppid == -1 && firmware_has_feature(FW_FEATURE_LPAR) &&
+   machine_is(pseries)) {
+   /*
+* PowerVM hypervisor doesn't export ibm,chip-id property.
+* Currently only PowerVM hypervisor supports
+* /rtas/ibm,configure-kernel-dump property. Use this
+* property to identify PowerVM LPARs within pseries
+* platform.
+*/
+   root = of_find_node_by_path("/rtas");
+   if (root) {
+   pp = of_find_property(root,
+   "ibm,configure-kernel-dump", NULL);
+   if (pp) {
+   np = of_get_cpu_node(cpu, NULL);
+   if (np) {
+   gppid = of_node_to_nid(np);
+   of_node_put(np);
+   }
+   }
+   of_node_put(root);
+   }
+   }
+   return gppid;
+}
+EXPORT_SYMBOL(get_physical_package_id);
+#endif
+
 static void add_cpu_to_masks(int cpu)
 {
int first_thread = cpu_first_thread_s

Re: [PATCH] powerpc: Use nid as fallback for chip_id

2019-07-31 Thread Srikar Dronamraju
* Michael Ellerman  [2019-07-29 22:41:55]:

> >  
> > +   chip_id = of_get_ibm_chip_id(np);
> > +   if (chip_id == -1)
> > +   chip_id = of_node_to_nid(np);
> > +
> > of_node_put(np);
> > -   return of_get_ibm_chip_id(np);
> > +   return chip_id;
> >  }
> 
> A nid is not a chip-id.
> 

Agree that nid is not a chip-id.

> This obviously happens to work for the case you've identified above but
> it's not something I'm happy to merge in general.
> 

Okay.

> We could do a similar change in the topology code, but I'd probably like
> it to be restricted to when we're running under PowerVM and there are no
> chip-ids found at all.
> 

So for PowerNV case and KVM guest, of_get_ibm_chip_id() always seems to
returns a valid chip-id. Its *only* in the PowerVM case that we are
returning nid as the fallback chip-id.


Do you think checking for OPAL firmware would help?

chip_id = of_get_ibm_chip_id(np);
if (chip_id == -1 && !firmware_has_feature(FW_FEATURE_OPAL))
chip_id = of_node_to_nid(np);

of_node_put(np);


or should we do

int topology_physical_package_id(int cpu)
{
int chip_id = cpu_to_chip_id(cpu)
if (chip_id == -1 && !firmware_has_feature(FW_FEATURE_OPAL))
//Fallback to nid instead of chip-id.

return chip_id;
}

> I'm also not clear how it will interact with migration.
> 

On migration, this function would be triggered when the cpumasks are getting
updated. So I would expect this to continue working.

Or Am I missing someother migration related quirk?

> cheers
> 


The other alternative that I see is 




-- 
Thanks and Regards
Srikar Dronamraju



[PATCH] powerpc: Use nid as fallback for chip_id

2019-07-02 Thread Srikar Dronamraju
One of the uses of chip_id is to find out all cores that are part of the same
chip. However ibm,chip_id property is not present in device-tree of PowerVM
Lpars. Hence lscpu output shows one core per socket and multiple cores.

Before the patch.
# lscpu
Architecture:ppc64le
Byte Order:  Little Endian
CPU(s):  128
On-line CPU(s) list: 0-127
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):   16
NUMA node(s):2
Model:   2.2 (pvr 004e 0202)
Model name:  POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:   32K
L1i cache:   32K
L2 cache:512K
L3 cache:10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127

# cat /sys/devices/system/cpu/cpu0/topology/physical_package_id
-1

Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/kernel/prom.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 7159e791a70d..0b8918b43580 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -867,18 +867,24 @@ EXPORT_SYMBOL(of_get_ibm_chip_id);
  * @cpu: The logical cpu number.
  *
  * Return the value of the ibm,chip-id property corresponding to the given
- * logical cpu number. If the chip-id can not be found, returns -1.
+ * logical cpu number. If the chip-id can not be found, return nid.
+ *
  */
 int cpu_to_chip_id(int cpu)
 {
struct device_node *np;
+   int chip_id = -1;
 
np = of_get_cpu_node(cpu, NULL);
if (!np)
return -1;
 
+   chip_id = of_get_ibm_chip_id(np);
+   if (chip_id == -1)
+   chip_id = of_node_to_nid(np);
+
of_node_put(np);
-   return of_get_ibm_chip_id(np);
+   return chip_id;
 }
 EXPORT_SYMBOL(cpu_to_chip_id);
 
-- 
2.18.1



Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

2019-03-18 Thread Srikar Dronamraju
* Laurent Vivier  [2019-03-15 12:12:45]:

> 
> Another way to avoid the nodes overlapping for the offline nodes at
> startup is to ensure the default values don't define a distance that
> merge all offline nodes into node 0.
> 
> A powerpc specific patch can workaround the kernel crash by doing this:
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 87f0dd0..3ba29bb 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -623,6 +623,7 @@ static int __init parse_numa_properties(void)
> struct device_node *memory;
> int default_nid = 0;
> unsigned long i;
> +   int nid, dist;
> 
> if (numa_enabled == 0) {
> printk(KERN_WARNING "NUMA disabled by user\n");
> @@ -636,6 +637,10 @@ static int __init parse_numa_properties(void)
> 
> dbg("NUMA associativity depth for CPU/Memory: %d\n",
> min_common_depth);
> 
> +   for (nid = 0; nid < MAX_NUMNODES; nid ++)
> +   for (dist = 0; dist < MAX_DISTANCE_REF_POINTS; dist++)
> +   distance_lookup_table[nid][dist] = nid;
> +

The only reason, this would have worked in the specific case, is because we
are overriding the distance_lookup_table with a unique distance.
So node_distance for any other node other than itself will return max
distance which is 40 in this case. (since distance_ref_points_depth is 2)

I am not sure if  this will work if the node distance between the two nodes
happens to be 20.

> /*
>  * Even though we connect cpus to numa domains later in SMP
>  * init, we need to know the node ids now. This is because
> 

-- 
Thanks and Regards
Srikar Dronamraju



Re: [RFC v3] sched/topology: fix kernel crash when a CPU is hotplugged in a memoryless node

2019-03-18 Thread Srikar Dronamraju
> > node 0 (because firmware doesn't provide the distance information for
> > memoryless/cpuless nodes):
> > 
> >   node   0   1   2   3
> > 0:  10  40  10  10
> > 1:  40  10  40  40
> > 2:  10  40  10  10
> > 3:  10  40  10  10
> 
> *groan*... what does it do for things like percpu memory? ISTR the
> per-cpu chunks are all allocated early too. Having them all use memory
> out of node-0 would seem sub-optimal.

In the specific failing case, there is only one node with memory; all other
nodes are cpu only nodes.

However in the generic case since its just a cpu hotplug ops, the memory
allocated for per-cpu chunks allocated early would remain.

May be Michael Ellerman can correct me here.

> 
> > We should have:
> > 
> >   node   0   1   2   3
> > 0:  10  40  40  40
> > 1:  40  10  40  40
> > 2:  40  40  10  40
> > 3:  40  40  40  10
> 
> Can it happen that it introduces a new distance in the table? One that
> hasn't been seen before? This example only has 10 and 40, but suppose
> the new node lands at distance 20 (or 80); can such a thing happen?
> 
> If not; why not?

Yes distances can be 20, 40 or 80. There is nothing that makes the node
distance to be 40 always.

> So you're relying on sched_domain_numa_masks_set/clear() to fix this up,
> but that in turn relies on the sched_domain_numa_levels thing to stay
> accurate.
> 
> This all seems very fragile and unfortunate.
> 

Any reasons why this is fragile?

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH v03] powerpc/numa: Perform full re-add of CPU for PRRN/VPHN topology update

2019-02-07 Thread Srikar Dronamraju
> 
>  int arch_update_cpu_topology(void)
>  {
> - return numa_update_cpu_topology(true);
> + int changed = topology_changed;
> +
> + topology_changed = 0;
> + return changed;
>  }
> 

Do we need Powerpc override for arch_update_cpu_topology() now?  That
topology_changed sometime back doesn't seem to have help. The scheduler
atleast now is neglecting whether the topology changed or not.

Also we can do away with the new topology_changed.

>  static void topology_work_fn(struct work_struct *work)
>  {
> - rebuild_sched_domains();
> + lock_device_hotplug();
> + if (numa_update_cpu_topology(true))
> + rebuild_sched_domains();
> + unlock_device_hotplug();
>  }

Should this hunk be a separate patch by itself to say why
rebuild_sched_domains with a changelog that explains why it should be under
lock_device_hotplug? rebuild_sched_domains already takes cpuset_mutex. 
So I am not sure if we need to take device_hotplug_lock.

>  static DECLARE_WORK(topology_work, topology_work_fn);
> 
> -static void topology_schedule_update(void)
> +void topology_schedule_update(void)
>  {
> - schedule_work(_work);
> + if (!topology_update_in_progress)
> + schedule_work(_work);
>  }
> 
>  static void topology_timer_fn(struct timer_list *unused)
>  {
> + bool sdo = false;

Is sdo any abbrevation?

> +
> + if (topology_scans < 1)
> + bitmap_fill(cpumask_bits(_associativity_changes_mask),
> + nr_cpumask_bits);

Why do we need topology_scan? Just to make sure
cpu_associativity_changes_mask is populated only once?
cant we use a static bool inside the function for the same?


> +
>   if (prrn_enabled && cpumask_weight(_associativity_changes_mask))
> - topology_schedule_update();
> - else if (vphn_enabled) {
> + sdo =  true;
> + if (vphn_enabled) {

Any reason to remove the else above?

>   if (update_cpu_associativity_changes_mask() > 0)
> - topology_schedule_update();
> + sdo =  true;
>   reset_topology_timer();
>   }
> + if (sdo)
> + topology_schedule_update();
> +     topology_scans++;
>  }

Are the above two hunks necessary? Not getting how the current changes are
different from the previous.

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH] powerpc/numa: Perform full re-add of CPU for PRRN/VPHN topology update

2018-11-13 Thread Srikar Dronamraju
> -static void topology_work_fn(struct work_struct *work)
> -{
> - rebuild_sched_domains();
> + if (changed)
> + rebuild_sched_domains();
>  }
>  static DECLARE_WORK(topology_work, topology_work_fn);
> 
> @@ -1553,7 +1424,6 @@ void __init shared_proc_topology_init(void)
>   if (lppaca_shared_proc(get_lppaca())) {
>   bitmap_fill(cpumask_bits(_associativity_changes_mask),
>   nr_cpumask_bits);
> - numa_update_cpu_topology(false);

Shouldn't we be calling topology_schedule_update() here?

>   }
>  }
> 


-- 
Thanks and Regards
Srikar Dronamraju



[PATCH v2] powerpc/topology: Update numa mask when cpu node mapping changes

2018-10-16 Thread Srikar Dronamraju
Commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot") will update the cpu node topology for shared lpars
on PowerVM.

However shared lpars on PowerVM also support VPHN and PRRN events.
On receiving a VPHN, PRRN events, cpu to node mapping might change.

Scheduler maintains sched_domains_numa_masks[], which is currently not
updated on cpu to node mapping changes. This can lead to machine
hangs or performance hit.

Fix numa_update_cpu_topology() to update sched_domains_numa_masks[].

Signed-off-by: Srikar Dronamraju 
---
Changelog: v1->v2:
- Updated changelog.
- Updated comments to refer to topology_inited.

 arch/powerpc/mm/numa.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 3ea5b05ee6be..767c363ebafa 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1424,6 +1425,20 @@ int numa_update_cpu_topology(bool cpus_locked)
changed = 1;
}
 
+   /*
+* Scheduler maintains a sched_domain_numa_masks table that needs to
+* be updated whenever cpu-node mapping changes. Cpu-node mapping
+* changes only with VPHN and PRRN. 'topology_inited' indicates if
+* cpu-hotplug infrastructure is initialized and good to use.
+*/
+   if (!(topology_inited && (vphn_enabled || prrn_enabled)))
+   goto out;
+
+   for_each_cpu(cpu, _cpus) {
+   sched_cpu_deactivate(cpu);
+   sched_cpu_activate(cpu);
+   }
+
 out:
kfree(updates);
topology_update_needed = 0;
-- 
2.12.3



Re: [PATCH v10 2/3] powerpc: Use cpu_smallcore_sibling_mask at SMT level on bigcores

2018-10-12 Thread Srikar Dronamraju
* Gautham R. Shenoy  [2018-10-11 11:03:02]:

> From: "Gautham R. Shenoy" 
> 
> 
> +#ifdef CONFIG_SCHED_SMT
> + if (has_big_cores) {
> + pr_info("Using small cores at SMT level\n");
> + power9_topology[0].mask = smallcore_smt_mask;
> + powerpc_topology[0].mask = smallcore_smt_mask;
> + }
> +#endif
>   /*

I dont see a way a system can have has_big_core set but use
powerpc_topology.



Re: [PATCH v10 1/3] powerpc: Detect the presence of big-cores via "ibm,thread-groups"

2018-10-12 Thread Srikar Dronamraju
> +DECLARE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
> 
> +/*
> + * On big-cores system, cpu_l1_cache_map for each CPU corresponds to
> + * the set its siblings that share the L1-cache.
> + */
> +DEFINE_PER_CPU(cpumask_var_t, cpu_l1_cache_map);
> 

Nit:
Can you add a comment on how cpu_l1_cache_map differs from
cpu_smallcore_map?.

Everything else looks okay to me.



Re: [PATCH] powerpc/topology: Update numa mask when cpu node mapping changes

2018-10-09 Thread Srikar Dronamraju
* Srikar Dronamraju  [2018-10-10 09:54:46]:

> Commit 2ea626306810 ("powerpc/topology: Get topology for shared
> processors at boot") will update the cpu node topology for shared lpars
> on PowerVM.
> 
> However shared lpars on PowerVM also support VPHN and PRRN events.
> On receiving a VPHN, PRRN events, cpu to node mapping might change.
> 
> Scheduler maintains sched_domains_numa_masks[], which is currently not
> updated on cpu to node mapping changes. This can lead to machine
> regressions and performance regressions.

s/regressions and performance regressions./hangs or performance hit./

> 
> Fix numa_update_cpu_topology() to update sched_domains_numa_masks[].
> 

-- 
Thanks and Regards
Srikar Dronamraju



[PATCH] powerpc/topology: Update numa mask when cpu node mapping changes

2018-10-09 Thread Srikar Dronamraju
Commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot") will update the cpu node topology for shared lpars
on PowerVM.

However shared lpars on PowerVM also support VPHN and PRRN events.
On receiving a VPHN, PRRN events, cpu to node mapping might change.

Scheduler maintains sched_domains_numa_masks[], which is currently not
updated on cpu to node mapping changes. This can lead to machine
regressions and performance regressions.

Fix numa_update_cpu_topology() to update sched_domains_numa_masks[].

Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 3ea5b05ee6be..bf34384aa314 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1424,6 +1425,19 @@ int numa_update_cpu_topology(bool cpus_locked)
changed = 1;
}
 
+   /*
+* Scheduler maintains a sched_domain_numa_masks table that needs to
+* be updated whenever cpu-node mapping changes. Cpu-node mapping
+* change only with vphn and prrn.
+*/
+   if (!(topology_inited && (vphn_enabled || prrn_enabled)))
+   goto out;
+
+   for_each_cpu(cpu, _cpus) {
+   sched_cpu_deactivate(cpu);
+   sched_cpu_activate(cpu);
+   }
+
 out:
kfree(updates);
topology_update_needed = 0;
-- 
2.12.3



[PATCH] powerpc/numa: Skip onlining a offline node in kdump path

2018-09-27 Thread Srikar Dronamraju
 enough nodes as online and marks the rest as
offline at boot.  However kdump kernel boots with all available CPUs.
With Commit 2ea626306810 ("powerpc/topology: Get topology for shared
processors at boot"), all CPUs are onlined on their respective nodes at
boot time. try_online_node() tries to online the offline nodes but fails
as all needed subsystems are not yet initialized.

As part of fix, detect and skip early onlining of a offline node.

Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at 
boot")
Reported-by: Pavithra Prakash 
Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index e94148a1d7e4..d88139acdfe6 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1217,9 +1217,10 @@ int find_and_online_cpu_nid(int cpu)
 * Need to ensure that NODE_DATA is initialized for a node from
 * available memory (see memblock_alloc_try_nid). If unable to
 * init the node, then default to nearest node that has memory
-* installed.
+* installed. Skip onlining a node if the subsystems are not
+* yet initialized.
 */
-   if (try_online_node(new_nid))
+   if (!topology_inited || try_online_node(new_nid))
new_nid = first_online_node;
 #else
/*
-- 
2.12.3



[PATCH] powerpc/numa: Use associativity if VPHN hcall is successful

2018-09-25 Thread Srikar Dronamraju
Currently associativity is used to lookup node-id even if the preceding
VPHN hcall failed. However this can cause CPU to be made part of the
wrong node, (most likely to be node 0). This is because VPHN is not
enabled on kvm guests.

With 2ea6263 ("powerpc/topology: Get topology for shared processors at
boot"), associativity is used to set to the wrong node. Hence KVM guest
topology is broken.

For example : A 4 node KVM guest before would have reported.

[root@localhost ~]#  numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 1746 MB
node 0 free: 1604 MB
node 1 cpus: 4 5 6 7
node 1 size: 2044 MB
node 1 free: 1765 MB
node 2 cpus: 8 9 10 11
node 2 size: 2044 MB
node 2 free: 1837 MB
node 3 cpus: 12 13 14 15
node 3 size: 2044 MB
node 3 free: 1903 MB
node distances:
node   0   1   2   3
  0:  10  40  40  40
  1:  40  10  40  40
  2:  40  40  10  40
  3:  40  40  40  10

would now report

[root@localhost ~]# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 1746 MB
node 0 free: 1244 MB
node 1 cpus:
node 1 size: 2044 MB
node 1 free: 2032 MB
node 2 cpus: 1
node 2 size: 2044 MB
node 2 free: 2028 MB
node 3 cpus:
node 3 size: 2044 MB
node 3 free: 2032 MB
node distances:
node   0   1   2   3
  0:  10  40  40  40
  1:  40  10  40  40
  2:  40  40  10  40
  3:  40  40  40  10

Fix this by skipping associativity lookup if the VPHN hcall failed.

Signed-off-by: Srikar Dronamraju 
---
 arch/powerpc/mm/numa.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 35ac5422903a..e94148a1d7e4 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1204,7 +1204,9 @@ int find_and_online_cpu_nid(int cpu)
int new_nid;
 
/* Use associativity from first thread for all siblings */
-   vphn_get_associativity(cpu, associativity);
+   if (vphn_get_associativity(cpu, associativity))
+   return cpu_to_node(cpu);
+
new_nid = associativity_to_nid(associativity);
if (new_nid < 0 || !node_possible(new_nid))
new_nid = first_online_node;
-- 
2.12.3



[tip:sched/core] sched/topology: Set correct NUMA topology type

2018-09-10 Thread tip-bot for Srikar Dronamraju
Commit-ID:  e5e96fafd9028b1478b165db78c52d981c14f471
Gitweb: https://git.kernel.org/tip/e5e96fafd9028b1478b165db78c52d981c14f471
Author: Srikar Dronamraju 
AuthorDate: Fri, 10 Aug 2018 22:30:18 +0530
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 10:13:45 +0200

sched/topology: Set correct NUMA topology type

With the following commit:

  051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain")

the scheduler introduced a new NUMA level. However this leads to the NUMA 
topology
on 2 node systems to not be marked as NUMA_DIRECT anymore.

After this commit, it gets reported as NUMA_BACKPLANE, because
sched_domains_numa_level is now 2 on 2 node systems.

Fix this by allowing setting systems that have up to 2 NUMA levels as
NUMA_DIRECT.

While here remove code that assumes that level can be 0.

Signed-off-by: Srikar Dronamraju 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andre Wild 
Cc: Heiko Carstens 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Michael Ellerman 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Suravee Suthikulpanit 
Cc: Thomas Gleixner 
Cc: linuxppc-dev 
Fixes: 051f3ca02e46 "Introduce NUMA identity node sched domain"
Link: 
http://lkml.kernel.org/r/1533920419-17410-1-git-send-email-sri...@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/topology.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 56a0fed30c0a..505a41c42b96 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1295,7 +1295,7 @@ static void init_numa_topology_type(void)
 
n = sched_max_numa_distance;
 
-   if (sched_domains_numa_levels <= 1) {
+   if (sched_domains_numa_levels <= 2) {
sched_numa_topology_type = NUMA_DIRECT;
return;
}
@@ -1380,9 +1380,6 @@ void sched_init_numa(void)
break;
}
 
-   if (!level)
-   return;
-
/*
 * 'level' contains the number of unique distances
 *


Re: [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch

2018-08-31 Thread Srikar Dronamraju
* Peter Zijlstra  [2018-08-31 13:26:39]:

> On Fri, Aug 31, 2018 at 01:12:53PM +0200, Peter Zijlstra wrote:
> > NAK, not until you've fixed every cpu_to_node() user in the kernel to
> > deal with that mask changing.
> 
> Also, what happens if userspace reads that information; uses libnuma and
> then you go and shift the world underneath their feet?
> 
> > This is absolutely insane.
> 

The topology events are suppose to be very rare.
>From whatever small experiments I have done till now, unless tasks are
bound to both cpu and memory, they seem to be coping well with topology
updates. I know things weren't optimal after a topology change but they
worked. Now after 051f3ca02e46 "Introduce NUMA identity node sched
domain", systems stall. I am only exploring at ways to keep them working
as much as they were before that commit.

-- 
Thanks and Regards
Srikar Dronamraju



<    1   2   3   4   5   6   >