Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

2015-07-15 Thread Nishanth Aravamudan
On 15.07.2015 [16:35:16 -0400], Tejun Heo wrote:
 Hello,
 
 On Thu, Jul 02, 2015 at 04:02:02PM -0700, Nishanth Aravamudan wrote:
  we currently emit at boot:
  
  [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 
  
  After this commit, we correctly emit:
  
  [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 
 
 JFYI, the numbers in the brackets aren't NUMA node numbers but percpu
 allocation group numbers and they're not split according to nodes but
 percpu allocation units.  In both cases, there are two units each
 serving 0-3 and 4-7.  In the above case, because it wasn't being fed
 the correct NUMA information, both got assigned to the same group.  In
 the latter, they got assigned to different ones but even then if the
 group numbers match NUMA node numbers, that's just a coincidence.

Ok, thank you for clarifying! From a correctness perspective, even if
the numbers don't match NUMA nodes, should we expect the grouping to be
split along NUMA topology?

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

2015-07-15 Thread Tejun Heo
Hello,

On Wed, Jul 15, 2015 at 03:43:51PM -0700, Nishanth Aravamudan wrote:
 Ok, thank you for clarifying! From a correctness perspective, even if
 the numbers don't match NUMA nodes, should we expect the grouping to be
 split along NUMA topology?

Yeap, the groups get formed according to the node distances.  Nodes
which are not at LOCAL_DISTANCE are always put in different groups.

Thanks.

-- 
tejun
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

2015-07-15 Thread Tejun Heo
Hello,

On Thu, Jul 02, 2015 at 04:02:02PM -0700, Nishanth Aravamudan wrote:
 we currently emit at boot:
 
 [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 
 
 After this commit, we correctly emit:
 
 [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 

JFYI, the numbers in the brackets aren't NUMA node numbers but percpu
allocation group numbers and they're not split according to nodes but
percpu allocation units.  In both cases, there are two units each
serving 0-3 and 4-7.  In the above case, because it wasn't being fed
the correct NUMA information, both got assigned to the same group.  In
the latter, they got assigned to different ones but even then if the
group numbers match NUMA node numbers, that's just a coincidence.

Thanks.

-- 
tejun
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

2015-07-14 Thread David Rientjes
On Fri, 10 Jul 2015, Nishanth Aravamudan wrote:

  After the percpu areas on initialized and cpu_to_node() is correct, it 
  would be really nice to be able to make numa_cpu_lookup_table[] be 
  __initdata since it shouldn't be necessary anymore.  That probably has cpu 
  callbacks that need to be modified to no longer look at 
  numa_cpu_lookup_table[] or pass the value in, but it would make it much 
  cleaner.  Then nobody will have to worry about figuring out whether 
  early_cpu_to_node() or cpu_to_node() is the right one to call.
 
 When I worked on the original pcpu patches for power, I wanted to do
 this, but got myself confused and never came back to it. Thank you for
 suggesting it and I'll work on it soon.
 

Great, thanks for taking it on!  I have powerpc machines so I can test 
this and try to help where possible.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

2015-07-10 Thread Nishanth Aravamudan
On 08.07.2015 [18:22:09 -0700], David Rientjes wrote:
 On Thu, 2 Jul 2015, Nishanth Aravamudan wrote:
 
  Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
  have an ordering issue during boot with early calls to cpu_to_node().
  The value returned by those calls now depend on the per-cpu area being
  setup, but that is not guaranteed to be the case during boot. Instead,
  we need to add an early_cpu_to_node() which doesn't use the per-CPU area
  and call that from certain spots that are known to invoke cpu_to_node()
  before the per-CPU areas are not configured.
  
  On an example 2-node NUMA system with the following topology:
  
  available: 2 nodes (0-1)
  node 0 cpus: 0 1 2 3
  node 0 size: 2029 MB
  node 0 free: 1753 MB
  node 1 cpus: 4 5 6 7
  node 1 size: 2045 MB
  node 1 free: 1945 MB
  node distances:
  node   0   1 
0:  10  40 
1:  40  10 
  
  we currently emit at boot:
  
  [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 
  
  After this commit, we correctly emit:
  
  [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 
  
  Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
  
  diff --git a/arch/powerpc/include/asm/topology.h 
  b/arch/powerpc/include/asm/topology.h
  index 5f1048e..f2c4c89 100644
  --- a/arch/powerpc/include/asm/topology.h
  +++ b/arch/powerpc/include/asm/topology.h
  @@ -39,6 +39,8 @@ static inline int pcibus_to_node(struct pci_bus *bus)
   extern int __node_distance(int, int);
   #define node_distance(a, b) __node_distance(a, b)
   
  +extern int early_cpu_to_node(int);
  +
   extern void __init dump_numa_cpu_topology(void);
   
   extern int sysfs_add_device_to_node(struct device *dev, int nid);
  diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
  index c69671c..23a2cf3 100644
  --- a/arch/powerpc/kernel/setup_64.c
  +++ b/arch/powerpc/kernel/setup_64.c
  @@ -715,8 +715,8 @@ void __init setup_arch(char **cmdline_p)
   
   static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t 
  align)
   {
  -   return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align,
  -   __pa(MAX_DMA_ADDRESS));
  +   return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size,
  +   align, __pa(MAX_DMA_ADDRESS));
   }
   
   static void __init pcpu_fc_free(void *ptr, size_t size)
  @@ -726,7 +726,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
   
   static int pcpu_cpu_distance(unsigned int from, unsigned int to)
   {
  -   if (cpu_to_node(from) == cpu_to_node(to))
  +   if (early_cpu_to_node(from) == early_cpu_to_node(to))
  return LOCAL_DISTANCE;
  else
  return REMOTE_DISTANCE;
  diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
  index 5e80621..9ffabf4 100644
  --- a/arch/powerpc/mm/numa.c
  +++ b/arch/powerpc/mm/numa.c
  @@ -157,6 +157,11 @@ static void map_cpu_to_node(int cpu, int node)
  cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
   }
   
  +int early_cpu_to_node(int cpu)
  +{
  +   return numa_cpu_lookup_table[cpu];
  +}
  +
   #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
   static void unmap_cpu_from_node(unsigned long cpu)
   {
  
  
 
 early_cpu_to_node() looks like it's begging to be __init since we 
 shouldn't have a need to reference to numa_cpu_lookup_table after boot and 
 that appears like it can be done if pcpu_cpu_distance() is made __init in 
 this patch and smp_prepare_boot_cpu() is made __init in the next patch.  
 So I think this is fine, but those functions and things like 
 reset_numa_cpu_lookup_table() should be in init.text.

Yep, that makes total sense!

 After the percpu areas on initialized and cpu_to_node() is correct, it 
 would be really nice to be able to make numa_cpu_lookup_table[] be 
 __initdata since it shouldn't be necessary anymore.  That probably has cpu 
 callbacks that need to be modified to no longer look at 
 numa_cpu_lookup_table[] or pass the value in, but it would make it much 
 cleaner.  Then nobody will have to worry about figuring out whether 
 early_cpu_to_node() or cpu_to_node() is the right one to call.

When I worked on the original pcpu patches for power, I wanted to do
this, but got myself confused and never came back to it. Thank you for
suggesting it and I'll work on it soon.

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

2015-07-08 Thread David Rientjes
On Thu, 2 Jul 2015, Nishanth Aravamudan wrote:

 Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
 have an ordering issue during boot with early calls to cpu_to_node().
 The value returned by those calls now depend on the per-cpu area being
 setup, but that is not guaranteed to be the case during boot. Instead,
 we need to add an early_cpu_to_node() which doesn't use the per-CPU area
 and call that from certain spots that are known to invoke cpu_to_node()
 before the per-CPU areas are not configured.
 
 On an example 2-node NUMA system with the following topology:
 
 available: 2 nodes (0-1)
 node 0 cpus: 0 1 2 3
 node 0 size: 2029 MB
 node 0 free: 1753 MB
 node 1 cpus: 4 5 6 7
 node 1 size: 2045 MB
 node 1 free: 1945 MB
 node distances:
 node   0   1 
   0:  10  40 
   1:  40  10 
 
 we currently emit at boot:
 
 [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 
 
 After this commit, we correctly emit:
 
 [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 
 
 Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
 
 diff --git a/arch/powerpc/include/asm/topology.h 
 b/arch/powerpc/include/asm/topology.h
 index 5f1048e..f2c4c89 100644
 --- a/arch/powerpc/include/asm/topology.h
 +++ b/arch/powerpc/include/asm/topology.h
 @@ -39,6 +39,8 @@ static inline int pcibus_to_node(struct pci_bus *bus)
  extern int __node_distance(int, int);
  #define node_distance(a, b) __node_distance(a, b)
  
 +extern int early_cpu_to_node(int);
 +
  extern void __init dump_numa_cpu_topology(void);
  
  extern int sysfs_add_device_to_node(struct device *dev, int nid);
 diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
 index c69671c..23a2cf3 100644
 --- a/arch/powerpc/kernel/setup_64.c
 +++ b/arch/powerpc/kernel/setup_64.c
 @@ -715,8 +715,8 @@ void __init setup_arch(char **cmdline_p)
  
  static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t 
 align)
  {
 - return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align,
 - __pa(MAX_DMA_ADDRESS));
 + return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size,
 + align, __pa(MAX_DMA_ADDRESS));
  }
  
  static void __init pcpu_fc_free(void *ptr, size_t size)
 @@ -726,7 +726,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
  
  static int pcpu_cpu_distance(unsigned int from, unsigned int to)
  {
 - if (cpu_to_node(from) == cpu_to_node(to))
 + if (early_cpu_to_node(from) == early_cpu_to_node(to))
   return LOCAL_DISTANCE;
   else
   return REMOTE_DISTANCE;
 diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
 index 5e80621..9ffabf4 100644
 --- a/arch/powerpc/mm/numa.c
 +++ b/arch/powerpc/mm/numa.c
 @@ -157,6 +157,11 @@ static void map_cpu_to_node(int cpu, int node)
   cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
  }
  
 +int early_cpu_to_node(int cpu)
 +{
 + return numa_cpu_lookup_table[cpu];
 +}
 +
  #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
  static void unmap_cpu_from_node(unsigned long cpu)
  {
 
 

early_cpu_to_node() looks like it's begging to be __init since we 
shouldn't have a need to reference to numa_cpu_lookup_table after boot and 
that appears like it can be done if pcpu_cpu_distance() is made __init in 
this patch and smp_prepare_boot_cpu() is made __init in the next patch.  
So I think this is fine, but those functions and things like 
reset_numa_cpu_lookup_table() should be in init.text.

After the percpu areas on initialized and cpu_to_node() is correct, it 
would be really nice to be able to make numa_cpu_lookup_table[] be 
__initdata since it shouldn't be necessary anymore.  That probably has cpu 
callbacks that need to be modified to no longer look at 
numa_cpu_lookup_table[] or pass the value in, but it would make it much 
cleaner.  Then nobody will have to worry about figuring out whether 
early_cpu_to_node() or cpu_to_node() is the right one to call.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH 1/2] powerpc/numa: fix cpu_to_node() usage during boot

2015-07-02 Thread Nishanth Aravamudan
Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we
have an ordering issue during boot with early calls to cpu_to_node().
The value returned by those calls now depend on the per-cpu area being
setup, but that is not guaranteed to be the case during boot. Instead,
we need to add an early_cpu_to_node() which doesn't use the per-CPU area
and call that from certain spots that are known to invoke cpu_to_node()
before the per-CPU areas are not configured.

On an example 2-node NUMA system with the following topology:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 2029 MB
node 0 free: 1753 MB
node 1 cpus: 4 5 6 7
node 1 size: 2045 MB
node 1 free: 1945 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10 

we currently emit at boot:

[0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 

After this commit, we correctly emit:

[0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 

Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index 5f1048e..f2c4c89 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -39,6 +39,8 @@ static inline int pcibus_to_node(struct pci_bus *bus)
 extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 
+extern int early_cpu_to_node(int);
+
 extern void __init dump_numa_cpu_topology(void);
 
 extern int sysfs_add_device_to_node(struct device *dev, int nid);
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index c69671c..23a2cf3 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -715,8 +715,8 @@ void __init setup_arch(char **cmdline_p)
 
 static void * __init pcpu_fc_alloc(unsigned int cpu, size_t size, size_t align)
 {
-   return __alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu)), size, align,
-   __pa(MAX_DMA_ADDRESS));
+   return __alloc_bootmem_node(NODE_DATA(early_cpu_to_node(cpu)), size,
+   align, __pa(MAX_DMA_ADDRESS));
 }
 
 static void __init pcpu_fc_free(void *ptr, size_t size)
@@ -726,7 +726,7 @@ static void __init pcpu_fc_free(void *ptr, size_t size)
 
 static int pcpu_cpu_distance(unsigned int from, unsigned int to)
 {
-   if (cpu_to_node(from) == cpu_to_node(to))
+   if (early_cpu_to_node(from) == early_cpu_to_node(to))
return LOCAL_DISTANCE;
else
return REMOTE_DISTANCE;
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 5e80621..9ffabf4 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -157,6 +157,11 @@ static void map_cpu_to_node(int cpu, int node)
cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
 }
 
+int early_cpu_to_node(int cpu)
+{
+   return numa_cpu_lookup_table[cpu];
+}
+
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
 static void unmap_cpu_from_node(unsigned long cpu)
 {

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev