Re: [tip:sched/urgent] sched: Fix crash in sched_init_numa()

2016-02-08 Thread Raghavendra K T

On 01/19/2016 07:08 PM, tip-bot for Raghavendra K T wrote:

Commit-ID:  9c03ee147193645be4c186d3688232fa438c57c7
Gitweb: http://git.kernel.org/tip/9c03ee147193645be4c186d3688232fa438c57c7
Author: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
AuthorDate: Sat, 16 Jan 2016 00:31:23 +0530
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Tue, 19 Jan 2016 08:42:20 +0100

sched: Fix crash in sched_init_numa()

The following PowerPC commit:

   c118baf80256 ("arch/powerpc/mm/numa.c: do not allocate bootmem memory for non 
existing nodes")

avoids allocating bootmem memory for non existent nodes.

But when DEBUG_PER_CPU_MAPS=y is enabled, my powerNV system failed to boot
because in sched_init_numa(), cpumask_or() operation was done on
unallocated nodes.

Fix that by making cpumask_or() operation only on existing nodes.

[ Tested with and w/o DEBUG_PER_CPU_MAPS=y on x86 and PowerPC. ]

Reported-by: Jan Stancek <jstan...@redhat.com>
Tested-by: Jan Stancek <jstan...@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
Cc: <gk...@linux.vnet.ibm.com>
Cc: <grant.lik...@linaro.org>
Cc: <nik...@linux.vnet.ibm.com>
Cc: <vdavy...@parallels.com>
Cc: <linuxppc-dev@lists.ozlabs.org>
Cc: <linux...@kvack.org>
Cc: <pet...@infradead.org>
Cc: <b...@kernel.crashing.org>
Cc: <pau...@samba.org>
Cc: <m...@ellerman.id.au>
Cc: <an...@samba.org>
Link: 
http://lkml.kernel.org/r/1452884483-11676-1-git-send-email-raghavendra...@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
  kernel/sched/core.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44253ad..474658b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6840,7 +6840,7 @@ static void sched_init_numa(void)

sched_domains_numa_masks[i][j] = mask;

-   for (k = 0; k < nr_node_ids; k++) {
+   for_each_node(k) {
if (node_distance(j, k) > 
sched_domains_numa_distance[i])
continue;






Hello Greg,
Above commit fixes the debug kernel crash in 4.4 kernel [ when
DEBUG_PER_CPU_MAPS=y to be precise]. This is a regression in 4.4 from
4.3 and should be ideally present in 4.4-stable.

Could you please pull in this change.?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[tip:sched/urgent] sched: Fix crash in sched_init_numa()

2016-01-19 Thread tip-bot for Raghavendra K T
Commit-ID:  9c03ee147193645be4c186d3688232fa438c57c7
Gitweb: http://git.kernel.org/tip/9c03ee147193645be4c186d3688232fa438c57c7
Author: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
AuthorDate: Sat, 16 Jan 2016 00:31:23 +0530
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Tue, 19 Jan 2016 08:42:20 +0100

sched: Fix crash in sched_init_numa()

The following PowerPC commit:

  c118baf80256 ("arch/powerpc/mm/numa.c: do not allocate bootmem memory for non 
existing nodes")

avoids allocating bootmem memory for non existent nodes.

But when DEBUG_PER_CPU_MAPS=y is enabled, my powerNV system failed to boot
because in sched_init_numa(), cpumask_or() operation was done on
unallocated nodes.

Fix that by making cpumask_or() operation only on existing nodes.

[ Tested with and w/o DEBUG_PER_CPU_MAPS=y on x86 and PowerPC. ]

Reported-by: Jan Stancek <jstan...@redhat.com>
Tested-by: Jan Stancek <jstan...@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
Cc: <gk...@linux.vnet.ibm.com>
Cc: <grant.lik...@linaro.org>
Cc: <nik...@linux.vnet.ibm.com>
Cc: <vdavy...@parallels.com>
Cc: <linuxppc-dev@lists.ozlabs.org>
Cc: <linux...@kvack.org>
Cc: <pet...@infradead.org>
Cc: <b...@kernel.crashing.org>
Cc: <pau...@samba.org>
Cc: <m...@ellerman.id.au>
Cc: <an...@samba.org>
Link: 
http://lkml.kernel.org/r/1452884483-11676-1-git-send-email-raghavendra...@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44253ad..474658b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6840,7 +6840,7 @@ static void sched_init_numa(void)
 
sched_domains_numa_masks[i][j] = mask;
 
-   for (k = 0; k < nr_node_ids; k++) {
+   for_each_node(k) {
if (node_distance(j, k) > 
sched_domains_numa_distance[i])
continue;
 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] PowerNV crash with 4.4.0-rc8 at sched_init_numa (related to commit c118baf80256)

2016-01-15 Thread Raghavendra K T
* Jan Stancek <jstan...@redhat.com> [2016-01-09 18:03:55]:

> Hi,
> 
> I'm seeing bare metal ppc64le system crashing early during boot
> with latest upstream kernel (4.4.0-rc8):
> 
> # git describe
> v4.4-rc8-96-g751e5f5
> 
> [0.625451] Unable to handle kernel paging request for data at address 
> 0x
> [0.625586] Faulting instruction address: 0xc04ae000
> [0.625698] Oops: Kernel access of bad area, sig: 11 [#1]
> [0.625789] SMP NR_CPUS=2048 NUMA PowerNV
> [0.625879] Modules linked in:
> [0.625973] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.4.0-rc8+ #6
> [0.626087] task: c02ff430 ti: c02ff6084000 task.ti: 
> c02ff6084000
> [0.626224] NIP: c04ae000 LR: c090b9e4 CTR: 
> 0003
> [0.626361] REGS: c02ff6087930 TRAP: 0300   Not tainted  (4.4.0-rc8+)
> [0.626475] MSR: 90019033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 48002044  
> XER: 2000
> [0.626808] CFAR: c0008468 DAR:  DSISR: 4000 
> SOFTE: 1
> GPR00: c090b9ac c02ff6087bb0 c1700900 c03ff229e080
> GPR04: c03ff229e080  0003 0001
> GPR08:   0010 90011003
> GPR12: 2200 cfb4 c000bd68 0002
> GPR16: 0028 c0b25940 c173ffa4 
> GPR20: c0b259d8 c0b259e0 c0b259e8 
> GPR24: c03ff229e080  c189b180 
> GPR28:  c1740a94 0002 0002
> [0.627925] NIP [c04ae000] __bitmap_or+0x30/0x50
> [0.627973] LR [c090b9e4] sched_init_numa+0x440/0x7c8
> [0.628030] Call Trace:
> [0.628054] [c02ff6087bb0] [c090b9ac] 
> sched_init_numa+0x408/0x7c8 (unreliable)
> [0.628136] [c02ff6087ca0] [c0c60718] sched_init_smp+0x60/0x238
> [0.628206] [c02ff6087d00] [c0c44294] 
> kernel_init_freeable+0x1fc/0x3b4
> [0.628286] [c02ff6087dc0] [c000bd84] kernel_init+0x24/0x140
> [0.628356] [c02ff6087e30] [c0009544] 
> ret_from_kernel_thread+0x5c/0x98
> [0.628435] Instruction dump:
> [0.628470] 38c6003f 78c9d183 4d820020 38c9 3920 78c60020 38c60001 
> 7cc903a6
> [0.628587] 6000 6000 6000 6042 <7d05482a> 7d44482a 
> 7d0a5378 7d43492a
> [0.628711] ---[ end trace b423f3e02b333fbf ]---
> [0.628757]
> [2.628822] Kernel panic - not syncing: Fatal exception
> [2.628969] Rebooting in 10 seconds..[0.00] OPAL V3 detected !
> 
 
> The crash goes away if I revert following commit:
>   commit c118baf802562688d46e6002f2b5fe66b947da21
>   Author: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
>   Date:   Thu Nov 5 18:46:29 2015 -0800
> arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing 
> nodes
>

Something like below should fix. I 'll send it in a separate email 
 marking Peter and Ingo. Basically for_each_node conversion
has targeted only slowpaths / used_once sort of functions.
But it seems there was a cpumask_or in sched_init_numa that used
unallocated node.

Sorry for getting back late.. Was overcautious checking x86/power
w/ and w/o DEBUG_PER_CPU_MAPS

---8<- 
From 6680994a5a8dde7eccfbd2bffde341fdff2aed63 Mon Sep 17 00:00:00 2001
From: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
Date: Fri, 15 Jan 2016 18:19:56 +0530
Subject: [PATCH] Fix: PowerNV crash with 4.4.0-rc8 at sched_init_numa

Commit c118baf80256 ("arch/powerpc/mm/numa.c: do not allocate bootmem
memory for non existing nodes") avoided bootmem memory allocation for
non existent nodes.

When DEBUG_PER_CPU_MAPS enabled, powerNV system failed to boot because
in sched_init_numa, cpumask_or operation was done on unallocated nodes.
Fix that by making cpumask_or operation only on existing nodes.

[ Tested with and w/o DEBUG_PER_CPU_MAPS on x86 and powerpc ]

Reported-by: Jan Stancek <jstan...@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44253ad..474658b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6840,7 +6840,7 @@ static void sched_init_numa(void)
 
sched_domains_numa_masks[i][j] = mask;
 
-   for (k = 0; k < nr_node_ids; k++) {
+   for_each_node(k) {
if (node_distance(j, k) > 
sched_domains_numa_distance[i])
continue;
 
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] Fix: PowerNV crash with 4.4.0-rc8 at sched_init_numa

2016-01-15 Thread Raghavendra K T
Commit c118baf80256 ("arch/powerpc/mm/numa.c: do not allocate bootmem
memory for non existing nodes") avoided bootmem memory allocation for
non existent nodes.

When DEBUG_PER_CPU_MAPS enabled, powerNV system failed to boot because
in sched_init_numa, cpumask_or operation was done on unallocated nodes.
Fix that by making cpumask_or operation only on existing nodes.

[ Tested with and w/o DEBUG_PER_CPU_MAPS on x86 and powerpc ]

Reported-by: Jan Stancek <jstan...@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44253ad..474658b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6840,7 +6840,7 @@ static void sched_init_numa(void)
 
sched_domains_numa_masks[i][j] = mask;
 
-   for (k = 0; k < nr_node_ids; k++) {
+   for_each_node(k) {
if (node_distance(j, k) > 
sched_domains_numa_distance[i])
continue;
 
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] PowerNV crash with 4.4.0-rc8 at sched_init_numa (related to commit c118baf80256)

2016-01-11 Thread Raghavendra K T

On 01/10/2016 04:33 AM, Jan Stancek wrote:

Hi,

I'm seeing bare metal ppc64le system crashing early during boot
with latest upstream kernel (4.4.0-rc8):



Jan,
Do you mind sharing the .config you used for the kernel.
Not able to reproduce with the one that I have :(

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] PowerNV crash with 4.4.0-rc8 at sched_init_numa (related to commit c118baf80256)

2016-01-11 Thread Raghavendra K T

On 01/11/2016 05:22 PM, Raghavendra K T wrote:

On 01/10/2016 04:33 AM, Jan Stancek wrote:

Hi,

I'm seeing bare metal ppc64le system crashing early during boot
with latest upstream kernel (4.4.0-rc8):



Jan,
Do you mind sharing the .config you used for the kernel.
Not able to reproduce with the one that I have :(



Never mind.. I enabled DEBUG_PER_CPU_MAPS.. to hit that..
/me goes back..

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [BUG] PowerNV crash with 4.4.0-rc8 at sched_init_numa (related to commit c118baf80256)

2016-01-09 Thread Raghavendra K T

On 01/10/2016 04:33 AM, Jan Stancek wrote:

Hi,

I'm seeing bare metal ppc64le system crashing early during boot
with latest upstream kernel (4.4.0-rc8):



Hi Jan,
Thanks for reporting. Let me try to reproduce the issue.

(Between if you think there is anything special in the .config
that I need for testing .. please share).

- Raghu


# git describe
v4.4-rc8-96-g751e5f5

[0.625451] Unable to handle kernel paging request for data at address 
0x
[0.625586] Faulting instruction address: 0xc04ae000
[0.625698] Oops: Kernel access of bad area, sig: 11 [#1]
[0.625789] SMP NR_CPUS=2048 NUMA PowerNV
[0.625879] Modules linked in:
[0.625973] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.4.0-rc8+ #6
[0.626087] task: c02ff430 ti: c02ff6084000 task.ti: 
c02ff6084000
[0.626224] NIP: c04ae000 LR: c090b9e4 CTR: 0003
[0.626361] REGS: c02ff6087930 TRAP: 0300   Not tainted  (4.4.0-rc8+)
[0.626475] MSR: 90019033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 48002044  
XER: 2000
[0.626808] CFAR: c0008468 DAR:  DSISR: 4000 
SOFTE: 1
GPR00: c090b9ac c02ff6087bb0 c1700900 c03ff229e080
GPR04: c03ff229e080  0003 0001
GPR08:   0010 90011003
GPR12: 2200 cfb4 c000bd68 0002
GPR16: 0028 c0b25940 c173ffa4 
GPR20: c0b259d8 c0b259e0 c0b259e8 
GPR24: c03ff229e080  c189b180 
GPR28:  c1740a94 0002 0002
[0.627925] NIP [c04ae000] __bitmap_or+0x30/0x50
[0.627973] LR [c090b9e4] sched_init_numa+0x440/0x7c8
[0.628030] Call Trace:
[0.628054] [c02ff6087bb0] [c090b9ac] 
sched_init_numa+0x408/0x7c8 (unreliable)
[0.628136] [c02ff6087ca0] [c0c60718] sched_init_smp+0x60/0x238
[0.628206] [c02ff6087d00] [c0c44294] 
kernel_init_freeable+0x1fc/0x3b4
[0.628286] [c02ff6087dc0] [c000bd84] kernel_init+0x24/0x140
[0.628356] [c02ff6087e30] [c0009544] 
ret_from_kernel_thread+0x5c/0x98
[0.628435] Instruction dump:
[0.628470] 38c6003f 78c9d183 4d820020 38c9 3920 78c60020 38c60001 
7cc903a6
[0.628587] 6000 6000 6000 6042 <7d05482a> 7d44482a 7d0a5378 
7d43492a
[0.628711] ---[ end trace b423f3e02b333fbf ]---
[0.628757]
[2.628822] Kernel panic - not syncing: Fatal exception
[2.628969] Rebooting in 10 seconds..[0.00] OPAL V3 detected !

# numactl -H
available: 4 nodes (0-1,16-17)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
node 0 size: 64941 MB
node 0 free: 64210 MB
node 1 cpus: 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 1 size: 65456 MB
node 1 free: 62424 MB
node 16 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
node 16 size: 65457 MB
node 16 free: 65258 MB
node 17 cpus: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151
node 17 size: 65186 MB
node 17 free: 65001 MB
node distances:
node   0   1  16  17
   0:  10  20  40  40
   1:  20  10  40  40
  16:  40  40  10  20
  17:  40  40  20  10

The crash goes away if I revert following commit:
   commit c118baf802562688d46e6002f2b5fe66b947da21
   Author: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
   Date:   Thu Nov 5 18:46:29 2015 -0800
 arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing 
nodes

Regards,
Jan





___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] block-mq:Fix the null memory access while setting tags cpumask

2015-10-13 Thread Raghavendra K T

On 10/13/2015 10:17 PM, Jeff Moyer wrote:

Raghavendra K T <raghavendra...@linux.vnet.ibm.com> writes:


In nr_hw_queues >1 cases when certain number of cpus are onlined/or
offlined, that results change in request_queue map in block-mq layer,
we see the kernel dumping like:


What version is that patch against?  This problem should be fixed by the
patch set from Akinobu Mita.  I tested cpu hot-plugging on my system
with an nvme device, and it is fixed there on mainline.



Hi Jeff,

My patch was against (4.3.0-rc2) bcee19f424a.

I do see now that  commit 1356aae08338 (blk-mq: avoid setting
hctx->tags->cpumask before allocation) has equivalent fix.

Thanks for the information.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] block-mq:Fix the null memory access while setting tags cpumask

2015-10-12 Thread Raghavendra K T
In nr_hw_queues >1 cases when certain number of cpus are onlined/or
offlined, that results change in request_queue map in block-mq layer,
we see the kernel dumping like:

 BUG: unable to handle kernel NULL pointer dereference at 0080
 IP: [] cpumask_set_cpu+0x6/0xd
 PGD 6d957067 PUD 7604c067 PMD 0
 Oops: 0002 [#1] SMP
 Modules linked in: null_blk
 CPU: 2 PID: 1926 Comm: bash Not tainted 4.3.0-rc2+ #24
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 task: 8800724cd1c0 ti: 880070a2c000 task.ti: 880070a2c000
 RIP: 0010:[]  [] cpumask_set_cpu+0x6/0xd
 RSP: 0018:880070a2fbc8  EFLAGS: 00010203
 RAX: 880073eedc00 RBX: 88006cc88000 RCX: 88006c06b000
 RDX: 0007 RSI: 0080 RDI: 0008
 RBP: 880070a2fbc8 R08: 88006c06ac00 R09: 88006c06ad48
 R10: 88004ea8 R11: 88006c069650 R12: 88007378fe28
 R13: 0008 R14: e8500200 R15: 81d2a630
 FS:  7fa34803b700() GS:88007cc4() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: 0080 CR3: 761d2000 CR4: 06e0
 Stack:
  880070a2fc18 8128edec  880073eedc00
  0039 88006cc88000 0007 ffe3
  81cef2c0  880070a2fc38 8129049a
 Call Trace:
  [] blk_mq_map_swqueue+0x9d/0x206
  [] blk_mq_queue_reinit_notify+0xe3/0x144
  [] notifier_call_chain+0x37/0x63
  [] __raw_notifier_call_chain+0xe/0x10
  [] __cpu_notify+0x20/0x32
  [] cpu_notify_nofail+0x13/0x1b
  [] _cpu_down+0x18a/0x264
  [] ? path_put+0x1f/0x23
  [] cpu_down+0x2d/0x3a
  [] cpu_subsys_offline+0x14/0x16
  [] device_offline+0x65/0x94
  [] online_store+0x48/0x68
  [] ? kernfs_fop_write+0x6f/0x143
  [] dev_attr_store+0x20/0x22
  [] sysfs_kf_write+0x3c/0x3e
  [] kernfs_fop_write+0xed/0x143
  [] __vfs_write+0x28/0xa6
  [] ? security_file_permission+0x3c/0x44
  [] ? percpu_down_read+0x21/0x42
  [] ? __sb_start_write+0x24/0x41
  [] vfs_write+0x8d/0xd1
  [] SyS_write+0x59/0x83
  [] entry_SYSCALL_64_fastpath+0x12/0x71
 Code: 03 75 06 65 48 ff 0a eb 1a f0 48 83 af 68 07 00 00 01 74 02 eb 0d 48 8d 
bf 68 07 00 00 ff 90 78 07 00 00 5d c3 55 89 ff 48 89 e5  48 0f ab 3e 5d c3 
0f 1f 44 00 00 55 8b 4e 44 31 d2 8b b7 94
 RIP  [] cpumask_set_cpu+0x6/0xd
  RSP 
 CR2: 0080

How to reproduce:
1. create 80 vcpu guest with 10 core 8 threads
2. modprobe null_blk submit_queues=64
3. for i in 72 73 74 75 76 77 78 79 ; do
 echo 0 > /sys/devices/system/cpu/cpu$i/online;
done

Reason:
We try to set freed hwctx->tag->cpumask in blk_mq_map_swqueue().
Introduced during commit f26cdc8536ad ("blk-mq: Shared tag enhancements").

What is happening:
When certain number of cpus are onlined/offlined, that results in
blk_mq_update_queue_map, we could potentially end up in new mapping to
hwctx.

Subsequent blk_mq_map_swqueue of request_queue, tries to set the
hwctx->tags->cpumask which is already freed by blk_mq_free_rq_map in
earlier itearation when it was not mapped.

Fix:
Set the hwctx->tags->cpumask only after blk_mq_init_rq_map() is done

hwctx->tags->cpumask does not follow the hwctx->cpumask after new
mapping even in the cases where new mapping does not cause problem.
That is also fixed with this change.

This problem is originally found in powervm which had 160 cpus (SMT8),
128 nr_hw_queues. The dump was easily reproduced with offlining last core
and it has been a blocker issue because cpu hotplug is a common case for
DLPAR.

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f2d67b4..39a7834 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1811,7 +1811,6 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 
hctx = q->mq_ops->map_queue(q, i);
cpumask_set_cpu(i, hctx->cpumask);
-   cpumask_set_cpu(i, hctx->tags->cpumask);
ctx->index_hw = hctx->nr_ctx;
hctx->ctxs[hctx->nr_ctx++] = ctx;
}
@@ -1836,6 +1835,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
if (!set->tags[i])
set->tags[i] = blk_mq_init_rq_map(set, i);
hctx->tags = set->tags[i];
+   cpumask_copy(hctx->tags->cpumask, hctx->cpumask);
WARN_ON(!hctx->tags);
 
/*
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC, 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table

2015-10-06 Thread Raghavendra K T

On 10/06/2015 03:47 PM, Michael Ellerman wrote:

On Sun, 2015-27-09 at 18:29:09 UTC, Raghavendra K T wrote:

We access numa_cpu_lookup_table array directly in all the places
to read/update numa cpu lookup information. Instead use a helper
function to update.

This is helpful in changing the way numa<-->cpu mapping in single
place when needed.

This is a cosmetic change, no change in functionality.

Signed-off-by: Raghavendra K T <raghavendra...@linux.inet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
  arch/powerpc/include/asm/mmzone.h |  2 +-
  arch/powerpc/kernel/smp.c | 10 +-
  arch/powerpc/mm/numa.c| 28 +---
  3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/mmzone.h 
b/arch/powerpc/include/asm/mmzone.h
index 7b58917..c24a5f4 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -29,7 +29,7 @@ extern struct pglist_data *node_data[];
   * Following are specific to this numa platform.
   */

-extern int numa_cpu_lookup_table[];
+extern int numa_cpu_lookup(int cpu);


Can you rename it better :)

Something like cpu_to_nid().


Good name. sure.



Although maybe nid is wrong given the rest of the series.


May be not. The current plan is to rename (after discussing with Nish)

chipid to pnid (physical nid)
and nid to vnid (virtual nid)

within powerpc numa.c
[reasoning chipid is applicable only to OPAL, since we want to handle
powerkvm, powervm and baremetal we need a generic name ]

But 'nid' naming will be retained which is applicable for generic
kernel interactions.




diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..d5e6eee 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS];
  cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
  struct pglist_data *node_data[MAX_NUMNODES];

-EXPORT_SYMBOL(numa_cpu_lookup_table);
  EXPORT_SYMBOL(node_to_cpumask_map);
  EXPORT_SYMBOL(node_data);

@@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
  }

-static void reset_numa_cpu_lookup_table(void)
+int numa_cpu_lookup(int cpu)
  {
-   unsigned int cpu;
-
-   for_each_possible_cpu(cpu)
-   numa_cpu_lookup_table[cpu] = -1;
+   return numa_cpu_lookup_table[cpu];
  }
+EXPORT_SYMBOL(numa_cpu_lookup);


I don't see you changing any modular code that uses this, or any macros that
might be used by modules, so I don't see why this needs to be exported?

I think you just added it because num_cpu_lookup_table was exported?



arch/powerpc/kernel/smp.c uses it.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-10-06 Thread Raghavendra K T

On 10/06/2015 03:55 PM, Michael Ellerman wrote:

On Sun, 2015-09-27 at 23:59 +0530, Raghavendra K T wrote:

Problem description:
Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
got from device tree is naturally mapped (directly) to nid.

Potential side effect of that is:

1) There are several places in kernel that assumes serial node numbering.
and memory allocations assume that all the nodes from 0-(highest nid)
exist inturn ending up allocating memory for the nodes that does not exist.


Is it several? Or lots?

If it's several, ie. more than two but not lots, then we should probably just
fix those places. Or is that /really/ hard for some reason?



It is several and I did attempt to fix them. But the rest of the places
(like memcg, work queue, scheduler and so on) are tricky to fix because
the memory allocations are glued with other things.
and similar fix may be expected in future too..



Do we ever get whole nodes hotplugged in under PowerVM? I don't think so, but I
don't remember for sure.



Even on powervm we do have discontiguous  numa nodes. [Adding more to 
it, we could even end up creating a dummy node 0 just to make kernel

happy]
for e.g.,
available: 2 nodes (0,7)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 7 cpus: 0 1 2 3 4 5 6 7
node 7 size: 10240 MB
node 7 free: 8174 MB
node distances:
node   0   7
  0:  10  40
  7:  40  10

note that node zero neither has any cpu nor memory.


2) For virtualization use cases (such as qemu, libvirt, openstack), mapping
sparse nid of the host system to contiguous nids of guest (numa affinity,
placement) could be a challenge.


Can you elaborate? That's a bit vague.


one e.g., i can think of: (though libvirt/openstack people will know 
more about it) suppose one wishes to have half of the vcpus bind to one

physical node and rest of the vcpus to second numa node, we cant say
whether second node is 1,8, or 16. and same libvirtxml on a two node
system may not be valid for another two numa node system.
[ i believe it may cause some migration problem too ].




Possible Solutions:
1) Handling the memory allocations is kernel case by case: Though in some
cases it is easy to achieve, some cases may be intrusive/not trivial.
at the end it does not handle side effect (2) above.

2) Map the sparse chipid got from device tree to a serial nid at kernel
level (The idea proposed in this series).
Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
con: The chipid is in device tree no longer the same as nid in kernel

3) Let the lower layer (OPAL) give the serial node ids after parsing the
chipid and the associativity etc [ either as a separate item in device tree
or by compacting the chipid numbers ]
Pros: kernel, device tree are on same page and less change in kernel
Con: is it the functionality expected in lower layer


...


3) Numactl tests from
ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz

(infact there were more breakage before the patch because of sparse nid
and memoryless node cases of powerpc)


This is probably the best argument for your series. ie. userspace is dumb and
fixing every broken app that assumes linear node numbering is not feasible.


So on the whole I think the concept is good. This series though is a bit
confusing because of all the renaming etc. etc. Nish made lots of good comments
so I'll wait for a v2 based on those.



Yes, will be sending V2 soon extending my patch to fix powervm case too.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-09-30 Thread Raghavendra K T

On 09/30/2015 01:16 AM, Denis Kirjanov wrote:

On 9/29/15, Raghavendra K T <raghavendra...@linux.vnet.ibm.com> wrote:

On 09/28/2015 10:34 PM, Nishanth Aravamudan wrote:

On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote:

On 9/27/15, Raghavendra K T <raghavendra...@linux.vnet.ibm.com> wrote:

Problem description:
Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
got from device tree is naturally mapped (directly) to nid.


Interesting thing to play with, I'll try to test it on my POWER7 box,
but it doesn't have the OPAL layer :(


Hi Denis,
Thanks for your interest. I have pushed the patches to

https://github.com/ktraghavendra/linux/tree/serialnuma_v1 if it makes
   patches easy to grab.


Thanks!
One sad thing is that I can't test the actual node id mapping now
since currently I have an access to machine with only one memory node
:/ Can we fake it through qemu?



faking the sparse numa ids is possible with Nish's patch for qemu:

https://lists.gnu.org/archive/html/qemu-devel/2014-06/msg05826.html

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-09-29 Thread Raghavendra K T

On 09/28/2015 10:34 PM, Nishanth Aravamudan wrote:

On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote:

On 9/27/15, Raghavendra K T <raghavendra...@linux.vnet.ibm.com> wrote:

Problem description:
Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
got from device tree is naturally mapped (directly) to nid.


Interesting thing to play with, I'll try to test it on my POWER7 box,
but it doesn't have the OPAL layer :(


Hi Denis,
Thanks for your interest. I have pushed the patches to

https://github.com/ktraghavendra/linux/tree/serialnuma_v1 if it makes
 patches easy to grab.



Note that it's also interesting to try it under PowerVM, with odd NUMA
topologies and report any issues found :)



Thanks Nish, I 'll also grab a powerVM and test.




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid

2015-09-29 Thread Raghavendra K T

On 09/28/2015 10:57 PM, Nishanth Aravamudan wrote:

On 27.09.2015 [23:59:10 +0530], Raghavendra K T wrote:

There is no change in the fuctionality

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
  arch/powerpc/mm/numa.c | 42 +-
  1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d5e6eee..f84ed2f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid,
}
  }

-/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
+/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
   * info is found.
   */
-static int associativity_to_nid(const __be32 *associativity)
+static int associativity_to_chipid(const __be32 *associativity)


This is confusing to me. This function is also used by the DLPAR code
under PowerVM to indicate what node the CPU is on -- not a chip (which I
don't believe is exposed at all under PowerVM).



Good point.

should I retain the name nid?
or any suggestions? instead of chipid -> nid which fits both the cases.
or should I rename like nid->vnid  something?
[...]

@@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void)

/* Use associativity from first thread for all siblings */
vphn_get_associativity(cpu, associativity);
-   new_nid = associativity_to_nid(associativity);
+   new_nid = associativity_to_chipid(associativity);


If you are getting a chipid, shouldn't you be assigning it to a variable
called 'new_chipid'?


yes perhaps.
my splitting idea was
1. change nid name in functions to chipid (without changing nid
variable calling that function)
2. rename variables to chipid and assign nid=chipid (1:1 mapping)
3. now let nid = mapped chipid

But I see that it isn't consistent in some places. do you think
merging step 1 and step 2 is okay?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid

2015-09-29 Thread Raghavendra K T

On 09/28/2015 10:58 PM, Nishanth Aravamudan wrote:

On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote:

Once we have made the distinction between nid and chipid
create a 1:1 mapping between them. This makes compacting the
nids easy later.



Didn't the previous patch just do the opposite of...



As per my thoughts it was:
1. rename functions to say loud that it is chipid (and not nid)
2. and then assign nid = chipid so that we are clear that
we made nid:chipid 1:1 mapping and compact nids later..

But again may be I should combine patch 2 and 3.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-09-29 Thread Raghavendra K T

On 09/28/2015 11:04 PM, Nishanth Aravamudan wrote:

On 27.09.2015 [23:59:08 +0530], Raghavendra K T wrote:

[...]


2) Map the sparse chipid got from device tree to a serial nid at kernel
level (The idea proposed in this series).
Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
con: The chipid is in device tree no longer the same as nid in kernel


Is there any debugging/logging? Looks like not -- so how does a sysadmin
map from firmware-provided values to the Linux values? That's going to
make debugging of large systems (PowerVM or otherwise) less than
pleasant, it seems? Possibly you could put something in sysfs?


I see 2 things could be done here:

1) while doing dump_numa_cpu_topology() we can dump nid_to_chipid()
as additional information.

2) sysfs->
Does /sys/devices/system/node/nodeX/*chipid* looks good. May be we
should add only for powerpc or otherwise we need to have chipid = nid
populated for other archs. [ I think this change may be done slowly ]





___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid

2015-09-29 Thread Raghavendra K T

On 09/28/2015 11:05 PM, Nishanth Aravamudan wrote:

On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote:

Once we have made the distinction between nid and chipid
create a 1:1 mapping between them. This makes compacting the
nids easy later.

No functionality change.

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
  arch/powerpc/mm/numa.c | 36 +---
  1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f84ed2f..dd2073b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -264,6 +264,17 @@ out:
return chipid;
  }

+
+ /* Return the nid from associativity */
+static int associativity_to_nid(const __be32 *associativity)
+{
+   int nid;
+
+   nid = associativity_to_chipid(associativity);
+   return nid;
+}


This is ultimately confusing. You are assigning the semantic return
value of a chipid to a nid -- is it a nid or a chipid? Shouldn't the
variable naming be consistent?



:( yes. will come up with some consistent naming.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping

2015-09-29 Thread Raghavendra K T

On 09/28/2015 11:02 PM, Nishanth Aravamudan wrote:

On 27.09.2015 [23:59:12 +0530], Raghavendra K T wrote:

Create arrays that maps serial nids and sparse chipids.

Note: My original idea had only two arrays of chipid to nid map. Final
code is inspired by driver/acpi/numa.c that maps a proximity node with
a logical node by Takayoshi Kochi <t-ko...@bq.jp.nec.com>, and thus
uses an additional chipid_map nodemask. The mask helps in first unused
nid easily by knowing first unset bit in the mask.

No change in functionality.

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
  arch/powerpc/mm/numa.c | 48 +++-
  1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index dd2073b..f015cad 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -63,6 +63,11 @@ static int form1_affinity;
  static int distance_ref_points_depth;
  static const __be32 *distance_ref_points;
  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
+static nodemask_t chipid_map = NODE_MASK_NONE;
+static int chipid_to_nid_map[MAX_NUMNODES]
+   = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };


Hrm, conceptually there are *more* chips than nodes, right? So what
guarantees we won't see > MAX_NUMNODES chips?


You are correct that nid <= chipids.
and #nids = #chipids when all possible slots are populated. Considering
we assume that maximum chip slots are no more than MAX_NUMNODES,


how about having

#define MAX_CHIPNODES MAX_NUMNODES
and
chipid_to_nid_map[MAX_CHIPNODES] = { [0 ... MAX_CHIPNODES - 1] = ..




+static int nid_to_chipid_map[MAX_NUMNODES]
+   = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };

  /*
   * Allocate node_to_cpumask_map based on number of available nodes
@@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
  }

+int chipid_to_nid(int chipid)
+{
+   if (chipid < 0)
+   return NUMA_NO_NODE;


Do you really want to support these cases? Or should they be
bugs/warnings indicating that you got an unexpected input? Or at least
WARN_ON_ONCE?



Right. Querying for nid of an invalid chipid should be atleast
WARN_ON_ONCE(). But 'll check once if there is any valid scenario
before the change.


+   return chipid_to_nid_map[chipid];
+}
+
+int nid_to_chipid(int nid)
+{
+   if (nid < 0)
+   return NUMA_NO_NODE;
+   return nid_to_chipid_map[nid];
+}
+
+static void __map_chipid_to_nid(int chipid, int nid)
+{
+   if (chipid_to_nid_map[chipid] == NUMA_NO_NODE
+|| nid < chipid_to_nid_map[chipid])
+   chipid_to_nid_map[chipid] = nid;
+   if (nid_to_chipid_map[nid] == NUMA_NO_NODE
+   || chipid < nid_to_chipid_map[nid])
+   nid_to_chipid_map[nid] = chipid;
+}


chip <-> node mapping is a static (physical) concept, right? Should we
emit some debugging if for some reason we get a runtime call to remap
an already mapped chip to a new node?



Good point. Already mapped chipid to a different nid is unexpected
whereas mapping chipid to same nid is expected.(because mapping comes
from cpus belonging to same node).

WARN_ON() should suffice here?



+
+int map_chipid_to_nid(int chipid)
+{
+   int nid;
+
+   if (chipid < 0 || chipid >= MAX_NUMNODES)
+   return NUMA_NO_NODE;
+
+   nid = chipid_to_nid_map[chipid];
+   if (nid == NUMA_NO_NODE) {
+   if (nodes_weight(chipid_map) >= MAX_NUMNODES)
+   return NUMA_NO_NODE;


If you create a KVM guest with a bogus topology, doesn't this just start
losing NUMA information for very high-noded guests?



'll try to see if it is possible to hit this case, ideally we should
not allow more than MAX_NUMNODES for chipids and we should abort early.


+   nid = first_unset_node(chipid_map);
+   __map_chipid_to_nid(chipid, nid);
+   node_set(nid, chipid_map);
+   }
+   return nid;
+}
+
  int numa_cpu_lookup(int cpu)
  {
return numa_cpu_lookup_table[cpu];
@@ -264,7 +311,6 @@ out:
return chipid;
  }

-


stray change?



yep, will correct that.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table

2015-09-27 Thread Raghavendra K T

On 09/27/2015 11:59 PM, Raghavendra K T wrote:

We access numa_cpu_lookup_table array directly in all the places
to read/update numa cpu lookup information. Instead use a helper
function to update.

This is helpful in changing the way numa<-->cpu mapping in single
place when needed.

This is a cosmetic change, no change in functionality.

Signed-off-by: Raghavendra K T <raghavendra...@linux.inet.ibm.com>

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>

---

Sorry:  changelog edit messed it.. :(


  arch/powerpc/include/asm/mmzone.h |  2 +-
  arch/powerpc/kernel/smp.c | 10 +-
  arch/powerpc/mm/numa.c| 28 +---
  3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/mmzone.h 
b/arch/powerpc/include/asm/mmzone.h
index 7b58917..c24a5f4 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -29,7 +29,7 @@ extern struct pglist_data *node_data[];
   * Following are specific to this numa platform.
   */

-extern int numa_cpu_lookup_table[];
+extern int numa_cpu_lookup(int cpu);
  extern cpumask_var_t node_to_cpumask_map[];
  #ifdef CONFIG_MEMORY_HOTPLUG
  extern unsigned long max_pfn;
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ec9ec20..56fbe9e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 * numa_node_id() works after this.
 */
if (cpu_present(cpu)) {
-   set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+   set_cpu_numa_node(cpu, numa_cpu_lookup(cpu));
set_cpu_numa_mem(cpu,
-   local_memory_node(numa_cpu_lookup_table[cpu]));
+   local_memory_node(numa_cpu_lookup(cpu)));
}
}

@@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void)
  #ifdef CONFIG_PPC64
paca[boot_cpuid].__current = current;
  #endif
-   set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
+   set_numa_node(numa_cpu_lookup(boot_cpuid));
current_set[boot_cpuid] = task_thread_info(current);
  }

@@ -718,8 +718,8 @@ void start_secondary(void *unused)
}
traverse_core_siblings(cpu, true);

-   set_numa_node(numa_cpu_lookup_table[cpu]);
-   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
+   set_numa_node(numa_cpu_lookup(cpu));
+   set_numa_mem(local_memory_node(numa_cpu_lookup(cpu)));

smp_wmb();
notify_cpu_starting(cpu);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..d5e6eee 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS];
  cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
  struct pglist_data *node_data[MAX_NUMNODES];

-EXPORT_SYMBOL(numa_cpu_lookup_table);
  EXPORT_SYMBOL(node_to_cpumask_map);
  EXPORT_SYMBOL(node_data);

@@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
  }

-static void reset_numa_cpu_lookup_table(void)
+int numa_cpu_lookup(int cpu)
  {
-   unsigned int cpu;
-
-   for_each_possible_cpu(cpu)
-   numa_cpu_lookup_table[cpu] = -1;
+   return numa_cpu_lookup_table[cpu];
  }
+EXPORT_SYMBOL(numa_cpu_lookup);

-static void update_numa_cpu_lookup_table(unsigned int cpu, int node)
+static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node)
  {
numa_cpu_lookup_table[cpu] = node;
  }

+static void reset_numa_cpu_lookup_table(void)
+{
+   unsigned int cpu;
+
+   for_each_possible_cpu(cpu)
+   update_numa_cpu_lookup_table(cpu, -1);
+}
+
  static void map_cpu_to_node(int cpu, int node)
  {
update_numa_cpu_lookup_table(cpu, node);
@@ -160,7 +165,7 @@ static void map_cpu_to_node(int cpu, int node)
  #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
  static void unmap_cpu_from_node(unsigned long cpu)
  {
-   int node = numa_cpu_lookup_table[cpu];
+   int node = numa_cpu_lookup(cpu);

dbg("removing cpu %lu from node %d\n", cpu, node);

@@ -536,7 +541,8 @@ static int numa_setup_cpu(unsigned long lcpu)
 * directly instead of querying the firmware, since it represents
 * the most recent mapping notified to us by the platform (eg: VPHN).
 */
-   if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
+   nid = numa_cpu_lookup(lcpu);
+   if (nid >= 0) {
map_cpu_to_node(lcpu, nid);
return nid;
}
@@ -1413,7 +1419,7 @@ int arch_update_cpu_topology(void)
if (new_nid < 0 || !node_online(new_nid))
new_nid = first_online_node;

-   if (new_nid == numa_cpu_lookup_table[cpu]) {
+

[PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table

2015-09-27 Thread Raghavendra K T
We access numa_cpu_lookup_table array directly in all the places
to read/update numa cpu lookup information. Instead use a helper
function to update.

This is helpful in changing the way numa<-->cpu mapping in single
place when needed.

This is a cosmetic change, no change in functionality.

Signed-off-by: Raghavendra K T <raghavendra...@linux.inet.ibm.com>
---
 arch/powerpc/include/asm/mmzone.h |  2 +-
 arch/powerpc/kernel/smp.c | 10 +-
 arch/powerpc/mm/numa.c| 28 +---
 3 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/mmzone.h 
b/arch/powerpc/include/asm/mmzone.h
index 7b58917..c24a5f4 100644
--- a/arch/powerpc/include/asm/mmzone.h
+++ b/arch/powerpc/include/asm/mmzone.h
@@ -29,7 +29,7 @@ extern struct pglist_data *node_data[];
  * Following are specific to this numa platform.
  */
 
-extern int numa_cpu_lookup_table[];
+extern int numa_cpu_lookup(int cpu);
 extern cpumask_var_t node_to_cpumask_map[];
 #ifdef CONFIG_MEMORY_HOTPLUG
 extern unsigned long max_pfn;
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ec9ec20..56fbe9e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 * numa_node_id() works after this.
 */
if (cpu_present(cpu)) {
-   set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
+   set_cpu_numa_node(cpu, numa_cpu_lookup(cpu));
set_cpu_numa_mem(cpu,
-   local_memory_node(numa_cpu_lookup_table[cpu]));
+   local_memory_node(numa_cpu_lookup(cpu)));
}
}
 
@@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void)
 #ifdef CONFIG_PPC64
paca[boot_cpuid].__current = current;
 #endif
-   set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
+   set_numa_node(numa_cpu_lookup(boot_cpuid));
current_set[boot_cpuid] = task_thread_info(current);
 }
 
@@ -718,8 +718,8 @@ void start_secondary(void *unused)
}
traverse_core_siblings(cpu, true);
 
-   set_numa_node(numa_cpu_lookup_table[cpu]);
-   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
+   set_numa_node(numa_cpu_lookup(cpu));
+   set_numa_mem(local_memory_node(numa_cpu_lookup(cpu)));
 
smp_wmb();
notify_cpu_starting(cpu);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..d5e6eee 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS];
 cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
 struct pglist_data *node_data[MAX_NUMNODES];
 
-EXPORT_SYMBOL(numa_cpu_lookup_table);
 EXPORT_SYMBOL(node_to_cpumask_map);
 EXPORT_SYMBOL(node_data);
 
@@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
 }
 
-static void reset_numa_cpu_lookup_table(void)
+int numa_cpu_lookup(int cpu)
 {
-   unsigned int cpu;
-
-   for_each_possible_cpu(cpu)
-   numa_cpu_lookup_table[cpu] = -1;
+   return numa_cpu_lookup_table[cpu];
 }
+EXPORT_SYMBOL(numa_cpu_lookup);
 
-static void update_numa_cpu_lookup_table(unsigned int cpu, int node)
+static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node)
 {
numa_cpu_lookup_table[cpu] = node;
 }
 
+static void reset_numa_cpu_lookup_table(void)
+{
+   unsigned int cpu;
+
+   for_each_possible_cpu(cpu)
+   update_numa_cpu_lookup_table(cpu, -1);
+}
+
 static void map_cpu_to_node(int cpu, int node)
 {
update_numa_cpu_lookup_table(cpu, node);
@@ -160,7 +165,7 @@ static void map_cpu_to_node(int cpu, int node)
 #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR)
 static void unmap_cpu_from_node(unsigned long cpu)
 {
-   int node = numa_cpu_lookup_table[cpu];
+   int node = numa_cpu_lookup(cpu);
 
dbg("removing cpu %lu from node %d\n", cpu, node);
 
@@ -536,7 +541,8 @@ static int numa_setup_cpu(unsigned long lcpu)
 * directly instead of querying the firmware, since it represents
 * the most recent mapping notified to us by the platform (eg: VPHN).
 */
-   if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) {
+   nid = numa_cpu_lookup(lcpu);
+   if (nid >= 0) {
map_cpu_to_node(lcpu, nid);
return nid;
}
@@ -1413,7 +1419,7 @@ int arch_update_cpu_topology(void)
if (new_nid < 0 || !node_online(new_nid))
new_nid = first_online_node;
 
-   if (new_nid == numa_cpu_lookup_table[cpu]) {
+   if (new_nid == numa_cpu_lookup(cpu)) {
cpumask_andnot(_associativity_changes_mask,
_associativity_changes_mask,
 

[PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid

2015-09-27 Thread Raghavendra K T
Once we have made the distinction between nid and chipid
create a 1:1 mapping between them. This makes compacting the
nids easy later.

No functionality change.

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 36 +---
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f84ed2f..dd2073b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -264,6 +264,17 @@ out:
return chipid;
 }
 
+
+ /* Return the nid from associativity */
+static int associativity_to_nid(const __be32 *associativity)
+{
+   int nid;
+
+   nid = associativity_to_chipid(associativity);
+   return nid;
+}
+
+
 /* Returns the chipid associated with the given device tree node,
  * or -1 if not found.
  */
@@ -278,6 +289,17 @@ static int of_node_to_chipid_single(struct device_node 
*device)
return chipid;
 }
 
+/*
+ * Returns nid from the chipid associated with given tree node
+ */
+static int of_node_to_nid_single(struct device_node *device)
+{
+   int nid;
+
+   nid = of_node_to_chipid_single(device);
+   return nid;
+}
+
 /* Walk the device tree upwards, looking for an associativity id */
 int of_node_to_nid(struct device_node *device)
 {
@@ -286,7 +308,7 @@ int of_node_to_nid(struct device_node *device)
 
of_node_get(device);
while (device) {
-   nid = of_node_to_chipid_single(device);
+   nid = of_node_to_nid_single(device);
if (nid != -1)
break;
 
@@ -498,7 +520,7 @@ static int of_get_assoc_arrays(struct device_node *memory,
 }
 
 /*
- * This is like of_node_to_chipid_single() for memory represented in the
+ * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
  */
 static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
@@ -557,7 +579,7 @@ static int numa_setup_cpu(unsigned long lcpu)
goto out;
}
 
-   nid = of_node_to_chipid_single(cpu);
+   nid = of_node_to_nid_single(cpu);
 
 out_present:
if (nid < 0 || !node_online(nid))
@@ -754,7 +776,7 @@ static int __init parse_numa_properties(void)
 
cpu = of_get_cpu_node(i, NULL);
BUG_ON(!cpu);
-   nid = of_node_to_chipid_single(cpu);
+   nid = of_node_to_nid_single(cpu);
of_node_put(cpu);
 
/*
@@ -796,7 +818,7 @@ new_range:
 * have associativity properties.  If none, then
 * everything goes to default_nid.
 */
-   nid = of_node_to_chipid_single(memory);
+   nid = of_node_to_nid_single(memory);
if (nid < 0)
nid = default_nid;
 
@@ -1119,7 +1141,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr)
if ((scn_addr < start) || (scn_addr >= (start + size)))
continue;
 
-   nid = of_node_to_chipid_single(memory);
+   nid = of_node_to_nid_single(memory);
break;
}
 
@@ -1415,7 +1437,7 @@ int arch_update_cpu_topology(void)
 
/* Use associativity from first thread for all siblings */
vphn_get_associativity(cpu, associativity);
-   new_nid = associativity_to_chipid(associativity);
+   new_nid = associativity_to_nid(associativity);
if (new_nid < 0 || !node_online(new_nid))
new_nid = first_online_node;
 
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH RFC 5/5] powerpc:numa Use chipid to nid mapping to get serial numa node ids

2015-09-27 Thread Raghavendra K T
Also properly initialize numa distance table for serial nids.

Problem: Powerpc supports sparse nid numbering which could affect
1) memory footprint 2) virtualization use cases

Current solution:
The patch maps sprase chipid got fromn device tree to serail
nids.

Result before:
node   0   1  16  17
  0:  10  20  40  40
  1:  20  10  40  40
 16:  40  40  10  20
 17:  40  40  20  10

After:
node   0   1   2   3
  0:  10  20  40  40
  1:  20  10  40  40
  2:  40  40  10  20
  3:  40  40  20  10

Testing: Scenarios tested on  baremetal and KVM guest with 4 nodes
1) offlining and onlining memory and cpus
2) Running the tests from numactl source.
3) Creating 1000s of docker containers stressing the system

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index f015cad..873ac8c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -304,7 +304,8 @@ static int associativity_to_chipid(const __be32 
*associativity)
/*
 * Skip the length field and send start of associativity array
 */
-   initialize_distance_lookup_table(chipid, associativity + 1);
+   initialize_distance_lookup_table(chipid_to_nid(chipid),
+associativity + 1);
}
 
 out:
@@ -314,9 +315,10 @@ out:
  /* Return the nid from associativity */
 static int associativity_to_nid(const __be32 *associativity)
 {
-   int nid;
+   int chipid, nid;
 
-   nid = associativity_to_chipid(associativity);
+   chipid = associativity_to_chipid(associativity);
+   nid = map_chipid_to_nid(chipid);
return nid;
 }
 
@@ -340,9 +342,10 @@ static int of_node_to_chipid_single(struct device_node 
*device)
  */
 static int of_node_to_nid_single(struct device_node *device)
 {
-   int nid;
+   int chipid, nid;
 
-   nid = of_node_to_chipid_single(device);
+   chipid = of_node_to_chipid_single(device);
+   nid = map_chipid_to_nid(chipid);
return nid;
 }
 
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH RFC 0/5] powerpc:numa Add serial nid support

2015-09-27 Thread Raghavendra K T
Problem description:
Powerpc has sparse node numbering, i.e. on a 4 node system nodes are
numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid
got from device tree is naturally mapped (directly) to nid.

Potential side effect of that is:

1) There are several places in kernel that assumes serial node numbering.
and memory allocations assume that all the nodes from 0-(highest nid)
exist inturn ending up allocating memory for the nodes that does not exist.

2) For virtualization use cases (such as qemu, libvirt, openstack), mapping
sparse nid of the host system to contiguous nids of guest (numa affinity,
placement) could be a challenge.

Possible Solutions:
1) Handling the memory allocations is kernel case by case: Though in some
cases it is easy to achieve, some cases may be intrusive/not trivial. 
at the end it does not handle side effect (2) above.

2) Map the sparse chipid got from device tree to a serial nid at kernel
level (The idea proposed in this series).
Pro: It is more natural to handle at kernel level than at lower (OPAL) layer.
con: The chipid is in device tree no longer the same as nid in kernel

3) Let the lower layer (OPAL) give the serial node ids after parsing the
chipid and the associativity etc [ either as a separate item in device tree
or by compacting the chipid numbers ]
Pros: kernel, device tree are on same page and less change in kernel
Con: is it the functionality expected in lower layer

As mentioned above, current patch series tries to map chipid from lower layer
to a contiguos nid at kernel level keeping the node distance calculation and
so on intact.

Result:
Before the patch: numactl -H

available: 4 nodes (0-1,16-17)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31
node 0 size: 31665 MB
node 0 free: 29836 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 
54 55 56 57 58 59 60 61 62 63
node 1 size: 32722 MB
node 1 free: 32019 MB
node 16 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 
86 87 88 89 90 91 92 93 94 95
node 16 size: 32571 MB
node 16 free: 31222 MB
node 17 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 17 size: 0 MB
node 17 free: 0 MB
node distances:
node   0   1  16  17 
  0:  10  20  40  40 
  1:  20  10  40  40 
 16:  40  40  10  20 
 17:  40  40  20  10 

After the patch: numactl -H

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31
node 0 size: 31665 MB
node 0 free: 30657 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 
54 55 56 57 58 59 60 61 62 63
node 1 size: 32722 MB
node 1 free: 32566 MB
node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 
86 87 88 89 90 91 92 93 94 95
node 2 size: 32571 MB
node 2 free: 32401 MB
node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3 
  0:  10  20  40  40 
  1:  20  10  40  40 
  2:  40  40  10  20 
  3:  40  40  20  10 

(note that numa distances are intact). Apart from this, The following tests
are done with the patched kernel (both baremetal and KVM guest with multiple
nodes) to ensure there is no breakage.

1) offlining and onlining of memory in /sys/devices/system/node/nodeX path

2) offlining and onlining of cpus in /sys/devices/system/cpu/ path

3) Numactl tests from
ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz

(infact there were more breakage before the patch because of sparse nid
and memoryless node cases of powerpc)

4) Thousands of docker containers were spawned.

Please let me know your comments.

 patch 1-3: cleanup patches
 patch 4: Adds helper function to map nid and chipid
 patch 5: Uses the mapping to get serial nid 

Raghavendra K T (5):
  powerpc:numa Add numa_cpu_lookup function to update lookup table
  powerpc:numa Rename functions referring to nid as chipid
  powerpc:numa create 1:1 mappaing between chipid and nid
  powerpc:numa Add helper functions to maintain chipid to nid mapping
  powerpc:numa Use chipid to nid mapping to get serial numa node ids

 arch/powerpc/include/asm/mmzone.h |   2 +-
 arch/powerpc/kernel/smp.c |  10 ++--
 arch/powerpc/mm/numa.c| 121 +++---
 3 files changed, 105 insertions(+), 28 deletions(-)

-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid

2015-09-27 Thread Raghavendra K T
There is no change in the fuctionality

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 42 +-
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d5e6eee..f84ed2f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid,
}
 }
 
-/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
+/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
  * info is found.
  */
-static int associativity_to_nid(const __be32 *associativity)
+static int associativity_to_chipid(const __be32 *associativity)
 {
-   int nid = -1;
+   int chipid = -1;
 
if (min_common_depth == -1)
goto out;
 
if (of_read_number(associativity, 1) >= min_common_depth)
-   nid = of_read_number([min_common_depth], 1);
+   chipid = of_read_number([min_common_depth], 1);
 
/* POWER4 LPAR uses 0x as invalid node */
-   if (nid == 0x || nid >= MAX_NUMNODES)
-   nid = -1;
+   if (chipid == 0x || chipid >= MAX_NUMNODES)
+   chipid = -1;
 
-   if (nid > 0 &&
+   if (chipid > 0 &&
of_read_number(associativity, 1) >= distance_ref_points_depth) {
/*
 * Skip the length field and send start of associativity array
 */
-   initialize_distance_lookup_table(nid, associativity + 1);
+   initialize_distance_lookup_table(chipid, associativity + 1);
}
 
 out:
-   return nid;
+   return chipid;
 }
 
-/* Returns the nid associated with the given device tree node,
+/* Returns the chipid associated with the given device tree node,
  * or -1 if not found.
  */
-static int of_node_to_nid_single(struct device_node *device)
+static int of_node_to_chipid_single(struct device_node *device)
 {
-   int nid = -1;
+   int chipid = -1;
const __be32 *tmp;
 
tmp = of_get_associativity(device);
if (tmp)
-   nid = associativity_to_nid(tmp);
-   return nid;
+   chipid = associativity_to_chipid(tmp);
+   return chipid;
 }
 
 /* Walk the device tree upwards, looking for an associativity id */
@@ -286,7 +286,7 @@ int of_node_to_nid(struct device_node *device)
 
of_node_get(device);
while (device) {
-   nid = of_node_to_nid_single(device);
+   nid = of_node_to_chipid_single(device);
if (nid != -1)
break;
 
@@ -498,7 +498,7 @@ static int of_get_assoc_arrays(struct device_node *memory,
 }
 
 /*
- * This is like of_node_to_nid_single() for memory represented in the
+ * This is like of_node_to_chipid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
  */
 static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
@@ -557,7 +557,7 @@ static int numa_setup_cpu(unsigned long lcpu)
goto out;
}
 
-   nid = of_node_to_nid_single(cpu);
+   nid = of_node_to_chipid_single(cpu);
 
 out_present:
if (nid < 0 || !node_online(nid))
@@ -754,7 +754,7 @@ static int __init parse_numa_properties(void)
 
cpu = of_get_cpu_node(i, NULL);
BUG_ON(!cpu);
-   nid = of_node_to_nid_single(cpu);
+   nid = of_node_to_chipid_single(cpu);
of_node_put(cpu);
 
/*
@@ -796,7 +796,7 @@ new_range:
 * have associativity properties.  If none, then
 * everything goes to default_nid.
 */
-   nid = of_node_to_nid_single(memory);
+   nid = of_node_to_chipid_single(memory);
if (nid < 0)
nid = default_nid;
 
@@ -1119,7 +1119,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr)
if ((scn_addr < start) || (scn_addr >= (start + size)))
continue;
 
-   nid = of_node_to_nid_single(memory);
+   nid = of_node_to_chipid_single(memory);
break;
}
 
@@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void)
 
/* Use associativity from first thread for all siblings */
vphn_get_associativity(cpu, associativity);
-   new_nid = associativity_to_nid(associativity);
+   new_nid = associativity_to_chipid(associativity);
if (new_nid < 0 || !node_online(new_nid))
new_nid = first_online_node;
 
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping

2015-09-27 Thread Raghavendra K T
Create arrays that maps serial nids and sparse chipids.

Note: My original idea had only two arrays of chipid to nid map. Final
code is inspired by driver/acpi/numa.c that maps a proximity node with
a logical node by Takayoshi Kochi <t-ko...@bq.jp.nec.com>, and thus
uses an additional chipid_map nodemask. The mask helps in first unused
nid easily by knowing first unset bit in the mask.

No change in functionality.

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 48 +++-
 1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index dd2073b..f015cad 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -63,6 +63,11 @@ static int form1_affinity;
 static int distance_ref_points_depth;
 static const __be32 *distance_ref_points;
 static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
+static nodemask_t chipid_map = NODE_MASK_NONE;
+static int chipid_to_nid_map[MAX_NUMNODES]
+   = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
+static int nid_to_chipid_map[MAX_NUMNODES]
+   = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE };
 
 /*
  * Allocate node_to_cpumask_map based on number of available nodes
@@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned long 
end_pfn,
return 0;
 }
 
+int chipid_to_nid(int chipid)
+{
+   if (chipid < 0)
+   return NUMA_NO_NODE;
+   return chipid_to_nid_map[chipid];
+}
+
+int nid_to_chipid(int nid)
+{
+   if (nid < 0)
+   return NUMA_NO_NODE;
+   return nid_to_chipid_map[nid];
+}
+
+static void __map_chipid_to_nid(int chipid, int nid)
+{
+   if (chipid_to_nid_map[chipid] == NUMA_NO_NODE
+|| nid < chipid_to_nid_map[chipid])
+   chipid_to_nid_map[chipid] = nid;
+   if (nid_to_chipid_map[nid] == NUMA_NO_NODE
+   || chipid < nid_to_chipid_map[nid])
+   nid_to_chipid_map[nid] = chipid;
+}
+
+int map_chipid_to_nid(int chipid)
+{
+   int nid;
+
+   if (chipid < 0 || chipid >= MAX_NUMNODES)
+   return NUMA_NO_NODE;
+
+   nid = chipid_to_nid_map[chipid];
+   if (nid == NUMA_NO_NODE) {
+   if (nodes_weight(chipid_map) >= MAX_NUMNODES)
+   return NUMA_NO_NODE;
+   nid = first_unset_node(chipid_map);
+   __map_chipid_to_nid(chipid, nid);
+   node_set(nid, chipid_map);
+   }
+   return nid;
+}
+
 int numa_cpu_lookup(int cpu)
 {
return numa_cpu_lookup_table[cpu];
@@ -264,7 +311,6 @@ out:
return chipid;
 }
 
-
  /* Return the nid from associativity */
 static int associativity_to_nid(const __be32 *associativity)
 {
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes

2015-09-22 Thread Raghavendra K T

On 09/22/2015 10:59 AM, Michael Ellerman wrote:

On Tue, 2015-09-15 at 07:38 +0530, Raghavendra K T wrote:


... nothing


Sure this patch looks obvious, but please give me a changelog that proves
you've thought about it thoroughly.

For example is it OK to use for_each_node() at this point in boot? Is there any
historical reason why we did it with a hard coded loop? If so what has changed.
What systems have you tested on? etc. etc.

cheers


Changelog:

With the setup_nr_nodes(), we have already initialized
node_possible_map. So it is safe to use for_each_node here.

There are many places in the kernel that use hardcoded 'for' loop with
nr_node_ids, because all other architectures have numa nodes populated
serially. That should be reason we had maintained same for powerpc.

But since on power we have sparse numa node ids possible, we
unnecessarily allocate memory for non existent numa nodes.

For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18
and we allocate memory for nodes 2-14.

The patch is boot tested on a 4 node tuleta [ confirming with printks ].
that it works as expected.




Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
  arch/powerpc/mm/numa.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..8d8a541 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void)
setup_nr_node_ids();

/* allocate the map */
-   for (node = 0; node < nr_node_ids; node++)
+   for_each_node(node)
alloc_bootmem_cpumask_var(_to_cpumask_map[node]);

/* cpumask_of_node() will now work */








___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru

2015-09-14 Thread Raghavendra K T

On 09/14/2015 02:30 PM, Vladimir Davydov wrote:

Hi,

On Wed, Sep 09, 2015 at 12:01:46AM +0530, Raghavendra K T wrote:

The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17
is common), this patch saves some unnecessary allocations for
non existing numa nodes.

Even without that saving, perhaps patch makes code more readable.


Do I understand correctly that node 0 must always be in
node_possible_map? I ask, because we currently test
lru->node[0].memcg_lrus to determine if the list is memcg aware.



Yes, node 0 is always there. So it should not be a problem.



Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
  mm/list_lru.c | 23 +++
  1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 909eca2..5a97f83 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -377,7 +377,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool 
memcg_aware)
  {
int i;

-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
if (!memcg_aware)
lru->node[i].memcg_lrus = NULL;


So, we don't explicitly initialize memcg_lrus for nodes that are not in
node_possible_map. That's OK, because we allocate lru->node using
kzalloc. However, this partial nullifying in case !memcg_aware looks
confusing IMO. Let's drop it, I mean something like this:


Yes, you are right. and we do not have to have memcg_aware check inside
for loop too.
Will change as per your suggestion and send V2.
Thanks for the review.



static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware)
{
int i;

if (!memcg_aware)
return 0;

for_each_node(i) {
if (memcg_init_list_lru_node(>node[i]))
goto fail;
}



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru

2015-09-14 Thread Raghavendra K T

On 09/14/2015 05:34 PM, Vladimir Davydov wrote:

On Mon, Sep 14, 2015 at 05:09:31PM +0530, Raghavendra K T wrote:

On 09/14/2015 02:30 PM, Vladimir Davydov wrote:

On Wed, Sep 09, 2015 at 12:01:46AM +0530, Raghavendra K T wrote:

The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17
is common), this patch saves some unnecessary allocations for
non existing numa nodes.

Even without that saving, perhaps patch makes code more readable.


Do I understand correctly that node 0 must always be in
node_possible_map? I ask, because we currently test
lru->node[0].memcg_lrus to determine if the list is memcg aware.



Yes, node 0 is always there. So it should not be a problem.


I think it should be mentioned in the comment to list_lru_memcg_aware
then.



Something like this: ?
static inline bool list_lru_memcg_aware(struct list_lru *lru)
{
/*
 * This needs node 0 to be always present, even
 * in the systems supporting sparse numa ids.
 */
return !!lru->node[0].memcg_lrus;
}


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V2 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru

2015-09-14 Thread Raghavendra K T
The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17
is common), this patch saves some unnecessary allocations for
non existing numa nodes.

Even without that saving, perhaps patch makes code more readable.

[ Take memcg_aware check outside for_each loop: Vldimir]
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 mm/list_lru.c | 34 +++---
 1 file changed, 23 insertions(+), 11 deletions(-)

 Changes in V2:
  - Take memcg_aware check outside for_each loop (Vldimir)
  - Add comment that node 0 should always be present (Vladimir)  
  
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 909eca2..60cd6dd 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -42,6 +42,10 @@ static void list_lru_unregister(struct list_lru *lru)
 #ifdef CONFIG_MEMCG_KMEM
 static inline bool list_lru_memcg_aware(struct list_lru *lru)
 {
+   /*
+* This needs node 0 to be always present, even
+* in the systems supporting sparse numa ids.
+*/
return !!lru->node[0].memcg_lrus;
 }
 
@@ -377,16 +381,20 @@ static int memcg_init_list_lru(struct list_lru *lru, bool 
memcg_aware)
 {
int i;
 
-   for (i = 0; i < nr_node_ids; i++) {
-   if (!memcg_aware)
-   lru->node[i].memcg_lrus = NULL;
-   else if (memcg_init_list_lru_node(>node[i]))
+   if (!memcg_aware)
+   return 0;
+
+   for_each_node(i) {
+   if (memcg_init_list_lru_node(>node[i]))
goto fail;
}
return 0;
 fail:
-   for (i = i - 1; i >= 0; i--)
+   for (i = i - 1; i >= 0; i--) {
+   if (!lru->node[i].memcg_lrus)
+   continue;
memcg_destroy_list_lru_node(>node[i]);
+   }
return -ENOMEM;
 }
 
@@ -397,7 +405,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru)
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_destroy_list_lru_node(>node[i]);
 }
 
@@ -409,16 +417,20 @@ static int memcg_update_list_lru(struct list_lru *lru,
if (!list_lru_memcg_aware(lru))
return 0;
 
-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
if (memcg_update_list_lru_node(>node[i],
   old_size, new_size))
goto fail;
}
return 0;
 fail:
-   for (i = i - 1; i >= 0; i--)
+   for (i = i - 1; i >= 0; i--) {
+   if (!lru->node[i].memcg_lrus)
+   continue;
+
memcg_cancel_update_list_lru_node(>node[i],
  old_size, new_size);
+   }
return -ENOMEM;
 }
 
@@ -430,7 +442,7 @@ static void memcg_cancel_update_list_lru(struct list_lru 
*lru,
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_cancel_update_list_lru_node(>node[i],
  old_size, new_size);
 }
@@ -485,7 +497,7 @@ static void memcg_drain_list_lru(struct list_lru *lru,
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_drain_list_lru_node(>node[i], src_idx, dst_idx);
 }
 
@@ -522,7 +534,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
if (!lru->node)
goto out;
 
-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
spin_lock_init(>node[i].lock);
if (key)
lockdep_set_class(>node[i].lock, key);
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V2 0/2] Replace nr_node_ids for loop with for_each_node

2015-09-14 Thread Raghavendra K T
Many places in the kernel use 'for' loop with nr_node_ids. For the architectures
which supports sparse numa ids, this will result in some unnecessary allocations
for non existing nodes.
(for e.g., numa node numbers such as 0,1,16,17 is common in powerpc.)

So replace the for loop with for_each_node so that allocations happen only for
existing numa nodes.

Please note that, though there are many places where nr_node_ids is used,
current patchset uses for_each_node only for slowpath to avoid find_next_bit
traversal.

Changes in V2:
  - Take memcg_aware check outside for_each loop (Vldimir)
  - Add comment that node 0 should always be present (Vladimir)

Raghavendra K T (2):
  mm: Replace nr_node_ids for loop with for_each_node in list lru
  powerpc:numa Do not allocate bootmem memory for non existing nodes

 arch/powerpc/mm/numa.c |  2 +-
 mm/list_lru.c  | 34 +++---
 2 files changed, 24 insertions(+), 12 deletions(-)

-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes

2015-09-14 Thread Raghavendra K T
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..8d8a541 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void)
setup_nr_node_ids();
 
/* allocate the map */
-   for (node = 0; node < nr_node_ids; node++)
+   for_each_node(node)
alloc_bootmem_cpumask_var(_to_cpumask_map[node]);
 
/* cpumask_of_node() will now work */
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru

2015-09-14 Thread Raghavendra K T

On 09/15/2015 07:38 AM, Raghavendra K T wrote:

The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17
is common), this patch saves some unnecessary allocations for
non existing numa nodes.

Even without that saving, perhaps patch makes code more readable.

[ Take memcg_aware check outside for_each loop: Vladimir]
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---


Sorry that I had misspelled Vladimir above.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes

2015-09-08 Thread Raghavendra K T
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8b9502a..8d8a541 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void)
setup_nr_node_ids();
 
/* allocate the map */
-   for (node = 0; node < nr_node_ids; node++)
+   for_each_node(node)
alloc_bootmem_cpumask_var(_to_cpumask_map[node]);
 
/* cpumask_of_node() will now work */
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 0/2] Replace nr_node_ids for loop with for_each_node

2015-09-08 Thread Raghavendra K T
Many places in the kernel use 'for' loop with nr_node_ids. For the architectures
which supports sparse numa ids, this will result in some unnecessary allocations
for non existing nodes.
(for e.g., numa node numbers such as 0,1,16,17 is common in powerpc.)

So replace the for loop with for_each_node so that allocations happen only for
existing numa nodes.

Please note that, though there are many places where nr_node_ids is used,
current patchset uses for_each_node only for slowpath to avoid find_next_bit
traversal.

Raghavendra K T (2):
  mm: Replace nr_node_ids for loop with for_each_node in list lru
  powerpc:numa Do not allocate bootmem memory for non existing nodes

 arch/powerpc/mm/numa.c |  2 +-
 mm/list_lru.c  | 23 +++
 2 files changed, 16 insertions(+), 9 deletions(-)

-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru

2015-09-08 Thread Raghavendra K T
The functions used in the patch are in slowpath, which gets called
whenever alloc_super is called during mounts.

Though this should not make difference for the architectures with
sequential numa node ids, for the powerpc which can potentially have
sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17
is common), this patch saves some unnecessary allocations for
non existing numa nodes.

Even without that saving, perhaps patch makes code more readable.

Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com>
---
 mm/list_lru.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 909eca2..5a97f83 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -377,7 +377,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool 
memcg_aware)
 {
int i;
 
-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
if (!memcg_aware)
lru->node[i].memcg_lrus = NULL;
else if (memcg_init_list_lru_node(>node[i]))
@@ -385,8 +385,11 @@ static int memcg_init_list_lru(struct list_lru *lru, bool 
memcg_aware)
}
return 0;
 fail:
-   for (i = i - 1; i >= 0; i--)
+   for (i = i - 1; i >= 0; i--) {
+   if (!lru->node[i].memcg_lrus)
+   continue;
memcg_destroy_list_lru_node(>node[i]);
+   }
return -ENOMEM;
 }
 
@@ -397,7 +400,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru)
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_destroy_list_lru_node(>node[i]);
 }
 
@@ -409,16 +412,20 @@ static int memcg_update_list_lru(struct list_lru *lru,
if (!list_lru_memcg_aware(lru))
return 0;
 
-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
if (memcg_update_list_lru_node(>node[i],
   old_size, new_size))
goto fail;
}
return 0;
 fail:
-   for (i = i - 1; i >= 0; i--)
+   for (i = i - 1; i >= 0; i--) {
+   if (!lru->node[i].memcg_lrus)
+   continue;
+
memcg_cancel_update_list_lru_node(>node[i],
  old_size, new_size);
+   }
return -ENOMEM;
 }
 
@@ -430,7 +437,7 @@ static void memcg_cancel_update_list_lru(struct list_lru 
*lru,
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_cancel_update_list_lru_node(>node[i],
  old_size, new_size);
 }
@@ -485,7 +492,7 @@ static void memcg_drain_list_lru(struct list_lru *lru,
if (!list_lru_memcg_aware(lru))
return;
 
-   for (i = 0; i < nr_node_ids; i++)
+   for_each_node(i)
memcg_drain_list_lru_node(>node[i], src_idx, dst_idx);
 }
 
@@ -522,7 +529,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware,
if (!lru->node)
goto out;
 
-   for (i = 0; i < nr_node_ids; i++) {
+   for_each_node(i) {
spin_lock_init(>node[i].lock);
if (key)
lockdep_set_class(>node[i].lock, key);
-- 
1.7.11.7

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] powerpc/numa: set node_possible_map to only node_online_map during boot

2015-03-06 Thread Raghavendra K T

On 03/06/2015 10:57 AM, Nishanth Aravamudan wrote:

On 05.03.2015 [15:29:00 -0800], David Rientjes wrote:

On Thu, 5 Mar 2015, Nishanth Aravamudan wrote:


So if we compare to x86:

arch/x86/mm/numa.c::numa_init():

 nodes_clear(numa_nodes_parsed);
 nodes_clear(node_possible_map);
 nodes_clear(node_online_map);
...
numa_register_memblks(...);

arch/x86/mm/numa.c::numa_register_memblks():

node_possible_map = numa_nodes_parsed;

Basically, it looks like x86 NUMA init clears out possible map and
online map, probably for a similar reason to what I gave in the
changelog that by default, the possible map seems to be based off
MAX_NUMNODES, rather than nr_node_ids or anything dynamic.

My patch was an attempt to emulate the same thing on powerpc. You are
right that there is a window in which the node_possible_map and
node_online_map are out of sync with my patch. It seems like it
shouldn't matter given how early in boot we are, but perhaps the
following would have been clearer:

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0257a7d659ef..1a118b08fad2 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -958,6 +958,13 @@ void __init initmem_init(void)

 memblock_dump_all();

+   /*
+* Reduce the possible NUMA nodes to the online NUMA nodes,
+* since we do not support node hotplug. This ensures that  we
+* lower the maximum NUMA node ID to what is actually present.
+*/
+   nodes_and(node_possible_map, node_possible_map, node_online_map);


If you don't support node hotplug, then a node should always be possible
if it's online unless there are other tricks powerpc plays with
node_possible_map.  Shouldn't this just be
node_possible_map = node_online_map?


Yeah, but I was too dumb to think of that before sending :)

Updated version follows...

-Nish


---8---


Raghu noticed an issue with excessive memory allocation on power with a
simple cgroup test, specifically, in mem_cgroup_css_alloc -
for_each_node - alloc_mem_cgroup_per_zone_info(), which ends up blowing
up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup
directories).

should we also add after this patch it has reduced to around 2MB?


The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes
possible), which defines node_possible_map, which in turn defines the
value of nr_node_ids in setup_nr_node_ids and the iteration of
for_each_node.

In practice, we never see a system with 256 NUMA nodes, and in fact, we
do not support node hotplug on power in the first place, so the nodes
that are online when we come up are the nodes that will be present for
the lifetime of this kernel. So let's, at least, drop the NUMA possible
map down to the online map at runtime. This is similar to what x86 does
in its initialization routines.

mem_cgroup_css_alloc should also be fixed to only iterate over
memory-populated nodes and handle hotplug, but that is a separate
change.


Maybe we could fomally add
Reported-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com

Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
To: Michael Ellerman m...@ellerman.id.au
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Tejun Heo t...@kernel.org
Cc: David Rientjes rient...@google.com
Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
Cc: Paul Mackerras pau...@samba.org
Cc: Anton Blanchard an...@samba.org
Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com

---
v1 - v2:
   Rather than clear node_possible_map and set it nid-by-nid, just
   directly assign node_online_map to it, as suggested by Michael
   Ellerman and Tejun Heo.

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0257a7d659ef..0c1716cd271f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -958,6 +958,13 @@ void __init initmem_init(void)

memblock_dump_all();

+   /*
+* Reduce the possible NUMA nodes to the online NUMA nodes,
+* since we do not support node hotplug. This ensures that  we
+* lower the maximum NUMA node ID to what is actually present.
+*/


 Hope we remember this change when we add hotplug :)


+   node_possible_map = node_online_map;
+
for_each_online_node(nid) {
unsigned long start_pfn, end_pfn;



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev