Re: [tip:sched/urgent] sched: Fix crash in sched_init_numa()
On 01/19/2016 07:08 PM, tip-bot for Raghavendra K T wrote: Commit-ID: 9c03ee147193645be4c186d3688232fa438c57c7 Gitweb: http://git.kernel.org/tip/9c03ee147193645be4c186d3688232fa438c57c7 Author: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> AuthorDate: Sat, 16 Jan 2016 00:31:23 +0530 Committer: Ingo Molnar <mi...@kernel.org> CommitDate: Tue, 19 Jan 2016 08:42:20 +0100 sched: Fix crash in sched_init_numa() The following PowerPC commit: c118baf80256 ("arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes") avoids allocating bootmem memory for non existent nodes. But when DEBUG_PER_CPU_MAPS=y is enabled, my powerNV system failed to boot because in sched_init_numa(), cpumask_or() operation was done on unallocated nodes. Fix that by making cpumask_or() operation only on existing nodes. [ Tested with and w/o DEBUG_PER_CPU_MAPS=y on x86 and PowerPC. ] Reported-by: Jan Stancek <jstan...@redhat.com> Tested-by: Jan Stancek <jstan...@redhat.com> Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> Cc: <gk...@linux.vnet.ibm.com> Cc: <grant.lik...@linaro.org> Cc: <nik...@linux.vnet.ibm.com> Cc: <vdavy...@parallels.com> Cc: <linuxppc-dev@lists.ozlabs.org> Cc: <linux...@kvack.org> Cc: <pet...@infradead.org> Cc: <b...@kernel.crashing.org> Cc: <pau...@samba.org> Cc: <m...@ellerman.id.au> Cc: <an...@samba.org> Link: http://lkml.kernel.org/r/1452884483-11676-1-git-send-email-raghavendra...@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mi...@kernel.org> --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 44253ad..474658b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6840,7 +6840,7 @@ static void sched_init_numa(void) sched_domains_numa_masks[i][j] = mask; - for (k = 0; k < nr_node_ids; k++) { + for_each_node(k) { if (node_distance(j, k) > sched_domains_numa_distance[i]) continue; Hello Greg, Above commit fixes the debug kernel crash in 4.4 kernel [ when DEBUG_PER_CPU_MAPS=y to be precise]. This is a regression in 4.4 from 4.3 and should be ideally present in 4.4-stable. Could you please pull in this change.? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[tip:sched/urgent] sched: Fix crash in sched_init_numa()
Commit-ID: 9c03ee147193645be4c186d3688232fa438c57c7 Gitweb: http://git.kernel.org/tip/9c03ee147193645be4c186d3688232fa438c57c7 Author: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> AuthorDate: Sat, 16 Jan 2016 00:31:23 +0530 Committer: Ingo Molnar <mi...@kernel.org> CommitDate: Tue, 19 Jan 2016 08:42:20 +0100 sched: Fix crash in sched_init_numa() The following PowerPC commit: c118baf80256 ("arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes") avoids allocating bootmem memory for non existent nodes. But when DEBUG_PER_CPU_MAPS=y is enabled, my powerNV system failed to boot because in sched_init_numa(), cpumask_or() operation was done on unallocated nodes. Fix that by making cpumask_or() operation only on existing nodes. [ Tested with and w/o DEBUG_PER_CPU_MAPS=y on x86 and PowerPC. ] Reported-by: Jan Stancek <jstan...@redhat.com> Tested-by: Jan Stancek <jstan...@redhat.com> Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> Cc: <gk...@linux.vnet.ibm.com> Cc: <grant.lik...@linaro.org> Cc: <nik...@linux.vnet.ibm.com> Cc: <vdavy...@parallels.com> Cc: <linuxppc-dev@lists.ozlabs.org> Cc: <linux...@kvack.org> Cc: <pet...@infradead.org> Cc: <b...@kernel.crashing.org> Cc: <pau...@samba.org> Cc: <m...@ellerman.id.au> Cc: <an...@samba.org> Link: http://lkml.kernel.org/r/1452884483-11676-1-git-send-email-raghavendra...@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mi...@kernel.org> --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 44253ad..474658b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6840,7 +6840,7 @@ static void sched_init_numa(void) sched_domains_numa_masks[i][j] = mask; - for (k = 0; k < nr_node_ids; k++) { + for_each_node(k) { if (node_distance(j, k) > sched_domains_numa_distance[i]) continue; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [BUG] PowerNV crash with 4.4.0-rc8 at sched_init_numa (related to commit c118baf80256)
* Jan Stancek <jstan...@redhat.com> [2016-01-09 18:03:55]: > Hi, > > I'm seeing bare metal ppc64le system crashing early during boot > with latest upstream kernel (4.4.0-rc8): > > # git describe > v4.4-rc8-96-g751e5f5 > > [0.625451] Unable to handle kernel paging request for data at address > 0x > [0.625586] Faulting instruction address: 0xc04ae000 > [0.625698] Oops: Kernel access of bad area, sig: 11 [#1] > [0.625789] SMP NR_CPUS=2048 NUMA PowerNV > [0.625879] Modules linked in: > [0.625973] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.4.0-rc8+ #6 > [0.626087] task: c02ff430 ti: c02ff6084000 task.ti: > c02ff6084000 > [0.626224] NIP: c04ae000 LR: c090b9e4 CTR: > 0003 > [0.626361] REGS: c02ff6087930 TRAP: 0300 Not tainted (4.4.0-rc8+) > [0.626475] MSR: 90019033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 48002044 > XER: 2000 > [0.626808] CFAR: c0008468 DAR: DSISR: 4000 > SOFTE: 1 > GPR00: c090b9ac c02ff6087bb0 c1700900 c03ff229e080 > GPR04: c03ff229e080 0003 0001 > GPR08: 0010 90011003 > GPR12: 2200 cfb4 c000bd68 0002 > GPR16: 0028 c0b25940 c173ffa4 > GPR20: c0b259d8 c0b259e0 c0b259e8 > GPR24: c03ff229e080 c189b180 > GPR28: c1740a94 0002 0002 > [0.627925] NIP [c04ae000] __bitmap_or+0x30/0x50 > [0.627973] LR [c090b9e4] sched_init_numa+0x440/0x7c8 > [0.628030] Call Trace: > [0.628054] [c02ff6087bb0] [c090b9ac] > sched_init_numa+0x408/0x7c8 (unreliable) > [0.628136] [c02ff6087ca0] [c0c60718] sched_init_smp+0x60/0x238 > [0.628206] [c02ff6087d00] [c0c44294] > kernel_init_freeable+0x1fc/0x3b4 > [0.628286] [c02ff6087dc0] [c000bd84] kernel_init+0x24/0x140 > [0.628356] [c02ff6087e30] [c0009544] > ret_from_kernel_thread+0x5c/0x98 > [0.628435] Instruction dump: > [0.628470] 38c6003f 78c9d183 4d820020 38c9 3920 78c60020 38c60001 > 7cc903a6 > [0.628587] 6000 6000 6000 6042 <7d05482a> 7d44482a > 7d0a5378 7d43492a > [0.628711] ---[ end trace b423f3e02b333fbf ]--- > [0.628757] > [2.628822] Kernel panic - not syncing: Fatal exception > [2.628969] Rebooting in 10 seconds..[0.00] OPAL V3 detected ! > > The crash goes away if I revert following commit: > commit c118baf802562688d46e6002f2b5fe66b947da21 > Author: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> > Date: Thu Nov 5 18:46:29 2015 -0800 > arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing > nodes > Something like below should fix. I 'll send it in a separate email marking Peter and Ingo. Basically for_each_node conversion has targeted only slowpaths / used_once sort of functions. But it seems there was a cpumask_or in sched_init_numa that used unallocated node. Sorry for getting back late.. Was overcautious checking x86/power w/ and w/o DEBUG_PER_CPU_MAPS ---8<- From 6680994a5a8dde7eccfbd2bffde341fdff2aed63 Mon Sep 17 00:00:00 2001 From: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> Date: Fri, 15 Jan 2016 18:19:56 +0530 Subject: [PATCH] Fix: PowerNV crash with 4.4.0-rc8 at sched_init_numa Commit c118baf80256 ("arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes") avoided bootmem memory allocation for non existent nodes. When DEBUG_PER_CPU_MAPS enabled, powerNV system failed to boot because in sched_init_numa, cpumask_or operation was done on unallocated nodes. Fix that by making cpumask_or operation only on existing nodes. [ Tested with and w/o DEBUG_PER_CPU_MAPS on x86 and powerpc ] Reported-by: Jan Stancek <jstan...@redhat.com> Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 44253ad..474658b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6840,7 +6840,7 @@ static void sched_init_numa(void) sched_domains_numa_masks[i][j] = mask; - for (k = 0; k < nr_node_ids; k++) { + for_each_node(k) { if (node_distance(j, k) > sched_domains_numa_distance[i]) continue; -- 1.7.11.7 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] Fix: PowerNV crash with 4.4.0-rc8 at sched_init_numa
Commit c118baf80256 ("arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes") avoided bootmem memory allocation for non existent nodes. When DEBUG_PER_CPU_MAPS enabled, powerNV system failed to boot because in sched_init_numa, cpumask_or operation was done on unallocated nodes. Fix that by making cpumask_or operation only on existing nodes. [ Tested with and w/o DEBUG_PER_CPU_MAPS on x86 and powerpc ] Reported-by: Jan Stancek <jstan...@redhat.com> Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 44253ad..474658b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6840,7 +6840,7 @@ static void sched_init_numa(void) sched_domains_numa_masks[i][j] = mask; - for (k = 0; k < nr_node_ids; k++) { + for_each_node(k) { if (node_distance(j, k) > sched_domains_numa_distance[i]) continue; -- 1.7.11.7 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [BUG] PowerNV crash with 4.4.0-rc8 at sched_init_numa (related to commit c118baf80256)
On 01/10/2016 04:33 AM, Jan Stancek wrote: Hi, I'm seeing bare metal ppc64le system crashing early during boot with latest upstream kernel (4.4.0-rc8): Jan, Do you mind sharing the .config you used for the kernel. Not able to reproduce with the one that I have :( ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [BUG] PowerNV crash with 4.4.0-rc8 at sched_init_numa (related to commit c118baf80256)
On 01/11/2016 05:22 PM, Raghavendra K T wrote: On 01/10/2016 04:33 AM, Jan Stancek wrote: Hi, I'm seeing bare metal ppc64le system crashing early during boot with latest upstream kernel (4.4.0-rc8): Jan, Do you mind sharing the .config you used for the kernel. Not able to reproduce with the one that I have :( Never mind.. I enabled DEBUG_PER_CPU_MAPS.. to hit that.. /me goes back.. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [BUG] PowerNV crash with 4.4.0-rc8 at sched_init_numa (related to commit c118baf80256)
On 01/10/2016 04:33 AM, Jan Stancek wrote: Hi, I'm seeing bare metal ppc64le system crashing early during boot with latest upstream kernel (4.4.0-rc8): Hi Jan, Thanks for reporting. Let me try to reproduce the issue. (Between if you think there is anything special in the .config that I need for testing .. please share). - Raghu # git describe v4.4-rc8-96-g751e5f5 [0.625451] Unable to handle kernel paging request for data at address 0x [0.625586] Faulting instruction address: 0xc04ae000 [0.625698] Oops: Kernel access of bad area, sig: 11 [#1] [0.625789] SMP NR_CPUS=2048 NUMA PowerNV [0.625879] Modules linked in: [0.625973] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.4.0-rc8+ #6 [0.626087] task: c02ff430 ti: c02ff6084000 task.ti: c02ff6084000 [0.626224] NIP: c04ae000 LR: c090b9e4 CTR: 0003 [0.626361] REGS: c02ff6087930 TRAP: 0300 Not tainted (4.4.0-rc8+) [0.626475] MSR: 90019033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 48002044 XER: 2000 [0.626808] CFAR: c0008468 DAR: DSISR: 4000 SOFTE: 1 GPR00: c090b9ac c02ff6087bb0 c1700900 c03ff229e080 GPR04: c03ff229e080 0003 0001 GPR08: 0010 90011003 GPR12: 2200 cfb4 c000bd68 0002 GPR16: 0028 c0b25940 c173ffa4 GPR20: c0b259d8 c0b259e0 c0b259e8 GPR24: c03ff229e080 c189b180 GPR28: c1740a94 0002 0002 [0.627925] NIP [c04ae000] __bitmap_or+0x30/0x50 [0.627973] LR [c090b9e4] sched_init_numa+0x440/0x7c8 [0.628030] Call Trace: [0.628054] [c02ff6087bb0] [c090b9ac] sched_init_numa+0x408/0x7c8 (unreliable) [0.628136] [c02ff6087ca0] [c0c60718] sched_init_smp+0x60/0x238 [0.628206] [c02ff6087d00] [c0c44294] kernel_init_freeable+0x1fc/0x3b4 [0.628286] [c02ff6087dc0] [c000bd84] kernel_init+0x24/0x140 [0.628356] [c02ff6087e30] [c0009544] ret_from_kernel_thread+0x5c/0x98 [0.628435] Instruction dump: [0.628470] 38c6003f 78c9d183 4d820020 38c9 3920 78c60020 38c60001 7cc903a6 [0.628587] 6000 6000 6000 6042 <7d05482a> 7d44482a 7d0a5378 7d43492a [0.628711] ---[ end trace b423f3e02b333fbf ]--- [0.628757] [2.628822] Kernel panic - not syncing: Fatal exception [2.628969] Rebooting in 10 seconds..[0.00] OPAL V3 detected ! # numactl -H available: 4 nodes (0-1,16-17) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 node 0 size: 64941 MB node 0 free: 64210 MB node 1 cpus: 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 node 1 size: 65456 MB node 1 free: 62424 MB node 16 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 node 16 size: 65457 MB node 16 free: 65258 MB node 17 cpus: 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 node 17 size: 65186 MB node 17 free: 65001 MB node distances: node 0 1 16 17 0: 10 20 40 40 1: 20 10 40 40 16: 40 40 10 20 17: 40 40 20 10 The crash goes away if I revert following commit: commit c118baf802562688d46e6002f2b5fe66b947da21 Author: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> Date: Thu Nov 5 18:46:29 2015 -0800 arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes Regards, Jan ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] block-mq:Fix the null memory access while setting tags cpumask
On 10/13/2015 10:17 PM, Jeff Moyer wrote: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> writes: In nr_hw_queues >1 cases when certain number of cpus are onlined/or offlined, that results change in request_queue map in block-mq layer, we see the kernel dumping like: What version is that patch against? This problem should be fixed by the patch set from Akinobu Mita. I tested cpu hot-plugging on my system with an nvme device, and it is fixed there on mainline. Hi Jeff, My patch was against (4.3.0-rc2) bcee19f424a. I do see now that commit 1356aae08338 (blk-mq: avoid setting hctx->tags->cpumask before allocation) has equivalent fix. Thanks for the information. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] block-mq:Fix the null memory access while setting tags cpumask
In nr_hw_queues >1 cases when certain number of cpus are onlined/or offlined, that results change in request_queue map in block-mq layer, we see the kernel dumping like: BUG: unable to handle kernel NULL pointer dereference at 0080 IP: [] cpumask_set_cpu+0x6/0xd PGD 6d957067 PUD 7604c067 PMD 0 Oops: 0002 [#1] SMP Modules linked in: null_blk CPU: 2 PID: 1926 Comm: bash Not tainted 4.3.0-rc2+ #24 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: 8800724cd1c0 ti: 880070a2c000 task.ti: 880070a2c000 RIP: 0010:[] [] cpumask_set_cpu+0x6/0xd RSP: 0018:880070a2fbc8 EFLAGS: 00010203 RAX: 880073eedc00 RBX: 88006cc88000 RCX: 88006c06b000 RDX: 0007 RSI: 0080 RDI: 0008 RBP: 880070a2fbc8 R08: 88006c06ac00 R09: 88006c06ad48 R10: 88004ea8 R11: 88006c069650 R12: 88007378fe28 R13: 0008 R14: e8500200 R15: 81d2a630 FS: 7fa34803b700() GS:88007cc4() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 0080 CR3: 761d2000 CR4: 06e0 Stack: 880070a2fc18 8128edec 880073eedc00 0039 88006cc88000 0007 ffe3 81cef2c0 880070a2fc38 8129049a Call Trace: [] blk_mq_map_swqueue+0x9d/0x206 [] blk_mq_queue_reinit_notify+0xe3/0x144 [] notifier_call_chain+0x37/0x63 [] __raw_notifier_call_chain+0xe/0x10 [] __cpu_notify+0x20/0x32 [] cpu_notify_nofail+0x13/0x1b [] _cpu_down+0x18a/0x264 [] ? path_put+0x1f/0x23 [] cpu_down+0x2d/0x3a [] cpu_subsys_offline+0x14/0x16 [] device_offline+0x65/0x94 [] online_store+0x48/0x68 [] ? kernfs_fop_write+0x6f/0x143 [] dev_attr_store+0x20/0x22 [] sysfs_kf_write+0x3c/0x3e [] kernfs_fop_write+0xed/0x143 [] __vfs_write+0x28/0xa6 [] ? security_file_permission+0x3c/0x44 [] ? percpu_down_read+0x21/0x42 [] ? __sb_start_write+0x24/0x41 [] vfs_write+0x8d/0xd1 [] SyS_write+0x59/0x83 [] entry_SYSCALL_64_fastpath+0x12/0x71 Code: 03 75 06 65 48 ff 0a eb 1a f0 48 83 af 68 07 00 00 01 74 02 eb 0d 48 8d bf 68 07 00 00 ff 90 78 07 00 00 5d c3 55 89 ff 48 89 e5 48 0f ab 3e 5d c3 0f 1f 44 00 00 55 8b 4e 44 31 d2 8b b7 94 RIP [] cpumask_set_cpu+0x6/0xd RSP CR2: 0080 How to reproduce: 1. create 80 vcpu guest with 10 core 8 threads 2. modprobe null_blk submit_queues=64 3. for i in 72 73 74 75 76 77 78 79 ; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done Reason: We try to set freed hwctx->tag->cpumask in blk_mq_map_swqueue(). Introduced during commit f26cdc8536ad ("blk-mq: Shared tag enhancements"). What is happening: When certain number of cpus are onlined/offlined, that results in blk_mq_update_queue_map, we could potentially end up in new mapping to hwctx. Subsequent blk_mq_map_swqueue of request_queue, tries to set the hwctx->tags->cpumask which is already freed by blk_mq_free_rq_map in earlier itearation when it was not mapped. Fix: Set the hwctx->tags->cpumask only after blk_mq_init_rq_map() is done hwctx->tags->cpumask does not follow the hwctx->cpumask after new mapping even in the cases where new mapping does not cause problem. That is also fixed with this change. This problem is originally found in powervm which had 160 cpus (SMT8), 128 nr_hw_queues. The dump was easily reproduced with offlining last core and it has been a blocker issue because cpu hotplug is a common case for DLPAR. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- block/blk-mq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index f2d67b4..39a7834 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1811,7 +1811,6 @@ static void blk_mq_map_swqueue(struct request_queue *q) hctx = q->mq_ops->map_queue(q, i); cpumask_set_cpu(i, hctx->cpumask); - cpumask_set_cpu(i, hctx->tags->cpumask); ctx->index_hw = hctx->nr_ctx; hctx->ctxs[hctx->nr_ctx++] = ctx; } @@ -1836,6 +1835,7 @@ static void blk_mq_map_swqueue(struct request_queue *q) if (!set->tags[i]) set->tags[i] = blk_mq_init_rq_map(set, i); hctx->tags = set->tags[i]; + cpumask_copy(hctx->tags->cpumask, hctx->cpumask); WARN_ON(!hctx->tags); /* -- 1.7.11.7 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC, 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table
On 10/06/2015 03:47 PM, Michael Ellerman wrote: On Sun, 2015-27-09 at 18:29:09 UTC, Raghavendra K T wrote: We access numa_cpu_lookup_table array directly in all the places to read/update numa cpu lookup information. Instead use a helper function to update. This is helpful in changing the way numa<-->cpu mapping in single place when needed. This is a cosmetic change, no change in functionality. Signed-off-by: Raghavendra K T <raghavendra...@linux.inet.ibm.com> Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/include/asm/mmzone.h | 2 +- arch/powerpc/kernel/smp.c | 10 +- arch/powerpc/mm/numa.c| 28 +--- 3 files changed, 23 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h index 7b58917..c24a5f4 100644 --- a/arch/powerpc/include/asm/mmzone.h +++ b/arch/powerpc/include/asm/mmzone.h @@ -29,7 +29,7 @@ extern struct pglist_data *node_data[]; * Following are specific to this numa platform. */ -extern int numa_cpu_lookup_table[]; +extern int numa_cpu_lookup(int cpu); Can you rename it better :) Something like cpu_to_nid(). Good name. sure. Although maybe nid is wrong given the rest of the series. May be not. The current plan is to rename (after discussing with Nish) chipid to pnid (physical nid) and nid to vnid (virtual nid) within powerpc numa.c [reasoning chipid is applicable only to OPAL, since we want to handle powerkvm, powervm and baremetal we need a generic name ] But 'nid' naming will be retained which is applicable for generic kernel interactions. diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..d5e6eee 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS]; cpumask_var_t node_to_cpumask_map[MAX_NUMNODES]; struct pglist_data *node_data[MAX_NUMNODES]; -EXPORT_SYMBOL(numa_cpu_lookup_table); EXPORT_SYMBOL(node_to_cpumask_map); EXPORT_SYMBOL(node_data); @@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn, return 0; } -static void reset_numa_cpu_lookup_table(void) +int numa_cpu_lookup(int cpu) { - unsigned int cpu; - - for_each_possible_cpu(cpu) - numa_cpu_lookup_table[cpu] = -1; + return numa_cpu_lookup_table[cpu]; } +EXPORT_SYMBOL(numa_cpu_lookup); I don't see you changing any modular code that uses this, or any macros that might be used by modules, so I don't see why this needs to be exported? I think you just added it because num_cpu_lookup_table was exported? arch/powerpc/kernel/smp.c uses it. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
On 10/06/2015 03:55 PM, Michael Ellerman wrote: On Sun, 2015-09-27 at 23:59 +0530, Raghavendra K T wrote: Problem description: Powerpc has sparse node numbering, i.e. on a 4 node system nodes are numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid got from device tree is naturally mapped (directly) to nid. Potential side effect of that is: 1) There are several places in kernel that assumes serial node numbering. and memory allocations assume that all the nodes from 0-(highest nid) exist inturn ending up allocating memory for the nodes that does not exist. Is it several? Or lots? If it's several, ie. more than two but not lots, then we should probably just fix those places. Or is that /really/ hard for some reason? It is several and I did attempt to fix them. But the rest of the places (like memcg, work queue, scheduler and so on) are tricky to fix because the memory allocations are glued with other things. and similar fix may be expected in future too.. Do we ever get whole nodes hotplugged in under PowerVM? I don't think so, but I don't remember for sure. Even on powervm we do have discontiguous numa nodes. [Adding more to it, we could even end up creating a dummy node 0 just to make kernel happy] for e.g., available: 2 nodes (0,7) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 7 cpus: 0 1 2 3 4 5 6 7 node 7 size: 10240 MB node 7 free: 8174 MB node distances: node 0 7 0: 10 40 7: 40 10 note that node zero neither has any cpu nor memory. 2) For virtualization use cases (such as qemu, libvirt, openstack), mapping sparse nid of the host system to contiguous nids of guest (numa affinity, placement) could be a challenge. Can you elaborate? That's a bit vague. one e.g., i can think of: (though libvirt/openstack people will know more about it) suppose one wishes to have half of the vcpus bind to one physical node and rest of the vcpus to second numa node, we cant say whether second node is 1,8, or 16. and same libvirtxml on a two node system may not be valid for another two numa node system. [ i believe it may cause some migration problem too ]. Possible Solutions: 1) Handling the memory allocations is kernel case by case: Though in some cases it is easy to achieve, some cases may be intrusive/not trivial. at the end it does not handle side effect (2) above. 2) Map the sparse chipid got from device tree to a serial nid at kernel level (The idea proposed in this series). Pro: It is more natural to handle at kernel level than at lower (OPAL) layer. con: The chipid is in device tree no longer the same as nid in kernel 3) Let the lower layer (OPAL) give the serial node ids after parsing the chipid and the associativity etc [ either as a separate item in device tree or by compacting the chipid numbers ] Pros: kernel, device tree are on same page and less change in kernel Con: is it the functionality expected in lower layer ... 3) Numactl tests from ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz (infact there were more breakage before the patch because of sparse nid and memoryless node cases of powerpc) This is probably the best argument for your series. ie. userspace is dumb and fixing every broken app that assumes linear node numbering is not feasible. So on the whole I think the concept is good. This series though is a bit confusing because of all the renaming etc. etc. Nish made lots of good comments so I'll wait for a v2 based on those. Yes, will be sending V2 soon extending my patch to fix powervm case too. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
On 09/30/2015 01:16 AM, Denis Kirjanov wrote: On 9/29/15, Raghavendra K T <raghavendra...@linux.vnet.ibm.com> wrote: On 09/28/2015 10:34 PM, Nishanth Aravamudan wrote: On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote: On 9/27/15, Raghavendra K T <raghavendra...@linux.vnet.ibm.com> wrote: Problem description: Powerpc has sparse node numbering, i.e. on a 4 node system nodes are numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid got from device tree is naturally mapped (directly) to nid. Interesting thing to play with, I'll try to test it on my POWER7 box, but it doesn't have the OPAL layer :( Hi Denis, Thanks for your interest. I have pushed the patches to https://github.com/ktraghavendra/linux/tree/serialnuma_v1 if it makes patches easy to grab. Thanks! One sad thing is that I can't test the actual node id mapping now since currently I have an access to machine with only one memory node :/ Can we fake it through qemu? faking the sparse numa ids is possible with Nish's patch for qemu: https://lists.gnu.org/archive/html/qemu-devel/2014-06/msg05826.html ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
On 09/28/2015 10:34 PM, Nishanth Aravamudan wrote: On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote: On 9/27/15, Raghavendra K T <raghavendra...@linux.vnet.ibm.com> wrote: Problem description: Powerpc has sparse node numbering, i.e. on a 4 node system nodes are numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid got from device tree is naturally mapped (directly) to nid. Interesting thing to play with, I'll try to test it on my POWER7 box, but it doesn't have the OPAL layer :( Hi Denis, Thanks for your interest. I have pushed the patches to https://github.com/ktraghavendra/linux/tree/serialnuma_v1 if it makes patches easy to grab. Note that it's also interesting to try it under PowerVM, with odd NUMA topologies and report any issues found :) Thanks Nish, I 'll also grab a powerVM and test. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid
On 09/28/2015 10:57 PM, Nishanth Aravamudan wrote: On 27.09.2015 [23:59:10 +0530], Raghavendra K T wrote: There is no change in the fuctionality Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 42 +- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index d5e6eee..f84ed2f 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid, } } -/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa +/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa * info is found. */ -static int associativity_to_nid(const __be32 *associativity) +static int associativity_to_chipid(const __be32 *associativity) This is confusing to me. This function is also used by the DLPAR code under PowerVM to indicate what node the CPU is on -- not a chip (which I don't believe is exposed at all under PowerVM). Good point. should I retain the name nid? or any suggestions? instead of chipid -> nid which fits both the cases. or should I rename like nid->vnid something? [...] @@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void) /* Use associativity from first thread for all siblings */ vphn_get_associativity(cpu, associativity); - new_nid = associativity_to_nid(associativity); + new_nid = associativity_to_chipid(associativity); If you are getting a chipid, shouldn't you be assigning it to a variable called 'new_chipid'? yes perhaps. my splitting idea was 1. change nid name in functions to chipid (without changing nid variable calling that function) 2. rename variables to chipid and assign nid=chipid (1:1 mapping) 3. now let nid = mapped chipid But I see that it isn't consistent in some places. do you think merging step 1 and step 2 is okay? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid
On 09/28/2015 10:58 PM, Nishanth Aravamudan wrote: On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote: Once we have made the distinction between nid and chipid create a 1:1 mapping between them. This makes compacting the nids easy later. Didn't the previous patch just do the opposite of... As per my thoughts it was: 1. rename functions to say loud that it is chipid (and not nid) 2. and then assign nid = chipid so that we are clear that we made nid:chipid 1:1 mapping and compact nids later.. But again may be I should combine patch 2 and 3. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
On 09/28/2015 11:04 PM, Nishanth Aravamudan wrote: On 27.09.2015 [23:59:08 +0530], Raghavendra K T wrote: [...] 2) Map the sparse chipid got from device tree to a serial nid at kernel level (The idea proposed in this series). Pro: It is more natural to handle at kernel level than at lower (OPAL) layer. con: The chipid is in device tree no longer the same as nid in kernel Is there any debugging/logging? Looks like not -- so how does a sysadmin map from firmware-provided values to the Linux values? That's going to make debugging of large systems (PowerVM or otherwise) less than pleasant, it seems? Possibly you could put something in sysfs? I see 2 things could be done here: 1) while doing dump_numa_cpu_topology() we can dump nid_to_chipid() as additional information. 2) sysfs-> Does /sys/devices/system/node/nodeX/*chipid* looks good. May be we should add only for powerpc or otherwise we need to have chipid = nid populated for other archs. [ I think this change may be done slowly ] ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid
On 09/28/2015 11:05 PM, Nishanth Aravamudan wrote: On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote: Once we have made the distinction between nid and chipid create a 1:1 mapping between them. This makes compacting the nids easy later. No functionality change. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 36 +--- 1 file changed, 29 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index f84ed2f..dd2073b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -264,6 +264,17 @@ out: return chipid; } + + /* Return the nid from associativity */ +static int associativity_to_nid(const __be32 *associativity) +{ + int nid; + + nid = associativity_to_chipid(associativity); + return nid; +} This is ultimately confusing. You are assigning the semantic return value of a chipid to a nid -- is it a nid or a chipid? Shouldn't the variable naming be consistent? :( yes. will come up with some consistent naming. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping
On 09/28/2015 11:02 PM, Nishanth Aravamudan wrote: On 27.09.2015 [23:59:12 +0530], Raghavendra K T wrote: Create arrays that maps serial nids and sparse chipids. Note: My original idea had only two arrays of chipid to nid map. Final code is inspired by driver/acpi/numa.c that maps a proximity node with a logical node by Takayoshi Kochi <t-ko...@bq.jp.nec.com>, and thus uses an additional chipid_map nodemask. The mask helps in first unused nid easily by knowing first unset bit in the mask. No change in functionality. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 48 +++- 1 file changed, 47 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index dd2073b..f015cad 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -63,6 +63,11 @@ static int form1_affinity; static int distance_ref_points_depth; static const __be32 *distance_ref_points; static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; +static nodemask_t chipid_map = NODE_MASK_NONE; +static int chipid_to_nid_map[MAX_NUMNODES] + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; Hrm, conceptually there are *more* chips than nodes, right? So what guarantees we won't see > MAX_NUMNODES chips? You are correct that nid <= chipids. and #nids = #chipids when all possible slots are populated. Considering we assume that maximum chip slots are no more than MAX_NUMNODES, how about having #define MAX_CHIPNODES MAX_NUMNODES and chipid_to_nid_map[MAX_CHIPNODES] = { [0 ... MAX_CHIPNODES - 1] = .. +static int nid_to_chipid_map[MAX_NUMNODES] + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; /* * Allocate node_to_cpumask_map based on number of available nodes @@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn, return 0; } +int chipid_to_nid(int chipid) +{ + if (chipid < 0) + return NUMA_NO_NODE; Do you really want to support these cases? Or should they be bugs/warnings indicating that you got an unexpected input? Or at least WARN_ON_ONCE? Right. Querying for nid of an invalid chipid should be atleast WARN_ON_ONCE(). But 'll check once if there is any valid scenario before the change. + return chipid_to_nid_map[chipid]; +} + +int nid_to_chipid(int nid) +{ + if (nid < 0) + return NUMA_NO_NODE; + return nid_to_chipid_map[nid]; +} + +static void __map_chipid_to_nid(int chipid, int nid) +{ + if (chipid_to_nid_map[chipid] == NUMA_NO_NODE +|| nid < chipid_to_nid_map[chipid]) + chipid_to_nid_map[chipid] = nid; + if (nid_to_chipid_map[nid] == NUMA_NO_NODE + || chipid < nid_to_chipid_map[nid]) + nid_to_chipid_map[nid] = chipid; +} chip <-> node mapping is a static (physical) concept, right? Should we emit some debugging if for some reason we get a runtime call to remap an already mapped chip to a new node? Good point. Already mapped chipid to a different nid is unexpected whereas mapping chipid to same nid is expected.(because mapping comes from cpus belonging to same node). WARN_ON() should suffice here? + +int map_chipid_to_nid(int chipid) +{ + int nid; + + if (chipid < 0 || chipid >= MAX_NUMNODES) + return NUMA_NO_NODE; + + nid = chipid_to_nid_map[chipid]; + if (nid == NUMA_NO_NODE) { + if (nodes_weight(chipid_map) >= MAX_NUMNODES) + return NUMA_NO_NODE; If you create a KVM guest with a bogus topology, doesn't this just start losing NUMA information for very high-noded guests? 'll try to see if it is possible to hit this case, ideally we should not allow more than MAX_NUMNODES for chipids and we should abort early. + nid = first_unset_node(chipid_map); + __map_chipid_to_nid(chipid, nid); + node_set(nid, chipid_map); + } + return nid; +} + int numa_cpu_lookup(int cpu) { return numa_cpu_lookup_table[cpu]; @@ -264,7 +311,6 @@ out: return chipid; } - stray change? yep, will correct that. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table
On 09/27/2015 11:59 PM, Raghavendra K T wrote: We access numa_cpu_lookup_table array directly in all the places to read/update numa cpu lookup information. Instead use a helper function to update. This is helpful in changing the way numa<-->cpu mapping in single place when needed. This is a cosmetic change, no change in functionality. Signed-off-by: Raghavendra K T <raghavendra...@linux.inet.ibm.com> Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- Sorry: changelog edit messed it.. :( arch/powerpc/include/asm/mmzone.h | 2 +- arch/powerpc/kernel/smp.c | 10 +- arch/powerpc/mm/numa.c| 28 +--- 3 files changed, 23 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h index 7b58917..c24a5f4 100644 --- a/arch/powerpc/include/asm/mmzone.h +++ b/arch/powerpc/include/asm/mmzone.h @@ -29,7 +29,7 @@ extern struct pglist_data *node_data[]; * Following are specific to this numa platform. */ -extern int numa_cpu_lookup_table[]; +extern int numa_cpu_lookup(int cpu); extern cpumask_var_t node_to_cpumask_map[]; #ifdef CONFIG_MEMORY_HOTPLUG extern unsigned long max_pfn; diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ec9ec20..56fbe9e 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus) * numa_node_id() works after this. */ if (cpu_present(cpu)) { - set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]); + set_cpu_numa_node(cpu, numa_cpu_lookup(cpu)); set_cpu_numa_mem(cpu, - local_memory_node(numa_cpu_lookup_table[cpu])); + local_memory_node(numa_cpu_lookup(cpu))); } } @@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void) #ifdef CONFIG_PPC64 paca[boot_cpuid].__current = current; #endif - set_numa_node(numa_cpu_lookup_table[boot_cpuid]); + set_numa_node(numa_cpu_lookup(boot_cpuid)); current_set[boot_cpuid] = task_thread_info(current); } @@ -718,8 +718,8 @@ void start_secondary(void *unused) } traverse_core_siblings(cpu, true); - set_numa_node(numa_cpu_lookup_table[cpu]); - set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); + set_numa_node(numa_cpu_lookup(cpu)); + set_numa_mem(local_memory_node(numa_cpu_lookup(cpu))); smp_wmb(); notify_cpu_starting(cpu); diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..d5e6eee 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS]; cpumask_var_t node_to_cpumask_map[MAX_NUMNODES]; struct pglist_data *node_data[MAX_NUMNODES]; -EXPORT_SYMBOL(numa_cpu_lookup_table); EXPORT_SYMBOL(node_to_cpumask_map); EXPORT_SYMBOL(node_data); @@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn, return 0; } -static void reset_numa_cpu_lookup_table(void) +int numa_cpu_lookup(int cpu) { - unsigned int cpu; - - for_each_possible_cpu(cpu) - numa_cpu_lookup_table[cpu] = -1; + return numa_cpu_lookup_table[cpu]; } +EXPORT_SYMBOL(numa_cpu_lookup); -static void update_numa_cpu_lookup_table(unsigned int cpu, int node) +static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) { numa_cpu_lookup_table[cpu] = node; } +static void reset_numa_cpu_lookup_table(void) +{ + unsigned int cpu; + + for_each_possible_cpu(cpu) + update_numa_cpu_lookup_table(cpu, -1); +} + static void map_cpu_to_node(int cpu, int node) { update_numa_cpu_lookup_table(cpu, node); @@ -160,7 +165,7 @@ static void map_cpu_to_node(int cpu, int node) #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR) static void unmap_cpu_from_node(unsigned long cpu) { - int node = numa_cpu_lookup_table[cpu]; + int node = numa_cpu_lookup(cpu); dbg("removing cpu %lu from node %d\n", cpu, node); @@ -536,7 +541,8 @@ static int numa_setup_cpu(unsigned long lcpu) * directly instead of querying the firmware, since it represents * the most recent mapping notified to us by the platform (eg: VPHN). */ - if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) { + nid = numa_cpu_lookup(lcpu); + if (nid >= 0) { map_cpu_to_node(lcpu, nid); return nid; } @@ -1413,7 +1419,7 @@ int arch_update_cpu_topology(void) if (new_nid < 0 || !node_online(new_nid)) new_nid = first_online_node; - if (new_nid == numa_cpu_lookup_table[cpu]) { +
[PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table
We access numa_cpu_lookup_table array directly in all the places to read/update numa cpu lookup information. Instead use a helper function to update. This is helpful in changing the way numa<-->cpu mapping in single place when needed. This is a cosmetic change, no change in functionality. Signed-off-by: Raghavendra K T <raghavendra...@linux.inet.ibm.com> --- arch/powerpc/include/asm/mmzone.h | 2 +- arch/powerpc/kernel/smp.c | 10 +- arch/powerpc/mm/numa.c| 28 +--- 3 files changed, 23 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h index 7b58917..c24a5f4 100644 --- a/arch/powerpc/include/asm/mmzone.h +++ b/arch/powerpc/include/asm/mmzone.h @@ -29,7 +29,7 @@ extern struct pglist_data *node_data[]; * Following are specific to this numa platform. */ -extern int numa_cpu_lookup_table[]; +extern int numa_cpu_lookup(int cpu); extern cpumask_var_t node_to_cpumask_map[]; #ifdef CONFIG_MEMORY_HOTPLUG extern unsigned long max_pfn; diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ec9ec20..56fbe9e 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus) * numa_node_id() works after this. */ if (cpu_present(cpu)) { - set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]); + set_cpu_numa_node(cpu, numa_cpu_lookup(cpu)); set_cpu_numa_mem(cpu, - local_memory_node(numa_cpu_lookup_table[cpu])); + local_memory_node(numa_cpu_lookup(cpu))); } } @@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void) #ifdef CONFIG_PPC64 paca[boot_cpuid].__current = current; #endif - set_numa_node(numa_cpu_lookup_table[boot_cpuid]); + set_numa_node(numa_cpu_lookup(boot_cpuid)); current_set[boot_cpuid] = task_thread_info(current); } @@ -718,8 +718,8 @@ void start_secondary(void *unused) } traverse_core_siblings(cpu, true); - set_numa_node(numa_cpu_lookup_table[cpu]); - set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); + set_numa_node(numa_cpu_lookup(cpu)); + set_numa_mem(local_memory_node(numa_cpu_lookup(cpu))); smp_wmb(); notify_cpu_starting(cpu); diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..d5e6eee 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS]; cpumask_var_t node_to_cpumask_map[MAX_NUMNODES]; struct pglist_data *node_data[MAX_NUMNODES]; -EXPORT_SYMBOL(numa_cpu_lookup_table); EXPORT_SYMBOL(node_to_cpumask_map); EXPORT_SYMBOL(node_data); @@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn, return 0; } -static void reset_numa_cpu_lookup_table(void) +int numa_cpu_lookup(int cpu) { - unsigned int cpu; - - for_each_possible_cpu(cpu) - numa_cpu_lookup_table[cpu] = -1; + return numa_cpu_lookup_table[cpu]; } +EXPORT_SYMBOL(numa_cpu_lookup); -static void update_numa_cpu_lookup_table(unsigned int cpu, int node) +static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) { numa_cpu_lookup_table[cpu] = node; } +static void reset_numa_cpu_lookup_table(void) +{ + unsigned int cpu; + + for_each_possible_cpu(cpu) + update_numa_cpu_lookup_table(cpu, -1); +} + static void map_cpu_to_node(int cpu, int node) { update_numa_cpu_lookup_table(cpu, node); @@ -160,7 +165,7 @@ static void map_cpu_to_node(int cpu, int node) #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR) static void unmap_cpu_from_node(unsigned long cpu) { - int node = numa_cpu_lookup_table[cpu]; + int node = numa_cpu_lookup(cpu); dbg("removing cpu %lu from node %d\n", cpu, node); @@ -536,7 +541,8 @@ static int numa_setup_cpu(unsigned long lcpu) * directly instead of querying the firmware, since it represents * the most recent mapping notified to us by the platform (eg: VPHN). */ - if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) { + nid = numa_cpu_lookup(lcpu); + if (nid >= 0) { map_cpu_to_node(lcpu, nid); return nid; } @@ -1413,7 +1419,7 @@ int arch_update_cpu_topology(void) if (new_nid < 0 || !node_online(new_nid)) new_nid = first_online_node; - if (new_nid == numa_cpu_lookup_table[cpu]) { + if (new_nid == numa_cpu_lookup(cpu)) { cpumask_andnot(_associativity_changes_mask, _associativity_changes_mask,
[PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid
Once we have made the distinction between nid and chipid create a 1:1 mapping between them. This makes compacting the nids easy later. No functionality change. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 36 +--- 1 file changed, 29 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index f84ed2f..dd2073b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -264,6 +264,17 @@ out: return chipid; } + + /* Return the nid from associativity */ +static int associativity_to_nid(const __be32 *associativity) +{ + int nid; + + nid = associativity_to_chipid(associativity); + return nid; +} + + /* Returns the chipid associated with the given device tree node, * or -1 if not found. */ @@ -278,6 +289,17 @@ static int of_node_to_chipid_single(struct device_node *device) return chipid; } +/* + * Returns nid from the chipid associated with given tree node + */ +static int of_node_to_nid_single(struct device_node *device) +{ + int nid; + + nid = of_node_to_chipid_single(device); + return nid; +} + /* Walk the device tree upwards, looking for an associativity id */ int of_node_to_nid(struct device_node *device) { @@ -286,7 +308,7 @@ int of_node_to_nid(struct device_node *device) of_node_get(device); while (device) { - nid = of_node_to_chipid_single(device); + nid = of_node_to_nid_single(device); if (nid != -1) break; @@ -498,7 +520,7 @@ static int of_get_assoc_arrays(struct device_node *memory, } /* - * This is like of_node_to_chipid_single() for memory represented in the + * This is like of_node_to_nid_single() for memory represented in the * ibm,dynamic-reconfiguration-memory node. */ static int of_drconf_to_nid_single(struct of_drconf_cell *drmem, @@ -557,7 +579,7 @@ static int numa_setup_cpu(unsigned long lcpu) goto out; } - nid = of_node_to_chipid_single(cpu); + nid = of_node_to_nid_single(cpu); out_present: if (nid < 0 || !node_online(nid)) @@ -754,7 +776,7 @@ static int __init parse_numa_properties(void) cpu = of_get_cpu_node(i, NULL); BUG_ON(!cpu); - nid = of_node_to_chipid_single(cpu); + nid = of_node_to_nid_single(cpu); of_node_put(cpu); /* @@ -796,7 +818,7 @@ new_range: * have associativity properties. If none, then * everything goes to default_nid. */ - nid = of_node_to_chipid_single(memory); + nid = of_node_to_nid_single(memory); if (nid < 0) nid = default_nid; @@ -1119,7 +1141,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr) if ((scn_addr < start) || (scn_addr >= (start + size))) continue; - nid = of_node_to_chipid_single(memory); + nid = of_node_to_nid_single(memory); break; } @@ -1415,7 +1437,7 @@ int arch_update_cpu_topology(void) /* Use associativity from first thread for all siblings */ vphn_get_associativity(cpu, associativity); - new_nid = associativity_to_chipid(associativity); + new_nid = associativity_to_nid(associativity); if (new_nid < 0 || !node_online(new_nid)) new_nid = first_online_node; -- 1.7.11.7 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH RFC 5/5] powerpc:numa Use chipid to nid mapping to get serial numa node ids
Also properly initialize numa distance table for serial nids. Problem: Powerpc supports sparse nid numbering which could affect 1) memory footprint 2) virtualization use cases Current solution: The patch maps sprase chipid got fromn device tree to serail nids. Result before: node 0 1 16 17 0: 10 20 40 40 1: 20 10 40 40 16: 40 40 10 20 17: 40 40 20 10 After: node 0 1 2 3 0: 10 20 40 40 1: 20 10 40 40 2: 40 40 10 20 3: 40 40 20 10 Testing: Scenarios tested on baremetal and KVM guest with 4 nodes 1) offlining and onlining memory and cpus 2) Running the tests from numactl source. 3) Creating 1000s of docker containers stressing the system Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index f015cad..873ac8c 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -304,7 +304,8 @@ static int associativity_to_chipid(const __be32 *associativity) /* * Skip the length field and send start of associativity array */ - initialize_distance_lookup_table(chipid, associativity + 1); + initialize_distance_lookup_table(chipid_to_nid(chipid), +associativity + 1); } out: @@ -314,9 +315,10 @@ out: /* Return the nid from associativity */ static int associativity_to_nid(const __be32 *associativity) { - int nid; + int chipid, nid; - nid = associativity_to_chipid(associativity); + chipid = associativity_to_chipid(associativity); + nid = map_chipid_to_nid(chipid); return nid; } @@ -340,9 +342,10 @@ static int of_node_to_chipid_single(struct device_node *device) */ static int of_node_to_nid_single(struct device_node *device) { - int nid; + int chipid, nid; - nid = of_node_to_chipid_single(device); + chipid = of_node_to_chipid_single(device); + nid = map_chipid_to_nid(chipid); return nid; } -- 1.7.11.7 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH RFC 0/5] powerpc:numa Add serial nid support
Problem description: Powerpc has sparse node numbering, i.e. on a 4 node system nodes are numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid got from device tree is naturally mapped (directly) to nid. Potential side effect of that is: 1) There are several places in kernel that assumes serial node numbering. and memory allocations assume that all the nodes from 0-(highest nid) exist inturn ending up allocating memory for the nodes that does not exist. 2) For virtualization use cases (such as qemu, libvirt, openstack), mapping sparse nid of the host system to contiguous nids of guest (numa affinity, placement) could be a challenge. Possible Solutions: 1) Handling the memory allocations is kernel case by case: Though in some cases it is easy to achieve, some cases may be intrusive/not trivial. at the end it does not handle side effect (2) above. 2) Map the sparse chipid got from device tree to a serial nid at kernel level (The idea proposed in this series). Pro: It is more natural to handle at kernel level than at lower (OPAL) layer. con: The chipid is in device tree no longer the same as nid in kernel 3) Let the lower layer (OPAL) give the serial node ids after parsing the chipid and the associativity etc [ either as a separate item in device tree or by compacting the chipid numbers ] Pros: kernel, device tree are on same page and less change in kernel Con: is it the functionality expected in lower layer As mentioned above, current patch series tries to map chipid from lower layer to a contiguos nid at kernel level keeping the node distance calculation and so on intact. Result: Before the patch: numactl -H available: 4 nodes (0-1,16-17) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 31665 MB node 0 free: 29836 MB node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 1 size: 32722 MB node 1 free: 32019 MB node 16 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 16 size: 32571 MB node 16 free: 31222 MB node 17 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 17 size: 0 MB node 17 free: 0 MB node distances: node 0 1 16 17 0: 10 20 40 40 1: 20 10 40 40 16: 40 40 10 20 17: 40 40 20 10 After the patch: numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 31665 MB node 0 free: 30657 MB node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 1 size: 32722 MB node 1 free: 32566 MB node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 2 size: 32571 MB node 2 free: 32401 MB node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 3 size: 0 MB node 3 free: 0 MB node distances: node 0 1 2 3 0: 10 20 40 40 1: 20 10 40 40 2: 40 40 10 20 3: 40 40 20 10 (note that numa distances are intact). Apart from this, The following tests are done with the patched kernel (both baremetal and KVM guest with multiple nodes) to ensure there is no breakage. 1) offlining and onlining of memory in /sys/devices/system/node/nodeX path 2) offlining and onlining of cpus in /sys/devices/system/cpu/ path 3) Numactl tests from ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz (infact there were more breakage before the patch because of sparse nid and memoryless node cases of powerpc) 4) Thousands of docker containers were spawned. Please let me know your comments. patch 1-3: cleanup patches patch 4: Adds helper function to map nid and chipid patch 5: Uses the mapping to get serial nid Raghavendra K T (5): powerpc:numa Add numa_cpu_lookup function to update lookup table powerpc:numa Rename functions referring to nid as chipid powerpc:numa create 1:1 mappaing between chipid and nid powerpc:numa Add helper functions to maintain chipid to nid mapping powerpc:numa Use chipid to nid mapping to get serial numa node ids arch/powerpc/include/asm/mmzone.h | 2 +- arch/powerpc/kernel/smp.c | 10 ++-- arch/powerpc/mm/numa.c| 121 +++--- 3 files changed, 105 insertions(+), 28 deletions(-) -- 1.7.11.7 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid
There is no change in the fuctionality Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 42 +- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index d5e6eee..f84ed2f 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid, } } -/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa +/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa * info is found. */ -static int associativity_to_nid(const __be32 *associativity) +static int associativity_to_chipid(const __be32 *associativity) { - int nid = -1; + int chipid = -1; if (min_common_depth == -1) goto out; if (of_read_number(associativity, 1) >= min_common_depth) - nid = of_read_number([min_common_depth], 1); + chipid = of_read_number([min_common_depth], 1); /* POWER4 LPAR uses 0x as invalid node */ - if (nid == 0x || nid >= MAX_NUMNODES) - nid = -1; + if (chipid == 0x || chipid >= MAX_NUMNODES) + chipid = -1; - if (nid > 0 && + if (chipid > 0 && of_read_number(associativity, 1) >= distance_ref_points_depth) { /* * Skip the length field and send start of associativity array */ - initialize_distance_lookup_table(nid, associativity + 1); + initialize_distance_lookup_table(chipid, associativity + 1); } out: - return nid; + return chipid; } -/* Returns the nid associated with the given device tree node, +/* Returns the chipid associated with the given device tree node, * or -1 if not found. */ -static int of_node_to_nid_single(struct device_node *device) +static int of_node_to_chipid_single(struct device_node *device) { - int nid = -1; + int chipid = -1; const __be32 *tmp; tmp = of_get_associativity(device); if (tmp) - nid = associativity_to_nid(tmp); - return nid; + chipid = associativity_to_chipid(tmp); + return chipid; } /* Walk the device tree upwards, looking for an associativity id */ @@ -286,7 +286,7 @@ int of_node_to_nid(struct device_node *device) of_node_get(device); while (device) { - nid = of_node_to_nid_single(device); + nid = of_node_to_chipid_single(device); if (nid != -1) break; @@ -498,7 +498,7 @@ static int of_get_assoc_arrays(struct device_node *memory, } /* - * This is like of_node_to_nid_single() for memory represented in the + * This is like of_node_to_chipid_single() for memory represented in the * ibm,dynamic-reconfiguration-memory node. */ static int of_drconf_to_nid_single(struct of_drconf_cell *drmem, @@ -557,7 +557,7 @@ static int numa_setup_cpu(unsigned long lcpu) goto out; } - nid = of_node_to_nid_single(cpu); + nid = of_node_to_chipid_single(cpu); out_present: if (nid < 0 || !node_online(nid)) @@ -754,7 +754,7 @@ static int __init parse_numa_properties(void) cpu = of_get_cpu_node(i, NULL); BUG_ON(!cpu); - nid = of_node_to_nid_single(cpu); + nid = of_node_to_chipid_single(cpu); of_node_put(cpu); /* @@ -796,7 +796,7 @@ new_range: * have associativity properties. If none, then * everything goes to default_nid. */ - nid = of_node_to_nid_single(memory); + nid = of_node_to_chipid_single(memory); if (nid < 0) nid = default_nid; @@ -1119,7 +1119,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr) if ((scn_addr < start) || (scn_addr >= (start + size))) continue; - nid = of_node_to_nid_single(memory); + nid = of_node_to_chipid_single(memory); break; } @@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void) /* Use associativity from first thread for all siblings */ vphn_get_associativity(cpu, associativity); - new_nid = associativity_to_nid(associativity); + new_nid = associativity_to_chipid(associativity); if (new_nid < 0 || !node_online(new_nid)) new_nid = first_online_node; -- 1.7.11.7 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping
Create arrays that maps serial nids and sparse chipids. Note: My original idea had only two arrays of chipid to nid map. Final code is inspired by driver/acpi/numa.c that maps a proximity node with a logical node by Takayoshi Kochi <t-ko...@bq.jp.nec.com>, and thus uses an additional chipid_map nodemask. The mask helps in first unused nid easily by knowing first unset bit in the mask. No change in functionality. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 48 +++- 1 file changed, 47 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index dd2073b..f015cad 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -63,6 +63,11 @@ static int form1_affinity; static int distance_ref_points_depth; static const __be32 *distance_ref_points; static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; +static nodemask_t chipid_map = NODE_MASK_NONE; +static int chipid_to_nid_map[MAX_NUMNODES] + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; +static int nid_to_chipid_map[MAX_NUMNODES] + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; /* * Allocate node_to_cpumask_map based on number of available nodes @@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn, return 0; } +int chipid_to_nid(int chipid) +{ + if (chipid < 0) + return NUMA_NO_NODE; + return chipid_to_nid_map[chipid]; +} + +int nid_to_chipid(int nid) +{ + if (nid < 0) + return NUMA_NO_NODE; + return nid_to_chipid_map[nid]; +} + +static void __map_chipid_to_nid(int chipid, int nid) +{ + if (chipid_to_nid_map[chipid] == NUMA_NO_NODE +|| nid < chipid_to_nid_map[chipid]) + chipid_to_nid_map[chipid] = nid; + if (nid_to_chipid_map[nid] == NUMA_NO_NODE + || chipid < nid_to_chipid_map[nid]) + nid_to_chipid_map[nid] = chipid; +} + +int map_chipid_to_nid(int chipid) +{ + int nid; + + if (chipid < 0 || chipid >= MAX_NUMNODES) + return NUMA_NO_NODE; + + nid = chipid_to_nid_map[chipid]; + if (nid == NUMA_NO_NODE) { + if (nodes_weight(chipid_map) >= MAX_NUMNODES) + return NUMA_NO_NODE; + nid = first_unset_node(chipid_map); + __map_chipid_to_nid(chipid, nid); + node_set(nid, chipid_map); + } + return nid; +} + int numa_cpu_lookup(int cpu) { return numa_cpu_lookup_table[cpu]; @@ -264,7 +311,6 @@ out: return chipid; } - /* Return the nid from associativity */ static int associativity_to_nid(const __be32 *associativity) { -- 1.7.11.7 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes
On 09/22/2015 10:59 AM, Michael Ellerman wrote: On Tue, 2015-09-15 at 07:38 +0530, Raghavendra K T wrote: ... nothing Sure this patch looks obvious, but please give me a changelog that proves you've thought about it thoroughly. For example is it OK to use for_each_node() at this point in boot? Is there any historical reason why we did it with a hard coded loop? If so what has changed. What systems have you tested on? etc. etc. cheers Changelog: With the setup_nr_nodes(), we have already initialized node_possible_map. So it is safe to use for_each_node here. There are many places in the kernel that use hardcoded 'for' loop with nr_node_ids, because all other architectures have numa nodes populated serially. That should be reason we had maintained same for powerpc. But since on power we have sparse numa node ids possible, we unnecessarily allocate memory for non existent numa nodes. For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18 and we allocate memory for nodes 2-14. The patch is boot tested on a 4 node tuleta [ confirming with printks ]. that it works as expected. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..8d8a541 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void) setup_nr_node_ids(); /* allocate the map */ - for (node = 0; node < nr_node_ids; node++) + for_each_node(node) alloc_bootmem_cpumask_var(_to_cpumask_map[node]); /* cpumask_of_node() will now work */ ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru
On 09/14/2015 02:30 PM, Vladimir Davydov wrote: Hi, On Wed, Sep 09, 2015 at 12:01:46AM +0530, Raghavendra K T wrote: The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. Do I understand correctly that node 0 must always be in node_possible_map? I ask, because we currently test lru->node[0].memcg_lrus to determine if the list is memcg aware. Yes, node 0 is always there. So it should not be a problem. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- mm/list_lru.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/mm/list_lru.c b/mm/list_lru.c index 909eca2..5a97f83 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -377,7 +377,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { int i; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { if (!memcg_aware) lru->node[i].memcg_lrus = NULL; So, we don't explicitly initialize memcg_lrus for nodes that are not in node_possible_map. That's OK, because we allocate lru->node using kzalloc. However, this partial nullifying in case !memcg_aware looks confusing IMO. Let's drop it, I mean something like this: Yes, you are right. and we do not have to have memcg_aware check inside for loop too. Will change as per your suggestion and send V2. Thanks for the review. static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { int i; if (!memcg_aware) return 0; for_each_node(i) { if (memcg_init_list_lru_node(>node[i])) goto fail; } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru
On 09/14/2015 05:34 PM, Vladimir Davydov wrote: On Mon, Sep 14, 2015 at 05:09:31PM +0530, Raghavendra K T wrote: On 09/14/2015 02:30 PM, Vladimir Davydov wrote: On Wed, Sep 09, 2015 at 12:01:46AM +0530, Raghavendra K T wrote: The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. Do I understand correctly that node 0 must always be in node_possible_map? I ask, because we currently test lru->node[0].memcg_lrus to determine if the list is memcg aware. Yes, node 0 is always there. So it should not be a problem. I think it should be mentioned in the comment to list_lru_memcg_aware then. Something like this: ? static inline bool list_lru_memcg_aware(struct list_lru *lru) { /* * This needs node 0 to be always present, even * in the systems supporting sparse numa ids. */ return !!lru->node[0].memcg_lrus; } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V2 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru
The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. [ Take memcg_aware check outside for_each loop: Vldimir] Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- mm/list_lru.c | 34 +++--- 1 file changed, 23 insertions(+), 11 deletions(-) Changes in V2: - Take memcg_aware check outside for_each loop (Vldimir) - Add comment that node 0 should always be present (Vladimir) diff --git a/mm/list_lru.c b/mm/list_lru.c index 909eca2..60cd6dd 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -42,6 +42,10 @@ static void list_lru_unregister(struct list_lru *lru) #ifdef CONFIG_MEMCG_KMEM static inline bool list_lru_memcg_aware(struct list_lru *lru) { + /* +* This needs node 0 to be always present, even +* in the systems supporting sparse numa ids. +*/ return !!lru->node[0].memcg_lrus; } @@ -377,16 +381,20 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { int i; - for (i = 0; i < nr_node_ids; i++) { - if (!memcg_aware) - lru->node[i].memcg_lrus = NULL; - else if (memcg_init_list_lru_node(>node[i])) + if (!memcg_aware) + return 0; + + for_each_node(i) { + if (memcg_init_list_lru_node(>node[i])) goto fail; } return 0; fail: - for (i = i - 1; i >= 0; i--) + for (i = i - 1; i >= 0; i--) { + if (!lru->node[i].memcg_lrus) + continue; memcg_destroy_list_lru_node(>node[i]); + } return -ENOMEM; } @@ -397,7 +405,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru) if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_destroy_list_lru_node(>node[i]); } @@ -409,16 +417,20 @@ static int memcg_update_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return 0; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { if (memcg_update_list_lru_node(>node[i], old_size, new_size)) goto fail; } return 0; fail: - for (i = i - 1; i >= 0; i--) + for (i = i - 1; i >= 0; i--) { + if (!lru->node[i].memcg_lrus) + continue; + memcg_cancel_update_list_lru_node(>node[i], old_size, new_size); + } return -ENOMEM; } @@ -430,7 +442,7 @@ static void memcg_cancel_update_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_cancel_update_list_lru_node(>node[i], old_size, new_size); } @@ -485,7 +497,7 @@ static void memcg_drain_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_drain_list_lru_node(>node[i], src_idx, dst_idx); } @@ -522,7 +534,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware, if (!lru->node) goto out; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { spin_lock_init(>node[i].lock); if (key) lockdep_set_class(>node[i].lock, key); -- 1.7.11.7 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V2 0/2] Replace nr_node_ids for loop with for_each_node
Many places in the kernel use 'for' loop with nr_node_ids. For the architectures which supports sparse numa ids, this will result in some unnecessary allocations for non existing nodes. (for e.g., numa node numbers such as 0,1,16,17 is common in powerpc.) So replace the for loop with for_each_node so that allocations happen only for existing numa nodes. Please note that, though there are many places where nr_node_ids is used, current patchset uses for_each_node only for slowpath to avoid find_next_bit traversal. Changes in V2: - Take memcg_aware check outside for_each loop (Vldimir) - Add comment that node 0 should always be present (Vladimir) Raghavendra K T (2): mm: Replace nr_node_ids for loop with for_each_node in list lru powerpc:numa Do not allocate bootmem memory for non existing nodes arch/powerpc/mm/numa.c | 2 +- mm/list_lru.c | 34 +++--- 2 files changed, 24 insertions(+), 12 deletions(-) -- 1.7.11.7 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..8d8a541 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void) setup_nr_node_ids(); /* allocate the map */ - for (node = 0; node < nr_node_ids; node++) + for_each_node(node) alloc_bootmem_cpumask_var(_to_cpumask_map[node]); /* cpumask_of_node() will now work */ -- 1.7.11.7 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH V2 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru
On 09/15/2015 07:38 AM, Raghavendra K T wrote: The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. [ Take memcg_aware check outside for_each loop: Vladimir] Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- Sorry that I had misspelled Vladimir above. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..8d8a541 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void) setup_nr_node_ids(); /* allocate the map */ - for (node = 0; node < nr_node_ids; node++) + for_each_node(node) alloc_bootmem_cpumask_var(_to_cpumask_map[node]); /* cpumask_of_node() will now work */ -- 1.7.11.7 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 0/2] Replace nr_node_ids for loop with for_each_node
Many places in the kernel use 'for' loop with nr_node_ids. For the architectures which supports sparse numa ids, this will result in some unnecessary allocations for non existing nodes. (for e.g., numa node numbers such as 0,1,16,17 is common in powerpc.) So replace the for loop with for_each_node so that allocations happen only for existing numa nodes. Please note that, though there are many places where nr_node_ids is used, current patchset uses for_each_node only for slowpath to avoid find_next_bit traversal. Raghavendra K T (2): mm: Replace nr_node_ids for loop with for_each_node in list lru powerpc:numa Do not allocate bootmem memory for non existing nodes arch/powerpc/mm/numa.c | 2 +- mm/list_lru.c | 23 +++ 2 files changed, 16 insertions(+), 9 deletions(-) -- 1.7.11.7 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru
The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- mm/list_lru.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/mm/list_lru.c b/mm/list_lru.c index 909eca2..5a97f83 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -377,7 +377,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { int i; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { if (!memcg_aware) lru->node[i].memcg_lrus = NULL; else if (memcg_init_list_lru_node(>node[i])) @@ -385,8 +385,11 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) } return 0; fail: - for (i = i - 1; i >= 0; i--) + for (i = i - 1; i >= 0; i--) { + if (!lru->node[i].memcg_lrus) + continue; memcg_destroy_list_lru_node(>node[i]); + } return -ENOMEM; } @@ -397,7 +400,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru) if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_destroy_list_lru_node(>node[i]); } @@ -409,16 +412,20 @@ static int memcg_update_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return 0; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { if (memcg_update_list_lru_node(>node[i], old_size, new_size)) goto fail; } return 0; fail: - for (i = i - 1; i >= 0; i--) + for (i = i - 1; i >= 0; i--) { + if (!lru->node[i].memcg_lrus) + continue; + memcg_cancel_update_list_lru_node(>node[i], old_size, new_size); + } return -ENOMEM; } @@ -430,7 +437,7 @@ static void memcg_cancel_update_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_cancel_update_list_lru_node(>node[i], old_size, new_size); } @@ -485,7 +492,7 @@ static void memcg_drain_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_drain_list_lru_node(>node[i], src_idx, dst_idx); } @@ -522,7 +529,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware, if (!lru->node) goto out; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { spin_lock_init(>node[i].lock); if (key) lockdep_set_class(>node[i].lock, key); -- 1.7.11.7 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2] powerpc/numa: set node_possible_map to only node_online_map during boot
On 03/06/2015 10:57 AM, Nishanth Aravamudan wrote: On 05.03.2015 [15:29:00 -0800], David Rientjes wrote: On Thu, 5 Mar 2015, Nishanth Aravamudan wrote: So if we compare to x86: arch/x86/mm/numa.c::numa_init(): nodes_clear(numa_nodes_parsed); nodes_clear(node_possible_map); nodes_clear(node_online_map); ... numa_register_memblks(...); arch/x86/mm/numa.c::numa_register_memblks(): node_possible_map = numa_nodes_parsed; Basically, it looks like x86 NUMA init clears out possible map and online map, probably for a similar reason to what I gave in the changelog that by default, the possible map seems to be based off MAX_NUMNODES, rather than nr_node_ids or anything dynamic. My patch was an attempt to emulate the same thing on powerpc. You are right that there is a window in which the node_possible_map and node_online_map are out of sync with my patch. It seems like it shouldn't matter given how early in boot we are, but perhaps the following would have been clearer: diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0257a7d659ef..1a118b08fad2 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -958,6 +958,13 @@ void __init initmem_init(void) memblock_dump_all(); + /* +* Reduce the possible NUMA nodes to the online NUMA nodes, +* since we do not support node hotplug. This ensures that we +* lower the maximum NUMA node ID to what is actually present. +*/ + nodes_and(node_possible_map, node_possible_map, node_online_map); If you don't support node hotplug, then a node should always be possible if it's online unless there are other tricks powerpc plays with node_possible_map. Shouldn't this just be node_possible_map = node_online_map? Yeah, but I was too dumb to think of that before sending :) Updated version follows... -Nish ---8--- Raghu noticed an issue with excessive memory allocation on power with a simple cgroup test, specifically, in mem_cgroup_css_alloc - for_each_node - alloc_mem_cgroup_per_zone_info(), which ends up blowing up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup directories). should we also add after this patch it has reduced to around 2MB? The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes possible), which defines node_possible_map, which in turn defines the value of nr_node_ids in setup_nr_node_ids and the iteration of for_each_node. In practice, we never see a system with 256 NUMA nodes, and in fact, we do not support node hotplug on power in the first place, so the nodes that are online when we come up are the nodes that will be present for the lifetime of this kernel. So let's, at least, drop the NUMA possible map down to the online map at runtime. This is similar to what x86 does in its initialization routines. mem_cgroup_css_alloc should also be fixed to only iterate over memory-populated nodes and handle hotplug, but that is a separate change. Maybe we could fomally add Reported-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com Signed-off-by: Nishanth Aravamudan n...@linux.vnet.ibm.com To: Michael Ellerman m...@ellerman.id.au Cc: linuxppc-dev@lists.ozlabs.org Cc: Tejun Heo t...@kernel.org Cc: David Rientjes rient...@google.com Cc: Benjamin Herrenschmidt b...@kernel.crashing.org Cc: Paul Mackerras pau...@samba.org Cc: Anton Blanchard an...@samba.org Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- v1 - v2: Rather than clear node_possible_map and set it nid-by-nid, just directly assign node_online_map to it, as suggested by Michael Ellerman and Tejun Heo. diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 0257a7d659ef..0c1716cd271f 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -958,6 +958,13 @@ void __init initmem_init(void) memblock_dump_all(); + /* +* Reduce the possible NUMA nodes to the online NUMA nodes, +* since we do not support node hotplug. This ensures that we +* lower the maximum NUMA node ID to what is actually present. +*/ Hope we remember this change when we add hotplug :) + node_possible_map = node_online_map; + for_each_online_node(nid) { unsigned long start_pfn, end_pfn; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev