Re: [tip:sched/urgent] sched: Fix crash in sched_init_numa()
On 01/19/2016 07:08 PM, tip-bot for Raghavendra K T wrote: Commit-ID: 9c03ee147193645be4c186d3688232fa438c57c7 Gitweb: http://git.kernel.org/tip/9c03ee147193645be4c186d3688232fa438c57c7 Author: Raghavendra K T AuthorDate: Sat, 16 Jan 2016 00:31:23 +0530 Committer: Ingo Molnar CommitDate: Tue, 19 Jan 2016 08:42:20 +0100 sched: Fix crash in sched_init_numa() The following PowerPC commit: c118baf80256 ("arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes") avoids allocating bootmem memory for non existent nodes. But when DEBUG_PER_CPU_MAPS=y is enabled, my powerNV system failed to boot because in sched_init_numa(), cpumask_or() operation was done on unallocated nodes. Fix that by making cpumask_or() operation only on existing nodes. [ Tested with and w/o DEBUG_PER_CPU_MAPS=y on x86 and PowerPC. ] Reported-by: Jan Stancek Tested-by: Jan Stancek Signed-off-by: Raghavendra K T Cc: Cc: Cc: Cc: Cc: Cc: Cc: Cc: Cc: Cc: Cc: Link: http://lkml.kernel.org/r/1452884483-11676-1-git-send-email-raghavendra...@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 44253ad..474658b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6840,7 +6840,7 @@ static void sched_init_numa(void) sched_domains_numa_masks[i][j] = mask; - for (k = 0; k < nr_node_ids; k++) { + for_each_node(k) { if (node_distance(j, k) > sched_domains_numa_distance[i]) continue; Hello Greg, Above commit fixes the debug kernel crash in 4.4 kernel [ when DEBUG_PER_CPU_MAPS=y to be precise]. This is a regression in 4.4 from 4.3 and should be ideally present in 4.4-stable. Could you please pull in this change.?
Re: [tip:sched/urgent] sched: Fix crash in sched_init_numa()
On 01/19/2016 07:08 PM, tip-bot for Raghavendra K T wrote: Commit-ID: 9c03ee147193645be4c186d3688232fa438c57c7 Gitweb: http://git.kernel.org/tip/9c03ee147193645be4c186d3688232fa438c57c7 Author: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> AuthorDate: Sat, 16 Jan 2016 00:31:23 +0530 Committer: Ingo Molnar <mi...@kernel.org> CommitDate: Tue, 19 Jan 2016 08:42:20 +0100 sched: Fix crash in sched_init_numa() The following PowerPC commit: c118baf80256 ("arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes") avoids allocating bootmem memory for non existent nodes. But when DEBUG_PER_CPU_MAPS=y is enabled, my powerNV system failed to boot because in sched_init_numa(), cpumask_or() operation was done on unallocated nodes. Fix that by making cpumask_or() operation only on existing nodes. [ Tested with and w/o DEBUG_PER_CPU_MAPS=y on x86 and PowerPC. ] Reported-by: Jan Stancek <jstan...@redhat.com> Tested-by: Jan Stancek <jstan...@redhat.com> Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> Cc: <gk...@linux.vnet.ibm.com> Cc: <grant.lik...@linaro.org> Cc: <nik...@linux.vnet.ibm.com> Cc: <vdavy...@parallels.com> Cc: <linuxppc-...@lists.ozlabs.org> Cc: <linux...@kvack.org> Cc: <pet...@infradead.org> Cc: <b...@kernel.crashing.org> Cc: <pau...@samba.org> Cc: <m...@ellerman.id.au> Cc: <an...@samba.org> Link: http://lkml.kernel.org/r/1452884483-11676-1-git-send-email-raghavendra...@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mi...@kernel.org> --- kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 44253ad..474658b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6840,7 +6840,7 @@ static void sched_init_numa(void) sched_domains_numa_masks[i][j] = mask; - for (k = 0; k < nr_node_ids; k++) { + for_each_node(k) { if (node_distance(j, k) > sched_domains_numa_distance[i]) continue; Hello Greg, Above commit fixes the debug kernel crash in 4.4 kernel [ when DEBUG_PER_CPU_MAPS=y to be precise]. This is a regression in 4.4 from 4.3 and should be ideally present in 4.4-stable. Could you please pull in this change.?
[PATCH] blk-mq: Avoid memoryless numa node encoded in hctx numa_node
In architecture like powerpc, we can have cpus without any local memory attached to it (a.k.a memoryless nodes). In such cases cpu to node mapping can result in memory allocation hints for block hctx->numa_node populated with node values which does not have real memory. Instead use local_memory_node(), which is guaranteed to have memory. local_memory_node is a noop in other architectures that does not support memoryless nodes. Signed-off-by: Raghavendra K T --- block/blk-mq-cpumap.c | 2 +- block/blk-mq.c| 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Validated the patch on a memoryless node powerpc system. diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c index 8764c24..d0634bc 100644 --- a/block/blk-mq-cpumap.c +++ b/block/blk-mq-cpumap.c @@ -113,7 +113,7 @@ int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index) for_each_possible_cpu(i) { if (index == mq_map[i]) - return cpu_to_node(i); + return local_memory_node(cpu_to_node(i)); } return NUMA_NO_NODE; diff --git a/block/blk-mq.c b/block/blk-mq.c index 6ada3b4..74be735 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1794,7 +1794,7 @@ static void blk_mq_init_cpu_queues(struct request_queue *q, * not, we remain on the home node of the device */ if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE) - hctx->numa_node = cpu_to_node(i); + hctx->numa_node = local_memory_node(cpu_to_node(i)); } } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] blk-mq: Reuse hardware context cpumask for tags
hctx->cpumask is already populated and let the tag cpumask follow that instead of going through a new for loop. Signed-off-by: Raghavendra K T --- block/blk-mq.c | 9 + 1 file changed, 1 insertion(+), 8 deletions(-) Nish had suggested to put cpumask_copy after WARN_ON (instead of before). diff --git a/block/blk-mq.c b/block/blk-mq.c index 6d6f8fe..6ada3b4 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1854,6 +1854,7 @@ static void blk_mq_map_swqueue(struct request_queue *q, hctx->tags = set->tags[i]; WARN_ON(!hctx->tags); + cpumask_copy(hctx->tags->cpumask, hctx->cpumask); /* * Set the map size to the number of mapped software queues. * This is more accurate and more efficient than looping @@ -1867,14 +1868,6 @@ static void blk_mq_map_swqueue(struct request_queue *q, hctx->next_cpu = cpumask_first(hctx->cpumask); hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH; } - - queue_for_each_ctx(q, ctx, i) { - if (!cpumask_test_cpu(i, online_mask)) - continue; - - hctx = q->mq_ops->map_queue(q, i); - cpumask_set_cpu(i, hctx->tags->cpumask); - } } static void queue_set_hctx_shared(struct request_queue *q, bool shared) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] blk-mq: Avoid memoryless numa node encoded in hctx numa_node
In architecture like powerpc, we can have cpus without any local memory attached to it (a.k.a memoryless nodes). In such cases cpu to node mapping can result in memory allocation hints for block hctx->numa_node populated with node values which does not have real memory. Instead use local_memory_node(), which is guaranteed to have memory. local_memory_node is a noop in other architectures that does not support memoryless nodes. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- block/blk-mq-cpumap.c | 2 +- block/blk-mq.c| 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) Validated the patch on a memoryless node powerpc system. diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c index 8764c24..d0634bc 100644 --- a/block/blk-mq-cpumap.c +++ b/block/blk-mq-cpumap.c @@ -113,7 +113,7 @@ int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index) for_each_possible_cpu(i) { if (index == mq_map[i]) - return cpu_to_node(i); + return local_memory_node(cpu_to_node(i)); } return NUMA_NO_NODE; diff --git a/block/blk-mq.c b/block/blk-mq.c index 6ada3b4..74be735 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1794,7 +1794,7 @@ static void blk_mq_init_cpu_queues(struct request_queue *q, * not, we remain on the home node of the device */ if (nr_hw_queues > 1 && hctx->numa_node == NUMA_NO_NODE) - hctx->numa_node = cpu_to_node(i); + hctx->numa_node = local_memory_node(cpu_to_node(i)); } } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] blk-mq: Reuse hardware context cpumask for tags
hctx->cpumask is already populated and let the tag cpumask follow that instead of going through a new for loop. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- block/blk-mq.c | 9 + 1 file changed, 1 insertion(+), 8 deletions(-) Nish had suggested to put cpumask_copy after WARN_ON (instead of before). diff --git a/block/blk-mq.c b/block/blk-mq.c index 6d6f8fe..6ada3b4 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1854,6 +1854,7 @@ static void blk_mq_map_swqueue(struct request_queue *q, hctx->tags = set->tags[i]; WARN_ON(!hctx->tags); + cpumask_copy(hctx->tags->cpumask, hctx->cpumask); /* * Set the map size to the number of mapped software queues. * This is more accurate and more efficient than looping @@ -1867,14 +1868,6 @@ static void blk_mq_map_swqueue(struct request_queue *q, hctx->next_cpu = cpumask_first(hctx->cpumask); hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH; } - - queue_for_each_ctx(q, ctx, i) { - if (!cpumask_test_cpu(i, online_mask)) - continue; - - hctx = q->mq_ops->map_queue(q, i); - cpumask_set_cpu(i, hctx->tags->cpumask); - } } static void queue_set_hctx_shared(struct request_queue *q, bool shared) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Unable to handle kernel paging request for data at address __percpu_counter_add
On 11/24/2015 02:43 AM, Tejun Heo wrote: Hello, On Thu, Nov 19, 2015 at 03:54:35PM +0530, Raghavendra K T wrote: While I was creating thousands of docker container on a power8 baremetal (config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating around 5600 container I have hit below problem. [This is looking similar to https://bugzilla.kernel.org/show_bug.cgi?id=101011, but kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9) since it is 4.3.0 tagged kernel] Any hints on how to go about the fix. Please let me know if you think any more information needed. docker daemon is device mapper based. (and it took a day to recreate the problem) [ by disabling CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able to create 10k containers without any problem] Could be the same problem that Ilya is trying to fix. ie. blkdev i_wb pointing to a stale wb. Can you please see whether the following patch resolves the issue? http://lkml.kernel.org/g/1448054554-24138-1-git-send-email-idryo...@gmail.com Hi Tejun, Thanks again for the pointer. I was now able to create more than 10k containers without any problem with CGROUP_WRITEBACK on whereas earlier I had hit this problem few times around 5k+ containers itself. (Also Replying to Ilya's thread). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Unable to handle kernel paging request for data at address __percpu_counter_add
On 11/24/2015 02:43 AM, Tejun Heo wrote: Hello, On Thu, Nov 19, 2015 at 03:54:35PM +0530, Raghavendra K T wrote: While I was creating thousands of docker container on a power8 baremetal (config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating around 5600 container I have hit below problem. [This is looking similar to https://bugzilla.kernel.org/show_bug.cgi?id=101011, but kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9) since it is 4.3.0 tagged kernel] Any hints on how to go about the fix. Please let me know if you think any more information needed. docker daemon is device mapper based. (and it took a day to recreate the problem) [ by disabling CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able to create 10k containers without any problem] Could be the same problem that Ilya is trying to fix. ie. blkdev i_wb pointing to a stale wb. Can you please see whether the following patch resolves the issue? http://lkml.kernel.org/g/1448054554-24138-1-git-send-email-idryo...@gmail.com Hi Tejun, Thanks again for the pointer. I was now able to create more than 10k containers without any problem with CGROUP_WRITEBACK on whereas earlier I had hit this problem few times around 5k+ containers itself. (Also Replying to Ilya's thread). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Unable to handle kernel paging request for data at address __percpu_counter_add
On 11/24/2015 02:43 AM, Tejun Heo wrote: Hello, On Thu, Nov 19, 2015 at 03:54:35PM +0530, Raghavendra K T wrote: While I was creating thousands of docker container on a power8 baremetal (config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating around 5600 container I have hit below problem. [This is looking similar to https://bugzilla.kernel.org/show_bug.cgi?id=101011, but kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9) since it is 4.3.0 tagged kernel] Any hints on how to go about the fix. Please let me know if you think any more information needed. docker daemon is device mapper based. (and it took a day to recreate the problem) [ by disabling CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able to create 10k containers without any problem] Could be the same problem that Ilya is trying to fix. ie. blkdev i_wb pointing to a stale wb. Can you please see whether the following patch resolves the issue? http://lkml.kernel.org/g/1448054554-24138-1-git-send-email-idryo...@gmail.com Thanks Tejun for the pointer. Will check if the patch resolves the issue. (reproduction takes loong time.. so it may take some time to report back). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: Unable to handle kernel paging request for data at address __percpu_counter_add
On 11/24/2015 02:43 AM, Tejun Heo wrote: Hello, On Thu, Nov 19, 2015 at 03:54:35PM +0530, Raghavendra K T wrote: While I was creating thousands of docker container on a power8 baremetal (config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating around 5600 container I have hit below problem. [This is looking similar to https://bugzilla.kernel.org/show_bug.cgi?id=101011, but kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9) since it is 4.3.0 tagged kernel] Any hints on how to go about the fix. Please let me know if you think any more information needed. docker daemon is device mapper based. (and it took a day to recreate the problem) [ by disabling CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able to create 10k containers without any problem] Could be the same problem that Ilya is trying to fix. ie. blkdev i_wb pointing to a stale wb. Can you please see whether the following patch resolves the issue? http://lkml.kernel.org/g/1448054554-24138-1-git-send-email-idryo...@gmail.com Thanks Tejun for the pointer. Will check if the patch resolves the issue. (reproduction takes loong time.. so it may take some time to report back). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
BUG: Unable to handle kernel paging request for data at address __percpu_counter_add
Hi, While I was creating thousands of docker container on a power8 baremetal (config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating around 5600 container I have hit below problem. [This is looking similar to https://bugzilla.kernel.org/show_bug.cgi?id=101011, but kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9) since it is 4.3.0 tagged kernel] Any hints on how to go about the fix. Please let me know if you think any more information needed. docker daemon is device mapper based. (and it took a day to recreate the problem) [ by disabling CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able to create 10k containers without any problem] Nov 14 17:27:00 docker5 kernel: [40161.570029] Unable to handle kernel paging request for data at address 0x3fedfa Nov 14 17:27:00 docker5 kernel: [40161.570125] Faulting instruction address: 0xc056de90 Nov 14 17:27:00 docker5 kernel: [40161.570136] Oops: Kernel access of bad area, sig: 11 [#1] Nov 14 17:27:00 docker5 kernel: [40161.570143] SMP NR_CPUS=256 NUMA PowerNV Nov 14 17:27:00 docker5 kernel: [40161.570177] Modules linked in: veth(E) xt_nat(E) xt_tcpudp(E) xt_addrtype(E) xt_conntrack(E) ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) iptable_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) iptable_filter(E) ip_tables(E) x_tables(E) nf_nat(E) nf_conntrack(E) bridge(E) stp(E) llc(E) dm_thin_pool(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) libcrc32c(E) uio_pdrv_genirq(E) powernv_rng(E) uio(E) autofs4(E) ses(E) enclosure(E) mlx4_en(E) vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) lpfc(E) mlx4_core(E) scsi_transport_fc(E) ipr(E) Nov 14 17:27:00 docker5 kernel: [40161.570755] CPU: 154 PID: 77177 Comm: docker Tainted: GE 4.3.0+ #34 Nov 14 17:27:00 docker5 kernel: [40161.570830] task: c0eaec7f2780 ti: c0eaa4ac task.ti: c0eaa4ac Nov 14 17:27:00 docker5 kernel: [40161.570904] NIP: c056de90 LR: c02273e0 CTR: Nov 14 17:27:00 docker5 kernel: [40161.570978] REGS: c0eaa4ac3530 TRAP: 0300 Tainted: GE(4.3.0+) Nov 14 17:27:00 docker5 kernel: [40161.571051] MSR: 90019033 CR: 28028428 XER: 2000 Nov 14 17:27:00 docker5 kernel: [40161.571244] CFAR: c0008468 DAR: 003fedfa DSISR: 4000 SOFTE: 0 Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR00: c02273e0 c0eaa4ac37b0 c14d6c00 c0f1f7603fb8 Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR04: 0001 0040 0001 0001 Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR08: 007d 003fedfa 003fedfa Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR12: c03a4700 c7fbb700 c0cff0f8 Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR16: c0e790430400 c0e7a7e1a000 Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR20: c0e7c9d16800 c176cfc4 0001 Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR24: 0009eca9 0001 c0ffcf8cb800 Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR28: c0f21a739af0 0001 c1505414 c0f1f7603fb8 Nov 14 17:27:00 docker5 kernel: [40161.572243] NIP [c056de90] __percpu_counter_add+0x30/0x100 Nov 14 17:27:00 docker5 kernel: [40161.572310] LR [c02273e0] account_page_dirtied+0x100/0x250 Nov 14 17:27:00 docker5 kernel: [40161.572373] Call Trace: Nov 14 17:27:00 docker5 kernel: [40161.572401] [c0eaa4ac37b0] [c0eaa4ac37f0] 0xc0eaa4ac37f0 (unreliable) Nov 14 17:27:00 docker5 kernel: [40161.572491] [c0eaa4ac37f0] [c02273e0] account_page_dirtied+0x100/0x250 Nov 14 17:27:00 docker5 kernel: [40161.572580] [c0eaa4ac3840] [c031031c] __set_page_dirty+0x7c/0x130 Nov 14 17:27:00 docker5 kernel: [40161.572656] [c0eaa4ac3890] [c03106f8] mark_buffer_dirty+0x178/0x1c0 Nov 14 17:27:00 docker5 kernel: [40161.572746] [c0eaa4ac38d0] [c03a5c54] ext4_commit_super+0x1d4/0x340 Nov 14 17:27:00 docker5 kernel: [40161.572835] [c0eaa4ac3970] [c03a8d58] ext4_setup_super+0x118/0x250 Nov 14 17:27:00 docker5 kernel: [40161.572924] [c0eaa4ac3a00] [c03abce4] ext4_fill_super+0x1c04/0x3250 Nov 14 17:27:00 docker5 kernel: [40161.573013] [c0eaa4ac3b50] [c02c9964] mount_bdev+0x234/0x270 Nov 14 17:27:00 docker5 kernel: [40161.573089] [c0eaa4ac3bd0] [c03a3178] ext4_mount+0x48/0x60 Nov 14 17:27:00 docker5 kernel: [40161.573165] [c0eaa4ac3c10] [c02cad9c] mount_fs+0x8c/0x230 Nov 14 17:27:00 docker5 kernel: [40161.573242] [c0eaa4ac3cb0] [c02f0518] vfs_kern_mount+0x78/0x180 Nov 14 17:27:00 docker5 kernel: [40161.573319] [c0eaa4ac3d00] [c02f5150] do_mount+0x2e0/0xf60 Nov 14 17:27:00
BUG: Unable to handle kernel paging request for data at address __percpu_counter_add
Hi, While I was creating thousands of docker container on a power8 baremetal (config: 4.3.0 kernel 1TB RAM, 20core (=160 cpu) system. After creating around 5600 container I have hit below problem. [This is looking similar to https://bugzilla.kernel.org/show_bug.cgi?id=101011, but kernel had Revert "ext4: remove block_device_ejected" (bdfe0cbd746aa9) since it is 4.3.0 tagged kernel] Any hints on how to go about the fix. Please let me know if you think any more information needed. docker daemon is device mapper based. (and it took a day to recreate the problem) [ by disabling CONFIG_BLK_CGROUP and CONFIG_CGROUP_WRITEBACK I am able to create 10k containers without any problem] Nov 14 17:27:00 docker5 kernel: [40161.570029] Unable to handle kernel paging request for data at address 0x3fedfa Nov 14 17:27:00 docker5 kernel: [40161.570125] Faulting instruction address: 0xc056de90 Nov 14 17:27:00 docker5 kernel: [40161.570136] Oops: Kernel access of bad area, sig: 11 [#1] Nov 14 17:27:00 docker5 kernel: [40161.570143] SMP NR_CPUS=256 NUMA PowerNV Nov 14 17:27:00 docker5 kernel: [40161.570177] Modules linked in: veth(E) xt_nat(E) xt_tcpudp(E) xt_addrtype(E) xt_conntrack(E) ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) iptable_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) iptable_filter(E) ip_tables(E) x_tables(E) nf_nat(E) nf_conntrack(E) bridge(E) stp(E) llc(E) dm_thin_pool(E) dm_persistent_data(E) dm_bio_prison(E) dm_bufio(E) libcrc32c(E) uio_pdrv_genirq(E) powernv_rng(E) uio(E) autofs4(E) ses(E) enclosure(E) mlx4_en(E) vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) lpfc(E) mlx4_core(E) scsi_transport_fc(E) ipr(E) Nov 14 17:27:00 docker5 kernel: [40161.570755] CPU: 154 PID: 77177 Comm: docker Tainted: GE 4.3.0+ #34 Nov 14 17:27:00 docker5 kernel: [40161.570830] task: c0eaec7f2780 ti: c0eaa4ac task.ti: c0eaa4ac Nov 14 17:27:00 docker5 kernel: [40161.570904] NIP: c056de90 LR: c02273e0 CTR: Nov 14 17:27:00 docker5 kernel: [40161.570978] REGS: c0eaa4ac3530 TRAP: 0300 Tainted: GE(4.3.0+) Nov 14 17:27:00 docker5 kernel: [40161.571051] MSR: 90019033CR: 28028428 XER: 2000 Nov 14 17:27:00 docker5 kernel: [40161.571244] CFAR: c0008468 DAR: 003fedfa DSISR: 4000 SOFTE: 0 Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR00: c02273e0 c0eaa4ac37b0 c14d6c00 c0f1f7603fb8 Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR04: 0001 0040 0001 0001 Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR08: 007d 003fedfa 003fedfa Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR12: c03a4700 c7fbb700 c0cff0f8 Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR16: c0e790430400 c0e7a7e1a000 Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR20: c0e7c9d16800 c176cfc4 0001 Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR24: 0009eca9 0001 c0ffcf8cb800 Nov 14 17:27:00 docker5 kernel: [40161.571244] GPR28: c0f21a739af0 0001 c1505414 c0f1f7603fb8 Nov 14 17:27:00 docker5 kernel: [40161.572243] NIP [c056de90] __percpu_counter_add+0x30/0x100 Nov 14 17:27:00 docker5 kernel: [40161.572310] LR [c02273e0] account_page_dirtied+0x100/0x250 Nov 14 17:27:00 docker5 kernel: [40161.572373] Call Trace: Nov 14 17:27:00 docker5 kernel: [40161.572401] [c0eaa4ac37b0] [c0eaa4ac37f0] 0xc0eaa4ac37f0 (unreliable) Nov 14 17:27:00 docker5 kernel: [40161.572491] [c0eaa4ac37f0] [c02273e0] account_page_dirtied+0x100/0x250 Nov 14 17:27:00 docker5 kernel: [40161.572580] [c0eaa4ac3840] [c031031c] __set_page_dirty+0x7c/0x130 Nov 14 17:27:00 docker5 kernel: [40161.572656] [c0eaa4ac3890] [c03106f8] mark_buffer_dirty+0x178/0x1c0 Nov 14 17:27:00 docker5 kernel: [40161.572746] [c0eaa4ac38d0] [c03a5c54] ext4_commit_super+0x1d4/0x340 Nov 14 17:27:00 docker5 kernel: [40161.572835] [c0eaa4ac3970] [c03a8d58] ext4_setup_super+0x118/0x250 Nov 14 17:27:00 docker5 kernel: [40161.572924] [c0eaa4ac3a00] [c03abce4] ext4_fill_super+0x1c04/0x3250 Nov 14 17:27:00 docker5 kernel: [40161.573013] [c0eaa4ac3b50] [c02c9964] mount_bdev+0x234/0x270 Nov 14 17:27:00 docker5 kernel: [40161.573089] [c0eaa4ac3bd0] [c03a3178] ext4_mount+0x48/0x60 Nov 14 17:27:00 docker5 kernel: [40161.573165] [c0eaa4ac3c10] [c02cad9c] mount_fs+0x8c/0x230 Nov 14 17:27:00 docker5 kernel: [40161.573242] [c0eaa4ac3cb0] [c02f0518] vfs_kern_mount+0x78/0x180 Nov 14 17:27:00 docker5 kernel: [40161.573319] [c0eaa4ac3d00] [c02f5150]
Re: [PATCH] block-mq:Fix the null memory access while setting tags cpumask
On 10/13/2015 10:17 PM, Jeff Moyer wrote: Raghavendra K T writes: In nr_hw_queues >1 cases when certain number of cpus are onlined/or offlined, that results change in request_queue map in block-mq layer, we see the kernel dumping like: What version is that patch against? This problem should be fixed by the patch set from Akinobu Mita. I tested cpu hot-plugging on my system with an nvme device, and it is fixed there on mainline. Hi Jeff, My patch was against (4.3.0-rc2) bcee19f424a. I do see now that commit 1356aae08338 (blk-mq: avoid setting hctx->tags->cpumask before allocation) has equivalent fix. Thanks for the information. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] block-mq:Fix the null memory access while setting tags cpumask
On 10/13/2015 10:17 PM, Jeff Moyer wrote: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> writes: In nr_hw_queues >1 cases when certain number of cpus are onlined/or offlined, that results change in request_queue map in block-mq layer, we see the kernel dumping like: What version is that patch against? This problem should be fixed by the patch set from Akinobu Mita. I tested cpu hot-plugging on my system with an nvme device, and it is fixed there on mainline. Hi Jeff, My patch was against (4.3.0-rc2) bcee19f424a. I do see now that commit 1356aae08338 (blk-mq: avoid setting hctx->tags->cpumask before allocation) has equivalent fix. Thanks for the information. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] block-mq:Fix the null memory access while setting tags cpumask
In nr_hw_queues >1 cases when certain number of cpus are onlined/or offlined, that results change in request_queue map in block-mq layer, we see the kernel dumping like: BUG: unable to handle kernel NULL pointer dereference at 0080 IP: [] cpumask_set_cpu+0x6/0xd PGD 6d957067 PUD 7604c067 PMD 0 Oops: 0002 [#1] SMP Modules linked in: null_blk CPU: 2 PID: 1926 Comm: bash Not tainted 4.3.0-rc2+ #24 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: 8800724cd1c0 ti: 880070a2c000 task.ti: 880070a2c000 RIP: 0010:[] [] cpumask_set_cpu+0x6/0xd RSP: 0018:880070a2fbc8 EFLAGS: 00010203 RAX: 880073eedc00 RBX: 88006cc88000 RCX: 88006c06b000 RDX: 0007 RSI: 0080 RDI: 0008 RBP: 880070a2fbc8 R08: 88006c06ac00 R09: 88006c06ad48 R10: 88004ea8 R11: 88006c069650 R12: 88007378fe28 R13: 0008 R14: e8500200 R15: 81d2a630 FS: 7fa34803b700() GS:88007cc4() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 0080 CR3: 761d2000 CR4: 06e0 Stack: 880070a2fc18 8128edec 880073eedc00 0039 88006cc88000 0007 ffe3 81cef2c0 880070a2fc38 8129049a Call Trace: [] blk_mq_map_swqueue+0x9d/0x206 [] blk_mq_queue_reinit_notify+0xe3/0x144 [] notifier_call_chain+0x37/0x63 [] __raw_notifier_call_chain+0xe/0x10 [] __cpu_notify+0x20/0x32 [] cpu_notify_nofail+0x13/0x1b [] _cpu_down+0x18a/0x264 [] ? path_put+0x1f/0x23 [] cpu_down+0x2d/0x3a [] cpu_subsys_offline+0x14/0x16 [] device_offline+0x65/0x94 [] online_store+0x48/0x68 [] ? kernfs_fop_write+0x6f/0x143 [] dev_attr_store+0x20/0x22 [] sysfs_kf_write+0x3c/0x3e [] kernfs_fop_write+0xed/0x143 [] __vfs_write+0x28/0xa6 [] ? security_file_permission+0x3c/0x44 [] ? percpu_down_read+0x21/0x42 [] ? __sb_start_write+0x24/0x41 [] vfs_write+0x8d/0xd1 [] SyS_write+0x59/0x83 [] entry_SYSCALL_64_fastpath+0x12/0x71 Code: 03 75 06 65 48 ff 0a eb 1a f0 48 83 af 68 07 00 00 01 74 02 eb 0d 48 8d bf 68 07 00 00 ff 90 78 07 00 00 5d c3 55 89 ff 48 89 e5 48 0f ab 3e 5d c3 0f 1f 44 00 00 55 8b 4e 44 31 d2 8b b7 94 RIP [] cpumask_set_cpu+0x6/0xd RSP CR2: 0080 How to reproduce: 1. create 80 vcpu guest with 10 core 8 threads 2. modprobe null_blk submit_queues=64 3. for i in 72 73 74 75 76 77 78 79 ; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done Reason: We try to set freed hwctx->tag->cpumask in blk_mq_map_swqueue(). Introduced during commit f26cdc8536ad ("blk-mq: Shared tag enhancements"). What is happening: When certain number of cpus are onlined/offlined, that results in blk_mq_update_queue_map, we could potentially end up in new mapping to hwctx. Subsequent blk_mq_map_swqueue of request_queue, tries to set the hwctx->tags->cpumask which is already freed by blk_mq_free_rq_map in earlier itearation when it was not mapped. Fix: Set the hwctx->tags->cpumask only after blk_mq_init_rq_map() is done hwctx->tags->cpumask does not follow the hwctx->cpumask after new mapping even in the cases where new mapping does not cause problem. That is also fixed with this change. This problem is originally found in powervm which had 160 cpus (SMT8), 128 nr_hw_queues. The dump was easily reproduced with offlining last core and it has been a blocker issue because cpu hotplug is a common case for DLPAR. Signed-off-by: Raghavendra K T --- block/blk-mq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index f2d67b4..39a7834 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1811,7 +1811,6 @@ static void blk_mq_map_swqueue(struct request_queue *q) hctx = q->mq_ops->map_queue(q, i); cpumask_set_cpu(i, hctx->cpumask); - cpumask_set_cpu(i, hctx->tags->cpumask); ctx->index_hw = hctx->nr_ctx; hctx->ctxs[hctx->nr_ctx++] = ctx; } @@ -1836,6 +1835,7 @@ static void blk_mq_map_swqueue(struct request_queue *q) if (!set->tags[i]) set->tags[i] = blk_mq_init_rq_map(set, i); hctx->tags = set->tags[i]; + cpumask_copy(hctx->tags->cpumask, hctx->cpumask); WARN_ON(!hctx->tags); /* -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] block-mq:Fix the null memory access while setting tags cpumask
In nr_hw_queues >1 cases when certain number of cpus are onlined/or offlined, that results change in request_queue map in block-mq layer, we see the kernel dumping like: BUG: unable to handle kernel NULL pointer dereference at 0080 IP: [] cpumask_set_cpu+0x6/0xd PGD 6d957067 PUD 7604c067 PMD 0 Oops: 0002 [#1] SMP Modules linked in: null_blk CPU: 2 PID: 1926 Comm: bash Not tainted 4.3.0-rc2+ #24 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: 8800724cd1c0 ti: 880070a2c000 task.ti: 880070a2c000 RIP: 0010:[] [] cpumask_set_cpu+0x6/0xd RSP: 0018:880070a2fbc8 EFLAGS: 00010203 RAX: 880073eedc00 RBX: 88006cc88000 RCX: 88006c06b000 RDX: 0007 RSI: 0080 RDI: 0008 RBP: 880070a2fbc8 R08: 88006c06ac00 R09: 88006c06ad48 R10: 88004ea8 R11: 88006c069650 R12: 88007378fe28 R13: 0008 R14: e8500200 R15: 81d2a630 FS: 7fa34803b700() GS:88007cc4() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 0080 CR3: 761d2000 CR4: 06e0 Stack: 880070a2fc18 8128edec 880073eedc00 0039 88006cc88000 0007 ffe3 81cef2c0 880070a2fc38 8129049a Call Trace: [] blk_mq_map_swqueue+0x9d/0x206 [] blk_mq_queue_reinit_notify+0xe3/0x144 [] notifier_call_chain+0x37/0x63 [] __raw_notifier_call_chain+0xe/0x10 [] __cpu_notify+0x20/0x32 [] cpu_notify_nofail+0x13/0x1b [] _cpu_down+0x18a/0x264 [] ? path_put+0x1f/0x23 [] cpu_down+0x2d/0x3a [] cpu_subsys_offline+0x14/0x16 [] device_offline+0x65/0x94 [] online_store+0x48/0x68 [] ? kernfs_fop_write+0x6f/0x143 [] dev_attr_store+0x20/0x22 [] sysfs_kf_write+0x3c/0x3e [] kernfs_fop_write+0xed/0x143 [] __vfs_write+0x28/0xa6 [] ? security_file_permission+0x3c/0x44 [] ? percpu_down_read+0x21/0x42 [] ? __sb_start_write+0x24/0x41 [] vfs_write+0x8d/0xd1 [] SyS_write+0x59/0x83 [] entry_SYSCALL_64_fastpath+0x12/0x71 Code: 03 75 06 65 48 ff 0a eb 1a f0 48 83 af 68 07 00 00 01 74 02 eb 0d 48 8d bf 68 07 00 00 ff 90 78 07 00 00 5d c3 55 89 ff 48 89 e5 48 0f ab 3e 5d c3 0f 1f 44 00 00 55 8b 4e 44 31 d2 8b b7 94 RIP [] cpumask_set_cpu+0x6/0xd RSP CR2: 0080 How to reproduce: 1. create 80 vcpu guest with 10 core 8 threads 2. modprobe null_blk submit_queues=64 3. for i in 72 73 74 75 76 77 78 79 ; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done Reason: We try to set freed hwctx->tag->cpumask in blk_mq_map_swqueue(). Introduced during commit f26cdc8536ad ("blk-mq: Shared tag enhancements"). What is happening: When certain number of cpus are onlined/offlined, that results in blk_mq_update_queue_map, we could potentially end up in new mapping to hwctx. Subsequent blk_mq_map_swqueue of request_queue, tries to set the hwctx->tags->cpumask which is already freed by blk_mq_free_rq_map in earlier itearation when it was not mapped. Fix: Set the hwctx->tags->cpumask only after blk_mq_init_rq_map() is done hwctx->tags->cpumask does not follow the hwctx->cpumask after new mapping even in the cases where new mapping does not cause problem. That is also fixed with this change. This problem is originally found in powervm which had 160 cpus (SMT8), 128 nr_hw_queues. The dump was easily reproduced with offlining last core and it has been a blocker issue because cpu hotplug is a common case for DLPAR. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- block/blk-mq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index f2d67b4..39a7834 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1811,7 +1811,6 @@ static void blk_mq_map_swqueue(struct request_queue *q) hctx = q->mq_ops->map_queue(q, i); cpumask_set_cpu(i, hctx->cpumask); - cpumask_set_cpu(i, hctx->tags->cpumask); ctx->index_hw = hctx->nr_ctx; hctx->ctxs[hctx->nr_ctx++] = ctx; } @@ -1836,6 +1835,7 @@ static void blk_mq_map_swqueue(struct request_queue *q) if (!set->tags[i]) set->tags[i] = blk_mq_init_rq_map(set, i); hctx->tags = set->tags[i]; + cpumask_copy(hctx->tags->cpumask, hctx->cpumask); WARN_ON(!hctx->tags); /* -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
On 10/06/2015 03:55 PM, Michael Ellerman wrote: On Sun, 2015-09-27 at 23:59 +0530, Raghavendra K T wrote: Problem description: Powerpc has sparse node numbering, i.e. on a 4 node system nodes are numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid got from device tree is naturally mapped (directly) to nid. Potential side effect of that is: 1) There are several places in kernel that assumes serial node numbering. and memory allocations assume that all the nodes from 0-(highest nid) exist inturn ending up allocating memory for the nodes that does not exist. Is it several? Or lots? If it's several, ie. more than two but not lots, then we should probably just fix those places. Or is that /really/ hard for some reason? It is several and I did attempt to fix them. But the rest of the places (like memcg, work queue, scheduler and so on) are tricky to fix because the memory allocations are glued with other things. and similar fix may be expected in future too.. Do we ever get whole nodes hotplugged in under PowerVM? I don't think so, but I don't remember for sure. Even on powervm we do have discontiguous numa nodes. [Adding more to it, we could even end up creating a dummy node 0 just to make kernel happy] for e.g., available: 2 nodes (0,7) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 7 cpus: 0 1 2 3 4 5 6 7 node 7 size: 10240 MB node 7 free: 8174 MB node distances: node 0 7 0: 10 40 7: 40 10 note that node zero neither has any cpu nor memory. 2) For virtualization use cases (such as qemu, libvirt, openstack), mapping sparse nid of the host system to contiguous nids of guest (numa affinity, placement) could be a challenge. Can you elaborate? That's a bit vague. one e.g., i can think of: (though libvirt/openstack people will know more about it) suppose one wishes to have half of the vcpus bind to one physical node and rest of the vcpus to second numa node, we cant say whether second node is 1,8, or 16. and same libvirtxml on a two node system may not be valid for another two numa node system. [ i believe it may cause some migration problem too ]. Possible Solutions: 1) Handling the memory allocations is kernel case by case: Though in some cases it is easy to achieve, some cases may be intrusive/not trivial. at the end it does not handle side effect (2) above. 2) Map the sparse chipid got from device tree to a serial nid at kernel level (The idea proposed in this series). Pro: It is more natural to handle at kernel level than at lower (OPAL) layer. con: The chipid is in device tree no longer the same as nid in kernel 3) Let the lower layer (OPAL) give the serial node ids after parsing the chipid and the associativity etc [ either as a separate item in device tree or by compacting the chipid numbers ] Pros: kernel, device tree are on same page and less change in kernel Con: is it the functionality expected in lower layer ... 3) Numactl tests from ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz (infact there were more breakage before the patch because of sparse nid and memoryless node cases of powerpc) This is probably the best argument for your series. ie. userspace is dumb and fixing every broken app that assumes linear node numbering is not feasible. So on the whole I think the concept is good. This series though is a bit confusing because of all the renaming etc. etc. Nish made lots of good comments so I'll wait for a v2 based on those. Yes, will be sending V2 soon extending my patch to fix powervm case too. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table
On 10/06/2015 03:47 PM, Michael Ellerman wrote: On Sun, 2015-27-09 at 18:29:09 UTC, Raghavendra K T wrote: We access numa_cpu_lookup_table array directly in all the places to read/update numa cpu lookup information. Instead use a helper function to update. This is helpful in changing the way numa<-->cpu mapping in single place when needed. This is a cosmetic change, no change in functionality. Signed-off-by: Raghavendra K T Signed-off-by: Raghavendra K T --- arch/powerpc/include/asm/mmzone.h | 2 +- arch/powerpc/kernel/smp.c | 10 +- arch/powerpc/mm/numa.c| 28 +--- 3 files changed, 23 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h index 7b58917..c24a5f4 100644 --- a/arch/powerpc/include/asm/mmzone.h +++ b/arch/powerpc/include/asm/mmzone.h @@ -29,7 +29,7 @@ extern struct pglist_data *node_data[]; * Following are specific to this numa platform. */ -extern int numa_cpu_lookup_table[]; +extern int numa_cpu_lookup(int cpu); Can you rename it better :) Something like cpu_to_nid(). Good name. sure. Although maybe nid is wrong given the rest of the series. May be not. The current plan is to rename (after discussing with Nish) chipid to pnid (physical nid) and nid to vnid (virtual nid) within powerpc numa.c [reasoning chipid is applicable only to OPAL, since we want to handle powerkvm, powervm and baremetal we need a generic name ] But 'nid' naming will be retained which is applicable for generic kernel interactions. diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..d5e6eee 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS]; cpumask_var_t node_to_cpumask_map[MAX_NUMNODES]; struct pglist_data *node_data[MAX_NUMNODES]; -EXPORT_SYMBOL(numa_cpu_lookup_table); EXPORT_SYMBOL(node_to_cpumask_map); EXPORT_SYMBOL(node_data); @@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn, return 0; } -static void reset_numa_cpu_lookup_table(void) +int numa_cpu_lookup(int cpu) { - unsigned int cpu; - - for_each_possible_cpu(cpu) - numa_cpu_lookup_table[cpu] = -1; + return numa_cpu_lookup_table[cpu]; } +EXPORT_SYMBOL(numa_cpu_lookup); I don't see you changing any modular code that uses this, or any macros that might be used by modules, so I don't see why this needs to be exported? I think you just added it because num_cpu_lookup_table was exported? arch/powerpc/kernel/smp.c uses it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC, 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table
On 10/06/2015 03:47 PM, Michael Ellerman wrote: On Sun, 2015-27-09 at 18:29:09 UTC, Raghavendra K T wrote: We access numa_cpu_lookup_table array directly in all the places to read/update numa cpu lookup information. Instead use a helper function to update. This is helpful in changing the way numa<-->cpu mapping in single place when needed. This is a cosmetic change, no change in functionality. Signed-off-by: Raghavendra K T <raghavendra...@linux.inet.ibm.com> Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/include/asm/mmzone.h | 2 +- arch/powerpc/kernel/smp.c | 10 +- arch/powerpc/mm/numa.c| 28 +--- 3 files changed, 23 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h index 7b58917..c24a5f4 100644 --- a/arch/powerpc/include/asm/mmzone.h +++ b/arch/powerpc/include/asm/mmzone.h @@ -29,7 +29,7 @@ extern struct pglist_data *node_data[]; * Following are specific to this numa platform. */ -extern int numa_cpu_lookup_table[]; +extern int numa_cpu_lookup(int cpu); Can you rename it better :) Something like cpu_to_nid(). Good name. sure. Although maybe nid is wrong given the rest of the series. May be not. The current plan is to rename (after discussing with Nish) chipid to pnid (physical nid) and nid to vnid (virtual nid) within powerpc numa.c [reasoning chipid is applicable only to OPAL, since we want to handle powerkvm, powervm and baremetal we need a generic name ] But 'nid' naming will be retained which is applicable for generic kernel interactions. diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..d5e6eee 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS]; cpumask_var_t node_to_cpumask_map[MAX_NUMNODES]; struct pglist_data *node_data[MAX_NUMNODES]; -EXPORT_SYMBOL(numa_cpu_lookup_table); EXPORT_SYMBOL(node_to_cpumask_map); EXPORT_SYMBOL(node_data); @@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn, return 0; } -static void reset_numa_cpu_lookup_table(void) +int numa_cpu_lookup(int cpu) { - unsigned int cpu; - - for_each_possible_cpu(cpu) - numa_cpu_lookup_table[cpu] = -1; + return numa_cpu_lookup_table[cpu]; } +EXPORT_SYMBOL(numa_cpu_lookup); I don't see you changing any modular code that uses this, or any macros that might be used by modules, so I don't see why this needs to be exported? I think you just added it because num_cpu_lookup_table was exported? arch/powerpc/kernel/smp.c uses it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
On 10/06/2015 03:55 PM, Michael Ellerman wrote: On Sun, 2015-09-27 at 23:59 +0530, Raghavendra K T wrote: Problem description: Powerpc has sparse node numbering, i.e. on a 4 node system nodes are numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid got from device tree is naturally mapped (directly) to nid. Potential side effect of that is: 1) There are several places in kernel that assumes serial node numbering. and memory allocations assume that all the nodes from 0-(highest nid) exist inturn ending up allocating memory for the nodes that does not exist. Is it several? Or lots? If it's several, ie. more than two but not lots, then we should probably just fix those places. Or is that /really/ hard for some reason? It is several and I did attempt to fix them. But the rest of the places (like memcg, work queue, scheduler and so on) are tricky to fix because the memory allocations are glued with other things. and similar fix may be expected in future too.. Do we ever get whole nodes hotplugged in under PowerVM? I don't think so, but I don't remember for sure. Even on powervm we do have discontiguous numa nodes. [Adding more to it, we could even end up creating a dummy node 0 just to make kernel happy] for e.g., available: 2 nodes (0,7) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 7 cpus: 0 1 2 3 4 5 6 7 node 7 size: 10240 MB node 7 free: 8174 MB node distances: node 0 7 0: 10 40 7: 40 10 note that node zero neither has any cpu nor memory. 2) For virtualization use cases (such as qemu, libvirt, openstack), mapping sparse nid of the host system to contiguous nids of guest (numa affinity, placement) could be a challenge. Can you elaborate? That's a bit vague. one e.g., i can think of: (though libvirt/openstack people will know more about it) suppose one wishes to have half of the vcpus bind to one physical node and rest of the vcpus to second numa node, we cant say whether second node is 1,8, or 16. and same libvirtxml on a two node system may not be valid for another two numa node system. [ i believe it may cause some migration problem too ]. Possible Solutions: 1) Handling the memory allocations is kernel case by case: Though in some cases it is easy to achieve, some cases may be intrusive/not trivial. at the end it does not handle side effect (2) above. 2) Map the sparse chipid got from device tree to a serial nid at kernel level (The idea proposed in this series). Pro: It is more natural to handle at kernel level than at lower (OPAL) layer. con: The chipid is in device tree no longer the same as nid in kernel 3) Let the lower layer (OPAL) give the serial node ids after parsing the chipid and the associativity etc [ either as a separate item in device tree or by compacting the chipid numbers ] Pros: kernel, device tree are on same page and less change in kernel Con: is it the functionality expected in lower layer ... 3) Numactl tests from ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz (infact there were more breakage before the patch because of sparse nid and memoryless node cases of powerpc) This is probably the best argument for your series. ie. userspace is dumb and fixing every broken app that assumes linear node numbering is not feasible. So on the whole I think the concept is good. This series though is a bit confusing because of all the renaming etc. etc. Nish made lots of good comments so I'll wait for a v2 based on those. Yes, will be sending V2 soon extending my patch to fix powervm case too. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
On 09/30/2015 01:16 AM, Denis Kirjanov wrote: On 9/29/15, Raghavendra K T wrote: On 09/28/2015 10:34 PM, Nishanth Aravamudan wrote: On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote: On 9/27/15, Raghavendra K T wrote: Problem description: Powerpc has sparse node numbering, i.e. on a 4 node system nodes are numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid got from device tree is naturally mapped (directly) to nid. Interesting thing to play with, I'll try to test it on my POWER7 box, but it doesn't have the OPAL layer :( Hi Denis, Thanks for your interest. I have pushed the patches to https://github.com/ktraghavendra/linux/tree/serialnuma_v1 if it makes patches easy to grab. Thanks! One sad thing is that I can't test the actual node id mapping now since currently I have an access to machine with only one memory node :/ Can we fake it through qemu? faking the sparse numa ids is possible with Nish's patch for qemu: https://lists.gnu.org/archive/html/qemu-devel/2014-06/msg05826.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
On 09/30/2015 01:16 AM, Denis Kirjanov wrote: On 9/29/15, Raghavendra K T <raghavendra...@linux.vnet.ibm.com> wrote: On 09/28/2015 10:34 PM, Nishanth Aravamudan wrote: On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote: On 9/27/15, Raghavendra K T <raghavendra...@linux.vnet.ibm.com> wrote: Problem description: Powerpc has sparse node numbering, i.e. on a 4 node system nodes are numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid got from device tree is naturally mapped (directly) to nid. Interesting thing to play with, I'll try to test it on my POWER7 box, but it doesn't have the OPAL layer :( Hi Denis, Thanks for your interest. I have pushed the patches to https://github.com/ktraghavendra/linux/tree/serialnuma_v1 if it makes patches easy to grab. Thanks! One sad thing is that I can't test the actual node id mapping now since currently I have an access to machine with only one memory node :/ Can we fake it through qemu? faking the sparse numa ids is possible with Nish's patch for qemu: https://lists.gnu.org/archive/html/qemu-devel/2014-06/msg05826.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid
On 09/28/2015 11:05 PM, Nishanth Aravamudan wrote: On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote: Once we have made the distinction between nid and chipid create a 1:1 mapping between them. This makes compacting the nids easy later. No functionality change. Signed-off-by: Raghavendra K T --- arch/powerpc/mm/numa.c | 36 +--- 1 file changed, 29 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index f84ed2f..dd2073b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -264,6 +264,17 @@ out: return chipid; } + + /* Return the nid from associativity */ +static int associativity_to_nid(const __be32 *associativity) +{ + int nid; + + nid = associativity_to_chipid(associativity); + return nid; +} This is ultimately confusing. You are assigning the semantic return value of a chipid to a nid -- is it a nid or a chipid? Shouldn't the variable naming be consistent? :( yes. will come up with some consistent naming. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
On 09/28/2015 11:04 PM, Nishanth Aravamudan wrote: On 27.09.2015 [23:59:08 +0530], Raghavendra K T wrote: [...] 2) Map the sparse chipid got from device tree to a serial nid at kernel level (The idea proposed in this series). Pro: It is more natural to handle at kernel level than at lower (OPAL) layer. con: The chipid is in device tree no longer the same as nid in kernel Is there any debugging/logging? Looks like not -- so how does a sysadmin map from firmware-provided values to the Linux values? That's going to make debugging of large systems (PowerVM or otherwise) less than pleasant, it seems? Possibly you could put something in sysfs? I see 2 things could be done here: 1) while doing dump_numa_cpu_topology() we can dump nid_to_chipid() as additional information. 2) sysfs-> Does /sys/devices/system/node/nodeX/*chipid* looks good. May be we should add only for powerpc or otherwise we need to have chipid = nid populated for other archs. [ I think this change may be done slowly ] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping
On 09/28/2015 11:02 PM, Nishanth Aravamudan wrote: On 27.09.2015 [23:59:12 +0530], Raghavendra K T wrote: Create arrays that maps serial nids and sparse chipids. Note: My original idea had only two arrays of chipid to nid map. Final code is inspired by driver/acpi/numa.c that maps a proximity node with a logical node by Takayoshi Kochi , and thus uses an additional chipid_map nodemask. The mask helps in first unused nid easily by knowing first unset bit in the mask. No change in functionality. Signed-off-by: Raghavendra K T --- arch/powerpc/mm/numa.c | 48 +++- 1 file changed, 47 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index dd2073b..f015cad 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -63,6 +63,11 @@ static int form1_affinity; static int distance_ref_points_depth; static const __be32 *distance_ref_points; static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; +static nodemask_t chipid_map = NODE_MASK_NONE; +static int chipid_to_nid_map[MAX_NUMNODES] + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; Hrm, conceptually there are *more* chips than nodes, right? So what guarantees we won't see > MAX_NUMNODES chips? You are correct that nid <= chipids. and #nids = #chipids when all possible slots are populated. Considering we assume that maximum chip slots are no more than MAX_NUMNODES, how about having #define MAX_CHIPNODES MAX_NUMNODES and chipid_to_nid_map[MAX_CHIPNODES] = { [0 ... MAX_CHIPNODES - 1] = .. +static int nid_to_chipid_map[MAX_NUMNODES] + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; /* * Allocate node_to_cpumask_map based on number of available nodes @@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn, return 0; } +int chipid_to_nid(int chipid) +{ + if (chipid < 0) + return NUMA_NO_NODE; Do you really want to support these cases? Or should they be bugs/warnings indicating that you got an unexpected input? Or at least WARN_ON_ONCE? Right. Querying for nid of an invalid chipid should be atleast WARN_ON_ONCE(). But 'll check once if there is any valid scenario before the change. + return chipid_to_nid_map[chipid]; +} + +int nid_to_chipid(int nid) +{ + if (nid < 0) + return NUMA_NO_NODE; + return nid_to_chipid_map[nid]; +} + +static void __map_chipid_to_nid(int chipid, int nid) +{ + if (chipid_to_nid_map[chipid] == NUMA_NO_NODE +|| nid < chipid_to_nid_map[chipid]) + chipid_to_nid_map[chipid] = nid; + if (nid_to_chipid_map[nid] == NUMA_NO_NODE + || chipid < nid_to_chipid_map[nid]) + nid_to_chipid_map[nid] = chipid; +} chip <-> node mapping is a static (physical) concept, right? Should we emit some debugging if for some reason we get a runtime call to remap an already mapped chip to a new node? Good point. Already mapped chipid to a different nid is unexpected whereas mapping chipid to same nid is expected.(because mapping comes from cpus belonging to same node). WARN_ON() should suffice here? + +int map_chipid_to_nid(int chipid) +{ + int nid; + + if (chipid < 0 || chipid >= MAX_NUMNODES) + return NUMA_NO_NODE; + + nid = chipid_to_nid_map[chipid]; + if (nid == NUMA_NO_NODE) { + if (nodes_weight(chipid_map) >= MAX_NUMNODES) + return NUMA_NO_NODE; If you create a KVM guest with a bogus topology, doesn't this just start losing NUMA information for very high-noded guests? 'll try to see if it is possible to hit this case, ideally we should not allow more than MAX_NUMNODES for chipids and we should abort early. + nid = first_unset_node(chipid_map); + __map_chipid_to_nid(chipid, nid); + node_set(nid, chipid_map); + } + return nid; +} + int numa_cpu_lookup(int cpu) { return numa_cpu_lookup_table[cpu]; @@ -264,7 +311,6 @@ out: return chipid; } - stray change? yep, will correct that. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid
On 09/28/2015 10:58 PM, Nishanth Aravamudan wrote: On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote: Once we have made the distinction between nid and chipid create a 1:1 mapping between them. This makes compacting the nids easy later. Didn't the previous patch just do the opposite of... As per my thoughts it was: 1. rename functions to say loud that it is chipid (and not nid) 2. and then assign nid = chipid so that we are clear that we made nid:chipid 1:1 mapping and compact nids later.. But again may be I should combine patch 2 and 3. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid
On 09/28/2015 10:57 PM, Nishanth Aravamudan wrote: On 27.09.2015 [23:59:10 +0530], Raghavendra K T wrote: There is no change in the fuctionality Signed-off-by: Raghavendra K T --- arch/powerpc/mm/numa.c | 42 +- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index d5e6eee..f84ed2f 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid, } } -/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa +/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa * info is found. */ -static int associativity_to_nid(const __be32 *associativity) +static int associativity_to_chipid(const __be32 *associativity) This is confusing to me. This function is also used by the DLPAR code under PowerVM to indicate what node the CPU is on -- not a chip (which I don't believe is exposed at all under PowerVM). Good point. should I retain the name nid? or any suggestions? instead of chipid -> nid which fits both the cases. or should I rename like nid->vnid something? [...] @@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void) /* Use associativity from first thread for all siblings */ vphn_get_associativity(cpu, associativity); - new_nid = associativity_to_nid(associativity); + new_nid = associativity_to_chipid(associativity); If you are getting a chipid, shouldn't you be assigning it to a variable called 'new_chipid'? yes perhaps. my splitting idea was 1. change nid name in functions to chipid (without changing nid variable calling that function) 2. rename variables to chipid and assign nid=chipid (1:1 mapping) 3. now let nid = mapped chipid But I see that it isn't consistent in some places. do you think merging step 1 and step 2 is okay? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
On 09/28/2015 10:34 PM, Nishanth Aravamudan wrote: On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote: On 9/27/15, Raghavendra K T wrote: Problem description: Powerpc has sparse node numbering, i.e. on a 4 node system nodes are numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid got from device tree is naturally mapped (directly) to nid. Interesting thing to play with, I'll try to test it on my POWER7 box, but it doesn't have the OPAL layer :( Hi Denis, Thanks for your interest. I have pushed the patches to https://github.com/ktraghavendra/linux/tree/serialnuma_v1 if it makes patches easy to grab. Note that it's also interesting to try it under PowerVM, with odd NUMA topologies and report any issues found :) Thanks Nish, I 'll also grab a powerVM and test. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid
On 09/28/2015 10:57 PM, Nishanth Aravamudan wrote: On 27.09.2015 [23:59:10 +0530], Raghavendra K T wrote: There is no change in the fuctionality Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 42 +- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index d5e6eee..f84ed2f 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid, } } -/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa +/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa * info is found. */ -static int associativity_to_nid(const __be32 *associativity) +static int associativity_to_chipid(const __be32 *associativity) This is confusing to me. This function is also used by the DLPAR code under PowerVM to indicate what node the CPU is on -- not a chip (which I don't believe is exposed at all under PowerVM). Good point. should I retain the name nid? or any suggestions? instead of chipid -> nid which fits both the cases. or should I rename like nid->vnid something? [...] @@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void) /* Use associativity from first thread for all siblings */ vphn_get_associativity(cpu, associativity); - new_nid = associativity_to_nid(associativity); + new_nid = associativity_to_chipid(associativity); If you are getting a chipid, shouldn't you be assigning it to a variable called 'new_chipid'? yes perhaps. my splitting idea was 1. change nid name in functions to chipid (without changing nid variable calling that function) 2. rename variables to chipid and assign nid=chipid (1:1 mapping) 3. now let nid = mapped chipid But I see that it isn't consistent in some places. do you think merging step 1 and step 2 is okay? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
On 09/28/2015 10:34 PM, Nishanth Aravamudan wrote: On 28.09.2015 [13:44:42 +0300], Denis Kirjanov wrote: On 9/27/15, Raghavendra K T <raghavendra...@linux.vnet.ibm.com> wrote: Problem description: Powerpc has sparse node numbering, i.e. on a 4 node system nodes are numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid got from device tree is naturally mapped (directly) to nid. Interesting thing to play with, I'll try to test it on my POWER7 box, but it doesn't have the OPAL layer :( Hi Denis, Thanks for your interest. I have pushed the patches to https://github.com/ktraghavendra/linux/tree/serialnuma_v1 if it makes patches easy to grab. Note that it's also interesting to try it under PowerVM, with odd NUMA topologies and report any issues found :) Thanks Nish, I 'll also grab a powerVM and test. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid
On 09/28/2015 10:58 PM, Nishanth Aravamudan wrote: On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote: Once we have made the distinction between nid and chipid create a 1:1 mapping between them. This makes compacting the nids easy later. Didn't the previous patch just do the opposite of... As per my thoughts it was: 1. rename functions to say loud that it is chipid (and not nid) 2. and then assign nid = chipid so that we are clear that we made nid:chipid 1:1 mapping and compact nids later.. But again may be I should combine patch 2 and 3. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid
On 09/28/2015 11:05 PM, Nishanth Aravamudan wrote: On 27.09.2015 [23:59:11 +0530], Raghavendra K T wrote: Once we have made the distinction between nid and chipid create a 1:1 mapping between them. This makes compacting the nids easy later. No functionality change. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 36 +--- 1 file changed, 29 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index f84ed2f..dd2073b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -264,6 +264,17 @@ out: return chipid; } + + /* Return the nid from associativity */ +static int associativity_to_nid(const __be32 *associativity) +{ + int nid; + + nid = associativity_to_chipid(associativity); + return nid; +} This is ultimately confusing. You are assigning the semantic return value of a chipid to a nid -- is it a nid or a chipid? Shouldn't the variable naming be consistent? :( yes. will come up with some consistent naming. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping
On 09/28/2015 11:02 PM, Nishanth Aravamudan wrote: On 27.09.2015 [23:59:12 +0530], Raghavendra K T wrote: Create arrays that maps serial nids and sparse chipids. Note: My original idea had only two arrays of chipid to nid map. Final code is inspired by driver/acpi/numa.c that maps a proximity node with a logical node by Takayoshi Kochi <t-ko...@bq.jp.nec.com>, and thus uses an additional chipid_map nodemask. The mask helps in first unused nid easily by knowing first unset bit in the mask. No change in functionality. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 48 +++- 1 file changed, 47 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index dd2073b..f015cad 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -63,6 +63,11 @@ static int form1_affinity; static int distance_ref_points_depth; static const __be32 *distance_ref_points; static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; +static nodemask_t chipid_map = NODE_MASK_NONE; +static int chipid_to_nid_map[MAX_NUMNODES] + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; Hrm, conceptually there are *more* chips than nodes, right? So what guarantees we won't see > MAX_NUMNODES chips? You are correct that nid <= chipids. and #nids = #chipids when all possible slots are populated. Considering we assume that maximum chip slots are no more than MAX_NUMNODES, how about having #define MAX_CHIPNODES MAX_NUMNODES and chipid_to_nid_map[MAX_CHIPNODES] = { [0 ... MAX_CHIPNODES - 1] = .. +static int nid_to_chipid_map[MAX_NUMNODES] + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; /* * Allocate node_to_cpumask_map based on number of available nodes @@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn, return 0; } +int chipid_to_nid(int chipid) +{ + if (chipid < 0) + return NUMA_NO_NODE; Do you really want to support these cases? Or should they be bugs/warnings indicating that you got an unexpected input? Or at least WARN_ON_ONCE? Right. Querying for nid of an invalid chipid should be atleast WARN_ON_ONCE(). But 'll check once if there is any valid scenario before the change. + return chipid_to_nid_map[chipid]; +} + +int nid_to_chipid(int nid) +{ + if (nid < 0) + return NUMA_NO_NODE; + return nid_to_chipid_map[nid]; +} + +static void __map_chipid_to_nid(int chipid, int nid) +{ + if (chipid_to_nid_map[chipid] == NUMA_NO_NODE +|| nid < chipid_to_nid_map[chipid]) + chipid_to_nid_map[chipid] = nid; + if (nid_to_chipid_map[nid] == NUMA_NO_NODE + || chipid < nid_to_chipid_map[nid]) + nid_to_chipid_map[nid] = chipid; +} chip <-> node mapping is a static (physical) concept, right? Should we emit some debugging if for some reason we get a runtime call to remap an already mapped chip to a new node? Good point. Already mapped chipid to a different nid is unexpected whereas mapping chipid to same nid is expected.(because mapping comes from cpus belonging to same node). WARN_ON() should suffice here? + +int map_chipid_to_nid(int chipid) +{ + int nid; + + if (chipid < 0 || chipid >= MAX_NUMNODES) + return NUMA_NO_NODE; + + nid = chipid_to_nid_map[chipid]; + if (nid == NUMA_NO_NODE) { + if (nodes_weight(chipid_map) >= MAX_NUMNODES) + return NUMA_NO_NODE; If you create a KVM guest with a bogus topology, doesn't this just start losing NUMA information for very high-noded guests? 'll try to see if it is possible to hit this case, ideally we should not allow more than MAX_NUMNODES for chipids and we should abort early. + nid = first_unset_node(chipid_map); + __map_chipid_to_nid(chipid, nid); + node_set(nid, chipid_map); + } + return nid; +} + int numa_cpu_lookup(int cpu) { return numa_cpu_lookup_table[cpu]; @@ -264,7 +311,6 @@ out: return chipid; } - stray change? yep, will correct that. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/5] powerpc:numa Add serial nid support
On 09/28/2015 11:04 PM, Nishanth Aravamudan wrote: On 27.09.2015 [23:59:08 +0530], Raghavendra K T wrote: [...] 2) Map the sparse chipid got from device tree to a serial nid at kernel level (The idea proposed in this series). Pro: It is more natural to handle at kernel level than at lower (OPAL) layer. con: The chipid is in device tree no longer the same as nid in kernel Is there any debugging/logging? Looks like not -- so how does a sysadmin map from firmware-provided values to the Linux values? That's going to make debugging of large systems (PowerVM or otherwise) less than pleasant, it seems? Possibly you could put something in sysfs? I see 2 things could be done here: 1) while doing dump_numa_cpu_topology() we can dump nid_to_chipid() as additional information. 2) sysfs-> Does /sys/devices/system/node/nodeX/*chipid* looks good. May be we should add only for powerpc or otherwise we need to have chipid = nid populated for other archs. [ I think this change may be done slowly ] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table
On 09/27/2015 11:59 PM, Raghavendra K T wrote: We access numa_cpu_lookup_table array directly in all the places to read/update numa cpu lookup information. Instead use a helper function to update. This is helpful in changing the way numa<-->cpu mapping in single place when needed. This is a cosmetic change, no change in functionality. Signed-off-by: Raghavendra K T Signed-off-by: Raghavendra K T --- Sorry: changelog edit messed it.. :( arch/powerpc/include/asm/mmzone.h | 2 +- arch/powerpc/kernel/smp.c | 10 +- arch/powerpc/mm/numa.c| 28 +--- 3 files changed, 23 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h index 7b58917..c24a5f4 100644 --- a/arch/powerpc/include/asm/mmzone.h +++ b/arch/powerpc/include/asm/mmzone.h @@ -29,7 +29,7 @@ extern struct pglist_data *node_data[]; * Following are specific to this numa platform. */ -extern int numa_cpu_lookup_table[]; +extern int numa_cpu_lookup(int cpu); extern cpumask_var_t node_to_cpumask_map[]; #ifdef CONFIG_MEMORY_HOTPLUG extern unsigned long max_pfn; diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ec9ec20..56fbe9e 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus) * numa_node_id() works after this. */ if (cpu_present(cpu)) { - set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]); + set_cpu_numa_node(cpu, numa_cpu_lookup(cpu)); set_cpu_numa_mem(cpu, - local_memory_node(numa_cpu_lookup_table[cpu])); + local_memory_node(numa_cpu_lookup(cpu))); } } @@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void) #ifdef CONFIG_PPC64 paca[boot_cpuid].__current = current; #endif - set_numa_node(numa_cpu_lookup_table[boot_cpuid]); + set_numa_node(numa_cpu_lookup(boot_cpuid)); current_set[boot_cpuid] = task_thread_info(current); } @@ -718,8 +718,8 @@ void start_secondary(void *unused) } traverse_core_siblings(cpu, true); - set_numa_node(numa_cpu_lookup_table[cpu]); - set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); + set_numa_node(numa_cpu_lookup(cpu)); + set_numa_mem(local_memory_node(numa_cpu_lookup(cpu))); smp_wmb(); notify_cpu_starting(cpu); diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..d5e6eee 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS]; cpumask_var_t node_to_cpumask_map[MAX_NUMNODES]; struct pglist_data *node_data[MAX_NUMNODES]; -EXPORT_SYMBOL(numa_cpu_lookup_table); EXPORT_SYMBOL(node_to_cpumask_map); EXPORT_SYMBOL(node_data); @@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn, return 0; } -static void reset_numa_cpu_lookup_table(void) +int numa_cpu_lookup(int cpu) { - unsigned int cpu; - - for_each_possible_cpu(cpu) - numa_cpu_lookup_table[cpu] = -1; + return numa_cpu_lookup_table[cpu]; } +EXPORT_SYMBOL(numa_cpu_lookup); -static void update_numa_cpu_lookup_table(unsigned int cpu, int node) +static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) { numa_cpu_lookup_table[cpu] = node; } +static void reset_numa_cpu_lookup_table(void) +{ + unsigned int cpu; + + for_each_possible_cpu(cpu) + update_numa_cpu_lookup_table(cpu, -1); +} + static void map_cpu_to_node(int cpu, int node) { update_numa_cpu_lookup_table(cpu, node); @@ -160,7 +165,7 @@ static void map_cpu_to_node(int cpu, int node) #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR) static void unmap_cpu_from_node(unsigned long cpu) { - int node = numa_cpu_lookup_table[cpu]; + int node = numa_cpu_lookup(cpu); dbg("removing cpu %lu from node %d\n", cpu, node); @@ -536,7 +541,8 @@ static int numa_setup_cpu(unsigned long lcpu) * directly instead of querying the firmware, since it represents * the most recent mapping notified to us by the platform (eg: VPHN). */ - if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) { + nid = numa_cpu_lookup(lcpu); + if (nid >= 0) { map_cpu_to_node(lcpu, nid); return nid; } @@ -1413,7 +1419,7 @@ int arch_update_cpu_topology(void) if (new_nid < 0 || !node_online(new_nid)) new_nid = first_online_node; - if (new_nid == numa_cpu_lookup_table[cpu]) { + if (new_nid == numa_cpu_lookup(cpu)) { cpumask_andnot(_a
[PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid
Once we have made the distinction between nid and chipid create a 1:1 mapping between them. This makes compacting the nids easy later. No functionality change. Signed-off-by: Raghavendra K T --- arch/powerpc/mm/numa.c | 36 +--- 1 file changed, 29 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index f84ed2f..dd2073b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -264,6 +264,17 @@ out: return chipid; } + + /* Return the nid from associativity */ +static int associativity_to_nid(const __be32 *associativity) +{ + int nid; + + nid = associativity_to_chipid(associativity); + return nid; +} + + /* Returns the chipid associated with the given device tree node, * or -1 if not found. */ @@ -278,6 +289,17 @@ static int of_node_to_chipid_single(struct device_node *device) return chipid; } +/* + * Returns nid from the chipid associated with given tree node + */ +static int of_node_to_nid_single(struct device_node *device) +{ + int nid; + + nid = of_node_to_chipid_single(device); + return nid; +} + /* Walk the device tree upwards, looking for an associativity id */ int of_node_to_nid(struct device_node *device) { @@ -286,7 +308,7 @@ int of_node_to_nid(struct device_node *device) of_node_get(device); while (device) { - nid = of_node_to_chipid_single(device); + nid = of_node_to_nid_single(device); if (nid != -1) break; @@ -498,7 +520,7 @@ static int of_get_assoc_arrays(struct device_node *memory, } /* - * This is like of_node_to_chipid_single() for memory represented in the + * This is like of_node_to_nid_single() for memory represented in the * ibm,dynamic-reconfiguration-memory node. */ static int of_drconf_to_nid_single(struct of_drconf_cell *drmem, @@ -557,7 +579,7 @@ static int numa_setup_cpu(unsigned long lcpu) goto out; } - nid = of_node_to_chipid_single(cpu); + nid = of_node_to_nid_single(cpu); out_present: if (nid < 0 || !node_online(nid)) @@ -754,7 +776,7 @@ static int __init parse_numa_properties(void) cpu = of_get_cpu_node(i, NULL); BUG_ON(!cpu); - nid = of_node_to_chipid_single(cpu); + nid = of_node_to_nid_single(cpu); of_node_put(cpu); /* @@ -796,7 +818,7 @@ new_range: * have associativity properties. If none, then * everything goes to default_nid. */ - nid = of_node_to_chipid_single(memory); + nid = of_node_to_nid_single(memory); if (nid < 0) nid = default_nid; @@ -1119,7 +1141,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr) if ((scn_addr < start) || (scn_addr >= (start + size))) continue; - nid = of_node_to_chipid_single(memory); + nid = of_node_to_nid_single(memory); break; } @@ -1415,7 +1437,7 @@ int arch_update_cpu_topology(void) /* Use associativity from first thread for all siblings */ vphn_get_associativity(cpu, associativity); - new_nid = associativity_to_chipid(associativity); + new_nid = associativity_to_nid(associativity); if (new_nid < 0 || !node_online(new_nid)) new_nid = first_online_node; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 0/5] powerpc:numa Add serial nid support
Problem description: Powerpc has sparse node numbering, i.e. on a 4 node system nodes are numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid got from device tree is naturally mapped (directly) to nid. Potential side effect of that is: 1) There are several places in kernel that assumes serial node numbering. and memory allocations assume that all the nodes from 0-(highest nid) exist inturn ending up allocating memory for the nodes that does not exist. 2) For virtualization use cases (such as qemu, libvirt, openstack), mapping sparse nid of the host system to contiguous nids of guest (numa affinity, placement) could be a challenge. Possible Solutions: 1) Handling the memory allocations is kernel case by case: Though in some cases it is easy to achieve, some cases may be intrusive/not trivial. at the end it does not handle side effect (2) above. 2) Map the sparse chipid got from device tree to a serial nid at kernel level (The idea proposed in this series). Pro: It is more natural to handle at kernel level than at lower (OPAL) layer. con: The chipid is in device tree no longer the same as nid in kernel 3) Let the lower layer (OPAL) give the serial node ids after parsing the chipid and the associativity etc [ either as a separate item in device tree or by compacting the chipid numbers ] Pros: kernel, device tree are on same page and less change in kernel Con: is it the functionality expected in lower layer As mentioned above, current patch series tries to map chipid from lower layer to a contiguos nid at kernel level keeping the node distance calculation and so on intact. Result: Before the patch: numactl -H available: 4 nodes (0-1,16-17) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 31665 MB node 0 free: 29836 MB node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 1 size: 32722 MB node 1 free: 32019 MB node 16 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 16 size: 32571 MB node 16 free: 31222 MB node 17 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 17 size: 0 MB node 17 free: 0 MB node distances: node 0 1 16 17 0: 10 20 40 40 1: 20 10 40 40 16: 40 40 10 20 17: 40 40 20 10 After the patch: numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 31665 MB node 0 free: 30657 MB node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 1 size: 32722 MB node 1 free: 32566 MB node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 2 size: 32571 MB node 2 free: 32401 MB node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 3 size: 0 MB node 3 free: 0 MB node distances: node 0 1 2 3 0: 10 20 40 40 1: 20 10 40 40 2: 40 40 10 20 3: 40 40 20 10 (note that numa distances are intact). Apart from this, The following tests are done with the patched kernel (both baremetal and KVM guest with multiple nodes) to ensure there is no breakage. 1) offlining and onlining of memory in /sys/devices/system/node/nodeX path 2) offlining and onlining of cpus in /sys/devices/system/cpu/ path 3) Numactl tests from ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz (infact there were more breakage before the patch because of sparse nid and memoryless node cases of powerpc) 4) Thousands of docker containers were spawned. Please let me know your comments. patch 1-3: cleanup patches patch 4: Adds helper function to map nid and chipid patch 5: Uses the mapping to get serial nid Raghavendra K T (5): powerpc:numa Add numa_cpu_lookup function to update lookup table powerpc:numa Rename functions referring to nid as chipid powerpc:numa create 1:1 mappaing between chipid and nid powerpc:numa Add helper functions to maintain chipid to nid mapping powerpc:numa Use chipid to nid mapping to get serial numa node ids arch/powerpc/include/asm/mmzone.h | 2 +- arch/powerpc/kernel/smp.c | 10 ++-- arch/powerpc/mm/numa.c| 121 +++--- 3 files changed, 105 insertions(+), 28 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 5/5] powerpc:numa Use chipid to nid mapping to get serial numa node ids
Also properly initialize numa distance table for serial nids. Problem: Powerpc supports sparse nid numbering which could affect 1) memory footprint 2) virtualization use cases Current solution: The patch maps sprase chipid got fromn device tree to serail nids. Result before: node 0 1 16 17 0: 10 20 40 40 1: 20 10 40 40 16: 40 40 10 20 17: 40 40 20 10 After: node 0 1 2 3 0: 10 20 40 40 1: 20 10 40 40 2: 40 40 10 20 3: 40 40 20 10 Testing: Scenarios tested on baremetal and KVM guest with 4 nodes 1) offlining and onlining memory and cpus 2) Running the tests from numactl source. 3) Creating 1000s of docker containers stressing the system Signed-off-by: Raghavendra K T --- arch/powerpc/mm/numa.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index f015cad..873ac8c 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -304,7 +304,8 @@ static int associativity_to_chipid(const __be32 *associativity) /* * Skip the length field and send start of associativity array */ - initialize_distance_lookup_table(chipid, associativity + 1); + initialize_distance_lookup_table(chipid_to_nid(chipid), +associativity + 1); } out: @@ -314,9 +315,10 @@ out: /* Return the nid from associativity */ static int associativity_to_nid(const __be32 *associativity) { - int nid; + int chipid, nid; - nid = associativity_to_chipid(associativity); + chipid = associativity_to_chipid(associativity); + nid = map_chipid_to_nid(chipid); return nid; } @@ -340,9 +342,10 @@ static int of_node_to_chipid_single(struct device_node *device) */ static int of_node_to_nid_single(struct device_node *device) { - int nid; + int chipid, nid; - nid = of_node_to_chipid_single(device); + chipid = of_node_to_chipid_single(device); + nid = map_chipid_to_nid(chipid); return nid; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table
We access numa_cpu_lookup_table array directly in all the places to read/update numa cpu lookup information. Instead use a helper function to update. This is helpful in changing the way numa<-->cpu mapping in single place when needed. This is a cosmetic change, no change in functionality. Signed-off-by: Raghavendra K T --- arch/powerpc/include/asm/mmzone.h | 2 +- arch/powerpc/kernel/smp.c | 10 +- arch/powerpc/mm/numa.c| 28 +--- 3 files changed, 23 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h index 7b58917..c24a5f4 100644 --- a/arch/powerpc/include/asm/mmzone.h +++ b/arch/powerpc/include/asm/mmzone.h @@ -29,7 +29,7 @@ extern struct pglist_data *node_data[]; * Following are specific to this numa platform. */ -extern int numa_cpu_lookup_table[]; +extern int numa_cpu_lookup(int cpu); extern cpumask_var_t node_to_cpumask_map[]; #ifdef CONFIG_MEMORY_HOTPLUG extern unsigned long max_pfn; diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ec9ec20..56fbe9e 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus) * numa_node_id() works after this. */ if (cpu_present(cpu)) { - set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]); + set_cpu_numa_node(cpu, numa_cpu_lookup(cpu)); set_cpu_numa_mem(cpu, - local_memory_node(numa_cpu_lookup_table[cpu])); + local_memory_node(numa_cpu_lookup(cpu))); } } @@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void) #ifdef CONFIG_PPC64 paca[boot_cpuid].__current = current; #endif - set_numa_node(numa_cpu_lookup_table[boot_cpuid]); + set_numa_node(numa_cpu_lookup(boot_cpuid)); current_set[boot_cpuid] = task_thread_info(current); } @@ -718,8 +718,8 @@ void start_secondary(void *unused) } traverse_core_siblings(cpu, true); - set_numa_node(numa_cpu_lookup_table[cpu]); - set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); + set_numa_node(numa_cpu_lookup(cpu)); + set_numa_mem(local_memory_node(numa_cpu_lookup(cpu))); smp_wmb(); notify_cpu_starting(cpu); diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..d5e6eee 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS]; cpumask_var_t node_to_cpumask_map[MAX_NUMNODES]; struct pglist_data *node_data[MAX_NUMNODES]; -EXPORT_SYMBOL(numa_cpu_lookup_table); EXPORT_SYMBOL(node_to_cpumask_map); EXPORT_SYMBOL(node_data); @@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn, return 0; } -static void reset_numa_cpu_lookup_table(void) +int numa_cpu_lookup(int cpu) { - unsigned int cpu; - - for_each_possible_cpu(cpu) - numa_cpu_lookup_table[cpu] = -1; + return numa_cpu_lookup_table[cpu]; } +EXPORT_SYMBOL(numa_cpu_lookup); -static void update_numa_cpu_lookup_table(unsigned int cpu, int node) +static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) { numa_cpu_lookup_table[cpu] = node; } +static void reset_numa_cpu_lookup_table(void) +{ + unsigned int cpu; + + for_each_possible_cpu(cpu) + update_numa_cpu_lookup_table(cpu, -1); +} + static void map_cpu_to_node(int cpu, int node) { update_numa_cpu_lookup_table(cpu, node); @@ -160,7 +165,7 @@ static void map_cpu_to_node(int cpu, int node) #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR) static void unmap_cpu_from_node(unsigned long cpu) { - int node = numa_cpu_lookup_table[cpu]; + int node = numa_cpu_lookup(cpu); dbg("removing cpu %lu from node %d\n", cpu, node); @@ -536,7 +541,8 @@ static int numa_setup_cpu(unsigned long lcpu) * directly instead of querying the firmware, since it represents * the most recent mapping notified to us by the platform (eg: VPHN). */ - if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) { + nid = numa_cpu_lookup(lcpu); + if (nid >= 0) { map_cpu_to_node(lcpu, nid); return nid; } @@ -1413,7 +1419,7 @@ int arch_update_cpu_topology(void) if (new_nid < 0 || !node_online(new_nid)) new_nid = first_online_node; - if (new_nid == numa_cpu_lookup_table[cpu]) { + if (new_nid == numa_cpu_lookup(cpu)) { cpumask_andnot(_associativity_changes_mask, _associativity_changes_mask, cpu_sibling_mask(cpu));
[PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping
Create arrays that maps serial nids and sparse chipids. Note: My original idea had only two arrays of chipid to nid map. Final code is inspired by driver/acpi/numa.c that maps a proximity node with a logical node by Takayoshi Kochi , and thus uses an additional chipid_map nodemask. The mask helps in first unused nid easily by knowing first unset bit in the mask. No change in functionality. Signed-off-by: Raghavendra K T --- arch/powerpc/mm/numa.c | 48 +++- 1 file changed, 47 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index dd2073b..f015cad 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -63,6 +63,11 @@ static int form1_affinity; static int distance_ref_points_depth; static const __be32 *distance_ref_points; static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; +static nodemask_t chipid_map = NODE_MASK_NONE; +static int chipid_to_nid_map[MAX_NUMNODES] + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; +static int nid_to_chipid_map[MAX_NUMNODES] + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; /* * Allocate node_to_cpumask_map based on number of available nodes @@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn, return 0; } +int chipid_to_nid(int chipid) +{ + if (chipid < 0) + return NUMA_NO_NODE; + return chipid_to_nid_map[chipid]; +} + +int nid_to_chipid(int nid) +{ + if (nid < 0) + return NUMA_NO_NODE; + return nid_to_chipid_map[nid]; +} + +static void __map_chipid_to_nid(int chipid, int nid) +{ + if (chipid_to_nid_map[chipid] == NUMA_NO_NODE +|| nid < chipid_to_nid_map[chipid]) + chipid_to_nid_map[chipid] = nid; + if (nid_to_chipid_map[nid] == NUMA_NO_NODE + || chipid < nid_to_chipid_map[nid]) + nid_to_chipid_map[nid] = chipid; +} + +int map_chipid_to_nid(int chipid) +{ + int nid; + + if (chipid < 0 || chipid >= MAX_NUMNODES) + return NUMA_NO_NODE; + + nid = chipid_to_nid_map[chipid]; + if (nid == NUMA_NO_NODE) { + if (nodes_weight(chipid_map) >= MAX_NUMNODES) + return NUMA_NO_NODE; + nid = first_unset_node(chipid_map); + __map_chipid_to_nid(chipid, nid); + node_set(nid, chipid_map); + } + return nid; +} + int numa_cpu_lookup(int cpu) { return numa_cpu_lookup_table[cpu]; @@ -264,7 +311,6 @@ out: return chipid; } - /* Return the nid from associativity */ static int associativity_to_nid(const __be32 *associativity) { -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid
There is no change in the fuctionality Signed-off-by: Raghavendra K T --- arch/powerpc/mm/numa.c | 42 +- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index d5e6eee..f84ed2f 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid, } } -/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa +/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa * info is found. */ -static int associativity_to_nid(const __be32 *associativity) +static int associativity_to_chipid(const __be32 *associativity) { - int nid = -1; + int chipid = -1; if (min_common_depth == -1) goto out; if (of_read_number(associativity, 1) >= min_common_depth) - nid = of_read_number([min_common_depth], 1); + chipid = of_read_number([min_common_depth], 1); /* POWER4 LPAR uses 0x as invalid node */ - if (nid == 0x || nid >= MAX_NUMNODES) - nid = -1; + if (chipid == 0x || chipid >= MAX_NUMNODES) + chipid = -1; - if (nid > 0 && + if (chipid > 0 && of_read_number(associativity, 1) >= distance_ref_points_depth) { /* * Skip the length field and send start of associativity array */ - initialize_distance_lookup_table(nid, associativity + 1); + initialize_distance_lookup_table(chipid, associativity + 1); } out: - return nid; + return chipid; } -/* Returns the nid associated with the given device tree node, +/* Returns the chipid associated with the given device tree node, * or -1 if not found. */ -static int of_node_to_nid_single(struct device_node *device) +static int of_node_to_chipid_single(struct device_node *device) { - int nid = -1; + int chipid = -1; const __be32 *tmp; tmp = of_get_associativity(device); if (tmp) - nid = associativity_to_nid(tmp); - return nid; + chipid = associativity_to_chipid(tmp); + return chipid; } /* Walk the device tree upwards, looking for an associativity id */ @@ -286,7 +286,7 @@ int of_node_to_nid(struct device_node *device) of_node_get(device); while (device) { - nid = of_node_to_nid_single(device); + nid = of_node_to_chipid_single(device); if (nid != -1) break; @@ -498,7 +498,7 @@ static int of_get_assoc_arrays(struct device_node *memory, } /* - * This is like of_node_to_nid_single() for memory represented in the + * This is like of_node_to_chipid_single() for memory represented in the * ibm,dynamic-reconfiguration-memory node. */ static int of_drconf_to_nid_single(struct of_drconf_cell *drmem, @@ -557,7 +557,7 @@ static int numa_setup_cpu(unsigned long lcpu) goto out; } - nid = of_node_to_nid_single(cpu); + nid = of_node_to_chipid_single(cpu); out_present: if (nid < 0 || !node_online(nid)) @@ -754,7 +754,7 @@ static int __init parse_numa_properties(void) cpu = of_get_cpu_node(i, NULL); BUG_ON(!cpu); - nid = of_node_to_nid_single(cpu); + nid = of_node_to_chipid_single(cpu); of_node_put(cpu); /* @@ -796,7 +796,7 @@ new_range: * have associativity properties. If none, then * everything goes to default_nid. */ - nid = of_node_to_nid_single(memory); + nid = of_node_to_chipid_single(memory); if (nid < 0) nid = default_nid; @@ -1119,7 +1119,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr) if ((scn_addr < start) || (scn_addr >= (start + size))) continue; - nid = of_node_to_nid_single(memory); + nid = of_node_to_chipid_single(memory); break; } @@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void) /* Use associativity from first thread for all siblings */ vphn_get_associativity(cpu, associativity); - new_nid = associativity_to_nid(associativity); + new_nid = associativity_to_chipid(associativity); if (new_nid < 0 || !node_online(new_nid)) new_nid = first_online_node; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.
[PATCH RFC 5/5] powerpc:numa Use chipid to nid mapping to get serial numa node ids
Also properly initialize numa distance table for serial nids. Problem: Powerpc supports sparse nid numbering which could affect 1) memory footprint 2) virtualization use cases Current solution: The patch maps sprase chipid got fromn device tree to serail nids. Result before: node 0 1 16 17 0: 10 20 40 40 1: 20 10 40 40 16: 40 40 10 20 17: 40 40 20 10 After: node 0 1 2 3 0: 10 20 40 40 1: 20 10 40 40 2: 40 40 10 20 3: 40 40 20 10 Testing: Scenarios tested on baremetal and KVM guest with 4 nodes 1) offlining and onlining memory and cpus 2) Running the tests from numactl source. 3) Creating 1000s of docker containers stressing the system Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index f015cad..873ac8c 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -304,7 +304,8 @@ static int associativity_to_chipid(const __be32 *associativity) /* * Skip the length field and send start of associativity array */ - initialize_distance_lookup_table(chipid, associativity + 1); + initialize_distance_lookup_table(chipid_to_nid(chipid), +associativity + 1); } out: @@ -314,9 +315,10 @@ out: /* Return the nid from associativity */ static int associativity_to_nid(const __be32 *associativity) { - int nid; + int chipid, nid; - nid = associativity_to_chipid(associativity); + chipid = associativity_to_chipid(associativity); + nid = map_chipid_to_nid(chipid); return nid; } @@ -340,9 +342,10 @@ static int of_node_to_chipid_single(struct device_node *device) */ static int of_node_to_nid_single(struct device_node *device) { - int nid; + int chipid, nid; - nid = of_node_to_chipid_single(device); + chipid = of_node_to_chipid_single(device); + nid = map_chipid_to_nid(chipid); return nid; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 2/5] powerpc:numa Rename functions referring to nid as chipid
There is no change in the fuctionality Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 42 +- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index d5e6eee..f84ed2f 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -235,47 +235,47 @@ static void initialize_distance_lookup_table(int nid, } } -/* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa +/* Returns chipid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa * info is found. */ -static int associativity_to_nid(const __be32 *associativity) +static int associativity_to_chipid(const __be32 *associativity) { - int nid = -1; + int chipid = -1; if (min_common_depth == -1) goto out; if (of_read_number(associativity, 1) >= min_common_depth) - nid = of_read_number([min_common_depth], 1); + chipid = of_read_number([min_common_depth], 1); /* POWER4 LPAR uses 0x as invalid node */ - if (nid == 0x || nid >= MAX_NUMNODES) - nid = -1; + if (chipid == 0x || chipid >= MAX_NUMNODES) + chipid = -1; - if (nid > 0 && + if (chipid > 0 && of_read_number(associativity, 1) >= distance_ref_points_depth) { /* * Skip the length field and send start of associativity array */ - initialize_distance_lookup_table(nid, associativity + 1); + initialize_distance_lookup_table(chipid, associativity + 1); } out: - return nid; + return chipid; } -/* Returns the nid associated with the given device tree node, +/* Returns the chipid associated with the given device tree node, * or -1 if not found. */ -static int of_node_to_nid_single(struct device_node *device) +static int of_node_to_chipid_single(struct device_node *device) { - int nid = -1; + int chipid = -1; const __be32 *tmp; tmp = of_get_associativity(device); if (tmp) - nid = associativity_to_nid(tmp); - return nid; + chipid = associativity_to_chipid(tmp); + return chipid; } /* Walk the device tree upwards, looking for an associativity id */ @@ -286,7 +286,7 @@ int of_node_to_nid(struct device_node *device) of_node_get(device); while (device) { - nid = of_node_to_nid_single(device); + nid = of_node_to_chipid_single(device); if (nid != -1) break; @@ -498,7 +498,7 @@ static int of_get_assoc_arrays(struct device_node *memory, } /* - * This is like of_node_to_nid_single() for memory represented in the + * This is like of_node_to_chipid_single() for memory represented in the * ibm,dynamic-reconfiguration-memory node. */ static int of_drconf_to_nid_single(struct of_drconf_cell *drmem, @@ -557,7 +557,7 @@ static int numa_setup_cpu(unsigned long lcpu) goto out; } - nid = of_node_to_nid_single(cpu); + nid = of_node_to_chipid_single(cpu); out_present: if (nid < 0 || !node_online(nid)) @@ -754,7 +754,7 @@ static int __init parse_numa_properties(void) cpu = of_get_cpu_node(i, NULL); BUG_ON(!cpu); - nid = of_node_to_nid_single(cpu); + nid = of_node_to_chipid_single(cpu); of_node_put(cpu); /* @@ -796,7 +796,7 @@ new_range: * have associativity properties. If none, then * everything goes to default_nid. */ - nid = of_node_to_nid_single(memory); + nid = of_node_to_chipid_single(memory); if (nid < 0) nid = default_nid; @@ -1119,7 +1119,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr) if ((scn_addr < start) || (scn_addr >= (start + size))) continue; - nid = of_node_to_nid_single(memory); + nid = of_node_to_chipid_single(memory); break; } @@ -1415,7 +1415,7 @@ int arch_update_cpu_topology(void) /* Use associativity from first thread for all siblings */ vphn_get_associativity(cpu, associativity); - new_nid = associativity_to_nid(associativity); + new_nid = associativity_to_chipid(associativity); if (new_nid < 0 || !node_online(new_nid)) new_nid = first_online_node; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org M
[PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table
We access numa_cpu_lookup_table array directly in all the places to read/update numa cpu lookup information. Instead use a helper function to update. This is helpful in changing the way numa<-->cpu mapping in single place when needed. This is a cosmetic change, no change in functionality. Signed-off-by: Raghavendra K T <raghavendra...@linux.inet.ibm.com> --- arch/powerpc/include/asm/mmzone.h | 2 +- arch/powerpc/kernel/smp.c | 10 +- arch/powerpc/mm/numa.c| 28 +--- 3 files changed, 23 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h index 7b58917..c24a5f4 100644 --- a/arch/powerpc/include/asm/mmzone.h +++ b/arch/powerpc/include/asm/mmzone.h @@ -29,7 +29,7 @@ extern struct pglist_data *node_data[]; * Following are specific to this numa platform. */ -extern int numa_cpu_lookup_table[]; +extern int numa_cpu_lookup(int cpu); extern cpumask_var_t node_to_cpumask_map[]; #ifdef CONFIG_MEMORY_HOTPLUG extern unsigned long max_pfn; diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ec9ec20..56fbe9e 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus) * numa_node_id() works after this. */ if (cpu_present(cpu)) { - set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]); + set_cpu_numa_node(cpu, numa_cpu_lookup(cpu)); set_cpu_numa_mem(cpu, - local_memory_node(numa_cpu_lookup_table[cpu])); + local_memory_node(numa_cpu_lookup(cpu))); } } @@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void) #ifdef CONFIG_PPC64 paca[boot_cpuid].__current = current; #endif - set_numa_node(numa_cpu_lookup_table[boot_cpuid]); + set_numa_node(numa_cpu_lookup(boot_cpuid)); current_set[boot_cpuid] = task_thread_info(current); } @@ -718,8 +718,8 @@ void start_secondary(void *unused) } traverse_core_siblings(cpu, true); - set_numa_node(numa_cpu_lookup_table[cpu]); - set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); + set_numa_node(numa_cpu_lookup(cpu)); + set_numa_mem(local_memory_node(numa_cpu_lookup(cpu))); smp_wmb(); notify_cpu_starting(cpu); diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..d5e6eee 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS]; cpumask_var_t node_to_cpumask_map[MAX_NUMNODES]; struct pglist_data *node_data[MAX_NUMNODES]; -EXPORT_SYMBOL(numa_cpu_lookup_table); EXPORT_SYMBOL(node_to_cpumask_map); EXPORT_SYMBOL(node_data); @@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn, return 0; } -static void reset_numa_cpu_lookup_table(void) +int numa_cpu_lookup(int cpu) { - unsigned int cpu; - - for_each_possible_cpu(cpu) - numa_cpu_lookup_table[cpu] = -1; + return numa_cpu_lookup_table[cpu]; } +EXPORT_SYMBOL(numa_cpu_lookup); -static void update_numa_cpu_lookup_table(unsigned int cpu, int node) +static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) { numa_cpu_lookup_table[cpu] = node; } +static void reset_numa_cpu_lookup_table(void) +{ + unsigned int cpu; + + for_each_possible_cpu(cpu) + update_numa_cpu_lookup_table(cpu, -1); +} + static void map_cpu_to_node(int cpu, int node) { update_numa_cpu_lookup_table(cpu, node); @@ -160,7 +165,7 @@ static void map_cpu_to_node(int cpu, int node) #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR) static void unmap_cpu_from_node(unsigned long cpu) { - int node = numa_cpu_lookup_table[cpu]; + int node = numa_cpu_lookup(cpu); dbg("removing cpu %lu from node %d\n", cpu, node); @@ -536,7 +541,8 @@ static int numa_setup_cpu(unsigned long lcpu) * directly instead of querying the firmware, since it represents * the most recent mapping notified to us by the platform (eg: VPHN). */ - if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) { + nid = numa_cpu_lookup(lcpu); + if (nid >= 0) { map_cpu_to_node(lcpu, nid); return nid; } @@ -1413,7 +1419,7 @@ int arch_update_cpu_topology(void) if (new_nid < 0 || !node_online(new_nid)) new_nid = first_online_node; - if (new_nid == numa_cpu_lookup_table[cpu]) { + if (new_nid == numa_cpu_lookup(cpu)) { cpumask_andnot(_associativity_changes_mask, _associativity_changes_mask,
[PATCH RFC 4/5] powerpc:numa Add helper functions to maintain chipid to nid mapping
Create arrays that maps serial nids and sparse chipids. Note: My original idea had only two arrays of chipid to nid map. Final code is inspired by driver/acpi/numa.c that maps a proximity node with a logical node by Takayoshi Kochi <t-ko...@bq.jp.nec.com>, and thus uses an additional chipid_map nodemask. The mask helps in first unused nid easily by knowing first unset bit in the mask. No change in functionality. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 48 +++- 1 file changed, 47 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index dd2073b..f015cad 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -63,6 +63,11 @@ static int form1_affinity; static int distance_ref_points_depth; static const __be32 *distance_ref_points; static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS]; +static nodemask_t chipid_map = NODE_MASK_NONE; +static int chipid_to_nid_map[MAX_NUMNODES] + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; +static int nid_to_chipid_map[MAX_NUMNODES] + = { [0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE }; /* * Allocate node_to_cpumask_map based on number of available nodes @@ -133,6 +138,48 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn, return 0; } +int chipid_to_nid(int chipid) +{ + if (chipid < 0) + return NUMA_NO_NODE; + return chipid_to_nid_map[chipid]; +} + +int nid_to_chipid(int nid) +{ + if (nid < 0) + return NUMA_NO_NODE; + return nid_to_chipid_map[nid]; +} + +static void __map_chipid_to_nid(int chipid, int nid) +{ + if (chipid_to_nid_map[chipid] == NUMA_NO_NODE +|| nid < chipid_to_nid_map[chipid]) + chipid_to_nid_map[chipid] = nid; + if (nid_to_chipid_map[nid] == NUMA_NO_NODE + || chipid < nid_to_chipid_map[nid]) + nid_to_chipid_map[nid] = chipid; +} + +int map_chipid_to_nid(int chipid) +{ + int nid; + + if (chipid < 0 || chipid >= MAX_NUMNODES) + return NUMA_NO_NODE; + + nid = chipid_to_nid_map[chipid]; + if (nid == NUMA_NO_NODE) { + if (nodes_weight(chipid_map) >= MAX_NUMNODES) + return NUMA_NO_NODE; + nid = first_unset_node(chipid_map); + __map_chipid_to_nid(chipid, nid); + node_set(nid, chipid_map); + } + return nid; +} + int numa_cpu_lookup(int cpu) { return numa_cpu_lookup_table[cpu]; @@ -264,7 +311,6 @@ out: return chipid; } - /* Return the nid from associativity */ static int associativity_to_nid(const __be32 *associativity) { -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 0/5] powerpc:numa Add serial nid support
Problem description: Powerpc has sparse node numbering, i.e. on a 4 node system nodes are numbered (possibly) as 0,1,16,17. At a lower level, we map the chipid got from device tree is naturally mapped (directly) to nid. Potential side effect of that is: 1) There are several places in kernel that assumes serial node numbering. and memory allocations assume that all the nodes from 0-(highest nid) exist inturn ending up allocating memory for the nodes that does not exist. 2) For virtualization use cases (such as qemu, libvirt, openstack), mapping sparse nid of the host system to contiguous nids of guest (numa affinity, placement) could be a challenge. Possible Solutions: 1) Handling the memory allocations is kernel case by case: Though in some cases it is easy to achieve, some cases may be intrusive/not trivial. at the end it does not handle side effect (2) above. 2) Map the sparse chipid got from device tree to a serial nid at kernel level (The idea proposed in this series). Pro: It is more natural to handle at kernel level than at lower (OPAL) layer. con: The chipid is in device tree no longer the same as nid in kernel 3) Let the lower layer (OPAL) give the serial node ids after parsing the chipid and the associativity etc [ either as a separate item in device tree or by compacting the chipid numbers ] Pros: kernel, device tree are on same page and less change in kernel Con: is it the functionality expected in lower layer As mentioned above, current patch series tries to map chipid from lower layer to a contiguos nid at kernel level keeping the node distance calculation and so on intact. Result: Before the patch: numactl -H available: 4 nodes (0-1,16-17) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 31665 MB node 0 free: 29836 MB node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 1 size: 32722 MB node 1 free: 32019 MB node 16 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 16 size: 32571 MB node 16 free: 31222 MB node 17 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 17 size: 0 MB node 17 free: 0 MB node distances: node 0 1 16 17 0: 10 20 40 40 1: 20 10 40 40 16: 40 40 10 20 17: 40 40 20 10 After the patch: numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 0 size: 31665 MB node 0 free: 30657 MB node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 node 1 size: 32722 MB node 1 free: 32566 MB node 2 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node 2 size: 32571 MB node 2 free: 32401 MB node 3 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 node 3 size: 0 MB node 3 free: 0 MB node distances: node 0 1 2 3 0: 10 20 40 40 1: 20 10 40 40 2: 40 40 10 20 3: 40 40 20 10 (note that numa distances are intact). Apart from this, The following tests are done with the patched kernel (both baremetal and KVM guest with multiple nodes) to ensure there is no breakage. 1) offlining and onlining of memory in /sys/devices/system/node/nodeX path 2) offlining and onlining of cpus in /sys/devices/system/cpu/ path 3) Numactl tests from ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.10.tar.gz (infact there were more breakage before the patch because of sparse nid and memoryless node cases of powerpc) 4) Thousands of docker containers were spawned. Please let me know your comments. patch 1-3: cleanup patches patch 4: Adds helper function to map nid and chipid patch 5: Uses the mapping to get serial nid Raghavendra K T (5): powerpc:numa Add numa_cpu_lookup function to update lookup table powerpc:numa Rename functions referring to nid as chipid powerpc:numa create 1:1 mappaing between chipid and nid powerpc:numa Add helper functions to maintain chipid to nid mapping powerpc:numa Use chipid to nid mapping to get serial numa node ids arch/powerpc/include/asm/mmzone.h | 2 +- arch/powerpc/kernel/smp.c | 10 ++-- arch/powerpc/mm/numa.c| 121 +++--- 3 files changed, 105 insertions(+), 28 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC 3/5] powerpc:numa create 1:1 mappaing between chipid and nid
Once we have made the distinction between nid and chipid create a 1:1 mapping between them. This makes compacting the nids easy later. No functionality change. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 36 +--- 1 file changed, 29 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index f84ed2f..dd2073b 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -264,6 +264,17 @@ out: return chipid; } + + /* Return the nid from associativity */ +static int associativity_to_nid(const __be32 *associativity) +{ + int nid; + + nid = associativity_to_chipid(associativity); + return nid; +} + + /* Returns the chipid associated with the given device tree node, * or -1 if not found. */ @@ -278,6 +289,17 @@ static int of_node_to_chipid_single(struct device_node *device) return chipid; } +/* + * Returns nid from the chipid associated with given tree node + */ +static int of_node_to_nid_single(struct device_node *device) +{ + int nid; + + nid = of_node_to_chipid_single(device); + return nid; +} + /* Walk the device tree upwards, looking for an associativity id */ int of_node_to_nid(struct device_node *device) { @@ -286,7 +308,7 @@ int of_node_to_nid(struct device_node *device) of_node_get(device); while (device) { - nid = of_node_to_chipid_single(device); + nid = of_node_to_nid_single(device); if (nid != -1) break; @@ -498,7 +520,7 @@ static int of_get_assoc_arrays(struct device_node *memory, } /* - * This is like of_node_to_chipid_single() for memory represented in the + * This is like of_node_to_nid_single() for memory represented in the * ibm,dynamic-reconfiguration-memory node. */ static int of_drconf_to_nid_single(struct of_drconf_cell *drmem, @@ -557,7 +579,7 @@ static int numa_setup_cpu(unsigned long lcpu) goto out; } - nid = of_node_to_chipid_single(cpu); + nid = of_node_to_nid_single(cpu); out_present: if (nid < 0 || !node_online(nid)) @@ -754,7 +776,7 @@ static int __init parse_numa_properties(void) cpu = of_get_cpu_node(i, NULL); BUG_ON(!cpu); - nid = of_node_to_chipid_single(cpu); + nid = of_node_to_nid_single(cpu); of_node_put(cpu); /* @@ -796,7 +818,7 @@ new_range: * have associativity properties. If none, then * everything goes to default_nid. */ - nid = of_node_to_chipid_single(memory); + nid = of_node_to_nid_single(memory); if (nid < 0) nid = default_nid; @@ -1119,7 +1141,7 @@ static int hot_add_node_scn_to_nid(unsigned long scn_addr) if ((scn_addr < start) || (scn_addr >= (start + size))) continue; - nid = of_node_to_chipid_single(memory); + nid = of_node_to_nid_single(memory); break; } @@ -1415,7 +1437,7 @@ int arch_update_cpu_topology(void) /* Use associativity from first thread for all siblings */ vphn_get_associativity(cpu, associativity); - new_nid = associativity_to_chipid(associativity); + new_nid = associativity_to_nid(associativity); if (new_nid < 0 || !node_online(new_nid)) new_nid = first_online_node; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 1/5] powerpc:numa Add numa_cpu_lookup function to update lookup table
On 09/27/2015 11:59 PM, Raghavendra K T wrote: We access numa_cpu_lookup_table array directly in all the places to read/update numa cpu lookup information. Instead use a helper function to update. This is helpful in changing the way numa<-->cpu mapping in single place when needed. This is a cosmetic change, no change in functionality. Signed-off-by: Raghavendra K T <raghavendra...@linux.inet.ibm.com> Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- Sorry: changelog edit messed it.. :( arch/powerpc/include/asm/mmzone.h | 2 +- arch/powerpc/kernel/smp.c | 10 +- arch/powerpc/mm/numa.c| 28 +--- 3 files changed, 23 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/mmzone.h b/arch/powerpc/include/asm/mmzone.h index 7b58917..c24a5f4 100644 --- a/arch/powerpc/include/asm/mmzone.h +++ b/arch/powerpc/include/asm/mmzone.h @@ -29,7 +29,7 @@ extern struct pglist_data *node_data[]; * Following are specific to this numa platform. */ -extern int numa_cpu_lookup_table[]; +extern int numa_cpu_lookup(int cpu); extern cpumask_var_t node_to_cpumask_map[]; #ifdef CONFIG_MEMORY_HOTPLUG extern unsigned long max_pfn; diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index ec9ec20..56fbe9e 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -381,9 +381,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus) * numa_node_id() works after this. */ if (cpu_present(cpu)) { - set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]); + set_cpu_numa_node(cpu, numa_cpu_lookup(cpu)); set_cpu_numa_mem(cpu, - local_memory_node(numa_cpu_lookup_table[cpu])); + local_memory_node(numa_cpu_lookup(cpu))); } } @@ -400,7 +400,7 @@ void smp_prepare_boot_cpu(void) #ifdef CONFIG_PPC64 paca[boot_cpuid].__current = current; #endif - set_numa_node(numa_cpu_lookup_table[boot_cpuid]); + set_numa_node(numa_cpu_lookup(boot_cpuid)); current_set[boot_cpuid] = task_thread_info(current); } @@ -718,8 +718,8 @@ void start_secondary(void *unused) } traverse_core_siblings(cpu, true); - set_numa_node(numa_cpu_lookup_table[cpu]); - set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); + set_numa_node(numa_cpu_lookup(cpu)); + set_numa_mem(local_memory_node(numa_cpu_lookup(cpu))); smp_wmb(); notify_cpu_starting(cpu); diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..d5e6eee 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -52,7 +52,6 @@ int numa_cpu_lookup_table[NR_CPUS]; cpumask_var_t node_to_cpumask_map[MAX_NUMNODES]; struct pglist_data *node_data[MAX_NUMNODES]; -EXPORT_SYMBOL(numa_cpu_lookup_table); EXPORT_SYMBOL(node_to_cpumask_map); EXPORT_SYMBOL(node_data); @@ -134,19 +133,25 @@ static int __init fake_numa_create_new_node(unsigned long end_pfn, return 0; } -static void reset_numa_cpu_lookup_table(void) +int numa_cpu_lookup(int cpu) { - unsigned int cpu; - - for_each_possible_cpu(cpu) - numa_cpu_lookup_table[cpu] = -1; + return numa_cpu_lookup_table[cpu]; } +EXPORT_SYMBOL(numa_cpu_lookup); -static void update_numa_cpu_lookup_table(unsigned int cpu, int node) +static inline void update_numa_cpu_lookup_table(unsigned int cpu, int node) { numa_cpu_lookup_table[cpu] = node; } +static void reset_numa_cpu_lookup_table(void) +{ + unsigned int cpu; + + for_each_possible_cpu(cpu) + update_numa_cpu_lookup_table(cpu, -1); +} + static void map_cpu_to_node(int cpu, int node) { update_numa_cpu_lookup_table(cpu, node); @@ -160,7 +165,7 @@ static void map_cpu_to_node(int cpu, int node) #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_PPC_SPLPAR) static void unmap_cpu_from_node(unsigned long cpu) { - int node = numa_cpu_lookup_table[cpu]; + int node = numa_cpu_lookup(cpu); dbg("removing cpu %lu from node %d\n", cpu, node); @@ -536,7 +541,8 @@ static int numa_setup_cpu(unsigned long lcpu) * directly instead of querying the firmware, since it represents * the most recent mapping notified to us by the platform (eg: VPHN). */ - if ((nid = numa_cpu_lookup_table[lcpu]) >= 0) { + nid = numa_cpu_lookup(lcpu); + if (nid >= 0) { map_cpu_to_node(lcpu, nid); return nid; } @@ -1413,7 +1419,7 @@ int arch_update_cpu_topology(void) if (new_nid < 0 || !node_online(new_nid)) new_nid = first_online_node; - if (new_nid == numa_cpu_lookup_table[cpu]) { +
Re: [PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes
* Michael Ellerman [2015-09-22 15:29:03]: > On Tue, 2015-09-15 at 07:38 +0530, Raghavendra K T wrote: > > > > ... nothing > > Sure this patch looks obvious, but please give me a changelog that proves > you've thought about it thoroughly. > > For example is it OK to use for_each_node() at this point in boot? Is there > any > historical reason why we did it with a hard coded loop? If so what has > changed. > What systems have you tested on? etc. etc. > > cheers Hi Michael, resending the patches with the changelog. Please note that the patch is in -mm tree already. ---8<--- >From 86ead2520662c362d7b9ebd452ce1c33e156016f Mon Sep 17 00:00:00 2001 From: Raghavendra K T Date: Sun, 6 Sep 2015 12:54:40 +0530 Subject: [PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes With the setup_nr_nodes(), we have already initialized node_possible_map. So it is safe to use for_each_node here. There are many places in the kernel that use hardcoded 'for' loop with nr_node_ids, because all other architectures have numa nodes populated serially. That should be reason we had maintained the same for powerpc. But, since sparse numa node ids possible on powerpc, we unnecessarily allocate memory for non existent numa nodes. For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18 and we allocate memory for nodes 2-14. This patch we allocate memory for only existing numa nodes. The patch is boot tested on a 4 node tuleta [ confirming with printks ]. that it works as expected. Signed-off-by: Raghavendra K T --- arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..8d8a541 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void) setup_nr_node_ids(); /* allocate the map */ - for (node = 0; node < nr_node_ids; node++) + for_each_node(node) alloc_bootmem_cpumask_var(_to_cpumask_map[node]); /* cpumask_of_node() will now work */ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes
On 09/22/2015 10:59 AM, Michael Ellerman wrote: On Tue, 2015-09-15 at 07:38 +0530, Raghavendra K T wrote: ... nothing Sure this patch looks obvious, but please give me a changelog that proves you've thought about it thoroughly. For example is it OK to use for_each_node() at this point in boot? Is there any historical reason why we did it with a hard coded loop? If so what has changed. What systems have you tested on? etc. etc. cheers Changelog: With the setup_nr_nodes(), we have already initialized node_possible_map. So it is safe to use for_each_node here. There are many places in the kernel that use hardcoded 'for' loop with nr_node_ids, because all other architectures have numa nodes populated serially. That should be reason we had maintained same for powerpc. But since on power we have sparse numa node ids possible, we unnecessarily allocate memory for non existent numa nodes. For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18 and we allocate memory for nodes 2-14. The patch is boot tested on a 4 node tuleta [ confirming with printks ]. that it works as expected. Signed-off-by: Raghavendra K T --- arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..8d8a541 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void) setup_nr_node_ids(); /* allocate the map */ - for (node = 0; node < nr_node_ids; node++) + for_each_node(node) alloc_bootmem_cpumask_var(_to_cpumask_map[node]); /* cpumask_of_node() will now work */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes
* Michael Ellerman <m...@ellerman.id.au> [2015-09-22 15:29:03]: > On Tue, 2015-09-15 at 07:38 +0530, Raghavendra K T wrote: > > > > ... nothing > > Sure this patch looks obvious, but please give me a changelog that proves > you've thought about it thoroughly. > > For example is it OK to use for_each_node() at this point in boot? Is there > any > historical reason why we did it with a hard coded loop? If so what has > changed. > What systems have you tested on? etc. etc. > > cheers Hi Michael, resending the patches with the changelog. Please note that the patch is in -mm tree already. ---8<--- >From 86ead2520662c362d7b9ebd452ce1c33e156016f Mon Sep 17 00:00:00 2001 From: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> Date: Sun, 6 Sep 2015 12:54:40 +0530 Subject: [PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes With the setup_nr_nodes(), we have already initialized node_possible_map. So it is safe to use for_each_node here. There are many places in the kernel that use hardcoded 'for' loop with nr_node_ids, because all other architectures have numa nodes populated serially. That should be reason we had maintained the same for powerpc. But, since sparse numa node ids possible on powerpc, we unnecessarily allocate memory for non existent numa nodes. For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18 and we allocate memory for nodes 2-14. This patch we allocate memory for only existing numa nodes. The patch is boot tested on a 4 node tuleta [ confirming with printks ]. that it works as expected. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..8d8a541 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void) setup_nr_node_ids(); /* allocate the map */ - for (node = 0; node < nr_node_ids; node++) + for_each_node(node) alloc_bootmem_cpumask_var(_to_cpumask_map[node]); /* cpumask_of_node() will now work */ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes
On 09/22/2015 10:59 AM, Michael Ellerman wrote: On Tue, 2015-09-15 at 07:38 +0530, Raghavendra K T wrote: ... nothing Sure this patch looks obvious, but please give me a changelog that proves you've thought about it thoroughly. For example is it OK to use for_each_node() at this point in boot? Is there any historical reason why we did it with a hard coded loop? If so what has changed. What systems have you tested on? etc. etc. cheers Changelog: With the setup_nr_nodes(), we have already initialized node_possible_map. So it is safe to use for_each_node here. There are many places in the kernel that use hardcoded 'for' loop with nr_node_ids, because all other architectures have numa nodes populated serially. That should be reason we had maintained same for powerpc. But since on power we have sparse numa node ids possible, we unnecessarily allocate memory for non existent numa nodes. For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18 and we allocate memory for nodes 2-14. The patch is boot tested on a 4 node tuleta [ confirming with printks ]. that it works as expected. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..8d8a541 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void) setup_nr_node_ids(); /* allocate the map */ - for (node = 0; node < nr_node_ids; node++) + for_each_node(node) alloc_bootmem_cpumask_var(_to_cpumask_map[node]); /* cpumask_of_node() will now work */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru
On 09/15/2015 07:38 AM, Raghavendra K T wrote: The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. [ Take memcg_aware check outside for_each loop: Vladimir] Signed-off-by: Raghavendra K T --- Sorry that I had misspelled Vladimir above. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 0/2] Replace nr_node_ids for loop with for_each_node
Many places in the kernel use 'for' loop with nr_node_ids. For the architectures which supports sparse numa ids, this will result in some unnecessary allocations for non existing nodes. (for e.g., numa node numbers such as 0,1,16,17 is common in powerpc.) So replace the for loop with for_each_node so that allocations happen only for existing numa nodes. Please note that, though there are many places where nr_node_ids is used, current patchset uses for_each_node only for slowpath to avoid find_next_bit traversal. Changes in V2: - Take memcg_aware check outside for_each loop (Vldimir) - Add comment that node 0 should always be present (Vladimir) Raghavendra K T (2): mm: Replace nr_node_ids for loop with for_each_node in list lru powerpc:numa Do not allocate bootmem memory for non existing nodes arch/powerpc/mm/numa.c | 2 +- mm/list_lru.c | 34 +++--- 2 files changed, 24 insertions(+), 12 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes
Signed-off-by: Raghavendra K T --- arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..8d8a541 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void) setup_nr_node_ids(); /* allocate the map */ - for (node = 0; node < nr_node_ids; node++) + for_each_node(node) alloc_bootmem_cpumask_var(_to_cpumask_map[node]); /* cpumask_of_node() will now work */ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru
The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. [ Take memcg_aware check outside for_each loop: Vldimir] Signed-off-by: Raghavendra K T --- mm/list_lru.c | 34 +++--- 1 file changed, 23 insertions(+), 11 deletions(-) Changes in V2: - Take memcg_aware check outside for_each loop (Vldimir) - Add comment that node 0 should always be present (Vladimir) diff --git a/mm/list_lru.c b/mm/list_lru.c index 909eca2..60cd6dd 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -42,6 +42,10 @@ static void list_lru_unregister(struct list_lru *lru) #ifdef CONFIG_MEMCG_KMEM static inline bool list_lru_memcg_aware(struct list_lru *lru) { + /* +* This needs node 0 to be always present, even +* in the systems supporting sparse numa ids. +*/ return !!lru->node[0].memcg_lrus; } @@ -377,16 +381,20 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { int i; - for (i = 0; i < nr_node_ids; i++) { - if (!memcg_aware) - lru->node[i].memcg_lrus = NULL; - else if (memcg_init_list_lru_node(>node[i])) + if (!memcg_aware) + return 0; + + for_each_node(i) { + if (memcg_init_list_lru_node(>node[i])) goto fail; } return 0; fail: - for (i = i - 1; i >= 0; i--) + for (i = i - 1; i >= 0; i--) { + if (!lru->node[i].memcg_lrus) + continue; memcg_destroy_list_lru_node(>node[i]); + } return -ENOMEM; } @@ -397,7 +405,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru) if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_destroy_list_lru_node(>node[i]); } @@ -409,16 +417,20 @@ static int memcg_update_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return 0; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { if (memcg_update_list_lru_node(>node[i], old_size, new_size)) goto fail; } return 0; fail: - for (i = i - 1; i >= 0; i--) + for (i = i - 1; i >= 0; i--) { + if (!lru->node[i].memcg_lrus) + continue; + memcg_cancel_update_list_lru_node(>node[i], old_size, new_size); + } return -ENOMEM; } @@ -430,7 +442,7 @@ static void memcg_cancel_update_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_cancel_update_list_lru_node(>node[i], old_size, new_size); } @@ -485,7 +497,7 @@ static void memcg_drain_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_drain_list_lru_node(>node[i], src_idx, dst_idx); } @@ -522,7 +534,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware, if (!lru->node) goto out; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { spin_lock_init(>node[i].lock); if (key) lockdep_set_class(>node[i].lock, key); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru
On 09/14/2015 05:34 PM, Vladimir Davydov wrote: On Mon, Sep 14, 2015 at 05:09:31PM +0530, Raghavendra K T wrote: On 09/14/2015 02:30 PM, Vladimir Davydov wrote: On Wed, Sep 09, 2015 at 12:01:46AM +0530, Raghavendra K T wrote: The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. Do I understand correctly that node 0 must always be in node_possible_map? I ask, because we currently test lru->node[0].memcg_lrus to determine if the list is memcg aware. Yes, node 0 is always there. So it should not be a problem. I think it should be mentioned in the comment to list_lru_memcg_aware then. Something like this: ? static inline bool list_lru_memcg_aware(struct list_lru *lru) { /* * This needs node 0 to be always present, even * in the systems supporting sparse numa ids. */ return !!lru->node[0].memcg_lrus; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru
On 09/14/2015 02:30 PM, Vladimir Davydov wrote: Hi, On Wed, Sep 09, 2015 at 12:01:46AM +0530, Raghavendra K T wrote: The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. Do I understand correctly that node 0 must always be in node_possible_map? I ask, because we currently test lru->node[0].memcg_lrus to determine if the list is memcg aware. Yes, node 0 is always there. So it should not be a problem. Signed-off-by: Raghavendra K T --- mm/list_lru.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/mm/list_lru.c b/mm/list_lru.c index 909eca2..5a97f83 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -377,7 +377,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { int i; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { if (!memcg_aware) lru->node[i].memcg_lrus = NULL; So, we don't explicitly initialize memcg_lrus for nodes that are not in node_possible_map. That's OK, because we allocate lru->node using kzalloc. However, this partial nullifying in case !memcg_aware looks confusing IMO. Let's drop it, I mean something like this: Yes, you are right. and we do not have to have memcg_aware check inside for loop too. Will change as per your suggestion and send V2. Thanks for the review. static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { int i; if (!memcg_aware) return 0; for_each_node(i) { if (memcg_init_list_lru_node(>node[i])) goto fail; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru
On 09/14/2015 02:30 PM, Vladimir Davydov wrote: Hi, On Wed, Sep 09, 2015 at 12:01:46AM +0530, Raghavendra K T wrote: The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. Do I understand correctly that node 0 must always be in node_possible_map? I ask, because we currently test lru->node[0].memcg_lrus to determine if the list is memcg aware. Yes, node 0 is always there. So it should not be a problem. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- mm/list_lru.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/mm/list_lru.c b/mm/list_lru.c index 909eca2..5a97f83 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -377,7 +377,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { int i; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { if (!memcg_aware) lru->node[i].memcg_lrus = NULL; So, we don't explicitly initialize memcg_lrus for nodes that are not in node_possible_map. That's OK, because we allocate lru->node using kzalloc. However, this partial nullifying in case !memcg_aware looks confusing IMO. Let's drop it, I mean something like this: Yes, you are right. and we do not have to have memcg_aware check inside for loop too. Will change as per your suggestion and send V2. Thanks for the review. static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { int i; if (!memcg_aware) return 0; for_each_node(i) { if (memcg_init_list_lru_node(>node[i])) goto fail; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru
On 09/14/2015 05:34 PM, Vladimir Davydov wrote: On Mon, Sep 14, 2015 at 05:09:31PM +0530, Raghavendra K T wrote: On 09/14/2015 02:30 PM, Vladimir Davydov wrote: On Wed, Sep 09, 2015 at 12:01:46AM +0530, Raghavendra K T wrote: The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. Do I understand correctly that node 0 must always be in node_possible_map? I ask, because we currently test lru->node[0].memcg_lrus to determine if the list is memcg aware. Yes, node 0 is always there. So it should not be a problem. I think it should be mentioned in the comment to list_lru_memcg_aware then. Something like this: ? static inline bool list_lru_memcg_aware(struct list_lru *lru) { /* * This needs node 0 to be always present, even * in the systems supporting sparse numa ids. */ return !!lru->node[0].memcg_lrus; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 0/2] Replace nr_node_ids for loop with for_each_node
Many places in the kernel use 'for' loop with nr_node_ids. For the architectures which supports sparse numa ids, this will result in some unnecessary allocations for non existing nodes. (for e.g., numa node numbers such as 0,1,16,17 is common in powerpc.) So replace the for loop with for_each_node so that allocations happen only for existing numa nodes. Please note that, though there are many places where nr_node_ids is used, current patchset uses for_each_node only for slowpath to avoid find_next_bit traversal. Changes in V2: - Take memcg_aware check outside for_each loop (Vldimir) - Add comment that node 0 should always be present (Vladimir) Raghavendra K T (2): mm: Replace nr_node_ids for loop with for_each_node in list lru powerpc:numa Do not allocate bootmem memory for non existing nodes arch/powerpc/mm/numa.c | 2 +- mm/list_lru.c | 34 +++--- 2 files changed, 24 insertions(+), 12 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..8d8a541 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void) setup_nr_node_ids(); /* allocate the map */ - for (node = 0; node < nr_node_ids; node++) + for_each_node(node) alloc_bootmem_cpumask_var(_to_cpumask_map[node]); /* cpumask_of_node() will now work */ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru
On 09/15/2015 07:38 AM, Raghavendra K T wrote: The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. [ Take memcg_aware check outside for_each loop: Vladimir] Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- Sorry that I had misspelled Vladimir above. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru
The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. [ Take memcg_aware check outside for_each loop: Vldimir] Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- mm/list_lru.c | 34 +++--- 1 file changed, 23 insertions(+), 11 deletions(-) Changes in V2: - Take memcg_aware check outside for_each loop (Vldimir) - Add comment that node 0 should always be present (Vladimir) diff --git a/mm/list_lru.c b/mm/list_lru.c index 909eca2..60cd6dd 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -42,6 +42,10 @@ static void list_lru_unregister(struct list_lru *lru) #ifdef CONFIG_MEMCG_KMEM static inline bool list_lru_memcg_aware(struct list_lru *lru) { + /* +* This needs node 0 to be always present, even +* in the systems supporting sparse numa ids. +*/ return !!lru->node[0].memcg_lrus; } @@ -377,16 +381,20 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { int i; - for (i = 0; i < nr_node_ids; i++) { - if (!memcg_aware) - lru->node[i].memcg_lrus = NULL; - else if (memcg_init_list_lru_node(>node[i])) + if (!memcg_aware) + return 0; + + for_each_node(i) { + if (memcg_init_list_lru_node(>node[i])) goto fail; } return 0; fail: - for (i = i - 1; i >= 0; i--) + for (i = i - 1; i >= 0; i--) { + if (!lru->node[i].memcg_lrus) + continue; memcg_destroy_list_lru_node(>node[i]); + } return -ENOMEM; } @@ -397,7 +405,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru) if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_destroy_list_lru_node(>node[i]); } @@ -409,16 +417,20 @@ static int memcg_update_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return 0; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { if (memcg_update_list_lru_node(>node[i], old_size, new_size)) goto fail; } return 0; fail: - for (i = i - 1; i >= 0; i--) + for (i = i - 1; i >= 0; i--) { + if (!lru->node[i].memcg_lrus) + continue; + memcg_cancel_update_list_lru_node(>node[i], old_size, new_size); + } return -ENOMEM; } @@ -430,7 +442,7 @@ static void memcg_cancel_update_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_cancel_update_list_lru_node(>node[i], old_size, new_size); } @@ -485,7 +497,7 @@ static void memcg_drain_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_drain_list_lru_node(>node[i], src_idx, dst_idx); } @@ -522,7 +534,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware, if (!lru->node) goto out; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { spin_lock_init(>node[i].lock); if (key) lockdep_set_class(>node[i].lock, key); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru
The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. Signed-off-by: Raghavendra K T --- mm/list_lru.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/mm/list_lru.c b/mm/list_lru.c index 909eca2..5a97f83 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -377,7 +377,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { int i; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { if (!memcg_aware) lru->node[i].memcg_lrus = NULL; else if (memcg_init_list_lru_node(>node[i])) @@ -385,8 +385,11 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) } return 0; fail: - for (i = i - 1; i >= 0; i--) + for (i = i - 1; i >= 0; i--) { + if (!lru->node[i].memcg_lrus) + continue; memcg_destroy_list_lru_node(>node[i]); + } return -ENOMEM; } @@ -397,7 +400,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru) if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_destroy_list_lru_node(>node[i]); } @@ -409,16 +412,20 @@ static int memcg_update_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return 0; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { if (memcg_update_list_lru_node(>node[i], old_size, new_size)) goto fail; } return 0; fail: - for (i = i - 1; i >= 0; i--) + for (i = i - 1; i >= 0; i--) { + if (!lru->node[i].memcg_lrus) + continue; + memcg_cancel_update_list_lru_node(>node[i], old_size, new_size); + } return -ENOMEM; } @@ -430,7 +437,7 @@ static void memcg_cancel_update_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_cancel_update_list_lru_node(>node[i], old_size, new_size); } @@ -485,7 +492,7 @@ static void memcg_drain_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_drain_list_lru_node(>node[i], src_idx, dst_idx); } @@ -522,7 +529,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware, if (!lru->node) goto out; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { spin_lock_init(>node[i].lock); if (key) lockdep_set_class(>node[i].lock, key); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes
Signed-off-by: Raghavendra K T --- arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..8d8a541 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void) setup_nr_node_ids(); /* allocate the map */ - for (node = 0; node < nr_node_ids; node++) + for_each_node(node) alloc_bootmem_cpumask_var(_to_cpumask_map[node]); /* cpumask_of_node() will now work */ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] Replace nr_node_ids for loop with for_each_node
Many places in the kernel use 'for' loop with nr_node_ids. For the architectures which supports sparse numa ids, this will result in some unnecessary allocations for non existing nodes. (for e.g., numa node numbers such as 0,1,16,17 is common in powerpc.) So replace the for loop with for_each_node so that allocations happen only for existing numa nodes. Please note that, though there are many places where nr_node_ids is used, current patchset uses for_each_node only for slowpath to avoid find_next_bit traversal. Raghavendra K T (2): mm: Replace nr_node_ids for loop with for_each_node in list lru powerpc:numa Do not allocate bootmem memory for non existing nodes arch/powerpc/mm/numa.c | 2 +- mm/list_lru.c | 23 +++ 2 files changed, 16 insertions(+), 9 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] Replace nr_node_ids for loop with for_each_node
Many places in the kernel use 'for' loop with nr_node_ids. For the architectures which supports sparse numa ids, this will result in some unnecessary allocations for non existing nodes. (for e.g., numa node numbers such as 0,1,16,17 is common in powerpc.) So replace the for loop with for_each_node so that allocations happen only for existing numa nodes. Please note that, though there are many places where nr_node_ids is used, current patchset uses for_each_node only for slowpath to avoid find_next_bit traversal. Raghavendra K T (2): mm: Replace nr_node_ids for loop with for_each_node in list lru powerpc:numa Do not allocate bootmem memory for non existing nodes arch/powerpc/mm/numa.c | 2 +- mm/list_lru.c | 23 +++ 2 files changed, 16 insertions(+), 9 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] powerpc:numa Do not allocate bootmem memory for non existing nodes
Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8b9502a..8d8a541 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -80,7 +80,7 @@ static void __init setup_node_to_cpumask_map(void) setup_nr_node_ids(); /* allocate the map */ - for (node = 0; node < nr_node_ids; node++) + for_each_node(node) alloc_bootmem_cpumask_var(_to_cpumask_map[node]); /* cpumask_of_node() will now work */ -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] mm: Replace nr_node_ids for loop with for_each_node in list lru
The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- mm/list_lru.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/mm/list_lru.c b/mm/list_lru.c index 909eca2..5a97f83 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -377,7 +377,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { int i; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { if (!memcg_aware) lru->node[i].memcg_lrus = NULL; else if (memcg_init_list_lru_node(>node[i])) @@ -385,8 +385,11 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) } return 0; fail: - for (i = i - 1; i >= 0; i--) + for (i = i - 1; i >= 0; i--) { + if (!lru->node[i].memcg_lrus) + continue; memcg_destroy_list_lru_node(>node[i]); + } return -ENOMEM; } @@ -397,7 +400,7 @@ static void memcg_destroy_list_lru(struct list_lru *lru) if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_destroy_list_lru_node(>node[i]); } @@ -409,16 +412,20 @@ static int memcg_update_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return 0; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { if (memcg_update_list_lru_node(>node[i], old_size, new_size)) goto fail; } return 0; fail: - for (i = i - 1; i >= 0; i--) + for (i = i - 1; i >= 0; i--) { + if (!lru->node[i].memcg_lrus) + continue; + memcg_cancel_update_list_lru_node(>node[i], old_size, new_size); + } return -ENOMEM; } @@ -430,7 +437,7 @@ static void memcg_cancel_update_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_cancel_update_list_lru_node(>node[i], old_size, new_size); } @@ -485,7 +492,7 @@ static void memcg_drain_list_lru(struct list_lru *lru, if (!list_lru_memcg_aware(lru)) return; - for (i = 0; i < nr_node_ids; i++) + for_each_node(i) memcg_drain_list_lru_node(>node[i], src_idx, dst_idx); } @@ -522,7 +529,7 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware, if (!lru->node) goto out; - for (i = 0; i < nr_node_ids; i++) { + for_each_node(i) { spin_lock_init(>node[i].lock); if (key) lockdep_set_class(>node[i].lock, key); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V4 1/2] net: Introduce helper functions to get the per cpu data
Signed-off-by: Raghavendra K T --- include/net/ip.h | 10 ++ net/ipv4/af_inet.c | 41 +++-- 2 files changed, 37 insertions(+), 14 deletions(-) diff --git a/include/net/ip.h b/include/net/ip.h index d5fe9f2..93bf12e 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -202,10 +202,20 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, #define NET_ADD_STATS_BH(net, field, adnd) SNMP_ADD_STATS_BH((net)->mib.net_statistics, field, adnd) #define NET_ADD_STATS_USER(net, field, adnd) SNMP_ADD_STATS_USER((net)->mib.net_statistics, field, adnd) +u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offct); unsigned long snmp_fold_field(void __percpu *mib, int offt); #if BITS_PER_LONG==32 +u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, +size_t syncp_offset); u64 snmp_fold_field64(void __percpu *mib, int offt, size_t sync_off); #else +static inline u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, + size_t syncp_offset) +{ + return snmp_get_cpu_field(mib, cpu, offct); + +} + static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_off) { return snmp_fold_field(mib, offt); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 9532ee8..302e36b 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1448,38 +1448,51 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family, } EXPORT_SYMBOL_GPL(inet_ctl_sock_create); +u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt) +{ + return *(((unsigned long *)per_cpu_ptr(mib, cpu)) + offt); +} +EXPORT_SYMBOL_GPL(snmp_get_cpu_field); + unsigned long snmp_fold_field(void __percpu *mib, int offt) { unsigned long res = 0; int i; for_each_possible_cpu(i) - res += *(((unsigned long *) per_cpu_ptr(mib, i)) + offt); + res += snmp_get_cpu_field(mib, i, offt); return res; } EXPORT_SYMBOL_GPL(snmp_fold_field); #if BITS_PER_LONG==32 +u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, +size_t syncp_offset) +{ + void *bhptr; + struct u64_stats_sync *syncp; + u64 v; + unsigned int start; + + bhptr = per_cpu_ptr(mib, cpu); + syncp = (struct u64_stats_sync *)(bhptr + syncp_offset); + do { + start = u64_stats_fetch_begin_irq(syncp); + v = *(((u64 *)bhptr) + offt); + } while (u64_stats_fetch_retry_irq(syncp, start)); + + return v; +} +EXPORT_SYMBOL_GPL(snmp_get_cpu_field64); + u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset) { u64 res = 0; int cpu; for_each_possible_cpu(cpu) { - void *bhptr; - struct u64_stats_sync *syncp; - u64 v; - unsigned int start; - - bhptr = per_cpu_ptr(mib, cpu); - syncp = (struct u64_stats_sync *)(bhptr + syncp_offset); - do { - start = u64_stats_fetch_begin_irq(syncp); - v = *(((u64 *) bhptr) + offt); - } while (u64_stats_fetch_retry_irq(syncp, start)); - - res += v; + res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset); } return res; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V4 1/2] net: Introduce helper functions to get the per cpu data
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- include/net/ip.h | 10 ++ net/ipv4/af_inet.c | 41 +++-- 2 files changed, 37 insertions(+), 14 deletions(-) diff --git a/include/net/ip.h b/include/net/ip.h index d5fe9f2..93bf12e 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -202,10 +202,20 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, #define NET_ADD_STATS_BH(net, field, adnd) SNMP_ADD_STATS_BH((net)-mib.net_statistics, field, adnd) #define NET_ADD_STATS_USER(net, field, adnd) SNMP_ADD_STATS_USER((net)-mib.net_statistics, field, adnd) +u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offct); unsigned long snmp_fold_field(void __percpu *mib, int offt); #if BITS_PER_LONG==32 +u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, +size_t syncp_offset); u64 snmp_fold_field64(void __percpu *mib, int offt, size_t sync_off); #else +static inline u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, + size_t syncp_offset) +{ + return snmp_get_cpu_field(mib, cpu, offct); + +} + static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_off) { return snmp_fold_field(mib, offt); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 9532ee8..302e36b 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1448,38 +1448,51 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family, } EXPORT_SYMBOL_GPL(inet_ctl_sock_create); +u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt) +{ + return *(((unsigned long *)per_cpu_ptr(mib, cpu)) + offt); +} +EXPORT_SYMBOL_GPL(snmp_get_cpu_field); + unsigned long snmp_fold_field(void __percpu *mib, int offt) { unsigned long res = 0; int i; for_each_possible_cpu(i) - res += *(((unsigned long *) per_cpu_ptr(mib, i)) + offt); + res += snmp_get_cpu_field(mib, i, offt); return res; } EXPORT_SYMBOL_GPL(snmp_fold_field); #if BITS_PER_LONG==32 +u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, +size_t syncp_offset) +{ + void *bhptr; + struct u64_stats_sync *syncp; + u64 v; + unsigned int start; + + bhptr = per_cpu_ptr(mib, cpu); + syncp = (struct u64_stats_sync *)(bhptr + syncp_offset); + do { + start = u64_stats_fetch_begin_irq(syncp); + v = *(((u64 *)bhptr) + offt); + } while (u64_stats_fetch_retry_irq(syncp, start)); + + return v; +} +EXPORT_SYMBOL_GPL(snmp_get_cpu_field64); + u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset) { u64 res = 0; int cpu; for_each_possible_cpu(cpu) { - void *bhptr; - struct u64_stats_sync *syncp; - u64 v; - unsigned int start; - - bhptr = per_cpu_ptr(mib, cpu); - syncp = (struct u64_stats_sync *)(bhptr + syncp_offset); - do { - start = u64_stats_fetch_begin_irq(syncp); - v = *(((u64 *) bhptr) + offt); - } while (u64_stats_fetch_retry_irq(syncp, start)); - - res += v; + res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset); } return res; } -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V4 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
Docker container creation linearly increased from around 1.6 sec to 7.5 sec (at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field. reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks through per cpu data of an item (iteratively for around 36 items). idea: This patch tries to aggregate the statistics by going through all the items of each cpu sequentially which is reducing cache misses. Docker creation got faster by more than 2x after the patch. Result: Before After Docker creation time 6.836s 3.25s cache miss 2.7% 1.41% perf before: 50.73% docker [kernel.kallsyms] [k] snmp_fold_field 9.07% swapper [kernel.kallsyms] [k] snooze_loop 3.49% docker [kernel.kallsyms] [k] veth_stats_one 2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock perf after: 10.57% docker docker[.] scanblock 8.37% swapper [kernel.kallsyms] [k] snooze_loop 6.91% docker [kernel.kallsyms] [k] snmp_get_cpu_field 6.67% docker [kernel.kallsyms] [k] veth_stats_one changes/ideas suggested: Using buffer in stack (Eric), Usage of memset (David), Using memcpy in place of unaligned_put (Joe). Signed-off-by: Raghavendra K T --- net/ipv6/addrconf.c | 26 -- 1 file changed, 16 insertions(+), 10 deletions(-) Changes in V4: - remove 'item' variable and use IPSTATS_MIB_MAX to avoid sparse warning (Eric) also remove 'item' parameter (Joe) - add missing memset of padding. Changes in V3: - use memset to initialize temp buffer in leaf function. (David) - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe) - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric) - Changes in V2: - Allocate the stat calculation buffer in stack. (Eric) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 21c2c81..4abc10c 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4624,18 +4624,24 @@ static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib, } static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, - int items, int bytes, size_t syncpoff) + int bytes, size_t syncpoff) { - int i; - int pad = bytes - sizeof(u64) * items; + int i, c; + u64 buff[IPSTATS_MIB_MAX]; + int pad = bytes - sizeof(u64) * IPSTATS_MIB_MAX; + BUG_ON(pad < 0); - /* Use put_unaligned() because stats may not be aligned for u64. */ - put_unaligned(items, [0]); - for (i = 1; i < items; i++) - put_unaligned(snmp_fold_field64(mib, i, syncpoff), [i]); + memset(buff, 0, sizeof(buff)); + buff[0] = IPSTATS_MIB_MAX; - memset([items], 0, pad); + for_each_possible_cpu(c) { + for (i = 1; i < IPSTATS_MIB_MAX; i++) + buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff); + } + + memcpy(stats, buff, IPSTATS_MIB_MAX * sizeof(u64)); + memset([IPSTATS_MIB_MAX], 0, pad); } static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype, @@ -4643,8 +4649,8 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype, { switch (attrtype) { case IFLA_INET6_STATS: - __snmp6_fill_stats64(stats, idev->stats.ipv6, -IPSTATS_MIB_MAX, bytes, offsetof(struct ipstats_mib, syncp)); + __snmp6_fill_stats64(stats, idev->stats.ipv6, bytes, +offsetof(struct ipstats_mib, syncp)); break; case IFLA_INET6_ICMP6STATS: __snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, ICMP6_MIB_MAX, bytes); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V4 0/2] Optimize the snmp stat aggregation for large cpus
While creating 1000 containers, perf is showing lot of time spent in snmp_fold_field on a large cpu system. The current patch tries to improve by reordering the statistics gathering. Please note that similar overhead was also reported while creating veth pairs https://lkml.org/lkml/2013/3/19/556 Changes in V4: - remove 'item' variable and use IPSTATS_MIB_MAX to avoid sparse warning (Eric) also remove 'item' parameter (Joe) - add missing memset of padding. Changes in V3: - use memset to initialize temp buffer in leaf function. (David) - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe) - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric) - Changes in V2: - Allocate the stat calculation buffer in stack. (Eric) Setup: 160 cpu (20 core) baremetal powerpc system with 1TB memory 1000 docker containers was created with command docker run -itd ubuntu:15.04 /bin/bash in loop observation: Docker container creation linearly increased from around 1.6 sec to 7.5 sec (at 1000 containers) perf data showed, creating veth interfaces resulting in the below code path was taking more time. rtnl_fill_ifinfo -> inet6_fill_link_af -> inet6_fill_ifla6_attrs -> snmp_fold_field proposed idea: currently __snmp6_fill_stats64 calls snmp_fold_field that walks through per cpu data to of an item (iteratively for around 36 items). The patch tries to aggregate the statistics by going through all the items of each cpu sequentially which is reducing cache misses. Performance of docker creation improved by around more than 2x after the patch. before the patch: #time docker run -itd ubuntu:15.04 /bin/bash 3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617 real0m6.836s user0m0.095s sys 0m0.011s perf record -a docker run -itd ubuntu:15.04 /bin/bash === # Samples: 32K of event 'cycles' # Event count (approx.): 24688700190 # Overhead Command Shared Object Symbol # ... .. 50.73% docker [kernel.kallsyms] [k] snmp_fold_field 9.07% swapper [kernel.kallsyms] [k] snooze_loop 3.49% docker [kernel.kallsyms] [k] veth_stats_one 2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock 1.37% docker docker [.] backtrace_qsort 1.31% docker docker [.] strings.FieldsFunc cache-misses: 2.7% after the patch: = #time docker run -itd ubuntu:15.04 /bin/bash 9178273e9df399c8290b6c196e4aef9273be2876225f63b14a60cf97eacfafb5 real0m3.249s user0m0.088s sys 0m0.020s perf record -a docker run -itd ubuntu:15.04 /bin/bash === # Samples: 18K of event 'cycles' # Event count (approx.): 13958958797 # Overhead Command Shared Object Symbol # ... .. 10.57% docker docker[.] scanblock 8.37% swapper [kernel.kallsyms] [k] snooze_loop 6.91% docker [kernel.kallsyms] [k] snmp_get_cpu_field 6.67% docker [kernel.kallsyms] [k] veth_stats_one 3.96% docker docker[.] runtime_MSpan_Sweep 2.47% docker docker[.] strings.FieldsFunc cache-misses: 1.41 % Please let me know if you have suggestions/comments. Thanks Eric, Joe and David for the comments. Raghavendra K T (2): net: Introduce helper functions to get the per cpu data net: Optimize
Re: [PATCH RFC V3 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
On 08/29/2015 08:51 PM, Joe Perches wrote: On Sat, 2015-08-29 at 07:32 -0700, Eric Dumazet wrote: On Sat, 2015-08-29 at 14:37 +0530, Raghavendra K T wrote: static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, - int items, int bytes, size_t syncpoff) + int items, int bytes, size_t syncpoff) { - int i; + int i, c; int pad = bytes - sizeof(u64) * items; + u64 buff[items]; + One last comment : using variable length arrays is confusing for the reader, and sparse as well. $ make C=2 net/ipv6/addrconf.o ... CHECK net/ipv6/addrconf.c net/ipv6/addrconf.c:4733:18: warning: Variable length array is used. net/ipv6/addrconf.c:4737:25: error: cannot size expression I suggest you remove 'items' parameter and replace it by IPSTATS_MIB_MAX, as __snmp6_fill_stats64() is called exactly once. If you respin, I suggest: o remove "items" from the __snmp6_fill_stats64 arguments and use IPSTATS_MIB_MAX in the function instead Yes, as also suggested by Eric. o add braces around the for_each_possible_cpu loop for_each_possible_cpu(c) { for (i = 1; i < items; i++) buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff); } Sure. It makes it more readable. will respin V4 with these changes (+ memset 0 for pad which I realized). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V3 1/2] net: Introduce helper functions to get the per cpu data
Signed-off-by: Raghavendra K T --- include/net/ip.h | 10 ++ net/ipv4/af_inet.c | 41 +++-- 2 files changed, 37 insertions(+), 14 deletions(-) diff --git a/include/net/ip.h b/include/net/ip.h index d5fe9f2..93bf12e 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -202,10 +202,20 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, #define NET_ADD_STATS_BH(net, field, adnd) SNMP_ADD_STATS_BH((net)->mib.net_statistics, field, adnd) #define NET_ADD_STATS_USER(net, field, adnd) SNMP_ADD_STATS_USER((net)->mib.net_statistics, field, adnd) +u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offct); unsigned long snmp_fold_field(void __percpu *mib, int offt); #if BITS_PER_LONG==32 +u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, +size_t syncp_offset); u64 snmp_fold_field64(void __percpu *mib, int offt, size_t sync_off); #else +static inline u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, + size_t syncp_offset) +{ + return snmp_get_cpu_field(mib, cpu, offct); + +} + static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_off) { return snmp_fold_field(mib, offt); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 9532ee8..302e36b 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1448,38 +1448,51 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family, } EXPORT_SYMBOL_GPL(inet_ctl_sock_create); +u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt) +{ + return *(((unsigned long *)per_cpu_ptr(mib, cpu)) + offt); +} +EXPORT_SYMBOL_GPL(snmp_get_cpu_field); + unsigned long snmp_fold_field(void __percpu *mib, int offt) { unsigned long res = 0; int i; for_each_possible_cpu(i) - res += *(((unsigned long *) per_cpu_ptr(mib, i)) + offt); + res += snmp_get_cpu_field(mib, i, offt); return res; } EXPORT_SYMBOL_GPL(snmp_fold_field); #if BITS_PER_LONG==32 +u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, +size_t syncp_offset) +{ + void *bhptr; + struct u64_stats_sync *syncp; + u64 v; + unsigned int start; + + bhptr = per_cpu_ptr(mib, cpu); + syncp = (struct u64_stats_sync *)(bhptr + syncp_offset); + do { + start = u64_stats_fetch_begin_irq(syncp); + v = *(((u64 *)bhptr) + offt); + } while (u64_stats_fetch_retry_irq(syncp, start)); + + return v; +} +EXPORT_SYMBOL_GPL(snmp_get_cpu_field64); + u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset) { u64 res = 0; int cpu; for_each_possible_cpu(cpu) { - void *bhptr; - struct u64_stats_sync *syncp; - u64 v; - unsigned int start; - - bhptr = per_cpu_ptr(mib, cpu); - syncp = (struct u64_stats_sync *)(bhptr + syncp_offset); - do { - start = u64_stats_fetch_begin_irq(syncp); - v = *(((u64 *) bhptr) + offt); - } while (u64_stats_fetch_retry_irq(syncp, start)); - - res += v; + res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset); } return res; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V3 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
Docker container creation linearly increased from around 1.6 sec to 7.5 sec (at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field. reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks through per cpu data of an item (iteratively for around 36 items). idea: This patch tries to aggregate the statistics by going through all the items of each cpu sequentially which is reducing cache misses. Docker creation got faster by more than 2x after the patch. Result: Before After Docker creation time 6.836s 3.25s cache miss 2.7% 1.41% perf before: 50.73% docker [kernel.kallsyms] [k] snmp_fold_field 9.07% swapper [kernel.kallsyms] [k] snooze_loop 3.49% docker [kernel.kallsyms] [k] veth_stats_one 2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock perf after: 10.57% docker docker[.] scanblock 8.37% swapper [kernel.kallsyms] [k] snooze_loop 6.91% docker [kernel.kallsyms] [k] snmp_get_cpu_field 6.67% docker [kernel.kallsyms] [k] veth_stats_one changes/ideas suggested: Using buffer in stack (Eric), Usage of memset (David), Using memcpy in place of unaligned_put (Joe). Signed-off-by: Raghavendra K T --- net/ipv6/addrconf.c | 22 +- 1 file changed, 13 insertions(+), 9 deletions(-) Changes in V3: - use memset to initialize temp buffer in leaf function. (David) - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe) - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric) - Changes in V2: - Allocate the stat calculation buffer in stack. (Eric) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 21c2c81..379619a 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4624,18 +4624,22 @@ static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib, } static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, - int items, int bytes, size_t syncpoff) + int items, int bytes, size_t syncpoff) { - int i; + int i, c; int pad = bytes - sizeof(u64) * items; + u64 buff[items]; + BUG_ON(pad < 0); - /* Use put_unaligned() because stats may not be aligned for u64. */ - put_unaligned(items, [0]); - for (i = 1; i < items; i++) - put_unaligned(snmp_fold_field64(mib, i, syncpoff), [i]); + memset(buff, 0, sizeof(buff)); + buff[0] = items; - memset([items], 0, pad); + for_each_possible_cpu(c) + for (i = 1; i < items; i++) + buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff); + + memcpy(stats, buff, items * sizeof(u64)); } static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype, @@ -4643,8 +4647,8 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype, { switch (attrtype) { case IFLA_INET6_STATS: - __snmp6_fill_stats64(stats, idev->stats.ipv6, -IPSTATS_MIB_MAX, bytes, offsetof(struct ipstats_mib, syncp)); + __snmp6_fill_stats64(stats, idev->stats.ipv6, IPSTATS_MIB_MAX, +bytes, offsetof(struct ipstats_mib, syncp)); break; case IFLA_INET6_ICMP6STATS: __snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, ICMP6_MIB_MAX, bytes); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V3 0/2] Optimize the snmp stat aggregation for large cpus
While creating 1000 containers, perf is showing lot of time spent in snmp_fold_field on a large cpu system. The current patch tries to improve by reordering the statistics gathering. Please note that similar overhead was also reported while creating veth pairs https://lkml.org/lkml/2013/3/19/556 Changes in V3: - use memset to initialize temp buffer in leaf function. (David) - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe) - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric) - Changes in V2: - Allocate the stat calculation buffer in stack. (Eric) Setup: 160 cpu (20 core) baremetal powerpc system with 1TB memory 1000 docker containers was created with command docker run -itd ubuntu:15.04 /bin/bash in loop observation: Docker container creation linearly increased from around 1.6 sec to 7.5 sec (at 1000 containers) perf data showed, creating veth interfaces resulting in the below code path was taking more time. rtnl_fill_ifinfo -> inet6_fill_link_af -> inet6_fill_ifla6_attrs -> snmp_fold_field proposed idea: currently __snmp6_fill_stats64 calls snmp_fold_field that walks through per cpu data to of an item (iteratively for around 36 items). The patch tries to aggregate the statistics by going through all the items of each cpu sequentially which is reducing cache misses. Performance of docker creation improved by around more than 2x after the patch. before the patch: #time docker run -itd ubuntu:15.04 /bin/bash 3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617 real0m6.836s user0m0.095s sys 0m0.011s perf record -a docker run -itd ubuntu:15.04 /bin/bash === # Samples: 32K of event 'cycles' # Event count (approx.): 24688700190 # Overhead Command Shared Object Symbol # ... .. 50.73% docker [kernel.kallsyms] [k] snmp_fold_field 9.07% swapper [kernel.kallsyms] [k] snooze_loop 3.49% docker [kernel.kallsyms] [k] veth_stats_one 2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock 1.37% docker docker [.] backtrace_qsort 1.31% docker docker [.] strings.FieldsFunc cache-misses: 2.7% after the patch: = #time docker run -itd ubuntu:15.04 /bin/bash 9178273e9df399c8290b6c196e4aef9273be2876225f63b14a60cf97eacfafb5 real0m3.249s user0m0.088s sys 0m0.020s perf record -a docker run -itd ubuntu:15.04 /bin/bash === # Samples: 18K of event 'cycles' # Event count (approx.): 13958958797 # Overhead Command Shared Object Symbol # ... .. 10.57% docker docker[.] scanblock 8.37% swapper [kernel.kallsyms] [k] snooze_loop 6.91% docker [kernel.kallsyms] [k] snmp_get_cpu_field 6.67% docker [kernel.kallsyms] [k] veth_stats_one 3.96% docker docker[.] runtime_MSpan_Sweep 2.47% docker docker[.] strings.FieldsFunc cache-misses: 1.41 % Please let me know if you have suggestions/comments. Thanks Eric, Joe and David for comments on V1 and V2. Raghavendra K T (2): net: Introduce helper functions to get the per cpu data net: Optimize snmp stat aggregation by walking all the percpu data at once include/net/ip.h| 10 ++ net/ipv4/af_ine
Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
On 08/29/2015 10:41 AM, David Miller wrote: From: Raghavendra K T Date: Sat, 29 Aug 2015 08:27:15 +0530 resending the patch with memset. Please let me know if you want to resend all the patches. Do not post patches as replies to existing discussion threads. Instead, make a new, fresh, patch posting, updating the Subject line as needed. Sure. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
On 08/29/2015 08:56 AM, Eric Dumazet wrote: On Sat, 2015-08-29 at 08:27 +0530, Raghavendra K T wrote: /* Use put_unaligned() because stats may not be aligned for u64. */ put_unaligned(items, [0]); for (i = 1; i < items; i++) - put_unaligned(snmp_fold_field64(mib, i, syncpoff), [i]); + put_unaligned(buff[i], [i]); I believe Joe suggested following code instead : buff[0] = items; memcpy(stats, buff, items * sizeof(u64)); Thanks. Sure, will use this. (I missed that. I thought that it was applicable only when we have aligned data,and for power, put_aunaligned was not a nop unlike intel). Also please move buff[] array into __snmp6_fill_stats64() to make it clear it is used in a 'leaf' function. Correct. (even if calling memcpy()/memset() makes it not a leaf function) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V3 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
Docker container creation linearly increased from around 1.6 sec to 7.5 sec (at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field. reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks through per cpu data of an item (iteratively for around 36 items). idea: This patch tries to aggregate the statistics by going through all the items of each cpu sequentially which is reducing cache misses. Docker creation got faster by more than 2x after the patch. Result: Before After Docker creation time 6.836s 3.25s cache miss 2.7% 1.41% perf before: 50.73% docker [kernel.kallsyms] [k] snmp_fold_field 9.07% swapper [kernel.kallsyms] [k] snooze_loop 3.49% docker [kernel.kallsyms] [k] veth_stats_one 2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock perf after: 10.57% docker docker[.] scanblock 8.37% swapper [kernel.kallsyms] [k] snooze_loop 6.91% docker [kernel.kallsyms] [k] snmp_get_cpu_field 6.67% docker [kernel.kallsyms] [k] veth_stats_one changes/ideas suggested: Using buffer in stack (Eric), Usage of memset (David), Using memcpy in place of unaligned_put (Joe). Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- net/ipv6/addrconf.c | 22 +- 1 file changed, 13 insertions(+), 9 deletions(-) Changes in V3: - use memset to initialize temp buffer in leaf function. (David) - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe) - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric) - Changes in V2: - Allocate the stat calculation buffer in stack. (Eric) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 21c2c81..379619a 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4624,18 +4624,22 @@ static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib, } static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, - int items, int bytes, size_t syncpoff) + int items, int bytes, size_t syncpoff) { - int i; + int i, c; int pad = bytes - sizeof(u64) * items; + u64 buff[items]; + BUG_ON(pad 0); - /* Use put_unaligned() because stats may not be aligned for u64. */ - put_unaligned(items, stats[0]); - for (i = 1; i items; i++) - put_unaligned(snmp_fold_field64(mib, i, syncpoff), stats[i]); + memset(buff, 0, sizeof(buff)); + buff[0] = items; - memset(stats[items], 0, pad); + for_each_possible_cpu(c) + for (i = 1; i items; i++) + buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff); + + memcpy(stats, buff, items * sizeof(u64)); } static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype, @@ -4643,8 +4647,8 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype, { switch (attrtype) { case IFLA_INET6_STATS: - __snmp6_fill_stats64(stats, idev-stats.ipv6, -IPSTATS_MIB_MAX, bytes, offsetof(struct ipstats_mib, syncp)); + __snmp6_fill_stats64(stats, idev-stats.ipv6, IPSTATS_MIB_MAX, +bytes, offsetof(struct ipstats_mib, syncp)); break; case IFLA_INET6_ICMP6STATS: __snmp6_fill_statsdev(stats, idev-stats.icmpv6dev-mibs, ICMP6_MIB_MAX, bytes); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V3 0/2] Optimize the snmp stat aggregation for large cpus
While creating 1000 containers, perf is showing lot of time spent in snmp_fold_field on a large cpu system. The current patch tries to improve by reordering the statistics gathering. Please note that similar overhead was also reported while creating veth pairs https://lkml.org/lkml/2013/3/19/556 Changes in V3: - use memset to initialize temp buffer in leaf function. (David) - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe) - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric) - Changes in V2: - Allocate the stat calculation buffer in stack. (Eric) Setup: 160 cpu (20 core) baremetal powerpc system with 1TB memory 1000 docker containers was created with command docker run -itd ubuntu:15.04 /bin/bash in loop observation: Docker container creation linearly increased from around 1.6 sec to 7.5 sec (at 1000 containers) perf data showed, creating veth interfaces resulting in the below code path was taking more time. rtnl_fill_ifinfo - inet6_fill_link_af - inet6_fill_ifla6_attrs - snmp_fold_field proposed idea: currently __snmp6_fill_stats64 calls snmp_fold_field that walks through per cpu data to of an item (iteratively for around 36 items). The patch tries to aggregate the statistics by going through all the items of each cpu sequentially which is reducing cache misses. Performance of docker creation improved by around more than 2x after the patch. before the patch: #time docker run -itd ubuntu:15.04 /bin/bash 3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617 real0m6.836s user0m0.095s sys 0m0.011s perf record -a docker run -itd ubuntu:15.04 /bin/bash === # Samples: 32K of event 'cycles' # Event count (approx.): 24688700190 # Overhead Command Shared Object Symbol # ... .. 50.73% docker [kernel.kallsyms] [k] snmp_fold_field 9.07% swapper [kernel.kallsyms] [k] snooze_loop 3.49% docker [kernel.kallsyms] [k] veth_stats_one 2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock 1.37% docker docker [.] backtrace_qsort 1.31% docker docker [.] strings.FieldsFunc cache-misses: 2.7% after the patch: = #time docker run -itd ubuntu:15.04 /bin/bash 9178273e9df399c8290b6c196e4aef9273be2876225f63b14a60cf97eacfafb5 real0m3.249s user0m0.088s sys 0m0.020s perf record -a docker run -itd ubuntu:15.04 /bin/bash === # Samples: 18K of event 'cycles' # Event count (approx.): 13958958797 # Overhead Command Shared Object Symbol # ... .. 10.57% docker docker[.] scanblock 8.37% swapper [kernel.kallsyms] [k] snooze_loop 6.91% docker [kernel.kallsyms] [k] snmp_get_cpu_field 6.67% docker [kernel.kallsyms] [k] veth_stats_one 3.96% docker docker[.] runtime_MSpan_Sweep 2.47% docker docker[.] strings.FieldsFunc cache-misses: 1.41 % Please let me know if you have suggestions/comments. Thanks Eric, Joe and David for comments on V1 and V2. Raghavendra K T (2): net: Introduce helper functions to get the per cpu data net: Optimize snmp stat aggregation by walking all the percpu data at once include/net/ip.h| 10 ++ net/ipv4/af_inet.c | 41
[PATCH RFC V3 1/2] net: Introduce helper functions to get the per cpu data
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- include/net/ip.h | 10 ++ net/ipv4/af_inet.c | 41 +++-- 2 files changed, 37 insertions(+), 14 deletions(-) diff --git a/include/net/ip.h b/include/net/ip.h index d5fe9f2..93bf12e 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -202,10 +202,20 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, #define NET_ADD_STATS_BH(net, field, adnd) SNMP_ADD_STATS_BH((net)-mib.net_statistics, field, adnd) #define NET_ADD_STATS_USER(net, field, adnd) SNMP_ADD_STATS_USER((net)-mib.net_statistics, field, adnd) +u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offct); unsigned long snmp_fold_field(void __percpu *mib, int offt); #if BITS_PER_LONG==32 +u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, +size_t syncp_offset); u64 snmp_fold_field64(void __percpu *mib, int offt, size_t sync_off); #else +static inline u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, + size_t syncp_offset) +{ + return snmp_get_cpu_field(mib, cpu, offct); + +} + static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_off) { return snmp_fold_field(mib, offt); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 9532ee8..302e36b 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1448,38 +1448,51 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family, } EXPORT_SYMBOL_GPL(inet_ctl_sock_create); +u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt) +{ + return *(((unsigned long *)per_cpu_ptr(mib, cpu)) + offt); +} +EXPORT_SYMBOL_GPL(snmp_get_cpu_field); + unsigned long snmp_fold_field(void __percpu *mib, int offt) { unsigned long res = 0; int i; for_each_possible_cpu(i) - res += *(((unsigned long *) per_cpu_ptr(mib, i)) + offt); + res += snmp_get_cpu_field(mib, i, offt); return res; } EXPORT_SYMBOL_GPL(snmp_fold_field); #if BITS_PER_LONG==32 +u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, +size_t syncp_offset) +{ + void *bhptr; + struct u64_stats_sync *syncp; + u64 v; + unsigned int start; + + bhptr = per_cpu_ptr(mib, cpu); + syncp = (struct u64_stats_sync *)(bhptr + syncp_offset); + do { + start = u64_stats_fetch_begin_irq(syncp); + v = *(((u64 *)bhptr) + offt); + } while (u64_stats_fetch_retry_irq(syncp, start)); + + return v; +} +EXPORT_SYMBOL_GPL(snmp_get_cpu_field64); + u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset) { u64 res = 0; int cpu; for_each_possible_cpu(cpu) { - void *bhptr; - struct u64_stats_sync *syncp; - u64 v; - unsigned int start; - - bhptr = per_cpu_ptr(mib, cpu); - syncp = (struct u64_stats_sync *)(bhptr + syncp_offset); - do { - start = u64_stats_fetch_begin_irq(syncp); - v = *(((u64 *) bhptr) + offt); - } while (u64_stats_fetch_retry_irq(syncp, start)); - - res += v; + res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset); } return res; } -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
On 08/29/2015 10:41 AM, David Miller wrote: From: Raghavendra K T raghavendra...@linux.vnet.ibm.com Date: Sat, 29 Aug 2015 08:27:15 +0530 resending the patch with memset. Please let me know if you want to resend all the patches. Do not post patches as replies to existing discussion threads. Instead, make a new, fresh, patch posting, updating the Subject line as needed. Sure. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
On 08/29/2015 08:56 AM, Eric Dumazet wrote: On Sat, 2015-08-29 at 08:27 +0530, Raghavendra K T wrote: /* Use put_unaligned() because stats may not be aligned for u64. */ put_unaligned(items, stats[0]); for (i = 1; i items; i++) - put_unaligned(snmp_fold_field64(mib, i, syncpoff), stats[i]); + put_unaligned(buff[i], stats[i]); I believe Joe suggested following code instead : buff[0] = items; memcpy(stats, buff, items * sizeof(u64)); Thanks. Sure, will use this. (I missed that. I thought that it was applicable only when we have aligned data,and for power, put_aunaligned was not a nop unlike intel). Also please move buff[] array into __snmp6_fill_stats64() to make it clear it is used in a 'leaf' function. Correct. (even if calling memcpy()/memset() makes it not a leaf function) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC V3 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
On 08/29/2015 08:51 PM, Joe Perches wrote: On Sat, 2015-08-29 at 07:32 -0700, Eric Dumazet wrote: On Sat, 2015-08-29 at 14:37 +0530, Raghavendra K T wrote: static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, - int items, int bytes, size_t syncpoff) + int items, int bytes, size_t syncpoff) { - int i; + int i, c; int pad = bytes - sizeof(u64) * items; + u64 buff[items]; + One last comment : using variable length arrays is confusing for the reader, and sparse as well. $ make C=2 net/ipv6/addrconf.o ... CHECK net/ipv6/addrconf.c net/ipv6/addrconf.c:4733:18: warning: Variable length array is used. net/ipv6/addrconf.c:4737:25: error: cannot size expression I suggest you remove 'items' parameter and replace it by IPSTATS_MIB_MAX, as __snmp6_fill_stats64() is called exactly once. If you respin, I suggest: o remove items from the __snmp6_fill_stats64 arguments and use IPSTATS_MIB_MAX in the function instead Yes, as also suggested by Eric. o add braces around the for_each_possible_cpu loop for_each_possible_cpu(c) { for (i = 1; i items; i++) buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff); } Sure. It makes it more readable. will respin V4 with these changes (+ memset 0 for pad which I realized). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V4 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
Docker container creation linearly increased from around 1.6 sec to 7.5 sec (at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field. reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks through per cpu data of an item (iteratively for around 36 items). idea: This patch tries to aggregate the statistics by going through all the items of each cpu sequentially which is reducing cache misses. Docker creation got faster by more than 2x after the patch. Result: Before After Docker creation time 6.836s 3.25s cache miss 2.7% 1.41% perf before: 50.73% docker [kernel.kallsyms] [k] snmp_fold_field 9.07% swapper [kernel.kallsyms] [k] snooze_loop 3.49% docker [kernel.kallsyms] [k] veth_stats_one 2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock perf after: 10.57% docker docker[.] scanblock 8.37% swapper [kernel.kallsyms] [k] snooze_loop 6.91% docker [kernel.kallsyms] [k] snmp_get_cpu_field 6.67% docker [kernel.kallsyms] [k] veth_stats_one changes/ideas suggested: Using buffer in stack (Eric), Usage of memset (David), Using memcpy in place of unaligned_put (Joe). Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- net/ipv6/addrconf.c | 26 -- 1 file changed, 16 insertions(+), 10 deletions(-) Changes in V4: - remove 'item' variable and use IPSTATS_MIB_MAX to avoid sparse warning (Eric) also remove 'item' parameter (Joe) - add missing memset of padding. Changes in V3: - use memset to initialize temp buffer in leaf function. (David) - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe) - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric) - Changes in V2: - Allocate the stat calculation buffer in stack. (Eric) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 21c2c81..4abc10c 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4624,18 +4624,24 @@ static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib, } static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, - int items, int bytes, size_t syncpoff) + int bytes, size_t syncpoff) { - int i; - int pad = bytes - sizeof(u64) * items; + int i, c; + u64 buff[IPSTATS_MIB_MAX]; + int pad = bytes - sizeof(u64) * IPSTATS_MIB_MAX; + BUG_ON(pad 0); - /* Use put_unaligned() because stats may not be aligned for u64. */ - put_unaligned(items, stats[0]); - for (i = 1; i items; i++) - put_unaligned(snmp_fold_field64(mib, i, syncpoff), stats[i]); + memset(buff, 0, sizeof(buff)); + buff[0] = IPSTATS_MIB_MAX; - memset(stats[items], 0, pad); + for_each_possible_cpu(c) { + for (i = 1; i IPSTATS_MIB_MAX; i++) + buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff); + } + + memcpy(stats, buff, IPSTATS_MIB_MAX * sizeof(u64)); + memset(stats[IPSTATS_MIB_MAX], 0, pad); } static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype, @@ -4643,8 +4649,8 @@ static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype, { switch (attrtype) { case IFLA_INET6_STATS: - __snmp6_fill_stats64(stats, idev-stats.ipv6, -IPSTATS_MIB_MAX, bytes, offsetof(struct ipstats_mib, syncp)); + __snmp6_fill_stats64(stats, idev-stats.ipv6, bytes, +offsetof(struct ipstats_mib, syncp)); break; case IFLA_INET6_ICMP6STATS: __snmp6_fill_statsdev(stats, idev-stats.icmpv6dev-mibs, ICMP6_MIB_MAX, bytes); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V4 0/2] Optimize the snmp stat aggregation for large cpus
While creating 1000 containers, perf is showing lot of time spent in snmp_fold_field on a large cpu system. The current patch tries to improve by reordering the statistics gathering. Please note that similar overhead was also reported while creating veth pairs https://lkml.org/lkml/2013/3/19/556 Changes in V4: - remove 'item' variable and use IPSTATS_MIB_MAX to avoid sparse warning (Eric) also remove 'item' parameter (Joe) - add missing memset of padding. Changes in V3: - use memset to initialize temp buffer in leaf function. (David) - use memcpy to copy the buffer data to stat instead of unalign_pu (Joe) - Move buffer definition to leaf function __snmp6_fill_stats64() (Eric) - Changes in V2: - Allocate the stat calculation buffer in stack. (Eric) Setup: 160 cpu (20 core) baremetal powerpc system with 1TB memory 1000 docker containers was created with command docker run -itd ubuntu:15.04 /bin/bash in loop observation: Docker container creation linearly increased from around 1.6 sec to 7.5 sec (at 1000 containers) perf data showed, creating veth interfaces resulting in the below code path was taking more time. rtnl_fill_ifinfo - inet6_fill_link_af - inet6_fill_ifla6_attrs - snmp_fold_field proposed idea: currently __snmp6_fill_stats64 calls snmp_fold_field that walks through per cpu data to of an item (iteratively for around 36 items). The patch tries to aggregate the statistics by going through all the items of each cpu sequentially which is reducing cache misses. Performance of docker creation improved by around more than 2x after the patch. before the patch: #time docker run -itd ubuntu:15.04 /bin/bash 3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617 real0m6.836s user0m0.095s sys 0m0.011s perf record -a docker run -itd ubuntu:15.04 /bin/bash === # Samples: 32K of event 'cycles' # Event count (approx.): 24688700190 # Overhead Command Shared Object Symbol # ... .. 50.73% docker [kernel.kallsyms] [k] snmp_fold_field 9.07% swapper [kernel.kallsyms] [k] snooze_loop 3.49% docker [kernel.kallsyms] [k] veth_stats_one 2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock 1.37% docker docker [.] backtrace_qsort 1.31% docker docker [.] strings.FieldsFunc cache-misses: 2.7% after the patch: = #time docker run -itd ubuntu:15.04 /bin/bash 9178273e9df399c8290b6c196e4aef9273be2876225f63b14a60cf97eacfafb5 real0m3.249s user0m0.088s sys 0m0.020s perf record -a docker run -itd ubuntu:15.04 /bin/bash === # Samples: 18K of event 'cycles' # Event count (approx.): 13958958797 # Overhead Command Shared Object Symbol # ... .. 10.57% docker docker[.] scanblock 8.37% swapper [kernel.kallsyms] [k] snooze_loop 6.91% docker [kernel.kallsyms] [k] snmp_get_cpu_field 6.67% docker [kernel.kallsyms] [k] veth_stats_one 3.96% docker docker[.] runtime_MSpan_Sweep 2.47% docker docker[.] strings.FieldsFunc cache-misses: 1.41 % Please let me know if you have suggestions/comments. Thanks Eric, Joe and David for the comments. Raghavendra K T (2): net: Introduce helper functions to get the per cpu data net: Optimize snmp stat
Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
* David Miller [2015-08-28 11:24:13]: > From: Raghavendra K T > Date: Fri, 28 Aug 2015 12:09:52 +0530 > > > On 08/28/2015 12:08 AM, David Miller wrote: > >> From: Raghavendra K T > >> Date: Wed, 26 Aug 2015 23:07:33 +0530 > >> > >>> @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64 > >>> *stats, void __percpu *mib, > >>> static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int > >>> attrtype, > >>>int bytes) > >>> { > >>> + u64 buff[IPSTATS_MIB_MAX] = {0,}; > >>> + > ... > > hope you wanted to know the overhead than to change the current > > patch. please let me know.. > > I want you to change that variable initializer to an explicit memset(). > > The compiler is emitting a memset() or similar _anyways_. > > Not because it will have any impact at all upon performance, but because > of how it looks to people trying to read and understand the code. > > Hi David, resending the patch with memset. Please let me know if you want to resend all the patches. 8< From: Raghavendra K T Subject: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once Docker container creation linearly increased from around 1.6 sec to 7.5 sec (at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field. reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks through per cpu data of an item (iteratively for around 90 items). idea: This patch tries to aggregate the statistics by going through all the items of each cpu sequentially which is reducing cache misses. Docker creation got faster by more than 2x after the patch. Result: Before After Docker creation time 6.836s 3.357s cache miss 2.7% 1.38% perf before: 50.73% docker [kernel.kallsyms] [k] snmp_fold_field 9.07% swapper [kernel.kallsyms] [k] snooze_loop 3.49% docker [kernel.kallsyms] [k] veth_stats_one 2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock perf after: 10.56% swapper [kernel.kallsyms] [k] snooze_loop 8.72% docker [kernel.kallsyms] [k] snmp_get_cpu_field 7.59% docker [kernel.kallsyms] [k] veth_stats_one 3.65% swapper [kernel.kallsyms] [k] _raw_spin_lock Signed-off-by: Raghavendra K T --- net/ipv6/addrconf.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) Change in V2: - Allocate stat calculation buffer in stack (Eric) - Use memset to zero temp buffer (David) Thanks David and Eric for coments on V1 and as both of them pointed, unfortunately we cannot get rid of buffer for calculation without avoiding unaligned op. diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 21c2c81..9bdfba3 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4624,16 +4624,22 @@ static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib, } static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, - int items, int bytes, size_t syncpoff) + int items, int bytes, size_t syncpoff, + u64 *buff) { - int i; + int i, c; int pad = bytes - sizeof(u64) * items; BUG_ON(pad < 0); /* Use put_unaligned() because stats may not be aligned for u64. */ put_unaligned(items, [0]); + + for_each_possible_cpu(c) + for (i = 1; i < items; i++) + buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff); + for (i = 1; i < items; i++) - put_unaligned(snmp_fold_field64(mib, i, syncpoff), [i]); + put_unaligned(buff[i], [i]); memset([items], 0, pad); } @@ -4641,10 +4647,13 @@ static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype, int bytes) { + u64 buff[IPSTATS_MIB_MAX]; + switch (attrtype) { case IFLA_INET6_STATS: - __snmp6_fill_stats64(stats, idev->stats.ipv6, -IPSTATS_MIB_MAX, bytes, offsetof(struct ipstats_mib, syncp)); + memset(buff, 0, sizeof(buff)); + __snmp6_fill_stats64(stats, idev->stats.ipv6, IPSTATS_MIB_MAX, bytes, +offsetof(struct ipstats_mib, syncp), buff); break; case IFLA_INET6_ICMP6STATS: __snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, ICMP6_MIB_MAX, bytes); -- 1.7.11.7 -- To unsub
Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
On 08/28/2015 12:08 AM, David Miller wrote: From: Raghavendra K T Date: Wed, 26 Aug 2015 23:07:33 +0530 @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype, int bytes) { + u64 buff[IPSTATS_MIB_MAX] = {0,}; + switch (attrtype) { case IFLA_INET6_STATS: - __snmp6_fill_stats64(stats, idev->stats.ipv6, I would suggest using an explicit memset() here, it makes the overhead incurred by this scheme clearer. I changed the code to look like below to measure fill_stat overhead: container creation now took: 3.012s it was: without patch : 6.86sec with current patch: 3.34sec and perf did not show the snmp6_fill_stats() parent traces. changed code: snmp6_fill_stats(...) { switch (attrtype) { case IFLA_INET6_STATS: put_unaligned(IPSTATS_MIB_MAX, [0]); memset([1], 0, IPSTATS_MIB_MAX-1); //__snmp6_fill_stats64(stats, idev->stats.ipv6, IPSTATS_MIB_MAX, bytes, // offsetof(struct ipstats_mib, syncp), buff); . } So in summary: The current patch amounts to reduction in major overhead in fill_stat, though there is still percpu walk overhead (0.33sec difference). [ percpu walk overead grows when create for e.g. 3k containers]. cache miss: there was no major difference (around 1.4%) w.r.t patch Hi David, hope you wanted to know the overhead than to change the current patch. please let me know.. Eric, does V2 patch look good now.. please add your ack/review Details: time = time docker run -itd ubuntu:15.04 /bin/bash b6670c321b5957f004e281cbb14512deafd0c0be6a39707c2f3dc95649bbc394 real0m3.012s user0m0.093s sys 0m0.009s perf: == # Samples: 18K of event 'cycles' # Event count (approx.): 12838752009 # Overhead Command Shared Object Symbol # ... . # 15.29% swapper [kernel.kallsyms] [k] snooze_loop 9.37% docker docker [.] scanblock 6.47% docker [kernel.kallsyms] [k] veth_stats_one 3.87% swapper [kernel.kallsyms] [k] _raw_spin_lock 2.71% docker docker [.] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
On 08/28/2015 12:08 AM, David Miller wrote: From: Raghavendra K T raghavendra...@linux.vnet.ibm.com Date: Wed, 26 Aug 2015 23:07:33 +0530 @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype, int bytes) { + u64 buff[IPSTATS_MIB_MAX] = {0,}; + switch (attrtype) { case IFLA_INET6_STATS: - __snmp6_fill_stats64(stats, idev-stats.ipv6, I would suggest using an explicit memset() here, it makes the overhead incurred by this scheme clearer. I changed the code to look like below to measure fill_stat overhead: container creation now took: 3.012s it was: without patch : 6.86sec with current patch: 3.34sec and perf did not show the snmp6_fill_stats() parent traces. changed code: snmp6_fill_stats(...) { switch (attrtype) { case IFLA_INET6_STATS: put_unaligned(IPSTATS_MIB_MAX, stats[0]); memset(stats[1], 0, IPSTATS_MIB_MAX-1); //__snmp6_fill_stats64(stats, idev-stats.ipv6, IPSTATS_MIB_MAX, bytes, // offsetof(struct ipstats_mib, syncp), buff); . } So in summary: The current patch amounts to reduction in major overhead in fill_stat, though there is still percpu walk overhead (0.33sec difference). [ percpu walk overead grows when create for e.g. 3k containers]. cache miss: there was no major difference (around 1.4%) w.r.t patch Hi David, hope you wanted to know the overhead than to change the current patch. please let me know.. Eric, does V2 patch look good now.. please add your ack/review Details: time = time docker run -itd ubuntu:15.04 /bin/bash b6670c321b5957f004e281cbb14512deafd0c0be6a39707c2f3dc95649bbc394 real0m3.012s user0m0.093s sys 0m0.009s perf: == # Samples: 18K of event 'cycles' # Event count (approx.): 12838752009 # Overhead Command Shared Object Symbol # ... . # 15.29% swapper [kernel.kallsyms] [k] snooze_loop 9.37% docker docker [.] scanblock 6.47% docker [kernel.kallsyms] [k] veth_stats_one 3.87% swapper [kernel.kallsyms] [k] _raw_spin_lock 2.71% docker docker [.] -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
* David Miller da...@davemloft.net [2015-08-28 11:24:13]: From: Raghavendra K T raghavendra...@linux.vnet.ibm.com Date: Fri, 28 Aug 2015 12:09:52 +0530 On 08/28/2015 12:08 AM, David Miller wrote: From: Raghavendra K T raghavendra...@linux.vnet.ibm.com Date: Wed, 26 Aug 2015 23:07:33 +0530 @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype, int bytes) { + u64 buff[IPSTATS_MIB_MAX] = {0,}; + ... hope you wanted to know the overhead than to change the current patch. please let me know.. I want you to change that variable initializer to an explicit memset(). The compiler is emitting a memset() or similar _anyways_. Not because it will have any impact at all upon performance, but because of how it looks to people trying to read and understand the code. Hi David, resending the patch with memset. Please let me know if you want to resend all the patches. 8 From: Raghavendra K T raghavendra...@linux.vnet.ibm.com Subject: [PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once Docker container creation linearly increased from around 1.6 sec to 7.5 sec (at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field. reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks through per cpu data of an item (iteratively for around 90 items). idea: This patch tries to aggregate the statistics by going through all the items of each cpu sequentially which is reducing cache misses. Docker creation got faster by more than 2x after the patch. Result: Before After Docker creation time 6.836s 3.357s cache miss 2.7% 1.38% perf before: 50.73% docker [kernel.kallsyms] [k] snmp_fold_field 9.07% swapper [kernel.kallsyms] [k] snooze_loop 3.49% docker [kernel.kallsyms] [k] veth_stats_one 2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock perf after: 10.56% swapper [kernel.kallsyms] [k] snooze_loop 8.72% docker [kernel.kallsyms] [k] snmp_get_cpu_field 7.59% docker [kernel.kallsyms] [k] veth_stats_one 3.65% swapper [kernel.kallsyms] [k] _raw_spin_lock Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- net/ipv6/addrconf.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) Change in V2: - Allocate stat calculation buffer in stack (Eric) - Use memset to zero temp buffer (David) Thanks David and Eric for coments on V1 and as both of them pointed, unfortunately we cannot get rid of buffer for calculation without avoiding unaligned op. diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 21c2c81..9bdfba3 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4624,16 +4624,22 @@ static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib, } static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, - int items, int bytes, size_t syncpoff) + int items, int bytes, size_t syncpoff, + u64 *buff) { - int i; + int i, c; int pad = bytes - sizeof(u64) * items; BUG_ON(pad 0); /* Use put_unaligned() because stats may not be aligned for u64. */ put_unaligned(items, stats[0]); + + for_each_possible_cpu(c) + for (i = 1; i items; i++) + buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff); + for (i = 1; i items; i++) - put_unaligned(snmp_fold_field64(mib, i, syncpoff), stats[i]); + put_unaligned(buff[i], stats[i]); memset(stats[items], 0, pad); } @@ -4641,10 +4647,13 @@ static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype, int bytes) { + u64 buff[IPSTATS_MIB_MAX]; + switch (attrtype) { case IFLA_INET6_STATS: - __snmp6_fill_stats64(stats, idev-stats.ipv6, -IPSTATS_MIB_MAX, bytes, offsetof(struct ipstats_mib, syncp)); + memset(buff, 0, sizeof(buff)); + __snmp6_fill_stats64(stats, idev-stats.ipv6, IPSTATS_MIB_MAX, bytes, +offsetof(struct ipstats_mib, syncp), buff); break; case IFLA_INET6_ICMP6STATS: __snmp6_fill_statsdev(stats, idev-stats.icmpv6dev-mibs, ICMP6_MIB_MAX, bytes); -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord
[PATCH RFC V2 1/2] net: Introduce helper functions to get the per cpu data
Signed-off-by: Raghavendra K T --- include/net/ip.h | 10 ++ net/ipv4/af_inet.c | 41 +++-- 2 files changed, 37 insertions(+), 14 deletions(-) diff --git a/include/net/ip.h b/include/net/ip.h index d5fe9f2..93bf12e 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -202,10 +202,20 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, #define NET_ADD_STATS_BH(net, field, adnd) SNMP_ADD_STATS_BH((net)->mib.net_statistics, field, adnd) #define NET_ADD_STATS_USER(net, field, adnd) SNMP_ADD_STATS_USER((net)->mib.net_statistics, field, adnd) +u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offct); unsigned long snmp_fold_field(void __percpu *mib, int offt); #if BITS_PER_LONG==32 +u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, +size_t syncp_offset); u64 snmp_fold_field64(void __percpu *mib, int offt, size_t sync_off); #else +static inline u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, + size_t syncp_offset) +{ + return snmp_get_cpu_field(mib, cpu, offct); + +} + static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_off) { return snmp_fold_field(mib, offt); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 9532ee8..302e36b 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1448,38 +1448,51 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family, } EXPORT_SYMBOL_GPL(inet_ctl_sock_create); +u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt) +{ + return *(((unsigned long *)per_cpu_ptr(mib, cpu)) + offt); +} +EXPORT_SYMBOL_GPL(snmp_get_cpu_field); + unsigned long snmp_fold_field(void __percpu *mib, int offt) { unsigned long res = 0; int i; for_each_possible_cpu(i) - res += *(((unsigned long *) per_cpu_ptr(mib, i)) + offt); + res += snmp_get_cpu_field(mib, i, offt); return res; } EXPORT_SYMBOL_GPL(snmp_fold_field); #if BITS_PER_LONG==32 +u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, +size_t syncp_offset) +{ + void *bhptr; + struct u64_stats_sync *syncp; + u64 v; + unsigned int start; + + bhptr = per_cpu_ptr(mib, cpu); + syncp = (struct u64_stats_sync *)(bhptr + syncp_offset); + do { + start = u64_stats_fetch_begin_irq(syncp); + v = *(((u64 *)bhptr) + offt); + } while (u64_stats_fetch_retry_irq(syncp, start)); + + return v; +} +EXPORT_SYMBOL_GPL(snmp_get_cpu_field64); + u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset) { u64 res = 0; int cpu; for_each_possible_cpu(cpu) { - void *bhptr; - struct u64_stats_sync *syncp; - u64 v; - unsigned int start; - - bhptr = per_cpu_ptr(mib, cpu); - syncp = (struct u64_stats_sync *)(bhptr + syncp_offset); - do { - start = u64_stats_fetch_begin_irq(syncp); - v = *(((u64 *) bhptr) + offt); - } while (u64_stats_fetch_retry_irq(syncp, start)); - - res += v; + res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset); } return res; } -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V2 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
Docker container creation linearly increased from around 1.6 sec to 7.5 sec (at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field. reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks through per cpu data of an item (iteratively for around 90 items). idea: This patch tries to aggregate the statistics by going through all the items of each cpu sequentially which is reducing cache misses. Docker creation got faster by more than 2x after the patch. Result: Before After Docker creation time 6.836s 3.357s cache miss 2.7% 1.38% perf before: 50.73% docker [kernel.kallsyms] [k] snmp_fold_field 9.07% swapper [kernel.kallsyms] [k] snooze_loop 3.49% docker [kernel.kallsyms] [k] veth_stats_one 2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock perf after: 10.56% swapper [kernel.kallsyms] [k] snooze_loop 8.72% docker [kernel.kallsyms] [k] snmp_get_cpu_field 7.59% docker [kernel.kallsyms] [k] veth_stats_one 3.65% swapper [kernel.kallsyms] [k] _raw_spin_lock Signed-off-by: Raghavendra K T --- net/ipv6/addrconf.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) Change in V2: - Allocate stat calculation buffer in stack (Eric) Thanks David and Eric for coments on V1 and as both of them pointed, unfortunately we cannot get rid of buffer for calculation without avoiding unaligned op. diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 21c2c81..0f6c7a5 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4624,16 +4624,22 @@ static inline void __snmp6_fill_statsdev(u64 *stats, atomic_long_t *mib, } static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, - int items, int bytes, size_t syncpoff) + int items, int bytes, size_t syncpoff, + u64 *buff) { - int i; + int i, c; int pad = bytes - sizeof(u64) * items; BUG_ON(pad < 0); /* Use put_unaligned() because stats may not be aligned for u64. */ put_unaligned(items, [0]); + + for_each_possible_cpu(c) + for (i = 1; i < items; i++) + buff[i] += snmp_get_cpu_field64(mib, c, i, syncpoff); + for (i = 1; i < items; i++) - put_unaligned(snmp_fold_field64(mib, i, syncpoff), [i]); + put_unaligned(buff[i], [i]); memset([items], 0, pad); } @@ -4641,10 +4647,12 @@ static inline void __snmp6_fill_stats64(u64 *stats, void __percpu *mib, static void snmp6_fill_stats(u64 *stats, struct inet6_dev *idev, int attrtype, int bytes) { + u64 buff[IPSTATS_MIB_MAX] = {0,}; + switch (attrtype) { case IFLA_INET6_STATS: - __snmp6_fill_stats64(stats, idev->stats.ipv6, -IPSTATS_MIB_MAX, bytes, offsetof(struct ipstats_mib, syncp)); + __snmp6_fill_stats64(stats, idev->stats.ipv6, IPSTATS_MIB_MAX, bytes, +offsetof(struct ipstats_mib, syncp), buff); break; case IFLA_INET6_ICMP6STATS: __snmp6_fill_statsdev(stats, idev->stats.icmpv6dev->mibs, ICMP6_MIB_MAX, bytes); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V2 0/2] Optimize the snmp stat aggregation for large cpus
While creating 1000 containers, perf is showing lot of time spent in snmp_fold_field on a large cpu system. The current patch tries to improve by reordering the statistics gathering. Please note that similar overhead was also reported while creating veth pairs https://lkml.org/lkml/2013/3/19/556 Changes in V2: - Allocate the stat calculation buffer in stack. (Eric) Setup: 160 cpu (20 core) baremetal powerpc system with 1TB memory 1000 docker containers was created with command docker run -itd ubuntu:15.04 /bin/bash in loop observation: Docker container creation linearly increased from around 1.6 sec to 7.5 sec (at 1000 containers) perf data showed, creating veth interfaces resulting in the below code path was taking more time. rtnl_fill_ifinfo -> inet6_fill_link_af -> inet6_fill_ifla6_attrs -> snmp_fold_field proposed idea: currently __snmp6_fill_stats64 calls snmp_fold_field that walks through per cpu data to of an item (iteratively for around 90 items). The patch tries to aggregate the statistics by going through all the items of each cpu sequentially which is reducing cache misses. Performance of docker creation improved by around more than 2x after the patch. before the patch: time docker run -itd ubuntu:15.04 /bin/bash 3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617 real0m6.836s user0m0.095s sys 0m0.011s perf record -a docker run -itd ubuntu:15.04 /bin/bash === # Samples: 32K of event 'cycles' # Event count (approx.): 24688700190 # Overhead Command Shared Object Symbol # ... .. 50.73% docker [kernel.kallsyms] [k] snmp_fold_field 9.07% swapper [kernel.kallsyms] [k] snooze_loop 3.49% docker [kernel.kallsyms] [k] veth_stats_one 2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock 1.37% docker docker [.] backtrace_qsort 1.31% docker docker [.] strings.FieldsFunc cache-misses: 2.7% after the patch: = time docker run -itd ubuntu:15.04 /bin/bash 4e0619421332990bdea413fe455ab187607ed63d33d5c37aa5291bc2f5b35857 real0m3.357s user0m0.092s sys 0m0.010s perf record -a docker run -itd ubuntu:15.04 /bin/bash === # Samples: 15K of event 'cycles' # Event count (approx.): 11471830714 # Overhead Command Shared Object Symbol # ... . 10.56% swapper [kernel.kallsyms] [k] snooze_loop 8.72% docker [kernel.kallsyms] [k] snmp_get_cpu_field 7.59% docker [kernel.kallsyms] [k] veth_stats_one 3.65% swapper [kernel.kallsyms] [k] _raw_spin_lock 3.06% docker docker[.] strings.FieldsFunc 2.96% docker docker[.] backtrace_qsort cache-misses: 1.38 % Please let me know if you have suggestions/comments. Thanks Eric and David for comments on V1. Raghavendra K T (2): net: Introduce helper functions to get the per cpu data net: Optimize snmp stat aggregation by walking all the percpu data at once include/net/ip.h| 10 ++ net/ipv4/af_inet.c | 41 +++-- net/ipv6/addrconf.c | 18 +- 3 files changed, 50 insertions(+), 19 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line "uns
Re: [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus
On 08/26/2015 07:39 PM, Eric Dumazet wrote: On Wed, 2015-08-26 at 15:55 +0530, Raghavendra K T wrote: On 08/26/2015 04:37 AM, David Miller wrote: From: Raghavendra K T Date: Tue, 25 Aug 2015 13:24:24 +0530 Please let me know if you have suggestions/comments. Like Eric Dumazet said the idea is good but needs some adjustments. You might want to see whether a per-cpu work buffer works for this. sure, Let me know if I understood correctly, we allocate the temp buffer, we will have a "add_this_cpu_data" function and do for_each_online_cpu(cpu) smp_call_function_single(cpu, add_this_cpu_data, buffer, 1) if not could you please point to an example you had in mind. Sorry I do not think it is a good idea. Sending an IPI is way more expensive and intrusive than reading 4 or 5 cache lines from memory (per cpu) Definitely not something we want. Okay. Another problem I thought here was that we could only loop over online cpus. It's extremely unfortunately that we can't depend upon the destination buffer being properly aligned, because we wouldn't need a temporary scratch area if it were aligned properly. True, But I think for 64 bit cpus when (pad == 0) we can go ahead and use stats array directly and get rid of put_unaligned(). is it correct? Nope. We have no alignment guarantee. It could be 0x04 pointer value. (ie not a multiple of 8) (my internal initial patch had this version but thought it is ugly to have ifdef BITS_PER_LONG==64) This has nothing to do with arch having 64bit per long. It is about alignment of a u64. Okay. I 'll send V2 with declaring tmp buffer in stack. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 2/2] net: Optimize snmp stat aggregation by walking all the percpu data at once
On 08/25/2015 09:30 PM, Eric Dumazet wrote: On Tue, 2015-08-25 at 21:17 +0530, Raghavendra K T wrote: On 08/25/2015 07:58 PM, Eric Dumazet wrote: This is a great idea, but kcalloc()/kmalloc() can fail and you'll crash the whole kernel at this point. Good catch, and my bad. Though system is in bad memory condition, since fill_stat is not critical for the system do you think silently returning from here is a good idea? or do you think we should handle with -ENOMEM way up.? Hmm... presumably these 288 bytes could be allocated in inet6_fill_ifla6_attrs() stack frame. Correct, since we need to allocate for IPSTATS_MIB_MAX, we could do this in even snmp6_fill_stats() stack frame. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC 0/2] Optimize the snmp stat aggregation for large cpus
On 08/26/2015 04:37 AM, David Miller wrote: From: Raghavendra K T Date: Tue, 25 Aug 2015 13:24:24 +0530 Please let me know if you have suggestions/comments. Like Eric Dumazet said the idea is good but needs some adjustments. You might want to see whether a per-cpu work buffer works for this. sure, Let me know if I understood correctly, we allocate the temp buffer, we will have a "add_this_cpu_data" function and do for_each_online_cpu(cpu) smp_call_function_single(cpu, add_this_cpu_data, buffer, 1) if not could you please point to an example you had in mind. It's extremely unfortunately that we can't depend upon the destination buffer being properly aligned, because we wouldn't need a temporary scratch area if it were aligned properly. True, But I think for 64 bit cpus when (pad == 0) we can go ahead and use stats array directly and get rid of put_unaligned(). is it correct? (my internal initial patch had this version but thought it is ugly to have ifdef BITS_PER_LONG==64) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC V2 0/2] Optimize the snmp stat aggregation for large cpus
While creating 1000 containers, perf is showing lot of time spent in snmp_fold_field on a large cpu system. The current patch tries to improve by reordering the statistics gathering. Please note that similar overhead was also reported while creating veth pairs https://lkml.org/lkml/2013/3/19/556 Changes in V2: - Allocate the stat calculation buffer in stack. (Eric) Setup: 160 cpu (20 core) baremetal powerpc system with 1TB memory 1000 docker containers was created with command docker run -itd ubuntu:15.04 /bin/bash in loop observation: Docker container creation linearly increased from around 1.6 sec to 7.5 sec (at 1000 containers) perf data showed, creating veth interfaces resulting in the below code path was taking more time. rtnl_fill_ifinfo - inet6_fill_link_af - inet6_fill_ifla6_attrs - snmp_fold_field proposed idea: currently __snmp6_fill_stats64 calls snmp_fold_field that walks through per cpu data to of an item (iteratively for around 90 items). The patch tries to aggregate the statistics by going through all the items of each cpu sequentially which is reducing cache misses. Performance of docker creation improved by around more than 2x after the patch. before the patch: time docker run -itd ubuntu:15.04 /bin/bash 3f45ba571a42e925c4ec4aaee0e48d7610a9ed82a4c931f83324d41822cf6617 real0m6.836s user0m0.095s sys 0m0.011s perf record -a docker run -itd ubuntu:15.04 /bin/bash === # Samples: 32K of event 'cycles' # Event count (approx.): 24688700190 # Overhead Command Shared Object Symbol # ... .. 50.73% docker [kernel.kallsyms] [k] snmp_fold_field 9.07% swapper [kernel.kallsyms] [k] snooze_loop 3.49% docker [kernel.kallsyms] [k] veth_stats_one 2.85% swapper [kernel.kallsyms] [k] _raw_spin_lock 1.37% docker docker [.] backtrace_qsort 1.31% docker docker [.] strings.FieldsFunc cache-misses: 2.7% after the patch: = time docker run -itd ubuntu:15.04 /bin/bash 4e0619421332990bdea413fe455ab187607ed63d33d5c37aa5291bc2f5b35857 real0m3.357s user0m0.092s sys 0m0.010s perf record -a docker run -itd ubuntu:15.04 /bin/bash === # Samples: 15K of event 'cycles' # Event count (approx.): 11471830714 # Overhead Command Shared Object Symbol # ... . 10.56% swapper [kernel.kallsyms] [k] snooze_loop 8.72% docker [kernel.kallsyms] [k] snmp_get_cpu_field 7.59% docker [kernel.kallsyms] [k] veth_stats_one 3.65% swapper [kernel.kallsyms] [k] _raw_spin_lock 3.06% docker docker[.] strings.FieldsFunc 2.96% docker docker[.] backtrace_qsort cache-misses: 1.38 % Please let me know if you have suggestions/comments. Thanks Eric and David for comments on V1. Raghavendra K T (2): net: Introduce helper functions to get the per cpu data net: Optimize snmp stat aggregation by walking all the percpu data at once include/net/ip.h| 10 ++ net/ipv4/af_inet.c | 41 +++-- net/ipv6/addrconf.c | 18 +- 3 files changed, 50 insertions(+), 19 deletions(-) -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux
[PATCH RFC V2 1/2] net: Introduce helper functions to get the per cpu data
Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- include/net/ip.h | 10 ++ net/ipv4/af_inet.c | 41 +++-- 2 files changed, 37 insertions(+), 14 deletions(-) diff --git a/include/net/ip.h b/include/net/ip.h index d5fe9f2..93bf12e 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -202,10 +202,20 @@ void ip_send_unicast_reply(struct sock *sk, struct sk_buff *skb, #define NET_ADD_STATS_BH(net, field, adnd) SNMP_ADD_STATS_BH((net)-mib.net_statistics, field, adnd) #define NET_ADD_STATS_USER(net, field, adnd) SNMP_ADD_STATS_USER((net)-mib.net_statistics, field, adnd) +u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offct); unsigned long snmp_fold_field(void __percpu *mib, int offt); #if BITS_PER_LONG==32 +u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, +size_t syncp_offset); u64 snmp_fold_field64(void __percpu *mib, int offt, size_t sync_off); #else +static inline u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, + size_t syncp_offset) +{ + return snmp_get_cpu_field(mib, cpu, offct); + +} + static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_off) { return snmp_fold_field(mib, offt); diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 9532ee8..302e36b 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1448,38 +1448,51 @@ int inet_ctl_sock_create(struct sock **sk, unsigned short family, } EXPORT_SYMBOL_GPL(inet_ctl_sock_create); +u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt) +{ + return *(((unsigned long *)per_cpu_ptr(mib, cpu)) + offt); +} +EXPORT_SYMBOL_GPL(snmp_get_cpu_field); + unsigned long snmp_fold_field(void __percpu *mib, int offt) { unsigned long res = 0; int i; for_each_possible_cpu(i) - res += *(((unsigned long *) per_cpu_ptr(mib, i)) + offt); + res += snmp_get_cpu_field(mib, i, offt); return res; } EXPORT_SYMBOL_GPL(snmp_fold_field); #if BITS_PER_LONG==32 +u64 snmp_get_cpu_field64(void __percpu *mib, int cpu, int offct, +size_t syncp_offset) +{ + void *bhptr; + struct u64_stats_sync *syncp; + u64 v; + unsigned int start; + + bhptr = per_cpu_ptr(mib, cpu); + syncp = (struct u64_stats_sync *)(bhptr + syncp_offset); + do { + start = u64_stats_fetch_begin_irq(syncp); + v = *(((u64 *)bhptr) + offt); + } while (u64_stats_fetch_retry_irq(syncp, start)); + + return v; +} +EXPORT_SYMBOL_GPL(snmp_get_cpu_field64); + u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_offset) { u64 res = 0; int cpu; for_each_possible_cpu(cpu) { - void *bhptr; - struct u64_stats_sync *syncp; - u64 v; - unsigned int start; - - bhptr = per_cpu_ptr(mib, cpu); - syncp = (struct u64_stats_sync *)(bhptr + syncp_offset); - do { - start = u64_stats_fetch_begin_irq(syncp); - v = *(((u64 *) bhptr) + offt); - } while (u64_stats_fetch_retry_irq(syncp, start)); - - res += v; + res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset); } return res; } -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/