Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot
Hello, On Fri, Jul 10, 2015 at 09:15:47AM -0700, Nishanth Aravamudan wrote: On 08.07.2015 [16:16:23 -0700], Nishanth Aravamudan wrote: On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote: On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote: Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we have an ordering issue during boot with early calls to cpu_to_node(). now that .. implies we changed something and broke this. What commit was it that changed the behaviour? Well, that's something I'm trying to still unearth. In the commits before and after adding USE_PERCPU_NUMA_NODE_ID (8c272261194d powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID), the dmesg reports: pcpu-alloc: [0] 0 1 2 3 4 5 6 7 Ok, I did a bisection, and it seems like prior to commit 1a4d76076cda69b0abf15463a8cebc172406da25 (percpu: implement asynchronous chunk population), we emitted the above, e.g.: pcpu-alloc: [0] 0 1 2 3 4 5 6 7 And after that commit, we emitted: pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 I'm not exactly sure why that changed, but I'm still reading/understanding the commit. Tejun might be able to explain. Tejun, for reference, I noticed on Power systems since the above-mentioned commit, pcpu-alloc is not reflecting the topology of the system correctly -- that is, the pcpu areas are all on node 0 unconditionally (based up on pcpu-alloc's output). Prior to that, there was just one group, it seems like, which completely ignored the NUMA topology. Is this just an ordering thing that changed with the introduction of the async code? It's just each unit growing and percpu allocator deciding to split them into separate allocation units. Before it was serving all cpus in a single alloc unit as they looked like they belong to the same NUMA node and small enough to fit into one alloc unit. In the latter, the async one added more reserve space, so the allocator is deciding to split them into two alloc units while assigning them to the same group as the NUMA info wasn't still there. Thanks. -- tejun ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot
On Wed, 2015-07-08 at 16:16 -0700, Nishanth Aravamudan wrote: On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote: On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote: we currently emit at boot: [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 After this commit, we correctly emit: [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 So it looks fairly sane, and I guess it's a bug fix. But I'm a bit reluctant to put it in straight away without some time in next. I'm fine with that -- it could use some more extensive testing, admittedly (I only have been able to verify the pcpu areas are being correctly allocated on the right node so far). I still need to test with hotplug and things like that. Hence the RFC. It looks like the symptom is that the per-cpu areas are all allocated on node 0, is that all that goes wrong? Yes, that's the symptom. I cc'd a few folks to see if they could help indicate the performance implications of such a setup -- sorry, I should have been more explicit about that. OK cool. I'm happy to put it in next if you send a non-RFC version. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot
On 08.07.2015 [16:16:23 -0700], Nishanth Aravamudan wrote: On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote: On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote: Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we have an ordering issue during boot with early calls to cpu_to_node(). now that .. implies we changed something and broke this. What commit was it that changed the behaviour? Well, that's something I'm trying to still unearth. In the commits before and after adding USE_PERCPU_NUMA_NODE_ID (8c272261194d powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID), the dmesg reports: pcpu-alloc: [0] 0 1 2 3 4 5 6 7 Ok, I did a bisection, and it seems like prior to commit 1a4d76076cda69b0abf15463a8cebc172406da25 (percpu: implement asynchronous chunk population), we emitted the above, e.g.: pcpu-alloc: [0] 0 1 2 3 4 5 6 7 And after that commit, we emitted: pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 I'm not exactly sure why that changed, but I'm still reading/understanding the commit. Tejun might be able to explain. Tejun, for reference, I noticed on Power systems since the above-mentioned commit, pcpu-alloc is not reflecting the topology of the system correctly -- that is, the pcpu areas are all on node 0 unconditionally (based up on pcpu-alloc's output). Prior to that, there was just one group, it seems like, which completely ignored the NUMA topology. Is this just an ordering thing that changed with the introduction of the async code? Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot
On 08.07.2015 [14:00:56 +1000], Michael Ellerman wrote: On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote: Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we have an ordering issue during boot with early calls to cpu_to_node(). now that .. implies we changed something and broke this. What commit was it that changed the behaviour? Well, that's something I'm trying to still unearth. In the commits before and after adding USE_PERCPU_NUMA_NODE_ID (8c272261194d powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID), the dmesg reports: pcpu-alloc: [0] 0 1 2 3 4 5 6 7 At least prior to 8c272261194d, this might have been due to the old powerpc-specific cpu_to_node(): static inline int cpu_to_node(int cpu) { int nid; nid = numa_cpu_lookup_table[cpu]; /* * During early boot, the numa-cpu lookup table might not have been * setup for all CPUs yet. In such cases, default to node 0. */ return (nid 0) ? 0 : nid; } which might imply that no one cares or that simply no one noticed. The value returned by those calls now depend on the per-cpu area being setup, but that is not guaranteed to be the case during boot. Instead, we need to add an early_cpu_to_node() which doesn't use the per-CPU area and call that from certain spots that are known to invoke cpu_to_node() before the per-CPU areas are not configured. On an example 2-node NUMA system with the following topology: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 2029 MB node 0 free: 1753 MB node 1 cpus: 4 5 6 7 node 1 size: 2045 MB node 1 free: 1945 MB node distances: node 0 1 0: 10 40 1: 40 10 we currently emit at boot: [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 After this commit, we correctly emit: [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 So it looks fairly sane, and I guess it's a bug fix. But I'm a bit reluctant to put it in straight away without some time in next. I'm fine with that -- it could use some more extensive testing, admittedly (I only have been able to verify the pcpu areas are being correctly allocated on the right node so far). I still need to test with hotplug and things like that. Hence the RFC. It looks like the symptom is that the per-cpu areas are all allocated on node 0, is that all that goes wrong? Yes, that's the symptom. I cc'd a few folks to see if they could help indicate the performance implications of such a setup -- sorry, I should have been more explicit about that. Thanks, Nish ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot
On Wed, 8 Jul 2015, Nishanth Aravamudan wrote: It looks like the symptom is that the per-cpu areas are all allocated on node 0, is that all that goes wrong? Yes, that's the symptom. I cc'd a few folks to see if they could help indicate the performance implications of such a setup -- sorry, I should have been more explicit about that. Yeah, not sure it's really a bugfix but rather a performance optimization since cpu_to_node() with CONFIG_USE_PERCPU_NUMA_NODE_ID is only about performance. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC,1/2] powerpc/numa: fix cpu_to_node() usage during boot
On Thu, 2015-02-07 at 23:02:02 UTC, Nishanth Aravamudan wrote: Much like on x86, now that powerpc is using USE_PERCPU_NUMA_NODE_ID, we have an ordering issue during boot with early calls to cpu_to_node(). now that .. implies we changed something and broke this. What commit was it that changed the behaviour? The value returned by those calls now depend on the per-cpu area being setup, but that is not guaranteed to be the case during boot. Instead, we need to add an early_cpu_to_node() which doesn't use the per-CPU area and call that from certain spots that are known to invoke cpu_to_node() before the per-CPU areas are not configured. On an example 2-node NUMA system with the following topology: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 2029 MB node 0 free: 1753 MB node 1 cpus: 4 5 6 7 node 1 size: 2045 MB node 1 free: 1945 MB node distances: node 0 1 0: 10 40 1: 40 10 we currently emit at boot: [0.00] pcpu-alloc: [0] 0 1 2 3 [0] 4 5 6 7 After this commit, we correctly emit: [0.00] pcpu-alloc: [0] 0 1 2 3 [1] 4 5 6 7 So it looks fairly sane, and I guess it's a bug fix. But I'm a bit reluctant to put it in straight away without some time in next. It looks like the symptom is that the per-cpu areas are all allocated on node 0, is that all that goes wrong? cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev