Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
Hi Gerry, On 25.07.2014 [09:50:01 +0800], Jiang Liu wrote: > > > On 2014/7/25 7:32, Nishanth Aravamudan wrote: > > On 23.07.2014 [16:20:24 +0800], Jiang Liu wrote: > >> > >> > >> On 2014/7/22 1:57, Nishanth Aravamudan wrote: > >>> On 21.07.2014 [10:41:59 -0700], Tony Luck wrote: > On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan > wrote: > > It seems like the issue is the order of onlining of resources on a > > specific x86 platform? > > Yes. When we online a node the BIOS hits us with some ACPI hotplug > events: > > First: Here are some new cpus > >>> > >>> Ok, so during this period, you might get some remote allocations. Do you > >>> know the topology of these CPUs? That is they belong to a > >>> (soon-to-exist) NUMA node? Can you online that currently offline NUMA > >>> node at this point (so that NODE_DATA()) resolves, etc.)? > >> Hi Nishanth, > >>We have method to get the NUMA information about the CPU, and > >> patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing > >> CPU hot-addition" tries to solve this issue by onlining NUMA node > >> as early as possible. Actually we are trying to enable memoryless node > >> as you have suggested. > > > > Ok, it seems like you have two sets of patches then? One is to fix the > > NUMA information timing (30/30 only). The rest of the patches are > > general discussions about where cpu_to_mem() might be used instead of > > cpu_to_node(). However, based upon Tejun's feedback, it seems like > > rather than force all callers to use cpu_to_mem(), we should be looking > > at the core VM to ensure fallback is occuring appropriately when > > memoryless nodes are present. > > > > Do you have a specific situation, once you've applied 30/30, where > > kmalloc_node() leads to an Oops? > Hi Nishanth, > After following the two threads related to support of memoryless > node and digging more code, I realized my first version path set is an > overkill. As Tejun has pointed out, we shouldn't expose the detail of > memoryless node to normal user, but there are still some special users > who need the detail. So I have tried to summarize it as: > 1) Arch code should online corresponding NUMA node before onlining any >CPU or memory, otherwise it may cause invalid memory access when >accessing NODE_DATA(nid). I think that's reasonable. A related caveat is that NUMA topology information should be stored as early as possible in boot for *all* CPUs [I think only cpu_to_* is used, at least for now], not just the boot CPU, etc. This is because (at least on my examination) pre-SMP initcalls are not prevented from using cpu_to_node, which will falsely return 0 for all CPUs until set_cpu_numa_node() is called. > 2) For normal memory allocations without __GFP_THISNODE setting in the >gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of >numa_mem_id()/cpu_to_mem() because the latter loses hardware topology >information as pointed out by Tejun: >A - B - X - C - D > Where X is the memless node. numa_mem_id() on X would return > either B or C, right? If B or C can't satisfy the allocation, > the allocator would fallback to A from B and D for C, both of > which aren't optimal. It should first fall back to C or B > respectively, which the allocator can't do anymoe because the > information is lost when the caller side performs numa_mem_id(). Yes, this seems like a very good description of the reasoning. > 3) For memory allocation with __GFP_THISNODE setting in gfp_flags, >numa_node_id()/cpu_to_node() should be used if caller only wants to >allocate from local memory, otherwise numa_mem_id()/cpu_to_mem() >should be used if caller wants to allocate from the nearest node. > > 4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check >whether a page is allocated from the nearest node. I'm less clear on what you mean here, I'll look at your v2 patches. I mean, numa_node_id()/cpu_to_node() should be used to indicate node-local preference with appropriate failure handling. But I don't know why one would prefer to use numa_node_id() to numa_mem_id() in such a path? The only time they differ is if memoryless nodes are present, which is what your local memory allocation would ideally be for those nodes anyways? Thanks, Nish -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
Hi Gerry, On 25.07.2014 [09:50:01 +0800], Jiang Liu wrote: On 2014/7/25 7:32, Nishanth Aravamudan wrote: On 23.07.2014 [16:20:24 +0800], Jiang Liu wrote: On 2014/7/22 1:57, Nishanth Aravamudan wrote: On 21.07.2014 [10:41:59 -0700], Tony Luck wrote: On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan n...@linux.vnet.ibm.com wrote: It seems like the issue is the order of onlining of resources on a specific x86 platform? Yes. When we online a node the BIOS hits us with some ACPI hotplug events: First: Here are some new cpus Ok, so during this period, you might get some remote allocations. Do you know the topology of these CPUs? That is they belong to a (soon-to-exist) NUMA node? Can you online that currently offline NUMA node at this point (so that NODE_DATA()) resolves, etc.)? Hi Nishanth, We have method to get the NUMA information about the CPU, and patch [RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing CPU hot-addition tries to solve this issue by onlining NUMA node as early as possible. Actually we are trying to enable memoryless node as you have suggested. Ok, it seems like you have two sets of patches then? One is to fix the NUMA information timing (30/30 only). The rest of the patches are general discussions about where cpu_to_mem() might be used instead of cpu_to_node(). However, based upon Tejun's feedback, it seems like rather than force all callers to use cpu_to_mem(), we should be looking at the core VM to ensure fallback is occuring appropriately when memoryless nodes are present. Do you have a specific situation, once you've applied 30/30, where kmalloc_node() leads to an Oops? Hi Nishanth, After following the two threads related to support of memoryless node and digging more code, I realized my first version path set is an overkill. As Tejun has pointed out, we shouldn't expose the detail of memoryless node to normal user, but there are still some special users who need the detail. So I have tried to summarize it as: 1) Arch code should online corresponding NUMA node before onlining any CPU or memory, otherwise it may cause invalid memory access when accessing NODE_DATA(nid). I think that's reasonable. A related caveat is that NUMA topology information should be stored as early as possible in boot for *all* CPUs [I think only cpu_to_* is used, at least for now], not just the boot CPU, etc. This is because (at least on my examination) pre-SMP initcalls are not prevented from using cpu_to_node, which will falsely return 0 for all CPUs until set_cpu_numa_node() is called. 2) For normal memory allocations without __GFP_THISNODE setting in the gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of numa_mem_id()/cpu_to_mem() because the latter loses hardware topology information as pointed out by Tejun: A - B - X - C - D Where X is the memless node. numa_mem_id() on X would return either B or C, right? If B or C can't satisfy the allocation, the allocator would fallback to A from B and D for C, both of which aren't optimal. It should first fall back to C or B respectively, which the allocator can't do anymoe because the information is lost when the caller side performs numa_mem_id(). Yes, this seems like a very good description of the reasoning. 3) For memory allocation with __GFP_THISNODE setting in gfp_flags, numa_node_id()/cpu_to_node() should be used if caller only wants to allocate from local memory, otherwise numa_mem_id()/cpu_to_mem() should be used if caller wants to allocate from the nearest node. 4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check whether a page is allocated from the nearest node. I'm less clear on what you mean here, I'll look at your v2 patches. I mean, numa_node_id()/cpu_to_node() should be used to indicate node-local preference with appropriate failure handling. But I don't know why one would prefer to use numa_node_id() to numa_mem_id() in such a path? The only time they differ is if memoryless nodes are present, which is what your local memory allocation would ideally be for those nodes anyways? Thanks, Nish -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On 2014/7/25 7:32, Nishanth Aravamudan wrote: > On 23.07.2014 [16:20:24 +0800], Jiang Liu wrote: >> >> >> On 2014/7/22 1:57, Nishanth Aravamudan wrote: >>> On 21.07.2014 [10:41:59 -0700], Tony Luck wrote: On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan wrote: > It seems like the issue is the order of onlining of resources on a > specific x86 platform? Yes. When we online a node the BIOS hits us with some ACPI hotplug events: First: Here are some new cpus >>> >>> Ok, so during this period, you might get some remote allocations. Do you >>> know the topology of these CPUs? That is they belong to a >>> (soon-to-exist) NUMA node? Can you online that currently offline NUMA >>> node at this point (so that NODE_DATA()) resolves, etc.)? >> Hi Nishanth, >> We have method to get the NUMA information about the CPU, and >> patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing >> CPU hot-addition" tries to solve this issue by onlining NUMA node >> as early as possible. Actually we are trying to enable memoryless node >> as you have suggested. > > Ok, it seems like you have two sets of patches then? One is to fix the > NUMA information timing (30/30 only). The rest of the patches are > general discussions about where cpu_to_mem() might be used instead of > cpu_to_node(). However, based upon Tejun's feedback, it seems like > rather than force all callers to use cpu_to_mem(), we should be looking > at the core VM to ensure fallback is occuring appropriately when > memoryless nodes are present. > > Do you have a specific situation, once you've applied 30/30, where > kmalloc_node() leads to an Oops? Hi Nishanth, After following the two threads related to support of memoryless node and digging more code, I realized my first version path set is an overkill. As Tejun has pointed out, we shouldn't expose the detail of memoryless node to normal user, but there are still some special users who need the detail. So I have tried to summarize it as: 1) Arch code should online corresponding NUMA node before onlining any CPU or memory, otherwise it may cause invalid memory access when accessing NODE_DATA(nid). 2) For normal memory allocations without __GFP_THISNODE setting in the gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of numa_mem_id()/cpu_to_mem() because the latter loses hardware topology information as pointed out by Tejun: A - B - X - C - D Where X is the memless node. numa_mem_id() on X would return either B or C, right? If B or C can't satisfy the allocation, the allocator would fallback to A from B and D for C, both of which aren't optimal. It should first fall back to C or B respectively, which the allocator can't do anymoe because the information is lost when the caller side performs numa_mem_id(). 3) For memory allocation with __GFP_THISNODE setting in gfp_flags, numa_node_id()/cpu_to_node() should be used if caller only wants to allocate from local memory, otherwise numa_mem_id()/cpu_to_mem() should be used if caller wants to allocate from the nearest node. 4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check whether a page is allocated from the nearest node. And my v2 patch set is based on above rules. Any suggestions here? Regards! Gerry > > Thanks, > Nish > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On 23.07.2014 [16:20:24 +0800], Jiang Liu wrote: > > > On 2014/7/22 1:57, Nishanth Aravamudan wrote: > > On 21.07.2014 [10:41:59 -0700], Tony Luck wrote: > >> On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan > >> wrote: > >>> It seems like the issue is the order of onlining of resources on a > >>> specific x86 platform? > >> > >> Yes. When we online a node the BIOS hits us with some ACPI hotplug events: > >> > >> First: Here are some new cpus > > > > Ok, so during this period, you might get some remote allocations. Do you > > know the topology of these CPUs? That is they belong to a > > (soon-to-exist) NUMA node? Can you online that currently offline NUMA > > node at this point (so that NODE_DATA()) resolves, etc.)? > Hi Nishanth, > We have method to get the NUMA information about the CPU, and > patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing > CPU hot-addition" tries to solve this issue by onlining NUMA node > as early as possible. Actually we are trying to enable memoryless node > as you have suggested. Ok, it seems like you have two sets of patches then? One is to fix the NUMA information timing (30/30 only). The rest of the patches are general discussions about where cpu_to_mem() might be used instead of cpu_to_node(). However, based upon Tejun's feedback, it seems like rather than force all callers to use cpu_to_mem(), we should be looking at the core VM to ensure fallback is occuring appropriately when memoryless nodes are present. Do you have a specific situation, once you've applied 30/30, where kmalloc_node() leads to an Oops? Thanks, Nish -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On 23.07.2014 [16:20:24 +0800], Jiang Liu wrote: On 2014/7/22 1:57, Nishanth Aravamudan wrote: On 21.07.2014 [10:41:59 -0700], Tony Luck wrote: On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan n...@linux.vnet.ibm.com wrote: It seems like the issue is the order of onlining of resources on a specific x86 platform? Yes. When we online a node the BIOS hits us with some ACPI hotplug events: First: Here are some new cpus Ok, so during this period, you might get some remote allocations. Do you know the topology of these CPUs? That is they belong to a (soon-to-exist) NUMA node? Can you online that currently offline NUMA node at this point (so that NODE_DATA()) resolves, etc.)? Hi Nishanth, We have method to get the NUMA information about the CPU, and patch [RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing CPU hot-addition tries to solve this issue by onlining NUMA node as early as possible. Actually we are trying to enable memoryless node as you have suggested. Ok, it seems like you have two sets of patches then? One is to fix the NUMA information timing (30/30 only). The rest of the patches are general discussions about where cpu_to_mem() might be used instead of cpu_to_node(). However, based upon Tejun's feedback, it seems like rather than force all callers to use cpu_to_mem(), we should be looking at the core VM to ensure fallback is occuring appropriately when memoryless nodes are present. Do you have a specific situation, once you've applied 30/30, where kmalloc_node() leads to an Oops? Thanks, Nish -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On 2014/7/25 7:32, Nishanth Aravamudan wrote: On 23.07.2014 [16:20:24 +0800], Jiang Liu wrote: On 2014/7/22 1:57, Nishanth Aravamudan wrote: On 21.07.2014 [10:41:59 -0700], Tony Luck wrote: On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan n...@linux.vnet.ibm.com wrote: It seems like the issue is the order of onlining of resources on a specific x86 platform? Yes. When we online a node the BIOS hits us with some ACPI hotplug events: First: Here are some new cpus Ok, so during this period, you might get some remote allocations. Do you know the topology of these CPUs? That is they belong to a (soon-to-exist) NUMA node? Can you online that currently offline NUMA node at this point (so that NODE_DATA()) resolves, etc.)? Hi Nishanth, We have method to get the NUMA information about the CPU, and patch [RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing CPU hot-addition tries to solve this issue by onlining NUMA node as early as possible. Actually we are trying to enable memoryless node as you have suggested. Ok, it seems like you have two sets of patches then? One is to fix the NUMA information timing (30/30 only). The rest of the patches are general discussions about where cpu_to_mem() might be used instead of cpu_to_node(). However, based upon Tejun's feedback, it seems like rather than force all callers to use cpu_to_mem(), we should be looking at the core VM to ensure fallback is occuring appropriately when memoryless nodes are present. Do you have a specific situation, once you've applied 30/30, where kmalloc_node() leads to an Oops? Hi Nishanth, After following the two threads related to support of memoryless node and digging more code, I realized my first version path set is an overkill. As Tejun has pointed out, we shouldn't expose the detail of memoryless node to normal user, but there are still some special users who need the detail. So I have tried to summarize it as: 1) Arch code should online corresponding NUMA node before onlining any CPU or memory, otherwise it may cause invalid memory access when accessing NODE_DATA(nid). 2) For normal memory allocations without __GFP_THISNODE setting in the gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of numa_mem_id()/cpu_to_mem() because the latter loses hardware topology information as pointed out by Tejun: A - B - X - C - D Where X is the memless node. numa_mem_id() on X would return either B or C, right? If B or C can't satisfy the allocation, the allocator would fallback to A from B and D for C, both of which aren't optimal. It should first fall back to C or B respectively, which the allocator can't do anymoe because the information is lost when the caller side performs numa_mem_id(). 3) For memory allocation with __GFP_THISNODE setting in gfp_flags, numa_node_id()/cpu_to_node() should be used if caller only wants to allocate from local memory, otherwise numa_mem_id()/cpu_to_mem() should be used if caller wants to allocate from the nearest node. 4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check whether a page is allocated from the nearest node. And my v2 patch set is based on above rules. Any suggestions here? Regards! Gerry Thanks, Nish -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On 2014/7/22 1:57, Nishanth Aravamudan wrote: > On 21.07.2014 [10:41:59 -0700], Tony Luck wrote: >> On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan >> wrote: >>> It seems like the issue is the order of onlining of resources on a >>> specific x86 platform? >> >> Yes. When we online a node the BIOS hits us with some ACPI hotplug events: >> >> First: Here are some new cpus > > Ok, so during this period, you might get some remote allocations. Do you > know the topology of these CPUs? That is they belong to a > (soon-to-exist) NUMA node? Can you online that currently offline NUMA > node at this point (so that NODE_DATA()) resolves, etc.)? Hi Nishanth, We have method to get the NUMA information about the CPU, and patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing CPU hot-addition" tries to solve this issue by onlining NUMA node as early as possible. Actually we are trying to enable memoryless node as you have suggested. Regards! Gerry > >> Next: Here is some new memory > > And then update the NUMA topology at this point? That is, > set_cpu_numa_node/mem as appropriate so the underlying allocators do the > right thing? > >> Last; Here are some new I/O things (PCIe root ports, PCIe devices, >> IOAPICs, IOMMUs, ...) >> >> So there is a period where the node is memoryless - although that will >> generally be resolved when the memory hot plug event arrives ... that >> isn't guaranteed to occur (there might not be any memory on the node, >> or what memory there is may have failed self-test and been disabled). > > Right, but the allocator(s) generally does the right thing already in > the face of memoryless nodes -- they fallback to the nearest node. That > leads to poor performance, but is functional. Based upon the previous > thread Jiang pointed to, it seems like the real issue here isn't that > the node is memoryless, but that it's not even online yet? So NODE_DATA > access crashes? > > Thanks, > Nish > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On 2014/7/22 1:57, Nishanth Aravamudan wrote: On 21.07.2014 [10:41:59 -0700], Tony Luck wrote: On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan n...@linux.vnet.ibm.com wrote: It seems like the issue is the order of onlining of resources on a specific x86 platform? Yes. When we online a node the BIOS hits us with some ACPI hotplug events: First: Here are some new cpus Ok, so during this period, you might get some remote allocations. Do you know the topology of these CPUs? That is they belong to a (soon-to-exist) NUMA node? Can you online that currently offline NUMA node at this point (so that NODE_DATA()) resolves, etc.)? Hi Nishanth, We have method to get the NUMA information about the CPU, and patch [RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing CPU hot-addition tries to solve this issue by onlining NUMA node as early as possible. Actually we are trying to enable memoryless node as you have suggested. Regards! Gerry Next: Here is some new memory And then update the NUMA topology at this point? That is, set_cpu_numa_node/mem as appropriate so the underlying allocators do the right thing? Last; Here are some new I/O things (PCIe root ports, PCIe devices, IOAPICs, IOMMUs, ...) So there is a period where the node is memoryless - although that will generally be resolved when the memory hot plug event arrives ... that isn't guaranteed to occur (there might not be any memory on the node, or what memory there is may have failed self-test and been disabled). Right, but the allocator(s) generally does the right thing already in the face of memoryless nodes -- they fallback to the nearest node. That leads to poor performance, but is functional. Based upon the previous thread Jiang pointed to, it seems like the real issue here isn't that the node is memoryless, but that it's not even online yet? So NODE_DATA access crashes? Thanks, Nish -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Mon, Jul 21, 2014 at 10:41:59AM -0700, Tony Luck wrote: > On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan > wrote: > > It seems like the issue is the order of onlining of resources on a > > specific x86 platform? > > Yes. When we online a node the BIOS hits us with some ACPI hotplug events: > > First: Here are some new cpus > Next: Here is some new memory > Last; Here are some new I/O things (PCIe root ports, PCIe devices, > IOAPICs, IOMMUs, ...) > > So there is a period where the node is memoryless - although that will > generally > be resolved when the memory hot plug event arrives ... that isn't guaranteed > to > occur (there might not be any memory on the node, or what memory there is > may have failed self-test and been disabled). Right, but we could 'easily' capture that in arch code and make it look like it was done in a 'sane' order. No need to wreck the rest of the kernel to support this particular BIOS fuckup. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On 21.07.2014 [10:41:59 -0700], Tony Luck wrote: > On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan > wrote: > > It seems like the issue is the order of onlining of resources on a > > specific x86 platform? > > Yes. When we online a node the BIOS hits us with some ACPI hotplug events: > > First: Here are some new cpus Ok, so during this period, you might get some remote allocations. Do you know the topology of these CPUs? That is they belong to a (soon-to-exist) NUMA node? Can you online that currently offline NUMA node at this point (so that NODE_DATA()) resolves, etc.)? > Next: Here is some new memory And then update the NUMA topology at this point? That is, set_cpu_numa_node/mem as appropriate so the underlying allocators do the right thing? > Last; Here are some new I/O things (PCIe root ports, PCIe devices, > IOAPICs, IOMMUs, ...) > > So there is a period where the node is memoryless - although that will > generally be resolved when the memory hot plug event arrives ... that > isn't guaranteed to occur (there might not be any memory on the node, > or what memory there is may have failed self-test and been disabled). Right, but the allocator(s) generally does the right thing already in the face of memoryless nodes -- they fallback to the nearest node. That leads to poor performance, but is functional. Based upon the previous thread Jiang pointed to, it seems like the real issue here isn't that the node is memoryless, but that it's not even online yet? So NODE_DATA access crashes? Thanks, Nish -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan wrote: > It seems like the issue is the order of onlining of resources on a > specific x86 platform? Yes. When we online a node the BIOS hits us with some ACPI hotplug events: First: Here are some new cpus Next: Here is some new memory Last; Here are some new I/O things (PCIe root ports, PCIe devices, IOAPICs, IOMMUs, ...) So there is a period where the node is memoryless - although that will generally be resolved when the memory hot plug event arrives ... that isn't guaranteed to occur (there might not be any memory on the node, or what memory there is may have failed self-test and been disabled). -Tony -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
Hi Jiang, On 11.07.2014 [15:37:17 +0800], Jiang Liu wrote: > Previously we have posted a patch fix a memory crash issue caused by > memoryless node on x86 platforms, please refer to > http://comments.gmane.org/gmane.linux.kernel/1687425 > > As suggested by David Rientjes, the most suitable fix for the issue > should be to use cpu_to_mem() rather than cpu_to_node() in the caller. > So this is the patchset according to David's suggestion. Hrm, that is initially what David said, but then later on in the thread, he specifically says he doesn't think memoryless nodes are the problem. It seems like the issue is the order of onlining of resources on a specifix x86 platform? memoryless nodes in and of themselves don't cause the kernel to crash. powerpc boots with them (both previously without CONFIG_HAVE_MEMORYLESS_NODES and now with it) and is functional, although it does lead to some performance issues I'm hoping to resolve. In fact, David specifically says that the kernel crash you triggered makes sense as cpu_to_node() points to an offline node? In any case, a blind s/cpu_to_node/cpu_to_mem/ is not always correct. There is a semantic difference and in some cases the allocator already do the right thing under covers (falls back to nearest node) and in some cases it doesn't. Thanks, Nish -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
Hi Jiang, On 11.07.2014 [15:37:17 +0800], Jiang Liu wrote: Previously we have posted a patch fix a memory crash issue caused by memoryless node on x86 platforms, please refer to http://comments.gmane.org/gmane.linux.kernel/1687425 As suggested by David Rientjes, the most suitable fix for the issue should be to use cpu_to_mem() rather than cpu_to_node() in the caller. So this is the patchset according to David's suggestion. Hrm, that is initially what David said, but then later on in the thread, he specifically says he doesn't think memoryless nodes are the problem. It seems like the issue is the order of onlining of resources on a specifix x86 platform? memoryless nodes in and of themselves don't cause the kernel to crash. powerpc boots with them (both previously without CONFIG_HAVE_MEMORYLESS_NODES and now with it) and is functional, although it does lead to some performance issues I'm hoping to resolve. In fact, David specifically says that the kernel crash you triggered makes sense as cpu_to_node() points to an offline node? In any case, a blind s/cpu_to_node/cpu_to_mem/ is not always correct. There is a semantic difference and in some cases the allocator already do the right thing under covers (falls back to nearest node) and in some cases it doesn't. Thanks, Nish -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan n...@linux.vnet.ibm.com wrote: It seems like the issue is the order of onlining of resources on a specific x86 platform? Yes. When we online a node the BIOS hits us with some ACPI hotplug events: First: Here are some new cpus Next: Here is some new memory Last; Here are some new I/O things (PCIe root ports, PCIe devices, IOAPICs, IOMMUs, ...) So there is a period where the node is memoryless - although that will generally be resolved when the memory hot plug event arrives ... that isn't guaranteed to occur (there might not be any memory on the node, or what memory there is may have failed self-test and been disabled). -Tony -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On 21.07.2014 [10:41:59 -0700], Tony Luck wrote: On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan n...@linux.vnet.ibm.com wrote: It seems like the issue is the order of onlining of resources on a specific x86 platform? Yes. When we online a node the BIOS hits us with some ACPI hotplug events: First: Here are some new cpus Ok, so during this period, you might get some remote allocations. Do you know the topology of these CPUs? That is they belong to a (soon-to-exist) NUMA node? Can you online that currently offline NUMA node at this point (so that NODE_DATA()) resolves, etc.)? Next: Here is some new memory And then update the NUMA topology at this point? That is, set_cpu_numa_node/mem as appropriate so the underlying allocators do the right thing? Last; Here are some new I/O things (PCIe root ports, PCIe devices, IOAPICs, IOMMUs, ...) So there is a period where the node is memoryless - although that will generally be resolved when the memory hot plug event arrives ... that isn't guaranteed to occur (there might not be any memory on the node, or what memory there is may have failed self-test and been disabled). Right, but the allocator(s) generally does the right thing already in the face of memoryless nodes -- they fallback to the nearest node. That leads to poor performance, but is functional. Based upon the previous thread Jiang pointed to, it seems like the real issue here isn't that the node is memoryless, but that it's not even online yet? So NODE_DATA access crashes? Thanks, Nish -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Mon, Jul 21, 2014 at 10:41:59AM -0700, Tony Luck wrote: On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan n...@linux.vnet.ibm.com wrote: It seems like the issue is the order of onlining of resources on a specific x86 platform? Yes. When we online a node the BIOS hits us with some ACPI hotplug events: First: Here are some new cpus Next: Here is some new memory Last; Here are some new I/O things (PCIe root ports, PCIe devices, IOAPICs, IOMMUs, ...) So there is a period where the node is memoryless - although that will generally be resolved when the memory hot plug event arrives ... that isn't guaranteed to occur (there might not be any memory on the node, or what memory there is may have failed self-test and been disabled). Right, but we could 'easily' capture that in arch code and make it look like it was done in a 'sane' order. No need to wreck the rest of the kernel to support this particular BIOS fuckup. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Sat, 12 Jul 2014, Jiri Kosina wrote: > I am pretty sure I've seen ppc64 machine with memoryless NUMA node. > Yes, Nishanth Aravamudan (now cc'd) has been working diligently on the problems that have been encountered, including problems in generic kernel code, on powerpc with memoryless nodes. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Fri, 11 Jul 2014, Peter Zijlstra wrote: > > There are other cases too. > > Are there any sane ones? > They are specifically allowed by the ACPI specification to be able to include only cpus, I/O, networking cards, etc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Fri, 11 Jul 2014, Peter Zijlstra wrote: There are other cases too. Are there any sane ones? They are specifically allowed by the ACPI specification to be able to include only cpus, I/O, networking cards, etc. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Sat, 12 Jul 2014, Jiri Kosina wrote: I am pretty sure I've seen ppc64 machine with memoryless NUMA node. Yes, Nishanth Aravamudan (now cc'd) has been working diligently on the problems that have been encountered, including problems in generic kernel code, on powerpc with memoryless nodes. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On 07/11/2014 01:20 PM, Andi Kleen wrote: > Greg KH writes: > >> On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote: >>> On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: Any comments are welcomed! >>> >>> Why would anybody _ever_ have a memoryless node? That's ridiculous. >> >> I'm with Peter here, why would this be a situation that we should even >> support? Are there machines out there shipping like this? > > We've always had memory nodes. > > A classic case in the old days was a two socket system where someone > didn't populate any DIMMs on the second socket. > > There are other cases too. > Yes, like a node controller-based system where the system can be populated with either memory cards or CPU cards, for example. Now you can have both memoryless nodes and memory-only nodes... Memory-only nodes also happen in real life. In some cases they are done by permanently putting low-frequency CPUs to sleep for their memory controllers. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Fri, 11 Jul 2014, Greg KH wrote: > > On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: > > > Any comments are welcomed! > > > > Why would anybody _ever_ have a memoryless node? That's ridiculous. > > I'm with Peter here, why would this be a situation that we should even > support? Are there machines out there shipping like this? I am pretty sure I've seen ppc64 machine with memoryless NUMA node. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Fri, Jul 11, 2014 at 10:51:06PM +0200, Peter Zijlstra wrote: > On Fri, Jul 11, 2014 at 01:20:51PM -0700, Andi Kleen wrote: > > Greg KH writes: > > > > > On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote: > > >> On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: > > >> > Any comments are welcomed! > > >> > > >> Why would anybody _ever_ have a memoryless node? That's ridiculous. > > > > > > I'm with Peter here, why would this be a situation that we should even > > > support? Are there machines out there shipping like this? > > > > We've always had memory nodes. > > > > A classic case in the old days was a two socket system where someone > > didn't populate any DIMMs on the second socket. > > That's a obvious; don't do that then case. Its silly. True. We should recommend that anyone running Linux will email you for approval of their configuration first. > > There are other cases too. > > Are there any sane ones Yes. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Fri, Jul 11, 2014 at 01:20:51PM -0700, Andi Kleen wrote: > Greg KH writes: > > > On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote: > >> On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: > >> > Any comments are welcomed! > >> > >> Why would anybody _ever_ have a memoryless node? That's ridiculous. > > > > I'm with Peter here, why would this be a situation that we should even > > support? Are there machines out there shipping like this? > > We've always had memory nodes. > > A classic case in the old days was a two socket system where someone > didn't populate any DIMMs on the second socket. That's a obvious; don't do that then case. Its silly. > There are other cases too. Are there any sane ones? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
Greg KH writes: > On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote: >> On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: >> > Any comments are welcomed! >> >> Why would anybody _ever_ have a memoryless node? That's ridiculous. > > I'm with Peter here, why would this be a situation that we should even > support? Are there machines out there shipping like this? We've always had memory nodes. A classic case in the old days was a two socket system where someone didn't populate any DIMMs on the second socket. There are other cases too. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On 07/11/2014 08:33 AM, Greg KH wrote: > On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote: >> > On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: >>> > > Any comments are welcomed! >> > >> > Why would anybody _ever_ have a memoryless node? That's ridiculous. > I'm with Peter here, why would this be a situation that we should even > support? Are there machines out there shipping like this? This is orthogonal to the problem Jiang Liu is solving, but... The IBM guys have been hitting the CPU-less and memoryless node issues forever, but that's mostly because their (traditional) hypervisor had good NUMA support and ran multi-node guests. I've never seen it in practice on x86 mostly because the hypervisors don't have good NUMA support. I honestly think this is something x86 is going to have to handle eventually anyway. It's essentially a resource fragmentation problem, and there are going to be times where a guest needs to be spun up and hypervisor has nodes with either no spare memory or no spare CPUs. The hypervisor has 3 choices in this case: 1. Lie about the NUMA layout 2. Waste the resources 3. Tell the guest how it's actually arranged -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote: > On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: > > Any comments are welcomed! > > Why would anybody _ever_ have a memoryless node? That's ridiculous. I'm with Peter here, why would this be a situation that we should even support? Are there machines out there shipping like this? greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: > Any comments are welcomed! Why would anybody _ever_ have a memoryless node? That's ridiculous. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC Patch V1 00/30] Enable memoryless node on x86 platforms
Previously we have posted a patch fix a memory crash issue caused by memoryless node on x86 platforms, please refer to http://comments.gmane.org/gmane.linux.kernel/1687425 As suggested by David Rientjes, the most suitable fix for the issue should be to use cpu_to_mem() rather than cpu_to_node() in the caller. So this is the patchset according to David's suggestion. Patch 1-26 prepare for enabling memoryless node on x86 platforms by replacing cpu_to_node()/numa_node_id() with cpu_to_mem()/numa_mem_id(). Patch 27-29 enable support of memoryless node on x86 platforms. Patch 30 tunes order to online NUMA node when doing CPU hot-addition. This patchset fixes the issue mentioned by Mike Galbraith that CPUs are associated with wrong node after adding memory to a memoryless node. With support of memoryless node enabled, it will correctly report system hardware topology for nodes without memory installed. root@bkd01sdp:~# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 node 0 size: 15725 MB node 0 free: 15129 MB node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 node 1 size: 15862 MB node 1 free: 15627 MB node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 node 2 size: 0 MB node 2 free: 0 MB node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 node 3 size: 0 MB node 3 free: 0 MB node distances: node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10 With memoryless node enabled, CPUs are correctly associated with node 2 after memory hot-addition to node 2. root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 node 0 size: 15725 MB node 0 free: 14872 MB node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 node 1 size: 15862 MB node 1 free: 15641 MB node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 node 2 size: 128 MB node 2 free: 127 MB node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 node 3 size: 0 MB node 3 free: 0 MB node distances: node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10 The patchset is based on the latest mainstream kernel and has been tested on a 4-socket Intel platform with CPU/memory hot-addition capability. Any comments are welcomed! Jiang Liu (30): mm, kernel: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, sched: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, net: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, netfilter: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, perf: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, tracing: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, thp: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, memcg: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, xfrm: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, char/mspec.c: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, IB/qib: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, i40e: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, i40evf: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, igb: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, ixgbe: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, intel_powerclamp: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, bnx2fc: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, bnx2i: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, fcoe: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, irqchip: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, of: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, x86: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, x86/platform/uv: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, x86, kvm: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, x86, perf: Use cpu_to_mem()/numa_mem_id() to support memoryless node x86, numa: Kill useless code to improve code readability mm: Update _mem_id_[] for every possible CPU when memory configuration changes mm, x86: Enable memoryless node support to better support CPU/memory hotplug x86, NUMA: Online node earlier when doing CPU hot-addition arch/x86/Kconfig
[RFC Patch V1 00/30] Enable memoryless node on x86 platforms
Previously we have posted a patch fix a memory crash issue caused by memoryless node on x86 platforms, please refer to http://comments.gmane.org/gmane.linux.kernel/1687425 As suggested by David Rientjes, the most suitable fix for the issue should be to use cpu_to_mem() rather than cpu_to_node() in the caller. So this is the patchset according to David's suggestion. Patch 1-26 prepare for enabling memoryless node on x86 platforms by replacing cpu_to_node()/numa_node_id() with cpu_to_mem()/numa_mem_id(). Patch 27-29 enable support of memoryless node on x86 platforms. Patch 30 tunes order to online NUMA node when doing CPU hot-addition. This patchset fixes the issue mentioned by Mike Galbraith that CPUs are associated with wrong node after adding memory to a memoryless node. With support of memoryless node enabled, it will correctly report system hardware topology for nodes without memory installed. root@bkd01sdp:~# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 node 0 size: 15725 MB node 0 free: 15129 MB node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 node 1 size: 15862 MB node 1 free: 15627 MB node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 node 2 size: 0 MB node 2 free: 0 MB node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 node 3 size: 0 MB node 3 free: 0 MB node distances: node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10 With memoryless node enabled, CPUs are correctly associated with node 2 after memory hot-addition to node 2. root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 node 0 size: 15725 MB node 0 free: 14872 MB node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 node 1 size: 15862 MB node 1 free: 15641 MB node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 node 2 size: 128 MB node 2 free: 127 MB node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 node 3 size: 0 MB node 3 free: 0 MB node distances: node 0 1 2 3 0: 10 21 21 21 1: 21 10 21 21 2: 21 21 10 21 3: 21 21 21 10 The patchset is based on the latest mainstream kernel and has been tested on a 4-socket Intel platform with CPU/memory hot-addition capability. Any comments are welcomed! Jiang Liu (30): mm, kernel: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, sched: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, net: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, netfilter: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, perf: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, tracing: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, thp: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, memcg: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, xfrm: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, char/mspec.c: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, IB/qib: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, i40e: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, i40evf: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, igb: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, ixgbe: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, intel_powerclamp: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, bnx2fc: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, bnx2i: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, fcoe: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, irqchip: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, of: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, x86: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, x86/platform/uv: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, x86, kvm: Use cpu_to_mem()/numa_mem_id() to support memoryless node mm, x86, perf: Use cpu_to_mem()/numa_mem_id() to support memoryless node x86, numa: Kill useless code to improve code readability mm: Update _mem_id_[] for every possible CPU when memory configuration changes mm, x86: Enable memoryless node support to better support CPU/memory hotplug x86, NUMA: Online node earlier when doing CPU hot-addition arch/x86/Kconfig
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: Any comments are welcomed! Why would anybody _ever_ have a memoryless node? That's ridiculous. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote: On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: Any comments are welcomed! Why would anybody _ever_ have a memoryless node? That's ridiculous. I'm with Peter here, why would this be a situation that we should even support? Are there machines out there shipping like this? greg k-h -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On 07/11/2014 08:33 AM, Greg KH wrote: On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote: On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: Any comments are welcomed! Why would anybody _ever_ have a memoryless node? That's ridiculous. I'm with Peter here, why would this be a situation that we should even support? Are there machines out there shipping like this? This is orthogonal to the problem Jiang Liu is solving, but... The IBM guys have been hitting the CPU-less and memoryless node issues forever, but that's mostly because their (traditional) hypervisor had good NUMA support and ran multi-node guests. I've never seen it in practice on x86 mostly because the hypervisors don't have good NUMA support. I honestly think this is something x86 is going to have to handle eventually anyway. It's essentially a resource fragmentation problem, and there are going to be times where a guest needs to be spun up and hypervisor has nodes with either no spare memory or no spare CPUs. The hypervisor has 3 choices in this case: 1. Lie about the NUMA layout 2. Waste the resources 3. Tell the guest how it's actually arranged -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
Greg KH gre...@linuxfoundation.org writes: On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote: On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: Any comments are welcomed! Why would anybody _ever_ have a memoryless node? That's ridiculous. I'm with Peter here, why would this be a situation that we should even support? Are there machines out there shipping like this? We've always had memory nodes. A classic case in the old days was a two socket system where someone didn't populate any DIMMs on the second socket. There are other cases too. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Fri, Jul 11, 2014 at 01:20:51PM -0700, Andi Kleen wrote: Greg KH gre...@linuxfoundation.org writes: On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote: On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: Any comments are welcomed! Why would anybody _ever_ have a memoryless node? That's ridiculous. I'm with Peter here, why would this be a situation that we should even support? Are there machines out there shipping like this? We've always had memory nodes. A classic case in the old days was a two socket system where someone didn't populate any DIMMs on the second socket. That's a obvious; don't do that then case. Its silly. There are other cases too. Are there any sane ones? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Fri, Jul 11, 2014 at 10:51:06PM +0200, Peter Zijlstra wrote: On Fri, Jul 11, 2014 at 01:20:51PM -0700, Andi Kleen wrote: Greg KH gre...@linuxfoundation.org writes: On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote: On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: Any comments are welcomed! Why would anybody _ever_ have a memoryless node? That's ridiculous. I'm with Peter here, why would this be a situation that we should even support? Are there machines out there shipping like this? We've always had memory nodes. A classic case in the old days was a two socket system where someone didn't populate any DIMMs on the second socket. That's a obvious; don't do that then case. Its silly. True. We should recommend that anyone running Linux will email you for approval of their configuration first. There are other cases too. Are there any sane ones Yes. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On Fri, 11 Jul 2014, Greg KH wrote: On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: Any comments are welcomed! Why would anybody _ever_ have a memoryless node? That's ridiculous. I'm with Peter here, why would this be a situation that we should even support? Are there machines out there shipping like this? I am pretty sure I've seen ppc64 machine with memoryless NUMA node. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms
On 07/11/2014 01:20 PM, Andi Kleen wrote: Greg KH gre...@linuxfoundation.org writes: On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote: On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote: Any comments are welcomed! Why would anybody _ever_ have a memoryless node? That's ridiculous. I'm with Peter here, why would this be a situation that we should even support? Are there machines out there shipping like this? We've always had memory nodes. A classic case in the old days was a two socket system where someone didn't populate any DIMMs on the second socket. There are other cases too. Yes, like a node controller-based system where the system can be populated with either memory cards or CPU cards, for example. Now you can have both memoryless nodes and memory-only nodes... Memory-only nodes also happen in real life. In some cases they are done by permanently putting low-frequency CPUs to sleep for their memory controllers. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/