Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-08-18 Thread Nishanth Aravamudan
Hi Gerry,

On 25.07.2014 [09:50:01 +0800], Jiang Liu wrote:
> 
> 
> On 2014/7/25 7:32, Nishanth Aravamudan wrote:
> > On 23.07.2014 [16:20:24 +0800], Jiang Liu wrote:
> >>
> >>
> >> On 2014/7/22 1:57, Nishanth Aravamudan wrote:
> >>> On 21.07.2014 [10:41:59 -0700], Tony Luck wrote:
>  On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
>   wrote:
> > It seems like the issue is the order of onlining of resources on a
> > specific x86 platform?
> 
>  Yes. When we online a node the BIOS hits us with some ACPI hotplug 
>  events:
> 
>  First: Here are some new cpus
> >>>
> >>> Ok, so during this period, you might get some remote allocations. Do you
> >>> know the topology of these CPUs? That is they belong to a
> >>> (soon-to-exist) NUMA node? Can you online that currently offline NUMA
> >>> node at this point (so that NODE_DATA()) resolves, etc.)?
> >> Hi Nishanth,
> >>We have method to get the NUMA information about the CPU, and
> >> patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
> >> CPU hot-addition" tries to solve this issue by onlining NUMA node
> >> as early as possible. Actually we are trying to enable memoryless node
> >> as you have suggested.
> > 
> > Ok, it seems like you have two sets of patches then? One is to fix the
> > NUMA information timing (30/30 only). The rest of the patches are
> > general discussions about where cpu_to_mem() might be used instead of
> > cpu_to_node(). However, based upon Tejun's feedback, it seems like
> > rather than force all callers to use cpu_to_mem(), we should be looking
> > at the core VM to ensure fallback is occuring appropriately when
> > memoryless nodes are present. 
> > 
> > Do you have a specific situation, once you've applied 30/30, where
> > kmalloc_node() leads to an Oops?
> Hi Nishanth,
>   After following the two threads related to support of memoryless
> node and digging more code, I realized my first version path set is an
> overkill. As Tejun has pointed out, we shouldn't expose the detail of
> memoryless node to normal user, but there are still some special users
> who need the detail. So I have tried to summarize it as:
> 1) Arch code should online corresponding NUMA node before onlining any
>CPU or memory, otherwise it may cause invalid memory access when
>accessing NODE_DATA(nid).

I think that's reasonable.

A related caveat is that NUMA topology information should be stored as
early as possible in boot for *all* CPUs [I think only cpu_to_* is used,
at least for now], not just the boot CPU, etc. This is because (at least
on my examination) pre-SMP initcalls are not prevented from using
cpu_to_node, which will falsely return 0 for all CPUs until
set_cpu_numa_node() is called.

> 2) For normal memory allocations without __GFP_THISNODE setting in the
>gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of
>numa_mem_id()/cpu_to_mem() because the latter loses hardware topology
>information as pointed out by Tejun:
>A - B - X - C - D
> Where X is the memless node.  numa_mem_id() on X would return
> either B or C, right?  If B or C can't satisfy the allocation,
> the allocator would fallback to A from B and D for C, both of
> which aren't optimal. It should first fall back to C or B
> respectively, which the allocator can't do anymoe because the
> information is lost when the caller side performs numa_mem_id().

Yes, this seems like a very good description of the reasoning.

> 3) For memory allocation with __GFP_THISNODE setting in gfp_flags,
>numa_node_id()/cpu_to_node() should be used if caller only wants to
>allocate from local memory, otherwise numa_mem_id()/cpu_to_mem()
>should be used if caller wants to allocate from the nearest node.
>
> 4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check
>whether a page is allocated from the nearest node.

I'm less clear on what you mean here, I'll look at your v2 patches. I
mean, numa_node_id()/cpu_to_node() should be used to indicate node-local
preference with appropriate failure handling. But I don't know why one
would prefer to use numa_node_id() to numa_mem_id() in such a path? The
only time they differ is if memoryless nodes are present, which is what
your local memory allocation would ideally be for those nodes anyways?

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-08-18 Thread Nishanth Aravamudan
Hi Gerry,

On 25.07.2014 [09:50:01 +0800], Jiang Liu wrote:
 
 
 On 2014/7/25 7:32, Nishanth Aravamudan wrote:
  On 23.07.2014 [16:20:24 +0800], Jiang Liu wrote:
 
 
  On 2014/7/22 1:57, Nishanth Aravamudan wrote:
  On 21.07.2014 [10:41:59 -0700], Tony Luck wrote:
  On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
  n...@linux.vnet.ibm.com wrote:
  It seems like the issue is the order of onlining of resources on a
  specific x86 platform?
 
  Yes. When we online a node the BIOS hits us with some ACPI hotplug 
  events:
 
  First: Here are some new cpus
 
  Ok, so during this period, you might get some remote allocations. Do you
  know the topology of these CPUs? That is they belong to a
  (soon-to-exist) NUMA node? Can you online that currently offline NUMA
  node at this point (so that NODE_DATA()) resolves, etc.)?
  Hi Nishanth,
 We have method to get the NUMA information about the CPU, and
  patch [RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
  CPU hot-addition tries to solve this issue by onlining NUMA node
  as early as possible. Actually we are trying to enable memoryless node
  as you have suggested.
  
  Ok, it seems like you have two sets of patches then? One is to fix the
  NUMA information timing (30/30 only). The rest of the patches are
  general discussions about where cpu_to_mem() might be used instead of
  cpu_to_node(). However, based upon Tejun's feedback, it seems like
  rather than force all callers to use cpu_to_mem(), we should be looking
  at the core VM to ensure fallback is occuring appropriately when
  memoryless nodes are present. 
  
  Do you have a specific situation, once you've applied 30/30, where
  kmalloc_node() leads to an Oops?
 Hi Nishanth,
   After following the two threads related to support of memoryless
 node and digging more code, I realized my first version path set is an
 overkill. As Tejun has pointed out, we shouldn't expose the detail of
 memoryless node to normal user, but there are still some special users
 who need the detail. So I have tried to summarize it as:
 1) Arch code should online corresponding NUMA node before onlining any
CPU or memory, otherwise it may cause invalid memory access when
accessing NODE_DATA(nid).

I think that's reasonable.

A related caveat is that NUMA topology information should be stored as
early as possible in boot for *all* CPUs [I think only cpu_to_* is used,
at least for now], not just the boot CPU, etc. This is because (at least
on my examination) pre-SMP initcalls are not prevented from using
cpu_to_node, which will falsely return 0 for all CPUs until
set_cpu_numa_node() is called.

 2) For normal memory allocations without __GFP_THISNODE setting in the
gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of
numa_mem_id()/cpu_to_mem() because the latter loses hardware topology
information as pointed out by Tejun:
A - B - X - C - D
 Where X is the memless node.  numa_mem_id() on X would return
 either B or C, right?  If B or C can't satisfy the allocation,
 the allocator would fallback to A from B and D for C, both of
 which aren't optimal. It should first fall back to C or B
 respectively, which the allocator can't do anymoe because the
 information is lost when the caller side performs numa_mem_id().

Yes, this seems like a very good description of the reasoning.

 3) For memory allocation with __GFP_THISNODE setting in gfp_flags,
numa_node_id()/cpu_to_node() should be used if caller only wants to
allocate from local memory, otherwise numa_mem_id()/cpu_to_mem()
should be used if caller wants to allocate from the nearest node.

 4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check
whether a page is allocated from the nearest node.

I'm less clear on what you mean here, I'll look at your v2 patches. I
mean, numa_node_id()/cpu_to_node() should be used to indicate node-local
preference with appropriate failure handling. But I don't know why one
would prefer to use numa_node_id() to numa_mem_id() in such a path? The
only time they differ is if memoryless nodes are present, which is what
your local memory allocation would ideally be for those nodes anyways?

Thanks,
Nish

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-24 Thread Jiang Liu


On 2014/7/25 7:32, Nishanth Aravamudan wrote:
> On 23.07.2014 [16:20:24 +0800], Jiang Liu wrote:
>>
>>
>> On 2014/7/22 1:57, Nishanth Aravamudan wrote:
>>> On 21.07.2014 [10:41:59 -0700], Tony Luck wrote:
 On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
  wrote:
> It seems like the issue is the order of onlining of resources on a
> specific x86 platform?

 Yes. When we online a node the BIOS hits us with some ACPI hotplug events:

 First: Here are some new cpus
>>>
>>> Ok, so during this period, you might get some remote allocations. Do you
>>> know the topology of these CPUs? That is they belong to a
>>> (soon-to-exist) NUMA node? Can you online that currently offline NUMA
>>> node at this point (so that NODE_DATA()) resolves, etc.)?
>> Hi Nishanth,
>>  We have method to get the NUMA information about the CPU, and
>> patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
>> CPU hot-addition" tries to solve this issue by onlining NUMA node
>> as early as possible. Actually we are trying to enable memoryless node
>> as you have suggested.
> 
> Ok, it seems like you have two sets of patches then? One is to fix the
> NUMA information timing (30/30 only). The rest of the patches are
> general discussions about where cpu_to_mem() might be used instead of
> cpu_to_node(). However, based upon Tejun's feedback, it seems like
> rather than force all callers to use cpu_to_mem(), we should be looking
> at the core VM to ensure fallback is occuring appropriately when
> memoryless nodes are present. 
> 
> Do you have a specific situation, once you've applied 30/30, where
> kmalloc_node() leads to an Oops?
Hi Nishanth,
After following the two threads related to support of memoryless
node and digging more code, I realized my first version path set is an
overkill. As Tejun has pointed out, we shouldn't expose the detail of
memoryless node to normal user, but there are still some special users
who need the detail. So I have tried to summarize it as:
1) Arch code should online corresponding NUMA node before onlining any
   CPU or memory, otherwise it may cause invalid memory access when
   accessing NODE_DATA(nid).
2) For normal memory allocations without __GFP_THISNODE setting in the
   gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of
   numa_mem_id()/cpu_to_mem() because the latter loses hardware topology
   information as pointed out by Tejun:
   A - B - X - C - D
Where X is the memless node.  numa_mem_id() on X would return
either B or C, right?  If B or C can't satisfy the allocation,
the allocator would fallback to A from B and D for C, both of
which aren't optimal. It should first fall back to C or B
respectively, which the allocator can't do anymoe because the
information is lost when the caller side performs numa_mem_id().
3) For memory allocation with __GFP_THISNODE setting in gfp_flags,
   numa_node_id()/cpu_to_node() should be used if caller only wants to
   allocate from local memory, otherwise numa_mem_id()/cpu_to_mem()
   should be used if caller wants to allocate from the nearest node.
4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check
   whether a page is allocated from the nearest node.

And my v2 patch set is based on above rules.
Any suggestions here?
Regards!
Gerry

> 
> Thanks,
> Nish
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-24 Thread Nishanth Aravamudan
On 23.07.2014 [16:20:24 +0800], Jiang Liu wrote:
> 
> 
> On 2014/7/22 1:57, Nishanth Aravamudan wrote:
> > On 21.07.2014 [10:41:59 -0700], Tony Luck wrote:
> >> On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
> >>  wrote:
> >>> It seems like the issue is the order of onlining of resources on a
> >>> specific x86 platform?
> >>
> >> Yes. When we online a node the BIOS hits us with some ACPI hotplug events:
> >>
> >> First: Here are some new cpus
> > 
> > Ok, so during this period, you might get some remote allocations. Do you
> > know the topology of these CPUs? That is they belong to a
> > (soon-to-exist) NUMA node? Can you online that currently offline NUMA
> > node at this point (so that NODE_DATA()) resolves, etc.)?
> Hi Nishanth,
>   We have method to get the NUMA information about the CPU, and
> patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
> CPU hot-addition" tries to solve this issue by onlining NUMA node
> as early as possible. Actually we are trying to enable memoryless node
> as you have suggested.

Ok, it seems like you have two sets of patches then? One is to fix the
NUMA information timing (30/30 only). The rest of the patches are
general discussions about where cpu_to_mem() might be used instead of
cpu_to_node(). However, based upon Tejun's feedback, it seems like
rather than force all callers to use cpu_to_mem(), we should be looking
at the core VM to ensure fallback is occuring appropriately when
memoryless nodes are present. 

Do you have a specific situation, once you've applied 30/30, where
kmalloc_node() leads to an Oops?

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-24 Thread Nishanth Aravamudan
On 23.07.2014 [16:20:24 +0800], Jiang Liu wrote:
 
 
 On 2014/7/22 1:57, Nishanth Aravamudan wrote:
  On 21.07.2014 [10:41:59 -0700], Tony Luck wrote:
  On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
  n...@linux.vnet.ibm.com wrote:
  It seems like the issue is the order of onlining of resources on a
  specific x86 platform?
 
  Yes. When we online a node the BIOS hits us with some ACPI hotplug events:
 
  First: Here are some new cpus
  
  Ok, so during this period, you might get some remote allocations. Do you
  know the topology of these CPUs? That is they belong to a
  (soon-to-exist) NUMA node? Can you online that currently offline NUMA
  node at this point (so that NODE_DATA()) resolves, etc.)?
 Hi Nishanth,
   We have method to get the NUMA information about the CPU, and
 patch [RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
 CPU hot-addition tries to solve this issue by onlining NUMA node
 as early as possible. Actually we are trying to enable memoryless node
 as you have suggested.

Ok, it seems like you have two sets of patches then? One is to fix the
NUMA information timing (30/30 only). The rest of the patches are
general discussions about where cpu_to_mem() might be used instead of
cpu_to_node(). However, based upon Tejun's feedback, it seems like
rather than force all callers to use cpu_to_mem(), we should be looking
at the core VM to ensure fallback is occuring appropriately when
memoryless nodes are present. 

Do you have a specific situation, once you've applied 30/30, where
kmalloc_node() leads to an Oops?

Thanks,
Nish

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-24 Thread Jiang Liu


On 2014/7/25 7:32, Nishanth Aravamudan wrote:
 On 23.07.2014 [16:20:24 +0800], Jiang Liu wrote:


 On 2014/7/22 1:57, Nishanth Aravamudan wrote:
 On 21.07.2014 [10:41:59 -0700], Tony Luck wrote:
 On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
 n...@linux.vnet.ibm.com wrote:
 It seems like the issue is the order of onlining of resources on a
 specific x86 platform?

 Yes. When we online a node the BIOS hits us with some ACPI hotplug events:

 First: Here are some new cpus

 Ok, so during this period, you might get some remote allocations. Do you
 know the topology of these CPUs? That is they belong to a
 (soon-to-exist) NUMA node? Can you online that currently offline NUMA
 node at this point (so that NODE_DATA()) resolves, etc.)?
 Hi Nishanth,
  We have method to get the NUMA information about the CPU, and
 patch [RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
 CPU hot-addition tries to solve this issue by onlining NUMA node
 as early as possible. Actually we are trying to enable memoryless node
 as you have suggested.
 
 Ok, it seems like you have two sets of patches then? One is to fix the
 NUMA information timing (30/30 only). The rest of the patches are
 general discussions about where cpu_to_mem() might be used instead of
 cpu_to_node(). However, based upon Tejun's feedback, it seems like
 rather than force all callers to use cpu_to_mem(), we should be looking
 at the core VM to ensure fallback is occuring appropriately when
 memoryless nodes are present. 
 
 Do you have a specific situation, once you've applied 30/30, where
 kmalloc_node() leads to an Oops?
Hi Nishanth,
After following the two threads related to support of memoryless
node and digging more code, I realized my first version path set is an
overkill. As Tejun has pointed out, we shouldn't expose the detail of
memoryless node to normal user, but there are still some special users
who need the detail. So I have tried to summarize it as:
1) Arch code should online corresponding NUMA node before onlining any
   CPU or memory, otherwise it may cause invalid memory access when
   accessing NODE_DATA(nid).
2) For normal memory allocations without __GFP_THISNODE setting in the
   gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of
   numa_mem_id()/cpu_to_mem() because the latter loses hardware topology
   information as pointed out by Tejun:
   A - B - X - C - D
Where X is the memless node.  numa_mem_id() on X would return
either B or C, right?  If B or C can't satisfy the allocation,
the allocator would fallback to A from B and D for C, both of
which aren't optimal. It should first fall back to C or B
respectively, which the allocator can't do anymoe because the
information is lost when the caller side performs numa_mem_id().
3) For memory allocation with __GFP_THISNODE setting in gfp_flags,
   numa_node_id()/cpu_to_node() should be used if caller only wants to
   allocate from local memory, otherwise numa_mem_id()/cpu_to_mem()
   should be used if caller wants to allocate from the nearest node.
4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check
   whether a page is allocated from the nearest node.

And my v2 patch set is based on above rules.
Any suggestions here?
Regards!
Gerry

 
 Thanks,
 Nish
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-23 Thread Jiang Liu


On 2014/7/22 1:57, Nishanth Aravamudan wrote:
> On 21.07.2014 [10:41:59 -0700], Tony Luck wrote:
>> On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
>>  wrote:
>>> It seems like the issue is the order of onlining of resources on a
>>> specific x86 platform?
>>
>> Yes. When we online a node the BIOS hits us with some ACPI hotplug events:
>>
>> First: Here are some new cpus
> 
> Ok, so during this period, you might get some remote allocations. Do you
> know the topology of these CPUs? That is they belong to a
> (soon-to-exist) NUMA node? Can you online that currently offline NUMA
> node at this point (so that NODE_DATA()) resolves, etc.)?
Hi Nishanth,
We have method to get the NUMA information about the CPU, and
patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
CPU hot-addition" tries to solve this issue by onlining NUMA node
as early as possible. Actually we are trying to enable memoryless node
as you have suggested.

Regards!
Gerry

> 
>> Next: Here is some new memory
> 
> And then update the NUMA topology at this point? That is,
> set_cpu_numa_node/mem as appropriate so the underlying allocators do the
> right thing?
> 
>> Last; Here are some new I/O things (PCIe root ports, PCIe devices,
>> IOAPICs, IOMMUs, ...)
>>
>> So there is a period where the node is memoryless - although that will
>> generally be resolved when the memory hot plug event arrives ... that
>> isn't guaranteed to occur (there might not be any memory on the node,
>> or what memory there is may have failed self-test and been disabled).
> 
> Right, but the allocator(s) generally does the right thing already in
> the face of memoryless nodes -- they fallback to the nearest node. That
> leads to poor performance, but is functional. Based upon the previous
> thread Jiang pointed to, it seems like the real issue here isn't that
> the node is memoryless, but that it's not even online yet? So NODE_DATA
> access crashes?
> 
> Thanks,
> Nish
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-23 Thread Jiang Liu


On 2014/7/22 1:57, Nishanth Aravamudan wrote:
 On 21.07.2014 [10:41:59 -0700], Tony Luck wrote:
 On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
 n...@linux.vnet.ibm.com wrote:
 It seems like the issue is the order of onlining of resources on a
 specific x86 platform?

 Yes. When we online a node the BIOS hits us with some ACPI hotplug events:

 First: Here are some new cpus
 
 Ok, so during this period, you might get some remote allocations. Do you
 know the topology of these CPUs? That is they belong to a
 (soon-to-exist) NUMA node? Can you online that currently offline NUMA
 node at this point (so that NODE_DATA()) resolves, etc.)?
Hi Nishanth,
We have method to get the NUMA information about the CPU, and
patch [RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
CPU hot-addition tries to solve this issue by onlining NUMA node
as early as possible. Actually we are trying to enable memoryless node
as you have suggested.

Regards!
Gerry

 
 Next: Here is some new memory
 
 And then update the NUMA topology at this point? That is,
 set_cpu_numa_node/mem as appropriate so the underlying allocators do the
 right thing?
 
 Last; Here are some new I/O things (PCIe root ports, PCIe devices,
 IOAPICs, IOMMUs, ...)

 So there is a period where the node is memoryless - although that will
 generally be resolved when the memory hot plug event arrives ... that
 isn't guaranteed to occur (there might not be any memory on the node,
 or what memory there is may have failed self-test and been disabled).
 
 Right, but the allocator(s) generally does the right thing already in
 the face of memoryless nodes -- they fallback to the nearest node. That
 leads to poor performance, but is functional. Based upon the previous
 thread Jiang pointed to, it seems like the real issue here isn't that
 the node is memoryless, but that it's not even online yet? So NODE_DATA
 access crashes?
 
 Thanks,
 Nish
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-21 Thread Peter Zijlstra
On Mon, Jul 21, 2014 at 10:41:59AM -0700, Tony Luck wrote:
> On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
>  wrote:
> > It seems like the issue is the order of onlining of resources on a
> > specific x86 platform?
> 
> Yes. When we online a node the BIOS hits us with some ACPI hotplug events:
> 
> First: Here are some new cpus
> Next: Here is some new memory
> Last; Here are some new I/O things (PCIe root ports, PCIe devices,
> IOAPICs, IOMMUs, ...)
> 
> So there is a period where the node is memoryless - although that will 
> generally
> be resolved when the memory hot plug event arrives ... that isn't guaranteed 
> to
> occur (there might not be any memory on the node, or what memory there is
> may have failed self-test and been disabled).

Right, but we could 'easily' capture that in arch code and make it look
like it was done in a 'sane' order. No need to wreck the rest of the
kernel to support this particular BIOS fuckup.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-21 Thread Nishanth Aravamudan
On 21.07.2014 [10:41:59 -0700], Tony Luck wrote:
> On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
>  wrote:
> > It seems like the issue is the order of onlining of resources on a
> > specific x86 platform?
> 
> Yes. When we online a node the BIOS hits us with some ACPI hotplug events:
> 
> First: Here are some new cpus

Ok, so during this period, you might get some remote allocations. Do you
know the topology of these CPUs? That is they belong to a
(soon-to-exist) NUMA node? Can you online that currently offline NUMA
node at this point (so that NODE_DATA()) resolves, etc.)?

> Next: Here is some new memory

And then update the NUMA topology at this point? That is,
set_cpu_numa_node/mem as appropriate so the underlying allocators do the
right thing?

> Last; Here are some new I/O things (PCIe root ports, PCIe devices,
> IOAPICs, IOMMUs, ...)
> 
> So there is a period where the node is memoryless - although that will
> generally be resolved when the memory hot plug event arrives ... that
> isn't guaranteed to occur (there might not be any memory on the node,
> or what memory there is may have failed self-test and been disabled).

Right, but the allocator(s) generally does the right thing already in
the face of memoryless nodes -- they fallback to the nearest node. That
leads to poor performance, but is functional. Based upon the previous
thread Jiang pointed to, it seems like the real issue here isn't that
the node is memoryless, but that it's not even online yet? So NODE_DATA
access crashes?

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-21 Thread Tony Luck
On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
 wrote:
> It seems like the issue is the order of onlining of resources on a
> specific x86 platform?

Yes. When we online a node the BIOS hits us with some ACPI hotplug events:

First: Here are some new cpus
Next: Here is some new memory
Last; Here are some new I/O things (PCIe root ports, PCIe devices,
IOAPICs, IOMMUs, ...)

So there is a period where the node is memoryless - although that will generally
be resolved when the memory hot plug event arrives ... that isn't guaranteed to
occur (there might not be any memory on the node, or what memory there is
may have failed self-test and been disabled).

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-21 Thread Nishanth Aravamudan
Hi Jiang,

On 11.07.2014 [15:37:17 +0800], Jiang Liu wrote:
> Previously we have posted a patch fix a memory crash issue caused by
> memoryless node on x86 platforms, please refer to
> http://comments.gmane.org/gmane.linux.kernel/1687425
> 
> As suggested by David Rientjes, the most suitable fix for the issue
> should be to use cpu_to_mem() rather than cpu_to_node() in the caller.
> So this is the patchset according to David's suggestion.

Hrm, that is initially what David said, but then later on in the thread,
he specifically says he doesn't think memoryless nodes are the problem.
It seems like the issue is the order of onlining of resources on a
specifix x86 platform?

memoryless nodes in and of themselves don't cause the kernel to crash.
powerpc boots with them (both previously without
CONFIG_HAVE_MEMORYLESS_NODES and now with it) and is functional,
although it does lead to some performance issues I'm hoping to resolve.
In fact, David specifically says that the kernel crash you triggered
makes sense as cpu_to_node() points to an offline node?

In any case, a blind s/cpu_to_node/cpu_to_mem/ is not always correct.
There is a semantic difference and in some cases the allocator already
do the right thing under covers (falls back to nearest node) and in some
cases it doesn't.

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-21 Thread Nishanth Aravamudan
Hi Jiang,

On 11.07.2014 [15:37:17 +0800], Jiang Liu wrote:
 Previously we have posted a patch fix a memory crash issue caused by
 memoryless node on x86 platforms, please refer to
 http://comments.gmane.org/gmane.linux.kernel/1687425
 
 As suggested by David Rientjes, the most suitable fix for the issue
 should be to use cpu_to_mem() rather than cpu_to_node() in the caller.
 So this is the patchset according to David's suggestion.

Hrm, that is initially what David said, but then later on in the thread,
he specifically says he doesn't think memoryless nodes are the problem.
It seems like the issue is the order of onlining of resources on a
specifix x86 platform?

memoryless nodes in and of themselves don't cause the kernel to crash.
powerpc boots with them (both previously without
CONFIG_HAVE_MEMORYLESS_NODES and now with it) and is functional,
although it does lead to some performance issues I'm hoping to resolve.
In fact, David specifically says that the kernel crash you triggered
makes sense as cpu_to_node() points to an offline node?

In any case, a blind s/cpu_to_node/cpu_to_mem/ is not always correct.
There is a semantic difference and in some cases the allocator already
do the right thing under covers (falls back to nearest node) and in some
cases it doesn't.

Thanks,
Nish

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-21 Thread Tony Luck
On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
n...@linux.vnet.ibm.com wrote:
 It seems like the issue is the order of onlining of resources on a
 specific x86 platform?

Yes. When we online a node the BIOS hits us with some ACPI hotplug events:

First: Here are some new cpus
Next: Here is some new memory
Last; Here are some new I/O things (PCIe root ports, PCIe devices,
IOAPICs, IOMMUs, ...)

So there is a period where the node is memoryless - although that will generally
be resolved when the memory hot plug event arrives ... that isn't guaranteed to
occur (there might not be any memory on the node, or what memory there is
may have failed self-test and been disabled).

-Tony
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-21 Thread Nishanth Aravamudan
On 21.07.2014 [10:41:59 -0700], Tony Luck wrote:
 On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
 n...@linux.vnet.ibm.com wrote:
  It seems like the issue is the order of onlining of resources on a
  specific x86 platform?
 
 Yes. When we online a node the BIOS hits us with some ACPI hotplug events:
 
 First: Here are some new cpus

Ok, so during this period, you might get some remote allocations. Do you
know the topology of these CPUs? That is they belong to a
(soon-to-exist) NUMA node? Can you online that currently offline NUMA
node at this point (so that NODE_DATA()) resolves, etc.)?

 Next: Here is some new memory

And then update the NUMA topology at this point? That is,
set_cpu_numa_node/mem as appropriate so the underlying allocators do the
right thing?

 Last; Here are some new I/O things (PCIe root ports, PCIe devices,
 IOAPICs, IOMMUs, ...)
 
 So there is a period where the node is memoryless - although that will
 generally be resolved when the memory hot plug event arrives ... that
 isn't guaranteed to occur (there might not be any memory on the node,
 or what memory there is may have failed self-test and been disabled).

Right, but the allocator(s) generally does the right thing already in
the face of memoryless nodes -- they fallback to the nearest node. That
leads to poor performance, but is functional. Based upon the previous
thread Jiang pointed to, it seems like the real issue here isn't that
the node is memoryless, but that it's not even online yet? So NODE_DATA
access crashes?

Thanks,
Nish

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-21 Thread Peter Zijlstra
On Mon, Jul 21, 2014 at 10:41:59AM -0700, Tony Luck wrote:
 On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
 n...@linux.vnet.ibm.com wrote:
  It seems like the issue is the order of onlining of resources on a
  specific x86 platform?
 
 Yes. When we online a node the BIOS hits us with some ACPI hotplug events:
 
 First: Here are some new cpus
 Next: Here is some new memory
 Last; Here are some new I/O things (PCIe root ports, PCIe devices,
 IOAPICs, IOMMUs, ...)
 
 So there is a period where the node is memoryless - although that will 
 generally
 be resolved when the memory hot plug event arrives ... that isn't guaranteed 
 to
 occur (there might not be any memory on the node, or what memory there is
 may have failed self-test and been disabled).

Right, but we could 'easily' capture that in arch code and make it look
like it was done in a 'sane' order. No need to wreck the rest of the
kernel to support this particular BIOS fuckup.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-14 Thread David Rientjes
On Sat, 12 Jul 2014, Jiri Kosina wrote:

> I am pretty sure I've seen ppc64 machine with memoryless NUMA node.
> 

Yes, Nishanth Aravamudan (now cc'd) has been working diligently on the 
problems that have been encountered, including problems in generic kernel 
code, on powerpc with memoryless nodes.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-14 Thread David Rientjes
On Fri, 11 Jul 2014, Peter Zijlstra wrote:

> > There are other cases too.
> 
> Are there any sane ones?
> 

They are specifically allowed by the ACPI specification to be able to 
include only cpus, I/O, networking cards, etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-14 Thread David Rientjes
On Fri, 11 Jul 2014, Peter Zijlstra wrote:

  There are other cases too.
 
 Are there any sane ones?
 

They are specifically allowed by the ACPI specification to be able to 
include only cpus, I/O, networking cards, etc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-14 Thread David Rientjes
On Sat, 12 Jul 2014, Jiri Kosina wrote:

 I am pretty sure I've seen ppc64 machine with memoryless NUMA node.
 

Yes, Nishanth Aravamudan (now cc'd) has been working diligently on the 
problems that have been encountered, including problems in generic kernel 
code, on powerpc with memoryless nodes.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread H. Peter Anvin
On 07/11/2014 01:20 PM, Andi Kleen wrote:
> Greg KH  writes:
> 
>> On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
>>> On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
 Any comments are welcomed!
>>>
>>> Why would anybody _ever_ have a memoryless node? That's ridiculous.
>>
>> I'm with Peter here, why would this be a situation that we should even
>> support?  Are there machines out there shipping like this?
> 
> We've always had memory nodes.
> 
> A classic case in the old days was a two socket system where someone
> didn't populate any DIMMs on the second socket.
> 
> There are other cases too.
> 

Yes, like a node controller-based system where the system can be
populated with either memory cards or CPU cards, for example.  Now you
can have both memoryless nodes and memory-only nodes...

Memory-only nodes also happen in real life.  In some cases they are done
by permanently putting low-frequency CPUs to sleep for their memory
controllers.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Jiri Kosina
On Fri, 11 Jul 2014, Greg KH wrote:

> > On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
> > > Any comments are welcomed!
> > 
> > Why would anybody _ever_ have a memoryless node? That's ridiculous.
> 
> I'm with Peter here, why would this be a situation that we should even
> support?  Are there machines out there shipping like this?

I am pretty sure I've seen ppc64 machine with memoryless NUMA node.

-- 
Jiri Kosina
SUSE Labs

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Andi Kleen
On Fri, Jul 11, 2014 at 10:51:06PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 11, 2014 at 01:20:51PM -0700, Andi Kleen wrote:
> > Greg KH  writes:
> > 
> > > On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
> > >> On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
> > >> > Any comments are welcomed!
> > >> 
> > >> Why would anybody _ever_ have a memoryless node? That's ridiculous.
> > >
> > > I'm with Peter here, why would this be a situation that we should even
> > > support?  Are there machines out there shipping like this?
> > 
> > We've always had memory nodes.
> > 
> > A classic case in the old days was a two socket system where someone
> > didn't populate any DIMMs on the second socket.
> 
> That's a obvious; don't do that then case. Its silly.

True. We should recommend that anyone running Linux will email you
for approval of their configuration first.


> > There are other cases too.
> 
> Are there any sane ones

Yes.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Peter Zijlstra
On Fri, Jul 11, 2014 at 01:20:51PM -0700, Andi Kleen wrote:
> Greg KH  writes:
> 
> > On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
> >> On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
> >> > Any comments are welcomed!
> >> 
> >> Why would anybody _ever_ have a memoryless node? That's ridiculous.
> >
> > I'm with Peter here, why would this be a situation that we should even
> > support?  Are there machines out there shipping like this?
> 
> We've always had memory nodes.
> 
> A classic case in the old days was a two socket system where someone
> didn't populate any DIMMs on the second socket.

That's a obvious; don't do that then case. Its silly.

> There are other cases too.

Are there any sane ones?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Andi Kleen
Greg KH  writes:

> On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
>> On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
>> > Any comments are welcomed!
>> 
>> Why would anybody _ever_ have a memoryless node? That's ridiculous.
>
> I'm with Peter here, why would this be a situation that we should even
> support?  Are there machines out there shipping like this?

We've always had memory nodes.

A classic case in the old days was a two socket system where someone
didn't populate any DIMMs on the second socket.

There are other cases too.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Dave Hansen
On 07/11/2014 08:33 AM, Greg KH wrote:
> On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
>> > On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
>>> > > Any comments are welcomed!
>> > 
>> > Why would anybody _ever_ have a memoryless node? That's ridiculous.
> I'm with Peter here, why would this be a situation that we should even
> support?  Are there machines out there shipping like this?

This is orthogonal to the problem Jiang Liu is solving, but...

The IBM guys have been hitting the CPU-less and memoryless node issues
forever, but that's mostly because their (traditional) hypervisor had
good NUMA support and ran multi-node guests.

I've never seen it in practice on x86 mostly because the hypervisors
don't have good NUMA support. I honestly think this is something x86 is
going to have to handle eventually anyway.  It's essentially a resource
fragmentation problem, and there are going to be times where a guest
needs to be spun up and hypervisor has nodes with either no spare memory
or no spare CPUs.

The hypervisor has 3 choices in this case:
1. Lie about the NUMA layout
2. Waste the resources
3. Tell the guest how it's actually arranged


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Greg KH
On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
> On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
> > Any comments are welcomed!
> 
> Why would anybody _ever_ have a memoryless node? That's ridiculous.

I'm with Peter here, why would this be a situation that we should even
support?  Are there machines out there shipping like this?

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Peter Zijlstra
On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
> Any comments are welcomed!

Why would anybody _ever_ have a memoryless node? That's ridiculous.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Jiang Liu
Previously we have posted a patch fix a memory crash issue caused by
memoryless node on x86 platforms, please refer to
http://comments.gmane.org/gmane.linux.kernel/1687425

As suggested by David Rientjes, the most suitable fix for the issue
should be to use cpu_to_mem() rather than cpu_to_node() in the caller.
So this is the patchset according to David's suggestion.

Patch 1-26 prepare for enabling memoryless node on x86 platforms by
replacing cpu_to_node()/numa_node_id() with cpu_to_mem()/numa_mem_id().
Patch 27-29 enable support of memoryless node on x86 platforms.
Patch 30 tunes order to online NUMA node when doing CPU hot-addition.

This patchset fixes the issue mentioned by Mike Galbraith that CPUs
are associated with wrong node after adding memory to a memoryless
node.

With support of memoryless node enabled, it will correctly report system
hardware topology for nodes without memory installed.
root@bkd01sdp:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 
70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 15129 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 
82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15627 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 
97 98 99 100 101 102 103 104
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 
110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3
  0:  10  21  21  21
  1:  21  10  21  21
  2:  21  21  10  21
  3:  21  21  21  10

With memoryless node enabled, CPUs are correctly associated with node 2
after memory hot-addition to node 2.
root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 
70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 14872 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 
82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15641 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 
97 98 99 100 101 102 103 104
node 2 size: 128 MB
node 2 free: 127 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 
110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3
  0:  10  21  21  21
  1:  21  10  21  21
  2:  21  21  10  21
  3:  21  21  21  10

The patchset is based on the latest mainstream kernel and has been
tested on a 4-socket Intel platform with CPU/memory hot-addition
capability.

Any comments are welcomed!

Jiang Liu (30):
  mm, kernel: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, sched: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, net: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, netfilter: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
  mm, perf: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, tracing: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
  mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, thp: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, memcg: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, xfrm: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, char/mspec.c: Use cpu_to_mem()/numa_mem_id() to support
memoryless node
  mm, IB/qib: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, i40e: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, i40evf: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, igb: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, ixgbe: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, intel_powerclamp: Use cpu_to_mem()/numa_mem_id() to support
memoryless node
  mm, bnx2fc: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, bnx2i: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, fcoe: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, irqchip: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
  mm, of: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, x86: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, x86/platform/uv: Use cpu_to_mem()/numa_mem_id() to support
memoryless node
  mm, x86, kvm: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
  mm, x86, perf: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
  x86, numa: Kill useless code to improve code readability
  mm: Update _mem_id_[] for every possible CPU when memory
configuration changes
  mm, x86: Enable memoryless node support to better support CPU/memory
hotplug
  x86, NUMA: Online node earlier when doing CPU hot-addition

 arch/x86/Kconfig   

[RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Jiang Liu
Previously we have posted a patch fix a memory crash issue caused by
memoryless node on x86 platforms, please refer to
http://comments.gmane.org/gmane.linux.kernel/1687425

As suggested by David Rientjes, the most suitable fix for the issue
should be to use cpu_to_mem() rather than cpu_to_node() in the caller.
So this is the patchset according to David's suggestion.

Patch 1-26 prepare for enabling memoryless node on x86 platforms by
replacing cpu_to_node()/numa_node_id() with cpu_to_mem()/numa_mem_id().
Patch 27-29 enable support of memoryless node on x86 platforms.
Patch 30 tunes order to online NUMA node when doing CPU hot-addition.

This patchset fixes the issue mentioned by Mike Galbraith that CPUs
are associated with wrong node after adding memory to a memoryless
node.

With support of memoryless node enabled, it will correctly report system
hardware topology for nodes without memory installed.
root@bkd01sdp:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 
70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 15129 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 
82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15627 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 
97 98 99 100 101 102 103 104
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 
110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3
  0:  10  21  21  21
  1:  21  10  21  21
  2:  21  21  10  21
  3:  21  21  21  10

With memoryless node enabled, CPUs are correctly associated with node 2
after memory hot-addition to node 2.
root@bkd01sdp:/sys/devices/system/node/node2# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 60 61 62 63 64 65 66 67 68 69 
70 71 72 73 74
node 0 size: 15725 MB
node 0 free: 14872 MB
node 1 cpus: 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 75 76 77 78 79 80 81 
82 83 84 85 86 87 88 89
node 1 size: 15862 MB
node 1 free: 15641 MB
node 2 cpus: 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 90 91 92 93 94 95 96 
97 98 99 100 101 102 103 104
node 2 size: 128 MB
node 2 free: 127 MB
node 3 cpus: 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 105 106 107 108 109 
110 111 112 113 114 115 116 117 118 119
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3
  0:  10  21  21  21
  1:  21  10  21  21
  2:  21  21  10  21
  3:  21  21  21  10

The patchset is based on the latest mainstream kernel and has been
tested on a 4-socket Intel platform with CPU/memory hot-addition
capability.

Any comments are welcomed!

Jiang Liu (30):
  mm, kernel: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, sched: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, net: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, netfilter: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
  mm, perf: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, tracing: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
  mm: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, thp: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, memcg: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, xfrm: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, char/mspec.c: Use cpu_to_mem()/numa_mem_id() to support
memoryless node
  mm, IB/qib: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, i40e: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, i40evf: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, igb: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, ixgbe: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, intel_powerclamp: Use cpu_to_mem()/numa_mem_id() to support
memoryless node
  mm, bnx2fc: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, bnx2i: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, fcoe: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, irqchip: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
  mm, of: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, x86: Use cpu_to_mem()/numa_mem_id() to support memoryless node
  mm, x86/platform/uv: Use cpu_to_mem()/numa_mem_id() to support
memoryless node
  mm, x86, kvm: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
  mm, x86, perf: Use cpu_to_mem()/numa_mem_id() to support memoryless
node
  x86, numa: Kill useless code to improve code readability
  mm: Update _mem_id_[] for every possible CPU when memory
configuration changes
  mm, x86: Enable memoryless node support to better support CPU/memory
hotplug
  x86, NUMA: Online node earlier when doing CPU hot-addition

 arch/x86/Kconfig   

Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Peter Zijlstra
On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
 Any comments are welcomed!

Why would anybody _ever_ have a memoryless node? That's ridiculous.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Greg KH
On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
 On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
  Any comments are welcomed!
 
 Why would anybody _ever_ have a memoryless node? That's ridiculous.

I'm with Peter here, why would this be a situation that we should even
support?  Are there machines out there shipping like this?

greg k-h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Dave Hansen
On 07/11/2014 08:33 AM, Greg KH wrote:
 On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
  On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
   Any comments are welcomed!
  
  Why would anybody _ever_ have a memoryless node? That's ridiculous.
 I'm with Peter here, why would this be a situation that we should even
 support?  Are there machines out there shipping like this?

This is orthogonal to the problem Jiang Liu is solving, but...

The IBM guys have been hitting the CPU-less and memoryless node issues
forever, but that's mostly because their (traditional) hypervisor had
good NUMA support and ran multi-node guests.

I've never seen it in practice on x86 mostly because the hypervisors
don't have good NUMA support. I honestly think this is something x86 is
going to have to handle eventually anyway.  It's essentially a resource
fragmentation problem, and there are going to be times where a guest
needs to be spun up and hypervisor has nodes with either no spare memory
or no spare CPUs.

The hypervisor has 3 choices in this case:
1. Lie about the NUMA layout
2. Waste the resources
3. Tell the guest how it's actually arranged


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Andi Kleen
Greg KH gre...@linuxfoundation.org writes:

 On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
 On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
  Any comments are welcomed!
 
 Why would anybody _ever_ have a memoryless node? That's ridiculous.

 I'm with Peter here, why would this be a situation that we should even
 support?  Are there machines out there shipping like this?

We've always had memory nodes.

A classic case in the old days was a two socket system where someone
didn't populate any DIMMs on the second socket.

There are other cases too.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Peter Zijlstra
On Fri, Jul 11, 2014 at 01:20:51PM -0700, Andi Kleen wrote:
 Greg KH gre...@linuxfoundation.org writes:
 
  On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
  On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
   Any comments are welcomed!
  
  Why would anybody _ever_ have a memoryless node? That's ridiculous.
 
  I'm with Peter here, why would this be a situation that we should even
  support?  Are there machines out there shipping like this?
 
 We've always had memory nodes.
 
 A classic case in the old days was a two socket system where someone
 didn't populate any DIMMs on the second socket.

That's a obvious; don't do that then case. Its silly.

 There are other cases too.

Are there any sane ones?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Andi Kleen
On Fri, Jul 11, 2014 at 10:51:06PM +0200, Peter Zijlstra wrote:
 On Fri, Jul 11, 2014 at 01:20:51PM -0700, Andi Kleen wrote:
  Greg KH gre...@linuxfoundation.org writes:
  
   On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
   On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
Any comments are welcomed!
   
   Why would anybody _ever_ have a memoryless node? That's ridiculous.
  
   I'm with Peter here, why would this be a situation that we should even
   support?  Are there machines out there shipping like this?
  
  We've always had memory nodes.
  
  A classic case in the old days was a two socket system where someone
  didn't populate any DIMMs on the second socket.
 
 That's a obvious; don't do that then case. Its silly.

True. We should recommend that anyone running Linux will email you
for approval of their configuration first.


  There are other cases too.
 
 Are there any sane ones

Yes.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread Jiri Kosina
On Fri, 11 Jul 2014, Greg KH wrote:

  On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
   Any comments are welcomed!
  
  Why would anybody _ever_ have a memoryless node? That's ridiculous.
 
 I'm with Peter here, why would this be a situation that we should even
 support?  Are there machines out there shipping like this?

I am pretty sure I've seen ppc64 machine with memoryless NUMA node.

-- 
Jiri Kosina
SUSE Labs

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

2014-07-11 Thread H. Peter Anvin
On 07/11/2014 01:20 PM, Andi Kleen wrote:
 Greg KH gre...@linuxfoundation.org writes:
 
 On Fri, Jul 11, 2014 at 10:29:56AM +0200, Peter Zijlstra wrote:
 On Fri, Jul 11, 2014 at 03:37:17PM +0800, Jiang Liu wrote:
 Any comments are welcomed!

 Why would anybody _ever_ have a memoryless node? That's ridiculous.

 I'm with Peter here, why would this be a situation that we should even
 support?  Are there machines out there shipping like this?
 
 We've always had memory nodes.
 
 A classic case in the old days was a two socket system where someone
 didn't populate any DIMMs on the second socket.
 
 There are other cases too.
 

Yes, like a node controller-based system where the system can be
populated with either memory cards or CPU cards, for example.  Now you
can have both memoryless nodes and memory-only nodes...

Memory-only nodes also happen in real life.  In some cases they are done
by permanently putting low-frequency CPUs to sleep for their memory
controllers.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/