Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-26 Thread Punit Agrawal
Jonathan Cameron  writes:

[...]

>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet.  If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.

If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.

This issue was found through code inspection.

Please make sure to use the updated patch from Lorenzo for your
tests[0].

[0] https://marc.info/?l=linux-acpi&m=152998665713983&w=2

>
> Jonathan
>
>> 
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-26 Thread Punit Agrawal
Jonathan Cameron  writes:

[...]

>
> I'll test it when back in the office, but I had a similar issue with
> memory only nodes when I moved the SRAT listing for cpus from the 4
> 4th mode to the 3rd node to fake some memory I could hot unplug.
> This gave a memory only node for the last node on the system.
>
> When I instead moved cpus from the 3rd node to the 4th (so the node
> with only memory was now in the middle, everything worked).
>
> Was odd, and I'd been meaning to chase it down but hadn't gotten to it
> yet.  If I get time I'll put together some test firmwares as see if there
> are any other nasty corner cases we aren't handling.

If you get a chance, it'd be really helpful to test reversing the
ordering of entries in the SRAT and booting with a restricted
NR_CPUS.

This issue was found through code inspection.

Please make sure to use the updated patch from Lorenzo for your
tests[0].

[0] https://marc.info/?l=linux-acpi&m=152998665713983&w=2

>
> Jonathan
>
>> 
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-22 Thread Jonathan Cameron
On Fri, 22 Jun 2018 11:24:38 +0100
Punit Agrawal  wrote:

> Michal Hocko  writes:
> 
> > On Fri 22-06-18 16:58:05, Hanjun Guo wrote:  
> >> On 2018/6/20 19:51, Punit Agrawal wrote:  
> >> > Xie XiuQi  writes:
> >> >   
> >> >> Hi Lorenzo, Punit,
> >> >>
> >> >>
> >> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:  
> >> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:  
> >>  Michal Hocko  writes:
> >>   
> >> > On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> >> > [...]  
> >> >> In terms of $SUBJECT, I wonder if it's worth taking the original 
> >> >> patch
> >> >> as a temporary fix (it'll also be easier to backport) while we work 
> >> >> on
> >> >> fixing these other issues and enabling memoryless nodes.  
> >> >
> >> > Well, x86 already does that but copying this antipatern is not really
> >> > nice. So it is good as a quick fix but it would be definitely much
> >> > better to have a robust fix. Who knows how many other places might 
> >> > hit
> >> > this. You certainly do not want to add a hack like this all over...  
> >> 
> >>  Completely agree! I was only suggesting it as a temporary measure,
> >>  especially as it looked like a proper fix might be invasive.
> >> 
> >>  Another fix might be to change the node specific allocation to node
> >>  agnostic allocations. It isn't clear why the allocation is being
> >>  requested from a specific node. I think Lorenzo suggested this in one 
> >>  of
> >>  the threads.  
> >> >>>
> >> >>> I think that code was just copypasted but it is better to fix the
> >> >>> underlying issue.
> >> >>>  
> >>  I've started putting together a set fixing the issues identified in 
> >>  this
> >>  thread. It should give a better idea on the best course of action.  
> >> >>>
> >> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
> >> >>> should be (famous last words) just a matter of mapping PXMs to nodes 
> >> >>> for
> >> >>> every SRAT GICC entry, feel free to pick it up if it works.
> >> >>>
> >> >>> Yes, we can take the original patch just because it is safer for an -rc
> >> >>> cycle even though if the patch below would do delaying the fix for a
> >> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
> >> >>> not a disaster.  
> >> >>
> >> >> I tested this patch on my arm board, it works.  
> >> > 
> >> > I am assuming you tried the patch without enabling support for
> >> > memory-less nodes.
> >> > 
> >> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
> >> > from NR_CPUS restriction. When it comes to building zonelists, the node
> >> > referenced by the PCI controller also has zonelists initialised.
> >> > 
> >> > So it looks like a fallback node is setup even if we don't have
> >> > memory-less nodes enabled. I need to stare some more at the code to see
> >> > why we need memory-less nodes at all then ...  
> >> 
> >> Yes, please. From my limited MM knowledge, zonelists should not be
> >> initialised if no CPU and no memory on this node, correct me if I'm
> >> wrong.  
> >
> > Well, as long as there is a code which can explicitly ask for a specific
> > node than it is safer to have zonelists configured. Otherwise you just
> > force callers to add hacks and figure out the proper placement there.
> > Zonelists should be cheep to configure for all possible nodes. It's not
> > like we are talking about huge amount of resources.  
> 
> I agree. The current problem stems from not configuring the zonelists
> for nodes that don't have onlined cpu and memory. Lorenzo's patch fixes
> the configuration of such nodes.
> 
> For allocation requests targeting memory-less nodes, the allocator will
> take the slow path and fall back to one of the other nodes based on the
> zonelists.
> 
> I'm not sure how common such allocations are but I'll work on enabling
> CONFIG_HAVE_MEMORYLESS_NODES on top of Lorenzo's patch. AIUI, this
> config improves the fallback mechanism by starting the search from a
> near-by node with memory.

I'll test it when back in the office, but I had a similar issue with
memory only nodes when I moved the SRAT listing for cpus from the 4
4th mode to the 3rd node to fake some memory I could hot unplug.
This gave a memory only node for the last node on the system.

When I instead moved cpus from the 3rd node to the 4th (so the node
with only memory was now in the middle, everything worked).

Was odd, and I'd been meaning to chase it down but hadn't gotten to it
yet.  If I get time I'll put together some test firmwares as see if there
are any other nasty corner cases we aren't handling.

Jonathan

> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel




Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-22 Thread Punit Agrawal
Michal Hocko  writes:

> On Fri 22-06-18 16:58:05, Hanjun Guo wrote:
>> On 2018/6/20 19:51, Punit Agrawal wrote:
>> > Xie XiuQi  writes:
>> > 
>> >> Hi Lorenzo, Punit,
>> >>
>> >>
>> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>>  Michal Hocko  writes:
>> 
>> > On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>> > [...]
>> >> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> >> as a temporary fix (it'll also be easier to backport) while we work on
>> >> fixing these other issues and enabling memoryless nodes.
>> >
>> > Well, x86 already does that but copying this antipatern is not really
>> > nice. So it is good as a quick fix but it would be definitely much
>> > better to have a robust fix. Who knows how many other places might hit
>> > this. You certainly do not want to add a hack like this all over...
>> 
>>  Completely agree! I was only suggesting it as a temporary measure,
>>  especially as it looked like a proper fix might be invasive.
>> 
>>  Another fix might be to change the node specific allocation to node
>>  agnostic allocations. It isn't clear why the allocation is being
>>  requested from a specific node. I think Lorenzo suggested this in one of
>>  the threads.
>> >>>
>> >>> I think that code was just copypasted but it is better to fix the
>> >>> underlying issue.
>> >>>
>>  I've started putting together a set fixing the issues identified in this
>>  thread. It should give a better idea on the best course of action.
>> >>>
>> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> >>> every SRAT GICC entry, feel free to pick it up if it works.
>> >>>
>> >>> Yes, we can take the original patch just because it is safer for an -rc
>> >>> cycle even though if the patch below would do delaying the fix for a
>> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> >>> not a disaster.
>> >>
>> >> I tested this patch on my arm board, it works.
>> > 
>> > I am assuming you tried the patch without enabling support for
>> > memory-less nodes.
>> > 
>> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
>> > from NR_CPUS restriction. When it comes to building zonelists, the node
>> > referenced by the PCI controller also has zonelists initialised.
>> > 
>> > So it looks like a fallback node is setup even if we don't have
>> > memory-less nodes enabled. I need to stare some more at the code to see
>> > why we need memory-less nodes at all then ...
>> 
>> Yes, please. From my limited MM knowledge, zonelists should not be
>> initialised if no CPU and no memory on this node, correct me if I'm
>> wrong.
>
> Well, as long as there is a code which can explicitly ask for a specific
> node than it is safer to have zonelists configured. Otherwise you just
> force callers to add hacks and figure out the proper placement there.
> Zonelists should be cheep to configure for all possible nodes. It's not
> like we are talking about huge amount of resources.

I agree. The current problem stems from not configuring the zonelists
for nodes that don't have onlined cpu and memory. Lorenzo's patch fixes
the configuration of such nodes.

For allocation requests targeting memory-less nodes, the allocator will
take the slow path and fall back to one of the other nodes based on the
zonelists.

I'm not sure how common such allocations are but I'll work on enabling
CONFIG_HAVE_MEMORYLESS_NODES on top of Lorenzo's patch. AIUI, this
config improves the fallback mechanism by starting the search from a
near-by node with memory.


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-22 Thread Michal Hocko
On Fri 22-06-18 16:58:05, Hanjun Guo wrote:
> On 2018/6/20 19:51, Punit Agrawal wrote:
> > Xie XiuQi  writes:
> > 
> >> Hi Lorenzo, Punit,
> >>
> >>
> >> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
> >>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>  Michal Hocko  writes:
> 
> > On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> > [...]
> >> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> >> as a temporary fix (it'll also be easier to backport) while we work on
> >> fixing these other issues and enabling memoryless nodes.
> >
> > Well, x86 already does that but copying this antipatern is not really
> > nice. So it is good as a quick fix but it would be definitely much
> > better to have a robust fix. Who knows how many other places might hit
> > this. You certainly do not want to add a hack like this all over...
> 
>  Completely agree! I was only suggesting it as a temporary measure,
>  especially as it looked like a proper fix might be invasive.
> 
>  Another fix might be to change the node specific allocation to node
>  agnostic allocations. It isn't clear why the allocation is being
>  requested from a specific node. I think Lorenzo suggested this in one of
>  the threads.
> >>>
> >>> I think that code was just copypasted but it is better to fix the
> >>> underlying issue.
> >>>
>  I've started putting together a set fixing the issues identified in this
>  thread. It should give a better idea on the best course of action.
> >>>
> >>> On ACPI ARM64, this diff should do if I read the code correctly, it
> >>> should be (famous last words) just a matter of mapping PXMs to nodes for
> >>> every SRAT GICC entry, feel free to pick it up if it works.
> >>>
> >>> Yes, we can take the original patch just because it is safer for an -rc
> >>> cycle even though if the patch below would do delaying the fix for a
> >>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
> >>> not a disaster.
> >>
> >> I tested this patch on my arm board, it works.
> > 
> > I am assuming you tried the patch without enabling support for
> > memory-less nodes.
> > 
> > The patch de-couples the onlining of numa nodes (as parsed from SRAT)
> > from NR_CPUS restriction. When it comes to building zonelists, the node
> > referenced by the PCI controller also has zonelists initialised.
> > 
> > So it looks like a fallback node is setup even if we don't have
> > memory-less nodes enabled. I need to stare some more at the code to see
> > why we need memory-less nodes at all then ...
> 
> Yes, please. From my limited MM knowledge, zonelists should not be
> initialised if no CPU and no memory on this node, correct me if I'm
> wrong.

Well, as long as there is a code which can explicitly ask for a specific
node than it is safer to have zonelists configured. Otherwise you just
force callers to add hacks and figure out the proper placement there.
Zonelists should be cheep to configure for all possible nodes. It's not
like we are talking about huge amount of resources.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-22 Thread Hanjun Guo
On 2018/6/20 19:51, Punit Agrawal wrote:
> Xie XiuQi  writes:
> 
>> Hi Lorenzo, Punit,
>>
>>
>> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
 Michal Hocko  writes:

> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> [...]
>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> as a temporary fix (it'll also be easier to backport) while we work on
>> fixing these other issues and enabling memoryless nodes.
>
> Well, x86 already does that but copying this antipatern is not really
> nice. So it is good as a quick fix but it would be definitely much
> better to have a robust fix. Who knows how many other places might hit
> this. You certainly do not want to add a hack like this all over...

 Completely agree! I was only suggesting it as a temporary measure,
 especially as it looked like a proper fix might be invasive.

 Another fix might be to change the node specific allocation to node
 agnostic allocations. It isn't clear why the allocation is being
 requested from a specific node. I think Lorenzo suggested this in one of
 the threads.
>>>
>>> I think that code was just copypasted but it is better to fix the
>>> underlying issue.
>>>
 I've started putting together a set fixing the issues identified in this
 thread. It should give a better idea on the best course of action.
>>>
>>> On ACPI ARM64, this diff should do if I read the code correctly, it
>>> should be (famous last words) just a matter of mapping PXMs to nodes for
>>> every SRAT GICC entry, feel free to pick it up if it works.
>>>
>>> Yes, we can take the original patch just because it is safer for an -rc
>>> cycle even though if the patch below would do delaying the fix for a
>>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>>> not a disaster.
>>
>> I tested this patch on my arm board, it works.
> 
> I am assuming you tried the patch without enabling support for
> memory-less nodes.
> 
> The patch de-couples the onlining of numa nodes (as parsed from SRAT)
> from NR_CPUS restriction. When it comes to building zonelists, the node
> referenced by the PCI controller also has zonelists initialised.
> 
> So it looks like a fallback node is setup even if we don't have
> memory-less nodes enabled. I need to stare some more at the code to see
> why we need memory-less nodes at all then ...

Yes, please. From my limited MM knowledge, zonelists should not be
initialised if no CPU and no memory on this node, correct me if I'm
wrong.

Thanks
Hanjun



Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-20 Thread Punit Agrawal
Xie XiuQi  writes:

> Hi Lorenzo, Punit,
>
>
> On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
>> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>>> Michal Hocko  writes:
>>>
 On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
 [...]
> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> as a temporary fix (it'll also be easier to backport) while we work on
> fixing these other issues and enabling memoryless nodes.

 Well, x86 already does that but copying this antipatern is not really
 nice. So it is good as a quick fix but it would be definitely much
 better to have a robust fix. Who knows how many other places might hit
 this. You certainly do not want to add a hack like this all over...
>>>
>>> Completely agree! I was only suggesting it as a temporary measure,
>>> especially as it looked like a proper fix might be invasive.
>>>
>>> Another fix might be to change the node specific allocation to node
>>> agnostic allocations. It isn't clear why the allocation is being
>>> requested from a specific node. I think Lorenzo suggested this in one of
>>> the threads.
>> 
>> I think that code was just copypasted but it is better to fix the
>> underlying issue.
>> 
>>> I've started putting together a set fixing the issues identified in this
>>> thread. It should give a better idea on the best course of action.
>> 
>> On ACPI ARM64, this diff should do if I read the code correctly, it
>> should be (famous last words) just a matter of mapping PXMs to nodes for
>> every SRAT GICC entry, feel free to pick it up if it works.
>> 
>> Yes, we can take the original patch just because it is safer for an -rc
>> cycle even though if the patch below would do delaying the fix for a
>> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
>> not a disaster.
>
> I tested this patch on my arm board, it works.

I am assuming you tried the patch without enabling support for
memory-less nodes.

The patch de-couples the onlining of numa nodes (as parsed from SRAT)
from NR_CPUS restriction. When it comes to building zonelists, the node
referenced by the PCI controller also has zonelists initialised.

So it looks like a fallback node is setup even if we don't have
memory-less nodes enabled. I need to stare some more at the code to see
why we need memory-less nodes at all then ...



Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-19 Thread Xie XiuQi
Hi Lorenzo, Punit,


On 2018/6/20 0:32, Lorenzo Pieralisi wrote:
> On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
>> Michal Hocko  writes:
>>
>>> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
>>> [...]
 In terms of $SUBJECT, I wonder if it's worth taking the original patch
 as a temporary fix (it'll also be easier to backport) while we work on
 fixing these other issues and enabling memoryless nodes.
>>>
>>> Well, x86 already does that but copying this antipatern is not really
>>> nice. So it is good as a quick fix but it would be definitely much
>>> better to have a robust fix. Who knows how many other places might hit
>>> this. You certainly do not want to add a hack like this all over...
>>
>> Completely agree! I was only suggesting it as a temporary measure,
>> especially as it looked like a proper fix might be invasive.
>>
>> Another fix might be to change the node specific allocation to node
>> agnostic allocations. It isn't clear why the allocation is being
>> requested from a specific node. I think Lorenzo suggested this in one of
>> the threads.
> 
> I think that code was just copypasted but it is better to fix the
> underlying issue.
> 
>> I've started putting together a set fixing the issues identified in this
>> thread. It should give a better idea on the best course of action.
> 
> On ACPI ARM64, this diff should do if I read the code correctly, it
> should be (famous last words) just a matter of mapping PXMs to nodes for
> every SRAT GICC entry, feel free to pick it up if it works.
> 
> Yes, we can take the original patch just because it is safer for an -rc
> cycle even though if the patch below would do delaying the fix for a
> couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
> not a disaster.

I tested this patch on my arm board, it works.

-- 
Thanks,
Xie XiuQi

> 
> Lorenzo
> 
> -- >8 --
> diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
> index d190a7b231bf..877b268ef9fa 100644
> --- a/arch/arm64/kernel/acpi_numa.c
> +++ b/arch/arm64/kernel/acpi_numa.c
> @@ -70,12 +70,6 @@ void __init acpi_numa_gicc_affinity_init(struct 
> acpi_srat_gicc_affinity *pa)
>   if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
>   return;
>  
> - if (cpus_in_srat >= NR_CPUS) {
> - pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not 
> be able to use all cpus\n",
> -  NR_CPUS);
> - return;
> - }
> -
>   pxm = pa->proximity_domain;
>   node = acpi_map_pxm_to_node(pxm);
>  
> @@ -85,6 +79,14 @@ void __init acpi_numa_gicc_affinity_init(struct 
> acpi_srat_gicc_affinity *pa)
>   return;
>   }
>  
> + node_set(node, numa_nodes_parsed);
> +
> + if (cpus_in_srat >= NR_CPUS) {
> + pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not 
> be able to use all cpus\n",
> +  NR_CPUS);
> + return;
> + }
> +
>   mpidr = acpi_map_madt_entry(pa->acpi_processor_uid);
>   if (mpidr == PHYS_CPUID_INVALID) {
>   pr_err("SRAT: PXM %d with ACPI ID %d has no valid MPIDR in 
> MADT\n",
> @@ -95,7 +97,6 @@ void __init acpi_numa_gicc_affinity_init(struct 
> acpi_srat_gicc_affinity *pa)
>  
>   early_node_cpu_hwid[cpus_in_srat].node_id = node;
>   early_node_cpu_hwid[cpus_in_srat].cpu_hwid =  mpidr;
> - node_set(node, numa_nodes_parsed);
>   cpus_in_srat++;
>   pr_info("SRAT: PXM %d -> MPIDR 0x%Lx -> Node %d\n",
>   pxm, mpidr, node);
> 
> .
> 





Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-19 Thread Lorenzo Pieralisi
On Tue, Jun 19, 2018 at 04:35:40PM +0100, Punit Agrawal wrote:
> Michal Hocko  writes:
> 
> > On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> > [...]
> >> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> >> as a temporary fix (it'll also be easier to backport) while we work on
> >> fixing these other issues and enabling memoryless nodes.
> >
> > Well, x86 already does that but copying this antipatern is not really
> > nice. So it is good as a quick fix but it would be definitely much
> > better to have a robust fix. Who knows how many other places might hit
> > this. You certainly do not want to add a hack like this all over...
> 
> Completely agree! I was only suggesting it as a temporary measure,
> especially as it looked like a proper fix might be invasive.
> 
> Another fix might be to change the node specific allocation to node
> agnostic allocations. It isn't clear why the allocation is being
> requested from a specific node. I think Lorenzo suggested this in one of
> the threads.

I think that code was just copypasted but it is better to fix the
underlying issue.

> I've started putting together a set fixing the issues identified in this
> thread. It should give a better idea on the best course of action.

On ACPI ARM64, this diff should do if I read the code correctly, it
should be (famous last words) just a matter of mapping PXMs to nodes for
every SRAT GICC entry, feel free to pick it up if it works.

Yes, we can take the original patch just because it is safer for an -rc
cycle even though if the patch below would do delaying the fix for a
couple of -rc (to get it tested across ACPI ARM64 NUMA platforms) is
not a disaster.

Lorenzo

-- >8 --
diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
index d190a7b231bf..877b268ef9fa 100644
--- a/arch/arm64/kernel/acpi_numa.c
+++ b/arch/arm64/kernel/acpi_numa.c
@@ -70,12 +70,6 @@ void __init acpi_numa_gicc_affinity_init(struct 
acpi_srat_gicc_affinity *pa)
if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
return;
 
-   if (cpus_in_srat >= NR_CPUS) {
-   pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not 
be able to use all cpus\n",
-NR_CPUS);
-   return;
-   }
-
pxm = pa->proximity_domain;
node = acpi_map_pxm_to_node(pxm);
 
@@ -85,6 +79,14 @@ void __init acpi_numa_gicc_affinity_init(struct 
acpi_srat_gicc_affinity *pa)
return;
}
 
+   node_set(node, numa_nodes_parsed);
+
+   if (cpus_in_srat >= NR_CPUS) {
+   pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not 
be able to use all cpus\n",
+NR_CPUS);
+   return;
+   }
+
mpidr = acpi_map_madt_entry(pa->acpi_processor_uid);
if (mpidr == PHYS_CPUID_INVALID) {
pr_err("SRAT: PXM %d with ACPI ID %d has no valid MPIDR in 
MADT\n",
@@ -95,7 +97,6 @@ void __init acpi_numa_gicc_affinity_init(struct 
acpi_srat_gicc_affinity *pa)
 
early_node_cpu_hwid[cpus_in_srat].node_id = node;
early_node_cpu_hwid[cpus_in_srat].cpu_hwid =  mpidr;
-   node_set(node, numa_nodes_parsed);
cpus_in_srat++;
pr_info("SRAT: PXM %d -> MPIDR 0x%Lx -> Node %d\n",
pxm, mpidr, node);


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-19 Thread Punit Agrawal
Michal Hocko  writes:

> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> [...]
>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> as a temporary fix (it'll also be easier to backport) while we work on
>> fixing these other issues and enabling memoryless nodes.
>
> Well, x86 already does that but copying this antipatern is not really
> nice. So it is good as a quick fix but it would be definitely much
> better to have a robust fix. Who knows how many other places might hit
> this. You certainly do not want to add a hack like this all over...

Completely agree! I was only suggesting it as a temporary measure,
especially as it looked like a proper fix might be invasive.

Another fix might be to change the node specific allocation to node
agnostic allocations. It isn't clear why the allocation is being
requested from a specific node. I think Lorenzo suggested this in one of
the threads.

I've started putting together a set fixing the issues identified in this
thread. It should give a better idea on the best course of action.


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-19 Thread Michal Hocko
On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
[...]
> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> as a temporary fix (it'll also be easier to backport) while we work on
> fixing these other issues and enabling memoryless nodes.

Well, x86 already does that but copying this antipatern is not really
nice. So it is good as a quick fix but it would be definitely much
better to have a robust fix. Who knows how many other places might hit
this. You certainly do not want to add a hack like this all over...
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-19 Thread Punit Agrawal
Lorenzo Pieralisi  writes:

> On Tue, Jun 19, 2018 at 01:52:16PM +0100, Punit Agrawal wrote:
>> Michal Hocko  writes:
>> 
>> > On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
>> > [...]
>> >> I tested on a arm board with 128 cores 4 numa nodes, but I set 
>> >> CONFIG_NR_CPUS=72.
>> >> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> >> But some pci device may related to node 3, which be set in ACPI table.
>> >
>> > Could you double check that zonelists for node 3 are generated
>> > correctly?
>> 
>> The cpus in node 3 aren't onlined and there's no memory attached - I
>> suspect that no zonelists are built for this node.
>> 
>> We skip creating a node, if the number of SRAT entries parsed exceeds
>> NR_CPUS[0]. This in turn prevents onlining the numa node and so no
>> zonelists will be created for it.
>> 
>> I think the problem will go away if the cpus are restricted via the
>> kernel command line by setting nr_cpus.
>> 
>> Xie, can you try the below patch on top of the one enabling memoryless
>> nodes? I'm not sure this is the right solution but at least it'll
>> confirm the problem.
>
> This issue looks familiar (or at least related):
>
> git log d3bd058826aa

Indeed. Thanks for digging into this.

>
> The reason why the NR_CPUS guard is there is to avoid overflowing
> the early_node_cpu_hwid array.

Ah right... I missed that. The below patch is definitely not what we
want.

> IA64 does something different in
> that respect compared to x86, we have to have a look into this.
>
> Regardless, AFAICS the proximity domains to nodes mappings should not
> depend on CONFIG_NR_CPUS, it seems that there is something wrong in that
> in ARM64 ACPI SRAT parsing.

Not only SRAT parsing but it looks like there is a similar restriction
while parsing the ACPI MADT in acpi_map_gic_cpu_interface().

The incomplete parsing introduces a dependency on the ordering of
entries being aligned between SRAT and MADT when NR_CPUS is
restricted. We want to parse the entire table in both cases so that the
code is robust to reordering of entries.

In terms of $SUBJECT, I wonder if it's worth taking the original patch
as a temporary fix (it'll also be easier to backport) while we work on
fixing these other issues and enabling memoryless nodes.


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-19 Thread Lorenzo Pieralisi
On Tue, Jun 19, 2018 at 01:52:16PM +0100, Punit Agrawal wrote:
> Michal Hocko  writes:
> 
> > On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> > [...]
> >> I tested on a arm board with 128 cores 4 numa nodes, but I set 
> >> CONFIG_NR_CPUS=72.
> >> Then node 3 is not be created, because node 3 has no memory, and no cpu.
> >> But some pci device may related to node 3, which be set in ACPI table.
> >
> > Could you double check that zonelists for node 3 are generated
> > correctly?
> 
> The cpus in node 3 aren't onlined and there's no memory attached - I
> suspect that no zonelists are built for this node.
> 
> We skip creating a node, if the number of SRAT entries parsed exceeds
> NR_CPUS[0]. This in turn prevents onlining the numa node and so no
> zonelists will be created for it.
> 
> I think the problem will go away if the cpus are restricted via the
> kernel command line by setting nr_cpus.
> 
> Xie, can you try the below patch on top of the one enabling memoryless
> nodes? I'm not sure this is the right solution but at least it'll
> confirm the problem.

This issue looks familiar (or at least related):

git log d3bd058826aa

The reason why the NR_CPUS guard is there is to avoid overflowing
the early_node_cpu_hwid array. IA64 does something different in
that respect compared to x86, we have to have a look into this.

Regardless, AFAICS the proximity domains to nodes mappings should not
depend on CONFIG_NR_CPUS, it seems that there is something wrong in that
in ARM64 ACPI SRAT parsing.

Lorenzo

> 
> Thanks,
> Punit
> 
> [0] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/acpi_numa.c?h=v4.18-rc1#n73
> 
> -- >8 --
> diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
> index d190a7b231bf..fea0f7164f1a 100644
> --- a/arch/arm64/kernel/acpi_numa.c
> +++ b/arch/arm64/kernel/acpi_numa.c
> @@ -70,11 +70,9 @@ void __init acpi_numa_gicc_affinity_init(struct 
> acpi_srat_gicc_affinity *pa)
> if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
> return;
> 
> -   if (cpus_in_srat >= NR_CPUS) {
> + if (cpus_in_srat >= NR_CPUS)
> pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not 
> be able to use all cpus\n",
>  NR_CPUS);
> -   return;
> -   }
> 
> pxm = pa->proximity_domain;
> node = acpi_map_pxm_to_node(pxm);


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-19 Thread Punit Agrawal
Michal Hocko  writes:

> On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> [...]
>> I tested on a arm board with 128 cores 4 numa nodes, but I set 
>> CONFIG_NR_CPUS=72.
>> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> But some pci device may related to node 3, which be set in ACPI table.
>
> Could you double check that zonelists for node 3 are generated
> correctly?

The cpus in node 3 aren't onlined and there's no memory attached - I
suspect that no zonelists are built for this node.

We skip creating a node, if the number of SRAT entries parsed exceeds
NR_CPUS[0]. This in turn prevents onlining the numa node and so no
zonelists will be created for it.

I think the problem will go away if the cpus are restricted via the
kernel command line by setting nr_cpus.

Xie, can you try the below patch on top of the one enabling memoryless
nodes? I'm not sure this is the right solution but at least it'll
confirm the problem.

Thanks,
Punit

[0] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/acpi_numa.c?h=v4.18-rc1#n73

-- >8 --
diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
index d190a7b231bf..fea0f7164f1a 100644
--- a/arch/arm64/kernel/acpi_numa.c
+++ b/arch/arm64/kernel/acpi_numa.c
@@ -70,11 +70,9 @@ void __init acpi_numa_gicc_affinity_init(struct 
acpi_srat_gicc_affinity *pa)
if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
return;

-   if (cpus_in_srat >= NR_CPUS) {
+ if (cpus_in_srat >= NR_CPUS)
pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not 
be able to use all cpus\n",
 NR_CPUS);
-   return;
-   }

pxm = pa->proximity_domain;
node = acpi_map_pxm_to_node(pxm);


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-19 Thread Xie XiuQi
Hi Michal,

On 2018/6/19 20:07, Michal Hocko wrote:
> On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> [...]
>> I tested on a arm board with 128 cores 4 numa nodes, but I set 
>> CONFIG_NR_CPUS=72.
>> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> But some pci device may related to node 3, which be set in ACPI table.
> 
> Could you double check that zonelists for node 3 are generated
> correctly?
> 

zonelists for node 3 is not created at all.

Kernel parse SRAT table to create node info, but in this case,
SRAT table is parsed not completed. Only the first 72 items are parsed.
In SRAT table, we haven't seen node 3 information yet, because 
cpu_to_node_map[72] is too small.

[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x3 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30001 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30002 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30003 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30100 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30101 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30102 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30103 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30200 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30201 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30202 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30203 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30300 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30301 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30302 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30303 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30400 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30401 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30402 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30403 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30500 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30501 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30502 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30503 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30600 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30601 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30602 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30603 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30700 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30701 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30702 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30703 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x1 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10001 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10002 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10003 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10100 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10101 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10102 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10103 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10200 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10201 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10202 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10203 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10300 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10301 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10302 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10303 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10400 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10401 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10402 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10403 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10500 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10501 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10502 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10503 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10600 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10601 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10602 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10603 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10700 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10701 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10702 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 1 -> MPIDR 0x10703 -> Node 1
[0.00] ACPI: NUMA: SRAT: PXM 2 -> MPIDR 0x7 -> Node 2
[

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-19 Thread Michal Hocko
On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
[...]
> I tested on a arm board with 128 cores 4 numa nodes, but I set 
> CONFIG_NR_CPUS=72.
> Then node 3 is not be created, because node 3 has no memory, and no cpu.
> But some pci device may related to node 3, which be set in ACPI table.

Could you double check that zonelists for node 3 are generated
correctly?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-19 Thread Xie XiuQi
Hi Punit,


On 2018/6/14 1:39, Punit Agrawal wrote:
> Punit Agrawal  writes:
> 
> 
> [...]
> 
>>
>> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
>> up returning the original node in the fallback path.
>>
>> Xie, does the below patch help? I can submit a proper patch if this
>> fixes the issue for you.
>>
>> -- >8 --
>> Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
>>
>> Signed-off-by: Punit Agrawal 
>> ---
>>  arch/arm64/Kconfig   | 4 
>>  arch/arm64/mm/numa.c | 2 ++
>>  2 files changed, 6 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index eb2cf4938f6d..5317e9aa93ab 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
>>  def_bool y
>>  depends on NUMA
>>  
>> +config HAVE_MEMORYLESS_NODES
>> +   def_bool y
>> +   depends on NUMA
>> +
>>  config HAVE_SETUP_PER_CPU_AREA
>>  def_bool y
>>  depends on NUMA
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index dad128ba98bf..c699dcfe93de 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
>>  static void map_cpu_to_node(unsigned int cpu, int nid)
>>  {
>>  set_cpu_numa_node(cpu, nid);
>> +set_numa_mem(local_memory_node(nid));
> 
> Argh, this should be
> 
> set_cpu_numa_mem(cpu, local_memory_node(nid));
> 
> There is not guarantee that map_cpu_to_node() will be called on the
> local cpu.
> 
> Hanjun, Xie - can you try with the update please?

I've tested this patch, but it does not help.
The boot message is attached.

I tested on a arm board with 128 cores 4 numa nodes, but I set 
CONFIG_NR_CPUS=72.
Then node 3 is not be created, because node 3 has no memory, and no cpu.
But some pci device may related to node 3, which be set in ACPI table.

165 /* Interface called from ACPI code to setup PCI host controller */
166 struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root *root)
167 {
168 int node = acpi_get_node(root->device->handle);
169 struct acpi_pci_generic_root_info *ri;
170 struct pci_bus *bus, *child;
171 struct acpi_pci_root_ops *root_ops;
172
// this node may be not created.
177 ri = kzalloc_node(sizeof(*ri), GFP_KERNEL, node);
178 if (!ri)
179 return NULL;
180
181 root_ops = kzalloc_node(sizeof(*root_ops), GFP_KERNEL, node);
182 if (!root_ops) {
183 kfree(ri);
184 return NULL;
185 }
186
187 ri->cfg = pci_acpi_setup_ecam_mapping(root);
188 if (!ri->cfg) {
189 kfree(ri);
190 kfree(root_ops);
191 return NULL;
192 }


> 
> Thanks,
> Punit
> 
>> +
>>  if (nid >= 0)
>>  cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
>>  }
> 
> .
> 

-- 
Thanks,
Xie XiuQi
[0.00] Booting Linux on physical CPU 0x03 [0x480fd010]
[0.00] Linux version 4.16.0-rc1-00491-g204a6cc-dirty 
(xiexiuqi@localhost.localdomain) (gcc version 6.3.1 20170404 (Linaro GCC 
6.3-2017.05)) #17 SMP PREEMPT Tue Jun 19 16:33:32 CST 2018
[0.00] earlycon: pl11 at MMIO32 0x9408 (options '')
[0.00] bootconsole [pl11] enabled
[0.00] efi: Getting EFI parameters from FDT:
[0.00] efi: EFI v2.60 by EDK II
[0.00] efi:  SMBIOS 3.0=0x3eb6  ACPI 2.0=0x3971  
MEMATTR=0x3b106418 
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x3971 24 (v02 HISI  )
[0.00] ACPI: XSDT 0x3970 74 (v01 HISI   HIP08
  0113)
[0.00] ACPI: FACP 0x3963 000114 (v06 HISI   HIP08
 INTL 20151124)
[0.00] ACPI: DSDT 0x395C 006A1A (v02 HISI   HIP08
 INTL 20170929)
[0.00] ACPI: GTDT 0x3962 60 (v02 HISI   HIP08
 INTL 20151124)
[0.00] ACPI: DBG2 0x3961 5A (v00 HISI   HIP08
 INTL 20151124)
[0.00] ACPI: MCFG 0x3960 3C (v01 HISI   HIP08
 INTL 20151124)
[0.00] ACPI: SLIT 0x395F 3C (v01 HISI   HIP07
 INTL 20151124)
[0.00] ACPI: SRAT 0x395E 0009C0 (v03 HISI   HIP08
 INTL 20151124)
[0.00] ACPI: APIC 0x395D 00286C (v04 HISI   HIP08
 INTL 20151124)
[0.00] ACPI: IORT 0x395B 00110C (v00 HISI   HIP08
 INTL 20170929)
[0.00] ACPI: PPTT 0x311F 0037D0 (v01 HISI   HIP08
 INTL 20151124)
[0.00] ACPI: SPMI 0x311E 41 (v05 HISI   HIP08
 INTL 20151124)
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x3 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30001 -> Node 0
[0.00] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30002 -> Node 0
[0.00

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-13 Thread Hanjun Guo
Hi Punit,

On 2018/6/14 1:39, Punit Agrawal wrote:
> Punit Agrawal  writes:
> 
> 
> [...]
> 
>>
>> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
>> up returning the original node in the fallback path.
>>
>> Xie, does the below patch help? I can submit a proper patch if this
>> fixes the issue for you.
>>
>> -- >8 --
>> Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
>>
>> Signed-off-by: Punit Agrawal 
>> ---
>>  arch/arm64/Kconfig   | 4 
>>  arch/arm64/mm/numa.c | 2 ++
>>  2 files changed, 6 insertions(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index eb2cf4938f6d..5317e9aa93ab 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
>>  def_bool y
>>  depends on NUMA
>>  
>> +config HAVE_MEMORYLESS_NODES
>> +   def_bool y
>> +   depends on NUMA
>> +
>>  config HAVE_SETUP_PER_CPU_AREA
>>  def_bool y
>>  depends on NUMA
>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>> index dad128ba98bf..c699dcfe93de 100644
>> --- a/arch/arm64/mm/numa.c
>> +++ b/arch/arm64/mm/numa.c
>> @@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
>>  static void map_cpu_to_node(unsigned int cpu, int nid)
>>  {
>>  set_cpu_numa_node(cpu, nid);
>> +set_numa_mem(local_memory_node(nid));
> 
> Argh, this should be
> 
> set_cpu_numa_mem(cpu, local_memory_node(nid));
> 
> There is not guarantee that map_cpu_to_node() will be called on the
> local cpu.
> 
> Hanjun, Xie - can you try with the update please?

Thanks for looking into this, we will try this tomorrow
(the hardware is occupied now) and update here.

Thanks
Hanjun



Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-13 Thread Punit Agrawal
Punit Agrawal  writes:


[...]

>
> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
> up returning the original node in the fallback path.
>
> Xie, does the below patch help? I can submit a proper patch if this
> fixes the issue for you.
>
> -- >8 --
> Subject: [PATCH] arm64/numa: Enable memoryless numa nodes
>
> Signed-off-by: Punit Agrawal 
> ---
>  arch/arm64/Kconfig   | 4 
>  arch/arm64/mm/numa.c | 2 ++
>  2 files changed, 6 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index eb2cf4938f6d..5317e9aa93ab 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -756,6 +756,10 @@ config USE_PERCPU_NUMA_NODE_ID
>   def_bool y
>   depends on NUMA
>  
> +config HAVE_MEMORYLESS_NODES
> +   def_bool y
> +   depends on NUMA
> +
>  config HAVE_SETUP_PER_CPU_AREA
>   def_bool y
>   depends on NUMA
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index dad128ba98bf..c699dcfe93de 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -73,6 +73,8 @@ EXPORT_SYMBOL(cpumask_of_node);
>  static void map_cpu_to_node(unsigned int cpu, int nid)
>  {
>   set_cpu_numa_node(cpu, nid);
> + set_numa_mem(local_memory_node(nid));

Argh, this should be

set_cpu_numa_mem(cpu, local_memory_node(nid));

There is not guarantee that map_cpu_to_node() will be called on the
local cpu.

Hanjun, Xie - can you try with the update please?

Thanks,
Punit

> +
>   if (nid >= 0)
>   cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
>  }


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-12 Thread Michal Hocko
On Tue 12-06-18 16:08:03, Punit Agrawal wrote:
> Michal Hocko  writes:
[...]
> > Well, the standard way to handle memory less NUMA nodes is to simply
> > fallback to the closest NUMA node. We even have an API for that
> > (numa_mem_id).
> 
> CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
> up returning the original node in the fallback path.

Yes this makes more sense.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-12 Thread Punit Agrawal
Michal Hocko  writes:

> On Mon 11-06-18 08:43:03, Bjorn Helgaas wrote:
>> On Mon, Jun 11, 2018 at 08:32:10PM +0800, Xie XiuQi wrote:
>> > Hi Michal,
>> > 
>> > On 2018/6/11 16:52, Michal Hocko wrote:
>> > > On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
>> > >> Hi Michal,
>> > >>
>> > >> On 2018/6/7 20:21, Michal Hocko wrote:
>> > >>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>> >  On 2018/6/7 18:55, Michal Hocko wrote:
>> > >>> [...]
>> > > I am not sure I have the full context but pci_acpi_scan_root calls
>> > > kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>> > > and that should fall back to whatever node that is online. Offline 
>> > > node
>> > > shouldn't keep any pages behind. So there must be something else 
>> > > going
>> > > on here and the patch is not the right way to handle it. What does
>> > > faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>> > 
>> >  The whole context is:
>> > 
>> >  The system is booted with a NUMA node has no memory attaching to it
>> >  (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>> >  in MADT, so CPUs on this memory-less node are not brought up, and
>> >  this NUMA node will not be online (but SRAT presents this NUMA node);
>> > 
>> >  Devices attaching to this NUMA node such as PCI host bridge still
>> >  return the valid NUMA node via _PXM, but actually that valid NUMA node
>> >  is not online which lead to this issue.
>> > >>>
>> > >>> But we should have other numa nodes on the zonelists so the allocator
>> > >>> should fall back to other node. If the zonelist is not intiailized
>> > >>> properly, though, then this can indeed show up as a problem. Knowing
>> > >>> which exact place has blown up would help get a better picture...
>> > >>>
>> > >>
>> > >> I specific a non-exist node to allocate memory using kzalloc_node,
>> > >> and got this following error message.
>> > >>
>> > >> And I found out there is just a VM_WARN, but it does not prevent the 
>> > >> memory
>> > >> allocation continue.
>> > >>
>> > >> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
>> > >> it would cause oops here.
>> > >>
>> > >> 459 /*
>> > >> 460  * Allocate pages, preferring the node given as nid. The node must 
>> > >> be valid and
>> > >> 461  * online. For more general interface, see alloc_pages_node().
>> > >> 462  */
>> > >> 463 static inline struct page *
>> > >> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>> > >> 465 {
>> > >> 466 VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> > >> 467 VM_WARN_ON(!node_online(nid));
>> > >> 468
>> > >> 469 return __alloc_pages(gfp_mask, order, nid);
>> > >> 470 }
>> > >> 471
>> > >>
>> > >> (I wrote a ko, to allocate memory on a non-exist node using 
>> > >> kzalloc_node().)
>> > > 
>> > > OK, so this is an artificialy broken code, right. You shouldn't get a
>> > > non-existent node via standard APIs AFAICS. The original report was
>> > > about an existing node which is offline AFAIU. That would be a different
>> > > case. If I am missing something and there are legitimate users that try
>> > > to allocate from non-existing nodes then we should handle that in
>> > > node_zonelist.
>> > 
>> > I think hanjun's comments may help to understood this question:
>> >  - NUMA node will be built if CPUs and (or) memory are valid on this NUMA
>> >  node;
>> > 
>> >  - But if we boot the system with memory-less node and also with
>> >  CONFIG_NR_CPUS less than CPUs in SRAT, for example, 64 CPUs total with 4
>> >  NUMA nodes, 16 CPUs on each NUMA node, if we boot with
>> >  CONFIG_NR_CPUS=48, then we will not built numa node for node 3, but with
>> >  devices on that numa node, alloc memory will be panic because NUMA node
>> >  3 is not a valid node.
>
> Hmm, but this is not a memory-less node. It sounds like a misconfigured
> kernel to me or the broken initialization. Each CPU should have a
> fallback numa node to be used.
>
>> > I triggered this BUG on arm64 platform, and I found a similar bug has
>> > been fixed on x86 platform. So I sent a similar patch for this bug.
>> > 
>> > Or, could we consider to fix it in the mm subsystem?
>> 
>> The patch below (b755de8dfdfe) seems like totally the wrong direction.
>> I don't think we want every caller of kzalloc_node() to have check for
>> node_online().
>
> absolutely.
>
>> Why would memory on an off-line node even be in the allocation pool?
>> I wouldn't expect that memory to be put in the pool until the node
>> comes online and the memory is accessible, so this sounds like some
>> kind of setup issue.
>> 
>> But I'm definitely not an mm person.
>
> Well, the standard way to handle memory less NUMA nodes is to simply
> fallback to the closest NUMA node. We even have an API for that
> (numa_mem_id).

CONFIG_HAVE_MEMORYLESS node is not enabled on arm64 which means we end
up returning the original node in the fallback path.

Xie, does 

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-11 Thread Michal Hocko
On Mon 11-06-18 08:43:03, Bjorn Helgaas wrote:
> On Mon, Jun 11, 2018 at 08:32:10PM +0800, Xie XiuQi wrote:
> > Hi Michal,
> > 
> > On 2018/6/11 16:52, Michal Hocko wrote:
> > > On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
> > >> Hi Michal,
> > >>
> > >> On 2018/6/7 20:21, Michal Hocko wrote:
> > >>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> >  On 2018/6/7 18:55, Michal Hocko wrote:
> > >>> [...]
> > > I am not sure I have the full context but pci_acpi_scan_root calls
> > > kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> > > and that should fall back to whatever node that is online. Offline 
> > > node
> > > shouldn't keep any pages behind. So there must be something else going
> > > on here and the patch is not the right way to handle it. What does
> > > faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> > 
> >  The whole context is:
> > 
> >  The system is booted with a NUMA node has no memory attaching to it
> >  (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> >  in MADT, so CPUs on this memory-less node are not brought up, and
> >  this NUMA node will not be online (but SRAT presents this NUMA node);
> > 
> >  Devices attaching to this NUMA node such as PCI host bridge still
> >  return the valid NUMA node via _PXM, but actually that valid NUMA node
> >  is not online which lead to this issue.
> > >>>
> > >>> But we should have other numa nodes on the zonelists so the allocator
> > >>> should fall back to other node. If the zonelist is not intiailized
> > >>> properly, though, then this can indeed show up as a problem. Knowing
> > >>> which exact place has blown up would help get a better picture...
> > >>>
> > >>
> > >> I specific a non-exist node to allocate memory using kzalloc_node,
> > >> and got this following error message.
> > >>
> > >> And I found out there is just a VM_WARN, but it does not prevent the 
> > >> memory
> > >> allocation continue.
> > >>
> > >> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
> > >> it would cause oops here.
> > >>
> > >> 459 /*
> > >> 460  * Allocate pages, preferring the node given as nid. The node must 
> > >> be valid and
> > >> 461  * online. For more general interface, see alloc_pages_node().
> > >> 462  */
> > >> 463 static inline struct page *
> > >> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> > >> 465 {
> > >> 466 VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> > >> 467 VM_WARN_ON(!node_online(nid));
> > >> 468
> > >> 469 return __alloc_pages(gfp_mask, order, nid);
> > >> 470 }
> > >> 471
> > >>
> > >> (I wrote a ko, to allocate memory on a non-exist node using 
> > >> kzalloc_node().)
> > > 
> > > OK, so this is an artificialy broken code, right. You shouldn't get a
> > > non-existent node via standard APIs AFAICS. The original report was
> > > about an existing node which is offline AFAIU. That would be a different
> > > case. If I am missing something and there are legitimate users that try
> > > to allocate from non-existing nodes then we should handle that in
> > > node_zonelist.
> > 
> > I think hanjun's comments may help to understood this question:
> >  - NUMA node will be built if CPUs and (or) memory are valid on this NUMA
> >  node;
> > 
> >  - But if we boot the system with memory-less node and also with
> >  CONFIG_NR_CPUS less than CPUs in SRAT, for example, 64 CPUs total with 4
> >  NUMA nodes, 16 CPUs on each NUMA node, if we boot with
> >  CONFIG_NR_CPUS=48, then we will not built numa node for node 3, but with
> >  devices on that numa node, alloc memory will be panic because NUMA node
> >  3 is not a valid node.

Hmm, but this is not a memory-less node. It sounds like a misconfigured
kernel to me or the broken initialization. Each CPU should have a
fallback numa node to be used.

> > I triggered this BUG on arm64 platform, and I found a similar bug has
> > been fixed on x86 platform. So I sent a similar patch for this bug.
> > 
> > Or, could we consider to fix it in the mm subsystem?
> 
> The patch below (b755de8dfdfe) seems like totally the wrong direction.
> I don't think we want every caller of kzalloc_node() to have check for
> node_online().

absolutely.

> Why would memory on an off-line node even be in the allocation pool?
> I wouldn't expect that memory to be put in the pool until the node
> comes online and the memory is accessible, so this sounds like some
> kind of setup issue.
> 
> But I'm definitely not an mm person.

Well, the standard way to handle memory less NUMA nodes is to simply
fallback to the closest NUMA node. We even have an API for that
(numa_mem_id).
 
> > From b755de8dfdfef97effaa91379ffafcb81f4d62a1 Mon Sep 17 00:00:00 2001
> > From: Yinghai Lu 
> > Date: Wed, 20 Feb 2008 12:41:52 -0800
> > Subject: [PATCH] x86: make dev_to_node return online node
> > 
> > a numa system (with multi HT chains) may return node without ram. Aka it
> 

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-11 Thread Bjorn Helgaas
On Mon, Jun 11, 2018 at 08:32:10PM +0800, Xie XiuQi wrote:
> Hi Michal,
> 
> On 2018/6/11 16:52, Michal Hocko wrote:
> > On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
> >> Hi Michal,
> >>
> >> On 2018/6/7 20:21, Michal Hocko wrote:
> >>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>  On 2018/6/7 18:55, Michal Hocko wrote:
> >>> [...]
> > I am not sure I have the full context but pci_acpi_scan_root calls
> > kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> > and that should fall back to whatever node that is online. Offline node
> > shouldn't keep any pages behind. So there must be something else going
> > on here and the patch is not the right way to handle it. What does
> > faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> 
>  The whole context is:
> 
>  The system is booted with a NUMA node has no memory attaching to it
>  (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>  in MADT, so CPUs on this memory-less node are not brought up, and
>  this NUMA node will not be online (but SRAT presents this NUMA node);
> 
>  Devices attaching to this NUMA node such as PCI host bridge still
>  return the valid NUMA node via _PXM, but actually that valid NUMA node
>  is not online which lead to this issue.
> >>>
> >>> But we should have other numa nodes on the zonelists so the allocator
> >>> should fall back to other node. If the zonelist is not intiailized
> >>> properly, though, then this can indeed show up as a problem. Knowing
> >>> which exact place has blown up would help get a better picture...
> >>>
> >>
> >> I specific a non-exist node to allocate memory using kzalloc_node,
> >> and got this following error message.
> >>
> >> And I found out there is just a VM_WARN, but it does not prevent the memory
> >> allocation continue.
> >>
> >> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
> >> it would cause oops here.
> >>
> >> 459 /*
> >> 460  * Allocate pages, preferring the node given as nid. The node must be 
> >> valid and
> >> 461  * online. For more general interface, see alloc_pages_node().
> >> 462  */
> >> 463 static inline struct page *
> >> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> >> 465 {
> >> 466 VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> >> 467 VM_WARN_ON(!node_online(nid));
> >> 468
> >> 469 return __alloc_pages(gfp_mask, order, nid);
> >> 470 }
> >> 471
> >>
> >> (I wrote a ko, to allocate memory on a non-exist node using 
> >> kzalloc_node().)
> > 
> > OK, so this is an artificialy broken code, right. You shouldn't get a
> > non-existent node via standard APIs AFAICS. The original report was
> > about an existing node which is offline AFAIU. That would be a different
> > case. If I am missing something and there are legitimate users that try
> > to allocate from non-existing nodes then we should handle that in
> > node_zonelist.
> 
> I think hanjun's comments may help to understood this question:
>  - NUMA node will be built if CPUs and (or) memory are valid on this NUMA
>  node;
> 
>  - But if we boot the system with memory-less node and also with
>  CONFIG_NR_CPUS less than CPUs in SRAT, for example, 64 CPUs total with 4
>  NUMA nodes, 16 CPUs on each NUMA node, if we boot with
>  CONFIG_NR_CPUS=48, then we will not built numa node for node 3, but with
>  devices on that numa node, alloc memory will be panic because NUMA node
>  3 is not a valid node.
> 
> I triggered this BUG on arm64 platform, and I found a similar bug has
> been fixed on x86 platform. So I sent a similar patch for this bug.
> 
> Or, could we consider to fix it in the mm subsystem?

The patch below (b755de8dfdfe) seems like totally the wrong direction.
I don't think we want every caller of kzalloc_node() to have check for
node_online().

Why would memory on an off-line node even be in the allocation pool?
I wouldn't expect that memory to be put in the pool until the node
comes online and the memory is accessible, so this sounds like some
kind of setup issue.

But I'm definitely not an mm person.

> From b755de8dfdfef97effaa91379ffafcb81f4d62a1 Mon Sep 17 00:00:00 2001
> From: Yinghai Lu 
> Date: Wed, 20 Feb 2008 12:41:52 -0800
> Subject: [PATCH] x86: make dev_to_node return online node
> 
> a numa system (with multi HT chains) may return node without ram. Aka it
> is not online. Try to get an online node, otherwise return -1.
> 
> Signed-off-by: Yinghai Lu 
> Signed-off-by: Ingo Molnar 
> Signed-off-by: Thomas Gleixner 
> ---
>  arch/x86/pci/acpi.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
> index d95de2f..ea8685f 100644
> --- a/arch/x86/pci/acpi.c
> +++ b/arch/x86/pci/acpi.c
> @@ -172,6 +172,9 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct 
> acpi_device *device, int do
>   set_mp_bus_to_node(busnum, node);
>   else
>   node = get_mp_bus_to_node(busnu

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-11 Thread Xie XiuQi
Hi Michal,

On 2018/6/11 16:52, Michal Hocko wrote:
> On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
>> Hi Michal,
>>
>> On 2018/6/7 20:21, Michal Hocko wrote:
>>> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
 On 2018/6/7 18:55, Michal Hocko wrote:
>>> [...]
> I am not sure I have the full context but pci_acpi_scan_root calls
> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> and that should fall back to whatever node that is online. Offline node
> shouldn't keep any pages behind. So there must be something else going
> on here and the patch is not the right way to handle it. What does
> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?

 The whole context is:

 The system is booted with a NUMA node has no memory attaching to it
 (memory-less NUMA node), also with NR_CPUS less than CPUs presented
 in MADT, so CPUs on this memory-less node are not brought up, and
 this NUMA node will not be online (but SRAT presents this NUMA node);

 Devices attaching to this NUMA node such as PCI host bridge still
 return the valid NUMA node via _PXM, but actually that valid NUMA node
 is not online which lead to this issue.
>>>
>>> But we should have other numa nodes on the zonelists so the allocator
>>> should fall back to other node. If the zonelist is not intiailized
>>> properly, though, then this can indeed show up as a problem. Knowing
>>> which exact place has blown up would help get a better picture...
>>>
>>
>> I specific a non-exist node to allocate memory using kzalloc_node,
>> and got this following error message.
>>
>> And I found out there is just a VM_WARN, but it does not prevent the memory
>> allocation continue.
>>
>> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
>> it would cause oops here.
>>
>> 459 /*
>> 460  * Allocate pages, preferring the node given as nid. The node must be 
>> valid and
>> 461  * online. For more general interface, see alloc_pages_node().
>> 462  */
>> 463 static inline struct page *
>> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>> 465 {
>> 466 VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>> 467 VM_WARN_ON(!node_online(nid));
>> 468
>> 469 return __alloc_pages(gfp_mask, order, nid);
>> 470 }
>> 471
>>
>> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)
> 
> OK, so this is an artificialy broken code, right. You shouldn't get a
> non-existent node via standard APIs AFAICS. The original report was
> about an existing node which is offline AFAIU. That would be a different
> case. If I am missing something and there are legitimate users that try
> to allocate from non-existing nodes then we should handle that in
> node_zonelist.

I think hanjun's comments may help to understood this question:
 - NUMA node will be built if CPUs and (or) memory are valid on this NUMA node;

 - But if we boot the system with memory-less node and also with CONFIG_NR_CPUS
   less than CPUs in SRAT, for example, 64 CPUs total with 4 NUMA nodes, 16 CPUs
   on each NUMA node, if we boot with CONFIG_NR_CPUS=48, then we will not built
   numa node for node 3, but with devices on that numa node, alloc memory will
   be panic because NUMA node 3 is not a valid node.

I triggered this BUG on arm64 platform, and I found a similar bug has been fixed
on x86 platform. So I sent a similar patch for this bug.

Or, could we consider to fix it in the mm subsystem?

>From b755de8dfdfef97effaa91379ffafcb81f4d62a1 Mon Sep 17 00:00:00 2001
From: Yinghai Lu 
Date: Wed, 20 Feb 2008 12:41:52 -0800
Subject: [PATCH] x86: make dev_to_node return online node

a numa system (with multi HT chains) may return node without ram. Aka it
is not online. Try to get an online node, otherwise return -1.

Signed-off-by: Yinghai Lu 
Signed-off-by: Ingo Molnar 
Signed-off-by: Thomas Gleixner 
---
 arch/x86/pci/acpi.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index d95de2f..ea8685f 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -172,6 +172,9 @@ struct pci_bus * __devinit pci_acpi_scan_root(struct 
acpi_device *device, int do
set_mp_bus_to_node(busnum, node);
else
node = get_mp_bus_to_node(busnum);
+
+   if (node != -1 && !node_online(node))
+   node = -1;
 #endif

/* Allocate per-root-bus (not per bus) arch-specific data.
-- 
1.8.3.1


> 
> [...]
> 

-- 
Thanks,
Xie XiuQi



Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-11 Thread Michal Hocko
On Mon 11-06-18 11:23:18, Xie XiuQi wrote:
> Hi Michal,
> 
> On 2018/6/7 20:21, Michal Hocko wrote:
> > On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> >> On 2018/6/7 18:55, Michal Hocko wrote:
> > [...]
> >>> I am not sure I have the full context but pci_acpi_scan_root calls
> >>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> >>> and that should fall back to whatever node that is online. Offline node
> >>> shouldn't keep any pages behind. So there must be something else going
> >>> on here and the patch is not the right way to handle it. What does
> >>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> >>
> >> The whole context is:
> >>
> >> The system is booted with a NUMA node has no memory attaching to it
> >> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> >> in MADT, so CPUs on this memory-less node are not brought up, and
> >> this NUMA node will not be online (but SRAT presents this NUMA node);
> >>
> >> Devices attaching to this NUMA node such as PCI host bridge still
> >> return the valid NUMA node via _PXM, but actually that valid NUMA node
> >> is not online which lead to this issue.
> > 
> > But we should have other numa nodes on the zonelists so the allocator
> > should fall back to other node. If the zonelist is not intiailized
> > properly, though, then this can indeed show up as a problem. Knowing
> > which exact place has blown up would help get a better picture...
> > 
> 
> I specific a non-exist node to allocate memory using kzalloc_node,
> and got this following error message.
>
> And I found out there is just a VM_WARN, but it does not prevent the memory
> allocation continue.
> 
> This nid would be use to access NODE_DADA(nid), so if nid is invalid,
> it would cause oops here.
> 
> 459 /*
> 460  * Allocate pages, preferring the node given as nid. The node must be 
> valid and
> 461  * online. For more general interface, see alloc_pages_node().
> 462  */
> 463 static inline struct page *
> 464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
> 465 {
> 466 VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
> 467 VM_WARN_ON(!node_online(nid));
> 468
> 469 return __alloc_pages(gfp_mask, order, nid);
> 470 }
> 471
> 
> (I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)

OK, so this is an artificialy broken code, right. You shouldn't get a
non-existent node via standard APIs AFAICS. The original report was
about an existing node which is offline AFAIU. That would be a different
case. If I am missing something and there are legitimate users that try
to allocate from non-existing nodes then we should handle that in
node_zonelist.

[...]
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-10 Thread Xie XiuQi
Hi Michal,

On 2018/6/7 20:21, Michal Hocko wrote:
> On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
>> On 2018/6/7 18:55, Michal Hocko wrote:
> [...]
>>> I am not sure I have the full context but pci_acpi_scan_root calls
>>> kzalloc_node(sizeof(*info), GFP_KERNEL, node)
>>> and that should fall back to whatever node that is online. Offline node
>>> shouldn't keep any pages behind. So there must be something else going
>>> on here and the patch is not the right way to handle it. What does
>>> faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
>>
>> The whole context is:
>>
>> The system is booted with a NUMA node has no memory attaching to it
>> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
>> in MADT, so CPUs on this memory-less node are not brought up, and
>> this NUMA node will not be online (but SRAT presents this NUMA node);
>>
>> Devices attaching to this NUMA node such as PCI host bridge still
>> return the valid NUMA node via _PXM, but actually that valid NUMA node
>> is not online which lead to this issue.
> 
> But we should have other numa nodes on the zonelists so the allocator
> should fall back to other node. If the zonelist is not intiailized
> properly, though, then this can indeed show up as a problem. Knowing
> which exact place has blown up would help get a better picture...
> 

I specific a non-exist node to allocate memory using kzalloc_node,
and got this following error message.

And I found out there is just a VM_WARN, but it does not prevent the memory
allocation continue.

This nid would be use to access NODE_DADA(nid), so if nid is invalid,
it would cause oops here.

459 /*
460  * Allocate pages, preferring the node given as nid. The node must be valid 
and
461  * online. For more general interface, see alloc_pages_node().
462  */
463 static inline struct page *
464 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
465 {
466 VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
467 VM_WARN_ON(!node_online(nid));
468
469 return __alloc_pages(gfp_mask, order, nid);
470 }
471

(I wrote a ko, to allocate memory on a non-exist node using kzalloc_node().)

[  120.061693] WARNING: CPU: 6 PID: 3966 at ./include/linux/gfp.h:467 
allocate_slab+0x5fd/0x7e0
[  120.070095] Modules linked in: bench(OE+) nls_utf8 isofs loop xt_CHECKSUM 
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 
nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c 
ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables 
ip6table_filter ip6_tables iptable_filter dm_mirror dm_region_hash dm_log 
dm_mod intel_rapl skx_edac nfit vfat libnvdimm fat x86_pkg_temp_thermal 
coretemp kvm_intel kvm irqbypass iTCO_wdt crct10dif_pclmul iTCO_vendor_support 
crc32_pclmul ghash_clmulni_intel ses pcbc enclosure aesni_intel 
scsi_transport_sas crypto_simd cryptd sg glue_helper ipmi_si joydev mei_me 
i2c_i801 ipmi_devintf ioatdma shpchp pcspkr ipmi_msghandler mei dca i2c_core 
lpc_ich acpi_power_meter nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
[  120.140992]  ext4 mbcache jbd2 sd_mod crc32c_intel i40e ahci libahci 
megaraid_sas libata
[  120.149053] CPU: 6 PID: 3966 Comm: insmod Tainted: G   OE 
4.17.0-rc2-RHEL74+ #5
[  120.157369] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.62 03/26/2018
[  120.164303] RIP: 0010:allocate_slab+0x5fd/0x7e0
[  120.168817] RSP: 0018:881196947af0 EFLAGS: 00010246
[  120.174022] RAX:  RBX: 014012c0 RCX: b4bc8173
[  120.181126] RDX:  RSI: 0008 RDI: 8817aefa7868
[  120.188233] RBP: 014000c0 R08: ed02f5df4f0e R09: ed02f5df4f0e
[  120.195338] R10: ed02f5df4f0d R11: 8817aefa786f R12: 0055
[  120.202444] R13: 0003 R14: 880107c0f800 R15: 
[  120.209550] FS:  7f6935d8c740() GS:8817aef8() 
knlGS:
[  120.217606] CS:  0010 DS:  ES:  CR0: 80050033
[  120.223330] CR2: 00c21b88 CR3: 001197fd0006 CR4: 007606e0
[  120.230435] DR0:  DR1:  DR2: 
[  120.237541] DR3:  DR6: fffe0ff0 DR7: 0400
[  120.244646] PKRU: 5554
[  120.247346] Call Trace:
[  120.249791]  ? __kasan_slab_free+0xff/0x150
[  120.253960]  ? mpidr_init+0x20/0x30 [bench]
[  120.258129]  new_slab+0x3d/0x90
[  120.261262]  ___slab_alloc+0x371/0x640
[  120.265002]  ? __wake_up_common+0x8a/0x150
[  120.269085]  ? mpidr_init+0x20/0x30 [bench]
[  120.273254]  ? mpidr_init+0x20/0x30 [bench]
[  120.277423]  __slab_alloc+0x40/0x66
[  120.280901]  kmem_cache_alloc_node_trace+0xbc/0x270
[  120.285762]  ? mpidr_init+0x20/0x30 [bench]
[  120.289931]  ? 0xc074
[  120.293236]  mpidr_init+0x20/0x30 [bench]
[  120.297236]  do_one_initcall+0x4b/0x1f5
[  120.301062]  ? do_init_module+0x22/0x233
[  120.304972]  ? kmem_cache_alloc_trace+0xf

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-07 Thread Michal Hocko
On Thu 07-06-18 19:55:53, Hanjun Guo wrote:
> On 2018/6/7 18:55, Michal Hocko wrote:
[...]
> > I am not sure I have the full context but pci_acpi_scan_root calls
> > kzalloc_node(sizeof(*info), GFP_KERNEL, node)
> > and that should fall back to whatever node that is online. Offline node
> > shouldn't keep any pages behind. So there must be something else going
> > on here and the patch is not the right way to handle it. What does
> > faddr2line __alloc_pages_nodemask+0xf0 tells on this kernel?
> 
> The whole context is:
> 
> The system is booted with a NUMA node has no memory attaching to it
> (memory-less NUMA node), also with NR_CPUS less than CPUs presented
> in MADT, so CPUs on this memory-less node are not brought up, and
> this NUMA node will not be online (but SRAT presents this NUMA node);
> 
> Devices attaching to this NUMA node such as PCI host bridge still
> return the valid NUMA node via _PXM, but actually that valid NUMA node
> is not online which lead to this issue.

But we should have other numa nodes on the zonelists so the allocator
should fall back to other node. If the zonelist is not intiailized
properly, though, then this can indeed show up as a problem. Knowing
which exact place has blown up would help get a better picture...

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-07 Thread Hanjun Guo
On 2018/6/7 18:55, Michal Hocko wrote:
> On Wed 06-06-18 15:39:34, Bjorn Helgaas wrote:
>> [+cc akpm, linux-mm, linux-pci]
>>
>> On Wed, Jun 6, 2018 at 10:44 AM Will Deacon  wrote:
>>>
>>> On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
 A numa system may return node which is not online.
 For example, a numa node:
 1) without memory
 2) NR_CPUS is very small, and the cpus on the node are not brought up

 In this situation, we use NUMA_NO_NODE to avoid oops.

 [   25.732905] Unable to handle kernel NULL pointer dereference at virtual 
 address 1988
 [   25.740982] Mem abort info:
 [   25.743762]   ESR = 0x9605
 [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
 [   25.752711]   SET = 0, FnV = 0
 [   25.755751]   EA = 0, S1PTW = 0
 [   25.758878] Data abort info:
 [   25.761745]   ISV = 0, ISS = 0x0005
 [   25.765568]   CM = 0, WnR = 0
 [   25.768521] [1988] user address but active_mm is swapper
 [   25.774861] Internal error: Oops: 9605 [#1] SMP
 [   25.779724] Modules linked in:
 [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ 
 #115
 [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI 
 Nemo 2.0 RC0 - B305 05/28/2018
 [   25.798831] pstate: 80c9 (Nzcv daif +PAN +UAO)
 [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
 [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
 [   25.813252] sp : 0996f660
 [   25.816553] x29: 0996f660 x28: 
 [   25.821852] x27: 014012c0 x26: 
 [   25.827150] x25: 0003 x24: 08099eac
 [   25.832449] x23: 0040 x22: 
 [   25.837747] x21: 0001 x20: 
 [   25.843045] x19: 0040 x18: 00010e00
 [   25.848343] x17: 0437f790 x16: 0020
 [   25.853641] x15:  x14: 6549435020524541
 [   25.858939] x13: 20454d502067756c x12: 
 [   25.864237] x11: 0996f6f0 x10: 0006
 [   25.869536] x9 : 12a4 x8 : 8023c000ff90
 [   25.874834] x7 :  x6 : 08d73c08
 [   25.880132] x5 :  x4 : 0081
 [   25.885430] x3 :  x2 : 
 [   25.890728] x1 : 0001 x0 : 1980
 [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x(ptrval))
 [   25.902712] Call trace:
 [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
 [   25.909577]  allocate_slab+0x94/0x590
 [   25.913225]  new_slab+0x68/0xc8
 [   25.916353]  ___slab_alloc+0x444/0x4f8
 [   25.920088]  __slab_alloc+0x50/0x68
 [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
 [   25.928426]  pci_acpi_scan_root+0x94/0x278
 [   25.932510]  acpi_pci_root_add+0x228/0x4b0
 [   25.936593]  acpi_bus_attach+0x10c/0x218
 [   25.940501]  acpi_bus_attach+0xac/0x218
 [   25.944323]  acpi_bus_attach+0xac/0x218
 [   25.948144]  acpi_bus_scan+0x5c/0xc0
 [   25.951708]  acpi_scan_init+0xf8/0x254
 [   25.955443]  acpi_init+0x310/0x37c
 [   25.958831]  do_one_initcall+0x54/0x208
 [   25.962653]  kernel_init_freeable+0x244/0x340
 [   25.966999]  kernel_init+0x18/0x118
 [   25.970474]  ret_from_fork+0x10/0x1c
 [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
 [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
 [   25.984765] Kernel panic - not syncing: Fatal exception

 Signed-off-by: Xie XiuQi 
 Tested-by: Huiqiang Wang 
 Cc: Hanjun Guo 
 Cc: Tomasz Nowicki 
 Cc: Xishi Qiu 
 ---
  arch/arm64/kernel/pci.c | 3 +++
  1 file changed, 3 insertions(+)

 diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
 index 0e2ea1c..e17cc45 100644
 --- a/arch/arm64/kernel/pci.c
 +++ b/arch/arm64/kernel/pci.c
 @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct 
 acpi_pci_root *root)
   struct pci_bus *bus, *child;
   struct acpi_pci_root_ops *root_ops;

 + if (node != NUMA_NO_NODE && !node_online(node))
 + node = NUMA_NO_NODE;
 +
>>>
>>> This really feels like a bodge, but it does appear to be what other
>>> architectures do, so:
>>>
>>> Acked-by: Will Deacon 
>>
>> I agree, this doesn't feel like something we should be avoiding in the
>> caller of kzalloc_node().
>>
>> I would not expect kzalloc_node() to return memory that's offline, no
>> matter what node we told it to allocate from.  I could imagine it
>> returning failure, or returning memory from a node that *is* online,
>> but returning a pointer to offline memory seems broken.
>>
>> Are we putting memory that's offline in the free list?  I don't know
>> where to look to figure this ou

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-07 Thread Michal Hocko
On Wed 06-06-18 15:39:34, Bjorn Helgaas wrote:
> [+cc akpm, linux-mm, linux-pci]
> 
> On Wed, Jun 6, 2018 at 10:44 AM Will Deacon  wrote:
> >
> > On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
> > > A numa system may return node which is not online.
> > > For example, a numa node:
> > > 1) without memory
> > > 2) NR_CPUS is very small, and the cpus on the node are not brought up
> > >
> > > In this situation, we use NUMA_NO_NODE to avoid oops.
> > >
> > > [   25.732905] Unable to handle kernel NULL pointer dereference at 
> > > virtual address 1988
> > > [   25.740982] Mem abort info:
> > > [   25.743762]   ESR = 0x9605
> > > [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
> > > [   25.752711]   SET = 0, FnV = 0
> > > [   25.755751]   EA = 0, S1PTW = 0
> > > [   25.758878] Data abort info:
> > > [   25.761745]   ISV = 0, ISS = 0x0005
> > > [   25.765568]   CM = 0, WnR = 0
> > > [   25.768521] [1988] user address but active_mm is swapper
> > > [   25.774861] Internal error: Oops: 9605 [#1] SMP
> > > [   25.779724] Modules linked in:
> > > [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ 
> > > #115
> > > [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI 
> > > Nemo 2.0 RC0 - B305 05/28/2018
> > > [   25.798831] pstate: 80c9 (Nzcv daif +PAN +UAO)
> > > [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
> > > [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
> > > [   25.813252] sp : 0996f660
> > > [   25.816553] x29: 0996f660 x28: 
> > > [   25.821852] x27: 014012c0 x26: 
> > > [   25.827150] x25: 0003 x24: 08099eac
> > > [   25.832449] x23: 0040 x22: 
> > > [   25.837747] x21: 0001 x20: 
> > > [   25.843045] x19: 0040 x18: 00010e00
> > > [   25.848343] x17: 0437f790 x16: 0020
> > > [   25.853641] x15:  x14: 6549435020524541
> > > [   25.858939] x13: 20454d502067756c x12: 
> > > [   25.864237] x11: 0996f6f0 x10: 0006
> > > [   25.869536] x9 : 12a4 x8 : 8023c000ff90
> > > [   25.874834] x7 :  x6 : 08d73c08
> > > [   25.880132] x5 :  x4 : 0081
> > > [   25.885430] x3 :  x2 : 
> > > [   25.890728] x1 : 0001 x0 : 1980
> > > [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x
> > > (ptrval))
> > > [   25.902712] Call trace:
> > > [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
> > > [   25.909577]  allocate_slab+0x94/0x590
> > > [   25.913225]  new_slab+0x68/0xc8
> > > [   25.916353]  ___slab_alloc+0x444/0x4f8
> > > [   25.920088]  __slab_alloc+0x50/0x68
> > > [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
> > > [   25.928426]  pci_acpi_scan_root+0x94/0x278
> > > [   25.932510]  acpi_pci_root_add+0x228/0x4b0
> > > [   25.936593]  acpi_bus_attach+0x10c/0x218
> > > [   25.940501]  acpi_bus_attach+0xac/0x218
> > > [   25.944323]  acpi_bus_attach+0xac/0x218
> > > [   25.948144]  acpi_bus_scan+0x5c/0xc0
> > > [   25.951708]  acpi_scan_init+0xf8/0x254
> > > [   25.955443]  acpi_init+0x310/0x37c
> > > [   25.958831]  do_one_initcall+0x54/0x208
> > > [   25.962653]  kernel_init_freeable+0x244/0x340
> > > [   25.966999]  kernel_init+0x18/0x118
> > > [   25.970474]  ret_from_fork+0x10/0x1c
> > > [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
> > > [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
> > > [   25.984765] Kernel panic - not syncing: Fatal exception
> > >
> > > Signed-off-by: Xie XiuQi 
> > > Tested-by: Huiqiang Wang 
> > > Cc: Hanjun Guo 
> > > Cc: Tomasz Nowicki 
> > > Cc: Xishi Qiu 
> > > ---
> > >  arch/arm64/kernel/pci.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> > > index 0e2ea1c..e17cc45 100644
> > > --- a/arch/arm64/kernel/pci.c
> > > +++ b/arch/arm64/kernel/pci.c
> > > @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct 
> > > acpi_pci_root *root)
> > >   struct pci_bus *bus, *child;
> > >   struct acpi_pci_root_ops *root_ops;
> > >
> > > + if (node != NUMA_NO_NODE && !node_online(node))
> > > + node = NUMA_NO_NODE;
> > > +
> >
> > This really feels like a bodge, but it does appear to be what other
> > architectures do, so:
> >
> > Acked-by: Will Deacon 
> 
> I agree, this doesn't feel like something we should be avoiding in the
> caller of kzalloc_node().
> 
> I would not expect kzalloc_node() to return memory that's offline, no
> matter what node we told it to allocate from.  I could imagine it
> returning failure, or returning memory from a node that *is* online,
> but returning a pointer to offline memory seems broken.
> 
> Are we putting memory that's offline in the free list?  I 

Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-06 Thread Bjorn Helgaas
[+cc akpm, linux-mm, linux-pci]

On Wed, Jun 6, 2018 at 10:44 AM Will Deacon  wrote:
>
> On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
> > A numa system may return node which is not online.
> > For example, a numa node:
> > 1) without memory
> > 2) NR_CPUS is very small, and the cpus on the node are not brought up
> >
> > In this situation, we use NUMA_NO_NODE to avoid oops.
> >
> > [   25.732905] Unable to handle kernel NULL pointer dereference at virtual 
> > address 1988
> > [   25.740982] Mem abort info:
> > [   25.743762]   ESR = 0x9605
> > [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
> > [   25.752711]   SET = 0, FnV = 0
> > [   25.755751]   EA = 0, S1PTW = 0
> > [   25.758878] Data abort info:
> > [   25.761745]   ISV = 0, ISS = 0x0005
> > [   25.765568]   CM = 0, WnR = 0
> > [   25.768521] [1988] user address but active_mm is swapper
> > [   25.774861] Internal error: Oops: 9605 [#1] SMP
> > [   25.779724] Modules linked in:
> > [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ 
> > #115
> > [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI 
> > Nemo 2.0 RC0 - B305 05/28/2018
> > [   25.798831] pstate: 80c9 (Nzcv daif +PAN +UAO)
> > [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
> > [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
> > [   25.813252] sp : 0996f660
> > [   25.816553] x29: 0996f660 x28: 
> > [   25.821852] x27: 014012c0 x26: 
> > [   25.827150] x25: 0003 x24: 08099eac
> > [   25.832449] x23: 0040 x22: 
> > [   25.837747] x21: 0001 x20: 
> > [   25.843045] x19: 0040 x18: 00010e00
> > [   25.848343] x17: 0437f790 x16: 0020
> > [   25.853641] x15:  x14: 6549435020524541
> > [   25.858939] x13: 20454d502067756c x12: 
> > [   25.864237] x11: 0996f6f0 x10: 0006
> > [   25.869536] x9 : 12a4 x8 : 8023c000ff90
> > [   25.874834] x7 :  x6 : 08d73c08
> > [   25.880132] x5 :  x4 : 0081
> > [   25.885430] x3 :  x2 : 
> > [   25.890728] x1 : 0001 x0 : 1980
> > [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x(ptrval))
> > [   25.902712] Call trace:
> > [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
> > [   25.909577]  allocate_slab+0x94/0x590
> > [   25.913225]  new_slab+0x68/0xc8
> > [   25.916353]  ___slab_alloc+0x444/0x4f8
> > [   25.920088]  __slab_alloc+0x50/0x68
> > [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
> > [   25.928426]  pci_acpi_scan_root+0x94/0x278
> > [   25.932510]  acpi_pci_root_add+0x228/0x4b0
> > [   25.936593]  acpi_bus_attach+0x10c/0x218
> > [   25.940501]  acpi_bus_attach+0xac/0x218
> > [   25.944323]  acpi_bus_attach+0xac/0x218
> > [   25.948144]  acpi_bus_scan+0x5c/0xc0
> > [   25.951708]  acpi_scan_init+0xf8/0x254
> > [   25.955443]  acpi_init+0x310/0x37c
> > [   25.958831]  do_one_initcall+0x54/0x208
> > [   25.962653]  kernel_init_freeable+0x244/0x340
> > [   25.966999]  kernel_init+0x18/0x118
> > [   25.970474]  ret_from_fork+0x10/0x1c
> > [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
> > [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
> > [   25.984765] Kernel panic - not syncing: Fatal exception
> >
> > Signed-off-by: Xie XiuQi 
> > Tested-by: Huiqiang Wang 
> > Cc: Hanjun Guo 
> > Cc: Tomasz Nowicki 
> > Cc: Xishi Qiu 
> > ---
> >  arch/arm64/kernel/pci.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> > index 0e2ea1c..e17cc45 100644
> > --- a/arch/arm64/kernel/pci.c
> > +++ b/arch/arm64/kernel/pci.c
> > @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root 
> > *root)
> >   struct pci_bus *bus, *child;
> >   struct acpi_pci_root_ops *root_ops;
> >
> > + if (node != NUMA_NO_NODE && !node_online(node))
> > + node = NUMA_NO_NODE;
> > +
>
> This really feels like a bodge, but it does appear to be what other
> architectures do, so:
>
> Acked-by: Will Deacon 

I agree, this doesn't feel like something we should be avoiding in the
caller of kzalloc_node().

I would not expect kzalloc_node() to return memory that's offline, no
matter what node we told it to allocate from.  I could imagine it
returning failure, or returning memory from a node that *is* online,
but returning a pointer to offline memory seems broken.

Are we putting memory that's offline in the free list?  I don't know
where to look to figure this out.

Bjorn


Re: [PATCH 1/2] arm64: avoid alloc memory on offline node

2018-06-06 Thread Will Deacon
On Thu, May 31, 2018 at 08:14:38PM +0800, Xie XiuQi wrote:
> A numa system may return node which is not online.
> For example, a numa node:
> 1) without memory
> 2) NR_CPUS is very small, and the cpus on the node are not brought up
> 
> In this situation, we use NUMA_NO_NODE to avoid oops.
> 
> [   25.732905] Unable to handle kernel NULL pointer dereference at virtual 
> address 1988
> [   25.740982] Mem abort info:
> [   25.743762]   ESR = 0x9605
> [   25.746803]   Exception class = DABT (current EL), IL = 32 bits
> [   25.752711]   SET = 0, FnV = 0
> [   25.755751]   EA = 0, S1PTW = 0
> [   25.758878] Data abort info:
> [   25.761745]   ISV = 0, ISS = 0x0005
> [   25.765568]   CM = 0, WnR = 0
> [   25.768521] [1988] user address but active_mm is swapper
> [   25.774861] Internal error: Oops: 9605 [#1] SMP
> [   25.779724] Modules linked in:
> [   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
> [   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 
> 2.0 RC0 - B305 05/28/2018
> [   25.798831] pstate: 80c9 (Nzcv daif +PAN +UAO)
> [   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
> [   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
> [   25.813252] sp : 0996f660
> [   25.816553] x29: 0996f660 x28: 
> [   25.821852] x27: 014012c0 x26: 
> [   25.827150] x25: 0003 x24: 08099eac
> [   25.832449] x23: 0040 x22: 
> [   25.837747] x21: 0001 x20: 
> [   25.843045] x19: 0040 x18: 00010e00
> [   25.848343] x17: 0437f790 x16: 0020
> [   25.853641] x15:  x14: 6549435020524541
> [   25.858939] x13: 20454d502067756c x12: 
> [   25.864237] x11: 0996f6f0 x10: 0006
> [   25.869536] x9 : 12a4 x8 : 8023c000ff90
> [   25.874834] x7 :  x6 : 08d73c08
> [   25.880132] x5 :  x4 : 0081
> [   25.885430] x3 :  x2 : 
> [   25.890728] x1 : 0001 x0 : 1980
> [   25.896027] Process swapper/0 (pid: 1, stack limit = 0x(ptrval))
> [   25.902712] Call trace:
> [   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
> [   25.909577]  allocate_slab+0x94/0x590
> [   25.913225]  new_slab+0x68/0xc8
> [   25.916353]  ___slab_alloc+0x444/0x4f8
> [   25.920088]  __slab_alloc+0x50/0x68
> [   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
> [   25.928426]  pci_acpi_scan_root+0x94/0x278
> [   25.932510]  acpi_pci_root_add+0x228/0x4b0
> [   25.936593]  acpi_bus_attach+0x10c/0x218
> [   25.940501]  acpi_bus_attach+0xac/0x218
> [   25.944323]  acpi_bus_attach+0xac/0x218
> [   25.948144]  acpi_bus_scan+0x5c/0xc0
> [   25.951708]  acpi_scan_init+0xf8/0x254
> [   25.955443]  acpi_init+0x310/0x37c
> [   25.958831]  do_one_initcall+0x54/0x208
> [   25.962653]  kernel_init_freeable+0x244/0x340
> [   25.966999]  kernel_init+0x18/0x118
> [   25.970474]  ret_from_fork+0x10/0x1c
> [   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
> [   25.980162] ---[ end trace 64f0893eb21ec283 ]---
> [   25.984765] Kernel panic - not syncing: Fatal exception
> 
> Signed-off-by: Xie XiuQi 
> Tested-by: Huiqiang Wang 
> Cc: Hanjun Guo 
> Cc: Tomasz Nowicki 
> Cc: Xishi Qiu 
> ---
>  arch/arm64/kernel/pci.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> index 0e2ea1c..e17cc45 100644
> --- a/arch/arm64/kernel/pci.c
> +++ b/arch/arm64/kernel/pci.c
> @@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root 
> *root)
>   struct pci_bus *bus, *child;
>   struct acpi_pci_root_ops *root_ops;
>  
> + if (node != NUMA_NO_NODE && !node_online(node))
> + node = NUMA_NO_NODE;
> +

This really feels like a bodge, but it does appear to be what other
architectures do, so:

Acked-by: Will Deacon 

Will


[PATCH 1/2] arm64: avoid alloc memory on offline node

2018-05-31 Thread Xie XiuQi
A numa system may return node which is not online.
For example, a numa node:
1) without memory
2) NR_CPUS is very small, and the cpus on the node are not brought up

In this situation, we use NUMA_NO_NODE to avoid oops.

[   25.732905] Unable to handle kernel NULL pointer dereference at virtual 
address 1988
[   25.740982] Mem abort info:
[   25.743762]   ESR = 0x9605
[   25.746803]   Exception class = DABT (current EL), IL = 32 bits
[   25.752711]   SET = 0, FnV = 0
[   25.755751]   EA = 0, S1PTW = 0
[   25.758878] Data abort info:
[   25.761745]   ISV = 0, ISS = 0x0005
[   25.765568]   CM = 0, WnR = 0
[   25.768521] [1988] user address but active_mm is swapper
[   25.774861] Internal error: Oops: 9605 [#1] SMP
[   25.779724] Modules linked in:
[   25.782768] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.17.0-rc6-mpam+ #115
[   25.789714] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 EC UEFI Nemo 
2.0 RC0 - B305 05/28/2018
[   25.798831] pstate: 80c9 (Nzcv daif +PAN +UAO)
[   25.803612] pc : __alloc_pages_nodemask+0xf0/0xe70
[   25.808389] lr : __alloc_pages_nodemask+0x184/0xe70
[   25.813252] sp : 0996f660
[   25.816553] x29: 0996f660 x28: 
[   25.821852] x27: 014012c0 x26: 
[   25.827150] x25: 0003 x24: 08099eac
[   25.832449] x23: 0040 x22: 
[   25.837747] x21: 0001 x20: 
[   25.843045] x19: 0040 x18: 00010e00
[   25.848343] x17: 0437f790 x16: 0020
[   25.853641] x15:  x14: 6549435020524541
[   25.858939] x13: 20454d502067756c x12: 
[   25.864237] x11: 0996f6f0 x10: 0006
[   25.869536] x9 : 12a4 x8 : 8023c000ff90
[   25.874834] x7 :  x6 : 08d73c08
[   25.880132] x5 :  x4 : 0081
[   25.885430] x3 :  x2 : 
[   25.890728] x1 : 0001 x0 : 1980
[   25.896027] Process swapper/0 (pid: 1, stack limit = 0x(ptrval))
[   25.902712] Call trace:
[   25.905146]  __alloc_pages_nodemask+0xf0/0xe70
[   25.909577]  allocate_slab+0x94/0x590
[   25.913225]  new_slab+0x68/0xc8
[   25.916353]  ___slab_alloc+0x444/0x4f8
[   25.920088]  __slab_alloc+0x50/0x68
[   25.923562]  kmem_cache_alloc_node_trace+0xe8/0x230
[   25.928426]  pci_acpi_scan_root+0x94/0x278
[   25.932510]  acpi_pci_root_add+0x228/0x4b0
[   25.936593]  acpi_bus_attach+0x10c/0x218
[   25.940501]  acpi_bus_attach+0xac/0x218
[   25.944323]  acpi_bus_attach+0xac/0x218
[   25.948144]  acpi_bus_scan+0x5c/0xc0
[   25.951708]  acpi_scan_init+0xf8/0x254
[   25.955443]  acpi_init+0x310/0x37c
[   25.958831]  do_one_initcall+0x54/0x208
[   25.962653]  kernel_init_freeable+0x244/0x340
[   25.966999]  kernel_init+0x18/0x118
[   25.970474]  ret_from_fork+0x10/0x1c
[   25.974036] Code: 7100047f 321902a4 1a950095 b5000602 (b9400803)
[   25.980162] ---[ end trace 64f0893eb21ec283 ]---
[   25.984765] Kernel panic - not syncing: Fatal exception

Signed-off-by: Xie XiuQi 
Tested-by: Huiqiang Wang 
Cc: Hanjun Guo 
Cc: Tomasz Nowicki 
Cc: Xishi Qiu 
---
 arch/arm64/kernel/pci.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
index 0e2ea1c..e17cc45 100644
--- a/arch/arm64/kernel/pci.c
+++ b/arch/arm64/kernel/pci.c
@@ -170,6 +170,9 @@ struct pci_bus *pci_acpi_scan_root(struct acpi_pci_root 
*root)
struct pci_bus *bus, *child;
struct acpi_pci_root_ops *root_ops;
 
+   if (node != NUMA_NO_NODE && !node_online(node))
+   node = NUMA_NO_NODE;
+
ri = kzalloc_node(sizeof(*ri), GFP_KERNEL, node);
if (!ri)
return NULL;
-- 
1.8.3.1