Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-07 Thread Michal Hocko
On Fri 07-12-18 22:27:13, Pingfan Liu wrote:
[...]
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 1308f54..4dc497d 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -754,18 +754,23 @@ void __init init_cpu_to_node(void)
>  {
> int cpu;
> u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
> +   int node, nr;
> 
> BUG_ON(cpu_to_apicid == NULL);
> +   nr = cpumask_weight(cpu_possible_mask);
> +
> +   /* bring up all possible node, since dev->numa_node */
> +   //should check acpi works for node possible,
> +   for_each_node(node)
> +   if (!node_online(node))
> +   init_memory_less_node(node);

I suspect there is no change if you replace for_each_node by
for_each_node_mask(nid, node_possible_map)

here. If that is the case then we are probably calling
free_area_init_node too early. I do not see it yet though.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-07 Thread Michal Hocko
On Fri 07-12-18 22:27:13, Pingfan Liu wrote:
> On Fri, Dec 7, 2018 at 10:22 PM Michal Hocko  wrote:
> >
> > On Fri 07-12-18 21:20:17, Pingfan Liu wrote:
> > [...]
> > > Hi Michal,
> > >
> > > As I mentioned in my previous email, I have manually apply the patch,
> > > and the patch can not work for normal bootup.
> >
> > I am sorry, I have misread your previous response. Is there anything
> > interesting on the serial console by any chance?
> 
> Nothing. It need more effort to debug. But as I mentioned, enable all
> of the rest possible node, then it works. Maybe it can give some help
> for you.

I will have a look. Thanks!

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-07 Thread Pingfan Liu
On Fri, Dec 7, 2018 at 10:22 PM Michal Hocko  wrote:
>
> On Fri 07-12-18 21:20:17, Pingfan Liu wrote:
> [...]
> > Hi Michal,
> >
> > As I mentioned in my previous email, I have manually apply the patch,
> > and the patch can not work for normal bootup.
>
> I am sorry, I have misread your previous response. Is there anything
> interesting on the serial console by any chance?

Nothing. It need more effort to debug. But as I mentioned, enable all
of the rest possible node, then it works. Maybe it can give some help
for you.
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1308f54..4dc497d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -754,18 +754,23 @@ void __init init_cpu_to_node(void)
 {
int cpu;
u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
+   int node, nr;

BUG_ON(cpu_to_apicid == NULL);
+   nr = cpumask_weight(cpu_possible_mask);
+
+   /* bring up all possible node, since dev->numa_node */
+   //should check acpi works for node possible,
+   for_each_node(node)
+   if (!node_online(node))
+   init_memory_less_node(node);

for_each_possible_cpu(cpu) {
-   int node = numa_cpu_node(cpu);
+   node = numa_cpu_node(cpu);

if (node == NUMA_NO_NODE)
continue;

-   if (!node_online(node))
-   init_memory_less_node(node);
-
numa_set_node(cpu, node);
}
 }

Thanks,
Pingfan


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-07 Thread Michal Hocko
On Fri 07-12-18 21:20:17, Pingfan Liu wrote:
[...]
> Hi Michal,
> 
> As I mentioned in my previous email, I have manually apply the patch,
> and the patch can not work for normal bootup.

I am sorry, I have misread your previous response. Is there anything
interesting on the serial console by any chance?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-07 Thread Pingfan Liu
On Fri, Dec 7, 2018 at 7:30 PM Michal Hocko  wrote:
>
[...]
> On Fri 07-12-18 17:40:09, Pingfan Liu wrote:
> > On Fri, Dec 7, 2018 at 3:53 PM Michal Hocko  wrote:
> > >
> > > On Fri 07-12-18 10:56:51, Pingfan Liu wrote:
> > > [...]
> > > > In a short word, the fix method should consider about the two factors:
> > > > semantic of online-node and the effect on all archs
> > >
> > > I am pretty sure there is a lot of room for unification in this area.
> > > Nevertheless I strongly believe the bug should be fixed firs with the
> > > simplest way and all the cleanup should be done on top.
> > >
> > > Do I get it right that the diff worked for you and I can prepare a full
> > > patch?
> > >
> > Sure, I am glad to test you new patch.
>
> From 46e68be89d9c299fd497b2b8bea3f2add144f17f Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Fri, 7 Dec 2018 12:23:32 +0100
> Subject: [PATCH] x86, numa: always initialize all possible nodes
>
> Pingfan Liu has reported the following splat
> [5.772742] BUG: unable to handle kernel paging request at 2088
> [5.773618] PGD 0 P4D 0
> [5.773618] Oops:  [#1] SMP NOPTI
> [5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
> [5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 
> 06/29/2018
> [5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
> [5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 
> ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 
> 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
> e1 44 89 e6 89
> [5.773618] RSP: 0018:aa65fb20 EFLAGS: 00010246
> [5.773618] RAX:  RBX: 006012c0 RCX: 
> 
> [5.773618] RDX:  RSI: 0002 RDI: 
> 2080
> [5.773618] RBP: 006012c0 R08:  R09: 
> 0002
> [5.773618] R10: 006080c0 R11: 0002 R12: 
> 
> [5.773618] R13: 0001 R14:  R15: 
> 0002
> [5.773618] FS:  () GS:8c69afe0() 
> knlGS:
> [5.773618] CS:  0010 DS:  ES:  CR0: 80050033
> [5.773618] CR2: 2088 CR3: 00087e00a000 CR4: 
> 003406e0
> [5.773618] Call Trace:
> [5.773618]  new_slab+0xa9/0x570
> [5.773618]  ___slab_alloc+0x375/0x540
> [5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> [5.773618]  __slab_alloc+0x1c/0x38
> [5.773618]  __kmalloc_node_track_caller+0xc8/0x270
> [5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> [5.773618]  devm_kmalloc+0x28/0x60
> [5.773618]  pinctrl_bind_pins+0x2b/0x2a0
> [5.773618]  really_probe+0x73/0x420
> [5.773618]  driver_probe_device+0x115/0x130
> [5.773618]  __driver_attach+0x103/0x110
> [5.773618]  ? driver_probe_device+0x130/0x130
> [5.773618]  bus_for_each_dev+0x67/0xc0
> [5.773618]  ? klist_add_tail+0x3b/0x70
> [5.773618]  bus_add_driver+0x41/0x260
> [5.773618]  ? pcie_port_setup+0x4d/0x4d
> [5.773618]  driver_register+0x5b/0xe0
> [5.773618]  ? pcie_port_setup+0x4d/0x4d
> [5.773618]  do_one_initcall+0x4e/0x1d4
> [5.773618]  ? init_setup+0x25/0x28
> [5.773618]  kernel_init_freeable+0x1c1/0x26e
> [5.773618]  ? loglevel+0x5b/0x5b
> [5.773618]  ? rest_init+0xb0/0xb0
> [5.773618]  kernel_init+0xa/0x110
> [5.773618]  ret_from_fork+0x22/0x40
> [5.773618] Modules linked in:
> [5.773618] CR2: 2088
> [5.773618] ---[ end trace 1030c9120a03d081 ]---
>
> with his AMD machine with the following topology
>   NUMA node0 CPU(s): 0,8,16,24
>   NUMA node1 CPU(s): 2,10,18,26
>   NUMA node2 CPU(s): 4,12,20,28
>   NUMA node3 CPU(s): 6,14,22,30
>   NUMA node4 CPU(s): 1,9,17,25
>   NUMA node5 CPU(s): 3,11,19,27
>   NUMA node6 CPU(s): 5,13,21,29
>   NUMA node7 CPU(s): 7,15,23,31
>
> [0.007418] Early memory node ranges
> [0.007419]   node   1: [mem 0x1000-0x0008efff]
> [0.007420]   node   1: [mem 0x0009-0x0009]
> [0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
> [0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
> [0.007423]   node   1: [mem 0x6c528000-0x6fff]
> [0.007424]   node   1: [mem 0x0001-0x00047fff]
> [0.007425]   node   5: [mem 0x00048000-0x00087eff]
>
> and nr_cpus set to 4. The underlying reason is tha the device is bound
> to node 2 which doesn't have any memory and init_cpu_to_node only
> initializes memory-less nodes for possible cpus which nr_cpus restrics.
> This in turn means that proper zonelists are not allocated and the page
> allocator blows up.
>
> Fix the issue by moving init_memory_less_node into numa_register_memblks
> and always initialize all possible nodes consistently at a single place.
>
> 

Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-07 Thread Michal Hocko
On Fri 07-12-18 17:40:09, Pingfan Liu wrote:
> On Fri, Dec 7, 2018 at 3:53 PM Michal Hocko  wrote:
> >
> > On Fri 07-12-18 10:56:51, Pingfan Liu wrote:
> > [...]
> > > In a short word, the fix method should consider about the two factors:
> > > semantic of online-node and the effect on all archs
> >
> > I am pretty sure there is a lot of room for unification in this area.
> > Nevertheless I strongly believe the bug should be fixed firs with the
> > simplest way and all the cleanup should be done on top.
> >
> > Do I get it right that the diff worked for you and I can prepare a full
> > patch?
> >
> Sure, I am glad to test you new patch.

>From 46e68be89d9c299fd497b2b8bea3f2add144f17f Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 7 Dec 2018 12:23:32 +0100
Subject: [PATCH] x86, numa: always initialize all possible nodes

Pingfan Liu has reported the following splat
[5.772742] BUG: unable to handle kernel paging request at 2088
[5.773618] PGD 0 P4D 0
[5.773618] Oops:  [#1] SMP NOPTI
[5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
[5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 
06/29/2018
[5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
[5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 
ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 
0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
e1 44 89 e6 89
[5.773618] RSP: 0018:aa65fb20 EFLAGS: 00010246
[5.773618] RAX:  RBX: 006012c0 RCX: 
[5.773618] RDX:  RSI: 0002 RDI: 2080
[5.773618] RBP: 006012c0 R08:  R09: 0002
[5.773618] R10: 006080c0 R11: 0002 R12: 
[5.773618] R13: 0001 R14:  R15: 0002
[5.773618] FS:  () GS:8c69afe0() 
knlGS:
[5.773618] CS:  0010 DS:  ES:  CR0: 80050033
[5.773618] CR2: 2088 CR3: 00087e00a000 CR4: 003406e0
[5.773618] Call Trace:
[5.773618]  new_slab+0xa9/0x570
[5.773618]  ___slab_alloc+0x375/0x540
[5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[5.773618]  __slab_alloc+0x1c/0x38
[5.773618]  __kmalloc_node_track_caller+0xc8/0x270
[5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[5.773618]  devm_kmalloc+0x28/0x60
[5.773618]  pinctrl_bind_pins+0x2b/0x2a0
[5.773618]  really_probe+0x73/0x420
[5.773618]  driver_probe_device+0x115/0x130
[5.773618]  __driver_attach+0x103/0x110
[5.773618]  ? driver_probe_device+0x130/0x130
[5.773618]  bus_for_each_dev+0x67/0xc0
[5.773618]  ? klist_add_tail+0x3b/0x70
[5.773618]  bus_add_driver+0x41/0x260
[5.773618]  ? pcie_port_setup+0x4d/0x4d
[5.773618]  driver_register+0x5b/0xe0
[5.773618]  ? pcie_port_setup+0x4d/0x4d
[5.773618]  do_one_initcall+0x4e/0x1d4
[5.773618]  ? init_setup+0x25/0x28
[5.773618]  kernel_init_freeable+0x1c1/0x26e
[5.773618]  ? loglevel+0x5b/0x5b
[5.773618]  ? rest_init+0xb0/0xb0
[5.773618]  kernel_init+0xa/0x110
[5.773618]  ret_from_fork+0x22/0x40
[5.773618] Modules linked in:
[5.773618] CR2: 2088
[5.773618] ---[ end trace 1030c9120a03d081 ]---

with his AMD machine with the following topology
  NUMA node0 CPU(s): 0,8,16,24
  NUMA node1 CPU(s): 2,10,18,26
  NUMA node2 CPU(s): 4,12,20,28
  NUMA node3 CPU(s): 6,14,22,30
  NUMA node4 CPU(s): 1,9,17,25
  NUMA node5 CPU(s): 3,11,19,27
  NUMA node6 CPU(s): 5,13,21,29
  NUMA node7 CPU(s): 7,15,23,31

[0.007418] Early memory node ranges
[0.007419]   node   1: [mem 0x1000-0x0008efff]
[0.007420]   node   1: [mem 0x0009-0x0009]
[0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
[0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
[0.007423]   node   1: [mem 0x6c528000-0x6fff]
[0.007424]   node   1: [mem 0x0001-0x00047fff]
[0.007425]   node   5: [mem 0x00048000-0x00087eff]

and nr_cpus set to 4. The underlying reason is tha the device is bound
to node 2 which doesn't have any memory and init_cpu_to_node only
initializes memory-less nodes for possible cpus which nr_cpus restrics.
This in turn means that proper zonelists are not allocated and the page
allocator blows up.

Fix the issue by moving init_memory_less_node into numa_register_memblks
and always initialize all possible nodes consistently at a single place.

Reported-by: Pingfan Liu 
Signed-off-by: Michal Hocko 
---
 arch/x86/mm/numa.c | 33 +++--
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1308f5408bf7..4575ae4d5449 100644
--- 

Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-07 Thread Pingfan Liu
On Fri, Dec 7, 2018 at 3:53 PM Michal Hocko  wrote:
>
> On Fri 07-12-18 10:56:51, Pingfan Liu wrote:
> [...]
> > In a short word, the fix method should consider about the two factors:
> > semantic of online-node and the effect on all archs
>
> I am pretty sure there is a lot of room for unification in this area.
> Nevertheless I strongly believe the bug should be fixed firs with the
> simplest way and all the cleanup should be done on top.
>
> Do I get it right that the diff worked for you and I can prepare a full
> patch?
>
Sure, I am glad to test you new patch.

Thanks,
Pingfan


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-06 Thread Michal Hocko
On Fri 07-12-18 10:56:51, Pingfan Liu wrote:
[...]
> In a short word, the fix method should consider about the two factors:
> semantic of online-node and the effect on all archs

I am pretty sure there is a lot of room for unification in this area.
Nevertheless I strongly believe the bug should be fixed firs with the
simplest way and all the cleanup should be done on top.

Do I get it right that the diff worked for you and I can prepare a full
patch?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-06 Thread Pingfan Liu
On Thu, Dec 6, 2018 at 8:11 PM Michal Hocko  wrote:
>
> On Thu 06-12-18 18:44:03, Pingfan Liu wrote:
> > On Thu, Dec 6, 2018 at 6:03 PM Pingfan Liu  wrote:
> [...]
> > > Which commit is this patch applied on? I can not apply it on latest linux 
> > > tree.
> > >
> > I applied it by manual, will see the test result. I think it should
> > work since you instance all the node.
> > But there are two things worth to consider:
> > -1st. why x86 do not bring up all nodes by default, apparently it will
> > be more simple by that way
>
> What do you mean? Why it didn't bring up before? Or do you see some

Yes, this is what I mean. But maybe the author does not consider about
the nr_cpus, otherwise, using:
+   for_each_node(node)
+   if (!node_online(node))
+   init_memory_less_node(node);
in init_cpu_to_node() is more simple.

> nodes not being brought up after this patch?
>
> > -2nd. there are other archs, do they obey the rules?
>
> I am afraid that each arch does its own initialization.

Then it is arguable whether to fix this issue in memory core or let
each archs to fix this issue. I check the powerpc code, it should also
need a fix, it maybe the same in arm and mips ..
 BTW, your patch can not work for normal bootup, and the kernel hang
without any kernel message.
I think it is due to the bug in the patch:
alloc_node_data(nid);
+   if (!end)
+   init_memory_less_node(nid); //which calls
alloc_node_data(nid) also.
How about the following:
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1308f54..4dc497d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -754,18 +754,23 @@ void __init init_cpu_to_node(void)
 {
int cpu;
u16 *cpu_to_apicid = early_per_cpu_ptr(x86_cpu_to_apicid);
+   int node, nr;

BUG_ON(cpu_to_apicid == NULL);
+   nr = cpumask_weight(cpu_possible_mask);
+
+   /* bring up all possible node, since dev->numa_node */
+   //should check acpi works for node possible,
+   for_each_node(node)
+   if (!node_online(node))
+   init_memory_less_node(node);

for_each_possible_cpu(cpu) {
-   int node = numa_cpu_node(cpu);
+   node = numa_cpu_node(cpu);

if (node == NUMA_NO_NODE)
continue;

-   if (!node_online(node))
-   init_memory_less_node(node);
-
numa_set_node(cpu, node);
}
 }

Although it works, I hesitate about the idea, due to the semantic of
online-node, does the online-node require either cpu or memory inside
the node to be online?
In a short word, the fix method should consider about the two factors:
semantic of online-node and the effect on all archs

Thanks,
Pingfan

> --
> Michal Hocko
> SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-06 Thread Michal Hocko
On Thu 06-12-18 18:44:03, Pingfan Liu wrote:
> On Thu, Dec 6, 2018 at 6:03 PM Pingfan Liu  wrote:
[...]
> > Which commit is this patch applied on? I can not apply it on latest linux 
> > tree.
> >
> I applied it by manual, will see the test result. I think it should
> work since you instance all the node.
> But there are two things worth to consider:
> -1st. why x86 do not bring up all nodes by default, apparently it will
> be more simple by that way

What do you mean? Why it didn't bring up before? Or do you see some
nodes not being brought up after this patch?

> -2nd. there are other archs, do they obey the rules?

I am afraid that each arch does its own initialization.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-06 Thread Pingfan Liu
On Thu, Dec 6, 2018 at 6:03 PM Pingfan Liu  wrote:
>
> [...]
> > THanks for pointing this out. It made my life easier. So It think the
> > bug is that we call init_memory_less_node from this path. I suspect
> > numa_register_memblks is the right place to do this. So I admit I
> > am not 100% sure but could you give this a try please?
> >
> Sure.
>
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 1308f5408bf7..4575ae4d5449 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -527,6 +527,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
> > }
> >  }
> >
> > +static void __init init_memory_less_node(int nid)
> > +{
> > +   unsigned long zones_size[MAX_NR_ZONES] = {0};
> > +   unsigned long zholes_size[MAX_NR_ZONES] = {0};
> > +
> > +   free_area_init_node(nid, zones_size, 0, zholes_size);
> > +
> > +   /*
> > +* All zonelists will be built later in start_kernel() after per cpu
> > +* areas are initialized.
> > +*/
> > +}
> > +
> >  static int __init numa_register_memblks(struct numa_meminfo *mi)
> >  {
> > unsigned long uninitialized_var(pfn_align);
> > @@ -592,6 +605,8 @@ static int __init numa_register_memblks(struct 
> > numa_meminfo *mi)
> > continue;
> >
> > alloc_node_data(nid);
> > +   if (!end)
> > +   init_memory_less_node(nid);
> > }
> >
> > /* Dump memblock with node info and return. */
> > @@ -721,21 +736,6 @@ void __init x86_numa_init(void)
> > numa_init(dummy_numa_init);
> >  }
> >
> > -static void __init init_memory_less_node(int nid)
> > -{
> > -   unsigned long zones_size[MAX_NR_ZONES] = {0};
> > -   unsigned long zholes_size[MAX_NR_ZONES] = {0};
> > -
> > -   /* Allocate and initialize node data. Memory-less node is now 
> > online.*/
> > -   alloc_node_data(nid);
> > -   free_area_init_node(nid, zones_size, 0, zholes_size);
> > -
> > -   /*
> > -* All zonelists will be built later in start_kernel() after per cpu
> > -* areas are initialized.
> > -*/
> > -}
> > -
> >  /*
> >   * Setup early cpu_to_node.
> >   *
> > @@ -763,9 +763,6 @@ void __init init_cpu_to_node(void)
> > if (node == NUMA_NO_NODE)
> > continue;
> >
> > -   if (!node_online(node))
> > -   init_memory_less_node(node);
> > -
> > numa_set_node(cpu, node);
> > }
> >  }
> > --
> Which commit is this patch applied on? I can not apply it on latest linux 
> tree.
>
I applied it by manual, will see the test result. I think it should
work since you instance all the node.
But there are two things worth to consider:
-1st. why x86 do not bring up all nodes by default, apparently it will
be more simple by that way
-2nd. there are other archs, do they obey the rules?

Thanks,
Pingfan


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-06 Thread Pingfan Liu
[...]
> THanks for pointing this out. It made my life easier. So It think the
> bug is that we call init_memory_less_node from this path. I suspect
> numa_register_memblks is the right place to do this. So I admit I
> am not 100% sure but could you give this a try please?
>
Sure.

> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 1308f5408bf7..4575ae4d5449 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -527,6 +527,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
> }
>  }
>
> +static void __init init_memory_less_node(int nid)
> +{
> +   unsigned long zones_size[MAX_NR_ZONES] = {0};
> +   unsigned long zholes_size[MAX_NR_ZONES] = {0};
> +
> +   free_area_init_node(nid, zones_size, 0, zholes_size);
> +
> +   /*
> +* All zonelists will be built later in start_kernel() after per cpu
> +* areas are initialized.
> +*/
> +}
> +
>  static int __init numa_register_memblks(struct numa_meminfo *mi)
>  {
> unsigned long uninitialized_var(pfn_align);
> @@ -592,6 +605,8 @@ static int __init numa_register_memblks(struct 
> numa_meminfo *mi)
> continue;
>
> alloc_node_data(nid);
> +   if (!end)
> +   init_memory_less_node(nid);
> }
>
> /* Dump memblock with node info and return. */
> @@ -721,21 +736,6 @@ void __init x86_numa_init(void)
> numa_init(dummy_numa_init);
>  }
>
> -static void __init init_memory_less_node(int nid)
> -{
> -   unsigned long zones_size[MAX_NR_ZONES] = {0};
> -   unsigned long zholes_size[MAX_NR_ZONES] = {0};
> -
> -   /* Allocate and initialize node data. Memory-less node is now 
> online.*/
> -   alloc_node_data(nid);
> -   free_area_init_node(nid, zones_size, 0, zholes_size);
> -
> -   /*
> -* All zonelists will be built later in start_kernel() after per cpu
> -* areas are initialized.
> -*/
> -}
> -
>  /*
>   * Setup early cpu_to_node.
>   *
> @@ -763,9 +763,6 @@ void __init init_cpu_to_node(void)
> if (node == NUMA_NO_NODE)
> continue;
>
> -   if (!node_online(node))
> -   init_memory_less_node(node);
> -
> numa_set_node(cpu, node);
> }
>  }
> --
Which commit is this patch applied on? I can not apply it on latest linux tree.

Thanks,
Pingfan


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-06 Thread Michal Hocko
On Thu 06-12-18 11:07:33, Pingfan Liu wrote:
> On Wed, Dec 5, 2018 at 5:40 PM Vlastimil Babka  wrote:
> >
> > On 12/5/18 10:29 AM, Pingfan Liu wrote:
> > >> [0.007418] Early memory node ranges
> > >> [0.007419]   node   1: [mem 0x1000-0x0008efff]
> > >> [0.007420]   node   1: [mem 0x0009-0x0009]
> > >> [0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
> > >> [0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
> > >> [0.007423]   node   1: [mem 0x6c528000-0x6fff]
> > >> [0.007424]   node   1: [mem 0x0001-0x00047fff]
> > >> [0.007425]   node   5: [mem 0x00048000-0x00087eff]
> > >>
> > >> There is clearly no node2. Where did the driver get the node2 from?
> >
> > I don't understand these tables too much, but it seems the other nodes
> > exist without them:
> >
> > [0.007393] SRAT: PXM 2 -> APIC 0x20 -> Node 2
> >
> > Maybe the nodes are hotplugable or something?
> >
> I also not sure about it, and just have a hurry look at acpi spec. I
> will reply it on another email, and Cced some acpi guys about it
> 
> > > Since using nr_cpus=4 , the node2 is not be instanced by x86 initalizing 
> > > code.
> >
> > Indeed, nr_cpus seems to restrict what nodes we allocate and populate
> > zonelists for.
> 
> Yes, in init_cpu_to_node(),  since nr_cpus limits the possible cpu,
> which affects the loop for_each_possible_cpu(cpu) and skip the node2
> in this case.

THanks for pointing this out. It made my life easier. So It think the
bug is that we call init_memory_less_node from this path. I suspect
numa_register_memblks is the right place to do this. So I admit I
am not 100% sure but could you give this a try please?

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1308f5408bf7..4575ae4d5449 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -527,6 +527,19 @@ static void __init numa_clear_kernel_node_hotplug(void)
}
 }
 
+static void __init init_memory_less_node(int nid)
+{
+   unsigned long zones_size[MAX_NR_ZONES] = {0};
+   unsigned long zholes_size[MAX_NR_ZONES] = {0};
+
+   free_area_init_node(nid, zones_size, 0, zholes_size);
+
+   /*
+* All zonelists will be built later in start_kernel() after per cpu
+* areas are initialized.
+*/
+}
+
 static int __init numa_register_memblks(struct numa_meminfo *mi)
 {
unsigned long uninitialized_var(pfn_align);
@@ -592,6 +605,8 @@ static int __init numa_register_memblks(struct numa_meminfo 
*mi)
continue;
 
alloc_node_data(nid);
+   if (!end)
+   init_memory_less_node(nid);
}
 
/* Dump memblock with node info and return. */
@@ -721,21 +736,6 @@ void __init x86_numa_init(void)
numa_init(dummy_numa_init);
 }
 
-static void __init init_memory_less_node(int nid)
-{
-   unsigned long zones_size[MAX_NR_ZONES] = {0};
-   unsigned long zholes_size[MAX_NR_ZONES] = {0};
-
-   /* Allocate and initialize node data. Memory-less node is now online.*/
-   alloc_node_data(nid);
-   free_area_init_node(nid, zones_size, 0, zholes_size);
-
-   /*
-* All zonelists will be built later in start_kernel() after per cpu
-* areas are initialized.
-*/
-}
-
 /*
  * Setup early cpu_to_node.
  *
@@ -763,9 +763,6 @@ void __init init_cpu_to_node(void)
if (node == NUMA_NO_NODE)
continue;
 
-   if (!node_online(node))
-   init_memory_less_node(node);
-
numa_set_node(cpu, node);
}
 }
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-05 Thread Michal Hocko
On Thu 06-12-18 11:34:30, Pingfan Liu wrote:
[...]
> > I suspect we are looking at two issues here. The first one, and a more
> > important one is that there is a NUMA affinity configured for the device
> > to a non-existing node. The second one is that nr_cpus affects
> > initialization of possible nodes.
> 
> The dev->numa_node info is extracted from acpi table, not depends on
> the instance of numa-node, which may be limited by nr_cpus. Hence the
> node is existing, just not instanced.

Hmm, binding to memory less node is quite dubious. But OK. I am not sure
how much sanitization can we do. We need to fallback anyway so we should
better make sure that all possible nodes are initialized regardless of
nr_cpus. I will look into that.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-05 Thread Pingfan Liu
On Wed, Dec 5, 2018 at 5:43 PM Michal Hocko  wrote:
>
> On Wed 05-12-18 17:29:31, Pingfan Liu wrote:
> > On Wed, Dec 5, 2018 at 5:21 PM Michal Hocko  wrote:
> > >
> > > On Wed 05-12-18 13:38:17, Pingfan Liu wrote:
> > > > On Tue, Dec 4, 2018 at 4:56 PM Michal Hocko  wrote:
> > > > >
> > > > > On Tue 04-12-18 16:20:32, Pingfan Liu wrote:
> > > > > > On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > > > > > > > During my test on some AMD machine, with kexec -l nr_cpus=x 
> > > > > > > > option, the
> > > > > > > > kernel failed to bootup, because some node's data struct can 
> > > > > > > > not be allocated,
> > > > > > > > e.g, on x86, initialized by 
> > > > > > > > init_cpu_to_node()->init_memory_less_node(). But
> > > > > > > > device->numa_node info is used as preferred_nid param for
> > > > > > > > __alloc_pages_nodemask(), which causes NULL reference
> > > > > > > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > > > > > > This patch tries to fix the issue by falling back to the first 
> > > > > > > > online node,
> > > > > > > > when encountering such corner case.
> > > > > > >
> > > > > > > We have seen similar issues already and the bug was usually that 
> > > > > > > the
> > > > > > > zonelists were not initialized yet or the node is completely 
> > > > > > > bogus.
> > > > > > > Zonelists should be initialized by build_all_zonelists quite 
> > > > > > > early so I
> > > > > > > am wondering whether the later is the case. What is the actual 
> > > > > > > node
> > > > > > > number the device is associated with?
> > > > > > >
> > > > > > The device's node num is 2. And in my case, I used nr_cpus param. 
> > > > > > Due
> > > > > > to init_cpu_to_node() initialize all the possible node.  It is hard
> > > > > > for me to figure out without this param, how zonelists is accessed
> > > > > > before page allocator works.
> > > > >
> > > > > I believe we should focus on this. Why does the node have no zonelist
> > > > > even though all zonelists should be initialized already? Maybe this is
> > > > > nr_cpus pecularity and we do not initialize all the existing numa 
> > > > > nodes.
> > > > > Or maybe the device is associated to a non-existing node with that
> > > > > setup. A full dmesg might help us here.
> > > > >
> > > > Requiring the machine again, and I got the following without nr_cpus 
> > > > option
> > > > [root@dell-per7425-03 ~]# cd /sys/devices/system/node/
> > > > [root@dell-per7425-03 node]# ls
> > > > has_cpu  has_memory  has_normal_memory  node0  node1  node2  node3
> > > > node4  node5  node6  node7  online  possible  power  uevent
> > > > [root@dell-per7425-03 node]# cat has_cpu
> > > > 0-7
> > > > [root@dell-per7425-03 node]# cat has_memory
> > > > 1,5
> > > > [root@dell-per7425-03 node]# cat online
> > > > 0-7
> > > > [root@dell-per7425-03 node]# cat possible
> > > > 0-7
> > > > And lscpu shows the following numa-cpu info:
> > > > NUMA node0 CPU(s): 0,8,16,24
> > > > NUMA node1 CPU(s): 2,10,18,26
> > > > NUMA node2 CPU(s): 4,12,20,28
> > > > NUMA node3 CPU(s): 6,14,22,30
> > > > NUMA node4 CPU(s): 1,9,17,25
> > > > NUMA node5 CPU(s): 3,11,19,27
> > > > NUMA node6 CPU(s): 5,13,21,29
> > > > NUMA node7 CPU(s): 7,15,23,31
> > > >
> > > > For the full panic message (I masked some hostname info with xx),
> > > > please see the attachment.
> > > > In a short word, it seems a problem with nr_cpus, if without this
> > > > option, the kernel can bootup correctly.
> > >
> > > Yep.
> > > [0.007418] Early memory node ranges
> > > [0.007419]   node   1: [mem 0x1000-0x0008efff]
> > > [0.007420]   node   1: [mem 0x0009-0x0009]
> > > [0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
> > > [0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
> > > [0.007423]   node   1: [mem 0x6c528000-0x6fff]
> > > [0.007424]   node   1: [mem 0x0001-0x00047fff]
> > > [0.007425]   node   5: [mem 0x00048000-0x00087eff]
> > >
> > > There is clearly no node2. Where did the driver get the node2 from?
> > Since using nr_cpus=4 , the node2 is not be instanced by x86 initalizing 
> > code.
> > For the normal bootup, having the following:
> > [0.007704] Movable zone start for each node
> > [0.007707] Early memory node ranges
> > [0.007708]   node   1: [mem 0x1000-0x0008efff]
> > [0.007709]   node   1: [mem 0x0009-0x0009]
> > [0.007711]   node   1: [mem 0x0010-0x5c3d6fff]
> > [0.007712]   node   1: [mem 0x643df000-0x68ff7fff]
> > [0.007712]   node   1: [mem 0x6c528000-0x6fff]
> > [0.007713]   node   1: [mem 0x0001-0x00047fff]
> > [0.007714]   node   5: [mem 

Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-05 Thread Pingfan Liu
On Wed, Dec 5, 2018 at 5:40 PM Vlastimil Babka  wrote:
>
> On 12/5/18 10:29 AM, Pingfan Liu wrote:
> >> [0.007418] Early memory node ranges
> >> [0.007419]   node   1: [mem 0x1000-0x0008efff]
> >> [0.007420]   node   1: [mem 0x0009-0x0009]
> >> [0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
> >> [0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
> >> [0.007423]   node   1: [mem 0x6c528000-0x6fff]
> >> [0.007424]   node   1: [mem 0x0001-0x00047fff]
> >> [0.007425]   node   5: [mem 0x00048000-0x00087eff]
> >>
> >> There is clearly no node2. Where did the driver get the node2 from?
>
> I don't understand these tables too much, but it seems the other nodes
> exist without them:
>
> [0.007393] SRAT: PXM 2 -> APIC 0x20 -> Node 2
>
> Maybe the nodes are hotplugable or something?
>
I also not sure about it, and just have a hurry look at acpi spec. I
will reply it on another email, and Cced some acpi guys about it

> > Since using nr_cpus=4 , the node2 is not be instanced by x86 initalizing 
> > code.
>
> Indeed, nr_cpus seems to restrict what nodes we allocate and populate
> zonelists for.

Yes, in init_cpu_to_node(),  since nr_cpus limits the possible cpu,
which affects the loop for_each_possible_cpu(cpu) and skip the node2
in this case.

Thanks,
Pingfan


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-05 Thread David Rientjes
On Wed, 5 Dec 2018, Pingfan Liu wrote:

> > > And rather than using first_online_node, would next_online_node() work?
> > >
> > What is the gain? Is it for memory pressure on node0?
> >
> Maybe I got your point now.  Do you try to give a cheap assumption on
> nearest neigh of this node?
> 

It's likely better than first_online_node, but probably going to be the 
same based on the node ids that you have reported since the nodemask will 
simply wrap around back to the first node.


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-05 Thread Michal Hocko
On Wed 05-12-18 17:29:31, Pingfan Liu wrote:
> On Wed, Dec 5, 2018 at 5:21 PM Michal Hocko  wrote:
> >
> > On Wed 05-12-18 13:38:17, Pingfan Liu wrote:
> > > On Tue, Dec 4, 2018 at 4:56 PM Michal Hocko  wrote:
> > > >
> > > > On Tue 04-12-18 16:20:32, Pingfan Liu wrote:
> > > > > On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko  wrote:
> > > > > >
> > > > > > On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > > > > > > During my test on some AMD machine, with kexec -l nr_cpus=x 
> > > > > > > option, the
> > > > > > > kernel failed to bootup, because some node's data struct can not 
> > > > > > > be allocated,
> > > > > > > e.g, on x86, initialized by 
> > > > > > > init_cpu_to_node()->init_memory_less_node(). But
> > > > > > > device->numa_node info is used as preferred_nid param for
> > > > > > > __alloc_pages_nodemask(), which causes NULL reference
> > > > > > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > > > > > This patch tries to fix the issue by falling back to the first 
> > > > > > > online node,
> > > > > > > when encountering such corner case.
> > > > > >
> > > > > > We have seen similar issues already and the bug was usually that the
> > > > > > zonelists were not initialized yet or the node is completely bogus.
> > > > > > Zonelists should be initialized by build_all_zonelists quite early 
> > > > > > so I
> > > > > > am wondering whether the later is the case. What is the actual node
> > > > > > number the device is associated with?
> > > > > >
> > > > > The device's node num is 2. And in my case, I used nr_cpus param. Due
> > > > > to init_cpu_to_node() initialize all the possible node.  It is hard
> > > > > for me to figure out without this param, how zonelists is accessed
> > > > > before page allocator works.
> > > >
> > > > I believe we should focus on this. Why does the node have no zonelist
> > > > even though all zonelists should be initialized already? Maybe this is
> > > > nr_cpus pecularity and we do not initialize all the existing numa nodes.
> > > > Or maybe the device is associated to a non-existing node with that
> > > > setup. A full dmesg might help us here.
> > > >
> > > Requiring the machine again, and I got the following without nr_cpus 
> > > option
> > > [root@dell-per7425-03 ~]# cd /sys/devices/system/node/
> > > [root@dell-per7425-03 node]# ls
> > > has_cpu  has_memory  has_normal_memory  node0  node1  node2  node3
> > > node4  node5  node6  node7  online  possible  power  uevent
> > > [root@dell-per7425-03 node]# cat has_cpu
> > > 0-7
> > > [root@dell-per7425-03 node]# cat has_memory
> > > 1,5
> > > [root@dell-per7425-03 node]# cat online
> > > 0-7
> > > [root@dell-per7425-03 node]# cat possible
> > > 0-7
> > > And lscpu shows the following numa-cpu info:
> > > NUMA node0 CPU(s): 0,8,16,24
> > > NUMA node1 CPU(s): 2,10,18,26
> > > NUMA node2 CPU(s): 4,12,20,28
> > > NUMA node3 CPU(s): 6,14,22,30
> > > NUMA node4 CPU(s): 1,9,17,25
> > > NUMA node5 CPU(s): 3,11,19,27
> > > NUMA node6 CPU(s): 5,13,21,29
> > > NUMA node7 CPU(s): 7,15,23,31
> > >
> > > For the full panic message (I masked some hostname info with xx),
> > > please see the attachment.
> > > In a short word, it seems a problem with nr_cpus, if without this
> > > option, the kernel can bootup correctly.
> >
> > Yep.
> > [0.007418] Early memory node ranges
> > [0.007419]   node   1: [mem 0x1000-0x0008efff]
> > [0.007420]   node   1: [mem 0x0009-0x0009]
> > [0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
> > [0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
> > [0.007423]   node   1: [mem 0x6c528000-0x6fff]
> > [0.007424]   node   1: [mem 0x0001-0x00047fff]
> > [0.007425]   node   5: [mem 0x00048000-0x00087eff]
> >
> > There is clearly no node2. Where did the driver get the node2 from?
> Since using nr_cpus=4 , the node2 is not be instanced by x86 initalizing code.
> For the normal bootup, having the following:
> [0.007704] Movable zone start for each node
> [0.007707] Early memory node ranges
> [0.007708]   node   1: [mem 0x1000-0x0008efff]
> [0.007709]   node   1: [mem 0x0009-0x0009]
> [0.007711]   node   1: [mem 0x0010-0x5c3d6fff]
> [0.007712]   node   1: [mem 0x643df000-0x68ff7fff]
> [0.007712]   node   1: [mem 0x6c528000-0x6fff]
> [0.007713]   node   1: [mem 0x0001-0x00047fff]
> [0.007714]   node   5: [mem 0x00048000-0x00087eff]
> [0.008434] Zeroed struct page in unavailable ranges: 46490 pages

Hmm, this is even more interesting. So even a normal boot doesn't have
node 2. So where exactly does the device get its affinity from?

I suspect we are looking at two issues here. The first one, and a more
important one is that there 

Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-05 Thread Vlastimil Babka
On 12/5/18 10:29 AM, Pingfan Liu wrote:
>> [0.007418] Early memory node ranges
>> [0.007419]   node   1: [mem 0x1000-0x0008efff]
>> [0.007420]   node   1: [mem 0x0009-0x0009]
>> [0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
>> [0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
>> [0.007423]   node   1: [mem 0x6c528000-0x6fff]
>> [0.007424]   node   1: [mem 0x0001-0x00047fff]
>> [0.007425]   node   5: [mem 0x00048000-0x00087eff]
>>
>> There is clearly no node2. Where did the driver get the node2 from?

I don't understand these tables too much, but it seems the other nodes
exist without them:

[0.007393] SRAT: PXM 2 -> APIC 0x20 -> Node 2

Maybe the nodes are hotplugable or something?

> Since using nr_cpus=4 , the node2 is not be instanced by x86 initalizing code.

Indeed, nr_cpus seems to restrict what nodes we allocate and populate
zonelists for.


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-05 Thread Pingfan Liu
On Wed, Dec 5, 2018 at 5:21 PM Michal Hocko  wrote:
>
> On Wed 05-12-18 13:38:17, Pingfan Liu wrote:
> > On Tue, Dec 4, 2018 at 4:56 PM Michal Hocko  wrote:
> > >
> > > On Tue 04-12-18 16:20:32, Pingfan Liu wrote:
> > > > On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko  wrote:
> > > > >
> > > > > On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > > > > > During my test on some AMD machine, with kexec -l nr_cpus=x option, 
> > > > > > the
> > > > > > kernel failed to bootup, because some node's data struct can not be 
> > > > > > allocated,
> > > > > > e.g, on x86, initialized by 
> > > > > > init_cpu_to_node()->init_memory_less_node(). But
> > > > > > device->numa_node info is used as preferred_nid param for
> > > > > > __alloc_pages_nodemask(), which causes NULL reference
> > > > > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > > > > This patch tries to fix the issue by falling back to the first 
> > > > > > online node,
> > > > > > when encountering such corner case.
> > > > >
> > > > > We have seen similar issues already and the bug was usually that the
> > > > > zonelists were not initialized yet or the node is completely bogus.
> > > > > Zonelists should be initialized by build_all_zonelists quite early so 
> > > > > I
> > > > > am wondering whether the later is the case. What is the actual node
> > > > > number the device is associated with?
> > > > >
> > > > The device's node num is 2. And in my case, I used nr_cpus param. Due
> > > > to init_cpu_to_node() initialize all the possible node.  It is hard
> > > > for me to figure out without this param, how zonelists is accessed
> > > > before page allocator works.
> > >
> > > I believe we should focus on this. Why does the node have no zonelist
> > > even though all zonelists should be initialized already? Maybe this is
> > > nr_cpus pecularity and we do not initialize all the existing numa nodes.
> > > Or maybe the device is associated to a non-existing node with that
> > > setup. A full dmesg might help us here.
> > >
> > Requiring the machine again, and I got the following without nr_cpus option
> > [root@dell-per7425-03 ~]# cd /sys/devices/system/node/
> > [root@dell-per7425-03 node]# ls
> > has_cpu  has_memory  has_normal_memory  node0  node1  node2  node3
> > node4  node5  node6  node7  online  possible  power  uevent
> > [root@dell-per7425-03 node]# cat has_cpu
> > 0-7
> > [root@dell-per7425-03 node]# cat has_memory
> > 1,5
> > [root@dell-per7425-03 node]# cat online
> > 0-7
> > [root@dell-per7425-03 node]# cat possible
> > 0-7
> > And lscpu shows the following numa-cpu info:
> > NUMA node0 CPU(s): 0,8,16,24
> > NUMA node1 CPU(s): 2,10,18,26
> > NUMA node2 CPU(s): 4,12,20,28
> > NUMA node3 CPU(s): 6,14,22,30
> > NUMA node4 CPU(s): 1,9,17,25
> > NUMA node5 CPU(s): 3,11,19,27
> > NUMA node6 CPU(s): 5,13,21,29
> > NUMA node7 CPU(s): 7,15,23,31
> >
> > For the full panic message (I masked some hostname info with xx),
> > please see the attachment.
> > In a short word, it seems a problem with nr_cpus, if without this
> > option, the kernel can bootup correctly.
>
> Yep.
> [0.007418] Early memory node ranges
> [0.007419]   node   1: [mem 0x1000-0x0008efff]
> [0.007420]   node   1: [mem 0x0009-0x0009]
> [0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
> [0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
> [0.007423]   node   1: [mem 0x6c528000-0x6fff]
> [0.007424]   node   1: [mem 0x0001-0x00047fff]
> [0.007425]   node   5: [mem 0x00048000-0x00087eff]
>
> There is clearly no node2. Where did the driver get the node2 from?
Since using nr_cpus=4 , the node2 is not be instanced by x86 initalizing code.
For the normal bootup, having the following:
[0.007704] Movable zone start for each node
[0.007707] Early memory node ranges
[0.007708]   node   1: [mem 0x1000-0x0008efff]
[0.007709]   node   1: [mem 0x0009-0x0009]
[0.007711]   node   1: [mem 0x0010-0x5c3d6fff]
[0.007712]   node   1: [mem 0x643df000-0x68ff7fff]
[0.007712]   node   1: [mem 0x6c528000-0x6fff]
[0.007713]   node   1: [mem 0x0001-0x00047fff]
[0.007714]   node   5: [mem 0x00048000-0x00087eff]
[0.008434] Zeroed struct page in unavailable ranges: 46490 pages
[0.008435] Initmem setup node 1 [mem 0x1000-0x00047fff]
[0.022826] Initmem setup node 5 [mem 0x00048000-0x00087eff]
[0.024303] ACPI: PM-Timer IO Port: 0x408
[0.024320] ACPI: LAPIC_NMI (acpi_id[0xff] high edge lint[0x1])
[0.024349] IOAPIC[0]: apic_id 128, version 33, address 0xfec0, GSI 0-23
[0.024354] IOAPIC[1]: apic_id 129, version 33, address 0xfd88, GSI 24-55
[0.024360] IOAPIC[2]: apic_id 

Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-05 Thread Michal Hocko
On Wed 05-12-18 13:38:17, Pingfan Liu wrote:
> On Tue, Dec 4, 2018 at 4:56 PM Michal Hocko  wrote:
> >
> > On Tue 04-12-18 16:20:32, Pingfan Liu wrote:
> > > On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko  wrote:
> > > >
> > > > On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > > > > During my test on some AMD machine, with kexec -l nr_cpus=x option, 
> > > > > the
> > > > > kernel failed to bootup, because some node's data struct can not be 
> > > > > allocated,
> > > > > e.g, on x86, initialized by 
> > > > > init_cpu_to_node()->init_memory_less_node(). But
> > > > > device->numa_node info is used as preferred_nid param for
> > > > > __alloc_pages_nodemask(), which causes NULL reference
> > > > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > > > This patch tries to fix the issue by falling back to the first online 
> > > > > node,
> > > > > when encountering such corner case.
> > > >
> > > > We have seen similar issues already and the bug was usually that the
> > > > zonelists were not initialized yet or the node is completely bogus.
> > > > Zonelists should be initialized by build_all_zonelists quite early so I
> > > > am wondering whether the later is the case. What is the actual node
> > > > number the device is associated with?
> > > >
> > > The device's node num is 2. And in my case, I used nr_cpus param. Due
> > > to init_cpu_to_node() initialize all the possible node.  It is hard
> > > for me to figure out without this param, how zonelists is accessed
> > > before page allocator works.
> >
> > I believe we should focus on this. Why does the node have no zonelist
> > even though all zonelists should be initialized already? Maybe this is
> > nr_cpus pecularity and we do not initialize all the existing numa nodes.
> > Or maybe the device is associated to a non-existing node with that
> > setup. A full dmesg might help us here.
> >
> Requiring the machine again, and I got the following without nr_cpus option
> [root@dell-per7425-03 ~]# cd /sys/devices/system/node/
> [root@dell-per7425-03 node]# ls
> has_cpu  has_memory  has_normal_memory  node0  node1  node2  node3
> node4  node5  node6  node7  online  possible  power  uevent
> [root@dell-per7425-03 node]# cat has_cpu
> 0-7
> [root@dell-per7425-03 node]# cat has_memory
> 1,5
> [root@dell-per7425-03 node]# cat online
> 0-7
> [root@dell-per7425-03 node]# cat possible
> 0-7
> And lscpu shows the following numa-cpu info:
> NUMA node0 CPU(s): 0,8,16,24
> NUMA node1 CPU(s): 2,10,18,26
> NUMA node2 CPU(s): 4,12,20,28
> NUMA node3 CPU(s): 6,14,22,30
> NUMA node4 CPU(s): 1,9,17,25
> NUMA node5 CPU(s): 3,11,19,27
> NUMA node6 CPU(s): 5,13,21,29
> NUMA node7 CPU(s): 7,15,23,31
> 
> For the full panic message (I masked some hostname info with xx),
> please see the attachment.
> In a short word, it seems a problem with nr_cpus, if without this
> option, the kernel can bootup correctly.

Yep.
[0.007418] Early memory node ranges
[0.007419]   node   1: [mem 0x1000-0x0008efff]
[0.007420]   node   1: [mem 0x0009-0x0009]
[0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
[0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
[0.007423]   node   1: [mem 0x6c528000-0x6fff]
[0.007424]   node   1: [mem 0x0001-0x00047fff]
[0.007425]   node   5: [mem 0x00048000-0x00087eff]

There is clearly no node2. Where did the driver get the node2 from?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-04 Thread Pingfan Liu
On Tue, Dec 4, 2018 at 5:09 PM Wei Yang  wrote:
>
> On Tue, Dec 04, 2018 at 04:52:52PM +0800, Pingfan Liu wrote:
> >On Tue, Dec 4, 2018 at 4:34 PM Wei Yang  wrote:
> >>
> >> On Tue, Dec 04, 2018 at 03:20:13PM +0800, Pingfan Liu wrote:
> >> >On Tue, Dec 4, 2018 at 2:54 PM Wei Yang  wrote:
> >> >>
> >> >> On Tue, Dec 04, 2018 at 11:05:57AM +0800, Pingfan Liu wrote:
> >> >> >During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> >> >> >kernel failed to bootup, because some node's data struct can not be 
> >> >> >allocated,
> >> >> >e.g, on x86, initialized by 
> >> >> >init_cpu_to_node()->init_memory_less_node(). But
> >> >> >device->numa_node info is used as preferred_nid param for
> >> >>
> >> >> could we fix the preferred_nid before passed to
> >> >> __alloc_pages_nodemask()?
> >> >>
> >> >Yes, we can doit too, but what is the gain?
> >>
> >> node_zonelist() is used some places. If we are sure where the problem
> >> is, it is not necessary to spread to other places.
> >>
> >> >
> >> >> BTW, I don't catch the function call flow to this point. Would you mind
> >> >> giving me some hint?
> >> >>
> >> >You can track the code along slab_alloc() ->...->__alloc_pages_nodemask()
> >>
> >> slab_alloc() pass NUMA_NO_NODE down, so I am lost in where the
> >> preferred_nid is assigned.
> >>
> >You can follow:
> >[5.773618]  new_slab+0xa9/0x570
> >[5.773618]  ___slab_alloc+0x375/0x540
> >[5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> >where static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int 
> >node)
> >
>
> Well, thanks for your patience, but I still don't get it.
>
> new_slab(node)
> allocate_slab(node)
>alloc_slab_page(node)
>if (node == NUMA_NO_NODE)
>alloc_pages()
>eles
>__alloc_pages_node(node)
>
> As you mentioned, this starts from slab_alloc() which pass NUMA_NO_NODE.
> This means it goes to alloc_pages() and then alloc_pages_current() ->
> __alloc_pages_nodemask(). Here we use policy_node() to get the
> preferred_nid.
>
> I didn't catch the relathionship between policy_node() and
> device->numa_node. Maybe I got wrong in some place. Would you minding
> sharing more?
>
Have uploaded the full panic log. Enjoy it.

Regards,
Pingfan


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-04 Thread Pingfan Liu
On Tue, Dec 4, 2018 at 3:16 PM Pingfan Liu  wrote:
>
> On Tue, Dec 4, 2018 at 11:53 AM David Rientjes  wrote:
> >
> > On Tue, 4 Dec 2018, Pingfan Liu wrote:
> >
> > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > index 76f8db0..8324953 100644
> > > --- a/include/linux/gfp.h
> > > +++ b/include/linux/gfp.h
> > > @@ -453,6 +453,8 @@ static inline int gfp_zonelist(gfp_t flags)
> > >   */
> > >  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> > >  {
> > > + if (unlikely(!node_online(nid)))
> > > + nid = first_online_node;
> > >   return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> > >  }
> > >
> >
> > So we're passing the node id from dev_to_node() to kmalloc which
> > interprets that as the preferred node and then does node_zonelist() to
> > find the zonelist at allocation time.
> >
> > What happens if we fix this in alloc_dr()?  Does anything else cause
> > problems?
> >
> I think it is better to fix it mm, since it can protect any new
> similar bug in future. While fixing in alloc_dr() just work at present
>
> > And rather than using first_online_node, would next_online_node() work?
> >
> What is the gain? Is it for memory pressure on node0?
>
Maybe I got your point now.  Do you try to give a cheap assumption on
nearest neigh of this node?

Thanks,
Pingfan


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-04 Thread Pingfan Liu
On Tue, Dec 4, 2018 at 4:56 PM Michal Hocko  wrote:
>
> On Tue 04-12-18 16:20:32, Pingfan Liu wrote:
> > On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko  wrote:
> > >
> > > On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > > > During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> > > > kernel failed to bootup, because some node's data struct can not be 
> > > > allocated,
> > > > e.g, on x86, initialized by 
> > > > init_cpu_to_node()->init_memory_less_node(). But
> > > > device->numa_node info is used as preferred_nid param for
> > > > __alloc_pages_nodemask(), which causes NULL reference
> > > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > > This patch tries to fix the issue by falling back to the first online 
> > > > node,
> > > > when encountering such corner case.
> > >
> > > We have seen similar issues already and the bug was usually that the
> > > zonelists were not initialized yet or the node is completely bogus.
> > > Zonelists should be initialized by build_all_zonelists quite early so I
> > > am wondering whether the later is the case. What is the actual node
> > > number the device is associated with?
> > >
> > The device's node num is 2. And in my case, I used nr_cpus param. Due
> > to init_cpu_to_node() initialize all the possible node.  It is hard
> > for me to figure out without this param, how zonelists is accessed
> > before page allocator works.
>
> I believe we should focus on this. Why does the node have no zonelist
> even though all zonelists should be initialized already? Maybe this is
> nr_cpus pecularity and we do not initialize all the existing numa nodes.
> Or maybe the device is associated to a non-existing node with that
> setup. A full dmesg might help us here.
>
Requiring the machine again, and I got the following without nr_cpus option
[root@dell-per7425-03 ~]# cd /sys/devices/system/node/
[root@dell-per7425-03 node]# ls
has_cpu  has_memory  has_normal_memory  node0  node1  node2  node3
node4  node5  node6  node7  online  possible  power  uevent
[root@dell-per7425-03 node]# cat has_cpu
0-7
[root@dell-per7425-03 node]# cat has_memory
1,5
[root@dell-per7425-03 node]# cat online
0-7
[root@dell-per7425-03 node]# cat possible
0-7
And lscpu shows the following numa-cpu info:
NUMA node0 CPU(s): 0,8,16,24
NUMA node1 CPU(s): 2,10,18,26
NUMA node2 CPU(s): 4,12,20,28
NUMA node3 CPU(s): 6,14,22,30
NUMA node4 CPU(s): 1,9,17,25
NUMA node5 CPU(s): 3,11,19,27
NUMA node6 CPU(s): 5,13,21,29
NUMA node7 CPU(s): 7,15,23,31

For the full panic message (I masked some hostname info with xx),
please see the attachment.
In a short word, it seems a problem with nr_cpus, if without this
option, the kernel can bootup correctly.

> > > Your patch is not correct btw, because we want to fallback into the node 
> > > in
> > > the distance order rather into the first online node.
> > > --
> > What about this:
> > +extern int find_next_best_node(int node, nodemask_t *used_node_mask);
> > +
> >  /*
> >   * We get the zone list from the current node and the gfp_mask.
> >   * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
> > @@ -453,6 +455,11 @@ static inline int gfp_zonelist(gfp_t flags)
> >   */
> >  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> >  {
> > +   if (unlikely(!node_online(nid))) {
> > +   nodemask_t used_mask;
> > +   nodes_complement(used_mask, node_online_map);
> > +   nid = find_next_best_node(nid, _mask);
> > +   }
> > return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> >  }
> >
> > I just finished the compiling, not test it yet, since the machine is
> > not on hand yet. It needs some time to get it again.
>
> This is clearly a no-go. nodemask_t can be giant and you cannot have it
> on the stack for allocation paths which might be called from a deep

What about the static variable
static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
 {
+   if (unlikely(!node_online(nid))) {
+   WARN_ONCE(1, "Try to alloc mem from not online node\n");
+   nid = find_best_node(nid);
+   }
return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
 }

+int find_best_node(int node)
+{
+   static nodemask_t used_mask;
+   static DEFINE_SPINLOCK(lock);
+   static int best_neigh[MAX_NUMNODES] = {0};
+   int nid;
+
+   spin_lock();
+   if (likely( best_neigh[node] != 0)) {
+   nid = best_neigh[node];
+   goto unlock;
+   }
+   nodes_complement(used_mask, node_online_map);
+   nid = find_next_best_node(node, _mask);
+   best_neigh[node] = nid;
+
+unlock:
+   spin_unlock();
+   return nid;
+}

> stack already. Also this is called from the allocator hot paths and each
> branch counts.
>
I am puzzling about this. Does unlikely work for it?

Thanks,
Pingfan
[ 1048.902358] kvm: exiting hardware virtualization
[ 

Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-04 Thread Vlastimil Babka
On 12/4/18 9:56 AM, Michal Hocko wrote:
>> The device's node num is 2. And in my case, I used nr_cpus param. Due
>> to init_cpu_to_node() initialize all the possible node.  It is hard
>> for me to figure out without this param, how zonelists is accessed
>> before page allocator works.
> I believe we should focus on this. Why does the node have no zonelist
> even though all zonelists should be initialized already? Maybe this is
> nr_cpus pecularity and we do not initialize all the existing numa nodes.
> Or maybe the device is associated to a non-existing node with that
> setup. A full dmesg might help us here.

Yes, a full dmesg should contain line such as this one:

[0.137407] Built 1 zonelists, mobility grouping on.  Total pages:
6181664

That should at least tell us if nr_cpus=X resulted in some node's
zonelists not being built.


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-04 Thread Wei Yang
On Tue, Dec 04, 2018 at 04:52:52PM +0800, Pingfan Liu wrote:
>On Tue, Dec 4, 2018 at 4:34 PM Wei Yang  wrote:
>>
>> On Tue, Dec 04, 2018 at 03:20:13PM +0800, Pingfan Liu wrote:
>> >On Tue, Dec 4, 2018 at 2:54 PM Wei Yang  wrote:
>> >>
>> >> On Tue, Dec 04, 2018 at 11:05:57AM +0800, Pingfan Liu wrote:
>> >> >During my test on some AMD machine, with kexec -l nr_cpus=x option, the
>> >> >kernel failed to bootup, because some node's data struct can not be 
>> >> >allocated,
>> >> >e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). 
>> >> >But
>> >> >device->numa_node info is used as preferred_nid param for
>> >>
>> >> could we fix the preferred_nid before passed to
>> >> __alloc_pages_nodemask()?
>> >>
>> >Yes, we can doit too, but what is the gain?
>>
>> node_zonelist() is used some places. If we are sure where the problem
>> is, it is not necessary to spread to other places.
>>
>> >
>> >> BTW, I don't catch the function call flow to this point. Would you mind
>> >> giving me some hint?
>> >>
>> >You can track the code along slab_alloc() ->...->__alloc_pages_nodemask()
>>
>> slab_alloc() pass NUMA_NO_NODE down, so I am lost in where the
>> preferred_nid is assigned.
>>
>You can follow:
>[5.773618]  new_slab+0xa9/0x570
>[5.773618]  ___slab_alloc+0x375/0x540
>[5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
>where static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
>

Well, thanks for your patience, but I still don't get it.

new_slab(node)
allocate_slab(node)
   alloc_slab_page(node)
   if (node == NUMA_NO_NODE)
   alloc_pages()
   eles
   __alloc_pages_node(node)

As you mentioned, this starts from slab_alloc() which pass NUMA_NO_NODE.
This means it goes to alloc_pages() and then alloc_pages_current() ->
__alloc_pages_nodemask(). Here we use policy_node() to get the
preferred_nid. 

I didn't catch the relathionship between policy_node() and
device->numa_node. Maybe I got wrong in some place. Would you minding
sharing more?

>Thanks,
>Pingfan

-- 
Wei Yang
Help you, Help me


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-04 Thread Pingfan Liu
On Tue, Dec 4, 2018 at 4:40 PM Wei Yang  wrote:
>
> On Tue, Dec 04, 2018 at 04:20:32PM +0800, Pingfan Liu wrote:
> >On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko  wrote:
> >>
> >> On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> >> > During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> >> > kernel failed to bootup, because some node's data struct can not be 
> >> > allocated,
> >> > e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). 
> >> > But
> >> > device->numa_node info is used as preferred_nid param for
> >> > __alloc_pages_nodemask(), which causes NULL reference
> >> >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> >> > This patch tries to fix the issue by falling back to the first online 
> >> > node,
> >> > when encountering such corner case.
> >>
> >> We have seen similar issues already and the bug was usually that the
> >> zonelists were not initialized yet or the node is completely bogus.
> >> Zonelists should be initialized by build_all_zonelists quite early so I
> >> am wondering whether the later is the case. What is the actual node
> >> number the device is associated with?
> >>
> >The device's node num is 2. And in my case, I used nr_cpus param. Due
> >to init_cpu_to_node() initialize all the possible node.  It is hard
> >for me to figure out without this param, how zonelists is accessed
> >before page allocator works.
>
> If my understanding is correct, we can't do page alloc before zonelist
> is initialized.
>
> I guess Michal's point is to figure out this reason.
>
Yeah, I know. I just want to emphasize that I hit this bug using
nr_cpus, which may be rarely used by people. Hence it may be a
different bug as Michal had seen. Sorry for bad English if it cause
confusion.

Thanks

[...]


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-04 Thread Michal Hocko
On Tue 04-12-18 16:20:32, Pingfan Liu wrote:
> On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko  wrote:
> >
> > On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > > During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> > > kernel failed to bootup, because some node's data struct can not be 
> > > allocated,
> > > e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). 
> > > But
> > > device->numa_node info is used as preferred_nid param for
> > > __alloc_pages_nodemask(), which causes NULL reference
> > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > This patch tries to fix the issue by falling back to the first online 
> > > node,
> > > when encountering such corner case.
> >
> > We have seen similar issues already and the bug was usually that the
> > zonelists were not initialized yet or the node is completely bogus.
> > Zonelists should be initialized by build_all_zonelists quite early so I
> > am wondering whether the later is the case. What is the actual node
> > number the device is associated with?
> >
> The device's node num is 2. And in my case, I used nr_cpus param. Due
> to init_cpu_to_node() initialize all the possible node.  It is hard
> for me to figure out without this param, how zonelists is accessed
> before page allocator works.

I believe we should focus on this. Why does the node have no zonelist
even though all zonelists should be initialized already? Maybe this is
nr_cpus pecularity and we do not initialize all the existing numa nodes.
Or maybe the device is associated to a non-existing node with that
setup. A full dmesg might help us here.

> > Your patch is not correct btw, because we want to fallback into the node in
> > the distance order rather into the first online node.
> > --
> What about this:
> +extern int find_next_best_node(int node, nodemask_t *used_node_mask);
> +
>  /*
>   * We get the zone list from the current node and the gfp_mask.
>   * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
> @@ -453,6 +455,11 @@ static inline int gfp_zonelist(gfp_t flags)
>   */
>  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>  {
> +   if (unlikely(!node_online(nid))) {
> +   nodemask_t used_mask;
> +   nodes_complement(used_mask, node_online_map);
> +   nid = find_next_best_node(nid, _mask);
> +   }
> return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
>  }
> 
> I just finished the compiling, not test it yet, since the machine is
> not on hand yet. It needs some time to get it again.

This is clearly a no-go. nodemask_t can be giant and you cannot have it
on the stack for allocation paths which might be called from a deep
stack already. Also this is called from the allocator hot paths and each
branch counts.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-04 Thread Pingfan Liu
On Tue, Dec 4, 2018 at 4:34 PM Wei Yang  wrote:
>
> On Tue, Dec 04, 2018 at 03:20:13PM +0800, Pingfan Liu wrote:
> >On Tue, Dec 4, 2018 at 2:54 PM Wei Yang  wrote:
> >>
> >> On Tue, Dec 04, 2018 at 11:05:57AM +0800, Pingfan Liu wrote:
> >> >During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> >> >kernel failed to bootup, because some node's data struct can not be 
> >> >allocated,
> >> >e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). 
> >> >But
> >> >device->numa_node info is used as preferred_nid param for
> >>
> >> could we fix the preferred_nid before passed to
> >> __alloc_pages_nodemask()?
> >>
> >Yes, we can doit too, but what is the gain?
>
> node_zonelist() is used some places. If we are sure where the problem
> is, it is not necessary to spread to other places.
>
> >
> >> BTW, I don't catch the function call flow to this point. Would you mind
> >> giving me some hint?
> >>
> >You can track the code along slab_alloc() ->...->__alloc_pages_nodemask()
>
> slab_alloc() pass NUMA_NO_NODE down, so I am lost in where the
> preferred_nid is assigned.
>
You can follow:
[5.773618]  new_slab+0xa9/0x570
[5.773618]  ___slab_alloc+0x375/0x540
[5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
where static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)

Thanks,
Pingfan


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-04 Thread Wei Yang
On Tue, Dec 04, 2018 at 04:20:32PM +0800, Pingfan Liu wrote:
>On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko  wrote:
>>
>> On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
>> > During my test on some AMD machine, with kexec -l nr_cpus=x option, the
>> > kernel failed to bootup, because some node's data struct can not be 
>> > allocated,
>> > e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). 
>> > But
>> > device->numa_node info is used as preferred_nid param for
>> > __alloc_pages_nodemask(), which causes NULL reference
>> >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
>> > This patch tries to fix the issue by falling back to the first online node,
>> > when encountering such corner case.
>>
>> We have seen similar issues already and the bug was usually that the
>> zonelists were not initialized yet or the node is completely bogus.
>> Zonelists should be initialized by build_all_zonelists quite early so I
>> am wondering whether the later is the case. What is the actual node
>> number the device is associated with?
>>
>The device's node num is 2. And in my case, I used nr_cpus param. Due
>to init_cpu_to_node() initialize all the possible node.  It is hard
>for me to figure out without this param, how zonelists is accessed
>before page allocator works.

If my understanding is correct, we can't do page alloc before zonelist
is initialized.

I guess Michal's point is to figure out this reason.

>
>> Your patch is not correct btw, because we want to fallback into the node in
>> the distance order rather into the first online node.
>> --
>What about this:
>+extern int find_next_best_node(int node, nodemask_t *used_node_mask);
>+
> /*
>  * We get the zone list from the current node and the gfp_mask.
>  * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
>@@ -453,6 +455,11 @@ static inline int gfp_zonelist(gfp_t flags)
>  */
> static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> {
>+   if (unlikely(!node_online(nid))) {
>+   nodemask_t used_mask;
>+   nodes_complement(used_mask, node_online_map);
>+   nid = find_next_best_node(nid, _mask);
>+   }
>return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> }
>
>I just finished the compiling, not test it yet, since the machine is
>not on hand yet. It needs some time to get it again.
>
>Thanks,
>Pingfan

-- 
Wei Yang
Help you, Help me


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-04 Thread Wei Yang
On Tue, Dec 04, 2018 at 03:20:13PM +0800, Pingfan Liu wrote:
>On Tue, Dec 4, 2018 at 2:54 PM Wei Yang  wrote:
>>
>> On Tue, Dec 04, 2018 at 11:05:57AM +0800, Pingfan Liu wrote:
>> >During my test on some AMD machine, with kexec -l nr_cpus=x option, the
>> >kernel failed to bootup, because some node's data struct can not be 
>> >allocated,
>> >e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
>> >device->numa_node info is used as preferred_nid param for
>>
>> could we fix the preferred_nid before passed to
>> __alloc_pages_nodemask()?
>>
>Yes, we can doit too, but what is the gain?

node_zonelist() is used some places. If we are sure where the problem
is, it is not necessary to spread to other places.

>
>> BTW, I don't catch the function call flow to this point. Would you mind
>> giving me some hint?
>>
>You can track the code along slab_alloc() ->...->__alloc_pages_nodemask()

slab_alloc() pass NUMA_NO_NODE down, so I am lost in where the
preferred_nid is assigned.

>
>Thanks,
>Pingfan

-- 
Wei Yang
Help you, Help me


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-04 Thread Pingfan Liu
On Tue, Dec 4, 2018 at 3:22 PM Michal Hocko  wrote:
>
> On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> > During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> > kernel failed to bootup, because some node's data struct can not be 
> > allocated,
> > e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> > device->numa_node info is used as preferred_nid param for
> > __alloc_pages_nodemask(), which causes NULL reference
> >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > This patch tries to fix the issue by falling back to the first online node,
> > when encountering such corner case.
>
> We have seen similar issues already and the bug was usually that the
> zonelists were not initialized yet or the node is completely bogus.
> Zonelists should be initialized by build_all_zonelists quite early so I
> am wondering whether the later is the case. What is the actual node
> number the device is associated with?
>
The device's node num is 2. And in my case, I used nr_cpus param. Due
to init_cpu_to_node() initialize all the possible node.  It is hard
for me to figure out without this param, how zonelists is accessed
before page allocator works.

> Your patch is not correct btw, because we want to fallback into the node in
> the distance order rather into the first online node.
> --
What about this:
+extern int find_next_best_node(int node, nodemask_t *used_node_mask);
+
 /*
  * We get the zone list from the current node and the gfp_mask.
  * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
@@ -453,6 +455,11 @@ static inline int gfp_zonelist(gfp_t flags)
  */
 static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
 {
+   if (unlikely(!node_online(nid))) {
+   nodemask_t used_mask;
+   nodes_complement(used_mask, node_online_map);
+   nid = find_next_best_node(nid, _mask);
+   }
return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
 }

I just finished the compiling, not test it yet, since the machine is
not on hand yet. It needs some time to get it again.

Thanks,
Pingfan


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-03 Thread Michal Hocko
On Tue 04-12-18 11:05:57, Pingfan Liu wrote:
> During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> kernel failed to bootup, because some node's data struct can not be allocated,
> e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> device->numa_node info is used as preferred_nid param for
> __alloc_pages_nodemask(), which causes NULL reference
>   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> This patch tries to fix the issue by falling back to the first online node,
> when encountering such corner case.

We have seen similar issues already and the bug was usually that the
zonelists were not initialized yet or the node is completely bogus.
Zonelists should be initialized by build_all_zonelists quite early so I
am wondering whether the later is the case. What is the actual node
number the device is associated with?

Your patch is not correct btw, because we want to fallback into the node in
the distance order rather into the first online node.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-03 Thread Pingfan Liu
On Tue, Dec 4, 2018 at 2:54 PM Wei Yang  wrote:
>
> On Tue, Dec 04, 2018 at 11:05:57AM +0800, Pingfan Liu wrote:
> >During my test on some AMD machine, with kexec -l nr_cpus=x option, the
> >kernel failed to bootup, because some node's data struct can not be 
> >allocated,
> >e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
> >device->numa_node info is used as preferred_nid param for
>
> could we fix the preferred_nid before passed to
> __alloc_pages_nodemask()?
>
Yes, we can doit too, but what is the gain?

> BTW, I don't catch the function call flow to this point. Would you mind
> giving me some hint?
>
You can track the code along slab_alloc() ->...->__alloc_pages_nodemask()

Thanks,
Pingfan


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-03 Thread Pingfan Liu
On Tue, Dec 4, 2018 at 11:53 AM David Rientjes  wrote:
>
> On Tue, 4 Dec 2018, Pingfan Liu wrote:
>
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 76f8db0..8324953 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -453,6 +453,8 @@ static inline int gfp_zonelist(gfp_t flags)
> >   */
> >  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> >  {
> > + if (unlikely(!node_online(nid)))
> > + nid = first_online_node;
> >   return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> >  }
> >
>
> So we're passing the node id from dev_to_node() to kmalloc which
> interprets that as the preferred node and then does node_zonelist() to
> find the zonelist at allocation time.
>
> What happens if we fix this in alloc_dr()?  Does anything else cause
> problems?
>
I think it is better to fix it mm, since it can protect any new
similar bug in future. While fixing in alloc_dr() just work at present

> And rather than using first_online_node, would next_online_node() work?
>
What is the gain? Is it for memory pressure on node0?

Thanks,
Pingfan

> I'm thinking about this:
>
> diff --git a/drivers/base/devres.c b/drivers/base/devres.c
> --- a/drivers/base/devres.c
> +++ b/drivers/base/devres.c
> @@ -100,6 +100,8 @@ static __always_inline struct devres * 
> alloc_dr(dr_release_t release,
> _size)))
> return NULL;
>
> +   if (unlikely(!node_online(nid)))
> +   nid = next_online_node(nid);
> dr = kmalloc_node_track_caller(tot_size, gfp, nid);
> if (unlikely(!dr))
> return NULL;


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-03 Thread Wei Yang
On Tue, Dec 04, 2018 at 11:05:57AM +0800, Pingfan Liu wrote:
>During my test on some AMD machine, with kexec -l nr_cpus=x option, the
>kernel failed to bootup, because some node's data struct can not be allocated,
>e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
>device->numa_node info is used as preferred_nid param for

could we fix the preferred_nid before passed to
__alloc_pages_nodemask()?

BTW, I don't catch the function call flow to this point. Would you mind
giving me some hint?

-- 
Wei Yang
Help you, Help me


Re: [PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-03 Thread David Rientjes
On Tue, 4 Dec 2018, Pingfan Liu wrote:

> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 76f8db0..8324953 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -453,6 +453,8 @@ static inline int gfp_zonelist(gfp_t flags)
>   */
>  static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>  {
> + if (unlikely(!node_online(nid)))
> + nid = first_online_node;
>   return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
>  }
>  

So we're passing the node id from dev_to_node() to kmalloc which 
interprets that as the preferred node and then does node_zonelist() to 
find the zonelist at allocation time.

What happens if we fix this in alloc_dr()?  Does anything else cause 
problems?

And rather than using first_online_node, would next_online_node() work?

I'm thinking about this:

diff --git a/drivers/base/devres.c b/drivers/base/devres.c
--- a/drivers/base/devres.c
+++ b/drivers/base/devres.c
@@ -100,6 +100,8 @@ static __always_inline struct devres * 
alloc_dr(dr_release_t release,
_size)))
return NULL;
 
+   if (unlikely(!node_online(nid)))
+   nid = next_online_node(nid);
dr = kmalloc_node_track_caller(tot_size, gfp, nid);
if (unlikely(!dr))
return NULL;


[PATCH] mm/alloc: fallback to first node if the wanted node offline

2018-12-03 Thread Pingfan Liu
During my test on some AMD machine, with kexec -l nr_cpus=x option, the
kernel failed to bootup, because some node's data struct can not be allocated,
e.g, on x86, initialized by init_cpu_to_node()->init_memory_less_node(). But
device->numa_node info is used as preferred_nid param for
__alloc_pages_nodemask(), which causes NULL reference
  ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
This patch tries to fix the issue by falling back to the first online node,
when encountering such corner case.

Notes about the crashing info:
-1. kexec -l with nr_cpus=4
-2. system info
  NUMA node0 CPU(s): 0,8,16,24
  NUMA node1 CPU(s): 2,10,18,26
  NUMA node2 CPU(s): 4,12,20,28
  NUMA node3 CPU(s): 6,14,22,30
  NUMA node4 CPU(s): 1,9,17,25
  NUMA node5 CPU(s): 3,11,19,27
  NUMA node6 CPU(s): 5,13,21,29
  NUMA node7 CPU(s): 7,15,23,31
-3. panic stack
[...]
[5.721547] atomic64_test: passed for x86-64 platform with CX8 and with SSE
[5.729187] pcieport :00:01.1: Signaling PME with IRQ 34
[5.735187] pcieport :00:01.2: Signaling PME with IRQ 35
[5.741168] pcieport :00:01.3: Signaling PME with IRQ 36
[5.747189] pcieport :00:07.1: Signaling PME with IRQ 37
[5.754061] pcieport :00:08.1: Signaling PME with IRQ 39
[5.760727] pcieport :20:07.1: Signaling PME with IRQ 40
[5.766955] pcieport :20:08.1: Signaling PME with IRQ 42
[5.772742] BUG: unable to handle kernel paging request at 2088
[5.773618] PGD 0 P4D 0
[5.773618] Oops:  [#1] SMP NOPTI
[5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
[5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 
06/29/2018
[5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
[5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 
ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 
0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
e1 44 89 e6 89
[5.773618] RSP: 0018:aa65fb20 EFLAGS: 00010246
[5.773618] RAX:  RBX: 006012c0 RCX: 
[5.773618] RDX:  RSI: 0002 RDI: 2080
[5.773618] RBP: 006012c0 R08:  R09: 0002
[5.773618] R10: 006080c0 R11: 0002 R12: 
[5.773618] R13: 0001 R14:  R15: 0002
[5.773618] FS:  () GS:8c69afe0() 
knlGS:
[5.773618] CS:  0010 DS:  ES:  CR0: 80050033
[5.773618] CR2: 2088 CR3: 00087e00a000 CR4: 003406e0
[5.773618] Call Trace:
[5.773618]  new_slab+0xa9/0x570
[5.773618]  ___slab_alloc+0x375/0x540
[5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[5.773618]  __slab_alloc+0x1c/0x38
[5.773618]  __kmalloc_node_track_caller+0xc8/0x270
[5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[5.773618]  devm_kmalloc+0x28/0x60
[5.773618]  pinctrl_bind_pins+0x2b/0x2a0
[5.773618]  really_probe+0x73/0x420
[5.773618]  driver_probe_device+0x115/0x130
[5.773618]  __driver_attach+0x103/0x110
[5.773618]  ? driver_probe_device+0x130/0x130
[5.773618]  bus_for_each_dev+0x67/0xc0
[5.773618]  ? klist_add_tail+0x3b/0x70
[5.773618]  bus_add_driver+0x41/0x260
[5.773618]  ? pcie_port_setup+0x4d/0x4d
[5.773618]  driver_register+0x5b/0xe0
[5.773618]  ? pcie_port_setup+0x4d/0x4d
[5.773618]  do_one_initcall+0x4e/0x1d4
[5.773618]  ? init_setup+0x25/0x28
[5.773618]  kernel_init_freeable+0x1c1/0x26e
[5.773618]  ? loglevel+0x5b/0x5b
[5.773618]  ? rest_init+0xb0/0xb0
[5.773618]  kernel_init+0xa/0x110
[5.773618]  ret_from_fork+0x22/0x40
[5.773618] Modules linked in:
[5.773618] CR2: 2088
[5.773618] ---[ end trace 1030c9120a03d081 ]---
[...]

Other notes about the reproduction of this bug:
After appling the following patch:
commit 0d76bcc960e6057750fcf556b65da13f8bbdfd2b
Author: Bjorn Helgaas 
Date:   Tue Nov 13 08:38:17 2018 -0600

Revert "ACPI/PCI: Pay attention to device-specific _PXM node values"

This bug is covered and not triggered on my test AMD machine.
But it should still exist since dev->numa_node info can be set by other
method on other archs when using nr_cpus param

Signed-off-by: Pingfan Liu 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Mike Rapoport 
Cc: Bjorn Helgaas 
Cc: Jonathan Cameron 
---
 include/linux/gfp.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 76f8db0..8324953 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -453,6 +453,8 @@ static inline int gfp_zonelist(gfp_t flags)
  */
 static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
 {
+   if (unlikely(!node_online(nid)))
+   nid = first_online_node;
return