Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-02-11 Thread Michal Hocko
On Mon 11-02-19 14:49:09, Ingo Molnar wrote:
> 
> * Michal Hocko  wrote:
> 
> > On Thu 24-01-19 11:10:50, Dave Hansen wrote:
> > > On 1/24/19 6:17 AM, Michal Hocko wrote:
> > > > and nr_cpus set to 4. The underlying reason is tha the device is bound
> > > > to node 2 which doesn't have any memory and init_cpu_to_node only
> > > > initializes memory-less nodes for possible cpus which nr_cpus restrics.
> > > > This in turn means that proper zonelists are not allocated and the page
> > > > allocator blows up.
> > > 
> > > This looks OK to me.
> > > 
> > > Could we add a few DEBUG_VM checks that *look* for these invalid
> > > zonelists?  Or, would our existing list debugging have caught this?
> > 
> > Currently we simply blow up because those zonelists are NULL. I do not
> > think we have a way to check whether an existing zonelist is actually 
> > _correct_ other thatn check it for NULL. But what would we do in the
> > later case?
> > 
> > > Basically, is this bug also a sign that we need better debugging around
> > > this?
> > 
> > My earlier patch had a debugging printk to display the zonelists and
> > that might be worthwhile I guess. Basically something like this
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2e097f336126..c30d59f803fb 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5259,6 +5259,11 @@ static void build_zonelists(pg_data_t *pgdat)
> >  
> > build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
> > build_thisnode_zonelists(pgdat);
> > +
> > +   pr_info("node[%d] zonelist: ", pgdat->node_id);
> > +   for_each_zone_zonelist(zone, z, 
> > >node_zonelists[ZONELIST_FALLBACK], MAX_NR_ZONES-1)
> > +   pr_cont("%d:%s ", zone_to_nid(zone), zone->name);
> > +   pr_cont("\n");
> >  }
> 
> Looks like this patch fell through the cracks - any update on this?

I was waiting for some feedback. As there were no complains about the
above debugging output I will make it a separate patch and post both
patches later this week. I just have to go through my backlog pile after
vacation.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-02-11 Thread Ingo Molnar


* Michal Hocko  wrote:

> On Thu 24-01-19 11:10:50, Dave Hansen wrote:
> > On 1/24/19 6:17 AM, Michal Hocko wrote:
> > > and nr_cpus set to 4. The underlying reason is tha the device is bound
> > > to node 2 which doesn't have any memory and init_cpu_to_node only
> > > initializes memory-less nodes for possible cpus which nr_cpus restrics.
> > > This in turn means that proper zonelists are not allocated and the page
> > > allocator blows up.
> > 
> > This looks OK to me.
> > 
> > Could we add a few DEBUG_VM checks that *look* for these invalid
> > zonelists?  Or, would our existing list debugging have caught this?
> 
> Currently we simply blow up because those zonelists are NULL. I do not
> think we have a way to check whether an existing zonelist is actually 
> _correct_ other thatn check it for NULL. But what would we do in the
> later case?
> 
> > Basically, is this bug also a sign that we need better debugging around
> > this?
> 
> My earlier patch had a debugging printk to display the zonelists and
> that might be worthwhile I guess. Basically something like this
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2e097f336126..c30d59f803fb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5259,6 +5259,11 @@ static void build_zonelists(pg_data_t *pgdat)
>  
>   build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
>   build_thisnode_zonelists(pgdat);
> +
> + pr_info("node[%d] zonelist: ", pgdat->node_id);
> + for_each_zone_zonelist(zone, z, 
> >node_zonelists[ZONELIST_FALLBACK], MAX_NR_ZONES-1)
> + pr_cont("%d:%s ", zone_to_nid(zone), zone->name);
> + pr_cont("\n");
>  }

Looks like this patch fell through the cracks - any update on this?

Thanks,

Ingo


Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-25 Thread Mike Rapoport
On Fri, Jan 25, 2019 at 11:40:23AM +0100, Michal Hocko wrote:
> On Thu 24-01-19 19:51:44, Mike Rapoport wrote:
> > On Thu, Jan 24, 2019 at 03:17:27PM +0100, Michal Hocko wrote:
> > > a friendly ping for this. Does anybody see any problem with this
> > > approach?
> > 
> > FWIW, it looks fine to me.
> > 
> > It'd just be nice to have a few more words in the changelog about *how* the
> > x86 init was reworked ;-)
> 
> Heh, I thought it was there but nope... It probably just existed in my
> head. Sorry about that. What about the following paragraphs added?
> "
> The new code relies on the arch specific initialization to allocate all
> possible NUMA nodes (including memory less) - numa_register_memblks in
> this case. Generic code then initializes both zonelists 
> (__build_all_zonelists)
> and allocator internals (free_area_init_nodes) for all non-null pgdats
> rather than online ones.
> 
> For the x86 specific part also do not make new node online in alloc_node_data
> because this is too early to know that. numa_register_memblks knows that
> a node has some memory so it can make the node online appropriately.
> init_memory_less_node hack can be safely removed altogether now.
> "

LGTM, thanks!
 
> -- 
> Michal Hocko
> SUSE Labs
> 

-- 
Sincerely yours,
Mike.



Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-25 Thread Michal Hocko
On Thu 24-01-19 11:10:50, Dave Hansen wrote:
> On 1/24/19 6:17 AM, Michal Hocko wrote:
> > and nr_cpus set to 4. The underlying reason is tha the device is bound
> > to node 2 which doesn't have any memory and init_cpu_to_node only
> > initializes memory-less nodes for possible cpus which nr_cpus restrics.
> > This in turn means that proper zonelists are not allocated and the page
> > allocator blows up.
> 
> This looks OK to me.
> 
> Could we add a few DEBUG_VM checks that *look* for these invalid
> zonelists?  Or, would our existing list debugging have caught this?

Currently we simply blow up because those zonelists are NULL. I do not
think we have a way to check whether an existing zonelist is actually 
_correct_ other thatn check it for NULL. But what would we do in the
later case?

> Basically, is this bug also a sign that we need better debugging around
> this?

My earlier patch had a debugging printk to display the zonelists and
that might be worthwhile I guess. Basically something like this

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e097f336126..c30d59f803fb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5259,6 +5259,11 @@ static void build_zonelists(pg_data_t *pgdat)
 
build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
build_thisnode_zonelists(pgdat);
+
+   pr_info("node[%d] zonelist: ", pgdat->node_id);
+   for_each_zone_zonelist(zone, z, 
>node_zonelists[ZONELIST_FALLBACK], MAX_NR_ZONES-1)
+   pr_cont("%d:%s ", zone_to_nid(zone), zone->name);
+   pr_cont("\n");
 }
 
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-25 Thread Michal Hocko
On Thu 24-01-19 19:51:44, Mike Rapoport wrote:
> On Thu, Jan 24, 2019 at 03:17:27PM +0100, Michal Hocko wrote:
> > a friendly ping for this. Does anybody see any problem with this
> > approach?
> 
> FWIW, it looks fine to me.
> 
> It'd just be nice to have a few more words in the changelog about *how* the
> x86 init was reworked ;-)

Heh, I thought it was there but nope... It probably just existed in my
head. Sorry about that. What about the following paragraphs added?
"
The new code relies on the arch specific initialization to allocate all
possible NUMA nodes (including memory less) - numa_register_memblks in
this case. Generic code then initializes both zonelists (__build_all_zonelists)
and allocator internals (free_area_init_nodes) for all non-null pgdats
rather than online ones.

For the x86 specific part also do not make new node online in alloc_node_data
because this is too early to know that. numa_register_memblks knows that
a node has some memory so it can make the node online appropriately.
init_memory_less_node hack can be safely removed altogether now.
"

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-24 Thread Dave Hansen
On 1/24/19 6:17 AM, Michal Hocko wrote:
> and nr_cpus set to 4. The underlying reason is tha the device is bound
> to node 2 which doesn't have any memory and init_cpu_to_node only
> initializes memory-less nodes for possible cpus which nr_cpus restrics.
> This in turn means that proper zonelists are not allocated and the page
> allocator blows up.

This looks OK to me.

Could we add a few DEBUG_VM checks that *look* for these invalid
zonelists?  Or, would our existing list debugging have caught this?

Basically, is this bug also a sign that we need better debugging around
this?


Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-24 Thread Mike Rapoport
On Thu, Jan 24, 2019 at 03:17:27PM +0100, Michal Hocko wrote:
> a friendly ping for this. Does anybody see any problem with this
> approach?

FWIW, it looks fine to me.

It'd just be nice to have a few more words in the changelog about *how* the
x86 init was reworked ;-)
 
> On Mon 14-01-19 09:24:16, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > Pingfan Liu has reported the following splat
> > [5.772742] BUG: unable to handle kernel paging request at 
> > 2088
> > [5.773618] PGD 0 P4D 0
> > [5.773618] Oops:  [#1] SMP NOPTI
> > [5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
> > [5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 
> > 06/29/2018
> > [5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
> > [5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da 
> > c1 ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 
> > <3b> 77 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
> > e1 44 89 e6 89
> > [5.773618] RSP: 0018:aa65fb20 EFLAGS: 00010246
> > [5.773618] RAX:  RBX: 006012c0 RCX: 
> > 
> > [5.773618] RDX:  RSI: 0002 RDI: 
> > 2080
> > [5.773618] RBP: 006012c0 R08:  R09: 
> > 0002
> > [5.773618] R10: 006080c0 R11: 0002 R12: 
> > 
> > [5.773618] R13: 0001 R14:  R15: 
> > 0002
> > [5.773618] FS:  () GS:8c69afe0() 
> > knlGS:
> > [5.773618] CS:  0010 DS:  ES:  CR0: 80050033
> > [5.773618] CR2: 2088 CR3: 00087e00a000 CR4: 
> > 003406e0
> > [5.773618] Call Trace:
> > [5.773618]  new_slab+0xa9/0x570
> > [5.773618]  ___slab_alloc+0x375/0x540
> > [5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> > [5.773618]  __slab_alloc+0x1c/0x38
> > [5.773618]  __kmalloc_node_track_caller+0xc8/0x270
> > [5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> > [5.773618]  devm_kmalloc+0x28/0x60
> > [5.773618]  pinctrl_bind_pins+0x2b/0x2a0
> > [5.773618]  really_probe+0x73/0x420
> > [5.773618]  driver_probe_device+0x115/0x130
> > [5.773618]  __driver_attach+0x103/0x110
> > [5.773618]  ? driver_probe_device+0x130/0x130
> > [5.773618]  bus_for_each_dev+0x67/0xc0
> > [5.773618]  ? klist_add_tail+0x3b/0x70
> > [5.773618]  bus_add_driver+0x41/0x260
> > [5.773618]  ? pcie_port_setup+0x4d/0x4d
> > [5.773618]  driver_register+0x5b/0xe0
> > [5.773618]  ? pcie_port_setup+0x4d/0x4d
> > [5.773618]  do_one_initcall+0x4e/0x1d4
> > [5.773618]  ? init_setup+0x25/0x28
> > [5.773618]  kernel_init_freeable+0x1c1/0x26e
> > [5.773618]  ? loglevel+0x5b/0x5b
> > [5.773618]  ? rest_init+0xb0/0xb0
> > [5.773618]  kernel_init+0xa/0x110
> > [5.773618]  ret_from_fork+0x22/0x40
> > [5.773618] Modules linked in:
> > [5.773618] CR2: 2088
> > [5.773618] ---[ end trace 1030c9120a03d081 ]---
> > 
> > with his AMD machine with the following topology
> >   NUMA node0 CPU(s): 0,8,16,24
> >   NUMA node1 CPU(s): 2,10,18,26
> >   NUMA node2 CPU(s): 4,12,20,28
> >   NUMA node3 CPU(s): 6,14,22,30
> >   NUMA node4 CPU(s): 1,9,17,25
> >   NUMA node5 CPU(s): 3,11,19,27
> >   NUMA node6 CPU(s): 5,13,21,29
> >   NUMA node7 CPU(s): 7,15,23,31
> > 
> > [0.007418] Early memory node ranges
> > [0.007419]   node   1: [mem 0x1000-0x0008efff]
> > [0.007420]   node   1: [mem 0x0009-0x0009]
> > [0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
> > [0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
> > [0.007423]   node   1: [mem 0x6c528000-0x6fff]
> > [0.007424]   node   1: [mem 0x0001-0x00047fff]
> > [0.007425]   node   5: [mem 0x00048000-0x00087eff]
> > 
> > and nr_cpus set to 4. The underlying reason is tha the device is bound
> > to node 2 which doesn't have any memory and init_cpu_to_node only
> > initializes memory-less nodes for possible cpus which nr_cpus restrics.
> > This in turn means that proper zonelists are not allocated and the page
> > allocator blows up.
> > 
> > Fix the issue by reworking how x86 initializes the memory less nodes.
> > The current implementation is hacked into the workflow and it doesn't
> > allow any flexibility. There is init_memory_less_node called for each
> > offline node that has a CPU as already mentioned above. This will make
> > sure that we will have a new online node without any memory. Much later
> > on we build a zone list for this node and things seem to work, except
> > they do not (e.g. due to nr_cpus). Not to mention that it doesn't really
> > make much sense to consider an empty node as 

Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-24 Thread Michal Hocko
a friendly ping for this. Does anybody see any problem with this
approach?

On Mon 14-01-19 09:24:16, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Pingfan Liu has reported the following splat
> [5.772742] BUG: unable to handle kernel paging request at 2088
> [5.773618] PGD 0 P4D 0
> [5.773618] Oops:  [#1] SMP NOPTI
> [5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
> [5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 
> 06/29/2018
> [5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
> [5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 
> ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 
> 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
> e1 44 89 e6 89
> [5.773618] RSP: 0018:aa65fb20 EFLAGS: 00010246
> [5.773618] RAX:  RBX: 006012c0 RCX: 
> 
> [5.773618] RDX:  RSI: 0002 RDI: 
> 2080
> [5.773618] RBP: 006012c0 R08:  R09: 
> 0002
> [5.773618] R10: 006080c0 R11: 0002 R12: 
> 
> [5.773618] R13: 0001 R14:  R15: 
> 0002
> [5.773618] FS:  () GS:8c69afe0() 
> knlGS:
> [5.773618] CS:  0010 DS:  ES:  CR0: 80050033
> [5.773618] CR2: 2088 CR3: 00087e00a000 CR4: 
> 003406e0
> [5.773618] Call Trace:
> [5.773618]  new_slab+0xa9/0x570
> [5.773618]  ___slab_alloc+0x375/0x540
> [5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> [5.773618]  __slab_alloc+0x1c/0x38
> [5.773618]  __kmalloc_node_track_caller+0xc8/0x270
> [5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> [5.773618]  devm_kmalloc+0x28/0x60
> [5.773618]  pinctrl_bind_pins+0x2b/0x2a0
> [5.773618]  really_probe+0x73/0x420
> [5.773618]  driver_probe_device+0x115/0x130
> [5.773618]  __driver_attach+0x103/0x110
> [5.773618]  ? driver_probe_device+0x130/0x130
> [5.773618]  bus_for_each_dev+0x67/0xc0
> [5.773618]  ? klist_add_tail+0x3b/0x70
> [5.773618]  bus_add_driver+0x41/0x260
> [5.773618]  ? pcie_port_setup+0x4d/0x4d
> [5.773618]  driver_register+0x5b/0xe0
> [5.773618]  ? pcie_port_setup+0x4d/0x4d
> [5.773618]  do_one_initcall+0x4e/0x1d4
> [5.773618]  ? init_setup+0x25/0x28
> [5.773618]  kernel_init_freeable+0x1c1/0x26e
> [5.773618]  ? loglevel+0x5b/0x5b
> [5.773618]  ? rest_init+0xb0/0xb0
> [5.773618]  kernel_init+0xa/0x110
> [5.773618]  ret_from_fork+0x22/0x40
> [5.773618] Modules linked in:
> [5.773618] CR2: 2088
> [5.773618] ---[ end trace 1030c9120a03d081 ]---
> 
> with his AMD machine with the following topology
>   NUMA node0 CPU(s): 0,8,16,24
>   NUMA node1 CPU(s): 2,10,18,26
>   NUMA node2 CPU(s): 4,12,20,28
>   NUMA node3 CPU(s): 6,14,22,30
>   NUMA node4 CPU(s): 1,9,17,25
>   NUMA node5 CPU(s): 3,11,19,27
>   NUMA node6 CPU(s): 5,13,21,29
>   NUMA node7 CPU(s): 7,15,23,31
> 
> [0.007418] Early memory node ranges
> [0.007419]   node   1: [mem 0x1000-0x0008efff]
> [0.007420]   node   1: [mem 0x0009-0x0009]
> [0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
> [0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
> [0.007423]   node   1: [mem 0x6c528000-0x6fff]
> [0.007424]   node   1: [mem 0x0001-0x00047fff]
> [0.007425]   node   5: [mem 0x00048000-0x00087eff]
> 
> and nr_cpus set to 4. The underlying reason is tha the device is bound
> to node 2 which doesn't have any memory and init_cpu_to_node only
> initializes memory-less nodes for possible cpus which nr_cpus restrics.
> This in turn means that proper zonelists are not allocated and the page
> allocator blows up.
> 
> Fix the issue by reworking how x86 initializes the memory less nodes.
> The current implementation is hacked into the workflow and it doesn't
> allow any flexibility. There is init_memory_less_node called for each
> offline node that has a CPU as already mentioned above. This will make
> sure that we will have a new online node without any memory. Much later
> on we build a zone list for this node and things seem to work, except
> they do not (e.g. due to nr_cpus). Not to mention that it doesn't really
> make much sense to consider an empty node as online because we just
> consider this node whenever we want to iterate nodes to use and empty
> node is obviously not the best candidate. This is all just too fragile.
> 
> Reported-by: Pingfan Liu 
> Tested-by: Pingfan Liu 
> Signed-off-by: Michal Hocko 
> ---
> 
> Hi,
> I am sending this as an RFC because I am not sure this is the proper way
> to go myself. I am especially not sure 

Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-14 Thread Pingfan Liu
[...]
> >
> > I would appreciate a help with those architectures because I couldn't
> > really grasp how the memoryless nodes are really initialized there. E.g.
> > ppc only seem to call setup_node_data for online nodes but I couldn't
> > find any special treatment for nodes without any memory.
>
> We have a somewhat dubious hack in our hotplug code, see:
>
> e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes")
>
> Which basically onlines the node when we hotplug a CPU into it.
>
This bug should be related with the present state of numa node during
boot time. On PowerNV and PSeries, the boot code seems not to bring up
all nodes if memoryless. Then it can not avoid this bug.

Thanks,
Pingfan


Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-14 Thread Michal Hocko
On Mon 14-01-19 21:26:39, Michael Ellerman wrote:
> Michal Hocko  writes:
> 
> > From: Michal Hocko 
> >
> > Pingfan Liu has reported the following splat
> > [5.772742] BUG: unable to handle kernel paging request at 
> > 2088
> > [5.773618] PGD 0 P4D 0
> > [5.773618] Oops:  [#1] SMP NOPTI
> > [5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
> > [5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 
> > 06/29/2018
> > [5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
> > [5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da 
> > c1 ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 
> > <3b> 77 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
> > e1 44 89 e6 89
> > [5.773618] RSP: 0018:aa65fb20 EFLAGS: 00010246
> > [5.773618] RAX:  RBX: 006012c0 RCX: 
> > 
> > [5.773618] RDX:  RSI: 0002 RDI: 
> > 2080
> > [5.773618] RBP: 006012c0 R08:  R09: 
> > 0002
> > [5.773618] R10: 006080c0 R11: 0002 R12: 
> > 
> > [5.773618] R13: 0001 R14:  R15: 
> > 0002
> > [5.773618] FS:  () GS:8c69afe0() 
> > knlGS:
> > [5.773618] CS:  0010 DS:  ES:  CR0: 80050033
> > [5.773618] CR2: 2088 CR3: 00087e00a000 CR4: 
> > 003406e0
> > [5.773618] Call Trace:
> > [5.773618]  new_slab+0xa9/0x570
> > [5.773618]  ___slab_alloc+0x375/0x540
> > [5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> > [5.773618]  __slab_alloc+0x1c/0x38
> > [5.773618]  __kmalloc_node_track_caller+0xc8/0x270
> > [5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> > [5.773618]  devm_kmalloc+0x28/0x60
> > [5.773618]  pinctrl_bind_pins+0x2b/0x2a0
> > [5.773618]  really_probe+0x73/0x420
> > [5.773618]  driver_probe_device+0x115/0x130
> > [5.773618]  __driver_attach+0x103/0x110
> > [5.773618]  ? driver_probe_device+0x130/0x130
> > [5.773618]  bus_for_each_dev+0x67/0xc0
> > [5.773618]  ? klist_add_tail+0x3b/0x70
> > [5.773618]  bus_add_driver+0x41/0x260
> > [5.773618]  ? pcie_port_setup+0x4d/0x4d
> > [5.773618]  driver_register+0x5b/0xe0
> > [5.773618]  ? pcie_port_setup+0x4d/0x4d
> > [5.773618]  do_one_initcall+0x4e/0x1d4
> > [5.773618]  ? init_setup+0x25/0x28
> > [5.773618]  kernel_init_freeable+0x1c1/0x26e
> > [5.773618]  ? loglevel+0x5b/0x5b
> > [5.773618]  ? rest_init+0xb0/0xb0
> > [5.773618]  kernel_init+0xa/0x110
> > [5.773618]  ret_from_fork+0x22/0x40
> > [5.773618] Modules linked in:
> > [5.773618] CR2: 2088
> > [5.773618] ---[ end trace 1030c9120a03d081 ]---
> >
> > with his AMD machine with the following topology
> >   NUMA node0 CPU(s): 0,8,16,24
> >   NUMA node1 CPU(s): 2,10,18,26
> >   NUMA node2 CPU(s): 4,12,20,28
> >   NUMA node3 CPU(s): 6,14,22,30
> >   NUMA node4 CPU(s): 1,9,17,25
> >   NUMA node5 CPU(s): 3,11,19,27
> >   NUMA node6 CPU(s): 5,13,21,29
> >   NUMA node7 CPU(s): 7,15,23,31
> >
> > [0.007418] Early memory node ranges
> > [0.007419]   node   1: [mem 0x1000-0x0008efff]
> > [0.007420]   node   1: [mem 0x0009-0x0009]
> > [0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
> > [0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
> > [0.007423]   node   1: [mem 0x6c528000-0x6fff]
> > [0.007424]   node   1: [mem 0x0001-0x00047fff]
> > [0.007425]   node   5: [mem 0x00048000-0x00087eff]
> >
> > and nr_cpus set to 4. The underlying reason is tha the device is bound
> > to node 2 which doesn't have any memory and init_cpu_to_node only
> > initializes memory-less nodes for possible cpus which nr_cpus restrics.
> > This in turn means that proper zonelists are not allocated and the page
> > allocator blows up.
> >
> > Fix the issue by reworking how x86 initializes the memory less nodes.
> > The current implementation is hacked into the workflow and it doesn't
> > allow any flexibility. There is init_memory_less_node called for each
> > offline node that has a CPU as already mentioned above. This will make
> > sure that we will have a new online node without any memory. Much later
> > on we build a zone list for this node and things seem to work, except
> > they do not (e.g. due to nr_cpus). Not to mention that it doesn't really
> > make much sense to consider an empty node as online because we just
> > consider this node whenever we want to iterate nodes to use and empty
> > node is obviously not the best candidate. This is all just too fragile.
> >
> > Reported-by: Pingfan Liu 
> > Tested-by: Pingfan Liu 
> > 

Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-14 Thread Michael Ellerman
Michal Hocko  writes:

> From: Michal Hocko 
>
> Pingfan Liu has reported the following splat
> [5.772742] BUG: unable to handle kernel paging request at 2088
> [5.773618] PGD 0 P4D 0
> [5.773618] Oops:  [#1] SMP NOPTI
> [5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
> [5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 
> 06/29/2018
> [5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
> [5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 
> ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 
> 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
> e1 44 89 e6 89
> [5.773618] RSP: 0018:aa65fb20 EFLAGS: 00010246
> [5.773618] RAX:  RBX: 006012c0 RCX: 
> 
> [5.773618] RDX:  RSI: 0002 RDI: 
> 2080
> [5.773618] RBP: 006012c0 R08:  R09: 
> 0002
> [5.773618] R10: 006080c0 R11: 0002 R12: 
> 
> [5.773618] R13: 0001 R14:  R15: 
> 0002
> [5.773618] FS:  () GS:8c69afe0() 
> knlGS:
> [5.773618] CS:  0010 DS:  ES:  CR0: 80050033
> [5.773618] CR2: 2088 CR3: 00087e00a000 CR4: 
> 003406e0
> [5.773618] Call Trace:
> [5.773618]  new_slab+0xa9/0x570
> [5.773618]  ___slab_alloc+0x375/0x540
> [5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> [5.773618]  __slab_alloc+0x1c/0x38
> [5.773618]  __kmalloc_node_track_caller+0xc8/0x270
> [5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> [5.773618]  devm_kmalloc+0x28/0x60
> [5.773618]  pinctrl_bind_pins+0x2b/0x2a0
> [5.773618]  really_probe+0x73/0x420
> [5.773618]  driver_probe_device+0x115/0x130
> [5.773618]  __driver_attach+0x103/0x110
> [5.773618]  ? driver_probe_device+0x130/0x130
> [5.773618]  bus_for_each_dev+0x67/0xc0
> [5.773618]  ? klist_add_tail+0x3b/0x70
> [5.773618]  bus_add_driver+0x41/0x260
> [5.773618]  ? pcie_port_setup+0x4d/0x4d
> [5.773618]  driver_register+0x5b/0xe0
> [5.773618]  ? pcie_port_setup+0x4d/0x4d
> [5.773618]  do_one_initcall+0x4e/0x1d4
> [5.773618]  ? init_setup+0x25/0x28
> [5.773618]  kernel_init_freeable+0x1c1/0x26e
> [5.773618]  ? loglevel+0x5b/0x5b
> [5.773618]  ? rest_init+0xb0/0xb0
> [5.773618]  kernel_init+0xa/0x110
> [5.773618]  ret_from_fork+0x22/0x40
> [5.773618] Modules linked in:
> [5.773618] CR2: 2088
> [5.773618] ---[ end trace 1030c9120a03d081 ]---
>
> with his AMD machine with the following topology
>   NUMA node0 CPU(s): 0,8,16,24
>   NUMA node1 CPU(s): 2,10,18,26
>   NUMA node2 CPU(s): 4,12,20,28
>   NUMA node3 CPU(s): 6,14,22,30
>   NUMA node4 CPU(s): 1,9,17,25
>   NUMA node5 CPU(s): 3,11,19,27
>   NUMA node6 CPU(s): 5,13,21,29
>   NUMA node7 CPU(s): 7,15,23,31
>
> [0.007418] Early memory node ranges
> [0.007419]   node   1: [mem 0x1000-0x0008efff]
> [0.007420]   node   1: [mem 0x0009-0x0009]
> [0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
> [0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
> [0.007423]   node   1: [mem 0x6c528000-0x6fff]
> [0.007424]   node   1: [mem 0x0001-0x00047fff]
> [0.007425]   node   5: [mem 0x00048000-0x00087eff]
>
> and nr_cpus set to 4. The underlying reason is tha the device is bound
> to node 2 which doesn't have any memory and init_cpu_to_node only
> initializes memory-less nodes for possible cpus which nr_cpus restrics.
> This in turn means that proper zonelists are not allocated and the page
> allocator blows up.
>
> Fix the issue by reworking how x86 initializes the memory less nodes.
> The current implementation is hacked into the workflow and it doesn't
> allow any flexibility. There is init_memory_less_node called for each
> offline node that has a CPU as already mentioned above. This will make
> sure that we will have a new online node without any memory. Much later
> on we build a zone list for this node and things seem to work, except
> they do not (e.g. due to nr_cpus). Not to mention that it doesn't really
> make much sense to consider an empty node as online because we just
> consider this node whenever we want to iterate nodes to use and empty
> node is obviously not the best candidate. This is all just too fragile.
>
> Reported-by: Pingfan Liu 
> Tested-by: Pingfan Liu 
> Signed-off-by: Michal Hocko 
> ---
>
> Hi,
> I am sending this as an RFC because I am not sure this is the proper way
> to go myself. I am especially not sure about other architectures
> supporting memoryless nodes (ppc and ia64 AFAICS or are there more?).
>
> I would 

[RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-14 Thread Michal Hocko
From: Michal Hocko 

Pingfan Liu has reported the following splat
[5.772742] BUG: unable to handle kernel paging request at 2088
[5.773618] PGD 0 P4D 0
[5.773618] Oops:  [#1] SMP NOPTI
[5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
[5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 
06/29/2018
[5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
[5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 
ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 
0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
e1 44 89 e6 89
[5.773618] RSP: 0018:aa65fb20 EFLAGS: 00010246
[5.773618] RAX:  RBX: 006012c0 RCX: 
[5.773618] RDX:  RSI: 0002 RDI: 2080
[5.773618] RBP: 006012c0 R08:  R09: 0002
[5.773618] R10: 006080c0 R11: 0002 R12: 
[5.773618] R13: 0001 R14:  R15: 0002
[5.773618] FS:  () GS:8c69afe0() 
knlGS:
[5.773618] CS:  0010 DS:  ES:  CR0: 80050033
[5.773618] CR2: 2088 CR3: 00087e00a000 CR4: 003406e0
[5.773618] Call Trace:
[5.773618]  new_slab+0xa9/0x570
[5.773618]  ___slab_alloc+0x375/0x540
[5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[5.773618]  __slab_alloc+0x1c/0x38
[5.773618]  __kmalloc_node_track_caller+0xc8/0x270
[5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[5.773618]  devm_kmalloc+0x28/0x60
[5.773618]  pinctrl_bind_pins+0x2b/0x2a0
[5.773618]  really_probe+0x73/0x420
[5.773618]  driver_probe_device+0x115/0x130
[5.773618]  __driver_attach+0x103/0x110
[5.773618]  ? driver_probe_device+0x130/0x130
[5.773618]  bus_for_each_dev+0x67/0xc0
[5.773618]  ? klist_add_tail+0x3b/0x70
[5.773618]  bus_add_driver+0x41/0x260
[5.773618]  ? pcie_port_setup+0x4d/0x4d
[5.773618]  driver_register+0x5b/0xe0
[5.773618]  ? pcie_port_setup+0x4d/0x4d
[5.773618]  do_one_initcall+0x4e/0x1d4
[5.773618]  ? init_setup+0x25/0x28
[5.773618]  kernel_init_freeable+0x1c1/0x26e
[5.773618]  ? loglevel+0x5b/0x5b
[5.773618]  ? rest_init+0xb0/0xb0
[5.773618]  kernel_init+0xa/0x110
[5.773618]  ret_from_fork+0x22/0x40
[5.773618] Modules linked in:
[5.773618] CR2: 2088
[5.773618] ---[ end trace 1030c9120a03d081 ]---

with his AMD machine with the following topology
  NUMA node0 CPU(s): 0,8,16,24
  NUMA node1 CPU(s): 2,10,18,26
  NUMA node2 CPU(s): 4,12,20,28
  NUMA node3 CPU(s): 6,14,22,30
  NUMA node4 CPU(s): 1,9,17,25
  NUMA node5 CPU(s): 3,11,19,27
  NUMA node6 CPU(s): 5,13,21,29
  NUMA node7 CPU(s): 7,15,23,31

[0.007418] Early memory node ranges
[0.007419]   node   1: [mem 0x1000-0x0008efff]
[0.007420]   node   1: [mem 0x0009-0x0009]
[0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
[0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
[0.007423]   node   1: [mem 0x6c528000-0x6fff]
[0.007424]   node   1: [mem 0x0001-0x00047fff]
[0.007425]   node   5: [mem 0x00048000-0x00087eff]

and nr_cpus set to 4. The underlying reason is tha the device is bound
to node 2 which doesn't have any memory and init_cpu_to_node only
initializes memory-less nodes for possible cpus which nr_cpus restrics.
This in turn means that proper zonelists are not allocated and the page
allocator blows up.

Fix the issue by reworking how x86 initializes the memory less nodes.
The current implementation is hacked into the workflow and it doesn't
allow any flexibility. There is init_memory_less_node called for each
offline node that has a CPU as already mentioned above. This will make
sure that we will have a new online node without any memory. Much later
on we build a zone list for this node and things seem to work, except
they do not (e.g. due to nr_cpus). Not to mention that it doesn't really
make much sense to consider an empty node as online because we just
consider this node whenever we want to iterate nodes to use and empty
node is obviously not the best candidate. This is all just too fragile.

Reported-by: Pingfan Liu 
Tested-by: Pingfan Liu 
Signed-off-by: Michal Hocko 
---

Hi,
I am sending this as an RFC because I am not sure this is the proper way
to go myself. I am especially not sure about other architectures
supporting memoryless nodes (ppc and ia64 AFAICS or are there more?).

I would appreciate a help with those architectures because I couldn't
really grasp how the memoryless nodes are really initialized there. E.g.
ppc only seem to call setup_node_data for online nodes but I couldn't
find any special treatment