From: Michal Hocko
Pingfan Liu has reported the following splat
[5.772742] BUG: unable to handle kernel paging request at 2088
[5.773618] PGD 0 P4D 0
[5.773618] Oops: [#1] SMP NOPTI
[5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
[5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3
06/29/2018
[5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
[5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1
ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08
0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
e1 44 89 e6 89
[5.773618] RSP: 0018:aa65fb20 EFLAGS: 00010246
[5.773618] RAX: RBX: 006012c0 RCX:
[5.773618] RDX: RSI: 0002 RDI: 2080
[5.773618] RBP: 006012c0 R08: R09: 0002
[5.773618] R10: 006080c0 R11: 0002 R12:
[5.773618] R13: 0001 R14: R15: 0002
[5.773618] FS: () GS:8c69afe0()
knlGS:
[5.773618] CS: 0010 DS: ES: CR0: 80050033
[5.773618] CR2: 2088 CR3: 00087e00a000 CR4: 003406e0
[5.773618] Call Trace:
[5.773618] new_slab+0xa9/0x570
[5.773618] ___slab_alloc+0x375/0x540
[5.773618] ? pinctrl_bind_pins+0x2b/0x2a0
[5.773618] __slab_alloc+0x1c/0x38
[5.773618] __kmalloc_node_track_caller+0xc8/0x270
[5.773618] ? pinctrl_bind_pins+0x2b/0x2a0
[5.773618] devm_kmalloc+0x28/0x60
[5.773618] pinctrl_bind_pins+0x2b/0x2a0
[5.773618] really_probe+0x73/0x420
[5.773618] driver_probe_device+0x115/0x130
[5.773618] __driver_attach+0x103/0x110
[5.773618] ? driver_probe_device+0x130/0x130
[5.773618] bus_for_each_dev+0x67/0xc0
[5.773618] ? klist_add_tail+0x3b/0x70
[5.773618] bus_add_driver+0x41/0x260
[5.773618] ? pcie_port_setup+0x4d/0x4d
[5.773618] driver_register+0x5b/0xe0
[5.773618] ? pcie_port_setup+0x4d/0x4d
[5.773618] do_one_initcall+0x4e/0x1d4
[5.773618] ? init_setup+0x25/0x28
[5.773618] kernel_init_freeable+0x1c1/0x26e
[5.773618] ? loglevel+0x5b/0x5b
[5.773618] ? rest_init+0xb0/0xb0
[5.773618] kernel_init+0xa/0x110
[5.773618] ret_from_fork+0x22/0x40
[5.773618] Modules linked in:
[5.773618] CR2: 2088
[5.773618] ---[ end trace 1030c9120a03d081 ]---
with his AMD machine with the following topology
NUMA node0 CPU(s): 0,8,16,24
NUMA node1 CPU(s): 2,10,18,26
NUMA node2 CPU(s): 4,12,20,28
NUMA node3 CPU(s): 6,14,22,30
NUMA node4 CPU(s): 1,9,17,25
NUMA node5 CPU(s): 3,11,19,27
NUMA node6 CPU(s): 5,13,21,29
NUMA node7 CPU(s): 7,15,23,31
[0.007418] Early memory node ranges
[0.007419] node 1: [mem 0x1000-0x0008efff]
[0.007420] node 1: [mem 0x0009-0x0009]
[0.007422] node 1: [mem 0x0010-0x5c3d6fff]
[0.007422] node 1: [mem 0x643df000-0x68ff7fff]
[0.007423] node 1: [mem 0x6c528000-0x6fff]
[0.007424] node 1: [mem 0x0001-0x00047fff]
[0.007425] node 5: [mem 0x00048000-0x00087eff]
and nr_cpus set to 4. The underlying reason is tha the device is bound
to node 2 which doesn't have any memory and init_cpu_to_node only
initializes memory-less nodes for possible cpus which nr_cpus restrics.
This in turn means that proper zonelists are not allocated and the page
allocator blows up.
Fix the issue by reworking how x86 initializes the memory less nodes.
The current implementation is hacked into the workflow and it doesn't
allow any flexibility. There is init_memory_less_node called for each
offline node that has a CPU as already mentioned above. This will make
sure that we will have a new online node without any memory. Much later
on we build a zone list for this node and things seem to work, except
they do not (e.g. due to nr_cpus). Not to mention that it doesn't really
make much sense to consider an empty node as online because we just
consider this node whenever we want to iterate nodes to use and empty
node is obviously not the best candidate. This is all just too fragile.
The new code relies on the arch specific initialization to allocate all
possible NUMA nodes (including memory less) - numa_register_memblks in
this case. Generic code then initializes both zonelists (__build_all_zonelists)
and allocator internals (free_area_init_nodes) for all non-null pgdats
rather than online ones.
For the x86 specific part also do not make new node online in alloc_node_data
because this is too early to know that. numa_register_memblks knows that
a node has some memory so it can make the node online appropriately.