Re: [PATCH 1/2] x86, numa: always initialize all possible nodes

2019-06-26 Thread Michal Hocko
On Thu 02-05-19 09:00:31, Michal Hocko wrote:
> On Wed 01-05-19 15:12:32, Barret Rhoden wrote:
> [...]
> > A more elegant solution may be to avoid registering with sysfs during early
> > boot, or something else entirely.  But I figured I'd ask for help at this
> > point.  =)
> 
> Thanks for the report and an excellent analysis! This is really helpful.
> I will think about this some more but I am traveling this week. It seems
> really awkward to register a sysfs file for an empty range. That looks
> like a bug to me.

I am sorry, but I didn't get to this for a long time and I am still
busy. The patch has been dropped from the mm tree (thus linux-next). I
hope I can revisit this or somebody else will take over and finish this
work. This is much more trickier than I anticipated unfortunately.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] x86, numa: always initialize all possible nodes

2019-05-02 Thread Michal Hocko
On Wed 01-05-19 15:12:32, Barret Rhoden wrote:
[...]
> A more elegant solution may be to avoid registering with sysfs during early
> boot, or something else entirely.  But I figured I'd ask for help at this
> point.  =)

Thanks for the report and an excellent analysis! This is really helpful.
I will think about this some more but I am traveling this week. It seems
really awkward to register a sysfs file for an empty range. That looks
like a bug to me.

-- 
Michal Hocko
SUSE Labs


[PATCH 1/2] x86, numa: always initialize all possible nodes

2019-02-12 Thread Michal Hocko
From: Michal Hocko 

Pingfan Liu has reported the following splat
[5.772742] BUG: unable to handle kernel paging request at 2088
[5.773618] PGD 0 P4D 0
[5.773618] Oops:  [#1] SMP NOPTI
[5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
[5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 
06/29/2018
[5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
[5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 
ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 08 
0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
e1 44 89 e6 89
[5.773618] RSP: 0018:aa65fb20 EFLAGS: 00010246
[5.773618] RAX:  RBX: 006012c0 RCX: 
[5.773618] RDX:  RSI: 0002 RDI: 2080
[5.773618] RBP: 006012c0 R08:  R09: 0002
[5.773618] R10: 006080c0 R11: 0002 R12: 
[5.773618] R13: 0001 R14:  R15: 0002
[5.773618] FS:  () GS:8c69afe0() 
knlGS:
[5.773618] CS:  0010 DS:  ES:  CR0: 80050033
[5.773618] CR2: 2088 CR3: 00087e00a000 CR4: 003406e0
[5.773618] Call Trace:
[5.773618]  new_slab+0xa9/0x570
[5.773618]  ___slab_alloc+0x375/0x540
[5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[5.773618]  __slab_alloc+0x1c/0x38
[5.773618]  __kmalloc_node_track_caller+0xc8/0x270
[5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
[5.773618]  devm_kmalloc+0x28/0x60
[5.773618]  pinctrl_bind_pins+0x2b/0x2a0
[5.773618]  really_probe+0x73/0x420
[5.773618]  driver_probe_device+0x115/0x130
[5.773618]  __driver_attach+0x103/0x110
[5.773618]  ? driver_probe_device+0x130/0x130
[5.773618]  bus_for_each_dev+0x67/0xc0
[5.773618]  ? klist_add_tail+0x3b/0x70
[5.773618]  bus_add_driver+0x41/0x260
[5.773618]  ? pcie_port_setup+0x4d/0x4d
[5.773618]  driver_register+0x5b/0xe0
[5.773618]  ? pcie_port_setup+0x4d/0x4d
[5.773618]  do_one_initcall+0x4e/0x1d4
[5.773618]  ? init_setup+0x25/0x28
[5.773618]  kernel_init_freeable+0x1c1/0x26e
[5.773618]  ? loglevel+0x5b/0x5b
[5.773618]  ? rest_init+0xb0/0xb0
[5.773618]  kernel_init+0xa/0x110
[5.773618]  ret_from_fork+0x22/0x40
[5.773618] Modules linked in:
[5.773618] CR2: 2088
[5.773618] ---[ end trace 1030c9120a03d081 ]---

with his AMD machine with the following topology
  NUMA node0 CPU(s): 0,8,16,24
  NUMA node1 CPU(s): 2,10,18,26
  NUMA node2 CPU(s): 4,12,20,28
  NUMA node3 CPU(s): 6,14,22,30
  NUMA node4 CPU(s): 1,9,17,25
  NUMA node5 CPU(s): 3,11,19,27
  NUMA node6 CPU(s): 5,13,21,29
  NUMA node7 CPU(s): 7,15,23,31

[0.007418] Early memory node ranges
[0.007419]   node   1: [mem 0x1000-0x0008efff]
[0.007420]   node   1: [mem 0x0009-0x0009]
[0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
[0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
[0.007423]   node   1: [mem 0x6c528000-0x6fff]
[0.007424]   node   1: [mem 0x0001-0x00047fff]
[0.007425]   node   5: [mem 0x00048000-0x00087eff]

and nr_cpus set to 4. The underlying reason is tha the device is bound
to node 2 which doesn't have any memory and init_cpu_to_node only
initializes memory-less nodes for possible cpus which nr_cpus restrics.
This in turn means that proper zonelists are not allocated and the page
allocator blows up.

Fix the issue by reworking how x86 initializes the memory less nodes.
The current implementation is hacked into the workflow and it doesn't
allow any flexibility. There is init_memory_less_node called for each
offline node that has a CPU as already mentioned above. This will make
sure that we will have a new online node without any memory. Much later
on we build a zone list for this node and things seem to work, except
they do not (e.g. due to nr_cpus). Not to mention that it doesn't really
make much sense to consider an empty node as online because we just
consider this node whenever we want to iterate nodes to use and empty
node is obviously not the best candidate. This is all just too fragile.

The new code relies on the arch specific initialization to allocate all
possible NUMA nodes (including memory less) - numa_register_memblks in
this case. Generic code then initializes both zonelists (__build_all_zonelists)
and allocator internals (free_area_init_nodes) for all non-null pgdats
rather than online ones.

For the x86 specific part also do not make new node online in alloc_node_data
because this is too early to know that. numa_register_memblks knows that
a node has some memory so it can make the node online appropriately.