Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-06 Thread Andi Kleen
> > What's the point of this indirection other than another way of avoiding
> > empty node 0?
> 
> Honestly, I do not have any idea. I've traced it down to
> Author: Andi Kleen 
> Date:   Tue Jan 11 15:35:48 2005 -0800

I don't remember all the details, and I can't even find the commit
(is it in linux-historic?).

But AFAIK there's no guarantee PXMs are small and continuous, so it
seemed better to have a clean zero based space.

Back then we had a lot of problems with buggy SRAT tables in BIOS,
so we really tried to avoid trusting the BIOS as much as possible.

-Andi



Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread Srikar Dronamraju
* Michal Hocko  [2020-07-03 12:59:44]:

> > Honestly, I do not have any idea. I've traced it down to
> > Author: Andi Kleen 
> > Date:   Tue Jan 11 15:35:48 2005 -0800
> > 
> > [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> > 
> > Fix fallout from the recent nodemask_t changes. The node ids assigned
> > in the SRAT parser were off by one.
> > 
> > I added a new first_unset_node() function to nodemask.h to allocate
> > IDs sanely.
> > 
> > Signed-off-by: Andi Kleen 
> > Signed-off-by: Linus Torvalds 
> > 
> > which doesn't really tell all that much. The historical baggage and a
> > long term behavior which is not really trivial to fix I suspect.
> 
> Thinking about this some more, this logic makes some sense afterall.
> Especially in the world without memory hotplug which was very likely the
> case back then. It is much better to have compact node mask rather than
> sparse one. After all node numbers shouldn't really matter as long as
> you have a clear mapping to the HW. I am not sure we export that
> information (except for the kernel ring buffer) though.
> 
> The memory hotplug changes that somehow because you can hotremove numa
> nodes and therefore make the nodemask sparse but that is not a common
> case. I am not sure what would happen if a completely new node was added
> and its corresponding node was already used by the renumbered one
> though. It would likely conflate the two I am afraid. But I am not sure
> this is really possible with x86 and a lack of a bug report would
> suggest that nobody is doing that at least.
> 

JFYI,
Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
hotplug on memoryless node. 

https://bugzilla.kernel.org/show_bug.cgi?id=202187

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread Michal Hocko
On Fri 03-07-20 13:32:21, David Hildenbrand wrote:
> On 03.07.20 12:59, Michal Hocko wrote:
> > On Fri 03-07-20 11:24:17, Michal Hocko wrote:
> >> [Cc Andi]
> >>
> >> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> >>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
>  On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> >> [...]
> > Yep, looks like it.
> >
> > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]
> 
>  This begs a question whether ppc can do the same thing?
> >>> Or x86 stop doing it so that you can see on what node you are running?
> >>>
> >>> What's the point of this indirection other than another way of avoiding
> >>> empty node 0?
> >>
> >> Honestly, I do not have any idea. I've traced it down to
> >> Author: Andi Kleen 
> >> Date:   Tue Jan 11 15:35:48 2005 -0800
> >>
> >> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> >>
> >> Fix fallout from the recent nodemask_t changes. The node ids assigned
> >> in the SRAT parser were off by one.
> >>
> >> I added a new first_unset_node() function to nodemask.h to allocate
> >> IDs sanely.
> >>
> >> Signed-off-by: Andi Kleen 
> >> Signed-off-by: Linus Torvalds 
> >>
> >> which doesn't really tell all that much. The historical baggage and a
> >> long term behavior which is not really trivial to fix I suspect.
> > 
> > Thinking about this some more, this logic makes some sense afterall.
> > Especially in the world without memory hotplug which was very likely the
> > case back then. It is much better to have compact node mask rather than
> > sparse one. After all node numbers shouldn't really matter as long as
> > you have a clear mapping to the HW. I am not sure we export that
> > information (except for the kernel ring buffer) though.
> > 
> > The memory hotplug changes that somehow because you can hotremove numa
> > nodes and therefore make the nodemask sparse but that is not a common
> > case. I am not sure what would happen if a completely new node was added
> > and its corresponding node was already used by the renumbered one
> > though. It would likely conflate the two I am afraid. But I am not sure
> > this is really possible with x86 and a lack of a bug report would
> > suggest that nobody is doing that at least.
> > 
> 
> I think the ACPI code takes care of properly mapping PXM to nodes.
> 
> So if I start with PXM 0 empty and PXM 1 populated, I will get
> PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU
> 
> $ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U 
> /var/tmp/monitor
> $ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U 
> /var/tmp/monitor
> 
> $ echo "info numa" | sudo nc -U /var/tmp/monitor
> QEMU 5.0.50 monitor - type 'help' for more information
> (qemu) info numa
> 2 nodes
> node 0 cpus:
> node 0 size: 1024 MB
> node 0 plugged: 1024 MB
> node 1 cpus: 0 1 2 3
> node 1 size: 4096 MB
> node 1 plugged: 0 MB

Thanks for double checking.

> I get in the guest:
> 
> [   50.174435] [ cut here ]
> [   50.175436] node 1 was absent from the node_possible_map
> [   50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 
> add_memory_resource+0x8c/0x290

This would mean that the ACPI code or whoever does the remaping is not
adding the new node into possible nodes.

[...]
> I remember that we added that check just recently (due to powerpc if I am not 
> wrong).
> Not sure why that triggers here.

This was a misbehaving Qemu IIRC providing a garbage map.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread David Hildenbrand
On 03.07.20 12:59, Michal Hocko wrote:
> On Fri 03-07-20 11:24:17, Michal Hocko wrote:
>> [Cc Andi]
>>
>> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
>>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
 On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
>> [...]
> Yep, looks like it.
>
> [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]

 This begs a question whether ppc can do the same thing?
>>> Or x86 stop doing it so that you can see on what node you are running?
>>>
>>> What's the point of this indirection other than another way of avoiding
>>> empty node 0?
>>
>> Honestly, I do not have any idea. I've traced it down to
>> Author: Andi Kleen 
>> Date:   Tue Jan 11 15:35:48 2005 -0800
>>
>> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
>>
>> Fix fallout from the recent nodemask_t changes. The node ids assigned
>> in the SRAT parser were off by one.
>>
>> I added a new first_unset_node() function to nodemask.h to allocate
>> IDs sanely.
>>
>> Signed-off-by: Andi Kleen 
>> Signed-off-by: Linus Torvalds 
>>
>> which doesn't really tell all that much. The historical baggage and a
>> long term behavior which is not really trivial to fix I suspect.
> 
> Thinking about this some more, this logic makes some sense afterall.
> Especially in the world without memory hotplug which was very likely the
> case back then. It is much better to have compact node mask rather than
> sparse one. After all node numbers shouldn't really matter as long as
> you have a clear mapping to the HW. I am not sure we export that
> information (except for the kernel ring buffer) though.
> 
> The memory hotplug changes that somehow because you can hotremove numa
> nodes and therefore make the nodemask sparse but that is not a common
> case. I am not sure what would happen if a completely new node was added
> and its corresponding node was already used by the renumbered one
> though. It would likely conflate the two I am afraid. But I am not sure
> this is really possible with x86 and a lack of a bug report would
> suggest that nobody is doing that at least.
> 

I think the ACPI code takes care of properly mapping PXM to nodes.

So if I start with PXM 0 empty and PXM 1 populated, I will get
PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU

$ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U 
/var/tmp/monitor
$ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U 
/var/tmp/monitor

$ echo "info numa" | sudo nc -U /var/tmp/monitor
QEMU 5.0.50 monitor - type 'help' for more information
(qemu) info numa
2 nodes
node 0 cpus:
node 0 size: 1024 MB
node 0 plugged: 1024 MB
node 1 cpus: 0 1 2 3
node 1 size: 4096 MB
node 1 plugged: 0 MB

I get in the guest:

[   50.174435] [ cut here ]
[   50.175436] node 1 was absent from the node_possible_map
[   50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 
add_memory_resource+0x8c/0x290
[   50.176844] Modules linked in:
[   50.176845] CPU: 0 PID: 7 Comm: kworker/u8:0 Not tainted 5.8.0-rc2+ #4
[   50.176846] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
[   50.176846] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   50.176847] RIP: 0010:add_memory_resource+0x8c/0x290
[   50.176849] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 63 c5 48 89 04 24 48 
0f a3 05 94 6c 1c 01 72 17 89 ee 48 c78
[   50.176849] RSP: 0018:a7a1c0043d48 EFLAGS: 00010296
[   50.176850] RAX: 002c RBX: 8bc633e63b80 RCX: 
[   50.176851] RDX: 8bc63bc27060 RSI: 8bc63bc18d00 RDI: 8bc63bc18d00
[   50.176851] RBP: 0001 R08: 01e1 R09: a7a1c0043bd8
[   50.176852] R10: 0005 R11:  R12: 00014000
[   50.176852] R13: 00017fff R14: 4000 R15: 00018000
[   50.176853] FS:  () GS:8bc63bc0() 
knlGS:
[   50.176853] CS:  0010 DS:  ES:  CR0: 80050033
[   50.176855] CR2: 55dfcbfc5ee8 CR3: aca0a000 CR4: 06f0
[   50.176855] DR0:  DR1:  DR2: 
[   50.176856] DR3:  DR6: fffe0ff0 DR7: 0400
[   50.176856] Call Trace:
[   50.176856]  __add_memory+0x33/0x70
[   50.176857]  acpi_memory_device_add+0x132/0x2f2
[   50.176857]  acpi_bus_attach+0xd2/0x200
[   50.176858]  acpi_bus_scan+0x33/0x70
[   50.176858]  acpi_device_hotplug+0x298/0x390
[   50.176858]  

Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread Michal Hocko
On Fri 03-07-20 11:24:17, Michal Hocko wrote:
> [Cc Andi]
> 
> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> > On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> > > On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> [...]
> > > > Yep, looks like it.
> > > > 
> > > > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> > > > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> > > > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]
> > > 
> > > This begs a question whether ppc can do the same thing?
> > Or x86 stop doing it so that you can see on what node you are running?
> > 
> > What's the point of this indirection other than another way of avoiding
> > empty node 0?
> 
> Honestly, I do not have any idea. I've traced it down to
> Author: Andi Kleen 
> Date:   Tue Jan 11 15:35:48 2005 -0800
> 
> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> 
> Fix fallout from the recent nodemask_t changes. The node ids assigned
> in the SRAT parser were off by one.
> 
> I added a new first_unset_node() function to nodemask.h to allocate
> IDs sanely.
> 
> Signed-off-by: Andi Kleen 
> Signed-off-by: Linus Torvalds 
> 
> which doesn't really tell all that much. The historical baggage and a
> long term behavior which is not really trivial to fix I suspect.

Thinking about this some more, this logic makes some sense afterall.
Especially in the world without memory hotplug which was very likely the
case back then. It is much better to have compact node mask rather than
sparse one. After all node numbers shouldn't really matter as long as
you have a clear mapping to the HW. I am not sure we export that
information (except for the kernel ring buffer) though.

The memory hotplug changes that somehow because you can hotremove numa
nodes and therefore make the nodemask sparse but that is not a common
case. I am not sure what would happen if a completely new node was added
and its corresponding node was already used by the renumbered one
though. It would likely conflate the two I am afraid. But I am not sure
this is really possible with x86 and a lack of a bug report would
suggest that nobody is doing that at least.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread Michal Hocko
[Cc Andi]

On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> > On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
[...]
> > > Yep, looks like it.
> > > 
> > > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> > > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> > > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]
> > 
> > This begs a question whether ppc can do the same thing?
> Or x86 stop doing it so that you can see on what node you are running?
> 
> What's the point of this indirection other than another way of avoiding
> empty node 0?

Honestly, I do not have any idea. I've traced it down to
Author: Andi Kleen 
Date:   Tue Jan 11 15:35:48 2005 -0800

[PATCH] x86_64: Fix ACPI SRAT NUMA parsing

Fix fallout from the recent nodemask_t changes. The node ids assigned
in the SRAT parser were off by one.

I added a new first_unset_node() function to nodemask.h to allocate
IDs sanely.

Signed-off-by: Andi Kleen 
Signed-off-by: Linus Torvalds 

which doesn't really tell all that much. The historical baggage and a
long term behavior which is not really trivial to fix I suspect.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-03 Thread Michal Suchánek
On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> > On 01.07.20 13:06, David Hildenbrand wrote:
> > > On 01.07.20 13:01, Srikar Dronamraju wrote:
> > >> * David Hildenbrand  [2020-07-01 12:15:54]:
> > >>
> > >>> On 01.07.20 12:04, Srikar Dronamraju wrote:
> >  * Michal Hocko  [2020-07-01 10:42:00]:
> > 
> > >
> > >>
> > >> 2. Also existence of dummy node also leads to inconsistent 
> > >> information. The
> > >> number of online nodes is inconsistent with the information in the
> > >> device-tree and resource-dump
> > >>
> > >> 3. When the dummy node is present, single node non-Numa systems end 
> > >> up showing
> > >> up as NUMA systems and numa_balancing gets enabled. This will mean 
> > >> we take
> > >> the hit from the unnecessary numa hinting faults.
> > >
> > > I have to say that I dislike the node online/offline state and 
> > > directly
> > > exporting that to the userspace. Users should only care whether the 
> > > node
> > > has memory/cpus. Numa nodes can be online without any memory. Just
> > > offline all the present memory blocks but do not physically hot remove
> > > them and you are in the same situation. If users are confused by an
> > > output of tools like numactl -H then those could be updated and hide
> > > nodes without any memory
> > >
> > > The autonuma problem sounds interesting but again this patch doesn't
> > > really solve the underlying problem because I strongly suspect that 
> > > the
> > > problem is still there when a numa node gets all its memory offline as
> > > mentioned above.
> 
> I would really appreciate a feedback to these two as well.
> 
> > > While I completely agree that making node 0 special is wrong, I have
> > > still hard time to review this very simply looking patch because all 
> > > the
> > > numa initialization is so spread around that this might just blow up
> > > at unexpected places. IIRC we have discussed testing in the previous
> > > version and David has provided a way to emulate these configurations
> > > on x86. Did you manage to use those instruction for additional testing
> > > on other than ppc architectures?
> > >
> > 
> >  I have tried all the steps that David mentioned and reported back at
> >  https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u
> > 
> >  As a summary, David's steps are still not creating a 
> >  memoryless/cpuless on
> >  x86 VM.
> > >>>
> > >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> > >>> online*. Once you hotplug some memory, it will switch online. Once you
> > >>> remove memory, it will switch back offline.
> > >>>
> > >>
> > >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless 
> > >> at
> > >> boot.  The code in question tries to handle a cpuless/memoryless node 0 
> > >> at
> > >> boot.
> > > 
> > > I was just correcting your statement, because it was wrong.
> > > 
> > > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> > > have CPUs nor memory. That would imply that we can, in fact, never have
> > > node 0 offline during boot.
> > > 
> > 
> > Yep, looks like it.
> > 
> > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]
> 
> This begs a question whether ppc can do the same thing?
Or x86 stop doing it so that you can see on what node you are running?

What's the point of this indirection other than another way of avoiding
empty node 0?

Thanks

Michal


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-02 Thread Srikar Dronamraju
* Michal Hocko  [2020-07-02 10:41:23]:

> On Thu 02-07-20 12:14:08, Srikar Dronamraju wrote:
> > * Michal Hocko  [2020-07-01 14:21:10]:
> > 
> > > > > The autonuma problem sounds interesting but again this patch 
> > > > > doesn't
> > > > > really solve the underlying problem because I strongly suspect 
> > > > > that the
> > > > > problem is still there when a numa node gets all its memory 
> > > > > offline as
> > > > > mentioned above.
> > > 
> > > I would really appreciate a feedback to these two as well.
> > 
> > 1. Its not just numactl that's to be fixed but all tools/utilities that
> > depend on /sys/devices/system/node/online. Are we saying to not rely/believe
> > in the output given by the kernel but do further verification?  
> 
> No, what we are saying is that even an online node might have zero
> number of online pages/cpus. So the online status is not really
> something that matters. If people are confused by that output then user
> space tools can make their confusion go away. I really do not understand
> why the kernel should do any logic there.

The user facing teams are saying they are getting queries from the users who
are unable to understand from the tools/sysfs files why a node is online and
but has no attached resources. Its the amount of time that is being spent on
these issues that triggered the patch. Initially even I was skeptical that
this was a non-issue.

> 
> > Also how would the user space differentiate between the case where the
> > Kernel missed marking a node as offline to the case where the memory was
> > offlined on a cpuless node but node wasn't offline?.
> 
> What I am arguing is that those two shouldn't be any different. Really!
> 
> > 2. Regarding the autonuma, the case of offline memory is user/admin driven,
> > so if there is a performance hit, its something that's driven by his
> > user/admin actions. Also how often do we see users offline complete memory
> > of cpuless node on a 2 node system?
> 
> How often do we see crippled HW configurations like that? Really if
> autonuma should be made more clever for one case it should recognize the
> other as well.
> 

Lets take a 16 socket PowerVM system and assume that 32 lpars are created
on that socket, i.e 2 lpars for each socket. (PowerVM has the final say on
how the lpars are created.) In such a case, we can expect 30 out of the 32
lpars to face this problem, with the only 2 lpars that actually run on
socket 0 having the correct configuration.

> > > 
> > > This begs a question whether ppc can do the same thing?
> > 
> > Certainly ppc can be made to adapt to this situation but that would be a
> > workaround. Do we have a reason why we think node 0 is unique and special?
> 
> It is not. As replied in other email in this thread. I would hope for
> having less hacks in the numa initialization. Cleaning up the mess is
> would be a lot of work and testing on all NUMA capable architectures.
> This is a heritage from the past I am afraid. All that I am arguing here
> is that your touch to the generic code with a very simple looking patch
> might have side effects which are pretty much impossible to review.
> Moreover it seems that nothing but ppc really needs this treatment.
> So fixing it in ppc specific code sounds much more safe.
> 
> Normally I would really push for a generic solution but after getting
> burned several times in this area I do not dare anymore. The problem is
> not in the code complexity but in how spread it is in places where you
> do not expect side effects.
> 

I do understand and respect your viewpoint.

> -- 
> Michal Hocko
> SUSE Labs

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-02 Thread Michal Hocko
On Thu 02-07-20 12:14:08, Srikar Dronamraju wrote:
> * Michal Hocko  [2020-07-01 14:21:10]:
> 
> > > >>
> > > >> 2. Also existence of dummy node also leads to inconsistent 
> > > >> information. The
> > > >> number of online nodes is inconsistent with the information in the
> > > >> device-tree and resource-dump
> > > >>
> > > >> 3. When the dummy node is present, single node non-Numa systems 
> > > >> end up showing
> > > >> up as NUMA systems and numa_balancing gets enabled. This will mean 
> > > >> we take
> > > >> the hit from the unnecessary numa hinting faults.
> > > >
> > > > I have to say that I dislike the node online/offline state and 
> > > > directly
> > > > exporting that to the userspace. Users should only care whether the 
> > > > node
> > > > has memory/cpus. Numa nodes can be online without any memory. Just
> > > > offline all the present memory blocks but do not physically hot 
> > > > remove
> > > > them and you are in the same situation. If users are confused by an
> > > > output of tools like numactl -H then those could be updated and hide
> > > > nodes without any memory
> > > >
> > > > The autonuma problem sounds interesting but again this patch doesn't
> > > > really solve the underlying problem because I strongly suspect that 
> > > > the
> > > > problem is still there when a numa node gets all its memory offline 
> > > > as
> > > > mentioned above.
> > 
> > I would really appreciate a feedback to these two as well.
> 
> 1. Its not just numactl that's to be fixed but all tools/utilities that
> depend on /sys/devices/system/node/online. Are we saying to not rely/believe
> in the output given by the kernel but do further verification?  

No, what we are saying is that even an online node might have zero
number of online pages/cpus. So the online status is not really
something that matters. If people are confused by that output then user
space tools can make their confusion go away. I really do not understand
why the kernel should do any logic there.

> Also how would the user space differentiate between the case where the
> Kernel missed marking a node as offline to the case where the memory was
> offlined on a cpuless node but node wasn't offline?.

What I am arguing is that those two shouldn't be any different. Really!

> 2. Regarding the autonuma, the case of offline memory is user/admin driven,
> so if there is a performance hit, its something that's driven by his
> user/admin actions. Also how often do we see users offline complete memory
> of cpuless node on a 2 node system?

How often do we see crippled HW configurations like that? Really if
autonuma should be made more clever for one case it should recognize the
other as well.

> > > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> > > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> > > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]
> > 
> > This begs a question whether ppc can do the same thing?
> 
> Certainly ppc can be made to adapt to this situation but that would be a
> workaround. Do we have a reason why we think node 0 is unique and special?

It is not. As replied in other email in this thread. I would hope for
having less hacks in the numa initialization. Cleaning up the mess is
would be a lot of work and testing on all NUMA capable architectures.
This is a heritage from the past I am afraid. All that I am arguing here
is that your touch to the generic code with a very simple looking patch
might have side effects which are pretty much impossible to review.
Moreover it seems that nothing but ppc really needs this treatment.
So fixing it in ppc specific code sounds much more safe.

Normally I would really push for a generic solution but after getting
burned several times in this area I do not dare anymore. The problem is
not in the code complexity but in how spread it is in places where you
do not expect side effects.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-02 Thread Srikar Dronamraju
* Michal Hocko  [2020-07-01 14:21:10]:

> > >>
> > >> 2. Also existence of dummy node also leads to inconsistent 
> > >> information. The
> > >> number of online nodes is inconsistent with the information in the
> > >> device-tree and resource-dump
> > >>
> > >> 3. When the dummy node is present, single node non-Numa systems end 
> > >> up showing
> > >> up as NUMA systems and numa_balancing gets enabled. This will mean 
> > >> we take
> > >> the hit from the unnecessary numa hinting faults.
> > >
> > > I have to say that I dislike the node online/offline state and 
> > > directly
> > > exporting that to the userspace. Users should only care whether the 
> > > node
> > > has memory/cpus. Numa nodes can be online without any memory. Just
> > > offline all the present memory blocks but do not physically hot remove
> > > them and you are in the same situation. If users are confused by an
> > > output of tools like numactl -H then those could be updated and hide
> > > nodes without any memory
> > >
> > > The autonuma problem sounds interesting but again this patch doesn't
> > > really solve the underlying problem because I strongly suspect that 
> > > the
> > > problem is still there when a numa node gets all its memory offline as
> > > mentioned above.
> 
> I would really appreciate a feedback to these two as well.

1. Its not just numactl that's to be fixed but all tools/utilities that
depend on /sys/devices/system/node/online. Are we saying to not rely/believe
in the output given by the kernel but do further verification?  

Also how would the user space differentiate between the case where the
Kernel missed marking a node as offline to the case where the memory was
offlined on a cpuless node but node wasn't offline?.

2. Regarding the autonuma, the case of offline memory is user/admin driven,
so if there is a performance hit, its something that's driven by his
user/admin actions. Also how often do we see users offline complete memory
of cpuless node on a 2 node system?

> 
> > [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> > [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> > [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]
> 
> This begs a question whether ppc can do the same thing?

Certainly ppc can be made to adapt to this situation but that would be a
workaround. Do we have a reason why we think node 0 is unique and special?
If yes can we document it so that in future also people know why we consider
node 0 to be special. I do understand the *fear of the unknown* but when we
are unable to theoretically or practically come up a case, then it may
probably be better we hit the situation to understand what that unknown is?

> I would swear that we've had x86 system with node 0 but I cannot really
> find it and it is possible that it was not x86 after all...

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread Michal Hocko
On Tue 30-06-20 09:31:25, Srikar Dronamraju wrote:
> * Christopher Lameter  [2020-06-29 14:58:40]:
> 
> > On Wed, 24 Jun 2020, Srikar Dronamraju wrote:
> > 
> > > Currently Linux kernel with CONFIG_NUMA on a system with multiple
> > > possible nodes, marks node 0 as online at boot.  However in practice,
> > > there are systems which have node 0 as memoryless and cpuless.
> > 
> > Maybe add something to explain why you are not simply mapping the
> > existing memory to NUMA node 0 which is after all just a numbering scheme
> > used by the kernel and can be used arbitrarily?
> > 
> 
> I thought Michal Hocko already gave a clear picture on why mapping is a bad
> idea. https://lore.kernel.org/lkml/20200316085425.gb11...@dhcp22.suse.cz/t/#u
> Are you suggesting that we add that as part of the changelog?

Well, I was not aware x86 already does renumber. So there is a certain
precendence. As I've said I do not really like that but this is what
already is happening. If renumbering is not an option then just handle
that in the ppc code explicitly. Generic solution would be preferable of
course but as I've said it is really hard to check for correctness and
potential subtle issues.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread Michal Hocko
On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> On 01.07.20 13:06, David Hildenbrand wrote:
> > On 01.07.20 13:01, Srikar Dronamraju wrote:
> >> * David Hildenbrand  [2020-07-01 12:15:54]:
> >>
> >>> On 01.07.20 12:04, Srikar Dronamraju wrote:
>  * Michal Hocko  [2020-07-01 10:42:00]:
> 
> >
> >>
> >> 2. Also existence of dummy node also leads to inconsistent 
> >> information. The
> >> number of online nodes is inconsistent with the information in the
> >> device-tree and resource-dump
> >>
> >> 3. When the dummy node is present, single node non-Numa systems end up 
> >> showing
> >> up as NUMA systems and numa_balancing gets enabled. This will mean we 
> >> take
> >> the hit from the unnecessary numa hinting faults.
> >
> > I have to say that I dislike the node online/offline state and directly
> > exporting that to the userspace. Users should only care whether the node
> > has memory/cpus. Numa nodes can be online without any memory. Just
> > offline all the present memory blocks but do not physically hot remove
> > them and you are in the same situation. If users are confused by an
> > output of tools like numactl -H then those could be updated and hide
> > nodes without any memory
> >
> > The autonuma problem sounds interesting but again this patch doesn't
> > really solve the underlying problem because I strongly suspect that the
> > problem is still there when a numa node gets all its memory offline as
> > mentioned above.

I would really appreciate a feedback to these two as well.

> > While I completely agree that making node 0 special is wrong, I have
> > still hard time to review this very simply looking patch because all the
> > numa initialization is so spread around that this might just blow up
> > at unexpected places. IIRC we have discussed testing in the previous
> > version and David has provided a way to emulate these configurations
> > on x86. Did you manage to use those instruction for additional testing
> > on other than ppc architectures?
> >
> 
>  I have tried all the steps that David mentioned and reported back at
>  https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u
> 
>  As a summary, David's steps are still not creating a memoryless/cpuless 
>  on
>  x86 VM.
> >>>
> >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> >>> online*. Once you hotplug some memory, it will switch online. Once you
> >>> remove memory, it will switch back offline.
> >>>
> >>
> >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
> >> boot.  The code in question tries to handle a cpuless/memoryless node 0 at
> >> boot.
> > 
> > I was just correcting your statement, because it was wrong.
> > 
> > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> > have CPUs nor memory. That would imply that we can, in fact, never have
> > node 0 offline during boot.
> > 
> 
> Yep, looks like it.
> 
> [0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> [0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> [0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> [0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> [0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
> [0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
> [0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]

This begs a question whether ppc can do the same thing?

I would swear that we've had x86 system with node 0 but I cannot really
find it and it is possible that it was not x86 after all...
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread David Hildenbrand
On 01.07.20 13:06, David Hildenbrand wrote:
> On 01.07.20 13:01, Srikar Dronamraju wrote:
>> * David Hildenbrand  [2020-07-01 12:15:54]:
>>
>>> On 01.07.20 12:04, Srikar Dronamraju wrote:
 * Michal Hocko  [2020-07-01 10:42:00]:

>
>>
>> 2. Also existence of dummy node also leads to inconsistent information. 
>> The
>> number of online nodes is inconsistent with the information in the
>> device-tree and resource-dump
>>
>> 3. When the dummy node is present, single node non-Numa systems end up 
>> showing
>> up as NUMA systems and numa_balancing gets enabled. This will mean we 
>> take
>> the hit from the unnecessary numa hinting faults.
>
> I have to say that I dislike the node online/offline state and directly
> exporting that to the userspace. Users should only care whether the node
> has memory/cpus. Numa nodes can be online without any memory. Just
> offline all the present memory blocks but do not physically hot remove
> them and you are in the same situation. If users are confused by an
> output of tools like numactl -H then those could be updated and hide
> nodes without any memory
>
> The autonuma problem sounds interesting but again this patch doesn't
> really solve the underlying problem because I strongly suspect that the
> problem is still there when a numa node gets all its memory offline as
> mentioned above.
>
> While I completely agree that making node 0 special is wrong, I have
> still hard time to review this very simply looking patch because all the
> numa initialization is so spread around that this might just blow up
> at unexpected places. IIRC we have discussed testing in the previous
> version and David has provided a way to emulate these configurations
> on x86. Did you manage to use those instruction for additional testing
> on other than ppc architectures?
>

 I have tried all the steps that David mentioned and reported back at
 https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u

 As a summary, David's steps are still not creating a memoryless/cpuless on
 x86 VM.
>>>
>>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
>>> online*. Once you hotplug some memory, it will switch online. Once you
>>> remove memory, it will switch back offline.
>>>
>>
>> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
>> boot.  The code in question tries to handle a cpuless/memoryless node 0 at
>> boot.
> 
> I was just correcting your statement, because it was wrong.
> 
> Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> have CPUs nor memory. That would imply that we can, in fact, never have
> node 0 offline during boot.
> 

Yep, looks like it.

[0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
[0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
[0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
[0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
[0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x-0x0009]
[0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x0010-0xbfff]
[0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x1-0x13fff]



-- 
Thanks,

David / dhildenb



Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread David Hildenbrand
On 01.07.20 13:01, Srikar Dronamraju wrote:
> * David Hildenbrand  [2020-07-01 12:15:54]:
> 
>> On 01.07.20 12:04, Srikar Dronamraju wrote:
>>> * Michal Hocko  [2020-07-01 10:42:00]:
>>>

>
> 2. Also existence of dummy node also leads to inconsistent information. 
> The
> number of online nodes is inconsistent with the information in the
> device-tree and resource-dump
>
> 3. When the dummy node is present, single node non-Numa systems end up 
> showing
> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> the hit from the unnecessary numa hinting faults.

 I have to say that I dislike the node online/offline state and directly
 exporting that to the userspace. Users should only care whether the node
 has memory/cpus. Numa nodes can be online without any memory. Just
 offline all the present memory blocks but do not physically hot remove
 them and you are in the same situation. If users are confused by an
 output of tools like numactl -H then those could be updated and hide
 nodes without any memory

 The autonuma problem sounds interesting but again this patch doesn't
 really solve the underlying problem because I strongly suspect that the
 problem is still there when a numa node gets all its memory offline as
 mentioned above.

 While I completely agree that making node 0 special is wrong, I have
 still hard time to review this very simply looking patch because all the
 numa initialization is so spread around that this might just blow up
 at unexpected places. IIRC we have discussed testing in the previous
 version and David has provided a way to emulate these configurations
 on x86. Did you manage to use those instruction for additional testing
 on other than ppc architectures?

>>>
>>> I have tried all the steps that David mentioned and reported back at
>>> https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u
>>>
>>> As a summary, David's steps are still not creating a memoryless/cpuless on
>>> x86 VM.
>>
>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
>> online*. Once you hotplug some memory, it will switch online. Once you
>> remove memory, it will switch back offline.
>>
> 
> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
> boot.  The code in question tries to handle a cpuless/memoryless node 0 at
> boot.

I was just correcting your statement, because it was wrong.

Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
have CPUs nor memory. That would imply that we can, in fact, never have
node 0 offline during boot.

-- 
Thanks,

David / dhildenb



Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread Srikar Dronamraju
* David Hildenbrand  [2020-07-01 12:15:54]:

> On 01.07.20 12:04, Srikar Dronamraju wrote:
> > * Michal Hocko  [2020-07-01 10:42:00]:
> > 
> >>
> >>>
> >>> 2. Also existence of dummy node also leads to inconsistent information. 
> >>> The
> >>> number of online nodes is inconsistent with the information in the
> >>> device-tree and resource-dump
> >>>
> >>> 3. When the dummy node is present, single node non-Numa systems end up 
> >>> showing
> >>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> >>> the hit from the unnecessary numa hinting faults.
> >>
> >> I have to say that I dislike the node online/offline state and directly
> >> exporting that to the userspace. Users should only care whether the node
> >> has memory/cpus. Numa nodes can be online without any memory. Just
> >> offline all the present memory blocks but do not physically hot remove
> >> them and you are in the same situation. If users are confused by an
> >> output of tools like numactl -H then those could be updated and hide
> >> nodes without any memory
> >>
> >> The autonuma problem sounds interesting but again this patch doesn't
> >> really solve the underlying problem because I strongly suspect that the
> >> problem is still there when a numa node gets all its memory offline as
> >> mentioned above.
> >>
> >> While I completely agree that making node 0 special is wrong, I have
> >> still hard time to review this very simply looking patch because all the
> >> numa initialization is so spread around that this might just blow up
> >> at unexpected places. IIRC we have discussed testing in the previous
> >> version and David has provided a way to emulate these configurations
> >> on x86. Did you manage to use those instruction for additional testing
> >> on other than ppc architectures?
> >>
> > 
> > I have tried all the steps that David mentioned and reported back at
> > https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u
> > 
> > As a summary, David's steps are still not creating a memoryless/cpuless on
> > x86 VM.
> 
> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> online*. Once you hotplug some memory, it will switch online. Once you
> remove memory, it will switch back offline.
> 

Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
boot.  The code in question tries to handle a cpuless/memoryless node 0 at
boot.

With the steps that you gave the node 0 was always populated, node 1 or
some other node would be memoryless/cpuless and offline. But that should
have no impact by patch.

I don't see how adding/hotplugging/removing memory to a node after boot is
going to affect the changes that I have made. Please do correct me if I have
misunderstood.


-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread David Hildenbrand
On 01.07.20 12:04, Srikar Dronamraju wrote:
> * Michal Hocko  [2020-07-01 10:42:00]:
> 
>>
>>>
>>> 2. Also existence of dummy node also leads to inconsistent information. The
>>> number of online nodes is inconsistent with the information in the
>>> device-tree and resource-dump
>>>
>>> 3. When the dummy node is present, single node non-Numa systems end up 
>>> showing
>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
>>> the hit from the unnecessary numa hinting faults.
>>
>> I have to say that I dislike the node online/offline state and directly
>> exporting that to the userspace. Users should only care whether the node
>> has memory/cpus. Numa nodes can be online without any memory. Just
>> offline all the present memory blocks but do not physically hot remove
>> them and you are in the same situation. If users are confused by an
>> output of tools like numactl -H then those could be updated and hide
>> nodes without any memory
>>
>> The autonuma problem sounds interesting but again this patch doesn't
>> really solve the underlying problem because I strongly suspect that the
>> problem is still there when a numa node gets all its memory offline as
>> mentioned above.
>>
>> While I completely agree that making node 0 special is wrong, I have
>> still hard time to review this very simply looking patch because all the
>> numa initialization is so spread around that this might just blow up
>> at unexpected places. IIRC we have discussed testing in the previous
>> version and David has provided a way to emulate these configurations
>> on x86. Did you manage to use those instruction for additional testing
>> on other than ppc architectures?
>>
> 
> I have tried all the steps that David mentioned and reported back at
> https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u
> 
> As a summary, David's steps are still not creating a memoryless/cpuless on
> x86 VM.

Now, that is wrong. You get a memoryless/cpuless node, which is *not
online*. Once you hotplug some memory, it will switch online. Once you
remove memory, it will switch back offline.

-- 
Thanks,

David / dhildenb



Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread Srikar Dronamraju
* Michal Hocko  [2020-07-01 10:42:00]:

> 
> > 
> > 2. Also existence of dummy node also leads to inconsistent information. The
> > number of online nodes is inconsistent with the information in the
> > device-tree and resource-dump
> > 
> > 3. When the dummy node is present, single node non-Numa systems end up 
> > showing
> > up as NUMA systems and numa_balancing gets enabled. This will mean we take
> > the hit from the unnecessary numa hinting faults.
> 
> I have to say that I dislike the node online/offline state and directly
> exporting that to the userspace. Users should only care whether the node
> has memory/cpus. Numa nodes can be online without any memory. Just
> offline all the present memory blocks but do not physically hot remove
> them and you are in the same situation. If users are confused by an
> output of tools like numactl -H then those could be updated and hide
> nodes without any memory
> 
> The autonuma problem sounds interesting but again this patch doesn't
> really solve the underlying problem because I strongly suspect that the
> problem is still there when a numa node gets all its memory offline as
> mentioned above.
> 
> While I completely agree that making node 0 special is wrong, I have
> still hard time to review this very simply looking patch because all the
> numa initialization is so spread around that this might just blow up
> at unexpected places. IIRC we have discussed testing in the previous
> version and David has provided a way to emulate these configurations
> on x86. Did you manage to use those instruction for additional testing
> on other than ppc architectures?
> 

I have tried all the steps that David mentioned and reported back at
https://lore.kernel.org/lkml/20200511174731.gd1...@linux.vnet.ibm.com/t/#u

As a summary, David's steps are still not creating a memoryless/cpuless on
x86 VM. I have tried booting with Numa/non-numa on all the x86 machines that
I could get to.

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-07-01 Thread Michal Hocko


On Wed 24-06-20 14:58:46, Srikar Dronamraju wrote:
> Currently Linux kernel with CONFIG_NUMA on a system with multiple
> possible nodes, marks node 0 as online at boot.  However in practice,
> there are systems which have node 0 as memoryless and cpuless.
> 
> This can cause numa_balancing to be enabled on systems with only one node
> with memory and CPUs. The existence of this dummy node which is cpuless and
> memoryless node can confuse users/scripts looking at output of lscpu /
> numactl.
> 
> By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
> always online.
> 
> v5.8-rc2
>  available: 2 nodes (0,2)
>  node 0 cpus:
>  node 0 size: 0 MB
>  node 0 free: 0 MB
>  node 2 cpus: 0 1 2 3 4 5 6 7
>  node 2 size: 32625 MB
>  node 2 free: 31490 MB
>  node distances:
>  node   0   2
>0:  10  20
>2:  20  10
> 
> proc and sys files
> --
>  /sys/devices/system/node/online:0,2
>  /proc/sys/kernel/numa_balancing:1
>  /sys/devices/system/node/has_cpu:   2
>  /sys/devices/system/node/has_memory:2
>  /sys/devices/system/node/has_normal_memory: 2
>  /sys/devices/system/node/possible:  0-31
> 
> v5.8-rc2 + patch
> --
>  available: 1 nodes (2)
>  node 2 cpus: 0 1 2 3 4 5 6 7
>  node 2 size: 32625 MB
>  node 2 free: 31487 MB
>  node distances:
>  node   2
>2:  10
> 
> proc and sys files
> --
> /sys/devices/system/node/online:2
> /proc/sys/kernel/numa_balancing:0
> /sys/devices/system/node/has_cpu:   2
> /sys/devices/system/node/has_memory:2
> /sys/devices/system/node/has_normal_memory: 2
> /sys/devices/system/node/possible:  0-31
> 
> Note: On Powerpc, cpu_to_node of possible but not present cpus would
> previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
> numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
> queried from vphn"). Without the 2 commits, Powerpc system might crash.
> 
> 1. User space applications like Numactl, lscpu, that parse the sysfs tend to
> believe there is an extra online node. This tends to confuse users and
> applications. Other user space applications start believing that system was
> not able to use all the resources (i.e missing resources) or the system was
> not setup correctly.
> 
> 2. Also existence of dummy node also leads to inconsistent information. The
> number of online nodes is inconsistent with the information in the
> device-tree and resource-dump
> 
> 3. When the dummy node is present, single node non-Numa systems end up showing
> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> the hit from the unnecessary numa hinting faults.

I have to say that I dislike the node online/offline state and directly
exporting that to the userspace. Users should only care whether the node
has memory/cpus. Numa nodes can be online without any memory. Just
offline all the present memory blocks but do not physically hot remove
them and you are in the same situation. If users are confused by an
output of tools like numactl -H then those could be updated and hide
nodes without any memory

The autonuma problem sounds interesting but again this patch doesn't
really solve the underlying problem because I strongly suspect that the
problem is still there when a numa node gets all its memory offline as
mentioned above.

While I completely agree that making node 0 special is wrong, I have
still hard time to review this very simply looking patch because all the
numa initialization is so spread around that this might just blow up
at unexpected places. IIRC we have discussed testing in the previous
version and David has provided a way to emulate these configurations
on x86. Did you manage to use those instruction for additional testing
on other than ppc architectures?

> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Cc: Michal Hocko 
> Cc: Mel Gorman 
> Cc: Vlastimil Babka 
> Cc: "Kirill A. Shutemov" 
> Cc: Christopher Lameter 
> Cc: Michael Ellerman 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Gautham R Shenoy 
> Cc: Satheesh Rajendran 
> Cc: David Hildenbrand 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v4:->v5:
> - rebased to v5.8-rc2
> link v4: 
> http://lore.kernel.org/lkml/20200512132937.19295-1-sri...@linux.vnet.ibm.com/t/#u
> 
> Changelog v1:->v2:
> - Rebased to v5.7-rc3
> Link v2: 
> https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-sri...@linux.vnet.ibm.com/t/#u
> 
>  mm/page_alloc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48eb0f1410d4..5187664558e1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -117,8 +117,10 @@ EXPORT_SYMBOL(latent_entropy);
>   */
>  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
>   [N_POSSIBLE] = NODE_MASK_ALL,
> +#ifdef CONFIG_NUMA
> + [N_ONLINE] = 

Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-06-29 Thread Srikar Dronamraju
* Christopher Lameter  [2020-06-29 14:58:40]:

> On Wed, 24 Jun 2020, Srikar Dronamraju wrote:
> 
> > Currently Linux kernel with CONFIG_NUMA on a system with multiple
> > possible nodes, marks node 0 as online at boot.  However in practice,
> > there are systems which have node 0 as memoryless and cpuless.
> 
> Maybe add something to explain why you are not simply mapping the
> existing memory to NUMA node 0 which is after all just a numbering scheme
> used by the kernel and can be used arbitrarily?
> 

I thought Michal Hocko already gave a clear picture on why mapping is a bad
idea. https://lore.kernel.org/lkml/20200316085425.gb11...@dhcp22.suse.cz/t/#u
Are you suggesting that we add that as part of the changelog?

> This could be seen more as a bug in the arch code during the setup of NUMA
> nodes. The two nodes are created by the firmwware / bootstrap code after
> all. Just do not do it?
> 

- The arch/setup code in powerpc is not onlining these nodes. 
- Later on cpus/memory in node 0 can be onlined.
- Firmware in this case Phyp is an independent code by itself.

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-06-29 Thread Christopher Lameter
On Wed, 24 Jun 2020, Srikar Dronamraju wrote:

> Currently Linux kernel with CONFIG_NUMA on a system with multiple
> possible nodes, marks node 0 as online at boot.  However in practice,
> there are systems which have node 0 as memoryless and cpuless.

Maybe add something to explain why you are not simply mapping the
existing memory to NUMA node 0 which is after all just a numbering scheme
used by the kernel and can be used arbitrarily?

This could be seen more as a bug in the arch code during the setup of NUMA
nodes. The two nodes are created by the firmwware / bootstrap code after
all. Just do not do it?