Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-15 Thread Michal Hocko
On Tue 14-03-17 20:35:21, Andrea Arcangeli wrote:
> Hello,
> 
> On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote:
> > On Fri 10-03-17 13:00:37, Reza Arbab wrote:
> > > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:
> > > >OK, so while I was playing with this setup some more I probably got why
> > > >this is done this way. All new memblocks are added to the zone Normal
> > > >where they are accounted as spanned but not present.
> > > 
> > > It's not always zone Normal. See zone_for_memory(). This leads to a
> > > workaround for having to do online_movable in descending block order.
> > > Instead of this:
> > > 
> > > 1. probe block 34, probe block 33, probe block 32, ...
> > > 2. online_movable 34, online_movable 33, online_movable 32, ...
> > > 
> > > you can online_movable the first block before adding the rest:
> > 
> > I do I enforce that behavior when the probe happens automagically?
> 
> What I provided as guide to online as movable in current and older
> kernels is:
> 
> 1) Remove udev rule
> 2) After adding memory with qemu/libvirt API run in the guest
> 
> --- workaround start 
> #!/bin/bash
> for i in `ls -d1 /sys/devices/system/memory/memory* | sort -nr -t y -k 5`; do 
> if [ "`cat $i/state`" == "offline" ]; then echo online_movable > $i/state ; 
> fi; done
> --- workaround end 
> 
> That's how bad is onlining as movable for memory hotunplug.

Yeah, this is really yucky. Fortunately, I already have a working prototype
which removes this restriction altogether.

> > > 1. probe block 32, online_movable 32
> > > 2. probe block 33, probe block 34, ...
> > >   - zone_for_memory() will cause these to start Movable
> > > 3. online 33, online 34, ...
> > >   - they're already in Movable, so online_movable is equivalentr
> > > 
> > > I agree with your general sentiment that this stuff is very nonintuitive.
> > 
> > My criterion for nonintuitive is probably different because I would call
> > this _completely_unusable_. Sorry for being so loud about this but the
> > more I look into this area the more WTF code I see. This has seen close
> > to zero review and seems to be building up more single usecase code on
> > top of previous. We need to change this, seriously!
> 
> It's not a bug, but when I found it I called it a "constraint" and
> when I debugged it (IIRC) it originated here:
> 
>   } else if (online_type == MMOP_ONLINE_MOVABLE) {
>   if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, _shift))
>   return -EINVAL;
>   }
> 
> Fixing it so you could online as movable even if it wasn't the last
> block was in my todo list but then we had other plans.
> 
> Unfortunately unless pfn+nr_pages of the newly created Movable zone
> matches the end of the normal zone (or whatever was there that has to
> be converted to Movable), it will fail onlining as movable.

Well, I think the main problem is that we associate pages added in the
first phase (probe) to the Normal zone. The everything else is just a
fallout and fiddling to make it work somehow.
 
[...]

> To be clear, I'm not necessarily against eventually removing
> _DEFFAULT_ONLINE, but an equally reliable and comparably fast
> alternative should be provided first and there's no such alternative
> right now.

As pointed in other reply that is not an intention here. I primarily
wanted to understand the scope of the problem. I believe this config
option was rushed into the kernel without considering other alternatives
which would make the hotplug more usable by others as well.
 
> If s390 has special issues requiring admin or a software hotplug
> manager app to enable blocks of memory by hand, the _DEFFAULT_ONLINE
> could be then an option selected or not selected by
> arch/*/Kconfig. The udev rule is still an automatic action so it's 1:1
> with the in-kernel feature. If the in-kernel automatic onlining is not
> workable on s390 I would assume the udev rule is also not workable.

But this is not about s390. It is about different usecases which require
different onlining policy and that is the main problem of the hardcoded
one in the kernel. See the other reply.
-- 
Michal Hocko
SUSE Labs


Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-15 Thread Michal Hocko
On Tue 14-03-17 20:35:21, Andrea Arcangeli wrote:
> Hello,
> 
> On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote:
> > On Fri 10-03-17 13:00:37, Reza Arbab wrote:
> > > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:
> > > >OK, so while I was playing with this setup some more I probably got why
> > > >this is done this way. All new memblocks are added to the zone Normal
> > > >where they are accounted as spanned but not present.
> > > 
> > > It's not always zone Normal. See zone_for_memory(). This leads to a
> > > workaround for having to do online_movable in descending block order.
> > > Instead of this:
> > > 
> > > 1. probe block 34, probe block 33, probe block 32, ...
> > > 2. online_movable 34, online_movable 33, online_movable 32, ...
> > > 
> > > you can online_movable the first block before adding the rest:
> > 
> > I do I enforce that behavior when the probe happens automagically?
> 
> What I provided as guide to online as movable in current and older
> kernels is:
> 
> 1) Remove udev rule
> 2) After adding memory with qemu/libvirt API run in the guest
> 
> --- workaround start 
> #!/bin/bash
> for i in `ls -d1 /sys/devices/system/memory/memory* | sort -nr -t y -k 5`; do 
> if [ "`cat $i/state`" == "offline" ]; then echo online_movable > $i/state ; 
> fi; done
> --- workaround end 
> 
> That's how bad is onlining as movable for memory hotunplug.

Yeah, this is really yucky. Fortunately, I already have a working prototype
which removes this restriction altogether.

> > > 1. probe block 32, online_movable 32
> > > 2. probe block 33, probe block 34, ...
> > >   - zone_for_memory() will cause these to start Movable
> > > 3. online 33, online 34, ...
> > >   - they're already in Movable, so online_movable is equivalentr
> > > 
> > > I agree with your general sentiment that this stuff is very nonintuitive.
> > 
> > My criterion for nonintuitive is probably different because I would call
> > this _completely_unusable_. Sorry for being so loud about this but the
> > more I look into this area the more WTF code I see. This has seen close
> > to zero review and seems to be building up more single usecase code on
> > top of previous. We need to change this, seriously!
> 
> It's not a bug, but when I found it I called it a "constraint" and
> when I debugged it (IIRC) it originated here:
> 
>   } else if (online_type == MMOP_ONLINE_MOVABLE) {
>   if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, _shift))
>   return -EINVAL;
>   }
> 
> Fixing it so you could online as movable even if it wasn't the last
> block was in my todo list but then we had other plans.
> 
> Unfortunately unless pfn+nr_pages of the newly created Movable zone
> matches the end of the normal zone (or whatever was there that has to
> be converted to Movable), it will fail onlining as movable.

Well, I think the main problem is that we associate pages added in the
first phase (probe) to the Normal zone. The everything else is just a
fallout and fiddling to make it work somehow.
 
[...]

> To be clear, I'm not necessarily against eventually removing
> _DEFFAULT_ONLINE, but an equally reliable and comparably fast
> alternative should be provided first and there's no such alternative
> right now.

As pointed in other reply that is not an intention here. I primarily
wanted to understand the scope of the problem. I believe this config
option was rushed into the kernel without considering other alternatives
which would make the hotplug more usable by others as well.
 
> If s390 has special issues requiring admin or a software hotplug
> manager app to enable blocks of memory by hand, the _DEFFAULT_ONLINE
> could be then an option selected or not selected by
> arch/*/Kconfig. The udev rule is still an automatic action so it's 1:1
> with the in-kernel feature. If the in-kernel automatic onlining is not
> workable on s390 I would assume the udev rule is also not workable.

But this is not about s390. It is about different usecases which require
different onlining policy and that is the main problem of the hardcoded
one in the kernel. See the other reply.
-- 
Michal Hocko
SUSE Labs


Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-14 Thread Andrea Arcangeli
Hello,

On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote:
> On Fri 10-03-17 13:00:37, Reza Arbab wrote:
> > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:
> > >OK, so while I was playing with this setup some more I probably got why
> > >this is done this way. All new memblocks are added to the zone Normal
> > >where they are accounted as spanned but not present.
> > 
> > It's not always zone Normal. See zone_for_memory(). This leads to a
> > workaround for having to do online_movable in descending block order.
> > Instead of this:
> > 
> > 1. probe block 34, probe block 33, probe block 32, ...
> > 2. online_movable 34, online_movable 33, online_movable 32, ...
> > 
> > you can online_movable the first block before adding the rest:
> 
> I do I enforce that behavior when the probe happens automagically?

What I provided as guide to online as movable in current and older
kernels is:

1) Remove udev rule
2) After adding memory with qemu/libvirt API run in the guest

--- workaround start 
#!/bin/bash
for i in `ls -d1 /sys/devices/system/memory/memory* | sort -nr -t y -k 5`; do 
if [ "`cat $i/state`" == "offline" ]; then echo online_movable > $i/state ; fi; 
done
--- workaround end 

That's how bad is onlining as movable for memory hotunplug.

> > 1. probe block 32, online_movable 32
> > 2. probe block 33, probe block 34, ...
> > - zone_for_memory() will cause these to start Movable
> > 3. online 33, online 34, ...
> > - they're already in Movable, so online_movable is equivalentr
> > 
> > I agree with your general sentiment that this stuff is very nonintuitive.
> 
> My criterion for nonintuitive is probably different because I would call
> this _completely_unusable_. Sorry for being so loud about this but the
> more I look into this area the more WTF code I see. This has seen close
> to zero review and seems to be building up more single usecase code on
> top of previous. We need to change this, seriously!

It's not a bug, but when I found it I called it a "constraint" and
when I debugged it (IIRC) it originated here:

} else if (online_type == MMOP_ONLINE_MOVABLE) {
if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, _shift))
return -EINVAL;
}

Fixing it so you could online as movable even if it wasn't the last
block was in my todo list but then we had other plans.

Unfortunately unless pfn+nr_pages of the newly created Movable zone
matches the end of the normal zone (or whatever was there that has to
be converted to Movable), it will fail onlining as movable.

To fix it is enough to check that everything from pfn to the end of
whatever zone existed there (or the end of the node perhaps safer) can
be converted as movable and just convert the whole thing as movable
instead of stopping at pfn+nr_pages.

Also note, onlining highmem physical ranges as movable requires yet
another config option to be set for the below check to pass
(CONFIG_MOVABLE_NODE=y), which I'm not exactly sure why anybody would
want to set =n, and perhaps would be candidate for dropping well
before considering to drop _DEFAULT_ONLINE and at best it should be
replaced with a kernel parameter to turn off the CONFIG_MOVABLE_NODE=y
behavior.

if ((zone_idx(zone) > ZONE_NORMAL ||
online_type == MMOP_ONLINE_MOVABLE) &&
!can_online_high_movable(zone))
return -EINVAL;

I would suggest to drop the constraints in online_pages and perhaps
also the CONFIG_MOVABLE_NODE option before going to drop the automatic
onlining in kernel.

Because of the above constraint the udev rule is unusable anyway for
onlining memory as movable so that it can be hotunplugged reliably
(well not so reliably, page_migrate does a loop and tries many times
but it occurred to me it failed once to offline and at next try it
worked, temporary page pins with O_DIRECT screw with page_migration,
rightfully so and no easy fix).

After the above constraint is fixed I suggest to look into fixing the
async onlining by making the udev rule run synchronously. Adding 4T to
a 8G guest is perfectly reasonable and normal operation, no excuse for
it to fail as long as you don't pretend such 4T to be unpluggable too
later (which we won't).

As I understand it, the whole point of _DEFFAULT_ONLINE=y is precisely
that it's easier to fix it in kernel than fixing the udev
rule. Furthermore the above constraint for the zone shifting which
breaks online_movable in udev is not an issue for _DEFFAULT_ONLINE=y
because kernel onlining is synchronous by design (no special
synchronous udev rule fix is required) so it can cope fine with the
existing above constraint by onlining as movable from the end of the
last zone in each node so that such constraint never gets in the way.

Extending _DEFFAULT_ONLINE=y so that it can also online as movable has
been worked on.

On a side note, kernelcore=xxx passed to a kernel with
_DEFFAULT_ONLINE=y should already allow to 

Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-14 Thread Andrea Arcangeli
Hello,

On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote:
> On Fri 10-03-17 13:00:37, Reza Arbab wrote:
> > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:
> > >OK, so while I was playing with this setup some more I probably got why
> > >this is done this way. All new memblocks are added to the zone Normal
> > >where they are accounted as spanned but not present.
> > 
> > It's not always zone Normal. See zone_for_memory(). This leads to a
> > workaround for having to do online_movable in descending block order.
> > Instead of this:
> > 
> > 1. probe block 34, probe block 33, probe block 32, ...
> > 2. online_movable 34, online_movable 33, online_movable 32, ...
> > 
> > you can online_movable the first block before adding the rest:
> 
> I do I enforce that behavior when the probe happens automagically?

What I provided as guide to online as movable in current and older
kernels is:

1) Remove udev rule
2) After adding memory with qemu/libvirt API run in the guest

--- workaround start 
#!/bin/bash
for i in `ls -d1 /sys/devices/system/memory/memory* | sort -nr -t y -k 5`; do 
if [ "`cat $i/state`" == "offline" ]; then echo online_movable > $i/state ; fi; 
done
--- workaround end 

That's how bad is onlining as movable for memory hotunplug.

> > 1. probe block 32, online_movable 32
> > 2. probe block 33, probe block 34, ...
> > - zone_for_memory() will cause these to start Movable
> > 3. online 33, online 34, ...
> > - they're already in Movable, so online_movable is equivalentr
> > 
> > I agree with your general sentiment that this stuff is very nonintuitive.
> 
> My criterion for nonintuitive is probably different because I would call
> this _completely_unusable_. Sorry for being so loud about this but the
> more I look into this area the more WTF code I see. This has seen close
> to zero review and seems to be building up more single usecase code on
> top of previous. We need to change this, seriously!

It's not a bug, but when I found it I called it a "constraint" and
when I debugged it (IIRC) it originated here:

} else if (online_type == MMOP_ONLINE_MOVABLE) {
if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, _shift))
return -EINVAL;
}

Fixing it so you could online as movable even if it wasn't the last
block was in my todo list but then we had other plans.

Unfortunately unless pfn+nr_pages of the newly created Movable zone
matches the end of the normal zone (or whatever was there that has to
be converted to Movable), it will fail onlining as movable.

To fix it is enough to check that everything from pfn to the end of
whatever zone existed there (or the end of the node perhaps safer) can
be converted as movable and just convert the whole thing as movable
instead of stopping at pfn+nr_pages.

Also note, onlining highmem physical ranges as movable requires yet
another config option to be set for the below check to pass
(CONFIG_MOVABLE_NODE=y), which I'm not exactly sure why anybody would
want to set =n, and perhaps would be candidate for dropping well
before considering to drop _DEFAULT_ONLINE and at best it should be
replaced with a kernel parameter to turn off the CONFIG_MOVABLE_NODE=y
behavior.

if ((zone_idx(zone) > ZONE_NORMAL ||
online_type == MMOP_ONLINE_MOVABLE) &&
!can_online_high_movable(zone))
return -EINVAL;

I would suggest to drop the constraints in online_pages and perhaps
also the CONFIG_MOVABLE_NODE option before going to drop the automatic
onlining in kernel.

Because of the above constraint the udev rule is unusable anyway for
onlining memory as movable so that it can be hotunplugged reliably
(well not so reliably, page_migrate does a loop and tries many times
but it occurred to me it failed once to offline and at next try it
worked, temporary page pins with O_DIRECT screw with page_migration,
rightfully so and no easy fix).

After the above constraint is fixed I suggest to look into fixing the
async onlining by making the udev rule run synchronously. Adding 4T to
a 8G guest is perfectly reasonable and normal operation, no excuse for
it to fail as long as you don't pretend such 4T to be unpluggable too
later (which we won't).

As I understand it, the whole point of _DEFFAULT_ONLINE=y is precisely
that it's easier to fix it in kernel than fixing the udev
rule. Furthermore the above constraint for the zone shifting which
breaks online_movable in udev is not an issue for _DEFFAULT_ONLINE=y
because kernel onlining is synchronous by design (no special
synchronous udev rule fix is required) so it can cope fine with the
existing above constraint by onlining as movable from the end of the
last zone in each node so that such constraint never gets in the way.

Extending _DEFFAULT_ONLINE=y so that it can also online as movable has
been worked on.

On a side note, kernelcore=xxx passed to a kernel with
_DEFFAULT_ONLINE=y should already allow to 

Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Andi Kleen
> and ARCH_SPARSEMEM_DEFAULT is enabeld on 64b. So I guess whatever was
> the reason to add this code back in 2006 is not true anymore. So I am
> really wondering. Do we absolutely need to assign pages which are not
> onlined yet to the ZONE_NORMAL unconditionally? Why cannot we put them
> out of any zone and wait for memory online operation to put them where
> requested?

I don't remember any specific reason. May have just been simplification.
Should be fine to use any zone.

-Andi


Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Andi Kleen
> and ARCH_SPARSEMEM_DEFAULT is enabeld on 64b. So I guess whatever was
> the reason to add this code back in 2006 is not true anymore. So I am
> really wondering. Do we absolutely need to assign pages which are not
> onlined yet to the ZONE_NORMAL unconditionally? Why cannot we put them
> out of any zone and wait for memory online operation to put them where
> requested?

I don't remember any specific reason. May have just been simplification.
Should be fine to use any zone.

-Andi


Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Michal Hocko
Let's add Andi

On Fri 10-03-17 16:53:33, Michal Hocko wrote:
> On Fri 10-03-17 14:58:07, Michal Hocko wrote:
> [...]
> > This would explain why onlining from the last block actually works but
> > to me this sounds like a completely crappy behavior. All we need to
> > guarantee AFAICS is that Normal and Movable zones do not overlap. I
> > believe there is even no real requirement about ordering of the physical
> > memory in Normal vs. Movable zones as long as they do not overlap. But
> > let's keep it simple for the start and always enforce the current status
> > quo that Normal zone is physically preceeding Movable zone.
> > Can somebody explain why we cannot have a simple rule for Normal vs.
> > Movable which would be:
> > - block [pfn, pfn+block_size] can be Normal if
> >   !zone_populated(MOVABLE) || pfn+block_size < 
> > ZONE_MOVABLE->zone_start_pfn
> > - block [pfn, pfn+block_size] can be Movable if
> >   !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn
> 
> OK, so while I was playing with this setup some more I probably got why
> this is done this way. All new memblocks are added to the zone Normal
> where they are accounted as spanned but not present. When we do
> online_movable we just cut from the end of the Normal zone and move it
> to Movable zone. This sounds really awkward. What was the reason to go
> this way? Why cannot we simply add those pages to the zone at the online
> time?

Answering to myself. So the reason seems to be 9d99aaa31f59 ("[PATCH]
x86_64: Support memory hotadd without sparsemem") which is no longer
true because 
config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
depends on SPARSEMEM || X86_64_ACPI_NUMA
depends on ARCH_ENABLE_MEMORY_HOTPLUG
depends on COMPILE_TEST || !KASAN

so it is either SPARSEMEM or X86_64_ACPI_NUMA that would have to be enabled.
config X86_64_ACPI_NUMA
def_bool y
prompt "ACPI NUMA detection"
depends on X86_64 && NUMA && ACPI && PCI
select ACPI_NUMA

But I do not see any way how to enable anything but SPARSEMEM for x86_64
choice
prompt "Memory model"
depends on SELECT_MEMORY_MODEL
default DISCONTIGMEM_MANUAL if ARCH_DISCONTIGMEM_DEFAULT
default SPARSEMEM_MANUAL if ARCH_SPARSEMEM_DEFAULT
default FLATMEM_MANUAL

ARCH_SPARSEMEM_DEFAULT is 32b only
config ARCH_DISCONTIGMEM_DEFAULT
def_bool y
depends on NUMA && X86_32

and ARCH_SPARSEMEM_DEFAULT is enabeld on 64b. So I guess whatever was
the reason to add this code back in 2006 is not true anymore. So I am
really wondering. Do we absolutely need to assign pages which are not
onlined yet to the ZONE_NORMAL unconditionally? Why cannot we put them
out of any zone and wait for memory online operation to put them where
requested?
-- 
Michal Hocko
SUSE Labs


Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Michal Hocko
Let's add Andi

On Fri 10-03-17 16:53:33, Michal Hocko wrote:
> On Fri 10-03-17 14:58:07, Michal Hocko wrote:
> [...]
> > This would explain why onlining from the last block actually works but
> > to me this sounds like a completely crappy behavior. All we need to
> > guarantee AFAICS is that Normal and Movable zones do not overlap. I
> > believe there is even no real requirement about ordering of the physical
> > memory in Normal vs. Movable zones as long as they do not overlap. But
> > let's keep it simple for the start and always enforce the current status
> > quo that Normal zone is physically preceeding Movable zone.
> > Can somebody explain why we cannot have a simple rule for Normal vs.
> > Movable which would be:
> > - block [pfn, pfn+block_size] can be Normal if
> >   !zone_populated(MOVABLE) || pfn+block_size < 
> > ZONE_MOVABLE->zone_start_pfn
> > - block [pfn, pfn+block_size] can be Movable if
> >   !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn
> 
> OK, so while I was playing with this setup some more I probably got why
> this is done this way. All new memblocks are added to the zone Normal
> where they are accounted as spanned but not present. When we do
> online_movable we just cut from the end of the Normal zone and move it
> to Movable zone. This sounds really awkward. What was the reason to go
> this way? Why cannot we simply add those pages to the zone at the online
> time?

Answering to myself. So the reason seems to be 9d99aaa31f59 ("[PATCH]
x86_64: Support memory hotadd without sparsemem") which is no longer
true because 
config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
depends on SPARSEMEM || X86_64_ACPI_NUMA
depends on ARCH_ENABLE_MEMORY_HOTPLUG
depends on COMPILE_TEST || !KASAN

so it is either SPARSEMEM or X86_64_ACPI_NUMA that would have to be enabled.
config X86_64_ACPI_NUMA
def_bool y
prompt "ACPI NUMA detection"
depends on X86_64 && NUMA && ACPI && PCI
select ACPI_NUMA

But I do not see any way how to enable anything but SPARSEMEM for x86_64
choice
prompt "Memory model"
depends on SELECT_MEMORY_MODEL
default DISCONTIGMEM_MANUAL if ARCH_DISCONTIGMEM_DEFAULT
default SPARSEMEM_MANUAL if ARCH_SPARSEMEM_DEFAULT
default FLATMEM_MANUAL

ARCH_SPARSEMEM_DEFAULT is 32b only
config ARCH_DISCONTIGMEM_DEFAULT
def_bool y
depends on NUMA && X86_32

and ARCH_SPARSEMEM_DEFAULT is enabeld on 64b. So I guess whatever was
the reason to add this code back in 2006 is not true anymore. So I am
really wondering. Do we absolutely need to assign pages which are not
onlined yet to the ZONE_NORMAL unconditionally? Why cannot we put them
out of any zone and wait for memory online operation to put them where
requested?
-- 
Michal Hocko
SUSE Labs


Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Reza Arbab

On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote:
I agree with your general sentiment that this stuff is very 
nonintuitive.


My criterion for nonintuitive is probably different because I would 
call this _completely_unusable_. Sorry for being so loud about this but 
the more I look into this area the more WTF code I see. This has seen 
close to zero review and seems to be building up more single usecase 
code on top of previous. We need to change this, seriously!


No argument here. I'm happy to help however I can.

--
Reza Arbab



Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Reza Arbab

On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote:
I agree with your general sentiment that this stuff is very 
nonintuitive.


My criterion for nonintuitive is probably different because I would 
call this _completely_unusable_. Sorry for being so loud about this but 
the more I look into this area the more WTF code I see. This has seen 
close to zero review and seems to be building up more single usecase 
code on top of previous. We need to change this, seriously!


No argument here. I'm happy to help however I can.

--
Reza Arbab



Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Michal Hocko
On Mon 13-03-17 14:57:12, Igor Mammedov wrote:
> On Mon, 13 Mar 2017 11:43:02 +0100
> Michal Hocko  wrote:
> 
> > On Mon 13-03-17 11:31:10, Igor Mammedov wrote:
> > > On Fri, 10 Mar 2017 14:58:07 +0100  
> > [...]
> > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009]
> > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff]
> > > > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff]
> > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] 
> > > > hotplug
> > > > [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 
> > > > 0x0010-0x3fff] -> [mem 0x-0x3fff]
> > > > [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff]
> > > > [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd]
> > > > [0.00] Zone ranges:
> > > > [0.00]   DMA  [mem 0x1000-0x00ff]
> > > > [0.00]   DMA32[mem 0x0100-0x7ffd]
> > > > [0.00]   Normal   empty
> > > > [0.00] Movable zone start for each node
> > > > [0.00] Early memory node ranges
> > > > [0.00]   node   0: [mem 0x1000-0x0009efff]
> > > > [0.00]   node   0: [mem 0x0010-0x3fff]
> > > > [0.00]   node   1: [mem 0x4000-0x7ffd]
> > > > 
> > > > so there is neither any normal zone nor movable one at the boot time.  
> > > it could be if hotpluggable memory were present at boot time in E802 table
> > > (if I remember right when running on hyperv there is movable zone at boot 
> > > time),
> > > 
> > > but in qemu hotpluggable memory isn't put into E820,
> > > so zone is allocated later when memory is enumerated
> > > by ACPI subsystem and onlined.
> > > It causes less issues wrt movable zone and works for
> > > different versions of linux/windows as well.
> > > 
> > > That's where in kernel auto-onlining could be also useful,
> > > since user would be able to start-up with with small
> > > non removable memory plus several removable DIMMs
> > > and have all the memory onlined/available by the time
> > > initrd is loaded. (missing piece here is onling
> > > removable memory as movable by default).  
> > 
> > Why we should even care to online that memory that early rather than
> > making it available via e820?
> 
> It's not forbidden by spec and has less complications
> when it comes to removable memory. Declaring it in E820
> would add following limitations/drawbacks:
>  - firmware should be able to exclude removable memory
>from its usage (currently SeaBIOS nor EFI have to
>know/care about it) => less qemu-guest ABI to maintain.
>  - OS should be taught to avoid/move (early) nonmovable
>allocations from removable address ranges.
>There were patches targeting that in recent kernels,
>but it won't work with older kernels that don't have it.
>So limiting a range of OSes that could run on QEMU
>and do memory removal.
> 
> E820 less approach works reasonably well with wide range
> of guest OSes and less complex that if removable memory
> were present it E820. Hence I don't have a compelling
> reason to introduce removable memory in E820 as it
> only adds to hot(un)plug issues.

OK I see and that sounds like an argument to not put those ranges to
E820. I still fail to see why we haeve to online the memory early during
the boot and cannot wait for userspace to run?

-- 
Michal Hocko
SUSE Labs


Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Michal Hocko
On Mon 13-03-17 14:57:12, Igor Mammedov wrote:
> On Mon, 13 Mar 2017 11:43:02 +0100
> Michal Hocko  wrote:
> 
> > On Mon 13-03-17 11:31:10, Igor Mammedov wrote:
> > > On Fri, 10 Mar 2017 14:58:07 +0100  
> > [...]
> > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009]
> > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff]
> > > > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff]
> > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] 
> > > > hotplug
> > > > [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 
> > > > 0x0010-0x3fff] -> [mem 0x-0x3fff]
> > > > [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff]
> > > > [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd]
> > > > [0.00] Zone ranges:
> > > > [0.00]   DMA  [mem 0x1000-0x00ff]
> > > > [0.00]   DMA32[mem 0x0100-0x7ffd]
> > > > [0.00]   Normal   empty
> > > > [0.00] Movable zone start for each node
> > > > [0.00] Early memory node ranges
> > > > [0.00]   node   0: [mem 0x1000-0x0009efff]
> > > > [0.00]   node   0: [mem 0x0010-0x3fff]
> > > > [0.00]   node   1: [mem 0x4000-0x7ffd]
> > > > 
> > > > so there is neither any normal zone nor movable one at the boot time.  
> > > it could be if hotpluggable memory were present at boot time in E802 table
> > > (if I remember right when running on hyperv there is movable zone at boot 
> > > time),
> > > 
> > > but in qemu hotpluggable memory isn't put into E820,
> > > so zone is allocated later when memory is enumerated
> > > by ACPI subsystem and onlined.
> > > It causes less issues wrt movable zone and works for
> > > different versions of linux/windows as well.
> > > 
> > > That's where in kernel auto-onlining could be also useful,
> > > since user would be able to start-up with with small
> > > non removable memory plus several removable DIMMs
> > > and have all the memory onlined/available by the time
> > > initrd is loaded. (missing piece here is onling
> > > removable memory as movable by default).  
> > 
> > Why we should even care to online that memory that early rather than
> > making it available via e820?
> 
> It's not forbidden by spec and has less complications
> when it comes to removable memory. Declaring it in E820
> would add following limitations/drawbacks:
>  - firmware should be able to exclude removable memory
>from its usage (currently SeaBIOS nor EFI have to
>know/care about it) => less qemu-guest ABI to maintain.
>  - OS should be taught to avoid/move (early) nonmovable
>allocations from removable address ranges.
>There were patches targeting that in recent kernels,
>but it won't work with older kernels that don't have it.
>So limiting a range of OSes that could run on QEMU
>and do memory removal.
> 
> E820 less approach works reasonably well with wide range
> of guest OSes and less complex that if removable memory
> were present it E820. Hence I don't have a compelling
> reason to introduce removable memory in E820 as it
> only adds to hot(un)plug issues.

OK I see and that sounds like an argument to not put those ranges to
E820. I still fail to see why we haeve to online the memory early during
the boot and cannot wait for userspace to run?

-- 
Michal Hocko
SUSE Labs


Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Igor Mammedov
On Mon, 13 Mar 2017 11:43:02 +0100
Michal Hocko  wrote:

> On Mon 13-03-17 11:31:10, Igor Mammedov wrote:
> > On Fri, 10 Mar 2017 14:58:07 +0100  
> [...]
> > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009]
> > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff]
> > > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff]
> > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] 
> > > hotplug
> > > [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 
> > > 0x0010-0x3fff] -> [mem 0x-0x3fff]
> > > [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff]
> > > [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd]
> > > [0.00] Zone ranges:
> > > [0.00]   DMA  [mem 0x1000-0x00ff]
> > > [0.00]   DMA32[mem 0x0100-0x7ffd]
> > > [0.00]   Normal   empty
> > > [0.00] Movable zone start for each node
> > > [0.00] Early memory node ranges
> > > [0.00]   node   0: [mem 0x1000-0x0009efff]
> > > [0.00]   node   0: [mem 0x0010-0x3fff]
> > > [0.00]   node   1: [mem 0x4000-0x7ffd]
> > > 
> > > so there is neither any normal zone nor movable one at the boot time.  
> > it could be if hotpluggable memory were present at boot time in E802 table
> > (if I remember right when running on hyperv there is movable zone at boot 
> > time),
> > 
> > but in qemu hotpluggable memory isn't put into E820,
> > so zone is allocated later when memory is enumerated
> > by ACPI subsystem and onlined.
> > It causes less issues wrt movable zone and works for
> > different versions of linux/windows as well.
> > 
> > That's where in kernel auto-onlining could be also useful,
> > since user would be able to start-up with with small
> > non removable memory plus several removable DIMMs
> > and have all the memory onlined/available by the time
> > initrd is loaded. (missing piece here is onling
> > removable memory as movable by default).  
> 
> Why we should even care to online that memory that early rather than
> making it available via e820?

It's not forbidden by spec and has less complications
when it comes to removable memory. Declaring it in E820
would add following limitations/drawbacks:
 - firmware should be able to exclude removable memory
   from its usage (currently SeaBIOS nor EFI have to
   know/care about it) => less qemu-guest ABI to maintain.
 - OS should be taught to avoid/move (early) nonmovable
   allocations from removable address ranges.
   There were patches targeting that in recent kernels,
   but it won't work with older kernels that don't have it.
   So limiting a range of OSes that could run on QEMU
   and do memory removal.

E820 less approach works reasonably well with wide range
of guest OSes and less complex that if removable memory
were present it E820. Hence I don't have a compelling
reason to introduce removable memory in E820 as it
only adds to hot(un)plug issues.

I have an off-tree QEMU hack that puts hotremovable
memory added with "-device pc-dimm" on CLI into E820
to experiment with. It could be useful to play
with zone layouts at boot time, so if you are
interested I can rebase it on top of current master
and post it here to play with.


Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Igor Mammedov
On Mon, 13 Mar 2017 11:43:02 +0100
Michal Hocko  wrote:

> On Mon 13-03-17 11:31:10, Igor Mammedov wrote:
> > On Fri, 10 Mar 2017 14:58:07 +0100  
> [...]
> > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009]
> > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff]
> > > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff]
> > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] 
> > > hotplug
> > > [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 
> > > 0x0010-0x3fff] -> [mem 0x-0x3fff]
> > > [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff]
> > > [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd]
> > > [0.00] Zone ranges:
> > > [0.00]   DMA  [mem 0x1000-0x00ff]
> > > [0.00]   DMA32[mem 0x0100-0x7ffd]
> > > [0.00]   Normal   empty
> > > [0.00] Movable zone start for each node
> > > [0.00] Early memory node ranges
> > > [0.00]   node   0: [mem 0x1000-0x0009efff]
> > > [0.00]   node   0: [mem 0x0010-0x3fff]
> > > [0.00]   node   1: [mem 0x4000-0x7ffd]
> > > 
> > > so there is neither any normal zone nor movable one at the boot time.  
> > it could be if hotpluggable memory were present at boot time in E802 table
> > (if I remember right when running on hyperv there is movable zone at boot 
> > time),
> > 
> > but in qemu hotpluggable memory isn't put into E820,
> > so zone is allocated later when memory is enumerated
> > by ACPI subsystem and onlined.
> > It causes less issues wrt movable zone and works for
> > different versions of linux/windows as well.
> > 
> > That's where in kernel auto-onlining could be also useful,
> > since user would be able to start-up with with small
> > non removable memory plus several removable DIMMs
> > and have all the memory onlined/available by the time
> > initrd is loaded. (missing piece here is onling
> > removable memory as movable by default).  
> 
> Why we should even care to online that memory that early rather than
> making it available via e820?

It's not forbidden by spec and has less complications
when it comes to removable memory. Declaring it in E820
would add following limitations/drawbacks:
 - firmware should be able to exclude removable memory
   from its usage (currently SeaBIOS nor EFI have to
   know/care about it) => less qemu-guest ABI to maintain.
 - OS should be taught to avoid/move (early) nonmovable
   allocations from removable address ranges.
   There were patches targeting that in recent kernels,
   but it won't work with older kernels that don't have it.
   So limiting a range of OSes that could run on QEMU
   and do memory removal.

E820 less approach works reasonably well with wide range
of guest OSes and less complex that if removable memory
were present it E820. Hence I don't have a compelling
reason to introduce removable memory in E820 as it
only adds to hot(un)plug issues.

I have an off-tree QEMU hack that puts hotremovable
memory added with "-device pc-dimm" on CLI into E820
to experiment with. It could be useful to play
with zone layouts at boot time, so if you are
interested I can rebase it on top of current master
and post it here to play with.


Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Michal Hocko
On Mon 13-03-17 11:31:10, Igor Mammedov wrote:
> On Fri, 10 Mar 2017 14:58:07 +0100
[...]
> > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009]
> > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff]
> > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff]
> > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] 
> > hotplug
> > [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 
> > 0x0010-0x3fff] -> [mem 0x-0x3fff]
> > [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff]
> > [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd]
> > [0.00] Zone ranges:
> > [0.00]   DMA  [mem 0x1000-0x00ff]
> > [0.00]   DMA32[mem 0x0100-0x7ffd]
> > [0.00]   Normal   empty
> > [0.00] Movable zone start for each node
> > [0.00] Early memory node ranges
> > [0.00]   node   0: [mem 0x1000-0x0009efff]
> > [0.00]   node   0: [mem 0x0010-0x3fff]
> > [0.00]   node   1: [mem 0x4000-0x7ffd]
> > 
> > so there is neither any normal zone nor movable one at the boot time.
> it could be if hotpluggable memory were present at boot time in E802 table
> (if I remember right when running on hyperv there is movable zone at boot 
> time),
> 
> but in qemu hotpluggable memory isn't put into E820,
> so zone is allocated later when memory is enumerated
> by ACPI subsystem and onlined.
> It causes less issues wrt movable zone and works for
> different versions of linux/windows as well.
> 
> That's where in kernel auto-onlining could be also useful,
> since user would be able to start-up with with small
> non removable memory plus several removable DIMMs
> and have all the memory onlined/available by the time
> initrd is loaded. (missing piece here is onling
> removable memory as movable by default).

Why we should even care to online that memory that early rather than
making it available via e820?
 
> > Then I hotplugged 1G slot
> > (qemu) object_add memory-backend-ram,id=mem1,size=1G
> > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> You can also specify node a pc-dimm goes to with 'node' property
> if it should go to other then node 0.
> 
> device_add pc-dimm,id=dimm1,memdev=mem1,node=1

thanks for the tip

> > unfortunatelly the memory didn't show up automatically and I got
> > [  116.375781] acpi PNP0C80:00: Enumeration failure
> it should work,
> do you have CONFIG_ACPI_HOTPLUG_MEMORY enabled?

No I didn't. Thanks, good to know!
-- 
Michal Hocko
SUSE Labs


Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Michal Hocko
On Mon 13-03-17 11:31:10, Igor Mammedov wrote:
> On Fri, 10 Mar 2017 14:58:07 +0100
[...]
> > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009]
> > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff]
> > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff]
> > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] 
> > hotplug
> > [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 
> > 0x0010-0x3fff] -> [mem 0x-0x3fff]
> > [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff]
> > [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd]
> > [0.00] Zone ranges:
> > [0.00]   DMA  [mem 0x1000-0x00ff]
> > [0.00]   DMA32[mem 0x0100-0x7ffd]
> > [0.00]   Normal   empty
> > [0.00] Movable zone start for each node
> > [0.00] Early memory node ranges
> > [0.00]   node   0: [mem 0x1000-0x0009efff]
> > [0.00]   node   0: [mem 0x0010-0x3fff]
> > [0.00]   node   1: [mem 0x4000-0x7ffd]
> > 
> > so there is neither any normal zone nor movable one at the boot time.
> it could be if hotpluggable memory were present at boot time in E802 table
> (if I remember right when running on hyperv there is movable zone at boot 
> time),
> 
> but in qemu hotpluggable memory isn't put into E820,
> so zone is allocated later when memory is enumerated
> by ACPI subsystem and onlined.
> It causes less issues wrt movable zone and works for
> different versions of linux/windows as well.
> 
> That's where in kernel auto-onlining could be also useful,
> since user would be able to start-up with with small
> non removable memory plus several removable DIMMs
> and have all the memory onlined/available by the time
> initrd is loaded. (missing piece here is onling
> removable memory as movable by default).

Why we should even care to online that memory that early rather than
making it available via e820?
 
> > Then I hotplugged 1G slot
> > (qemu) object_add memory-backend-ram,id=mem1,size=1G
> > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
> You can also specify node a pc-dimm goes to with 'node' property
> if it should go to other then node 0.
> 
> device_add pc-dimm,id=dimm1,memdev=mem1,node=1

thanks for the tip

> > unfortunatelly the memory didn't show up automatically and I got
> > [  116.375781] acpi PNP0C80:00: Enumeration failure
> it should work,
> do you have CONFIG_ACPI_HOTPLUG_MEMORY enabled?

No I didn't. Thanks, good to know!
-- 
Michal Hocko
SUSE Labs


Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Igor Mammedov
On Fri, 10 Mar 2017 14:58:07 +0100
Michal Hocko  wrote:

> Let's CC people touching this logic. A short summary is that onlining
> memory via udev is currently unusable for online_movable because blocks
> are added from lower addresses while movable blocks are allowed from
> last blocks. More below.
> 
> On Thu 09-03-17 13:54:00, Michal Hocko wrote:
> > On Tue 07-03-17 13:40:04, Igor Mammedov wrote:  
> > > On Mon, 6 Mar 2017 15:54:17 +0100
> > > Michal Hocko  wrote:
> > >   
> > > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote:  
> > [...]  
> > > > > in current mainline kernel it triggers following code path:
> > > > > 
> > > > > online_pages()
> > > > >   ...
> > > > >if (online_type == MMOP_ONLINE_KERNEL) {   
> > > > >   
> > > > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, 
> > > > > _shift))
> > > > > return -EINVAL;
> > > > 
> > > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here  
> > > pretty much, reproducer is above so try and see for yourself  
> > 
> > I will play with this...  
> 
> OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G 
> which generated
'mem' here distributes boot memory specified by "-m 2G" and does not
include memory specified by -device pc-dimm.

> [...]
> [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009]
> [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff]
> [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff]
> [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] hotplug
> [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 
> 0x0010-0x3fff] -> [mem 0x-0x3fff]
> [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff]
> [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd]
> [0.00] Zone ranges:
> [0.00]   DMA  [mem 0x1000-0x00ff]
> [0.00]   DMA32[mem 0x0100-0x7ffd]
> [0.00]   Normal   empty
> [0.00] Movable zone start for each node
> [0.00] Early memory node ranges
> [0.00]   node   0: [mem 0x1000-0x0009efff]
> [0.00]   node   0: [mem 0x0010-0x3fff]
> [0.00]   node   1: [mem 0x4000-0x7ffd]
> 
> so there is neither any normal zone nor movable one at the boot time.
it could be if hotpluggable memory were present at boot time in E802 table
(if I remember right when running on hyperv there is movable zone at boot time),

but in qemu hotpluggable memory isn't put into E820,
so zone is allocated later when memory is enumerated
by ACPI subsystem and onlined.
It causes less issues wrt movable zone and works for
different versions of linux/windows as well.

That's where in kernel auto-onlining could be also useful,
since user would be able to start-up with with small
non removable memory plus several removable DIMMs
and have all the memory onlined/available by the time
initrd is loaded. (missing piece here is onling
removable memory as movable by default).


> Then I hotplugged 1G slot
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
You can also specify node a pc-dimm goes to with 'node' property
if it should go to other then node 0.

device_add pc-dimm,id=dimm1,memdev=mem1,node=1


> unfortunatelly the memory didn't show up automatically and I got
> [  116.375781] acpi PNP0C80:00: Enumeration failure
it should work,
do you have CONFIG_ACPI_HOTPLUG_MEMORY enabled?
 
> so I had to probe it manually (prbably the BIOS my qemu uses doesn't
> support auto probing - I haven't really dug further). Anyway the SRAT
> table printed during the boot told that we should start at 0x1



Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Igor Mammedov
On Fri, 10 Mar 2017 14:58:07 +0100
Michal Hocko  wrote:

> Let's CC people touching this logic. A short summary is that onlining
> memory via udev is currently unusable for online_movable because blocks
> are added from lower addresses while movable blocks are allowed from
> last blocks. More below.
> 
> On Thu 09-03-17 13:54:00, Michal Hocko wrote:
> > On Tue 07-03-17 13:40:04, Igor Mammedov wrote:  
> > > On Mon, 6 Mar 2017 15:54:17 +0100
> > > Michal Hocko  wrote:
> > >   
> > > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote:  
> > [...]  
> > > > > in current mainline kernel it triggers following code path:
> > > > > 
> > > > > online_pages()
> > > > >   ...
> > > > >if (online_type == MMOP_ONLINE_KERNEL) {   
> > > > >   
> > > > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, 
> > > > > _shift))
> > > > > return -EINVAL;
> > > > 
> > > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here  
> > > pretty much, reproducer is above so try and see for yourself  
> > 
> > I will play with this...  
> 
> OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G 
> which generated
'mem' here distributes boot memory specified by "-m 2G" and does not
include memory specified by -device pc-dimm.

> [...]
> [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009]
> [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff]
> [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff]
> [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] hotplug
> [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 
> 0x0010-0x3fff] -> [mem 0x-0x3fff]
> [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff]
> [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd]
> [0.00] Zone ranges:
> [0.00]   DMA  [mem 0x1000-0x00ff]
> [0.00]   DMA32[mem 0x0100-0x7ffd]
> [0.00]   Normal   empty
> [0.00] Movable zone start for each node
> [0.00] Early memory node ranges
> [0.00]   node   0: [mem 0x1000-0x0009efff]
> [0.00]   node   0: [mem 0x0010-0x3fff]
> [0.00]   node   1: [mem 0x4000-0x7ffd]
> 
> so there is neither any normal zone nor movable one at the boot time.
it could be if hotpluggable memory were present at boot time in E802 table
(if I remember right when running on hyperv there is movable zone at boot time),

but in qemu hotpluggable memory isn't put into E820,
so zone is allocated later when memory is enumerated
by ACPI subsystem and onlined.
It causes less issues wrt movable zone and works for
different versions of linux/windows as well.

That's where in kernel auto-onlining could be also useful,
since user would be able to start-up with with small
non removable memory plus several removable DIMMs
and have all the memory onlined/available by the time
initrd is loaded. (missing piece here is onling
removable memory as movable by default).


> Then I hotplugged 1G slot
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
You can also specify node a pc-dimm goes to with 'node' property
if it should go to other then node 0.

device_add pc-dimm,id=dimm1,memdev=mem1,node=1


> unfortunatelly the memory didn't show up automatically and I got
> [  116.375781] acpi PNP0C80:00: Enumeration failure
it should work,
do you have CONFIG_ACPI_HOTPLUG_MEMORY enabled?
 
> so I had to probe it manually (prbably the BIOS my qemu uses doesn't
> support auto probing - I haven't really dug further). Anyway the SRAT
> table printed during the boot told that we should start at 0x1



Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Michal Hocko
On Fri 10-03-17 13:00:37, Reza Arbab wrote:
> On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:
> >OK, so while I was playing with this setup some more I probably got why
> >this is done this way. All new memblocks are added to the zone Normal
> >where they are accounted as spanned but not present.
> 
> It's not always zone Normal. See zone_for_memory(). This leads to a
> workaround for having to do online_movable in descending block order.
> Instead of this:
> 
> 1. probe block 34, probe block 33, probe block 32, ...
> 2. online_movable 34, online_movable 33, online_movable 32, ...
> 
> you can online_movable the first block before adding the rest:

I do I enforce that behavior when the probe happens automagically?

> 1. probe block 32, online_movable 32
> 2. probe block 33, probe block 34, ...
>   - zone_for_memory() will cause these to start Movable
> 3. online 33, online 34, ...
>   - they're already in Movable, so online_movable is equivalentr
> 
> I agree with your general sentiment that this stuff is very nonintuitive.

My criterion for nonintuitive is probably different because I would call
this _completely_unusable_. Sorry for being so loud about this but the
more I look into this area the more WTF code I see. This has seen close
to zero review and seems to be building up more single usecase code on
top of previous. We need to change this, seriously!
-- 
Michal Hocko
SUSE Labs


Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-13 Thread Michal Hocko
On Fri 10-03-17 13:00:37, Reza Arbab wrote:
> On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:
> >OK, so while I was playing with this setup some more I probably got why
> >this is done this way. All new memblocks are added to the zone Normal
> >where they are accounted as spanned but not present.
> 
> It's not always zone Normal. See zone_for_memory(). This leads to a
> workaround for having to do online_movable in descending block order.
> Instead of this:
> 
> 1. probe block 34, probe block 33, probe block 32, ...
> 2. online_movable 34, online_movable 33, online_movable 32, ...
> 
> you can online_movable the first block before adding the rest:

I do I enforce that behavior when the probe happens automagically?

> 1. probe block 32, online_movable 32
> 2. probe block 33, probe block 34, ...
>   - zone_for_memory() will cause these to start Movable
> 3. online 33, online 34, ...
>   - they're already in Movable, so online_movable is equivalentr
> 
> I agree with your general sentiment that this stuff is very nonintuitive.

My criterion for nonintuitive is probably different because I would call
this _completely_unusable_. Sorry for being so loud about this but the
more I look into this area the more WTF code I see. This has seen close
to zero review and seems to be building up more single usecase code on
top of previous. We need to change this, seriously!
-- 
Michal Hocko
SUSE Labs


Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-10 Thread Reza Arbab

On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:

OK, so while I was playing with this setup some more I probably got why
this is done this way. All new memblocks are added to the zone Normal
where they are accounted as spanned but not present.


It's not always zone Normal. See zone_for_memory(). This leads to a 
workaround for having to do online_movable in descending block order. 
Instead of this:


1. probe block 34, probe block 33, probe block 32, ...
2. online_movable 34, online_movable 33, online_movable 32, ...

you can online_movable the first block before adding the rest:

1. probe block 32, online_movable 32
2. probe block 33, probe block 34, ...
- zone_for_memory() will cause these to start Movable
3. online 33, online 34, ...
- they're already in Movable, so online_movable is equivalentr

I agree with your general sentiment that this stuff is very 
nonintuitive.


--
Reza Arbab



Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-10 Thread Reza Arbab

On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote:

OK, so while I was playing with this setup some more I probably got why
this is done this way. All new memblocks are added to the zone Normal
where they are accounted as spanned but not present.


It's not always zone Normal. See zone_for_memory(). This leads to a 
workaround for having to do online_movable in descending block order. 
Instead of this:


1. probe block 34, probe block 33, probe block 32, ...
2. online_movable 34, online_movable 33, online_movable 32, ...

you can online_movable the first block before adding the rest:

1. probe block 32, online_movable 32
2. probe block 33, probe block 34, ...
- zone_for_memory() will cause these to start Movable
3. online 33, online 34, ...
- they're already in Movable, so online_movable is equivalentr

I agree with your general sentiment that this stuff is very 
nonintuitive.


--
Reza Arbab



Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-10 Thread Michal Hocko
On Fri 10-03-17 14:58:07, Michal Hocko wrote:
[...]
> This would explain why onlining from the last block actually works but
> to me this sounds like a completely crappy behavior. All we need to
> guarantee AFAICS is that Normal and Movable zones do not overlap. I
> believe there is even no real requirement about ordering of the physical
> memory in Normal vs. Movable zones as long as they do not overlap. But
> let's keep it simple for the start and always enforce the current status
> quo that Normal zone is physically preceeding Movable zone.
> Can somebody explain why we cannot have a simple rule for Normal vs.
> Movable which would be:
>   - block [pfn, pfn+block_size] can be Normal if
> !zone_populated(MOVABLE) || pfn+block_size < 
> ZONE_MOVABLE->zone_start_pfn
>   - block [pfn, pfn+block_size] can be Movable if
> !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn

OK, so while I was playing with this setup some more I probably got why
this is done this way. All new memblocks are added to the zone Normal
where they are accounted as spanned but not present. When we do
online_movable we just cut from the end of the Normal zone and move it
to Movable zone. This sounds really awkward. What was the reason to go
this way? Why cannot we simply add those pages to the zone at the online
time?
-- 
Michal Hocko
SUSE Labs


Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-10 Thread Michal Hocko
On Fri 10-03-17 14:58:07, Michal Hocko wrote:
[...]
> This would explain why onlining from the last block actually works but
> to me this sounds like a completely crappy behavior. All we need to
> guarantee AFAICS is that Normal and Movable zones do not overlap. I
> believe there is even no real requirement about ordering of the physical
> memory in Normal vs. Movable zones as long as they do not overlap. But
> let's keep it simple for the start and always enforce the current status
> quo that Normal zone is physically preceeding Movable zone.
> Can somebody explain why we cannot have a simple rule for Normal vs.
> Movable which would be:
>   - block [pfn, pfn+block_size] can be Normal if
> !zone_populated(MOVABLE) || pfn+block_size < 
> ZONE_MOVABLE->zone_start_pfn
>   - block [pfn, pfn+block_size] can be Movable if
> !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn

OK, so while I was playing with this setup some more I probably got why
this is done this way. All new memblocks are added to the zone Normal
where they are accounted as spanned but not present. When we do
online_movable we just cut from the end of the Normal zone and move it
to Movable zone. This sounds really awkward. What was the reason to go
this way? Why cannot we simply add those pages to the zone at the online
time?
-- 
Michal Hocko
SUSE Labs


WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-10 Thread Michal Hocko
Let's CC people touching this logic. A short summary is that onlining
memory via udev is currently unusable for online_movable because blocks
are added from lower addresses while movable blocks are allowed from
last blocks. More below.

On Thu 09-03-17 13:54:00, Michal Hocko wrote:
> On Tue 07-03-17 13:40:04, Igor Mammedov wrote:
> > On Mon, 6 Mar 2017 15:54:17 +0100
> > Michal Hocko  wrote:
> > 
> > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
> [...]
> > > > in current mainline kernel it triggers following code path:
> > > > 
> > > > online_pages()
> > > >   ...
> > > >if (online_type == MMOP_ONLINE_KERNEL) { 
> > > > 
> > > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, 
> > > > _shift))
> > > > return -EINVAL;  
> > > 
> > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here
> > pretty much, reproducer is above so try and see for yourself
> 
> I will play with this...

OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G 
which generated
[...]
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009]
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff]
[0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff]
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] hotplug
[0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 
0x0010-0x3fff] -> [mem 0x-0x3fff]
[0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff]
[0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd]
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x1000-0x00ff]
[0.00]   DMA32[mem 0x0100-0x7ffd]
[0.00]   Normal   empty
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   0: [mem 0x1000-0x0009efff]
[0.00]   node   0: [mem 0x0010-0x3fff]
[0.00]   node   1: [mem 0x4000-0x7ffd]

so there is neither any normal zone nor movable one at the boot time.
Then I hotplugged 1G slot
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1

unfortunatelly the memory didn't show up automatically and I got
[  116.375781] acpi PNP0C80:00: Enumeration failure

so I had to probe it manually (prbably the BIOS my qemu uses doesn't
support auto probing - I haven't really dug further). Anyway the SRAT
table printed during the boot told that we should start at 0x1

# echo 0x1 > /sys/devices/system/memory/probe
# grep . /sys/devices/system/memory/memory32/valid_zones
Normal Movable

which looks reasonably right? Both Normal and Movable zones are allowed

# echo $((0x1+(128<<20))) > /sys/devices/system/memory/probe
# grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable

Huh, so our valid_zones have changed under our feet...

# echo $((0x1+2*(128<<20))) > /sys/devices/system/memory/probe
# grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal
/sys/devices/system/memory/memory34/valid_zones:Normal Movable

and again. So only the last memblock is considered movable. Let's try to
online them now.

# echo online_movable > /sys/devices/system/memory/memory34/state
# grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable Normal

This would explain why onlining from the last block actually works but
to me this sounds like a completely crappy behavior. All we need to
guarantee AFAICS is that Normal and Movable zones do not overlap. I
believe there is even no real requirement about ordering of the physical
memory in Normal vs. Movable zones as long as they do not overlap. But
let's keep it simple for the start and always enforce the current status
quo that Normal zone is physically preceeding Movable zone.
Can somebody explain why we cannot have a simple rule for Normal vs.
Movable which would be:
- block [pfn, pfn+block_size] can be Normal if
  !zone_populated(MOVABLE) || pfn+block_size < 
ZONE_MOVABLE->zone_start_pfn
- block [pfn, pfn+block_size] can be Movable if
  !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn

I haven't fully grokked all the restrictions on the movable zone size
based on the kernel parameters (find_zone_movable_pfns_for_nodes) but
this shouldn't really make the situation really much more complicated I
believe because those parameters should be mostly about static
initialization rather 

WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)

2017-03-10 Thread Michal Hocko
Let's CC people touching this logic. A short summary is that onlining
memory via udev is currently unusable for online_movable because blocks
are added from lower addresses while movable blocks are allowed from
last blocks. More below.

On Thu 09-03-17 13:54:00, Michal Hocko wrote:
> On Tue 07-03-17 13:40:04, Igor Mammedov wrote:
> > On Mon, 6 Mar 2017 15:54:17 +0100
> > Michal Hocko  wrote:
> > 
> > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
> [...]
> > > > in current mainline kernel it triggers following code path:
> > > > 
> > > > online_pages()
> > > >   ...
> > > >if (online_type == MMOP_ONLINE_KERNEL) { 
> > > > 
> > > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, 
> > > > _shift))
> > > > return -EINVAL;  
> > > 
> > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here
> > pretty much, reproducer is above so try and see for yourself
> 
> I will play with this...

OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G 
which generated
[...]
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009]
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff]
[0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff]
[0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] hotplug
[0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 
0x0010-0x3fff] -> [mem 0x-0x3fff]
[0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff]
[0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd]
[0.00] Zone ranges:
[0.00]   DMA  [mem 0x1000-0x00ff]
[0.00]   DMA32[mem 0x0100-0x7ffd]
[0.00]   Normal   empty
[0.00] Movable zone start for each node
[0.00] Early memory node ranges
[0.00]   node   0: [mem 0x1000-0x0009efff]
[0.00]   node   0: [mem 0x0010-0x3fff]
[0.00]   node   1: [mem 0x4000-0x7ffd]

so there is neither any normal zone nor movable one at the boot time.
Then I hotplugged 1G slot
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1

unfortunatelly the memory didn't show up automatically and I got
[  116.375781] acpi PNP0C80:00: Enumeration failure

so I had to probe it manually (prbably the BIOS my qemu uses doesn't
support auto probing - I haven't really dug further). Anyway the SRAT
table printed during the boot told that we should start at 0x1

# echo 0x1 > /sys/devices/system/memory/probe
# grep . /sys/devices/system/memory/memory32/valid_zones
Normal Movable

which looks reasonably right? Both Normal and Movable zones are allowed

# echo $((0x1+(128<<20))) > /sys/devices/system/memory/probe
# grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable

Huh, so our valid_zones have changed under our feet...

# echo $((0x1+2*(128<<20))) > /sys/devices/system/memory/probe
# grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal
/sys/devices/system/memory/memory34/valid_zones:Normal Movable

and again. So only the last memblock is considered movable. Let's try to
online them now.

# echo online_movable > /sys/devices/system/memory/memory34/state
# grep . /sys/devices/system/memory/memory3?/valid_zones
/sys/devices/system/memory/memory32/valid_zones:Normal
/sys/devices/system/memory/memory33/valid_zones:Normal Movable
/sys/devices/system/memory/memory34/valid_zones:Movable Normal

This would explain why onlining from the last block actually works but
to me this sounds like a completely crappy behavior. All we need to
guarantee AFAICS is that Normal and Movable zones do not overlap. I
believe there is even no real requirement about ordering of the physical
memory in Normal vs. Movable zones as long as they do not overlap. But
let's keep it simple for the start and always enforce the current status
quo that Normal zone is physically preceeding Movable zone.
Can somebody explain why we cannot have a simple rule for Normal vs.
Movable which would be:
- block [pfn, pfn+block_size] can be Normal if
  !zone_populated(MOVABLE) || pfn+block_size < 
ZONE_MOVABLE->zone_start_pfn
- block [pfn, pfn+block_size] can be Movable if
  !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn

I haven't fully grokked all the restrictions on the movable zone size
based on the kernel parameters (find_zone_movable_pfns_for_nodes) but
this shouldn't really make the situation really much more complicated I
believe because those parameters should be mostly about static
initialization rather than hotplug but I