Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Tue 14-03-17 20:35:21, Andrea Arcangeli wrote: > Hello, > > On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote: > > On Fri 10-03-17 13:00:37, Reza Arbab wrote: > > > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote: > > > >OK, so while I was playing with this setup some more I probably got why > > > >this is done this way. All new memblocks are added to the zone Normal > > > >where they are accounted as spanned but not present. > > > > > > It's not always zone Normal. See zone_for_memory(). This leads to a > > > workaround for having to do online_movable in descending block order. > > > Instead of this: > > > > > > 1. probe block 34, probe block 33, probe block 32, ... > > > 2. online_movable 34, online_movable 33, online_movable 32, ... > > > > > > you can online_movable the first block before adding the rest: > > > > I do I enforce that behavior when the probe happens automagically? > > What I provided as guide to online as movable in current and older > kernels is: > > 1) Remove udev rule > 2) After adding memory with qemu/libvirt API run in the guest > > --- workaround start > #!/bin/bash > for i in `ls -d1 /sys/devices/system/memory/memory* | sort -nr -t y -k 5`; do > if [ "`cat $i/state`" == "offline" ]; then echo online_movable > $i/state ; > fi; done > --- workaround end > > That's how bad is onlining as movable for memory hotunplug. Yeah, this is really yucky. Fortunately, I already have a working prototype which removes this restriction altogether. > > > 1. probe block 32, online_movable 32 > > > 2. probe block 33, probe block 34, ... > > > - zone_for_memory() will cause these to start Movable > > > 3. online 33, online 34, ... > > > - they're already in Movable, so online_movable is equivalentr > > > > > > I agree with your general sentiment that this stuff is very nonintuitive. > > > > My criterion for nonintuitive is probably different because I would call > > this _completely_unusable_. Sorry for being so loud about this but the > > more I look into this area the more WTF code I see. This has seen close > > to zero review and seems to be building up more single usecase code on > > top of previous. We need to change this, seriously! > > It's not a bug, but when I found it I called it a "constraint" and > when I debugged it (IIRC) it originated here: > > } else if (online_type == MMOP_ONLINE_MOVABLE) { > if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, _shift)) > return -EINVAL; > } > > Fixing it so you could online as movable even if it wasn't the last > block was in my todo list but then we had other plans. > > Unfortunately unless pfn+nr_pages of the newly created Movable zone > matches the end of the normal zone (or whatever was there that has to > be converted to Movable), it will fail onlining as movable. Well, I think the main problem is that we associate pages added in the first phase (probe) to the Normal zone. The everything else is just a fallout and fiddling to make it work somehow. [...] > To be clear, I'm not necessarily against eventually removing > _DEFFAULT_ONLINE, but an equally reliable and comparably fast > alternative should be provided first and there's no such alternative > right now. As pointed in other reply that is not an intention here. I primarily wanted to understand the scope of the problem. I believe this config option was rushed into the kernel without considering other alternatives which would make the hotplug more usable by others as well. > If s390 has special issues requiring admin or a software hotplug > manager app to enable blocks of memory by hand, the _DEFFAULT_ONLINE > could be then an option selected or not selected by > arch/*/Kconfig. The udev rule is still an automatic action so it's 1:1 > with the in-kernel feature. If the in-kernel automatic onlining is not > workable on s390 I would assume the udev rule is also not workable. But this is not about s390. It is about different usecases which require different onlining policy and that is the main problem of the hardcoded one in the kernel. See the other reply. -- Michal Hocko SUSE Labs
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Tue 14-03-17 20:35:21, Andrea Arcangeli wrote: > Hello, > > On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote: > > On Fri 10-03-17 13:00:37, Reza Arbab wrote: > > > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote: > > > >OK, so while I was playing with this setup some more I probably got why > > > >this is done this way. All new memblocks are added to the zone Normal > > > >where they are accounted as spanned but not present. > > > > > > It's not always zone Normal. See zone_for_memory(). This leads to a > > > workaround for having to do online_movable in descending block order. > > > Instead of this: > > > > > > 1. probe block 34, probe block 33, probe block 32, ... > > > 2. online_movable 34, online_movable 33, online_movable 32, ... > > > > > > you can online_movable the first block before adding the rest: > > > > I do I enforce that behavior when the probe happens automagically? > > What I provided as guide to online as movable in current and older > kernels is: > > 1) Remove udev rule > 2) After adding memory with qemu/libvirt API run in the guest > > --- workaround start > #!/bin/bash > for i in `ls -d1 /sys/devices/system/memory/memory* | sort -nr -t y -k 5`; do > if [ "`cat $i/state`" == "offline" ]; then echo online_movable > $i/state ; > fi; done > --- workaround end > > That's how bad is onlining as movable for memory hotunplug. Yeah, this is really yucky. Fortunately, I already have a working prototype which removes this restriction altogether. > > > 1. probe block 32, online_movable 32 > > > 2. probe block 33, probe block 34, ... > > > - zone_for_memory() will cause these to start Movable > > > 3. online 33, online 34, ... > > > - they're already in Movable, so online_movable is equivalentr > > > > > > I agree with your general sentiment that this stuff is very nonintuitive. > > > > My criterion for nonintuitive is probably different because I would call > > this _completely_unusable_. Sorry for being so loud about this but the > > more I look into this area the more WTF code I see. This has seen close > > to zero review and seems to be building up more single usecase code on > > top of previous. We need to change this, seriously! > > It's not a bug, but when I found it I called it a "constraint" and > when I debugged it (IIRC) it originated here: > > } else if (online_type == MMOP_ONLINE_MOVABLE) { > if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, _shift)) > return -EINVAL; > } > > Fixing it so you could online as movable even if it wasn't the last > block was in my todo list but then we had other plans. > > Unfortunately unless pfn+nr_pages of the newly created Movable zone > matches the end of the normal zone (or whatever was there that has to > be converted to Movable), it will fail onlining as movable. Well, I think the main problem is that we associate pages added in the first phase (probe) to the Normal zone. The everything else is just a fallout and fiddling to make it work somehow. [...] > To be clear, I'm not necessarily against eventually removing > _DEFFAULT_ONLINE, but an equally reliable and comparably fast > alternative should be provided first and there's no such alternative > right now. As pointed in other reply that is not an intention here. I primarily wanted to understand the scope of the problem. I believe this config option was rushed into the kernel without considering other alternatives which would make the hotplug more usable by others as well. > If s390 has special issues requiring admin or a software hotplug > manager app to enable blocks of memory by hand, the _DEFFAULT_ONLINE > could be then an option selected or not selected by > arch/*/Kconfig. The udev rule is still an automatic action so it's 1:1 > with the in-kernel feature. If the in-kernel automatic onlining is not > workable on s390 I would assume the udev rule is also not workable. But this is not about s390. It is about different usecases which require different onlining policy and that is the main problem of the hardcoded one in the kernel. See the other reply. -- Michal Hocko SUSE Labs
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
Hello, On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote: > On Fri 10-03-17 13:00:37, Reza Arbab wrote: > > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote: > > >OK, so while I was playing with this setup some more I probably got why > > >this is done this way. All new memblocks are added to the zone Normal > > >where they are accounted as spanned but not present. > > > > It's not always zone Normal. See zone_for_memory(). This leads to a > > workaround for having to do online_movable in descending block order. > > Instead of this: > > > > 1. probe block 34, probe block 33, probe block 32, ... > > 2. online_movable 34, online_movable 33, online_movable 32, ... > > > > you can online_movable the first block before adding the rest: > > I do I enforce that behavior when the probe happens automagically? What I provided as guide to online as movable in current and older kernels is: 1) Remove udev rule 2) After adding memory with qemu/libvirt API run in the guest --- workaround start #!/bin/bash for i in `ls -d1 /sys/devices/system/memory/memory* | sort -nr -t y -k 5`; do if [ "`cat $i/state`" == "offline" ]; then echo online_movable > $i/state ; fi; done --- workaround end That's how bad is onlining as movable for memory hotunplug. > > 1. probe block 32, online_movable 32 > > 2. probe block 33, probe block 34, ... > > - zone_for_memory() will cause these to start Movable > > 3. online 33, online 34, ... > > - they're already in Movable, so online_movable is equivalentr > > > > I agree with your general sentiment that this stuff is very nonintuitive. > > My criterion for nonintuitive is probably different because I would call > this _completely_unusable_. Sorry for being so loud about this but the > more I look into this area the more WTF code I see. This has seen close > to zero review and seems to be building up more single usecase code on > top of previous. We need to change this, seriously! It's not a bug, but when I found it I called it a "constraint" and when I debugged it (IIRC) it originated here: } else if (online_type == MMOP_ONLINE_MOVABLE) { if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, _shift)) return -EINVAL; } Fixing it so you could online as movable even if it wasn't the last block was in my todo list but then we had other plans. Unfortunately unless pfn+nr_pages of the newly created Movable zone matches the end of the normal zone (or whatever was there that has to be converted to Movable), it will fail onlining as movable. To fix it is enough to check that everything from pfn to the end of whatever zone existed there (or the end of the node perhaps safer) can be converted as movable and just convert the whole thing as movable instead of stopping at pfn+nr_pages. Also note, onlining highmem physical ranges as movable requires yet another config option to be set for the below check to pass (CONFIG_MOVABLE_NODE=y), which I'm not exactly sure why anybody would want to set =n, and perhaps would be candidate for dropping well before considering to drop _DEFAULT_ONLINE and at best it should be replaced with a kernel parameter to turn off the CONFIG_MOVABLE_NODE=y behavior. if ((zone_idx(zone) > ZONE_NORMAL || online_type == MMOP_ONLINE_MOVABLE) && !can_online_high_movable(zone)) return -EINVAL; I would suggest to drop the constraints in online_pages and perhaps also the CONFIG_MOVABLE_NODE option before going to drop the automatic onlining in kernel. Because of the above constraint the udev rule is unusable anyway for onlining memory as movable so that it can be hotunplugged reliably (well not so reliably, page_migrate does a loop and tries many times but it occurred to me it failed once to offline and at next try it worked, temporary page pins with O_DIRECT screw with page_migration, rightfully so and no easy fix). After the above constraint is fixed I suggest to look into fixing the async onlining by making the udev rule run synchronously. Adding 4T to a 8G guest is perfectly reasonable and normal operation, no excuse for it to fail as long as you don't pretend such 4T to be unpluggable too later (which we won't). As I understand it, the whole point of _DEFFAULT_ONLINE=y is precisely that it's easier to fix it in kernel than fixing the udev rule. Furthermore the above constraint for the zone shifting which breaks online_movable in udev is not an issue for _DEFFAULT_ONLINE=y because kernel onlining is synchronous by design (no special synchronous udev rule fix is required) so it can cope fine with the existing above constraint by onlining as movable from the end of the last zone in each node so that such constraint never gets in the way. Extending _DEFFAULT_ONLINE=y so that it can also online as movable has been worked on. On a side note, kernelcore=xxx passed to a kernel with _DEFFAULT_ONLINE=y should already allow to
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
Hello, On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote: > On Fri 10-03-17 13:00:37, Reza Arbab wrote: > > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote: > > >OK, so while I was playing with this setup some more I probably got why > > >this is done this way. All new memblocks are added to the zone Normal > > >where they are accounted as spanned but not present. > > > > It's not always zone Normal. See zone_for_memory(). This leads to a > > workaround for having to do online_movable in descending block order. > > Instead of this: > > > > 1. probe block 34, probe block 33, probe block 32, ... > > 2. online_movable 34, online_movable 33, online_movable 32, ... > > > > you can online_movable the first block before adding the rest: > > I do I enforce that behavior when the probe happens automagically? What I provided as guide to online as movable in current and older kernels is: 1) Remove udev rule 2) After adding memory with qemu/libvirt API run in the guest --- workaround start #!/bin/bash for i in `ls -d1 /sys/devices/system/memory/memory* | sort -nr -t y -k 5`; do if [ "`cat $i/state`" == "offline" ]; then echo online_movable > $i/state ; fi; done --- workaround end That's how bad is onlining as movable for memory hotunplug. > > 1. probe block 32, online_movable 32 > > 2. probe block 33, probe block 34, ... > > - zone_for_memory() will cause these to start Movable > > 3. online 33, online 34, ... > > - they're already in Movable, so online_movable is equivalentr > > > > I agree with your general sentiment that this stuff is very nonintuitive. > > My criterion for nonintuitive is probably different because I would call > this _completely_unusable_. Sorry for being so loud about this but the > more I look into this area the more WTF code I see. This has seen close > to zero review and seems to be building up more single usecase code on > top of previous. We need to change this, seriously! It's not a bug, but when I found it I called it a "constraint" and when I debugged it (IIRC) it originated here: } else if (online_type == MMOP_ONLINE_MOVABLE) { if (!zone_can_shift(pfn, nr_pages, ZONE_MOVABLE, _shift)) return -EINVAL; } Fixing it so you could online as movable even if it wasn't the last block was in my todo list but then we had other plans. Unfortunately unless pfn+nr_pages of the newly created Movable zone matches the end of the normal zone (or whatever was there that has to be converted to Movable), it will fail onlining as movable. To fix it is enough to check that everything from pfn to the end of whatever zone existed there (or the end of the node perhaps safer) can be converted as movable and just convert the whole thing as movable instead of stopping at pfn+nr_pages. Also note, onlining highmem physical ranges as movable requires yet another config option to be set for the below check to pass (CONFIG_MOVABLE_NODE=y), which I'm not exactly sure why anybody would want to set =n, and perhaps would be candidate for dropping well before considering to drop _DEFAULT_ONLINE and at best it should be replaced with a kernel parameter to turn off the CONFIG_MOVABLE_NODE=y behavior. if ((zone_idx(zone) > ZONE_NORMAL || online_type == MMOP_ONLINE_MOVABLE) && !can_online_high_movable(zone)) return -EINVAL; I would suggest to drop the constraints in online_pages and perhaps also the CONFIG_MOVABLE_NODE option before going to drop the automatic onlining in kernel. Because of the above constraint the udev rule is unusable anyway for onlining memory as movable so that it can be hotunplugged reliably (well not so reliably, page_migrate does a loop and tries many times but it occurred to me it failed once to offline and at next try it worked, temporary page pins with O_DIRECT screw with page_migration, rightfully so and no easy fix). After the above constraint is fixed I suggest to look into fixing the async onlining by making the udev rule run synchronously. Adding 4T to a 8G guest is perfectly reasonable and normal operation, no excuse for it to fail as long as you don't pretend such 4T to be unpluggable too later (which we won't). As I understand it, the whole point of _DEFFAULT_ONLINE=y is precisely that it's easier to fix it in kernel than fixing the udev rule. Furthermore the above constraint for the zone shifting which breaks online_movable in udev is not an issue for _DEFFAULT_ONLINE=y because kernel onlining is synchronous by design (no special synchronous udev rule fix is required) so it can cope fine with the existing above constraint by onlining as movable from the end of the last zone in each node so that such constraint never gets in the way. Extending _DEFFAULT_ONLINE=y so that it can also online as movable has been worked on. On a side note, kernelcore=xxx passed to a kernel with _DEFFAULT_ONLINE=y should already allow to
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
> and ARCH_SPARSEMEM_DEFAULT is enabeld on 64b. So I guess whatever was > the reason to add this code back in 2006 is not true anymore. So I am > really wondering. Do we absolutely need to assign pages which are not > onlined yet to the ZONE_NORMAL unconditionally? Why cannot we put them > out of any zone and wait for memory online operation to put them where > requested? I don't remember any specific reason. May have just been simplification. Should be fine to use any zone. -Andi
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
> and ARCH_SPARSEMEM_DEFAULT is enabeld on 64b. So I guess whatever was > the reason to add this code back in 2006 is not true anymore. So I am > really wondering. Do we absolutely need to assign pages which are not > onlined yet to the ZONE_NORMAL unconditionally? Why cannot we put them > out of any zone and wait for memory online operation to put them where > requested? I don't remember any specific reason. May have just been simplification. Should be fine to use any zone. -Andi
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
Let's add Andi On Fri 10-03-17 16:53:33, Michal Hocko wrote: > On Fri 10-03-17 14:58:07, Michal Hocko wrote: > [...] > > This would explain why onlining from the last block actually works but > > to me this sounds like a completely crappy behavior. All we need to > > guarantee AFAICS is that Normal and Movable zones do not overlap. I > > believe there is even no real requirement about ordering of the physical > > memory in Normal vs. Movable zones as long as they do not overlap. But > > let's keep it simple for the start and always enforce the current status > > quo that Normal zone is physically preceeding Movable zone. > > Can somebody explain why we cannot have a simple rule for Normal vs. > > Movable which would be: > > - block [pfn, pfn+block_size] can be Normal if > > !zone_populated(MOVABLE) || pfn+block_size < > > ZONE_MOVABLE->zone_start_pfn > > - block [pfn, pfn+block_size] can be Movable if > > !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn > > OK, so while I was playing with this setup some more I probably got why > this is done this way. All new memblocks are added to the zone Normal > where they are accounted as spanned but not present. When we do > online_movable we just cut from the end of the Normal zone and move it > to Movable zone. This sounds really awkward. What was the reason to go > this way? Why cannot we simply add those pages to the zone at the online > time? Answering to myself. So the reason seems to be 9d99aaa31f59 ("[PATCH] x86_64: Support memory hotadd without sparsemem") which is no longer true because config MEMORY_HOTPLUG bool "Allow for memory hot-add" depends on SPARSEMEM || X86_64_ACPI_NUMA depends on ARCH_ENABLE_MEMORY_HOTPLUG depends on COMPILE_TEST || !KASAN so it is either SPARSEMEM or X86_64_ACPI_NUMA that would have to be enabled. config X86_64_ACPI_NUMA def_bool y prompt "ACPI NUMA detection" depends on X86_64 && NUMA && ACPI && PCI select ACPI_NUMA But I do not see any way how to enable anything but SPARSEMEM for x86_64 choice prompt "Memory model" depends on SELECT_MEMORY_MODEL default DISCONTIGMEM_MANUAL if ARCH_DISCONTIGMEM_DEFAULT default SPARSEMEM_MANUAL if ARCH_SPARSEMEM_DEFAULT default FLATMEM_MANUAL ARCH_SPARSEMEM_DEFAULT is 32b only config ARCH_DISCONTIGMEM_DEFAULT def_bool y depends on NUMA && X86_32 and ARCH_SPARSEMEM_DEFAULT is enabeld on 64b. So I guess whatever was the reason to add this code back in 2006 is not true anymore. So I am really wondering. Do we absolutely need to assign pages which are not onlined yet to the ZONE_NORMAL unconditionally? Why cannot we put them out of any zone and wait for memory online operation to put them where requested? -- Michal Hocko SUSE Labs
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
Let's add Andi On Fri 10-03-17 16:53:33, Michal Hocko wrote: > On Fri 10-03-17 14:58:07, Michal Hocko wrote: > [...] > > This would explain why onlining from the last block actually works but > > to me this sounds like a completely crappy behavior. All we need to > > guarantee AFAICS is that Normal and Movable zones do not overlap. I > > believe there is even no real requirement about ordering of the physical > > memory in Normal vs. Movable zones as long as they do not overlap. But > > let's keep it simple for the start and always enforce the current status > > quo that Normal zone is physically preceeding Movable zone. > > Can somebody explain why we cannot have a simple rule for Normal vs. > > Movable which would be: > > - block [pfn, pfn+block_size] can be Normal if > > !zone_populated(MOVABLE) || pfn+block_size < > > ZONE_MOVABLE->zone_start_pfn > > - block [pfn, pfn+block_size] can be Movable if > > !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn > > OK, so while I was playing with this setup some more I probably got why > this is done this way. All new memblocks are added to the zone Normal > where they are accounted as spanned but not present. When we do > online_movable we just cut from the end of the Normal zone and move it > to Movable zone. This sounds really awkward. What was the reason to go > this way? Why cannot we simply add those pages to the zone at the online > time? Answering to myself. So the reason seems to be 9d99aaa31f59 ("[PATCH] x86_64: Support memory hotadd without sparsemem") which is no longer true because config MEMORY_HOTPLUG bool "Allow for memory hot-add" depends on SPARSEMEM || X86_64_ACPI_NUMA depends on ARCH_ENABLE_MEMORY_HOTPLUG depends on COMPILE_TEST || !KASAN so it is either SPARSEMEM or X86_64_ACPI_NUMA that would have to be enabled. config X86_64_ACPI_NUMA def_bool y prompt "ACPI NUMA detection" depends on X86_64 && NUMA && ACPI && PCI select ACPI_NUMA But I do not see any way how to enable anything but SPARSEMEM for x86_64 choice prompt "Memory model" depends on SELECT_MEMORY_MODEL default DISCONTIGMEM_MANUAL if ARCH_DISCONTIGMEM_DEFAULT default SPARSEMEM_MANUAL if ARCH_SPARSEMEM_DEFAULT default FLATMEM_MANUAL ARCH_SPARSEMEM_DEFAULT is 32b only config ARCH_DISCONTIGMEM_DEFAULT def_bool y depends on NUMA && X86_32 and ARCH_SPARSEMEM_DEFAULT is enabeld on 64b. So I guess whatever was the reason to add this code back in 2006 is not true anymore. So I am really wondering. Do we absolutely need to assign pages which are not onlined yet to the ZONE_NORMAL unconditionally? Why cannot we put them out of any zone and wait for memory online operation to put them where requested? -- Michal Hocko SUSE Labs
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote: I agree with your general sentiment that this stuff is very nonintuitive. My criterion for nonintuitive is probably different because I would call this _completely_unusable_. Sorry for being so loud about this but the more I look into this area the more WTF code I see. This has seen close to zero review and seems to be building up more single usecase code on top of previous. We need to change this, seriously! No argument here. I'm happy to help however I can. -- Reza Arbab
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Mon, Mar 13, 2017 at 10:21:45AM +0100, Michal Hocko wrote: I agree with your general sentiment that this stuff is very nonintuitive. My criterion for nonintuitive is probably different because I would call this _completely_unusable_. Sorry for being so loud about this but the more I look into this area the more WTF code I see. This has seen close to zero review and seems to be building up more single usecase code on top of previous. We need to change this, seriously! No argument here. I'm happy to help however I can. -- Reza Arbab
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Mon 13-03-17 14:57:12, Igor Mammedov wrote: > On Mon, 13 Mar 2017 11:43:02 +0100 > Michal Hockowrote: > > > On Mon 13-03-17 11:31:10, Igor Mammedov wrote: > > > On Fri, 10 Mar 2017 14:58:07 +0100 > > [...] > > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009] > > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff] > > > > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff] > > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] > > > > hotplug > > > > [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem > > > > 0x0010-0x3fff] -> [mem 0x-0x3fff] > > > > [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff] > > > > [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd] > > > > [0.00] Zone ranges: > > > > [0.00] DMA [mem 0x1000-0x00ff] > > > > [0.00] DMA32[mem 0x0100-0x7ffd] > > > > [0.00] Normal empty > > > > [0.00] Movable zone start for each node > > > > [0.00] Early memory node ranges > > > > [0.00] node 0: [mem 0x1000-0x0009efff] > > > > [0.00] node 0: [mem 0x0010-0x3fff] > > > > [0.00] node 1: [mem 0x4000-0x7ffd] > > > > > > > > so there is neither any normal zone nor movable one at the boot time. > > > it could be if hotpluggable memory were present at boot time in E802 table > > > (if I remember right when running on hyperv there is movable zone at boot > > > time), > > > > > > but in qemu hotpluggable memory isn't put into E820, > > > so zone is allocated later when memory is enumerated > > > by ACPI subsystem and onlined. > > > It causes less issues wrt movable zone and works for > > > different versions of linux/windows as well. > > > > > > That's where in kernel auto-onlining could be also useful, > > > since user would be able to start-up with with small > > > non removable memory plus several removable DIMMs > > > and have all the memory onlined/available by the time > > > initrd is loaded. (missing piece here is onling > > > removable memory as movable by default). > > > > Why we should even care to online that memory that early rather than > > making it available via e820? > > It's not forbidden by spec and has less complications > when it comes to removable memory. Declaring it in E820 > would add following limitations/drawbacks: > - firmware should be able to exclude removable memory >from its usage (currently SeaBIOS nor EFI have to >know/care about it) => less qemu-guest ABI to maintain. > - OS should be taught to avoid/move (early) nonmovable >allocations from removable address ranges. >There were patches targeting that in recent kernels, >but it won't work with older kernels that don't have it. >So limiting a range of OSes that could run on QEMU >and do memory removal. > > E820 less approach works reasonably well with wide range > of guest OSes and less complex that if removable memory > were present it E820. Hence I don't have a compelling > reason to introduce removable memory in E820 as it > only adds to hot(un)plug issues. OK I see and that sounds like an argument to not put those ranges to E820. I still fail to see why we haeve to online the memory early during the boot and cannot wait for userspace to run? -- Michal Hocko SUSE Labs
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Mon 13-03-17 14:57:12, Igor Mammedov wrote: > On Mon, 13 Mar 2017 11:43:02 +0100 > Michal Hocko wrote: > > > On Mon 13-03-17 11:31:10, Igor Mammedov wrote: > > > On Fri, 10 Mar 2017 14:58:07 +0100 > > [...] > > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009] > > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff] > > > > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff] > > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] > > > > hotplug > > > > [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem > > > > 0x0010-0x3fff] -> [mem 0x-0x3fff] > > > > [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff] > > > > [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd] > > > > [0.00] Zone ranges: > > > > [0.00] DMA [mem 0x1000-0x00ff] > > > > [0.00] DMA32[mem 0x0100-0x7ffd] > > > > [0.00] Normal empty > > > > [0.00] Movable zone start for each node > > > > [0.00] Early memory node ranges > > > > [0.00] node 0: [mem 0x1000-0x0009efff] > > > > [0.00] node 0: [mem 0x0010-0x3fff] > > > > [0.00] node 1: [mem 0x4000-0x7ffd] > > > > > > > > so there is neither any normal zone nor movable one at the boot time. > > > it could be if hotpluggable memory were present at boot time in E802 table > > > (if I remember right when running on hyperv there is movable zone at boot > > > time), > > > > > > but in qemu hotpluggable memory isn't put into E820, > > > so zone is allocated later when memory is enumerated > > > by ACPI subsystem and onlined. > > > It causes less issues wrt movable zone and works for > > > different versions of linux/windows as well. > > > > > > That's where in kernel auto-onlining could be also useful, > > > since user would be able to start-up with with small > > > non removable memory plus several removable DIMMs > > > and have all the memory onlined/available by the time > > > initrd is loaded. (missing piece here is onling > > > removable memory as movable by default). > > > > Why we should even care to online that memory that early rather than > > making it available via e820? > > It's not forbidden by spec and has less complications > when it comes to removable memory. Declaring it in E820 > would add following limitations/drawbacks: > - firmware should be able to exclude removable memory >from its usage (currently SeaBIOS nor EFI have to >know/care about it) => less qemu-guest ABI to maintain. > - OS should be taught to avoid/move (early) nonmovable >allocations from removable address ranges. >There were patches targeting that in recent kernels, >but it won't work with older kernels that don't have it. >So limiting a range of OSes that could run on QEMU >and do memory removal. > > E820 less approach works reasonably well with wide range > of guest OSes and less complex that if removable memory > were present it E820. Hence I don't have a compelling > reason to introduce removable memory in E820 as it > only adds to hot(un)plug issues. OK I see and that sounds like an argument to not put those ranges to E820. I still fail to see why we haeve to online the memory early during the boot and cannot wait for userspace to run? -- Michal Hocko SUSE Labs
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Mon, 13 Mar 2017 11:43:02 +0100 Michal Hockowrote: > On Mon 13-03-17 11:31:10, Igor Mammedov wrote: > > On Fri, 10 Mar 2017 14:58:07 +0100 > [...] > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009] > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff] > > > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff] > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] > > > hotplug > > > [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem > > > 0x0010-0x3fff] -> [mem 0x-0x3fff] > > > [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff] > > > [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd] > > > [0.00] Zone ranges: > > > [0.00] DMA [mem 0x1000-0x00ff] > > > [0.00] DMA32[mem 0x0100-0x7ffd] > > > [0.00] Normal empty > > > [0.00] Movable zone start for each node > > > [0.00] Early memory node ranges > > > [0.00] node 0: [mem 0x1000-0x0009efff] > > > [0.00] node 0: [mem 0x0010-0x3fff] > > > [0.00] node 1: [mem 0x4000-0x7ffd] > > > > > > so there is neither any normal zone nor movable one at the boot time. > > it could be if hotpluggable memory were present at boot time in E802 table > > (if I remember right when running on hyperv there is movable zone at boot > > time), > > > > but in qemu hotpluggable memory isn't put into E820, > > so zone is allocated later when memory is enumerated > > by ACPI subsystem and onlined. > > It causes less issues wrt movable zone and works for > > different versions of linux/windows as well. > > > > That's where in kernel auto-onlining could be also useful, > > since user would be able to start-up with with small > > non removable memory plus several removable DIMMs > > and have all the memory onlined/available by the time > > initrd is loaded. (missing piece here is onling > > removable memory as movable by default). > > Why we should even care to online that memory that early rather than > making it available via e820? It's not forbidden by spec and has less complications when it comes to removable memory. Declaring it in E820 would add following limitations/drawbacks: - firmware should be able to exclude removable memory from its usage (currently SeaBIOS nor EFI have to know/care about it) => less qemu-guest ABI to maintain. - OS should be taught to avoid/move (early) nonmovable allocations from removable address ranges. There were patches targeting that in recent kernels, but it won't work with older kernels that don't have it. So limiting a range of OSes that could run on QEMU and do memory removal. E820 less approach works reasonably well with wide range of guest OSes and less complex that if removable memory were present it E820. Hence I don't have a compelling reason to introduce removable memory in E820 as it only adds to hot(un)plug issues. I have an off-tree QEMU hack that puts hotremovable memory added with "-device pc-dimm" on CLI into E820 to experiment with. It could be useful to play with zone layouts at boot time, so if you are interested I can rebase it on top of current master and post it here to play with.
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Mon, 13 Mar 2017 11:43:02 +0100 Michal Hocko wrote: > On Mon 13-03-17 11:31:10, Igor Mammedov wrote: > > On Fri, 10 Mar 2017 14:58:07 +0100 > [...] > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009] > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff] > > > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff] > > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] > > > hotplug > > > [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem > > > 0x0010-0x3fff] -> [mem 0x-0x3fff] > > > [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff] > > > [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd] > > > [0.00] Zone ranges: > > > [0.00] DMA [mem 0x1000-0x00ff] > > > [0.00] DMA32[mem 0x0100-0x7ffd] > > > [0.00] Normal empty > > > [0.00] Movable zone start for each node > > > [0.00] Early memory node ranges > > > [0.00] node 0: [mem 0x1000-0x0009efff] > > > [0.00] node 0: [mem 0x0010-0x3fff] > > > [0.00] node 1: [mem 0x4000-0x7ffd] > > > > > > so there is neither any normal zone nor movable one at the boot time. > > it could be if hotpluggable memory were present at boot time in E802 table > > (if I remember right when running on hyperv there is movable zone at boot > > time), > > > > but in qemu hotpluggable memory isn't put into E820, > > so zone is allocated later when memory is enumerated > > by ACPI subsystem and onlined. > > It causes less issues wrt movable zone and works for > > different versions of linux/windows as well. > > > > That's where in kernel auto-onlining could be also useful, > > since user would be able to start-up with with small > > non removable memory plus several removable DIMMs > > and have all the memory onlined/available by the time > > initrd is loaded. (missing piece here is onling > > removable memory as movable by default). > > Why we should even care to online that memory that early rather than > making it available via e820? It's not forbidden by spec and has less complications when it comes to removable memory. Declaring it in E820 would add following limitations/drawbacks: - firmware should be able to exclude removable memory from its usage (currently SeaBIOS nor EFI have to know/care about it) => less qemu-guest ABI to maintain. - OS should be taught to avoid/move (early) nonmovable allocations from removable address ranges. There were patches targeting that in recent kernels, but it won't work with older kernels that don't have it. So limiting a range of OSes that could run on QEMU and do memory removal. E820 less approach works reasonably well with wide range of guest OSes and less complex that if removable memory were present it E820. Hence I don't have a compelling reason to introduce removable memory in E820 as it only adds to hot(un)plug issues. I have an off-tree QEMU hack that puts hotremovable memory added with "-device pc-dimm" on CLI into E820 to experiment with. It could be useful to play with zone layouts at boot time, so if you are interested I can rebase it on top of current master and post it here to play with.
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Mon 13-03-17 11:31:10, Igor Mammedov wrote: > On Fri, 10 Mar 2017 14:58:07 +0100 [...] > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009] > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff] > > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff] > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] > > hotplug > > [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem > > 0x0010-0x3fff] -> [mem 0x-0x3fff] > > [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff] > > [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd] > > [0.00] Zone ranges: > > [0.00] DMA [mem 0x1000-0x00ff] > > [0.00] DMA32[mem 0x0100-0x7ffd] > > [0.00] Normal empty > > [0.00] Movable zone start for each node > > [0.00] Early memory node ranges > > [0.00] node 0: [mem 0x1000-0x0009efff] > > [0.00] node 0: [mem 0x0010-0x3fff] > > [0.00] node 1: [mem 0x4000-0x7ffd] > > > > so there is neither any normal zone nor movable one at the boot time. > it could be if hotpluggable memory were present at boot time in E802 table > (if I remember right when running on hyperv there is movable zone at boot > time), > > but in qemu hotpluggable memory isn't put into E820, > so zone is allocated later when memory is enumerated > by ACPI subsystem and onlined. > It causes less issues wrt movable zone and works for > different versions of linux/windows as well. > > That's where in kernel auto-onlining could be also useful, > since user would be able to start-up with with small > non removable memory plus several removable DIMMs > and have all the memory onlined/available by the time > initrd is loaded. (missing piece here is onling > removable memory as movable by default). Why we should even care to online that memory that early rather than making it available via e820? > > Then I hotplugged 1G slot > > (qemu) object_add memory-backend-ram,id=mem1,size=1G > > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 > You can also specify node a pc-dimm goes to with 'node' property > if it should go to other then node 0. > > device_add pc-dimm,id=dimm1,memdev=mem1,node=1 thanks for the tip > > unfortunatelly the memory didn't show up automatically and I got > > [ 116.375781] acpi PNP0C80:00: Enumeration failure > it should work, > do you have CONFIG_ACPI_HOTPLUG_MEMORY enabled? No I didn't. Thanks, good to know! -- Michal Hocko SUSE Labs
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Mon 13-03-17 11:31:10, Igor Mammedov wrote: > On Fri, 10 Mar 2017 14:58:07 +0100 [...] > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009] > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff] > > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff] > > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] > > hotplug > > [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem > > 0x0010-0x3fff] -> [mem 0x-0x3fff] > > [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff] > > [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd] > > [0.00] Zone ranges: > > [0.00] DMA [mem 0x1000-0x00ff] > > [0.00] DMA32[mem 0x0100-0x7ffd] > > [0.00] Normal empty > > [0.00] Movable zone start for each node > > [0.00] Early memory node ranges > > [0.00] node 0: [mem 0x1000-0x0009efff] > > [0.00] node 0: [mem 0x0010-0x3fff] > > [0.00] node 1: [mem 0x4000-0x7ffd] > > > > so there is neither any normal zone nor movable one at the boot time. > it could be if hotpluggable memory were present at boot time in E802 table > (if I remember right when running on hyperv there is movable zone at boot > time), > > but in qemu hotpluggable memory isn't put into E820, > so zone is allocated later when memory is enumerated > by ACPI subsystem and onlined. > It causes less issues wrt movable zone and works for > different versions of linux/windows as well. > > That's where in kernel auto-onlining could be also useful, > since user would be able to start-up with with small > non removable memory plus several removable DIMMs > and have all the memory onlined/available by the time > initrd is loaded. (missing piece here is onling > removable memory as movable by default). Why we should even care to online that memory that early rather than making it available via e820? > > Then I hotplugged 1G slot > > (qemu) object_add memory-backend-ram,id=mem1,size=1G > > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 > You can also specify node a pc-dimm goes to with 'node' property > if it should go to other then node 0. > > device_add pc-dimm,id=dimm1,memdev=mem1,node=1 thanks for the tip > > unfortunatelly the memory didn't show up automatically and I got > > [ 116.375781] acpi PNP0C80:00: Enumeration failure > it should work, > do you have CONFIG_ACPI_HOTPLUG_MEMORY enabled? No I didn't. Thanks, good to know! -- Michal Hocko SUSE Labs
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Fri, 10 Mar 2017 14:58:07 +0100 Michal Hockowrote: > Let's CC people touching this logic. A short summary is that onlining > memory via udev is currently unusable for online_movable because blocks > are added from lower addresses while movable blocks are allowed from > last blocks. More below. > > On Thu 09-03-17 13:54:00, Michal Hocko wrote: > > On Tue 07-03-17 13:40:04, Igor Mammedov wrote: > > > On Mon, 6 Mar 2017 15:54:17 +0100 > > > Michal Hocko wrote: > > > > > > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote: > > [...] > > > > > in current mainline kernel it triggers following code path: > > > > > > > > > > online_pages() > > > > > ... > > > > >if (online_type == MMOP_ONLINE_KERNEL) { > > > > > > > > > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, > > > > > _shift)) > > > > > return -EINVAL; > > > > > > > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here > > > pretty much, reproducer is above so try and see for yourself > > > > I will play with this... > > OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G > which generated 'mem' here distributes boot memory specified by "-m 2G" and does not include memory specified by -device pc-dimm. > [...] > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009] > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff] > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff] > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] hotplug > [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem > 0x0010-0x3fff] -> [mem 0x-0x3fff] > [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff] > [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd] > [0.00] Zone ranges: > [0.00] DMA [mem 0x1000-0x00ff] > [0.00] DMA32[mem 0x0100-0x7ffd] > [0.00] Normal empty > [0.00] Movable zone start for each node > [0.00] Early memory node ranges > [0.00] node 0: [mem 0x1000-0x0009efff] > [0.00] node 0: [mem 0x0010-0x3fff] > [0.00] node 1: [mem 0x4000-0x7ffd] > > so there is neither any normal zone nor movable one at the boot time. it could be if hotpluggable memory were present at boot time in E802 table (if I remember right when running on hyperv there is movable zone at boot time), but in qemu hotpluggable memory isn't put into E820, so zone is allocated later when memory is enumerated by ACPI subsystem and onlined. It causes less issues wrt movable zone and works for different versions of linux/windows as well. That's where in kernel auto-onlining could be also useful, since user would be able to start-up with with small non removable memory plus several removable DIMMs and have all the memory onlined/available by the time initrd is loaded. (missing piece here is onling removable memory as movable by default). > Then I hotplugged 1G slot > (qemu) object_add memory-backend-ram,id=mem1,size=1G > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 You can also specify node a pc-dimm goes to with 'node' property if it should go to other then node 0. device_add pc-dimm,id=dimm1,memdev=mem1,node=1 > unfortunatelly the memory didn't show up automatically and I got > [ 116.375781] acpi PNP0C80:00: Enumeration failure it should work, do you have CONFIG_ACPI_HOTPLUG_MEMORY enabled? > so I had to probe it manually (prbably the BIOS my qemu uses doesn't > support auto probing - I haven't really dug further). Anyway the SRAT > table printed during the boot told that we should start at 0x1
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Fri, 10 Mar 2017 14:58:07 +0100 Michal Hocko wrote: > Let's CC people touching this logic. A short summary is that onlining > memory via udev is currently unusable for online_movable because blocks > are added from lower addresses while movable blocks are allowed from > last blocks. More below. > > On Thu 09-03-17 13:54:00, Michal Hocko wrote: > > On Tue 07-03-17 13:40:04, Igor Mammedov wrote: > > > On Mon, 6 Mar 2017 15:54:17 +0100 > > > Michal Hocko wrote: > > > > > > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote: > > [...] > > > > > in current mainline kernel it triggers following code path: > > > > > > > > > > online_pages() > > > > > ... > > > > >if (online_type == MMOP_ONLINE_KERNEL) { > > > > > > > > > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, > > > > > _shift)) > > > > > return -EINVAL; > > > > > > > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here > > > pretty much, reproducer is above so try and see for yourself > > > > I will play with this... > > OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G > which generated 'mem' here distributes boot memory specified by "-m 2G" and does not include memory specified by -device pc-dimm. > [...] > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009] > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff] > [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff] > [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] hotplug > [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem > 0x0010-0x3fff] -> [mem 0x-0x3fff] > [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff] > [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd] > [0.00] Zone ranges: > [0.00] DMA [mem 0x1000-0x00ff] > [0.00] DMA32[mem 0x0100-0x7ffd] > [0.00] Normal empty > [0.00] Movable zone start for each node > [0.00] Early memory node ranges > [0.00] node 0: [mem 0x1000-0x0009efff] > [0.00] node 0: [mem 0x0010-0x3fff] > [0.00] node 1: [mem 0x4000-0x7ffd] > > so there is neither any normal zone nor movable one at the boot time. it could be if hotpluggable memory were present at boot time in E802 table (if I remember right when running on hyperv there is movable zone at boot time), but in qemu hotpluggable memory isn't put into E820, so zone is allocated later when memory is enumerated by ACPI subsystem and onlined. It causes less issues wrt movable zone and works for different versions of linux/windows as well. That's where in kernel auto-onlining could be also useful, since user would be able to start-up with with small non removable memory plus several removable DIMMs and have all the memory onlined/available by the time initrd is loaded. (missing piece here is onling removable memory as movable by default). > Then I hotplugged 1G slot > (qemu) object_add memory-backend-ram,id=mem1,size=1G > (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 You can also specify node a pc-dimm goes to with 'node' property if it should go to other then node 0. device_add pc-dimm,id=dimm1,memdev=mem1,node=1 > unfortunatelly the memory didn't show up automatically and I got > [ 116.375781] acpi PNP0C80:00: Enumeration failure it should work, do you have CONFIG_ACPI_HOTPLUG_MEMORY enabled? > so I had to probe it manually (prbably the BIOS my qemu uses doesn't > support auto probing - I haven't really dug further). Anyway the SRAT > table printed during the boot told that we should start at 0x1
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Fri 10-03-17 13:00:37, Reza Arbab wrote: > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote: > >OK, so while I was playing with this setup some more I probably got why > >this is done this way. All new memblocks are added to the zone Normal > >where they are accounted as spanned but not present. > > It's not always zone Normal. See zone_for_memory(). This leads to a > workaround for having to do online_movable in descending block order. > Instead of this: > > 1. probe block 34, probe block 33, probe block 32, ... > 2. online_movable 34, online_movable 33, online_movable 32, ... > > you can online_movable the first block before adding the rest: I do I enforce that behavior when the probe happens automagically? > 1. probe block 32, online_movable 32 > 2. probe block 33, probe block 34, ... > - zone_for_memory() will cause these to start Movable > 3. online 33, online 34, ... > - they're already in Movable, so online_movable is equivalentr > > I agree with your general sentiment that this stuff is very nonintuitive. My criterion for nonintuitive is probably different because I would call this _completely_unusable_. Sorry for being so loud about this but the more I look into this area the more WTF code I see. This has seen close to zero review and seems to be building up more single usecase code on top of previous. We need to change this, seriously! -- Michal Hocko SUSE Labs
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Fri 10-03-17 13:00:37, Reza Arbab wrote: > On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote: > >OK, so while I was playing with this setup some more I probably got why > >this is done this way. All new memblocks are added to the zone Normal > >where they are accounted as spanned but not present. > > It's not always zone Normal. See zone_for_memory(). This leads to a > workaround for having to do online_movable in descending block order. > Instead of this: > > 1. probe block 34, probe block 33, probe block 32, ... > 2. online_movable 34, online_movable 33, online_movable 32, ... > > you can online_movable the first block before adding the rest: I do I enforce that behavior when the probe happens automagically? > 1. probe block 32, online_movable 32 > 2. probe block 33, probe block 34, ... > - zone_for_memory() will cause these to start Movable > 3. online 33, online 34, ... > - they're already in Movable, so online_movable is equivalentr > > I agree with your general sentiment that this stuff is very nonintuitive. My criterion for nonintuitive is probably different because I would call this _completely_unusable_. Sorry for being so loud about this but the more I look into this area the more WTF code I see. This has seen close to zero review and seems to be building up more single usecase code on top of previous. We need to change this, seriously! -- Michal Hocko SUSE Labs
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote: OK, so while I was playing with this setup some more I probably got why this is done this way. All new memblocks are added to the zone Normal where they are accounted as spanned but not present. It's not always zone Normal. See zone_for_memory(). This leads to a workaround for having to do online_movable in descending block order. Instead of this: 1. probe block 34, probe block 33, probe block 32, ... 2. online_movable 34, online_movable 33, online_movable 32, ... you can online_movable the first block before adding the rest: 1. probe block 32, online_movable 32 2. probe block 33, probe block 34, ... - zone_for_memory() will cause these to start Movable 3. online 33, online 34, ... - they're already in Movable, so online_movable is equivalentr I agree with your general sentiment that this stuff is very nonintuitive. -- Reza Arbab
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Fri, Mar 10, 2017 at 04:53:33PM +0100, Michal Hocko wrote: OK, so while I was playing with this setup some more I probably got why this is done this way. All new memblocks are added to the zone Normal where they are accounted as spanned but not present. It's not always zone Normal. See zone_for_memory(). This leads to a workaround for having to do online_movable in descending block order. Instead of this: 1. probe block 34, probe block 33, probe block 32, ... 2. online_movable 34, online_movable 33, online_movable 32, ... you can online_movable the first block before adding the rest: 1. probe block 32, online_movable 32 2. probe block 33, probe block 34, ... - zone_for_memory() will cause these to start Movable 3. online 33, online 34, ... - they're already in Movable, so online_movable is equivalentr I agree with your general sentiment that this stuff is very nonintuitive. -- Reza Arbab
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Fri 10-03-17 14:58:07, Michal Hocko wrote: [...] > This would explain why onlining from the last block actually works but > to me this sounds like a completely crappy behavior. All we need to > guarantee AFAICS is that Normal and Movable zones do not overlap. I > believe there is even no real requirement about ordering of the physical > memory in Normal vs. Movable zones as long as they do not overlap. But > let's keep it simple for the start and always enforce the current status > quo that Normal zone is physically preceeding Movable zone. > Can somebody explain why we cannot have a simple rule for Normal vs. > Movable which would be: > - block [pfn, pfn+block_size] can be Normal if > !zone_populated(MOVABLE) || pfn+block_size < > ZONE_MOVABLE->zone_start_pfn > - block [pfn, pfn+block_size] can be Movable if > !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn OK, so while I was playing with this setup some more I probably got why this is done this way. All new memblocks are added to the zone Normal where they are accounted as spanned but not present. When we do online_movable we just cut from the end of the Normal zone and move it to Movable zone. This sounds really awkward. What was the reason to go this way? Why cannot we simply add those pages to the zone at the online time? -- Michal Hocko SUSE Labs
Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
On Fri 10-03-17 14:58:07, Michal Hocko wrote: [...] > This would explain why onlining from the last block actually works but > to me this sounds like a completely crappy behavior. All we need to > guarantee AFAICS is that Normal and Movable zones do not overlap. I > believe there is even no real requirement about ordering of the physical > memory in Normal vs. Movable zones as long as they do not overlap. But > let's keep it simple for the start and always enforce the current status > quo that Normal zone is physically preceeding Movable zone. > Can somebody explain why we cannot have a simple rule for Normal vs. > Movable which would be: > - block [pfn, pfn+block_size] can be Normal if > !zone_populated(MOVABLE) || pfn+block_size < > ZONE_MOVABLE->zone_start_pfn > - block [pfn, pfn+block_size] can be Movable if > !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn OK, so while I was playing with this setup some more I probably got why this is done this way. All new memblocks are added to the zone Normal where they are accounted as spanned but not present. When we do online_movable we just cut from the end of the Normal zone and move it to Movable zone. This sounds really awkward. What was the reason to go this way? Why cannot we simply add those pages to the zone at the online time? -- Michal Hocko SUSE Labs
WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
Let's CC people touching this logic. A short summary is that onlining memory via udev is currently unusable for online_movable because blocks are added from lower addresses while movable blocks are allowed from last blocks. More below. On Thu 09-03-17 13:54:00, Michal Hocko wrote: > On Tue 07-03-17 13:40:04, Igor Mammedov wrote: > > On Mon, 6 Mar 2017 15:54:17 +0100 > > Michal Hockowrote: > > > > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote: > [...] > > > > in current mainline kernel it triggers following code path: > > > > > > > > online_pages() > > > > ... > > > >if (online_type == MMOP_ONLINE_KERNEL) { > > > > > > > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, > > > > _shift)) > > > > return -EINVAL; > > > > > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here > > pretty much, reproducer is above so try and see for yourself > > I will play with this... OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated [...] [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009] [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff] [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff] [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] hotplug [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 0x0010-0x3fff] -> [mem 0x-0x3fff] [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff] [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd] [0.00] Zone ranges: [0.00] DMA [mem 0x1000-0x00ff] [0.00] DMA32[mem 0x0100-0x7ffd] [0.00] Normal empty [0.00] Movable zone start for each node [0.00] Early memory node ranges [0.00] node 0: [mem 0x1000-0x0009efff] [0.00] node 0: [mem 0x0010-0x3fff] [0.00] node 1: [mem 0x4000-0x7ffd] so there is neither any normal zone nor movable one at the boot time. Then I hotplugged 1G slot (qemu) object_add memory-backend-ram,id=mem1,size=1G (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 unfortunatelly the memory didn't show up automatically and I got [ 116.375781] acpi PNP0C80:00: Enumeration failure so I had to probe it manually (prbably the BIOS my qemu uses doesn't support auto probing - I haven't really dug further). Anyway the SRAT table printed during the boot told that we should start at 0x1 # echo 0x1 > /sys/devices/system/memory/probe # grep . /sys/devices/system/memory/memory32/valid_zones Normal Movable which looks reasonably right? Both Normal and Movable zones are allowed # echo $((0x1+(128<<20))) > /sys/devices/system/memory/probe # grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable Huh, so our valid_zones have changed under our feet... # echo $((0x1+2*(128<<20))) > /sys/devices/system/memory/probe # grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal /sys/devices/system/memory/memory34/valid_zones:Normal Movable and again. So only the last memblock is considered movable. Let's try to online them now. # echo online_movable > /sys/devices/system/memory/memory34/state # grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable Normal This would explain why onlining from the last block actually works but to me this sounds like a completely crappy behavior. All we need to guarantee AFAICS is that Normal and Movable zones do not overlap. I believe there is even no real requirement about ordering of the physical memory in Normal vs. Movable zones as long as they do not overlap. But let's keep it simple for the start and always enforce the current status quo that Normal zone is physically preceeding Movable zone. Can somebody explain why we cannot have a simple rule for Normal vs. Movable which would be: - block [pfn, pfn+block_size] can be Normal if !zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn - block [pfn, pfn+block_size] can be Movable if !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn I haven't fully grokked all the restrictions on the movable zone size based on the kernel parameters (find_zone_movable_pfns_for_nodes) but this shouldn't really make the situation really much more complicated I believe because those parameters should be mostly about static initialization rather
WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks)
Let's CC people touching this logic. A short summary is that onlining memory via udev is currently unusable for online_movable because blocks are added from lower addresses while movable blocks are allowed from last blocks. More below. On Thu 09-03-17 13:54:00, Michal Hocko wrote: > On Tue 07-03-17 13:40:04, Igor Mammedov wrote: > > On Mon, 6 Mar 2017 15:54:17 +0100 > > Michal Hocko wrote: > > > > > On Fri 03-03-17 18:34:22, Igor Mammedov wrote: > [...] > > > > in current mainline kernel it triggers following code path: > > > > > > > > online_pages() > > > > ... > > > >if (online_type == MMOP_ONLINE_KERNEL) { > > > > > > > > if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, > > > > _shift)) > > > > return -EINVAL; > > > > > > Are you sure? I would expect MMOP_ONLINE_MOVABLE here > > pretty much, reproducer is above so try and see for yourself > > I will play with this... OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated [...] [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x-0x0009] [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x0010-0x3fff] [0.00] ACPI: SRAT: Node 1 PXM 1 [mem 0x4000-0x7fff] [0.00] ACPI: SRAT: Node 0 PXM 0 [mem 0x1-0x27fff] hotplug [0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 0x0010-0x3fff] -> [mem 0x-0x3fff] [0.00] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fff] [0.00] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffd] [0.00] Zone ranges: [0.00] DMA [mem 0x1000-0x00ff] [0.00] DMA32[mem 0x0100-0x7ffd] [0.00] Normal empty [0.00] Movable zone start for each node [0.00] Early memory node ranges [0.00] node 0: [mem 0x1000-0x0009efff] [0.00] node 0: [mem 0x0010-0x3fff] [0.00] node 1: [mem 0x4000-0x7ffd] so there is neither any normal zone nor movable one at the boot time. Then I hotplugged 1G slot (qemu) object_add memory-backend-ram,id=mem1,size=1G (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 unfortunatelly the memory didn't show up automatically and I got [ 116.375781] acpi PNP0C80:00: Enumeration failure so I had to probe it manually (prbably the BIOS my qemu uses doesn't support auto probing - I haven't really dug further). Anyway the SRAT table printed during the boot told that we should start at 0x1 # echo 0x1 > /sys/devices/system/memory/probe # grep . /sys/devices/system/memory/memory32/valid_zones Normal Movable which looks reasonably right? Both Normal and Movable zones are allowed # echo $((0x1+(128<<20))) > /sys/devices/system/memory/probe # grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable Huh, so our valid_zones have changed under our feet... # echo $((0x1+2*(128<<20))) > /sys/devices/system/memory/probe # grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal /sys/devices/system/memory/memory34/valid_zones:Normal Movable and again. So only the last memblock is considered movable. Let's try to online them now. # echo online_movable > /sys/devices/system/memory/memory34/state # grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable Normal This would explain why onlining from the last block actually works but to me this sounds like a completely crappy behavior. All we need to guarantee AFAICS is that Normal and Movable zones do not overlap. I believe there is even no real requirement about ordering of the physical memory in Normal vs. Movable zones as long as they do not overlap. But let's keep it simple for the start and always enforce the current status quo that Normal zone is physically preceeding Movable zone. Can somebody explain why we cannot have a simple rule for Normal vs. Movable which would be: - block [pfn, pfn+block_size] can be Normal if !zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn - block [pfn, pfn+block_size] can be Movable if !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn I haven't fully grokked all the restrictions on the movable zone size based on the kernel parameters (find_zone_movable_pfns_for_nodes) but this shouldn't really make the situation really much more complicated I believe because those parameters should be mostly about static initialization rather than hotplug but I