Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-30 Thread Vlastimil Babka
On 03/20/2017 07:33 AM, Joonsoo Kim wrote:
>> The fact sticky movable pageblocks aren't ideal for CMA doesn't mean
>> they're not ideal for memory hotunplug though.
>>
>> With CMA there's no point in having the sticky movable pageblocks
>> scattered around and it's purely a misfeature to use sticky movable
>> pageblocks because you need the whole CMA area contiguous hence a
>> ZONE_CMA is ideal.
> No. CMA ranges could be registered many times for each devices and they
> could be scattered due to device's H/W limitation. So, current implementation
> in kernel, MIGRATE_CMA pageblocks, are scattered sometimes.
> 
>> As opposed with memory hotplug the sticky movable pageblocks would
>> allow the kernel to satisfy the current /sys API and they would
>> provide no downside unlike in the CMA case where the size of the
>> allocation is unknown.
> No, same downside also exists in this case. Downside is not related to the 
> case
> that device uses that range. It is related to VM management to this range and
> problems are the same. For example, with sticky movable pageblock, we need to
> subtract number of freepages in sticky movable pageblock when watermark is
> checked for non-movable allocation and it causes some problems.

Agree. Right now for CMA we have to account NR_FREE_CMA_PAGES (number of
free pages within MIGRATE_CMA pageblocks), which brings all those hooks
and other troubles for keep the accounting precise (there used to be
various races in there). This goes against the rest of page grouping by
mobility design, which wasn't meant to be precise for performance
reasons (e.g. when you change pageblock type and move pages between
freelists, any pcpu cached pages are left at their previous type's list).

We also can't ignore this accounting, as then the watermark check could
then pass for e.g. UNMOVABLE allocation, which would proceed to find
that the only free pages available are within the MIGRATE_CMA (or
sticky-movable) pageblocks, where it's not allowed to fallback to. If
only then we went reclaiming, the zone balance checks would also
consider the zone balanced, even though unmovable allocations would
still not be possible.

Even with this extra accounting, things are not perfect, because reclaim
doesn't guarantee freeing the pages in the right pageblocks, so we can
easily overreclaim. That's mainly why I agreed that ZONE_CMA should be
better than the current implementation, and I'm skeptical about the
sticky-movable pageblock idea. Note the conversion to node-lru reclaim
has changed things somewhat, as we can't reclaim a single zone anymore,
but the accounting troubles remain.


Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-30 Thread Vlastimil Babka
On 03/20/2017 07:33 AM, Joonsoo Kim wrote:
>> The fact sticky movable pageblocks aren't ideal for CMA doesn't mean
>> they're not ideal for memory hotunplug though.
>>
>> With CMA there's no point in having the sticky movable pageblocks
>> scattered around and it's purely a misfeature to use sticky movable
>> pageblocks because you need the whole CMA area contiguous hence a
>> ZONE_CMA is ideal.
> No. CMA ranges could be registered many times for each devices and they
> could be scattered due to device's H/W limitation. So, current implementation
> in kernel, MIGRATE_CMA pageblocks, are scattered sometimes.
> 
>> As opposed with memory hotplug the sticky movable pageblocks would
>> allow the kernel to satisfy the current /sys API and they would
>> provide no downside unlike in the CMA case where the size of the
>> allocation is unknown.
> No, same downside also exists in this case. Downside is not related to the 
> case
> that device uses that range. It is related to VM management to this range and
> problems are the same. For example, with sticky movable pageblock, we need to
> subtract number of freepages in sticky movable pageblock when watermark is
> checked for non-movable allocation and it causes some problems.

Agree. Right now for CMA we have to account NR_FREE_CMA_PAGES (number of
free pages within MIGRATE_CMA pageblocks), which brings all those hooks
and other troubles for keep the accounting precise (there used to be
various races in there). This goes against the rest of page grouping by
mobility design, which wasn't meant to be precise for performance
reasons (e.g. when you change pageblock type and move pages between
freelists, any pcpu cached pages are left at their previous type's list).

We also can't ignore this accounting, as then the watermark check could
then pass for e.g. UNMOVABLE allocation, which would proceed to find
that the only free pages available are within the MIGRATE_CMA (or
sticky-movable) pageblocks, where it's not allowed to fallback to. If
only then we went reclaiming, the zone balance checks would also
consider the zone balanced, even though unmovable allocations would
still not be possible.

Even with this extra accounting, things are not perfect, because reclaim
doesn't guarantee freeing the pages in the right pageblocks, so we can
easily overreclaim. That's mainly why I agreed that ZONE_CMA should be
better than the current implementation, and I'm skeptical about the
sticky-movable pageblock idea. Note the conversion to node-lru reclaim
has changed things somewhat, as we can't reclaim a single zone anymore,
but the accounting troubles remain.


Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-20 Thread Joonsoo Kim
2017-03-17 4:01 GMT+09:00 Andrea Arcangeli :
> Hello Joonsoo,

Hello, Andrea.

> On Thu, Mar 16, 2017 at 02:31:22PM +0900, Joonsoo Kim wrote:
>> I don't follow up previous discussion so please let me know if I miss
>> something. I'd just like to mention about sticky pageblocks.
>
> The interesting part of the previous discussion relevant for the
> sticky movable pageblock is this part from Vitaly:
>
> === quote ===
> Now we have
>
> [Normal][Normal][Normal][Movable][Movable][Movable]
>
> we could have
>
> [Normal][Normal][Movable][Normal][Movable][Normal]
> === quote ===
>
> Suppose you're an admin you can try to do starting from an
> all-offlined hotplug memory:
>
> kvm ~ # cat /sys/devices/system/memory/memory3[6-9]/online
> 0
> 0
> 0
> 0
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
> Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M
>
> All hotplug memory is offline, no Movable zone.
>
> Then you online interleaved:
>
> kvm ~ # echo online_movable > /sys/devices/system/memory/memory39/online
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
> Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M
> Zone: Movable   Present: 128M   Managed: 128M   Start: 4.9G End: 5.0G
> kvm ~ # echo online > /sys/devices/system/memory/memory38/online
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
> Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M
> Zone: NormalPresent: 128M   Managed: 128M   Start: 4.0G End: 4.9G
> Zone: Movable   Present: 128M   Managed: 128M   Start: 4.9G End: 5.0G
>
> So far so good.
>
> kvm ~ # echo online_movable > /sys/devices/system/memory/memory37/online
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
> Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M
> Zone: NormalPresent: 256M   Managed: 256M   Start: 4.0G End: 4.9G
> Zone: Movable   Present: 128M   Managed: 128M   Start: 4.9G End: 5.0G
>
> Oops you thought you onlined movable memory37 but instead it silently
> went in the normal zone (without even erroring out) and it's
> definitely not going to be unpluggable and it's definitely non
> movable all falls apart here. Admin won't run my zoneinfo.py
> script that I had write specifically to understand what a mess what
> was happening with online_movable interleaved.
>
> The admin is much better off not touching
> /sys/devices/system/memory/memory37 ever, and just use the in-kernel
> onlining, at the very least until udev and sys interface are fixed for
> both movable and non-movable hotplug onlining.

Thanks for explanation. Now, I understand the issue correctly.

>> Before that, I'd like to say that a lot of code already deals with zone
>> overlap. Zone overlap exists for a long time although I don't know exact
>> history. IIRC, Mel fixed such a case before and compaction code has a
>> check for it. And, I added the overlap check to some pfn iterators which
>> doesn't have such a check for preparation of introducing a new zone,
>> ZONE_CMA, which has zone range overlap property. See following commits.
>>
>> 'ba6b097', '9d43f5a', 'a91c43c'.
>>
>
> So you suggest to create a full overlap like:
>
>  --- Movable --
>  --- Normal  --
>
> Then search for pages in the Movable zone buddy which will only
> contain those that are onlined with echo online_movable?

Yes. Full overlap would be the worst case but it's possible and it
would work well(?) even in current kernel.

>> Come to my main topic, I disagree that sticky pageblock would be
>> superior to the current separate zone approach. There is some reasons
>> about the objection to sticky movable pageblock in following link.
>>
>> Sticky movable pageblock is conceptually same with MIGRATE_CMA and it
>> will cause many subtle issues like as MIGRATE_CMA did for CMA users.
>> MIGRATE_CMA introduces many hooks in various code path, and, to fix the
>> remaining issues, it needs more hooks. I don't think it is
>
> I'm not saying the sticky movable pageblocks are the way to go, to the
> contrary we're saying the Movable zone constraints can better be
> satisfied by the in-kernel onlining mechanism and it's overall much
> simpler for the user to use the in-kernel onlining, than in trying to
> fix udev to be synchronous and implementing sticky movable pageblocks
> to make the /sys interface usable without unexpected side effects. And
> I would suggest to look into dropping the MOVABLE_NODE config option
> first (and turn it in a kernel parameter if something).

Okay.

> I agree sticky movable pageblocks may slowdown things and increase
> complexity so it'd be better not having to implement those.
>
>> 

Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-20 Thread Joonsoo Kim
2017-03-17 4:01 GMT+09:00 Andrea Arcangeli :
> Hello Joonsoo,

Hello, Andrea.

> On Thu, Mar 16, 2017 at 02:31:22PM +0900, Joonsoo Kim wrote:
>> I don't follow up previous discussion so please let me know if I miss
>> something. I'd just like to mention about sticky pageblocks.
>
> The interesting part of the previous discussion relevant for the
> sticky movable pageblock is this part from Vitaly:
>
> === quote ===
> Now we have
>
> [Normal][Normal][Normal][Movable][Movable][Movable]
>
> we could have
>
> [Normal][Normal][Movable][Normal][Movable][Normal]
> === quote ===
>
> Suppose you're an admin you can try to do starting from an
> all-offlined hotplug memory:
>
> kvm ~ # cat /sys/devices/system/memory/memory3[6-9]/online
> 0
> 0
> 0
> 0
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
> Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M
>
> All hotplug memory is offline, no Movable zone.
>
> Then you online interleaved:
>
> kvm ~ # echo online_movable > /sys/devices/system/memory/memory39/online
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
> Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M
> Zone: Movable   Present: 128M   Managed: 128M   Start: 4.9G End: 5.0G
> kvm ~ # echo online > /sys/devices/system/memory/memory38/online
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
> Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M
> Zone: NormalPresent: 128M   Managed: 128M   Start: 4.0G End: 4.9G
> Zone: Movable   Present: 128M   Managed: 128M   Start: 4.9G End: 5.0G
>
> So far so good.
>
> kvm ~ # echo online_movable > /sys/devices/system/memory/memory37/online
> kvm ~ # python ~andrea/zoneinfo.py
> Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
> Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M
> Zone: NormalPresent: 256M   Managed: 256M   Start: 4.0G End: 4.9G
> Zone: Movable   Present: 128M   Managed: 128M   Start: 4.9G End: 5.0G
>
> Oops you thought you onlined movable memory37 but instead it silently
> went in the normal zone (without even erroring out) and it's
> definitely not going to be unpluggable and it's definitely non
> movable all falls apart here. Admin won't run my zoneinfo.py
> script that I had write specifically to understand what a mess what
> was happening with online_movable interleaved.
>
> The admin is much better off not touching
> /sys/devices/system/memory/memory37 ever, and just use the in-kernel
> onlining, at the very least until udev and sys interface are fixed for
> both movable and non-movable hotplug onlining.

Thanks for explanation. Now, I understand the issue correctly.

>> Before that, I'd like to say that a lot of code already deals with zone
>> overlap. Zone overlap exists for a long time although I don't know exact
>> history. IIRC, Mel fixed such a case before and compaction code has a
>> check for it. And, I added the overlap check to some pfn iterators which
>> doesn't have such a check for preparation of introducing a new zone,
>> ZONE_CMA, which has zone range overlap property. See following commits.
>>
>> 'ba6b097', '9d43f5a', 'a91c43c'.
>>
>
> So you suggest to create a full overlap like:
>
>  --- Movable --
>  --- Normal  --
>
> Then search for pages in the Movable zone buddy which will only
> contain those that are onlined with echo online_movable?

Yes. Full overlap would be the worst case but it's possible and it
would work well(?) even in current kernel.

>> Come to my main topic, I disagree that sticky pageblock would be
>> superior to the current separate zone approach. There is some reasons
>> about the objection to sticky movable pageblock in following link.
>>
>> Sticky movable pageblock is conceptually same with MIGRATE_CMA and it
>> will cause many subtle issues like as MIGRATE_CMA did for CMA users.
>> MIGRATE_CMA introduces many hooks in various code path, and, to fix the
>> remaining issues, it needs more hooks. I don't think it is
>
> I'm not saying the sticky movable pageblocks are the way to go, to the
> contrary we're saying the Movable zone constraints can better be
> satisfied by the in-kernel onlining mechanism and it's overall much
> simpler for the user to use the in-kernel onlining, than in trying to
> fix udev to be synchronous and implementing sticky movable pageblocks
> to make the /sys interface usable without unexpected side effects. And
> I would suggest to look into dropping the MOVABLE_NODE config option
> first (and turn it in a kernel parameter if something).

Okay.

> I agree sticky movable pageblocks may slowdown things and increase
> complexity so it'd be better not having to implement those.
>
>> maintainable approach. 

Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-17 Thread Igor Mammedov
On Thu, 16 Mar 2017 20:01:25 +0100
Andrea Arcangeli  wrote:

[...]
> If we can make zone overlap work with a 100% overlap across the whole
> node that would be a fine alternative, the zoneinfo.py output will
> look weird, but if that's the only downside it's no big deal. With
> sticky movable pageblocks it'll all be ZONE_NORMAL, with overlap it'll
> all be both ZONE_NORMAL and ZONE_MOVABLE at the same time.
Looks like I'm not getting idea with zone overlap, so

We potentially have a flag that hotplugged block is removable
so on hotplug we could register them with zone MOVABLE as default,
however here comes zone balance issue so we can't do it until
it's solved.

As Vitaly's suggested we could steal(convert) existing blocks from
the border of MOVABLE zone into zone NORMAL when there isn't enough
memory in zone NORMAL to accommodate page tables extension for
just arrived new memory block. That would make a memory module
containing stolen block non-removable, but that may be acceptable
sacrifice to keep system alive. Potentially on attempt to remove it
kernel could even inform hardware(hypervisor) that memory module
become non removable using _OST ACPI method.


> Thanks,
> Andrea



Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-17 Thread Igor Mammedov
On Thu, 16 Mar 2017 20:01:25 +0100
Andrea Arcangeli  wrote:

[...]
> If we can make zone overlap work with a 100% overlap across the whole
> node that would be a fine alternative, the zoneinfo.py output will
> look weird, but if that's the only downside it's no big deal. With
> sticky movable pageblocks it'll all be ZONE_NORMAL, with overlap it'll
> all be both ZONE_NORMAL and ZONE_MOVABLE at the same time.
Looks like I'm not getting idea with zone overlap, so

We potentially have a flag that hotplugged block is removable
so on hotplug we could register them with zone MOVABLE as default,
however here comes zone balance issue so we can't do it until
it's solved.

As Vitaly's suggested we could steal(convert) existing blocks from
the border of MOVABLE zone into zone NORMAL when there isn't enough
memory in zone NORMAL to accommodate page tables extension for
just arrived new memory block. That would make a memory module
containing stolen block non-removable, but that may be acceptable
sacrifice to keep system alive. Potentially on attempt to remove it
kernel could even inform hardware(hypervisor) that memory module
become non removable using _OST ACPI method.


> Thanks,
> Andrea



Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-16 Thread Andrea Arcangeli
Hello Joonsoo,

On Thu, Mar 16, 2017 at 02:31:22PM +0900, Joonsoo Kim wrote:
> I don't follow up previous discussion so please let me know if I miss
> something. I'd just like to mention about sticky pageblocks.

The interesting part of the previous discussion relevant for the
sticky movable pageblock is this part from Vitaly:

=== quote ===
Now we have

[Normal][Normal][Normal][Movable][Movable][Movable]

we could have

[Normal][Normal][Movable][Normal][Movable][Normal]
=== quote ===

Suppose you're an admin you can try to do starting from an
all-offlined hotplug memory:

kvm ~ # cat /sys/devices/system/memory/memory3[6-9]/online
0
0
0
0
kvm ~ # python ~andrea/zoneinfo.py 
Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M

All hotplug memory is offline, no Movable zone.

Then you online interleaved:

kvm ~ # echo online_movable > /sys/devices/system/memory/memory39/online
kvm ~ # python ~andrea/zoneinfo.py 
Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M
Zone: Movable   Present: 128M   Managed: 128M   Start: 4.9G End: 5.0G
kvm ~ # echo online > /sys/devices/system/memory/memory38/online
kvm ~ # python ~andrea/zoneinfo.py 
Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M
Zone: NormalPresent: 128M   Managed: 128M   Start: 4.0G End: 4.9G
Zone: Movable   Present: 128M   Managed: 128M   Start: 4.9G End: 5.0G

So far so good.

kvm ~ # echo online_movable > /sys/devices/system/memory/memory37/online
kvm ~ # python ~andrea/zoneinfo.py 
Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M
Zone: NormalPresent: 256M   Managed: 256M   Start: 4.0G End: 4.9G
Zone: Movable   Present: 128M   Managed: 128M   Start: 4.9G End: 5.0G

Oops you thought you onlined movable memory37 but instead it silently
went in the normal zone (without even erroring out) and it's
definitely not going to be unpluggable and it's definitely non
movable all falls apart here. Admin won't run my zoneinfo.py
script that I had write specifically to understand what a mess what
was happening with online_movable interleaved.

The admin is much better off not touching
/sys/devices/system/memory/memory37 ever, and just use the in-kernel
onlining, at the very least until udev and sys interface are fixed for
both movable and non-movable hotplug onlining.

> Before that, I'd like to say that a lot of code already deals with zone
> overlap. Zone overlap exists for a long time although I don't know exact
> history. IIRC, Mel fixed such a case before and compaction code has a
> check for it. And, I added the overlap check to some pfn iterators which
> doesn't have such a check for preparation of introducing a new zone,
> ZONE_CMA, which has zone range overlap property. See following commits.
> 
> 'ba6b097', '9d43f5a', 'a91c43c'.
> 

So you suggest to create a full overlap like:

 --- Movable --
 --- Normal  --

Then search for pages in the Movable zone buddy which will only
contain those that are onlined with echo online_movable?

> Come to my main topic, I disagree that sticky pageblock would be
> superior to the current separate zone approach. There is some reasons
> about the objection to sticky movable pageblock in following link.
> 
> Sticky movable pageblock is conceptually same with MIGRATE_CMA and it
> will cause many subtle issues like as MIGRATE_CMA did for CMA users.
> MIGRATE_CMA introduces many hooks in various code path, and, to fix the
> remaining issues, it needs more hooks. I don't think it is

I'm not saying the sticky movable pageblocks are the way to go, to the
contrary we're saying the Movable zone constraints can better be
satisfied by the in-kernel onlining mechanism and it's overall much
simpler for the user to use the in-kernel onlining, than in trying to
fix udev to be synchronous and implementing sticky movable pageblocks
to make the /sys interface usable without unexpected side effects. And
I would suggest to look into dropping the MOVABLE_NODE config option
first (and turn it in a kernel parameter if something).

I agree sticky movable pageblocks may slowdown things and increase
complexity so it'd be better not having to implement those.

> maintainable approach. If you see following link which implements ZONE
> approach, you can see that many hooks are removed in the end.
> 
> lkml.kernel.org/r/1476414196-3514-1-git-send-email-iamjoonsoo@lge.com
> 
> I don't know exact requirement on memory hotplug so it would be
> possible that ZONE approach is not suitable for it. But, anyway, sticky
> pageblock seems not to be a good solution to me.

The fact sticky 

Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-16 Thread Andrea Arcangeli
Hello Joonsoo,

On Thu, Mar 16, 2017 at 02:31:22PM +0900, Joonsoo Kim wrote:
> I don't follow up previous discussion so please let me know if I miss
> something. I'd just like to mention about sticky pageblocks.

The interesting part of the previous discussion relevant for the
sticky movable pageblock is this part from Vitaly:

=== quote ===
Now we have

[Normal][Normal][Normal][Movable][Movable][Movable]

we could have

[Normal][Normal][Movable][Normal][Movable][Normal]
=== quote ===

Suppose you're an admin you can try to do starting from an
all-offlined hotplug memory:

kvm ~ # cat /sys/devices/system/memory/memory3[6-9]/online
0
0
0
0
kvm ~ # python ~andrea/zoneinfo.py 
Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M

All hotplug memory is offline, no Movable zone.

Then you online interleaved:

kvm ~ # echo online_movable > /sys/devices/system/memory/memory39/online
kvm ~ # python ~andrea/zoneinfo.py 
Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M
Zone: Movable   Present: 128M   Managed: 128M   Start: 4.9G End: 5.0G
kvm ~ # echo online > /sys/devices/system/memory/memory38/online
kvm ~ # python ~andrea/zoneinfo.py 
Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M
Zone: NormalPresent: 128M   Managed: 128M   Start: 4.0G End: 4.9G
Zone: Movable   Present: 128M   Managed: 128M   Start: 4.9G End: 5.0G

So far so good.

kvm ~ # echo online_movable > /sys/devices/system/memory/memory37/online
kvm ~ # python ~andrea/zoneinfo.py 
Zone: DMA   Present: 15MManaged: 15MStart: 0M   End: 16M
Zone: DMA32 Present: 2031M  Managed: 1892M  Start: 16M  End: 2047M
Zone: NormalPresent: 256M   Managed: 256M   Start: 4.0G End: 4.9G
Zone: Movable   Present: 128M   Managed: 128M   Start: 4.9G End: 5.0G

Oops you thought you onlined movable memory37 but instead it silently
went in the normal zone (without even erroring out) and it's
definitely not going to be unpluggable and it's definitely non
movable all falls apart here. Admin won't run my zoneinfo.py
script that I had write specifically to understand what a mess what
was happening with online_movable interleaved.

The admin is much better off not touching
/sys/devices/system/memory/memory37 ever, and just use the in-kernel
onlining, at the very least until udev and sys interface are fixed for
both movable and non-movable hotplug onlining.

> Before that, I'd like to say that a lot of code already deals with zone
> overlap. Zone overlap exists for a long time although I don't know exact
> history. IIRC, Mel fixed such a case before and compaction code has a
> check for it. And, I added the overlap check to some pfn iterators which
> doesn't have such a check for preparation of introducing a new zone,
> ZONE_CMA, which has zone range overlap property. See following commits.
> 
> 'ba6b097', '9d43f5a', 'a91c43c'.
> 

So you suggest to create a full overlap like:

 --- Movable --
 --- Normal  --

Then search for pages in the Movable zone buddy which will only
contain those that are onlined with echo online_movable?

> Come to my main topic, I disagree that sticky pageblock would be
> superior to the current separate zone approach. There is some reasons
> about the objection to sticky movable pageblock in following link.
> 
> Sticky movable pageblock is conceptually same with MIGRATE_CMA and it
> will cause many subtle issues like as MIGRATE_CMA did for CMA users.
> MIGRATE_CMA introduces many hooks in various code path, and, to fix the
> remaining issues, it needs more hooks. I don't think it is

I'm not saying the sticky movable pageblocks are the way to go, to the
contrary we're saying the Movable zone constraints can better be
satisfied by the in-kernel onlining mechanism and it's overall much
simpler for the user to use the in-kernel onlining, than in trying to
fix udev to be synchronous and implementing sticky movable pageblocks
to make the /sys interface usable without unexpected side effects. And
I would suggest to look into dropping the MOVABLE_NODE config option
first (and turn it in a kernel parameter if something).

I agree sticky movable pageblocks may slowdown things and increase
complexity so it'd be better not having to implement those.

> maintainable approach. If you see following link which implements ZONE
> approach, you can see that many hooks are removed in the end.
> 
> lkml.kernel.org/r/1476414196-3514-1-git-send-email-iamjoonsoo@lge.com
> 
> I don't know exact requirement on memory hotplug so it would be
> possible that ZONE approach is not suitable for it. But, anyway, sticky
> pageblock seems not to be a good solution to me.

The fact sticky 

Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-15 Thread Joonsoo Kim
On Wed, Mar 15, 2017 at 05:37:29PM +0100, Andrea Arcangeli wrote:
> On Wed, Mar 15, 2017 at 02:11:40PM +0100, Michal Hocko wrote:
> > OK, I see now. I am afraid there is quite a lot of code which expects
> > that zones do not overlap. We can have holes in zones but not different
> > zones interleaving. Probably something which could be addressed but far
> > from trivial IMHO.
> > 
> > All that being said, I do not want to discourage you from experiments in
> > those areas. Just be prepared all those are far from trivial and
> > something for a long project ;)
> 
> This constraint was known for quite some time, so when I talked about
> this very constraint with Mel at least year LSF/MM he suggested sticky
> pageblocks would be superior to the current movable zone.
> 
> So instead of having a Movable zone, we could use the pageblocks but
> make it sticky-movable so they're only going to accept __GFP_MOVABLE
> allocations into them. It would be still a quite large change indeed
> but it looks simpler and with fewer drawbacks than trying to make the
> zone overlap.

Hello,

I don't follow up previous discussion so please let me know if I miss
something. I'd just like to mention about sticky pageblocks.

Before that, I'd like to say that a lot of code already deals with zone
overlap. Zone overlap exists for a long time although I don't know exact
history. IIRC, Mel fixed such a case before and compaction code has a
check for it. And, I added the overlap check to some pfn iterators which
doesn't have such a check for preparation of introducing a new zone,
ZONE_CMA, which has zone range overlap property. See following commits.

'ba6b097', '9d43f5a', 'a91c43c'.

Come to my main topic, I disagree that sticky pageblock would be
superior to the current separate zone approach. There is some reasons
about the objection to sticky movable pageblock in following link.

Sticky movable pageblock is conceptually same with MIGRATE_CMA and it
will cause many subtle issues like as MIGRATE_CMA did for CMA users.
MIGRATE_CMA introduces many hooks in various code path, and, to fix the
remaining issues, it needs more hooks. I don't think it is
maintainable approach. If you see following link which implements ZONE
approach, you can see that many hooks are removed in the end.

lkml.kernel.org/r/1476414196-3514-1-git-send-email-iamjoonsoo@lge.com

I don't know exact requirement on memory hotplug so it would be
possible that ZONE approach is not suitable for it. But, anyway, sticky
pageblock seems not to be a good solution to me.

Thanks.



Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-15 Thread Joonsoo Kim
On Wed, Mar 15, 2017 at 05:37:29PM +0100, Andrea Arcangeli wrote:
> On Wed, Mar 15, 2017 at 02:11:40PM +0100, Michal Hocko wrote:
> > OK, I see now. I am afraid there is quite a lot of code which expects
> > that zones do not overlap. We can have holes in zones but not different
> > zones interleaving. Probably something which could be addressed but far
> > from trivial IMHO.
> > 
> > All that being said, I do not want to discourage you from experiments in
> > those areas. Just be prepared all those are far from trivial and
> > something for a long project ;)
> 
> This constraint was known for quite some time, so when I talked about
> this very constraint with Mel at least year LSF/MM he suggested sticky
> pageblocks would be superior to the current movable zone.
> 
> So instead of having a Movable zone, we could use the pageblocks but
> make it sticky-movable so they're only going to accept __GFP_MOVABLE
> allocations into them. It would be still a quite large change indeed
> but it looks simpler and with fewer drawbacks than trying to make the
> zone overlap.

Hello,

I don't follow up previous discussion so please let me know if I miss
something. I'd just like to mention about sticky pageblocks.

Before that, I'd like to say that a lot of code already deals with zone
overlap. Zone overlap exists for a long time although I don't know exact
history. IIRC, Mel fixed such a case before and compaction code has a
check for it. And, I added the overlap check to some pfn iterators which
doesn't have such a check for preparation of introducing a new zone,
ZONE_CMA, which has zone range overlap property. See following commits.

'ba6b097', '9d43f5a', 'a91c43c'.

Come to my main topic, I disagree that sticky pageblock would be
superior to the current separate zone approach. There is some reasons
about the objection to sticky movable pageblock in following link.

Sticky movable pageblock is conceptually same with MIGRATE_CMA and it
will cause many subtle issues like as MIGRATE_CMA did for CMA users.
MIGRATE_CMA introduces many hooks in various code path, and, to fix the
remaining issues, it needs more hooks. I don't think it is
maintainable approach. If you see following link which implements ZONE
approach, you can see that many hooks are removed in the end.

lkml.kernel.org/r/1476414196-3514-1-git-send-email-iamjoonsoo@lge.com

I don't know exact requirement on memory hotplug so it would be
possible that ZONE approach is not suitable for it. But, anyway, sticky
pageblock seems not to be a good solution to me.

Thanks.



Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-15 Thread Andrea Arcangeli
On Wed, Mar 15, 2017 at 02:11:40PM +0100, Michal Hocko wrote:
> OK, I see now. I am afraid there is quite a lot of code which expects
> that zones do not overlap. We can have holes in zones but not different
> zones interleaving. Probably something which could be addressed but far
> from trivial IMHO.
> 
> All that being said, I do not want to discourage you from experiments in
> those areas. Just be prepared all those are far from trivial and
> something for a long project ;)

This constraint was known for quite some time, so when I talked about
this very constraint with Mel at least year LSF/MM he suggested sticky
pageblocks would be superior to the current movable zone.

So instead of having a Movable zone, we could use the pageblocks but
make it sticky-movable so they're only going to accept __GFP_MOVABLE
allocations into them. It would be still a quite large change indeed
but it looks simpler and with fewer drawbacks than trying to make the
zone overlap.

Currently when you online memory as movable you're patching down the
movable zone not just onlining the memory and that complexity you've
to deal with, would go away with sticky movable pageblocks.

One other option could be to boot like with _DEFAULT_ONLINE=n and of
course without udev rule. Then after booting with the base memory run
one of the two echo below:

$ cat /sys/devices/system/memory/removable_hotplug_default
[disabled] online online_movable
$ echo online > /sys/devices/system/memory/removable_hotplug_default
$ echo online_movable > /sys/devices/system/memory/removable_hotplug_default

Then the "echo online/online_movable" would activate the in-kernel
hotplug mechanism that is faster and more reliable than udev and it
won't risk to run into the movable zone shift "constraint". After the
"echo" the kernel would behave like if it booted with _DEFAULT_ONLINE=y.

If you still want to do it by hand and leave it disabled or even
trying to fix udev movable shift constraints, sticky pageblocks and
lack of synchronicity (and deal with the resulting slower
performance compared to in-kernel onlining), you could.

The in-kernel onlining would use the exact same code of
_DEFAULT_ONLINE=y, but it would be configured with a file like
/etc/sysctl.conf. And then to switch it to the _movable model you
would just need to edit the file like you've to edit the udev rule
(the one that if you edit it with online_movable currently breaks).

>From usability prospective it would be like udev, but without all
drawbacks of doing the onlining in userland.

Checking if the memory should become movable or not depending on
acpi_has_method(handle, "_EJ0") isn't flexible enough I think, on bare
metal especially we couldn't change the ACPI like we can do with the
hypervisor, but the admin has still to decide freely if he wants to
risk early OOM and movable zone imbalance or if he prefers not being
able to hotunplug the memory ever again. So it would need to become a
grub boot option which is probably less friendly than editing
sysctl.conf or something like that (especially given grub-mkconfig
output..).

Thanks,
Andrea


Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-15 Thread Andrea Arcangeli
On Wed, Mar 15, 2017 at 02:11:40PM +0100, Michal Hocko wrote:
> OK, I see now. I am afraid there is quite a lot of code which expects
> that zones do not overlap. We can have holes in zones but not different
> zones interleaving. Probably something which could be addressed but far
> from trivial IMHO.
> 
> All that being said, I do not want to discourage you from experiments in
> those areas. Just be prepared all those are far from trivial and
> something for a long project ;)

This constraint was known for quite some time, so when I talked about
this very constraint with Mel at least year LSF/MM he suggested sticky
pageblocks would be superior to the current movable zone.

So instead of having a Movable zone, we could use the pageblocks but
make it sticky-movable so they're only going to accept __GFP_MOVABLE
allocations into them. It would be still a quite large change indeed
but it looks simpler and with fewer drawbacks than trying to make the
zone overlap.

Currently when you online memory as movable you're patching down the
movable zone not just onlining the memory and that complexity you've
to deal with, would go away with sticky movable pageblocks.

One other option could be to boot like with _DEFAULT_ONLINE=n and of
course without udev rule. Then after booting with the base memory run
one of the two echo below:

$ cat /sys/devices/system/memory/removable_hotplug_default
[disabled] online online_movable
$ echo online > /sys/devices/system/memory/removable_hotplug_default
$ echo online_movable > /sys/devices/system/memory/removable_hotplug_default

Then the "echo online/online_movable" would activate the in-kernel
hotplug mechanism that is faster and more reliable than udev and it
won't risk to run into the movable zone shift "constraint". After the
"echo" the kernel would behave like if it booted with _DEFAULT_ONLINE=y.

If you still want to do it by hand and leave it disabled or even
trying to fix udev movable shift constraints, sticky pageblocks and
lack of synchronicity (and deal with the resulting slower
performance compared to in-kernel onlining), you could.

The in-kernel onlining would use the exact same code of
_DEFAULT_ONLINE=y, but it would be configured with a file like
/etc/sysctl.conf. And then to switch it to the _movable model you
would just need to edit the file like you've to edit the udev rule
(the one that if you edit it with online_movable currently breaks).

>From usability prospective it would be like udev, but without all
drawbacks of doing the onlining in userland.

Checking if the memory should become movable or not depending on
acpi_has_method(handle, "_EJ0") isn't flexible enough I think, on bare
metal especially we couldn't change the ACPI like we can do with the
hypervisor, but the admin has still to decide freely if he wants to
risk early OOM and movable zone imbalance or if he prefers not being
able to hotunplug the memory ever again. So it would need to become a
grub boot option which is probably less friendly than editing
sysctl.conf or something like that (especially given grub-mkconfig
output..).

Thanks,
Andrea


Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-15 Thread Michal Hocko
On Wed 15-03-17 13:53:09, Vitaly Kuznetsov wrote:
> Michal Hocko  writes:
> 
> > On Wed 15-03-17 11:48:37, Vitaly Kuznetsov wrote:
[...]
> >> What actually stops us from having the following approach:
> >> 1) Everything is added to MOVABLE
> >> 2) When we're out of memory for kernel allocations in NORMAL we 'harvest'
> >> the first MOVABLE block and 'convert' it to NORMAL. It may happen that
> >> there is no free pages in this block but it was MOVABLE which means we
> >> can move all allocations somewhere else.
> >> 3) Freeing the whole 128mb memblock takes time but we don't need to wait
> >> till it finishes, we just need to satisfy the currently pending
> >> allocation and we can continue moving everything else in the background.
> >
> > Although it sounds like a good idea at first sight there are many tiny
> > details which will make it much more complicated. First of all, how
> > do we know that the lowmem (resp. all zones normal zones) are under
> > pressure to reduce the movable zone? Getting OOM for ~__GFP_MOVABLE
> > request? Isn't that too late already?
> 
> Yes, I was basically thinking about OOM handling. It can also be a sort
> of watermark-based decision.
> 
> >  Sync migration at that state might
> > be really non trivial (pages might be dirty, pinned etc...).
> 
> Non-trivial, yes, but we already have the code to move all allocations
> away from MOVABLE block when we try to offline it, we can probably
> leverage it.

Sure, I am not saying this is impossible. I am just saying there are
many subtle details to be solved.

> 
> >  What about
> > user expectation to hotremove that memory later, should we just break
> > it?  How do we inflate movable zone back?
> 
> I think that it's OK to leave this block non-offlineable for future. As
> Andrea already pointed out it is not practical to try to guarantee we
> can unplug everything we plugged in, we're talking about 'best effort'
> service here anyway.

Well, my understanding of movable zone is closer to a requirement than a
best effort thing. You have to sacrifice a lot - higher memory pressure
to other zones with resulting perfomance conseqences, potential
latencies to access remote memory when the data (locks etc.) are on a
remote non-movable node. It would be really bad to find out that all
that was in vain just because the lowmem pressure has stolen your
movable memory.
 
> >> An alternative approach would be to have lists of memblocks which
> >> constitute ZONE_NORMAL and ZONE_MOVABLE instead of a simple 'NORMAL
> >> before MOVABLE' rule we have now but I'm not sure this is a viable
> >> approach with the current code base.
> >
> > I am not sure I understand.
> 
> Now we have 
> 
> [Normal][Normal][Normal][Movable][Movable][Movable]
> 
> we could have
> [Normal][Normal][Movable][Normal][Movable][Normal]
> 
> so when new block comes in we make a decision to which zone we want to
> online it (based on memory usage in these zones) and zone becomes a list
> of memblocks which constitute it, not a simple [from..to] range.

OK, I see now. I am afraid there is quite a lot of code which expects
that zones do not overlap. We can have holes in zones but not different
zones interleaving. Probably something which could be addressed but far
from trivial IMHO.

All that being said, I do not want to discourage you from experiments in
those areas. Just be prepared all those are far from trivial and
something for a long project ;)
-- 
Michal Hocko
SUSE Labs


Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-15 Thread Michal Hocko
On Wed 15-03-17 13:53:09, Vitaly Kuznetsov wrote:
> Michal Hocko  writes:
> 
> > On Wed 15-03-17 11:48:37, Vitaly Kuznetsov wrote:
[...]
> >> What actually stops us from having the following approach:
> >> 1) Everything is added to MOVABLE
> >> 2) When we're out of memory for kernel allocations in NORMAL we 'harvest'
> >> the first MOVABLE block and 'convert' it to NORMAL. It may happen that
> >> there is no free pages in this block but it was MOVABLE which means we
> >> can move all allocations somewhere else.
> >> 3) Freeing the whole 128mb memblock takes time but we don't need to wait
> >> till it finishes, we just need to satisfy the currently pending
> >> allocation and we can continue moving everything else in the background.
> >
> > Although it sounds like a good idea at first sight there are many tiny
> > details which will make it much more complicated. First of all, how
> > do we know that the lowmem (resp. all zones normal zones) are under
> > pressure to reduce the movable zone? Getting OOM for ~__GFP_MOVABLE
> > request? Isn't that too late already?
> 
> Yes, I was basically thinking about OOM handling. It can also be a sort
> of watermark-based decision.
> 
> >  Sync migration at that state might
> > be really non trivial (pages might be dirty, pinned etc...).
> 
> Non-trivial, yes, but we already have the code to move all allocations
> away from MOVABLE block when we try to offline it, we can probably
> leverage it.

Sure, I am not saying this is impossible. I am just saying there are
many subtle details to be solved.

> 
> >  What about
> > user expectation to hotremove that memory later, should we just break
> > it?  How do we inflate movable zone back?
> 
> I think that it's OK to leave this block non-offlineable for future. As
> Andrea already pointed out it is not practical to try to guarantee we
> can unplug everything we plugged in, we're talking about 'best effort'
> service here anyway.

Well, my understanding of movable zone is closer to a requirement than a
best effort thing. You have to sacrifice a lot - higher memory pressure
to other zones with resulting perfomance conseqences, potential
latencies to access remote memory when the data (locks etc.) are on a
remote non-movable node. It would be really bad to find out that all
that was in vain just because the lowmem pressure has stolen your
movable memory.
 
> >> An alternative approach would be to have lists of memblocks which
> >> constitute ZONE_NORMAL and ZONE_MOVABLE instead of a simple 'NORMAL
> >> before MOVABLE' rule we have now but I'm not sure this is a viable
> >> approach with the current code base.
> >
> > I am not sure I understand.
> 
> Now we have 
> 
> [Normal][Normal][Normal][Movable][Movable][Movable]
> 
> we could have
> [Normal][Normal][Movable][Normal][Movable][Normal]
> 
> so when new block comes in we make a decision to which zone we want to
> online it (based on memory usage in these zones) and zone becomes a list
> of memblocks which constitute it, not a simple [from..to] range.

OK, I see now. I am afraid there is quite a lot of code which expects
that zones do not overlap. We can have holes in zones but not different
zones interleaving. Probably something which could be addressed but far
from trivial IMHO.

All that being said, I do not want to discourage you from experiments in
those areas. Just be prepared all those are far from trivial and
something for a long project ;)
-- 
Michal Hocko
SUSE Labs


Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-15 Thread Vitaly Kuznetsov
Michal Hocko  writes:

> On Wed 15-03-17 11:48:37, Vitaly Kuznetsov wrote:
>> Michal Hocko  writes:
> [...]
>> Speaking about long term approach,
>
> Not really related to the patch but ok (I hope this will not distract
> from the original intention here)...
>

Yes, not directly related to your patch.

>> (I'm not really familiar with the history of memory zones code so please
>> bear with me if my questions are stupid)
>> 
>> Currently when we online memory blocks we need to know where to put the
>> boundary between NORMAL and MOVABLE and this is a very hard decision to
>> make, no matter if we do this from kernel or from userspace. In theory,
>> we just want to avoid redundant limitations with future unplug but we
>> don't really know how much memory we'll need for kernel allocations in
>> future.
>
> yes, and that is why I am not really all that happy about the whole
> movable zones concept. It is basically reintroducing highmem issues from
> 32b times. But this is the only concept we currently have to provide a
> reliable memory hotremove right now.
>
>> What actually stops us from having the following approach:
>> 1) Everything is added to MOVABLE
>> 2) When we're out of memory for kernel allocations in NORMAL we 'harvest'
>> the first MOVABLE block and 'convert' it to NORMAL. It may happen that
>> there is no free pages in this block but it was MOVABLE which means we
>> can move all allocations somewhere else.
>> 3) Freeing the whole 128mb memblock takes time but we don't need to wait
>> till it finishes, we just need to satisfy the currently pending
>> allocation and we can continue moving everything else in the background.
>
> Although it sounds like a good idea at first sight there are many tiny
> details which will make it much more complicated. First of all, how
> do we know that the lowmem (resp. all zones normal zones) are under
> pressure to reduce the movable zone? Getting OOM for ~__GFP_MOVABLE
> request? Isn't that too late already?

Yes, I was basically thinking about OOM handling. It can also be a sort
of watermark-based decision.

>  Sync migration at that state might
> be really non trivial (pages might be dirty, pinned etc...).

Non-trivial, yes, but we already have the code to move all allocations
away from MOVABLE block when we try to offline it, we can probably
leverage it.

>  What about
> user expectation to hotremove that memory later, should we just break
> it?  How do we inflate movable zone back?

I think that it's OK to leave this block non-offlineable for future. As
Andrea already pointed out it is not practical to try to guarantee we
can unplug everything we plugged in, we're talking about 'best effort'
service here anyway.

>
>> An alternative approach would be to have lists of memblocks which
>> constitute ZONE_NORMAL and ZONE_MOVABLE instead of a simple 'NORMAL
>> before MOVABLE' rule we have now but I'm not sure this is a viable
>> approach with the current code base.
>
> I am not sure I understand.

Now we have 

[Normal][Normal][Normal][Movable][Movable][Movable]

we could have
[Normal][Normal][Movable][Normal][Movable][Normal]

so when new block comes in we make a decision to which zone we want to
online it (based on memory usage in these zones) and zone becomes a list
of memblocks which constitute it, not a simple [from..to] range.

-- 
  Vitaly


Re: ZONE_NORMAL vs. ZONE_MOVABLE

2017-03-15 Thread Vitaly Kuznetsov
Michal Hocko  writes:

> On Wed 15-03-17 11:48:37, Vitaly Kuznetsov wrote:
>> Michal Hocko  writes:
> [...]
>> Speaking about long term approach,
>
> Not really related to the patch but ok (I hope this will not distract
> from the original intention here)...
>

Yes, not directly related to your patch.

>> (I'm not really familiar with the history of memory zones code so please
>> bear with me if my questions are stupid)
>> 
>> Currently when we online memory blocks we need to know where to put the
>> boundary between NORMAL and MOVABLE and this is a very hard decision to
>> make, no matter if we do this from kernel or from userspace. In theory,
>> we just want to avoid redundant limitations with future unplug but we
>> don't really know how much memory we'll need for kernel allocations in
>> future.
>
> yes, and that is why I am not really all that happy about the whole
> movable zones concept. It is basically reintroducing highmem issues from
> 32b times. But this is the only concept we currently have to provide a
> reliable memory hotremove right now.
>
>> What actually stops us from having the following approach:
>> 1) Everything is added to MOVABLE
>> 2) When we're out of memory for kernel allocations in NORMAL we 'harvest'
>> the first MOVABLE block and 'convert' it to NORMAL. It may happen that
>> there is no free pages in this block but it was MOVABLE which means we
>> can move all allocations somewhere else.
>> 3) Freeing the whole 128mb memblock takes time but we don't need to wait
>> till it finishes, we just need to satisfy the currently pending
>> allocation and we can continue moving everything else in the background.
>
> Although it sounds like a good idea at first sight there are many tiny
> details which will make it much more complicated. First of all, how
> do we know that the lowmem (resp. all zones normal zones) are under
> pressure to reduce the movable zone? Getting OOM for ~__GFP_MOVABLE
> request? Isn't that too late already?

Yes, I was basically thinking about OOM handling. It can also be a sort
of watermark-based decision.

>  Sync migration at that state might
> be really non trivial (pages might be dirty, pinned etc...).

Non-trivial, yes, but we already have the code to move all allocations
away from MOVABLE block when we try to offline it, we can probably
leverage it.

>  What about
> user expectation to hotremove that memory later, should we just break
> it?  How do we inflate movable zone back?

I think that it's OK to leave this block non-offlineable for future. As
Andrea already pointed out it is not practical to try to guarantee we
can unplug everything we plugged in, we're talking about 'best effort'
service here anyway.

>
>> An alternative approach would be to have lists of memblocks which
>> constitute ZONE_NORMAL and ZONE_MOVABLE instead of a simple 'NORMAL
>> before MOVABLE' rule we have now but I'm not sure this is a viable
>> approach with the current code base.
>
> I am not sure I understand.

Now we have 

[Normal][Normal][Normal][Movable][Movable][Movable]

we could have
[Normal][Normal][Movable][Normal][Movable][Normal]

so when new block comes in we make a decision to which zone we want to
online it (based on memory usage in these zones) and zone becomes a list
of memblocks which constitute it, not a simple [from..to] range.

-- 
  Vitaly