Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-06-14 Thread David Hildenbrand

On 02.06.21 17:56, Zi Yan wrote:

On 10 May 2021, at 10:36, Zi Yan wrote:


On 7 May 2021, at 10:00, David Hildenbrand wrote:


On 07.05.21 13:55, Michal Hocko wrote:

[I haven't read through respective patches due to lack of time but let
   me comment on the general idea and the underlying justification]

On Thu 06-05-21 17:31:09, David Hildenbrand wrote:

On 06.05.21 17:26, Zi Yan wrote:

From: Zi Yan 

Hi all,

This patchset tries to remove the restriction on memory hotplug/hotremove
granularity, which is always greater or equal to memory section size[1].
With the patchset, kernel is able to online/offline memory at a size independent
of memory section size, as small as 2MB (the subsection size).


... which doesn't make any sense as we can only online/offline whole memory
block devices.


Agreed. The subsection thingy is just a hack to workaround pmem
alignement problems. For the real memory hotplug it is quite hard to
argue for reasonable hotplug scenarios for very small physical memory
ranges wrt. to the existing sparsemem memory model.


The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
size without increasing memory hotplug/hotremove granularity at the same time,


Gah, no. Please no. No.


Agreed. Those are completely independent concepts. MAX_ORDER is can be
really arbitrary irrespective of the section size with vmemmap sparse
model. The existing restriction is due to old sparse model not being
able to do page pointer arithmetic across memory sections. Is there any
reason to stick with that memory model for an advance feature you are
working on?


No. I just want to increase MAX_ORDER. If the existing restriction can
be removed, that will be great.



I gave it some more thought yesterday. I guess the first thing we should look 
into is increasing MAX_ORDER and leaving pageblock_order and section size as is 
-- finding out what we have to tweak to get that up and running. Once we have 
that in place, we can actually look into better fragmentation avoidance etc. 
One step at a time.


It makes sense to me.



Because that change itself might require some thought. Requiring that bigger 
MAX_ORDER depends on SPARSE_VMEMMAP is something reasonable to do.


OK, if with SPARSE_VMEMMAP MAX_ORDER can be set to be bigger than
SECTION_SIZE, it is perfectly OK to me. Since 1GB THP support, which I
want to add ultimately, will require SPARSE_VMEMMAP too (otherwise,
all page++ will need to be changed to nth_page(page,1)).



As stated somewhere here already, we'll have to look into making 
alloc_contig_range() (and main users CMA and virtio-mem) independent of 
MAX_ORDER and mainly rely on pageblock_order. The current handling in 
alloc_contig_range() is far from optimal as we have to isolate a whole 
MAX_ORDER - 1 page -- and on ZONE_NORMAL we'll fail easily if any part contains 
something unmovable although we don't even want to allocate that part. I 
actually have that on my list (to be able to fully support pageblock_order 
instead of MAX_ORDER -1 chunks in virtio-mem), however didn't have time to look 
into it.


So in your mind, for gigantic page allocation (> MAX_ORDER), 
alloc_contig_range()
should be used instead of buddy allocator while pageblock_order is kept at a 
small
granularity like 2MB. Is that the case? Isn’t it going to have high fail rate
when any of the pageblocks within a gigantic page range (like 1GB) becomes 
unmovable?
Are you thinking additional mechanism/policy to prevent such thing happening as
an additional step for gigantic page allocation? Like your ZONE_PREFER_MOVABLE 
idea?



Further, page onlining / offlining code and early init code most probably also 
needs care if MAX_ORDER - 1 crosses sections. Memory holes we might suddenly 
have in MAX_ORDER - 1 pages might become a problem and will have to be handled. 
Not sure which other code has to be tweaked (compaction? page isolation?).


Can you elaborate it a little more? From what I understand, memory holes mean 
valid
PFNs are not contiguous before and after a hole, so pfn++ will not work, but
struct pages are still virtually contiguous assuming SPARSE_VMEMMAP, meaning 
page++
would still work. So when MAX_ORDER - 1 crosses sections, additional code would 
be
needed instead of simple pfn++. Is there anything I am missing?

BTW, to test a system with memory holes, do you know is there an easy of adding
random memory holes to an x86_64 VM, which can help reveal potential missing 
pieces
in the code? Changing BIOS-e820 table might be one way, but I have no idea on
how to do it on QEMU.



Figuring out what needs care itself might take quite some effort.

One thing I was thinking about as well: The bigger our MAX_ORDER, the slower it 
could be to allocate smaller pages. If we have 1G pages, splitting them down to 
4k then takes 8 additional steps if I'm, not wrong. Of course, that's the worst 
case. Would be interesting to evaluate.


Sure. I am planning to check it too. As a simple start, I am going 

Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-06-02 Thread Zi Yan
On 10 May 2021, at 10:36, Zi Yan wrote:

> On 7 May 2021, at 10:00, David Hildenbrand wrote:
>
>> On 07.05.21 13:55, Michal Hocko wrote:
>>> [I haven't read through respective patches due to lack of time but let
>>>   me comment on the general idea and the underlying justification]
>>>
>>> On Thu 06-05-21 17:31:09, David Hildenbrand wrote:
 On 06.05.21 17:26, Zi Yan wrote:
> From: Zi Yan 
>
> Hi all,
>
> This patchset tries to remove the restriction on memory hotplug/hotremove
> granularity, which is always greater or equal to memory section size[1].
> With the patchset, kernel is able to online/offline memory at a size 
> independent
> of memory section size, as small as 2MB (the subsection size).

 ... which doesn't make any sense as we can only online/offline whole memory
 block devices.
>>>
>>> Agreed. The subsection thingy is just a hack to workaround pmem
>>> alignement problems. For the real memory hotplug it is quite hard to
>>> argue for reasonable hotplug scenarios for very small physical memory
>>> ranges wrt. to the existing sparsemem memory model.
>>>
> The motivation is to increase MAX_ORDER of the buddy allocator and 
> pageblock
> size without increasing memory hotplug/hotremove granularity at the same 
> time,

 Gah, no. Please no. No.
>>>
>>> Agreed. Those are completely independent concepts. MAX_ORDER is can be
>>> really arbitrary irrespective of the section size with vmemmap sparse
>>> model. The existing restriction is due to old sparse model not being
>>> able to do page pointer arithmetic across memory sections. Is there any
>>> reason to stick with that memory model for an advance feature you are
>>> working on?
>
> No. I just want to increase MAX_ORDER. If the existing restriction can
> be removed, that will be great.
>
>>
>> I gave it some more thought yesterday. I guess the first thing we should 
>> look into is increasing MAX_ORDER and leaving pageblock_order and section 
>> size as is -- finding out what we have to tweak to get that up and running. 
>> Once we have that in place, we can actually look into better fragmentation 
>> avoidance etc. One step at a time.
>
> It makes sense to me.
>
>>
>> Because that change itself might require some thought. Requiring that bigger 
>> MAX_ORDER depends on SPARSE_VMEMMAP is something reasonable to do.
>
> OK, if with SPARSE_VMEMMAP MAX_ORDER can be set to be bigger than
> SECTION_SIZE, it is perfectly OK to me. Since 1GB THP support, which I
> want to add ultimately, will require SPARSE_VMEMMAP too (otherwise,
> all page++ will need to be changed to nth_page(page,1)).
>
>>
>> As stated somewhere here already, we'll have to look into making 
>> alloc_contig_range() (and main users CMA and virtio-mem) independent of 
>> MAX_ORDER and mainly rely on pageblock_order. The current handling in 
>> alloc_contig_range() is far from optimal as we have to isolate a whole 
>> MAX_ORDER - 1 page -- and on ZONE_NORMAL we'll fail easily if any part 
>> contains something unmovable although we don't even want to allocate that 
>> part. I actually have that on my list (to be able to fully support 
>> pageblock_order instead of MAX_ORDER -1 chunks in virtio-mem), however 
>> didn't have time to look into it.
>
> So in your mind, for gigantic page allocation (> MAX_ORDER), 
> alloc_contig_range()
> should be used instead of buddy allocator while pageblock_order is kept at a 
> small
> granularity like 2MB. Is that the case? Isn’t it going to have high fail rate
> when any of the pageblocks within a gigantic page range (like 1GB) becomes 
> unmovable?
> Are you thinking additional mechanism/policy to prevent such thing happening 
> as
> an additional step for gigantic page allocation? Like your 
> ZONE_PREFER_MOVABLE idea?
>
>>
>> Further, page onlining / offlining code and early init code most probably 
>> also needs care if MAX_ORDER - 1 crosses sections. Memory holes we might 
>> suddenly have in MAX_ORDER - 1 pages might become a problem and will have to 
>> be handled. Not sure which other code has to be tweaked (compaction? page 
>> isolation?).
>
> Can you elaborate it a little more? From what I understand, memory holes mean 
> valid
> PFNs are not contiguous before and after a hole, so pfn++ will not work, but
> struct pages are still virtually contiguous assuming SPARSE_VMEMMAP, meaning 
> page++
> would still work. So when MAX_ORDER - 1 crosses sections, additional code 
> would be
> needed instead of simple pfn++. Is there anything I am missing?
>
> BTW, to test a system with memory holes, do you know is there an easy of 
> adding
> random memory holes to an x86_64 VM, which can help reveal potential missing 
> pieces
> in the code? Changing BIOS-e820 table might be one way, but I have no idea on
> how to do it on QEMU.
>
>>
>> Figuring out what needs care itself might take quite some effort.
>>
>> One thing I was thinking about as well: The bigger our MAX_ORDER, the sl

Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-12 Thread David Hildenbrand


As stated somewhere here already, we'll have to look into making 
alloc_contig_range() (and main users CMA and virtio-mem) independent of 
MAX_ORDER and mainly rely on pageblock_order. The current handling in 
alloc_contig_range() is far from optimal as we have to isolate a whole 
MAX_ORDER - 1 page -- and on ZONE_NORMAL we'll fail easily if any part contains 
something unmovable although we don't even want to allocate that part. I 
actually have that on my list (to be able to fully support pageblock_order 
instead of MAX_ORDER -1 chunks in virtio-mem), however didn't have time to look 
into it.


So in your mind, for gigantic page allocation (> MAX_ORDER), 
alloc_contig_range()
should be used instead of buddy allocator while pageblock_order is kept at a 
small
granularity like 2MB. Is that the case? Isn’t it going to have high fail rate
when any of the pageblocks within a gigantic page range (like 1GB) becomes 
unmovable?
Are you thinking additional mechanism/policy to prevent such thing happening as
an additional step for gigantic page allocation? Like your ZONE_PREFER_MOVABLE 
idea?



I am not fully sure yet where the journey will go , I guess nobody 
knows. Ultimately, having buddy support for >= current MAX_ORDER (IOW, 
increasing MAX_ORDER) will most probably happen, so it would be worth 
investigating what has to be done to get that running as a first step.


Of course, we could temporarily think about wiring it up in the buddy like

if (order < MAX_ORDER)
__alloc_pages()...
else
alloc_contig_pages()

but it doesn't really improve the situation IMHO, just an API change.

So I think we should look into increasing MAX_ORDER, seeing what needs 
to be done to have that part running while keeping the section size and 
the pageblock order as is. I know that at least memory 
onlining/offlining, cma, alloc_contig_range(), ... needs tweaking, 
especially when we don't increase the section size (but also if we would 
due to the way page isolation is currently handled). Having a MAX_ORDER 
-1 page being partially in different nodes might be another thing to 
look into (I heard that it can already happen right now, but I don't 
remember the details).


The next step after that would then be better fragmentation avoidance 
for larger granularity like 1G THP.




Further, page onlining / offlining code and early init code most probably also 
needs care if MAX_ORDER - 1 crosses sections. Memory holes we might suddenly 
have in MAX_ORDER - 1 pages might become a problem and will have to be handled. 
Not sure which other code has to be tweaked (compaction? page isolation?).


Can you elaborate it a little more? From what I understand, memory holes mean 
valid
PFNs are not contiguous before and after a hole, so pfn++ will not work, but
struct pages are still virtually contiguous assuming SPARSE_VMEMMAP, meaning 
page++
would still work. So when MAX_ORDER - 1 crosses sections, additional code would 
be
needed instead of simple pfn++. Is there anything I am missing?


I think there are two cases when talking about MAX_ORDER and memory holes:

1. Hole with a valid memmap: the memmap is initialize to PageReserved()
   and the pages are not given to the buddy. pfn_valid() and
   pfn_to_page() works as expected.
2. Hole without a valid memmam: we have that CONFIG_HOLES_IN_ZONE thing
   already, see include/linux/mmzone.h. pfn_valid_within() checks are
   required. Doesn't win a beauty contest, but gets the job done in
   existing setups that seem to care.

"If it is possible to have holes within a MAX_ORDER_NR_PAGES, then we 
need to check pfn validity within that MAX_ORDER_NR_PAGES block. 
pfn_valid_within() should be used in this case; we optimise this away 
when we have no holes within a MAX_ORDER_NR_PAGES block."


CONFIG_HOLES_IN_ZONE is just a bad name for this.

(increasing the section size implies that we waste more memory for the 
memmap in holes. increasing MAX_ORDER means that we might have to deal 
with holes within MAX_ORDER chunks)


We don't have too many pfn_valid_within() checks. I wonder if we could 
add something that is optimized for "holes are a power of two and 
properly aligned", because pfn_valid_within() right not deals with holes 
of any kind which makes it somewhat inefficient IIRC.




BTW, to test a system with memory holes, do you know is there an easy of adding
random memory holes to an x86_64 VM, which can help reveal potential missing 
pieces
in the code? Changing BIOS-e820 table might be one way, but I have no idea on
how to do it on QEMU.


It might not be very easy that way. But I heard that some arm64 systems 
have crazy memory layouts -- maybe there, it's easier to get something 
nasty running? :)


https://lkml.kernel.org/r/yjpewf2cgjs5m...@kernel.org

I remember there was a way to define the e820 completely on kernel 
cmdline, but I might be wrong ...


--
Thanks,

David / dhildenb



Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-10 Thread Zi Yan
On 7 May 2021, at 10:00, David Hildenbrand wrote:

> On 07.05.21 13:55, Michal Hocko wrote:
>> [I haven't read through respective patches due to lack of time but let
>>   me comment on the general idea and the underlying justification]
>>
>> On Thu 06-05-21 17:31:09, David Hildenbrand wrote:
>>> On 06.05.21 17:26, Zi Yan wrote:
 From: Zi Yan 

 Hi all,

 This patchset tries to remove the restriction on memory hotplug/hotremove
 granularity, which is always greater or equal to memory section size[1].
 With the patchset, kernel is able to online/offline memory at a size 
 independent
 of memory section size, as small as 2MB (the subsection size).
>>>
>>> ... which doesn't make any sense as we can only online/offline whole memory
>>> block devices.
>>
>> Agreed. The subsection thingy is just a hack to workaround pmem
>> alignement problems. For the real memory hotplug it is quite hard to
>> argue for reasonable hotplug scenarios for very small physical memory
>> ranges wrt. to the existing sparsemem memory model.
>>
 The motivation is to increase MAX_ORDER of the buddy allocator and 
 pageblock
 size without increasing memory hotplug/hotremove granularity at the same 
 time,
>>>
>>> Gah, no. Please no. No.
>>
>> Agreed. Those are completely independent concepts. MAX_ORDER is can be
>> really arbitrary irrespective of the section size with vmemmap sparse
>> model. The existing restriction is due to old sparse model not being
>> able to do page pointer arithmetic across memory sections. Is there any
>> reason to stick with that memory model for an advance feature you are
>> working on?

No. I just want to increase MAX_ORDER. If the existing restriction can
be removed, that will be great.

>
> I gave it some more thought yesterday. I guess the first thing we should look 
> into is increasing MAX_ORDER and leaving pageblock_order and section size as 
> is -- finding out what we have to tweak to get that up and running. Once we 
> have that in place, we can actually look into better fragmentation avoidance 
> etc. One step at a time.

It makes sense to me.

>
> Because that change itself might require some thought. Requiring that bigger 
> MAX_ORDER depends on SPARSE_VMEMMAP is something reasonable to do.

OK, if with SPARSE_VMEMMAP MAX_ORDER can be set to be bigger than
SECTION_SIZE, it is perfectly OK to me. Since 1GB THP support, which I
want to add ultimately, will require SPARSE_VMEMMAP too (otherwise,
all page++ will need to be changed to nth_page(page,1)).

>
> As stated somewhere here already, we'll have to look into making 
> alloc_contig_range() (and main users CMA and virtio-mem) independent of 
> MAX_ORDER and mainly rely on pageblock_order. The current handling in 
> alloc_contig_range() is far from optimal as we have to isolate a whole 
> MAX_ORDER - 1 page -- and on ZONE_NORMAL we'll fail easily if any part 
> contains something unmovable although we don't even want to allocate that 
> part. I actually have that on my list (to be able to fully support 
> pageblock_order instead of MAX_ORDER -1 chunks in virtio-mem), however didn't 
> have time to look into it.

So in your mind, for gigantic page allocation (> MAX_ORDER), 
alloc_contig_range()
should be used instead of buddy allocator while pageblock_order is kept at a 
small
granularity like 2MB. Is that the case? Isn’t it going to have high fail rate
when any of the pageblocks within a gigantic page range (like 1GB) becomes 
unmovable?
Are you thinking additional mechanism/policy to prevent such thing happening as
an additional step for gigantic page allocation? Like your ZONE_PREFER_MOVABLE 
idea?

>
> Further, page onlining / offlining code and early init code most probably 
> also needs care if MAX_ORDER - 1 crosses sections. Memory holes we might 
> suddenly have in MAX_ORDER - 1 pages might become a problem and will have to 
> be handled. Not sure which other code has to be tweaked (compaction? page 
> isolation?).

Can you elaborate it a little more? From what I understand, memory holes mean 
valid
PFNs are not contiguous before and after a hole, so pfn++ will not work, but
struct pages are still virtually contiguous assuming SPARSE_VMEMMAP, meaning 
page++
would still work. So when MAX_ORDER - 1 crosses sections, additional code would 
be
needed instead of simple pfn++. Is there anything I am missing?

BTW, to test a system with memory holes, do you know is there an easy of adding
random memory holes to an x86_64 VM, which can help reveal potential missing 
pieces
in the code? Changing BIOS-e820 table might be one way, but I have no idea on
how to do it on QEMU.

>
> Figuring out what needs care itself might take quite some effort.
>
> One thing I was thinking about as well: The bigger our MAX_ORDER, the slower 
> it could be to allocate smaller pages. If we have 1G pages, splitting them 
> down to 4k then takes 8 additional steps if I'm, not wrong. Of course, that's 
> the worst case. Woul

Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-07 Thread Michal Hocko
[I haven't read through respective patches due to lack of time but let
 me comment on the general idea and the underlying justification]

On Thu 06-05-21 17:31:09, David Hildenbrand wrote:
> On 06.05.21 17:26, Zi Yan wrote:
> > From: Zi Yan 
> > 
> > Hi all,
> > 
> > This patchset tries to remove the restriction on memory hotplug/hotremove
> > granularity, which is always greater or equal to memory section size[1].
> > With the patchset, kernel is able to online/offline memory at a size 
> > independent
> > of memory section size, as small as 2MB (the subsection size).
> 
> ... which doesn't make any sense as we can only online/offline whole memory
> block devices.

Agreed. The subsection thingy is just a hack to workaround pmem
alignement problems. For the real memory hotplug it is quite hard to
argue for reasonable hotplug scenarios for very small physical memory
ranges wrt. to the existing sparsemem memory model.
 
> > The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
> > size without increasing memory hotplug/hotremove granularity at the same 
> > time,
> 
> Gah, no. Please no. No.

Agreed. Those are completely independent concepts. MAX_ORDER is can be
really arbitrary irrespective of the section size with vmemmap sparse
model. The existing restriction is due to old sparse model not being
able to do page pointer arithmetic across memory sections. Is there any
reason to stick with that memory model for an advance feature you are
working on?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-07 Thread David Hildenbrand

On 07.05.21 13:55, Michal Hocko wrote:

[I haven't read through respective patches due to lack of time but let
  me comment on the general idea and the underlying justification]

On Thu 06-05-21 17:31:09, David Hildenbrand wrote:

On 06.05.21 17:26, Zi Yan wrote:

From: Zi Yan 

Hi all,

This patchset tries to remove the restriction on memory hotplug/hotremove
granularity, which is always greater or equal to memory section size[1].
With the patchset, kernel is able to online/offline memory at a size independent
of memory section size, as small as 2MB (the subsection size).


... which doesn't make any sense as we can only online/offline whole memory
block devices.


Agreed. The subsection thingy is just a hack to workaround pmem
alignement problems. For the real memory hotplug it is quite hard to
argue for reasonable hotplug scenarios for very small physical memory
ranges wrt. to the existing sparsemem memory model.
  

The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
size without increasing memory hotplug/hotremove granularity at the same time,


Gah, no. Please no. No.


Agreed. Those are completely independent concepts. MAX_ORDER is can be
really arbitrary irrespective of the section size with vmemmap sparse
model. The existing restriction is due to old sparse model not being
able to do page pointer arithmetic across memory sections. Is there any
reason to stick with that memory model for an advance feature you are
working on?



I gave it some more thought yesterday. I guess the first thing we should 
look into is increasing MAX_ORDER and leaving pageblock_order and 
section size as is -- finding out what we have to tweak to get that up 
and running. Once we have that in place, we can actually look into 
better fragmentation avoidance etc. One step at a time.


Because that change itself might require some thought. Requiring that 
bigger MAX_ORDER depends on SPARSE_VMEMMAP is something reasonable to do.


As stated somewhere here already, we'll have to look into making 
alloc_contig_range() (and main users CMA and virtio-mem) independent of 
MAX_ORDER and mainly rely on pageblock_order. The current handling in 
alloc_contig_range() is far from optimal as we have to isolate a whole 
MAX_ORDER - 1 page -- and on ZONE_NORMAL we'll fail easily if any part 
contains something unmovable although we don't even want to allocate 
that part. I actually have that on my list (to be able to fully support 
pageblock_order instead of MAX_ORDER -1 chunks in virtio-mem), however 
didn't have time to look into it.


Further, page onlining / offlining code and early init code most 
probably also needs care if MAX_ORDER - 1 crosses sections. Memory holes 
we might suddenly have in MAX_ORDER - 1 pages might become a problem and 
will have to be handled. Not sure which other code has to be tweaked 
(compaction? page isolation?).


Figuring out what needs care itself might take quite some effort.


One thing I was thinking about as well: The bigger our MAX_ORDER, the 
slower it could be to allocate smaller pages. If we have 1G pages, 
splitting them down to 4k then takes 8 additional steps if I'm, not 
wrong. Of course, that's the worst case. Would be interesting to evaluate.


--
Thanks,

David / dhildenb



Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-06 Thread David Hildenbrand

On 06.05.21 21:30, Matthew Wilcox wrote:

On Thu, May 06, 2021 at 09:10:52PM +0200, David Hildenbrand wrote:

I have to admit that I am not really a friend of that. I still think our
target goal should be to have gigantic THP *in addition to* ordinary THP.
Use gigantic THP where enabled and possible, and just use ordinary THP
everywhere else. Having one pageblock granularity is a real limitation IMHO
and requires us to hack the system to support it to some degree.


You're thinking too small with only two THP sizes ;-)  I'm aiming to


Well, I raised in my other mail that we will have multiple different use 
cases, including multiple different THP e.g., on aarch64 ;)



support arbitrary power-of-two memory allocations.  I think there's a
fruitful discussion to be had about how that works for anonymous memory --
with page cache, we have readahead to tell us when our predictions of use
are actually fulfilled.  It doesn't tell us what percentage of the pages


Right, and I think we have to think about a better approach than just 
increasing the pageblock_order.



allocated were actually used, but it's a hint.  It's a big lift to go from
2MB all the way to 1GB ... if you can look back to see that the previous
1GB was basically fully populated, then maybe jump up from allocating
2MB folios to allocating a 1GB folio, but wow, that's a big step.

This goal really does mean that we want to allocate from the page
allocator, and so we do want to grow MAX_ORDER.  I suppose we could
do somethig ugly like

if (order <= MAX_ORDER)
alloc_page()
else
alloc_really_big_page()

but that feels like unnecessary hardship to place on the user.


I had something similar for the sort term in mind, relying on 
alloc_contig_pages() (and maybe ZONE_MOVABLE to make allocations more 
likely to succeed). Devil's in the details (page migration, ...).



--
Thanks,

David / dhildenb



Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-06 Thread Matthew Wilcox
On Thu, May 06, 2021 at 09:10:52PM +0200, David Hildenbrand wrote:
> I have to admit that I am not really a friend of that. I still think our
> target goal should be to have gigantic THP *in addition to* ordinary THP.
> Use gigantic THP where enabled and possible, and just use ordinary THP
> everywhere else. Having one pageblock granularity is a real limitation IMHO
> and requires us to hack the system to support it to some degree.

You're thinking too small with only two THP sizes ;-)  I'm aiming to
support arbitrary power-of-two memory allocations.  I think there's a
fruitful discussion to be had about how that works for anonymous memory --
with page cache, we have readahead to tell us when our predictions of use
are actually fulfilled.  It doesn't tell us what percentage of the pages
allocated were actually used, but it's a hint.  It's a big lift to go from
2MB all the way to 1GB ... if you can look back to see that the previous
1GB was basically fully populated, then maybe jump up from allocating
2MB folios to allocating a 1GB folio, but wow, that's a big step.

This goal really does mean that we want to allocate from the page
allocator, and so we do want to grow MAX_ORDER.  I suppose we could
do somethig ugly like

if (order <= MAX_ORDER)
alloc_page()
else
alloc_really_big_page()

but that feels like unnecessary hardship to place on the user.

I know that for the initial implementation, we're going to rely on hints
from the user to use 1GB pages, but it'd be nice to not do that.


Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-06 Thread David Hildenbrand


1. Pageblock size

There are a couple of features that rely on the pageblock size to be reasonably small to 
work as expected. One example is virtio-balloon free page reporting, then there is 
virtio-mem (still also glued MAX_ORDER) and we have CMA (still also glued to MAX_ORDER). 
Most probably there are more. We track movability/ page isolation per pageblock; it's the 
smallest granularity you can effectively isolate pages or mark them as CMA 
(MIGRATE_ISOLATE, MIGRATE_CMA). Well, and there are "ordinary" THP / huge pages 
most of our applications use and will use, especially on smallish systems.

Assume you bump up the pageblock order to 1G. Small VMs won't be able to report any free 
pages to the hypervisor. You'll take the "fine-grained" out of virtio-mem. Each 
CMA area will have to be at least 1G big, which turns CMA essentially useless on smallish 
systems (like we have on arm64 with 64k base pages -- pageblock_size is 512MB and I hate 
it).


I understand the issue of having large pageblock in small systems. My plan for 
this issue is to make MAX_ORDER a variable (pageblock size would be set 
according to MAX_ORDER) that can be adjusted based on total memory and via boot 
time parameter. My apology since I did not state this clearly in my cover 
letter and it confused you. When we have a boot time adjustable MAX_ORDER, 
large pageblock like 1GB would only appear for systems with large memory. For 
small VMs, pageblock size would stay at 2MB, so all your concerns on smallish 
systems should go away.


I have to admit that I am not really a friend of that. I still think our 
target goal should be to have gigantic THP *in addition to* ordinary 
THP. Use gigantic THP where enabled and possible, and just use ordinary 
THP everywhere else. Having one pageblock granularity is a real 
limitation IMHO and requires us to hack the system to support it to some 
degree.






Then, imagine systems that have like 4G of main memory. By stopping grouping at 
2M and instead grouping at 1G you can very easily find yourself in the system 
where all your 4 pageblocks are unmovable and you essentially don't optimize 
for huge pages in that environment any more.

Long story short: we need a different mechanism on top and shall leave the 
pageblock size untouched, it's too tightly integrated with page isolation, 
ordinary THP, and CMA.


I think it is better to make pageblock size adjustable based on total memory of a 
system. It is not reasonable to have the same pageblock size across systems with 
memory sizes from <1GB to several TBs. Do you agree?



I suggest an additional mechanism on top. Please bear in mind that 
ordinary THP will most probably be still the default for 99.9% of all 
application/library cases, even when you have gigantic THP around.




2. Section size

I assume the only reason you want to touch that is because pageblock_size <= 
section_size, and I guess that's one of the reasons I dislike it so much. Messing 
with the section size really only makes sense when we want to manage metadata for 
larger granularity within a section.


Perhaps it is worth checking if it is feasible to make pageblock_size > 
section_size, so we can still have small sections when pageblock_size are large. 
One potential issue for that is when PFN discontinues at section boundary, we 
might have partial pageblock when pageblock_size is big. I guess supporting 
partial pageblock (or different pageblock sizes like you mentioned below ) would 
be the right solution.



We allocate metadata per section. We mark whole sections 
early/online/present/ Yes, in case of vmemmap, we manage the memmap in 
smaller granularity using the sub-section map, some kind of hack to support 
some ZONE_DEVICE cases better.

Let's assume we introduce something new "gigapage_order", corresponding to 1G. 
We could either decide to squeeze the metadata into sections, having to increase the 
section size, or manage that metadata differently.

Managing it differently certainly makes the necessary changes easier. Instead 
of adding more hacks into sections, rather manage that metadata at differently 
place / in a different way.


Can you elaborate on managing it differently?


Let's keep it simple. Assume you track on a 1G gigpageblock MOVABLE vs. 
!movable in addition to existing pageblocks. A 64 TB system would have 
64*1024 gigpageblocks. One bit per gigapageblock would require 8k a.k.a. 
2 pages. If you need more states, it would maybe double. No need to 
manage that using sparse memory sections IMHO. Just allocate 2/4 pages 
during boot for the bitmap.






See [1] for an alternative. Not necessarily what I would dream off, but just to 
showcase that there might be alternative to group pages.


I saw this patch too. It is an interesting idea to separate different 
allocation orders into different regions, but it would not work for gigantic 
page allocations unless we have large pageblock size to utilize existing 
anti-fragmentation mechani

Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-06 Thread Zi Yan
On 6 May 2021, at 12:28, David Hildenbrand wrote:

> On 06.05.21 17:50, Zi Yan wrote:
>> On 6 May 2021, at 11:40, David Hildenbrand wrote:
>>
>> The last patch increases SECTION_SIZE_BITS to demonstrate the use of 
>> memory
>> hotplug/hotremove subsection, but is not intended to be merged as is. It 
>> is
>> there in case one wants to try this out and will be removed during the 
>> final
>> submission.
>>
>> Feel free to give suggestions and comments. I am looking forward to your
>> feedback.
>
> Please not like this.

 Do you mind sharing more useful feedback instead of just saying a lot of 
 No?
>>>
>>> I remember reasoning about this already in another thread, no? Either 
>>> you're ignoring my previous feedback or my mind is messing with me.
>>
>> I definitely remember all your suggestions:
>>
>> 1. do not use CMA allocation for 1GB THP.
>> 2. section size defines the minimum size in which we can add_memory(), so we 
>> cannot increase it.
>>
>> I am trying an alternative here. I am not using CMA allocation and not 
>> increasing the minimum size of add_memory() by decoupling the memory block 
>> size from section size, so that add_memory() can add a memory block smaller 
>> (as small as 2MB, the subsection size) than section size. In this way, 
>> section size can be increased freely. I do not see the strong tie between 
>> add_memory() and section size, especially we have subsection bitmap support.
>
> Okay, let me express my thoughts, I could have sworn I explained back then 
> why I am not a friend of messing with the existing pageblock size:

Thanks for writing down your thoughts in detail. I will clarify my high-level 
plan below too.

>
> 1. Pageblock size
>
> There are a couple of features that rely on the pageblock size to be 
> reasonably small to work as expected. One example is virtio-balloon free page 
> reporting, then there is virtio-mem (still also glued MAX_ORDER) and we have 
> CMA (still also glued to MAX_ORDER). Most probably there are more. We track 
> movability/ page isolation per pageblock; it's the smallest granularity you 
> can effectively isolate pages or mark them as CMA (MIGRATE_ISOLATE, 
> MIGRATE_CMA). Well, and there are "ordinary" THP / huge pages most of our 
> applications use and will use, especially on smallish systems.
>
> Assume you bump up the pageblock order to 1G. Small VMs won't be able to 
> report any free pages to the hypervisor. You'll take the "fine-grained" out 
> of virtio-mem. Each CMA area will have to be at least 1G big, which turns CMA 
> essentially useless on smallish systems (like we have on arm64 with 64k base 
> pages -- pageblock_size is 512MB and I hate it).

I understand the issue of having large pageblock in small systems. My plan for 
this issue is to make MAX_ORDER a variable (pageblock size would be set 
according to MAX_ORDER) that can be adjusted based on total memory and via boot 
time parameter. My apology since I did not state this clearly in my cover 
letter and it confused you. When we have a boot time adjustable MAX_ORDER, 
large pageblock like 1GB would only appear for systems with large memory. For 
small VMs, pageblock size would stay at 2MB, so all your concerns on smallish 
systems should go away.

>
> Then, imagine systems that have like 4G of main memory. By stopping grouping 
> at 2M and instead grouping at 1G you can very easily find yourself in the 
> system where all your 4 pageblocks are unmovable and you essentially don't 
> optimize for huge pages in that environment any more.
>
> Long story short: we need a different mechanism on top and shall leave the 
> pageblock size untouched, it's too tightly integrated with page isolation, 
> ordinary THP, and CMA.

I think it is better to make pageblock size adjustable based on total memory of 
a system. It is not reasonable to have the same pageblock size across systems 
with memory sizes from <1GB to several TBs. Do you agree?

>
> 2. Section size
>
> I assume the only reason you want to touch that is because pageblock_size <= 
> section_size, and I guess that's one of the reasons I dislike it so much. 
> Messing with the section size really only makes sense when we want to manage 
> metadata for larger granularity within a section.

Perhaps it is worth checking if it is feasible to make pageblock_size > 
section_size, so we can still have small sections when pageblock_size are 
large. One potential issue for that is when PFN discontinues at section 
boundary, we might have partial pageblock when pageblock_size is big. I guess 
supporting partial pageblock (or different pageblock sizes like you mentioned 
below ) would be the right solution.

>
> We allocate metadata per section. We mark whole sections 
> early/online/present/ Yes, in case of vmemmap, we manage the memmap in 
> smaller granularity using the sub-section map, some kind of hack to support 
> some ZONE_DEVICE cases better.
>
> Let's assume we i

Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-06 Thread David Hildenbrand

On 06.05.21 17:50, Zi Yan wrote:

On 6 May 2021, at 11:40, David Hildenbrand wrote:


The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
hotplug/hotremove subsection, but is not intended to be merged as is. It is
there in case one wants to try this out and will be removed during the final
submission.

Feel free to give suggestions and comments. I am looking forward to your
feedback.


Please not like this.


Do you mind sharing more useful feedback instead of just saying a lot of No?


I remember reasoning about this already in another thread, no? Either you're 
ignoring my previous feedback or my mind is messing with me.


I definitely remember all your suggestions:

1. do not use CMA allocation for 1GB THP.
2. section size defines the minimum size in which we can add_memory(), so we 
cannot increase it.

I am trying an alternative here. I am not using CMA allocation and not 
increasing the minimum size of add_memory() by decoupling the memory block size 
from section size, so that add_memory() can add a memory block smaller (as 
small as 2MB, the subsection size) than section size. In this way, section size 
can be increased freely. I do not see the strong tie between add_memory() and 
section size, especially we have subsection bitmap support.


Okay, let me express my thoughts, I could have sworn I explained back 
then why I am not a friend of messing with the existing pageblock size:


1. Pageblock size

There are a couple of features that rely on the pageblock size to be 
reasonably small to work as expected. One example is virtio-balloon free 
page reporting, then there is virtio-mem (still also glued MAX_ORDER) 
and we have CMA (still also glued to MAX_ORDER). Most probably there are 
more. We track movability/ page isolation per pageblock; it's the 
smallest granularity you can effectively isolate pages or mark them as 
CMA (MIGRATE_ISOLATE, MIGRATE_CMA). Well, and there are "ordinary" THP / 
huge pages most of our applications use and will use, especially on 
smallish systems.


Assume you bump up the pageblock order to 1G. Small VMs won't be able to 
report any free pages to the hypervisor. You'll take the "fine-grained" 
out of virtio-mem. Each CMA area will have to be at least 1G big, which 
turns CMA essentially useless on smallish systems (like we have on arm64 
with 64k base pages -- pageblock_size is 512MB and I hate it).


Then, imagine systems that have like 4G of main memory. By stopping 
grouping at 2M and instead grouping at 1G you can very easily find 
yourself in the system where all your 4 pageblocks are unmovable and you 
essentially don't optimize for huge pages in that environment any more.


Long story short: we need a different mechanism on top and shall leave 
the pageblock size untouched, it's too tightly integrated with page 
isolation, ordinary THP, and CMA.


2. Section size

I assume the only reason you want to touch that is because 
pageblock_size <= section_size, and I guess that's one of the reasons I 
dislike it so much. Messing with the section size really only makes 
sense when we want to manage metadata for larger granularity within a 
section.


We allocate metadata per section. We mark whole sections 
early/online/present/ Yes, in case of vmemmap, we manage the memmap 
in smaller granularity using the sub-section map, some kind of hack to 
support some ZONE_DEVICE cases better.


Let's assume we introduce something new "gigapage_order", corresponding 
to 1G. We could either decide to squeeze the metadata into sections, 
having to increase the section size, or manage that metadata differently.


Managing it differently certainly makes the necessary changes easier. 
Instead of adding more hacks into sections, rather manage that metadata 
at differently place / in a different way.


See [1] for an alternative. Not necessarily what I would dream off, but 
just to showcase that there might be alternative to group pages.


3. Grouping pages > pageblock_order

There are other approaches that would benefit from grouping at > 
pageblock_order and having bigger MAX_ORDER. And that doesn't 
necessarily mean to form gigantic pages only, we might want to group in 
multiple granularity on a single system. Memory hot(un)plug is one 
example, but also optimizing memory consumption by powering down DIMM 
banks. Also, some architectures support differing huge page sizes 
(aarch64) that could be improved without CMA. Why not have more than 2 
THP sizes on these systems?


Ideally, we'd have a mechanism that tries grouping on different 
granularity, like for every order in pageblock_order ... 
max_pageblock_order (e.g., 1 GiB), and not only add one new level of 
grouping (or increase the single grouping size).


[1] https://lkml.kernel.org/r/20210414023803.937-1-lipeif...@oppo.com

--
Thanks,

David / dhildenb



Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-06 Thread Zi Yan
On 6 May 2021, at 11:40, David Hildenbrand wrote:

 The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
 hotplug/hotremove subsection, but is not intended to be merged as is. It is
 there in case one wants to try this out and will be removed during the 
 final
 submission.

 Feel free to give suggestions and comments. I am looking forward to your
 feedback.
>>>
>>> Please not like this.
>>
>> Do you mind sharing more useful feedback instead of just saying a lot of No?
>
> I remember reasoning about this already in another thread, no? Either you're 
> ignoring my previous feedback or my mind is messing with me.

I definitely remember all your suggestions:

1. do not use CMA allocation for 1GB THP.
2. section size defines the minimum size in which we can add_memory(), so we 
cannot increase it.

I am trying an alternative here. I am not using CMA allocation and not 
increasing the minimum size of add_memory() by decoupling the memory block size 
from section size, so that add_memory() can add a memory block smaller (as 
small as 2MB, the subsection size) than section size. In this way, section size 
can be increased freely. I do not see the strong tie between add_memory() and 
section size, especially we have subsection bitmap support.


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-06 Thread Zi Yan
On 6 May 2021, at 11:26, Zi Yan wrote:

> From: Zi Yan 
>
> Hi all,
>
> This patchset tries to remove the restriction on memory hotplug/hotremove
> granularity, which is always greater or equal to memory section size[1].
> With the patchset, kernel is able to online/offline memory at a size 
> independent
> of memory section size, as small as 2MB (the subsection size).
>
> The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
> size without increasing memory hotplug/hotremove granularity at the same time,
> so that the kernel can allocator 1GB pages using buddy allocator and utilizes
> existing pageblock based anti-fragmentation, paving the road for 1GB THP
> support[2].
>
> The patchset utilizes the existing subsection support[3] and changes the
> section size alignment checks to subsection size alignment checks. There are
> also changes to pageblock code to support partial pageblocks, when pageblock
> size is increased along with MAX_ORDER. Increasing pageblock size can enable
> kernel to utilize existing anti-fragmentation mechanism for gigantic page
> allocations.
>
> The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
> hotplug/hotremove subsection, but is not intended to be merged as is. It is
> there in case one wants to try this out and will be removed during the final
> submission.
>
> Feel free to give suggestions and comments. I am looking forward to your
> feedback.
>
> Thanks.

Added the missing references.

[1] 
https://lore.kernel.org/linux-mm/4b3006cf-3391-6839-904e-b41561319...@redhat.com/
[2] https://lore.kernel.org/linux-mm/20200928175428.4110504-1-zi@sent.com/
[3] 
https://patchwork.kernel.org/project/linux-nvdimm/cover/156092349300.979959.17603710711957735135.st...@dwillia2-desk3.amr.corp.intel.com/

>
> Zi Yan (7):
>   mm: sparse: set/clear subsection bitmap when pages are
> onlined/offlined.
>   mm: set pageblock_order to the max of HUGETLB_PAGE_ORDER and
> MAX_ORDER-1
>   mm: memory_hotplug: decouple memory_block size with section size.
>   mm: pageblock: allow set/unset migratetype for partial pageblock
>   mm: memory_hotplug, sparse: enable memory hotplug/hotremove
> subsections
>   arch: x86: no MAX_ORDER exceeds SECTION_SIZE check for 32bit vdso.
>   [not for merge] mm: increase SECTION_SIZE_BITS to 31
>
>  arch/ia64/Kconfig|   1 -
>  arch/powerpc/Kconfig |   1 -
>  arch/x86/Kconfig |  15 +++
>  arch/x86/entry/vdso/Makefile |   1 +
>  arch/x86/include/asm/sparsemem.h |   2 +-
>  drivers/base/memory.c| 176 +++
>  drivers/base/node.c  |   2 +-
>  include/linux/memory.h   |   8 +-
>  include/linux/mmzone.h   |   2 +
>  include/linux/page-isolation.h   |   8 +-
>  include/linux/pageblock-flags.h  |   9 --
>  mm/Kconfig   |   7 --
>  mm/memory_hotplug.c  |  22 ++--
>  mm/page_alloc.c  |  40 ---
>  mm/page_isolation.c  |  30 +++---
>  mm/sparse.c  |  55 --
>  16 files changed, 219 insertions(+), 160 deletions(-)
>
> -- 
> 2.30.2


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-06 Thread David Hildenbrand

The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
hotplug/hotremove subsection, but is not intended to be merged as is. It is
there in case one wants to try this out and will be removed during the final
submission.

Feel free to give suggestions and comments. I am looking forward to your
feedback.


Please not like this.


Do you mind sharing more useful feedback instead of just saying a lot of No?


I remember reasoning about this already in another thread, no? Either 
you're ignoring my previous feedback or my mind is messing with me.


--
Thanks,

David / dhildenb



Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-06 Thread David Hildenbrand

On 06.05.21 17:31, David Hildenbrand wrote:

On 06.05.21 17:26, Zi Yan wrote:

From: Zi Yan 

Hi all,

This patchset tries to remove the restriction on memory hotplug/hotremove
granularity, which is always greater or equal to memory section size[1].
With the patchset, kernel is able to online/offline memory at a size independent
of memory section size, as small as 2MB (the subsection size).


... which doesn't make any sense as we can only online/offline whole
memory block devices.



The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
size without increasing memory hotplug/hotremove granularity at the same time,


Gah, no. Please no. No.


so that the kernel can allocator 1GB pages using buddy allocator and utilizes
existing pageblock based anti-fragmentation, paving the road for 1GB THP
support[2].


Not like this, please no.



The patchset utilizes the existing subsection support[3] and changes the
section size alignment checks to subsection size alignment checks. There are
also changes to pageblock code to support partial pageblocks, when pageblock
size is increased along with MAX_ORDER. Increasing pageblock size can enable
kernel to utilize existing anti-fragmentation mechanism for gigantic page
allocations.


Please not like this.



The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
hotplug/hotremove subsection, but is not intended to be merged as is. It is
there in case one wants to try this out and will be removed during the final
submission.

Feel free to give suggestions and comments. I am looking forward to your
feedback.


Please not like this.



And just to be clear (I think I mentioned this already to you?): Nack to 
increasing the section size. Nack to increasing the pageblock order. 
Please find different ways to group on gigantic-pages level. There are 
alternative ideas floating around.


Semi-nack to increasing MAX_ORDER. I first want to see 
alloc_contig_range() be able to fully and cleanly handle allocations < 
MAX_ORDER in all cases (especially !CMA and !ZONE_MOVABLE) before we go 
down that path.


--
Thanks,

David / dhildenb



Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-06 Thread Zi Yan
On 6 May 2021, at 11:31, David Hildenbrand wrote:

> On 06.05.21 17:26, Zi Yan wrote:
>> From: Zi Yan 
>>
>> Hi all,
>>
>> This patchset tries to remove the restriction on memory hotplug/hotremove
>> granularity, which is always greater or equal to memory section size[1].
>> With the patchset, kernel is able to online/offline memory at a size 
>> independent
>> of memory section size, as small as 2MB (the subsection size).
>
> ... which doesn't make any sense as we can only online/offline whole memory 
> block devices.

Why limit the memory block size to section size? Patch 3 removes the restriction
by using (start_pfn, nr_pages) to allow memory block size goes below section 
size.
Also we have subsection bitmap available to tell us which subsection is online,
there is no reason to force memory block size to match section size.

>
>>
>> The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
>> size without increasing memory hotplug/hotremove granularity at the same 
>> time,
>
> Gah, no. Please no. No.
>
>> so that the kernel can allocator 1GB pages using buddy allocator and utilizes
>> existing pageblock based anti-fragmentation, paving the road for 1GB THP
>> support[2].
>
> Not like this, please no.
>
>>
>> The patchset utilizes the existing subsection support[3] and changes the
>> section size alignment checks to subsection size alignment checks. There are
>> also changes to pageblock code to support partial pageblocks, when pageblock
>> size is increased along with MAX_ORDER. Increasing pageblock size can enable
>> kernel to utilize existing anti-fragmentation mechanism for gigantic page
>> allocations.
>
> Please not like this.
>
>>
>> The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
>> hotplug/hotremove subsection, but is not intended to be merged as is. It is
>> there in case one wants to try this out and will be removed during the final
>> submission.
>>
>> Feel free to give suggestions and comments. I am looking forward to your
>> feedback.
>
> Please not like this.

Do you mind sharing more useful feedback instead of just saying a lot of No?

Thanks.


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


[RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-06 Thread Zi Yan
From: Zi Yan 

Hi all,

This patchset tries to remove the restriction on memory hotplug/hotremove
granularity, which is always greater or equal to memory section size[1].
With the patchset, kernel is able to online/offline memory at a size independent
of memory section size, as small as 2MB (the subsection size).

The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
size without increasing memory hotplug/hotremove granularity at the same time,
so that the kernel can allocator 1GB pages using buddy allocator and utilizes
existing pageblock based anti-fragmentation, paving the road for 1GB THP
support[2].

The patchset utilizes the existing subsection support[3] and changes the
section size alignment checks to subsection size alignment checks. There are
also changes to pageblock code to support partial pageblocks, when pageblock
size is increased along with MAX_ORDER. Increasing pageblock size can enable
kernel to utilize existing anti-fragmentation mechanism for gigantic page
allocations.

The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
hotplug/hotremove subsection, but is not intended to be merged as is. It is
there in case one wants to try this out and will be removed during the final
submission.

Feel free to give suggestions and comments. I am looking forward to your
feedback.

Thanks.

Zi Yan (7):
  mm: sparse: set/clear subsection bitmap when pages are
onlined/offlined.
  mm: set pageblock_order to the max of HUGETLB_PAGE_ORDER and
MAX_ORDER-1
  mm: memory_hotplug: decouple memory_block size with section size.
  mm: pageblock: allow set/unset migratetype for partial pageblock
  mm: memory_hotplug, sparse: enable memory hotplug/hotremove
subsections
  arch: x86: no MAX_ORDER exceeds SECTION_SIZE check for 32bit vdso.
  [not for merge] mm: increase SECTION_SIZE_BITS to 31

 arch/ia64/Kconfig|   1 -
 arch/powerpc/Kconfig |   1 -
 arch/x86/Kconfig |  15 +++
 arch/x86/entry/vdso/Makefile |   1 +
 arch/x86/include/asm/sparsemem.h |   2 +-
 drivers/base/memory.c| 176 +++
 drivers/base/node.c  |   2 +-
 include/linux/memory.h   |   8 +-
 include/linux/mmzone.h   |   2 +
 include/linux/page-isolation.h   |   8 +-
 include/linux/pageblock-flags.h  |   9 --
 mm/Kconfig   |   7 --
 mm/memory_hotplug.c  |  22 ++--
 mm/page_alloc.c  |  40 ---
 mm/page_isolation.c  |  30 +++---
 mm/sparse.c  |  55 --
 16 files changed, 219 insertions(+), 160 deletions(-)

-- 
2.30.2



Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size

2021-05-06 Thread David Hildenbrand

On 06.05.21 17:26, Zi Yan wrote:

From: Zi Yan 

Hi all,

This patchset tries to remove the restriction on memory hotplug/hotremove
granularity, which is always greater or equal to memory section size[1].
With the patchset, kernel is able to online/offline memory at a size independent
of memory section size, as small as 2MB (the subsection size).


... which doesn't make any sense as we can only online/offline whole 
memory block devices.




The motivation is to increase MAX_ORDER of the buddy allocator and pageblock
size without increasing memory hotplug/hotremove granularity at the same time,


Gah, no. Please no. No.


so that the kernel can allocator 1GB pages using buddy allocator and utilizes
existing pageblock based anti-fragmentation, paving the road for 1GB THP
support[2].


Not like this, please no.



The patchset utilizes the existing subsection support[3] and changes the
section size alignment checks to subsection size alignment checks. There are
also changes to pageblock code to support partial pageblocks, when pageblock
size is increased along with MAX_ORDER. Increasing pageblock size can enable
kernel to utilize existing anti-fragmentation mechanism for gigantic page
allocations.


Please not like this.



The last patch increases SECTION_SIZE_BITS to demonstrate the use of memory
hotplug/hotremove subsection, but is not intended to be merged as is. It is
there in case one wants to try this out and will be removed during the final
submission.

Feel free to give suggestions and comments. I am looking forward to your
feedback.


Please not like this.

--
Thanks,

David / dhildenb