Re: [PATCH v9 4/8] mm,memory_hotplug: Allocate memmap from the added memory range

2021-04-20 Thread Michal Hocko
On Fri 16-04-21 13:24:07, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
> 
> This has some disadvantages:
>  a) an existing memory is consumed for that purpose
> (eg: ~2MB per 128MB memory section on x86_64)

I would extend this slightly. This can even lead to extreme cases where
system goes OOM because the physically hotplugged memory depletes the
available memory before it is onlined.

>  b) if the whole node is movable then we have off-node struct pages
> which has performance drawbacks.
>  c) It might be there are no PMD_ALIGNED chunks so memmap array gets
> populated with base pages.
> 
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
> 
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.

Again this can be confusing because this is not what is really happening
in practice because we are going to have a multisection memory block
where all sections will be backed by a common reserved space rather than
per section sparse space. I would go with

"
Vmemap page tables can map arbitrary memory. That means that we can
reserve a part of the physically hotadded memory to back vmemmap page
tables. This implementation uses the beggining of the hotplugged memory
for that purpose.
"

> struct pages which back the allocated space then just need to be treated
> carefully.
> 
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gaining
> a new field which specifies the number of vmemmap_pages at the beginning.
> This patch also introduces the following functions:

There is quite a large leap from __populate_section_memmap to the
memory_block that deserves explaining to not lose all the subtle things
discussed in the past. I think it should be made clear why all the fuzz.
I would structure it as follows:
"
There are some non-obiously things to consider though.  Vmemmap
pages are allocated/freed during the memory hotplug events
(add_memory_resource, try_remove_memory) when the memory is
added/removed. This means that the reserved physical range is not online
yet it is used. The most obvious side effect is that pfn_to_online_page
returns NULL for those pfns. The current design expects that this
should be OK as the hotplugged memory is considered a garbage until it
is onlined. For example hibernation wouldn't save the content of those
vmmemmaps into the image so it wouldn't be restored on resume but this
should be OK as there no real content to recover anyway while metadata
is reachable from other data structures (e.g. vmemmap page tables).

The reserved space is therefore (de)initialized during the {on,off}line
events (mhp_{de}init_memmap_on_memory). That is done by extracting page
allocator independent initialization from the regular onlining path.
The primary reason to handle the reserved space outside of {on,off}line_pages
is to make each initialization specific to the purpose rather than
special case them in a single function.

> Adjusting of present_pages is done at the end once we know that online_pages()
> succedeed.
> 
> On offline, memory_block_offline() needs to unaccount vmemmap pages from
> present_pages() before calling offline_pages().
> This is necessary because offline_pages() tears down some structures based
> on the fact whether the node or the zone become empty.
> If offline_pages() fails, we account back vmemmap pages.
> If it succeeds, we call mhp_deinit_memmap_on_memory().
> 
> Hot-remove:
> 
>  We need to be careful when removing memory, as adding and
>  removing memory needs to be done with the same granularity.
>  To check that this assumption is not violated, we check the
>  memory range we want to remove and if a) any memory block has
>  vmemmap pages and b) the range spans more than a single memory
>  block, we scream out loud and refuse to proceed.
> 
>  If all is good and the range was using memmap on memory (aka vmemmap pages),
>  we construct an altmap structure so free_hugepage_table does the right
>  thing and calls vmem_altmap_free instead of free_pagetable.
> 
> Signed-off-by: Oscar Salvador 
> Reviewed-by: David Hildenbrand 
> ---
>  drivers/base/memory.c  |  71 --
>  include/linux/memory.h |   8 ++-
>  include/linux/memory_hotplug.h |  15 +++-
>  include/linux/memremap.h   |   2 +-
>  include/linux/mmzone.h |   7 +-
>  mm/Kconfig |   5 ++
>  mm/memory_hotplug.c| 159 
> ++---
>  mm/sparse.c|   2 -
>  8 files changed, 247 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index f209925a5d4e..2e2b2

[PATCH v9 4/8] mm,memory_hotplug: Allocate memmap from the added memory range

2021-04-16 Thread Oscar Salvador
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose
(eg: ~2MB per 128MB memory section on x86_64)
 b) if the whole node is movable then we have off-node struct pages
which has performance drawbacks.
 c) It might be there are no PMD_ALIGNED chunks so memmap array gets
populated with base pages.

This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.

Vmemap page tables can map arbitrary memory.
That means that we can simply use the beginning of each memory section and
map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Implementation wise we will reuse vmem_altmap infrastructure to override
the default allocator used by __populate_section_memmap.
Part of the implementation also relies on memory_block structure gaining
a new field which specifies the number of vmemmap_pages at the beginning.
This patch also introduces the following functions:

 - mhp_init_memmap_on_memory:
   Initializes vmemmap pages by calling 
move_pfn_range_to_zone(),
   calls kasan_add_zero_shadow(), and onlines as many 
sections
   as vmemmap pages fully span.
 - mhp_deinit_memmap_on_memory:
   Undoes what mhp_init_memmap_on_memory.

The new function memory_block_online() calls mhp_init_memmap_on_memory() before
doing the actual online_pages(). Should online_pages() fail, we clean up
by calling mhp_deinit_memmap_on_memory().
Adjusting of present_pages is done at the end once we know that online_pages()
succedeed.

On offline, memory_block_offline() needs to unaccount vmemmap pages from
present_pages() before calling offline_pages().
This is necessary because offline_pages() tears down some structures based
on the fact whether the node or the zone become empty.
If offline_pages() fails, we account back vmemmap pages.
If it succeeds, we call mhp_deinit_memmap_on_memory().

Hot-remove:

 We need to be careful when removing memory, as adding and
 removing memory needs to be done with the same granularity.
 To check that this assumption is not violated, we check the
 memory range we want to remove and if a) any memory block has
 vmemmap pages and b) the range spans more than a single memory
 block, we scream out loud and refuse to proceed.

 If all is good and the range was using memmap on memory (aka vmemmap pages),
 we construct an altmap structure so free_hugepage_table does the right
 thing and calls vmem_altmap_free instead of free_pagetable.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
---
 drivers/base/memory.c  |  71 --
 include/linux/memory.h |   8 ++-
 include/linux/memory_hotplug.h |  15 +++-
 include/linux/memremap.h   |   2 +-
 include/linux/mmzone.h |   7 +-
 mm/Kconfig |   5 ++
 mm/memory_hotplug.c| 159 ++---
 mm/sparse.c|   2 -
 8 files changed, 247 insertions(+), 22 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f209925a5d4e..2e2b2f654f0a 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -173,16 +173,72 @@ static int memory_block_online(struct memory_block *mem)
 {
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+   unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+   struct zone *zone;
+   int ret;
+
+   zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, 
nr_pages);
+
+   /*
+* Although vmemmap pages have a different lifecycle than the pages
+* they describe (they remain until the memory is unplugged), doing
+* their initialization and accounting at memory onlining/offlining
+* stage simplifies things a lot.
+*/
+   if (nr_vmemmap_pages) {
+   ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, 
zone);
+   if (ret)
+   return ret;
+   }
+
+   ret = online_pages(start_pfn + nr_vmemmap_pages,
+  nr_pages - nr_vmemmap_pages, zone);
+   if (ret) {
+   if (nr_vmemmap_pages)
+   mhp_deinit_memmap_on_memory(start_pfn, 
nr_vmemmap_pages);
+   return ret;
+   }
+
+   /*
+* Account once onlining succeeded. If the zone was unpopulated, it is
+* now already properly populated.
+*/
+   if (nr_vmemmap_pages)
+   adjust_present_page_count(zone, nr_vmemmap_pages);
 
-   return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
+   return ret;
 }
 
 static int memory_block_offline(struct memory_block *mem)
 {
un