Re: [RFC PATCH v3 2/4] mm,memory_hotplug: Allocate memmap from the added memory range

2020-12-09 Thread Oscar Salvador
On Wed, Dec 09, 2020 at 11:32:26AM +0100, Oscar Salvador wrote:
> On Wed, Dec 02, 2020 at 11:05:53AM +0100, David Hildenbrand wrote:
> > If you take a look at generic_online_page() there are some things that
> > won't be done for our vmemmap pages
> > 
> > 1. kernel_map_pages(page, 1 << order, 1);
> > 
> > We're accessing these pages already when initializing the memmap. We
> > might have to explicitly map these vmemmap pages at some point. Might
> > require some thought. Did you test with debug pagealloc?
> 
> I always try to run with all debug stuff enabled, but I definitely
> did not enable debug_pagealloc.
> I will have a look at it.
> 
> > 2. totalram_pages_add(1UL << order);
> > 
> > We should add/remove the vmemmap pages manually from totalram I guess.
> 
> Yes, we should. That was a clear oversight.

Looking closer, I do not think we have to account those into totalram.
I might be mistaken but looking at memblock_free_all, it seems we only
account to totalram_pages those ranges laying in memblock.memory filtering
out memblock.reserved.

And it seems that the pages we use for pglist_data structs (the ones we handle
in register_page_bootmem_info_node) fall in memblock.reserved.

-- 
Oscar Salvador
SUSE L3


Re: [RFC PATCH v3 2/4] mm,memory_hotplug: Allocate memmap from the added memory range

2020-12-09 Thread Oscar Salvador
On Wed, Dec 02, 2020 at 11:05:53AM +0100, David Hildenbrand wrote:
> If you take a look at generic_online_page() there are some things that
> won't be done for our vmemmap pages
> 
> 1. kernel_map_pages(page, 1 << order, 1);
> 
> We're accessing these pages already when initializing the memmap. We
> might have to explicitly map these vmemmap pages at some point. Might
> require some thought. Did you test with debug pagealloc?

I always try to run with all debug stuff enabled, but I definitely
did not enable debug_pagealloc.
I will have a look at it.

> 2. totalram_pages_add(1UL << order);
> 
> We should add/remove the vmemmap pages manually from totalram I guess.

Yes, we should. That was a clear oversight.

> Also, I wonder if we should add these pages to the zone managed page
> count. Not quite sure yet what the benefit would be and if we care at
> all. (it's a change in behavior when hotplugging memory, though, that's
> why I am poining it out)

I would rather not since they are not "manageable" per se.
The same we do not account the pages used for bootmem stuff into zone->managed.

> > Hot-remove:
> > 
> >  If the range was using memmap on memory (aka vmemmap pages),
> >  we construct an altmap structure so free_hugepage_table does
> >  the right thing and calls vmem_altmap_free instead of
> >  free_pagetable.
> > 
> 
> You might want to update the comment in include/linux/mmzone.h for
> ZONE_MOVABLE, adding this special case when onlining such memory to the
> MOVABLE zone. The pages are movable for offlining only, not for allocations.

Sure, let us document that.

> > -int create_memory_block_devices(unsigned long start, unsigned long size)
> > +int create_memory_block_devices(unsigned long start, unsigned long size,
> > +   unsigned long vmemmap_pages)
> >  {
> > const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
> > unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
> > @@ -645,9 +649,10 @@ int create_memory_block_devices(unsigned long start, 
> > unsigned long size)
> > return -EINVAL;
> >  
> > for (block_id = start_block_id; block_id != end_block_id; block_id++) {
> > -   ret = init_memory_block(block_id, MEM_OFFLINE);
> > +   ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages);
> > if (ret)
> > break;
> > +   vmemmap_pages = 0;
> 
> We should even bail out if someone would have vmemmap_pages set when
> creating more than one memory block device.

That cannot really happen as we would have bailed out earlier prior to
call create_memory_block_devices, by the checks in 
mhp_supports_memmap_on_memory.

> > -   for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES)
> > -   (*online_page_callback)(pfn_to_page(pfn), MAX_ORDER - 1);
> > +   for (pfn = buddy_start_pfn; pfn < end_pfn; pfn += (1 << order)) {
> > +   order = MAX_ORDER - 1;
> > +   while (pfn & ((1 << order) - 1))
> > +   order--;
> > +   (*online_page_callback)(pfn_to_page(pfn), order);
> > +   }
> 
> I think it would be nicer to handle the unaligned part in a separate loop.

No strong opinion here, I can certainly do it.

> > /* mark all involved sections as online */
> > online_mem_sections(start_pfn, end_pfn);
> > @@ -678,6 +684,7 @@ static void __meminit resize_pgdat_range(struct 
> > pglist_data *pgdat, unsigned lon
> > pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - 
> > pgdat->node_start_pfn;
> >  
> >  }
> > +
> 
> Unrelated change.

Ups ^^

> > +static int get_memblock_vmemmap_pages_cb(struct memory_block *mem, void 
> > *arg)
> > +{
> > +   unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
> > +   int ret = !mem->nr_vmemmap_pages;
> > +
> > +   if (!ret)
> > +   *nr_vmemmap_pages += mem->nr_vmemmap_pages;
> > +   return ret;
> 
> When removing in the same granularity than adding, all we care about it
> the first memory block.
> 
> 
> unsigned long *nr_vmemmap_pages = (unsigned long *)arg;
> 
> *nr_vmemmap_pages = mem->nr_vmemmap_pages;
> /* We only care about the first memory block. */
> return 1;

Yeah, cleaner, thanks

> > +   /*
> > +* Prepare a vmem_altmap struct if we used it at hot-add, so
> > +* remove_pmd_table->free_hugepage_table does the right thing.
> > +*/
> > +   (void)walk_memory_blocks(start, size, _vmemmap_pages,
> > +get_memblock_vmemmap_pages_cb);
> > +   if (nr_vmemmap_pages) {
> > +   mhp_altmap.alloc = nr_vmemmap_pages;
> > +   altmap = _altmap;
> > +   }
> > +
> 
> If someone would remove_memory() in a different granularity than
> add_memory(), this would no longer work. How can we catch that
> efficiently? Or at least document that this is not supported with
> memmap_on_memory).

Well, we can check whether the size spans more than a single memory block.
And if it does, we can print a warning with pr_warn and refuse to 

Re: [RFC PATCH v3 2/4] mm,memory_hotplug: Allocate memmap from the added memory range

2020-12-02 Thread David Hildenbrand
On 01.12.20 12:51, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
> 
> This has some disadvantages:
>  a) an existing memory is consumed for that purpose
> (eg: ~2MB per 128MB memory section on x86_64)
>  b) if the whole node is movable then we have off-node struct pages
> which has performance drawbacks.
>  c) It might be there are no PMD_ALIGNED chunks so memmap array gets
> populated with base pages.
> 
> This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.
> 
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
> 
> Implementation wise we will reuse vmem_altmap infrastructure to override
> the default allocator used by __populate_section_memmap.
> Part of the implementation also relies on memory_block structure gaining
> a new field which specifies the number of vmemmap_pages at the beginning.
> This comes in handy as in {online,offline}_pages, all the isolation and
> migration is being one on (buddy_start_pfn, end_pfn] range,
> being buddy_start_pfn = start_pfn + nr_vmemmap_pages.
> 
> In this way, we have:
> 
> (start_pfn, buddy_start_pfn - 1] = Initialized and PageReserved
> (buddy_start_pfn, end_pfn]   = Initialized and sent to buddy

If you take a look at generic_online_page() there are some things that
won't be done for our vmemmap pages

1. kernel_map_pages(page, 1 << order, 1);

We're accessing these pages already when initializing the memmap. We
might have to explicitly map these vmemmap pages at some point. Might
require some thought. Did you test with debug pagealloc?

2. totalram_pages_add(1UL << order);

We should add/remove the vmemmap pages manually from totalram I guess.


Also, I wonder if we should add these pages to the zone managed page
count. Not quite sure yet what the benefit would be and if we care at
all. (it's a change in behavior when hotplugging memory, though, that's
why I am poining it out)

> 
> Hot-remove:
> 
>  If the range was using memmap on memory (aka vmemmap pages),
>  we construct an altmap structure so free_hugepage_table does
>  the right thing and calls vmem_altmap_free instead of
>  free_pagetable.
> 

You might want to update the comment in include/linux/mmzone.h for
ZONE_MOVABLE, adding this special case when onlining such memory to the
MOVABLE zone. The pages are movable for offlining only, not for allocations.

[...]
> -int create_memory_block_devices(unsigned long start, unsigned long size)
> +int create_memory_block_devices(unsigned long start, unsigned long size,
> + unsigned long vmemmap_pages)
>  {
>   const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
>   unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
> @@ -645,9 +649,10 @@ int create_memory_block_devices(unsigned long start, 
> unsigned long size)
>   return -EINVAL;
>  
>   for (block_id = start_block_id; block_id != end_block_id; block_id++) {
> - ret = init_memory_block(block_id, MEM_OFFLINE);
> + ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages);
>   if (ret)
>   break;
> + vmemmap_pages = 0;

We should even bail out if someone would have vmemmap_pages set when
creating more than one memory block device.

>   }
>   if (ret) {
>   end_block_id = block_id;
> diff --git a/include/linux/memory.h b/include/linux/memory.h
> index 439a89e758d8..b910d2aea879 100644
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -30,6 +30,11 @@ struct memory_block {
>   int phys_device;/* to which fru does this belong? */
>   struct device dev;
>   int nid;/* NID for this memory block */
> + unsigned long nr_vmemmap_pages; /*
> +  * Number of vmemmap pages. These pages
> +  * lay at the beginning of the memory
> +  * block.
> +  */
>  };
>  
>  int arch_get_memory_phys_device(unsigned long start_pfn);
> @@ -81,7 +86,7 @@ static inline int memory_notify(unsigned long val, void *v)
>  #else
>  extern int register_memory_notifier(struct notifier_block *nb);
>  extern void unregister_memory_notifier(struct notifier_block *nb);
> -int create_memory_block_devices(unsigned long start, unsigned long size);
> +int create_memory_block_devices(unsigned long start, unsigned long size, 
> unsigned long vmemmap_pages);
>  void remove_memory_block_devices(unsigned long start, unsigned long size);
>  extern void memory_dev_init(void);
>  extern int memory_notify(unsigned long