Re: [linus:master] [mm] d99e3140a4: BUG:KCSAN:data-race_in_folio_remove_rmap_ptes/print_report
On Tue, May 28, 2024 at 09:43:39AM +0200, David Hildenbrand wrote: > Likely that's just a page_type check racing against concurrent > mapcount changes. > > In __folio_rmap_sanity_checks() we check > VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); Yeah, and that "collides" with last = atomic_add_negative(-1, >_mapcount) from __folio_remove_rmap. > To make sure we don't get hugetlb folios in the wrong rmap code path. That > can easily race with concurrent mapcount changes, just like any other > page_type checks that end up in folio_test_type/page_has_type e.g., from > PFN walkers. > > Load tearing in these functions shouldn't really result in false positives > (what we care about), but READ_ONCE shouldn't hurt or make a difference. > > > From b03dc9bf27571442d886d8da624a4e4f737433f2 Mon Sep 17 00:00:00 2001 > From: David Hildenbrand > Date: Tue, 28 May 2024 09:37:20 +0200 > Subject: [PATCH] mm: read page_type using READ_ONCE > > KCSAN complains about possible data races: while we check for a > page_type -- for example for sanity checks -- we might concurrently > modify the mapcount that overlays page_type. > > Let's use READ_ONCE to avoid laod tearing (shouldn't make a difference) > and to make KCSAN happy. > > Note: nothing should really be broken besides wrong KCSAN complaints. > > Reported-by: kernel test robot > Closes: https://lore.kernel.org/oe-lkp/202405281431.c46a3be9-...@intel.com > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador Thanks! -- Oscar Salvador SUSE Labs
Re: [PATCH v3 5/9] riscv: mm: Add memory hotplugging support
On Tue, May 21, 2024 at 03:19:37PM +0200, Alexandre Ghiti wrote: > On Tue, May 21, 2024 at 1:49 PM Björn Töpel wrote: > > + if (PageReserved(page)) { > > + __ClearPageReserved(page); > > What's the difference between __ClearPageReserved() and > ClearPageReserved()? Because it seems like free_reserved_page() calls > the latter already, so why would you need to call > __ClearPageReserved() on the first page? __{Set,Clear}Page are the non-atomic version. Usually used when you know that no one else can fiddle with the page, which should be the case here since we are removing the memory. As to why we have __ClearPageReserved and then having free_reserved_page() call ClearPageReserved I do not really know. Looking at the history, it has always been like this. I remember I looked at this a few years ago but I cannot remember the outcome of that. Maybe David remembers better, but I think we could remove that __ClearPageReserved. Looking at powerpc implementation code, it does not do the __ClearPageReserved and relies only on free_reserved_page(). I will have a look. -- Oscar Salvador SUSE Labs
Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V
On Tue, May 14, 2024 at 04:04:44PM +0200, Björn Töpel wrote: > From: Björn Töpel > > Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for > RISC-V. > > Signed-off-by: Björn Töpel > --- > arch/riscv/Kconfig | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > index 6bec1bce6586..b9398b64bb69 100644 > --- a/arch/riscv/Kconfig > +++ b/arch/riscv/Kconfig > @@ -16,6 +16,8 @@ config RISCV > select ACPI_REDUCED_HARDWARE_ONLY if ACPI > select ARCH_DMA_DEFAULT_COHERENT > select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION > + select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU Hopefully this should be SPARSEMEM_VMEMMAP. We are trying to deprecate memory-hotplug on !SPARSEMEM_VMEMMAP. And it is always easier to do it now that when the code goes already in, so please consider if you really need SPARSEMEM and why (I do not think you do). -- Oscar Salvador SUSE Labs
Re: [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump
On Tue, May 14, 2024 at 04:04:43PM +0200, Björn Töpel wrote: > From: Björn Töpel > > During memory hot remove, the ptdump functionality can end up touching > stale data. Avoid any potential crashes (or worse), by holding the > memory hotplug read-lock while traversing the page table. > > This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm: > Hold memory hotplug lock while walking for kernel page table dump"). > > Signed-off-by: Björn Töpel Reviewed-by: Oscar Salvador funny enough, it seems arm64 and riscv are the only ones holding the hotplug lock here. I think we have the same problem on the other arches as well (at least on x86_64 that I can see). If we happen to finally need the lock in those, I would rather have a centric function in the generic mm code with the locking and then calling an arch specific ptdump_show function, so the lock is not scattered. But that is another story. -- Oscar Salvador SUSE Labs
Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support
On Tue, May 14, 2024 at 04:04:42PM +0200, Björn Töpel wrote: > +static void __meminit free_vmemmap_storage(struct page *page, size_t size, > +struct vmem_altmap *altmap) > +{ > + if (altmap) > + vmem_altmap_free(altmap, size >> PAGE_SHIFT); > + else > + free_pages((unsigned long)page_address(page), get_order(size)); David already pointed this out, but can check arch/x86/mm/init_64.c:free_pagetable(). You will see that we have to do some magic for bootmem memory (DIMMs which were not hotplugged but already present) > +#ifdef CONFIG_SPARSEMEM_VMEMMAP > +void __ref vmemmap_free(unsigned long start, unsigned long end, struct > vmem_altmap *altmap) > +{ > + remove_pgd_mapping(start, end, true, altmap); > +} > +#endif /* CONFIG_SPARSEMEM_VMEMMAP */ > +#endif /* CONFIG_MEMORY_HOTPLUG */ I will comment on the patch where you add support for hotplug and the dependency, but on a track in LSFMM today, we decided that most likely we will drop memory-hotplug support for !CONFIG_SPARSEMEM_VMEMMAP environments. So, since you are adding this plain fresh, please consider to tight the hotplug dependency to CONFIG_SPARSEMEM_VMEMMAP. As a bonus, you will only have to maintain one flavour of functions. -- Oscar Salvador SUSE Labs
Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add
On Tue, May 14, 2024 at 04:04:41PM +0200, Björn Töpel wrote: > From: Björn Töpel > > Add a parameter to the direct map setup function, so it can be used in > arch_add_memory() later. > > Signed-off-by: Björn Töpel Reviewed-by: Oscar Salvador > --- > arch/riscv/mm/init.c | 15 ++- > 1 file changed, 6 insertions(+), 9 deletions(-) > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > index c969427eab88..6f72b0b2b854 100644 > --- a/arch/riscv/mm/init.c > +++ b/arch/riscv/mm/init.c > @@ -1227,7 +1227,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa) > } > > static void __meminit create_linear_mapping_range(phys_addr_t start, > phys_addr_t end, > - uintptr_t fixed_map_size) > + uintptr_t fixed_map_size, > const pgprot_t *pgprot) > { > phys_addr_t pa; > uintptr_t va, map_size; > @@ -1238,7 +1238,7 @@ static void __meminit > create_linear_mapping_range(phys_addr_t start, phys_addr_t > best_map_size(pa, va, end - pa); > > create_pgd_mapping(swapper_pg_dir, va, pa, map_size, > -pgprot_from_va(va)); > +pgprot ? *pgprot : pgprot_from_va(va)); > } > } > > @@ -1282,22 +1282,19 @@ static void __init > create_linear_mapping_page_table(void) > if (end >= __pa(PAGE_OFFSET) + memory_limit) > end = __pa(PAGE_OFFSET) + memory_limit; > > - create_linear_mapping_range(start, end, 0); > + create_linear_mapping_range(start, end, 0, NULL); > } > > #ifdef CONFIG_STRICT_KERNEL_RWX > - create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0); > - create_linear_mapping_range(krodata_start, > - krodata_start + krodata_size, 0); > + create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0, > NULL); > + create_linear_mapping_range(krodata_start, krodata_start + > krodata_size, 0, NULL); > > memblock_clear_nomap(ktext_start, ktext_size); > memblock_clear_nomap(krodata_start, krodata_size); > #endif > > #ifdef CONFIG_KFENCE > - create_linear_mapping_range(kfence_pool, > - kfence_pool + KFENCE_POOL_SIZE, > - PAGE_SIZE); > + create_linear_mapping_range(kfence_pool, kfence_pool + > KFENCE_POOL_SIZE, PAGE_SIZE, NULL); > > memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE); > #endif > -- > 2.40.1 > -- Oscar Salvador SUSE Labs
Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions
On Tue, May 14, 2024 at 04:04:40PM +0200, Björn Töpel wrote: > From: Björn Töpel > > Prepare for memory hotplugging support by changing from __init to > __meminit for the page table functions that are used by the upcoming > architecture specific callbacks. > > Changing the __init attribute to __meminit, avoids that the functions > are removed after init. The __meminit attribute makes sure the > functions are kept in the kernel text post init, but only if memory > hotplugging is enabled for the build. > > Also, make sure that the altmap parameter is properly passed on to > vmemmap_populate_hugepages(). > > Signed-off-by: Björn Töpel Reviewed-by: Oscar Salvador > +static void __meminit create_linear_mapping_range(phys_addr_t start, > phys_addr_t end, > + uintptr_t fixed_map_size) > { > phys_addr_t pa; > uintptr_t va, map_size; > @@ -1435,7 +1429,7 @@ int __meminit vmemmap_populate(unsigned long start, > unsigned long end, int node, >* memory hotplug, we are not able to update all the page tables with >* the new PMDs. >*/ > - return vmemmap_populate_hugepages(start, end, node, NULL); > + return vmemmap_populate_hugepages(start, end, node, altmap); I would have put this into a separate patch. -- Oscar Salvador SUSE Labs
Re: [PATCH v20 8/9] mm: memory_hotplug: disable memmap_on_memory when hugetlb_free_vmemmap enabled
On Thu, Apr 15, 2021 at 04:40:04PM +0800, Muchun Song wrote: > bool mhp_supports_memmap_on_memory(unsigned long size) > { > + bool supported; > unsigned long nr_vmemmap_pages = size / PAGE_SIZE; > unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page); > unsigned long remaining_size = size - vmemmap_size; > @@ -1011,11 +1012,18 @@ bool mhp_supports_memmap_on_memory(unsigned long size) >* altmap as an alternative source of memory, and we do not > exactly >* populate a single PMD. >*/ > - return memmap_on_memory && > -IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) && > -size == memory_block_size_bytes() && > -IS_ALIGNED(vmemmap_size, PMD_SIZE) && > -IS_ALIGNED(remaining_size, pageblock_nr_pages << PAGE_SHIFT); > + supported = memmap_on_memory && > + IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) && > + size == memory_block_size_bytes() && > + IS_ALIGNED(vmemmap_size, PMD_SIZE) && > + IS_ALIGNED(remaining_size, pageblock_nr_pages << > PAGE_SHIFT); > + > + if (supported && is_hugetlb_free_vmemmap_enabled()) { > + pr_info("Cannot enable memory_hotplug.memmap_on_memory, it is > not compatible with hugetlb_free_vmemmap\n"); > + supported = false; > + } I would not print anything and rather have return memmap_on_memory && !is_hugetlb_free_vmemmap_enabled && IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) && size == memory_block_size_bytes() && IS_ALIGNED(vmemmap_size, PMD_SIZE) && IS_ALIGNED(remaining_size, pageblock_nr_pages << PAGE_SHIFT); Documentation/admin-guide/kernel-parameters.txt already provides an explanation on memory_hotplug.memmap_on_memory parameter that states that the feature cannot be enabled when using hugetlb-vmemmap optimization. Users can always check whether the feature is enabled via /sys/modules/memory_hotplug/parameters/memmap_on_memory. Also, I did not check if it is, but if not, the fact about hugetlb-vmemmmap vs hotplug-vmemmap should also be called out in the hugetlb-vmemmap kernel parameter. Thanks -- Oscar Salvador SUSE L3
[PATCH v10 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig
pfn_range_valid_contig() bails out when it finds an in-use page or a hugetlb page, among other things. We can drop the in-use page check since __alloc_contig_pages can migrate away those pages, and the hugetlb page check can go too since isolate_migratepages_range is now capable of dealing with hugetlb pages. Either way, those checks are racy so let the end function handle it when the time comes. Signed-off-by: Oscar Salvador Suggested-by: David Hildenbrand Reviewed-by: David Hildenbrand Acked-by: Mike Kravetz Acked-by: Michal Hocko --- mm/page_alloc.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b5a94de3cdde..c5338e912ace 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8901,12 +8901,6 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn, if (PageReserved(page)) return false; - - if (page_count(page) > 0) - return false; - - if (PageHuge(page)) - return false; } return true; } -- 2.16.3
[PATCH v10 3/7] mm,hugetlb: Drop clearing of flag from prep_new_huge_page
Pages allocated via the page allocator or CMA get its private field cleared by means of post_alloc_hook(). Pages allocated during boot, that is directly from the memblock allocator, get cleared by paging_init()->..->memmap_init_zone->..->__init_single_page() before any memblock allocation. Based on this ground, let us remove the clearing of the flag from prep_new_huge_page() as it is not needed. This was a leftover from 6c0371490140 ("hugetlb: convert PageHugeFreed to HPageFreed flag"). Previously the explicit clearing was necessary because compound allocations do not get this initialization (see prep_compound_page). Signed-off-by: Oscar Salvador Acked-by: Michal Hocko Reviewed-by: David Hildenbrand Reviewed-by: Mike Kravetz --- mm/hugetlb.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 54d81d5947ed..2cb9fa79cbaa 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1493,7 +1493,6 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) spin_lock_irq(_lock); h->nr_huge_pages++; h->nr_huge_pages_node[nid]++; - ClearHPageFreed(page); spin_unlock_irq(_lock); } -- 2.16.3
[PATCH v10 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages
alloc_contig_range() will fail if it finds a HugeTLB page within the range, without a chance to handle them. Since HugeTLB pages can be migrated as any LRU or Movable page, it does not make sense to bail out without trying. Enable the interface to recognize in-use HugeTLB pages so we can migrate them, and have much better chances to succeed the call. Signed-off-by: Oscar Salvador Reviewed-by: Mike Kravetz Acked-by: Michal Hocko Acked-by: David Hildenbrand --- include/linux/hugetlb.h | 5 +++-- mm/compaction.c | 12 +++- mm/hugetlb.c| 22 +- mm/vmscan.c | 5 +++-- 4 files changed, 34 insertions(+), 10 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index b2d2118bfd1a..b92f25ccef58 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -595,7 +595,7 @@ struct huge_bootmem_page { struct hstate *hstate; }; -int isolate_or_dissolve_huge_page(struct page *page); +int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list); struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, @@ -878,7 +878,8 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; -static inline int isolate_or_dissolve_huge_page(struct page *page) +static inline int isolate_or_dissolve_huge_page(struct page *page, + struct list_head *list) { return -ENOMEM; } diff --git a/mm/compaction.c b/mm/compaction.c index af2e8e194e50..84fde270ae74 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -907,7 +907,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, } if (PageHuge(page) && cc->alloc_contig) { - ret = isolate_or_dissolve_huge_page(page); + ret = isolate_or_dissolve_huge_page(page, >migratepages); /* * Fail isolation in case isolate_or_dissolve_huge_page() @@ -921,6 +921,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, goto isolate_fail; } + if (PageHuge(page)) { + /* +* Hugepage was successfully isolated and placed +* on the cc->migratepages list. +*/ + low_pfn += compound_nr(page) - 1; + goto isolate_success_no_list; + } + /* * Ok, the hugepage was dissolved. Now these pages are * Buddy and cannot be re-allocated because they are @@ -1062,6 +1071,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, isolate_success: list_add(>lru, >migratepages); +isolate_success_no_list: cc->nr_migratepages += compound_nr(page); nr_isolated += compound_nr(page); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index cc2463953fab..3db405dea3dc 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2270,9 +2270,11 @@ static void restore_reserve_on_error(struct hstate *h, * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one * @h: struct hstate old page belongs to * @old_page: Old page to dissolve + * @list: List to isolate the page in case we need to * Returns 0 on success, otherwise negated error. */ -static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page) +static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, + struct list_head *list) { gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; int nid = page_to_nid(old_page); @@ -2299,9 +2301,13 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page) goto free_new; } else if (page_count(old_page)) { /* -* Someone has grabbed the page, fail for now. +* Someone has grabbed the page, try to isolate it here. +* Fail with -EBUSY if not possible. */ - ret = -EBUSY; + spin_unlock_irq(_lock); + if (!isolate_huge_page(old_page, list)) + ret = -EBUSY; + spin_lock_irq(_lock); goto free_new; } else if (!HPageFreed(old_page)) { /* @@ -2351,10 +2357,11 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page) return ret; } -int iso
[PATCH v10 5/7] mm: Make alloc_contig_range handle free hugetlb pages
alloc_contig_range will fail if it ever sees a HugeTLB page within the range we are trying to allocate, even when that page is free and can be easily reallocated. This has proved to be problematic for some users of alloc_contic_range, e.g: CMA and virtio-mem, where those would fail the call even when those pages lay in ZONE_MOVABLE and are free. We can do better by trying to replace such page. Free hugepages are tricky to handle so as to no userspace application notices disruption, we need to replace the current free hugepage with a new one. In order to do that, a new function called alloc_and_dissolve_huge_page is introduced. This function will first try to get a new fresh hugepage, and if it succeeds, it will replace the old one in the free hugepage pool. The free page replacement is done under hugetlb_lock, so no external users of hugetlb will notice the change. To allocate the new huge page, we use alloc_buddy_huge_page(), so we do not have to deal with any counters, and prep_new_huge_page() is not called. This is valulable because in case we need to free the new page, we only need to call __free_pages(). Once we know that the page to be replaced is a genuine 0-refcounted huge page, we remove the old page from the freelist by remove_hugetlb_page(). Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page() for the new huge page to properly initialize it and increment the hstate->nr_huge_pages counter (previously decremented by remove_hugetlb_page()). Once done, the page is enqueued by enqueue_huge_page() and it is ready to be used. There is one tricky case when page's refcount is 0 because it is in the process of being released. A missing PageHugeFreed bit will tell us that freeing is in flight so we retry after dropping the hugetlb_lock. The race window should be small and the next retry should make a forward progress. E.g: CPU0CPU1 free_huge_page()isolate_or_dissolve_huge_page PageHuge() == T alloc_and_dissolve_huge_page alloc_buddy_huge_page() spin_lock_irq(hugetlb_lock) // PageHuge() && !PageHugeFreed && // !PageCount() spin_unlock_irq(hugetlb_lock) spin_lock_irq(hugetlb_lock) 1) update_and_free_page PageHuge() == F __free_pages() 2) enqueue_huge_page SetPageHugeFreed() spin_unlock_irq(_lock) spin_lock_irq(hugetlb_lock) 1) PageHuge() == F (freed by case#1 from CPU0) 2) PageHuge() == T PageHugeFreed() == T - proceed with replacing the page In the case above we retry as the window race is quite small and we have high chances to succeed next time. With regard to the allocation, we restrict it to the node the page belongs to with __GFP_THISNODE, meaning we do not fallback on other node's zones. Note that gigantic hugetlb pages are fenced off since there is a cyclic dependency between them and alloc_contig_range. Signed-off-by: Oscar Salvador Acked-by: Michal Hocko Acked-by: David Hildenbrand Reviewed-by: Mike Kravetz --- include/linux/hugetlb.h | 6 +++ mm/compaction.c | 33 -- mm/hugetlb.c| 116 3 files changed, 152 insertions(+), 3 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 09f1fd12a6fa..b2d2118bfd1a 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -595,6 +595,7 @@ struct huge_bootmem_page { struct hstate *hstate; }; +int isolate_or_dissolve_huge_page(struct page *page); struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, @@ -877,6 +878,11 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; +static inline int isolate_or_dissolve_huge_page(struct page *page) +{ + return -ENOMEM; +} + static inline struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve) diff --git a/mm/compaction.c b/mm/compaction.c index 110b90d5fff5..af2e8e194e50 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -788,7 +788,7 @@ static bool too_many_isolated(pg_data_t *pgdat) * Isolate all pages that can be migrated from the range specified by * [low_pfn, end_pfn). The range is expected to be within same pageblock. * Returns errno, like -EAGAIN or -EINTR
[PATCH v10 4/7] mm,hugetlb: Split prep_new_huge_page functionality
Currently, prep_new_huge_page() performs two functions. It sets the right state for a new hugetlb, and increases the hstate's counters to account for the new page. Let us split its functionality into two separate functions, decoupling the handling of the counters from initializing a hugepage. The outcome is having __prep_new_huge_page(), which only initializes the page , and __prep_account_new_huge_page(), which adds the new page to the hstate's counters. This allows us to be able to set a hugetlb without having to worry about the counter/locking. It will prove useful in the next patch. prep_new_huge_page() still calls both functions. Signed-off-by: Oscar Salvador Acked-by: Michal Hocko Reviewed-by: Mike Kravetz Reviewed-by: David Hildenbrand --- mm/hugetlb.c | 20 +--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 2cb9fa79cbaa..6f39ec79face 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1483,16 +1483,30 @@ void free_huge_page(struct page *page) } } -static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) +/* + * Must be called with the hugetlb lock held + */ +static void __prep_account_new_huge_page(struct hstate *h, int nid) +{ + lockdep_assert_held(_lock); + h->nr_huge_pages++; + h->nr_huge_pages_node[nid]++; +} + +static void __prep_new_huge_page(struct page *page) { INIT_LIST_HEAD(>lru); set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); hugetlb_set_page_subpool(page, NULL); set_hugetlb_cgroup(page, NULL); set_hugetlb_cgroup_rsvd(page, NULL); +} + +static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) +{ + __prep_new_huge_page(page); spin_lock_irq(_lock); - h->nr_huge_pages++; - h->nr_huge_pages_node[nid]++; + __prep_account_new_huge_page(h, nid); spin_unlock_irq(_lock); } -- 2.16.3
[PATCH v10 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes
Currently, isolate_migratepages_{range,block} and their callers use a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was any error during isolation. This does not work as soon as we need to start reporting different error codes and make sure we pass them down the chain, so they are properly interpreted by functions like e.g: alloc_contig_range. Let us rework isolate_migratepages_{range,block} so we can report error codes. Since isolate_migratepages_block will stop returning the next pfn to be scanned, we reuse the cc->migrate_pfn field to keep track of that. Signed-off-by: Oscar Salvador Acked-by: Vlastimil Babka Acked-by: Mike Kravetz Reviewed-by: David Hildenbrand --- mm/compaction.c | 52 +--- mm/internal.h | 10 -- mm/page_alloc.c | 7 +++ 3 files changed, 36 insertions(+), 33 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 8c5028bfbd56..110b90d5fff5 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -787,15 +787,14 @@ static bool too_many_isolated(pg_data_t *pgdat) * * Isolate all pages that can be migrated from the range specified by * [low_pfn, end_pfn). The range is expected to be within same pageblock. - * Returns zero if there is a fatal signal pending, otherwise PFN of the - * first page that was not scanned (which may be both less, equal to or more - * than end_pfn). + * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion, + * or 0. + * cc->migrate_pfn will contain the next pfn to scan. * * The pages are isolated on cc->migratepages list (not required to be empty), - * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field - * is neither read nor updated. + * and cc->nr_migratepages is updated accordingly. */ -static unsigned long +static int isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, unsigned long end_pfn, isolate_mode_t isolate_mode) { @@ -809,6 +808,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, bool skip_on_failure = false; unsigned long next_skip_pfn = 0; bool skip_updated = false; + int ret = 0; + + cc->migrate_pfn = low_pfn; /* * Ensure that there are not too many pages isolated from the LRU @@ -818,16 +820,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, while (unlikely(too_many_isolated(pgdat))) { /* stop isolation if there are still pages not migrated */ if (cc->nr_migratepages) - return 0; + return -EAGAIN; /* async migration should just abort */ if (cc->mode == MIGRATE_ASYNC) - return 0; + return -EAGAIN; congestion_wait(BLK_RW_ASYNC, HZ/10); if (fatal_signal_pending(current)) - return 0; + return -EINTR; } cond_resched(); @@ -875,8 +877,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (fatal_signal_pending(current)) { cc->contended = true; + ret = -EINTR; - low_pfn = 0; goto fatal_pending; } @@ -1130,7 +1132,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (nr_isolated) count_compact_events(COMPACTISOLATED, nr_isolated); - return low_pfn; + cc->migrate_pfn = low_pfn; + + return ret; } /** @@ -1139,15 +1143,14 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, * @start_pfn: The first PFN to start isolating. * @end_pfn: The one-past-last PFN. * - * Returns zero if isolation fails fatally due to e.g. pending signal. - * Otherwise, function returns one-past-the-last PFN of isolated page - * (which may be greater than end_pfn if end fell in a middle of a THP page). + * Returns -EAGAIN when contented, -EINTR in case of a signal pending or 0. */ -unsigned long +int isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn, unsigned long end_pfn) { unsigned long pfn, block_start_pfn, block_end_pfn; + int ret = 0; /* Scan block by block. First and last block may be incomplete */ pfn = start_pfn; @@ -1166,17 +1169,17 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn, block_end_pfn, cc->zone)) continue; - pfn = isolate_migra
[PATCH v10 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range
Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or -EBUSY, and report them down the chain. The problem is that when migrate_pages() reports -ENOMEM, we keep going till we exhaust all the try-attempts (5 at the moment) instead of bailing out. migrate_pages() bails out right away on -ENOMEM because it is considered a fatal error. Do the same here instead of keep going and retrying. Note that this is not fixing a real issue, just a cosmetic change. Although we can save some cycles by backing off ealier Signed-off-by: Oscar Salvador Acked-by: Vlastimil Babka Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Acked-by: Mike Kravetz --- mm/page_alloc.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1c67c99603a3..689454692de1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8697,7 +8697,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, } tries = 0; } else if (++tries == 5) { - ret = ret < 0 ? ret : -EBUSY; + ret = -EBUSY; break; } @@ -8707,6 +8707,13 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, ret = migrate_pages(>migratepages, alloc_migration_target, NULL, (unsigned long), cc->mode, MR_CONTIG_RANGE); + + /* +* On -ENOMEM, migrate_pages() bails out right away. It is pointless +* to retry again over this error, so do the same here. +*/ + if (ret == -ENOMEM) + break; } lru_cache_enable(); -- 2.16.3
[PATCH v10 0/7] Make alloc_contig_range handle Hugetlb pages
loc_contig_range: [1b8000, 1c) PFNs busy [ 180.061972] alloc_contig_range: [1b8000, 1c) PFNs busy [ 180.063413] alloc_contig_range: [1b8000, 1c) PFNs busy [ 180.064838] alloc_contig_range: [1b8000, 1c) PFNs busy [ 180.065848] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.066794] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.067738] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.068669] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.069598] alloc_contig_range: [1bfc00, 1c) PFNs busy" And then with this patchset running: "Same experiment with ZONE_MOVABLE: a) Free huge pages: all memory can get unplugged again. b) Allocated/populated but idle huge pages: all memory can get unplugged again. c) Allocated/populated but all 512 huge pages are read/written in a loop: all memory can get unplugged again, but I get a single [ 121.192345] alloc_contig_range: [18, 188000) PFNs busy Most probably because it happened to try migrating a huge page while it was busy. As virtio-mem retries on ZONE_MOVABLE a couple of times, it can deal with this temporary failure. Last but not least, I did something extreme: # cat /proc/meminfo MemTotal:5061568 kB MemFree: 186560 kB MemAvailable: 354524 kB ... HugePages_Total:2048 HugePages_Free: 2048 HugePages_Rsvd:0 HugePages_Surp:0 Triggering unplug would require to dissolve+alloc - which now fails when trying to allocate an additional ~512 huge pages (1G). As expected, I can properly see memory unplug not fully succeeding. + I get a fairly continuous stream of [ 226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy ... But more importantly, the hugepage count remains stable, as configured by the admin (me): HugePages_Total:2048 HugePages_Free: 2048 HugePages_Rsvd:0 HugePages_Surp:0" Oscar Salvador (7): mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range mm,compaction: Let isolate_migratepages_{range,block} return error codes mm,hugetlb: Drop clearing of flag from prep_new_huge_page mm,hugetlb: Split prep_new_huge_page functionality mm: Make alloc_contig_range handle free hugetlb pages mm: Make alloc_contig_range handle in-use hugetlb pages mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig include/linux/hugetlb.h | 7 +++ mm/compaction.c | 91 - mm/hugetlb.c| 149 ++-- mm/internal.h | 10 +++- mm/page_alloc.c | 22 +++ mm/vmscan.c | 5 +- 6 files changed, 237 insertions(+), 47 deletions(-) -- 2.16.3
Re: [PATCH v9 5/7] mm: Make alloc_contig_range handle free hugetlb pages
On Fri, Apr 16, 2021 at 05:49:20PM +0800, Baoquan He wrote: > On 04/16/21 at 09:00am, Oscar Salvador wrote: > ... > > +/* > > + * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old > > one > > + * @h: struct hstate old page belongs to > > + * @old_page: Old page to dissolve > > + * Returns 0 on success, otherwise negated error. > > + */ > > +static int alloc_and_dissolve_huge_page(struct hstate *h, struct page > > *old_page) > > +{ > > + gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; > > + int nid = page_to_nid(old_page); > > + struct page *new_page; > > + int ret = 0; > > + > > + /* > > +* Before dissolving the page, we need to allocate a new one for the > > +* pool to remain stable. Using alloc_buddy_huge_page() allows us to > > +* not having to deal with prep_new_page() and avoids dealing of any > ~ prep_new_huge_page() ? Yep, clumsy. @Andrew: could you please amend that or should I send a new version? Thanks -- Oscar Salvador SUSE L3
[PATCH v9 8/8] arm64/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
Enable arm64 platform to use the MHP_MEMMAP_ON_MEMORY feature. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand --- arch/arm64/Kconfig | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index e4e1b6550115..68735831b236 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -309,6 +309,9 @@ config ARCH_ENABLE_MEMORY_HOTPLUG config ARCH_ENABLE_MEMORY_HOTREMOVE def_bool y +config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE + def_bool y + config SMP def_bool y -- 2.16.3
[PATCH v9 7/8] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
Enable x86_64 platform to use the MHP_MEMMAP_ON_MEMORY feature. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand --- arch/x86/Kconfig | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 2792879d398e..9f0211df1746 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2433,6 +2433,9 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE def_bool y depends on MEMORY_HOTPLUG +config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE + def_bool y + config USE_PERCPU_NUMA_NODE_ID def_bool y depends on NUMA -- 2.16.3
[PATCH v9 6/8] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory
Self stored memmap leads to a sparse memory situation which is unsuitable for workloads that requires large contiguous memory chunks, so make this an opt-in which needs to be explicitly enabled. To control this, let memory_hotplug have its own memory space, as suggested by David, so we can add memory_hotplug.memmap_on_memory parameter. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand Acked-by: Michal Hocko --- Documentation/admin-guide/kernel-parameters.txt | 17 + mm/Makefile | 5 - mm/memory_hotplug.c | 10 +- 3 files changed, 30 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 04545725f187..af32c17cd4eb 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2794,6 +2794,23 @@ seconds. Use this parameter to check at some other rate. 0 disables periodic checking. + memory_hotplug.memmap_on_memory + [KNL,X86,ARM] Boolean flag to enable this feature. + Format: {on | off (default)} + When enabled, runtime hotplugged memory will + allocate its internal metadata (struct pages) + from the hotadded memory which will allow to + hotadd a lot of memory without requiring + additional memory to do so. + This feature is disabled by default because it + has some implication on large (e.g. GB) + allocations in some configurations (e.g. small + memory blocks). + The state of the flag can be read in + /sys/module/memory_hotplug/parameters/memmap_on_memory. + Note that even when enabled, there are a few cases where + the feature is not effective. + memtest=[KNL,X86,ARM,PPC] Enable memtest Format: default : 0 diff --git a/mm/Makefile b/mm/Makefile index 72227b24a616..82ae9482f5e3 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -58,9 +58,13 @@ obj-y:= filemap.o mempool.o oom_kill.o fadvise.o \ page-alloc-y := page_alloc.o page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o +# Give 'memory_hotplug' its own module-parameter namespace +memory-hotplug-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o + obj-y += page-alloc.o obj-y += init-mm.o obj-y += memblock.o +obj-y += $(memory-hotplug-y) ifdef CONFIG_MMU obj-$(CONFIG_ADVISE_SYSCALLS) += madvise.o @@ -83,7 +87,6 @@ obj-$(CONFIG_SLUB) += slub.o obj-$(CONFIG_KASAN)+= kasan/ obj-$(CONFIG_KFENCE) += kfence/ obj-$(CONFIG_FAILSLAB) += failslab.o -obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 5ef626926449..b93949a84d4a 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -42,7 +42,15 @@ #include "internal.h" #include "shuffle.h" -static bool memmap_on_memory; + +/* + * memory_hotplug.memmap_on_memory parameter + */ +static bool memmap_on_memory __ro_after_init; +#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY +module_param(memmap_on_memory, bool, 0444); +MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug"); +#endif /* * online_page_callback contains pointer to current page onlining function. -- 2.16.3
[PATCH v9 5/8] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
Let the caller check whether it can pass MHP_MEMMAP_ON_MEMORY by checking mhp_supports_memmap_on_memory(). MHP_MEMMAP_ON_MEMORY can only be set in case ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE is enabled, the architecture supports altmap, and the range to be added spans a single memory block. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand Acked-by: Michal Hocko --- drivers/acpi/acpi_memhotplug.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c index b02fd51e5589..8cc195c4c861 100644 --- a/drivers/acpi/acpi_memhotplug.c +++ b/drivers/acpi/acpi_memhotplug.c @@ -171,6 +171,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device) acpi_handle handle = mem_device->device->handle; int result, num_enabled = 0; struct acpi_memory_info *info; + mhp_t mhp_flags = MHP_NONE; int node; node = acpi_get_node(handle); @@ -194,8 +195,10 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device) if (node < 0) node = memory_add_physaddr_to_nid(info->start_addr); + if (mhp_supports_memmap_on_memory(info->length)) + mhp_flags |= MHP_MEMMAP_ON_MEMORY; result = __add_memory(node, info->start_addr, info->length, - MHP_NONE); + mhp_flags); /* * If the memory block has been used by the kernel, add_memory() -- 2.16.3
[PATCH v9 4/8] mm,memory_hotplug: Allocate memmap from the added memory range
Physical memory hotadd has to allocate a memmap (struct page array) for the newly added memory section. Currently, alloc_pages_node() is used for those allocations. This has some disadvantages: a) an existing memory is consumed for that purpose (eg: ~2MB per 128MB memory section on x86_64) b) if the whole node is movable then we have off-node struct pages which has performance drawbacks. c) It might be there are no PMD_ALIGNED chunks so memmap array gets populated with base pages. This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled. Vmemap page tables can map arbitrary memory. That means that we can simply use the beginning of each memory section and map struct pages there. struct pages which back the allocated space then just need to be treated carefully. Implementation wise we will reuse vmem_altmap infrastructure to override the default allocator used by __populate_section_memmap. Part of the implementation also relies on memory_block structure gaining a new field which specifies the number of vmemmap_pages at the beginning. This patch also introduces the following functions: - mhp_init_memmap_on_memory: Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages fully span. - mhp_deinit_memmap_on_memory: Undoes what mhp_init_memmap_on_memory. The new function memory_block_online() calls mhp_init_memmap_on_memory() before doing the actual online_pages(). Should online_pages() fail, we clean up by calling mhp_deinit_memmap_on_memory(). Adjusting of present_pages is done at the end once we know that online_pages() succedeed. On offline, memory_block_offline() needs to unaccount vmemmap pages from present_pages() before calling offline_pages(). This is necessary because offline_pages() tears down some structures based on the fact whether the node or the zone become empty. If offline_pages() fails, we account back vmemmap pages. If it succeeds, we call mhp_deinit_memmap_on_memory(). Hot-remove: We need to be careful when removing memory, as adding and removing memory needs to be done with the same granularity. To check that this assumption is not violated, we check the memory range we want to remove and if a) any memory block has vmemmap pages and b) the range spans more than a single memory block, we scream out loud and refuse to proceed. If all is good and the range was using memmap on memory (aka vmemmap pages), we construct an altmap structure so free_hugepage_table does the right thing and calls vmem_altmap_free instead of free_pagetable. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand --- drivers/base/memory.c | 71 -- include/linux/memory.h | 8 ++- include/linux/memory_hotplug.h | 15 +++- include/linux/memremap.h | 2 +- include/linux/mmzone.h | 7 +- mm/Kconfig | 5 ++ mm/memory_hotplug.c| 159 ++--- mm/sparse.c| 2 - 8 files changed, 247 insertions(+), 22 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index f209925a5d4e..2e2b2f654f0a 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -173,16 +173,72 @@ static int memory_block_online(struct memory_block *mem) { unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; + unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages; + struct zone *zone; + int ret; + + zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, nr_pages); + + /* +* Although vmemmap pages have a different lifecycle than the pages +* they describe (they remain until the memory is unplugged), doing +* their initialization and accounting at memory onlining/offlining +* stage simplifies things a lot. +*/ + if (nr_vmemmap_pages) { + ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone); + if (ret) + return ret; + } + + ret = online_pages(start_pfn + nr_vmemmap_pages, + nr_pages - nr_vmemmap_pages, zone); + if (ret) { + if (nr_vmemmap_pages) + mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages); + return ret; + } + + /* +* Account once onlining succeeded. If the zone was unpopulated, it is +* now already properly populated. +*/ + if (nr_vmemmap_pages) + adjust_present_page_count(zone, nr_vmemmap_pages); - return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid); + return ret; } static int memory_block_offline(struct
[PATCH v9 3/8] mm,memory_hotplug: Factor out adjusting present pages into adjust_present_page_count()
From: David Hildenbrand Let's have a single place (inspired by adjust_managed_page_count()) where we adjust present pages. In contrast to adjust_managed_page_count(), only memory onlining/offlining is allowed to modify the number of present pages. Signed-off-by: David Hildenbrand Signed-off-by: Oscar Salvador Reviewed-by: Oscar Salvador --- mm/memory_hotplug.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 25e59d5dc13c..d05056b3c173 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -829,6 +829,16 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn, return default_zone_for_pfn(nid, start_pfn, nr_pages); } +static void adjust_present_page_count(struct zone *zone, long nr_pages) +{ + unsigned long flags; + + zone->present_pages += nr_pages; + pgdat_resize_lock(zone->zone_pgdat, ); + zone->zone_pgdat->node_present_pages += nr_pages; + pgdat_resize_unlock(zone->zone_pgdat, ); +} + int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type, int nid) { @@ -882,11 +892,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, } online_pages_range(pfn, nr_pages); - zone->present_pages += nr_pages; - - pgdat_resize_lock(zone->zone_pgdat, ); - zone->zone_pgdat->node_present_pages += nr_pages; - pgdat_resize_unlock(zone->zone_pgdat, ); + adjust_present_page_count(zone, nr_pages); node_states_set_node(nid, ); if (need_zonelists_rebuild) @@ -1701,11 +1707,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) /* removal success */ adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages); - zone->present_pages -= nr_pages; - - pgdat_resize_lock(zone->zone_pgdat, ); - zone->zone_pgdat->node_present_pages -= nr_pages; - pgdat_resize_unlock(zone->zone_pgdat, ); + adjust_present_page_count(zone, -nr_pages); init_per_zone_wmark_min(); -- 2.16.3
[PATCH v9 0/8] Allocate memmap from hotadded memory (per device)
mory_hotplug: Factor out adjusting present pages into adjust_present_page_count() Oscar Salvador (7): drivers/base/memory: Introduce memory_block_{online,offline} mm,memory_hotplug: Relax fully spanned sections check mm,memory_hotplug: Allocate memmap from the added memory range acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE arm64/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE Documentation/admin-guide/kernel-parameters.txt | 17 ++ arch/arm64/Kconfig | 3 + arch/x86/Kconfig| 3 + drivers/acpi/acpi_memhotplug.c | 5 +- drivers/base/memory.c | 100 ++-- include/linux/memory.h | 8 +- include/linux/memory_hotplug.h | 15 +- include/linux/memremap.h| 2 +- include/linux/mmzone.h | 7 +- mm/Kconfig | 5 + mm/Makefile | 5 +- mm/memory_hotplug.c | 205 +--- mm/sparse.c | 2 - 13 files changed, 330 insertions(+), 47 deletions(-) -- 2.16.3
[PATCH v9 2/8] mm,memory_hotplug: Relax fully spanned sections check
When using self-hosted vmemmap pages, the number of pages passed to {online,offline}_pages might not fully span sections, but they always fully span pageblocks. Relax the check account for that case. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand --- mm/memory_hotplug.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 0cdbbfbc5757..25e59d5dc13c 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -838,9 +838,14 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int ret; struct memory_notify arg; - /* We can only online full sections (e.g., SECTION_IS_ONLINE) */ + /* We can only offline full sections (e.g., SECTION_IS_ONLINE). +* However, when using e.g: memmap_on_memory, some pages are initialized +* prior to calling in here. The remaining amount of pages must be +* pageblock aligned. +*/ if (WARN_ON_ONCE(!nr_pages || -!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION))) +!IS_ALIGNED(pfn, pageblock_nr_pages) || +!IS_ALIGNED(pfn + nr_pages, PAGES_PER_SECTION))) return -EINVAL; mem_hotplug_begin(); @@ -1573,9 +1578,14 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) int ret, node; char *reason; - /* We can only offline full sections (e.g., SECTION_IS_ONLINE) */ + /* We can only offline full sections (e.g., SECTION_IS_ONLINE). +* However, when using e.g: memmap_on_memory, some pages are initialized +* prior to calling in here. The remaining amount of pages must be +* pageblock aligned. +*/ if (WARN_ON_ONCE(!nr_pages || -!IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION))) +!IS_ALIGNED(start_pfn, pageblock_nr_pages) || +!IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION))) return -EINVAL; mem_hotplug_begin(); -- 2.16.3
[PATCH v9 1/8] drivers/base/memory: Introduce memory_block_{online,offline}
This is a preparatory patch that introduces two new functions: memory_block_online() and memory_block_offline(). For now, these functions will only call online_pages() and offline_pages() respectively, but they will be later in charge of preparing the vmemmap pages, carrying out the initialization and proper accounting of such pages. Since memory_block struct contains all the information, pass this struct down the chain till the end functions. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand --- drivers/base/memory.c | 33 + 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index f35298425575..f209925a5d4e 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -169,30 +169,41 @@ int memory_notify(unsigned long val, void *v) return blocking_notifier_call_chain(_chain, val, v); } +static int memory_block_online(struct memory_block *mem) +{ + unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); + unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; + + return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid); +} + +static int memory_block_offline(struct memory_block *mem) +{ + unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); + unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; + + return offline_pages(start_pfn, nr_pages); +} + /* * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is * OK to have direct references to sparsemem variables in here. */ static int -memory_block_action(unsigned long start_section_nr, unsigned long action, - int online_type, int nid) +memory_block_action(struct memory_block *mem, unsigned long action) { - unsigned long start_pfn; - unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; int ret; - start_pfn = section_nr_to_pfn(start_section_nr); - switch (action) { case MEM_ONLINE: - ret = online_pages(start_pfn, nr_pages, online_type, nid); + ret = memory_block_online(mem); break; case MEM_OFFLINE: - ret = offline_pages(start_pfn, nr_pages); + ret = memory_block_offline(mem); break; default: WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: " -"%ld\n", __func__, start_section_nr, action, action); +"%ld\n", __func__, mem->start_section_nr, action, action); ret = -EINVAL; } @@ -210,9 +221,7 @@ static int memory_block_change_state(struct memory_block *mem, if (to_state == MEM_OFFLINE) mem->state = MEM_GOING_OFFLINE; - ret = memory_block_action(mem->start_section_nr, to_state, - mem->online_type, mem->nid); - + ret = memory_block_action(mem, to_state); mem->state = ret ? from_state_req : to_state; return ret; -- 2.16.3
Re: [PATCH v8 4/8] mm,memory_hotplug: Allocate memmap from the added memory range
es; BUG_ON(check_hotplug_memory_range(start, size)); @@ -1848,6 +1966,31 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) if (rc) return rc; + /* +* We only support removing memory added with MHP_MEMMAP_ON_MEMORY in +* the same granularity it was added - a single memory block. +*/ + if (memmap_on_memory) { + nr_vmemmap_pages = walk_memory_blocks(start, size, NULL, + get_nr_vmemmap_pages_cb); + if (nr_vmemmap_pages) { + if (size != memory_block_size_bytes()) { + pr_warn("Refuse to remove %#llx - %#llx," + "wrong granularity\n", + start, start + size); + return -EINVAL; + } + + /* +* Let remove_pmd_table->free_hugepage_table do the +* right thing if we used vmem_altmap when hot-adding +* the range. +*/ + mhp_altmap.alloc = nr_vmemmap_pages; + altmap = _altmap; + } + } + /* remove memmap entry */ firmware_map_remove(start, start + size, "System RAM"); @@ -1859,7 +2002,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) mem_hotplug_begin(); - arch_remove_memory(nid, start, size, NULL); + arch_remove_memory(nid, start, size, altmap); if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { memblock_free(start, size); diff --git a/mm/sparse.c b/mm/sparse.c index 7bd23f9d6cef..8e96cf00536b 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -623,7 +623,6 @@ void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn) } } -#ifdef CONFIG_MEMORY_HOTREMOVE /* Mark all memory sections within the pfn range as offline */ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn) { @@ -644,7 +643,6 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn) ms->section_mem_map &= ~SECTION_IS_ONLINE; } } -#endif #ifdef CONFIG_SPARSEMEM_VMEMMAP static struct page * __meminit populate_section_memmap(unsigned long pfn, -- Oscar Salvador SUSE L3
Re: [PATCH v8 4/8] mm,memory_hotplug: Allocate memmap from the added memory range
On Fri, Apr 16, 2021 at 12:33:34PM +0200, David Hildenbrand wrote: > IIRC, we have to add the zero shadow first, before touching the memory. This > is also what mm/memremap.c does. > > In mhp_deinit_memmap_on_memory(), you already remove in the proper > (reversed) order :) But looking at online_pages(), we do it after the move_pfn_range_to_zone(), right? AFAIK (I might be well wrong here), memory_notify() will eventually call kasan_add_zero_shadow? So that comes after the move_pfn_range_to_zone? Or is my understanding incorrect? -- Oscar Salvador SUSE L3
[PATCH v8 8/8] arm64/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
Enable arm64 platform to use the MHP_MEMMAP_ON_MEMORY feature. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand --- arch/arm64/Kconfig | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index e4e1b6550115..68735831b236 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -309,6 +309,9 @@ config ARCH_ENABLE_MEMORY_HOTPLUG config ARCH_ENABLE_MEMORY_HOTREMOVE def_bool y +config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE + def_bool y + config SMP def_bool y -- 2.16.3
[PATCH v8 7/8] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
Enable x86_64 platform to use the MHP_MEMMAP_ON_MEMORY feature. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand --- arch/x86/Kconfig | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 2792879d398e..9f0211df1746 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2433,6 +2433,9 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE def_bool y depends on MEMORY_HOTPLUG +config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE + def_bool y + config USE_PERCPU_NUMA_NODE_ID def_bool y depends on NUMA -- 2.16.3
[PATCH v8 6/8] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory
Self stored memmap leads to a sparse memory situation which is unsuitable for workloads that requires large contiguous memory chunks, so make this an opt-in which needs to be explicitly enabled. To control this, let memory_hotplug have its own memory space, as suggested by David, so we can add memory_hotplug.memmap_on_memory parameter. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand Acked-by: Michal Hocko --- Documentation/admin-guide/kernel-parameters.txt | 17 + mm/Makefile | 5 - mm/memory_hotplug.c | 10 +- 3 files changed, 30 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 04545725f187..af32c17cd4eb 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2794,6 +2794,23 @@ seconds. Use this parameter to check at some other rate. 0 disables periodic checking. + memory_hotplug.memmap_on_memory + [KNL,X86,ARM] Boolean flag to enable this feature. + Format: {on | off (default)} + When enabled, runtime hotplugged memory will + allocate its internal metadata (struct pages) + from the hotadded memory which will allow to + hotadd a lot of memory without requiring + additional memory to do so. + This feature is disabled by default because it + has some implication on large (e.g. GB) + allocations in some configurations (e.g. small + memory blocks). + The state of the flag can be read in + /sys/module/memory_hotplug/parameters/memmap_on_memory. + Note that even when enabled, there are a few cases where + the feature is not effective. + memtest=[KNL,X86,ARM,PPC] Enable memtest Format: default : 0 diff --git a/mm/Makefile b/mm/Makefile index 72227b24a616..82ae9482f5e3 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -58,9 +58,13 @@ obj-y:= filemap.o mempool.o oom_kill.o fadvise.o \ page-alloc-y := page_alloc.o page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o +# Give 'memory_hotplug' its own module-parameter namespace +memory-hotplug-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o + obj-y += page-alloc.o obj-y += init-mm.o obj-y += memblock.o +obj-y += $(memory-hotplug-y) ifdef CONFIG_MMU obj-$(CONFIG_ADVISE_SYSCALLS) += madvise.o @@ -83,7 +87,6 @@ obj-$(CONFIG_SLUB) += slub.o obj-$(CONFIG_KASAN)+= kasan/ obj-$(CONFIG_KFENCE) += kfence/ obj-$(CONFIG_FAILSLAB) += failslab.o -obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 7efee41268b6..54c78c486f1f 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -42,7 +42,15 @@ #include "internal.h" #include "shuffle.h" -static bool memmap_on_memory; + +/* + * memory_hotplug.memmap_on_memory parameter + */ +static bool memmap_on_memory __ro_after_init; +#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY +module_param(memmap_on_memory, bool, 0444); +MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug"); +#endif /* * online_page_callback contains pointer to current page onlining function. -- 2.16.3
[PATCH v8 5/8] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
Let the caller check whether it can pass MHP_MEMMAP_ON_MEMORY by checking mhp_supports_memmap_on_memory(). MHP_MEMMAP_ON_MEMORY can only be set in case ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE is enabled, the architecture supports altmap, and the range to be added spans a single memory block. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand Acked-by: Michal Hocko --- drivers/acpi/acpi_memhotplug.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c index b02fd51e5589..8cc195c4c861 100644 --- a/drivers/acpi/acpi_memhotplug.c +++ b/drivers/acpi/acpi_memhotplug.c @@ -171,6 +171,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device) acpi_handle handle = mem_device->device->handle; int result, num_enabled = 0; struct acpi_memory_info *info; + mhp_t mhp_flags = MHP_NONE; int node; node = acpi_get_node(handle); @@ -194,8 +195,10 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device) if (node < 0) node = memory_add_physaddr_to_nid(info->start_addr); + if (mhp_supports_memmap_on_memory(info->length)) + mhp_flags |= MHP_MEMMAP_ON_MEMORY; result = __add_memory(node, info->start_addr, info->length, - MHP_NONE); + mhp_flags); /* * If the memory block has been used by the kernel, add_memory() -- 2.16.3
[PATCH v8 4/8] mm,memory_hotplug: Allocate memmap from the added memory range
Physical memory hotadd has to allocate a memmap (struct page array) for the newly added memory section. Currently, alloc_pages_node() is used for those allocations. This has some disadvantages: a) an existing memory is consumed for that purpose (eg: ~2MB per 128MB memory section on x86_64) b) if the whole node is movable then we have off-node struct pages which has performance drawbacks. c) It might be there are no PMD_ALIGNED chunks so memmap array gets populated with base pages. This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled. Vmemap page tables can map arbitrary memory. That means that we can simply use the beginning of each memory section and map struct pages there. struct pages which back the allocated space then just need to be treated carefully. Implementation wise we will reuse vmem_altmap infrastructure to override the default allocator used by __populate_section_memmap. Part of the implementation also relies on memory_block structure gaining a new field which specifies the number of vmemmap_pages at the beginning. This patch also introduces the following functions: - mhp_init_memmap_on_memory: Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages fully span. - mhp_deinit_memmap_on_memory: Undoes what mhp_init_memmap_on_memory. The new function memory_block_online() calls mhp_init_memmap_on_memory() before doing the actual online_pages(). Should online_pages() fail, we clean up by calling mhp_deinit_memmap_on_memory(). Adjusting of present_pages is done at the end once we know that online_pages() succedeed. On offline, memory_block_offline() needs to unaccount vmemmap pages from present_pages() before calling offline_pages(). This is necessary because offline_pages() tears down some structures based on the fact whether the node or the zone become empty. If offline_pages() fails, we account back vmemmap pages. If it succeeds, we call mhp_deinit_memmap_on_memory(). Hot-remove: We need to be careful when removing memory, as adding and removing memory needs to be done with the same granularity. To check that this assumption is not violated, we check the memory range we want to remove and if a) any memory block has vmemmap pages and b) the range spans more than a single memory block, we scream out loud and refuse to proceed. If all is good and the range was using memmap on memory (aka vmemmap pages), we construct an altmap structure so free_hugepage_table does the right thing and calls vmem_altmap_free instead of free_pagetable. Signed-off-by: Oscar Salvador --- drivers/base/memory.c | 75 -- include/linux/memory.h | 8 +- include/linux/memory_hotplug.h | 17 +++- include/linux/memremap.h | 2 +- include/linux/mmzone.h | 7 +- mm/Kconfig | 5 ++ mm/memory_hotplug.c| 171 ++--- mm/sparse.c| 2 - 8 files changed, 265 insertions(+), 22 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index f209925a5d4e..179857d53982 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -173,16 +173,76 @@ static int memory_block_online(struct memory_block *mem) { unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; + unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages; + struct zone *zone; + int ret; + + zone = mhp_get_target_zone(start_pfn, nr_pages, mem->nid, + mem->online_type); + + /* +* Although vmemmap pages have a different lifecycle than the pages +* they describe (they remain until the memory is unplugged), doing +* its initialization and accounting at hot-{online,offline} stage +* simplifies things a lot +*/ + if (nr_vmemmap_pages) { + ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone); + if (ret) + return ret; + } + + ret = online_pages(start_pfn + nr_vmemmap_pages, + nr_pages - nr_vmemmap_pages, zone); + if (ret) { + if (nr_vmemmap_pages) + mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages); + return ret; + } + + /* +* Account once onlining succeeded. If the page was unpopulated, it is +* now already properly populated. +*/ + if (nr_vmemmap_pages) + adjust_present_page_count(zone, nr_vmemmap_pages); - return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid); + return ret; } static int memory_block_offline(struct
[PATCH v8 3/8] mm,memory_hotplug: Factor out adjusting present pages into adjust_present_page_count()
From: David Hildenbrand Let's have a single place (inspired by adjust_managed_page_count()) where we adjust present pages. In contrast to adjust_managed_page_count(), only memory onlining/offlining is allowed to modify the number of present pages. Signed-off-by: David Hildenbrand Signed-off-by: Oscar Salvador Reviewed-by: Oscar Salvador --- mm/memory_hotplug.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 25e59d5dc13c..d05056b3c173 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -829,6 +829,16 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn, return default_zone_for_pfn(nid, start_pfn, nr_pages); } +static void adjust_present_page_count(struct zone *zone, long nr_pages) +{ + unsigned long flags; + + zone->present_pages += nr_pages; + pgdat_resize_lock(zone->zone_pgdat, ); + zone->zone_pgdat->node_present_pages += nr_pages; + pgdat_resize_unlock(zone->zone_pgdat, ); +} + int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type, int nid) { @@ -882,11 +892,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, } online_pages_range(pfn, nr_pages); - zone->present_pages += nr_pages; - - pgdat_resize_lock(zone->zone_pgdat, ); - zone->zone_pgdat->node_present_pages += nr_pages; - pgdat_resize_unlock(zone->zone_pgdat, ); + adjust_present_page_count(zone, nr_pages); node_states_set_node(nid, ); if (need_zonelists_rebuild) @@ -1701,11 +1707,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) /* removal success */ adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages); - zone->present_pages -= nr_pages; - - pgdat_resize_lock(zone->zone_pgdat, ); - zone->zone_pgdat->node_present_pages -= nr_pages; - pgdat_resize_unlock(zone->zone_pgdat, ); + adjust_present_page_count(zone, -nr_pages); init_per_zone_wmark_min(); -- 2.16.3
[PATCH v8 1/8] drivers/base/memory: Introduce memory_block_{online,offline}
This is a preparatory patch that introduces two new functions: memory_block_online() and memory_block_offline(). For now, these functions will only call online_pages() and offline_pages() respectively, but they will be later in charge of preparing the vmemmap pages, carrying out the initialization and proper accounting of such pages. Since memory_block struct contains all the information, pass this struct down the chain till the end functions. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand --- drivers/base/memory.c | 33 + 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index f35298425575..f209925a5d4e 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -169,30 +169,41 @@ int memory_notify(unsigned long val, void *v) return blocking_notifier_call_chain(_chain, val, v); } +static int memory_block_online(struct memory_block *mem) +{ + unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); + unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; + + return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid); +} + +static int memory_block_offline(struct memory_block *mem) +{ + unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); + unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; + + return offline_pages(start_pfn, nr_pages); +} + /* * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is * OK to have direct references to sparsemem variables in here. */ static int -memory_block_action(unsigned long start_section_nr, unsigned long action, - int online_type, int nid) +memory_block_action(struct memory_block *mem, unsigned long action) { - unsigned long start_pfn; - unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; int ret; - start_pfn = section_nr_to_pfn(start_section_nr); - switch (action) { case MEM_ONLINE: - ret = online_pages(start_pfn, nr_pages, online_type, nid); + ret = memory_block_online(mem); break; case MEM_OFFLINE: - ret = offline_pages(start_pfn, nr_pages); + ret = memory_block_offline(mem); break; default: WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: " -"%ld\n", __func__, start_section_nr, action, action); +"%ld\n", __func__, mem->start_section_nr, action, action); ret = -EINVAL; } @@ -210,9 +221,7 @@ static int memory_block_change_state(struct memory_block *mem, if (to_state == MEM_OFFLINE) mem->state = MEM_GOING_OFFLINE; - ret = memory_block_action(mem->start_section_nr, to_state, - mem->online_type, mem->nid); - + ret = memory_block_action(mem, to_state); mem->state = ret ? from_state_req : to_state; return ret; -- 2.16.3
[PATCH v8 2/8] mm,memory_hotplug: Relax fully spanned sections check
When using self-hosted vmemmap pages, the number of pages passed to {online,offline}_pages might not fully span sections, but they always fully span pageblocks. Relax the check account for that case. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand --- mm/memory_hotplug.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 0cdbbfbc5757..25e59d5dc13c 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -838,9 +838,14 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int ret; struct memory_notify arg; - /* We can only online full sections (e.g., SECTION_IS_ONLINE) */ + /* We can only offline full sections (e.g., SECTION_IS_ONLINE). +* However, when using e.g: memmap_on_memory, some pages are initialized +* prior to calling in here. The remaining amount of pages must be +* pageblock aligned. +*/ if (WARN_ON_ONCE(!nr_pages || -!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION))) +!IS_ALIGNED(pfn, pageblock_nr_pages) || +!IS_ALIGNED(pfn + nr_pages, PAGES_PER_SECTION))) return -EINVAL; mem_hotplug_begin(); @@ -1573,9 +1578,14 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) int ret, node; char *reason; - /* We can only offline full sections (e.g., SECTION_IS_ONLINE) */ + /* We can only offline full sections (e.g., SECTION_IS_ONLINE). +* However, when using e.g: memmap_on_memory, some pages are initialized +* prior to calling in here. The remaining amount of pages must be +* pageblock aligned. +*/ if (WARN_ON_ONCE(!nr_pages || -!IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION))) +!IS_ALIGNED(start_pfn, pageblock_nr_pages) || +!IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION))) return -EINVAL; mem_hotplug_begin(); -- 2.16.3
[PATCH v8 0/8] Allocate memmap from hotadded memory (per device)
Changes from v7 -> v8: - Addressed feedback from David for patch#4 Changes from v6 -> v7: - Fix check from "mm,memory_hotplug: Relax fully spanned sections check" - Add fixup from "mm,memory_hotplug: Allocate memmap from the added memory range" - Add Reviewed-by from David for patch#2 - Fix changelog in "mm,memory_hotplug: Factor out adjusting present pages into adjust_present_page_count()" Changes from v5 -> v6: - Create memory_block_{online,offline} functions - Create vmemmap_* functions to deal with vmemmap stuff, so {online,offline}_pages remain untouched - Add adjust_present_page_count's patch from David - Relax check in {offline,online}_pages - Rework changelogs Changes from v4 -> v5: - Addressed feedback from David (patch#1) - Tested on x86_64 with different struct page sizes and on large/small memory blocks - Tested on arm64 with 4K, 64K (with and without THP) and with different struct page sizes NOTE: We might need to make this feature and hugetlb-vmemmap feature [1] mutually exclusive. I raised an issue I see in [2]. Hugetlb-vmemmap feature has been withdrawn for the time being due to the need in further changes wrt. locking/freeing context. I will keep an eye, and when the time comes again I will see how the two features play together and how one another can be disabled when needed. Changes from v3 -> v4: - Addressed feedback from David - Wrap memmap_on_memory module thingy with #ifdef on MHP_MEMMAP_ON_MEMORY - Move "depend on MEMORY_HOTPLUG" to MHP_MEMMAP_ON_MEMORY in generic mm/Kconfig - Collect David's Reviewed-bys Changes from v2 -> v3: - Addressed feedback from David - Squash former patch#4 and and patch#5 into patch#1 - Fix config dependency CONFIR_SPARSE_VMEMMAP vs CONFIG_SPARSE_VMEMMAP_ENABLE - Simplify module parameter functions Changes from v1 -> v2 - Addressed feedback from David - Fence off the feature in case struct page size is not multiple of PMD size or pageblock alignment cannot be guaranted - Tested on x86_64 small and large memory_blocks - Tested on arm64 4KB and 64KB page sizes (for some reason I cannot boot my VM with 16KB page size). Arm64 with 4KB page size behaves like x86_64 after [1], which made section size smaller. With 64KB, the feature gets fenced off due to pageblock alignment. Changes from RFCv3 -> v1: - Addressed feedback from David - Re-order patches Changes from v2 -> v3 (RFC): - Re-order patches (Michal) - Fold "mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY" in patch#1 - Add kernel boot option to enable this feature (Michal) Changes from v1 -> v2 (RFC): - Addressed feedback provided by David - Add a arch_support_memmap_on_memory to be called from mhp_supports_memmap_on_memory, as atm, only ARM, powerpc and x86_64 have altmat support. [1] https://lore.kernel.org/lkml/cover.1611206601.git.sudaraj... Original cover letter: The primary goal of this patchset is to reduce memory overhead of the hot-added memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use to populate memmap (struct page array) has two main drawbacks: a) it consumes an additional memory until the hotadded memory itself is onlined and b) memmap might end up on a different numa node which is especially true for movable_node configuration. c) due to fragmentation we might end up populating memmap with base pages One way to mitigate all these issues is to simply allocate memmap array (which is the largest memory footprint of the physical memory hotplug) from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows us to map any pfn range so the memory doesn't need to be online to be usable for the array. See patch 4 for more details. This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set. [Overall design]: Implementation wise we reuse vmem_altmap infrastructure to override the default allocator used by vmemap_populate. memory_block structure gains a new field called nr_vmemmap_pages, which accounts for the number of vmemmap pages used by that memory_block. E.g: On x86_64, that is 512 vmemmap pages on small memory bloks and 4096 on large memory blocks (1GB) We also introduce new two functions: memory_block_{online,offline}. These functions take care of initializing/unitializing vmemmap pages prior to calling {online,offline}_pages, so the latter functions can remain totally untouched. More details can be found in the respective changelogs. David Hildenbrand (1): mm,memory_hotplug: Factor out adjusting present pages into adjust_present_page_count() Oscar Salvador (7): drivers/base/memory: Introduce memory_block_{online,offline} mm,memory_hotplug: Relax fully spanned sections check mm,memory_hotplug: Allocate memmap from the added memory range acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when
Re: [PATCH v7 4/8] mm,memory_hotplug: Allocate memmap from the added memory range
On Thu, Apr 15, 2021 at 01:19:59PM +0200, David Hildenbrand wrote: > > Implementation wise we will reuse vmem_altmap infrastructure to override > > the default allocator used by __populate_section_memmap. > > Part of the implementation also relies on memory_block structure gaining > > a new field which specifies the number of vmemmap_pages at the beginning. > > This patch also introduces the following functions: > > > > - vmemmap_init_space: Initializes vmemmap pages by calling > > move_pfn_range_to_zone(), > >calls kasan_add_zero_shadow() or the vmemmap range and > > marks > >online as many sections as vmemmap pages fully span. > > - vmemmap_adjust_pages: Accounts/substract vmemmap_pages to node and zone > > present_pages > > - vmemmap_deinit_space: Undoes what vmemmap_init_space does. > > > > This is a bit asynchronous; and the function names are not really expressing > what is being done :) I'll try to come up with better names below. Yeah, was not happy either with the names but at that time I could not come up with anything better. > It is worth mentioning that the real "mess" is that we want offline_pages() > to properly handle zone->present_pages going to 0. Therefore, we want to > manually mess with the present page count. This should be explained by this: "On offline, memory_block_offline() calls vmemmap_adjust_pages() prior to calling offline_pages(), because offline_pages() performs the tearing-down of kthreads and the rebuilding of the zonelists if the node/zone become empty." Is not that clear? > I suggest detecting the zone in here and just passing it down to > online_pages(). Ok, makes sense. > My take would be doing the present page adjustment after onlining succeeded: > > static int memory_block_online(struct memory_block *mem) > { > unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); > unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; > unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages; > struct zone *zone; > int ret; > > zone = mhp_get_target_zone(mem->nid, mem->online_type); > > if (nr_vmemmap_pages) { > ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, > zone); > if (ret) > return ret; > } > > ret = online_pages(start_pfn + nr_vmemmap_pages, nr_pages - > nr_vmemmap_pages, zone); > if (ret) { > if (nr_vmemmap_pages) > mhp_deinit_memmap_on_memory(start_pfn, > nr_vmemmap_pages); > return ret; > } > > /* >* Account once onlining succeeded. If the page was unpopulated, >* it is now already properly populated. >*/ > if (nr_vmemmap_pages) > adjust_present_page_count(zone, nr_vmemmap_pages); > return 0; > } > > And the opposite: > > static int memory_block_offline(struct memory_block *mem) > { > unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); > unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; > unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages; > struct zone *zone; > int ret; > > zone = page_zone(pfn_to_page(start_pfn)); > > > /* >* Unaccount before offlining, such that unpopulated zones can >* properly be torn down in offline_pages(). >*/ > if (nr_vmemmap_pages) > adjust_present_page_count(zone, -nr_vmemmap_pages); > > ret = offline_pages(start_pfn + nr_vmemmap_pages, nr_pages - > nr_vmemmap_pages); > if (ret) { > if (nr_vmemmap_pages) > adjust_present_page_count(zone, +nr_vmemmap_pages); > return ret; > } > > if (nr_vmemmap_pages) > mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages); > return 0; > } > > Having to do the present page adjustment manually is not completely nice, > but it's easier than manually having to mess with zones becomming > populated/unpopulated > outside of online_pages()/offline_pages(). Ok, I like this, and the naming is much better. I will work on this shortly. thanks David! -- Oscar Salvador SUSE L3
[PATCH v9 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig
pfn_range_valid_contig() bails out when it finds an in-use page or a hugetlb page, among other things. We can drop the in-use page check since __alloc_contig_pages can migrate away those pages, and the hugetlb page check can go too since isolate_migratepages_range is now capable of dealing with hugetlb pages. Either way, those checks are racy so let the end function handle it when the time comes. Signed-off-by: Oscar Salvador Suggested-by: David Hildenbrand Reviewed-by: David Hildenbrand Acked-by: Mike Kravetz Acked-by: Michal Hocko --- mm/page_alloc.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b5a94de3cdde..c5338e912ace 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8901,12 +8901,6 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn, if (PageReserved(page)) return false; - - if (page_count(page) > 0) - return false; - - if (PageHuge(page)) - return false; } return true; } -- 2.16.3
[PATCH v9 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages
alloc_contig_range() will fail if it finds a HugeTLB page within the range, without a chance to handle them. Since HugeTLB pages can be migrated as any LRU or Movable page, it does not make sense to bail out without trying. Enable the interface to recognize in-use HugeTLB pages so we can migrate them, and have much better chances to succeed the call. Signed-off-by: Oscar Salvador Reviewed-by: Mike Kravetz Acked-by: Michal Hocko --- include/linux/hugetlb.h | 5 +++-- mm/compaction.c | 12 +++- mm/hugetlb.c| 24 ++-- mm/vmscan.c | 5 +++-- 4 files changed, 35 insertions(+), 11 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index b2d2118bfd1a..b92f25ccef58 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -595,7 +595,7 @@ struct huge_bootmem_page { struct hstate *hstate; }; -int isolate_or_dissolve_huge_page(struct page *page); +int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list); struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, @@ -878,7 +878,8 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; -static inline int isolate_or_dissolve_huge_page(struct page *page) +static inline int isolate_or_dissolve_huge_page(struct page *page, + struct list_head *list) { return -ENOMEM; } diff --git a/mm/compaction.c b/mm/compaction.c index af2e8e194e50..84fde270ae74 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -907,7 +907,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, } if (PageHuge(page) && cc->alloc_contig) { - ret = isolate_or_dissolve_huge_page(page); + ret = isolate_or_dissolve_huge_page(page, >migratepages); /* * Fail isolation in case isolate_or_dissolve_huge_page() @@ -921,6 +921,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, goto isolate_fail; } + if (PageHuge(page)) { + /* +* Hugepage was successfully isolated and placed +* on the cc->migratepages list. +*/ + low_pfn += compound_nr(page) - 1; + goto isolate_success_no_list; + } + /* * Ok, the hugepage was dissolved. Now these pages are * Buddy and cannot be re-allocated because they are @@ -1062,6 +1071,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, isolate_success: list_add(>lru, >migratepages); +isolate_success_no_list: cc->nr_migratepages += compound_nr(page); nr_isolated += compound_nr(page); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index c588eed97c5a..6eb3b6ca4264 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2270,9 +2270,11 @@ static void restore_reserve_on_error(struct hstate *h, * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one * @h: struct hstate old page belongs to * @old_page: Old page to dissolve + * @list: List to isolate the page in case we need to * Returns 0 on success, otherwise negated error. */ -static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page) +static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, + struct list_head *list) { gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; int nid = page_to_nid(old_page); @@ -2299,9 +2301,13 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page) goto free_new; } else if (page_count(old_page)) { /* -* Someone has grabbed the page, fail for now. +* Someone has grabbed the page, try to isolate it here. +* Fail with -EBUSY if not possible. */ - ret = -EBUSY; + spin_unlock_irq(_lock); + if (!isolate_huge_page(old_page, list)) + ret = -EBUSY; + spin_lock_irq(_lock); goto free_new; } else if (!HPageFreed(old_page)) { /* @@ -2351,10 +2357,11 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page) return ret; } -int isolate_or_dissolve_huge_pa
[PATCH v9 3/7] mm,hugetlb: Drop clearing of flag from prep_new_huge_page
Pages allocated via the page allocator or CMA get its private field cleared by means of post_alloc_hook(). Pages allocated during boot, that is directly from the memblock allocator, get cleared by paging_init()->..->memmap_init_zone->..->__init_single_page() before any memblock allocation. Based on this ground, let us remove the clearing of the flag from prep_new_huge_page() as it is not needed. This was a leftover from 6c0371490140 ("hugetlb: convert PageHugeFreed to HPageFreed flag"). Previously the explicit clearing was necessary because compound allocations do not get this initialization (see prep_compound_page). Signed-off-by: Oscar Salvador Acked-by: Michal Hocko Reviewed-by: David Hildenbrand Reviewed-by: Mike Kravetz --- mm/hugetlb.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 54d81d5947ed..2cb9fa79cbaa 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1493,7 +1493,6 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) spin_lock_irq(_lock); h->nr_huge_pages++; h->nr_huge_pages_node[nid]++; - ClearHPageFreed(page); spin_unlock_irq(_lock); } -- 2.16.3
[PATCH v9 5/7] mm: Make alloc_contig_range handle free hugetlb pages
alloc_contig_range will fail if it ever sees a HugeTLB page within the range we are trying to allocate, even when that page is free and can be easily reallocated. This has proved to be problematic for some users of alloc_contic_range, e.g: CMA and virtio-mem, where those would fail the call even when those pages lay in ZONE_MOVABLE and are free. We can do better by trying to replace such page. Free hugepages are tricky to handle so as to no userspace application notices disruption, we need to replace the current free hugepage with a new one. In order to do that, a new function called alloc_and_dissolve_huge_page is introduced. This function will first try to get a new fresh hugepage, and if it succeeds, it will replace the old one in the free hugepage pool. The free page replacement is done under hugetlb_lock, so no external users of hugetlb will notice the change. To allocate the new huge page, we use alloc_buddy_huge_page(), so we do not have to deal with any counters, and prep_new_huge_page() is not called. This is valulable because in case we need to free the new page, we only need to call __free_pages(). Once we know that the page to be replaced is a genuine 0-refcounted huge page, we remove the old page from the freelist by remove_hugetlb_page(). Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page() for the new huge page to properly initialize it and increment the hstate->nr_huge_pages counter (previously decremented by remove_hugetlb_page()). Once done, the page is enqueued by enqueue_huge_page() and it is ready to be used. There is one tricky case when page's refcount is 0 because it is in the process of being released. A missing PageHugeFreed bit will tell us that freeing is in flight so we retry after dropping the hugetlb_lock. The race window should be small and the next retry should make a forward progress. E.g: CPU0CPU1 free_huge_page()isolate_or_dissolve_huge_page PageHuge() == T alloc_and_dissolve_huge_page alloc_buddy_huge_page() spin_lock_irq(hugetlb_lock) // PageHuge() && !PageHugeFreed && // !PageCount() spin_unlock_irq(hugetlb_lock) spin_lock_irq(hugetlb_lock) 1) update_and_free_page PageHuge() == F __free_pages() 2) enqueue_huge_page SetPageHugeFreed() spin_unlock_irq(_lock) spin_lock_irq(hugetlb_lock) 1) PageHuge() == F (freed by case#1 from CPU0) 2) PageHuge() == T PageHugeFreed() == T - proceed with replacing the page In the case above we retry as the window race is quite small and we have high chances to succeed next time. With regard to the allocation, we restrict it to the node the page belongs to with __GFP_THISNODE, meaning we do not fallback on other node's zones. Note that gigantic hugetlb pages are fenced off since there is a cyclic dependency between them and alloc_contig_range. Signed-off-by: Oscar Salvador Acked-by: Michal Hocko Acked-by: David Hildenbrand Reviewed-by: Mike Kravetz --- include/linux/hugetlb.h | 6 +++ mm/compaction.c | 33 -- mm/hugetlb.c| 116 3 files changed, 152 insertions(+), 3 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 09f1fd12a6fa..b2d2118bfd1a 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -595,6 +595,7 @@ struct huge_bootmem_page { struct hstate *hstate; }; +int isolate_or_dissolve_huge_page(struct page *page); struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, @@ -877,6 +878,11 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; +static inline int isolate_or_dissolve_huge_page(struct page *page) +{ + return -ENOMEM; +} + static inline struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve) diff --git a/mm/compaction.c b/mm/compaction.c index 110b90d5fff5..af2e8e194e50 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -788,7 +788,7 @@ static bool too_many_isolated(pg_data_t *pgdat) * Isolate all pages that can be migrated from the range specified by * [low_pfn, end_pfn). The range is expected to be within same pageblock. * Returns errno, like -EAGAIN or -EINTR
[PATCH v9 4/7] mm,hugetlb: Split prep_new_huge_page functionality
Currently, prep_new_huge_page() performs two functions. It sets the right state for a new hugetlb, and increases the hstate's counters to account for the new page. Let us split its functionality into two separate functions, decoupling the handling of the counters from initializing a hugepage. The outcome is having __prep_new_huge_page(), which only initializes the page , and __prep_account_new_huge_page(), which adds the new page to the hstate's counters. This allows us to be able to set a hugetlb without having to worry about the counter/locking. It will prove useful in the next patch. prep_new_huge_page() still calls both functions. Signed-off-by: Oscar Salvador Acked-by: Michal Hocko Reviewed-by: Mike Kravetz Reviewed-by: David Hildenbrand --- mm/hugetlb.c | 20 +--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 2cb9fa79cbaa..6f39ec79face 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1483,16 +1483,30 @@ void free_huge_page(struct page *page) } } -static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) +/* + * Must be called with the hugetlb lock held + */ +static void __prep_account_new_huge_page(struct hstate *h, int nid) +{ + lockdep_assert_held(_lock); + h->nr_huge_pages++; + h->nr_huge_pages_node[nid]++; +} + +static void __prep_new_huge_page(struct page *page) { INIT_LIST_HEAD(>lru); set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); hugetlb_set_page_subpool(page, NULL); set_hugetlb_cgroup(page, NULL); set_hugetlb_cgroup_rsvd(page, NULL); +} + +static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) +{ + __prep_new_huge_page(page); spin_lock_irq(_lock); - h->nr_huge_pages++; - h->nr_huge_pages_node[nid]++; + __prep_account_new_huge_page(h, nid); spin_unlock_irq(_lock); } -- 2.16.3
[PATCH v9 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes
Currently, isolate_migratepages_{range,block} and their callers use a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was any error during isolation. This does not work as soon as we need to start reporting different error codes and make sure we pass them down the chain, so they are properly interpreted by functions like e.g: alloc_contig_range. Let us rework isolate_migratepages_{range,block} so we can report error codes. Since isolate_migratepages_block will stop returning the next pfn to be scanned, we reuse the cc->migrate_pfn field to keep track of that. Signed-off-by: Oscar Salvador Acked-by: Vlastimil Babka Acked-by: Mike Kravetz Reviewed-by: David Hildenbrand --- mm/compaction.c | 52 +--- mm/internal.h | 10 -- mm/page_alloc.c | 7 +++ 3 files changed, 36 insertions(+), 33 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 8c5028bfbd56..110b90d5fff5 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -787,15 +787,14 @@ static bool too_many_isolated(pg_data_t *pgdat) * * Isolate all pages that can be migrated from the range specified by * [low_pfn, end_pfn). The range is expected to be within same pageblock. - * Returns zero if there is a fatal signal pending, otherwise PFN of the - * first page that was not scanned (which may be both less, equal to or more - * than end_pfn). + * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion, + * or 0. + * cc->migrate_pfn will contain the next pfn to scan. * * The pages are isolated on cc->migratepages list (not required to be empty), - * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field - * is neither read nor updated. + * and cc->nr_migratepages is updated accordingly. */ -static unsigned long +static int isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, unsigned long end_pfn, isolate_mode_t isolate_mode) { @@ -809,6 +808,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, bool skip_on_failure = false; unsigned long next_skip_pfn = 0; bool skip_updated = false; + int ret = 0; + + cc->migrate_pfn = low_pfn; /* * Ensure that there are not too many pages isolated from the LRU @@ -818,16 +820,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, while (unlikely(too_many_isolated(pgdat))) { /* stop isolation if there are still pages not migrated */ if (cc->nr_migratepages) - return 0; + return -EAGAIN; /* async migration should just abort */ if (cc->mode == MIGRATE_ASYNC) - return 0; + return -EAGAIN; congestion_wait(BLK_RW_ASYNC, HZ/10); if (fatal_signal_pending(current)) - return 0; + return -EINTR; } cond_resched(); @@ -875,8 +877,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (fatal_signal_pending(current)) { cc->contended = true; + ret = -EINTR; - low_pfn = 0; goto fatal_pending; } @@ -1130,7 +1132,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (nr_isolated) count_compact_events(COMPACTISOLATED, nr_isolated); - return low_pfn; + cc->migrate_pfn = low_pfn; + + return ret; } /** @@ -1139,15 +1143,14 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, * @start_pfn: The first PFN to start isolating. * @end_pfn: The one-past-last PFN. * - * Returns zero if isolation fails fatally due to e.g. pending signal. - * Otherwise, function returns one-past-the-last PFN of isolated page - * (which may be greater than end_pfn if end fell in a middle of a THP page). + * Returns -EAGAIN when contented, -EINTR in case of a signal pending or 0. */ -unsigned long +int isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn, unsigned long end_pfn) { unsigned long pfn, block_start_pfn, block_end_pfn; + int ret = 0; /* Scan block by block. First and last block may be incomplete */ pfn = start_pfn; @@ -1166,17 +1169,17 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn, block_end_pfn, cc->zone)) continue; - pfn = isolate_migra
[PATCH v9 0/7] Make alloc_contig_range handle Hugetlb pages
180.065848] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.066794] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.067738] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.068669] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.069598] alloc_contig_range: [1bfc00, 1c) PFNs busy" And then with this patchset running: "Same experiment with ZONE_MOVABLE: a) Free huge pages: all memory can get unplugged again. b) Allocated/populated but idle huge pages: all memory can get unplugged again. c) Allocated/populated but all 512 huge pages are read/written in a loop: all memory can get unplugged again, but I get a single [ 121.192345] alloc_contig_range: [18, 188000) PFNs busy Most probably because it happened to try migrating a huge page while it was busy. As virtio-mem retries on ZONE_MOVABLE a couple of times, it can deal with this temporary failure. Last but not least, I did something extreme: # cat /proc/meminfo MemTotal:5061568 kB MemFree: 186560 kB MemAvailable: 354524 kB ... HugePages_Total:2048 HugePages_Free: 2048 HugePages_Rsvd:0 HugePages_Surp:0 Triggering unplug would require to dissolve+alloc - which now fails when trying to allocate an additional ~512 huge pages (1G). As expected, I can properly see memory unplug not fully succeeding. + I get a fairly continuous stream of [ 226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy ... But more importantly, the hugepage count remains stable, as configured by the admin (me): HugePages_Total:2048 HugePages_Free: 2048 HugePages_Rsvd: 0 HugePages_Surp:0" Oscar Salvador (7): mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range mm,compaction: Let isolate_migratepages_{range,block} return error codes mm,hugetlb: Drop clearing of flag from prep_new_huge_page mm,hugetlb: Split prep_new_huge_page functionality mm: Make alloc_contig_range handle free hugetlb pages mm: Make alloc_contig_range handle in-use hugetlb pages mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig include/linux/hugetlb.h | 7 +++ mm/compaction.c | 91 - mm/hugetlb.c| 149 ++-- mm/internal.h | 10 +++- mm/page_alloc.c | 22 +++ mm/vmscan.c | 5 +- 6 files changed, 237 insertions(+), 47 deletions(-) -- 2.16.3
[PATCH v9 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range
Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or -EBUSY, and report them down the chain. The problem is that when migrate_pages() reports -ENOMEM, we keep going till we exhaust all the try-attempts (5 at the moment) instead of bailing out. migrate_pages() bails out right away on -ENOMEM because it is considered a fatal error. Do the same here instead of keep going and retrying. Note that this is not fixing a real issue, just a cosmetic change. Although we can save some cycles by backing off ealier Signed-off-by: Oscar Salvador Acked-by: Vlastimil Babka Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Acked-by: Mike Kravetz --- mm/page_alloc.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1c67c99603a3..689454692de1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8697,7 +8697,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, } tries = 0; } else if (++tries == 5) { - ret = ret < 0 ? ret : -EBUSY; + ret = -EBUSY; break; } @@ -8707,6 +8707,13 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, ret = migrate_pages(>migratepages, alloc_migration_target, NULL, (unsigned long), cc->mode, MR_CONTIG_RANGE); + + /* +* On -ENOMEM, migrate_pages() bails out right away. It is pointless +* to retry again over this error, so do the same here. +*/ + if (ret == -ENOMEM) + break; } lru_cache_enable(); -- 2.16.3
Re: [PATCH v7 0/8] Allocate memmap from hotadded memory (per device)
On Thu, Apr 08, 2021 at 02:17:56PM +0200, Oscar Salvador wrote: > Hi, > > I decided to send another version with the fixups included as it seemed a bit > awkward otherwise. It should ease the review. > Sorry for the spam. Gentle ping :-) hint: patch#4 is the one that needs some looking -- Oscar Salvador SUSE L3
[PATCH v8 5/7] mm: Make alloc_contig_range handle free hugetlb pages
alloc_contig_range will fail if it ever sees a HugeTLB page within the range we are trying to allocate, even when that page is free and can be easily reallocated. This has proved to be problematic for some users of alloc_contic_range, e.g: CMA and virtio-mem, where those would fail the call even when those pages lay in ZONE_MOVABLE and are free. We can do better by trying to replace such page. Free hugepages are tricky to handle so as to no userspace application notices disruption, we need to replace the current free hugepage with a new one. In order to do that, a new function called alloc_and_dissolve_huge_page is introduced. This function will first try to get a new fresh hugepage, and if it succeeds, it will replace the old one in the free hugepage pool. The free page replacement is done under hugetlb_lock, so no external users of hugetlb will notice the change. To allocate the new huge page, we use alloc_buddy_huge_page(), so we do not have to deal with any counters, and prep_new_huge_page() is not called. This is valulable because in case we need to free the new page, we only need to call __free_pages(). Once we know that the page to be replaced is a genuine 0-refcounted huge page, we remove the old page from the freelist by remove_hugetlb_page(). Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page() for the new huge page to properly initialize it and increment the hstate->nr_huge_pages counter (previously decremented by remove_hugetlb_page()). Once done, the page is enqueued by enqueue_huge_page() and it is ready to be used. There is one tricky case when page's refcount is 0 because it is in the process of being released. A missing PageHugeFreed bit will tell us that freeing is in flight so we retry after dropping the hugetlb_lock. The race window should be small and the next retry should make a forward progress. E.g: CPU0CPU1 free_huge_page()isolate_or_dissolve_huge_page PageHuge() == T alloc_and_dissolve_huge_page alloc_buddy_huge_page() spin_lock_irq(hugetlb_lock) // PageHuge() && !PageHugeFreed && // !PageCount() spin_unlock_irq(hugetlb_lock) spin_lock_irq(hugetlb_lock) 1) update_and_free_page PageHuge() == F __free_pages() 2) enqueue_huge_page SetPageHugeFreed() spin_unlock_irq(_lock) spin_lock_irq(hugetlb_lock) 1) PageHuge() == F (freed by case#1 from CPU0) 2) PageHuge() == T PageHugeFreed() == T - proceed with replacing the page In the case above we retry as the window race is quite small and we have high chances to succeed next time. With regard to the allocation, we restrict it to the node the page belongs to with __GFP_THISNODE, meaning we do not fallback on other node's zones. Note that gigantic hugetlb pages are fenced off since there is a cyclic dependency between them and alloc_contig_range. Signed-off-by: Oscar Salvador Acked-by: Michal Hocko --- include/linux/hugetlb.h | 6 +++ mm/compaction.c | 35 +-- mm/hugetlb.c| 117 3 files changed, 155 insertions(+), 3 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 09f1fd12a6fa..b2d2118bfd1a 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -595,6 +595,7 @@ struct huge_bootmem_page { struct hstate *hstate; }; +int isolate_or_dissolve_huge_page(struct page *page); struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, @@ -877,6 +878,11 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; +static inline int isolate_or_dissolve_huge_page(struct page *page) +{ + return -ENOMEM; +} + static inline struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve) diff --git a/mm/compaction.c b/mm/compaction.c index 110b90d5fff5..73f00f36b6e7 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -788,7 +788,7 @@ static bool too_many_isolated(pg_data_t *pgdat) * Isolate all pages that can be migrated from the range specified by * [low_pfn, end_pfn). The range is expected to be within same pageblock. * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion, - * or 0. + *
[PATCH v8 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages
alloc_contig_range() will fail if it finds a HugeTLB page within the range, without a chance to handle them. Since HugeTLB pages can be migrated as any LRU or Movable page, it does not make sense to bail out without trying. Enable the interface to recognize in-use HugeTLB pages so we can migrate them, and have much better chances to succeed the call. Signed-off-by: Oscar Salvador Reviewed-by: Mike Kravetz Acked-by: Michal Hocko --- include/linux/hugetlb.h | 5 +++-- mm/compaction.c | 12 +++- mm/hugetlb.c| 24 ++-- mm/vmscan.c | 5 +++-- 4 files changed, 35 insertions(+), 11 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index b2d2118bfd1a..b92f25ccef58 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -595,7 +595,7 @@ struct huge_bootmem_page { struct hstate *hstate; }; -int isolate_or_dissolve_huge_page(struct page *page); +int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list); struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, @@ -878,7 +878,8 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; -static inline int isolate_or_dissolve_huge_page(struct page *page) +static inline int isolate_or_dissolve_huge_page(struct page *page, + struct list_head *list) { return -ENOMEM; } diff --git a/mm/compaction.c b/mm/compaction.c index 73f00f36b6e7..8149ffc6bccd 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -907,7 +907,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, } if (PageHuge(page) && cc->alloc_contig) { - ret = isolate_or_dissolve_huge_page(page); + ret = isolate_or_dissolve_huge_page(page, >migratepages); /* * Fail isolation in case isolate_or_dissolve_huge_page() @@ -923,6 +923,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, goto isolate_fail; } + if (PageHuge(page)) { + /* +* Hugepage was successfully isolated and placed +* on the cc->migratepages list. +*/ + low_pfn += compound_nr(page) - 1; + goto isolate_success_no_list; + } + /* * Ok, the hugepage was dissolved. Now these pages are * Buddy and cannot be re-allocated because they are @@ -1064,6 +1073,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, isolate_success: list_add(>lru, >migratepages); +isolate_success_no_list: cc->nr_migratepages += compound_nr(page); nr_isolated += compound_nr(page); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4abd4c4565dd..efe2f1fed74e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2270,10 +2270,12 @@ static void restore_reserve_on_error(struct hstate *h, * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one * @h: struct hstate old page belongs to * @old_page: Old page to dissolve + * @list: List to isolate the page in case we need to * Returns 0 on success, otherwise negated error. */ -static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page) +static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, + struct list_head *list) { gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; int nid = page_to_nid(old_page); @@ -2300,9 +2302,13 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page) goto free_new; } else if (page_count(old_page)) { /* -* Someone has grabbed the page, fail for now. +* Someone has grabbed the page, try to isolate it here. +* Fail with -EBUSY if not possible. */ - ret = -EBUSY; + spin_unlock_irq(_lock); + if (!isolate_huge_page(old_page, list)) + ret = -EBUSY; + spin_lock_irq(_lock); goto free_new; } else if (!HPageFreed(old_page)) { /* @@ -2352,10 +2358,11 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page) return ret; } -int isolate_or_dissolve_huge_pa
[PATCH v8 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig
pfn_range_valid_contig() bails out when it finds an in-use page or a hugetlb page, among other things. We can drop the in-use page check since __alloc_contig_pages can migrate away those pages, and the hugetlb page check can go too since isolate_migratepages_range is now capable of dealing with hugetlb pages. Either way, those checks are racy so let the end function handle it when the time comes. Signed-off-by: Oscar Salvador Suggested-by: David Hildenbrand Reviewed-by: David Hildenbrand Acked-by: Mike Kravetz --- mm/page_alloc.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b5a94de3cdde..c5338e912ace 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8901,12 +8901,6 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn, if (PageReserved(page)) return false; - - if (page_count(page) > 0) - return false; - - if (PageHuge(page)) - return false; } return true; } -- 2.16.3
[PATCH v8 4/7] mm,hugetlb: Split prep_new_huge_page functionality
Currently, prep_new_huge_page() performs two functions. It sets the right state for a new hugetlb, and increases the hstate's counters to account for the new page. Let us split its functionality into two separate functions, decoupling the handling of the counters from initializing a hugepage. The outcome is having __prep_new_huge_page(), which only initializes the page , and __prep_account_new_huge_page(), which adds the new page to the hstate's counters. This allows us to be able to set a hugetlb without having to worry about the counter/locking. It will prove useful in the next patch. prep_new_huge_page() still calls both functions. Signed-off-by: Oscar Salvador Acked-by: Michal Hocko Reviewed-by: Mike Kravetz --- mm/hugetlb.c | 20 +--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 2cb9fa79cbaa..6f39ec79face 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1483,16 +1483,30 @@ void free_huge_page(struct page *page) } } -static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) +/* + * Must be called with the hugetlb lock held + */ +static void __prep_account_new_huge_page(struct hstate *h, int nid) +{ + lockdep_assert_held(_lock); + h->nr_huge_pages++; + h->nr_huge_pages_node[nid]++; +} + +static void __prep_new_huge_page(struct page *page) { INIT_LIST_HEAD(>lru); set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); hugetlb_set_page_subpool(page, NULL); set_hugetlb_cgroup(page, NULL); set_hugetlb_cgroup_rsvd(page, NULL); +} + +static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) +{ + __prep_new_huge_page(page); spin_lock_irq(_lock); - h->nr_huge_pages++; - h->nr_huge_pages_node[nid]++; + __prep_account_new_huge_page(h, nid); spin_unlock_irq(_lock); } -- 2.16.3
[PATCH v8 3/7] mm,hugetlb: Drop clearing of flag from prep_new_huge_page
Pages allocated after boot get its private field cleared by means of post_alloc_hook(). Pages allocated during boot, that is directly from the memblock allocator, get cleared by paging_init()->..->memmap_init_zone->..->__init_single_page() before any memblock allocation. Based on this ground, let us remove the clearing of the flag from prep_new_huge_page() as it is not needed. Signed-off-by: Oscar Salvador --- mm/hugetlb.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 54d81d5947ed..2cb9fa79cbaa 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1493,7 +1493,6 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) spin_lock_irq(_lock); h->nr_huge_pages++; h->nr_huge_pages_node[nid]++; - ClearHPageFreed(page); spin_unlock_irq(_lock); } -- 2.16.3
[PATCH v8 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes
Currently, isolate_migratepages_{range,block} and their callers use a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was any error during isolation. This does not work as soon as we need to start reporting different error codes and make sure we pass them down the chain, so they are properly interpreted by functions like e.g: alloc_contig_range. Let us rework isolate_migratepages_{range,block} so we can report error codes. Since isolate_migratepages_block will stop returning the next pfn to be scanned, we reuse the cc->migrate_pfn field to keep track of that. Signed-off-by: Oscar Salvador Acked-by: Vlastimil Babka Acked-by: Mike Kravetz --- mm/compaction.c | 52 +--- mm/internal.h | 10 -- mm/page_alloc.c | 7 +++ 3 files changed, 36 insertions(+), 33 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 8c5028bfbd56..110b90d5fff5 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -787,15 +787,14 @@ static bool too_many_isolated(pg_data_t *pgdat) * * Isolate all pages that can be migrated from the range specified by * [low_pfn, end_pfn). The range is expected to be within same pageblock. - * Returns zero if there is a fatal signal pending, otherwise PFN of the - * first page that was not scanned (which may be both less, equal to or more - * than end_pfn). + * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion, + * or 0. + * cc->migrate_pfn will contain the next pfn to scan. * * The pages are isolated on cc->migratepages list (not required to be empty), - * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field - * is neither read nor updated. + * and cc->nr_migratepages is updated accordingly. */ -static unsigned long +static int isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, unsigned long end_pfn, isolate_mode_t isolate_mode) { @@ -809,6 +808,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, bool skip_on_failure = false; unsigned long next_skip_pfn = 0; bool skip_updated = false; + int ret = 0; + + cc->migrate_pfn = low_pfn; /* * Ensure that there are not too many pages isolated from the LRU @@ -818,16 +820,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, while (unlikely(too_many_isolated(pgdat))) { /* stop isolation if there are still pages not migrated */ if (cc->nr_migratepages) - return 0; + return -EAGAIN; /* async migration should just abort */ if (cc->mode == MIGRATE_ASYNC) - return 0; + return -EAGAIN; congestion_wait(BLK_RW_ASYNC, HZ/10); if (fatal_signal_pending(current)) - return 0; + return -EINTR; } cond_resched(); @@ -875,8 +877,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (fatal_signal_pending(current)) { cc->contended = true; + ret = -EINTR; - low_pfn = 0; goto fatal_pending; } @@ -1130,7 +1132,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (nr_isolated) count_compact_events(COMPACTISOLATED, nr_isolated); - return low_pfn; + cc->migrate_pfn = low_pfn; + + return ret; } /** @@ -1139,15 +1143,14 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, * @start_pfn: The first PFN to start isolating. * @end_pfn: The one-past-last PFN. * - * Returns zero if isolation fails fatally due to e.g. pending signal. - * Otherwise, function returns one-past-the-last PFN of isolated page - * (which may be greater than end_pfn if end fell in a middle of a THP page). + * Returns -EAGAIN when contented, -EINTR in case of a signal pending or 0. */ -unsigned long +int isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn, unsigned long end_pfn) { unsigned long pfn, block_start_pfn, block_end_pfn; + int ret = 0; /* Scan block by block. First and last block may be incomplete */ pfn = start_pfn; @@ -1166,17 +1169,17 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn, block_end_pfn, cc->zone)) continue; - pfn = isolate_migratepages_block(cc, pfn, block_end_pfn, - ISOLA
[PATCH v8 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range
Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or -EBUSY, and report them down the chain. The problem is that when migrate_pages() reports -ENOMEM, we keep going till we exhaust all the try-attempts (5 at the moment) instead of bailing out. migrate_pages() bails out right away on -ENOMEM because it is considered a fatal error. Do the same here instead of keep going and retrying. Note that this is not fixing a real issue, just a cosmetic change. Although we can save some cycles by backing off ealier Signed-off-by: Oscar Salvador Acked-by: Vlastimil Babka Reviewed-by: David Hildenbrand Acked-by: Michal Hocko Acked-by: Mike Kravetz --- mm/page_alloc.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1c67c99603a3..689454692de1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8697,7 +8697,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, } tries = 0; } else if (++tries == 5) { - ret = ret < 0 ? ret : -EBUSY; + ret = -EBUSY; break; } @@ -8707,6 +8707,13 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, ret = migrate_pages(>migratepages, alloc_migration_target, NULL, (unsigned long), cc->mode, MR_CONTIG_RANGE); + + /* +* On -ENOMEM, migrate_pages() bails out right away. It is pointless +* to retry again over this error, so do the same here. +*/ + if (ret == -ENOMEM) + break; } lru_cache_enable(); -- 2.16.3
[PATCH v8 0/7] Make alloc_contig_range handle Hugetlb pages
80.060531] alloc_contig_range: [1b8000, 1c) PFNs busy [ 180.061972] alloc_contig_range: [1b8000, 1c) PFNs busy [ 180.063413] alloc_contig_range: [1b8000, 1c) PFNs busy [ 180.064838] alloc_contig_range: [1b8000, 1c) PFNs busy [ 180.065848] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.066794] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.067738] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.068669] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.069598] alloc_contig_range: [1bfc00, 1c) PFNs busy" And then with this patchset running: "Same experiment with ZONE_MOVABLE: a) Free huge pages: all memory can get unplugged again. b) Allocated/populated but idle huge pages: all memory can get unplugged again. c) Allocated/populated but all 512 huge pages are read/written in a loop: all memory can get unplugged again, but I get a single [ 121.192345] alloc_contig_range: [18, 188000) PFNs busy Most probably because it happened to try migrating a huge page while it was busy. As virtio-mem retries on ZONE_MOVABLE a couple of times, it can deal with this temporary failure. Last but not least, I did something extreme: # cat /proc/meminfo MemTotal:5061568 kB MemFree: 186560 kB MemAvailable: 354524 kB ... HugePages_Total:2048 HugePages_Free: 2048 HugePages_Rsvd:0 HugePages_Surp:0 Triggering unplug would require to dissolve+alloc - which now fails when trying to allocate an additional ~512 huge pages (1G). As expected, I can properly see memory unplug not fully succeeding. + I get a fairly continuous stream of [ 226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy ... But more importantly, the hugepage count remains stable, as configured by the admin (me): HugePages_Total:2048 HugePages_Free: 2048 HugePages_Rsvd: 0 HugePages_Surp:0" Oscar Salvador (7): mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range mm,compaction: Let isolate_migratepages_{range,block} return error codes mm,hugetlb: Drop clearing of flag from prep_new_huge_page mm,hugetlb: Split prep_new_huge_page functionality mm: Make alloc_contig_range handle free hugetlb pages mm: Make alloc_contig_range handle in-use hugetlb pages mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig include/linux/hugetlb.h | 7 +++ mm/compaction.c | 93 +- mm/hugetlb.c| 150 ++-- mm/internal.h | 10 +++- mm/page_alloc.c | 22 +++ mm/vmscan.c | 5 +- 6 files changed, 240 insertions(+), 47 deletions(-) -- 2.16.3
Re: [PATCH v7 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes
On Wed, Apr 14, 2021 at 01:54:17PM +0200, David Hildenbrand wrote: > >* Isolate all pages that can be migrated from the range specified by > >* [low_pfn, end_pfn). The range is expected to be within same pageblock. > > - * Returns zero if there is a fatal signal pending, otherwise PFN of the > > - * first page that was not scanned (which may be both less, equal to or > > more > > - * than end_pfn). > > + * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or > > congestion, > > + * or 0. > > + * cc->migrate_pfn will contain the next pfn to scan (which may be both > > less, > > + * equal to or more that end_pfn). > > I failed to parse the last sentence -- e.g., using "both" and then listing > three options. Also, s/than/than/? Can we simplify to > > "cc->migrate_pfn will contain the next pfn to scan" Ok, will simplify > > /** > > @@ -1139,15 +1144,15 @@ isolate_migratepages_block(struct compact_control > > *cc, unsigned long low_pfn, > >* @start_pfn: The first PFN to start isolating. > >* @end_pfn: The one-past-last PFN. > >* > > - * Returns zero if isolation fails fatally due to e.g. pending signal. > > - * Otherwise, function returns one-past-the-last PFN of isolated page > > - * (which may be greater than end_pfn if end fell in a middle of a THP > > page). > > + * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or > > congestion, > > errno is usually positive. > > > + * or 0. > > I'd be more specific here. Something like > > " > Returns 0 if isolation succeeded. Returns -EINTR if a fatal signal is > pending. Returns -EAGAIN when contended and the caller should retry. > > In any case, cc->migrate_pfn is set to one-past-the-last PFN that has been > isolated. > " I will try to reword it. Although note that 0 does not mean the isolation succeeded. Compaction tracks whether we could isolate pages by means of cc->nr_migratepages. The thing is that alloc_contig_range and compaction code need different things. Compaction wants to isolate and migrate as much as it needs to in order to form a higher-order page (or in case if it is trying to compact the whole zone, it goes through the zone). alloc_contig_range() needs to isolate the whole range we specified in order to be able to migrate it and make it free for whoever wants to use it. Let us say that the interface between alloc_contig_range() and compaction still needs some more love that I plan to work on when this goes in. -- Oscar Salvador SUSE L3
Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages
On Tue, Apr 13, 2021 at 03:40:18PM +0200, Michal Hocko wrote: > > + /* > > +* Call __prep_new_huge_page() to construct the hugetlb page, > > and > > +* enqueue it then to place it in the freelists. After this, > > +* counters are back on track. Free hugepages have a refcount > > of 0, > > +* so we need to decrease new_page's count as well. > > +*/ > > + __prep_new_huge_page(new_page); > > + __prep_account_new_huge_page(h, nid); > > I think it would help to put something like the following into the > comment above this really strange construct. > > /* >* new_page needs to be initialized with the standard >* hugetlb state. This is normally done by >* prep_new_huge_page but that takes hugetlb_lock which >* is already held so we need to open code it here. >* Reference count trick is needed because allocator >* gives us referenced page but the pool requires pages > * with 0 refcount. >*/ Ok, I will try to add more info, thanks Michal! -- Oscar Salvador SUSE L3
Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages
On Wed, Apr 14, 2021 at 02:26:21PM +0200, David Hildenbrand wrote: > > +static inline int isolate_or_dissolve_huge_page(struct page *page) > > +{ > > + return -ENOMEM; > > Without CONFIG_HUGETLB_PAGE, there is no way someone could possible pass in > something valid. Although it doesn't matter too much, -EINVAL or similar > sounds a bit better. I guess we could, was just to make it consistent with the kind of error we return when we have it enabled. > > @@ -809,6 +809,7 @@ isolate_migratepages_block(struct compact_control *cc, > > unsigned long low_pfn, > > bool skip_on_failure = false; > > unsigned long next_skip_pfn = 0; > > bool skip_updated = false; > > + bool fatal_error = false; > > Can't we use "ret == -ENOMEM" instead of fatal_error? Yes, we can, I will see how it looks. [...] > > + /* > > +* Fence off gigantic pages as there is a cyclic dependency between > > +* alloc_contig_range and them. Return -ENOME as this has the effect > > s/-ENOME/-ENOMEM/ thanks, I missed that one. > > > +* of bailing out right away without further retrying. > > +*/ > > + if (hstate_is_gigantic(h)) > > + return -ENOMEM; > > + > > + return alloc_and_dissolve_huge_page(h, head); > > +} > > + > > struct page *alloc_huge_page(struct vm_area_struct *vma, > > unsigned long addr, int avoid_reserve) > > { > > > > Complicated stuff, but looks good to me. Thanks for having a look! -- Oscar Salvador SUSE L3
Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
On Wed, Apr 14, 2021 at 12:32:58PM +0200, Michal Hocko wrote: > Well, to be precise it does the very same thing with memamp struct pages > but that is before the initialization code you have pointed out above. > In this context it just poisons the allocated content which is the GB > page storage. Right. > > I checked, and when we get there in __alloc_bootmem_huge_page, > > page->private is > > still zeroed, so I guess it should be safe to assume that we do not really > > need > > to clear the flag in __prep_new_huge_page() routine? > > It would be quite nasty if the struct pages content would be undefined. > Maybe that is possible but then I would rather stick the initialization > into __alloc_bootmem_huge_page. Yes, but I do not think that is really possible unless I missed something. Let us see what Mike thinks of it, if there are no objections, we can get rid of the clearing flag right there. -- Oscar Salvador SUSE L3
Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
On Wed, Apr 14, 2021 at 12:01:47PM +0200, Oscar Salvador wrote: > but it seems that does not the memmap struct page. that sould read as "but it seems that that does not affect the memmap struct page" -- Oscar Salvador SUSE L3
Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
On Wed, Apr 14, 2021 at 10:28:33AM +0200, Michal Hocko wrote: > You are right it doesn't do it there. But all struct pages, even those > that are allocated by the bootmem allocator should initialize its struct > pages. They would be poisoned otherwise, right? I would have to look at > the exact code path but IIRC this should be around the time bootmem > allocator state transitions to the page allocator. Ok, you are right. struct pages are initialized a bit earlier through: start_kernel setup_arch paging_init zone_sizes_init free_area_init free_area_init_node free_area_init_core memmap_init_zone memmap_init_range __init_single_page While the allocation of bootmem hugetlb happens start_kernel parse_args ... hugepages_setup ... hugetlb_hstate_alloc_pages __alloc_bootmem_huge_page which is after the setup_arch() call. So by the time we get the page from __alloc_bootmem_huge_page(), fields are zeroed. I thought we might get in trouble because memblock_alloc_try_nid_raw() calls page_init_poison() which poisons the chunk with 0xff,e.g: [1.955471] boot: [1.955476] boot: but it seems that does not the memmap struct page. I checked, and when we get there in __alloc_bootmem_huge_page, page->private is still zeroed, so I guess it should be safe to assume that we do not really need to clear the flag in __prep_new_huge_page() routine? -- Oscar Salvador SUSE L3
Re: [PATCH 02/10] mm/numa: automatically generate node migration order
On Wed, Apr 14, 2021 at 10:12:29AM +0200, David Hildenbrand wrote: > My guest best is that fast class is something like HBM (High Bandwidth > Memory), medium class is ordinary RAM, slow class is PMEM. I see, thanks for the hint David ;-) -- Oscar Salvador SUSE L3
Re: [PATCH 02/10] mm/numa: automatically generate node migration order
On Wed, Apr 14, 2021 at 10:08:54AM +0200, Oscar Salvador wrote: > In Dave's example, list is created in a way that stays local to the socket, > and we go from the fast one to the slow one. Or maybe it is just because find_next_best_node() does not know any better and creates the list that way? -- Oscar Salvador SUSE L3
Re: [PATCH 02/10] mm/numa: automatically generate node migration order
On Fri, Apr 09, 2021 at 08:07:08PM -0700, Wei Xu wrote: > On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen > wrote: > > + * When Node 0 fills up, its memory should be migrated to > > + * Node 1. When Node 1 fills up, it should be migrated to > > + * Node 2. The migration path start on the nodes with the > > + * processors (since allocations default to this node) and > > + * fast memory, progress through medium and end with the > > + * slow memory: > > + * > > + * 0 -> 1 -> 2 -> stop > > + * 3 -> 4 -> 5 -> stop > > + * > > + * This is represented in the node_demotion[] like this: > > + * > > + * { 1, // Node 0 migrates to 1 > > + *2, // Node 1 migrates to 2 > > + * -1, // Node 2 does not migrate > > + *4, // Node 3 migrates to 4 > > + *5, // Node 4 migrates to 5 > > + * -1} // Node 5 does not migrate > > + */ > > In this example, if we want to support multiple nodes as the demotion > target of a source node, we can group these nodes into three tiers > (classes): > > fast class: > 0 -> {1, 4} // 1 is the preferred > 3 -> {4, 1} // 4 is the preferred > > medium class: > 1 -> {2, 5} // 2 is the preferred > 4 -> {5, 2} // 5 is the preferred > > slow class: > 2 -> stop > 5 -> stop Hi Wei Xu, I have some questions about it Fast class/memory are pictured as those nodes with CPUs, while Slow class/memory are PMEM, right? Then, what stands for medium class/memory? In Dave's example, list is created in a way that stays local to the socket, and we go from the fast one to the slow one. In yours, lists are created taking the fastest nodes from all sockets and we work our way down, which means have cross-socket nodes in the list. How much of a penalty is that? And while I get your point, I am not sure if that is what we pretend here. This patchset aims to place cold pages that are about to be reclaim in slower nodes to give them a second chance, while your design seems more to have kind of different memory clases and be able to place applications in one of those tiers depending on its demands or sysadmin-demand. Could you expand some more? -- Oscar Salvador SUSE L3
Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
On Wed, Apr 14, 2021 at 08:04:21AM +0200, Michal Hocko wrote: > On Tue 13-04-21 14:19:03, Mike Kravetz wrote: > > On 4/13/21 6:23 AM, Michal Hocko wrote: > > The only place where page->private may not be initialized is when we do > > allocations at boot time from memblock. In this case, we will add the > > pages to the free list via put_page/free_huge_page so the appropriate > > flags will be cleared before anyone notices. > > Pages allocated by the bootmem should be pre initialized from the boot, > no? I guess Mike means: hugetlb_hstate_alloc_pages alloc_bootmem_huge_page __alloc_bootmem_huge_page memblock_alloc_try_nid_raw and AFAICS, memblock_alloc_try_nid_raw() does not zero the memory. Then these pages are initialized in: gather_bootmem_prealloc prep_compound_huge_page prep_new_huge_page But as can be noticed, no one touches page->private when coming from that path. -- Oscar Salvador SUSE L3
Re: [PATCH v2 resend] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove
On Mon, Apr 12, 2021 at 01:08:42PM +0100, Mel Gorman wrote: > zone_pcp_reset allegedly protects against a race with drain_pages > using local_irq_save but this is bogus. local_irq_save only operates > on the local CPU. If memory hotplug is running on CPU A and drain_pages > is running on CPU B, disabling IRQs on CPU A does not affect CPU B and > offers no protection. > > This patch deletes IRQ disable/enable on the grounds that IRQs protect > nothing and assumes the existing hotplug paths guarantees the PCP cannot be > used after zone_pcp_enable(). That should be the case already because all > the pages have been freed and there is no page to put on the PCP lists. > > Signed-off-by: Mel Gorman Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality
On Tue, Apr 13, 2021 at 02:33:41PM -0700, Mike Kravetz wrote: > > -static void prep_new_huge_page(struct hstate *h, struct page *page, int > > nid) > > +/* > > + * Must be called with the hugetlb lock held > > + */ > > +static void __prep_account_new_huge_page(struct hstate *h, int nid) > > +{ > > + h->nr_huge_pages++; > > + h->nr_huge_pages_node[nid]++; > > I would prefer if we also move setting the destructor to this routine. > set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); Uhm, but that is the routine that does the accounting, it feels wrong here, plus... > > That way, PageHuge() will be false until it 'really' is a huge page. > If not, we could potentially go into that retry loop in > dissolve_free_huge_page or alloc_and_dissolve_huge_page in patch 5. ...I do not follow here, could you please elaborate some more? Unless I am missing something, behaviour should not be any different with this patch. Thanks -- Oscar Salvador SUSE L3
Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages
On Tue, Apr 13, 2021 at 03:29:02PM -0700, Mike Kravetz wrote: > > spin_lock_irq(hugetlb_lock) > > 1) update_and_free_page > >PageHuge() == F > >__free_pages() > > 2) enqueue_huge_page > >SetPageHugeFreed() > > spin_unlock(_lock) > > Very small nit, the above should be spin_unlock_irq(_lock) Right, I missed it somehow. > > + /* > > +* The page might have been dissolved from under our feet, so make sure > > +* to carefully check the state under the lock. > > +* Return success when racing as if we dissolved the page ourselves. > > +*/ > > + spin_lock_irq(_lock); > > + if (PageHuge(page)) { > > + head = compound_head(page); > > + h = page_hstate(head); > > + } else { > > + spin_unlock(_lock); > > Should be be spin_unlock_irq(_lock); > > Other than that, it looks good. Yeah, I will amend it in the next version. Thanks Mike! -- Oscar Salvador SUSE L3
Re: [PATCH v7 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages
On Tue, Apr 13, 2021 at 03:48:53PM -0700, Mike Kravetz wrote: > The label free_new is: > > free_new: > spin_unlock_irq(_lock); > __free_pages(new_page, huge_page_order(h)); > > return ret; > > So, we are locking and immediately unlocking without any code in > between. Usually, I don't like like multiple labels before return. > However, perhaps we should add another to avoid this unnecessary > cycle. On the other hand, this is an uncommon race condition so the > simple code may be acceptable. I guess we could have something like: free_new: spin_unlock_irq(_lock); free_new_nolock: __free_pages(new_page, huge_page_order(h)); return ret; And let the retry go to there without locking. But as you said, the racecondition is rare enough, so I am not sure if this buys us much. But I can certainly add it if you feel strong about it. -- Oscar Salvador SUSE L3
[PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages
alloc_contig_range will fail if it ever sees a HugeTLB page within the range we are trying to allocate, even when that page is free and can be easily reallocated. This has proved to be problematic for some users of alloc_contic_range, e.g: CMA and virtio-mem, where those would fail the call even when those pages lay in ZONE_MOVABLE and are free. We can do better by trying to replace such page. Free hugepages are tricky to handle so as to no userspace application notices disruption, we need to replace the current free hugepage with a new one. In order to do that, a new function called alloc_and_dissolve_huge_page is introduced. This function will first try to get a new fresh hugepage, and if it succeeds, it will replace the old one in the free hugepage pool. The free page replacement is done under hugetlb_lock, so no external users of hugetlb will notice the change. To allocate the new huge page, we use alloc_buddy_huge_page(), so we do not have to deal with any counters, and prep_new_huge_page() is not called. This is valulable because in case we need to free the new page, we only need to call __free_pages(). Once we know that the page to be replaced is a genuine 0-refcounted huge page, we remove the old page from the freelist by remove_hugetlb_page(). Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page() for the new huge page to properly initialize it and increment the hstate->nr_huge_pages counter (previously decremented by remove_hugetlb_page()). Once done, the page is enqueued by enqueue_huge_page() and it is ready to be used. There is one tricky case when page's refcount is 0 because it is in the process of being released. A missing PageHugeFreed bit will tell us that freeing is in flight so we retry after dropping the hugetlb_lock. The race window should be small and the next retry should make a forward progress. E.g: CPU0CPU1 free_huge_page()isolate_or_dissolve_huge_page PageHuge() == T alloc_and_dissolve_huge_page alloc_buddy_huge_page() spin_lock_irq(hugetlb_lock) // PageHuge() && !PageHugeFreed && // !PageCount() spin_unlock_irq(hugetlb_lock) spin_lock_irq(hugetlb_lock) 1) update_and_free_page PageHuge() == F __free_pages() 2) enqueue_huge_page SetPageHugeFreed() spin_unlock(_lock) spin_lock_irq(hugetlb_lock) 1) PageHuge() == F (freed by case#1 from CPU0) 2) PageHuge() == T PageHugeFreed() == T - proceed with replacing the page In the case above we retry as the window race is quite small and we have high chances to succeed next time. With regard to the allocation, we restrict it to the node the page belongs to with __GFP_THISNODE, meaning we do not fallback on other node's zones. Note that gigantic hugetlb pages are fenced off since there is a cyclic dependency between them and alloc_contig_range. Signed-off-by: Oscar Salvador --- include/linux/hugetlb.h | 6 +++ mm/compaction.c | 37 ++-- mm/hugetlb.c| 115 3 files changed, 155 insertions(+), 3 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 09f1fd12a6fa..b2d2118bfd1a 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -595,6 +595,7 @@ struct huge_bootmem_page { struct hstate *hstate; }; +int isolate_or_dissolve_huge_page(struct page *page); struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, @@ -877,6 +878,11 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; +static inline int isolate_or_dissolve_huge_page(struct page *page) +{ + return -ENOMEM; +} + static inline struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve) diff --git a/mm/compaction.c b/mm/compaction.c index eeba4668c22c..89426b6d1ea3 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -788,7 +788,7 @@ static bool too_many_isolated(pg_data_t *pgdat) * Isolate all pages that can be migrated from the range specified by * [low_pfn, end_pfn). The range is expected to be within same pageblock. * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion, - * or 0. + * -ENOMEM in case we could not al
[PATCH v7 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages
alloc_contig_range() will fail if it finds a HugeTLB page within the range, without a chance to handle them. Since HugeTLB pages can be migrated as any LRU or Movable page, it does not make sense to bail out without trying. Enable the interface to recognize in-use HugeTLB pages so we can migrate them, and have much better chances to succeed the call. Signed-off-by: Oscar Salvador Reviewed-by: Mike Kravetz Acked-by: Michal Hocko --- include/linux/hugetlb.h | 5 +++-- mm/compaction.c | 12 +++- mm/hugetlb.c| 22 +- mm/vmscan.c | 5 +++-- 4 files changed, 34 insertions(+), 10 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index b2d2118bfd1a..b92f25ccef58 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -595,7 +595,7 @@ struct huge_bootmem_page { struct hstate *hstate; }; -int isolate_or_dissolve_huge_page(struct page *page); +int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list); struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, @@ -878,7 +878,8 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; -static inline int isolate_or_dissolve_huge_page(struct page *page) +static inline int isolate_or_dissolve_huge_page(struct page *page, + struct list_head *list) { return -ENOMEM; } diff --git a/mm/compaction.c b/mm/compaction.c index 89426b6d1ea3..bb8ff3543972 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -909,7 +909,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, } if (PageHuge(page) && cc->alloc_contig) { - ret = isolate_or_dissolve_huge_page(page); + ret = isolate_or_dissolve_huge_page(page, >migratepages); /* * Fail isolation in case isolate_or_dissolve_huge_page @@ -927,6 +927,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, goto isolate_fail; } + if (PageHuge(page)) { + /* +* Hugepage was successfully isolated and placed +* on the cc->migratepages list. +*/ + low_pfn += compound_nr(page) - 1; + goto isolate_success_no_list; + } + /* * Ok, the hugepage was dissolved. Now these pages are * Buddy and cannot be re-allocated because they are @@ -1068,6 +1077,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, isolate_success: list_add(>lru, >migratepages); +isolate_success_no_list: cc->nr_migratepages += compound_nr(page); nr_isolated += compound_nr(page); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4a664d6e82c1..24a453ff47f2 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2270,10 +2270,12 @@ static void restore_reserve_on_error(struct hstate *h, * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one * @h: struct hstate old page belongs to * @old_page: Old page to dissolve + * @list: List to isolate the page in case we need to * Returns 0 on success, otherwise negated error. */ -static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page) +static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, + struct list_head *list) { gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; int nid = page_to_nid(old_page); @@ -2300,9 +2302,13 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page) goto free_new; } else if (page_count(old_page)) { /* -* Someone has grabbed the page, fail for now. +* Someone has grabbed the page, try to isolate it here. +* Fail with -EBUSY if not possible. */ - ret = -EBUSY; + spin_unlock_irq(_lock); + if (!isolate_huge_page(old_page, list)) + ret = -EBUSY; + spin_lock_irq(_lock); goto free_new; } else if (!HPageFreed(old_page)) { /* @@ -2350,10 +2356,11 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page) return ret; } -int isolate_or_dissolve_huge_pa
[PATCH v7 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig
pfn_range_valid_contig() bails out when it finds an in-use page or a hugetlb page, among other things. We can drop the in-use page check since __alloc_contig_pages can migrate away those pages, and the hugetlb page check can go too since isolate_migratepages_range is now capable of dealing with hugetlb pages. Either way, those checks are racy so let the end function handle it when the time comes. Signed-off-by: Oscar Salvador Suggested-by: David Hildenbrand Reviewed-by: David Hildenbrand --- mm/page_alloc.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b5a94de3cdde..c5338e912ace 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8901,12 +8901,6 @@ static bool pfn_range_valid_contig(struct zone *z, unsigned long start_pfn, if (PageReserved(page)) return false; - - if (page_count(page) > 0) - return false; - - if (PageHuge(page)) - return false; } return true; } -- 2.16.3
[PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality
Currently, prep_new_huge_page() performs two functions. It sets the right state for a new hugetlb, and increases the hstate's counters to account for the new page. Let us split its functionality into two separate functions, decoupling the handling of the counters from initializing a hugepage. The outcome is having __prep_new_huge_page(), which only initializes the page , and __prep_account_new_huge_page(), which adds the new page to the hstate's counters. This allows us to be able to set a hugetlb without having to worry about the counter/locking. It will prove useful in the next patch. prep_new_huge_page() still calls both functions. Signed-off-by: Oscar Salvador --- mm/hugetlb.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index e40d5fe5c63c..0607b2b71ac6 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1483,7 +1483,16 @@ void free_huge_page(struct page *page) } } -static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) +/* + * Must be called with the hugetlb lock held + */ +static void __prep_account_new_huge_page(struct hstate *h, int nid) +{ + h->nr_huge_pages++; + h->nr_huge_pages_node[nid]++; +} + +static void __prep_new_huge_page(struct page *page) { INIT_LIST_HEAD(>lru); set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); @@ -1491,9 +1500,13 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) set_hugetlb_cgroup(page, NULL); set_hugetlb_cgroup_rsvd(page, NULL); ClearHPageFreed(page); +} + +static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) +{ + __prep_new_huge_page(page); spin_lock_irq(_lock); - h->nr_huge_pages++; - h->nr_huge_pages_node[nid]++; + __prep_account_new_huge_page(h, nid); spin_unlock_irq(_lock); } -- 2.16.3
[PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock
Currently, the clearing of the flag is done under the lock, but this is unnecessary as we just allocated the page and we did not give it away yet, so no one should be messing with it. Also, this helps making clear that here the lock is only protecting the counter. Signed-off-by: Oscar Salvador --- mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 54d81d5947ed..e40d5fe5c63c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1490,10 +1490,10 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) hugetlb_set_page_subpool(page, NULL); set_hugetlb_cgroup(page, NULL); set_hugetlb_cgroup_rsvd(page, NULL); + ClearHPageFreed(page); spin_lock_irq(_lock); h->nr_huge_pages++; h->nr_huge_pages_node[nid]++; - ClearHPageFreed(page); spin_unlock_irq(_lock); } -- 2.16.3
[PATCH v7 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes
Currently, isolate_migratepages_{range,block} and their callers use a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was any error during isolation. This does not work as soon as we need to start reporting different error codes and make sure we pass them down the chain, so they are properly interpreted by functions like e.g: alloc_contig_range. Let us rework isolate_migratepages_{range,block} so we can report error codes. Since isolate_migratepages_block will stop returning the next pfn to be scanned, we reuse the cc->migrate_pfn field to keep track of that. Signed-off-by: Oscar Salvador Acked-by: Vlastimil Babka --- mm/compaction.c | 54 +++--- mm/internal.h | 10 -- mm/page_alloc.c | 7 +++ 3 files changed, 38 insertions(+), 33 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 8c5028bfbd56..eeba4668c22c 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -787,15 +787,15 @@ static bool too_many_isolated(pg_data_t *pgdat) * * Isolate all pages that can be migrated from the range specified by * [low_pfn, end_pfn). The range is expected to be within same pageblock. - * Returns zero if there is a fatal signal pending, otherwise PFN of the - * first page that was not scanned (which may be both less, equal to or more - * than end_pfn). + * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion, + * or 0. + * cc->migrate_pfn will contain the next pfn to scan (which may be both less, + * equal to or more that end_pfn). * * The pages are isolated on cc->migratepages list (not required to be empty), - * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field - * is neither read nor updated. + * and cc->nr_migratepages is updated accordingly. */ -static unsigned long +static int isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, unsigned long end_pfn, isolate_mode_t isolate_mode) { @@ -809,6 +809,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, bool skip_on_failure = false; unsigned long next_skip_pfn = 0; bool skip_updated = false; + int ret = 0; + + cc->migrate_pfn = low_pfn; /* * Ensure that there are not too many pages isolated from the LRU @@ -818,16 +821,16 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, while (unlikely(too_many_isolated(pgdat))) { /* stop isolation if there are still pages not migrated */ if (cc->nr_migratepages) - return 0; + return -EAGAIN; /* async migration should just abort */ if (cc->mode == MIGRATE_ASYNC) - return 0; + return -EAGAIN; congestion_wait(BLK_RW_ASYNC, HZ/10); if (fatal_signal_pending(current)) - return 0; + return -EINTR; } cond_resched(); @@ -875,8 +878,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (fatal_signal_pending(current)) { cc->contended = true; + ret = -EINTR; - low_pfn = 0; goto fatal_pending; } @@ -1130,7 +1133,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, if (nr_isolated) count_compact_events(COMPACTISOLATED, nr_isolated); - return low_pfn; + cc->migrate_pfn = low_pfn; + + return ret; } /** @@ -1139,15 +1144,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, * @start_pfn: The first PFN to start isolating. * @end_pfn: The one-past-last PFN. * - * Returns zero if isolation fails fatally due to e.g. pending signal. - * Otherwise, function returns one-past-the-last PFN of isolated page - * (which may be greater than end_pfn if end fell in a middle of a THP page). + * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or congestion, + * or 0. */ -unsigned long +int isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn, unsigned long end_pfn) { unsigned long pfn, block_start_pfn, block_end_pfn; + int ret = 0; /* Scan block by block. First and last block may be incomplete */ pfn = start_pfn; @@ -1166,17 +1171,17 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn, block_end_pfn, cc->zone)) continue; - pfn = isolate_migra
[PATCH v7 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range
Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or -EBUSY, and report them down the chain. The problem is that when migrate_pages() reports -ENOMEM, we keep going till we exhaust all the try-attempts (5 at the moment) instead of bailing out. migrate_pages() bails out right away on -ENOMEM because it is considered a fatal error. Do the same here instead of keep going and retrying. Note that this is not fixing a real issue, just a cosmetic change. Although we can save some cycles by backing off ealier Signed-off-by: Oscar Salvador Acked-by: Vlastimil Babka Reviewed-by: David Hildenbrand Acked-by: Michal Hocko --- mm/page_alloc.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1c67c99603a3..689454692de1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8697,7 +8697,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, } tries = 0; } else if (++tries == 5) { - ret = ret < 0 ? ret : -EBUSY; + ret = -EBUSY; break; } @@ -8707,6 +8707,13 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, ret = migrate_pages(>migratepages, alloc_migration_target, NULL, (unsigned long), cc->mode, MR_CONTIG_RANGE); + + /* +* On -ENOMEM, migrate_pages() bails out right away. It is pointless +* to retry again over this error, so do the same here. +*/ + if (ret == -ENOMEM) + break; } lru_cache_enable(); -- 2.16.3
[PATCH v7 0/7] Make alloc_contig_range handle Hugetlb pages
064838] alloc_contig_range: [1b8000, 1c) PFNs busy [ 180.065848] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.066794] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.067738] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.068669] alloc_contig_range: [1bfc00, 1c) PFNs busy [ 180.069598] alloc_contig_range: [1bfc00, 1c) PFNs busy" And then with this patchset running: "Same experiment with ZONE_MOVABLE: a) Free huge pages: all memory can get unplugged again. b) Allocated/populated but idle huge pages: all memory can get unplugged again. c) Allocated/populated but all 512 huge pages are read/written in a loop: all memory can get unplugged again, but I get a single [ 121.192345] alloc_contig_range: [18, 188000) PFNs busy Most probably because it happened to try migrating a huge page while it was busy. As virtio-mem retries on ZONE_MOVABLE a couple of times, it can deal with this temporary failure. Last but not least, I did something extreme: # cat /proc/meminfo MemTotal:5061568 kB MemFree: 186560 kB MemAvailable: 354524 kB ... HugePages_Total:2048 HugePages_Free: 2048 HugePages_Rsvd:0 HugePages_Surp:0 Triggering unplug would require to dissolve+alloc - which now fails when trying to allocate an additional ~512 huge pages (1G). As expected, I can properly see memory unplug not fully succeeding. + I get a fairly continuous stream of [ 226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy ... But more importantly, the hugepage count remains stable, as configured by the admin (me): HugePages_Total:2048 HugePages_Free: 2048 HugePages_Rsvd:0 HugePages_Surp:0" Oscar Salvador (7): mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range mm,compaction: Let isolate_migratepages_{range,block} return error codes mm,hugetlb: Clear HPageFreed outside of the lock mm,hugetlb: Split prep_new_huge_page functionality mm: Make alloc_contig_range handle free hugetlb pages mm: Make alloc_contig_range handle in-use hugetlb pages mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig include/linux/hugetlb.h | 7 +++ mm/compaction.c | 97 ++- mm/hugetlb.c| 148 ++-- mm/internal.h | 10 +++- mm/page_alloc.c | 22 +++ mm/vmscan.c | 5 +- 6 files changed, 242 insertions(+), 47 deletions(-) -- 2.16.3
Re: [PATCH V2 4/6] mm: Drop redundant ARCH_ENABLE_[HUGEPAGE|THP]_MIGRATION
On Thu, Apr 01, 2021 at 12:14:06PM +0530, Anshuman Khandual wrote: > ARCH_ENABLE_[HUGEPAGE|THP]_MIGRATION configs have duplicate definitions on > platforms that subscribe them. Drop these reduntant definitions and instead > just select them appropriately. > > Cc: Catalin Marinas > Cc: Will Deacon > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: "H. Peter Anvin" > Cc: Andrew Morton > Cc: x...@kernel.org > Cc: linux-arm-ker...@lists.infradead.org > Cc: linuxppc-...@lists.ozlabs.org > Cc: linux...@kvack.org > Cc: linux-kernel@vger.kernel.org > Acked-by: Catalin Marinas (arm64) > Signed-off-by: Anshuman Khandual Hi Anshuman, X86 needs fixing, see below: > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 503d8b2e8676..10702ef1eb57 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -60,8 +60,10 @@ config X86 > select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI > select ARCH_32BIT_OFF_T if X86_32 > select ARCH_CLOCKSOURCE_INIT > + select ARCH_ENABLE_HUGEPAGE_MIGRATION if x86_64 && HUGETLB_PAGE && > MIGRATION > select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64 || (X86_32 && HIGHMEM) > select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG > + select ARCH_ENABLE_THP_MIGRATION if x86_64 && TRANSPARENT_HUGEPAGE you need s/x86_64/X86_64/, otherwise we are left with no migration :-) -- Oscar Salvador SUSE L3
Re: [PATCH v5 4/8] hugetlb: create remove_hugetlb_page() to separate functionality
On Fri, Apr 09, 2021 at 01:52:50PM -0700, Mike Kravetz wrote: > The new remove_hugetlb_page() routine is designed to remove a hugetlb > page from hugetlbfs processing. It will remove the page from the active > or free list, update global counters and set the compound page > destructor to NULL so that PageHuge() will return false for the 'page'. > After this call, the 'page' can be treated as a normal compound page or > a collection of base size pages. > > update_and_free_page no longer decrements h->nr_huge_pages{_node} as > this is performed in remove_hugetlb_page. The only functionality > performed by update_and_free_page is to free the base pages to the lower > level allocators. > > update_and_free_page is typically called after remove_hugetlb_page. > > remove_hugetlb_page is to be called with the hugetlb_lock held. > > Creating this routine and separating functionality is in preparation for > restructuring code to reduce lock hold times. This commit should not > introduce any changes to functionality. > > Signed-off-by: Mike Kravetz > Acked-by: Michal Hocko > Reviewed-by: Miaohe Lin > Reviewed-by: Muchun Song Reviewed-by: Oscar Salvador A "nit" below: > static void update_and_free_page(struct hstate *h, struct page *page) > { > int i; > @@ -1334,8 +1369,6 @@ static void update_and_free_page(struct hstate *h, > struct page *page) After this, update_and_free_page()'s job is to reset subpage's flags and free the page. Maybe we want to rename that function at some point, or maybe not as "update" might already imply that. Just speaking out loud. -- Oscar Salvador SUSE L3
Re: [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events
On Fri, Apr 09, 2021 at 08:59:21PM +0200, David Hildenbrand wrote: > The only way to add more System RAM is via add_memory() and friends like > add_memory_driver_managed(). These all require CONFIG_MEMORY_HOTPLUG. Yeah, my point was more towards whether PMEM can come in a way that it does not have to be hotplugged, but come functional by default (as RAM). But after having read all papers out there, I do not think that it is possible. -- Oscar Salvador SUSE L3
Re: [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events
On Fri, Apr 09, 2021 at 12:14:04PM +0200, Oscar Salvador wrote: > On Thu, Apr 08, 2021 at 11:52:51AM +0200, Oscar Salvador wrote: > > I am not really into PMEM, and I ignore whether we need > > CONFIG_MEMORY_HOTPLUG in order to have such memory on the system. > > If so, the following can be partly ignored. > > Ok, I refreshed by memory with [1]. Bleh, being [1] https://lore.kernel.org/linux-mm/20190124231448.e102d...@viggo.jf.intel.com/ -- Oscar Salvador SUSE L3
Re: [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events
On Thu, Apr 08, 2021 at 11:52:51AM +0200, Oscar Salvador wrote: > I am not really into PMEM, and I ignore whether we need > CONFIG_MEMORY_HOTPLUG in order to have such memory on the system. > If so, the following can be partly ignored. Ok, I refreshed by memory with [1]. >From that, it seems that in order to use PMEM as RAM we need >CONFIG_MEMORY_HOTPLUG. But is that always the case? Can happen that in some scenario PMEM comes ready to use and we do not need the hotplug trick? Anyway, I would still like to clarify the state of the HOTPLUG_CPU. On x86_64, HOTPLUG_CPU and MEMORY_HOTPLUG are tied by SPM means, but on arm64 one can have MEMORY_HOTPLUG while not having picked HOTPLUG_CPU. My point is that we might want to put the callback functions and the callback registration for cpu-hotplug guarded by its own HOTPLUG_CPU instead of guarding it in the same MEMORY_HOTPLUG block to make it more clear? -- Oscar Salvador SUSE L3
Re: [PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages
On Wed, Apr 07, 2021 at 11:40:13AM -0700, Wei Xu wrote: > anon_should_be_aged() doesn't really need "lruvec". It essentially > answers whether the pages of the given node can be swapped or demoted. > So it would be clearer and less confusing if anon_should_be_aged() > takes "pgdat" instead of "lruvec" as the argument. The call to > mem_cgroup_lruvec(NULL, pgdat) in age_active_anon() can then be removed > as well. I tend to agree with this, and I would go one step further with the naming. For me, taking into account the nature of the function that tells us whether we have any means to age those pages, a better fit would be something like anon_can_be_aged(). IIUC, the "should age" part would be inactive_is_low(). -- Oscar Salvador SUSE L3
Re: [PATCH 02/10] mm/numa: automatically generate node migration order
On Thu, Apr 08, 2021 at 02:51:20PM -0700, Dave Hansen wrote: > I've fleshed out the description a bit. I hope this helps? Yes, thanks Dave, both additions look fine to me. -- Oscar Salvador SUSE L3
Re: [PATCH v6 4/8] mm,memory_hotplug: Allocate memmap from the added memory range
On Thu, Apr 08, 2021 at 10:05:18PM -0700, Andrew Morton wrote: > Yes please. I just sent v7 with that yesterday [1] Hope David/Michal finds some time to review patch#4 as that is the only missing piece atm. [1] https://lkml.org/lkml/2021/4/8/546 -- Oscar Salvador SUSE L3
Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
On Thu, Apr 08, 2021 at 01:40:33PM -0700, Yang Shi wrote: > Thanks a lot for the example code. You didn't miss anything. At first > glance, I thought your suggestion seemed neater. Actually I > misunderstood what Dave said about "That could really have caused some > interesting problems." with multiple calls to migrate_pages(). I was > thinking about: > > unsigned long foo() > { > unsigned long *ret_succeeded; > > migrate_pages(..., ret_succeeded); > > migrate_pages(..., ret_succeeded); > > return *ret_succeeded; > } But that would not be a problem as well. I mean I am not sure what is foo() supposed to do. I assume is supposed to return the *total* number of pages that were migrated? Then could do something like: unsigned long foo() { unsigned long ret_succeeded; unsigned long total_succeeded = 0; migrate_pages(..., _succeeded); total_succeeded += ret_succeeded; migrate_pages(..., _succeeded); total_succeeded += ret_succeeded; return *total_succeeded; } But AFAICS, you would have to do that with Wei Xu's version and with mine, no difference there. IIUC, Dave's concern was that nr_succeeded was only set to 0 at the beginning of the function, and never reset back, which means, we would carry the sum of previous nr_succeeded instead of the nr_succeeded in that round. That would be misleading for e.g: reclaim in case we were to call migrate_pages() several times, as instead of a delta value, nr_succeeded would accumulate. But that won't happen neither with Wei Xu's version nor with mine. -- Oscar Salvador SUSE L3
Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
On Thu, Apr 08, 2021 at 08:17:26PM +0200, Oscar Salvador wrote: > diff --git a/include/linux/migrate.h b/include/linux/migrate.h > index 3a389633b68f..fd661cb2ce13 100644 > --- a/include/linux/migrate.h > +++ b/include/linux/migrate.h > @@ -40,7 +40,8 @@ extern int migrate_page(struct address_space *mapping, > struct page *newpage, struct page *page, > enum migrate_mode mode); > extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t > free, > - unsigned long private, enum migrate_mode mode, int reason); > + unsigned long private, enum migrate_mode mode, int reason, > + unsigned int *ret_succeeded); > extern struct page *alloc_migration_target(struct page *page, unsigned long > private); > extern int isolate_movable_page(struct page *page, isolate_mode_t mode); > extern void putback_movable_page(struct page *page); > @@ -58,7 +59,7 @@ extern int migrate_page_move_mapping(struct address_space > *mapping, > static inline void putback_movable_pages(struct list_head *l) {} > static inline int migrate_pages(struct list_head *l, new_page_t new, > free_page_t free, unsigned long private, enum migrate_mode mode, > - int reason) > + int reason, unsigned int *ret_succeeded) > { return -ENOSYS; } > static inline struct page *alloc_migration_target(struct page *page, > unsigned long private) > diff --git a/mm/compaction.c b/mm/compaction.c > index e04f4476e68e..7238e8faff04 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -2364,7 +2364,7 @@ compact_zone(struct compact_control *cc, struct > capture_control *capc) > > err = migrate_pages(>migratepages, compaction_alloc, > compaction_free, (unsigned long)cc, cc->mode, > - MR_COMPACTION); > + MR_COMPACTION, NULL); > > trace_mm_compaction_migratepages(cc->nr_migratepages, err, > >migratepages); > diff --git a/mm/gup.c b/mm/gup.c > index e40579624f10..b70d463aa1fc 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -1606,7 +1606,7 @@ static long check_and_migrate_cma_pages(struct > mm_struct *mm, > put_page(pages[i]); > > if (migrate_pages(_page_list, alloc_migration_target, NULL, > - (unsigned long), MIGRATE_SYNC, MR_CONTIG_RANGE)) { > + (unsigned long), MIGRATE_SYNC, MR_CONTIG_RANGE, > NULL)) { > /* >* some of the pages failed migration. Do get_user_pages >* without migration. > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 24210c9bd843..a17e0f039076 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1852,7 +1852,8 @@ static int __soft_offline_page(struct page *page) > > if (isolate_page(hpage, )) { > ret = migrate_pages(, alloc_migration_target, NULL, > - (unsigned long), MIGRATE_SYNC, MR_MEMORY_FAILURE); > + (unsigned long), MIGRATE_SYNC, MR_MEMORY_FAILURE, > + NULL); > if (!ret) { > bool release = !huge; > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 0cdbbfbc5757..28496376de94 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1466,7 +1466,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long > end_pfn) > if (nodes_empty(nmask)) > node_set(mtc.nid, nmask); > ret = migrate_pages(, alloc_migration_target, NULL, > - (unsigned long), MIGRATE_SYNC, MR_MEMORY_HOTPLUG); > + (unsigned long), MIGRATE_SYNC, MR_MEMORY_HOTPLUG, > + NULL); > if (ret) { > list_for_each_entry(page, , lru) { > pr_warn("migrating pfn %lx failed ret:%d ", > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index ab51132547b8..df260ed12102 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1103,7 +1103,8 @@ static int migrate_to_node(struct mm_struct *mm, int > source, int dest, > > if (!list_empty()) { > err = migrate_pages(, alloc_migration_target, NULL, > - (unsigned long), MIGRATE_SYNC, MR_SYSCALL); > + (unsigned long), MIGRATE_SYNC, MR_SYSCALL, > + NULL); > if (err) > putback_m
Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
if (!list_empty()) { err = migrate_pages(, alloc_migration_target, NULL, - (unsigned long), MIGRATE_SYNC, MR_SYSCALL); + (unsigned long), MIGRATE_SYNC, MR_SYSCALL, + NULL); if (err) putback_movable_pages(); } @@ -1355,7 +1356,8 @@ static long do_mbind(unsigned long start, unsigned long len, if (!list_empty()) { WARN_ON_ONCE(flags & MPOL_MF_LAZY); nr_failed = migrate_pages(, new_page, NULL, - start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND); + start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND, + NULL); if (nr_failed) putback_movable_pages(); } diff --git a/mm/migrate.c b/mm/migrate.c index 695a594e5860..087ed407b3ce 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1493,6 +1493,9 @@ static inline int try_split_thp(struct page *page, struct page **page2, * @mode: The migration mode that specifies the constraints for * page migration, if any. * @reason:The reason for page migration. + * @ret_succeeded: A pointer to place the value of the number of pages + * migrated successfully. The caller must pass a valid pointer + * if they care about it. * * The function returns after 10 attempts or if no pages are movable any more * because the list has become empty or no retryable pages exist any more. @@ -1503,7 +1506,7 @@ static inline int try_split_thp(struct page *page, struct page **page2, */ int migrate_pages(struct list_head *from, new_page_t get_new_page, free_page_t put_new_page, unsigned long private, - enum migrate_mode mode, int reason) + enum migrate_mode mode, int reason, unsigned int *ret_succeeded) { int retry = 1; int thp_retry = 1; @@ -1654,6 +1657,9 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, if (!swapwrite) current->flags &= ~PF_SWAPWRITE; + if (ret_succeeded) + *ret_succeeded = nr_succeeded; + return rc; } @@ -1723,7 +1729,8 @@ static int do_move_pages_to_node(struct mm_struct *mm, }; err = migrate_pages(pagelist, alloc_migration_target, NULL, - (unsigned long), MIGRATE_SYNC, MR_SYSCALL); + (unsigned long), MIGRATE_SYNC, MR_SYSCALL, + NULL); if (err) putback_movable_pages(pagelist); return err; @@ -2230,7 +2237,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, list_add(>lru, ); nr_remaining = migrate_pages(, alloc_misplaced_dst_page, NULL, node, MIGRATE_ASYNC, -MR_NUMA_MISPLACED); +MR_NUMA_MISPLACED, NULL); if (nr_remaining) { if (!list_empty()) { list_del(>lru); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 46f3d594369d..0c1bbadd5ca3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8490,7 +8490,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, cc->nr_migratepages -= nr_reclaimed; ret = migrate_pages(>migratepages, alloc_migration_target, - NULL, (unsigned long), cc->mode, MR_CONTIG_RANGE); + NULL, (unsigned long), cc->mode, MR_CONTIG_RANGE, + NULL); } if (ret < 0) { putback_movable_pages(>migratepages); As I said I might be missing a point here, but I cannot see the problem you describe here. -- Oscar Salvador SUSE L3
[PATCH v7 8/8] arm64/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
Enable arm64 platform to use the MHP_MEMMAP_ON_MEMORY feature. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand --- arch/arm64/Kconfig | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 5656e7aacd69..0e23acd173f0 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -309,6 +309,9 @@ config ARCH_ENABLE_MEMORY_HOTPLUG config ARCH_ENABLE_MEMORY_HOTREMOVE def_bool y +config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE + def_bool y + config SMP def_bool y -- 2.16.3
[PATCH v7 7/8] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
Enable x86_64 platform to use the MHP_MEMMAP_ON_MEMORY feature. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand --- arch/x86/Kconfig | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 2792879d398e..9f0211df1746 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2433,6 +2433,9 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE def_bool y depends on MEMORY_HOTPLUG +config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE + def_bool y + config USE_PERCPU_NUMA_NODE_ID def_bool y depends on NUMA -- 2.16.3
[PATCH v7 5/8] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
Let the caller check whether it can pass MHP_MEMMAP_ON_MEMORY by checking mhp_supports_memmap_on_memory(). MHP_MEMMAP_ON_MEMORY can only be set in case ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE is enabled, the architecture supports altmap, and the range to be added spans a single memory block. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand Acked-by: Michal Hocko --- drivers/acpi/acpi_memhotplug.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c index b02fd51e5589..8cc195c4c861 100644 --- a/drivers/acpi/acpi_memhotplug.c +++ b/drivers/acpi/acpi_memhotplug.c @@ -171,6 +171,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device) acpi_handle handle = mem_device->device->handle; int result, num_enabled = 0; struct acpi_memory_info *info; + mhp_t mhp_flags = MHP_NONE; int node; node = acpi_get_node(handle); @@ -194,8 +195,10 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device) if (node < 0) node = memory_add_physaddr_to_nid(info->start_addr); + if (mhp_supports_memmap_on_memory(info->length)) + mhp_flags |= MHP_MEMMAP_ON_MEMORY; result = __add_memory(node, info->start_addr, info->length, - MHP_NONE); + mhp_flags); /* * If the memory block has been used by the kernel, add_memory() -- 2.16.3
[PATCH v7 6/8] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory
Self stored memmap leads to a sparse memory situation which is unsuitable for workloads that requires large contiguous memory chunks, so make this an opt-in which needs to be explicitly enabled. To control this, let memory_hotplug have its own memory space, as suggested by David, so we can add memory_hotplug.memmap_on_memory parameter. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand Acked-by: Michal Hocko --- Documentation/admin-guide/kernel-parameters.txt | 17 + mm/Makefile | 5 - mm/memory_hotplug.c | 10 +- 3 files changed, 30 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 04545725f187..af32c17cd4eb 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2794,6 +2794,23 @@ seconds. Use this parameter to check at some other rate. 0 disables periodic checking. + memory_hotplug.memmap_on_memory + [KNL,X86,ARM] Boolean flag to enable this feature. + Format: {on | off (default)} + When enabled, runtime hotplugged memory will + allocate its internal metadata (struct pages) + from the hotadded memory which will allow to + hotadd a lot of memory without requiring + additional memory to do so. + This feature is disabled by default because it + has some implication on large (e.g. GB) + allocations in some configurations (e.g. small + memory blocks). + The state of the flag can be read in + /sys/module/memory_hotplug/parameters/memmap_on_memory. + Note that even when enabled, there are a few cases where + the feature is not effective. + memtest=[KNL,X86,ARM,PPC] Enable memtest Format: default : 0 diff --git a/mm/Makefile b/mm/Makefile index 72227b24a616..82ae9482f5e3 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -58,9 +58,13 @@ obj-y:= filemap.o mempool.o oom_kill.o fadvise.o \ page-alloc-y := page_alloc.o page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o +# Give 'memory_hotplug' its own module-parameter namespace +memory-hotplug-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o + obj-y += page-alloc.o obj-y += init-mm.o obj-y += memblock.o +obj-y += $(memory-hotplug-y) ifdef CONFIG_MMU obj-$(CONFIG_ADVISE_SYSCALLS) += madvise.o @@ -83,7 +87,6 @@ obj-$(CONFIG_SLUB) += slub.o obj-$(CONFIG_KASAN)+= kasan/ obj-$(CONFIG_KFENCE) += kfence/ obj-$(CONFIG_FAILSLAB) += failslab.o -obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o obj-$(CONFIG_MEMTEST) += memtest.o obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b48067839f90..ccbdde94c488 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -42,7 +42,15 @@ #include "internal.h" #include "shuffle.h" -static bool memmap_on_memory; + +/* + * memory_hotplug.memmap_on_memory parameter + */ +static bool memmap_on_memory __ro_after_init; +#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY +module_param(memmap_on_memory, bool, 0444); +MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug"); +#endif /* * online_page_callback contains pointer to current page onlining function. -- 2.16.3
[PATCH v7 4/8] mm,memory_hotplug: Allocate memmap from the added memory range
Physical memory hotadd has to allocate a memmap (struct page array) for the newly added memory section. Currently, alloc_pages_node() is used for those allocations. This has some disadvantages: a) an existing memory is consumed for that purpose (eg: ~2MB per 128MB memory section on x86_64) b) if the whole node is movable then we have off-node struct pages which has performance drawbacks. c) It might be there are no PMD_ALIGNED chunks so memmap array gets populated with base pages. This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled. Vmemap page tables can map arbitrary memory. That means that we can simply use the beginning of each memory section and map struct pages there. struct pages which back the allocated space then just need to be treated carefully. Implementation wise we will reuse vmem_altmap infrastructure to override the default allocator used by __populate_section_memmap. Part of the implementation also relies on memory_block structure gaining a new field which specifies the number of vmemmap_pages at the beginning. This patch also introduces the following functions: - vmemmap_init_space: Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls kasan_add_zero_shadow() or the vmemmap range and marks online as many sections as vmemmap pages fully span. - vmemmap_adjust_pages: Accounts/substract vmemmap_pages to node and zone present_pages - vmemmap_deinit_space: Undoes what vmemmap_init_space does. The new function memory_block_online() calls vmemmap_init_space() before doing the actual online_pages(). Should online_pages() fail, we clean up by calling vmemmap_adjust_pages() and vmemmap_deinit_space(). On offline, memory_block_offline() calls vmemmap_adjust_pages() prior to calling offline_pages(), because offline_pages() performs the tearing-down of kthreads and the rebuilding of the zonelists if the node/zone become empty. If offline_pages() fails, we account back vmemmap pages by vmemmap_adjust_pages(). If it succeeds, we call vmemmap_deinit_space(). Hot-remove: We need to be careful when removing memory, as adding and removing memory needs to be done with the same granularity. To check that this assumption is not violated, we check the memory range we want to remove and if a) any memory block has vmemmap pages and b) the range spans more than a single memory block, we scream out loud and refuse to proceed. If all is good and the range was using memmap on memory (aka vmemmap pages), we construct an altmap structure so free_hugepage_table does the right thing and calls vmem_altmap_free instead of free_pagetable. Signed-off-by: Oscar Salvador --- drivers/base/memory.c | 64 ++-- include/linux/memory.h | 8 +- include/linux/memory_hotplug.h | 13 include/linux/memremap.h | 2 +- include/linux/mmzone.h | 7 +- mm/Kconfig | 5 ++ mm/memory_hotplug.c| 162 - mm/sparse.c| 2 - 8 files changed, 247 insertions(+), 16 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index f209925a5d4e..a5e536a3e9a4 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -173,16 +173,65 @@ static int memory_block_online(struct memory_block *mem) { unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; + unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages; + int ret; + + /* +* Although vmemmap pages have a different lifecycle than the pages +* they describe (they remain until the memory is unplugged), doing +* its initialization and accounting at hot-{online,offline} stage +* simplifies things a lot +*/ + if (nr_vmemmap_pages) { + ret = vmemmap_init_space(start_pfn, nr_vmemmap_pages, mem->nid, +mem->online_type); + if (ret) + return ret; + } - return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid); + ret = online_pages(start_pfn + nr_vmemmap_pages, + nr_pages - nr_vmemmap_pages, mem->online_type, + mem->nid); + + /* +* Undo the work if online_pages() fails. +*/ + if (ret && nr_vmemmap_pages) { + vmemmap_adjust_pages(start_pfn, -nr_vmemmap_pages); + vmemmap_deinit_space(start_pfn, nr_vmemmap_pages); + } + + return ret; } static int memory_block_offline(struct memory_block *mem) { unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; + unsigned long nr_vmemmap_pages = mem
[PATCH v7 2/8] mm,memory_hotplug: Relax fully spanned sections check
When using self-hosted vmemmap pages, the number of pages passed to {online,offline}_pages might not fully span sections, but they always fully span pageblocks. Relax the check account for that case. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand --- mm/memory_hotplug.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 0cdbbfbc5757..25e59d5dc13c 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -838,9 +838,14 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int ret; struct memory_notify arg; - /* We can only online full sections (e.g., SECTION_IS_ONLINE) */ + /* We can only offline full sections (e.g., SECTION_IS_ONLINE). +* However, when using e.g: memmap_on_memory, some pages are initialized +* prior to calling in here. The remaining amount of pages must be +* pageblock aligned. +*/ if (WARN_ON_ONCE(!nr_pages || -!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION))) +!IS_ALIGNED(pfn, pageblock_nr_pages) || +!IS_ALIGNED(pfn + nr_pages, PAGES_PER_SECTION))) return -EINVAL; mem_hotplug_begin(); @@ -1573,9 +1578,14 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) int ret, node; char *reason; - /* We can only offline full sections (e.g., SECTION_IS_ONLINE) */ + /* We can only offline full sections (e.g., SECTION_IS_ONLINE). +* However, when using e.g: memmap_on_memory, some pages are initialized +* prior to calling in here. The remaining amount of pages must be +* pageblock aligned. +*/ if (WARN_ON_ONCE(!nr_pages || -!IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION))) +!IS_ALIGNED(start_pfn, pageblock_nr_pages) || +!IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION))) return -EINVAL; mem_hotplug_begin(); -- 2.16.3
[PATCH v7 3/8] mm,memory_hotplug: Factor out adjusting present pages into adjust_present_page_count()
From: David Hildenbrand Let's have a single place (inspired by adjust_managed_page_count()) where we adjust present pages. In contrast to adjust_managed_page_count(), only memory onlining/offlining is allowed to modify the number of present pages. Signed-off-by: David Hildenbrand Signed-off-by: Oscar Salvador Reviewed-by: Oscar Salvador --- mm/memory_hotplug.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 25e59d5dc13c..d05056b3c173 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -829,6 +829,16 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn, return default_zone_for_pfn(nid, start_pfn, nr_pages); } +static void adjust_present_page_count(struct zone *zone, long nr_pages) +{ + unsigned long flags; + + zone->present_pages += nr_pages; + pgdat_resize_lock(zone->zone_pgdat, ); + zone->zone_pgdat->node_present_pages += nr_pages; + pgdat_resize_unlock(zone->zone_pgdat, ); +} + int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type, int nid) { @@ -882,11 +892,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, } online_pages_range(pfn, nr_pages); - zone->present_pages += nr_pages; - - pgdat_resize_lock(zone->zone_pgdat, ); - zone->zone_pgdat->node_present_pages += nr_pages; - pgdat_resize_unlock(zone->zone_pgdat, ); + adjust_present_page_count(zone, nr_pages); node_states_set_node(nid, ); if (need_zonelists_rebuild) @@ -1701,11 +1707,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) /* removal success */ adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages); - zone->present_pages -= nr_pages; - - pgdat_resize_lock(zone->zone_pgdat, ); - zone->zone_pgdat->node_present_pages -= nr_pages; - pgdat_resize_unlock(zone->zone_pgdat, ); + adjust_present_page_count(zone, -nr_pages); init_per_zone_wmark_min(); -- 2.16.3
[PATCH v7 1/8] drivers/base/memory: Introduce memory_block_{online,offline}
This is a preparatory patch that introduces two new functions: memory_block_online() and memory_block_offline(). For now, these functions will only call online_pages() and offline_pages() respectively, but they will be later in charge of preparing the vmemmap pages, carrying out the initialization and proper accounting of such pages. Since memory_block struct contains all the information, pass this struct down the chain till the end functions. Signed-off-by: Oscar Salvador Reviewed-by: David Hildenbrand --- drivers/base/memory.c | 33 + 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index f35298425575..f209925a5d4e 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -169,30 +169,41 @@ int memory_notify(unsigned long val, void *v) return blocking_notifier_call_chain(_chain, val, v); } +static int memory_block_online(struct memory_block *mem) +{ + unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); + unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; + + return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid); +} + +static int memory_block_offline(struct memory_block *mem) +{ + unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); + unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; + + return offline_pages(start_pfn, nr_pages); +} + /* * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is * OK to have direct references to sparsemem variables in here. */ static int -memory_block_action(unsigned long start_section_nr, unsigned long action, - int online_type, int nid) +memory_block_action(struct memory_block *mem, unsigned long action) { - unsigned long start_pfn; - unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; int ret; - start_pfn = section_nr_to_pfn(start_section_nr); - switch (action) { case MEM_ONLINE: - ret = online_pages(start_pfn, nr_pages, online_type, nid); + ret = memory_block_online(mem); break; case MEM_OFFLINE: - ret = offline_pages(start_pfn, nr_pages); + ret = memory_block_offline(mem); break; default: WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: " -"%ld\n", __func__, start_section_nr, action, action); +"%ld\n", __func__, mem->start_section_nr, action, action); ret = -EINVAL; } @@ -210,9 +221,7 @@ static int memory_block_change_state(struct memory_block *mem, if (to_state == MEM_OFFLINE) mem->state = MEM_GOING_OFFLINE; - ret = memory_block_action(mem->start_section_nr, to_state, - mem->online_type, mem->nid); - + ret = memory_block_action(mem, to_state); mem->state = ret ? from_state_req : to_state; return ret; -- 2.16.3
[PATCH v7 0/8] Allocate memmap from hotadded memory (per device)
Hi, I decided to send another version with the fixups included as it seemed a bit awkward otherwise. It should ease the review. Sorry for the spam. Changes from v6 -> v7: - Fix check from "mm,memory_hotplug: Relax fully spanned sections check" - Add fixup from "mm,memory_hotplug: Allocate memmap from the added memory range" - Add Reviewed-by from David for patch#2 - Fix changelog in "mm,memory_hotplug: Factor out adjusting present pages into adjust_present_page_count()" Changes from v5 -> v6: - Create memory_block_{online,offline} functions - Create vmemmap_* functions to deal with vmemmap stuff, so {online,offline}_pages remain untouched - Add adjust_present_page_count's patch from David - Relax check in {offline,online}_pages - Rework changelogs Changes from v4 -> v5: - Addressed feedback from David (patch#1) - Tested on x86_64 with different struct page sizes and on large/small memory blocks - Tested on arm64 with 4K, 64K (with and without THP) and with different struct page sizes NOTE: We might need to make this feature and hugetlb-vmemmap feature [1] mutually exclusive. I raised an issue I see in [2]. Hugetlb-vmemmap feature has been withdrawn for the time being due to the need in further changes wrt. locking/freeing context. I will keep an eye, and when the time comes again I will see how the two features play together and how one another can be disabled when needed. Changes from v3 -> v4: - Addressed feedback from David - Wrap memmap_on_memory module thingy with #ifdef on MHP_MEMMAP_ON_MEMORY - Move "depend on MEMORY_HOTPLUG" to MHP_MEMMAP_ON_MEMORY in generic mm/Kconfig - Collect David's Reviewed-bys Changes from v2 -> v3: - Addressed feedback from David - Squash former patch#4 and and patch#5 into patch#1 - Fix config dependency CONFIR_SPARSE_VMEMMAP vs CONFIG_SPARSE_VMEMMAP_ENABLE - Simplify module parameter functions Changes from v1 -> v2 - Addressed feedback from David - Fence off the feature in case struct page size is not multiple of PMD size or pageblock alignment cannot be guaranted - Tested on x86_64 small and large memory_blocks - Tested on arm64 4KB and 64KB page sizes (for some reason I cannot boot my VM with 16KB page size). Arm64 with 4KB page size behaves like x86_64 after [1], which made section size smaller. With 64KB, the feature gets fenced off due to pageblock alignment. Changes from RFCv3 -> v1: - Addressed feedback from David - Re-order patches Changes from v2 -> v3 (RFC): - Re-order patches (Michal) - Fold "mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY" in patch#1 - Add kernel boot option to enable this feature (Michal) Changes from v1 -> v2 (RFC): - Addressed feedback provided by David - Add a arch_support_memmap_on_memory to be called from mhp_supports_memmap_on_memory, as atm, only ARM, powerpc and x86_64 have altmat support. [1] https://lore.kernel.org/lkml/cover.1611206601.git.sudar...@codeaurora.org Original cover letter: The primary goal of this patchset is to reduce memory overhead of the hot-added memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use to populate memmap (struct page array) has two main drawbacks: a) it consumes an additional memory until the hotadded memory itself is onlined and b) memmap might end up on a different numa node which is especially true for movable_node configuration. c) due to fragmentation we might end up populating memmap with base pages One way to mitigate all these issues is to simply allocate memmap array (which is the largest memory footprint of the physical memory hotplug) from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows us to map any pfn range so the memory doesn't need to be online to be usable for the array. See patch 4 for more details. This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set. [Overall design]: Implementation wise we reuse vmem_altmap infrastructure to override the default allocator used by vmemap_populate. memory_block structure gains a new field called nr_vmemmap_pages, which accounts for the number of vmemmap pages used by that memory_block. E.g: On x86_64, that is 512 vmemmap pages on small memory bloks and 4096 on large memory blocks (1GB) We also introduce new two functions: memory_block_{online,offline}. These functions take care of initializing/unitializing vmemmap pages prior to calling {online,offline}_pages, so the latter functions can remain totally untouched. More details can be found in the respective changelogs. David Hildenbrand (1): mm,memory_hotplug: Factor out adjusting present pages into adjust_present_page_count() Oscar Salvador (7): drivers/base/memory: Introduce memory_block_{online,offline} mm,memory_hotplug: Relax fully spanned sections check mm,memory_h
Re: [PATCH 05/10] mm/migrate: demote pages during reclaim
On Thu, Apr 01, 2021 at 11:32:25AM -0700, Dave Hansen wrote: > > From: Dave Hansen > > This is mostly derived from a patch from Yang Shi: > > > https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang@linux.alibaba.com/ > > Add code to the reclaim path (shrink_page_list()) to "demote" data > to another NUMA node instead of discarding the data. This always > avoids the cost of I/O needed to read the page back in and sometimes > avoids the writeout cost when the pagee is dirty. > > A second pass through shrink_page_list() will be made if any demotions > fail. This essentally falls back to normal reclaim behavior in the > case that demotions fail. Previous versions of this patch may have > simply failed to reclaim pages which were eligible for demotion but > were unable to be demoted in practice. > > Note: This just adds the start of infratructure for migration. It is > actually disabled next to the FIXME in migrate_demote_page_ok(). > > Signed-off-by: Dave Hansen > Cc: Wei Xu > Cc: Yang Shi > Cc: David Rientjes > Cc: Huang Ying > Cc: Dan Williams > Cc: osalvador Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded
On Thu, Apr 01, 2021 at 11:32:23AM -0700, Dave Hansen wrote: > > From: Yang Shi > > The migrate_pages() returns the number of pages that were not migrated, > or an error code. When returning an error code, there is no way to know > how many pages were migrated or not migrated. > > In the following patch, migrate_pages() is used to demote pages to PMEM > node, we need account how many pages are reclaimed (demoted) since page > reclaim behavior depends on this. Add *nr_succeeded parameter to make > migrate_pages() return how many pages are demoted successfully for all > cases. > > Signed-off-by: Yang Shi > Signed-off-by: Dave Hansen > Reviewed-by: Yang Shi > Cc: Wei Xu > Cc: Huang Ying > Cc: Dan Williams > Cc: David Hildenbrand > Cc: osalvador > ... > int migrate_pages(struct list_head *from, new_page_t get_new_page, > free_page_t put_new_page, unsigned long private, > - enum migrate_mode mode, int reason) > + enum migrate_mode mode, int reason, unsigned int *nr_succeeded) > { > int retry = 1; > int thp_retry = 1; > int nr_failed = 0; > - int nr_succeeded = 0; > int nr_thp_succeeded = 0; > int nr_thp_failed = 0; > int nr_thp_split = 0; > @@ -1611,10 +1611,10 @@ retry: > case MIGRATEPAGE_SUCCESS: > if (is_thp) { > nr_thp_succeeded++; > - nr_succeeded += nr_subpages; > + *nr_succeeded += nr_subpages; > break; > } > - nr_succeeded++; > + (*nr_succeeded)++; > break; > default: > /* > @@ -1643,12 +1643,12 @@ out: >*/ > list_splice(_pages, from); > > - count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded); > + count_vm_events(PGMIGRATE_SUCCESS, *nr_succeeded); > count_vm_events(PGMIGRATE_FAIL, nr_failed); > count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded); > count_vm_events(THP_MIGRATION_FAIL, nr_thp_failed); > count_vm_events(THP_MIGRATION_SPLIT, nr_thp_split); > - trace_mm_migrate_pages(nr_succeeded, nr_failed, nr_thp_succeeded, > + trace_mm_migrate_pages(*nr_succeeded, nr_failed, nr_thp_succeeded, > nr_thp_failed, nr_thp_split, mode, reason); It seems that reclaiming is the only user who cared about how many pages could we migrated, could not do the following instead: diff --git a/mm/migrate.c b/mm/migrate.c index 695a594e5860..d4170b7ea2fe 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1503,7 +1503,7 @@ static inline int try_split_thp(struct page *page, struct page **page2, */ int migrate_pages(struct list_head *from, new_page_t get_new_page, free_page_t put_new_page, unsigned long private, - enum migrate_mode mode, int reason) + enum migrate_mode mode, int reason, unsigned int *ret_succeeded) { int retry = 1; int thp_retry = 1; @@ -1654,6 +1654,9 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, if (!swapwrite) current->flags &= ~PF_SWAPWRITE; + if (ret_succedded) + *ret_succedded = nr_succedded; + return rc; } And pass only a valid pointer from demote_page_list() and NULL from all the others? I was just wondered after all those "unsigned int nr_succedded" in all other functions. This would also solve the "be careful to initialize nr_succedded" problem? -- Oscar Salvador SUSE L3
Re: [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events
On Thu, Apr 01, 2021 at 11:32:21AM -0700, Dave Hansen wrote: > > From: Dave Hansen > > Reclaim-based migration is attempting to optimize data placement in > memory based on the system topology. If the system changes, so must > the migration ordering. > > The implementation is conceptually simple and entirely unoptimized. > On any memory or CPU hotplug events, assume that a node was added or > removed and recalculate all migration targets. This ensures that the > node_demotion[] array is always ready to be used in case the new > reclaim mode is enabled. > > This recalculation is far from optimal, most glaringly that it does > not even attempt to figure out the hotplug event would have some > *actual* effect on the demotion order. But, given the expected > paucity of hotplug events, this should be fine. > > === What does RCU provide? === > > Imaginge a simple loop which walks down the demotion path looking > for the last node: > > terminal_node = start_node; > while (node_demotion[terminal_node] != NUMA_NO_NODE) { > terminal_node = node_demotion[terminal_node]; > } > > The initial values are: > > node_demotion[0] = 1; > node_demotion[1] = NUMA_NO_NODE; > > and are updated to: > > node_demotion[0] = NUMA_NO_NODE; > node_demotion[1] = 0; > > What guarantees that the loop did not observe: > > node_demotion[0] = 1; > node_demotion[1] = 0; > > and would loop forever? > > With RCU, a rcu_read_lock/unlock() can be placed around the > loop. Since the write side does a synchronize_rcu(), the loop > that observed the old contents is known to be complete after the > synchronize_rcu() has completed. > > RCU, combined with disable_all_migrate_targets(), ensures that > the old migration state is not visible by the time > __set_migration_target_nodes() is called. > > === What does READ_ONCE() provide? === > > READ_ONCE() forbids the compiler from merging or reordering > successive reads of node_demotion[]. This ensures that any > updates are *eventually* observed. > > Consider the above loop again. The compiler could theoretically > read the entirety of node_demotion[] into local storage > (registers) and never go back to memory, and *permanently* > observe bad values for node_demotion[]. > > Note: RCU does not provide any universal compiler-ordering > guarantees: > > https://lore.kernel.org/lkml/20150921204327.gh4...@linux.vnet.ibm.com/ > > Signed-off-by: Dave Hansen > Reviewed-by: Yang Shi > Cc: Wei Xu > Cc: David Rientjes > Cc: Huang Ying > Cc: Dan Williams > Cc: David Hildenbrand > Cc: osalvador > ... > +#if defined(CONFIG_MEMORY_HOTPLUG) I am not really into PMEM, and I ignore whether we need CONFIG_MEMORY_HOTPLUG in order to have such memory on the system. If so, the following can be partly ignored. I think that you either want to check CONFIG_MEMORY_HOTPLUG + CONFIG_CPU_HOTPLUG, or just do not put it under any conf dependency. The thing is that migrate_on_reclaim_init() will only be called if we have CONFIG_MEMORY_HOTPLUG, and when we do not have that (but we do have CONFIG_CPU_HOTPLUG) the calls to set_migration_target_nodes() wont't be made when the system brings up the CPUs during the boot phase, which means node_demotion[] list won't be initialized. But this brings me to the next point. >From a conceptual point of view, I think you want to build the node_demotion[] list, being orthogonal to it whether we support CPU Or MEMORY hotplug. Now, in case we support CPU or MEMORY hotplug, we do want to be able to re-build the list for .e.g: in case NUMA nodes become cpu-less or memory-less. On x86_64, CPU_HOTPLUG is enabled by default if SMP, the same for MEMORY_HOTPLUG, but I am not sure about other archs. Can we have !CPU_HOTPLUG && MEMORY_HOTPLUG, !MEMORY_HOTPLUG && CPU_HOTPLUG? I do now really know, but I think you should be careful about that. If this was my call, I would: - Do not place the burden to initialize node_demotion[] list in CPU hotplug boot phase (or if so, be carefull because if I disable MEMORY_HOTPLUG, I end up with no demotion_list[]) - Diferentiate between migration_{online,offline}_cpu and migrate_on_reclaim_callback() and place them under their respective configs-dependency. But I might be missing some details so I might be off somewhere. Another thing that caught my eye is that we are calling set_migration_target_nodes() for every CPU the system brings up at boot phase. I know systems with *lots* of CPUs. I am not sure whether we have a mechanism to delay that until all CPUs that are meant to be online are online? (after boot?) That's probably happening in wonderland, but was just speaking out loud. (Of course the same happen with memory_hotplug acpi operations. All it takes is some qemu-handling) -- Oscar Salvador SUSE L3