Re: [linus:master] [mm] d99e3140a4: BUG:KCSAN:data-race_in_folio_remove_rmap_ptes/print_report

2024-05-28 Thread Oscar Salvador
On Tue, May 28, 2024 at 09:43:39AM +0200, David Hildenbrand wrote:
> Likely that's just a page_type check racing against concurrent
> mapcount changes.
> 
> In __folio_rmap_sanity_checks() we check
>   VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);

Yeah, and that "collides" with

last = atomic_add_negative(-1, >_mapcount)

from __folio_remove_rmap.

> To make sure we don't get hugetlb folios in the wrong rmap code path. That
> can easily race with concurrent mapcount changes, just like any other
> page_type checks that end up in folio_test_type/page_has_type e.g., from
> PFN walkers.
> 
> Load tearing in these functions shouldn't really result in false positives
> (what we care about), but READ_ONCE shouldn't hurt or make a difference.
> 
> 
> From b03dc9bf27571442d886d8da624a4e4f737433f2 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand 
> Date: Tue, 28 May 2024 09:37:20 +0200
> Subject: [PATCH] mm: read page_type using READ_ONCE
> 
> KCSAN complains about possible data races: while we check for a
> page_type -- for example for sanity checks -- we might concurrently
> modify the mapcount that overlays page_type.
> 
> Let's use READ_ONCE to avoid laod tearing (shouldn't make a difference)
> and to make KCSAN happy.
> 
> Note: nothing should really be broken besides wrong KCSAN complaints.
> 
> Reported-by: kernel test robot 
> Closes: https://lore.kernel.org/oe-lkp/202405281431.c46a3be9-...@intel.com
> Signed-off-by: David Hildenbrand 

Reviewed-by: Oscar Salvador 

Thanks!

-- 
Oscar Salvador
SUSE Labs



Re: [PATCH v3 5/9] riscv: mm: Add memory hotplugging support

2024-05-21 Thread Oscar Salvador
On Tue, May 21, 2024 at 03:19:37PM +0200, Alexandre Ghiti wrote:
> On Tue, May 21, 2024 at 1:49 PM Björn Töpel  wrote:
> > +   if (PageReserved(page)) {
> > +   __ClearPageReserved(page);
> 
> What's the difference between __ClearPageReserved() and
> ClearPageReserved()? Because it seems like free_reserved_page() calls
> the latter already, so why would you need to call
> __ClearPageReserved() on the first page?

__{Set,Clear}Page are the non-atomic version.
Usually used when you know that no one else can fiddle with the page, which
should be the case here since we are removing the memory.

As to why we have __ClearPageReserved and then having
free_reserved_page() call ClearPageReserved I do not really know.
Looking at the history, it has always been like this.

I remember I looked at this a few years ago but I cannot remember the outcome
of that.

Maybe David remembers better, but I think we could remove that
__ClearPageReserved.
Looking at powerpc implementation code, it does not do the
__ClearPageReserved and relies only on free_reserved_page().

I will have a look.

-- 
Oscar Salvador
SUSE Labs



Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V

2024-05-14 Thread Oscar Salvador
On Tue, May 14, 2024 at 04:04:44PM +0200, Björn Töpel wrote:
> From: Björn Töpel 
> 
> Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
> RISC-V.
> 
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/Kconfig | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 6bec1bce6586..b9398b64bb69 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -16,6 +16,8 @@ config RISCV
>   select ACPI_REDUCED_HARDWARE_ONLY if ACPI
>   select ARCH_DMA_DEFAULT_COHERENT
>   select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
> + select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU

Hopefully this should be SPARSEMEM_VMEMMAP.
We are trying to deprecate memory-hotplug on !SPARSEMEM_VMEMMAP.

And it is always easier to do it now that when the code goes already in,
so please consider if you really need SPARSEMEM and why (I do not think
you do).

 

-- 
Oscar Salvador
SUSE Labs



Re: [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump

2024-05-14 Thread Oscar Salvador
On Tue, May 14, 2024 at 04:04:43PM +0200, Björn Töpel wrote:
> From: Björn Töpel 
> 
> During memory hot remove, the ptdump functionality can end up touching
> stale data. Avoid any potential crashes (or worse), by holding the
> memory hotplug read-lock while traversing the page table.
> 
> This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm:
> Hold memory hotplug lock while walking for kernel page table dump").
> 
> Signed-off-by: Björn Töpel 

Reviewed-by: Oscar Salvador 

funny enough, it seems arm64 and riscv are the only ones holding the
hotplug lock here.
I think we have the same problem on the other arches as well (at least
on x86_64 that I can see).

If we happen to finally need the lock in those, I would rather have a
centric function in the generic mm code with the locking and then
calling an arch specific ptdump_show function, so the lock is not
scattered. But that is another story.

 

-- 
Oscar Salvador
SUSE Labs



Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread Oscar Salvador
On Tue, May 14, 2024 at 04:04:42PM +0200, Björn Töpel wrote:
> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
> +struct vmem_altmap *altmap)
> +{
> + if (altmap)
> + vmem_altmap_free(altmap, size >> PAGE_SHIFT);
> + else
> + free_pages((unsigned long)page_address(page), get_order(size));

David already pointed this out, but can check
arch/x86/mm/init_64.c:free_pagetable().

You will see that we have to do some magic for bootmem memory (DIMMs
which were not hotplugged but already present)

> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +void __ref vmemmap_free(unsigned long start, unsigned long end, struct 
> vmem_altmap *altmap)
> +{
> + remove_pgd_mapping(start, end, true, altmap);
> +}
> +#endif /* CONFIG_SPARSEMEM_VMEMMAP */
> +#endif /* CONFIG_MEMORY_HOTPLUG */

I will comment on the patch where you add support for hotplug and the
dependency, but on a track in LSFMM today, we decided that most likely
we will drop memory-hotplug support for !CONFIG_SPARSEMEM_VMEMMAP
environments.
So, since you are adding this plain fresh, please consider to tight the
hotplug dependency to CONFIG_SPARSEMEM_VMEMMAP.
As a bonus, you will only have to maintain one flavour of functions.


-- 
Oscar Salvador
SUSE Labs



Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add

2024-05-14 Thread Oscar Salvador
On Tue, May 14, 2024 at 04:04:41PM +0200, Björn Töpel wrote:
> From: Björn Töpel 
> 
> Add a parameter to the direct map setup function, so it can be used in
> arch_add_memory() later.
> 
> Signed-off-by: Björn Töpel 

Reviewed-by: Oscar Salvador 

> ---
>  arch/riscv/mm/init.c | 15 ++-
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index c969427eab88..6f72b0b2b854 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1227,7 +1227,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>  }
>  
>  static void __meminit create_linear_mapping_range(phys_addr_t start, 
> phys_addr_t end,
> -   uintptr_t fixed_map_size)
> +   uintptr_t fixed_map_size, 
> const pgprot_t *pgprot)
>  {
>   phys_addr_t pa;
>   uintptr_t va, map_size;
> @@ -1238,7 +1238,7 @@ static void __meminit 
> create_linear_mapping_range(phys_addr_t start, phys_addr_t
>   best_map_size(pa, va, end - pa);
>  
>   create_pgd_mapping(swapper_pg_dir, va, pa, map_size,
> -pgprot_from_va(va));
> +pgprot ? *pgprot : pgprot_from_va(va));
>   }
>  }
>  
> @@ -1282,22 +1282,19 @@ static void __init 
> create_linear_mapping_page_table(void)
>   if (end >= __pa(PAGE_OFFSET) + memory_limit)
>   end = __pa(PAGE_OFFSET) + memory_limit;
>  
> - create_linear_mapping_range(start, end, 0);
> + create_linear_mapping_range(start, end, 0, NULL);
>   }
>  
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> - create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0);
> - create_linear_mapping_range(krodata_start,
> - krodata_start + krodata_size, 0);
> + create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0, 
> NULL);
> + create_linear_mapping_range(krodata_start, krodata_start + 
> krodata_size, 0, NULL);
>  
>   memblock_clear_nomap(ktext_start,  ktext_size);
>   memblock_clear_nomap(krodata_start, krodata_size);
>  #endif
>  
>  #ifdef CONFIG_KFENCE
> - create_linear_mapping_range(kfence_pool,
> - kfence_pool + KFENCE_POOL_SIZE,
> - PAGE_SIZE);
> + create_linear_mapping_range(kfence_pool, kfence_pool + 
> KFENCE_POOL_SIZE, PAGE_SIZE, NULL);
>  
>   memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE);
>  #endif
> -- 
> 2.40.1
> 

-- 
Oscar Salvador
SUSE Labs



Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions

2024-05-14 Thread Oscar Salvador
On Tue, May 14, 2024 at 04:04:40PM +0200, Björn Töpel wrote:
> From: Björn Töpel 
> 
> Prepare for memory hotplugging support by changing from __init to
> __meminit for the page table functions that are used by the upcoming
> architecture specific callbacks.
> 
> Changing the __init attribute to __meminit, avoids that the functions
> are removed after init. The __meminit attribute makes sure the
> functions are kept in the kernel text post init, but only if memory
> hotplugging is enabled for the build.
> 
> Also, make sure that the altmap parameter is properly passed on to
> vmemmap_populate_hugepages().
> 
> Signed-off-by: Björn Töpel 

Reviewed-by: Oscar Salvador 

> +static void __meminit create_linear_mapping_range(phys_addr_t start, 
> phys_addr_t end,
> +   uintptr_t fixed_map_size)
>  {
>   phys_addr_t pa;
>   uintptr_t va, map_size;
> @@ -1435,7 +1429,7 @@ int __meminit vmemmap_populate(unsigned long start, 
> unsigned long end, int node,
>* memory hotplug, we are not able to update all the page tables with
>* the new PMDs.
>*/
> - return vmemmap_populate_hugepages(start, end, node, NULL);
> + return vmemmap_populate_hugepages(start, end, node, altmap);

I would have put this into a separate patch.

 

-- 
Oscar Salvador
SUSE Labs



Re: [PATCH v20 8/9] mm: memory_hotplug: disable memmap_on_memory when hugetlb_free_vmemmap enabled

2021-04-20 Thread Oscar Salvador
On Thu, Apr 15, 2021 at 04:40:04PM +0800, Muchun Song wrote:
>  bool mhp_supports_memmap_on_memory(unsigned long size)
>  {
> + bool supported;
>   unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
>   unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>   unsigned long remaining_size = size - vmemmap_size;
> @@ -1011,11 +1012,18 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
>*   altmap as an alternative source of memory, and we do not 
> exactly
>*   populate a single PMD.
>*/
> - return memmap_on_memory &&
> -IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
> -size == memory_block_size_bytes() &&
> -IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> -IS_ALIGNED(remaining_size, pageblock_nr_pages << PAGE_SHIFT);
> + supported = memmap_on_memory &&
> + IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
> + size == memory_block_size_bytes() &&
> + IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> + IS_ALIGNED(remaining_size, pageblock_nr_pages << 
> PAGE_SHIFT);
> +
> + if (supported && is_hugetlb_free_vmemmap_enabled()) {
> + pr_info("Cannot enable memory_hotplug.memmap_on_memory, it is 
> not compatible with hugetlb_free_vmemmap\n");
> + supported = false;
> + }

I would not print anything and rather have

return memmap_on_memory &&
   !is_hugetlb_free_vmemmap_enabled &&
   IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) &&
   size == memory_block_size_bytes() &&
   IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
   IS_ALIGNED(remaining_size, pageblock_nr_pages << PAGE_SHIFT);

Documentation/admin-guide/kernel-parameters.txt already provides an
explanation on memory_hotplug.memmap_on_memory parameter that states
that the feature cannot be enabled when using hugetlb-vmemmap
optimization.

Users can always check whether the feature is enabled via
/sys/modules/memory_hotplug/parameters/memmap_on_memory.

Also, I did not check if it is, but if not, the fact about hugetlb-vmemmmap vs
hotplug-vmemmap should also be called out in the hugetlb-vmemmap kernel
parameter.

Thanks

-- 
Oscar Salvador
SUSE L3


[PATCH v10 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig

2021-04-19 Thread Oscar Salvador
pfn_range_valid_contig() bails out when it finds an in-use page or a
hugetlb page, among other things.
We can drop the in-use page check since __alloc_contig_pages can migrate
away those pages, and the hugetlb page check can go too since
isolate_migratepages_range is now capable of dealing with hugetlb pages.
Either way, those checks are racy so let the end function handle it
when the time comes.

Signed-off-by: Oscar Salvador 
Suggested-by: David Hildenbrand 
Reviewed-by: David Hildenbrand 
Acked-by: Mike Kravetz 
Acked-by: Michal Hocko 
---
 mm/page_alloc.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b5a94de3cdde..c5338e912ace 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8901,12 +8901,6 @@ static bool pfn_range_valid_contig(struct zone *z, 
unsigned long start_pfn,
 
if (PageReserved(page))
return false;
-
-   if (page_count(page) > 0)
-   return false;
-
-   if (PageHuge(page))
-   return false;
}
return true;
 }
-- 
2.16.3



[PATCH v10 3/7] mm,hugetlb: Drop clearing of flag from prep_new_huge_page

2021-04-19 Thread Oscar Salvador
Pages allocated via the page allocator or CMA get its private field cleared
by means of post_alloc_hook().
Pages allocated during boot, that is directly from the memblock allocator,
get cleared by paging_init()->..->memmap_init_zone->..->__init_single_page()
before any memblock allocation.

Based on this ground, let us remove the clearing of the flag from
prep_new_huge_page() as it is not needed.
This was a leftover from 6c0371490140 ("hugetlb: convert PageHugeFreed
to HPageFreed flag"). Previously the explicit clearing was necessary
because compound allocations do not get this initialization
(see prep_compound_page).

Signed-off-by: Oscar Salvador 
Acked-by: Michal Hocko 
Reviewed-by: David Hildenbrand 
Reviewed-by: Mike Kravetz 
---
 mm/hugetlb.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 54d81d5947ed..2cb9fa79cbaa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1493,7 +1493,6 @@ static void prep_new_huge_page(struct hstate *h, struct 
page *page, int nid)
spin_lock_irq(_lock);
h->nr_huge_pages++;
h->nr_huge_pages_node[nid]++;
-   ClearHPageFreed(page);
spin_unlock_irq(_lock);
 }
 
-- 
2.16.3



[PATCH v10 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages

2021-04-19 Thread Oscar Salvador
alloc_contig_range() will fail if it finds a HugeTLB page within the range,
without a chance to handle them. Since HugeTLB pages can be migrated as any
LRU or Movable page, it does not make sense to bail out without trying.
Enable the interface to recognize in-use HugeTLB pages so we can migrate
them, and have much better chances to succeed the call.

Signed-off-by: Oscar Salvador 
Reviewed-by: Mike Kravetz 
Acked-by: Michal Hocko 
Acked-by: David Hildenbrand 
---
 include/linux/hugetlb.h |  5 +++--
 mm/compaction.c | 12 +++-
 mm/hugetlb.c| 22 +-
 mm/vmscan.c |  5 +++--
 4 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b2d2118bfd1a..b92f25ccef58 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -595,7 +595,7 @@ struct huge_bootmem_page {
struct hstate *hstate;
 };
 
-int isolate_or_dissolve_huge_page(struct page *page);
+int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -878,7 +878,8 @@ static inline void huge_ptep_modify_prot_commit(struct 
vm_area_struct *vma,
 #else  /* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
-static inline int isolate_or_dissolve_huge_page(struct page *page)
+static inline int isolate_or_dissolve_huge_page(struct page *page,
+   struct list_head *list)
 {
return -ENOMEM;
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index af2e8e194e50..84fde270ae74 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -907,7 +907,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
}
 
if (PageHuge(page) && cc->alloc_contig) {
-   ret = isolate_or_dissolve_huge_page(page);
+   ret = isolate_or_dissolve_huge_page(page, 
>migratepages);
 
/*
 * Fail isolation in case 
isolate_or_dissolve_huge_page()
@@ -921,6 +921,15 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
goto isolate_fail;
}
 
+   if (PageHuge(page)) {
+   /*
+* Hugepage was successfully isolated and placed
+* on the cc->migratepages list.
+*/
+   low_pfn += compound_nr(page) - 1;
+   goto isolate_success_no_list;
+   }
+
/*
 * Ok, the hugepage was dissolved. Now these pages are
 * Buddy and cannot be re-allocated because they are
@@ -1062,6 +1071,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
 isolate_success:
list_add(>lru, >migratepages);
+isolate_success_no_list:
cc->nr_migratepages += compound_nr(page);
nr_isolated += compound_nr(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cc2463953fab..3db405dea3dc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2270,9 +2270,11 @@ static void restore_reserve_on_error(struct hstate *h,
  * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
  * @h: struct hstate old page belongs to
  * @old_page: Old page to dissolve
+ * @list: List to isolate the page in case we need to
  * Returns 0 on success, otherwise negated error.
  */
-static int alloc_and_dissolve_huge_page(struct hstate *h, struct page 
*old_page)
+static int alloc_and_dissolve_huge_page(struct hstate *h, struct page 
*old_page,
+   struct list_head *list)
 {
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
int nid = page_to_nid(old_page);
@@ -2299,9 +2301,13 @@ static int alloc_and_dissolve_huge_page(struct hstate 
*h, struct page *old_page)
goto free_new;
} else if (page_count(old_page)) {
/*
-* Someone has grabbed the page, fail for now.
+* Someone has grabbed the page, try to isolate it here.
+* Fail with -EBUSY if not possible.
 */
-   ret = -EBUSY;
+   spin_unlock_irq(_lock);
+   if (!isolate_huge_page(old_page, list))
+   ret = -EBUSY;
+   spin_lock_irq(_lock);
goto free_new;
} else if (!HPageFreed(old_page)) {
/*
@@ -2351,10 +2357,11 @@ static int alloc_and_dissolve_huge_page(struct hstate 
*h, struct page *old_page)
return ret;
 }
 
-int iso

[PATCH v10 5/7] mm: Make alloc_contig_range handle free hugetlb pages

2021-04-19 Thread Oscar Salvador
alloc_contig_range will fail if it ever sees a HugeTLB page within the
range we are trying to allocate, even when that page is free and can be
easily reallocated.
This has proved to be problematic for some users of alloc_contic_range,
e.g: CMA and virtio-mem, where those would fail the call even when those
pages lay in ZONE_MOVABLE and are free.

We can do better by trying to replace such page.

Free hugepages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with
a new one.

In order to do that, a new function called alloc_and_dissolve_huge_page
is introduced.
This function will first try to get a new fresh hugepage, and if it
succeeds, it will replace the old one in the free hugepage pool.

The free page replacement is done under hugetlb_lock, so no external
users of hugetlb will notice the change.
To allocate the new huge page, we use alloc_buddy_huge_page(), so we
do not have to deal with any counters, and prep_new_huge_page() is not
called. This is valulable because in case we need to free the new page,
we only need to call __free_pages().

Once we know that the page to be replaced is a genuine 0-refcounted
huge page, we remove the old page from the freelist by remove_hugetlb_page().
Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page()
for the new huge page to properly initialize it and increment the
hstate->nr_huge_pages counter (previously decremented by
remove_hugetlb_page()).
Once done, the page is enqueued by enqueue_huge_page() and it is ready
to be used.

There is one tricky case when
page's refcount is 0 because it is in the process of being released.
A missing PageHugeFreed bit will tell us that freeing is in flight so
we retry after dropping the hugetlb_lock. The race window should be
small and the next retry should make a forward progress.

E.g:

CPU0CPU1
free_huge_page()isolate_or_dissolve_huge_page
  PageHuge() == T
  alloc_and_dissolve_huge_page
alloc_buddy_huge_page()
spin_lock_irq(hugetlb_lock)
// PageHuge() && !PageHugeFreed &&
// !PageCount()
spin_unlock_irq(hugetlb_lock)
  spin_lock_irq(hugetlb_lock)
  1) update_and_free_page
   PageHuge() == F
   __free_pages()
  2) enqueue_huge_page
   SetPageHugeFreed()
  spin_unlock_irq(_lock)
  spin_lock_irq(hugetlb_lock)
   1) PageHuge() == F (freed by case#1 from 
CPU0)
   2) PageHuge() == T
   PageHugeFreed() == T
   - proceed with replacing the page

In the case above we retry as the window race is quite small and we have high
chances to succeed next time.

With regard to the allocation, we restrict it to the node the page belongs
to with __GFP_THISNODE, meaning we do not fallback on other node's zones.

Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.

Signed-off-by: Oscar Salvador 
Acked-by: Michal Hocko 
Acked-by: David Hildenbrand 
Reviewed-by: Mike Kravetz 
---
 include/linux/hugetlb.h |   6 +++
 mm/compaction.c |  33 --
 mm/hugetlb.c| 116 
 3 files changed, 152 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 09f1fd12a6fa..b2d2118bfd1a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -595,6 +595,7 @@ struct huge_bootmem_page {
struct hstate *hstate;
 };
 
+int isolate_or_dissolve_huge_page(struct page *page);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -877,6 +878,11 @@ static inline void huge_ptep_modify_prot_commit(struct 
vm_area_struct *vma,
 #else  /* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
+static inline int isolate_or_dissolve_huge_page(struct page *page)
+{
+   return -ENOMEM;
+}
+
 static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
   unsigned long addr,
   int avoid_reserve)
diff --git a/mm/compaction.c b/mm/compaction.c
index 110b90d5fff5..af2e8e194e50 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -788,7 +788,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
  * Isolate all pages that can be migrated from the range specified by
  * [low_pfn, end_pfn). The range is expected to be within same pageblock.
  * Returns errno, like -EAGAIN or -EINTR

[PATCH v10 4/7] mm,hugetlb: Split prep_new_huge_page functionality

2021-04-19 Thread Oscar Salvador
Currently, prep_new_huge_page() performs two functions.
It sets the right state for a new hugetlb, and increases the hstate's
counters to account for the new page.

Let us split its functionality into two separate functions, decoupling
the handling of the counters from initializing a hugepage.
The outcome is having __prep_new_huge_page(), which only
initializes the page , and __prep_account_new_huge_page(), which adds
the new page to the hstate's counters.

This allows us to be able to set a hugetlb without having to worry
about the counter/locking. It will prove useful in the next patch.
prep_new_huge_page() still calls both functions.

Signed-off-by: Oscar Salvador 
Acked-by: Michal Hocko 
Reviewed-by: Mike Kravetz 
Reviewed-by: David Hildenbrand 
---
 mm/hugetlb.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2cb9fa79cbaa..6f39ec79face 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1483,16 +1483,30 @@ void free_huge_page(struct page *page)
}
 }
 
-static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
+/*
+ * Must be called with the hugetlb lock held
+ */
+static void __prep_account_new_huge_page(struct hstate *h, int nid)
+{
+   lockdep_assert_held(_lock);
+   h->nr_huge_pages++;
+   h->nr_huge_pages_node[nid]++;
+}
+
+static void __prep_new_huge_page(struct page *page)
 {
INIT_LIST_HEAD(>lru);
set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
hugetlb_set_page_subpool(page, NULL);
set_hugetlb_cgroup(page, NULL);
set_hugetlb_cgroup_rsvd(page, NULL);
+}
+
+static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
+{
+   __prep_new_huge_page(page);
spin_lock_irq(_lock);
-   h->nr_huge_pages++;
-   h->nr_huge_pages_node[nid]++;
+   __prep_account_new_huge_page(h, nid);
spin_unlock_irq(_lock);
 }
 
-- 
2.16.3



[PATCH v10 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes

2021-04-19 Thread Oscar Salvador
Currently, isolate_migratepages_{range,block} and their callers use
a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was
any error during isolation.
This does not work as soon as we need to start reporting different error
codes and make sure we pass them down the chain, so they are properly
interpreted by functions like e.g: alloc_contig_range.

Let us rework isolate_migratepages_{range,block} so we can report error
codes.
Since isolate_migratepages_block will stop returning the next pfn to be
scanned, we reuse the cc->migrate_pfn field to keep track of that.

Signed-off-by: Oscar Salvador 
Acked-by: Vlastimil Babka 
Acked-by: Mike Kravetz 
Reviewed-by: David Hildenbrand 
---
 mm/compaction.c | 52 +---
 mm/internal.h   | 10 --
 mm/page_alloc.c |  7 +++
 3 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 8c5028bfbd56..110b90d5fff5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -787,15 +787,14 @@ static bool too_many_isolated(pg_data_t *pgdat)
  *
  * Isolate all pages that can be migrated from the range specified by
  * [low_pfn, end_pfn). The range is expected to be within same pageblock.
- * Returns zero if there is a fatal signal pending, otherwise PFN of the
- * first page that was not scanned (which may be both less, equal to or more
- * than end_pfn).
+ * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or 
congestion,
+ * or 0.
+ * cc->migrate_pfn will contain the next pfn to scan.
  *
  * The pages are isolated on cc->migratepages list (not required to be empty),
- * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field
- * is neither read nor updated.
+ * and cc->nr_migratepages is updated accordingly.
  */
-static unsigned long
+static int
 isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
unsigned long end_pfn, isolate_mode_t isolate_mode)
 {
@@ -809,6 +808,9 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
bool skip_on_failure = false;
unsigned long next_skip_pfn = 0;
bool skip_updated = false;
+   int ret = 0;
+
+   cc->migrate_pfn = low_pfn;
 
/*
 * Ensure that there are not too many pages isolated from the LRU
@@ -818,16 +820,16 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
while (unlikely(too_many_isolated(pgdat))) {
/* stop isolation if there are still pages not migrated */
if (cc->nr_migratepages)
-   return 0;
+   return -EAGAIN;
 
/* async migration should just abort */
if (cc->mode == MIGRATE_ASYNC)
-   return 0;
+   return -EAGAIN;
 
congestion_wait(BLK_RW_ASYNC, HZ/10);
 
if (fatal_signal_pending(current))
-   return 0;
+   return -EINTR;
}
 
cond_resched();
@@ -875,8 +877,8 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
if (fatal_signal_pending(current)) {
cc->contended = true;
+   ret = -EINTR;
 
-   low_pfn = 0;
goto fatal_pending;
}
 
@@ -1130,7 +1132,9 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (nr_isolated)
count_compact_events(COMPACTISOLATED, nr_isolated);
 
-   return low_pfn;
+   cc->migrate_pfn = low_pfn;
+
+   return ret;
 }
 
 /**
@@ -1139,15 +1143,14 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
  * @start_pfn: The first PFN to start isolating.
  * @end_pfn:   The one-past-last PFN.
  *
- * Returns zero if isolation fails fatally due to e.g. pending signal.
- * Otherwise, function returns one-past-the-last PFN of isolated page
- * (which may be greater than end_pfn if end fell in a middle of a THP page).
+ * Returns -EAGAIN when contented, -EINTR in case of a signal pending or 0.
  */
-unsigned long
+int
 isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
unsigned long end_pfn)
 {
unsigned long pfn, block_start_pfn, block_end_pfn;
+   int ret = 0;
 
/* Scan block by block. First and last block may be incomplete */
pfn = start_pfn;
@@ -1166,17 +1169,17 @@ isolate_migratepages_range(struct compact_control *cc, 
unsigned long start_pfn,
block_end_pfn, cc->zone))
continue;
 
-   pfn = isolate_migra

[PATCH v10 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range

2021-04-19 Thread Oscar Salvador
Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or -EBUSY,
and report them down the chain.
The problem is that when migrate_pages() reports -ENOMEM, we keep going till we
exhaust all the try-attempts (5 at the moment) instead of bailing out.

migrate_pages() bails out right away on -ENOMEM because it is considered a fatal
error. Do the same here instead of keep going and retrying.
Note that this is not fixing a real issue, just a cosmetic change. Although we
can save some cycles by backing off ealier

Signed-off-by: Oscar Salvador 
Acked-by: Vlastimil Babka 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
Acked-by: Mike Kravetz 
---
 mm/page_alloc.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c67c99603a3..689454692de1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8697,7 +8697,7 @@ static int __alloc_contig_migrate_range(struct 
compact_control *cc,
}
tries = 0;
} else if (++tries == 5) {
-   ret = ret < 0 ? ret : -EBUSY;
+   ret = -EBUSY;
break;
}
 
@@ -8707,6 +8707,13 @@ static int __alloc_contig_migrate_range(struct 
compact_control *cc,
 
ret = migrate_pages(>migratepages, alloc_migration_target,
NULL, (unsigned long), cc->mode, 
MR_CONTIG_RANGE);
+
+   /*
+* On -ENOMEM, migrate_pages() bails out right away. It is 
pointless
+* to retry again over this error, so do the same here.
+*/
+   if (ret == -ENOMEM)
+   break;
}
 
lru_cache_enable();
-- 
2.16.3



[PATCH v10 0/7] Make alloc_contig_range handle Hugetlb pages

2021-04-19 Thread Oscar Salvador
loc_contig_range: [1b8000, 1c) PFNs busy
  [  180.061972] alloc_contig_range: [1b8000, 1c) PFNs busy
  [  180.063413] alloc_contig_range: [1b8000, 1c) PFNs busy
  [  180.064838] alloc_contig_range: [1b8000, 1c) PFNs busy
  [  180.065848] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.066794] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.067738] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.068669] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.069598] alloc_contig_range: [1bfc00, 1c) PFNs busy"

 And then with this patchset running:

 "Same experiment with ZONE_MOVABLE:

  a) Free huge pages: all memory can get unplugged again.

  b) Allocated/populated but idle huge pages: all memory can get unplugged
 again.

  c) Allocated/populated but all 512 huge pages are read/written in a
 loop: all memory can get unplugged again, but I get a single

  [  121.192345] alloc_contig_range: [18, 188000) PFNs busy

  Most probably because it happened to try migrating a huge page while it
  was busy. As virtio-mem retries on ZONE_MOVABLE a couple of times, it
  can deal with this temporary failure.

  Last but not least, I did something extreme:

  # cat /proc/meminfo
  MemTotal:5061568 kB
  MemFree:  186560 kB
  MemAvailable: 354524 kB
  ...
  HugePages_Total:2048
  HugePages_Free: 2048
  HugePages_Rsvd:0
  HugePages_Surp:0

  Triggering unplug would require to dissolve+alloc - which now fails when
  trying to allocate an additional ~512 huge pages (1G).

  As expected, I can properly see memory unplug not fully succeeding. + I
  get a fairly continuous stream of

  [  226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy
  ...

  But more importantly, the hugepage count remains stable, as configured
  by the admin (me):

  HugePages_Total:2048
  HugePages_Free: 2048
  HugePages_Rsvd:0
  HugePages_Surp:0"

Oscar Salvador (7):
  mm,page_alloc: Bail out earlier on -ENOMEM in
alloc_contig_migrate_range
  mm,compaction: Let isolate_migratepages_{range,block} return error
codes
  mm,hugetlb: Drop clearing of flag from prep_new_huge_page
  mm,hugetlb: Split prep_new_huge_page functionality
  mm: Make alloc_contig_range handle free hugetlb pages
  mm: Make alloc_contig_range handle in-use hugetlb pages
  mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig

 include/linux/hugetlb.h |   7 +++
 mm/compaction.c |  91 -
 mm/hugetlb.c| 149 ++--
 mm/internal.h   |  10 +++-
 mm/page_alloc.c |  22 +++
 mm/vmscan.c |   5 +-
 6 files changed, 237 insertions(+), 47 deletions(-)

-- 
2.16.3



Re: [PATCH v9 5/7] mm: Make alloc_contig_range handle free hugetlb pages

2021-04-16 Thread Oscar Salvador
On Fri, Apr 16, 2021 at 05:49:20PM +0800, Baoquan He wrote:
> On 04/16/21 at 09:00am, Oscar Salvador wrote:
> ...  
> > +/*
> > + * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old 
> > one
> > + * @h: struct hstate old page belongs to
> > + * @old_page: Old page to dissolve
> > + * Returns 0 on success, otherwise negated error.
> > + */
> > +static int alloc_and_dissolve_huge_page(struct hstate *h, struct page 
> > *old_page)
> > +{
> > +   gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
> > +   int nid = page_to_nid(old_page);
> > +   struct page *new_page;
> > +   int ret = 0;
> > +
> > +   /*
> > +* Before dissolving the page, we need to allocate a new one for the
> > +* pool to remain stable. Using alloc_buddy_huge_page() allows us to
> > +* not having to deal with prep_new_page() and avoids dealing of any
>  ~ prep_new_huge_page() ?

Yep, clumsy.
@Andrew: could you please amend that or should I send a new version?

Thanks

-- 
Oscar Salvador
SUSE L3


[PATCH v9 8/8] arm64/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

2021-04-16 Thread Oscar Salvador
Enable arm64 platform to use the MHP_MEMMAP_ON_MEMORY feature.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
---
 arch/arm64/Kconfig | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e4e1b6550115..68735831b236 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -309,6 +309,9 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 config ARCH_ENABLE_MEMORY_HOTREMOVE
def_bool y
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+   def_bool y
+
 config SMP
def_bool y
 
-- 
2.16.3



[PATCH v9 7/8] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

2021-04-16 Thread Oscar Salvador
Enable x86_64 platform to use the MHP_MEMMAP_ON_MEMORY feature.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
---
 arch/x86/Kconfig | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..9f0211df1746 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2433,6 +2433,9 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE
def_bool y
depends on MEMORY_HOTPLUG
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+   def_bool y
+
 config USE_PERCPU_NUMA_NODE_ID
def_bool y
depends on NUMA
-- 
2.16.3



[PATCH v9 6/8] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory

2021-04-16 Thread Oscar Salvador
Self stored memmap leads to a sparse memory situation which is unsuitable
for workloads that requires large contiguous memory chunks, so make this
an opt-in which needs to be explicitly enabled.

To control this, let memory_hotplug have its own memory space, as suggested
by David, so we can add memory_hotplug.memmap_on_memory parameter.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
---
 Documentation/admin-guide/kernel-parameters.txt | 17 +
 mm/Makefile |  5 -
 mm/memory_hotplug.c | 10 +-
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..af32c17cd4eb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2794,6 +2794,23 @@
seconds.  Use this parameter to check at some
other rate.  0 disables periodic checking.
 
+   memory_hotplug.memmap_on_memory
+   [KNL,X86,ARM] Boolean flag to enable this feature.
+   Format: {on | off (default)}
+   When enabled, runtime hotplugged memory will
+   allocate its internal metadata (struct pages)
+   from the hotadded memory which will allow to
+   hotadd a lot of memory without requiring
+   additional memory to do so.
+   This feature is disabled by default because it
+   has some implication on large (e.g. GB)
+   allocations in some configurations (e.g. small
+   memory blocks).
+   The state of the flag can be read in
+   /sys/module/memory_hotplug/parameters/memmap_on_memory.
+   Note that even when enabled, there are a few cases where
+   the feature is not effective.
+
memtest=[KNL,X86,ARM,PPC] Enable memtest
Format: 
default : 0 
diff --git a/mm/Makefile b/mm/Makefile
index 72227b24a616..82ae9482f5e3 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -58,9 +58,13 @@ obj-y:= filemap.o mempool.o 
oom_kill.o fadvise.o \
 page-alloc-y := page_alloc.o
 page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
 
+# Give 'memory_hotplug' its own module-parameter namespace
+memory-hotplug-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
+
 obj-y += page-alloc.o
 obj-y += init-mm.o
 obj-y += memblock.o
+obj-y += $(memory-hotplug-y)
 
 ifdef CONFIG_MMU
obj-$(CONFIG_ADVISE_SYSCALLS)   += madvise.o
@@ -83,7 +87,6 @@ obj-$(CONFIG_SLUB) += slub.o
 obj-$(CONFIG_KASAN)+= kasan/
 obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
-obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_MEMTEST)  += memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 5ef626926449..b93949a84d4a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -42,7 +42,15 @@
 #include "internal.h"
 #include "shuffle.h"
 
-static bool memmap_on_memory;
+
+/*
+ * memory_hotplug.memmap_on_memory parameter
+ */
+static bool memmap_on_memory __ro_after_init;
+#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
+module_param(memmap_on_memory, bool, 0444);
+MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory 
hotplug");
+#endif
 
 /*
  * online_page_callback contains pointer to current page onlining function.
-- 
2.16.3



[PATCH v9 5/8] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported

2021-04-16 Thread Oscar Salvador
Let the caller check whether it can pass MHP_MEMMAP_ON_MEMORY by
checking mhp_supports_memmap_on_memory().
MHP_MEMMAP_ON_MEMORY can only be set in case
ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE is enabled, the architecture supports
altmap, and the range to be added spans a single memory block.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
---
 drivers/acpi/acpi_memhotplug.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index b02fd51e5589..8cc195c4c861 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -171,6 +171,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
acpi_handle handle = mem_device->device->handle;
int result, num_enabled = 0;
struct acpi_memory_info *info;
+   mhp_t mhp_flags = MHP_NONE;
int node;
 
node = acpi_get_node(handle);
@@ -194,8 +195,10 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
 
+   if (mhp_supports_memmap_on_memory(info->length))
+   mhp_flags |= MHP_MEMMAP_ON_MEMORY;
result = __add_memory(node, info->start_addr, info->length,
- MHP_NONE);
+ mhp_flags);
 
/*
 * If the memory block has been used by the kernel, add_memory()
-- 
2.16.3



[PATCH v9 4/8] mm,memory_hotplug: Allocate memmap from the added memory range

2021-04-16 Thread Oscar Salvador
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose
(eg: ~2MB per 128MB memory section on x86_64)
 b) if the whole node is movable then we have off-node struct pages
which has performance drawbacks.
 c) It might be there are no PMD_ALIGNED chunks so memmap array gets
populated with base pages.

This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.

Vmemap page tables can map arbitrary memory.
That means that we can simply use the beginning of each memory section and
map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Implementation wise we will reuse vmem_altmap infrastructure to override
the default allocator used by __populate_section_memmap.
Part of the implementation also relies on memory_block structure gaining
a new field which specifies the number of vmemmap_pages at the beginning.
This patch also introduces the following functions:

 - mhp_init_memmap_on_memory:
   Initializes vmemmap pages by calling 
move_pfn_range_to_zone(),
   calls kasan_add_zero_shadow(), and onlines as many 
sections
   as vmemmap pages fully span.
 - mhp_deinit_memmap_on_memory:
   Undoes what mhp_init_memmap_on_memory.

The new function memory_block_online() calls mhp_init_memmap_on_memory() before
doing the actual online_pages(). Should online_pages() fail, we clean up
by calling mhp_deinit_memmap_on_memory().
Adjusting of present_pages is done at the end once we know that online_pages()
succedeed.

On offline, memory_block_offline() needs to unaccount vmemmap pages from
present_pages() before calling offline_pages().
This is necessary because offline_pages() tears down some structures based
on the fact whether the node or the zone become empty.
If offline_pages() fails, we account back vmemmap pages.
If it succeeds, we call mhp_deinit_memmap_on_memory().

Hot-remove:

 We need to be careful when removing memory, as adding and
 removing memory needs to be done with the same granularity.
 To check that this assumption is not violated, we check the
 memory range we want to remove and if a) any memory block has
 vmemmap pages and b) the range spans more than a single memory
 block, we scream out loud and refuse to proceed.

 If all is good and the range was using memmap on memory (aka vmemmap pages),
 we construct an altmap structure so free_hugepage_table does the right
 thing and calls vmem_altmap_free instead of free_pagetable.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
---
 drivers/base/memory.c  |  71 --
 include/linux/memory.h |   8 ++-
 include/linux/memory_hotplug.h |  15 +++-
 include/linux/memremap.h   |   2 +-
 include/linux/mmzone.h |   7 +-
 mm/Kconfig |   5 ++
 mm/memory_hotplug.c| 159 ++---
 mm/sparse.c|   2 -
 8 files changed, 247 insertions(+), 22 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f209925a5d4e..2e2b2f654f0a 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -173,16 +173,72 @@ static int memory_block_online(struct memory_block *mem)
 {
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+   unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+   struct zone *zone;
+   int ret;
+
+   zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, 
nr_pages);
+
+   /*
+* Although vmemmap pages have a different lifecycle than the pages
+* they describe (they remain until the memory is unplugged), doing
+* their initialization and accounting at memory onlining/offlining
+* stage simplifies things a lot.
+*/
+   if (nr_vmemmap_pages) {
+   ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, 
zone);
+   if (ret)
+   return ret;
+   }
+
+   ret = online_pages(start_pfn + nr_vmemmap_pages,
+  nr_pages - nr_vmemmap_pages, zone);
+   if (ret) {
+   if (nr_vmemmap_pages)
+   mhp_deinit_memmap_on_memory(start_pfn, 
nr_vmemmap_pages);
+   return ret;
+   }
+
+   /*
+* Account once onlining succeeded. If the zone was unpopulated, it is
+* now already properly populated.
+*/
+   if (nr_vmemmap_pages)
+   adjust_present_page_count(zone, nr_vmemmap_pages);
 
-   return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
+   return ret;
 }
 
 static int memory_block_offline(struct

[PATCH v9 3/8] mm,memory_hotplug: Factor out adjusting present pages into adjust_present_page_count()

2021-04-16 Thread Oscar Salvador
From: David Hildenbrand 

Let's have a single place (inspired by adjust_managed_page_count()) where
we adjust present pages.
In contrast to adjust_managed_page_count(), only memory onlining/offlining
is allowed to modify the number of present pages.

Signed-off-by: David Hildenbrand 
Signed-off-by: Oscar Salvador 
Reviewed-by: Oscar Salvador 
---
 mm/memory_hotplug.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 25e59d5dc13c..d05056b3c173 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -829,6 +829,16 @@ struct zone * zone_for_pfn_range(int online_type, int nid, 
unsigned start_pfn,
return default_zone_for_pfn(nid, start_pfn, nr_pages);
 }
 
+static void adjust_present_page_count(struct zone *zone, long nr_pages)
+{
+   unsigned long flags;
+
+   zone->present_pages += nr_pages;
+   pgdat_resize_lock(zone->zone_pgdat, );
+   zone->zone_pgdat->node_present_pages += nr_pages;
+   pgdat_resize_unlock(zone->zone_pgdat, );
+}
+
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
   int online_type, int nid)
 {
@@ -882,11 +892,7 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
}
 
online_pages_range(pfn, nr_pages);
-   zone->present_pages += nr_pages;
-
-   pgdat_resize_lock(zone->zone_pgdat, );
-   zone->zone_pgdat->node_present_pages += nr_pages;
-   pgdat_resize_unlock(zone->zone_pgdat, );
+   adjust_present_page_count(zone, nr_pages);
 
node_states_set_node(nid, );
if (need_zonelists_rebuild)
@@ -1701,11 +1707,7 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
 
/* removal success */
adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
-   zone->present_pages -= nr_pages;
-
-   pgdat_resize_lock(zone->zone_pgdat, );
-   zone->zone_pgdat->node_present_pages -= nr_pages;
-   pgdat_resize_unlock(zone->zone_pgdat, );
+   adjust_present_page_count(zone, -nr_pages);
 
init_per_zone_wmark_min();
 
-- 
2.16.3



[PATCH v9 0/8] Allocate memmap from hotadded memory (per device)

2021-04-16 Thread Oscar Salvador
mory_hotplug: Factor out adjusting present pages into
adjust_present_page_count()

Oscar Salvador (7):
  drivers/base/memory: Introduce memory_block_{online,offline}
  mm,memory_hotplug: Relax fully spanned sections check
  mm,memory_hotplug: Allocate memmap from the added memory range
  acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported
  mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory
  x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  arm64/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

 Documentation/admin-guide/kernel-parameters.txt |  17 ++
 arch/arm64/Kconfig  |   3 +
 arch/x86/Kconfig|   3 +
 drivers/acpi/acpi_memhotplug.c  |   5 +-
 drivers/base/memory.c   | 100 ++--
 include/linux/memory.h  |   8 +-
 include/linux/memory_hotplug.h  |  15 +-
 include/linux/memremap.h|   2 +-
 include/linux/mmzone.h  |   7 +-
 mm/Kconfig  |   5 +
 mm/Makefile |   5 +-
 mm/memory_hotplug.c | 205 +---
 mm/sparse.c |   2 -
 13 files changed, 330 insertions(+), 47 deletions(-)

-- 
2.16.3



[PATCH v9 2/8] mm,memory_hotplug: Relax fully spanned sections check

2021-04-16 Thread Oscar Salvador
When using self-hosted vmemmap pages, the number of pages passed to
{online,offline}_pages might not fully span sections, but they always
fully span pageblocks.
Relax the check account for that case.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0cdbbfbc5757..25e59d5dc13c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -838,9 +838,14 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
int ret;
struct memory_notify arg;
 
-   /* We can only online full sections (e.g., SECTION_IS_ONLINE) */
+   /* We can only offline full sections (e.g., SECTION_IS_ONLINE).
+* However, when using e.g: memmap_on_memory, some pages are initialized
+* prior to calling in here. The remaining amount of pages must be
+* pageblock aligned.
+*/
if (WARN_ON_ONCE(!nr_pages ||
-!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
+!IS_ALIGNED(pfn, pageblock_nr_pages) ||
+!IS_ALIGNED(pfn + nr_pages, PAGES_PER_SECTION)))
return -EINVAL;
 
mem_hotplug_begin();
@@ -1573,9 +1578,14 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
int ret, node;
char *reason;
 
-   /* We can only offline full sections (e.g., SECTION_IS_ONLINE) */
+   /* We can only offline full sections (e.g., SECTION_IS_ONLINE).
+* However, when using e.g: memmap_on_memory, some pages are initialized
+* prior to calling in here. The remaining amount of pages must be
+* pageblock aligned.
+*/
if (WARN_ON_ONCE(!nr_pages ||
-!IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
+!IS_ALIGNED(start_pfn, pageblock_nr_pages) ||
+!IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION)))
return -EINVAL;
 
mem_hotplug_begin();
-- 
2.16.3



[PATCH v9 1/8] drivers/base/memory: Introduce memory_block_{online,offline}

2021-04-16 Thread Oscar Salvador
This is a preparatory patch that introduces two new functions:
memory_block_online() and memory_block_offline().

For now, these functions will only call online_pages() and offline_pages()
respectively, but they will be later in charge of preparing the vmemmap
pages, carrying out the initialization and proper accounting of such
pages.

Since memory_block struct contains all the information, pass this struct
down the chain till the end functions.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
---
 drivers/base/memory.c | 33 +
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f35298425575..f209925a5d4e 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -169,30 +169,41 @@ int memory_notify(unsigned long val, void *v)
return blocking_notifier_call_chain(_chain, val, v);
 }
 
+static int memory_block_online(struct memory_block *mem)
+{
+   unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
+   unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+
+   return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
+}
+
+static int memory_block_offline(struct memory_block *mem)
+{
+   unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
+   unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+
+   return offline_pages(start_pfn, nr_pages);
+}
+
 /*
  * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
  * OK to have direct references to sparsemem variables in here.
  */
 static int
-memory_block_action(unsigned long start_section_nr, unsigned long action,
-   int online_type, int nid)
+memory_block_action(struct memory_block *mem, unsigned long action)
 {
-   unsigned long start_pfn;
-   unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
int ret;
 
-   start_pfn = section_nr_to_pfn(start_section_nr);
-
switch (action) {
case MEM_ONLINE:
-   ret = online_pages(start_pfn, nr_pages, online_type, nid);
+   ret = memory_block_online(mem);
break;
case MEM_OFFLINE:
-   ret = offline_pages(start_pfn, nr_pages);
+   ret = memory_block_offline(mem);
break;
default:
WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
-"%ld\n", __func__, start_section_nr, action, action);
+"%ld\n", __func__, mem->start_section_nr, action, action);
ret = -EINVAL;
}
 
@@ -210,9 +221,7 @@ static int memory_block_change_state(struct memory_block 
*mem,
if (to_state == MEM_OFFLINE)
mem->state = MEM_GOING_OFFLINE;
 
-   ret = memory_block_action(mem->start_section_nr, to_state,
- mem->online_type, mem->nid);
-
+   ret = memory_block_action(mem, to_state);
mem->state = ret ? from_state_req : to_state;
 
return ret;
-- 
2.16.3



Re: [PATCH v8 4/8] mm,memory_hotplug: Allocate memmap from the added memory range

2021-04-16 Thread Oscar Salvador
es;

BUG_ON(check_hotplug_memory_range(start, size));

@@ -1848,6 +1966,31 @@ static int __ref try_remove_memory(int nid, u64 start, 
u64 size)
if (rc)
return rc;

+   /*
+* We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
+* the same granularity it was added - a single memory block.
+*/
+   if (memmap_on_memory) {
+   nr_vmemmap_pages = walk_memory_blocks(start, size, NULL,
+ get_nr_vmemmap_pages_cb);
+   if (nr_vmemmap_pages) {
+   if (size != memory_block_size_bytes()) {
+   pr_warn("Refuse to remove %#llx - %#llx,"
+   "wrong granularity\n",
+   start, start + size);
+   return -EINVAL;
+   }
+
+   /*
+* Let remove_pmd_table->free_hugepage_table do the
+* right thing if we used vmem_altmap when hot-adding
+* the range.
+*/
+   mhp_altmap.alloc = nr_vmemmap_pages;
+   altmap = _altmap;
+   }
+   }
+
/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");

@@ -1859,7 +2002,7 @@ static int __ref try_remove_memory(int nid, u64 start, 
u64 size)

mem_hotplug_begin();

-   arch_remove_memory(nid, start, size, NULL);
+   arch_remove_memory(nid, start, size, altmap);

if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
memblock_free(start, size);
diff --git a/mm/sparse.c b/mm/sparse.c
index 7bd23f9d6cef..8e96cf00536b 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -623,7 +623,6 @@ void online_mem_sections(unsigned long start_pfn, unsigned 
long end_pfn)
}
 }

-#ifdef CONFIG_MEMORY_HOTREMOVE
 /* Mark all memory sections within the pfn range as offline */
 void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 {
@@ -644,7 +643,6 @@ void offline_mem_sections(unsigned long start_pfn, unsigned 
long end_pfn)
    ms->section_mem_map &= ~SECTION_IS_ONLINE;
}
 }
-#endif

 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 static struct page * __meminit populate_section_memmap(unsigned long pfn,

 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v8 4/8] mm,memory_hotplug: Allocate memmap from the added memory range

2021-04-16 Thread Oscar Salvador
On Fri, Apr 16, 2021 at 12:33:34PM +0200, David Hildenbrand wrote:
> IIRC, we have to add the zero shadow first, before touching the memory. This
> is also what mm/memremap.c does.
> 
> In mhp_deinit_memmap_on_memory(), you already remove in the proper
> (reversed) order :)

But looking at online_pages(), we do it after the
move_pfn_range_to_zone(), right? AFAIK (I might be well wrong here),
memory_notify() will eventually call kasan_add_zero_shadow? So that
comes after the move_pfn_range_to_zone? Or is my understanding
incorrect?


-- 
Oscar Salvador
SUSE L3


[PATCH v8 8/8] arm64/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

2021-04-16 Thread Oscar Salvador
Enable arm64 platform to use the MHP_MEMMAP_ON_MEMORY feature.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
---
 arch/arm64/Kconfig | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e4e1b6550115..68735831b236 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -309,6 +309,9 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 config ARCH_ENABLE_MEMORY_HOTREMOVE
def_bool y
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+   def_bool y
+
 config SMP
def_bool y
 
-- 
2.16.3



[PATCH v8 7/8] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

2021-04-16 Thread Oscar Salvador
Enable x86_64 platform to use the MHP_MEMMAP_ON_MEMORY feature.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
---
 arch/x86/Kconfig | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..9f0211df1746 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2433,6 +2433,9 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE
def_bool y
depends on MEMORY_HOTPLUG
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+   def_bool y
+
 config USE_PERCPU_NUMA_NODE_ID
def_bool y
depends on NUMA
-- 
2.16.3



[PATCH v8 6/8] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory

2021-04-16 Thread Oscar Salvador
Self stored memmap leads to a sparse memory situation which is unsuitable
for workloads that requires large contiguous memory chunks, so make this
an opt-in which needs to be explicitly enabled.

To control this, let memory_hotplug have its own memory space, as suggested
by David, so we can add memory_hotplug.memmap_on_memory parameter.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
---
 Documentation/admin-guide/kernel-parameters.txt | 17 +
 mm/Makefile |  5 -
 mm/memory_hotplug.c | 10 +-
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..af32c17cd4eb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2794,6 +2794,23 @@
seconds.  Use this parameter to check at some
other rate.  0 disables periodic checking.
 
+   memory_hotplug.memmap_on_memory
+   [KNL,X86,ARM] Boolean flag to enable this feature.
+   Format: {on | off (default)}
+   When enabled, runtime hotplugged memory will
+   allocate its internal metadata (struct pages)
+   from the hotadded memory which will allow to
+   hotadd a lot of memory without requiring
+   additional memory to do so.
+   This feature is disabled by default because it
+   has some implication on large (e.g. GB)
+   allocations in some configurations (e.g. small
+   memory blocks).
+   The state of the flag can be read in
+   /sys/module/memory_hotplug/parameters/memmap_on_memory.
+   Note that even when enabled, there are a few cases where
+   the feature is not effective.
+
memtest=[KNL,X86,ARM,PPC] Enable memtest
Format: 
default : 0 
diff --git a/mm/Makefile b/mm/Makefile
index 72227b24a616..82ae9482f5e3 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -58,9 +58,13 @@ obj-y:= filemap.o mempool.o 
oom_kill.o fadvise.o \
 page-alloc-y := page_alloc.o
 page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
 
+# Give 'memory_hotplug' its own module-parameter namespace
+memory-hotplug-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
+
 obj-y += page-alloc.o
 obj-y += init-mm.o
 obj-y += memblock.o
+obj-y += $(memory-hotplug-y)
 
 ifdef CONFIG_MMU
obj-$(CONFIG_ADVISE_SYSCALLS)   += madvise.o
@@ -83,7 +87,6 @@ obj-$(CONFIG_SLUB) += slub.o
 obj-$(CONFIG_KASAN)+= kasan/
 obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
-obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_MEMTEST)  += memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7efee41268b6..54c78c486f1f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -42,7 +42,15 @@
 #include "internal.h"
 #include "shuffle.h"
 
-static bool memmap_on_memory;
+
+/*
+ * memory_hotplug.memmap_on_memory parameter
+ */
+static bool memmap_on_memory __ro_after_init;
+#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
+module_param(memmap_on_memory, bool, 0444);
+MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory 
hotplug");
+#endif
 
 /*
  * online_page_callback contains pointer to current page onlining function.
-- 
2.16.3



[PATCH v8 5/8] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported

2021-04-16 Thread Oscar Salvador
Let the caller check whether it can pass MHP_MEMMAP_ON_MEMORY by
checking mhp_supports_memmap_on_memory().
MHP_MEMMAP_ON_MEMORY can only be set in case
ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE is enabled, the architecture supports
altmap, and the range to be added spans a single memory block.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
---
 drivers/acpi/acpi_memhotplug.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index b02fd51e5589..8cc195c4c861 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -171,6 +171,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
acpi_handle handle = mem_device->device->handle;
int result, num_enabled = 0;
struct acpi_memory_info *info;
+   mhp_t mhp_flags = MHP_NONE;
int node;
 
node = acpi_get_node(handle);
@@ -194,8 +195,10 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
 
+   if (mhp_supports_memmap_on_memory(info->length))
+   mhp_flags |= MHP_MEMMAP_ON_MEMORY;
result = __add_memory(node, info->start_addr, info->length,
- MHP_NONE);
+ mhp_flags);
 
/*
 * If the memory block has been used by the kernel, add_memory()
-- 
2.16.3



[PATCH v8 4/8] mm,memory_hotplug: Allocate memmap from the added memory range

2021-04-16 Thread Oscar Salvador
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose
(eg: ~2MB per 128MB memory section on x86_64)
 b) if the whole node is movable then we have off-node struct pages
which has performance drawbacks.
 c) It might be there are no PMD_ALIGNED chunks so memmap array gets
populated with base pages.

This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.

Vmemap page tables can map arbitrary memory.
That means that we can simply use the beginning of each memory section and
map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Implementation wise we will reuse vmem_altmap infrastructure to override
the default allocator used by __populate_section_memmap.
Part of the implementation also relies on memory_block structure gaining
a new field which specifies the number of vmemmap_pages at the beginning.
This patch also introduces the following functions:

 - mhp_init_memmap_on_memory:
   Initializes vmemmap pages by calling 
move_pfn_range_to_zone(),
   calls kasan_add_zero_shadow(), and onlines as many 
sections
   as vmemmap pages fully span.
 - mhp_deinit_memmap_on_memory:
   Undoes what mhp_init_memmap_on_memory.

The new function memory_block_online() calls mhp_init_memmap_on_memory() before
doing the actual online_pages(). Should online_pages() fail, we clean up
by calling mhp_deinit_memmap_on_memory().
Adjusting of present_pages is done at the end once we know that online_pages()
succedeed.

On offline, memory_block_offline() needs to unaccount vmemmap pages from
present_pages() before calling offline_pages().
This is necessary because offline_pages() tears down some structures based
on the fact whether the node or the zone become empty.
If offline_pages() fails, we account back vmemmap pages.
If it succeeds, we call mhp_deinit_memmap_on_memory().

Hot-remove:

 We need to be careful when removing memory, as adding and
 removing memory needs to be done with the same granularity.
 To check that this assumption is not violated, we check the
 memory range we want to remove and if a) any memory block has
 vmemmap pages and b) the range spans more than a single memory
 block, we scream out loud and refuse to proceed.

 If all is good and the range was using memmap on memory (aka vmemmap pages),
 we construct an altmap structure so free_hugepage_table does the right
 thing and calls vmem_altmap_free instead of free_pagetable.

Signed-off-by: Oscar Salvador 
---
 drivers/base/memory.c  |  75 --
 include/linux/memory.h |   8 +-
 include/linux/memory_hotplug.h |  17 +++-
 include/linux/memremap.h   |   2 +-
 include/linux/mmzone.h |   7 +-
 mm/Kconfig |   5 ++
 mm/memory_hotplug.c| 171 ++---
 mm/sparse.c|   2 -
 8 files changed, 265 insertions(+), 22 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f209925a5d4e..179857d53982 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -173,16 +173,76 @@ static int memory_block_online(struct memory_block *mem)
 {
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+   unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+   struct zone *zone;
+   int ret;
+
+   zone = mhp_get_target_zone(start_pfn, nr_pages, mem->nid,
+  mem->online_type);
+
+   /*
+* Although vmemmap pages have a different lifecycle than the pages
+* they describe (they remain until the memory is unplugged), doing
+* its initialization and accounting at hot-{online,offline} stage
+* simplifies things a lot
+*/
+   if (nr_vmemmap_pages) {
+   ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, 
zone);
+   if (ret)
+   return ret;
+   }
+
+   ret = online_pages(start_pfn + nr_vmemmap_pages,
+  nr_pages - nr_vmemmap_pages, zone);
+   if (ret) {
+   if (nr_vmemmap_pages)
+   mhp_deinit_memmap_on_memory(start_pfn, 
nr_vmemmap_pages);
+   return ret;
+   }
+
+   /*
+* Account once onlining succeeded. If the page was unpopulated, it is
+* now already properly populated.
+*/
+   if (nr_vmemmap_pages)
+   adjust_present_page_count(zone, nr_vmemmap_pages);
 
-   return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
+   return ret;
 }
 
 static int memory_block_offline(struct

[PATCH v8 3/8] mm,memory_hotplug: Factor out adjusting present pages into adjust_present_page_count()

2021-04-16 Thread Oscar Salvador
From: David Hildenbrand 

Let's have a single place (inspired by adjust_managed_page_count()) where
we adjust present pages.
In contrast to adjust_managed_page_count(), only memory onlining/offlining
is allowed to modify the number of present pages.

Signed-off-by: David Hildenbrand 
Signed-off-by: Oscar Salvador 
Reviewed-by: Oscar Salvador 
---
 mm/memory_hotplug.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 25e59d5dc13c..d05056b3c173 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -829,6 +829,16 @@ struct zone * zone_for_pfn_range(int online_type, int nid, 
unsigned start_pfn,
return default_zone_for_pfn(nid, start_pfn, nr_pages);
 }
 
+static void adjust_present_page_count(struct zone *zone, long nr_pages)
+{
+   unsigned long flags;
+
+   zone->present_pages += nr_pages;
+   pgdat_resize_lock(zone->zone_pgdat, );
+   zone->zone_pgdat->node_present_pages += nr_pages;
+   pgdat_resize_unlock(zone->zone_pgdat, );
+}
+
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
   int online_type, int nid)
 {
@@ -882,11 +892,7 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
}
 
online_pages_range(pfn, nr_pages);
-   zone->present_pages += nr_pages;
-
-   pgdat_resize_lock(zone->zone_pgdat, );
-   zone->zone_pgdat->node_present_pages += nr_pages;
-   pgdat_resize_unlock(zone->zone_pgdat, );
+   adjust_present_page_count(zone, nr_pages);
 
node_states_set_node(nid, );
if (need_zonelists_rebuild)
@@ -1701,11 +1707,7 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
 
/* removal success */
adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
-   zone->present_pages -= nr_pages;
-
-   pgdat_resize_lock(zone->zone_pgdat, );
-   zone->zone_pgdat->node_present_pages -= nr_pages;
-   pgdat_resize_unlock(zone->zone_pgdat, );
+   adjust_present_page_count(zone, -nr_pages);
 
init_per_zone_wmark_min();
 
-- 
2.16.3



[PATCH v8 1/8] drivers/base/memory: Introduce memory_block_{online,offline}

2021-04-16 Thread Oscar Salvador
This is a preparatory patch that introduces two new functions:
memory_block_online() and memory_block_offline().

For now, these functions will only call online_pages() and offline_pages()
respectively, but they will be later in charge of preparing the vmemmap
pages, carrying out the initialization and proper accounting of such
pages.

Since memory_block struct contains all the information, pass this struct
down the chain till the end functions.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
---
 drivers/base/memory.c | 33 +
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f35298425575..f209925a5d4e 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -169,30 +169,41 @@ int memory_notify(unsigned long val, void *v)
return blocking_notifier_call_chain(_chain, val, v);
 }
 
+static int memory_block_online(struct memory_block *mem)
+{
+   unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
+   unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+
+   return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
+}
+
+static int memory_block_offline(struct memory_block *mem)
+{
+   unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
+   unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+
+   return offline_pages(start_pfn, nr_pages);
+}
+
 /*
  * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
  * OK to have direct references to sparsemem variables in here.
  */
 static int
-memory_block_action(unsigned long start_section_nr, unsigned long action,
-   int online_type, int nid)
+memory_block_action(struct memory_block *mem, unsigned long action)
 {
-   unsigned long start_pfn;
-   unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
int ret;
 
-   start_pfn = section_nr_to_pfn(start_section_nr);
-
switch (action) {
case MEM_ONLINE:
-   ret = online_pages(start_pfn, nr_pages, online_type, nid);
+   ret = memory_block_online(mem);
break;
case MEM_OFFLINE:
-   ret = offline_pages(start_pfn, nr_pages);
+   ret = memory_block_offline(mem);
break;
default:
WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
-"%ld\n", __func__, start_section_nr, action, action);
+"%ld\n", __func__, mem->start_section_nr, action, action);
ret = -EINVAL;
}
 
@@ -210,9 +221,7 @@ static int memory_block_change_state(struct memory_block 
*mem,
if (to_state == MEM_OFFLINE)
mem->state = MEM_GOING_OFFLINE;
 
-   ret = memory_block_action(mem->start_section_nr, to_state,
- mem->online_type, mem->nid);
-
+   ret = memory_block_action(mem, to_state);
mem->state = ret ? from_state_req : to_state;
 
return ret;
-- 
2.16.3



[PATCH v8 2/8] mm,memory_hotplug: Relax fully spanned sections check

2021-04-16 Thread Oscar Salvador
When using self-hosted vmemmap pages, the number of pages passed to
{online,offline}_pages might not fully span sections, but they always
fully span pageblocks.
Relax the check account for that case.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0cdbbfbc5757..25e59d5dc13c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -838,9 +838,14 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
int ret;
struct memory_notify arg;
 
-   /* We can only online full sections (e.g., SECTION_IS_ONLINE) */
+   /* We can only offline full sections (e.g., SECTION_IS_ONLINE).
+* However, when using e.g: memmap_on_memory, some pages are initialized
+* prior to calling in here. The remaining amount of pages must be
+* pageblock aligned.
+*/
if (WARN_ON_ONCE(!nr_pages ||
-!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
+!IS_ALIGNED(pfn, pageblock_nr_pages) ||
+!IS_ALIGNED(pfn + nr_pages, PAGES_PER_SECTION)))
return -EINVAL;
 
mem_hotplug_begin();
@@ -1573,9 +1578,14 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
int ret, node;
char *reason;
 
-   /* We can only offline full sections (e.g., SECTION_IS_ONLINE) */
+   /* We can only offline full sections (e.g., SECTION_IS_ONLINE).
+* However, when using e.g: memmap_on_memory, some pages are initialized
+* prior to calling in here. The remaining amount of pages must be
+* pageblock aligned.
+*/
if (WARN_ON_ONCE(!nr_pages ||
-!IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
+!IS_ALIGNED(start_pfn, pageblock_nr_pages) ||
+!IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION)))
return -EINVAL;
 
mem_hotplug_begin();
-- 
2.16.3



[PATCH v8 0/8] Allocate memmap from hotadded memory (per device)

2021-04-16 Thread Oscar Salvador
Changes from v7 -> v8:
 - Addressed feedback from David for patch#4

Changes from v6 -> v7:
 - Fix check from "mm,memory_hotplug: Relax fully spanned sections check"
 - Add fixup from "mm,memory_hotplug: Allocate memmap from the added memory 
range"
 - Add Reviewed-by from David for patch#2
 - Fix changelog in "mm,memory_hotplug: Factor out adjusting present pages into
adjust_present_page_count()"

Changes from v5 -> v6:
 - Create memory_block_{online,offline} functions
 - Create vmemmap_* functions to deal with vmemmap stuff, so 
   {online,offline}_pages remain untouched
 - Add adjust_present_page_count's patch from David
 - Relax check in {offline,online}_pages
 - Rework changelogs

Changes from v4 -> v5:
 - Addressed feedback from David (patch#1)
 - Tested on x86_64 with different struct page sizes and on large/small memory
   blocks
 - Tested on arm64 with 4K, 64K (with and without THP) and with different struct
   page sizes

 NOTE: We might need to make this feature and hugetlb-vmemmap feature [1] 
mutually
   exclusive. I raised an issue I see in [2].
   Hugetlb-vmemmap feature has been withdrawn for the time being due to the 
need
   in further changes  wrt. locking/freeing context.
   I will keep an eye, and when the time comes again I will see how the two
   features play together and how one another can be disabled when needed.

Changes from v3 -> v4:
 - Addressed feedback from David
 - Wrap memmap_on_memory module thingy with #ifdef
   on MHP_MEMMAP_ON_MEMORY
 - Move "depend on MEMORY_HOTPLUG" to MHP_MEMMAP_ON_MEMORY
   in generic mm/Kconfig
 - Collect David's Reviewed-bys

Changes from v2 -> v3:
 - Addressed feedback from David
 - Squash former patch#4 and and patch#5 into patch#1
 - Fix config dependency CONFIR_SPARSE_VMEMMAP vs CONFIG_SPARSE_VMEMMAP_ENABLE
 - Simplify module parameter functions

Changes from v1 -> v2
 - Addressed feedback from David
 - Fence off the feature in case struct page size is not
   multiple of PMD size or pageblock alignment cannot be guaranted
 - Tested on x86_64 small and large memory_blocks
 - Tested on arm64 4KB and 64KB page sizes (for some reason I cannot boot
   my VM with 16KB page size).

 Arm64 with 4KB page size behaves like x86_64 after [1], which made section
 size smaller.
 With 64KB, the feature gets fenced off due to pageblock alignment.

Changes from RFCv3 -> v1:
 - Addressed feedback from David
 - Re-order patches

Changes from v2 -> v3 (RFC):
 - Re-order patches (Michal)
 - Fold "mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY" in patch#1
 - Add kernel boot option to enable this feature (Michal)

Changes from v1 -> v2 (RFC):
 - Addressed feedback provided by David
 - Add a arch_support_memmap_on_memory to be called
   from mhp_supports_memmap_on_memory, as atm,
   only ARM, powerpc and x86_64 have altmat support.

[1] https://lore.kernel.org/lkml/cover.1611206601.git.sudaraj...

Original cover letter:

 The primary goal of this patchset is to reduce memory overhead of the
 hot-added memory (at least for SPARSEMEM_VMEMMAP memory model).
 The current way we use to populate memmap (struct page array) has two main 
drawbacks:

 a) it consumes an additional memory until the hotadded memory itself is
onlined and
 b) memmap might end up on a different numa node which is especially true
for movable_node configuration.
 c) due to fragmentation we might end up populating memmap with base
pages

 One way to mitigate all these issues is to simply allocate memmap array
 (which is the largest memory footprint of the physical memory hotplug)
 from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
 us to map any pfn range so the memory doesn't need to be online to be
 usable for the array. See patch 4 for more details.
 This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.

 [Overall design]:

 Implementation wise we reuse vmem_altmap infrastructure to override
 the default allocator used by vmemap_populate.
 memory_block structure gains a new field called nr_vmemmap_pages,
 which accounts for the number of vmemmap pages used by that memory_block.
 E.g: On x86_64, that is 512 vmemmap pages on small memory bloks and 4096
  on large memory blocks (1GB)

 We also introduce new two functions: memory_block_{online,offline}.
 These functions take care of initializing/unitializing vmemmap pages
 prior to calling {online,offline}_pages, so the latter functions can
 remain totally untouched.

 More details can be found in the respective changelogs.

David Hildenbrand (1):
  mm,memory_hotplug: Factor out adjusting present pages into
adjust_present_page_count()

Oscar Salvador (7):
  drivers/base/memory: Introduce memory_block_{online,offline}
  mm,memory_hotplug: Relax fully spanned sections check
  mm,memory_hotplug: Allocate memmap from the added memory range
  acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when 

Re: [PATCH v7 4/8] mm,memory_hotplug: Allocate memmap from the added memory range

2021-04-16 Thread Oscar Salvador
On Thu, Apr 15, 2021 at 01:19:59PM +0200, David Hildenbrand wrote:
> > Implementation wise we will reuse vmem_altmap infrastructure to override
> > the default allocator used by __populate_section_memmap.
> > Part of the implementation also relies on memory_block structure gaining
> > a new field which specifies the number of vmemmap_pages at the beginning.
> > This patch also introduces the following functions:
> > 
> >   - vmemmap_init_space: Initializes vmemmap pages by calling 
> > move_pfn_range_to_zone(),
> >calls kasan_add_zero_shadow() or the vmemmap range and 
> > marks
> >online as many sections as vmemmap pages fully span.
> >   - vmemmap_adjust_pages: Accounts/substract vmemmap_pages to node and zone
> >  present_pages
> >   - vmemmap_deinit_space: Undoes what vmemmap_init_space does.
> > 
> 
> This is a bit asynchronous; and the function names are not really expressing 
> what is being done :) I'll try to come up with better names below.

Yeah, was not happy either with the names but at that time I could not
come up with anything better.

> It is worth mentioning that the real "mess" is that we want offline_pages() 
> to properly handle zone->present_pages going to 0. Therefore, we want to 
> manually mess with the present page count.

This should be explained by this:

"On offline, memory_block_offline() calls vmemmap_adjust_pages() prior to 
calling
offline_pages(), because offline_pages() performs the tearing-down of kthreads
and the rebuilding of the zonelists if the node/zone become empty."

Is not that clear?

> I suggest detecting the zone in here and just passing it down to 
> online_pages().

Ok, makes sense.

> My take would be doing the present page adjustment after onlining succeeded:
> 
> static int memory_block_online(struct memory_block *mem)
> {
>   unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
>   unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
>   unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
>   struct zone *zone;
>   int ret;
> 
>   zone = mhp_get_target_zone(mem->nid, mem->online_type);
> 
>   if (nr_vmemmap_pages) {
>   ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, 
> zone);
>   if (ret)
>   return ret;
>   }
> 
>   ret = online_pages(start_pfn + nr_vmemmap_pages, nr_pages - 
> nr_vmemmap_pages, zone);
>   if (ret) {
>   if (nr_vmemmap_pages)
>   mhp_deinit_memmap_on_memory(start_pfn, 
> nr_vmemmap_pages);
>   return ret;
>   }
> 
>   /*
>* Account once onlining succeeded. If the page was unpopulated,
>* it is now already properly populated.
>*/
>   if (nr_vmemmap_pages)
>   adjust_present_page_count(zone, nr_vmemmap_pages);
>   return 0;   
> }
> 
> And the opposite:
> 
> static int memory_block_offline(struct memory_block *mem)
> {
>   unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
>   unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
>   unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
>   struct zone *zone;
>   int ret;
> 
>   zone = page_zone(pfn_to_page(start_pfn));
> 
> 
>   /*
>* Unaccount before offlining, such that unpopulated zones can
>* properly be torn down in offline_pages().
>*/
>   if (nr_vmemmap_pages)
>   adjust_present_page_count(zone, -nr_vmemmap_pages);
> 
>   ret = offline_pages(start_pfn + nr_vmemmap_pages, nr_pages - 
> nr_vmemmap_pages);
>   if (ret) {
>   if (nr_vmemmap_pages)
>   adjust_present_page_count(zone, +nr_vmemmap_pages);
>   return ret;
>   }
> 
>   if (nr_vmemmap_pages)
>   mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
>   return 0;   
> }
> 
> Having to do the present page adjustment manually is not completely nice,
> but it's easier than manually having to mess with zones becomming 
> populated/unpopulated
> outside of online_pages()/offline_pages().

Ok, I like this, and the naming is much better.
I will work on this shortly.

thanks David!

-- 
Oscar Salvador
SUSE L3


[PATCH v9 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig

2021-04-16 Thread Oscar Salvador
pfn_range_valid_contig() bails out when it finds an in-use page or a
hugetlb page, among other things.
We can drop the in-use page check since __alloc_contig_pages can migrate
away those pages, and the hugetlb page check can go too since
isolate_migratepages_range is now capable of dealing with hugetlb pages.
Either way, those checks are racy so let the end function handle it
when the time comes.

Signed-off-by: Oscar Salvador 
Suggested-by: David Hildenbrand 
Reviewed-by: David Hildenbrand 
Acked-by: Mike Kravetz 
Acked-by: Michal Hocko 
---
 mm/page_alloc.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b5a94de3cdde..c5338e912ace 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8901,12 +8901,6 @@ static bool pfn_range_valid_contig(struct zone *z, 
unsigned long start_pfn,
 
if (PageReserved(page))
return false;
-
-   if (page_count(page) > 0)
-   return false;
-
-   if (PageHuge(page))
-   return false;
}
return true;
 }
-- 
2.16.3



[PATCH v9 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages

2021-04-16 Thread Oscar Salvador
alloc_contig_range() will fail if it finds a HugeTLB page within the range,
without a chance to handle them. Since HugeTLB pages can be migrated as any
LRU or Movable page, it does not make sense to bail out without trying.
Enable the interface to recognize in-use HugeTLB pages so we can migrate
them, and have much better chances to succeed the call.

Signed-off-by: Oscar Salvador 
Reviewed-by: Mike Kravetz 
Acked-by: Michal Hocko 
---
 include/linux/hugetlb.h |  5 +++--
 mm/compaction.c | 12 +++-
 mm/hugetlb.c| 24 ++--
 mm/vmscan.c |  5 +++--
 4 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b2d2118bfd1a..b92f25ccef58 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -595,7 +595,7 @@ struct huge_bootmem_page {
struct hstate *hstate;
 };
 
-int isolate_or_dissolve_huge_page(struct page *page);
+int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -878,7 +878,8 @@ static inline void huge_ptep_modify_prot_commit(struct 
vm_area_struct *vma,
 #else  /* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
-static inline int isolate_or_dissolve_huge_page(struct page *page)
+static inline int isolate_or_dissolve_huge_page(struct page *page,
+   struct list_head *list)
 {
return -ENOMEM;
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index af2e8e194e50..84fde270ae74 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -907,7 +907,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
}
 
if (PageHuge(page) && cc->alloc_contig) {
-   ret = isolate_or_dissolve_huge_page(page);
+   ret = isolate_or_dissolve_huge_page(page, 
>migratepages);
 
/*
 * Fail isolation in case 
isolate_or_dissolve_huge_page()
@@ -921,6 +921,15 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
goto isolate_fail;
}
 
+   if (PageHuge(page)) {
+   /*
+* Hugepage was successfully isolated and placed
+* on the cc->migratepages list.
+*/
+   low_pfn += compound_nr(page) - 1;
+   goto isolate_success_no_list;
+   }
+
/*
 * Ok, the hugepage was dissolved. Now these pages are
 * Buddy and cannot be re-allocated because they are
@@ -1062,6 +1071,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
 isolate_success:
list_add(>lru, >migratepages);
+isolate_success_no_list:
cc->nr_migratepages += compound_nr(page);
nr_isolated += compound_nr(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c588eed97c5a..6eb3b6ca4264 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2270,9 +2270,11 @@ static void restore_reserve_on_error(struct hstate *h,
  * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
  * @h: struct hstate old page belongs to
  * @old_page: Old page to dissolve
+ * @list: List to isolate the page in case we need to
  * Returns 0 on success, otherwise negated error.
  */
-static int alloc_and_dissolve_huge_page(struct hstate *h, struct page 
*old_page)
+static int alloc_and_dissolve_huge_page(struct hstate *h, struct page 
*old_page,
+   struct list_head *list)
 {
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
int nid = page_to_nid(old_page);
@@ -2299,9 +2301,13 @@ static int alloc_and_dissolve_huge_page(struct hstate 
*h, struct page *old_page)
goto free_new;
} else if (page_count(old_page)) {
/*
-* Someone has grabbed the page, fail for now.
+* Someone has grabbed the page, try to isolate it here.
+* Fail with -EBUSY if not possible.
 */
-   ret = -EBUSY;
+   spin_unlock_irq(_lock);
+   if (!isolate_huge_page(old_page, list))
+   ret = -EBUSY;
+   spin_lock_irq(_lock);
goto free_new;
} else if (!HPageFreed(old_page)) {
/*
@@ -2351,10 +2357,11 @@ static int alloc_and_dissolve_huge_page(struct hstate 
*h, struct page *old_page)
return ret;
 }
 
-int isolate_or_dissolve_huge_pa

[PATCH v9 3/7] mm,hugetlb: Drop clearing of flag from prep_new_huge_page

2021-04-16 Thread Oscar Salvador
Pages allocated via the page allocator or CMA get its private field cleared
by means of post_alloc_hook().
Pages allocated during boot, that is directly from the memblock allocator,
get cleared by paging_init()->..->memmap_init_zone->..->__init_single_page()
before any memblock allocation.

Based on this ground, let us remove the clearing of the flag from
prep_new_huge_page() as it is not needed.
This was a leftover from 6c0371490140 ("hugetlb: convert PageHugeFreed
to HPageFreed flag"). Previously the explicit clearing was necessary
because compound allocations do not get this initialization
(see prep_compound_page).

Signed-off-by: Oscar Salvador 
Acked-by: Michal Hocko 
Reviewed-by: David Hildenbrand 
Reviewed-by: Mike Kravetz 
---
 mm/hugetlb.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 54d81d5947ed..2cb9fa79cbaa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1493,7 +1493,6 @@ static void prep_new_huge_page(struct hstate *h, struct 
page *page, int nid)
spin_lock_irq(_lock);
h->nr_huge_pages++;
h->nr_huge_pages_node[nid]++;
-   ClearHPageFreed(page);
spin_unlock_irq(_lock);
 }
 
-- 
2.16.3



[PATCH v9 5/7] mm: Make alloc_contig_range handle free hugetlb pages

2021-04-16 Thread Oscar Salvador
alloc_contig_range will fail if it ever sees a HugeTLB page within the
range we are trying to allocate, even when that page is free and can be
easily reallocated.
This has proved to be problematic for some users of alloc_contic_range,
e.g: CMA and virtio-mem, where those would fail the call even when those
pages lay in ZONE_MOVABLE and are free.

We can do better by trying to replace such page.

Free hugepages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with
a new one.

In order to do that, a new function called alloc_and_dissolve_huge_page
is introduced.
This function will first try to get a new fresh hugepage, and if it
succeeds, it will replace the old one in the free hugepage pool.

The free page replacement is done under hugetlb_lock, so no external
users of hugetlb will notice the change.
To allocate the new huge page, we use alloc_buddy_huge_page(), so we
do not have to deal with any counters, and prep_new_huge_page() is not
called. This is valulable because in case we need to free the new page,
we only need to call __free_pages().

Once we know that the page to be replaced is a genuine 0-refcounted
huge page, we remove the old page from the freelist by remove_hugetlb_page().
Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page()
for the new huge page to properly initialize it and increment the
hstate->nr_huge_pages counter (previously decremented by
remove_hugetlb_page()).
Once done, the page is enqueued by enqueue_huge_page() and it is ready
to be used.

There is one tricky case when
page's refcount is 0 because it is in the process of being released.
A missing PageHugeFreed bit will tell us that freeing is in flight so
we retry after dropping the hugetlb_lock. The race window should be
small and the next retry should make a forward progress.

E.g:

CPU0CPU1
free_huge_page()isolate_or_dissolve_huge_page
  PageHuge() == T
  alloc_and_dissolve_huge_page
alloc_buddy_huge_page()
spin_lock_irq(hugetlb_lock)
// PageHuge() && !PageHugeFreed &&
// !PageCount()
spin_unlock_irq(hugetlb_lock)
  spin_lock_irq(hugetlb_lock)
  1) update_and_free_page
   PageHuge() == F
   __free_pages()
  2) enqueue_huge_page
   SetPageHugeFreed()
  spin_unlock_irq(_lock)
  spin_lock_irq(hugetlb_lock)
   1) PageHuge() == F (freed by case#1 from 
CPU0)
   2) PageHuge() == T
   PageHugeFreed() == T
   - proceed with replacing the page

In the case above we retry as the window race is quite small and we have high
chances to succeed next time.

With regard to the allocation, we restrict it to the node the page belongs
to with __GFP_THISNODE, meaning we do not fallback on other node's zones.

Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.

Signed-off-by: Oscar Salvador 
Acked-by: Michal Hocko 
Acked-by: David Hildenbrand 
Reviewed-by: Mike Kravetz 
---
 include/linux/hugetlb.h |   6 +++
 mm/compaction.c |  33 --
 mm/hugetlb.c| 116 
 3 files changed, 152 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 09f1fd12a6fa..b2d2118bfd1a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -595,6 +595,7 @@ struct huge_bootmem_page {
struct hstate *hstate;
 };
 
+int isolate_or_dissolve_huge_page(struct page *page);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -877,6 +878,11 @@ static inline void huge_ptep_modify_prot_commit(struct 
vm_area_struct *vma,
 #else  /* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
+static inline int isolate_or_dissolve_huge_page(struct page *page)
+{
+   return -ENOMEM;
+}
+
 static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
   unsigned long addr,
   int avoid_reserve)
diff --git a/mm/compaction.c b/mm/compaction.c
index 110b90d5fff5..af2e8e194e50 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -788,7 +788,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
  * Isolate all pages that can be migrated from the range specified by
  * [low_pfn, end_pfn). The range is expected to be within same pageblock.
  * Returns errno, like -EAGAIN or -EINTR

[PATCH v9 4/7] mm,hugetlb: Split prep_new_huge_page functionality

2021-04-16 Thread Oscar Salvador
Currently, prep_new_huge_page() performs two functions.
It sets the right state for a new hugetlb, and increases the hstate's
counters to account for the new page.

Let us split its functionality into two separate functions, decoupling
the handling of the counters from initializing a hugepage.
The outcome is having __prep_new_huge_page(), which only
initializes the page , and __prep_account_new_huge_page(), which adds
the new page to the hstate's counters.

This allows us to be able to set a hugetlb without having to worry
about the counter/locking. It will prove useful in the next patch.
prep_new_huge_page() still calls both functions.

Signed-off-by: Oscar Salvador 
Acked-by: Michal Hocko 
Reviewed-by: Mike Kravetz 
Reviewed-by: David Hildenbrand 
---
 mm/hugetlb.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2cb9fa79cbaa..6f39ec79face 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1483,16 +1483,30 @@ void free_huge_page(struct page *page)
}
 }
 
-static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
+/*
+ * Must be called with the hugetlb lock held
+ */
+static void __prep_account_new_huge_page(struct hstate *h, int nid)
+{
+   lockdep_assert_held(_lock);
+   h->nr_huge_pages++;
+   h->nr_huge_pages_node[nid]++;
+}
+
+static void __prep_new_huge_page(struct page *page)
 {
INIT_LIST_HEAD(>lru);
set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
hugetlb_set_page_subpool(page, NULL);
set_hugetlb_cgroup(page, NULL);
set_hugetlb_cgroup_rsvd(page, NULL);
+}
+
+static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
+{
+   __prep_new_huge_page(page);
spin_lock_irq(_lock);
-   h->nr_huge_pages++;
-   h->nr_huge_pages_node[nid]++;
+   __prep_account_new_huge_page(h, nid);
spin_unlock_irq(_lock);
 }
 
-- 
2.16.3



[PATCH v9 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes

2021-04-16 Thread Oscar Salvador
Currently, isolate_migratepages_{range,block} and their callers use
a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was
any error during isolation.
This does not work as soon as we need to start reporting different error
codes and make sure we pass them down the chain, so they are properly
interpreted by functions like e.g: alloc_contig_range.

Let us rework isolate_migratepages_{range,block} so we can report error
codes.
Since isolate_migratepages_block will stop returning the next pfn to be
scanned, we reuse the cc->migrate_pfn field to keep track of that.

Signed-off-by: Oscar Salvador 
Acked-by: Vlastimil Babka 
Acked-by: Mike Kravetz 
Reviewed-by: David Hildenbrand 
---
 mm/compaction.c | 52 +---
 mm/internal.h   | 10 --
 mm/page_alloc.c |  7 +++
 3 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 8c5028bfbd56..110b90d5fff5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -787,15 +787,14 @@ static bool too_many_isolated(pg_data_t *pgdat)
  *
  * Isolate all pages that can be migrated from the range specified by
  * [low_pfn, end_pfn). The range is expected to be within same pageblock.
- * Returns zero if there is a fatal signal pending, otherwise PFN of the
- * first page that was not scanned (which may be both less, equal to or more
- * than end_pfn).
+ * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or 
congestion,
+ * or 0.
+ * cc->migrate_pfn will contain the next pfn to scan.
  *
  * The pages are isolated on cc->migratepages list (not required to be empty),
- * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field
- * is neither read nor updated.
+ * and cc->nr_migratepages is updated accordingly.
  */
-static unsigned long
+static int
 isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
unsigned long end_pfn, isolate_mode_t isolate_mode)
 {
@@ -809,6 +808,9 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
bool skip_on_failure = false;
unsigned long next_skip_pfn = 0;
bool skip_updated = false;
+   int ret = 0;
+
+   cc->migrate_pfn = low_pfn;
 
/*
 * Ensure that there are not too many pages isolated from the LRU
@@ -818,16 +820,16 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
while (unlikely(too_many_isolated(pgdat))) {
/* stop isolation if there are still pages not migrated */
if (cc->nr_migratepages)
-   return 0;
+   return -EAGAIN;
 
/* async migration should just abort */
if (cc->mode == MIGRATE_ASYNC)
-   return 0;
+   return -EAGAIN;
 
congestion_wait(BLK_RW_ASYNC, HZ/10);
 
if (fatal_signal_pending(current))
-   return 0;
+   return -EINTR;
}
 
cond_resched();
@@ -875,8 +877,8 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
if (fatal_signal_pending(current)) {
cc->contended = true;
+   ret = -EINTR;
 
-   low_pfn = 0;
goto fatal_pending;
}
 
@@ -1130,7 +1132,9 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (nr_isolated)
count_compact_events(COMPACTISOLATED, nr_isolated);
 
-   return low_pfn;
+   cc->migrate_pfn = low_pfn;
+
+   return ret;
 }
 
 /**
@@ -1139,15 +1143,14 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
  * @start_pfn: The first PFN to start isolating.
  * @end_pfn:   The one-past-last PFN.
  *
- * Returns zero if isolation fails fatally due to e.g. pending signal.
- * Otherwise, function returns one-past-the-last PFN of isolated page
- * (which may be greater than end_pfn if end fell in a middle of a THP page).
+ * Returns -EAGAIN when contented, -EINTR in case of a signal pending or 0.
  */
-unsigned long
+int
 isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
unsigned long end_pfn)
 {
unsigned long pfn, block_start_pfn, block_end_pfn;
+   int ret = 0;
 
/* Scan block by block. First and last block may be incomplete */
pfn = start_pfn;
@@ -1166,17 +1169,17 @@ isolate_migratepages_range(struct compact_control *cc, 
unsigned long start_pfn,
block_end_pfn, cc->zone))
continue;
 
-   pfn = isolate_migra

[PATCH v9 0/7] Make alloc_contig_range handle Hugetlb pages

2021-04-16 Thread Oscar Salvador
  180.065848] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.066794] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.067738] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.068669] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.069598] alloc_contig_range: [1bfc00, 1c) PFNs busy"

 And then with this patchset running:

 "Same experiment with ZONE_MOVABLE:

  a) Free huge pages: all memory can get unplugged again.

  b) Allocated/populated but idle huge pages: all memory can get unplugged
 again.

  c) Allocated/populated but all 512 huge pages are read/written in a
 loop: all memory can get unplugged again, but I get a single

  [  121.192345] alloc_contig_range: [18, 188000) PFNs busy

  Most probably because it happened to try migrating a huge page while it
  was busy. As virtio-mem retries on ZONE_MOVABLE a couple of times, it
  can deal with this temporary failure.

  Last but not least, I did something extreme:

  # cat /proc/meminfo
  MemTotal:5061568 kB
  MemFree:  186560 kB
  MemAvailable: 354524 kB
  ...
  HugePages_Total:2048
  HugePages_Free: 2048
  HugePages_Rsvd:0
  HugePages_Surp:0

  Triggering unplug would require to dissolve+alloc - which now fails when
  trying to allocate an additional ~512 huge pages (1G).

  As expected, I can properly see memory unplug not fully succeeding. + I
  get a fairly continuous stream of

  [  226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy
  ...

  But more importantly, the hugepage count remains stable, as configured
  by the admin (me):

  HugePages_Total:2048
  HugePages_Free: 2048
  HugePages_Rsvd:    0
  HugePages_Surp:0"

Oscar Salvador (7):
  mm,page_alloc: Bail out earlier on -ENOMEM in
alloc_contig_migrate_range
  mm,compaction: Let isolate_migratepages_{range,block} return error
codes
  mm,hugetlb: Drop clearing of flag from prep_new_huge_page
  mm,hugetlb: Split prep_new_huge_page functionality
  mm: Make alloc_contig_range handle free hugetlb pages
  mm: Make alloc_contig_range handle in-use hugetlb pages
  mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig

 include/linux/hugetlb.h |   7 +++
 mm/compaction.c |  91 -
 mm/hugetlb.c| 149 ++--
 mm/internal.h   |  10 +++-
 mm/page_alloc.c |  22 +++
 mm/vmscan.c |   5 +-
 6 files changed, 237 insertions(+), 47 deletions(-)

-- 
2.16.3



[PATCH v9 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range

2021-04-16 Thread Oscar Salvador
Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or -EBUSY,
and report them down the chain.
The problem is that when migrate_pages() reports -ENOMEM, we keep going till we
exhaust all the try-attempts (5 at the moment) instead of bailing out.

migrate_pages() bails out right away on -ENOMEM because it is considered a fatal
error. Do the same here instead of keep going and retrying.
Note that this is not fixing a real issue, just a cosmetic change. Although we
can save some cycles by backing off ealier

Signed-off-by: Oscar Salvador 
Acked-by: Vlastimil Babka 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
Acked-by: Mike Kravetz 
---
 mm/page_alloc.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c67c99603a3..689454692de1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8697,7 +8697,7 @@ static int __alloc_contig_migrate_range(struct 
compact_control *cc,
}
tries = 0;
} else if (++tries == 5) {
-   ret = ret < 0 ? ret : -EBUSY;
+   ret = -EBUSY;
break;
}
 
@@ -8707,6 +8707,13 @@ static int __alloc_contig_migrate_range(struct 
compact_control *cc,
 
ret = migrate_pages(>migratepages, alloc_migration_target,
NULL, (unsigned long), cc->mode, 
MR_CONTIG_RANGE);
+
+   /*
+* On -ENOMEM, migrate_pages() bails out right away. It is 
pointless
+* to retry again over this error, so do the same here.
+*/
+   if (ret == -ENOMEM)
+   break;
}
 
lru_cache_enable();
-- 
2.16.3



Re: [PATCH v7 0/8] Allocate memmap from hotadded memory (per device)

2021-04-15 Thread Oscar Salvador
On Thu, Apr 08, 2021 at 02:17:56PM +0200, Oscar Salvador wrote:
> Hi,
> 
> I decided to send another version with the fixups included as it seemed a bit
> awkward otherwise. It should ease the review.
> Sorry for the spam.

Gentle ping :-) 

hint: patch#4 is the one that needs some looking

-- 
Oscar Salvador
SUSE L3


[PATCH v8 5/7] mm: Make alloc_contig_range handle free hugetlb pages

2021-04-15 Thread Oscar Salvador
alloc_contig_range will fail if it ever sees a HugeTLB page within the
range we are trying to allocate, even when that page is free and can be
easily reallocated.
This has proved to be problematic for some users of alloc_contic_range,
e.g: CMA and virtio-mem, where those would fail the call even when those
pages lay in ZONE_MOVABLE and are free.

We can do better by trying to replace such page.

Free hugepages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with
a new one.

In order to do that, a new function called alloc_and_dissolve_huge_page
is introduced.
This function will first try to get a new fresh hugepage, and if it
succeeds, it will replace the old one in the free hugepage pool.

The free page replacement is done under hugetlb_lock, so no external
users of hugetlb will notice the change.
To allocate the new huge page, we use alloc_buddy_huge_page(), so we
do not have to deal with any counters, and prep_new_huge_page() is not
called. This is valulable because in case we need to free the new page,
we only need to call __free_pages().

Once we know that the page to be replaced is a genuine 0-refcounted
huge page, we remove the old page from the freelist by remove_hugetlb_page().
Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page()
for the new huge page to properly initialize it and increment the
hstate->nr_huge_pages counter (previously decremented by
remove_hugetlb_page()).
Once done, the page is enqueued by enqueue_huge_page() and it is ready
to be used.

There is one tricky case when
page's refcount is 0 because it is in the process of being released.
A missing PageHugeFreed bit will tell us that freeing is in flight so
we retry after dropping the hugetlb_lock. The race window should be
small and the next retry should make a forward progress.

E.g:

CPU0CPU1
free_huge_page()isolate_or_dissolve_huge_page
  PageHuge() == T
  alloc_and_dissolve_huge_page
alloc_buddy_huge_page()
spin_lock_irq(hugetlb_lock)
// PageHuge() && !PageHugeFreed &&
// !PageCount()
spin_unlock_irq(hugetlb_lock)
  spin_lock_irq(hugetlb_lock)
  1) update_and_free_page
   PageHuge() == F
   __free_pages()
  2) enqueue_huge_page
   SetPageHugeFreed()
  spin_unlock_irq(_lock)
  spin_lock_irq(hugetlb_lock)
   1) PageHuge() == F (freed by case#1 from 
CPU0)
   2) PageHuge() == T
   PageHugeFreed() == T
   - proceed with replacing the page

In the case above we retry as the window race is quite small and we have high
chances to succeed next time.

With regard to the allocation, we restrict it to the node the page belongs
to with __GFP_THISNODE, meaning we do not fallback on other node's zones.

Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.

Signed-off-by: Oscar Salvador 
Acked-by: Michal Hocko 
---
 include/linux/hugetlb.h |   6 +++
 mm/compaction.c |  35 +--
 mm/hugetlb.c| 117 
 3 files changed, 155 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 09f1fd12a6fa..b2d2118bfd1a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -595,6 +595,7 @@ struct huge_bootmem_page {
struct hstate *hstate;
 };
 
+int isolate_or_dissolve_huge_page(struct page *page);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -877,6 +878,11 @@ static inline void huge_ptep_modify_prot_commit(struct 
vm_area_struct *vma,
 #else  /* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
+static inline int isolate_or_dissolve_huge_page(struct page *page)
+{
+   return -ENOMEM;
+}
+
 static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
   unsigned long addr,
   int avoid_reserve)
diff --git a/mm/compaction.c b/mm/compaction.c
index 110b90d5fff5..73f00f36b6e7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -788,7 +788,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
  * Isolate all pages that can be migrated from the range specified by
  * [low_pfn, end_pfn). The range is expected to be within same pageblock.
  * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or 
congestion,
- * or 0.
+ *

[PATCH v8 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages

2021-04-15 Thread Oscar Salvador
alloc_contig_range() will fail if it finds a HugeTLB page within the range,
without a chance to handle them. Since HugeTLB pages can be migrated as any
LRU or Movable page, it does not make sense to bail out without trying.
Enable the interface to recognize in-use HugeTLB pages so we can migrate
them, and have much better chances to succeed the call.

Signed-off-by: Oscar Salvador 
Reviewed-by: Mike Kravetz 
Acked-by: Michal Hocko 
---
 include/linux/hugetlb.h |  5 +++--
 mm/compaction.c | 12 +++-
 mm/hugetlb.c| 24 ++--
 mm/vmscan.c |  5 +++--
 4 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b2d2118bfd1a..b92f25ccef58 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -595,7 +595,7 @@ struct huge_bootmem_page {
struct hstate *hstate;
 };
 
-int isolate_or_dissolve_huge_page(struct page *page);
+int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -878,7 +878,8 @@ static inline void huge_ptep_modify_prot_commit(struct 
vm_area_struct *vma,
 #else  /* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
-static inline int isolate_or_dissolve_huge_page(struct page *page)
+static inline int isolate_or_dissolve_huge_page(struct page *page,
+   struct list_head *list)
 {
return -ENOMEM;
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index 73f00f36b6e7..8149ffc6bccd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -907,7 +907,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
}
 
if (PageHuge(page) && cc->alloc_contig) {
-   ret = isolate_or_dissolve_huge_page(page);
+   ret = isolate_or_dissolve_huge_page(page, 
>migratepages);
 
/*
 * Fail isolation in case 
isolate_or_dissolve_huge_page()
@@ -923,6 +923,15 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
goto isolate_fail;
}
 
+   if (PageHuge(page)) {
+   /*
+* Hugepage was successfully isolated and placed
+* on the cc->migratepages list.
+*/
+   low_pfn += compound_nr(page) - 1;
+   goto isolate_success_no_list;
+   }
+
/*
 * Ok, the hugepage was dissolved. Now these pages are
 * Buddy and cannot be re-allocated because they are
@@ -1064,6 +1073,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
 isolate_success:
list_add(>lru, >migratepages);
+isolate_success_no_list:
cc->nr_migratepages += compound_nr(page);
nr_isolated += compound_nr(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4abd4c4565dd..efe2f1fed74e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2270,10 +2270,12 @@ static void restore_reserve_on_error(struct hstate *h,
  * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
  * @h: struct hstate old page belongs to
  * @old_page: Old page to dissolve
+ * @list: List to isolate the page in case we need to
  * Returns 0 on success, otherwise negated error.
  */
 
-static int alloc_and_dissolve_huge_page(struct hstate *h, struct page 
*old_page)
+static int alloc_and_dissolve_huge_page(struct hstate *h, struct page 
*old_page,
+   struct list_head *list)
 {
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
int nid = page_to_nid(old_page);
@@ -2300,9 +2302,13 @@ static int alloc_and_dissolve_huge_page(struct hstate 
*h, struct page *old_page)
goto free_new;
} else if (page_count(old_page)) {
/*
-* Someone has grabbed the page, fail for now.
+* Someone has grabbed the page, try to isolate it here.
+* Fail with -EBUSY if not possible.
 */
-   ret = -EBUSY;
+   spin_unlock_irq(_lock);
+   if (!isolate_huge_page(old_page, list))
+   ret = -EBUSY;
+   spin_lock_irq(_lock);
goto free_new;
} else if (!HPageFreed(old_page)) {
/*
@@ -2352,10 +2358,11 @@ static int alloc_and_dissolve_huge_page(struct hstate 
*h, struct page *old_page)
return ret;
 }
 
-int isolate_or_dissolve_huge_pa

[PATCH v8 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig

2021-04-15 Thread Oscar Salvador
pfn_range_valid_contig() bails out when it finds an in-use page or a
hugetlb page, among other things.
We can drop the in-use page check since __alloc_contig_pages can migrate
away those pages, and the hugetlb page check can go too since
isolate_migratepages_range is now capable of dealing with hugetlb pages.
Either way, those checks are racy so let the end function handle it
when the time comes.

Signed-off-by: Oscar Salvador 
Suggested-by: David Hildenbrand 
Reviewed-by: David Hildenbrand 
Acked-by: Mike Kravetz 
---
 mm/page_alloc.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b5a94de3cdde..c5338e912ace 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8901,12 +8901,6 @@ static bool pfn_range_valid_contig(struct zone *z, 
unsigned long start_pfn,
 
if (PageReserved(page))
return false;
-
-   if (page_count(page) > 0)
-   return false;
-
-   if (PageHuge(page))
-   return false;
}
return true;
 }
-- 
2.16.3



[PATCH v8 4/7] mm,hugetlb: Split prep_new_huge_page functionality

2021-04-15 Thread Oscar Salvador
Currently, prep_new_huge_page() performs two functions.
It sets the right state for a new hugetlb, and increases the hstate's
counters to account for the new page.

Let us split its functionality into two separate functions, decoupling
the handling of the counters from initializing a hugepage.
The outcome is having __prep_new_huge_page(), which only
initializes the page , and __prep_account_new_huge_page(), which adds
the new page to the hstate's counters.

This allows us to be able to set a hugetlb without having to worry
about the counter/locking. It will prove useful in the next patch.
prep_new_huge_page() still calls both functions.

Signed-off-by: Oscar Salvador 
Acked-by: Michal Hocko 
Reviewed-by: Mike Kravetz 
---
 mm/hugetlb.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2cb9fa79cbaa..6f39ec79face 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1483,16 +1483,30 @@ void free_huge_page(struct page *page)
}
 }
 
-static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
+/*
+ * Must be called with the hugetlb lock held
+ */
+static void __prep_account_new_huge_page(struct hstate *h, int nid)
+{
+   lockdep_assert_held(_lock);
+   h->nr_huge_pages++;
+   h->nr_huge_pages_node[nid]++;
+}
+
+static void __prep_new_huge_page(struct page *page)
 {
INIT_LIST_HEAD(>lru);
set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
hugetlb_set_page_subpool(page, NULL);
set_hugetlb_cgroup(page, NULL);
set_hugetlb_cgroup_rsvd(page, NULL);
+}
+
+static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
+{
+   __prep_new_huge_page(page);
spin_lock_irq(_lock);
-   h->nr_huge_pages++;
-   h->nr_huge_pages_node[nid]++;
+   __prep_account_new_huge_page(h, nid);
spin_unlock_irq(_lock);
 }
 
-- 
2.16.3



[PATCH v8 3/7] mm,hugetlb: Drop clearing of flag from prep_new_huge_page

2021-04-15 Thread Oscar Salvador
Pages allocated after boot get its private field cleared by means
of post_alloc_hook().
Pages allocated during boot, that is directly from the memblock allocator,
get cleared by paging_init()->..->memmap_init_zone->..->__init_single_page()
before any memblock allocation.

Based on this ground, let us remove the clearing of the flag from
prep_new_huge_page() as it is not needed.

Signed-off-by: Oscar Salvador 
---
 mm/hugetlb.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 54d81d5947ed..2cb9fa79cbaa 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1493,7 +1493,6 @@ static void prep_new_huge_page(struct hstate *h, struct 
page *page, int nid)
spin_lock_irq(_lock);
h->nr_huge_pages++;
h->nr_huge_pages_node[nid]++;
-   ClearHPageFreed(page);
spin_unlock_irq(_lock);
 }
 
-- 
2.16.3



[PATCH v8 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes

2021-04-15 Thread Oscar Salvador
Currently, isolate_migratepages_{range,block} and their callers use
a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was
any error during isolation.
This does not work as soon as we need to start reporting different error
codes and make sure we pass them down the chain, so they are properly
interpreted by functions like e.g: alloc_contig_range.

Let us rework isolate_migratepages_{range,block} so we can report error
codes.
Since isolate_migratepages_block will stop returning the next pfn to be
scanned, we reuse the cc->migrate_pfn field to keep track of that.

Signed-off-by: Oscar Salvador 
Acked-by: Vlastimil Babka 
Acked-by: Mike Kravetz 
---
 mm/compaction.c | 52 +---
 mm/internal.h   | 10 --
 mm/page_alloc.c |  7 +++
 3 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 8c5028bfbd56..110b90d5fff5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -787,15 +787,14 @@ static bool too_many_isolated(pg_data_t *pgdat)
  *
  * Isolate all pages that can be migrated from the range specified by
  * [low_pfn, end_pfn). The range is expected to be within same pageblock.
- * Returns zero if there is a fatal signal pending, otherwise PFN of the
- * first page that was not scanned (which may be both less, equal to or more
- * than end_pfn).
+ * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or 
congestion,
+ * or 0.
+ * cc->migrate_pfn will contain the next pfn to scan.
  *
  * The pages are isolated on cc->migratepages list (not required to be empty),
- * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field
- * is neither read nor updated.
+ * and cc->nr_migratepages is updated accordingly.
  */
-static unsigned long
+static int
 isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
unsigned long end_pfn, isolate_mode_t isolate_mode)
 {
@@ -809,6 +808,9 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
bool skip_on_failure = false;
unsigned long next_skip_pfn = 0;
bool skip_updated = false;
+   int ret = 0;
+
+   cc->migrate_pfn = low_pfn;
 
/*
 * Ensure that there are not too many pages isolated from the LRU
@@ -818,16 +820,16 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
while (unlikely(too_many_isolated(pgdat))) {
/* stop isolation if there are still pages not migrated */
if (cc->nr_migratepages)
-   return 0;
+   return -EAGAIN;
 
/* async migration should just abort */
if (cc->mode == MIGRATE_ASYNC)
-   return 0;
+   return -EAGAIN;
 
congestion_wait(BLK_RW_ASYNC, HZ/10);
 
if (fatal_signal_pending(current))
-   return 0;
+   return -EINTR;
}
 
cond_resched();
@@ -875,8 +877,8 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
if (fatal_signal_pending(current)) {
cc->contended = true;
+   ret = -EINTR;
 
-   low_pfn = 0;
goto fatal_pending;
}
 
@@ -1130,7 +1132,9 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (nr_isolated)
count_compact_events(COMPACTISOLATED, nr_isolated);
 
-   return low_pfn;
+   cc->migrate_pfn = low_pfn;
+
+   return ret;
 }
 
 /**
@@ -1139,15 +1143,14 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
  * @start_pfn: The first PFN to start isolating.
  * @end_pfn:   The one-past-last PFN.
  *
- * Returns zero if isolation fails fatally due to e.g. pending signal.
- * Otherwise, function returns one-past-the-last PFN of isolated page
- * (which may be greater than end_pfn if end fell in a middle of a THP page).
+ * Returns -EAGAIN when contented, -EINTR in case of a signal pending or 0.
  */
-unsigned long
+int
 isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
unsigned long end_pfn)
 {
unsigned long pfn, block_start_pfn, block_end_pfn;
+   int ret = 0;
 
/* Scan block by block. First and last block may be incomplete */
pfn = start_pfn;
@@ -1166,17 +1169,17 @@ isolate_migratepages_range(struct compact_control *cc, 
unsigned long start_pfn,
block_end_pfn, cc->zone))
continue;
 
-   pfn = isolate_migratepages_block(cc, pfn, block_end_pfn,
-   ISOLA

[PATCH v8 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range

2021-04-15 Thread Oscar Salvador
Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or -EBUSY,
and report them down the chain.
The problem is that when migrate_pages() reports -ENOMEM, we keep going till we
exhaust all the try-attempts (5 at the moment) instead of bailing out.

migrate_pages() bails out right away on -ENOMEM because it is considered a fatal
error. Do the same here instead of keep going and retrying.
Note that this is not fixing a real issue, just a cosmetic change. Although we
can save some cycles by backing off ealier

Signed-off-by: Oscar Salvador 
Acked-by: Vlastimil Babka 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
Acked-by: Mike Kravetz 
---
 mm/page_alloc.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c67c99603a3..689454692de1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8697,7 +8697,7 @@ static int __alloc_contig_migrate_range(struct 
compact_control *cc,
}
tries = 0;
} else if (++tries == 5) {
-   ret = ret < 0 ? ret : -EBUSY;
+   ret = -EBUSY;
break;
}
 
@@ -8707,6 +8707,13 @@ static int __alloc_contig_migrate_range(struct 
compact_control *cc,
 
ret = migrate_pages(>migratepages, alloc_migration_target,
NULL, (unsigned long), cc->mode, 
MR_CONTIG_RANGE);
+
+   /*
+* On -ENOMEM, migrate_pages() bails out right away. It is 
pointless
+* to retry again over this error, so do the same here.
+*/
+   if (ret == -ENOMEM)
+   break;
}
 
lru_cache_enable();
-- 
2.16.3



[PATCH v8 0/7] Make alloc_contig_range handle Hugetlb pages

2021-04-15 Thread Oscar Salvador
80.060531] alloc_contig_range: [1b8000, 1c) PFNs busy
  [  180.061972] alloc_contig_range: [1b8000, 1c) PFNs busy
  [  180.063413] alloc_contig_range: [1b8000, 1c) PFNs busy
  [  180.064838] alloc_contig_range: [1b8000, 1c) PFNs busy
  [  180.065848] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.066794] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.067738] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.068669] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.069598] alloc_contig_range: [1bfc00, 1c) PFNs busy"

 And then with this patchset running:

 "Same experiment with ZONE_MOVABLE:

  a) Free huge pages: all memory can get unplugged again.

  b) Allocated/populated but idle huge pages: all memory can get unplugged
 again.

  c) Allocated/populated but all 512 huge pages are read/written in a
 loop: all memory can get unplugged again, but I get a single

  [  121.192345] alloc_contig_range: [18, 188000) PFNs busy

  Most probably because it happened to try migrating a huge page while it
  was busy. As virtio-mem retries on ZONE_MOVABLE a couple of times, it
  can deal with this temporary failure.

  Last but not least, I did something extreme:

  # cat /proc/meminfo
  MemTotal:5061568 kB
  MemFree:  186560 kB
  MemAvailable: 354524 kB
  ...
  HugePages_Total:2048
  HugePages_Free: 2048
  HugePages_Rsvd:0
  HugePages_Surp:0

  Triggering unplug would require to dissolve+alloc - which now fails when
  trying to allocate an additional ~512 huge pages (1G).

  As expected, I can properly see memory unplug not fully succeeding. + I
  get a fairly continuous stream of

  [  226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy
  ...

  But more importantly, the hugepage count remains stable, as configured
  by the admin (me):

  HugePages_Total:2048
  HugePages_Free: 2048
  HugePages_Rsvd:    0
  HugePages_Surp:0"

Oscar Salvador (7):
  mm,page_alloc: Bail out earlier on -ENOMEM in
alloc_contig_migrate_range
  mm,compaction: Let isolate_migratepages_{range,block} return error
codes
  mm,hugetlb: Drop clearing of flag from prep_new_huge_page
  mm,hugetlb: Split prep_new_huge_page functionality
  mm: Make alloc_contig_range handle free hugetlb pages
  mm: Make alloc_contig_range handle in-use hugetlb pages
  mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig

 include/linux/hugetlb.h |   7 +++
 mm/compaction.c |  93 +-
 mm/hugetlb.c| 150 ++--
 mm/internal.h   |  10 +++-
 mm/page_alloc.c |  22 +++
 mm/vmscan.c |   5 +-
 6 files changed, 240 insertions(+), 47 deletions(-)

-- 
2.16.3



Re: [PATCH v7 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes

2021-04-15 Thread Oscar Salvador
On Wed, Apr 14, 2021 at 01:54:17PM +0200, David Hildenbrand wrote:
> >* Isolate all pages that can be migrated from the range specified by
> >* [low_pfn, end_pfn). The range is expected to be within same pageblock.
> > - * Returns zero if there is a fatal signal pending, otherwise PFN of the
> > - * first page that was not scanned (which may be both less, equal to or 
> > more
> > - * than end_pfn).
> > + * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or 
> > congestion,
> > + * or 0.
> > + * cc->migrate_pfn will contain the next pfn to scan (which may be both 
> > less,
> > + * equal to or more that end_pfn).
> 
> I failed to parse the last sentence -- e.g., using "both" and then listing
> three options. Also, s/than/than/? Can we simplify to
> 
> "cc->migrate_pfn will contain the next pfn to scan"

Ok, will simplify

> >   /**
> > @@ -1139,15 +1144,15 @@ isolate_migratepages_block(struct compact_control 
> > *cc, unsigned long low_pfn,
> >* @start_pfn: The first PFN to start isolating.
> >* @end_pfn:   The one-past-last PFN.
> >*
> > - * Returns zero if isolation fails fatally due to e.g. pending signal.
> > - * Otherwise, function returns one-past-the-last PFN of isolated page
> > - * (which may be greater than end_pfn if end fell in a middle of a THP 
> > page).
> > + * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or 
> > congestion,
> 
> errno is usually positive.
> 
> > + * or 0.
> 
> I'd be more specific here. Something like
> 
> "
> Returns 0 if isolation succeeded. Returns -EINTR if a fatal signal is
> pending. Returns -EAGAIN when contended and the caller should retry.
> 
> In any case, cc->migrate_pfn is set to one-past-the-last PFN that has been
> isolated.
> "

I will try to reword it. Although note that 0 does not mean the
isolation succeeded.
Compaction tracks whether we could isolate pages by means of
cc->nr_migratepages.

The thing is that alloc_contig_range and compaction code need different
things.
Compaction wants to isolate and migrate as much as it needs to in order to
form a higher-order page (or in case
if it is trying to compact the whole zone, it goes through the zone).
alloc_contig_range() needs to isolate the whole range we specified in
order to be able to migrate it and make it free for whoever wants to use
it.

Let us say that the interface between alloc_contig_range() and
compaction still needs some more love that I plan to work on when this goes
in.


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages

2021-04-15 Thread Oscar Salvador
On Tue, Apr 13, 2021 at 03:40:18PM +0200, Michal Hocko wrote:
> > +   /*
> > +* Call __prep_new_huge_page() to construct the hugetlb page, 
> > and
> > +* enqueue it then to place it in the freelists. After this,
> > +* counters are back on track. Free hugepages have a refcount 
> > of 0,
> > +* so we need to decrease new_page's count as well.
> > +*/
> > +   __prep_new_huge_page(new_page);
> > +   __prep_account_new_huge_page(h, nid);
> 
> I think it would help to put something like the following into the
> comment above this really strange construct.
> 
>   /*
>* new_page needs to be initialized with the standard
>* hugetlb state. This is normally done by
>* prep_new_huge_page but that takes hugetlb_lock which
>* is already held so we need to open code it here.
>* Reference count trick is needed because allocator
>* gives us referenced page but the pool requires pages
>    * with 0 refcount.
>*/

Ok, I will try to add more info, thanks Michal!

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages

2021-04-15 Thread Oscar Salvador
On Wed, Apr 14, 2021 at 02:26:21PM +0200, David Hildenbrand wrote:
> > +static inline int isolate_or_dissolve_huge_page(struct page *page)
> > +{
> > +   return -ENOMEM;
> 
> Without CONFIG_HUGETLB_PAGE, there is no way someone could possible pass in
> something valid. Although it doesn't matter too much, -EINVAL or similar
> sounds a bit better.

I guess we could, was just to make it consistent with the kind of error
we return when we have it enabled.

> > @@ -809,6 +809,7 @@ isolate_migratepages_block(struct compact_control *cc, 
> > unsigned long low_pfn,
> > bool skip_on_failure = false;
> > unsigned long next_skip_pfn = 0;
> > bool skip_updated = false;
> > +   bool fatal_error = false;
> 
> Can't we use "ret == -ENOMEM" instead of fatal_error?

Yes, we can, I will see how it looks.

[...]

> > +   /*
> > +* Fence off gigantic pages as there is a cyclic dependency between
> > +* alloc_contig_range and them. Return -ENOME as this has the effect
> 
> s/-ENOME/-ENOMEM/

thanks, I missed that one.

> 
> > +* of bailing out right away without further retrying.
> > +*/
> > +   if (hstate_is_gigantic(h))
> > +   return -ENOMEM;
> > +
> > +   return alloc_and_dissolve_huge_page(h, head);
> > +}
> > +
> >   struct page *alloc_huge_page(struct vm_area_struct *vma,
> > unsigned long addr, int avoid_reserve)
> >   {
> > 
> 
> Complicated stuff, but looks good to me.

Thanks for having a look!

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock

2021-04-14 Thread Oscar Salvador
On Wed, Apr 14, 2021 at 12:32:58PM +0200, Michal Hocko wrote:
> Well, to be precise it does the very same thing with memamp struct pages
> but that is before the initialization code you have pointed out above.
> In this context it just poisons the allocated content which is the GB
> page storage.

Right.

> > I checked, and when we get there in __alloc_bootmem_huge_page, 
> > page->private is
> > still zeroed, so I guess it should be safe to assume that we do not really 
> > need
> > to clear the flag in __prep_new_huge_page() routine?
> 
> It would be quite nasty if the struct pages content would be undefined.
> Maybe that is possible but then I would rather stick the initialization
> into __alloc_bootmem_huge_page.

Yes, but I do not think that is really possible unless I missed something.
Let us see what Mike thinks of it, if there are no objections, we can
get rid of the clearing flag right there.
 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock

2021-04-14 Thread Oscar Salvador
On Wed, Apr 14, 2021 at 12:01:47PM +0200, Oscar Salvador wrote:
>  but it seems that does not the memmap struct page.
that sould read as "but it seems that that does not affect the memmap struct 
page"
 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock

2021-04-14 Thread Oscar Salvador
On Wed, Apr 14, 2021 at 10:28:33AM +0200, Michal Hocko wrote:
> You are right it doesn't do it there. But all struct pages, even those
> that are allocated by the bootmem allocator should initialize its struct
> pages. They would be poisoned otherwise, right? I would have to look at
> the exact code path but IIRC this should be around the time bootmem
> allocator state transitions to the page allocator.

Ok, you are right.
struct pages are initialized a bit earlier through:

start_kernel
 setup_arch
  paging_init
   zone_sizes_init
free_area_init
 free_area_init_node
  free_area_init_core
   memmap_init_zone
memmap_init_range
 __init_single_page

While the allocation of bootmem hugetlb happens

start_kernel
 parse_args
  ...
   hugepages_setup
...
 hugetlb_hstate_alloc_pages
  __alloc_bootmem_huge_page

which is after the setup_arch() call.

So by the time we get the page from __alloc_bootmem_huge_page(), fields are
zeroed.
I thought we might get in trouble because memblock_alloc_try_nid_raw() calls
page_init_poison() which poisons the chunk with 0xff,e.g:

[1.955471] boot:    

[1.955476] boot:    


 but it seems that does not the memmap struct page.

I checked, and when we get there in __alloc_bootmem_huge_page, page->private is
still zeroed, so I guess it should be safe to assume that we do not really need
to clear the flag in __prep_new_huge_page() routine?

-- 
Oscar Salvador
SUSE L3


Re: [PATCH 02/10] mm/numa: automatically generate node migration order

2021-04-14 Thread Oscar Salvador
On Wed, Apr 14, 2021 at 10:12:29AM +0200, David Hildenbrand wrote:
> My guest best is that fast class is something like HBM (High Bandwidth
> Memory), medium class is ordinary RAM, slow class is PMEM.

I see, thanks for the hint David ;-)

-- 
Oscar Salvador
SUSE L3


Re: [PATCH 02/10] mm/numa: automatically generate node migration order

2021-04-14 Thread Oscar Salvador
On Wed, Apr 14, 2021 at 10:08:54AM +0200, Oscar Salvador wrote:
> In Dave's example, list is created in a way that stays local to the socket,
> and we go from the fast one to the slow one.

Or maybe it is just because find_next_best_node() does not know any better
and creates the list that way?

-- 
Oscar Salvador
SUSE L3


Re: [PATCH 02/10] mm/numa: automatically generate node migration order

2021-04-14 Thread Oscar Salvador
On Fri, Apr 09, 2021 at 08:07:08PM -0700, Wei Xu wrote:
> On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen  
> wrote:
> > + * When Node 0 fills up, its memory should be migrated to
> > + * Node 1.  When Node 1 fills up, it should be migrated to
> > + * Node 2.  The migration path start on the nodes with the
> > + * processors (since allocations default to this node) and
> > + * fast memory, progress through medium and end with the
> > + * slow memory:
> > + *
> > + * 0 -> 1 -> 2 -> stop
> > + * 3 -> 4 -> 5 -> stop
> > + *
> > + * This is represented in the node_demotion[] like this:
> > + *
> > + * {  1, // Node 0 migrates to 1
> > + *2, // Node 1 migrates to 2
> > + *   -1, // Node 2 does not migrate
> > + *4, // Node 3 migrates to 4
> > + *5, // Node 4 migrates to 5
> > + *   -1} // Node 5 does not migrate
> > + */
> 
> In this example, if we want to support multiple nodes as the demotion
> target of a source node, we can group these nodes into three tiers
> (classes):
> 
> fast class:
> 0 -> {1, 4}  // 1 is the preferred
> 3 -> {4, 1}  // 4 is the preferred
> 
> medium class:
> 1 -> {2, 5}  // 2 is the preferred
> 4 -> {5, 2}  // 5 is the preferred
> 
> slow class:
> 2 -> stop
> 5 -> stop

Hi Wei Xu,

I have some questions about it

Fast class/memory are pictured as those nodes with CPUs, while Slow class/memory
are PMEM, right?
Then, what stands for medium class/memory?

In Dave's example, list is created in a way that stays local to the socket,
and we go from the fast one to the slow one.
In yours, lists are created taking the fastest nodes from all sockets and
we work our way down, which means have cross-socket nodes in the list.
How much of a penalty is that?

And while I get your point, I am not sure if that is what we pretend here.
This patchset aims to place cold pages that are about to be reclaim in slower
nodes to give them a second chance, while your design seems more to have kind
of different memory clases and be able to place applications in one of those 
tiers
depending on its demands or sysadmin-demand.

Could you expand some more?

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock

2021-04-14 Thread Oscar Salvador
On Wed, Apr 14, 2021 at 08:04:21AM +0200, Michal Hocko wrote:
> On Tue 13-04-21 14:19:03, Mike Kravetz wrote:
> > On 4/13/21 6:23 AM, Michal Hocko wrote:
> > The only place where page->private may not be initialized is when we do
> > allocations at boot time from memblock.  In this case, we will add the
> > pages to the free list via put_page/free_huge_page so the appropriate
> > flags will be cleared before anyone notices.
> 
> Pages allocated by the bootmem should be pre initialized from the boot,
> no?

I guess Mike means:

hugetlb_hstate_alloc_pages
 alloc_bootmem_huge_page
  __alloc_bootmem_huge_page
   memblock_alloc_try_nid_raw

and AFAICS, memblock_alloc_try_nid_raw() does not zero the memory.

Then these pages are initialized in:

gather_bootmem_prealloc
 prep_compound_huge_page
 prep_new_huge_page

But as can be noticed, no one touches page->private when coming from that
path.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v2 resend] mm/memory_hotplug: Make unpopulated zones PCP structures unreachable during hot remove

2021-04-14 Thread Oscar Salvador
On Mon, Apr 12, 2021 at 01:08:42PM +0100, Mel Gorman wrote:
> zone_pcp_reset allegedly protects against a race with drain_pages
> using local_irq_save but this is bogus. local_irq_save only operates
> on the local CPU. If memory hotplug is running on CPU A and drain_pages
> is running on CPU B, disabling IRQs on CPU A does not affect CPU B and
> offers no protection.
> 
> This patch deletes IRQ disable/enable on the grounds that IRQs protect
> nothing and assumes the existing hotplug paths guarantees the PCP cannot be
> used after zone_pcp_enable(). That should be the case already because all
> the pages have been freed and there is no page to put on the PCP lists.
> 
> Signed-off-by: Mel Gorman 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality

2021-04-13 Thread Oscar Salvador
On Tue, Apr 13, 2021 at 02:33:41PM -0700, Mike Kravetz wrote:
> > -static void prep_new_huge_page(struct hstate *h, struct page *page, int 
> > nid)
> > +/*
> > + * Must be called with the hugetlb lock held
> > + */
> > +static void __prep_account_new_huge_page(struct hstate *h, int nid)
> > +{
> > +   h->nr_huge_pages++;
> > +   h->nr_huge_pages_node[nid]++;
> 
> I would prefer if we also move setting the destructor to this routine.
>   set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);

Uhm, but that is the routine that does the accounting, it feels wrong
here, plus...

> 
> That way, PageHuge() will be false until it 'really' is a huge page.
> If not, we could potentially go into that retry loop in
> dissolve_free_huge_page or alloc_and_dissolve_huge_page in patch 5.

...I do not follow here, could you please elaborate some more?
Unless I am missing something, behaviour should not be any different with this
patch.

Thanks


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages

2021-04-13 Thread Oscar Salvador
On Tue, Apr 13, 2021 at 03:29:02PM -0700, Mike Kravetz wrote:
> >   spin_lock_irq(hugetlb_lock)
> >   1) update_and_free_page
> >PageHuge() == F
> >__free_pages()
> >   2) enqueue_huge_page
> >SetPageHugeFreed()
> >   spin_unlock(_lock)
> 
> Very small nit, the above should be spin_unlock_irq(_lock)

Right, I missed it somehow.

> > +   /*
> > +* The page might have been dissolved from under our feet, so make sure
> > +* to carefully check the state under the lock.
> > +* Return success when racing as if we dissolved the page ourselves.
> > +*/
> > +   spin_lock_irq(_lock);
> > +   if (PageHuge(page)) {
> > +   head = compound_head(page);
> > +   h = page_hstate(head);
> > +   } else {
> > +   spin_unlock(_lock);
> 
> Should be be spin_unlock_irq(_lock);
> 
> Other than that, it looks good.

Yeah, I will amend it in the next version.

Thanks Mike!


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v7 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages

2021-04-13 Thread Oscar Salvador
On Tue, Apr 13, 2021 at 03:48:53PM -0700, Mike Kravetz wrote:
> The label free_new is:
> 
> free_new:
> spin_unlock_irq(_lock);
> __free_pages(new_page, huge_page_order(h));
> 
> return ret;
> 
> So, we are locking and immediately unlocking without any code in
> between.  Usually, I don't like like multiple labels before return.
> However, perhaps we should add another to avoid this unnecessary
> cycle.  On the other hand, this is an uncommon race condition so the
> simple code may be acceptable.

I guess we could have something like:

 free_new:
 spin_unlock_irq(_lock);
 free_new_nolock:
 __free_pages(new_page, huge_page_order(h));
 
 return ret;

And let the retry go to there without locking. But as you said, the
racecondition is rare enough, so I am not sure if this buys us much.
But I can certainly add it if you feel strong about it.


-- 
Oscar Salvador
SUSE L3


[PATCH v7 5/7] mm: Make alloc_contig_range handle free hugetlb pages

2021-04-13 Thread Oscar Salvador
alloc_contig_range will fail if it ever sees a HugeTLB page within the
range we are trying to allocate, even when that page is free and can be
easily reallocated.
This has proved to be problematic for some users of alloc_contic_range,
e.g: CMA and virtio-mem, where those would fail the call even when those
pages lay in ZONE_MOVABLE and are free.

We can do better by trying to replace such page.

Free hugepages are tricky to handle so as to no userspace application
notices disruption, we need to replace the current free hugepage with
a new one.

In order to do that, a new function called alloc_and_dissolve_huge_page
is introduced.
This function will first try to get a new fresh hugepage, and if it
succeeds, it will replace the old one in the free hugepage pool.

The free page replacement is done under hugetlb_lock, so no external
users of hugetlb will notice the change.
To allocate the new huge page, we use alloc_buddy_huge_page(), so we
do not have to deal with any counters, and prep_new_huge_page() is not
called. This is valulable because in case we need to free the new page,
we only need to call __free_pages().

Once we know that the page to be replaced is a genuine 0-refcounted
huge page, we remove the old page from the freelist by remove_hugetlb_page().
Then, we can call __prep_new_huge_page() and __prep_account_new_huge_page()
for the new huge page to properly initialize it and increment the
hstate->nr_huge_pages counter (previously decremented by
remove_hugetlb_page()).
Once done, the page is enqueued by enqueue_huge_page() and it is ready
to be used.

There is one tricky case when
page's refcount is 0 because it is in the process of being released.
A missing PageHugeFreed bit will tell us that freeing is in flight so
we retry after dropping the hugetlb_lock. The race window should be
small and the next retry should make a forward progress.

E.g:

CPU0CPU1
free_huge_page()isolate_or_dissolve_huge_page
  PageHuge() == T
  alloc_and_dissolve_huge_page
alloc_buddy_huge_page()
spin_lock_irq(hugetlb_lock)
// PageHuge() && !PageHugeFreed &&
// !PageCount()
spin_unlock_irq(hugetlb_lock)
  spin_lock_irq(hugetlb_lock)
  1) update_and_free_page
   PageHuge() == F
   __free_pages()
  2) enqueue_huge_page
   SetPageHugeFreed()
  spin_unlock(_lock)
  spin_lock_irq(hugetlb_lock)
   1) PageHuge() == F (freed by case#1 from 
CPU0)
   2) PageHuge() == T
   PageHugeFreed() == T
   - proceed with replacing the page

In the case above we retry as the window race is quite small and we have high
chances to succeed next time.

With regard to the allocation, we restrict it to the node the page belongs
to with __GFP_THISNODE, meaning we do not fallback on other node's zones.

Note that gigantic hugetlb pages are fenced off since there is a cyclic
dependency between them and alloc_contig_range.

Signed-off-by: Oscar Salvador 
---
 include/linux/hugetlb.h |   6 +++
 mm/compaction.c |  37 ++--
 mm/hugetlb.c| 115 
 3 files changed, 155 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 09f1fd12a6fa..b2d2118bfd1a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -595,6 +595,7 @@ struct huge_bootmem_page {
struct hstate *hstate;
 };
 
+int isolate_or_dissolve_huge_page(struct page *page);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -877,6 +878,11 @@ static inline void huge_ptep_modify_prot_commit(struct 
vm_area_struct *vma,
 #else  /* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
+static inline int isolate_or_dissolve_huge_page(struct page *page)
+{
+   return -ENOMEM;
+}
+
 static inline struct page *alloc_huge_page(struct vm_area_struct *vma,
   unsigned long addr,
   int avoid_reserve)
diff --git a/mm/compaction.c b/mm/compaction.c
index eeba4668c22c..89426b6d1ea3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -788,7 +788,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
  * Isolate all pages that can be migrated from the range specified by
  * [low_pfn, end_pfn). The range is expected to be within same pageblock.
  * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or 
congestion,
- * or 0.
+ * -ENOMEM in case we could not al

[PATCH v7 6/7] mm: Make alloc_contig_range handle in-use hugetlb pages

2021-04-13 Thread Oscar Salvador
alloc_contig_range() will fail if it finds a HugeTLB page within the range,
without a chance to handle them. Since HugeTLB pages can be migrated as any
LRU or Movable page, it does not make sense to bail out without trying.
Enable the interface to recognize in-use HugeTLB pages so we can migrate
them, and have much better chances to succeed the call.

Signed-off-by: Oscar Salvador 
Reviewed-by: Mike Kravetz 
Acked-by: Michal Hocko 
---
 include/linux/hugetlb.h |  5 +++--
 mm/compaction.c | 12 +++-
 mm/hugetlb.c| 22 +-
 mm/vmscan.c |  5 +++--
 4 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b2d2118bfd1a..b92f25ccef58 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -595,7 +595,7 @@ struct huge_bootmem_page {
struct hstate *hstate;
 };
 
-int isolate_or_dissolve_huge_page(struct page *page);
+int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
 struct page *alloc_huge_page(struct vm_area_struct *vma,
unsigned long addr, int avoid_reserve);
 struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
@@ -878,7 +878,8 @@ static inline void huge_ptep_modify_prot_commit(struct 
vm_area_struct *vma,
 #else  /* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
-static inline int isolate_or_dissolve_huge_page(struct page *page)
+static inline int isolate_or_dissolve_huge_page(struct page *page,
+   struct list_head *list)
 {
return -ENOMEM;
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index 89426b6d1ea3..bb8ff3543972 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -909,7 +909,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
}
 
if (PageHuge(page) && cc->alloc_contig) {
-   ret = isolate_or_dissolve_huge_page(page);
+   ret = isolate_or_dissolve_huge_page(page, 
>migratepages);
 
/*
 * Fail isolation in case isolate_or_dissolve_huge_page
@@ -927,6 +927,15 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
goto isolate_fail;
}
 
+   if (PageHuge(page)) {
+   /*
+* Hugepage was successfully isolated and placed
+* on the cc->migratepages list.
+*/
+   low_pfn += compound_nr(page) - 1;
+   goto isolate_success_no_list;
+   }
+
/*
 * Ok, the hugepage was dissolved. Now these pages are
 * Buddy and cannot be re-allocated because they are
@@ -1068,6 +1077,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
 isolate_success:
list_add(>lru, >migratepages);
+isolate_success_no_list:
cc->nr_migratepages += compound_nr(page);
nr_isolated += compound_nr(page);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4a664d6e82c1..24a453ff47f2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2270,10 +2270,12 @@ static void restore_reserve_on_error(struct hstate *h,
  * alloc_and_dissolve_huge_page - Allocate a new page and dissolve the old one
  * @h: struct hstate old page belongs to
  * @old_page: Old page to dissolve
+ * @list: List to isolate the page in case we need to
  * Returns 0 on success, otherwise negated error.
  */
 
-static int alloc_and_dissolve_huge_page(struct hstate *h, struct page 
*old_page)
+static int alloc_and_dissolve_huge_page(struct hstate *h, struct page 
*old_page,
+   struct list_head *list)
 {
gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
int nid = page_to_nid(old_page);
@@ -2300,9 +2302,13 @@ static int alloc_and_dissolve_huge_page(struct hstate 
*h, struct page *old_page)
goto free_new;
} else if (page_count(old_page)) {
/*
-* Someone has grabbed the page, fail for now.
+* Someone has grabbed the page, try to isolate it here.
+* Fail with -EBUSY if not possible.
 */
-   ret = -EBUSY;
+   spin_unlock_irq(_lock);
+   if (!isolate_huge_page(old_page, list))
+   ret = -EBUSY;
+   spin_lock_irq(_lock);
goto free_new;
} else if (!HPageFreed(old_page)) {
/*
@@ -2350,10 +2356,11 @@ static int alloc_and_dissolve_huge_page(struct hstate 
*h, struct page *old_page)
return ret;
 }
 
-int isolate_or_dissolve_huge_pa

[PATCH v7 7/7] mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig

2021-04-13 Thread Oscar Salvador
pfn_range_valid_contig() bails out when it finds an in-use page or a
hugetlb page, among other things.
We can drop the in-use page check since __alloc_contig_pages can migrate
away those pages, and the hugetlb page check can go too since
isolate_migratepages_range is now capable of dealing with hugetlb pages.
Either way, those checks are racy so let the end function handle it
when the time comes.

Signed-off-by: Oscar Salvador 
Suggested-by: David Hildenbrand 
Reviewed-by: David Hildenbrand 
---
 mm/page_alloc.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b5a94de3cdde..c5338e912ace 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8901,12 +8901,6 @@ static bool pfn_range_valid_contig(struct zone *z, 
unsigned long start_pfn,
 
if (PageReserved(page))
return false;
-
-   if (page_count(page) > 0)
-   return false;
-
-   if (PageHuge(page))
-   return false;
}
return true;
 }
-- 
2.16.3



[PATCH v7 4/7] mm,hugetlb: Split prep_new_huge_page functionality

2021-04-13 Thread Oscar Salvador
Currently, prep_new_huge_page() performs two functions.
It sets the right state for a new hugetlb, and increases the hstate's
counters to account for the new page.

Let us split its functionality into two separate functions, decoupling
the handling of the counters from initializing a hugepage.
The outcome is having __prep_new_huge_page(), which only
initializes the page , and __prep_account_new_huge_page(), which adds
the new page to the hstate's counters.

This allows us to be able to set a hugetlb without having to worry
about the counter/locking. It will prove useful in the next patch.
prep_new_huge_page() still calls both functions.

Signed-off-by: Oscar Salvador 
---
 mm/hugetlb.c | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e40d5fe5c63c..0607b2b71ac6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1483,7 +1483,16 @@ void free_huge_page(struct page *page)
}
 }
 
-static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
+/*
+ * Must be called with the hugetlb lock held
+ */
+static void __prep_account_new_huge_page(struct hstate *h, int nid)
+{
+   h->nr_huge_pages++;
+   h->nr_huge_pages_node[nid]++;
+}
+
+static void __prep_new_huge_page(struct page *page)
 {
INIT_LIST_HEAD(>lru);
set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
@@ -1491,9 +1500,13 @@ static void prep_new_huge_page(struct hstate *h, struct 
page *page, int nid)
set_hugetlb_cgroup(page, NULL);
set_hugetlb_cgroup_rsvd(page, NULL);
ClearHPageFreed(page);
+}
+
+static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
+{
+   __prep_new_huge_page(page);
spin_lock_irq(_lock);
-   h->nr_huge_pages++;
-   h->nr_huge_pages_node[nid]++;
+   __prep_account_new_huge_page(h, nid);
spin_unlock_irq(_lock);
 }
 
-- 
2.16.3



[PATCH v7 3/7] mm,hugetlb: Clear HPageFreed outside of the lock

2021-04-13 Thread Oscar Salvador
Currently, the clearing of the flag is done under the lock, but this
is unnecessary as we just allocated the page and we did not give it
away yet, so no one should be messing with it.

Also, this helps making clear that here the lock is only protecting the
counter.

Signed-off-by: Oscar Salvador 
---
 mm/hugetlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 54d81d5947ed..e40d5fe5c63c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1490,10 +1490,10 @@ static void prep_new_huge_page(struct hstate *h, struct 
page *page, int nid)
hugetlb_set_page_subpool(page, NULL);
set_hugetlb_cgroup(page, NULL);
set_hugetlb_cgroup_rsvd(page, NULL);
+   ClearHPageFreed(page);
spin_lock_irq(_lock);
h->nr_huge_pages++;
h->nr_huge_pages_node[nid]++;
-   ClearHPageFreed(page);
spin_unlock_irq(_lock);
 }
 
-- 
2.16.3



[PATCH v7 2/7] mm,compaction: Let isolate_migratepages_{range,block} return error codes

2021-04-13 Thread Oscar Salvador
Currently, isolate_migratepages_{range,block} and their callers use
a pfn == 0 vs pfn != 0 scheme to let the caller know whether there was
any error during isolation.
This does not work as soon as we need to start reporting different error
codes and make sure we pass them down the chain, so they are properly
interpreted by functions like e.g: alloc_contig_range.

Let us rework isolate_migratepages_{range,block} so we can report error
codes.
Since isolate_migratepages_block will stop returning the next pfn to be
scanned, we reuse the cc->migrate_pfn field to keep track of that.

Signed-off-by: Oscar Salvador 
Acked-by: Vlastimil Babka 
---
 mm/compaction.c | 54 +++---
 mm/internal.h   | 10 --
 mm/page_alloc.c |  7 +++
 3 files changed, 38 insertions(+), 33 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 8c5028bfbd56..eeba4668c22c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -787,15 +787,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
  *
  * Isolate all pages that can be migrated from the range specified by
  * [low_pfn, end_pfn). The range is expected to be within same pageblock.
- * Returns zero if there is a fatal signal pending, otherwise PFN of the
- * first page that was not scanned (which may be both less, equal to or more
- * than end_pfn).
+ * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or 
congestion,
+ * or 0.
+ * cc->migrate_pfn will contain the next pfn to scan (which may be both less,
+ * equal to or more that end_pfn).
  *
  * The pages are isolated on cc->migratepages list (not required to be empty),
- * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field
- * is neither read nor updated.
+ * and cc->nr_migratepages is updated accordingly.
  */
-static unsigned long
+static int
 isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
unsigned long end_pfn, isolate_mode_t isolate_mode)
 {
@@ -809,6 +809,9 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
bool skip_on_failure = false;
unsigned long next_skip_pfn = 0;
bool skip_updated = false;
+   int ret = 0;
+
+   cc->migrate_pfn = low_pfn;
 
/*
 * Ensure that there are not too many pages isolated from the LRU
@@ -818,16 +821,16 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
while (unlikely(too_many_isolated(pgdat))) {
/* stop isolation if there are still pages not migrated */
if (cc->nr_migratepages)
-   return 0;
+   return -EAGAIN;
 
/* async migration should just abort */
if (cc->mode == MIGRATE_ASYNC)
-   return 0;
+   return -EAGAIN;
 
congestion_wait(BLK_RW_ASYNC, HZ/10);
 
if (fatal_signal_pending(current))
-   return 0;
+   return -EINTR;
}
 
cond_resched();
@@ -875,8 +878,8 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
if (fatal_signal_pending(current)) {
cc->contended = true;
+   ret = -EINTR;
 
-   low_pfn = 0;
goto fatal_pending;
}
 
@@ -1130,7 +1133,9 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (nr_isolated)
count_compact_events(COMPACTISOLATED, nr_isolated);
 
-   return low_pfn;
+   cc->migrate_pfn = low_pfn;
+
+   return ret;
 }
 
 /**
@@ -1139,15 +1144,15 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
  * @start_pfn: The first PFN to start isolating.
  * @end_pfn:   The one-past-last PFN.
  *
- * Returns zero if isolation fails fatally due to e.g. pending signal.
- * Otherwise, function returns one-past-the-last PFN of isolated page
- * (which may be greater than end_pfn if end fell in a middle of a THP page).
+ * Returns errno, like -EAGAIN or -EINTR in case e.g signal pending or 
congestion,
+ * or 0.
  */
-unsigned long
+int
 isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
unsigned long end_pfn)
 {
unsigned long pfn, block_start_pfn, block_end_pfn;
+   int ret = 0;
 
/* Scan block by block. First and last block may be incomplete */
pfn = start_pfn;
@@ -1166,17 +1171,17 @@ isolate_migratepages_range(struct compact_control *cc, 
unsigned long start_pfn,
block_end_pfn, cc->zone))
continue;
 
-   pfn = isolate_migra

[PATCH v7 1/7] mm,page_alloc: Bail out earlier on -ENOMEM in alloc_contig_migrate_range

2021-04-13 Thread Oscar Salvador
Currently, __alloc_contig_migrate_range can generate -EINTR, -ENOMEM or -EBUSY,
and report them down the chain.
The problem is that when migrate_pages() reports -ENOMEM, we keep going till we
exhaust all the try-attempts (5 at the moment) instead of bailing out.

migrate_pages() bails out right away on -ENOMEM because it is considered a fatal
error. Do the same here instead of keep going and retrying.
Note that this is not fixing a real issue, just a cosmetic change. Although we
can save some cycles by backing off ealier

Signed-off-by: Oscar Salvador 
Acked-by: Vlastimil Babka 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
---
 mm/page_alloc.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c67c99603a3..689454692de1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8697,7 +8697,7 @@ static int __alloc_contig_migrate_range(struct 
compact_control *cc,
}
tries = 0;
} else if (++tries == 5) {
-   ret = ret < 0 ? ret : -EBUSY;
+   ret = -EBUSY;
break;
}
 
@@ -8707,6 +8707,13 @@ static int __alloc_contig_migrate_range(struct 
compact_control *cc,
 
ret = migrate_pages(>migratepages, alloc_migration_target,
NULL, (unsigned long), cc->mode, 
MR_CONTIG_RANGE);
+
+   /*
+* On -ENOMEM, migrate_pages() bails out right away. It is 
pointless
+* to retry again over this error, so do the same here.
+*/
+   if (ret == -ENOMEM)
+   break;
}
 
lru_cache_enable();
-- 
2.16.3



[PATCH v7 0/7] Make alloc_contig_range handle Hugetlb pages

2021-04-13 Thread Oscar Salvador
064838] alloc_contig_range: [1b8000, 1c) PFNs busy
  [  180.065848] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.066794] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.067738] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.068669] alloc_contig_range: [1bfc00, 1c) PFNs busy
  [  180.069598] alloc_contig_range: [1bfc00, 1c) PFNs busy"

 And then with this patchset running:

 "Same experiment with ZONE_MOVABLE:

  a) Free huge pages: all memory can get unplugged again.

  b) Allocated/populated but idle huge pages: all memory can get unplugged
 again.

  c) Allocated/populated but all 512 huge pages are read/written in a
 loop: all memory can get unplugged again, but I get a single

  [  121.192345] alloc_contig_range: [18, 188000) PFNs busy

  Most probably because it happened to try migrating a huge page while it
  was busy. As virtio-mem retries on ZONE_MOVABLE a couple of times, it
  can deal with this temporary failure.

  Last but not least, I did something extreme:

  # cat /proc/meminfo
  MemTotal:5061568 kB
  MemFree:  186560 kB
  MemAvailable: 354524 kB
  ...
  HugePages_Total:2048
  HugePages_Free: 2048
  HugePages_Rsvd:0
  HugePages_Surp:0

  Triggering unplug would require to dissolve+alloc - which now fails when
  trying to allocate an additional ~512 huge pages (1G).

  As expected, I can properly see memory unplug not fully succeeding. + I
  get a fairly continuous stream of

  [  226.611584] alloc_contig_range: [19f400, 19f800) PFNs busy
  ...

  But more importantly, the hugepage count remains stable, as configured
  by the admin (me):

  HugePages_Total:2048
  HugePages_Free: 2048
  HugePages_Rsvd:0
  HugePages_Surp:0"

Oscar Salvador (7):
  mm,page_alloc: Bail out earlier on -ENOMEM in
alloc_contig_migrate_range
  mm,compaction: Let isolate_migratepages_{range,block} return error
codes
  mm,hugetlb: Clear HPageFreed outside of the lock
  mm,hugetlb: Split prep_new_huge_page functionality
  mm: Make alloc_contig_range handle free hugetlb pages
  mm: Make alloc_contig_range handle in-use hugetlb pages
  mm,page_alloc: Drop unnecessary checks from pfn_range_valid_contig

 include/linux/hugetlb.h |   7 +++
 mm/compaction.c |  97 ++-
 mm/hugetlb.c| 148 ++--
 mm/internal.h   |  10 +++-
 mm/page_alloc.c |  22 +++
 mm/vmscan.c |   5 +-
 6 files changed, 242 insertions(+), 47 deletions(-)

-- 
2.16.3



Re: [PATCH V2 4/6] mm: Drop redundant ARCH_ENABLE_[HUGEPAGE|THP]_MIGRATION

2021-04-12 Thread Oscar Salvador
On Thu, Apr 01, 2021 at 12:14:06PM +0530, Anshuman Khandual wrote:
> ARCH_ENABLE_[HUGEPAGE|THP]_MIGRATION configs have duplicate definitions on
> platforms that subscribe them. Drop these reduntant definitions and instead
> just select them appropriately.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: Andrew Morton 
> Cc: x...@kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linuxppc-...@lists.ozlabs.org
> Cc: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Acked-by: Catalin Marinas  (arm64)
> Signed-off-by: Anshuman Khandual 

Hi Anshuman, 

X86 needs fixing, see below:

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 503d8b2e8676..10702ef1eb57 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -60,8 +60,10 @@ config X86
>   select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI
>   select ARCH_32BIT_OFF_T if X86_32
>   select ARCH_CLOCKSOURCE_INIT
> + select ARCH_ENABLE_HUGEPAGE_MIGRATION if x86_64 && HUGETLB_PAGE && 
> MIGRATION
>   select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64 || (X86_32 && HIGHMEM)
>   select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
> + select ARCH_ENABLE_THP_MIGRATION if x86_64 && TRANSPARENT_HUGEPAGE

you need s/x86_64/X86_64/, otherwise we are left with no migration :-)

-- 
Oscar Salvador
SUSE L3


Re: [PATCH v5 4/8] hugetlb: create remove_hugetlb_page() to separate functionality

2021-04-12 Thread Oscar Salvador
On Fri, Apr 09, 2021 at 01:52:50PM -0700, Mike Kravetz wrote:
> The new remove_hugetlb_page() routine is designed to remove a hugetlb
> page from hugetlbfs processing.  It will remove the page from the active
> or free list, update global counters and set the compound page
> destructor to NULL so that PageHuge() will return false for the 'page'.
> After this call, the 'page' can be treated as a normal compound page or
> a collection of base size pages.
> 
> update_and_free_page no longer decrements h->nr_huge_pages{_node} as
> this is performed in remove_hugetlb_page.  The only functionality
> performed by update_and_free_page is to free the base pages to the lower
> level allocators.
> 
> update_and_free_page is typically called after remove_hugetlb_page.
> 
> remove_hugetlb_page is to be called with the hugetlb_lock held.
> 
> Creating this routine and separating functionality is in preparation for
> restructuring code to reduce lock hold times.  This commit should not
> introduce any changes to functionality.
> 
> Signed-off-by: Mike Kravetz 
> Acked-by: Michal Hocko 
> Reviewed-by: Miaohe Lin 
> Reviewed-by: Muchun Song 

Reviewed-by: Oscar Salvador 

A "nit" below:

>  static void update_and_free_page(struct hstate *h, struct page *page)
>  {
>   int i;
> @@ -1334,8 +1369,6 @@ static void update_and_free_page(struct hstate *h, 
> struct page *page)

After this, update_and_free_page()'s job is to reset subpage's flags and free
the page.
Maybe we want to rename that function at some point, or maybe not as "update" 
might
already imply that. Just speaking out loud.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events

2021-04-12 Thread Oscar Salvador
On Fri, Apr 09, 2021 at 08:59:21PM +0200, David Hildenbrand wrote:
 
> The only way to add more System RAM is via add_memory() and friends like
> add_memory_driver_managed(). These all require CONFIG_MEMORY_HOTPLUG.

Yeah, my point was more towards whether PMEM can come in a way that it does
not have to be hotplugged, but come functional by default (as RAM).
But after having read all papers out there, I do not think that it is possible.

-- 
Oscar Salvador
SUSE L3


Re: [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events

2021-04-09 Thread Oscar Salvador
On Fri, Apr 09, 2021 at 12:14:04PM +0200, Oscar Salvador wrote:
> On Thu, Apr 08, 2021 at 11:52:51AM +0200, Oscar Salvador wrote:
> > I am not really into PMEM, and I ignore whether we need
> > CONFIG_MEMORY_HOTPLUG in order to have such memory on the system.
> > If so, the following can be partly ignored.
> 
> Ok, I refreshed by memory with [1].

Bleh, being [1] 
https://lore.kernel.org/linux-mm/20190124231448.e102d...@viggo.jf.intel.com/

-- 
Oscar Salvador
SUSE L3


Re: [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events

2021-04-09 Thread Oscar Salvador
On Thu, Apr 08, 2021 at 11:52:51AM +0200, Oscar Salvador wrote:
> I am not really into PMEM, and I ignore whether we need
> CONFIG_MEMORY_HOTPLUG in order to have such memory on the system.
> If so, the following can be partly ignored.

Ok, I refreshed by memory with [1].
>From that, it seems that in order to use PMEM as RAM we need 
>CONFIG_MEMORY_HOTPLUG.
But is that always the case? Can happen that in some scenario PMEM comes ready
to use and we do not need the hotplug trick?

Anyway, I would still like to clarify the state of the HOTPLUG_CPU.
On x86_64, HOTPLUG_CPU and MEMORY_HOTPLUG are tied by SPM means, but on arm64
one can have MEMORY_HOTPLUG while not having picked HOTPLUG_CPU.

My point is that we might want to put the callback functions and the callback
registration for cpu-hotplug guarded by its own HOTPLUG_CPU instead of guarding 
it
in the same MEMORY_HOTPLUG block to make it more clear?

-- 
Oscar Salvador
SUSE L3


Re: [PATCH 07/10] mm/vmscan: add helper for querying ability to age anonymous pages

2021-04-09 Thread Oscar Salvador
On Wed, Apr 07, 2021 at 11:40:13AM -0700, Wei Xu wrote:
> anon_should_be_aged() doesn't really need "lruvec".  It essentially
> answers whether the pages of the given node can be swapped or demoted.
> So it would be clearer and less confusing if anon_should_be_aged()
> takes "pgdat" instead of "lruvec" as the argument.  The call to
> mem_cgroup_lruvec(NULL, pgdat) in age_active_anon() can then be removed
> as well.

I tend to agree with this, and I would go one step further with the naming.
For me, taking into account the nature of the function that tells us whether
we have any means to age those pages, a better fit would be something like
anon_can_be_aged(). IIUC, the "should age" part would be inactive_is_low().


-- 
Oscar Salvador
SUSE L3


Re: [PATCH 02/10] mm/numa: automatically generate node migration order

2021-04-09 Thread Oscar Salvador
On Thu, Apr 08, 2021 at 02:51:20PM -0700, Dave Hansen wrote:
> I've fleshed out the description a bit.  I hope this helps?

Yes, thanks Dave, both additions look fine to me.


-- 
Oscar Salvador
SUSE L3


Re: [PATCH v6 4/8] mm,memory_hotplug: Allocate memmap from the added memory range

2021-04-08 Thread Oscar Salvador
On Thu, Apr 08, 2021 at 10:05:18PM -0700, Andrew Morton wrote:
> Yes please.

I just sent v7 with that yesterday [1]

Hope David/Michal finds some time to review patch#4 as that is the only
missing piece atm.

[1] https://lkml.org/lkml/2021/4/8/546


-- 
Oscar Salvador
SUSE L3


Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded

2021-04-08 Thread Oscar Salvador
On Thu, Apr 08, 2021 at 01:40:33PM -0700, Yang Shi wrote:
> Thanks a lot for the example code. You didn't miss anything. At first
> glance, I thought your suggestion seemed neater. Actually I
> misunderstood what Dave said about "That could really have caused some
> interesting problems." with multiple calls to migrate_pages(). I was
> thinking about:
> 
> unsigned long foo()
> {
> unsigned long *ret_succeeded;
> 
> migrate_pages(..., ret_succeeded);
> 
> migrate_pages(..., ret_succeeded);
> 
> return *ret_succeeded;
> }

But that would not be a problem as well. I mean I am not sure what is
foo() supposed to do.
I assume is supposed to return the *total* number of pages that were
migrated?

Then could do something like:

 unsigned long foo()
 {
 unsigned long ret_succeeded;
 unsigned long total_succeeded = 0;

 migrate_pages(..., _succeeded);
 total_succeeded += ret_succeeded;

 migrate_pages(..., _succeeded);
 total_succeeded += ret_succeeded;

 return *total_succeeded;
 }

 But AFAICS, you would have to do that with Wei Xu's version and with
 mine, no difference there.

IIUC, Dave's concern was that nr_succeeded was only set to 0 at the beginning
of the function, and never reset back, which means, we would carry the
sum of previous nr_succeeded instead of the nr_succeeded in that round.
That would be misleading for e.g: reclaim in case we were to call
migrate_pages() several times, as instead of a delta value, nr_succeeded
would accumulate.

But that won't happen neither with Wei Xu's version nor with mine. 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded

2021-04-08 Thread Oscar Salvador
On Thu, Apr 08, 2021 at 08:17:26PM +0200, Oscar Salvador wrote:
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 3a389633b68f..fd661cb2ce13 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -40,7 +40,8 @@ extern int migrate_page(struct address_space *mapping,
>   struct page *newpage, struct page *page,
>   enum migrate_mode mode);
>  extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t 
> free,
> - unsigned long private, enum migrate_mode mode, int reason);
> + unsigned long private, enum migrate_mode mode, int reason,
> + unsigned int *ret_succeeded);
>  extern struct page *alloc_migration_target(struct page *page, unsigned long 
> private);
>  extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
>  extern void putback_movable_page(struct page *page);
> @@ -58,7 +59,7 @@ extern int migrate_page_move_mapping(struct address_space 
> *mapping,
>  static inline void putback_movable_pages(struct list_head *l) {}
>  static inline int migrate_pages(struct list_head *l, new_page_t new,
>   free_page_t free, unsigned long private, enum migrate_mode mode,
> - int reason)
> + int reason, unsigned int *ret_succeeded)
>   { return -ENOSYS; }
>  static inline struct page *alloc_migration_target(struct page *page,
>   unsigned long private)
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e04f4476e68e..7238e8faff04 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2364,7 +2364,7 @@ compact_zone(struct compact_control *cc, struct 
> capture_control *capc)
> 
>   err = migrate_pages(>migratepages, compaction_alloc,
>   compaction_free, (unsigned long)cc, cc->mode,
> - MR_COMPACTION);
> + MR_COMPACTION, NULL);
> 
>   trace_mm_compaction_migratepages(cc->nr_migratepages, err,
>   >migratepages);
> diff --git a/mm/gup.c b/mm/gup.c
> index e40579624f10..b70d463aa1fc 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1606,7 +1606,7 @@ static long check_and_migrate_cma_pages(struct 
> mm_struct *mm,
>   put_page(pages[i]);
> 
>   if (migrate_pages(_page_list, alloc_migration_target, NULL,
> - (unsigned long), MIGRATE_SYNC, MR_CONTIG_RANGE)) {
> + (unsigned long), MIGRATE_SYNC, MR_CONTIG_RANGE, 
> NULL)) {
>   /*
>* some of the pages failed migration. Do get_user_pages
>* without migration.
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 24210c9bd843..a17e0f039076 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1852,7 +1852,8 @@ static int __soft_offline_page(struct page *page)
> 
>   if (isolate_page(hpage, )) {
>   ret = migrate_pages(, alloc_migration_target, NULL,
> - (unsigned long), MIGRATE_SYNC, MR_MEMORY_FAILURE);
> + (unsigned long), MIGRATE_SYNC, MR_MEMORY_FAILURE,
> + NULL);
>   if (!ret) {
>   bool release = !huge;
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0cdbbfbc5757..28496376de94 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1466,7 +1466,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long 
> end_pfn)
>   if (nodes_empty(nmask))
>   node_set(mtc.nid, nmask);
>   ret = migrate_pages(, alloc_migration_target, NULL,
> - (unsigned long), MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
> + (unsigned long), MIGRATE_SYNC, MR_MEMORY_HOTPLUG,
> + NULL);
>   if (ret) {
>   list_for_each_entry(page, , lru) {
>   pr_warn("migrating pfn %lx failed ret:%d ",
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index ab51132547b8..df260ed12102 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1103,7 +1103,8 @@ static int migrate_to_node(struct mm_struct *mm, int 
> source, int dest,
> 
>   if (!list_empty()) {
>   err = migrate_pages(, alloc_migration_target, NULL,
> - (unsigned long), MIGRATE_SYNC, MR_SYSCALL);
> + (unsigned long), MIGRATE_SYNC, MR_SYSCALL,
> + NULL);
>   if (err)
>   putback_m

Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded

2021-04-08 Thread Oscar Salvador

if (!list_empty()) {
err = migrate_pages(, alloc_migration_target, NULL,
-   (unsigned long), MIGRATE_SYNC, MR_SYSCALL);
+   (unsigned long), MIGRATE_SYNC, MR_SYSCALL,
+   NULL);
if (err)
putback_movable_pages();
}
@@ -1355,7 +1356,8 @@ static long do_mbind(unsigned long start, unsigned long 
len,
if (!list_empty()) {
WARN_ON_ONCE(flags & MPOL_MF_LAZY);
nr_failed = migrate_pages(, new_page, NULL,
-   start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
+   start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND,
+   NULL);
if (nr_failed)
putback_movable_pages();
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 695a594e5860..087ed407b3ce 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1493,6 +1493,9 @@ static inline int try_split_thp(struct page *page, struct 
page **page2,
  * @mode:  The migration mode that specifies the constraints for
  * page migration, if any.
  * @reason:The reason for page migration.
+ * @ret_succeeded: A pointer to place the value of the number of pages
+ *  migrated successfully. The caller must pass a valid 
pointer
+ *  if they care about it.
  *
  * The function returns after 10 attempts or if no pages are movable any more
  * because the list has become empty or no retryable pages exist any more.
@@ -1503,7 +1506,7 @@ static inline int try_split_thp(struct page *page, struct 
page **page2,
  */
 int migrate_pages(struct list_head *from, new_page_t get_new_page,
free_page_t put_new_page, unsigned long private,
-   enum migrate_mode mode, int reason)
+   enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
 {
int retry = 1;
int thp_retry = 1;
@@ -1654,6 +1657,9 @@ int migrate_pages(struct list_head *from, new_page_t 
get_new_page,
if (!swapwrite)
current->flags &= ~PF_SWAPWRITE;

+   if (ret_succeeded)
+   *ret_succeeded = nr_succeeded;
+
return rc;
 }

@@ -1723,7 +1729,8 @@ static int do_move_pages_to_node(struct mm_struct *mm,
};

err = migrate_pages(pagelist, alloc_migration_target, NULL,
-   (unsigned long), MIGRATE_SYNC, MR_SYSCALL);
+   (unsigned long), MIGRATE_SYNC, MR_SYSCALL,
+   NULL);
if (err)
putback_movable_pages(pagelist);
return err;
@@ -2230,7 +2237,7 @@ int migrate_misplaced_page(struct page *page, struct 
vm_area_struct *vma,
list_add(>lru, );
nr_remaining = migrate_pages(, alloc_misplaced_dst_page,
 NULL, node, MIGRATE_ASYNC,
-MR_NUMA_MISPLACED);
+MR_NUMA_MISPLACED, NULL);
if (nr_remaining) {
if (!list_empty()) {
list_del(>lru);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 46f3d594369d..0c1bbadd5ca3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8490,7 +8490,8 @@ static int __alloc_contig_migrate_range(struct 
compact_control *cc,
cc->nr_migratepages -= nr_reclaimed;

ret = migrate_pages(>migratepages, alloc_migration_target,
-   NULL, (unsigned long), cc->mode, 
MR_CONTIG_RANGE);
+   NULL, (unsigned long), cc->mode, 
MR_CONTIG_RANGE,
+   NULL);
}
if (ret < 0) {
putback_movable_pages(>migratepages);


As I said I might be missing a point here, but I cannot see the problem
you describe here.


-- 
Oscar Salvador
SUSE L3


[PATCH v7 8/8] arm64/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

2021-04-08 Thread Oscar Salvador
Enable arm64 platform to use the MHP_MEMMAP_ON_MEMORY feature.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
---
 arch/arm64/Kconfig | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 5656e7aacd69..0e23acd173f0 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -309,6 +309,9 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 config ARCH_ENABLE_MEMORY_HOTREMOVE
def_bool y
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+   def_bool y
+
 config SMP
def_bool y
 
-- 
2.16.3



[PATCH v7 7/8] x86/Kconfig: Introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

2021-04-08 Thread Oscar Salvador
Enable x86_64 platform to use the MHP_MEMMAP_ON_MEMORY feature.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
---
 arch/x86/Kconfig | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..9f0211df1746 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2433,6 +2433,9 @@ config ARCH_ENABLE_MEMORY_HOTREMOVE
def_bool y
depends on MEMORY_HOTPLUG
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+   def_bool y
+
 config USE_PERCPU_NUMA_NODE_ID
def_bool y
depends on NUMA
-- 
2.16.3



[PATCH v7 5/8] acpi,memhotplug: Enable MHP_MEMMAP_ON_MEMORY when supported

2021-04-08 Thread Oscar Salvador
Let the caller check whether it can pass MHP_MEMMAP_ON_MEMORY by
checking mhp_supports_memmap_on_memory().
MHP_MEMMAP_ON_MEMORY can only be set in case
ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE is enabled, the architecture supports
altmap, and the range to be added spans a single memory block.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
---
 drivers/acpi/acpi_memhotplug.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index b02fd51e5589..8cc195c4c861 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -171,6 +171,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
acpi_handle handle = mem_device->device->handle;
int result, num_enabled = 0;
struct acpi_memory_info *info;
+   mhp_t mhp_flags = MHP_NONE;
int node;
 
node = acpi_get_node(handle);
@@ -194,8 +195,10 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (node < 0)
node = memory_add_physaddr_to_nid(info->start_addr);
 
+   if (mhp_supports_memmap_on_memory(info->length))
+   mhp_flags |= MHP_MEMMAP_ON_MEMORY;
result = __add_memory(node, info->start_addr, info->length,
- MHP_NONE);
+ mhp_flags);
 
/*
 * If the memory block has been used by the kernel, add_memory()
-- 
2.16.3



[PATCH v7 6/8] mm,memory_hotplug: Add kernel boot option to enable memmap_on_memory

2021-04-08 Thread Oscar Salvador
Self stored memmap leads to a sparse memory situation which is unsuitable
for workloads that requires large contiguous memory chunks, so make this
an opt-in which needs to be explicitly enabled.

To control this, let memory_hotplug have its own memory space, as suggested
by David, so we can add memory_hotplug.memmap_on_memory parameter.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
Acked-by: Michal Hocko 
---
 Documentation/admin-guide/kernel-parameters.txt | 17 +
 mm/Makefile |  5 -
 mm/memory_hotplug.c | 10 +-
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..af32c17cd4eb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2794,6 +2794,23 @@
seconds.  Use this parameter to check at some
other rate.  0 disables periodic checking.
 
+   memory_hotplug.memmap_on_memory
+   [KNL,X86,ARM] Boolean flag to enable this feature.
+   Format: {on | off (default)}
+   When enabled, runtime hotplugged memory will
+   allocate its internal metadata (struct pages)
+   from the hotadded memory which will allow to
+   hotadd a lot of memory without requiring
+   additional memory to do so.
+   This feature is disabled by default because it
+   has some implication on large (e.g. GB)
+   allocations in some configurations (e.g. small
+   memory blocks).
+   The state of the flag can be read in
+   /sys/module/memory_hotplug/parameters/memmap_on_memory.
+   Note that even when enabled, there are a few cases where
+   the feature is not effective.
+
memtest=[KNL,X86,ARM,PPC] Enable memtest
Format: 
default : 0 
diff --git a/mm/Makefile b/mm/Makefile
index 72227b24a616..82ae9482f5e3 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -58,9 +58,13 @@ obj-y:= filemap.o mempool.o 
oom_kill.o fadvise.o \
 page-alloc-y := page_alloc.o
 page-alloc-$(CONFIG_SHUFFLE_PAGE_ALLOCATOR) += shuffle.o
 
+# Give 'memory_hotplug' its own module-parameter namespace
+memory-hotplug-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
+
 obj-y += page-alloc.o
 obj-y += init-mm.o
 obj-y += memblock.o
+obj-y += $(memory-hotplug-y)
 
 ifdef CONFIG_MMU
obj-$(CONFIG_ADVISE_SYSCALLS)   += madvise.o
@@ -83,7 +87,6 @@ obj-$(CONFIG_SLUB) += slub.o
 obj-$(CONFIG_KASAN)+= kasan/
 obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
-obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_MEMTEST)  += memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b48067839f90..ccbdde94c488 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -42,7 +42,15 @@
 #include "internal.h"
 #include "shuffle.h"
 
-static bool memmap_on_memory;
+
+/*
+ * memory_hotplug.memmap_on_memory parameter
+ */
+static bool memmap_on_memory __ro_after_init;
+#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
+module_param(memmap_on_memory, bool, 0444);
+MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory 
hotplug");
+#endif
 
 /*
  * online_page_callback contains pointer to current page onlining function.
-- 
2.16.3



[PATCH v7 4/8] mm,memory_hotplug: Allocate memmap from the added memory range

2021-04-08 Thread Oscar Salvador
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose
(eg: ~2MB per 128MB memory section on x86_64)
 b) if the whole node is movable then we have off-node struct pages
which has performance drawbacks.
 c) It might be there are no PMD_ALIGNED chunks so memmap array gets
populated with base pages.

This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled.

Vmemap page tables can map arbitrary memory.
That means that we can simply use the beginning of each memory section and
map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Implementation wise we will reuse vmem_altmap infrastructure to override
the default allocator used by __populate_section_memmap.
Part of the implementation also relies on memory_block structure gaining
a new field which specifies the number of vmemmap_pages at the beginning.
This patch also introduces the following functions:

 - vmemmap_init_space: Initializes vmemmap pages by calling 
move_pfn_range_to_zone(),
   calls kasan_add_zero_shadow() or the vmemmap range and 
marks
   online as many sections as vmemmap pages fully span.
 - vmemmap_adjust_pages: Accounts/substract vmemmap_pages to node and zone
 present_pages
 - vmemmap_deinit_space: Undoes what vmemmap_init_space does.

The new function memory_block_online() calls vmemmap_init_space() before
doing the actual online_pages(). Should online_pages() fail, we clean up
by calling vmemmap_adjust_pages() and vmemmap_deinit_space().

On offline, memory_block_offline() calls vmemmap_adjust_pages() prior to calling
offline_pages(), because offline_pages() performs the tearing-down of kthreads
and the rebuilding of the zonelists if the node/zone become empty.
If offline_pages() fails, we account back vmemmap pages by 
vmemmap_adjust_pages().
If it succeeds, we call vmemmap_deinit_space().

Hot-remove:

 We need to be careful when removing memory, as adding and
 removing memory needs to be done with the same granularity.
 To check that this assumption is not violated, we check the
 memory range we want to remove and if a) any memory block has
 vmemmap pages and b) the range spans more than a single memory
 block, we scream out loud and refuse to proceed.

 If all is good and the range was using memmap on memory (aka vmemmap pages),
 we construct an altmap structure so free_hugepage_table does the right
 thing and calls vmem_altmap_free instead of free_pagetable.

Signed-off-by: Oscar Salvador 
---
 drivers/base/memory.c  |  64 ++--
 include/linux/memory.h |   8 +-
 include/linux/memory_hotplug.h |  13 
 include/linux/memremap.h   |   2 +-
 include/linux/mmzone.h |   7 +-
 mm/Kconfig |   5 ++
 mm/memory_hotplug.c| 162 -
 mm/sparse.c|   2 -
 8 files changed, 247 insertions(+), 16 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f209925a5d4e..a5e536a3e9a4 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -173,16 +173,65 @@ static int memory_block_online(struct memory_block *mem)
 {
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+   unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+   int ret;
+
+   /*
+* Although vmemmap pages have a different lifecycle than the pages
+* they describe (they remain until the memory is unplugged), doing
+* its initialization and accounting at hot-{online,offline} stage
+* simplifies things a lot
+*/
+   if (nr_vmemmap_pages) {
+   ret = vmemmap_init_space(start_pfn, nr_vmemmap_pages, mem->nid,
+mem->online_type);
+   if (ret)
+   return ret;
+   }
 
-   return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
+   ret = online_pages(start_pfn + nr_vmemmap_pages,
+  nr_pages - nr_vmemmap_pages, mem->online_type,
+  mem->nid);
+
+   /*
+* Undo the work if online_pages() fails.
+*/
+   if (ret && nr_vmemmap_pages) {
+   vmemmap_adjust_pages(start_pfn, -nr_vmemmap_pages);
+   vmemmap_deinit_space(start_pfn, nr_vmemmap_pages);
+   }
+
+   return ret;
 }
 
 static int memory_block_offline(struct memory_block *mem)
 {
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+   unsigned long nr_vmemmap_pages = mem

[PATCH v7 2/8] mm,memory_hotplug: Relax fully spanned sections check

2021-04-08 Thread Oscar Salvador
When using self-hosted vmemmap pages, the number of pages passed to
{online,offline}_pages might not fully span sections, but they always
fully span pageblocks.
Relax the check account for that case.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
---
 mm/memory_hotplug.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0cdbbfbc5757..25e59d5dc13c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -838,9 +838,14 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
int ret;
struct memory_notify arg;
 
-   /* We can only online full sections (e.g., SECTION_IS_ONLINE) */
+   /* We can only offline full sections (e.g., SECTION_IS_ONLINE).
+* However, when using e.g: memmap_on_memory, some pages are initialized
+* prior to calling in here. The remaining amount of pages must be
+* pageblock aligned.
+*/
if (WARN_ON_ONCE(!nr_pages ||
-!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION)))
+!IS_ALIGNED(pfn, pageblock_nr_pages) ||
+!IS_ALIGNED(pfn + nr_pages, PAGES_PER_SECTION)))
return -EINVAL;
 
mem_hotplug_begin();
@@ -1573,9 +1578,14 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
int ret, node;
char *reason;
 
-   /* We can only offline full sections (e.g., SECTION_IS_ONLINE) */
+   /* We can only offline full sections (e.g., SECTION_IS_ONLINE).
+* However, when using e.g: memmap_on_memory, some pages are initialized
+* prior to calling in here. The remaining amount of pages must be
+* pageblock aligned.
+*/
if (WARN_ON_ONCE(!nr_pages ||
-!IS_ALIGNED(start_pfn | nr_pages, PAGES_PER_SECTION)))
+!IS_ALIGNED(start_pfn, pageblock_nr_pages) ||
+!IS_ALIGNED(start_pfn + nr_pages, PAGES_PER_SECTION)))
return -EINVAL;
 
mem_hotplug_begin();
-- 
2.16.3



[PATCH v7 3/8] mm,memory_hotplug: Factor out adjusting present pages into adjust_present_page_count()

2021-04-08 Thread Oscar Salvador
From: David Hildenbrand 

Let's have a single place (inspired by adjust_managed_page_count()) where
we adjust present pages.
In contrast to adjust_managed_page_count(), only memory onlining/offlining
is allowed to modify the number of present pages.

Signed-off-by: David Hildenbrand 
Signed-off-by: Oscar Salvador 
Reviewed-by: Oscar Salvador 
---
 mm/memory_hotplug.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 25e59d5dc13c..d05056b3c173 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -829,6 +829,16 @@ struct zone * zone_for_pfn_range(int online_type, int nid, 
unsigned start_pfn,
return default_zone_for_pfn(nid, start_pfn, nr_pages);
 }
 
+static void adjust_present_page_count(struct zone *zone, long nr_pages)
+{
+   unsigned long flags;
+
+   zone->present_pages += nr_pages;
+   pgdat_resize_lock(zone->zone_pgdat, );
+   zone->zone_pgdat->node_present_pages += nr_pages;
+   pgdat_resize_unlock(zone->zone_pgdat, );
+}
+
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
   int online_type, int nid)
 {
@@ -882,11 +892,7 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
}
 
online_pages_range(pfn, nr_pages);
-   zone->present_pages += nr_pages;
-
-   pgdat_resize_lock(zone->zone_pgdat, );
-   zone->zone_pgdat->node_present_pages += nr_pages;
-   pgdat_resize_unlock(zone->zone_pgdat, );
+   adjust_present_page_count(zone, nr_pages);
 
node_states_set_node(nid, );
if (need_zonelists_rebuild)
@@ -1701,11 +1707,7 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
 
/* removal success */
adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
-   zone->present_pages -= nr_pages;
-
-   pgdat_resize_lock(zone->zone_pgdat, );
-   zone->zone_pgdat->node_present_pages -= nr_pages;
-   pgdat_resize_unlock(zone->zone_pgdat, );
+   adjust_present_page_count(zone, -nr_pages);
 
init_per_zone_wmark_min();
 
-- 
2.16.3



[PATCH v7 1/8] drivers/base/memory: Introduce memory_block_{online,offline}

2021-04-08 Thread Oscar Salvador
This is a preparatory patch that introduces two new functions:
memory_block_online() and memory_block_offline().

For now, these functions will only call online_pages() and offline_pages()
respectively, but they will be later in charge of preparing the vmemmap
pages, carrying out the initialization and proper accounting of such
pages.

Since memory_block struct contains all the information, pass this struct
down the chain till the end functions.

Signed-off-by: Oscar Salvador 
Reviewed-by: David Hildenbrand 
---
 drivers/base/memory.c | 33 +
 1 file changed, 21 insertions(+), 12 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index f35298425575..f209925a5d4e 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -169,30 +169,41 @@ int memory_notify(unsigned long val, void *v)
return blocking_notifier_call_chain(_chain, val, v);
 }
 
+static int memory_block_online(struct memory_block *mem)
+{
+   unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
+   unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+
+   return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
+}
+
+static int memory_block_offline(struct memory_block *mem)
+{
+   unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
+   unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+
+   return offline_pages(start_pfn, nr_pages);
+}
+
 /*
  * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
  * OK to have direct references to sparsemem variables in here.
  */
 static int
-memory_block_action(unsigned long start_section_nr, unsigned long action,
-   int online_type, int nid)
+memory_block_action(struct memory_block *mem, unsigned long action)
 {
-   unsigned long start_pfn;
-   unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
int ret;
 
-   start_pfn = section_nr_to_pfn(start_section_nr);
-
switch (action) {
case MEM_ONLINE:
-   ret = online_pages(start_pfn, nr_pages, online_type, nid);
+   ret = memory_block_online(mem);
break;
case MEM_OFFLINE:
-   ret = offline_pages(start_pfn, nr_pages);
+   ret = memory_block_offline(mem);
break;
default:
WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
-"%ld\n", __func__, start_section_nr, action, action);
+"%ld\n", __func__, mem->start_section_nr, action, action);
ret = -EINVAL;
}
 
@@ -210,9 +221,7 @@ static int memory_block_change_state(struct memory_block 
*mem,
if (to_state == MEM_OFFLINE)
mem->state = MEM_GOING_OFFLINE;
 
-   ret = memory_block_action(mem->start_section_nr, to_state,
- mem->online_type, mem->nid);
-
+   ret = memory_block_action(mem, to_state);
mem->state = ret ? from_state_req : to_state;
 
return ret;
-- 
2.16.3



[PATCH v7 0/8] Allocate memmap from hotadded memory (per device)

2021-04-08 Thread Oscar Salvador
Hi,

I decided to send another version with the fixups included as it seemed a bit
awkward otherwise. It should ease the review.
Sorry for the spam.

Changes from v6 -> v7:
 - Fix check from "mm,memory_hotplug: Relax fully spanned sections check"
 - Add fixup from "mm,memory_hotplug: Allocate memmap from the added memory 
range"
 - Add Reviewed-by from David for patch#2
 - Fix changelog in "mm,memory_hotplug: Factor out adjusting present pages into
adjust_present_page_count()"

Changes from v5 -> v6:
 - Create memory_block_{online,offline} functions
 - Create vmemmap_* functions to deal with vmemmap stuff, so 
   {online,offline}_pages remain untouched
 - Add adjust_present_page_count's patch from David
 - Relax check in {offline,online}_pages
 - Rework changelogs

Changes from v4 -> v5:
 - Addressed feedback from David (patch#1)
 - Tested on x86_64 with different struct page sizes and on large/small memory
   blocks
 - Tested on arm64 with 4K, 64K (with and without THP) and with different struct
   page sizes

 NOTE: We might need to make this feature and hugetlb-vmemmap feature [1] 
mutually
   exclusive. I raised an issue I see in [2].
   Hugetlb-vmemmap feature has been withdrawn for the time being due to the 
need
   in further changes  wrt. locking/freeing context.
   I will keep an eye, and when the time comes again I will see how the two
   features play together and how one another can be disabled when needed.

Changes from v3 -> v4:
 - Addressed feedback from David
 - Wrap memmap_on_memory module thingy with #ifdef
   on MHP_MEMMAP_ON_MEMORY
 - Move "depend on MEMORY_HOTPLUG" to MHP_MEMMAP_ON_MEMORY
   in generic mm/Kconfig
 - Collect David's Reviewed-bys

Changes from v2 -> v3:
 - Addressed feedback from David
 - Squash former patch#4 and and patch#5 into patch#1
 - Fix config dependency CONFIR_SPARSE_VMEMMAP vs CONFIG_SPARSE_VMEMMAP_ENABLE
 - Simplify module parameter functions

Changes from v1 -> v2
 - Addressed feedback from David
 - Fence off the feature in case struct page size is not
   multiple of PMD size or pageblock alignment cannot be guaranted
 - Tested on x86_64 small and large memory_blocks
 - Tested on arm64 4KB and 64KB page sizes (for some reason I cannot boot
   my VM with 16KB page size).

 Arm64 with 4KB page size behaves like x86_64 after [1], which made section
 size smaller.
 With 64KB, the feature gets fenced off due to pageblock alignment.

Changes from RFCv3 -> v1:
 - Addressed feedback from David
 - Re-order patches

Changes from v2 -> v3 (RFC):
 - Re-order patches (Michal)
 - Fold "mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY" in patch#1
 - Add kernel boot option to enable this feature (Michal)

Changes from v1 -> v2 (RFC):
 - Addressed feedback provided by David
 - Add a arch_support_memmap_on_memory to be called
   from mhp_supports_memmap_on_memory, as atm,
   only ARM, powerpc and x86_64 have altmat support.

[1] https://lore.kernel.org/lkml/cover.1611206601.git.sudar...@codeaurora.org

Original cover letter:

 The primary goal of this patchset is to reduce memory overhead of the
 hot-added memory (at least for SPARSEMEM_VMEMMAP memory model).
 The current way we use to populate memmap (struct page array) has two main 
drawbacks:

 a) it consumes an additional memory until the hotadded memory itself is
onlined and
 b) memmap might end up on a different numa node which is especially true
for movable_node configuration.
 c) due to fragmentation we might end up populating memmap with base
pages

 One way to mitigate all these issues is to simply allocate memmap array
 (which is the largest memory footprint of the physical memory hotplug)
 from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
 us to map any pfn range so the memory doesn't need to be online to be
 usable for the array. See patch 4 for more details.
 This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.

 [Overall design]:

 Implementation wise we reuse vmem_altmap infrastructure to override
 the default allocator used by vmemap_populate.
 memory_block structure gains a new field called nr_vmemmap_pages,
 which accounts for the number of vmemmap pages used by that memory_block.
 E.g: On x86_64, that is 512 vmemmap pages on small memory bloks and 4096
  on large memory blocks (1GB)

 We also introduce new two functions: memory_block_{online,offline}.
 These functions take care of initializing/unitializing vmemmap pages
 prior to calling {online,offline}_pages, so the latter functions can
 remain totally untouched.

 More details can be found in the respective changelogs.


David Hildenbrand (1):
  mm,memory_hotplug: Factor out adjusting present pages into
adjust_present_page_count()

Oscar Salvador (7):
  drivers/base/memory: Introduce memory_block_{online,offline}
  mm,memory_hotplug: Relax fully spanned sections check
  mm,memory_h

Re: [PATCH 05/10] mm/migrate: demote pages during reclaim

2021-04-08 Thread Oscar Salvador
On Thu, Apr 01, 2021 at 11:32:25AM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen 
> 
> This is mostly derived from a patch from Yang Shi:
> 
>   
> https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang@linux.alibaba.com/
> 
> Add code to the reclaim path (shrink_page_list()) to "demote" data
> to another NUMA node instead of discarding the data.  This always
> avoids the cost of I/O needed to read the page back in and sometimes
> avoids the writeout cost when the pagee is dirty.
> 
> A second pass through shrink_page_list() will be made if any demotions
> fail.  This essentally falls back to normal reclaim behavior in the
> case that demotions fail.  Previous versions of this patch may have
> simply failed to reclaim pages which were eligible for demotion but
> were unable to be demoted in practice.
> 
> Note: This just adds the start of infratructure for migration. It is
> actually disabled next to the FIXME in migrate_demote_page_ok().
> 
> Signed-off-by: Dave Hansen 
> Cc: Wei Xu 
> Cc: Yang Shi 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: osalvador 

Reviewed-by: Oscar Salvador 

-- 
Oscar Salvador
SUSE L3


Re: [PATCH 04/10] mm/migrate: make migrate_pages() return nr_succeeded

2021-04-08 Thread Oscar Salvador
On Thu, Apr 01, 2021 at 11:32:23AM -0700, Dave Hansen wrote:
> 
> From: Yang Shi 
> 
> The migrate_pages() returns the number of pages that were not migrated,
> or an error code.  When returning an error code, there is no way to know
> how many pages were migrated or not migrated.
> 
> In the following patch, migrate_pages() is used to demote pages to PMEM
> node, we need account how many pages are reclaimed (demoted) since page
> reclaim behavior depends on this.  Add *nr_succeeded parameter to make
> migrate_pages() return how many pages are demoted successfully for all
> cases.
> 
> Signed-off-by: Yang Shi 
> Signed-off-by: Dave Hansen 
> Reviewed-by: Yang Shi 
> Cc: Wei Xu 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: David Hildenbrand 
> Cc: osalvador 
> 

...
>  int migrate_pages(struct list_head *from, new_page_t get_new_page,
>   free_page_t put_new_page, unsigned long private,
> - enum migrate_mode mode, int reason)
> + enum migrate_mode mode, int reason, unsigned int *nr_succeeded)
>  {
>   int retry = 1;
>   int thp_retry = 1;
>   int nr_failed = 0;
> - int nr_succeeded = 0;
>   int nr_thp_succeeded = 0;
>   int nr_thp_failed = 0;
>   int nr_thp_split = 0;
> @@ -1611,10 +1611,10 @@ retry:
>   case MIGRATEPAGE_SUCCESS:
>   if (is_thp) {
>   nr_thp_succeeded++;
> - nr_succeeded += nr_subpages;
> + *nr_succeeded += nr_subpages;
>   break;
>   }
> - nr_succeeded++;
> + (*nr_succeeded)++;
>   break;
>   default:
>   /*
> @@ -1643,12 +1643,12 @@ out:
>*/
>   list_splice(_pages, from);
>  
> - count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
> + count_vm_events(PGMIGRATE_SUCCESS, *nr_succeeded);
>   count_vm_events(PGMIGRATE_FAIL, nr_failed);
>   count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded);
>   count_vm_events(THP_MIGRATION_FAIL, nr_thp_failed);
>   count_vm_events(THP_MIGRATION_SPLIT, nr_thp_split);
> - trace_mm_migrate_pages(nr_succeeded, nr_failed, nr_thp_succeeded,
> + trace_mm_migrate_pages(*nr_succeeded, nr_failed, nr_thp_succeeded,
>  nr_thp_failed, nr_thp_split, mode, reason);

It seems that reclaiming is the only user who cared about how many pages
could we migrated, could not do the following instead:

diff --git a/mm/migrate.c b/mm/migrate.c
index 695a594e5860..d4170b7ea2fe 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1503,7 +1503,7 @@ static inline int try_split_thp(struct page *page, struct 
page **page2,
  */
 int migrate_pages(struct list_head *from, new_page_t get_new_page,
free_page_t put_new_page, unsigned long private,
-   enum migrate_mode mode, int reason)
+   enum migrate_mode mode, int reason, unsigned int *ret_succeeded)
 {
int retry = 1;
int thp_retry = 1;
@@ -1654,6 +1654,9 @@ int migrate_pages(struct list_head *from, new_page_t 
get_new_page,
if (!swapwrite)
current->flags &= ~PF_SWAPWRITE;

+   if (ret_succedded)
+   *ret_succedded = nr_succedded;
+
return rc;
 }

 And pass only a valid pointer from demote_page_list() and NULL from all
 the others?
 I was just wondered after all those "unsigned int nr_succedded" in all
 other functions.
 This would also solve the "be careful to initialize nr_succedded"
 problem?


-- 
Oscar Salvador
SUSE L3


Re: [PATCH 03/10] mm/migrate: update node demotion order during on hotplug events

2021-04-08 Thread Oscar Salvador
On Thu, Apr 01, 2021 at 11:32:21AM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen 
> 
> Reclaim-based migration is attempting to optimize data placement in
> memory based on the system topology.  If the system changes, so must
> the migration ordering.
> 
> The implementation is conceptually simple and entirely unoptimized.
> On any memory or CPU hotplug events, assume that a node was added or
> removed and recalculate all migration targets.  This ensures that the
> node_demotion[] array is always ready to be used in case the new
> reclaim mode is enabled.
> 
> This recalculation is far from optimal, most glaringly that it does
> not even attempt to figure out the hotplug event would have some
> *actual* effect on the demotion order.  But, given the expected
> paucity of hotplug events, this should be fine.
> 
> === What does RCU provide? ===
> 
> Imaginge a simple loop which walks down the demotion path looking
> for the last node:
> 
> terminal_node = start_node;
> while (node_demotion[terminal_node] != NUMA_NO_NODE) {
> terminal_node = node_demotion[terminal_node];
> }
> 
> The initial values are:
> 
> node_demotion[0] = 1;
> node_demotion[1] = NUMA_NO_NODE;
> 
> and are updated to:
> 
> node_demotion[0] = NUMA_NO_NODE;
> node_demotion[1] = 0;
> 
> What guarantees that the loop did not observe:
> 
> node_demotion[0] = 1;
> node_demotion[1] = 0;
> 
> and would loop forever?
> 
> With RCU, a rcu_read_lock/unlock() can be placed around the
> loop.  Since the write side does a synchronize_rcu(), the loop
> that observed the old contents is known to be complete after the
> synchronize_rcu() has completed.
> 
> RCU, combined with disable_all_migrate_targets(), ensures that
> the old migration state is not visible by the time
> __set_migration_target_nodes() is called.
> 
> === What does READ_ONCE() provide? ===
> 
> READ_ONCE() forbids the compiler from merging or reordering
> successive reads of node_demotion[].  This ensures that any
> updates are *eventually* observed.
> 
> Consider the above loop again.  The compiler could theoretically
> read the entirety of node_demotion[] into local storage
> (registers) and never go back to memory, and *permanently*
> observe bad values for node_demotion[].
> 
> Note: RCU does not provide any universal compiler-ordering
> guarantees:
> 
>   https://lore.kernel.org/lkml/20150921204327.gh4...@linux.vnet.ibm.com/
> 
> Signed-off-by: Dave Hansen 
> Reviewed-by: Yang Shi 
> Cc: Wei Xu 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: David Hildenbrand 
> Cc: osalvador 
> 

...
  
> +#if defined(CONFIG_MEMORY_HOTPLUG)

I am not really into PMEM, and I ignore whether we need
CONFIG_MEMORY_HOTPLUG in order to have such memory on the system.
If so, the following can be partly ignored.

I think that you either want to check CONFIG_MEMORY_HOTPLUG +
CONFIG_CPU_HOTPLUG, or just do not put it under any conf dependency.

The thing is that migrate_on_reclaim_init() will only be called if
we have CONFIG_MEMORY_HOTPLUG, and when we do not have that (but we do have
CONFIG_CPU_HOTPLUG) the calls to set_migration_target_nodes() wont't be
made when the system brings up the CPUs during the boot phase,
which means node_demotion[] list won't be initialized.

But this brings me to the next point.

>From a conceptual point of view, I think you want to build the
node_demotion[] list, being orthogonal to it whether we support CPU Or
MEMORY hotplug.

Now, in case we support CPU or MEMORY hotplug, we do want to be able to re-build
the list for .e.g: in case NUMA nodes become cpu-less or memory-less.

On x86_64, CPU_HOTPLUG is enabled by default if SMP, the same for
MEMORY_HOTPLUG, but I am not sure about other archs.
Can we have !CPU_HOTPLUG && MEMORY_HOTPLUG, !MEMORY_HOTPLUG &&
CPU_HOTPLUG? I do now really know, but I think you should be careful
about that.

If this was my call, I would:

- Do not place the burden to initialize node_demotion[] list in CPU
  hotplug boot phase (or if so, be carefull because if I disable
  MEMORY_HOTPLUG, I end up with no demotion_list[])
- Diferentiate between migration_{online,offline}_cpu and
  migrate_on_reclaim_callback() and place them under their respective
  configs-dependency.

But I might be missing some details so I might be off somewhere.

Another thing that caught my eye is that we are calling
set_migration_target_nodes() for every CPU the system brings up at boot
phase. I know systems with *lots* of CPUs.
I am not sure whether we have a mechanism to delay that until all CPUs
that are meant to be online are online? (after boot?)
That's probably happening in wonderland, but was just speaking out loud.

(Of course the same happen with memory_hotplug acpi operations.
All it takes is some qemu-handling)

-- 
Oscar Salvador
SUSE L3


  1   2   3   4   5   6   7   8   9   10   >