Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread Björn Töpel
Oscar Salvador  writes:

> On Tue, May 14, 2024 at 04:04:42PM +0200, Björn Töpel wrote:
>> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
>> +   struct vmem_altmap *altmap)
>> +{
>> +if (altmap)
>> +vmem_altmap_free(altmap, size >> PAGE_SHIFT);
>> +else
>> +free_pages((unsigned long)page_address(page), get_order(size));
>
> David already pointed this out, but can check
> arch/x86/mm/init_64.c:free_pagetable().
>
> You will see that we have to do some magic for bootmem memory (DIMMs
> which were not hotplugged but already present)

Thank you!

>> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
>> +void __ref vmemmap_free(unsigned long start, unsigned long end, struct 
>> vmem_altmap *altmap)
>> +{
>> +remove_pgd_mapping(start, end, true, altmap);
>> +}
>> +#endif /* CONFIG_SPARSEMEM_VMEMMAP */
>> +#endif /* CONFIG_MEMORY_HOTPLUG */
>
> I will comment on the patch where you add support for hotplug and the
> dependency, but on a track in LSFMM today, we decided that most likely
> we will drop memory-hotplug support for !CONFIG_SPARSEMEM_VMEMMAP
> environments.
> So, since you are adding this plain fresh, please consider to tight the
> hotplug dependency to CONFIG_SPARSEMEM_VMEMMAP.
> As a bonus, you will only have to maintain one flavour of functions.

Ah, yeah, I saw it mentioned on the LSF/MM/BPF topics. Less is
definitely more -- I'll make the next version depend on
SPARSEMEM_VMEMMAP.


Björn



Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread Oscar Salvador
On Tue, May 14, 2024 at 04:04:42PM +0200, Björn Töpel wrote:
> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
> +struct vmem_altmap *altmap)
> +{
> + if (altmap)
> + vmem_altmap_free(altmap, size >> PAGE_SHIFT);
> + else
> + free_pages((unsigned long)page_address(page), get_order(size));

David already pointed this out, but can check
arch/x86/mm/init_64.c:free_pagetable().

You will see that we have to do some magic for bootmem memory (DIMMs
which were not hotplugged but already present)

> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +void __ref vmemmap_free(unsigned long start, unsigned long end, struct 
> vmem_altmap *altmap)
> +{
> + remove_pgd_mapping(start, end, true, altmap);
> +}
> +#endif /* CONFIG_SPARSEMEM_VMEMMAP */
> +#endif /* CONFIG_MEMORY_HOTPLUG */

I will comment on the patch where you add support for hotplug and the
dependency, but on a track in LSFMM today, we decided that most likely
we will drop memory-hotplug support for !CONFIG_SPARSEMEM_VMEMMAP
environments.
So, since you are adding this plain fresh, please consider to tight the
hotplug dependency to CONFIG_SPARSEMEM_VMEMMAP.
As a bonus, you will only have to maintain one flavour of functions.


-- 
Oscar Salvador
SUSE Labs



Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread Björn Töpel
Alexandre Ghiti  writes:

> On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:

>> +int __ref arch_add_memory(int nid, u64 start, u64 size, struct mhp_params 
>> *params)
>> +{
>> +   int ret;
>> +
>> +   create_linear_mapping_range(start, start + size, 0, >pgprot);
>> +   flush_tlb_all();
>> +   ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT, 
>> params);
>> +   if (ret) {
>> +   remove_linear_mapping(start, size);
>> +   return ret;
>> +   }
>> +
>
> You need to flush the TLB here too since __add_pages() populates the
> page table with the new vmemmap mapping (only because riscv allows to
> cache invalid entries, I'll adapt this in my next version of Svvptc
> support).
>
>> +   max_pfn = PFN_UP(start + size);
>> +   max_low_pfn = max_pfn;
>> +   return 0;
>> +}
>> +
>> +void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap 
>> *altmap)
>> +{
>> +   __remove_pages(start >> PAGE_SHIFT, size >> PAGE_SHIFT, altmap);
>> +   remove_linear_mapping(start, size);
>
> You need to flush the TLB here too.

I'll address all of the above in the next version. Thanks for reviewing
the series!


Björn



Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread Alexandre Ghiti
On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> For an architecture to support memory hotplugging, a couple of
> callbacks needs to be implemented:
>
>  arch_add_memory()
>   This callback is responsible for adding the physical memory into the
>   direct map, and call into the memory hotplugging generic code via
>   __add_pages() that adds the corresponding struct page entries, and
>   updates the vmemmap mapping.
>
>  arch_remove_memory()
>   This is the inverse of the callback above.
>
>  vmemmap_free()
>   This function tears down the vmemmap mappings (if
>   CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
>   backing vmemmap pages. Note that for persistent memory, an
>   alternative allocator for the backing pages can be used; The
>   vmem_altmap. This means that when the backing pages are cleared,
>   extra care is needed so that the correct deallocation method is
>   used.
>
>  arch_get_mappable_range()
>   This functions returns the PA range that the direct map can map.
>   Used by the MHP internals for sanity checks.
>
> The page table unmap/teardown functions are heavily based on code from
> the x86 tree. The same remove_pgd_mapping() function is used in both
> vmemmap_free() and arch_remove_memory(), but in the latter function
> the backing pages are not removed.
>
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/mm/init.c | 242 +++
>  1 file changed, 242 insertions(+)
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 6f72b0b2b854..7f0b921a3d3a 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
> }
>  }
>  #endif
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> +{
> +   pte_t *pte;
> +   int i;
> +
> +   for (i = 0; i < PTRS_PER_PTE; i++) {
> +   pte = pte_start + i;
> +   if (!pte_none(*pte))
> +   return;
> +   }
> +
> +   free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
> +   pmd_clear(pmd);
> +}
> +
> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> +{
> +   pmd_t *pmd;
> +   int i;
> +
> +   for (i = 0; i < PTRS_PER_PMD; i++) {
> +   pmd = pmd_start + i;
> +   if (!pmd_none(*pmd))
> +   return;
> +   }
> +
> +   free_pages((unsigned long)page_address(pud_page(*pud)), 0);
> +   pud_clear(pud);
> +}
> +
> +static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
> +{
> +   pud_t *pud;
> +   int i;
> +
> +   for (i = 0; i < PTRS_PER_PUD; i++) {
> +   pud = pud_start + i;
> +   if (!pud_none(*pud))
> +   return;
> +   }
> +
> +   free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
> +   p4d_clear(p4d);
> +}
> +
> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
> +  struct vmem_altmap *altmap)
> +{
> +   if (altmap)
> +   vmem_altmap_free(altmap, size >> PAGE_SHIFT);
> +   else
> +   free_pages((unsigned long)page_address(page), 
> get_order(size));
> +}
> +
> +static void __meminit remove_pte_mapping(pte_t *pte_base, unsigned long 
> addr, unsigned long end,
> +bool is_vmemmap, struct vmem_altmap 
> *altmap)
> +{
> +   unsigned long next;
> +   pte_t *ptep, pte;
> +
> +   for (; addr < end; addr = next) {
> +   next = (addr + PAGE_SIZE) & PAGE_MASK;
> +   if (next > end)
> +   next = end;
> +
> +   ptep = pte_base + pte_index(addr);
> +   pte = READ_ONCE(*ptep);
> +
> +   if (!pte_present(*ptep))
> +   continue;
> +
> +   pte_clear(_mm, addr, ptep);
> +   if (is_vmemmap)
> +   free_vmemmap_storage(pte_page(pte), PAGE_SIZE, 
> altmap);
> +   }
> +}
> +
> +static void __meminit remove_pmd_mapping(pmd_t *pmd_base, unsigned long 
> addr, unsigned long end,
> +bool is_vmemmap, struct vmem_altmap 
> *altmap)
> +{
> +   unsigned long next;
> +   pte_t *pte_base;
> +   pmd_t *pmdp, pmd;
> +
> +   for (; addr < end; addr = next) {
> +   next = pmd_addr_end(addr, end);
> +   pmdp = pmd_base + pmd_index(addr);
> +   pmd = READ_ONCE(*pmdp);
> +
> +   if (!pmd_present(pmd))
> +   continue;
> +
> +   if (pmd_leaf(pmd)) {
> +   pmd_clear(pmdp);
> +   if (is_vmemmap)
> +   free_vmemmap_storage(pmd_page(pmd), PMD_SIZE, 
> altmap);
> +   continue;
> +   }
> +
> +   pte_base = 

Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread Björn Töpel
David Hildenbrand  writes:

> On 14.05.24 16:04, Björn Töpel wrote:
>> From: Björn Töpel 
>> 
>> For an architecture to support memory hotplugging, a couple of
>> callbacks needs to be implemented:
>> 
>>   arch_add_memory()
>>This callback is responsible for adding the physical memory into the
>>direct map, and call into the memory hotplugging generic code via
>>__add_pages() that adds the corresponding struct page entries, and
>>updates the vmemmap mapping.
>> 
>>   arch_remove_memory()
>>This is the inverse of the callback above.
>> 
>>   vmemmap_free()
>>This function tears down the vmemmap mappings (if
>>CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
>>backing vmemmap pages. Note that for persistent memory, an
>>alternative allocator for the backing pages can be used; The
>>vmem_altmap. This means that when the backing pages are cleared,
>>extra care is needed so that the correct deallocation method is
>>used.
>> 
>>   arch_get_mappable_range()
>>This functions returns the PA range that the direct map can map.
>>Used by the MHP internals for sanity checks.
>> 
>> The page table unmap/teardown functions are heavily based on code from
>> the x86 tree. The same remove_pgd_mapping() function is used in both
>> vmemmap_free() and arch_remove_memory(), but in the latter function
>> the backing pages are not removed.
>> 
>> Signed-off-by: Björn Töpel 
>> ---
>>   arch/riscv/mm/init.c | 242 +++
>>   1 file changed, 242 insertions(+)
>> 
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 6f72b0b2b854..7f0b921a3d3a 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
>>  }
>>   }
>>   #endif
>> +
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
>> +{
>> +pte_t *pte;
>> +int i;
>> +
>> +for (i = 0; i < PTRS_PER_PTE; i++) {
>> +pte = pte_start + i;
>> +if (!pte_none(*pte))
>> +return;
>> +}
>> +
>> +free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
>> +pmd_clear(pmd);
>> +}
>> +
>> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
>> +{
>> +pmd_t *pmd;
>> +int i;
>> +
>> +for (i = 0; i < PTRS_PER_PMD; i++) {
>> +pmd = pmd_start + i;
>> +if (!pmd_none(*pmd))
>> +return;
>> +}
>> +
>> +free_pages((unsigned long)page_address(pud_page(*pud)), 0);
>> +pud_clear(pud);
>> +}
>> +
>> +static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
>> +{
>> +pud_t *pud;
>> +int i;
>> +
>> +for (i = 0; i < PTRS_PER_PUD; i++) {
>> +pud = pud_start + i;
>> +if (!pud_none(*pud))
>> +return;
>> +}
>> +
>> +free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
>> +p4d_clear(p4d);
>> +}
>> +
>> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
>> +   struct vmem_altmap *altmap)
>> +{
>> +if (altmap)
>> +vmem_altmap_free(altmap, size >> PAGE_SHIFT);
>> +else
>> +free_pages((unsigned long)page_address(page), get_order(size));
>
> If you unplug a DIMM that was added during boot (can happen on x86-64, 
> can it happen on riscv?), free_pages() would not be sufficient. You'd be 
> freeing a PG_reserved page that has to be freed differently.

I'd say if it can happen on x86-64, it probably can on RISC-V. I'll look
into this for the next spin!

Thanks for spending time on the series!


Cheers,
Björn



Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread David Hildenbrand

On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

For an architecture to support memory hotplugging, a couple of
callbacks needs to be implemented:

  arch_add_memory()
   This callback is responsible for adding the physical memory into the
   direct map, and call into the memory hotplugging generic code via
   __add_pages() that adds the corresponding struct page entries, and
   updates the vmemmap mapping.

  arch_remove_memory()
   This is the inverse of the callback above.

  vmemmap_free()
   This function tears down the vmemmap mappings (if
   CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
   backing vmemmap pages. Note that for persistent memory, an
   alternative allocator for the backing pages can be used; The
   vmem_altmap. This means that when the backing pages are cleared,
   extra care is needed so that the correct deallocation method is
   used.

  arch_get_mappable_range()
   This functions returns the PA range that the direct map can map.
   Used by the MHP internals for sanity checks.

The page table unmap/teardown functions are heavily based on code from
the x86 tree. The same remove_pgd_mapping() function is used in both
vmemmap_free() and arch_remove_memory(), but in the latter function
the backing pages are not removed.

Signed-off-by: Björn Töpel 
---
  arch/riscv/mm/init.c | 242 +++
  1 file changed, 242 insertions(+)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 6f72b0b2b854..7f0b921a3d3a 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
}
  }
  #endif
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
+{
+   pte_t *pte;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PTE; i++) {
+   pte = pte_start + i;
+   if (!pte_none(*pte))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
+   pmd_clear(pmd);
+}
+
+static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
+{
+   pmd_t *pmd;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PMD; i++) {
+   pmd = pmd_start + i;
+   if (!pmd_none(*pmd))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(pud_page(*pud)), 0);
+   pud_clear(pud);
+}
+
+static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+   pud_t *pud;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PUD; i++) {
+   pud = pud_start + i;
+   if (!pud_none(*pud))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
+   p4d_clear(p4d);
+}
+
+static void __meminit free_vmemmap_storage(struct page *page, size_t size,
+  struct vmem_altmap *altmap)
+{
+   if (altmap)
+   vmem_altmap_free(altmap, size >> PAGE_SHIFT);
+   else
+   free_pages((unsigned long)page_address(page), get_order(size));


If you unplug a DIMM that was added during boot (can happen on x86-64, 
can it happen on riscv?), free_pages() would not be sufficient. You'd be 
freeing a PG_reserved page that has to be freed differently.


--
Cheers,

David / dhildenb




[PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread Björn Töpel
From: Björn Töpel 

For an architecture to support memory hotplugging, a couple of
callbacks needs to be implemented:

 arch_add_memory()
  This callback is responsible for adding the physical memory into the
  direct map, and call into the memory hotplugging generic code via
  __add_pages() that adds the corresponding struct page entries, and
  updates the vmemmap mapping.

 arch_remove_memory()
  This is the inverse of the callback above.

 vmemmap_free()
  This function tears down the vmemmap mappings (if
  CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
  backing vmemmap pages. Note that for persistent memory, an
  alternative allocator for the backing pages can be used; The
  vmem_altmap. This means that when the backing pages are cleared,
  extra care is needed so that the correct deallocation method is
  used.

 arch_get_mappable_range()
  This functions returns the PA range that the direct map can map.
  Used by the MHP internals for sanity checks.

The page table unmap/teardown functions are heavily based on code from
the x86 tree. The same remove_pgd_mapping() function is used in both
vmemmap_free() and arch_remove_memory(), but in the latter function
the backing pages are not removed.

Signed-off-by: Björn Töpel 
---
 arch/riscv/mm/init.c | 242 +++
 1 file changed, 242 insertions(+)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 6f72b0b2b854..7f0b921a3d3a 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
}
 }
 #endif
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
+{
+   pte_t *pte;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PTE; i++) {
+   pte = pte_start + i;
+   if (!pte_none(*pte))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
+   pmd_clear(pmd);
+}
+
+static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
+{
+   pmd_t *pmd;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PMD; i++) {
+   pmd = pmd_start + i;
+   if (!pmd_none(*pmd))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(pud_page(*pud)), 0);
+   pud_clear(pud);
+}
+
+static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+   pud_t *pud;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PUD; i++) {
+   pud = pud_start + i;
+   if (!pud_none(*pud))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
+   p4d_clear(p4d);
+}
+
+static void __meminit free_vmemmap_storage(struct page *page, size_t size,
+  struct vmem_altmap *altmap)
+{
+   if (altmap)
+   vmem_altmap_free(altmap, size >> PAGE_SHIFT);
+   else
+   free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void __meminit remove_pte_mapping(pte_t *pte_base, unsigned long addr, 
unsigned long end,
+bool is_vmemmap, struct vmem_altmap 
*altmap)
+{
+   unsigned long next;
+   pte_t *ptep, pte;
+
+   for (; addr < end; addr = next) {
+   next = (addr + PAGE_SIZE) & PAGE_MASK;
+   if (next > end)
+   next = end;
+
+   ptep = pte_base + pte_index(addr);
+   pte = READ_ONCE(*ptep);
+
+   if (!pte_present(*ptep))
+   continue;
+
+   pte_clear(_mm, addr, ptep);
+   if (is_vmemmap)
+   free_vmemmap_storage(pte_page(pte), PAGE_SIZE, altmap);
+   }
+}
+
+static void __meminit remove_pmd_mapping(pmd_t *pmd_base, unsigned long addr, 
unsigned long end,
+bool is_vmemmap, struct vmem_altmap 
*altmap)
+{
+   unsigned long next;
+   pte_t *pte_base;
+   pmd_t *pmdp, pmd;
+
+   for (; addr < end; addr = next) {
+   next = pmd_addr_end(addr, end);
+   pmdp = pmd_base + pmd_index(addr);
+   pmd = READ_ONCE(*pmdp);
+
+   if (!pmd_present(pmd))
+   continue;
+
+   if (pmd_leaf(pmd)) {
+   pmd_clear(pmdp);
+   if (is_vmemmap)
+   free_vmemmap_storage(pmd_page(pmd), PMD_SIZE, 
altmap);
+   continue;
+   }
+
+   pte_base = (pte_t *)pmd_page_vaddr(*pmdp);
+   remove_pte_mapping(pte_base, addr, next, is_vmemmap, altmap);
+   free_pte_table(pte_base, pmdp);
+   }
+}
+
+static void __meminit remove_pud_mapping(pud_t *pud_base, unsigned long addr, 
unsigned long end,
+bool is_vmemmap, struct vmem_altmap 
*altmap)