from:"David Hildenbrand"

Re: [linus:master] [mm] d99e3140a4: BUG:KCSAN:data-race_in_folio_remove_rmap_ptes/print_report

2024-05-28 Thread David Hildenbrand


Am 28.05.24 um 09:11 schrieb kernel test robot:



Hello,

kernel test robot noticed 
"BUG:KCSAN:data-race_in_folio_remove_rmap_ptes/print_report" on:

commit: d99e3140a4d33e26066183ff727d8f02f56bec64 ("mm: turn folio_test_hugetlb into 
a PageType")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

[test failed on linus/master  c760b3725e52403dc1b28644fb09c47a83cacea6]
[test failed on linux-next/master 3689b0ef08b70e4e03b82ebd37730a03a672853a]

in testcase: trinity
version: trinity-i386-abe9de86-1_20230429
with following parameters:

runtime: 300s
group: group-04
nr_groups: 5



compiler: gcc-13
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


we noticed this issue does not always happen. we also noticed there are
different random KCSAN issues for both this commit and its parent. but below
4 only happen on this commit with not small rate and keep clean on parent.



Likely that's just a page_type check racing against concurrent
mapcount changes.

In __folio_rmap_sanity_checks() we check
VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);

To make sure we don't get hugetlb folios in the wrong rmap code path. That
can easily race with concurrent mapcount changes, just like any other
page_type checks that end up in folio_test_type/page_has_type e.g., from
PFN walkers.

Load tearing in these functions shouldn't really result in false positives
(what we care about), but READ_ONCE shouldn't hurt or make a difference.


From b03dc9bf27571442d886d8da624a4e4f737433f2 Mon Sep 17 00:00:00 2001
From: David Hildenbrand 
Date: Tue, 28 May 2024 09:37:20 +0200
Subject: [PATCH] mm: read page_type using READ_ONCE

KCSAN complains about possible data races: while we check for a
page_type -- for example for sanity checks -- we might concurrently
modify the mapcount that overlays page_type.

Let's use READ_ONCE to avoid laod tearing (shouldn't make a difference)
and to make KCSAN happy.

Note: nothing should really be broken besides wrong KCSAN complaints.

Reported-by: kernel test robot 
Closes: https://lore.kernel.org/oe-lkp/202405281431.c46a3be9-...@intel.com
Signed-off-by: David Hildenbrand 
---
 include/linux/page-flags.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 104078afe0b1..e46ccbb9aa58 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -955,9 +955,9 @@ PAGEFLAG_FALSE(HasHWPoisoned, has_hwpoisoned)
 #define PG_slab0x1000
 
 #define PageType(page, flag)		\

-   ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
+   ((READ_ONCE(page->page_type) & (PAGE_TYPE_BASE | flag)) == 
PAGE_TYPE_BASE)
 #define folio_test_type(folio, flag)   \
-   ((folio->page.page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
+   ((READ_ONCE(folio->page.page_type) & (PAGE_TYPE_BASE | flag))  == 
PAGE_TYPE_BASE)
 
 static inline int page_type_has_type(unsigned int page_type)

 {
@@ -966,7 +966,7 @@ static inline int page_type_has_type(unsigned int page_type)
 
 static inline int page_has_type(const struct page *page)

 {
-   return page_type_has_type(page->page_type);
+   return page_type_has_type(READ_ONCE(page->page_type));
 }
 
 #define FOLIO_TYPE_OPS(lname, fname)	\

--
2.45.1


--
Thanks,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-05-16 Thread David Hildenbrand


On 16.05.24 19:44, Guillaume Morin wrote:

On 02 May  5:59, Guillaume Morin wrote:


On 30 Apr 21:25, David Hildenbrand wrote:

I tried to get the hugepd stuff right but this was the first I heard
about it :-) Afaict follow_huge_pmd and friends were already DTRT


I'll have to have a closer look at some details (the hugepd writability
check looks a bit odd), but it's mostly what I would have expected!


Ok in the meantime, here is the uprobe change on your current
uprobes_cow trying to address the comments you made in your previous
message. Some of them were not 100% clear to me, so it's a best effort
patch :-) Again lightly tested


David, have you had a chance to take a look at both patches?


Not in detail, last weeks were busy (currently traveling back home from 
LSF/MM). I'll try to find time within the next two weeks to polish my 
changes and send them out. It would be great if you could send your 
stuff based on top of that then.


(the merge window just opened on Saturday, so we have plenty of time to 
make it to the next one :) )


--
Cheers,

David / dhildenb

Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V

2024-05-14 Thread David Hildenbrand


On 14.05.24 20:17, Björn Töpel wrote:

Alexandre Ghiti  writes:


On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:


From: Björn Töpel 

Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
RISC-V.

Signed-off-by: Björn Töpel 
---
  arch/riscv/Kconfig | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 6bec1bce6586..b9398b64bb69 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -16,6 +16,8 @@ config RISCV
 select ACPI_REDUCED_HARDWARE_ONLY if ACPI
 select ARCH_DMA_DEFAULT_COHERENT
 select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
+   select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU


I think this should be SPARSEMEM_VMEMMAP here.


Hmm, care to elaborate? I thought that was optional.


There was a discussion at LSF/MM today to maybe require 
SPARSEMEM_VMEMMAP for hotplug. Would that work here as well?


--
Cheers,

David / dhildenb

Re: [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump

2024-05-14 Thread David Hildenbrand


On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

During memory hot remove, the ptdump functionality can end up touching
stale data. Avoid any potential crashes (or worse), by holding the
memory hotplug read-lock while traversing the page table.

This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm:
Hold memory hotplug lock while walking for kernel page table dump").

Signed-off-by: Björn Töpel 
---
  arch/riscv/mm/ptdump.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index 1289cc6d3700..9d5f657a251b 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -6,6 +6,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -370,7 +371,9 @@ bool ptdump_check_wx(void)
  
  static int ptdump_show(struct seq_file *m, void *v)

  {
+   get_online_mems();
ptdump_walk(m, m->private);
+   put_online_mems();
  
  	return 0;

  }


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread David Hildenbrand


On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

For an architecture to support memory hotplugging, a couple of
callbacks needs to be implemented:

  arch_add_memory()
   This callback is responsible for adding the physical memory into the
   direct map, and call into the memory hotplugging generic code via
   __add_pages() that adds the corresponding struct page entries, and
   updates the vmemmap mapping.

  arch_remove_memory()
   This is the inverse of the callback above.

  vmemmap_free()
   This function tears down the vmemmap mappings (if
   CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
   backing vmemmap pages. Note that for persistent memory, an
   alternative allocator for the backing pages can be used; The
   vmem_altmap. This means that when the backing pages are cleared,
   extra care is needed so that the correct deallocation method is
   used.

  arch_get_mappable_range()
   This functions returns the PA range that the direct map can map.
   Used by the MHP internals for sanity checks.

The page table unmap/teardown functions are heavily based on code from
the x86 tree. The same remove_pgd_mapping() function is used in both
vmemmap_free() and arch_remove_memory(), but in the latter function
the backing pages are not removed.

Signed-off-by: Björn Töpel 
---
  arch/riscv/mm/init.c | 242 +++
  1 file changed, 242 insertions(+)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 6f72b0b2b854..7f0b921a3d3a 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
}
  }
  #endif
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
+{
+   pte_t *pte;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PTE; i++) {
+   pte = pte_start + i;
+   if (!pte_none(*pte))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
+   pmd_clear(pmd);
+}
+
+static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
+{
+   pmd_t *pmd;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PMD; i++) {
+   pmd = pmd_start + i;
+   if (!pmd_none(*pmd))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(pud_page(*pud)), 0);
+   pud_clear(pud);
+}
+
+static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+   pud_t *pud;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PUD; i++) {
+   pud = pud_start + i;
+   if (!pud_none(*pud))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
+   p4d_clear(p4d);
+}
+
+static void __meminit free_vmemmap_storage(struct page *page, size_t size,
+  struct vmem_altmap *altmap)
+{
+   if (altmap)
+   vmem_altmap_free(altmap, size >> PAGE_SHIFT);
+   else
+   free_pages((unsigned long)page_address(page), get_order(size));


If you unplug a DIMM that was added during boot (can happen on x86-64, 
can it happen on riscv?), free_pages() would not be sufficient. You'd be 
freeing a PG_reserved page that has to be freed differently.


--
Cheers,

David / dhildenb

Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add

2024-05-14 Thread David Hildenbrand


On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

Add a parameter to the direct map setup function, so it can be used in
arch_add_memory() later.

Signed-off-by: Björn Töpel 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 2/8] riscv: mm: Change attribute from init to meminit for page functions

2024-05-14 Thread David Hildenbrand


On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

Prepare for memory hotplugging support by changing from __init to
__meminit for the page table functions that are used by the upcoming
architecture specific callbacks.

Changing the __init attribute to __meminit, avoids that the functions
are removed after init. The __meminit attribute makes sure the
functions are kept in the kernel text post init, but only if memory
hotplugging is enabled for the build.

Also, make sure that the altmap parameter is properly passed on to
vmemmap_populate_hugepages().

Signed-off-by: Björn Töpel 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 7/8] virtio-mem: Enable virtio-mem for RISC-V

2024-05-14 Thread David Hildenbrand


On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

Now that RISC-V has memory hotplugging support, virtio-mem can be used
on the platform.

Signed-off-by: Björn Töpel 
---
  drivers/virtio/Kconfig | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index c17193544268..4e5cebf1b82a 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -122,7 +122,7 @@ config VIRTIO_BALLOON
  
  config VIRTIO_MEM

tristate "Virtio mem driver"
-   depends on X86_64 || ARM64
+   depends on X86_64 || ARM64 || RISCV
depends on VIRTIO
depends on MEMORY_HOTPLUG
depends on MEMORY_HOTREMOVE



Nice!

Acked-by: David Hildenbrand 
--
Cheers,

David / dhildenb

Re: [PATCH v23 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-10 Thread David Hildenbrand

+   cpu_buffer->subbuf_ids = subbuf_ids;
+
+   meta->meta_page_size = PAGE_SIZE;
+   meta->meta_struct_len = sizeof(*meta);
+   meta->nr_subbufs = nr_subbufs;
+   meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+
+   rb_update_meta_page(cpu_buffer);
+}
+
+static struct ring_buffer_per_cpu *
+rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
+{
+   struct ring_buffer_per_cpu *cpu_buffer;
+
+   if (!cpumask_test_cpu(cpu, buffer->cpumask))
+   return ERR_PTR(-EINVAL);
+
+   cpu_buffer = buffer->buffers[cpu];
+
+   mutex_lock(_buffer->mapping_lock);
+
+   if (!cpu_buffer->mapped) {
+   mutex_unlock(_buffer->mapping_lock);
+   return ERR_PTR(-ENODEV);
+   }
+
+   return cpu_buffer;
+}
+
+static void rb_put_mapped_buffer(struct ring_buffer_per_cpu *cpu_buffer)
+{
+   mutex_unlock(_buffer->mapping_lock);
+}
+
+/*
+ * Fast-path for rb_buffer_(un)map(). Called whenever the meta-page doesn't 
need
+ * to be set-up or torn-down.
+ */
+static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu *cpu_buffer,
+  bool inc)
+{
+   unsigned long flags;
+
+   lockdep_assert_held(_buffer->mapping_lock);
+
+   if (inc && cpu_buffer->mapped == UINT_MAX)
+   return -EBUSY;
+
+   if (WARN_ON(!inc && cpu_buffer->mapped == 0))
+   return -EINVAL;
+
+   mutex_lock(_buffer->buffer->mutex);
+   raw_spin_lock_irqsave(_buffer->reader_lock, flags);
+
+   if (inc)
+   cpu_buffer->mapped++;
+   else
+   cpu_buffer->mapped--;
+
+   raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
+   mutex_unlock(_buffer->buffer->mutex);
+
+   return 0;
+}
+
+/*
+ *   +--+  pgoff == 0
+ *   |   meta page  |
+ *   +--+  pgoff == 1
+ *   | subbuffer 0  |
+ *   |  |
+ *   +--+  pgoff == (1 + (1 << subbuf_order))
+ *   | subbuffer 1  |
+ *   |  |
+ * ...
+ */
+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+
+   /* Refuse MP_PRIVATE or writable mappings */
+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+
+   /*
+* Make sure the mapping cannot become writable later. Also tell the VM
+* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND).


Coment a bit outdated.

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-10 Thread David Hildenbrand


On 08.05.24 04:34, Steven Rostedt wrote:

On Tue, 30 Apr 2024 12:13:51 +0100
Vincent Donnefort  wrote:


+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+
+   /* Refuse MP_PRIVATE or writable mappings */
+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+
+   /*
+* Make sure the mapping cannot become writable later. Also tell the VM
+* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally,
+* prevent migration, GUP and dump (VM_IO).
+*/
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE);


Do we really need the VM_IO?

When testing this in gdb, I would get:

(gdb) p tmap->map->subbuf_size
Cannot access memory at address 0x77fc2008

It appears that you can't ptrace IO memory. When I removed that flag,
gdb has no problem reading that memory.

I think we should drop that flag.

Can you send a v23 with that removed, Shuah's update, and also the
change below:


+
+   lockdep_assert_held(_buffer->mapping_lock);
+
+   subbuf_order = cpu_buffer->buffer->subbuf_order;
+   subbuf_pages = 1 << subbuf_order;
+
+   nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */
+   nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */
+
+   vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+   if (!vma_pages || vma_pages > nr_pages)
+   return -EINVAL;
+
+   nr_pages = vma_pages;
+
+   pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
+   if (!pages)
+   return -ENOMEM;
+
+   if (!pgoff) {
+   pages[p++] = virt_to_page(cpu_buffer->meta_page);
+
+   /*
+* TODO: Align sub-buffers on their size, once
+* vm_insert_pages() supports the zero-page.
+*/
+   } else {
+   /* Skip the meta-page */
+   pgoff--;
+
+   if (pgoff % subbuf_pages) {
+   err = -EINVAL;
+   goto out;
+   }
+
+   s += pgoff / subbuf_pages;
+   }
+
+   while (s < nr_subbufs && p < nr_pages) {
+   struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
+   int off = 0;
+
+   for (; off < (1 << (subbuf_order)); off++, page++) {
+   if (p >= nr_pages)
+   break;
+
+   pages[p++] = page;
+   }
+   s++;
+   }


The above can be made to:

while (p < nr_pages) {
struct page *page;
int off = 0;

if (WARN_ON_ONCE(s >= nr_subbufs))
break;


I'm not particularly happy about us calling vm_insert_pages with NULL 
pointers stored in pages.


Should we instead do

if (WARN_ON_ONCE(s >= nr_subbufs)) {
err = -EINVAL;
goto out;
}

?

--
Cheers,

David / dhildenb

Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-10 Thread David Hildenbrand


On 09.05.24 13:05, Vincent Donnefort wrote:

On Tue, May 07, 2024 at 10:34:02PM -0400, Steven Rostedt wrote:

On Tue, 30 Apr 2024 12:13:51 +0100
Vincent Donnefort  wrote:


+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+
+   /* Refuse MP_PRIVATE or writable mappings */
+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+
+   /*
+* Make sure the mapping cannot become writable later. Also tell the VM
+* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally,
+* prevent migration, GUP and dump (VM_IO).
+*/
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE);


Do we really need the VM_IO?

When testing this in gdb, I would get:

(gdb) p tmap->map->subbuf_size
Cannot access memory at address 0x77fc2008

It appears that you can't ptrace IO memory. When I removed that flag,
gdb has no problem reading that memory.


Yeah, VM_IO indeed implies DONTDUMP. VM_IO was part of Linus recommendations.


Yes, the VM should recognize that memory to some degree as being special 
already due to VM_MIXEDMAP and VM_DONTEXPAND.


#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)

So any of these flag achieve that (e.g., mlock_fixup() checks 
VM_SPECIAL). KSM similarly skips VM_DONTEXPAND and VM_MIXEDMAP (likely 
we should be using VM_SPECIAL in vma_ksm_compatible()). Not sure about 
page migration, likely its fine.


Thinking about MADV_DONTNEED, I can spot in 
madvise_dontneed_free_valid_vma() only that we disallow primarily VM_PFNMAP.


... I assume if user space MADV_DONTNEED's some pages we'll simply get a 
page fault later on access that will SIGBUS, handling that gracefully 
(we should double-check!).




But perhaps, VM_DONTEXPAND and MIXEDMAP (implicitely set by vm_insert_pages) are
enough protection?


Do we want to dump these pages? VM_DONTDUMP might be reasonabe then.



I don't see how anything could use GUP there and as David pointed-out on the
previous version, it doesn't event prevent the GUP-fast path.


Yes, GUP-fast would still have worked under some conditions.

--
Cheers,

David / dhildenb

Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-02 Thread David Hildenbrand


On 30.04.24 13:13, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

   ring_buffer_{map,unmap}()

And controls on the ring-buffer:

   ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

   A unique ID for each subbuf of the ring-buffer, currently they are
   only identified through their in-kernel VA.

   A meta-page, where are stored ring-buffer statistics and a
   description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h


[...]


+/*
+ *   +--+  pgoff == 0
+ *   |   meta page  |
+ *   +--+  pgoff == 1
+ *   | subbuffer 0  |
+ *   |  |
+ *   +--+  pgoff == (1 + (1 << subbuf_order))
+ *   | subbuffer 1  |
+ *   |  |
+ * ...
+ */
+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+
+   /* Refuse MP_PRIVATE or writable mappings */
+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+
+   /*
+* Make sure the mapping cannot become writable later. Also tell the VM
+* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally,
+* prevent migration, GUP and dump (VM_IO).
+*/
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE);
+
+   lockdep_assert_held(_buffer->mapping_lock);
+
+   subbuf_order = cpu_buffer->buffer->subbuf_order;
+   subbuf_pages = 1 << subbuf_order;
+
+   nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */
+   nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */
+
+   vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+   if (!vma_pages || vma_pages > nr_pages)
+   return -EINVAL;
+
+   nr_pages = vma_pages;
+
+   pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
+   if (!pages)
+   return -ENOMEM;
+
+   if (!pgoff) {
+   pages[p++] = virt_to_page(cpu_buffer->meta_page);
+
+   /*
+* TODO: Align sub-buffers on their size, once
+* vm_insert_pages() supports the zero-page.
+*/
+   } else {
+   /* Skip the meta-page */
+   pgoff--;
+
+   if (pgoff % subbuf_pages) {
+   err = -EINVAL;
+   goto out;
+   }
+
+   s += pgoff / subbuf_pages;
+   }
+
+   while (s < nr_subbufs && p < nr_pages) {
+   struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
+   int off = 0;
+
+   for (; off < (1 << (subbuf_order)); off++, page++) {
+   if (p >= nr_pages)
+   break;
+
+   pages[p++] = page;
+   }
+   s++;
+   }
+
+   err = vm_insert_pages(vma, vma->vm_start, pages, _pages);


Nit: I did not immediately understand if we could end here with p < 
nr_pages (IOW, pages[] not completely filled).


One source of confusion is the "s < nr_subbufs" check in the while loop: 
why is "p < nr_pages" insufficient?



For the MM bits:

Acked-by: David Hildenbrand 



--
Cheers,

David / dhildenb

Re: [PATCH v22 1/5] ring-buffer: Allocate sub-buffers with __GFP_COMP

2024-05-02 Thread David Hildenbrand


On 30.04.24 13:13, Vincent Donnefort wrote:

In preparation for the ring-buffer memory mapping, allocate compound
pages for the ring-buffer sub-buffers to enable us to map them to
user-space with vm_insert_pages().

Signed-off-by: Vincent Donnefort 


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-30 Thread David Hildenbrand


On 26.04.24 21:55, Guillaume Morin wrote:

On 26 Apr  9:19, David Hildenbrand wrote:

A couple of points:

a) Don't use page_mapcount(). Either folio_mapcount(), but likely you want
to check PageAnonExclusive.

b) If you're not following the can_follow_write_pte/_pmd model, you are
doing something wrong :)

c) The code was heavily changed in mm/mm-unstable. It was merged with t
the common code.

Likely, in mm/mm-unstable, the existing can_follow_write_pte and
can_follow_write_pmd checks will already cover what you want in most cases.

We'd need a can_follow_write_pud() to cover follow_huge_pud() and
(unfortunately) something to handle follow_hugepd() as well similarly.

Copy-pasting what we do in can_follow_write_pte() and adjusting for
different PTE types is the right thing to do. Maybe now it's time to factor
out the common checks into a separate helper.


I tried to get the hugepd stuff right but this was the first I heard
about it :-) Afaict follow_huge_pmd and friends were already DTRT


I'll have to have a closer look at some details (the hugepd writability 
check looks a bit odd), but it's mostly what I would have expected!


--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-30 Thread David Hildenbrand


On 30.04.24 17:22, Guillaume Morin wrote:

On 26 Apr 21:55, Guillaume Morin wrote:


On 26 Apr  9:19, David Hildenbrand wrote:

A couple of points:

a) Don't use page_mapcount(). Either folio_mapcount(), but likely you want
to check PageAnonExclusive.

b) If you're not following the can_follow_write_pte/_pmd model, you are
doing something wrong :)

c) The code was heavily changed in mm/mm-unstable. It was merged with t
the common code.

Likely, in mm/mm-unstable, the existing can_follow_write_pte and
can_follow_write_pmd checks will already cover what you want in most cases.

We'd need a can_follow_write_pud() to cover follow_huge_pud() and
(unfortunately) something to handle follow_hugepd() as well similarly.

Copy-pasting what we do in can_follow_write_pte() and adjusting for
different PTE types is the right thing to do. Maybe now it's time to factor
out the common checks into a separate helper.


I tried to get the hugepd stuff right but this was the first I heard
about it :-) Afaict follow_huge_pmd and friends were already DTRT


I got it working on top of your uprobes-cow branch with the foll force
patch sent friday. Still pretty lightly tested


Sorry for not replying earlier, was busy with other stuff. I'll try 
getiing that stuff into shape and send it out soonish.




I went with using one write uprobe function with some additional
branches. I went back and forth between that and making them 2 different
functions.


All the folio_test_hugetlb() special casing is a bit suboptimal. Likely 
we want a separate variant, because we should be sing hugetlb PTE 
functions consistently (e.g., huge_pte_uffd_wp() vs pte_uffd_wp(), 
softdirty does not exist etc.)




diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 2f4e88552d3f..8a33e380f7ea 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -83,6 +83,10 @@ static const struct fs_parameter_spec 
hugetlb_fs_parameters[] = {
{}
  };
  
+bool hugetlbfs_mapping(struct address_space *mapping) {

+   return mapping->a_ops == _aops;


is_vm_hugetlb_page() might be what you are looking for.

[...]


  }
  
-static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)

+static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, 
int len, unsigned long page_mask)
  {
void *kaddr = kmap_atomic(page);
-   memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
+   memcpy(dst, kaddr + (vaddr & ~page_mask), len);
kunmap_atomic(kaddr);
  }


  
-static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len)

+static void copy_to_page(struct page *page, unsigned long vaddr, const void 
*src, int len, unsigned long page_mask)
  {
void *kaddr = kmap_atomic(page);
-   memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len);
+   memcpy(kaddr + (vaddr & ~page_mask), src, len);
kunmap_atomic(kaddr);
  }


These two changes really are rather ugly ...

An why are they even required? We get a PAGE_SIZED-based subpage of a 
hugetlb page. We only kmap that one and copy within that one.


In other words, I don't think the copy_from_page() and copy_to_page() 
changes are even required when we consistently work on subpages and not 
suddenly on head pages.


  
-static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)

+static int verify_opcode(struct page *page, unsigned long vaddr, 
uprobe_opcode_t *new_opcode, unsigned long page_mask)
  {
uprobe_opcode_t old_opcode;
bool is_swbp;
@@ -191,7 +192,8 @@ static int verify_opcode(struct page *page, unsigned long 
vaddr, uprobe_opcode_t
 * is a trap variant; uprobes always wins over any other (gdb)
 * breakpoint.
 */
-   copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE);
+   copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE,
+  page_mask);
is_swbp = is_swbp_insn(_opcode);
  
  	if (is_swbp_insn(new_opcode)) {

@@ -376,8 +378,8 @@ struct uwo_data {
uprobe_opcode_t opcode;
  };
  
-static int __write_opcode_pte(pte_t *ptep, unsigned long vaddr,

-   unsigned long next, struct mm_walk *walk)
+static int __write_opcode(pte_t *ptep, unsigned long vaddr,
+ unsigned long page_mask, struct mm_walk *walk)



Unrelated alignment change.


  {
struct uwo_data *data = walk->private;;
const bool is_register = !!is_swbp_insn(>opcode);
@@ -415,9 +417,12 @@ static int __write_opcode_pte(pte_t *ptep, unsigned long 
vaddr,
  
  	/* Unmap + flush the TLB, such that we can write atomically .*/

flush_cache_page(vma, vaddr, pte_pfn(pte));
-   pte = ptep_clear_flush(vma, vaddr, ptep);
+   if (folio_test_hugetlb(folio))
+   pte = huge_ptep_clear_flush(vma, vaddr, ptep);
+   else
+   pte = ptep_clear_flush(vma, vaddr, ptep);
copy_to_page(page, da

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-26 Thread David Hildenbrand


On 26.04.24 02:09, Guillaume Morin wrote:

On 25 Apr 21:56, David Hildenbrand wrote:


On 25.04.24 17:19, Guillaume Morin wrote:

On 24 Apr 23:00, David Hildenbrand wrote:

One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for
hugetlb mappings. However this was also on my TODO and I have a draft
patch that implements it.


Yes, I documented it back then and added sanity checks in GUP code to fence
it off. Shouldn't be too hard to implement (famous last words) and would be
the cleaner thing to use here once I manage to switch over to
FOLL_WRITE|FOLL_FORCE to break COW.


Yes, my patch seems to be working. The hugetlb code is pretty simple.
And it allows ptrace and the proc pid mem file to work on the executable
private hugetlb mappings.

There is one thing I am unclear about though. hugetlb enforces that
huge_pte_write() is true on FOLL_WRITE in both the fault and
follow_page_mask paths. I am not sure if we can simply assume in the
hugetlb code that if the pte is not writable and this is a write fault
then we're in the FOLL_FORCE|FOLL_WRITE case.  Or do we want to keep the
checks simply not enforce it for FOLL_FORCE|FOLL_WRITE?

The latter is more complicated in the fault path because there is no
FAULT_FLAG_FORCE flag.



I just pushed something to
https://github.com/davidhildenbrand/linux/tree/uprobes_cow

Only very lightly tested so far. Expect the worst :)



I'll try it out and send you the hugetlb bits



I still detest having the zapping logic there, but to get it all right I
don't see a clean way around that.


For hugetlb, we'd primarily have to implement the
mm_walk_ops->hugetlb_entry() callback (well, and FOLL_FORCE).


For FOLL_FORCE, heer is my draft. Let me know if this is what you had in
mind.


diff --git a/mm/gup.c b/mm/gup.c
index 1611e73b1121..ac60e0ae64e8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1056,9 +1056,6 @@ static int check_vma_flags(struct vm_area_struct *vma, 
unsigned long gup_flags)
if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
if (!(gup_flags & FOLL_FORCE))
return -EFAULT;
-   /* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */
-   if (is_vm_hugetlb_page(vma))
-   return -EFAULT;
/*
 * We used to let the write,force case do COW in a
 * VM_MAYWRITE VM_SHARED !VM_WRITE vma, so ptrace could
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3548eae42cf9..73f86eddf888 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5941,7 +5941,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct 
vm_area_struct *vma,
   struct folio *pagecache_folio, spinlock_t *ptl,
   struct vm_fault *vmf)
  {
-   const bool unshare = flags & FAULT_FLAG_UNSHARE;
+   const bool make_writable = !(flags & FAULT_FLAG_UNSHARE) &&
+   (vma->vm_flags & VM_WRITE);
pte_t pte = huge_ptep_get(ptep);
struct hstate *h = hstate_vma(vma);
struct folio *old_folio;
@@ -5959,16 +5960,9 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, 
struct vm_area_struct *vma,
 * can trigger this, because hugetlb_fault() will always resolve
 * uffd-wp bit first.
 */
-   if (!unshare && huge_pte_uffd_wp(pte))
+   if (make_writable && huge_pte_uffd_wp(pte))
return 0;
  
-	/*

-* hugetlb does not support FOLL_FORCE-style write faults that keep the
-* PTE mapped R/O such as maybe_mkwrite() would do.
-*/
-   if (WARN_ON_ONCE(!unshare && !(vma->vm_flags & VM_WRITE)))
-   return VM_FAULT_SIGSEGV;
-
/* Let's take out MAP_SHARED mappings first. */
if (vma->vm_flags & VM_MAYSHARE) {
set_huge_ptep_writable(vma, haddr, ptep);
@@ -5989,7 +5983,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct 
vm_area_struct *vma,
folio_move_anon_rmap(old_folio, vma);
SetPageAnonExclusive(_folio->page);
}
-   if (likely(!unshare))
+   if (likely(make_writable))
set_huge_ptep_writable(vma, haddr, ptep);


Maybe we want to refactor that similarly into a 
set_huge_ptep_maybe_writable, and handle the VM_WRITE check internally.


Then, here you'd do

if (unshare)
set_huge_ptep(vma, haddr, ptep);
else
set_huge_ptep_maybe_writable(vma, haddr, ptep);

Something like that.




/* Break COW or unshare */
huge_ptep_clear_flush(vma, haddr, ptep);
@@ -6883,6 +6878,17 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
  }
  #endif /* CONFIG_USERFAULTFD */
  
+static bool is_force_follow(struct vm_area_struct* vma, unsigned int flags,

+struct p

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-25 Thread David Hildenbrand


On 25.04.24 17:19, Guillaume Morin wrote:

On 24 Apr 23:00, David Hildenbrand wrote:

One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for
hugetlb mappings. However this was also on my TODO and I have a draft
patch that implements it.


Yes, I documented it back then and added sanity checks in GUP code to fence
it off. Shouldn't be too hard to implement (famous last words) and would be
the cleaner thing to use here once I manage to switch over to
FOLL_WRITE|FOLL_FORCE to break COW.


Yes, my patch seems to be working. The hugetlb code is pretty simple.
And it allows ptrace and the proc pid mem file to work on the executable
private hugetlb mappings.

There is one thing I am unclear about though. hugetlb enforces that
huge_pte_write() is true on FOLL_WRITE in both the fault and
follow_page_mask paths. I am not sure if we can simply assume in the
hugetlb code that if the pte is not writable and this is a write fault
then we're in the FOLL_FORCE|FOLL_WRITE case.  Or do we want to keep the
checks simply not enforce it for FOLL_FORCE|FOLL_WRITE?

The latter is more complicated in the fault path because there is no
FAULT_FLAG_FORCE flag.



I just pushed something to
https://github.com/davidhildenbrand/linux/tree/uprobes_cow

Only very lightly tested so far. Expect the worst :)

I still detest having the zapping logic there, but to get it all right I 
don't see a clean way around that.



For hugetlb, we'd primarily have to implement the 
mm_walk_ops->hugetlb_entry() callback (well, and FOLL_FORCE).


Likely vaddr and PAGE_SIZE in uprobe_write_opcode() would have to be 
expanded to cover the full hugetlb page.


--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-25 Thread David Hildenbrand


On 25.04.24 17:19, Guillaume Morin wrote:

On 24 Apr 23:00, David Hildenbrand wrote:

One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for
hugetlb mappings. However this was also on my TODO and I have a draft
patch that implements it.


Yes, I documented it back then and added sanity checks in GUP code to fence
it off. Shouldn't be too hard to implement (famous last words) and would be
the cleaner thing to use here once I manage to switch over to
FOLL_WRITE|FOLL_FORCE to break COW.


Yes, my patch seems to be working. The hugetlb code is pretty simple.
And it allows ptrace and the proc pid mem file to work on the executable
private hugetlb mappings.

There is one thing I am unclear about though. hugetlb enforces that
huge_pte_write() is true on FOLL_WRITE in both the fault and
follow_page_mask paths. I am not sure if we can simply assume in the
hugetlb code that if the pte is not writable and this is a write fault
then we're in the FOLL_FORCE|FOLL_WRITE case.  Or do we want to keep the
checks simply not enforce it for FOLL_FORCE|FOLL_WRITE?

The latter is more complicated in the fault path because there is no
FAULT_FLAG_FORCE flag.



handle_mm_fault()->sanitize_fault_flags() makes sure that we'll only 
proceed with a fault either if

* we have VM_WRITE set
* we are in a COW mapping (MAP_PRIVATE with at least VM_MAYWRITE)

Once you see FAULT_FLAG_WRITE and you do have VM_WRITE, you don't care 
about FOLL_FORCE, it's simply a write fault.


Once you see FAULT_FLAG_WRITE and you *don't* have VM_WRITE, you must 
have VM_MAYWRITE and are essentially in FOLL_FORCE.


In a VMA without VM_WRITE, you must never map a PTE writable. In 
ordinary COW code, that's done in wp_page_copy(), where we *always* use 
maybe_mkwrite(), to do exactly what a write fault would do, but without 
mapping the PTE writable.


That's what the whole can_follow_write_pmd()/can_follow_write_pte() is 
about: writing to PTEs that are not writable.


You'll have to follow the exact same model in hugetlb 
(can_follow_write_pmd(), hugetlb_maybe_mkwrite(), ...).


--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-24 Thread David Hildenbrand


On 24.04.24 22:44, Guillaume Morin wrote:

On 24 Apr 22:09, David Hildenbrand wrote:

Let me try to see if we can get this done cleaner.

One ugly part (in general here) is the custom page replacement in the
registration part.

We are guaranteed to have a MAP_PRIVATE mapping. Instead of replacing pages
ourselves (which we likely shouldn't do ...) ... maybe we could use
FAULT_FLAG_UNSHARE faults such that we will get an anonymous folio
populated. (like KSM does nowadays)

Punching FOLL_PIN|FOLL_LONGTERM into GUP would achieve the same thing, but
using FOLL_WRITE would not work on many file systems. So maybe we have to
trigger an unsharing fault ourselves.


^ realizing that we already use FOLL_FORCE, so we can just use FOLL_WRITE to
break COW.


It was never clear to me why uprobes was not doing FOLL_WRITE in the
first place, I must say.


It's quite dated code ...

The use of FOLL_FORCE really is ugly here. When registering, we require 
VM_WRITE but ... when unregistering, we don't ...




One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for
hugetlb mappings. However this was also on my TODO and I have a draft
patch that implements it.


Yes, I documented it back then and added sanity checks in GUP code to 
fence it off. Shouldn't be too hard to implement (famous last words) and 
would be the cleaner thing to use here once I manage to switch over to 
FOLL_WRITE|FOLL_FORCE to break COW.


--
Cheers,

David / dhildenb

Re: [PATCH v21 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-24 Thread David Hildenbrand


On 24.04.24 22:31, Vincent Donnefort wrote:

Hi David,

Thanks for your quick response.

On Wed, Apr 24, 2024 at 05:26:39PM +0200, David Hildenbrand wrote:


I gave it some more thought, and I think we are still missing something (I
wish PFNMAP/MIXEDMAP wouldn't be that hard).


+
+/*
+ *   +--+  pgoff == 0
+ *   |   meta page  |
+ *   +--+  pgoff == 1
+ *   | subbuffer 0  |
+ *   |  |
+ *   +--+  pgoff == (1 + (1 << subbuf_order))
+ *   | subbuffer 1  |
+ *   |  |
+ * ...
+ */
+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+


I'd add some comments here like

/* Refuse any MAP_PRIVATE or writable mappings. */

+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+


/*
  * Make sure the mapping cannot become writable later. Also, tell the VM
  * to not touch these pages pages (VM_DONTCOPY | VM_DONTDUMP) and tell
  * GUP to leave them alone as well (VM_IO).
  */

+   vm_flags_mod(vma,
+VM_MIXEDMAP | VM_PFNMAP |
+VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND | VM_IO,
+VM_MAYWRITE);


I am still really unsure about VM_PFNMAP ... it's not a PFNMAP at all and,
as stated, vm_insert_pages() even complains quite a lot when it would have
to set VM_MIXEDMAP and VM_PFNMAP is already set, likely for a very good
reason.

Can't we limit ourselves to VM_IO?

But then, I wonder if it really helps much regarding GUP: yes, it blocks
ordinary GUP (see check_vma_flags()) but as insert_page_into_pte_locked()
does *not* set pte_special(), GUP-fast (gup_fast_pte_range()) will not
reject it.

Really, if you want GUP-fast to reject it, remap_pfn_range() and friends are
the way to go, that will set pte_special() such that also GUP-fast will
leave it alone, just like vm_normal_page() would.

So ... I know Linus recommended VM_PFNMAP/VM_IO to stop GUP, but it alone
won't stop all of GUP. We really have to mark the PTE as special, which
vm_insert_page() must not do (because it is refcounted!).


Hum, apologies, I am not sure to follow the connection here. Why do you think
the recommendation was to prevent GUP?


Ah, I'm hallucinating! :) "not let people play games with the mapping" to me
implied "make sure nobody touches it". If GUP is acceptable that makes stuff
a lot easier. VM_IO will block some GUP, but not all of it.





Which means: do we really have to stop GUP from grabbing that page?

Using vm_insert_page() only with VM_MIXEDMAP (and without VM_PFNMAP|VM_IO)
would be better.


Under the assumption we do not want to stop all GUP, why not using VM_IO over
VM_MIXEDMAP which is I believe more restrictive?


VM_MIXEDMAP will be implicitly set by vm_insert_page(). There is a lengthy 
comment
for vm_normal_page() that explains all this madness. VM_MIXEDMAP is primarily
relevant for COW mappings, which you just forbid completely.

remap_pfn_range_notrack() documents the semantics of some of the other flags:

 *   VM_IO tells people not to look at these pages
 *  (accesses can have side effects).
 *   VM_PFNMAP tells the core MM that the base pages are just
 *  raw PFN mappings, and do not have a "struct page" associated
 *  with them.
 *   VM_DONTEXPAND
 *  Disable vma merging and expanding with mremap().
 *   VM_DONTDUMP
 *  Omit vma from core dump, even when VM_IO turned off.

VM_PFNMAP is very likely really not what we want, unless we really perform raw
PFN mappings ... VM_IO we can set without doing much harm.

So I would suggest dropping VM_PFNMAP when using vm_insert_pages(), using only 
VM_IO
and likely just letting vm_insert_pages() set VM_MIXEDMAP for you.

[...]



vm_insert_pages() documents: "In case of error, we may have mapped a subset
of the provided pages. It is the caller's responsibility to account for this
case."

Which could for example happen, when allocating a page table fails.

Would we able to deal with that here?


As we are in the mmap path, on an error, I would expect the vma to be destroyed
and those pages whom insertion succeeded to be unmapped?



Ah, we simply fail ->mmap().

In mmap_region(), if call_mmap() failed, we "goto unmap_and_free_vma" where we 
have

/* Undo any partial mapping done by a device driver. */
unmap_region(mm, , vma, prev, next, vma->vm_start, vma->vm_end, 
vma->vm_end, true);



But perhaps shall we proactively zap_page_range_single()?


No mmap_region() should indeed be handling it correctly already!

--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-24 Thread David Hildenbrand


On 22.04.24 22:53, Guillaume Morin wrote:

On 22 Apr 20:59, David Hildenbrand wrote:

The benefit - to me - is very clear. People do use hugetlb mappings to
run code in production environments. The perf benefits are there for some
workloads. Intel has published a whitepaper about it etc.
Uprobes are a very good tool to do live tracing. If you can restart the
process and reproduce, you should be able to disable hugetlb remapping
but if you need to look at a live process, there are not many options.
Not being able to use uprobes is crippling.


Please add all that as motivation to the patch description or cover letter.


Yes, libhugetlbfs exists. But why do we have to support uprobes with it?
Nobody cared until now, why care now?


I think you could ask the same question for every new feature patch :)


I have to, because it usually indicates a lack of motivation in the
cover-letter/patch description :P


My cover letter was indeed lacking. I will make sure to add this kind of
details next time.
  

Since the removal a few releases ago of the __morecore() hook in glibc,
the main feature of libhugetlbfs is ELF segments remapping. I think
there are definitely a lot of users that simply deal with this
unnecessary limitation.

I am certainly not shoving this patch through anyone's throat if there
is no interest. But we definitely find it a very useful feature ...


Let me try to see if we can get this done cleaner.

One ugly part (in general here) is the custom page replacement in the
registration part.

We are guaranteed to have a MAP_PRIVATE mapping. Instead of replacing pages
ourselves (which we likely shouldn't do ...) ... maybe we could use
FAULT_FLAG_UNSHARE faults such that we will get an anonymous folio
populated. (like KSM does nowadays)

Punching FOLL_PIN|FOLL_LONGTERM into GUP would achieve the same thing, but
using FOLL_WRITE would not work on many file systems. So maybe we have to
trigger an unsharing fault ourselves.


^ realizing that we already use FOLL_FORCE, so we can just use 
FOLL_WRITE to break COW.




That would do the page replacement for us and we "should" be able to lookup
an anonymous folio that we can then just modify, like ptrace would.

But then, there is also unregistration part, with weird conditional page
replacement. Zapping the anon page if the content matches the content of the
original page is one thing. But why are we placing an existing anonymous
page by a new anonymous page when the content from the original page differs
(but matches the one from the just copied page?)?

I'll have to further think about that one. It's all a bit nasty.


Sounds good to me. I am willing to help with the code when you have a
plan or testing as you see fit. Let me know.


I'm hacking on a redesign that removes the manual COW breaking logic and 
*might* make it easier to integrate hugetlb. (very likely, but until I 
have the redesign running I cannot promise anything :) )


I'll let you know once I have something ready so you could integrate the 
hugetlb portion.


--
Cheers,

David / dhildenb

Re: [PATCH v21 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-24 Thread David Hildenbrand




I gave it some more thought, and I think we are still missing something 
(I wish PFNMAP/MIXEDMAP wouldn't be that hard).



+
+/*
+ *   +--+  pgoff == 0
+ *   |   meta page  |
+ *   +--+  pgoff == 1
+ *   | subbuffer 0  |
+ *   |  |
+ *   +--+  pgoff == (1 + (1 << subbuf_order))
+ *   | subbuffer 1  |
+ *   |  |
+ * ...
+ */
+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+


I'd add some comments here like

/* Refuse any MAP_PRIVATE or writable mappings. */

+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+


/*
 * Make sure the mapping cannot become writable later. Also, tell the VM
 * to not touch these pages pages (VM_DONTCOPY | VM_DONTDUMP) and tell
 * GUP to leave them alone as well (VM_IO).
 */

+   vm_flags_mod(vma,
+VM_MIXEDMAP | VM_PFNMAP |
+VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND | VM_IO,
+VM_MAYWRITE);


I am still really unsure about VM_PFNMAP ... it's not a PFNMAP at all 
and, as stated, vm_insert_pages() even complains quite a lot when it 
would have to set VM_MIXEDMAP and VM_PFNMAP is already set, likely for a 
very good reason.


Can't we limit ourselves to VM_IO?

But then, I wonder if it really helps much regarding GUP: yes, it blocks 
ordinary GUP (see check_vma_flags()) but as 
insert_page_into_pte_locked() does *not* set pte_special(), GUP-fast 
(gup_fast_pte_range()) will not reject it.


Really, if you want GUP-fast to reject it, remap_pfn_range() and friends 
are the way to go, that will set pte_special() such that also GUP-fast 
will leave it alone, just like vm_normal_page() would.


So ... I know Linus recommended VM_PFNMAP/VM_IO to stop GUP, but it 
alone won't stop all of GUP. We really have to mark the PTE as special, 
which vm_insert_page() must not do (because it is refcounted!).


Which means: do we really have to stop GUP from grabbing that page?

Using vm_insert_page() only with VM_MIXEDMAP (and without 
VM_PFNMAP|VM_IO) would be better.


If we want to stop all of GUP, remap_pfn_range() currently seems 
unavoidable :(




+
+   lockdep_assert_held(_buffer->mapping_lock);
+
+   subbuf_order = cpu_buffer->buffer->subbuf_order;
+   subbuf_pages = 1 << subbuf_order;
+
+   nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */
+   nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */
+
+   vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+   if (!vma_pages || vma_pages > nr_pages)
+   return -EINVAL;
+
+   nr_pages = vma_pages;
+
+   pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
+   if (!pages)
+   return -ENOMEM;
+
+   if (!pgoff) {
+   pages[p++] = virt_to_page(cpu_buffer->meta_page);
+
+   /*
+* TODO: Align sub-buffers on their size, once
+* vm_insert_pages() supports the zero-page.
+*/
+   } else {
+   /* Skip the meta-page */
+   pgoff--;
+
+   if (pgoff % subbuf_pages) {
+   err = -EINVAL;
+   goto out;
+   }
+
+   s += pgoff / subbuf_pages;
+   }
+
+   while (s < nr_subbufs && p < nr_pages) {
+   struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
+   int off = 0;
+
+   for (; off < (1 << (subbuf_order)); off++, page++) {
+   if (p >= nr_pages)
+   break;
+
+   pages[p++] = page;
+   }
+   s++;
+   }
+
+   err = vm_insert_pages(vma, vma->vm_start, pages, _pages);


vm_insert_pages() documents: "In case of error, we may have mapped a 
subset of the provided pages. It is the caller's responsibility to 
account for this case."


Which could for example happen, when allocating a page table fails.

Would we able to deal with that here?


Again, I wish it would all be easier ...

--
Cheers,

David / dhildenb

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-23 Thread David Hildenbrand


On 22.04.24 22:31, Vincent Donnefort wrote:

On Mon, Apr 22, 2024 at 08:27:17PM +0200, David Hildenbrand wrote:

On 22.04.24 20:20, Vincent Donnefort wrote:

Hi David,

Thanks for having a look, very much appreciated!

On Mon, Apr 22, 2024 at 11:27:11AM +0200, David Hildenbrand wrote:

On 19.04.24 20:25, David Hildenbrand wrote:

On 06.04.24 19:36, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,
+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..793ecc454039 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -26,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
@@ -338,6 +340,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
@@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages

Re: [PATCH v3 1/4] virtio_balloon: separate vm events into a function

2024-04-23 Thread David Hildenbrand


On 23.04.24 05:41, zhenwei pi wrote:

All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', separate these events into a function to
make code clean. Then we can remove 'CONFIG_VM_EVENT_COUNTERS' from
'update_balloon_stats'.

Signed-off-by: zhenwei pi 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-22 Thread David Hildenbrand


On 22.04.24 20:11, Guillaume Morin wrote:

(Dropping Mike Kravetz as CC since he has retired and his email is no
longer valid, adding Muchun since he's the current hugetlb maintainer,
as well as linux-trace-kernel)

On 22 Apr 11:39, David Hildenbrand wrote:


On 19.04.24 20:37, Guillaume Morin wrote:

libhugetlbfs, the Intel iodlr code both allow to remap .text onto a
hugetlb private mapping. It's also pretty easy to do it manually.
One drawback of using this functionality is the lack of support for
uprobes (NOTE uprobe ignores shareable vmas)

This patch adds support for private hugetlb mappings.  It does require exposing
some hugetlbfs innards and relies on copy_user_large_folio which is only
available when CONFIG_HUGETLBFS is used so I had to use an ugly #ifdef

If there is some interest in applying this patch in some form or
another, I am open to any refactoring suggestions (esp getting rid the
#ifdef in uprobes.c) . I tried to limit the
amount of branching.


All that hugetlb special casing  oh my. What's the benefit why we should
be interested in making that code less clean -- to phrase it in a nice way
;) ?


I do appreciate the nice phrasing. Believe me, I did try to limit the
special casing to a minimum :-).

Outside of __replace_page, I added only 3-ish branches so I do not think
it's *too* bad. The uprobe code is using PAGE_{SHIFT,MASK} quite liberally so I
had to add calls to retrieve these for the hugetlb vmas.

__replace_page has a lot of special casing. I certainly agree (and
unfortunately for me it's at the beginning of the patch :)).  It's doing
something pretty uncommon outside of the mm code so it has to make a
bunch of specific hugetlb calls. I am not quite sure how to improve it
but if you have suggestions, I'd be happy to refactor.


See below.



The benefit - to me - is very clear. People do use hugetlb mappings to
run code in production environments. The perf benefits are there for some
workloads. Intel has published a whitepaper about it etc.
Uprobes are a very good tool to do live tracing. If you can restart the
process and reproduce, you should be able to disable hugetlb remapping
but if you need to look at a live process, there are not many options.
Not being able to use uprobes is crippling.


Please add all that as motivation to the patch description or cover letter.




Yes, libhugetlbfs exists. But why do we have to support uprobes with it?
Nobody cared until now, why care now?


I think you could ask the same question for every new feature patch :)


I have to, because it usually indicates a lack of motivation in the 
cover-letter/patch description :P


People will have to maintain that code, and maintaining hugetlb code in 
odd places is no fun ...




Since the removal a few releases ago of the __morecore() hook in glibc,
the main feature of libhugetlbfs is ELF segments remapping. I think
there are definitely a lot of users that simply deal with this
unnecessary limitation.

I am certainly not shoving this patch through anyone's throat if there
is no interest. But we definitely find it a very useful feature ...


Let me try to see if we can get this done cleaner.

One ugly part (in general here) is the custom page replacement in the 
registration part.


We are guaranteed to have a MAP_PRIVATE mapping. Instead of replacing 
pages ourselves (which we likely shouldn't do ...) ... maybe we could 
use FAULT_FLAG_UNSHARE faults such that we will get an anonymous folio 
populated. (like KSM does nowadays)


Punching FOLL_PIN|FOLL_LONGTERM into GUP would achieve the same thing, 
but using FOLL_WRITE would not work on many file systems. So maybe we 
have to trigger an unsharing fault ourselves.


That would do the page replacement for us and we "should" be able to 
lookup an anonymous folio that we can then just modify, like ptrace would.


But then, there is also unregistration part, with weird conditional page 
replacement. Zapping the anon page if the content matches the content of 
the original page is one thing. But why are we placing an existing 
anonymous page by a new anonymous page when the content from the 
original page differs (but matches the one from the just copied page?)?


I'll have to further think about that one. It's all a bit nasty.


One thing to note is that hugetlb folios don't grow on trees. Likely, 
Many setups *don't* reserve extra hugetlb folios and you might just 
easily be running out of free hugetlb folios that you can use to break 
COW here (replace a file hugetlb by a fresh anon hugetlb page). Likely 
it's easy to make register or unregister fail.


--
Cheers,

David / dhildenb

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-22 Thread David Hildenbrand


On 22.04.24 20:20, Vincent Donnefort wrote:

Hi David,

Thanks for having a look, very much appreciated!

On Mon, Apr 22, 2024 at 11:27:11AM +0200, David Hildenbrand wrote:

On 19.04.24 20:25, David Hildenbrand wrote:

On 06.04.24 19:36, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

 ring_buffer_{map,unmap}()

And controls on the ring-buffer:

 ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

 A unique ID for each subbuf of the ring-buffer, currently they are
 only identified through their in-kernel VA.

 A meta-page, where are stored ring-buffer statistics and a
 description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
#include 
#include 
+#include 
+
struct trace_buffer;
struct ring_buffer_iter;
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
#define trace_rb_cpu_prepareNULL
#endif
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,
+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
#endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..793ecc454039 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
#include 
#include 
#include 
+#include 
#include 
#include 
#include 
@@ -26,6 +27,7 @@
#include 
#include 
#include 
+#include 
#include 
#include 
@@ -338,6 +340,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
};
@@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct lis

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-22 Thread David Hildenbrand


On 19.04.24 20:25, David Hildenbrand wrote:

On 06.04.24 19:36, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

ring_buffer_{map,unmap}()

And controls on the ring-buffer:

ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

A unique ID for each subbuf of the ring-buffer, currently they are
only identified through their in-kernel VA.

A meta-page, where are stored ring-buffer statistics and a
description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
   #include 
   #include 
   
+#include 

+
   struct trace_buffer;
   struct ring_buffer_iter;
   
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);

   #define trace_rb_cpu_prepare NULL
   #endif
   
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,

+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
   #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..793ecc454039 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 
   #include 
@@ -26,6 +27,7 @@
   #include 
   #include 
   #include 
+#include 
   
   #include 

   #include 
@@ -338,6 +340,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
   };
   
@@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {

u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_irq_work(_buffer->irq_

Re: [PATCH v2 1/4] virtio_balloon: separate vm events into a function

2024-04-22 Thread David Hildenbrand


On 22.04.24 09:42, zhenwei pi wrote:

All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', once any stack variable is required by any
VM events in future, we would have codes like:
  #ifdef CONFIG_VM_EVENT_COUNTERS
   unsigned long foo;
  #endif
   ...
  #ifdef CONFIG_VM_EVENT_COUNTERS
   foo = events[XXX] + events[YYY];
   update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
  #endif

Separate vm events into a single function, also remove
'CONFIG_VM_EVENT_COUNTERS' from 'update_balloon_stats'.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 44 ++---
  1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1f5b3dd31fcf..59fe157e5722 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -316,34 +316,48 @@ static inline void update_stat(struct virtio_balloon *vb, 
int idx,
  
  #define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
  
-static unsigned int update_balloon_stats(struct virtio_balloon *vb)

+/* Return the number of entries filled by vm events */
+static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb,
+  unsigned int start)
  {
+#ifdef CONFIG_VM_EVENT_COUNTERS
unsigned long events[NR_VM_EVENT_ITEMS];
-   struct sysinfo i;
-   unsigned int idx = 0;
-   long available;
-   unsigned long caches;
+   unsigned int idx = start;
  
  	all_vm_events(events);

-   si_meminfo();
-
-   available = si_mem_available();
-   caches = global_node_page_state(NR_FILE_PAGES);
-
-#ifdef CONFIG_VM_EVENT_COUNTERS
update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN,
-   pages_to_bytes(events[PSWPIN]));
+   pages_to_bytes(events[PSWPIN]));
update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
-   pages_to_bytes(events[PSWPOUT]));
+   pages_to_bytes(events[PSWPOUT]));
update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
+
  #ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
events[HTLB_BUDDY_PGALLOC]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL,
events[HTLB_BUDDY_PGALLOC_FAIL]);
-#endif
-#endif
+#endif /* CONFIG_HUGETLB_PAGE */
+
+   return idx - start;
+#else /* CONFIG_VM_EVENT_COUNTERS */
+
+   return 0;
+#endif /* CONFIG_VM_EVENT_COUNTERS */
+}
+
+static unsigned int update_balloon_stats(struct virtio_balloon *vb)
+{
+   struct sysinfo i;
+   unsigned int idx = 0;
+   long available;
+   unsigned long caches;
+
+   idx += update_balloon_vm_stats(vb, idx);


No need to handle idx that complicated now. Just do

unsigned int idx;

idx = update_balloon_vm_stats(vb);

We can go down that path if we ever want to rearrange the code and not 
have the vm_stats first.



+
+   si_meminfo();
+   available = si_mem_available();
+   caches = global_node_page_state(NR_FILE_PAGES);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
pages_to_bytes(i.freeram));
update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMTOT,


--
Cheers,

David / dhildenb

Re: [PATCH v2 1/4] virtio_balloon: separate vm events into a function

2024-04-22 Thread David Hildenbrand


On 22.04.24 10:04, zhenwei pi wrote:



On 4/22/24 15:47, David Hildenbrand wrote:

On 22.04.24 09:42, zhenwei pi wrote:

All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', once any stack variable is required by any
VM events in future, we would have codes like:
   #ifdef CONFIG_VM_EVENT_COUNTERS
    unsigned long foo;
   #endif
    ...
   #ifdef CONFIG_VM_EVENT_COUNTERS
    foo = events[XXX] + events[YYY];
    update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
   #endif

Separate vm events into a single function, also remove


Why not simply use __maybe_unused for that variable?



1>
static unsigned int update_balloon_stats()
{
  unsigned __maybe_unused long foo;

  ...
#ifdef CONFIG_VM_EVENT_COUNTERS
  foo = events[XXX] + events[YYY];
  update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
#endif
}

2>
static inline unsigned int update_balloon_vm_stats()
{
#ifdef CONFIG_VM_EVENT_COUNTERS
  unsigned long foo;

  foo = events[XXX] + events[YYY];
  update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
#endif
}

  From the point of my view, I don't need to compile code in my brain
when reading codes for case 2. :)


But for #1? :)

I mean, you didn't compile the code in your brain when you sent out v1 :P

But I agree that moving that to a separate function ins cleaner, staring 
at resulting update_balloon_stats().


Let me comment on some nits as a fresh reply.

--
Cheers,

David / dhildenb

Re: [PATCH v2 3/4] virtio_balloon: introduce memory allocation stall counter

2024-04-22 Thread David Hildenbrand


On 22.04.24 09:42, zhenwei pi wrote:

Memory allocation stall counter represents the performance/latency of
memory allocation, expose this counter to the host side by virtio
balloon device via out-of-bound way.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 8 
  include/uapi/linux/virtio_balloon.h | 6 --
  2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 87a1d6fa77fb..ab039e83bc6f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -323,6 +323,8 @@ static inline unsigned int update_balloon_vm_stats(struct 
virtio_balloon *vb,
  #ifdef CONFIG_VM_EVENT_COUNTERS
unsigned long events[NR_VM_EVENT_ITEMS];
unsigned int idx = start;
+   unsigned int zid;
+   unsigned long stall = 0;
  
  	all_vm_events(events);

update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN,
@@ -333,6 +335,12 @@ static inline unsigned int update_balloon_vm_stats(struct 
virtio_balloon *vb,
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]);
  
+	/* sum all the stall events */

+   for (zid = 0; zid < MAX_NR_ZONES; zid++)
+   stall += events[ALLOCSTALL_NORMAL - ZONE_NORMAL + zid];
+
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
+
  #ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
events[HTLB_BUDDY_PGALLOC]);
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index b17bbe033697..487b893a160e 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -72,7 +72,8 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations */
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
  #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
-#define VIRTIO_BALLOON_S_NR   11
+#define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
+#define VIRTIO_BALLOON_S_NR   12
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -85,7 +86,8 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
-   VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \
+   VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \
  }
  
  #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 1/4] virtio_balloon: separate vm events into a function

2024-04-22 Thread David Hildenbrand


On 22.04.24 09:42, zhenwei pi wrote:

All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', once any stack variable is required by any
VM events in future, we would have codes like:
  #ifdef CONFIG_VM_EVENT_COUNTERS
   unsigned long foo;
  #endif
   ...
  #ifdef CONFIG_VM_EVENT_COUNTERS
   foo = events[XXX] + events[YYY];
   update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
  #endif

Separate vm events into a single function, also remove


Why not simply use __maybe_unused for that variable?

--
Cheers,

David / dhildenb

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-19 Thread David Hildenbrand


On 06.04.24 19:36, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

   ring_buffer_{map,unmap}()

And controls on the ring-buffer:

   ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

   A unique ID for each subbuf of the ring-buffer, currently they are
   only identified through their in-kernel VA.

   A meta-page, where are stored ring-buffer statistics and a
   description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
  #include 
  #include 
  
+#include 

+
  struct trace_buffer;
  struct ring_buffer_iter;
  
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);

  #define trace_rb_cpu_prepare  NULL
  #endif
  
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,

+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
  #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..793ecc454039 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -26,6 +27,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
@@ -338,6 +340,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
  };
  
@@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {

u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(_buffer->irq_work.waiters);

Re: [PATCH 3/3] virtio_balloon: introduce memory scan/reclaim info

2024-04-18 Thread David Hildenbrand


On 18.04.24 08:26, zhenwei pi wrote:

Expose memory scan/reclaim information to the host side via virtio
balloon device.

Now we have a metric to analyze the memory performance:

y: counter increases
n: counter does not changes
h: the rate of counter change is high
l: the rate of counter change is low

OOM: VIRTIO_BALLOON_S_OOM_KILL
STALL: VIRTIO_BALLOON_S_ALLOC_STALL
ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC
DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT
ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC
DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT

- OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]:
   the guest runs under really critial memory pressure

- OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to cgroup, not the global memory
   pressure.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]:
   the memory allocation stalls due to global memory pressure. The
   performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows
   quite effective memory reclaiming.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to global memory pressure.
   the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing
   heavily, the serious case leads poor performance and difficult
   trouble shooting. Ex, sshd may block on memory allocation when
   accepting new connections, a user can't login a VM by ssh command.

- OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]:
   the low ratio between ARCLM/ASCAN shows that the guest tries to
   reclaim more memory, but it can't. Once more memory is required in
   future, it will struggle to reclaim memory.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c |  9 +
  include/uapi/linux/virtio_balloon.h | 12 ++--
  2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index e88e6573afa5..bc9332c1ae85 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -356,6 +356,15 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
stall += events[ALLOCSTALL_MOVABLE];
update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
  
+	update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_SCAN,

+   pages_to_bytes(events[PGSCAN_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_SCAN,
+   pages_to_bytes(events[PGSCAN_DIRECT]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_RECLAIM,
+   pages_to_bytes(events[PGSTEAL_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_RECLAIM,
+   pages_to_bytes(events[PGSTEAL_DIRECT]));
+
  #ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
events[HTLB_BUDDY_PGALLOC]);
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 487b893a160e..ee35a372805d 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -73,7 +73,11 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
  #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
  #define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
-#define VIRTIO_BALLOON_S_NR   12
+#define VIRTIO_BALLOON_S_ASYNC_SCAN12 /* Amount of memory scanned 
asynchronously */
+#define VIRTIO_BALLOON_S_DIRECT_SCAN   13 /* Amount of memory scanned directly 
*/
+#define VIRTIO_BALLOON_S_ASYNC_RECLAIM 14 /* Amount of memory reclaimed 
asynchronously */
+#define VIRTIO_BALLOON_S_DIRECT_RECLAIM 15 /* Amount of memory reclaimed 
directly */
+#define VIRTIO_BALLOON_S_NR   16
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -87,7 +91,11 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \
-   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls", \
+   VIRTIO_BALLOON_S_NAMES_prefix "async-scans", \
+   VIRTIO_BALLOON_S_NAMES_prefix "direct-scans", \
+   VIRTIO_BALLOON_S_NAMES_prefix "async-reclaims", \
+   VIRTIO_BALLOON_S_NAMES_prefix "direct-reclaims" \
  }
  
  #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")


Not an expert on these counters/events, but LGTM

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH 2/3] virtio_balloon: introduce memory allocation stall counter

2024-04-18 Thread David Hildenbrand


On 18.04.24 08:26, zhenwei pi wrote:

Memory allocation stall counter represents the performance/latency of
memory allocation, expose this counter to the host side by virtio
balloon device via out-of-bound way.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 20 +++-
  include/uapi/linux/virtio_balloon.h |  6 --
  2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index fd19934a847f..e88e6573afa5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -321,7 +321,7 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
unsigned long events[NR_VM_EVENT_ITEMS];
struct sysinfo i;
unsigned int idx = 0;
-   long available;
+   long available, stall = 0;
unsigned long caches;
  
  	all_vm_events(events);

@@ -338,6 +338,24 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]);
+
+   /* sum all the stall events */
+#ifdef CONFIG_ZONE_DMA
+   stall += events[ALLOCSTALL_DMA];
+#endif
+#ifdef CONFIG_ZONE_DMA32
+   stall += events[ALLOCSTALL_DMA32];
+#endif
+#ifdef CONFIG_HIGHMEM
+   stall += events[ALLOCSTALL_HIGH];
+#endif
+#ifdef CONFIG_ZONE_DEVICE
+   stall += events[ALLOCSTALL_DEVICE];
+#endif


Naive me would think that ALLOCSTALL_DEVICE is always 0. :)

Likely we should just do:

for (zid = 0; zid < MAX_NR_ZONES; zid++)
stall += events[ALLOCSTALL_NORMAL - ZONE_NORMAL + zid];

(see isolate_lru_folios() -> __count_zid_vm_events(), where we realy on 
the same ordering)


Apart form that, LGTM.

--
Cheers,

David / dhildenb

Re: [PATCH 1/3] virtio_balloon: introduce oom-kill invocations

2024-04-18 Thread David Hildenbrand


On 18.04.24 08:26, zhenwei pi wrote:

When the guest OS runs under critical memory pressure, the guest
starts to kill processes. A guest monitor agent may scan 'oom_kill'
from /proc/vmstat, and reports the OOM KILL event. However, the agent
may be killed and we will loss this critical event(and the later
events).

For now we can also grep for magic words in guest kernel log from host
side. Rather than this unstable way, virtio balloon reports OOM-KILL
invocations instead.

Signed-off-by: zhenwei pi 


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [RFC 3/3] virtio_balloon: introduce memory scan/reclaim info

2024-04-15 Thread David Hildenbrand


On 15.04.24 10:41, zhenwei pi wrote:

Expose memory scan/reclaim information to the host side via virtio
balloon device.

Now we have a metric to analyze the memory performance:

y: counter increases
n: counter does not changes
h: the rate of counter change is high
l: the rate of counter change is low

OOM: VIRTIO_BALLOON_S_OOM_KILL
STALL: VIRTIO_BALLOON_S_ALLOC_STALL
ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC
DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT
ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC
DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT

- OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]:
   the guest runs under really critial memory pressure

- OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to cgroup, not the global memory
   pressure.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]:
   the memory allocation stalls due to global memory pressure. The
   performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows
   quite effective memory reclaiming.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to global memory pressure.
   the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing
   heavily, the serious case leads poor performance and difficult
   trouble shooting. Ex, sshd may block on memory allocation when
   accepting new connections, a user can't login a VM by ssh command.

- OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]:
   the low ratio between ARCLM/ASCAN shows that the guest tries to
   reclaim more memory, but it can't. Once more memory is required in
   future, it will struggle to reclaim memory.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c |  9 +
  include/uapi/linux/virtio_balloon.h | 12 ++--
  2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4b9c9569f6e5..7b86514e99d4 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -372,6 +372,15 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
stall += events[ALLOCSTALL_MOVABLE];
update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
  
+	update_stat(vb, idx++, VIRTIO_BALLOON_S_SCAN_ASYNC,

+   pages_to_bytes(events[PGSCAN_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_SCAN_DIRECT,
+   pages_to_bytes(events[PGSCAN_DIRECT]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_RECLAIM_ASYNC,
+   pages_to_bytes(events[PGSTEAL_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_RECLAIM_DIRECT,
+   pages_to_bytes(events[PGSTEAL_DIRECT]));
+
return idx;
  }
  
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h

index 13d0c32ba27c..0875a9cccb01 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -73,7 +73,11 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
  #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
  #define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
-#define VIRTIO_BALLOON_S_NR   12
+#define VIRTIO_BALLOON_S_SCAN_ASYNC12 /* Amount of memory scanned 
asynchronously */
+#define VIRTIO_BALLOON_S_SCAN_DIRECT   13 /* Amount of memory scanned directly 
*/
+#define VIRTIO_BALLOON_S_RECLAIM_ASYNC 14 /* Amount of memory reclaimed 
asynchronously */
+#define VIRTIO_BALLOON_S_RECLAIM_DIRECT 15 /* Amount of memory reclaimed 
directly */


Should these be the other way around:

ASYN_SCAN
...
ASYNC_RECLAIM

so we can get ...


+#define VIRTIO_BALLOON_S_NR   16
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -87,7 +91,11 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
VIRTIO_BALLOON_S_NAMES_prefix "oom-kill", \
-   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stall" \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stall", \
+   VIRTIO_BALLOON_S_NAMES_prefix "scan-async", \
+   VIRTIO_BALLOON_S_NAMES_prefix "scan-direct", \
+   VIRTIO_BALLOON_S_NAMES_prefix "reclaim-async", \
+   VIRTIO_BALLOON_S_NAMES_prefix "reclaim-direct" \


...

"async-scans", "async-reclaims" ...

--
Cheers,

David / dhildenb

Re: [RFC 2/3] virtio_balloon: introduce memory allocation stall counter

2024-04-15 Thread David Hildenbrand


On 15.04.24 10:41, zhenwei pi wrote:

Memory allocation stall counter represents the performance/latency of
memory allocation, expose this counter to the host side by virtio
balloon device via out-of-bound way.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 19 ++-
  include/uapi/linux/virtio_balloon.h |  6 --
  2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index fd8daa742734..4b9c9569f6e5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -321,7 +321,7 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
unsigned long events[NR_VM_EVENT_ITEMS];
struct sysinfo i;
unsigned int idx = 0;
-   long available;
+   long available, stall = 0;
unsigned long caches;
  
  	all_vm_events(events);

@@ -355,6 +355,23 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL,
events[OOM_KILL]);
  
+	/* sum all the stall event */

+#ifdef CONFIG_ZONE_DMA
+   stall += events[ALLOCSTALL_DMA];
+#endif
+#ifdef CONFIG_ZONE_DMA32
+   stall += events[ALLOCSTALL_DMA32];
+#endif
+#ifdef CONFIG_HIGHMEM
+   stall += events[ALLOCSTALL_HIGH];
+#endif
+#ifdef CONFIG_ZONE_DEVICE
+   stall += events[ALLOCSTALL_DEVICE];
+#endif
+   stall += events[ALLOCSTALL_NORMAL];
+   stall += events[ALLOCSTALL_MOVABLE];
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
+
return idx;
  }
  
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h

index cde5547e64a7..13d0c32ba27c 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -72,7 +72,8 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations */
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
  #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
-#define VIRTIO_BALLOON_S_NR   11
+#define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
+#define VIRTIO_BALLOON_S_NR   12
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -85,7 +86,8 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
-   VIRTIO_BALLOON_S_NAMES_prefix "oom-kill" \
+   VIRTIO_BALLOON_S_NAMES_prefix "oom-kill", \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stall" \


"alloc-stalls"

--
Cheers,

David / dhildenb

Re: [RFC 1/3] virtio_balloon: introduce oom-kill invocations

2024-04-15 Thread David Hildenbrand


On 15.04.24 10:41, zhenwei pi wrote:

When the guest OS runs under critical memory pressure, the guest
starts to kill processes. A guest monitor agent may scan 'oom_kill'
from /proc/vmstat, and reports the OOM KILL event. However, the agent
may be killed and we will loss this critical event(and the later
events).

For now we can also grep for magic words in guest kernel log from host
side. Rather than this unstable way, virtio balloon reports OOM-KILL
invocations instead.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 2 ++
  include/uapi/linux/virtio_balloon.h | 6 --
  2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1f5b3dd31fcf..fd8daa742734 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -352,6 +352,8 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
pages_to_bytes(available));
update_stat(vb, idx++, VIRTIO_BALLOON_S_CACHES,
pages_to_bytes(caches));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL,
+   events[OOM_KILL]);
  
  	return idx;

  }
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index ddaa45e723c4..cde5547e64a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -71,7 +71,8 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_CACHES   7   /* Disk caches */
  #define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations */
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
-#define VIRTIO_BALLOON_S_NR   10
+#define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
+#define VIRTIO_BALLOON_S_NR   11
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -83,7 +84,8 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "available-memory", \
VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
-   VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures" \
+   VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
+   VIRTIO_BALLOON_S_NAMES_prefix "oom-kill" \


"oom-kills"

--
Cheers,

David / dhildenb

Re: [RFC 0/3] Improve memory statistics for virtio balloon

2024-04-15 Thread David Hildenbrand


On 15.04.24 10:41, zhenwei pi wrote:

Hi,

When the guest runs under critial memory pressure, the guest becomss
too slow, even sshd turns D state(uninterruptible) on memory
allocation. We can't login this VM to do any work on trouble shooting.

Guest kernel log via virtual TTY(on host side) only provides a few
necessary log after OOM. More detail memory statistics are required,
then we can know explicit memory events and estimate the pressure.

I'm going to introduce several VM counters for virtio balloon:
- oom-kill
- alloc-stall
- scan-async
- scan-direct
- reclaim-async
- reclaim-direct


IIUC, we're only exposing events that are already getting provided via 
all_vm_events(), correct?


In that case, I don't really see a major issue. Some considerations:

(1) These new events are fairly Linux specific.

PSWPIN and friends are fairly generic, but HGTLB is also already fairly 
Linux specific already. OOM-kills don't really exist on Windows, for 
example. We'll have to be careful of properly describing what the 
semantics are.


(2) How should we handle if Linux ever stops supporting a certain event 
(e.g., major reclaim rework). I assume, simply return nothing like we 
currently would for VIRTIO_BALLOON_S_HTLB_PGALLOC without 
CONFIG_HUGETLB_PAGE.


--
Cheers,

David / dhildenb

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-12 Thread David Hildenbrand


On 11.04.24 18:55, Paolo Bonzini wrote:

On Mon, Apr 8, 2024 at 3:56 PM Peter Xu  wrote:

Paolo,

I may miss a bunch of details here (as I still remember some change_pte
patches previously on the list..), however not sure whether we considered
enable it?  Asked because I remember Andrea used to have a custom tree
maintaining that part:

https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968


The patch enables it only for KSM, so it would still require a bunch
of cleanups, for example I also would still use set_pte_at() in all
the places that are not KSM. This would at least fix the issue with
the poor documentation of where to use set_pte_at_notify() vs
set_pte_at().

With regard to the implementation, I like the idea of disabling the
invalidation on the MMU notifier side, but I would rather have
MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of
overloading the event field.


Maybe it can't be enabled for some reason that I overlooked in the current
tree, or we just decided to not to?


I have just learnt about the patch, nobody had ever mentioned it even
though it's almost 2 years old... It's a lot of code though and no one


I assume Andrea used it on his tree where he also has a version of 
"randprotect" (even included in that commit subject) to mitigate a KSM 
security issue that was reported by some security researchers [1] a 
while ago. From what I recall, the industry did not end up caring about 
that security issue that much.


IIUC, with "randprotect" we get a lot more R/O protection even when not 
de-duplicating a page -- thus the name. Likely, the reporter mentioned 
in the commit is a researcher that played with Andreas fix for the 
security issue. But I'm just speculating at this point :)



has ever reported an issue for over 10 years, so I think it's easiest
to just rip the code out.


Yes. Can always be readded in a possibly cleaner fashion (like you note 
above), when deemed necessary and we are willing to support it.


[1] https://gruss.cc/files/remote_dedup.pdf

--
Cheers,

David / dhildenb

Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

2024-04-09 Thread David Hildenbrand


On 09.04.24 20:31, James Houghton wrote:

Ah, I didn't see this in my inbox, sorry David!


No worries :)



On Thu, Apr 4, 2024 at 11:52 AM David Hildenbrand  wrote:


On 02.04.24 01:29, James Houghton wrote:

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f349e08a9dfe..daaa9db625d3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,10 @@ enum mmu_notifier_event {

   #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)

+#define MMU_NOTIFIER_YOUNG   (1 << 0)
+#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)


Especially this one really deserves some documentation :)


Yes, will do. Something like

 MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE indicates that the passed-in
bitmap either (1) does not accurately represent the age of the pages
(in the case of test_young), or (2) was not able to be used to
completely clear the age/access bit (in the case of clear_young).


Make sense. I do wonder what the expected reaction from the caller is :)






+#define MMU_NOTIFIER_YOUNG_FAST  (1 << 2)


And that one as well.


Something like

Indicates that (1) passing a bitmap ({test,clear}_young_bitmap)
would have been supported for this address range.

The name MMU_NOTIFIER_YOUNG_FAST really comes from the fact that KVM
is able to harvest the access bit "fast" (so for x86, locklessly, and
for arm64, with the KVM MMU read lock), "fast" enough that using a
bitmap to do look-around is probably a good idea.


Is that really the right way to communicate that ("would have been 
supported") -- wouldn't we want to sense support differently?






Likely best to briefly document all of them, and how they are
supposed to be used (return value for X).


Right. Will do.




+
   struct mmu_notifier_ops {
   /*
* Called either by mmu_notifier_unregister or when the mm is
@@ -106,21 +110,36 @@ struct mmu_notifier_ops {
* clear_young is a lightweight version of clear_flush_young. Like the
* latter, it is supposed to test-and-clear the young/accessed bitflag
* in the secondary pte, but it may omit flushing the secondary tlb.
+  *
+  * If @bitmap is given but is not supported, return
+  * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+  *
+  * If the walk is done "quickly" and there were young PTEs,
+  * MMU_NOTIFIER_YOUNG_FAST is returned.
*/
   int (*clear_young)(struct mmu_notifier *subscription,
  struct mm_struct *mm,
  unsigned long start,
-unsigned long end);
+unsigned long end,
+unsigned long *bitmap);

   /*
* test_young is called to check the young/accessed bitflag in
* the secondary pte. This is used to know if the page is
* frequently used without actually clearing the flag or tearing
* down the secondary mapping on the page.
+  *
+  * If @bitmap is given but is not supported, return
+  * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+  *
+  * If the walk is done "quickly" and there were young PTEs,
+  * MMU_NOTIFIER_YOUNG_FAST is returned.
*/
   int (*test_young)(struct mmu_notifier *subscription,
 struct mm_struct *mm,
-   unsigned long address);
+   unsigned long start,
+   unsigned long end,
+   unsigned long *bitmap);


What does "quickly" mean (why not use "fast")? What are the semantics, I
don't find any existing usage of that in this file.


"fast" means fast enough such that using a bitmap to scan adjacent
pages (e.g. with MGLRU) is likely to be beneficial. I'll write more in
this comment. Perhaps I should just rename it to
MMU_NOTIFIER_YOUNG_BITMAP_SUPPORTED and drop the whole "likely to be
beneficial" thing -- that's for MGLRU/etc. to decide really.


Yes!





Further, what is MMU_NOTIFIER_YOUNG you introduce used for?


MMU_NOTIFIER_YOUNG is the return value when the page was young, but we
(1) didn't use a bitmap, and (2) the "fast" access bit harvesting
wasn't possible. In this case we simply return 1, which is
MMU_NOTIFIER_YOUNG. I'll make kvm_mmu_notifier_test_clear_young()
properly return MMU_NOTIFIER_YOUNG instead of relying on the fact that
it will be 1.


Yes, that will clarify it!

--
Cheers,

David / dhildenb

Re: [PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at()

2024-04-08 Thread David Hildenbrand


On 05.04.24 13:58, Paolo Bonzini wrote:

With the demise of the .change_pte() MMU notifier callback, there is no
notification happening in set_pte_at_notify().  It is a synonym of
set_pte_at() and can be replaced with it.

Signed-off-by: Paolo Bonzini 
---


A real joy seeing that gone

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH 3/4] mmu_notifier: remove the .change_pte() callback

2024-04-08 Thread David Hildenbrand


On 05.04.24 13:58, Paolo Bonzini wrote:

The scope of set_pte_at_notify() has reduced more and more through the
years.  Initially, it was meant for when the change to the PTE was
not bracketed by mmu_notifier_invalidate_range_{start,end}().  However,
that has not been so for over ten years.  During all this period
the only implementation of .change_pte() was KVM and it
had no actual functionality, because it was called after
mmu_notifier_invalidate_range_start() zapped the secondary PTE.

Now that this (nonfunctional) user of the .change_pte() callback is
gone, the whole callback can be removed.  For now, leave in place
set_pte_at_notify() even though it is just a synonym for set_pte_at().

Signed-off-by: Paolo Bonzini 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

2024-04-04 Thread David Hildenbrand


On 02.04.24 01:29, James Houghton wrote:

The bitmap is provided for secondary MMUs to use if they support it. For
test_young(), after it returns, the bitmap represents the pages that
were young in the interval [start, end). For clear_young, it represents
the pages that we wish the secondary MMU to clear the accessed/young bit
for.

If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API
should be unchanged except that if young PTEs are found and the
architecture supports passing in a bitmap, instead of returning 1,
MMU_NOTIFIER_YOUNG_FAST is returned.

This allows MGLRU's look-around logic to work faster, resulting in a 4%
improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST
to indicate to main mm that doing look-around is likely to be
beneficial.

If the secondary MMU doesn't support the bitmap, it must return
an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

[1]: https://lore.kernel.org/all/20230609005935.42390-1-yuz...@google.com/

Suggested-by: Yu Zhao 
Signed-off-by: James Houghton 
---
  include/linux/mmu_notifier.h | 93 +---
  include/trace/events/kvm.h   | 13 +++--
  mm/mmu_notifier.c| 20 +---
  virt/kvm/kvm_main.c  | 19 ++--
  4 files changed, 123 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f349e08a9dfe..daaa9db625d3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,10 @@ enum mmu_notifier_event {
  
  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
  
+#define MMU_NOTIFIER_YOUNG			(1 << 0)

+#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE   (1 << 1)


Especially this one really deserves some documentation :)


+#define MMU_NOTIFIER_YOUNG_FAST(1 << 2)


And that one as well.

Likely best to briefly document all of them, and how they are
supposed to be used (return value for X).


+
  struct mmu_notifier_ops {
/*
 * Called either by mmu_notifier_unregister or when the mm is
@@ -106,21 +110,36 @@ struct mmu_notifier_ops {
 * clear_young is a lightweight version of clear_flush_young. Like the
 * latter, it is supposed to test-and-clear the young/accessed bitflag
 * in the secondary pte, but it may omit flushing the secondary tlb.
+*
+* If @bitmap is given but is not supported, return
+* MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+*
+* If the walk is done "quickly" and there were young PTEs,
+* MMU_NOTIFIER_YOUNG_FAST is returned.
 */
int (*clear_young)(struct mmu_notifier *subscription,
   struct mm_struct *mm,
   unsigned long start,
-  unsigned long end);
+  unsigned long end,
+  unsigned long *bitmap);
  
  	/*

 * test_young is called to check the young/accessed bitflag in
 * the secondary pte. This is used to know if the page is
 * frequently used without actually clearing the flag or tearing
 * down the secondary mapping on the page.
+*
+* If @bitmap is given but is not supported, return
+* MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+*
+* If the walk is done "quickly" and there were young PTEs,
+* MMU_NOTIFIER_YOUNG_FAST is returned.
 */
int (*test_young)(struct mmu_notifier *subscription,
  struct mm_struct *mm,
- unsigned long address);
+ unsigned long start,
+ unsigned long end,
+ unsigned long *bitmap);


What does "quickly" mean (why not use "fast")? What are the semantics, I 
don't find any existing usage of that in this file.


Further, what is MMU_NOTIFIER_YOUNG you introduce used for?

--
Cheers,

David / dhildenb

Re: [PATCH 2/2] virtio_balloon: Treat stats requests as wakeup events

2024-03-19 Thread David Hildenbrand


On 18.03.24 10:10, David Stevens wrote:

From: David Stevens 

Treat stats requests as wakeup events to ensure that the driver responds
to device requests in a timely manner.

Signed-off-by: David Stevens 
---
  drivers/virtio/virtio_balloon.c | 75 -
  1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7fe7ef5f1c77..402dec98e08c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -121,11 +121,14 @@ struct virtio_balloon {
struct page_reporting_dev_info pr_dev_info;
  
  	/* State for keeping the wakeup_source active while adjusting the balloon */

-   spinlock_t adjustment_lock;
-   bool adjustment_signal_pending;
-   bool adjustment_in_progress;
+   spinlock_t wakeup_lock;
+   bool processing_wakeup_event;
+   u32 wakeup_signal_mask;
  };
  
+#define ADJUSTMENT_WAKEUP_SIGNAL (1 << 0)

+#define STATS_WAKEUP_SIGNAL (1 << 1)


I'd suggest a different naming like:

VIRTIO_BALLOON_WAKEUP_SIGNAL_ADJUST
VIRTIO_BALLOON_WAKEUP_SIGNAL_STATS


Apart from that, nothing jumped at me.

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH 1/2] virtio_balloon: Give the balloon its own wakeup source

2024-03-19 Thread David Hildenbrand


On 18.03.24 10:10, David Stevens wrote:

From: David Stevens 

Wakeup sources don't support nesting multiple events, so sharing a
single object between multiple drivers can result in one driver
overriding the wakeup event processing period specified by another
driver. Have the virtio balloon driver use the wakeup source of the
device it is bound to rather than the wakeup source of the parent
device, to avoid conflicts with the transport layer.

Note that although the virtio balloon's virtio_device itself isn't what
actually wakes up the device, it is responsible for processing wakeup
events. In the same way that EPOLLWAKEUP uses a dedicated wakeup_source
to prevent suspend when userspace is processing wakeup events, a
dedicated wakeup_source is necessary when processing wakeup events in a
higher layer in the kernel.

Fixes: b12fbc3f787e ("virtio_balloon: stay awake while adjusting balloon")
Signed-off-by: David Stevens 
---
  drivers/virtio/virtio_balloon.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1f5b3dd31fcf..7fe7ef5f1c77 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -450,7 +450,7 @@ static void start_update_balloon_size(struct virtio_balloon 
*vb)
vb->adjustment_signal_pending = true;
if (!vb->adjustment_in_progress) {
vb->adjustment_in_progress = true;
-   pm_stay_awake(vb->vdev->dev.parent);
+   pm_stay_awake(>vdev->dev);
}
spin_unlock_irqrestore(>adjustment_lock, flags);
  
@@ -462,7 +462,7 @@ static void end_update_balloon_size(struct virtio_balloon *vb)

spin_lock_irq(>adjustment_lock);
if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
vb->adjustment_in_progress = false;
-   pm_relax(vb->vdev->dev.parent);
+   pm_relax(>vdev->dev);
}
spin_unlock_irq(>adjustment_lock);
  }
@@ -1028,6 +1028,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
}
  
  	spin_lock_init(>adjustment_lock);


Can we add a comment here why we have to do that?


+   device_set_wakeup_capable(>vdev->dev, true);
  
  	virtio_device_ready(vdev);
  


Absolutely not an expert on the details, but I assume this is fine.

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

[PATCH v1] virtio-mem: support suspend+resume

2024-03-18 Thread David Hildenbrand

With virtio-mem, primarily hibernation is problematic: as the machine shuts
down, the virtio-mem device loses its state. Powering the machine back up
is like losing a bunch of DIMMs. While there would be ways to add limited
support, suspend+resume is more commonly used for VMs and "easier" to
support cleanly.

s2idle can be supported without any device dependencies. Similarly, one
would expect suspend-to-ram (i.e., S3) to work out of the box. However,
QEMU currently unplugs all device memory when resuming the VM, using a
cold reset on the "wakeup" path. In order to support S3, we need a feature
flag for the device to tell us if memory remains plugged when waking up. In
the future, QEMU will implement this feature.

So let's always support s2idle and support S3 with plugged memory only if
the device indicates support. Block hibernation early using the PM
notifier.

Trying to hibernate now fails early:
# echo disk > /sys/power/state
[   26.455369] PM: hibernation: hibernation entry
[   26.458271] virtio_mem virtio0: hibernation is not supported.
[   26.462498] PM: hibernation: hibernation exit
-bash: echo: write error: Operation not permitted

s2idle works even without the new feature bit:
# echo s2idle > /sys/power/mem_sleep
# echo mem > /sys/power/state
[   52.083725] PM: suspend entry (s2idle)
[   52.095950] Filesystems sync: 0.010 seconds
[   52.101493] Freezing user space processes
[   52.104213] Freezing user space processes completed (elapsed 0.001 
seconds)
[   52.106520] OOM killer disabled.
[   52.107655] Freezing remaining freezable tasks
[   52.110880] Freezing remaining freezable tasks completed (elapsed 
0.001 seconds)
[   52.113296] printk: Suspending console(s) (use no_console_suspend to 
debug)

S3 does not work without the feature bit when memory is plugged:
# echo deep > /sys/power/mem_sleep
# echo mem > /sys/power/state
[   32.788281] PM: suspend entry (deep)
[   32.816630] Filesystems sync: 0.027 seconds
[   32.820029] Freezing user space processes
[   32.823870] Freezing user space processes completed (elapsed 0.001 
seconds)
[   32.827756] OOM killer disabled.
[   32.829608] Freezing remaining freezable tasks
[   32.833842] Freezing remaining freezable tasks completed (elapsed 
0.001 seconds)
[   32.837953] printk: Suspending console(s) (use no_console_suspend to 
debug)
[   32.916172] virtio_mem virtio0: suspend+resume with plugged memory 
is not supported
[   32.916181] virtio-pci :00:02.0: PM: pci_pm_suspend(): 
virtio_pci_freeze+0x0/0x50 returns -1
[   32.916197] virtio-pci :00:02.0: PM: dpm_run_callback(): 
pci_pm_suspend+0x0/0x170 returns -1
[   32.916210] virtio-pci :00:02.0: PM: failed to suspend async: 
error -1

But S3 works with the new feature bit when memory is plugged (patched
QEMU):
# echo deep > /sys/power/mem_sleep
# echo mem > /sys/power/state
[   33.983694] PM: suspend entry (deep)
[   34.009828] Filesystems sync: 0.024 seconds
[   34.013589] Freezing user space processes
[   34.016722] Freezing user space processes completed (elapsed 0.001 
seconds)
[   34.019092] OOM killer disabled.
[   34.020291] Freezing remaining freezable tasks
[   34.023549] Freezing remaining freezable tasks completed (elapsed 
0.001 seconds)
[   34.026090] printk: Suspending console(s) (use no_console_suspend to 
debug)

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Xuan Zhuo 
Signed-off-by: David Hildenbrand 
---

I had QEMU support ready [1] but reset-related things just changed upstream
that will require a bit of a rework -- and it will end up looking cleaner.

Will come back to upstreaming the QEMU part once I can properly sync
the Linux headers to contain VIRTIO_MEM_F_PERSISTENT_SUSPEND.

[1] https://github.com/davidhildenbrand/qemu/tree/virtio-mem-suspend

---
 drivers/virtio/virtio_mem.c | 68 ++---
 include/uapi/linux/virtio_mem.h |  2 +
 2 files changed, 64 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 8e32232944423..51088d02de32f 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -252,6 +253,9 @@ struct virtio_mem {
/* Memory notifier (online/offline events). */
struct notifier_block memory_notifier;
 
+   /* Notifier to block hibernation image storing/reloading. */
+   struct notifier_block pm_notifier;
+
 #ifdef CONFIG_PROC_VMCORE
/* vmcore callback for /proc/vmcore handling in kdump mode */
struct vmcore_cb vmcore_cb;
@@ -,6 +1115,25 @@ static int virtio_mem_memory_notifi

[PATCH v1] virtio: reenable config if freezing device failed

2024-02-13 Thread David Hildenbrand

Currently, we don't reenable the config if freezing the device failed.

For example, virtio-mem currently doesn't support suspend+resume, and
trying to freeze the device will always fail. Afterwards, the device
will no longer respond to resize requests, because it won't get notified
about config changes.

Let's fix this by re-enabling the config if freezing fails.

Fixes: 22b7050a024d ("virtio: defer config changed notifications")
Cc: 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Xuan Zhuo 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f4080692b351..f513ee21b1c1 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -510,8 +510,10 @@ int virtio_device_freeze(struct virtio_device *dev)
 
if (drv && drv->freeze) {
ret = drv->freeze(dev);
-   if (ret)
+   if (ret) {
+   virtio_config_enable(dev);
return ret;
+   }
}
 
if (dev->config->destroy_avq)

base-commit: c664e16bb1ba1c8cf1d7ecf3df5fd83bbb8ac15a
-- 
2.43.0

Re: [PATCH] driver/virtio: Add Memory Balloon Support for SEV/SEV-ES

2024-01-11 Thread David Hildenbrand


On 10.01.24 07:22, Zheyun Shen wrote:

For now, SEV pins guest's memory to avoid swapping or
moving ciphertext, but leading to the inhibition of
Memory Ballooning.

In Memory Ballooning, only guest's free pages will be relocated
in balloon inflation and deflation, so the difference of plaintext
doesn't matter to guest.


A Linux hypervisor will always give you a fresh, zeroed page. I don't 
recall what the spec says, could be that that is a guarantee.




Memory Ballooning is a nice memory overcommitment technology can
be used in CVM based on SEV and SEV-ES, so userspace tools can
provide an option to allow SEV not to pin memory and enable
Memory Ballooning. Guest kernel may not inhibit Balloon and
should set shared memory for Balloon decrypted.


Two points:

1) Memory overcommit means that you promise to have more memory than you 
actually have.


To be able to use that in a *safe* way in the hypervisor, to fulfill 
that promise, you need some backup strategy, which is usually swap space 
in the hypervisor. Further one might apply other techniques (ram 
compression, memory deduplication) in the hypervisor to make that 
swapping unlikely to ever happen when overcommitting (because nobody 
wants to swap).


Assume you run a lot of VMs that mostly have private/encrypted memory 
(which is the default). Imagine you previously inflated the balloon on 
VM0, and that VM needs more memory (you promised it could have more!). 
You reach out to other VMs to inflate the balloon so you get memory 
back, but they cannot give up memory safely.


In that scenario (a) you cannot swap something out because all pages are 
pinned (b) memory compression cannot be applied because pages are pinned 
and (c) memory deduplication cannot be applied because pages are pinned.


Pinned memory is a resource that cannot be overcomitted.

So I am not convinced the use case you are targeting can be considered 
any way of sane memory overcommit. You should better call it resizing VM 
memory instead. Then, it's clearer that the hypervisor cannot promise to 
ever give you that memory when you are in need.



2) What about other features?

What if the hypervisor enabled free-page-reporting? Would that work 
(unlikely, I assunme). Don't we have to block that?


--
Cheers,

David / dhildenb

Re: [PATCH] driver/virtio: Add Memory Balloon Support for SEV/SEV-ES

2024-01-11 Thread David Hildenbrand


For now, SEV pins guest's memory to avoid swapping or
moving ciphertext, but leading to the inhibition of
Memory Ballooning.

In Memory Ballooning, only guest's free pages will be relocated
in balloon inflation and deflation, so the difference of plaintext
doesn't matter to guest.



This seems only true if the page is zeroed, is this true here?


Sorry, I cannot figure out why the pages should be zeroed. I think
both host kernel and guest kernel assume that the pages are not
zeroed and will use kzalloc or manually zero them in real applications,
which is same as non-SEV environments.


balloon_page_alloc() will not zero the memory (no __GFP_ZERO set). Only 
in some configurations (zero-on-alloc, zero-on-free), the kernel would 
do that implicitly.


So we'd eventually be leaking secrets to the untrusted hypervisor?



I have tested in SEV-ES, reclaiming memory by balloon inflation and reuse
them after balloon deflation both works well with the patch. Hypervisor
can normally give the reclaimed memory from one CVM to another, or give
back to the origin CVM.


I'll comment on your misconception of memory overcommit separately.

--
Cheers,

David / dhildenb

Re: REGRESSION: lockdep warning triggered by 15b9ce7ecd: virtio_balloon: stay awake while adjusting balloon

2024-01-09 Thread David Hildenbrand

e able to 
squash it anymore.




 From a99a1efa6a2b470a98ea2c87e58bebe90ce329a1 Mon Sep 17 00:00:00 2001
From: David Stevens 
Date: Tue, 9 Jan 2024 14:41:21 +0900
Subject: [PATCH] virtio_balloon: Fix interrupt context deadlock

Use _irq spinlock functions with the adjustment_lock, since
start_update_balloon_size needs to acquire it in an interrupt context.

Fixes: 5b9ce7ecd715 ("virtio_balloon: stay awake while adjusting balloon")
Reported-by: Theodore Ts'o 
Signed-off-by: David Stevens 
---
  drivers/virtio/virtio_balloon.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index aa6a1a649ad6..1f5b3dd31fcf 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -459,12 +459,12 @@ static void start_update_balloon_size(struct
virtio_balloon *vb)

  static void end_update_balloon_size(struct virtio_balloon *vb)
  {
-   spin_lock(>adjustment_lock);
+   spin_lock_irq(>adjustment_lock);
 if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
 vb->adjustment_in_progress = false;
 pm_relax(vb->vdev->dev.parent);
 }
-   spin_unlock(>adjustment_lock);
+   spin_unlock_irq(>adjustment_lock);
  }

  static void virtballoon_changed(struct virtio_device *vdev)
@@ -506,9 +506,9 @@ static void update_balloon_size_func(struct
work_struct *work)
 vb = container_of(work, struct virtio_balloon,
   update_balloon_size_work);

-   spin_lock(>adjustment_lock);
+   spin_lock_irq(>adjustment_lock);
 vb->adjustment_signal_pending = false;
-   spin_unlock(>adjustment_lock);
+   spin_unlock_irq(>adjustment_lock);

 diff = towards_target(vb);



Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2] virtio_balloon: stay awake while adjusting balloon

2023-12-19 Thread David Hildenbrand


On 19.12.23 15:37, David Stevens wrote:

On Mon, Dec 18, 2023 at 12:33 PM David Hildenbrand  wrote:


On 18.12.23 16:18, David Stevens wrote:

From: David Stevens 

A virtio_balloon's parent device may be configured so that a
configuration change interrupt is a wakeup event. Extend the processing
of such a wakeup event until the balloon finishes inflating or deflating
by calling pm_stay_awake/pm_relax in the virtio_balloon driver. Note
that these calls are no-ops if the parent device doesn't support wakeup
events or if the wakeup events are not enabled.

This change allows the guest to use system power states such as s2idle
without running the risk the virtio_balloon's cooperative memory
management becoming unresponsive to the host's requests.

Signed-off-by: David Stevens 
---
v1 -> v2:
   - Use adjustment_signal_pending flag instead of a sequence number
   - Call pm_stay_awake/pm_relax on parent device instead of adding a wake
 event to the virtio balloon device

   drivers/virtio/virtio_balloon.c | 57 +++--
   1 file changed, 47 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1fe93e93f5bc..a3c11159cbe0 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -119,6 +119,11 @@ struct virtio_balloon {
   /* Free page reporting device */
   struct virtqueue *reporting_vq;
   struct page_reporting_dev_info pr_dev_info;
+
+ /* State for keeping the wakeup_source active while adjusting the balloon 
*/
+ spinlock_t adjustment_lock;
+ bool adjustment_signal_pending;
+ bool adjustment_in_progress;
   };

   static const struct virtio_device_id id_table[] = {
@@ -437,6 +442,31 @@ static void virtio_balloon_queue_free_page_work(struct 
virtio_balloon *vb)
   queue_work(vb->balloon_wq, >report_free_page_work);
   }

+static void start_update_balloon_size(struct virtio_balloon *vb)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(>adjustment_lock, flags);
+ vb->adjustment_signal_pending = true;
+ if (!vb->adjustment_in_progress) {
+ vb->adjustment_in_progress = true;
+ pm_stay_awake(vb->vdev->dev.parent);
+ }
+ spin_unlock_irqrestore(>adjustment_lock, flags);
+
+ queue_work(system_freezable_wq, >update_balloon_size_work);
+}
+
+static void end_update_balloon_size(struct virtio_balloon *vb)
+{
+ spin_lock(>adjustment_lock);
+ if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {


How could vb->adjustment_in_progress ever not be set at this point?


+ vb->adjustment_in_progress = false;
+ pm_relax(vb->vdev->dev.parent);
+ }
+ spin_unlock(>adjustment_lock);
+}
+


LGTM, although I wonder what happens when calling pm_stay_awake() etc.
on a parent device that is not wakeup-even-capable?


If the parent device is not wakeup capable or if wakeup isn't enabled,
then the vb->vdev->dev.parent->power.wakeup pointer will be NULL, so
the NULL checks in __pm_relax/__pm_stay_awake will immediately return.


Ah, I missed that NULL check, makes sense. Thanks!

--
Cheers,

David / dhildenb

Re: [PATCH v2] virtio_balloon: stay awake while adjusting balloon

2023-12-18 Thread David Hildenbrand


On 18.12.23 16:18, David Stevens wrote:

From: David Stevens 

A virtio_balloon's parent device may be configured so that a
configuration change interrupt is a wakeup event. Extend the processing
of such a wakeup event until the balloon finishes inflating or deflating
by calling pm_stay_awake/pm_relax in the virtio_balloon driver. Note
that these calls are no-ops if the parent device doesn't support wakeup
events or if the wakeup events are not enabled.

This change allows the guest to use system power states such as s2idle
without running the risk the virtio_balloon's cooperative memory
management becoming unresponsive to the host's requests.

Signed-off-by: David Stevens 
---
v1 -> v2:
  - Use adjustment_signal_pending flag instead of a sequence number
  - Call pm_stay_awake/pm_relax on parent device instead of adding a wake
event to the virtio balloon device

  drivers/virtio/virtio_balloon.c | 57 +++--
  1 file changed, 47 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1fe93e93f5bc..a3c11159cbe0 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -119,6 +119,11 @@ struct virtio_balloon {
/* Free page reporting device */
struct virtqueue *reporting_vq;
struct page_reporting_dev_info pr_dev_info;
+
+   /* State for keeping the wakeup_source active while adjusting the 
balloon */
+   spinlock_t adjustment_lock;
+   bool adjustment_signal_pending;
+   bool adjustment_in_progress;
  };
  
  static const struct virtio_device_id id_table[] = {

@@ -437,6 +442,31 @@ static void virtio_balloon_queue_free_page_work(struct 
virtio_balloon *vb)
queue_work(vb->balloon_wq, >report_free_page_work);
  }
  
+static void start_update_balloon_size(struct virtio_balloon *vb)

+{
+   unsigned long flags;
+
+   spin_lock_irqsave(>adjustment_lock, flags);
+   vb->adjustment_signal_pending = true;
+   if (!vb->adjustment_in_progress) {
+   vb->adjustment_in_progress = true;
+   pm_stay_awake(vb->vdev->dev.parent);
+   }
+   spin_unlock_irqrestore(>adjustment_lock, flags);
+
+   queue_work(system_freezable_wq, >update_balloon_size_work);
+}
+
+static void end_update_balloon_size(struct virtio_balloon *vb)
+{
+   spin_lock(>adjustment_lock);
+   if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {


How could vb->adjustment_in_progress ever not be set at this point?


+   vb->adjustment_in_progress = false;
+   pm_relax(vb->vdev->dev.parent);
+   }
+   spin_unlock(>adjustment_lock);
+}
+


LGTM, although I wonder what happens when calling pm_stay_awake() etc. 
on a parent device that is not wakeup-even-capable?


--
Cheers,

David / dhildenb

Re: [PATCH] virtio_balloon: stay awake while adjusting balloon

2023-12-18 Thread David Hildenbrand


On 18.12.23 16:16, David Stevens wrote:

On Mon, Dec 18, 2023 at 6:37 AM David Hildenbrand  wrote:


On 14.12.23 05:13, David Stevens wrote:

On Wed, Dec 13, 2023 at 5:44 PM David Hildenbrand  wrote:


On 11.12.23 12:43, David Stevens wrote:

From: David Stevens 



Hi David,


Add a wakeup event for when the balloon is inflating or deflating.
Userspace can enable this wakeup event to prevent the system from
suspending while the balloon is being adjusted. This allows
/sys/power/wakeup_count to be used without breaking virtio_balloon's
cooperative memory management.


Can you add/share some more details


I'm working on enabling support for Linux s2Idle in our Android
virtual machine, to restrict apps from running in the background
without holding an Android partial wakelock. With the patch I recently
sent out [1], since crosvm advertises native PCI power management for
virtio devices, the Android guest can properly enter s2idle, and it
can be woken up by incoming IO. However, one of the remaining problems
is that when the host needs to reclaim memory from the guest via the
virtio-balloon, there is nothing preventing the guest from entering
s2idle before the balloon driver finishes returning memory to the
host.


Thanks for the information. So you also want to wakeup the VM when
wanting to get more memory from the VM?

Using which mechanism would that wakeup happen? Likely not the device
itself?


The wakeup would happen via the parent device's interrupt. I've sent a
new version of this patch that uses the parent device's wakeup event
instead of adding a new one.



One alternative to this approach would be to add a virtballoon_suspend
callback to abort suspend if the balloon is inflating/adjusting.
However, it seems cleaner to just prevent suspend in the first place.


Also, the PM notifier could also be used with very high priority, so the
device would respond early to PM_SUSPEND_PREPARE.


One drawback of blocking suspend via a PM notifier is that the
behavior isn't configurable by userspace, whereas wakeup events can be
enabled/disabled by userspace.


The question is if that behavior for the balloon is really worth it 
being configured by user space?


--
Cheers,

David / dhildenb

Re: [PATCH v2] virtio: Add support for no-reset virtio PCI PM

2023-12-18 Thread David Hildenbrand


On 15.12.23 02:00, David Stevens wrote:

On Thu, Dec 14, 2023 at 6:47 PM David Hildenbrand  wrote:


On 08.12.23 08:07, David Stevens wrote:

If a virtio_pci_device supports native PCI power management and has the
No_Soft_Reset bit set, then skip resetting and reinitializing the device
when suspending and restoring the device. This allows system-wide low
power states like s2idle to be used in systems with stateful virtio
devices that can't simply be re-initialized (e.g. virtio-fs).

Signed-off-by: David Stevens 
---
v1 -> v2:
   - Check the No_Soft_Reset bit

   drivers/virtio/virtio_pci_common.c | 34 +-
   1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index c2524a7207cf..3a95ecaf12dc 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -492,8 +492,40 @@ static int virtio_pci_restore(struct device *dev)
   return virtio_device_restore(_dev->vdev);
   }

+static bool vp_supports_pm_no_reset(struct device *dev)
+{
+ struct pci_dev *pci_dev = to_pci_dev(dev);
+ u16 pmcsr;
+
+ if (!pci_dev->pm_cap)
+ return false;
+
+ pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
+ if (PCI_POSSIBLE_ERROR(pmcsr)) {
+ dev_err(dev, "Unable to query pmcsr");
+ return false;
+ }
+
+ return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
+}
+
+static int virtio_pci_suspend(struct device *dev)
+{
+ return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev);
+}
+
+static int virtio_pci_resume(struct device *dev)
+{
+ return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev);
+}
+
   static const struct dev_pm_ops virtio_pci_pm_ops = {
- SET_SYSTEM_SLEEP_PM_OPS(virtio_pci_freeze, virtio_pci_restore)
+ .suspend = virtio_pci_suspend,
+ .resume = virtio_pci_resume,
+ .freeze = virtio_pci_freeze,
+ .thaw = virtio_pci_restore,
+ .poweroff = virtio_pci_freeze,
+ .restore = virtio_pci_restore,
   };
   #endif



Am I correct with my assumption that this will make s2idle work with 
virtio-mem-pci as well?

Right now, all suspending is disabled, but really only s4/hibernate is 
problematic.

[root@vm-0 ~]# echo "s2idle" > /sys/power/mem_sleep
[root@vm-0 ~]# echo "mem" > /sys/power/state
[  101.822991] PM: suspend entry (s2idle)
[  101.828978] Filesystems sync: 0.004 seconds
[  101.831618] Freezing user space processes
[  101.834569] Freezing user space processes completed (elapsed 0.001 seconds)
[  101.836915] OOM killer disabled.
[  101.838072] Freezing remaining freezable tasks
[  101.841054] Freezing remaining freezable tasks completed (elapsed 0.001 
seconds)
[  101.843538] printk: Suspending console(s) (use no_console_suspend to debug)
[  101.957676] virtio_mem virtio0: save/restore not supported.
[  101.957683] virtio-pci :00:02.0: PM: pci_pm_suspend(): 
virtio_pci_freeze+0x0/0x50 returns -1
[  101.957702] virtio-pci :00:02.0: PM: dpm_run_callback(): 
pci_pm_suspend+0x0/0x170 returns -1
[  101.957718] virtio-pci :00:02.0: PM: failed to suspend async: error -1


QEMU's virtio-pci devices don't advertise no_soft_reset, so this patch
won't affect vanilla QEMU. But if you add PCI_PM_CTRL_NO_SOFT_RESET to
the capability, then it should work. I'm working with crosvm, which
doesn't have virtio-mem implemented, but this patch makes virtio-fs
work with no extra kernel changes.


Thanks for the explanation, makes sense to me.

--
Cheers,

David / dhildenb

Re: [PATCH] virtio_balloon: stay awake while adjusting balloon

2023-12-18 Thread David Hildenbrand


On 14.12.23 05:13, David Stevens wrote:

On Wed, Dec 13, 2023 at 5:44 PM David Hildenbrand  wrote:


On 11.12.23 12:43, David Stevens wrote:

From: David Stevens 



Hi David,


Add a wakeup event for when the balloon is inflating or deflating.
Userspace can enable this wakeup event to prevent the system from
suspending while the balloon is being adjusted. This allows
/sys/power/wakeup_count to be used without breaking virtio_balloon's
cooperative memory management.


Can you add/share some more details


I'm working on enabling support for Linux s2Idle in our Android
virtual machine, to restrict apps from running in the background
without holding an Android partial wakelock. With the patch I recently
sent out [1], since crosvm advertises native PCI power management for
virtio devices, the Android guest can properly enter s2idle, and it
can be woken up by incoming IO. However, one of the remaining problems
is that when the host needs to reclaim memory from the guest via the
virtio-balloon, there is nothing preventing the guest from entering
s2idle before the balloon driver finishes returning memory to the
host.


Thanks for the information. So you also want to wakeup the VM when 
wanting to get more memory from the VM?


Using which mechanism would that wakeup happen? Likely not the device 
itself?




One alternative to this approach would be to add a virtballoon_suspend
callback to abort suspend if the balloon is inflating/adjusting.
However, it seems cleaner to just prevent suspend in the first place.


Also, the PM notifier could also be used with very high priority, so the 
device would respond early to PM_SUSPEND_PREPARE.


[...]


   queue_work(system_freezable_wq, work);
+ else
+ end_update_balloon_size(vb, seqno);


What if we stop the workqueue and unload the driver -- see
remove_common() -- won't you leave pm_stay_awake() wrongly set?


When a device gets removed, its wakeup source is destroyed, which
automatically calls __pm_relax.


Ah, thanks.




   }

   static int init_vqs(struct virtio_balloon *vb)
@@ -992,6 +1028,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
   goto out_unregister_oom;
   }

+ spin_lock_init(>adjustment_lock);
+ device_set_wakeup_capable(>vdev->dev, true);



I'm a bit confused: Documentation/driver-api/pm/devices.rst documents

"
The :c:member:`power.can_wakeup` flag just records whether the device
(and its driver) can physically support wakeup events.  The
:c:func:`device_set_wakeup_capable()` routine affects this flag.
"

...

"
Whether or not a device is capable of issuing wakeup events is a
hardware matter, and the kernel is responsible for keeping track of it.
"

But how is the virtio-balloon device capable of waking up the machine?
Your patch merely implies that the virtio-baloon device is capable to
prohbit going to sleep.

What am I missing?


The underlying virtio_pci_device is capable of waking up the machine,
if it supports PCI power management. The core PCI code will keep the
machine awake while processing the interrupt (i.e. during
virtballoon_changed), but after processing is handed off to the
virtio-balloon driver, the virtio-balloon driver needs to keep the
machine awake until the processing is actually completed.

An alternative to making vb->vdev->dev wakeup capable is to plumb the
pm_stay_awake/pm_relax calls to the underlying virtio_pci_device. Would
that be a preferable approach?


The way you describe it, it rather belongs into the PCI code, because 
that's what actually makes the device PM-capable (i.e., would not apply 
to virtio-ccw or virtio-mmio). The virtio-balloon itself is not 
PM-capable. But how hard is it to move that handling into PCI specific code?


--
Cheers,

David / dhildenb

Re: [PATCH v2] virtio: Add support for no-reset virtio PCI PM

2023-12-14 Thread David Hildenbrand


On 08.12.23 08:07, David Stevens wrote:

If a virtio_pci_device supports native PCI power management and has the
No_Soft_Reset bit set, then skip resetting and reinitializing the device
when suspending and restoring the device. This allows system-wide low
power states like s2idle to be used in systems with stateful virtio
devices that can't simply be re-initialized (e.g. virtio-fs).

Signed-off-by: David Stevens 
---
v1 -> v2:
  - Check the No_Soft_Reset bit

  drivers/virtio/virtio_pci_common.c | 34 +-
  1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index c2524a7207cf..3a95ecaf12dc 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -492,8 +492,40 @@ static int virtio_pci_restore(struct device *dev)
return virtio_device_restore(_dev->vdev);
  }
  
+static bool vp_supports_pm_no_reset(struct device *dev)

+{
+   struct pci_dev *pci_dev = to_pci_dev(dev);
+   u16 pmcsr;
+
+   if (!pci_dev->pm_cap)
+   return false;
+
+   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
+   if (PCI_POSSIBLE_ERROR(pmcsr)) {
+   dev_err(dev, "Unable to query pmcsr");
+   return false;
+   }
+
+   return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
+}
+
+static int virtio_pci_suspend(struct device *dev)
+{
+   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev);
+}
+
+static int virtio_pci_resume(struct device *dev)
+{
+   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev);
+}
+
  static const struct dev_pm_ops virtio_pci_pm_ops = {
-   SET_SYSTEM_SLEEP_PM_OPS(virtio_pci_freeze, virtio_pci_restore)
+   .suspend = virtio_pci_suspend,
+   .resume = virtio_pci_resume,
+   .freeze = virtio_pci_freeze,
+   .thaw = virtio_pci_restore,
+   .poweroff = virtio_pci_freeze,
+   .restore = virtio_pci_restore,
  };
  #endif
  


Am I correct with my assumption that this will make s2idle work with 
virtio-mem-pci as well?

Right now, all suspending is disabled, but really only s4/hibernate is 
problematic.

[root@vm-0 ~]# echo "s2idle" > /sys/power/mem_sleep
[root@vm-0 ~]# echo "mem" > /sys/power/state
[  101.822991] PM: suspend entry (s2idle)
[  101.828978] Filesystems sync: 0.004 seconds
[  101.831618] Freezing user space processes
[  101.834569] Freezing user space processes completed (elapsed 0.001 seconds)
[  101.836915] OOM killer disabled.
[  101.838072] Freezing remaining freezable tasks
[  101.841054] Freezing remaining freezable tasks completed (elapsed 0.001 
seconds)
[  101.843538] printk: Suspending console(s) (use no_console_suspend to debug)
[  101.957676] virtio_mem virtio0: save/restore not supported.
[  101.957683] virtio-pci :00:02.0: PM: pci_pm_suspend(): 
virtio_pci_freeze+0x0/0x50 returns -1
[  101.957702] virtio-pci :00:02.0: PM: dpm_run_callback(): 
pci_pm_suspend+0x0/0x170 returns -1
[  101.957718] virtio-pci :00:02.0: PM: failed to suspend async: error -1


--
Cheers,

David / dhildenb

Re: [PATCH v5 3/4] mm/memory_hotplug: export mhp_supports_memmap_on_memory()

2023-12-14 Thread David Hildenbrand


On 14.12.23 08:37, Vishal Verma wrote:

In preparation for adding sysfs ABI to toggle memmap_on_memory semantics
for drivers adding memory, export the mhp_supports_memmap_on_memory()
helper. This allows drivers to check if memmap_on_memory support is
available before trying to request it, and display an appropriate
message if it isn't available. As part of this, remove the size argument
to this - with recent updates to allow memmap_on_memory for larger
ranges, and the internal splitting of altmaps into respective memory
blocks, the size argument is meaningless.

Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Suggested-by: David Hildenbrand 
Signed-off-by: Vishal Verma 
---
  include/linux/memory_hotplug.h |  6 ++
  mm/memory_hotplug.c| 17 ++---
  2 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7d2076583494..ebc9d528f00c 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -121,6 +121,7 @@ struct mhp_params {
  
  bool mhp_range_allowed(u64 start, u64 size, bool need_mapping);

  struct range mhp_get_pluggable_range(bool need_mapping);
+bool mhp_supports_memmap_on_memory(void);
  
  /*

   * Zone resizing functions
@@ -262,6 +263,11 @@ static inline bool movable_node_is_enabled(void)
return false;
  }
  
+static bool mhp_supports_memmap_on_memory(void)

+{
+   return false;
+}
+
  static inline void pgdat_kswapd_lock(pg_data_t *pgdat) {}
  static inline void pgdat_kswapd_unlock(pg_data_t *pgdat) {}
  static inline void pgdat_kswapd_lock_init(pg_data_t *pgdat) {}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 926e1cfb10e9..751664c519f7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1325,7 +1325,7 @@ static inline bool 
arch_supports_memmap_on_memory(unsigned long vmemmap_size)
  }
  #endif
  
-static bool mhp_supports_memmap_on_memory(unsigned long size)

+bool mhp_supports_memmap_on_memory(void)
  {
unsigned long vmemmap_size = memory_block_memmap_size();
unsigned long memmap_pages = memory_block_memmap_on_memory_pages();
@@ -1334,17 +1334,11 @@ static bool mhp_supports_memmap_on_memory(unsigned long 
size)
 * Besides having arch support and the feature enabled at runtime, we
 * need a few more assumptions to hold true:
 *
-* a) We span a single memory block: memory onlining/offlinin;g happens
-*in memory block granularity. We don't want the vmemmap of online
-*memory blocks to reside on offline memory blocks. In the future,
-*we might want to support variable-sized memory blocks to make the
-*feature more versatile.
-*
-* b) The vmemmap pages span complete PMDs: We don't want vmemmap code
+* a) The vmemmap pages span complete PMDs: We don't want vmemmap code
 *to populate memory from the altmap for unrelated parts (i.e.,
 *other memory blocks)
 *
-* c) The vmemmap pages (and thereby the pages that will be exposed to
+* b) The vmemmap pages (and thereby the pages that will be exposed to
 *the buddy) have to cover full pageblocks: memory 
onlining/offlining
 *code requires applicable ranges to be page-aligned, for example, 
to
 *set the migratetypes properly.
@@ -1356,7 +1350,7 @@ static bool mhp_supports_memmap_on_memory(unsigned long 
size)
 *   altmap as an alternative source of memory, and we do not 
exactly
 *   populate a single PMD.
 */
-   if (!mhp_memmap_on_memory() || size != memory_block_size_bytes())
+   if (!mhp_memmap_on_memory())
return false;
  
  	/*

@@ -1379,6 +1373,7 @@ static bool mhp_supports_memmap_on_memory(unsigned long 
size)
  
  	return arch_supports_memmap_on_memory(vmemmap_size);

  }
+EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
  
  static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)

  {
@@ -1512,7 +1507,7 @@ int __ref add_memory_resource(int nid, struct resource 
*res, mhp_t mhp_flags)
 * Self hosted memmap array
 */
if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
-   mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
+   mhp_supports_memmap_on_memory()) {
ret = create_altmaps_and_memory_blocks(nid, group, start, size);
if (ret)
goto error;



Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH] virtio_balloon: stay awake while adjusting balloon

2023-12-13 Thread David Hildenbrand


On 11.12.23 12:43, David Stevens wrote:

From: David Stevens 



Hi David,


Add a wakeup event for when the balloon is inflating or deflating.
Userspace can enable this wakeup event to prevent the system from
suspending while the balloon is being adjusted. This allows
/sys/power/wakeup_count to be used without breaking virtio_balloon's
cooperative memory management.



Can you add/share some more details



Signed-off-by: David Stevens 
---
  drivers/virtio/virtio_balloon.c | 59 +++--
  1 file changed, 49 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1fe93e93f5bc..811d8937246a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -119,6 +119,11 @@ struct virtio_balloon {
/* Free page reporting device */
struct virtqueue *reporting_vq;
struct page_reporting_dev_info pr_dev_info;
+
+   /* State for keeping the wakeup_source active while adjusting the 
balloon */
+   spinlock_t adjustment_lock;
+   u32 adjustment_seqno;


Using a simple flag that gets set when updating the balloon size and 
test-and-clear when testing for changes should be easier to get.


bool adjustment_balloon_size_changed;

or sth like that.


+   bool adjustment_in_progress;
  };
  
  static const struct virtio_device_id id_table[] = {

@@ -437,6 +442,31 @@ static void virtio_balloon_queue_free_page_work(struct 
virtio_balloon *vb)
queue_work(vb->balloon_wq, >report_free_page_work);
  }
  
+static void start_update_balloon_size(struct virtio_balloon *vb)

+{
+   unsigned long flags;
+
+   spin_lock_irqsave(>adjustment_lock, flags);
+   vb->adjustment_seqno++;
+   if (!vb->adjustment_in_progress) {
+   vb->adjustment_in_progress = true;
+   pm_stay_awake(>vdev->dev);
+   }
+   spin_unlock_irqrestore(>adjustment_lock, flags);
+
+   queue_work(system_freezable_wq, >update_balloon_size_work);
+}
+
+static void end_update_balloon_size(struct virtio_balloon *vb, u32 seqno)
+{
+   spin_lock(>adjustment_lock);
+   if (vb->adjustment_seqno == seqno && vb->adjustment_in_progress) {
+   vb->adjustment_in_progress = false;
+   pm_relax(>vdev->dev);
+   }
+   spin_unlock(>adjustment_lock);
+}
+
  static void virtballoon_changed(struct virtio_device *vdev)
  {
struct virtio_balloon *vb = vdev->priv;
@@ -444,8 +474,7 @@ static void virtballoon_changed(struct virtio_device *vdev)
  
  	spin_lock_irqsave(>stop_update_lock, flags);

if (!vb->stop_update) {
-   queue_work(system_freezable_wq,
-  >update_balloon_size_work);
+   start_update_balloon_size(vb);
virtio_balloon_queue_free_page_work(vb);
}
spin_unlock_irqrestore(>stop_update_lock, flags);
@@ -473,22 +502,29 @@ static void update_balloon_size_func(struct work_struct 
*work)
  {
struct virtio_balloon *vb;
s64 diff;
+   u32 seqno;
  
  	vb = container_of(work, struct virtio_balloon,

  update_balloon_size_work);
-   diff = towards_target(vb);
  
-	if (!diff)

-   return;
+   spin_lock(>adjustment_lock);
+   seqno = vb->adjustment_seqno;
+   spin_unlock(>adjustment_lock);
  
-	if (diff > 0)

-   diff -= fill_balloon(vb, diff);
-   else
-   diff += leak_balloon(vb, -diff);
-   update_balloon_size(vb);
+   diff = towards_target(vb);
+
+   if (diff) {
+   if (diff > 0)
+   diff -= fill_balloon(vb, diff);
+   else
+   diff += leak_balloon(vb, -diff);
+   update_balloon_size(vb);
+   }
  
  	if (diff)

queue_work(system_freezable_wq, work);
+   else
+   end_update_balloon_size(vb, seqno);


What if we stop the workqueue and unload the driver -- see 
remove_common() -- won't you leave pm_stay_awake() wrongly set?



  }
  
  static int init_vqs(struct virtio_balloon *vb)

@@ -992,6 +1028,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
goto out_unregister_oom;
}
  
+	spin_lock_init(>adjustment_lock);

+   device_set_wakeup_capable(>vdev->dev, true);



I'm a bit confused: Documentation/driver-api/pm/devices.rst documents

"
The :c:member:`power.can_wakeup` flag just records whether the device 
(and its driver) can physically support wakeup events.  The

:c:func:`device_set_wakeup_capable()` routine affects this flag.
"

...

"
Whether or not a device is capable of issuing wakeup events is a 
hardware matter, and the kernel is responsible for keeping track of it.

"

But how is the virtio-balloon device capable of waking up the machine? 
Your patch merely implies that the virtio-baloon device is capable to 
prohbit going to sleep.


What am I missing?

--
Cheers,

David /

Re: [PATCH v3 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-12 Thread David Hildenbrand


On 11.12.23 23:52, Vishal Verma wrote:

Add a sysfs knob for dax devices to control the memmap_on_memory setting
if the dax device were to be hotplugged as system memory.

The default memmap_on_memory setting for dax devices originating via
pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
preserve legacy behavior. For dax devices via CXL, the default is on.
The sysfs control allows the administrator to override the above
defaults if needed.

Cc: David Hildenbrand 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Tested-by: Li Zhijian 
Reviewed-by: Jonathan Cameron 
Reviewed-by: David Hildenbrand 
Signed-off-by: Vishal Verma 
---
  drivers/dax/bus.c   | 47 +
  Documentation/ABI/testing/sysfs-bus-dax | 17 
  2 files changed, 64 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1ff1ab5fa105..2871e5188f0d 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1270,6 +1270,52 @@ static ssize_t numa_node_show(struct device *dev,
  }
  static DEVICE_ATTR_RO(numa_node);
  
+static ssize_t memmap_on_memory_show(struct device *dev,

+struct device_attribute *attr, char *buf)
+{
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+
+   return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
+}
+
+static ssize_t memmap_on_memory_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+   struct device_driver *drv = dev->driver;
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+   struct dax_region *dax_region = dev_dax->region;
+   struct dax_device_driver *dax_drv = to_dax_drv(drv);
+   ssize_t rc;
+   bool val;
+
+   rc = kstrtobool(buf, );
+   if (rc)
+   return rc;
+
+   if (dev_dax->memmap_on_memory == val)
+   return len;
+
+   device_lock(dax_region->dev);
+   if (!dax_region->dev->driver) {
+   device_unlock(dax_region->dev);
+   return -ENXIO;
+   }
+
+   if (dax_drv->type == DAXDRV_KMEM_TYPE) {
+   device_unlock(dax_region->dev);
+   return -EBUSY;
+   }
+
+   device_lock(dev);
+   dev_dax->memmap_on_memory = val;
+   device_unlock(dev);
+
+   device_unlock(dax_region->dev);
+   return len;
+}
+static DEVICE_ATTR_RW(memmap_on_memory);
+
  static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int 
n)
  {
struct device *dev = container_of(kobj, struct device, kobj);
@@ -1296,6 +1342,7 @@ static struct attribute *dev_dax_attributes[] = {
_attr_align.attr,
_attr_resource.attr,
_attr_numa_node.attr,
+   _attr_memmap_on_memory.attr,
NULL,
  };
  
diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax

index a61a7b186017..b1fd8bf8a7de 100644
--- a/Documentation/ABI/testing/sysfs-bus-dax
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -149,3 +149,20 @@ KernelVersion: v5.1
  Contact:  nvd...@lists.linux.dev
  Description:
(RO) The id attribute indicates the region id of a dax region.
+
+What:  /sys/bus/dax/devices/daxX.Y/memmap_on_memory
+Date:  October, 2023
+KernelVersion: v6.8
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RW) Control the memmap_on_memory setting if the dax device
+   were to be hotplugged as system memory. This determines whether
+   the 'altmap' for the hotplugged memory will be placed on the
+   device being hotplugged (memmap_on_memory=1) or if it will be
+   placed on regular memory (memmap_on_memory=0). This attribute
+   must be set before the device is handed over to the 'kmem'
+   driver (i.e.  hotplugged into system-ram). Additionally, this
+   depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled
+   memmap_on_memory parameter for memory_hotplug. This is
+   typically set on the kernel command line -
+   memory_hotplug.memmap_on_memory set to 'true' or 'force'."



Thinking about it, I wonder if we could disallow setting that property 
to "true" if the current configuration does not allow it.


That is:

1) Removing the "size" parameter from mhp_supports_memmap_on_memory(), 
it doesn't make any sense anymore.


2) Exporting mhp_supports_memmap_on_memory() to modules.

3) When setting memmap_on_memory, check whether 
mhp_supports_memmap_on_memory() == true.


Then, the user really gets an error when trying to set it to "true".

--
Cheers,

David / dhildenb

Re: [PATCH v2 2/2] dax: add a sysfs knob to control memmap_on_memory behavior

2023-12-08 Thread David Hildenbrand


On 07.12.23 05:36, Vishal Verma wrote:

Add a sysfs knob for dax devices to control the memmap_on_memory setting
if the dax device were to be hotplugged as system memory.

The default memmap_on_memory setting for dax devices originating via
pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to
preserve legacy behavior. For dax devices via CXL, the default is on.
The sysfs control allows the administrator to override the above
defaults if needed.

Cc: David Hildenbrand 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Reviewed-by: Jonathan Cameron 
Reviewed-by: David Hildenbrand 
Signed-off-by: Vishal Verma 
---
  drivers/dax/bus.c   | 40 +
  Documentation/ABI/testing/sysfs-bus-dax | 13 +++
  2 files changed, 53 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1ff1ab5fa105..11abb57cc031 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1270,6 +1270,45 @@ static ssize_t numa_node_show(struct device *dev,
  }
  static DEVICE_ATTR_RO(numa_node);
  
+static ssize_t memmap_on_memory_show(struct device *dev,

+struct device_attribute *attr, char *buf)
+{
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+
+   return sprintf(buf, "%d\n", dev_dax->memmap_on_memory);
+}
+
+static ssize_t memmap_on_memory_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+   struct dax_region *dax_region = dev_dax->region;
+   ssize_t rc;
+   bool val;
+
+   rc = kstrtobool(buf, );
+   if (rc)
+   return rc;
+
+   if (dev_dax->memmap_on_memory == val)
+   return len;
+
+   device_lock(dax_region->dev);
+   if (!dax_region->dev->driver) {
+   device_unlock(dax_region->dev);
+   return -ENXIO;
+   }
+
+   device_lock(dev);
+   dev_dax->memmap_on_memory = val;
+   device_unlock(dev);
+
+   device_unlock(dax_region->dev);
+   return rc == 0 ? len : rc;
+}
+static DEVICE_ATTR_RW(memmap_on_memory);
+
  static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int 
n)
  {
struct device *dev = container_of(kobj, struct device, kobj);
@@ -1296,6 +1335,7 @@ static struct attribute *dev_dax_attributes[] = {
_attr_align.attr,
_attr_resource.attr,
_attr_numa_node.attr,
+   _attr_memmap_on_memory.attr,
NULL,
  };
  
diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax

index a61a7b186017..bb063a004e41 100644
--- a/Documentation/ABI/testing/sysfs-bus-dax
+++ b/Documentation/ABI/testing/sysfs-bus-dax
@@ -149,3 +149,16 @@ KernelVersion: v5.1
  Contact:  nvd...@lists.linux.dev
  Description:
(RO) The id attribute indicates the region id of a dax region.
+
+What:  /sys/bus/dax/devices/daxX.Y/memmap_on_memory
+Date:  October, 2023
+KernelVersion: v6.8
+Contact:   nvd...@lists.linux.dev
+Description:
+   (RW) Control the memmap_on_memory setting if the dax device
+   were to be hotplugged as system memory. This determines whether
+   the 'altmap' for the hotplugged memory will be placed on the
+   device being hotplugged (memmap_on+memory=1) or if it will be
+   placed on regular memory (memmap_on_memory=0). This attribute
+   must be set before the device is handed over to the 'kmem'
+   driver (i.e.  hotplugged into system-ram).



Should we note the dependency on other factors as given in 
mhp_supports_memmap_on_memory(), especially, the system-wide setting and 
some weird kernel configurations?


--
Cheers,

David / dhildenb

Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)

2023-11-30 Thread David Hildenbrand


On 30.11.23 15:33, Alexandru Elisei wrote:

On Thu, Nov 30, 2023 at 02:43:48PM +0100, David Hildenbrand wrote:

On 30.11.23 14:32, Alexandru Elisei wrote:

Hi,

On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote:

+
+out_retry:
+   put_page(page);
+   if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+   vma_end_read(vma);
+   if (fault_flag_allow_retry_first(vmf->flags)) {
+   err = VM_FAULT_RETRY;
+   } else {
+   /* Replay the fault. */
+   err = 0;


Hello!

Unfortunately, if the page continues to be pinned, it seems like fault will 
continue to occur.
I guess it makes system stability issue. (but I'm not familiar with that, so 
please let me know if I'm mistaken!)

How about migrating the page when migration problem repeats.


Yes, I had the same though in the previous iteration of the series, the
page was migrated out of the VMA if tag storage couldn't be reserved.

Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
pin will be released before the fault is replayed. Because of this, and
because it makes the code simpler, I chose not to migrate the page if tag
storage couldn't be reserved.


There are still some cases that are theoretically problematic: vmsplice()
can pin pages forever and doesn't use FOLL_LONGTERM yet.

All these things also affect other users that rely on movability (e.g., CMA,
memory hotunplug).


I wasn't aware of that, thank you for the information. Then to ensure that the
process doesn't hang by replying the loop indefinitely, I'll migrate the page if
tag storage cannot be reserved. Looking over the code again, I think I can reuse
the same function that migrates tag storage pages out of the MTE VMA (added in
patch #21), so no major changes needed.


It's going to be interesting if migrating that page fails because it is
pinned :/


I imagine that having both the page **and** its tag storage pinned longterm
without FOLL_LONGTERM is going to be exceedingly rare.


Yes. I recall that the rule of thumb is that some O_DIRECT I/O can take 
up to 10 seconds, although extremely rare (and maybe not applicable on 
arm64).




Am I mistaken in believing that the problematic vmsplice() behaviour is
recognized as something that needs to be fixed?


Yes, for a couple of years  I'm hoping this will actually get fixed now 
that O_DIRECT mostly uses FOLL_PIN instead of FOLL_GET.


--
Cheers,

David / dhildenb

Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)

2023-11-30 Thread David Hildenbrand


On 30.11.23 14:32, Alexandru Elisei wrote:

Hi,

On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote:

+
+out_retry:
+   put_page(page);
+   if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+   vma_end_read(vma);
+   if (fault_flag_allow_retry_first(vmf->flags)) {
+   err = VM_FAULT_RETRY;
+   } else {
+   /* Replay the fault. */
+   err = 0;


Hello!

Unfortunately, if the page continues to be pinned, it seems like fault will 
continue to occur.
I guess it makes system stability issue. (but I'm not familiar with that, so 
please let me know if I'm mistaken!)

How about migrating the page when migration problem repeats.


Yes, I had the same though in the previous iteration of the series, the
page was migrated out of the VMA if tag storage couldn't be reserved.

Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
pin will be released before the fault is replayed. Because of this, and
because it makes the code simpler, I chose not to migrate the page if tag
storage couldn't be reserved.


There are still some cases that are theoretically problematic: vmsplice()
can pin pages forever and doesn't use FOLL_LONGTERM yet.

All these things also affect other users that rely on movability (e.g., CMA,
memory hotunplug).


I wasn't aware of that, thank you for the information. Then to ensure that the
process doesn't hang by replying the loop indefinitely, I'll migrate the page if
tag storage cannot be reserved. Looking over the code again, I think I can reuse
the same function that migrates tag storage pages out of the MTE VMA (added in
patch #21), so no major changes needed.


It's going to be interesting if migrating that page fails because it is 
pinned :/


--
Cheers,

David / dhildenb

Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)

2023-11-28 Thread David Hildenbrand


On 28.11.23 18:55, David Hildenbrand wrote:

On 19.11.23 17:57, Alexandru Elisei wrote:

To enable tagging on a memory range, userspace can use mprotect() with the
PROT_MTE access flag. Pages already mapped in the VMA don't have the
associated tag storage block reserved, so mark the PTEs as
PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
reserve the tag storage on the fault path.


That sounds alot like fake PROT_NONE. Would there be a way to unify hat
handling and simply reuse pte_protnone()? For example, could we special
case on VMA flags?

Like, don't do NUMA hinting in these special VMAs. Then, have something
like:

if (pte_protnone(vmf->orig_pte))
return handle_pte_protnone(vmf);



Think out loud: maybe there isn't even the need to special-case on the 
VMA. Arch code should know it there is something to do. If not, it 
surely was triggered bu NUMA hinting. So maybe that could be handled in 
handle_pte_protnone() quite nicely.


--
Cheers,

David / dhildenb

Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)

2023-11-28 Thread David Hildenbrand


On 19.11.23 17:57, Alexandru Elisei wrote:

To enable tagging on a memory range, userspace can use mprotect() with the
PROT_MTE access flag. Pages already mapped in the VMA don't have the
associated tag storage block reserved, so mark the PTEs as
PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
reserve the tag storage on the fault path.


That sounds alot like fake PROT_NONE. Would there be a way to unify hat 
handling and simply reuse pte_protnone()? For example, could we special 
case on VMA flags?


Like, don't do NUMA hinting in these special VMAs. Then, have something 
like:


if (pte_protnone(vmf->orig_pte))
return handle_pte_protnone(vmf);

In there, special case on the VMA flags.

I *suspect* that handle_page_missing_tag_storage() stole (sorry :P) some 
code from the prot_none handling path. At least the recovery path and 
writability handling looks like it better be located shared in 
handle_pte_protnone() as well.


That might take some magic out of this patch.

--
Cheers,

David / dhildenb

Re: [PATCH RFC v2 18/27] arm64: mte: Reserve tag block for the zero page

2023-11-28 Thread David Hildenbrand


On 19.11.23 17:57, Alexandru Elisei wrote:

On arm64, the zero page receives special treatment by having the tagged
flag set on MTE initialization, not when the page is mapped in a process
address space. Reserve the corresponding tag block when tag storage
management is being activated.


Out of curiosity: why does the shared zeropage require tagged storage? 
What about the huge zeropage?


--
Cheers,

David / dhildenb

Re: [PATCH RFC v2 14/27] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled

2023-11-28 Thread David Hildenbrand


On 27.11.23 16:07, Alexandru Elisei wrote:

Hi,

On Fri, Nov 24, 2023 at 08:54:12PM +0100, David Hildenbrand wrote:

On 19.11.23 17:57, Alexandru Elisei wrote:

To be able to reserve the tag storage associated with a page requires that
the tag storage page can be migrated.

When HW KASAN is enabled, the kernel allocates pages, which are now tagged,
in non-preemptible contexts, which can make reserving the associate tag
storage impossible.


I assume that it's the only in-kernel user that actually requires tagged
memory (besides for user space), correct?


Indeed, this is the case. I'll expand the commit message to be more clear about
it.



Great, thanks!

--
Cheers,

David / dhildenb

Re: [PATCH RFC v2 12/27] arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype

2023-11-28 Thread David Hildenbrand


On 27.11.23 16:01, Alexandru Elisei wrote:

Hi David,

On Fri, Nov 24, 2023 at 08:40:55PM +0100, David Hildenbrand wrote:

On 19.11.23 17:57, Alexandru Elisei wrote:

Add the MTE tag storage pages to the MIGRATE_CMA migratetype, which allows
the page allocator to manage them like regular pages.

Ths migratype lends the pages some very desirable properties:

* They cannot be longterm pinned, meaning they will always be migratable.

* The pages can be allocated explicitely by using their PFN (with
alloc_contig_range()) when they are needed to store tags.

Signed-off-by: Alexandru Elisei 
---
   arch/arm64/Kconfig  |  1 +
   arch/arm64/kernel/mte_tag_storage.c | 68 +
   include/linux/mmzone.h  |  5 +++
   mm/internal.h   |  3 --
   4 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fe8276fdc7a8..047487046e8f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2065,6 +2065,7 @@ config ARM64_MTE
   if ARM64_MTE
   config ARM64_MTE_TAG_STORAGE
bool "Dynamic MTE tag storage management"
+   select CONFIG_CMA
help
  Adds support for dynamic management of the memory used by the hardware
  for storing MTE tags. This memory, unlike normal memory, cannot be
diff --git a/arch/arm64/kernel/mte_tag_storage.c 
b/arch/arm64/kernel/mte_tag_storage.c
index fa6267ef8392..427f4f1909f3 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -5,10 +5,12 @@
* Copyright (C) 2023 ARM Ltd.
*/
+#include 
   #include 
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 
   #include 
@@ -189,6 +191,14 @@ static int __init fdt_init_tag_storage(unsigned long node, 
const char *uname,
return ret;
}
+   /* Pages are managed in pageblock_nr_pages chunks */
+   if (!IS_ALIGNED(tag_range->start | range_len(tag_range), 
pageblock_nr_pages)) {
+   pr_err("Tag storage region 0x%llx-0x%llx not aligned to pageblock 
size 0x%llx",
+  PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
+  PFN_PHYS(pageblock_nr_pages));
+   return -EINVAL;
+   }
+
ret = tag_storage_get_memory_node(node, _node);
if (ret)
return ret;
@@ -254,3 +264,61 @@ void __init mte_tag_storage_init(void)
pr_info("MTE tag storage region management disabled");
}
   }
+
+static int __init mte_tag_storage_activate_regions(void)
+{
+   phys_addr_t dram_start, dram_end;
+   struct range *tag_range;
+   unsigned long pfn;
+   int i, ret;
+
+   if (num_tag_regions == 0)
+   return 0;
+
+   dram_start = memblock_start_of_DRAM();
+   dram_end = memblock_end_of_DRAM();
+
+   for (i = 0; i < num_tag_regions; i++) {
+   tag_range = _regions[i].tag_range;
+   /*
+* Tag storage region was clipped by arm64_bootmem_init()
+* enforcing addressing limits.
+*/
+   if (PFN_PHYS(tag_range->start) < dram_start ||
+   PFN_PHYS(tag_range->end) >= dram_end) {
+   pr_err("Tag storage region 0x%llx-0x%llx outside addressable 
memory",
+  PFN_PHYS(tag_range->start), 
PFN_PHYS(tag_range->end));
+   ret = -EINVAL;
+   goto out_disabled;
+   }
+   }
+
+   /*
+* MTE disabled, tag storage pages can be used like any other pages. The
+* only restriction is that the pages cannot be used by kexec because
+* the memory remains marked as reserved in the memblock allocator.
+*/
+   if (!system_supports_mte()) {
+   for (i = 0; i< num_tag_regions; i++) {
+   tag_range = _regions[i].tag_range;
+   for (pfn = tag_range->start; pfn <= tag_range->end; 
pfn++)
+   free_reserved_page(pfn_to_page(pfn));
+   }
+   ret = 0;
+   goto out_disabled;
+   }
+
+   for (i = 0; i < num_tag_regions; i++) {
+   tag_range = _regions[i].tag_range;
+   for (pfn = tag_range->start; pfn <= tag_range->end; pfn += 
pageblock_nr_pages)
+   init_cma_reserved_pageblock(pfn_to_page(pfn));
+   totalcma_pages += range_len(tag_range);
+   }


You shouldn't be doing that manually in arm code. Likely you want some cma.c
helper for something like that.


If you referring to the last loop (the one that does
ini_cma_reserved_pageblock()), indeed, there's already a function which
does that, cma_init_reserved_areas() -> cma_activate_area().



But, can you elaborate on why you took this

Re: [PATCH RFC v2 15/27] arm64: mte: Check that tag storage blocks are in the same zone

2023-11-24 Thread David Hildenbrand


On 19.11.23 17:57, Alexandru Elisei wrote:

alloc_contig_range() requires that the requested pages are in the same
zone. Check that this is indeed the case before initializing the tag
storage blocks.

Signed-off-by: Alexandru Elisei 
---
  arch/arm64/kernel/mte_tag_storage.c | 33 +
  1 file changed, 33 insertions(+)

diff --git a/arch/arm64/kernel/mte_tag_storage.c 
b/arch/arm64/kernel/mte_tag_storage.c
index 8b9bedf7575d..fd63430d4dc0 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -265,6 +265,35 @@ void __init mte_tag_storage_init(void)
}
  }
  
+/* alloc_contig_range() requires all pages to be in the same zone. */

+static int __init mte_tag_storage_check_zone(void)
+{
+   struct range *tag_range;
+   struct zone *zone;
+   unsigned long pfn;
+   u32 block_size;
+   int i, j;
+
+   for (i = 0; i < num_tag_regions; i++) {
+   block_size = tag_regions[i].block_size;
+   if (block_size == 1)
+   continue;
+
+   tag_range = _regions[i].tag_range;
+   for (pfn = tag_range->start; pfn <= tag_range->end; pfn += 
block_size) {
+   zone = page_zone(pfn_to_page(pfn));
+   for (j = 1; j < block_size; j++) {
+   if (page_zone(pfn_to_page(pfn + j)) != zone) {
+   pr_err("Tag storage block pages in different 
zones");
+   return -EINVAL;
+   }
+   }
+   }
+   }
+
+return 0;
+}
+


Looks like something that ordinary CMA provides. See cma_activate_area().

Can't we find a way to let CMA do CMA thingies and only be a user of 
that? What would be required to make the performance issue you spelled 
out in the cover letter be gone and not have to open-code that in arch code?


--
Cheers,

David / dhildenb

Re: [PATCH RFC v2 14/27] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled

2023-11-24 Thread David Hildenbrand


On 19.11.23 17:57, Alexandru Elisei wrote:

To be able to reserve the tag storage associated with a page requires that
the tag storage page can be migrated.

When HW KASAN is enabled, the kernel allocates pages, which are now tagged,
in non-preemptible contexts, which can make reserving the associate tag
storage impossible.


I assume that it's the only in-kernel user that actually requires tagged 
memory (besides for user space), correct?


--
Cheers,

David / dhildenb

Re: [PATCH RFC v2 13/27] arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK

2023-11-24 Thread David Hildenbrand


On 19.11.23 17:57, Alexandru Elisei wrote:

Tag storage memory requires that the tag storage pages used for data are
always migratable when they need to be repurposed to store tags.

If ARCH_KEEP_MEMBLOCK is enabled, kexec will scan all non-reserved
memblocks to find a suitable location for copying the kernel image. The
kernel image, once loaded, cannot be moved to another location in physical
memory. The initialization code for the tag storage reserves the memblocks
for the tag storage pages, which means kexec will not use them, and the tag
storage pages can be migrated at any time, which is the desired behaviour.

However, if ARCH_KEEP_MEMBLOCK is not selected, kexec will not skip a
region unless the memory resource has the IORESOURCE_SYSRAM_DRIVER_MANAGED
flag, which isn't currently set by the tag storage initialization code.

Make ARM64_MTE_TAG_STORAGE depend on ARCH_KEEP_MEMBLOCK to make it explicit
that that the Kconfig option required for it to work correctly.

Signed-off-by: Alexandru Elisei 
---
  arch/arm64/Kconfig | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 047487046e8f..efa5b7958169 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2065,6 +2065,7 @@ config ARM64_MTE
  if ARM64_MTE
  config ARM64_MTE_TAG_STORAGE
bool "Dynamic MTE tag storage management"
+   depends on ARCH_KEEP_MEMBLOCK
select CONFIG_CMA
help
  Adds support for dynamic management of the memory used by the hardware


Doesn't arm64 select that unconditionally? Why is this required then?

--
Cheers,

David / dhildenb

Re: [PATCH RFC v2 12/27] arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype

2023-11-24 Thread David Hildenbrand


On 19.11.23 17:57, Alexandru Elisei wrote:

Add the MTE tag storage pages to the MIGRATE_CMA migratetype, which allows
the page allocator to manage them like regular pages.

Ths migratype lends the pages some very desirable properties:

* They cannot be longterm pinned, meaning they will always be migratable.

* The pages can be allocated explicitely by using their PFN (with
   alloc_contig_range()) when they are needed to store tags.

Signed-off-by: Alexandru Elisei 
---
  arch/arm64/Kconfig  |  1 +
  arch/arm64/kernel/mte_tag_storage.c | 68 +
  include/linux/mmzone.h  |  5 +++
  mm/internal.h   |  3 --
  4 files changed, 74 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fe8276fdc7a8..047487046e8f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2065,6 +2065,7 @@ config ARM64_MTE
  if ARM64_MTE
  config ARM64_MTE_TAG_STORAGE
bool "Dynamic MTE tag storage management"
+   select CONFIG_CMA
help
  Adds support for dynamic management of the memory used by the hardware
  for storing MTE tags. This memory, unlike normal memory, cannot be
diff --git a/arch/arm64/kernel/mte_tag_storage.c 
b/arch/arm64/kernel/mte_tag_storage.c
index fa6267ef8392..427f4f1909f3 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -5,10 +5,12 @@
   * Copyright (C) 2023 ARM Ltd.
   */
  
+#include 

  #include 
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -189,6 +191,14 @@ static int __init fdt_init_tag_storage(unsigned long node, 
const char *uname,
return ret;
}
  
+	/* Pages are managed in pageblock_nr_pages chunks */

+   if (!IS_ALIGNED(tag_range->start | range_len(tag_range), 
pageblock_nr_pages)) {
+   pr_err("Tag storage region 0x%llx-0x%llx not aligned to pageblock 
size 0x%llx",
+  PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end),
+  PFN_PHYS(pageblock_nr_pages));
+   return -EINVAL;
+   }
+
ret = tag_storage_get_memory_node(node, _node);
if (ret)
return ret;
@@ -254,3 +264,61 @@ void __init mte_tag_storage_init(void)
pr_info("MTE tag storage region management disabled");
}
  }
+
+static int __init mte_tag_storage_activate_regions(void)
+{
+   phys_addr_t dram_start, dram_end;
+   struct range *tag_range;
+   unsigned long pfn;
+   int i, ret;
+
+   if (num_tag_regions == 0)
+   return 0;
+
+   dram_start = memblock_start_of_DRAM();
+   dram_end = memblock_end_of_DRAM();
+
+   for (i = 0; i < num_tag_regions; i++) {
+   tag_range = _regions[i].tag_range;
+   /*
+* Tag storage region was clipped by arm64_bootmem_init()
+* enforcing addressing limits.
+*/
+   if (PFN_PHYS(tag_range->start) < dram_start ||
+   PFN_PHYS(tag_range->end) >= dram_end) {
+   pr_err("Tag storage region 0x%llx-0x%llx outside addressable 
memory",
+  PFN_PHYS(tag_range->start), 
PFN_PHYS(tag_range->end));
+   ret = -EINVAL;
+   goto out_disabled;
+   }
+   }
+
+   /*
+* MTE disabled, tag storage pages can be used like any other pages. The
+* only restriction is that the pages cannot be used by kexec because
+* the memory remains marked as reserved in the memblock allocator.
+*/
+   if (!system_supports_mte()) {
+   for (i = 0; i< num_tag_regions; i++) {
+   tag_range = _regions[i].tag_range;
+   for (pfn = tag_range->start; pfn <= tag_range->end; 
pfn++)
+   free_reserved_page(pfn_to_page(pfn));
+   }
+   ret = 0;
+   goto out_disabled;
+   }
+
+   for (i = 0; i < num_tag_regions; i++) {
+   tag_range = _regions[i].tag_range;
+   for (pfn = tag_range->start; pfn <= tag_range->end; pfn += 
pageblock_nr_pages)
+   init_cma_reserved_pageblock(pfn_to_page(pfn));
+   totalcma_pages += range_len(tag_range);
+   }


You shouldn't be doing that manually in arm code. Likely you want some 
cma.c helper for something like that.


But, can you elaborate on why you took this hacky (sorry) approach as 
documented in the cover letter:


"The arm64 code manages this memory directly instead of using
cma_declare_contiguous/cma_alloc for performance reasons."

What is the exact problem?

--
Cheers,

David / dhildenb

Re: [PATCH RFC v2 06/27] mm: page_alloc: Allow an arch to hook early into free_pages_prepare()

2023-11-24 Thread David Hildenbrand


On 19.11.23 17:57, Alexandru Elisei wrote:

Add arch_free_pages_prepare() hook that is called before that page flags
are cleared. This will be used by arm64 when explicit management of tag
storage pages is enabled.


Can you elaborate a bit what exactly will be done by that code with that 
information?


--
Cheers,

David / dhildenb

Re: [PATCH RFC v2 05/27] mm: page_alloc: Add an arch hook to allow prep_new_page() to fail

2023-11-24 Thread David Hildenbrand


On 19.11.23 17:56, Alexandru Elisei wrote:

Introduce arch_prep_new_page(), which will be used by arm64 to reserve tag
storage for an allocated page. Reserving tag storage can fail, for example,
if the tag storage page has a short pin on it, so allow prep_new_page() ->
arch_prep_new_page() to similarly fail.


But what are the side-effects of this? How does the calling code recover?

E.g., what if we need to populate a page into user space, but that 
particular page we allocated fails to be prepared? So we inject a signal 
into that poor process?


--
Cheers,

David / dhildenb

Re: [PATCH v8 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-11-02 Thread David Hildenbrand


On 01.11.23 23:51, Vishal Verma wrote:

The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
'memblock_size' chunks of memory being added. Adding a larger span of
memory precludes memmap_on_memory semantics.

For users of hotplug such as kmem, large amounts of memory might get
added from the CXL subsystem. In some cases, this amount may exceed the
available 'main memory' to store the memmap for the memory being added.
In this case, it is useful to have a way to place the memmap on the
memory being added, even if it means splitting the addition into
memblock-sized chunks.

Change add_memory_resource() to loop over memblock-sized chunks of
memory if caller requested memmap_on_memory, and if other conditions for
it are met. Teach try_remove_memory() to also expect that a memory
range being removed might have been split up into memblock sized chunks,
and to loop through those as needed.

This does preclude being able to use PUD mappings in the direct map; a
proposal to how this could be optimized in the future is laid out
here[1].

[1]: 
https://lore.kernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752...@redhat.com/

Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Suggested-by: David Hildenbrand 
Reviewed-by: Dan Williams 
Signed-off-by: Vishal Verma 
---
  mm/memory_hotplug.c | 213 ++--
  1 file changed, 138 insertions(+), 75 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6be7de9efa55..d242e49d7f7b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1380,6 +1380,84 @@ static bool mhp_supports_memmap_on_memory(unsigned long 
size)
return arch_supports_memmap_on_memory(vmemmap_size);
  }
  
+static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)

+{
+   unsigned long memblock_size = memory_block_size_bytes();
+   u64 cur_start;
+
+   /*
+* For memmap_on_memory, the altmaps were added on a per-memblock
+* basis; we have to process each individual memory block.
+*/
+   for (cur_start = start; cur_start < start + size;
+cur_start += memblock_size) {
+   struct vmem_altmap *altmap = NULL;
+   struct memory_block *mem;
+
+   mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(cur_start)));
+   WARN_ON_ONCE(!mem);
+   if (!mem)
+   continue;


Nit:

if (WARN_ON_ONCE(!mem))
continue;


+   for (cur_start = start; cur_start < start + size;
+cur_start += memblock_size) {
+   struct mhp_params params = { .pgprot =
+pgprot_mhp(PAGE_KERNEL) };
+   struct vmem_altmap mhp_altmap = {
+   .base_pfn = PHYS_PFN(cur_start),
+   .end_pfn = PHYS_PFN(cur_start + memblock_size - 1),
+   };
+
+   mhp_altmap.free = memory_block_memmap_on_memory_pages();
+   params.altmap = kmemdup(_altmap, sizeof(struct vmem_altmap),
+   GFP_KERNEL);
+   if (!params.altmap)
+   return -ENOMEM;


As already spotted, we have to cleanup.


+
+   /* call arch's memory hotadd */
+   ret = arch_add_memory(nid, cur_start, memblock_size, );
+   if (ret < 0) {
+   kfree(params.altmap);
+   goto out;
+   }
+
+   /* create memory block devices after memory was added */
+   ret = create_memory_block_devices(cur_start, memblock_size,
+ params.altmap, group);
+   if (ret) {
+   arch_remove_memory(cur_start, memblock_size, NULL);
+   kfree(params.altmap);
+   goto out;
+   }
+   }
+
+   return 0;
+out:
+   if (ret && (cur_start != start))


Nit: I think you can drop the inner parentheses.


@@ -2146,11 +2208,31 @@ void try_offline_node(int nid)
  }
  EXPORT_SYMBOL(try_offline_node);
  
+static int memory_blocks_have_altmaps(u64 start, u64 size)

+{
+   u64 num_memblocks = size / memory_block_size_bytes();
+   u64 num_altmaps = 0;
+
+   if (!mhp_memmap_on_memory())
+   return 0;
+
+   walk_memory_blocks(start, size, _altmaps,
+  count_memory_range_altmaps_cb);
+
+   if (num_altmaps == 0)
+   return 0;
+
+   if (num_memblocks != num_altmaps) {
+   WARN_ONCE(1, "Not all memblocks in range have altmaps");


Nit:

if (WARN_ON_ONCE(num_memblocks != num_altmaps))
return -EINVAL;

Should be sufficient.

[...]


/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM&quo

Re: [PATCH v7 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-31 Thread David Hildenbrand


On 31.10.23 03:14, Verma, Vishal L wrote:

On Mon, 2023-10-30 at 11:20 +0100, David Hildenbrand wrote:

On 26.10.23 00:44, Vishal Verma wrote:



[..]


@@ -2146,11 +2186,69 @@ void try_offline_node(int nid)
   }
   EXPORT_SYMBOL(try_offline_node);
   
-static int __ref try_remove_memory(u64 start, u64 size)

+static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)
   {
-   struct memory_block *mem;
-   int rc = 0, nid = NUMA_NO_NODE;
+   unsigned long memblock_size = memory_block_size_bytes();
 struct vmem_altmap *altmap = NULL;
+   struct memory_block *mem;
+   u64 cur_start;
+   int rc;
+
+   /*
+    * For memmap_on_memory, the altmaps could have been added on
+    * a per-memblock basis. Loop through the entire range if so,
+    * and remove each memblock and its altmap.
+    */


/*
   * altmaps where added on a per-memblock basis; we have to process
   * each individual memory block.
   */


+   for (cur_start = start; cur_start < start + size;
+    cur_start += memblock_size) {
+   rc = walk_memory_blocks(cur_start, memblock_size, ,
+   test_has_altmap_cb);
+   if (rc) {
+   altmap = mem->altmap;
+   /*
+    * Mark altmap NULL so that we can add a debug
+    * check on memblock free.
+    */
+   mem->altmap = NULL;
+   }


Simpler (especially, we know that there must be an altmap):

mem = find_memory_block(pfn_to_section_nr(cur_start));
altmap = mem->altmap;
mem->altmap = NULL;

I think we might be able to remove test_has_altmap_cb() then.


+
+   remove_memory_block_devices(cur_start, memblock_size);
+
+   arch_remove_memory(cur_start, memblock_size, altmap);
+
+   /* Verify that all vmemmap pages have actually been freed. */
+   if (altmap) {


There must be an altmap, so this can be done unconditionally.


Hi David,


Hi!



All other comments make sense, making those changes now.

However for this one, does the WARN() below go away then?

I was wondering if maybe arch_remove_memory() is responsible for
freeing the altmap here, and at this stage we're just checking if that
happened. If it didn't WARN and then free it.


I think that has to stay, to make sure arch_remove_memory() did the 
right thing and we don't -- by BUG -- still have some altmap pages in 
use after they should have been completely freed.




I drilled down the path, and I don't see altmap actually getting freed
in vmem_altmap_free(), but I wasn't sure if  was meant
to free it as altmap->alloc went down to 0.



vmem_altmap_free() does the "altmap->alloc -= nr_pfns", which is called 
when arch_remove_memory() frees the vmemmap pages and detects that they 
actually come from the altmap reserve and not from the buddy/earlyboot 
allocator.


Freeing an altmap is just unaccounting it in the altmap structure; and 
here we make sure that we are actually back down to 0 and don't have 
some weird altmap freeing BUG in arch_remove_memory().


--
Cheers,

David / dhildenb

Re: [PATCH v7 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-30 Thread David Hildenbrand


On 26.10.23 00:44, Vishal Verma wrote:

The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
'memblock_size' chunks of memory being added. Adding a larger span of
memory precludes memmap_on_memory semantics.

For users of hotplug such as kmem, large amounts of memory might get
added from the CXL subsystem. In some cases, this amount may exceed the
available 'main memory' to store the memmap for the memory being added.
In this case, it is useful to have a way to place the memmap on the
memory being added, even if it means splitting the addition into
memblock-sized chunks.

Change add_memory_resource() to loop over memblock-sized chunks of
memory if caller requested memmap_on_memory, and if other conditions for
it are met. Teach try_remove_memory() to also expect that a memory
range being removed might have been split up into memblock sized chunks,
and to loop through those as needed.

This does preclude being able to use PUD mappings in the direct map; a
proposal to how this could be optimized in the future is laid out
here[1].



Almost there, I think :)

  
+static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,

+   u64 start, u64 size)
+{
+   unsigned long memblock_size = memory_block_size_bytes();
+   u64 cur_start;
+   int ret;
+
+   for (cur_start = start; cur_start < start + size;
+cur_start += memblock_size) {
+   struct mhp_params params = { .pgprot =
+pgprot_mhp(PAGE_KERNEL) };
+   struct vmem_altmap mhp_altmap = {
+   .base_pfn = PHYS_PFN(cur_start),
+   .end_pfn = PHYS_PFN(cur_start + memblock_size - 1),
+   };
+
+   mhp_altmap.free = memory_block_memmap_on_memory_pages();
+   params.altmap = kmemdup(_altmap, sizeof(struct vmem_altmap),
+   GFP_KERNEL);
+   if (!params.altmap)
+   return -ENOMEM;


Best to cleanup here instead of handling it in the caller [as noted by 
Vishal, we might not be doing that yet]. Using 
remove_memory_blocks_and_altmaps() on the fully processed range sounds 
reasonable.


maybe something like

ret = arch_add_memory(nid, cur_start, memblock_size, );
if (ret) {
kfree(params.altmap);
goto out;
}

ret = create_memory_block_devices(cur_start, memblock_size,
   params.altmap, group);
if (ret) {
arch_remove_memory(cur_start, memblock_size, NULL);
kfree(params.altmap);
goto out;
}

if (ret && cur_start != start)
remove_memory_blocks_and_altmaps(start, cur_start - start);
return ret;


+
+   /* call arch's memory hotadd */
+   ret = arch_add_memory(nid, cur_start, memblock_size, );
+   if (ret < 0) {
+   kfree(params.altmap);
+   return ret;
+   }
+
+   /* create memory block devices after memory was added */
+   ret = create_memory_block_devices(cur_start, memblock_size,
+ params.altmap, group);
+   if (ret) {
+   arch_remove_memory(cur_start, memblock_size, NULL);
+   kfree(params.altmap);
+   return ret;
+   }
+   }
+
+   return 0;
+}
+


[...]


  static int check_cpu_on_node(int nid)
  {
int cpu;
@@ -2146,11 +2186,69 @@ void try_offline_node(int nid)
  }
  EXPORT_SYMBOL(try_offline_node);
  
-static int __ref try_remove_memory(u64 start, u64 size)

+static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)
  {
-   struct memory_block *mem;
-   int rc = 0, nid = NUMA_NO_NODE;
+   unsigned long memblock_size = memory_block_size_bytes();
struct vmem_altmap *altmap = NULL;
+   struct memory_block *mem;
+   u64 cur_start;
+   int rc;
+
+   /*
+* For memmap_on_memory, the altmaps could have been added on
+* a per-memblock basis. Loop through the entire range if so,
+* and remove each memblock and its altmap.
+*/


/*
 * altmaps where added on a per-memblock basis; we have to process
 * each individual memory block.
 */


+   for (cur_start = start; cur_start < start + size;
+cur_start += memblock_size) {
+   rc = walk_memory_blocks(cur_start, memblock_size, ,
+   test_has_altmap_cb);
+   if (rc) {
+   altmap = mem->altmap;
+   /*
+* Mark altmap NULL so that we can add a debug
+* check on memblock free.
+*/
+   mem->altmap = NULL;
+   }


Simpler (especially, we know that there must be an altmap):

mem =

Re: [PATCH v6 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-17 Thread David Hildenbrand


On 17.10.23 07:44, Vishal Verma wrote:

The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
'memblock_size' chunks of memory being added. Adding a larger span of
memory precludes memmap_on_memory semantics.

For users of hotplug such as kmem, large amounts of memory might get
added from the CXL subsystem. In some cases, this amount may exceed the
available 'main memory' to store the memmap for the memory being added.
In this case, it is useful to have a way to place the memmap on the
memory being added, even if it means splitting the addition into
memblock-sized chunks.

Change add_memory_resource() to loop over memblock-sized chunks of
memory if caller requested memmap_on_memory, and if other conditions for
it are met. Teach try_remove_memory() to also expect that a memory
range being removed might have been split up into memblock sized chunks,
and to loop through those as needed.

This does preclude being able to use PUD mappings in the direct map; a
proposal to how this could be optimized in the future is laid out
here[1].

[1]: 
https://lore.kernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752...@redhat.com/

Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Suggested-by: David Hildenbrand 
Reviewed-by: Dan Williams 
Signed-off-by: Vishal Verma 
---
  mm/memory_hotplug.c | 214 
  1 file changed, 148 insertions(+), 66 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6be7de9efa55..83e5ec377aad 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1380,6 +1380,43 @@ static bool mhp_supports_memmap_on_memory(unsigned long 
size)
return arch_supports_memmap_on_memory(vmemmap_size);
  }
  
+static int add_memory_create_devices(int nid, struct memory_group *group,

+u64 start, u64 size, mhp_t mhp_flags)
+{
+   struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+   struct vmem_altmap mhp_altmap = {
+   .base_pfn =  PHYS_PFN(start),
+   .end_pfn  =  PHYS_PFN(start + size - 1),
+   };
+   int ret;
+
+   if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
+   mhp_altmap.free = memory_block_memmap_on_memory_pages();
+   params.altmap = kmemdup(_altmap, sizeof(struct vmem_altmap),
+   GFP_KERNEL);
+   if (!params.altmap)
+   return -ENOMEM;
+   }
+
+   /* call arch's memory hotadd */
+   ret = arch_add_memory(nid, start, size, );
+   if (ret < 0)
+   goto error;
+
+   /* create memory block devices after memory was added */
+   ret = create_memory_block_devices(start, size, params.altmap, group);
+   if (ret)
+   goto err_bdev;
+
+   return 0;
+
+err_bdev:
+   arch_remove_memory(start, size, NULL);
+error:
+   kfree(params.altmap);
+   return ret;
+}
+
  /*
   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
   * and online/offline operations (triggered e.g. by sysfs).
@@ -1388,14 +1425,10 @@ static bool mhp_supports_memmap_on_memory(unsigned long 
size)
   */
  int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
  {
-   struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+   unsigned long memblock_size = memory_block_size_bytes();
enum memblock_flags memblock_flags = MEMBLOCK_NONE;
-   struct vmem_altmap mhp_altmap = {
-   .base_pfn =  PHYS_PFN(res->start),
-   .end_pfn  =  PHYS_PFN(res->end),
-   };
struct memory_group *group = NULL;
-   u64 start, size;
+   u64 start, size, cur_start;
bool new_node = false;
int ret;
  
@@ -1436,28 +1469,21 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)

/*
 * Self hosted memmap array
 */
-   if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
-   if (mhp_supports_memmap_on_memory(size)) {
-   mhp_altmap.free = memory_block_memmap_on_memory_pages();
-   params.altmap = kmemdup(_altmap,
-   sizeof(struct vmem_altmap),
-   GFP_KERNEL);
-   if (!params.altmap)
+   if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
+   mhp_supports_memmap_on_memory(memblock_size)) {
+   for (cur_start = start; cur_start < start + size;
+cur_start += memblock_size) {
+   ret = add_memory_create_devices(nid, group, cur_start,
+   memblock_size,
+   mhp_flags);


Just like on the removal part, that function does

Re: [PATCH v6 1/3] mm/memory_hotplug: replace an open-coded kmemdup() in add_memory_resource()

2023-10-17 Thread David Hildenbrand


On 17.10.23 07:44, Vishal Verma wrote:

A review of the memmap_on_memory modifications to add_memory_resource()
revealed an instance of an open-coded kmemdup(). Replace it with
kmemdup().

Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Dan Williams 
Reported-by: Dan Williams 
Signed-off-by: Vishal Verma 
---
  mm/memory_hotplug.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f8d3e7427e32..6be7de9efa55 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1439,11 +1439,11 @@ int __ref add_memory_resource(int nid, struct resource 
*res, mhp_t mhp_flags)
if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
if (mhp_supports_memmap_on_memory(size)) {
mhp_altmap.free = memory_block_memmap_on_memory_pages();
-   params.altmap = kmalloc(sizeof(struct vmem_altmap), 
GFP_KERNEL);
+   params.altmap = kmemdup(_altmap,
+   sizeof(struct vmem_altmap),
+   GFP_KERNEL);
if (!params.altmap)
goto error;
-
-   memcpy(params.altmap, _altmap, sizeof(mhp_altmap));
}
/* fallback to not using altmap  */
}



Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-12 Thread David Hildenbrand


On 12.10.23 07:53, Verma, Vishal L wrote:

On Mon, 2023-10-09 at 17:04 +0200, David Hildenbrand wrote:

On 07.10.23 10:55, Huang, Ying wrote:

Vishal Verma  writes:


@@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size)
 if (rc)
 return rc;
   
+   mem_hotplug_begin();

+
 /*
-    * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
-    * the same granularity it was added - a single memory block.
+    * For memmap_on_memory, the altmaps could have been added on
+    * a per-memblock basis. Loop through the entire range if so,
+    * and remove each memblock and its altmap.
  */
 if (mhp_memmap_on_memory()) {


IIUC, even if mhp_memmap_on_memory() returns true, it's still possible
that the memmap is put in DRAM after [2/2].  So that,
arch_remove_memory() are called for each memory block unnecessarily.  Can
we detect this (via altmap?) and call remove_memory_block_and_altmap()
for the whole range?


Good point. We should handle memblock-per-memblock onny if we have to
handle the altmap. Otherwise, just call a separate function that doesn't
care about -- e.g., called remove_memory_blocks_no_altmap().

We could simply walk all memory blocks and make sure either all have an
altmap or none has an altmap. If there is a mix, we should bail out with
WARN_ON_ONCE().


Ok I think I follow - based on both of these threads, here's my
understanding in an incremental diff from the original patches (may not
apply directly as I've already committed changes from the other bits of
feedback - but this should provide an idea of the direction) -

---

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 507291e44c0b..30addcb063b4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -2201,6 +2201,40 @@ static void __ref remove_memory_block_and_altmap(u64 
start, u64 size)
}
  }
  
+static bool memblocks_have_altmaps(u64 start, u64 size)

+{
+   unsigned long memblock_size = memory_block_size_bytes();
+   u64 num_altmaps = 0, num_no_altmaps = 0;
+   struct memory_block *mem;
+   u64 cur_start;
+   int rc = 0;
+
+   if (!mhp_memmap_on_memory())
+   return false;


Probably can remove that, checked by the caller. (or drop the one in the 
caller)



+
+   for (cur_start = start; cur_start < start + size;
+cur_start += memblock_size) {
+   if (walk_memory_blocks(cur_start, memblock_size, ,
+  test_has_altmap_cb))
+   num_altmaps++;
+   else
+   num_no_altmaps++;
+   }


You should do that without the outer loop, by doing the counting in the 
callback function instead.	



+
+   if (!num_altmaps && num_no_altmaps > 0)
+   return false;
+
+   if (!num_no_altmaps && num_altmaps > 0)
+   return true;
+
+   /*
+* If there is a mix of memblocks with and without altmaps,
+* something has gone very wrong. WARN and bail.
+*/
+   WARN_ONCE(1, "memblocks have a mix of missing and present altmaps");


It would be better if we could even make try_remove_memory() fail in 
this case.



+   return false;
+}
+
  static int __ref try_remove_memory(u64 start, u64 size)
  {
int rc, nid = NUMA_NO_NODE;
@@ -2230,7 +2264,7 @@ static int __ref try_remove_memory(u64 start, u64 size)
 * a per-memblock basis. Loop through the entire range if so,
 * and remove each memblock and its altmap.
 */
-   if (mhp_memmap_on_memory()) {
+   if (mhp_memmap_on_memory() && memblocks_have_altmaps(start, size)) {
unsigned long memblock_size = memory_block_size_bytes();
u64 cur_start;
  
@@ -2239,7 +2273,8 @@ static int __ref try_remove_memory(u64 start, u64 size)

remove_memory_block_and_altmap(cur_start,
   memblock_size);


^ probably cleaner move the loop into remove_memory_block_and_altmap() 
and call it remove_memory_blocks_and_altmaps(start, size) instead.



} else {
-   remove_memory_block_and_altmap(start, size);
+   remove_memory_block_devices(start, size);
+   arch_remove_memory(start, size, NULL);
}
  
  	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {




--
Cheers,

David / dhildenb

Re: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-09 Thread David Hildenbrand


On 07.10.23 00:01, Verma, Vishal L wrote:

On Fri, 2023-10-06 at 14:52 +0200, David Hildenbrand wrote:

On 05.10.23 20:31, Vishal Verma wrote:



<..>

@@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size)
 if (rc)
 return rc;
   
+   mem_hotplug_begin();

+
 /*
-    * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
-    * the same granularity it was added - a single memory block.
+    * For memmap_on_memory, the altmaps could have been added on
+    * a per-memblock basis. Loop through the entire range if so,
+    * and remove each memblock and its altmap.
  */
 if (mhp_memmap_on_memory()) {
-   rc = walk_memory_blocks(start, size, , test_has_altmap_cb);
-   if (rc) {
-   if (size != memory_block_size_bytes()) {
-   pr_warn("Refuse to remove %#llx - %#llx,"
-   "wrong granularity\n",
-   start, start + size);
-   return -EINVAL;
-   }
-   altmap = mem->altmap;
-   /*
-    * Mark altmap NULL so that we can add a debug
-    * check on memblock free.
-    */
-   mem->altmap = NULL;
-   }
+   unsigned long memblock_size = memory_block_size_bytes();
+   u64 cur_start;
+
+   for (cur_start = start; cur_start < start + size;
+    cur_start += memblock_size)
+   remove_memory_block_and_altmap(nid, cur_start,
+  memblock_size);
+   } else {
+   remove_memory_block_and_altmap(nid, start, size);


Better call remove_memory_block_devices() and arch_remove_memory(start,
size, altmap) here explicitly instead of using
remove_memory_block_and_altmap() that really can only handle a single
memory block with any inputs.


I'm not sure I follow. Even in the non memmap_on_memory case, we'd have
to walk_memory_blocks() to get to the memory_block->altmap, right?


See my other reply to, at least with mhp_memmap_on_memory()==false, we 
don't have to worry about the altmap.




Or is there a more direct way? If we have to walk_memory_blocks, what's
the advantage of calling those directly instead of calling the helper
created above?


I think we have two cases to handle

1) All have an altmap. Remove them block-by-block. Probably we should 
call a function remove_memory_blocks(altmap=true) [or alternatively 
remove_memory_blocks_and_altmaps()] and just handle iterating internally.


2) All don't have an altmap. We can remove them in one go. Probably we 
should call that remove_memory_blocks(altmap=false) [or alternatively 
remove_memory_blocks_no_altmaps()].


I guess it's best to do a walk upfront to make sure either all have an 
altmap or none has one. Then we can branch off to the right function 
knowing whether we have to process altmaps or not.


The existing

if (mhp_memmap_on_memory()) {
...
}

Can be extended for that case.

Please let me know if I failed to express what I mean, then I can 
briefly prototype it on top of your changes.


--
Cheers,

David / dhildenb

Re: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-09 Thread David Hildenbrand


On 07.10.23 10:55, Huang, Ying wrote:

Vishal Verma  writes:


The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
'memblock_size' chunks of memory being added. Adding a larger span of
memory precludes memmap_on_memory semantics.

For users of hotplug such as kmem, large amounts of memory might get
added from the CXL subsystem. In some cases, this amount may exceed the
available 'main memory' to store the memmap for the memory being added.
In this case, it is useful to have a way to place the memmap on the
memory being added, even if it means splitting the addition into
memblock-sized chunks.

Change add_memory_resource() to loop over memblock-sized chunks of
memory if caller requested memmap_on_memory, and if other conditions for
it are met. Teach try_remove_memory() to also expect that a memory
range being removed might have been split up into memblock sized chunks,
and to loop through those as needed.

Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Suggested-by: David Hildenbrand 
Signed-off-by: Vishal Verma 
---
  mm/memory_hotplug.c | 162 
  1 file changed, 99 insertions(+), 63 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f8d3e7427e32..77ec6f15f943 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1380,6 +1380,44 @@ static bool mhp_supports_memmap_on_memory(unsigned long 
size)
return arch_supports_memmap_on_memory(vmemmap_size);
  }
  
+static int add_memory_create_devices(int nid, struct memory_group *group,

+u64 start, u64 size, mhp_t mhp_flags)
+{
+   struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+   struct vmem_altmap mhp_altmap = {
+   .base_pfn =  PHYS_PFN(start),
+   .end_pfn  =  PHYS_PFN(start + size - 1),
+   };
+   int ret;
+
+   if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) {
+   mhp_altmap.free = memory_block_memmap_on_memory_pages();
+   params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
+   if (!params.altmap)
+   return -ENOMEM;
+
+   memcpy(params.altmap, _altmap, sizeof(mhp_altmap));
+   }
+
+   /* call arch's memory hotadd */
+   ret = arch_add_memory(nid, start, size, );
+   if (ret < 0)
+   goto error;
+
+   /* create memory block devices after memory was added */
+   ret = create_memory_block_devices(start, size, params.altmap, group);
+   if (ret)
+   goto err_bdev;
+
+   return 0;
+
+err_bdev:
+   arch_remove_memory(start, size, NULL);
+error:
+   kfree(params.altmap);
+   return ret;
+}
+
  /*
   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
   * and online/offline operations (triggered e.g. by sysfs).
@@ -1388,14 +1426,10 @@ static bool mhp_supports_memmap_on_memory(unsigned long 
size)
   */
  int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
  {
-   struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+   unsigned long memblock_size = memory_block_size_bytes();
enum memblock_flags memblock_flags = MEMBLOCK_NONE;
-   struct vmem_altmap mhp_altmap = {
-   .base_pfn =  PHYS_PFN(res->start),
-   .end_pfn  =  PHYS_PFN(res->end),
-   };
struct memory_group *group = NULL;
-   u64 start, size;
+   u64 start, size, cur_start;
bool new_node = false;
int ret;
  
@@ -1436,28 +1470,21 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)

/*
 * Self hosted memmap array
 */
-   if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
-   if (mhp_supports_memmap_on_memory(size)) {
-   mhp_altmap.free = memory_block_memmap_on_memory_pages();
-   params.altmap = kmalloc(sizeof(struct vmem_altmap), 
GFP_KERNEL);
-   if (!params.altmap)
+   if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
+   mhp_supports_memmap_on_memory(memblock_size)) {
+   for (cur_start = start; cur_start < start + size;
+cur_start += memblock_size) {
+   ret = add_memory_create_devices(nid, group, cur_start,
+   memblock_size,
+   mhp_flags);
+   if (ret)
goto error;
-
-   memcpy(params.altmap, _altmap, sizeof(mhp_altmap));
}
-   /* fallback to not using altmap  */
-   }
-
-   /* call arch's memory hotadd */
-   ret = arch_add_memory(nid, start, size, );
-   if (ret < 0)
-   goto error_fre

Re: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-06 Thread David Hildenbrand


On 05.10.23 20:31, Vishal Verma wrote:

The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
'memblock_size' chunks of memory being added. Adding a larger span of
memory precludes memmap_on_memory semantics.

For users of hotplug such as kmem, large amounts of memory might get
added from the CXL subsystem. In some cases, this amount may exceed the
available 'main memory' to store the memmap for the memory being added.
In this case, it is useful to have a way to place the memmap on the
memory being added, even if it means splitting the addition into
memblock-sized chunks.

Change add_memory_resource() to loop over memblock-sized chunks of
memory if caller requested memmap_on_memory, and if other conditions for
it are met. Teach try_remove_memory() to also expect that a memory
range being removed might have been split up into memblock sized chunks,
and to loop through those as needed.



Maybe add that this implies that we're not making use of PUD mappings in 
the direct map yet, and link to the proposal on how we could optimize 
that eventually in the future.

[...]

  
-static int __ref try_remove_memory(u64 start, u64 size)

+static void __ref remove_memory_block_and_altmap(int nid, u64 start, u64 size)



You shouldn't need the nid, right?


  {
+   int rc = 0;
struct memory_block *mem;
-   int rc = 0, nid = NUMA_NO_NODE;
struct vmem_altmap *altmap = NULL;
  




+   rc = walk_memory_blocks(start, size, , test_has_altmap_cb);
+   if (rc) {
+   altmap = mem->altmap;
+   /*
+* Mark altmap NULL so that we can add a debug
+* check on memblock free.
+*/
+   mem->altmap = NULL;
+   }
+
+   /*
+* Memory block device removal under the device_hotplug_lock is
+* a barrier against racing online attempts.
+*/
+   remove_memory_block_devices(start, size);


We're now calling that under the memory hotplug lock. I assume this is 
fine, but I remember some ugly lockdep details ...should be alright I guess.



+
+   arch_remove_memory(start, size, altmap);
+
+   /* Verify that all vmemmap pages have actually been freed. */
+   if (altmap) {
+   WARN(altmap->alloc, "Altmap not fully unmapped");
+   kfree(altmap);
+   }
+}
+
+static int __ref try_remove_memory(u64 start, u64 size)
+{
+   int rc, nid = NUMA_NO_NODE;
+
BUG_ON(check_hotplug_memory_range(start, size));
  
  	/*

@@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size)
if (rc)
return rc;
  
+	mem_hotplug_begin();

+
/*
-* We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
-* the same granularity it was added - a single memory block.
+* For memmap_on_memory, the altmaps could have been added on
+* a per-memblock basis. Loop through the entire range if so,
+* and remove each memblock and its altmap.
 */
if (mhp_memmap_on_memory()) {
-   rc = walk_memory_blocks(start, size, , test_has_altmap_cb);
-   if (rc) {
-   if (size != memory_block_size_bytes()) {
-   pr_warn("Refuse to remove %#llx - %#llx,"
-   "wrong granularity\n",
-   start, start + size);
-   return -EINVAL;
-   }
-   altmap = mem->altmap;
-   /*
-* Mark altmap NULL so that we can add a debug
-* check on memblock free.
-*/
-   mem->altmap = NULL;
-   }
+   unsigned long memblock_size = memory_block_size_bytes();
+   u64 cur_start;
+
+   for (cur_start = start; cur_start < start + size;
+cur_start += memblock_size)
+   remove_memory_block_and_altmap(nid, cur_start,
+  memblock_size);
+   } else {
+   remove_memory_block_and_altmap(nid, start, size);


Better call remove_memory_block_devices() and arch_remove_memory(start, 
size, altmap) here explicitly instead of using 
remove_memory_block_and_altmap() that really can only handle a single 
memory block with any inputs.




}
  
  	/* remove memmap entry */

firmware_map_remove(start, start + size, "System RAM");


Can we continue doing that in the old order? (IOW before taking the lock?).

  
-	/*

-* Memory block device removal under the device_hotplug_lock is
-* a barrier against racing online attempts.
-*/
-   remove_memory_block_devices(start, size);
-
-   mem_hotplug_begin();
-
-   arch_remove_memory(start, size, altmap);
-
-   /* Verify that all vmemmap pages have actually been

Re: [PATCH v4 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-06 Thread David Hildenbrand


On 03.10.23 22:03, Verma, Vishal L wrote:

On Mon, 2023-10-02 at 11:28 +0200, David Hildenbrand wrote:



+
+static int __ref try_remove_memory(u64 start, u64 size)
+{
+   int rc, nid = NUMA_NO_NODE;
+
+   BUG_ON(check_hotplug_memory_range(start, size));
+
+   /*
+    * All memory blocks must be offlined before removing memory.  Check
+    * whether all memory blocks in question are offline and return error
+    * if this is not the case.
+    *
+    * While at it, determine the nid. Note that if we'd have mixed nodes,
+    * we'd only try to offline the last determined one -- which is good
+    * enough for the cases we care about.
+    */
+   rc = walk_memory_blocks(start, size, , check_memblock_offlined_cb);
+   if (rc)
+   return rc;
+
+   /*
+    * For memmap_on_memory, the altmaps could have been added on
+    * a per-memblock basis. Loop through the entire range if so,
+    * and remove each memblock and its altmap.
+    */
+   if (mhp_memmap_on_memory()) {
+   unsigned long memblock_size = memory_block_size_bytes();
+   u64 cur_start;
+
+   for (cur_start = start; cur_start < start + size;
+    cur_start += memblock_size)
+   __try_remove_memory(nid, cur_start, memblock_size);
+   } else {
+   __try_remove_memory(nid, start, size);
+   }
+
 return 0;
   }


Why is the firmware, memblock and nid handling not kept in this outer
function?

We really shouldn't be doing per memory block what needs to be done per
memblock: remove_memory_block_devices() and arch_remove_memory().



Ah yes makes sense since we only do create_memory_block_devices() and
arch_add_memory() in the per memory block inner loop during addition.

How should the locking work in this case though?


Sorry, I had to process a family NMI the last couple of days.



The original code holds the mem_hotplug_begin() lock over
arch_remove_memory() and all of the nid and memblock stuff. Should I
just hold the lock and release it in the inner loop for each memory
block, and then also acquire and release it separately for the memblock
and nid stuff in the outer function?


I think we have to hold it over the whole operation.

I saw that you sent a v5, I'll comment there.


--
Cheers,

David / dhildenb

Re: [PATCH v4 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-10-02 Thread David Hildenbrand





+
+static int __ref try_remove_memory(u64 start, u64 size)
+{
+   int rc, nid = NUMA_NO_NODE;
+
+   BUG_ON(check_hotplug_memory_range(start, size));
+
+   /*
+* All memory blocks must be offlined before removing memory.  Check
+* whether all memory blocks in question are offline and return error
+* if this is not the case.
+*
+* While at it, determine the nid. Note that if we'd have mixed nodes,
+* we'd only try to offline the last determined one -- which is good
+* enough for the cases we care about.
+*/
+   rc = walk_memory_blocks(start, size, , check_memblock_offlined_cb);
+   if (rc)
+   return rc;
+
+   /*
+* For memmap_on_memory, the altmaps could have been added on
+* a per-memblock basis. Loop through the entire range if so,
+* and remove each memblock and its altmap.
+*/
+   if (mhp_memmap_on_memory()) {
+   unsigned long memblock_size = memory_block_size_bytes();
+   u64 cur_start;
+
+   for (cur_start = start; cur_start < start + size;
+cur_start += memblock_size)
+   __try_remove_memory(nid, cur_start, memblock_size);
+   } else {
+   __try_remove_memory(nid, start, size);
+   }
+
return 0;
  }


Why is the firmware, memblock and nid handling not kept in this outer 
function?


We really shouldn't be doing per memory block what needs to be done per 
memblock: remove_memory_block_devices() and arch_remove_memory().



--
Cheers,

David / dhildenb

Re: [PATCH v4 2/2] dax/kmem: allow kmem to add memory with memmap_on_memory

2023-10-02 Thread David Hildenbrand


On 28.09.23 22:30, Vishal Verma wrote:

Large amounts of memory managed by the kmem driver may come in via CXL,
and it is often desirable to have the memmap for this memory on the new
memory itself.

Enroll kmem-managed memory for memmap_on_memory semantics if the dax
region originates via CXL. For non-CXL dax regions, retain the existing
default behavior of hot adding without memmap_on_memory semantics.

Add a sysfs override under the dax device to control this behavior and
override either default.

Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Reviewed-by: Jonathan Cameron 
Signed-off-by: Vishal Verma 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse

2023-09-11 Thread David Hildenbrand


On 11.09.23 13:52, Catalin Marinas wrote:

On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote:

On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote:

On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote:

On 24.08.23 13:06, David Hildenbrand wrote:

Regarding one complication: "The kernel needs to know where to allocate
a PROT_MTE page from or migrate a current page if it becomes PROT_MTE
(mprotect()) and the range it is in does not support tagging.",
simplified handling would be if it's in a MIGRATE_CMA pageblock, it
doesn't support tagging. You have to migrate to a !CMA page (for
example, not specifying GFP_MOVABLE as a quick way to achieve that).


Okay, I now realize that this patch set effectively duplicates some CMA
behavior using a new migrate-type.

[...]

I considered mixing the tag storage memory memory with normal memory and
adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged,
this means that it's not enough anymore to have a __GFP_MOVABLE allocation
request to use MIGRATE_CMA.

I considered two solutions to this problem:

1. Only allocate from MIGRATE_CMA is the requested memory is not tagged =>
this effectively means transforming all memory from MIGRATE_CMA into the
MIGRATE_METADATA migratetype that the series introduces. Not very
appealing, because that means treating normal memory that is also on the
MIGRATE_CMA lists as tagged memory.


That's indeed not ideal. We could try this if it makes the patches
significantly simpler, though I'm not so sure.

Allocating metadata is the easier part as we know the correspondence
from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag
storage page), so alloc_contig_range() does this for us. Just adding it
to the CMA range is sufficient.

However, making sure that we don't allocate PROT_MTE pages from the
metadata range is what led us to another migrate type. I guess we could
achieve something similar with a new zone or a CPU-less NUMA node,


Ideally, no significant core-mm changes to optimize for an architecture 
oddity. That implies, no new zones and no new migratetypes -- unless it 
is unavoidable and you are confident that you can convince core-MM 
people that the use case (giving back 3% of system RAM at max in some 
setups) is worth the trouble.


I also had CPU-less NUMA nodes in mind when thinking about that, but not 
sure how easy it would be to integrate it. If the tag memory has 
actually different performance characteristics as well, a NUMA node 
would be the right choice.


If we could find some way to easily support this either via CMA or 
CPU-less NUMA nodes, that would be much preferable; even if we cannot 
cover each and every future use case right now. I expect some issues 
with CXL+MTE either way , but are happy to be taught otherwise :)



Another thought I had was adding something like CMA memory 
characteristics. Like, asking if a given CMA area/page supports tagging 
(i.e., flag for the CMA area set?)?


When you need memory that supports tagging and have a page that does not 
support tagging (CMA && taggable), simply migrate to !MOVABLE memory 
(eventually we could also try adding !CMA).


Was that discussed and what would be the challenges with that? Page 
migration due to compaction comes to mind, but it might also be easy to 
handle if we can just avoid CMA memory for that.



though the latter is not guaranteed not to allocate memory from the
range, only make it less likely. Both these options are less flexible in
terms of size/alignment/placement.

Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and
configure the metadata range in ZONE_MOVABLE but at some point I'd
expect some CXL-attached memory to support MTE with additional carveout
reserved.


I have no idea how we could possibly cleanly support memory hotplug in 
virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast 
to s390x storage keys, the approach that arm64 with MTE took here 
(exposing tag memory to the VM) makes it rather hard and complicated.


--
Cheers,

David / dhildenb

Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks

2023-08-14 Thread David Hildenbrand


On 14.08.23 08:45, Huang, Ying wrote:

"Verma, Vishal L"  writes:


On Mon, 2023-07-24 at 13:54 +0800, Huang, Ying wrote:

Vishal Verma  writes:



@@ -2035,12 +2056,38 @@ void try_offline_node(int nid)
  }
  EXPORT_SYMBOL(try_offline_node);

-static int __ref try_remove_memory(u64 start, u64 size)
+static void __ref __try_remove_memory(int nid, u64 start, u64 size,
+struct vmem_altmap *altmap)
  {
-   struct vmem_altmap mhp_altmap = {};
-   struct vmem_altmap *altmap = NULL;
-   unsigned long nr_vmemmap_pages;
-   int rc = 0, nid = NUMA_NO_NODE;
+   /* remove memmap entry */
+   firmware_map_remove(start, start + size, "System RAM");


If mhp_supports_memmap_on_memory(), we will call
firmware_map_add_hotplug() for whole range.  But here we may call
firmware_map_remove() for part of range.  Is it OK?



Good point, this is a discrepancy in the add vs remove path. Can the
firmware memmap entries be moved up a bit in the add path, and is it
okay to create these for each memblock? Or should these be for the
whole range? I'm not familiar with the implications. (I've left it as
is for v3 for now, but depending on the direction I can update in a
future rev).


Cced more firmware map developers and maintainers.

Per my understanding, we should create one firmware memmap entry for
each memblock.


Ideally we should create it for the whole range, ti limit the ranges. 
But it really only matters for DIMMs; for dax/kmem, we'll not create any 
firmware entries.


--
Cheers,

David / dhildenb

Re: [PATCH v2 0/3] mm: use memmap_on_memory semantics for dax/kmem

2023-07-25 Thread David Hildenbrand


On 20.07.23 09:14, Vishal Verma wrote:

The dax/kmem driver can potentially hot-add large amounts of memory
originating from CXL memory expanders, or NVDIMMs, or other 'device
memories'. There is a chance there isn't enough regular system memory
available to fit the memmap for this new memory. It's therefore
desirable, if all other conditions are met, for the kmem managed memory
to place its memmap on the newly added memory itself.

The main hurdle for accomplishing this for kmem is that memmap_on_memory
can only be done if the memory being added is equal to the size of one
memblock. To overcome this,allow the hotplug code to split an add_memory()
request into memblock-sized chunks, and try_remove_memory() to also
expect and handle such a scenario.

Patch 1 exports mhp_supports_memmap_on_memory() so it can be used by the
kmem driver.

Patch 2 teaches the memory_hotplug code to allow for splitting
add_memory() and remove_memory() requests over memblock sized chunks.

Patch 3 adds a sysfs control for the kmem driver that would
allow an opt-out of using memmap_on_memory for the memory being added.



It might be reasonable to rebase this on Aneesh's work. For example, 
patch #1 might not be required anymore.


--
Cheers,

David / dhildenb

Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem

2023-07-14 Thread David Hildenbrand


On 13.07.23 21:12, Jeff Moyer wrote:

David Hildenbrand  writes:


On 16.06.23 00:00, Vishal Verma wrote:

The dax/kmem driver can potentially hot-add large amounts of memory
originating from CXL memory expanders, or NVDIMMs, or other 'device
memories'. There is a chance there isn't enough regular system memory
available to fit ythe memmap for this new memory. It's therefore
desirable, if all other conditions are met, for the kmem managed memory
to place its memmap on the newly added memory itself.

Arrange for this by first allowing for a module parameter override for
the mhp_supports_memmap_on_memory() test using a flag, adjusting the
only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c,
exporting the symbol so it can be called by kmem.c, and finally changing
the kmem driver to add_memory() in chunks of memory_block_size_bytes().


1) Why is the override a requirement here? Just let the admin
configure it then then add conditional support for kmem.

2) I recall that there are cases where we don't want the memmap to
land on slow memory (which online_movable would achieve). Just imagine
the slow PMEM case. So this might need another configuration knob on
the kmem side.


 From my memory, the case where you don't want the memmap to land on
*persistent memory* is when the device is small (such as NVDIMM-N), and
you want to reserve as much space as possible for the application data.
This has nothing to do with the speed of access.


Now that you mention it, I also do remember the origin of the altmap --
to achieve exactly that: place the memmap on the device.

commit 4b94ffdc4163bae1ec73b6e977ffb7a7da3d06d3
Author: Dan Williams 
Date:   Fri Jan 15 16:56:22 2016 -0800

x86, mm: introduce vmem_altmap to augment vmemmap_populate()

In support of providing struct page for large persistent memory

capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array.  The default vmemmap_populate()
allocates page table storage area from the page allocator.  Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'.  Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.

In PFN_MODE_PMEM (and only then), we use the altmap (don't see a way to
configure it).


BUT that case is completely different from the "System RAM" mode. The memmap
of an NVDIMM in pmem mode is barely used by core-mm (i.e., not the buddy).

In comparison, if the buddy and everybody else works on the memmap in
"System RAM", it's much more significant if that resides on slow memory.


Looking at

commit 9b6e63cbf85b89b2dbffa4955dbf2df8250e5375
Author: Michal Hocko 
Date:   Tue Oct 3 16:16:19 2017 -0700

mm, page_alloc: add scheduling point to memmap_init_zone

memmap_init_zone gets a pfn range to initialize and it can be really

large resulting in a soft lockup on non-preemptible kernels

  NMI watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [kworker/u642:5:1720]

  [...]
  task: 88ecd7e902c0 ti: 88eca4e5 task.ti: 88eca4e5
  RIP: move_pfn_range_to_zone+0x185/0x1d0
  [...]
  Call Trace:
devm_memremap_pages+0x2c7/0x430
pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
nvdimm_bus_probe+0x64/0x110 [libnvdimm]


It's hard to tell if that was only required due to the memmap for these devices
being that large, or also partially because the access to the memmap is slower
that it makes a real difference.


I recall that we're also often using ZONE_MOVABLE on such slow memory
to not end up placing other kernel data structures on there: especially,
user space page tables as I've been told.


@Dan, any insight on the performance aspects when placing the memmap on
(slow) memory and having that memory be consumed by the buddy where we 
frequently
operate on the memmap?

--
Cheers,

David / dhildenb

Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

2023-07-13 Thread David Hildenbrand


On 13.07.23 17:40, Verma, Vishal L wrote:

On Thu, 2023-07-13 at 17:23 +0200, David Hildenbrand wrote:

On 13.07.23 17:15, Verma, Vishal L wrote:

On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:

On 13.07.23 08:45, Verma, Vishal L wrote:


I'm taking a shot at implementing the splitting internally in
memory_hotplug.c. The caller (kmem) side does become trivial with this
approach, but there's a slight complication if I don't have the module
param override (patch 1 of this series).

The kmem diff now looks like:

  diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
  index 898ca9505754..8be932f63f90 100644
  --- a/drivers/dax/kmem.c
  +++ b/drivers/dax/kmem.c
  @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
  data->mgid = rc;
   
  for (i = 0; i < dev_dax->nr_range; i++) {

  +   mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
  + MHP_SPLIT_MEMBLOCKS;
  struct resource *res;
  struct range range;
   
  @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)

   * this as RAM automatically.
   */
  rc = add_memory_driver_managed(data->mgid, range.start,
  -   range_len(), kmem_name, 
MHP_NID_IS_MGID);
  +   range_len(), kmem_name, mhp_flags);
   
  if (rc) {

  dev_warn(dev, "mapping%d: %#llx-%#llx memory add 
failed\n",
  



Why do we need the MHP_SPLIT_MEMBLOCKS?


I thought we still wanted either an opt-in or opt-out for the kmem
driver to be able to do memmap_on_memory, in case there were
performance implications or the lack of 1GiB PUDs. I haven't
implemented that yet, but I was thinking along the lines of a sysfs
knob exposed by kmem, that controls setting of this new
MHP_SPLIT_MEMBLOCKS flag.


Why is MHP_MEMMAP_ON_MEMORY not sufficient for that?



Ah I see what you mean now - knob just controls MHP_MEMMAP_ON_MEMORY,
and memory_hotplug is free to split to memblocks if it needs to to
satisfy that.


And if you don't want memmap holes in a larger area you're adding (for 
example to runtime-allocate 1 GiB pages), simply check the size your 
adding, and if it's, say, less than 1 G, don't set the flag.


But that's probably a corner case use case not worth considering for now.



That sounds reasonable. Let me give this a try and see if I run into
anything else. Thanks David!


Sure!

--
Cheers,

David / dhildenb

Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

2023-07-13 Thread David Hildenbrand


On 13.07.23 17:15, Verma, Vishal L wrote:

On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:

On 13.07.23 08:45, Verma, Vishal L wrote:


I'm taking a shot at implementing the splitting internally in
memory_hotplug.c. The caller (kmem) side does become trivial with this
approach, but there's a slight complication if I don't have the module
param override (patch 1 of this series).

The kmem diff now looks like:

     diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
     index 898ca9505754..8be932f63f90 100644
     --- a/drivers/dax/kmem.c
     +++ b/drivers/dax/kmem.c
     @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
     data->mgid = rc;
  
     for (i = 0; i < dev_dax->nr_range; i++) {

     +   mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
     + MHP_SPLIT_MEMBLOCKS;
     struct resource *res;
     struct range range;
  
     @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)

  * this as RAM automatically.
  */
     rc = add_memory_driver_managed(data->mgid, range.start,
     -   range_len(), kmem_name, 
MHP_NID_IS_MGID);
     +   range_len(), kmem_name, mhp_flags);
  
     if (rc) {

     dev_warn(dev, "mapping%d: %#llx-%#llx memory add 
failed\n",
 



Why do we need the MHP_SPLIT_MEMBLOCKS?


I thought we still wanted either an opt-in or opt-out for the kmem
driver to be able to do memmap_on_memory, in case there were
performance implications or the lack of 1GiB PUDs. I haven't
implemented that yet, but I was thinking along the lines of a sysfs
knob exposed by kmem, that controls setting of this new
MHP_SPLIT_MEMBLOCKS flag.


Why is MHP_MEMMAP_ON_MEMORY not sufficient for that?





In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a
single memory block, you can simply split up internally, no?

Essentially in add_memory_resource() something like

if (mhp_flags & MHP_MEMMAP_ON_MEMORY &&
  mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
 for (cur_start = start, cur_start < start + size;
  cur_start += memory_block_size_bytes()) {
 mhp_altmap.free = PHYS_PFN(memory_block_size_bytes());
 mhp_altmap.base_pfn = PHYS_PFN(start);
 params.altmap = _altmap;

 ret = arch_add_memory(nid, start,
   memory_block_size_bytes(), );
 if (ret < 0) ...

 ret = create_memory_block_devices(start, 
memory_block_size_bytes(),
   mhp_altmap.alloc, group);
 if (ret) ...
 
 }

} else {
 /* old boring stuff */
}

Of course, doing it a bit cleaner, factoring out adding of mem+creating devices 
into
a helper so we can use it on the other path, avoiding repeating 
memory_block_size_bytes()
...


My current approach was looping a level higher, on the call to
add_memory_resource, but this looks reasonable too, and I can switch to
this. In fact it is better as in my case I had to loop twice, once for
the regular add_memory() path and once for the _driver_managed() path.
Yours should avoid that.


As we only care about the altmap here, handling it for arch_add_memory() 
+ create_memory_block_devices() should be sufficient.


--
Cheers,

David / dhildenb

Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

2023-07-13 Thread David Hildenbrand


On 13.07.23 08:45, Verma, Vishal L wrote:

On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote:

On 11.07.23 16:30, Aneesh Kumar K.V wrote:

David Hildenbrand  writes:


Maybe the better alternative is teach
add_memory_resource()/try_remove_memory() to do that internally.

In the add_memory_resource() case, it might be a loop around that
memmap_on_memory + arch_add_memory code path (well, and the error path
also needs adjustment):

 /*
  * Self hosted memmap array
  */
 if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
 if (!mhp_supports_memmap_on_memory(size)) {
 ret = -EINVAL;
 goto error;
 }
 mhp_altmap.free = PHYS_PFN(size);
 mhp_altmap.base_pfn = PHYS_PFN(start);
 params.altmap = _altmap;
 }

 /* call arch's memory hotadd */
 ret = arch_add_memory(nid, start, size, );
 if (ret < 0)
 goto error;


Note that we want to handle that on a per memory-block basis, because we
don't want the vmemmap of memory block #2 to end up on memory block #1.
It all gets messy with memory onlining/offlining etc otherwise ...



I tried to implement this inside add_memory_driver_managed() and also
within dax/kmem. IMHO doing the error handling inside dax/kmem is
better. Here is how it looks:

1. If any blocks got added before (mapped > 0) we loop through all successful 
request_mem_regions
2. For each succesful request_mem_regions if any blocks got added, we
keep the resource. If none got added, we will kfree the resource



Doing this unconditional splitting outside of
add_memory_driver_managed() is undesirable for at least two reasons

1) You end up always creating individual entries in the resource tree
     (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective.
2) As we call arch_add_memory() in memory block granularity (e.g., 128
     MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the
     identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective.

While you could sense for support and do the split based on that, it
will be beneficial for other users (especially DIMMs) if we do that
internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be
effective or not.


I'm taking a shot at implementing the splitting internally in
memory_hotplug.c. The caller (kmem) side does become trivial with this
approach, but there's a slight complication if I don't have the module
param override (patch 1 of this series).

The kmem diff now looks like:

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 898ca9505754..8be932f63f90 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
data->mgid = rc;
 
for (i = 0; i < dev_dax->nr_range; i++) {

+   mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
+ MHP_SPLIT_MEMBLOCKS;
struct resource *res;
struct range range;
 
@@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)

 * this as RAM automatically.
 */
rc = add_memory_driver_managed(data->mgid, range.start,
-   range_len(), kmem_name, 
MHP_NID_IS_MGID);
+   range_len(), kmem_name, mhp_flags);
 
if (rc) {

dev_warn(dev, "mapping%d: %#llx-%#llx memory add 
failed\n",




Why do we need the MHP_SPLIT_MEMBLOCKS?

In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a
single memory block, you can simply split up internally, no?

Essentially in add_memory_resource() something like

if (mhp_flags & MHP_MEMMAP_ON_MEMORY &&
mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
for (cur_start = start, cur_start < start + size;
 cur_start += memory_block_size_bytes()) {
mhp_altmap.free = PHYS_PFN(memory_block_size_bytes());
mhp_altmap.base_pfn = PHYS_PFN(start);
params.altmap = _altmap;

ret = arch_add_memory(nid, start,
  memory_block_size_bytes(), );
if (ret < 0) ...

ret = create_memory_block_devices(start, 
memory_block_size_bytes(),
  mhp_altmap.alloc, group);
if (ret) ...

}
} else {
/* old boring stuff */
}

Of course, doing it a bit cleaner, factoring out adding of mem+creating devices 
into
a helper so we can use it on the other path, avoiding repeating 
memory_block_size_bytes()
...

If any adding of memory failed, we remove what we al

Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

2023-07-11 Thread David Hildenbrand


On 11.07.23 16:30, Aneesh Kumar K.V wrote:

David Hildenbrand  writes:


On 16.06.23 00:00, Vishal Verma wrote:

With DAX memory regions originating from CXL memory expanders or
NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
on a system without enough 'regular' main memory to support the memmap
for it. To avoid this, ensure that all kmem managed hotplugged memory is
added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
new memory region being hot added.

To do this, call add_memory() in chunks of memory_block_size_bytes() as
that is a requirement for memmap_on_memory. Additionally, Use the
mhp_flag to force the memmap_on_memory checks regardless of the
respective module parameter setting.

Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Oscar Salvador 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Signed-off-by: Vishal Verma 
---
   drivers/dax/kmem.c | 49 -
   1 file changed, 36 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 7b36db6f1cbd..0751346193ef 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -12,6 +12,7 @@
   #include 
   #include 
   #include 
+#include 
   #include "dax-private.h"
   #include "bus.h"

@@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
data->mgid = rc;

for (i = 0; i < dev_dax->nr_range; i++) {
+   u64 cur_start, cur_len, remaining;
struct resource *res;
struct range range;

@@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
res->flags = IORESOURCE_SYSTEM_RAM;

/*
-* Ensure that future kexec'd kernels will not treat
-* this as RAM automatically.
+* Add memory in chunks of memory_block_size_bytes() so that
+* it is considered for MHP_MEMMAP_ON_MEMORY
+* @range has already been aligned to memory_block_size_bytes(),
+* so the following loop will always break it down cleanly.
 */
-   rc = add_memory_driver_managed(data->mgid, range.start,
-   range_len(), kmem_name, MHP_NID_IS_MGID);
+   cur_start = range.start;
+   cur_len = memory_block_size_bytes();
+   remaining = range_len();
+   while (remaining) {
+   mhp_t mhp_flags = MHP_NID_IS_MGID;

-   if (rc) {
-   dev_warn(dev, "mapping%d: %#llx-%#llx memory add 
failed\n",
-   i, range.start, range.end);
-   remove_resource(res);
-   kfree(res);
-   data->res[i] = NULL;
-   if (mapped)
-   continue;
-   goto err_request_mem;
+   if (mhp_supports_memmap_on_memory(cur_len,
+ MHP_MEMMAP_ON_MEMORY))
+   mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+   /*
+* Ensure that future kexec'd kernels will not treat
+* this as RAM automatically.
+*/
+   rc = add_memory_driver_managed(data->mgid, cur_start,
+  cur_len, kmem_name,
+  mhp_flags);
+
+   if (rc) {
+   dev_warn(dev,
+"mapping%d: %#llx-%#llx memory add 
failed\n",
+i, cur_start, cur_start + cur_len - 1);
+   remove_resource(res);
+   kfree(res);
+   data->res[i] = NULL;
+   if (mapped)
+   continue;
+   goto err_request_mem;
+   }
+
+   cur_start += cur_len;
+   remaining -= cur_len;
}
mapped++;
}



Maybe the better alternative is teach
add_memory_resource()/try_remove_memory() to do that internally.

In the add_memory_resource() case, it might be a loop around that
memmap_on_memory + arch_add_memory code path (well, and the error path
also needs adjustment):

/*
 * Self hosted memmap array
 */
if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
if (!mhp_supports_memmap_on_memory(size)) {
ret = -EINVAL;
goto error;
}
mhp_altm

Re: [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param

2023-06-23 Thread David Hildenbrand


On 23.06.23 10:40, Aneesh Kumar K.V wrote:

Vishal Verma  writes:


For memory hotplug to consider MHP_MEMMAP_ON_MEMORY behavior, the
'memmap_on_memory' module parameter was a hard requirement.

In preparation for the dax/kmem driver to use memmap_on_memory
semantics, arrange for the module parameter check to be bypassed via the
appropriate mhp_flag.

Recall that the kmem driver could contribute huge amounts of hotplugged
memory originating from special purposes devices such as CXL memory
expanders. In some cases memmap_on_memory may be the /only/ way this new
memory can be hotplugged. Hence it makes sense for kmem to have a way to
force memmap_on_memory without depending on a module param, if all the
other conditions for it are met.

The only other user of this interface is acpi/acpi_memoryhotplug.c,
which only enables the mhp_flag if an initial
mhp_supports_memmap_on_memory() test passes. Maintain the existing
behavior and semantics for this by performing the initial check from
acpi without the MHP_MEMMAP_ON_MEMORY flag, so its decision falls back
to the module parameter.

Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Oscar Salvador 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Signed-off-by: Vishal Verma 
---
  include/linux/memory_hotplug.h |  2 +-
  drivers/acpi/acpi_memhotplug.c |  2 +-
  mm/memory_hotplug.c| 24 
  3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 9fcbf5706595..c9ddcd3cad70 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -358,7 +358,7 @@ extern struct zone *zone_for_pfn_range(int online_type, int 
nid,
  extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
  struct mhp_params *params);
  void arch_remove_linear_mapping(u64 start, u64 size);
-extern bool mhp_supports_memmap_on_memory(unsigned long size);
+extern bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags);
  #endif /* CONFIG_MEMORY_HOTPLUG */
  
  #endif /* __LINUX_MEMORY_HOTPLUG_H */

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 24f662d8bd39..119d3bb49753 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -211,7 +211,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (!info->length)
continue;
  
-		if (mhp_supports_memmap_on_memory(info->length))

+   if (mhp_supports_memmap_on_memory(info->length, 0))
mhp_flags |= MHP_MEMMAP_ON_MEMORY;
result = __add_memory(mgid, info->start_addr, info->length,
  mhp_flags);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8e0fa209d533..bb3845830922 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1283,15 +1283,21 @@ static int online_memory_block(struct memory_block 
*mem, void *arg)
return device_online(>dev);
  }
  
-bool mhp_supports_memmap_on_memory(unsigned long size)

+bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags)
  {
unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
unsigned long remaining_size = size - vmemmap_size;
  
  	/*

-* Besides having arch support and the feature enabled at runtime, we
-* need a few more assumptions to hold true:
+* The MHP_MEMMAP_ON_MEMORY flag indicates a caller that wants to force
+* memmap_on_memory (if other conditions are met), regardless of the
+* module parameter. drivers/dax/kmem.c is an example, where large
+* amounts of hotplug memory may come from, and the only option to
+* successfully online all of it is to place the memmap on this memory.
+*
+* Besides having arch support and the feature enabled at runtime or
+* via the mhp_flag, we need a few more assumptions to hold true:
 *
 * a) We span a single memory block: memory onlining/offlinin;g happens
 *in memory block granularity. We don't want the vmemmap of online
@@ -1315,10 +1321,12 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
 *   altmap as an alternative source of memory, and we do not 
exactly
 *   populate a single PMD.
 */
-   return mhp_memmap_on_memory() &&
-  size == memory_block_size_bytes() &&
-  IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
-  IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
+
+   if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) || mhp_memmap_on_memory())
+   return size == memory_block_size_bytes() &&
+  I

Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem

2023-06-22 Thread David Hildenbrand


On 21.06.23 21:32, Verma, Vishal L wrote:

On Fri, 2023-06-16 at 09:44 +0200, David Hildenbrand wrote:

On 16.06.23 00:00, Vishal Verma wrote:

The dax/kmem driver can potentially hot-add large amounts of memory
originating from CXL memory expanders, or NVDIMMs, or other 'device
memories'. There is a chance there isn't enough regular system memory
available to fit ythe memmap for this new memory. It's therefore
desirable, if all other conditions are met, for the kmem managed memory
to place its memmap on the newly added memory itself.

Arrange for this by first allowing for a module parameter override for
the mhp_supports_memmap_on_memory() test using a flag, adjusting the
only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c,
exporting the symbol so it can be called by kmem.c, and finally changing
the kmem driver to add_memory() in chunks of memory_block_size_bytes().


1) Why is the override a requirement here? Just let the admin configure
it then then add conditional support for kmem.


Configure it in the current way using the module parameter to
memory_hotplug? The whole module param check feels a bit awkward,
especially since memory_hotplug is builtin, the only way to supply the
param is on the kernel command line as opposed to a modprobe config.


Yes, and that's nothing special. Runtime toggling is not implemented.



The goal with extending mhp_supports_memmap_on_memory() to check for
support with or without consideration for the module param is that it
makes it a bit more flexible for callers like kmem.


Not convinced yet that the global parameter should be bypassed TBH. And 
if so, this should be a separate patch on top that is completely 
optional for the remainder of the series.



In any case, there has to be some admin control over that, because

1) You usually don't want vmemmap on potentially slow memory
2) Using memmap-on-memory prohibits gigantic pages from forming on that 
memory (when runtime-allocating them).


So "just doing that" without any config knob is problematic.

--
Cheers,

David / dhildenb

Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

2023-06-16 Thread David Hildenbrand


On 16.06.23 00:00, Vishal Verma wrote:

With DAX memory regions originating from CXL memory expanders or
NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
on a system without enough 'regular' main memory to support the memmap
for it. To avoid this, ensure that all kmem managed hotplugged memory is
added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
new memory region being hot added.

To do this, call add_memory() in chunks of memory_block_size_bytes() as
that is a requirement for memmap_on_memory. Additionally, Use the
mhp_flag to force the memmap_on_memory checks regardless of the
respective module parameter setting.

Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Oscar Salvador 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Signed-off-by: Vishal Verma 
---
  drivers/dax/kmem.c | 49 -
  1 file changed, 36 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 7b36db6f1cbd..0751346193ef 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -12,6 +12,7 @@
  #include 
  #include 
  #include 
+#include 
  #include "dax-private.h"
  #include "bus.h"
  
@@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)

data->mgid = rc;
  
  	for (i = 0; i < dev_dax->nr_range; i++) {

+   u64 cur_start, cur_len, remaining;
struct resource *res;
struct range range;
  
@@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)

res->flags = IORESOURCE_SYSTEM_RAM;
  
  		/*

-* Ensure that future kexec'd kernels will not treat
-* this as RAM automatically.
+* Add memory in chunks of memory_block_size_bytes() so that
+* it is considered for MHP_MEMMAP_ON_MEMORY
+* @range has already been aligned to memory_block_size_bytes(),
+* so the following loop will always break it down cleanly.
 */
-   rc = add_memory_driver_managed(data->mgid, range.start,
-   range_len(), kmem_name, MHP_NID_IS_MGID);
+   cur_start = range.start;
+   cur_len = memory_block_size_bytes();
+   remaining = range_len();
+   while (remaining) {
+   mhp_t mhp_flags = MHP_NID_IS_MGID;
  
-		if (rc) {

-   dev_warn(dev, "mapping%d: %#llx-%#llx memory add 
failed\n",
-   i, range.start, range.end);
-   remove_resource(res);
-   kfree(res);
-   data->res[i] = NULL;
-   if (mapped)
-   continue;
-   goto err_request_mem;
+   if (mhp_supports_memmap_on_memory(cur_len,
+ MHP_MEMMAP_ON_MEMORY))
+   mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+   /*
+* Ensure that future kexec'd kernels will not treat
+* this as RAM automatically.
+*/
+   rc = add_memory_driver_managed(data->mgid, cur_start,
+  cur_len, kmem_name,
+  mhp_flags);
+
+   if (rc) {
+   dev_warn(dev,
+"mapping%d: %#llx-%#llx memory add 
failed\n",
+i, cur_start, cur_start + cur_len - 1);
+   remove_resource(res);
+   kfree(res);
+   data->res[i] = NULL;
+   if (mapped)
+   continue;
+   goto err_request_mem;
+   }
+
+   cur_start += cur_len;
+   remaining -= cur_len;
}
mapped++;
}



Maybe the better alternative is teach 
add_memory_resource()/try_remove_memory() to do that internally.


In the add_memory_resource() case, it might be a loop around that 
memmap_on_memory + arch_add_memory code path (well, and the error path 
also needs adjustment):


/*
 * Self hosted memmap array
 */
if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
if (!mhp_supports_memmap_on_memory(size)) {
ret = -EINVAL;
goto error;
}
mhp_altmap.free = PHYS_PFN(size);
mhp_altmap.base_pfn = PHYS_PFN(start);
params.al

Re: [PATCH 2/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory()

2023-06-16 Thread David Hildenbrand


On 16.06.23 00:00, Vishal Verma wrote:

In preparation for the dax/kmem driver, which can be built as a module,
to use this interface, export it with EXPORT_SYMBOL_GPL().

Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Andrew Morton 
Cc: David Hildenbrand 
Cc: Oscar Salvador 
Cc: Dan Williams 
Cc: Dave Jiang 
Cc: Dave Hansen 
Cc: Huang Ying 
Signed-off-by: Vishal Verma 
---
  mm/memory_hotplug.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index bb3845830922..92922080d3fa 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1328,6 +1328,7 @@ bool mhp_supports_memmap_on_memory(unsigned long size, 
mhp_t mhp_flags)
   IS_ALIGNED(remaining_size, (pageblock_nr_pages << 
PAGE_SHIFT));
return false;
  }
+EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
  
  /*

   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug



Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param

2023-06-16 Thread David Hildenbrand


On 16.06.23 00:00, Vishal Verma wrote:

For memory hotplug to consider MHP_MEMMAP_ON_MEMORY behavior, the
'memmap_on_memory' module parameter was a hard requirement.

In preparation for the dax/kmem driver to use memmap_on_memory
semantics, arrange for the module parameter check to be bypassed via the
appropriate mhp_flag.

Recall that the kmem driver could contribute huge amounts of hotplugged
memory originating from special purposes devices such as CXL memory
expanders. In some cases memmap_on_memory may be the /only/ way this new
memory can be hotplugged. Hence it makes sense for kmem to have a way to
force memmap_on_memory without depending on a module param, if all the
other conditions for it are met.


Just let the admin configure it. After all, an admin is involved in 
configuring the dax/kmem device to begin with. If add_memory() fails you 
could give a useful hint to the admin.


--
Cheers,

David / dhildenb

Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem

2023-06-16 Thread David Hildenbrand


On 16.06.23 00:00, Vishal Verma wrote:

The dax/kmem driver can potentially hot-add large amounts of memory
originating from CXL memory expanders, or NVDIMMs, or other 'device
memories'. There is a chance there isn't enough regular system memory
available to fit ythe memmap for this new memory. It's therefore
desirable, if all other conditions are met, for the kmem managed memory
to place its memmap on the newly added memory itself.

Arrange for this by first allowing for a module parameter override for
the mhp_supports_memmap_on_memory() test using a flag, adjusting the
only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c,
exporting the symbol so it can be called by kmem.c, and finally changing
the kmem driver to add_memory() in chunks of memory_block_size_bytes().


1) Why is the override a requirement here? Just let the admin configure 
it then then add conditional support for kmem.


2) I recall that there are cases where we don't want the memmap to land 
on slow memory (which online_movable would achieve). Just imagine the 
slow PMEM case. So this might need another configuration knob on the 
kmem side.



I recall some discussions on doing that chunk handling internally (so 
kmem can just perform one add_memory() and we'd split that up internally).


--
Cheers,

David / dhildenb

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3857 matches

Mail list logo