Re: [linus:master] [mm] d99e3140a4: BUG:KCSAN:data-race_in_folio_remove_rmap_ptes/print_report
Am 28.05.24 um 09:11 schrieb kernel test robot: Hello, kernel test robot noticed "BUG:KCSAN:data-race_in_folio_remove_rmap_ptes/print_report" on: commit: d99e3140a4d33e26066183ff727d8f02f56bec64 ("mm: turn folio_test_hugetlb into a PageType") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master [test failed on linus/master c760b3725e52403dc1b28644fb09c47a83cacea6] [test failed on linux-next/master 3689b0ef08b70e4e03b82ebd37730a03a672853a] in testcase: trinity version: trinity-i386-abe9de86-1_20230429 with following parameters: runtime: 300s group: group-04 nr_groups: 5 compiler: gcc-13 test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G (please refer to attached dmesg/kmsg for entire log/backtrace) we noticed this issue does not always happen. we also noticed there are different random KCSAN issues for both this commit and its parent. but below 4 only happen on this commit with not small rate and keep clean on parent. Likely that's just a page_type check racing against concurrent mapcount changes. In __folio_rmap_sanity_checks() we check VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio); To make sure we don't get hugetlb folios in the wrong rmap code path. That can easily race with concurrent mapcount changes, just like any other page_type checks that end up in folio_test_type/page_has_type e.g., from PFN walkers. Load tearing in these functions shouldn't really result in false positives (what we care about), but READ_ONCE shouldn't hurt or make a difference. From b03dc9bf27571442d886d8da624a4e4f737433f2 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 28 May 2024 09:37:20 +0200 Subject: [PATCH] mm: read page_type using READ_ONCE KCSAN complains about possible data races: while we check for a page_type -- for example for sanity checks -- we might concurrently modify the mapcount that overlays page_type. Let's use READ_ONCE to avoid laod tearing (shouldn't make a difference) and to make KCSAN happy. Note: nothing should really be broken besides wrong KCSAN complaints. Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202405281431.c46a3be9-...@intel.com Signed-off-by: David Hildenbrand --- include/linux/page-flags.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 104078afe0b1..e46ccbb9aa58 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -955,9 +955,9 @@ PAGEFLAG_FALSE(HasHWPoisoned, has_hwpoisoned) #define PG_slab0x1000 #define PageType(page, flag) \ - ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) + ((READ_ONCE(page->page_type) & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) #define folio_test_type(folio, flag) \ - ((folio->page.page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) + ((READ_ONCE(folio->page.page_type) & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE) static inline int page_type_has_type(unsigned int page_type) { @@ -966,7 +966,7 @@ static inline int page_type_has_type(unsigned int page_type) static inline int page_has_type(const struct page *page) { - return page_type_has_type(page->page_type); + return page_type_has_type(READ_ONCE(page->page_type)); } #define FOLIO_TYPE_OPS(lname, fname) \ -- 2.45.1 -- Thanks, David / dhildenb
Re: [RFC][PATCH] uprobe: support for private hugetlb mappings
On 16.05.24 19:44, Guillaume Morin wrote: On 02 May 5:59, Guillaume Morin wrote: On 30 Apr 21:25, David Hildenbrand wrote: I tried to get the hugepd stuff right but this was the first I heard about it :-) Afaict follow_huge_pmd and friends were already DTRT I'll have to have a closer look at some details (the hugepd writability check looks a bit odd), but it's mostly what I would have expected! Ok in the meantime, here is the uprobe change on your current uprobes_cow trying to address the comments you made in your previous message. Some of them were not 100% clear to me, so it's a best effort patch :-) Again lightly tested David, have you had a chance to take a look at both patches? Not in detail, last weeks were busy (currently traveling back home from LSF/MM). I'll try to find time within the next two weeks to polish my changes and send them out. It would be great if you could send your stuff based on top of that then. (the merge window just opened on Saturday, so we have plenty of time to make it to the next one :) ) -- Cheers, David / dhildenb
Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V
On 14.05.24 20:17, Björn Töpel wrote: Alexandre Ghiti writes: On Tue, May 14, 2024 at 4:05 PM Björn Töpel wrote: From: Björn Töpel Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for RISC-V. Signed-off-by: Björn Töpel --- arch/riscv/Kconfig | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 6bec1bce6586..b9398b64bb69 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -16,6 +16,8 @@ config RISCV select ACPI_REDUCED_HARDWARE_ONLY if ACPI select ARCH_DMA_DEFAULT_COHERENT select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION + select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU I think this should be SPARSEMEM_VMEMMAP here. Hmm, care to elaborate? I thought that was optional. There was a discussion at LSF/MM today to maybe require SPARSEMEM_VMEMMAP for hotplug. Would that work here as well? -- Cheers, David / dhildenb
Re: [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump
On 14.05.24 16:04, Björn Töpel wrote: From: Björn Töpel During memory hot remove, the ptdump functionality can end up touching stale data. Avoid any potential crashes (or worse), by holding the memory hotplug read-lock while traversing the page table. This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm: Hold memory hotplug lock while walking for kernel page table dump"). Signed-off-by: Björn Töpel --- arch/riscv/mm/ptdump.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c index 1289cc6d3700..9d5f657a251b 100644 --- a/arch/riscv/mm/ptdump.c +++ b/arch/riscv/mm/ptdump.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include @@ -370,7 +371,9 @@ bool ptdump_check_wx(void) static int ptdump_show(struct seq_file *m, void *v) { + get_online_mems(); ptdump_walk(m, m->private); + put_online_mems(); return 0; } Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support
On 14.05.24 16:04, Björn Töpel wrote: From: Björn Töpel For an architecture to support memory hotplugging, a couple of callbacks needs to be implemented: arch_add_memory() This callback is responsible for adding the physical memory into the direct map, and call into the memory hotplugging generic code via __add_pages() that adds the corresponding struct page entries, and updates the vmemmap mapping. arch_remove_memory() This is the inverse of the callback above. vmemmap_free() This function tears down the vmemmap mappings (if CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the backing vmemmap pages. Note that for persistent memory, an alternative allocator for the backing pages can be used; The vmem_altmap. This means that when the backing pages are cleared, extra care is needed so that the correct deallocation method is used. arch_get_mappable_range() This functions returns the PA range that the direct map can map. Used by the MHP internals for sanity checks. The page table unmap/teardown functions are heavily based on code from the x86 tree. The same remove_pgd_mapping() function is used in both vmemmap_free() and arch_remove_memory(), but in the latter function the backing pages are not removed. Signed-off-by: Björn Töpel --- arch/riscv/mm/init.c | 242 +++ 1 file changed, 242 insertions(+) diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index 6f72b0b2b854..7f0b921a3d3a 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void) } } #endif + +#ifdef CONFIG_MEMORY_HOTPLUG +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd) +{ + pte_t *pte; + int i; + + for (i = 0; i < PTRS_PER_PTE; i++) { + pte = pte_start + i; + if (!pte_none(*pte)) + return; + } + + free_pages((unsigned long)page_address(pmd_page(*pmd)), 0); + pmd_clear(pmd); +} + +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud) +{ + pmd_t *pmd; + int i; + + for (i = 0; i < PTRS_PER_PMD; i++) { + pmd = pmd_start + i; + if (!pmd_none(*pmd)) + return; + } + + free_pages((unsigned long)page_address(pud_page(*pud)), 0); + pud_clear(pud); +} + +static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d) +{ + pud_t *pud; + int i; + + for (i = 0; i < PTRS_PER_PUD; i++) { + pud = pud_start + i; + if (!pud_none(*pud)) + return; + } + + free_pages((unsigned long)page_address(p4d_page(*p4d)), 0); + p4d_clear(p4d); +} + +static void __meminit free_vmemmap_storage(struct page *page, size_t size, + struct vmem_altmap *altmap) +{ + if (altmap) + vmem_altmap_free(altmap, size >> PAGE_SHIFT); + else + free_pages((unsigned long)page_address(page), get_order(size)); If you unplug a DIMM that was added during boot (can happen on x86-64, can it happen on riscv?), free_pages() would not be sufficient. You'd be freeing a PG_reserved page that has to be freed differently. -- Cheers, David / dhildenb
Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add
On 14.05.24 16:04, Björn Töpel wrote: From: Björn Töpel Add a parameter to the direct map setup function, so it can be used in arch_add_memory() later. Signed-off-by: Björn Töpel --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions
On 14.05.24 16:04, Björn Töpel wrote: From: Björn Töpel Prepare for memory hotplugging support by changing from __init to __meminit for the page table functions that are used by the upcoming architecture specific callbacks. Changing the __init attribute to __meminit, avoids that the functions are removed after init. The __meminit attribute makes sure the functions are kept in the kernel text post init, but only if memory hotplugging is enabled for the build. Also, make sure that the altmap parameter is properly passed on to vmemmap_populate_hugepages(). Signed-off-by: Björn Töpel --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 7/8] virtio-mem: Enable virtio-mem for RISC-V
On 14.05.24 16:04, Björn Töpel wrote: From: Björn Töpel Now that RISC-V has memory hotplugging support, virtio-mem can be used on the platform. Signed-off-by: Björn Töpel --- drivers/virtio/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index c17193544268..4e5cebf1b82a 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -122,7 +122,7 @@ config VIRTIO_BALLOON config VIRTIO_MEM tristate "Virtio mem driver" - depends on X86_64 || ARM64 + depends on X86_64 || ARM64 || RISCV depends on VIRTIO depends on MEMORY_HOTPLUG depends on MEMORY_HOTREMOVE Nice! Acked-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v23 2/5] ring-buffer: Introducing ring-buffer mapping functions
+ cpu_buffer->subbuf_ids = subbuf_ids; + + meta->meta_page_size = PAGE_SIZE; + meta->meta_struct_len = sizeof(*meta); + meta->nr_subbufs = nr_subbufs; + meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE; + + rb_update_meta_page(cpu_buffer); +} + +static struct ring_buffer_per_cpu * +rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu) +{ + struct ring_buffer_per_cpu *cpu_buffer; + + if (!cpumask_test_cpu(cpu, buffer->cpumask)) + return ERR_PTR(-EINVAL); + + cpu_buffer = buffer->buffers[cpu]; + + mutex_lock(_buffer->mapping_lock); + + if (!cpu_buffer->mapped) { + mutex_unlock(_buffer->mapping_lock); + return ERR_PTR(-ENODEV); + } + + return cpu_buffer; +} + +static void rb_put_mapped_buffer(struct ring_buffer_per_cpu *cpu_buffer) +{ + mutex_unlock(_buffer->mapping_lock); +} + +/* + * Fast-path for rb_buffer_(un)map(). Called whenever the meta-page doesn't need + * to be set-up or torn-down. + */ +static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu *cpu_buffer, + bool inc) +{ + unsigned long flags; + + lockdep_assert_held(_buffer->mapping_lock); + + if (inc && cpu_buffer->mapped == UINT_MAX) + return -EBUSY; + + if (WARN_ON(!inc && cpu_buffer->mapped == 0)) + return -EINVAL; + + mutex_lock(_buffer->buffer->mutex); + raw_spin_lock_irqsave(_buffer->reader_lock, flags); + + if (inc) + cpu_buffer->mapped++; + else + cpu_buffer->mapped--; + + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags); + mutex_unlock(_buffer->buffer->mutex); + + return 0; +} + +/* + * +--+ pgoff == 0 + * | meta page | + * +--+ pgoff == 1 + * | subbuffer 0 | + * | | + * +--+ pgoff == (1 + (1 << subbuf_order)) + * | subbuffer 1 | + * | | + * ... + */ +#ifdef CONFIG_MMU +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer, + struct vm_area_struct *vma) +{ + unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff; + unsigned int subbuf_pages, subbuf_order; + struct page **pages; + int p = 0, s = 0; + int err; + + /* Refuse MP_PRIVATE or writable mappings */ + if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC || + !(vma->vm_flags & VM_MAYSHARE)) + return -EPERM; + + /* +* Make sure the mapping cannot become writable later. Also tell the VM +* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Coment a bit outdated. Acked-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions
On 08.05.24 04:34, Steven Rostedt wrote: On Tue, 30 Apr 2024 12:13:51 +0100 Vincent Donnefort wrote: +#ifdef CONFIG_MMU +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer, + struct vm_area_struct *vma) +{ + unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff; + unsigned int subbuf_pages, subbuf_order; + struct page **pages; + int p = 0, s = 0; + int err; + + /* Refuse MP_PRIVATE or writable mappings */ + if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC || + !(vma->vm_flags & VM_MAYSHARE)) + return -EPERM; + + /* +* Make sure the mapping cannot become writable later. Also tell the VM +* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally, +* prevent migration, GUP and dump (VM_IO). +*/ + vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE); Do we really need the VM_IO? When testing this in gdb, I would get: (gdb) p tmap->map->subbuf_size Cannot access memory at address 0x77fc2008 It appears that you can't ptrace IO memory. When I removed that flag, gdb has no problem reading that memory. I think we should drop that flag. Can you send a v23 with that removed, Shuah's update, and also the change below: + + lockdep_assert_held(_buffer->mapping_lock); + + subbuf_order = cpu_buffer->buffer->subbuf_order; + subbuf_pages = 1 << subbuf_order; + + nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */ + nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */ + + vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT; + if (!vma_pages || vma_pages > nr_pages) + return -EINVAL; + + nr_pages = vma_pages; + + pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL); + if (!pages) + return -ENOMEM; + + if (!pgoff) { + pages[p++] = virt_to_page(cpu_buffer->meta_page); + + /* +* TODO: Align sub-buffers on their size, once +* vm_insert_pages() supports the zero-page. +*/ + } else { + /* Skip the meta-page */ + pgoff--; + + if (pgoff % subbuf_pages) { + err = -EINVAL; + goto out; + } + + s += pgoff / subbuf_pages; + } + + while (s < nr_subbufs && p < nr_pages) { + struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]); + int off = 0; + + for (; off < (1 << (subbuf_order)); off++, page++) { + if (p >= nr_pages) + break; + + pages[p++] = page; + } + s++; + } The above can be made to: while (p < nr_pages) { struct page *page; int off = 0; if (WARN_ON_ONCE(s >= nr_subbufs)) break; I'm not particularly happy about us calling vm_insert_pages with NULL pointers stored in pages. Should we instead do if (WARN_ON_ONCE(s >= nr_subbufs)) { err = -EINVAL; goto out; } ? -- Cheers, David / dhildenb
Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions
On 09.05.24 13:05, Vincent Donnefort wrote: On Tue, May 07, 2024 at 10:34:02PM -0400, Steven Rostedt wrote: On Tue, 30 Apr 2024 12:13:51 +0100 Vincent Donnefort wrote: +#ifdef CONFIG_MMU +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer, + struct vm_area_struct *vma) +{ + unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff; + unsigned int subbuf_pages, subbuf_order; + struct page **pages; + int p = 0, s = 0; + int err; + + /* Refuse MP_PRIVATE or writable mappings */ + if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC || + !(vma->vm_flags & VM_MAYSHARE)) + return -EPERM; + + /* +* Make sure the mapping cannot become writable later. Also tell the VM +* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally, +* prevent migration, GUP and dump (VM_IO). +*/ + vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE); Do we really need the VM_IO? When testing this in gdb, I would get: (gdb) p tmap->map->subbuf_size Cannot access memory at address 0x77fc2008 It appears that you can't ptrace IO memory. When I removed that flag, gdb has no problem reading that memory. Yeah, VM_IO indeed implies DONTDUMP. VM_IO was part of Linus recommendations. Yes, the VM should recognize that memory to some degree as being special already due to VM_MIXEDMAP and VM_DONTEXPAND. #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP) So any of these flag achieve that (e.g., mlock_fixup() checks VM_SPECIAL). KSM similarly skips VM_DONTEXPAND and VM_MIXEDMAP (likely we should be using VM_SPECIAL in vma_ksm_compatible()). Not sure about page migration, likely its fine. Thinking about MADV_DONTNEED, I can spot in madvise_dontneed_free_valid_vma() only that we disallow primarily VM_PFNMAP. ... I assume if user space MADV_DONTNEED's some pages we'll simply get a page fault later on access that will SIGBUS, handling that gracefully (we should double-check!). But perhaps, VM_DONTEXPAND and MIXEDMAP (implicitely set by vm_insert_pages) are enough protection? Do we want to dump these pages? VM_DONTDUMP might be reasonabe then. I don't see how anything could use GUP there and as David pointed-out on the previous version, it doesn't event prevent the GUP-fast path. Yes, GUP-fast would still have worked under some conditions. -- Cheers, David / dhildenb
Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions
On 30.04.24 13:13, Vincent Donnefort wrote: In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. CC: Signed-off-by: Vincent Donnefort diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index dc5ae4e96aee..96d2140b471e 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h [...] +/* + * +--+ pgoff == 0 + * | meta page | + * +--+ pgoff == 1 + * | subbuffer 0 | + * | | + * +--+ pgoff == (1 + (1 << subbuf_order)) + * | subbuffer 1 | + * | | + * ... + */ +#ifdef CONFIG_MMU +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer, + struct vm_area_struct *vma) +{ + unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff; + unsigned int subbuf_pages, subbuf_order; + struct page **pages; + int p = 0, s = 0; + int err; + + /* Refuse MP_PRIVATE or writable mappings */ + if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC || + !(vma->vm_flags & VM_MAYSHARE)) + return -EPERM; + + /* +* Make sure the mapping cannot become writable later. Also tell the VM +* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally, +* prevent migration, GUP and dump (VM_IO). +*/ + vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE); + + lockdep_assert_held(_buffer->mapping_lock); + + subbuf_order = cpu_buffer->buffer->subbuf_order; + subbuf_pages = 1 << subbuf_order; + + nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */ + nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */ + + vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT; + if (!vma_pages || vma_pages > nr_pages) + return -EINVAL; + + nr_pages = vma_pages; + + pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL); + if (!pages) + return -ENOMEM; + + if (!pgoff) { + pages[p++] = virt_to_page(cpu_buffer->meta_page); + + /* +* TODO: Align sub-buffers on their size, once +* vm_insert_pages() supports the zero-page. +*/ + } else { + /* Skip the meta-page */ + pgoff--; + + if (pgoff % subbuf_pages) { + err = -EINVAL; + goto out; + } + + s += pgoff / subbuf_pages; + } + + while (s < nr_subbufs && p < nr_pages) { + struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]); + int off = 0; + + for (; off < (1 << (subbuf_order)); off++, page++) { + if (p >= nr_pages) + break; + + pages[p++] = page; + } + s++; + } + + err = vm_insert_pages(vma, vma->vm_start, pages, _pages); Nit: I did not immediately understand if we could end here with p < nr_pages (IOW, pages[] not completely filled). One source of confusion is the "s < nr_subbufs" check in the while loop: why is "p < nr_pages" insufficient? For the MM bits: Acked-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v22 1/5] ring-buffer: Allocate sub-buffers with __GFP_COMP
On 30.04.24 13:13, Vincent Donnefort wrote: In preparation for the ring-buffer memory mapping, allocate compound pages for the ring-buffer sub-buffers to enable us to map them to user-space with vm_insert_pages(). Signed-off-by: Vincent Donnefort Acked-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [RFC][PATCH] uprobe: support for private hugetlb mappings
On 26.04.24 21:55, Guillaume Morin wrote: On 26 Apr 9:19, David Hildenbrand wrote: A couple of points: a) Don't use page_mapcount(). Either folio_mapcount(), but likely you want to check PageAnonExclusive. b) If you're not following the can_follow_write_pte/_pmd model, you are doing something wrong :) c) The code was heavily changed in mm/mm-unstable. It was merged with t the common code. Likely, in mm/mm-unstable, the existing can_follow_write_pte and can_follow_write_pmd checks will already cover what you want in most cases. We'd need a can_follow_write_pud() to cover follow_huge_pud() and (unfortunately) something to handle follow_hugepd() as well similarly. Copy-pasting what we do in can_follow_write_pte() and adjusting for different PTE types is the right thing to do. Maybe now it's time to factor out the common checks into a separate helper. I tried to get the hugepd stuff right but this was the first I heard about it :-) Afaict follow_huge_pmd and friends were already DTRT I'll have to have a closer look at some details (the hugepd writability check looks a bit odd), but it's mostly what I would have expected! -- Cheers, David / dhildenb
Re: [RFC][PATCH] uprobe: support for private hugetlb mappings
On 30.04.24 17:22, Guillaume Morin wrote: On 26 Apr 21:55, Guillaume Morin wrote: On 26 Apr 9:19, David Hildenbrand wrote: A couple of points: a) Don't use page_mapcount(). Either folio_mapcount(), but likely you want to check PageAnonExclusive. b) If you're not following the can_follow_write_pte/_pmd model, you are doing something wrong :) c) The code was heavily changed in mm/mm-unstable. It was merged with t the common code. Likely, in mm/mm-unstable, the existing can_follow_write_pte and can_follow_write_pmd checks will already cover what you want in most cases. We'd need a can_follow_write_pud() to cover follow_huge_pud() and (unfortunately) something to handle follow_hugepd() as well similarly. Copy-pasting what we do in can_follow_write_pte() and adjusting for different PTE types is the right thing to do. Maybe now it's time to factor out the common checks into a separate helper. I tried to get the hugepd stuff right but this was the first I heard about it :-) Afaict follow_huge_pmd and friends were already DTRT I got it working on top of your uprobes-cow branch with the foll force patch sent friday. Still pretty lightly tested Sorry for not replying earlier, was busy with other stuff. I'll try getiing that stuff into shape and send it out soonish. I went with using one write uprobe function with some additional branches. I went back and forth between that and making them 2 different functions. All the folio_test_hugetlb() special casing is a bit suboptimal. Likely we want a separate variant, because we should be sing hugetlb PTE functions consistently (e.g., huge_pte_uffd_wp() vs pte_uffd_wp(), softdirty does not exist etc.) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 2f4e88552d3f..8a33e380f7ea 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -83,6 +83,10 @@ static const struct fs_parameter_spec hugetlb_fs_parameters[] = { {} }; +bool hugetlbfs_mapping(struct address_space *mapping) { + return mapping->a_ops == _aops; is_vm_hugetlb_page() might be what you are looking for. [...] } -static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len) +static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len, unsigned long page_mask) { void *kaddr = kmap_atomic(page); - memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len); + memcpy(dst, kaddr + (vaddr & ~page_mask), len); kunmap_atomic(kaddr); } -static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len) +static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len, unsigned long page_mask) { void *kaddr = kmap_atomic(page); - memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len); + memcpy(kaddr + (vaddr & ~page_mask), src, len); kunmap_atomic(kaddr); } These two changes really are rather ugly ... An why are they even required? We get a PAGE_SIZED-based subpage of a hugetlb page. We only kmap that one and copy within that one. In other words, I don't think the copy_from_page() and copy_to_page() changes are even required when we consistently work on subpages and not suddenly on head pages. -static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode) +static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode, unsigned long page_mask) { uprobe_opcode_t old_opcode; bool is_swbp; @@ -191,7 +192,8 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t * is a trap variant; uprobes always wins over any other (gdb) * breakpoint. */ - copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE); + copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE, + page_mask); is_swbp = is_swbp_insn(_opcode); if (is_swbp_insn(new_opcode)) { @@ -376,8 +378,8 @@ struct uwo_data { uprobe_opcode_t opcode; }; -static int __write_opcode_pte(pte_t *ptep, unsigned long vaddr, - unsigned long next, struct mm_walk *walk) +static int __write_opcode(pte_t *ptep, unsigned long vaddr, + unsigned long page_mask, struct mm_walk *walk) Unrelated alignment change. { struct uwo_data *data = walk->private;; const bool is_register = !!is_swbp_insn(>opcode); @@ -415,9 +417,12 @@ static int __write_opcode_pte(pte_t *ptep, unsigned long vaddr, /* Unmap + flush the TLB, such that we can write atomically .*/ flush_cache_page(vma, vaddr, pte_pfn(pte)); - pte = ptep_clear_flush(vma, vaddr, ptep); + if (folio_test_hugetlb(folio)) + pte = huge_ptep_clear_flush(vma, vaddr, ptep); + else + pte = ptep_clear_flush(vma, vaddr, ptep); copy_to_page(page, da
Re: [RFC][PATCH] uprobe: support for private hugetlb mappings
On 26.04.24 02:09, Guillaume Morin wrote: On 25 Apr 21:56, David Hildenbrand wrote: On 25.04.24 17:19, Guillaume Morin wrote: On 24 Apr 23:00, David Hildenbrand wrote: One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for hugetlb mappings. However this was also on my TODO and I have a draft patch that implements it. Yes, I documented it back then and added sanity checks in GUP code to fence it off. Shouldn't be too hard to implement (famous last words) and would be the cleaner thing to use here once I manage to switch over to FOLL_WRITE|FOLL_FORCE to break COW. Yes, my patch seems to be working. The hugetlb code is pretty simple. And it allows ptrace and the proc pid mem file to work on the executable private hugetlb mappings. There is one thing I am unclear about though. hugetlb enforces that huge_pte_write() is true on FOLL_WRITE in both the fault and follow_page_mask paths. I am not sure if we can simply assume in the hugetlb code that if the pte is not writable and this is a write fault then we're in the FOLL_FORCE|FOLL_WRITE case. Or do we want to keep the checks simply not enforce it for FOLL_FORCE|FOLL_WRITE? The latter is more complicated in the fault path because there is no FAULT_FLAG_FORCE flag. I just pushed something to https://github.com/davidhildenbrand/linux/tree/uprobes_cow Only very lightly tested so far. Expect the worst :) I'll try it out and send you the hugetlb bits I still detest having the zapping logic there, but to get it all right I don't see a clean way around that. For hugetlb, we'd primarily have to implement the mm_walk_ops->hugetlb_entry() callback (well, and FOLL_FORCE). For FOLL_FORCE, heer is my draft. Let me know if this is what you had in mind. diff --git a/mm/gup.c b/mm/gup.c index 1611e73b1121..ac60e0ae64e8 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1056,9 +1056,6 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) { if (!(gup_flags & FOLL_FORCE)) return -EFAULT; - /* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */ - if (is_vm_hugetlb_page(vma)) - return -EFAULT; /* * We used to let the write,force case do COW in a * VM_MAYWRITE VM_SHARED !VM_WRITE vma, so ptrace could diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3548eae42cf9..73f86eddf888 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5941,7 +5941,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma, struct folio *pagecache_folio, spinlock_t *ptl, struct vm_fault *vmf) { - const bool unshare = flags & FAULT_FLAG_UNSHARE; + const bool make_writable = !(flags & FAULT_FLAG_UNSHARE) && + (vma->vm_flags & VM_WRITE); pte_t pte = huge_ptep_get(ptep); struct hstate *h = hstate_vma(vma); struct folio *old_folio; @@ -5959,16 +5960,9 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma, * can trigger this, because hugetlb_fault() will always resolve * uffd-wp bit first. */ - if (!unshare && huge_pte_uffd_wp(pte)) + if (make_writable && huge_pte_uffd_wp(pte)) return 0; - /* -* hugetlb does not support FOLL_FORCE-style write faults that keep the -* PTE mapped R/O such as maybe_mkwrite() would do. -*/ - if (WARN_ON_ONCE(!unshare && !(vma->vm_flags & VM_WRITE))) - return VM_FAULT_SIGSEGV; - /* Let's take out MAP_SHARED mappings first. */ if (vma->vm_flags & VM_MAYSHARE) { set_huge_ptep_writable(vma, haddr, ptep); @@ -5989,7 +5983,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma, folio_move_anon_rmap(old_folio, vma); SetPageAnonExclusive(_folio->page); } - if (likely(!unshare)) + if (likely(make_writable)) set_huge_ptep_writable(vma, haddr, ptep); Maybe we want to refactor that similarly into a set_huge_ptep_maybe_writable, and handle the VM_WRITE check internally. Then, here you'd do if (unshare) set_huge_ptep(vma, haddr, ptep); else set_huge_ptep_maybe_writable(vma, haddr, ptep); Something like that. /* Break COW or unshare */ huge_ptep_clear_flush(vma, haddr, ptep); @@ -6883,6 +6878,17 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, } #endif /* CONFIG_USERFAULTFD */ +static bool is_force_follow(struct vm_area_struct* vma, unsigned int flags, +struct p
Re: [RFC][PATCH] uprobe: support for private hugetlb mappings
On 25.04.24 17:19, Guillaume Morin wrote: On 24 Apr 23:00, David Hildenbrand wrote: One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for hugetlb mappings. However this was also on my TODO and I have a draft patch that implements it. Yes, I documented it back then and added sanity checks in GUP code to fence it off. Shouldn't be too hard to implement (famous last words) and would be the cleaner thing to use here once I manage to switch over to FOLL_WRITE|FOLL_FORCE to break COW. Yes, my patch seems to be working. The hugetlb code is pretty simple. And it allows ptrace and the proc pid mem file to work on the executable private hugetlb mappings. There is one thing I am unclear about though. hugetlb enforces that huge_pte_write() is true on FOLL_WRITE in both the fault and follow_page_mask paths. I am not sure if we can simply assume in the hugetlb code that if the pte is not writable and this is a write fault then we're in the FOLL_FORCE|FOLL_WRITE case. Or do we want to keep the checks simply not enforce it for FOLL_FORCE|FOLL_WRITE? The latter is more complicated in the fault path because there is no FAULT_FLAG_FORCE flag. I just pushed something to https://github.com/davidhildenbrand/linux/tree/uprobes_cow Only very lightly tested so far. Expect the worst :) I still detest having the zapping logic there, but to get it all right I don't see a clean way around that. For hugetlb, we'd primarily have to implement the mm_walk_ops->hugetlb_entry() callback (well, and FOLL_FORCE). Likely vaddr and PAGE_SIZE in uprobe_write_opcode() would have to be expanded to cover the full hugetlb page. -- Cheers, David / dhildenb
Re: [RFC][PATCH] uprobe: support for private hugetlb mappings
On 25.04.24 17:19, Guillaume Morin wrote: On 24 Apr 23:00, David Hildenbrand wrote: One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for hugetlb mappings. However this was also on my TODO and I have a draft patch that implements it. Yes, I documented it back then and added sanity checks in GUP code to fence it off. Shouldn't be too hard to implement (famous last words) and would be the cleaner thing to use here once I manage to switch over to FOLL_WRITE|FOLL_FORCE to break COW. Yes, my patch seems to be working. The hugetlb code is pretty simple. And it allows ptrace and the proc pid mem file to work on the executable private hugetlb mappings. There is one thing I am unclear about though. hugetlb enforces that huge_pte_write() is true on FOLL_WRITE in both the fault and follow_page_mask paths. I am not sure if we can simply assume in the hugetlb code that if the pte is not writable and this is a write fault then we're in the FOLL_FORCE|FOLL_WRITE case. Or do we want to keep the checks simply not enforce it for FOLL_FORCE|FOLL_WRITE? The latter is more complicated in the fault path because there is no FAULT_FLAG_FORCE flag. handle_mm_fault()->sanitize_fault_flags() makes sure that we'll only proceed with a fault either if * we have VM_WRITE set * we are in a COW mapping (MAP_PRIVATE with at least VM_MAYWRITE) Once you see FAULT_FLAG_WRITE and you do have VM_WRITE, you don't care about FOLL_FORCE, it's simply a write fault. Once you see FAULT_FLAG_WRITE and you *don't* have VM_WRITE, you must have VM_MAYWRITE and are essentially in FOLL_FORCE. In a VMA without VM_WRITE, you must never map a PTE writable. In ordinary COW code, that's done in wp_page_copy(), where we *always* use maybe_mkwrite(), to do exactly what a write fault would do, but without mapping the PTE writable. That's what the whole can_follow_write_pmd()/can_follow_write_pte() is about: writing to PTEs that are not writable. You'll have to follow the exact same model in hugetlb (can_follow_write_pmd(), hugetlb_maybe_mkwrite(), ...). -- Cheers, David / dhildenb
Re: [RFC][PATCH] uprobe: support for private hugetlb mappings
On 24.04.24 22:44, Guillaume Morin wrote: On 24 Apr 22:09, David Hildenbrand wrote: Let me try to see if we can get this done cleaner. One ugly part (in general here) is the custom page replacement in the registration part. We are guaranteed to have a MAP_PRIVATE mapping. Instead of replacing pages ourselves (which we likely shouldn't do ...) ... maybe we could use FAULT_FLAG_UNSHARE faults such that we will get an anonymous folio populated. (like KSM does nowadays) Punching FOLL_PIN|FOLL_LONGTERM into GUP would achieve the same thing, but using FOLL_WRITE would not work on many file systems. So maybe we have to trigger an unsharing fault ourselves. ^ realizing that we already use FOLL_FORCE, so we can just use FOLL_WRITE to break COW. It was never clear to me why uprobes was not doing FOLL_WRITE in the first place, I must say. It's quite dated code ... The use of FOLL_FORCE really is ugly here. When registering, we require VM_WRITE but ... when unregistering, we don't ... One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for hugetlb mappings. However this was also on my TODO and I have a draft patch that implements it. Yes, I documented it back then and added sanity checks in GUP code to fence it off. Shouldn't be too hard to implement (famous last words) and would be the cleaner thing to use here once I manage to switch over to FOLL_WRITE|FOLL_FORCE to break COW. -- Cheers, David / dhildenb
Re: [PATCH v21 2/5] ring-buffer: Introducing ring-buffer mapping functions
On 24.04.24 22:31, Vincent Donnefort wrote: Hi David, Thanks for your quick response. On Wed, Apr 24, 2024 at 05:26:39PM +0200, David Hildenbrand wrote: I gave it some more thought, and I think we are still missing something (I wish PFNMAP/MIXEDMAP wouldn't be that hard). + +/* + * +--+ pgoff == 0 + * | meta page | + * +--+ pgoff == 1 + * | subbuffer 0 | + * | | + * +--+ pgoff == (1 + (1 << subbuf_order)) + * | subbuffer 1 | + * | | + * ... + */ +#ifdef CONFIG_MMU +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer, + struct vm_area_struct *vma) +{ + unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff; + unsigned int subbuf_pages, subbuf_order; + struct page **pages; + int p = 0, s = 0; + int err; + I'd add some comments here like /* Refuse any MAP_PRIVATE or writable mappings. */ + if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC || + !(vma->vm_flags & VM_MAYSHARE)) + return -EPERM; + /* * Make sure the mapping cannot become writable later. Also, tell the VM * to not touch these pages pages (VM_DONTCOPY | VM_DONTDUMP) and tell * GUP to leave them alone as well (VM_IO). */ + vm_flags_mod(vma, +VM_MIXEDMAP | VM_PFNMAP | +VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND | VM_IO, +VM_MAYWRITE); I am still really unsure about VM_PFNMAP ... it's not a PFNMAP at all and, as stated, vm_insert_pages() even complains quite a lot when it would have to set VM_MIXEDMAP and VM_PFNMAP is already set, likely for a very good reason. Can't we limit ourselves to VM_IO? But then, I wonder if it really helps much regarding GUP: yes, it blocks ordinary GUP (see check_vma_flags()) but as insert_page_into_pte_locked() does *not* set pte_special(), GUP-fast (gup_fast_pte_range()) will not reject it. Really, if you want GUP-fast to reject it, remap_pfn_range() and friends are the way to go, that will set pte_special() such that also GUP-fast will leave it alone, just like vm_normal_page() would. So ... I know Linus recommended VM_PFNMAP/VM_IO to stop GUP, but it alone won't stop all of GUP. We really have to mark the PTE as special, which vm_insert_page() must not do (because it is refcounted!). Hum, apologies, I am not sure to follow the connection here. Why do you think the recommendation was to prevent GUP? Ah, I'm hallucinating! :) "not let people play games with the mapping" to me implied "make sure nobody touches it". If GUP is acceptable that makes stuff a lot easier. VM_IO will block some GUP, but not all of it. Which means: do we really have to stop GUP from grabbing that page? Using vm_insert_page() only with VM_MIXEDMAP (and without VM_PFNMAP|VM_IO) would be better. Under the assumption we do not want to stop all GUP, why not using VM_IO over VM_MIXEDMAP which is I believe more restrictive? VM_MIXEDMAP will be implicitly set by vm_insert_page(). There is a lengthy comment for vm_normal_page() that explains all this madness. VM_MIXEDMAP is primarily relevant for COW mappings, which you just forbid completely. remap_pfn_range_notrack() documents the semantics of some of the other flags: * VM_IO tells people not to look at these pages * (accesses can have side effects). * VM_PFNMAP tells the core MM that the base pages are just * raw PFN mappings, and do not have a "struct page" associated * with them. * VM_DONTEXPAND * Disable vma merging and expanding with mremap(). * VM_DONTDUMP * Omit vma from core dump, even when VM_IO turned off. VM_PFNMAP is very likely really not what we want, unless we really perform raw PFN mappings ... VM_IO we can set without doing much harm. So I would suggest dropping VM_PFNMAP when using vm_insert_pages(), using only VM_IO and likely just letting vm_insert_pages() set VM_MIXEDMAP for you. [...] vm_insert_pages() documents: "In case of error, we may have mapped a subset of the provided pages. It is the caller's responsibility to account for this case." Which could for example happen, when allocating a page table fails. Would we able to deal with that here? As we are in the mmap path, on an error, I would expect the vma to be destroyed and those pages whom insertion succeeded to be unmapped? Ah, we simply fail ->mmap(). In mmap_region(), if call_mmap() failed, we "goto unmap_and_free_vma" where we have /* Undo any partial mapping done by a device driver. */ unmap_region(mm, , vma, prev, next, vma->vm_start, vma->vm_end, vma->vm_end, true); But perhaps shall we proactively zap_page_range_single()? No mmap_region() should indeed be handling it correctly already! -- Cheers, David / dhildenb
Re: [RFC][PATCH] uprobe: support for private hugetlb mappings
On 22.04.24 22:53, Guillaume Morin wrote: On 22 Apr 20:59, David Hildenbrand wrote: The benefit - to me - is very clear. People do use hugetlb mappings to run code in production environments. The perf benefits are there for some workloads. Intel has published a whitepaper about it etc. Uprobes are a very good tool to do live tracing. If you can restart the process and reproduce, you should be able to disable hugetlb remapping but if you need to look at a live process, there are not many options. Not being able to use uprobes is crippling. Please add all that as motivation to the patch description or cover letter. Yes, libhugetlbfs exists. But why do we have to support uprobes with it? Nobody cared until now, why care now? I think you could ask the same question for every new feature patch :) I have to, because it usually indicates a lack of motivation in the cover-letter/patch description :P My cover letter was indeed lacking. I will make sure to add this kind of details next time. Since the removal a few releases ago of the __morecore() hook in glibc, the main feature of libhugetlbfs is ELF segments remapping. I think there are definitely a lot of users that simply deal with this unnecessary limitation. I am certainly not shoving this patch through anyone's throat if there is no interest. But we definitely find it a very useful feature ... Let me try to see if we can get this done cleaner. One ugly part (in general here) is the custom page replacement in the registration part. We are guaranteed to have a MAP_PRIVATE mapping. Instead of replacing pages ourselves (which we likely shouldn't do ...) ... maybe we could use FAULT_FLAG_UNSHARE faults such that we will get an anonymous folio populated. (like KSM does nowadays) Punching FOLL_PIN|FOLL_LONGTERM into GUP would achieve the same thing, but using FOLL_WRITE would not work on many file systems. So maybe we have to trigger an unsharing fault ourselves. ^ realizing that we already use FOLL_FORCE, so we can just use FOLL_WRITE to break COW. That would do the page replacement for us and we "should" be able to lookup an anonymous folio that we can then just modify, like ptrace would. But then, there is also unregistration part, with weird conditional page replacement. Zapping the anon page if the content matches the content of the original page is one thing. But why are we placing an existing anonymous page by a new anonymous page when the content from the original page differs (but matches the one from the just copied page?)? I'll have to further think about that one. It's all a bit nasty. Sounds good to me. I am willing to help with the code when you have a plan or testing as you see fit. Let me know. I'm hacking on a redesign that removes the manual COW breaking logic and *might* make it easier to integrate hugetlb. (very likely, but until I have the redesign running I cannot promise anything :) ) I'll let you know once I have something ready so you could integrate the hugetlb portion. -- Cheers, David / dhildenb
Re: [PATCH v21 2/5] ring-buffer: Introducing ring-buffer mapping functions
I gave it some more thought, and I think we are still missing something (I wish PFNMAP/MIXEDMAP wouldn't be that hard). + +/* + * +--+ pgoff == 0 + * | meta page | + * +--+ pgoff == 1 + * | subbuffer 0 | + * | | + * +--+ pgoff == (1 + (1 << subbuf_order)) + * | subbuffer 1 | + * | | + * ... + */ +#ifdef CONFIG_MMU +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer, + struct vm_area_struct *vma) +{ + unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff; + unsigned int subbuf_pages, subbuf_order; + struct page **pages; + int p = 0, s = 0; + int err; + I'd add some comments here like /* Refuse any MAP_PRIVATE or writable mappings. */ + if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC || + !(vma->vm_flags & VM_MAYSHARE)) + return -EPERM; + /* * Make sure the mapping cannot become writable later. Also, tell the VM * to not touch these pages pages (VM_DONTCOPY | VM_DONTDUMP) and tell * GUP to leave them alone as well (VM_IO). */ + vm_flags_mod(vma, +VM_MIXEDMAP | VM_PFNMAP | +VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND | VM_IO, +VM_MAYWRITE); I am still really unsure about VM_PFNMAP ... it's not a PFNMAP at all and, as stated, vm_insert_pages() even complains quite a lot when it would have to set VM_MIXEDMAP and VM_PFNMAP is already set, likely for a very good reason. Can't we limit ourselves to VM_IO? But then, I wonder if it really helps much regarding GUP: yes, it blocks ordinary GUP (see check_vma_flags()) but as insert_page_into_pte_locked() does *not* set pte_special(), GUP-fast (gup_fast_pte_range()) will not reject it. Really, if you want GUP-fast to reject it, remap_pfn_range() and friends are the way to go, that will set pte_special() such that also GUP-fast will leave it alone, just like vm_normal_page() would. So ... I know Linus recommended VM_PFNMAP/VM_IO to stop GUP, but it alone won't stop all of GUP. We really have to mark the PTE as special, which vm_insert_page() must not do (because it is refcounted!). Which means: do we really have to stop GUP from grabbing that page? Using vm_insert_page() only with VM_MIXEDMAP (and without VM_PFNMAP|VM_IO) would be better. If we want to stop all of GUP, remap_pfn_range() currently seems unavoidable :( + + lockdep_assert_held(_buffer->mapping_lock); + + subbuf_order = cpu_buffer->buffer->subbuf_order; + subbuf_pages = 1 << subbuf_order; + + nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */ + nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */ + + vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT; + if (!vma_pages || vma_pages > nr_pages) + return -EINVAL; + + nr_pages = vma_pages; + + pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL); + if (!pages) + return -ENOMEM; + + if (!pgoff) { + pages[p++] = virt_to_page(cpu_buffer->meta_page); + + /* +* TODO: Align sub-buffers on their size, once +* vm_insert_pages() supports the zero-page. +*/ + } else { + /* Skip the meta-page */ + pgoff--; + + if (pgoff % subbuf_pages) { + err = -EINVAL; + goto out; + } + + s += pgoff / subbuf_pages; + } + + while (s < nr_subbufs && p < nr_pages) { + struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]); + int off = 0; + + for (; off < (1 << (subbuf_order)); off++, page++) { + if (p >= nr_pages) + break; + + pages[p++] = page; + } + s++; + } + + err = vm_insert_pages(vma, vma->vm_start, pages, _pages); vm_insert_pages() documents: "In case of error, we may have mapped a subset of the provided pages. It is the caller's responsibility to account for this case." Which could for example happen, when allocating a page table fails. Would we able to deal with that here? Again, I wish it would all be easier ... -- Cheers, David / dhildenb
Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
On 22.04.24 22:31, Vincent Donnefort wrote: On Mon, Apr 22, 2024 at 08:27:17PM +0200, David Hildenbrand wrote: On 22.04.24 20:20, Vincent Donnefort wrote: Hi David, Thanks for having a look, very much appreciated! On Mon, Apr 22, 2024 at 11:27:11AM +0200, David Hildenbrand wrote: On 19.04.24 20:25, David Hildenbrand wrote: On 06.04.24 19:36, Vincent Donnefort wrote: In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. CC: Signed-off-by: Vincent Donnefort diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index dc5ae4e96aee..96d2140b471e 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -6,6 +6,8 @@ #include #include +#include + struct trace_buffer; struct ring_buffer_iter; @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node); #define trace_rb_cpu_prepare NULL #endif +int ring_buffer_map(struct trace_buffer *buffer, int cpu, + struct vm_area_struct *vma); +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); #endif /* _LINUX_RING_BUFFER_H */ diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h new file mode 100644 index ..ffcd8dfcaa4f --- /dev/null +++ b/include/uapi/linux/trace_mmap.h @@ -0,0 +1,46 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _TRACE_MMAP_H_ +#define _TRACE_MMAP_H_ + +#include + +/** + * struct trace_buffer_meta - Ring-buffer Meta-page description + * @meta_page_size:Size of this meta-page. + * @meta_struct_len: Size of this structure. + * @subbuf_size: Size of each sub-buffer. + * @nr_subbufs:Number of subbfs in the ring-buffer, including the reader. + * @reader.lost_events:Number of events lost at the time of the reader swap. + * @reader.id: subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1] + * @reader.read: Number of bytes read on the reader subbuf. + * @flags: Placeholder for now, 0 until new features are supported. + * @entries: Number of entries in the ring-buffer. + * @overrun: Number of entries lost in the ring-buffer. + * @read: Number of entries that have been read. + * @Reserved1: Reserved for future use. + * @Reserved2: Reserved for future use. + */ +struct trace_buffer_meta { + __u32 meta_page_size; + __u32 meta_struct_len; + + __u32 subbuf_size; + __u32 nr_subbufs; + + struct { + __u64 lost_events; + __u32 id; + __u32 read; + } reader; + + __u64 flags; + + __u64 entries; + __u64 overrun; + __u64 read; + + __u64 Reserved1; + __u64 Reserved2; +}; + +#endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index cc9ebe593571..793ecc454039 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,7 @@ #include #include #include +#include #include #include @@ -338,6 +340,7 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned longreal_end; /* real end of data */ unsigned order; /* order of the page */ + u32 id;/* ID for external mapping */ struct buffer_data_page *page; /* Actual data page */ }; @@ -484,6 +487,12 @@ struct ring_buffer_per_cpu { u64 read_stamp; /* pages removed since last reset */ unsigned long pages_removed; + + unsigned intmapped; + struct mutexmapping_lock; + unsigned long *subbuf_ids;/* ID to subbuf VA */ + struct trace_buffer_meta*meta_page; + /* ring buffer pages
Re: [PATCH v3 1/4] virtio_balloon: separate vm events into a function
On 23.04.24 05:41, zhenwei pi wrote: All the VM events related statistics have dependence on 'CONFIG_VM_EVENT_COUNTERS', separate these events into a function to make code clean. Then we can remove 'CONFIG_VM_EVENT_COUNTERS' from 'update_balloon_stats'. Signed-off-by: zhenwei pi --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [RFC][PATCH] uprobe: support for private hugetlb mappings
On 22.04.24 20:11, Guillaume Morin wrote: (Dropping Mike Kravetz as CC since he has retired and his email is no longer valid, adding Muchun since he's the current hugetlb maintainer, as well as linux-trace-kernel) On 22 Apr 11:39, David Hildenbrand wrote: On 19.04.24 20:37, Guillaume Morin wrote: libhugetlbfs, the Intel iodlr code both allow to remap .text onto a hugetlb private mapping. It's also pretty easy to do it manually. One drawback of using this functionality is the lack of support for uprobes (NOTE uprobe ignores shareable vmas) This patch adds support for private hugetlb mappings. It does require exposing some hugetlbfs innards and relies on copy_user_large_folio which is only available when CONFIG_HUGETLBFS is used so I had to use an ugly #ifdef If there is some interest in applying this patch in some form or another, I am open to any refactoring suggestions (esp getting rid the #ifdef in uprobes.c) . I tried to limit the amount of branching. All that hugetlb special casing oh my. What's the benefit why we should be interested in making that code less clean -- to phrase it in a nice way ;) ? I do appreciate the nice phrasing. Believe me, I did try to limit the special casing to a minimum :-). Outside of __replace_page, I added only 3-ish branches so I do not think it's *too* bad. The uprobe code is using PAGE_{SHIFT,MASK} quite liberally so I had to add calls to retrieve these for the hugetlb vmas. __replace_page has a lot of special casing. I certainly agree (and unfortunately for me it's at the beginning of the patch :)). It's doing something pretty uncommon outside of the mm code so it has to make a bunch of specific hugetlb calls. I am not quite sure how to improve it but if you have suggestions, I'd be happy to refactor. See below. The benefit - to me - is very clear. People do use hugetlb mappings to run code in production environments. The perf benefits are there for some workloads. Intel has published a whitepaper about it etc. Uprobes are a very good tool to do live tracing. If you can restart the process and reproduce, you should be able to disable hugetlb remapping but if you need to look at a live process, there are not many options. Not being able to use uprobes is crippling. Please add all that as motivation to the patch description or cover letter. Yes, libhugetlbfs exists. But why do we have to support uprobes with it? Nobody cared until now, why care now? I think you could ask the same question for every new feature patch :) I have to, because it usually indicates a lack of motivation in the cover-letter/patch description :P People will have to maintain that code, and maintaining hugetlb code in odd places is no fun ... Since the removal a few releases ago of the __morecore() hook in glibc, the main feature of libhugetlbfs is ELF segments remapping. I think there are definitely a lot of users that simply deal with this unnecessary limitation. I am certainly not shoving this patch through anyone's throat if there is no interest. But we definitely find it a very useful feature ... Let me try to see if we can get this done cleaner. One ugly part (in general here) is the custom page replacement in the registration part. We are guaranteed to have a MAP_PRIVATE mapping. Instead of replacing pages ourselves (which we likely shouldn't do ...) ... maybe we could use FAULT_FLAG_UNSHARE faults such that we will get an anonymous folio populated. (like KSM does nowadays) Punching FOLL_PIN|FOLL_LONGTERM into GUP would achieve the same thing, but using FOLL_WRITE would not work on many file systems. So maybe we have to trigger an unsharing fault ourselves. That would do the page replacement for us and we "should" be able to lookup an anonymous folio that we can then just modify, like ptrace would. But then, there is also unregistration part, with weird conditional page replacement. Zapping the anon page if the content matches the content of the original page is one thing. But why are we placing an existing anonymous page by a new anonymous page when the content from the original page differs (but matches the one from the just copied page?)? I'll have to further think about that one. It's all a bit nasty. One thing to note is that hugetlb folios don't grow on trees. Likely, Many setups *don't* reserve extra hugetlb folios and you might just easily be running out of free hugetlb folios that you can use to break COW here (replace a file hugetlb by a fresh anon hugetlb page). Likely it's easy to make register or unregister fail. -- Cheers, David / dhildenb
Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
On 22.04.24 20:20, Vincent Donnefort wrote: Hi David, Thanks for having a look, very much appreciated! On Mon, Apr 22, 2024 at 11:27:11AM +0200, David Hildenbrand wrote: On 19.04.24 20:25, David Hildenbrand wrote: On 06.04.24 19:36, Vincent Donnefort wrote: In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. CC: Signed-off-by: Vincent Donnefort diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index dc5ae4e96aee..96d2140b471e 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -6,6 +6,8 @@ #include #include +#include + struct trace_buffer; struct ring_buffer_iter; @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node); #define trace_rb_cpu_prepareNULL #endif +int ring_buffer_map(struct trace_buffer *buffer, int cpu, + struct vm_area_struct *vma); +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); #endif /* _LINUX_RING_BUFFER_H */ diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h new file mode 100644 index ..ffcd8dfcaa4f --- /dev/null +++ b/include/uapi/linux/trace_mmap.h @@ -0,0 +1,46 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _TRACE_MMAP_H_ +#define _TRACE_MMAP_H_ + +#include + +/** + * struct trace_buffer_meta - Ring-buffer Meta-page description + * @meta_page_size:Size of this meta-page. + * @meta_struct_len: Size of this structure. + * @subbuf_size: Size of each sub-buffer. + * @nr_subbufs:Number of subbfs in the ring-buffer, including the reader. + * @reader.lost_events:Number of events lost at the time of the reader swap. + * @reader.id: subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1] + * @reader.read: Number of bytes read on the reader subbuf. + * @flags: Placeholder for now, 0 until new features are supported. + * @entries: Number of entries in the ring-buffer. + * @overrun: Number of entries lost in the ring-buffer. + * @read: Number of entries that have been read. + * @Reserved1: Reserved for future use. + * @Reserved2: Reserved for future use. + */ +struct trace_buffer_meta { + __u32 meta_page_size; + __u32 meta_struct_len; + + __u32 subbuf_size; + __u32 nr_subbufs; + + struct { + __u64 lost_events; + __u32 id; + __u32 read; + } reader; + + __u64 flags; + + __u64 entries; + __u64 overrun; + __u64 read; + + __u64 Reserved1; + __u64 Reserved2; +}; + +#endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index cc9ebe593571..793ecc454039 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,7 @@ #include #include #include +#include #include #include @@ -338,6 +340,7 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned longreal_end; /* real end of data */ unsigned order; /* order of the page */ + u32 id;/* ID for external mapping */ struct buffer_data_page *page; /* Actual data page */ }; @@ -484,6 +487,12 @@ struct ring_buffer_per_cpu { u64 read_stamp; /* pages removed since last reset */ unsigned long pages_removed; + + unsigned intmapped; + struct mutexmapping_lock; + unsigned long *subbuf_ids;/* ID to subbuf VA */ + struct trace_buffer_meta*meta_page; + /* ring buffer pages to update, > 0 to add, < 0 to remove */ longnr_pages_to_update; struct lis
Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
On 19.04.24 20:25, David Hildenbrand wrote: On 06.04.24 19:36, Vincent Donnefort wrote: In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. CC: Signed-off-by: Vincent Donnefort diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index dc5ae4e96aee..96d2140b471e 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -6,6 +6,8 @@ #include #include +#include + struct trace_buffer; struct ring_buffer_iter; @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node); #define trace_rb_cpu_prepare NULL #endif +int ring_buffer_map(struct trace_buffer *buffer, int cpu, + struct vm_area_struct *vma); +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); #endif /* _LINUX_RING_BUFFER_H */ diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h new file mode 100644 index ..ffcd8dfcaa4f --- /dev/null +++ b/include/uapi/linux/trace_mmap.h @@ -0,0 +1,46 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _TRACE_MMAP_H_ +#define _TRACE_MMAP_H_ + +#include + +/** + * struct trace_buffer_meta - Ring-buffer Meta-page description + * @meta_page_size:Size of this meta-page. + * @meta_struct_len: Size of this structure. + * @subbuf_size: Size of each sub-buffer. + * @nr_subbufs:Number of subbfs in the ring-buffer, including the reader. + * @reader.lost_events:Number of events lost at the time of the reader swap. + * @reader.id: subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1] + * @reader.read: Number of bytes read on the reader subbuf. + * @flags: Placeholder for now, 0 until new features are supported. + * @entries: Number of entries in the ring-buffer. + * @overrun: Number of entries lost in the ring-buffer. + * @read: Number of entries that have been read. + * @Reserved1: Reserved for future use. + * @Reserved2: Reserved for future use. + */ +struct trace_buffer_meta { + __u32 meta_page_size; + __u32 meta_struct_len; + + __u32 subbuf_size; + __u32 nr_subbufs; + + struct { + __u64 lost_events; + __u32 id; + __u32 read; + } reader; + + __u64 flags; + + __u64 entries; + __u64 overrun; + __u64 read; + + __u64 Reserved1; + __u64 Reserved2; +}; + +#endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index cc9ebe593571..793ecc454039 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,7 @@ #include #include #include +#include #include #include @@ -338,6 +340,7 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned longreal_end; /* real end of data */ unsigned order; /* order of the page */ + u32 id;/* ID for external mapping */ struct buffer_data_page *page; /* Actual data page */ }; @@ -484,6 +487,12 @@ struct ring_buffer_per_cpu { u64 read_stamp; /* pages removed since last reset */ unsigned long pages_removed; + + unsigned intmapped; + struct mutexmapping_lock; + unsigned long *subbuf_ids;/* ID to subbuf VA */ + struct trace_buffer_meta*meta_page; + /* ring buffer pages to update, > 0 to add, < 0 to remove */ longnr_pages_to_update; struct list_headnew_pages; /* new pages to add */ @@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) init_irq_work(_buffer->irq_
Re: [PATCH v2 1/4] virtio_balloon: separate vm events into a function
On 22.04.24 09:42, zhenwei pi wrote: All the VM events related statistics have dependence on 'CONFIG_VM_EVENT_COUNTERS', once any stack variable is required by any VM events in future, we would have codes like: #ifdef CONFIG_VM_EVENT_COUNTERS unsigned long foo; #endif ... #ifdef CONFIG_VM_EVENT_COUNTERS foo = events[XXX] + events[YYY]; update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo); #endif Separate vm events into a single function, also remove 'CONFIG_VM_EVENT_COUNTERS' from 'update_balloon_stats'. Signed-off-by: zhenwei pi --- drivers/virtio/virtio_balloon.c | 44 ++--- 1 file changed, 29 insertions(+), 15 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 1f5b3dd31fcf..59fe157e5722 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -316,34 +316,48 @@ static inline void update_stat(struct virtio_balloon *vb, int idx, #define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT) -static unsigned int update_balloon_stats(struct virtio_balloon *vb) +/* Return the number of entries filled by vm events */ +static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb, + unsigned int start) { +#ifdef CONFIG_VM_EVENT_COUNTERS unsigned long events[NR_VM_EVENT_ITEMS]; - struct sysinfo i; - unsigned int idx = 0; - long available; - unsigned long caches; + unsigned int idx = start; all_vm_events(events); - si_meminfo(); - - available = si_mem_available(); - caches = global_node_page_state(NR_FILE_PAGES); - -#ifdef CONFIG_VM_EVENT_COUNTERS update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN, - pages_to_bytes(events[PSWPIN])); + pages_to_bytes(events[PSWPIN])); update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT, - pages_to_bytes(events[PSWPOUT])); + pages_to_bytes(events[PSWPOUT])); update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]); update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]); + #ifdef CONFIG_HUGETLB_PAGE update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC, events[HTLB_BUDDY_PGALLOC]); update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL, events[HTLB_BUDDY_PGALLOC_FAIL]); -#endif -#endif +#endif /* CONFIG_HUGETLB_PAGE */ + + return idx - start; +#else /* CONFIG_VM_EVENT_COUNTERS */ + + return 0; +#endif /* CONFIG_VM_EVENT_COUNTERS */ +} + +static unsigned int update_balloon_stats(struct virtio_balloon *vb) +{ + struct sysinfo i; + unsigned int idx = 0; + long available; + unsigned long caches; + + idx += update_balloon_vm_stats(vb, idx); No need to handle idx that complicated now. Just do unsigned int idx; idx = update_balloon_vm_stats(vb); We can go down that path if we ever want to rearrange the code and not have the vm_stats first. + + si_meminfo(); + available = si_mem_available(); + caches = global_node_page_state(NR_FILE_PAGES); update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE, pages_to_bytes(i.freeram)); update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMTOT, -- Cheers, David / dhildenb
Re: [PATCH v2 1/4] virtio_balloon: separate vm events into a function
On 22.04.24 10:04, zhenwei pi wrote: On 4/22/24 15:47, David Hildenbrand wrote: On 22.04.24 09:42, zhenwei pi wrote: All the VM events related statistics have dependence on 'CONFIG_VM_EVENT_COUNTERS', once any stack variable is required by any VM events in future, we would have codes like: #ifdef CONFIG_VM_EVENT_COUNTERS unsigned long foo; #endif ... #ifdef CONFIG_VM_EVENT_COUNTERS foo = events[XXX] + events[YYY]; update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo); #endif Separate vm events into a single function, also remove Why not simply use __maybe_unused for that variable? 1> static unsigned int update_balloon_stats() { unsigned __maybe_unused long foo; ... #ifdef CONFIG_VM_EVENT_COUNTERS foo = events[XXX] + events[YYY]; update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo); #endif } 2> static inline unsigned int update_balloon_vm_stats() { #ifdef CONFIG_VM_EVENT_COUNTERS unsigned long foo; foo = events[XXX] + events[YYY]; update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo); #endif } From the point of my view, I don't need to compile code in my brain when reading codes for case 2. :) But for #1? :) I mean, you didn't compile the code in your brain when you sent out v1 :P But I agree that moving that to a separate function ins cleaner, staring at resulting update_balloon_stats(). Let me comment on some nits as a fresh reply. -- Cheers, David / dhildenb
Re: [PATCH v2 3/4] virtio_balloon: introduce memory allocation stall counter
On 22.04.24 09:42, zhenwei pi wrote: Memory allocation stall counter represents the performance/latency of memory allocation, expose this counter to the host side by virtio balloon device via out-of-bound way. Signed-off-by: zhenwei pi --- drivers/virtio/virtio_balloon.c | 8 include/uapi/linux/virtio_balloon.h | 6 -- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 87a1d6fa77fb..ab039e83bc6f 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -323,6 +323,8 @@ static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb, #ifdef CONFIG_VM_EVENT_COUNTERS unsigned long events[NR_VM_EVENT_ITEMS]; unsigned int idx = start; + unsigned int zid; + unsigned long stall = 0; all_vm_events(events); update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN, @@ -333,6 +335,12 @@ static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb, update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]); update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]); + /* sum all the stall events */ + for (zid = 0; zid < MAX_NR_ZONES; zid++) + stall += events[ALLOCSTALL_NORMAL - ZONE_NORMAL + zid]; + + update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall); + #ifdef CONFIG_HUGETLB_PAGE update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC, events[HTLB_BUDDY_PGALLOC]); diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index b17bbe033697..487b893a160e 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -72,7 +72,8 @@ struct virtio_balloon_config { #define VIRTIO_BALLOON_S_HTLB_PGALLOC 8 /* Hugetlb page allocations */ #define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page allocation failures */ #define VIRTIO_BALLOON_S_OOM_KILL 10 /* OOM killer invocations */ -#define VIRTIO_BALLOON_S_NR 11 +#define VIRTIO_BALLOON_S_ALLOC_STALL 11 /* Stall count of memory allocatoin */ +#define VIRTIO_BALLOON_S_NR 12 #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \ VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \ @@ -85,7 +86,8 @@ struct virtio_balloon_config { VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \ - VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \ + VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \ + VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \ } #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("") Acked-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 1/4] virtio_balloon: separate vm events into a function
On 22.04.24 09:42, zhenwei pi wrote: All the VM events related statistics have dependence on 'CONFIG_VM_EVENT_COUNTERS', once any stack variable is required by any VM events in future, we would have codes like: #ifdef CONFIG_VM_EVENT_COUNTERS unsigned long foo; #endif ... #ifdef CONFIG_VM_EVENT_COUNTERS foo = events[XXX] + events[YYY]; update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo); #endif Separate vm events into a single function, also remove Why not simply use __maybe_unused for that variable? -- Cheers, David / dhildenb
Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
On 06.04.24 19:36, Vincent Donnefort wrote: In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. CC: Signed-off-by: Vincent Donnefort diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index dc5ae4e96aee..96d2140b471e 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -6,6 +6,8 @@ #include #include +#include + struct trace_buffer; struct ring_buffer_iter; @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node); #define trace_rb_cpu_prepare NULL #endif +int ring_buffer_map(struct trace_buffer *buffer, int cpu, + struct vm_area_struct *vma); +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); #endif /* _LINUX_RING_BUFFER_H */ diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h new file mode 100644 index ..ffcd8dfcaa4f --- /dev/null +++ b/include/uapi/linux/trace_mmap.h @@ -0,0 +1,46 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _TRACE_MMAP_H_ +#define _TRACE_MMAP_H_ + +#include + +/** + * struct trace_buffer_meta - Ring-buffer Meta-page description + * @meta_page_size:Size of this meta-page. + * @meta_struct_len: Size of this structure. + * @subbuf_size: Size of each sub-buffer. + * @nr_subbufs:Number of subbfs in the ring-buffer, including the reader. + * @reader.lost_events:Number of events lost at the time of the reader swap. + * @reader.id: subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1] + * @reader.read: Number of bytes read on the reader subbuf. + * @flags: Placeholder for now, 0 until new features are supported. + * @entries: Number of entries in the ring-buffer. + * @overrun: Number of entries lost in the ring-buffer. + * @read: Number of entries that have been read. + * @Reserved1: Reserved for future use. + * @Reserved2: Reserved for future use. + */ +struct trace_buffer_meta { + __u32 meta_page_size; + __u32 meta_struct_len; + + __u32 subbuf_size; + __u32 nr_subbufs; + + struct { + __u64 lost_events; + __u32 id; + __u32 read; + } reader; + + __u64 flags; + + __u64 entries; + __u64 overrun; + __u64 read; + + __u64 Reserved1; + __u64 Reserved2; +}; + +#endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index cc9ebe593571..793ecc454039 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,7 @@ #include #include #include +#include #include #include @@ -338,6 +340,7 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned longreal_end; /* real end of data */ unsigned order; /* order of the page */ + u32 id;/* ID for external mapping */ struct buffer_data_page *page; /* Actual data page */ }; @@ -484,6 +487,12 @@ struct ring_buffer_per_cpu { u64 read_stamp; /* pages removed since last reset */ unsigned long pages_removed; + + unsigned intmapped; + struct mutexmapping_lock; + unsigned long *subbuf_ids;/* ID to subbuf VA */ + struct trace_buffer_meta*meta_page; + /* ring buffer pages to update, > 0 to add, < 0 to remove */ longnr_pages_to_update; struct list_headnew_pages; /* new pages to add */ @@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters); init_waitqueue_head(_buffer->irq_work.waiters);
Re: [PATCH 3/3] virtio_balloon: introduce memory scan/reclaim info
On 18.04.24 08:26, zhenwei pi wrote: Expose memory scan/reclaim information to the host side via virtio balloon device. Now we have a metric to analyze the memory performance: y: counter increases n: counter does not changes h: the rate of counter change is high l: the rate of counter change is low OOM: VIRTIO_BALLOON_S_OOM_KILL STALL: VIRTIO_BALLOON_S_ALLOC_STALL ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT - OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]: the guest runs under really critial memory pressure - OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]: the memory allocation stalls due to cgroup, not the global memory pressure. - OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]: the memory allocation stalls due to global memory pressure. The performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows quite effective memory reclaiming. - OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]: the memory allocation stalls due to global memory pressure. the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing heavily, the serious case leads poor performance and difficult trouble shooting. Ex, sshd may block on memory allocation when accepting new connections, a user can't login a VM by ssh command. - OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]: the low ratio between ARCLM/ASCAN shows that the guest tries to reclaim more memory, but it can't. Once more memory is required in future, it will struggle to reclaim memory. Signed-off-by: zhenwei pi --- drivers/virtio/virtio_balloon.c | 9 + include/uapi/linux/virtio_balloon.h | 12 ++-- 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index e88e6573afa5..bc9332c1ae85 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -356,6 +356,15 @@ static unsigned int update_balloon_stats(struct virtio_balloon *vb) stall += events[ALLOCSTALL_MOVABLE]; update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall); + update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_SCAN, + pages_to_bytes(events[PGSCAN_KSWAPD])); + update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_SCAN, + pages_to_bytes(events[PGSCAN_DIRECT])); + update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_RECLAIM, + pages_to_bytes(events[PGSTEAL_KSWAPD])); + update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_RECLAIM, + pages_to_bytes(events[PGSTEAL_DIRECT])); + #ifdef CONFIG_HUGETLB_PAGE update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC, events[HTLB_BUDDY_PGALLOC]); diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 487b893a160e..ee35a372805d 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -73,7 +73,11 @@ struct virtio_balloon_config { #define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page allocation failures */ #define VIRTIO_BALLOON_S_OOM_KILL 10 /* OOM killer invocations */ #define VIRTIO_BALLOON_S_ALLOC_STALL 11 /* Stall count of memory allocatoin */ -#define VIRTIO_BALLOON_S_NR 12 +#define VIRTIO_BALLOON_S_ASYNC_SCAN12 /* Amount of memory scanned asynchronously */ +#define VIRTIO_BALLOON_S_DIRECT_SCAN 13 /* Amount of memory scanned directly */ +#define VIRTIO_BALLOON_S_ASYNC_RECLAIM 14 /* Amount of memory reclaimed asynchronously */ +#define VIRTIO_BALLOON_S_DIRECT_RECLAIM 15 /* Amount of memory reclaimed directly */ +#define VIRTIO_BALLOON_S_NR 16 #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \ VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \ @@ -87,7 +91,11 @@ struct virtio_balloon_config { VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \ VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \ - VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \ + VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls", \ + VIRTIO_BALLOON_S_NAMES_prefix "async-scans", \ + VIRTIO_BALLOON_S_NAMES_prefix "direct-scans", \ + VIRTIO_BALLOON_S_NAMES_prefix "async-reclaims", \ + VIRTIO_BALLOON_S_NAMES_prefix "direct-reclaims" \ } #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("") Not an expert on these counters/events, but LGTM Acked-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH 2/3] virtio_balloon: introduce memory allocation stall counter
On 18.04.24 08:26, zhenwei pi wrote: Memory allocation stall counter represents the performance/latency of memory allocation, expose this counter to the host side by virtio balloon device via out-of-bound way. Signed-off-by: zhenwei pi --- drivers/virtio/virtio_balloon.c | 20 +++- include/uapi/linux/virtio_balloon.h | 6 -- 2 files changed, 23 insertions(+), 3 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index fd19934a847f..e88e6573afa5 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -321,7 +321,7 @@ static unsigned int update_balloon_stats(struct virtio_balloon *vb) unsigned long events[NR_VM_EVENT_ITEMS]; struct sysinfo i; unsigned int idx = 0; - long available; + long available, stall = 0; unsigned long caches; all_vm_events(events); @@ -338,6 +338,24 @@ static unsigned int update_balloon_stats(struct virtio_balloon *vb) update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]); update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]); update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]); + + /* sum all the stall events */ +#ifdef CONFIG_ZONE_DMA + stall += events[ALLOCSTALL_DMA]; +#endif +#ifdef CONFIG_ZONE_DMA32 + stall += events[ALLOCSTALL_DMA32]; +#endif +#ifdef CONFIG_HIGHMEM + stall += events[ALLOCSTALL_HIGH]; +#endif +#ifdef CONFIG_ZONE_DEVICE + stall += events[ALLOCSTALL_DEVICE]; +#endif Naive me would think that ALLOCSTALL_DEVICE is always 0. :) Likely we should just do: for (zid = 0; zid < MAX_NR_ZONES; zid++) stall += events[ALLOCSTALL_NORMAL - ZONE_NORMAL + zid]; (see isolate_lru_folios() -> __count_zid_vm_events(), where we realy on the same ordering) Apart form that, LGTM. -- Cheers, David / dhildenb
Re: [PATCH 1/3] virtio_balloon: introduce oom-kill invocations
On 18.04.24 08:26, zhenwei pi wrote: When the guest OS runs under critical memory pressure, the guest starts to kill processes. A guest monitor agent may scan 'oom_kill' from /proc/vmstat, and reports the OOM KILL event. However, the agent may be killed and we will loss this critical event(and the later events). For now we can also grep for magic words in guest kernel log from host side. Rather than this unstable way, virtio balloon reports OOM-KILL invocations instead. Signed-off-by: zhenwei pi Acked-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [RFC 3/3] virtio_balloon: introduce memory scan/reclaim info
On 15.04.24 10:41, zhenwei pi wrote: Expose memory scan/reclaim information to the host side via virtio balloon device. Now we have a metric to analyze the memory performance: y: counter increases n: counter does not changes h: the rate of counter change is high l: the rate of counter change is low OOM: VIRTIO_BALLOON_S_OOM_KILL STALL: VIRTIO_BALLOON_S_ALLOC_STALL ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT - OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]: the guest runs under really critial memory pressure - OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]: the memory allocation stalls due to cgroup, not the global memory pressure. - OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]: the memory allocation stalls due to global memory pressure. The performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows quite effective memory reclaiming. - OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]: the memory allocation stalls due to global memory pressure. the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing heavily, the serious case leads poor performance and difficult trouble shooting. Ex, sshd may block on memory allocation when accepting new connections, a user can't login a VM by ssh command. - OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]: the low ratio between ARCLM/ASCAN shows that the guest tries to reclaim more memory, but it can't. Once more memory is required in future, it will struggle to reclaim memory. Signed-off-by: zhenwei pi --- drivers/virtio/virtio_balloon.c | 9 + include/uapi/linux/virtio_balloon.h | 12 ++-- 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 4b9c9569f6e5..7b86514e99d4 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -372,6 +372,15 @@ static unsigned int update_balloon_stats(struct virtio_balloon *vb) stall += events[ALLOCSTALL_MOVABLE]; update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall); + update_stat(vb, idx++, VIRTIO_BALLOON_S_SCAN_ASYNC, + pages_to_bytes(events[PGSCAN_KSWAPD])); + update_stat(vb, idx++, VIRTIO_BALLOON_S_SCAN_DIRECT, + pages_to_bytes(events[PGSCAN_DIRECT])); + update_stat(vb, idx++, VIRTIO_BALLOON_S_RECLAIM_ASYNC, + pages_to_bytes(events[PGSTEAL_KSWAPD])); + update_stat(vb, idx++, VIRTIO_BALLOON_S_RECLAIM_DIRECT, + pages_to_bytes(events[PGSTEAL_DIRECT])); + return idx; } diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 13d0c32ba27c..0875a9cccb01 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -73,7 +73,11 @@ struct virtio_balloon_config { #define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page allocation failures */ #define VIRTIO_BALLOON_S_OOM_KILL 10 /* OOM killer invocations */ #define VIRTIO_BALLOON_S_ALLOC_STALL 11 /* Stall count of memory allocatoin */ -#define VIRTIO_BALLOON_S_NR 12 +#define VIRTIO_BALLOON_S_SCAN_ASYNC12 /* Amount of memory scanned asynchronously */ +#define VIRTIO_BALLOON_S_SCAN_DIRECT 13 /* Amount of memory scanned directly */ +#define VIRTIO_BALLOON_S_RECLAIM_ASYNC 14 /* Amount of memory reclaimed asynchronously */ +#define VIRTIO_BALLOON_S_RECLAIM_DIRECT 15 /* Amount of memory reclaimed directly */ Should these be the other way around: ASYN_SCAN ... ASYNC_RECLAIM so we can get ... +#define VIRTIO_BALLOON_S_NR 16 #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \ VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \ @@ -87,7 +91,11 @@ struct virtio_balloon_config { VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \ VIRTIO_BALLOON_S_NAMES_prefix "oom-kill", \ - VIRTIO_BALLOON_S_NAMES_prefix "alloc-stall" \ + VIRTIO_BALLOON_S_NAMES_prefix "alloc-stall", \ + VIRTIO_BALLOON_S_NAMES_prefix "scan-async", \ + VIRTIO_BALLOON_S_NAMES_prefix "scan-direct", \ + VIRTIO_BALLOON_S_NAMES_prefix "reclaim-async", \ + VIRTIO_BALLOON_S_NAMES_prefix "reclaim-direct" \ ... "async-scans", "async-reclaims" ... -- Cheers, David / dhildenb
Re: [RFC 2/3] virtio_balloon: introduce memory allocation stall counter
On 15.04.24 10:41, zhenwei pi wrote: Memory allocation stall counter represents the performance/latency of memory allocation, expose this counter to the host side by virtio balloon device via out-of-bound way. Signed-off-by: zhenwei pi --- drivers/virtio/virtio_balloon.c | 19 ++- include/uapi/linux/virtio_balloon.h | 6 -- 2 files changed, 22 insertions(+), 3 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index fd8daa742734..4b9c9569f6e5 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -321,7 +321,7 @@ static unsigned int update_balloon_stats(struct virtio_balloon *vb) unsigned long events[NR_VM_EVENT_ITEMS]; struct sysinfo i; unsigned int idx = 0; - long available; + long available, stall = 0; unsigned long caches; all_vm_events(events); @@ -355,6 +355,23 @@ static unsigned int update_balloon_stats(struct virtio_balloon *vb) update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]); + /* sum all the stall event */ +#ifdef CONFIG_ZONE_DMA + stall += events[ALLOCSTALL_DMA]; +#endif +#ifdef CONFIG_ZONE_DMA32 + stall += events[ALLOCSTALL_DMA32]; +#endif +#ifdef CONFIG_HIGHMEM + stall += events[ALLOCSTALL_HIGH]; +#endif +#ifdef CONFIG_ZONE_DEVICE + stall += events[ALLOCSTALL_DEVICE]; +#endif + stall += events[ALLOCSTALL_NORMAL]; + stall += events[ALLOCSTALL_MOVABLE]; + update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall); + return idx; } diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index cde5547e64a7..13d0c32ba27c 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -72,7 +72,8 @@ struct virtio_balloon_config { #define VIRTIO_BALLOON_S_HTLB_PGALLOC 8 /* Hugetlb page allocations */ #define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page allocation failures */ #define VIRTIO_BALLOON_S_OOM_KILL 10 /* OOM killer invocations */ -#define VIRTIO_BALLOON_S_NR 11 +#define VIRTIO_BALLOON_S_ALLOC_STALL 11 /* Stall count of memory allocatoin */ +#define VIRTIO_BALLOON_S_NR 12 #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \ VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \ @@ -85,7 +86,8 @@ struct virtio_balloon_config { VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \ - VIRTIO_BALLOON_S_NAMES_prefix "oom-kill" \ + VIRTIO_BALLOON_S_NAMES_prefix "oom-kill", \ + VIRTIO_BALLOON_S_NAMES_prefix "alloc-stall" \ "alloc-stalls" -- Cheers, David / dhildenb
Re: [RFC 1/3] virtio_balloon: introduce oom-kill invocations
On 15.04.24 10:41, zhenwei pi wrote: When the guest OS runs under critical memory pressure, the guest starts to kill processes. A guest monitor agent may scan 'oom_kill' from /proc/vmstat, and reports the OOM KILL event. However, the agent may be killed and we will loss this critical event(and the later events). For now we can also grep for magic words in guest kernel log from host side. Rather than this unstable way, virtio balloon reports OOM-KILL invocations instead. Signed-off-by: zhenwei pi --- drivers/virtio/virtio_balloon.c | 2 ++ include/uapi/linux/virtio_balloon.h | 6 -- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 1f5b3dd31fcf..fd8daa742734 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -352,6 +352,8 @@ static unsigned int update_balloon_stats(struct virtio_balloon *vb) pages_to_bytes(available)); update_stat(vb, idx++, VIRTIO_BALLOON_S_CACHES, pages_to_bytes(caches)); + update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, + events[OOM_KILL]); return idx; } diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index ddaa45e723c4..cde5547e64a7 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -71,7 +71,8 @@ struct virtio_balloon_config { #define VIRTIO_BALLOON_S_CACHES 7 /* Disk caches */ #define VIRTIO_BALLOON_S_HTLB_PGALLOC 8 /* Hugetlb page allocations */ #define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page allocation failures */ -#define VIRTIO_BALLOON_S_NR 10 +#define VIRTIO_BALLOON_S_OOM_KILL 10 /* OOM killer invocations */ +#define VIRTIO_BALLOON_S_NR 11 #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \ VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \ @@ -83,7 +84,8 @@ struct virtio_balloon_config { VIRTIO_BALLOON_S_NAMES_prefix "available-memory", \ VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \ VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \ - VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures" \ + VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \ + VIRTIO_BALLOON_S_NAMES_prefix "oom-kill" \ "oom-kills" -- Cheers, David / dhildenb
Re: [RFC 0/3] Improve memory statistics for virtio balloon
On 15.04.24 10:41, zhenwei pi wrote: Hi, When the guest runs under critial memory pressure, the guest becomss too slow, even sshd turns D state(uninterruptible) on memory allocation. We can't login this VM to do any work on trouble shooting. Guest kernel log via virtual TTY(on host side) only provides a few necessary log after OOM. More detail memory statistics are required, then we can know explicit memory events and estimate the pressure. I'm going to introduce several VM counters for virtio balloon: - oom-kill - alloc-stall - scan-async - scan-direct - reclaim-async - reclaim-direct IIUC, we're only exposing events that are already getting provided via all_vm_events(), correct? In that case, I don't really see a major issue. Some considerations: (1) These new events are fairly Linux specific. PSWPIN and friends are fairly generic, but HGTLB is also already fairly Linux specific already. OOM-kills don't really exist on Windows, for example. We'll have to be careful of properly describing what the semantics are. (2) How should we handle if Linux ever stops supporting a certain event (e.g., major reclaim rework). I assume, simply return nothing like we currently would for VIRTIO_BALLOON_S_HTLB_PGALLOC without CONFIG_HUGETLB_PAGE. -- Cheers, David / dhildenb
Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback
On 11.04.24 18:55, Paolo Bonzini wrote: On Mon, Apr 8, 2024 at 3:56 PM Peter Xu wrote: Paolo, I may miss a bunch of details here (as I still remember some change_pte patches previously on the list..), however not sure whether we considered enable it? Asked because I remember Andrea used to have a custom tree maintaining that part: https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968 The patch enables it only for KSM, so it would still require a bunch of cleanups, for example I also would still use set_pte_at() in all the places that are not KSM. This would at least fix the issue with the poor documentation of where to use set_pte_at_notify() vs set_pte_at(). With regard to the implementation, I like the idea of disabling the invalidation on the MMU notifier side, but I would rather have MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of overloading the event field. Maybe it can't be enabled for some reason that I overlooked in the current tree, or we just decided to not to? I have just learnt about the patch, nobody had ever mentioned it even though it's almost 2 years old... It's a lot of code though and no one I assume Andrea used it on his tree where he also has a version of "randprotect" (even included in that commit subject) to mitigate a KSM security issue that was reported by some security researchers [1] a while ago. From what I recall, the industry did not end up caring about that security issue that much. IIUC, with "randprotect" we get a lot more R/O protection even when not de-duplicating a page -- thus the name. Likely, the reporter mentioned in the commit is a researcher that played with Andreas fix for the security issue. But I'm just speculating at this point :) has ever reported an issue for over 10 years, so I think it's easiest to just rip the code out. Yes. Can always be readded in a possibly cleaner fashion (like you note above), when deemed necessary and we are willing to support it. [1] https://gruss.cc/files/remote_dedup.pdf -- Cheers, David / dhildenb
Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young
On 09.04.24 20:31, James Houghton wrote: Ah, I didn't see this in my inbox, sorry David! No worries :) On Thu, Apr 4, 2024 at 11:52 AM David Hildenbrand wrote: On 02.04.24 01:29, James Houghton wrote: diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index f349e08a9dfe..daaa9db625d3 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -61,6 +61,10 @@ enum mmu_notifier_event { #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) +#define MMU_NOTIFIER_YOUNG (1 << 0) +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1) Especially this one really deserves some documentation :) Yes, will do. Something like MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE indicates that the passed-in bitmap either (1) does not accurately represent the age of the pages (in the case of test_young), or (2) was not able to be used to completely clear the age/access bit (in the case of clear_young). Make sense. I do wonder what the expected reaction from the caller is :) +#define MMU_NOTIFIER_YOUNG_FAST (1 << 2) And that one as well. Something like Indicates that (1) passing a bitmap ({test,clear}_young_bitmap) would have been supported for this address range. The name MMU_NOTIFIER_YOUNG_FAST really comes from the fact that KVM is able to harvest the access bit "fast" (so for x86, locklessly, and for arm64, with the KVM MMU read lock), "fast" enough that using a bitmap to do look-around is probably a good idea. Is that really the right way to communicate that ("would have been supported") -- wouldn't we want to sense support differently? Likely best to briefly document all of them, and how they are supposed to be used (return value for X). Right. Will do. + struct mmu_notifier_ops { /* * Called either by mmu_notifier_unregister or when the mm is @@ -106,21 +110,36 @@ struct mmu_notifier_ops { * clear_young is a lightweight version of clear_flush_young. Like the * latter, it is supposed to test-and-clear the young/accessed bitflag * in the secondary pte, but it may omit flushing the secondary tlb. + * + * If @bitmap is given but is not supported, return + * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE. + * + * If the walk is done "quickly" and there were young PTEs, + * MMU_NOTIFIER_YOUNG_FAST is returned. */ int (*clear_young)(struct mmu_notifier *subscription, struct mm_struct *mm, unsigned long start, -unsigned long end); +unsigned long end, +unsigned long *bitmap); /* * test_young is called to check the young/accessed bitflag in * the secondary pte. This is used to know if the page is * frequently used without actually clearing the flag or tearing * down the secondary mapping on the page. + * + * If @bitmap is given but is not supported, return + * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE. + * + * If the walk is done "quickly" and there were young PTEs, + * MMU_NOTIFIER_YOUNG_FAST is returned. */ int (*test_young)(struct mmu_notifier *subscription, struct mm_struct *mm, - unsigned long address); + unsigned long start, + unsigned long end, + unsigned long *bitmap); What does "quickly" mean (why not use "fast")? What are the semantics, I don't find any existing usage of that in this file. "fast" means fast enough such that using a bitmap to scan adjacent pages (e.g. with MGLRU) is likely to be beneficial. I'll write more in this comment. Perhaps I should just rename it to MMU_NOTIFIER_YOUNG_BITMAP_SUPPORTED and drop the whole "likely to be beneficial" thing -- that's for MGLRU/etc. to decide really. Yes! Further, what is MMU_NOTIFIER_YOUNG you introduce used for? MMU_NOTIFIER_YOUNG is the return value when the page was young, but we (1) didn't use a bitmap, and (2) the "fast" access bit harvesting wasn't possible. In this case we simply return 1, which is MMU_NOTIFIER_YOUNG. I'll make kvm_mmu_notifier_test_clear_young() properly return MMU_NOTIFIER_YOUNG instead of relying on the fact that it will be 1. Yes, that will clarify it! -- Cheers, David / dhildenb
Re: [PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at()
On 05.04.24 13:58, Paolo Bonzini wrote: With the demise of the .change_pte() MMU notifier callback, there is no notification happening in set_pte_at_notify(). It is a synonym of set_pte_at() and can be replaced with it. Signed-off-by: Paolo Bonzini --- A real joy seeing that gone Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH 3/4] mmu_notifier: remove the .change_pte() callback
On 05.04.24 13:58, Paolo Bonzini wrote: The scope of set_pte_at_notify() has reduced more and more through the years. Initially, it was meant for when the change to the PTE was not bracketed by mmu_notifier_invalidate_range_{start,end}(). However, that has not been so for over ten years. During all this period the only implementation of .change_pte() was KVM and it had no actual functionality, because it was called after mmu_notifier_invalidate_range_start() zapped the secondary PTE. Now that this (nonfunctional) user of the .change_pte() callback is gone, the whole callback can be removed. For now, leave in place set_pte_at_notify() even though it is just a synonym for set_pte_at(). Signed-off-by: Paolo Bonzini --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young
On 02.04.24 01:29, James Houghton wrote: The bitmap is provided for secondary MMUs to use if they support it. For test_young(), after it returns, the bitmap represents the pages that were young in the interval [start, end). For clear_young, it represents the pages that we wish the secondary MMU to clear the accessed/young bit for. If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API should be unchanged except that if young PTEs are found and the architecture supports passing in a bitmap, instead of returning 1, MMU_NOTIFIER_YOUNG_FAST is returned. This allows MGLRU's look-around logic to work faster, resulting in a 4% improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST to indicate to main mm that doing look-around is likely to be beneficial. If the secondary MMU doesn't support the bitmap, it must return an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE. [1]: https://lore.kernel.org/all/20230609005935.42390-1-yuz...@google.com/ Suggested-by: Yu Zhao Signed-off-by: James Houghton --- include/linux/mmu_notifier.h | 93 +--- include/trace/events/kvm.h | 13 +++-- mm/mmu_notifier.c| 20 +--- virt/kvm/kvm_main.c | 19 ++-- 4 files changed, 123 insertions(+), 22 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index f349e08a9dfe..daaa9db625d3 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -61,6 +61,10 @@ enum mmu_notifier_event { #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) +#define MMU_NOTIFIER_YOUNG (1 << 0) +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1) Especially this one really deserves some documentation :) +#define MMU_NOTIFIER_YOUNG_FAST(1 << 2) And that one as well. Likely best to briefly document all of them, and how they are supposed to be used (return value for X). + struct mmu_notifier_ops { /* * Called either by mmu_notifier_unregister or when the mm is @@ -106,21 +110,36 @@ struct mmu_notifier_ops { * clear_young is a lightweight version of clear_flush_young. Like the * latter, it is supposed to test-and-clear the young/accessed bitflag * in the secondary pte, but it may omit flushing the secondary tlb. +* +* If @bitmap is given but is not supported, return +* MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE. +* +* If the walk is done "quickly" and there were young PTEs, +* MMU_NOTIFIER_YOUNG_FAST is returned. */ int (*clear_young)(struct mmu_notifier *subscription, struct mm_struct *mm, unsigned long start, - unsigned long end); + unsigned long end, + unsigned long *bitmap); /* * test_young is called to check the young/accessed bitflag in * the secondary pte. This is used to know if the page is * frequently used without actually clearing the flag or tearing * down the secondary mapping on the page. +* +* If @bitmap is given but is not supported, return +* MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE. +* +* If the walk is done "quickly" and there were young PTEs, +* MMU_NOTIFIER_YOUNG_FAST is returned. */ int (*test_young)(struct mmu_notifier *subscription, struct mm_struct *mm, - unsigned long address); + unsigned long start, + unsigned long end, + unsigned long *bitmap); What does "quickly" mean (why not use "fast")? What are the semantics, I don't find any existing usage of that in this file. Further, what is MMU_NOTIFIER_YOUNG you introduce used for? -- Cheers, David / dhildenb
Re: [PATCH 2/2] virtio_balloon: Treat stats requests as wakeup events
On 18.03.24 10:10, David Stevens wrote: From: David Stevens Treat stats requests as wakeup events to ensure that the driver responds to device requests in a timely manner. Signed-off-by: David Stevens --- drivers/virtio/virtio_balloon.c | 75 - 1 file changed, 46 insertions(+), 29 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 7fe7ef5f1c77..402dec98e08c 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -121,11 +121,14 @@ struct virtio_balloon { struct page_reporting_dev_info pr_dev_info; /* State for keeping the wakeup_source active while adjusting the balloon */ - spinlock_t adjustment_lock; - bool adjustment_signal_pending; - bool adjustment_in_progress; + spinlock_t wakeup_lock; + bool processing_wakeup_event; + u32 wakeup_signal_mask; }; +#define ADJUSTMENT_WAKEUP_SIGNAL (1 << 0) +#define STATS_WAKEUP_SIGNAL (1 << 1) I'd suggest a different naming like: VIRTIO_BALLOON_WAKEUP_SIGNAL_ADJUST VIRTIO_BALLOON_WAKEUP_SIGNAL_STATS Apart from that, nothing jumped at me. Acked-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH 1/2] virtio_balloon: Give the balloon its own wakeup source
On 18.03.24 10:10, David Stevens wrote: From: David Stevens Wakeup sources don't support nesting multiple events, so sharing a single object between multiple drivers can result in one driver overriding the wakeup event processing period specified by another driver. Have the virtio balloon driver use the wakeup source of the device it is bound to rather than the wakeup source of the parent device, to avoid conflicts with the transport layer. Note that although the virtio balloon's virtio_device itself isn't what actually wakes up the device, it is responsible for processing wakeup events. In the same way that EPOLLWAKEUP uses a dedicated wakeup_source to prevent suspend when userspace is processing wakeup events, a dedicated wakeup_source is necessary when processing wakeup events in a higher layer in the kernel. Fixes: b12fbc3f787e ("virtio_balloon: stay awake while adjusting balloon") Signed-off-by: David Stevens --- drivers/virtio/virtio_balloon.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 1f5b3dd31fcf..7fe7ef5f1c77 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -450,7 +450,7 @@ static void start_update_balloon_size(struct virtio_balloon *vb) vb->adjustment_signal_pending = true; if (!vb->adjustment_in_progress) { vb->adjustment_in_progress = true; - pm_stay_awake(vb->vdev->dev.parent); + pm_stay_awake(>vdev->dev); } spin_unlock_irqrestore(>adjustment_lock, flags); @@ -462,7 +462,7 @@ static void end_update_balloon_size(struct virtio_balloon *vb) spin_lock_irq(>adjustment_lock); if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) { vb->adjustment_in_progress = false; - pm_relax(vb->vdev->dev.parent); + pm_relax(>vdev->dev); } spin_unlock_irq(>adjustment_lock); } @@ -1028,6 +1028,7 @@ static int virtballoon_probe(struct virtio_device *vdev) } spin_lock_init(>adjustment_lock); Can we add a comment here why we have to do that? + device_set_wakeup_capable(>vdev->dev, true); virtio_device_ready(vdev); Absolutely not an expert on the details, but I assume this is fine. Acked-by: David Hildenbrand -- Cheers, David / dhildenb
[PATCH v1] virtio-mem: support suspend+resume
With virtio-mem, primarily hibernation is problematic: as the machine shuts down, the virtio-mem device loses its state. Powering the machine back up is like losing a bunch of DIMMs. While there would be ways to add limited support, suspend+resume is more commonly used for VMs and "easier" to support cleanly. s2idle can be supported without any device dependencies. Similarly, one would expect suspend-to-ram (i.e., S3) to work out of the box. However, QEMU currently unplugs all device memory when resuming the VM, using a cold reset on the "wakeup" path. In order to support S3, we need a feature flag for the device to tell us if memory remains plugged when waking up. In the future, QEMU will implement this feature. So let's always support s2idle and support S3 with plugged memory only if the device indicates support. Block hibernation early using the PM notifier. Trying to hibernate now fails early: # echo disk > /sys/power/state [ 26.455369] PM: hibernation: hibernation entry [ 26.458271] virtio_mem virtio0: hibernation is not supported. [ 26.462498] PM: hibernation: hibernation exit -bash: echo: write error: Operation not permitted s2idle works even without the new feature bit: # echo s2idle > /sys/power/mem_sleep # echo mem > /sys/power/state [ 52.083725] PM: suspend entry (s2idle) [ 52.095950] Filesystems sync: 0.010 seconds [ 52.101493] Freezing user space processes [ 52.104213] Freezing user space processes completed (elapsed 0.001 seconds) [ 52.106520] OOM killer disabled. [ 52.107655] Freezing remaining freezable tasks [ 52.110880] Freezing remaining freezable tasks completed (elapsed 0.001 seconds) [ 52.113296] printk: Suspending console(s) (use no_console_suspend to debug) S3 does not work without the feature bit when memory is plugged: # echo deep > /sys/power/mem_sleep # echo mem > /sys/power/state [ 32.788281] PM: suspend entry (deep) [ 32.816630] Filesystems sync: 0.027 seconds [ 32.820029] Freezing user space processes [ 32.823870] Freezing user space processes completed (elapsed 0.001 seconds) [ 32.827756] OOM killer disabled. [ 32.829608] Freezing remaining freezable tasks [ 32.833842] Freezing remaining freezable tasks completed (elapsed 0.001 seconds) [ 32.837953] printk: Suspending console(s) (use no_console_suspend to debug) [ 32.916172] virtio_mem virtio0: suspend+resume with plugged memory is not supported [ 32.916181] virtio-pci :00:02.0: PM: pci_pm_suspend(): virtio_pci_freeze+0x0/0x50 returns -1 [ 32.916197] virtio-pci :00:02.0: PM: dpm_run_callback(): pci_pm_suspend+0x0/0x170 returns -1 [ 32.916210] virtio-pci :00:02.0: PM: failed to suspend async: error -1 But S3 works with the new feature bit when memory is plugged (patched QEMU): # echo deep > /sys/power/mem_sleep # echo mem > /sys/power/state [ 33.983694] PM: suspend entry (deep) [ 34.009828] Filesystems sync: 0.024 seconds [ 34.013589] Freezing user space processes [ 34.016722] Freezing user space processes completed (elapsed 0.001 seconds) [ 34.019092] OOM killer disabled. [ 34.020291] Freezing remaining freezable tasks [ 34.023549] Freezing remaining freezable tasks completed (elapsed 0.001 seconds) [ 34.026090] printk: Suspending console(s) (use no_console_suspend to debug) Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Xuan Zhuo Signed-off-by: David Hildenbrand --- I had QEMU support ready [1] but reset-related things just changed upstream that will require a bit of a rework -- and it will end up looking cleaner. Will come back to upstreaming the QEMU part once I can properly sync the Linux headers to contain VIRTIO_MEM_F_PERSISTENT_SUSPEND. [1] https://github.com/davidhildenbrand/qemu/tree/virtio-mem-suspend --- drivers/virtio/virtio_mem.c | 68 ++--- include/uapi/linux/virtio_mem.h | 2 + 2 files changed, 64 insertions(+), 6 deletions(-) diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 8e32232944423..51088d02de32f 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -21,6 +21,7 @@ #include #include #include +#include #include @@ -252,6 +253,9 @@ struct virtio_mem { /* Memory notifier (online/offline events). */ struct notifier_block memory_notifier; + /* Notifier to block hibernation image storing/reloading. */ + struct notifier_block pm_notifier; + #ifdef CONFIG_PROC_VMCORE /* vmcore callback for /proc/vmcore handling in kdump mode */ struct vmcore_cb vmcore_cb; @@ -,6 +1115,25 @@ static int virtio_mem_memory_notifi
[PATCH v1] virtio: reenable config if freezing device failed
Currently, we don't reenable the config if freezing the device failed. For example, virtio-mem currently doesn't support suspend+resume, and trying to freeze the device will always fail. Afterwards, the device will no longer respond to resize requests, because it won't get notified about config changes. Let's fix this by re-enabling the config if freezing fails. Fixes: 22b7050a024d ("virtio: defer config changed notifications") Cc: Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Xuan Zhuo Signed-off-by: David Hildenbrand --- drivers/virtio/virtio.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index f4080692b351..f513ee21b1c1 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -510,8 +510,10 @@ int virtio_device_freeze(struct virtio_device *dev) if (drv && drv->freeze) { ret = drv->freeze(dev); - if (ret) + if (ret) { + virtio_config_enable(dev); return ret; + } } if (dev->config->destroy_avq) base-commit: c664e16bb1ba1c8cf1d7ecf3df5fd83bbb8ac15a -- 2.43.0
Re: [PATCH] driver/virtio: Add Memory Balloon Support for SEV/SEV-ES
On 10.01.24 07:22, Zheyun Shen wrote: For now, SEV pins guest's memory to avoid swapping or moving ciphertext, but leading to the inhibition of Memory Ballooning. In Memory Ballooning, only guest's free pages will be relocated in balloon inflation and deflation, so the difference of plaintext doesn't matter to guest. A Linux hypervisor will always give you a fresh, zeroed page. I don't recall what the spec says, could be that that is a guarantee. Memory Ballooning is a nice memory overcommitment technology can be used in CVM based on SEV and SEV-ES, so userspace tools can provide an option to allow SEV not to pin memory and enable Memory Ballooning. Guest kernel may not inhibit Balloon and should set shared memory for Balloon decrypted. Two points: 1) Memory overcommit means that you promise to have more memory than you actually have. To be able to use that in a *safe* way in the hypervisor, to fulfill that promise, you need some backup strategy, which is usually swap space in the hypervisor. Further one might apply other techniques (ram compression, memory deduplication) in the hypervisor to make that swapping unlikely to ever happen when overcommitting (because nobody wants to swap). Assume you run a lot of VMs that mostly have private/encrypted memory (which is the default). Imagine you previously inflated the balloon on VM0, and that VM needs more memory (you promised it could have more!). You reach out to other VMs to inflate the balloon so you get memory back, but they cannot give up memory safely. In that scenario (a) you cannot swap something out because all pages are pinned (b) memory compression cannot be applied because pages are pinned and (c) memory deduplication cannot be applied because pages are pinned. Pinned memory is a resource that cannot be overcomitted. So I am not convinced the use case you are targeting can be considered any way of sane memory overcommit. You should better call it resizing VM memory instead. Then, it's clearer that the hypervisor cannot promise to ever give you that memory when you are in need. 2) What about other features? What if the hypervisor enabled free-page-reporting? Would that work (unlikely, I assunme). Don't we have to block that? -- Cheers, David / dhildenb
Re: [PATCH] driver/virtio: Add Memory Balloon Support for SEV/SEV-ES
For now, SEV pins guest's memory to avoid swapping or moving ciphertext, but leading to the inhibition of Memory Ballooning. In Memory Ballooning, only guest's free pages will be relocated in balloon inflation and deflation, so the difference of plaintext doesn't matter to guest. This seems only true if the page is zeroed, is this true here? Sorry, I cannot figure out why the pages should be zeroed. I think both host kernel and guest kernel assume that the pages are not zeroed and will use kzalloc or manually zero them in real applications, which is same as non-SEV environments. balloon_page_alloc() will not zero the memory (no __GFP_ZERO set). Only in some configurations (zero-on-alloc, zero-on-free), the kernel would do that implicitly. So we'd eventually be leaking secrets to the untrusted hypervisor? I have tested in SEV-ES, reclaiming memory by balloon inflation and reuse them after balloon deflation both works well with the patch. Hypervisor can normally give the reclaimed memory from one CVM to another, or give back to the origin CVM. I'll comment on your misconception of memory overcommit separately. -- Cheers, David / dhildenb
Re: REGRESSION: lockdep warning triggered by 15b9ce7ecd: virtio_balloon: stay awake while adjusting balloon
e able to squash it anymore. From a99a1efa6a2b470a98ea2c87e58bebe90ce329a1 Mon Sep 17 00:00:00 2001 From: David Stevens Date: Tue, 9 Jan 2024 14:41:21 +0900 Subject: [PATCH] virtio_balloon: Fix interrupt context deadlock Use _irq spinlock functions with the adjustment_lock, since start_update_balloon_size needs to acquire it in an interrupt context. Fixes: 5b9ce7ecd715 ("virtio_balloon: stay awake while adjusting balloon") Reported-by: Theodore Ts'o Signed-off-by: David Stevens --- drivers/virtio/virtio_balloon.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index aa6a1a649ad6..1f5b3dd31fcf 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -459,12 +459,12 @@ static void start_update_balloon_size(struct virtio_balloon *vb) static void end_update_balloon_size(struct virtio_balloon *vb) { - spin_lock(>adjustment_lock); + spin_lock_irq(>adjustment_lock); if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) { vb->adjustment_in_progress = false; pm_relax(vb->vdev->dev.parent); } - spin_unlock(>adjustment_lock); + spin_unlock_irq(>adjustment_lock); } static void virtballoon_changed(struct virtio_device *vdev) @@ -506,9 +506,9 @@ static void update_balloon_size_func(struct work_struct *work) vb = container_of(work, struct virtio_balloon, update_balloon_size_work); - spin_lock(>adjustment_lock); + spin_lock_irq(>adjustment_lock); vb->adjustment_signal_pending = false; - spin_unlock(>adjustment_lock); + spin_unlock_irq(>adjustment_lock); diff = towards_target(vb); Acked-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2] virtio_balloon: stay awake while adjusting balloon
On 19.12.23 15:37, David Stevens wrote: On Mon, Dec 18, 2023 at 12:33 PM David Hildenbrand wrote: On 18.12.23 16:18, David Stevens wrote: From: David Stevens A virtio_balloon's parent device may be configured so that a configuration change interrupt is a wakeup event. Extend the processing of such a wakeup event until the balloon finishes inflating or deflating by calling pm_stay_awake/pm_relax in the virtio_balloon driver. Note that these calls are no-ops if the parent device doesn't support wakeup events or if the wakeup events are not enabled. This change allows the guest to use system power states such as s2idle without running the risk the virtio_balloon's cooperative memory management becoming unresponsive to the host's requests. Signed-off-by: David Stevens --- v1 -> v2: - Use adjustment_signal_pending flag instead of a sequence number - Call pm_stay_awake/pm_relax on parent device instead of adding a wake event to the virtio balloon device drivers/virtio/virtio_balloon.c | 57 +++-- 1 file changed, 47 insertions(+), 10 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 1fe93e93f5bc..a3c11159cbe0 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -119,6 +119,11 @@ struct virtio_balloon { /* Free page reporting device */ struct virtqueue *reporting_vq; struct page_reporting_dev_info pr_dev_info; + + /* State for keeping the wakeup_source active while adjusting the balloon */ + spinlock_t adjustment_lock; + bool adjustment_signal_pending; + bool adjustment_in_progress; }; static const struct virtio_device_id id_table[] = { @@ -437,6 +442,31 @@ static void virtio_balloon_queue_free_page_work(struct virtio_balloon *vb) queue_work(vb->balloon_wq, >report_free_page_work); } +static void start_update_balloon_size(struct virtio_balloon *vb) +{ + unsigned long flags; + + spin_lock_irqsave(>adjustment_lock, flags); + vb->adjustment_signal_pending = true; + if (!vb->adjustment_in_progress) { + vb->adjustment_in_progress = true; + pm_stay_awake(vb->vdev->dev.parent); + } + spin_unlock_irqrestore(>adjustment_lock, flags); + + queue_work(system_freezable_wq, >update_balloon_size_work); +} + +static void end_update_balloon_size(struct virtio_balloon *vb) +{ + spin_lock(>adjustment_lock); + if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) { How could vb->adjustment_in_progress ever not be set at this point? + vb->adjustment_in_progress = false; + pm_relax(vb->vdev->dev.parent); + } + spin_unlock(>adjustment_lock); +} + LGTM, although I wonder what happens when calling pm_stay_awake() etc. on a parent device that is not wakeup-even-capable? If the parent device is not wakeup capable or if wakeup isn't enabled, then the vb->vdev->dev.parent->power.wakeup pointer will be NULL, so the NULL checks in __pm_relax/__pm_stay_awake will immediately return. Ah, I missed that NULL check, makes sense. Thanks! -- Cheers, David / dhildenb
Re: [PATCH v2] virtio_balloon: stay awake while adjusting balloon
On 18.12.23 16:18, David Stevens wrote: From: David Stevens A virtio_balloon's parent device may be configured so that a configuration change interrupt is a wakeup event. Extend the processing of such a wakeup event until the balloon finishes inflating or deflating by calling pm_stay_awake/pm_relax in the virtio_balloon driver. Note that these calls are no-ops if the parent device doesn't support wakeup events or if the wakeup events are not enabled. This change allows the guest to use system power states such as s2idle without running the risk the virtio_balloon's cooperative memory management becoming unresponsive to the host's requests. Signed-off-by: David Stevens --- v1 -> v2: - Use adjustment_signal_pending flag instead of a sequence number - Call pm_stay_awake/pm_relax on parent device instead of adding a wake event to the virtio balloon device drivers/virtio/virtio_balloon.c | 57 +++-- 1 file changed, 47 insertions(+), 10 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 1fe93e93f5bc..a3c11159cbe0 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -119,6 +119,11 @@ struct virtio_balloon { /* Free page reporting device */ struct virtqueue *reporting_vq; struct page_reporting_dev_info pr_dev_info; + + /* State for keeping the wakeup_source active while adjusting the balloon */ + spinlock_t adjustment_lock; + bool adjustment_signal_pending; + bool adjustment_in_progress; }; static const struct virtio_device_id id_table[] = { @@ -437,6 +442,31 @@ static void virtio_balloon_queue_free_page_work(struct virtio_balloon *vb) queue_work(vb->balloon_wq, >report_free_page_work); } +static void start_update_balloon_size(struct virtio_balloon *vb) +{ + unsigned long flags; + + spin_lock_irqsave(>adjustment_lock, flags); + vb->adjustment_signal_pending = true; + if (!vb->adjustment_in_progress) { + vb->adjustment_in_progress = true; + pm_stay_awake(vb->vdev->dev.parent); + } + spin_unlock_irqrestore(>adjustment_lock, flags); + + queue_work(system_freezable_wq, >update_balloon_size_work); +} + +static void end_update_balloon_size(struct virtio_balloon *vb) +{ + spin_lock(>adjustment_lock); + if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) { How could vb->adjustment_in_progress ever not be set at this point? + vb->adjustment_in_progress = false; + pm_relax(vb->vdev->dev.parent); + } + spin_unlock(>adjustment_lock); +} + LGTM, although I wonder what happens when calling pm_stay_awake() etc. on a parent device that is not wakeup-even-capable? -- Cheers, David / dhildenb
Re: [PATCH] virtio_balloon: stay awake while adjusting balloon
On 18.12.23 16:16, David Stevens wrote: On Mon, Dec 18, 2023 at 6:37 AM David Hildenbrand wrote: On 14.12.23 05:13, David Stevens wrote: On Wed, Dec 13, 2023 at 5:44 PM David Hildenbrand wrote: On 11.12.23 12:43, David Stevens wrote: From: David Stevens Hi David, Add a wakeup event for when the balloon is inflating or deflating. Userspace can enable this wakeup event to prevent the system from suspending while the balloon is being adjusted. This allows /sys/power/wakeup_count to be used without breaking virtio_balloon's cooperative memory management. Can you add/share some more details I'm working on enabling support for Linux s2Idle in our Android virtual machine, to restrict apps from running in the background without holding an Android partial wakelock. With the patch I recently sent out [1], since crosvm advertises native PCI power management for virtio devices, the Android guest can properly enter s2idle, and it can be woken up by incoming IO. However, one of the remaining problems is that when the host needs to reclaim memory from the guest via the virtio-balloon, there is nothing preventing the guest from entering s2idle before the balloon driver finishes returning memory to the host. Thanks for the information. So you also want to wakeup the VM when wanting to get more memory from the VM? Using which mechanism would that wakeup happen? Likely not the device itself? The wakeup would happen via the parent device's interrupt. I've sent a new version of this patch that uses the parent device's wakeup event instead of adding a new one. One alternative to this approach would be to add a virtballoon_suspend callback to abort suspend if the balloon is inflating/adjusting. However, it seems cleaner to just prevent suspend in the first place. Also, the PM notifier could also be used with very high priority, so the device would respond early to PM_SUSPEND_PREPARE. One drawback of blocking suspend via a PM notifier is that the behavior isn't configurable by userspace, whereas wakeup events can be enabled/disabled by userspace. The question is if that behavior for the balloon is really worth it being configured by user space? -- Cheers, David / dhildenb
Re: [PATCH v2] virtio: Add support for no-reset virtio PCI PM
On 15.12.23 02:00, David Stevens wrote: On Thu, Dec 14, 2023 at 6:47 PM David Hildenbrand wrote: On 08.12.23 08:07, David Stevens wrote: If a virtio_pci_device supports native PCI power management and has the No_Soft_Reset bit set, then skip resetting and reinitializing the device when suspending and restoring the device. This allows system-wide low power states like s2idle to be used in systems with stateful virtio devices that can't simply be re-initialized (e.g. virtio-fs). Signed-off-by: David Stevens --- v1 -> v2: - Check the No_Soft_Reset bit drivers/virtio/virtio_pci_common.c | 34 +- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c index c2524a7207cf..3a95ecaf12dc 100644 --- a/drivers/virtio/virtio_pci_common.c +++ b/drivers/virtio/virtio_pci_common.c @@ -492,8 +492,40 @@ static int virtio_pci_restore(struct device *dev) return virtio_device_restore(_dev->vdev); } +static bool vp_supports_pm_no_reset(struct device *dev) +{ + struct pci_dev *pci_dev = to_pci_dev(dev); + u16 pmcsr; + + if (!pci_dev->pm_cap) + return false; + + pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, ); + if (PCI_POSSIBLE_ERROR(pmcsr)) { + dev_err(dev, "Unable to query pmcsr"); + return false; + } + + return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET; +} + +static int virtio_pci_suspend(struct device *dev) +{ + return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev); +} + +static int virtio_pci_resume(struct device *dev) +{ + return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev); +} + static const struct dev_pm_ops virtio_pci_pm_ops = { - SET_SYSTEM_SLEEP_PM_OPS(virtio_pci_freeze, virtio_pci_restore) + .suspend = virtio_pci_suspend, + .resume = virtio_pci_resume, + .freeze = virtio_pci_freeze, + .thaw = virtio_pci_restore, + .poweroff = virtio_pci_freeze, + .restore = virtio_pci_restore, }; #endif Am I correct with my assumption that this will make s2idle work with virtio-mem-pci as well? Right now, all suspending is disabled, but really only s4/hibernate is problematic. [root@vm-0 ~]# echo "s2idle" > /sys/power/mem_sleep [root@vm-0 ~]# echo "mem" > /sys/power/state [ 101.822991] PM: suspend entry (s2idle) [ 101.828978] Filesystems sync: 0.004 seconds [ 101.831618] Freezing user space processes [ 101.834569] Freezing user space processes completed (elapsed 0.001 seconds) [ 101.836915] OOM killer disabled. [ 101.838072] Freezing remaining freezable tasks [ 101.841054] Freezing remaining freezable tasks completed (elapsed 0.001 seconds) [ 101.843538] printk: Suspending console(s) (use no_console_suspend to debug) [ 101.957676] virtio_mem virtio0: save/restore not supported. [ 101.957683] virtio-pci :00:02.0: PM: pci_pm_suspend(): virtio_pci_freeze+0x0/0x50 returns -1 [ 101.957702] virtio-pci :00:02.0: PM: dpm_run_callback(): pci_pm_suspend+0x0/0x170 returns -1 [ 101.957718] virtio-pci :00:02.0: PM: failed to suspend async: error -1 QEMU's virtio-pci devices don't advertise no_soft_reset, so this patch won't affect vanilla QEMU. But if you add PCI_PM_CTRL_NO_SOFT_RESET to the capability, then it should work. I'm working with crosvm, which doesn't have virtio-mem implemented, but this patch makes virtio-fs work with no extra kernel changes. Thanks for the explanation, makes sense to me. -- Cheers, David / dhildenb
Re: [PATCH] virtio_balloon: stay awake while adjusting balloon
On 14.12.23 05:13, David Stevens wrote: On Wed, Dec 13, 2023 at 5:44 PM David Hildenbrand wrote: On 11.12.23 12:43, David Stevens wrote: From: David Stevens Hi David, Add a wakeup event for when the balloon is inflating or deflating. Userspace can enable this wakeup event to prevent the system from suspending while the balloon is being adjusted. This allows /sys/power/wakeup_count to be used without breaking virtio_balloon's cooperative memory management. Can you add/share some more details I'm working on enabling support for Linux s2Idle in our Android virtual machine, to restrict apps from running in the background without holding an Android partial wakelock. With the patch I recently sent out [1], since crosvm advertises native PCI power management for virtio devices, the Android guest can properly enter s2idle, and it can be woken up by incoming IO. However, one of the remaining problems is that when the host needs to reclaim memory from the guest via the virtio-balloon, there is nothing preventing the guest from entering s2idle before the balloon driver finishes returning memory to the host. Thanks for the information. So you also want to wakeup the VM when wanting to get more memory from the VM? Using which mechanism would that wakeup happen? Likely not the device itself? One alternative to this approach would be to add a virtballoon_suspend callback to abort suspend if the balloon is inflating/adjusting. However, it seems cleaner to just prevent suspend in the first place. Also, the PM notifier could also be used with very high priority, so the device would respond early to PM_SUSPEND_PREPARE. [...] queue_work(system_freezable_wq, work); + else + end_update_balloon_size(vb, seqno); What if we stop the workqueue and unload the driver -- see remove_common() -- won't you leave pm_stay_awake() wrongly set? When a device gets removed, its wakeup source is destroyed, which automatically calls __pm_relax. Ah, thanks. } static int init_vqs(struct virtio_balloon *vb) @@ -992,6 +1028,9 @@ static int virtballoon_probe(struct virtio_device *vdev) goto out_unregister_oom; } + spin_lock_init(>adjustment_lock); + device_set_wakeup_capable(>vdev->dev, true); I'm a bit confused: Documentation/driver-api/pm/devices.rst documents " The :c:member:`power.can_wakeup` flag just records whether the device (and its driver) can physically support wakeup events. The :c:func:`device_set_wakeup_capable()` routine affects this flag. " ... " Whether or not a device is capable of issuing wakeup events is a hardware matter, and the kernel is responsible for keeping track of it. " But how is the virtio-balloon device capable of waking up the machine? Your patch merely implies that the virtio-baloon device is capable to prohbit going to sleep. What am I missing? The underlying virtio_pci_device is capable of waking up the machine, if it supports PCI power management. The core PCI code will keep the machine awake while processing the interrupt (i.e. during virtballoon_changed), but after processing is handed off to the virtio-balloon driver, the virtio-balloon driver needs to keep the machine awake until the processing is actually completed. An alternative to making vb->vdev->dev wakeup capable is to plumb the pm_stay_awake/pm_relax calls to the underlying virtio_pci_device. Would that be a preferable approach? The way you describe it, it rather belongs into the PCI code, because that's what actually makes the device PM-capable (i.e., would not apply to virtio-ccw or virtio-mmio). The virtio-balloon itself is not PM-capable. But how hard is it to move that handling into PCI specific code? -- Cheers, David / dhildenb
Re: [PATCH v2] virtio: Add support for no-reset virtio PCI PM
On 08.12.23 08:07, David Stevens wrote: If a virtio_pci_device supports native PCI power management and has the No_Soft_Reset bit set, then skip resetting and reinitializing the device when suspending and restoring the device. This allows system-wide low power states like s2idle to be used in systems with stateful virtio devices that can't simply be re-initialized (e.g. virtio-fs). Signed-off-by: David Stevens --- v1 -> v2: - Check the No_Soft_Reset bit drivers/virtio/virtio_pci_common.c | 34 +- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c index c2524a7207cf..3a95ecaf12dc 100644 --- a/drivers/virtio/virtio_pci_common.c +++ b/drivers/virtio/virtio_pci_common.c @@ -492,8 +492,40 @@ static int virtio_pci_restore(struct device *dev) return virtio_device_restore(_dev->vdev); } +static bool vp_supports_pm_no_reset(struct device *dev) +{ + struct pci_dev *pci_dev = to_pci_dev(dev); + u16 pmcsr; + + if (!pci_dev->pm_cap) + return false; + + pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, ); + if (PCI_POSSIBLE_ERROR(pmcsr)) { + dev_err(dev, "Unable to query pmcsr"); + return false; + } + + return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET; +} + +static int virtio_pci_suspend(struct device *dev) +{ + return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev); +} + +static int virtio_pci_resume(struct device *dev) +{ + return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev); +} + static const struct dev_pm_ops virtio_pci_pm_ops = { - SET_SYSTEM_SLEEP_PM_OPS(virtio_pci_freeze, virtio_pci_restore) + .suspend = virtio_pci_suspend, + .resume = virtio_pci_resume, + .freeze = virtio_pci_freeze, + .thaw = virtio_pci_restore, + .poweroff = virtio_pci_freeze, + .restore = virtio_pci_restore, }; #endif Am I correct with my assumption that this will make s2idle work with virtio-mem-pci as well? Right now, all suspending is disabled, but really only s4/hibernate is problematic. [root@vm-0 ~]# echo "s2idle" > /sys/power/mem_sleep [root@vm-0 ~]# echo "mem" > /sys/power/state [ 101.822991] PM: suspend entry (s2idle) [ 101.828978] Filesystems sync: 0.004 seconds [ 101.831618] Freezing user space processes [ 101.834569] Freezing user space processes completed (elapsed 0.001 seconds) [ 101.836915] OOM killer disabled. [ 101.838072] Freezing remaining freezable tasks [ 101.841054] Freezing remaining freezable tasks completed (elapsed 0.001 seconds) [ 101.843538] printk: Suspending console(s) (use no_console_suspend to debug) [ 101.957676] virtio_mem virtio0: save/restore not supported. [ 101.957683] virtio-pci :00:02.0: PM: pci_pm_suspend(): virtio_pci_freeze+0x0/0x50 returns -1 [ 101.957702] virtio-pci :00:02.0: PM: dpm_run_callback(): pci_pm_suspend+0x0/0x170 returns -1 [ 101.957718] virtio-pci :00:02.0: PM: failed to suspend async: error -1 -- Cheers, David / dhildenb
Re: [PATCH v5 3/4] mm/memory_hotplug: export mhp_supports_memmap_on_memory()
On 14.12.23 08:37, Vishal Verma wrote: In preparation for adding sysfs ABI to toggle memmap_on_memory semantics for drivers adding memory, export the mhp_supports_memmap_on_memory() helper. This allows drivers to check if memmap_on_memory support is available before trying to request it, and display an appropriate message if it isn't available. As part of this, remove the size argument to this - with recent updates to allow memmap_on_memory for larger ranges, and the internal splitting of altmaps into respective memory blocks, the size argument is meaningless. Cc: Andrew Morton Cc: David Hildenbrand Cc: Michal Hocko Cc: Oscar Salvador Cc: Dan Williams Cc: Dave Jiang Cc: Dave Hansen Cc: Huang Ying Suggested-by: David Hildenbrand Signed-off-by: Vishal Verma --- include/linux/memory_hotplug.h | 6 ++ mm/memory_hotplug.c| 17 ++--- 2 files changed, 12 insertions(+), 11 deletions(-) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 7d2076583494..ebc9d528f00c 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -121,6 +121,7 @@ struct mhp_params { bool mhp_range_allowed(u64 start, u64 size, bool need_mapping); struct range mhp_get_pluggable_range(bool need_mapping); +bool mhp_supports_memmap_on_memory(void); /* * Zone resizing functions @@ -262,6 +263,11 @@ static inline bool movable_node_is_enabled(void) return false; } +static bool mhp_supports_memmap_on_memory(void) +{ + return false; +} + static inline void pgdat_kswapd_lock(pg_data_t *pgdat) {} static inline void pgdat_kswapd_unlock(pg_data_t *pgdat) {} static inline void pgdat_kswapd_lock_init(pg_data_t *pgdat) {} diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 926e1cfb10e9..751664c519f7 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1325,7 +1325,7 @@ static inline bool arch_supports_memmap_on_memory(unsigned long vmemmap_size) } #endif -static bool mhp_supports_memmap_on_memory(unsigned long size) +bool mhp_supports_memmap_on_memory(void) { unsigned long vmemmap_size = memory_block_memmap_size(); unsigned long memmap_pages = memory_block_memmap_on_memory_pages(); @@ -1334,17 +1334,11 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) * Besides having arch support and the feature enabled at runtime, we * need a few more assumptions to hold true: * -* a) We span a single memory block: memory onlining/offlinin;g happens -*in memory block granularity. We don't want the vmemmap of online -*memory blocks to reside on offline memory blocks. In the future, -*we might want to support variable-sized memory blocks to make the -*feature more versatile. -* -* b) The vmemmap pages span complete PMDs: We don't want vmemmap code +* a) The vmemmap pages span complete PMDs: We don't want vmemmap code *to populate memory from the altmap for unrelated parts (i.e., *other memory blocks) * -* c) The vmemmap pages (and thereby the pages that will be exposed to +* b) The vmemmap pages (and thereby the pages that will be exposed to *the buddy) have to cover full pageblocks: memory onlining/offlining *code requires applicable ranges to be page-aligned, for example, to *set the migratetypes properly. @@ -1356,7 +1350,7 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) * altmap as an alternative source of memory, and we do not exactly * populate a single PMD. */ - if (!mhp_memmap_on_memory() || size != memory_block_size_bytes()) + if (!mhp_memmap_on_memory()) return false; /* @@ -1379,6 +1373,7 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) return arch_supports_memmap_on_memory(vmemmap_size); } +EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory); static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size) { @@ -1512,7 +1507,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) * Self hosted memmap array */ if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) && - mhp_supports_memmap_on_memory(memory_block_size_bytes())) { + mhp_supports_memmap_on_memory()) { ret = create_altmaps_and_memory_blocks(nid, group, start, size); if (ret) goto error; Acked-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH] virtio_balloon: stay awake while adjusting balloon
On 11.12.23 12:43, David Stevens wrote: From: David Stevens Hi David, Add a wakeup event for when the balloon is inflating or deflating. Userspace can enable this wakeup event to prevent the system from suspending while the balloon is being adjusted. This allows /sys/power/wakeup_count to be used without breaking virtio_balloon's cooperative memory management. Can you add/share some more details Signed-off-by: David Stevens --- drivers/virtio/virtio_balloon.c | 59 +++-- 1 file changed, 49 insertions(+), 10 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 1fe93e93f5bc..811d8937246a 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -119,6 +119,11 @@ struct virtio_balloon { /* Free page reporting device */ struct virtqueue *reporting_vq; struct page_reporting_dev_info pr_dev_info; + + /* State for keeping the wakeup_source active while adjusting the balloon */ + spinlock_t adjustment_lock; + u32 adjustment_seqno; Using a simple flag that gets set when updating the balloon size and test-and-clear when testing for changes should be easier to get. bool adjustment_balloon_size_changed; or sth like that. + bool adjustment_in_progress; }; static const struct virtio_device_id id_table[] = { @@ -437,6 +442,31 @@ static void virtio_balloon_queue_free_page_work(struct virtio_balloon *vb) queue_work(vb->balloon_wq, >report_free_page_work); } +static void start_update_balloon_size(struct virtio_balloon *vb) +{ + unsigned long flags; + + spin_lock_irqsave(>adjustment_lock, flags); + vb->adjustment_seqno++; + if (!vb->adjustment_in_progress) { + vb->adjustment_in_progress = true; + pm_stay_awake(>vdev->dev); + } + spin_unlock_irqrestore(>adjustment_lock, flags); + + queue_work(system_freezable_wq, >update_balloon_size_work); +} + +static void end_update_balloon_size(struct virtio_balloon *vb, u32 seqno) +{ + spin_lock(>adjustment_lock); + if (vb->adjustment_seqno == seqno && vb->adjustment_in_progress) { + vb->adjustment_in_progress = false; + pm_relax(>vdev->dev); + } + spin_unlock(>adjustment_lock); +} + static void virtballoon_changed(struct virtio_device *vdev) { struct virtio_balloon *vb = vdev->priv; @@ -444,8 +474,7 @@ static void virtballoon_changed(struct virtio_device *vdev) spin_lock_irqsave(>stop_update_lock, flags); if (!vb->stop_update) { - queue_work(system_freezable_wq, - >update_balloon_size_work); + start_update_balloon_size(vb); virtio_balloon_queue_free_page_work(vb); } spin_unlock_irqrestore(>stop_update_lock, flags); @@ -473,22 +502,29 @@ static void update_balloon_size_func(struct work_struct *work) { struct virtio_balloon *vb; s64 diff; + u32 seqno; vb = container_of(work, struct virtio_balloon, update_balloon_size_work); - diff = towards_target(vb); - if (!diff) - return; + spin_lock(>adjustment_lock); + seqno = vb->adjustment_seqno; + spin_unlock(>adjustment_lock); - if (diff > 0) - diff -= fill_balloon(vb, diff); - else - diff += leak_balloon(vb, -diff); - update_balloon_size(vb); + diff = towards_target(vb); + + if (diff) { + if (diff > 0) + diff -= fill_balloon(vb, diff); + else + diff += leak_balloon(vb, -diff); + update_balloon_size(vb); + } if (diff) queue_work(system_freezable_wq, work); + else + end_update_balloon_size(vb, seqno); What if we stop the workqueue and unload the driver -- see remove_common() -- won't you leave pm_stay_awake() wrongly set? } static int init_vqs(struct virtio_balloon *vb) @@ -992,6 +1028,9 @@ static int virtballoon_probe(struct virtio_device *vdev) goto out_unregister_oom; } + spin_lock_init(>adjustment_lock); + device_set_wakeup_capable(>vdev->dev, true); I'm a bit confused: Documentation/driver-api/pm/devices.rst documents " The :c:member:`power.can_wakeup` flag just records whether the device (and its driver) can physically support wakeup events. The :c:func:`device_set_wakeup_capable()` routine affects this flag. " ... " Whether or not a device is capable of issuing wakeup events is a hardware matter, and the kernel is responsible for keeping track of it. " But how is the virtio-balloon device capable of waking up the machine? Your patch merely implies that the virtio-baloon device is capable to prohbit going to sleep. What am I missing? -- Cheers, David /
Re: [PATCH v3 2/2] dax: add a sysfs knob to control memmap_on_memory behavior
On 11.12.23 23:52, Vishal Verma wrote: Add a sysfs knob for dax devices to control the memmap_on_memory setting if the dax device were to be hotplugged as system memory. The default memmap_on_memory setting for dax devices originating via pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to preserve legacy behavior. For dax devices via CXL, the default is on. The sysfs control allows the administrator to override the above defaults if needed. Cc: David Hildenbrand Cc: Dan Williams Cc: Dave Jiang Cc: Dave Hansen Cc: Huang Ying Tested-by: Li Zhijian Reviewed-by: Jonathan Cameron Reviewed-by: David Hildenbrand Signed-off-by: Vishal Verma --- drivers/dax/bus.c | 47 + Documentation/ABI/testing/sysfs-bus-dax | 17 2 files changed, 64 insertions(+) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 1ff1ab5fa105..2871e5188f0d 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -1270,6 +1270,52 @@ static ssize_t numa_node_show(struct device *dev, } static DEVICE_ATTR_RO(numa_node); +static ssize_t memmap_on_memory_show(struct device *dev, +struct device_attribute *attr, char *buf) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + + return sprintf(buf, "%d\n", dev_dax->memmap_on_memory); +} + +static ssize_t memmap_on_memory_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + struct device_driver *drv = dev->driver; + struct dev_dax *dev_dax = to_dev_dax(dev); + struct dax_region *dax_region = dev_dax->region; + struct dax_device_driver *dax_drv = to_dax_drv(drv); + ssize_t rc; + bool val; + + rc = kstrtobool(buf, ); + if (rc) + return rc; + + if (dev_dax->memmap_on_memory == val) + return len; + + device_lock(dax_region->dev); + if (!dax_region->dev->driver) { + device_unlock(dax_region->dev); + return -ENXIO; + } + + if (dax_drv->type == DAXDRV_KMEM_TYPE) { + device_unlock(dax_region->dev); + return -EBUSY; + } + + device_lock(dev); + dev_dax->memmap_on_memory = val; + device_unlock(dev); + + device_unlock(dax_region->dev); + return len; +} +static DEVICE_ATTR_RW(memmap_on_memory); + static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n) { struct device *dev = container_of(kobj, struct device, kobj); @@ -1296,6 +1342,7 @@ static struct attribute *dev_dax_attributes[] = { _attr_align.attr, _attr_resource.attr, _attr_numa_node.attr, + _attr_memmap_on_memory.attr, NULL, }; diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax index a61a7b186017..b1fd8bf8a7de 100644 --- a/Documentation/ABI/testing/sysfs-bus-dax +++ b/Documentation/ABI/testing/sysfs-bus-dax @@ -149,3 +149,20 @@ KernelVersion: v5.1 Contact: nvd...@lists.linux.dev Description: (RO) The id attribute indicates the region id of a dax region. + +What: /sys/bus/dax/devices/daxX.Y/memmap_on_memory +Date: October, 2023 +KernelVersion: v6.8 +Contact: nvd...@lists.linux.dev +Description: + (RW) Control the memmap_on_memory setting if the dax device + were to be hotplugged as system memory. This determines whether + the 'altmap' for the hotplugged memory will be placed on the + device being hotplugged (memmap_on_memory=1) or if it will be + placed on regular memory (memmap_on_memory=0). This attribute + must be set before the device is handed over to the 'kmem' + driver (i.e. hotplugged into system-ram). Additionally, this + depends on CONFIG_MHP_MEMMAP_ON_MEMORY, and a globally enabled + memmap_on_memory parameter for memory_hotplug. This is + typically set on the kernel command line - + memory_hotplug.memmap_on_memory set to 'true' or 'force'." Thinking about it, I wonder if we could disallow setting that property to "true" if the current configuration does not allow it. That is: 1) Removing the "size" parameter from mhp_supports_memmap_on_memory(), it doesn't make any sense anymore. 2) Exporting mhp_supports_memmap_on_memory() to modules. 3) When setting memmap_on_memory, check whether mhp_supports_memmap_on_memory() == true. Then, the user really gets an error when trying to set it to "true". -- Cheers, David / dhildenb
Re: [PATCH v2 2/2] dax: add a sysfs knob to control memmap_on_memory behavior
On 07.12.23 05:36, Vishal Verma wrote: Add a sysfs knob for dax devices to control the memmap_on_memory setting if the dax device were to be hotplugged as system memory. The default memmap_on_memory setting for dax devices originating via pmem or hmem is set to 'false' - i.e. no memmap_on_memory semantics, to preserve legacy behavior. For dax devices via CXL, the default is on. The sysfs control allows the administrator to override the above defaults if needed. Cc: David Hildenbrand Cc: Dan Williams Cc: Dave Jiang Cc: Dave Hansen Cc: Huang Ying Reviewed-by: Jonathan Cameron Reviewed-by: David Hildenbrand Signed-off-by: Vishal Verma --- drivers/dax/bus.c | 40 + Documentation/ABI/testing/sysfs-bus-dax | 13 +++ 2 files changed, 53 insertions(+) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 1ff1ab5fa105..11abb57cc031 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -1270,6 +1270,45 @@ static ssize_t numa_node_show(struct device *dev, } static DEVICE_ATTR_RO(numa_node); +static ssize_t memmap_on_memory_show(struct device *dev, +struct device_attribute *attr, char *buf) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + + return sprintf(buf, "%d\n", dev_dax->memmap_on_memory); +} + +static ssize_t memmap_on_memory_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + struct dax_region *dax_region = dev_dax->region; + ssize_t rc; + bool val; + + rc = kstrtobool(buf, ); + if (rc) + return rc; + + if (dev_dax->memmap_on_memory == val) + return len; + + device_lock(dax_region->dev); + if (!dax_region->dev->driver) { + device_unlock(dax_region->dev); + return -ENXIO; + } + + device_lock(dev); + dev_dax->memmap_on_memory = val; + device_unlock(dev); + + device_unlock(dax_region->dev); + return rc == 0 ? len : rc; +} +static DEVICE_ATTR_RW(memmap_on_memory); + static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n) { struct device *dev = container_of(kobj, struct device, kobj); @@ -1296,6 +1335,7 @@ static struct attribute *dev_dax_attributes[] = { _attr_align.attr, _attr_resource.attr, _attr_numa_node.attr, + _attr_memmap_on_memory.attr, NULL, }; diff --git a/Documentation/ABI/testing/sysfs-bus-dax b/Documentation/ABI/testing/sysfs-bus-dax index a61a7b186017..bb063a004e41 100644 --- a/Documentation/ABI/testing/sysfs-bus-dax +++ b/Documentation/ABI/testing/sysfs-bus-dax @@ -149,3 +149,16 @@ KernelVersion: v5.1 Contact: nvd...@lists.linux.dev Description: (RO) The id attribute indicates the region id of a dax region. + +What: /sys/bus/dax/devices/daxX.Y/memmap_on_memory +Date: October, 2023 +KernelVersion: v6.8 +Contact: nvd...@lists.linux.dev +Description: + (RW) Control the memmap_on_memory setting if the dax device + were to be hotplugged as system memory. This determines whether + the 'altmap' for the hotplugged memory will be placed on the + device being hotplugged (memmap_on+memory=1) or if it will be + placed on regular memory (memmap_on_memory=0). This attribute + must be set before the device is handed over to the 'kmem' + driver (i.e. hotplugged into system-ram). Should we note the dependency on other factors as given in mhp_supports_memmap_on_memory(), especially, the system-wide setting and some weird kernel configurations? -- Cheers, David / dhildenb
Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
On 30.11.23 15:33, Alexandru Elisei wrote: On Thu, Nov 30, 2023 at 02:43:48PM +0100, David Hildenbrand wrote: On 30.11.23 14:32, Alexandru Elisei wrote: Hi, On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote: + +out_retry: + put_page(page); + if (vmf->flags & FAULT_FLAG_VMA_LOCK) + vma_end_read(vma); + if (fault_flag_allow_retry_first(vmf->flags)) { + err = VM_FAULT_RETRY; + } else { + /* Replay the fault. */ + err = 0; Hello! Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur. I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!) How about migrating the page when migration problem repeats. Yes, I had the same though in the previous iteration of the series, the page was migrated out of the VMA if tag storage couldn't be reserved. Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the pin will be released before the fault is replayed. Because of this, and because it makes the code simpler, I chose not to migrate the page if tag storage couldn't be reserved. There are still some cases that are theoretically problematic: vmsplice() can pin pages forever and doesn't use FOLL_LONGTERM yet. All these things also affect other users that rely on movability (e.g., CMA, memory hotunplug). I wasn't aware of that, thank you for the information. Then to ensure that the process doesn't hang by replying the loop indefinitely, I'll migrate the page if tag storage cannot be reserved. Looking over the code again, I think I can reuse the same function that migrates tag storage pages out of the MTE VMA (added in patch #21), so no major changes needed. It's going to be interesting if migrating that page fails because it is pinned :/ I imagine that having both the page **and** its tag storage pinned longterm without FOLL_LONGTERM is going to be exceedingly rare. Yes. I recall that the rule of thumb is that some O_DIRECT I/O can take up to 10 seconds, although extremely rare (and maybe not applicable on arm64). Am I mistaken in believing that the problematic vmsplice() behaviour is recognized as something that needs to be fixed? Yes, for a couple of years I'm hoping this will actually get fixed now that O_DIRECT mostly uses FOLL_PIN instead of FOLL_GET. -- Cheers, David / dhildenb
Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
On 30.11.23 14:32, Alexandru Elisei wrote: Hi, On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote: + +out_retry: + put_page(page); + if (vmf->flags & FAULT_FLAG_VMA_LOCK) + vma_end_read(vma); + if (fault_flag_allow_retry_first(vmf->flags)) { + err = VM_FAULT_RETRY; + } else { + /* Replay the fault. */ + err = 0; Hello! Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur. I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!) How about migrating the page when migration problem repeats. Yes, I had the same though in the previous iteration of the series, the page was migrated out of the VMA if tag storage couldn't be reserved. Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the pin will be released before the fault is replayed. Because of this, and because it makes the code simpler, I chose not to migrate the page if tag storage couldn't be reserved. There are still some cases that are theoretically problematic: vmsplice() can pin pages forever and doesn't use FOLL_LONGTERM yet. All these things also affect other users that rely on movability (e.g., CMA, memory hotunplug). I wasn't aware of that, thank you for the information. Then to ensure that the process doesn't hang by replying the loop indefinitely, I'll migrate the page if tag storage cannot be reserved. Looking over the code again, I think I can reuse the same function that migrates tag storage pages out of the MTE VMA (added in patch #21), so no major changes needed. It's going to be interesting if migrating that page fails because it is pinned :/ -- Cheers, David / dhildenb
Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
On 28.11.23 18:55, David Hildenbrand wrote: On 19.11.23 17:57, Alexandru Elisei wrote: To enable tagging on a memory range, userspace can use mprotect() with the PROT_MTE access flag. Pages already mapped in the VMA don't have the associated tag storage block reserved, so mark the PTEs as PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and reserve the tag storage on the fault path. That sounds alot like fake PROT_NONE. Would there be a way to unify hat handling and simply reuse pte_protnone()? For example, could we special case on VMA flags? Like, don't do NUMA hinting in these special VMAs. Then, have something like: if (pte_protnone(vmf->orig_pte)) return handle_pte_protnone(vmf); Think out loud: maybe there isn't even the need to special-case on the VMA. Arch code should know it there is something to do. If not, it surely was triggered bu NUMA hinting. So maybe that could be handled in handle_pte_protnone() quite nicely. -- Cheers, David / dhildenb
Re: [PATCH RFC v2 19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)
On 19.11.23 17:57, Alexandru Elisei wrote: To enable tagging on a memory range, userspace can use mprotect() with the PROT_MTE access flag. Pages already mapped in the VMA don't have the associated tag storage block reserved, so mark the PTEs as PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and reserve the tag storage on the fault path. That sounds alot like fake PROT_NONE. Would there be a way to unify hat handling and simply reuse pte_protnone()? For example, could we special case on VMA flags? Like, don't do NUMA hinting in these special VMAs. Then, have something like: if (pte_protnone(vmf->orig_pte)) return handle_pte_protnone(vmf); In there, special case on the VMA flags. I *suspect* that handle_page_missing_tag_storage() stole (sorry :P) some code from the prot_none handling path. At least the recovery path and writability handling looks like it better be located shared in handle_pte_protnone() as well. That might take some magic out of this patch. -- Cheers, David / dhildenb
Re: [PATCH RFC v2 18/27] arm64: mte: Reserve tag block for the zero page
On 19.11.23 17:57, Alexandru Elisei wrote: On arm64, the zero page receives special treatment by having the tagged flag set on MTE initialization, not when the page is mapped in a process address space. Reserve the corresponding tag block when tag storage management is being activated. Out of curiosity: why does the shared zeropage require tagged storage? What about the huge zeropage? -- Cheers, David / dhildenb
Re: [PATCH RFC v2 14/27] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled
On 27.11.23 16:07, Alexandru Elisei wrote: Hi, On Fri, Nov 24, 2023 at 08:54:12PM +0100, David Hildenbrand wrote: On 19.11.23 17:57, Alexandru Elisei wrote: To be able to reserve the tag storage associated with a page requires that the tag storage page can be migrated. When HW KASAN is enabled, the kernel allocates pages, which are now tagged, in non-preemptible contexts, which can make reserving the associate tag storage impossible. I assume that it's the only in-kernel user that actually requires tagged memory (besides for user space), correct? Indeed, this is the case. I'll expand the commit message to be more clear about it. Great, thanks! -- Cheers, David / dhildenb
Re: [PATCH RFC v2 12/27] arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype
On 27.11.23 16:01, Alexandru Elisei wrote: Hi David, On Fri, Nov 24, 2023 at 08:40:55PM +0100, David Hildenbrand wrote: On 19.11.23 17:57, Alexandru Elisei wrote: Add the MTE tag storage pages to the MIGRATE_CMA migratetype, which allows the page allocator to manage them like regular pages. Ths migratype lends the pages some very desirable properties: * They cannot be longterm pinned, meaning they will always be migratable. * The pages can be allocated explicitely by using their PFN (with alloc_contig_range()) when they are needed to store tags. Signed-off-by: Alexandru Elisei --- arch/arm64/Kconfig | 1 + arch/arm64/kernel/mte_tag_storage.c | 68 + include/linux/mmzone.h | 5 +++ mm/internal.h | 3 -- 4 files changed, 74 insertions(+), 3 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index fe8276fdc7a8..047487046e8f 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2065,6 +2065,7 @@ config ARM64_MTE if ARM64_MTE config ARM64_MTE_TAG_STORAGE bool "Dynamic MTE tag storage management" + select CONFIG_CMA help Adds support for dynamic management of the memory used by the hardware for storing MTE tags. This memory, unlike normal memory, cannot be diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c index fa6267ef8392..427f4f1909f3 100644 --- a/arch/arm64/kernel/mte_tag_storage.c +++ b/arch/arm64/kernel/mte_tag_storage.c @@ -5,10 +5,12 @@ * Copyright (C) 2023 ARM Ltd. */ +#include #include #include #include #include +#include #include #include #include @@ -189,6 +191,14 @@ static int __init fdt_init_tag_storage(unsigned long node, const char *uname, return ret; } + /* Pages are managed in pageblock_nr_pages chunks */ + if (!IS_ALIGNED(tag_range->start | range_len(tag_range), pageblock_nr_pages)) { + pr_err("Tag storage region 0x%llx-0x%llx not aligned to pageblock size 0x%llx", + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end), + PFN_PHYS(pageblock_nr_pages)); + return -EINVAL; + } + ret = tag_storage_get_memory_node(node, _node); if (ret) return ret; @@ -254,3 +264,61 @@ void __init mte_tag_storage_init(void) pr_info("MTE tag storage region management disabled"); } } + +static int __init mte_tag_storage_activate_regions(void) +{ + phys_addr_t dram_start, dram_end; + struct range *tag_range; + unsigned long pfn; + int i, ret; + + if (num_tag_regions == 0) + return 0; + + dram_start = memblock_start_of_DRAM(); + dram_end = memblock_end_of_DRAM(); + + for (i = 0; i < num_tag_regions; i++) { + tag_range = _regions[i].tag_range; + /* +* Tag storage region was clipped by arm64_bootmem_init() +* enforcing addressing limits. +*/ + if (PFN_PHYS(tag_range->start) < dram_start || + PFN_PHYS(tag_range->end) >= dram_end) { + pr_err("Tag storage region 0x%llx-0x%llx outside addressable memory", + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end)); + ret = -EINVAL; + goto out_disabled; + } + } + + /* +* MTE disabled, tag storage pages can be used like any other pages. The +* only restriction is that the pages cannot be used by kexec because +* the memory remains marked as reserved in the memblock allocator. +*/ + if (!system_supports_mte()) { + for (i = 0; i< num_tag_regions; i++) { + tag_range = _regions[i].tag_range; + for (pfn = tag_range->start; pfn <= tag_range->end; pfn++) + free_reserved_page(pfn_to_page(pfn)); + } + ret = 0; + goto out_disabled; + } + + for (i = 0; i < num_tag_regions; i++) { + tag_range = _regions[i].tag_range; + for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) + init_cma_reserved_pageblock(pfn_to_page(pfn)); + totalcma_pages += range_len(tag_range); + } You shouldn't be doing that manually in arm code. Likely you want some cma.c helper for something like that. If you referring to the last loop (the one that does ini_cma_reserved_pageblock()), indeed, there's already a function which does that, cma_init_reserved_areas() -> cma_activate_area(). But, can you elaborate on why you took this
Re: [PATCH RFC v2 15/27] arm64: mte: Check that tag storage blocks are in the same zone
On 19.11.23 17:57, Alexandru Elisei wrote: alloc_contig_range() requires that the requested pages are in the same zone. Check that this is indeed the case before initializing the tag storage blocks. Signed-off-by: Alexandru Elisei --- arch/arm64/kernel/mte_tag_storage.c | 33 + 1 file changed, 33 insertions(+) diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c index 8b9bedf7575d..fd63430d4dc0 100644 --- a/arch/arm64/kernel/mte_tag_storage.c +++ b/arch/arm64/kernel/mte_tag_storage.c @@ -265,6 +265,35 @@ void __init mte_tag_storage_init(void) } } +/* alloc_contig_range() requires all pages to be in the same zone. */ +static int __init mte_tag_storage_check_zone(void) +{ + struct range *tag_range; + struct zone *zone; + unsigned long pfn; + u32 block_size; + int i, j; + + for (i = 0; i < num_tag_regions; i++) { + block_size = tag_regions[i].block_size; + if (block_size == 1) + continue; + + tag_range = _regions[i].tag_range; + for (pfn = tag_range->start; pfn <= tag_range->end; pfn += block_size) { + zone = page_zone(pfn_to_page(pfn)); + for (j = 1; j < block_size; j++) { + if (page_zone(pfn_to_page(pfn + j)) != zone) { + pr_err("Tag storage block pages in different zones"); + return -EINVAL; + } + } + } + } + +return 0; +} + Looks like something that ordinary CMA provides. See cma_activate_area(). Can't we find a way to let CMA do CMA thingies and only be a user of that? What would be required to make the performance issue you spelled out in the cover letter be gone and not have to open-code that in arch code? -- Cheers, David / dhildenb
Re: [PATCH RFC v2 14/27] arm64: mte: Disable dynamic tag storage management if HW KASAN is enabled
On 19.11.23 17:57, Alexandru Elisei wrote: To be able to reserve the tag storage associated with a page requires that the tag storage page can be migrated. When HW KASAN is enabled, the kernel allocates pages, which are now tagged, in non-preemptible contexts, which can make reserving the associate tag storage impossible. I assume that it's the only in-kernel user that actually requires tagged memory (besides for user space), correct? -- Cheers, David / dhildenb
Re: [PATCH RFC v2 13/27] arm64: mte: Make tag storage depend on ARCH_KEEP_MEMBLOCK
On 19.11.23 17:57, Alexandru Elisei wrote: Tag storage memory requires that the tag storage pages used for data are always migratable when they need to be repurposed to store tags. If ARCH_KEEP_MEMBLOCK is enabled, kexec will scan all non-reserved memblocks to find a suitable location for copying the kernel image. The kernel image, once loaded, cannot be moved to another location in physical memory. The initialization code for the tag storage reserves the memblocks for the tag storage pages, which means kexec will not use them, and the tag storage pages can be migrated at any time, which is the desired behaviour. However, if ARCH_KEEP_MEMBLOCK is not selected, kexec will not skip a region unless the memory resource has the IORESOURCE_SYSRAM_DRIVER_MANAGED flag, which isn't currently set by the tag storage initialization code. Make ARM64_MTE_TAG_STORAGE depend on ARCH_KEEP_MEMBLOCK to make it explicit that that the Kconfig option required for it to work correctly. Signed-off-by: Alexandru Elisei --- arch/arm64/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 047487046e8f..efa5b7958169 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2065,6 +2065,7 @@ config ARM64_MTE if ARM64_MTE config ARM64_MTE_TAG_STORAGE bool "Dynamic MTE tag storage management" + depends on ARCH_KEEP_MEMBLOCK select CONFIG_CMA help Adds support for dynamic management of the memory used by the hardware Doesn't arm64 select that unconditionally? Why is this required then? -- Cheers, David / dhildenb
Re: [PATCH RFC v2 12/27] arm64: mte: Add tag storage pages to the MIGRATE_CMA migratetype
On 19.11.23 17:57, Alexandru Elisei wrote: Add the MTE tag storage pages to the MIGRATE_CMA migratetype, which allows the page allocator to manage them like regular pages. Ths migratype lends the pages some very desirable properties: * They cannot be longterm pinned, meaning they will always be migratable. * The pages can be allocated explicitely by using their PFN (with alloc_contig_range()) when they are needed to store tags. Signed-off-by: Alexandru Elisei --- arch/arm64/Kconfig | 1 + arch/arm64/kernel/mte_tag_storage.c | 68 + include/linux/mmzone.h | 5 +++ mm/internal.h | 3 -- 4 files changed, 74 insertions(+), 3 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index fe8276fdc7a8..047487046e8f 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2065,6 +2065,7 @@ config ARM64_MTE if ARM64_MTE config ARM64_MTE_TAG_STORAGE bool "Dynamic MTE tag storage management" + select CONFIG_CMA help Adds support for dynamic management of the memory used by the hardware for storing MTE tags. This memory, unlike normal memory, cannot be diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c index fa6267ef8392..427f4f1909f3 100644 --- a/arch/arm64/kernel/mte_tag_storage.c +++ b/arch/arm64/kernel/mte_tag_storage.c @@ -5,10 +5,12 @@ * Copyright (C) 2023 ARM Ltd. */ +#include #include #include #include #include +#include #include #include #include @@ -189,6 +191,14 @@ static int __init fdt_init_tag_storage(unsigned long node, const char *uname, return ret; } + /* Pages are managed in pageblock_nr_pages chunks */ + if (!IS_ALIGNED(tag_range->start | range_len(tag_range), pageblock_nr_pages)) { + pr_err("Tag storage region 0x%llx-0x%llx not aligned to pageblock size 0x%llx", + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end), + PFN_PHYS(pageblock_nr_pages)); + return -EINVAL; + } + ret = tag_storage_get_memory_node(node, _node); if (ret) return ret; @@ -254,3 +264,61 @@ void __init mte_tag_storage_init(void) pr_info("MTE tag storage region management disabled"); } } + +static int __init mte_tag_storage_activate_regions(void) +{ + phys_addr_t dram_start, dram_end; + struct range *tag_range; + unsigned long pfn; + int i, ret; + + if (num_tag_regions == 0) + return 0; + + dram_start = memblock_start_of_DRAM(); + dram_end = memblock_end_of_DRAM(); + + for (i = 0; i < num_tag_regions; i++) { + tag_range = _regions[i].tag_range; + /* +* Tag storage region was clipped by arm64_bootmem_init() +* enforcing addressing limits. +*/ + if (PFN_PHYS(tag_range->start) < dram_start || + PFN_PHYS(tag_range->end) >= dram_end) { + pr_err("Tag storage region 0x%llx-0x%llx outside addressable memory", + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end)); + ret = -EINVAL; + goto out_disabled; + } + } + + /* +* MTE disabled, tag storage pages can be used like any other pages. The +* only restriction is that the pages cannot be used by kexec because +* the memory remains marked as reserved in the memblock allocator. +*/ + if (!system_supports_mte()) { + for (i = 0; i< num_tag_regions; i++) { + tag_range = _regions[i].tag_range; + for (pfn = tag_range->start; pfn <= tag_range->end; pfn++) + free_reserved_page(pfn_to_page(pfn)); + } + ret = 0; + goto out_disabled; + } + + for (i = 0; i < num_tag_regions; i++) { + tag_range = _regions[i].tag_range; + for (pfn = tag_range->start; pfn <= tag_range->end; pfn += pageblock_nr_pages) + init_cma_reserved_pageblock(pfn_to_page(pfn)); + totalcma_pages += range_len(tag_range); + } You shouldn't be doing that manually in arm code. Likely you want some cma.c helper for something like that. But, can you elaborate on why you took this hacky (sorry) approach as documented in the cover letter: "The arm64 code manages this memory directly instead of using cma_declare_contiguous/cma_alloc for performance reasons." What is the exact problem? -- Cheers, David / dhildenb
Re: [PATCH RFC v2 06/27] mm: page_alloc: Allow an arch to hook early into free_pages_prepare()
On 19.11.23 17:57, Alexandru Elisei wrote: Add arch_free_pages_prepare() hook that is called before that page flags are cleared. This will be used by arm64 when explicit management of tag storage pages is enabled. Can you elaborate a bit what exactly will be done by that code with that information? -- Cheers, David / dhildenb
Re: [PATCH RFC v2 05/27] mm: page_alloc: Add an arch hook to allow prep_new_page() to fail
On 19.11.23 17:56, Alexandru Elisei wrote: Introduce arch_prep_new_page(), which will be used by arm64 to reserve tag storage for an allocated page. Reserving tag storage can fail, for example, if the tag storage page has a short pin on it, so allow prep_new_page() -> arch_prep_new_page() to similarly fail. But what are the side-effects of this? How does the calling code recover? E.g., what if we need to populate a page into user space, but that particular page we allocated fails to be prepared? So we inject a signal into that poor process? -- Cheers, David / dhildenb
Re: [PATCH v8 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
On 01.11.23 23:51, Vishal Verma wrote: The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to 'memblock_size' chunks of memory being added. Adding a larger span of memory precludes memmap_on_memory semantics. For users of hotplug such as kmem, large amounts of memory might get added from the CXL subsystem. In some cases, this amount may exceed the available 'main memory' to store the memmap for the memory being added. In this case, it is useful to have a way to place the memmap on the memory being added, even if it means splitting the addition into memblock-sized chunks. Change add_memory_resource() to loop over memblock-sized chunks of memory if caller requested memmap_on_memory, and if other conditions for it are met. Teach try_remove_memory() to also expect that a memory range being removed might have been split up into memblock sized chunks, and to loop through those as needed. This does preclude being able to use PUD mappings in the direct map; a proposal to how this could be optimized in the future is laid out here[1]. [1]: https://lore.kernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752...@redhat.com/ Cc: Andrew Morton Cc: David Hildenbrand Cc: Michal Hocko Cc: Oscar Salvador Cc: Dan Williams Cc: Dave Jiang Cc: Dave Hansen Cc: Huang Ying Suggested-by: David Hildenbrand Reviewed-by: Dan Williams Signed-off-by: Vishal Verma --- mm/memory_hotplug.c | 213 ++-- 1 file changed, 138 insertions(+), 75 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 6be7de9efa55..d242e49d7f7b 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1380,6 +1380,84 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) return arch_supports_memmap_on_memory(vmemmap_size); } +static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size) +{ + unsigned long memblock_size = memory_block_size_bytes(); + u64 cur_start; + + /* +* For memmap_on_memory, the altmaps were added on a per-memblock +* basis; we have to process each individual memory block. +*/ + for (cur_start = start; cur_start < start + size; +cur_start += memblock_size) { + struct vmem_altmap *altmap = NULL; + struct memory_block *mem; + + mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(cur_start))); + WARN_ON_ONCE(!mem); + if (!mem) + continue; Nit: if (WARN_ON_ONCE(!mem)) continue; + for (cur_start = start; cur_start < start + size; +cur_start += memblock_size) { + struct mhp_params params = { .pgprot = +pgprot_mhp(PAGE_KERNEL) }; + struct vmem_altmap mhp_altmap = { + .base_pfn = PHYS_PFN(cur_start), + .end_pfn = PHYS_PFN(cur_start + memblock_size - 1), + }; + + mhp_altmap.free = memory_block_memmap_on_memory_pages(); + params.altmap = kmemdup(_altmap, sizeof(struct vmem_altmap), + GFP_KERNEL); + if (!params.altmap) + return -ENOMEM; As already spotted, we have to cleanup. + + /* call arch's memory hotadd */ + ret = arch_add_memory(nid, cur_start, memblock_size, ); + if (ret < 0) { + kfree(params.altmap); + goto out; + } + + /* create memory block devices after memory was added */ + ret = create_memory_block_devices(cur_start, memblock_size, + params.altmap, group); + if (ret) { + arch_remove_memory(cur_start, memblock_size, NULL); + kfree(params.altmap); + goto out; + } + } + + return 0; +out: + if (ret && (cur_start != start)) Nit: I think you can drop the inner parentheses. @@ -2146,11 +2208,31 @@ void try_offline_node(int nid) } EXPORT_SYMBOL(try_offline_node); +static int memory_blocks_have_altmaps(u64 start, u64 size) +{ + u64 num_memblocks = size / memory_block_size_bytes(); + u64 num_altmaps = 0; + + if (!mhp_memmap_on_memory()) + return 0; + + walk_memory_blocks(start, size, _altmaps, + count_memory_range_altmaps_cb); + + if (num_altmaps == 0) + return 0; + + if (num_memblocks != num_altmaps) { + WARN_ONCE(1, "Not all memblocks in range have altmaps"); Nit: if (WARN_ON_ONCE(num_memblocks != num_altmaps)) return -EINVAL; Should be sufficient. [...] /* remove memmap entry */ firmware_map_remove(start, start + size, "System RAM&quo
Re: [PATCH v7 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
On 31.10.23 03:14, Verma, Vishal L wrote: On Mon, 2023-10-30 at 11:20 +0100, David Hildenbrand wrote: On 26.10.23 00:44, Vishal Verma wrote: [..] @@ -2146,11 +2186,69 @@ void try_offline_node(int nid) } EXPORT_SYMBOL(try_offline_node); -static int __ref try_remove_memory(u64 start, u64 size) +static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size) { - struct memory_block *mem; - int rc = 0, nid = NUMA_NO_NODE; + unsigned long memblock_size = memory_block_size_bytes(); struct vmem_altmap *altmap = NULL; + struct memory_block *mem; + u64 cur_start; + int rc; + + /* + * For memmap_on_memory, the altmaps could have been added on + * a per-memblock basis. Loop through the entire range if so, + * and remove each memblock and its altmap. + */ /* * altmaps where added on a per-memblock basis; we have to process * each individual memory block. */ + for (cur_start = start; cur_start < start + size; + cur_start += memblock_size) { + rc = walk_memory_blocks(cur_start, memblock_size, , + test_has_altmap_cb); + if (rc) { + altmap = mem->altmap; + /* + * Mark altmap NULL so that we can add a debug + * check on memblock free. + */ + mem->altmap = NULL; + } Simpler (especially, we know that there must be an altmap): mem = find_memory_block(pfn_to_section_nr(cur_start)); altmap = mem->altmap; mem->altmap = NULL; I think we might be able to remove test_has_altmap_cb() then. + + remove_memory_block_devices(cur_start, memblock_size); + + arch_remove_memory(cur_start, memblock_size, altmap); + + /* Verify that all vmemmap pages have actually been freed. */ + if (altmap) { There must be an altmap, so this can be done unconditionally. Hi David, Hi! All other comments make sense, making those changes now. However for this one, does the WARN() below go away then? I was wondering if maybe arch_remove_memory() is responsible for freeing the altmap here, and at this stage we're just checking if that happened. If it didn't WARN and then free it. I think that has to stay, to make sure arch_remove_memory() did the right thing and we don't -- by BUG -- still have some altmap pages in use after they should have been completely freed. I drilled down the path, and I don't see altmap actually getting freed in vmem_altmap_free(), but I wasn't sure if was meant to free it as altmap->alloc went down to 0. vmem_altmap_free() does the "altmap->alloc -= nr_pfns", which is called when arch_remove_memory() frees the vmemmap pages and detects that they actually come from the altmap reserve and not from the buddy/earlyboot allocator. Freeing an altmap is just unaccounting it in the altmap structure; and here we make sure that we are actually back down to 0 and don't have some weird altmap freeing BUG in arch_remove_memory(). -- Cheers, David / dhildenb
Re: [PATCH v7 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
On 26.10.23 00:44, Vishal Verma wrote: The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to 'memblock_size' chunks of memory being added. Adding a larger span of memory precludes memmap_on_memory semantics. For users of hotplug such as kmem, large amounts of memory might get added from the CXL subsystem. In some cases, this amount may exceed the available 'main memory' to store the memmap for the memory being added. In this case, it is useful to have a way to place the memmap on the memory being added, even if it means splitting the addition into memblock-sized chunks. Change add_memory_resource() to loop over memblock-sized chunks of memory if caller requested memmap_on_memory, and if other conditions for it are met. Teach try_remove_memory() to also expect that a memory range being removed might have been split up into memblock sized chunks, and to loop through those as needed. This does preclude being able to use PUD mappings in the direct map; a proposal to how this could be optimized in the future is laid out here[1]. Almost there, I think :) +static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group, + u64 start, u64 size) +{ + unsigned long memblock_size = memory_block_size_bytes(); + u64 cur_start; + int ret; + + for (cur_start = start; cur_start < start + size; +cur_start += memblock_size) { + struct mhp_params params = { .pgprot = +pgprot_mhp(PAGE_KERNEL) }; + struct vmem_altmap mhp_altmap = { + .base_pfn = PHYS_PFN(cur_start), + .end_pfn = PHYS_PFN(cur_start + memblock_size - 1), + }; + + mhp_altmap.free = memory_block_memmap_on_memory_pages(); + params.altmap = kmemdup(_altmap, sizeof(struct vmem_altmap), + GFP_KERNEL); + if (!params.altmap) + return -ENOMEM; Best to cleanup here instead of handling it in the caller [as noted by Vishal, we might not be doing that yet]. Using remove_memory_blocks_and_altmaps() on the fully processed range sounds reasonable. maybe something like ret = arch_add_memory(nid, cur_start, memblock_size, ); if (ret) { kfree(params.altmap); goto out; } ret = create_memory_block_devices(cur_start, memblock_size, params.altmap, group); if (ret) { arch_remove_memory(cur_start, memblock_size, NULL); kfree(params.altmap); goto out; } if (ret && cur_start != start) remove_memory_blocks_and_altmaps(start, cur_start - start); return ret; + + /* call arch's memory hotadd */ + ret = arch_add_memory(nid, cur_start, memblock_size, ); + if (ret < 0) { + kfree(params.altmap); + return ret; + } + + /* create memory block devices after memory was added */ + ret = create_memory_block_devices(cur_start, memblock_size, + params.altmap, group); + if (ret) { + arch_remove_memory(cur_start, memblock_size, NULL); + kfree(params.altmap); + return ret; + } + } + + return 0; +} + [...] static int check_cpu_on_node(int nid) { int cpu; @@ -2146,11 +2186,69 @@ void try_offline_node(int nid) } EXPORT_SYMBOL(try_offline_node); -static int __ref try_remove_memory(u64 start, u64 size) +static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size) { - struct memory_block *mem; - int rc = 0, nid = NUMA_NO_NODE; + unsigned long memblock_size = memory_block_size_bytes(); struct vmem_altmap *altmap = NULL; + struct memory_block *mem; + u64 cur_start; + int rc; + + /* +* For memmap_on_memory, the altmaps could have been added on +* a per-memblock basis. Loop through the entire range if so, +* and remove each memblock and its altmap. +*/ /* * altmaps where added on a per-memblock basis; we have to process * each individual memory block. */ + for (cur_start = start; cur_start < start + size; +cur_start += memblock_size) { + rc = walk_memory_blocks(cur_start, memblock_size, , + test_has_altmap_cb); + if (rc) { + altmap = mem->altmap; + /* +* Mark altmap NULL so that we can add a debug +* check on memblock free. +*/ + mem->altmap = NULL; + } Simpler (especially, we know that there must be an altmap): mem =
Re: [PATCH v6 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
On 17.10.23 07:44, Vishal Verma wrote: The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to 'memblock_size' chunks of memory being added. Adding a larger span of memory precludes memmap_on_memory semantics. For users of hotplug such as kmem, large amounts of memory might get added from the CXL subsystem. In some cases, this amount may exceed the available 'main memory' to store the memmap for the memory being added. In this case, it is useful to have a way to place the memmap on the memory being added, even if it means splitting the addition into memblock-sized chunks. Change add_memory_resource() to loop over memblock-sized chunks of memory if caller requested memmap_on_memory, and if other conditions for it are met. Teach try_remove_memory() to also expect that a memory range being removed might have been split up into memblock sized chunks, and to loop through those as needed. This does preclude being able to use PUD mappings in the direct map; a proposal to how this could be optimized in the future is laid out here[1]. [1]: https://lore.kernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752...@redhat.com/ Cc: Andrew Morton Cc: David Hildenbrand Cc: Michal Hocko Cc: Oscar Salvador Cc: Dan Williams Cc: Dave Jiang Cc: Dave Hansen Cc: Huang Ying Suggested-by: David Hildenbrand Reviewed-by: Dan Williams Signed-off-by: Vishal Verma --- mm/memory_hotplug.c | 214 1 file changed, 148 insertions(+), 66 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 6be7de9efa55..83e5ec377aad 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1380,6 +1380,43 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) return arch_supports_memmap_on_memory(vmemmap_size); } +static int add_memory_create_devices(int nid, struct memory_group *group, +u64 start, u64 size, mhp_t mhp_flags) +{ + struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; + struct vmem_altmap mhp_altmap = { + .base_pfn = PHYS_PFN(start), + .end_pfn = PHYS_PFN(start + size - 1), + }; + int ret; + + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) { + mhp_altmap.free = memory_block_memmap_on_memory_pages(); + params.altmap = kmemdup(_altmap, sizeof(struct vmem_altmap), + GFP_KERNEL); + if (!params.altmap) + return -ENOMEM; + } + + /* call arch's memory hotadd */ + ret = arch_add_memory(nid, start, size, ); + if (ret < 0) + goto error; + + /* create memory block devices after memory was added */ + ret = create_memory_block_devices(start, size, params.altmap, group); + if (ret) + goto err_bdev; + + return 0; + +err_bdev: + arch_remove_memory(start, size, NULL); +error: + kfree(params.altmap); + return ret; +} + /* * NOTE: The caller must call lock_device_hotplug() to serialize hotplug * and online/offline operations (triggered e.g. by sysfs). @@ -1388,14 +1425,10 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) */ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) { - struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; + unsigned long memblock_size = memory_block_size_bytes(); enum memblock_flags memblock_flags = MEMBLOCK_NONE; - struct vmem_altmap mhp_altmap = { - .base_pfn = PHYS_PFN(res->start), - .end_pfn = PHYS_PFN(res->end), - }; struct memory_group *group = NULL; - u64 start, size; + u64 start, size, cur_start; bool new_node = false; int ret; @@ -1436,28 +1469,21 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) /* * Self hosted memmap array */ - if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { - if (mhp_supports_memmap_on_memory(size)) { - mhp_altmap.free = memory_block_memmap_on_memory_pages(); - params.altmap = kmemdup(_altmap, - sizeof(struct vmem_altmap), - GFP_KERNEL); - if (!params.altmap) + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) && + mhp_supports_memmap_on_memory(memblock_size)) { + for (cur_start = start; cur_start < start + size; +cur_start += memblock_size) { + ret = add_memory_create_devices(nid, group, cur_start, + memblock_size, + mhp_flags); Just like on the removal part, that function does
Re: [PATCH v6 1/3] mm/memory_hotplug: replace an open-coded kmemdup() in add_memory_resource()
On 17.10.23 07:44, Vishal Verma wrote: A review of the memmap_on_memory modifications to add_memory_resource() revealed an instance of an open-coded kmemdup(). Replace it with kmemdup(). Cc: Andrew Morton Cc: David Hildenbrand Cc: Michal Hocko Cc: Oscar Salvador Cc: Dan Williams Reported-by: Dan Williams Signed-off-by: Vishal Verma --- mm/memory_hotplug.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index f8d3e7427e32..6be7de9efa55 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1439,11 +1439,11 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { if (mhp_supports_memmap_on_memory(size)) { mhp_altmap.free = memory_block_memmap_on_memory_pages(); - params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); + params.altmap = kmemdup(_altmap, + sizeof(struct vmem_altmap), + GFP_KERNEL); if (!params.altmap) goto error; - - memcpy(params.altmap, _altmap, sizeof(mhp_altmap)); } /* fallback to not using altmap */ } Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks
On 12.10.23 07:53, Verma, Vishal L wrote: On Mon, 2023-10-09 at 17:04 +0200, David Hildenbrand wrote: On 07.10.23 10:55, Huang, Ying wrote: Vishal Verma writes: @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size) if (rc) return rc; + mem_hotplug_begin(); + /* - * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in - * the same granularity it was added - a single memory block. + * For memmap_on_memory, the altmaps could have been added on + * a per-memblock basis. Loop through the entire range if so, + * and remove each memblock and its altmap. */ if (mhp_memmap_on_memory()) { IIUC, even if mhp_memmap_on_memory() returns true, it's still possible that the memmap is put in DRAM after [2/2]. So that, arch_remove_memory() are called for each memory block unnecessarily. Can we detect this (via altmap?) and call remove_memory_block_and_altmap() for the whole range? Good point. We should handle memblock-per-memblock onny if we have to handle the altmap. Otherwise, just call a separate function that doesn't care about -- e.g., called remove_memory_blocks_no_altmap(). We could simply walk all memory blocks and make sure either all have an altmap or none has an altmap. If there is a mix, we should bail out with WARN_ON_ONCE(). Ok I think I follow - based on both of these threads, here's my understanding in an incremental diff from the original patches (may not apply directly as I've already committed changes from the other bits of feedback - but this should provide an idea of the direction) - --- diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 507291e44c0b..30addcb063b4 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -2201,6 +2201,40 @@ static void __ref remove_memory_block_and_altmap(u64 start, u64 size) } } +static bool memblocks_have_altmaps(u64 start, u64 size) +{ + unsigned long memblock_size = memory_block_size_bytes(); + u64 num_altmaps = 0, num_no_altmaps = 0; + struct memory_block *mem; + u64 cur_start; + int rc = 0; + + if (!mhp_memmap_on_memory()) + return false; Probably can remove that, checked by the caller. (or drop the one in the caller) + + for (cur_start = start; cur_start < start + size; +cur_start += memblock_size) { + if (walk_memory_blocks(cur_start, memblock_size, , + test_has_altmap_cb)) + num_altmaps++; + else + num_no_altmaps++; + } You should do that without the outer loop, by doing the counting in the callback function instead. + + if (!num_altmaps && num_no_altmaps > 0) + return false; + + if (!num_no_altmaps && num_altmaps > 0) + return true; + + /* +* If there is a mix of memblocks with and without altmaps, +* something has gone very wrong. WARN and bail. +*/ + WARN_ONCE(1, "memblocks have a mix of missing and present altmaps"); It would be better if we could even make try_remove_memory() fail in this case. + return false; +} + static int __ref try_remove_memory(u64 start, u64 size) { int rc, nid = NUMA_NO_NODE; @@ -2230,7 +2264,7 @@ static int __ref try_remove_memory(u64 start, u64 size) * a per-memblock basis. Loop through the entire range if so, * and remove each memblock and its altmap. */ - if (mhp_memmap_on_memory()) { + if (mhp_memmap_on_memory() && memblocks_have_altmaps(start, size)) { unsigned long memblock_size = memory_block_size_bytes(); u64 cur_start; @@ -2239,7 +2273,8 @@ static int __ref try_remove_memory(u64 start, u64 size) remove_memory_block_and_altmap(cur_start, memblock_size); ^ probably cleaner move the loop into remove_memory_block_and_altmap() and call it remove_memory_blocks_and_altmaps(start, size) instead. } else { - remove_memory_block_and_altmap(start, size); + remove_memory_block_devices(start, size); + arch_remove_memory(start, size, NULL); } if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) { -- Cheers, David / dhildenb
Re: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks
On 07.10.23 00:01, Verma, Vishal L wrote: On Fri, 2023-10-06 at 14:52 +0200, David Hildenbrand wrote: On 05.10.23 20:31, Vishal Verma wrote: <..> @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size) if (rc) return rc; + mem_hotplug_begin(); + /* - * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in - * the same granularity it was added - a single memory block. + * For memmap_on_memory, the altmaps could have been added on + * a per-memblock basis. Loop through the entire range if so, + * and remove each memblock and its altmap. */ if (mhp_memmap_on_memory()) { - rc = walk_memory_blocks(start, size, , test_has_altmap_cb); - if (rc) { - if (size != memory_block_size_bytes()) { - pr_warn("Refuse to remove %#llx - %#llx," - "wrong granularity\n", - start, start + size); - return -EINVAL; - } - altmap = mem->altmap; - /* - * Mark altmap NULL so that we can add a debug - * check on memblock free. - */ - mem->altmap = NULL; - } + unsigned long memblock_size = memory_block_size_bytes(); + u64 cur_start; + + for (cur_start = start; cur_start < start + size; + cur_start += memblock_size) + remove_memory_block_and_altmap(nid, cur_start, + memblock_size); + } else { + remove_memory_block_and_altmap(nid, start, size); Better call remove_memory_block_devices() and arch_remove_memory(start, size, altmap) here explicitly instead of using remove_memory_block_and_altmap() that really can only handle a single memory block with any inputs. I'm not sure I follow. Even in the non memmap_on_memory case, we'd have to walk_memory_blocks() to get to the memory_block->altmap, right? See my other reply to, at least with mhp_memmap_on_memory()==false, we don't have to worry about the altmap. Or is there a more direct way? If we have to walk_memory_blocks, what's the advantage of calling those directly instead of calling the helper created above? I think we have two cases to handle 1) All have an altmap. Remove them block-by-block. Probably we should call a function remove_memory_blocks(altmap=true) [or alternatively remove_memory_blocks_and_altmaps()] and just handle iterating internally. 2) All don't have an altmap. We can remove them in one go. Probably we should call that remove_memory_blocks(altmap=false) [or alternatively remove_memory_blocks_no_altmaps()]. I guess it's best to do a walk upfront to make sure either all have an altmap or none has one. Then we can branch off to the right function knowing whether we have to process altmaps or not. The existing if (mhp_memmap_on_memory()) { ... } Can be extended for that case. Please let me know if I failed to express what I mean, then I can briefly prototype it on top of your changes. -- Cheers, David / dhildenb
Re: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks
On 07.10.23 10:55, Huang, Ying wrote: Vishal Verma writes: The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to 'memblock_size' chunks of memory being added. Adding a larger span of memory precludes memmap_on_memory semantics. For users of hotplug such as kmem, large amounts of memory might get added from the CXL subsystem. In some cases, this amount may exceed the available 'main memory' to store the memmap for the memory being added. In this case, it is useful to have a way to place the memmap on the memory being added, even if it means splitting the addition into memblock-sized chunks. Change add_memory_resource() to loop over memblock-sized chunks of memory if caller requested memmap_on_memory, and if other conditions for it are met. Teach try_remove_memory() to also expect that a memory range being removed might have been split up into memblock sized chunks, and to loop through those as needed. Cc: Andrew Morton Cc: David Hildenbrand Cc: Michal Hocko Cc: Oscar Salvador Cc: Dan Williams Cc: Dave Jiang Cc: Dave Hansen Cc: Huang Ying Suggested-by: David Hildenbrand Signed-off-by: Vishal Verma --- mm/memory_hotplug.c | 162 1 file changed, 99 insertions(+), 63 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index f8d3e7427e32..77ec6f15f943 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1380,6 +1380,44 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) return arch_supports_memmap_on_memory(vmemmap_size); } +static int add_memory_create_devices(int nid, struct memory_group *group, +u64 start, u64 size, mhp_t mhp_flags) +{ + struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; + struct vmem_altmap mhp_altmap = { + .base_pfn = PHYS_PFN(start), + .end_pfn = PHYS_PFN(start + size - 1), + }; + int ret; + + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY)) { + mhp_altmap.free = memory_block_memmap_on_memory_pages(); + params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); + if (!params.altmap) + return -ENOMEM; + + memcpy(params.altmap, _altmap, sizeof(mhp_altmap)); + } + + /* call arch's memory hotadd */ + ret = arch_add_memory(nid, start, size, ); + if (ret < 0) + goto error; + + /* create memory block devices after memory was added */ + ret = create_memory_block_devices(start, size, params.altmap, group); + if (ret) + goto err_bdev; + + return 0; + +err_bdev: + arch_remove_memory(start, size, NULL); +error: + kfree(params.altmap); + return ret; +} + /* * NOTE: The caller must call lock_device_hotplug() to serialize hotplug * and online/offline operations (triggered e.g. by sysfs). @@ -1388,14 +1426,10 @@ static bool mhp_supports_memmap_on_memory(unsigned long size) */ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) { - struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) }; + unsigned long memblock_size = memory_block_size_bytes(); enum memblock_flags memblock_flags = MEMBLOCK_NONE; - struct vmem_altmap mhp_altmap = { - .base_pfn = PHYS_PFN(res->start), - .end_pfn = PHYS_PFN(res->end), - }; struct memory_group *group = NULL; - u64 start, size; + u64 start, size, cur_start; bool new_node = false; int ret; @@ -1436,28 +1470,21 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) /* * Self hosted memmap array */ - if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { - if (mhp_supports_memmap_on_memory(size)) { - mhp_altmap.free = memory_block_memmap_on_memory_pages(); - params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); - if (!params.altmap) + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) && + mhp_supports_memmap_on_memory(memblock_size)) { + for (cur_start = start; cur_start < start + size; +cur_start += memblock_size) { + ret = add_memory_create_devices(nid, group, cur_start, + memblock_size, + mhp_flags); + if (ret) goto error; - - memcpy(params.altmap, _altmap, sizeof(mhp_altmap)); } - /* fallback to not using altmap */ - } - - /* call arch's memory hotadd */ - ret = arch_add_memory(nid, start, size, ); - if (ret < 0) - goto error_fre
Re: [PATCH v5 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks
On 05.10.23 20:31, Vishal Verma wrote: The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to 'memblock_size' chunks of memory being added. Adding a larger span of memory precludes memmap_on_memory semantics. For users of hotplug such as kmem, large amounts of memory might get added from the CXL subsystem. In some cases, this amount may exceed the available 'main memory' to store the memmap for the memory being added. In this case, it is useful to have a way to place the memmap on the memory being added, even if it means splitting the addition into memblock-sized chunks. Change add_memory_resource() to loop over memblock-sized chunks of memory if caller requested memmap_on_memory, and if other conditions for it are met. Teach try_remove_memory() to also expect that a memory range being removed might have been split up into memblock sized chunks, and to loop through those as needed. Maybe add that this implies that we're not making use of PUD mappings in the direct map yet, and link to the proposal on how we could optimize that eventually in the future. [...] -static int __ref try_remove_memory(u64 start, u64 size) +static void __ref remove_memory_block_and_altmap(int nid, u64 start, u64 size) You shouldn't need the nid, right? { + int rc = 0; struct memory_block *mem; - int rc = 0, nid = NUMA_NO_NODE; struct vmem_altmap *altmap = NULL; + rc = walk_memory_blocks(start, size, , test_has_altmap_cb); + if (rc) { + altmap = mem->altmap; + /* +* Mark altmap NULL so that we can add a debug +* check on memblock free. +*/ + mem->altmap = NULL; + } + + /* +* Memory block device removal under the device_hotplug_lock is +* a barrier against racing online attempts. +*/ + remove_memory_block_devices(start, size); We're now calling that under the memory hotplug lock. I assume this is fine, but I remember some ugly lockdep details ...should be alright I guess. + + arch_remove_memory(start, size, altmap); + + /* Verify that all vmemmap pages have actually been freed. */ + if (altmap) { + WARN(altmap->alloc, "Altmap not fully unmapped"); + kfree(altmap); + } +} + +static int __ref try_remove_memory(u64 start, u64 size) +{ + int rc, nid = NUMA_NO_NODE; + BUG_ON(check_hotplug_memory_range(start, size)); /* @@ -2167,47 +2221,28 @@ static int __ref try_remove_memory(u64 start, u64 size) if (rc) return rc; + mem_hotplug_begin(); + /* -* We only support removing memory added with MHP_MEMMAP_ON_MEMORY in -* the same granularity it was added - a single memory block. +* For memmap_on_memory, the altmaps could have been added on +* a per-memblock basis. Loop through the entire range if so, +* and remove each memblock and its altmap. */ if (mhp_memmap_on_memory()) { - rc = walk_memory_blocks(start, size, , test_has_altmap_cb); - if (rc) { - if (size != memory_block_size_bytes()) { - pr_warn("Refuse to remove %#llx - %#llx," - "wrong granularity\n", - start, start + size); - return -EINVAL; - } - altmap = mem->altmap; - /* -* Mark altmap NULL so that we can add a debug -* check on memblock free. -*/ - mem->altmap = NULL; - } + unsigned long memblock_size = memory_block_size_bytes(); + u64 cur_start; + + for (cur_start = start; cur_start < start + size; +cur_start += memblock_size) + remove_memory_block_and_altmap(nid, cur_start, + memblock_size); + } else { + remove_memory_block_and_altmap(nid, start, size); Better call remove_memory_block_devices() and arch_remove_memory(start, size, altmap) here explicitly instead of using remove_memory_block_and_altmap() that really can only handle a single memory block with any inputs. } /* remove memmap entry */ firmware_map_remove(start, start + size, "System RAM"); Can we continue doing that in the old order? (IOW before taking the lock?). - /* -* Memory block device removal under the device_hotplug_lock is -* a barrier against racing online attempts. -*/ - remove_memory_block_devices(start, size); - - mem_hotplug_begin(); - - arch_remove_memory(start, size, altmap); - - /* Verify that all vmemmap pages have actually been
Re: [PATCH v4 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks
On 03.10.23 22:03, Verma, Vishal L wrote: On Mon, 2023-10-02 at 11:28 +0200, David Hildenbrand wrote: + +static int __ref try_remove_memory(u64 start, u64 size) +{ + int rc, nid = NUMA_NO_NODE; + + BUG_ON(check_hotplug_memory_range(start, size)); + + /* + * All memory blocks must be offlined before removing memory. Check + * whether all memory blocks in question are offline and return error + * if this is not the case. + * + * While at it, determine the nid. Note that if we'd have mixed nodes, + * we'd only try to offline the last determined one -- which is good + * enough for the cases we care about. + */ + rc = walk_memory_blocks(start, size, , check_memblock_offlined_cb); + if (rc) + return rc; + + /* + * For memmap_on_memory, the altmaps could have been added on + * a per-memblock basis. Loop through the entire range if so, + * and remove each memblock and its altmap. + */ + if (mhp_memmap_on_memory()) { + unsigned long memblock_size = memory_block_size_bytes(); + u64 cur_start; + + for (cur_start = start; cur_start < start + size; + cur_start += memblock_size) + __try_remove_memory(nid, cur_start, memblock_size); + } else { + __try_remove_memory(nid, start, size); + } + return 0; } Why is the firmware, memblock and nid handling not kept in this outer function? We really shouldn't be doing per memory block what needs to be done per memblock: remove_memory_block_devices() and arch_remove_memory(). Ah yes makes sense since we only do create_memory_block_devices() and arch_add_memory() in the per memory block inner loop during addition. How should the locking work in this case though? Sorry, I had to process a family NMI the last couple of days. The original code holds the mem_hotplug_begin() lock over arch_remove_memory() and all of the nid and memblock stuff. Should I just hold the lock and release it in the inner loop for each memory block, and then also acquire and release it separately for the memblock and nid stuff in the outer function? I think we have to hold it over the whole operation. I saw that you sent a v5, I'll comment there. -- Cheers, David / dhildenb
Re: [PATCH v4 1/2] mm/memory_hotplug: split memmap_on_memory requests across memblocks
+ +static int __ref try_remove_memory(u64 start, u64 size) +{ + int rc, nid = NUMA_NO_NODE; + + BUG_ON(check_hotplug_memory_range(start, size)); + + /* +* All memory blocks must be offlined before removing memory. Check +* whether all memory blocks in question are offline and return error +* if this is not the case. +* +* While at it, determine the nid. Note that if we'd have mixed nodes, +* we'd only try to offline the last determined one -- which is good +* enough for the cases we care about. +*/ + rc = walk_memory_blocks(start, size, , check_memblock_offlined_cb); + if (rc) + return rc; + + /* +* For memmap_on_memory, the altmaps could have been added on +* a per-memblock basis. Loop through the entire range if so, +* and remove each memblock and its altmap. +*/ + if (mhp_memmap_on_memory()) { + unsigned long memblock_size = memory_block_size_bytes(); + u64 cur_start; + + for (cur_start = start; cur_start < start + size; +cur_start += memblock_size) + __try_remove_memory(nid, cur_start, memblock_size); + } else { + __try_remove_memory(nid, start, size); + } + return 0; } Why is the firmware, memblock and nid handling not kept in this outer function? We really shouldn't be doing per memory block what needs to be done per memblock: remove_memory_block_devices() and arch_remove_memory(). -- Cheers, David / dhildenb
Re: [PATCH v4 2/2] dax/kmem: allow kmem to add memory with memmap_on_memory
On 28.09.23 22:30, Vishal Verma wrote: Large amounts of memory managed by the kmem driver may come in via CXL, and it is often desirable to have the memmap for this memory on the new memory itself. Enroll kmem-managed memory for memmap_on_memory semantics if the dax region originates via CXL. For non-CXL dax regions, retain the existing default behavior of hot adding without memmap_on_memory semantics. Add a sysfs override under the dax device to control this behavior and override either default. Cc: Andrew Morton Cc: David Hildenbrand Cc: Michal Hocko Cc: Oscar Salvador Cc: Dan Williams Cc: Dave Jiang Cc: Dave Hansen Cc: Huang Ying Reviewed-by: Jonathan Cameron Signed-off-by: Vishal Verma --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH RFC 00/37] Add support for arm64 MTE dynamic tag storage reuse
On 11.09.23 13:52, Catalin Marinas wrote: On Wed, Sep 06, 2023 at 12:23:21PM +0100, Alexandru Elisei wrote: On Thu, Aug 24, 2023 at 04:24:30PM +0100, Catalin Marinas wrote: On Thu, Aug 24, 2023 at 01:25:41PM +0200, David Hildenbrand wrote: On 24.08.23 13:06, David Hildenbrand wrote: Regarding one complication: "The kernel needs to know where to allocate a PROT_MTE page from or migrate a current page if it becomes PROT_MTE (mprotect()) and the range it is in does not support tagging.", simplified handling would be if it's in a MIGRATE_CMA pageblock, it doesn't support tagging. You have to migrate to a !CMA page (for example, not specifying GFP_MOVABLE as a quick way to achieve that). Okay, I now realize that this patch set effectively duplicates some CMA behavior using a new migrate-type. [...] I considered mixing the tag storage memory memory with normal memory and adding it to MIGRATE_CMA. But since tag storage memory cannot be tagged, this means that it's not enough anymore to have a __GFP_MOVABLE allocation request to use MIGRATE_CMA. I considered two solutions to this problem: 1. Only allocate from MIGRATE_CMA is the requested memory is not tagged => this effectively means transforming all memory from MIGRATE_CMA into the MIGRATE_METADATA migratetype that the series introduces. Not very appealing, because that means treating normal memory that is also on the MIGRATE_CMA lists as tagged memory. That's indeed not ideal. We could try this if it makes the patches significantly simpler, though I'm not so sure. Allocating metadata is the easier part as we know the correspondence from the tagged pages (32 PROT_MTE page) to the metadata page (1 tag storage page), so alloc_contig_range() does this for us. Just adding it to the CMA range is sufficient. However, making sure that we don't allocate PROT_MTE pages from the metadata range is what led us to another migrate type. I guess we could achieve something similar with a new zone or a CPU-less NUMA node, Ideally, no significant core-mm changes to optimize for an architecture oddity. That implies, no new zones and no new migratetypes -- unless it is unavoidable and you are confident that you can convince core-MM people that the use case (giving back 3% of system RAM at max in some setups) is worth the trouble. I also had CPU-less NUMA nodes in mind when thinking about that, but not sure how easy it would be to integrate it. If the tag memory has actually different performance characteristics as well, a NUMA node would be the right choice. If we could find some way to easily support this either via CMA or CPU-less NUMA nodes, that would be much preferable; even if we cannot cover each and every future use case right now. I expect some issues with CXL+MTE either way , but are happy to be taught otherwise :) Another thought I had was adding something like CMA memory characteristics. Like, asking if a given CMA area/page supports tagging (i.e., flag for the CMA area set?)? When you need memory that supports tagging and have a page that does not support tagging (CMA && taggable), simply migrate to !MOVABLE memory (eventually we could also try adding !CMA). Was that discussed and what would be the challenges with that? Page migration due to compaction comes to mind, but it might also be easy to handle if we can just avoid CMA memory for that. though the latter is not guaranteed not to allocate memory from the range, only make it less likely. Both these options are less flexible in terms of size/alignment/placement. Maybe as a quick hack - only allow PROT_MTE from ZONE_NORMAL and configure the metadata range in ZONE_MOVABLE but at some point I'd expect some CXL-attached memory to support MTE with additional carveout reserved. I have no idea how we could possibly cleanly support memory hotplug in virtual environments (virtual DIMMs, virtio-mem) with MTE. In contrast to s390x storage keys, the approach that arm64 with MTE took here (exposing tag memory to the VM) makes it rather hard and complicated. -- Cheers, David / dhildenb
Re: [PATCH v2 2/3] mm/memory_hotplug: split memmap_on_memory requests across memblocks
On 14.08.23 08:45, Huang, Ying wrote: "Verma, Vishal L" writes: On Mon, 2023-07-24 at 13:54 +0800, Huang, Ying wrote: Vishal Verma writes: @@ -2035,12 +2056,38 @@ void try_offline_node(int nid) } EXPORT_SYMBOL(try_offline_node); -static int __ref try_remove_memory(u64 start, u64 size) +static void __ref __try_remove_memory(int nid, u64 start, u64 size, +struct vmem_altmap *altmap) { - struct vmem_altmap mhp_altmap = {}; - struct vmem_altmap *altmap = NULL; - unsigned long nr_vmemmap_pages; - int rc = 0, nid = NUMA_NO_NODE; + /* remove memmap entry */ + firmware_map_remove(start, start + size, "System RAM"); If mhp_supports_memmap_on_memory(), we will call firmware_map_add_hotplug() for whole range. But here we may call firmware_map_remove() for part of range. Is it OK? Good point, this is a discrepancy in the add vs remove path. Can the firmware memmap entries be moved up a bit in the add path, and is it okay to create these for each memblock? Or should these be for the whole range? I'm not familiar with the implications. (I've left it as is for v3 for now, but depending on the direction I can update in a future rev). Cced more firmware map developers and maintainers. Per my understanding, we should create one firmware memmap entry for each memblock. Ideally we should create it for the whole range, ti limit the ranges. But it really only matters for DIMMs; for dax/kmem, we'll not create any firmware entries. -- Cheers, David / dhildenb
Re: [PATCH v2 0/3] mm: use memmap_on_memory semantics for dax/kmem
On 20.07.23 09:14, Vishal Verma wrote: The dax/kmem driver can potentially hot-add large amounts of memory originating from CXL memory expanders, or NVDIMMs, or other 'device memories'. There is a chance there isn't enough regular system memory available to fit the memmap for this new memory. It's therefore desirable, if all other conditions are met, for the kmem managed memory to place its memmap on the newly added memory itself. The main hurdle for accomplishing this for kmem is that memmap_on_memory can only be done if the memory being added is equal to the size of one memblock. To overcome this,allow the hotplug code to split an add_memory() request into memblock-sized chunks, and try_remove_memory() to also expect and handle such a scenario. Patch 1 exports mhp_supports_memmap_on_memory() so it can be used by the kmem driver. Patch 2 teaches the memory_hotplug code to allow for splitting add_memory() and remove_memory() requests over memblock sized chunks. Patch 3 adds a sysfs control for the kmem driver that would allow an opt-out of using memmap_on_memory for the memory being added. It might be reasonable to rebase this on Aneesh's work. For example, patch #1 might not be required anymore. -- Cheers, David / dhildenb
Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem
On 13.07.23 21:12, Jeff Moyer wrote: David Hildenbrand writes: On 16.06.23 00:00, Vishal Verma wrote: The dax/kmem driver can potentially hot-add large amounts of memory originating from CXL memory expanders, or NVDIMMs, or other 'device memories'. There is a chance there isn't enough regular system memory available to fit ythe memmap for this new memory. It's therefore desirable, if all other conditions are met, for the kmem managed memory to place its memmap on the newly added memory itself. Arrange for this by first allowing for a module parameter override for the mhp_supports_memmap_on_memory() test using a flag, adjusting the only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c, exporting the symbol so it can be called by kmem.c, and finally changing the kmem driver to add_memory() in chunks of memory_block_size_bytes(). 1) Why is the override a requirement here? Just let the admin configure it then then add conditional support for kmem. 2) I recall that there are cases where we don't want the memmap to land on slow memory (which online_movable would achieve). Just imagine the slow PMEM case. So this might need another configuration knob on the kmem side. From my memory, the case where you don't want the memmap to land on *persistent memory* is when the device is small (such as NVDIMM-N), and you want to reserve as much space as possible for the application data. This has nothing to do with the speed of access. Now that you mention it, I also do remember the origin of the altmap -- to achieve exactly that: place the memmap on the device. commit 4b94ffdc4163bae1ec73b6e977ffb7a7da3d06d3 Author: Dan Williams Date: Fri Jan 15 16:56:22 2016 -0800 x86, mm: introduce vmem_altmap to augment vmemmap_populate() In support of providing struct page for large persistent memory capacities, use struct vmem_altmap to change the default policy for allocating memory for the memmap array. The default vmemmap_populate() allocates page table storage area from the page allocator. Given persistent memory capacities relative to DRAM it may not be feasible to store the memmap in 'System Memory'. Instead vmem_altmap represents pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf() requests. In PFN_MODE_PMEM (and only then), we use the altmap (don't see a way to configure it). BUT that case is completely different from the "System RAM" mode. The memmap of an NVDIMM in pmem mode is barely used by core-mm (i.e., not the buddy). In comparison, if the buddy and everybody else works on the memmap in "System RAM", it's much more significant if that resides on slow memory. Looking at commit 9b6e63cbf85b89b2dbffa4955dbf2df8250e5375 Author: Michal Hocko Date: Tue Oct 3 16:16:19 2017 -0700 mm, page_alloc: add scheduling point to memmap_init_zone memmap_init_zone gets a pfn range to initialize and it can be really large resulting in a soft lockup on non-preemptible kernels NMI watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [kworker/u642:5:1720] [...] task: 88ecd7e902c0 ti: 88eca4e5 task.ti: 88eca4e5 RIP: move_pfn_range_to_zone+0x185/0x1d0 [...] Call Trace: devm_memremap_pages+0x2c7/0x430 pmem_attach_disk+0x2fd/0x3f0 [nd_pmem] nvdimm_bus_probe+0x64/0x110 [libnvdimm] It's hard to tell if that was only required due to the memmap for these devices being that large, or also partially because the access to the memmap is slower that it makes a real difference. I recall that we're also often using ZONE_MOVABLE on such slow memory to not end up placing other kernel data structures on there: especially, user space page tables as I've been told. @Dan, any insight on the performance aspects when placing the memmap on (slow) memory and having that memory be consumed by the buddy where we frequently operate on the memmap? -- Cheers, David / dhildenb
Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
On 13.07.23 17:40, Verma, Vishal L wrote: On Thu, 2023-07-13 at 17:23 +0200, David Hildenbrand wrote: On 13.07.23 17:15, Verma, Vishal L wrote: On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote: On 13.07.23 08:45, Verma, Vishal L wrote: I'm taking a shot at implementing the splitting internally in memory_hotplug.c. The caller (kmem) side does become trivial with this approach, but there's a slight complication if I don't have the module param override (patch 1 of this series). The kmem diff now looks like: diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 898ca9505754..8be932f63f90 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) data->mgid = rc; for (i = 0; i < dev_dax->nr_range; i++) { + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | + MHP_SPLIT_MEMBLOCKS; struct resource *res; struct range range; @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) * this as RAM automatically. */ rc = add_memory_driver_managed(data->mgid, range.start, - range_len(), kmem_name, MHP_NID_IS_MGID); + range_len(), kmem_name, mhp_flags); if (rc) { dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", Why do we need the MHP_SPLIT_MEMBLOCKS? I thought we still wanted either an opt-in or opt-out for the kmem driver to be able to do memmap_on_memory, in case there were performance implications or the lack of 1GiB PUDs. I haven't implemented that yet, but I was thinking along the lines of a sysfs knob exposed by kmem, that controls setting of this new MHP_SPLIT_MEMBLOCKS flag. Why is MHP_MEMMAP_ON_MEMORY not sufficient for that? Ah I see what you mean now - knob just controls MHP_MEMMAP_ON_MEMORY, and memory_hotplug is free to split to memblocks if it needs to to satisfy that. And if you don't want memmap holes in a larger area you're adding (for example to runtime-allocate 1 GiB pages), simply check the size your adding, and if it's, say, less than 1 G, don't set the flag. But that's probably a corner case use case not worth considering for now. That sounds reasonable. Let me give this a try and see if I run into anything else. Thanks David! Sure! -- Cheers, David / dhildenb
Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
On 13.07.23 17:15, Verma, Vishal L wrote: On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote: On 13.07.23 08:45, Verma, Vishal L wrote: I'm taking a shot at implementing the splitting internally in memory_hotplug.c. The caller (kmem) side does become trivial with this approach, but there's a slight complication if I don't have the module param override (patch 1 of this series). The kmem diff now looks like: diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 898ca9505754..8be932f63f90 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) data->mgid = rc; for (i = 0; i < dev_dax->nr_range; i++) { + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | + MHP_SPLIT_MEMBLOCKS; struct resource *res; struct range range; @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) * this as RAM automatically. */ rc = add_memory_driver_managed(data->mgid, range.start, - range_len(), kmem_name, MHP_NID_IS_MGID); + range_len(), kmem_name, mhp_flags); if (rc) { dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", Why do we need the MHP_SPLIT_MEMBLOCKS? I thought we still wanted either an opt-in or opt-out for the kmem driver to be able to do memmap_on_memory, in case there were performance implications or the lack of 1GiB PUDs. I haven't implemented that yet, but I was thinking along the lines of a sysfs knob exposed by kmem, that controls setting of this new MHP_SPLIT_MEMBLOCKS flag. Why is MHP_MEMMAP_ON_MEMORY not sufficient for that? In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a single memory block, you can simply split up internally, no? Essentially in add_memory_resource() something like if (mhp_flags & MHP_MEMMAP_ON_MEMORY && mhp_supports_memmap_on_memory(memory_block_size_bytes())) { for (cur_start = start, cur_start < start + size; cur_start += memory_block_size_bytes()) { mhp_altmap.free = PHYS_PFN(memory_block_size_bytes()); mhp_altmap.base_pfn = PHYS_PFN(start); params.altmap = _altmap; ret = arch_add_memory(nid, start, memory_block_size_bytes(), ); if (ret < 0) ... ret = create_memory_block_devices(start, memory_block_size_bytes(), mhp_altmap.alloc, group); if (ret) ... } } else { /* old boring stuff */ } Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes() ... My current approach was looping a level higher, on the call to add_memory_resource, but this looks reasonable too, and I can switch to this. In fact it is better as in my case I had to loop twice, once for the regular add_memory() path and once for the _driver_managed() path. Yours should avoid that. As we only care about the altmap here, handling it for arch_add_memory() + create_memory_block_devices() should be sufficient. -- Cheers, David / dhildenb
Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
On 13.07.23 08:45, Verma, Vishal L wrote: On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote: On 11.07.23 16:30, Aneesh Kumar K.V wrote: David Hildenbrand writes: Maybe the better alternative is teach add_memory_resource()/try_remove_memory() to do that internally. In the add_memory_resource() case, it might be a loop around that memmap_on_memory + arch_add_memory code path (well, and the error path also needs adjustment): /* * Self hosted memmap array */ if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { if (!mhp_supports_memmap_on_memory(size)) { ret = -EINVAL; goto error; } mhp_altmap.free = PHYS_PFN(size); mhp_altmap.base_pfn = PHYS_PFN(start); params.altmap = _altmap; } /* call arch's memory hotadd */ ret = arch_add_memory(nid, start, size, ); if (ret < 0) goto error; Note that we want to handle that on a per memory-block basis, because we don't want the vmemmap of memory block #2 to end up on memory block #1. It all gets messy with memory onlining/offlining etc otherwise ... I tried to implement this inside add_memory_driver_managed() and also within dax/kmem. IMHO doing the error handling inside dax/kmem is better. Here is how it looks: 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions 2. For each succesful request_mem_regions if any blocks got added, we keep the resource. If none got added, we will kfree the resource Doing this unconditional splitting outside of add_memory_driver_managed() is undesirable for at least two reasons 1) You end up always creating individual entries in the resource tree (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective. 2) As we call arch_add_memory() in memory block granularity (e.g., 128 MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective. While you could sense for support and do the split based on that, it will be beneficial for other users (especially DIMMs) if we do that internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be effective or not. I'm taking a shot at implementing the splitting internally in memory_hotplug.c. The caller (kmem) side does become trivial with this approach, but there's a slight complication if I don't have the module param override (patch 1 of this series). The kmem diff now looks like: diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 898ca9505754..8be932f63f90 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) data->mgid = rc; for (i = 0; i < dev_dax->nr_range; i++) { + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY | + MHP_SPLIT_MEMBLOCKS; struct resource *res; struct range range; @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) * this as RAM automatically. */ rc = add_memory_driver_managed(data->mgid, range.start, - range_len(), kmem_name, MHP_NID_IS_MGID); + range_len(), kmem_name, mhp_flags); if (rc) { dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", Why do we need the MHP_SPLIT_MEMBLOCKS? In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a single memory block, you can simply split up internally, no? Essentially in add_memory_resource() something like if (mhp_flags & MHP_MEMMAP_ON_MEMORY && mhp_supports_memmap_on_memory(memory_block_size_bytes())) { for (cur_start = start, cur_start < start + size; cur_start += memory_block_size_bytes()) { mhp_altmap.free = PHYS_PFN(memory_block_size_bytes()); mhp_altmap.base_pfn = PHYS_PFN(start); params.altmap = _altmap; ret = arch_add_memory(nid, start, memory_block_size_bytes(), ); if (ret < 0) ... ret = create_memory_block_devices(start, memory_block_size_bytes(), mhp_altmap.alloc, group); if (ret) ... } } else { /* old boring stuff */ } Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes() ... If any adding of memory failed, we remove what we al
Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
On 11.07.23 16:30, Aneesh Kumar K.V wrote: David Hildenbrand writes: On 16.06.23 00:00, Vishal Verma wrote: With DAX memory regions originating from CXL memory expanders or NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory on a system without enough 'regular' main memory to support the memmap for it. To avoid this, ensure that all kmem managed hotplugged memory is added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the new memory region being hot added. To do this, call add_memory() in chunks of memory_block_size_bytes() as that is a requirement for memmap_on_memory. Additionally, Use the mhp_flag to force the memmap_on_memory checks regardless of the respective module parameter setting. Cc: "Rafael J. Wysocki" Cc: Len Brown Cc: Andrew Morton Cc: David Hildenbrand Cc: Oscar Salvador Cc: Dan Williams Cc: Dave Jiang Cc: Dave Hansen Cc: Huang Ying Signed-off-by: Vishal Verma --- drivers/dax/kmem.c | 49 - 1 file changed, 36 insertions(+), 13 deletions(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 7b36db6f1cbd..0751346193ef 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -12,6 +12,7 @@ #include #include #include +#include #include "dax-private.h" #include "bus.h" @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) data->mgid = rc; for (i = 0; i < dev_dax->nr_range; i++) { + u64 cur_start, cur_len, remaining; struct resource *res; struct range range; @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) res->flags = IORESOURCE_SYSTEM_RAM; /* -* Ensure that future kexec'd kernels will not treat -* this as RAM automatically. +* Add memory in chunks of memory_block_size_bytes() so that +* it is considered for MHP_MEMMAP_ON_MEMORY +* @range has already been aligned to memory_block_size_bytes(), +* so the following loop will always break it down cleanly. */ - rc = add_memory_driver_managed(data->mgid, range.start, - range_len(), kmem_name, MHP_NID_IS_MGID); + cur_start = range.start; + cur_len = memory_block_size_bytes(); + remaining = range_len(); + while (remaining) { + mhp_t mhp_flags = MHP_NID_IS_MGID; - if (rc) { - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", - i, range.start, range.end); - remove_resource(res); - kfree(res); - data->res[i] = NULL; - if (mapped) - continue; - goto err_request_mem; + if (mhp_supports_memmap_on_memory(cur_len, + MHP_MEMMAP_ON_MEMORY)) + mhp_flags |= MHP_MEMMAP_ON_MEMORY; + /* +* Ensure that future kexec'd kernels will not treat +* this as RAM automatically. +*/ + rc = add_memory_driver_managed(data->mgid, cur_start, + cur_len, kmem_name, + mhp_flags); + + if (rc) { + dev_warn(dev, +"mapping%d: %#llx-%#llx memory add failed\n", +i, cur_start, cur_start + cur_len - 1); + remove_resource(res); + kfree(res); + data->res[i] = NULL; + if (mapped) + continue; + goto err_request_mem; + } + + cur_start += cur_len; + remaining -= cur_len; } mapped++; } Maybe the better alternative is teach add_memory_resource()/try_remove_memory() to do that internally. In the add_memory_resource() case, it might be a loop around that memmap_on_memory + arch_add_memory code path (well, and the error path also needs adjustment): /* * Self hosted memmap array */ if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { if (!mhp_supports_memmap_on_memory(size)) { ret = -EINVAL; goto error; } mhp_altm
Re: [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param
On 23.06.23 10:40, Aneesh Kumar K.V wrote: Vishal Verma writes: For memory hotplug to consider MHP_MEMMAP_ON_MEMORY behavior, the 'memmap_on_memory' module parameter was a hard requirement. In preparation for the dax/kmem driver to use memmap_on_memory semantics, arrange for the module parameter check to be bypassed via the appropriate mhp_flag. Recall that the kmem driver could contribute huge amounts of hotplugged memory originating from special purposes devices such as CXL memory expanders. In some cases memmap_on_memory may be the /only/ way this new memory can be hotplugged. Hence it makes sense for kmem to have a way to force memmap_on_memory without depending on a module param, if all the other conditions for it are met. The only other user of this interface is acpi/acpi_memoryhotplug.c, which only enables the mhp_flag if an initial mhp_supports_memmap_on_memory() test passes. Maintain the existing behavior and semantics for this by performing the initial check from acpi without the MHP_MEMMAP_ON_MEMORY flag, so its decision falls back to the module parameter. Cc: "Rafael J. Wysocki" Cc: Len Brown Cc: Andrew Morton Cc: David Hildenbrand Cc: Oscar Salvador Cc: Dan Williams Cc: Dave Jiang Cc: Dave Hansen Cc: Huang Ying Signed-off-by: Vishal Verma --- include/linux/memory_hotplug.h | 2 +- drivers/acpi/acpi_memhotplug.c | 2 +- mm/memory_hotplug.c| 24 3 files changed, 18 insertions(+), 10 deletions(-) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 9fcbf5706595..c9ddcd3cad70 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -358,7 +358,7 @@ extern struct zone *zone_for_pfn_range(int online_type, int nid, extern int arch_create_linear_mapping(int nid, u64 start, u64 size, struct mhp_params *params); void arch_remove_linear_mapping(u64 start, u64 size); -extern bool mhp_supports_memmap_on_memory(unsigned long size); +extern bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags); #endif /* CONFIG_MEMORY_HOTPLUG */ #endif /* __LINUX_MEMORY_HOTPLUG_H */ diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c index 24f662d8bd39..119d3bb49753 100644 --- a/drivers/acpi/acpi_memhotplug.c +++ b/drivers/acpi/acpi_memhotplug.c @@ -211,7 +211,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device) if (!info->length) continue; - if (mhp_supports_memmap_on_memory(info->length)) + if (mhp_supports_memmap_on_memory(info->length, 0)) mhp_flags |= MHP_MEMMAP_ON_MEMORY; result = __add_memory(mgid, info->start_addr, info->length, mhp_flags); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 8e0fa209d533..bb3845830922 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1283,15 +1283,21 @@ static int online_memory_block(struct memory_block *mem, void *arg) return device_online(>dev); } -bool mhp_supports_memmap_on_memory(unsigned long size) +bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags) { unsigned long nr_vmemmap_pages = size / PAGE_SIZE; unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page); unsigned long remaining_size = size - vmemmap_size; /* -* Besides having arch support and the feature enabled at runtime, we -* need a few more assumptions to hold true: +* The MHP_MEMMAP_ON_MEMORY flag indicates a caller that wants to force +* memmap_on_memory (if other conditions are met), regardless of the +* module parameter. drivers/dax/kmem.c is an example, where large +* amounts of hotplug memory may come from, and the only option to +* successfully online all of it is to place the memmap on this memory. +* +* Besides having arch support and the feature enabled at runtime or +* via the mhp_flag, we need a few more assumptions to hold true: * * a) We span a single memory block: memory onlining/offlinin;g happens *in memory block granularity. We don't want the vmemmap of online @@ -1315,10 +1321,12 @@ bool mhp_supports_memmap_on_memory(unsigned long size) * altmap as an alternative source of memory, and we do not exactly * populate a single PMD. */ - return mhp_memmap_on_memory() && - size == memory_block_size_bytes() && - IS_ALIGNED(vmemmap_size, PMD_SIZE) && - IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)); + + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) || mhp_memmap_on_memory()) + return size == memory_block_size_bytes() && + I
Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem
On 21.06.23 21:32, Verma, Vishal L wrote: On Fri, 2023-06-16 at 09:44 +0200, David Hildenbrand wrote: On 16.06.23 00:00, Vishal Verma wrote: The dax/kmem driver can potentially hot-add large amounts of memory originating from CXL memory expanders, or NVDIMMs, or other 'device memories'. There is a chance there isn't enough regular system memory available to fit ythe memmap for this new memory. It's therefore desirable, if all other conditions are met, for the kmem managed memory to place its memmap on the newly added memory itself. Arrange for this by first allowing for a module parameter override for the mhp_supports_memmap_on_memory() test using a flag, adjusting the only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c, exporting the symbol so it can be called by kmem.c, and finally changing the kmem driver to add_memory() in chunks of memory_block_size_bytes(). 1) Why is the override a requirement here? Just let the admin configure it then then add conditional support for kmem. Configure it in the current way using the module parameter to memory_hotplug? The whole module param check feels a bit awkward, especially since memory_hotplug is builtin, the only way to supply the param is on the kernel command line as opposed to a modprobe config. Yes, and that's nothing special. Runtime toggling is not implemented. The goal with extending mhp_supports_memmap_on_memory() to check for support with or without consideration for the module param is that it makes it a bit more flexible for callers like kmem. Not convinced yet that the global parameter should be bypassed TBH. And if so, this should be a separate patch on top that is completely optional for the remainder of the series. In any case, there has to be some admin control over that, because 1) You usually don't want vmemmap on potentially slow memory 2) Using memmap-on-memory prohibits gigantic pages from forming on that memory (when runtime-allocating them). So "just doing that" without any config knob is problematic. -- Cheers, David / dhildenb
Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
On 16.06.23 00:00, Vishal Verma wrote: With DAX memory regions originating from CXL memory expanders or NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory on a system without enough 'regular' main memory to support the memmap for it. To avoid this, ensure that all kmem managed hotplugged memory is added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the new memory region being hot added. To do this, call add_memory() in chunks of memory_block_size_bytes() as that is a requirement for memmap_on_memory. Additionally, Use the mhp_flag to force the memmap_on_memory checks regardless of the respective module parameter setting. Cc: "Rafael J. Wysocki" Cc: Len Brown Cc: Andrew Morton Cc: David Hildenbrand Cc: Oscar Salvador Cc: Dan Williams Cc: Dave Jiang Cc: Dave Hansen Cc: Huang Ying Signed-off-by: Vishal Verma --- drivers/dax/kmem.c | 49 - 1 file changed, 36 insertions(+), 13 deletions(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 7b36db6f1cbd..0751346193ef 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -12,6 +12,7 @@ #include #include #include +#include #include "dax-private.h" #include "bus.h" @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) data->mgid = rc; for (i = 0; i < dev_dax->nr_range; i++) { + u64 cur_start, cur_len, remaining; struct resource *res; struct range range; @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax) res->flags = IORESOURCE_SYSTEM_RAM; /* -* Ensure that future kexec'd kernels will not treat -* this as RAM automatically. +* Add memory in chunks of memory_block_size_bytes() so that +* it is considered for MHP_MEMMAP_ON_MEMORY +* @range has already been aligned to memory_block_size_bytes(), +* so the following loop will always break it down cleanly. */ - rc = add_memory_driver_managed(data->mgid, range.start, - range_len(), kmem_name, MHP_NID_IS_MGID); + cur_start = range.start; + cur_len = memory_block_size_bytes(); + remaining = range_len(); + while (remaining) { + mhp_t mhp_flags = MHP_NID_IS_MGID; - if (rc) { - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n", - i, range.start, range.end); - remove_resource(res); - kfree(res); - data->res[i] = NULL; - if (mapped) - continue; - goto err_request_mem; + if (mhp_supports_memmap_on_memory(cur_len, + MHP_MEMMAP_ON_MEMORY)) + mhp_flags |= MHP_MEMMAP_ON_MEMORY; + /* +* Ensure that future kexec'd kernels will not treat +* this as RAM automatically. +*/ + rc = add_memory_driver_managed(data->mgid, cur_start, + cur_len, kmem_name, + mhp_flags); + + if (rc) { + dev_warn(dev, +"mapping%d: %#llx-%#llx memory add failed\n", +i, cur_start, cur_start + cur_len - 1); + remove_resource(res); + kfree(res); + data->res[i] = NULL; + if (mapped) + continue; + goto err_request_mem; + } + + cur_start += cur_len; + remaining -= cur_len; } mapped++; } Maybe the better alternative is teach add_memory_resource()/try_remove_memory() to do that internally. In the add_memory_resource() case, it might be a loop around that memmap_on_memory + arch_add_memory code path (well, and the error path also needs adjustment): /* * Self hosted memmap array */ if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { if (!mhp_supports_memmap_on_memory(size)) { ret = -EINVAL; goto error; } mhp_altmap.free = PHYS_PFN(size); mhp_altmap.base_pfn = PHYS_PFN(start); params.al
Re: [PATCH 2/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory()
On 16.06.23 00:00, Vishal Verma wrote: In preparation for the dax/kmem driver, which can be built as a module, to use this interface, export it with EXPORT_SYMBOL_GPL(). Cc: "Rafael J. Wysocki" Cc: Len Brown Cc: Andrew Morton Cc: David Hildenbrand Cc: Oscar Salvador Cc: Dan Williams Cc: Dave Jiang Cc: Dave Hansen Cc: Huang Ying Signed-off-by: Vishal Verma --- mm/memory_hotplug.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index bb3845830922..92922080d3fa 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1328,6 +1328,7 @@ bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags) IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT)); return false; } +EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory); /* * NOTE: The caller must call lock_device_hotplug() to serialize hotplug Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param
On 16.06.23 00:00, Vishal Verma wrote: For memory hotplug to consider MHP_MEMMAP_ON_MEMORY behavior, the 'memmap_on_memory' module parameter was a hard requirement. In preparation for the dax/kmem driver to use memmap_on_memory semantics, arrange for the module parameter check to be bypassed via the appropriate mhp_flag. Recall that the kmem driver could contribute huge amounts of hotplugged memory originating from special purposes devices such as CXL memory expanders. In some cases memmap_on_memory may be the /only/ way this new memory can be hotplugged. Hence it makes sense for kmem to have a way to force memmap_on_memory without depending on a module param, if all the other conditions for it are met. Just let the admin configure it. After all, an admin is involved in configuring the dax/kmem device to begin with. If add_memory() fails you could give a useful hint to the admin. -- Cheers, David / dhildenb
Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem
On 16.06.23 00:00, Vishal Verma wrote: The dax/kmem driver can potentially hot-add large amounts of memory originating from CXL memory expanders, or NVDIMMs, or other 'device memories'. There is a chance there isn't enough regular system memory available to fit ythe memmap for this new memory. It's therefore desirable, if all other conditions are met, for the kmem managed memory to place its memmap on the newly added memory itself. Arrange for this by first allowing for a module parameter override for the mhp_supports_memmap_on_memory() test using a flag, adjusting the only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c, exporting the symbol so it can be called by kmem.c, and finally changing the kmem driver to add_memory() in chunks of memory_block_size_bytes(). 1) Why is the override a requirement here? Just let the admin configure it then then add conditional support for kmem. 2) I recall that there are cases where we don't want the memmap to land on slow memory (which online_movable would achieve). Just imagine the slow PMEM case. So this might need another configuration knob on the kmem side. I recall some discussions on doing that chunk handling internally (so kmem can just perform one add_memory() and we'd split that up internally). -- Cheers, David / dhildenb