from:"david"

Re: [PATCH] virt: acrn: Remove unusted list 'acrn_irqfd_clients'

2024-05-17 Thread Dr. David Alan Gilbert

* li...@treblig.org (li...@treblig.org) wrote:
> From: "Dr. David Alan Gilbert" 
> 
> It doesn't look like this was ever used.
> 
> Build tested only.
> 
> Signed-off-by: Dr. David Alan Gilbert 

Ping

> ---
>  drivers/virt/acrn/irqfd.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/virt/acrn/irqfd.c b/drivers/virt/acrn/irqfd.c
> index d4ad211dce7a3..346cf0be4aac7 100644
> --- a/drivers/virt/acrn/irqfd.c
> +++ b/drivers/virt/acrn/irqfd.c
> @@ -16,8 +16,6 @@
>  
>  #include "acrn_drv.h"
>  
> -static LIST_HEAD(acrn_irqfd_clients);
> -
>  /**
>   * struct hsm_irqfd - Properties of HSM irqfd
>   * @vm:  Associated VM pointer
> -- 
> 2.45.0
> 
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-05-16 Thread David Hildenbrand


On 16.05.24 19:44, Guillaume Morin wrote:

On 02 May  5:59, Guillaume Morin wrote:


On 30 Apr 21:25, David Hildenbrand wrote:

I tried to get the hugepd stuff right but this was the first I heard
about it :-) Afaict follow_huge_pmd and friends were already DTRT


I'll have to have a closer look at some details (the hugepd writability
check looks a bit odd), but it's mostly what I would have expected!


Ok in the meantime, here is the uprobe change on your current
uprobes_cow trying to address the comments you made in your previous
message. Some of them were not 100% clear to me, so it's a best effort
patch :-) Again lightly tested


David, have you had a chance to take a look at both patches?


Not in detail, last weeks were busy (currently traveling back home from 
LSF/MM). I'll try to find time within the next two weeks to polish my 
changes and send them out. It would be great if you could send your 
stuff based on top of that then.


(the merge window just opened on Saturday, so we have plenty of time to 
make it to the next one :) )


--
Cheers,

David / dhildenb

Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V

2024-05-14 Thread David Hildenbrand


On 14.05.24 20:17, Björn Töpel wrote:

Alexandre Ghiti  writes:


On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:


From: Björn Töpel 

Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
RISC-V.

Signed-off-by: Björn Töpel 
---
  arch/riscv/Kconfig | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 6bec1bce6586..b9398b64bb69 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -16,6 +16,8 @@ config RISCV
 select ACPI_REDUCED_HARDWARE_ONLY if ACPI
 select ARCH_DMA_DEFAULT_COHERENT
 select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
+   select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU


I think this should be SPARSEMEM_VMEMMAP here.


Hmm, care to elaborate? I thought that was optional.


There was a discussion at LSF/MM today to maybe require 
SPARSEMEM_VMEMMAP for hotplug. Would that work here as well?


--
Cheers,

David / dhildenb

Re: [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump

2024-05-14 Thread David Hildenbrand


On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

During memory hot remove, the ptdump functionality can end up touching
stale data. Avoid any potential crashes (or worse), by holding the
memory hotplug read-lock while traversing the page table.

This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm:
Hold memory hotplug lock while walking for kernel page table dump").

Signed-off-by: Björn Töpel 
---
  arch/riscv/mm/ptdump.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index 1289cc6d3700..9d5f657a251b 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -6,6 +6,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -370,7 +371,9 @@ bool ptdump_check_wx(void)
  
  static int ptdump_show(struct seq_file *m, void *v)

  {
+   get_online_mems();
ptdump_walk(m, m->private);
+   put_online_mems();
  
  	return 0;

  }


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread David Hildenbrand


On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

For an architecture to support memory hotplugging, a couple of
callbacks needs to be implemented:

  arch_add_memory()
   This callback is responsible for adding the physical memory into the
   direct map, and call into the memory hotplugging generic code via
   __add_pages() that adds the corresponding struct page entries, and
   updates the vmemmap mapping.

  arch_remove_memory()
   This is the inverse of the callback above.

  vmemmap_free()
   This function tears down the vmemmap mappings (if
   CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
   backing vmemmap pages. Note that for persistent memory, an
   alternative allocator for the backing pages can be used; The
   vmem_altmap. This means that when the backing pages are cleared,
   extra care is needed so that the correct deallocation method is
   used.

  arch_get_mappable_range()
   This functions returns the PA range that the direct map can map.
   Used by the MHP internals for sanity checks.

The page table unmap/teardown functions are heavily based on code from
the x86 tree. The same remove_pgd_mapping() function is used in both
vmemmap_free() and arch_remove_memory(), but in the latter function
the backing pages are not removed.

Signed-off-by: Björn Töpel 
---
  arch/riscv/mm/init.c | 242 +++
  1 file changed, 242 insertions(+)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 6f72b0b2b854..7f0b921a3d3a 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
}
  }
  #endif
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
+{
+   pte_t *pte;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PTE; i++) {
+   pte = pte_start + i;
+   if (!pte_none(*pte))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
+   pmd_clear(pmd);
+}
+
+static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
+{
+   pmd_t *pmd;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PMD; i++) {
+   pmd = pmd_start + i;
+   if (!pmd_none(*pmd))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(pud_page(*pud)), 0);
+   pud_clear(pud);
+}
+
+static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+   pud_t *pud;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PUD; i++) {
+   pud = pud_start + i;
+   if (!pud_none(*pud))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
+   p4d_clear(p4d);
+}
+
+static void __meminit free_vmemmap_storage(struct page *page, size_t size,
+  struct vmem_altmap *altmap)
+{
+   if (altmap)
+   vmem_altmap_free(altmap, size >> PAGE_SHIFT);
+   else
+   free_pages((unsigned long)page_address(page), get_order(size));


If you unplug a DIMM that was added during boot (can happen on x86-64, 
can it happen on riscv?), free_pages() would not be sufficient. You'd be 
freeing a PG_reserved page that has to be freed differently.


--
Cheers,

David / dhildenb

Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add

2024-05-14 Thread David Hildenbrand


On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

Add a parameter to the direct map setup function, so it can be used in
arch_add_memory() later.

Signed-off-by: Björn Töpel 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 2/8] riscv: mm: Change attribute from init to meminit for page functions

2024-05-14 Thread David Hildenbrand


On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

Prepare for memory hotplugging support by changing from __init to
__meminit for the page table functions that are used by the upcoming
architecture specific callbacks.

Changing the __init attribute to __meminit, avoids that the functions
are removed after init. The __meminit attribute makes sure the
functions are kept in the kernel text post init, but only if memory
hotplugging is enabled for the build.

Also, make sure that the altmap parameter is properly passed on to
vmemmap_populate_hugepages().

Signed-off-by: Björn Töpel 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 7/8] virtio-mem: Enable virtio-mem for RISC-V

2024-05-14 Thread David Hildenbrand


On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

Now that RISC-V has memory hotplugging support, virtio-mem can be used
on the platform.

Signed-off-by: Björn Töpel 
---
  drivers/virtio/Kconfig | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index c17193544268..4e5cebf1b82a 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -122,7 +122,7 @@ config VIRTIO_BALLOON
  
  config VIRTIO_MEM

tristate "Virtio mem driver"
-   depends on X86_64 || ARM64
+   depends on X86_64 || ARM64 || RISCV
depends on VIRTIO
depends on MEMORY_HOTPLUG
depends on MEMORY_HOTREMOVE



Nice!

Acked-by: David Hildenbrand 
--
Cheers,

David / dhildenb

Re: [PATCH v23 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-10 Thread David Hildenbrand

+   cpu_buffer->subbuf_ids = subbuf_ids;
+
+   meta->meta_page_size = PAGE_SIZE;
+   meta->meta_struct_len = sizeof(*meta);
+   meta->nr_subbufs = nr_subbufs;
+   meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+
+   rb_update_meta_page(cpu_buffer);
+}
+
+static struct ring_buffer_per_cpu *
+rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
+{
+   struct ring_buffer_per_cpu *cpu_buffer;
+
+   if (!cpumask_test_cpu(cpu, buffer->cpumask))
+   return ERR_PTR(-EINVAL);
+
+   cpu_buffer = buffer->buffers[cpu];
+
+   mutex_lock(_buffer->mapping_lock);
+
+   if (!cpu_buffer->mapped) {
+   mutex_unlock(_buffer->mapping_lock);
+   return ERR_PTR(-ENODEV);
+   }
+
+   return cpu_buffer;
+}
+
+static void rb_put_mapped_buffer(struct ring_buffer_per_cpu *cpu_buffer)
+{
+   mutex_unlock(_buffer->mapping_lock);
+}
+
+/*
+ * Fast-path for rb_buffer_(un)map(). Called whenever the meta-page doesn't 
need
+ * to be set-up or torn-down.
+ */
+static int __rb_inc_dec_mapped(struct ring_buffer_per_cpu *cpu_buffer,
+  bool inc)
+{
+   unsigned long flags;
+
+   lockdep_assert_held(_buffer->mapping_lock);
+
+   if (inc && cpu_buffer->mapped == UINT_MAX)
+   return -EBUSY;
+
+   if (WARN_ON(!inc && cpu_buffer->mapped == 0))
+   return -EINVAL;
+
+   mutex_lock(_buffer->buffer->mutex);
+   raw_spin_lock_irqsave(_buffer->reader_lock, flags);
+
+   if (inc)
+   cpu_buffer->mapped++;
+   else
+   cpu_buffer->mapped--;
+
+   raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
+   mutex_unlock(_buffer->buffer->mutex);
+
+   return 0;
+}
+
+/*
+ *   +--+  pgoff == 0
+ *   |   meta page  |
+ *   +--+  pgoff == 1
+ *   | subbuffer 0  |
+ *   |  |
+ *   +--+  pgoff == (1 + (1 << subbuf_order))
+ *   | subbuffer 1  |
+ *   |  |
+ * ...
+ */
+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+
+   /* Refuse MP_PRIVATE or writable mappings */
+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+
+   /*
+* Make sure the mapping cannot become writable later. Also tell the VM
+* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND).


Coment a bit outdated.

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-10 Thread David Hildenbrand


On 08.05.24 04:34, Steven Rostedt wrote:

On Tue, 30 Apr 2024 12:13:51 +0100
Vincent Donnefort  wrote:


+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+
+   /* Refuse MP_PRIVATE or writable mappings */
+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+
+   /*
+* Make sure the mapping cannot become writable later. Also tell the VM
+* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally,
+* prevent migration, GUP and dump (VM_IO).
+*/
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE);


Do we really need the VM_IO?

When testing this in gdb, I would get:

(gdb) p tmap->map->subbuf_size
Cannot access memory at address 0x77fc2008

It appears that you can't ptrace IO memory. When I removed that flag,
gdb has no problem reading that memory.

I think we should drop that flag.

Can you send a v23 with that removed, Shuah's update, and also the
change below:


+
+   lockdep_assert_held(_buffer->mapping_lock);
+
+   subbuf_order = cpu_buffer->buffer->subbuf_order;
+   subbuf_pages = 1 << subbuf_order;
+
+   nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */
+   nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */
+
+   vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+   if (!vma_pages || vma_pages > nr_pages)
+   return -EINVAL;
+
+   nr_pages = vma_pages;
+
+   pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
+   if (!pages)
+   return -ENOMEM;
+
+   if (!pgoff) {
+   pages[p++] = virt_to_page(cpu_buffer->meta_page);
+
+   /*
+* TODO: Align sub-buffers on their size, once
+* vm_insert_pages() supports the zero-page.
+*/
+   } else {
+   /* Skip the meta-page */
+   pgoff--;
+
+   if (pgoff % subbuf_pages) {
+   err = -EINVAL;
+   goto out;
+   }
+
+   s += pgoff / subbuf_pages;
+   }
+
+   while (s < nr_subbufs && p < nr_pages) {
+   struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
+   int off = 0;
+
+   for (; off < (1 << (subbuf_order)); off++, page++) {
+   if (p >= nr_pages)
+   break;
+
+   pages[p++] = page;
+   }
+   s++;
+   }


The above can be made to:

while (p < nr_pages) {
struct page *page;
int off = 0;

if (WARN_ON_ONCE(s >= nr_subbufs))
break;


I'm not particularly happy about us calling vm_insert_pages with NULL 
pointers stored in pages.


Should we instead do

if (WARN_ON_ONCE(s >= nr_subbufs)) {
err = -EINVAL;
goto out;
}

?

--
Cheers,

David / dhildenb

Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-10 Thread David Hildenbrand


On 09.05.24 13:05, Vincent Donnefort wrote:

On Tue, May 07, 2024 at 10:34:02PM -0400, Steven Rostedt wrote:

On Tue, 30 Apr 2024 12:13:51 +0100
Vincent Donnefort  wrote:


+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+
+   /* Refuse MP_PRIVATE or writable mappings */
+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+
+   /*
+* Make sure the mapping cannot become writable later. Also tell the VM
+* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally,
+* prevent migration, GUP and dump (VM_IO).
+*/
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE);


Do we really need the VM_IO?

When testing this in gdb, I would get:

(gdb) p tmap->map->subbuf_size
Cannot access memory at address 0x77fc2008

It appears that you can't ptrace IO memory. When I removed that flag,
gdb has no problem reading that memory.


Yeah, VM_IO indeed implies DONTDUMP. VM_IO was part of Linus recommendations.


Yes, the VM should recognize that memory to some degree as being special 
already due to VM_MIXEDMAP and VM_DONTEXPAND.


#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)

So any of these flag achieve that (e.g., mlock_fixup() checks 
VM_SPECIAL). KSM similarly skips VM_DONTEXPAND and VM_MIXEDMAP (likely 
we should be using VM_SPECIAL in vma_ksm_compatible()). Not sure about 
page migration, likely its fine.


Thinking about MADV_DONTNEED, I can spot in 
madvise_dontneed_free_valid_vma() only that we disallow primarily VM_PFNMAP.


... I assume if user space MADV_DONTNEED's some pages we'll simply get a 
page fault later on access that will SIGBUS, handling that gracefully 
(we should double-check!).




But perhaps, VM_DONTEXPAND and MIXEDMAP (implicitely set by vm_insert_pages) are
enough protection?


Do we want to dump these pages? VM_DONTDUMP might be reasonabe then.



I don't see how anything could use GUP there and as David pointed-out on the
previous version, it doesn't event prevent the GUP-fast path.


Yes, GUP-fast would still have worked under some conditions.

--
Cheers,

David / dhildenb

Re: [PATCH] ftrace: Remove unused global 'ftrace_direct_func_count'

2024-05-06 Thread Dr. David Alan Gilbert

* li...@treblig.org (li...@treblig.org) wrote:
> From: "Dr. David Alan Gilbert" 
> 
> Commit 8788ca164eb4b ("ftrace: Remove the legacy _ftrace_direct API")
> stopped setting the 'ftrace_direct_func_count' variable, but left
> it around.  Clean it up.
> 
> Signed-off-by: Dr. David Alan Gilbert 

FYI this is on top of my earlier 'ftrace: Remove unused list 
'ftrace_direct_funcs'

Dave

> ---
>  include/linux/ftrace.h |  2 --
>  kernel/trace/fgraph.c  | 11 ---
>  kernel/trace/ftrace.c  |  1 -
>  3 files changed, 14 deletions(-)
> 
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index b01cca36147ff..e3a83ebd1b333 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -413,7 +413,6 @@ struct ftrace_func_entry {
>  };
>  
>  #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> -extern int ftrace_direct_func_count;
>  unsigned long ftrace_find_rec_direct(unsigned long ip);
>  int register_ftrace_direct(struct ftrace_ops *ops, unsigned long addr);
>  int unregister_ftrace_direct(struct ftrace_ops *ops, unsigned long addr,
> @@ -425,7 +424,6 @@ void ftrace_stub_direct_tramp(void);
>  
>  #else
>  struct ftrace_ops;
> -# define ftrace_direct_func_count 0
>  static inline unsigned long ftrace_find_rec_direct(unsigned long ip)
>  {
>   return 0;
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index c83c005e654e3..a130b2d898f7c 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -125,17 +125,6 @@ int function_graph_enter(unsigned long ret, unsigned 
> long func,
>  {
>   struct ftrace_graph_ent trace;
>  
> -#ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
> - /*
> -  * Skip graph tracing if the return location is served by direct 
> trampoline,
> -  * since call sequence and return addresses are unpredictable anyway.
> -  * Ex: BPF trampoline may call original function and may skip frame
> -  * depending on type of BPF programs attached.
> -  */
> - if (ftrace_direct_func_count &&
> - ftrace_find_rec_direct(ret - MCOUNT_INSN_SIZE))
> - return -EBUSY;
> -#endif
>   trace.func = func;
>   trace.depth = ++current->curr_ret_depth;
>  
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index b18b4ece3d7c9..adf34167c3418 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -2538,7 +2538,6 @@ ftrace_find_unique_ops(struct dyn_ftrace *rec)
>  /* Protected by rcu_tasks for reading, and direct_mutex for writing */
>  static struct ftrace_hash __rcu *direct_functions = EMPTY_HASH;
>  static DEFINE_MUTEX(direct_mutex);
> -int ftrace_direct_func_count;
>  
>  /*
>   * Search the direct_functions hash to see if the given instruction pointer
> -- 
> 2.45.0
> 
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: ftrace_direct_func_count ?

2024-05-06 Thread Dr. David Alan Gilbert

* Steven Rostedt (rost...@goodmis.org) wrote:
> On Sat, 4 May 2024 13:35:26 +
> "Dr. David Alan Gilbert"  wrote:
> 
> > Hi,
> >   I've just posted a patch 'ftrace: Remove unused list 
> > 'ftrace_direct_funcs''
> > that clears out some old code, but while at it I noticed the global
> > 'ftrace_direct_func_count'.
> > 
> > As far as I can tell, it's never assigned (or initialised) but it is tested:
> > 
> > kernel/trace/fgraph.c:
> > #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
> >   /*
> >* Skip graph tracing if the return location is served by direct 
> > trampoline,
> >* since call sequence and return addresses are unpredictable anyway.
> >* Ex: BPF trampoline may call original function and may skip frame
> >* depending on type of BPF programs attached.
> >*/
> >   if (ftrace_direct_func_count &&
> >   ftrace_find_rec_direct(ret - MCOUNT_INSN_SIZE))
> > return -EBUSY;
> > #endif
> > 
> > So I wasn't sure whether it was just safe to nuke that section
> > or whether it really needed fixing?
> 
> Yes, after commit 8788ca164eb4bad ("ftrace: Remove the legacy
> _ftrace_direct API") that variable is no longer used.

OK, thanks, I'll send a follow up patch to my other patch to nuke
this as well.

Dave

> -- Steve
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

2024-05-04 Thread David Laight

From: Waiman Long
> Sent: 03 May 2024 23:14
> 
> 
> On 5/3/24 17:10, David Laight wrote:
> > From: Waiman Long
> >> Sent: 03 May 2024 17:00
> > ...
> >> David,
> >>
> >> Could you respin the series based on the latest upstream code?
> > I've just reapplied the patches to 'master' and they all apply
> > cleanly and diffing the new patches to the old ones gives no differences.
> > So I think they should still apply.
> >
> > Were you seeing a specific problem?
> >
> > I don't remember any suggested changed either.
> > (Apart from a very local variable I used to keep a patch isolated.)
> 
> No, I just want to make sure that your patches will still apply. Anyway,
> it will be easier for the maintainer to merge your remaining patches if
> you can send out a new version even if they are almost the same as the
> old ones.

I don't think any changes are needed.
So the existing versions are fine.
They applied (well my copy of what I think I sent applied) and built.
So there shouldn't be any issues.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

ftrace_direct_func_count ?

2024-05-04 Thread Dr. David Alan Gilbert

Hi,
  I've just posted a patch 'ftrace: Remove unused list 'ftrace_direct_funcs''
that clears out some old code, but while at it I noticed the global
'ftrace_direct_func_count'.

As far as I can tell, it's never assigned (or initialised) but it is tested:

kernel/trace/fgraph.c:
#ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
  /*
   * Skip graph tracing if the return location is served by direct trampoline,
   * since call sequence and return addresses are unpredictable anyway.
   * Ex: BPF trampoline may call original function and may skip frame
   * depending on type of BPF programs attached.
   */
  if (ftrace_direct_func_count &&
  ftrace_find_rec_direct(ret - MCOUNT_INSN_SIZE))
return -EBUSY;
#endif

So I wasn't sure whether it was just safe to nuke that section
or whether it really needed fixing?

Dave

-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

2024-05-03 Thread David Laight

From: Waiman Long
> Sent: 03 May 2024 17:00
...
> David,
> 
> Could you respin the series based on the latest upstream code?

I've just reapplied the patches to 'master' and they all apply
cleanly and diffing the new patches to the old ones gives no differences.
So I think they should still apply.

Were you seeing a specific problem?

I don't remember any suggested changed either.
(Apart from a very local variable I used to keep a patch isolated.)

    David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

2024-05-03 Thread David Laight

From: Waiman Long
> Sent: 03 May 2024 17:00
> To: David Laight ; 'linux-kernel@vger.kernel.org' 
>  ker...@vger.kernel.org>; 'pet...@infradead.org' 
> Cc: 'mi...@redhat.com' ; 'w...@kernel.org' 
> ; 'boqun.f...@gmail.com'
> ; 'Linus Torvalds' ; 
> 'virtualization@lists.linux-
> foundation.org' ; 'Zeng Heng' 
> 
> Subject: Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and 
> per_cpu_ptr().
> 
> 
> On 12/31/23 23:14, Waiman Long wrote:
> >
> > On 12/31/23 16:55, David Laight wrote:
> >> per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
> >> This requires the cpu number be 64bit.
> >> However the value is osq_lock() comes from a 32bit xchg() and there
> >> isn't a way of telling gcc the high bits are zero (they are) so
> >> there will always be an instruction to clear the high bits.
> >>
> >> The cpu number is also offset by one (to make the initialiser 0)
> >> It seems to be impossible to get gcc to convert
> >> __per_cpu_offset[cpu_p1 - 1]
> >> into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the
> >> address).
> >>
> >> Converting the cpu number to 32bit unsigned prior to the decrement means
> >> that gcc knows the decrement has set the high bits to zero and doesn't
> >> add a register-register move (or cltq) to zero/sign extend the value.
> >>
> >> Not massive but saves two instructions.
> >>
> >> Signed-off-by: David Laight 
> >> ---
> >>   kernel/locking/osq_lock.c | 6 ++
> >>   1 file changed, 2 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> >> index 35bb99e96697..37a4fa872989 100644
> >> --- a/kernel/locking/osq_lock.c
> >> +++ b/kernel/locking/osq_lock.c
> >> @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
> >>   return cpu_nr + 1;
> >>   }
> >>   -static inline struct optimistic_spin_node *decode_cpu(int
> >> encoded_cpu_val)
> >> +static inline struct optimistic_spin_node *decode_cpu(unsigned int
> >> encoded_cpu_val)
> >>   {
> >> -    int cpu_nr = encoded_cpu_val - 1;
> >> -
> >> -    return per_cpu_ptr(_node, cpu_nr);
> >> +    return per_cpu_ptr(_node, encoded_cpu_val - 1);
> >>   }
> >>     /*
> >
> > You really like micro-optimization.
> >
> > Anyway,
> >
> > Reviewed-by: Waiman Long 
> >
> David,
> 
> Could you respin the series based on the latest upstream code?

Looks like a wet bank holiday weekend.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-02 Thread David Hildenbrand


On 30.04.24 13:13, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

   ring_buffer_{map,unmap}()

And controls on the ring-buffer:

   ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

   A unique ID for each subbuf of the ring-buffer, currently they are
   only identified through their in-kernel VA.

   A meta-page, where are stored ring-buffer statistics and a
   description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h


[...]


+/*
+ *   +--+  pgoff == 0
+ *   |   meta page  |
+ *   +--+  pgoff == 1
+ *   | subbuffer 0  |
+ *   |  |
+ *   +--+  pgoff == (1 + (1 << subbuf_order))
+ *   | subbuffer 1  |
+ *   |  |
+ * ...
+ */
+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+
+   /* Refuse MP_PRIVATE or writable mappings */
+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+
+   /*
+* Make sure the mapping cannot become writable later. Also tell the VM
+* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally,
+* prevent migration, GUP and dump (VM_IO).
+*/
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE);
+
+   lockdep_assert_held(_buffer->mapping_lock);
+
+   subbuf_order = cpu_buffer->buffer->subbuf_order;
+   subbuf_pages = 1 << subbuf_order;
+
+   nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */
+   nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */
+
+   vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+   if (!vma_pages || vma_pages > nr_pages)
+   return -EINVAL;
+
+   nr_pages = vma_pages;
+
+   pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
+   if (!pages)
+   return -ENOMEM;
+
+   if (!pgoff) {
+   pages[p++] = virt_to_page(cpu_buffer->meta_page);
+
+   /*
+* TODO: Align sub-buffers on their size, once
+* vm_insert_pages() supports the zero-page.
+*/
+   } else {
+   /* Skip the meta-page */
+   pgoff--;
+
+   if (pgoff % subbuf_pages) {
+   err = -EINVAL;
+   goto out;
+   }
+
+   s += pgoff / subbuf_pages;
+   }
+
+   while (s < nr_subbufs && p < nr_pages) {
+   struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
+   int off = 0;
+
+   for (; off < (1 << (subbuf_order)); off++, page++) {
+   if (p >= nr_pages)
+   break;
+
+   pages[p++] = page;
+   }
+   s++;
+   }
+
+   err = vm_insert_pages(vma, vma->vm_start, pages, _pages);


Nit: I did not immediately understand if we could end here with p < 
nr_pages (IOW, pages[] not completely filled).


One source of confusion is the "s < nr_subbufs" check in the while loop: 
why is "p < nr_pages" insufficient?



For the MM bits:

Acked-by: David Hildenbrand 



--
Cheers,

David / dhildenb

Re: [PATCH v22 1/5] ring-buffer: Allocate sub-buffers with __GFP_COMP

2024-05-02 Thread David Hildenbrand


On 30.04.24 13:13, Vincent Donnefort wrote:

In preparation for the ring-buffer memory mapping, allocate compound
pages for the ring-buffer sub-buffers to enable us to map them to
user-space with vm_insert_pages().

Signed-off-by: Vincent Donnefort 


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-30 Thread David Hildenbrand


On 26.04.24 21:55, Guillaume Morin wrote:

On 26 Apr  9:19, David Hildenbrand wrote:

A couple of points:

a) Don't use page_mapcount(). Either folio_mapcount(), but likely you want
to check PageAnonExclusive.

b) If you're not following the can_follow_write_pte/_pmd model, you are
doing something wrong :)

c) The code was heavily changed in mm/mm-unstable. It was merged with t
the common code.

Likely, in mm/mm-unstable, the existing can_follow_write_pte and
can_follow_write_pmd checks will already cover what you want in most cases.

We'd need a can_follow_write_pud() to cover follow_huge_pud() and
(unfortunately) something to handle follow_hugepd() as well similarly.

Copy-pasting what we do in can_follow_write_pte() and adjusting for
different PTE types is the right thing to do. Maybe now it's time to factor
out the common checks into a separate helper.


I tried to get the hugepd stuff right but this was the first I heard
about it :-) Afaict follow_huge_pmd and friends were already DTRT


I'll have to have a closer look at some details (the hugepd writability 
check looks a bit odd), but it's mostly what I would have expected!


--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-30 Thread David Hildenbrand


On 30.04.24 17:22, Guillaume Morin wrote:

On 26 Apr 21:55, Guillaume Morin wrote:


On 26 Apr  9:19, David Hildenbrand wrote:

A couple of points:

a) Don't use page_mapcount(). Either folio_mapcount(), but likely you want
to check PageAnonExclusive.

b) If you're not following the can_follow_write_pte/_pmd model, you are
doing something wrong :)

c) The code was heavily changed in mm/mm-unstable. It was merged with t
the common code.

Likely, in mm/mm-unstable, the existing can_follow_write_pte and
can_follow_write_pmd checks will already cover what you want in most cases.

We'd need a can_follow_write_pud() to cover follow_huge_pud() and
(unfortunately) something to handle follow_hugepd() as well similarly.

Copy-pasting what we do in can_follow_write_pte() and adjusting for
different PTE types is the right thing to do. Maybe now it's time to factor
out the common checks into a separate helper.


I tried to get the hugepd stuff right but this was the first I heard
about it :-) Afaict follow_huge_pmd and friends were already DTRT


I got it working on top of your uprobes-cow branch with the foll force
patch sent friday. Still pretty lightly tested


Sorry for not replying earlier, was busy with other stuff. I'll try 
getiing that stuff into shape and send it out soonish.




I went with using one write uprobe function with some additional
branches. I went back and forth between that and making them 2 different
functions.


All the folio_test_hugetlb() special casing is a bit suboptimal. Likely 
we want a separate variant, because we should be sing hugetlb PTE 
functions consistently (e.g., huge_pte_uffd_wp() vs pte_uffd_wp(), 
softdirty does not exist etc.)




diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 2f4e88552d3f..8a33e380f7ea 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -83,6 +83,10 @@ static const struct fs_parameter_spec 
hugetlb_fs_parameters[] = {
{}
  };
  
+bool hugetlbfs_mapping(struct address_space *mapping) {

+   return mapping->a_ops == _aops;


is_vm_hugetlb_page() might be what you are looking for.

[...]


  }
  
-static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, int len)

+static void copy_from_page(struct page *page, unsigned long vaddr, void *dst, 
int len, unsigned long page_mask)
  {
void *kaddr = kmap_atomic(page);
-   memcpy(dst, kaddr + (vaddr & ~PAGE_MASK), len);
+   memcpy(dst, kaddr + (vaddr & ~page_mask), len);
kunmap_atomic(kaddr);
  }


  
-static void copy_to_page(struct page *page, unsigned long vaddr, const void *src, int len)

+static void copy_to_page(struct page *page, unsigned long vaddr, const void 
*src, int len, unsigned long page_mask)
  {
void *kaddr = kmap_atomic(page);
-   memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len);
+   memcpy(kaddr + (vaddr & ~page_mask), src, len);
kunmap_atomic(kaddr);
  }


These two changes really are rather ugly ...

An why are they even required? We get a PAGE_SIZED-based subpage of a 
hugetlb page. We only kmap that one and copy within that one.


In other words, I don't think the copy_from_page() and copy_to_page() 
changes are even required when we consistently work on subpages and not 
suddenly on head pages.


  
-static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)

+static int verify_opcode(struct page *page, unsigned long vaddr, 
uprobe_opcode_t *new_opcode, unsigned long page_mask)
  {
uprobe_opcode_t old_opcode;
bool is_swbp;
@@ -191,7 +192,8 @@ static int verify_opcode(struct page *page, unsigned long 
vaddr, uprobe_opcode_t
 * is a trap variant; uprobes always wins over any other (gdb)
 * breakpoint.
 */
-   copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE);
+   copy_from_page(page, vaddr, _opcode, UPROBE_SWBP_INSN_SIZE,
+  page_mask);
is_swbp = is_swbp_insn(_opcode);
  
  	if (is_swbp_insn(new_opcode)) {

@@ -376,8 +378,8 @@ struct uwo_data {
uprobe_opcode_t opcode;
  };
  
-static int __write_opcode_pte(pte_t *ptep, unsigned long vaddr,

-   unsigned long next, struct mm_walk *walk)
+static int __write_opcode(pte_t *ptep, unsigned long vaddr,
+ unsigned long page_mask, struct mm_walk *walk)



Unrelated alignment change.


  {
struct uwo_data *data = walk->private;;
const bool is_register = !!is_swbp_insn(>opcode);
@@ -415,9 +417,12 @@ static int __write_opcode_pte(pte_t *ptep, unsigned long 
vaddr,
  
  	/* Unmap + flush the TLB, such that we can write atomically .*/

flush_cache_page(vma, vaddr, pte_pfn(pte));
-   pte = ptep_clear_flush(vma, vaddr, ptep);
+   if (folio_test_hugetlb(folio))
+   pte = huge_ptep_clear_flush(vma, vaddr, ptep);
+   else
+   pte = ptep_clear_flush(vma, vaddr, ptep);
copy_to_page(page, da

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-26 Thread David Hildenbrand


On 26.04.24 02:09, Guillaume Morin wrote:

On 25 Apr 21:56, David Hildenbrand wrote:


On 25.04.24 17:19, Guillaume Morin wrote:

On 24 Apr 23:00, David Hildenbrand wrote:

One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for
hugetlb mappings. However this was also on my TODO and I have a draft
patch that implements it.


Yes, I documented it back then and added sanity checks in GUP code to fence
it off. Shouldn't be too hard to implement (famous last words) and would be
the cleaner thing to use here once I manage to switch over to
FOLL_WRITE|FOLL_FORCE to break COW.


Yes, my patch seems to be working. The hugetlb code is pretty simple.
And it allows ptrace and the proc pid mem file to work on the executable
private hugetlb mappings.

There is one thing I am unclear about though. hugetlb enforces that
huge_pte_write() is true on FOLL_WRITE in both the fault and
follow_page_mask paths. I am not sure if we can simply assume in the
hugetlb code that if the pte is not writable and this is a write fault
then we're in the FOLL_FORCE|FOLL_WRITE case.  Or do we want to keep the
checks simply not enforce it for FOLL_FORCE|FOLL_WRITE?

The latter is more complicated in the fault path because there is no
FAULT_FLAG_FORCE flag.



I just pushed something to
https://github.com/davidhildenbrand/linux/tree/uprobes_cow

Only very lightly tested so far. Expect the worst :)



I'll try it out and send you the hugetlb bits



I still detest having the zapping logic there, but to get it all right I
don't see a clean way around that.


For hugetlb, we'd primarily have to implement the
mm_walk_ops->hugetlb_entry() callback (well, and FOLL_FORCE).


For FOLL_FORCE, heer is my draft. Let me know if this is what you had in
mind.


diff --git a/mm/gup.c b/mm/gup.c
index 1611e73b1121..ac60e0ae64e8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1056,9 +1056,6 @@ static int check_vma_flags(struct vm_area_struct *vma, 
unsigned long gup_flags)
if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
if (!(gup_flags & FOLL_FORCE))
return -EFAULT;
-   /* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */
-   if (is_vm_hugetlb_page(vma))
-   return -EFAULT;
/*
 * We used to let the write,force case do COW in a
 * VM_MAYWRITE VM_SHARED !VM_WRITE vma, so ptrace could
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3548eae42cf9..73f86eddf888 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5941,7 +5941,8 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct 
vm_area_struct *vma,
   struct folio *pagecache_folio, spinlock_t *ptl,
   struct vm_fault *vmf)
  {
-   const bool unshare = flags & FAULT_FLAG_UNSHARE;
+   const bool make_writable = !(flags & FAULT_FLAG_UNSHARE) &&
+   (vma->vm_flags & VM_WRITE);
pte_t pte = huge_ptep_get(ptep);
struct hstate *h = hstate_vma(vma);
struct folio *old_folio;
@@ -5959,16 +5960,9 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, 
struct vm_area_struct *vma,
 * can trigger this, because hugetlb_fault() will always resolve
 * uffd-wp bit first.
 */
-   if (!unshare && huge_pte_uffd_wp(pte))
+   if (make_writable && huge_pte_uffd_wp(pte))
return 0;
  
-	/*

-* hugetlb does not support FOLL_FORCE-style write faults that keep the
-* PTE mapped R/O such as maybe_mkwrite() would do.
-*/
-   if (WARN_ON_ONCE(!unshare && !(vma->vm_flags & VM_WRITE)))
-   return VM_FAULT_SIGSEGV;
-
/* Let's take out MAP_SHARED mappings first. */
if (vma->vm_flags & VM_MAYSHARE) {
set_huge_ptep_writable(vma, haddr, ptep);
@@ -5989,7 +5983,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct 
vm_area_struct *vma,
folio_move_anon_rmap(old_folio, vma);
SetPageAnonExclusive(_folio->page);
}
-   if (likely(!unshare))
+   if (likely(make_writable))
set_huge_ptep_writable(vma, haddr, ptep);


Maybe we want to refactor that similarly into a 
set_huge_ptep_maybe_writable, and handle the VM_WRITE check internally.


Then, here you'd do

if (unshare)
set_huge_ptep(vma, haddr, ptep);
else
set_huge_ptep_maybe_writable(vma, haddr, ptep);

Something like that.




/* Break COW or unshare */
huge_ptep_clear_flush(vma, haddr, ptep);
@@ -6883,6 +6878,17 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
  }
  #endif /* CONFIG_USERFAULTFD */
  
+static bool is_force_follow(struct vm_area_struct* vma, unsigned int flags,

+struct p

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-25 Thread David Hildenbrand


On 25.04.24 17:19, Guillaume Morin wrote:

On 24 Apr 23:00, David Hildenbrand wrote:

One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for
hugetlb mappings. However this was also on my TODO and I have a draft
patch that implements it.


Yes, I documented it back then and added sanity checks in GUP code to fence
it off. Shouldn't be too hard to implement (famous last words) and would be
the cleaner thing to use here once I manage to switch over to
FOLL_WRITE|FOLL_FORCE to break COW.


Yes, my patch seems to be working. The hugetlb code is pretty simple.
And it allows ptrace and the proc pid mem file to work on the executable
private hugetlb mappings.

There is one thing I am unclear about though. hugetlb enforces that
huge_pte_write() is true on FOLL_WRITE in both the fault and
follow_page_mask paths. I am not sure if we can simply assume in the
hugetlb code that if the pte is not writable and this is a write fault
then we're in the FOLL_FORCE|FOLL_WRITE case.  Or do we want to keep the
checks simply not enforce it for FOLL_FORCE|FOLL_WRITE?

The latter is more complicated in the fault path because there is no
FAULT_FLAG_FORCE flag.



I just pushed something to
https://github.com/davidhildenbrand/linux/tree/uprobes_cow

Only very lightly tested so far. Expect the worst :)

I still detest having the zapping logic there, but to get it all right I 
don't see a clean way around that.



For hugetlb, we'd primarily have to implement the 
mm_walk_ops->hugetlb_entry() callback (well, and FOLL_FORCE).


Likely vaddr and PAGE_SIZE in uprobe_write_opcode() would have to be 
expanded to cover the full hugetlb page.


--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-25 Thread David Hildenbrand


On 25.04.24 17:19, Guillaume Morin wrote:

On 24 Apr 23:00, David Hildenbrand wrote:

One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for
hugetlb mappings. However this was also on my TODO and I have a draft
patch that implements it.


Yes, I documented it back then and added sanity checks in GUP code to fence
it off. Shouldn't be too hard to implement (famous last words) and would be
the cleaner thing to use here once I manage to switch over to
FOLL_WRITE|FOLL_FORCE to break COW.


Yes, my patch seems to be working. The hugetlb code is pretty simple.
And it allows ptrace and the proc pid mem file to work on the executable
private hugetlb mappings.

There is one thing I am unclear about though. hugetlb enforces that
huge_pte_write() is true on FOLL_WRITE in both the fault and
follow_page_mask paths. I am not sure if we can simply assume in the
hugetlb code that if the pte is not writable and this is a write fault
then we're in the FOLL_FORCE|FOLL_WRITE case.  Or do we want to keep the
checks simply not enforce it for FOLL_FORCE|FOLL_WRITE?

The latter is more complicated in the fault path because there is no
FAULT_FLAG_FORCE flag.



handle_mm_fault()->sanitize_fault_flags() makes sure that we'll only 
proceed with a fault either if

* we have VM_WRITE set
* we are in a COW mapping (MAP_PRIVATE with at least VM_MAYWRITE)

Once you see FAULT_FLAG_WRITE and you do have VM_WRITE, you don't care 
about FOLL_FORCE, it's simply a write fault.


Once you see FAULT_FLAG_WRITE and you *don't* have VM_WRITE, you must 
have VM_MAYWRITE and are essentially in FOLL_FORCE.


In a VMA without VM_WRITE, you must never map a PTE writable. In 
ordinary COW code, that's done in wp_page_copy(), where we *always* use 
maybe_mkwrite(), to do exactly what a write fault would do, but without 
mapping the PTE writable.


That's what the whole can_follow_write_pmd()/can_follow_write_pte() is 
about: writing to PTEs that are not writable.


You'll have to follow the exact same model in hugetlb 
(can_follow_write_pmd(), hugetlb_maybe_mkwrite(), ...).


--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-24 Thread David Hildenbrand


On 24.04.24 22:44, Guillaume Morin wrote:

On 24 Apr 22:09, David Hildenbrand wrote:

Let me try to see if we can get this done cleaner.

One ugly part (in general here) is the custom page replacement in the
registration part.

We are guaranteed to have a MAP_PRIVATE mapping. Instead of replacing pages
ourselves (which we likely shouldn't do ...) ... maybe we could use
FAULT_FLAG_UNSHARE faults such that we will get an anonymous folio
populated. (like KSM does nowadays)

Punching FOLL_PIN|FOLL_LONGTERM into GUP would achieve the same thing, but
using FOLL_WRITE would not work on many file systems. So maybe we have to
trigger an unsharing fault ourselves.


^ realizing that we already use FOLL_FORCE, so we can just use FOLL_WRITE to
break COW.


It was never clear to me why uprobes was not doing FOLL_WRITE in the
first place, I must say.


It's quite dated code ...

The use of FOLL_FORCE really is ugly here. When registering, we require 
VM_WRITE but ... when unregistering, we don't ...




One issue here is that FOLL_FORCE|FOLL_WRITE is not implemented for
hugetlb mappings. However this was also on my TODO and I have a draft
patch that implements it.


Yes, I documented it back then and added sanity checks in GUP code to 
fence it off. Shouldn't be too hard to implement (famous last words) and 
would be the cleaner thing to use here once I manage to switch over to 
FOLL_WRITE|FOLL_FORCE to break COW.


--
Cheers,

David / dhildenb

Re: [PATCH v21 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-24 Thread David Hildenbrand


On 24.04.24 22:31, Vincent Donnefort wrote:

Hi David,

Thanks for your quick response.

On Wed, Apr 24, 2024 at 05:26:39PM +0200, David Hildenbrand wrote:


I gave it some more thought, and I think we are still missing something (I
wish PFNMAP/MIXEDMAP wouldn't be that hard).


+
+/*
+ *   +--+  pgoff == 0
+ *   |   meta page  |
+ *   +--+  pgoff == 1
+ *   | subbuffer 0  |
+ *   |  |
+ *   +--+  pgoff == (1 + (1 << subbuf_order))
+ *   | subbuffer 1  |
+ *   |  |
+ * ...
+ */
+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+


I'd add some comments here like

/* Refuse any MAP_PRIVATE or writable mappings. */

+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+


/*
  * Make sure the mapping cannot become writable later. Also, tell the VM
  * to not touch these pages pages (VM_DONTCOPY | VM_DONTDUMP) and tell
  * GUP to leave them alone as well (VM_IO).
  */

+   vm_flags_mod(vma,
+VM_MIXEDMAP | VM_PFNMAP |
+VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND | VM_IO,
+VM_MAYWRITE);


I am still really unsure about VM_PFNMAP ... it's not a PFNMAP at all and,
as stated, vm_insert_pages() even complains quite a lot when it would have
to set VM_MIXEDMAP and VM_PFNMAP is already set, likely for a very good
reason.

Can't we limit ourselves to VM_IO?

But then, I wonder if it really helps much regarding GUP: yes, it blocks
ordinary GUP (see check_vma_flags()) but as insert_page_into_pte_locked()
does *not* set pte_special(), GUP-fast (gup_fast_pte_range()) will not
reject it.

Really, if you want GUP-fast to reject it, remap_pfn_range() and friends are
the way to go, that will set pte_special() such that also GUP-fast will
leave it alone, just like vm_normal_page() would.

So ... I know Linus recommended VM_PFNMAP/VM_IO to stop GUP, but it alone
won't stop all of GUP. We really have to mark the PTE as special, which
vm_insert_page() must not do (because it is refcounted!).


Hum, apologies, I am not sure to follow the connection here. Why do you think
the recommendation was to prevent GUP?


Ah, I'm hallucinating! :) "not let people play games with the mapping" to me
implied "make sure nobody touches it". If GUP is acceptable that makes stuff
a lot easier. VM_IO will block some GUP, but not all of it.





Which means: do we really have to stop GUP from grabbing that page?

Using vm_insert_page() only with VM_MIXEDMAP (and without VM_PFNMAP|VM_IO)
would be better.


Under the assumption we do not want to stop all GUP, why not using VM_IO over
VM_MIXEDMAP which is I believe more restrictive?


VM_MIXEDMAP will be implicitly set by vm_insert_page(). There is a lengthy 
comment
for vm_normal_page() that explains all this madness. VM_MIXEDMAP is primarily
relevant for COW mappings, which you just forbid completely.

remap_pfn_range_notrack() documents the semantics of some of the other flags:

 *   VM_IO tells people not to look at these pages
 *  (accesses can have side effects).
 *   VM_PFNMAP tells the core MM that the base pages are just
 *  raw PFN mappings, and do not have a "struct page" associated
 *  with them.
 *   VM_DONTEXPAND
 *  Disable vma merging and expanding with mremap().
 *   VM_DONTDUMP
 *  Omit vma from core dump, even when VM_IO turned off.

VM_PFNMAP is very likely really not what we want, unless we really perform raw
PFN mappings ... VM_IO we can set without doing much harm.

So I would suggest dropping VM_PFNMAP when using vm_insert_pages(), using only 
VM_IO
and likely just letting vm_insert_pages() set VM_MIXEDMAP for you.

[...]



vm_insert_pages() documents: "In case of error, we may have mapped a subset
of the provided pages. It is the caller's responsibility to account for this
case."

Which could for example happen, when allocating a page table fails.

Would we able to deal with that here?


As we are in the mmap path, on an error, I would expect the vma to be destroyed
and those pages whom insertion succeeded to be unmapped?



Ah, we simply fail ->mmap().

In mmap_region(), if call_mmap() failed, we "goto unmap_and_free_vma" where we 
have

/* Undo any partial mapping done by a device driver. */
unmap_region(mm, , vma, prev, next, vma->vm_start, vma->vm_end, 
vma->vm_end, true);



But perhaps shall we proactively zap_page_range_single()?


No mmap_region() should indeed be handling it correctly already!

--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-24 Thread David Hildenbrand


On 22.04.24 22:53, Guillaume Morin wrote:

On 22 Apr 20:59, David Hildenbrand wrote:

The benefit - to me - is very clear. People do use hugetlb mappings to
run code in production environments. The perf benefits are there for some
workloads. Intel has published a whitepaper about it etc.
Uprobes are a very good tool to do live tracing. If you can restart the
process and reproduce, you should be able to disable hugetlb remapping
but if you need to look at a live process, there are not many options.
Not being able to use uprobes is crippling.


Please add all that as motivation to the patch description or cover letter.


Yes, libhugetlbfs exists. But why do we have to support uprobes with it?
Nobody cared until now, why care now?


I think you could ask the same question for every new feature patch :)


I have to, because it usually indicates a lack of motivation in the
cover-letter/patch description :P


My cover letter was indeed lacking. I will make sure to add this kind of
details next time.
  

Since the removal a few releases ago of the __morecore() hook in glibc,
the main feature of libhugetlbfs is ELF segments remapping. I think
there are definitely a lot of users that simply deal with this
unnecessary limitation.

I am certainly not shoving this patch through anyone's throat if there
is no interest. But we definitely find it a very useful feature ...


Let me try to see if we can get this done cleaner.

One ugly part (in general here) is the custom page replacement in the
registration part.

We are guaranteed to have a MAP_PRIVATE mapping. Instead of replacing pages
ourselves (which we likely shouldn't do ...) ... maybe we could use
FAULT_FLAG_UNSHARE faults such that we will get an anonymous folio
populated. (like KSM does nowadays)

Punching FOLL_PIN|FOLL_LONGTERM into GUP would achieve the same thing, but
using FOLL_WRITE would not work on many file systems. So maybe we have to
trigger an unsharing fault ourselves.


^ realizing that we already use FOLL_FORCE, so we can just use 
FOLL_WRITE to break COW.




That would do the page replacement for us and we "should" be able to lookup
an anonymous folio that we can then just modify, like ptrace would.

But then, there is also unregistration part, with weird conditional page
replacement. Zapping the anon page if the content matches the content of the
original page is one thing. But why are we placing an existing anonymous
page by a new anonymous page when the content from the original page differs
(but matches the one from the just copied page?)?

I'll have to further think about that one. It's all a bit nasty.


Sounds good to me. I am willing to help with the code when you have a
plan or testing as you see fit. Let me know.


I'm hacking on a redesign that removes the manual COW breaking logic and 
*might* make it easier to integrate hugetlb. (very likely, but until I 
have the redesign running I cannot promise anything :) )


I'll let you know once I have something ready so you could integrate the 
hugetlb portion.


--
Cheers,

David / dhildenb

Re: [PATCH v21 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-24 Thread David Hildenbrand




I gave it some more thought, and I think we are still missing something 
(I wish PFNMAP/MIXEDMAP wouldn't be that hard).



+
+/*
+ *   +--+  pgoff == 0
+ *   |   meta page  |
+ *   +--+  pgoff == 1
+ *   | subbuffer 0  |
+ *   |  |
+ *   +--+  pgoff == (1 + (1 << subbuf_order))
+ *   | subbuffer 1  |
+ *   |  |
+ * ...
+ */
+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+


I'd add some comments here like

/* Refuse any MAP_PRIVATE or writable mappings. */

+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+


/*
 * Make sure the mapping cannot become writable later. Also, tell the VM
 * to not touch these pages pages (VM_DONTCOPY | VM_DONTDUMP) and tell
 * GUP to leave them alone as well (VM_IO).
 */

+   vm_flags_mod(vma,
+VM_MIXEDMAP | VM_PFNMAP |
+VM_DONTCOPY | VM_DONTDUMP | VM_DONTEXPAND | VM_IO,
+VM_MAYWRITE);


I am still really unsure about VM_PFNMAP ... it's not a PFNMAP at all 
and, as stated, vm_insert_pages() even complains quite a lot when it 
would have to set VM_MIXEDMAP and VM_PFNMAP is already set, likely for a 
very good reason.


Can't we limit ourselves to VM_IO?

But then, I wonder if it really helps much regarding GUP: yes, it blocks 
ordinary GUP (see check_vma_flags()) but as 
insert_page_into_pte_locked() does *not* set pte_special(), GUP-fast 
(gup_fast_pte_range()) will not reject it.


Really, if you want GUP-fast to reject it, remap_pfn_range() and friends 
are the way to go, that will set pte_special() such that also GUP-fast 
will leave it alone, just like vm_normal_page() would.


So ... I know Linus recommended VM_PFNMAP/VM_IO to stop GUP, but it 
alone won't stop all of GUP. We really have to mark the PTE as special, 
which vm_insert_page() must not do (because it is refcounted!).


Which means: do we really have to stop GUP from grabbing that page?

Using vm_insert_page() only with VM_MIXEDMAP (and without 
VM_PFNMAP|VM_IO) would be better.


If we want to stop all of GUP, remap_pfn_range() currently seems 
unavoidable :(




+
+   lockdep_assert_held(_buffer->mapping_lock);
+
+   subbuf_order = cpu_buffer->buffer->subbuf_order;
+   subbuf_pages = 1 << subbuf_order;
+
+   nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */
+   nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */
+
+   vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+   if (!vma_pages || vma_pages > nr_pages)
+   return -EINVAL;
+
+   nr_pages = vma_pages;
+
+   pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
+   if (!pages)
+   return -ENOMEM;
+
+   if (!pgoff) {
+   pages[p++] = virt_to_page(cpu_buffer->meta_page);
+
+   /*
+* TODO: Align sub-buffers on their size, once
+* vm_insert_pages() supports the zero-page.
+*/
+   } else {
+   /* Skip the meta-page */
+   pgoff--;
+
+   if (pgoff % subbuf_pages) {
+   err = -EINVAL;
+   goto out;
+   }
+
+   s += pgoff / subbuf_pages;
+   }
+
+   while (s < nr_subbufs && p < nr_pages) {
+   struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
+   int off = 0;
+
+   for (; off < (1 << (subbuf_order)); off++, page++) {
+   if (p >= nr_pages)
+   break;
+
+   pages[p++] = page;
+   }
+   s++;
+   }
+
+   err = vm_insert_pages(vma, vma->vm_start, pages, _pages);


vm_insert_pages() documents: "In case of error, we may have mapped a 
subset of the provided pages. It is the caller's responsibility to 
account for this case."


Which could for example happen, when allocating a page table fails.

Would we able to deal with that here?


Again, I wish it would all be easier ...

--
Cheers,

David / dhildenb

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-23 Thread David Hildenbrand


On 22.04.24 22:31, Vincent Donnefort wrote:

On Mon, Apr 22, 2024 at 08:27:17PM +0200, David Hildenbrand wrote:

On 22.04.24 20:20, Vincent Donnefort wrote:

Hi David,

Thanks for having a look, very much appreciated!

On Mon, Apr 22, 2024 at 11:27:11AM +0200, David Hildenbrand wrote:

On 19.04.24 20:25, David Hildenbrand wrote:

On 06.04.24 19:36, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,
+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..793ecc454039 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -26,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
@@ -338,6 +340,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
@@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages

Re: [PATCH v3 1/4] virtio_balloon: separate vm events into a function

2024-04-23 Thread David Hildenbrand


On 23.04.24 05:41, zhenwei pi wrote:

All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', separate these events into a function to
make code clean. Then we can remove 'CONFIG_VM_EVENT_COUNTERS' from
'update_balloon_stats'.

Signed-off-by: zhenwei pi 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [RFC][PATCH] uprobe: support for private hugetlb mappings

2024-04-22 Thread David Hildenbrand


On 22.04.24 20:11, Guillaume Morin wrote:

(Dropping Mike Kravetz as CC since he has retired and his email is no
longer valid, adding Muchun since he's the current hugetlb maintainer,
as well as linux-trace-kernel)

On 22 Apr 11:39, David Hildenbrand wrote:


On 19.04.24 20:37, Guillaume Morin wrote:

libhugetlbfs, the Intel iodlr code both allow to remap .text onto a
hugetlb private mapping. It's also pretty easy to do it manually.
One drawback of using this functionality is the lack of support for
uprobes (NOTE uprobe ignores shareable vmas)

This patch adds support for private hugetlb mappings.  It does require exposing
some hugetlbfs innards and relies on copy_user_large_folio which is only
available when CONFIG_HUGETLBFS is used so I had to use an ugly #ifdef

If there is some interest in applying this patch in some form or
another, I am open to any refactoring suggestions (esp getting rid the
#ifdef in uprobes.c) . I tried to limit the
amount of branching.


All that hugetlb special casing  oh my. What's the benefit why we should
be interested in making that code less clean -- to phrase it in a nice way
;) ?


I do appreciate the nice phrasing. Believe me, I did try to limit the
special casing to a minimum :-).

Outside of __replace_page, I added only 3-ish branches so I do not think
it's *too* bad. The uprobe code is using PAGE_{SHIFT,MASK} quite liberally so I
had to add calls to retrieve these for the hugetlb vmas.

__replace_page has a lot of special casing. I certainly agree (and
unfortunately for me it's at the beginning of the patch :)).  It's doing
something pretty uncommon outside of the mm code so it has to make a
bunch of specific hugetlb calls. I am not quite sure how to improve it
but if you have suggestions, I'd be happy to refactor.


See below.



The benefit - to me - is very clear. People do use hugetlb mappings to
run code in production environments. The perf benefits are there for some
workloads. Intel has published a whitepaper about it etc.
Uprobes are a very good tool to do live tracing. If you can restart the
process and reproduce, you should be able to disable hugetlb remapping
but if you need to look at a live process, there are not many options.
Not being able to use uprobes is crippling.


Please add all that as motivation to the patch description or cover letter.




Yes, libhugetlbfs exists. But why do we have to support uprobes with it?
Nobody cared until now, why care now?


I think you could ask the same question for every new feature patch :)


I have to, because it usually indicates a lack of motivation in the 
cover-letter/patch description :P


People will have to maintain that code, and maintaining hugetlb code in 
odd places is no fun ...




Since the removal a few releases ago of the __morecore() hook in glibc,
the main feature of libhugetlbfs is ELF segments remapping. I think
there are definitely a lot of users that simply deal with this
unnecessary limitation.

I am certainly not shoving this patch through anyone's throat if there
is no interest. But we definitely find it a very useful feature ...


Let me try to see if we can get this done cleaner.

One ugly part (in general here) is the custom page replacement in the 
registration part.


We are guaranteed to have a MAP_PRIVATE mapping. Instead of replacing 
pages ourselves (which we likely shouldn't do ...) ... maybe we could 
use FAULT_FLAG_UNSHARE faults such that we will get an anonymous folio 
populated. (like KSM does nowadays)


Punching FOLL_PIN|FOLL_LONGTERM into GUP would achieve the same thing, 
but using FOLL_WRITE would not work on many file systems. So maybe we 
have to trigger an unsharing fault ourselves.


That would do the page replacement for us and we "should" be able to 
lookup an anonymous folio that we can then just modify, like ptrace would.


But then, there is also unregistration part, with weird conditional page 
replacement. Zapping the anon page if the content matches the content of 
the original page is one thing. But why are we placing an existing 
anonymous page by a new anonymous page when the content from the 
original page differs (but matches the one from the just copied page?)?


I'll have to further think about that one. It's all a bit nasty.


One thing to note is that hugetlb folios don't grow on trees. Likely, 
Many setups *don't* reserve extra hugetlb folios and you might just 
easily be running out of free hugetlb folios that you can use to break 
COW here (replace a file hugetlb by a fresh anon hugetlb page). Likely 
it's easy to make register or unregister fail.


--
Cheers,

David / dhildenb

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-22 Thread David Hildenbrand


On 22.04.24 20:20, Vincent Donnefort wrote:

Hi David,

Thanks for having a look, very much appreciated!

On Mon, Apr 22, 2024 at 11:27:11AM +0200, David Hildenbrand wrote:

On 19.04.24 20:25, David Hildenbrand wrote:

On 06.04.24 19:36, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

 ring_buffer_{map,unmap}()

And controls on the ring-buffer:

 ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

 A unique ID for each subbuf of the ring-buffer, currently they are
 only identified through their in-kernel VA.

 A meta-page, where are stored ring-buffer statistics and a
 description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
#include 
#include 
+#include 
+
struct trace_buffer;
struct ring_buffer_iter;
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
#define trace_rb_cpu_prepareNULL
#endif
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,
+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
#endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..793ecc454039 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
#include 
#include 
#include 
+#include 
#include 
#include 
#include 
@@ -26,6 +27,7 @@
#include 
#include 
#include 
+#include 
#include 
#include 
@@ -338,6 +340,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
};
@@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct lis

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-22 Thread David Hildenbrand


On 19.04.24 20:25, David Hildenbrand wrote:

On 06.04.24 19:36, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

ring_buffer_{map,unmap}()

And controls on the ring-buffer:

ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

A unique ID for each subbuf of the ring-buffer, currently they are
only identified through their in-kernel VA.

A meta-page, where are stored ring-buffer statistics and a
description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
   #include 
   #include 
   
+#include 

+
   struct trace_buffer;
   struct ring_buffer_iter;
   
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node);

   #define trace_rb_cpu_prepare NULL
   #endif
   
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,

+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
   #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..793ecc454039 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 
   #include 
@@ -26,6 +27,7 @@
   #include 
   #include 
   #include 
+#include 
   
   #include 

   #include 
@@ -338,6 +340,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
   };
   
@@ -484,6 +487,12 @@ struct ring_buffer_per_cpu {

u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -1599,6 +1608,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_irq_work(_buffer->irq_

Re: [PATCH v2 1/4] virtio_balloon: separate vm events into a function

2024-04-22 Thread David Hildenbrand


On 22.04.24 09:42, zhenwei pi wrote:

All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', once any stack variable is required by any
VM events in future, we would have codes like:
  #ifdef CONFIG_VM_EVENT_COUNTERS
   unsigned long foo;
  #endif
   ...
  #ifdef CONFIG_VM_EVENT_COUNTERS
   foo = events[XXX] + events[YYY];
   update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
  #endif

Separate vm events into a single function, also remove
'CONFIG_VM_EVENT_COUNTERS' from 'update_balloon_stats'.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 44 ++---
  1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1f5b3dd31fcf..59fe157e5722 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -316,34 +316,48 @@ static inline void update_stat(struct virtio_balloon *vb, 
int idx,
  
  #define pages_to_bytes(x) ((u64)(x) << PAGE_SHIFT)
  
-static unsigned int update_balloon_stats(struct virtio_balloon *vb)

+/* Return the number of entries filled by vm events */
+static inline unsigned int update_balloon_vm_stats(struct virtio_balloon *vb,
+  unsigned int start)
  {
+#ifdef CONFIG_VM_EVENT_COUNTERS
unsigned long events[NR_VM_EVENT_ITEMS];
-   struct sysinfo i;
-   unsigned int idx = 0;
-   long available;
-   unsigned long caches;
+   unsigned int idx = start;
  
  	all_vm_events(events);

-   si_meminfo();
-
-   available = si_mem_available();
-   caches = global_node_page_state(NR_FILE_PAGES);
-
-#ifdef CONFIG_VM_EVENT_COUNTERS
update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN,
-   pages_to_bytes(events[PSWPIN]));
+   pages_to_bytes(events[PSWPIN]));
update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
-   pages_to_bytes(events[PSWPOUT]));
+   pages_to_bytes(events[PSWPOUT]));
update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
+
  #ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
events[HTLB_BUDDY_PGALLOC]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL,
events[HTLB_BUDDY_PGALLOC_FAIL]);
-#endif
-#endif
+#endif /* CONFIG_HUGETLB_PAGE */
+
+   return idx - start;
+#else /* CONFIG_VM_EVENT_COUNTERS */
+
+   return 0;
+#endif /* CONFIG_VM_EVENT_COUNTERS */
+}
+
+static unsigned int update_balloon_stats(struct virtio_balloon *vb)
+{
+   struct sysinfo i;
+   unsigned int idx = 0;
+   long available;
+   unsigned long caches;
+
+   idx += update_balloon_vm_stats(vb, idx);


No need to handle idx that complicated now. Just do

unsigned int idx;

idx = update_balloon_vm_stats(vb);

We can go down that path if we ever want to rearrange the code and not 
have the vm_stats first.



+
+   si_meminfo();
+   available = si_mem_available();
+   caches = global_node_page_state(NR_FILE_PAGES);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
pages_to_bytes(i.freeram));
update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMTOT,


--
Cheers,

David / dhildenb

Re: [PATCH v2 1/4] virtio_balloon: separate vm events into a function

2024-04-22 Thread David Hildenbrand


On 22.04.24 10:04, zhenwei pi wrote:



On 4/22/24 15:47, David Hildenbrand wrote:

On 22.04.24 09:42, zhenwei pi wrote:

All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', once any stack variable is required by any
VM events in future, we would have codes like:
   #ifdef CONFIG_VM_EVENT_COUNTERS
    unsigned long foo;
   #endif
    ...
   #ifdef CONFIG_VM_EVENT_COUNTERS
    foo = events[XXX] + events[YYY];
    update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
   #endif

Separate vm events into a single function, also remove


Why not simply use __maybe_unused for that variable?



1>
static unsigned int update_balloon_stats()
{
  unsigned __maybe_unused long foo;

  ...
#ifdef CONFIG_VM_EVENT_COUNTERS
  foo = events[XXX] + events[YYY];
  update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
#endif
}

2>
static inline unsigned int update_balloon_vm_stats()
{
#ifdef CONFIG_VM_EVENT_COUNTERS
  unsigned long foo;

  foo = events[XXX] + events[YYY];
  update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
#endif
}

  From the point of my view, I don't need to compile code in my brain
when reading codes for case 2. :)


But for #1? :)

I mean, you didn't compile the code in your brain when you sent out v1 :P

But I agree that moving that to a separate function ins cleaner, staring 
at resulting update_balloon_stats().


Let me comment on some nits as a fresh reply.

--
Cheers,

David / dhildenb

Re: [PATCH v2 3/4] virtio_balloon: introduce memory allocation stall counter

2024-04-22 Thread David Hildenbrand


On 22.04.24 09:42, zhenwei pi wrote:

Memory allocation stall counter represents the performance/latency of
memory allocation, expose this counter to the host side by virtio
balloon device via out-of-bound way.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 8 
  include/uapi/linux/virtio_balloon.h | 6 --
  2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 87a1d6fa77fb..ab039e83bc6f 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -323,6 +323,8 @@ static inline unsigned int update_balloon_vm_stats(struct 
virtio_balloon *vb,
  #ifdef CONFIG_VM_EVENT_COUNTERS
unsigned long events[NR_VM_EVENT_ITEMS];
unsigned int idx = start;
+   unsigned int zid;
+   unsigned long stall = 0;
  
  	all_vm_events(events);

update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_IN,
@@ -333,6 +335,12 @@ static inline unsigned int update_balloon_vm_stats(struct 
virtio_balloon *vb,
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]);
  
+	/* sum all the stall events */

+   for (zid = 0; zid < MAX_NR_ZONES; zid++)
+   stall += events[ALLOCSTALL_NORMAL - ZONE_NORMAL + zid];
+
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
+
  #ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
events[HTLB_BUDDY_PGALLOC]);
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index b17bbe033697..487b893a160e 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -72,7 +72,8 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations */
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
  #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
-#define VIRTIO_BALLOON_S_NR   11
+#define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
+#define VIRTIO_BALLOON_S_NR   12
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -85,7 +86,8 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
-   VIRTIO_BALLOON_S_NAMES_prefix "oom-kills" \
+   VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \
  }
  
  #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v2 1/4] virtio_balloon: separate vm events into a function

2024-04-22 Thread David Hildenbrand


On 22.04.24 09:42, zhenwei pi wrote:

All the VM events related statistics have dependence on
'CONFIG_VM_EVENT_COUNTERS', once any stack variable is required by any
VM events in future, we would have codes like:
  #ifdef CONFIG_VM_EVENT_COUNTERS
   unsigned long foo;
  #endif
   ...
  #ifdef CONFIG_VM_EVENT_COUNTERS
   foo = events[XXX] + events[YYY];
   update_stat(vb, idx++, VIRTIO_BALLOON_S_XXX, foo);
  #endif

Separate vm events into a single function, also remove


Why not simply use __maybe_unused for that variable?

--
Cheers,

David / dhildenb

[PATCH v2 1/1] virtio: Add support for the virtio suspend feature

2024-04-22 Thread David Stevens

Add support for the VIRTIO_F_SUSPEND feature. When this feature is
negotiated, power management can use it to suspend virtio devices
instead of resorting to resetting the devices entirely.

Signed-off-by: David Stevens 
---
 drivers/virtio/virtio.c| 60 ++
 drivers/virtio/virtio_pci_common.c | 34 -
 drivers/virtio/virtio_pci_modern.c | 19 ++
 include/linux/virtio.h |  8 
 include/uapi/linux/virtio_config.h | 10 -
 5 files changed, 112 insertions(+), 19 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f4080692b351..c7685a0dc995 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -498,6 +499,13 @@ void unregister_virtio_device(struct virtio_device *dev)
 }
 EXPORT_SYMBOL_GPL(unregister_virtio_device);
 
+void virtio_device_mark_removed(struct virtio_device *dev)
+{
+   /* Pairs with READ_ONCE() in virtio_device_set_suspend_bit(). */
+   WRITE_ONCE(dev->removed, true);
+}
+EXPORT_SYMBOL_GPL(virtio_device_mark_removed);
+
 #ifdef CONFIG_PM_SLEEP
 int virtio_device_freeze(struct virtio_device *dev)
 {
@@ -580,6 +588,58 @@ int virtio_device_restore(struct virtio_device *dev)
return ret;
 }
 EXPORT_SYMBOL_GPL(virtio_device_restore);
+
+static int virtio_device_set_suspend_bit(struct virtio_device *dev, bool 
enabled)
+{
+   u8 status, target;
+
+   status = dev->config->get_status(dev);
+   if (enabled)
+   target = status | VIRTIO_CONFIG_S_SUSPEND;
+   else
+   target = status & ~VIRTIO_CONFIG_S_SUSPEND;
+
+   if (target == status)
+   return 0;
+
+   dev->config->set_status(dev, target);
+
+   while ((status = dev->config->get_status(dev)) != target) {
+   if (status & VIRTIO_CONFIG_S_NEEDS_RESET)
+   return -EIO;
+   /* Pairs with WRITE_ONCE() in virtio_device_mark_removed(). */
+   if (READ_ONCE(dev->removed))
+   return -EIO;
+   msleep(10);
+   }
+   return 0;
+}
+
+bool virtio_device_can_suspend(struct virtio_device *dev)
+{
+   return virtio_has_feature(dev, VIRTIO_F_SUSPEND) &&
+  (dev->config->get_status(dev) & VIRTIO_CONFIG_S_FEATURES_OK);
+}
+EXPORT_SYMBOL_GPL(virtio_device_can_suspend);
+
+int virtio_device_suspend(struct virtio_device *dev)
+{
+   return virtio_device_set_suspend_bit(dev, true);
+}
+EXPORT_SYMBOL_GPL(virtio_device_suspend);
+
+bool virtio_device_can_resume(struct virtio_device *dev)
+{
+   return virtio_has_feature(dev, VIRTIO_F_SUSPEND) &&
+  (dev->config->get_status(dev) & VIRTIO_CONFIG_S_SUSPEND);
+}
+EXPORT_SYMBOL_GPL(virtio_device_can_resume);
+
+int virtio_device_resume(struct virtio_device *dev)
+{
+   return virtio_device_set_suspend_bit(dev, false);
+}
+EXPORT_SYMBOL_GPL(virtio_device_resume);
 #endif
 
 static int virtio_init(void)
diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index b655fccaf773..a6ca7718b5dc 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -495,31 +495,26 @@ static int virtio_pci_restore(struct device *dev)
return virtio_device_restore(_dev->vdev);
 }
 
-static bool vp_supports_pm_no_reset(struct device *dev)
+static int virtio_pci_suspend(struct device *dev)
 {
struct pci_dev *pci_dev = to_pci_dev(dev);
-   u16 pmcsr;
-
-   if (!pci_dev->pm_cap)
-   return false;
-
-   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
-   if (PCI_POSSIBLE_ERROR(pmcsr)) {
-   dev_err(dev, "Unable to query pmcsr");
-   return false;
-   }
+   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
 
-   return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
-}
+   if (virtio_device_can_suspend(_dev->vdev))
+   return virtio_device_suspend(_dev->vdev);
 
-static int virtio_pci_suspend(struct device *dev)
-{
-   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev);
+   return virtio_pci_freeze(dev);
 }
 
 static int virtio_pci_resume(struct device *dev)
 {
-   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev);
+   struct pci_dev *pci_dev = to_pci_dev(dev);
+   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
+
+   if (virtio_device_can_resume(_dev->vdev))
+   return virtio_device_resume(_dev->vdev);
+
+   return virtio_pci_restore(dev);
 }
 
 static const struct dev_pm_ops virtio_pci_pm_ops = {
@@ -623,9 +618,12 @@ static void virtio_pci_remove(struct pci_dev *pci_dev)
 * Device is marked broken on surprise removal so that virtio upper

[PATCH v2 0/1] virtio: Add suspend support

2024-04-22 Thread David Stevens

This series implements support for the virtio device suspend feature
that is under discussion. Unfortunately, the virtio mailing list is
currently being migrated, so recent discussion of the proposal is not
archived anywhere. There current version of the proposal is a
combination of [1] and [2].

[1] https://lore.kernel.org/all/20230906081637.32185-3-lingshan@intel.com/
[2] https://lists.oasis-open.org/archives/virtio-comment/202402/msg00088.html

v1 -> v2:
 - Check for device removal while waiting for suspend bit.
 - Don't try to suspend uninitialized deivces.
 - Use msleep instead of mdelay.

David Stevens (1):
  virtio: Add support for the virtio suspend feature

 drivers/virtio/virtio.c| 60 ++
 drivers/virtio/virtio_pci_common.c | 34 -
 drivers/virtio/virtio_pci_modern.c | 19 ++
 include/linux/virtio.h |  8 
 include/uapi/linux/virtio_config.h | 10 -
 5 files changed, 112 insertions(+), 19 deletions(-)


base-commit: e8f897f4afef0031fe618a8e94127a0934896aba
-- 
2.44.0.769.g3c40516874-goog

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-19 Thread David Matlack

On 2024-04-19 01:47 PM, James Houghton wrote:
> On Thu, Apr 11, 2024 at 10:28 AM David Matlack  wrote:
> > On 2024-04-11 10:08 AM, David Matlack wrote:
> > bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > bool young = false;
> >
> > if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
> >
> > if (tdp_mmu_enabled)
> > young |= kvm_tdp_mmu_age_gfn_range(kvm, range);
> >
> > return young;
> > }
> >
> > bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > bool young = false;
> >
> > if (!range->arg.metadata->bitmap && kvm_memslots_have_rmaps(kvm))
> > young = kvm_handle_gfn_range(kvm, range, kvm_test_age_rmap);
> >
> > if (tdp_mmu_enabled)
> > young |= kvm_tdp_mmu_test_age_gfn(kvm, range);
> >
> > return young;
> 
> 
> Yeah I think this is the right thing to do. Given your other
> suggestions (on patch 3), I think this will look something like this
> -- let me know if I've misunderstood something:
> 
> bool check_rmap = !bitmap && kvm_memslot_have_rmaps(kvm);
> 
> if (check_rmap)
>   KVM_MMU_LOCK(kvm);
> 
> rcu_read_lock(); // perhaps only do this when we don't take the MMU lock?
> 
> if (check_rmap)
>   kvm_handle_gfn_range(/* ... */ kvm_test_age_rmap)
> 
> if (tdp_mmu_enabled)
>   kvm_tdp_mmu_test_age_gfn() // modified to be RCU-safe
> 
> rcu_read_unlock();
> if (check_rmap)
>   KVM_MMU_UNLOCK(kvm);

I was thinking a little different. If you follow my suggestion to first
make the TDP MMU aging lockless, you'll end up with something like this
prior to adding bitmap support (note: the comments are just for
demonstrative purposes):

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
bool young = false;

/* Shadow MMU aging holds write-lock. */
if (kvm_memslots_have_rmaps(kvm)) {
write_lock(>mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(>mmu_lock);
}

/* TDM MMU aging is lockless. */
if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

return young;
}

Then when you add bitmap support it would look something like this:

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{
unsigned long *bitmap = range->arg.metadata->bitmap;
bool young = false;

/* SHadow MMU aging holds write-lock and does not support bitmap. */
if (kvm_memslots_have_rmaps(kvm) && !bitmap) {
write_lock(>mmu_lock);
young = kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(>mmu_lock);
}

/* TDM MMU aging is lockless and supports bitmap. */
if (tdp_mmu_enabled)
young |= kvm_tdp_mmu_age_gfn_range(kvm, range);

return young;
}

rcu_read_lock/unlock() would be called in kvm_tdp_mmu_age_gfn_range().

That brings up a question I've been wondering about. If KVM only
advertises support for the bitmap lookaround when shadow roots are not
allocated, does that mean MGLRU will be blind to accesses made by L2
when nested virtualization is enabled? And does that mean the Linux MM
will think all L2 memory is cold (i.e. good candidate for swapping)
because it isn't seeing accesses made by L2?

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-19 Thread David Hildenbrand

  s++;
+   }
+
+   err = vm_insert_pages(vma, vma->vm_start, pages, _pages);


I think Linus suggested it ("avoid all the sub-page ref-counts entirely 
by using VM_PFNMAP, and use vm_insert_pages()"), but ... 
vm_insert_pages() will:

* Mess with mapcounts
* Mess with refcounts

See 
insert_pages()->insert_page_in_batch_locked()->insert_page_into_pte_locked().


So we'll mess with the mapcount and refcount of the shared zeropage ... 
h


If I am not wrong, vm_normal_page() would not return the shared zeropage 
in case we don't have CONFIG_ARCH_HAS_PTE_SPECIAL ... so 
unmap()->...->zap_present_ptes() would not decrement the refcount and we 
could overflow it. ... we also shouldn't ever mess with the mapcount of 
the shared zeropage in the first place.


vm_insert_page() is clearer on that: "This allows drivers to insert 
individual pages they've allocated into a user vma". You didn't allocate 
the shared zeropage.


I'm wondering if we even expect VM_MIXEDMAP and VM_PFNMAP to be set at 
the same time? vm_insert_pages() would BUG_ON in case VM_PFNMAP is 
already set and it would set VM_MIXEDMAP ... similarly 
vmf_insert_pfn_prot() would not be happy about that at all ...


BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == 
(VM_PFNMAP|VM_MIXEDMAP));


... remap_pfn_range() is used by io_uring_mmap for a similar purpose. 
But it only supports a single PFN range at a time and requires the 
caller to handle refcounting of pages.


It's getting late in Germany, so I might be missing something; but using 
the shared zeropage here might be a problem.


--
Cheers,

David / dhildenb

Re: [PATCH 3/3] virtio_balloon: introduce memory scan/reclaim info

2024-04-18 Thread David Hildenbrand


On 18.04.24 08:26, zhenwei pi wrote:

Expose memory scan/reclaim information to the host side via virtio
balloon device.

Now we have a metric to analyze the memory performance:

y: counter increases
n: counter does not changes
h: the rate of counter change is high
l: the rate of counter change is low

OOM: VIRTIO_BALLOON_S_OOM_KILL
STALL: VIRTIO_BALLOON_S_ALLOC_STALL
ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC
DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT
ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC
DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT

- OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]:
   the guest runs under really critial memory pressure

- OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to cgroup, not the global memory
   pressure.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]:
   the memory allocation stalls due to global memory pressure. The
   performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows
   quite effective memory reclaiming.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to global memory pressure.
   the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing
   heavily, the serious case leads poor performance and difficult
   trouble shooting. Ex, sshd may block on memory allocation when
   accepting new connections, a user can't login a VM by ssh command.

- OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]:
   the low ratio between ARCLM/ASCAN shows that the guest tries to
   reclaim more memory, but it can't. Once more memory is required in
   future, it will struggle to reclaim memory.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c |  9 +
  include/uapi/linux/virtio_balloon.h | 12 ++--
  2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index e88e6573afa5..bc9332c1ae85 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -356,6 +356,15 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
stall += events[ALLOCSTALL_MOVABLE];
update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
  
+	update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_SCAN,

+   pages_to_bytes(events[PGSCAN_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_SCAN,
+   pages_to_bytes(events[PGSCAN_DIRECT]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_ASYNC_RECLAIM,
+   pages_to_bytes(events[PGSTEAL_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_DIRECT_RECLAIM,
+   pages_to_bytes(events[PGSTEAL_DIRECT]));
+
  #ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
events[HTLB_BUDDY_PGALLOC]);
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 487b893a160e..ee35a372805d 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -73,7 +73,11 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
  #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
  #define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
-#define VIRTIO_BALLOON_S_NR   12
+#define VIRTIO_BALLOON_S_ASYNC_SCAN12 /* Amount of memory scanned 
asynchronously */
+#define VIRTIO_BALLOON_S_DIRECT_SCAN   13 /* Amount of memory scanned directly 
*/
+#define VIRTIO_BALLOON_S_ASYNC_RECLAIM 14 /* Amount of memory reclaimed 
asynchronously */
+#define VIRTIO_BALLOON_S_DIRECT_RECLAIM 15 /* Amount of memory reclaimed 
directly */
+#define VIRTIO_BALLOON_S_NR   16
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -87,7 +91,11 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
VIRTIO_BALLOON_S_NAMES_prefix "oom-kills", \
-   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls" \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stalls", \
+   VIRTIO_BALLOON_S_NAMES_prefix "async-scans", \
+   VIRTIO_BALLOON_S_NAMES_prefix "direct-scans", \
+   VIRTIO_BALLOON_S_NAMES_prefix "async-reclaims", \
+   VIRTIO_BALLOON_S_NAMES_prefix "direct-reclaims" \
  }
  
  #define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("")


Not an expert on these counters/events, but LGTM

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH 2/3] virtio_balloon: introduce memory allocation stall counter

2024-04-18 Thread David Hildenbrand


On 18.04.24 08:26, zhenwei pi wrote:

Memory allocation stall counter represents the performance/latency of
memory allocation, expose this counter to the host side by virtio
balloon device via out-of-bound way.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 20 +++-
  include/uapi/linux/virtio_balloon.h |  6 --
  2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index fd19934a847f..e88e6573afa5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -321,7 +321,7 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
unsigned long events[NR_VM_EVENT_ITEMS];
struct sysinfo i;
unsigned int idx = 0;
-   long available;
+   long available, stall = 0;
unsigned long caches;
  
  	all_vm_events(events);

@@ -338,6 +338,24 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL, events[OOM_KILL]);
+
+   /* sum all the stall events */
+#ifdef CONFIG_ZONE_DMA
+   stall += events[ALLOCSTALL_DMA];
+#endif
+#ifdef CONFIG_ZONE_DMA32
+   stall += events[ALLOCSTALL_DMA32];
+#endif
+#ifdef CONFIG_HIGHMEM
+   stall += events[ALLOCSTALL_HIGH];
+#endif
+#ifdef CONFIG_ZONE_DEVICE
+   stall += events[ALLOCSTALL_DEVICE];
+#endif


Naive me would think that ALLOCSTALL_DEVICE is always 0. :)

Likely we should just do:

for (zid = 0; zid < MAX_NR_ZONES; zid++)
stall += events[ALLOCSTALL_NORMAL - ZONE_NORMAL + zid];

(see isolate_lru_folios() -> __count_zid_vm_events(), where we realy on 
the same ordering)


Apart form that, LGTM.

--
Cheers,

David / dhildenb

Re: [PATCH 1/3] virtio_balloon: introduce oom-kill invocations

2024-04-18 Thread David Hildenbrand


On 18.04.24 08:26, zhenwei pi wrote:

When the guest OS runs under critical memory pressure, the guest
starts to kill processes. A guest monitor agent may scan 'oom_kill'
from /proc/vmstat, and reports the OOM KILL event. However, the agent
may be killed and we will loss this critical event(and the later
events).

For now we can also grep for magic words in guest kernel log from host
side. Rather than this unstable way, virtio balloon reports OOM-KILL
invocations instead.

Signed-off-by: zhenwei pi 


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

[PATCH 1/1] virtio: Add support for the virtio suspend feature

2024-04-17 Thread David Stevens

Add support for the VIRTIO_F_SUSPEND feature. When this feature is
negotiated, power management can use it to suspend virtio devices
instead of resorting to resetting the devices entirely.

Signed-off-by: David Stevens 
---
 drivers/virtio/virtio.c| 32 ++
 drivers/virtio/virtio_pci_common.c | 29 +++
 drivers/virtio/virtio_pci_modern.c | 19 ++
 include/linux/virtio.h |  2 ++
 include/uapi/linux/virtio_config.h | 10 +-
 5 files changed, 74 insertions(+), 18 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f4080692b351..cd11495a5098 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -580,6 +581,37 @@ int virtio_device_restore(struct virtio_device *dev)
return ret;
 }
 EXPORT_SYMBOL_GPL(virtio_device_restore);
+
+static int virtio_device_set_suspend_bit(struct virtio_device *dev, bool 
enabled)
+{
+   u8 status, target;
+
+   status = dev->config->get_status(dev);
+   if (enabled)
+   target = status | VIRTIO_CONFIG_S_SUSPEND;
+   else
+   target = status & ~VIRTIO_CONFIG_S_SUSPEND;
+   dev->config->set_status(dev, target);
+
+   while ((status = dev->config->get_status(dev)) != target) {
+   if (status & VIRTIO_CONFIG_S_NEEDS_RESET)
+   return -EIO;
+   mdelay(10);
+   }
+   return 0;
+}
+
+int virtio_device_suspend(struct virtio_device *dev)
+{
+   return virtio_device_set_suspend_bit(dev, true);
+}
+EXPORT_SYMBOL_GPL(virtio_device_suspend);
+
+int virtio_device_resume(struct virtio_device *dev)
+{
+   return virtio_device_set_suspend_bit(dev, false);
+}
+EXPORT_SYMBOL_GPL(virtio_device_resume);
 #endif
 
 static int virtio_init(void)
diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index b655fccaf773..4d542de05970 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -495,31 +495,26 @@ static int virtio_pci_restore(struct device *dev)
return virtio_device_restore(_dev->vdev);
 }
 
-static bool vp_supports_pm_no_reset(struct device *dev)
+static int virtio_pci_suspend(struct device *dev)
 {
struct pci_dev *pci_dev = to_pci_dev(dev);
-   u16 pmcsr;
-
-   if (!pci_dev->pm_cap)
-   return false;
-
-   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
-   if (PCI_POSSIBLE_ERROR(pmcsr)) {
-   dev_err(dev, "Unable to query pmcsr");
-   return false;
-   }
+   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
 
-   return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
-}
+   if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND))
+   return virtio_device_suspend(_dev->vdev);
 
-static int virtio_pci_suspend(struct device *dev)
-{
-   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev);
+   return virtio_pci_freeze(dev);
 }
 
 static int virtio_pci_resume(struct device *dev)
 {
-   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev);
+   struct pci_dev *pci_dev = to_pci_dev(dev);
+   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
+
+   if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND))
+   return virtio_device_resume(_dev->vdev);
+
+   return virtio_pci_restore(dev);
 }
 
 static const struct dev_pm_ops virtio_pci_pm_ops = {
diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index f62b530aa3b5..ac8734526b8d 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -209,6 +209,22 @@ static void vp_modern_avq_deactivate(struct virtio_device 
*vdev)
__virtqueue_break(admin_vq->info.vq);
 }
 
+static bool vp_supports_pm_no_reset(struct pci_dev *pci_dev)
+{
+   u16 pmcsr;
+
+   if (!pci_dev->pm_cap)
+   return false;
+
+   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
+   if (PCI_POSSIBLE_ERROR(pmcsr)) {
+   dev_err(_dev->dev, "Unable to query pmcsr");
+   return false;
+   }
+
+   return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
+}
+
 static void vp_transport_features(struct virtio_device *vdev, u64 features)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
@@ -223,6 +239,9 @@ static void vp_transport_features(struct virtio_device 
*vdev, u64 features)
 
if (features & BIT_ULL(VIRTIO_F_ADMIN_VQ))
__virtio_set_bit(vdev, VIRTIO_F_ADMIN_VQ);
+
+   if (features & BIT_ULL(VIRTIO_F_SUSPEND) && 
vp_supports_pm_no_reset(pci_dev))
+   __virtio_set_bit(vdev, VIRTIO_F_SUSPEND);
 }
 
 static i

[PATCH 0/1] virtio: Add suspend support

2024-04-17 Thread David Stevens

This series implements support for the virtio device suspend feature
that is under discussion. Unfortunately, the virtio mailing list is
currently being migrated, so recent discussion of the proposal is not
archived anywhere. There current version of the proposal is a
combination of [1] and [2].

[1] https://lore.kernel.org/all/20230906081637.32185-3-lingshan@intel.com/
[2] https://lists.oasis-open.org/archives/virtio-comment/202402/msg00088.html

David Stevens (1):
  virtio: Add support for the virtio suspend feature

 drivers/virtio/virtio.c| 32 ++
 drivers/virtio/virtio_pci_common.c | 29 +++
 drivers/virtio/virtio_pci_modern.c | 19 ++
 include/linux/virtio.h |  2 ++
 include/uapi/linux/virtio_config.h | 10 +-
 5 files changed, 74 insertions(+), 18 deletions(-)


base-commit: e8f897f4afef0031fe618a8e94127a0934896aba
-- 
2.44.0.683.g7961c838ac-goog

Re: [PATCH] module: ban '.', '..' as module names, ban '/' in module names

2024-04-15 Thread Dr. David Alan Gilbert

* Alexey Dobriyan (adobri...@gmail.com) wrote:
> On Sun, Apr 14, 2024 at 01:58:55PM -0700, Luis Chamberlain wrote:
> > On Sun, Apr 14, 2024 at 10:05:05PM +0300, Alexey Dobriyan wrote:
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -3616,4 +3616,12 @@ extern int vfs_fadvise(struct file *file, loff_t 
> > > offset, loff_t len,
> > >  extern int generic_fadvise(struct file *file, loff_t offset, loff_t len,
> > >  int advice);
> > >  
> > > +/*
> > > + * Use this if data from userspace end up as directory/filename on
> > > + * some virtual filesystem.
> > > + */
> > > +static inline bool string_is_vfs_ready(const char *s)
> > > +{
> > > + return strcmp(s, ".") != 0 && strcmp(s, "..") != 0 && !strchr(s, '/');
> > > +}
> > >  #endif /* _LINUX_FS_H */
> > > --- a/kernel/module/main.c
> > > +++ b/kernel/module/main.c
> > > @@ -2893,6 +2893,11 @@ static int load_module(struct load_info *info, 
> > > const char __user *uargs,
> > >  
> > >   audit_log_kern_module(mod->name);
> > >  
> > > + if (!string_is_vfs_ready(mod->name)) {
> > > + err = -EINVAL;
> > > + goto free_module;
> > > + }
> > > +
> > 
> > Sensible change however to put string_is_vfs_ready() in include/linux/fs.h 
> > is a stretch if there really are no other users.
> 
> This is forward thinking patch :-)
> 
> Other subsystems may create files/directories in proc/sysfs, and should
> check for bad names as well:
> 
>   /proc/2821/net/dev_snmp6/eth0
> 
> This looks exactly like something coming from userspace and making it
> into /proc, so the filter function doesn't belong to kernel/module/internal.h

You mean like:

[24180.292204] tuxthe: renamed from tuxthe
root@dalek:/home/dg# ls /sys/class/net/
enp5s0  lo  tuxthe  tuxthe  tuxthe  virbr0  virbr1

?

Dave
> 
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/

Re: [RFC 3/3] virtio_balloon: introduce memory scan/reclaim info

2024-04-15 Thread David Hildenbrand


On 15.04.24 10:41, zhenwei pi wrote:

Expose memory scan/reclaim information to the host side via virtio
balloon device.

Now we have a metric to analyze the memory performance:

y: counter increases
n: counter does not changes
h: the rate of counter change is high
l: the rate of counter change is low

OOM: VIRTIO_BALLOON_S_OOM_KILL
STALL: VIRTIO_BALLOON_S_ALLOC_STALL
ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC
DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT
ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC
DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT

- OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]:
   the guest runs under really critial memory pressure

- OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to cgroup, not the global memory
   pressure.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]:
   the memory allocation stalls due to global memory pressure. The
   performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows
   quite effective memory reclaiming.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]:
   the memory allocation stalls due to global memory pressure.
   the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing
   heavily, the serious case leads poor performance and difficult
   trouble shooting. Ex, sshd may block on memory allocation when
   accepting new connections, a user can't login a VM by ssh command.

- OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]:
   the low ratio between ARCLM/ASCAN shows that the guest tries to
   reclaim more memory, but it can't. Once more memory is required in
   future, it will struggle to reclaim memory.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c |  9 +
  include/uapi/linux/virtio_balloon.h | 12 ++--
  2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 4b9c9569f6e5..7b86514e99d4 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -372,6 +372,15 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
stall += events[ALLOCSTALL_MOVABLE];
update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
  
+	update_stat(vb, idx++, VIRTIO_BALLOON_S_SCAN_ASYNC,

+   pages_to_bytes(events[PGSCAN_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_SCAN_DIRECT,
+   pages_to_bytes(events[PGSCAN_DIRECT]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_RECLAIM_ASYNC,
+   pages_to_bytes(events[PGSTEAL_KSWAPD]));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_RECLAIM_DIRECT,
+   pages_to_bytes(events[PGSTEAL_DIRECT]));
+
return idx;
  }
  
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h

index 13d0c32ba27c..0875a9cccb01 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -73,7 +73,11 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
  #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
  #define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
-#define VIRTIO_BALLOON_S_NR   12
+#define VIRTIO_BALLOON_S_SCAN_ASYNC12 /* Amount of memory scanned 
asynchronously */
+#define VIRTIO_BALLOON_S_SCAN_DIRECT   13 /* Amount of memory scanned directly 
*/
+#define VIRTIO_BALLOON_S_RECLAIM_ASYNC 14 /* Amount of memory reclaimed 
asynchronously */
+#define VIRTIO_BALLOON_S_RECLAIM_DIRECT 15 /* Amount of memory reclaimed 
directly */


Should these be the other way around:

ASYN_SCAN
...
ASYNC_RECLAIM

so we can get ...


+#define VIRTIO_BALLOON_S_NR   16
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -87,7 +91,11 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
VIRTIO_BALLOON_S_NAMES_prefix "oom-kill", \
-   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stall" \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stall", \
+   VIRTIO_BALLOON_S_NAMES_prefix "scan-async", \
+   VIRTIO_BALLOON_S_NAMES_prefix "scan-direct", \
+   VIRTIO_BALLOON_S_NAMES_prefix "reclaim-async", \
+   VIRTIO_BALLOON_S_NAMES_prefix "reclaim-direct" \


...

"async-scans", "async-reclaims" ...

--
Cheers,

David / dhildenb

Re: [RFC 2/3] virtio_balloon: introduce memory allocation stall counter

2024-04-15 Thread David Hildenbrand


On 15.04.24 10:41, zhenwei pi wrote:

Memory allocation stall counter represents the performance/latency of
memory allocation, expose this counter to the host side by virtio
balloon device via out-of-bound way.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 19 ++-
  include/uapi/linux/virtio_balloon.h |  6 --
  2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index fd8daa742734..4b9c9569f6e5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -321,7 +321,7 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
unsigned long events[NR_VM_EVENT_ITEMS];
struct sysinfo i;
unsigned int idx = 0;
-   long available;
+   long available, stall = 0;
unsigned long caches;
  
  	all_vm_events(events);

@@ -355,6 +355,23 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL,
events[OOM_KILL]);
  
+	/* sum all the stall event */

+#ifdef CONFIG_ZONE_DMA
+   stall += events[ALLOCSTALL_DMA];
+#endif
+#ifdef CONFIG_ZONE_DMA32
+   stall += events[ALLOCSTALL_DMA32];
+#endif
+#ifdef CONFIG_HIGHMEM
+   stall += events[ALLOCSTALL_HIGH];
+#endif
+#ifdef CONFIG_ZONE_DEVICE
+   stall += events[ALLOCSTALL_DEVICE];
+#endif
+   stall += events[ALLOCSTALL_NORMAL];
+   stall += events[ALLOCSTALL_MOVABLE];
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_ALLOC_STALL, stall);
+
return idx;
  }
  
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h

index cde5547e64a7..13d0c32ba27c 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -72,7 +72,8 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations */
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
  #define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
-#define VIRTIO_BALLOON_S_NR   11
+#define VIRTIO_BALLOON_S_ALLOC_STALL   11 /* Stall count of memory allocatoin 
*/
+#define VIRTIO_BALLOON_S_NR   12
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -85,7 +86,8 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
-   VIRTIO_BALLOON_S_NAMES_prefix "oom-kill" \
+   VIRTIO_BALLOON_S_NAMES_prefix "oom-kill", \
+   VIRTIO_BALLOON_S_NAMES_prefix "alloc-stall" \


"alloc-stalls"

--
Cheers,

David / dhildenb

Re: [RFC 1/3] virtio_balloon: introduce oom-kill invocations

2024-04-15 Thread David Hildenbrand


On 15.04.24 10:41, zhenwei pi wrote:

When the guest OS runs under critical memory pressure, the guest
starts to kill processes. A guest monitor agent may scan 'oom_kill'
from /proc/vmstat, and reports the OOM KILL event. However, the agent
may be killed and we will loss this critical event(and the later
events).

For now we can also grep for magic words in guest kernel log from host
side. Rather than this unstable way, virtio balloon reports OOM-KILL
invocations instead.

Signed-off-by: zhenwei pi 
---
  drivers/virtio/virtio_balloon.c | 2 ++
  include/uapi/linux/virtio_balloon.h | 6 --
  2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1f5b3dd31fcf..fd8daa742734 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -352,6 +352,8 @@ static unsigned int update_balloon_stats(struct 
virtio_balloon *vb)
pages_to_bytes(available));
update_stat(vb, idx++, VIRTIO_BALLOON_S_CACHES,
pages_to_bytes(caches));
+   update_stat(vb, idx++, VIRTIO_BALLOON_S_OOM_KILL,
+   events[OOM_KILL]);
  
  	return idx;

  }
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index ddaa45e723c4..cde5547e64a7 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -71,7 +71,8 @@ struct virtio_balloon_config {
  #define VIRTIO_BALLOON_S_CACHES   7   /* Disk caches */
  #define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page allocations */
  #define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page allocation failures 
*/
-#define VIRTIO_BALLOON_S_NR   10
+#define VIRTIO_BALLOON_S_OOM_KILL  10 /* OOM killer invocations */
+#define VIRTIO_BALLOON_S_NR   11
  
  #define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \

VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \
@@ -83,7 +84,8 @@ struct virtio_balloon_config {
VIRTIO_BALLOON_S_NAMES_prefix "available-memory", \
VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \
VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \
-   VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures" \
+   VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures", \
+   VIRTIO_BALLOON_S_NAMES_prefix "oom-kill" \


"oom-kills"

--
Cheers,

David / dhildenb

Re: [RFC 0/3] Improve memory statistics for virtio balloon

2024-04-15 Thread David Hildenbrand


On 15.04.24 10:41, zhenwei pi wrote:

Hi,

When the guest runs under critial memory pressure, the guest becomss
too slow, even sshd turns D state(uninterruptible) on memory
allocation. We can't login this VM to do any work on trouble shooting.

Guest kernel log via virtual TTY(on host side) only provides a few
necessary log after OOM. More detail memory statistics are required,
then we can know explicit memory events and estimate the pressure.

I'm going to introduce several VM counters for virtio balloon:
- oom-kill
- alloc-stall
- scan-async
- scan-direct
- reclaim-async
- reclaim-direct


IIUC, we're only exposing events that are already getting provided via 
all_vm_events(), correct?


In that case, I don't really see a major issue. Some considerations:

(1) These new events are fairly Linux specific.

PSWPIN and friends are fairly generic, but HGTLB is also already fairly 
Linux specific already. OOM-kills don't really exist on Windows, for 
example. We'll have to be careful of properly describing what the 
semantics are.


(2) How should we handle if Linux ever stops supporting a certain event 
(e.g., major reclaim rework). I assume, simply return nothing like we 
currently would for VIRTIO_BALLOON_S_HTLB_PGALLOC without 
CONFIG_HUGETLB_PAGE.


--
Cheers,

David / dhildenb

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-12 Thread David Matlack

On 2024-04-01 11:29 PM, James Houghton wrote:
> Only handle the TDP MMU case for now. In other cases, if a bitmap was
> not provided, fallback to the slowpath that takes mmu_lock, or, if a
> bitmap was provided, inform the caller that the bitmap is unreliable.

I think this patch will trigger a lockdep assert in

  kvm_tdp_mmu_age_gfn_range
kvm_tdp_mmu_handle_gfn
  for_each_tdp_mmu_root
__for_each_tdp_mmu_root
  kvm_lockdep_assert_mmu_lock_held

... because it walks tdp_mmu_roots without holding mmu_lock.

Yu's patch[1] added a lockless walk to the TDP MMU. We'd need something
similar here and also update the comment above tdp_mmu_roots describing
how tdp_mmu_roots can be read locklessly.

[1] https://lore.kernel.org/kvmarm/zitx64bbx5vdj...@google.com/

Re: [PATCH v3 3/7] KVM: Add basic bitmap support into kvm_mmu_notifier_test/clear_young

2024-04-12 Thread David Matlack

On 2024-04-01 11:29 PM, James Houghton wrote:
> Add kvm_arch_prepare_bitmap_age() for architectures to indiciate that
> they support bitmap-based aging in kvm_mmu_notifier_test_clear_young()
> and that they do not need KVM to grab the MMU lock for writing. This
> function allows architectures to do other locking or other preparatory
> work that it needs.

There's a lot going on here. I know it's extra work but I think the
series would be easier to understand and simplify if you introduced the
KVM support for lockless test/clear_young() first, and then introduce
support for the bitmap-based look-around.

Specifically:

 1. Make all test/clear_young() notifiers lockless. i.e. Move the
mmu_lock into the architecture-specific code (kvm_age_gfn() and
kvm_test_age_gfn()).

 2. Convert KVM/x86's kvm_{test,}_age_gfn() to be lockless for the TDP
MMU.

 4. Convert KVM/arm64's kvm_{test,}_age_gfn() to hold the mmu_lock in
read-mode.

 5. Add bitmap-based look-around support to KVM/x86 and KVM/arm64
(probably 2-3 patches).

> 
> If an architecture does not implement kvm_arch_prepare_bitmap_age() or
> is unable to do bitmap-based aging at runtime (and marks the bitmap as
> unreliable):
>  1. If a bitmap was provided, we inform the caller that the bitmap is
> unreliable (MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE).
>  2. If a bitmap was not provided, fall back to the old logic.
> 
> Also add logic for architectures to easily use the provided bitmap if
> they are able. The expectation is that the architecture's implementation
> of kvm_gfn_test_age() will use kvm_gfn_record_young(), and
> kvm_gfn_age() will use kvm_gfn_should_age().
> 
> Suggested-by: Yu Zhao 
> Signed-off-by: James Houghton 
> ---
>  include/linux/kvm_host.h | 60 ++
>  virt/kvm/kvm_main.c  | 92 +---
>  2 files changed, 127 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1800d03a06a9..5862fd7b5f9b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1992,6 +1992,26 @@ extern const struct _kvm_stats_desc 
> kvm_vm_stats_desc[];
>  extern const struct kvm_stats_header kvm_vcpu_stats_header;
>  extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[];
>  
> +/*
> + * Architectures that support using bitmaps for kvm_age_gfn() and
> + * kvm_test_age_gfn should return true for kvm_arch_prepare_bitmap_age()
> + * and do any work they need to prepare. The subsequent walk will not
> + * automatically grab the KVM MMU lock, so some architectures may opt
> + * to grab it.
> + *
> + * If true is returned, a subsequent call to kvm_arch_finish_bitmap_age() is
> + * guaranteed.
> + */
> +#ifndef kvm_arch_prepare_bitmap_age
> +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)

I find the name of these architecture callbacks misleading/confusing.
The lockless path is used even when a bitmap is not provided. i.e.
bitmap can be NULL in between kvm_arch_prepare/finish_bitmap_age().

> +{
> + return false;
> +}
> +#endif
> +#ifndef kvm_arch_finish_bitmap_age
> +static inline void kvm_arch_finish_bitmap_age(struct mmu_notifier *mn) {}
> +#endif

kvm_arch_finish_bitmap_age() seems unnecessary. I think the KVM/arm64
code could acquire/release the mmu_lock in read-mode in
kvm_test_age_gfn() and kvm_age_gfn() right?

> +
>  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
>  static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
>  {
> @@ -2076,9 +2096,16 @@ static inline bool 
> mmu_invalidate_retry_gfn_unsafe(struct kvm *kvm,
>   return READ_ONCE(kvm->mmu_invalidate_seq) != mmu_seq;
>  }
>  
> +struct test_clear_young_metadata {
> + unsigned long *bitmap;
> + unsigned long bitmap_offset_end;

bitmap_offset_end is unused.

> + unsigned long end;
> + bool unreliable;
> +};
>  union kvm_mmu_notifier_arg {
>   pte_t pte;
>   unsigned long attributes;
> + struct test_clear_young_metadata *metadata;

nit: Maybe s/metadata/test_clear_young/ ?

>  };
>  
>  struct kvm_gfn_range {
> @@ -2087,11 +2114,44 @@ struct kvm_gfn_range {
>   gfn_t end;
>   union kvm_mmu_notifier_arg arg;
>   bool may_block;
> + bool lockless;

Please document this as it's somewhat subtle. A reader might think this
implies the entire operation runs without taking the mmu_lock.

>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
>  bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
> +
> +static inline void kvm_age_set_unreliable(struct kvm_gfn_range *range)
> +{
> + struct test_clear_young_metadata *args = range->arg.metadata;
> +
> + args->unreliable = true;
> +}
> +static inline unsigned long kvm_young_bitmap_offset(struct kvm_gfn_range 
> *range,
> +

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-12 Thread David Hildenbrand


On 11.04.24 18:55, Paolo Bonzini wrote:

On Mon, Apr 8, 2024 at 3:56 PM Peter Xu  wrote:

Paolo,

I may miss a bunch of details here (as I still remember some change_pte
patches previously on the list..), however not sure whether we considered
enable it?  Asked because I remember Andrea used to have a custom tree
maintaining that part:

https://github.com/aagit/aa/commit/c761078df7a77d13ddfaeebe56a0f4bc128b1968


The patch enables it only for KSM, so it would still require a bunch
of cleanups, for example I also would still use set_pte_at() in all
the places that are not KSM. This would at least fix the issue with
the poor documentation of where to use set_pte_at_notify() vs
set_pte_at().

With regard to the implementation, I like the idea of disabling the
invalidation on the MMU notifier side, but I would rather have
MMU_NOTIFIER_CHANGE_PTE as a separate field in the range instead of
overloading the event field.


Maybe it can't be enabled for some reason that I overlooked in the current
tree, or we just decided to not to?


I have just learnt about the patch, nobody had ever mentioned it even
though it's almost 2 years old... It's a lot of code though and no one


I assume Andrea used it on his tree where he also has a version of 
"randprotect" (even included in that commit subject) to mitigate a KSM 
security issue that was reported by some security researchers [1] a 
while ago. From what I recall, the industry did not end up caring about 
that security issue that much.


IIUC, with "randprotect" we get a lot more R/O protection even when not 
de-duplicating a page -- thus the name. Likely, the reporter mentioned 
in the commit is a researcher that played with Andreas fix for the 
security issue. But I'm just speculating at this point :)



has ever reported an issue for over 10 years, so I think it's easiest
to just rip the code out.


Yes. Can always be readded in a possibly cleaner fashion (like you note 
above), when deemed necessary and we are willing to support it.


[1] https://gruss.cc/files/remote_dedup.pdf

--
Cheers,

David / dhildenb

Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

2024-04-12 Thread David Matlack

On 2024-04-01 11:29 PM, James Houghton wrote:
> The bitmap is provided for secondary MMUs to use if they support it. For
> test_young(), after it returns, the bitmap represents the pages that
> were young in the interval [start, end). For clear_young, it represents
> the pages that we wish the secondary MMU to clear the accessed/young bit
> for.
> 
> If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API
> should be unchanged except that if young PTEs are found and the
> architecture supports passing in a bitmap, instead of returning 1,
> MMU_NOTIFIER_YOUNG_FAST is returned.
> 
> This allows MGLRU's look-around logic to work faster, resulting in a 4%
> improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST
> to indicate to main mm that doing look-around is likely to be
> beneficial.
> 
> If the secondary MMU doesn't support the bitmap, it must return
> an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> 
> [1]: https://lore.kernel.org/all/20230609005935.42390-1-yuz...@google.com/
> 
> Suggested-by: Yu Zhao 
> Signed-off-by: James Houghton 
> ---
>  include/linux/mmu_notifier.h | 93 +---
>  include/trace/events/kvm.h   | 13 +++--
>  mm/mmu_notifier.c| 20 +---
>  virt/kvm/kvm_main.c  | 19 ++--
>  4 files changed, 123 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index f349e08a9dfe..daaa9db625d3 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -61,6 +61,10 @@ enum mmu_notifier_event {
>  
>  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
>  
> +#define MMU_NOTIFIER_YOUNG   (1 << 0)
> +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)

MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE appears to be unused by all callers
of test/clear_young(). I would vote to remove it.

> +#define MMU_NOTIFIER_YOUNG_FAST  (1 << 2)

Instead of MMU_NOTIFIER_YOUNG_FAST, how about
MMU_NOTIFIER_YOUNG_LOOK_AROUND? i.e. The secondary MMU is returning
saying it recommends doing a look-around and passing in a bitmap?

That would avoid the whole "what does FAST really mean" confusion.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fb49c2a60200..ca4b1ef9dfc2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -917,10 +917,15 @@ static int kvm_mmu_notifier_clear_flush_young(struct 
> mmu_notifier *mn,
>  static int kvm_mmu_notifier_clear_young(struct mmu_notifier *mn,
>   struct mm_struct *mm,
>   unsigned long start,
> - unsigned long end)
> + unsigned long end,
> + unsigned long *bitmap)
>  {
>   trace_kvm_age_hva(start, end);
>  
> + /* We don't support bitmaps. Don't test or clear anything. */
> + if (bitmap)
> + return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;

Wouldn't it be a bug to get a bitmap here? The main MM is only suppost
to pass in a bitmap if the secondary MMU returns
MMU_NOTIFIER_YOUNG_FAST, which KVM does not do at this point.

Put another way, this check seems unneccessary.

> +
>   /*
>* Even though we do not flush TLB, this will still adversely
>* affect performance on pre-Haswell Intel EPT, where there is
> @@ -939,11 +944,17 @@ static int kvm_mmu_notifier_clear_young(struct 
> mmu_notifier *mn,
>  
>  static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
>  struct mm_struct *mm,
> -unsigned long address)
> +unsigned long start,
> +unsigned long end,
> +unsigned long *bitmap)
>  {
> - trace_kvm_test_age_hva(address);
> + trace_kvm_test_age_hva(start, end);
> +
> + /* We don't support bitmaps. Don't test or clear anything. */
> + if (bitmap)
> + return MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE;

Same thing here.

Re: [PATCH v3 0/7] mm/kvm: Improve parallelism for access bit harvesting

2024-04-12 Thread David Matlack

On 2024-04-01 11:29 PM, James Houghton wrote:
> This patchset adds a fast path in KVM to test and clear access bits on
> sptes without taking the mmu_lock. It also adds support for using a
> bitmap to (1) test the access bits for many sptes in a single call to
> mmu_notifier_test_young, and to (2) clear the access bits for many ptes
> in a single call to mmu_notifier_clear_young.

How much improvement would we get if we _just_ made test/clear_young
lockless on x86 and hold the read-lock on arm64? And then how much
benefit does the bitmap look-around add on top of that?

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-11 Thread David Matlack

On Thu, Apr 11, 2024 at 11:00 AM David Matlack  wrote:
>
> On Thu, Apr 11, 2024 at 10:28 AM David Matlack  wrote:
> >
> > On 2024-04-11 10:08 AM, David Matlack wrote:
> > > On 2024-04-01 11:29 PM, James Houghton wrote:
> > > > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > > > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > > > bitmap was provided, inform the caller that the bitmap is unreliable.
> > > >
> > > > Suggested-by: Yu Zhao 
> > > > Signed-off-by: James Houghton 
> > > > ---
> > > >  arch/x86/include/asm/kvm_host.h | 14 ++
> > > >  arch/x86/kvm/mmu/mmu.c  | 16 ++--
> > > >  arch/x86/kvm/mmu/tdp_mmu.c  | 10 +-
> > > >  3 files changed, 37 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/arch/x86/include/asm/kvm_host.h 
> > > > b/arch/x86/include/asm/kvm_host.h
> > > > index 3b58e2306621..c30918d0887e 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot 
> > > > *slot, unsigned long npages);
> > > >   */
> > > >  #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
> > > >
> > > > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > > > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > > > +{
> > > > +   /*
> > > > +* Indicate that we support bitmap-based aging when using the TDP 
> > > > MMU
> > > > +* and the accessed bit is available in the TDP page tables.
> > > > +*
> > > > +* We have no other preparatory work to do here, so we do not need 
> > > > to
> > > > +* redefine kvm_arch_finish_bitmap_age().
> > > > +*/
> > > > +   return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > > > +&& shadow_accessed_mask;
> > > > +}
> > > > +
> > > >  #endif /* _ASM_X86_KVM_HOST_H */
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index 992e651540e8..fae1a75750bb 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct 
> > > > kvm_gfn_range *range)
> > > >  {
> > > > bool young = false;
> > > >
> > > > -   if (kvm_memslots_have_rmaps(kvm))
> > > > +   if (kvm_memslots_have_rmaps(kvm)) {
> > > > +   if (range->lockless) {
> > > > +   kvm_age_set_unreliable(range);
> > > > +   return false;
> > > > +   }
> > >
> > > If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> > > virtualization, MGLRU will effectively be blind to all accesses made by
> > > the VM.
> > >
> > > kvm_arch_prepare_bitmap_age() will return true indicating that the
> > > bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> > > return false immediately and indicate the bitmap is unreliable because a
> > > shadow root is allocate. The notfier will then return
> > > MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
>
> Ah no, I'm wrong here. Setting args.unreliable causes the notifier to
> return 0 instead of MMU_NOTIFIER_YOUNG_FAST.
> MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is used for something else.

Nope, wrong again. Just ignore me while I try to figure out how this
actually works :)

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-11 Thread David Matlack

On Thu, Apr 11, 2024 at 10:28 AM David Matlack  wrote:
>
> On 2024-04-11 10:08 AM, David Matlack wrote:
> > On 2024-04-01 11:29 PM, James Houghton wrote:
> > > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > > bitmap was provided, inform the caller that the bitmap is unreliable.
> > >
> > > Suggested-by: Yu Zhao 
> > > Signed-off-by: James Houghton 
> > > ---
> > >  arch/x86/include/asm/kvm_host.h | 14 ++
> > >  arch/x86/kvm/mmu/mmu.c  | 16 ++--
> > >  arch/x86/kvm/mmu/tdp_mmu.c  | 10 +-
> > >  3 files changed, 37 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h 
> > > b/arch/x86/include/asm/kvm_host.h
> > > index 3b58e2306621..c30918d0887e 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot 
> > > *slot, unsigned long npages);
> > >   */
> > >  #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
> > >
> > > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > > +{
> > > +   /*
> > > +* Indicate that we support bitmap-based aging when using the TDP MMU
> > > +* and the accessed bit is available in the TDP page tables.
> > > +*
> > > +* We have no other preparatory work to do here, so we do not need to
> > > +* redefine kvm_arch_finish_bitmap_age().
> > > +*/
> > > +   return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > > +&& shadow_accessed_mask;
> > > +}
> > > +
> > >  #endif /* _ASM_X86_KVM_HOST_H */
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 992e651540e8..fae1a75750bb 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct 
> > > kvm_gfn_range *range)
> > >  {
> > > bool young = false;
> > >
> > > -   if (kvm_memslots_have_rmaps(kvm))
> > > +   if (kvm_memslots_have_rmaps(kvm)) {
> > > +   if (range->lockless) {
> > > +   kvm_age_set_unreliable(range);
> > > +   return false;
> > > +   }
> >
> > If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> > virtualization, MGLRU will effectively be blind to all accesses made by
> > the VM.
> >
> > kvm_arch_prepare_bitmap_age() will return true indicating that the
> > bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> > return false immediately and indicate the bitmap is unreliable because a
> > shadow root is allocate. The notfier will then return
> > MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

Ah no, I'm wrong here. Setting args.unreliable causes the notifier to
return 0 instead of MMU_NOTIFIER_YOUNG_FAST.
MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is used for something else.

The control flow of all this and naming of functions and macros is
overall confusing. args.unreliable and
MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE for one. Also I now realize
kvm_arch_prepare/finish_bitmap_age() are used even when the bitmap is
_not_ provided, so those names are also misleading.

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-11 Thread David Matlack

On 2024-04-11 10:08 AM, David Matlack wrote:
> On 2024-04-01 11:29 PM, James Houghton wrote:
> > Only handle the TDP MMU case for now. In other cases, if a bitmap was
> > not provided, fallback to the slowpath that takes mmu_lock, or, if a
> > bitmap was provided, inform the caller that the bitmap is unreliable.
> > 
> > Suggested-by: Yu Zhao 
> > Signed-off-by: James Houghton 
> > ---
> >  arch/x86/include/asm/kvm_host.h | 14 ++
> >  arch/x86/kvm/mmu/mmu.c  | 16 ++--
> >  arch/x86/kvm/mmu/tdp_mmu.c  | 10 +-
> >  3 files changed, 37 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 3b58e2306621..c30918d0887e 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, 
> > unsigned long npages);
> >   */
> >  #define KVM_EXIT_HYPERCALL_MBZ GENMASK_ULL(31, 1)
> >  
> > +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> > +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> > +{
> > +   /*
> > +* Indicate that we support bitmap-based aging when using the TDP MMU
> > +* and the accessed bit is available in the TDP page tables.
> > +*
> > +* We have no other preparatory work to do here, so we do not need to
> > +* redefine kvm_arch_finish_bitmap_age().
> > +*/
> > +   return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> > +&& shadow_accessed_mask;
> > +}
> > +
> >  #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 992e651540e8..fae1a75750bb 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct 
> > kvm_gfn_range *range)
> >  {
> > bool young = false;
> >  
> > -   if (kvm_memslots_have_rmaps(kvm))
> > +   if (kvm_memslots_have_rmaps(kvm)) {
> > +   if (range->lockless) {
> > +   kvm_age_set_unreliable(range);
> > +   return false;
> > +   }
> 
> If a VM has TDP MMU enabled, supports A/D bits, and is using nested
> virtualization, MGLRU will effectively be blind to all accesses made by
> the VM.
> 
> kvm_arch_prepare_bitmap_age() will return true indicating that the
> bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
> return false immediately and indicate the bitmap is unreliable because a
> shadow root is allocate. The notfier will then return
> MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> 
> Looking at the callers, MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is never
> consumed or used. So I think MGLRU will assume all memory is
> unaccessed?
> 
> One way to improve the situation would be to re-order the TDP MMU
> function first and return young instead of false, so that way MGLRU at
> least has visibility into accesses made by L1 (and L2 if EPT is disable
> in L2). But that still means MGLRU is blind to accesses made by L2.
> 
> What about grabbing the mmu_lock if there's a shadow root allocated and
> get rid of MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE altogether?
> 
>   if (kvm_memslots_have_rmaps(kvm)) {
>   write_lock(>mmu_lock);
>   young |= kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
>   write_unlock(>mmu_lock);
>   }
> 
> The TDP MMU walk would still be lockless. KVM only has to take the
> mmu_lock to collect accesses made by L2.
> 
> kvm_age_rmap() and kvm_test_age_rmap() will need to become bitmap-aware
> as well, but that seems relatively simple with the helper functions.

Wait, even simpler, just check kvm_memslots_have_rmaps() in
kvm_arch_prepare_bitmap_age() and skip the shadow MMU when processing a
bitmap request.

i.e.

static inline bool kvm_arch_prepare_bitmap_age(struct kvm *kvm, struct 
mmu_notifier *mn)
{
/*
 * Indicate that we support bitmap-based aging when using the TDP MMU
 * and the accessed bit is available in the TDP page tables.
 *
 * We have no other preparatory work to do here, so we do not need to
 * redefine kvm_arch_finish_bitmap_age().
 */
return IS_ENABLED(CONFIG_X86_64)
&& tdp_mmu_enabled
&& shadow_accessed_mask
&& !kvm_memslots_have_rmaps(kvm);
}

bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
{

Re: [PATCH v3 5/7] KVM: x86: Participate in bitmap-based PTE aging

2024-04-11 Thread David Matlack

On 2024-04-01 11:29 PM, James Houghton wrote:
> Only handle the TDP MMU case for now. In other cases, if a bitmap was
> not provided, fallback to the slowpath that takes mmu_lock, or, if a
> bitmap was provided, inform the caller that the bitmap is unreliable.
> 
> Suggested-by: Yu Zhao 
> Signed-off-by: James Houghton 
> ---
>  arch/x86/include/asm/kvm_host.h | 14 ++
>  arch/x86/kvm/mmu/mmu.c  | 16 ++--
>  arch/x86/kvm/mmu/tdp_mmu.c  | 10 +-
>  3 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3b58e2306621..c30918d0887e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2324,4 +2324,18 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, 
> unsigned long npages);
>   */
>  #define KVM_EXIT_HYPERCALL_MBZ   GENMASK_ULL(31, 1)
>  
> +#define kvm_arch_prepare_bitmap_age kvm_arch_prepare_bitmap_age
> +static inline bool kvm_arch_prepare_bitmap_age(struct mmu_notifier *mn)
> +{
> + /*
> +  * Indicate that we support bitmap-based aging when using the TDP MMU
> +  * and the accessed bit is available in the TDP page tables.
> +  *
> +  * We have no other preparatory work to do here, so we do not need to
> +  * redefine kvm_arch_finish_bitmap_age().
> +  */
> + return IS_ENABLED(CONFIG_X86_64) && tdp_mmu_enabled
> +  && shadow_accessed_mask;
> +}
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 992e651540e8..fae1a75750bb 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -1674,8 +1674,14 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range 
> *range)
>  {
>   bool young = false;
>  
> - if (kvm_memslots_have_rmaps(kvm))
> + if (kvm_memslots_have_rmaps(kvm)) {
> + if (range->lockless) {
> + kvm_age_set_unreliable(range);
> + return false;
> + }

If a VM has TDP MMU enabled, supports A/D bits, and is using nested
virtualization, MGLRU will effectively be blind to all accesses made by
the VM.

kvm_arch_prepare_bitmap_age() will return true indicating that the
bitmap is supported. But then kvm_age_gfn() and kvm_test_age_gfn() will
return false immediately and indicate the bitmap is unreliable because a
shadow root is allocate. The notfier will then return
MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

Looking at the callers, MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE is never
consumed or used. So I think MGLRU will assume all memory is
unaccessed?

One way to improve the situation would be to re-order the TDP MMU
function first and return young instead of false, so that way MGLRU at
least has visibility into accesses made by L1 (and L2 if EPT is disable
in L2). But that still means MGLRU is blind to accesses made by L2.

What about grabbing the mmu_lock if there's a shadow root allocated and
get rid of MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE altogether?

if (kvm_memslots_have_rmaps(kvm)) {
write_lock(>mmu_lock);
young |= kvm_handle_gfn_range(kvm, range, kvm_age_rmap);
write_unlock(>mmu_lock);
}

The TDP MMU walk would still be lockless. KVM only has to take the
mmu_lock to collect accesses made by L2.

kvm_age_rmap() and kvm_test_age_rmap() will need to become bitmap-aware
as well, but that seems relatively simple with the helper functions.

Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

2024-04-09 Thread David Hildenbrand


On 09.04.24 20:31, James Houghton wrote:

Ah, I didn't see this in my inbox, sorry David!


No worries :)



On Thu, Apr 4, 2024 at 11:52 AM David Hildenbrand  wrote:


On 02.04.24 01:29, James Houghton wrote:

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f349e08a9dfe..daaa9db625d3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,10 @@ enum mmu_notifier_event {

   #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)

+#define MMU_NOTIFIER_YOUNG   (1 << 0)
+#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)


Especially this one really deserves some documentation :)


Yes, will do. Something like

 MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE indicates that the passed-in
bitmap either (1) does not accurately represent the age of the pages
(in the case of test_young), or (2) was not able to be used to
completely clear the age/access bit (in the case of clear_young).


Make sense. I do wonder what the expected reaction from the caller is :)






+#define MMU_NOTIFIER_YOUNG_FAST  (1 << 2)


And that one as well.


Something like

Indicates that (1) passing a bitmap ({test,clear}_young_bitmap)
would have been supported for this address range.

The name MMU_NOTIFIER_YOUNG_FAST really comes from the fact that KVM
is able to harvest the access bit "fast" (so for x86, locklessly, and
for arm64, with the KVM MMU read lock), "fast" enough that using a
bitmap to do look-around is probably a good idea.


Is that really the right way to communicate that ("would have been 
supported") -- wouldn't we want to sense support differently?






Likely best to briefly document all of them, and how they are
supposed to be used (return value for X).


Right. Will do.




+
   struct mmu_notifier_ops {
   /*
* Called either by mmu_notifier_unregister or when the mm is
@@ -106,21 +110,36 @@ struct mmu_notifier_ops {
* clear_young is a lightweight version of clear_flush_young. Like the
* latter, it is supposed to test-and-clear the young/accessed bitflag
* in the secondary pte, but it may omit flushing the secondary tlb.
+  *
+  * If @bitmap is given but is not supported, return
+  * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+  *
+  * If the walk is done "quickly" and there were young PTEs,
+  * MMU_NOTIFIER_YOUNG_FAST is returned.
*/
   int (*clear_young)(struct mmu_notifier *subscription,
  struct mm_struct *mm,
  unsigned long start,
-unsigned long end);
+unsigned long end,
+unsigned long *bitmap);

   /*
* test_young is called to check the young/accessed bitflag in
* the secondary pte. This is used to know if the page is
* frequently used without actually clearing the flag or tearing
* down the secondary mapping on the page.
+  *
+  * If @bitmap is given but is not supported, return
+  * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+  *
+  * If the walk is done "quickly" and there were young PTEs,
+  * MMU_NOTIFIER_YOUNG_FAST is returned.
*/
   int (*test_young)(struct mmu_notifier *subscription,
 struct mm_struct *mm,
-   unsigned long address);
+   unsigned long start,
+   unsigned long end,
+   unsigned long *bitmap);


What does "quickly" mean (why not use "fast")? What are the semantics, I
don't find any existing usage of that in this file.


"fast" means fast enough such that using a bitmap to scan adjacent
pages (e.g. with MGLRU) is likely to be beneficial. I'll write more in
this comment. Perhaps I should just rename it to
MMU_NOTIFIER_YOUNG_BITMAP_SUPPORTED and drop the whole "likely to be
beneficial" thing -- that's for MGLRU/etc. to decide really.


Yes!





Further, what is MMU_NOTIFIER_YOUNG you introduce used for?


MMU_NOTIFIER_YOUNG is the return value when the page was young, but we
(1) didn't use a bitmap, and (2) the "fast" access bit harvesting
wasn't possible. In this case we simply return 1, which is
MMU_NOTIFIER_YOUNG. I'll make kvm_mmu_notifier_test_clear_young()
properly return MMU_NOTIFIER_YOUNG instead of relying on the fact that
it will be 1.


Yes, that will clarify it!

--
Cheers,

David / dhildenb

Re: [PATCH 4/4] mm: replace set_pte_at_notify() with just set_pte_at()

2024-04-08 Thread David Hildenbrand


On 05.04.24 13:58, Paolo Bonzini wrote:

With the demise of the .change_pte() MMU notifier callback, there is no
notification happening in set_pte_at_notify().  It is a synonym of
set_pte_at() and can be replaced with it.

Signed-off-by: Paolo Bonzini 
---


A real joy seeing that gone

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH 3/4] mmu_notifier: remove the .change_pte() callback

2024-04-08 Thread David Hildenbrand


On 05.04.24 13:58, Paolo Bonzini wrote:

The scope of set_pte_at_notify() has reduced more and more through the
years.  Initially, it was meant for when the change to the PTE was
not bracketed by mmu_notifier_invalidate_range_{start,end}().  However,
that has not been so for over ten years.  During all this period
the only implementation of .change_pte() was KVM and it
had no actual functionality, because it was called after
mmu_notifier_invalidate_range_start() zapped the secondary PTE.

Now that this (nonfunctional) user of the .change_pte() callback is
gone, the whole callback can be removed.  For now, leave in place
set_pte_at_notify() even though it is just a synonym for set_pte_at().

Signed-off-by: Paolo Bonzini 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

2024-04-04 Thread David Hildenbrand


On 02.04.24 01:29, James Houghton wrote:

The bitmap is provided for secondary MMUs to use if they support it. For
test_young(), after it returns, the bitmap represents the pages that
were young in the interval [start, end). For clear_young, it represents
the pages that we wish the secondary MMU to clear the accessed/young bit
for.

If a bitmap is not provided, the mmu_notifier_{test,clear}_young() API
should be unchanged except that if young PTEs are found and the
architecture supports passing in a bitmap, instead of returning 1,
MMU_NOTIFIER_YOUNG_FAST is returned.

This allows MGLRU's look-around logic to work faster, resulting in a 4%
improvement in real workloads[1]. Also introduce MMU_NOTIFIER_YOUNG_FAST
to indicate to main mm that doing look-around is likely to be
beneficial.

If the secondary MMU doesn't support the bitmap, it must return
an int that contains MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.

[1]: https://lore.kernel.org/all/20230609005935.42390-1-yuz...@google.com/

Suggested-by: Yu Zhao 
Signed-off-by: James Houghton 
---
  include/linux/mmu_notifier.h | 93 +---
  include/trace/events/kvm.h   | 13 +++--
  mm/mmu_notifier.c| 20 +---
  virt/kvm/kvm_main.c  | 19 ++--
  4 files changed, 123 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f349e08a9dfe..daaa9db625d3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,10 @@ enum mmu_notifier_event {
  
  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
  
+#define MMU_NOTIFIER_YOUNG			(1 << 0)

+#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE   (1 << 1)


Especially this one really deserves some documentation :)


+#define MMU_NOTIFIER_YOUNG_FAST(1 << 2)


And that one as well.

Likely best to briefly document all of them, and how they are
supposed to be used (return value for X).


+
  struct mmu_notifier_ops {
/*
 * Called either by mmu_notifier_unregister or when the mm is
@@ -106,21 +110,36 @@ struct mmu_notifier_ops {
 * clear_young is a lightweight version of clear_flush_young. Like the
 * latter, it is supposed to test-and-clear the young/accessed bitflag
 * in the secondary pte, but it may omit flushing the secondary tlb.
+*
+* If @bitmap is given but is not supported, return
+* MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+*
+* If the walk is done "quickly" and there were young PTEs,
+* MMU_NOTIFIER_YOUNG_FAST is returned.
 */
int (*clear_young)(struct mmu_notifier *subscription,
   struct mm_struct *mm,
   unsigned long start,
-  unsigned long end);
+  unsigned long end,
+  unsigned long *bitmap);
  
  	/*

 * test_young is called to check the young/accessed bitflag in
 * the secondary pte. This is used to know if the page is
 * frequently used without actually clearing the flag or tearing
 * down the secondary mapping on the page.
+*
+* If @bitmap is given but is not supported, return
+* MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+*
+* If the walk is done "quickly" and there were young PTEs,
+* MMU_NOTIFIER_YOUNG_FAST is returned.
 */
int (*test_young)(struct mmu_notifier *subscription,
  struct mm_struct *mm,
- unsigned long address);
+ unsigned long start,
+ unsigned long end,
+ unsigned long *bitmap);


What does "quickly" mean (why not use "fast")? What are the semantics, I 
don't find any existing usage of that in this file.


Further, what is MMU_NOTIFIER_YOUNG you introduce used for?

--
Cheers,

David / dhildenb

[PATCH v2 0/2] Improvements to virtio_balloon pm

2024-03-20 Thread David Stevens

From: David Stevens 

The virtio_balloon driver uses wakeup sources to allow the guest to
enter system power management sleep states (e.g. s2idle) without running
the risk of becoming unresponsive to cooperative memory management
requests from the host. This series fixes an issue where wakeup sources
for inflate/deflate were improperly shared between drivers. It also
closes a race where stats requests that come in immediately before a
sleep state transition could fail to be handled in a timely manner.

v1: https://lore.kernel.org/lkml/20240318091034.535573-1-steve...@google.com/

v1 -> v2:
 - Add comment about virtio-balloon's wakeup source
 - Rename virtio-balloon wakeup event macros

David Stevens (2):
  virtio_balloon: Give the balloon its own wakeup source
  virtio_balloon: Treat stats requests as wakeup events

 drivers/virtio/virtio_balloon.c | 84 +
 1 file changed, 55 insertions(+), 29 deletions(-)


base-commit: e8f897f4afef0031fe618a8e94127a0934896aba
-- 
2.44.0.291.gc1ea87d7ee-goog

[PATCH v2 2/2] virtio_balloon: Treat stats requests as wakeup events

2024-03-20 Thread David Stevens

From: David Stevens 

Treat stats requests as wakeup events to ensure that the driver responds
to device requests in a timely manner.

Signed-off-by: David Stevens 
Acked-by: David Hildenbrand 
---
 drivers/virtio/virtio_balloon.c | 75 -
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 89bc8da80519..b09e8e3c62e5 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -121,11 +121,14 @@ struct virtio_balloon {
struct page_reporting_dev_info pr_dev_info;
 
/* State for keeping the wakeup_source active while adjusting the 
balloon */
-   spinlock_t adjustment_lock;
-   bool adjustment_signal_pending;
-   bool adjustment_in_progress;
+   spinlock_t wakeup_lock;
+   bool processing_wakeup_event;
+   u32 wakeup_signal_mask;
 };
 
+#define VIRTIO_BALLOON_WAKEUP_SIGNAL_ADJUST (1 << 0)
+#define VIRTIO_BALLOON_WAKEUP_SIGNAL_STATS (1 << 1)
+
 static const struct virtio_device_id id_table[] = {
{ VIRTIO_ID_BALLOON, VIRTIO_DEV_ANY_ID },
{ 0 },
@@ -140,6 +143,36 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
+static void start_wakeup_event(struct virtio_balloon *vb, u32 mask)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(>wakeup_lock, flags);
+   vb->wakeup_signal_mask |= mask;
+   if (!vb->processing_wakeup_event) {
+   vb->processing_wakeup_event = true;
+   pm_stay_awake(>vdev->dev);
+   }
+   spin_unlock_irqrestore(>wakeup_lock, flags);
+}
+
+static void process_wakeup_event(struct virtio_balloon *vb, u32 mask)
+{
+   spin_lock_irq(>wakeup_lock);
+   vb->wakeup_signal_mask &= ~mask;
+   spin_unlock_irq(>wakeup_lock);
+}
+
+static void finish_wakeup_event(struct virtio_balloon *vb)
+{
+   spin_lock_irq(>wakeup_lock);
+   if (!vb->wakeup_signal_mask && vb->processing_wakeup_event) {
+   vb->processing_wakeup_event = false;
+   pm_relax(>vdev->dev);
+   }
+   spin_unlock_irq(>wakeup_lock);
+}
+
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -370,8 +403,10 @@ static void stats_request(struct virtqueue *vq)
struct virtio_balloon *vb = vq->vdev->priv;
 
spin_lock(>stop_update_lock);
-   if (!vb->stop_update)
+   if (!vb->stop_update) {
+   start_wakeup_event(vb, VIRTIO_BALLOON_WAKEUP_SIGNAL_STATS);
queue_work(system_freezable_wq, >update_balloon_stats_work);
+   }
spin_unlock(>stop_update_lock);
 }
 
@@ -444,29 +479,10 @@ static void virtio_balloon_queue_free_page_work(struct 
virtio_balloon *vb)
 
 static void start_update_balloon_size(struct virtio_balloon *vb)
 {
-   unsigned long flags;
-
-   spin_lock_irqsave(>adjustment_lock, flags);
-   vb->adjustment_signal_pending = true;
-   if (!vb->adjustment_in_progress) {
-   vb->adjustment_in_progress = true;
-   pm_stay_awake(>vdev->dev);
-   }
-   spin_unlock_irqrestore(>adjustment_lock, flags);
-
+   start_wakeup_event(vb, VIRTIO_BALLOON_WAKEUP_SIGNAL_ADJUST);
queue_work(system_freezable_wq, >update_balloon_size_work);
 }
 
-static void end_update_balloon_size(struct virtio_balloon *vb)
-{
-   spin_lock_irq(>adjustment_lock);
-   if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
-   vb->adjustment_in_progress = false;
-   pm_relax(>vdev->dev);
-   }
-   spin_unlock_irq(>adjustment_lock);
-}
-
 static void virtballoon_changed(struct virtio_device *vdev)
 {
struct virtio_balloon *vb = vdev->priv;
@@ -495,7 +511,10 @@ static void update_balloon_stats_func(struct work_struct 
*work)
 
vb = container_of(work, struct virtio_balloon,
  update_balloon_stats_work);
+
+   process_wakeup_event(vb, VIRTIO_BALLOON_WAKEUP_SIGNAL_STATS);
stats_handle_request(vb);
+   finish_wakeup_event(vb);
 }
 
 static void update_balloon_size_func(struct work_struct *work)
@@ -506,9 +525,7 @@ static void update_balloon_size_func(struct work_struct 
*work)
vb = container_of(work, struct virtio_balloon,
  update_balloon_size_work);
 
-   spin_lock_irq(>adjustment_lock);
-   vb->adjustment_signal_pending = false;
-   spin_unlock_irq(>adjustment_lock);
+   process_wakeup_event(vb, VIRTIO_BALLOON_WAKEUP_SIGNAL_ADJUST);
 
diff = towards_target(vb);
 
@@ -523,7 +540,7 @@ static void update_balloon_size_func(struct work_struct 
*work)
if (diff)
queue_work(system_freezable_wq, work);
else
-

[PATCH v2 1/2] virtio_balloon: Give the balloon its own wakeup source

2024-03-20 Thread David Stevens

From: David Stevens 

Wakeup sources don't support nesting multiple events, so sharing a
single object between multiple drivers can result in one driver
overriding the wakeup event processing period specified by another
driver. Have the virtio balloon driver use the wakeup source of the
device it is bound to rather than the wakeup source of the parent
device, to avoid conflicts with the transport layer.

Note that although the virtio balloon's virtio_device itself isn't what
actually wakes up the device, it is responsible for processing wakeup
events. In the same way that EPOLLWAKEUP uses a dedicated wakeup_source
to prevent suspend when userspace is processing wakeup events, a
dedicated wakeup_source is necessary when processing wakeup events in a
higher layer in the kernel.

Fixes: b12fbc3f787e ("virtio_balloon: stay awake while adjusting balloon")
Signed-off-by: David Stevens 
Acked-by: David Hildenbrand 
---
 drivers/virtio/virtio_balloon.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1f5b3dd31fcf..89bc8da80519 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -450,7 +450,7 @@ static void start_update_balloon_size(struct virtio_balloon 
*vb)
vb->adjustment_signal_pending = true;
if (!vb->adjustment_in_progress) {
vb->adjustment_in_progress = true;
-   pm_stay_awake(vb->vdev->dev.parent);
+   pm_stay_awake(>vdev->dev);
}
spin_unlock_irqrestore(>adjustment_lock, flags);
 
@@ -462,7 +462,7 @@ static void end_update_balloon_size(struct virtio_balloon 
*vb)
spin_lock_irq(>adjustment_lock);
if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
vb->adjustment_in_progress = false;
-   pm_relax(vb->vdev->dev.parent);
+   pm_relax(>vdev->dev);
}
spin_unlock_irq(>adjustment_lock);
 }
@@ -1029,6 +1029,15 @@ static int virtballoon_probe(struct virtio_device *vdev)
 
spin_lock_init(>adjustment_lock);
 
+   /*
+* The virtio balloon itself can't wake up the device, but it is
+* responsible for processing wakeup events passed up from the transport
+* layer. Wakeup sources don't support nesting/chaining calls, so we use
+* our own wakeup source to ensure wakeup events are properly handled
+* without trampling on the transport layer's wakeup source.
+*/
+   device_set_wakeup_capable(>vdev->dev, true);
+
virtio_device_ready(vdev);
 
if (towards_target(vb))
-- 
2.44.0.291.gc1ea87d7ee-goog

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-20 Thread David Woodhouse

On Tue, 2024-03-19 at 14:47 +0100, Peter Hilber wrote:
> While the virtio-comment list is not available, now also CC'ing Parav,
> which may be interested in this virtio-rtc spec related discussion thread.
> 
> On 14.03.24 15:19, David Woodhouse wrote:
> > On 14 March 2024 11:13:37 CET, Peter Hilber  
> > wrote:
> > > > To a certain extent, as long as the virtio-rtc device is designed to 
> > > > expose time precisely and unambiguously, it's less important if the 
> > > > Linux kernel *today* can use that. Although of course we should strive 
> > > > for that. Let's be...well, *unambiguous*, I suppose... that we've 
> > > > changed topics to discuss that though.
> > > > 
> > > 
> > > As Virtio is extensible (unlike hardware), my approach is to mostly 
> > > specify
> > > only what also has a PoC user and a use case.
> > 
> > If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case 
> > on day one. Otherwise, as I said in my first response, I can go do that as 
> > a separate device and decide that virtio_rtc doesn't meet our needs 
> > (especially for maintaining accuracy over LM).
> 
> We plan to add 
> 
> - leap second indication,
> 
> - UTC-to-TAI offset,
> 
> - clock smearing indication (including the noon-to-noon linear smearing
>   variant which seems to be somewhat popular), and
>
> - clock accuracy indication
> 
> to the initial spec and to the PoC implementation.

Sounds good, thanks! I look forward to seeing the new revision. I'm
hoping Julien can give feedback on the clock accuracy parts.

> However, due to resource restrictions, we cannot ourselves add the
> memory-mapped clock to the initial spec.
>
> Everyone is very welcome to contribute the memory-mapped clock to the spec,
> and I think it might then still make it to the initial version.

Makes sense. That is my primary target, so I'm *hoping* we can converge
and get that into your initial spec, otherwise for expediency I'm going
to have to define an ACPI or DT or PCI device of our own and expose the
memory region through that instead.

(Even if I have to do that in the short term to stop the bleeding with
customers' clocks and live migration, I'd still aspire to migrate to a
virtio_rtc version of it in future)



smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH 2/2] virtio_balloon: Treat stats requests as wakeup events

2024-03-19 Thread David Hildenbrand


On 18.03.24 10:10, David Stevens wrote:

From: David Stevens 

Treat stats requests as wakeup events to ensure that the driver responds
to device requests in a timely manner.

Signed-off-by: David Stevens 
---
  drivers/virtio/virtio_balloon.c | 75 -
  1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7fe7ef5f1c77..402dec98e08c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -121,11 +121,14 @@ struct virtio_balloon {
struct page_reporting_dev_info pr_dev_info;
  
  	/* State for keeping the wakeup_source active while adjusting the balloon */

-   spinlock_t adjustment_lock;
-   bool adjustment_signal_pending;
-   bool adjustment_in_progress;
+   spinlock_t wakeup_lock;
+   bool processing_wakeup_event;
+   u32 wakeup_signal_mask;
  };
  
+#define ADJUSTMENT_WAKEUP_SIGNAL (1 << 0)

+#define STATS_WAKEUP_SIGNAL (1 << 1)


I'd suggest a different naming like:

VIRTIO_BALLOON_WAKEUP_SIGNAL_ADJUST
VIRTIO_BALLOON_WAKEUP_SIGNAL_STATS


Apart from that, nothing jumped at me.

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

Re: [PATCH 1/2] virtio_balloon: Give the balloon its own wakeup source

2024-03-19 Thread David Hildenbrand


On 18.03.24 10:10, David Stevens wrote:

From: David Stevens 

Wakeup sources don't support nesting multiple events, so sharing a
single object between multiple drivers can result in one driver
overriding the wakeup event processing period specified by another
driver. Have the virtio balloon driver use the wakeup source of the
device it is bound to rather than the wakeup source of the parent
device, to avoid conflicts with the transport layer.

Note that although the virtio balloon's virtio_device itself isn't what
actually wakes up the device, it is responsible for processing wakeup
events. In the same way that EPOLLWAKEUP uses a dedicated wakeup_source
to prevent suspend when userspace is processing wakeup events, a
dedicated wakeup_source is necessary when processing wakeup events in a
higher layer in the kernel.

Fixes: b12fbc3f787e ("virtio_balloon: stay awake while adjusting balloon")
Signed-off-by: David Stevens 
---
  drivers/virtio/virtio_balloon.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1f5b3dd31fcf..7fe7ef5f1c77 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -450,7 +450,7 @@ static void start_update_balloon_size(struct virtio_balloon 
*vb)
vb->adjustment_signal_pending = true;
if (!vb->adjustment_in_progress) {
vb->adjustment_in_progress = true;
-   pm_stay_awake(vb->vdev->dev.parent);
+   pm_stay_awake(>vdev->dev);
}
spin_unlock_irqrestore(>adjustment_lock, flags);
  
@@ -462,7 +462,7 @@ static void end_update_balloon_size(struct virtio_balloon *vb)

spin_lock_irq(>adjustment_lock);
if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
vb->adjustment_in_progress = false;
-   pm_relax(vb->vdev->dev.parent);
+   pm_relax(>vdev->dev);
}
spin_unlock_irq(>adjustment_lock);
  }
@@ -1028,6 +1028,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
}
  
  	spin_lock_init(>adjustment_lock);


Can we add a comment here why we have to do that?


+   device_set_wakeup_capable(>vdev->dev, true);
  
  	virtio_device_ready(vdev);
  


Absolutely not an expert on the details, but I assume this is fine.

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

[PATCH v1] virtio-mem: support suspend+resume

2024-03-18 Thread David Hildenbrand

With virtio-mem, primarily hibernation is problematic: as the machine shuts
down, the virtio-mem device loses its state. Powering the machine back up
is like losing a bunch of DIMMs. While there would be ways to add limited
support, suspend+resume is more commonly used for VMs and "easier" to
support cleanly.

s2idle can be supported without any device dependencies. Similarly, one
would expect suspend-to-ram (i.e., S3) to work out of the box. However,
QEMU currently unplugs all device memory when resuming the VM, using a
cold reset on the "wakeup" path. In order to support S3, we need a feature
flag for the device to tell us if memory remains plugged when waking up. In
the future, QEMU will implement this feature.

So let's always support s2idle and support S3 with plugged memory only if
the device indicates support. Block hibernation early using the PM
notifier.

Trying to hibernate now fails early:
# echo disk > /sys/power/state
[   26.455369] PM: hibernation: hibernation entry
[   26.458271] virtio_mem virtio0: hibernation is not supported.
[   26.462498] PM: hibernation: hibernation exit
-bash: echo: write error: Operation not permitted

s2idle works even without the new feature bit:
# echo s2idle > /sys/power/mem_sleep
# echo mem > /sys/power/state
[   52.083725] PM: suspend entry (s2idle)
[   52.095950] Filesystems sync: 0.010 seconds
[   52.101493] Freezing user space processes
[   52.104213] Freezing user space processes completed (elapsed 0.001 
seconds)
[   52.106520] OOM killer disabled.
[   52.107655] Freezing remaining freezable tasks
[   52.110880] Freezing remaining freezable tasks completed (elapsed 
0.001 seconds)
[   52.113296] printk: Suspending console(s) (use no_console_suspend to 
debug)

S3 does not work without the feature bit when memory is plugged:
# echo deep > /sys/power/mem_sleep
# echo mem > /sys/power/state
[   32.788281] PM: suspend entry (deep)
[   32.816630] Filesystems sync: 0.027 seconds
[   32.820029] Freezing user space processes
[   32.823870] Freezing user space processes completed (elapsed 0.001 
seconds)
[   32.827756] OOM killer disabled.
[   32.829608] Freezing remaining freezable tasks
[   32.833842] Freezing remaining freezable tasks completed (elapsed 
0.001 seconds)
[   32.837953] printk: Suspending console(s) (use no_console_suspend to 
debug)
[   32.916172] virtio_mem virtio0: suspend+resume with plugged memory 
is not supported
[   32.916181] virtio-pci :00:02.0: PM: pci_pm_suspend(): 
virtio_pci_freeze+0x0/0x50 returns -1
[   32.916197] virtio-pci :00:02.0: PM: dpm_run_callback(): 
pci_pm_suspend+0x0/0x170 returns -1
[   32.916210] virtio-pci :00:02.0: PM: failed to suspend async: 
error -1

But S3 works with the new feature bit when memory is plugged (patched
QEMU):
# echo deep > /sys/power/mem_sleep
# echo mem > /sys/power/state
[   33.983694] PM: suspend entry (deep)
[   34.009828] Filesystems sync: 0.024 seconds
[   34.013589] Freezing user space processes
[   34.016722] Freezing user space processes completed (elapsed 0.001 
seconds)
[   34.019092] OOM killer disabled.
[   34.020291] Freezing remaining freezable tasks
[   34.023549] Freezing remaining freezable tasks completed (elapsed 
0.001 seconds)
[   34.026090] printk: Suspending console(s) (use no_console_suspend to 
debug)

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Xuan Zhuo 
Signed-off-by: David Hildenbrand 
---

I had QEMU support ready [1] but reset-related things just changed upstream
that will require a bit of a rework -- and it will end up looking cleaner.

Will come back to upstreaming the QEMU part once I can properly sync
the Linux headers to contain VIRTIO_MEM_F_PERSISTENT_SUSPEND.

[1] https://github.com/davidhildenbrand/qemu/tree/virtio-mem-suspend

---
 drivers/virtio/virtio_mem.c | 68 ++---
 include/uapi/linux/virtio_mem.h |  2 +
 2 files changed, 64 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 8e32232944423..51088d02de32f 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -252,6 +253,9 @@ struct virtio_mem {
/* Memory notifier (online/offline events). */
struct notifier_block memory_notifier;
 
+   /* Notifier to block hibernation image storing/reloading. */
+   struct notifier_block pm_notifier;
+
 #ifdef CONFIG_PROC_VMCORE
/* vmcore callback for /proc/vmcore handling in kdump mode */
struct vmcore_cb vmcore_cb;
@@ -,6 +1115,25 @@ static int virtio_mem_memory_notifi

[PATCH 2/2] virtio_balloon: Treat stats requests as wakeup events

2024-03-18 Thread David Stevens

From: David Stevens 

Treat stats requests as wakeup events to ensure that the driver responds
to device requests in a timely manner.

Signed-off-by: David Stevens 
---
 drivers/virtio/virtio_balloon.c | 75 -
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7fe7ef5f1c77..402dec98e08c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -121,11 +121,14 @@ struct virtio_balloon {
struct page_reporting_dev_info pr_dev_info;
 
/* State for keeping the wakeup_source active while adjusting the 
balloon */
-   spinlock_t adjustment_lock;
-   bool adjustment_signal_pending;
-   bool adjustment_in_progress;
+   spinlock_t wakeup_lock;
+   bool processing_wakeup_event;
+   u32 wakeup_signal_mask;
 };
 
+#define ADJUSTMENT_WAKEUP_SIGNAL (1 << 0)
+#define STATS_WAKEUP_SIGNAL (1 << 1)
+
 static const struct virtio_device_id id_table[] = {
{ VIRTIO_ID_BALLOON, VIRTIO_DEV_ANY_ID },
{ 0 },
@@ -140,6 +143,36 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
+static void start_wakeup_event(struct virtio_balloon *vb, u32 mask)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(>wakeup_lock, flags);
+   vb->wakeup_signal_mask |= mask;
+   if (!vb->processing_wakeup_event) {
+   vb->processing_wakeup_event = true;
+   pm_stay_awake(>vdev->dev);
+   }
+   spin_unlock_irqrestore(>wakeup_lock, flags);
+}
+
+static void process_wakeup_event(struct virtio_balloon *vb, u32 mask)
+{
+   spin_lock_irq(>wakeup_lock);
+   vb->wakeup_signal_mask &= ~mask;
+   spin_unlock_irq(>wakeup_lock);
+}
+
+static void finish_wakeup_event(struct virtio_balloon *vb)
+{
+   spin_lock_irq(>wakeup_lock);
+   if (!vb->wakeup_signal_mask && vb->processing_wakeup_event) {
+   vb->processing_wakeup_event = false;
+   pm_relax(>vdev->dev);
+   }
+   spin_unlock_irq(>wakeup_lock);
+}
+
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -370,8 +403,10 @@ static void stats_request(struct virtqueue *vq)
struct virtio_balloon *vb = vq->vdev->priv;
 
spin_lock(>stop_update_lock);
-   if (!vb->stop_update)
+   if (!vb->stop_update) {
+   start_wakeup_event(vb, STATS_WAKEUP_SIGNAL);
queue_work(system_freezable_wq, >update_balloon_stats_work);
+   }
spin_unlock(>stop_update_lock);
 }
 
@@ -444,29 +479,10 @@ static void virtio_balloon_queue_free_page_work(struct 
virtio_balloon *vb)
 
 static void start_update_balloon_size(struct virtio_balloon *vb)
 {
-   unsigned long flags;
-
-   spin_lock_irqsave(>adjustment_lock, flags);
-   vb->adjustment_signal_pending = true;
-   if (!vb->adjustment_in_progress) {
-   vb->adjustment_in_progress = true;
-   pm_stay_awake(>vdev->dev);
-   }
-   spin_unlock_irqrestore(>adjustment_lock, flags);
-
+   start_wakeup_event(vb, ADJUSTMENT_WAKEUP_SIGNAL);
queue_work(system_freezable_wq, >update_balloon_size_work);
 }
 
-static void end_update_balloon_size(struct virtio_balloon *vb)
-{
-   spin_lock_irq(>adjustment_lock);
-   if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
-   vb->adjustment_in_progress = false;
-   pm_relax(>vdev->dev);
-   }
-   spin_unlock_irq(>adjustment_lock);
-}
-
 static void virtballoon_changed(struct virtio_device *vdev)
 {
struct virtio_balloon *vb = vdev->priv;
@@ -495,7 +511,10 @@ static void update_balloon_stats_func(struct work_struct 
*work)
 
vb = container_of(work, struct virtio_balloon,
  update_balloon_stats_work);
+
+   process_wakeup_event(vb, STATS_WAKEUP_SIGNAL);
stats_handle_request(vb);
+   finish_wakeup_event(vb);
 }
 
 static void update_balloon_size_func(struct work_struct *work)
@@ -506,9 +525,7 @@ static void update_balloon_size_func(struct work_struct 
*work)
vb = container_of(work, struct virtio_balloon,
  update_balloon_size_work);
 
-   spin_lock_irq(>adjustment_lock);
-   vb->adjustment_signal_pending = false;
-   spin_unlock_irq(>adjustment_lock);
+   process_wakeup_event(vb, ADJUSTMENT_WAKEUP_SIGNAL);
 
diff = towards_target(vb);
 
@@ -523,7 +540,7 @@ static void update_balloon_size_func(struct work_struct 
*work)
if (diff)
queue_work(system_freezable_wq, work);
else
-   end_update_balloon_size(vb);
+   finish_wakeup_event(vb);
 }
 
 static

[PATCH 1/2] virtio_balloon: Give the balloon its own wakeup source

2024-03-18 Thread David Stevens

From: David Stevens 

Wakeup sources don't support nesting multiple events, so sharing a
single object between multiple drivers can result in one driver
overriding the wakeup event processing period specified by another
driver. Have the virtio balloon driver use the wakeup source of the
device it is bound to rather than the wakeup source of the parent
device, to avoid conflicts with the transport layer.

Note that although the virtio balloon's virtio_device itself isn't what
actually wakes up the device, it is responsible for processing wakeup
events. In the same way that EPOLLWAKEUP uses a dedicated wakeup_source
to prevent suspend when userspace is processing wakeup events, a
dedicated wakeup_source is necessary when processing wakeup events in a
higher layer in the kernel.

Fixes: b12fbc3f787e ("virtio_balloon: stay awake while adjusting balloon")
Signed-off-by: David Stevens 
---
 drivers/virtio/virtio_balloon.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1f5b3dd31fcf..7fe7ef5f1c77 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -450,7 +450,7 @@ static void start_update_balloon_size(struct virtio_balloon 
*vb)
vb->adjustment_signal_pending = true;
if (!vb->adjustment_in_progress) {
vb->adjustment_in_progress = true;
-   pm_stay_awake(vb->vdev->dev.parent);
+   pm_stay_awake(>vdev->dev);
}
spin_unlock_irqrestore(>adjustment_lock, flags);
 
@@ -462,7 +462,7 @@ static void end_update_balloon_size(struct virtio_balloon 
*vb)
spin_lock_irq(>adjustment_lock);
if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
vb->adjustment_in_progress = false;
-   pm_relax(vb->vdev->dev.parent);
+   pm_relax(>vdev->dev);
}
spin_unlock_irq(>adjustment_lock);
 }
@@ -1028,6 +1028,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
}
 
spin_lock_init(>adjustment_lock);
+   device_set_wakeup_capable(>vdev->dev, true);
 
virtio_device_ready(vdev);
 
-- 
2.44.0.291.gc1ea87d7ee-goog

[PATCH 0/2] Improvements to virtio_balloon pm

2024-03-18 Thread David Stevens

From: David Stevens 

The virtio_balloon driver uses wakeup sources to allow the guest to
enter system power management sleep states (e.g. s2idle) without running
the risk of becoming unresponsive to cooperative memory management
requests from the host. This series fixes an issue where wakeup sources
for inflate/deflate were improperly shared between drivers. It also
closes a race where stats requests that come in immediately before a
sleep state transition could fail to be handled in a timely manner.

David Stevens (2):
  virtio_balloon: Give the balloon its own wakeup source
  virtio_balloon: Treat stats requests as wakeup events

 drivers/virtio/virtio_balloon.c | 76 -
 1 file changed, 47 insertions(+), 29 deletions(-)


base-commit: e8f897f4afef0031fe618a8e94127a0934896aba
-- 
2.44.0.291.gc1ea87d7ee-goog

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-14 Thread David Woodhouse

On 14 March 2024 11:13:37 CET, Peter Hilber  
wrote:
>> To a certain extent, as long as the virtio-rtc device is designed to expose 
>> time precisely and unambiguously, it's less important if the Linux kernel 
>> *today* can use that. Although of course we should strive for that. Let's 
>> be...well, *unambiguous*, I suppose... that we've changed topics to discuss 
>> that though.
>> 
>
>As Virtio is extensible (unlike hardware), my approach is to mostly specify
>only what also has a PoC user and a use case.

If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case on 
day one. Otherwise, as I said in my first response, I can go do that as a 
separate device and decide that virtio_rtc doesn't meet our needs (especially 
for maintaining accuracy over LM).

My main concern for virto_rtc is that we avoid *ambiguity*. Yes, I get that 
it's extensible but we don't want a v1.0 of the spec, implemented by various 
hypervisors, which still leaves guests not knowing what the actual time is. 
That would not be good. And even UTC without a leap second indicator has that 
problem.

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread David Woodhouse

On 13 March 2024 17:50:48 GMT, Peter Hilber  
wrote:
>On 13.03.24 13:45, David Woodhouse wrote:
>> Surely the whole point of this effort is to provide guests with precise
>> and *unambiguous* knowledge of what the time is? 
>
>I would say, a fundamental point of this effort is to enable such
>implementations, and to detect if a device is promising to support this.
>
>Where we might differ is as to whether the Virtio clock *for every
>implementation* has to be *continuously* accurate w.r.t. a time standard,
>or whether *for some implementations* it could be enough that all guests in
>the local system have the same, precise local notion of time, which might
>be off from the actual time standard.

That makes sense, but remember I don't just want {X, Y, Z} but *also* the error 
bounds of ±deltaY and ±deltaZ too.

So your example just boils down to "I'm calling it UTC, and it's really 
precise, but we make no promises about its *accuracy*". And that's fine.

>Also, cf. ptp_kvm, which AFAIU doesn't address leap seconds at all...

KVM is not an exemplar of good time practices. 
Not in *any* respect :)

>With your described use case the UTC_SMEARED clock should of course not be
>used. The UTC_SMEARED clock would get a distinct name through udev, like
>/dev/ptp_virtio_utc_smeared, so the incompatibility could at least be
>detected.

As long as it's clear to all concerned that this is fundamentally not usable as 
an accurate time source, and is only for the local-sync case you described, 
sure.

>> Using UTC is bad enough, because for a UTC timestamp in the middle of a
>> leap second the guest can't know know *which* occurrence of that leap
>> second it is, so it might be wrong by a second. To resolve that
>> ambiguity needs a leap indicator and/or tai_offset field.
>
>I agree that virtio-rtc should communicate this. The question is, what
>exactly, and for which clock read request?

Are we now conflating software architecture (and Linux in particular) with 
"hardware" design?

To a certain extent, as long as the virtio-rtc device is designed to expose 
time precisely and unambiguously, it's less important if the Linux kernel 
*today* can use that. Although of course we should strive for that. Let's 
be...well, *unambiguous*, I suppose... that we've changed topics to discuss 
that though.

>As for PTP clocks:
>
>- It doesn't fit into the ioctl PTP_SYS_OFFSET_PRECISE2.
>
>- The clock_adjtime(2) tai_offset and return value could be set (if
>  upstream will accept this). Would this help? As discussed, user space
>  would need to interpret this (and currently no dynamic POSIX clock sets
>  this).

Hm, maybe?

>>> I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or
>>> boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor,
>>> so the device might not even know which vCPUs there are. E.g. there is even
>>> interest to make virtio-rtc work as part of the virtio-net device (which
>>> might be implemented in hardware).
>> 
>> Sure, but those implementations aren't going to offer the TSC pairing
>> at all, are they?
>> 
>
>They could offer an Intel ART pairing (some physical PTP NICs are already
>doing this, look for the convert_art_to_tsc() users).

Right, but isn't that software's problem? The time pairing is defined against 
the ART in that case.

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread David Woodhouse

On Wed, 2024-03-13 at 13:58 +0100, Alexandre Belloni wrote:
> The TSC or whatever CPU counter/clock that is used to keep the system
> time is not an RTC, I don't get why it has to be exposed as such to the
> guests. PTP is fine and precise, RTC is not.

Ah, I see. But the point of the virtio_rtc is not really to expose that
CPU counter. The point is to report the wallclock time, just like an
actual RTC. The real difference is the *precision*.

The virtio_rtc device has a facility to *also* expose the counter,
because that's what we actually need to gain that precision...

Applications don't read the RTC every time they want to know what the
time is. These days, they don't even make a system call; it's done
entirely in userspace mode. The kernel exposes some shared memory,
essentially saying "the counter was X at time Y, and runs at Z Hz".
Then applications just read the CPU counter and do some arithmetic.

As we require more and more precision in the calibration, it becomes
important to get *paired* readings of the CPU counter and the wallclock
time at precisely the same moment. If the guest has to read one and
then the other, potentially taking interrupts, getting preempted and
suffering steal/SMI time in the middle, that introduces an error which
is increasingly significant as we increasingly care about precision.

Peter's proposal exposes the pairs of {X,Y} and leaves *all* the guest
kernels having to repeat readings over time and perform the calibration
as the underlying hardware oscillator frequency (Z) drifts with
temperature. I'm trying to get him to let the hypervisor expose the
calibrated frequency Z too. Along with *error* bounds for ±δX and ±δZ.
Which aside from reducing the duplication of effort, will *also* fix
the problem of live migration where *all* those things suffer a step
change and leave the guest with an inaccurate clock but not knowing it.

But that part isn't relevant to the RTC, as you say. RTC doesn't care
about that level of precision; it's just what the system uses to know
roughly what year it is, when it powers up. (And isn't even really
trusted for that, which is a large part of why I added the
X509_V_FLAG_NO_CHECK_TIME flag to OpenSSL, so Secure Boot doesn't break
when the RTC is catastrophically wrong :)

If you're asking why patch 7/7 in Peter's series exists to expose the
virtio clock through RTC, and you're not particularly interested in the
first six, I suppose that's a fair question. As is the question of "why
is it called virtio_rtc not virtio_ptp?". 

But let me turn it around: if the kernel has access to this virtio
device and *not* any other RTC, why *wouldn't* the kernel use the time
from it? The fact that it can optionally *also* provide paired readings
with the CPU counter doesn't actually *hurt* for the RTC use case, does
it?

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread David Woodhouse

On Wed, 2024-03-13 at 10:45 +0100, Peter Hilber wrote:
> On 12.03.24 18:15, David Woodhouse wrote:
> > On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
> > > On 08.03.24 13:33, David Woodhouse wrote:
> > > > On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
> > > > > On 07.03.24 15:02, David Woodhouse wrote:
> > > > > > Hm, should we allow UTC? If you tell me the time in UTC, then
> > > > > > (sometimes) I still don't actually know what the time is, because 
> > > > > > some
> > > > > > UTC seconds occur twice. UTC only makes sense if you provide the TAI
> > > > > > offset, surely? Should the virtio_rtc specification make it 
> > > > > > mandatory
> > > > > > to provide such?
> > > > > > 
> > > > > > Otherwise you're just designing it to allow crappy hypervisors to
> > > > > > expose incomplete information.
> > > > > > 
> > > > > 
> > > > > Hi David,
> > > > > 
> > > > > (adding virtio-comm...@lists.oasis-open.org for spec discussion),
> > > > > 
> > > > > thank you for your insightful comments. I think I take a broadly 
> > > > > similar
> > > > > view. The reason why the current spec and driver is like this is that 
> > > > > I
> > > > > took a pragmatic approach at first and only included features which 
> > > > > work
> > > > > out-of-the-box for the current Linux ecosystem.
> > > > > 
> > > > > The current virtio_rtc features work similar to ptp_kvm, and therefore
> > > > > can work out-of-the-box with time sync daemons such as chrony.
> > > > > 
> > > > > As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
> > > > > as well, I am afraid that
> > > > > 
> > > > > - in some (embedded) scenarios, the TAI clock may not be available
> > > > > 
> > > > > - crappy hypervisors will pass off the UTC clock as the TAI clock.
> > > > > 
> > > > > For the same reasons, I am also not sure about adding a *mandatory* 
> > > > > TAI
> > > > > offset to each readout. I don't know user-space software which would
> > > > > leverage this already (at least not through the PTP clock interface).
> > > > > And why would such software not go straight for the TAI clock instead?
> > > > > 
> > > > > How about adding a requirement to the spec that the virtio-rtc device
> > > > > SHOULD expose the TAI clock whenever it is available - would this
> > > > > address your concerns?
> > > > 
> > > > I think that would be too easy for implementors to miss, or decide not
> > > > to obey. Or to get *wrong*, by exposing a TAI clock but actually
> > > > putting UTC in it.
> > > > 
> > > > I think I prefer to mandate the tai_offset field with the UTC clock.
> > > > Crappy implementations will just set it to zero, but at least that
> > > > gives a clear signal to the guests that it's *their* problem to
> > > > resolve.
> > > 
> > > To me there are some open questions regarding how this would work. Is 
> > > there
> > > a use case for this with the v3 clock reading methods, or would it be
> > > enough to address this with the Virtio timekeeper?
> > > 
> > > Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
> > > best alongside some additional information about leap seconds. I am not
> > > aware about any user-space user. In addition, leap second smearing should
> > > also be addressed.
> > > 
> > 
> > Is there even a standard yet for leap-smearing? Will it be linear over
> > 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
> > think is what Google does? Meta does something different again, don't
> > they?
> > 
> > Exposing UTC as the only clock reference is bad enough; when leap
> > seconds happen there's a whole second during which you don't *know*
> > which second it is. It seems odd to me, for a precision clock to be
> > deliberately ambiguous about what the time is!
> 
> Just to be clear, the device can perfectly expose only a TAI reference
> clock (or both UTC and TAI), the spec is just completely open about this,
> as it tries to work for diverse use cases.

As long as the gues

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread David Woodhouse

On Wed, 2024-03-13 at 12:18 +0100, Alexandre Belloni wrote:
> 
> I still don't know anything about virtio but under Linux, an RTC is
> always UTC (or localtime when dual booting but let's not care) and never
> accounts for leap seconds. Having an RTC and RTC driver behaving
> differently would be super inconvenient. Why don't you leave this to
> userspace?

Well yes, we don't need to expose *anything* from the hypervisor and we
can leave it all to guest userspace. We can run NTP on every single one
of *hundreds* of guests, leaving them all to duplicate the work of
calibrating the *same* underlying oscillator.

I thought we were trying to avoid that, by having the hypervisor tell
them what the time was. If we're going to do that, we need it to be
sufficiently precise (and some clients want to *know* the precision),
and above all we need it to be *unambiguous*.

If the hypervisor says that the time is 3692217600.001, then the guest
doesn't actually know *which* 3692217600.001 it is, and thus it still
doesn't know the time to an accuracy better than 1 second.

And if we start allowing the hypervisor to smear clocks in some other
underspecified ways, then we end up with errors of up to 1 second in
the clock for long periods of time *around* the leap second.

We need to avoid that ambiguity.

> I guess I'm still questioning whether this is the correct interface to
> expose the host system time instead of an actual RTC.

If an RTC device is able to report '23:59:60' as the time of day, I
suppose that *could* resolve the ambiguity. But talking to a device is
slow; we want guests to be able to know the time — accurately — with a
simple counter/tsc read and some arithmetic. Which means *paired* reads
of 'RTC' and the counter, and a precise indication of the counter
frequency.

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-12 Thread David Woodhouse

On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
> On 08.03.24 13:33, David Woodhouse wrote:
> > On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
> > > On 07.03.24 15:02, David Woodhouse wrote:
> > > > Hm, should we allow UTC? If you tell me the time in UTC, then
> > > > (sometimes) I still don't actually know what the time is, because some
> > > > UTC seconds occur twice. UTC only makes sense if you provide the TAI
> > > > offset, surely? Should the virtio_rtc specification make it mandatory
> > > > to provide such?
> > > > 
> > > > Otherwise you're just designing it to allow crappy hypervisors to
> > > > expose incomplete information.
> > > > 
> > > 
> > > Hi David,
> > > 
> > > (adding virtio-comm...@lists.oasis-open.org for spec discussion),
> > > 
> > > thank you for your insightful comments. I think I take a broadly similar
> > > view. The reason why the current spec and driver is like this is that I
> > > took a pragmatic approach at first and only included features which work
> > > out-of-the-box for the current Linux ecosystem.
> > > 
> > > The current virtio_rtc features work similar to ptp_kvm, and therefore
> > > can work out-of-the-box with time sync daemons such as chrony.
> > > 
> > > As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
> > > as well, I am afraid that
> > > 
> > > - in some (embedded) scenarios, the TAI clock may not be available
> > > 
> > > - crappy hypervisors will pass off the UTC clock as the TAI clock.
> > > 
> > > For the same reasons, I am also not sure about adding a *mandatory* TAI
> > > offset to each readout. I don't know user-space software which would
> > > leverage this already (at least not through the PTP clock interface).
> > > And why would such software not go straight for the TAI clock instead?
> > > 
> > > How about adding a requirement to the spec that the virtio-rtc device
> > > SHOULD expose the TAI clock whenever it is available - would this
> > > address your concerns?
> > 
> > I think that would be too easy for implementors to miss, or decide not
> > to obey. Or to get *wrong*, by exposing a TAI clock but actually
> > putting UTC in it.
> > 
> > I think I prefer to mandate the tai_offset field with the UTC clock.
> > Crappy implementations will just set it to zero, but at least that
> > gives a clear signal to the guests that it's *their* problem to
> > resolve.
> 
> To me there are some open questions regarding how this would work. Is there
> a use case for this with the v3 clock reading methods, or would it be
> enough to address this with the Virtio timekeeper?
> 
> Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
> best alongside some additional information about leap seconds. I am not
> aware about any user-space user. In addition, leap second smearing should
> also be addressed.
> 

Is there even a standard yet for leap-smearing? Will it be linear over
1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
think is what Google does? Meta does something different again, don't
they?

Exposing UTC as the only clock reference is bad enough; when leap
seconds happen there's a whole second during which you don't *know*
which second it is. It seems odd to me, for a precision clock to be
deliberately ambiguous about what the time is!

But if the virtio-rtc clock is defined as UTC and then expose something
*different* in it, that's even worse. You potentially end up providing
inaccurate time for a whole *day* leading up to the leap second.

I think you're right that leap second smearing should be addressed. At
the very least, by making it clear that the virtio-rtc clock which
advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
yet-to-be-defined variant.

Please make it explicit that any hypervisor which wants to advertise a
smeared clock shall define a new type which specifies the precise
smearing algorithm and cannot be conflated with the one you're defining
here.

> > One other thing to note is I think we're being very naïve about the TSC
> > on x86 hosts. Theoretically, the TSC for every vCPU might run at a
> > different frequency, and even if they run at the same frequency they
> > might be offset from each other. I'm happy to be naïve but I think we
> > should be *explicitly* so, and just say for example that it's defined
> > against vCPU0 so if other vCPUs are different then all bets are off.
> 
> ATM Virtio has no notion of vCPUs, or vCPU topology. So I wonder if you
> have an opinion on how to represent this in a platform-independent way.

Well, it doesn't have a notion of TSCs either; you include that by
implicit reference don't you?



smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-08 Thread David Woodhouse

On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
> On 07.03.24 15:02, David Woodhouse wrote:
> > On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> > > RFC v3 updates
> > > --
> > > 
> > > This series implements a driver for a virtio-rtc device conforming to spec
> > > RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
> > > the PTP clock driver already present before.
> > > 
> > > This patch series depends on the patch series "treewide: Use clocksource 
> > > id
> > > for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
> > > series on top of mainline.
> > > 
> > > Overview
> > > 
> > > 
> > > This patch series adds the virtio_rtc module, and related bugfixes. The
> > > virtio_rtc module implements a driver compatible with the proposed Virtio
> > > RTC device specification [1]. The Virtio RTC (Real Time Clock) device
> > > provides information about current time. The device can provide different
> > > clocks, e.g. for the UTC or TAI time standards, or for physical time
> > > elapsed since some past epoch. 
> > 
> > Hm, should we allow UTC? If you tell me the time in UTC, then
> > (sometimes) I still don't actually know what the time is, because some
> > UTC seconds occur twice. UTC only makes sense if you provide the TAI
> > offset, surely? Should the virtio_rtc specification make it mandatory
> > to provide such?
> > 
> > Otherwise you're just designing it to allow crappy hypervisors to
> > expose incomplete information.
> > 
> 
> Hi David,
> 
> (adding virtio-comm...@lists.oasis-open.org for spec discussion),
> 
> thank you for your insightful comments. I think I take a broadly similar
> view. The reason why the current spec and driver is like this is that I
> took a pragmatic approach at first and only included features which work
> out-of-the-box for the current Linux ecosystem.
> 
> The current virtio_rtc features work similar to ptp_kvm, and therefore can
> work out-of-the-box with time sync daemons such as chrony.
> 
> As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock as
> well, I am afraid that
> 
> - in some (embedded) scenarios, the TAI clock may not be available
> 
> - crappy hypervisors will pass off the UTC clock as the TAI clock.
> 
> For the same reasons, I am also not sure about adding a *mandatory* TAI
> offset to each readout. I don't know user-space software which would
> leverage this already (at least not through the PTP clock interface). And
> why would such software not go straight for the TAI clock instead?
> 
> How about adding a requirement to the spec that the virtio-rtc device
> SHOULD expose the TAI clock whenever it is available - would this address
> your concerns?

I think that would be too easy for implementors to miss, or decide not
to obey. Or to get *wrong*, by exposing a TAI clock but actually
putting UTC in it.

I think I prefer to mandate the tai_offset field with the UTC clock.
Crappy implementations will just set it to zero, but at least that
gives a clear signal to the guests that it's *their* problem to
resolve.




> > > PTP clock interface
> > > ---
> > > 
> > > virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm.
> > > If both the Virtio RTC device and this driver have special support for the
> > > current clocksource, time synchronization programs can use
> > > cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
> > > PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization
> > > with single-digit ns precision is possible with a quiescent reference 
> > > clock
> > > (from the Virtio RTC device). This works even when the Virtio device
> > > response is slow compared to ptp_kvm hypercalls.
> > 
> > Is PTP the right mechanism for this? As I understand it, PTP is a way
> > to precisely synchronize one clock with another. But in the case of
> > virt guests synchronizing against the host, it isn't really *another*
> > clock. It really is the *same* underlying clock. As the host clock
> > varies with temperature, for example, so does the guest clock. The only
> > difference is an offset and (on x86 perhaps) a mathematical scaling of
> > the frequency.
> > 
> > I was looking at this another way, when I came across this virtio-rtc
> > work.
> > 
> > My idea was just for the hypervisor to expose its own timekeeping
> > information — the counter/TSC value and TAI time at a given mome

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-07 Thread David Woodhouse

On Mon, 2023-12-18 at 08:38 +0100, Peter Hilber wrote:
> RFC v3 updates
> --
> 
> This series implements a driver for a virtio-rtc device conforming to spec
> RFC v3 [1]. It now includes an RTC class driver with alarm, in addition to
> the PTP clock driver already present before.
> 
> This patch series depends on the patch series "treewide: Use clocksource id
> for get_device_system_crosststamp()" [3]. Pull [4] to get the combined
> series on top of mainline.
> 
> Overview
> 
> 
> This patch series adds the virtio_rtc module, and related bugfixes. The
> virtio_rtc module implements a driver compatible with the proposed Virtio
> RTC device specification [1]. The Virtio RTC (Real Time Clock) device
> provides information about current time. The device can provide different
> clocks, e.g. for the UTC or TAI time standards, or for physical time
> elapsed since some past epoch. 

Hm, should we allow UTC? If you tell me the time in UTC, then
(sometimes) I still don't actually know what the time is, because some
UTC seconds occur twice. UTC only makes sense if you provide the TAI
offset, surely? Should the virtio_rtc specification make it mandatory
to provide such?

Otherwise you're just designing it to allow crappy hypervisors to
expose incomplete information.

> PTP clock interface
> ---
> 
> virtio_rtc exposes clocks as PTP clocks to userspace, similar to ptp_kvm.
> If both the Virtio RTC device and this driver have special support for the
> current clocksource, time synchronization programs can use
> cross-timestamping using ioctl PTP_SYS_OFFSET_PRECISE2 aka
> PTP_SYS_OFFSET_PRECISE. Similar to ptp_kvm, system time synchronization
> with single-digit ns precision is possible with a quiescent reference clock
> (from the Virtio RTC device). This works even when the Virtio device
> response is slow compared to ptp_kvm hypercalls.

Is PTP the right mechanism for this? As I understand it, PTP is a way
to precisely synchronize one clock with another. But in the case of
virt guests synchronizing against the host, it isn't really *another*
clock. It really is the *same* underlying clock. As the host clock
varies with temperature, for example, so does the guest clock. The only
difference is an offset and (on x86 perhaps) a mathematical scaling of
the frequency.

I was looking at this another way, when I came across this virtio-rtc
work.

My idea was just for the hypervisor to expose its own timekeeping
information — the counter/TSC value and TAI time at a given moment,
frequency of the counter, and the precision of both that frequency
(±PPM) and the TAI timestamp (±µs).

By putting that in a host/guest shared data structure with a seqcount
for lockless updates, we can update it as time synchronization on the
host is refined, and we can even cleanly handle live migration where
the guest ends up on a completely different host. It allows for use
cases which *really* care (e.g. timestamping financial transactions) to
ensure that there is never even a moment of getting *wrong* timestamps
if they haven't yet resynced after a migration.

Now I'm trying to work out if I should attempt to reconcile with your
existing virtio-rtc work, or just decide that virtio-rtc isn't trying
to solve the actual problem that we have, and go ahead with something
different... ?

smime.p7s
Description: S/MIME cryptographic signature

[PATCH v1] virtio: reenable config if freezing device failed

2024-02-13 Thread David Hildenbrand

Currently, we don't reenable the config if freezing the device failed.

For example, virtio-mem currently doesn't support suspend+resume, and
trying to freeze the device will always fail. Afterwards, the device
will no longer respond to resize requests, because it won't get notified
about config changes.

Let's fix this by re-enabling the config if freezing fails.

Fixes: 22b7050a024d ("virtio: defer config changed notifications")
Cc: 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Xuan Zhuo 
Signed-off-by: David Hildenbrand 
---
 drivers/virtio/virtio.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f4080692b351..f513ee21b1c1 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -510,8 +510,10 @@ int virtio_device_freeze(struct virtio_device *dev)
 
if (drv && drv->freeze) {
ret = drv->freeze(dev);
-   if (ret)
+   if (ret) {
+   virtio_config_enable(dev);
return ret;
+   }
}
 
if (dev->config->destroy_avq)

base-commit: c664e16bb1ba1c8cf1d7ecf3df5fd83bbb8ac15a
-- 
2.43.0

How to display a ktime value as trace timestamp in trace output?

2024-01-31 Thread David Howells

Hi Steven,

I have a tracepoint in AF_RXRPC that displays information about a timeout I'm
going to set.  I have the timeout in a ktime_t as an absolute time.  Is there
a way to display this in the trace output such that it looks like a trace
timestamp and can be (roughly) correlated with the displayed timestamps?

I tried subtracting ktime_get_read() - ktime_get_boottime() from it and
displaying the result, but it looked about one and a bit seconds out from the
trace timestamp.

Thanks,
David

Re: [PATCH 02/20] filelock: add coccinelle scripts to move fields to struct file_lock_core

2024-01-17 Thread David Howells

Do we need to keep these coccinelle scripts for posterity?  Or can they just
be included in the patch description of the patch that generates them?

David

Re: [PATCH] driver/virtio: Add Memory Balloon Support for SEV/SEV-ES

2024-01-11 Thread David Hildenbrand


On 10.01.24 07:22, Zheyun Shen wrote:

For now, SEV pins guest's memory to avoid swapping or
moving ciphertext, but leading to the inhibition of
Memory Ballooning.

In Memory Ballooning, only guest's free pages will be relocated
in balloon inflation and deflation, so the difference of plaintext
doesn't matter to guest.


A Linux hypervisor will always give you a fresh, zeroed page. I don't 
recall what the spec says, could be that that is a guarantee.




Memory Ballooning is a nice memory overcommitment technology can
be used in CVM based on SEV and SEV-ES, so userspace tools can
provide an option to allow SEV not to pin memory and enable
Memory Ballooning. Guest kernel may not inhibit Balloon and
should set shared memory for Balloon decrypted.


Two points:

1) Memory overcommit means that you promise to have more memory than you 
actually have.


To be able to use that in a *safe* way in the hypervisor, to fulfill 
that promise, you need some backup strategy, which is usually swap space 
in the hypervisor. Further one might apply other techniques (ram 
compression, memory deduplication) in the hypervisor to make that 
swapping unlikely to ever happen when overcommitting (because nobody 
wants to swap).


Assume you run a lot of VMs that mostly have private/encrypted memory 
(which is the default). Imagine you previously inflated the balloon on 
VM0, and that VM needs more memory (you promised it could have more!). 
You reach out to other VMs to inflate the balloon so you get memory 
back, but they cannot give up memory safely.


In that scenario (a) you cannot swap something out because all pages are 
pinned (b) memory compression cannot be applied because pages are pinned 
and (c) memory deduplication cannot be applied because pages are pinned.


Pinned memory is a resource that cannot be overcomitted.

So I am not convinced the use case you are targeting can be considered 
any way of sane memory overcommit. You should better call it resizing VM 
memory instead. Then, it's clearer that the hypervisor cannot promise to 
ever give you that memory when you are in need.



2) What about other features?

What if the hypervisor enabled free-page-reporting? Would that work 
(unlikely, I assunme). Don't we have to block that?


--
Cheers,

David / dhildenb

Re: [PATCH] driver/virtio: Add Memory Balloon Support for SEV/SEV-ES

2024-01-11 Thread David Hildenbrand


For now, SEV pins guest's memory to avoid swapping or
moving ciphertext, but leading to the inhibition of
Memory Ballooning.

In Memory Ballooning, only guest's free pages will be relocated
in balloon inflation and deflation, so the difference of plaintext
doesn't matter to guest.



This seems only true if the page is zeroed, is this true here?


Sorry, I cannot figure out why the pages should be zeroed. I think
both host kernel and guest kernel assume that the pages are not
zeroed and will use kzalloc or manually zero them in real applications,
which is same as non-SEV environments.


balloon_page_alloc() will not zero the memory (no __GFP_ZERO set). Only 
in some configurations (zero-on-alloc, zero-on-free), the kernel would 
do that implicitly.


So we'd eventually be leaking secrets to the untrusted hypervisor?



I have tested in SEV-ES, reclaiming memory by balloon inflation and reuse
them after balloon deflation both works well with the patch. Hypervisor
can normally give the reclaimed memory from one CVM to another, or give
back to the origin CVM.


I'll comment on your misconception of memory overcommit separately.

--
Cheers,

David / dhildenb

[PATCH v3] virtio_balloon: stay awake while adjusting balloon

2024-01-09 Thread David Stevens

From: David Stevens 

A virtio_balloon's parent device may be configured so that a
configuration change interrupt is a wakeup event. Extend the processing
of such a wakeup event until the balloon finishes inflating or deflating
by calling pm_stay_awake/pm_relax in the virtio_balloon driver. Note
that these calls are no-ops if the parent device doesn't support wakeup
events or if the wakeup events are not enabled.

This change allows the guest to use system power states such as s2idle
without running the risk the virtio_balloon's cooperative memory
management becoming unresponsive to the host's requests.

Signed-off-by: David Stevens 
---
v2->v3:
 - Use _irq spinlock functions with adjustment_lock in workqueue, since
   the lock is accessed in an interrupt context.
v1 -> v2:
 - Use adjustment_signal_pending flag instead of a sequence number
 - Call pm_stay_awake/pm_relax on parent device instead of adding a wake
   event to the virtio balloon device

 drivers/virtio/virtio_balloon.c | 57 +++--
 1 file changed, 47 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1fe93e93f5bc..fa710e6c505a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -119,6 +119,11 @@ struct virtio_balloon {
/* Free page reporting device */
struct virtqueue *reporting_vq;
struct page_reporting_dev_info pr_dev_info;
+
+   /* State for keeping the wakeup_source active while adjusting the 
balloon */
+   spinlock_t adjustment_lock;
+   bool adjustment_signal_pending;
+   bool adjustment_in_progress;
 };
 
 static const struct virtio_device_id id_table[] = {
@@ -437,6 +442,31 @@ static void virtio_balloon_queue_free_page_work(struct 
virtio_balloon *vb)
queue_work(vb->balloon_wq, >report_free_page_work);
 }
 
+static void start_update_balloon_size(struct virtio_balloon *vb)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(>adjustment_lock, flags);
+   vb->adjustment_signal_pending = true;
+   if (!vb->adjustment_in_progress) {
+   vb->adjustment_in_progress = true;
+   pm_stay_awake(vb->vdev->dev.parent);
+   }
+   spin_unlock_irqrestore(>adjustment_lock, flags);
+
+   queue_work(system_freezable_wq, >update_balloon_size_work);
+}
+
+static void end_update_balloon_size(struct virtio_balloon *vb)
+{
+   spin_lock_irq(>adjustment_lock);
+   if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
+   vb->adjustment_in_progress = false;
+   pm_relax(vb->vdev->dev.parent);
+   }
+   spin_unlock_irq(>adjustment_lock);
+}
+
 static void virtballoon_changed(struct virtio_device *vdev)
 {
struct virtio_balloon *vb = vdev->priv;
@@ -444,8 +474,7 @@ static void virtballoon_changed(struct virtio_device *vdev)
 
spin_lock_irqsave(>stop_update_lock, flags);
if (!vb->stop_update) {
-   queue_work(system_freezable_wq,
-  >update_balloon_size_work);
+   start_update_balloon_size(vb);
virtio_balloon_queue_free_page_work(vb);
}
spin_unlock_irqrestore(>stop_update_lock, flags);
@@ -476,19 +505,25 @@ static void update_balloon_size_func(struct work_struct 
*work)
 
vb = container_of(work, struct virtio_balloon,
  update_balloon_size_work);
-   diff = towards_target(vb);
 
-   if (!diff)
-   return;
+   spin_lock_irq(>adjustment_lock);
+   vb->adjustment_signal_pending = false;
+   spin_unlock_irq(>adjustment_lock);
 
-   if (diff > 0)
-   diff -= fill_balloon(vb, diff);
-   else
-   diff += leak_balloon(vb, -diff);
-   update_balloon_size(vb);
+   diff = towards_target(vb);
+
+   if (diff) {
+   if (diff > 0)
+   diff -= fill_balloon(vb, diff);
+   else
+   diff += leak_balloon(vb, -diff);
+   update_balloon_size(vb);
+   }
 
if (diff)
queue_work(system_freezable_wq, work);
+   else
+   end_update_balloon_size(vb);
 }
 
 static int init_vqs(struct virtio_balloon *vb)
@@ -992,6 +1027,8 @@ static int virtballoon_probe(struct virtio_device *vdev)
goto out_unregister_oom;
}
 
+   spin_lock_init(>adjustment_lock);
+
virtio_device_ready(vdev);
 
if (towards_target(vb))
-- 
2.43.0.472.g3155946c3a-goog

Re: REGRESSION: lockdep warning triggered by 15b9ce7ecd: virtio_balloon: stay awake while adjusting balloon

2024-01-09 Thread David Hildenbrand


On 09.01.24 06:50, David Stevens wrote:

On Tue, Jan 9, 2024 at 6:50 AM Theodore Ts'o  wrote:


Hi, while doing final testing before sending a pull request, I merged
in linux-next, and commit 5b9ce7ecd7: virtio_balloon: stay awake while
adjusting balloon seems to be causing a lockdep warning (see attached)
when running gce-xfstests on a Google Compute Engine e2 VM.  I was not
able to trigger it using kvm-xfstests, but the following command:
"gce-xfstests -C 10 ext4/4k generic/476) was sufficient to triger the
problem.   For more information please see [1] and [2].

[1] 
https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md
[2] https://thunk.org/gce-xfstests

I found it by looking at the git logs, and this commit aroused my
suspicions, and I further testing showed that the lockdep warning was
reproducible with this commit, but not when testing with the
immediately preceeding commit (15b9ce7ecd^).

Cheers,

 - Ted


root: ext4/4k run xfstest generic/476
systemd[1]: Started fstests-generic-476.scope - /usr/bin/bash -c test -w 
/proc/self/oom_score_adj && echo 250 > /proc/self/oom_score_adj; exec 
./tests/generic/476.
kernel: [  399.361181] EXT4-fs (dm-1): mounted filesystem 
840e25bd-f650-4819-8562-7eded85ef370 r/w with ordered data mode. Quota mode: 
none.
systemd[1]: fstests-generic-476.scope: Deactivated successfully.
systemd[1]: fstests-generic-476.scope: Consumed 3min 1.966s CPU time.
systemd[1]: xt\x2dvdb.mount: Deactivated successfully.
kernel: [  537.085404] EXT4-fs (dm-0): unmounting filesystem 
d3d7a675-f7b6-4384-abec-2e60d885b6da.
systemd[1]: xt\x2dvdc.mount: Deactivated successfully.
kernel: [  540.565870]
kernel: [  540.567523] 
kernel: [  540.572007] WARNING: inconsistent lock state
kernel: [  540.576407] 6.7.0-rc3-xfstests-lockdep-00012-g5b9ce7ecd715 #318 Not 
tainted
kernel: [  540.583532] 
kernel: [  540.587928] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
kernel: [  540.594326] kworker/0:3/329 [HC0[0]:SC0[0]:HE1:SE1] takes:
kernel: [  540.599955] 90b280a548c0 (>adjustment_lock){?...}-{2:2}, at: 
update_balloon_size_func+0x33/0x190
kernel: [  540.609926] {IN-HARDIRQ-W} state was registered at:
kernel: [  540.614935]   __lock_acquire+0x3f2/0xb30
kernel: [  540.618992]   lock_acquire+0xbf/0x2b0
kernel: [  540.622786]   _raw_spin_lock_irqsave+0x43/0x90
kernel: [  540.627366]   virtballoon_changed+0x51/0xd0
kernel: [  540.631947]   virtio_config_changed+0x5a/0x70
kernel: [  540.636437]   vp_config_changed+0x11/0x20
kernel: [  540.640576]   __handle_irq_event_percpu+0x88/0x230
kernel: [  540.645500]   handle_irq_event+0x38/0x80
kernel: [  540.649558]   handle_edge_irq+0x8f/0x1f0
kernel: [  540.653791]   __common_interrupt+0x47/0xf0
kernel: [  540.658106]   common_interrupt+0x79/0xa0
kernel: [  540.661672] EXT4-fs (dm-1): unmounting filesystem 
840e25bd-f650-4819-8562-7eded85ef370.
kernel: [  540.663183]   asm_common_interrupt+0x26/0x40
kernel: [  540.663190]   acpi_safe_halt+0x1b/0x30
kernel: [  540.663196]   acpi_idle_enter+0x7b/0xd0
kernel: [  540.663199]   cpuidle_enter_state+0x90/0x4f0
kernel: [  540.688723]   cpuidle_enter+0x2d/0x40
kernel: [  540.692516]   cpuidle_idle_call+0xe4/0x120
kernel: [  540.697036]   do_idle+0x84/0xd0
kernel: [  540.700393]   cpu_startup_entry+0x2a/0x30
kernel: [  540.704588]   rest_init+0xe9/0x180
kernel: [  540.708118]   arch_call_rest_init+0xe/0x30
kernel: [  540.712426]   start_kernel+0x41c/0x4b0
kernel: [  540.716310]   x86_64_start_reservations+0x18/0x30
kernel: [  540.721164]   x86_64_start_kernel+0x8c/0x90
kernel: [  540.725737]   secondary_startup_64_no_verify+0x178/0x17b
kernel: [  540.731432] irq event stamp: 22681
kernel: [  540.734956] hardirqs last  enabled at (22681): [] 
_raw_spin_unlock_irq+0x28/0x50
kernel: [  540.744564] hardirqs last disabled at (22680): [] 
_raw_spin_lock_irq+0x5d/0x90
kernel: [  540.753475] softirqs last  enabled at (22076): [] 
srcu_invoke_callbacks+0x101/0x1c0
kernel: [  540.762904] softirqs last disabled at (22072): [] 
srcu_invoke_callbacks+0x101/0x1c0
kernel: [  540.773298]
kernel: [  540.773298] other info that might help us debug this:
kernel: [  540.780207]  Possible unsafe locking scenario:
kernel: [  540.780207]
kernel: [  540.786438]CPU0
kernel: [  540.789007]
kernel: [  540.791766]   lock(>adjustment_lock);
kernel: [  540.796014]   
kernel: [  540.798778] lock(>adjustment_lock);
kernel: [  540.803605]


Oh, that's embarrassing, I completely whiffed on interactions with
interrupts. The following patch fixes it, and I've locally repro'ed
the issue and verified the fix. What's the process for getting this
fix merged? Does it get merged as a seperatch patch, or squashed into
the original commit?


Depends on who queued it. Likely MST, so it can be squashed.

If it would be sitting in Andrews stable trees, we wouldn't b

Re: REGRESSION: lockdep warning triggered by 15b9ce7ecd: virtio_balloon: stay awake while adjusting balloon

2024-01-08 Thread David Stevens

3605]

Oh, that's embarrassing, I completely whiffed on interactions with
interrupts. The following patch fixes it, and I've locally repro'ed
the issue and verified the fix. What's the process for getting this
fix merged? Does it get merged as a seperatch patch, or squashed into
the original commit?

>From a99a1efa6a2b470a98ea2c87e58bebe90ce329a1 Mon Sep 17 00:00:00 2001
From: David Stevens 
Date: Tue, 9 Jan 2024 14:41:21 +0900
Subject: [PATCH] virtio_balloon: Fix interrupt context deadlock

Use _irq spinlock functions with the adjustment_lock, since
start_update_balloon_size needs to acquire it in an interrupt context.

Fixes: 5b9ce7ecd715 ("virtio_balloon: stay awake while adjusting balloon")
Reported-by: Theodore Ts'o 
Signed-off-by: David Stevens 
---
 drivers/virtio/virtio_balloon.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index aa6a1a649ad6..1f5b3dd31fcf 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -459,12 +459,12 @@ static void start_update_balloon_size(struct
virtio_balloon *vb)

 static void end_update_balloon_size(struct virtio_balloon *vb)
 {
-   spin_lock(>adjustment_lock);
+   spin_lock_irq(>adjustment_lock);
if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
vb->adjustment_in_progress = false;
pm_relax(vb->vdev->dev.parent);
}
-   spin_unlock(>adjustment_lock);
+   spin_unlock_irq(>adjustment_lock);
 }

 static void virtballoon_changed(struct virtio_device *vdev)
@@ -506,9 +506,9 @@ static void update_balloon_size_func(struct
work_struct *work)
vb = container_of(work, struct virtio_balloon,
  update_balloon_size_work);

-   spin_lock(>adjustment_lock);
+   spin_lock_irq(>adjustment_lock);
vb->adjustment_signal_pending = false;
-   spin_unlock(>adjustment_lock);
+   spin_unlock_irq(>adjustment_lock);

diff = towards_target(vb);

-- 
2.43.0.472.g3155946c3a-goog

RE: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

2024-01-02 Thread David Laight

From: Boqun Feng
> Sent: 02 January 2024 18:54
> 
> On Sat, Dec 30, 2023 at 03:49:52PM +0000, David Laight wrote:
> [...]
> > But it looks odd that osq_unlock()'s fast path uses _release but the very
> > similar code in osq_wait_next() uses _acquire.
> >
> 
> The _release in osq_unlock() is needed since unlocks are needed to be
> RELEASE so that lock+unlock can be a critical section (i.e. no memory
> accesses can escape). When osq_wait_next() is used in non unlock cases,
> the RELEASE is not required. As for the case where osq_wait_next() is
> used in osq_unlock(), there is a xchg() preceding it, which provides a
> full barrier, so things are fine.

I know there have been issues with ACQUIRE/RELEASE/FULL xchg in this code,
but are FULL xchg always needed on node->next?

> /me wonders whether we can relax the _acquire in osq_wait_next() into
> a _relaxed.

I wouldn't have worried about relaxed v release.

> > Indeed, apart from some (assumed) optimisations, I think osq_unlock()
> > could just be:
> > next = osq_wait_next(lock, this_cpu_ptr(_node), 0);
> > if (next)
> > next->locked = 1;
> >
> 
> If so we need to provide some sort of RELEASE semantics for the
> osq_unlock() in all the cases.

I wonder how often the unqueue code happens, and especially for
the last cpu in the list?
I'd only expect need_resched() to return true after spinning for
a while - in which case perhaps it is more likely that there are
a lot of cpu in the queue and the cpu being removed won't be last.
So osq_wait_next() exits on xchg(>next, NULL) != NULL
which is full barrier.

On a slightly different note I've also wondered if 'osq_node'
actually needs to be cache line aligned?
You definitely don't want it spanning 2 cache line, but I'd
expect that per-cpu data is mostly accessed by its own cpu?
So you really aren't going to get false sharing with some
other per-cpu data since the cpu is busy in this code.
So __aligned(16) would do?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

2024-01-02 Thread David Laight

From: Ingo Molnar
> Sent: 02 January 2024 09:54
> 
> 
> * David Laight  wrote:
> 
> > per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
> > This requires the cpu number be 64bit.
> > However the value is osq_lock() comes from a 32bit xchg() and there
> > isn't a way of telling gcc the high bits are zero (they are) so
> > there will always be an instruction to clear the high bits.
> >
> > The cpu number is also offset by one (to make the initialiser 0)
> > It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1]
> > into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the 
> > address).
> >
> > Converting the cpu number to 32bit unsigned prior to the decrement means
> > that gcc knows the decrement has set the high bits to zero and doesn't
> > add a register-register move (or cltq) to zero/sign extend the value.
> >
> > Not massive but saves two instructions.
> >
> > Signed-off-by: David Laight 
> > ---
> >  kernel/locking/osq_lock.c | 6 ++
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > index 35bb99e96697..37a4fa872989 100644
> > --- a/kernel/locking/osq_lock.c
> > +++ b/kernel/locking/osq_lock.c
> > @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
> > return cpu_nr + 1;
> >  }
> >
> > -static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> > +static inline struct optimistic_spin_node *decode_cpu(unsigned int 
> > encoded_cpu_val)
> >  {
> > -   int cpu_nr = encoded_cpu_val - 1;
> > -
> > -   return per_cpu_ptr(_node, cpu_nr);
> > +   return per_cpu_ptr(_node, encoded_cpu_val - 1);
> 
> So why do we 'encode' the CPU number to begin with?
> 
> Why not use -1 as the special value? Checks for negative values
> generates similarly fast machine code compared to checking for 0, if
> the value is also used (which it is in most cases here). What am I
> missing? We seem to be going through a lot of unnecessary hoops, and
> some of that is in the runtime path.

You'd really have to ask the person who did the original patch
that changed lock->tail from a pointer to an int (saving 8 bytes)
in every mutex and rwsem.

I suspect the reason is that it is so much safer to have the
initialiser being zero, rather than a non-zero value with zero
being a valid value.

It is also hard to avoid an extra instruction in the per_cpu_ptr()
code - something has to extend the 32bit result from xchg() to
a 64bit one for the array index.
The asm for an unsigned 32bit exchange could return a 64bit result
(which would have the desired effect), but that won't work for
a signed value.
The '-1' in the vcpu_is_preempted() path will be executed in parallel
with something else and is likely to have no measurable effect.

So it is a slightly risky change that has less benefit than the
other changes (which save cache line accesses).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

2024-01-01 Thread David Laight

From: Waiman Long
> Sent: 01 January 2024 04:14
...
> You really like micro-optimization.

They all add up :-)

    David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

[PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

2023-12-31 Thread David Laight

per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
This requires the cpu number be 64bit.
However the value is osq_lock() comes from a 32bit xchg() and there
isn't a way of telling gcc the high bits are zero (they are) so
there will always be an instruction to clear the high bits.

The cpu number is also offset by one (to make the initialiser 0)
It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1]
into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address).

Converting the cpu number to 32bit unsigned prior to the decrement means
that gcc knows the decrement has set the high bits to zero and doesn't
add a register-register move (or cltq) to zero/sign extend the value.

Not massive but saves two instructions.

Signed-off-by: David Laight 
---
 kernel/locking/osq_lock.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 35bb99e96697..37a4fa872989 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
 }
 
-static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
+static inline struct optimistic_spin_node *decode_cpu(unsigned int 
encoded_cpu_val)
 {
-   int cpu_nr = encoded_cpu_val - 1;
-
-   return per_cpu_ptr(_node, cpu_nr);
+   return per_cpu_ptr(_node, encoded_cpu_val - 1);
 }
 
 /*
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

[PATCH next v2 4/5] locking/osq_lock: Avoid writing to node->next in the osq_lock() fast path.

2023-12-31 Thread David Laight

When osq_lock() returns false or osq_unlock() returns static
analysis shows that node->next should always be NULL.
This means that it isn't necessary to explicitly set it to NULL
prior to atomic_xchg(>tail, curr) on extry to osq_lock().

Just in case there a non-obvious race condition that can leave it
non-NULL check with WARN_ON_ONCE() and NULL if set.
Note that without this check the fast path (adding at the list head)
doesn't need to to access the per-cpu osq_node at all.

Signed-off-by: David Laight 
---
 kernel/locking/osq_lock.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 27324b509f68..35bb99e96697 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -87,12 +87,17 @@ osq_wait_next(struct optimistic_spin_queue *lock,
 
 bool osq_lock(struct optimistic_spin_queue *lock)
 {
-   struct optimistic_spin_node *node = this_cpu_ptr(_node);
-   struct optimistic_spin_node *prev, *next;
+   struct optimistic_spin_node *node, *prev, *next;
int curr = encode_cpu(smp_processor_id());
int prev_cpu;
 
-   node->next = NULL;
+   /*
+* node->next should be NULL on entry.
+* Check just in case there is a race somewhere.
+* Note that this is probably an unnecessary cache miss in the fast 
path.
+*/
+   if (WARN_ON_ONCE(raw_cpu_read(osq_node.next) != NULL))
+   raw_cpu_write(osq_node.next, NULL);
 
/*
 * We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -104,8 +109,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (prev_cpu == OSQ_UNLOCKED_VAL)
return true;
 
-   node->prev_cpu = prev_cpu;
+   node = this_cpu_ptr(_node);
prev = decode_cpu(prev_cpu);
+   node->prev_cpu = prev_cpu;
node->locked = 0;
 
/*
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

[PATCH next v2 3/5] locking/osq_lock: Use node->prev_cpu instead of saving node->prev.

2023-12-31 Thread David Laight

node->prev is only used to update 'prev' in the unlikely case
of concurrent unqueues.
This can be replaced by a check for node->prev_cpu changing
and then calling decode_cpu() to get the changed 'prev' pointer.

node->cpu (or more particularly) prev->cpu is only used for the
osq_wait_next() call in the unqueue path.
Normally this is exactly the value that the initial xchg() read
from lock->tail (used to obtain 'prev'), but can get updated
by concurrent unqueues.

Both the 'prev' and 'cpu' members of optimistic_spin_node are
now unused and can be deleted.

Signed-off-by: David Laight 
---
 kernel/locking/osq_lock.c | 31 +--
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index eb8a6dfdb79d..27324b509f68 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -13,9 +13,8 @@
  */
 
 struct optimistic_spin_node {
-   struct optimistic_spin_node *next, *prev;
+   struct optimistic_spin_node *next;
int locked;/* 1 if lock acquired */
-   int cpu;   /* encoded CPU # + 1 value */
int prev_cpu;  /* encoded CPU # + 1 value */
 };
 
@@ -91,10 +90,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
struct optimistic_spin_node *node = this_cpu_ptr(_node);
struct optimistic_spin_node *prev, *next;
int curr = encode_cpu(smp_processor_id());
-   int old;
+   int prev_cpu;
 
node->next = NULL;
-   node->cpu = curr;
 
/*
 * We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -102,13 +100,12 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * the node fields we just initialised) semantics when updating
 * the lock tail.
 */
-   old = atomic_xchg(>tail, curr);
-   if (old == OSQ_UNLOCKED_VAL)
+   prev_cpu = atomic_xchg(>tail, curr);
+   if (prev_cpu == OSQ_UNLOCKED_VAL)
return true;
 
-   node->prev_cpu = old;
-   prev = decode_cpu(old);
-   node->prev = prev;
+   node->prev_cpu = prev_cpu;
+   prev = decode_cpu(prev_cpu);
node->locked = 0;
 
/*
@@ -174,9 +171,16 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 
/*
 * Or we race against a concurrent unqueue()'s step-B, in which
-* case its step-C will write us a new @node->prev pointer.
+* case its step-C will write us a new @node->prev_cpu value.
 */
-   prev = READ_ONCE(node->prev);
+   {
+   int new_prev_cpu = READ_ONCE(node->prev_cpu);
+
+   if (new_prev_cpu == prev_cpu)
+   continue;
+   prev_cpu = new_prev_cpu;
+   prev = decode_cpu(prev_cpu);
+   }
}
 
/*
@@ -186,7 +190,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * back to @prev.
 */
 
-   next = osq_wait_next(lock, node, prev->cpu);
+   next = osq_wait_next(lock, node, prev_cpu);
if (!next)
return false;
 
@@ -198,8 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * it will wait in Step-A.
 */
 
-   WRITE_ONCE(next->prev_cpu, prev->cpu);
-   WRITE_ONCE(next->prev, prev);
+   WRITE_ONCE(next->prev_cpu, prev_cpu);
WRITE_ONCE(prev->next, next);
 
return false;
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

[PATCH next v2 2/5] locking/osq_lock: Optimise the vcpu_is_preempted() check.

2023-12-31 Thread David Laight

The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
cpu is no longer running.

Although patched out for bare-metal the code still needs the cpu number.
Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss
when osq_unlock() is waking up the next cpu.

Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.

This is simpler than checking for 'node->prev' changing and caching
'prev->cpu'.

Signed-off-by: David Laight 
---
 kernel/locking/osq_lock.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index e0bc74d85a76..eb8a6dfdb79d 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -14,8 +14,9 @@
 
 struct optimistic_spin_node {
struct optimistic_spin_node *next, *prev;
-   int locked; /* 1 if lock acquired */
-   int cpu; /* encoded CPU # + 1 value */
+   int locked;/* 1 if lock acquired */
+   int cpu;   /* encoded CPU # + 1 value */
+   int prev_cpu;  /* encoded CPU # + 1 value */
 };
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
@@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
 }
 
-static inline int node_cpu(struct optimistic_spin_node *node)
-{
-   return node->cpu - 1;
-}
-
 static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
 {
int cpu_nr = encoded_cpu_val - 1;
@@ -110,9 +106,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (old == OSQ_UNLOCKED_VAL)
return true;
 
-   node->locked = 0;
+   node->prev_cpu = old;
prev = decode_cpu(old);
node->prev = prev;
+   node->locked = 0;
 
/*
 * osq_lock()   unqueue
@@ -144,7 +141,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * polling, be careful.
 */
if (smp_cond_load_relaxed(>locked, VAL || need_resched() ||
- vcpu_is_preempted(node_cpu(node->prev
+ vcpu_is_preempted(READ_ONCE(node->prev_cpu) - 
1)))
return true;
 
/* unqueue */
@@ -201,6 +198,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * it will wait in Step-A.
 */
 
+   WRITE_ONCE(next->prev_cpu, prev->cpu);
WRITE_ONCE(next->prev, prev);
WRITE_ONCE(prev->next, next);
 
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

[PATCH next v2 1/5] locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path.

2023-12-31 Thread David Laight

Since node->locked cannot be set before the assignment to prev->next
it is save to clear it in the slow path.

Signed-off-by: David Laight 
---
 kernel/locking/osq_lock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 75a6f6133866..e0bc74d85a76 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -97,7 +97,6 @@ bool osq_lock(struct optimistic_spin_queue *lock)
int curr = encode_cpu(smp_processor_id());
int old;
 
-   node->locked = 0;
node->next = NULL;
node->cpu = curr;
 
@@ -111,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (old == OSQ_UNLOCKED_VAL)
return true;
 
+   node->locked = 0;
prev = decode_cpu(old);
node->prev = prev;
 
-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

[PATCH next v2 0/5] locking/osq_lock: Optimisations to osq_lock code.

2023-12-31 Thread David Laight

This is an updated series of optimisations to osq_lock.c
Patches #1 and #3 from v1 have been applied by Linus.
Some of the generated code issues I was getting were caused by
CONFIG_DEBUG_PREEMPT being set. No idea why, it isn't any more.

Patch #1 is the node->locked part of the old #2.

Patch #2 removes the pretty much guaranteed cache line reload getting
the cpu number (from node->prev) for the vcpu_is_preempted() check.
It is (basically) the old #5 with the addition of a READ_ONCE()
and leaving the '+ 1' offset (for patch 3).

Patch #3 ends up removing both node->cpu and node->prev.
This saves issues initialising node->cpu.
Basically node->cpu was only ever read as node->prev->cpu in the unqueue code.
Most of the time it is the value read from lock->tail that was used to
obtain 'prev' in the first place.
The only time it is different is in the unlock race path where 'prev'
is re-read from node->prev - updated right at the bottom of osq_lock().
So the updated node->prev_cpu can used (and prev obtained from it) without
worrying about only one of node->prev and node->prev-cpu being updated.

Linus did suggest just saving the cpu numbers instead of pointers.
It actually works for 'prev' but not 'next'.

Patch #4 removes the 'should be unnecessary' node->next = NULL
assignment from the top of osq_lock().
Since longman was worried about race conditions, I've added a
WARN_ON_ONCE() check that ensures it is NULL.
This saves dirtying the 'node' cache line in the fast path, but the
check still requires the cache line be loaded.

Patch #5 just stops gcc using two separate instructions to decrement
the offset cpu number and then convert it to 64 bits.
Linus got annoyed with it, and I'd spotted it as well.
I don't seem to be able to get gcc to convert __per_cpu_offset[cpu - 1]
to (__per_cpu_offset - 1)[cpu] (cpu is offset by one) but, in any case,
it would still need zero extending in the common case.

David Laight (5):
  1) Defer clearing node->locked until the slow osq_lock() path.
  2) Optimise vcpu_is_preempted() check.
  3) Use node->prev_cpu instead of saving node->prev.
  4) Avoid writing to node->next in the osq_lock() fast path.
  5) Optimise decode_cpu() and per_cpu_ptr().

 kernel/locking/osq_lock.c | 59 +--
 1 file changed, 32 insertions(+), 27 deletions(-)

-- 
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

2023-12-31 Thread David Laight

From: Linus Torvalds
> Sent: 30 December 2023 20:59
> 
> On Sat, 30 Dec 2023 at 12:41, Linus Torvalds
>  wrote:
> >
> > UNTESTED patch to just do the "this_cpu_write()" parts attached.
> > Again, note how we do end up doing that this_cpu_ptr conversion later
> > anyway, but at least it's off the critical path.
> 
> Also note that while 'this_cpu_ptr()' doesn't exactly generate lovely
> code, it really is still better than caching a value in memory.
> 
> At least the memory location that 'this_cpu_ptr()' accesses is
> slightly more likely to be hot (and is right next to the cpu number,
> iirc).

I was only going to access the 'self' field in code that required
the 'node' cache line be present.

> 
> That said, I think we should fix this_cpu_ptr() to not ever generate
> that disgusting cltq just because the cpu pointer has the wrong
> signedness. I don't quite know how to do it, but this:
> 
>   -#define per_cpu_offset(x) (__per_cpu_offset[x])
>   +#define per_cpu_offset(x) (__per_cpu_offset[(unsigned)(x)])
> 
> at least helps a *bit*. It gets rid of the cltq, at least, but if
> somebody actually passes in an 'unsigned long' cpuid, it would cause
> an unnecessary truncation.

Doing the conversion using arithmetic might help, so:
__per_cpu_offset[(x) + 0u]

> And gcc still generates
> 
> subl$1, %eax#, cpu_nr
> addq__per_cpu_offset(,%rax,8), %rcx
> 
> instead of just doing
> 
> addq__per_cpu_offset-8(,%rax,8), %rcx
> 
> because it still needs to clear the upper 32 bits and doesn't know
> that the 'xchg()' already did that.

Not only that, you need to do the 'subl' after converting to 64 bits.
Otherwise the wrong location is read were cpu_nr to be zero.
I've tried that - but it still failed.

> Oh well. I guess even without the -1/+1 games by the OSQ code, we
> would still end up with a "movl" just to do that upper bits clearing
> that the compiler doesn't know is unnecessary.
> 
> I don't think we have any reasonable way to tell the compiler that the
> register output of our xchg() inline asm has the upper 32 bits clear.

It could be done for a 32bit unsigned xchg() - just make the return
type unsigned 64bit.
But that won't work for the signed exchange - and 'atomic_t' is signed.
OTOH I'd guess this code could use 'unsigned int' instead of atomic_t?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 93551 matches

Mail list logo