Re: [PATCH v2 2/3] target/s390x: add support for "disable-deprecated-feats" expansion option

2024-04-25 Thread David Hildenbrand

On 24.04.24 20:33, Collin Walling wrote:

On 4/24/24 03:24, David Hildenbrand wrote:

On 23.04.24 23:06, Collin Walling wrote:

Retain a list of deprecated features disjoint from any particular
CPU model. When a query-cpu-model-expansion is provided with the
"disable-deprecated-feats" option set, the resulting properties list
will include all deprecated features paired with false. Example:

{ ... "bpb": false, "csske": false, ...}

It is recommended that s390 guests operate with these features
explicitly disabled to ensure compatability with future hardware.

Signed-off-by: Collin Walling 
---
   target/s390x/cpu_features.c  | 14 ++
   target/s390x/cpu_features.h  |  1 +
   target/s390x/cpu_models_sysemu.c | 20 
   3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/target/s390x/cpu_features.c b/target/s390x/cpu_features.c
index d28eb65845..efafc9711c 100644
--- a/target/s390x/cpu_features.c
+++ b/target/s390x/cpu_features.c
@@ -212,6 +212,20 @@ void s390_feat_bitmap_to_ascii(const S390FeatBitmap 
features, void *opaque,
   };
   }
   
+void s390_get_deprecated_features(S390FeatBitmap features)

+{
+static const int feats[] = {
+ /* CSSKE is deprecated on newer generations */
+ S390_FEAT_CONDITIONAL_SSKE,
+ S390_FEAT_BPB,
+};
+int i;
+
+for (i = 0; i < ARRAY_SIZE(feats); i++) {
+set_bit(feats[i], features);
+}
+}
+
   #define FEAT_GROUP_INIT(_name, _group, _desc)\
   {\
   .name = _name,   \
diff --git a/target/s390x/cpu_features.h b/target/s390x/cpu_features.h
index a9bd68a2e1..661a8cd6db 100644
--- a/target/s390x/cpu_features.h
+++ b/target/s390x/cpu_features.h
@@ -69,6 +69,7 @@ void s390_add_from_feat_block(S390FeatBitmap features, 
S390FeatType type,
 uint8_t *data);
   void s390_feat_bitmap_to_ascii(const S390FeatBitmap features, void *opaque,
  void (*fn)(const char *name, void *opaque));
+void s390_get_deprecated_features(S390FeatBitmap features);
   
   /* Definition of a CPU feature group */

   typedef struct {
diff --git a/target/s390x/cpu_models_sysemu.c b/target/s390x/cpu_models_sysemu.c
index ef9fa80efd..b002819188 100644
--- a/target/s390x/cpu_models_sysemu.c
+++ b/target/s390x/cpu_models_sysemu.c
@@ -171,7 +171,8 @@ static void qdict_add_enabled_feat(const char *name, void 
*opaque)
   
   /* convert S390CPUDef into a static CpuModelInfo */

   static void cpu_info_from_model(CpuModelInfo *info, const S390CPUModel 
*model,
-bool delta_changes)
+bool delta_changes,
+bool disable_deprecated_feats)
   {
   QDict *qdict = qdict_new();
   S390FeatBitmap bitmap;
@@ -201,6 +202,13 @@ static void cpu_info_from_model(CpuModelInfo *info, const 
S390CPUModel *model,
   s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat);
   }
   
+/* features flagged as deprecated */

+if (disable_deprecated_feats) {
+bitmap_zero(bitmap, S390_FEAT_MAX);
+s390_get_deprecated_features(bitmap);
+s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat);
+}


Likely, we should remove these features from the actual bitmap, such
that they won't appear twice in the output? I'd expect the
cpu_info_from_model() caller to handle that.

Just adding them to the list as disabled is likely wrong.

For example, if someone were to expend a given model with "... bpb:
true" with disable-deprecated-feat=on, that should be remove from
"bpb:true", and only replaced by "bpb=false" if it would be part of the
CPU model we would be expanding to.

Or am I missing something?



qdict_add_disabled_feat will handle updating the feature if it already
exists. I placed the code to process deprecated features as the last
step of cpu_info_from_model to override any features that have already
been added to the bitmap. Whether it should be the deprecated feats that
take priority, or what the user requests is up in the air, however...


Yes, that's one of my concern. IIRC, if the user specifies the same 
property multiple times, it's unclear which one will win.


"bpb=true,bpb=false" might mean that bpb=true might win.

I think this is something that this interface should sort out, so it 
will be actually usable!




... Daniel's suggestion to modify the QMP response to include a separate
list of "deprecated-props" seems like a much more efficient and readable
way that alleviates both your and Markus' concerns.


Would you only include properties that would apply to the current model 
and would be set to true in the current model?


How would libvirt make use of this interface, and could it run into the 
issue spelled out above?


--
Cheers,

David / dhildenb




Re: [PATCH v2 2/3] target/s390x: add support for "disable-deprecated-feats" expansion option

2024-04-24 Thread David Hildenbrand

On 23.04.24 23:06, Collin Walling wrote:

Retain a list of deprecated features disjoint from any particular
CPU model. When a query-cpu-model-expansion is provided with the
"disable-deprecated-feats" option set, the resulting properties list
will include all deprecated features paired with false. Example:

{ ... "bpb": false, "csske": false, ...}

It is recommended that s390 guests operate with these features
explicitly disabled to ensure compatability with future hardware.

Signed-off-by: Collin Walling 
---
  target/s390x/cpu_features.c  | 14 ++
  target/s390x/cpu_features.h  |  1 +
  target/s390x/cpu_models_sysemu.c | 20 
  3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/target/s390x/cpu_features.c b/target/s390x/cpu_features.c
index d28eb65845..efafc9711c 100644
--- a/target/s390x/cpu_features.c
+++ b/target/s390x/cpu_features.c
@@ -212,6 +212,20 @@ void s390_feat_bitmap_to_ascii(const S390FeatBitmap 
features, void *opaque,
  };
  }
  
+void s390_get_deprecated_features(S390FeatBitmap features)

+{
+static const int feats[] = {
+ /* CSSKE is deprecated on newer generations */
+ S390_FEAT_CONDITIONAL_SSKE,
+ S390_FEAT_BPB,
+};
+int i;
+
+for (i = 0; i < ARRAY_SIZE(feats); i++) {
+set_bit(feats[i], features);
+}
+}
+
  #define FEAT_GROUP_INIT(_name, _group, _desc)\
  {\
  .name = _name,   \
diff --git a/target/s390x/cpu_features.h b/target/s390x/cpu_features.h
index a9bd68a2e1..661a8cd6db 100644
--- a/target/s390x/cpu_features.h
+++ b/target/s390x/cpu_features.h
@@ -69,6 +69,7 @@ void s390_add_from_feat_block(S390FeatBitmap features, 
S390FeatType type,
uint8_t *data);
  void s390_feat_bitmap_to_ascii(const S390FeatBitmap features, void *opaque,
 void (*fn)(const char *name, void *opaque));
+void s390_get_deprecated_features(S390FeatBitmap features);
  
  /* Definition of a CPU feature group */

  typedef struct {
diff --git a/target/s390x/cpu_models_sysemu.c b/target/s390x/cpu_models_sysemu.c
index ef9fa80efd..b002819188 100644
--- a/target/s390x/cpu_models_sysemu.c
+++ b/target/s390x/cpu_models_sysemu.c
@@ -171,7 +171,8 @@ static void qdict_add_enabled_feat(const char *name, void 
*opaque)
  
  /* convert S390CPUDef into a static CpuModelInfo */

  static void cpu_info_from_model(CpuModelInfo *info, const S390CPUModel *model,
-bool delta_changes)
+bool delta_changes,
+bool disable_deprecated_feats)
  {
  QDict *qdict = qdict_new();
  S390FeatBitmap bitmap;
@@ -201,6 +202,13 @@ static void cpu_info_from_model(CpuModelInfo *info, const 
S390CPUModel *model,
  s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat);
  }
  
+/* features flagged as deprecated */

+if (disable_deprecated_feats) {
+bitmap_zero(bitmap, S390_FEAT_MAX);
+s390_get_deprecated_features(bitmap);
+s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat);
+}


Likely, we should remove these features from the actual bitmap, such 
that they won't appear twice in the output? I'd expect the 
cpu_info_from_model() caller to handle that.


Just adding them to the list as disabled is likely wrong.

For example, if someone were to expend a given model with "... bpb: 
true" with disable-deprecated-feat=on, that should be remove from 
"bpb:true", and only replaced by "bpb=false" if it would be part of the 
CPU model we would be expanding to.


Or am I missing something?

--
Cheers,

David / dhildenb




Re: [PATCH 0/3] Remove useless architecture prefix from the CPU list

2024-04-20 Thread David Hildenbrand

On 20.04.24 07:46, Thomas Huth wrote:

Printing an architecture prefix in front of each CPU name is not helpful
at all: It is confusing for the users since they don't know whether they
have to specify these letters for the "-cpu" parameter, too, and it also
takes some precious space in the dense output of the CPU entries. Let's
simply remove those now.


Yes, I also never really understood the purpose.

--
Cheers,

David / dhildenb




[PATCH v1] virtio-mem: improve error message when unplug of device fails due to plugged memory

2024-04-16 Thread David Hildenbrand
The error message is actually expressive, considering QEMU only. But
when called from Libvirt, talking about "size" can be confusing, because
in Libvirt "size" translates to the memory backend size in QEMU (maximum
size) and "current" translates to the QEMU "size" property.

Let's simply avoid talking about the "size" property and spell out that
some device memory is still plugged.

Cc: Liang Cong 
Cc: Mario Casquero 
Cc: "Michael S. Tsirkin" 
Signed-off-by: David Hildenbrand 
---
 hw/virtio/virtio-mem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index ffd119ebac..ef64bf1b4a 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -1832,8 +1832,8 @@ static void virtio_mem_unplug_request_check(VirtIOMEM 
*vmem, Error **errp)
 }
 
 if (vmem->size) {
-error_setg(errp, "virtio-mem device cannot get unplugged while"
-   " '" VIRTIO_MEM_SIZE_PROP "' != '0'");
+error_setg(errp, "virtio-mem device cannot get unplugged while some"
+   " of its memory is still plugged");
 return;
 }
 if (vmem->requested_size) {
-- 
2.44.0




Re: [PATCH for-9.1 v3 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-04-08 Thread David Hildenbrand

On 08.04.24 09:58, Stefano Garzarella wrote:

On Thu, Apr 04, 2024 at 04:09:34PM +0200, David Hildenbrand wrote:

On 04.04.24 14:23, Stefano Garzarella wrote:

shm_open() creates and opens a new POSIX shared memory object.
A POSIX shared memory object allows creating memory backend with an
associated file descriptor that can be shared with external processes
(e.g. vhost-user).

The new `memory-backend-shm` can be used as an alternative when
`memory-backend-memfd` is not available (Linux only), since shm_open()
should be provided by any POSIX-compliant operating system.

This backend mimics memfd, allocating memory that is practically
anonymous. In theory shm_open() requires a name, but this is allocated
for a short time interval and shm_unlink() is called right after
shm_open(). After that, only fd is shared with external processes
(e.g., vhost-user) as if it were associated with anonymous memory.

In the future we may also allow the user to specify the name to be
passed to shm_open(), but for now we keep the backend simple, mimicking
anonymous memory such as memfd.

Signed-off-by: Stefano Garzarella 
---
v3
- enriched commit message and documentation to highlight that we
   want to mimic memfd (David)
---
  docs/system/devices/vhost-user.rst |   5 +-
  qapi/qom.json  |  17 +
  backends/hostmem-shm.c | 118 +
  backends/meson.build   |   1 +
  qemu-options.hx|  11 +++
  5 files changed, 150 insertions(+), 2 deletions(-)
  create mode 100644 backends/hostmem-shm.c

diff --git a/docs/system/devices/vhost-user.rst 
b/docs/system/devices/vhost-user.rst
index 9b2da106ce..35259d8ec7 100644
--- a/docs/system/devices/vhost-user.rst
+++ b/docs/system/devices/vhost-user.rst
@@ -98,8 +98,9 @@ Shared memory object
  In order for the daemon to access the VirtIO queues to process the
  requests it needs access to the guest's address space. This is
-achieved via the ``memory-backend-file`` or ``memory-backend-memfd``
-objects. A reference to a file-descriptor which can access this object
+achieved via the ``memory-backend-file``, ``memory-backend-memfd``, or
+``memory-backend-shm`` objects.
+A reference to a file-descriptor which can access this object
  will be passed via the socket as part of the protocol negotiation.
  Currently the shared memory object needs to match the size of the main
diff --git a/qapi/qom.json b/qapi/qom.json
index 85e6b4f84a..5252ec69e3 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -721,6 +721,19 @@
  '*hugetlbsize': 'size',
  '*seal': 'bool' } }
+##
+# @MemoryBackendShmProperties:
+#
+# Properties for memory-backend-shm objects.
+#
+# The @share boolean option is true by default with shm.
+#
+# Since: 9.1
+##
+{ 'struct': 'MemoryBackendShmProperties',
+  'base': 'MemoryBackendProperties',
+  'data': { } }
+


Acked-by: David Hildenbrand 

One comment: we should maybe just forbid setting share=off. it doesn't
make any sense and it can even result in an unexpected double memory
consumption. We missed doing that for memfd, unfortunately.


Good point!

IIUC the `share` property is defined by the parent `hostmem`, so I
should find a way to override the property here and disable the setter,
or add an option to `hostmem` to make the property non-writable.


Right, or simply fail later when you would find "share=off" in 
shm_backend_memory_alloc().


When ever supporting named shmem_open(), it could make sense for VM 
snapshotting. Right now it doesn't really make any sense.


--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v3 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-04-04 Thread David Hildenbrand

On 04.04.24 14:23, Stefano Garzarella wrote:

shm_open() creates and opens a new POSIX shared memory object.
A POSIX shared memory object allows creating memory backend with an
associated file descriptor that can be shared with external processes
(e.g. vhost-user).

The new `memory-backend-shm` can be used as an alternative when
`memory-backend-memfd` is not available (Linux only), since shm_open()
should be provided by any POSIX-compliant operating system.

This backend mimics memfd, allocating memory that is practically
anonymous. In theory shm_open() requires a name, but this is allocated
for a short time interval and shm_unlink() is called right after
shm_open(). After that, only fd is shared with external processes
(e.g., vhost-user) as if it were associated with anonymous memory.

In the future we may also allow the user to specify the name to be
passed to shm_open(), but for now we keep the backend simple, mimicking
anonymous memory such as memfd.

Signed-off-by: Stefano Garzarella 
---
v3
- enriched commit message and documentation to highlight that we
   want to mimic memfd (David)
---
  docs/system/devices/vhost-user.rst |   5 +-
  qapi/qom.json  |  17 +
  backends/hostmem-shm.c | 118 +
  backends/meson.build   |   1 +
  qemu-options.hx|  11 +++
  5 files changed, 150 insertions(+), 2 deletions(-)
  create mode 100644 backends/hostmem-shm.c

diff --git a/docs/system/devices/vhost-user.rst 
b/docs/system/devices/vhost-user.rst
index 9b2da106ce..35259d8ec7 100644
--- a/docs/system/devices/vhost-user.rst
+++ b/docs/system/devices/vhost-user.rst
@@ -98,8 +98,9 @@ Shared memory object
  
  In order for the daemon to access the VirtIO queues to process the

  requests it needs access to the guest's address space. This is
-achieved via the ``memory-backend-file`` or ``memory-backend-memfd``
-objects. A reference to a file-descriptor which can access this object
+achieved via the ``memory-backend-file``, ``memory-backend-memfd``, or
+``memory-backend-shm`` objects.
+A reference to a file-descriptor which can access this object
  will be passed via the socket as part of the protocol negotiation.
  
  Currently the shared memory object needs to match the size of the main

diff --git a/qapi/qom.json b/qapi/qom.json
index 85e6b4f84a..5252ec69e3 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -721,6 +721,19 @@
  '*hugetlbsize': 'size',
  '*seal': 'bool' } }
  
+##

+# @MemoryBackendShmProperties:
+#
+# Properties for memory-backend-shm objects.
+#
+# The @share boolean option is true by default with shm.
+#
+# Since: 9.1
+##
+{ 'struct': 'MemoryBackendShmProperties',
+  'base': 'MemoryBackendProperties',
+  'data': { } }
+


Acked-by: David Hildenbrand 

One comment: we should maybe just forbid setting share=off. it doesn't 
make any sense and it can even result in an unexpected double memory 
consumption. We missed doing that for memfd, unfortunately.


--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-03-27 Thread David Hildenbrand


So I thought that for now we only support the "anonymous" mode, and if
in the future we have a use case where the user wants to specify the
name, we can add it later.


Okay, so for now you really only want an anonymous fd that behaves like
a memfd, got it.

Likely we should somehow fail if the "POSIX shared memory object"
already exists. Hmm.


`O_CREAT | O_EXCL` should provide just this behavior.
  From shm_open(3) manpage:

  O_EXCL  If O_CREAT was also specified, and a shared memory object
  with the given name already exists, return an error.  The
  check for the existence of the object, and its creation if
  it does not exist, are performed atomically.



Cool!





That said, if you think it's already useful from the beginning, I can
add the name as an optional parameter.



I'm also not quite sure if "host_memory_backend_get_name()" should be
used for the purpose here.


What problem do you see? As an alternative I thought of a static
counter.


If it's really just an "internal UUID", as we'll remove the file using
shm_unlink(), then the name in fact shouldn't matter and this would be
fine. Correct?


Right. It's a name that will only "appear" in the system for a very
short time since we call shm_unlink() right after shm_open(). So I just
need the unique name to prevent several QEMUs launched at the same time
from colliding with the name.


Okay, essentially emulating tmpfile(), but for weird shmem_open() :)





So I assume if we ever want to have non-anonymous fds here, we'd pass
in a new property "name", and change the logic how we open/unlink.
Makes sense.


Exactly! I can clarify this in the commit description as well.

Thanks for the helpful comments!
If there is anything I need to change for v3, please tell me ;-)


Besides some patch description extension, makes sense to me :)

--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-03-27 Thread David Hildenbrand

On 27.03.24 11:23, Stefano Garzarella wrote:

On Tue, Mar 26, 2024 at 03:45:52PM +0100, David Hildenbrand wrote:

+mode = 0;
+oflag = O_RDWR | O_CREAT | O_EXCL;
+backend_name = host_memory_backend_get_name(backend);
+
+/*
+ * Some operating systems allow creating anonymous POSIX shared memory
+ * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
+ * defined by POSIX, so let's create a unique name.
+ *
+ * From Linux's shm_open(3) man-page:
+ *   For  portable  use,  a shared  memory  object should be identified
+ *   by a name of the form /somename;"
+ */
+g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
+backend_name);


Hm, shouldn't we just let the user specify a name, and if no name was
specified, generate one ourselves?


I thought about it and initially did it that way, but then some problems
came up so I tried to keep it as simple as possible for the user and for
our use case (having an fd associated with memory and sharing it with
other processes).

The problems I had were:

- what mode_t to use if the object does not exist and needs to be
created? >
- exclude O_EXCL if the user passes the name since they may have already
created it?


I'd handle both like we handle files. But I understand now that you 
really just want to "simulate" memfd.




- call shm_unlink() only at object deallocation?


Right, we don't really do that for ordinary files, they keep existing. 
We only have the "discard-data" property to free up memory.


For memfd, it just happens "automatically". They cannot be looked up by 
name, and once the last user closes the fd, it gets destroyed.




So I thought that for now we only support the "anonymous" mode, and if
in the future we have a use case where the user wants to specify the
name, we can add it later.


Okay, so for now you really only want an anonymous fd that behaves like 
a memfd, got it.


Likely we should somehow fail if the "POSIX shared memory object" 
already exists. Hmm.




That said, if you think it's already useful from the beginning, I can
add the name as an optional parameter.



I'm also not quite sure if "host_memory_backend_get_name()" should be
used for the purpose here.


What problem do you see? As an alternative I thought of a static
counter.


If it's really just an "internal UUID", as we'll remove the file using 
shm_unlink(), then the name in fact shouldn't matter and this would be 
fine. Correct?



So I assume if we ever want to have non-anonymous fds here, we'd pass in 
a new property "name", and change the logic how we open/unlink. Makes sense.


--
Cheers,

David / dhildenb




Re: [PATCH-for-9.1 v2 13/21] hw/mem/pc-dimm: Remove legacy_align argument from pc_dimm_pre_plug()

2024-03-27 Thread David Hildenbrand

On 27.03.24 10:51, Philippe Mathieu-Daudé wrote:

'legacy_align' is always NULL, remove it.

Signed-off-by: Philippe Mathieu-Daudé 
Reviewed-by: Thomas Huth 
Message-Id: <20240305134221.30924-11-phi...@linaro.org>
---


I was really confused for a second until I saw that this series is 
dependent on another one.


Please make sure to CC people you CC on patches to also CC on the cover 
letter.


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH-for-9.1 v2 14/21] hw/mem/memory-device: Remove legacy_align from memory_device_pre_plug()

2024-03-27 Thread David Hildenbrand

  void memory_device_pre_plug(MemoryDeviceState *md, MachineState *ms,
-const uint64_t *legacy_align, Error **errp)
+Error **errp)
  {
  const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md);
  Error *local_err = NULL;
@@ -388,14 +388,10 @@ void memory_device_pre_plug(MemoryDeviceState *md, 
MachineState *ms,
  return;
  }
  
-if (legacy_align) {

-align = *legacy_align;
-} else {
-if (mdc->get_min_alignment) {
-align = mdc->get_min_alignment(md);
-}
-align = MAX(align, memory_region_get_alignment(mr));
+if (mdc->get_min_alignment) {
+align = mdc->get_min_alignment(md);
  }
+align = MAX(align, memory_region_get_alignment(mr));
  addr = mdc->get_addr(md);
  addr = memory_device_get_free_addr(ms, !addr ? NULL : , align,
 memory_region_size(mr), _err);


Very nice!

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 02/11] libvhost-user: fail vu_message_write() if sendmsg() is failing

2024-03-27 Thread David Hildenbrand


I was wondering if we could see some partial sendmsg()/write
succeeding. Meaning, we transferred some bytes but not all, and we'd
actually need to loop ...


Yep, true, but I would fix it in another patch/series if you agree.


Absolutely.

--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()

2024-03-26 Thread David Hildenbrand

+mode = 0;
+oflag = O_RDWR | O_CREAT | O_EXCL;
+backend_name = host_memory_backend_get_name(backend);
+
+/*
+ * Some operating systems allow creating anonymous POSIX shared memory
+ * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not
+ * defined by POSIX, so let's create a unique name.
+ *
+ * From Linux's shm_open(3) man-page:
+ *   For  portable  use,  a shared  memory  object should be identified
+ *   by a name of the form /somename;"
+ */
+g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(),
+backend_name);


Hm, shouldn't we just let the user specify a name, and if no name was 
specified, generate one ourselves?


I'm also not quite sure if "host_memory_backend_get_name()" should be 
used for the purpose here.


--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 02/11] libvhost-user: fail vu_message_write() if sendmsg() is failing

2024-03-26 Thread David Hildenbrand

On 26.03.24 15:34, Eric Blake wrote:

On Tue, Mar 26, 2024 at 02:39:27PM +0100, Stefano Garzarella wrote:

In vu_message_write() we use sendmsg() to send the message header,
then a write() to send the payload.

If sendmsg() fails we should avoid sending the payload, since we
were unable to send the header.

Discovered before fixing the issue with the previous patch, where
sendmsg() failed on macOS due to wrong parameters, but the frontend
still sent the payload which the backend incorrectly interpreted
as a wrong header.

Signed-off-by: Stefano Garzarella 
---
  subprojects/libvhost-user/libvhost-user.c | 5 +
  1 file changed, 5 insertions(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 22bea0c775..a11afd1960 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -639,6 +639,11 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg 
*vmsg)
  rc = sendmsg(conn_fd, , 0);
  } while (rc < 0 && (errno == EINTR || errno == EAGAIN));
  
+if (rc <= 0) {


Is rejecting a 0 return value correct?  Technically, a 0 return means
a successful write of 0 bytes - but then again, it is unwise to
attempt to send an fd or other auxilliary ddata without at least one
regular byte, so we should not be attempting a write of 0 bytes.  So I
guess this one is okay, although I might have used < instead of <=.


I was wondering if we could see some partial sendmsg()/write succeeding. 
Meaning, we transferred some bytes but not all, and we'd actually need 
to loop ...


--
Cheers,

David / dhildenb




Re: [PATCH for-9.1 v2 01/11] libvhost-user: set msg.msg_control to NULL when it is empty

2024-03-26 Thread David Hildenbrand

On 26.03.24 14:39, Stefano Garzarella wrote:

On some OS (e.g. macOS) sendmsg() returns -1 (errno EINVAL) if
the `struct msghdr` has the field `msg_controllen` set to 0, but
`msg_control` is not NULL.

Signed-off-by: Stefano Garzarella 
---
  subprojects/libvhost-user/libvhost-user.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index a879149fef..22bea0c775 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -632,6 +632,7 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg 
*vmsg)
  memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize);
  } else {
  msg.msg_controllen = 0;
+msg.msg_control = NULL;
  }
  
  do {


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v1] qapi: document parameters of query-cpu-model-* QAPI commands

2024-03-25 Thread David Hildenbrand

  #
  # Usually, a CPU model is compared against the maximum possible CPU
  # model of a certain configuration (e.g. the "host" model for KVM).
@@ -154,7 +155,14 @@
  # Some architectures may not support comparing CPU models.  s390x
  # supports comparing CPU models.
  #
-# Returns: a CpuModelBaselineInfo
+# @modela: description of the first CPU model to compare, referred to as
+# "model A" in CpuModelCompareResult
+#
+# @modelb: description of the second CPU model to compare, referred to as
+# "model B" in CpuModelCompareResult
+#
+# Returns: a CpuModelCompareInfo, describing how both CPU models


Scratch the comma?


Agreed to both. Do you want to fixup when applying or do you prefer a v2?

Thanks!



Reviewed-by: Markus Armbruster 



--
Cheers,

David / dhildenb




[PATCH v1] qapi: document parameters of query-cpu-model-* QAPI commands

2024-03-25 Thread David Hildenbrand
Let's document the parameters of these commands, so we can remove them
from the "documentation-exceptions" list.

While at it, extend the "Returns:" documentation as well, fixing a wrong
use of CpuModelBaselineInfo vs. CpuModelCompareInfo for
query-cpu-model-comparison.

Cc: Markus Armbruster 
Cc: Eric Blake 
Cc: Eduardo Habkost 
Cc: Marcel Apfelbaum 
Cc: "Philippe Mathieu-Daudé" 
Cc: Yanan Wang 
Signed-off-by: David Hildenbrand 
---
 qapi/machine-target.json | 46 +++-
 qapi/pragma.json |  3 ---
 2 files changed, 31 insertions(+), 18 deletions(-)

diff --git a/qapi/machine-target.json b/qapi/machine-target.json
index 519adf3220..7883616cce 100644
--- a/qapi/machine-target.json
+++ b/qapi/machine-target.json
@@ -124,11 +124,12 @@
 ##
 # @query-cpu-model-comparison:
 #
-# Compares two CPU models, returning how they compare in a specific
-# configuration.  The results indicates how both models compare
-# regarding runnability.  This result can be used by tooling to make
-# decisions if a certain CPU model will run in a certain configuration
-# or if a compatible CPU model has to be created by baselining.
+# Compares two CPU models, @modela and @modelb, returning how they
+# compare in a specific configuration.  The results indicates how
+# both models compare regarding runnability.  This result can be
+# used by tooling to make decisions if a certain CPU model will
+# run in a certain configuration or if a compatible CPU model has
+# to be created by baselining.
 #
 # Usually, a CPU model is compared against the maximum possible CPU
 # model of a certain configuration (e.g. the "host" model for KVM).
@@ -154,7 +155,14 @@
 # Some architectures may not support comparing CPU models.  s390x
 # supports comparing CPU models.
 #
-# Returns: a CpuModelBaselineInfo
+# @modela: description of the first CPU model to compare, referred to as
+# "model A" in CpuModelCompareResult
+#
+# @modelb: description of the second CPU model to compare, referred to as
+# "model B" in CpuModelCompareResult
+#
+# Returns: a CpuModelCompareInfo, describing how both CPU models
+# compare
 #
 # Errors:
 # - if comparing CPU models is not supported
@@ -175,9 +183,9 @@
 ##
 # @query-cpu-model-baseline:
 #
-# Baseline two CPU models, creating a compatible third model.  The
-# created model will always be a static, migration-safe CPU model (see
-# "static" CPU model expansion for details).
+# Baseline two CPU models, @modela and @modelb, creating a compatible
+# third model.  The created model will always be a static,
+# migration-safe CPU model (see "static" CPU model expansion for details).
 #
 # This interface can be used by tooling to create a compatible CPU
 # model out two CPU models.  The created CPU model will be identical
@@ -204,7 +212,11 @@
 # Some architectures may not support baselining CPU models.  s390x
 # supports baselining CPU models.
 #
-# Returns: a CpuModelBaselineInfo
+# @modela: description of the first CPU model to baseline
+#
+# @modelb: description of the second CPU model to baseline
+#
+# Returns: a CpuModelBaselineInfo, describing the baselined CPU model
 #
 # Errors:
 # - if baselining CPU models is not supported
@@ -243,10 +255,10 @@
 ##
 # @query-cpu-model-expansion:
 #
-# Expands a given CPU model (or a combination of CPU model +
-# additional options) to different granularities, allowing tooling to
-# get an understanding what a specific CPU model looks like in QEMU
-# under a certain configuration.
+# Expands a given CPU model, @model, (or a combination of CPU model +
+# additional options) to different granularities, specified by
+# @type, allowing tooling to get an understanding what a specific
+# CPU model looks like in QEMU under a certain configuration.
 #
 # This interface can be used to query the "host" CPU model.
 #
@@ -269,7 +281,11 @@
 # Some architectures may not support all expansion types.  s390x
 # supports "full" and "static". Arm only supports "full".
 #
-# Returns: a CpuModelExpansionInfo
+# @model: description of the CPU model to expand
+#
+# @type: expansion type, specifying how to expand the CPU model
+#
+# Returns: a CpuModelExpansionInfo, describing the expanded CPU model
 #
 # Errors:
 # - if expanding CPU models is not supported
diff --git a/qapi/pragma.json b/qapi/pragma.json
index 6929ab776e..0d82bc1a03 100644
--- a/qapi/pragma.json
+++ b/qapi/pragma.json
@@ -90,9 +90,6 @@
 'XDbgBlockGraph',
 'YankInstanceType',
 'blockdev-reopen',
-'query-cpu-model-baseline',
-'query-cpu-model-comparison',
-'query-cpu-model-expansion',
 'query-rocker',
 'query-rocker-ports',
 'query-stats-schemas',
-- 
2.43.2




Re: Let's close member documentation gaps

2024-03-25 Thread David Hildenbrand

On 25.03.24 15:13, David Hildenbrand wrote:

On 25.03.24 10:36, Markus Armbruster wrote:

If you're cc'ed, I have a bit of doc work for you.  Search for your
name to find it.

The QAPI generator forces you to document your stuff.  Except for
commands, events, enum and object types listed in pragma
documentation-exceptions, the generator silently defaults missing
documentation to "Not documented".  Right now, we're using this loophole
some 500 times.



What would be the right step to make sure I am resolving these "hidden"
issues, and that the QAPI generator would be happy with my changes?


Ah, I assume simply removing the offender from "qapi/pragma.json" and 
then compiling.


--
Cheers,

David / dhildenb




Re: Let's close member documentation gaps

2024-03-25 Thread David Hildenbrand

On 25.03.24 10:36, Markus Armbruster wrote:

If you're cc'ed, I have a bit of doc work for you.  Search for your
name to find it.

The QAPI generator forces you to document your stuff.  Except for
commands, events, enum and object types listed in pragma
documentation-exceptions, the generator silently defaults missing
documentation to "Not documented".  Right now, we're using this loophole
some 500 times.



What would be the right step to make sure I am resolving these "hidden" 
issues, and that the QAPI generator would be happy with my changes?


--
Cheers,

David / dhildenb




Re: [PATCH v3 11/49] physmem: Introduce ram_block_discard_guest_memfd_range()

2024-03-20 Thread David Hildenbrand

On 20.03.24 18:38, Michael Roth wrote:

On Wed, Mar 20, 2024 at 10:37:14AM +0100, David Hildenbrand wrote:

On 20.03.24 09:39, Michael Roth wrote:

From: Xiaoyao Li 

When memory page is converted from private to shared, the original
private memory is back'ed by guest_memfd. Introduce
ram_block_discard_guest_memfd_range() for discarding memory in
guest_memfd.

Originally-from: Isaku Yamahata 
Codeveloped-by: Xiaoyao Li 


"Co-developed-by"


Signed-off-by: Xiaoyao Li 
Reviewed-by: David Hildenbrand 


Your SOB should go here.


---
Changes in v5:
- Collect Reviewed-by from David;

Changes in in v4:
- Drop ram_block_convert_range() and open code its implementation in the
next Patch.

Signed-off-by: Michael Roth 


I only received 3 patches from this series, and now I am confused: changelog
talks about v5 and this is "PATCH v3"

Please make sure to send at least the cover letter along (I might not need
the other 46 patches :D ).


Sorry for the confusion, you got auto-Cc'd by git, which is good, but
not sure there's a good way to make sure everyone gets a copy of the
cover letter. I could see how it would help useful to potential
reviewers though. I'll try to come up with a script for it and take that
approach in the future.


A script shared with me in the past to achieve that in most cases:

$ cat cc-cmd.sh
#!/bin/bash

if [[ $1 == *gitsendemail.msg* || $1 == *cover-letter* ]]; then
grep ': .* <.*@.*>' -h *.patch | sed 's/^.*: //' | sort | uniq
fi


And attach to "git send-email ... *.patch": --cc-cmd=./cc-cmd.sh

--
Cheers,

David / dhildenb




Re: [PATCH 2/4] hw/s390x/virtio-ccw: Always deliver NMI to first CPU

2024-03-20 Thread David Hildenbrand

On 20.02.24 16:08, Philippe Mathieu-Daudé wrote:

We can trigger NMI from HMP or QMP.

QEMU maps the NMI to the s390x per-CPU 'RESTART' interrupt.
Linux guests usually setup this interrupt to trigger kdump
or crash. Such crashdump can be triggered in QEMU by HMP
"nmi" or QMP "inject-nmi" commands.

Using QMP, since we can not select a particular CPU, the first
CPU is used (CPU#0). See the documentation from commit 795dc6e4
("watchdog: Add new Virtual Watchdog action INJECT-NMI"):

   @inject-nmi: a non-maskable interrupt is injected into the
first VCPU (all VCPUS on x86) (since 2.4)

While we can select a particular CPU on HMP, the guest behavior
is expected to be the same if using CPU #N or CPU #0. Since
always using CPU#0 simplifies API maintainance, update s390_nmi()
to deliver NMI to the first CPU.

Signed-off-by: Philippe Mathieu-Daudé 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v3 11/49] physmem: Introduce ram_block_discard_guest_memfd_range()

2024-03-20 Thread David Hildenbrand

On 20.03.24 13:43, Xiaoyao Li wrote:

On 3/20/2024 5:37 PM, David Hildenbrand wrote:

On 20.03.24 09:39, Michael Roth wrote:

From: Xiaoyao Li 

When memory page is converted from private to shared, the original
private memory is back'ed by guest_memfd. Introduce
ram_block_discard_guest_memfd_range() for discarding memory in
guest_memfd.

Originally-from: Isaku Yamahata 
Codeveloped-by: Xiaoyao Li 


"Co-developed-by"


Michael is using the patch from my TDX-QEMU v5 series[1]. I need to fix it.


Signed-off-by: Xiaoyao Li 
Reviewed-by: David Hildenbrand 


Your SOB should go here.


---
Changes in v5:
- Collect Reviewed-by from David;

Changes in in v4:
- Drop ram_block_convert_range() and open code its implementation in the
    next Patch.

Signed-off-by: Michael Roth 


I only received 3 patches from this series, and now I am confused:
changelog talks about v5 and this is "PATCH v3"


As above, because the guest_memfd patches in my TDX-QEMU v5[1] were
directly picked for this series, so the change history says v5. They are
needed by SEV-SNP as well.


Thanks, I was missing the context without a cover letter. These comments 
here likely should be dropped here.


--
Cheers,

David / dhildenb




Re: [PATCH v3 11/49] physmem: Introduce ram_block_discard_guest_memfd_range()

2024-03-20 Thread David Hildenbrand

On 20.03.24 09:39, Michael Roth wrote:

From: Xiaoyao Li 

When memory page is converted from private to shared, the original
private memory is back'ed by guest_memfd. Introduce
ram_block_discard_guest_memfd_range() for discarding memory in
guest_memfd.

Originally-from: Isaku Yamahata 
Codeveloped-by: Xiaoyao Li 


"Co-developed-by"


Signed-off-by: Xiaoyao Li 
Reviewed-by: David Hildenbrand 


Your SOB should go here.


---
Changes in v5:
- Collect Reviewed-by from David;

Changes in in v4:
- Drop ram_block_convert_range() and open code its implementation in the
   next Patch.

Signed-off-by: Michael Roth 


I only received 3 patches from this series, and now I am confused: 
changelog talks about v5 and this is "PATCH v3"


Please make sure to send at least the cover letter along (I might not 
need the other 46 patches :D ).


--
Cheers,

David / dhildenb




Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment

2024-03-19 Thread David Hildenbrand

On 19.03.24 15:23, Peter Maydell wrote:

On Tue, 19 Mar 2024 at 09:24, David Hildenbrand  wrote:

I spotted new pause_all_vcpus() / resume_all_vcpus() calls in 
hw/intc/arm_gicv3_kvm.c and
thought they would be the problematic bit.

Yeah, that's going to be problematic. Further note that a lot of code does not 
expect
that the BQL is suddenly dropped.


Agreed; we already have one nasty set of bugs in the framebuffer
devices because a function drops the BQL briefly:
https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=wn+k8dqneb_...@mail.gmail.com/T/#u
so let's avoid introducing any more of a similar kind.

Side note, the pause_all_vcpus()/resume_all_vcpus() calls in
hw/i386/vapic.c are probably a bit suspect for similar reasons.


Exactly my thoughts. But there, it was less clear "why" it is even 
required. It's only performed for KVM.


Do we also just want to stop KVM threads from executing instructions?, 
so blocking KVM ioctls might be a reasonable "replacement"? Really not sure.


--
Cheers,

David / dhildenb




Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment

2024-03-19 Thread David Hildenbrand

On 19.03.24 10:24, David Hildenbrand wrote:

On 19.03.24 06:06, zhukeqian wrote:

Hi David,

Thanks for reviewing.

On 17.03.24 09:37, Keqian Zhu via wrote:

Both main loop thread and vCPU thread are allowed to call
pause_all_vcpus(), and in general resume_all_vcpus() is called after
it. Two issues live in pause_all_vcpus():


In general, calling pause_all_vcpus() from VCPU threads is quite dangerous.

Do we have reproducers for the cases below?



I produce the issues by testing ARM vCPU hotplug feature:
QEMU changes for vCPU hotplug could be cloned from below site,
   https://github.com/salil-mehta/qemu.git virt-cpuhp-armv8/rfc-v2
Guest Kernel changes (by James Morse, ARM) are available here:
   https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git 
virtual_cpu_hotplug/rfc/v2



Thanks for these infos (would be reasonable to include that in the cover 
letter).

Okay, so likely this is not actually a "fix" for upstream as it is. Understood.


The procedure to produce problems:
1. Startup a Linux VM (e.g., called OS-vcpuhotplug) with 32 possible vCPUs and 
16 current vCPUs.
2. Log in guestOS and run script[1] to continuously online/offline CPU.
3. At host side, run script[2] to continuously hotplug/unhotplug vCPU.
After several minutes, we can hit these problems.

Script[1] to online/offline CPU:
for ((time=1;time<1000;time++));
do
  for ((cpu=16;cpu<32;cpu++));
  do
  echo 1 > /sys/devices/system/cpu/cpu$cpu/online
  done

  for ((cpu=16;cpu<32;cpu++));
  do
  echo 0 > /sys/devices/system/cpu/cpu$cpu/online
  done
done

Script[2] to hotplug/unhotplug vCPU:
for ((time=1;time<100;time++));
do
  echo $time
  for ((cpu=16;cpu<=32;cpu++));
  do
  echo "virsh setvcpus OS-vcpuhotplug --count  $cpu --live"
  virsh setvcpus OS-vcpuhotplug --count  $cpu --live
  sleep 2
  done

  for ((cpu=32;cpu>=16;cpu--));
  do
  echo "virsh setvcpus OS-vcpuhotplug --count  $cpu --live"
  virsh setvcpus OS-vcpuhotplug --count  $cpu --live
  sleep 2
  done

  for ((cpu=16;cpu<=32;cpu+=2));
  do
  echo "virsh setvcpus OS-vcpuhotplug --count  $cpu --live"
  virsh setvcpus OS-vcpuhotplug --count  $cpu --live
  sleep 2
  done

  for ((cpu=32;cpu>=16;cpu-=2));
  do
  echo "virsh setvcpus OS-vcpuhotplug --count  $cpu --live"
  virsh setvcpus OS-vcpuhotplug --count  $cpu --live
  sleep 2
  done
done

The script[1] will call PSCI CPU_ON which emulated by QEMU, which result in 
calling cpu_reset() on vCPU thread.


I spotted new pause_all_vcpus() / resume_all_vcpus() calls in 
hw/intc/arm_gicv3_kvm.c and
thought they would be the problematic bit.

Yeah, that's going to be problematic. Further note that a lot of code does not 
expect
that the BQL is suddenly dropped.

We had issues with that in different context where we ended up wanting to use 
pause/resume from VCPU context:

https://lore.kernel.org/all/294a987d-b0ef-1b58-98ac-0d4d43075...@redhat.com/

This sounds like a bad idea. Read below.


For ARM architecture, it needs to reset GICC registers, which is only possible 
when all vcpus paused. So script[1]
will call pause_all_vcpus() in vCPU thread.
The script[2] also calls cpu_reset() for newly hotplugged vCPU, which is done 
in main loop thread.
So this scenario causes problems as I state in commit message.



1. There is possibility that during thread T1 waits on qemu_pause_cond
with bql unlocked, other thread has called
pause_all_vcpus() and resume_all_vcpus(), then thread T1 will stuck,
because the condition all_vcpus_paused() is always false.


How can this happen?

Two threads calling pause_all_vcpus() is borderline broken, as you note.

IIRC, we should call pause_all_vcpus() only if some other mechanism prevents 
these races. For example, based on runstate changes.



We already has bql to prevent concurrent calling of pause_all_vcpus() and 
resume_all_vcpus(). But pause_all_vcpus() will
unlock bql in the half way, which gives change for other thread to call pause 
and resume. In the  past, code does not consider
this problem, now I add retry mechanism to fix it.


Note that BQL did not prevent concurrent calling of pause_all_vcpus(). There 
had to be something else. Likely that was runstate transitions.





Just imagine one thread calling pause_all_vcpus() while another one
calls resume_all_vcpus(). It cannot possibly work.


With bql, we can make sure all vcpus are paused after pause_all_vcpus() finish, 
 and all vcpus are resumed after resume_all_vcpus() finish.

For example, the following situation may occur:
Thread T1: lock bql  ->pau

Re: 答复: [PATCH v1 2/2] system/cpus: Fix resume_all_vcpus() under vCPU hotplug condition

2024-03-19 Thread David Hildenbrand

On 19.03.24 06:11, zhukeqian wrote:

Hi David,

On 17.03.24 09:37, Keqian Zhu via wrote:

For vCPU being hotplugged, qemu_init_vcpu() is called. In this
function, we set vcpu state as stopped, and then wait vcpu thread to
be created.

As the vcpu state is stopped, it will inform us it has been created
and then wait on halt_cond. After we has realized vcpu object, we will
resume the vcpu thread.

However, during we wait vcpu thread to be created, the bql is
unlocked, and other thread is allowed to call resume_all_vcpus(),
which will resume the un-realized vcpu.

This fixes the issue by filter out un-realized vcpu during
resume_all_vcpus().


Similar question: is there a reproducer?

How could we currently hotplug a VCPU, and while it is being created, see 
pause_all_vcpus()/resume_all_vcpus() getting claled.


I described the reason for this at patch 1.


If I am not getting this wrong, there seems to be some other mechanism missing 
that makes sure that this cannot happen. Dropping the BQL half-way through 
creating a VCPU might be the problem.


When we add retry mechanism in pause_all_vcpus(), we can solve this problem. 
With the sematic unchanged for user, which means:
With bql, we can make sure all vcpus are paused after pause_all_vcpus() finish, 
 and all vcpus are resumed after resume_all_vcpus() finish.


Okay, got it. As just replied to #1, please see if you can avoid messing 
with pause_all_vcpus() by inhibiting KVM IOCTLs like KVM does. That 
would be preferable.


--
Cheers,

David / dhildenb




Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment

2024-03-19 Thread David Hildenbrand
 to lock bql-> lock bql  &&  
all_vcpu_paused ->   success and do other work -> unlock bql
Thread T2: wait T1 unlock bql to lock bql->   
lock bql->  resume_all_vcpus   ->   success  and do other work   -> unlock 
bql



Now trow in another thread and it all gets really complicated :)

Finding ways to avoid pause_all_vcpus() on the ARM reset code would be 
preferable.

I guess you simply want to do something similar to what KVM does to avoid 
messing
with pause_all_vcpus(): inhibiting certain IOCTLs.


commit f39b7d2b96e3e73c01bb678cd096f7baf0b9ab39
Author: David Hildenbrand 
Date:   Fri Nov 11 10:47:58 2022 -0500

kvm: Atomic memslot updates

If we update an existing memslot (e.g., resize, split), we temporarily

remove the memslot to re-add it immediately afterwards. These updates
are not atomic, especially not for KVM VCPU threads, such that we can
get spurious faults.

Let's inhibit most KVM ioctls while performing relevant updates, such

that we can perform the update just as if it would happen atomically
without additional kernel support.

We capture the add/del changes and apply them in the notifier commit

stage instead. There, we can check for overlaps and perform the ioctl
inhibiting only if really required (-> overlap).

To keep things simple we don't perform additional checks that wouldn't

actually result in an overlap -- such as !RAM memory regions in some
cases (see kvm_set_phys_mem()).

To minimize cache-line bouncing, use a separate indicator

(in_ioctl_lock) per CPU.  Also, make sure to hold the kvm_slots_lock
while performing both actions (removing+re-adding).

We have to wait until all IOCTLs were exited and block new ones from

getting executed.

This approach cannot result in a deadlock as long as the inhibitor does

not hold any locks that might hinder an IOCTL from getting finished and
exited - something fairly unusual. The inhibitor will always hold the BQL.

AFAIKs, one possible candidate would be userfaultfd. If a page cannot be

placed (e.g., during postcopy), because we're waiting for a lock, or if the
userfaultfd thread cannot process a fault, because it is waiting for a
lock, there could be a deadlock. However, the BQL is not applicable here,
because any other guest memory access while holding the BQL would already
result in a deadlock.

Nothing else in the kernel should block forever and wait for userspace

intervention.

Note: pause_all_vcpus()/resume_all_vcpus() or

start_exclusive()/end_exclusive() cannot be used, as they either drop
the BQL or require to be called without the BQL - something inhibitors
cannot handle. We need a low-level locking mechanism that is
deadlock-free even when not releasing the BQL.

--
Cheers,

David / dhildenb




Re: [PATCH v2 1/2] target/s390x: Use mutable temporary value for op_ts

2024-03-18 Thread David Hildenbrand

On 18.03.24 21:27, Ilya Leoshkevich wrote:

From: Ido Plat 

Otherwise TCG would assume the register that holds t1 would be constant
and reuse whenever it needs the value within it.

Cc: qemu-sta...@nongnu.org
Fixes: f1ea739bd598 ("target/s390x: Use tcg_constant_* in local contexts")
Reviewed-by: Ilya Leoshkevich 
Reviewed-by: Richard Henderson 
[iii: Adjust a newline and capitalization, add tags]
Signed-off-by: Ido Plat 
Signed-off-by: Ilya Leoshkevich 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v1 2/2] system/cpus: Fix resume_all_vcpus() under vCPU hotplug condition

2024-03-18 Thread David Hildenbrand

On 17.03.24 09:37, Keqian Zhu via wrote:

For vCPU being hotplugged, qemu_init_vcpu() is called. In this
function, we set vcpu state as stopped, and then wait vcpu thread
to be created.

As the vcpu state is stopped, it will inform us it has been created
and then wait on halt_cond. After we has realized vcpu object, we
will resume the vcpu thread.

However, during we wait vcpu thread to be created, the bql is
unlocked, and other thread is allowed to call resume_all_vcpus(),
which will resume the un-realized vcpu.

This fixes the issue by filter out un-realized vcpu during
resume_all_vcpus().


Similar question: is there a reproducer?

How could we currently hotplug a VCPU, and while it is being created, 
see pause_all_vcpus()/resume_all_vcpus() getting claled.


If I am not getting this wrong, there seems to be some other mechanism 
missing that makes sure that this cannot happen. Dropping the BQL 
half-way through creating a VCPU might be the problem.


--
Cheers,

David / dhildenb




Re: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment

2024-03-18 Thread David Hildenbrand

On 17.03.24 09:37, Keqian Zhu via wrote:

Both main loop thread and vCPU thread are allowed to call
pause_all_vcpus(), and in general resume_all_vcpus() is called
after it. Two issues live in pause_all_vcpus():


In general, calling pause_all_vcpus() from VCPU threads is quite dangerous.

Do we have reproducers for the cases below?



1. There is possibility that during thread T1 waits on
qemu_pause_cond with bql unlocked, other thread has called
pause_all_vcpus() and resume_all_vcpus(), then thread T1 will
stuck, because the condition all_vcpus_paused() is always false.


How can this happen?

Two threads calling pause_all_vcpus() is borderline broken, as you note.

IIRC, we should call pause_all_vcpus() only if some other mechanism 
prevents these races. For example, based on runstate changes.



Just imagine one thread calling pause_all_vcpus() while another one 
calls resume_all_vcpus(). It cannot possibly work.




2. After all_vcpus_paused() has been checked as true, we will
unlock bql to relock replay_mutex. During the bql was unlocked,
the vcpu's state may has been changed by other thread, so we
must retry.

Signed-off-by: Keqian Zhu 
---
  system/cpus.c | 29 -
  1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/system/cpus.c b/system/cpus.c
index 68d161d96b..4e41abe23e 100644
--- a/system/cpus.c
+++ b/system/cpus.c
@@ -571,12 +571,14 @@ static bool all_vcpus_paused(void)
  return true;
  }
  
-void pause_all_vcpus(void)

+static void request_pause_all_vcpus(void)
  {
  CPUState *cpu;
  
-qemu_clock_enable(QEMU_CLOCK_VIRTUAL, false);

  CPU_FOREACH(cpu) {
+if (cpu->stopped) {
+continue;
+}
  if (qemu_cpu_is_self(cpu)) {
  qemu_cpu_stop(cpu, true);
  } else {
@@ -584,6 +586,14 @@ void pause_all_vcpus(void)
  qemu_cpu_kick(cpu);
  }
  }
+}
+
+void pause_all_vcpus(void)
+{
+qemu_clock_enable(QEMU_CLOCK_VIRTUAL, false);
+
+retry:
+request_pause_all_vcpus();
  
  /* We need to drop the replay_lock so any vCPU threads woken up

   * can finish their replay tasks
@@ -592,14 +602,23 @@ void pause_all_vcpus(void)
  
  while (!all_vcpus_paused()) {

  qemu_cond_wait(_pause_cond, );
-CPU_FOREACH(cpu) {
-qemu_cpu_kick(cpu);
-}
+/* During we waited on qemu_pause_cond the bql was unlocked,
+ * the vcpu's state may has been changed by other thread, so
+ * we must request the pause state on all vcpus again.
+ */
+request_pause_all_vcpus();
  }
  
  bql_unlock();

  replay_mutex_lock();
  bql_lock();
+
+/* During the bql was unlocked, the vcpu's state may has been
+ * changed by other thread, so we must retry.
+ */
+if (!all_vcpus_paused()) {
+goto retry;
+}
  }
  
  void cpu_resume(CPUState *cpu)


--
Cheers,

David / dhildenb




Re: [PATCH v2 00/14] libvhost-user: support more memslots and cleanup memslot handling code

2024-03-12 Thread David Hildenbrand

On 11.03.24 21:03, Mario Casquero wrote:

This series has been successfully tested by QE. Start the
qemu-storage-daemon in the background with a rhel 9.5 image and
vhost-user-blk. After that, boot up a VM with virtio-mem and
vhost-user-blk-pci. Check with the HMP command 'info mtree' that
virtio-mem is making use of multiple memslots.

Tested-by: Mario Casquero 


Thanks Mario!

--
Cheers,

David / dhildenb




Re: [PATCH v2 0/4] physmem: Fix MemoryRegion for second access to cached MMIO Address Space

2024-03-08 Thread David Hildenbrand

On 08.03.24 08:36, Peter Xu wrote:

On Thu, Mar 07, 2024 at 03:37:06PM +, Jonathan Cameron wrote:

v2: (Thanks to Peter Xu for reviewing!)
- New patch 1 to rename addr1 to mr_addr in the interests of meaningful naming.
- Take advantage of a cached address space only allow for a single MR to 
simplify
   the new code.
- Various cleanups of indentation etc.
- Cover letter and some patch descriptions updated to reflect changes.
- Changes all called out in specific patches.


All look good to me, thanks.  Having the new functions' first argument as
MemTxAttrs is slightly weird to me, but that's not a big deal and we can
clean it up later if wanted.  I guess it's good to fix this in 9.0 first as
it's a real bug even if not trivial to hit.

I queued it in my migration tree (with my "memory API" hat..).

I won't send a pull until next Monday.  Please shoot if there's any objections!



Works for me, thanks!

--
Cheers,

David / dhildenb




Re: [PATCH v2 4/4] physmem: Fix wrong address in large address_space_read/write_cached_slow()

2024-03-08 Thread David Hildenbrand

On 07.03.24 16:37, Jonathan Cameron wrote:

If the access is bigger than the MemoryRegion supports,
flatview_read/write_continue() will attempt to update the Memory Region.
but the address passed to flatview_translate() is relative to the cache, not
to the FlatView.

On arm/virt with interleaved CXL memory emulation and virtio-blk-pci this
lead to the first part of descriptor being read from the CXL memory and the
second part from PA 0x8 which happens to be a blank region
of a flash chip and all ffs on this particular configuration.
Note this test requires the out of tree ARM support for CXL, but
the problem is more general.

Avoid this by adding new address_space_read_continue_cached()
and address_space_write_continue_cached() which share all the logic
with the flatview versions except for the MemoryRegion lookup which
is unnecessary as the MemoryRegionCache only covers one MemoryRegion.

Signed-off-by: Jonathan Cameron 
---
v2: Review from Peter Xu
- Drop additional lookups of the MemoryRegion via
address_space_translate_cached() as it will always return the same
answer.
- Drop various parameters that are then unused.
- rename addr1 to mr_addr.
- Drop a fuzz_dma_read_cb(). Could put this back but it means
   carrying the address into the inner call and the only in tree
   fuzzer checks if it is normal RAM and if not does nothing anyway.
   We don't hit this path for normal RAM.
---
  system/physmem.c | 63 +++-
  1 file changed, 57 insertions(+), 6 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 1264eab24b..701bea27dd 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3381,6 +3381,59 @@ static inline MemoryRegion 
*address_space_translate_cached(
  return section.mr;
  }
  
+/* Called within RCU critical section.  */

+static MemTxResult address_space_write_continue_cached(MemTxAttrs attrs,
+   const void *ptr,
+   hwaddr len,
+   hwaddr mr_addr,
+   hwaddr l,
+   MemoryRegion *mr)


The only thing that is really confusing is

hwaddr len,
...
hwaddr l,

ehm, what?

... but it fits the style of flatview_read_continue(), so at least the 
level of confusion this creates is consistent with the other code.



Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v2 3/4] physmem: Factor out body of flatview_read/write_continue() loop

2024-03-07 Thread David Hildenbrand

On 07.03.24 16:37, Jonathan Cameron wrote:

This code will be reused for the address_space_cached accessors
shortly.

Also reduce scope of result variable now we aren't directly
calling this in the loop.

Signed-off-by: Jonathan Cameron 
---
v2: Thanks to Peter Xu
- Fix alignment of code.
- Drop unused addr parameter.
- Carry through new mr_addr parameter name.
- RB not picked up as not sure what Peter will think wrt to
   resulting parameter ordering.
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v2 2/4] physmem: Reduce local variable scope in flatview_read/write_continue()

2024-03-07 Thread David Hildenbrand

On 07.03.24 16:37, Jonathan Cameron wrote:

Precursor to factoring out the inner loops for reuse.

Reviewed-by: Peter Xu 
Signed-off-by: Jonathan Cameron 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v2 1/4] physmem: Rename addr1 to more informative mr_addr in flatview_read/write() and similar

2024-03-07 Thread David Hildenbrand

On 07.03.24 16:37, Jonathan Cameron wrote:

The calls to flatview_read/write[_continue]() have parameters addr and
addr1 but the names give no indication of what they are addresses of.
Rename addr1 to mr_addr to reflect that it is the translated address
offset within the MemoryRegion returned by flatview_translate().
Similarly rename the parameter in address_space_read/write_cached_slow()

Suggested-by: Peter Xu 
Signed-off-by: Jonathan Cameron 



Much cleaner indeed!

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH] qapi: Fix format of the memory-backend-file's @rom property doc comment

2024-02-29 Thread David Hildenbrand

On 29.02.24 11:58, Stefano Garzarella wrote:

Reflow paragraph following commit a937b6aa73 ("qapi: Reformat doc
comments to conform to current conventions"): use 4 spaces indentation,
70 columns width, and two spaces to separate sentences.

Suggested-by: Markus Armbruster 
Signed-off-by: Stefano Garzarella 
---
  qapi/qom.json | 27 ++-
  1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 2a6e49365a..db1b0fdea2 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -668,19 +668,20 @@
  # @readonly: if true, the backing file is opened read-only; if false,
  # it is opened read-write.  (default: false)
  #
-# @rom: whether to create Read Only Memory (ROM) that cannot be modified
-#   by the VM.  Any write attempts to such ROM will be denied.  Most
-#   use cases want writable RAM instead of ROM.  However, selected use
-#   cases, like R/O NVDIMMs, can benefit from ROM.  If set to 'on',
-#   create ROM; if set to 'off', create writable RAM;  if set to
-#   'auto', the value of the @readonly property is used.  This
-#   property is primarily helpful when we want to have proper RAM in
-#   configurations that would traditionally create ROM before this
-#   property was introduced: VM templating, where we want to open a
-#   file readonly (@readonly set to true) and mark the memory to be
-#   private for QEMU (@share set to false).  For this use case, we need
-#   writable RAM instead of ROM, and want to set this property to 'off'.
-#   (default: auto, since 8.2)
+# @rom: whether to create Read Only Memory (ROM) that cannot be
+# modified by the VM.  Any write attempts to such ROM will be
+# denied.  Most use cases want writable RAM instead of ROM.
+# However, selected use cases, like R/O NVDIMMs, can benefit from
+# ROM.  If set to 'on', create ROM; if set to 'off', create
+# writable RAM; if set to 'auto', the value of the @readonly
+# property is used.  This property is primarily helpful when we
+# want to have proper RAM in configurations that would
+# traditionally create ROM before this property was introduced: VM
+# templating, where we want to open a file readonly (@readonly set
+# to true) and mark the memory to be private for QEMU (@share set
+# to false).  For this use case, we need writable RAM instead of
+# ROM, and want to set this property to 'off'.  (default: auto,
+# since 8.2)
  #
  # Since: 2.1
  ##


Ideally, we'd have a format checker that complains like checkpatch 
usually would.


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH 9/9] hostmem-file: support POSIX shm_open()

2024-02-28 Thread David Hildenbrand

On 28.02.24 12:47, Stefano Garzarella wrote:

Add a new `shm` bool option for `-object memory-backend-file`.

When this option is set to true, the POSIX shm_open(3) is used instead
of open(2).

So a file will not be created in the filesystem, but a "POSIX shared
memory object" will be instantiated. In Linux this turns into a file
in /dev/shm, but in other OSes this may not happen (for example in
macOS or FreeBSD nothing is shown in any filesystem).

This new feature is useful when we need to share guest memory with
another process (e.g. vhost-user backend), but we don't have
memfd_create() or any special filesystems (e.g. /dev/shm) available
as in macOS.

Signed-off-by: Stefano Garzarella 
---
I am not sure this is the best way to support shm_open() in QEMU.

Other solutions I had in mind were:

- create a new memory-backend-shm

- extend memory-backend-memfd to use shm_open() on systems where memfd is
not available (problem: shm_open wants a name to assign to the object, but
we can do a workaround by using a random name and do the unlink right away)

Any preference/suggestion?


Both sound like reasonable options, and IMHO better than hostmem-file 
with things that are not necessarily "real" files.


Regarding memory-backend-memfd, we similarly have to pass a name to 
memfd_create(), although for different purpose: "The  name  supplied in 
name is used as a filename and will be displayed as the target of the 
corresponding symbolic link in the directory /proc/self/fd/".


So we simply pass TYPE_MEMORY_BACKEND_MEMFD.

Likely, memory-backend-shm that directly maps to shm_open() and only 
provides properties reasonable for shm_open() is the cleanest approach. 
So that would currently be my preference :)


--
Cheers,

David / dhildenb




Re: [PATCH v4 33/66] i386/tdx: Make memory type private by default

2024-02-20 Thread David Hildenbrand

On 29.01.24 03:18, Xiaoyao Li wrote:

On 1/26/2024 10:58 PM, David Hildenbrand wrote:

On 25.01.24 04:22, Xiaoyao Li wrote:

By default (due to the recent UPM change), restricted memory attribute is
shared.  Convert the memory region from shared to private at the memory
slot creation time.

add kvm region registering function to check the flag
and convert the region, and add memory listener to TDX guest code to set
the flag to the possible memory region.

Without this patch
- Secure-EPT violation on private area
- KVM_MEMORY_FAULT EXIT (kvm -> qemu)
- qemu converts the 4K page from shared to private
- Resume VCPU execution
- Secure-EPT violation again
- KVM resolves EPT Violation
This also prevents huge page because page conversion is done at 4K
granularity.  Although it's possible to merge 4K private mapping into
2M large page, it slows guest boot.

With this patch
- After memory slot creation, convert the region from private to shared
- Secure-EPT violation on private area.
- KVM resolves EPT Violation

Originated-from: Isaku Yamahata 
Signed-off-by: Xiaoyao Li 
---
   include/exec/memory.h |  1 +
   target/i386/kvm/tdx.c | 20 
   2 files changed, 21 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 7229fcc0415f..f25959f6d30f 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -850,6 +850,7 @@ struct IOMMUMemoryRegion {
   #define MEMORY_LISTENER_PRIORITY_MIN    0
   #define MEMORY_LISTENER_PRIORITY_ACCEL  10
   #define MEMORY_LISTENER_PRIORITY_DEV_BACKEND    10
+#define MEMORY_LISTENER_PRIORITY_ACCEL_HIGH 20
   /**
    * struct MemoryListener: callbacks structure for updates to the
physical memory map
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 7b250d80bc1d..f892551821ce 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -19,6 +19,7 @@
   #include "standard-headers/asm-x86/kvm_para.h"
   #include "sysemu/kvm.h"
   #include "sysemu/sysemu.h"
+#include "exec/address-spaces.h"
   #include "hw/i386/x86.h"
   #include "kvm_i386.h"
@@ -621,6 +622,19 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
   return 0;
   }
+static void tdx_guest_region_add(MemoryListener *listener,
+ MemoryRegionSection *section)
+{
+    memory_region_set_default_private(section->mr);
+}


That looks fishy. Why is TDX to decide what happens to other memory
regions it doesn't own?

We should define that behavior when creating these memory region, and
TDX could sanity check that they have been setup properly.

Let me ask differently: For which memory region where we have
RAM_GUEST_MEMFD set would we *not* want to set private as default right
from the start?


All memory regions have RAM_GUEST_MEMFD set will benefit from being
private as default, for TDX guest.

I will update the implementation to set RAM_DEFAULT_PRIVATE flag when
guest_memfd is created successfully, like

diff --git a/system/physmem.c b/system/physmem.c
index fc59470191ef..60676689c807 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1850,6 +1850,8 @@ static void ram_block_add(RAMBlock *new_block,
Error **errp)
   qemu_mutex_unlock_ramlist();
   return;
   }
+
+new_block->flags |= RAM_DEFAULT_PRIVATE;
   }

then this patch can be dropped, and the calling of
memory_region_set_default_private(mr) of Patch 45 can be dropped too.

I think this is what you suggested, right?


Yes, if that works, great!

--
Cheers,

David / dhildenb




Re: [PATCH] system/physmem: remove redundant arg reassignment

2024-02-15 Thread David Hildenbrand

On 15.02.24 10:15, Manos Pitsidianakis wrote:

Arguments `ram_block` are reassigned to local declarations `block`
without further use. Remove re-assignment to reduce noise.

Signed-off-by: Manos Pitsidianakis 
---
  system/physmem.c | 7 ++-
  1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 5e66d9ae36..d4c3bfac65 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2154,10 +2154,8 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length)
   *
   * Called within RCU critical section.
   */
-void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr)
+void *qemu_map_ram_ptr(RAMBlock *block, ram_addr_t addr)
  {
-RAMBlock *block = ram_block;
-
  if (block == NULL) {
  block = qemu_get_ram_block(addr);
  addr -= block->offset;
@@ -2182,10 +2180,9 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t 
addr)
   *
   * Called within RCU critical section.
   */
-static void *qemu_ram_ptr_length(RAMBlock *ram_block, ram_addr_t addr,
+static void *qemu_ram_ptr_length(RAMBlock *block, ram_addr_t addr,
   hwaddr *size, bool lock)
  {
-RAMBlock *block = ram_block;
  if (*size == 0) {
  return NULL;
  }

base-commit: 5767815218efd3cbfd409505ed824d5f356044ae


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




[PATCH v2 06/14] libvhost-user: No need to check for NULL when unmapping

2024-02-14 Thread David Hildenbrand
We never add a memory region if mmap() failed. Therefore, no need to check
for NULL.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index f43b5096d0..225283f764 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -247,11 +247,8 @@ vu_remove_all_mem_regs(VuDev *dev)
 
 for (i = 0; i < dev->nregions; i++) {
 VuDevRegion *r = >regions[i];
-void *ma = (void *)(uintptr_t)r->mmap_addr;
 
-if (ma) {
-munmap(ma, r->size + r->mmap_offset);
-}
+munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset);
 }
 dev->nregions = 0;
 }
@@ -888,11 +885,8 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 for (i = 0; i < dev->nregions; i++) {
 if (reg_equal(>regions[i], msg_region)) {
 VuDevRegion *r = >regions[i];
-void *ma = (void *) (uintptr_t) r->mmap_addr;
 
-if (ma) {
-munmap(ma, r->size + r->mmap_offset);
-}
+munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset);
 
 /*
  * Shift all affected entries by 1 to close the hole at index i and
-- 
2.43.0




[PATCH v2 04/14] libvhost-user: Merge vu_set_mem_table_exec_postcopy() into vu_set_mem_table_exec()

2024-02-14 Thread David Hildenbrand
Let's reduce some code duplication and prepare for further changes.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 119 +++---
 1 file changed, 39 insertions(+), 80 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index e4907dfc26..a7bd7de3cd 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -937,95 +937,23 @@ vu_get_shared_object(VuDev *dev, VhostUserMsg *vmsg)
 }
 
 static bool
-vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
+vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
 {
-unsigned int i;
 VhostUserMemory m = vmsg->payload.memory, *memory = 
-dev->nregions = memory->nregions;
-
-DPRINT("Nregions: %u\n", memory->nregions);
-for (i = 0; i < dev->nregions; i++) {
-void *mmap_addr;
-VhostUserMemoryRegion *msg_region = >regions[i];
-VuDevRegion *dev_region = >regions[i];
-
-DPRINT("Region %d\n", i);
-DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
-   msg_region->guest_phys_addr);
-DPRINT("memory_size: 0x%016"PRIx64"\n",
-   msg_region->memory_size);
-DPRINT("userspace_addr   0x%016"PRIx64"\n",
-   msg_region->userspace_addr);
-DPRINT("mmap_offset  0x%016"PRIx64"\n",
-   msg_region->mmap_offset);
-
-dev_region->gpa = msg_region->guest_phys_addr;
-dev_region->size = msg_region->memory_size;
-dev_region->qva = msg_region->userspace_addr;
-dev_region->mmap_offset = msg_region->mmap_offset;
+int prot = PROT_READ | PROT_WRITE;
+unsigned int i;
 
-/* We don't use offset argument of mmap() since the
- * mapped address has to be page aligned, and we use huge
- * pages.
+if (dev->postcopy_listening) {
+/*
  * In postcopy we're using PROT_NONE here to catch anyone
  * accessing it before we userfault
  */
-mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
- PROT_NONE, MAP_SHARED | MAP_NORESERVE,
- vmsg->fds[i], 0);
-
-if (mmap_addr == MAP_FAILED) {
-vu_panic(dev, "region mmap error: %s", strerror(errno));
-} else {
-dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
-DPRINT("mmap_addr:   0x%016"PRIx64"\n",
-   dev_region->mmap_addr);
-}
-
-/* Return the address to QEMU so that it can translate the ufd
- * fault addresses back.
- */
-msg_region->userspace_addr = (uintptr_t)(mmap_addr +
- dev_region->mmap_offset);
-close(vmsg->fds[i]);
-}
-
-/* Send the message back to qemu with the addresses filled in */
-vmsg->fd_num = 0;
-if (!vu_send_reply(dev, dev->sock, vmsg)) {
-vu_panic(dev, "failed to respond to set-mem-table for postcopy");
-return false;
-}
-
-/* Wait for QEMU to confirm that it's registered the handler for the
- * faults.
- */
-if (!dev->read_msg(dev, dev->sock, vmsg) ||
-vmsg->size != sizeof(vmsg->payload.u64) ||
-vmsg->payload.u64 != 0) {
-vu_panic(dev, "failed to receive valid ack for postcopy 
set-mem-table");
-return false;
+prot = PROT_NONE;
 }
 
-/* OK, now we can go and register the memory and generate faults */
-(void)generate_faults(dev);
-
-return false;
-}
-
-static bool
-vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
-{
-unsigned int i;
-VhostUserMemory m = vmsg->payload.memory, *memory = 
-
 vu_remove_all_mem_regs(dev);
 dev->nregions = memory->nregions;
 
-if (dev->postcopy_listening) {
-return vu_set_mem_table_exec_postcopy(dev, vmsg);
-}
-
 DPRINT("Nregions: %u\n", memory->nregions);
 for (i = 0; i < dev->nregions; i++) {
 void *mmap_addr;
@@ -1051,8 +979,7 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
  * mapped address has to be page aligned, and we use huge
  * pages.  */
 mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
- PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE,
- vmsg->fds[i], 0);
+ prot, MAP_SHARED | MAP_NORESERVE, vmsg->fds[i], 0);
 
 if (mmap_addr == MAP_FAILED) {
 vu_panic(dev, "region mmap error: %s", strerror

[PATCH v2 09/14] libvhost-user: Factor out search for memory region by GPA and simplify

2024-02-14 Thread David Hildenbrand
Memory regions cannot overlap, and if we ever hit that case something
would be really flawed.

For example, when vhost code in QEMU decides to increase the size of memory
regions to cover full huge pages, it makes sure to never create overlaps,
and if there would be overlaps, it would bail out.

QEMU commits 48d7c9757749 ("vhost: Merge sections added to temporary
list"), c1ece84e7c93 ("vhost: Huge page align and merge") and
e7b94a84b6cb ("vhost: Allow adjoining regions") added and clarified that
handling and how overlaps are impossible.

Consequently, each GPA can belong to at most one memory region, and
everything else doesn't make sense. Let's factor out our search to prepare
for further changes.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 79 +--
 1 file changed, 45 insertions(+), 34 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 7f29e01c30..d72f25396d 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -195,30 +195,47 @@ vu_panic(VuDev *dev, const char *msg, ...)
  */
 }
 
+/* Search for a memory region that covers this guest physical address. */
+static VuDevRegion *
+vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr)
+{
+unsigned int i;
+
+/*
+ * Memory regions cannot overlap in guest physical address space. Each
+ * GPA belongs to exactly one memory region, so there can only be one
+ * match.
+ */
+for (i = 0; i < dev->nregions; i++) {
+VuDevRegion *cur = >regions[i];
+
+if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) {
+return cur;
+}
+}
+return NULL;
+}
+
 /* Translate guest physical address to our virtual address.  */
 void *
 vu_gpa_to_va(VuDev *dev, uint64_t *plen, uint64_t guest_addr)
 {
-unsigned int i;
+VuDevRegion *r;
 
 if (*plen == 0) {
 return NULL;
 }
 
-/* Find matching memory region.  */
-for (i = 0; i < dev->nregions; i++) {
-VuDevRegion *r = >regions[i];
-
-if ((guest_addr >= r->gpa) && (guest_addr < (r->gpa + r->size))) {
-if ((guest_addr + *plen) > (r->gpa + r->size)) {
-*plen = r->gpa + r->size - guest_addr;
-}
-return (void *)(uintptr_t)
-guest_addr - r->gpa + r->mmap_addr + r->mmap_offset;
-}
+r = vu_gpa_to_mem_region(dev, guest_addr);
+if (!r) {
+return NULL;
 }
 
-return NULL;
+if ((guest_addr + *plen) > (r->gpa + r->size)) {
+*plen = r->gpa + r->size - guest_addr;
+}
+return (void *)(uintptr_t)guest_addr - r->gpa + r->mmap_addr +
+   r->mmap_offset;
 }
 
 /* Translate qemu virtual address to our virtual address.  */
@@ -854,8 +871,8 @@ static inline bool reg_equal(VuDevRegion *vudev_reg,
 static bool
 vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = 
-unsigned int i;
-bool found = false;
+unsigned int idx;
+VuDevRegion *r;
 
 if (vmsg->fd_num > 1) {
 vmsg_close_fds(vmsg);
@@ -882,28 +899,22 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 DPRINT("mmap_offset  0x%016"PRIx64"\n",
msg_region->mmap_offset);
 
-for (i = 0; i < dev->nregions; i++) {
-if (reg_equal(>regions[i], msg_region)) {
-VuDevRegion *r = >regions[i];
-
-munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset);
-
-/* Shift all affected entries by 1 to close the hole at index. */
-memmove(dev->regions + i, dev->regions + i + 1,
-sizeof(VuDevRegion) * (dev->nregions - i - 1));
-DPRINT("Successfully removed a region\n");
-dev->nregions--;
-i--;
-
-found = true;
-break;
-}
-}
-
-if (!found) {
+r = vu_gpa_to_mem_region(dev, msg_region->guest_phys_addr);
+if (!r || !reg_equal(r, msg_region)) {
+vmsg_close_fds(vmsg);
 vu_panic(dev, "Specified region not found\n");
+return false;
 }
 
+munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset);
+
+idx = r - dev->regions;
+assert(idx < dev->nregions);
+/* Shift all affected entries by 1 to close the hole. */
+memmove(r, r + 1, sizeof(VuDevRegion) * (dev->nregions - idx - 1));
+DPRINT("Successfully removed a region\n");
+dev->nregions--;
+
 vmsg_close_fds(vmsg);
 
 return false;
-- 
2.43.0




[PATCH v2 07/14] libvhost-user: Don't zero out memory for memory regions

2024-02-14 Thread David Hildenbrand
dev->nregions always covers only valid entries. Stop zeroing out other
array elements that are unused.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 225283f764..2e8b611385 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -888,13 +888,9 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 
 munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset);
 
-/*
- * Shift all affected entries by 1 to close the hole at index i and
- * zero out the last entry.
- */
+/* Shift all affected entries by 1 to close the hole at index. */
 memmove(dev->regions + i, dev->regions + i + 1,
 sizeof(VuDevRegion) * (dev->nregions - i - 1));
-memset(dev->regions + dev->nregions - 1, 0, sizeof(VuDevRegion));
 DPRINT("Successfully removed a region\n");
 dev->nregions--;
 i--;
@@ -2119,7 +2115,6 @@ vu_init(VuDev *dev,
 DPRINT("%s: failed to malloc mem regions\n", __func__);
 return false;
 }
-memset(dev->regions, 0, VHOST_USER_MAX_RAM_SLOTS * 
sizeof(dev->regions[0]));
 
 dev->vq = malloc(max_queues * sizeof(dev->vq[0]));
 if (!dev->vq) {
-- 
2.43.0




[PATCH v2 01/14] libvhost-user: Dynamically allocate memory for memory slots

2024-02-14 Thread David Hildenbrand
Let's prepare for increasing VHOST_USER_MAX_RAM_SLOTS by dynamically
allocating dev->regions. We don't have any ABI guarantees (not
dynamically linked), so we can simply change the layout of VuDev.

Let's zero out the memory, just as we used to do.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 11 +++
 subprojects/libvhost-user/libvhost-user.h |  2 +-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index a3b158c671..360c5366d6 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -2171,6 +2171,8 @@ vu_deinit(VuDev *dev)
 
 free(dev->vq);
 dev->vq = NULL;
+free(dev->regions);
+dev->regions = NULL;
 }
 
 bool
@@ -2205,9 +2207,18 @@ vu_init(VuDev *dev,
 dev->backend_fd = -1;
 dev->max_queues = max_queues;
 
+dev->regions = malloc(VHOST_USER_MAX_RAM_SLOTS * sizeof(dev->regions[0]));
+if (!dev->regions) {
+DPRINT("%s: failed to malloc mem regions\n", __func__);
+return false;
+}
+memset(dev->regions, 0, VHOST_USER_MAX_RAM_SLOTS * 
sizeof(dev->regions[0]));
+
 dev->vq = malloc(max_queues * sizeof(dev->vq[0]));
 if (!dev->vq) {
 DPRINT("%s: failed to malloc virtqueues\n", __func__);
+free(dev->regions);
+dev->regions = NULL;
 return false;
 }
 
diff --git a/subprojects/libvhost-user/libvhost-user.h 
b/subprojects/libvhost-user/libvhost-user.h
index c2352904f0..c882b4e3a2 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -398,7 +398,7 @@ typedef struct VuDevInflightInfo {
 struct VuDev {
 int sock;
 uint32_t nregions;
-VuDevRegion regions[VHOST_USER_MAX_RAM_SLOTS];
+VuDevRegion *regions;
 VuVirtq *vq;
 VuDevInflightInfo inflight_info;
 int log_call_fd;
-- 
2.43.0




[PATCH v2 10/14] libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va()

2024-02-14 Thread David Hildenbrand
Let's speed up GPA to memory region / virtual address lookup. Store the
memory regions ordered by guest physical addresses, and use binary
search for address translation, as well as when adding/removing memory
regions.

Most importantly, this will speed up GPA->VA address translation when we
have many memslots.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 49 +--
 1 file changed, 45 insertions(+), 4 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index d72f25396d..ef6353d847 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -199,19 +199,30 @@ vu_panic(VuDev *dev, const char *msg, ...)
 static VuDevRegion *
 vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr)
 {
-unsigned int i;
+int low = 0;
+int high = dev->nregions - 1;
 
 /*
  * Memory regions cannot overlap in guest physical address space. Each
  * GPA belongs to exactly one memory region, so there can only be one
  * match.
+ *
+ * We store our memory regions ordered by GPA and can simply perform a
+ * binary search.
  */
-for (i = 0; i < dev->nregions; i++) {
-VuDevRegion *cur = >regions[i];
+while (low <= high) {
+unsigned int mid = low + (high - low) / 2;
+VuDevRegion *cur = >regions[mid];
 
 if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) {
 return cur;
 }
+if (guest_addr >= cur->gpa + cur->size) {
+low = mid + 1;
+}
+if (guest_addr < cur->gpa) {
+high = mid - 1;
+}
 }
 return NULL;
 }
@@ -273,9 +284,14 @@ vu_remove_all_mem_regs(VuDev *dev)
 static void
 _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd)
 {
+const uint64_t start_gpa = msg_region->guest_phys_addr;
+const uint64_t end_gpa = start_gpa + msg_region->memory_size;
 int prot = PROT_READ | PROT_WRITE;
 VuDevRegion *r;
 void *mmap_addr;
+int low = 0;
+int high = dev->nregions - 1;
+unsigned int idx;
 
 DPRINT("Adding region %d\n", dev->nregions);
 DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
@@ -295,6 +311,29 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
 prot = PROT_NONE;
 }
 
+/*
+ * We will add memory regions into the array sorted by GPA. Perform a
+ * binary search to locate the insertion point: it will be at the low
+ * index.
+ */
+while (low <= high) {
+unsigned int mid = low + (high - low)  / 2;
+VuDevRegion *cur = >regions[mid];
+
+/* Overlap of GPA addresses. */
+if (start_gpa < cur->gpa + cur->size && cur->gpa < end_gpa) {
+vu_panic(dev, "regions with overlapping guest physical addresses");
+return;
+}
+if (start_gpa >= cur->gpa + cur->size) {
+low = mid + 1;
+}
+if (start_gpa < cur->gpa) {
+high = mid - 1;
+}
+}
+idx = low;
+
 /*
  * We don't use offset argument of mmap() since the mapped address has
  * to be page aligned, and we use huge pages.
@@ -308,7 +347,9 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
 DPRINT("mmap_addr:   0x%016"PRIx64"\n",
(uint64_t)(uintptr_t)mmap_addr);
 
-r = >regions[dev->nregions];
+/* Shift all affected entries by 1 to open a hole at idx. */
+r = >regions[idx];
+memmove(r + 1, r, sizeof(VuDevRegion) * (dev->nregions - idx));
 r->gpa = msg_region->guest_phys_addr;
 r->size = msg_region->memory_size;
 r->qva = msg_region->userspace_addr;
-- 
2.43.0




[PATCH v2 13/14] libvhost-user: Dynamically remap rings after (temporarily?) removing memory regions

2024-02-14 Thread David Hildenbrand
Currently, we try to remap all rings whenever we add a single new memory
region. That doesn't quite make sense, because we already map rings when
setting the ring address, and panic if that goes wrong. Likely, that
handling was simply copied from set_mem_table code, where we actually
have to remap all rings.

Remapping all rings might require us to walk quite a lot of memory
regions to perform the address translations. Ideally, we'd simply remove
that remapping.

However, let's be a bit careful. There might be some weird corner cases
where we might temporarily remove a single memory region (e.g., resize
it), that would have worked for now. Further, a ring might be located on
hotplugged memory, and as the VM reboots, we might unplug that memory, to
hotplug memory before resetting the ring addresses.

So let's unmap affected rings as we remove a memory region, and try
dynamically mapping the ring again when required.

Acked-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 107 --
 1 file changed, 78 insertions(+), 29 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index ed0a978d4f..61fb3050b3 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -283,10 +283,75 @@ vu_remove_all_mem_regs(VuDev *dev)
 dev->nregions = 0;
 }
 
+static bool
+map_ring(VuDev *dev, VuVirtq *vq)
+{
+vq->vring.desc = qva_to_va(dev, vq->vra.desc_user_addr);
+vq->vring.used = qva_to_va(dev, vq->vra.used_user_addr);
+vq->vring.avail = qva_to_va(dev, vq->vra.avail_user_addr);
+
+DPRINT("Setting virtq addresses:\n");
+DPRINT("vring_desc  at %p\n", vq->vring.desc);
+DPRINT("vring_used  at %p\n", vq->vring.used);
+DPRINT("vring_avail at %p\n", vq->vring.avail);
+
+return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
+}
+
 static bool
 vu_is_vq_usable(VuDev *dev, VuVirtq *vq)
 {
-return likely(!dev->broken) && likely(vq->vring.avail);
+if (unlikely(dev->broken)) {
+return false;
+}
+
+if (likely(vq->vring.avail)) {
+return true;
+}
+
+/*
+ * In corner cases, we might temporarily remove a memory region that
+ * mapped a ring. When removing a memory region we make sure to
+ * unmap any rings that would be impacted. Let's try to remap if we
+ * already succeeded mapping this ring once.
+ */
+if (!vq->vra.desc_user_addr || !vq->vra.used_user_addr ||
+!vq->vra.avail_user_addr) {
+return false;
+}
+if (map_ring(dev, vq)) {
+vu_panic(dev, "remapping queue on access");
+return false;
+}
+return true;
+}
+
+static void
+unmap_rings(VuDev *dev, VuDevRegion *r)
+{
+int i;
+
+for (i = 0; i < dev->max_queues; i++) {
+VuVirtq *vq = >vq[i];
+const uintptr_t desc = (uintptr_t)vq->vring.desc;
+const uintptr_t used = (uintptr_t)vq->vring.used;
+const uintptr_t avail = (uintptr_t)vq->vring.avail;
+
+if (desc < r->mmap_addr || desc >= r->mmap_addr + r->size) {
+continue;
+}
+if (used < r->mmap_addr || used >= r->mmap_addr + r->size) {
+continue;
+}
+if (avail < r->mmap_addr || avail >= r->mmap_addr + r->size) {
+continue;
+}
+
+DPRINT("Unmapping rings of queue %d\n", i);
+vq->vring.desc = NULL;
+vq->vring.used = NULL;
+vq->vring.avail = NULL;
+}
 }
 
 static size_t
@@ -786,21 +851,6 @@ vu_reset_device_exec(VuDev *dev, VhostUserMsg *vmsg)
 return false;
 }
 
-static bool
-map_ring(VuDev *dev, VuVirtq *vq)
-{
-vq->vring.desc = qva_to_va(dev, vq->vra.desc_user_addr);
-vq->vring.used = qva_to_va(dev, vq->vra.used_user_addr);
-vq->vring.avail = qva_to_va(dev, vq->vra.avail_user_addr);
-
-DPRINT("Setting virtq addresses:\n");
-DPRINT("vring_desc  at %p\n", vq->vring.desc);
-DPRINT("vring_used  at %p\n", vq->vring.used);
-DPRINT("vring_avail at %p\n", vq->vring.avail);
-
-return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
-}
-
 static bool
 generate_faults(VuDev *dev) {
 unsigned int i;
@@ -884,7 +934,6 @@ generate_faults(VuDev *dev) {
 
 static bool
 vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
-int i;
 VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = 
 
 if (vmsg->fd_num != 1) {
@@ -930,19 +979,9 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 vmsg->fd_num = 0;
 DPRINT("Successf

[PATCH v2 05/14] libvhost-user: Factor out adding a memory region

2024-02-14 Thread David Hildenbrand
Let's factor it out, reducing quite some code duplication and perparing
for further changes.

If we fail to mmap a region and panic, we now simply don't add that
(broken) region.

Note that we now increment dev->nregions as we are successfully
adding memory regions, and don't increment dev->nregions if anything went
wrong.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 168 --
 1 file changed, 60 insertions(+), 108 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index a7bd7de3cd..f43b5096d0 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -256,6 +256,61 @@ vu_remove_all_mem_regs(VuDev *dev)
 dev->nregions = 0;
 }
 
+static void
+_vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd)
+{
+int prot = PROT_READ | PROT_WRITE;
+VuDevRegion *r;
+void *mmap_addr;
+
+DPRINT("Adding region %d\n", dev->nregions);
+DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
+   msg_region->guest_phys_addr);
+DPRINT("memory_size: 0x%016"PRIx64"\n",
+   msg_region->memory_size);
+DPRINT("userspace_addr:  0x%016"PRIx64"\n",
+   msg_region->userspace_addr);
+DPRINT("mmap_offset: 0x%016"PRIx64"\n",
+   msg_region->mmap_offset);
+
+if (dev->postcopy_listening) {
+/*
+ * In postcopy we're using PROT_NONE here to catch anyone
+ * accessing it before we userfault
+ */
+prot = PROT_NONE;
+}
+
+/*
+ * We don't use offset argument of mmap() since the mapped address has
+ * to be page aligned, and we use huge pages.
+ */
+mmap_addr = mmap(0, msg_region->memory_size + msg_region->mmap_offset,
+ prot, MAP_SHARED | MAP_NORESERVE, fd, 0);
+if (mmap_addr == MAP_FAILED) {
+vu_panic(dev, "region mmap error: %s", strerror(errno));
+return;
+}
+DPRINT("mmap_addr:   0x%016"PRIx64"\n",
+   (uint64_t)(uintptr_t)mmap_addr);
+
+r = >regions[dev->nregions];
+r->gpa = msg_region->guest_phys_addr;
+r->size = msg_region->memory_size;
+r->qva = msg_region->userspace_addr;
+r->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
+r->mmap_offset = msg_region->mmap_offset;
+dev->nregions++;
+
+if (dev->postcopy_listening) {
+/*
+ * Return the address to QEMU so that it can translate the ufd
+ * fault addresses back.
+ */
+msg_region->userspace_addr = r->mmap_addr + r->mmap_offset;
+}
+}
+
 static void
 vmsg_close_fds(VhostUserMsg *vmsg)
 {
@@ -727,10 +782,7 @@ generate_faults(VuDev *dev) {
 static bool
 vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 int i;
-bool track_ramblocks = dev->postcopy_listening;
 VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = 
-VuDevRegion *dev_region = >regions[dev->nregions];
-void *mmap_addr;
 
 if (vmsg->fd_num != 1) {
 vmsg_close_fds(vmsg);
@@ -760,69 +812,20 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
  * we know all the postcopy client bases have been received, and we
  * should start generating faults.
  */
-if (track_ramblocks &&
+if (dev->postcopy_listening &&
 vmsg->size == sizeof(vmsg->payload.u64) &&
 vmsg->payload.u64 == 0) {
 (void)generate_faults(dev);
 return false;
 }
 
-DPRINT("Adding region: %u\n", dev->nregions);
-DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
-   msg_region->guest_phys_addr);
-DPRINT("memory_size: 0x%016"PRIx64"\n",
-   msg_region->memory_size);
-DPRINT("userspace_addr   0x%016"PRIx64"\n",
-   msg_region->userspace_addr);
-DPRINT("mmap_offset  0x%016"PRIx64"\n",
-   msg_region->mmap_offset);
-
-dev_region->gpa = msg_region->guest_phys_addr;
-dev_region->size = msg_region->memory_size;
-dev_region->qva = msg_region->userspace_addr;
-dev_region->mmap_offset = msg_region->mmap_offset;
-
-/*
- * We don't use offset argument of mmap() since the
- * mapped address has to be page aligned, and we use huge
- * pages.
- */
-if (track_ramblocks) {
-/*
- * In postcopy we're using PROT_NONE here to catch anyone
- * accessing it before we userfault.
- */
-mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
-  

[PATCH v2 08/14] libvhost-user: Don't search for duplicates when removing memory regions

2024-02-14 Thread David Hildenbrand
We cannot have duplicate memory regions, something would be deeply
flawed elsewhere. Let's just stop the search once we found an entry.

We'll add more sanity checks when adding memory regions later.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 2e8b611385..7f29e01c30 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -896,8 +896,7 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 i--;
 
 found = true;
-
-/* Continue the search for eventual duplicates. */
+break;
 }
 }
 
-- 
2.43.0




[PATCH v2 02/14] libvhost-user: Bump up VHOST_USER_MAX_RAM_SLOTS to 509

2024-02-14 Thread David Hildenbrand
Let's support up to 509 mem slots, just like vhost in the kernel usually
does and the rust vhost-user implementation recently [1] started doing.
This is required to properly support memory hotplug, either using
multiple DIMMs (ACPI supports up to 256) or using virtio-mem.

The 509 used to be the KVM limit, it supported 512, but 3 were
used for internal purposes. Currently, KVM supports more than 512, but
it usually doesn't make use of more than ~260 (i.e., 256 DIMMs + boot
memory), except when other memory devices like PCI devices with BARs are
used. So, 509 seems to work well for vhost in the kernel.

Details can be found in the QEMU change that made virtio-mem consume
up to 256 mem slots across all virtio-mem devices. [2]

509 mem slots implies 509 VMAs/mappings in the worst case (even though,
in practice with virtio-mem we won't be seeing more than ~260 in most
setups).

With max_map_count under Linux defaulting to 64k, 509 mem slots
still correspond to less than 1% of the maximum number of mappings.
There are plenty left for the application to consume.

[1] https://github.com/rust-vmm/vhost/pull/224
[2] https://lore.kernel.org/all/20230926185738.277351-1-da...@redhat.com/

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.h | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.h 
b/subprojects/libvhost-user/libvhost-user.h
index c882b4e3a2..deb40e77b3 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -31,10 +31,12 @@
 #define VHOST_MEMORY_BASELINE_NREGIONS 8
 
 /*
- * Set a reasonable maximum number of ram slots, which will be supported by
- * any architecture.
+ * vhost in the kernel usually supports 509 mem slots. 509 used to be the
+ * KVM limit, it supported 512, but 3 were used for internal purposes. This
+ * limit is sufficient to support many DIMMs and virtio-mem in
+ * "dynamic-memslots" mode.
  */
-#define VHOST_USER_MAX_RAM_SLOTS 32
+#define VHOST_USER_MAX_RAM_SLOTS 509
 
 #define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
 
-- 
2.43.0




[PATCH v2 00/14] libvhost-user: support more memslots and cleanup memslot handling code

2024-02-14 Thread David Hildenbrand
This series adds support for more memslots (509) to libvhost-user, to
make it fully compatible with virtio-mem that uses up to 256 memslots
accross all memory devices in "dynamic-memslot" mode (more details
in patch #2).

With that in place, this series optimizes and extends memory region
handling in libvhost-user:
* Heavily deduplicate and clean up the memory region handling code
* Speeds up GPA->VA translation with many memslots using binary search
* Optimize mmap_offset handling to use it as fd_offset for mmap()
* Avoid ring remapping when adding a single memory region
* Avoid dumping all guest memory, possibly allocating memory in sparse
  memory mappings when the process crashes

I'm being very careful to not break some weird corner case that modern
QEMU might no longer trigger, but older one could have triggered or some
other frontend might trigger.

The only thing where I am not careful is to forbid memory regions that
overlap in GPA space: it doesn't make any sense.

With this series, virtio-mem (with dynamic-memslots=on) +
qemu-storage-daemon works flawlessly and as expected in my tests.

v1 -> v2:
* Drop "libvhost-user: Fix msg_region->userspace_addr computation"
 -> Not actually required
* "libvhost-user: Factor out adding a memory region"
 -> Make debug output more consistent (add missing ":")
* "libvhost-user: Use most of mmap_offset as fd_offset"
 -> get_fd_pagesize -> get_fd_hugepagesize; remove getpagesize()
 -> "mmap_offset:" to "old mmap_offset:" in debug message
 -> "adj mmap_offset:" to "new mmap_offset:" in debug message
 -> Use "(unsigned int)fs.f_type"; the man page of fstatfs() calls out
that the type of f_type can vary depending on the architecture.
"unsigned int" is sufficient here.
 -> Updated patch description
* Added RBs+ACKs
* Did a Gitlab CI run, seems to be happy reagrding libvhost-user

Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Stefan Hajnoczi 
Cc: Stefano Garzarella 
Cc: Germano Veit Michel 
Cc: Raphael Norwitz 

David Hildenbrand (14):
  libvhost-user: Dynamically allocate memory for memory slots
  libvhost-user: Bump up VHOST_USER_MAX_RAM_SLOTS to 509
  libvhost-user: Factor out removing all mem regions
  libvhost-user: Merge vu_set_mem_table_exec_postcopy() into
vu_set_mem_table_exec()
  libvhost-user: Factor out adding a memory region
  libvhost-user: No need to check for NULL when unmapping
  libvhost-user: Don't zero out memory for memory regions
  libvhost-user: Don't search for duplicates when removing memory
regions
  libvhost-user: Factor out search for memory region by GPA and simplify
  libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va()
  libvhost-user: Use most of mmap_offset as fd_offset
  libvhost-user: Factor out vq usability check
  libvhost-user: Dynamically remap rings after (temporarily?) removing
memory regions
  libvhost-user: Mark mmap'ed region memory as MADV_DONTDUMP

 subprojects/libvhost-user/libvhost-user.c | 595 --
 subprojects/libvhost-user/libvhost-user.h |  10 +-
 2 files changed, 334 insertions(+), 271 deletions(-)

-- 
2.43.0




[PATCH v2 12/14] libvhost-user: Factor out vq usability check

2024-02-14 Thread David Hildenbrand
Let's factor it out to prepare for further changes.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 24 +++
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 55aef5fcc6..ed0a978d4f 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -283,6 +283,12 @@ vu_remove_all_mem_regs(VuDev *dev)
 dev->nregions = 0;
 }
 
+static bool
+vu_is_vq_usable(VuDev *dev, VuVirtq *vq)
+{
+return likely(!dev->broken) && likely(vq->vring.avail);
+}
+
 static size_t
 get_fd_hugepagesize(int fd)
 {
@@ -2380,8 +2386,7 @@ vu_queue_get_avail_bytes(VuDev *dev, VuVirtq *vq, 
unsigned int *in_bytes,
 idx = vq->last_avail_idx;
 
 total_bufs = in_total = out_total = 0;
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 goto done;
 }
 
@@ -2496,8 +2501,7 @@ vu_queue_avail_bytes(VuDev *dev, VuVirtq *vq, unsigned 
int in_bytes,
 bool
 vu_queue_empty(VuDev *dev, VuVirtq *vq)
 {
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 return true;
 }
 
@@ -2536,8 +2540,7 @@ vring_notify(VuDev *dev, VuVirtq *vq)
 
 static void _vu_queue_notify(VuDev *dev, VuVirtq *vq, bool sync)
 {
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 return;
 }
 
@@ -2862,8 +2865,7 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
 unsigned int head;
 VuVirtqElement *elem;
 
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 return NULL;
 }
 
@@ -3020,8 +3022,7 @@ vu_queue_fill(VuDev *dev, VuVirtq *vq,
 {
 struct vring_used_elem uelem;
 
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 return;
 }
 
@@ -3050,8 +3051,7 @@ vu_queue_flush(VuDev *dev, VuVirtq *vq, unsigned int 
count)
 {
 uint16_t old, new;
 
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 return;
 }
 
-- 
2.43.0




[PATCH v2 14/14] libvhost-user: Mark mmap'ed region memory as MADV_DONTDUMP

2024-02-14 Thread David Hildenbrand
We already use MADV_NORESERVE to deal with sparse memory regions. Let's
also set madvise(MADV_DONTDUMP), otherwise a crash of the process can
result in us allocating all memory in the mmap'ed region for dumping
purposes.

This change implies that the mmap'ed rings won't be included in a
coredump. If ever required for debugging purposes, we could mark only
the mapped rings MADV_DODUMP.

Ignore errors during madvise() for now.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 61fb3050b3..a879149fef 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -460,6 +460,12 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
 DPRINT("mmap_addr:   0x%016"PRIx64"\n",
(uint64_t)(uintptr_t)mmap_addr);
 
+#if defined(__linux__)
+/* Don't include all guest memory in a coredump. */
+madvise(mmap_addr, msg_region->memory_size + mmap_offset,
+MADV_DONTDUMP);
+#endif
+
 /* Shift all affected entries by 1 to open a hole at idx. */
 r = >regions[idx];
 memmove(r + 1, r, sizeof(VuDevRegion) * (dev->nregions - idx));
-- 
2.43.0




[PATCH v2 11/14] libvhost-user: Use most of mmap_offset as fd_offset

2024-02-14 Thread David Hildenbrand
In the past, QEMU would create memory regions that could partially cover
hugetlb pages, making mmap() fail if we would use the mmap_offset as an
fd_offset. For that reason, we never used the mmap_offset as an offset into
the fd and instead always mapped the fd from the very start.

However, that can easily result in us mmap'ing a lot of unnecessary
parts of an fd, possibly repeatedly.

QEMU nowadays does not create memory regions that partially cover huge
pages -- it never really worked with postcopy. QEMU handles merging of
regions that partially cover huge pages (due to holes in boot memory) since
2018 in c1ece84e7c93 ("vhost: Huge page align and merge").

Let's be a bit careful and not unconditionally convert the
mmap_offset into an fd_offset. Instead, let's simply detect the hugetlb
size and pass as much as we can as fd_offset, making sure that we call
mmap() with a properly aligned offset.

With QEMU and a virtio-mem device that is fully plugged (50GiB using 50
memslots) the qemu-storage daemon process consumes in the VA space
1281GiB before this change and 58GiB after this change.

 Vhost user message 
Request: VHOST_USER_ADD_MEM_REG (37)
Flags:   0x9
Size:40
Fds: 59
Adding region 4
guest_phys_addr: 0x0002
memory_size: 0x4000
userspace_addr:  0x7fb73bffe000
old mmap_offset: 0x8000
fd_offset:   0x8000
new mmap_offset: 0x
mmap_addr:   0x7f02f1bdc000
Successfully added new region
 Vhost user message 
Request: VHOST_USER_ADD_MEM_REG (37)
Flags:   0x9
Size:40
Fds: 59
Adding region 5
guest_phys_addr: 0x00024000
memory_size: 0x4000
userspace_addr:  0x7fb77bffe000
old mmap_offset: 0xc000
fd_offset:   0xc000
new mmap_offset: 0x
mmap_addr:   0x7f028400
Successfully added new region

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 54 ---
 1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index ef6353d847..55aef5fcc6 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -43,6 +43,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #ifdef __NR_userfaultfd
 #include 
@@ -281,12 +283,32 @@ vu_remove_all_mem_regs(VuDev *dev)
 dev->nregions = 0;
 }
 
+static size_t
+get_fd_hugepagesize(int fd)
+{
+#if defined(__linux__)
+struct statfs fs;
+int ret;
+
+do {
+ret = fstatfs(fd, );
+} while (ret != 0 && errno == EINTR);
+
+if (!ret && (unsigned int)fs.f_type == HUGETLBFS_MAGIC) {
+return fs.f_bsize;
+}
+#endif
+return 0;
+}
+
 static void
 _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd)
 {
 const uint64_t start_gpa = msg_region->guest_phys_addr;
 const uint64_t end_gpa = start_gpa + msg_region->memory_size;
 int prot = PROT_READ | PROT_WRITE;
+uint64_t mmap_offset, fd_offset;
+size_t hugepagesize;
 VuDevRegion *r;
 void *mmap_addr;
 int low = 0;
@@ -300,7 +322,7 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
msg_region->memory_size);
 DPRINT("userspace_addr:  0x%016"PRIx64"\n",
msg_region->userspace_addr);
-DPRINT("mmap_offset: 0x%016"PRIx64"\n",
+DPRINT("old mmap_offset: 0x%016"PRIx64"\n",
msg_region->mmap_offset);
 
 if (dev->postcopy_listening) {
@@ -335,11 +357,31 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
 idx = low;
 
 /*
- * We don't use offset argument of mmap() since the mapped address has
- * to be page aligned, and we use huge pages.
+ * Convert most of msg_region->mmap_offset to fd_offset. In almost all
+ * cases, this will leave us with mmap_offset == 0, mmap()'ing only
+ * what we really need. Only if a memory region would partially cover
+ * hugetlb pages, we'd get mmap_offset != 0, which usually doesn't happen
+ * anymore (i.e., modern QEMU).
+ *
+ * Note that mmap() with hugetlb would fail if the offset into the file
+ * is not aligned to the huge page size.
  */
-mmap_addr = mmap(0, msg_region->memory_size + msg_region->mmap_offset,
- prot, MAP_SHARED | MAP_NORESERVE, fd, 0);
+hugepagesize = get_fd_hugepagesize(fd);
+if (hugepagesize) {
+fd_offset = ALIGN_DOWN(msg_region->mmap_offset, hugepagesize);
+mmap_offset = msg_region->mmap_offset - fd_offset;
+} else {
+

[PATCH v2 03/14] libvhost-user: Factor out removing all mem regions

2024-02-14 Thread David Hildenbrand
Let's factor it out. Note that the check for MAP_FAILED was wrong as
we never set mmap_addr if mmap() failed. We'll remove the NULL check
separately.

Reviewed-by: Raphael Norwitz 
Acked-by: Stefano Garzarella 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 34 ---
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 360c5366d6..e4907dfc26 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -240,6 +240,22 @@ qva_to_va(VuDev *dev, uint64_t qemu_addr)
 return NULL;
 }
 
+static void
+vu_remove_all_mem_regs(VuDev *dev)
+{
+unsigned int i;
+
+for (i = 0; i < dev->nregions; i++) {
+VuDevRegion *r = >regions[i];
+void *ma = (void *)(uintptr_t)r->mmap_addr;
+
+if (ma) {
+munmap(ma, r->size + r->mmap_offset);
+}
+}
+dev->nregions = 0;
+}
+
 static void
 vmsg_close_fds(VhostUserMsg *vmsg)
 {
@@ -1003,14 +1019,7 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
 unsigned int i;
 VhostUserMemory m = vmsg->payload.memory, *memory = 
 
-for (i = 0; i < dev->nregions; i++) {
-VuDevRegion *r = >regions[i];
-void *ma = (void *) (uintptr_t) r->mmap_addr;
-
-if (ma) {
-munmap(ma, r->size + r->mmap_offset);
-}
-}
+vu_remove_all_mem_regs(dev);
 dev->nregions = memory->nregions;
 
 if (dev->postcopy_listening) {
@@ -2112,14 +2121,7 @@ vu_deinit(VuDev *dev)
 {
 unsigned int i;
 
-for (i = 0; i < dev->nregions; i++) {
-VuDevRegion *r = >regions[i];
-void *m = (void *) (uintptr_t) r->mmap_addr;
-if (m != MAP_FAILED) {
-munmap(m, r->size + r->mmap_offset);
-}
-}
-dev->nregions = 0;
+vu_remove_all_mem_regs(dev);
 
 for (i = 0; i < dev->max_queues; i++) {
 VuVirtq *vq = >vq[i];
-- 
2.43.0




Re: [PATCH v1 00/15] libvhost-user: support more memslots and cleanup memslot handling code

2024-02-14 Thread David Hildenbrand

On 13.02.24 19:55, Michael S. Tsirkin wrote:

On Tue, Feb 13, 2024 at 07:27:44PM +0100, David Hildenbrand wrote:

On 13.02.24 18:33, Michael S. Tsirkin wrote:

On Fri, Feb 02, 2024 at 10:53:17PM +0100, David Hildenbrand wrote:

This series adds support for more memslots (509) to libvhost-user, to
make it fully compatible with virtio-mem that uses up to 256 memslots
accross all memory devices in "dynamic-memslot" mode (more details
in patch #3).



Breaks build on some systems. E.g.
https://gitlab.com/mstredhat/qemu/-/jobs/6163591599




./subprojects/libvhost-user/libvhost-user.c:369:27: error: comparison of
integer expressions of different signedness: ‘long int’ and ‘unsigned int’
[-Werror=sign-compare]
   369 | if (!ret && fs.f_type == HUGETLBFS_MAGIC) {
   |   ^~

So easy to fix in v2, thanks!



I think there is another problem around plugins though.


There is a wrong checkpatch error:

https://gitlab.com/mstredhat/qemu/-/jobs/6162397277

d96f29518232719b0c444ab93913e8515a6cb5c6:100: ERROR: use 
qemu_real_host_page_size() instead of getpagesize()

total: 1 errors, 1 warnings, 81 lines checked

qemu_real_host_page_size() is not available in libvhost-user. But I 
could just change that code to not require getpagesize() at all.



Apart from that, I don't spot anything libvhost-user related (some qtest 
timeouts, a "error_setv: Assertion `*errp == NULL' failed."). Did I miss 
something?


--
Cheers,

David / dhildenb




Re: [PATCH v1 00/15] libvhost-user: support more memslots and cleanup memslot handling code

2024-02-13 Thread David Hildenbrand

On 13.02.24 18:33, Michael S. Tsirkin wrote:

On Fri, Feb 02, 2024 at 10:53:17PM +0100, David Hildenbrand wrote:

This series adds support for more memslots (509) to libvhost-user, to
make it fully compatible with virtio-mem that uses up to 256 memslots
accross all memory devices in "dynamic-memslot" mode (more details
in patch #3).



Breaks build on some systems. E.g.
https://gitlab.com/mstredhat/qemu/-/jobs/6163591599




./subprojects/libvhost-user/libvhost-user.c:369:27: error: comparison of 
integer expressions of different signedness: ‘long int’ and ‘unsigned 
int’ [-Werror=sign-compare]

  369 | if (!ret && fs.f_type == HUGETLBFS_MAGIC) {
  |   ^~

So easy to fix in v2, thanks!

--
Cheers,

David / dhildenb




Re: [PATCH v1 01/15] libvhost-user: Fix msg_region->userspace_addr computation

2024-02-13 Thread David Hildenbrand

On 13.02.24 18:32, Michael S. Tsirkin wrote:

On Fri, Feb 02, 2024 at 10:53:18PM +0100, David Hildenbrand wrote:

We barely had mmap_offset set in the past. With virtio-mem and
dynamic-memslots that will change.

In vu_add_mem_reg() and vu_set_mem_table_exec_postcopy(), we are
performing pointer arithmetics, which is wrong.


Wrong how? I suspect you mean arithmetic on void * pointers is not portable?


You are absolutely right, no idea how I concluded that we might have a 
different pointer size here.


I'll either convert this patch into a cleanup or drop it for v2, thanks!

--
Cheers,

David / dhildenb




Re: [PATCH V3 09/13] migration: notifier error checking

2024-02-12 Thread David Hildenbrand

On 08.02.24 19:54, Steve Sistare wrote:

Check the status returned by migration notifiers and report errors.
If notifiers fail, call the notifiers again so they can clean up.


IIUC, if any of the notifiers will actually start to fail, say, during 
MIG_EVENT_PRECOPY_SETUP, you will call MIG_EVENT_PRECOPY_FAILED on all 
notifiers.


That will include notifiers that have never seen a 
MIG_EVENT_PRECOPY_SETUP call.


Is that what we expect notifiers to be able to deal with? Can we 
document that?


--
Cheers,

David / dhildenb




Re: [PATCH V3 08/13] migration: refactor migrate_fd_connect failures

2024-02-12 Thread David Hildenbrand

On 08.02.24 19:54, Steve Sistare wrote:

Move common code for the error path in migrate_fd_connect to a shared
fail label.  No functional change.

Signed-off-by: Steve Sistare 
Reviewed-by: Peter Xu 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH V3 07/13] migration: per-mode notifiers

2024-02-12 Thread David Hildenbrand

On 08.02.24 19:54, Steve Sistare wrote:

Keep a separate list of migration notifiers for each migration mode.

Suggested-by: Peter Xu 
Signed-off-by: Steve Sistare 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH V3 06/13] migration: MigrationNotifyFunc

2024-02-12 Thread David Hildenbrand

On 08.02.24 19:53, Steve Sistare wrote:

Define MigrationNotifyFunc to improve type safety and simplify migration
notifiers.

Signed-off-by: Steve Sistare 
---

Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH V3 04/13] migration: MigrationEvent for notifiers

2024-02-12 Thread David Hildenbrand

On 08.02.24 19:53, Steve Sistare wrote:

Passing MigrationState to notifiers is unsound because they could access
unstable migration state internals or even modify the state.  Instead, pass
the minimal info needed in a new MigrationEvent struct, which could be
extended in the future if needed.

Suggested-by: Peter Xu 
Signed-off-by: Steve Sistare 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH V3 03/13] migration: convert to NotifierWithReturn

2024-02-12 Thread David Hildenbrand

On 08.02.24 19:53, Steve Sistare wrote:

Change all migration notifiers to type NotifierWithReturn, so notifiers
can return an error status in a future patch.  For now, pass NULL for the
notifier error parameter, and do not check the return value.

Signed-off-by: Steve Sistare 
Reviewed-by: Peter Xu 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH V3 02/13] migration: remove error from notifier data

2024-02-12 Thread David Hildenbrand

On 08.02.24 19:53, Steve Sistare wrote:

Remove the error object from opaque data passed to notifiers.
Use the new error parameter passed to the notifier instead.

Signed-off-by: Steve Sistare 
Reviewed-by: Peter Xu 
---


Would have squashed #1 and #2.

Reviewed-by: David Hildenbrand 
--
Cheers,

David / dhildenb




Re: [PATCH V3 01/13] notify: pass error to notifier with return

2024-02-12 Thread David Hildenbrand

On 08.02.24 19:53, Steve Sistare wrote:

Pass an error object as the third parameter to "notifier with return"
notifiers, so clients no longer need to bundle an error object in the
opaque data.  The new parameter is used in a later patch.

Signed-off-by: Steve Sistare 
Reviewed-by: Peter Xu 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v1 00/15] libvhost-user: support more memslots and cleanup memslot handling code

2024-02-09 Thread David Hildenbrand

On 07.02.24 12:40, Stefano Garzarella wrote:

On Fri, Feb 02, 2024 at 10:53:17PM +0100, David Hildenbrand wrote:

This series adds support for more memslots (509) to libvhost-user, to
make it fully compatible with virtio-mem that uses up to 256 memslots
accross all memory devices in "dynamic-memslot" mode (more details
in patch #3).

One simple fix upfront.

With that in place, this series optimizes and extens memory region
handling in libvhost-user:
* Heavily deduplicate and clean up the memory region handling code
* Speeds up GPA->VA translation with many memslots using binary search
* Optimize mmap_offset handling to use it as fd_offset for mmap()
* Avoid ring remapping when adding a single memory region
* Avoid dumping all guest memory, possibly allocating memory in sparse
  memory mappings when the process crashes

I'm being very careful to not break some weird corner case that modern
QEMU might no longer trigger, but older one could have triggered or some
other frontend might trigger.

The only thing where I am not careful is to forbid memory regions that
overlap in GPA space: it doesn't make any sense.

With this series, virtio-mem (with dynamic-memslots=on) +
qemu-storage-daemon works flawlessly and as expected in my tests.


I don't know much about this code, but I didn't find anything wrong with
it. Thank you also for the great cleanup!

Acked-by: Stefano Garzarella 


Thanks Stefano for the review, highly appreciated!

--
Cheers,

David / dhildenb




[PULL v3 3/3] oslib-posix: initialize backend memory objects in parallel

2024-02-05 Thread David Hildenbrand
From: Mark Kanda 

QEMU initializes preallocated backend memory as the objects are parsed from
the command line. This is not optimal in some cases (e.g. memory spanning
multiple NUMA nodes) because the memory objects are initialized in series.

Allow the initialization to occur in parallel (asynchronously). In order to
ensure optimal thread placement, asynchronous initialization requires prealloc
context threads to be in use.

Signed-off-by: Mark Kanda 
Message-ID: <20240131165327.3154970-2-mark.ka...@oracle.com>
Tested-by: Mario Casquero 
Signed-off-by: David Hildenbrand 
---
 backends/hostmem.c |   7 ++-
 hw/virtio/virtio-mem.c |   4 +-
 include/hw/qdev-core.h |   5 ++
 include/qemu/osdep.h   |  18 +-
 system/vl.c|   9 +++
 util/oslib-posix.c | 131 +++--
 util/oslib-win32.c |   8 ++-
 7 files changed, 145 insertions(+), 37 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 987f6f591e..81a72ce40b 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -20,6 +20,7 @@
 #include "qom/object_interfaces.h"
 #include "qemu/mmap-alloc.h"
 #include "qemu/madvise.h"
+#include "hw/qdev-core.h"
 
 #ifdef CONFIG_NUMA
 #include 
@@ -237,7 +238,7 @@ static void host_memory_backend_set_prealloc(Object *obj, 
bool value,
 uint64_t sz = memory_region_size(>mr);
 
 if (!qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads,
-   backend->prealloc_context, errp)) {
+   backend->prealloc_context, false, errp)) {
 return;
 }
 backend->prealloc = true;
@@ -323,6 +324,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, 
Error **errp)
 HostMemoryBackendClass *bc = MEMORY_BACKEND_GET_CLASS(uc);
 void *ptr;
 uint64_t sz;
+bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
 
 if (!bc->alloc) {
 return;
@@ -402,7 +404,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, 
Error **errp)
 if (backend->prealloc && 
!qemu_prealloc_mem(memory_region_get_fd(>mr),
 ptr, sz,
 backend->prealloc_threads,
-backend->prealloc_context, 
errp)) {
+backend->prealloc_context,
+async, errp)) {
 return;
 }
 }
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 99ab989852..ffd119ebac 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -605,7 +605,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, 
uint64_t start_gpa,
 int fd = memory_region_get_fd(>memdev->mr);
 Error *local_err = NULL;
 
-if (!qemu_prealloc_mem(fd, area, size, 1, NULL, _err)) {
+if (!qemu_prealloc_mem(fd, area, size, 1, NULL, false, _err)) {
 static bool warned;
 
 /*
@@ -1248,7 +1248,7 @@ static int virtio_mem_prealloc_range_cb(VirtIOMEM *vmem, 
void *arg,
 int fd = memory_region_get_fd(>memdev->mr);
 Error *local_err = NULL;
 
-if (!qemu_prealloc_mem(fd, area, size, 1, NULL, _err)) {
+if (!qemu_prealloc_mem(fd, area, size, 1, NULL, false, _err)) {
 error_report_err(local_err);
 return -ENOMEM;
 }
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index d47536eadb..9228e96c87 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -1083,6 +1083,11 @@ typedef enum MachineInitPhase {
  */
 PHASE_ACCEL_CREATED,
 
+/*
+ * Late backend objects have been created and initialized.
+ */
+PHASE_LATE_BACKENDS_CREATED,
+
 /*
  * machine_class->init has been called, thus creating any embedded
  * devices and validating machine properties.  Devices created at
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index c9692cc314..7d359dabc4 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -680,6 +680,8 @@ typedef struct ThreadContext ThreadContext;
  * @area: start address of the are to preallocate
  * @sz: the size of the area to preallocate
  * @max_threads: maximum number of threads to use
+ * @tc: prealloc context threads pointer, NULL if not in use
+ * @async: request asynchronous preallocation, requires @tc
  * @errp: returns an error if this function fails
  *
  * Preallocate memory (populate/prefault page tables writable) for the virtual
@@ -687,10 +689,24 @@ typedef struct ThreadContext ThreadContext;
  * each page in the area was faulted in writable at least once, for example,
  * after allocating file blocks for mapped files.
  *
+ * When setting @async, allocation might be performed asynchronously.
+ * qemu_finish_async_prealloc_mem() must be called to finish any asynchronous
+ * preallocation.

[PULL v3 1/3] hv-balloon: use get_min_alignment() to express 32 GiB alignment

2024-02-05 Thread David Hildenbrand
Let's implement the get_min_alignment() callback for memory devices, and
copy for the device memory region the alignment of the host memory
region. This mimics what virtio-mem does, and allows for re-introducing
proper alignment checks for the memory region size (where we don't care
about additional device requirements) in memory device core.

Message-ID: <20240117135554.787344-2-da...@redhat.com>
Reviewed-by: Maciej S. Szmigiero 
Signed-off-by: David Hildenbrand 
---
 hw/hyperv/hv-balloon.c | 37 +
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c
index 0238365712..ade283335a 100644
--- a/hw/hyperv/hv-balloon.c
+++ b/hw/hyperv/hv-balloon.c
@@ -1477,22 +1477,7 @@ static void hv_balloon_ensure_mr(HvBalloon *balloon)
 balloon->mr = g_new0(MemoryRegion, 1);
 memory_region_init(balloon->mr, OBJECT(balloon), TYPE_HV_BALLOON,
memory_region_size(hostmem_mr));
-
-/*
- * The VM can indicate an alignment up to 32 GiB. Memory device core can
- * usually only handle/guarantee 1 GiB alignment. The user will have to
- * specify a larger maxmem eventually.
- *
- * The memory device core will warn the user in case maxmem might have to 
be
- * increased and will fail plugging the device if there is not sufficient
- * space after alignment.
- *
- * TODO: we could do the alignment ourselves in a slightly bigger region.
- * But this feels better, although the warning might be annoying. Maybe
- * we can optimize that in the future (e.g., with such a device on the
- * cmdline place/size the device memory region differently.
- */
-balloon->mr->align = MAX(32 * GiB, 
memory_region_get_alignment(hostmem_mr));
+balloon->mr->align = memory_region_get_alignment(hostmem_mr);
 }
 
 static void hv_balloon_free_mr(HvBalloon *balloon)
@@ -1654,6 +1639,25 @@ static MemoryRegion 
*hv_balloon_md_get_memory_region(MemoryDeviceState *md,
 return balloon->mr;
 }
 
+static uint64_t hv_balloon_md_get_min_alignment(const MemoryDeviceState *md)
+{
+/*
+ * The VM can indicate an alignment up to 32 GiB. Memory device core can
+ * usually only handle/guarantee 1 GiB alignment. The user will have to
+ * specify a larger maxmem eventually.
+ *
+ * The memory device core will warn the user in case maxmem might have to 
be
+ * increased and will fail plugging the device if there is not sufficient
+ * space after alignment.
+ *
+ * TODO: we could do the alignment ourselves in a slightly bigger region.
+ * But this feels better, although the warning might be annoying. Maybe
+ * we can optimize that in the future (e.g., with such a device on the
+ * cmdline place/size the device memory region differently.
+ */
+return 32 * GiB;
+}
+
 static void hv_balloon_md_fill_device_info(const MemoryDeviceState *md,
MemoryDeviceInfo *info)
 {
@@ -1766,5 +1770,6 @@ static void hv_balloon_class_init(ObjectClass *klass, 
void *data)
 mdc->get_memory_region = hv_balloon_md_get_memory_region;
 mdc->decide_memslots = hv_balloon_decide_memslots;
 mdc->get_memslots = hv_balloon_get_memslots;
+mdc->get_min_alignment = hv_balloon_md_get_min_alignment;
 mdc->fill_device_info = hv_balloon_md_fill_device_info;
 }
-- 
2.43.0




[PULL v3 2/3] memory-device: reintroduce memory region size check

2024-02-05 Thread David Hildenbrand
We used to check that the memory region size is multiples of the overall
requested address alignment for the device memory address.

We removed that check, because there are cases (i.e., hv-balloon) where
devices unconditionally request an address alignment that has a very large
alignment (i.e., 32 GiB), but the actual memory device size might not be
multiples of that alignment.

However, this change:

(a) allows for some practically impossible DIMM sizes, like "1GB+1 byte".
(b) allows for DIMMs that partially cover hugetlb pages, previously
reported in [1].

Both scenarios don't make any sense: we might even waste memory.

So let's reintroduce that check, but only check that the
memory region size is multiples of the memory region alignment (i.e.,
page size, huge page size), but not any additional memory device
requirements communicated using md->get_min_alignment().

The following examples now fail again as expected:

(a) 1M with 2M THP
 qemu-system-x86_64 -m 4g,maxmem=16g,slots=1 -S -nodefaults -nographic \
 -object memory-backend-ram,id=mem1,size=1M \
 -device pc-dimm,id=dimm1,memdev=mem1
 -> backend memory size must be multiple of 0x20

(b) 1G+1byte

 qemu-system-x86_64 -m 4g,maxmem=16g,slots=1 -S -nodefaults -nographic \
   -object memory-backend-ram,id=mem1,size=1073741825B \
   -device pc-dimm,id=dimm1,memdev=mem1
 -> backend memory size must be multiple of 0x20

(c) Unliagned hugetlb size (2M)

 qemu-system-x86_64 -m 4g,maxmem=16g,slots=1 -S -nodefaults -nographic \
   -object 
memory-backend-file,id=mem1,mem-path=/dev/hugepages/tmp,size=511M \
   -device pc-dimm,id=dimm1,memdev=mem1
 backend memory size must be multiple of 0x20

(d) Unliagned hugetlb size (1G)

 qemu-system-x86_64 -m 4g,maxmem=16g,slots=1 -S -nodefaults -nographic \
-object 
memory-backend-file,id=mem1,mem-path=/dev/hugepages1G/tmp,size=2047M \
-device pc-dimm,id=dimm1,memdev=mem1
 -> backend memory size must be multiple of 0x4000

Note that this fix depends on a hv-balloon change to communicate its
additional alignment requirements using get_min_alignment() instead of
through the memory region.

[1] 
https://lkml.kernel.org/r/f77d641d500324525ac036fe1827b3070de75fc1.1701088320.git.mpriv...@redhat.com

Message-ID: <20240117135554.787344-3-da...@redhat.com>
Reported-by: Zhenyu Zhang 
Reported-by: Michal Privoznik 
Fixes: eb1b7c4bd413 ("memory-device: Drop size alignment check")
Tested-by: Zhenyu Zhang 
Tested-by: Mario Casquero 
Reviewed-by: Maciej S. Szmigiero 
Signed-off-by: David Hildenbrand 
---
 hw/mem/memory-device.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c
index a1b1af26bc..e098585cda 100644
--- a/hw/mem/memory-device.c
+++ b/hw/mem/memory-device.c
@@ -374,6 +374,20 @@ void memory_device_pre_plug(MemoryDeviceState *md, 
MachineState *ms,
 goto out;
 }
 
+/*
+ * We always want the memory region size to be multiples of the memory
+ * region alignment: for example, DIMMs with 1G+1byte size don't make
+ * any sense. Note that we don't check that the size is multiples
+ * of any additional alignment requirements the memory device might
+ * have when it comes to the address in physical address space.
+ */
+if (!QEMU_IS_ALIGNED(memory_region_size(mr),
+ memory_region_get_alignment(mr))) {
+error_setg(errp, "backend memory size must be multiple of 0x%"
+   PRIx64, memory_region_get_alignment(mr));
+return;
+}
+
 if (legacy_align) {
 align = *legacy_align;
 } else {
-- 
2.43.0




[PULL v3 0/3] Host Memory Backends and Memory devices queue 2024-02-06

2024-02-05 Thread David Hildenbrand
The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440:

  Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into 
staging (2024-02-03 13:31:58 +)

are available in the Git repository at:

  https://github.com/davidhildenbrand/qemu.git tags/mem-2024-02-06-v3

for you to fetch changes up to 04accf43df83aa10f06f7dbda3ecf0db97f0c5a6:

  oslib-posix: initialize backend memory objects in parallel (2024-02-06 
08:15:22 +0100)


Hi,

"Host Memory Backends" and "Memory devices" queue ("mem"):
- Reintroduce memory region size checks for memory devices; the removal
  lead to some undesired side effects
- Preallocate memory of memory backends in selected configurations
  asynchronously (so we preallocate concurrently), to speed up QEMU
  startup time.

--------
David Hildenbrand (2):
  hv-balloon: use get_min_alignment() to express 32 GiB alignment
  memory-device: reintroduce memory region size check

Mark Kanda (1):
  oslib-posix: initialize backend memory objects in parallel

 backends/hostmem.c |   7 ++-
 hw/hyperv/hv-balloon.c |  37 +++-
 hw/mem/memory-device.c |  14 +
 hw/virtio/virtio-mem.c |   4 +-
 include/hw/qdev-core.h |   5 ++
 include/qemu/osdep.h   |  18 +-
 system/vl.c|   9 +++
 util/oslib-posix.c | 131 +++--
 util/oslib-win32.c |   8 ++-
 9 files changed, 180 insertions(+), 53 deletions(-)

-- 
2.43.0




Re: [PULL v2 0/3] Host Memory Backends and Memory devices queue 2024-02-06

2024-02-05 Thread David Hildenbrand

On 06.02.24 08:14, David Hildenbrand wrote:

The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440:

   Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into 
staging (2024-02-03 13:31:58 +)

are available in the Git repository at:

   https://github.com/davidhildenbrand/qemu.git tags/mem-2024-02-06-v2

for you to fetch changes up to 8eb38ab662e74d618d473a9a52d71c644926c3c0:

   oslib-posix: initialize backend memory objects in parallel (2024-02-06 
08:09:55 +0100)


Hi,

"Host Memory Backends" and "Memory devices" queue ("mem"):
- Reintroduce memory region size checks for memory devices; the removal
   lead to some undesired side effects
- Preallocate memory of memory backends in selected configurations
   asynchronously (so we preallocate concurrently), to speed up QEMU
   startup time.


... and yet another instance of the same mail address is wrong. Gah.

Maybe v3 will do ...

--
Cheers,

David / dhildenb




[PULL v2 1/3] hv-balloon: use get_min_alignment() to express 32 GiB alignment

2024-02-05 Thread David Hildenbrand
Let's implement the get_min_alignment() callback for memory devices, and
copy for the device memory region the alignment of the host memory
region. This mimics what virtio-mem does, and allows for re-introducing
proper alignment checks for the memory region size (where we don't care
about additional device requirements) in memory device core.

Message-ID: <20240117135554.787344-2-da...@redhat.com>
Reviewed-by: Maciej S. Szmigiero 
Signed-off-by: David Hildenbrand 
---
 hw/hyperv/hv-balloon.c | 37 +
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c
index 0238365712..ade283335a 100644
--- a/hw/hyperv/hv-balloon.c
+++ b/hw/hyperv/hv-balloon.c
@@ -1477,22 +1477,7 @@ static void hv_balloon_ensure_mr(HvBalloon *balloon)
 balloon->mr = g_new0(MemoryRegion, 1);
 memory_region_init(balloon->mr, OBJECT(balloon), TYPE_HV_BALLOON,
memory_region_size(hostmem_mr));
-
-/*
- * The VM can indicate an alignment up to 32 GiB. Memory device core can
- * usually only handle/guarantee 1 GiB alignment. The user will have to
- * specify a larger maxmem eventually.
- *
- * The memory device core will warn the user in case maxmem might have to 
be
- * increased and will fail plugging the device if there is not sufficient
- * space after alignment.
- *
- * TODO: we could do the alignment ourselves in a slightly bigger region.
- * But this feels better, although the warning might be annoying. Maybe
- * we can optimize that in the future (e.g., with such a device on the
- * cmdline place/size the device memory region differently.
- */
-balloon->mr->align = MAX(32 * GiB, 
memory_region_get_alignment(hostmem_mr));
+balloon->mr->align = memory_region_get_alignment(hostmem_mr);
 }
 
 static void hv_balloon_free_mr(HvBalloon *balloon)
@@ -1654,6 +1639,25 @@ static MemoryRegion 
*hv_balloon_md_get_memory_region(MemoryDeviceState *md,
 return balloon->mr;
 }
 
+static uint64_t hv_balloon_md_get_min_alignment(const MemoryDeviceState *md)
+{
+/*
+ * The VM can indicate an alignment up to 32 GiB. Memory device core can
+ * usually only handle/guarantee 1 GiB alignment. The user will have to
+ * specify a larger maxmem eventually.
+ *
+ * The memory device core will warn the user in case maxmem might have to 
be
+ * increased and will fail plugging the device if there is not sufficient
+ * space after alignment.
+ *
+ * TODO: we could do the alignment ourselves in a slightly bigger region.
+ * But this feels better, although the warning might be annoying. Maybe
+ * we can optimize that in the future (e.g., with such a device on the
+ * cmdline place/size the device memory region differently.
+ */
+return 32 * GiB;
+}
+
 static void hv_balloon_md_fill_device_info(const MemoryDeviceState *md,
MemoryDeviceInfo *info)
 {
@@ -1766,5 +1770,6 @@ static void hv_balloon_class_init(ObjectClass *klass, 
void *data)
 mdc->get_memory_region = hv_balloon_md_get_memory_region;
 mdc->decide_memslots = hv_balloon_decide_memslots;
 mdc->get_memslots = hv_balloon_get_memslots;
+mdc->get_min_alignment = hv_balloon_md_get_min_alignment;
 mdc->fill_device_info = hv_balloon_md_fill_device_info;
 }
-- 
2.43.0




[PULL v2 0/3] Host Memory Backends and Memory devices queue 2024-02-06

2024-02-05 Thread David Hildenbrand
The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440:

  Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into 
staging (2024-02-03 13:31:58 +)

are available in the Git repository at:

  https://github.com/davidhildenbrand/qemu.git tags/mem-2024-02-06-v2

for you to fetch changes up to 8eb38ab662e74d618d473a9a52d71c644926c3c0:

  oslib-posix: initialize backend memory objects in parallel (2024-02-06 
08:09:55 +0100)


Hi,

"Host Memory Backends" and "Memory devices" queue ("mem"):
- Reintroduce memory region size checks for memory devices; the removal
  lead to some undesired side effects
- Preallocate memory of memory backends in selected configurations
  asynchronously (so we preallocate concurrently), to speed up QEMU
  startup time.

--------
David Hildenbrand (2):
  hv-balloon: use get_min_alignment() to express 32 GiB alignment
  memory-device: reintroduce memory region size check

Mark Kanda (1):
  oslib-posix: initialize backend memory objects in parallel

 backends/hostmem.c |   7 ++-
 hw/hyperv/hv-balloon.c |  37 +++-
 hw/mem/memory-device.c |  14 +
 hw/virtio/virtio-mem.c |   4 +-
 include/hw/qdev-core.h |   5 ++
 include/qemu/osdep.h   |  18 +-
 system/vl.c|   9 +++
 util/oslib-posix.c | 131 +++--
 util/oslib-win32.c |   8 ++-
 9 files changed, 180 insertions(+), 53 deletions(-)

-- 
2.43.0




Re: [PULL 0/3] Host Memory Backends and Memory devices queue 2024-02-6

2024-02-05 Thread David Hildenbrand

On 06.02.24 08:02, David Hildenbrand wrote:

The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440:

   Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into 
staging (2024-02-03 13:31:58 +)

are available in the Git repository at:

   https://github.com/davidhildenbrand/qemu.git tags/mem-2024-02-06

for you to fetch changes up to a32c31979a852a07d590effa1586f6ab4fa4d784:

   oslib-posix: initialize backend memory objects in parallel (2024-02-04 
17:51:13 +0100)


Hi,

"Host Memory Backends" and "Memory devices" queue ("mem"):
- Reintroduce memory region size checks for memory devices; the removal
   lead to some undesired side effects
- Preallocate memory of memory backends in selected configurations
   asynchronously (so we preallocate concurrently), to speed up QEMU
   startup time.


A mail address in the second patch is wrong. Will send a v2.

--
Cheers,

David / dhildenb




[PULL 0/3] Host Memory Backends and Memory devices queue 2024-02-6

2024-02-05 Thread David Hildenbrand
The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440:

  Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into 
staging (2024-02-03 13:31:58 +)

are available in the Git repository at:

  https://github.com/davidhildenbrand/qemu.git tags/mem-2024-02-06

for you to fetch changes up to a32c31979a852a07d590effa1586f6ab4fa4d784:

  oslib-posix: initialize backend memory objects in parallel (2024-02-04 
17:51:13 +0100)


Hi,

"Host Memory Backends" and "Memory devices" queue ("mem"):
- Reintroduce memory region size checks for memory devices; the removal
  lead to some undesired side effects
- Preallocate memory of memory backends in selected configurations
  asynchronously (so we preallocate concurrently), to speed up QEMU
  startup time.

--------
David Hildenbrand (2):
  hv-balloon: use get_min_alignment() to express 32 GiB alignment
  memory-device: reintroduce memory region size check

Mark Kanda (1):
  oslib-posix: initialize backend memory objects in parallel

 backends/hostmem.c |   7 ++-
 hw/hyperv/hv-balloon.c |  37 +++-
 hw/mem/memory-device.c |  14 +
 hw/virtio/virtio-mem.c |   4 +-
 include/hw/qdev-core.h |   5 ++
 include/qemu/osdep.h   |  18 +-
 system/vl.c|   9 +++
 util/oslib-posix.c | 131 +++--
 util/oslib-win32.c |   8 ++-
 9 files changed, 180 insertions(+), 53 deletions(-)

-- 
2.43.0




[PULL 1/3] hv-balloon: use get_min_alignment() to express 32 GiB alignment

2024-02-05 Thread David Hildenbrand
Let's implement the get_min_alignment() callback for memory devices, and
copy for the device memory region the alignment of the host memory
region. This mimics what virtio-mem does, and allows for re-introducing
proper alignment checks for the memory region size (where we don't care
about additional device requirements) in memory device core.

Message-ID: <20240117135554.787344-2-da...@redhat.com>
Reviewed-by: Maciej S. Szmigiero 
Signed-off-by: David Hildenbrand 
---
 hw/hyperv/hv-balloon.c | 37 +
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c
index 0238365712..ade283335a 100644
--- a/hw/hyperv/hv-balloon.c
+++ b/hw/hyperv/hv-balloon.c
@@ -1477,22 +1477,7 @@ static void hv_balloon_ensure_mr(HvBalloon *balloon)
 balloon->mr = g_new0(MemoryRegion, 1);
 memory_region_init(balloon->mr, OBJECT(balloon), TYPE_HV_BALLOON,
memory_region_size(hostmem_mr));
-
-/*
- * The VM can indicate an alignment up to 32 GiB. Memory device core can
- * usually only handle/guarantee 1 GiB alignment. The user will have to
- * specify a larger maxmem eventually.
- *
- * The memory device core will warn the user in case maxmem might have to 
be
- * increased and will fail plugging the device if there is not sufficient
- * space after alignment.
- *
- * TODO: we could do the alignment ourselves in a slightly bigger region.
- * But this feels better, although the warning might be annoying. Maybe
- * we can optimize that in the future (e.g., with such a device on the
- * cmdline place/size the device memory region differently.
- */
-balloon->mr->align = MAX(32 * GiB, 
memory_region_get_alignment(hostmem_mr));
+balloon->mr->align = memory_region_get_alignment(hostmem_mr);
 }
 
 static void hv_balloon_free_mr(HvBalloon *balloon)
@@ -1654,6 +1639,25 @@ static MemoryRegion 
*hv_balloon_md_get_memory_region(MemoryDeviceState *md,
 return balloon->mr;
 }
 
+static uint64_t hv_balloon_md_get_min_alignment(const MemoryDeviceState *md)
+{
+/*
+ * The VM can indicate an alignment up to 32 GiB. Memory device core can
+ * usually only handle/guarantee 1 GiB alignment. The user will have to
+ * specify a larger maxmem eventually.
+ *
+ * The memory device core will warn the user in case maxmem might have to 
be
+ * increased and will fail plugging the device if there is not sufficient
+ * space after alignment.
+ *
+ * TODO: we could do the alignment ourselves in a slightly bigger region.
+ * But this feels better, although the warning might be annoying. Maybe
+ * we can optimize that in the future (e.g., with such a device on the
+ * cmdline place/size the device memory region differently.
+ */
+return 32 * GiB;
+}
+
 static void hv_balloon_md_fill_device_info(const MemoryDeviceState *md,
MemoryDeviceInfo *info)
 {
@@ -1766,5 +1770,6 @@ static void hv_balloon_class_init(ObjectClass *klass, 
void *data)
 mdc->get_memory_region = hv_balloon_md_get_memory_region;
 mdc->decide_memslots = hv_balloon_decide_memslots;
 mdc->get_memslots = hv_balloon_get_memslots;
+mdc->get_min_alignment = hv_balloon_md_get_min_alignment;
 mdc->fill_device_info = hv_balloon_md_fill_device_info;
 }
-- 
2.43.0




Re: [GIT PULL 5/8] util: Add write-only "node-affinity" property for ThreadContext

2024-02-05 Thread David Hildenbrand

On 05.02.24 17:13, Claudio Fontana wrote:

Hello David,



Hi,


It would seem to me that a lot of the calling code like qemu_prealloc_mem for 
example
should be sysemu-only, not used for tools, or user mode either right?

And the thread_context.c itself should also be sysemu-only, correct?


Yes, both should currently only get used for sysemu only. Memory 
backends and virtio-mem are sysemu-only.


--
Cheers,

David / dhildenb




Re: [GIT PULL 5/8] util: Add write-only "node-affinity" property for ThreadContext

2024-02-05 Thread David Hildenbrand

On 05.02.24 11:14, Claudio Fontana wrote:

Hi,


Hi Claudio,



turning pages back in time,

noticed that in recent qemu-img binaries we include an ELF dependency on 
libnuma.so that seems unused.

I think it stems from this commit:

commit 10218ae6d006f76410804cc4dc690085b3d008b5
Author: David Hildenbrand 
Date:   Fri Oct 14 15:47:17 2022 +0200

 util: Add write-only "node-affinity" property for ThreadContext


possibly this hunk?

diff --git a/util/meson.build b/util/meson.build
index e97cd2d779..c0a7bc54d4 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -1,5 +1,5 @@
  util_ss.add(files('osdep.c', 'cutils.c', 'unicode.c', 'qemu-timer-common.c'))
-util_ss.add(files('thread-context.c'))
+util_ss.add(files('thread-context.c'), numa)
  if not config_host_data.get('CONFIG_ATOMIC64')
util_ss.add(files('atomic64.c'))
  endif


I wonder if there is some conditional we could use to avoid the apparently 
useless dependency to libnuma in the qemu-img binary?


the simplest change is probably moving the thread-context stuff out of 
util (as you say, it's currently only used by QEMU itself).


--
Cheers,

David / dhildenb




Re: [PATCH v1 11/15] libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va()

2024-02-04 Thread David Hildenbrand

On 04.02.24 23:07, Raphael Norwitz wrote:

On Sun, Feb 4, 2024 at 9:51 AM David Hildenbrand  wrote:


On 04.02.24 03:10, Raphael Norwitz wrote:

One comment on this one.

On Fri, Feb 2, 2024 at 4:56 PM David Hildenbrand  wrote:


Let's speed up GPA to memory region / virtual address lookup. Store the
memory regions ordered by guest physical addresses, and use binary
search for address translation, as well as when adding/removing memory
regions.

Most importantly, this will speed up GPA->VA address translation when we
have many memslots.

Signed-off-by: David Hildenbrand 
---
   subprojects/libvhost-user/libvhost-user.c | 49 +--
   1 file changed, 45 insertions(+), 4 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index d036b54ed0..75e47b7bb3 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -199,19 +199,30 @@ vu_panic(VuDev *dev, const char *msg, ...)
   static VuDevRegion *
   vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr)
   {
-unsigned int i;
+int low = 0;
+int high = dev->nregions - 1;

   /*
* Memory regions cannot overlap in guest physical address space. Each
* GPA belongs to exactly one memory region, so there can only be one
* match.
+ *
+ * We store our memory regions ordered by GPA and can simply perform a
+ * binary search.
*/
-for (i = 0; i < dev->nregions; i++) {
-VuDevRegion *cur = >regions[i];
+while (low <= high) {
+unsigned int mid = low + (high - low) / 2;
+VuDevRegion *cur = >regions[mid];

   if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) {
   return cur;
   }
+if (guest_addr >= cur->gpa + cur->size) {
+low = mid + 1;
+}
+if (guest_addr < cur->gpa) {
+high = mid - 1;
+}
   }
   return NULL;
   }
@@ -273,9 +284,14 @@ vu_remove_all_mem_regs(VuDev *dev)
   static void
   _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd)
   {
+const uint64_t start_gpa = msg_region->guest_phys_addr;
+const uint64_t end_gpa = start_gpa + msg_region->memory_size;
   int prot = PROT_READ | PROT_WRITE;
   VuDevRegion *r;
   void *mmap_addr;
+int low = 0;
+int high = dev->nregions - 1;
+unsigned int idx;

   DPRINT("Adding region %d\n", dev->nregions);
   DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
@@ -295,6 +311,29 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
   prot = PROT_NONE;
   }

+/*
+ * We will add memory regions into the array sorted by GPA. Perform a
+ * binary search to locate the insertion point: it will be at the low
+ * index.
+ */
+while (low <= high) {
+unsigned int mid = low + (high - low)  / 2;
+VuDevRegion *cur = >regions[mid];
+
+/* Overlap of GPA addresses. */


Looks like this check will only catch if the new region is fully
contained within an existing region. I think we need to check whether
either start or end region are in the range, i.e.:


That check should cover all cases of overlaps, not just fully contained.

See the QEMU implementation of range_overlaps_rang() that contains a
similar logic:

return !(range2->upb < range1->lob || range1->upb < range2->lob);

 !(range2->upb < range1->lob || range1->upb < range2->lob);
=  !(range2->upb < range1->lob) && !(range1->upb < range2->lob)
=   range2->upb >= range1->lob && range1->upb >= range2->lob
=   range1->lob <= range2->upb && range2->lob <= range1->upb

In QEMU, upb is inclusive, if it were exclusive (like we have here):

=   range1->lob < range2->upb && range2->lob < range1->upb

Which is what we have here with:

range1->lob = start_gpa
range1->upb = end_gpa
range2->lob = cur->gpa
range2->upb = cur->gpa + cur->size

Also if you are interested, see

https://stackoverflow.com/questions/3269434/whats-the-most-efficient-way-to-test-if-two-ranges-overlap

Thanks!


Got it, thanks for the full explanation. With that:

Reviewed-by: Raphael Norwitz 


Thanks!

--
Cheers,

David / dhildenb




Re: [PATCH v1 01/15] libvhost-user: Fix msg_region->userspace_addr computation

2024-02-04 Thread David Hildenbrand

On 04.02.24 23:01, Raphael Norwitz wrote:

On Sun, Feb 4, 2024 at 9:36 AM David Hildenbrand  wrote:


On 04.02.24 02:35, Raphael Norwitz wrote:

As a heads up, I've left Nutanix and updated it in MAINTAINERS. Will
be updating it again shortly so tagging these with my new work email.



Thanks for the fast review! The mail server already complained to me :)

Maybe consider adding yourself as reviewer for vhost as well? (which
covers libvhost-user), I took your mail address from git history, not
get_maintainers.pl.


I don't expect I'll have much time to review code outside of
vhost-user-blk/vhost-user-scsi, but happy to add an entry if it helps
folks tag me on relevant patches.


If it helps, it might make sense to split out libvhost-user into a 
separate MAINTAINERS section.


--
Cheers,

David / dhildenb




Re: [PATCH v4 1/1] oslib-posix: initialize backend memory objects in parallel

2024-02-04 Thread David Hildenbrand

On 03.02.24 23:43, Dongli Zhang wrote:



On 1/31/24 08:53, Mark Kanda wrote:

QEMU initializes preallocated backend memory as the objects are parsed from
the command line. This is not optimal in some cases (e.g. memory spanning
multiple NUMA nodes) because the memory objects are initialized in series.

Allow the initialization to occur in parallel (asynchronously). In order to
ensure optimal thread placement, asynchronous initialization requires prealloc
context threads to be in use.

Signed-off-by: Mark Kanda 
Signed-off-by: David Hildenbrand 
---
  backends/hostmem.c |   7 ++-
  hw/virtio/virtio-mem.c |   4 +-
  include/hw/qdev-core.h |   5 ++
  include/qemu/osdep.h   |  18 +-
  system/vl.c|   9 +++
  util/oslib-posix.c | 131 +++--
  util/oslib-win32.c |   8 ++-
  7 files changed, 145 insertions(+), 37 deletions(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 30f69b2cb5..17221e422a 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -20,6 +20,7 @@
  #include "qom/object_interfaces.h"
  #include "qemu/mmap-alloc.h"
  #include "qemu/madvise.h"
+#include "hw/qdev-core.h"
  
  #ifdef CONFIG_NUMA

  #include 
@@ -237,7 +238,7 @@ static void host_memory_backend_set_prealloc(Object *obj, 
bool value,
  uint64_t sz = memory_region_size(>mr);
  
  if (!qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads,

-   backend->prealloc_context, errp)) {
+   backend->prealloc_context, false, errp)) {
  return;
  }
  backend->prealloc = true;
@@ -323,6 +324,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, 
Error **errp)
  HostMemoryBackendClass *bc = MEMORY_BACKEND_GET_CLASS(uc);
  void *ptr;
  uint64_t sz;
+bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED);
  
  if (!bc->alloc) {

  return;
@@ -398,7 +400,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, 
Error **errp)
  if (backend->prealloc && 
!qemu_prealloc_mem(memory_region_get_fd(>mr),
  ptr, sz,
  backend->prealloc_threads,
-backend->prealloc_context, 
errp)) {
+backend->prealloc_context,
+async, errp)) {
  return;
  }
  }
diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 99ab989852..ffd119ebac 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -605,7 +605,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, 
uint64_t start_gpa,
  int fd = memory_region_get_fd(>memdev->mr);
  Error *local_err = NULL;
  
-if (!qemu_prealloc_mem(fd, area, size, 1, NULL, _err)) {

+if (!qemu_prealloc_mem(fd, area, size, 1, NULL, false, _err)) {
  static bool warned;
  
  /*

@@ -1248,7 +1248,7 @@ static int virtio_mem_prealloc_range_cb(VirtIOMEM *vmem, 
void *arg,
  int fd = memory_region_get_fd(>memdev->mr);
  Error *local_err = NULL;
  
-if (!qemu_prealloc_mem(fd, area, size, 1, NULL, _err)) {

+if (!qemu_prealloc_mem(fd, area, size, 1, NULL, false, _err)) {
  error_report_err(local_err);
  return -ENOMEM;
  }
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 151d968238..83dd9e2485 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -1071,6 +1071,11 @@ typedef enum MachineInitPhase {
   */
  PHASE_ACCEL_CREATED,
  
+/*

+ * Late backend objects have been created and initialized.
+ */
+PHASE_LATE_BACKENDS_CREATED,
+
  /*
   * machine_class->init has been called, thus creating any embedded
   * devices and validating machine properties.  Devices created at
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index c9692cc314..7d359dabc4 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -680,6 +680,8 @@ typedef struct ThreadContext ThreadContext;
   * @area: start address of the are to preallocate
   * @sz: the size of the area to preallocate
   * @max_threads: maximum number of threads to use
+ * @tc: prealloc context threads pointer, NULL if not in use
+ * @async: request asynchronous preallocation, requires @tc
   * @errp: returns an error if this function fails
   *
   * Preallocate memory (populate/prefault page tables writable) for the virtual
@@ -687,10 +689,24 @@ typedef struct ThreadContext ThreadContext;
   * each page in the area was faulted in writable at least once, for example,
   * after allocating file blocks for mapped files.
   *
+ * When setting @async, allocation might be performed asynchronously.
+ * qemu_finish_async_prealloc_mem() must be called to finis

Re: [PATCH v1 14/15] libvhost-user: Dynamically remap rings after (temporarily?) removing memory regions

2024-02-04 Thread David Hildenbrand

On 04.02.24 03:15, Raphael Norwitz wrote:

Someone else with more knowledge of the VQ mapping code should also review.

On Fri, Feb 2, 2024 at 4:55 PM David Hildenbrand  wrote:


Currently, we try to remap all rings whenever we add a single new memory
region. That doesn't quite make sense, because we already map rings when
setting the ring address, and panic if that goes wrong. Likely, that
handling was simply copied from set_mem_table code, where we actually
have to remap all rings.

Remapping all rings might require us to walk quite a lot of memory
regions to perform the address translations. Ideally, we'd simply remove
that remapping.

However, let's be a bit careful. There might be some weird corner cases
where we might temporarily remove a single memory region (e.g., resize
it), that would have worked for now. Further, a ring might be located on
hotplugged memory, and as the VM reboots, we might unplug that memory, to
hotplug memory before resetting the ring addresses.

So let's unmap affected rings as we remove a memory region, and try
dynamically mapping the ring again when required.

Signed-off-by: David Hildenbrand 


Acked-by: Raphael Norwitz 


Thanks!




---
  subprojects/libvhost-user/libvhost-user.c | 107 --
  1 file changed, 78 insertions(+), 29 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index febeb2eb89..738e84ab63 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -283,10 +283,75 @@ vu_remove_all_mem_regs(VuDev *dev)
  dev->nregions = 0;
  }

+static bool
+map_ring(VuDev *dev, VuVirtq *vq)
+{
+vq->vring.desc = qva_to_va(dev, vq->vra.desc_user_addr);
+vq->vring.used = qva_to_va(dev, vq->vra.used_user_addr);
+vq->vring.avail = qva_to_va(dev, vq->vra.avail_user_addr);
+
+DPRINT("Setting virtq addresses:\n");
+DPRINT("vring_desc  at %p\n", vq->vring.desc);
+DPRINT("vring_used  at %p\n", vq->vring.used);
+DPRINT("vring_avail at %p\n", vq->vring.avail);
+
+return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
+}
+
  static bool


Consider changing the function name to indicate that it may actually map a vq?

Maybe vu_maybe_map_vq()?


I don't think that would be really better. It's an implementation detial 
that we try to recover in these corner cases by remapping the rings.


In the majority of all cases this function will simply check whether the 
device is broken and the vring was set up properly (which usually 
implies mapping the rings).


So I think in the caller:

if (!vu_is_vq_usable()) {
return;
}

is easier to get than:

if (!vu_maybe_map_vq()) {
return;
}


Thanks!

--
Cheers,

David / dhildenb




Re: [PATCH v1 11/15] libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va()

2024-02-04 Thread David Hildenbrand

On 04.02.24 03:10, Raphael Norwitz wrote:

One comment on this one.

On Fri, Feb 2, 2024 at 4:56 PM David Hildenbrand  wrote:


Let's speed up GPA to memory region / virtual address lookup. Store the
memory regions ordered by guest physical addresses, and use binary
search for address translation, as well as when adding/removing memory
regions.

Most importantly, this will speed up GPA->VA address translation when we
have many memslots.

Signed-off-by: David Hildenbrand 
---
  subprojects/libvhost-user/libvhost-user.c | 49 +--
  1 file changed, 45 insertions(+), 4 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index d036b54ed0..75e47b7bb3 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -199,19 +199,30 @@ vu_panic(VuDev *dev, const char *msg, ...)
  static VuDevRegion *
  vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr)
  {
-unsigned int i;
+int low = 0;
+int high = dev->nregions - 1;

  /*
   * Memory regions cannot overlap in guest physical address space. Each
   * GPA belongs to exactly one memory region, so there can only be one
   * match.
+ *
+ * We store our memory regions ordered by GPA and can simply perform a
+ * binary search.
   */
-for (i = 0; i < dev->nregions; i++) {
-VuDevRegion *cur = >regions[i];
+while (low <= high) {
+unsigned int mid = low + (high - low) / 2;
+VuDevRegion *cur = >regions[mid];

  if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) {
  return cur;
  }
+if (guest_addr >= cur->gpa + cur->size) {
+low = mid + 1;
+}
+if (guest_addr < cur->gpa) {
+high = mid - 1;
+}
  }
  return NULL;
  }
@@ -273,9 +284,14 @@ vu_remove_all_mem_regs(VuDev *dev)
  static void
  _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd)
  {
+const uint64_t start_gpa = msg_region->guest_phys_addr;
+const uint64_t end_gpa = start_gpa + msg_region->memory_size;
  int prot = PROT_READ | PROT_WRITE;
  VuDevRegion *r;
  void *mmap_addr;
+int low = 0;
+int high = dev->nregions - 1;
+unsigned int idx;

  DPRINT("Adding region %d\n", dev->nregions);
  DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
@@ -295,6 +311,29 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
  prot = PROT_NONE;
  }

+/*
+ * We will add memory regions into the array sorted by GPA. Perform a
+ * binary search to locate the insertion point: it will be at the low
+ * index.
+ */
+while (low <= high) {
+unsigned int mid = low + (high - low)  / 2;
+VuDevRegion *cur = >regions[mid];
+
+/* Overlap of GPA addresses. */


Looks like this check will only catch if the new region is fully
contained within an existing region. I think we need to check whether
either start or end region are in the range, i.e.:


That check should cover all cases of overlaps, not just fully contained.

See the QEMU implementation of range_overlaps_rang() that contains a 
similar logic:


return !(range2->upb < range1->lob || range1->upb < range2->lob);

   !(range2->upb < range1->lob || range1->upb < range2->lob);
=  !(range2->upb < range1->lob) && !(range1->upb < range2->lob)
=   range2->upb >= range1->lob && range1->upb >= range2->lob
=   range1->lob <= range2->upb && range2->lob <= range1->upb

In QEMU, upb is inclusive, if it were exclusive (like we have here):

=   range1->lob < range2->upb && range2->lob < range1->upb

Which is what we have here with:

range1->lob = start_gpa
range1->upb = end_gpa
range2->lob = cur->gpa
range2->upb = cur->gpa + cur->size

Also if you are interested, see

https://stackoverflow.com/questions/3269434/whats-the-most-efficient-way-to-test-if-two-ranges-overlap

Thanks!

--
Cheers,

David / dhildenb




Re: [PATCH v1 01/15] libvhost-user: Fix msg_region->userspace_addr computation

2024-02-04 Thread David Hildenbrand

On 04.02.24 02:35, Raphael Norwitz wrote:

As a heads up, I've left Nutanix and updated it in MAINTAINERS. Will
be updating it again shortly so tagging these with my new work email.



Thanks for the fast review! The mail server already complained to me :)

Maybe consider adding yourself as reviewer for vhost as well? (which 
covers libvhost-user), I took your mail address from git history, not 
get_maintainers.pl.



On Fri, Feb 2, 2024 at 4:54 PM David Hildenbrand  wrote:


We barely had mmap_offset set in the past. With virtio-mem and
dynamic-memslots that will change.

In vu_add_mem_reg() and vu_set_mem_table_exec_postcopy(), we are
performing pointer arithmetics, which is wrong. Let's simply
use dev_region->mmap_addr instead of "void *mmap_addr".

Fixes: ec94c8e621de ("Support adding individual regions in libvhost-user")
Fixes: 9bb38019942c ("vhost+postcopy: Send address back to qemu")
Cc: Raphael Norwitz 
Signed-off-by: David Hildenbrand 


Reviewed-by: Raphael Norwitz 



--
Cheers,

David / dhildenb




[PATCH v1 09/15] libvhost-user: Don't search for duplicates when removing memory regions

2024-02-02 Thread David Hildenbrand
We cannot have duplicate memory regions, something would be deeply
flawed elsewhere. Let's just stop the search once we found an entry.

We'll add more sanity checks when adding memory regions later.

Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index e1a1b9df88..22154b217f 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -896,8 +896,7 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 i--;
 
 found = true;
-
-/* Continue the search for eventual duplicates. */
+break;
 }
 }
 
-- 
2.43.0




[PATCH v1 12/15] libvhost-user: Use most of mmap_offset as fd_offset

2024-02-02 Thread David Hildenbrand
In the past, QEMU would create memory regions that could partially cover
hugetlb pages, making mmap() fail if we would use the mmap_offset as an
fd_offset. For that reason, we never used the mmap_offset as an offset into
the fd and instead always mapped the fd from the very start.

However, that can easily result in us mmap'ing a lot of unnecessary
parts of an fd, possibly repeatedly.

QEMU nowadays does not create memory regions that partially cover huge
pages -- it never really worked with postcopy. QEMU handles merging of
regions that partially cover huge pages (due to holes in boot memory) since
2018 in c1ece84e7c93 ("vhost: Huge page align and merge").

Let's be a bit careful and not unconditionally convert the
mmap_offset into an fd_offset. Instead, let's simply detect the hugetlb
size and pass as much as we can as fd_offset, making sure that we call
mmap() with a properly aligned offset.

With QEMU and a virtio-mem device that is fully plugged (50GiB using 50
memslots) the qemu-storage daemon process consumes in the VA space
1281GiB before this change and 58GiB after this change.

Example debug output:
   Vhost user message 
  Request: VHOST_USER_ADD_MEM_REG (37)
  Flags:   0x9
  Size:40
  Fds: 59
  Adding region 50
  guest_phys_addr: 0x000d8000
  memory_size: 0x4000
  userspace_addr   0x7f54ebffe000
  mmap_offset  0x000c
  fd_offset:   0x000c
  new mmap_offset: 0x
  mmap_addr:   0x7f7ecc00
  Successfully added new region
   Vhost user message 
  Request: VHOST_USER_ADD_MEM_REG (37)
  Flags:   0x9
  Size:40
  Fds: 59
  Adding region 51
  guest_phys_addr: 0x000dc000
  memory_size: 0x4000
  userspace_addr   0x7f552bffe000
  mmap_offset  0x000c4000
  fd_offset:   0x000c4000
  new mmap_offset: 0x
  mmap_addr:   0x7f7e8c00
  Successfully added new region

Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 50 ---
 1 file changed, 45 insertions(+), 5 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 75e47b7bb3..7d8293dc84 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -43,6 +43,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #ifdef __NR_userfaultfd
 #include 
@@ -281,12 +283,36 @@ vu_remove_all_mem_regs(VuDev *dev)
 dev->nregions = 0;
 }
 
+static size_t
+get_fd_pagesize(int fd)
+{
+static size_t pagesize;
+#if defined(__linux__)
+struct statfs fs;
+int ret;
+
+do {
+ret = fstatfs(fd, );
+} while (ret != 0 && errno == EINTR);
+
+if (!ret && fs.f_type == HUGETLBFS_MAGIC) {
+return fs.f_bsize;
+}
+#endif
+
+if (!pagesize) {
+pagesize = getpagesize();
+}
+return pagesize;
+}
+
 static void
 _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd)
 {
 const uint64_t start_gpa = msg_region->guest_phys_addr;
 const uint64_t end_gpa = start_gpa + msg_region->memory_size;
 int prot = PROT_READ | PROT_WRITE;
+uint64_t mmap_offset, fd_offset;
 VuDevRegion *r;
 void *mmap_addr;
 int low = 0;
@@ -335,11 +361,25 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
 idx = low;
 
 /*
- * We don't use offset argument of mmap() since the mapped address has
- * to be page aligned, and we use huge pages.
+ * Convert most of msg_region->mmap_offset to fd_offset. In almost all
+ * cases, this will leave us with mmap_offset == 0, mmap()'ing only
+ * what we really need. Only if a memory region would partially cover
+ * hugetlb pages, we'd get mmap_offset != 0, which usually doesn't happen
+ * anymore (i.e., modern QEMU).
+ *
+ * Note that mmap() with hugetlb would fail if the offset into the file
+ * is not aligned to the huge page size.
  */
-mmap_addr = mmap(0, msg_region->memory_size + msg_region->mmap_offset,
- prot, MAP_SHARED | MAP_NORESERVE, fd, 0);
+fd_offset = ALIGN_DOWN(msg_region->mmap_offset, get_fd_pagesize(fd));
+mmap_offset = msg_region->mmap_offset - fd_offset;
+
+DPRINT("fd_offset:   0x%016"PRIx64"\n",
+   fd_offset);
+DPRINT("adj mmap_offset: 0x%016"PRIx64"\n",
+   mmap_offset);
+
+mmap_addr = mmap(0, msg_region->memory_size + mmap_offset,
+ prot, MAP_SHARED | MAP_NORESERVE, fd, fd_offset);
 if (mmap_addr == MAP_FAILED) {
 vu_panic(dev, "region mmap error: %s", strerror(errno));
 return;
@@ -354,7 +394,7 @@ _vu_add_mem_reg(VuDe

[PATCH v1 13/15] libvhost-user: Factor out vq usability check

2024-02-02 Thread David Hildenbrand
Let's factor it out to prepare for further changes.

Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 24 +++
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 7d8293dc84..febeb2eb89 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -283,6 +283,12 @@ vu_remove_all_mem_regs(VuDev *dev)
 dev->nregions = 0;
 }
 
+static bool
+vu_is_vq_usable(VuDev *dev, VuVirtq *vq)
+{
+return likely(!dev->broken) && likely(vq->vring.avail);
+}
+
 static size_t
 get_fd_pagesize(int fd)
 {
@@ -2378,8 +2384,7 @@ vu_queue_get_avail_bytes(VuDev *dev, VuVirtq *vq, 
unsigned int *in_bytes,
 idx = vq->last_avail_idx;
 
 total_bufs = in_total = out_total = 0;
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 goto done;
 }
 
@@ -2494,8 +2499,7 @@ vu_queue_avail_bytes(VuDev *dev, VuVirtq *vq, unsigned 
int in_bytes,
 bool
 vu_queue_empty(VuDev *dev, VuVirtq *vq)
 {
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 return true;
 }
 
@@ -2534,8 +2538,7 @@ vring_notify(VuDev *dev, VuVirtq *vq)
 
 static void _vu_queue_notify(VuDev *dev, VuVirtq *vq, bool sync)
 {
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 return;
 }
 
@@ -2860,8 +2863,7 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
 unsigned int head;
 VuVirtqElement *elem;
 
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 return NULL;
 }
 
@@ -3018,8 +3020,7 @@ vu_queue_fill(VuDev *dev, VuVirtq *vq,
 {
 struct vring_used_elem uelem;
 
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 return;
 }
 
@@ -3048,8 +3049,7 @@ vu_queue_flush(VuDev *dev, VuVirtq *vq, unsigned int 
count)
 {
 uint16_t old, new;
 
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
+if (!vu_is_vq_usable(dev, vq)) {
 return;
 }
 
-- 
2.43.0




[PATCH v1 05/15] libvhost-user: Merge vu_set_mem_table_exec_postcopy() into vu_set_mem_table_exec()

2024-02-02 Thread David Hildenbrand
Let's reduce some code duplication and prepare for further changes.

Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 119 +++---
 1 file changed, 39 insertions(+), 80 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index d5b3468e43..d9e2214ad2 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -937,95 +937,23 @@ vu_get_shared_object(VuDev *dev, VhostUserMsg *vmsg)
 }
 
 static bool
-vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg)
+vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
 {
-unsigned int i;
 VhostUserMemory m = vmsg->payload.memory, *memory = 
-dev->nregions = memory->nregions;
-
-DPRINT("Nregions: %u\n", memory->nregions);
-for (i = 0; i < dev->nregions; i++) {
-void *mmap_addr;
-VhostUserMemoryRegion *msg_region = >regions[i];
-VuDevRegion *dev_region = >regions[i];
-
-DPRINT("Region %d\n", i);
-DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
-   msg_region->guest_phys_addr);
-DPRINT("memory_size: 0x%016"PRIx64"\n",
-   msg_region->memory_size);
-DPRINT("userspace_addr   0x%016"PRIx64"\n",
-   msg_region->userspace_addr);
-DPRINT("mmap_offset  0x%016"PRIx64"\n",
-   msg_region->mmap_offset);
-
-dev_region->gpa = msg_region->guest_phys_addr;
-dev_region->size = msg_region->memory_size;
-dev_region->qva = msg_region->userspace_addr;
-dev_region->mmap_offset = msg_region->mmap_offset;
+int prot = PROT_READ | PROT_WRITE;
+unsigned int i;
 
-/* We don't use offset argument of mmap() since the
- * mapped address has to be page aligned, and we use huge
- * pages.
+if (dev->postcopy_listening) {
+/*
  * In postcopy we're using PROT_NONE here to catch anyone
  * accessing it before we userfault
  */
-mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
- PROT_NONE, MAP_SHARED | MAP_NORESERVE,
- vmsg->fds[i], 0);
-
-if (mmap_addr == MAP_FAILED) {
-vu_panic(dev, "region mmap error: %s", strerror(errno));
-} else {
-dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
-DPRINT("mmap_addr:   0x%016"PRIx64"\n",
-   dev_region->mmap_addr);
-}
-
-/* Return the address to QEMU so that it can translate the ufd
- * fault addresses back.
- */
-msg_region->userspace_addr = dev_region->mmap_addr +
- dev_region->mmap_offset;
-close(vmsg->fds[i]);
-}
-
-/* Send the message back to qemu with the addresses filled in */
-vmsg->fd_num = 0;
-if (!vu_send_reply(dev, dev->sock, vmsg)) {
-vu_panic(dev, "failed to respond to set-mem-table for postcopy");
-return false;
-}
-
-/* Wait for QEMU to confirm that it's registered the handler for the
- * faults.
- */
-if (!dev->read_msg(dev, dev->sock, vmsg) ||
-vmsg->size != sizeof(vmsg->payload.u64) ||
-vmsg->payload.u64 != 0) {
-vu_panic(dev, "failed to receive valid ack for postcopy 
set-mem-table");
-return false;
+prot = PROT_NONE;
 }
 
-/* OK, now we can go and register the memory and generate faults */
-(void)generate_faults(dev);
-
-return false;
-}
-
-static bool
-vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
-{
-unsigned int i;
-VhostUserMemory m = vmsg->payload.memory, *memory = 
-
 vu_remove_all_mem_regs(dev);
 dev->nregions = memory->nregions;
 
-if (dev->postcopy_listening) {
-return vu_set_mem_table_exec_postcopy(dev, vmsg);
-}
-
 DPRINT("Nregions: %u\n", memory->nregions);
 for (i = 0; i < dev->nregions; i++) {
 void *mmap_addr;
@@ -1051,8 +979,7 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
  * mapped address has to be page aligned, and we use huge
  * pages.  */
 mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
- PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE,
- vmsg->fds[i], 0);
+ prot, MAP_SHARED | MAP_NORESERVE, vmsg->fds[i], 0);
 
 if (mmap_addr == MAP_FAILED) {
 vu_panic(dev, "region mmap error: %s", strerror(errno));
@@ -1062,9 +989,41 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg 

[PATCH v1 02/15] libvhost-user: Dynamically allocate memory for memory slots

2024-02-02 Thread David Hildenbrand
Let's prepare for increasing VHOST_USER_MAX_RAM_SLOTS by dynamically
allocating dev->regions. We don't have any ABI guarantees (not
dynamically linked), so we can simply change the layout of VuDev.

Let's zero out the memory, just as we used to do.

Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 11 +++
 subprojects/libvhost-user/libvhost-user.h |  2 +-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 7e515ed15d..8a5a7a2295 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -2171,6 +2171,8 @@ vu_deinit(VuDev *dev)
 
 free(dev->vq);
 dev->vq = NULL;
+free(dev->regions);
+dev->regions = NULL;
 }
 
 bool
@@ -2205,9 +2207,18 @@ vu_init(VuDev *dev,
 dev->backend_fd = -1;
 dev->max_queues = max_queues;
 
+dev->regions = malloc(VHOST_USER_MAX_RAM_SLOTS * sizeof(dev->regions[0]));
+if (!dev->regions) {
+DPRINT("%s: failed to malloc mem regions\n", __func__);
+return false;
+}
+memset(dev->regions, 0, VHOST_USER_MAX_RAM_SLOTS * 
sizeof(dev->regions[0]));
+
 dev->vq = malloc(max_queues * sizeof(dev->vq[0]));
 if (!dev->vq) {
 DPRINT("%s: failed to malloc virtqueues\n", __func__);
+free(dev->regions);
+dev->regions = NULL;
 return false;
 }
 
diff --git a/subprojects/libvhost-user/libvhost-user.h 
b/subprojects/libvhost-user/libvhost-user.h
index c2352904f0..c882b4e3a2 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -398,7 +398,7 @@ typedef struct VuDevInflightInfo {
 struct VuDev {
 int sock;
 uint32_t nregions;
-VuDevRegion regions[VHOST_USER_MAX_RAM_SLOTS];
+VuDevRegion *regions;
 VuVirtq *vq;
 VuDevInflightInfo inflight_info;
 int log_call_fd;
-- 
2.43.0




[PATCH v1 08/15] libvhost-user: Don't zero out memory for memory regions

2024-02-02 Thread David Hildenbrand
dev->nregions always covers only valid entries. Stop zeroing out other
array elements that are unused.

Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index f99c888b48..e1a1b9df88 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -888,13 +888,9 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 
 munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset);
 
-/*
- * Shift all affected entries by 1 to close the hole at index i and
- * zero out the last entry.
- */
+/* Shift all affected entries by 1 to close the hole at index. */
 memmove(dev->regions + i, dev->regions + i + 1,
 sizeof(VuDevRegion) * (dev->nregions - i - 1));
-memset(dev->regions + dev->nregions - 1, 0, sizeof(VuDevRegion));
 DPRINT("Successfully removed a region\n");
 dev->nregions--;
 i--;
@@ -2119,7 +2115,6 @@ vu_init(VuDev *dev,
 DPRINT("%s: failed to malloc mem regions\n", __func__);
 return false;
 }
-memset(dev->regions, 0, VHOST_USER_MAX_RAM_SLOTS * 
sizeof(dev->regions[0]));
 
 dev->vq = malloc(max_queues * sizeof(dev->vq[0]));
 if (!dev->vq) {
-- 
2.43.0




[PATCH v1 11/15] libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va()

2024-02-02 Thread David Hildenbrand
Let's speed up GPA to memory region / virtual address lookup. Store the
memory regions ordered by guest physical addresses, and use binary
search for address translation, as well as when adding/removing memory
regions.

Most importantly, this will speed up GPA->VA address translation when we
have many memslots.

Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 49 +--
 1 file changed, 45 insertions(+), 4 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index d036b54ed0..75e47b7bb3 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -199,19 +199,30 @@ vu_panic(VuDev *dev, const char *msg, ...)
 static VuDevRegion *
 vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr)
 {
-unsigned int i;
+int low = 0;
+int high = dev->nregions - 1;
 
 /*
  * Memory regions cannot overlap in guest physical address space. Each
  * GPA belongs to exactly one memory region, so there can only be one
  * match.
+ *
+ * We store our memory regions ordered by GPA and can simply perform a
+ * binary search.
  */
-for (i = 0; i < dev->nregions; i++) {
-VuDevRegion *cur = >regions[i];
+while (low <= high) {
+unsigned int mid = low + (high - low) / 2;
+VuDevRegion *cur = >regions[mid];
 
 if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) {
 return cur;
 }
+if (guest_addr >= cur->gpa + cur->size) {
+low = mid + 1;
+}
+if (guest_addr < cur->gpa) {
+high = mid - 1;
+}
 }
 return NULL;
 }
@@ -273,9 +284,14 @@ vu_remove_all_mem_regs(VuDev *dev)
 static void
 _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd)
 {
+const uint64_t start_gpa = msg_region->guest_phys_addr;
+const uint64_t end_gpa = start_gpa + msg_region->memory_size;
 int prot = PROT_READ | PROT_WRITE;
 VuDevRegion *r;
 void *mmap_addr;
+int low = 0;
+int high = dev->nregions - 1;
+unsigned int idx;
 
 DPRINT("Adding region %d\n", dev->nregions);
 DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
@@ -295,6 +311,29 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
 prot = PROT_NONE;
 }
 
+/*
+ * We will add memory regions into the array sorted by GPA. Perform a
+ * binary search to locate the insertion point: it will be at the low
+ * index.
+ */
+while (low <= high) {
+unsigned int mid = low + (high - low)  / 2;
+VuDevRegion *cur = >regions[mid];
+
+/* Overlap of GPA addresses. */
+if (start_gpa < cur->gpa + cur->size && cur->gpa < end_gpa) {
+vu_panic(dev, "regions with overlapping guest physical addresses");
+return;
+}
+if (start_gpa >= cur->gpa + cur->size) {
+low = mid + 1;
+}
+if (start_gpa < cur->gpa) {
+high = mid - 1;
+}
+}
+idx = low;
+
 /*
  * We don't use offset argument of mmap() since the mapped address has
  * to be page aligned, and we use huge pages.
@@ -308,7 +347,9 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
 DPRINT("mmap_addr:   0x%016"PRIx64"\n",
(uint64_t)(uintptr_t)mmap_addr);
 
-r = >regions[dev->nregions];
+/* Shift all affected entries by 1 to open a hole at idx. */
+r = >regions[idx];
+memmove(r + 1, r, sizeof(VuDevRegion) * (dev->nregions - idx));
 r->gpa = msg_region->guest_phys_addr;
 r->size = msg_region->memory_size;
 r->qva = msg_region->userspace_addr;
-- 
2.43.0




[PATCH v1 07/15] libvhost-user: No need to check for NULL when unmapping

2024-02-02 Thread David Hildenbrand
We never add a memory region if mmap() failed. Therefore, no need to check
for NULL.

Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index a2baefe84b..f99c888b48 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -247,11 +247,8 @@ vu_remove_all_mem_regs(VuDev *dev)
 
 for (i = 0; i < dev->nregions; i++) {
 VuDevRegion *r = >regions[i];
-void *ma = (void *)(uintptr_t)r->mmap_addr;
 
-if (ma) {
-munmap(ma, r->size + r->mmap_offset);
-}
+munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset);
 }
 dev->nregions = 0;
 }
@@ -888,11 +885,8 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 for (i = 0; i < dev->nregions; i++) {
 if (reg_equal(>regions[i], msg_region)) {
 VuDevRegion *r = >regions[i];
-void *ma = (void *) (uintptr_t) r->mmap_addr;
 
-if (ma) {
-munmap(ma, r->size + r->mmap_offset);
-}
+munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset);
 
 /*
  * Shift all affected entries by 1 to close the hole at index i and
-- 
2.43.0




[PATCH v1 15/15] libvhost-user: Mark mmap'ed region memory as MADV_DONTDUMP

2024-02-02 Thread David Hildenbrand
We already use MADV_NORESERVE to deal with sparse memory regions. Let's
also set madvise(MADV_DONTDUMP), otherwise a crash of the process can
result in us allocating all memory in the mmap'ed region for dumping
purposes.

This change implies that the mmap'ed rings won't be included in a
coredump. If ever required for debugging purposes, we could mark only
the mapped rings MADV_DODUMP.

Ignore errors during madvise() for now.

Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 738e84ab63..26c289518c 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -458,6 +458,12 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion 
*msg_region, int fd)
 DPRINT("mmap_addr:   0x%016"PRIx64"\n",
(uint64_t)(uintptr_t)mmap_addr);
 
+#if defined(__linux__)
+/* Don't include all guest memory in a coredump. */
+madvise(mmap_addr, msg_region->memory_size + mmap_offset,
+MADV_DONTDUMP);
+#endif
+
 /* Shift all affected entries by 1 to open a hole at idx. */
 r = >regions[idx];
 memmove(r + 1, r, sizeof(VuDevRegion) * (dev->nregions - idx));
-- 
2.43.0




[PATCH v1 01/15] libvhost-user: Fix msg_region->userspace_addr computation

2024-02-02 Thread David Hildenbrand
We barely had mmap_offset set in the past. With virtio-mem and
dynamic-memslots that will change.

In vu_add_mem_reg() and vu_set_mem_table_exec_postcopy(), we are
performing pointer arithmetics, which is wrong. Let's simply
use dev_region->mmap_addr instead of "void *mmap_addr".

Fixes: ec94c8e621de ("Support adding individual regions in libvhost-user")
Fixes: 9bb38019942c ("vhost+postcopy: Send address back to qemu")
Cc: Raphael Norwitz 
Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index a3b158c671..7e515ed15d 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -800,8 +800,8 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
  * Return the address to QEMU so that it can translate the ufd
  * fault addresses back.
  */
-msg_region->userspace_addr = (uintptr_t)(mmap_addr +
- dev_region->mmap_offset);
+msg_region->userspace_addr = dev_region->mmap_addr +
+ dev_region->mmap_offset;
 
 /* Send the message back to qemu with the addresses filled in. */
 vmsg->fd_num = 0;
@@ -969,8 +969,8 @@ vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg 
*vmsg)
 /* Return the address to QEMU so that it can translate the ufd
  * fault addresses back.
  */
-msg_region->userspace_addr = (uintptr_t)(mmap_addr +
- dev_region->mmap_offset);
+msg_region->userspace_addr = dev_region->mmap_addr +
+ dev_region->mmap_offset;
 close(vmsg->fds[i]);
 }
 
-- 
2.43.0




[PATCH v1 00/15] libvhost-user: support more memslots and cleanup memslot handling code

2024-02-02 Thread David Hildenbrand
This series adds support for more memslots (509) to libvhost-user, to
make it fully compatible with virtio-mem that uses up to 256 memslots
accross all memory devices in "dynamic-memslot" mode (more details
in patch #3).

One simple fix upfront.

With that in place, this series optimizes and extens memory region
handling in libvhost-user:
* Heavily deduplicate and clean up the memory region handling code
* Speeds up GPA->VA translation with many memslots using binary search
* Optimize mmap_offset handling to use it as fd_offset for mmap()
* Avoid ring remapping when adding a single memory region
* Avoid dumping all guest memory, possibly allocating memory in sparse
  memory mappings when the process crashes

I'm being very careful to not break some weird corner case that modern
QEMU might no longer trigger, but older one could have triggered or some
other frontend might trigger.

The only thing where I am not careful is to forbid memory regions that
overlap in GPA space: it doesn't make any sense.

With this series, virtio-mem (with dynamic-memslots=on) +
qemu-storage-daemon works flawlessly and as expected in my tests.

Cc: Michael S. Tsirkin 
Cc: Jason Wang 
Cc: Stefan Hajnoczi 
Cc: Stefano Garzarella 
Cc: Germano Veit Michel 
Cc: Raphael Norwitz 

David Hildenbrand (15):
  libvhost-user: Fix msg_region->userspace_addr computation
  libvhost-user: Dynamically allocate memory for memory slots
  libvhost-user: Bump up VHOST_USER_MAX_RAM_SLOTS to 509
  libvhost-user: Factor out removing all mem regions
  libvhost-user: Merge vu_set_mem_table_exec_postcopy() into
vu_set_mem_table_exec()
  libvhost-user: Factor out adding a memory region
  libvhost-user: No need to check for NULL when unmapping
  libvhost-user: Don't zero out memory for memory regions
  libvhost-user: Don't search for duplicates when removing memory
regions
  libvhost-user: Factor out search for memory region by GPA and simplify
  libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va()
  libvhost-user: Use most of mmap_offset as fd_offset
  libvhost-user: Factor out vq usability check
  libvhost-user: Dynamically remap rings after (temporarily?) removing
memory regions
  libvhost-user: Mark mmap'ed region memory as MADV_DONTDUMP

 subprojects/libvhost-user/libvhost-user.c | 593 --
 subprojects/libvhost-user/libvhost-user.h |  10 +-
 2 files changed, 332 insertions(+), 271 deletions(-)

-- 
2.43.0




[PATCH v1 03/15] libvhost-user: Bump up VHOST_USER_MAX_RAM_SLOTS to 509

2024-02-02 Thread David Hildenbrand
Let's support up to 509 mem slots, just like vhost in the kernel usually
does and the rust vhost-user implementation recently [1] started doing.
This is required to properly support memory hotplug, either using
multiple DIMMs (ACPI supports up to 256) or using virtio-mem.

The 509 used to be the KVM limit, it supported 512, but 3 were
used for internal purposes. Currently, KVM supports more than 512, but
it usually doesn't make use of more than ~260 (i.e., 256 DIMMs + boot
memory), except when other memory devices like PCI devices with BARs are
used. So, 509 seems to work well for vhost in the kernel.

Details can be found in the QEMU change that made virtio-mem consume
up to 256 mem slots across all virtio-mem devices. [2]

509 mem slots implies 509 VMAs/mappings in the worst case (even though,
in practice with virtio-mem we won't be seeing more than ~260 in most
setups).

With max_map_count under Linux defaulting to 64k, 509 mem slots
still correspond to less than 1% of the maximum number of mappings.
There are plenty left for the application to consume.

[1] https://github.com/rust-vmm/vhost/pull/224
[2] https://lore.kernel.org/all/20230926185738.277351-1-da...@redhat.com/

Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.h | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.h 
b/subprojects/libvhost-user/libvhost-user.h
index c882b4e3a2..deb40e77b3 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -31,10 +31,12 @@
 #define VHOST_MEMORY_BASELINE_NREGIONS 8
 
 /*
- * Set a reasonable maximum number of ram slots, which will be supported by
- * any architecture.
+ * vhost in the kernel usually supports 509 mem slots. 509 used to be the
+ * KVM limit, it supported 512, but 3 were used for internal purposes. This
+ * limit is sufficient to support many DIMMs and virtio-mem in
+ * "dynamic-memslots" mode.
  */
-#define VHOST_USER_MAX_RAM_SLOTS 32
+#define VHOST_USER_MAX_RAM_SLOTS 509
 
 #define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
 
-- 
2.43.0




[PATCH v1 06/15] libvhost-user: Factor out adding a memory region

2024-02-02 Thread David Hildenbrand
Let's factor it out, reducing quite some code duplication and perparing
for further changes.

If we fail to mmap a region and panic, we now simply don't add that
(broken) region.

Note that we now increment dev->nregions as we are successfully
adding memory regions, and don't increment dev->nregions if anything went
wrong.

Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 168 --
 1 file changed, 60 insertions(+), 108 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index d9e2214ad2..a2baefe84b 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -256,6 +256,61 @@ vu_remove_all_mem_regs(VuDev *dev)
 dev->nregions = 0;
 }
 
+static void
+_vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd)
+{
+int prot = PROT_READ | PROT_WRITE;
+VuDevRegion *r;
+void *mmap_addr;
+
+DPRINT("Adding region %d\n", dev->nregions);
+DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
+   msg_region->guest_phys_addr);
+DPRINT("memory_size: 0x%016"PRIx64"\n",
+   msg_region->memory_size);
+DPRINT("userspace_addr   0x%016"PRIx64"\n",
+   msg_region->userspace_addr);
+DPRINT("mmap_offset  0x%016"PRIx64"\n",
+   msg_region->mmap_offset);
+
+if (dev->postcopy_listening) {
+/*
+ * In postcopy we're using PROT_NONE here to catch anyone
+ * accessing it before we userfault
+ */
+prot = PROT_NONE;
+}
+
+/*
+ * We don't use offset argument of mmap() since the mapped address has
+ * to be page aligned, and we use huge pages.
+ */
+mmap_addr = mmap(0, msg_region->memory_size + msg_region->mmap_offset,
+ prot, MAP_SHARED | MAP_NORESERVE, fd, 0);
+if (mmap_addr == MAP_FAILED) {
+vu_panic(dev, "region mmap error: %s", strerror(errno));
+return;
+}
+DPRINT("mmap_addr:   0x%016"PRIx64"\n",
+   (uint64_t)(uintptr_t)mmap_addr);
+
+r = >regions[dev->nregions];
+r->gpa = msg_region->guest_phys_addr;
+r->size = msg_region->memory_size;
+r->qva = msg_region->userspace_addr;
+r->mmap_addr = (uint64_t)(uintptr_t)mmap_addr;
+r->mmap_offset = msg_region->mmap_offset;
+dev->nregions++;
+
+if (dev->postcopy_listening) {
+/*
+ * Return the address to QEMU so that it can translate the ufd
+ * fault addresses back.
+ */
+msg_region->userspace_addr = r->mmap_addr + r->mmap_offset;
+}
+}
+
 static void
 vmsg_close_fds(VhostUserMsg *vmsg)
 {
@@ -727,10 +782,7 @@ generate_faults(VuDev *dev) {
 static bool
 vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 int i;
-bool track_ramblocks = dev->postcopy_listening;
 VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = 
-VuDevRegion *dev_region = >regions[dev->nregions];
-void *mmap_addr;
 
 if (vmsg->fd_num != 1) {
 vmsg_close_fds(vmsg);
@@ -760,69 +812,20 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
  * we know all the postcopy client bases have been received, and we
  * should start generating faults.
  */
-if (track_ramblocks &&
+if (dev->postcopy_listening &&
 vmsg->size == sizeof(vmsg->payload.u64) &&
 vmsg->payload.u64 == 0) {
 (void)generate_faults(dev);
 return false;
 }
 
-DPRINT("Adding region: %u\n", dev->nregions);
-DPRINT("guest_phys_addr: 0x%016"PRIx64"\n",
-   msg_region->guest_phys_addr);
-DPRINT("memory_size: 0x%016"PRIx64"\n",
-   msg_region->memory_size);
-DPRINT("userspace_addr   0x%016"PRIx64"\n",
-   msg_region->userspace_addr);
-DPRINT("mmap_offset  0x%016"PRIx64"\n",
-   msg_region->mmap_offset);
-
-dev_region->gpa = msg_region->guest_phys_addr;
-dev_region->size = msg_region->memory_size;
-dev_region->qva = msg_region->userspace_addr;
-dev_region->mmap_offset = msg_region->mmap_offset;
-
-/*
- * We don't use offset argument of mmap() since the
- * mapped address has to be page aligned, and we use huge
- * pages.
- */
-if (track_ramblocks) {
-/*
- * In postcopy we're using PROT_NONE here to catch anyone
- * accessing it before we userfault.
- */
-mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset,
- PROT_NONE, MAP_SHARED | MAP_NORESERVE,
- 

[PATCH v1 14/15] libvhost-user: Dynamically remap rings after (temporarily?) removing memory regions

2024-02-02 Thread David Hildenbrand
Currently, we try to remap all rings whenever we add a single new memory
region. That doesn't quite make sense, because we already map rings when
setting the ring address, and panic if that goes wrong. Likely, that
handling was simply copied from set_mem_table code, where we actually
have to remap all rings.

Remapping all rings might require us to walk quite a lot of memory
regions to perform the address translations. Ideally, we'd simply remove
that remapping.

However, let's be a bit careful. There might be some weird corner cases
where we might temporarily remove a single memory region (e.g., resize
it), that would have worked for now. Further, a ring might be located on
hotplugged memory, and as the VM reboots, we might unplug that memory, to
hotplug memory before resetting the ring addresses.

So let's unmap affected rings as we remove a memory region, and try
dynamically mapping the ring again when required.

Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 107 --
 1 file changed, 78 insertions(+), 29 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index febeb2eb89..738e84ab63 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -283,10 +283,75 @@ vu_remove_all_mem_regs(VuDev *dev)
 dev->nregions = 0;
 }
 
+static bool
+map_ring(VuDev *dev, VuVirtq *vq)
+{
+vq->vring.desc = qva_to_va(dev, vq->vra.desc_user_addr);
+vq->vring.used = qva_to_va(dev, vq->vra.used_user_addr);
+vq->vring.avail = qva_to_va(dev, vq->vra.avail_user_addr);
+
+DPRINT("Setting virtq addresses:\n");
+DPRINT("vring_desc  at %p\n", vq->vring.desc);
+DPRINT("vring_used  at %p\n", vq->vring.used);
+DPRINT("vring_avail at %p\n", vq->vring.avail);
+
+return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
+}
+
 static bool
 vu_is_vq_usable(VuDev *dev, VuVirtq *vq)
 {
-return likely(!dev->broken) && likely(vq->vring.avail);
+if (unlikely(dev->broken)) {
+return false;
+}
+
+if (likely(vq->vring.avail)) {
+return true;
+}
+
+/*
+ * In corner cases, we might temporarily remove a memory region that
+ * mapped a ring. When removing a memory region we make sure to
+ * unmap any rings that would be impacted. Let's try to remap if we
+ * already succeeded mapping this ring once.
+ */
+if (!vq->vra.desc_user_addr || !vq->vra.used_user_addr ||
+!vq->vra.avail_user_addr) {
+return false;
+}
+if (map_ring(dev, vq)) {
+vu_panic(dev, "remapping queue on access");
+return false;
+}
+return true;
+}
+
+static void
+unmap_rings(VuDev *dev, VuDevRegion *r)
+{
+int i;
+
+for (i = 0; i < dev->max_queues; i++) {
+VuVirtq *vq = >vq[i];
+const uintptr_t desc = (uintptr_t)vq->vring.desc;
+const uintptr_t used = (uintptr_t)vq->vring.used;
+const uintptr_t avail = (uintptr_t)vq->vring.avail;
+
+if (desc < r->mmap_addr || desc >= r->mmap_addr + r->size) {
+continue;
+}
+if (used < r->mmap_addr || used >= r->mmap_addr + r->size) {
+continue;
+}
+if (avail < r->mmap_addr || avail >= r->mmap_addr + r->size) {
+continue;
+}
+
+DPRINT("Unmapping rings of queue %d\n", i);
+vq->vring.desc = NULL;
+vq->vring.used = NULL;
+vq->vring.avail = NULL;
+}
 }
 
 static size_t
@@ -784,21 +849,6 @@ vu_reset_device_exec(VuDev *dev, VhostUserMsg *vmsg)
 return false;
 }
 
-static bool
-map_ring(VuDev *dev, VuVirtq *vq)
-{
-vq->vring.desc = qva_to_va(dev, vq->vra.desc_user_addr);
-vq->vring.used = qva_to_va(dev, vq->vra.used_user_addr);
-vq->vring.avail = qva_to_va(dev, vq->vra.avail_user_addr);
-
-DPRINT("Setting virtq addresses:\n");
-DPRINT("vring_desc  at %p\n", vq->vring.desc);
-DPRINT("vring_used  at %p\n", vq->vring.used);
-DPRINT("vring_avail at %p\n", vq->vring.avail);
-
-return !(vq->vring.desc && vq->vring.used && vq->vring.avail);
-}
-
 static bool
 generate_faults(VuDev *dev) {
 unsigned int i;
@@ -882,7 +932,6 @@ generate_faults(VuDev *dev) {
 
 static bool
 vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
-int i;
 VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = 
 
 if (vmsg->fd_num != 1) {
@@ -928,19 +977,9 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 vmsg->fd_num = 0;
 DPRINT("Successfully added new region in postcopy\n");
 return true;

[PATCH v1 10/15] libvhost-user: Factor out search for memory region by GPA and simplify

2024-02-02 Thread David Hildenbrand
Memory regions cannot overlap, and if we ever hit that case something
would be really flawed.

For example, when vhost code in QEMU decides to increase the size of memory
regions to cover full huge pages, it makes sure to never create overlaps,
and if there would be overlaps, it would bail out.

QEMU commits 48d7c9757749 ("vhost: Merge sections added to temporary
list"), c1ece84e7c93 ("vhost: Huge page align and merge") and
e7b94a84b6cb ("vhost: Allow adjoining regions") added and clarified that
handling and how overlaps are impossible.

Consequently, each GPA can belong to at most one memory region, and
everything else doesn't make sense. Let's factor out our search to prepare
for further changes.

Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 79 +--
 1 file changed, 45 insertions(+), 34 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 22154b217f..d036b54ed0 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -195,30 +195,47 @@ vu_panic(VuDev *dev, const char *msg, ...)
  */
 }
 
+/* Search for a memory region that covers this guest physical address. */
+static VuDevRegion *
+vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr)
+{
+unsigned int i;
+
+/*
+ * Memory regions cannot overlap in guest physical address space. Each
+ * GPA belongs to exactly one memory region, so there can only be one
+ * match.
+ */
+for (i = 0; i < dev->nregions; i++) {
+VuDevRegion *cur = >regions[i];
+
+if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) {
+return cur;
+}
+}
+return NULL;
+}
+
 /* Translate guest physical address to our virtual address.  */
 void *
 vu_gpa_to_va(VuDev *dev, uint64_t *plen, uint64_t guest_addr)
 {
-unsigned int i;
+VuDevRegion *r;
 
 if (*plen == 0) {
 return NULL;
 }
 
-/* Find matching memory region.  */
-for (i = 0; i < dev->nregions; i++) {
-VuDevRegion *r = >regions[i];
-
-if ((guest_addr >= r->gpa) && (guest_addr < (r->gpa + r->size))) {
-if ((guest_addr + *plen) > (r->gpa + r->size)) {
-*plen = r->gpa + r->size - guest_addr;
-}
-return (void *)(uintptr_t)
-guest_addr - r->gpa + r->mmap_addr + r->mmap_offset;
-}
+r = vu_gpa_to_mem_region(dev, guest_addr);
+if (!r) {
+return NULL;
 }
 
-return NULL;
+if ((guest_addr + *plen) > (r->gpa + r->size)) {
+*plen = r->gpa + r->size - guest_addr;
+}
+return (void *)(uintptr_t)guest_addr - r->gpa + r->mmap_addr +
+   r->mmap_offset;
 }
 
 /* Translate qemu virtual address to our virtual address.  */
@@ -854,8 +871,8 @@ static inline bool reg_equal(VuDevRegion *vudev_reg,
 static bool
 vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = 
-unsigned int i;
-bool found = false;
+unsigned int idx;
+VuDevRegion *r;
 
 if (vmsg->fd_num > 1) {
 vmsg_close_fds(vmsg);
@@ -882,28 +899,22 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) {
 DPRINT("mmap_offset  0x%016"PRIx64"\n",
msg_region->mmap_offset);
 
-for (i = 0; i < dev->nregions; i++) {
-if (reg_equal(>regions[i], msg_region)) {
-VuDevRegion *r = >regions[i];
-
-munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset);
-
-/* Shift all affected entries by 1 to close the hole at index. */
-memmove(dev->regions + i, dev->regions + i + 1,
-sizeof(VuDevRegion) * (dev->nregions - i - 1));
-DPRINT("Successfully removed a region\n");
-dev->nregions--;
-i--;
-
-found = true;
-break;
-}
-}
-
-if (!found) {
+r = vu_gpa_to_mem_region(dev, msg_region->guest_phys_addr);
+if (!r || !reg_equal(r, msg_region)) {
+vmsg_close_fds(vmsg);
 vu_panic(dev, "Specified region not found\n");
+return false;
 }
 
+munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset);
+
+idx = r - dev->regions;
+assert(idx < dev->nregions);
+/* Shift all affected entries by 1 to close the hole. */
+memmove(r, r + 1, sizeof(VuDevRegion) * (dev->nregions - idx - 1));
+DPRINT("Successfully removed a region\n");
+dev->nregions--;
+
 vmsg_close_fds(vmsg);
 
 return false;
-- 
2.43.0




[PATCH v1 04/15] libvhost-user: Factor out removing all mem regions

2024-02-02 Thread David Hildenbrand
Let's factor it out. Note that the check for MAP_FAILED was wrong as
we never set mmap_addr if mmap() failed. We'll remove the NULL check
separately.

Signed-off-by: David Hildenbrand 
---
 subprojects/libvhost-user/libvhost-user.c | 34 ---
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c 
b/subprojects/libvhost-user/libvhost-user.c
index 8a5a7a2295..d5b3468e43 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -240,6 +240,22 @@ qva_to_va(VuDev *dev, uint64_t qemu_addr)
 return NULL;
 }
 
+static void
+vu_remove_all_mem_regs(VuDev *dev)
+{
+unsigned int i;
+
+for (i = 0; i < dev->nregions; i++) {
+VuDevRegion *r = >regions[i];
+void *ma = (void *)(uintptr_t)r->mmap_addr;
+
+if (ma) {
+munmap(ma, r->size + r->mmap_offset);
+}
+}
+dev->nregions = 0;
+}
+
 static void
 vmsg_close_fds(VhostUserMsg *vmsg)
 {
@@ -1003,14 +1019,7 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg)
 unsigned int i;
 VhostUserMemory m = vmsg->payload.memory, *memory = 
 
-for (i = 0; i < dev->nregions; i++) {
-VuDevRegion *r = >regions[i];
-void *ma = (void *) (uintptr_t) r->mmap_addr;
-
-if (ma) {
-munmap(ma, r->size + r->mmap_offset);
-}
-}
+vu_remove_all_mem_regs(dev);
 dev->nregions = memory->nregions;
 
 if (dev->postcopy_listening) {
@@ -2112,14 +2121,7 @@ vu_deinit(VuDev *dev)
 {
 unsigned int i;
 
-for (i = 0; i < dev->nregions; i++) {
-VuDevRegion *r = >regions[i];
-void *m = (void *) (uintptr_t) r->mmap_addr;
-if (m != MAP_FAILED) {
-munmap(m, r->size + r->mmap_offset);
-}
-}
-dev->nregions = 0;
+vu_remove_all_mem_regs(dev);
 
 for (i = 0; i < dev->max_queues; i++) {
 VuVirtq *vq = >vq[i];
-- 
2.43.0




  1   2   3   4   5   6   7   8   9   10   >