Re: [PATCH v2 2/3] target/s390x: add support for "disable-deprecated-feats" expansion option
On 24.04.24 20:33, Collin Walling wrote: On 4/24/24 03:24, David Hildenbrand wrote: On 23.04.24 23:06, Collin Walling wrote: Retain a list of deprecated features disjoint from any particular CPU model. When a query-cpu-model-expansion is provided with the "disable-deprecated-feats" option set, the resulting properties list will include all deprecated features paired with false. Example: { ... "bpb": false, "csske": false, ...} It is recommended that s390 guests operate with these features explicitly disabled to ensure compatability with future hardware. Signed-off-by: Collin Walling --- target/s390x/cpu_features.c | 14 ++ target/s390x/cpu_features.h | 1 + target/s390x/cpu_models_sysemu.c | 20 3 files changed, 27 insertions(+), 8 deletions(-) diff --git a/target/s390x/cpu_features.c b/target/s390x/cpu_features.c index d28eb65845..efafc9711c 100644 --- a/target/s390x/cpu_features.c +++ b/target/s390x/cpu_features.c @@ -212,6 +212,20 @@ void s390_feat_bitmap_to_ascii(const S390FeatBitmap features, void *opaque, }; } +void s390_get_deprecated_features(S390FeatBitmap features) +{ +static const int feats[] = { + /* CSSKE is deprecated on newer generations */ + S390_FEAT_CONDITIONAL_SSKE, + S390_FEAT_BPB, +}; +int i; + +for (i = 0; i < ARRAY_SIZE(feats); i++) { +set_bit(feats[i], features); +} +} + #define FEAT_GROUP_INIT(_name, _group, _desc)\ {\ .name = _name, \ diff --git a/target/s390x/cpu_features.h b/target/s390x/cpu_features.h index a9bd68a2e1..661a8cd6db 100644 --- a/target/s390x/cpu_features.h +++ b/target/s390x/cpu_features.h @@ -69,6 +69,7 @@ void s390_add_from_feat_block(S390FeatBitmap features, S390FeatType type, uint8_t *data); void s390_feat_bitmap_to_ascii(const S390FeatBitmap features, void *opaque, void (*fn)(const char *name, void *opaque)); +void s390_get_deprecated_features(S390FeatBitmap features); /* Definition of a CPU feature group */ typedef struct { diff --git a/target/s390x/cpu_models_sysemu.c b/target/s390x/cpu_models_sysemu.c index ef9fa80efd..b002819188 100644 --- a/target/s390x/cpu_models_sysemu.c +++ b/target/s390x/cpu_models_sysemu.c @@ -171,7 +171,8 @@ static void qdict_add_enabled_feat(const char *name, void *opaque) /* convert S390CPUDef into a static CpuModelInfo */ static void cpu_info_from_model(CpuModelInfo *info, const S390CPUModel *model, -bool delta_changes) +bool delta_changes, +bool disable_deprecated_feats) { QDict *qdict = qdict_new(); S390FeatBitmap bitmap; @@ -201,6 +202,13 @@ static void cpu_info_from_model(CpuModelInfo *info, const S390CPUModel *model, s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat); } +/* features flagged as deprecated */ +if (disable_deprecated_feats) { +bitmap_zero(bitmap, S390_FEAT_MAX); +s390_get_deprecated_features(bitmap); +s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat); +} Likely, we should remove these features from the actual bitmap, such that they won't appear twice in the output? I'd expect the cpu_info_from_model() caller to handle that. Just adding them to the list as disabled is likely wrong. For example, if someone were to expend a given model with "... bpb: true" with disable-deprecated-feat=on, that should be remove from "bpb:true", and only replaced by "bpb=false" if it would be part of the CPU model we would be expanding to. Or am I missing something? qdict_add_disabled_feat will handle updating the feature if it already exists. I placed the code to process deprecated features as the last step of cpu_info_from_model to override any features that have already been added to the bitmap. Whether it should be the deprecated feats that take priority, or what the user requests is up in the air, however... Yes, that's one of my concern. IIRC, if the user specifies the same property multiple times, it's unclear which one will win. "bpb=true,bpb=false" might mean that bpb=true might win. I think this is something that this interface should sort out, so it will be actually usable! ... Daniel's suggestion to modify the QMP response to include a separate list of "deprecated-props" seems like a much more efficient and readable way that alleviates both your and Markus' concerns. Would you only include properties that would apply to the current model and would be set to true in the current model? How would libvirt make use of this interface, and could it run into the issue spelled out above? -- Cheers, David / dhildenb
Re: [PATCH v2 2/3] target/s390x: add support for "disable-deprecated-feats" expansion option
On 23.04.24 23:06, Collin Walling wrote: Retain a list of deprecated features disjoint from any particular CPU model. When a query-cpu-model-expansion is provided with the "disable-deprecated-feats" option set, the resulting properties list will include all deprecated features paired with false. Example: { ... "bpb": false, "csske": false, ...} It is recommended that s390 guests operate with these features explicitly disabled to ensure compatability with future hardware. Signed-off-by: Collin Walling --- target/s390x/cpu_features.c | 14 ++ target/s390x/cpu_features.h | 1 + target/s390x/cpu_models_sysemu.c | 20 3 files changed, 27 insertions(+), 8 deletions(-) diff --git a/target/s390x/cpu_features.c b/target/s390x/cpu_features.c index d28eb65845..efafc9711c 100644 --- a/target/s390x/cpu_features.c +++ b/target/s390x/cpu_features.c @@ -212,6 +212,20 @@ void s390_feat_bitmap_to_ascii(const S390FeatBitmap features, void *opaque, }; } +void s390_get_deprecated_features(S390FeatBitmap features) +{ +static const int feats[] = { + /* CSSKE is deprecated on newer generations */ + S390_FEAT_CONDITIONAL_SSKE, + S390_FEAT_BPB, +}; +int i; + +for (i = 0; i < ARRAY_SIZE(feats); i++) { +set_bit(feats[i], features); +} +} + #define FEAT_GROUP_INIT(_name, _group, _desc)\ {\ .name = _name, \ diff --git a/target/s390x/cpu_features.h b/target/s390x/cpu_features.h index a9bd68a2e1..661a8cd6db 100644 --- a/target/s390x/cpu_features.h +++ b/target/s390x/cpu_features.h @@ -69,6 +69,7 @@ void s390_add_from_feat_block(S390FeatBitmap features, S390FeatType type, uint8_t *data); void s390_feat_bitmap_to_ascii(const S390FeatBitmap features, void *opaque, void (*fn)(const char *name, void *opaque)); +void s390_get_deprecated_features(S390FeatBitmap features); /* Definition of a CPU feature group */ typedef struct { diff --git a/target/s390x/cpu_models_sysemu.c b/target/s390x/cpu_models_sysemu.c index ef9fa80efd..b002819188 100644 --- a/target/s390x/cpu_models_sysemu.c +++ b/target/s390x/cpu_models_sysemu.c @@ -171,7 +171,8 @@ static void qdict_add_enabled_feat(const char *name, void *opaque) /* convert S390CPUDef into a static CpuModelInfo */ static void cpu_info_from_model(CpuModelInfo *info, const S390CPUModel *model, -bool delta_changes) +bool delta_changes, +bool disable_deprecated_feats) { QDict *qdict = qdict_new(); S390FeatBitmap bitmap; @@ -201,6 +202,13 @@ static void cpu_info_from_model(CpuModelInfo *info, const S390CPUModel *model, s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat); } +/* features flagged as deprecated */ +if (disable_deprecated_feats) { +bitmap_zero(bitmap, S390_FEAT_MAX); +s390_get_deprecated_features(bitmap); +s390_feat_bitmap_to_ascii(bitmap, qdict, qdict_add_disabled_feat); +} Likely, we should remove these features from the actual bitmap, such that they won't appear twice in the output? I'd expect the cpu_info_from_model() caller to handle that. Just adding them to the list as disabled is likely wrong. For example, if someone were to expend a given model with "... bpb: true" with disable-deprecated-feat=on, that should be remove from "bpb:true", and only replaced by "bpb=false" if it would be part of the CPU model we would be expanding to. Or am I missing something? -- Cheers, David / dhildenb
Re: [PATCH 0/3] Remove useless architecture prefix from the CPU list
On 20.04.24 07:46, Thomas Huth wrote: Printing an architecture prefix in front of each CPU name is not helpful at all: It is confusing for the users since they don't know whether they have to specify these letters for the "-cpu" parameter, too, and it also takes some precious space in the dense output of the CPU entries. Let's simply remove those now. Yes, I also never really understood the purpose. -- Cheers, David / dhildenb
[PATCH v1] virtio-mem: improve error message when unplug of device fails due to plugged memory
The error message is actually expressive, considering QEMU only. But when called from Libvirt, talking about "size" can be confusing, because in Libvirt "size" translates to the memory backend size in QEMU (maximum size) and "current" translates to the QEMU "size" property. Let's simply avoid talking about the "size" property and spell out that some device memory is still plugged. Cc: Liang Cong Cc: Mario Casquero Cc: "Michael S. Tsirkin" Signed-off-by: David Hildenbrand --- hw/virtio/virtio-mem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index ffd119ebac..ef64bf1b4a 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -1832,8 +1832,8 @@ static void virtio_mem_unplug_request_check(VirtIOMEM *vmem, Error **errp) } if (vmem->size) { -error_setg(errp, "virtio-mem device cannot get unplugged while" - " '" VIRTIO_MEM_SIZE_PROP "' != '0'"); +error_setg(errp, "virtio-mem device cannot get unplugged while some" + " of its memory is still plugged"); return; } if (vmem->requested_size) { -- 2.44.0
Re: [PATCH for-9.1 v3 09/11] hostmem: add a new memory backend based on POSIX shm_open()
On 08.04.24 09:58, Stefano Garzarella wrote: On Thu, Apr 04, 2024 at 04:09:34PM +0200, David Hildenbrand wrote: On 04.04.24 14:23, Stefano Garzarella wrote: shm_open() creates and opens a new POSIX shared memory object. A POSIX shared memory object allows creating memory backend with an associated file descriptor that can be shared with external processes (e.g. vhost-user). The new `memory-backend-shm` can be used as an alternative when `memory-backend-memfd` is not available (Linux only), since shm_open() should be provided by any POSIX-compliant operating system. This backend mimics memfd, allocating memory that is practically anonymous. In theory shm_open() requires a name, but this is allocated for a short time interval and shm_unlink() is called right after shm_open(). After that, only fd is shared with external processes (e.g., vhost-user) as if it were associated with anonymous memory. In the future we may also allow the user to specify the name to be passed to shm_open(), but for now we keep the backend simple, mimicking anonymous memory such as memfd. Signed-off-by: Stefano Garzarella --- v3 - enriched commit message and documentation to highlight that we want to mimic memfd (David) --- docs/system/devices/vhost-user.rst | 5 +- qapi/qom.json | 17 + backends/hostmem-shm.c | 118 + backends/meson.build | 1 + qemu-options.hx| 11 +++ 5 files changed, 150 insertions(+), 2 deletions(-) create mode 100644 backends/hostmem-shm.c diff --git a/docs/system/devices/vhost-user.rst b/docs/system/devices/vhost-user.rst index 9b2da106ce..35259d8ec7 100644 --- a/docs/system/devices/vhost-user.rst +++ b/docs/system/devices/vhost-user.rst @@ -98,8 +98,9 @@ Shared memory object In order for the daemon to access the VirtIO queues to process the requests it needs access to the guest's address space. This is -achieved via the ``memory-backend-file`` or ``memory-backend-memfd`` -objects. A reference to a file-descriptor which can access this object +achieved via the ``memory-backend-file``, ``memory-backend-memfd``, or +``memory-backend-shm`` objects. +A reference to a file-descriptor which can access this object will be passed via the socket as part of the protocol negotiation. Currently the shared memory object needs to match the size of the main diff --git a/qapi/qom.json b/qapi/qom.json index 85e6b4f84a..5252ec69e3 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -721,6 +721,19 @@ '*hugetlbsize': 'size', '*seal': 'bool' } } +## +# @MemoryBackendShmProperties: +# +# Properties for memory-backend-shm objects. +# +# The @share boolean option is true by default with shm. +# +# Since: 9.1 +## +{ 'struct': 'MemoryBackendShmProperties', + 'base': 'MemoryBackendProperties', + 'data': { } } + Acked-by: David Hildenbrand One comment: we should maybe just forbid setting share=off. it doesn't make any sense and it can even result in an unexpected double memory consumption. We missed doing that for memfd, unfortunately. Good point! IIUC the `share` property is defined by the parent `hostmem`, so I should find a way to override the property here and disable the setter, or add an option to `hostmem` to make the property non-writable. Right, or simply fail later when you would find "share=off" in shm_backend_memory_alloc(). When ever supporting named shmem_open(), it could make sense for VM snapshotting. Right now it doesn't really make any sense. -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v3 09/11] hostmem: add a new memory backend based on POSIX shm_open()
On 04.04.24 14:23, Stefano Garzarella wrote: shm_open() creates and opens a new POSIX shared memory object. A POSIX shared memory object allows creating memory backend with an associated file descriptor that can be shared with external processes (e.g. vhost-user). The new `memory-backend-shm` can be used as an alternative when `memory-backend-memfd` is not available (Linux only), since shm_open() should be provided by any POSIX-compliant operating system. This backend mimics memfd, allocating memory that is practically anonymous. In theory shm_open() requires a name, but this is allocated for a short time interval and shm_unlink() is called right after shm_open(). After that, only fd is shared with external processes (e.g., vhost-user) as if it were associated with anonymous memory. In the future we may also allow the user to specify the name to be passed to shm_open(), but for now we keep the backend simple, mimicking anonymous memory such as memfd. Signed-off-by: Stefano Garzarella --- v3 - enriched commit message and documentation to highlight that we want to mimic memfd (David) --- docs/system/devices/vhost-user.rst | 5 +- qapi/qom.json | 17 + backends/hostmem-shm.c | 118 + backends/meson.build | 1 + qemu-options.hx| 11 +++ 5 files changed, 150 insertions(+), 2 deletions(-) create mode 100644 backends/hostmem-shm.c diff --git a/docs/system/devices/vhost-user.rst b/docs/system/devices/vhost-user.rst index 9b2da106ce..35259d8ec7 100644 --- a/docs/system/devices/vhost-user.rst +++ b/docs/system/devices/vhost-user.rst @@ -98,8 +98,9 @@ Shared memory object In order for the daemon to access the VirtIO queues to process the requests it needs access to the guest's address space. This is -achieved via the ``memory-backend-file`` or ``memory-backend-memfd`` -objects. A reference to a file-descriptor which can access this object +achieved via the ``memory-backend-file``, ``memory-backend-memfd``, or +``memory-backend-shm`` objects. +A reference to a file-descriptor which can access this object will be passed via the socket as part of the protocol negotiation. Currently the shared memory object needs to match the size of the main diff --git a/qapi/qom.json b/qapi/qom.json index 85e6b4f84a..5252ec69e3 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -721,6 +721,19 @@ '*hugetlbsize': 'size', '*seal': 'bool' } } +## +# @MemoryBackendShmProperties: +# +# Properties for memory-backend-shm objects. +# +# The @share boolean option is true by default with shm. +# +# Since: 9.1 +## +{ 'struct': 'MemoryBackendShmProperties', + 'base': 'MemoryBackendProperties', + 'data': { } } + Acked-by: David Hildenbrand One comment: we should maybe just forbid setting share=off. it doesn't make any sense and it can even result in an unexpected double memory consumption. We missed doing that for memfd, unfortunately. -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()
So I thought that for now we only support the "anonymous" mode, and if in the future we have a use case where the user wants to specify the name, we can add it later. Okay, so for now you really only want an anonymous fd that behaves like a memfd, got it. Likely we should somehow fail if the "POSIX shared memory object" already exists. Hmm. `O_CREAT | O_EXCL` should provide just this behavior. From shm_open(3) manpage: O_EXCL If O_CREAT was also specified, and a shared memory object with the given name already exists, return an error. The check for the existence of the object, and its creation if it does not exist, are performed atomically. Cool! That said, if you think it's already useful from the beginning, I can add the name as an optional parameter. I'm also not quite sure if "host_memory_backend_get_name()" should be used for the purpose here. What problem do you see? As an alternative I thought of a static counter. If it's really just an "internal UUID", as we'll remove the file using shm_unlink(), then the name in fact shouldn't matter and this would be fine. Correct? Right. It's a name that will only "appear" in the system for a very short time since we call shm_unlink() right after shm_open(). So I just need the unique name to prevent several QEMUs launched at the same time from colliding with the name. Okay, essentially emulating tmpfile(), but for weird shmem_open() :) So I assume if we ever want to have non-anonymous fds here, we'd pass in a new property "name", and change the logic how we open/unlink. Makes sense. Exactly! I can clarify this in the commit description as well. Thanks for the helpful comments! If there is anything I need to change for v3, please tell me ;-) Besides some patch description extension, makes sense to me :) -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()
On 27.03.24 11:23, Stefano Garzarella wrote: On Tue, Mar 26, 2024 at 03:45:52PM +0100, David Hildenbrand wrote: +mode = 0; +oflag = O_RDWR | O_CREAT | O_EXCL; +backend_name = host_memory_backend_get_name(backend); + +/* + * Some operating systems allow creating anonymous POSIX shared memory + * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not + * defined by POSIX, so let's create a unique name. + * + * From Linux's shm_open(3) man-page: + * For portable use, a shared memory object should be identified + * by a name of the form /somename;" + */ +g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(), +backend_name); Hm, shouldn't we just let the user specify a name, and if no name was specified, generate one ourselves? I thought about it and initially did it that way, but then some problems came up so I tried to keep it as simple as possible for the user and for our use case (having an fd associated with memory and sharing it with other processes). The problems I had were: - what mode_t to use if the object does not exist and needs to be created? > - exclude O_EXCL if the user passes the name since they may have already created it? I'd handle both like we handle files. But I understand now that you really just want to "simulate" memfd. - call shm_unlink() only at object deallocation? Right, we don't really do that for ordinary files, they keep existing. We only have the "discard-data" property to free up memory. For memfd, it just happens "automatically". They cannot be looked up by name, and once the last user closes the fd, it gets destroyed. So I thought that for now we only support the "anonymous" mode, and if in the future we have a use case where the user wants to specify the name, we can add it later. Okay, so for now you really only want an anonymous fd that behaves like a memfd, got it. Likely we should somehow fail if the "POSIX shared memory object" already exists. Hmm. That said, if you think it's already useful from the beginning, I can add the name as an optional parameter. I'm also not quite sure if "host_memory_backend_get_name()" should be used for the purpose here. What problem do you see? As an alternative I thought of a static counter. If it's really just an "internal UUID", as we'll remove the file using shm_unlink(), then the name in fact shouldn't matter and this would be fine. Correct? So I assume if we ever want to have non-anonymous fds here, we'd pass in a new property "name", and change the logic how we open/unlink. Makes sense. -- Cheers, David / dhildenb
Re: [PATCH-for-9.1 v2 13/21] hw/mem/pc-dimm: Remove legacy_align argument from pc_dimm_pre_plug()
On 27.03.24 10:51, Philippe Mathieu-Daudé wrote: 'legacy_align' is always NULL, remove it. Signed-off-by: Philippe Mathieu-Daudé Reviewed-by: Thomas Huth Message-Id: <20240305134221.30924-11-phi...@linaro.org> --- I was really confused for a second until I saw that this series is dependent on another one. Please make sure to CC people you CC on patches to also CC on the cover letter. Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH-for-9.1 v2 14/21] hw/mem/memory-device: Remove legacy_align from memory_device_pre_plug()
void memory_device_pre_plug(MemoryDeviceState *md, MachineState *ms, -const uint64_t *legacy_align, Error **errp) +Error **errp) { const MemoryDeviceClass *mdc = MEMORY_DEVICE_GET_CLASS(md); Error *local_err = NULL; @@ -388,14 +388,10 @@ void memory_device_pre_plug(MemoryDeviceState *md, MachineState *ms, return; } -if (legacy_align) { -align = *legacy_align; -} else { -if (mdc->get_min_alignment) { -align = mdc->get_min_alignment(md); -} -align = MAX(align, memory_region_get_alignment(mr)); +if (mdc->get_min_alignment) { +align = mdc->get_min_alignment(md); } +align = MAX(align, memory_region_get_alignment(mr)); addr = mdc->get_addr(md); addr = memory_device_get_free_addr(ms, !addr ? NULL : , align, memory_region_size(mr), _err); Very nice! Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v2 02/11] libvhost-user: fail vu_message_write() if sendmsg() is failing
I was wondering if we could see some partial sendmsg()/write succeeding. Meaning, we transferred some bytes but not all, and we'd actually need to loop ... Yep, true, but I would fix it in another patch/series if you agree. Absolutely. -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v2 09/11] hostmem: add a new memory backend based on POSIX shm_open()
+mode = 0; +oflag = O_RDWR | O_CREAT | O_EXCL; +backend_name = host_memory_backend_get_name(backend); + +/* + * Some operating systems allow creating anonymous POSIX shared memory + * objects (e.g. FreeBSD provides the SHM_ANON constant), but this is not + * defined by POSIX, so let's create a unique name. + * + * From Linux's shm_open(3) man-page: + * For portable use, a shared memory object should be identified + * by a name of the form /somename;" + */ +g_string_printf(shm_name, "/qemu-" FMT_pid "-shm-%s", getpid(), +backend_name); Hm, shouldn't we just let the user specify a name, and if no name was specified, generate one ourselves? I'm also not quite sure if "host_memory_backend_get_name()" should be used for the purpose here. -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v2 02/11] libvhost-user: fail vu_message_write() if sendmsg() is failing
On 26.03.24 15:34, Eric Blake wrote: On Tue, Mar 26, 2024 at 02:39:27PM +0100, Stefano Garzarella wrote: In vu_message_write() we use sendmsg() to send the message header, then a write() to send the payload. If sendmsg() fails we should avoid sending the payload, since we were unable to send the header. Discovered before fixing the issue with the previous patch, where sendmsg() failed on macOS due to wrong parameters, but the frontend still sent the payload which the backend incorrectly interpreted as a wrong header. Signed-off-by: Stefano Garzarella --- subprojects/libvhost-user/libvhost-user.c | 5 + 1 file changed, 5 insertions(+) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 22bea0c775..a11afd1960 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -639,6 +639,11 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg) rc = sendmsg(conn_fd, , 0); } while (rc < 0 && (errno == EINTR || errno == EAGAIN)); +if (rc <= 0) { Is rejecting a 0 return value correct? Technically, a 0 return means a successful write of 0 bytes - but then again, it is unwise to attempt to send an fd or other auxilliary ddata without at least one regular byte, so we should not be attempting a write of 0 bytes. So I guess this one is okay, although I might have used < instead of <=. I was wondering if we could see some partial sendmsg()/write succeeding. Meaning, we transferred some bytes but not all, and we'd actually need to loop ... -- Cheers, David / dhildenb
Re: [PATCH for-9.1 v2 01/11] libvhost-user: set msg.msg_control to NULL when it is empty
On 26.03.24 14:39, Stefano Garzarella wrote: On some OS (e.g. macOS) sendmsg() returns -1 (errno EINVAL) if the `struct msghdr` has the field `msg_controllen` set to 0, but `msg_control` is not NULL. Signed-off-by: Stefano Garzarella --- subprojects/libvhost-user/libvhost-user.c | 1 + 1 file changed, 1 insertion(+) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index a879149fef..22bea0c775 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -632,6 +632,7 @@ vu_message_write(VuDev *dev, int conn_fd, VhostUserMsg *vmsg) memcpy(CMSG_DATA(cmsg), vmsg->fds, fdsize); } else { msg.msg_controllen = 0; +msg.msg_control = NULL; } do { Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v1] qapi: document parameters of query-cpu-model-* QAPI commands
# # Usually, a CPU model is compared against the maximum possible CPU # model of a certain configuration (e.g. the "host" model for KVM). @@ -154,7 +155,14 @@ # Some architectures may not support comparing CPU models. s390x # supports comparing CPU models. # -# Returns: a CpuModelBaselineInfo +# @modela: description of the first CPU model to compare, referred to as +# "model A" in CpuModelCompareResult +# +# @modelb: description of the second CPU model to compare, referred to as +# "model B" in CpuModelCompareResult +# +# Returns: a CpuModelCompareInfo, describing how both CPU models Scratch the comma? Agreed to both. Do you want to fixup when applying or do you prefer a v2? Thanks! Reviewed-by: Markus Armbruster -- Cheers, David / dhildenb
[PATCH v1] qapi: document parameters of query-cpu-model-* QAPI commands
Let's document the parameters of these commands, so we can remove them from the "documentation-exceptions" list. While at it, extend the "Returns:" documentation as well, fixing a wrong use of CpuModelBaselineInfo vs. CpuModelCompareInfo for query-cpu-model-comparison. Cc: Markus Armbruster Cc: Eric Blake Cc: Eduardo Habkost Cc: Marcel Apfelbaum Cc: "Philippe Mathieu-Daudé" Cc: Yanan Wang Signed-off-by: David Hildenbrand --- qapi/machine-target.json | 46 +++- qapi/pragma.json | 3 --- 2 files changed, 31 insertions(+), 18 deletions(-) diff --git a/qapi/machine-target.json b/qapi/machine-target.json index 519adf3220..7883616cce 100644 --- a/qapi/machine-target.json +++ b/qapi/machine-target.json @@ -124,11 +124,12 @@ ## # @query-cpu-model-comparison: # -# Compares two CPU models, returning how they compare in a specific -# configuration. The results indicates how both models compare -# regarding runnability. This result can be used by tooling to make -# decisions if a certain CPU model will run in a certain configuration -# or if a compatible CPU model has to be created by baselining. +# Compares two CPU models, @modela and @modelb, returning how they +# compare in a specific configuration. The results indicates how +# both models compare regarding runnability. This result can be +# used by tooling to make decisions if a certain CPU model will +# run in a certain configuration or if a compatible CPU model has +# to be created by baselining. # # Usually, a CPU model is compared against the maximum possible CPU # model of a certain configuration (e.g. the "host" model for KVM). @@ -154,7 +155,14 @@ # Some architectures may not support comparing CPU models. s390x # supports comparing CPU models. # -# Returns: a CpuModelBaselineInfo +# @modela: description of the first CPU model to compare, referred to as +# "model A" in CpuModelCompareResult +# +# @modelb: description of the second CPU model to compare, referred to as +# "model B" in CpuModelCompareResult +# +# Returns: a CpuModelCompareInfo, describing how both CPU models +# compare # # Errors: # - if comparing CPU models is not supported @@ -175,9 +183,9 @@ ## # @query-cpu-model-baseline: # -# Baseline two CPU models, creating a compatible third model. The -# created model will always be a static, migration-safe CPU model (see -# "static" CPU model expansion for details). +# Baseline two CPU models, @modela and @modelb, creating a compatible +# third model. The created model will always be a static, +# migration-safe CPU model (see "static" CPU model expansion for details). # # This interface can be used by tooling to create a compatible CPU # model out two CPU models. The created CPU model will be identical @@ -204,7 +212,11 @@ # Some architectures may not support baselining CPU models. s390x # supports baselining CPU models. # -# Returns: a CpuModelBaselineInfo +# @modela: description of the first CPU model to baseline +# +# @modelb: description of the second CPU model to baseline +# +# Returns: a CpuModelBaselineInfo, describing the baselined CPU model # # Errors: # - if baselining CPU models is not supported @@ -243,10 +255,10 @@ ## # @query-cpu-model-expansion: # -# Expands a given CPU model (or a combination of CPU model + -# additional options) to different granularities, allowing tooling to -# get an understanding what a specific CPU model looks like in QEMU -# under a certain configuration. +# Expands a given CPU model, @model, (or a combination of CPU model + +# additional options) to different granularities, specified by +# @type, allowing tooling to get an understanding what a specific +# CPU model looks like in QEMU under a certain configuration. # # This interface can be used to query the "host" CPU model. # @@ -269,7 +281,11 @@ # Some architectures may not support all expansion types. s390x # supports "full" and "static". Arm only supports "full". # -# Returns: a CpuModelExpansionInfo +# @model: description of the CPU model to expand +# +# @type: expansion type, specifying how to expand the CPU model +# +# Returns: a CpuModelExpansionInfo, describing the expanded CPU model # # Errors: # - if expanding CPU models is not supported diff --git a/qapi/pragma.json b/qapi/pragma.json index 6929ab776e..0d82bc1a03 100644 --- a/qapi/pragma.json +++ b/qapi/pragma.json @@ -90,9 +90,6 @@ 'XDbgBlockGraph', 'YankInstanceType', 'blockdev-reopen', -'query-cpu-model-baseline', -'query-cpu-model-comparison', -'query-cpu-model-expansion', 'query-rocker', 'query-rocker-ports', 'query-stats-schemas', -- 2.43.2
Re: Let's close member documentation gaps
On 25.03.24 15:13, David Hildenbrand wrote: On 25.03.24 10:36, Markus Armbruster wrote: If you're cc'ed, I have a bit of doc work for you. Search for your name to find it. The QAPI generator forces you to document your stuff. Except for commands, events, enum and object types listed in pragma documentation-exceptions, the generator silently defaults missing documentation to "Not documented". Right now, we're using this loophole some 500 times. What would be the right step to make sure I am resolving these "hidden" issues, and that the QAPI generator would be happy with my changes? Ah, I assume simply removing the offender from "qapi/pragma.json" and then compiling. -- Cheers, David / dhildenb
Re: Let's close member documentation gaps
On 25.03.24 10:36, Markus Armbruster wrote: If you're cc'ed, I have a bit of doc work for you. Search for your name to find it. The QAPI generator forces you to document your stuff. Except for commands, events, enum and object types listed in pragma documentation-exceptions, the generator silently defaults missing documentation to "Not documented". Right now, we're using this loophole some 500 times. What would be the right step to make sure I am resolving these "hidden" issues, and that the QAPI generator would be happy with my changes? -- Cheers, David / dhildenb
Re: [PATCH v3 11/49] physmem: Introduce ram_block_discard_guest_memfd_range()
On 20.03.24 18:38, Michael Roth wrote: On Wed, Mar 20, 2024 at 10:37:14AM +0100, David Hildenbrand wrote: On 20.03.24 09:39, Michael Roth wrote: From: Xiaoyao Li When memory page is converted from private to shared, the original private memory is back'ed by guest_memfd. Introduce ram_block_discard_guest_memfd_range() for discarding memory in guest_memfd. Originally-from: Isaku Yamahata Codeveloped-by: Xiaoyao Li "Co-developed-by" Signed-off-by: Xiaoyao Li Reviewed-by: David Hildenbrand Your SOB should go here. --- Changes in v5: - Collect Reviewed-by from David; Changes in in v4: - Drop ram_block_convert_range() and open code its implementation in the next Patch. Signed-off-by: Michael Roth I only received 3 patches from this series, and now I am confused: changelog talks about v5 and this is "PATCH v3" Please make sure to send at least the cover letter along (I might not need the other 46 patches :D ). Sorry for the confusion, you got auto-Cc'd by git, which is good, but not sure there's a good way to make sure everyone gets a copy of the cover letter. I could see how it would help useful to potential reviewers though. I'll try to come up with a script for it and take that approach in the future. A script shared with me in the past to achieve that in most cases: $ cat cc-cmd.sh #!/bin/bash if [[ $1 == *gitsendemail.msg* || $1 == *cover-letter* ]]; then grep ': .* <.*@.*>' -h *.patch | sed 's/^.*: //' | sort | uniq fi And attach to "git send-email ... *.patch": --cc-cmd=./cc-cmd.sh -- Cheers, David / dhildenb
Re: [PATCH 2/4] hw/s390x/virtio-ccw: Always deliver NMI to first CPU
On 20.02.24 16:08, Philippe Mathieu-Daudé wrote: We can trigger NMI from HMP or QMP. QEMU maps the NMI to the s390x per-CPU 'RESTART' interrupt. Linux guests usually setup this interrupt to trigger kdump or crash. Such crashdump can be triggered in QEMU by HMP "nmi" or QMP "inject-nmi" commands. Using QMP, since we can not select a particular CPU, the first CPU is used (CPU#0). See the documentation from commit 795dc6e4 ("watchdog: Add new Virtual Watchdog action INJECT-NMI"): @inject-nmi: a non-maskable interrupt is injected into the first VCPU (all VCPUS on x86) (since 2.4) While we can select a particular CPU on HMP, the guest behavior is expected to be the same if using CPU #N or CPU #0. Since always using CPU#0 simplifies API maintainance, update s390_nmi() to deliver NMI to the first CPU. Signed-off-by: Philippe Mathieu-Daudé --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v3 11/49] physmem: Introduce ram_block_discard_guest_memfd_range()
On 20.03.24 13:43, Xiaoyao Li wrote: On 3/20/2024 5:37 PM, David Hildenbrand wrote: On 20.03.24 09:39, Michael Roth wrote: From: Xiaoyao Li When memory page is converted from private to shared, the original private memory is back'ed by guest_memfd. Introduce ram_block_discard_guest_memfd_range() for discarding memory in guest_memfd. Originally-from: Isaku Yamahata Codeveloped-by: Xiaoyao Li "Co-developed-by" Michael is using the patch from my TDX-QEMU v5 series[1]. I need to fix it. Signed-off-by: Xiaoyao Li Reviewed-by: David Hildenbrand Your SOB should go here. --- Changes in v5: - Collect Reviewed-by from David; Changes in in v4: - Drop ram_block_convert_range() and open code its implementation in the next Patch. Signed-off-by: Michael Roth I only received 3 patches from this series, and now I am confused: changelog talks about v5 and this is "PATCH v3" As above, because the guest_memfd patches in my TDX-QEMU v5[1] were directly picked for this series, so the change history says v5. They are needed by SEV-SNP as well. Thanks, I was missing the context without a cover letter. These comments here likely should be dropped here. -- Cheers, David / dhildenb
Re: [PATCH v3 11/49] physmem: Introduce ram_block_discard_guest_memfd_range()
On 20.03.24 09:39, Michael Roth wrote: From: Xiaoyao Li When memory page is converted from private to shared, the original private memory is back'ed by guest_memfd. Introduce ram_block_discard_guest_memfd_range() for discarding memory in guest_memfd. Originally-from: Isaku Yamahata Codeveloped-by: Xiaoyao Li "Co-developed-by" Signed-off-by: Xiaoyao Li Reviewed-by: David Hildenbrand Your SOB should go here. --- Changes in v5: - Collect Reviewed-by from David; Changes in in v4: - Drop ram_block_convert_range() and open code its implementation in the next Patch. Signed-off-by: Michael Roth I only received 3 patches from this series, and now I am confused: changelog talks about v5 and this is "PATCH v3" Please make sure to send at least the cover letter along (I might not need the other 46 patches :D ). -- Cheers, David / dhildenb
Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment
On 19.03.24 15:23, Peter Maydell wrote: On Tue, 19 Mar 2024 at 09:24, David Hildenbrand wrote: I spotted new pause_all_vcpus() / resume_all_vcpus() calls in hw/intc/arm_gicv3_kvm.c and thought they would be the problematic bit. Yeah, that's going to be problematic. Further note that a lot of code does not expect that the BQL is suddenly dropped. Agreed; we already have one nasty set of bugs in the framebuffer devices because a function drops the BQL briefly: https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=wn+k8dqneb_...@mail.gmail.com/T/#u so let's avoid introducing any more of a similar kind. Side note, the pause_all_vcpus()/resume_all_vcpus() calls in hw/i386/vapic.c are probably a bit suspect for similar reasons. Exactly my thoughts. But there, it was less clear "why" it is even required. It's only performed for KVM. Do we also just want to stop KVM threads from executing instructions?, so blocking KVM ioctls might be a reasonable "replacement"? Really not sure. -- Cheers, David / dhildenb
Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment
On 19.03.24 10:24, David Hildenbrand wrote: On 19.03.24 06:06, zhukeqian wrote: Hi David, Thanks for reviewing. On 17.03.24 09:37, Keqian Zhu via wrote: Both main loop thread and vCPU thread are allowed to call pause_all_vcpus(), and in general resume_all_vcpus() is called after it. Two issues live in pause_all_vcpus(): In general, calling pause_all_vcpus() from VCPU threads is quite dangerous. Do we have reproducers for the cases below? I produce the issues by testing ARM vCPU hotplug feature: QEMU changes for vCPU hotplug could be cloned from below site, https://github.com/salil-mehta/qemu.git virt-cpuhp-armv8/rfc-v2 Guest Kernel changes (by James Morse, ARM) are available here: https://git.kernel.org/pub/scm/linux/kernel/git/morse/linux.git virtual_cpu_hotplug/rfc/v2 Thanks for these infos (would be reasonable to include that in the cover letter). Okay, so likely this is not actually a "fix" for upstream as it is. Understood. The procedure to produce problems: 1. Startup a Linux VM (e.g., called OS-vcpuhotplug) with 32 possible vCPUs and 16 current vCPUs. 2. Log in guestOS and run script[1] to continuously online/offline CPU. 3. At host side, run script[2] to continuously hotplug/unhotplug vCPU. After several minutes, we can hit these problems. Script[1] to online/offline CPU: for ((time=1;time<1000;time++)); do for ((cpu=16;cpu<32;cpu++)); do echo 1 > /sys/devices/system/cpu/cpu$cpu/online done for ((cpu=16;cpu<32;cpu++)); do echo 0 > /sys/devices/system/cpu/cpu$cpu/online done done Script[2] to hotplug/unhotplug vCPU: for ((time=1;time<100;time++)); do echo $time for ((cpu=16;cpu<=32;cpu++)); do echo "virsh setvcpus OS-vcpuhotplug --count $cpu --live" virsh setvcpus OS-vcpuhotplug --count $cpu --live sleep 2 done for ((cpu=32;cpu>=16;cpu--)); do echo "virsh setvcpus OS-vcpuhotplug --count $cpu --live" virsh setvcpus OS-vcpuhotplug --count $cpu --live sleep 2 done for ((cpu=16;cpu<=32;cpu+=2)); do echo "virsh setvcpus OS-vcpuhotplug --count $cpu --live" virsh setvcpus OS-vcpuhotplug --count $cpu --live sleep 2 done for ((cpu=32;cpu>=16;cpu-=2)); do echo "virsh setvcpus OS-vcpuhotplug --count $cpu --live" virsh setvcpus OS-vcpuhotplug --count $cpu --live sleep 2 done done The script[1] will call PSCI CPU_ON which emulated by QEMU, which result in calling cpu_reset() on vCPU thread. I spotted new pause_all_vcpus() / resume_all_vcpus() calls in hw/intc/arm_gicv3_kvm.c and thought they would be the problematic bit. Yeah, that's going to be problematic. Further note that a lot of code does not expect that the BQL is suddenly dropped. We had issues with that in different context where we ended up wanting to use pause/resume from VCPU context: https://lore.kernel.org/all/294a987d-b0ef-1b58-98ac-0d4d43075...@redhat.com/ This sounds like a bad idea. Read below. For ARM architecture, it needs to reset GICC registers, which is only possible when all vcpus paused. So script[1] will call pause_all_vcpus() in vCPU thread. The script[2] also calls cpu_reset() for newly hotplugged vCPU, which is done in main loop thread. So this scenario causes problems as I state in commit message. 1. There is possibility that during thread T1 waits on qemu_pause_cond with bql unlocked, other thread has called pause_all_vcpus() and resume_all_vcpus(), then thread T1 will stuck, because the condition all_vcpus_paused() is always false. How can this happen? Two threads calling pause_all_vcpus() is borderline broken, as you note. IIRC, we should call pause_all_vcpus() only if some other mechanism prevents these races. For example, based on runstate changes. We already has bql to prevent concurrent calling of pause_all_vcpus() and resume_all_vcpus(). But pause_all_vcpus() will unlock bql in the half way, which gives change for other thread to call pause and resume. In the past, code does not consider this problem, now I add retry mechanism to fix it. Note that BQL did not prevent concurrent calling of pause_all_vcpus(). There had to be something else. Likely that was runstate transitions. Just imagine one thread calling pause_all_vcpus() while another one calls resume_all_vcpus(). It cannot possibly work. With bql, we can make sure all vcpus are paused after pause_all_vcpus() finish, and all vcpus are resumed after resume_all_vcpus() finish. For example, the following situation may occur: Thread T1: lock bql ->pau
Re: 答复: [PATCH v1 2/2] system/cpus: Fix resume_all_vcpus() under vCPU hotplug condition
On 19.03.24 06:11, zhukeqian wrote: Hi David, On 17.03.24 09:37, Keqian Zhu via wrote: For vCPU being hotplugged, qemu_init_vcpu() is called. In this function, we set vcpu state as stopped, and then wait vcpu thread to be created. As the vcpu state is stopped, it will inform us it has been created and then wait on halt_cond. After we has realized vcpu object, we will resume the vcpu thread. However, during we wait vcpu thread to be created, the bql is unlocked, and other thread is allowed to call resume_all_vcpus(), which will resume the un-realized vcpu. This fixes the issue by filter out un-realized vcpu during resume_all_vcpus(). Similar question: is there a reproducer? How could we currently hotplug a VCPU, and while it is being created, see pause_all_vcpus()/resume_all_vcpus() getting claled. I described the reason for this at patch 1. If I am not getting this wrong, there seems to be some other mechanism missing that makes sure that this cannot happen. Dropping the BQL half-way through creating a VCPU might be the problem. When we add retry mechanism in pause_all_vcpus(), we can solve this problem. With the sematic unchanged for user, which means: With bql, we can make sure all vcpus are paused after pause_all_vcpus() finish, and all vcpus are resumed after resume_all_vcpus() finish. Okay, got it. As just replied to #1, please see if you can avoid messing with pause_all_vcpus() by inhibiting KVM IOCTLs like KVM does. That would be preferable. -- Cheers, David / dhildenb
Re: 答复: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment
to lock bql-> lock bql && all_vcpu_paused -> success and do other work -> unlock bql Thread T2: wait T1 unlock bql to lock bql-> lock bql-> resume_all_vcpus -> success and do other work -> unlock bql Now trow in another thread and it all gets really complicated :) Finding ways to avoid pause_all_vcpus() on the ARM reset code would be preferable. I guess you simply want to do something similar to what KVM does to avoid messing with pause_all_vcpus(): inhibiting certain IOCTLs. commit f39b7d2b96e3e73c01bb678cd096f7baf0b9ab39 Author: David Hildenbrand Date: Fri Nov 11 10:47:58 2022 -0500 kvm: Atomic memslot updates If we update an existing memslot (e.g., resize, split), we temporarily remove the memslot to re-add it immediately afterwards. These updates are not atomic, especially not for KVM VCPU threads, such that we can get spurious faults. Let's inhibit most KVM ioctls while performing relevant updates, such that we can perform the update just as if it would happen atomically without additional kernel support. We capture the add/del changes and apply them in the notifier commit stage instead. There, we can check for overlaps and perform the ioctl inhibiting only if really required (-> overlap). To keep things simple we don't perform additional checks that wouldn't actually result in an overlap -- such as !RAM memory regions in some cases (see kvm_set_phys_mem()). To minimize cache-line bouncing, use a separate indicator (in_ioctl_lock) per CPU. Also, make sure to hold the kvm_slots_lock while performing both actions (removing+re-adding). We have to wait until all IOCTLs were exited and block new ones from getting executed. This approach cannot result in a deadlock as long as the inhibitor does not hold any locks that might hinder an IOCTL from getting finished and exited - something fairly unusual. The inhibitor will always hold the BQL. AFAIKs, one possible candidate would be userfaultfd. If a page cannot be placed (e.g., during postcopy), because we're waiting for a lock, or if the userfaultfd thread cannot process a fault, because it is waiting for a lock, there could be a deadlock. However, the BQL is not applicable here, because any other guest memory access while holding the BQL would already result in a deadlock. Nothing else in the kernel should block forever and wait for userspace intervention. Note: pause_all_vcpus()/resume_all_vcpus() or start_exclusive()/end_exclusive() cannot be used, as they either drop the BQL or require to be called without the BQL - something inhibitors cannot handle. We need a low-level locking mechanism that is deadlock-free even when not releasing the BQL. -- Cheers, David / dhildenb
Re: [PATCH v2 1/2] target/s390x: Use mutable temporary value for op_ts
On 18.03.24 21:27, Ilya Leoshkevich wrote: From: Ido Plat Otherwise TCG would assume the register that holds t1 would be constant and reuse whenever it needs the value within it. Cc: qemu-sta...@nongnu.org Fixes: f1ea739bd598 ("target/s390x: Use tcg_constant_* in local contexts") Reviewed-by: Ilya Leoshkevich Reviewed-by: Richard Henderson [iii: Adjust a newline and capitalization, add tags] Signed-off-by: Ido Plat Signed-off-by: Ilya Leoshkevich --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v1 2/2] system/cpus: Fix resume_all_vcpus() under vCPU hotplug condition
On 17.03.24 09:37, Keqian Zhu via wrote: For vCPU being hotplugged, qemu_init_vcpu() is called. In this function, we set vcpu state as stopped, and then wait vcpu thread to be created. As the vcpu state is stopped, it will inform us it has been created and then wait on halt_cond. After we has realized vcpu object, we will resume the vcpu thread. However, during we wait vcpu thread to be created, the bql is unlocked, and other thread is allowed to call resume_all_vcpus(), which will resume the un-realized vcpu. This fixes the issue by filter out un-realized vcpu during resume_all_vcpus(). Similar question: is there a reproducer? How could we currently hotplug a VCPU, and while it is being created, see pause_all_vcpus()/resume_all_vcpus() getting claled. If I am not getting this wrong, there seems to be some other mechanism missing that makes sure that this cannot happen. Dropping the BQL half-way through creating a VCPU might be the problem. -- Cheers, David / dhildenb
Re: [PATCH v1 1/2] system/cpus: Fix pause_all_vcpus() under concurrent environment
On 17.03.24 09:37, Keqian Zhu via wrote: Both main loop thread and vCPU thread are allowed to call pause_all_vcpus(), and in general resume_all_vcpus() is called after it. Two issues live in pause_all_vcpus(): In general, calling pause_all_vcpus() from VCPU threads is quite dangerous. Do we have reproducers for the cases below? 1. There is possibility that during thread T1 waits on qemu_pause_cond with bql unlocked, other thread has called pause_all_vcpus() and resume_all_vcpus(), then thread T1 will stuck, because the condition all_vcpus_paused() is always false. How can this happen? Two threads calling pause_all_vcpus() is borderline broken, as you note. IIRC, we should call pause_all_vcpus() only if some other mechanism prevents these races. For example, based on runstate changes. Just imagine one thread calling pause_all_vcpus() while another one calls resume_all_vcpus(). It cannot possibly work. 2. After all_vcpus_paused() has been checked as true, we will unlock bql to relock replay_mutex. During the bql was unlocked, the vcpu's state may has been changed by other thread, so we must retry. Signed-off-by: Keqian Zhu --- system/cpus.c | 29 - 1 file changed, 24 insertions(+), 5 deletions(-) diff --git a/system/cpus.c b/system/cpus.c index 68d161d96b..4e41abe23e 100644 --- a/system/cpus.c +++ b/system/cpus.c @@ -571,12 +571,14 @@ static bool all_vcpus_paused(void) return true; } -void pause_all_vcpus(void) +static void request_pause_all_vcpus(void) { CPUState *cpu; -qemu_clock_enable(QEMU_CLOCK_VIRTUAL, false); CPU_FOREACH(cpu) { +if (cpu->stopped) { +continue; +} if (qemu_cpu_is_self(cpu)) { qemu_cpu_stop(cpu, true); } else { @@ -584,6 +586,14 @@ void pause_all_vcpus(void) qemu_cpu_kick(cpu); } } +} + +void pause_all_vcpus(void) +{ +qemu_clock_enable(QEMU_CLOCK_VIRTUAL, false); + +retry: +request_pause_all_vcpus(); /* We need to drop the replay_lock so any vCPU threads woken up * can finish their replay tasks @@ -592,14 +602,23 @@ void pause_all_vcpus(void) while (!all_vcpus_paused()) { qemu_cond_wait(_pause_cond, ); -CPU_FOREACH(cpu) { -qemu_cpu_kick(cpu); -} +/* During we waited on qemu_pause_cond the bql was unlocked, + * the vcpu's state may has been changed by other thread, so + * we must request the pause state on all vcpus again. + */ +request_pause_all_vcpus(); } bql_unlock(); replay_mutex_lock(); bql_lock(); + +/* During the bql was unlocked, the vcpu's state may has been + * changed by other thread, so we must retry. + */ +if (!all_vcpus_paused()) { +goto retry; +} } void cpu_resume(CPUState *cpu) -- Cheers, David / dhildenb
Re: [PATCH v2 00/14] libvhost-user: support more memslots and cleanup memslot handling code
On 11.03.24 21:03, Mario Casquero wrote: This series has been successfully tested by QE. Start the qemu-storage-daemon in the background with a rhel 9.5 image and vhost-user-blk. After that, boot up a VM with virtio-mem and vhost-user-blk-pci. Check with the HMP command 'info mtree' that virtio-mem is making use of multiple memslots. Tested-by: Mario Casquero Thanks Mario! -- Cheers, David / dhildenb
Re: [PATCH v2 0/4] physmem: Fix MemoryRegion for second access to cached MMIO Address Space
On 08.03.24 08:36, Peter Xu wrote: On Thu, Mar 07, 2024 at 03:37:06PM +, Jonathan Cameron wrote: v2: (Thanks to Peter Xu for reviewing!) - New patch 1 to rename addr1 to mr_addr in the interests of meaningful naming. - Take advantage of a cached address space only allow for a single MR to simplify the new code. - Various cleanups of indentation etc. - Cover letter and some patch descriptions updated to reflect changes. - Changes all called out in specific patches. All look good to me, thanks. Having the new functions' first argument as MemTxAttrs is slightly weird to me, but that's not a big deal and we can clean it up later if wanted. I guess it's good to fix this in 9.0 first as it's a real bug even if not trivial to hit. I queued it in my migration tree (with my "memory API" hat..). I won't send a pull until next Monday. Please shoot if there's any objections! Works for me, thanks! -- Cheers, David / dhildenb
Re: [PATCH v2 4/4] physmem: Fix wrong address in large address_space_read/write_cached_slow()
On 07.03.24 16:37, Jonathan Cameron wrote: If the access is bigger than the MemoryRegion supports, flatview_read/write_continue() will attempt to update the Memory Region. but the address passed to flatview_translate() is relative to the cache, not to the FlatView. On arm/virt with interleaved CXL memory emulation and virtio-blk-pci this lead to the first part of descriptor being read from the CXL memory and the second part from PA 0x8 which happens to be a blank region of a flash chip and all ffs on this particular configuration. Note this test requires the out of tree ARM support for CXL, but the problem is more general. Avoid this by adding new address_space_read_continue_cached() and address_space_write_continue_cached() which share all the logic with the flatview versions except for the MemoryRegion lookup which is unnecessary as the MemoryRegionCache only covers one MemoryRegion. Signed-off-by: Jonathan Cameron --- v2: Review from Peter Xu - Drop additional lookups of the MemoryRegion via address_space_translate_cached() as it will always return the same answer. - Drop various parameters that are then unused. - rename addr1 to mr_addr. - Drop a fuzz_dma_read_cb(). Could put this back but it means carrying the address into the inner call and the only in tree fuzzer checks if it is normal RAM and if not does nothing anyway. We don't hit this path for normal RAM. --- system/physmem.c | 63 +++- 1 file changed, 57 insertions(+), 6 deletions(-) diff --git a/system/physmem.c b/system/physmem.c index 1264eab24b..701bea27dd 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -3381,6 +3381,59 @@ static inline MemoryRegion *address_space_translate_cached( return section.mr; } +/* Called within RCU critical section. */ +static MemTxResult address_space_write_continue_cached(MemTxAttrs attrs, + const void *ptr, + hwaddr len, + hwaddr mr_addr, + hwaddr l, + MemoryRegion *mr) The only thing that is really confusing is hwaddr len, ... hwaddr l, ehm, what? ... but it fits the style of flatview_read_continue(), so at least the level of confusion this creates is consistent with the other code. Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 3/4] physmem: Factor out body of flatview_read/write_continue() loop
On 07.03.24 16:37, Jonathan Cameron wrote: This code will be reused for the address_space_cached accessors shortly. Also reduce scope of result variable now we aren't directly calling this in the loop. Signed-off-by: Jonathan Cameron --- v2: Thanks to Peter Xu - Fix alignment of code. - Drop unused addr parameter. - Carry through new mr_addr parameter name. - RB not picked up as not sure what Peter will think wrt to resulting parameter ordering. --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 2/4] physmem: Reduce local variable scope in flatview_read/write_continue()
On 07.03.24 16:37, Jonathan Cameron wrote: Precursor to factoring out the inner loops for reuse. Reviewed-by: Peter Xu Signed-off-by: Jonathan Cameron --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 1/4] physmem: Rename addr1 to more informative mr_addr in flatview_read/write() and similar
On 07.03.24 16:37, Jonathan Cameron wrote: The calls to flatview_read/write[_continue]() have parameters addr and addr1 but the names give no indication of what they are addresses of. Rename addr1 to mr_addr to reflect that it is the translated address offset within the MemoryRegion returned by flatview_translate(). Similarly rename the parameter in address_space_read/write_cached_slow() Suggested-by: Peter Xu Signed-off-by: Jonathan Cameron Much cleaner indeed! Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH] qapi: Fix format of the memory-backend-file's @rom property doc comment
On 29.02.24 11:58, Stefano Garzarella wrote: Reflow paragraph following commit a937b6aa73 ("qapi: Reformat doc comments to conform to current conventions"): use 4 spaces indentation, 70 columns width, and two spaces to separate sentences. Suggested-by: Markus Armbruster Signed-off-by: Stefano Garzarella --- qapi/qom.json | 27 ++- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/qapi/qom.json b/qapi/qom.json index 2a6e49365a..db1b0fdea2 100644 --- a/qapi/qom.json +++ b/qapi/qom.json @@ -668,19 +668,20 @@ # @readonly: if true, the backing file is opened read-only; if false, # it is opened read-write. (default: false) # -# @rom: whether to create Read Only Memory (ROM) that cannot be modified -# by the VM. Any write attempts to such ROM will be denied. Most -# use cases want writable RAM instead of ROM. However, selected use -# cases, like R/O NVDIMMs, can benefit from ROM. If set to 'on', -# create ROM; if set to 'off', create writable RAM; if set to -# 'auto', the value of the @readonly property is used. This -# property is primarily helpful when we want to have proper RAM in -# configurations that would traditionally create ROM before this -# property was introduced: VM templating, where we want to open a -# file readonly (@readonly set to true) and mark the memory to be -# private for QEMU (@share set to false). For this use case, we need -# writable RAM instead of ROM, and want to set this property to 'off'. -# (default: auto, since 8.2) +# @rom: whether to create Read Only Memory (ROM) that cannot be +# modified by the VM. Any write attempts to such ROM will be +# denied. Most use cases want writable RAM instead of ROM. +# However, selected use cases, like R/O NVDIMMs, can benefit from +# ROM. If set to 'on', create ROM; if set to 'off', create +# writable RAM; if set to 'auto', the value of the @readonly +# property is used. This property is primarily helpful when we +# want to have proper RAM in configurations that would +# traditionally create ROM before this property was introduced: VM +# templating, where we want to open a file readonly (@readonly set +# to true) and mark the memory to be private for QEMU (@share set +# to false). For this use case, we need writable RAM instead of +# ROM, and want to set this property to 'off'. (default: auto, +# since 8.2) # # Since: 2.1 ## Ideally, we'd have a format checker that complains like checkpatch usually would. Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH 9/9] hostmem-file: support POSIX shm_open()
On 28.02.24 12:47, Stefano Garzarella wrote: Add a new `shm` bool option for `-object memory-backend-file`. When this option is set to true, the POSIX shm_open(3) is used instead of open(2). So a file will not be created in the filesystem, but a "POSIX shared memory object" will be instantiated. In Linux this turns into a file in /dev/shm, but in other OSes this may not happen (for example in macOS or FreeBSD nothing is shown in any filesystem). This new feature is useful when we need to share guest memory with another process (e.g. vhost-user backend), but we don't have memfd_create() or any special filesystems (e.g. /dev/shm) available as in macOS. Signed-off-by: Stefano Garzarella --- I am not sure this is the best way to support shm_open() in QEMU. Other solutions I had in mind were: - create a new memory-backend-shm - extend memory-backend-memfd to use shm_open() on systems where memfd is not available (problem: shm_open wants a name to assign to the object, but we can do a workaround by using a random name and do the unlink right away) Any preference/suggestion? Both sound like reasonable options, and IMHO better than hostmem-file with things that are not necessarily "real" files. Regarding memory-backend-memfd, we similarly have to pass a name to memfd_create(), although for different purpose: "The name supplied in name is used as a filename and will be displayed as the target of the corresponding symbolic link in the directory /proc/self/fd/". So we simply pass TYPE_MEMORY_BACKEND_MEMFD. Likely, memory-backend-shm that directly maps to shm_open() and only provides properties reasonable for shm_open() is the cleanest approach. So that would currently be my preference :) -- Cheers, David / dhildenb
Re: [PATCH v4 33/66] i386/tdx: Make memory type private by default
On 29.01.24 03:18, Xiaoyao Li wrote: On 1/26/2024 10:58 PM, David Hildenbrand wrote: On 25.01.24 04:22, Xiaoyao Li wrote: By default (due to the recent UPM change), restricted memory attribute is shared. Convert the memory region from shared to private at the memory slot creation time. add kvm region registering function to check the flag and convert the region, and add memory listener to TDX guest code to set the flag to the possible memory region. Without this patch - Secure-EPT violation on private area - KVM_MEMORY_FAULT EXIT (kvm -> qemu) - qemu converts the 4K page from shared to private - Resume VCPU execution - Secure-EPT violation again - KVM resolves EPT Violation This also prevents huge page because page conversion is done at 4K granularity. Although it's possible to merge 4K private mapping into 2M large page, it slows guest boot. With this patch - After memory slot creation, convert the region from private to shared - Secure-EPT violation on private area. - KVM resolves EPT Violation Originated-from: Isaku Yamahata Signed-off-by: Xiaoyao Li --- include/exec/memory.h | 1 + target/i386/kvm/tdx.c | 20 2 files changed, 21 insertions(+) diff --git a/include/exec/memory.h b/include/exec/memory.h index 7229fcc0415f..f25959f6d30f 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -850,6 +850,7 @@ struct IOMMUMemoryRegion { #define MEMORY_LISTENER_PRIORITY_MIN 0 #define MEMORY_LISTENER_PRIORITY_ACCEL 10 #define MEMORY_LISTENER_PRIORITY_DEV_BACKEND 10 +#define MEMORY_LISTENER_PRIORITY_ACCEL_HIGH 20 /** * struct MemoryListener: callbacks structure for updates to the physical memory map diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c index 7b250d80bc1d..f892551821ce 100644 --- a/target/i386/kvm/tdx.c +++ b/target/i386/kvm/tdx.c @@ -19,6 +19,7 @@ #include "standard-headers/asm-x86/kvm_para.h" #include "sysemu/kvm.h" #include "sysemu/sysemu.h" +#include "exec/address-spaces.h" #include "hw/i386/x86.h" #include "kvm_i386.h" @@ -621,6 +622,19 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp) return 0; } +static void tdx_guest_region_add(MemoryListener *listener, + MemoryRegionSection *section) +{ + memory_region_set_default_private(section->mr); +} That looks fishy. Why is TDX to decide what happens to other memory regions it doesn't own? We should define that behavior when creating these memory region, and TDX could sanity check that they have been setup properly. Let me ask differently: For which memory region where we have RAM_GUEST_MEMFD set would we *not* want to set private as default right from the start? All memory regions have RAM_GUEST_MEMFD set will benefit from being private as default, for TDX guest. I will update the implementation to set RAM_DEFAULT_PRIVATE flag when guest_memfd is created successfully, like diff --git a/system/physmem.c b/system/physmem.c index fc59470191ef..60676689c807 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -1850,6 +1850,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp) qemu_mutex_unlock_ramlist(); return; } + +new_block->flags |= RAM_DEFAULT_PRIVATE; } then this patch can be dropped, and the calling of memory_region_set_default_private(mr) of Patch 45 can be dropped too. I think this is what you suggested, right? Yes, if that works, great! -- Cheers, David / dhildenb
Re: [PATCH] system/physmem: remove redundant arg reassignment
On 15.02.24 10:15, Manos Pitsidianakis wrote: Arguments `ram_block` are reassigned to local declarations `block` without further use. Remove re-assignment to reduce noise. Signed-off-by: Manos Pitsidianakis --- system/physmem.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/system/physmem.c b/system/physmem.c index 5e66d9ae36..d4c3bfac65 100644 --- a/system/physmem.c +++ b/system/physmem.c @@ -2154,10 +2154,8 @@ void qemu_ram_remap(ram_addr_t addr, ram_addr_t length) * * Called within RCU critical section. */ -void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr) +void *qemu_map_ram_ptr(RAMBlock *block, ram_addr_t addr) { -RAMBlock *block = ram_block; - if (block == NULL) { block = qemu_get_ram_block(addr); addr -= block->offset; @@ -2182,10 +2180,9 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr) * * Called within RCU critical section. */ -static void *qemu_ram_ptr_length(RAMBlock *ram_block, ram_addr_t addr, +static void *qemu_ram_ptr_length(RAMBlock *block, ram_addr_t addr, hwaddr *size, bool lock) { -RAMBlock *block = ram_block; if (*size == 0) { return NULL; } base-commit: 5767815218efd3cbfd409505ed824d5f356044ae Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
[PATCH v2 06/14] libvhost-user: No need to check for NULL when unmapping
We never add a memory region if mmap() failed. Therefore, no need to check for NULL. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index f43b5096d0..225283f764 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -247,11 +247,8 @@ vu_remove_all_mem_regs(VuDev *dev) for (i = 0; i < dev->nregions; i++) { VuDevRegion *r = >regions[i]; -void *ma = (void *)(uintptr_t)r->mmap_addr; -if (ma) { -munmap(ma, r->size + r->mmap_offset); -} +munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset); } dev->nregions = 0; } @@ -888,11 +885,8 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { for (i = 0; i < dev->nregions; i++) { if (reg_equal(>regions[i], msg_region)) { VuDevRegion *r = >regions[i]; -void *ma = (void *) (uintptr_t) r->mmap_addr; -if (ma) { -munmap(ma, r->size + r->mmap_offset); -} +munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset); /* * Shift all affected entries by 1 to close the hole at index i and -- 2.43.0
[PATCH v2 04/14] libvhost-user: Merge vu_set_mem_table_exec_postcopy() into vu_set_mem_table_exec()
Let's reduce some code duplication and prepare for further changes. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 119 +++--- 1 file changed, 39 insertions(+), 80 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index e4907dfc26..a7bd7de3cd 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -937,95 +937,23 @@ vu_get_shared_object(VuDev *dev, VhostUserMsg *vmsg) } static bool -vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg) +vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg) { -unsigned int i; VhostUserMemory m = vmsg->payload.memory, *memory = -dev->nregions = memory->nregions; - -DPRINT("Nregions: %u\n", memory->nregions); -for (i = 0; i < dev->nregions; i++) { -void *mmap_addr; -VhostUserMemoryRegion *msg_region = >regions[i]; -VuDevRegion *dev_region = >regions[i]; - -DPRINT("Region %d\n", i); -DPRINT("guest_phys_addr: 0x%016"PRIx64"\n", - msg_region->guest_phys_addr); -DPRINT("memory_size: 0x%016"PRIx64"\n", - msg_region->memory_size); -DPRINT("userspace_addr 0x%016"PRIx64"\n", - msg_region->userspace_addr); -DPRINT("mmap_offset 0x%016"PRIx64"\n", - msg_region->mmap_offset); - -dev_region->gpa = msg_region->guest_phys_addr; -dev_region->size = msg_region->memory_size; -dev_region->qva = msg_region->userspace_addr; -dev_region->mmap_offset = msg_region->mmap_offset; +int prot = PROT_READ | PROT_WRITE; +unsigned int i; -/* We don't use offset argument of mmap() since the - * mapped address has to be page aligned, and we use huge - * pages. +if (dev->postcopy_listening) { +/* * In postcopy we're using PROT_NONE here to catch anyone * accessing it before we userfault */ -mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset, - PROT_NONE, MAP_SHARED | MAP_NORESERVE, - vmsg->fds[i], 0); - -if (mmap_addr == MAP_FAILED) { -vu_panic(dev, "region mmap error: %s", strerror(errno)); -} else { -dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr; -DPRINT("mmap_addr: 0x%016"PRIx64"\n", - dev_region->mmap_addr); -} - -/* Return the address to QEMU so that it can translate the ufd - * fault addresses back. - */ -msg_region->userspace_addr = (uintptr_t)(mmap_addr + - dev_region->mmap_offset); -close(vmsg->fds[i]); -} - -/* Send the message back to qemu with the addresses filled in */ -vmsg->fd_num = 0; -if (!vu_send_reply(dev, dev->sock, vmsg)) { -vu_panic(dev, "failed to respond to set-mem-table for postcopy"); -return false; -} - -/* Wait for QEMU to confirm that it's registered the handler for the - * faults. - */ -if (!dev->read_msg(dev, dev->sock, vmsg) || -vmsg->size != sizeof(vmsg->payload.u64) || -vmsg->payload.u64 != 0) { -vu_panic(dev, "failed to receive valid ack for postcopy set-mem-table"); -return false; +prot = PROT_NONE; } -/* OK, now we can go and register the memory and generate faults */ -(void)generate_faults(dev); - -return false; -} - -static bool -vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg) -{ -unsigned int i; -VhostUserMemory m = vmsg->payload.memory, *memory = - vu_remove_all_mem_regs(dev); dev->nregions = memory->nregions; -if (dev->postcopy_listening) { -return vu_set_mem_table_exec_postcopy(dev, vmsg); -} - DPRINT("Nregions: %u\n", memory->nregions); for (i = 0; i < dev->nregions; i++) { void *mmap_addr; @@ -1051,8 +979,7 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg) * mapped address has to be page aligned, and we use huge * pages. */ mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset, - PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, - vmsg->fds[i], 0); + prot, MAP_SHARED | MAP_NORESERVE, vmsg->fds[i], 0); if (mmap_addr == MAP_FAILED) { vu_panic(dev, "region mmap error: %s", strerror
[PATCH v2 09/14] libvhost-user: Factor out search for memory region by GPA and simplify
Memory regions cannot overlap, and if we ever hit that case something would be really flawed. For example, when vhost code in QEMU decides to increase the size of memory regions to cover full huge pages, it makes sure to never create overlaps, and if there would be overlaps, it would bail out. QEMU commits 48d7c9757749 ("vhost: Merge sections added to temporary list"), c1ece84e7c93 ("vhost: Huge page align and merge") and e7b94a84b6cb ("vhost: Allow adjoining regions") added and clarified that handling and how overlaps are impossible. Consequently, each GPA can belong to at most one memory region, and everything else doesn't make sense. Let's factor out our search to prepare for further changes. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 79 +-- 1 file changed, 45 insertions(+), 34 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 7f29e01c30..d72f25396d 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -195,30 +195,47 @@ vu_panic(VuDev *dev, const char *msg, ...) */ } +/* Search for a memory region that covers this guest physical address. */ +static VuDevRegion * +vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr) +{ +unsigned int i; + +/* + * Memory regions cannot overlap in guest physical address space. Each + * GPA belongs to exactly one memory region, so there can only be one + * match. + */ +for (i = 0; i < dev->nregions; i++) { +VuDevRegion *cur = >regions[i]; + +if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) { +return cur; +} +} +return NULL; +} + /* Translate guest physical address to our virtual address. */ void * vu_gpa_to_va(VuDev *dev, uint64_t *plen, uint64_t guest_addr) { -unsigned int i; +VuDevRegion *r; if (*plen == 0) { return NULL; } -/* Find matching memory region. */ -for (i = 0; i < dev->nregions; i++) { -VuDevRegion *r = >regions[i]; - -if ((guest_addr >= r->gpa) && (guest_addr < (r->gpa + r->size))) { -if ((guest_addr + *plen) > (r->gpa + r->size)) { -*plen = r->gpa + r->size - guest_addr; -} -return (void *)(uintptr_t) -guest_addr - r->gpa + r->mmap_addr + r->mmap_offset; -} +r = vu_gpa_to_mem_region(dev, guest_addr); +if (!r) { +return NULL; } -return NULL; +if ((guest_addr + *plen) > (r->gpa + r->size)) { +*plen = r->gpa + r->size - guest_addr; +} +return (void *)(uintptr_t)guest_addr - r->gpa + r->mmap_addr + + r->mmap_offset; } /* Translate qemu virtual address to our virtual address. */ @@ -854,8 +871,8 @@ static inline bool reg_equal(VuDevRegion *vudev_reg, static bool vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = -unsigned int i; -bool found = false; +unsigned int idx; +VuDevRegion *r; if (vmsg->fd_num > 1) { vmsg_close_fds(vmsg); @@ -882,28 +899,22 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { DPRINT("mmap_offset 0x%016"PRIx64"\n", msg_region->mmap_offset); -for (i = 0; i < dev->nregions; i++) { -if (reg_equal(>regions[i], msg_region)) { -VuDevRegion *r = >regions[i]; - -munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset); - -/* Shift all affected entries by 1 to close the hole at index. */ -memmove(dev->regions + i, dev->regions + i + 1, -sizeof(VuDevRegion) * (dev->nregions - i - 1)); -DPRINT("Successfully removed a region\n"); -dev->nregions--; -i--; - -found = true; -break; -} -} - -if (!found) { +r = vu_gpa_to_mem_region(dev, msg_region->guest_phys_addr); +if (!r || !reg_equal(r, msg_region)) { +vmsg_close_fds(vmsg); vu_panic(dev, "Specified region not found\n"); +return false; } +munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset); + +idx = r - dev->regions; +assert(idx < dev->nregions); +/* Shift all affected entries by 1 to close the hole. */ +memmove(r, r + 1, sizeof(VuDevRegion) * (dev->nregions - idx - 1)); +DPRINT("Successfully removed a region\n"); +dev->nregions--; + vmsg_close_fds(vmsg); return false; -- 2.43.0
[PATCH v2 07/14] libvhost-user: Don't zero out memory for memory regions
dev->nregions always covers only valid entries. Stop zeroing out other array elements that are unused. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 225283f764..2e8b611385 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -888,13 +888,9 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset); -/* - * Shift all affected entries by 1 to close the hole at index i and - * zero out the last entry. - */ +/* Shift all affected entries by 1 to close the hole at index. */ memmove(dev->regions + i, dev->regions + i + 1, sizeof(VuDevRegion) * (dev->nregions - i - 1)); -memset(dev->regions + dev->nregions - 1, 0, sizeof(VuDevRegion)); DPRINT("Successfully removed a region\n"); dev->nregions--; i--; @@ -2119,7 +2115,6 @@ vu_init(VuDev *dev, DPRINT("%s: failed to malloc mem regions\n", __func__); return false; } -memset(dev->regions, 0, VHOST_USER_MAX_RAM_SLOTS * sizeof(dev->regions[0])); dev->vq = malloc(max_queues * sizeof(dev->vq[0])); if (!dev->vq) { -- 2.43.0
[PATCH v2 01/14] libvhost-user: Dynamically allocate memory for memory slots
Let's prepare for increasing VHOST_USER_MAX_RAM_SLOTS by dynamically allocating dev->regions. We don't have any ABI guarantees (not dynamically linked), so we can simply change the layout of VuDev. Let's zero out the memory, just as we used to do. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 11 +++ subprojects/libvhost-user/libvhost-user.h | 2 +- 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index a3b158c671..360c5366d6 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -2171,6 +2171,8 @@ vu_deinit(VuDev *dev) free(dev->vq); dev->vq = NULL; +free(dev->regions); +dev->regions = NULL; } bool @@ -2205,9 +2207,18 @@ vu_init(VuDev *dev, dev->backend_fd = -1; dev->max_queues = max_queues; +dev->regions = malloc(VHOST_USER_MAX_RAM_SLOTS * sizeof(dev->regions[0])); +if (!dev->regions) { +DPRINT("%s: failed to malloc mem regions\n", __func__); +return false; +} +memset(dev->regions, 0, VHOST_USER_MAX_RAM_SLOTS * sizeof(dev->regions[0])); + dev->vq = malloc(max_queues * sizeof(dev->vq[0])); if (!dev->vq) { DPRINT("%s: failed to malloc virtqueues\n", __func__); +free(dev->regions); +dev->regions = NULL; return false; } diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h index c2352904f0..c882b4e3a2 100644 --- a/subprojects/libvhost-user/libvhost-user.h +++ b/subprojects/libvhost-user/libvhost-user.h @@ -398,7 +398,7 @@ typedef struct VuDevInflightInfo { struct VuDev { int sock; uint32_t nregions; -VuDevRegion regions[VHOST_USER_MAX_RAM_SLOTS]; +VuDevRegion *regions; VuVirtq *vq; VuDevInflightInfo inflight_info; int log_call_fd; -- 2.43.0
[PATCH v2 10/14] libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va()
Let's speed up GPA to memory region / virtual address lookup. Store the memory regions ordered by guest physical addresses, and use binary search for address translation, as well as when adding/removing memory regions. Most importantly, this will speed up GPA->VA address translation when we have many memslots. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 49 +-- 1 file changed, 45 insertions(+), 4 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index d72f25396d..ef6353d847 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -199,19 +199,30 @@ vu_panic(VuDev *dev, const char *msg, ...) static VuDevRegion * vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr) { -unsigned int i; +int low = 0; +int high = dev->nregions - 1; /* * Memory regions cannot overlap in guest physical address space. Each * GPA belongs to exactly one memory region, so there can only be one * match. + * + * We store our memory regions ordered by GPA and can simply perform a + * binary search. */ -for (i = 0; i < dev->nregions; i++) { -VuDevRegion *cur = >regions[i]; +while (low <= high) { +unsigned int mid = low + (high - low) / 2; +VuDevRegion *cur = >regions[mid]; if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) { return cur; } +if (guest_addr >= cur->gpa + cur->size) { +low = mid + 1; +} +if (guest_addr < cur->gpa) { +high = mid - 1; +} } return NULL; } @@ -273,9 +284,14 @@ vu_remove_all_mem_regs(VuDev *dev) static void _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) { +const uint64_t start_gpa = msg_region->guest_phys_addr; +const uint64_t end_gpa = start_gpa + msg_region->memory_size; int prot = PROT_READ | PROT_WRITE; VuDevRegion *r; void *mmap_addr; +int low = 0; +int high = dev->nregions - 1; +unsigned int idx; DPRINT("Adding region %d\n", dev->nregions); DPRINT("guest_phys_addr: 0x%016"PRIx64"\n", @@ -295,6 +311,29 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) prot = PROT_NONE; } +/* + * We will add memory regions into the array sorted by GPA. Perform a + * binary search to locate the insertion point: it will be at the low + * index. + */ +while (low <= high) { +unsigned int mid = low + (high - low) / 2; +VuDevRegion *cur = >regions[mid]; + +/* Overlap of GPA addresses. */ +if (start_gpa < cur->gpa + cur->size && cur->gpa < end_gpa) { +vu_panic(dev, "regions with overlapping guest physical addresses"); +return; +} +if (start_gpa >= cur->gpa + cur->size) { +low = mid + 1; +} +if (start_gpa < cur->gpa) { +high = mid - 1; +} +} +idx = low; + /* * We don't use offset argument of mmap() since the mapped address has * to be page aligned, and we use huge pages. @@ -308,7 +347,9 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) DPRINT("mmap_addr: 0x%016"PRIx64"\n", (uint64_t)(uintptr_t)mmap_addr); -r = >regions[dev->nregions]; +/* Shift all affected entries by 1 to open a hole at idx. */ +r = >regions[idx]; +memmove(r + 1, r, sizeof(VuDevRegion) * (dev->nregions - idx)); r->gpa = msg_region->guest_phys_addr; r->size = msg_region->memory_size; r->qva = msg_region->userspace_addr; -- 2.43.0
[PATCH v2 13/14] libvhost-user: Dynamically remap rings after (temporarily?) removing memory regions
Currently, we try to remap all rings whenever we add a single new memory region. That doesn't quite make sense, because we already map rings when setting the ring address, and panic if that goes wrong. Likely, that handling was simply copied from set_mem_table code, where we actually have to remap all rings. Remapping all rings might require us to walk quite a lot of memory regions to perform the address translations. Ideally, we'd simply remove that remapping. However, let's be a bit careful. There might be some weird corner cases where we might temporarily remove a single memory region (e.g., resize it), that would have worked for now. Further, a ring might be located on hotplugged memory, and as the VM reboots, we might unplug that memory, to hotplug memory before resetting the ring addresses. So let's unmap affected rings as we remove a memory region, and try dynamically mapping the ring again when required. Acked-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 107 -- 1 file changed, 78 insertions(+), 29 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index ed0a978d4f..61fb3050b3 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -283,10 +283,75 @@ vu_remove_all_mem_regs(VuDev *dev) dev->nregions = 0; } +static bool +map_ring(VuDev *dev, VuVirtq *vq) +{ +vq->vring.desc = qva_to_va(dev, vq->vra.desc_user_addr); +vq->vring.used = qva_to_va(dev, vq->vra.used_user_addr); +vq->vring.avail = qva_to_va(dev, vq->vra.avail_user_addr); + +DPRINT("Setting virtq addresses:\n"); +DPRINT("vring_desc at %p\n", vq->vring.desc); +DPRINT("vring_used at %p\n", vq->vring.used); +DPRINT("vring_avail at %p\n", vq->vring.avail); + +return !(vq->vring.desc && vq->vring.used && vq->vring.avail); +} + static bool vu_is_vq_usable(VuDev *dev, VuVirtq *vq) { -return likely(!dev->broken) && likely(vq->vring.avail); +if (unlikely(dev->broken)) { +return false; +} + +if (likely(vq->vring.avail)) { +return true; +} + +/* + * In corner cases, we might temporarily remove a memory region that + * mapped a ring. When removing a memory region we make sure to + * unmap any rings that would be impacted. Let's try to remap if we + * already succeeded mapping this ring once. + */ +if (!vq->vra.desc_user_addr || !vq->vra.used_user_addr || +!vq->vra.avail_user_addr) { +return false; +} +if (map_ring(dev, vq)) { +vu_panic(dev, "remapping queue on access"); +return false; +} +return true; +} + +static void +unmap_rings(VuDev *dev, VuDevRegion *r) +{ +int i; + +for (i = 0; i < dev->max_queues; i++) { +VuVirtq *vq = >vq[i]; +const uintptr_t desc = (uintptr_t)vq->vring.desc; +const uintptr_t used = (uintptr_t)vq->vring.used; +const uintptr_t avail = (uintptr_t)vq->vring.avail; + +if (desc < r->mmap_addr || desc >= r->mmap_addr + r->size) { +continue; +} +if (used < r->mmap_addr || used >= r->mmap_addr + r->size) { +continue; +} +if (avail < r->mmap_addr || avail >= r->mmap_addr + r->size) { +continue; +} + +DPRINT("Unmapping rings of queue %d\n", i); +vq->vring.desc = NULL; +vq->vring.used = NULL; +vq->vring.avail = NULL; +} } static size_t @@ -786,21 +851,6 @@ vu_reset_device_exec(VuDev *dev, VhostUserMsg *vmsg) return false; } -static bool -map_ring(VuDev *dev, VuVirtq *vq) -{ -vq->vring.desc = qva_to_va(dev, vq->vra.desc_user_addr); -vq->vring.used = qva_to_va(dev, vq->vra.used_user_addr); -vq->vring.avail = qva_to_va(dev, vq->vra.avail_user_addr); - -DPRINT("Setting virtq addresses:\n"); -DPRINT("vring_desc at %p\n", vq->vring.desc); -DPRINT("vring_used at %p\n", vq->vring.used); -DPRINT("vring_avail at %p\n", vq->vring.avail); - -return !(vq->vring.desc && vq->vring.used && vq->vring.avail); -} - static bool generate_faults(VuDev *dev) { unsigned int i; @@ -884,7 +934,6 @@ generate_faults(VuDev *dev) { static bool vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { -int i; VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = if (vmsg->fd_num != 1) { @@ -930,19 +979,9 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { vmsg->fd_num = 0; DPRINT("Successf
[PATCH v2 05/14] libvhost-user: Factor out adding a memory region
Let's factor it out, reducing quite some code duplication and perparing for further changes. If we fail to mmap a region and panic, we now simply don't add that (broken) region. Note that we now increment dev->nregions as we are successfully adding memory regions, and don't increment dev->nregions if anything went wrong. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 168 -- 1 file changed, 60 insertions(+), 108 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index a7bd7de3cd..f43b5096d0 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -256,6 +256,61 @@ vu_remove_all_mem_regs(VuDev *dev) dev->nregions = 0; } +static void +_vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) +{ +int prot = PROT_READ | PROT_WRITE; +VuDevRegion *r; +void *mmap_addr; + +DPRINT("Adding region %d\n", dev->nregions); +DPRINT("guest_phys_addr: 0x%016"PRIx64"\n", + msg_region->guest_phys_addr); +DPRINT("memory_size: 0x%016"PRIx64"\n", + msg_region->memory_size); +DPRINT("userspace_addr: 0x%016"PRIx64"\n", + msg_region->userspace_addr); +DPRINT("mmap_offset: 0x%016"PRIx64"\n", + msg_region->mmap_offset); + +if (dev->postcopy_listening) { +/* + * In postcopy we're using PROT_NONE here to catch anyone + * accessing it before we userfault + */ +prot = PROT_NONE; +} + +/* + * We don't use offset argument of mmap() since the mapped address has + * to be page aligned, and we use huge pages. + */ +mmap_addr = mmap(0, msg_region->memory_size + msg_region->mmap_offset, + prot, MAP_SHARED | MAP_NORESERVE, fd, 0); +if (mmap_addr == MAP_FAILED) { +vu_panic(dev, "region mmap error: %s", strerror(errno)); +return; +} +DPRINT("mmap_addr: 0x%016"PRIx64"\n", + (uint64_t)(uintptr_t)mmap_addr); + +r = >regions[dev->nregions]; +r->gpa = msg_region->guest_phys_addr; +r->size = msg_region->memory_size; +r->qva = msg_region->userspace_addr; +r->mmap_addr = (uint64_t)(uintptr_t)mmap_addr; +r->mmap_offset = msg_region->mmap_offset; +dev->nregions++; + +if (dev->postcopy_listening) { +/* + * Return the address to QEMU so that it can translate the ufd + * fault addresses back. + */ +msg_region->userspace_addr = r->mmap_addr + r->mmap_offset; +} +} + static void vmsg_close_fds(VhostUserMsg *vmsg) { @@ -727,10 +782,7 @@ generate_faults(VuDev *dev) { static bool vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { int i; -bool track_ramblocks = dev->postcopy_listening; VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = -VuDevRegion *dev_region = >regions[dev->nregions]; -void *mmap_addr; if (vmsg->fd_num != 1) { vmsg_close_fds(vmsg); @@ -760,69 +812,20 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { * we know all the postcopy client bases have been received, and we * should start generating faults. */ -if (track_ramblocks && +if (dev->postcopy_listening && vmsg->size == sizeof(vmsg->payload.u64) && vmsg->payload.u64 == 0) { (void)generate_faults(dev); return false; } -DPRINT("Adding region: %u\n", dev->nregions); -DPRINT("guest_phys_addr: 0x%016"PRIx64"\n", - msg_region->guest_phys_addr); -DPRINT("memory_size: 0x%016"PRIx64"\n", - msg_region->memory_size); -DPRINT("userspace_addr 0x%016"PRIx64"\n", - msg_region->userspace_addr); -DPRINT("mmap_offset 0x%016"PRIx64"\n", - msg_region->mmap_offset); - -dev_region->gpa = msg_region->guest_phys_addr; -dev_region->size = msg_region->memory_size; -dev_region->qva = msg_region->userspace_addr; -dev_region->mmap_offset = msg_region->mmap_offset; - -/* - * We don't use offset argument of mmap() since the - * mapped address has to be page aligned, and we use huge - * pages. - */ -if (track_ramblocks) { -/* - * In postcopy we're using PROT_NONE here to catch anyone - * accessing it before we userfault. - */ -mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset, -
[PATCH v2 08/14] libvhost-user: Don't search for duplicates when removing memory regions
We cannot have duplicate memory regions, something would be deeply flawed elsewhere. Let's just stop the search once we found an entry. We'll add more sanity checks when adding memory regions later. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 2e8b611385..7f29e01c30 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -896,8 +896,7 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { i--; found = true; - -/* Continue the search for eventual duplicates. */ +break; } } -- 2.43.0
[PATCH v2 02/14] libvhost-user: Bump up VHOST_USER_MAX_RAM_SLOTS to 509
Let's support up to 509 mem slots, just like vhost in the kernel usually does and the rust vhost-user implementation recently [1] started doing. This is required to properly support memory hotplug, either using multiple DIMMs (ACPI supports up to 256) or using virtio-mem. The 509 used to be the KVM limit, it supported 512, but 3 were used for internal purposes. Currently, KVM supports more than 512, but it usually doesn't make use of more than ~260 (i.e., 256 DIMMs + boot memory), except when other memory devices like PCI devices with BARs are used. So, 509 seems to work well for vhost in the kernel. Details can be found in the QEMU change that made virtio-mem consume up to 256 mem slots across all virtio-mem devices. [2] 509 mem slots implies 509 VMAs/mappings in the worst case (even though, in practice with virtio-mem we won't be seeing more than ~260 in most setups). With max_map_count under Linux defaulting to 64k, 509 mem slots still correspond to less than 1% of the maximum number of mappings. There are plenty left for the application to consume. [1] https://github.com/rust-vmm/vhost/pull/224 [2] https://lore.kernel.org/all/20230926185738.277351-1-da...@redhat.com/ Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.h | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h index c882b4e3a2..deb40e77b3 100644 --- a/subprojects/libvhost-user/libvhost-user.h +++ b/subprojects/libvhost-user/libvhost-user.h @@ -31,10 +31,12 @@ #define VHOST_MEMORY_BASELINE_NREGIONS 8 /* - * Set a reasonable maximum number of ram slots, which will be supported by - * any architecture. + * vhost in the kernel usually supports 509 mem slots. 509 used to be the + * KVM limit, it supported 512, but 3 were used for internal purposes. This + * limit is sufficient to support many DIMMs and virtio-mem in + * "dynamic-memslots" mode. */ -#define VHOST_USER_MAX_RAM_SLOTS 32 +#define VHOST_USER_MAX_RAM_SLOTS 509 #define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64) -- 2.43.0
[PATCH v2 00/14] libvhost-user: support more memslots and cleanup memslot handling code
This series adds support for more memslots (509) to libvhost-user, to make it fully compatible with virtio-mem that uses up to 256 memslots accross all memory devices in "dynamic-memslot" mode (more details in patch #2). With that in place, this series optimizes and extends memory region handling in libvhost-user: * Heavily deduplicate and clean up the memory region handling code * Speeds up GPA->VA translation with many memslots using binary search * Optimize mmap_offset handling to use it as fd_offset for mmap() * Avoid ring remapping when adding a single memory region * Avoid dumping all guest memory, possibly allocating memory in sparse memory mappings when the process crashes I'm being very careful to not break some weird corner case that modern QEMU might no longer trigger, but older one could have triggered or some other frontend might trigger. The only thing where I am not careful is to forbid memory regions that overlap in GPA space: it doesn't make any sense. With this series, virtio-mem (with dynamic-memslots=on) + qemu-storage-daemon works flawlessly and as expected in my tests. v1 -> v2: * Drop "libvhost-user: Fix msg_region->userspace_addr computation" -> Not actually required * "libvhost-user: Factor out adding a memory region" -> Make debug output more consistent (add missing ":") * "libvhost-user: Use most of mmap_offset as fd_offset" -> get_fd_pagesize -> get_fd_hugepagesize; remove getpagesize() -> "mmap_offset:" to "old mmap_offset:" in debug message -> "adj mmap_offset:" to "new mmap_offset:" in debug message -> Use "(unsigned int)fs.f_type"; the man page of fstatfs() calls out that the type of f_type can vary depending on the architecture. "unsigned int" is sufficient here. -> Updated patch description * Added RBs+ACKs * Did a Gitlab CI run, seems to be happy reagrding libvhost-user Cc: Michael S. Tsirkin Cc: Jason Wang Cc: Stefan Hajnoczi Cc: Stefano Garzarella Cc: Germano Veit Michel Cc: Raphael Norwitz David Hildenbrand (14): libvhost-user: Dynamically allocate memory for memory slots libvhost-user: Bump up VHOST_USER_MAX_RAM_SLOTS to 509 libvhost-user: Factor out removing all mem regions libvhost-user: Merge vu_set_mem_table_exec_postcopy() into vu_set_mem_table_exec() libvhost-user: Factor out adding a memory region libvhost-user: No need to check for NULL when unmapping libvhost-user: Don't zero out memory for memory regions libvhost-user: Don't search for duplicates when removing memory regions libvhost-user: Factor out search for memory region by GPA and simplify libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va() libvhost-user: Use most of mmap_offset as fd_offset libvhost-user: Factor out vq usability check libvhost-user: Dynamically remap rings after (temporarily?) removing memory regions libvhost-user: Mark mmap'ed region memory as MADV_DONTDUMP subprojects/libvhost-user/libvhost-user.c | 595 -- subprojects/libvhost-user/libvhost-user.h | 10 +- 2 files changed, 334 insertions(+), 271 deletions(-) -- 2.43.0
[PATCH v2 12/14] libvhost-user: Factor out vq usability check
Let's factor it out to prepare for further changes. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 24 +++ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 55aef5fcc6..ed0a978d4f 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -283,6 +283,12 @@ vu_remove_all_mem_regs(VuDev *dev) dev->nregions = 0; } +static bool +vu_is_vq_usable(VuDev *dev, VuVirtq *vq) +{ +return likely(!dev->broken) && likely(vq->vring.avail); +} + static size_t get_fd_hugepagesize(int fd) { @@ -2380,8 +2386,7 @@ vu_queue_get_avail_bytes(VuDev *dev, VuVirtq *vq, unsigned int *in_bytes, idx = vq->last_avail_idx; total_bufs = in_total = out_total = 0; -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { goto done; } @@ -2496,8 +2501,7 @@ vu_queue_avail_bytes(VuDev *dev, VuVirtq *vq, unsigned int in_bytes, bool vu_queue_empty(VuDev *dev, VuVirtq *vq) { -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { return true; } @@ -2536,8 +2540,7 @@ vring_notify(VuDev *dev, VuVirtq *vq) static void _vu_queue_notify(VuDev *dev, VuVirtq *vq, bool sync) { -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { return; } @@ -2862,8 +2865,7 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz) unsigned int head; VuVirtqElement *elem; -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { return NULL; } @@ -3020,8 +3022,7 @@ vu_queue_fill(VuDev *dev, VuVirtq *vq, { struct vring_used_elem uelem; -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { return; } @@ -3050,8 +3051,7 @@ vu_queue_flush(VuDev *dev, VuVirtq *vq, unsigned int count) { uint16_t old, new; -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { return; } -- 2.43.0
[PATCH v2 14/14] libvhost-user: Mark mmap'ed region memory as MADV_DONTDUMP
We already use MADV_NORESERVE to deal with sparse memory regions. Let's also set madvise(MADV_DONTDUMP), otherwise a crash of the process can result in us allocating all memory in the mmap'ed region for dumping purposes. This change implies that the mmap'ed rings won't be included in a coredump. If ever required for debugging purposes, we could mark only the mapped rings MADV_DODUMP. Ignore errors during madvise() for now. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 61fb3050b3..a879149fef 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -460,6 +460,12 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) DPRINT("mmap_addr: 0x%016"PRIx64"\n", (uint64_t)(uintptr_t)mmap_addr); +#if defined(__linux__) +/* Don't include all guest memory in a coredump. */ +madvise(mmap_addr, msg_region->memory_size + mmap_offset, +MADV_DONTDUMP); +#endif + /* Shift all affected entries by 1 to open a hole at idx. */ r = >regions[idx]; memmove(r + 1, r, sizeof(VuDevRegion) * (dev->nregions - idx)); -- 2.43.0
[PATCH v2 11/14] libvhost-user: Use most of mmap_offset as fd_offset
In the past, QEMU would create memory regions that could partially cover hugetlb pages, making mmap() fail if we would use the mmap_offset as an fd_offset. For that reason, we never used the mmap_offset as an offset into the fd and instead always mapped the fd from the very start. However, that can easily result in us mmap'ing a lot of unnecessary parts of an fd, possibly repeatedly. QEMU nowadays does not create memory regions that partially cover huge pages -- it never really worked with postcopy. QEMU handles merging of regions that partially cover huge pages (due to holes in boot memory) since 2018 in c1ece84e7c93 ("vhost: Huge page align and merge"). Let's be a bit careful and not unconditionally convert the mmap_offset into an fd_offset. Instead, let's simply detect the hugetlb size and pass as much as we can as fd_offset, making sure that we call mmap() with a properly aligned offset. With QEMU and a virtio-mem device that is fully plugged (50GiB using 50 memslots) the qemu-storage daemon process consumes in the VA space 1281GiB before this change and 58GiB after this change. Vhost user message Request: VHOST_USER_ADD_MEM_REG (37) Flags: 0x9 Size:40 Fds: 59 Adding region 4 guest_phys_addr: 0x0002 memory_size: 0x4000 userspace_addr: 0x7fb73bffe000 old mmap_offset: 0x8000 fd_offset: 0x8000 new mmap_offset: 0x mmap_addr: 0x7f02f1bdc000 Successfully added new region Vhost user message Request: VHOST_USER_ADD_MEM_REG (37) Flags: 0x9 Size:40 Fds: 59 Adding region 5 guest_phys_addr: 0x00024000 memory_size: 0x4000 userspace_addr: 0x7fb77bffe000 old mmap_offset: 0xc000 fd_offset: 0xc000 new mmap_offset: 0x mmap_addr: 0x7f028400 Successfully added new region Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 54 --- 1 file changed, 48 insertions(+), 6 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index ef6353d847..55aef5fcc6 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -43,6 +43,8 @@ #include #include #include +#include +#include #ifdef __NR_userfaultfd #include @@ -281,12 +283,32 @@ vu_remove_all_mem_regs(VuDev *dev) dev->nregions = 0; } +static size_t +get_fd_hugepagesize(int fd) +{ +#if defined(__linux__) +struct statfs fs; +int ret; + +do { +ret = fstatfs(fd, ); +} while (ret != 0 && errno == EINTR); + +if (!ret && (unsigned int)fs.f_type == HUGETLBFS_MAGIC) { +return fs.f_bsize; +} +#endif +return 0; +} + static void _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) { const uint64_t start_gpa = msg_region->guest_phys_addr; const uint64_t end_gpa = start_gpa + msg_region->memory_size; int prot = PROT_READ | PROT_WRITE; +uint64_t mmap_offset, fd_offset; +size_t hugepagesize; VuDevRegion *r; void *mmap_addr; int low = 0; @@ -300,7 +322,7 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) msg_region->memory_size); DPRINT("userspace_addr: 0x%016"PRIx64"\n", msg_region->userspace_addr); -DPRINT("mmap_offset: 0x%016"PRIx64"\n", +DPRINT("old mmap_offset: 0x%016"PRIx64"\n", msg_region->mmap_offset); if (dev->postcopy_listening) { @@ -335,11 +357,31 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) idx = low; /* - * We don't use offset argument of mmap() since the mapped address has - * to be page aligned, and we use huge pages. + * Convert most of msg_region->mmap_offset to fd_offset. In almost all + * cases, this will leave us with mmap_offset == 0, mmap()'ing only + * what we really need. Only if a memory region would partially cover + * hugetlb pages, we'd get mmap_offset != 0, which usually doesn't happen + * anymore (i.e., modern QEMU). + * + * Note that mmap() with hugetlb would fail if the offset into the file + * is not aligned to the huge page size. */ -mmap_addr = mmap(0, msg_region->memory_size + msg_region->mmap_offset, - prot, MAP_SHARED | MAP_NORESERVE, fd, 0); +hugepagesize = get_fd_hugepagesize(fd); +if (hugepagesize) { +fd_offset = ALIGN_DOWN(msg_region->mmap_offset, hugepagesize); +mmap_offset = msg_region->mmap_offset - fd_offset; +} else { +
[PATCH v2 03/14] libvhost-user: Factor out removing all mem regions
Let's factor it out. Note that the check for MAP_FAILED was wrong as we never set mmap_addr if mmap() failed. We'll remove the NULL check separately. Reviewed-by: Raphael Norwitz Acked-by: Stefano Garzarella Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 34 --- 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 360c5366d6..e4907dfc26 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -240,6 +240,22 @@ qva_to_va(VuDev *dev, uint64_t qemu_addr) return NULL; } +static void +vu_remove_all_mem_regs(VuDev *dev) +{ +unsigned int i; + +for (i = 0; i < dev->nregions; i++) { +VuDevRegion *r = >regions[i]; +void *ma = (void *)(uintptr_t)r->mmap_addr; + +if (ma) { +munmap(ma, r->size + r->mmap_offset); +} +} +dev->nregions = 0; +} + static void vmsg_close_fds(VhostUserMsg *vmsg) { @@ -1003,14 +1019,7 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg) unsigned int i; VhostUserMemory m = vmsg->payload.memory, *memory = -for (i = 0; i < dev->nregions; i++) { -VuDevRegion *r = >regions[i]; -void *ma = (void *) (uintptr_t) r->mmap_addr; - -if (ma) { -munmap(ma, r->size + r->mmap_offset); -} -} +vu_remove_all_mem_regs(dev); dev->nregions = memory->nregions; if (dev->postcopy_listening) { @@ -2112,14 +2121,7 @@ vu_deinit(VuDev *dev) { unsigned int i; -for (i = 0; i < dev->nregions; i++) { -VuDevRegion *r = >regions[i]; -void *m = (void *) (uintptr_t) r->mmap_addr; -if (m != MAP_FAILED) { -munmap(m, r->size + r->mmap_offset); -} -} -dev->nregions = 0; +vu_remove_all_mem_regs(dev); for (i = 0; i < dev->max_queues; i++) { VuVirtq *vq = >vq[i]; -- 2.43.0
Re: [PATCH v1 00/15] libvhost-user: support more memslots and cleanup memslot handling code
On 13.02.24 19:55, Michael S. Tsirkin wrote: On Tue, Feb 13, 2024 at 07:27:44PM +0100, David Hildenbrand wrote: On 13.02.24 18:33, Michael S. Tsirkin wrote: On Fri, Feb 02, 2024 at 10:53:17PM +0100, David Hildenbrand wrote: This series adds support for more memslots (509) to libvhost-user, to make it fully compatible with virtio-mem that uses up to 256 memslots accross all memory devices in "dynamic-memslot" mode (more details in patch #3). Breaks build on some systems. E.g. https://gitlab.com/mstredhat/qemu/-/jobs/6163591599 ./subprojects/libvhost-user/libvhost-user.c:369:27: error: comparison of integer expressions of different signedness: ‘long int’ and ‘unsigned int’ [-Werror=sign-compare] 369 | if (!ret && fs.f_type == HUGETLBFS_MAGIC) { | ^~ So easy to fix in v2, thanks! I think there is another problem around plugins though. There is a wrong checkpatch error: https://gitlab.com/mstredhat/qemu/-/jobs/6162397277 d96f29518232719b0c444ab93913e8515a6cb5c6:100: ERROR: use qemu_real_host_page_size() instead of getpagesize() total: 1 errors, 1 warnings, 81 lines checked qemu_real_host_page_size() is not available in libvhost-user. But I could just change that code to not require getpagesize() at all. Apart from that, I don't spot anything libvhost-user related (some qtest timeouts, a "error_setv: Assertion `*errp == NULL' failed."). Did I miss something? -- Cheers, David / dhildenb
Re: [PATCH v1 00/15] libvhost-user: support more memslots and cleanup memslot handling code
On 13.02.24 18:33, Michael S. Tsirkin wrote: On Fri, Feb 02, 2024 at 10:53:17PM +0100, David Hildenbrand wrote: This series adds support for more memslots (509) to libvhost-user, to make it fully compatible with virtio-mem that uses up to 256 memslots accross all memory devices in "dynamic-memslot" mode (more details in patch #3). Breaks build on some systems. E.g. https://gitlab.com/mstredhat/qemu/-/jobs/6163591599 ./subprojects/libvhost-user/libvhost-user.c:369:27: error: comparison of integer expressions of different signedness: ‘long int’ and ‘unsigned int’ [-Werror=sign-compare] 369 | if (!ret && fs.f_type == HUGETLBFS_MAGIC) { | ^~ So easy to fix in v2, thanks! -- Cheers, David / dhildenb
Re: [PATCH v1 01/15] libvhost-user: Fix msg_region->userspace_addr computation
On 13.02.24 18:32, Michael S. Tsirkin wrote: On Fri, Feb 02, 2024 at 10:53:18PM +0100, David Hildenbrand wrote: We barely had mmap_offset set in the past. With virtio-mem and dynamic-memslots that will change. In vu_add_mem_reg() and vu_set_mem_table_exec_postcopy(), we are performing pointer arithmetics, which is wrong. Wrong how? I suspect you mean arithmetic on void * pointers is not portable? You are absolutely right, no idea how I concluded that we might have a different pointer size here. I'll either convert this patch into a cleanup or drop it for v2, thanks! -- Cheers, David / dhildenb
Re: [PATCH V3 09/13] migration: notifier error checking
On 08.02.24 19:54, Steve Sistare wrote: Check the status returned by migration notifiers and report errors. If notifiers fail, call the notifiers again so they can clean up. IIUC, if any of the notifiers will actually start to fail, say, during MIG_EVENT_PRECOPY_SETUP, you will call MIG_EVENT_PRECOPY_FAILED on all notifiers. That will include notifiers that have never seen a MIG_EVENT_PRECOPY_SETUP call. Is that what we expect notifiers to be able to deal with? Can we document that? -- Cheers, David / dhildenb
Re: [PATCH V3 08/13] migration: refactor migrate_fd_connect failures
On 08.02.24 19:54, Steve Sistare wrote: Move common code for the error path in migrate_fd_connect to a shared fail label. No functional change. Signed-off-by: Steve Sistare Reviewed-by: Peter Xu --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH V3 07/13] migration: per-mode notifiers
On 08.02.24 19:54, Steve Sistare wrote: Keep a separate list of migration notifiers for each migration mode. Suggested-by: Peter Xu Signed-off-by: Steve Sistare --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH V3 06/13] migration: MigrationNotifyFunc
On 08.02.24 19:53, Steve Sistare wrote: Define MigrationNotifyFunc to improve type safety and simplify migration notifiers. Signed-off-by: Steve Sistare --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH V3 04/13] migration: MigrationEvent for notifiers
On 08.02.24 19:53, Steve Sistare wrote: Passing MigrationState to notifiers is unsound because they could access unstable migration state internals or even modify the state. Instead, pass the minimal info needed in a new MigrationEvent struct, which could be extended in the future if needed. Suggested-by: Peter Xu Signed-off-by: Steve Sistare --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH V3 03/13] migration: convert to NotifierWithReturn
On 08.02.24 19:53, Steve Sistare wrote: Change all migration notifiers to type NotifierWithReturn, so notifiers can return an error status in a future patch. For now, pass NULL for the notifier error parameter, and do not check the return value. Signed-off-by: Steve Sistare Reviewed-by: Peter Xu --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH V3 02/13] migration: remove error from notifier data
On 08.02.24 19:53, Steve Sistare wrote: Remove the error object from opaque data passed to notifiers. Use the new error parameter passed to the notifier instead. Signed-off-by: Steve Sistare Reviewed-by: Peter Xu --- Would have squashed #1 and #2. Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH V3 01/13] notify: pass error to notifier with return
On 08.02.24 19:53, Steve Sistare wrote: Pass an error object as the third parameter to "notifier with return" notifiers, so clients no longer need to bundle an error object in the opaque data. The new parameter is used in a later patch. Signed-off-by: Steve Sistare Reviewed-by: Peter Xu --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v1 00/15] libvhost-user: support more memslots and cleanup memslot handling code
On 07.02.24 12:40, Stefano Garzarella wrote: On Fri, Feb 02, 2024 at 10:53:17PM +0100, David Hildenbrand wrote: This series adds support for more memslots (509) to libvhost-user, to make it fully compatible with virtio-mem that uses up to 256 memslots accross all memory devices in "dynamic-memslot" mode (more details in patch #3). One simple fix upfront. With that in place, this series optimizes and extens memory region handling in libvhost-user: * Heavily deduplicate and clean up the memory region handling code * Speeds up GPA->VA translation with many memslots using binary search * Optimize mmap_offset handling to use it as fd_offset for mmap() * Avoid ring remapping when adding a single memory region * Avoid dumping all guest memory, possibly allocating memory in sparse memory mappings when the process crashes I'm being very careful to not break some weird corner case that modern QEMU might no longer trigger, but older one could have triggered or some other frontend might trigger. The only thing where I am not careful is to forbid memory regions that overlap in GPA space: it doesn't make any sense. With this series, virtio-mem (with dynamic-memslots=on) + qemu-storage-daemon works flawlessly and as expected in my tests. I don't know much about this code, but I didn't find anything wrong with it. Thank you also for the great cleanup! Acked-by: Stefano Garzarella Thanks Stefano for the review, highly appreciated! -- Cheers, David / dhildenb
[PULL v3 3/3] oslib-posix: initialize backend memory objects in parallel
From: Mark Kanda QEMU initializes preallocated backend memory as the objects are parsed from the command line. This is not optimal in some cases (e.g. memory spanning multiple NUMA nodes) because the memory objects are initialized in series. Allow the initialization to occur in parallel (asynchronously). In order to ensure optimal thread placement, asynchronous initialization requires prealloc context threads to be in use. Signed-off-by: Mark Kanda Message-ID: <20240131165327.3154970-2-mark.ka...@oracle.com> Tested-by: Mario Casquero Signed-off-by: David Hildenbrand --- backends/hostmem.c | 7 ++- hw/virtio/virtio-mem.c | 4 +- include/hw/qdev-core.h | 5 ++ include/qemu/osdep.h | 18 +- system/vl.c| 9 +++ util/oslib-posix.c | 131 +++-- util/oslib-win32.c | 8 ++- 7 files changed, 145 insertions(+), 37 deletions(-) diff --git a/backends/hostmem.c b/backends/hostmem.c index 987f6f591e..81a72ce40b 100644 --- a/backends/hostmem.c +++ b/backends/hostmem.c @@ -20,6 +20,7 @@ #include "qom/object_interfaces.h" #include "qemu/mmap-alloc.h" #include "qemu/madvise.h" +#include "hw/qdev-core.h" #ifdef CONFIG_NUMA #include @@ -237,7 +238,7 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value, uint64_t sz = memory_region_size(>mr); if (!qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads, - backend->prealloc_context, errp)) { + backend->prealloc_context, false, errp)) { return; } backend->prealloc = true; @@ -323,6 +324,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp) HostMemoryBackendClass *bc = MEMORY_BACKEND_GET_CLASS(uc); void *ptr; uint64_t sz; +bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED); if (!bc->alloc) { return; @@ -402,7 +404,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp) if (backend->prealloc && !qemu_prealloc_mem(memory_region_get_fd(>mr), ptr, sz, backend->prealloc_threads, -backend->prealloc_context, errp)) { +backend->prealloc_context, +async, errp)) { return; } } diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index 99ab989852..ffd119ebac 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -605,7 +605,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa, int fd = memory_region_get_fd(>memdev->mr); Error *local_err = NULL; -if (!qemu_prealloc_mem(fd, area, size, 1, NULL, _err)) { +if (!qemu_prealloc_mem(fd, area, size, 1, NULL, false, _err)) { static bool warned; /* @@ -1248,7 +1248,7 @@ static int virtio_mem_prealloc_range_cb(VirtIOMEM *vmem, void *arg, int fd = memory_region_get_fd(>memdev->mr); Error *local_err = NULL; -if (!qemu_prealloc_mem(fd, area, size, 1, NULL, _err)) { +if (!qemu_prealloc_mem(fd, area, size, 1, NULL, false, _err)) { error_report_err(local_err); return -ENOMEM; } diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h index d47536eadb..9228e96c87 100644 --- a/include/hw/qdev-core.h +++ b/include/hw/qdev-core.h @@ -1083,6 +1083,11 @@ typedef enum MachineInitPhase { */ PHASE_ACCEL_CREATED, +/* + * Late backend objects have been created and initialized. + */ +PHASE_LATE_BACKENDS_CREATED, + /* * machine_class->init has been called, thus creating any embedded * devices and validating machine properties. Devices created at diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h index c9692cc314..7d359dabc4 100644 --- a/include/qemu/osdep.h +++ b/include/qemu/osdep.h @@ -680,6 +680,8 @@ typedef struct ThreadContext ThreadContext; * @area: start address of the are to preallocate * @sz: the size of the area to preallocate * @max_threads: maximum number of threads to use + * @tc: prealloc context threads pointer, NULL if not in use + * @async: request asynchronous preallocation, requires @tc * @errp: returns an error if this function fails * * Preallocate memory (populate/prefault page tables writable) for the virtual @@ -687,10 +689,24 @@ typedef struct ThreadContext ThreadContext; * each page in the area was faulted in writable at least once, for example, * after allocating file blocks for mapped files. * + * When setting @async, allocation might be performed asynchronously. + * qemu_finish_async_prealloc_mem() must be called to finish any asynchronous + * preallocation.
[PULL v3 1/3] hv-balloon: use get_min_alignment() to express 32 GiB alignment
Let's implement the get_min_alignment() callback for memory devices, and copy for the device memory region the alignment of the host memory region. This mimics what virtio-mem does, and allows for re-introducing proper alignment checks for the memory region size (where we don't care about additional device requirements) in memory device core. Message-ID: <20240117135554.787344-2-da...@redhat.com> Reviewed-by: Maciej S. Szmigiero Signed-off-by: David Hildenbrand --- hw/hyperv/hv-balloon.c | 37 + 1 file changed, 21 insertions(+), 16 deletions(-) diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c index 0238365712..ade283335a 100644 --- a/hw/hyperv/hv-balloon.c +++ b/hw/hyperv/hv-balloon.c @@ -1477,22 +1477,7 @@ static void hv_balloon_ensure_mr(HvBalloon *balloon) balloon->mr = g_new0(MemoryRegion, 1); memory_region_init(balloon->mr, OBJECT(balloon), TYPE_HV_BALLOON, memory_region_size(hostmem_mr)); - -/* - * The VM can indicate an alignment up to 32 GiB. Memory device core can - * usually only handle/guarantee 1 GiB alignment. The user will have to - * specify a larger maxmem eventually. - * - * The memory device core will warn the user in case maxmem might have to be - * increased and will fail plugging the device if there is not sufficient - * space after alignment. - * - * TODO: we could do the alignment ourselves in a slightly bigger region. - * But this feels better, although the warning might be annoying. Maybe - * we can optimize that in the future (e.g., with such a device on the - * cmdline place/size the device memory region differently. - */ -balloon->mr->align = MAX(32 * GiB, memory_region_get_alignment(hostmem_mr)); +balloon->mr->align = memory_region_get_alignment(hostmem_mr); } static void hv_balloon_free_mr(HvBalloon *balloon) @@ -1654,6 +1639,25 @@ static MemoryRegion *hv_balloon_md_get_memory_region(MemoryDeviceState *md, return balloon->mr; } +static uint64_t hv_balloon_md_get_min_alignment(const MemoryDeviceState *md) +{ +/* + * The VM can indicate an alignment up to 32 GiB. Memory device core can + * usually only handle/guarantee 1 GiB alignment. The user will have to + * specify a larger maxmem eventually. + * + * The memory device core will warn the user in case maxmem might have to be + * increased and will fail plugging the device if there is not sufficient + * space after alignment. + * + * TODO: we could do the alignment ourselves in a slightly bigger region. + * But this feels better, although the warning might be annoying. Maybe + * we can optimize that in the future (e.g., with such a device on the + * cmdline place/size the device memory region differently. + */ +return 32 * GiB; +} + static void hv_balloon_md_fill_device_info(const MemoryDeviceState *md, MemoryDeviceInfo *info) { @@ -1766,5 +1770,6 @@ static void hv_balloon_class_init(ObjectClass *klass, void *data) mdc->get_memory_region = hv_balloon_md_get_memory_region; mdc->decide_memslots = hv_balloon_decide_memslots; mdc->get_memslots = hv_balloon_get_memslots; +mdc->get_min_alignment = hv_balloon_md_get_min_alignment; mdc->fill_device_info = hv_balloon_md_fill_device_info; } -- 2.43.0
[PULL v3 2/3] memory-device: reintroduce memory region size check
We used to check that the memory region size is multiples of the overall requested address alignment for the device memory address. We removed that check, because there are cases (i.e., hv-balloon) where devices unconditionally request an address alignment that has a very large alignment (i.e., 32 GiB), but the actual memory device size might not be multiples of that alignment. However, this change: (a) allows for some practically impossible DIMM sizes, like "1GB+1 byte". (b) allows for DIMMs that partially cover hugetlb pages, previously reported in [1]. Both scenarios don't make any sense: we might even waste memory. So let's reintroduce that check, but only check that the memory region size is multiples of the memory region alignment (i.e., page size, huge page size), but not any additional memory device requirements communicated using md->get_min_alignment(). The following examples now fail again as expected: (a) 1M with 2M THP qemu-system-x86_64 -m 4g,maxmem=16g,slots=1 -S -nodefaults -nographic \ -object memory-backend-ram,id=mem1,size=1M \ -device pc-dimm,id=dimm1,memdev=mem1 -> backend memory size must be multiple of 0x20 (b) 1G+1byte qemu-system-x86_64 -m 4g,maxmem=16g,slots=1 -S -nodefaults -nographic \ -object memory-backend-ram,id=mem1,size=1073741825B \ -device pc-dimm,id=dimm1,memdev=mem1 -> backend memory size must be multiple of 0x20 (c) Unliagned hugetlb size (2M) qemu-system-x86_64 -m 4g,maxmem=16g,slots=1 -S -nodefaults -nographic \ -object memory-backend-file,id=mem1,mem-path=/dev/hugepages/tmp,size=511M \ -device pc-dimm,id=dimm1,memdev=mem1 backend memory size must be multiple of 0x20 (d) Unliagned hugetlb size (1G) qemu-system-x86_64 -m 4g,maxmem=16g,slots=1 -S -nodefaults -nographic \ -object memory-backend-file,id=mem1,mem-path=/dev/hugepages1G/tmp,size=2047M \ -device pc-dimm,id=dimm1,memdev=mem1 -> backend memory size must be multiple of 0x4000 Note that this fix depends on a hv-balloon change to communicate its additional alignment requirements using get_min_alignment() instead of through the memory region. [1] https://lkml.kernel.org/r/f77d641d500324525ac036fe1827b3070de75fc1.1701088320.git.mpriv...@redhat.com Message-ID: <20240117135554.787344-3-da...@redhat.com> Reported-by: Zhenyu Zhang Reported-by: Michal Privoznik Fixes: eb1b7c4bd413 ("memory-device: Drop size alignment check") Tested-by: Zhenyu Zhang Tested-by: Mario Casquero Reviewed-by: Maciej S. Szmigiero Signed-off-by: David Hildenbrand --- hw/mem/memory-device.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/hw/mem/memory-device.c b/hw/mem/memory-device.c index a1b1af26bc..e098585cda 100644 --- a/hw/mem/memory-device.c +++ b/hw/mem/memory-device.c @@ -374,6 +374,20 @@ void memory_device_pre_plug(MemoryDeviceState *md, MachineState *ms, goto out; } +/* + * We always want the memory region size to be multiples of the memory + * region alignment: for example, DIMMs with 1G+1byte size don't make + * any sense. Note that we don't check that the size is multiples + * of any additional alignment requirements the memory device might + * have when it comes to the address in physical address space. + */ +if (!QEMU_IS_ALIGNED(memory_region_size(mr), + memory_region_get_alignment(mr))) { +error_setg(errp, "backend memory size must be multiple of 0x%" + PRIx64, memory_region_get_alignment(mr)); +return; +} + if (legacy_align) { align = *legacy_align; } else { -- 2.43.0
[PULL v3 0/3] Host Memory Backends and Memory devices queue 2024-02-06
The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440: Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into staging (2024-02-03 13:31:58 +) are available in the Git repository at: https://github.com/davidhildenbrand/qemu.git tags/mem-2024-02-06-v3 for you to fetch changes up to 04accf43df83aa10f06f7dbda3ecf0db97f0c5a6: oslib-posix: initialize backend memory objects in parallel (2024-02-06 08:15:22 +0100) Hi, "Host Memory Backends" and "Memory devices" queue ("mem"): - Reintroduce memory region size checks for memory devices; the removal lead to some undesired side effects - Preallocate memory of memory backends in selected configurations asynchronously (so we preallocate concurrently), to speed up QEMU startup time. -------- David Hildenbrand (2): hv-balloon: use get_min_alignment() to express 32 GiB alignment memory-device: reintroduce memory region size check Mark Kanda (1): oslib-posix: initialize backend memory objects in parallel backends/hostmem.c | 7 ++- hw/hyperv/hv-balloon.c | 37 +++- hw/mem/memory-device.c | 14 + hw/virtio/virtio-mem.c | 4 +- include/hw/qdev-core.h | 5 ++ include/qemu/osdep.h | 18 +- system/vl.c| 9 +++ util/oslib-posix.c | 131 +++-- util/oslib-win32.c | 8 ++- 9 files changed, 180 insertions(+), 53 deletions(-) -- 2.43.0
Re: [PULL v2 0/3] Host Memory Backends and Memory devices queue 2024-02-06
On 06.02.24 08:14, David Hildenbrand wrote: The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440: Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into staging (2024-02-03 13:31:58 +) are available in the Git repository at: https://github.com/davidhildenbrand/qemu.git tags/mem-2024-02-06-v2 for you to fetch changes up to 8eb38ab662e74d618d473a9a52d71c644926c3c0: oslib-posix: initialize backend memory objects in parallel (2024-02-06 08:09:55 +0100) Hi, "Host Memory Backends" and "Memory devices" queue ("mem"): - Reintroduce memory region size checks for memory devices; the removal lead to some undesired side effects - Preallocate memory of memory backends in selected configurations asynchronously (so we preallocate concurrently), to speed up QEMU startup time. ... and yet another instance of the same mail address is wrong. Gah. Maybe v3 will do ... -- Cheers, David / dhildenb
[PULL v2 1/3] hv-balloon: use get_min_alignment() to express 32 GiB alignment
Let's implement the get_min_alignment() callback for memory devices, and copy for the device memory region the alignment of the host memory region. This mimics what virtio-mem does, and allows for re-introducing proper alignment checks for the memory region size (where we don't care about additional device requirements) in memory device core. Message-ID: <20240117135554.787344-2-da...@redhat.com> Reviewed-by: Maciej S. Szmigiero Signed-off-by: David Hildenbrand --- hw/hyperv/hv-balloon.c | 37 + 1 file changed, 21 insertions(+), 16 deletions(-) diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c index 0238365712..ade283335a 100644 --- a/hw/hyperv/hv-balloon.c +++ b/hw/hyperv/hv-balloon.c @@ -1477,22 +1477,7 @@ static void hv_balloon_ensure_mr(HvBalloon *balloon) balloon->mr = g_new0(MemoryRegion, 1); memory_region_init(balloon->mr, OBJECT(balloon), TYPE_HV_BALLOON, memory_region_size(hostmem_mr)); - -/* - * The VM can indicate an alignment up to 32 GiB. Memory device core can - * usually only handle/guarantee 1 GiB alignment. The user will have to - * specify a larger maxmem eventually. - * - * The memory device core will warn the user in case maxmem might have to be - * increased and will fail plugging the device if there is not sufficient - * space after alignment. - * - * TODO: we could do the alignment ourselves in a slightly bigger region. - * But this feels better, although the warning might be annoying. Maybe - * we can optimize that in the future (e.g., with such a device on the - * cmdline place/size the device memory region differently. - */ -balloon->mr->align = MAX(32 * GiB, memory_region_get_alignment(hostmem_mr)); +balloon->mr->align = memory_region_get_alignment(hostmem_mr); } static void hv_balloon_free_mr(HvBalloon *balloon) @@ -1654,6 +1639,25 @@ static MemoryRegion *hv_balloon_md_get_memory_region(MemoryDeviceState *md, return balloon->mr; } +static uint64_t hv_balloon_md_get_min_alignment(const MemoryDeviceState *md) +{ +/* + * The VM can indicate an alignment up to 32 GiB. Memory device core can + * usually only handle/guarantee 1 GiB alignment. The user will have to + * specify a larger maxmem eventually. + * + * The memory device core will warn the user in case maxmem might have to be + * increased and will fail plugging the device if there is not sufficient + * space after alignment. + * + * TODO: we could do the alignment ourselves in a slightly bigger region. + * But this feels better, although the warning might be annoying. Maybe + * we can optimize that in the future (e.g., with such a device on the + * cmdline place/size the device memory region differently. + */ +return 32 * GiB; +} + static void hv_balloon_md_fill_device_info(const MemoryDeviceState *md, MemoryDeviceInfo *info) { @@ -1766,5 +1770,6 @@ static void hv_balloon_class_init(ObjectClass *klass, void *data) mdc->get_memory_region = hv_balloon_md_get_memory_region; mdc->decide_memslots = hv_balloon_decide_memslots; mdc->get_memslots = hv_balloon_get_memslots; +mdc->get_min_alignment = hv_balloon_md_get_min_alignment; mdc->fill_device_info = hv_balloon_md_fill_device_info; } -- 2.43.0
[PULL v2 0/3] Host Memory Backends and Memory devices queue 2024-02-06
The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440: Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into staging (2024-02-03 13:31:58 +) are available in the Git repository at: https://github.com/davidhildenbrand/qemu.git tags/mem-2024-02-06-v2 for you to fetch changes up to 8eb38ab662e74d618d473a9a52d71c644926c3c0: oslib-posix: initialize backend memory objects in parallel (2024-02-06 08:09:55 +0100) Hi, "Host Memory Backends" and "Memory devices" queue ("mem"): - Reintroduce memory region size checks for memory devices; the removal lead to some undesired side effects - Preallocate memory of memory backends in selected configurations asynchronously (so we preallocate concurrently), to speed up QEMU startup time. -------- David Hildenbrand (2): hv-balloon: use get_min_alignment() to express 32 GiB alignment memory-device: reintroduce memory region size check Mark Kanda (1): oslib-posix: initialize backend memory objects in parallel backends/hostmem.c | 7 ++- hw/hyperv/hv-balloon.c | 37 +++- hw/mem/memory-device.c | 14 + hw/virtio/virtio-mem.c | 4 +- include/hw/qdev-core.h | 5 ++ include/qemu/osdep.h | 18 +- system/vl.c| 9 +++ util/oslib-posix.c | 131 +++-- util/oslib-win32.c | 8 ++- 9 files changed, 180 insertions(+), 53 deletions(-) -- 2.43.0
Re: [PULL 0/3] Host Memory Backends and Memory devices queue 2024-02-6
On 06.02.24 08:02, David Hildenbrand wrote: The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440: Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into staging (2024-02-03 13:31:58 +) are available in the Git repository at: https://github.com/davidhildenbrand/qemu.git tags/mem-2024-02-06 for you to fetch changes up to a32c31979a852a07d590effa1586f6ab4fa4d784: oslib-posix: initialize backend memory objects in parallel (2024-02-04 17:51:13 +0100) Hi, "Host Memory Backends" and "Memory devices" queue ("mem"): - Reintroduce memory region size checks for memory devices; the removal lead to some undesired side effects - Preallocate memory of memory backends in selected configurations asynchronously (so we preallocate concurrently), to speed up QEMU startup time. A mail address in the second patch is wrong. Will send a v2. -- Cheers, David / dhildenb
[PULL 0/3] Host Memory Backends and Memory devices queue 2024-02-6
The following changes since commit 39a6e4f87e7b75a45b08d6dc8b8b7c2954c87440: Merge tag 'pull-qapi-2024-02-03' of https://repo.or.cz/qemu/armbru into staging (2024-02-03 13:31:58 +) are available in the Git repository at: https://github.com/davidhildenbrand/qemu.git tags/mem-2024-02-06 for you to fetch changes up to a32c31979a852a07d590effa1586f6ab4fa4d784: oslib-posix: initialize backend memory objects in parallel (2024-02-04 17:51:13 +0100) Hi, "Host Memory Backends" and "Memory devices" queue ("mem"): - Reintroduce memory region size checks for memory devices; the removal lead to some undesired side effects - Preallocate memory of memory backends in selected configurations asynchronously (so we preallocate concurrently), to speed up QEMU startup time. -------- David Hildenbrand (2): hv-balloon: use get_min_alignment() to express 32 GiB alignment memory-device: reintroduce memory region size check Mark Kanda (1): oslib-posix: initialize backend memory objects in parallel backends/hostmem.c | 7 ++- hw/hyperv/hv-balloon.c | 37 +++- hw/mem/memory-device.c | 14 + hw/virtio/virtio-mem.c | 4 +- include/hw/qdev-core.h | 5 ++ include/qemu/osdep.h | 18 +- system/vl.c| 9 +++ util/oslib-posix.c | 131 +++-- util/oslib-win32.c | 8 ++- 9 files changed, 180 insertions(+), 53 deletions(-) -- 2.43.0
[PULL 1/3] hv-balloon: use get_min_alignment() to express 32 GiB alignment
Let's implement the get_min_alignment() callback for memory devices, and copy for the device memory region the alignment of the host memory region. This mimics what virtio-mem does, and allows for re-introducing proper alignment checks for the memory region size (where we don't care about additional device requirements) in memory device core. Message-ID: <20240117135554.787344-2-da...@redhat.com> Reviewed-by: Maciej S. Szmigiero Signed-off-by: David Hildenbrand --- hw/hyperv/hv-balloon.c | 37 + 1 file changed, 21 insertions(+), 16 deletions(-) diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c index 0238365712..ade283335a 100644 --- a/hw/hyperv/hv-balloon.c +++ b/hw/hyperv/hv-balloon.c @@ -1477,22 +1477,7 @@ static void hv_balloon_ensure_mr(HvBalloon *balloon) balloon->mr = g_new0(MemoryRegion, 1); memory_region_init(balloon->mr, OBJECT(balloon), TYPE_HV_BALLOON, memory_region_size(hostmem_mr)); - -/* - * The VM can indicate an alignment up to 32 GiB. Memory device core can - * usually only handle/guarantee 1 GiB alignment. The user will have to - * specify a larger maxmem eventually. - * - * The memory device core will warn the user in case maxmem might have to be - * increased and will fail plugging the device if there is not sufficient - * space after alignment. - * - * TODO: we could do the alignment ourselves in a slightly bigger region. - * But this feels better, although the warning might be annoying. Maybe - * we can optimize that in the future (e.g., with such a device on the - * cmdline place/size the device memory region differently. - */ -balloon->mr->align = MAX(32 * GiB, memory_region_get_alignment(hostmem_mr)); +balloon->mr->align = memory_region_get_alignment(hostmem_mr); } static void hv_balloon_free_mr(HvBalloon *balloon) @@ -1654,6 +1639,25 @@ static MemoryRegion *hv_balloon_md_get_memory_region(MemoryDeviceState *md, return balloon->mr; } +static uint64_t hv_balloon_md_get_min_alignment(const MemoryDeviceState *md) +{ +/* + * The VM can indicate an alignment up to 32 GiB. Memory device core can + * usually only handle/guarantee 1 GiB alignment. The user will have to + * specify a larger maxmem eventually. + * + * The memory device core will warn the user in case maxmem might have to be + * increased and will fail plugging the device if there is not sufficient + * space after alignment. + * + * TODO: we could do the alignment ourselves in a slightly bigger region. + * But this feels better, although the warning might be annoying. Maybe + * we can optimize that in the future (e.g., with such a device on the + * cmdline place/size the device memory region differently. + */ +return 32 * GiB; +} + static void hv_balloon_md_fill_device_info(const MemoryDeviceState *md, MemoryDeviceInfo *info) { @@ -1766,5 +1770,6 @@ static void hv_balloon_class_init(ObjectClass *klass, void *data) mdc->get_memory_region = hv_balloon_md_get_memory_region; mdc->decide_memslots = hv_balloon_decide_memslots; mdc->get_memslots = hv_balloon_get_memslots; +mdc->get_min_alignment = hv_balloon_md_get_min_alignment; mdc->fill_device_info = hv_balloon_md_fill_device_info; } -- 2.43.0
Re: [GIT PULL 5/8] util: Add write-only "node-affinity" property for ThreadContext
On 05.02.24 17:13, Claudio Fontana wrote: Hello David, Hi, It would seem to me that a lot of the calling code like qemu_prealloc_mem for example should be sysemu-only, not used for tools, or user mode either right? And the thread_context.c itself should also be sysemu-only, correct? Yes, both should currently only get used for sysemu only. Memory backends and virtio-mem are sysemu-only. -- Cheers, David / dhildenb
Re: [GIT PULL 5/8] util: Add write-only "node-affinity" property for ThreadContext
On 05.02.24 11:14, Claudio Fontana wrote: Hi, Hi Claudio, turning pages back in time, noticed that in recent qemu-img binaries we include an ELF dependency on libnuma.so that seems unused. I think it stems from this commit: commit 10218ae6d006f76410804cc4dc690085b3d008b5 Author: David Hildenbrand Date: Fri Oct 14 15:47:17 2022 +0200 util: Add write-only "node-affinity" property for ThreadContext possibly this hunk? diff --git a/util/meson.build b/util/meson.build index e97cd2d779..c0a7bc54d4 100644 --- a/util/meson.build +++ b/util/meson.build @@ -1,5 +1,5 @@ util_ss.add(files('osdep.c', 'cutils.c', 'unicode.c', 'qemu-timer-common.c')) -util_ss.add(files('thread-context.c')) +util_ss.add(files('thread-context.c'), numa) if not config_host_data.get('CONFIG_ATOMIC64') util_ss.add(files('atomic64.c')) endif I wonder if there is some conditional we could use to avoid the apparently useless dependency to libnuma in the qemu-img binary? the simplest change is probably moving the thread-context stuff out of util (as you say, it's currently only used by QEMU itself). -- Cheers, David / dhildenb
Re: [PATCH v1 11/15] libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va()
On 04.02.24 23:07, Raphael Norwitz wrote: On Sun, Feb 4, 2024 at 9:51 AM David Hildenbrand wrote: On 04.02.24 03:10, Raphael Norwitz wrote: One comment on this one. On Fri, Feb 2, 2024 at 4:56 PM David Hildenbrand wrote: Let's speed up GPA to memory region / virtual address lookup. Store the memory regions ordered by guest physical addresses, and use binary search for address translation, as well as when adding/removing memory regions. Most importantly, this will speed up GPA->VA address translation when we have many memslots. Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 49 +-- 1 file changed, 45 insertions(+), 4 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index d036b54ed0..75e47b7bb3 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -199,19 +199,30 @@ vu_panic(VuDev *dev, const char *msg, ...) static VuDevRegion * vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr) { -unsigned int i; +int low = 0; +int high = dev->nregions - 1; /* * Memory regions cannot overlap in guest physical address space. Each * GPA belongs to exactly one memory region, so there can only be one * match. + * + * We store our memory regions ordered by GPA and can simply perform a + * binary search. */ -for (i = 0; i < dev->nregions; i++) { -VuDevRegion *cur = >regions[i]; +while (low <= high) { +unsigned int mid = low + (high - low) / 2; +VuDevRegion *cur = >regions[mid]; if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) { return cur; } +if (guest_addr >= cur->gpa + cur->size) { +low = mid + 1; +} +if (guest_addr < cur->gpa) { +high = mid - 1; +} } return NULL; } @@ -273,9 +284,14 @@ vu_remove_all_mem_regs(VuDev *dev) static void _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) { +const uint64_t start_gpa = msg_region->guest_phys_addr; +const uint64_t end_gpa = start_gpa + msg_region->memory_size; int prot = PROT_READ | PROT_WRITE; VuDevRegion *r; void *mmap_addr; +int low = 0; +int high = dev->nregions - 1; +unsigned int idx; DPRINT("Adding region %d\n", dev->nregions); DPRINT("guest_phys_addr: 0x%016"PRIx64"\n", @@ -295,6 +311,29 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) prot = PROT_NONE; } +/* + * We will add memory regions into the array sorted by GPA. Perform a + * binary search to locate the insertion point: it will be at the low + * index. + */ +while (low <= high) { +unsigned int mid = low + (high - low) / 2; +VuDevRegion *cur = >regions[mid]; + +/* Overlap of GPA addresses. */ Looks like this check will only catch if the new region is fully contained within an existing region. I think we need to check whether either start or end region are in the range, i.e.: That check should cover all cases of overlaps, not just fully contained. See the QEMU implementation of range_overlaps_rang() that contains a similar logic: return !(range2->upb < range1->lob || range1->upb < range2->lob); !(range2->upb < range1->lob || range1->upb < range2->lob); = !(range2->upb < range1->lob) && !(range1->upb < range2->lob) = range2->upb >= range1->lob && range1->upb >= range2->lob = range1->lob <= range2->upb && range2->lob <= range1->upb In QEMU, upb is inclusive, if it were exclusive (like we have here): = range1->lob < range2->upb && range2->lob < range1->upb Which is what we have here with: range1->lob = start_gpa range1->upb = end_gpa range2->lob = cur->gpa range2->upb = cur->gpa + cur->size Also if you are interested, see https://stackoverflow.com/questions/3269434/whats-the-most-efficient-way-to-test-if-two-ranges-overlap Thanks! Got it, thanks for the full explanation. With that: Reviewed-by: Raphael Norwitz Thanks! -- Cheers, David / dhildenb
Re: [PATCH v1 01/15] libvhost-user: Fix msg_region->userspace_addr computation
On 04.02.24 23:01, Raphael Norwitz wrote: On Sun, Feb 4, 2024 at 9:36 AM David Hildenbrand wrote: On 04.02.24 02:35, Raphael Norwitz wrote: As a heads up, I've left Nutanix and updated it in MAINTAINERS. Will be updating it again shortly so tagging these with my new work email. Thanks for the fast review! The mail server already complained to me :) Maybe consider adding yourself as reviewer for vhost as well? (which covers libvhost-user), I took your mail address from git history, not get_maintainers.pl. I don't expect I'll have much time to review code outside of vhost-user-blk/vhost-user-scsi, but happy to add an entry if it helps folks tag me on relevant patches. If it helps, it might make sense to split out libvhost-user into a separate MAINTAINERS section. -- Cheers, David / dhildenb
Re: [PATCH v4 1/1] oslib-posix: initialize backend memory objects in parallel
On 03.02.24 23:43, Dongli Zhang wrote: On 1/31/24 08:53, Mark Kanda wrote: QEMU initializes preallocated backend memory as the objects are parsed from the command line. This is not optimal in some cases (e.g. memory spanning multiple NUMA nodes) because the memory objects are initialized in series. Allow the initialization to occur in parallel (asynchronously). In order to ensure optimal thread placement, asynchronous initialization requires prealloc context threads to be in use. Signed-off-by: Mark Kanda Signed-off-by: David Hildenbrand --- backends/hostmem.c | 7 ++- hw/virtio/virtio-mem.c | 4 +- include/hw/qdev-core.h | 5 ++ include/qemu/osdep.h | 18 +- system/vl.c| 9 +++ util/oslib-posix.c | 131 +++-- util/oslib-win32.c | 8 ++- 7 files changed, 145 insertions(+), 37 deletions(-) diff --git a/backends/hostmem.c b/backends/hostmem.c index 30f69b2cb5..17221e422a 100644 --- a/backends/hostmem.c +++ b/backends/hostmem.c @@ -20,6 +20,7 @@ #include "qom/object_interfaces.h" #include "qemu/mmap-alloc.h" #include "qemu/madvise.h" +#include "hw/qdev-core.h" #ifdef CONFIG_NUMA #include @@ -237,7 +238,7 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value, uint64_t sz = memory_region_size(>mr); if (!qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads, - backend->prealloc_context, errp)) { + backend->prealloc_context, false, errp)) { return; } backend->prealloc = true; @@ -323,6 +324,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp) HostMemoryBackendClass *bc = MEMORY_BACKEND_GET_CLASS(uc); void *ptr; uint64_t sz; +bool async = !phase_check(PHASE_LATE_BACKENDS_CREATED); if (!bc->alloc) { return; @@ -398,7 +400,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp) if (backend->prealloc && !qemu_prealloc_mem(memory_region_get_fd(>mr), ptr, sz, backend->prealloc_threads, -backend->prealloc_context, errp)) { +backend->prealloc_context, +async, errp)) { return; } } diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index 99ab989852..ffd119ebac 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -605,7 +605,7 @@ static int virtio_mem_set_block_state(VirtIOMEM *vmem, uint64_t start_gpa, int fd = memory_region_get_fd(>memdev->mr); Error *local_err = NULL; -if (!qemu_prealloc_mem(fd, area, size, 1, NULL, _err)) { +if (!qemu_prealloc_mem(fd, area, size, 1, NULL, false, _err)) { static bool warned; /* @@ -1248,7 +1248,7 @@ static int virtio_mem_prealloc_range_cb(VirtIOMEM *vmem, void *arg, int fd = memory_region_get_fd(>memdev->mr); Error *local_err = NULL; -if (!qemu_prealloc_mem(fd, area, size, 1, NULL, _err)) { +if (!qemu_prealloc_mem(fd, area, size, 1, NULL, false, _err)) { error_report_err(local_err); return -ENOMEM; } diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h index 151d968238..83dd9e2485 100644 --- a/include/hw/qdev-core.h +++ b/include/hw/qdev-core.h @@ -1071,6 +1071,11 @@ typedef enum MachineInitPhase { */ PHASE_ACCEL_CREATED, +/* + * Late backend objects have been created and initialized. + */ +PHASE_LATE_BACKENDS_CREATED, + /* * machine_class->init has been called, thus creating any embedded * devices and validating machine properties. Devices created at diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h index c9692cc314..7d359dabc4 100644 --- a/include/qemu/osdep.h +++ b/include/qemu/osdep.h @@ -680,6 +680,8 @@ typedef struct ThreadContext ThreadContext; * @area: start address of the are to preallocate * @sz: the size of the area to preallocate * @max_threads: maximum number of threads to use + * @tc: prealloc context threads pointer, NULL if not in use + * @async: request asynchronous preallocation, requires @tc * @errp: returns an error if this function fails * * Preallocate memory (populate/prefault page tables writable) for the virtual @@ -687,10 +689,24 @@ typedef struct ThreadContext ThreadContext; * each page in the area was faulted in writable at least once, for example, * after allocating file blocks for mapped files. * + * When setting @async, allocation might be performed asynchronously. + * qemu_finish_async_prealloc_mem() must be called to finis
Re: [PATCH v1 14/15] libvhost-user: Dynamically remap rings after (temporarily?) removing memory regions
On 04.02.24 03:15, Raphael Norwitz wrote: Someone else with more knowledge of the VQ mapping code should also review. On Fri, Feb 2, 2024 at 4:55 PM David Hildenbrand wrote: Currently, we try to remap all rings whenever we add a single new memory region. That doesn't quite make sense, because we already map rings when setting the ring address, and panic if that goes wrong. Likely, that handling was simply copied from set_mem_table code, where we actually have to remap all rings. Remapping all rings might require us to walk quite a lot of memory regions to perform the address translations. Ideally, we'd simply remove that remapping. However, let's be a bit careful. There might be some weird corner cases where we might temporarily remove a single memory region (e.g., resize it), that would have worked for now. Further, a ring might be located on hotplugged memory, and as the VM reboots, we might unplug that memory, to hotplug memory before resetting the ring addresses. So let's unmap affected rings as we remove a memory region, and try dynamically mapping the ring again when required. Signed-off-by: David Hildenbrand Acked-by: Raphael Norwitz Thanks! --- subprojects/libvhost-user/libvhost-user.c | 107 -- 1 file changed, 78 insertions(+), 29 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index febeb2eb89..738e84ab63 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -283,10 +283,75 @@ vu_remove_all_mem_regs(VuDev *dev) dev->nregions = 0; } +static bool +map_ring(VuDev *dev, VuVirtq *vq) +{ +vq->vring.desc = qva_to_va(dev, vq->vra.desc_user_addr); +vq->vring.used = qva_to_va(dev, vq->vra.used_user_addr); +vq->vring.avail = qva_to_va(dev, vq->vra.avail_user_addr); + +DPRINT("Setting virtq addresses:\n"); +DPRINT("vring_desc at %p\n", vq->vring.desc); +DPRINT("vring_used at %p\n", vq->vring.used); +DPRINT("vring_avail at %p\n", vq->vring.avail); + +return !(vq->vring.desc && vq->vring.used && vq->vring.avail); +} + static bool Consider changing the function name to indicate that it may actually map a vq? Maybe vu_maybe_map_vq()? I don't think that would be really better. It's an implementation detial that we try to recover in these corner cases by remapping the rings. In the majority of all cases this function will simply check whether the device is broken and the vring was set up properly (which usually implies mapping the rings). So I think in the caller: if (!vu_is_vq_usable()) { return; } is easier to get than: if (!vu_maybe_map_vq()) { return; } Thanks! -- Cheers, David / dhildenb
Re: [PATCH v1 11/15] libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va()
On 04.02.24 03:10, Raphael Norwitz wrote: One comment on this one. On Fri, Feb 2, 2024 at 4:56 PM David Hildenbrand wrote: Let's speed up GPA to memory region / virtual address lookup. Store the memory regions ordered by guest physical addresses, and use binary search for address translation, as well as when adding/removing memory regions. Most importantly, this will speed up GPA->VA address translation when we have many memslots. Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 49 +-- 1 file changed, 45 insertions(+), 4 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index d036b54ed0..75e47b7bb3 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -199,19 +199,30 @@ vu_panic(VuDev *dev, const char *msg, ...) static VuDevRegion * vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr) { -unsigned int i; +int low = 0; +int high = dev->nregions - 1; /* * Memory regions cannot overlap in guest physical address space. Each * GPA belongs to exactly one memory region, so there can only be one * match. + * + * We store our memory regions ordered by GPA and can simply perform a + * binary search. */ -for (i = 0; i < dev->nregions; i++) { -VuDevRegion *cur = >regions[i]; +while (low <= high) { +unsigned int mid = low + (high - low) / 2; +VuDevRegion *cur = >regions[mid]; if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) { return cur; } +if (guest_addr >= cur->gpa + cur->size) { +low = mid + 1; +} +if (guest_addr < cur->gpa) { +high = mid - 1; +} } return NULL; } @@ -273,9 +284,14 @@ vu_remove_all_mem_regs(VuDev *dev) static void _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) { +const uint64_t start_gpa = msg_region->guest_phys_addr; +const uint64_t end_gpa = start_gpa + msg_region->memory_size; int prot = PROT_READ | PROT_WRITE; VuDevRegion *r; void *mmap_addr; +int low = 0; +int high = dev->nregions - 1; +unsigned int idx; DPRINT("Adding region %d\n", dev->nregions); DPRINT("guest_phys_addr: 0x%016"PRIx64"\n", @@ -295,6 +311,29 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) prot = PROT_NONE; } +/* + * We will add memory regions into the array sorted by GPA. Perform a + * binary search to locate the insertion point: it will be at the low + * index. + */ +while (low <= high) { +unsigned int mid = low + (high - low) / 2; +VuDevRegion *cur = >regions[mid]; + +/* Overlap of GPA addresses. */ Looks like this check will only catch if the new region is fully contained within an existing region. I think we need to check whether either start or end region are in the range, i.e.: That check should cover all cases of overlaps, not just fully contained. See the QEMU implementation of range_overlaps_rang() that contains a similar logic: return !(range2->upb < range1->lob || range1->upb < range2->lob); !(range2->upb < range1->lob || range1->upb < range2->lob); = !(range2->upb < range1->lob) && !(range1->upb < range2->lob) = range2->upb >= range1->lob && range1->upb >= range2->lob = range1->lob <= range2->upb && range2->lob <= range1->upb In QEMU, upb is inclusive, if it were exclusive (like we have here): = range1->lob < range2->upb && range2->lob < range1->upb Which is what we have here with: range1->lob = start_gpa range1->upb = end_gpa range2->lob = cur->gpa range2->upb = cur->gpa + cur->size Also if you are interested, see https://stackoverflow.com/questions/3269434/whats-the-most-efficient-way-to-test-if-two-ranges-overlap Thanks! -- Cheers, David / dhildenb
Re: [PATCH v1 01/15] libvhost-user: Fix msg_region->userspace_addr computation
On 04.02.24 02:35, Raphael Norwitz wrote: As a heads up, I've left Nutanix and updated it in MAINTAINERS. Will be updating it again shortly so tagging these with my new work email. Thanks for the fast review! The mail server already complained to me :) Maybe consider adding yourself as reviewer for vhost as well? (which covers libvhost-user), I took your mail address from git history, not get_maintainers.pl. On Fri, Feb 2, 2024 at 4:54 PM David Hildenbrand wrote: We barely had mmap_offset set in the past. With virtio-mem and dynamic-memslots that will change. In vu_add_mem_reg() and vu_set_mem_table_exec_postcopy(), we are performing pointer arithmetics, which is wrong. Let's simply use dev_region->mmap_addr instead of "void *mmap_addr". Fixes: ec94c8e621de ("Support adding individual regions in libvhost-user") Fixes: 9bb38019942c ("vhost+postcopy: Send address back to qemu") Cc: Raphael Norwitz Signed-off-by: David Hildenbrand Reviewed-by: Raphael Norwitz -- Cheers, David / dhildenb
[PATCH v1 09/15] libvhost-user: Don't search for duplicates when removing memory regions
We cannot have duplicate memory regions, something would be deeply flawed elsewhere. Let's just stop the search once we found an entry. We'll add more sanity checks when adding memory regions later. Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index e1a1b9df88..22154b217f 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -896,8 +896,7 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { i--; found = true; - -/* Continue the search for eventual duplicates. */ +break; } } -- 2.43.0
[PATCH v1 12/15] libvhost-user: Use most of mmap_offset as fd_offset
In the past, QEMU would create memory regions that could partially cover hugetlb pages, making mmap() fail if we would use the mmap_offset as an fd_offset. For that reason, we never used the mmap_offset as an offset into the fd and instead always mapped the fd from the very start. However, that can easily result in us mmap'ing a lot of unnecessary parts of an fd, possibly repeatedly. QEMU nowadays does not create memory regions that partially cover huge pages -- it never really worked with postcopy. QEMU handles merging of regions that partially cover huge pages (due to holes in boot memory) since 2018 in c1ece84e7c93 ("vhost: Huge page align and merge"). Let's be a bit careful and not unconditionally convert the mmap_offset into an fd_offset. Instead, let's simply detect the hugetlb size and pass as much as we can as fd_offset, making sure that we call mmap() with a properly aligned offset. With QEMU and a virtio-mem device that is fully plugged (50GiB using 50 memslots) the qemu-storage daemon process consumes in the VA space 1281GiB before this change and 58GiB after this change. Example debug output: Vhost user message Request: VHOST_USER_ADD_MEM_REG (37) Flags: 0x9 Size:40 Fds: 59 Adding region 50 guest_phys_addr: 0x000d8000 memory_size: 0x4000 userspace_addr 0x7f54ebffe000 mmap_offset 0x000c fd_offset: 0x000c new mmap_offset: 0x mmap_addr: 0x7f7ecc00 Successfully added new region Vhost user message Request: VHOST_USER_ADD_MEM_REG (37) Flags: 0x9 Size:40 Fds: 59 Adding region 51 guest_phys_addr: 0x000dc000 memory_size: 0x4000 userspace_addr 0x7f552bffe000 mmap_offset 0x000c4000 fd_offset: 0x000c4000 new mmap_offset: 0x mmap_addr: 0x7f7e8c00 Successfully added new region Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 50 --- 1 file changed, 45 insertions(+), 5 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 75e47b7bb3..7d8293dc84 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -43,6 +43,8 @@ #include #include #include +#include +#include #ifdef __NR_userfaultfd #include @@ -281,12 +283,36 @@ vu_remove_all_mem_regs(VuDev *dev) dev->nregions = 0; } +static size_t +get_fd_pagesize(int fd) +{ +static size_t pagesize; +#if defined(__linux__) +struct statfs fs; +int ret; + +do { +ret = fstatfs(fd, ); +} while (ret != 0 && errno == EINTR); + +if (!ret && fs.f_type == HUGETLBFS_MAGIC) { +return fs.f_bsize; +} +#endif + +if (!pagesize) { +pagesize = getpagesize(); +} +return pagesize; +} + static void _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) { const uint64_t start_gpa = msg_region->guest_phys_addr; const uint64_t end_gpa = start_gpa + msg_region->memory_size; int prot = PROT_READ | PROT_WRITE; +uint64_t mmap_offset, fd_offset; VuDevRegion *r; void *mmap_addr; int low = 0; @@ -335,11 +361,25 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) idx = low; /* - * We don't use offset argument of mmap() since the mapped address has - * to be page aligned, and we use huge pages. + * Convert most of msg_region->mmap_offset to fd_offset. In almost all + * cases, this will leave us with mmap_offset == 0, mmap()'ing only + * what we really need. Only if a memory region would partially cover + * hugetlb pages, we'd get mmap_offset != 0, which usually doesn't happen + * anymore (i.e., modern QEMU). + * + * Note that mmap() with hugetlb would fail if the offset into the file + * is not aligned to the huge page size. */ -mmap_addr = mmap(0, msg_region->memory_size + msg_region->mmap_offset, - prot, MAP_SHARED | MAP_NORESERVE, fd, 0); +fd_offset = ALIGN_DOWN(msg_region->mmap_offset, get_fd_pagesize(fd)); +mmap_offset = msg_region->mmap_offset - fd_offset; + +DPRINT("fd_offset: 0x%016"PRIx64"\n", + fd_offset); +DPRINT("adj mmap_offset: 0x%016"PRIx64"\n", + mmap_offset); + +mmap_addr = mmap(0, msg_region->memory_size + mmap_offset, + prot, MAP_SHARED | MAP_NORESERVE, fd, fd_offset); if (mmap_addr == MAP_FAILED) { vu_panic(dev, "region mmap error: %s", strerror(errno)); return; @@ -354,7 +394,7 @@ _vu_add_mem_reg(VuDe
[PATCH v1 13/15] libvhost-user: Factor out vq usability check
Let's factor it out to prepare for further changes. Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 24 +++ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 7d8293dc84..febeb2eb89 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -283,6 +283,12 @@ vu_remove_all_mem_regs(VuDev *dev) dev->nregions = 0; } +static bool +vu_is_vq_usable(VuDev *dev, VuVirtq *vq) +{ +return likely(!dev->broken) && likely(vq->vring.avail); +} + static size_t get_fd_pagesize(int fd) { @@ -2378,8 +2384,7 @@ vu_queue_get_avail_bytes(VuDev *dev, VuVirtq *vq, unsigned int *in_bytes, idx = vq->last_avail_idx; total_bufs = in_total = out_total = 0; -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { goto done; } @@ -2494,8 +2499,7 @@ vu_queue_avail_bytes(VuDev *dev, VuVirtq *vq, unsigned int in_bytes, bool vu_queue_empty(VuDev *dev, VuVirtq *vq) { -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { return true; } @@ -2534,8 +2538,7 @@ vring_notify(VuDev *dev, VuVirtq *vq) static void _vu_queue_notify(VuDev *dev, VuVirtq *vq, bool sync) { -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { return; } @@ -2860,8 +2863,7 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz) unsigned int head; VuVirtqElement *elem; -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { return NULL; } @@ -3018,8 +3020,7 @@ vu_queue_fill(VuDev *dev, VuVirtq *vq, { struct vring_used_elem uelem; -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { return; } @@ -3048,8 +3049,7 @@ vu_queue_flush(VuDev *dev, VuVirtq *vq, unsigned int count) { uint16_t old, new; -if (unlikely(dev->broken) || -unlikely(!vq->vring.avail)) { +if (!vu_is_vq_usable(dev, vq)) { return; } -- 2.43.0
[PATCH v1 05/15] libvhost-user: Merge vu_set_mem_table_exec_postcopy() into vu_set_mem_table_exec()
Let's reduce some code duplication and prepare for further changes. Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 119 +++--- 1 file changed, 39 insertions(+), 80 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index d5b3468e43..d9e2214ad2 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -937,95 +937,23 @@ vu_get_shared_object(VuDev *dev, VhostUserMsg *vmsg) } static bool -vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg) +vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg) { -unsigned int i; VhostUserMemory m = vmsg->payload.memory, *memory = -dev->nregions = memory->nregions; - -DPRINT("Nregions: %u\n", memory->nregions); -for (i = 0; i < dev->nregions; i++) { -void *mmap_addr; -VhostUserMemoryRegion *msg_region = >regions[i]; -VuDevRegion *dev_region = >regions[i]; - -DPRINT("Region %d\n", i); -DPRINT("guest_phys_addr: 0x%016"PRIx64"\n", - msg_region->guest_phys_addr); -DPRINT("memory_size: 0x%016"PRIx64"\n", - msg_region->memory_size); -DPRINT("userspace_addr 0x%016"PRIx64"\n", - msg_region->userspace_addr); -DPRINT("mmap_offset 0x%016"PRIx64"\n", - msg_region->mmap_offset); - -dev_region->gpa = msg_region->guest_phys_addr; -dev_region->size = msg_region->memory_size; -dev_region->qva = msg_region->userspace_addr; -dev_region->mmap_offset = msg_region->mmap_offset; +int prot = PROT_READ | PROT_WRITE; +unsigned int i; -/* We don't use offset argument of mmap() since the - * mapped address has to be page aligned, and we use huge - * pages. +if (dev->postcopy_listening) { +/* * In postcopy we're using PROT_NONE here to catch anyone * accessing it before we userfault */ -mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset, - PROT_NONE, MAP_SHARED | MAP_NORESERVE, - vmsg->fds[i], 0); - -if (mmap_addr == MAP_FAILED) { -vu_panic(dev, "region mmap error: %s", strerror(errno)); -} else { -dev_region->mmap_addr = (uint64_t)(uintptr_t)mmap_addr; -DPRINT("mmap_addr: 0x%016"PRIx64"\n", - dev_region->mmap_addr); -} - -/* Return the address to QEMU so that it can translate the ufd - * fault addresses back. - */ -msg_region->userspace_addr = dev_region->mmap_addr + - dev_region->mmap_offset; -close(vmsg->fds[i]); -} - -/* Send the message back to qemu with the addresses filled in */ -vmsg->fd_num = 0; -if (!vu_send_reply(dev, dev->sock, vmsg)) { -vu_panic(dev, "failed to respond to set-mem-table for postcopy"); -return false; -} - -/* Wait for QEMU to confirm that it's registered the handler for the - * faults. - */ -if (!dev->read_msg(dev, dev->sock, vmsg) || -vmsg->size != sizeof(vmsg->payload.u64) || -vmsg->payload.u64 != 0) { -vu_panic(dev, "failed to receive valid ack for postcopy set-mem-table"); -return false; +prot = PROT_NONE; } -/* OK, now we can go and register the memory and generate faults */ -(void)generate_faults(dev); - -return false; -} - -static bool -vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg) -{ -unsigned int i; -VhostUserMemory m = vmsg->payload.memory, *memory = - vu_remove_all_mem_regs(dev); dev->nregions = memory->nregions; -if (dev->postcopy_listening) { -return vu_set_mem_table_exec_postcopy(dev, vmsg); -} - DPRINT("Nregions: %u\n", memory->nregions); for (i = 0; i < dev->nregions; i++) { void *mmap_addr; @@ -1051,8 +979,7 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg) * mapped address has to be page aligned, and we use huge * pages. */ mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset, - PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, - vmsg->fds[i], 0); + prot, MAP_SHARED | MAP_NORESERVE, vmsg->fds[i], 0); if (mmap_addr == MAP_FAILED) { vu_panic(dev, "region mmap error: %s", strerror(errno)); @@ -1062,9 +989,41 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg
[PATCH v1 02/15] libvhost-user: Dynamically allocate memory for memory slots
Let's prepare for increasing VHOST_USER_MAX_RAM_SLOTS by dynamically allocating dev->regions. We don't have any ABI guarantees (not dynamically linked), so we can simply change the layout of VuDev. Let's zero out the memory, just as we used to do. Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 11 +++ subprojects/libvhost-user/libvhost-user.h | 2 +- 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 7e515ed15d..8a5a7a2295 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -2171,6 +2171,8 @@ vu_deinit(VuDev *dev) free(dev->vq); dev->vq = NULL; +free(dev->regions); +dev->regions = NULL; } bool @@ -2205,9 +2207,18 @@ vu_init(VuDev *dev, dev->backend_fd = -1; dev->max_queues = max_queues; +dev->regions = malloc(VHOST_USER_MAX_RAM_SLOTS * sizeof(dev->regions[0])); +if (!dev->regions) { +DPRINT("%s: failed to malloc mem regions\n", __func__); +return false; +} +memset(dev->regions, 0, VHOST_USER_MAX_RAM_SLOTS * sizeof(dev->regions[0])); + dev->vq = malloc(max_queues * sizeof(dev->vq[0])); if (!dev->vq) { DPRINT("%s: failed to malloc virtqueues\n", __func__); +free(dev->regions); +dev->regions = NULL; return false; } diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h index c2352904f0..c882b4e3a2 100644 --- a/subprojects/libvhost-user/libvhost-user.h +++ b/subprojects/libvhost-user/libvhost-user.h @@ -398,7 +398,7 @@ typedef struct VuDevInflightInfo { struct VuDev { int sock; uint32_t nregions; -VuDevRegion regions[VHOST_USER_MAX_RAM_SLOTS]; +VuDevRegion *regions; VuVirtq *vq; VuDevInflightInfo inflight_info; int log_call_fd; -- 2.43.0
[PATCH v1 08/15] libvhost-user: Don't zero out memory for memory regions
dev->nregions always covers only valid entries. Stop zeroing out other array elements that are unused. Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index f99c888b48..e1a1b9df88 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -888,13 +888,9 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset); -/* - * Shift all affected entries by 1 to close the hole at index i and - * zero out the last entry. - */ +/* Shift all affected entries by 1 to close the hole at index. */ memmove(dev->regions + i, dev->regions + i + 1, sizeof(VuDevRegion) * (dev->nregions - i - 1)); -memset(dev->regions + dev->nregions - 1, 0, sizeof(VuDevRegion)); DPRINT("Successfully removed a region\n"); dev->nregions--; i--; @@ -2119,7 +2115,6 @@ vu_init(VuDev *dev, DPRINT("%s: failed to malloc mem regions\n", __func__); return false; } -memset(dev->regions, 0, VHOST_USER_MAX_RAM_SLOTS * sizeof(dev->regions[0])); dev->vq = malloc(max_queues * sizeof(dev->vq[0])); if (!dev->vq) { -- 2.43.0
[PATCH v1 11/15] libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va()
Let's speed up GPA to memory region / virtual address lookup. Store the memory regions ordered by guest physical addresses, and use binary search for address translation, as well as when adding/removing memory regions. Most importantly, this will speed up GPA->VA address translation when we have many memslots. Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 49 +-- 1 file changed, 45 insertions(+), 4 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index d036b54ed0..75e47b7bb3 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -199,19 +199,30 @@ vu_panic(VuDev *dev, const char *msg, ...) static VuDevRegion * vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr) { -unsigned int i; +int low = 0; +int high = dev->nregions - 1; /* * Memory regions cannot overlap in guest physical address space. Each * GPA belongs to exactly one memory region, so there can only be one * match. + * + * We store our memory regions ordered by GPA and can simply perform a + * binary search. */ -for (i = 0; i < dev->nregions; i++) { -VuDevRegion *cur = >regions[i]; +while (low <= high) { +unsigned int mid = low + (high - low) / 2; +VuDevRegion *cur = >regions[mid]; if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) { return cur; } +if (guest_addr >= cur->gpa + cur->size) { +low = mid + 1; +} +if (guest_addr < cur->gpa) { +high = mid - 1; +} } return NULL; } @@ -273,9 +284,14 @@ vu_remove_all_mem_regs(VuDev *dev) static void _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) { +const uint64_t start_gpa = msg_region->guest_phys_addr; +const uint64_t end_gpa = start_gpa + msg_region->memory_size; int prot = PROT_READ | PROT_WRITE; VuDevRegion *r; void *mmap_addr; +int low = 0; +int high = dev->nregions - 1; +unsigned int idx; DPRINT("Adding region %d\n", dev->nregions); DPRINT("guest_phys_addr: 0x%016"PRIx64"\n", @@ -295,6 +311,29 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) prot = PROT_NONE; } +/* + * We will add memory regions into the array sorted by GPA. Perform a + * binary search to locate the insertion point: it will be at the low + * index. + */ +while (low <= high) { +unsigned int mid = low + (high - low) / 2; +VuDevRegion *cur = >regions[mid]; + +/* Overlap of GPA addresses. */ +if (start_gpa < cur->gpa + cur->size && cur->gpa < end_gpa) { +vu_panic(dev, "regions with overlapping guest physical addresses"); +return; +} +if (start_gpa >= cur->gpa + cur->size) { +low = mid + 1; +} +if (start_gpa < cur->gpa) { +high = mid - 1; +} +} +idx = low; + /* * We don't use offset argument of mmap() since the mapped address has * to be page aligned, and we use huge pages. @@ -308,7 +347,9 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) DPRINT("mmap_addr: 0x%016"PRIx64"\n", (uint64_t)(uintptr_t)mmap_addr); -r = >regions[dev->nregions]; +/* Shift all affected entries by 1 to open a hole at idx. */ +r = >regions[idx]; +memmove(r + 1, r, sizeof(VuDevRegion) * (dev->nregions - idx)); r->gpa = msg_region->guest_phys_addr; r->size = msg_region->memory_size; r->qva = msg_region->userspace_addr; -- 2.43.0
[PATCH v1 07/15] libvhost-user: No need to check for NULL when unmapping
We never add a memory region if mmap() failed. Therefore, no need to check for NULL. Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index a2baefe84b..f99c888b48 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -247,11 +247,8 @@ vu_remove_all_mem_regs(VuDev *dev) for (i = 0; i < dev->nregions; i++) { VuDevRegion *r = >regions[i]; -void *ma = (void *)(uintptr_t)r->mmap_addr; -if (ma) { -munmap(ma, r->size + r->mmap_offset); -} +munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset); } dev->nregions = 0; } @@ -888,11 +885,8 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { for (i = 0; i < dev->nregions; i++) { if (reg_equal(>regions[i], msg_region)) { VuDevRegion *r = >regions[i]; -void *ma = (void *) (uintptr_t) r->mmap_addr; -if (ma) { -munmap(ma, r->size + r->mmap_offset); -} +munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset); /* * Shift all affected entries by 1 to close the hole at index i and -- 2.43.0
[PATCH v1 15/15] libvhost-user: Mark mmap'ed region memory as MADV_DONTDUMP
We already use MADV_NORESERVE to deal with sparse memory regions. Let's also set madvise(MADV_DONTDUMP), otherwise a crash of the process can result in us allocating all memory in the mmap'ed region for dumping purposes. This change implies that the mmap'ed rings won't be included in a coredump. If ever required for debugging purposes, we could mark only the mapped rings MADV_DODUMP. Ignore errors during madvise() for now. Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 738e84ab63..26c289518c 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -458,6 +458,12 @@ _vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) DPRINT("mmap_addr: 0x%016"PRIx64"\n", (uint64_t)(uintptr_t)mmap_addr); +#if defined(__linux__) +/* Don't include all guest memory in a coredump. */ +madvise(mmap_addr, msg_region->memory_size + mmap_offset, +MADV_DONTDUMP); +#endif + /* Shift all affected entries by 1 to open a hole at idx. */ r = >regions[idx]; memmove(r + 1, r, sizeof(VuDevRegion) * (dev->nregions - idx)); -- 2.43.0
[PATCH v1 01/15] libvhost-user: Fix msg_region->userspace_addr computation
We barely had mmap_offset set in the past. With virtio-mem and dynamic-memslots that will change. In vu_add_mem_reg() and vu_set_mem_table_exec_postcopy(), we are performing pointer arithmetics, which is wrong. Let's simply use dev_region->mmap_addr instead of "void *mmap_addr". Fixes: ec94c8e621de ("Support adding individual regions in libvhost-user") Fixes: 9bb38019942c ("vhost+postcopy: Send address back to qemu") Cc: Raphael Norwitz Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index a3b158c671..7e515ed15d 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -800,8 +800,8 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { * Return the address to QEMU so that it can translate the ufd * fault addresses back. */ -msg_region->userspace_addr = (uintptr_t)(mmap_addr + - dev_region->mmap_offset); +msg_region->userspace_addr = dev_region->mmap_addr + + dev_region->mmap_offset; /* Send the message back to qemu with the addresses filled in. */ vmsg->fd_num = 0; @@ -969,8 +969,8 @@ vu_set_mem_table_exec_postcopy(VuDev *dev, VhostUserMsg *vmsg) /* Return the address to QEMU so that it can translate the ufd * fault addresses back. */ -msg_region->userspace_addr = (uintptr_t)(mmap_addr + - dev_region->mmap_offset); +msg_region->userspace_addr = dev_region->mmap_addr + + dev_region->mmap_offset; close(vmsg->fds[i]); } -- 2.43.0
[PATCH v1 00/15] libvhost-user: support more memslots and cleanup memslot handling code
This series adds support for more memslots (509) to libvhost-user, to make it fully compatible with virtio-mem that uses up to 256 memslots accross all memory devices in "dynamic-memslot" mode (more details in patch #3). One simple fix upfront. With that in place, this series optimizes and extens memory region handling in libvhost-user: * Heavily deduplicate and clean up the memory region handling code * Speeds up GPA->VA translation with many memslots using binary search * Optimize mmap_offset handling to use it as fd_offset for mmap() * Avoid ring remapping when adding a single memory region * Avoid dumping all guest memory, possibly allocating memory in sparse memory mappings when the process crashes I'm being very careful to not break some weird corner case that modern QEMU might no longer trigger, but older one could have triggered or some other frontend might trigger. The only thing where I am not careful is to forbid memory regions that overlap in GPA space: it doesn't make any sense. With this series, virtio-mem (with dynamic-memslots=on) + qemu-storage-daemon works flawlessly and as expected in my tests. Cc: Michael S. Tsirkin Cc: Jason Wang Cc: Stefan Hajnoczi Cc: Stefano Garzarella Cc: Germano Veit Michel Cc: Raphael Norwitz David Hildenbrand (15): libvhost-user: Fix msg_region->userspace_addr computation libvhost-user: Dynamically allocate memory for memory slots libvhost-user: Bump up VHOST_USER_MAX_RAM_SLOTS to 509 libvhost-user: Factor out removing all mem regions libvhost-user: Merge vu_set_mem_table_exec_postcopy() into vu_set_mem_table_exec() libvhost-user: Factor out adding a memory region libvhost-user: No need to check for NULL when unmapping libvhost-user: Don't zero out memory for memory regions libvhost-user: Don't search for duplicates when removing memory regions libvhost-user: Factor out search for memory region by GPA and simplify libvhost-user: Speedup gpa_to_mem_region() and vu_gpa_to_va() libvhost-user: Use most of mmap_offset as fd_offset libvhost-user: Factor out vq usability check libvhost-user: Dynamically remap rings after (temporarily?) removing memory regions libvhost-user: Mark mmap'ed region memory as MADV_DONTDUMP subprojects/libvhost-user/libvhost-user.c | 593 -- subprojects/libvhost-user/libvhost-user.h | 10 +- 2 files changed, 332 insertions(+), 271 deletions(-) -- 2.43.0
[PATCH v1 03/15] libvhost-user: Bump up VHOST_USER_MAX_RAM_SLOTS to 509
Let's support up to 509 mem slots, just like vhost in the kernel usually does and the rust vhost-user implementation recently [1] started doing. This is required to properly support memory hotplug, either using multiple DIMMs (ACPI supports up to 256) or using virtio-mem. The 509 used to be the KVM limit, it supported 512, but 3 were used for internal purposes. Currently, KVM supports more than 512, but it usually doesn't make use of more than ~260 (i.e., 256 DIMMs + boot memory), except when other memory devices like PCI devices with BARs are used. So, 509 seems to work well for vhost in the kernel. Details can be found in the QEMU change that made virtio-mem consume up to 256 mem slots across all virtio-mem devices. [2] 509 mem slots implies 509 VMAs/mappings in the worst case (even though, in practice with virtio-mem we won't be seeing more than ~260 in most setups). With max_map_count under Linux defaulting to 64k, 509 mem slots still correspond to less than 1% of the maximum number of mappings. There are plenty left for the application to consume. [1] https://github.com/rust-vmm/vhost/pull/224 [2] https://lore.kernel.org/all/20230926185738.277351-1-da...@redhat.com/ Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.h | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h index c882b4e3a2..deb40e77b3 100644 --- a/subprojects/libvhost-user/libvhost-user.h +++ b/subprojects/libvhost-user/libvhost-user.h @@ -31,10 +31,12 @@ #define VHOST_MEMORY_BASELINE_NREGIONS 8 /* - * Set a reasonable maximum number of ram slots, which will be supported by - * any architecture. + * vhost in the kernel usually supports 509 mem slots. 509 used to be the + * KVM limit, it supported 512, but 3 were used for internal purposes. This + * limit is sufficient to support many DIMMs and virtio-mem in + * "dynamic-memslots" mode. */ -#define VHOST_USER_MAX_RAM_SLOTS 32 +#define VHOST_USER_MAX_RAM_SLOTS 509 #define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64) -- 2.43.0
[PATCH v1 06/15] libvhost-user: Factor out adding a memory region
Let's factor it out, reducing quite some code duplication and perparing for further changes. If we fail to mmap a region and panic, we now simply don't add that (broken) region. Note that we now increment dev->nregions as we are successfully adding memory regions, and don't increment dev->nregions if anything went wrong. Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 168 -- 1 file changed, 60 insertions(+), 108 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index d9e2214ad2..a2baefe84b 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -256,6 +256,61 @@ vu_remove_all_mem_regs(VuDev *dev) dev->nregions = 0; } +static void +_vu_add_mem_reg(VuDev *dev, VhostUserMemoryRegion *msg_region, int fd) +{ +int prot = PROT_READ | PROT_WRITE; +VuDevRegion *r; +void *mmap_addr; + +DPRINT("Adding region %d\n", dev->nregions); +DPRINT("guest_phys_addr: 0x%016"PRIx64"\n", + msg_region->guest_phys_addr); +DPRINT("memory_size: 0x%016"PRIx64"\n", + msg_region->memory_size); +DPRINT("userspace_addr 0x%016"PRIx64"\n", + msg_region->userspace_addr); +DPRINT("mmap_offset 0x%016"PRIx64"\n", + msg_region->mmap_offset); + +if (dev->postcopy_listening) { +/* + * In postcopy we're using PROT_NONE here to catch anyone + * accessing it before we userfault + */ +prot = PROT_NONE; +} + +/* + * We don't use offset argument of mmap() since the mapped address has + * to be page aligned, and we use huge pages. + */ +mmap_addr = mmap(0, msg_region->memory_size + msg_region->mmap_offset, + prot, MAP_SHARED | MAP_NORESERVE, fd, 0); +if (mmap_addr == MAP_FAILED) { +vu_panic(dev, "region mmap error: %s", strerror(errno)); +return; +} +DPRINT("mmap_addr: 0x%016"PRIx64"\n", + (uint64_t)(uintptr_t)mmap_addr); + +r = >regions[dev->nregions]; +r->gpa = msg_region->guest_phys_addr; +r->size = msg_region->memory_size; +r->qva = msg_region->userspace_addr; +r->mmap_addr = (uint64_t)(uintptr_t)mmap_addr; +r->mmap_offset = msg_region->mmap_offset; +dev->nregions++; + +if (dev->postcopy_listening) { +/* + * Return the address to QEMU so that it can translate the ufd + * fault addresses back. + */ +msg_region->userspace_addr = r->mmap_addr + r->mmap_offset; +} +} + static void vmsg_close_fds(VhostUserMsg *vmsg) { @@ -727,10 +782,7 @@ generate_faults(VuDev *dev) { static bool vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { int i; -bool track_ramblocks = dev->postcopy_listening; VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = -VuDevRegion *dev_region = >regions[dev->nregions]; -void *mmap_addr; if (vmsg->fd_num != 1) { vmsg_close_fds(vmsg); @@ -760,69 +812,20 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { * we know all the postcopy client bases have been received, and we * should start generating faults. */ -if (track_ramblocks && +if (dev->postcopy_listening && vmsg->size == sizeof(vmsg->payload.u64) && vmsg->payload.u64 == 0) { (void)generate_faults(dev); return false; } -DPRINT("Adding region: %u\n", dev->nregions); -DPRINT("guest_phys_addr: 0x%016"PRIx64"\n", - msg_region->guest_phys_addr); -DPRINT("memory_size: 0x%016"PRIx64"\n", - msg_region->memory_size); -DPRINT("userspace_addr 0x%016"PRIx64"\n", - msg_region->userspace_addr); -DPRINT("mmap_offset 0x%016"PRIx64"\n", - msg_region->mmap_offset); - -dev_region->gpa = msg_region->guest_phys_addr; -dev_region->size = msg_region->memory_size; -dev_region->qva = msg_region->userspace_addr; -dev_region->mmap_offset = msg_region->mmap_offset; - -/* - * We don't use offset argument of mmap() since the - * mapped address has to be page aligned, and we use huge - * pages. - */ -if (track_ramblocks) { -/* - * In postcopy we're using PROT_NONE here to catch anyone - * accessing it before we userfault. - */ -mmap_addr = mmap(0, dev_region->size + dev_region->mmap_offset, - PROT_NONE, MAP_SHARED | MAP_NORESERVE, -
[PATCH v1 14/15] libvhost-user: Dynamically remap rings after (temporarily?) removing memory regions
Currently, we try to remap all rings whenever we add a single new memory region. That doesn't quite make sense, because we already map rings when setting the ring address, and panic if that goes wrong. Likely, that handling was simply copied from set_mem_table code, where we actually have to remap all rings. Remapping all rings might require us to walk quite a lot of memory regions to perform the address translations. Ideally, we'd simply remove that remapping. However, let's be a bit careful. There might be some weird corner cases where we might temporarily remove a single memory region (e.g., resize it), that would have worked for now. Further, a ring might be located on hotplugged memory, and as the VM reboots, we might unplug that memory, to hotplug memory before resetting the ring addresses. So let's unmap affected rings as we remove a memory region, and try dynamically mapping the ring again when required. Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 107 -- 1 file changed, 78 insertions(+), 29 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index febeb2eb89..738e84ab63 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -283,10 +283,75 @@ vu_remove_all_mem_regs(VuDev *dev) dev->nregions = 0; } +static bool +map_ring(VuDev *dev, VuVirtq *vq) +{ +vq->vring.desc = qva_to_va(dev, vq->vra.desc_user_addr); +vq->vring.used = qva_to_va(dev, vq->vra.used_user_addr); +vq->vring.avail = qva_to_va(dev, vq->vra.avail_user_addr); + +DPRINT("Setting virtq addresses:\n"); +DPRINT("vring_desc at %p\n", vq->vring.desc); +DPRINT("vring_used at %p\n", vq->vring.used); +DPRINT("vring_avail at %p\n", vq->vring.avail); + +return !(vq->vring.desc && vq->vring.used && vq->vring.avail); +} + static bool vu_is_vq_usable(VuDev *dev, VuVirtq *vq) { -return likely(!dev->broken) && likely(vq->vring.avail); +if (unlikely(dev->broken)) { +return false; +} + +if (likely(vq->vring.avail)) { +return true; +} + +/* + * In corner cases, we might temporarily remove a memory region that + * mapped a ring. When removing a memory region we make sure to + * unmap any rings that would be impacted. Let's try to remap if we + * already succeeded mapping this ring once. + */ +if (!vq->vra.desc_user_addr || !vq->vra.used_user_addr || +!vq->vra.avail_user_addr) { +return false; +} +if (map_ring(dev, vq)) { +vu_panic(dev, "remapping queue on access"); +return false; +} +return true; +} + +static void +unmap_rings(VuDev *dev, VuDevRegion *r) +{ +int i; + +for (i = 0; i < dev->max_queues; i++) { +VuVirtq *vq = >vq[i]; +const uintptr_t desc = (uintptr_t)vq->vring.desc; +const uintptr_t used = (uintptr_t)vq->vring.used; +const uintptr_t avail = (uintptr_t)vq->vring.avail; + +if (desc < r->mmap_addr || desc >= r->mmap_addr + r->size) { +continue; +} +if (used < r->mmap_addr || used >= r->mmap_addr + r->size) { +continue; +} +if (avail < r->mmap_addr || avail >= r->mmap_addr + r->size) { +continue; +} + +DPRINT("Unmapping rings of queue %d\n", i); +vq->vring.desc = NULL; +vq->vring.used = NULL; +vq->vring.avail = NULL; +} } static size_t @@ -784,21 +849,6 @@ vu_reset_device_exec(VuDev *dev, VhostUserMsg *vmsg) return false; } -static bool -map_ring(VuDev *dev, VuVirtq *vq) -{ -vq->vring.desc = qva_to_va(dev, vq->vra.desc_user_addr); -vq->vring.used = qva_to_va(dev, vq->vra.used_user_addr); -vq->vring.avail = qva_to_va(dev, vq->vra.avail_user_addr); - -DPRINT("Setting virtq addresses:\n"); -DPRINT("vring_desc at %p\n", vq->vring.desc); -DPRINT("vring_used at %p\n", vq->vring.used); -DPRINT("vring_avail at %p\n", vq->vring.avail); - -return !(vq->vring.desc && vq->vring.used && vq->vring.avail); -} - static bool generate_faults(VuDev *dev) { unsigned int i; @@ -882,7 +932,6 @@ generate_faults(VuDev *dev) { static bool vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { -int i; VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = if (vmsg->fd_num != 1) { @@ -928,19 +977,9 @@ vu_add_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { vmsg->fd_num = 0; DPRINT("Successfully added new region in postcopy\n"); return true;
[PATCH v1 10/15] libvhost-user: Factor out search for memory region by GPA and simplify
Memory regions cannot overlap, and if we ever hit that case something would be really flawed. For example, when vhost code in QEMU decides to increase the size of memory regions to cover full huge pages, it makes sure to never create overlaps, and if there would be overlaps, it would bail out. QEMU commits 48d7c9757749 ("vhost: Merge sections added to temporary list"), c1ece84e7c93 ("vhost: Huge page align and merge") and e7b94a84b6cb ("vhost: Allow adjoining regions") added and clarified that handling and how overlaps are impossible. Consequently, each GPA can belong to at most one memory region, and everything else doesn't make sense. Let's factor out our search to prepare for further changes. Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 79 +-- 1 file changed, 45 insertions(+), 34 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 22154b217f..d036b54ed0 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -195,30 +195,47 @@ vu_panic(VuDev *dev, const char *msg, ...) */ } +/* Search for a memory region that covers this guest physical address. */ +static VuDevRegion * +vu_gpa_to_mem_region(VuDev *dev, uint64_t guest_addr) +{ +unsigned int i; + +/* + * Memory regions cannot overlap in guest physical address space. Each + * GPA belongs to exactly one memory region, so there can only be one + * match. + */ +for (i = 0; i < dev->nregions; i++) { +VuDevRegion *cur = >regions[i]; + +if (guest_addr >= cur->gpa && guest_addr < cur->gpa + cur->size) { +return cur; +} +} +return NULL; +} + /* Translate guest physical address to our virtual address. */ void * vu_gpa_to_va(VuDev *dev, uint64_t *plen, uint64_t guest_addr) { -unsigned int i; +VuDevRegion *r; if (*plen == 0) { return NULL; } -/* Find matching memory region. */ -for (i = 0; i < dev->nregions; i++) { -VuDevRegion *r = >regions[i]; - -if ((guest_addr >= r->gpa) && (guest_addr < (r->gpa + r->size))) { -if ((guest_addr + *plen) > (r->gpa + r->size)) { -*plen = r->gpa + r->size - guest_addr; -} -return (void *)(uintptr_t) -guest_addr - r->gpa + r->mmap_addr + r->mmap_offset; -} +r = vu_gpa_to_mem_region(dev, guest_addr); +if (!r) { +return NULL; } -return NULL; +if ((guest_addr + *plen) > (r->gpa + r->size)) { +*plen = r->gpa + r->size - guest_addr; +} +return (void *)(uintptr_t)guest_addr - r->gpa + r->mmap_addr + + r->mmap_offset; } /* Translate qemu virtual address to our virtual address. */ @@ -854,8 +871,8 @@ static inline bool reg_equal(VuDevRegion *vudev_reg, static bool vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { VhostUserMemoryRegion m = vmsg->payload.memreg.region, *msg_region = -unsigned int i; -bool found = false; +unsigned int idx; +VuDevRegion *r; if (vmsg->fd_num > 1) { vmsg_close_fds(vmsg); @@ -882,28 +899,22 @@ vu_rem_mem_reg(VuDev *dev, VhostUserMsg *vmsg) { DPRINT("mmap_offset 0x%016"PRIx64"\n", msg_region->mmap_offset); -for (i = 0; i < dev->nregions; i++) { -if (reg_equal(>regions[i], msg_region)) { -VuDevRegion *r = >regions[i]; - -munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset); - -/* Shift all affected entries by 1 to close the hole at index. */ -memmove(dev->regions + i, dev->regions + i + 1, -sizeof(VuDevRegion) * (dev->nregions - i - 1)); -DPRINT("Successfully removed a region\n"); -dev->nregions--; -i--; - -found = true; -break; -} -} - -if (!found) { +r = vu_gpa_to_mem_region(dev, msg_region->guest_phys_addr); +if (!r || !reg_equal(r, msg_region)) { +vmsg_close_fds(vmsg); vu_panic(dev, "Specified region not found\n"); +return false; } +munmap((void *)(uintptr_t)r->mmap_addr, r->size + r->mmap_offset); + +idx = r - dev->regions; +assert(idx < dev->nregions); +/* Shift all affected entries by 1 to close the hole. */ +memmove(r, r + 1, sizeof(VuDevRegion) * (dev->nregions - idx - 1)); +DPRINT("Successfully removed a region\n"); +dev->nregions--; + vmsg_close_fds(vmsg); return false; -- 2.43.0
[PATCH v1 04/15] libvhost-user: Factor out removing all mem regions
Let's factor it out. Note that the check for MAP_FAILED was wrong as we never set mmap_addr if mmap() failed. We'll remove the NULL check separately. Signed-off-by: David Hildenbrand --- subprojects/libvhost-user/libvhost-user.c | 34 --- 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c index 8a5a7a2295..d5b3468e43 100644 --- a/subprojects/libvhost-user/libvhost-user.c +++ b/subprojects/libvhost-user/libvhost-user.c @@ -240,6 +240,22 @@ qva_to_va(VuDev *dev, uint64_t qemu_addr) return NULL; } +static void +vu_remove_all_mem_regs(VuDev *dev) +{ +unsigned int i; + +for (i = 0; i < dev->nregions; i++) { +VuDevRegion *r = >regions[i]; +void *ma = (void *)(uintptr_t)r->mmap_addr; + +if (ma) { +munmap(ma, r->size + r->mmap_offset); +} +} +dev->nregions = 0; +} + static void vmsg_close_fds(VhostUserMsg *vmsg) { @@ -1003,14 +1019,7 @@ vu_set_mem_table_exec(VuDev *dev, VhostUserMsg *vmsg) unsigned int i; VhostUserMemory m = vmsg->payload.memory, *memory = -for (i = 0; i < dev->nregions; i++) { -VuDevRegion *r = >regions[i]; -void *ma = (void *) (uintptr_t) r->mmap_addr; - -if (ma) { -munmap(ma, r->size + r->mmap_offset); -} -} +vu_remove_all_mem_regs(dev); dev->nregions = memory->nregions; if (dev->postcopy_listening) { @@ -2112,14 +2121,7 @@ vu_deinit(VuDev *dev) { unsigned int i; -for (i = 0; i < dev->nregions; i++) { -VuDevRegion *r = >regions[i]; -void *m = (void *) (uintptr_t) r->mmap_addr; -if (m != MAP_FAILED) { -munmap(m, r->size + r->mmap_offset); -} -} -dev->nregions = 0; +vu_remove_all_mem_regs(dev); for (i = 0; i < dev->max_queues; i++) { VuVirtq *vq = >vq[i]; -- 2.43.0