Re: [PATCH v2 0/2] Initialize backend memory objects in parallel
On 1/29/24 1:11 PM, David Hildenbrand wrote: On 22.01.24 16:32, Mark Kanda wrote: v2: - require MADV_POPULATE_WRITE (simplify the implementation) - require prealloc context threads to ensure optimal thread placement - use machine phase 'initialized' to detremine when to allow parallel init QEMU initializes preallocated backend memory when parsing the corresponding objects from the command line. In certain scenarios, such as memory being preallocated across multiple numa nodes, this approach is not optimal due to the unnecessary serialization. This series addresses this issue by initializing the backend memory objects in parallel. I just played with the code, some comments: * I suggest squashing both patches. It doesn't make things clearer if we factor out unconditionally adding contexts to a global list. * Keep the functions MT-capable, at least as long as async=false. That is, don't involve the global list if async=false. virtio-mem will perform preallocation from other threads at some point, where we could see concurrent preallocations for different devices. I made sure that qemu_mem_prealloc() can handle that. * Rename wait_mem_prealloc() to qemu_finish_async_mem_prealloc() and let it report the error / return true/false like qemu_prealloc_mem(). Especially, don't change the existing "qemu_prealloc_mem: preallocating memory failed" error message. * Do the conditional async=false fixup in touch_all_pages(). That means, in qemu_prealloc_mem(), only route the async parameter through. One thing I don't quite like is what happens when multiple threads would try issuing "async=true". It will currently not happen, but we should catch whenever that happens and require that only one thread at a time can perform async preallocs. Maybe we can assert in qemu_prealloc_mem()/ qemu_finish_async_mem_prealloc() that we hold the BQL. Hopefully, that is the case when we start creating memory backends, before starting the main loop. If not, maybe we should just document that async limitation. Ideally, we'd have some async_start(), prealloc(), prealloc(), async_finish() interface, where async_start() would block until another thread called async_finish(), so we never have a mixture. But that would currently be over-engineering. I'll attach the untested, likely broken, code I played with to see what it could look like. Observe how I only conditionally add the context to the list at the end of touch_all_pages(). Thank you very much for the feedback David. I'll take a close look at this for v3. Best regards, -Mark From fe26cc5252f1284efa8e667310609a22c6166324 Mon Sep 17 00:00:00 2001 From: Mark Kanda Date: Mon, 22 Jan 2024 09:32:18 -0600 Subject: [PATCH] oslib-posix: initialize selected backend memory objects in parallel QEMU initializes preallocated backend memory as the objects are parsed from the command line. This is not optimal in some cases (e.g. memory spanning multiple NUMA nodes) because the memory objects are initialized in series. Allow the initialization to occur in parallel. In order to ensure optimal thread placement, parallel initialization requires prealloc context threads to be in use. Signed-off-by: Mark Kanda Signed-off-by: David Hildenbrand --- backends/hostmem.c | 8 ++- hw/virtio/virtio-mem.c | 4 +- include/qemu/osdep.h | 19 +- system/vl.c | 8 +++ util/oslib-posix.c | 130 +++-- util/oslib-win32.c | 8 ++- 6 files changed, 140 insertions(+), 37 deletions(-) diff --git a/backends/hostmem.c b/backends/hostmem.c index 30f69b2cb5..8f602dc86f 100644 --- a/backends/hostmem.c +++ b/backends/hostmem.c @@ -20,6 +20,7 @@ #include "qom/object_interfaces.h" #include "qemu/mmap-alloc.h" #include "qemu/madvise.h" +#include "hw/qdev-core.h" #ifdef CONFIG_NUMA #include @@ -235,9 +236,10 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value, int fd = memory_region_get_fd(>mr); void *ptr = memory_region_get_ram_ptr(>mr); uint64_t sz = memory_region_size(>mr); + bool async = !phase_check(PHASE_MACHINE_INITIALIZED); if (!qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads, - backend->prealloc_context, errp)) { + backend->prealloc_context, async, errp)) { return; } backend->prealloc = true; @@ -323,6 +325,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp) HostMemoryBackendClass *bc = MEMORY_BACKEND_GET_CLASS(uc); void *ptr; uint64_t sz; + bool async = !phase_check(PHASE_MACHINE_INITIALIZED); if (!bc->alloc) { return; @@ -398,7 +401,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp) if (backend->prealloc && !qemu_prealloc_mem(memory_region_get_fd(>mr), ptr, sz,
Re: [PATCH v2 0/2] Initialize backend memory objects in parallel
On 22.01.24 16:32, Mark Kanda wrote: v2: - require MADV_POPULATE_WRITE (simplify the implementation) - require prealloc context threads to ensure optimal thread placement - use machine phase 'initialized' to detremine when to allow parallel init QEMU initializes preallocated backend memory when parsing the corresponding objects from the command line. In certain scenarios, such as memory being preallocated across multiple numa nodes, this approach is not optimal due to the unnecessary serialization. This series addresses this issue by initializing the backend memory objects in parallel. I just played with the code, some comments: * I suggest squashing both patches. It doesn't make things clearer if we factor out unconditionally adding contexts to a global list. * Keep the functions MT-capable, at least as long as async=false. That is, don't involve the global list if async=false. virtio-mem will perform preallocation from other threads at some point, where we could see concurrent preallocations for different devices. I made sure that qemu_mem_prealloc() can handle that. * Rename wait_mem_prealloc() to qemu_finish_async_mem_prealloc() and let it report the error / return true/false like qemu_prealloc_mem(). Especially, don't change the existing "qemu_prealloc_mem: preallocating memory failed" error message. * Do the conditional async=false fixup in touch_all_pages(). That means, in qemu_prealloc_mem(), only route the async parameter through. One thing I don't quite like is what happens when multiple threads would try issuing "async=true". It will currently not happen, but we should catch whenever that happens and require that only one thread at a time can perform async preallocs. Maybe we can assert in qemu_prealloc_mem()/ qemu_finish_async_mem_prealloc() that we hold the BQL. Hopefully, that is the case when we start creating memory backends, before starting the main loop. If not, maybe we should just document that async limitation. Ideally, we'd have some async_start(), prealloc(), prealloc(), async_finish() interface, where async_start() would block until another thread called async_finish(), so we never have a mixture. But that would currently be over-engineering. I'll attach the untested, likely broken, code I played with to see what it could look like. Observe how I only conditionally add the context to the list at the end of touch_all_pages(). From fe26cc5252f1284efa8e667310609a22c6166324 Mon Sep 17 00:00:00 2001 From: Mark Kanda Date: Mon, 22 Jan 2024 09:32:18 -0600 Subject: [PATCH] oslib-posix: initialize selected backend memory objects in parallel QEMU initializes preallocated backend memory as the objects are parsed from the command line. This is not optimal in some cases (e.g. memory spanning multiple NUMA nodes) because the memory objects are initialized in series. Allow the initialization to occur in parallel. In order to ensure optimal thread placement, parallel initialization requires prealloc context threads to be in use. Signed-off-by: Mark Kanda Signed-off-by: David Hildenbrand --- backends/hostmem.c | 8 ++- hw/virtio/virtio-mem.c | 4 +- include/qemu/osdep.h | 19 +- system/vl.c| 8 +++ util/oslib-posix.c | 130 +++-- util/oslib-win32.c | 8 ++- 6 files changed, 140 insertions(+), 37 deletions(-) diff --git a/backends/hostmem.c b/backends/hostmem.c index 30f69b2cb5..8f602dc86f 100644 --- a/backends/hostmem.c +++ b/backends/hostmem.c @@ -20,6 +20,7 @@ #include "qom/object_interfaces.h" #include "qemu/mmap-alloc.h" #include "qemu/madvise.h" +#include "hw/qdev-core.h" #ifdef CONFIG_NUMA #include @@ -235,9 +236,10 @@ static void host_memory_backend_set_prealloc(Object *obj, bool value, int fd = memory_region_get_fd(>mr); void *ptr = memory_region_get_ram_ptr(>mr); uint64_t sz = memory_region_size(>mr); +bool async = !phase_check(PHASE_MACHINE_INITIALIZED); if (!qemu_prealloc_mem(fd, ptr, sz, backend->prealloc_threads, - backend->prealloc_context, errp)) { + backend->prealloc_context, async, errp)) { return; } backend->prealloc = true; @@ -323,6 +325,7 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp) HostMemoryBackendClass *bc = MEMORY_BACKEND_GET_CLASS(uc); void *ptr; uint64_t sz; +bool async = !phase_check(PHASE_MACHINE_INITIALIZED); if (!bc->alloc) { return; @@ -398,7 +401,8 @@ host_memory_backend_memory_complete(UserCreatable *uc, Error **errp) if (backend->prealloc && !qemu_prealloc_mem(memory_region_get_fd(>mr), ptr, sz, backend->prealloc_threads, -backend->prealloc_context, errp)) { +
Re: [PATCH v2 0/2] Initialize backend memory objects in parallel
On 29.01.24 14:39, Mark Kanda wrote: Ping. Any comments? Sorry, fell through the cracks, will review this soonish. -- Cheers, David / dhildenb
Re: [PATCH v2 0/2] Initialize backend memory objects in parallel
Ping. Any comments? Thanks/regards, -Mark On 1/22/24 9:32 AM, Mark Kanda wrote: v2: - require MADV_POPULATE_WRITE (simplify the implementation) - require prealloc context threads to ensure optimal thread placement - use machine phase 'initialized' to detremine when to allow parallel init QEMU initializes preallocated backend memory when parsing the corresponding objects from the command line. In certain scenarios, such as memory being preallocated across multiple numa nodes, this approach is not optimal due to the unnecessary serialization. This series addresses this issue by initializing the backend memory objects in parallel. Mark Kanda (2): oslib-posix: refactor memory prealloc threads oslib-posix: initialize backend memory objects in parallel backends/hostmem.c | 8 +++- hw/virtio/virtio-mem.c | 4 +- include/qemu/osdep.h | 14 +- system/vl.c| 6 +++ util/oslib-posix.c | 103 - util/oslib-win32.c | 8 +++- 6 files changed, 103 insertions(+), 40 deletions(-)
[PATCH v2 0/2] Initialize backend memory objects in parallel
v2: - require MADV_POPULATE_WRITE (simplify the implementation) - require prealloc context threads to ensure optimal thread placement - use machine phase 'initialized' to detremine when to allow parallel init QEMU initializes preallocated backend memory when parsing the corresponding objects from the command line. In certain scenarios, such as memory being preallocated across multiple numa nodes, this approach is not optimal due to the unnecessary serialization. This series addresses this issue by initializing the backend memory objects in parallel. Mark Kanda (2): oslib-posix: refactor memory prealloc threads oslib-posix: initialize backend memory objects in parallel backends/hostmem.c | 8 +++- hw/virtio/virtio-mem.c | 4 +- include/qemu/osdep.h | 14 +- system/vl.c| 6 +++ util/oslib-posix.c | 103 - util/oslib-win32.c | 8 +++- 6 files changed, 103 insertions(+), 40 deletions(-) -- 2.39.3