Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
On 10/31/2016 05:00 PM, Michael R. Hines wrote: On 10/18/2016 05:47 AM, Peter Lieven wrote: Am 12.10.2016 um 23:18 schrieb Michael R. Hines: Peter, Greetings from DigitalOcean. We're experiencing the same symptoms without this patch. We have, collectively, many gigabytes of un-planned-for RSS being used per-hypervisor that we would like to get rid of =). Without explicitly trying this patch (will do that ASAP), we immediately noticed that the 192MB mentioned immediately melts away (Yay) when we disabled the coroutine thread pool explicitly, with another ~100MB in additional stack usage that would likely also go away if we applied the entirety of your patch. Is there any chance you have revisited this or have a timeline for it? Hi Michael, the current master already includes some of the patches of this original series. There are still some changes left, but what works for me is the current master + diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c index 5816702..3eaef68 100644 --- a/util/qemu-coroutine.c +++ b/util/qemu-coroutine.c @@ -25,8 +25,6 @@ enum { }; /** Free list to speed up creation */ -static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int release_pool_size; static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); static __thread unsigned int alloc_pool_size; static __thread Notifier coroutine_pool_cleanup_notifier; @@ -49,20 +47,10 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) if (CONFIG_COROUTINE_POOL) { co = QSLIST_FIRST(&alloc_pool); if (!co) { -if (release_pool_size > POOL_BATCH_SIZE) { -/* Slow path; a good place to register the destructor, too. */ -if (!coroutine_pool_cleanup_notifier.notify) { -coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup; - qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier); -} - -/* This is not exact; there could be a little skew between - * release_pool_size and the actual size of release_pool. But - * it is just a heuristic, it does not need to be perfect. - */ -alloc_pool_size = atomic_xchg(&release_pool_size, 0); -QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); -co = QSLIST_FIRST(&alloc_pool); +/* Slow path; a good place to register the destructor, too. */ +if (!coroutine_pool_cleanup_notifier.notify) { +coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup; + qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier); } } if (co) { @@ -85,11 +73,6 @@ static void coroutine_delete(Coroutine *co) co->caller = NULL; if (CONFIG_COROUTINE_POOL) { -if (release_pool_size < POOL_BATCH_SIZE * 2) { -QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next); -atomic_inc(&release_pool_size); -return; -} if (alloc_pool_size < POOL_BATCH_SIZE) { QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); alloc_pool_size++; + invoking qemu with the following environemnet variable set: MALLOC_MMAP_THRESHOLD_=32768 qemu-system-x86_64 The last one makes glibc automatically using mmap when the malloced memory exceeds 32kByte. Peter, I tested the above patch (and the environment variable --- it doesn't quite come close to as lean of an RSS tally as the original patchset there's still about 70-80 MB of remaining RSS. Any chance you could trim the remaining fat before merging this? =) False alarm! I didn't set the MMAP threshold low enough. Now the results are on-par with the other patchset. Thank you!
Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
On 10/18/2016 05:47 AM, Peter Lieven wrote: Am 12.10.2016 um 23:18 schrieb Michael R. Hines: Peter, Greetings from DigitalOcean. We're experiencing the same symptoms without this patch. We have, collectively, many gigabytes of un-planned-for RSS being used per-hypervisor that we would like to get rid of =). Without explicitly trying this patch (will do that ASAP), we immediately noticed that the 192MB mentioned immediately melts away (Yay) when we disabled the coroutine thread pool explicitly, with another ~100MB in additional stack usage that would likely also go away if we applied the entirety of your patch. Is there any chance you have revisited this or have a timeline for it? Hi Michael, the current master already includes some of the patches of this original series. There are still some changes left, but what works for me is the current master + diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c index 5816702..3eaef68 100644 --- a/util/qemu-coroutine.c +++ b/util/qemu-coroutine.c @@ -25,8 +25,6 @@ enum { }; /** Free list to speed up creation */ -static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int release_pool_size; static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); static __thread unsigned int alloc_pool_size; static __thread Notifier coroutine_pool_cleanup_notifier; @@ -49,20 +47,10 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) if (CONFIG_COROUTINE_POOL) { co = QSLIST_FIRST(&alloc_pool); if (!co) { -if (release_pool_size > POOL_BATCH_SIZE) { -/* Slow path; a good place to register the destructor, too. */ -if (!coroutine_pool_cleanup_notifier.notify) { -coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup; - qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier); -} - -/* This is not exact; there could be a little skew between - * release_pool_size and the actual size of release_pool. But - * it is just a heuristic, it does not need to be perfect. - */ -alloc_pool_size = atomic_xchg(&release_pool_size, 0); -QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); -co = QSLIST_FIRST(&alloc_pool); +/* Slow path; a good place to register the destructor, too. */ +if (!coroutine_pool_cleanup_notifier.notify) { +coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup; + qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier); } } if (co) { @@ -85,11 +73,6 @@ static void coroutine_delete(Coroutine *co) co->caller = NULL; if (CONFIG_COROUTINE_POOL) { -if (release_pool_size < POOL_BATCH_SIZE * 2) { -QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next); -atomic_inc(&release_pool_size); -return; -} if (alloc_pool_size < POOL_BATCH_SIZE) { QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); alloc_pool_size++; + invoking qemu with the following environemnet variable set: MALLOC_MMAP_THRESHOLD_=32768 qemu-system-x86_64 The last one makes glibc automatically using mmap when the malloced memory exceeds 32kByte. Peter, I tested the above patch (and the environment variable --- it doesn't quite come close to as lean of an RSS tally as the original patchset there's still about 70-80 MB of remaining RSS. Any chance you could trim the remaining fat before merging this? =) /* * Michael R. Hines * Senior Engineer, DigitalOcean. */
Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
Thank you for the response! I'll run off and test that. =) /* * Michael R. Hines * Senior Engineer, DigitalOcean. */ On 10/18/2016 05:47 AM, Peter Lieven wrote: Am 12.10.2016 um 23:18 schrieb Michael R. Hines: Peter, Greetings from DigitalOcean. We're experiencing the same symptoms without this patch. We have, collectively, many gigabytes of un-planned-for RSS being used per-hypervisor that we would like to get rid of =). Without explicitly trying this patch (will do that ASAP), we immediately noticed that the 192MB mentioned immediately melts away (Yay) when we disabled the coroutine thread pool explicitly, with another ~100MB in additional stack usage that would likely also go away if we applied the entirety of your patch. Is there any chance you have revisited this or have a timeline for it? Hi Michael, the current master already includes some of the patches of this original series. There are still some changes left, but what works for me is the current master + diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c index 5816702..3eaef68 100644 --- a/util/qemu-coroutine.c +++ b/util/qemu-coroutine.c @@ -25,8 +25,6 @@ enum { }; /** Free list to speed up creation */ -static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int release_pool_size; static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); static __thread unsigned int alloc_pool_size; static __thread Notifier coroutine_pool_cleanup_notifier; @@ -49,20 +47,10 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) if (CONFIG_COROUTINE_POOL) { co = QSLIST_FIRST(&alloc_pool); if (!co) { -if (release_pool_size > POOL_BATCH_SIZE) { -/* Slow path; a good place to register the destructor, too. */ -if (!coroutine_pool_cleanup_notifier.notify) { -coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup; - qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier); -} - -/* This is not exact; there could be a little skew between - * release_pool_size and the actual size of release_pool. But - * it is just a heuristic, it does not need to be perfect. - */ -alloc_pool_size = atomic_xchg(&release_pool_size, 0); -QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); -co = QSLIST_FIRST(&alloc_pool); +/* Slow path; a good place to register the destructor, too. */ +if (!coroutine_pool_cleanup_notifier.notify) { +coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup; + qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier); } } if (co) { @@ -85,11 +73,6 @@ static void coroutine_delete(Coroutine *co) co->caller = NULL; if (CONFIG_COROUTINE_POOL) { -if (release_pool_size < POOL_BATCH_SIZE * 2) { -QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next); -atomic_inc(&release_pool_size); -return; -} if (alloc_pool_size < POOL_BATCH_SIZE) { QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); alloc_pool_size++; + invoking qemu with the following environemnet variable set: MALLOC_MMAP_THRESHOLD_=32768 qemu-system-x86_64 The last one makes glibc automatically using mmap when the malloced memory exceeds 32kByte. Hope this helps, Peter
Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
Am 12.10.2016 um 23:18 schrieb Michael R. Hines: Peter, Greetings from DigitalOcean. We're experiencing the same symptoms without this patch. We have, collectively, many gigabytes of un-planned-for RSS being used per-hypervisor that we would like to get rid of =). Without explicitly trying this patch (will do that ASAP), we immediately noticed that the 192MB mentioned immediately melts away (Yay) when we disabled the coroutine thread pool explicitly, with another ~100MB in additional stack usage that would likely also go away if we applied the entirety of your patch. Is there any chance you have revisited this or have a timeline for it? Hi Michael, the current master already includes some of the patches of this original series. There are still some changes left, but what works for me is the current master + diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c index 5816702..3eaef68 100644 --- a/util/qemu-coroutine.c +++ b/util/qemu-coroutine.c @@ -25,8 +25,6 @@ enum { }; /** Free list to speed up creation */ -static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int release_pool_size; static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); static __thread unsigned int alloc_pool_size; static __thread Notifier coroutine_pool_cleanup_notifier; @@ -49,20 +47,10 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) if (CONFIG_COROUTINE_POOL) { co = QSLIST_FIRST(&alloc_pool); if (!co) { -if (release_pool_size > POOL_BATCH_SIZE) { -/* Slow path; a good place to register the destructor, too. */ -if (!coroutine_pool_cleanup_notifier.notify) { -coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup; - qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier); -} - -/* This is not exact; there could be a little skew between - * release_pool_size and the actual size of release_pool. But - * it is just a heuristic, it does not need to be perfect. - */ -alloc_pool_size = atomic_xchg(&release_pool_size, 0); -QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); -co = QSLIST_FIRST(&alloc_pool); +/* Slow path; a good place to register the destructor, too. */ +if (!coroutine_pool_cleanup_notifier.notify) { +coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup; + qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier); } } if (co) { @@ -85,11 +73,6 @@ static void coroutine_delete(Coroutine *co) co->caller = NULL; if (CONFIG_COROUTINE_POOL) { -if (release_pool_size < POOL_BATCH_SIZE * 2) { -QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next); -atomic_inc(&release_pool_size); -return; -} if (alloc_pool_size < POOL_BATCH_SIZE) { QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); alloc_pool_size++; + invoking qemu with the following environemnet variable set: MALLOC_MMAP_THRESHOLD_=32768 qemu-system-x86_64 The last one makes glibc automatically using mmap when the malloced memory exceeds 32kByte. Hope this helps, Peter
Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
Peter, Greetings from DigitalOcean. We're experiencing the same symptoms without this patch. We have, collectively, many gigabytes of un-planned-for RSS being used per-hypervisor that we would like to get rid of =). Without explicitly trying this patch (will do that ASAP), we immediately noticed that the 192MB mentioned immediately melts away (Yay) when we disabled the coroutine thread pool explicitly, with another ~100MB in additional stack usage that would likely also go away if we applied the entirety of your patch. Is there any chance you have revisited this or have a timeline for it? - Michael /* * Michael R. Hines * Senior Engineer, DigitalOcean. */ On 06/28/2016 04:01 AM, Peter Lieven wrote: I recently found that Qemu is using several hundred megabytes of RSS memory more than older versions such as Qemu 2.2.0. So I started tracing memory allocation and found 2 major reasons for this. 1) We changed the qemu coroutine pool to have a per thread and a global release pool. The choosen poolsize and the changed algorithm could lead to up to 192 free coroutines with just a single iothread. Each of the coroutines in the pool each having 1MB of stack memory. 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed freeing of memory. This lead to higher heap allocations which could not effectively be returned to kernel (most likely due to fragmentation). The following series is what I came up with. Beside the coroutine patches I changed some allocations to forcibly use mmap. All these allocations are not repeatly made during runtime so the impact of using mmap should be neglectible. There are still some big malloced allocations left which cannot be easily changed (e.g. the pixman buffers in VNC). So it might an idea to set a lower mmap threshold for malloc since this threshold seems to be in the order of several Megabytes on modern systems. Peter Lieven (15): coroutine-ucontext: mmap stack memory coroutine-ucontext: add a switch to monitor maximum stack size coroutine-ucontext: reduce stack size to 64kB coroutine: add a knob to disable the shared release pool util: add a helper to mmap private anonymous memory exec: use mmap for subpages qapi: use mmap for QmpInputVisitor virtio: use mmap for VirtQueue loader: use mmap for ROMs vmware_svga: use mmap for scratch pad qom: use mmap for bigger Objects util: add a function to realloc mmapped memory exec: use mmap for PhysPageMap->nodes vnc-tight: make the encoding palette static vnc: use mmap for VncState configure | 33 ++-- exec.c| 11 --- hw/core/loader.c | 16 +- hw/display/vmware_vga.c | 3 +- hw/virtio/virtio.c| 5 +-- include/qemu/mmap-alloc.h | 7 + include/qom/object.h | 1 + qapi/qmp-input-visitor.c | 5 +-- qom/object.c | 20 ++-- ui/vnc-enc-tight.c| 21 ++--- ui/vnc.c | 5 +-- ui/vnc.h | 1 + util/coroutine-ucontext.c | 66 +-- util/mmap-alloc.c | 27 util/qemu-coroutine.c | 79 ++- 15 files changed, 225 insertions(+), 75 deletions(-)
Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
Am 28.06.2016 um 16:43 schrieb Peter Lieven: Am 28.06.2016 um 14:56 schrieb Dr. David Alan Gilbert: * Peter Lieven (p...@kamp.de) wrote: Am 28.06.2016 um 14:29 schrieb Paolo Bonzini: Am 28.06.2016 um 13:37 schrieb Paolo Bonzini: On 28/06/2016 11:01, Peter Lieven wrote: I recently found that Qemu is using several hundred megabytes of RSS memory more than older versions such as Qemu 2.2.0. So I started tracing memory allocation and found 2 major reasons for this. 1) We changed the qemu coroutine pool to have a per thread and a global release pool. The choosen poolsize and the changed algorithm could lead to up to 192 free coroutines with just a single iothread. Each of the coroutines in the pool each having 1MB of stack memory. But the fix, as you correctly note, is to reduce the stack size. It would be nice to compile block-obj-y with -Wstack-usage=2048 too. To reveal if there are any big stack allocations in the block layer? Yes. Most should be fixed by now, but a handful are probably still there. (definitely one in vvfat.c). As it seems reducing to 64kB breaks live migration in some (non reproducible) cases. Does it hit the guard page? How would that look like? I get segfaults like this: segfault at 7f91aa642b78 ip 555ab714ef7d sp 7f91aa642b50 error 6 in qemu-system-x86_64[555ab6f2c000+794000] most of the time error 6. Sometimes error 7. segfault is near the sp. A backtrace would be good. Here we go. My old friend nc_senv_compat ;-) This has already been fixed in master. My test systems use an older Qemu ;-) Peter Again the question: Would you go for reducing the stack size an eliminating all stack eaters ? The static netbuf in nc_sendv_compat is no problem. And: I would go for adding the guard page without MAP_GROWSDOWN and mmaping the rest of the stack with this flag if availble. So we are save on non Linux systems or Linux before 3.9 or merged memory regions. Peter --- Program received signal SIGSEGV, Segmentation fault. 0x55a2ee35 in nc_sendv_compat (nc=0x0, iov=0x0, iovcnt=0, flags=0) at net/net.c:701 (gdb) bt full #0 0x55a2ee35 in nc_sendv_compat (nc=0x0, iov=0x0, iovcnt=0, flags=0) at net/net.c:701 buf = '\000' ... buffer = 0x0 offset = 0 #1 0x55a2f058 in qemu_deliver_packet_iov (sender=0x565a46b0, flags=0, iov=0x77e98d20, iovcnt=1, opaque=0x57802370) at net/net.c:745 nc = 0x57802370 ret = 21845 #2 0x55a3132d in qemu_net_queue_deliver (queue=0x57802590, sender=0x565a46b0, flags=0, data=0x5659e2a8 "", size=74) at net/queue.c:163 ret = -1 iov = {iov_base = 0x5659e2a8, iov_len = 74} #3 0x55a3178b in qemu_net_queue_flush (queue=0x57802590) at net/queue.c:260 packet = 0x5659e280 ret = 21845 #4 0x55a2eb7a in qemu_flush_or_purge_queued_packets ( nc=0x57802370, purge=false) at net/net.c:629 No locals. #5 0x55a2ebe4 in qemu_flush_queued_packets (nc=0x57802370) at net/net.c:642 No locals. #6 0x557747b7 in virtio_net_set_status (vdev=0x56fb32a8, status=7 '\a') at /usr/src/qemu-2.5.0/hw/net/virtio-net.c:178 ncs = 0x57802370 queue_started = true n = 0x56fb32a8 __func__ = "virtio_net_set_status" q = 0x57308b50 i = 0 queue_status = 7 '\a' #7 0x55795501 in virtio_set_status (vdev=0x56fb32a8, val=7 '\a') at /usr/src/qemu-2.5.0/hw/virtio/virtio.c:618 k = 0x5657eb40 __func__ = "virtio_set_status" #8 0x557985e6 in virtio_vmstate_change (opaque=0x56fb32a8, running=1, state=RUN_STATE_RUNNING) at /usr/src/qemu-2.5.0/hw/virtio/virtio.c:1539 vdev = 0x56fb32a8 qbus = 0x56fb3240 __func__ = "virtio_vmstate_change" k = 0x56570420 backend_run = true #9 0x558592ae in vm_state_notify (running=1, state=RUN_STATE_RUNNING) at vl.c:1601 e = 0x57320cf0 next = 0x57af4c40 #10 0x5585737d in vm_start () at vl.c:756 requested = RUN_STATE_MAX #11 0x55a209ec in process_incoming_migration_co (opaque=0x566a1600) at migration/migration.c:392 f = 0x566a1600 local_err = 0x0 mis = 0x575ab0e0 ps = POSTCOPY_INCOMING_NONE ret = 0 #12 0x55b61efd in coroutine_trampoline (i0=1465036928, i1=21845) at util/coroutine-ucontext.c:80 arg = {p = 0x5752b080, i = {1465036928, 21845}} self = 0x5752b080 co = 0x5752b080 #13 0x75cb7800 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #14 0x7fffcb40 in ?? () No symbol table info available. #15 0x in ?? () No symbol table info available. Dave 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to de
Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
Am 28.06.2016 um 14:56 schrieb Dr. David Alan Gilbert: * Peter Lieven (p...@kamp.de) wrote: Am 28.06.2016 um 14:29 schrieb Paolo Bonzini: Am 28.06.2016 um 13:37 schrieb Paolo Bonzini: On 28/06/2016 11:01, Peter Lieven wrote: I recently found that Qemu is using several hundred megabytes of RSS memory more than older versions such as Qemu 2.2.0. So I started tracing memory allocation and found 2 major reasons for this. 1) We changed the qemu coroutine pool to have a per thread and a global release pool. The choosen poolsize and the changed algorithm could lead to up to 192 free coroutines with just a single iothread. Each of the coroutines in the pool each having 1MB of stack memory. But the fix, as you correctly note, is to reduce the stack size. It would be nice to compile block-obj-y with -Wstack-usage=2048 too. To reveal if there are any big stack allocations in the block layer? Yes. Most should be fixed by now, but a handful are probably still there. (definitely one in vvfat.c). As it seems reducing to 64kB breaks live migration in some (non reproducible) cases. Does it hit the guard page? How would that look like? I get segfaults like this: segfault at 7f91aa642b78 ip 555ab714ef7d sp 7f91aa642b50 error 6 in qemu-system-x86_64[555ab6f2c000+794000] most of the time error 6. Sometimes error 7. segfault is near the sp. A backtrace would be good. Here we go. My old friend nc_senv_compat ;-) Again the question: Would you go for reducing the stack size an eliminating all stack eaters ? The static netbuf in nc_sendv_compat is no problem. And: I would go for adding the guard page without MAP_GROWSDOWN and mmaping the rest of the stack with this flag if availble. So we are save on non Linux systems or Linux before 3.9 or merged memory regions. Peter --- Program received signal SIGSEGV, Segmentation fault. 0x55a2ee35 in nc_sendv_compat (nc=0x0, iov=0x0, iovcnt=0, flags=0) at net/net.c:701 (gdb) bt full #0 0x55a2ee35 in nc_sendv_compat (nc=0x0, iov=0x0, iovcnt=0, flags=0) at net/net.c:701 buf = '\000' ... buffer = 0x0 offset = 0 #1 0x55a2f058 in qemu_deliver_packet_iov (sender=0x565a46b0, flags=0, iov=0x77e98d20, iovcnt=1, opaque=0x57802370) at net/net.c:745 nc = 0x57802370 ret = 21845 #2 0x55a3132d in qemu_net_queue_deliver (queue=0x57802590, sender=0x565a46b0, flags=0, data=0x5659e2a8 "", size=74) at net/queue.c:163 ret = -1 iov = {iov_base = 0x5659e2a8, iov_len = 74} #3 0x55a3178b in qemu_net_queue_flush (queue=0x57802590) at net/queue.c:260 packet = 0x5659e280 ret = 21845 #4 0x55a2eb7a in qemu_flush_or_purge_queued_packets ( nc=0x57802370, purge=false) at net/net.c:629 No locals. #5 0x55a2ebe4 in qemu_flush_queued_packets (nc=0x57802370) at net/net.c:642 No locals. #6 0x557747b7 in virtio_net_set_status (vdev=0x56fb32a8, status=7 '\a') at /usr/src/qemu-2.5.0/hw/net/virtio-net.c:178 ncs = 0x57802370 queue_started = true n = 0x56fb32a8 __func__ = "virtio_net_set_status" q = 0x57308b50 i = 0 queue_status = 7 '\a' #7 0x55795501 in virtio_set_status (vdev=0x56fb32a8, val=7 '\a') at /usr/src/qemu-2.5.0/hw/virtio/virtio.c:618 k = 0x5657eb40 __func__ = "virtio_set_status" #8 0x557985e6 in virtio_vmstate_change (opaque=0x56fb32a8, running=1, state=RUN_STATE_RUNNING) at /usr/src/qemu-2.5.0/hw/virtio/virtio.c:1539 vdev = 0x56fb32a8 qbus = 0x56fb3240 __func__ = "virtio_vmstate_change" k = 0x56570420 backend_run = true #9 0x558592ae in vm_state_notify (running=1, state=RUN_STATE_RUNNING) at vl.c:1601 e = 0x57320cf0 next = 0x57af4c40 #10 0x5585737d in vm_start () at vl.c:756 requested = RUN_STATE_MAX #11 0x55a209ec in process_incoming_migration_co (opaque=0x566a1600) at migration/migration.c:392 f = 0x566a1600 local_err = 0x0 mis = 0x575ab0e0 ps = POSTCOPY_INCOMING_NONE ret = 0 #12 0x55b61efd in coroutine_trampoline (i0=1465036928, i1=21845) at util/coroutine-ucontext.c:80 arg = {p = 0x5752b080, i = {1465036928, 21845}} self = 0x5752b080 co = 0x5752b080 #13 0x75cb7800 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 No symbol table info available. #14 0x7fffcb40 in ?? () No symbol table info available. #15 0x in ?? () No symbol table info available. Dave 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed freeing of memory. This lead to higher heap allocations which could not effectively be returned to kernel (mos
Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
- Original Message - > From: "Peter Lieven" > To: "Paolo Bonzini" > Cc: qemu-devel@nongnu.org, kw...@redhat.com, "peter maydell" > , m...@redhat.com, > dgilb...@redhat.com, mre...@redhat.com, kra...@redhat.com > Sent: Tuesday, June 28, 2016 2:33:02 PM > Subject: Re: [PATCH 00/15] optimize Qemu RSS usage > > Am 28.06.2016 um 14:29 schrieb Paolo Bonzini: > >> Am 28.06.2016 um 13:37 schrieb Paolo Bonzini: > >>> On 28/06/2016 11:01, Peter Lieven wrote: > I recently found that Qemu is using several hundred megabytes of RSS > memory > more than older versions such as Qemu 2.2.0. So I started tracing > memory allocation and found 2 major reasons for this. > > 1) We changed the qemu coroutine pool to have a per thread and a global > release > pool. The choosen poolsize and the changed algorithm could lead to > up > to > 192 free coroutines with just a single iothread. Each of the > coroutines > in the pool each having 1MB of stack memory. > >>> But the fix, as you correctly note, is to reduce the stack size. It > >>> would be nice to compile block-obj-y with -Wstack-usage=2048 too. > >> To reveal if there are any big stack allocations in the block layer? > > Yes. Most should be fixed by now, but a handful are probably still there. > > (definitely one in vvfat.c). > > > >> As it seems reducing to 64kB breaks live migration in some (non > >> reproducible) cases. > > Does it hit the guard page? > > How would that look like? I get segfaults like this: > > segfault at 7f91aa642b78 ip 555ab714ef7d sp 7f91aa642b50 error 6 in > qemu-system-x86_64[555ab6f2c000+794000] > > most of the time error 6. Sometimes error 7. segfault is near the sp. You can use "p ((CoroutineUContext*)current)->stack" from gdb to check the stack base of the currently running coroutine (do it in the thread that received the segfault). You can also check the instruction with that ip and try to get a backtrace. Paolo > 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed > freeing > of memory. This lead to higher heap allocations which could not > effectively > be returned to kernel (most likely due to fragmentation). > >>> I agree that some of the exec.c allocations need some care, but I would > >>> prefer to use a custom free list or lazy allocation instead of mmap. > >> This would only help if the elements from the free list would be allocated > >> using mmap? The issue is that RCU delays the freeing so that the number of > >> concurrent allocations is high and then a bunch is freed at once. If the > >> memory > >> was malloced it would still have caused trouble. > > The free list should improve reuse and fragmentation. I'll take a look at > > lazy allocation of subpages, too. > > Ok, that would be good. And for the PhsyPageMap we use mmap and try to avoid > the realloc? I think that with lazy allocation of subpages the PhysPageMap will be much smaller, but I need to check. Paolo
Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
* Peter Lieven (p...@kamp.de) wrote: > Am 28.06.2016 um 14:29 schrieb Paolo Bonzini: > > > Am 28.06.2016 um 13:37 schrieb Paolo Bonzini: > > > > On 28/06/2016 11:01, Peter Lieven wrote: > > > > > I recently found that Qemu is using several hundred megabytes of RSS > > > > > memory > > > > > more than older versions such as Qemu 2.2.0. So I started tracing > > > > > memory allocation and found 2 major reasons for this. > > > > > > > > > > 1) We changed the qemu coroutine pool to have a per thread and a > > > > > global > > > > > release > > > > > pool. The choosen poolsize and the changed algorithm could lead > > > > > to up > > > > > to > > > > > 192 free coroutines with just a single iothread. Each of the > > > > > coroutines > > > > > in the pool each having 1MB of stack memory. > > > > But the fix, as you correctly note, is to reduce the stack size. It > > > > would be nice to compile block-obj-y with -Wstack-usage=2048 too. > > > To reveal if there are any big stack allocations in the block layer? > > Yes. Most should be fixed by now, but a handful are probably still there. > > (definitely one in vvfat.c). > > > > > As it seems reducing to 64kB breaks live migration in some (non > > > reproducible) cases. > > Does it hit the guard page? > > How would that look like? I get segfaults like this: > > segfault at 7f91aa642b78 ip 555ab714ef7d sp 7f91aa642b50 error 6 in > qemu-system-x86_64[555ab6f2c000+794000] > > most of the time error 6. Sometimes error 7. segfault is near the sp. A backtrace would be good. Dave > > > > > > > > > 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to > > > > > delayed > > > > > freeing > > > > > of memory. This lead to higher heap allocations which could not > > > > > effectively > > > > > be returned to kernel (most likely due to fragmentation). > > > > I agree that some of the exec.c allocations need some care, but I would > > > > prefer to use a custom free list or lazy allocation instead of mmap. > > > This would only help if the elements from the free list would be allocated > > > using mmap? The issue is that RCU delays the freeing so that the number of > > > concurrent allocations is high and then a bunch is freed at once. If the > > > memory > > > was malloced it would still have caused trouble. > > The free list should improve reuse and fragmentation. I'll take a look at > > lazy allocation of subpages, too. > > Ok, that would be good. And for the PhsyPageMap we use mmap and try to avoid > the realloc? > > Peter > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
Am 28.06.2016 um 14:29 schrieb Paolo Bonzini: Am 28.06.2016 um 13:37 schrieb Paolo Bonzini: On 28/06/2016 11:01, Peter Lieven wrote: I recently found that Qemu is using several hundred megabytes of RSS memory more than older versions such as Qemu 2.2.0. So I started tracing memory allocation and found 2 major reasons for this. 1) We changed the qemu coroutine pool to have a per thread and a global release pool. The choosen poolsize and the changed algorithm could lead to up to 192 free coroutines with just a single iothread. Each of the coroutines in the pool each having 1MB of stack memory. But the fix, as you correctly note, is to reduce the stack size. It would be nice to compile block-obj-y with -Wstack-usage=2048 too. To reveal if there are any big stack allocations in the block layer? Yes. Most should be fixed by now, but a handful are probably still there. (definitely one in vvfat.c). As it seems reducing to 64kB breaks live migration in some (non reproducible) cases. Does it hit the guard page? How would that look like? I get segfaults like this: segfault at 7f91aa642b78 ip 555ab714ef7d sp 7f91aa642b50 error 6 in qemu-system-x86_64[555ab6f2c000+794000] most of the time error 6. Sometimes error 7. segfault is near the sp. 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed freeing of memory. This lead to higher heap allocations which could not effectively be returned to kernel (most likely due to fragmentation). I agree that some of the exec.c allocations need some care, but I would prefer to use a custom free list or lazy allocation instead of mmap. This would only help if the elements from the free list would be allocated using mmap? The issue is that RCU delays the freeing so that the number of concurrent allocations is high and then a bunch is freed at once. If the memory was malloced it would still have caused trouble. The free list should improve reuse and fragmentation. I'll take a look at lazy allocation of subpages, too. Ok, that would be good. And for the PhsyPageMap we use mmap and try to avoid the realloc? Peter
Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
> Am 28.06.2016 um 13:37 schrieb Paolo Bonzini: > > On 28/06/2016 11:01, Peter Lieven wrote: > >> I recently found that Qemu is using several hundred megabytes of RSS > >> memory > >> more than older versions such as Qemu 2.2.0. So I started tracing > >> memory allocation and found 2 major reasons for this. > >> > >> 1) We changed the qemu coroutine pool to have a per thread and a global > >> release > >> pool. The choosen poolsize and the changed algorithm could lead to up > >> to > >> 192 free coroutines with just a single iothread. Each of the > >> coroutines > >> in the pool each having 1MB of stack memory. > > But the fix, as you correctly note, is to reduce the stack size. It > > would be nice to compile block-obj-y with -Wstack-usage=2048 too. > > To reveal if there are any big stack allocations in the block layer? Yes. Most should be fixed by now, but a handful are probably still there. (definitely one in vvfat.c). > As it seems reducing to 64kB breaks live migration in some (non reproducible) > cases. Does it hit the guard page? > >> 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed > >> freeing > >> of memory. This lead to higher heap allocations which could not > >> effectively > >> be returned to kernel (most likely due to fragmentation). > > I agree that some of the exec.c allocations need some care, but I would > > prefer to use a custom free list or lazy allocation instead of mmap. > > This would only help if the elements from the free list would be allocated > using mmap? The issue is that RCU delays the freeing so that the number of > concurrent allocations is high and then a bunch is freed at once. If the > memory > was malloced it would still have caused trouble. The free list should improve reuse and fragmentation. I'll take a look at lazy allocation of subpages, too. Paolo
Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
Am 28.06.2016 um 13:37 schrieb Paolo Bonzini: On 28/06/2016 11:01, Peter Lieven wrote: I recently found that Qemu is using several hundred megabytes of RSS memory more than older versions such as Qemu 2.2.0. So I started tracing memory allocation and found 2 major reasons for this. 1) We changed the qemu coroutine pool to have a per thread and a global release pool. The choosen poolsize and the changed algorithm could lead to up to 192 free coroutines with just a single iothread. Each of the coroutines in the pool each having 1MB of stack memory. But the fix, as you correctly note, is to reduce the stack size. It would be nice to compile block-obj-y with -Wstack-usage=2048 too. To reveal if there are any big stack allocations in the block layer? As it seems reducing to 64kB breaks live migration in some (non reproducible) cases. The question is which way to go? Reduce the stack size and fix the big stack allocations or keep the stack size at 1MB? 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed freeing of memory. This lead to higher heap allocations which could not effectively be returned to kernel (most likely due to fragmentation). I agree that some of the exec.c allocations need some care, but I would prefer to use a custom free list or lazy allocation instead of mmap. This would only help if the elements from the free list would be allocated using mmap? The issue is that RCU delays the freeing so that the number of concurrent allocations is high and then a bunch is freed at once. If the memory was malloced it would still have caused trouble. Changing allocations to use mmap also is not really useful if you do it for objects that are never freed (as in patches 8-9-10-15 at least, and probably 11 too which is one of the most contentious). 9 actually frees the memory ;-) 15 frees the memory as soon as the vnc client disconnects. The others I agree. If the objects in Patch 11 are freed needs to be checked. In other words, the effort tracking down the allocation is really, really appreciated. But the patches look like you only had a hammer at hand, and everything looked like a nail. :) I just have observed that forcing ptmalloc to use mmap for everything above 4kB significantly reduced the RSS usage. Peter
Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage
On 28/06/2016 11:01, Peter Lieven wrote: > I recently found that Qemu is using several hundred megabytes of RSS memory > more than older versions such as Qemu 2.2.0. So I started tracing > memory allocation and found 2 major reasons for this. > > 1) We changed the qemu coroutine pool to have a per thread and a global > release >pool. The choosen poolsize and the changed algorithm could lead to up to >192 free coroutines with just a single iothread. Each of the coroutines >in the pool each having 1MB of stack memory. But the fix, as you correctly note, is to reduce the stack size. It would be nice to compile block-obj-y with -Wstack-usage=2048 too. > 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed > freeing >of memory. This lead to higher heap allocations which could not effectively >be returned to kernel (most likely due to fragmentation). I agree that some of the exec.c allocations need some care, but I would prefer to use a custom free list or lazy allocation instead of mmap. Changing allocations to use mmap also is not really useful if you do it for objects that are never freed (as in patches 8-9-10-15 at least, and probably 11 too which is one of the most contentious). In other words, the effort tracking down the allocation is really, really appreciated. But the patches look like you only had a hammer at hand, and everything looked like a nail. :) Paolo > The following series is what I came up with. Beside the coroutine patches I > changed > some allocations to forcibly use mmap. All these allocations are not repeatly > made > during runtime so the impact of using mmap should be neglectible. > > There are still some big malloced allocations left which cannot be easily > changed > (e.g. the pixman buffers in VNC). So it might an idea to set a lower mmap > threshold for > malloc since this threshold seems to be in the order of several Megabytes on > modern systems. > > Peter Lieven (15): > coroutine-ucontext: mmap stack memory > coroutine-ucontext: add a switch to monitor maximum stack size > coroutine-ucontext: reduce stack size to 64kB > coroutine: add a knob to disable the shared release pool > util: add a helper to mmap private anonymous memory > exec: use mmap for subpages > qapi: use mmap for QmpInputVisitor > virtio: use mmap for VirtQueue > loader: use mmap for ROMs > vmware_svga: use mmap for scratch pad > qom: use mmap for bigger Objects > util: add a function to realloc mmapped memory > exec: use mmap for PhysPageMap->nodes > vnc-tight: make the encoding palette static > vnc: use mmap for VncState > > configure | 33 ++-- > exec.c| 11 --- > hw/core/loader.c | 16 +- > hw/display/vmware_vga.c | 3 +- > hw/virtio/virtio.c| 5 +-- > include/qemu/mmap-alloc.h | 7 + > include/qom/object.h | 1 + > qapi/qmp-input-visitor.c | 5 +-- > qom/object.c | 20 ++-- > ui/vnc-enc-tight.c| 21 ++--- > ui/vnc.c | 5 +-- > ui/vnc.h | 1 + > util/coroutine-ucontext.c | 66 +-- > util/mmap-alloc.c | 27 > util/qemu-coroutine.c | 79 > ++- > 15 files changed, 225 insertions(+), 75 deletions(-) >