Re: [PATCH v11 13/21] migration-test: Add COLO migration unit test

2026-03-19 Thread Lukas Straub
On Tue, 10 Mar 2026 23:12:25 +0530
Arun Menon  wrote:

> Hi Lukas,
> 
> On Tue, Mar 10, 2026 at 04:29:54PM +0100, Lukas Straub wrote:
> > On Tue, 10 Mar 2026 20:17:57 +0530
> > Arun Menon  wrote:
> >   
> > > Hi Lukas,
> > > 
> > > On Mon, Mar 02, 2026 at 12:45:28PM +0100, Lukas Straub wrote:  
> > > > Add a COLO migration test for COLO migration and failover.
> > > > 
> > > > Reviewed-by: Fabiano Rosas 
> > > > Tested-by: Fabiano Rosas 
> > > > Reviewed-by: Peter Xu 
> > > > Signed-off-by: Lukas Straub 
> > > > [...]
> > > >
> > > 
> > > I was running the qtests locally, and I encountered a timeout error.
> > > 
> > > Command run: mkdir -p build ; cd build ; make check-qtest-x86_64;
> > > 
> > > Following is the output:
> > > ==
> > > [...]
> > > 
> > > Summary of Failures:
> > > 67/67 qtest+qtest-x86_64 - qemu:qtest-x86_64/migration-test 
> > > TIMEOUT480.05s   killed by signal 15 SIGTERM
> > > Ok:64
> > > Fail:  0
> > > Skipped:   2
> > > Timeout:   1
> > > ==
> > > 
> > > It seems that the test runner is stuck waiting for some input.
> > > Following is the stack trace
> > > [...]
> > > ==
> > > gstack 128276
> > > Thread 2 (Thread 0x7fdd090716c0 (LWP 128279) "call_rcu"):
> > > #0  0x7fdd0921434d in syscall () from /lib64/libc.so.6
> > > #1  0x557fd604563a in qemu_futex_wait (f=0x557fd60a0190 
> > > , val=4294967295) at 
> > > /home/arun/workdir/new/devel/upstream/qemu-priv/include/qemu/futex.h:47
> > > #2  0x557fd604584e in qemu_event_wait (ev=0x557fd60a0190 
> > > ) at ../util/event.c:162
> > > #3  0x557fd6045fde in call_rcu_thread (opaque=0x0) at 
> > > ../util/rcu.c:304
> > > #4  0x557fd600e8fb in qemu_thread_start (args=0x557fd6beec70) at 
> > > ../util/qemu-thread-posix.c:414
> > > #5  0x7fdd09193464 in start_thread () from /lib64/libc.so.6
> > > #6  0x7fdd092165ac in __clone3 () from /lib64/libc.so.6
> > > 
> > > Thread 1 (Thread 0x7fdd09073240 (LWP 128276) "migration-test"):
> > > #0  0x7fdd0919b982 in __syscall_cancel_arch () from /lib64/libc.so.6
> > > #1  0x7fdd0918fc3c in __internal_syscall_cancel () from 
> > > /lib64/libc.so.6
> > > #2  0x7fdd091dfb62 in clock_nanosleep@GLIBC_2.2.5 () from 
> > > /lib64/libc.so.6
> > > #3  0x7fdd091ebb37 in nanosleep () from /lib64/libc.so.6
> > > #4  0x7fdd0921613a in usleep () from /lib64/libc.so.6
> > > #5  0x557fd5fd99cd in wait_for_serial (side=0x557fd6065f08 
> > > "dest_serial") at ../tests/qtest/migration/framework.c:82
> > > #6  0x557fd5fe5865 in test_colo_common (args=0x557fd6bfdf50, 
> > > failover_during_checkpoint=false, primary_failover=true) at 
> > > ../tests/qtest/migration/colo-tests.c:66  
> > 
> > 60migrate_qmp(from, to, args->connect_uri, NULL, "{}");
> > 61
> > 62wait_for_migration_status(from, "colo", NULL);
> > 63wait_for_resume(to, get_dst());
> > 64
> > 65wait_for_serial("src_serial");
> > 66wait_for_serial("dest_serial");
> > 
> > Interesting, so the secondary guest is stuck/crahsed after entering
> > colo state despite having resumed.
> > 
> > It works fine here on master. And before the merge I have looped the
> > colo tests for a whole day on my machine without any failures.
> > 
> > How often does this happen? What is the commit you are on, host, ASAN,
> > MSAN, UBSAN, configure options? With kvm or without?  
> 
> I have started the run on a fedora-43 VM. I am on commit:
> de61484ec39f418e5c0d4603017695f9ffb8fe24 master branch.

Okay, so the problem is that nested virtualization slows down kvm page
faults (e.g. dirtying the dirty bitmap) so much that it takes longer
for the guest to dirty the memory than the interval between two colo
checkpoints at which point the dirty bitmap is cleared again. And to
make matters worse the secondary qemu process is starved by the primary
qemu for some reason.

So the following happens:
1. After a checkpoint primary dirties all it's memory pretty quickly
and afterwards the dirty mem + serial out loops quickly. So it is very
likely the guest is in the middle of dirtying the memory at any given
time.
2. The colo checkpointing is actually working flawlessly it only seems
stuck because the migraion-test is stuck and not echoing the qmp events.
The secondary receives the primary checkpoint (whose guest is in the
middle of dirtying memory) and both sides resume. The secondary is
starved and spends a lot of time in the dirty mem phase.
3. Another checkpoint happens, before the secondary guest is finished
dirtying memory and has a chance sending the serial out signal. Goto 1.

Can you try increasing the checkpoint interval the following and see
what works for you:
migrate_set_parameter_int(from, "x-checkpoint-delay", 300);

In my tests, raising it to 1000 worked fine. Though for a doubly nested VM
I had to raise it to 5000. E.g. It takes up to 5 seconds to dirty 100Mb
memory in a doubly nested VM.

Regards,
Lukas Straub

> 
> The configuration comman

Re: [PATCH v11 13/21] migration-test: Add COLO migration unit test

2026-03-10 Thread Arun Menon
Hi Lukas,

On Tue, Mar 10, 2026 at 04:29:54PM +0100, Lukas Straub wrote:
> On Tue, 10 Mar 2026 20:17:57 +0530
> Arun Menon  wrote:
> 
> > Hi Lukas,
> > 
> > On Mon, Mar 02, 2026 at 12:45:28PM +0100, Lukas Straub wrote:
> > > Add a COLO migration test for COLO migration and failover.
> > > 
> > > Reviewed-by: Fabiano Rosas 
> > > Tested-by: Fabiano Rosas 
> > > Reviewed-by: Peter Xu 
> > > Signed-off-by: Lukas Straub 
> > > ---
> > >  MAINTAINERS|   1 +
> > >  tests/qtest/meson.build|   7 +-
> > >  tests/qtest/migration-test.c   |   1 +
> > >  tests/qtest/migration/colo-tests.c | 198 
> > > +
> > >  tests/qtest/migration/framework.h  |   5 +
> > >  5 files changed, 211 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index 
> > > d2a1f4cc08223cb944b61e32a6d89e25bf82eacb..1b0ae10750036be00571b7104ad8426c071bb54c
> > >  100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -3875,6 +3875,7 @@ F: migration/colo*
> > >  F: migration/multifd-colo.*
> > >  F: include/migration/colo.h
> > >  F: include/migration/failover.h
> > > +F: tests/qtest/migration/colo-tests.c
> > >  F: docs/COLO-FT.txt
> > >  
> > >  COLO Proxy
> > > diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
> > > index 
> > > 25fdbc798010b19e8ec9b6ab55e02d3fb5741398..6a46e2a767de12d978d910ddb6de175bce9810b8
> > >  100644
> > > --- a/tests/qtest/meson.build
> > > +++ b/tests/qtest/meson.build
> > > @@ -374,6 +374,11 @@ if gnutls.found()
> > >endif
> > >  endif
> > >  
> > > +migration_colo_files = []
> > > +if get_option('replication').allowed()
> > > +  migration_colo_files = [files('migration/colo-tests.c')]
> > > +endif
> > > +
> > >  qtests = {
> > >'aspeed_hace-test': files('aspeed-hace-utils.c', 'aspeed_hace-test.c'),
> > >'aspeed_smc-test': files('aspeed-smc-utils.c', 'aspeed_smc-test.c'),
> > > @@ -385,7 +390,7 @@ qtests = {
> > >   'migration/migration-util.c') + 
> > > dbus_vmstate1,
> > >'erst-test': files('erst-test.c'),
> > >'ivshmem-test': [rt, '../../contrib/ivshmem-server/ivshmem-server.c'],
> > > -  'migration-test': test_migration_files + migration_tls_files,
> > > +  'migration-test': test_migration_files + migration_tls_files + 
> > > migration_colo_files,
> > >'pxe-test': files('boot-sector.c'),
> > >'pnv-xive2-test': files('pnv-xive2-common.c', 'pnv-xive2-flush-sync.c',
> > >'pnv-xive2-nvpg_bar.c'),
> > > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> > > index 
> > > 08936871741535c926eeac40a7d7c3f461c72fd0..e582f05c7dc2673dbd05a936df8feb6c964b5bbc
> > >  100644
> > > --- a/tests/qtest/migration-test.c
> > > +++ b/tests/qtest/migration-test.c
> > > @@ -55,6 +55,7 @@ int main(int argc, char **argv)
> > >  migration_test_add_precopy(env);
> > >  migration_test_add_cpr(env);
> > >  migration_test_add_misc(env);
> > > +migration_test_add_colo(env);
> > >  
> > >  ret = g_test_run();
> > >  
> > > diff --git a/tests/qtest/migration/colo-tests.c 
> > > b/tests/qtest/migration/colo-tests.c
> > > new file mode 100644
> > > index 
> > > ..598a1d3821ed0a90318732702027cebad47352fd
> > > --- /dev/null
> > > +++ b/tests/qtest/migration/colo-tests.c
> > > @@ -0,0 +1,198 @@
> > > +/*
> > > + * SPDX-License-Identifier: GPL-2.0-or-later
> > > + *
> > > + * QTest testcases for COLO migration
> > > + *
> > > + * Copyright (c) 2025 Lukas Straub 
> > > + *
> > > + * This work is licensed under the terms of the GNU GPL, version 2 or 
> > > later.
> > > + * See the COPYING file in the top-level directory.
> > > + *
> > > + */
> > > +
> > > +#include "qemu/osdep.h"
> > > +#include "libqtest.h"
> > > +#include "migration/framework.h"
> > > +#include "migration/migration-qmp.h"
> > > +#include "migration/migration-util.h"
> > > +#include "qemu/module.h"
> > > +
> > > +static int test_colo_common(MigrateCommon *args,
> > > +bool failover_during_checkpoint,
> > > +bool primary_failover)
> > > +{
> > > +QTestState *from, *to;
> > > +void *data_hook = NULL;
> > > +
> > > +/*
> > > + * For the COLO test, both VMs will run in parallel. Thus both VMs 
> > > want to
> > > + * open the image read/write at the same time. Using read-only=on is 
> > > not
> > > + * possible here, because ide-hd does not support read-only backing 
> > > image.
> > > + *
> > > + * So use -snapshot, where each qemu instance creates its own 
> > > writable
> > > + * snapshot internally while leaving the real image read-only.
> > > + */
> > > +args->start.opts_source = "-snapshot";
> > > +args->start.opts_target = "-snapshot";
> > > +
> > > +/*
> > > + * COLO migration code logs many errors when the migration socket
> > > + * is shut down, these are expected so we hide

Re: [PATCH v11 13/21] migration-test: Add COLO migration unit test

2026-03-10 Thread Lukas Straub
On Tue, 10 Mar 2026 20:17:57 +0530
Arun Menon  wrote:

> Hi Lukas,
> 
> On Mon, Mar 02, 2026 at 12:45:28PM +0100, Lukas Straub wrote:
> > Add a COLO migration test for COLO migration and failover.
> > 
> > Reviewed-by: Fabiano Rosas 
> > Tested-by: Fabiano Rosas 
> > Reviewed-by: Peter Xu 
> > Signed-off-by: Lukas Straub 
> > ---
> >  MAINTAINERS|   1 +
> >  tests/qtest/meson.build|   7 +-
> >  tests/qtest/migration-test.c   |   1 +
> >  tests/qtest/migration/colo-tests.c | 198 
> > +
> >  tests/qtest/migration/framework.h  |   5 +
> >  5 files changed, 211 insertions(+), 1 deletion(-)
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 
> > d2a1f4cc08223cb944b61e32a6d89e25bf82eacb..1b0ae10750036be00571b7104ad8426c071bb54c
> >  100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -3875,6 +3875,7 @@ F: migration/colo*
> >  F: migration/multifd-colo.*
> >  F: include/migration/colo.h
> >  F: include/migration/failover.h
> > +F: tests/qtest/migration/colo-tests.c
> >  F: docs/COLO-FT.txt
> >  
> >  COLO Proxy
> > diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
> > index 
> > 25fdbc798010b19e8ec9b6ab55e02d3fb5741398..6a46e2a767de12d978d910ddb6de175bce9810b8
> >  100644
> > --- a/tests/qtest/meson.build
> > +++ b/tests/qtest/meson.build
> > @@ -374,6 +374,11 @@ if gnutls.found()
> >endif
> >  endif
> >  
> > +migration_colo_files = []
> > +if get_option('replication').allowed()
> > +  migration_colo_files = [files('migration/colo-tests.c')]
> > +endif
> > +
> >  qtests = {
> >'aspeed_hace-test': files('aspeed-hace-utils.c', 'aspeed_hace-test.c'),
> >'aspeed_smc-test': files('aspeed-smc-utils.c', 'aspeed_smc-test.c'),
> > @@ -385,7 +390,7 @@ qtests = {
> >   'migration/migration-util.c') + dbus_vmstate1,
> >'erst-test': files('erst-test.c'),
> >'ivshmem-test': [rt, '../../contrib/ivshmem-server/ivshmem-server.c'],
> > -  'migration-test': test_migration_files + migration_tls_files,
> > +  'migration-test': test_migration_files + migration_tls_files + 
> > migration_colo_files,
> >'pxe-test': files('boot-sector.c'),
> >'pnv-xive2-test': files('pnv-xive2-common.c', 'pnv-xive2-flush-sync.c',
> >'pnv-xive2-nvpg_bar.c'),
> > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> > index 
> > 08936871741535c926eeac40a7d7c3f461c72fd0..e582f05c7dc2673dbd05a936df8feb6c964b5bbc
> >  100644
> > --- a/tests/qtest/migration-test.c
> > +++ b/tests/qtest/migration-test.c
> > @@ -55,6 +55,7 @@ int main(int argc, char **argv)
> >  migration_test_add_precopy(env);
> >  migration_test_add_cpr(env);
> >  migration_test_add_misc(env);
> > +migration_test_add_colo(env);
> >  
> >  ret = g_test_run();
> >  
> > diff --git a/tests/qtest/migration/colo-tests.c 
> > b/tests/qtest/migration/colo-tests.c
> > new file mode 100644
> > index 
> > ..598a1d3821ed0a90318732702027cebad47352fd
> > --- /dev/null
> > +++ b/tests/qtest/migration/colo-tests.c
> > @@ -0,0 +1,198 @@
> > +/*
> > + * SPDX-License-Identifier: GPL-2.0-or-later
> > + *
> > + * QTest testcases for COLO migration
> > + *
> > + * Copyright (c) 2025 Lukas Straub 
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 or 
> > later.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "libqtest.h"
> > +#include "migration/framework.h"
> > +#include "migration/migration-qmp.h"
> > +#include "migration/migration-util.h"
> > +#include "qemu/module.h"
> > +
> > +static int test_colo_common(MigrateCommon *args,
> > +bool failover_during_checkpoint,
> > +bool primary_failover)
> > +{
> > +QTestState *from, *to;
> > +void *data_hook = NULL;
> > +
> > +/*
> > + * For the COLO test, both VMs will run in parallel. Thus both VMs 
> > want to
> > + * open the image read/write at the same time. Using read-only=on is 
> > not
> > + * possible here, because ide-hd does not support read-only backing 
> > image.
> > + *
> > + * So use -snapshot, where each qemu instance creates its own writable
> > + * snapshot internally while leaving the real image read-only.
> > + */
> > +args->start.opts_source = "-snapshot";
> > +args->start.opts_target = "-snapshot";
> > +
> > +/*
> > + * COLO migration code logs many errors when the migration socket
> > + * is shut down, these are expected so we hide them here.
> > + */
> > +args->start.hide_stderr = true;
> > +
> > +/*
> > + * Test with yank with out of band capability since that is how it is
> > + * used in production.
> > + */
> > +args->start.oob = true;
> > +args->start.caps[MIGRATION_CAPABILITY_X_COLO] = true;
> > +
> > +if (migrate_start(&from, &

Re: [PATCH v11 13/21] migration-test: Add COLO migration unit test

2026-03-10 Thread Arun Menon
Hi Lukas,

On Mon, Mar 02, 2026 at 12:45:28PM +0100, Lukas Straub wrote:
> Add a COLO migration test for COLO migration and failover.
> 
> Reviewed-by: Fabiano Rosas 
> Tested-by: Fabiano Rosas 
> Reviewed-by: Peter Xu 
> Signed-off-by: Lukas Straub 
> ---
>  MAINTAINERS|   1 +
>  tests/qtest/meson.build|   7 +-
>  tests/qtest/migration-test.c   |   1 +
>  tests/qtest/migration/colo-tests.c | 198 
> +
>  tests/qtest/migration/framework.h  |   5 +
>  5 files changed, 211 insertions(+), 1 deletion(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 
> d2a1f4cc08223cb944b61e32a6d89e25bf82eacb..1b0ae10750036be00571b7104ad8426c071bb54c
>  100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3875,6 +3875,7 @@ F: migration/colo*
>  F: migration/multifd-colo.*
>  F: include/migration/colo.h
>  F: include/migration/failover.h
> +F: tests/qtest/migration/colo-tests.c
>  F: docs/COLO-FT.txt
>  
>  COLO Proxy
> diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
> index 
> 25fdbc798010b19e8ec9b6ab55e02d3fb5741398..6a46e2a767de12d978d910ddb6de175bce9810b8
>  100644
> --- a/tests/qtest/meson.build
> +++ b/tests/qtest/meson.build
> @@ -374,6 +374,11 @@ if gnutls.found()
>endif
>  endif
>  
> +migration_colo_files = []
> +if get_option('replication').allowed()
> +  migration_colo_files = [files('migration/colo-tests.c')]
> +endif
> +
>  qtests = {
>'aspeed_hace-test': files('aspeed-hace-utils.c', 'aspeed_hace-test.c'),
>'aspeed_smc-test': files('aspeed-smc-utils.c', 'aspeed_smc-test.c'),
> @@ -385,7 +390,7 @@ qtests = {
>   'migration/migration-util.c') + dbus_vmstate1,
>'erst-test': files('erst-test.c'),
>'ivshmem-test': [rt, '../../contrib/ivshmem-server/ivshmem-server.c'],
> -  'migration-test': test_migration_files + migration_tls_files,
> +  'migration-test': test_migration_files + migration_tls_files + 
> migration_colo_files,
>'pxe-test': files('boot-sector.c'),
>'pnv-xive2-test': files('pnv-xive2-common.c', 'pnv-xive2-flush-sync.c',
>'pnv-xive2-nvpg_bar.c'),
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 
> 08936871741535c926eeac40a7d7c3f461c72fd0..e582f05c7dc2673dbd05a936df8feb6c964b5bbc
>  100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -55,6 +55,7 @@ int main(int argc, char **argv)
>  migration_test_add_precopy(env);
>  migration_test_add_cpr(env);
>  migration_test_add_misc(env);
> +migration_test_add_colo(env);
>  
>  ret = g_test_run();
>  
> diff --git a/tests/qtest/migration/colo-tests.c 
> b/tests/qtest/migration/colo-tests.c
> new file mode 100644
> index 
> ..598a1d3821ed0a90318732702027cebad47352fd
> --- /dev/null
> +++ b/tests/qtest/migration/colo-tests.c
> @@ -0,0 +1,198 @@
> +/*
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + *
> + * QTest testcases for COLO migration
> + *
> + * Copyright (c) 2025 Lukas Straub 
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "libqtest.h"
> +#include "migration/framework.h"
> +#include "migration/migration-qmp.h"
> +#include "migration/migration-util.h"
> +#include "qemu/module.h"
> +
> +static int test_colo_common(MigrateCommon *args,
> +bool failover_during_checkpoint,
> +bool primary_failover)
> +{
> +QTestState *from, *to;
> +void *data_hook = NULL;
> +
> +/*
> + * For the COLO test, both VMs will run in parallel. Thus both VMs want 
> to
> + * open the image read/write at the same time. Using read-only=on is not
> + * possible here, because ide-hd does not support read-only backing 
> image.
> + *
> + * So use -snapshot, where each qemu instance creates its own writable
> + * snapshot internally while leaving the real image read-only.
> + */
> +args->start.opts_source = "-snapshot";
> +args->start.opts_target = "-snapshot";
> +
> +/*
> + * COLO migration code logs many errors when the migration socket
> + * is shut down, these are expected so we hide them here.
> + */
> +args->start.hide_stderr = true;
> +
> +/*
> + * Test with yank with out of band capability since that is how it is
> + * used in production.
> + */
> +args->start.oob = true;
> +args->start.caps[MIGRATION_CAPABILITY_X_COLO] = true;
> +
> +if (migrate_start(&from, &to, args->listen_uri, &args->start)) {
> +return -1;
> +}
> +
> +migrate_set_parameter_int(from, "x-checkpoint-delay", 300);
> +
> +if (args->start_hook) {
> +data_hook = args->start_hook(from, to);
> +}
> +
> +migrate_ensure_converge(from);
> +wait_for_serial("src_serial");
> +
> +migrate_