Re: [PATCH] virtio-pci: Fix the use of an uninitialized irqfd.
Hi, Am 22.05.24 um 07:10 schrieb Cindy Lu: > The crash was reported in MAC OS and NixOS, here is the link for this bug > https://gitlab.com/qemu-project/qemu/-/issues/2334 > https://gitlab.com/qemu-project/qemu/-/issues/2321 > > The root cause is that the function virtio_pci_set_guest_notifiers() only > initializes the irqfd when the use_guest_notifier_mask and guest_notifier_mask > are set. Sorry, I'm just trying to understand the fix and I'm probably missing something, but in virtio_pci_set_guest_notifiers() there is: > bool with_irqfd = msix_enabled(>pci_dev) && > kvm_msi_via_irqfd_enabled(); and then: > if ((with_irqfd || > (vdev->use_guest_notifier_mask && k->guest_notifier_mask)) && > assign) { > if (with_irqfd) { > proxy->vector_irqfd = > g_malloc0(sizeof(*proxy->vector_irqfd) * > msix_nr_vectors_allocated(>pci_dev)); > r = kvm_virtio_pci_vector_vq_use(proxy, nvqs); Meaning proxy->vector_irqfd is allocated when with_irqfd is true (even if vdev->use_guest_notifier_mask && k->guest_notifier_mask is false). > However, this check is missing in virtio_pci_set_vector(). > So the fix is to add this check. > > This fix is verified in vyatta,MacOS,NixOS,fedora system. > > The bt tree for this bug is: > Thread 6 "CPU 0/KVM" received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7c817be006c0 (LWP 1269146)] > kvm_virtio_pci_vq_vector_use () at ../qemu-9.0.0/hw/virtio/virtio-pci.c:817 > 817 if (irqfd->users == 0) { The crash happens because the irqfd is NULL/invalid here, right? proxy->vector_irqfd = NULL happens when virtio_pci_set_guest_notifiers() is called with assign=false or for an unsuccessful call to virtio_pci_set_guest_notifiers() with assign=true. AFAIU, the issue is that virtio_pci_set_vector() is called between a call to virtio_pci_set_guest_notifiers() with assign=false and a successful virtio_pci_set_guest_notifiers() with assign=true (or before the first such call). So I'm trying to understand why adding the check for vdev->use_guest_notifier_mask && k->guest_notifier_mask is sufficient to fix the issue. Thanks! Best Regards, Fiona
Re: [PULL 6/9] virtio-gpu: fix v2 migration
CC-ing stable, because this already is an issue in 9.0.0 Am 23.05.24 um 00:20 schrieb Fabiano Rosas: > From: Marc-André Lureau > > Commit dfcf74fa ("virtio-gpu: fix scanout migration post-load") broke > forward/backward version migration. Versioning of nested VMSD structures > is not straightforward, as the wire format doesn't have nested > structures versions. Introduce x-scanout-vmstate-version and a field > test to save/load appropriately according to the machine version. > > Fixes: dfcf74fa ("virtio-gpu: fix scanout migration post-load") > Signed-off-by: Marc-André Lureau > Signed-off-by: Peter Xu > Reviewed-by: Fiona Ebner > Tested-by: Fiona Ebner > [fixed long lines] > Signed-off-by: Fabiano Rosas > --- > hw/core/machine.c | 1 + > hw/display/virtio-gpu.c| 30 ++ > include/hw/virtio/virtio-gpu.h | 1 + > 3 files changed, 24 insertions(+), 8 deletions(-) > > diff --git a/hw/core/machine.c b/hw/core/machine.c > index c7ceb11501..8d6dc69f0e 100644 > --- a/hw/core/machine.c > +++ b/hw/core/machine.c > @@ -42,6 +42,7 @@ GlobalProperty hw_compat_8_2[] = { > { "migration", "zero-page-detection", "legacy"}, > { TYPE_VIRTIO_IOMMU_PCI, "granule", "4k" }, > { TYPE_VIRTIO_IOMMU_PCI, "aw-bits", "64" }, > +{ "virtio-gpu-device", "x-scanout-vmstate-version", "1" }, > }; > const size_t hw_compat_8_2_len = G_N_ELEMENTS(hw_compat_8_2); > > diff --git a/hw/display/virtio-gpu.c b/hw/display/virtio-gpu.c > index ae831b6b3e..d60b1b2973 100644 > --- a/hw/display/virtio-gpu.c > +++ b/hw/display/virtio-gpu.c > @@ -1166,10 +1166,17 @@ static void virtio_gpu_cursor_bh(void *opaque) > virtio_gpu_handle_cursor(>parent_obj.parent_obj, g->cursor_vq); > } > > +static bool scanout_vmstate_after_v2(void *opaque, int version) > +{ > +struct VirtIOGPUBase *base = container_of(opaque, VirtIOGPUBase, > scanout); > +struct VirtIOGPU *gpu = container_of(base, VirtIOGPU, parent_obj); > + > +return gpu->scanout_vmstate_version >= 2; > +} > + > static const VMStateDescription vmstate_virtio_gpu_scanout = { > .name = "virtio-gpu-one-scanout", > -.version_id = 2, > -.minimum_version_id = 1, > +.version_id = 1, > .fields = (const VMStateField[]) { > VMSTATE_UINT32(resource_id, struct virtio_gpu_scanout), > VMSTATE_UINT32(width, struct virtio_gpu_scanout), > @@ -1181,12 +1188,18 @@ static const VMStateDescription > vmstate_virtio_gpu_scanout = { > VMSTATE_UINT32(cursor.hot_y, struct virtio_gpu_scanout), > VMSTATE_UINT32(cursor.pos.x, struct virtio_gpu_scanout), > VMSTATE_UINT32(cursor.pos.y, struct virtio_gpu_scanout), > -VMSTATE_UINT32_V(fb.format, struct virtio_gpu_scanout, 2), > -VMSTATE_UINT32_V(fb.bytes_pp, struct virtio_gpu_scanout, 2), > -VMSTATE_UINT32_V(fb.width, struct virtio_gpu_scanout, 2), > -VMSTATE_UINT32_V(fb.height, struct virtio_gpu_scanout, 2), > -VMSTATE_UINT32_V(fb.stride, struct virtio_gpu_scanout, 2), > -VMSTATE_UINT32_V(fb.offset, struct virtio_gpu_scanout, 2), > +VMSTATE_UINT32_TEST(fb.format, struct virtio_gpu_scanout, > +scanout_vmstate_after_v2), > +VMSTATE_UINT32_TEST(fb.bytes_pp, struct virtio_gpu_scanout, > +scanout_vmstate_after_v2), > +VMSTATE_UINT32_TEST(fb.width, struct virtio_gpu_scanout, > +scanout_vmstate_after_v2), > +VMSTATE_UINT32_TEST(fb.height, struct virtio_gpu_scanout, > +scanout_vmstate_after_v2), > +VMSTATE_UINT32_TEST(fb.stride, struct virtio_gpu_scanout, > +scanout_vmstate_after_v2), > +VMSTATE_UINT32_TEST(fb.offset, struct virtio_gpu_scanout, > +scanout_vmstate_after_v2), > VMSTATE_END_OF_LIST() > }, > }; > @@ -1659,6 +1672,7 @@ static Property virtio_gpu_properties[] = { > DEFINE_PROP_BIT("blob", VirtIOGPU, parent_obj.conf.flags, > VIRTIO_GPU_FLAG_BLOB_ENABLED, false), > DEFINE_PROP_SIZE("hostmem", VirtIOGPU, parent_obj.conf.hostmem, 0), > +DEFINE_PROP_UINT8("x-scanout-vmstate-version", VirtIOGPU, > scanout_vmstate_version, 2), > DEFINE_PROP_END_OF_LIST(), > }; > > diff --git a/include/hw/virtio/virtio-gpu.h b/include/hw/virtio/virtio-gpu.h > index 56d6e821bf..7a59379f5a 100644 > --- a/include/hw/virtio/virtio-gpu.h > +++ b/include/hw/virtio/virtio-gpu.h > @@ -177,6 +177,7 @@ typedef struct VGPUDMABuf { > struct VirtIOGPU { > VirtIOGPUBase parent_obj; > > +uint8_t scanout_vmstate_version; > uint64_t conf_max_hostmem; > > VirtQueue *ctrl_vq;
Re: [PATCH 1/2] Revert "monitor: use aio_co_reschedule_self()"
CC-ing stable since 1f25c172f83704e350c0829438d832384084a74d is in 9.0.0 Am 06.05.24 um 21:06 schrieb Stefan Hajnoczi: > Commit 1f25c172f837 ("monitor: use aio_co_reschedule_self()") was a code > cleanup that uses aio_co_reschedule_self() instead of open coding > coroutine rescheduling. > > Bug RHEL-34618 was reported and Kevin Wolf identified > the root cause. I missed that aio_co_reschedule_self() -> > qemu_get_current_aio_context() only knows about > qemu_aio_context/IOThread AioContexts and not about iohandler_ctx. It > does not function correctly when going back from the iohandler_ctx to > qemu_aio_context. > > Go back to open coding the AioContext transitions to avoid this bug. > > This reverts commit 1f25c172f83704e350c0829438d832384084a74d. > > Buglink: https://issues.redhat.com/browse/RHEL-34618 > Signed-off-by: Stefan Hajnoczi > --- > qapi/qmp-dispatch.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c > index f3488afeef..176b549473 100644 > --- a/qapi/qmp-dispatch.c > +++ b/qapi/qmp-dispatch.c > @@ -212,7 +212,8 @@ QDict *coroutine_mixed_fn qmp_dispatch(const > QmpCommandList *cmds, QObject *requ > * executing the command handler so that it can make progress if > it > * involves an AIO_WAIT_WHILE(). > */ > -aio_co_reschedule_self(qemu_get_aio_context()); > +aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self()); > +qemu_coroutine_yield(); > } > > monitor_set_cur(qemu_coroutine_self(), cur_mon); > @@ -226,7 +227,9 @@ QDict *coroutine_mixed_fn qmp_dispatch(const > QmpCommandList *cmds, QObject *requ > * Move back to iohandler_ctx so that nested event loops for > * qemu_aio_context don't start new monitor commands. > */ > -aio_co_reschedule_self(iohandler_get_aio_context()); > +aio_co_schedule(iohandler_get_aio_context(), > +qemu_coroutine_self()); > +qemu_coroutine_yield(); > } > } else { > /*
Re: block snapshot issue with RBD
Hi, Am 28.05.24 um 20:19 schrieb Jin Cao: > Hi Ilya > > On 5/28/24 11:13 AM, Ilya Dryomov wrote: >> On Mon, May 27, 2024 at 9:06 PM Jin Cao wrote: >>> >>> Supplementary info: VM is paused after "migrate" command. After being >>> resumed with "cont", snapshot_delete_blkdev_internal works again, which >>> is confusing, as disk snapshot generally recommend I/O is paused, and a >>> frozen VM satisfy this requirement. >> >> Hi Jin, >> >> This doesn't seem to be related to RBD. Given that the same error is >> observed when using the RBD driver with the raw format, I would dig in >> the direction of migration somehow "installing" the raw format (which >> is on-disk compatible with the rbd format). >> > > Thanks for the hint. > >> Also, did you mean to say "snapshot_blkdev_internal" instead of >> "snapshot_delete_blkdev_internal" in both instances? > > Sorry for my copy-and-paste mistake. Yes, it's snapshot_blkdev_internal. > > -- > Sincerely, > Jin Cao > >> >> Thanks, >> >> Ilya >> >>> >>> -- >>> Sincerely >>> Jin Cao >>> >>> On 5/27/24 10:56 AM, Jin Cao wrote: CC block and migration related address. On 5/27/24 12:03 AM, Jin Cao wrote: > Hi, > > I encountered RBD block snapshot issue after doing migration. > > Steps > - > > 1. Start QEMU with: > ./qemu-system-x86_64 -name VM -machine q35 -accel kvm -cpu > host,migratable=on -m 2G -boot menu=on,strict=on > rbd:image/ubuntu-22.04-server-cloudimg-amd64.raw -net nic -net user > -cdrom /home/my/path/of/cloud-init.iso -monitor stdio > > 2. Do block snapshot in monitor cmd: snapshot_delete_blkdev_internal. > It works as expected: the snapshot is visable with command`rbd snap ls > pool_name/image_name`. > > 3. Do pseudo migration with monitor cmd: migrate -d > exec:cat>/tmp/vm.out > > 4. Do block snapshot again with snapshot_delete_blkdev_internal, then > I get: > Error: Block format 'raw' used by device 'ide0-hd0' does not > support internal snapshots > > I was hoping to do the second block snapshot successfully, and it > feels abnormal the RBD block snapshot function is disrupted after > migration. > > BTW, I get the same block snapshot error when I start QEMU with: > "-drive format=raw,file=rbd:pool_name/image_name" > > My questions is: how could I proceed with RBD block snapshot after the > pseudo migration? > > I bisected this issue to d3007d348a ("block: Fix crash when loading snapshot on inactive node"). > diff --git a/block/snapshot.c b/block/snapshot.c > index ec8cf4810b..c4d40e80dd 100644 > --- a/block/snapshot.c > +++ b/block/snapshot.c > @@ -196,8 +196,10 @@ bdrv_snapshot_fallback(BlockDriverState *bs) > int bdrv_can_snapshot(BlockDriverState *bs) > { > BlockDriver *drv = bs->drv; > + > GLOBAL_STATE_CODE(); > -if (!drv || !bdrv_is_inserted(bs) || bdrv_is_read_only(bs)) { > + > +if (!drv || !bdrv_is_inserted(bs) || !bdrv_is_writable(bs)) { > return 0; > } > So I guess the issue is that the blockdev is not writable when "postmigrate" state? Best Regards, Fiona
[PATCH v2 0/2] backup: allow specifying minimum cluster size
Based-on: https://lore.kernel.org/qemu-devel/20240429115157.2260885-1-vsement...@yandex-team.ru/ Discussion for v1: https://lore.kernel.org/qemu-devel/20240308155158.830258-1-f.eb...@proxmox.com/ Changes in v2: * Use 'size' type in QAPI. * Remove option in cbw_parse_options(), i.e. before parsing generic blockdev options. * Reword commit messages hoping to describe the issue in a more straight-forward way. In the context of backup fleecing, discarding the source will not work when the fleecing image has a larger granularity than the one used for block-copy operations (can happen if the backup target has smaller cluster size), because cbw_co_pdiscard_snapshot() will align down the discard requests and thus effectively ignore then. To make @discard-source work in such a scenario, allow specifying the minimum cluster size used for block-copy operations and thus in particular also the granularity for discard requests to the source. Fiona Ebner (2): copy-before-write: allow specifying minimum cluster size backup: add minimum cluster size to performance options block/backup.c | 2 +- block/block-copy.c | 22 ++ block/copy-before-write.c | 18 +- block/copy-before-write.h | 1 + blockdev.c | 3 +++ include/block/block-copy.h | 1 + qapi/block-core.json | 17 ++--- 7 files changed, 55 insertions(+), 9 deletions(-) -- 2.39.2
[PATCH v2 2/2] backup: add minimum cluster size to performance options
In the context of backup fleecing, discarding the source will not work when the fleecing image has a larger granularity than the one used for block-copy operations (can happen if the backup target has smaller cluster size), because cbw_co_pdiscard_snapshot() will align down the discard requests and thus effectively ignore then. To make @discard-source work in such a scenario, allow specifying the minimum cluster size used for block-copy operations and thus in particular also the granularity for discard requests to the source. Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Fiona Ebner --- Changes in v2: * Use 'size' type in QAPI. block/backup.c| 2 +- block/copy-before-write.c | 8 block/copy-before-write.h | 1 + blockdev.c| 3 +++ qapi/block-core.json | 9 +++-- 5 files changed, 20 insertions(+), 3 deletions(-) diff --git a/block/backup.c b/block/backup.c index 3dd2e229d2..a1292c01ec 100644 --- a/block/backup.c +++ b/block/backup.c @@ -458,7 +458,7 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, } cbw = bdrv_cbw_append(bs, target, filter_node_name, discard_source, - , errp); + perf->min_cluster_size, , errp); if (!cbw) { goto error; } diff --git a/block/copy-before-write.c b/block/copy-before-write.c index ef0bc4dcfe..183eed42e5 100644 --- a/block/copy-before-write.c +++ b/block/copy-before-write.c @@ -553,6 +553,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source, BlockDriverState *target, const char *filter_node_name, bool discard_source, + uint64_t min_cluster_size, BlockCopyState **bcs, Error **errp) { @@ -572,6 +573,13 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source, qdict_put_str(opts, "file", bdrv_get_node_name(source)); qdict_put_str(opts, "target", bdrv_get_node_name(target)); +if (min_cluster_size > INT64_MAX) { +error_setg(errp, "min-cluster-size too large: %lu > %ld", + min_cluster_size, INT64_MAX); +return NULL; +} +qdict_put_int(opts, "min-cluster-size", (int64_t)min_cluster_size); + top = bdrv_insert_node(source, opts, flags, errp); if (!top) { return NULL; diff --git a/block/copy-before-write.h b/block/copy-before-write.h index 01af0cd3c4..2a5d4ba693 100644 --- a/block/copy-before-write.h +++ b/block/copy-before-write.h @@ -40,6 +40,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source, BlockDriverState *target, const char *filter_node_name, bool discard_source, + uint64_t min_cluster_size, BlockCopyState **bcs, Error **errp); void bdrv_cbw_drop(BlockDriverState *bs); diff --git a/blockdev.c b/blockdev.c index 835064ed03..6740663fda 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2655,6 +2655,9 @@ static BlockJob *do_backup_common(BackupCommon *backup, if (backup->x_perf->has_max_chunk) { perf.max_chunk = backup->x_perf->max_chunk; } +if (backup->x_perf->has_min_cluster_size) { +perf.min_cluster_size = backup->x_perf->min_cluster_size; +} } if ((backup->sync == MIRROR_SYNC_MODE_BITMAP) || diff --git a/qapi/block-core.json b/qapi/block-core.json index 8fc0a4b234..f1219a9dfb 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1551,11 +1551,16 @@ # it should not be less than job cluster size which is calculated # as maximum of target image cluster size and 64k. Default 0. # +# @min-cluster-size: Minimum size of blocks used by copy-before-write +# and background copy operations. Has to be a power of 2. No +# effect if smaller than the maximum of the target's cluster size +# and 64 KiB. Default 0. (Since 9.1) +# # Since: 6.0 ## { 'struct': 'BackupPerf', - 'data': { '*use-copy-range': 'bool', -'*max-workers': 'int', '*max-chunk': 'int64' } } + 'data': { '*use-copy-range': 'bool', '*max-workers': 'int', +'*max-chunk': 'int64', '*min-cluster-size': 'size' } } ## # @BackupCommon: -- 2.39.2
[PATCH v2 1/2] copy-before-write: allow specifying minimum cluster size
In the context of backup fleecing, discarding the source will not work when the fleecing image has a larger granularity than the one used for block-copy operations (can happen if the backup target has smaller cluster size), because cbw_co_pdiscard_snapshot() will align down the discard requests and thus effectively ignore then. To make @discard-source work in such a scenario, allow specifying the minimum cluster size used for block-copy operations and thus in particular also the granularity for discard requests to the source. The type 'size' (corresponding to uint64_t in C) is used in QAPI to rule out negative inputs and for consistency with already existing @cluster-size parameters. Since block_copy_calculate_cluster_size() uses int64_t for its result, a check that the input is not too large is added in block_copy_state_new() before calling it. The calculation in block_copy_calculate_cluster_size() is done in the target int64_t type. Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Fiona Ebner --- Changes in v2: * Use 'size' type in QAPI. * Remove option in cbw_parse_options(), i.e. before parsing generic blockdev options. block/block-copy.c | 22 ++ block/copy-before-write.c | 10 +- include/block/block-copy.h | 1 + qapi/block-core.json | 8 +++- 4 files changed, 35 insertions(+), 6 deletions(-) diff --git a/block/block-copy.c b/block/block-copy.c index 7e3b378528..36eaecaaf4 100644 --- a/block/block-copy.c +++ b/block/block-copy.c @@ -310,6 +310,7 @@ void block_copy_set_copy_opts(BlockCopyState *s, bool use_copy_range, } static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, + int64_t min_cluster_size, Error **errp) { int ret; @@ -335,7 +336,7 @@ static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, "used. If the actual block size of the target exceeds " "this default, the backup may be unusable", BLOCK_COPY_CLUSTER_SIZE_DEFAULT); -return BLOCK_COPY_CLUSTER_SIZE_DEFAULT; +return MAX(min_cluster_size, (int64_t)BLOCK_COPY_CLUSTER_SIZE_DEFAULT); } else if (ret < 0 && !target_does_cow) { error_setg_errno(errp, -ret, "Couldn't determine the cluster size of the target image, " @@ -345,16 +346,18 @@ static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, return ret; } else if (ret < 0 && target_does_cow) { /* Not fatal; just trudge on ahead. */ -return BLOCK_COPY_CLUSTER_SIZE_DEFAULT; +return MAX(min_cluster_size, (int64_t)BLOCK_COPY_CLUSTER_SIZE_DEFAULT); } -return MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size); +return MAX(min_cluster_size, + (int64_t)MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size)); } BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target, BlockDriverState *copy_bitmap_bs, const BdrvDirtyBitmap *bitmap, bool discard_source, + uint64_t min_cluster_size, Error **errp) { ERRP_GUARD(); @@ -365,7 +368,18 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target, GLOBAL_STATE_CODE(); -cluster_size = block_copy_calculate_cluster_size(target->bs, errp); +if (min_cluster_size > INT64_MAX) { +error_setg(errp, "min-cluster-size too large: %lu > %ld", + min_cluster_size, INT64_MAX); +return NULL; +} else if (min_cluster_size && !is_power_of_2(min_cluster_size)) { +error_setg(errp, "min-cluster-size needs to be a power of 2"); +return NULL; +} + +cluster_size = block_copy_calculate_cluster_size(target->bs, + (int64_t)min_cluster_size, + errp); if (cluster_size < 0) { return NULL; } diff --git a/block/copy-before-write.c b/block/copy-before-write.c index cd65524e26..ef0bc4dcfe 100644 --- a/block/copy-before-write.c +++ b/block/copy-before-write.c @@ -417,6 +417,7 @@ static BlockdevOptions *cbw_parse_options(QDict *options, Error **errp) qdict_extract_subqdict(options, NULL, "bitmap"); qdict_del(options, "on-cbw-error"); qdict_del(options, "cbw-timeout"); +qdict_del(options, "min-cluster-size"); out: visit_free(v); @@ -432,6 +433,7 @@ static int cbw_open(BlockDriverState *bs, QDict *options, int flags, BDRVCopyBeforeWriteState *s = bs->opaque; BdrvDirtyBitmap *bitmap = NULL
[PATCH v4 4/5] iotests: add test for bitmap mirror
From: Fabian Grünbichler heavily based on/practically forked off iotest 257 for bitmap backups, but: - no writes to filter node 'mirror-top' between completion and finalization, as those seem to deadlock? - extra set of reference/test mirrors to verify that writes in parallel with active mirror work Intentionally keeping copyright and ownership of original test case to honor provenance. The test was originally adapted by Fabian from 257, but has seen rather big changes, because the interface for mirror with bitmap was changed, i.e. no @bitmap-mode parameter anymore and bitmap is used as the working bitmap, and the test was changed to use backing images and @sync-mode=write-blocking. Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.1 adapt to changes to mirror bitmap interface rename test from '384' to 'mirror-bitmap' use backing files, copy-mode=write-blocking, larger cluster size] Signed-off-by: Fiona Ebner --- tests/qemu-iotests/tests/mirror-bitmap | 597 tests/qemu-iotests/tests/mirror-bitmap.out | 3191 2 files changed, 3788 insertions(+) create mode 100755 tests/qemu-iotests/tests/mirror-bitmap create mode 100644 tests/qemu-iotests/tests/mirror-bitmap.out diff --git a/tests/qemu-iotests/tests/mirror-bitmap b/tests/qemu-iotests/tests/mirror-bitmap new file mode 100755 index 00..37bbe0f241 --- /dev/null +++ b/tests/qemu-iotests/tests/mirror-bitmap @@ -0,0 +1,597 @@ +#!/usr/bin/env python3 +# group: rw +# +# Test bitmap-sync mirrors (incremental, differential, and partials) +# +# Copyright (c) 2019 John Snow for Red Hat, Inc. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# +# owner=js...@redhat.com + +import os + +import iotests +from iotests import log, qemu_img + +SIZE = 64 * 1024 * 1024 +GRANULARITY = 64 * 1024 +IMAGE_CLUSTER_SIZE = 128 * 1024 + + +class Pattern: +def __init__(self, byte, offset, size=GRANULARITY): +self.byte = byte +self.offset = offset +self.size = size + +def bits(self, granularity): +lower = self.offset // granularity +upper = (self.offset + self.size - 1) // granularity +return set(range(lower, upper + 1)) + + +class PatternGroup: +"""Grouping of Pattern objects. Initialize with an iterable of Patterns.""" +def __init__(self, patterns): +self.patterns = patterns + +def bits(self, granularity): +"""Calculate the unique bits dirtied by this pattern grouping""" +res = set() +for pattern in self.patterns: +res |= pattern.bits(granularity) +return res + + +GROUPS = [ +PatternGroup([ +# Batch 0: 4 clusters +Pattern('0x49', 0x000), +Pattern('0x6c', 0x010), # 1M +Pattern('0x6f', 0x200), # 32M +Pattern('0x76', 0x3ff)]), # 64M - 64K +PatternGroup([ +# Batch 1: 6 clusters (3 new) +Pattern('0x65', 0x000), # Full overwrite +Pattern('0x77', 0x00f8000), # Partial-left (1M-32K) +Pattern('0x72', 0x2008000), # Partial-right (32M+32K) +Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K) +PatternGroup([ +# Batch 2: 7 clusters (3 new) +Pattern('0x74', 0x001), # Adjacent-right +Pattern('0x69', 0x00e8000), # Partial-left (1M-96K) +Pattern('0x6e', 0x2018000), # Partial-right (32M+96K) +Pattern('0x67', 0x3fe, +2*GRANULARITY)]), # Overwrite [(64M-128K)-64M) +PatternGroup([ +# Batch 3: 8 clusters (5 new) +# Carefully chosen such that nothing re-dirties the one cluster +# that copies out successfully before failure in Group #1. +Pattern('0xaa', 0x001, +3*GRANULARITY), # Overwrite and 2x Adjacent-right +Pattern('0xbb', 0x00d8000), # Partial-left (1M-160K) +Pattern('0xcc', 0x2028000), # Partial-right (32M+160K) +Pattern('0xdd', 0x3fc)]), # New; leaving a gap to the right +] + + +class EmulatedBitmap: +def __init__(self, granularity=GRANULARITY): +self._bits = set() +self.granularity = granularity + +def dirty_bits(self, bits): +self._bits |= set(bits) + +def dirty_group(self, n): +self.
[PATCH v4 0/5] mirror: allow specifying working bitmap
Changes from v3 (discussion here [3]): * Improve/fix QAPI documentation. Changes from v2 (discussion here [2]): * Cluster size caveats only apply to non-COW diff image, adapt the cluster size check and documentation accordingly. * In the IO test, use backing files (rather than stand-alone diff images) in combination with copy-mode=write-blocking and larger cluster size for target images, to have a more realistic use-case and show that COW prevents ending up with cluster with partial data upon unaligned writes. * Create a separate patch for replacing is_none_mode with sync_mode in MirrorBlockJobx struct. * Disallow using read-only bitmap (cannot be used as working bitmap). * Require that bitmap is enabled at the start of the job. * Avoid IO test script potentially waiting on non-existent job when blockdev-mirror QMP command fails. * Fix pylint issues in IO test. * Rename IO test from sync-bitmap-mirror to mirror-bitmap. Changes from RFC/v1 (discussion here [0]): * Add patch to split BackupSyncMode and MirrorSyncMode. * Drop bitmap-mode parameter and use passed-in bitmap as the working bitmap instead. Users can get the desired behaviors by using the block-dirty-bitmap-clear and block-dirty-bitmap-merge calls (see commit message in patch 2/4 for how exactly). * Add patch to check whether target image's cluster size is at most mirror job's granularity. Optional, it's an extra safety check that's useful when the target is a "diff" image that does not have previously synced data. Use cases: * Possibility to resume a failed mirror later. * Possibility to only mirror deltas to a previously mirrored volume. * Possibility to (efficiently) mirror an drive that was previously mirrored via some external mechanism (e.g. ZFS replication). We are using the last one in production without any issues since about 4 years now. In particular, like mentioned in [1]: > - create bitmap(s) > - (incrementally) replicate storage volume(s) out of band (using ZFS) > - incrementally drive mirror as part of a live migration of VM > - drop bitmap(s) Now, the IO test added in patch 4/5 actually contains yet another use case, namely doing incremental mirrors to qcow2 "diff" images, that only contain the delta and can be rebased later. I had to adapt the IO test, because its output expected the mirror bitmap to still be dirty, but nowadays the mirror is apparently already done when the bitmaps are queried. So I thought, I'll just use 'write-blocking' mode to avoid any potential timing issues. Initially, the qcow2 diff image targets were stand-alone and that suffers from an issue when 'write-blocking' mode is used. If a write is not aligned to the granularity of the mirror target, then rebasing the diff image onto a backing image will not yield the desired result, because the full cluster is considered to be allocated and will "hide" some part of the base/backing image. The failure can be seen by either using 'write-blocking' mode in the IO test or setting the (bitmap) granularity to 32 KiB rather than the current 64 KiB. The test thus uses a more realistic approach where the qcow2 targets have backing images and a check is added in patch 5/5 for the cluster size for non-COW targets. However, with e.g. NBD, the cluster size cannot be queried and prohibiting all bitmap mirrors to NBD targets just to prevent the highly specific edge case seems not worth it, so the limitation is rather documented and the check ignores cases where querying the target image's cluster size returns -ENOTSUP. [0]: https://lore.kernel.org/qemu-devel/b91dba34-7969-4d51-ba40-96a91038c...@yandex-team.ru/T/#m4ae27dc8ca1fb053e0a32cc4ffa2cfab6646805c [1]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/ [2]: https://lore.kernel.org/qemu-devel/20240307134711.709816-1-f.eb...@proxmox.com/ [3]: https://lore.kernel.org/qemu-devel/20240510131647.1256467-1-f.eb...@proxmox.com/ Fabian Grünbichler (1): iotests: add test for bitmap mirror Fiona Ebner (3): qapi/block-core: avoid the re-use of MirrorSyncMode for backup block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob struct blockdev: mirror: check for target's cluster size when using bitmap John Snow (1): mirror: allow specifying working bitmap block/backup.c | 18 +- block/mirror.c | 101 +- block/monitor/block-hmp-cmds.c |2 +- block/replication.c|2 +- blockdev.c | 127 +- include/block/block_int-global-state.h |7 +- qapi/block-core.json | 62 +- tests/qemu-iotests/tests/mirror-bitmap | 603 tests/qemu-iotests/tests/mirror-bitmap.out | 3198 tests/unit/test-block-iothread.c |2 +- 10 files changed, 4053 insertions(+), 69 deletions(-) create mode 100755 tests/qemu-iotes
[PATCH v4 3/5] mirror: allow specifying working bitmap
From: John Snow for the mirror job. The bitmap's granularity is used as the job's granularity. The new @bitmap parameter is marked unstable in the QAPI and can currently only be used for @sync=full mode. Clusters initially dirty in the bitmap as well as new writes are copied to the target. Using block-dirty-bitmap-clear and block-dirty-bitmap-merge API, callers can simulate the three kinds of @BitmapSyncMode (which is used by backup): 1. always: default, just pass bitmap as working bitmap. 2. never: copy bitmap and pass copy to the mirror job. 3. on-success: copy bitmap and pass copy to the mirror job and if successful, merge bitmap into original afterwards. When the target image is a non-COW "diff image", i.e. one that was not used as the target of a previous mirror and the target image's cluster size is larger than the bitmap's granularity, or when @copy-mode=write-blocking is used, there is a pitfall, because the cluster in the target image will be allocated, but not contain all the data corresponding to the same region in the source image. An idea to avoid the limitation would be to mark clusters which are affected by unaligned writes and are not allocated in the target image dirty, so they would be copied fully later. However, for migration, the invariant that an actively synced mirror stays actively synced (unless an error happens) is useful, because without that invariant, migration might inactivate block devices when mirror still got work to do and run into an assertion failure [0]. Another approach would be to read the missing data from the source upon unaligned writes to be able to write the full target cluster instead. But certain targets like NBD do not allow querying the cluster size. To avoid limiting/breaking the use case of syncing to an existing target, which is arguably more common than the diff image use case, document the limitation in QAPI. This patch was originally based on one by Ma Haocong, but it has since been modified pretty heavily, first by John and then again by Fiona. [0]: https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/ Suggested-by: Ma Haocong Signed-off-by: Ma Haocong Signed-off-by: John Snow [FG: switch to bdrv_dirty_bitmap_merge_internal] Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.1 get rid of bitmap mode parameter use caller-provided bitmap as working bitmap turn bitmap parameter experimental] Signed-off-by: Fiona Ebner Acked-by: Markus Armbruster --- block/mirror.c | 80 +- blockdev.c | 44 +++--- include/block/block_int-global-state.h | 5 +- qapi/block-core.json | 35 ++- tests/unit/test-block-iothread.c | 2 +- 5 files changed, 141 insertions(+), 25 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index ca23d6ef65..d3d0698116 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -73,6 +73,11 @@ typedef struct MirrorBlockJob { size_t buf_size; int64_t bdev_length; unsigned long *cow_bitmap; +/* + * Whether the bitmap is created locally or provided by the caller (for + * incremental sync). + */ +bool dirty_bitmap_is_local; BdrvDirtyBitmap *dirty_bitmap; BdrvDirtyBitmapIter *dbi; uint8_t *buf; @@ -691,7 +696,11 @@ static int mirror_exit_common(Job *job) bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs); } -bdrv_release_dirty_bitmap(s->dirty_bitmap); +if (s->dirty_bitmap_is_local) { +bdrv_release_dirty_bitmap(s->dirty_bitmap); +} else { +bdrv_enable_dirty_bitmap(s->dirty_bitmap); +} /* Make sure that the source BDS doesn't go away during bdrv_replace_node, * before we can call bdrv_drained_end */ @@ -820,6 +829,16 @@ static void mirror_abort(Job *job) assert(ret == 0); } +/* Always called after commit/abort. */ +static void mirror_clean(Job *job) +{ +MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job); + +if (!s->dirty_bitmap_is_local && s->dirty_bitmap) { +bdrv_dirty_bitmap_set_busy(s->dirty_bitmap, false); +} +} + static void coroutine_fn mirror_throttle(MirrorBlockJob *s) { int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); @@ -1016,7 +1035,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) mirror_free_init(s); s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); -if (s->sync_mode != MIRROR_SYNC_MODE_NONE) { +if (s->sync_mode != MIRROR_SYNC_MODE_NONE && s->dirty_bitmap_is_local) { ret = mirror_dirty_init(s); if (ret < 0 || job_is_cancelled(>common.job)) { goto immediate_exit; @@ -1029,6 +1048,14 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) */ mirror_top_opaque->job = s; +
[PATCH v4 1/5] qapi/block-core: avoid the re-use of MirrorSyncMode for backup
Backup supports all modes listed in MirrorSyncMode, while mirror does not. Introduce BackupSyncMode by copying the current MirrorSyncMode and drop the variants mirror does not support from MirrorSyncMode as well as the corresponding manual check in mirror_start(). A consequence is also tighter introspection: query-qmp-schema no longer reports drive-mirror and blockdev-mirror accepting @sync values they actually reject. Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Fiona Ebner Acked-by: Markus Armbruster Reviewed-by: Vladimir Sementsov-Ogievskiy --- block/backup.c | 18 - block/mirror.c | 7 --- block/monitor/block-hmp-cmds.c | 2 +- block/replication.c| 2 +- blockdev.c | 26 - include/block/block_int-global-state.h | 2 +- qapi/block-core.json | 27 +- 7 files changed, 47 insertions(+), 37 deletions(-) diff --git a/block/backup.c b/block/backup.c index ec29d6b810..1cc4e055c6 100644 --- a/block/backup.c +++ b/block/backup.c @@ -37,7 +37,7 @@ typedef struct BackupBlockJob { BdrvDirtyBitmap *sync_bitmap; -MirrorSyncMode sync_mode; +BackupSyncMode sync_mode; BitmapSyncMode bitmap_mode; BlockdevOnError on_source_error; BlockdevOnError on_target_error; @@ -111,7 +111,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp) assert(block_job_driver(job) == _job_driver); -if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) { +if (backup_job->sync_mode != BACKUP_SYNC_MODE_NONE) { error_setg(errp, "The backup job only supports block checkpoint in" " sync=none mode"); return; @@ -231,11 +231,11 @@ static void backup_init_bcs_bitmap(BackupBlockJob *job) uint64_t estimate; BdrvDirtyBitmap *bcs_bitmap = block_copy_dirty_bitmap(job->bcs); -if (job->sync_mode == MIRROR_SYNC_MODE_BITMAP) { +if (job->sync_mode == BACKUP_SYNC_MODE_BITMAP) { bdrv_clear_dirty_bitmap(bcs_bitmap, NULL); bdrv_dirty_bitmap_merge_internal(bcs_bitmap, job->sync_bitmap, NULL, true); -} else if (job->sync_mode == MIRROR_SYNC_MODE_TOP) { +} else if (job->sync_mode == BACKUP_SYNC_MODE_TOP) { /* * We can't hog the coroutine to initialize this thoroughly. * Set a flag and resume work when we are able to yield safely. @@ -254,7 +254,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp) backup_init_bcs_bitmap(s); -if (s->sync_mode == MIRROR_SYNC_MODE_TOP) { +if (s->sync_mode == BACKUP_SYNC_MODE_TOP) { int64_t offset = 0; int64_t count; @@ -282,7 +282,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp) block_copy_set_skip_unallocated(s->bcs, false); } -if (s->sync_mode == MIRROR_SYNC_MODE_NONE) { +if (s->sync_mode == BACKUP_SYNC_MODE_NONE) { /* * All bits are set in bcs bitmap to allow any cluster to be copied. * This does not actually require them to be copied. @@ -354,7 +354,7 @@ static const BlockJobDriver backup_job_driver = { BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, BlockDriverState *target, int64_t speed, - MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap, + BackupSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap, BitmapSyncMode bitmap_mode, bool compress, const char *filter_node_name, @@ -376,8 +376,8 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, GLOBAL_STATE_CODE(); /* QMP interface protects us from these cases */ -assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL); -assert(sync_bitmap || sync_mode != MIRROR_SYNC_MODE_BITMAP); +assert(sync_mode != BACKUP_SYNC_MODE_INCREMENTAL); +assert(sync_bitmap || sync_mode != BACKUP_SYNC_MODE_BITMAP); if (bs == target) { error_setg(errp, "Source and target cannot be the same"); diff --git a/block/mirror.c b/block/mirror.c index 1bdce3b657..c0597039a5 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -2013,13 +2013,6 @@ void mirror_start(const char *job_id, BlockDriverState *bs, GLOBAL_STATE_CODE(); -if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) || -(mode == MIRROR_SYNC_MODE_BITMAP)) { -error_setg(errp, "Sync mode '%s' not supported", - MirrorSyncMode_str(mode)); -return; -} - bdrv_graph_rdlock_main_loop(); is_none_mode = mode == MIRROR_SYNC_MODE_NONE; base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL; diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c index d954bec6f1..9633d
[PATCH v4 2/5] block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob struct
It is more flexible and is done in preparation to support specifying a working bitmap for mirror jobs. In particular, this makes it possible to assert that @sync_mode=full when a bitmap is used. That assertion is just to be sure, of course the mirror QMP commands will be made to fail earlier with a clean error. Signed-off-by: Fiona Ebner --- block/mirror.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index c0597039a5..ca23d6ef65 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -51,7 +51,7 @@ typedef struct MirrorBlockJob { BlockDriverState *to_replace; /* Used to block operations on the drive-mirror-replace target */ Error *replace_blocker; -bool is_none_mode; +MirrorSyncMode sync_mode; BlockMirrorBackingMode backing_mode; /* Whether the target image requires explicit zero-initialization */ bool zero_target; @@ -722,7 +722,8 @@ static int mirror_exit_common(Job *job) _abort); if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) { -BlockDriverState *backing = s->is_none_mode ? src : s->base; +BlockDriverState *backing; +backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base; BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs); if (bdrv_cow_bs(unfiltered_target) != backing) { @@ -1015,7 +1016,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) mirror_free_init(s); s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); -if (!s->is_none_mode) { +if (s->sync_mode != MIRROR_SYNC_MODE_NONE) { ret = mirror_dirty_init(s); if (ret < 0 || job_is_cancelled(>common.job)) { goto immediate_exit; @@ -1714,7 +1715,8 @@ static BlockJob *mirror_start_job( BlockCompletionFunc *cb, void *opaque, const BlockJobDriver *driver, - bool is_none_mode, BlockDriverState *base, + MirrorSyncMode sync_mode, + BlockDriverState *base, bool auto_complete, const char *filter_node_name, bool is_mirror, MirrorCopyMode copy_mode, Error **errp) @@ -1871,7 +1873,7 @@ static BlockJob *mirror_start_job( s->replaces = g_strdup(replaces); s->on_source_error = on_source_error; s->on_target_error = on_target_error; -s->is_none_mode = is_none_mode; +s->sync_mode = sync_mode; s->backing_mode = backing_mode; s->zero_target = zero_target; qatomic_set(>copy_mode, copy_mode); @@ -2008,20 +2010,18 @@ void mirror_start(const char *job_id, BlockDriverState *bs, bool unmap, const char *filter_node_name, MirrorCopyMode copy_mode, Error **errp) { -bool is_none_mode; BlockDriverState *base; GLOBAL_STATE_CODE(); bdrv_graph_rdlock_main_loop(); -is_none_mode = mode == MIRROR_SYNC_MODE_NONE; base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL; bdrv_graph_rdunlock_main_loop(); mirror_start_job(job_id, bs, creation_flags, target, replaces, speed, granularity, buf_size, backing_mode, zero_target, on_source_error, on_target_error, unmap, NULL, NULL, - _job_driver, is_none_mode, base, false, + _job_driver, mode, base, false, filter_node_name, true, copy_mode, errp); } @@ -2049,9 +2049,9 @@ BlockJob *commit_active_start(const char *job_id, BlockDriverState *bs, job_id, bs, creation_flags, base, NULL, speed, 0, 0, MIRROR_LEAVE_BACKING_CHAIN, false, on_error, on_error, true, cb, opaque, - _active_job_driver, false, base, auto_complete, - filter_node_name, false, MIRROR_COPY_MODE_BACKGROUND, - errp); + _active_job_driver, MIRROR_SYNC_MODE_FULL, base, + auto_complete, filter_node_name, false, + MIRROR_COPY_MODE_BACKGROUND, errp); if (!job) { goto error_restore_flags; } -- 2.39.2
[PATCH v4 5/5] blockdev: mirror: check for target's cluster size when using bitmap
When using mirror with a bitmap and the target does not do COW and is is a diff image, i.e. one that should only contain the delta and was not synced to previously, a too large cluster size for the target can be problematic. In particular, when the mirror sends data to the target aligned to the jobs granularity, but not aligned to the larger target image's cluster size, the target's cluster would be allocated but only be filled partially. When rebasing such a diff image later, the corresponding cluster of the base image would get "masked" and the part of the cluster not in the diff image is not accessible anymore. Unfortunately, it is not always possible to check for the target image's cluster size, e.g. when it's NBD. Because the limitation is already documented in the QAPI description for the @bitmap parameter and it's only required for special diff image use-case, simply skip the check then. Signed-off-by: Fiona Ebner --- blockdev.c | 57 ++ tests/qemu-iotests/tests/mirror-bitmap | 6 +++ tests/qemu-iotests/tests/mirror-bitmap.out | 7 +++ 3 files changed, 70 insertions(+) diff --git a/blockdev.c b/blockdev.c index 4f72a72dc7..468974108e 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2769,6 +2769,59 @@ void qmp_blockdev_backup(BlockdevBackup *backup, Error **errp) blockdev_do_action(, errp); } +static int blockdev_mirror_check_bitmap_granularity(BlockDriverState *target, +BdrvDirtyBitmap *bitmap, +Error **errp) +{ +int ret; +BlockDriverInfo bdi; +uint32_t bitmap_granularity; + +GLOBAL_STATE_CODE(); +GRAPH_RDLOCK_GUARD_MAINLOOP(); + +if (bdrv_backing_chain_next(target)) { +/* + * No need to worry about creating clusters with partial data when the + * target does COW. + */ +return 0; +} + +/* + * If there is no backing file on the target, we cannot rely on COW if our + * backup cluster size is smaller than the target cluster size. Even for + * targets with a backing file, try to avoid COW if possible. + */ +ret = bdrv_get_info(target, ); +if (ret == -ENOTSUP) { +/* + * Ignore if unable to get the info, e.g. when target is NBD. It's only + * relevant for syncing to a diff image and the documentation already + * states that the target's cluster size needs to small enough then. + */ +return 0; +} else if (ret < 0) { +error_setg_errno(errp, -ret, +"Couldn't determine the cluster size of the target image, " +"which has no backing file"); +return ret; +} + +bitmap_granularity = bdrv_dirty_bitmap_granularity(bitmap); +if (bitmap_granularity < bdi.cluster_size || +bitmap_granularity % bdi.cluster_size != 0) { +error_setg(errp, "Bitmap granularity %u is not a multiple of the " + "target image's cluster size %u and the target image has " + "no backing file", + bitmap_granularity, bdi.cluster_size); +return -EINVAL; +} + +return 0; +} + + /* Parameter check and block job starting for drive mirroring. * Caller should hold @device and @target's aio context (must be the same). **/ @@ -2863,6 +2916,10 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs, return; } +if (blockdev_mirror_check_bitmap_granularity(target, bitmap, errp)) { +return; +} + if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_DEFAULT, errp)) { return; } diff --git a/tests/qemu-iotests/tests/mirror-bitmap b/tests/qemu-iotests/tests/mirror-bitmap index 37bbe0f241..e8cd482a19 100755 --- a/tests/qemu-iotests/tests/mirror-bitmap +++ b/tests/qemu-iotests/tests/mirror-bitmap @@ -584,6 +584,12 @@ def test_mirror_api(): bitmap=bitmap) log('') +log("-- Test bitmap with too small granularity to non-COW target --\n") +vm.qmp_log("block-dirty-bitmap-add", node=drive0.node, + name="bitmap-small", granularity=GRANULARITY) +blockdev_mirror(drive0.vm, drive0.node, "mirror_target", "full", +job_id='api_job', bitmap="bitmap-small") +log('') def main(): for bsync_mode in ("never", "on-success", "always"): diff --git a/tests/qemu-iotests/tests/mirror-bitmap.out b/tests/qemu-iotests/tests/mirror-bitmap.out index 5c8acc1d69..af605f3803 100644 --- a/tests/qemu-iotests/tests/mirror-bitmap.out +++ b/tests/qemu-iotests/tests/mirror-bitmap.out @@ -3189,3 +3189,10 @@ qemu_img compare "TEST_DIR/PID-img" "TEST_DI
Re: [PATCH] hw/core/machine: move compatibility flags for VirtIO-net USO to machine 8.1
Am 21.05.24 um 00:22 schrieb Fabiano Rosas: > Fiona Ebner writes: > >> Migration from an 8.2 or 9.0 binary to an 8.1 binary with machine >> version 8.1 can fail with: >> >>> kvm: Features 0x1c0010130afffa7 unsupported. Allowed features: 0x10179bfffe7 >>> kvm: Failed to load virtio-net:virtio >>> kvm: error while loading state for instance 0x0 of device >>> ':00:12.0/virtio-net' >>> kvm: load of migration failed: Operation not permitted >> >> The series >> >> 53da8b5a99 virtio-net: Add support for USO features >> 9da1684954 virtio-net: Add USO flags to vhost support. >> f03e0cf63b tap: Add check for USO features >> 2ab0ec3121 tap: Add USO support to tap device. >> >> only landed in QEMU 8.2, so the compatibility flags should be part of >> machine version 8.1. >> >> Moving the flags unfortunately breaks forward migration with machine >> version 8.1 from a binary without this patch to a binary with this >> patch. >> >> Fixes: 53da8b5a99 ("virtio-net: Add support for USO features") >> Signed-off-by: Fiona Ebner > > Reviewed-by: Fabiano Rosas > > I'll get to it eventually, but is this another one where just having > -device virtio-net in the command line when testing cross-version > migration would already have caught the issue? > AFAIU, the guest kernel needs to be recent enough to support the feature too. I don't seem to run into the issue with a Debian 11 (kernel 5.10) guest, but I do run into the issue with an Ubuntu 23.10 (kernel 6.5) guest. Seems like it got added in kernel 6.2 with 418044e1de30 ("drivers/net/virtio_net.c: Added USO support.") Best Regards, Fiona
[PATCH] hw/core/machine: move compatibility flags for VirtIO-net USO to machine 8.1
Migration from an 8.2 or 9.0 binary to an 8.1 binary with machine version 8.1 can fail with: > kvm: Features 0x1c0010130afffa7 unsupported. Allowed features: 0x10179bfffe7 > kvm: Failed to load virtio-net:virtio > kvm: error while loading state for instance 0x0 of device > ':00:12.0/virtio-net' > kvm: load of migration failed: Operation not permitted The series 53da8b5a99 virtio-net: Add support for USO features 9da1684954 virtio-net: Add USO flags to vhost support. f03e0cf63b tap: Add check for USO features 2ab0ec3121 tap: Add USO support to tap device. only landed in QEMU 8.2, so the compatibility flags should be part of machine version 8.1. Moving the flags unfortunately breaks forward migration with machine version 8.1 from a binary without this patch to a binary with this patch. Fixes: 53da8b5a99 ("virtio-net: Add support for USO features") Signed-off-by: Fiona Ebner --- hw/core/machine.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/hw/core/machine.c b/hw/core/machine.c index c7ceb11501..95051b80db 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -50,15 +50,15 @@ GlobalProperty hw_compat_8_1[] = { { "ramfb", "x-migrate", "off" }, { "vfio-pci-nohotplug", "x-ramfb-migrate", "off" }, { "igb", "x-pcie-flr-init", "off" }, +{ TYPE_VIRTIO_NET, "host_uso", "off"}, +{ TYPE_VIRTIO_NET, "guest_uso4", "off"}, +{ TYPE_VIRTIO_NET, "guest_uso6", "off"}, }; const size_t hw_compat_8_1_len = G_N_ELEMENTS(hw_compat_8_1); GlobalProperty hw_compat_8_0[] = { { "migration", "multifd-flush-after-each-section", "on"}, { TYPE_PCI_DEVICE, "x-pcie-ari-nextfn-1", "on" }, -{ TYPE_VIRTIO_NET, "host_uso", "off"}, -{ TYPE_VIRTIO_NET, "guest_uso4", "off"}, -{ TYPE_VIRTIO_NET, "guest_uso6", "off"}, }; const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0); -- 2.39.2
Re: [PULL 04/17] virtio-net: Add support for USO features
Hi, Am 08.09.23 um 08:44 schrieb Jason Wang: > diff --git a/hw/core/machine.c b/hw/core/machine.c > index da699cf..230aab8 100644 > --- a/hw/core/machine.c > +++ b/hw/core/machine.c > @@ -38,6 +38,7 @@ > #include "exec/confidential-guest-support.h" > #include "hw/virtio/virtio.h" > #include "hw/virtio/virtio-pci.h" > +#include "hw/virtio/virtio-net.h" > > GlobalProperty hw_compat_8_1[] = {}; > const size_t hw_compat_8_1_len = G_N_ELEMENTS(hw_compat_8_1); > @@ -45,6 +46,9 @@ const size_t hw_compat_8_1_len = > G_N_ELEMENTS(hw_compat_8_1); > GlobalProperty hw_compat_8_0[] = { > { "migration", "multifd-flush-after-each-section", "on"}, > { TYPE_PCI_DEVICE, "x-pcie-ari-nextfn-1", "on" }, > +{ TYPE_VIRTIO_NET, "host_uso", "off"}, > +{ TYPE_VIRTIO_NET, "guest_uso4", "off"}, > +{ TYPE_VIRTIO_NET, "guest_uso6", "off"}, > }; > const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0); > unfortunately, this broke backwards migration with machine version 8.1 from 8.2 and 9.0 binaries to a 8.1 binary: > kvm: Features 0x1c0010130afffa7 unsupported. Allowed features: 0x10179bfffe7 > kvm: Failed to load virtio-net:virtio > kvm: error while loading state for instance 0x0 of device > ':00:12.0/virtio-net' > kvm: load of migration failed: Operation not permitted Since the series here only landed in 8.2, shouldn't these flags have been added to hw_compat_8_1[] instead? Attempting to fix it by moving the flags will break migration with machine version 8.1 between patched 9.0 and unpatched 9.0 however :( Is there anything that can be done or will it need to stay broken now? CC-ing the migration maintainers. Best Regards, Fiona
Re: [PATCH v4] hw/pflash: fix block write start
Am 16.05.24 um 10:46 schrieb Gerd Hoffmann: > Move the pflash_blk_write_start() call. We need the offset of the > first data write, not the offset for the setup (number-of-bytes) > write. Without this fix u-boot can do block writes to the first > flash block only. > > While being at it drop a leftover FIXME. > > Cc: qemu-sta...@nongnu.org > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2343 > Fixes: fcc79f2e0955 ("hw/pflash: implement update buffer for block writes") Just a minor thing I noticed: this is the commit in v8.1.5. The commit in v9.0.0 is 284a7ee2e2 ("hw/pflash: implement update buffer for block writes"). Best Regards, Fiona
Re: [PATCH v3 5/5] accel/tcg: Always call tcg_flush_jmp_cache() on reset
Hi, Am 03.05.24 um 14:34 schrieb Philippe Mathieu-Daudé: > In commit bb6cf6f016 ("accel/tcg: Factor tcg_cpu_reset_hold() out") > we unfortunately restricted the tcg_flush_jmp_cache() to system > emulation. Move it to the common tcg_exec_cpu_reset_hold() handler > so user emulation gets the jmp_cache initialized when threads > are created. > > Remove the NULL check in tcg_flush_jmp_cache() from commit 4e4fa6c12d > ("accel/tcg: Complete cpu initialization before registration") which > was a band-aid fix for incorrect commit bb6cf6f016. > AFAICT, commit 4e4fa6c12d was already present in v7.2.0, while commit bb6cf6f016 only later in v8.2.0. So is it really fine to remove the NULL check? Best Regards, Fiona
Re: [PATCH v4 0/3] Fix "virtio-gpu: fix scanout migration post-load"
Am 16.05.24 um 10:40 schrieb marcandre.lur...@redhat.com: > From: Marc-André Lureau > > Hi, > > The aforementioned patch breaks virtio-gpu device migrations for versions > pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is more > complex than it may initially appear, as evidenced in the problematic commit > dfcf74fa68c ("virtio-gpu: fix scanout migration post-load"). > > v2: > - use a manual version field test (instead of the more complex struct > variant) > > v3: > - introduce machine_check_version() > - drop the VMSD version, and use machine version field test > > v4: > - drop machine_check_version() approach > - property renamed to x-scanout-vmstate-version > > Marc-André Lureau (3): > migration: add "exists" info to load-state-field trace > migration: fix a typo > virtio-gpu: fix v2 migration > > include/hw/virtio/virtio-gpu.h | 1 + > hw/core/machine.c | 1 + > hw/display/virtio-gpu.c| 24 > migration/vmstate.c| 7 --- > migration/trace-events | 2 +- > 5 files changed, 23 insertions(+), 12 deletions(-) > Reviewed-by: Fiona Ebner Tested-by: Fiona Ebner Tested the same things as for v1/v2, with an Ubuntu 23.10 VM: Machine type pc-i440fx-8.2: 1. create snapshot with 8.2, load with patched 9.0 2. create snapshot with patched 9.0, load with patched 9.0 and with 8.2 Machine type pc-i440fx-9.0: 1. create snapshot with patched 9.0, load with patched 9.0 No crashes/failures and didn't notice any other issues Best Regards, Fiona
Re: [PATCH v2 0/4] Fix "virtio-gpu: fix scanout migration post-load"
Am 13.05.24 um 15:21 schrieb Marc-André Lureau: > > Indeed, it needs: > > diff --git a/hw/display/virtio-gpu.c b/hw/display/virtio-gpu.c > index 5de90bb62f..3a88eb5e3a 100644 > --- a/hw/display/virtio-gpu.c > +++ b/hw/display/virtio-gpu.c > @@ -1201,7 +1201,7 @@ static const VMStateDescription > vmstate_virtio_gpu_scanout = { > > static const VMStateDescription vmstate_virtio_gpu_scanouts = { > .name = "virtio-gpu-scanouts", > -.version_id = 1, > +.version_id = 2, > > Thanks! With that on top: Tested-by: Fiona Ebner Tested with an Ubuntu 23.10 VM: Machine type pc-i440fx-8.2: 1. create snapshot with 8.2, load with patched 9.0 2. create snapshot with patched 9.0, load with patched 9.0 and with 8.2 Machine type pc-i440fx-9.0: 1. create snapshot with patched 9.0, load with patched 9.0 No crashes/failures and didn't notice any other issues Best Regards, Fiona
Re: [PATCH 1/2] copy-before-write: allow specifying minimum cluster size
Am 26.03.24 um 10:06 schrieb Markus Armbruster: >> @@ -365,7 +368,13 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, >> BdrvChild *target, >> >> GLOBAL_STATE_CODE(); >> >> -cluster_size = block_copy_calculate_cluster_size(target->bs, errp); >> +if (min_cluster_size && !is_power_of_2(min_cluster_size)) { > > min_cluster_size is int64_t, is_power_of_2() takes uint64_t. Bad if > min_cluster_size is negative. Could this happen? > No, because it comes in as a uint32_t via the QAPI (the internal caller added by patch 2/2 from the backup code also gets the value via QAPI and there uint32_t is used too). ---snip--- >> diff --git a/qapi/block-core.json b/qapi/block-core.json >> index 0a72c590a8..85c8f88f6e 100644 >> --- a/qapi/block-core.json >> +++ b/qapi/block-core.json >> @@ -4625,12 +4625,18 @@ >> # @on-cbw-error parameter will decide how this failure is handled. >> # Default 0. (Since 7.1) >> # >> +# @min-cluster-size: Minimum size of blocks used by copy-before-write >> +# operations. Has to be a power of 2. No effect if smaller than >> +# the maximum of the target's cluster size and 64 KiB. Default 0. >> +# (Since 9.0) >> +# >> # Since: 6.2 >> ## >> { 'struct': 'BlockdevOptionsCbw', >>'base': 'BlockdevOptionsGenericFormat', >>'data': { 'target': 'BlockdevRef', '*bitmap': 'BlockDirtyBitmap', >> -'*on-cbw-error': 'OnCbwError', '*cbw-timeout': 'uint32' } } >> +'*on-cbw-error': 'OnCbwError', '*cbw-timeout': 'uint32', >> +'*min-cluster-size': 'uint32' } } > > Elsewhere in the schema, we use either 'int' or 'size' for cluster-size. > Why the difference? > The motivation was to disallow negative values up front and have it work with block_copy_calculate_cluster_size(), whose result is an int64_t. If I go with 'int', I'll have to add a check to disallow negative values. If I go with 'size', I'll have to add a check for to disallow too large values. Which approach should I go with? Best Regards, Fiona
Re: [PATCH v2 0/4] Fix "virtio-gpu: fix scanout migration post-load"
Hi, Am 13.05.24 um 09:19 schrieb marcandre.lur...@redhat.com: > From: Marc-André Lureau > > Hi, > > The aforementioned patch breaks virtio-gpu device migrations for versions > pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is more > complex than it may initially appear, as evidenced in the problematic commit > dfcf74fa68c ("virtio-gpu: fix scanout migration post-load"). > > v2: > - use a manual version field test (instead of the more complex struct > variant) > Unfortunately, when creating a snapshot with machine type pc-i440fx-9.0 and trying to load it afterwards (both times with patches on top of current master), it'll fail with: > qemu-system-x86_64: virtio-gpu-scanouts: incoming version_id 2 is too new for > local version_id 1 > qemu-system-x86_64: Missing section footer for :00:02.0/virtio-gpu > qemu-system-x86_64: Error -22 while loading VM state Is there a bump to virtio-gpu-scanouts' version_id missing? Best Regards, Fiona
[PATCH v3 5/5] blockdev: mirror: check for target's cluster size when using bitmap
When using mirror with a bitmap and the target does not do COW and is is a diff image, i.e. one that should only contain the delta and was not synced to previously, a too large cluster size for the target can be problematic. In particular, when the mirror sends data to the target aligned to the jobs granularity, but not aligned to the larger target image's cluster size, the target's cluster would be allocated but only be filled partially. When rebasing such a diff image later, the corresponding cluster of the base image would get "masked" and the part of the cluster not in the diff image is not accessible anymore. Unfortunately, it is not always possible to check for the target image's cluster size, e.g. when it's NBD. Because the limitation is already documented in the QAPI description for the @bitmap parameter and it's only required for special diff image use-case, simply skip the check then. Signed-off-by: Fiona Ebner --- Changes in v3: * detect when the target does COW and do not error out in that case * treat ENOTSUP differently from other failure when querying the cluster size blockdev.c | 57 ++ tests/qemu-iotests/tests/mirror-bitmap | 6 +++ tests/qemu-iotests/tests/mirror-bitmap.out | 7 +++ 3 files changed, 70 insertions(+) diff --git a/blockdev.c b/blockdev.c index 4f72a72dc7..468974108e 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2769,6 +2769,59 @@ void qmp_blockdev_backup(BlockdevBackup *backup, Error **errp) blockdev_do_action(, errp); } +static int blockdev_mirror_check_bitmap_granularity(BlockDriverState *target, +BdrvDirtyBitmap *bitmap, +Error **errp) +{ +int ret; +BlockDriverInfo bdi; +uint32_t bitmap_granularity; + +GLOBAL_STATE_CODE(); +GRAPH_RDLOCK_GUARD_MAINLOOP(); + +if (bdrv_backing_chain_next(target)) { +/* + * No need to worry about creating clusters with partial data when the + * target does COW. + */ +return 0; +} + +/* + * If there is no backing file on the target, we cannot rely on COW if our + * backup cluster size is smaller than the target cluster size. Even for + * targets with a backing file, try to avoid COW if possible. + */ +ret = bdrv_get_info(target, ); +if (ret == -ENOTSUP) { +/* + * Ignore if unable to get the info, e.g. when target is NBD. It's only + * relevant for syncing to a diff image and the documentation already + * states that the target's cluster size needs to small enough then. + */ +return 0; +} else if (ret < 0) { +error_setg_errno(errp, -ret, +"Couldn't determine the cluster size of the target image, " +"which has no backing file"); +return ret; +} + +bitmap_granularity = bdrv_dirty_bitmap_granularity(bitmap); +if (bitmap_granularity < bdi.cluster_size || +bitmap_granularity % bdi.cluster_size != 0) { +error_setg(errp, "Bitmap granularity %u is not a multiple of the " + "target image's cluster size %u and the target image has " + "no backing file", + bitmap_granularity, bdi.cluster_size); +return -EINVAL; +} + +return 0; +} + + /* Parameter check and block job starting for drive mirroring. * Caller should hold @device and @target's aio context (must be the same). **/ @@ -2863,6 +2916,10 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs, return; } +if (blockdev_mirror_check_bitmap_granularity(target, bitmap, errp)) { +return; +} + if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_DEFAULT, errp)) { return; } diff --git a/tests/qemu-iotests/tests/mirror-bitmap b/tests/qemu-iotests/tests/mirror-bitmap index 37bbe0f241..e8cd482a19 100755 --- a/tests/qemu-iotests/tests/mirror-bitmap +++ b/tests/qemu-iotests/tests/mirror-bitmap @@ -584,6 +584,12 @@ def test_mirror_api(): bitmap=bitmap) log('') +log("-- Test bitmap with too small granularity to non-COW target --\n") +vm.qmp_log("block-dirty-bitmap-add", node=drive0.node, + name="bitmap-small", granularity=GRANULARITY) +blockdev_mirror(drive0.vm, drive0.node, "mirror_target", "full", +job_id='api_job', bitmap="bitmap-small") +log('') def main(): for bsync_mode in ("never", "on-success", "always"): diff --git a/tests/qemu-iotests/tests/mirror-bitmap.out b/tests/qemu-iotests/tests/mirror-bitmap.out index 5c8acc1d69..af605f3803 100644 --- a/tests/qem
[PATCH v3 0/5] mirror: allow specifying working bitmap
Changes from v2 (discussion here [2]): * Cluster size caveats only apply to non-COW diff image, adapt the cluster size check and documentation accordingly. * In the IO test, use backing files (rather than stand-alone diff images) in combination with copy-mode=write-blocking and larger cluster size for target images, to have a more realistic use-case and show that COW prevents ending up with cluster with partial data upon unaligned writes. * Create a separate patch for replacing is_none_mode with sync_mode in MirrorBlockJobx struct. * Disallow using read-only bitmap (cannot be used as working bitmap). * Require that bitmap is enabled at the start of the job. * Avoid IO test script potentially waiting on non-existent job when blockdev-mirror QMP command fails. * Fix pylint issues in IO test. * Rename IO test from sync-bitmap-mirror to mirror-bitmap. Changes from RFC/v1 (discussion here [0]): * Add patch to split BackupSyncMode and MirrorSyncMode. * Drop bitmap-mode parameter and use passed-in bitmap as the working bitmap instead. Users can get the desired behaviors by using the block-dirty-bitmap-clear and block-dirty-bitmap-merge calls (see commit message in patch 2/4 for how exactly). * Add patch to check whether target image's cluster size is at most mirror job's granularity. Optional, it's an extra safety check that's useful when the target is a "diff" image that does not have previously synced data. Use cases: * Possibility to resume a failed mirror later. * Possibility to only mirror deltas to a previously mirrored volume. * Possibility to (efficiently) mirror an drive that was previously mirrored via some external mechanism (e.g. ZFS replication). We are using the last one in production without any issues since about 4 years now. In particular, like mentioned in [1]: > - create bitmap(s) > - (incrementally) replicate storage volume(s) out of band (using ZFS) > - incrementally drive mirror as part of a live migration of VM > - drop bitmap(s) Now, the IO test added in patch 4/5 actually contains yet another use case, namely doing incremental mirrors to qcow2 "diff" images, that only contain the delta and can be rebased later. I had to adapt the IO test, because its output expected the mirror bitmap to still be dirty, but nowadays the mirror is apparently already done when the bitmaps are queried. So I thought, I'll just use 'write-blocking' mode to avoid any potential timing issues. Initially, the qcow2 diff image targets were stand-alone and that suffers from an issue when 'write-blocking' mode is used. If a write is not aligned to the granularity of the mirror target, then rebasing the diff image onto a backing image will not yield the desired result, because the full cluster is considered to be allocated and will "hide" some part of the base/backing image. The failure can be seen by either using 'write-blocking' mode in the IO test or setting the (bitmap) granularity to 32 KiB rather than the current 64 KiB. The test thus uses a more realistic approach where the qcow2 targets have backing images and a check is added in patch 5/5 for the cluster size for non-COW targets. However, with e.g. NBD, the cluster size cannot be queried and prohibiting all bitmap mirrors to NBD targets just to prevent the highly specific edge case seems not worth it, so the limitation is rather documented and the check ignores cases where querying the target image's cluster size returns -ENOTSUP. [0]: https://lore.kernel.org/qemu-devel/b91dba34-7969-4d51-ba40-96a91038c...@yandex-team.ru/T/#m4ae27dc8ca1fb053e0a32cc4ffa2cfab6646805c [1]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/ [2]: https://lore.kernel.org/qemu-devel/20240307134711.709816-1-f.eb...@proxmox.com/ Fabian Grünbichler (1): iotests: add test for bitmap mirror Fiona Ebner (3): qapi/block-core: avoid the re-use of MirrorSyncMode for backup block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob struct blockdev: mirror: check for target's cluster size when using bitmap John Snow (1): mirror: allow specifying working bitmap block/backup.c | 18 +- block/mirror.c | 101 +- block/monitor/block-hmp-cmds.c |2 +- block/replication.c|2 +- blockdev.c | 127 +- include/block/block_int-global-state.h |7 +- qapi/block-core.json | 64 +- tests/qemu-iotests/tests/mirror-bitmap | 603 tests/qemu-iotests/tests/mirror-bitmap.out | 3198 tests/unit/test-block-iothread.c |2 +- 10 files changed, 4055 insertions(+), 69 deletions(-) create mode 100755 tests/qemu-iotests/tests/mirror-bitmap create mode 100644 tests/qemu-iotests/tests/mirror-bitmap.out -- 2.39.2
[PATCH v3 1/5] qapi/block-core: avoid the re-use of MirrorSyncMode for backup
Backup supports all modes listed in MirrorSyncMode, while mirror does not. Introduce BackupSyncMode by copying the current MirrorSyncMode and drop the variants mirror does not support from MirrorSyncMode as well as the corresponding manual check in mirror_start(). A consequence is also tighter introspection: query-qmp-schema no longer reports drive-mirror and blockdev-mirror accepting @sync values they actually reject. Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Fiona Ebner Acked-by: Markus Armbruster Reviewed-by: Vladimir Sementsov-Ogievskiy --- Changes in v3: * add comment about introspection to commit message as suggested by Markus block/backup.c | 18 - block/mirror.c | 7 --- block/monitor/block-hmp-cmds.c | 2 +- block/replication.c| 2 +- blockdev.c | 26 - include/block/block_int-global-state.h | 2 +- qapi/block-core.json | 27 +- 7 files changed, 47 insertions(+), 37 deletions(-) diff --git a/block/backup.c b/block/backup.c index ec29d6b810..1cc4e055c6 100644 --- a/block/backup.c +++ b/block/backup.c @@ -37,7 +37,7 @@ typedef struct BackupBlockJob { BdrvDirtyBitmap *sync_bitmap; -MirrorSyncMode sync_mode; +BackupSyncMode sync_mode; BitmapSyncMode bitmap_mode; BlockdevOnError on_source_error; BlockdevOnError on_target_error; @@ -111,7 +111,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp) assert(block_job_driver(job) == _job_driver); -if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) { +if (backup_job->sync_mode != BACKUP_SYNC_MODE_NONE) { error_setg(errp, "The backup job only supports block checkpoint in" " sync=none mode"); return; @@ -231,11 +231,11 @@ static void backup_init_bcs_bitmap(BackupBlockJob *job) uint64_t estimate; BdrvDirtyBitmap *bcs_bitmap = block_copy_dirty_bitmap(job->bcs); -if (job->sync_mode == MIRROR_SYNC_MODE_BITMAP) { +if (job->sync_mode == BACKUP_SYNC_MODE_BITMAP) { bdrv_clear_dirty_bitmap(bcs_bitmap, NULL); bdrv_dirty_bitmap_merge_internal(bcs_bitmap, job->sync_bitmap, NULL, true); -} else if (job->sync_mode == MIRROR_SYNC_MODE_TOP) { +} else if (job->sync_mode == BACKUP_SYNC_MODE_TOP) { /* * We can't hog the coroutine to initialize this thoroughly. * Set a flag and resume work when we are able to yield safely. @@ -254,7 +254,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp) backup_init_bcs_bitmap(s); -if (s->sync_mode == MIRROR_SYNC_MODE_TOP) { +if (s->sync_mode == BACKUP_SYNC_MODE_TOP) { int64_t offset = 0; int64_t count; @@ -282,7 +282,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp) block_copy_set_skip_unallocated(s->bcs, false); } -if (s->sync_mode == MIRROR_SYNC_MODE_NONE) { +if (s->sync_mode == BACKUP_SYNC_MODE_NONE) { /* * All bits are set in bcs bitmap to allow any cluster to be copied. * This does not actually require them to be copied. @@ -354,7 +354,7 @@ static const BlockJobDriver backup_job_driver = { BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, BlockDriverState *target, int64_t speed, - MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap, + BackupSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap, BitmapSyncMode bitmap_mode, bool compress, const char *filter_node_name, @@ -376,8 +376,8 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, GLOBAL_STATE_CODE(); /* QMP interface protects us from these cases */ -assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL); -assert(sync_bitmap || sync_mode != MIRROR_SYNC_MODE_BITMAP); +assert(sync_mode != BACKUP_SYNC_MODE_INCREMENTAL); +assert(sync_bitmap || sync_mode != BACKUP_SYNC_MODE_BITMAP); if (bs == target) { error_setg(errp, "Source and target cannot be the same"); diff --git a/block/mirror.c b/block/mirror.c index 1bdce3b657..c0597039a5 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -2013,13 +2013,6 @@ void mirror_start(const char *job_id, BlockDriverState *bs, GLOBAL_STATE_CODE(); -if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) || -(mode == MIRROR_SYNC_MODE_BITMAP)) { -error_setg(errp, "Sync mode '%s' not supported", - MirrorSyncMode_str(mode)); -return; -} - bdrv_graph_rdlock_main_loop(); is_none_mode = mode == MIRROR_SYNC_MODE_NONE; base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL; dif
[PATCH v3 4/5] iotests: add test for bitmap mirror
From: Fabian Grünbichler heavily based on/practically forked off iotest 257 for bitmap backups, but: - no writes to filter node 'mirror-top' between completion and finalization, as those seem to deadlock? - extra set of reference/test mirrors to verify that writes in parallel with active mirror work Intentionally keeping copyright and ownership of original test case to honor provenance. The test was originally adapted by Fabian from 257, but has seen rather big changes, because the interface for mirror with bitmap was changed, i.e. no @bitmap-mode parameter anymore and bitmap is used as the working bitmap, and the test was changed to use backing images and @sync-mode=write-blocking. Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.1 adapt to changes to mirror bitmap interface rename test from '384' to 'mirror-bitmap' use backing files, copy-mode=write-blocking, larger cluster size] Signed-off-by: Fiona Ebner --- Changes in v3: * avoid script potentially waiting on non-existent job when blockdev-mirror QMP command fails by asserting that there is no error when none is expected. * fix pylint issues * rename test from sync-bitmap-mirror to mirror-bitmap * use backing files (rather than stand-alone diff images) in combination with copy-mode=write-blocking and larger cluster size for target images, to have a more realistic use-case and show that COW prevents ending up with cluster with partial data upon unaligned writes tests/qemu-iotests/tests/mirror-bitmap | 597 tests/qemu-iotests/tests/mirror-bitmap.out | 3191 2 files changed, 3788 insertions(+) create mode 100755 tests/qemu-iotests/tests/mirror-bitmap create mode 100644 tests/qemu-iotests/tests/mirror-bitmap.out diff --git a/tests/qemu-iotests/tests/mirror-bitmap b/tests/qemu-iotests/tests/mirror-bitmap new file mode 100755 index 00..37bbe0f241 --- /dev/null +++ b/tests/qemu-iotests/tests/mirror-bitmap @@ -0,0 +1,597 @@ +#!/usr/bin/env python3 +# group: rw +# +# Test bitmap-sync mirrors (incremental, differential, and partials) +# +# Copyright (c) 2019 John Snow for Red Hat, Inc. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# +# owner=js...@redhat.com + +import os + +import iotests +from iotests import log, qemu_img + +SIZE = 64 * 1024 * 1024 +GRANULARITY = 64 * 1024 +IMAGE_CLUSTER_SIZE = 128 * 1024 + + +class Pattern: +def __init__(self, byte, offset, size=GRANULARITY): +self.byte = byte +self.offset = offset +self.size = size + +def bits(self, granularity): +lower = self.offset // granularity +upper = (self.offset + self.size - 1) // granularity +return set(range(lower, upper + 1)) + + +class PatternGroup: +"""Grouping of Pattern objects. Initialize with an iterable of Patterns.""" +def __init__(self, patterns): +self.patterns = patterns + +def bits(self, granularity): +"""Calculate the unique bits dirtied by this pattern grouping""" +res = set() +for pattern in self.patterns: +res |= pattern.bits(granularity) +return res + + +GROUPS = [ +PatternGroup([ +# Batch 0: 4 clusters +Pattern('0x49', 0x000), +Pattern('0x6c', 0x010), # 1M +Pattern('0x6f', 0x200), # 32M +Pattern('0x76', 0x3ff)]), # 64M - 64K +PatternGroup([ +# Batch 1: 6 clusters (3 new) +Pattern('0x65', 0x000), # Full overwrite +Pattern('0x77', 0x00f8000), # Partial-left (1M-32K) +Pattern('0x72', 0x2008000), # Partial-right (32M+32K) +Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K) +PatternGroup([ +# Batch 2: 7 clusters (3 new) +Pattern('0x74', 0x001), # Adjacent-right +Pattern('0x69', 0x00e8000), # Partial-left (1M-96K) +Pattern('0x6e', 0x2018000), # Partial-right (32M+96K) +Pattern('0x67', 0x3fe, +2*GRANULARITY)]), # Overwrite [(64M-128K)-64M) +PatternGroup([ +# Batch 3: 8 clusters (5 new) +# Carefully chosen such that nothing re-dirties the one cluster +# that copies out successfully before failure in Group #1. +Pattern('0xaa', 0x001, +3*GRAN
[PATCH v3 3/5] mirror: allow specifying working bitmap
From: John Snow for the mirror job. The bitmap's granularity is used as the job's granularity. The new @bitmap parameter is marked unstable in the QAPI and can currently only be used for @sync=full mode. Clusters initially dirty in the bitmap as well as new writes are copied to the target. Using block-dirty-bitmap-clear and block-dirty-bitmap-merge API, callers can simulate the three kinds of @BitmapSyncMode (which is used by backup): 1. always: default, just pass bitmap as working bitmap. 2. never: copy bitmap and pass copy to the mirror job. 3. on-success: copy bitmap and pass copy to the mirror job and if successful, merge bitmap into original afterwards. When the target image is a non-COW "diff image", i.e. one that was not used as the target of a previous mirror and the target image's cluster size is larger than the bitmap's granularity, or when @copy-mode=write-blocking is used, there is a pitfall, because the cluster in the target image will be allocated, but not contain all the data corresponding to the same region in the source image. An idea to avoid the limitation would be to mark clusters which are affected by unaligned writes and are not allocated in the target image dirty, so they would be copied fully later. However, for migration, the invariant that an actively synced mirror stays actively synced (unless an error happens) is useful, because without that invariant, migration might inactivate block devices when mirror still got work to do and run into an assertion failure [0]. Another approach would be to read the missing data from the source upon unaligned writes to be able to write the full target cluster instead. But certain targets like NBD do not allow querying the cluster size. To avoid limiting/breaking the use case of syncing to an existing target, which is arguably more common than the diff image use case, document the limitation in QAPI. This patch was originally based on one by Ma Haocong, but it has since been modified pretty heavily, first by John and then again by Fiona. [0]: https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/ Suggested-by: Ma Haocong Signed-off-by: Ma Haocong Signed-off-by: John Snow [FG: switch to bdrv_dirty_bitmap_merge_internal] Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.1 get rid of bitmap mode parameter use caller-provided bitmap as working bitmap turn bitmap parameter experimental] Signed-off-by: Fiona Ebner --- Changes in v3: * remove duplicate "use" in QAPI description * clarify that cluster size caveats only applies to non-COW diff image * split changing is_none_mode to sync_mode in job struct into a separate patch * use shorter sync_mode != none rather than sync_mode == top || full in an if condition * also disallow read-only bitmap (cannot be used as working bitmap) * require that bitmap is enabled at the start of the job block/mirror.c | 80 +- blockdev.c | 44 +++--- include/block/block_int-global-state.h | 5 +- qapi/block-core.json | 37 +++- tests/unit/test-block-iothread.c | 2 +- 5 files changed, 143 insertions(+), 25 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index ca23d6ef65..d3d0698116 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -73,6 +73,11 @@ typedef struct MirrorBlockJob { size_t buf_size; int64_t bdev_length; unsigned long *cow_bitmap; +/* + * Whether the bitmap is created locally or provided by the caller (for + * incremental sync). + */ +bool dirty_bitmap_is_local; BdrvDirtyBitmap *dirty_bitmap; BdrvDirtyBitmapIter *dbi; uint8_t *buf; @@ -691,7 +696,11 @@ static int mirror_exit_common(Job *job) bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs); } -bdrv_release_dirty_bitmap(s->dirty_bitmap); +if (s->dirty_bitmap_is_local) { +bdrv_release_dirty_bitmap(s->dirty_bitmap); +} else { +bdrv_enable_dirty_bitmap(s->dirty_bitmap); +} /* Make sure that the source BDS doesn't go away during bdrv_replace_node, * before we can call bdrv_drained_end */ @@ -820,6 +829,16 @@ static void mirror_abort(Job *job) assert(ret == 0); } +/* Always called after commit/abort. */ +static void mirror_clean(Job *job) +{ +MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job); + +if (!s->dirty_bitmap_is_local && s->dirty_bitmap) { +bdrv_dirty_bitmap_set_busy(s->dirty_bitmap, false); +} +} + static void coroutine_fn mirror_throttle(MirrorBlockJob *s) { int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); @@ -1016,7 +1035,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) mirror_free_init(s); s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); -if (s->sync
[PATCH v3 2/5] block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob struct
It is more flexible and is done in preparation to support specifying a working bitmap for mirror jobs. In particular, this makes it possible to assert that @sync_mode=full when a bitmap is used. That assertion is just to be sure, of course the mirror QMP commands will be made to fail earlier with a clean error. Signed-off-by: Fiona Ebner --- New in v3. block/mirror.c | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index c0597039a5..ca23d6ef65 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -51,7 +51,7 @@ typedef struct MirrorBlockJob { BlockDriverState *to_replace; /* Used to block operations on the drive-mirror-replace target */ Error *replace_blocker; -bool is_none_mode; +MirrorSyncMode sync_mode; BlockMirrorBackingMode backing_mode; /* Whether the target image requires explicit zero-initialization */ bool zero_target; @@ -722,7 +722,8 @@ static int mirror_exit_common(Job *job) _abort); if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) { -BlockDriverState *backing = s->is_none_mode ? src : s->base; +BlockDriverState *backing; +backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base; BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs); if (bdrv_cow_bs(unfiltered_target) != backing) { @@ -1015,7 +1016,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) mirror_free_init(s); s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); -if (!s->is_none_mode) { +if (s->sync_mode != MIRROR_SYNC_MODE_NONE) { ret = mirror_dirty_init(s); if (ret < 0 || job_is_cancelled(>common.job)) { goto immediate_exit; @@ -1714,7 +1715,8 @@ static BlockJob *mirror_start_job( BlockCompletionFunc *cb, void *opaque, const BlockJobDriver *driver, - bool is_none_mode, BlockDriverState *base, + MirrorSyncMode sync_mode, + BlockDriverState *base, bool auto_complete, const char *filter_node_name, bool is_mirror, MirrorCopyMode copy_mode, Error **errp) @@ -1871,7 +1873,7 @@ static BlockJob *mirror_start_job( s->replaces = g_strdup(replaces); s->on_source_error = on_source_error; s->on_target_error = on_target_error; -s->is_none_mode = is_none_mode; +s->sync_mode = sync_mode; s->backing_mode = backing_mode; s->zero_target = zero_target; qatomic_set(>copy_mode, copy_mode); @@ -2008,20 +2010,18 @@ void mirror_start(const char *job_id, BlockDriverState *bs, bool unmap, const char *filter_node_name, MirrorCopyMode copy_mode, Error **errp) { -bool is_none_mode; BlockDriverState *base; GLOBAL_STATE_CODE(); bdrv_graph_rdlock_main_loop(); -is_none_mode = mode == MIRROR_SYNC_MODE_NONE; base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL; bdrv_graph_rdunlock_main_loop(); mirror_start_job(job_id, bs, creation_flags, target, replaces, speed, granularity, buf_size, backing_mode, zero_target, on_source_error, on_target_error, unmap, NULL, NULL, - _job_driver, is_none_mode, base, false, + _job_driver, mode, base, false, filter_node_name, true, copy_mode, errp); } @@ -2049,9 +2049,9 @@ BlockJob *commit_active_start(const char *job_id, BlockDriverState *bs, job_id, bs, creation_flags, base, NULL, speed, 0, 0, MIRROR_LEAVE_BACKING_CHAIN, false, on_error, on_error, true, cb, opaque, - _active_job_driver, false, base, auto_complete, - filter_node_name, false, MIRROR_COPY_MODE_BACKGROUND, - errp); + _active_job_driver, MIRROR_SYNC_MODE_FULL, base, + auto_complete, filter_node_name, false, + MIRROR_COPY_MODE_BACKGROUND, errp); if (!job) { goto error_restore_flags; } -- 2.39.2
Re: [PATCH v2 2/4] mirror: allow specifying working bitmap
Am 07.05.24 um 14:15 schrieb Fiona Ebner: > Am 02.04.24 um 22:14 schrieb Vladimir Sementsov-Ogievskiy: >> On 07.03.24 16:47, Fiona Ebner wrote: >>> +# @bitmap: The name of a bitmap to use as a working bitmap for >>> +# sync=full mode. This argument must be not be present for other >>> +# sync modes and not at the same time as @granularity. The >>> +# bitmap's granularity is used as the job's granularity. When >>> +# the target is a diff image, i.e. one that should only contain >>> +# the delta and was not synced to previously, the target's >>> +# cluster size must not be larger than the bitmap's granularity. >> >> Could we check this? Like in block_copy_calculate_cluster_size(), we can >> check if target does COW, and if not, we can check that we are safe with >> granularity. >> > > The issue here is (in particular) present when the target does COW, i.e. > in qcow2 diff images, allocated clusters which end up with partial data, > when we don't have the right cluster size. Patch 4/4 adds the check for > the target's cluster size. > Sorry, no. What I said is wrong. It's just that the test does something very pathological and does not even use COW/backing files. All the mirror targets are separate diff images there. So yes, we can do the same as block_copy_calculate_cluster_size() and the issue only appears in the same edge cases as for backup where we can error out early. This also applies to copy-mode=write-blocking AFAICT. >>> +# For a diff image target, using copy-mode=write-blocking should >>> +# not be used, because unaligned writes will lead to allocated >>> +# clusters with partial data in the target image! >> >> Could this be checked? >> > > I don't think so. How should we know if the target already contains data > from a previous full sync or not? > > Those caveats when using diff images are unfortunate, and users should > be warned about them of course, but the main/expected use case for the > feature is to sync to the same target multiple times, so I'd hope the > cluster size check in patch 4/4 and mentioning the edge cases in the > documentation is enough here. >
Re: [PATCH 0/4] Fix "virtio-gpu: fix scanout migration post-load"
Am 07.05.24 um 13:19 schrieb marcandre.lur...@redhat.com: > From: Marc-André Lureau > > Hi, > > The aforementioned patch breaks virtio-gpu device migrations for versions > pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is more > complex than it may initially appear, as evidenced in the problematic commit > dfcf74fa68c ("virtio-gpu: fix scanout migration post-load"). > > To resolve this, we need to propagate the `vmstate` `version_id` through the > nested structures. Additionally, we should tie specific machine version to a > corresponding `version_id` to maintain migration compatibility. > > `VMS_VSTRUCT` allows specifying the appropriate version of the nested > structure > to use. > > Marc-André Lureau (4): > migration: add "exists" info to load-state-field trace > include/migration: add VMSTATE_VSTRUCT_TEST_VARRAY_UINT32 > virtio-gpu: use a VMState variant for the scanout field > virtio-gpu: add x-vmstate-version > > include/hw/virtio/virtio-gpu.h | 1 + > include/migration/vmstate.h| 12 > hw/core/machine.c | 1 + > hw/display/virtio-gpu.c| 28 +--- > migration/vmstate.c| 5 +++-- > migration/trace-events | 2 +- > 6 files changed, 39 insertions(+), 10 deletions(-) > Tested-by: Fiona Ebner Thank you for the fix! Tested with an Ubuntu 23.10 VM: Machine type pc-i440fx-8.2: 1. create snapshot with 8.2, load with patched 9.0 2. create snapshot with patched 9.0, load with patched 9.0 and with 8.2 Machine type pc-i440fx-9.0: 1. create snapshot with patched 9.0, load with patched 9.0 No crashes/failures and didn't notice any other issues :) Best Regards, Fiona
Re: [PATCH v2 2/4] mirror: allow specifying working bitmap
Am 02.04.24 um 22:14 schrieb Vladimir Sementsov-Ogievskiy: > On 07.03.24 16:47, Fiona Ebner wrote: >> diff --git a/block/mirror.c b/block/mirror.c >> index 1609354db3..5c9a00b574 100644 >> --- a/block/mirror.c >> +++ b/block/mirror.c >> @@ -51,7 +51,7 @@ typedef struct MirrorBlockJob { >> BlockDriverState *to_replace; >> /* Used to block operations on the drive-mirror-replace target */ >> Error *replace_blocker; >> - bool is_none_mode; >> + MirrorSyncMode sync_mode; > > Could you please split this change to separate preparation patch? > Will do. >> + if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_ALLOW_RO, >> errp)) { > > Why allow read-only bitmaps? > Good catch! This is a left-over from an earlier version. Now that the bitmap shall be used as the working bitmap, it cannot be read-only. I'll change it to BDRV_BITMAP_DEFAULT in v3 of the series. >> +# @bitmap: The name of a bitmap to use as a working bitmap for >> +# sync=full mode. This argument must be not be present for other >> +# sync modes and not at the same time as @granularity. The >> +# bitmap's granularity is used as the job's granularity. When >> +# the target is a diff image, i.e. one that should only contain >> +# the delta and was not synced to previously, the target's >> +# cluster size must not be larger than the bitmap's granularity. > > Could we check this? Like in block_copy_calculate_cluster_size(), we can > check if target does COW, and if not, we can check that we are safe with > granularity. > The issue here is (in particular) present when the target does COW, i.e. in qcow2 diff images, allocated clusters which end up with partial data, when we don't have the right cluster size. Patch 4/4 adds the check for the target's cluster size. >> +# For a diff image target, using copy-mode=write-blocking should >> +# not be used, because unaligned writes will lead to allocated >> +# clusters with partial data in the target image! > > Could this be checked? > I don't think so. How should we know if the target already contains data from a previous full sync or not? Those caveats when using diff images are unfortunate, and users should be warned about them of course, but the main/expected use case for the feature is to sync to the same target multiple times, so I'd hope the cluster size check in patch 4/4 and mentioning the edge cases in the documentation is enough here. >> The bitmap >> +# will be enabled after the job finishes. (Since 9.0) > > Hmm. That looks correct. At least for the case, when bitmap is enabled > at that start of job. Suggest to require this. > It's true for any provided bitmap: it will be disabled when the mirror job starts, because we manually set it in bdrv_mirror_top_do_write() and then in mirror_exit_common(), the bitmap will be enabled. Okay, I'll require that it is enabled at the beginning. >> +# >> # @granularity: granularity of the dirty bitmap, default is 64K if the >> # image format doesn't have clusters, 4K if the clusters are >> # smaller than that, else the cluster size. Must be a power of 2 >> @@ -2548,6 +2578,10 @@ >> # disappear from the query list without user intervention. >> # Defaults to true. (Since 3.1) >> # >> +# Features: >> +# >> +# @unstable: Member @bitmap is experimental. >> +# >> # Since: 2.6 > > Y_MODE_BACKGROUND, >> _abort); > > [..] > > Generally looks good to me. > Thank you for the review!
Re: [PULL 5/5] virtio-gpu: fix scanout migration post-load
Am 12.03.24 um 15:02 schrieb marcandre.lur...@redhat.com: > From: Marc-André Lureau > > The current post-loading code for scanout has a FIXME: it doesn't take > the resource region/rect into account. But there is more, when adding > blob migration support in commit f66767f75c9, I didn't realize that blob > resources could be used for scanouts. This situationn leads to a crash > during post-load, as they don't have an associated res->image. > > virtio_gpu_do_set_scanout() handle all cases, but requires the > associated virtio_gpu_framebuffer, which is currently not saved during > migration. > > Add a v2 of "virtio-gpu-one-scanout" with the framebuffer fields, so we > can restore blob scanouts, as well as fixing the existing FIXME. > > Signed-off-by: Marc-André Lureau > Reviewed-by: Sebastian Ott Hi, unfortunately, this broke migration from pre-9.0 to 9.0: > vmstate_load_state_field virtio-gpu:virtio-gpu > vmstate_load_state_field virtio-gpu-scanouts:parent_obj.enable > vmstate_load_state_field virtio-gpu-scanouts:parent_obj.conf.max_outputs > vmstate_load_state_field virtio-gpu-scanouts:parent_obj.scanout > vmstate_load_state_field virtio-gpu-one-scanout:resource_id > vmstate_load_state_field virtio-gpu-one-scanout:width > vmstate_load_state_field virtio-gpu-one-scanout:height > vmstate_load_state_field virtio-gpu-one-scanout:x > vmstate_load_state_field virtio-gpu-one-scanout:y > vmstate_load_state_field virtio-gpu-one-scanout:cursor.resource_id > vmstate_load_state_field virtio-gpu-one-scanout:cursor.hot_x > vmstate_load_state_field virtio-gpu-one-scanout:cursor.hot_y > vmstate_load_state_field virtio-gpu-one-scanout:cursor.pos.x > vmstate_load_state_field virtio-gpu-one-scanout:cursor.pos.y > vmstate_load_state_field virtio-gpu-one-scanout:fb.format > vmstate_load_state_field virtio-gpu-one-scanout:fb.bytes_pp > vmstate_load_state_field virtio-gpu-one-scanout:fb.width > vmstate_load_state_field virtio-gpu-one-scanout:fb.height > vmstate_load_state_field virtio-gpu-one-scanout:fb.stride > vmstate_load_state_field virtio-gpu-one-scanout:fb.offset > qemu-system-x86_64: Missing section footer for :00:02.0/virtio-gpu > qemu-system-x86_64: Error -22 while loading VM state It wrongly tries to load the fb fields even though they should be guarded by version 2. Looking at it with GDB, in vmstate_load_state(), when we come to field->name == parent_obj.scanout, the > } else if (field->flags & VMS_STRUCT) { > ret = vmstate_load_state(f, field->vmsd, curr_elem, > field->vmsd->version_id); branch will be taken and suddenly we'll have a call to vmstate_load_state() for vmsd==vmstate_virtio_gpu_scanout with version_id==2 rather than version_id==1, because that is field->vmsd->version_id (i.e. the .version_id in VMStateDescription vmstate_virtio_gpu_scanout). Would it have been necessary to version the VMStateDescription vmstate_virtio_gpu_scanouts too using VMS_VSTRUCT (or am I misinterpreting the use case for that)? Best Regards, Fiona
Re: [PATCH] block/copy-before-write: use uint64_t for timeout in nanoseconds
Am 29.04.24 um 16:36 schrieb Philippe Mathieu-Daudé: > Hi Fiona, > > On 29/4/24 16:19, Fiona Ebner wrote: > > Not everybody uses an email client that shows the patch content just > after the subject (your first lines wasn't making sense at first). > > Simply duplicating the subject helps to understand: > > Use uint64_t for timeout in nanoseconds ... > Oh, sorry. I'll try to remember that for the future. Should I re-send as a v2? Best Regards, Fiona
[PATCH] block/copy-before-write: use uint64_t for timeout in nanoseconds
rather than the uint32_t for which the maximum is slightly more than 4 seconds and larger values would overflow. The QAPI interface allows specifying the number of seconds, so only values 0 to 4 are safe right now, other values lead to a much lower timeout than a user expects. The block_copy() call where this is used already takes a uint64_t for the timeout, so no change required there. Fixes: 6db7fd1ca9 ("block/copy-before-write: implement cbw-timeout option") Reported-by: Friedrich Weber Signed-off-by: Fiona Ebner --- block/copy-before-write.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/copy-before-write.c b/block/copy-before-write.c index 8aba27a71d..026fa9840f 100644 --- a/block/copy-before-write.c +++ b/block/copy-before-write.c @@ -43,7 +43,7 @@ typedef struct BDRVCopyBeforeWriteState { BlockCopyState *bcs; BdrvChild *target; OnCbwError on_cbw_error; -uint32_t cbw_timeout_ns; +uint64_t cbw_timeout_ns; /* * @lock: protects access to @access_bitmap, @done_bitmap and -- 2.39.2
Re: [PATCH] Makefile: preserve --jobserver-auth argument when calling ninja
Am 02.04.24 um 10:17 schrieb Martin Hundebøll: > Qemu wraps its call to ninja in a Makefile. Since ninja, as opposed to > make, utilizes all CPU cores by default, the qemu Makefile translates > the absense of a `-jN` argument into `-j1`. This breaks jobserver > functionality, so update the -jN mangling to take the --jobserver-auth > argument into considerationa too. > > Signed-off-by: Martin Hundebøll > --- > Makefile | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/Makefile b/Makefile > index 8f36990335..183756018f 100644 > --- a/Makefile > +++ b/Makefile > @@ -142,7 +142,7 @@ MAKE.k = $(findstring k,$(firstword $(filter-out > --%,$(MAKEFLAGS > MAKE.q = $(findstring q,$(firstword $(filter-out --%,$(MAKEFLAGS > MAKE.nq = $(if $(word 2, $(MAKE.n) $(MAKE.q)),nq) > NINJAFLAGS = $(if $V,-v) $(if $(MAKE.n), -n) $(if $(MAKE.k), -k0) \ > -$(filter-out -j, $(lastword -j1 $(filter -l% -j%, $(MAKEFLAGS \ > +$(or $(filter -l% -j%, $(MAKEFLAGS)), $(if $(filter > --jobserver-auth=%, $(MAKEFLAGS)),, -j1)) \ > -d keepdepfile > ninja-cmd-goals = $(or $(MAKECMDGOALS), all) > ninja-cmd-goals += $(foreach g, $(MAKECMDGOALS), $(.ninja-goals.$g)) Hi, unfortunately, this patch breaks build when specifying just '-j' as a make flag (i.e. without a number), because it will now end up being passed to ninja: > $ make -j > changing dir to build for make ""... > make[1]: Entering directory '/home/febner/repos/qemu/build' > ninja: fatal: invalid -j parameter Best Regards, Fiona
Re: [PATCH] block: Remove unnecessary NULL check in bdrv_pad_request()
Am 27.03.24 um 20:27 schrieb Kevin Wolf: > Coverity complains that the check introduced in commit 3f934817 suggests > that qiov could be NULL and we dereference it before reaching the check. > In fact, all of the callers pass a non-NULL pointer, so just remove the > misleading check. > > Resolves: Coverity CID 1542668 > Signed-off-by: Kevin Wolf Reviewed-by: Fiona Ebner Thank you for the fix, Fiona
Question about block graph lock limitation with generated co-wrappers
Hi, we have a custom block driver downstream, which currently calls bdrv_get_info() (for its file child) in the bdrv_refresh_limits() callback. However, with graph locking, this doesn't work anymore. AFAICT, the reason is the following: The block driver has a backing file option. During initialization, in bdrv_set_backing_hd(), the graph lock is acquired exclusively. Then the bdrv_refresh_limits() callback is invoked. Now bdrv_get_info() is called, which is a generated co-wrapper. The bdrv_co_get_info_entry() function tries to acquire the graph lock for reading, sees that has_writer is true and so the coroutine will be put to wait, leading to a deadlock. For my specific case, I can move the bdrv_get_info() call to bdrv_open() as a workaround. But I wanted to ask if there is a way to make generated co-wrappers inside an exclusively locked section work? And if not, could we introduce/extend the annotations, so the compiler can catch this kind of issue, i.e. calling a generated co-wrapper while in an exclusively locked section? Best Regards, Fiona
[PATCH v3 0/4] fix two edge cases related to stream block jobs
Changes in v3: * Also deal with edge case in bdrv_next_cleanup(). Haven't run into an actual issue there, but at least the caller in migration/block.c uses bdrv_nb_sectors() which, while not a coroutine wrapper itself (it's written manually), may call bdrv_refresh_total_sectors(), which is a generated coroutine wrapper, so AFAIU, the block graph can change during that call. And even without that, it's just better to be more consistent with bdrv_next(). Changes in v2: * Ran into another issue while writing the IO test Stefan wanted to have (good call :)), so include a fix for that and add the test. I didn't notice during manual testing, because I hadn't used a scripted QMP 'quit', so there was no race. Fiona Ebner (3): block-backend: fix edge case in bdrv_next() where BDS associated to BB changes block-backend: fix edge case in bdrv_next_cleanup() where BDS associated to BB changes iotests: add test for stream job with an unaligned prefetch read Stefan Reiter (1): block/io: accept NULL qiov in bdrv_pad_request block/block-backend.c | 18 ++-- block/io.c| 31 --- .../tests/stream-unaligned-prefetch | 86 +++ .../tests/stream-unaligned-prefetch.out | 5 ++ 4 files changed, 117 insertions(+), 23 deletions(-) create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out -- 2.39.2
[PATCH v3 3/4] block-backend: fix edge case in bdrv_next_cleanup() where BDS associated to BB changes
Same rationale as for commit "block-backend: fix edge case in bdrv_next() where BDS associated to BB changes". The block graph might change between the bdrv_next() call and the bdrv_next_cleanup() call, so it could be that the associated BDS is not the same that was referenced previously anymore. Instead, rely on bdrv_next() to set it->bs to the BDS it referenced and unreference that one in any case. Signed-off-by: Fiona Ebner --- New in v3. block/block-backend.c | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/block/block-backend.c b/block/block-backend.c index 28af1eb17a..db6f9b92a3 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -663,13 +663,10 @@ void bdrv_next_cleanup(BdrvNextIterator *it) /* Must be called from the main loop */ assert(qemu_get_current_aio_context() == qemu_get_aio_context()); -if (it->phase == BDRV_NEXT_BACKEND_ROOTS) { -if (it->blk) { -bdrv_unref(blk_bs(it->blk)); -blk_unref(it->blk); -} -} else { -bdrv_unref(it->bs); +bdrv_unref(it->bs); + +if (it->phase == BDRV_NEXT_BACKEND_ROOTS && it->blk) { +blk_unref(it->blk); } bdrv_next_reset(it); -- 2.39.2
[PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
The old_bs variable in bdrv_next() is currently determined by looking at the old block backend. However, if the block graph changes before the next bdrv_next() call, it might be that the associated BDS is not the same that was referenced previously. In that case, the wrong BDS is unreferenced, leading to an assertion failure later: > bdrv_unref: Assertion `bs->refcnt > 0' failed. In particular, this can happen in the context of bdrv_flush_all(), when polling for bdrv_co_flush() in the generated co-wrapper leads to a graph change (for example with a stream block job [0]). A racy reproducer: > #!/bin/bash > rm -f /tmp/backing.qcow2 > rm -f /tmp/top.qcow2 > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2 > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2 > ./qemu-system-x86_64 --qmp stdio \ > --blockdev > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \ > < {"execute": "qmp_capabilities"} > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": > "node0" } } > {"execute": "quit"} > EOF [0]: > #0 bdrv_replace_child_tran (child=..., new_bs=..., tran=...) > #1 bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., > errp=...) > #2 bdrv_replace_node_common (from=..., to=..., auto_skip=..., > detach_subchain=..., errp=...) > #3 bdrv_drop_filter (bs=..., errp=...) > #4 bdrv_cor_filter_drop (cor_filter_bs=...) > #5 stream_prepare (job=...) > #6 job_prepare_locked (job=...) > #7 job_txn_apply_locked (fn=..., job=...) > #8 job_do_finalize_locked (job=...) > #9 job_exit (opaque=...) > #10 aio_bh_poll (ctx=...) > #11 aio_poll (ctx=..., blocking=...) > #12 bdrv_poll_co (s=...) > #13 bdrv_flush (bs=...) > #14 bdrv_flush_all () > #15 do_vm_stop (state=..., send_stop=...) > #16 vm_shutdown () Signed-off-by: Fiona Ebner --- No changes in v3. New in v2. block/block-backend.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/block/block-backend.c b/block/block-backend.c index 9c4de79e6b..28af1eb17a 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -599,14 +599,14 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it) /* Must be called from the main loop */ assert(qemu_get_current_aio_context() == qemu_get_aio_context()); +old_bs = it->bs; + /* First, return all root nodes of BlockBackends. In order to avoid * returning a BDS twice when multiple BBs refer to it, we only return it * if the BB is the first one in the parent list of the BDS. */ if (it->phase == BDRV_NEXT_BACKEND_ROOTS) { BlockBackend *old_blk = it->blk; -old_bs = old_blk ? blk_bs(old_blk) : NULL; - do { it->blk = blk_all_next(it->blk); bs = it->blk ? blk_bs(it->blk) : NULL; @@ -620,11 +620,10 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it) if (bs) { bdrv_ref(bs); bdrv_unref(old_bs); +it->bs = bs; return bs; } it->phase = BDRV_NEXT_MONITOR_OWNED; -} else { -old_bs = it->bs; } /* Then return the monitor-owned BDSes without a BB attached. Ignore all -- 2.39.2
[PATCH v3 1/4] block/io: accept NULL qiov in bdrv_pad_request
From: Stefan Reiter Some operations, e.g. block-stream, perform reads while discarding the results (only copy-on-read matters). In this case, they will pass NULL as the target QEMUIOVector, which will however trip bdrv_pad_request, since it wants to extend its passed vector. In particular, this is the case for the blk_co_preadv() call in stream_populate(). If there is no qiov, no operation can be done with it, but the bytes and offset still need to be updated, so the subsequent aligned read will actually be aligned and not run into an assertion failure. In particular, this can happen when the request alignment of the top node is larger than the allocated part of the bottom node, in which case padding becomes necessary. For example: > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768 > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2 > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2 > ./qemu-system-x86_64 --qmp stdio \ > --blockdev > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \ > < {"execute": "qmp_capabilities"} > {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": > "node0", "node-name": "node1" } } > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": > "node1" } } > EOF Originally-by: Stefan Reiter Signed-off-by: Thomas Lamprecht [FE: do update bytes and offset in any case add reproducer to commit message] Signed-off-by: Fiona Ebner --- No changes in v3. No changes in v2. block/io.c | 31 +++ 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/block/io.c b/block/io.c index 33150c0359..395bea3bac 100644 --- a/block/io.c +++ b/block/io.c @@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs, return 0; } -sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, - _head, _tail, - _niov); +/* + * For prefetching in stream_populate(), no qiov is passed along, because + * only copy-on-read matters. + */ +if (qiov && *qiov) { +sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, + _head, _tail, + _niov); -/* Guaranteed by bdrv_check_request32() */ -assert(*bytes <= SIZE_MAX); -ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov, - sliced_head, *bytes); -if (ret < 0) { -bdrv_padding_finalize(pad); -return ret; +/* Guaranteed by bdrv_check_request32() */ +assert(*bytes <= SIZE_MAX); +ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov, + sliced_head, *bytes); +if (ret < 0) { +bdrv_padding_finalize(pad); +return ret; +} +*qiov = >local_qiov; +*qiov_offset = 0; } + *bytes += pad->head + pad->tail; *offset -= pad->head; -*qiov = >local_qiov; -*qiov_offset = 0; if (padded) { *padded = true; } -- 2.39.2
[PATCH v3 4/4] iotests: add test for stream job with an unaligned prefetch read
Previously, bdrv_pad_request() could not deal with a NULL qiov when a read needed to be aligned. During prefetch, a stream job will pass a NULL qiov. Add a test case to cover this scenario. By accident, also covers a previous race during shutdown, where block graph changes during iteration in bdrv_flush_all() could lead to unreferencing the wrong block driver state and an assertion failure later. Signed-off-by: Fiona Ebner --- No changes in v3. New in v2. .../tests/stream-unaligned-prefetch | 86 +++ .../tests/stream-unaligned-prefetch.out | 5 ++ 2 files changed, 91 insertions(+) create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch b/tests/qemu-iotests/tests/stream-unaligned-prefetch new file mode 100755 index 00..546db1d369 --- /dev/null +++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch @@ -0,0 +1,86 @@ +#!/usr/bin/env python3 +# group: rw quick +# +# Test what happens when a stream job does an unaligned prefetch read +# which requires padding while having a NULL qiov. +# +# Copyright (C) Proxmox Server Solutions GmbH +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# + +import os +import iotests +from iotests import imgfmt, qemu_img_create, qemu_io, QMPTestCase + +image_size = 1 * 1024 * 1024 +cluster_size = 64 * 1024 +base = os.path.join(iotests.test_dir, 'base.img') +top = os.path.join(iotests.test_dir, 'top.img') + +class TestStreamUnalignedPrefetch(QMPTestCase): +def setUp(self) -> None: +""" +Create two images: +- base image {base} with {cluster_size // 2} bytes allocated +- top image {top} without any data allocated and coarser + cluster size + +Attach a compress filter for the top image, because that +requires that the request alignment is the top image's cluster +size. +""" +qemu_img_create('-f', imgfmt, +'-o', 'cluster_size={}'.format(cluster_size // 2), +base, str(image_size)) +qemu_io('-c', f'write 0 {cluster_size // 2}', base) +qemu_img_create('-f', imgfmt, +'-o', 'cluster_size={}'.format(cluster_size), +top, str(image_size)) + +self.vm = iotests.VM() +self.vm.add_blockdev(self.vm.qmp_to_opts({ +'driver': imgfmt, +'node-name': 'base', +'file': { +'driver': 'file', +'filename': base +} +})) +self.vm.add_blockdev(self.vm.qmp_to_opts({ +'driver': 'compress', +'node-name': 'compress-top', +'file': { +'driver': imgfmt, +'node-name': 'top', +'file': { +'driver': 'file', +'filename': top +}, +'backing': 'base' +} +})) +self.vm.launch() + +def tearDown(self) -> None: +self.vm.shutdown() +os.remove(top) +os.remove(base) + +def test_stream_unaligned_prefetch(self) -> None: +self.vm.cmd('block-stream', job_id='stream', device='compress-top') + + +if __name__ == '__main__': +iotests.main(supported_fmts=['qcow2'], supported_protocols=['file']) diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch.out b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out new file mode 100644 index 00..ae1213e6f8 --- /dev/null +++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out @@ -0,0 +1,5 @@ +. +-- +Ran 1 tests + +OK -- 2.39.2
[PATCH v2 3/3] iotests: add test for stream job with an unaligned prefetch read
Previously, bdrv_pad_request() could not deal with a NULL qiov when a read needed to be aligned. During prefetch, a stream job will pass a NULL qiov. Add a test case to cover this scenario. By accident, also covers a previous race during shutdown, where block graph changes during iteration in bdrv_flush_all() could lead to unreferencing the wrong block driver state and an assertion failure later. Signed-off-by: Fiona Ebner --- New in v2. .../tests/stream-unaligned-prefetch | 86 +++ .../tests/stream-unaligned-prefetch.out | 5 ++ 2 files changed, 91 insertions(+) create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch b/tests/qemu-iotests/tests/stream-unaligned-prefetch new file mode 100755 index 00..546db1d369 --- /dev/null +++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch @@ -0,0 +1,86 @@ +#!/usr/bin/env python3 +# group: rw quick +# +# Test what happens when a stream job does an unaligned prefetch read +# which requires padding while having a NULL qiov. +# +# Copyright (C) Proxmox Server Solutions GmbH +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# + +import os +import iotests +from iotests import imgfmt, qemu_img_create, qemu_io, QMPTestCase + +image_size = 1 * 1024 * 1024 +cluster_size = 64 * 1024 +base = os.path.join(iotests.test_dir, 'base.img') +top = os.path.join(iotests.test_dir, 'top.img') + +class TestStreamUnalignedPrefetch(QMPTestCase): +def setUp(self) -> None: +""" +Create two images: +- base image {base} with {cluster_size // 2} bytes allocated +- top image {top} without any data allocated and coarser + cluster size + +Attach a compress filter for the top image, because that +requires that the request alignment is the top image's cluster +size. +""" +qemu_img_create('-f', imgfmt, +'-o', 'cluster_size={}'.format(cluster_size // 2), +base, str(image_size)) +qemu_io('-c', f'write 0 {cluster_size // 2}', base) +qemu_img_create('-f', imgfmt, +'-o', 'cluster_size={}'.format(cluster_size), +top, str(image_size)) + +self.vm = iotests.VM() +self.vm.add_blockdev(self.vm.qmp_to_opts({ +'driver': imgfmt, +'node-name': 'base', +'file': { +'driver': 'file', +'filename': base +} +})) +self.vm.add_blockdev(self.vm.qmp_to_opts({ +'driver': 'compress', +'node-name': 'compress-top', +'file': { +'driver': imgfmt, +'node-name': 'top', +'file': { +'driver': 'file', +'filename': top +}, +'backing': 'base' +} +})) +self.vm.launch() + +def tearDown(self) -> None: +self.vm.shutdown() +os.remove(top) +os.remove(base) + +def test_stream_unaligned_prefetch(self) -> None: +self.vm.cmd('block-stream', job_id='stream', device='compress-top') + + +if __name__ == '__main__': +iotests.main(supported_fmts=['qcow2'], supported_protocols=['file']) diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch.out b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out new file mode 100644 index 00..ae1213e6f8 --- /dev/null +++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out @@ -0,0 +1,5 @@ +. +-- +Ran 1 tests + +OK -- 2.39.2
[PATCH v2 0/3] fix two edge cases related to stream block jobs
Changes in v2: * Ran into another issue while writing the IO test Stefan wanted to have (good call :)), so include a fix for that and add the test. I didn't notice during manual testing, because I hadn't used a scripted QMP 'quit', so there was no race. Fiona Ebner (2): block-backend: fix edge case in bdrv_next() where BDS associated to BB changes iotests: add test for stream job with an unaligned prefetch read Stefan Reiter (1): block/io: accept NULL qiov in bdrv_pad_request block/block-backend.c | 7 +- block/io.c| 31 --- .../tests/stream-unaligned-prefetch | 86 +++ .../tests/stream-unaligned-prefetch.out | 5 ++ 4 files changed, 113 insertions(+), 16 deletions(-) create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out -- 2.39.2
[PATCH v2 1/3] block/io: accept NULL qiov in bdrv_pad_request
From: Stefan Reiter Some operations, e.g. block-stream, perform reads while discarding the results (only copy-on-read matters). In this case, they will pass NULL as the target QEMUIOVector, which will however trip bdrv_pad_request, since it wants to extend its passed vector. In particular, this is the case for the blk_co_preadv() call in stream_populate(). If there is no qiov, no operation can be done with it, but the bytes and offset still need to be updated, so the subsequent aligned read will actually be aligned and not run into an assertion failure. In particular, this can happen when the request alignment of the top node is larger than the allocated part of the bottom node, in which case padding becomes necessary. For example: > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768 > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2 > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2 > ./qemu-system-x86_64 --qmp stdio \ > --blockdev > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \ > < {"execute": "qmp_capabilities"} > {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": > "node0", "node-name": "node1" } } > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": > "node1" } } > EOF Originally-by: Stefan Reiter Signed-off-by: Thomas Lamprecht [FE: do update bytes and offset in any case add reproducer to commit message] Signed-off-by: Fiona Ebner --- No changes in v2. block/io.c | 31 +++ 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/block/io.c b/block/io.c index 33150c0359..395bea3bac 100644 --- a/block/io.c +++ b/block/io.c @@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs, return 0; } -sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, - _head, _tail, - _niov); +/* + * For prefetching in stream_populate(), no qiov is passed along, because + * only copy-on-read matters. + */ +if (qiov && *qiov) { +sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, + _head, _tail, + _niov); -/* Guaranteed by bdrv_check_request32() */ -assert(*bytes <= SIZE_MAX); -ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov, - sliced_head, *bytes); -if (ret < 0) { -bdrv_padding_finalize(pad); -return ret; +/* Guaranteed by bdrv_check_request32() */ +assert(*bytes <= SIZE_MAX); +ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov, + sliced_head, *bytes); +if (ret < 0) { +bdrv_padding_finalize(pad); +return ret; +} +*qiov = >local_qiov; +*qiov_offset = 0; } + *bytes += pad->head + pad->tail; *offset -= pad->head; -*qiov = >local_qiov; -*qiov_offset = 0; if (padded) { *padded = true; } -- 2.39.2
[PATCH v2 2/3] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
The old_bs variable in bdrv_next() is currently determined by looking at the old block backend. However, if the block graph changes before the next bdrv_next() call, it might be that the associated BDS is not the same that was referenced previously. In that case, the wrong BDS is unreferenced, leading to an assertion failure later: > bdrv_unref: Assertion `bs->refcnt > 0' failed. In particular, this can happen in the context of bdrv_flush_all(), when polling for bdrv_co_flush() in the generated co-wrapper leads to a graph change (for example with a stream block job [0]). A racy reproducer: > #!/bin/bash > rm -f /tmp/backing.qcow2 > rm -f /tmp/top.qcow2 > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2 > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2 > ./qemu-system-x86_64 --qmp stdio \ > --blockdev > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \ > < {"execute": "qmp_capabilities"} > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": > "node0" } } > {"execute": "quit"} > EOF [0]: > #0 bdrv_replace_child_tran (child=..., new_bs=..., tran=...) > #1 bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., > errp=...) > #2 bdrv_replace_node_common (from=..., to=..., auto_skip=..., > detach_subchain=..., errp=...) > #3 bdrv_drop_filter (bs=..., errp=...) > #4 bdrv_cor_filter_drop (cor_filter_bs=...) > #5 stream_prepare (job=...) > #6 job_prepare_locked (job=...) > #7 job_txn_apply_locked (fn=..., job=...) > #8 job_do_finalize_locked (job=...) > #9 job_exit (opaque=...) > #10 aio_bh_poll (ctx=...) > #11 aio_poll (ctx=..., blocking=...) > #12 bdrv_poll_co (s=...) > #13 bdrv_flush (bs=...) > #14 bdrv_flush_all () > #15 do_vm_stop (state=..., send_stop=...) > #16 vm_shutdown () Signed-off-by: Fiona Ebner --- Not sure if this is the correct fix, or if the call site should rather be adapted somehow? New in v2. block/block-backend.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/block/block-backend.c b/block/block-backend.c index 9c4de79e6b..28af1eb17a 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -599,14 +599,14 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it) /* Must be called from the main loop */ assert(qemu_get_current_aio_context() == qemu_get_aio_context()); +old_bs = it->bs; + /* First, return all root nodes of BlockBackends. In order to avoid * returning a BDS twice when multiple BBs refer to it, we only return it * if the BB is the first one in the parent list of the BDS. */ if (it->phase == BDRV_NEXT_BACKEND_ROOTS) { BlockBackend *old_blk = it->blk; -old_bs = old_blk ? blk_bs(old_blk) : NULL; - do { it->blk = blk_all_next(it->blk); bs = it->blk ? blk_bs(it->blk) : NULL; @@ -620,11 +620,10 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it) if (bs) { bdrv_ref(bs); bdrv_unref(old_bs); +it->bs = bs; return bs; } it->phase = BDRV_NEXT_MONITOR_OWNED; -} else { -old_bs = it->bs; } /* Then return the monitor-owned BDSes without a BB attached. Ignore all -- 2.39.2
[PATCH] block/io: accept NULL qiov in bdrv_pad_request
From: Stefan Reiter Some operations, e.g. block-stream, perform reads while discarding the results (only copy-on-read matters). In this case, they will pass NULL as the target QEMUIOVector, which will however trip bdrv_pad_request, since it wants to extend its passed vector. In particular, this is the case for the blk_co_preadv() call in stream_populate(). If there is no qiov, no operation can be done with it, but the bytes and offset still need to be updated, so the subsequent aligned read will actually be aligned and not run into an assertion failure. In particular, this can happen when the request alignment of the top node is larger than the allocated part of the bottom node, in which case padding becomes necessary. For example: > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768 > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2 > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2 > ./qemu-system-x86_64 --qmp stdio \ > --blockdev > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \ > < {"execute": "qmp_capabilities"} > {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": > "node0", "node-name": "node1" } } > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": > "node1" } } > EOF Originally-by: Stefan Reiter Signed-off-by: Thomas Lamprecht [FE: do update bytes and offset in any case add reproducer to commit message] Signed-off-by: Fiona Ebner --- block/io.c | 31 +++ 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/block/io.c b/block/io.c index 33150c0359..395bea3bac 100644 --- a/block/io.c +++ b/block/io.c @@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs, return 0; } -sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, - _head, _tail, - _niov); +/* + * For prefetching in stream_populate(), no qiov is passed along, because + * only copy-on-read matters. + */ +if (qiov && *qiov) { +sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes, + _head, _tail, + _niov); -/* Guaranteed by bdrv_check_request32() */ -assert(*bytes <= SIZE_MAX); -ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov, - sliced_head, *bytes); -if (ret < 0) { -bdrv_padding_finalize(pad); -return ret; +/* Guaranteed by bdrv_check_request32() */ +assert(*bytes <= SIZE_MAX); +ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov, + sliced_head, *bytes); +if (ret < 0) { +bdrv_padding_finalize(pad); +return ret; +} +*qiov = >local_qiov; +*qiov_offset = 0; } + *bytes += pad->head + pad->tail; *offset -= pad->head; -*qiov = >local_qiov; -*qiov_offset = 0; if (padded) { *padded = true; } -- 2.39.2
[PATCH 2/2] backup: add minimum cluster size to performance options
Useful to make discard-source work in the context of backup fleecing when the fleecing image has a larger granularity than the backup target. Backup/block-copy will use at least this granularity for copy operations and in particular, discard requests to the backup source will too. If the granularity is too small, they will just be aligned down in cbw_co_pdiscard_snapshot() and thus effectively ignored. Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Fiona Ebner --- block/backup.c| 2 +- block/copy-before-write.c | 2 ++ block/copy-before-write.h | 1 + blockdev.c| 3 +++ qapi/block-core.json | 9 +++-- 5 files changed, 14 insertions(+), 3 deletions(-) diff --git a/block/backup.c b/block/backup.c index 3dd2e229d2..a1292c01ec 100644 --- a/block/backup.c +++ b/block/backup.c @@ -458,7 +458,7 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, } cbw = bdrv_cbw_append(bs, target, filter_node_name, discard_source, - , errp); + perf->min_cluster_size, , errp); if (!cbw) { goto error; } diff --git a/block/copy-before-write.c b/block/copy-before-write.c index f9896c6c1e..55a9272485 100644 --- a/block/copy-before-write.c +++ b/block/copy-before-write.c @@ -545,6 +545,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source, BlockDriverState *target, const char *filter_node_name, bool discard_source, + int64_t min_cluster_size, BlockCopyState **bcs, Error **errp) { @@ -563,6 +564,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source, } qdict_put_str(opts, "file", bdrv_get_node_name(source)); qdict_put_str(opts, "target", bdrv_get_node_name(target)); +qdict_put_int(opts, "min-cluster-size", min_cluster_size); top = bdrv_insert_node(source, opts, flags, errp); if (!top) { diff --git a/block/copy-before-write.h b/block/copy-before-write.h index 01af0cd3c4..dc6cafe7fa 100644 --- a/block/copy-before-write.h +++ b/block/copy-before-write.h @@ -40,6 +40,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source, BlockDriverState *target, const char *filter_node_name, bool discard_source, + int64_t min_cluster_size, BlockCopyState **bcs, Error **errp); void bdrv_cbw_drop(BlockDriverState *bs); diff --git a/blockdev.c b/blockdev.c index daceb50460..8e6bdbc94a 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2653,6 +2653,9 @@ static BlockJob *do_backup_common(BackupCommon *backup, if (backup->x_perf->has_max_chunk) { perf.max_chunk = backup->x_perf->max_chunk; } +if (backup->x_perf->has_min_cluster_size) { +perf.min_cluster_size = backup->x_perf->min_cluster_size; +} } if ((backup->sync == MIRROR_SYNC_MODE_BITMAP) || diff --git a/qapi/block-core.json b/qapi/block-core.json index 85c8f88f6e..ba0836892f 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1551,11 +1551,16 @@ # it should not be less than job cluster size which is calculated # as maximum of target image cluster size and 64k. Default 0. # +# @min-cluster-size: Minimum size of blocks used by copy-before-write +# and background copy operations. Has to be a power of 2. No +# effect if smaller than the maximum of the target's cluster size +# and 64 KiB. Default 0. (Since 9.0) +# # Since: 6.0 ## { 'struct': 'BackupPerf', - 'data': { '*use-copy-range': 'bool', -'*max-workers': 'int', '*max-chunk': 'int64' } } + 'data': { '*use-copy-range': 'bool', '*max-workers': 'int', +'*max-chunk': 'int64', '*min-cluster-size': 'uint32' } } ## # @BackupCommon: -- 2.39.2
[PATCH 0/2] backup: allow specifying minimum cluster size
Based-on: https://lore.kernel.org/qemu-devel/20240228141501.455989-1-vsement...@yandex-team.ru/ Useful to make discard-source work in the context of backup fleecing when the fleecing image has a larger granularity than the backup target. Backup/block-copy will use at least this granularity for copy operations and in particular, discard requests to the backup source will too. If the granularity is too small, they will just be aligned down in cbw_co_pdiscard_snapshot() and thus effectively ignored. Fiona Ebner (2): copy-before-write: allow specifying minimum cluster size backup: add minimum cluster size to performance options block/backup.c | 2 +- block/block-copy.c | 17 + block/copy-before-write.c | 5 - block/copy-before-write.h | 1 + blockdev.c | 3 +++ include/block/block-copy.h | 1 + qapi/block-core.json | 17 ++--- 7 files changed, 37 insertions(+), 9 deletions(-) -- 2.39.2
[PATCH 1/2] copy-before-write: allow specifying minimum cluster size
Useful to make discard-source work in the context of backup fleecing when the fleecing image has a larger granularity than the backup target. Copy-before-write operations will use at least this granularity and in particular, discard requests to the source node will too. If the granularity is too small, they will just be aligned down in cbw_co_pdiscard_snapshot() and thus effectively ignored. The QAPI uses uint32 so the value will be non-negative, but still fit into a uint64_t. Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Fiona Ebner --- block/block-copy.c | 17 + block/copy-before-write.c | 3 ++- include/block/block-copy.h | 1 + qapi/block-core.json | 8 +++- 4 files changed, 23 insertions(+), 6 deletions(-) diff --git a/block/block-copy.c b/block/block-copy.c index 7e3b378528..adb1cbb440 100644 --- a/block/block-copy.c +++ b/block/block-copy.c @@ -310,6 +310,7 @@ void block_copy_set_copy_opts(BlockCopyState *s, bool use_copy_range, } static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, + int64_t min_cluster_size, Error **errp) { int ret; @@ -335,7 +336,7 @@ static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, "used. If the actual block size of the target exceeds " "this default, the backup may be unusable", BLOCK_COPY_CLUSTER_SIZE_DEFAULT); -return BLOCK_COPY_CLUSTER_SIZE_DEFAULT; +return MAX(min_cluster_size, BLOCK_COPY_CLUSTER_SIZE_DEFAULT); } else if (ret < 0 && !target_does_cow) { error_setg_errno(errp, -ret, "Couldn't determine the cluster size of the target image, " @@ -345,16 +346,18 @@ static int64_t block_copy_calculate_cluster_size(BlockDriverState *target, return ret; } else if (ret < 0 && target_does_cow) { /* Not fatal; just trudge on ahead. */ -return BLOCK_COPY_CLUSTER_SIZE_DEFAULT; +return MAX(min_cluster_size, BLOCK_COPY_CLUSTER_SIZE_DEFAULT); } -return MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size); +return MAX(min_cluster_size, + MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size)); } BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target, BlockDriverState *copy_bitmap_bs, const BdrvDirtyBitmap *bitmap, bool discard_source, + int64_t min_cluster_size, Error **errp) { ERRP_GUARD(); @@ -365,7 +368,13 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target, GLOBAL_STATE_CODE(); -cluster_size = block_copy_calculate_cluster_size(target->bs, errp); +if (min_cluster_size && !is_power_of_2(min_cluster_size)) { +error_setg(errp, "min-cluster-size needs to be a power of 2"); +return NULL; +} + +cluster_size = block_copy_calculate_cluster_size(target->bs, + min_cluster_size, errp); if (cluster_size < 0) { return NULL; } diff --git a/block/copy-before-write.c b/block/copy-before-write.c index dac57481c5..f9896c6c1e 100644 --- a/block/copy-before-write.c +++ b/block/copy-before-write.c @@ -476,7 +476,8 @@ static int cbw_open(BlockDriverState *bs, QDict *options, int flags, s->discard_source = flags & BDRV_O_CBW_DISCARD_SOURCE; s->bcs = block_copy_state_new(bs->file, s->target, bs, bitmap, - flags & BDRV_O_CBW_DISCARD_SOURCE, errp); + flags & BDRV_O_CBW_DISCARD_SOURCE, + opts->min_cluster_size, errp); if (!s->bcs) { error_prepend(errp, "Cannot create block-copy-state: "); return -EINVAL; diff --git a/include/block/block-copy.h b/include/block/block-copy.h index bdc703bacd..77857c6c68 100644 --- a/include/block/block-copy.h +++ b/include/block/block-copy.h @@ -28,6 +28,7 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target, BlockDriverState *copy_bitmap_bs, const BdrvDirtyBitmap *bitmap, bool discard_source, + int64_t min_cluster_size, Error **errp); /* Function should be called prior any actual copy request */ diff --git a/qapi/block-core.json b/qapi/block-core.json index 0a72c590a8..85c8f88f6e 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -4625,12 +4625,18 @@ # @on-cbw-error
Re: [PATCH v3 0/5] backup: discard-source parameter
Am 28.02.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy: > Hi all! The main patch is 04, please look at it for description and > diagram. > > v3: > 02: new patch > 04: take WRITE permission only when discard_source is required > > Vladimir Sementsov-Ogievskiy (5): > block/copy-before-write: fix permission > block/copy-before-write: support unligned snapshot-discard > block/copy-before-write: create block_copy bitmap in filter node > qapi: blockdev-backup: add discard-source parameter > iotests: add backup-discard-source > > block/backup.c| 5 +- > block/block-copy.c| 12 +- > block/copy-before-write.c | 39 - > block/copy-before-write.h | 1 + > block/replication.c | 4 +- > blockdev.c| 2 +- > include/block/block-common.h | 2 + > include/block/block-copy.h| 2 + > include/block/block_int-global-state.h| 2 +- > qapi/block-core.json | 4 + > tests/qemu-iotests/257.out| 112 ++--- > .../qemu-iotests/tests/backup-discard-source | 151 ++ > .../tests/backup-discard-source.out | 5 + > 13 files changed, 271 insertions(+), 70 deletions(-) > create mode 100755 tests/qemu-iotests/tests/backup-discard-source > create mode 100644 tests/qemu-iotests/tests/backup-discard-source.out > Tested-by: Fiona Ebner
Re: [PATCH v3 5/5] iotests: add backup-discard-source
Am 28.02.24 um 15:15 schrieb Vladimir Sementsov-Ogievskiy: > Add test for a new backup option: discard-source. > > Signed-off-by: Vladimir Sementsov-Ogievskiy > --- > .../qemu-iotests/tests/backup-discard-source | 151 ++ > .../tests/backup-discard-source.out | 5 + > 2 files changed, 156 insertions(+) > create mode 100755 tests/qemu-iotests/tests/backup-discard-source > create mode 100644 tests/qemu-iotests/tests/backup-discard-source.out > > diff --git a/tests/qemu-iotests/tests/backup-discard-source > b/tests/qemu-iotests/tests/backup-discard-source > new file mode 100755 > index 00..8a88b0f6c4 > --- /dev/null > +++ b/tests/qemu-iotests/tests/backup-discard-source > @@ -0,0 +1,151 @@ > +#!/usr/bin/env python3 > +# > +# Test removing persistent bitmap from backing > +# > +# Copyright (c) 2022 Virtuozzo International GmbH. > +# Title and copyright year are wrong. Apart from that: Reviewed-by: Fiona Ebner
Re: [PATCH v3 2/5] block/copy-before-write: support unligned snapshot-discard
Am 28.02.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy: > First thing that crashes on unligned access here is > bdrv_reset_dirty_bitmap(). Correct way is to align-down the > snapshot-discard request. > > Signed-off-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fiona Ebner
Re: [PATCH v3 4/5] qapi: blockdev-backup: add discard-source parameter
Am 28.02.24 um 15:15 schrieb Vladimir Sementsov-Ogievskiy: > Add a parameter that enables discard-after-copy. That is mostly useful > in "push backup with fleecing" scheme, when source is snapshot-access > format driver node, based on copy-before-write filter snapshot-access > API: > > [guest] [snapshot-access] ~~ blockdev-backup ~~> [backup target] >|| >| root | file >vv > [copy-before-write] >| | >| file| target >v v > [active disk] [temp.img] > > In this case discard-after-copy does two things: > > - discard data in temp.img to save disk space > - avoid further copy-before-write operation in discarded area > > Note that we have to declare WRITE permission on source in > copy-before-write filter, for discard to work. Still we can't take it > unconditionally, as it will break normal backup from RO source. So, we > have to add a parameter and pass it thorough bdrv_open flags. > > Signed-off-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fiona Ebner
Re: [PATCH v3 3/5] block/copy-before-write: create block_copy bitmap in filter node
Am 28.02.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy: > Currently block_copy creates copy_bitmap in source node. But that is in > bad relation with .independent_close=true of copy-before-write filter: > source node may be detached and removed before .bdrv_close() handler > called, which should call block_copy_state_free(), which in turn should > remove copy_bitmap. > > That's all not ideal: it would be better if internal bitmap of > block-copy object is not attached to any node. But that is not possible > now. > > The simplest solution is just create copy_bitmap in filter node, where > anyway two other bitmaps are created. > > Signed-off-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fiona Ebner
Re: [PATCH v2 00/10] mirror: allow switching from background to active mode
Am 07.03.24 um 20:42 schrieb Vladimir Sementsov-Ogievskiy: > On 04.03.24 14:09, Peter Krempa wrote: >> On Mon, Mar 04, 2024 at 11:48:54 +0100, Kevin Wolf wrote: >>> Am 28.02.2024 um 19:07 hat Vladimir Sementsov-Ogievskiy geschrieben: On 03.11.23 18:56, Markus Armbruster wrote: > Kevin Wolf writes: >> >> [...] >> > Is the job abstraction a failure? > > We have > > block-job- command since job- command since > - > block-job-set-speed 1.1 > block-job-cancel 1.1 job-cancel 3.0 > block-job-pause 1.3 job-pause 3.0 > block-job-resume 1.3 job-resume 3.0 > block-job-complete 1.3 job-complete 3.0 > block-job-dismiss 2.12 job-dismiss 3.0 > block-job-finalize 2.12 job-finalize 3.0 > block-job-change 8.2 > query-block-jobs 1.1 query-jobs >> >> [...] >> >>> I consider these strictly optional. We don't really have strong reasons >>> to deprecate these commands (they are just thin wrappers), and I think >>> libvirt still uses block-job-* in some places. >> >> Libvirt uses 'block-job-cancel' because it has different semantics from >> 'job-cancel' which libvirt documented as the behaviour of the API that >> uses it. (Semantics regarding the expectation of what is written to the >> destination node at the point when the job is cancelled). >> > > That's the following semantics: > > # Note that if you issue 'block-job-cancel' after 'drive-mirror' has > # indicated (via the event BLOCK_JOB_READY) that the source and > # destination are synchronized, then the event triggered by this > # command changes to BLOCK_JOB_COMPLETED, to indicate that the > # mirroring has ended and the destination now has a point-in-time copy > # tied to the time of the cancellation. > > Hmm. Looking at this, it looks for me, that should probably a > 'block-job-complete" command (as leading to BLOCK_JOB_COMPLETED). > > Actually, what is the difference between block-job-complete and > block-job-cancel(force=false) for mirror in ready state? > > I only see the following differencies: > > 1. block-job-complete documents that it completes the job > synchronously.. But looking at mirror code I see it just set > s->should_complete = true, which will be then handled asynchronously.. > So I doubt that documentation is correct. > > 2. block-job-complete will trigger final graph changes. block-job-cancel > will not. > > Is [2] really useful? Seems yes: in case of some failure before starting > migration target, we'd like to continue executing source. So, no reason > to break block-graph in source, better keep it unchanged. > FWIW, we also rely on these special semantics. We allow cloning the disk state of a running guest using drive-mirror (and before finishing, fsfreeze in the guest for consistency). We cannot use block-job-complete there, because we do not want to switch the source's drive. > But I think, such behavior better be setup by mirror-job start > parameter, rather then by special option for cancel (or even compelete) > command, useful only for mirror. > > So, what about the following substitution for block-job-cancel: > > block-job-cancel(force=true) --> use job-cancel > > block-job-cancel(force=false) for backup, stream, commit --> use > job-cancel > > block-job-cancel(force=false) for mirror in ready mode --> > > instead, use block-job-complete. If you don't need final graph > modification which mirror job normally does, use graph-change=false > parameter for blockdev-mirror command. > But yes, having a graph-change parameter would work for us too :) Best Regards, Fiona
[PATCH v2 4/4] blockdev: mirror: check for target's cluster size when using bitmap
When using mirror with a bitmap and the target is a diff image, i.e. one that should only contain the delta and was not synced to previously, a too large cluster size for the target can be problematic. In particular, when the mirror sends data to the target aligned to the jobs granularity, but not aligned to the larger target image's cluster size, the target's cluster would be allocated but only be filled partially. When rebasing such a diff image later, the corresponding cluster of the base image would get "masked" and the part of the cluster not in the diff image is not accessible anymore. Unfortunately, it is not always possible to check for the target image's cluster size, e.g. when it's NBD. Because the limitation is already documented in the QAPI description for the @bitmap parameter and it's only required for special diff image use-case, simply skip the check then. Signed-off-by: Fiona Ebner --- blockdev.c| 19 +++ tests/qemu-iotests/tests/bitmap-sync-mirror | 6 ++ .../qemu-iotests/tests/bitmap-sync-mirror.out | 7 +++ 3 files changed, 32 insertions(+) diff --git a/blockdev.c b/blockdev.c index c76eb97a4c..968d44cd3b 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2847,6 +2847,9 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs, } if (bitmap_name) { +BlockDriverInfo bdi; +uint32_t bitmap_granularity; + if (sync != MIRROR_SYNC_MODE_FULL) { error_setg(errp, "Sync mode '%s' not supported with bitmap.", MirrorSyncMode_str(sync)); @@ -2863,6 +2866,22 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs, return; } +bitmap_granularity = bdrv_dirty_bitmap_granularity(bitmap); +/* + * Ignore if unable to get the info, e.g. when target is NBD. It's only + * relevant for syncing to a diff image and the documentation already + * states that the target's cluster size needs to small enough then. + */ +if (bdrv_get_info(target, ) >= 0) { +if (bitmap_granularity < bdi.cluster_size || +bitmap_granularity % bdi.cluster_size != 0) { +error_setg(errp, "Bitmap granularity %u is not a multiple of " + "the target image's cluster size %u", + bitmap_granularity, bdi.cluster_size); +return; +} +} + if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_ALLOW_RO, errp)) { return; } diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror b/tests/qemu-iotests/tests/bitmap-sync-mirror index 898f1f4ba4..cbd5cc99cc 100755 --- a/tests/qemu-iotests/tests/bitmap-sync-mirror +++ b/tests/qemu-iotests/tests/bitmap-sync-mirror @@ -552,6 +552,12 @@ def test_mirror_api(): bitmap=bitmap) log('') +log("-- Test bitmap with too small granularity --\n".format(sync_mode)) +vm.qmp_log("block-dirty-bitmap-add", node=drive0.node, + name="bitmap-small", granularity=(GRANULARITY // 2)) +blockdev_mirror(drive0.vm, drive0.node, "mirror_target", "full", +job_id='api_job', bitmap="bitmap-small") +log('') def main(): for bsync_mode in ("never", "on-success", "always"): diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror.out b/tests/qemu-iotests/tests/bitmap-sync-mirror.out index c05b4788c6..d40ea7d689 100644 --- a/tests/qemu-iotests/tests/bitmap-sync-mirror.out +++ b/tests/qemu-iotests/tests/bitmap-sync-mirror.out @@ -2937,3 +2937,10 @@ qemu_img compare "TEST_DIR/PID-img" "TEST_DIR/PID-fmirror3" ==> Identical, OK! {"execute": "blockdev-mirror", "arguments": {"bitmap": "bitmap0", "device": "drive0", "filter-node-name": "mirror-top", "job-id": "api_job", "sync": "none", "target": "mirror_target"}} {"error": {"class": "GenericError", "desc": "Sync mode 'none' not supported with bitmap."}} +-- Test bitmap with too small granularity -- + +{"execute": "block-dirty-bitmap-add", "arguments": {"granularity": 32768, "name": "bitmap-small", "node": "drive0"}} +{"return": {}} +{"execute": "blockdev-mirror", "arguments": {"bitmap": "bitmap-small", "device": "drive0", "filter-node-name": "mirror-top", "job-id": "api_job", "sync": "full", "target": "mirror_target"}} +{"error": {"class": "GenericError", "desc": "Bitmap granularity 32768 is not a multiple of the target image's cluster size 65536"}} + -- 2.39.2
[PATCH v2 3/4] iotests: add test for bitmap mirror
From: Fabian Grünbichler heavily based on/practically forked off iotest 257 for bitmap backups, but: - no writes to filter node 'mirror-top' between completion and finalization, as those seem to deadlock? - extra set of reference/test mirrors to verify that writes in parallel with active mirror work Intentionally keeping copyright and ownership of original test case to honor provenance. The test was originally adapted by Fabian from 257, but has seen rather big changes, because the interface for mirror with bitmap was changed, i.e. no @bitmap-mode parameter anymore and bitmap is used as the working bitmap. Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.0 adapt to changes to mirror bitmap interface rename test from '384' to 'bitmap-sync-mirror'] Signed-off-by: Fiona Ebner --- tests/qemu-iotests/tests/bitmap-sync-mirror | 565 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2939 + 2 files changed, 3504 insertions(+) create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror b/tests/qemu-iotests/tests/bitmap-sync-mirror new file mode 100755 index 00..898f1f4ba4 --- /dev/null +++ b/tests/qemu-iotests/tests/bitmap-sync-mirror @@ -0,0 +1,565 @@ +#!/usr/bin/env python3 +# group: rw +# +# Test bitmap-sync mirrors (incremental, differential, and partials) +# +# Copyright (c) 2019 John Snow for Red Hat, Inc. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# +# owner=js...@redhat.com + +import math +import os + +import iotests +from iotests import log, qemu_img + +SIZE = 64 * 1024 * 1024 +GRANULARITY = 64 * 1024 + + +class Pattern: +def __init__(self, byte, offset, size=GRANULARITY): +self.byte = byte +self.offset = offset +self.size = size + +def bits(self, granularity): +lower = self.offset // granularity +upper = (self.offset + self.size - 1) // granularity +return set(range(lower, upper + 1)) + + +class PatternGroup: +"""Grouping of Pattern objects. Initialize with an iterable of Patterns.""" +def __init__(self, patterns): +self.patterns = patterns + +def bits(self, granularity): +"""Calculate the unique bits dirtied by this pattern grouping""" +res = set() +for pattern in self.patterns: +res |= pattern.bits(granularity) +return res + + +GROUPS = [ +PatternGroup([ +# Batch 0: 4 clusters +Pattern('0x49', 0x000), +Pattern('0x6c', 0x010), # 1M +Pattern('0x6f', 0x200), # 32M +Pattern('0x76', 0x3ff)]), # 64M - 64K +PatternGroup([ +# Batch 1: 6 clusters (3 new) +Pattern('0x65', 0x000), # Full overwrite +Pattern('0x77', 0x00f8000), # Partial-left (1M-32K) +Pattern('0x72', 0x2008000), # Partial-right (32M+32K) +Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K) +PatternGroup([ +# Batch 2: 7 clusters (3 new) +Pattern('0x74', 0x001), # Adjacent-right +Pattern('0x69', 0x00e8000), # Partial-left (1M-96K) +Pattern('0x6e', 0x2018000), # Partial-right (32M+96K) +Pattern('0x67', 0x3fe, +2*GRANULARITY)]), # Overwrite [(64M-128K)-64M) +PatternGroup([ +# Batch 3: 8 clusters (5 new) +# Carefully chosen such that nothing re-dirties the one cluster +# that copies out successfully before failure in Group #1. +Pattern('0xaa', 0x001, +3*GRANULARITY), # Overwrite and 2x Adjacent-right +Pattern('0xbb', 0x00d8000), # Partial-left (1M-160K) +Pattern('0xcc', 0x2028000), # Partial-right (32M+160K) +Pattern('0xdd', 0x3fc)]), # New; leaving a gap to the right +] + + +class EmulatedBitmap: +def __init__(self, granularity=GRANULARITY): +self._bits = set() +self.granularity = granularity + +def dirty_bits(self, bits): +self._bits |= set(bits) + +def dirty_group(self, n): +self.dirty_bits(GROUPS[n].bits(self.granularity)) + +def clear(self): +self._bits = set() + +def clear_bits(self, bits
[PATCH v2 1/4] qapi/block-core: avoid the re-use of MirrorSyncMode for backup
Backup supports all modes listed in MirrorSyncMode, while mirror does not. Introduce BackupSyncMode by copying the current MirrorSyncMode and drop the variants mirror does not support from MirrorSyncMode as well as the corresponding manual check in mirror_start(). Suggested-by: Vladimir Sementsov-Ogievskiy Signed-off-by: Fiona Ebner --- I felt like keeping the "Since: X.Y" as before makes the most sense as to not lose history. Or is it necessary to change this for BackupSyncMode (and its members) since it got a new name? block/backup.c | 18 - block/mirror.c | 7 --- block/monitor/block-hmp-cmds.c | 2 +- block/replication.c| 2 +- blockdev.c | 26 - include/block/block_int-global-state.h | 2 +- qapi/block-core.json | 27 +- 7 files changed, 47 insertions(+), 37 deletions(-) diff --git a/block/backup.c b/block/backup.c index ec29d6b810..1cc4e055c6 100644 --- a/block/backup.c +++ b/block/backup.c @@ -37,7 +37,7 @@ typedef struct BackupBlockJob { BdrvDirtyBitmap *sync_bitmap; -MirrorSyncMode sync_mode; +BackupSyncMode sync_mode; BitmapSyncMode bitmap_mode; BlockdevOnError on_source_error; BlockdevOnError on_target_error; @@ -111,7 +111,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp) assert(block_job_driver(job) == _job_driver); -if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) { +if (backup_job->sync_mode != BACKUP_SYNC_MODE_NONE) { error_setg(errp, "The backup job only supports block checkpoint in" " sync=none mode"); return; @@ -231,11 +231,11 @@ static void backup_init_bcs_bitmap(BackupBlockJob *job) uint64_t estimate; BdrvDirtyBitmap *bcs_bitmap = block_copy_dirty_bitmap(job->bcs); -if (job->sync_mode == MIRROR_SYNC_MODE_BITMAP) { +if (job->sync_mode == BACKUP_SYNC_MODE_BITMAP) { bdrv_clear_dirty_bitmap(bcs_bitmap, NULL); bdrv_dirty_bitmap_merge_internal(bcs_bitmap, job->sync_bitmap, NULL, true); -} else if (job->sync_mode == MIRROR_SYNC_MODE_TOP) { +} else if (job->sync_mode == BACKUP_SYNC_MODE_TOP) { /* * We can't hog the coroutine to initialize this thoroughly. * Set a flag and resume work when we are able to yield safely. @@ -254,7 +254,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp) backup_init_bcs_bitmap(s); -if (s->sync_mode == MIRROR_SYNC_MODE_TOP) { +if (s->sync_mode == BACKUP_SYNC_MODE_TOP) { int64_t offset = 0; int64_t count; @@ -282,7 +282,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp) block_copy_set_skip_unallocated(s->bcs, false); } -if (s->sync_mode == MIRROR_SYNC_MODE_NONE) { +if (s->sync_mode == BACKUP_SYNC_MODE_NONE) { /* * All bits are set in bcs bitmap to allow any cluster to be copied. * This does not actually require them to be copied. @@ -354,7 +354,7 @@ static const BlockJobDriver backup_job_driver = { BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, BlockDriverState *target, int64_t speed, - MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap, + BackupSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap, BitmapSyncMode bitmap_mode, bool compress, const char *filter_node_name, @@ -376,8 +376,8 @@ BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs, GLOBAL_STATE_CODE(); /* QMP interface protects us from these cases */ -assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL); -assert(sync_bitmap || sync_mode != MIRROR_SYNC_MODE_BITMAP); +assert(sync_mode != BACKUP_SYNC_MODE_INCREMENTAL); +assert(sync_bitmap || sync_mode != BACKUP_SYNC_MODE_BITMAP); if (bs == target) { error_setg(errp, "Source and target cannot be the same"); diff --git a/block/mirror.c b/block/mirror.c index 5145eb53e1..1609354db3 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -2011,13 +2011,6 @@ void mirror_start(const char *job_id, BlockDriverState *bs, GLOBAL_STATE_CODE(); -if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) || -(mode == MIRROR_SYNC_MODE_BITMAP)) { -error_setg(errp, "Sync mode '%s' not supported", - MirrorSyncMode_str(mode)); -return; -} - bdrv_graph_rdlock_main_loop(); is_none_mode = mode == MIRROR_SYNC_MODE_NONE; base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL; diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c index d954bec6f1..9633d000c0 100644 --- a/blo
[PATCH v2 2/4] mirror: allow specifying working bitmap
From: John Snow for the mirror job. The bitmap's granularity is used as the job's granularity. The new @bitmap parameter is marked unstable in the QAPI and can currently only be used for @sync=full mode. Clusters initially dirty in the bitmap as well as new writes are copied to the target. Using block-dirty-bitmap-clear and block-dirty-bitmap-merge API, callers can simulate the three kinds of @BitmapSyncMode (which is used by backup): 1. always: default, just pass bitmap as working bitmap. 2. never: copy bitmap and pass copy to the mirror job. 3. on-success: copy bitmap and pass copy to the mirror job and if successful, merge bitmap into original afterwards. When the target image is a fresh "diff image", i.e. one that was not used as the target of a previous mirror and the target image's cluster size is larger than the bitmap's granularity, or when @copy-mode=write-blocking is used, there is a pitfall, because the cluster in the target image will be allocated, but not contain all the data corresponding to the same region in the source image. An idea to avoid the limitation would be to mark clusters which are affected by unaligned writes and are not allocated in the target image dirty, so they would be copied fully later. However, for migration, the invariant that an actively synced mirror stays actively synced (unless an error happens) is useful, because without that invariant, migration might inactivate block devices when mirror still got work to do and run into an assertion failure [0]. Another approach would be to read the missing data from the source upon unaligned writes to be able to write the full target cluster instead. But certain targets like NBD do not allow querying the cluster size. To avoid limiting/breaking the use case of syncing to an existing target, which is arguably more common than the diff image use case, document the limiation in QAPI. This patch was originally based on one by Ma Haocong, but it has since been modified pretty heavily, first by John and then again by Fiona. [0]: https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/ Suggested-by: Ma Haocong Signed-off-by: Ma Haocong Signed-off-by: John Snow [FG: switch to bdrv_dirty_bitmap_merge_internal] Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.0 get rid of bitmap mode parameter use caller-provided bitmap as working bitmap turn bitmap parameter experimental] Signed-off-by: Fiona Ebner --- block/mirror.c | 95 -- blockdev.c | 39 +-- include/block/block_int-global-state.h | 5 +- qapi/block-core.json | 37 +- tests/unit/test-block-iothread.c | 2 +- 5 files changed, 146 insertions(+), 32 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index 1609354db3..5c9a00b574 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -51,7 +51,7 @@ typedef struct MirrorBlockJob { BlockDriverState *to_replace; /* Used to block operations on the drive-mirror-replace target */ Error *replace_blocker; -bool is_none_mode; +MirrorSyncMode sync_mode; BlockMirrorBackingMode backing_mode; /* Whether the target image requires explicit zero-initialization */ bool zero_target; @@ -73,6 +73,11 @@ typedef struct MirrorBlockJob { size_t buf_size; int64_t bdev_length; unsigned long *cow_bitmap; +/* + * Whether the bitmap is created locally or provided by the caller (for + * incremental sync). + */ +bool dirty_bitmap_is_local; BdrvDirtyBitmap *dirty_bitmap; BdrvDirtyBitmapIter *dbi; uint8_t *buf; @@ -687,7 +692,11 @@ static int mirror_exit_common(Job *job) bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs); } -bdrv_release_dirty_bitmap(s->dirty_bitmap); +if (s->dirty_bitmap_is_local) { +bdrv_release_dirty_bitmap(s->dirty_bitmap); +} else { +bdrv_enable_dirty_bitmap(s->dirty_bitmap); +} /* Make sure that the source BDS doesn't go away during bdrv_replace_node, * before we can call bdrv_drained_end */ @@ -718,7 +727,8 @@ static int mirror_exit_common(Job *job) _abort); if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) { -BlockDriverState *backing = s->is_none_mode ? src : s->base; +BlockDriverState *backing; +backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base; BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs); if (bdrv_cow_bs(unfiltered_target) != backing) { @@ -815,6 +825,16 @@ static void mirror_abort(Job *job) assert(ret == 0); } +/* Always called after commit/abort. */ +static void mirror_clean(Job *job) +{ +MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job); + +if
[PATCH v2 0/4] mirror: allow specifying working bitmap
Changes from RFC/v1 (discussion here [0]): * Add patch to split BackupSyncMode and MirrorSyncMode. * Drop bitmap-mode parameter and use passed-in bitmap as the working bitmap instead. Users can get the desired behaviors by using the block-dirty-bitmap-clear and block-dirty-bitmap-merge calls (see commit message in patch 2/4 for how exactly). * Add patch to check whether target image's cluster size is at most mirror job's granularity. Optional, it's an extra safety check that's useful when the target is a "diff" image that does not have previously synced data. Use cases: * Possibility to resume a failed mirror later. * Possibility to only mirror deltas to a previously mirrored volume. * Possibility to (efficiently) mirror an drive that was previously mirrored via some external mechanism (e.g. ZFS replication). We are using the last one in production without any issues since about 4 years now. In particular, like mentioned in [1]: > - create bitmap(s) > - (incrementally) replicate storage volume(s) out of band (using ZFS) > - incrementally drive mirror as part of a live migration of VM > - drop bitmap(s) Now, the IO test added in patch 4/4 actually contains yet another use case, namely doing incremental mirrors to stand-alone qcow2 "diff" images, that only contain the delta and can be rebased later. I had to adapt the IO test, because its output expected the mirror bitmap to still be dirty, but nowadays the mirror is apparently already done when the bitmaps are queried. So I thought, I'll just use 'write-blocking' mode to avoid any potential timing issues. But this exposed an issue with the diff image approach. If a write is not aligned to the granularity of the mirror target, then rebasing the diff image onto a backing image will not yield the desired result, because the full cluster is considered to be allocated and will "hide" some part of the base/backing image. The failure can be seen by either using 'write-blocking' mode in the IO test or setting the (bitmap) granularity to 32 KiB rather than the current 64 KiB. For the latter case, patch 4/4 adds a check. For the former, the limitation is documented (I'd expect this to be a niche use case in practice). [0]: https://lore.kernel.org/qemu-devel/b91dba34-7969-4d51-ba40-96a91038c...@yandex-team.ru/T/#m4ae27dc8ca1fb053e0a32cc4ffa2cfab6646805c [1]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/ Fabian Grünbichler (1): iotests: add test for bitmap mirror Fiona Ebner (2): qapi/block-core: avoid the re-use of MirrorSyncMode for backup blockdev: mirror: check for target's cluster size when using bitmap John Snow (1): mirror: allow specifying working bitmap block/backup.c| 18 +- block/mirror.c| 102 +- block/monitor/block-hmp-cmds.c|2 +- block/replication.c |2 +- blockdev.c| 84 +- include/block/block_int-global-state.h|7 +- qapi/block-core.json | 64 +- tests/qemu-iotests/tests/bitmap-sync-mirror | 571 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2946 + tests/unit/test-block-iothread.c |2 +- 10 files changed, 3729 insertions(+), 69 deletions(-) create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out -- 2.39.2
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 29.02.24 um 13:47 schrieb Fiona Ebner: > Am 29.02.24 um 12:48 schrieb Vladimir Sementsov-Ogievskiy: >> On 29.02.24 13:11, Fiona Ebner wrote: >>> >>> The iotest creates a new target image for each incremental sync which >>> only records the diff relative to the previous mirror and those diff >>> images are later rebased onto each other to get the full picture. >>> >>> Thus, it can be that a previous mirror job (not just background process >>> or previous write) already copied a cluster, and in particular, copied >>> it to a different target! >> >> Aha understand. >> >> For simplicity, let's consider case, when source "cluster size" = "job >> cluster size" = "bitmap granularity" = "target cluster size". >> >> Which types of clusters we should consider, when we want to handle guest >> write? >> >> 1. Clusters, that should be copied by background process >> >> These are dirty clusters from user-given bitmap, or if we do a full-disk >> mirror, all clusters, not yet copied by background process. >> >> For such clusters we simply ignore the unaligned write. We can even >> ignore the aligned write too: less disturbing the guest by delays. >> > > Since do_sync_target_write() currently doesn't ignore aligned writes, I > wouldn't change it. Of course they can count towards the "done_bitmap" > you propose below. > >> 2. Clusters, already copied by background process during this mirror job >> and not dirtied by guest since this time. >> >> For such clusters we are safe to do unaligned write, as target cluster >> must be allocated. >> > > Right. > >> 3. Clusters, not marked initially by dirty bitmap. >> >> What to do with them? We can't do unaligned write. I see two variants: >> >> - do additional read from source, to fill the whole cluster, which seems >> a bit too heavy >> > > Yes, I'd rather only do that as a last resort. > >> - just mark the cluster as dirty for background job. So we behave like >> in "background" mode. But why not? The maximum count of such "hacks" is >> limited to number of "clear" clusters at start of mirror job, which >> means that we don't seriously affect the convergence. Mirror is >> guaranteed to converge anyway. And the whole sense of "write-blocking" >> mode is to have a guaranteed convergence. What do you think? >> > > It could lead to a lot of flips between job->actively_synced == true and > == false. AFAIU, currently, we only switch back from true to false when > an error happens. While I don't see a concrete issue with it, at least > it might be unexpected to users, so it better be documented. > > I'll try going with this approach, thanks! > These flips are actually a problem. When using live-migration with disk mirroring, it's good that an actively synced image stays actively synced. Otherwise, migration could finish at an inconvenient time and try to inactivate the block device while mirror still got something to do which would lead to an assertion failure [0]. The IO test added by this series is what uses the possibility to sync to "diff images" which contain only the delta. In production, we are only syncing to a previously mirrored target image. Non-aligned writes are not an issue later like with a diff image. (Even if the initial mirroring happened via ZFS replication outside of QEMU). So copy-mode=write-blocking would work fine for our use case, but if I go with the "mark clusters for unaligned writes dirty"-approach, it would not work fine anymore. Should I rather just document the limitation for the combination "target is a diff image" and copy-mode=write-blocking? I'd still add the check for the granularity and target cluster size. While also only needed for diff images, it would allow using background mode safely for those. Best Regards, Fiona [0]: https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 01.03.24 um 16:46 schrieb Vladimir Sementsov-Ogievskiy: > On 01.03.24 18:14, Fiona Ebner wrote: >> Am 01.03.24 um 16:02 schrieb Vladimir Sementsov-Ogievskiy: >>>>> About documentation: actually, I never liked that we use for backup >>>>> job >>>>> "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two >>>>> values supported only by backup. >>>>> >>>>> I'm also unsure how mode=full=some_bitmap differs from >>>>> mode=bitmap=some_bitmap.. >>>>> >>>> >>>> With the current patches, it was an error to specify @bitmap for other >>>> modes than 'incremental' and 'bitmap'. >>> >>> Current documentation says: >>> # @bitmap: The name of a dirty bitmap to use. Must be present if >>> sync >>> # is "bitmap" or "incremental". Can be present if sync is "full" >>> # or "top". Must not be present otherwise. >>> # (Since 2.4 (drive-backup), 3.1 (blockdev-backup)) >>> >>> >> >> This is for backup. The documentation (and behavior) for @bitmap added >> by these patches for mirror is different ;) > > I meant backup in "I'm also unsure", just as one more point not consider > backup-bitmap-API as a prototype for mirror-bitmap-API. > Oh, I see. Sorry for the confusion! Best Regards, Fiona
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 01.03.24 um 16:02 schrieb Vladimir Sementsov-Ogievskiy: > On 01.03.24 17:52, Fiona Ebner wrote: >> Am 01.03.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy: >>> >>> As we already understood, (block-)job-api needs some spring-cleaning. >>> Unfortunately I don't have much time on it, but still I decided to start >>> from finally depreacting block-job-* API and moving to job-*.. Probably >>> bitmap/bitmap-mode/sync APIs also need some optimization, keeping in >>> mind new block-dirty-bitmap-merge api. >>> >>> So, what I could advice in this situation for newc interfaces: >>> >>> 1. be minimalistic >>> 2. add `x-` prefix when unsure >>> >>> So, following these two rules, what about x-bitmap field, which may be >>> combined only with 'full' mode, and do what you need? >>> >> >> AFAIU, it should rather be marked as @unstable in QAPI [0]? Then it >> doesn't need to be renamed if it becomes stable later. > > Right, unstable feature is needed, using "x-" is optional. > > Recent discussion about it was in my "vhost-user-blk: live resize > additional APIs" series: > > https://patchew.org/QEMU/20231006202045.1161543-1-vsement...@yandex-team.ru/20231006202045.1161543-5-vsement...@yandex-team.ru/ > > Following it, I think it's OK to not care anymore with "x-" prefixes, > and rely on unstable feature. > Thanks for the confirmation! I'll go without the prefix in the name then. >> >>> About documentation: actually, I never liked that we use for backup job >>> "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two >>> values supported only by backup. >>> >>> I'm also unsure how mode=full=some_bitmap differs from >>> mode=bitmap=some_bitmap.. >>> >> >> With the current patches, it was an error to specify @bitmap for other >> modes than 'incremental' and 'bitmap'. > > Current documentation says: > # @bitmap: The name of a dirty bitmap to use. Must be present if sync > # is "bitmap" or "incremental". Can be present if sync is "full" > # or "top". Must not be present otherwise. > # (Since 2.4 (drive-backup), 3.1 (blockdev-backup)) > > This is for backup. The documentation (and behavior) for @bitmap added by these patches for mirror is different ;) Best Regards, Fiona
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 01.03.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy: > > As we already understood, (block-)job-api needs some spring-cleaning. > Unfortunately I don't have much time on it, but still I decided to start > from finally depreacting block-job-* API and moving to job-*.. Probably > bitmap/bitmap-mode/sync APIs also need some optimization, keeping in > mind new block-dirty-bitmap-merge api. > > So, what I could advice in this situation for newc interfaces: > > 1. be minimalistic > 2. add `x-` prefix when unsure > > So, following these two rules, what about x-bitmap field, which may be > combined only with 'full' mode, and do what you need? > AFAIU, it should rather be marked as @unstable in QAPI [0]? Then it doesn't need to be renamed if it becomes stable later. > About documentation: actually, I never liked that we use for backup job > "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two > values supported only by backup. > > I'm also unsure how mode=full=some_bitmap differs from > mode=bitmap=some_bitmap.. > With the current patches, it was an error to specify @bitmap for other modes than 'incremental' and 'bitmap'. > So, I'd suggest simply rename MirrorSyncMode to BackupSyncMode, and add > separate MirrorSyncMode with only "full", "top" and "none" values. > Sounds good to me! [0]: https://gitlab.com/qemu-project/qemu/-/commit/a3c45b3e62962f99338716b1347cfb0d427cea44 Best Regards, Fiona
Re: [PATCH 00/19] Workaround Windows failing to find 64bit SMBIOS entry point with SeaBIOS
Am 29.02.24 um 14:18 schrieb Fiona Ebner: > Am 27.02.24 um 16:47 schrieb Igor Mammedov: >> Windows (10) bootloader when running on top of SeaBIOS, fails to find >> >> SMBIOSv3 entry point. Tracing it shows that it looks for v2 anchor markers >> >> only and not v3. Tricking it into believing that entry point is found >> >> lets Windows successfully locate and parse SMBIOSv3 tables. Whether it >> >> will be fixed on Windows side is not clear so here goes a workaround. >> >> >> >> Idea is to try build v2 tables if QEMU configuration permits, >> >> and fallback to v3 tables otherwise. That will mask Windows issue >> >> form majority of users. >> >> However if VM configuration can't be described (typically large VMs) >> >> by v2 tables, QEMU will use SMBIOSv3 and Windows will hit the issue >> >> again. In this case complain to Microsoft and/or use UEFI instead of >> >> SeaBIOS (requires reinstall). >> >> >> >> Default compat setting of smbios-entry-point-type after series >> >> for pc/q35 machines: >> >> * 9.0-newer: 'auto' >> >> * 8.1-8.2: '64' >> >> * 8.0-older: '32' >> >> >> >> Fixes: https://gitlab.com/qemu-project/qemu/-/issues/2008 >> > > Thank you! I'm happy to confirm that this series works around the issue :) > While I still didn't do any in-depth testing (don't have enough knowledge for that anyways), I played around a bit more now, check that nothing obvious breaks also with a Linux VM and also ran a successful 'make check'. If that is enough, feel free to add: Tested-by: Fiona Ebner Best Regards, Fiona
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 29.02.24 um 13:00 schrieb Vladimir Sementsov-Ogievskiy: > > But anyway, this all could be simply achieved with > bitmap-copying/merging API, if we allow to pass user-given bitmap to the > mirror as working bitmap. > >> >> I see, I'll drop the 'bitmap-mode' in the next version if nobody >> complains :) >> > > Good. It's a golden rule: never make public interfaces which you don't > actually need for production. I myself sometimes violate it and spend > extra time on developing features, which we later have to just drop as > "not needed downstream, no sense in upstreaming". > Just wondering which new mode I should allow for the @MirrorSyncMode then? The documentation states: > # @incremental: only copy data described by the dirty bitmap. > # (since: 2.4) > # > # @bitmap: only copy data described by the dirty bitmap. (since: 4.2) > # Behavior on completion is determined by the BitmapSyncMode. For backup, do_backup_common() just maps @incremental to @bitmap + @bitmap-mode == @on-success. Using @bitmap for mirror would lead to being at odds with the documentation, because it mentions the BitmapSyncMode, which mirror won't have. Using @incremental for mirror would be consistent with the documentation, but behave a bit differently from backup. Opinions? Best Regards, Fiona
Re: [PATCH 00/19] Workaround Windows failing to find 64bit SMBIOS entry point with SeaBIOS
Am 27.02.24 um 16:47 schrieb Igor Mammedov: > Windows (10) bootloader when running on top of SeaBIOS, fails to find > > SMBIOSv3 entry point. Tracing it shows that it looks for v2 anchor markers > > only and not v3. Tricking it into believing that entry point is found > > lets Windows successfully locate and parse SMBIOSv3 tables. Whether it > > will be fixed on Windows side is not clear so here goes a workaround. > > > > Idea is to try build v2 tables if QEMU configuration permits, > > and fallback to v3 tables otherwise. That will mask Windows issue > > form majority of users. > > However if VM configuration can't be described (typically large VMs) > > by v2 tables, QEMU will use SMBIOSv3 and Windows will hit the issue > > again. In this case complain to Microsoft and/or use UEFI instead of > > SeaBIOS (requires reinstall). > > > > Default compat setting of smbios-entry-point-type after series > > for pc/q35 machines: > > * 9.0-newer: 'auto' > > * 8.1-8.2: '64' > > * 8.0-older: '32' > > > > Fixes: https://gitlab.com/qemu-project/qemu/-/issues/2008 > Thank you! I'm happy to confirm that this series works around the issue :) Best Regards, Fiona
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 29.02.24 um 12:48 schrieb Vladimir Sementsov-Ogievskiy: > On 29.02.24 13:11, Fiona Ebner wrote: >> >> The iotest creates a new target image for each incremental sync which >> only records the diff relative to the previous mirror and those diff >> images are later rebased onto each other to get the full picture. >> >> Thus, it can be that a previous mirror job (not just background process >> or previous write) already copied a cluster, and in particular, copied >> it to a different target! > > Aha understand. > > For simplicity, let's consider case, when source "cluster size" = "job > cluster size" = "bitmap granularity" = "target cluster size". > > Which types of clusters we should consider, when we want to handle guest > write? > > 1. Clusters, that should be copied by background process > > These are dirty clusters from user-given bitmap, or if we do a full-disk > mirror, all clusters, not yet copied by background process. > > For such clusters we simply ignore the unaligned write. We can even > ignore the aligned write too: less disturbing the guest by delays. > Since do_sync_target_write() currently doesn't ignore aligned writes, I wouldn't change it. Of course they can count towards the "done_bitmap" you propose below. > 2. Clusters, already copied by background process during this mirror job > and not dirtied by guest since this time. > > For such clusters we are safe to do unaligned write, as target cluster > must be allocated. > Right. > 3. Clusters, not marked initially by dirty bitmap. > > What to do with them? We can't do unaligned write. I see two variants: > > - do additional read from source, to fill the whole cluster, which seems > a bit too heavy > Yes, I'd rather only do that as a last resort. > - just mark the cluster as dirty for background job. So we behave like > in "background" mode. But why not? The maximum count of such "hacks" is > limited to number of "clear" clusters at start of mirror job, which > means that we don't seriously affect the convergence. Mirror is > guaranteed to converge anyway. And the whole sense of "write-blocking" > mode is to have a guaranteed convergence. What do you think? > It could lead to a lot of flips between job->actively_synced == true and == false. AFAIU, currently, we only switch back from true to false when an error happens. While I don't see a concrete issue with it, at least it might be unexpected to users, so it better be documented. I'll try going with this approach, thanks! > > > > Of course, we can't distinguish 3 types by on dirty bitmap, so we need > the second one. For example "done_bitmap", where we can mark clusters > that were successfully copied. That would be a kind of block-status of > target image. But using bitmap is a lot better than querying > block-status from target. Best Regards, Fiona
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 28.02.24 um 17:24 schrieb Vladimir Sementsov-Ogievskiy: > On 16.02.24 13:55, Fiona Ebner wrote: >> Previous discussion from when this was sent upstream [0] (it's been a >> while). I rebased the patches and re-ordered and squashed like >> suggested back then [1]. >> >> This implements two new mirror modes: >> >> - bitmap mirror mode with always/on-success/never bitmap sync mode >> - incremental mirror mode as sugar for bitmap + on-success >> >> Use cases: >> * Possibility to resume a failed mirror later. >> * Possibility to only mirror deltas to a previously mirrored volume. >> * Possibility to (efficiently) mirror an drive that was previously >> mirrored via some external mechanism (e.g. ZFS replication). >> >> We are using the last one in production without any issues since about >> 4 years now. In particular, like mentioned in [2]: >> >>> - create bitmap(s) >>> - (incrementally) replicate storage volume(s) out of band (using ZFS) >>> - incrementally drive mirror as part of a live migration of VM >>> - drop bitmap(s) > > Actually which mode you use, "never", "always" or "conditional"? Or in > downstream you have different approach? > We are using "conditional", but I think we don't really require any specific mode, because we drop the bitmaps after mirroring (even in failure case). Fabian, please correct me if I'm wrong. > Why am I asking: > > These modes (for backup) were developed prior to > block-dirty-bitmap-merge command, which allowed to copy bitmaps as you > want. With that API, we actually don't need all these modes, instead > it's enough to pass a bitmap, which would be _actually_ used by mirror. > > So, if you need "never" mode, you just copy your bitmap by > block-dirty-bitmap-add + block-dirty-bitmap-merge, and pass a copy to > mirror job. > > Or, you pass your bitmap to mirror-job, and have a "always" mode. > > And I don't see, why we need a "conditional" mode, which actually just > drops away the progress we actually made. (OK, we failed, but why to > drop the progress of successfully copied clusters?) > I'm not sure actually. Maybe John remembers? I see, I'll drop the 'bitmap-mode' in the next version if nobody complains :) > > Using user-given bitmap in the mirror job has also an additional > advantage of live progress: up to visualization of disk copying by > visualization of the dirty bitmap contents. > Best Regards, Fiona
Re: [RFC 0/4] mirror: implement incremental and bitmap modes
Am 28.02.24 um 17:06 schrieb Vladimir Sementsov-Ogievskiy: > On 28.02.24 19:00, Vladimir Sementsov-Ogievskiy wrote: >> On 16.02.24 13:55, Fiona Ebner wrote: >>> Now, the IO test added in patch 4/4 actually contains yet another use >>> case, namely doing incremental mirrors to stand-alone qcow2 "diff" >>> images, that only contain the delta and can be rebased later. I had to >>> adapt the IO test, because its output expected the mirror bitmap to >>> still be dirty, but nowadays the mirror is apparently already done >>> when the bitmaps are queried. So I thought, I'll just use >>> 'write-blocking' mode to avoid any potential timing issues. >>> >>> But this exposed an issue with the diff image approach. If a write is >>> not aligned to the granularity of the mirror target, then rebasing the >>> diff image onto a backing image will not yield the desired result, >>> because the full cluster is considered to be allocated and will "hide" >>> some part of the base/backing image. The failure can be seen by either >>> using 'write-blocking' mode in the IO test or setting the (bitmap) >>> granularity to 32 KiB rather than the current 64 KiB. >>> >>> The question is how to deal with these edge cases? Some possibilities >>> that would make sense to me: >>> >>> For 'background' mode: >>> * prohibit if target's cluster size is larger than the bitmap >>> granularity >>> * document the limitation >>> >>> For 'write-blocking' mode: >>> * disallow in combination with bitmap mode (would not be happy about >>> it, because I'd like to use this without diff images) >> >> why not just require the same: bitmap granularity must be >= target >> granularity >> For the iotest's use-case, that only works for background mode. I'll explain below. >>> * for writes that are not aligned to the target's cluster size, read >>> the relevant/missing parts from the source image to be able to write >>> whole target clusters (seems rather complex) >> >> There is another approach: consider and unaligned part of the request, >> fit in one cluster (we can always split any request to "aligned" >> middle part, and at most two small "unligned" parts, each fit into one >> cluster). >> >> We have two possibilities: >> >> 1. the cluster is dirty (marked dirty in the bitmap used by background >> process) >> >> We can simply ignore this part and rely on background process. This >> will not affect the convergence of the mirror job. >> Agreed. >> 2. the cluster is clear (i.e. background process, or some previous >> write already copied it) >> The iotest creates a new target image for each incremental sync which only records the diff relative to the previous mirror and those diff images are later rebased onto each other to get the full picture. Thus, it can be that a previous mirror job (not just background process or previous write) already copied a cluster, and in particular, copied it to a different target! >> In this case, we are safe to do unaligned write, as target cluster >> must be allocated. Because the diff image is new, the target's cluster is not necessarily allocated. When using write-blocking and a write of, e.g., 9 bytes to a clear source cluster comes in, only those 9 bytes are written to the target. Now the target's cluster is allocated but with only those 9 bytes of data. When rebasing, the previously copied cluster is "masked" and when reading the rebased image, we only see the cluster with those 9 bytes (and IIRC, zeroes for the rest of the cluster rather than the previously copied data). >> >> (for bitmap-mode, I don't consider here clusters that are clear from >> the start, which we shouldn't copy in any case) >> We do need to copy new writes to any cluster, and with a clear cluster and write-blocking, the issue can manifest. > > Hmm, right, and that's exactly the logic we already have in > do_sync_target_write(). So that's enough just to require that > bitmap_granularity >= target_granularity > Best Regards, Fiona
Re: [RFC 1/4] drive-mirror: add support for sync=bitmap mode=never
Am 21.02.24 um 07:55 schrieb Markus Armbruster: >> diff --git a/qapi/block-core.json b/qapi/block-core.json >> index ab5a93a966..ac05483958 100644 >> --- a/qapi/block-core.json >> +++ b/qapi/block-core.json >> @@ -2181,6 +2181,15 @@ >> # destination (all the disk, only the sectors allocated in the >> # topmost image, or only new I/O). >> # >> +# @bitmap: The name of a bitmap to use for sync=bitmap mode. This >> +# argument must be present for bitmap mode and absent otherwise. >> +# The bitmap's granularity is used instead of @granularity. >> +# (Since 9.0). > > What happens when the user specifies @granularity anyway? Error or > silently ignored? > It's an error: >> +if (bitmap) { >> +if (granularity) { >> +error_setg(errp, "granularity (%d)" >> + "cannot be specified when a bitmap is provided", >> + granularity); >> +return NULL; >> +} >> +# >> +# @bitmap-mode: Specifies the type of data the bitmap should contain >> +# after the operation concludes. Must be present if sync is >> +# "bitmap". Must NOT be present otherwise. (Since 9.0) > > Members that must be present when and only when some enum member has a > certain value should perhaps be in a union branch. Perhaps the block > maintainers have an opinion here. > Sounds sensible to me. Considering also the next patches, in the end it could be a union discriminated by the @sync which contains @bitmap and @bitmap-mode when it's the 'bitmap' sync mode, @bitmap when it's the 'incremental' sync mode (@bitmap-sync mode needs to be 'on-success' then, so there is no choice for the user) and which contains @granularity for the other sync modes. Best Regards, Fiona
[RFC 1/4] drive-mirror: add support for sync=bitmap mode=never
From: John Snow This patch adds support for the "BITMAP" sync mode to drive-mirror and blockdev-mirror. It adds support only for the BitmapSyncMode "never," because it's the simplest mode. This mode simply uses a user-provided bitmap as an initial copy manifest, and then does not clear any bits in the bitmap at the conclusion of the operation. Any new writes dirtied during the operation are copied out, in contrast to backup. Note that whether these writes are reflected in the bitmap at the conclusion of the operation depends on whether that bitmap is actually recording! This patch was originally based on one by Ma Haocong, but it has since been modified pretty heavily. Suggested-by: Ma Haocong Signed-off-by: Ma Haocong Signed-off-by: John Snow [FG: switch to bdrv_dirty_bitmap_merge_internal] Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.0 update version and formatting in QAPI] Signed-off-by: Fiona Ebner --- block/mirror.c | 96 -- blockdev.c | 38 +- include/block/block_int-global-state.h | 4 +- qapi/block-core.json | 25 ++- tests/unit/test-block-iothread.c | 4 +- 5 files changed, 139 insertions(+), 28 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index 5145eb53e1..315dff11e2 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -51,7 +51,7 @@ typedef struct MirrorBlockJob { BlockDriverState *to_replace; /* Used to block operations on the drive-mirror-replace target */ Error *replace_blocker; -bool is_none_mode; +MirrorSyncMode sync_mode; BlockMirrorBackingMode backing_mode; /* Whether the target image requires explicit zero-initialization */ bool zero_target; @@ -73,6 +73,8 @@ typedef struct MirrorBlockJob { size_t buf_size; int64_t bdev_length; unsigned long *cow_bitmap; +BdrvDirtyBitmap *sync_bitmap; +BitmapSyncMode bitmap_mode; BdrvDirtyBitmap *dirty_bitmap; BdrvDirtyBitmapIter *dbi; uint8_t *buf; @@ -718,7 +720,8 @@ static int mirror_exit_common(Job *job) _abort); if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) { -BlockDriverState *backing = s->is_none_mode ? src : s->base; +BlockDriverState *backing; +backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base; BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs); if (bdrv_cow_bs(unfiltered_target) != backing) { @@ -815,6 +818,16 @@ static void mirror_abort(Job *job) assert(ret == 0); } +/* Always called after commit/abort. */ +static void mirror_clean(Job *job) +{ +MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job); + +if (s->sync_bitmap) { +bdrv_dirty_bitmap_set_busy(s->sync_bitmap, false); +} +} + static void coroutine_fn mirror_throttle(MirrorBlockJob *s) { int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); @@ -1011,7 +1024,8 @@ static int coroutine_fn mirror_run(Job *job, Error **errp) mirror_free_init(s); s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); -if (!s->is_none_mode) { +if ((s->sync_mode == MIRROR_SYNC_MODE_TOP) || +(s->sync_mode == MIRROR_SYNC_MODE_FULL)) { ret = mirror_dirty_init(s); if (ret < 0 || job_is_cancelled(>common.job)) { goto immediate_exit; @@ -1302,6 +1316,7 @@ static const BlockJobDriver mirror_job_driver = { .run= mirror_run, .prepare= mirror_prepare, .abort = mirror_abort, +.clean = mirror_clean, .pause = mirror_pause, .complete = mirror_complete, .cancel = mirror_cancel, @@ -1320,6 +1335,7 @@ static const BlockJobDriver commit_active_job_driver = { .run= mirror_run, .prepare= mirror_prepare, .abort = mirror_abort, +.clean = mirror_clean, .pause = mirror_pause, .complete = mirror_complete, .cancel = commit_active_cancel, @@ -1712,7 +1728,10 @@ static BlockJob *mirror_start_job( BlockCompletionFunc *cb, void *opaque, const BlockJobDriver *driver, - bool is_none_mode, BlockDriverState *base, + MirrorSyncMode sync_mode, + BdrvDirtyBitmap *bitmap, + BitmapSyncMode bitmap_mode, + BlockDriverState *base, bool auto_complete, const char *filter_node_name,
[RFC 0/4] mirror: implement incremental and bitmap modes
Previous discussion from when this was sent upstream [0] (it's been a while). I rebased the patches and re-ordered and squashed like suggested back then [1]. This implements two new mirror modes: - bitmap mirror mode with always/on-success/never bitmap sync mode - incremental mirror mode as sugar for bitmap + on-success Use cases: * Possibility to resume a failed mirror later. * Possibility to only mirror deltas to a previously mirrored volume. * Possibility to (efficiently) mirror an drive that was previously mirrored via some external mechanism (e.g. ZFS replication). We are using the last one in production without any issues since about 4 years now. In particular, like mentioned in [2]: > - create bitmap(s) > - (incrementally) replicate storage volume(s) out of band (using ZFS) > - incrementally drive mirror as part of a live migration of VM > - drop bitmap(s) Now, the IO test added in patch 4/4 actually contains yet another use case, namely doing incremental mirrors to stand-alone qcow2 "diff" images, that only contain the delta and can be rebased later. I had to adapt the IO test, because its output expected the mirror bitmap to still be dirty, but nowadays the mirror is apparently already done when the bitmaps are queried. So I thought, I'll just use 'write-blocking' mode to avoid any potential timing issues. But this exposed an issue with the diff image approach. If a write is not aligned to the granularity of the mirror target, then rebasing the diff image onto a backing image will not yield the desired result, because the full cluster is considered to be allocated and will "hide" some part of the base/backing image. The failure can be seen by either using 'write-blocking' mode in the IO test or setting the (bitmap) granularity to 32 KiB rather than the current 64 KiB. The question is how to deal with these edge cases? Some possibilities that would make sense to me: For 'background' mode: * prohibit if target's cluster size is larger than the bitmap granularity * document the limitation For 'write-blocking' mode: * disallow in combination with bitmap mode (would not be happy about it, because I'd like to use this without diff images) * for writes that are not aligned to the target's cluster size, read the relevant/missing parts from the source image to be able to write whole target clusters (seems rather complex) * document the limitation [0]: https://lore.kernel.org/qemu-devel/20200218100740.2228521-1-f.gruenbich...@proxmox.com/ [1]: https://lore.kernel.org/qemu-devel/d35a76de-78d5-af56-0b34-f7bd2bbd3...@redhat.com/ [2]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/ Fabian Grünbichler (2): mirror: move some checks to qmp iotests: add test for bitmap mirror John Snow (2): drive-mirror: add support for sync=bitmap mode=never drive-mirror: add support for conditional and always bitmap sync modes block/mirror.c| 94 +- blockdev.c| 70 +- include/block/block_int-global-state.h|4 +- qapi/block-core.json | 25 +- tests/qemu-iotests/tests/bitmap-sync-mirror | 550 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2810 + tests/unit/test-block-iothread.c |4 +- 7 files changed, 3527 insertions(+), 30 deletions(-) create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out -- 2.39.2
[RFC 2/4] drive-mirror: add support for conditional and always bitmap sync modes
From: John Snow Teach mirror two new tricks for using bitmaps: Always: no matter what, we synchronize the copy_bitmap back to the sync_bitmap. In effect, this allows us resume a failed mirror at a later date. Conditional: On success only, we sync the bitmap. This is akin to incremental backup modes; we can use this bitmap to later refresh a successfully created mirror. Originally-by: John Snow [FG: add check for bitmap-mode without bitmap switch to bdrv_dirty_bitmap_merge_internal] Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.0] Signed-off-by: Fiona Ebner --- The original patch this was based on came from a WIP git branch and thus has no Signed-off-by trailer from John, see [0]. I added an Originally-by trailer for now. Let me know if I should drop that and wait for John's Signed-off-by instead. [0] https://lore.kernel.org/qemu-devel/1599140071.n44h532eeu.astr...@nora.none/ block/mirror.c | 24 ++-- blockdev.c | 3 +++ 2 files changed, 21 insertions(+), 6 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index 315dff11e2..84155b1f78 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -689,8 +689,6 @@ static int mirror_exit_common(Job *job) bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs); } -bdrv_release_dirty_bitmap(s->dirty_bitmap); - /* Make sure that the source BDS doesn't go away during bdrv_replace_node, * before we can call bdrv_drained_end */ bdrv_ref(src); @@ -796,6 +794,18 @@ static int mirror_exit_common(Job *job) bdrv_drained_end(target_bs); bdrv_unref(target_bs); +if (s->sync_bitmap) { +if (s->bitmap_mode == BITMAP_SYNC_MODE_ALWAYS || +(s->bitmap_mode == BITMAP_SYNC_MODE_ON_SUCCESS && + job->ret == 0 && ret == 0)) { +/* Success; synchronize copy back to sync. */ +bdrv_clear_dirty_bitmap(s->sync_bitmap, NULL); +bdrv_dirty_bitmap_merge_internal(s->sync_bitmap, s->dirty_bitmap, + NULL, true); +} +} +bdrv_release_dirty_bitmap(s->dirty_bitmap); + bs_opaque->job = NULL; bdrv_drained_end(src); @@ -1755,10 +1765,6 @@ static BlockJob *mirror_start_job( " sync mode", MirrorSyncMode_str(sync_mode)); return NULL; -} else if (bitmap_mode != BITMAP_SYNC_MODE_NEVER) { -error_setg(errp, - "Bitmap Sync Mode '%s' is not supported by Mirror", - BitmapSyncMode_str(bitmap_mode)); } } else if (bitmap) { error_setg(errp, @@ -1775,6 +1781,12 @@ static BlockJob *mirror_start_job( return NULL; } granularity = bdrv_dirty_bitmap_granularity(bitmap); + +if (bitmap_mode != BITMAP_SYNC_MODE_NEVER) { +if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_DEFAULT, errp)) { +return NULL; +} +} } else if (granularity == 0) { granularity = bdrv_get_default_bitmap_granularity(target); } diff --git a/blockdev.c b/blockdev.c index c65d9ded70..aeb9fde9f3 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2873,6 +2873,9 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs, if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_ALLOW_RO, errp)) { return; } +} else if (has_bitmap_mode) { +error_setg(errp, "Cannot specify bitmap sync mode without a bitmap"); +return; } if (!replaces) { -- 2.39.2
[RFC 3/4] mirror: move some checks to qmp
From: Fabian Grünbichler and assert the passing conditions in block/mirror.c. while incremental mode was never available for drive-mirror, it makes the interface more uniform w.r.t. backup block jobs. Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.0] Signed-off-by: Fiona Ebner --- block/mirror.c | 28 +--- blockdev.c | 29 + 2 files changed, 34 insertions(+), 23 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index 84155b1f78..15d1c060eb 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -1755,31 +1755,13 @@ static BlockJob *mirror_start_job( GLOBAL_STATE_CODE(); -if (sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) { -error_setg(errp, "Sync mode '%s' not supported", - MirrorSyncMode_str(sync_mode)); -return NULL; -} else if (sync_mode == MIRROR_SYNC_MODE_BITMAP) { -if (!bitmap) { -error_setg(errp, "Must provide a valid bitmap name for '%s'" - " sync mode", - MirrorSyncMode_str(sync_mode)); -return NULL; -} -} else if (bitmap) { -error_setg(errp, - "sync mode '%s' is not compatible with bitmaps", - MirrorSyncMode_str(sync_mode)); -return NULL; -} +/* QMP interface protects us from these cases */ +assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL); +assert((bitmap && sync_mode == MIRROR_SYNC_MODE_BITMAP) || + (!bitmap && sync_mode != MIRROR_SYNC_MODE_BITMAP)); +assert(!(bitmap && granularity)); if (bitmap) { -if (granularity) { -error_setg(errp, "granularity (%d)" - "cannot be specified when a bitmap is provided", - granularity); -return NULL; -} granularity = bdrv_dirty_bitmap_granularity(bitmap); if (bitmap_mode != BITMAP_SYNC_MODE_NEVER) { diff --git a/blockdev.c b/blockdev.c index aeb9fde9f3..519f408359 100644 --- a/blockdev.c +++ b/blockdev.c @@ -2852,7 +2852,36 @@ static void blockdev_mirror_common(const char *job_id, BlockDriverState *bs, sync = MIRROR_SYNC_MODE_FULL; } +if ((sync == MIRROR_SYNC_MODE_BITMAP) || +(sync == MIRROR_SYNC_MODE_INCREMENTAL)) { +/* done before desugaring 'incremental' to print the right message */ +if (!bitmap_name) { +error_setg(errp, "Must provide a valid bitmap name for " + "'%s' sync mode", MirrorSyncMode_str(sync)); +return; +} +} + +if (sync == MIRROR_SYNC_MODE_INCREMENTAL) { +if (has_bitmap_mode && +bitmap_mode != BITMAP_SYNC_MODE_ON_SUCCESS) { +error_setg(errp, "Bitmap sync mode must be '%s' " + "when using sync mode '%s'", + BitmapSyncMode_str(BITMAP_SYNC_MODE_ON_SUCCESS), + MirrorSyncMode_str(sync)); +return; +} +has_bitmap_mode = true; +sync = MIRROR_SYNC_MODE_BITMAP; +bitmap_mode = BITMAP_SYNC_MODE_ON_SUCCESS; +} + if (bitmap_name) { +if (sync != MIRROR_SYNC_MODE_BITMAP) { +error_setg(errp, "Sync mode '%s' not supported with bitmap.", + MirrorSyncMode_str(sync)); +return; +} if (granularity) { error_setg(errp, "Granularity and bitmap cannot both be set"); return; -- 2.39.2
[RFC 4/4] iotests: add test for bitmap mirror
From: Fabian Grünbichler heavily based on/practically forked off iotest 257 for bitmap backups, but: - no writes to filter node 'mirror-top' between completion and finalization, as those seem to deadlock? - no inclusion of not-yet-available full/top sync modes in combination with bitmaps - extra set of reference/test mirrors to verify that writes in parallel with active mirror work intentionally keeping copyright and ownership of original test case to honor provenance. Signed-off-by: Fabian Grünbichler Signed-off-by: Thomas Lamprecht [FE: rebase for 9.0, i.e. adapt to renames like vm.command -> vm.cmd, specifying explicit image format for rebase, adapt to new behavior of qemu_img(), dropping of 'status' field in output, etc. rename test from '384' to 'bitmap-sync-mirror'] Signed-off-by: Fiona Ebner --- tests/qemu-iotests/tests/bitmap-sync-mirror | 550 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2810 + 2 files changed, 3360 insertions(+) create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror b/tests/qemu-iotests/tests/bitmap-sync-mirror new file mode 100755 index 00..6cd9b74dac --- /dev/null +++ b/tests/qemu-iotests/tests/bitmap-sync-mirror @@ -0,0 +1,550 @@ +#!/usr/bin/env python3 +# group: rw +# +# Test bitmap-sync mirrors (incremental, differential, and partials) +# +# Copyright (c) 2019 John Snow for Red Hat, Inc. +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# +# owner=js...@redhat.com + +import math +import os + +import iotests +from iotests import log, qemu_img + +SIZE = 64 * 1024 * 1024 +GRANULARITY = 64 * 1024 + + +class Pattern: +def __init__(self, byte, offset, size=GRANULARITY): +self.byte = byte +self.offset = offset +self.size = size + +def bits(self, granularity): +lower = self.offset // granularity +upper = (self.offset + self.size - 1) // granularity +return set(range(lower, upper + 1)) + + +class PatternGroup: +"""Grouping of Pattern objects. Initialize with an iterable of Patterns.""" +def __init__(self, patterns): +self.patterns = patterns + +def bits(self, granularity): +"""Calculate the unique bits dirtied by this pattern grouping""" +res = set() +for pattern in self.patterns: +res |= pattern.bits(granularity) +return res + + +GROUPS = [ +PatternGroup([ +# Batch 0: 4 clusters +Pattern('0x49', 0x000), +Pattern('0x6c', 0x010), # 1M +Pattern('0x6f', 0x200), # 32M +Pattern('0x76', 0x3ff)]), # 64M - 64K +PatternGroup([ +# Batch 1: 6 clusters (3 new) +Pattern('0x65', 0x000), # Full overwrite +Pattern('0x77', 0x00f8000), # Partial-left (1M-32K) +Pattern('0x72', 0x2008000), # Partial-right (32M+32K) +Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K) +PatternGroup([ +# Batch 2: 7 clusters (3 new) +Pattern('0x74', 0x001), # Adjacent-right +Pattern('0x69', 0x00e8000), # Partial-left (1M-96K) +Pattern('0x6e', 0x2018000), # Partial-right (32M+96K) +Pattern('0x67', 0x3fe, +2*GRANULARITY)]), # Overwrite [(64M-128K)-64M) +PatternGroup([ +# Batch 3: 8 clusters (5 new) +# Carefully chosen such that nothing re-dirties the one cluster +# that copies out successfully before failure in Group #1. +Pattern('0xaa', 0x001, +3*GRANULARITY), # Overwrite and 2x Adjacent-right +Pattern('0xbb', 0x00d8000), # Partial-left (1M-160K) +Pattern('0xcc', 0x2028000), # Partial-right (32M+160K) +Pattern('0xdd', 0x3fc)]), # New; leaving a gap to the right +] + + +class EmulatedBitmap: +def __init__(self, granularity=GRANULARITY): +self._bits = set() +self.granularity = granularity + +def dirty_bits(self, bits): +self._bits |= set(bits) + +def dirty_group(self, n): +self.dirty_bits(GROUPS[n].bits(self.granularity)) + +def clear(self): +self._bits = set() + +def clear_bits(self, b
[PATCH] iotests: adapt to output change for recently introduced 'detached header' field
Failure was noticed when running the tests for the qcow2 image format. Fixes: 0bd779e27e ("crypto: Introduce 'detached-header' field in QCryptoBlockInfoLUKS") Signed-off-by: Fiona Ebner --- tests/qemu-iotests/198.out | 2 ++ tests/qemu-iotests/206.out | 1 + 2 files changed, 3 insertions(+) diff --git a/tests/qemu-iotests/198.out b/tests/qemu-iotests/198.out index 805494916f..62fb73fa3e 100644 --- a/tests/qemu-iotests/198.out +++ b/tests/qemu-iotests/198.out @@ -39,6 +39,7 @@ Format specific information: compression type: COMPRESSION_TYPE encrypt: ivgen alg: plain64 +detached header: false hash alg: sha256 cipher alg: aes-256 uuid: ---- @@ -84,6 +85,7 @@ Format specific information: compression type: COMPRESSION_TYPE encrypt: ivgen alg: plain64 +detached header: false hash alg: sha256 cipher alg: aes-256 uuid: ---- diff --git a/tests/qemu-iotests/206.out b/tests/qemu-iotests/206.out index 7e95694777..979f00f9bf 100644 --- a/tests/qemu-iotests/206.out +++ b/tests/qemu-iotests/206.out @@ -114,6 +114,7 @@ Format specific information: refcount bits: 16 encrypt: ivgen alg: plain64 +detached header: false hash alg: sha1 cipher alg: aes-128 uuid: ---- -- 2.39.2
Re: double free or corruption (out) in iscsi virtual machine
Am 17.01.24 um 08:23 schrieb M_O_Bz: > Basic Info: > 1. Issue: I got a " double free or corruption (out)", head for > attachment debug.log for details, the debug.log print the backtrace of > one virtual machine > 2. Reproduce: currently I cann't destribe how to reproduce this bug, > because it's in my productive enviroment which include some special stuffs > 3. qemu version: I'm using is qemu-6.0.1 > 4. qemu ccmdline in short:(checkout detail in the virtual machine log > message) Hi, sounds like it might be the issue fixed by: https://github.com/qemu/qemu/commit/5080152e2ef6cde7aa692e29880c62bd54acb750 Best Regards, Fiona
Re: [PATCH v2] qemu_init: increase NOFILE soft limit on POSIX
Am 18.12.23 um 11:13 schrieb Fiona Ebner: > In many configurations, e.g. multiple vNICs with multiple queues or > with many Ceph OSDs, the default soft limit of 1024 is not enough. > QEMU is supposed to work fine with file descriptors >= 1024 and does > not use select() on POSIX. Bump the soft limit to the allowed hard > limit to avoid issues with the aforementioned configurations. > > Of course the limit could be raised from the outside, but the man page > of systemd.exec states about 'LimitNOFILE=': > >> Don't use. >> [...] >> Typically applications should increase their soft limit to the hard >> limit on their own, if they are OK with working with file >> descriptors above 1023, > > If the soft limit is already the same as the hard limit, avoid the > superfluous setrlimit call. This can avoid a warning with a strict > seccomp filter blocking setrlimit if NOFILE was already raised before > executing QEMU. > > Buglink: https://bugzilla.proxmox.com/show_bug.cgi?id=4507 > Signed-off-by: Fiona Ebner > --- > Ping
Re: [PATCH v2 3/4] qapi: blockdev-backup: add discard-source parameter
Am 25.01.24 um 18:22 schrieb Vladimir Sementsov-Ogievskiy: > > Hmm. Taking maximum is not optimal for usual case without > discard-source: user may want to work in smaller granularity than > source, to save disk space. > > In case with discarding we have two possibilities: > > - either take larger granularity for the whole process like you propose > (but this will need and option for CBW?) > - or, fix discarding bitmap in CBW to work like normal discard: it > should be aligned down. This will lead actually to discard-source option > doing nothing.. > > == > But why do you want fleecing image with larger granularity? Is that a > real case or just experimenting? Still we should fix assertion anyway. > Yes, it's a real use case. We do support different storage types and want to allow users to place the fleecing image on a different storage than the original image for flexibility. I ran into the issue when backing up to a target with 1 MiB cluster_size while using a fleecing image on RBD (which has 4 MiB cluster_size by default). In theory, I guess I could look into querying the cluster_size of the backup target and trying to allocate the fleecing image with a small enough cluster_size. But not sure if that would work on all storage combinations, and would require teaching our storage plugin API (which also supports third-party plugins) to perform allocation with a specific cluster size. So not an ideal solution for us. > I think: > > 1. fix discarding bitmap to make aligning-down (will do that for v3) > Thanks! > 2. if we need another logic for block_copy_calculate_cluster_size() it > should be an option. May be explicit "copy-cluster-size" or > "granularity" option for CBW driver and for backup job. And we'll just > check that given cluster-size is power of two >= target_size. > I'll try to implement point 2. That should resolve the issue for our use case. Best Regards, Fiona
Re: [PATCH v2 3/4] qapi: blockdev-backup: add discard-source parameter
Am 24.01.24 um 16:03 schrieb Fiona Ebner: > Am 17.01.24 um 17:07 schrieb Vladimir Sementsov-Ogievskiy: >> Add a parameter that enables discard-after-copy. That is mostly useful >> in "push backup with fleecing" scheme, when source is snapshot-access >> format driver node, based on copy-before-write filter snapshot-access >> API: >> >> [guest] [snapshot-access] ~~ blockdev-backup ~~> [backup target] >>|| >>| root | file >>vv >> [copy-before-write] >>| | >>| file| target >>v v >> [active disk] [temp.img] >> >> In this case discard-after-copy does two things: >> >> - discard data in temp.img to save disk space >> - avoid further copy-before-write operation in discarded area >> >> Note that we have to declare WRITE permission on source in >> copy-before-write filter, for discard to work. >> >> Signed-off-by: Vladimir Sementsov-Ogievskiy > > Ran into another issue when the cluster_size of the fleecing image is > larger than for the backup target, e.g. > >> #!/bin/bash >> rm /tmp/fleecing.qcow2 >> ./qemu-img create /tmp/disk.qcow2 -f qcow2 1G >> ./qemu-img create /tmp/fleecing.qcow2 -o cluster_size=2M -f qcow2 1G >> ./qemu-img create /tmp/backup.qcow2 -f qcow2 1G >> ./qemu-system-x86_64 --qmp stdio \ >> --blockdev >> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/disk.qcow2 \ >> --blockdev >> qcow2,node-name=node1,file.driver=file,file.filename=/tmp/fleecing.qcow2,discard=unmap >> \ >> --blockdev >> qcow2,node-name=node2,file.driver=file,file.filename=/tmp/backup.qcow2 \ >> <> {"execute": "qmp_capabilities"} >> {"execute": "blockdev-add", "arguments": { "driver": "copy-before-write", >> "file": "node0", "target": "node1", "node-name": "node3" } } >> {"execute": "blockdev-add", "arguments": { "driver": "snapshot-access", >> "file": "node3", "discard": "unmap", "node-name": "snap0" } } >> {"execute": "blockdev-backup", "arguments": { "device": "snap0", "target": >> "node2", "sync": "full", "job-id": "backup0", "discard-source": true } } >> EOF > > will fail with > >> qemu-system-x86_64: ../util/hbitmap.c:570: hbitmap_reset: Assertion >> `QEMU_IS_ALIGNED(count, gran) || (start + count == hb->orig_size)' failed. > > Backtrace shows the assert happens while discarding, when resetting the > BDRVCopyBeforeWriteState access_bitmap > > #6 0x56142a2a in hbitmap_reset (hb=0x57e01b80, start=0, > count=1048576) at ../util/hbitmap.c:570 >> #7 0x55f80764 in bdrv_reset_dirty_bitmap_locked >> (bitmap=0x5850a660, offset=0, bytes=1048576) at >> ../block/dirty-bitmap.c:563 >> #8 0x55f807ab in bdrv_reset_dirty_bitmap (bitmap=0x5850a660, >> offset=0, bytes=1048576) at ../block/dirty-bitmap.c:570 >> #9 0x55f7bb16 in cbw_co_pdiscard_snapshot (bs=0x581a7f60, >> offset=0, bytes=1048576) at ../block/copy-before-write.c:330 >> #10 0x55f8d00a in bdrv_co_pdiscard_snapshot (bs=0x581a7f60, >> offset=0, bytes=1048576) at ../block/io.c:3734 >> #11 0x55fd2380 in snapshot_access_co_pdiscard (bs=0x582b4f60, >> offset=0, bytes=1048576) at ../block/snapshot-access.c:55 >> #12 0x55f8b65d in bdrv_co_pdiscard (child=0x584fe790, offset=0, >> bytes=1048576) at ../block/io.c:3144 >> #13 0x55f78650 in block_copy_task_entry (task=0x57f588f0) at >> ../block/block-copy.c:597 > > My guess for the cause is that in block_copy_calculate_cluster_size() we > only look at the target. But now that we need to discard the source, > we'll also need to consider that for the calculation? > Just querying the source and picking the maximum won't work either, because snapshot-access does not currently implement .bdrv_co_get_info and because copy-before-write (doesn't implement .bdrv_co_get_info and is a filter) will just return the info of its file child. But the discard will go to the target child. If I do 1. .bdrv_co_get_info in snapshot-access: return info from file child 2. .bdrv_co_get_info in copy-before-write: return maximum cluster_size from file child and target child 3. block_copy_calculate_cluster_size: return maximum from source and target then the issue does go away, but I don't know if that's not violating any assumptions and probably there's a better way to avoid the issue? Best Regards, Fiona
Re: [PATCH 1/2] virtio-scsi: Attach event vq notifier with no_poll
Am 24.01.24 um 18:38 schrieb Hanna Czenczek: > As of commit 38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi: > don't waste CPU polling the event virtqueue"), we only attach an io_read > notifier for the virtio-scsi event virtqueue instead, and no polling > notifiers. During operation, the event virtqueue is typically > non-empty, but none of the buffers are intended to be used immediately. > Instead, they only get used when certain events occur. Therefore, it > makes no sense to continuously poll it when non-empty, because it is > supposed to be and stay non-empty. > > We do this by using virtio_queue_aio_attach_host_notifier_no_poll() > instead of virtio_queue_aio_attach_host_notifier() for the event > virtqueue. > > Commit 766aa2de0f29b657148e04599320d771c36fd126 ("virtio-scsi: implement > BlockDevOps->drained_begin()") however has virtio_scsi_drained_end() use > virtio_queue_aio_attach_host_notifier() for all virtqueues, including > the event virtqueue. This can lead to it being polled again, undoing > the benefit of commit 38738f7dbbda90fbc161757b7f4be35b52205552. > > Fix it by using virtio_queue_aio_attach_host_notifier_no_poll() for the > event virtqueue. > > Reported-by: Fiona Ebner > Fixes: 766aa2de0f29b657148e04599320d771c36fd126 >("virtio-scsi: implement BlockDevOps->drained_begin()") > Signed-off-by: Hanna Czenczek Tested-by: Fiona Ebner Reviewed-by: Fiona Ebner
Re: [PATCH 2/2] virtio: Keep notifications disabled during drain
Am 24.01.24 um 18:38 schrieb Hanna Czenczek: > During drain, we do not care about virtqueue notifications, which is why > we remove the handlers on it. When removing those handlers, whether vq > notifications are enabled or not depends on whether we were in polling > mode or not; if not, they are enabled (by default); if so, they have > been disabled by the io_poll_start callback. > > Because we do not care about those notifications after removing the > handlers, this is fine. However, we have to explicitly ensure they are > enabled when re-attaching the handlers, so we will resume receiving > notifications. We do this in virtio_queue_aio_attach_host_notifier*(). > If such a function is called while we are in a polling section, > attaching the notifiers will then invoke the io_poll_start callback, > re-disabling notifications. > > Because we will always miss virtqueue updates in the drained section, we > also need to poll the virtqueue once after attaching the notifiers. > > Buglink: https://issues.redhat.com/browse/RHEL-3934 > Signed-off-by: Hanna Czenczek Tested-by: Fiona Ebner Reviewed-by: Fiona Ebner
Re: [PATCH v2 3/4] qapi: blockdev-backup: add discard-source parameter
Am 17.01.24 um 17:07 schrieb Vladimir Sementsov-Ogievskiy: > Add a parameter that enables discard-after-copy. That is mostly useful > in "push backup with fleecing" scheme, when source is snapshot-access > format driver node, based on copy-before-write filter snapshot-access > API: > > [guest] [snapshot-access] ~~ blockdev-backup ~~> [backup target] >|| >| root | file >vv > [copy-before-write] >| | >| file| target >v v > [active disk] [temp.img] > > In this case discard-after-copy does two things: > > - discard data in temp.img to save disk space > - avoid further copy-before-write operation in discarded area > > Note that we have to declare WRITE permission on source in > copy-before-write filter, for discard to work. > > Signed-off-by: Vladimir Sementsov-Ogievskiy Ran into another issue when the cluster_size of the fleecing image is larger than for the backup target, e.g. > #!/bin/bash > rm /tmp/fleecing.qcow2 > ./qemu-img create /tmp/disk.qcow2 -f qcow2 1G > ./qemu-img create /tmp/fleecing.qcow2 -o cluster_size=2M -f qcow2 1G > ./qemu-img create /tmp/backup.qcow2 -f qcow2 1G > ./qemu-system-x86_64 --qmp stdio \ > --blockdev > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/disk.qcow2 \ > --blockdev > qcow2,node-name=node1,file.driver=file,file.filename=/tmp/fleecing.qcow2,discard=unmap > \ > --blockdev > qcow2,node-name=node2,file.driver=file,file.filename=/tmp/backup.qcow2 \ > < {"execute": "qmp_capabilities"} > {"execute": "blockdev-add", "arguments": { "driver": "copy-before-write", > "file": "node0", "target": "node1", "node-name": "node3" } } > {"execute": "blockdev-add", "arguments": { "driver": "snapshot-access", > "file": "node3", "discard": "unmap", "node-name": "snap0" } } > {"execute": "blockdev-backup", "arguments": { "device": "snap0", "target": > "node2", "sync": "full", "job-id": "backup0", "discard-source": true } } > EOF will fail with > qemu-system-x86_64: ../util/hbitmap.c:570: hbitmap_reset: Assertion > `QEMU_IS_ALIGNED(count, gran) || (start + count == hb->orig_size)' failed. Backtrace shows the assert happens while discarding, when resetting the BDRVCopyBeforeWriteState access_bitmap > #6 0x56142a2a in hbitmap_reset (hb=0x57e01b80, start=0, count=1048576) at ../util/hbitmap.c:570 > #7 0x55f80764 in bdrv_reset_dirty_bitmap_locked > (bitmap=0x5850a660, offset=0, bytes=1048576) at > ../block/dirty-bitmap.c:563 > #8 0x55f807ab in bdrv_reset_dirty_bitmap (bitmap=0x5850a660, > offset=0, bytes=1048576) at ../block/dirty-bitmap.c:570 > #9 0x55f7bb16 in cbw_co_pdiscard_snapshot (bs=0x581a7f60, > offset=0, bytes=1048576) at ../block/copy-before-write.c:330 > #10 0x55f8d00a in bdrv_co_pdiscard_snapshot (bs=0x581a7f60, > offset=0, bytes=1048576) at ../block/io.c:3734 > #11 0x55fd2380 in snapshot_access_co_pdiscard (bs=0x582b4f60, > offset=0, bytes=1048576) at ../block/snapshot-access.c:55 > #12 0x55f8b65d in bdrv_co_pdiscard (child=0x584fe790, offset=0, > bytes=1048576) at ../block/io.c:3144 > #13 0x55f78650 in block_copy_task_entry (task=0x57f588f0) at > ../block/block-copy.c:597 My guess for the cause is that in block_copy_calculate_cluster_size() we only look at the target. But now that we need to discard the source, we'll also need to consider that for the calculation? Best Regards, Fiona
[PATCH v3 1/2] ui/clipboard: mark type as not available when there is no data
With VNC, a client can send a non-extended VNC_MSG_CLIENT_CUT_TEXT message with len=0. In qemu_clipboard_set_data(), the clipboard info will be updated setting data to NULL (because g_memdup(data, size) returns NULL when size is 0). If the client does not set the VNC_ENCODING_CLIPBOARD_EXT feature when setting up the encodings, then the 'request' callback for the clipboard peer is not initialized. Later, because data is NULL, qemu_clipboard_request() can be reached via vdagent_chr_write() and vdagent_clipboard_recv_request() and there, the clipboard owner's 'request' callback will be attempted to be called, but that is a NULL pointer. In particular, this can happen when using the KRDC (22.12.3) VNC client. Another scenario leading to the same issue is with two clients (say noVNC and KRDC): The noVNC client sets the extension VNC_FEATURE_CLIPBOARD_EXT and initializes its cbpeer. The KRDC client does not, but triggers a vnc_client_cut_text() (note it's not the _ext variant)). There, a new clipboard info with it as the 'owner' is created and via qemu_clipboard_set_data() is called, which in turn calls qemu_clipboard_update() with that info. In qemu_clipboard_update(), the notifier for the noVNC client will be called, i.e. vnc_clipboard_notify() and also set vs->cbinfo for the noVNC client. The 'owner' in that clipboard info is the clipboard peer for the KRDC client, which did not initialize the 'request' function. That sounds correct to me, it is the owner of that clipboard info. Then when noVNC sends a VNC_MSG_CLIENT_CUT_TEXT message (it did set the VNC_FEATURE_CLIPBOARD_EXT feature correctly, so a check for it passes), that clipboard info is passed to qemu_clipboard_request() and the original segfault still happens. Fix the issue by handling updates with size 0 differently. In particular, mark in the clipboard info that the type is not available. While at it, switch to g_memdup2(), because g_memdup() is deprecated. Cc: qemu-sta...@nongnu.org Fixes: CVE-2023-6683 Reported-by: Markus Frank Suggested-by: Marc-André Lureau Signed-off-by: Fiona Ebner --- Changes in v3: * Yet another new appraoch, setting available to false when no data is passed in when updating. * Update commit message to focus on the fact that non-extended VNC_MSG_CLIENT_CUT_TEXT messages with len=0 are problematic. ui/clipboard.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/ui/clipboard.c b/ui/clipboard.c index 3d14bffaf8..b3f6fa3c9e 100644 --- a/ui/clipboard.c +++ b/ui/clipboard.c @@ -163,9 +163,15 @@ void qemu_clipboard_set_data(QemuClipboardPeer *peer, } g_free(info->types[type].data); -info->types[type].data = g_memdup(data, size); -info->types[type].size = size; -info->types[type].available = true; +if (size) { +info->types[type].data = g_memdup2(data, size); +info->types[type].size = size; +info->types[type].available = true; +} else { +info->types[type].data = NULL; +info->types[type].size = 0; +info->types[type].available = false; +} if (update) { qemu_clipboard_update(info); -- 2.39.2
[PATCH v3 2/2] ui/clipboard: add asserts for update and request
Should an issue like CVE-2023-6683 ever appear again in the future, it will be more obvious which assumption was violated. Suggested-by: Marc-André Lureau Signed-off-by: Fiona Ebner --- Changes in v3: * Turn check for update into an assertion. * Split out into a separate patch. ui/clipboard.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/ui/clipboard.c b/ui/clipboard.c index b3f6fa3c9e..4264884a6c 100644 --- a/ui/clipboard.c +++ b/ui/clipboard.c @@ -65,12 +65,24 @@ bool qemu_clipboard_check_serial(QemuClipboardInfo *info, bool client) void qemu_clipboard_update(QemuClipboardInfo *info) { +uint32_t type; QemuClipboardNotify notify = { .type = QEMU_CLIPBOARD_UPDATE_INFO, .info = info, }; assert(info->selection < QEMU_CLIPBOARD_SELECTION__COUNT); +for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT; type++) { +/* + * If data is missing, the clipboard owner's 'request' callback needs to + * be set. Otherwise, there is no way to get the clipboard data and + * qemu_clipboard_request() cannot be called. + */ +if (info->types[type].available && !info->types[type].data) { +assert(info->owner && info->owner->request); +} +} + notifier_list_notify(_notifiers, ); if (cbinfo[info->selection] != info) { @@ -132,6 +144,8 @@ void qemu_clipboard_request(QemuClipboardInfo *info, !info->owner) return; +assert(info->owner->request); + info->types[type].requested = true; info->owner->request(info, type); } -- 2.39.2
[PATCH] block/io_uring: improve error message when init fails
The man page for io_uring_queue_init states: > io_uring_queue_init(3) returns 0 on success and -errno on failure. and the man page for io_uring_setup (which is one of the functions where the return value of io_uring_queue_init() can come from) states: > On error, a negative error code is returned. The caller should not > rely on errno variable. Tested using 'sysctl kernel.io_uring_disabled=2'. Output before this change: > failed to init linux io_uring ring Output after this change: > failed to init linux io_uring ring: Operation not permitted Signed-off-by: Fiona Ebner --- block/io_uring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/io_uring.c b/block/io_uring.c index d77ae55745..d11b2051ab 100644 --- a/block/io_uring.c +++ b/block/io_uring.c @@ -432,7 +432,7 @@ LuringState *luring_init(Error **errp) rc = io_uring_queue_init(MAX_ENTRIES, ring, 0); if (rc < 0) { -error_setg_errno(errp, errno, "failed to init linux io_uring ring"); +error_setg_errno(errp, -rc, "failed to init linux io_uring ring"); g_free(s); return NULL; } -- 2.39.2
Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler
Am 22.01.24 um 18:52 schrieb Hanna Czenczek: > On 22.01.24 18:41, Hanna Czenczek wrote: >> On 05.01.24 15:30, Fiona Ebner wrote: >>> Am 05.01.24 um 14:43 schrieb Fiona Ebner: >>>> Am 03.01.24 um 14:35 schrieb Paolo Bonzini: >>>>> On 1/3/24 12:40, Fiona Ebner wrote: >>>>>> I'm happy to report that I cannot reproduce the CPU-usage-spike issue >>>>>> with the patch, but I did run into an assertion failure when >>>>>> trying to >>>>>> verify that it fixes my original stuck-guest-IO issue. See below >>>>>> for the >>>>>> backtrace [0]. Hanna wrote in >>>>>> https://issues.redhat.com/browse/RHEL-3934 >>>>>> >>>>>>> I think it’s sufficient to simply call virtio_queue_notify_vq(vq) >>>>>>> after the virtio_queue_aio_attach_host_notifier(vq, ctx) call, >>>>>>> because >>>>>>> both virtio-scsi’s and virtio-blk’s .handle_output() implementations >>>>>>> acquire the device’s context, so this should be directly callable >>>>>>> from >>>>>>> any context. >>>>>> I guess this is not true anymore now that the AioContext locking was >>>>>> removed? >>>>> Good point and, in fact, even before it was much safer to use >>>>> virtio_queue_notify() instead. Not only does it use the event >>>>> notifier >>>>> handler, but it also calls it in the right thread/AioContext just by >>>>> doing event_notifier_set(). >>>>> >>>> But with virtio_queue_notify() using the event notifier, the >>>> CPU-usage-spike issue is present: >>>> >>>>>> Back to the CPU-usage-spike issue: I experimented around and it >>>>>> doesn't >>>>>> seem to matter whether I notify the virt queue before or after >>>>>> attaching >>>>>> the notifiers. But there's another functional difference. My patch >>>>>> called virtio_queue_notify() which contains this block: >>>>>> >>>>>>> if (vq->host_notifier_enabled) { >>>>>>> event_notifier_set(>host_notifier); >>>>>>> } else if (vq->handle_output) { >>>>>>> vq->handle_output(vdev, vq); >>>>>> In my testing, the first branch was taken, calling >>>>>> event_notifier_set(). >>>>>> Hanna's patch uses virtio_queue_notify_vq() and there, >>>>>> vq->handle_output() will be called. That seems to be the relevant >>>>>> difference regarding the CPU-usage-spike issue. >>>> I should mention that this is with a VirtIO SCSI disk. I also attempted >>>> to reproduce the CPU-usage-spike issue with a VirtIO block disk, but >>>> didn't manage yet. >>>> >>>> What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of >>>> the queues (but only one) will always show as nonempty. And then, >>>> run_poll_handlers_once() will always detect progress which explains the >>>> CPU usage. >>>> >>>> The following shows >>>> 1. vq address >>>> 2. number of times vq was passed to >>>> virtio_queue_host_notifier_aio_poll() >>>> 3. number of times the result of virtio_queue_host_notifier_aio_poll() >>>> was true for the vq >>>> >>>>> 0x555fd93f9c80 17162000 0 >>>>> 0x555fd93f9e48 17162000 6 >>>>> 0x555fd93f9ee0 17162000 0 >>>>> 0x555fd93f9d18 17162000 17162000 >>>>> 0x555fd93f9db0 17162000 0 >>>>> 0x555fd93f9f78 17162000 0 >>>> And for the problematic one, the reason it is seen as nonempty is: >>>> >>>>> 0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0 >>> vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and >>> s->events_dropped is false in my testing, so >>> virtio_scsi_handle_event_vq() doesn't do anything. >>> >>>> Those values stay like this while the call counts above increase. >>>> >>>> So something going wrong with the indices when the event notifier is >>>> set >>>> from QEMU side (in the main thread)? >>>> >>>> The guest is Debian 12 with a 6.1 kernel. >> >> So, trying to figure out a new RFC version: >> >> About the stack tra
Re: [PATCH v2 3/4] qapi: blockdev-backup: add discard-source parameter
Am 17.01.24 um 17:07 schrieb Vladimir Sementsov-Ogievskiy: > Add a parameter that enables discard-after-copy. That is mostly useful > in "push backup with fleecing" scheme, when source is snapshot-access > format driver node, based on copy-before-write filter snapshot-access > API: > > [guest] [snapshot-access] ~~ blockdev-backup ~~> [backup target] >|| >| root | file >vv > [copy-before-write] >| | >| file| target >v v > [active disk] [temp.img] > > In this case discard-after-copy does two things: > > - discard data in temp.img to save disk space > - avoid further copy-before-write operation in discarded area > > Note that we have to declare WRITE permission on source in > copy-before-write filter, for discard to work. > > Signed-off-by: Vladimir Sementsov-Ogievskiy Unfortunately, setting BLK_PERM_WRITE unconditionally breaks blockdev-backup for a read-only node (even when not using discard-source): > #!/bin/bash > ./qemu-img create /tmp/disk.raw -f raw 1G > ./qemu-img create /tmp/backup.raw -f raw 1G > ./qemu-system-x86_64 --qmp stdio \ > --blockdev > raw,node-name=node0,file.driver=file,file.filename=/tmp/disk.raw,read-only=true > \ > --blockdev raw,node-name=node1,file.driver=file,file.filename=/tmp/backup.raw > \ > < {"execute": "qmp_capabilities"} > {"execute": "blockdev-backup", "arguments": { "device": "node0", "target": > "node1", "sync": "full", "job-id": "backup0" } } > EOF The above works before applying this patch, but leads to an error afterwards: > {"error": {"class": "GenericError", "desc": "Block node is read-only"}} Best Regards, Fiona
Re: [PATCH 3/3] monitor: only run coroutine commands in qemu_aio_context
Am 16.01.24 um 20:00 schrieb Stefan Hajnoczi: > monitor_qmp_dispatcher_co() runs in the iohandler AioContext that is not > polled during nested event loops. The coroutine currently reschedules > itself in the main loop's qemu_aio_context AioContext, which is polled > during nested event loops. One known problem is that QMP device-add > calls drain_call_rcu(), which temporarily drops the BQL, leading to all > sorts of havoc like other vCPU threads re-entering device emulation code > while another vCPU thread is waiting in device emulation code with > aio_poll(). > > Paolo Bonzini suggested running non-coroutine QMP handlers in the > iohandler AioContext. This avoids trouble with nested event loops. His > original idea was to move coroutine rescheduling to > monitor_qmp_dispatch(), but I resorted to moving it to qmp_dispatch() > because we don't know if the QMP handler needs to run in coroutine > context in monitor_qmp_dispatch(). monitor_qmp_dispatch() would have > been nicer since it's associated with the monitor implementation and not > as general as qmp_dispatch(), which is also used by qemu-ga. > > A number of qemu-iotests need updated .out files because the order of > QMP events vs QMP responses has changed. > > Solves Issue #1933. > > Fixes: 7bed89958bfbf40df9ca681cefbdca63abdde39d ("device_core: use > drain_call_rcu in in qmp_device_add") > Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2215192 > Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2214985 > Buglink: https://issues.redhat.com/browse/RHEL-17369 > Signed-off-by: Stefan Hajnoczi With the patch I can no longer see any do_qmp_dispatch_bh() calls run in vCPU threads. I also did a bit of smoke testing with some other QMP and QGA commands and didn't find any obvious breakage, so: Tested-by: Fiona Ebner P.S. Unfortunately, the patch does not solve the other issue I came across back then [0] with snapshot_save_job_bh() being executed during a vCPU thread's aio_poll(). See also [1]. I suppose this would need some other mechanism to solve or could it also be scheduled to the iohandler AioContext? It's not directly related to your patch of course, just mentioning it, because it's a similar theme. [0]: https://lore.kernel.org/qemu-devel/31757c45-695d-4408-468c-c2de560af...@proxmox.com/ [1]: https://gitlab.com/qemu-project/qemu/-/issues/2111 Best Regards, Fiona
Re: [PATCH v2] ui/clipboard: ensure data is available or request callback is set upon update
Am 17.01.24 um 13:20 schrieb Marc-André Lureau: > Hi > > On Wed, Jan 17, 2024 at 3:56 PM Fiona Ebner wrote: >> >> Am 17.01.24 um 12:33 schrieb Marc-André Lureau: >>> Hi >>> >>> On Wed, Jan 17, 2024 at 3:30 PM Fiona Ebner wrote: >>>> >>>> Am 17.01.24 um 12:11 schrieb Marc-André Lureau: >>>>> Hi >>>>> >>>>> On Wed, Jan 17, 2024 at 3:01 PM Fiona Ebner wrote: >>>>>> >>>>>> +for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT && !missing_data; >>>>>> type++) { >>>>>> +if (!info->types[type].data) { >>>>>> +missing_data = true; >>>>>> +} >>>>>> +} >>>>>> +/* >>>>>> + * If data is missing, the clipboard owner's 'request' callback >>>>>> needs to be >>>>>> + * set. Otherwise, there is no way to get the clipboard data and >>>>>> + * qemu_clipboard_request() cannot be called. >>>>>> + */ >>>>>> +if (missing_data && info->owner && !info->owner->request) { >>>>>> +return; >>>>>> +} >>>>> >>>>> It needs to check whether the type is "available". If not data is >>>>> provided, owner should be set as well, it should assert() that. >>>>> >>>>> That should do the job: >>>>> >>>>> for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT; type++) { >>>>> /* >>>>> * If data is missing, the clipboard owner's 'request' callback needs >>>>> to >>>>> * be set. Otherwise, there is no way to get the clipboard data and >>>>> * qemu_clipboard_request() cannot be called. >>>>> */ >>>>> if (info->types[type].available && !info->types[type].data) { >>>>> assert(info->owner && info->owner->request); >>>>> } >>>>> } >>>>> >>>> >>>> Okay, thanks! But we can't assert, because that doesn't resolve the CVE >>>> as it would still crash. The VNC client might not have the >>>> VNC_FEATURE_CLIPBOARD_EXT feature, and the request callback is currently >>>> only set in that case. But we can return instead of assert to just avoid >>>> clipboard update. I'll send a v3. >>> >>> If it doesn't have VNC_FEATURE_CLIPBOARD_EXT, it shouldn't update the >>> clipboard without data. (ClientCutText/ServerCutText always have data, >>> even if 0-length) >>> >> >> But a buggy client should not be able to crash QEMU. With a >> VNC_MSG_CLIENT_CUT_TEXT message, when read_s32(data, 4) == 0, >> vnc_client_cut_text() is called with zero length. Is that supposed to > > Indeed. That case should be handled at the VNC server code level.. A > 0-length text should set clipboard to "", not NULL. > So have vnc_client_cut_text() call qemu_clipboard_set_data() with data="" and size=1 in the 0-length case? I can see such a call with noVNC as the client (which does use extended messages) when clearing the clipboard. But I suppose the call in vnc_client_cut_text_ext() also would need a check before calling qemu_clipboard_set_data(). Or is there anything guaranteeing the size read from the inflated buffer, i.e. uint32_t tsize = read_u32(buf, 0); which is passed to qemu_clipboard_set_data() is not zero? > With Ext, there is an explicit description of data ending with \0: > According to > https://github.com/rfbproto/rfbproto/blob/master/rfbproto.rst#user-content-7727extended-clipboard-pseudo-encoding > "The text must be followed by a terminating null even though the > length is also explicitly given." > > But we should still handle 0-length data cases. > >> happen? The branch for an extended message is only taken when >> read_s32(data, 4) < 0 and Daniel's patch fixes that branch. >> >> I noticed in qemu_clipboard_set_data(): >> >>> info->types[type].data = g_memdup(data, size); >> >> the g_memdup call will return NULL when size == 0 even if data is >> non-NULL. Is that the actual problem in the above scenario? > > Hmm, qemu_clipboard_set_data() shouldn't allow data == NULL, or size == 0. > > When data is requested, the "peer" will call > qemu_clipboard_set_data(). But I can't see a good way for the peer to > return with no data, it should probably update the clipboard info with > available=false. > Oh, I guess my suggestion above isn't necessary if we set available=false when size == 0. Sorry, is this what you mean by "no data"? Or do you mean the case where data=""? Or both? Best Regards, Fiona
Re: [PATCH v2] ui/clipboard: ensure data is available or request callback is set upon update
Am 17.01.24 um 12:33 schrieb Marc-André Lureau: > Hi > > On Wed, Jan 17, 2024 at 3:30 PM Fiona Ebner wrote: >> >> Am 17.01.24 um 12:11 schrieb Marc-André Lureau: >>> Hi >>> >>> On Wed, Jan 17, 2024 at 3:01 PM Fiona Ebner wrote: >>>> >>>> +for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT && !missing_data; >>>> type++) { >>>> +if (!info->types[type].data) { >>>> +missing_data = true; >>>> +} >>>> +} >>>> +/* >>>> + * If data is missing, the clipboard owner's 'request' callback needs >>>> to be >>>> + * set. Otherwise, there is no way to get the clipboard data and >>>> + * qemu_clipboard_request() cannot be called. >>>> + */ >>>> +if (missing_data && info->owner && !info->owner->request) { >>>> +return; >>>> +} >>> >>> It needs to check whether the type is "available". If not data is >>> provided, owner should be set as well, it should assert() that. >>> >>> That should do the job: >>> >>> for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT; type++) { >>> /* >>> * If data is missing, the clipboard owner's 'request' callback needs to >>> * be set. Otherwise, there is no way to get the clipboard data and >>> * qemu_clipboard_request() cannot be called. >>> */ >>> if (info->types[type].available && !info->types[type].data) { >>> assert(info->owner && info->owner->request); >>> } >>> } >>> >> >> Okay, thanks! But we can't assert, because that doesn't resolve the CVE >> as it would still crash. The VNC client might not have the >> VNC_FEATURE_CLIPBOARD_EXT feature, and the request callback is currently >> only set in that case. But we can return instead of assert to just avoid >> clipboard update. I'll send a v3. > > If it doesn't have VNC_FEATURE_CLIPBOARD_EXT, it shouldn't update the > clipboard without data. (ClientCutText/ServerCutText always have data, > even if 0-length) > But a buggy client should not be able to crash QEMU. With a VNC_MSG_CLIENT_CUT_TEXT message, when read_s32(data, 4) == 0, vnc_client_cut_text() is called with zero length. Is that supposed to happen? The branch for an extended message is only taken when read_s32(data, 4) < 0 and Daniel's patch fixes that branch. I noticed in qemu_clipboard_set_data(): > info->types[type].data = g_memdup(data, size); the g_memdup call will return NULL when size == 0 even if data is non-NULL. Is that the actual problem in the above scenario? Best Regards, Fiona
Re: [PATCH v2] ui/clipboard: ensure data is available or request callback is set upon update
Am 17.01.24 um 12:11 schrieb Marc-André Lureau: > Hi > > On Wed, Jan 17, 2024 at 3:01 PM Fiona Ebner wrote: >> >> +for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT && !missing_data; >> type++) { >> +if (!info->types[type].data) { >> +missing_data = true; >> +} >> +} >> +/* >> + * If data is missing, the clipboard owner's 'request' callback needs >> to be >> + * set. Otherwise, there is no way to get the clipboard data and >> + * qemu_clipboard_request() cannot be called. >> + */ >> +if (missing_data && info->owner && !info->owner->request) { >> +return; >> +} > > It needs to check whether the type is "available". If not data is > provided, owner should be set as well, it should assert() that. > > That should do the job: > > for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT; type++) { > /* > * If data is missing, the clipboard owner's 'request' callback needs to > * be set. Otherwise, there is no way to get the clipboard data and > * qemu_clipboard_request() cannot be called. > */ > if (info->types[type].available && !info->types[type].data) { > assert(info->owner && info->owner->request); > } > } > Okay, thanks! But we can't assert, because that doesn't resolve the CVE as it would still crash. The VNC client might not have the VNC_FEATURE_CLIPBOARD_EXT feature, and the request callback is currently only set in that case. But we can return instead of assert to just avoid clipboard update. I'll send a v3. Best Regards, Fiona
[PATCH v2] ui/clipboard: ensure data is available or request callback is set upon update
With VNC, it can be that a client sends a VNC_MSG_CLIENT_CUT_TEXT message before sending a VNC_MSG_CLIENT_SET_ENCODINGS message with VNC_ENCODING_CLIPBOARD_EXT for configuring the clipboard extension. This means that qemu_clipboard_request() can be reached (via vnc_client_cut_text_ext()) before vnc_server_cut_text_caps() was called and had the chance to initialize the clipboard peer. In that case, info->owner->request is NULL instead of a function and so attempting to call it in qemu_clipboard_request() results in a segfault. In particular, this can happen when using the KRDC (22.12.3) VNC client on Wayland. Another code path leading to the same issue in qemu_clipboard_request() is via vdagent_chr_write() and vdagent_clipboard_recv_request() after a non-extended VNC_MSG_CLIENT_CUT_TEXT messages with len=0 was sent. In particular, this can happen when using the KRDC (22.12.3) VNC client on X11. It is not enough to check in ui/vnc.c's protocol_client_msg() if the VNC_FEATURE_CLIPBOARD_EXT feature is enabled before handling an extended clipboard message with vnc_client_cut_text_ext(), because of the following scenario with two clients (say noVNC and KRDC): The noVNC client sets the extension VNC_FEATURE_CLIPBOARD_EXT and initializes its cbpeer. The KRDC client does not, but triggers a vnc_client_cut_text() (note it's not the _ext variant)). There, a new clipboard info with it as the 'owner' is created and via qemu_clipboard_set_data() is called, which in turn calls qemu_clipboard_update() with that info. In qemu_clipboard_update(), the notifier for the noVNC client will be called, i.e. vnc_clipboard_notify() and also set vs->cbinfo for the noVNC client. The 'owner' in that clipboard info is the clipboard peer for the KRDC client, which did not initialize the 'request' function. That sounds correct to me, it is the owner of that clipboard info. Then when noVNC sends a VNC_MSG_CLIENT_CUT_TEXT message (it did set the VNC_FEATURE_CLIPBOARD_EXT feature correctly, so a check for it passes), that clipboard info is passed to qemu_clipboard_request() and the original segfault still happens. Fix the issue by disallowing clipboard update if both, data is missing and the clipboard owner's 'request' callback is not set. Add an assert that the clipboard owner's 'request' callback is set in qemu_clipboard_request() to have a clean error/abort should a similar issue appear in the future. Fixes: CVE-2023-6683 Reported-by: Markus Frank Suggested-by: Marc-André Lureau Signed-off-by: Fiona Ebner --- Changes in v2: * Different approach as suggested by Marc-André: Instead of quietly returning in qemu_clipboard_request() when no request callback is set, prevent clipboard update if there is both, no data and no request callback. ui/clipboard.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/ui/clipboard.c b/ui/clipboard.c index 3d14bffaf8..d1bb7c77f2 100644 --- a/ui/clipboard.c +++ b/ui/clipboard.c @@ -65,12 +65,28 @@ bool qemu_clipboard_check_serial(QemuClipboardInfo *info, bool client) void qemu_clipboard_update(QemuClipboardInfo *info) { +uint32_t type; +bool missing_data = false; QemuClipboardNotify notify = { .type = QEMU_CLIPBOARD_UPDATE_INFO, .info = info, }; assert(info->selection < QEMU_CLIPBOARD_SELECTION__COUNT); +for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT && !missing_data; type++) { +if (!info->types[type].data) { +missing_data = true; +} +} +/* + * If data is missing, the clipboard owner's 'request' callback needs to be + * set. Otherwise, there is no way to get the clipboard data and + * qemu_clipboard_request() cannot be called. + */ +if (missing_data && info->owner && !info->owner->request) { +return; +} + notifier_list_notify(_notifiers, ); if (cbinfo[info->selection] != info) { @@ -132,6 +148,8 @@ void qemu_clipboard_request(QemuClipboardInfo *info, !info->owner) return; +assert(info->owner->request); + info->types[type].requested = true; info->owner->request(info, type); } -- 2.39.2
Re: [PATCH] ui/clipboard: avoid crash upon request when clipboard peer is not initialized
Am 16.01.24 um 13:11 schrieb Fiona Ebner: > Am 15.01.24 um 13:00 schrieb Marc-André Lureau: >>>>> >>>> >>>> The trouble is when qemu_clipboard_update() is called without data & >>>> without a request callback set. We shouldn't allow that as we have no >>>> means to get the clipboard data then. >>>> >>> >>> In the above scenario, I'm pretty sure there is data when >>> qemu_clipboard_update() is called. Just no request callback. If we'd >>> reject this, won't that break clients that do not set the >>> VNC_FEATURE_CLIPBOARD_EXT feature and only use non-extended >>> VNC_MSG_CLIENT_CUT_TEXT messages? >> >> If "data" is already set, then qemu_clipboard_request() returns. >> >> Inverting the condition I suggested above: it's allowed to be the >> clipboard owner if either "data" is set, or a request callback is set. >> > > Oh, sorry. Yes, it seems the problematic case is where data is not set. > But isn't that legitimate when clearing the clipboard? Or is a > VNC_MSG_CLIENT_CUT_TEXT message not valid when len is 0 and should be > rejected? In my testing KRDC does send such a message when the clipboard > is cleared: > >> #1 0x558f1e6a0dac in vnc_client_cut_text (vs=0x558f207754d0, len=0, >> text=0x558f2046e008 "\003 \002\377\005") at ../ui/vnc-clipboard.c:313 >> #2 0x558f1e68e067 in protocol_client_msg (vs=0x558f207754d0, >> data=0x558f2046e000 "\006", len=8) at ../ui/vnc.c:2454 > > Your suggestion would disallow this for clients that do not set the > VNC_FEATURE_CLIPBOARD_EXT feature (and only use non-extended > VNC_MSG_CLIENT_CUT_TEXT messages). > I noticed that there is yet another code path leading to qemu_clipboard_request() and dereferencing the NULL owner->request even if only non-extended VNC_MSG_CLIENT_CUT_TEXT messages are used: > Thread 1 "qemu-system-x86" hit Breakpoint 1, vnc_client_cut_text ( > vs=0x593e6c00, len=0, > text=0x575ea608 > "404078,bus=ahci0.0,drive=drive-sata0,id=sata0,bootindex=100\377\001") at > ../ui/vnc-clipboard.c:310 > (gdb) c > Continuing. And later: > qemu_clipboard_request (info=0x57e576e0, type=QEMU_CLIPBOARD_TYPE_TEXT) > at ../ui/clipboard.c:129 > 129 if (info->types[type].data || > (gdb) p *info->owner > $65 = {name = 0x0, notifier = {notify = 0x0, node = {le_next = 0x0, le_prev = > 0x0}}, request = 0x0} > (gdb) bt > #0 qemu_clipboard_request (info=0x57e576e0, > type=QEMU_CLIPBOARD_TYPE_TEXT) at ../ui/clipboard.c:129 > #1 0x558952ce in vdagent_chr_recv_chunk (vd=0x56f67800) at > ../ui/vdagent.c:769 > #2 vdagent_chr_write (chr=, buf=0x7fff4ab263e4 "", len=0) at > ../ui/vdagent.c:840 > #3 0x55d98830 in qemu_chr_write_buffer (s=s@entry=0x56f67800, > buf=buf@entry=0x7fff4ab263c0 "\001", len=36, > offset=offset@entry=0x7fffd030, write_all=false) at > ../chardev/char.c:122 > #4 0x55d98c3c in qemu_chr_write (s=0x56f67800, > buf=buf@entry=0x7fff4ab263c0 "\001", len=len@entry=36, > write_all=, write_all@entry=false) at ../chardev/char.c:186 > #5 0x55d9109f in qemu_chr_fe_write (be=be@entry=0x56e59040, > buf=buf@entry=0x7fff4ab263c0 "\001", len=len@entry=36) > at ../chardev/char-fe.c:42 > #6 0x55900045 in flush_buf (port=0x56e58f40, buf=0x7fff4ab263c0 > "\001", len=36) at ../hw/char/virtio-console.c:63 > #7 0x55bea4f3 in do_flush_queued_data (port=0x56e58f40, > vq=0x59272558, vdev=0x5926a4d0) > at ../hw/char/virtio-serial-bus.c:188 > #8 0x55c201ff in virtio_queue_notify_vq (vq=0x59272558) at > ../hw/virtio/virtio.c:2268 > #9 0x55e36dd5 in aio_dispatch_handler (ctx=ctx@entry=0x56e51b10, > node=0x7ffed812b9f0) at ../util/aio-posix.c:372 > #10 0x55e37662 in aio_dispatch_handlers (ctx=0x56e51b10) at > ../util/aio-posix.c:414 > #11 aio_dispatch (ctx=0x56e51b10) at ../util/aio-posix.c:424 > #12 0x55e4d44e in aio_ctx_dispatch (source=, > callback=, user_data=) > at ../util/async.c:358 > #13 0x7753e7a9 in g_main_context_dispatch () from > /lib/x86_64-linux-gnu/libglib-2.0.so.0 > #14 0x55e4ecb8 in glib_pollfds_poll () at ../util/main-loop.c:287 > #15 os_host_main_loop_wait (timeout=58675786) at ../util/main-loop.c:310 > #16 main_loop_wait (nonblocking=nonblocking@entry=0) at > ../util/main-loop.c:589 > #17 0x55aa5f63 in qemu_main_loop () at ../system/runstate.c:791 > #18 0x55cadc16 in qemu_default_main () at ../sys
[PATCH] block/io: clear BDRV_BLOCK_RECURSE flag after recursing in bdrv_co_block_status
Using fleecing backup like in [0] on a qcow2 image (with metadata preallocation) can lead to the following assertion failure: > bdrv_co_do_block_status: Assertion `!(ret & BDRV_BLOCK_ZERO)' failed. In the reproducer [0], it happens because the BDRV_BLOCK_RECURSE flag will be set by the qcow2 driver, so the caller will recursively check the file child. Then the BDRV_BLOCK_ZERO set too. Later up the call chain, in bdrv_co_do_block_status() for the snapshot-access driver, the assertion failure will happen, because both flags are set. To fix it, clear the recurse flag after the recursive check was done. In detail: > #0 qcow2_co_block_status Returns 0x45 = BDRV_BLOCK_RECURSE | BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID. > #1 bdrv_co_do_block_status Because of the data flag, bdrv_co_do_block_status() will now also set BDRV_BLOCK_ALLOCATED. Because of the recurse flag, bdrv_co_do_block_status() for the bdrv_file child will be called, which returns 0x16 = BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_OFFSET_VALID | BDRV_BLOCK_ZERO. Now the return value inherits the zero flag. Returns 0x57 = BDRV_BLOCK_RECURSE | BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO. > #2 bdrv_co_common_block_status_above > #3 bdrv_co_block_status_above > #4 bdrv_co_block_status > #5 cbw_co_snapshot_block_status > #6 bdrv_co_snapshot_block_status > #7 snapshot_access_co_block_status > #8 bdrv_co_do_block_status Return value is propagated all the way up to here, where the assertion failure happens, because BDRV_BLOCK_RECURSE and BDRV_BLOCK_ZERO are both set. > #9 bdrv_co_common_block_status_above > #10 bdrv_co_block_status_above > #11 block_copy_block_status > #12 block_copy_dirty_clusters > #13 block_copy_common > #14 block_copy_async_co_entry > #15 coroutine_trampoline [0]: > #!/bin/bash > rm /tmp/disk.qcow2 > ./qemu-img create /tmp/disk.qcow2 -o preallocation=metadata -f qcow2 1G > ./qemu-img create /tmp/fleecing.qcow2 -f qcow2 1G > ./qemu-img create /tmp/backup.qcow2 -f qcow2 1G > ./qemu-system-x86_64 --qmp stdio \ > --blockdev > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/disk.qcow2 \ > --blockdev > qcow2,node-name=node1,file.driver=file,file.filename=/tmp/fleecing.qcow2 \ > --blockdev > qcow2,node-name=node2,file.driver=file,file.filename=/tmp/backup.qcow2 \ > < {"execute": "qmp_capabilities"} > {"execute": "blockdev-add", "arguments": { "driver": "copy-before-write", > "file": "node0", "target": "node1", "node-name": "node3" } } > {"execute": "blockdev-add", "arguments": { "driver": "snapshot-access", > "file": "node3", "node-name": "snap0" } } > {"execute": "blockdev-backup", "arguments": { "device": "snap0", "target": > "node1", "sync": "full", "job-id": "backup0" } } > EOF Signed-off-by: Fiona Ebner --- I'm new to this part of the code, so I'm not sure if it is actually safe to clear the flag? Intuitively, I'd expect it to be only relevant until it was acted upon, but no clue. block/io.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/block/io.c b/block/io.c index 8fa7670571..33150c0359 100644 --- a/block/io.c +++ b/block/io.c @@ -2584,6 +2584,16 @@ bdrv_co_do_block_status(BlockDriverState *bs, bool want_zero, ret |= (ret2 & BDRV_BLOCK_ZERO); } } + +/* + * Now that the recursive search was done, clear the flag. Otherwise, + * with more complicated block graphs like snapshot-access -> + * copy-before-write -> qcow2, where the return value will be propagated + * further up to a parent bdrv_co_do_block_status() call, both the + * BDRV_BLOCK_RECURSE and BDRV_BLOCK_ZERO flags would be set, which is + * not allowed. + */ +ret &= ~BDRV_BLOCK_RECURSE; } out: -- 2.39.2
Re: [PATCH] ui/clipboard: avoid crash upon request when clipboard peer is not initialized
Am 15.01.24 um 13:00 schrieb Marc-André Lureau: >>> >>> The trouble is when qemu_clipboard_update() is called without data & >>> without a request callback set. We shouldn't allow that as we have no >>> means to get the clipboard data then. >>> >> >> In the above scenario, I'm pretty sure there is data when >> qemu_clipboard_update() is called. Just no request callback. If we'd >> reject this, won't that break clients that do not set the >> VNC_FEATURE_CLIPBOARD_EXT feature and only use non-extended >> VNC_MSG_CLIENT_CUT_TEXT messages? > > If "data" is already set, then qemu_clipboard_request() returns. > > Inverting the condition I suggested above: it's allowed to be the > clipboard owner if either "data" is set, or a request callback is set. > Oh, sorry. Yes, it seems the problematic case is where data is not set. But isn't that legitimate when clearing the clipboard? Or is a VNC_MSG_CLIENT_CUT_TEXT message not valid when len is 0 and should be rejected? In my testing KRDC does send such a message when the clipboard is cleared: > #1 0x558f1e6a0dac in vnc_client_cut_text (vs=0x558f207754d0, len=0, > text=0x558f2046e008 "\003 \002\377\005") at ../ui/vnc-clipboard.c:313 > #2 0x558f1e68e067 in protocol_client_msg (vs=0x558f207754d0, > data=0x558f2046e000 "\006", len=8) at ../ui/vnc.c:2454 Your suggestion would disallow this for clients that do not set the VNC_FEATURE_CLIPBOARD_EXT feature (and only use non-extended VNC_MSG_CLIENT_CUT_TEXT messages). Best Regards, Fiona
Re: [PATCH] ui/clipboard: avoid crash upon request when clipboard peer is not initialized
Am 15.01.24 um 12:33 schrieb Marc-André Lureau: > Hi > > On Mon, Jan 15, 2024 at 3:26 PM Fiona Ebner wrote: >> >> Am 15.01.24 um 12:15 schrieb Marc-André Lureau: >>> Hi >>> >>> On Mon, Jan 15, 2024 at 2:45 PM Fiona Ebner wrote: >>>> >>>> Am 14.01.24 um 14:51 schrieb Marc-André Lureau: >>>>>> >>>>>> diff --git a/ui/clipboard.c b/ui/clipboard.c >>>>>> index 3d14bffaf8..c13b54d2e9 100644 >>>>>> --- a/ui/clipboard.c >>>>>> +++ b/ui/clipboard.c >>>>>> @@ -129,7 +129,8 @@ void qemu_clipboard_request(QemuClipboardInfo *info, >>>>>> if (info->types[type].data || >>>>>> info->types[type].requested || >>>>>> !info->types[type].available || >>>>>> -!info->owner) >>>>>> +!info->owner || >>>>>> +!info->owner->request) >>>>>> return; >>>>> >>>>> While that fixes the crash, I think we should handle the situation >>>>> earlier. A clipboard peer shouldn't be allowed to hold the clipboard >>>>> if it doesn't have the data available or a "request" callback set. >>>>> >>>> >>>> Where should initialization of the cbpeer happen so that we are >>>> guaranteed to do it also for clients that do not set the >>>> VNC_FEATURE_CLIPBOARD_EXT feature? Can the vnc_clipboard_request >>>> function be re-used for clients without that feature or will it be >>>> necessary to add some kind of "dummy" request callback for those clients? >>> >>> qemu_clipboard_update() shouldn't accept info as current clipboard if >>> the owner doesn't have the data available or a "request" callback set. >>> This should also be assert() somehow and handled earlier. >>> >> >> The request callback is only initialized in vnc_server_cut_text_caps() >> when the VNC_FEATURE_CLIPBOARD_EXT is enabled. AFAIU, it's perfectly >> fine for clients to use the clipboard with non-extended messages and >> qemu_clipboard_update() should (and currently does) accept those. >> >>> In vnc_client_cut_text_ext() we could detect that situation, but with >>> Daniel's "[PATCH] ui: reject extended clipboard message if not >>> activated", this shouldn't happen anymore iiuc. >>> >> >> Daniel's patch doesn't change the behavior for non-extended messages. >> The problem can still happen with two VNC clients. This is the scenario >> described in the lower half of my commit message (and why Daniel >> mentions in his patch that it's not sufficient to fix the CVE). >> >> In short: client A does not set the VNC_FEATURE_CLIPBOARD_EXT feature >> and then uses a non-extended VNC_MSG_CLIENT_CUT_TEXT message. This leads >> to vnc_client_cut_text() being called and setting the clipboard info >> referencing that client. But here, no request callback is initialized, >> that only happens in vnc_server_cut_text_caps() when the >> VNC_FEATURE_CLIPBOARD_EXT is enabled. >> >> When client B does set the VNC_FEATURE_CLIPBOARD_EXT feature and does >> send an extended VNC_MSG_CLIENT_CUT_TEXT message, the request callback >> will be attempted but it isn't set. >> > > The trouble is when qemu_clipboard_update() is called without data & > without a request callback set. We shouldn't allow that as we have no > means to get the clipboard data then. > In the above scenario, I'm pretty sure there is data when qemu_clipboard_update() is called. Just no request callback. If we'd reject this, won't that break clients that do not set the VNC_FEATURE_CLIPBOARD_EXT feature and only use non-extended VNC_MSG_CLIENT_CUT_TEXT messages? Best Regards, Fiona
Re: [PATCH] ui/clipboard: avoid crash upon request when clipboard peer is not initialized
Am 15.01.24 um 12:15 schrieb Marc-André Lureau: > Hi > > On Mon, Jan 15, 2024 at 2:45 PM Fiona Ebner wrote: >> >> Am 14.01.24 um 14:51 schrieb Marc-André Lureau: >>>> >>>> diff --git a/ui/clipboard.c b/ui/clipboard.c >>>> index 3d14bffaf8..c13b54d2e9 100644 >>>> --- a/ui/clipboard.c >>>> +++ b/ui/clipboard.c >>>> @@ -129,7 +129,8 @@ void qemu_clipboard_request(QemuClipboardInfo *info, >>>> if (info->types[type].data || >>>> info->types[type].requested || >>>> !info->types[type].available || >>>> -!info->owner) >>>> +!info->owner || >>>> +!info->owner->request) >>>> return; >>> >>> While that fixes the crash, I think we should handle the situation >>> earlier. A clipboard peer shouldn't be allowed to hold the clipboard >>> if it doesn't have the data available or a "request" callback set. >>> >> >> Where should initialization of the cbpeer happen so that we are >> guaranteed to do it also for clients that do not set the >> VNC_FEATURE_CLIPBOARD_EXT feature? Can the vnc_clipboard_request >> function be re-used for clients without that feature or will it be >> necessary to add some kind of "dummy" request callback for those clients? > > qemu_clipboard_update() shouldn't accept info as current clipboard if > the owner doesn't have the data available or a "request" callback set. > This should also be assert() somehow and handled earlier. > The request callback is only initialized in vnc_server_cut_text_caps() when the VNC_FEATURE_CLIPBOARD_EXT is enabled. AFAIU, it's perfectly fine for clients to use the clipboard with non-extended messages and qemu_clipboard_update() should (and currently does) accept those. > In vnc_client_cut_text_ext() we could detect that situation, but with > Daniel's "[PATCH] ui: reject extended clipboard message if not > activated", this shouldn't happen anymore iiuc. > Daniel's patch doesn't change the behavior for non-extended messages. The problem can still happen with two VNC clients. This is the scenario described in the lower half of my commit message (and why Daniel mentions in his patch that it's not sufficient to fix the CVE). In short: client A does not set the VNC_FEATURE_CLIPBOARD_EXT feature and then uses a non-extended VNC_MSG_CLIENT_CUT_TEXT message. This leads to vnc_client_cut_text() being called and setting the clipboard info referencing that client. But here, no request callback is initialized, that only happens in vnc_server_cut_text_caps() when the VNC_FEATURE_CLIPBOARD_EXT is enabled. When client B does set the VNC_FEATURE_CLIPBOARD_EXT feature and does send an extended VNC_MSG_CLIENT_CUT_TEXT message, the request callback will be attempted but it isn't set. Best Regards, Fiona