from:"Fiona Ebner"

Re: [PATCH] virtio-pci: Fix the use of an uninitialized irqfd.

2024-05-29 Thread Fiona Ebner

Hi,

Am 22.05.24 um 07:10 schrieb Cindy Lu:
> The crash was reported in MAC OS and NixOS, here is the link for this bug
> https://gitlab.com/qemu-project/qemu/-/issues/2334
> https://gitlab.com/qemu-project/qemu/-/issues/2321
> 
> The root cause is that the function virtio_pci_set_guest_notifiers() only
> initializes the irqfd when the use_guest_notifier_mask and guest_notifier_mask
> are set.

Sorry, I'm just trying to understand the fix and I'm probably missing
something, but in virtio_pci_set_guest_notifiers() there is:

> bool with_irqfd = msix_enabled(>pci_dev) &&
> kvm_msi_via_irqfd_enabled();

and then:

> if ((with_irqfd ||
>  (vdev->use_guest_notifier_mask && k->guest_notifier_mask)) &&
> assign) {
> if (with_irqfd) {
> proxy->vector_irqfd =
> g_malloc0(sizeof(*proxy->vector_irqfd) *
>   msix_nr_vectors_allocated(>pci_dev));
> r = kvm_virtio_pci_vector_vq_use(proxy, nvqs);

Meaning proxy->vector_irqfd is allocated when with_irqfd is true (even
if vdev->use_guest_notifier_mask && k->guest_notifier_mask is false).

> However, this check is missing in virtio_pci_set_vector().
> So the fix is to add this check.
> 
> This fix is verified in vyatta,MacOS,NixOS,fedora system.
> 
> The bt tree for this bug is:
> Thread 6 "CPU 0/KVM" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7c817be006c0 (LWP 1269146)]
> kvm_virtio_pci_vq_vector_use () at ../qemu-9.0.0/hw/virtio/virtio-pci.c:817
> 817   if (irqfd->users == 0) {

The crash happens because the irqfd is NULL/invalid here, right?

proxy->vector_irqfd = NULL happens when virtio_pci_set_guest_notifiers()
is called with assign=false or for an unsuccessful call to
virtio_pci_set_guest_notifiers() with assign=true.

AFAIU, the issue is that virtio_pci_set_vector() is called between a
call to virtio_pci_set_guest_notifiers() with assign=false and a
successful virtio_pci_set_guest_notifiers() with assign=true (or before
the first such call).

So I'm trying to understand why adding the check for
vdev->use_guest_notifier_mask && k->guest_notifier_mask is sufficient to
fix the issue. Thanks!

Best Regards,
Fiona

Re: [PULL 6/9] virtio-gpu: fix v2 migration

2024-05-29 Thread Fiona Ebner

CC-ing stable, because this already is an issue in 9.0.0

Am 23.05.24 um 00:20 schrieb Fabiano Rosas:
> From: Marc-André Lureau 
> 
> Commit dfcf74fa ("virtio-gpu: fix scanout migration post-load") broke
> forward/backward version migration. Versioning of nested VMSD structures
> is not straightforward, as the wire format doesn't have nested
> structures versions. Introduce x-scanout-vmstate-version and a field
> test to save/load appropriately according to the machine version.
> 
> Fixes: dfcf74fa ("virtio-gpu: fix scanout migration post-load")
> Signed-off-by: Marc-André Lureau 
> Signed-off-by: Peter Xu 
> Reviewed-by: Fiona Ebner 
> Tested-by: Fiona Ebner 
> [fixed long lines]
> Signed-off-by: Fabiano Rosas 
> ---
>  hw/core/machine.c  |  1 +
>  hw/display/virtio-gpu.c| 30 ++
>  include/hw/virtio/virtio-gpu.h |  1 +
>  3 files changed, 24 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index c7ceb11501..8d6dc69f0e 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -42,6 +42,7 @@ GlobalProperty hw_compat_8_2[] = {
>  { "migration", "zero-page-detection", "legacy"},
>  { TYPE_VIRTIO_IOMMU_PCI, "granule", "4k" },
>  { TYPE_VIRTIO_IOMMU_PCI, "aw-bits", "64" },
> +{ "virtio-gpu-device", "x-scanout-vmstate-version", "1" },
>  };
>  const size_t hw_compat_8_2_len = G_N_ELEMENTS(hw_compat_8_2);
>  
> diff --git a/hw/display/virtio-gpu.c b/hw/display/virtio-gpu.c
> index ae831b6b3e..d60b1b2973 100644
> --- a/hw/display/virtio-gpu.c
> +++ b/hw/display/virtio-gpu.c
> @@ -1166,10 +1166,17 @@ static void virtio_gpu_cursor_bh(void *opaque)
>  virtio_gpu_handle_cursor(>parent_obj.parent_obj, g->cursor_vq);
>  }
>  
> +static bool scanout_vmstate_after_v2(void *opaque, int version)
> +{
> +struct VirtIOGPUBase *base = container_of(opaque, VirtIOGPUBase, 
> scanout);
> +struct VirtIOGPU *gpu = container_of(base, VirtIOGPU, parent_obj);
> +
> +return gpu->scanout_vmstate_version >= 2;
> +}
> +
>  static const VMStateDescription vmstate_virtio_gpu_scanout = {
>  .name = "virtio-gpu-one-scanout",
> -.version_id = 2,
> -.minimum_version_id = 1,
> +.version_id = 1,
>  .fields = (const VMStateField[]) {
>  VMSTATE_UINT32(resource_id, struct virtio_gpu_scanout),
>  VMSTATE_UINT32(width, struct virtio_gpu_scanout),
> @@ -1181,12 +1188,18 @@ static const VMStateDescription 
> vmstate_virtio_gpu_scanout = {
>  VMSTATE_UINT32(cursor.hot_y, struct virtio_gpu_scanout),
>  VMSTATE_UINT32(cursor.pos.x, struct virtio_gpu_scanout),
>  VMSTATE_UINT32(cursor.pos.y, struct virtio_gpu_scanout),
> -VMSTATE_UINT32_V(fb.format, struct virtio_gpu_scanout, 2),
> -VMSTATE_UINT32_V(fb.bytes_pp, struct virtio_gpu_scanout, 2),
> -VMSTATE_UINT32_V(fb.width, struct virtio_gpu_scanout, 2),
> -VMSTATE_UINT32_V(fb.height, struct virtio_gpu_scanout, 2),
> -VMSTATE_UINT32_V(fb.stride, struct virtio_gpu_scanout, 2),
> -VMSTATE_UINT32_V(fb.offset, struct virtio_gpu_scanout, 2),
> +VMSTATE_UINT32_TEST(fb.format, struct virtio_gpu_scanout,
> +scanout_vmstate_after_v2),
> +VMSTATE_UINT32_TEST(fb.bytes_pp, struct virtio_gpu_scanout,
> +scanout_vmstate_after_v2),
> +VMSTATE_UINT32_TEST(fb.width, struct virtio_gpu_scanout,
> +scanout_vmstate_after_v2),
> +VMSTATE_UINT32_TEST(fb.height, struct virtio_gpu_scanout,
> +scanout_vmstate_after_v2),
> +VMSTATE_UINT32_TEST(fb.stride, struct virtio_gpu_scanout,
> +scanout_vmstate_after_v2),
> +VMSTATE_UINT32_TEST(fb.offset, struct virtio_gpu_scanout,
> +scanout_vmstate_after_v2),
>  VMSTATE_END_OF_LIST()
>  },
>  };
> @@ -1659,6 +1672,7 @@ static Property virtio_gpu_properties[] = {
>  DEFINE_PROP_BIT("blob", VirtIOGPU, parent_obj.conf.flags,
>  VIRTIO_GPU_FLAG_BLOB_ENABLED, false),
>  DEFINE_PROP_SIZE("hostmem", VirtIOGPU, parent_obj.conf.hostmem, 0),
> +DEFINE_PROP_UINT8("x-scanout-vmstate-version", VirtIOGPU, 
> scanout_vmstate_version, 2),
>  DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/include/hw/virtio/virtio-gpu.h b/include/hw/virtio/virtio-gpu.h
> index 56d6e821bf..7a59379f5a 100644
> --- a/include/hw/virtio/virtio-gpu.h
> +++ b/include/hw/virtio/virtio-gpu.h
> @@ -177,6 +177,7 @@ typedef struct VGPUDMABuf {
>  struct VirtIOGPU {
>  VirtIOGPUBase parent_obj;
>  
> +uint8_t scanout_vmstate_version;
>  uint64_t conf_max_hostmem;
>  
>  VirtQueue *ctrl_vq;

Re: [PATCH 1/2] Revert "monitor: use aio_co_reschedule_self()"

2024-05-29 Thread Fiona Ebner

CC-ing stable since 1f25c172f83704e350c0829438d832384084a74d is in 9.0.0

Am 06.05.24 um 21:06 schrieb Stefan Hajnoczi:
> Commit 1f25c172f837 ("monitor: use aio_co_reschedule_self()") was a code
> cleanup that uses aio_co_reschedule_self() instead of open coding
> coroutine rescheduling.
> 
> Bug RHEL-34618 was reported and Kevin Wolf  identified
> the root cause. I missed that aio_co_reschedule_self() ->
> qemu_get_current_aio_context() only knows about
> qemu_aio_context/IOThread AioContexts and not about iohandler_ctx. It
> does not function correctly when going back from the iohandler_ctx to
> qemu_aio_context.
> 
> Go back to open coding the AioContext transitions to avoid this bug.
> 
> This reverts commit 1f25c172f83704e350c0829438d832384084a74d.
> 
> Buglink: https://issues.redhat.com/browse/RHEL-34618
> Signed-off-by: Stefan Hajnoczi 
> ---
>  qapi/qmp-dispatch.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
> index f3488afeef..176b549473 100644
> --- a/qapi/qmp-dispatch.c
> +++ b/qapi/qmp-dispatch.c
> @@ -212,7 +212,8 @@ QDict *coroutine_mixed_fn qmp_dispatch(const 
> QmpCommandList *cmds, QObject *requ
>   * executing the command handler so that it can make progress if 
> it
>   * involves an AIO_WAIT_WHILE().
>   */
> -aio_co_reschedule_self(qemu_get_aio_context());
> +aio_co_schedule(qemu_get_aio_context(), qemu_coroutine_self());
> +qemu_coroutine_yield();
>  }
>  
>  monitor_set_cur(qemu_coroutine_self(), cur_mon);
> @@ -226,7 +227,9 @@ QDict *coroutine_mixed_fn qmp_dispatch(const 
> QmpCommandList *cmds, QObject *requ
>   * Move back to iohandler_ctx so that nested event loops for
>   * qemu_aio_context don't start new monitor commands.
>   */
> -aio_co_reschedule_self(iohandler_get_aio_context());
> +aio_co_schedule(iohandler_get_aio_context(),
> +qemu_coroutine_self());
> +qemu_coroutine_yield();
>  }
>  } else {
> /*

Re: block snapshot issue with RBD

2024-05-29 Thread Fiona Ebner

Hi,

Am 28.05.24 um 20:19 schrieb Jin Cao:
> Hi Ilya
> 
> On 5/28/24 11:13 AM, Ilya Dryomov wrote:
>> On Mon, May 27, 2024 at 9:06 PM Jin Cao  wrote:
>>>
>>> Supplementary info: VM is paused after "migrate" command. After being
>>> resumed with "cont", snapshot_delete_blkdev_internal works again, which
>>> is confusing, as disk snapshot generally recommend I/O is paused, and a
>>> frozen VM satisfy this requirement.
>>
>> Hi Jin,
>>
>> This doesn't seem to be related to RBD.  Given that the same error is
>> observed when using the RBD driver with the raw format, I would dig in
>> the direction of migration somehow "installing" the raw format (which
>> is on-disk compatible with the rbd format).
>>
> 
> Thanks for the hint.
> 
>> Also, did you mean to say "snapshot_blkdev_internal" instead of
>> "snapshot_delete_blkdev_internal" in both instances?
> 
> Sorry for my copy-and-paste mistake. Yes, it's snapshot_blkdev_internal.
> 
> -- 
> Sincerely,
> Jin Cao
> 
>>
>> Thanks,
>>
>>  Ilya
>>
>>>
>>> -- 
>>> Sincerely
>>> Jin Cao
>>>
>>> On 5/27/24 10:56 AM, Jin Cao wrote:
 CC block and migration related address.

 On 5/27/24 12:03 AM, Jin Cao wrote:
> Hi,
>
> I encountered RBD block snapshot issue after doing migration.
>
> Steps
> -
>
> 1. Start QEMU with:
> ./qemu-system-x86_64 -name VM -machine q35 -accel kvm -cpu
> host,migratable=on -m 2G -boot menu=on,strict=on
> rbd:image/ubuntu-22.04-server-cloudimg-amd64.raw -net nic -net user
> -cdrom /home/my/path/of/cloud-init.iso -monitor stdio
>
> 2. Do block snapshot in monitor cmd: snapshot_delete_blkdev_internal.
> It works as expected: the snapshot is visable with command`rbd snap ls
> pool_name/image_name`.
>
> 3. Do pseudo migration with monitor cmd: migrate -d
> exec:cat>/tmp/vm.out
>
> 4. Do block snapshot again with snapshot_delete_blkdev_internal, then
> I get:
>  Error: Block format 'raw' used by device 'ide0-hd0' does not
> support internal snapshots
>
> I was hoping to do the second block snapshot successfully, and it
> feels abnormal the RBD block snapshot function is disrupted after
> migration.
>
> BTW, I get the same block snapshot error when I start QEMU with:
>   "-drive format=raw,file=rbd:pool_name/image_name"
>
> My questions is: how could I proceed with RBD block snapshot after the
> pseudo migration?
> 
> 

I bisected this issue to d3007d348a ("block: Fix crash when loading
snapshot on inactive node").

> diff --git a/block/snapshot.c b/block/snapshot.c
> index ec8cf4810b..c4d40e80dd 100644
> --- a/block/snapshot.c
> +++ b/block/snapshot.c
> @@ -196,8 +196,10 @@ bdrv_snapshot_fallback(BlockDriverState *bs)
>  int bdrv_can_snapshot(BlockDriverState *bs)
>  {
>  BlockDriver *drv = bs->drv;
> +
>  GLOBAL_STATE_CODE();
> -if (!drv || !bdrv_is_inserted(bs) || bdrv_is_read_only(bs)) {
> +
> +if (!drv || !bdrv_is_inserted(bs) || !bdrv_is_writable(bs)) {
>  return 0;
>  }
>  

So I guess the issue is that the blockdev is not writable when
"postmigrate" state?

Best Regards,
Fiona

[PATCH v2 0/2] backup: allow specifying minimum cluster size

2024-05-28 Thread Fiona Ebner

Based-on: 
https://lore.kernel.org/qemu-devel/20240429115157.2260885-1-vsement...@yandex-team.ru/

Discussion for v1:
https://lore.kernel.org/qemu-devel/20240308155158.830258-1-f.eb...@proxmox.com/

Changes in v2:
* Use 'size' type in QAPI.
* Remove option in cbw_parse_options(), i.e. before parsing generic
  blockdev options.
* Reword commit messages hoping to describe the issue in a more
  straight-forward way.

In the context of backup fleecing, discarding the source will not work
when the fleecing image has a larger granularity than the one used for
block-copy operations (can happen if the backup target has smaller
cluster size), because cbw_co_pdiscard_snapshot() will align down the
discard requests and thus effectively ignore then.

To make @discard-source work in such a scenario, allow specifying the
minimum cluster size used for block-copy operations and thus in
particular also the granularity for discard requests to the source.

Fiona Ebner (2):
  copy-before-write: allow specifying minimum cluster size
  backup: add minimum cluster size to performance options

 block/backup.c |  2 +-
 block/block-copy.c | 22 ++
 block/copy-before-write.c  | 18 +-
 block/copy-before-write.h  |  1 +
 blockdev.c |  3 +++
 include/block/block-copy.h |  1 +
 qapi/block-core.json   | 17 ++---
 7 files changed, 55 insertions(+), 9 deletions(-)

-- 
2.39.2

[PATCH v2 2/2] backup: add minimum cluster size to performance options

2024-05-28 Thread Fiona Ebner

In the context of backup fleecing, discarding the source will not work
when the fleecing image has a larger granularity than the one used for
block-copy operations (can happen if the backup target has smaller
cluster size), because cbw_co_pdiscard_snapshot() will align down the
discard requests and thus effectively ignore then.

To make @discard-source work in such a scenario, allow specifying the
minimum cluster size used for block-copy operations and thus in
particular also the granularity for discard requests to the source.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
---

Changes in v2:
* Use 'size' type in QAPI.

 block/backup.c| 2 +-
 block/copy-before-write.c | 8 
 block/copy-before-write.h | 1 +
 blockdev.c| 3 +++
 qapi/block-core.json  | 9 +++--
 5 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 3dd2e229d2..a1292c01ec 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -458,7 +458,7 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 }
 
 cbw = bdrv_cbw_append(bs, target, filter_node_name, discard_source,
-  , errp);
+  perf->min_cluster_size, , errp);
 if (!cbw) {
 goto error;
 }
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index ef0bc4dcfe..183eed42e5 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -553,6 +553,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
   BlockDriverState *target,
   const char *filter_node_name,
   bool discard_source,
+  uint64_t min_cluster_size,
   BlockCopyState **bcs,
   Error **errp)
 {
@@ -572,6 +573,13 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
 qdict_put_str(opts, "file", bdrv_get_node_name(source));
 qdict_put_str(opts, "target", bdrv_get_node_name(target));
 
+if (min_cluster_size > INT64_MAX) {
+error_setg(errp, "min-cluster-size too large: %lu > %ld",
+   min_cluster_size, INT64_MAX);
+return NULL;
+}
+qdict_put_int(opts, "min-cluster-size", (int64_t)min_cluster_size);
+
 top = bdrv_insert_node(source, opts, flags, errp);
 if (!top) {
 return NULL;
diff --git a/block/copy-before-write.h b/block/copy-before-write.h
index 01af0cd3c4..2a5d4ba693 100644
--- a/block/copy-before-write.h
+++ b/block/copy-before-write.h
@@ -40,6 +40,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
   BlockDriverState *target,
   const char *filter_node_name,
   bool discard_source,
+  uint64_t min_cluster_size,
   BlockCopyState **bcs,
   Error **errp);
 void bdrv_cbw_drop(BlockDriverState *bs);
diff --git a/blockdev.c b/blockdev.c
index 835064ed03..6740663fda 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2655,6 +2655,9 @@ static BlockJob *do_backup_common(BackupCommon *backup,
 if (backup->x_perf->has_max_chunk) {
 perf.max_chunk = backup->x_perf->max_chunk;
 }
+if (backup->x_perf->has_min_cluster_size) {
+perf.min_cluster_size = backup->x_perf->min_cluster_size;
+}
 }
 
 if ((backup->sync == MIRROR_SYNC_MODE_BITMAP) ||
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 8fc0a4b234..f1219a9dfb 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1551,11 +1551,16 @@
 # it should not be less than job cluster size which is calculated
 # as maximum of target image cluster size and 64k.  Default 0.
 #
+# @min-cluster-size: Minimum size of blocks used by copy-before-write
+# and background copy operations.  Has to be a power of 2.  No
+# effect if smaller than the maximum of the target's cluster size
+# and 64 KiB.  Default 0.  (Since 9.1)
+#
 # Since: 6.0
 ##
 { 'struct': 'BackupPerf',
-  'data': { '*use-copy-range': 'bool',
-'*max-workers': 'int', '*max-chunk': 'int64' } }
+  'data': { '*use-copy-range': 'bool', '*max-workers': 'int',
+'*max-chunk': 'int64', '*min-cluster-size': 'size' } }
 
 ##
 # @BackupCommon:
-- 
2.39.2

[PATCH v2 1/2] copy-before-write: allow specifying minimum cluster size

2024-05-28 Thread Fiona Ebner

In the context of backup fleecing, discarding the source will not work
when the fleecing image has a larger granularity than the one used for
block-copy operations (can happen if the backup target has smaller
cluster size), because cbw_co_pdiscard_snapshot() will align down the
discard requests and thus effectively ignore then.

To make @discard-source work in such a scenario, allow specifying the
minimum cluster size used for block-copy operations and thus in
particular also the granularity for discard requests to the source.

The type 'size' (corresponding to uint64_t in C) is used in QAPI to
rule out negative inputs and for consistency with already existing
@cluster-size parameters. Since block_copy_calculate_cluster_size()
uses int64_t for its result, a check that the input is not too large
is added in block_copy_state_new() before calling it. The calculation
in block_copy_calculate_cluster_size() is done in the target int64_t
type.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
---

Changes in v2:
* Use 'size' type in QAPI.
* Remove option in cbw_parse_options(), i.e. before parsing generic
  blockdev options.

 block/block-copy.c | 22 ++
 block/copy-before-write.c  | 10 +-
 include/block/block-copy.h |  1 +
 qapi/block-core.json   |  8 +++-
 4 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index 7e3b378528..36eaecaaf4 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -310,6 +310,7 @@ void block_copy_set_copy_opts(BlockCopyState *s, bool 
use_copy_range,
 }
 
 static int64_t block_copy_calculate_cluster_size(BlockDriverState *target,
+ int64_t min_cluster_size,
  Error **errp)
 {
 int ret;
@@ -335,7 +336,7 @@ static int64_t 
block_copy_calculate_cluster_size(BlockDriverState *target,
 "used. If the actual block size of the target exceeds "
 "this default, the backup may be unusable",
 BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
-return BLOCK_COPY_CLUSTER_SIZE_DEFAULT;
+return MAX(min_cluster_size, (int64_t)BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
 } else if (ret < 0 && !target_does_cow) {
 error_setg_errno(errp, -ret,
 "Couldn't determine the cluster size of the target image, "
@@ -345,16 +346,18 @@ static int64_t 
block_copy_calculate_cluster_size(BlockDriverState *target,
 return ret;
 } else if (ret < 0 && target_does_cow) {
 /* Not fatal; just trudge on ahead. */
-return BLOCK_COPY_CLUSTER_SIZE_DEFAULT;
+return MAX(min_cluster_size, (int64_t)BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
 }
 
-return MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size);
+return MAX(min_cluster_size,
+   (int64_t)MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, 
bdi.cluster_size));
 }
 
 BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
  BlockDriverState *copy_bitmap_bs,
  const BdrvDirtyBitmap *bitmap,
  bool discard_source,
+ uint64_t min_cluster_size,
  Error **errp)
 {
 ERRP_GUARD();
@@ -365,7 +368,18 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, 
BdrvChild *target,
 
 GLOBAL_STATE_CODE();
 
-cluster_size = block_copy_calculate_cluster_size(target->bs, errp);
+if (min_cluster_size > INT64_MAX) {
+error_setg(errp, "min-cluster-size too large: %lu > %ld",
+   min_cluster_size, INT64_MAX);
+return NULL;
+} else if (min_cluster_size && !is_power_of_2(min_cluster_size)) {
+error_setg(errp, "min-cluster-size needs to be a power of 2");
+return NULL;
+}
+
+cluster_size = block_copy_calculate_cluster_size(target->bs,
+ (int64_t)min_cluster_size,
+ errp);
 if (cluster_size < 0) {
 return NULL;
 }
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index cd65524e26..ef0bc4dcfe 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -417,6 +417,7 @@ static BlockdevOptions *cbw_parse_options(QDict *options, 
Error **errp)
 qdict_extract_subqdict(options, NULL, "bitmap");
 qdict_del(options, "on-cbw-error");
 qdict_del(options, "cbw-timeout");
+qdict_del(options, "min-cluster-size");
 
 out:
 visit_free(v);
@@ -432,6 +433,7 @@ static int cbw_open(BlockDriverState *bs, QDict *options, 
int flags,
 BDRVCopyBeforeWriteState *s = bs->opaque;
 BdrvDirtyBitmap *bitmap = NULL

[PATCH v4 4/5] iotests: add test for bitmap mirror

2024-05-21 Thread Fiona Ebner

From: Fabian Grünbichler 

heavily based on/practically forked off iotest 257 for bitmap backups,
but:

- no writes to filter node 'mirror-top' between completion and
finalization, as those seem to deadlock?
- extra set of reference/test mirrors to verify that writes in parallel
with active mirror work

Intentionally keeping copyright and ownership of original test case to
honor provenance.

The test was originally adapted by Fabian from 257, but has seen
rather big changes, because the interface for mirror with bitmap was
changed, i.e. no @bitmap-mode parameter anymore and bitmap is used as
the working bitmap, and the test was changed to use backing images and
@sync-mode=write-blocking.

Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.1
 adapt to changes to mirror bitmap interface
 rename test from '384' to 'mirror-bitmap'
 use backing files, copy-mode=write-blocking, larger cluster size]
Signed-off-by: Fiona Ebner 
---
 tests/qemu-iotests/tests/mirror-bitmap |  597 
 tests/qemu-iotests/tests/mirror-bitmap.out | 3191 
 2 files changed, 3788 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/mirror-bitmap
 create mode 100644 tests/qemu-iotests/tests/mirror-bitmap.out

diff --git a/tests/qemu-iotests/tests/mirror-bitmap 
b/tests/qemu-iotests/tests/mirror-bitmap
new file mode 100755
index 00..37bbe0f241
--- /dev/null
+++ b/tests/qemu-iotests/tests/mirror-bitmap
@@ -0,0 +1,597 @@
+#!/usr/bin/env python3
+# group: rw
+#
+# Test bitmap-sync mirrors (incremental, differential, and partials)
+#
+# Copyright (c) 2019 John Snow for Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+# owner=js...@redhat.com
+
+import os
+
+import iotests
+from iotests import log, qemu_img
+
+SIZE = 64 * 1024 * 1024
+GRANULARITY = 64 * 1024
+IMAGE_CLUSTER_SIZE = 128 * 1024
+
+
+class Pattern:
+def __init__(self, byte, offset, size=GRANULARITY):
+self.byte = byte
+self.offset = offset
+self.size = size
+
+def bits(self, granularity):
+lower = self.offset // granularity
+upper = (self.offset + self.size - 1) // granularity
+return set(range(lower, upper + 1))
+
+
+class PatternGroup:
+"""Grouping of Pattern objects. Initialize with an iterable of Patterns."""
+def __init__(self, patterns):
+self.patterns = patterns
+
+def bits(self, granularity):
+"""Calculate the unique bits dirtied by this pattern grouping"""
+res = set()
+for pattern in self.patterns:
+res |= pattern.bits(granularity)
+return res
+
+
+GROUPS = [
+PatternGroup([
+# Batch 0: 4 clusters
+Pattern('0x49', 0x000),
+Pattern('0x6c', 0x010),   # 1M
+Pattern('0x6f', 0x200),   # 32M
+Pattern('0x76', 0x3ff)]), # 64M - 64K
+PatternGroup([
+# Batch 1: 6 clusters (3 new)
+Pattern('0x65', 0x000),   # Full overwrite
+Pattern('0x77', 0x00f8000),   # Partial-left (1M-32K)
+Pattern('0x72', 0x2008000),   # Partial-right (32M+32K)
+Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K)
+PatternGroup([
+# Batch 2: 7 clusters (3 new)
+Pattern('0x74', 0x001),   # Adjacent-right
+Pattern('0x69', 0x00e8000),   # Partial-left  (1M-96K)
+Pattern('0x6e', 0x2018000),   # Partial-right (32M+96K)
+Pattern('0x67', 0x3fe,
+2*GRANULARITY)]), # Overwrite [(64M-128K)-64M)
+PatternGroup([
+# Batch 3: 8 clusters (5 new)
+# Carefully chosen such that nothing re-dirties the one cluster
+# that copies out successfully before failure in Group #1.
+Pattern('0xaa', 0x001,
+3*GRANULARITY),   # Overwrite and 2x Adjacent-right
+Pattern('0xbb', 0x00d8000),   # Partial-left (1M-160K)
+Pattern('0xcc', 0x2028000),   # Partial-right (32M+160K)
+Pattern('0xdd', 0x3fc)]), # New; leaving a gap to the right
+]
+
+
+class EmulatedBitmap:
+def __init__(self, granularity=GRANULARITY):
+self._bits = set()
+self.granularity = granularity
+
+def dirty_bits(self, bits):
+self._bits |= set(bits)
+
+def dirty_group(self, n):
+self.

[PATCH v4 0/5] mirror: allow specifying working bitmap

2024-05-21 Thread Fiona Ebner

Changes from v3 (discussion here [3]):
* Improve/fix QAPI documentation.

Changes from v2 (discussion here [2]):
* Cluster size caveats only apply to non-COW diff image, adapt the
cluster size check and documentation accordingly.
* In the IO test, use backing files (rather than stand-alone diff
images) in combination with copy-mode=write-blocking and larger
cluster size for target images, to have a more realistic use-case
and show that COW prevents ending up with cluster with partial data
upon unaligned writes.
* Create a separate patch for replacing is_none_mode with sync_mode in
MirrorBlockJobx struct.
* Disallow using read-only bitmap (cannot be used as working bitmap).
* Require that bitmap is enabled at the start of the job.
* Avoid IO test script potentially waiting on non-existent job when
blockdev-mirror QMP command fails.
* Fix pylint issues in IO test.
* Rename IO test from sync-bitmap-mirror to mirror-bitmap.

Changes from RFC/v1 (discussion here [0]):
* Add patch to split BackupSyncMode and MirrorSyncMode.
* Drop bitmap-mode parameter and use passed-in bitmap as the working
bitmap instead. Users can get the desired behaviors by
using the block-dirty-bitmap-clear and block-dirty-bitmap-merge
calls (see commit message in patch 2/4 for how exactly).
* Add patch to check whether target image's cluster size is at most
mirror job's granularity. Optional, it's an extra safety check
that's useful when the target is a "diff" image that does not have
previously synced data.

Use cases:
* Possibility to resume a failed mirror later.
* Possibility to only mirror deltas to a previously mirrored volume.
* Possibility to (efficiently) mirror an drive that was previously
mirrored via some external mechanism (e.g. ZFS replication).

We are using the last one in production without any issues since about
4 years now. In particular, like mentioned in [1]:

> - create bitmap(s)
> - (incrementally) replicate storage volume(s) out of band (using ZFS)
> - incrementally drive mirror as part of a live migration of VM
> - drop bitmap(s)

Now, the IO test added in patch 4/5 actually contains yet another use
case, namely doing incremental mirrors to qcow2 "diff" images, that
only contain the delta and can be rebased later. I had to adapt the IO
test, because its output expected the mirror bitmap to still be dirty,
but nowadays the mirror is apparently already done when the bitmaps
are queried. So I thought, I'll just use 'write-blocking' mode to
avoid any potential timing issues.

Initially, the qcow2 diff image targets were stand-alone and that
suffers from an issue when 'write-blocking' mode is used. If a write
is not aligned to the granularity of the mirror target, then rebasing
the diff image onto a backing image will not yield the desired result,
because the full cluster is considered to be allocated and will "hide"
some part of the base/backing image. The failure can be seen by either
using 'write-blocking' mode in the IO test or setting the (bitmap)
granularity to 32 KiB rather than the current 64 KiB.

The test thus uses a more realistic approach where the qcow2 targets
have backing images and a check is added in patch 5/5 for the cluster
size for non-COW targets. However, with e.g. NBD, the cluster size
cannot be queried and prohibiting all bitmap mirrors to NBD targets
just to prevent the highly specific edge case seems not worth it, so
the limitation is rather documented and the check ignores cases where
querying the target image's cluster size returns -ENOTSUP.

[0]:
https://lore.kernel.org/qemu-devel/b91dba34-7969-4d51-ba40-96a91038c...@yandex-team.ru/T/#m4ae27dc8ca1fb053e0a32cc4ffa2cfab6646805c
[1]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/
[2]:
https://lore.kernel.org/qemu-devel/20240307134711.709816-1-f.eb...@proxmox.com/
[3]:
https://lore.kernel.org/qemu-devel/20240510131647.1256467-1-f.eb...@proxmox.com/

Fabian Grünbichler (1):
iotests: add test for bitmap mirror

Fiona Ebner (3):
qapi/block-core: avoid the re-use of MirrorSyncMode for backup
block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob
struct
blockdev: mirror: check for target's cluster size when using bitmap

John Snow (1):
mirror: allow specifying working bitmap

[PATCH v4 3/5] mirror: allow specifying working bitmap

2024-05-21 Thread Fiona Ebner

From: John Snow 

for the mirror job. The bitmap's granularity is used as the job's
granularity.

The new @bitmap parameter is marked unstable in the QAPI and can
currently only be used for @sync=full mode.

Clusters initially dirty in the bitmap as well as new writes are
copied to the target.

Using block-dirty-bitmap-clear and block-dirty-bitmap-merge API,
callers can simulate the three kinds of @BitmapSyncMode (which is used
by backup):
1. always: default, just pass bitmap as working bitmap.
2. never: copy bitmap and pass copy to the mirror job.
3. on-success: copy bitmap and pass copy to the mirror job and if
   successful, merge bitmap into original afterwards.

When the target image is a non-COW "diff image", i.e. one that was not
used as the target of a previous mirror and the target image's cluster
size is larger than the bitmap's granularity, or when
@copy-mode=write-blocking is used, there is a pitfall, because the
cluster in the target image will be allocated, but not contain all the
data corresponding to the same region in the source image.

An idea to avoid the limitation would be to mark clusters which are
affected by unaligned writes and are not allocated in the target image
dirty, so they would be copied fully later. However, for migration,
the invariant that an actively synced mirror stays actively synced
(unless an error happens) is useful, because without that invariant,
migration might inactivate block devices when mirror still got work
to do and run into an assertion failure [0].

Another approach would be to read the missing data from the source
upon unaligned writes to be able to write the full target cluster
instead.

But certain targets like NBD do not allow querying the cluster size.
To avoid limiting/breaking the use case of syncing to an existing
target, which is arguably more common than the diff image use case,
document the limitation in QAPI.

This patch was originally based on one by Ma Haocong, but it has since
been modified pretty heavily, first by John and then again by Fiona.

[0]: 
https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/

Suggested-by: Ma Haocong 
Signed-off-by: Ma Haocong 
Signed-off-by: John Snow 
[FG: switch to bdrv_dirty_bitmap_merge_internal]
Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.1
 get rid of bitmap mode parameter
 use caller-provided bitmap as working bitmap
 turn bitmap parameter experimental]
Signed-off-by: Fiona Ebner 
Acked-by: Markus Armbruster 
---
 block/mirror.c | 80 +-
 blockdev.c | 44 +++---
 include/block/block_int-global-state.h |  5 +-
 qapi/block-core.json   | 35 ++-
 tests/unit/test-block-iothread.c   |  2 +-
 5 files changed, 141 insertions(+), 25 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index ca23d6ef65..d3d0698116 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -73,6 +73,11 @@ typedef struct MirrorBlockJob {
 size_t buf_size;
 int64_t bdev_length;
 unsigned long *cow_bitmap;
+/*
+ * Whether the bitmap is created locally or provided by the caller (for
+ * incremental sync).
+ */
+bool dirty_bitmap_is_local;
 BdrvDirtyBitmap *dirty_bitmap;
 BdrvDirtyBitmapIter *dbi;
 uint8_t *buf;
@@ -691,7 +696,11 @@ static int mirror_exit_common(Job *job)
 bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs);
 }
 
-bdrv_release_dirty_bitmap(s->dirty_bitmap);
+if (s->dirty_bitmap_is_local) {
+bdrv_release_dirty_bitmap(s->dirty_bitmap);
+} else {
+bdrv_enable_dirty_bitmap(s->dirty_bitmap);
+}
 
 /* Make sure that the source BDS doesn't go away during bdrv_replace_node,
  * before we can call bdrv_drained_end */
@@ -820,6 +829,16 @@ static void mirror_abort(Job *job)
 assert(ret == 0);
 }
 
+/* Always called after commit/abort. */
+static void mirror_clean(Job *job)
+{
+MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
+
+if (!s->dirty_bitmap_is_local && s->dirty_bitmap) {
+bdrv_dirty_bitmap_set_busy(s->dirty_bitmap, false);
+}
+}
+
 static void coroutine_fn mirror_throttle(MirrorBlockJob *s)
 {
 int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1016,7 +1035,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_free_init(s);
 
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
-if (s->sync_mode != MIRROR_SYNC_MODE_NONE) {
+if (s->sync_mode != MIRROR_SYNC_MODE_NONE && s->dirty_bitmap_is_local) {
 ret = mirror_dirty_init(s);
 if (ret < 0 || job_is_cancelled(>common.job)) {
 goto immediate_exit;
@@ -1029,6 +1048,14 @@ static int coroutine_fn mirror_run(Job *job, Error 
**errp)
  */
 mirror_top_opaque->job = s;
 
+

[PATCH v4 1/5] qapi/block-core: avoid the re-use of MirrorSyncMode for backup

2024-05-21 Thread Fiona Ebner

Backup supports all modes listed in MirrorSyncMode, while mirror does
not. Introduce BackupSyncMode by copying the current MirrorSyncMode
and drop the variants mirror does not support from MirrorSyncMode as
well as the corresponding manual check in mirror_start().

A consequence is also tighter introspection: query-qmp-schema no
longer reports drive-mirror and blockdev-mirror accepting @sync values
they actually reject.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
Acked-by: Markus Armbruster 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/backup.c | 18 -
 block/mirror.c |  7 ---
 block/monitor/block-hmp-cmds.c |  2 +-
 block/replication.c|  2 +-
 blockdev.c | 26 -
 include/block/block_int-global-state.h |  2 +-
 qapi/block-core.json   | 27 +-
 7 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index ec29d6b810..1cc4e055c6 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -37,7 +37,7 @@ typedef struct BackupBlockJob {
 
 BdrvDirtyBitmap *sync_bitmap;
 
-MirrorSyncMode sync_mode;
+BackupSyncMode sync_mode;
 BitmapSyncMode bitmap_mode;
 BlockdevOnError on_source_error;
 BlockdevOnError on_target_error;
@@ -111,7 +111,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
 
 assert(block_job_driver(job) == _job_driver);
 
-if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+if (backup_job->sync_mode != BACKUP_SYNC_MODE_NONE) {
 error_setg(errp, "The backup job only supports block checkpoint in"
" sync=none mode");
 return;
@@ -231,11 +231,11 @@ static void backup_init_bcs_bitmap(BackupBlockJob *job)
 uint64_t estimate;
 BdrvDirtyBitmap *bcs_bitmap = block_copy_dirty_bitmap(job->bcs);
 
-if (job->sync_mode == MIRROR_SYNC_MODE_BITMAP) {
+if (job->sync_mode == BACKUP_SYNC_MODE_BITMAP) {
 bdrv_clear_dirty_bitmap(bcs_bitmap, NULL);
 bdrv_dirty_bitmap_merge_internal(bcs_bitmap, job->sync_bitmap, NULL,
  true);
-} else if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
+} else if (job->sync_mode == BACKUP_SYNC_MODE_TOP) {
 /*
  * We can't hog the coroutine to initialize this thoroughly.
  * Set a flag and resume work when we are able to yield safely.
@@ -254,7 +254,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 
 backup_init_bcs_bitmap(s);
 
-if (s->sync_mode == MIRROR_SYNC_MODE_TOP) {
+if (s->sync_mode == BACKUP_SYNC_MODE_TOP) {
 int64_t offset = 0;
 int64_t count;
 
@@ -282,7 +282,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 block_copy_set_skip_unallocated(s->bcs, false);
 }
 
-if (s->sync_mode == MIRROR_SYNC_MODE_NONE) {
+if (s->sync_mode == BACKUP_SYNC_MODE_NONE) {
 /*
  * All bits are set in bcs bitmap to allow any cluster to be copied.
  * This does not actually require them to be copied.
@@ -354,7 +354,7 @@ static const BlockJobDriver backup_job_driver = {
 
 BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
   BlockDriverState *target, int64_t speed,
-  MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
+  BackupSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
   BitmapSyncMode bitmap_mode,
   bool compress,
   const char *filter_node_name,
@@ -376,8 +376,8 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 GLOBAL_STATE_CODE();
 
 /* QMP interface protects us from these cases */
-assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL);
-assert(sync_bitmap || sync_mode != MIRROR_SYNC_MODE_BITMAP);
+assert(sync_mode != BACKUP_SYNC_MODE_INCREMENTAL);
+assert(sync_bitmap || sync_mode != BACKUP_SYNC_MODE_BITMAP);
 
 if (bs == target) {
 error_setg(errp, "Source and target cannot be the same");
diff --git a/block/mirror.c b/block/mirror.c
index 1bdce3b657..c0597039a5 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -2013,13 +2013,6 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
 
 GLOBAL_STATE_CODE();
 
-if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) ||
-(mode == MIRROR_SYNC_MODE_BITMAP)) {
-error_setg(errp, "Sync mode '%s' not supported",
-   MirrorSyncMode_str(mode));
-return;
-}
-
 bdrv_graph_rdlock_main_loop();
 is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
 base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index d954bec6f1..9633d

[PATCH v4 2/5] block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob struct

2024-05-21 Thread Fiona Ebner

It is more flexible and is done in preparation to support specifying a
working bitmap for mirror jobs. In particular, this makes it possible
to assert that @sync_mode=full when a bitmap is used. That assertion
is just to be sure, of course the mirror QMP commands will be made to
fail earlier with a clean error.

Signed-off-by: Fiona Ebner 
---
 block/mirror.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index c0597039a5..ca23d6ef65 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -51,7 +51,7 @@ typedef struct MirrorBlockJob {
 BlockDriverState *to_replace;
 /* Used to block operations on the drive-mirror-replace target */
 Error *replace_blocker;
-bool is_none_mode;
+MirrorSyncMode sync_mode;
 BlockMirrorBackingMode backing_mode;
 /* Whether the target image requires explicit zero-initialization */
 bool zero_target;
@@ -722,7 +722,8 @@ static int mirror_exit_common(Job *job)
  _abort);
 
 if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
-BlockDriverState *backing = s->is_none_mode ? src : s->base;
+BlockDriverState *backing;
+backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base;
 BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs);
 
 if (bdrv_cow_bs(unfiltered_target) != backing) {
@@ -1015,7 +1016,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_free_init(s);
 
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
-if (!s->is_none_mode) {
+if (s->sync_mode != MIRROR_SYNC_MODE_NONE) {
 ret = mirror_dirty_init(s);
 if (ret < 0 || job_is_cancelled(>common.job)) {
 goto immediate_exit;
@@ -1714,7 +1715,8 @@ static BlockJob *mirror_start_job(
  BlockCompletionFunc *cb,
  void *opaque,
  const BlockJobDriver *driver,
- bool is_none_mode, BlockDriverState *base,
+ MirrorSyncMode sync_mode,
+ BlockDriverState *base,
  bool auto_complete, const char *filter_node_name,
  bool is_mirror, MirrorCopyMode copy_mode,
  Error **errp)
@@ -1871,7 +1873,7 @@ static BlockJob *mirror_start_job(
 s->replaces = g_strdup(replaces);
 s->on_source_error = on_source_error;
 s->on_target_error = on_target_error;
-s->is_none_mode = is_none_mode;
+s->sync_mode = sync_mode;
 s->backing_mode = backing_mode;
 s->zero_target = zero_target;
 qatomic_set(>copy_mode, copy_mode);
@@ -2008,20 +2010,18 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
   bool unmap, const char *filter_node_name,
   MirrorCopyMode copy_mode, Error **errp)
 {
-bool is_none_mode;
 BlockDriverState *base;
 
 GLOBAL_STATE_CODE();
 
 bdrv_graph_rdlock_main_loop();
-is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
 base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
 bdrv_graph_rdunlock_main_loop();
 
 mirror_start_job(job_id, bs, creation_flags, target, replaces,
  speed, granularity, buf_size, backing_mode, zero_target,
  on_source_error, on_target_error, unmap, NULL, NULL,
- _job_driver, is_none_mode, base, false,
+ _job_driver, mode, base, false,
  filter_node_name, true, copy_mode, errp);
 }
 
@@ -2049,9 +2049,9 @@ BlockJob *commit_active_start(const char *job_id, 
BlockDriverState *bs,
  job_id, bs, creation_flags, base, NULL, speed, 0, 0,
  MIRROR_LEAVE_BACKING_CHAIN, false,
  on_error, on_error, true, cb, opaque,
- _active_job_driver, false, base, auto_complete,
- filter_node_name, false, MIRROR_COPY_MODE_BACKGROUND,
- errp);
+ _active_job_driver, MIRROR_SYNC_MODE_FULL, base,
+ auto_complete, filter_node_name, false,
+ MIRROR_COPY_MODE_BACKGROUND, errp);
 if (!job) {
 goto error_restore_flags;
 }
-- 
2.39.2

[PATCH v4 5/5] blockdev: mirror: check for target's cluster size when using bitmap

2024-05-21 Thread Fiona Ebner

When using mirror with a bitmap and the target does not do COW and is
is a diff image, i.e. one that should only contain the delta and was
not synced to previously, a too large cluster size for the target can
be problematic. In particular, when the mirror sends data to the
target aligned to the jobs granularity, but not aligned to the larger
target image's cluster size, the target's cluster would be allocated
but only be filled partially. When rebasing such a diff image later,
the corresponding cluster of the base image would get "masked" and the
part of the cluster not in the diff image is not accessible anymore.

Unfortunately, it is not always possible to check for the target
image's cluster size, e.g. when it's NBD. Because the limitation is
already documented in the QAPI description for the @bitmap parameter
and it's only required for special diff image use-case, simply skip
the check then.

Signed-off-by: Fiona Ebner 
---
 blockdev.c | 57 ++
 tests/qemu-iotests/tests/mirror-bitmap |  6 +++
 tests/qemu-iotests/tests/mirror-bitmap.out |  7 +++
 3 files changed, 70 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index 4f72a72dc7..468974108e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2769,6 +2769,59 @@ void qmp_blockdev_backup(BlockdevBackup *backup, Error 
**errp)
 blockdev_do_action(, errp);
 }
 
+static int blockdev_mirror_check_bitmap_granularity(BlockDriverState *target,
+BdrvDirtyBitmap *bitmap,
+Error **errp)
+{
+int ret;
+BlockDriverInfo bdi;
+uint32_t bitmap_granularity;
+
+GLOBAL_STATE_CODE();
+GRAPH_RDLOCK_GUARD_MAINLOOP();
+
+if (bdrv_backing_chain_next(target)) {
+/*
+ * No need to worry about creating clusters with partial data when the
+ * target does COW.
+ */
+return 0;
+}
+
+/*
+ * If there is no backing file on the target, we cannot rely on COW if our
+ * backup cluster size is smaller than the target cluster size. Even for
+ * targets with a backing file, try to avoid COW if possible.
+ */
+ret = bdrv_get_info(target, );
+if (ret == -ENOTSUP) {
+/*
+ * Ignore if unable to get the info, e.g. when target is NBD. It's only
+ * relevant for syncing to a diff image and the documentation already
+ * states that the target's cluster size needs to small enough then.
+ */
+return 0;
+} else if (ret < 0) {
+error_setg_errno(errp, -ret,
+"Couldn't determine the cluster size of the target image, "
+"which has no backing file");
+return ret;
+}
+
+bitmap_granularity = bdrv_dirty_bitmap_granularity(bitmap);
+if (bitmap_granularity < bdi.cluster_size ||
+bitmap_granularity % bdi.cluster_size != 0) {
+error_setg(errp, "Bitmap granularity %u is not a multiple of the "
+   "target image's cluster size %u and the target image has "
+   "no backing file",
+   bitmap_granularity, bdi.cluster_size);
+return -EINVAL;
+}
+
+return 0;
+}
+
+
 /* Parameter check and block job starting for drive mirroring.
  * Caller should hold @device and @target's aio context (must be the same).
  **/
@@ -2863,6 +2916,10 @@ static void blockdev_mirror_common(const char *job_id, 
BlockDriverState *bs,
 return;
 }
 
+if (blockdev_mirror_check_bitmap_granularity(target, bitmap, errp)) {
+return;
+}
+
 if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_DEFAULT, errp)) {
 return;
 }
diff --git a/tests/qemu-iotests/tests/mirror-bitmap 
b/tests/qemu-iotests/tests/mirror-bitmap
index 37bbe0f241..e8cd482a19 100755
--- a/tests/qemu-iotests/tests/mirror-bitmap
+++ b/tests/qemu-iotests/tests/mirror-bitmap
@@ -584,6 +584,12 @@ def test_mirror_api():
 bitmap=bitmap)
 log('')
 
+log("-- Test bitmap with too small granularity to non-COW target --\n")
+vm.qmp_log("block-dirty-bitmap-add", node=drive0.node,
+   name="bitmap-small", granularity=GRANULARITY)
+blockdev_mirror(drive0.vm, drive0.node, "mirror_target", "full",
+job_id='api_job', bitmap="bitmap-small")
+log('')
 
 def main():
 for bsync_mode in ("never", "on-success", "always"):
diff --git a/tests/qemu-iotests/tests/mirror-bitmap.out 
b/tests/qemu-iotests/tests/mirror-bitmap.out
index 5c8acc1d69..af605f3803 100644
--- a/tests/qemu-iotests/tests/mirror-bitmap.out
+++ b/tests/qemu-iotests/tests/mirror-bitmap.out
@@ -3189,3 +3189,10 @@ qemu_img compare "TEST_DIR/PID-img" 
"TEST_DI

Re: [PATCH] hw/core/machine: move compatibility flags for VirtIO-net USO to machine 8.1

2024-05-21 Thread Fiona Ebner

Am 21.05.24 um 00:22 schrieb Fabiano Rosas:
> Fiona Ebner  writes:
> 
>> Migration from an 8.2 or 9.0 binary to an 8.1 binary with machine
>> version 8.1 can fail with:
>>
>>> kvm: Features 0x1c0010130afffa7 unsupported. Allowed features: 0x10179bfffe7
>>> kvm: Failed to load virtio-net:virtio
>>> kvm: error while loading state for instance 0x0 of device 
>>> ':00:12.0/virtio-net'
>>> kvm: load of migration failed: Operation not permitted
>>
>> The series
>>
>> 53da8b5a99 virtio-net: Add support for USO features
>> 9da1684954 virtio-net: Add USO flags to vhost support.
>> f03e0cf63b tap: Add check for USO features
>> 2ab0ec3121 tap: Add USO support to tap device.
>>
>> only landed in QEMU 8.2, so the compatibility flags should be part of
>> machine version 8.1.
>>
>> Moving the flags unfortunately breaks forward migration with machine
>> version 8.1 from a binary without this patch to a binary with this
>> patch.
>>
>> Fixes: 53da8b5a99 ("virtio-net: Add support for USO features")
>> Signed-off-by: Fiona Ebner 
> 
> Reviewed-by: Fabiano Rosas 
> 
> I'll get to it eventually, but is this another one where just having
> -device virtio-net in the command line when testing cross-version
> migration would already have caught the issue?
> 
AFAIU, the guest kernel needs to be recent enough to support the feature
too. I don't seem to run into the issue with a Debian 11 (kernel 5.10)
guest, but I do run into the issue with an Ubuntu 23.10 (kernel 6.5)
guest. Seems like it got added in kernel 6.2 with 418044e1de30
("drivers/net/virtio_net.c: Added USO support.")

Best Regards,
Fiona

[PATCH] hw/core/machine: move compatibility flags for VirtIO-net USO to machine 8.1

2024-05-17 Thread Fiona Ebner

Migration from an 8.2 or 9.0 binary to an 8.1 binary with machine
version 8.1 can fail with:

> kvm: Features 0x1c0010130afffa7 unsupported. Allowed features: 0x10179bfffe7
> kvm: Failed to load virtio-net:virtio
> kvm: error while loading state for instance 0x0 of device 
> ':00:12.0/virtio-net'
> kvm: load of migration failed: Operation not permitted

The series

53da8b5a99 virtio-net: Add support for USO features
9da1684954 virtio-net: Add USO flags to vhost support.
f03e0cf63b tap: Add check for USO features
2ab0ec3121 tap: Add USO support to tap device.

only landed in QEMU 8.2, so the compatibility flags should be part of
machine version 8.1.

Moving the flags unfortunately breaks forward migration with machine
version 8.1 from a binary without this patch to a binary with this
patch.

Fixes: 53da8b5a99 ("virtio-net: Add support for USO features")
Signed-off-by: Fiona Ebner 
---
 hw/core/machine.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index c7ceb11501..95051b80db 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -50,15 +50,15 @@ GlobalProperty hw_compat_8_1[] = {
 { "ramfb", "x-migrate", "off" },
 { "vfio-pci-nohotplug", "x-ramfb-migrate", "off" },
 { "igb", "x-pcie-flr-init", "off" },
+{ TYPE_VIRTIO_NET, "host_uso", "off"},
+{ TYPE_VIRTIO_NET, "guest_uso4", "off"},
+{ TYPE_VIRTIO_NET, "guest_uso6", "off"},
 };
 const size_t hw_compat_8_1_len = G_N_ELEMENTS(hw_compat_8_1);
 
 GlobalProperty hw_compat_8_0[] = {
 { "migration", "multifd-flush-after-each-section", "on"},
 { TYPE_PCI_DEVICE, "x-pcie-ari-nextfn-1", "on" },
-{ TYPE_VIRTIO_NET, "host_uso", "off"},
-{ TYPE_VIRTIO_NET, "guest_uso4", "off"},
-{ TYPE_VIRTIO_NET, "guest_uso6", "off"},
 };
 const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
 
-- 
2.39.2

Re: [PULL 04/17] virtio-net: Add support for USO features

2024-05-16 Thread Fiona Ebner

Hi,

Am 08.09.23 um 08:44 schrieb Jason Wang:
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index da699cf..230aab8 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -38,6 +38,7 @@
>  #include "exec/confidential-guest-support.h"
>  #include "hw/virtio/virtio.h"
>  #include "hw/virtio/virtio-pci.h"
> +#include "hw/virtio/virtio-net.h"
>  
>  GlobalProperty hw_compat_8_1[] = {};
>  const size_t hw_compat_8_1_len = G_N_ELEMENTS(hw_compat_8_1);
> @@ -45,6 +46,9 @@ const size_t hw_compat_8_1_len = 
> G_N_ELEMENTS(hw_compat_8_1);
>  GlobalProperty hw_compat_8_0[] = {
>  { "migration", "multifd-flush-after-each-section", "on"},
>  { TYPE_PCI_DEVICE, "x-pcie-ari-nextfn-1", "on" },
> +{ TYPE_VIRTIO_NET, "host_uso", "off"},
> +{ TYPE_VIRTIO_NET, "guest_uso4", "off"},
> +{ TYPE_VIRTIO_NET, "guest_uso6", "off"},
>  };
>  const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);
>  

unfortunately, this broke backwards migration with machine version 8.1
from 8.2 and 9.0 binaries to a 8.1 binary:

> kvm: Features 0x1c0010130afffa7 unsupported. Allowed features: 0x10179bfffe7
> kvm: Failed to load virtio-net:virtio
> kvm: error while loading state for instance 0x0 of device 
> ':00:12.0/virtio-net'
> kvm: load of migration failed: Operation not permitted

Since the series here only landed in 8.2, shouldn't these flags have
been added to hw_compat_8_1[] instead?

Attempting to fix it by moving the flags will break migration with
machine version 8.1 between patched 9.0 and unpatched 9.0 however :(

Is there anything that can be done or will it need to stay broken now?

CC-ing the migration maintainers.

Best Regards,
Fiona

Re: [PATCH v4] hw/pflash: fix block write start

2024-05-16 Thread Fiona Ebner

Am 16.05.24 um 10:46 schrieb Gerd Hoffmann:
> Move the pflash_blk_write_start() call.  We need the offset of the
> first data write, not the offset for the setup (number-of-bytes)
> write.  Without this fix u-boot can do block writes to the first
> flash block only.
> 
> While being at it drop a leftover FIXME.
> 
> Cc: qemu-sta...@nongnu.org
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2343
> Fixes: fcc79f2e0955 ("hw/pflash: implement update buffer for block writes")

Just a minor thing I noticed: this is the commit in v8.1.5.
The commit in v9.0.0 is 284a7ee2e2 ("hw/pflash: implement update buffer
for block writes").

Best Regards,
Fiona

Re: [PATCH v3 5/5] accel/tcg: Always call tcg_flush_jmp_cache() on reset

2024-05-16 Thread Fiona Ebner

Hi,

Am 03.05.24 um 14:34 schrieb Philippe Mathieu-Daudé:
> In commit bb6cf6f016 ("accel/tcg: Factor tcg_cpu_reset_hold() out")
> we unfortunately restricted the tcg_flush_jmp_cache() to system
> emulation. Move it to the common tcg_exec_cpu_reset_hold() handler
> so user emulation gets the jmp_cache initialized when threads
> are created.
> 
> Remove the NULL check in tcg_flush_jmp_cache() from commit 4e4fa6c12d
> ("accel/tcg: Complete cpu initialization before registration") which
> was a band-aid fix for incorrect commit bb6cf6f016.
> 

AFAICT, commit 4e4fa6c12d was already present in v7.2.0, while commit
bb6cf6f016 only later in v8.2.0. So is it really fine to remove the NULL
check?

Best Regards,
Fiona

Re: [PATCH v4 0/3] Fix "virtio-gpu: fix scanout migration post-load"

2024-05-16 Thread Fiona Ebner

Am 16.05.24 um 10:40 schrieb marcandre.lur...@redhat.com:
> From: Marc-André Lureau 
> 
> Hi,
> 
> The aforementioned patch breaks virtio-gpu device migrations for versions
> pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is more
> complex than it may initially appear, as evidenced in the problematic commit
> dfcf74fa68c ("virtio-gpu: fix scanout migration post-load").
> 
> v2:
>  - use a manual version field test (instead of the more complex struct 
> variant)
> 
> v3:
>  - introduce machine_check_version()
>  - drop the VMSD version, and use machine version field test
> 
> v4:
>  - drop machine_check_version() approach
>  - property renamed to x-scanout-vmstate-version
> 
> Marc-André Lureau (3):
>   migration: add "exists" info to load-state-field trace
>   migration: fix a typo
>   virtio-gpu: fix v2 migration
> 
>  include/hw/virtio/virtio-gpu.h |  1 +
>  hw/core/machine.c  |  1 +
>  hw/display/virtio-gpu.c| 24 
>  migration/vmstate.c|  7 ---
>  migration/trace-events |  2 +-
>  5 files changed, 23 insertions(+), 12 deletions(-)
> 

Reviewed-by: Fiona Ebner 
Tested-by: Fiona Ebner 

Tested the same things as for v1/v2, with an Ubuntu 23.10 VM:

Machine type pc-i440fx-8.2:
1. create snapshot with 8.2, load with patched 9.0
2. create snapshot with patched 9.0, load with patched 9.0 and with 8.2

Machine type pc-i440fx-9.0:
1. create snapshot with patched 9.0, load with patched 9.0

No crashes/failures and didn't notice any other issues 

Best Regards,
Fiona

Re: [PATCH v2 0/4] Fix "virtio-gpu: fix scanout migration post-load"

2024-05-13 Thread Fiona Ebner

Am 13.05.24 um 15:21 schrieb Marc-André Lureau:
> 
> Indeed, it needs:
> 
> diff --git a/hw/display/virtio-gpu.c b/hw/display/virtio-gpu.c
> index 5de90bb62f..3a88eb5e3a 100644
> --- a/hw/display/virtio-gpu.c
> +++ b/hw/display/virtio-gpu.c
> @@ -1201,7 +1201,7 @@ static const VMStateDescription
> vmstate_virtio_gpu_scanout = {
> 
>  static const VMStateDescription vmstate_virtio_gpu_scanouts = {
>  .name = "virtio-gpu-scanouts",
> -.version_id = 1,
> +.version_id = 2,
> 
> 

Thanks! With that on top:

Tested-by: Fiona Ebner 

Tested with an Ubuntu 23.10 VM:

Machine type pc-i440fx-8.2:
1. create snapshot with 8.2, load with patched 9.0
2. create snapshot with patched 9.0, load with patched 9.0 and with 8.2

Machine type pc-i440fx-9.0:
1. create snapshot with patched 9.0, load with patched 9.0

No crashes/failures and didn't notice any other issues 

Best Regards,
Fiona

Re: [PATCH 1/2] copy-before-write: allow specifying minimum cluster size

2024-05-13 Thread Fiona Ebner

Am 26.03.24 um 10:06 schrieb Markus Armbruster:
>> @@ -365,7 +368,13 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, 
>> BdrvChild *target,
>>  
>>  GLOBAL_STATE_CODE();
>>  
>> -cluster_size = block_copy_calculate_cluster_size(target->bs, errp);
>> +if (min_cluster_size && !is_power_of_2(min_cluster_size)) {
> 
> min_cluster_size is int64_t, is_power_of_2() takes uint64_t.  Bad if
> min_cluster_size is negative.  Could this happen?
> 

No, because it comes in as a uint32_t via the QAPI (the internal caller
added by patch 2/2 from the backup code also gets the value via QAPI and
there uint32_t is used too).

---snip---

>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 0a72c590a8..85c8f88f6e 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -4625,12 +4625,18 @@
>>  # @on-cbw-error parameter will decide how this failure is handled.
>>  # Default 0. (Since 7.1)
>>  #
>> +# @min-cluster-size: Minimum size of blocks used by copy-before-write
>> +# operations.  Has to be a power of 2.  No effect if smaller than
>> +# the maximum of the target's cluster size and 64 KiB.  Default 0.
>> +# (Since 9.0)
>> +#
>>  # Since: 6.2
>>  ##
>>  { 'struct': 'BlockdevOptionsCbw',
>>'base': 'BlockdevOptionsGenericFormat',
>>'data': { 'target': 'BlockdevRef', '*bitmap': 'BlockDirtyBitmap',
>> -'*on-cbw-error': 'OnCbwError', '*cbw-timeout': 'uint32' } }
>> +'*on-cbw-error': 'OnCbwError', '*cbw-timeout': 'uint32',
>> +'*min-cluster-size': 'uint32' } }
> 
> Elsewhere in the schema, we use either 'int' or 'size' for cluster-size.
> Why the difference?
> 

The motivation was to disallow negative values up front and have it work
with block_copy_calculate_cluster_size(), whose result is an int64_t. If
I go with 'int', I'll have to add a check to disallow negative values.
If I go with 'size', I'll have to add a check for to disallow too large
values.

Which approach should I go with?

Best Regards,
Fiona

Re: [PATCH v2 0/4] Fix "virtio-gpu: fix scanout migration post-load"

2024-05-13 Thread Fiona Ebner

Hi,

Am 13.05.24 um 09:19 schrieb marcandre.lur...@redhat.com:
> From: Marc-André Lureau 
> 
> Hi,
> 
> The aforementioned patch breaks virtio-gpu device migrations for versions
> pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is more
> complex than it may initially appear, as evidenced in the problematic commit
> dfcf74fa68c ("virtio-gpu: fix scanout migration post-load").
> 
> v2:
>  - use a manual version field test (instead of the more complex struct 
> variant)
> 

Unfortunately, when creating a snapshot with machine type pc-i440fx-9.0
and trying to load it afterwards (both times with patches on top of
current master), it'll fail with:

> qemu-system-x86_64: virtio-gpu-scanouts: incoming version_id 2 is too new for 
> local version_id 1
> qemu-system-x86_64: Missing section footer for :00:02.0/virtio-gpu
> qemu-system-x86_64: Error -22 while loading VM state

Is there a bump to virtio-gpu-scanouts' version_id missing?

Best Regards,
Fiona

[PATCH v3 5/5] blockdev: mirror: check for target's cluster size when using bitmap

2024-05-10 Thread Fiona Ebner

When using mirror with a bitmap and the target does not do COW and is
is a diff image, i.e. one that should only contain the delta and was
not synced to previously, a too large cluster size for the target can
be problematic. In particular, when the mirror sends data to the
target aligned to the jobs granularity, but not aligned to the larger
target image's cluster size, the target's cluster would be allocated
but only be filled partially. When rebasing such a diff image later,
the corresponding cluster of the base image would get "masked" and the
part of the cluster not in the diff image is not accessible anymore.

Unfortunately, it is not always possible to check for the target
image's cluster size, e.g. when it's NBD. Because the limitation is
already documented in the QAPI description for the @bitmap parameter
and it's only required for special diff image use-case, simply skip
the check then.

Signed-off-by: Fiona Ebner 
---

Changes in v3:
* detect when the target does COW and do not error out in that case
* treat ENOTSUP differently from other failure when querying the
  cluster size

 blockdev.c | 57 ++
 tests/qemu-iotests/tests/mirror-bitmap |  6 +++
 tests/qemu-iotests/tests/mirror-bitmap.out |  7 +++
 3 files changed, 70 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index 4f72a72dc7..468974108e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2769,6 +2769,59 @@ void qmp_blockdev_backup(BlockdevBackup *backup, Error 
**errp)
 blockdev_do_action(, errp);
 }
 
+static int blockdev_mirror_check_bitmap_granularity(BlockDriverState *target,
+BdrvDirtyBitmap *bitmap,
+Error **errp)
+{
+int ret;
+BlockDriverInfo bdi;
+uint32_t bitmap_granularity;
+
+GLOBAL_STATE_CODE();
+GRAPH_RDLOCK_GUARD_MAINLOOP();
+
+if (bdrv_backing_chain_next(target)) {
+/*
+ * No need to worry about creating clusters with partial data when the
+ * target does COW.
+ */
+return 0;
+}
+
+/*
+ * If there is no backing file on the target, we cannot rely on COW if our
+ * backup cluster size is smaller than the target cluster size. Even for
+ * targets with a backing file, try to avoid COW if possible.
+ */
+ret = bdrv_get_info(target, );
+if (ret == -ENOTSUP) {
+/*
+ * Ignore if unable to get the info, e.g. when target is NBD. It's only
+ * relevant for syncing to a diff image and the documentation already
+ * states that the target's cluster size needs to small enough then.
+ */
+return 0;
+} else if (ret < 0) {
+error_setg_errno(errp, -ret,
+"Couldn't determine the cluster size of the target image, "
+"which has no backing file");
+return ret;
+}
+
+bitmap_granularity = bdrv_dirty_bitmap_granularity(bitmap);
+if (bitmap_granularity < bdi.cluster_size ||
+bitmap_granularity % bdi.cluster_size != 0) {
+error_setg(errp, "Bitmap granularity %u is not a multiple of the "
+   "target image's cluster size %u and the target image has "
+   "no backing file",
+   bitmap_granularity, bdi.cluster_size);
+return -EINVAL;
+}
+
+return 0;
+}
+
+
 /* Parameter check and block job starting for drive mirroring.
  * Caller should hold @device and @target's aio context (must be the same).
  **/
@@ -2863,6 +2916,10 @@ static void blockdev_mirror_common(const char *job_id, 
BlockDriverState *bs,
 return;
 }
 
+if (blockdev_mirror_check_bitmap_granularity(target, bitmap, errp)) {
+return;
+}
+
 if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_DEFAULT, errp)) {
 return;
 }
diff --git a/tests/qemu-iotests/tests/mirror-bitmap 
b/tests/qemu-iotests/tests/mirror-bitmap
index 37bbe0f241..e8cd482a19 100755
--- a/tests/qemu-iotests/tests/mirror-bitmap
+++ b/tests/qemu-iotests/tests/mirror-bitmap
@@ -584,6 +584,12 @@ def test_mirror_api():
 bitmap=bitmap)
 log('')
 
+log("-- Test bitmap with too small granularity to non-COW target --\n")
+vm.qmp_log("block-dirty-bitmap-add", node=drive0.node,
+   name="bitmap-small", granularity=GRANULARITY)
+blockdev_mirror(drive0.vm, drive0.node, "mirror_target", "full",
+job_id='api_job', bitmap="bitmap-small")
+log('')
 
 def main():
 for bsync_mode in ("never", "on-success", "always"):
diff --git a/tests/qemu-iotests/tests/mirror-bitmap.out 
b/tests/qemu-iotests/tests/mirror-bitmap.out
index 5c8acc1d69..af605f3803 100644
--- a/tests/qem

[PATCH v3 0/5] mirror: allow specifying working bitmap

2024-05-10 Thread Fiona Ebner

We are using the last one in production without any issues since about
4 years now. In particular, like mentioned in [1]:

> - create bitmap(s)
> - (incrementally) replicate storage volume(s) out of band (using ZFS)
> - incrementally drive mirror as part of a live migration of VM
> - drop bitmap(s)

Fabian Grünbichler (1):
iotests: add test for bitmap mirror

John Snow (1):
mirror: allow specifying working bitmap

block/backup.c | 18 +-
block/mirror.c | 101 +-
block/monitor/block-hmp-cmds.c |2 +-
block/replication.c|2 +-
blockdev.c | 127 +-
include/block/block_int-global-state.h |7 +-
qapi/block-core.json | 64 +-
tests/qemu-iotests/tests/mirror-bitmap | 603
tests/qemu-iotests/tests/mirror-bitmap.out | 3198
tests/unit/test-block-iothread.c |2 +-
10 files changed, 4055 insertions(+), 69 deletions(-)
create mode 100755 tests/qemu-iotests/tests/mirror-bitmap
create mode 100644 tests/qemu-iotests/tests/mirror-bitmap.out

--
2.39.2

[PATCH v3 1/5] qapi/block-core: avoid the re-use of MirrorSyncMode for backup

2024-05-10 Thread Fiona Ebner

Backup supports all modes listed in MirrorSyncMode, while mirror does
not. Introduce BackupSyncMode by copying the current MirrorSyncMode
and drop the variants mirror does not support from MirrorSyncMode as
well as the corresponding manual check in mirror_start().

A consequence is also tighter introspection: query-qmp-schema no
longer reports drive-mirror and blockdev-mirror accepting @sync values
they actually reject.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
Acked-by: Markus Armbruster 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---

Changes in v3:
* add comment about introspection to commit message as suggested by
  Markus

 block/backup.c | 18 -
 block/mirror.c |  7 ---
 block/monitor/block-hmp-cmds.c |  2 +-
 block/replication.c|  2 +-
 blockdev.c | 26 -
 include/block/block_int-global-state.h |  2 +-
 qapi/block-core.json   | 27 +-
 7 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index ec29d6b810..1cc4e055c6 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -37,7 +37,7 @@ typedef struct BackupBlockJob {
 
 BdrvDirtyBitmap *sync_bitmap;
 
-MirrorSyncMode sync_mode;
+BackupSyncMode sync_mode;
 BitmapSyncMode bitmap_mode;
 BlockdevOnError on_source_error;
 BlockdevOnError on_target_error;
@@ -111,7 +111,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
 
 assert(block_job_driver(job) == _job_driver);
 
-if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+if (backup_job->sync_mode != BACKUP_SYNC_MODE_NONE) {
 error_setg(errp, "The backup job only supports block checkpoint in"
" sync=none mode");
 return;
@@ -231,11 +231,11 @@ static void backup_init_bcs_bitmap(BackupBlockJob *job)
 uint64_t estimate;
 BdrvDirtyBitmap *bcs_bitmap = block_copy_dirty_bitmap(job->bcs);
 
-if (job->sync_mode == MIRROR_SYNC_MODE_BITMAP) {
+if (job->sync_mode == BACKUP_SYNC_MODE_BITMAP) {
 bdrv_clear_dirty_bitmap(bcs_bitmap, NULL);
 bdrv_dirty_bitmap_merge_internal(bcs_bitmap, job->sync_bitmap, NULL,
  true);
-} else if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
+} else if (job->sync_mode == BACKUP_SYNC_MODE_TOP) {
 /*
  * We can't hog the coroutine to initialize this thoroughly.
  * Set a flag and resume work when we are able to yield safely.
@@ -254,7 +254,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 
 backup_init_bcs_bitmap(s);
 
-if (s->sync_mode == MIRROR_SYNC_MODE_TOP) {
+if (s->sync_mode == BACKUP_SYNC_MODE_TOP) {
 int64_t offset = 0;
 int64_t count;
 
@@ -282,7 +282,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 block_copy_set_skip_unallocated(s->bcs, false);
 }
 
-if (s->sync_mode == MIRROR_SYNC_MODE_NONE) {
+if (s->sync_mode == BACKUP_SYNC_MODE_NONE) {
 /*
  * All bits are set in bcs bitmap to allow any cluster to be copied.
  * This does not actually require them to be copied.
@@ -354,7 +354,7 @@ static const BlockJobDriver backup_job_driver = {
 
 BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
   BlockDriverState *target, int64_t speed,
-  MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
+  BackupSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
   BitmapSyncMode bitmap_mode,
   bool compress,
   const char *filter_node_name,
@@ -376,8 +376,8 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 GLOBAL_STATE_CODE();
 
 /* QMP interface protects us from these cases */
-assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL);
-assert(sync_bitmap || sync_mode != MIRROR_SYNC_MODE_BITMAP);
+assert(sync_mode != BACKUP_SYNC_MODE_INCREMENTAL);
+assert(sync_bitmap || sync_mode != BACKUP_SYNC_MODE_BITMAP);
 
 if (bs == target) {
 error_setg(errp, "Source and target cannot be the same");
diff --git a/block/mirror.c b/block/mirror.c
index 1bdce3b657..c0597039a5 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -2013,13 +2013,6 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
 
 GLOBAL_STATE_CODE();
 
-if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) ||
-(mode == MIRROR_SYNC_MODE_BITMAP)) {
-error_setg(errp, "Sync mode '%s' not supported",
-   MirrorSyncMode_str(mode));
-return;
-}
-
 bdrv_graph_rdlock_main_loop();
 is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
 base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
dif

[PATCH v3 4/5] iotests: add test for bitmap mirror

2024-05-10 Thread Fiona Ebner

From: Fabian Grünbichler 

heavily based on/practically forked off iotest 257 for bitmap backups,
but:

- no writes to filter node 'mirror-top' between completion and
finalization, as those seem to deadlock?
- extra set of reference/test mirrors to verify that writes in parallel
with active mirror work

Intentionally keeping copyright and ownership of original test case to
honor provenance.

The test was originally adapted by Fabian from 257, but has seen
rather big changes, because the interface for mirror with bitmap was
changed, i.e. no @bitmap-mode parameter anymore and bitmap is used as
the working bitmap, and the test was changed to use backing images and
@sync-mode=write-blocking.

Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.1
 adapt to changes to mirror bitmap interface
 rename test from '384' to 'mirror-bitmap'
 use backing files, copy-mode=write-blocking, larger cluster size]
Signed-off-by: Fiona Ebner 
---

Changes in v3:
* avoid script potentially waiting on non-existent job when
  blockdev-mirror QMP command fails by asserting that there is no
  error when none is expected.
* fix pylint issues
* rename test from sync-bitmap-mirror to mirror-bitmap
* use backing files (rather than stand-alone diff images) in
  combination with copy-mode=write-blocking and larger cluster size
  for target images, to have a more realistic use-case and show that
  COW prevents ending up with cluster with partial data upon unaligned
  writes

 tests/qemu-iotests/tests/mirror-bitmap |  597 
 tests/qemu-iotests/tests/mirror-bitmap.out | 3191 
 2 files changed, 3788 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/mirror-bitmap
 create mode 100644 tests/qemu-iotests/tests/mirror-bitmap.out

diff --git a/tests/qemu-iotests/tests/mirror-bitmap 
b/tests/qemu-iotests/tests/mirror-bitmap
new file mode 100755
index 00..37bbe0f241
--- /dev/null
+++ b/tests/qemu-iotests/tests/mirror-bitmap
@@ -0,0 +1,597 @@
+#!/usr/bin/env python3
+# group: rw
+#
+# Test bitmap-sync mirrors (incremental, differential, and partials)
+#
+# Copyright (c) 2019 John Snow for Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+# owner=js...@redhat.com
+
+import os
+
+import iotests
+from iotests import log, qemu_img
+
+SIZE = 64 * 1024 * 1024
+GRANULARITY = 64 * 1024
+IMAGE_CLUSTER_SIZE = 128 * 1024
+
+
+class Pattern:
+def __init__(self, byte, offset, size=GRANULARITY):
+self.byte = byte
+self.offset = offset
+self.size = size
+
+def bits(self, granularity):
+lower = self.offset // granularity
+upper = (self.offset + self.size - 1) // granularity
+return set(range(lower, upper + 1))
+
+
+class PatternGroup:
+"""Grouping of Pattern objects. Initialize with an iterable of Patterns."""
+def __init__(self, patterns):
+self.patterns = patterns
+
+def bits(self, granularity):
+"""Calculate the unique bits dirtied by this pattern grouping"""
+res = set()
+for pattern in self.patterns:
+res |= pattern.bits(granularity)
+return res
+
+
+GROUPS = [
+PatternGroup([
+# Batch 0: 4 clusters
+Pattern('0x49', 0x000),
+Pattern('0x6c', 0x010),   # 1M
+Pattern('0x6f', 0x200),   # 32M
+Pattern('0x76', 0x3ff)]), # 64M - 64K
+PatternGroup([
+# Batch 1: 6 clusters (3 new)
+Pattern('0x65', 0x000),   # Full overwrite
+Pattern('0x77', 0x00f8000),   # Partial-left (1M-32K)
+Pattern('0x72', 0x2008000),   # Partial-right (32M+32K)
+Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K)
+PatternGroup([
+# Batch 2: 7 clusters (3 new)
+Pattern('0x74', 0x001),   # Adjacent-right
+Pattern('0x69', 0x00e8000),   # Partial-left  (1M-96K)
+Pattern('0x6e', 0x2018000),   # Partial-right (32M+96K)
+Pattern('0x67', 0x3fe,
+2*GRANULARITY)]), # Overwrite [(64M-128K)-64M)
+PatternGroup([
+# Batch 3: 8 clusters (5 new)
+# Carefully chosen such that nothing re-dirties the one cluster
+# that copies out successfully before failure in Group #1.
+Pattern('0xaa', 0x001,
+3*GRAN

[PATCH v3 3/5] mirror: allow specifying working bitmap

2024-05-10 Thread Fiona Ebner

From: John Snow 

for the mirror job. The bitmap's granularity is used as the job's
granularity.

The new @bitmap parameter is marked unstable in the QAPI and can
currently only be used for @sync=full mode.

Clusters initially dirty in the bitmap as well as new writes are
copied to the target.

Using block-dirty-bitmap-clear and block-dirty-bitmap-merge API,
callers can simulate the three kinds of @BitmapSyncMode (which is used
by backup):
1. always: default, just pass bitmap as working bitmap.
2. never: copy bitmap and pass copy to the mirror job.
3. on-success: copy bitmap and pass copy to the mirror job and if
   successful, merge bitmap into original afterwards.

When the target image is a non-COW "diff image", i.e. one that was not
used as the target of a previous mirror and the target image's cluster
size is larger than the bitmap's granularity, or when
@copy-mode=write-blocking is used, there is a pitfall, because the
cluster in the target image will be allocated, but not contain all the
data corresponding to the same region in the source image.

An idea to avoid the limitation would be to mark clusters which are
affected by unaligned writes and are not allocated in the target image
dirty, so they would be copied fully later. However, for migration,
the invariant that an actively synced mirror stays actively synced
(unless an error happens) is useful, because without that invariant,
migration might inactivate block devices when mirror still got work
to do and run into an assertion failure [0].

Another approach would be to read the missing data from the source
upon unaligned writes to be able to write the full target cluster
instead.

But certain targets like NBD do not allow querying the cluster size.
To avoid limiting/breaking the use case of syncing to an existing
target, which is arguably more common than the diff image use case,
document the limitation in QAPI.

This patch was originally based on one by Ma Haocong, but it has since
been modified pretty heavily, first by John and then again by Fiona.

[0]: 
https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/

Suggested-by: Ma Haocong 
Signed-off-by: Ma Haocong 
Signed-off-by: John Snow 
[FG: switch to bdrv_dirty_bitmap_merge_internal]
Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.1
 get rid of bitmap mode parameter
 use caller-provided bitmap as working bitmap
 turn bitmap parameter experimental]
Signed-off-by: Fiona Ebner 
---

Changes in v3:
* remove duplicate "use" in QAPI description
* clarify that cluster size caveats only applies to non-COW diff image
* split changing is_none_mode to sync_mode in job struct into a
  separate patch
* use shorter sync_mode != none rather than sync_mode == top || full
  in an if condition
* also disallow read-only bitmap (cannot be used as working bitmap)
* require that bitmap is enabled at the start of the job

 block/mirror.c | 80 +-
 blockdev.c | 44 +++---
 include/block/block_int-global-state.h |  5 +-
 qapi/block-core.json   | 37 +++-
 tests/unit/test-block-iothread.c   |  2 +-
 5 files changed, 143 insertions(+), 25 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index ca23d6ef65..d3d0698116 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -73,6 +73,11 @@ typedef struct MirrorBlockJob {
 size_t buf_size;
 int64_t bdev_length;
 unsigned long *cow_bitmap;
+/*
+ * Whether the bitmap is created locally or provided by the caller (for
+ * incremental sync).
+ */
+bool dirty_bitmap_is_local;
 BdrvDirtyBitmap *dirty_bitmap;
 BdrvDirtyBitmapIter *dbi;
 uint8_t *buf;
@@ -691,7 +696,11 @@ static int mirror_exit_common(Job *job)
 bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs);
 }
 
-bdrv_release_dirty_bitmap(s->dirty_bitmap);
+if (s->dirty_bitmap_is_local) {
+bdrv_release_dirty_bitmap(s->dirty_bitmap);
+} else {
+bdrv_enable_dirty_bitmap(s->dirty_bitmap);
+}
 
 /* Make sure that the source BDS doesn't go away during bdrv_replace_node,
  * before we can call bdrv_drained_end */
@@ -820,6 +829,16 @@ static void mirror_abort(Job *job)
 assert(ret == 0);
 }
 
+/* Always called after commit/abort. */
+static void mirror_clean(Job *job)
+{
+MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
+
+if (!s->dirty_bitmap_is_local && s->dirty_bitmap) {
+bdrv_dirty_bitmap_set_busy(s->dirty_bitmap, false);
+}
+}
+
 static void coroutine_fn mirror_throttle(MirrorBlockJob *s)
 {
 int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1016,7 +1035,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_free_init(s);
 
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
-if (s->sync

[PATCH v3 2/5] block/mirror: replace is_none_mode with sync_mode in MirrorBlockJob struct

2024-05-10 Thread Fiona Ebner

It is more flexible and is done in preparation to support specifying a
working bitmap for mirror jobs. In particular, this makes it possible
to assert that @sync_mode=full when a bitmap is used. That assertion
is just to be sure, of course the mirror QMP commands will be made to
fail earlier with a clean error.

Signed-off-by: Fiona Ebner 
---

New in v3.

 block/mirror.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index c0597039a5..ca23d6ef65 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -51,7 +51,7 @@ typedef struct MirrorBlockJob {
 BlockDriverState *to_replace;
 /* Used to block operations on the drive-mirror-replace target */
 Error *replace_blocker;
-bool is_none_mode;
+MirrorSyncMode sync_mode;
 BlockMirrorBackingMode backing_mode;
 /* Whether the target image requires explicit zero-initialization */
 bool zero_target;
@@ -722,7 +722,8 @@ static int mirror_exit_common(Job *job)
  _abort);
 
 if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
-BlockDriverState *backing = s->is_none_mode ? src : s->base;
+BlockDriverState *backing;
+backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base;
 BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs);
 
 if (bdrv_cow_bs(unfiltered_target) != backing) {
@@ -1015,7 +1016,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_free_init(s);
 
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
-if (!s->is_none_mode) {
+if (s->sync_mode != MIRROR_SYNC_MODE_NONE) {
 ret = mirror_dirty_init(s);
 if (ret < 0 || job_is_cancelled(>common.job)) {
 goto immediate_exit;
@@ -1714,7 +1715,8 @@ static BlockJob *mirror_start_job(
  BlockCompletionFunc *cb,
  void *opaque,
  const BlockJobDriver *driver,
- bool is_none_mode, BlockDriverState *base,
+ MirrorSyncMode sync_mode,
+ BlockDriverState *base,
  bool auto_complete, const char *filter_node_name,
  bool is_mirror, MirrorCopyMode copy_mode,
  Error **errp)
@@ -1871,7 +1873,7 @@ static BlockJob *mirror_start_job(
 s->replaces = g_strdup(replaces);
 s->on_source_error = on_source_error;
 s->on_target_error = on_target_error;
-s->is_none_mode = is_none_mode;
+s->sync_mode = sync_mode;
 s->backing_mode = backing_mode;
 s->zero_target = zero_target;
 qatomic_set(>copy_mode, copy_mode);
@@ -2008,20 +2010,18 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
   bool unmap, const char *filter_node_name,
   MirrorCopyMode copy_mode, Error **errp)
 {
-bool is_none_mode;
 BlockDriverState *base;
 
 GLOBAL_STATE_CODE();
 
 bdrv_graph_rdlock_main_loop();
-is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
 base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
 bdrv_graph_rdunlock_main_loop();
 
 mirror_start_job(job_id, bs, creation_flags, target, replaces,
  speed, granularity, buf_size, backing_mode, zero_target,
  on_source_error, on_target_error, unmap, NULL, NULL,
- _job_driver, is_none_mode, base, false,
+ _job_driver, mode, base, false,
  filter_node_name, true, copy_mode, errp);
 }
 
@@ -2049,9 +2049,9 @@ BlockJob *commit_active_start(const char *job_id, 
BlockDriverState *bs,
  job_id, bs, creation_flags, base, NULL, speed, 0, 0,
  MIRROR_LEAVE_BACKING_CHAIN, false,
  on_error, on_error, true, cb, opaque,
- _active_job_driver, false, base, auto_complete,
- filter_node_name, false, MIRROR_COPY_MODE_BACKGROUND,
- errp);
+ _active_job_driver, MIRROR_SYNC_MODE_FULL, base,
+ auto_complete, filter_node_name, false,
+ MIRROR_COPY_MODE_BACKGROUND, errp);
 if (!job) {
 goto error_restore_flags;
 }
-- 
2.39.2

Re: [PATCH v2 2/4] mirror: allow specifying working bitmap

2024-05-08 Thread Fiona Ebner

Am 07.05.24 um 14:15 schrieb Fiona Ebner:
> Am 02.04.24 um 22:14 schrieb Vladimir Sementsov-Ogievskiy:
>> On 07.03.24 16:47, Fiona Ebner wrote:
>>> +# @bitmap: The name of a bitmap to use as a working bitmap for
>>> +# sync=full mode.  This argument must be not be present for other
>>> +# sync modes and not at the same time as @granularity.  The
>>> +# bitmap's granularity is used as the job's granularity.  When
>>> +# the target is a diff image, i.e. one that should only contain
>>> +# the delta and was not synced to previously, the target's
>>> +# cluster size must not be larger than the bitmap's granularity.
>>
>> Could we check this? Like in block_copy_calculate_cluster_size(), we can
>> check if target does COW, and if not, we can check that we are safe with
>> granularity.
>>
> 
> The issue here is (in particular) present when the target does COW, i.e.
> in qcow2 diff images, allocated clusters which end up with partial data,
> when we don't have the right cluster size. Patch 4/4 adds the check for
> the target's cluster size.
> 

Sorry, no. What I said is wrong. It's just that the test does something
very pathological and does not even use COW/backing files. All the
mirror targets are separate diff images there. So yes, we can do the
same as block_copy_calculate_cluster_size() and the issue only appears
in the same edge cases as for backup where we can error out early. This
also applies to copy-mode=write-blocking AFAICT.

>>> +# For a diff image target, using copy-mode=write-blocking should
>>> +# not be used, because unaligned writes will lead to allocated
>>> +# clusters with partial data in the target image!
>>
>> Could this be checked?
>>
> 
> I don't think so. How should we know if the target already contains data
> from a previous full sync or not?
> 
> Those caveats when using diff images are unfortunate, and users should
> be warned about them of course, but the main/expected use case for the
> feature is to sync to the same target multiple times, so I'd hope the
> cluster size check in patch 4/4 and mentioning the edge cases in the
> documentation is enough here.
>

Re: [PATCH 0/4] Fix "virtio-gpu: fix scanout migration post-load"

2024-05-08 Thread Fiona Ebner

Am 07.05.24 um 13:19 schrieb marcandre.lur...@redhat.com:
> From: Marc-André Lureau 
> 
> Hi,
> 
> The aforementioned patch breaks virtio-gpu device migrations for versions
> pre-9.0/9.0, both forwards and backwards. Versioning of `VMS_STRUCT` is more
> complex than it may initially appear, as evidenced in the problematic commit
> dfcf74fa68c ("virtio-gpu: fix scanout migration post-load").
> 
> To resolve this, we need to propagate the `vmstate` `version_id` through the
> nested structures. Additionally, we should tie specific machine version to a
> corresponding `version_id` to maintain migration compatibility.
> 
> `VMS_VSTRUCT` allows specifying the appropriate version of the nested 
> structure
> to use.
> 
> Marc-André Lureau (4):
>   migration: add "exists" info to load-state-field trace
>   include/migration: add VMSTATE_VSTRUCT_TEST_VARRAY_UINT32
>   virtio-gpu: use a VMState variant for the scanout field
>   virtio-gpu: add x-vmstate-version
> 
>  include/hw/virtio/virtio-gpu.h |  1 +
>  include/migration/vmstate.h| 12 
>  hw/core/machine.c  |  1 +
>  hw/display/virtio-gpu.c| 28 +---
>  migration/vmstate.c|  5 +++--
>  migration/trace-events |  2 +-
>  6 files changed, 39 insertions(+), 10 deletions(-)
> 

Tested-by: Fiona Ebner 

Thank you for the fix! Tested with an Ubuntu 23.10 VM:

Machine type pc-i440fx-8.2:
1. create snapshot with 8.2, load with patched 9.0
2. create snapshot with patched 9.0, load with patched 9.0 and with 8.2

Machine type pc-i440fx-9.0:
1. create snapshot with patched 9.0, load with patched 9.0

No crashes/failures and didn't notice any other issues :)

Best Regards,
Fiona

Re: [PATCH v2 2/4] mirror: allow specifying working bitmap

2024-05-07 Thread Fiona Ebner

Am 02.04.24 um 22:14 schrieb Vladimir Sementsov-Ogievskiy:
> On 07.03.24 16:47, Fiona Ebner wrote:
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 1609354db3..5c9a00b574 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -51,7 +51,7 @@ typedef struct MirrorBlockJob {
>>   BlockDriverState *to_replace;
>>   /* Used to block operations on the drive-mirror-replace target */
>>   Error *replace_blocker;
>> -    bool is_none_mode;
>> +    MirrorSyncMode sync_mode;
> 
> Could you please split this change to separate preparation patch?
> 

Will do.

>> +    if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_ALLOW_RO,
>> errp)) {
> 
> Why allow read-only bitmaps?
> 

Good catch! This is a left-over from an earlier version. Now that the
bitmap shall be used as the working bitmap, it cannot be read-only. I'll
change it to BDRV_BITMAP_DEFAULT in v3 of the series.

>> +# @bitmap: The name of a bitmap to use as a working bitmap for
>> +# sync=full mode.  This argument must be not be present for other
>> +# sync modes and not at the same time as @granularity.  The
>> +# bitmap's granularity is used as the job's granularity.  When
>> +# the target is a diff image, i.e. one that should only contain
>> +# the delta and was not synced to previously, the target's
>> +# cluster size must not be larger than the bitmap's granularity.
> 
> Could we check this? Like in block_copy_calculate_cluster_size(), we can
> check if target does COW, and if not, we can check that we are safe with
> granularity.
> 

The issue here is (in particular) present when the target does COW, i.e.
in qcow2 diff images, allocated clusters which end up with partial data,
when we don't have the right cluster size. Patch 4/4 adds the check for
the target's cluster size.

>> +# For a diff image target, using copy-mode=write-blocking should
>> +# not be used, because unaligned writes will lead to allocated
>> +# clusters with partial data in the target image!
> 
> Could this be checked?
> 

I don't think so. How should we know if the target already contains data
from a previous full sync or not?

Those caveats when using diff images are unfortunate, and users should
be warned about them of course, but the main/expected use case for the
feature is to sync to the same target multiple times, so I'd hope the
cluster size check in patch 4/4 and mentioning the edge cases in the
documentation is enough here.

>>  The bitmap
>> +# will be enabled after the job finishes.  (Since 9.0)
> 
> Hmm. That looks correct. At least for the case, when bitmap is enabled
> at that start of job. Suggest to require this.
> 

It's true for any provided bitmap: it will be disabled when the mirror
job starts, because we manually set it in bdrv_mirror_top_do_write() and
then in mirror_exit_common(), the bitmap will be enabled.

Okay, I'll require that it is enabled at the beginning.

>> +#
>>   # @granularity: granularity of the dirty bitmap, default is 64K if the
>>   # image format doesn't have clusters, 4K if the clusters are
>>   # smaller than that, else the cluster size.  Must be a power of 2
>> @@ -2548,6 +2578,10 @@
>>   # disappear from the query list without user intervention.
>>   # Defaults to true.  (Since 3.1)
>>   #
>> +# Features:
>> +#
>> +# @unstable: Member @bitmap is experimental.
>> +#
>>   # Since: 2.6
> 
> Y_MODE_BACKGROUND,
>>    _abort);
> 
> [..]
> 
> Generally looks good to me.
> 

Thank you for the review!

Re: [PULL 5/5] virtio-gpu: fix scanout migration post-load

2024-04-30 Thread Fiona Ebner

Am 12.03.24 um 15:02 schrieb marcandre.lur...@redhat.com:
> From: Marc-André Lureau 
> 
> The current post-loading code for scanout has a FIXME: it doesn't take
> the resource region/rect into account. But there is more, when adding
> blob migration support in commit f66767f75c9, I didn't realize that blob
> resources could be used for scanouts. This situationn leads to a crash
> during post-load, as they don't have an associated res->image.
> 
> virtio_gpu_do_set_scanout() handle all cases, but requires the
> associated virtio_gpu_framebuffer, which is currently not saved during
> migration.
> 
> Add a v2 of "virtio-gpu-one-scanout" with the framebuffer fields, so we
> can restore blob scanouts, as well as fixing the existing FIXME.
> 
> Signed-off-by: Marc-André Lureau 
> Reviewed-by: Sebastian Ott 

Hi,
unfortunately, this broke migration from pre-9.0 to 9.0:

> vmstate_load_state_field virtio-gpu:virtio-gpu
> vmstate_load_state_field virtio-gpu-scanouts:parent_obj.enable
> vmstate_load_state_field virtio-gpu-scanouts:parent_obj.conf.max_outputs
> vmstate_load_state_field virtio-gpu-scanouts:parent_obj.scanout
> vmstate_load_state_field virtio-gpu-one-scanout:resource_id
> vmstate_load_state_field virtio-gpu-one-scanout:width
> vmstate_load_state_field virtio-gpu-one-scanout:height
> vmstate_load_state_field virtio-gpu-one-scanout:x
> vmstate_load_state_field virtio-gpu-one-scanout:y
> vmstate_load_state_field virtio-gpu-one-scanout:cursor.resource_id
> vmstate_load_state_field virtio-gpu-one-scanout:cursor.hot_x
> vmstate_load_state_field virtio-gpu-one-scanout:cursor.hot_y
> vmstate_load_state_field virtio-gpu-one-scanout:cursor.pos.x
> vmstate_load_state_field virtio-gpu-one-scanout:cursor.pos.y
> vmstate_load_state_field virtio-gpu-one-scanout:fb.format
> vmstate_load_state_field virtio-gpu-one-scanout:fb.bytes_pp
> vmstate_load_state_field virtio-gpu-one-scanout:fb.width
> vmstate_load_state_field virtio-gpu-one-scanout:fb.height
> vmstate_load_state_field virtio-gpu-one-scanout:fb.stride
> vmstate_load_state_field virtio-gpu-one-scanout:fb.offset
> qemu-system-x86_64: Missing section footer for :00:02.0/virtio-gpu
> qemu-system-x86_64: Error -22 while loading VM state

It wrongly tries to load the fb fields even though they should be
guarded by version 2.

Looking at it with GDB, in vmstate_load_state(), when we come to
field->name == parent_obj.scanout, the

> } else if (field->flags & VMS_STRUCT) {
> ret = vmstate_load_state(f, field->vmsd, curr_elem,
>  field->vmsd->version_id);

branch will be taken and suddenly we'll have a call to
vmstate_load_state() for vmsd==vmstate_virtio_gpu_scanout with
version_id==2 rather than version_id==1, because that is
field->vmsd->version_id (i.e. the .version_id in VMStateDescription
vmstate_virtio_gpu_scanout).

Would it have been necessary to version the VMStateDescription
vmstate_virtio_gpu_scanouts too using VMS_VSTRUCT (or am I
misinterpreting the use case for that)?

Best Regards,
Fiona

Re: [PATCH] block/copy-before-write: use uint64_t for timeout in nanoseconds

2024-04-29 Thread Fiona Ebner

Am 29.04.24 um 16:36 schrieb Philippe Mathieu-Daudé:
> Hi Fiona,
> 
> On 29/4/24 16:19, Fiona Ebner wrote:
> 
> Not everybody uses an email client that shows the patch content just
> after the subject (your first lines wasn't making sense at first).
> 
> Simply duplicating the subject helps to understand:
> 
>   Use uint64_t for timeout in nanoseconds ...
> 

Oh, sorry. I'll try to remember that for the future. Should I re-send as
a v2?

Best Regards,
Fiona

[PATCH] block/copy-before-write: use uint64_t for timeout in nanoseconds

2024-04-29 Thread Fiona Ebner

rather than the uint32_t for which the maximum is slightly more than 4
seconds and larger values would overflow. The QAPI interface allows
specifying the number of seconds, so only values 0 to 4 are safe right
now, other values lead to a much lower timeout than a user expects.

The block_copy() call where this is used already takes a uint64_t for
the timeout, so no change required there.

Fixes: 6db7fd1ca9 ("block/copy-before-write: implement cbw-timeout option")
Reported-by: Friedrich Weber 
Signed-off-by: Fiona Ebner 
---
 block/copy-before-write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index 8aba27a71d..026fa9840f 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -43,7 +43,7 @@ typedef struct BDRVCopyBeforeWriteState {
 BlockCopyState *bcs;
 BdrvChild *target;
 OnCbwError on_cbw_error;
-uint32_t cbw_timeout_ns;
+uint64_t cbw_timeout_ns;
 
 /*
  * @lock: protects access to @access_bitmap, @done_bitmap and
-- 
2.39.2

Re: [PATCH] Makefile: preserve --jobserver-auth argument when calling ninja

2024-04-12 Thread Fiona Ebner

Am 02.04.24 um 10:17 schrieb Martin Hundebøll:
> Qemu wraps its call to ninja in a Makefile. Since ninja, as opposed to
> make, utilizes all CPU cores by default, the qemu Makefile translates
> the absense of a `-jN` argument into `-j1`. This breaks jobserver
> functionality, so update the -jN mangling to take the --jobserver-auth
> argument into considerationa too.
> 
> Signed-off-by: Martin Hundebøll 
> ---
>  Makefile | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/Makefile b/Makefile
> index 8f36990335..183756018f 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -142,7 +142,7 @@ MAKE.k = $(findstring k,$(firstword $(filter-out 
> --%,$(MAKEFLAGS
>  MAKE.q = $(findstring q,$(firstword $(filter-out --%,$(MAKEFLAGS
>  MAKE.nq = $(if $(word 2, $(MAKE.n) $(MAKE.q)),nq)
>  NINJAFLAGS = $(if $V,-v) $(if $(MAKE.n), -n) $(if $(MAKE.k), -k0) \
> -$(filter-out -j, $(lastword -j1 $(filter -l% -j%, $(MAKEFLAGS \
> +$(or $(filter -l% -j%, $(MAKEFLAGS)), $(if $(filter 
> --jobserver-auth=%, $(MAKEFLAGS)),, -j1)) \
>  -d keepdepfile
>  ninja-cmd-goals = $(or $(MAKECMDGOALS), all)
>  ninja-cmd-goals += $(foreach g, $(MAKECMDGOALS), $(.ninja-goals.$g))

Hi,

unfortunately, this patch breaks build when specifying just '-j' as a
make flag (i.e. without a number), because it will now end up being
passed to ninja:

> $ make -j
> changing dir to build for make ""...
> make[1]: Entering directory '/home/febner/repos/qemu/build'
> ninja: fatal: invalid -j parameter

Best Regards,
Fiona

Re: [PATCH] block: Remove unnecessary NULL check in bdrv_pad_request()

2024-03-28 Thread Fiona Ebner

Am 27.03.24 um 20:27 schrieb Kevin Wolf:
> Coverity complains that the check introduced in commit 3f934817 suggests
> that qiov could be NULL and we dereference it before reaching the check.
> In fact, all of the callers pass a non-NULL pointer, so just remove the
> misleading check.
> 
> Resolves: Coverity CID 1542668
> Signed-off-by: Kevin Wolf 

Reviewed-by: Fiona Ebner 

Thank you for the fix,
Fiona

Question about block graph lock limitation with generated co-wrappers

2024-03-26 Thread Fiona Ebner

Hi,
we have a custom block driver downstream, which currently calls
bdrv_get_info() (for its file child) in the bdrv_refresh_limits()
callback. However, with graph locking, this doesn't work anymore.
AFAICT, the reason is the following:

The block driver has a backing file option.
During initialization, in bdrv_set_backing_hd(), the graph lock is
acquired exclusively.
Then the bdrv_refresh_limits() callback is invoked.
Now bdrv_get_info() is called, which is a generated co-wrapper.
The bdrv_co_get_info_entry() function tries to acquire the graph lock
for reading, sees that has_writer is true and so the coroutine will be
put to wait, leading to a deadlock.

For my specific case, I can move the bdrv_get_info() call to bdrv_open()
as a workaround. But I wanted to ask if there is a way to make generated
co-wrappers inside an exclusively locked section work? And if not, could
we introduce/extend the annotations, so the compiler can catch this kind
of issue, i.e. calling a generated co-wrapper while in an exclusively
locked section?

Best Regards,
Fiona

[PATCH v3 0/4] fix two edge cases related to stream block jobs

2024-03-22 Thread Fiona Ebner

Changes in v3:
* Also deal with edge case in bdrv_next_cleanup(). Haven't run
  into an actual issue there, but at least the caller in
  migration/block.c uses bdrv_nb_sectors() which, while not a
  coroutine wrapper itself (it's written manually), may call
  bdrv_refresh_total_sectors(), which is a generated coroutine
  wrapper, so AFAIU, the block graph can change during that call.
  And even without that, it's just better to be more consistent
  with bdrv_next().

Changes in v2:
* Ran into another issue while writing the IO test Stefan wanted
  to have (good call :)), so include a fix for that and add the
  test. I didn't notice during manual testing, because I hadn't
  used a scripted QMP 'quit', so there was no race.

Fiona Ebner (3):
  block-backend: fix edge case in bdrv_next() where BDS associated to BB
changes
  block-backend: fix edge case in bdrv_next_cleanup() where BDS
associated to BB changes
  iotests: add test for stream job with an unaligned prefetch read

Stefan Reiter (1):
  block/io: accept NULL qiov in bdrv_pad_request

 block/block-backend.c | 18 ++--
 block/io.c| 31 ---
 .../tests/stream-unaligned-prefetch   | 86 +++
 .../tests/stream-unaligned-prefetch.out   |  5 ++
 4 files changed, 117 insertions(+), 23 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

-- 
2.39.2

[PATCH v3 3/4] block-backend: fix edge case in bdrv_next_cleanup() where BDS associated to BB changes

2024-03-22 Thread Fiona Ebner

Same rationale as for commit "block-backend: fix edge case in
bdrv_next() where BDS associated to BB changes". The block graph might
change between the bdrv_next() call and the bdrv_next_cleanup() call,
so it could be that the associated BDS is not the same that was
referenced previously anymore. Instead, rely on bdrv_next() to set
it->bs to the BDS it referenced and unreference that one in any case.

Signed-off-by: Fiona Ebner 
---

New in v3.

 block/block-backend.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 28af1eb17a..db6f9b92a3 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -663,13 +663,10 @@ void bdrv_next_cleanup(BdrvNextIterator *it)
 /* Must be called from the main loop */
 assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
-if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
-if (it->blk) {
-bdrv_unref(blk_bs(it->blk));
-blk_unref(it->blk);
-}
-} else {
-bdrv_unref(it->bs);
+bdrv_unref(it->bs);
+
+if (it->phase == BDRV_NEXT_BACKEND_ROOTS && it->blk) {
+blk_unref(it->blk);
 }
 
 bdrv_next_reset(it);
-- 
2.39.2

[PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-03-22 Thread Fiona Ebner

The old_bs variable in bdrv_next() is currently determined by looking
at the old block backend. However, if the block graph changes before
the next bdrv_next() call, it might be that the associated BDS is not
the same that was referenced previously. In that case, the wrong BDS
is unreferenced, leading to an assertion failure later:

> bdrv_unref: Assertion `bs->refcnt > 0' failed.

In particular, this can happen in the context of bdrv_flush_all(),
when polling for bdrv_co_flush() in the generated co-wrapper leads to
a graph change (for example with a stream block job [0]).

A racy reproducer:

> #!/bin/bash
> rm -f /tmp/backing.qcow2
> rm -f /tmp/top.qcow2
> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node0" } }
> {"execute": "quit"}
> EOF

[0]:

> #0  bdrv_replace_child_tran (child=..., new_bs=..., tran=...)
> #1  bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., 
> errp=...)
> #2  bdrv_replace_node_common (from=..., to=..., auto_skip=..., 
> detach_subchain=..., errp=...)
> #3  bdrv_drop_filter (bs=..., errp=...)
> #4  bdrv_cor_filter_drop (cor_filter_bs=...)
> #5  stream_prepare (job=...)
> #6  job_prepare_locked (job=...)
> #7  job_txn_apply_locked (fn=..., job=...)
> #8  job_do_finalize_locked (job=...)
> #9  job_exit (opaque=...)
> #10 aio_bh_poll (ctx=...)
> #11 aio_poll (ctx=..., blocking=...)
> #12 bdrv_poll_co (s=...)
> #13 bdrv_flush (bs=...)
> #14 bdrv_flush_all ()
> #15 do_vm_stop (state=..., send_stop=...)
> #16 vm_shutdown ()

Signed-off-by: Fiona Ebner 
---

No changes in v3.
New in v2.

 block/block-backend.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 9c4de79e6b..28af1eb17a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -599,14 +599,14 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 /* Must be called from the main loop */
 assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
+old_bs = it->bs;
+
 /* First, return all root nodes of BlockBackends. In order to avoid
  * returning a BDS twice when multiple BBs refer to it, we only return it
  * if the BB is the first one in the parent list of the BDS. */
 if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
 BlockBackend *old_blk = it->blk;
 
-old_bs = old_blk ? blk_bs(old_blk) : NULL;
-
 do {
 it->blk = blk_all_next(it->blk);
 bs = it->blk ? blk_bs(it->blk) : NULL;
@@ -620,11 +620,10 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 if (bs) {
 bdrv_ref(bs);
 bdrv_unref(old_bs);
+it->bs = bs;
 return bs;
 }
 it->phase = BDRV_NEXT_MONITOR_OWNED;
-} else {
-old_bs = it->bs;
 }
 
 /* Then return the monitor-owned BDSes without a BB attached. Ignore all
-- 
2.39.2

[PATCH v3 1/4] block/io: accept NULL qiov in bdrv_pad_request

2024-03-22 Thread Fiona Ebner

From: Stefan Reiter 

Some operations, e.g. block-stream, perform reads while discarding the
results (only copy-on-read matters). In this case, they will pass NULL
as the target QEMUIOVector, which will however trip bdrv_pad_request,
since it wants to extend its passed vector. In particular, this is the
case for the blk_co_preadv() call in stream_populate().

If there is no qiov, no operation can be done with it, but the bytes
and offset still need to be updated, so the subsequent aligned read
will actually be aligned and not run into an assertion failure.

In particular, this can happen when the request alignment of the top
node is larger than the allocated part of the bottom node, in which
case padding becomes necessary. For example:

> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": 
> "node0", "node-name": "node1" } }
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node1" } }
> EOF

Originally-by: Stefan Reiter 
Signed-off-by: Thomas Lamprecht 
[FE: do update bytes and offset in any case
 add reproducer to commit message]
Signed-off-by: Fiona Ebner 
---

No changes in v3.
No changes in v2.

 block/io.c | 31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/block/io.c b/block/io.c
index 33150c0359..395bea3bac 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs,
 return 0;
 }
 
-sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
-  _head, _tail,
-  _niov);
+/*
+ * For prefetching in stream_populate(), no qiov is passed along, because
+ * only copy-on-read matters.
+ */
+if (qiov && *qiov) {
+sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+  _head, _tail,
+  _niov);
 
-/* Guaranteed by bdrv_check_request32() */
-assert(*bytes <= SIZE_MAX);
-ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
-  sliced_head, *bytes);
-if (ret < 0) {
-bdrv_padding_finalize(pad);
-return ret;
+/* Guaranteed by bdrv_check_request32() */
+assert(*bytes <= SIZE_MAX);
+ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
+  sliced_head, *bytes);
+if (ret < 0) {
+bdrv_padding_finalize(pad);
+return ret;
+}
+*qiov = >local_qiov;
+*qiov_offset = 0;
 }
+
 *bytes += pad->head + pad->tail;
 *offset -= pad->head;
-*qiov = >local_qiov;
-*qiov_offset = 0;
 if (padded) {
 *padded = true;
 }
-- 
2.39.2

[PATCH v3 4/4] iotests: add test for stream job with an unaligned prefetch read

2024-03-22 Thread Fiona Ebner

Previously, bdrv_pad_request() could not deal with a NULL qiov when
a read needed to be aligned. During prefetch, a stream job will pass a
NULL qiov. Add a test case to cover this scenario.

By accident, also covers a previous race during shutdown, where block
graph changes during iteration in bdrv_flush_all() could lead to
unreferencing the wrong block driver state and an assertion failure
later.

Signed-off-by: Fiona Ebner 
---

No changes in v3.
New in v2.

 .../tests/stream-unaligned-prefetch   | 86 +++
 .../tests/stream-unaligned-prefetch.out   |  5 ++
 2 files changed, 91 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch
new file mode 100755
index 00..546db1d369
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch
@@ -0,0 +1,86 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when a stream job does an unaligned prefetch read
+# which requires padding while having a NULL qiov.
+#
+# Copyright (C) Proxmox Server Solutions GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+from iotests import imgfmt, qemu_img_create, qemu_io, QMPTestCase
+
+image_size = 1 * 1024 * 1024
+cluster_size = 64 * 1024
+base = os.path.join(iotests.test_dir, 'base.img')
+top = os.path.join(iotests.test_dir, 'top.img')
+
+class TestStreamUnalignedPrefetch(QMPTestCase):
+def setUp(self) -> None:
+"""
+Create two images:
+- base image {base} with {cluster_size // 2} bytes allocated
+- top image {top} without any data allocated and coarser
+  cluster size
+
+Attach a compress filter for the top image, because that
+requires that the request alignment is the top image's cluster
+size.
+"""
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size // 2),
+base, str(image_size))
+qemu_io('-c', f'write 0 {cluster_size // 2}', base)
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size),
+top, str(image_size))
+
+self.vm = iotests.VM()
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': imgfmt,
+'node-name': 'base',
+'file': {
+'driver': 'file',
+'filename': base
+}
+}))
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': 'compress',
+'node-name': 'compress-top',
+'file': {
+'driver': imgfmt,
+'node-name': 'top',
+'file': {
+'driver': 'file',
+'filename': top
+},
+'backing': 'base'
+}
+}))
+self.vm.launch()
+
+def tearDown(self) -> None:
+self.vm.shutdown()
+os.remove(top)
+os.remove(base)
+
+def test_stream_unaligned_prefetch(self) -> None:
+self.vm.cmd('block-stream', job_id='stream', device='compress-top')
+
+
+if __name__ == '__main__':
+iotests.main(supported_fmts=['qcow2'], supported_protocols=['file'])
diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch.out 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
new file mode 100644
index 00..ae1213e6f8
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
@@ -0,0 +1,5 @@
+.
+--
+Ran 1 tests
+
+OK
-- 
2.39.2

[PATCH v2 3/3] iotests: add test for stream job with an unaligned prefetch read

2024-03-21 Thread Fiona Ebner

Previously, bdrv_pad_request() could not deal with a NULL qiov when
a read needed to be aligned. During prefetch, a stream job will pass a
NULL qiov. Add a test case to cover this scenario.

By accident, also covers a previous race during shutdown, where block
graph changes during iteration in bdrv_flush_all() could lead to
unreferencing the wrong block driver state and an assertion failure
later.

Signed-off-by: Fiona Ebner 
---

New in v2.

 .../tests/stream-unaligned-prefetch   | 86 +++
 .../tests/stream-unaligned-prefetch.out   |  5 ++
 2 files changed, 91 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch
new file mode 100755
index 00..546db1d369
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch
@@ -0,0 +1,86 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when a stream job does an unaligned prefetch read
+# which requires padding while having a NULL qiov.
+#
+# Copyright (C) Proxmox Server Solutions GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+from iotests import imgfmt, qemu_img_create, qemu_io, QMPTestCase
+
+image_size = 1 * 1024 * 1024
+cluster_size = 64 * 1024
+base = os.path.join(iotests.test_dir, 'base.img')
+top = os.path.join(iotests.test_dir, 'top.img')
+
+class TestStreamUnalignedPrefetch(QMPTestCase):
+def setUp(self) -> None:
+"""
+Create two images:
+- base image {base} with {cluster_size // 2} bytes allocated
+- top image {top} without any data allocated and coarser
+  cluster size
+
+Attach a compress filter for the top image, because that
+requires that the request alignment is the top image's cluster
+size.
+"""
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size // 2),
+base, str(image_size))
+qemu_io('-c', f'write 0 {cluster_size // 2}', base)
+qemu_img_create('-f', imgfmt,
+'-o', 'cluster_size={}'.format(cluster_size),
+top, str(image_size))
+
+self.vm = iotests.VM()
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': imgfmt,
+'node-name': 'base',
+'file': {
+'driver': 'file',
+'filename': base
+}
+}))
+self.vm.add_blockdev(self.vm.qmp_to_opts({
+'driver': 'compress',
+'node-name': 'compress-top',
+'file': {
+'driver': imgfmt,
+'node-name': 'top',
+'file': {
+'driver': 'file',
+'filename': top
+},
+'backing': 'base'
+}
+}))
+self.vm.launch()
+
+def tearDown(self) -> None:
+self.vm.shutdown()
+os.remove(top)
+os.remove(base)
+
+def test_stream_unaligned_prefetch(self) -> None:
+self.vm.cmd('block-stream', job_id='stream', device='compress-top')
+
+
+if __name__ == '__main__':
+iotests.main(supported_fmts=['qcow2'], supported_protocols=['file'])
diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch.out 
b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
new file mode 100644
index 00..ae1213e6f8
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
@@ -0,0 +1,5 @@
+.
+--
+Ran 1 tests
+
+OK
-- 
2.39.2

[PATCH v2 0/3] fix two edge cases related to stream block jobs

2024-03-21 Thread Fiona Ebner

Changes in v2:
* Ran into another issue while writing the IO test Stefan wanted
  to have (good call :)), so include a fix for that and add the
  test. I didn't notice during manual testing, because I hadn't
  used a scripted QMP 'quit', so there was no race.

Fiona Ebner (2):
  block-backend: fix edge case in bdrv_next() where BDS associated to BB
changes
  iotests: add test for stream job with an unaligned prefetch read

Stefan Reiter (1):
  block/io: accept NULL qiov in bdrv_pad_request

 block/block-backend.c |  7 +-
 block/io.c| 31 ---
 .../tests/stream-unaligned-prefetch   | 86 +++
 .../tests/stream-unaligned-prefetch.out   |  5 ++
 4 files changed, 113 insertions(+), 16 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
 create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out

-- 
2.39.2

[PATCH v2 1/3] block/io: accept NULL qiov in bdrv_pad_request

2024-03-21 Thread Fiona Ebner

From: Stefan Reiter 

Some operations, e.g. block-stream, perform reads while discarding the
results (only copy-on-read matters). In this case, they will pass NULL
as the target QEMUIOVector, which will however trip bdrv_pad_request,
since it wants to extend its passed vector. In particular, this is the
case for the blk_co_preadv() call in stream_populate().

If there is no qiov, no operation can be done with it, but the bytes
and offset still need to be updated, so the subsequent aligned read
will actually be aligned and not run into an assertion failure.

In particular, this can happen when the request alignment of the top
node is larger than the allocated part of the bottom node, in which
case padding becomes necessary. For example:

> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": 
> "node0", "node-name": "node1" } }
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node1" } }
> EOF

Originally-by: Stefan Reiter 
Signed-off-by: Thomas Lamprecht 
[FE: do update bytes and offset in any case
 add reproducer to commit message]
Signed-off-by: Fiona Ebner 
---

No changes in v2.

 block/io.c | 31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/block/io.c b/block/io.c
index 33150c0359..395bea3bac 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs,
 return 0;
 }
 
-sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
-  _head, _tail,
-  _niov);
+/*
+ * For prefetching in stream_populate(), no qiov is passed along, because
+ * only copy-on-read matters.
+ */
+if (qiov && *qiov) {
+sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+  _head, _tail,
+  _niov);
 
-/* Guaranteed by bdrv_check_request32() */
-assert(*bytes <= SIZE_MAX);
-ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
-  sliced_head, *bytes);
-if (ret < 0) {
-bdrv_padding_finalize(pad);
-return ret;
+/* Guaranteed by bdrv_check_request32() */
+assert(*bytes <= SIZE_MAX);
+ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
+  sliced_head, *bytes);
+if (ret < 0) {
+bdrv_padding_finalize(pad);
+return ret;
+}
+*qiov = >local_qiov;
+*qiov_offset = 0;
 }
+
 *bytes += pad->head + pad->tail;
 *offset -= pad->head;
-*qiov = >local_qiov;
-*qiov_offset = 0;
 if (padded) {
 *padded = true;
 }
-- 
2.39.2

[PATCH v2 2/3] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-03-21 Thread Fiona Ebner

The old_bs variable in bdrv_next() is currently determined by looking
at the old block backend. However, if the block graph changes before
the next bdrv_next() call, it might be that the associated BDS is not
the same that was referenced previously. In that case, the wrong BDS
is unreferenced, leading to an assertion failure later:

> bdrv_unref: Assertion `bs->refcnt > 0' failed.

In particular, this can happen in the context of bdrv_flush_all(),
when polling for bdrv_co_flush() in the generated co-wrapper leads to
a graph change (for example with a stream block job [0]).

A racy reproducer:

> #!/bin/bash
> rm -f /tmp/backing.qcow2
> rm -f /tmp/top.qcow2
> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node0" } }
> {"execute": "quit"}
> EOF

[0]:

> #0  bdrv_replace_child_tran (child=..., new_bs=..., tran=...)
> #1  bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., 
> errp=...)
> #2  bdrv_replace_node_common (from=..., to=..., auto_skip=..., 
> detach_subchain=..., errp=...)
> #3  bdrv_drop_filter (bs=..., errp=...)
> #4  bdrv_cor_filter_drop (cor_filter_bs=...)
> #5  stream_prepare (job=...)
> #6  job_prepare_locked (job=...)
> #7  job_txn_apply_locked (fn=..., job=...)
> #8  job_do_finalize_locked (job=...)
> #9  job_exit (opaque=...)
> #10 aio_bh_poll (ctx=...)
> #11 aio_poll (ctx=..., blocking=...)
> #12 bdrv_poll_co (s=...)
> #13 bdrv_flush (bs=...)
> #14 bdrv_flush_all ()
> #15 do_vm_stop (state=..., send_stop=...)
> #16 vm_shutdown ()

Signed-off-by: Fiona Ebner 
---

Not sure if this is the correct fix, or if the call site should rather
be adapted somehow?

New in v2.

 block/block-backend.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 9c4de79e6b..28af1eb17a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -599,14 +599,14 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 /* Must be called from the main loop */
 assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
+old_bs = it->bs;
+
 /* First, return all root nodes of BlockBackends. In order to avoid
  * returning a BDS twice when multiple BBs refer to it, we only return it
  * if the BB is the first one in the parent list of the BDS. */
 if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
 BlockBackend *old_blk = it->blk;
 
-old_bs = old_blk ? blk_bs(old_blk) : NULL;
-
 do {
 it->blk = blk_all_next(it->blk);
 bs = it->blk ? blk_bs(it->blk) : NULL;
@@ -620,11 +620,10 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 if (bs) {
 bdrv_ref(bs);
 bdrv_unref(old_bs);
+it->bs = bs;
 return bs;
 }
 it->phase = BDRV_NEXT_MONITOR_OWNED;
-} else {
-old_bs = it->bs;
 }
 
 /* Then return the monitor-owned BDSes without a BB attached. Ignore all
-- 
2.39.2

[PATCH] block/io: accept NULL qiov in bdrv_pad_request

2024-03-19 Thread Fiona Ebner

From: Stefan Reiter 

Some operations, e.g. block-stream, perform reads while discarding the
results (only copy-on-read matters). In this case, they will pass NULL
as the target QEMUIOVector, which will however trip bdrv_pad_request,
since it wants to extend its passed vector. In particular, this is the
case for the blk_co_preadv() call in stream_populate().

If there is no qiov, no operation can be done with it, but the bytes
and offset still need to be updated, so the subsequent aligned read
will actually be aligned and not run into an assertion failure.

In particular, this can happen when the request alignment of the top
node is larger than the allocated part of the bottom node, in which
case padding becomes necessary. For example:

> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": 
> "node0", "node-name": "node1" } }
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node1" } }
> EOF

Originally-by: Stefan Reiter 
Signed-off-by: Thomas Lamprecht 
[FE: do update bytes and offset in any case
 add reproducer to commit message]
Signed-off-by: Fiona Ebner 
---
 block/io.c | 31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/block/io.c b/block/io.c
index 33150c0359..395bea3bac 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs,
 return 0;
 }
 
-sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
-  _head, _tail,
-  _niov);
+/*
+ * For prefetching in stream_populate(), no qiov is passed along, because
+ * only copy-on-read matters.
+ */
+if (qiov && *qiov) {
+sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+  _head, _tail,
+  _niov);
 
-/* Guaranteed by bdrv_check_request32() */
-assert(*bytes <= SIZE_MAX);
-ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
-  sliced_head, *bytes);
-if (ret < 0) {
-bdrv_padding_finalize(pad);
-return ret;
+/* Guaranteed by bdrv_check_request32() */
+assert(*bytes <= SIZE_MAX);
+ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
+  sliced_head, *bytes);
+if (ret < 0) {
+bdrv_padding_finalize(pad);
+return ret;
+}
+*qiov = >local_qiov;
+*qiov_offset = 0;
 }
+
 *bytes += pad->head + pad->tail;
 *offset -= pad->head;
-*qiov = >local_qiov;
-*qiov_offset = 0;
 if (padded) {
 *padded = true;
 }
-- 
2.39.2

[PATCH 2/2] backup: add minimum cluster size to performance options

2024-03-08 Thread Fiona Ebner

Useful to make discard-source work in the context of backup fleecing
when the fleecing image has a larger granularity than the backup
target.

Backup/block-copy will use at least this granularity for copy operations
and in particular, discard requests to the backup source will too. If
the granularity is too small, they will just be aligned down in
cbw_co_pdiscard_snapshot() and thus effectively ignored.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
---
 block/backup.c| 2 +-
 block/copy-before-write.c | 2 ++
 block/copy-before-write.h | 1 +
 blockdev.c| 3 +++
 qapi/block-core.json  | 9 +++--
 5 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index 3dd2e229d2..a1292c01ec 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -458,7 +458,7 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 }
 
 cbw = bdrv_cbw_append(bs, target, filter_node_name, discard_source,
-  , errp);
+  perf->min_cluster_size, , errp);
 if (!cbw) {
 goto error;
 }
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index f9896c6c1e..55a9272485 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -545,6 +545,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
   BlockDriverState *target,
   const char *filter_node_name,
   bool discard_source,
+  int64_t min_cluster_size,
   BlockCopyState **bcs,
   Error **errp)
 {
@@ -563,6 +564,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
 }
 qdict_put_str(opts, "file", bdrv_get_node_name(source));
 qdict_put_str(opts, "target", bdrv_get_node_name(target));
+qdict_put_int(opts, "min-cluster-size", min_cluster_size);
 
 top = bdrv_insert_node(source, opts, flags, errp);
 if (!top) {
diff --git a/block/copy-before-write.h b/block/copy-before-write.h
index 01af0cd3c4..dc6cafe7fa 100644
--- a/block/copy-before-write.h
+++ b/block/copy-before-write.h
@@ -40,6 +40,7 @@ BlockDriverState *bdrv_cbw_append(BlockDriverState *source,
   BlockDriverState *target,
   const char *filter_node_name,
   bool discard_source,
+  int64_t min_cluster_size,
   BlockCopyState **bcs,
   Error **errp);
 void bdrv_cbw_drop(BlockDriverState *bs);
diff --git a/blockdev.c b/blockdev.c
index daceb50460..8e6bdbc94a 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2653,6 +2653,9 @@ static BlockJob *do_backup_common(BackupCommon *backup,
 if (backup->x_perf->has_max_chunk) {
 perf.max_chunk = backup->x_perf->max_chunk;
 }
+if (backup->x_perf->has_min_cluster_size) {
+perf.min_cluster_size = backup->x_perf->min_cluster_size;
+}
 }
 
 if ((backup->sync == MIRROR_SYNC_MODE_BITMAP) ||
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 85c8f88f6e..ba0836892f 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1551,11 +1551,16 @@
 # it should not be less than job cluster size which is calculated
 # as maximum of target image cluster size and 64k.  Default 0.
 #
+# @min-cluster-size: Minimum size of blocks used by copy-before-write
+# and background copy operations.  Has to be a power of 2.  No
+# effect if smaller than the maximum of the target's cluster size
+# and 64 KiB.  Default 0.  (Since 9.0)
+#
 # Since: 6.0
 ##
 { 'struct': 'BackupPerf',
-  'data': { '*use-copy-range': 'bool',
-'*max-workers': 'int', '*max-chunk': 'int64' } }
+  'data': { '*use-copy-range': 'bool', '*max-workers': 'int',
+'*max-chunk': 'int64', '*min-cluster-size': 'uint32' } }
 
 ##
 # @BackupCommon:
-- 
2.39.2

[PATCH 0/2] backup: allow specifying minimum cluster size

2024-03-08 Thread Fiona Ebner

Based-on: 
https://lore.kernel.org/qemu-devel/20240228141501.455989-1-vsement...@yandex-team.ru/

Useful to make discard-source work in the context of backup fleecing
when the fleecing image has a larger granularity than the backup
target.

Backup/block-copy will use at least this granularity for copy operations
and in particular, discard requests to the backup source will too. If
the granularity is too small, they will just be aligned down in
cbw_co_pdiscard_snapshot() and thus effectively ignored.

Fiona Ebner (2):
  copy-before-write: allow specifying minimum cluster size
  backup: add minimum cluster size to performance options

 block/backup.c |  2 +-
 block/block-copy.c | 17 +
 block/copy-before-write.c  |  5 -
 block/copy-before-write.h  |  1 +
 blockdev.c |  3 +++
 include/block/block-copy.h |  1 +
 qapi/block-core.json   | 17 ++---
 7 files changed, 37 insertions(+), 9 deletions(-)

-- 
2.39.2

[PATCH 1/2] copy-before-write: allow specifying minimum cluster size

2024-03-08 Thread Fiona Ebner

Useful to make discard-source work in the context of backup fleecing
when the fleecing image has a larger granularity than the backup
target.

Copy-before-write operations will use at least this granularity and in
particular, discard requests to the source node will too. If the
granularity is too small, they will just be aligned down in
cbw_co_pdiscard_snapshot() and thus effectively ignored.

The QAPI uses uint32 so the value will be non-negative, but still fit
into a uint64_t.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
---
 block/block-copy.c | 17 +
 block/copy-before-write.c  |  3 ++-
 include/block/block-copy.h |  1 +
 qapi/block-core.json   |  8 +++-
 4 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index 7e3b378528..adb1cbb440 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -310,6 +310,7 @@ void block_copy_set_copy_opts(BlockCopyState *s, bool 
use_copy_range,
 }
 
 static int64_t block_copy_calculate_cluster_size(BlockDriverState *target,
+ int64_t min_cluster_size,
  Error **errp)
 {
 int ret;
@@ -335,7 +336,7 @@ static int64_t 
block_copy_calculate_cluster_size(BlockDriverState *target,
 "used. If the actual block size of the target exceeds "
 "this default, the backup may be unusable",
 BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
-return BLOCK_COPY_CLUSTER_SIZE_DEFAULT;
+return MAX(min_cluster_size, BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
 } else if (ret < 0 && !target_does_cow) {
 error_setg_errno(errp, -ret,
 "Couldn't determine the cluster size of the target image, "
@@ -345,16 +346,18 @@ static int64_t 
block_copy_calculate_cluster_size(BlockDriverState *target,
 return ret;
 } else if (ret < 0 && target_does_cow) {
 /* Not fatal; just trudge on ahead. */
-return BLOCK_COPY_CLUSTER_SIZE_DEFAULT;
+return MAX(min_cluster_size, BLOCK_COPY_CLUSTER_SIZE_DEFAULT);
 }
 
-return MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size);
+return MAX(min_cluster_size,
+   MAX(BLOCK_COPY_CLUSTER_SIZE_DEFAULT, bdi.cluster_size));
 }
 
 BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
  BlockDriverState *copy_bitmap_bs,
  const BdrvDirtyBitmap *bitmap,
  bool discard_source,
+ int64_t min_cluster_size,
  Error **errp)
 {
 ERRP_GUARD();
@@ -365,7 +368,13 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, 
BdrvChild *target,
 
 GLOBAL_STATE_CODE();
 
-cluster_size = block_copy_calculate_cluster_size(target->bs, errp);
+if (min_cluster_size && !is_power_of_2(min_cluster_size)) {
+error_setg(errp, "min-cluster-size needs to be a power of 2");
+return NULL;
+}
+
+cluster_size = block_copy_calculate_cluster_size(target->bs,
+ min_cluster_size, errp);
 if (cluster_size < 0) {
 return NULL;
 }
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index dac57481c5..f9896c6c1e 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -476,7 +476,8 @@ static int cbw_open(BlockDriverState *bs, QDict *options, 
int flags,
 
 s->discard_source = flags & BDRV_O_CBW_DISCARD_SOURCE;
 s->bcs = block_copy_state_new(bs->file, s->target, bs, bitmap,
-  flags & BDRV_O_CBW_DISCARD_SOURCE, errp);
+  flags & BDRV_O_CBW_DISCARD_SOURCE,
+  opts->min_cluster_size, errp);
 if (!s->bcs) {
 error_prepend(errp, "Cannot create block-copy-state: ");
 return -EINVAL;
diff --git a/include/block/block-copy.h b/include/block/block-copy.h
index bdc703bacd..77857c6c68 100644
--- a/include/block/block-copy.h
+++ b/include/block/block-copy.h
@@ -28,6 +28,7 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, 
BdrvChild *target,
  BlockDriverState *copy_bitmap_bs,
  const BdrvDirtyBitmap *bitmap,
  bool discard_source,
+ int64_t min_cluster_size,
  Error **errp);
 
 /* Function should be called prior any actual copy request */
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 0a72c590a8..85c8f88f6e 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4625,12 +4625,18 @@
 # @on-cbw-error

Re: [PATCH v3 0/5] backup: discard-source parameter

2024-03-08 Thread Fiona Ebner

Am 28.02.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy:
> Hi all! The main patch is 04, please look at it for description and
> diagram.
> 
> v3:
> 02: new patch
> 04: take WRITE permission only when discard_source is required
> 
> Vladimir Sementsov-Ogievskiy (5):
>   block/copy-before-write: fix permission
>   block/copy-before-write: support unligned snapshot-discard
>   block/copy-before-write: create block_copy bitmap in filter node
>   qapi: blockdev-backup: add discard-source parameter
>   iotests: add backup-discard-source
> 
>  block/backup.c|   5 +-
>  block/block-copy.c|  12 +-
>  block/copy-before-write.c |  39 -
>  block/copy-before-write.h |   1 +
>  block/replication.c   |   4 +-
>  blockdev.c|   2 +-
>  include/block/block-common.h  |   2 +
>  include/block/block-copy.h|   2 +
>  include/block/block_int-global-state.h|   2 +-
>  qapi/block-core.json  |   4 +
>  tests/qemu-iotests/257.out| 112 ++---
>  .../qemu-iotests/tests/backup-discard-source  | 151 ++
>  .../tests/backup-discard-source.out   |   5 +
>  13 files changed, 271 insertions(+), 70 deletions(-)
>  create mode 100755 tests/qemu-iotests/tests/backup-discard-source
>  create mode 100644 tests/qemu-iotests/tests/backup-discard-source.out
> 

Tested-by: Fiona Ebner

Re: [PATCH v3 5/5] iotests: add backup-discard-source

2024-03-08 Thread Fiona Ebner

Am 28.02.24 um 15:15 schrieb Vladimir Sementsov-Ogievskiy:
> Add test for a new backup option: discard-source.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  .../qemu-iotests/tests/backup-discard-source  | 151 ++
>  .../tests/backup-discard-source.out   |   5 +
>  2 files changed, 156 insertions(+)
>  create mode 100755 tests/qemu-iotests/tests/backup-discard-source
>  create mode 100644 tests/qemu-iotests/tests/backup-discard-source.out
> 
> diff --git a/tests/qemu-iotests/tests/backup-discard-source 
> b/tests/qemu-iotests/tests/backup-discard-source
> new file mode 100755
> index 00..8a88b0f6c4
> --- /dev/null
> +++ b/tests/qemu-iotests/tests/backup-discard-source
> @@ -0,0 +1,151 @@
> +#!/usr/bin/env python3
> +#
> +# Test removing persistent bitmap from backing
> +#
> +# Copyright (c) 2022 Virtuozzo International GmbH.
> +#

Title and copyright year are wrong.

Apart from that:

Reviewed-by: Fiona Ebner

Re: [PATCH v3 2/5] block/copy-before-write: support unligned snapshot-discard

2024-03-08 Thread Fiona Ebner

Am 28.02.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy:
> First thing that crashes on unligned access here is
> bdrv_reset_dirty_bitmap(). Correct way is to align-down the
> snapshot-discard request.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 

Reviewed-by: Fiona Ebner

Re: [PATCH v3 4/5] qapi: blockdev-backup: add discard-source parameter

2024-03-08 Thread Fiona Ebner

Am 28.02.24 um 15:15 schrieb Vladimir Sementsov-Ogievskiy:
> Add a parameter that enables discard-after-copy. That is mostly useful
> in "push backup with fleecing" scheme, when source is snapshot-access
> format driver node, based on copy-before-write filter snapshot-access
> API:
> 
> [guest]  [snapshot-access] ~~ blockdev-backup ~~> [backup target]
>||
>| root   | file
>vv
> [copy-before-write]
>| |
>| file| target
>v v
> [active disk]   [temp.img]
> 
> In this case discard-after-copy does two things:
> 
>  - discard data in temp.img to save disk space
>  - avoid further copy-before-write operation in discarded area
> 
> Note that we have to declare WRITE permission on source in
> copy-before-write filter, for discard to work. Still we can't take it
> unconditionally, as it will break normal backup from RO source. So, we
> have to add a parameter and pass it thorough bdrv_open flags.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 

Reviewed-by: Fiona Ebner

Re: [PATCH v3 3/5] block/copy-before-write: create block_copy bitmap in filter node

2024-03-08 Thread Fiona Ebner

Am 28.02.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy:
> Currently block_copy creates copy_bitmap in source node. But that is in
> bad relation with .independent_close=true of copy-before-write filter:
> source node may be detached and removed before .bdrv_close() handler
> called, which should call block_copy_state_free(), which in turn should
> remove copy_bitmap.
> 
> That's all not ideal: it would be better if internal bitmap of
> block-copy object is not attached to any node. But that is not possible
> now.
> 
> The simplest solution is just create copy_bitmap in filter node, where
> anyway two other bitmaps are created.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 

Reviewed-by: Fiona Ebner

Re: [PATCH v2 00/10] mirror: allow switching from background to active mode

2024-03-08 Thread Fiona Ebner

Am 07.03.24 um 20:42 schrieb Vladimir Sementsov-Ogievskiy:
> On 04.03.24 14:09, Peter Krempa wrote:
>> On Mon, Mar 04, 2024 at 11:48:54 +0100, Kevin Wolf wrote:
>>> Am 28.02.2024 um 19:07 hat Vladimir Sementsov-Ogievskiy geschrieben:
 On 03.11.23 18:56, Markus Armbruster wrote:
> Kevin Wolf  writes:
>>
>> [...]
>>
> Is the job abstraction a failure?
>
> We have
>
>   block-job- command  since   job- command    since
>   -
>   block-job-set-speed 1.1
>   block-job-cancel    1.1 job-cancel  3.0
>   block-job-pause 1.3 job-pause   3.0
>   block-job-resume    1.3 job-resume  3.0
>   block-job-complete  1.3 job-complete    3.0
>   block-job-dismiss   2.12    job-dismiss 3.0
>   block-job-finalize  2.12    job-finalize    3.0
>   block-job-change    8.2
>   query-block-jobs    1.1 query-jobs
>>
>> [...]
>>
>>> I consider these strictly optional. We don't really have strong reasons
>>> to deprecate these commands (they are just thin wrappers), and I think
>>> libvirt still uses block-job-* in some places.
>>
>> Libvirt uses 'block-job-cancel' because it has different semantics from
>> 'job-cancel' which libvirt documented as the behaviour of the API that
>> uses it. (Semantics regarding the expectation of what is written to the
>> destination node at the point when the job is cancelled).
>>
> 
> That's the following semantics:
> 
>   # Note that if you issue 'block-job-cancel' after 'drive-mirror' has
>   # indicated (via the event BLOCK_JOB_READY) that the source and
>   # destination are synchronized, then the event triggered by this
>   # command changes to BLOCK_JOB_COMPLETED, to indicate that the
>   # mirroring has ended and the destination now has a point-in-time copy
>   # tied to the time of the cancellation.
> 
> Hmm. Looking at this, it looks for me, that should probably a
> 'block-job-complete" command (as leading to BLOCK_JOB_COMPLETED).
> 
> Actually, what is the difference between block-job-complete and
> block-job-cancel(force=false) for mirror in ready state?
> 
> I only see the following differencies:
> 
> 1. block-job-complete documents that it completes the job
> synchronously.. But looking at mirror code I see it just set
> s->should_complete = true, which will be then handled asynchronously..
> So I doubt that documentation is correct.
> 
> 2. block-job-complete will trigger final graph changes. block-job-cancel
> will not.
> 
> Is [2] really useful? Seems yes: in case of some failure before starting
> migration target, we'd like to continue executing source. So, no reason
> to break block-graph in source, better keep it unchanged.
> 

FWIW, we also rely on these special semantics. We allow cloning the disk
state of a running guest using drive-mirror (and before finishing,
fsfreeze in the guest for consistency). We cannot use block-job-complete
there, because we do not want to switch the source's drive.

> But I think, such behavior better be setup by mirror-job start
> parameter, rather then by special option for cancel (or even compelete)
> command, useful only for mirror.
> 
> So, what about the following substitution for block-job-cancel:
> 
> block-job-cancel(force=true)  -->  use job-cancel
> 
> block-job-cancel(force=false) for backup, stream, commit  -->  use
> job-cancel
> 
> block-job-cancel(force=false) for mirror in ready mode  -->
> 
>   instead, use block-job-complete. If you don't need final graph
> modification which mirror job normally does, use graph-change=false
> parameter for blockdev-mirror command.
> 

But yes, having a graph-change parameter would work for us too :)

Best Regards,
Fiona

[PATCH v2 4/4] blockdev: mirror: check for target's cluster size when using bitmap

2024-03-07 Thread Fiona Ebner

When using mirror with a bitmap and the target is a diff image, i.e.
one that should only contain the delta and was not synced to
previously, a too large cluster size for the target can be
problematic. In particular, when the mirror sends data to the target
aligned to the jobs granularity, but not aligned to the larger target
image's cluster size, the target's cluster would be allocated but only
be filled partially. When rebasing such a diff image later, the
corresponding cluster of the base image would get "masked" and the
part of the cluster not in the diff image is not accessible anymore.

Unfortunately, it is not always possible to check for the target
image's cluster size, e.g. when it's NBD. Because the limitation is
already documented in the QAPI description for the @bitmap parameter
and it's only required for special diff image use-case, simply skip
the check then.

Signed-off-by: Fiona Ebner 
---
 blockdev.c| 19 +++
 tests/qemu-iotests/tests/bitmap-sync-mirror   |  6 ++
 .../qemu-iotests/tests/bitmap-sync-mirror.out |  7 +++
 3 files changed, 32 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index c76eb97a4c..968d44cd3b 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2847,6 +2847,9 @@ static void blockdev_mirror_common(const char *job_id, 
BlockDriverState *bs,
 }
 
 if (bitmap_name) {
+BlockDriverInfo bdi;
+uint32_t bitmap_granularity;
+
 if (sync != MIRROR_SYNC_MODE_FULL) {
 error_setg(errp, "Sync mode '%s' not supported with bitmap.",
MirrorSyncMode_str(sync));
@@ -2863,6 +2866,22 @@ static void blockdev_mirror_common(const char *job_id, 
BlockDriverState *bs,
 return;
 }
 
+bitmap_granularity = bdrv_dirty_bitmap_granularity(bitmap);
+/*
+ * Ignore if unable to get the info, e.g. when target is NBD. It's only
+ * relevant for syncing to a diff image and the documentation already
+ * states that the target's cluster size needs to small enough then.
+ */
+if (bdrv_get_info(target, ) >= 0) {
+if (bitmap_granularity < bdi.cluster_size ||
+bitmap_granularity % bdi.cluster_size != 0) {
+error_setg(errp, "Bitmap granularity %u is not a multiple of "
+   "the target image's cluster size %u",
+   bitmap_granularity, bdi.cluster_size);
+return;
+}
+}
+
 if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_ALLOW_RO, errp)) {
 return;
 }
diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror 
b/tests/qemu-iotests/tests/bitmap-sync-mirror
index 898f1f4ba4..cbd5cc99cc 100755
--- a/tests/qemu-iotests/tests/bitmap-sync-mirror
+++ b/tests/qemu-iotests/tests/bitmap-sync-mirror
@@ -552,6 +552,12 @@ def test_mirror_api():
 bitmap=bitmap)
 log('')
 
+log("-- Test bitmap with too small granularity --\n".format(sync_mode))
+vm.qmp_log("block-dirty-bitmap-add", node=drive0.node,
+   name="bitmap-small", granularity=(GRANULARITY // 2))
+blockdev_mirror(drive0.vm, drive0.node, "mirror_target", "full",
+job_id='api_job', bitmap="bitmap-small")
+log('')
 
 def main():
 for bsync_mode in ("never", "on-success", "always"):
diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror.out 
b/tests/qemu-iotests/tests/bitmap-sync-mirror.out
index c05b4788c6..d40ea7d689 100644
--- a/tests/qemu-iotests/tests/bitmap-sync-mirror.out
+++ b/tests/qemu-iotests/tests/bitmap-sync-mirror.out
@@ -2937,3 +2937,10 @@ qemu_img compare "TEST_DIR/PID-img" 
"TEST_DIR/PID-fmirror3" ==> Identical, OK!
 {"execute": "blockdev-mirror", "arguments": {"bitmap": "bitmap0", "device": 
"drive0", "filter-node-name": "mirror-top", "job-id": "api_job", "sync": 
"none", "target": "mirror_target"}}
 {"error": {"class": "GenericError", "desc": "Sync mode 'none' not supported 
with bitmap."}}
 
+-- Test bitmap with too small granularity --
+
+{"execute": "block-dirty-bitmap-add", "arguments": {"granularity": 32768, 
"name": "bitmap-small", "node": "drive0"}}
+{"return": {}}
+{"execute": "blockdev-mirror", "arguments": {"bitmap": "bitmap-small", 
"device": "drive0", "filter-node-name": "mirror-top", "job-id": "api_job", 
"sync": "full", "target": "mirror_target"}}
+{"error": {"class": "GenericError", "desc": "Bitmap granularity 32768 is not a 
multiple of the target image's cluster size 65536"}}
+
-- 
2.39.2

[PATCH v2 3/4] iotests: add test for bitmap mirror

2024-03-07 Thread Fiona Ebner

From: Fabian Grünbichler 

heavily based on/practically forked off iotest 257 for bitmap backups,
but:

- no writes to filter node 'mirror-top' between completion and
finalization, as those seem to deadlock?
- extra set of reference/test mirrors to verify that writes in parallel
with active mirror work

Intentionally keeping copyright and ownership of original test case to
honor provenance.

The test was originally adapted by Fabian from 257, but has seen
rather big changes, because the interface for mirror with bitmap was
changed, i.e. no @bitmap-mode parameter anymore and bitmap is used as
the working bitmap.

Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.0
 adapt to changes to mirror bitmap interface
 rename test from '384' to 'bitmap-sync-mirror']
Signed-off-by: Fiona Ebner 
---
 tests/qemu-iotests/tests/bitmap-sync-mirror   |  565 
 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2939 +
 2 files changed, 3504 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror
 create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out

diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror 
b/tests/qemu-iotests/tests/bitmap-sync-mirror
new file mode 100755
index 00..898f1f4ba4
--- /dev/null
+++ b/tests/qemu-iotests/tests/bitmap-sync-mirror
@@ -0,0 +1,565 @@
+#!/usr/bin/env python3
+# group: rw
+#
+# Test bitmap-sync mirrors (incremental, differential, and partials)
+#
+# Copyright (c) 2019 John Snow for Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+# owner=js...@redhat.com
+
+import math
+import os
+
+import iotests
+from iotests import log, qemu_img
+
+SIZE = 64 * 1024 * 1024
+GRANULARITY = 64 * 1024
+
+
+class Pattern:
+def __init__(self, byte, offset, size=GRANULARITY):
+self.byte = byte
+self.offset = offset
+self.size = size
+
+def bits(self, granularity):
+lower = self.offset // granularity
+upper = (self.offset + self.size - 1) // granularity
+return set(range(lower, upper + 1))
+
+
+class PatternGroup:
+"""Grouping of Pattern objects. Initialize with an iterable of Patterns."""
+def __init__(self, patterns):
+self.patterns = patterns
+
+def bits(self, granularity):
+"""Calculate the unique bits dirtied by this pattern grouping"""
+res = set()
+for pattern in self.patterns:
+res |= pattern.bits(granularity)
+return res
+
+
+GROUPS = [
+PatternGroup([
+# Batch 0: 4 clusters
+Pattern('0x49', 0x000),
+Pattern('0x6c', 0x010),   # 1M
+Pattern('0x6f', 0x200),   # 32M
+Pattern('0x76', 0x3ff)]), # 64M - 64K
+PatternGroup([
+# Batch 1: 6 clusters (3 new)
+Pattern('0x65', 0x000),   # Full overwrite
+Pattern('0x77', 0x00f8000),   # Partial-left (1M-32K)
+Pattern('0x72', 0x2008000),   # Partial-right (32M+32K)
+Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K)
+PatternGroup([
+# Batch 2: 7 clusters (3 new)
+Pattern('0x74', 0x001),   # Adjacent-right
+Pattern('0x69', 0x00e8000),   # Partial-left  (1M-96K)
+Pattern('0x6e', 0x2018000),   # Partial-right (32M+96K)
+Pattern('0x67', 0x3fe,
+2*GRANULARITY)]), # Overwrite [(64M-128K)-64M)
+PatternGroup([
+# Batch 3: 8 clusters (5 new)
+# Carefully chosen such that nothing re-dirties the one cluster
+# that copies out successfully before failure in Group #1.
+Pattern('0xaa', 0x001,
+3*GRANULARITY),   # Overwrite and 2x Adjacent-right
+Pattern('0xbb', 0x00d8000),   # Partial-left (1M-160K)
+Pattern('0xcc', 0x2028000),   # Partial-right (32M+160K)
+Pattern('0xdd', 0x3fc)]), # New; leaving a gap to the right
+]
+
+
+class EmulatedBitmap:
+def __init__(self, granularity=GRANULARITY):
+self._bits = set()
+self.granularity = granularity
+
+def dirty_bits(self, bits):
+self._bits |= set(bits)
+
+def dirty_group(self, n):
+self.dirty_bits(GROUPS[n].bits(self.granularity))
+
+def clear(self):
+self._bits = set()
+
+def clear_bits(self, bits

[PATCH v2 1/4] qapi/block-core: avoid the re-use of MirrorSyncMode for backup

2024-03-07 Thread Fiona Ebner

Backup supports all modes listed in MirrorSyncMode, while mirror does
not. Introduce BackupSyncMode by copying the current MirrorSyncMode
and drop the variants mirror does not support from MirrorSyncMode as
well as the corresponding manual check in mirror_start().

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Fiona Ebner 
---

I felt like keeping the "Since: X.Y" as before makes the most sense as
to not lose history. Or is it necessary to change this for
BackupSyncMode (and its members) since it got a new name?

 block/backup.c | 18 -
 block/mirror.c |  7 ---
 block/monitor/block-hmp-cmds.c |  2 +-
 block/replication.c|  2 +-
 blockdev.c | 26 -
 include/block/block_int-global-state.h |  2 +-
 qapi/block-core.json   | 27 +-
 7 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index ec29d6b810..1cc4e055c6 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -37,7 +37,7 @@ typedef struct BackupBlockJob {
 
 BdrvDirtyBitmap *sync_bitmap;
 
-MirrorSyncMode sync_mode;
+BackupSyncMode sync_mode;
 BitmapSyncMode bitmap_mode;
 BlockdevOnError on_source_error;
 BlockdevOnError on_target_error;
@@ -111,7 +111,7 @@ void backup_do_checkpoint(BlockJob *job, Error **errp)
 
 assert(block_job_driver(job) == _job_driver);
 
-if (backup_job->sync_mode != MIRROR_SYNC_MODE_NONE) {
+if (backup_job->sync_mode != BACKUP_SYNC_MODE_NONE) {
 error_setg(errp, "The backup job only supports block checkpoint in"
" sync=none mode");
 return;
@@ -231,11 +231,11 @@ static void backup_init_bcs_bitmap(BackupBlockJob *job)
 uint64_t estimate;
 BdrvDirtyBitmap *bcs_bitmap = block_copy_dirty_bitmap(job->bcs);
 
-if (job->sync_mode == MIRROR_SYNC_MODE_BITMAP) {
+if (job->sync_mode == BACKUP_SYNC_MODE_BITMAP) {
 bdrv_clear_dirty_bitmap(bcs_bitmap, NULL);
 bdrv_dirty_bitmap_merge_internal(bcs_bitmap, job->sync_bitmap, NULL,
  true);
-} else if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
+} else if (job->sync_mode == BACKUP_SYNC_MODE_TOP) {
 /*
  * We can't hog the coroutine to initialize this thoroughly.
  * Set a flag and resume work when we are able to yield safely.
@@ -254,7 +254,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 
 backup_init_bcs_bitmap(s);
 
-if (s->sync_mode == MIRROR_SYNC_MODE_TOP) {
+if (s->sync_mode == BACKUP_SYNC_MODE_TOP) {
 int64_t offset = 0;
 int64_t count;
 
@@ -282,7 +282,7 @@ static int coroutine_fn backup_run(Job *job, Error **errp)
 block_copy_set_skip_unallocated(s->bcs, false);
 }
 
-if (s->sync_mode == MIRROR_SYNC_MODE_NONE) {
+if (s->sync_mode == BACKUP_SYNC_MODE_NONE) {
 /*
  * All bits are set in bcs bitmap to allow any cluster to be copied.
  * This does not actually require them to be copied.
@@ -354,7 +354,7 @@ static const BlockJobDriver backup_job_driver = {
 
 BlockJob *backup_job_create(const char *job_id, BlockDriverState *bs,
   BlockDriverState *target, int64_t speed,
-  MirrorSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
+  BackupSyncMode sync_mode, BdrvDirtyBitmap *sync_bitmap,
   BitmapSyncMode bitmap_mode,
   bool compress,
   const char *filter_node_name,
@@ -376,8 +376,8 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 GLOBAL_STATE_CODE();
 
 /* QMP interface protects us from these cases */
-assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL);
-assert(sync_bitmap || sync_mode != MIRROR_SYNC_MODE_BITMAP);
+assert(sync_mode != BACKUP_SYNC_MODE_INCREMENTAL);
+assert(sync_bitmap || sync_mode != BACKUP_SYNC_MODE_BITMAP);
 
 if (bs == target) {
 error_setg(errp, "Source and target cannot be the same");
diff --git a/block/mirror.c b/block/mirror.c
index 5145eb53e1..1609354db3 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -2011,13 +2011,6 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
 
 GLOBAL_STATE_CODE();
 
-if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) ||
-(mode == MIRROR_SYNC_MODE_BITMAP)) {
-error_setg(errp, "Sync mode '%s' not supported",
-   MirrorSyncMode_str(mode));
-return;
-}
-
 bdrv_graph_rdlock_main_loop();
 is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
 base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index d954bec6f1..9633d000c0 100644
--- a/blo

[PATCH v2 2/4] mirror: allow specifying working bitmap

2024-03-07 Thread Fiona Ebner

From: John Snow 

for the mirror job. The bitmap's granularity is used as the job's
granularity.

The new @bitmap parameter is marked unstable in the QAPI and can
currently only be used for @sync=full mode.

Clusters initially dirty in the bitmap as well as new writes are
copied to the target.

Using block-dirty-bitmap-clear and block-dirty-bitmap-merge API,
callers can simulate the three kinds of @BitmapSyncMode (which is used
by backup):
1. always: default, just pass bitmap as working bitmap.
2. never: copy bitmap and pass copy to the mirror job.
3. on-success: copy bitmap and pass copy to the mirror job and if
   successful, merge bitmap into original afterwards.

When the target image is a fresh "diff image", i.e. one that was not
used as the target of a previous mirror and the target image's cluster
size is larger than the bitmap's granularity, or when
@copy-mode=write-blocking is used, there is a pitfall, because the
cluster in the target image will be allocated, but not contain all the
data corresponding to the same region in the source image.

An idea to avoid the limitation would be to mark clusters which are
affected by unaligned writes and are not allocated in the target image
dirty, so they would be copied fully later. However, for migration,
the invariant that an actively synced mirror stays actively synced
(unless an error happens) is useful, because without that invariant,
migration might inactivate block devices when mirror still got work
to do and run into an assertion failure [0].

Another approach would be to read the missing data from the source
upon unaligned writes to be able to write the full target cluster
instead.

But certain targets like NBD do not allow querying the cluster size.
To avoid limiting/breaking the use case of syncing to an existing
target, which is arguably more common than the diff image use case,
document the limiation in QAPI.

This patch was originally based on one by Ma Haocong, but it has since
been modified pretty heavily, first by John and then again by Fiona.

[0]: 
https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/

Suggested-by: Ma Haocong 
Signed-off-by: Ma Haocong 
Signed-off-by: John Snow 
[FG: switch to bdrv_dirty_bitmap_merge_internal]
Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.0
 get rid of bitmap mode parameter
 use caller-provided bitmap as working bitmap
 turn bitmap parameter experimental]
Signed-off-by: Fiona Ebner 
---
 block/mirror.c | 95 --
 blockdev.c | 39 +--
 include/block/block_int-global-state.h |  5 +-
 qapi/block-core.json   | 37 +-
 tests/unit/test-block-iothread.c   |  2 +-
 5 files changed, 146 insertions(+), 32 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 1609354db3..5c9a00b574 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -51,7 +51,7 @@ typedef struct MirrorBlockJob {
 BlockDriverState *to_replace;
 /* Used to block operations on the drive-mirror-replace target */
 Error *replace_blocker;
-bool is_none_mode;
+MirrorSyncMode sync_mode;
 BlockMirrorBackingMode backing_mode;
 /* Whether the target image requires explicit zero-initialization */
 bool zero_target;
@@ -73,6 +73,11 @@ typedef struct MirrorBlockJob {
 size_t buf_size;
 int64_t bdev_length;
 unsigned long *cow_bitmap;
+/*
+ * Whether the bitmap is created locally or provided by the caller (for
+ * incremental sync).
+ */
+bool dirty_bitmap_is_local;
 BdrvDirtyBitmap *dirty_bitmap;
 BdrvDirtyBitmapIter *dbi;
 uint8_t *buf;
@@ -687,7 +692,11 @@ static int mirror_exit_common(Job *job)
 bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs);
 }
 
-bdrv_release_dirty_bitmap(s->dirty_bitmap);
+if (s->dirty_bitmap_is_local) {
+bdrv_release_dirty_bitmap(s->dirty_bitmap);
+} else {
+bdrv_enable_dirty_bitmap(s->dirty_bitmap);
+}
 
 /* Make sure that the source BDS doesn't go away during bdrv_replace_node,
  * before we can call bdrv_drained_end */
@@ -718,7 +727,8 @@ static int mirror_exit_common(Job *job)
  _abort);
 
 if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
-BlockDriverState *backing = s->is_none_mode ? src : s->base;
+BlockDriverState *backing;
+backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base;
 BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs);
 
 if (bdrv_cow_bs(unfiltered_target) != backing) {
@@ -815,6 +825,16 @@ static void mirror_abort(Job *job)
 assert(ret == 0);
 }
 
+/* Always called after commit/abort. */
+static void mirror_clean(Job *job)
+{
+MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
+
+if

[PATCH v2 0/4] mirror: allow specifying working bitmap

2024-03-07 Thread Fiona Ebner

Changes from RFC/v1 (discussion here [0]):
* Add patch to split BackupSyncMode and MirrorSyncMode.
* Drop bitmap-mode parameter and use passed-in bitmap as the working
  bitmap instead. Users can get the desired behaviors by
  using the block-dirty-bitmap-clear and block-dirty-bitmap-merge
  calls (see commit message in patch 2/4 for how exactly).
* Add patch to check whether target image's cluster size is at most
  mirror job's granularity. Optional, it's an extra safety check
  that's useful when the target is a "diff" image that does not have
  previously synced data.

Use cases:
* Possibility to resume a failed mirror later.
* Possibility to only mirror deltas to a previously mirrored volume.
* Possibility to (efficiently) mirror an drive that was previously
  mirrored via some external mechanism (e.g. ZFS replication).

We are using the last one in production without any issues since about
4 years now. In particular, like mentioned in [1]:

> - create bitmap(s)
> - (incrementally) replicate storage volume(s) out of band (using ZFS)
> - incrementally drive mirror as part of a live migration of VM
> - drop bitmap(s)


Now, the IO test added in patch 4/4 actually contains yet another use
case, namely doing incremental mirrors to stand-alone qcow2 "diff"
images, that only contain the delta and can be rebased later. I had to
adapt the IO test, because its output expected the mirror bitmap to
still be dirty, but nowadays the mirror is apparently already done
when the bitmaps are queried. So I thought, I'll just use
'write-blocking' mode to avoid any potential timing issues.

But this exposed an issue with the diff image approach. If a write is
not aligned to the granularity of the mirror target, then rebasing the
diff image onto a backing image will not yield the desired result,
because the full cluster is considered to be allocated and will "hide"
some part of the base/backing image. The failure can be seen by either
using 'write-blocking' mode in the IO test or setting the (bitmap)
granularity to 32 KiB rather than the current 64 KiB.

For the latter case, patch 4/4 adds a check. For the former, the
limitation is documented (I'd expect this to be a niche use case in
practice).

[0]: 
https://lore.kernel.org/qemu-devel/b91dba34-7969-4d51-ba40-96a91038c...@yandex-team.ru/T/#m4ae27dc8ca1fb053e0a32cc4ffa2cfab6646805c
[1]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/


Fabian Grünbichler (1):
  iotests: add test for bitmap mirror

Fiona Ebner (2):
  qapi/block-core: avoid the re-use of MirrorSyncMode for backup
  blockdev: mirror: check for target's cluster size when using bitmap

John Snow (1):
  mirror: allow specifying working bitmap

 block/backup.c|   18 +-
 block/mirror.c|  102 +-
 block/monitor/block-hmp-cmds.c|2 +-
 block/replication.c   |2 +-
 blockdev.c|   84 +-
 include/block/block_int-global-state.h|7 +-
 qapi/block-core.json  |   64 +-
 tests/qemu-iotests/tests/bitmap-sync-mirror   |  571 
 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2946 +
 tests/unit/test-block-iothread.c  |2 +-
 10 files changed, 3729 insertions(+), 69 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror
 create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out

-- 
2.39.2

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-03-06 Thread Fiona Ebner

Am 29.02.24 um 13:47 schrieb Fiona Ebner:
> Am 29.02.24 um 12:48 schrieb Vladimir Sementsov-Ogievskiy:
>> On 29.02.24 13:11, Fiona Ebner wrote:
>>>
>>> The iotest creates a new target image for each incremental sync which
>>> only records the diff relative to the previous mirror and those diff
>>> images are later rebased onto each other to get the full picture.
>>>
>>> Thus, it can be that a previous mirror job (not just background process
>>> or previous write) already copied a cluster, and in particular, copied
>>> it to a different target!
>>
>> Aha understand.
>>
>> For simplicity, let's consider case, when source "cluster size" = "job
>> cluster size" = "bitmap granularity" = "target cluster size".
>>
>> Which types of clusters we should consider, when we want to handle guest
>> write?
>>
>> 1. Clusters, that should be copied by background process
>>
>> These are dirty clusters from user-given bitmap, or if we do a full-disk
>> mirror, all clusters, not yet copied by background process.
>>
>> For such clusters we simply ignore the unaligned write. We can even
>> ignore the aligned write too: less disturbing the guest by delays.
>>
> 
> Since do_sync_target_write() currently doesn't ignore aligned writes, I
> wouldn't change it. Of course they can count towards the "done_bitmap"
> you propose below.
> 
>> 2. Clusters, already copied by background process during this mirror job
>> and not dirtied by guest since this time.
>>
>> For such clusters we are safe to do unaligned write, as target cluster
>> must be allocated.
>>
> 
> Right.
> 
>> 3. Clusters, not marked initially by dirty bitmap.
>>
>> What to do with them? We can't do unaligned write. I see two variants:
>>
>> - do additional read from source, to fill the whole cluster, which seems
>> a bit too heavy
>>
> 
> Yes, I'd rather only do that as a last resort.
> 
>> - just mark the cluster as dirty for background job. So we behave like
>> in "background" mode. But why not? The maximum count of such "hacks" is
>> limited to number of "clear" clusters at start of mirror job, which
>> means that we don't seriously affect the convergence. Mirror is
>> guaranteed to converge anyway. And the whole sense of "write-blocking"
>> mode is to have a guaranteed convergence. What do you think?
>>
> 
> It could lead to a lot of flips between job->actively_synced == true and
> == false. AFAIU, currently, we only switch back from true to false when
> an error happens. While I don't see a concrete issue with it, at least
> it might be unexpected to users, so it better be documented.
> 
> I'll try going with this approach, thanks!
> 

These flips are actually a problem. When using live-migration with disk
mirroring, it's good that an actively synced image stays actively
synced. Otherwise, migration could finish at an inconvenient time and
try to inactivate the block device while mirror still got something to
do which would lead to an assertion failure [0].

The IO test added by this series is what uses the possibility to sync to
"diff images" which contain only the delta. In production, we are only
syncing to a previously mirrored target image. Non-aligned writes are
not an issue later like with a diff image. (Even if the initial
mirroring happened via ZFS replication outside of QEMU).

So copy-mode=write-blocking would work fine for our use case, but if I
go with the "mark clusters for unaligned writes dirty"-approach, it
would not work fine anymore.

Should I rather just document the limitation for the combination "target
is a diff image" and copy-mode=write-blocking?

I'd still add the check for the granularity and target cluster size.
While also only needed for diff images, it would allow using background
mode safely for those.

Best Regards,
Fiona

[0]:
https://lore.kernel.org/qemu-devel/1db7f571-cb7f-c293-04cc-cd856e060...@proxmox.com/

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-03-03 Thread Fiona Ebner

Am 01.03.24 um 16:46 schrieb Vladimir Sementsov-Ogievskiy:
> On 01.03.24 18:14, Fiona Ebner wrote:
>> Am 01.03.24 um 16:02 schrieb Vladimir Sementsov-Ogievskiy:
>>>>> About documentation: actually, I never liked that we use for backup
>>>>> job
>>>>> "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two
>>>>> values supported only by backup.
>>>>>
>>>>> I'm also unsure how mode=full=some_bitmap differs from
>>>>> mode=bitmap=some_bitmap..
>>>>>
>>>>
>>>> With the current patches, it was an error to specify @bitmap for other
>>>> modes than 'incremental' and 'bitmap'.
>>>
>>> Current documentation says:
>>>    # @bitmap: The name of a dirty bitmap to use.  Must be present if
>>> sync
>>>    # is "bitmap" or "incremental". Can be present if sync is "full"
>>>    # or "top".  Must not be present otherwise.
>>>    # (Since 2.4 (drive-backup), 3.1 (blockdev-backup))
>>>
>>>
>>
>> This is for backup. The documentation (and behavior) for @bitmap added
>> by these patches for mirror is different ;)
> 
> I meant backup in "I'm also unsure", just as one more point not consider
> backup-bitmap-API as a prototype for mirror-bitmap-API.
> 

Oh, I see. Sorry for the confusion!

Best Regards,
Fiona

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-03-01 Thread Fiona Ebner

Am 01.03.24 um 16:02 schrieb Vladimir Sementsov-Ogievskiy:
> On 01.03.24 17:52, Fiona Ebner wrote:
>> Am 01.03.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy:
>>>
>>> As we already understood, (block-)job-api needs some spring-cleaning.
>>> Unfortunately I don't have much time on it, but still I decided to start
>>> from finally depreacting block-job-* API and moving to job-*.. Probably
>>> bitmap/bitmap-mode/sync APIs also need some optimization, keeping in
>>> mind new block-dirty-bitmap-merge api.
>>>
>>> So, what I could advice in this situation for newc interfaces:
>>>
>>> 1. be minimalistic
>>> 2. add `x-` prefix when unsure
>>>
>>> So, following these two rules, what about x-bitmap field, which may be
>>> combined only with 'full' mode, and do what you need?
>>>
>>
>> AFAIU, it should rather be marked as @unstable in QAPI [0]? Then it
>> doesn't need to be renamed if it becomes stable later.
> 
> Right, unstable feature is needed, using "x-" is optional.
> 
> Recent discussion about it was in my "vhost-user-blk: live resize
> additional APIs" series:
> 
> https://patchew.org/QEMU/20231006202045.1161543-1-vsement...@yandex-team.ru/20231006202045.1161543-5-vsement...@yandex-team.ru/
> 
> Following it, I think it's OK to not care anymore with "x-" prefixes,
> and rely on unstable feature.
> 

Thanks for the confirmation! I'll go without the prefix in the name then.

>>
>>> About documentation: actually, I never liked that we use for backup job
>>> "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two
>>> values supported only by backup.
>>>
>>> I'm also unsure how mode=full=some_bitmap differs from
>>> mode=bitmap=some_bitmap..
>>>
>>
>> With the current patches, it was an error to specify @bitmap for other
>> modes than 'incremental' and 'bitmap'.
> 
> Current documentation says:
>   # @bitmap: The name of a dirty bitmap to use.  Must be present if sync
>   # is "bitmap" or "incremental". Can be present if sync is "full"
>   # or "top".  Must not be present otherwise.
>   # (Since 2.4 (drive-backup), 3.1 (blockdev-backup))
> 
> 

This is for backup. The documentation (and behavior) for @bitmap added
by these patches for mirror is different ;)

Best Regards,
Fiona

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-03-01 Thread Fiona Ebner

Am 01.03.24 um 15:14 schrieb Vladimir Sementsov-Ogievskiy:
> 
> As we already understood, (block-)job-api needs some spring-cleaning.
> Unfortunately I don't have much time on it, but still I decided to start
> from finally depreacting block-job-* API and moving to job-*.. Probably
> bitmap/bitmap-mode/sync APIs also need some optimization, keeping in
> mind new block-dirty-bitmap-merge api.
> 
> So, what I could advice in this situation for newc interfaces:
> 
> 1. be minimalistic
> 2. add `x-` prefix when unsure
> 
> So, following these two rules, what about x-bitmap field, which may be
> combined only with 'full' mode, and do what you need?
> 

AFAIU, it should rather be marked as @unstable in QAPI [0]? Then it
doesn't need to be renamed if it becomes stable later.

> About documentation: actually, I never liked that we use for backup job
> "MirrorSyncMode". Now it looks more like "BackupSyncMode", having two
> values supported only by backup.
> 
> I'm also unsure how mode=full=some_bitmap differs from
> mode=bitmap=some_bitmap..
> 

With the current patches, it was an error to specify @bitmap for other
modes than 'incremental' and 'bitmap'.

> So, I'd suggest simply rename MirrorSyncMode to BackupSyncMode, and add
> separate MirrorSyncMode with only "full", "top" and "none" values.
> 

Sounds good to me!

[0]:
https://gitlab.com/qemu-project/qemu/-/commit/a3c45b3e62962f99338716b1347cfb0d427cea44

Best Regards,
Fiona

Re: [PATCH 00/19] Workaround Windows failing to find 64bit SMBIOS entry point with SeaBIOS

2024-03-01 Thread Fiona Ebner

Am 29.02.24 um 14:18 schrieb Fiona Ebner:
> Am 27.02.24 um 16:47 schrieb Igor Mammedov:
>> Windows (10) bootloader when running on top of SeaBIOS, fails to find
>> 
>> SMBIOSv3 entry point. Tracing it shows that it looks for v2 anchor markers   
>> 
>> only and not v3. Tricking it into believing that entry point is found
>> 
>> lets Windows successfully locate and parse SMBIOSv3 tables. Whether it   
>> 
>> will be fixed on Windows side is not clear so here goes a workaround.
>> 
>>  
>> 
>> Idea is to try build v2 tables if QEMU configuration permits,
>> 
>> and fallback to v3 tables otherwise. That will mask Windows issue
>> 
>> form majority of users.  
>> 
>> However if VM configuration can't be described (typically large VMs) 
>> 
>> by v2 tables, QEMU will use SMBIOSv3 and Windows will hit the issue  
>> 
>> again. In this case complain to Microsoft and/or use UEFI instead of 
>> 
>> SeaBIOS (requires reinstall).
>> 
>>  
>> 
>> Default compat setting of smbios-entry-point-type after series   
>> 
>> for pc/q35 machines: 
>> 
>>   * 9.0-newer: 'auto'
>> 
>>   * 8.1-8.2: '64'
>> 
>>   * 8.0-older: '32'  
>> 
>>  
>> 
>> Fixes: https://gitlab.com/qemu-project/qemu/-/issues/2008
>> 
> 
> Thank you! I'm happy to confirm that this series works around the issue :)
> 

While I still didn't do any in-depth testing (don't have enough
knowledge for that anyways), I played around a bit more now, check that
nothing obvious breaks also with a Linux VM and also ran a successful
'make check'.

If that is enough, feel free to add:

Tested-by: Fiona Ebner 

Best Regards,
Fiona

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-02-29 Thread Fiona Ebner

Am 29.02.24 um 13:00 schrieb Vladimir Sementsov-Ogievskiy:
> 
> But anyway, this all could be simply achieved with
> bitmap-copying/merging API, if we allow to pass user-given bitmap to the
> mirror as working bitmap.
> 
>>
>> I see, I'll drop the 'bitmap-mode' in the next version if nobody
>> complains :)
>>
> 
> Good. It's a golden rule: never make public interfaces which you don't
> actually need for production. I myself sometimes violate it and spend
> extra time on developing features, which we later have to just drop as
> "not needed downstream, no sense in upstreaming".
> 

Just wondering which new mode I should allow for the @MirrorSyncMode
then? The documentation states:

> # @incremental: only copy data described by the dirty bitmap.
> # (since: 2.4)
> #
> # @bitmap: only copy data described by the dirty bitmap.  (since: 4.2)
> # Behavior on completion is determined by the BitmapSyncMode.

For backup, do_backup_common() just maps @incremental to @bitmap +
@bitmap-mode == @on-success.

Using @bitmap for mirror would lead to being at odds with the
documentation, because it mentions the BitmapSyncMode, which mirror
won't have.

Using @incremental for mirror would be consistent with the
documentation, but behave a bit differently from backup.

Opinions?

Best Regards,
Fiona

Re: [PATCH 00/19] Workaround Windows failing to find 64bit SMBIOS entry point with SeaBIOS

2024-02-29 Thread Fiona Ebner

Am 27.02.24 um 16:47 schrieb Igor Mammedov:
> Windows (10) bootloader when running on top of SeaBIOS, fails to find 
>
> SMBIOSv3 entry point. Tracing it shows that it looks for v2 anchor markers
>
> only and not v3. Tricking it into believing that entry point is found 
>
> lets Windows successfully locate and parse SMBIOSv3 tables. Whether it
>
> will be fixed on Windows side is not clear so here goes a workaround. 
>
>   
>
> Idea is to try build v2 tables if QEMU configuration permits, 
>
> and fallback to v3 tables otherwise. That will mask Windows issue 
>
> form majority of users.   
>
> However if VM configuration can't be described (typically large VMs)  
>
> by v2 tables, QEMU will use SMBIOSv3 and Windows will hit the issue   
>
> again. In this case complain to Microsoft and/or use UEFI instead of  
>
> SeaBIOS (requires reinstall). 
>
>   
>
> Default compat setting of smbios-entry-point-type after series
>
> for pc/q35 machines:  
>
>   * 9.0-newer: 'auto' 
>
>   * 8.1-8.2: '64' 
>
>   * 8.0-older: '32'   
>
>   
>
> Fixes: https://gitlab.com/qemu-project/qemu/-/issues/2008 
>

Thank you! I'm happy to confirm that this series works around the issue :)

Best Regards,
Fiona

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-02-29 Thread Fiona Ebner

Am 29.02.24 um 12:48 schrieb Vladimir Sementsov-Ogievskiy:
> On 29.02.24 13:11, Fiona Ebner wrote:
>>
>> The iotest creates a new target image for each incremental sync which
>> only records the diff relative to the previous mirror and those diff
>> images are later rebased onto each other to get the full picture.
>>
>> Thus, it can be that a previous mirror job (not just background process
>> or previous write) already copied a cluster, and in particular, copied
>> it to a different target!
> 
> Aha understand.
> 
> For simplicity, let's consider case, when source "cluster size" = "job
> cluster size" = "bitmap granularity" = "target cluster size".
> 
> Which types of clusters we should consider, when we want to handle guest
> write?
> 
> 1. Clusters, that should be copied by background process
> 
> These are dirty clusters from user-given bitmap, or if we do a full-disk
> mirror, all clusters, not yet copied by background process.
> 
> For such clusters we simply ignore the unaligned write. We can even
> ignore the aligned write too: less disturbing the guest by delays.
> 

Since do_sync_target_write() currently doesn't ignore aligned writes, I
wouldn't change it. Of course they can count towards the "done_bitmap"
you propose below.

> 2. Clusters, already copied by background process during this mirror job
> and not dirtied by guest since this time.
> 
> For such clusters we are safe to do unaligned write, as target cluster
> must be allocated.
> 

Right.

> 3. Clusters, not marked initially by dirty bitmap.
> 
> What to do with them? We can't do unaligned write. I see two variants:
> 
> - do additional read from source, to fill the whole cluster, which seems
> a bit too heavy
> 

Yes, I'd rather only do that as a last resort.

> - just mark the cluster as dirty for background job. So we behave like
> in "background" mode. But why not? The maximum count of such "hacks" is
> limited to number of "clear" clusters at start of mirror job, which
> means that we don't seriously affect the convergence. Mirror is
> guaranteed to converge anyway. And the whole sense of "write-blocking"
> mode is to have a guaranteed convergence. What do you think?
> 

It could lead to a lot of flips between job->actively_synced == true and
== false. AFAIU, currently, we only switch back from true to false when
an error happens. While I don't see a concrete issue with it, at least
it might be unexpected to users, so it better be documented.

I'll try going with this approach, thanks!

> 
> 
> 
> Of course, we can't distinguish 3 types by on dirty bitmap, so we need
> the second one. For example "done_bitmap", where we can mark clusters
> that were successfully copied. That would be a kind of block-status of
> target image. But using bitmap is a lot better than querying
> block-status from target.

Best Regards,
Fiona

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-02-29 Thread Fiona Ebner

Am 28.02.24 um 17:24 schrieb Vladimir Sementsov-Ogievskiy:
> On 16.02.24 13:55, Fiona Ebner wrote:
>> Previous discussion from when this was sent upstream [0] (it's been a
>> while). I rebased the patches and re-ordered and squashed like
>> suggested back then [1].
>>
>> This implements two new mirror modes:
>>
>> - bitmap mirror mode with always/on-success/never bitmap sync mode
>> - incremental mirror mode as sugar for bitmap + on-success
>>
>> Use cases:
>> * Possibility to resume a failed mirror later.
>> * Possibility to only mirror deltas to a previously mirrored volume.
>> * Possibility to (efficiently) mirror an drive that was previously
>>    mirrored via some external mechanism (e.g. ZFS replication).
>>
>> We are using the last one in production without any issues since about
>> 4 years now. In particular, like mentioned in [2]:
>>
>>> - create bitmap(s)
>>> - (incrementally) replicate storage volume(s) out of band (using ZFS)
>>> - incrementally drive mirror as part of a live migration of VM
>>> - drop bitmap(s)
> 
> Actually which mode you use, "never", "always" or "conditional"? Or in
> downstream you have different approach?
> 

We are using "conditional", but I think we don't really require any
specific mode, because we drop the bitmaps after mirroring (even in
failure case). Fabian, please correct me if I'm wrong.

> Why am I asking:
> 
> These modes (for backup) were developed prior to
> block-dirty-bitmap-merge command, which allowed to copy bitmaps as you
> want. With that API, we actually don't need all these modes, instead
> it's enough to pass a bitmap, which would be _actually_ used by mirror.
> 
> So, if you need "never" mode, you just copy your bitmap by
> block-dirty-bitmap-add + block-dirty-bitmap-merge, and pass a copy to
> mirror job.
> 
> Or, you pass your bitmap to mirror-job, and have a "always" mode.
> 
> And I don't see, why we need a "conditional" mode, which actually just
> drops away the progress we actually made. (OK, we failed, but why to
> drop the progress of successfully copied clusters?)
> 

I'm not sure actually. Maybe John remembers?

I see, I'll drop the 'bitmap-mode' in the next version if nobody
complains :)

> 
> Using user-given bitmap in the mirror job has also an additional
> advantage of live progress: up to visualization of disk copying by
> visualization of the dirty bitmap contents.
> 

Best Regards,
Fiona

Re: [RFC 0/4] mirror: implement incremental and bitmap modes

2024-02-29 Thread Fiona Ebner

Am 28.02.24 um 17:06 schrieb Vladimir Sementsov-Ogievskiy:
> On 28.02.24 19:00, Vladimir Sementsov-Ogievskiy wrote:
>> On 16.02.24 13:55, Fiona Ebner wrote:
>>> Now, the IO test added in patch 4/4 actually contains yet another use
>>> case, namely doing incremental mirrors to stand-alone qcow2 "diff"
>>> images, that only contain the delta and can be rebased later. I had to
>>> adapt the IO test, because its output expected the mirror bitmap to
>>> still be dirty, but nowadays the mirror is apparently already done
>>> when the bitmaps are queried. So I thought, I'll just use
>>> 'write-blocking' mode to avoid any potential timing issues.
>>>
>>> But this exposed an issue with the diff image approach. If a write is
>>> not aligned to the granularity of the mirror target, then rebasing the
>>> diff image onto a backing image will not yield the desired result,
>>> because the full cluster is considered to be allocated and will "hide"
>>> some part of the base/backing image. The failure can be seen by either
>>> using 'write-blocking' mode in the IO test or setting the (bitmap)
>>> granularity to 32 KiB rather than the current 64 KiB.
>>>
>>> The question is how to deal with these edge cases? Some possibilities
>>> that would make sense to me:
>>>
>>> For 'background' mode:
>>> * prohibit if target's cluster size is larger than the bitmap
>>>    granularity
>>> * document the limitation
>>>
>>> For 'write-blocking' mode:
>>> * disallow in combination with bitmap mode (would not be happy about
>>>    it, because I'd like to use this without diff images)
>>
>> why not just require the same: bitmap granularity must be >= target
>> granularity
>>

For the iotest's use-case, that only works for background mode. I'll
explain below.

>>> * for writes that are not aligned to the target's cluster size, read
>>>    the relevant/missing parts from the source image to be able to write
>>>    whole target clusters (seems rather complex)
>>
>> There is another approach: consider and unaligned part of the request,
>> fit in one cluster (we can always split any request to "aligned"
>> middle part, and at most two small "unligned" parts, each fit into one
>> cluster).
>>
>> We have two possibilities:
>>
>> 1. the cluster is dirty (marked dirty in the bitmap used by background
>> process)
>>
>> We can simply ignore this part and rely on background process. This
>> will not affect the convergence of the mirror job.
>>

Agreed.

>> 2. the cluster is clear (i.e. background process, or some previous
>> write already copied it)
>>

The iotest creates a new target image for each incremental sync which
only records the diff relative to the previous mirror and those diff
images are later rebased onto each other to get the full picture.

Thus, it can be that a previous mirror job (not just background process
or previous write) already copied a cluster, and in particular, copied
it to a different target!

>> In this case, we are safe to do unaligned write, as target cluster
>> must be allocated.

Because the diff image is new, the target's cluster is not necessarily
allocated. When using write-blocking and a write of, e.g., 9 bytes to a
clear source cluster comes in, only those 9 bytes are written to the
target. Now the target's cluster is allocated but with only those 9
bytes of data. When rebasing, the previously copied cluster is "masked"
and when reading the rebased image, we only see the cluster with those 9
bytes (and IIRC, zeroes for the rest of the cluster rather than the
previously copied data).

>>
>> (for bitmap-mode, I don't consider here clusters that are clear from
>> the start, which we shouldn't copy in any case)
>>

We do need to copy new writes to any cluster, and with a clear cluster
and write-blocking, the issue can manifest.

> 
> Hmm, right, and that's exactly the logic we already have in
> do_sync_target_write(). So that's enough just to require that
> bitmap_granularity >= target_granularity
> 

Best Regards,
Fiona

Re: [RFC 1/4] drive-mirror: add support for sync=bitmap mode=never

2024-02-21 Thread Fiona Ebner

Am 21.02.24 um 07:55 schrieb Markus Armbruster:
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index ab5a93a966..ac05483958 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -2181,6 +2181,15 @@
>>  # destination (all the disk, only the sectors allocated in the
>>  # topmost image, or only new I/O).
>>  #
>> +# @bitmap: The name of a bitmap to use for sync=bitmap mode.  This
>> +# argument must be present for bitmap mode and absent otherwise.
>> +# The bitmap's granularity is used instead of @granularity.
>> +# (Since 9.0).
> 
> What happens when the user specifies @granularity anyway?  Error or
> silently ignored?
> 

It's an error:

>> +if (bitmap) {
>> +if (granularity) {
>> +error_setg(errp, "granularity (%d)"
>> +   "cannot be specified when a bitmap is provided",
>> +   granularity);
>> +return NULL;
>> +}

>> +#
>> +# @bitmap-mode: Specifies the type of data the bitmap should contain
>> +# after the operation concludes.  Must be present if sync is
>> +# "bitmap".  Must NOT be present otherwise.  (Since 9.0)
> 
> Members that must be present when and only when some enum member has a
> certain value should perhaps be in a union branch.  Perhaps the block
> maintainers have an opinion here.
> 

Sounds sensible to me. Considering also the next patches, in the end it
could be a union discriminated by the @sync which contains @bitmap and
@bitmap-mode when it's the 'bitmap' sync mode, @bitmap when it's the
'incremental' sync mode (@bitmap-sync mode needs to be 'on-success'
then, so there is no choice for the user) and which contains
@granularity for the other sync modes.

Best Regards,
Fiona

[RFC 1/4] drive-mirror: add support for sync=bitmap mode=never

2024-02-16 Thread Fiona Ebner

From: John Snow 

This patch adds support for the "BITMAP" sync mode to drive-mirror and
blockdev-mirror. It adds support only for the BitmapSyncMode "never,"
because it's the simplest mode.

This mode simply uses a user-provided bitmap as an initial copy
manifest, and then does not clear any bits in the bitmap at the
conclusion of the operation.

Any new writes dirtied during the operation are copied out, in contrast
to backup. Note that whether these writes are reflected in the bitmap
at the conclusion of the operation depends on whether that bitmap is
actually recording!

This patch was originally based on one by Ma Haocong, but it has since
been modified pretty heavily.

Suggested-by: Ma Haocong 
Signed-off-by: Ma Haocong 
Signed-off-by: John Snow 
[FG: switch to bdrv_dirty_bitmap_merge_internal]
Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.0
 update version and formatting in QAPI]
Signed-off-by: Fiona Ebner 
---
 block/mirror.c | 96 --
 blockdev.c | 38 +-
 include/block/block_int-global-state.h |  4 +-
 qapi/block-core.json   | 25 ++-
 tests/unit/test-block-iothread.c   |  4 +-
 5 files changed, 139 insertions(+), 28 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 5145eb53e1..315dff11e2 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -51,7 +51,7 @@ typedef struct MirrorBlockJob {
 BlockDriverState *to_replace;
 /* Used to block operations on the drive-mirror-replace target */
 Error *replace_blocker;
-bool is_none_mode;
+MirrorSyncMode sync_mode;
 BlockMirrorBackingMode backing_mode;
 /* Whether the target image requires explicit zero-initialization */
 bool zero_target;
@@ -73,6 +73,8 @@ typedef struct MirrorBlockJob {
 size_t buf_size;
 int64_t bdev_length;
 unsigned long *cow_bitmap;
+BdrvDirtyBitmap *sync_bitmap;
+BitmapSyncMode bitmap_mode;
 BdrvDirtyBitmap *dirty_bitmap;
 BdrvDirtyBitmapIter *dbi;
 uint8_t *buf;
@@ -718,7 +720,8 @@ static int mirror_exit_common(Job *job)
  _abort);
 
 if (!abort && s->backing_mode == MIRROR_SOURCE_BACKING_CHAIN) {
-BlockDriverState *backing = s->is_none_mode ? src : s->base;
+BlockDriverState *backing;
+backing = s->sync_mode == MIRROR_SYNC_MODE_NONE ? src : s->base;
 BlockDriverState *unfiltered_target = bdrv_skip_filters(target_bs);
 
 if (bdrv_cow_bs(unfiltered_target) != backing) {
@@ -815,6 +818,16 @@ static void mirror_abort(Job *job)
 assert(ret == 0);
 }
 
+/* Always called after commit/abort. */
+static void mirror_clean(Job *job)
+{
+MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
+
+if (s->sync_bitmap) {
+bdrv_dirty_bitmap_set_busy(s->sync_bitmap, false);
+}
+}
+
 static void coroutine_fn mirror_throttle(MirrorBlockJob *s)
 {
 int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1011,7 +1024,8 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 mirror_free_init(s);
 
 s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
-if (!s->is_none_mode) {
+if ((s->sync_mode == MIRROR_SYNC_MODE_TOP) ||
+(s->sync_mode == MIRROR_SYNC_MODE_FULL)) {
 ret = mirror_dirty_init(s);
 if (ret < 0 || job_is_cancelled(>common.job)) {
 goto immediate_exit;
@@ -1302,6 +1316,7 @@ static const BlockJobDriver mirror_job_driver = {
 .run= mirror_run,
 .prepare= mirror_prepare,
 .abort  = mirror_abort,
+.clean  = mirror_clean,
 .pause  = mirror_pause,
 .complete   = mirror_complete,
 .cancel = mirror_cancel,
@@ -1320,6 +1335,7 @@ static const BlockJobDriver commit_active_job_driver = {
 .run= mirror_run,
 .prepare= mirror_prepare,
 .abort  = mirror_abort,
+.clean  = mirror_clean,
 .pause  = mirror_pause,
 .complete   = mirror_complete,
 .cancel = commit_active_cancel,
@@ -1712,7 +1728,10 @@ static BlockJob *mirror_start_job(
  BlockCompletionFunc *cb,
  void *opaque,
  const BlockJobDriver *driver,
- bool is_none_mode, BlockDriverState *base,
+ MirrorSyncMode sync_mode,
+ BdrvDirtyBitmap *bitmap,
+ BitmapSyncMode bitmap_mode,
+ BlockDriverState *base,
  bool auto_complete, const char *filter_node_name,

[RFC 0/4] mirror: implement incremental and bitmap modes

2024-02-16 Thread Fiona Ebner

Previous discussion from when this was sent upstream [0] (it's been a
while). I rebased the patches and re-ordered and squashed like
suggested back then [1].

This implements two new mirror modes:

- bitmap mirror mode with always/on-success/never bitmap sync mode
- incremental mirror mode as sugar for bitmap + on-success

Use cases:
* Possibility to resume a failed mirror later.
* Possibility to only mirror deltas to a previously mirrored volume.
* Possibility to (efficiently) mirror an drive that was previously
  mirrored via some external mechanism (e.g. ZFS replication).

We are using the last one in production without any issues since about
4 years now. In particular, like mentioned in [2]:

> - create bitmap(s)
> - (incrementally) replicate storage volume(s) out of band (using ZFS)
> - incrementally drive mirror as part of a live migration of VM
> - drop bitmap(s)


Now, the IO test added in patch 4/4 actually contains yet another use
case, namely doing incremental mirrors to stand-alone qcow2 "diff"
images, that only contain the delta and can be rebased later. I had to
adapt the IO test, because its output expected the mirror bitmap to
still be dirty, but nowadays the mirror is apparently already done
when the bitmaps are queried. So I thought, I'll just use
'write-blocking' mode to avoid any potential timing issues.

But this exposed an issue with the diff image approach. If a write is
not aligned to the granularity of the mirror target, then rebasing the
diff image onto a backing image will not yield the desired result,
because the full cluster is considered to be allocated and will "hide"
some part of the base/backing image. The failure can be seen by either
using 'write-blocking' mode in the IO test or setting the (bitmap)
granularity to 32 KiB rather than the current 64 KiB.

The question is how to deal with these edge cases? Some possibilities
that would make sense to me:

For 'background' mode:
* prohibit if target's cluster size is larger than the bitmap
  granularity
* document the limitation

For 'write-blocking' mode:
* disallow in combination with bitmap mode (would not be happy about
  it, because I'd like to use this without diff images)
* for writes that are not aligned to the target's cluster size, read
  the relevant/missing parts from the source image to be able to write
  whole target clusters (seems rather complex)
* document the limitation


[0]: 
https://lore.kernel.org/qemu-devel/20200218100740.2228521-1-f.gruenbich...@proxmox.com/
[1]: 
https://lore.kernel.org/qemu-devel/d35a76de-78d5-af56-0b34-f7bd2bbd3...@redhat.com/
[2]: https://lore.kernel.org/qemu-devel/1599127031.9uxdp5h9o2.astr...@nora.none/


Fabian Grünbichler (2):
  mirror: move some checks to qmp
  iotests: add test for bitmap mirror

John Snow (2):
  drive-mirror: add support for sync=bitmap mode=never
  drive-mirror: add support for conditional and always bitmap sync modes

 block/mirror.c|   94 +-
 blockdev.c|   70 +-
 include/block/block_int-global-state.h|4 +-
 qapi/block-core.json  |   25 +-
 tests/qemu-iotests/tests/bitmap-sync-mirror   |  550 
 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2810 +
 tests/unit/test-block-iothread.c  |4 +-
 7 files changed, 3527 insertions(+), 30 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror
 create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out

-- 
2.39.2

[RFC 2/4] drive-mirror: add support for conditional and always bitmap sync modes

2024-02-16 Thread Fiona Ebner

From: John Snow 

Teach mirror two new tricks for using bitmaps:

Always: no matter what, we synchronize the copy_bitmap back to the
sync_bitmap. In effect, this allows us resume a failed mirror at a later
date.

Conditional: On success only, we sync the bitmap. This is akin to
incremental backup modes; we can use this bitmap to later refresh a
successfully created mirror.

Originally-by: John Snow 
[FG: add check for bitmap-mode without bitmap
 switch to bdrv_dirty_bitmap_merge_internal]
Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.0]
Signed-off-by: Fiona Ebner 
---

The original patch this was based on came from a WIP git branch and
thus has no Signed-off-by trailer from John, see [0]. I added an
Originally-by trailer for now. Let me know if I should drop that and
wait for John's Signed-off-by instead.

[0] https://lore.kernel.org/qemu-devel/1599140071.n44h532eeu.astr...@nora.none/

 block/mirror.c | 24 ++--
 blockdev.c |  3 +++
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 315dff11e2..84155b1f78 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -689,8 +689,6 @@ static int mirror_exit_common(Job *job)
 bdrv_unfreeze_backing_chain(mirror_top_bs, target_bs);
 }
 
-bdrv_release_dirty_bitmap(s->dirty_bitmap);
-
 /* Make sure that the source BDS doesn't go away during bdrv_replace_node,
  * before we can call bdrv_drained_end */
 bdrv_ref(src);
@@ -796,6 +794,18 @@ static int mirror_exit_common(Job *job)
 bdrv_drained_end(target_bs);
 bdrv_unref(target_bs);
 
+if (s->sync_bitmap) {
+if (s->bitmap_mode == BITMAP_SYNC_MODE_ALWAYS ||
+(s->bitmap_mode == BITMAP_SYNC_MODE_ON_SUCCESS &&
+ job->ret == 0 && ret == 0)) {
+/* Success; synchronize copy back to sync. */
+bdrv_clear_dirty_bitmap(s->sync_bitmap, NULL);
+bdrv_dirty_bitmap_merge_internal(s->sync_bitmap, s->dirty_bitmap,
+ NULL, true);
+}
+}
+bdrv_release_dirty_bitmap(s->dirty_bitmap);
+
 bs_opaque->job = NULL;
 
 bdrv_drained_end(src);
@@ -1755,10 +1765,6 @@ static BlockJob *mirror_start_job(
" sync mode",
MirrorSyncMode_str(sync_mode));
 return NULL;
-} else if (bitmap_mode != BITMAP_SYNC_MODE_NEVER) {
-error_setg(errp,
-   "Bitmap Sync Mode '%s' is not supported by Mirror",
-   BitmapSyncMode_str(bitmap_mode));
 }
 } else if (bitmap) {
 error_setg(errp,
@@ -1775,6 +1781,12 @@ static BlockJob *mirror_start_job(
 return NULL;
 }
 granularity = bdrv_dirty_bitmap_granularity(bitmap);
+
+if (bitmap_mode != BITMAP_SYNC_MODE_NEVER) {
+if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_DEFAULT, errp)) {
+return NULL;
+}
+}
 } else if (granularity == 0) {
 granularity = bdrv_get_default_bitmap_granularity(target);
 }
diff --git a/blockdev.c b/blockdev.c
index c65d9ded70..aeb9fde9f3 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2873,6 +2873,9 @@ static void blockdev_mirror_common(const char *job_id, 
BlockDriverState *bs,
 if (bdrv_dirty_bitmap_check(bitmap, BDRV_BITMAP_ALLOW_RO, errp)) {
 return;
 }
+} else if (has_bitmap_mode) {
+error_setg(errp, "Cannot specify bitmap sync mode without a bitmap");
+return;
 }
 
 if (!replaces) {
-- 
2.39.2

[RFC 3/4] mirror: move some checks to qmp

2024-02-16 Thread Fiona Ebner

From: Fabian Grünbichler 

and assert the passing conditions in block/mirror.c. while incremental
mode was never available for drive-mirror, it makes the interface more
uniform w.r.t. backup block jobs.

Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.0]
Signed-off-by: Fiona Ebner 
---
 block/mirror.c | 28 +---
 blockdev.c | 29 +
 2 files changed, 34 insertions(+), 23 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 84155b1f78..15d1c060eb 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1755,31 +1755,13 @@ static BlockJob *mirror_start_job(
 
 GLOBAL_STATE_CODE();
 
-if (sync_mode == MIRROR_SYNC_MODE_INCREMENTAL) {
-error_setg(errp, "Sync mode '%s' not supported",
-   MirrorSyncMode_str(sync_mode));
-return NULL;
-} else if (sync_mode == MIRROR_SYNC_MODE_BITMAP) {
-if (!bitmap) {
-error_setg(errp, "Must provide a valid bitmap name for '%s'"
-   " sync mode",
-   MirrorSyncMode_str(sync_mode));
-return NULL;
-}
-} else if (bitmap) {
-error_setg(errp,
-   "sync mode '%s' is not compatible with bitmaps",
-   MirrorSyncMode_str(sync_mode));
-return NULL;
-}
+/* QMP interface protects us from these cases */
+assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL);
+assert((bitmap && sync_mode == MIRROR_SYNC_MODE_BITMAP) ||
+   (!bitmap && sync_mode != MIRROR_SYNC_MODE_BITMAP));
+assert(!(bitmap && granularity));
 
 if (bitmap) {
-if (granularity) {
-error_setg(errp, "granularity (%d)"
-   "cannot be specified when a bitmap is provided",
-   granularity);
-return NULL;
-}
 granularity = bdrv_dirty_bitmap_granularity(bitmap);
 
 if (bitmap_mode != BITMAP_SYNC_MODE_NEVER) {
diff --git a/blockdev.c b/blockdev.c
index aeb9fde9f3..519f408359 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2852,7 +2852,36 @@ static void blockdev_mirror_common(const char *job_id, 
BlockDriverState *bs,
 sync = MIRROR_SYNC_MODE_FULL;
 }
 
+if ((sync == MIRROR_SYNC_MODE_BITMAP) ||
+(sync == MIRROR_SYNC_MODE_INCREMENTAL)) {
+/* done before desugaring 'incremental' to print the right message */
+if (!bitmap_name) {
+error_setg(errp, "Must provide a valid bitmap name for "
+   "'%s' sync mode", MirrorSyncMode_str(sync));
+return;
+}
+}
+
+if (sync == MIRROR_SYNC_MODE_INCREMENTAL) {
+if (has_bitmap_mode &&
+bitmap_mode != BITMAP_SYNC_MODE_ON_SUCCESS) {
+error_setg(errp, "Bitmap sync mode must be '%s' "
+   "when using sync mode '%s'",
+   BitmapSyncMode_str(BITMAP_SYNC_MODE_ON_SUCCESS),
+   MirrorSyncMode_str(sync));
+return;
+}
+has_bitmap_mode = true;
+sync = MIRROR_SYNC_MODE_BITMAP;
+bitmap_mode = BITMAP_SYNC_MODE_ON_SUCCESS;
+}
+
 if (bitmap_name) {
+if (sync != MIRROR_SYNC_MODE_BITMAP) {
+error_setg(errp, "Sync mode '%s' not supported with bitmap.",
+   MirrorSyncMode_str(sync));
+return;
+}
 if (granularity) {
 error_setg(errp, "Granularity and bitmap cannot both be set");
 return;
-- 
2.39.2

[RFC 4/4] iotests: add test for bitmap mirror

2024-02-16 Thread Fiona Ebner

From: Fabian Grünbichler 

heavily based on/practically forked off iotest 257 for bitmap backups,
but:

- no writes to filter node 'mirror-top' between completion and
finalization, as those seem to deadlock?
- no inclusion of not-yet-available full/top sync modes in combination
with bitmaps
- extra set of reference/test mirrors to verify that writes in parallel
with active mirror work

intentionally keeping copyright and ownership of original test case to
honor provenance.

Signed-off-by: Fabian Grünbichler 
Signed-off-by: Thomas Lamprecht 
[FE: rebase for 9.0, i.e. adapt to renames like vm.command -> vm.cmd,
 specifying explicit image format for rebase,
 adapt to new behavior of qemu_img(),
 dropping of 'status' field in output, etc.
 rename test from '384' to 'bitmap-sync-mirror']
Signed-off-by: Fiona Ebner 
---
 tests/qemu-iotests/tests/bitmap-sync-mirror   |  550 
 .../qemu-iotests/tests/bitmap-sync-mirror.out | 2810 +
 2 files changed, 3360 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/bitmap-sync-mirror
 create mode 100644 tests/qemu-iotests/tests/bitmap-sync-mirror.out

diff --git a/tests/qemu-iotests/tests/bitmap-sync-mirror 
b/tests/qemu-iotests/tests/bitmap-sync-mirror
new file mode 100755
index 00..6cd9b74dac
--- /dev/null
+++ b/tests/qemu-iotests/tests/bitmap-sync-mirror
@@ -0,0 +1,550 @@
+#!/usr/bin/env python3
+# group: rw
+#
+# Test bitmap-sync mirrors (incremental, differential, and partials)
+#
+# Copyright (c) 2019 John Snow for Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+# owner=js...@redhat.com
+
+import math
+import os
+
+import iotests
+from iotests import log, qemu_img
+
+SIZE = 64 * 1024 * 1024
+GRANULARITY = 64 * 1024
+
+
+class Pattern:
+def __init__(self, byte, offset, size=GRANULARITY):
+self.byte = byte
+self.offset = offset
+self.size = size
+
+def bits(self, granularity):
+lower = self.offset // granularity
+upper = (self.offset + self.size - 1) // granularity
+return set(range(lower, upper + 1))
+
+
+class PatternGroup:
+"""Grouping of Pattern objects. Initialize with an iterable of Patterns."""
+def __init__(self, patterns):
+self.patterns = patterns
+
+def bits(self, granularity):
+"""Calculate the unique bits dirtied by this pattern grouping"""
+res = set()
+for pattern in self.patterns:
+res |= pattern.bits(granularity)
+return res
+
+
+GROUPS = [
+PatternGroup([
+# Batch 0: 4 clusters
+Pattern('0x49', 0x000),
+Pattern('0x6c', 0x010),   # 1M
+Pattern('0x6f', 0x200),   # 32M
+Pattern('0x76', 0x3ff)]), # 64M - 64K
+PatternGroup([
+# Batch 1: 6 clusters (3 new)
+Pattern('0x65', 0x000),   # Full overwrite
+Pattern('0x77', 0x00f8000),   # Partial-left (1M-32K)
+Pattern('0x72', 0x2008000),   # Partial-right (32M+32K)
+Pattern('0x69', 0x3fe)]), # Adjacent-left (64M - 128K)
+PatternGroup([
+# Batch 2: 7 clusters (3 new)
+Pattern('0x74', 0x001),   # Adjacent-right
+Pattern('0x69', 0x00e8000),   # Partial-left  (1M-96K)
+Pattern('0x6e', 0x2018000),   # Partial-right (32M+96K)
+Pattern('0x67', 0x3fe,
+2*GRANULARITY)]), # Overwrite [(64M-128K)-64M)
+PatternGroup([
+# Batch 3: 8 clusters (5 new)
+# Carefully chosen such that nothing re-dirties the one cluster
+# that copies out successfully before failure in Group #1.
+Pattern('0xaa', 0x001,
+3*GRANULARITY),   # Overwrite and 2x Adjacent-right
+Pattern('0xbb', 0x00d8000),   # Partial-left (1M-160K)
+Pattern('0xcc', 0x2028000),   # Partial-right (32M+160K)
+Pattern('0xdd', 0x3fc)]), # New; leaving a gap to the right
+]
+
+
+class EmulatedBitmap:
+def __init__(self, granularity=GRANULARITY):
+self._bits = set()
+self.granularity = granularity
+
+def dirty_bits(self, bits):
+self._bits |= set(bits)
+
+def dirty_group(self, n):
+self.dirty_bits(GROUPS[n].bits(self.granularity))
+
+def clear(self):
+self._bits = set()
+
+def clear_bits(self, b

[PATCH] iotests: adapt to output change for recently introduced 'detached header' field

2024-02-16 Thread Fiona Ebner

Failure was noticed when running the tests for the qcow2 image format.

Fixes: 0bd779e27e ("crypto: Introduce 'detached-header' field in 
QCryptoBlockInfoLUKS")
Signed-off-by: Fiona Ebner 
---
 tests/qemu-iotests/198.out | 2 ++
 tests/qemu-iotests/206.out | 1 +
 2 files changed, 3 insertions(+)

diff --git a/tests/qemu-iotests/198.out b/tests/qemu-iotests/198.out
index 805494916f..62fb73fa3e 100644
--- a/tests/qemu-iotests/198.out
+++ b/tests/qemu-iotests/198.out
@@ -39,6 +39,7 @@ Format specific information:
 compression type: COMPRESSION_TYPE
 encrypt:
 ivgen alg: plain64
+detached header: false
 hash alg: sha256
 cipher alg: aes-256
 uuid: ----
@@ -84,6 +85,7 @@ Format specific information:
 compression type: COMPRESSION_TYPE
 encrypt:
 ivgen alg: plain64
+detached header: false
 hash alg: sha256
 cipher alg: aes-256
 uuid: ----
diff --git a/tests/qemu-iotests/206.out b/tests/qemu-iotests/206.out
index 7e95694777..979f00f9bf 100644
--- a/tests/qemu-iotests/206.out
+++ b/tests/qemu-iotests/206.out
@@ -114,6 +114,7 @@ Format specific information:
 refcount bits: 16
 encrypt:
 ivgen alg: plain64
+detached header: false
 hash alg: sha1
 cipher alg: aes-128
 uuid: ----
-- 
2.39.2

Re: double free or corruption (out) in iscsi virtual machine

2024-02-15 Thread Fiona Ebner

Am 17.01.24 um 08:23 schrieb M_O_Bz:
> Basic Info:
> 1. Issue: I got a " double free or corruption (out)", head for
> attachment debug.log for details, the debug.log print the backtrace of
> one virtual machine
> 2. Reproduce: currently I cann't destribe how to reproduce this bug,
> because it's in my productive enviroment which include some special stuffs
> 3. qemu version:  I'm using is qemu-6.0.1
> 4. qemu ccmdline in short:(checkout detail in the virtual machine log
> message)

Hi,
sounds like it might be the issue fixed by:
https://github.com/qemu/qemu/commit/5080152e2ef6cde7aa692e29880c62bd54acb750

Best Regards,
Fiona

Re: [PATCH v2] qemu_init: increase NOFILE soft limit on POSIX

2024-02-02 Thread Fiona Ebner

Am 18.12.23 um 11:13 schrieb Fiona Ebner:
> In many configurations, e.g. multiple vNICs with multiple queues or
> with many Ceph OSDs, the default soft limit of 1024 is not enough.
> QEMU is supposed to work fine with file descriptors >= 1024 and does
> not use select() on POSIX. Bump the soft limit to the allowed hard
> limit to avoid issues with the aforementioned configurations.
> 
> Of course the limit could be raised from the outside, but the man page
> of systemd.exec states about 'LimitNOFILE=':
> 
>> Don't use.
>> [...]
>> Typically applications should increase their soft limit to the hard
>> limit on their own, if they are OK with working with file
>> descriptors above 1023,
> 
> If the soft limit is already the same as the hard limit, avoid the
> superfluous setrlimit call. This can avoid a warning with a strict
> seccomp filter blocking setrlimit if NOFILE was already raised before
> executing QEMU.
> 
> Buglink: https://bugzilla.proxmox.com/show_bug.cgi?id=4507
> Signed-off-by: Fiona Ebner 
> ---
> 

Ping

Re: [PATCH v2 3/4] qapi: blockdev-backup: add discard-source parameter

2024-01-26 Thread Fiona Ebner

Am 25.01.24 um 18:22 schrieb Vladimir Sementsov-Ogievskiy:
> 
> Hmm. Taking maximum is not optimal for usual case without
> discard-source: user may want to work in smaller granularity than
> source, to save disk space.
> 
> In case with discarding we have two possibilities:
> 
> - either take larger granularity for the whole process like you propose
> (but this will need and option for CBW?)
> - or, fix discarding bitmap in CBW to work like normal discard: it
> should be aligned down. This will lead actually to discard-source option
> doing nothing..
> 
> ==
> But why do you want fleecing image with larger granularity? Is that a
> real case or just experimenting? Still we should fix assertion anyway.
> 

Yes, it's a real use case. We do support different storage types and
want to allow users to place the fleecing image on a different storage
than the original image for flexibility.

I ran into the issue when backing up to a target with 1 MiB cluster_size
while using a fleecing image on RBD (which has 4 MiB cluster_size by
default).

In theory, I guess I could look into querying the cluster_size of the
backup target and trying to allocate the fleecing image with a small
enough cluster_size. But not sure if that would work on all storage
combinations, and would require teaching our storage plugin API (which
also supports third-party plugins) to perform allocation with a specific
cluster size. So not an ideal solution for us.

> I think:
> 
> 1. fix discarding bitmap to make aligning-down (will do that for v3)
> 

Thanks!

> 2. if we need another logic for block_copy_calculate_cluster_size() it
> should be an option. May be explicit "copy-cluster-size" or
> "granularity" option for CBW driver and for backup job. And we'll just
> check that given cluster-size is power of two >= target_size.
> 

I'll try to implement point 2. That should resolve the issue for our use
case.

Best Regards,
Fiona

Re: [PATCH v2 3/4] qapi: blockdev-backup: add discard-source parameter

2024-01-25 Thread Fiona Ebner

Am 24.01.24 um 16:03 schrieb Fiona Ebner:
> Am 17.01.24 um 17:07 schrieb Vladimir Sementsov-Ogievskiy:
>> Add a parameter that enables discard-after-copy. That is mostly useful
>> in "push backup with fleecing" scheme, when source is snapshot-access
>> format driver node, based on copy-before-write filter snapshot-access
>> API:
>>
>> [guest]  [snapshot-access] ~~ blockdev-backup ~~> [backup target]
>>||
>>| root   | file
>>vv
>> [copy-before-write]
>>| |
>>| file| target
>>v v
>> [active disk]   [temp.img]
>>
>> In this case discard-after-copy does two things:
>>
>>  - discard data in temp.img to save disk space
>>  - avoid further copy-before-write operation in discarded area
>>
>> Note that we have to declare WRITE permission on source in
>> copy-before-write filter, for discard to work.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> 
> Ran into another issue when the cluster_size of the fleecing image is
> larger than for the backup target, e.g.
> 
>> #!/bin/bash
>> rm /tmp/fleecing.qcow2
>> ./qemu-img create /tmp/disk.qcow2 -f qcow2 1G
>> ./qemu-img create /tmp/fleecing.qcow2 -o cluster_size=2M -f qcow2 1G
>> ./qemu-img create /tmp/backup.qcow2 -f qcow2 1G
>> ./qemu-system-x86_64 --qmp stdio \
>> --blockdev 
>> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/disk.qcow2 \
>> --blockdev 
>> qcow2,node-name=node1,file.driver=file,file.filename=/tmp/fleecing.qcow2,discard=unmap
>>  \
>> --blockdev 
>> qcow2,node-name=node2,file.driver=file,file.filename=/tmp/backup.qcow2 \
>> <> {"execute": "qmp_capabilities"}
>> {"execute": "blockdev-add", "arguments": { "driver": "copy-before-write", 
>> "file": "node0", "target": "node1", "node-name": "node3" } }
>> {"execute": "blockdev-add", "arguments": { "driver": "snapshot-access", 
>> "file": "node3", "discard": "unmap", "node-name": "snap0" } }
>> {"execute": "blockdev-backup", "arguments": { "device": "snap0", "target": 
>> "node2", "sync": "full", "job-id": "backup0", "discard-source": true } }
>> EOF
> 
> will fail with
> 
>> qemu-system-x86_64: ../util/hbitmap.c:570: hbitmap_reset: Assertion 
>> `QEMU_IS_ALIGNED(count, gran) || (start + count == hb->orig_size)' failed.
> 
> Backtrace shows the assert happens while discarding, when resetting the
> BDRVCopyBeforeWriteState access_bitmap
>  > #6  0x56142a2a in hbitmap_reset (hb=0x57e01b80, start=0,
> count=1048576) at ../util/hbitmap.c:570
>> #7  0x55f80764 in bdrv_reset_dirty_bitmap_locked 
>> (bitmap=0x5850a660, offset=0, bytes=1048576) at 
>> ../block/dirty-bitmap.c:563
>> #8  0x55f807ab in bdrv_reset_dirty_bitmap (bitmap=0x5850a660, 
>> offset=0, bytes=1048576) at ../block/dirty-bitmap.c:570
>> #9  0x55f7bb16 in cbw_co_pdiscard_snapshot (bs=0x581a7f60, 
>> offset=0, bytes=1048576) at ../block/copy-before-write.c:330
>> #10 0x55f8d00a in bdrv_co_pdiscard_snapshot (bs=0x581a7f60, 
>> offset=0, bytes=1048576) at ../block/io.c:3734
>> #11 0x55fd2380 in snapshot_access_co_pdiscard (bs=0x582b4f60, 
>> offset=0, bytes=1048576) at ../block/snapshot-access.c:55
>> #12 0x55f8b65d in bdrv_co_pdiscard (child=0x584fe790, offset=0, 
>> bytes=1048576) at ../block/io.c:3144
>> #13 0x55f78650 in block_copy_task_entry (task=0x57f588f0) at 
>> ../block/block-copy.c:597
> 
> My guess for the cause is that in block_copy_calculate_cluster_size() we
> only look at the target. But now that we need to discard the source,
> we'll also need to consider that for the calculation?
> 

Just querying the source and picking the maximum won't work either,
because snapshot-access does not currently implement .bdrv_co_get_info
and because copy-before-write (doesn't implement .bdrv_co_get_info and
is a filter) will just return the info of its file child. But the
discard will go to the target child.

If I do

1. .bdrv_co_get_info in snapshot-access: return info from file child
2. .bdrv_co_get_info in copy-before-write: return maximum cluster_size
from file child and target child
3. block_copy_calculate_cluster_size: return maximum from source and target

then the issue does go away, but I don't know if that's not violating
any assumptions and probably there's a better way to avoid the issue?

Best Regards,
Fiona

Re: [PATCH 1/2] virtio-scsi: Attach event vq notifier with no_poll

2024-01-25 Thread Fiona Ebner

Am 24.01.24 um 18:38 schrieb Hanna Czenczek:
> As of commit 38738f7dbbda90fbc161757b7f4be35b52205552 ("virtio-scsi:
> don't waste CPU polling the event virtqueue"), we only attach an io_read
> notifier for the virtio-scsi event virtqueue instead, and no polling
> notifiers.  During operation, the event virtqueue is typically
> non-empty, but none of the buffers are intended to be used immediately.
> Instead, they only get used when certain events occur.  Therefore, it
> makes no sense to continuously poll it when non-empty, because it is
> supposed to be and stay non-empty.
> 
> We do this by using virtio_queue_aio_attach_host_notifier_no_poll()
> instead of virtio_queue_aio_attach_host_notifier() for the event
> virtqueue.
> 
> Commit 766aa2de0f29b657148e04599320d771c36fd126 ("virtio-scsi: implement
> BlockDevOps->drained_begin()") however has virtio_scsi_drained_end() use
> virtio_queue_aio_attach_host_notifier() for all virtqueues, including
> the event virtqueue.  This can lead to it being polled again, undoing
> the benefit of commit 38738f7dbbda90fbc161757b7f4be35b52205552.
> 
> Fix it by using virtio_queue_aio_attach_host_notifier_no_poll() for the
> event virtqueue.
> 
> Reported-by: Fiona Ebner 
> Fixes: 766aa2de0f29b657148e04599320d771c36fd126
>("virtio-scsi: implement BlockDevOps->drained_begin()")
> Signed-off-by: Hanna Czenczek 

Tested-by: Fiona Ebner 
Reviewed-by: Fiona Ebner

Re: [PATCH 2/2] virtio: Keep notifications disabled during drain

2024-01-25 Thread Fiona Ebner

Am 24.01.24 um 18:38 schrieb Hanna Czenczek:
> During drain, we do not care about virtqueue notifications, which is why
> we remove the handlers on it.  When removing those handlers, whether vq
> notifications are enabled or not depends on whether we were in polling
> mode or not; if not, they are enabled (by default); if so, they have
> been disabled by the io_poll_start callback.
> 
> Because we do not care about those notifications after removing the
> handlers, this is fine.  However, we have to explicitly ensure they are
> enabled when re-attaching the handlers, so we will resume receiving
> notifications.  We do this in virtio_queue_aio_attach_host_notifier*().
> If such a function is called while we are in a polling section,
> attaching the notifiers will then invoke the io_poll_start callback,
> re-disabling notifications.
> 
> Because we will always miss virtqueue updates in the drained section, we
> also need to poll the virtqueue once after attaching the notifiers.
> 
> Buglink: https://issues.redhat.com/browse/RHEL-3934
> Signed-off-by: Hanna Czenczek 

Tested-by: Fiona Ebner 
Reviewed-by: Fiona Ebner

Re: [PATCH v2 3/4] qapi: blockdev-backup: add discard-source parameter

2024-01-24 Thread Fiona Ebner

Am 17.01.24 um 17:07 schrieb Vladimir Sementsov-Ogievskiy:
> Add a parameter that enables discard-after-copy. That is mostly useful
> in "push backup with fleecing" scheme, when source is snapshot-access
> format driver node, based on copy-before-write filter snapshot-access
> API:
> 
> [guest]  [snapshot-access] ~~ blockdev-backup ~~> [backup target]
>||
>| root   | file
>vv
> [copy-before-write]
>| |
>| file| target
>v v
> [active disk]   [temp.img]
> 
> In this case discard-after-copy does two things:
> 
>  - discard data in temp.img to save disk space
>  - avoid further copy-before-write operation in discarded area
> 
> Note that we have to declare WRITE permission on source in
> copy-before-write filter, for discard to work.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 

Ran into another issue when the cluster_size of the fleecing image is
larger than for the backup target, e.g.

> #!/bin/bash
> rm /tmp/fleecing.qcow2
> ./qemu-img create /tmp/disk.qcow2 -f qcow2 1G
> ./qemu-img create /tmp/fleecing.qcow2 -o cluster_size=2M -f qcow2 1G
> ./qemu-img create /tmp/backup.qcow2 -f qcow2 1G
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/disk.qcow2 \
> --blockdev 
> qcow2,node-name=node1,file.driver=file,file.filename=/tmp/fleecing.qcow2,discard=unmap
>  \
> --blockdev 
> qcow2,node-name=node2,file.driver=file,file.filename=/tmp/backup.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "copy-before-write", 
> "file": "node0", "target": "node1", "node-name": "node3" } }
> {"execute": "blockdev-add", "arguments": { "driver": "snapshot-access", 
> "file": "node3", "discard": "unmap", "node-name": "snap0" } }
> {"execute": "blockdev-backup", "arguments": { "device": "snap0", "target": 
> "node2", "sync": "full", "job-id": "backup0", "discard-source": true } }
> EOF

will fail with

> qemu-system-x86_64: ../util/hbitmap.c:570: hbitmap_reset: Assertion 
> `QEMU_IS_ALIGNED(count, gran) || (start + count == hb->orig_size)' failed.

Backtrace shows the assert happens while discarding, when resetting the
BDRVCopyBeforeWriteState access_bitmap
 > #6  0x56142a2a in hbitmap_reset (hb=0x57e01b80, start=0,
count=1048576) at ../util/hbitmap.c:570
> #7  0x55f80764 in bdrv_reset_dirty_bitmap_locked 
> (bitmap=0x5850a660, offset=0, bytes=1048576) at 
> ../block/dirty-bitmap.c:563
> #8  0x55f807ab in bdrv_reset_dirty_bitmap (bitmap=0x5850a660, 
> offset=0, bytes=1048576) at ../block/dirty-bitmap.c:570
> #9  0x55f7bb16 in cbw_co_pdiscard_snapshot (bs=0x581a7f60, 
> offset=0, bytes=1048576) at ../block/copy-before-write.c:330
> #10 0x55f8d00a in bdrv_co_pdiscard_snapshot (bs=0x581a7f60, 
> offset=0, bytes=1048576) at ../block/io.c:3734
> #11 0x55fd2380 in snapshot_access_co_pdiscard (bs=0x582b4f60, 
> offset=0, bytes=1048576) at ../block/snapshot-access.c:55
> #12 0x55f8b65d in bdrv_co_pdiscard (child=0x584fe790, offset=0, 
> bytes=1048576) at ../block/io.c:3144
> #13 0x55f78650 in block_copy_task_entry (task=0x57f588f0) at 
> ../block/block-copy.c:597

My guess for the cause is that in block_copy_calculate_cluster_size() we
only look at the target. But now that we need to discard the source,
we'll also need to consider that for the calculation?

Best Regards,
Fiona

[PATCH v3 1/2] ui/clipboard: mark type as not available when there is no data

2024-01-24 Thread Fiona Ebner

With VNC, a client can send a non-extended VNC_MSG_CLIENT_CUT_TEXT
message with len=0. In qemu_clipboard_set_data(), the clipboard info
will be updated setting data to NULL (because g_memdup(data, size)
returns NULL when size is 0). If the client does not set the
VNC_ENCODING_CLIPBOARD_EXT feature when setting up the encodings, then
the 'request' callback for the clipboard peer is not initialized.
Later, because data is NULL, qemu_clipboard_request() can be reached
via vdagent_chr_write() and vdagent_clipboard_recv_request() and
there, the clipboard owner's 'request' callback will be attempted to
be called, but that is a NULL pointer.

In particular, this can happen when using the KRDC (22.12.3) VNC
client.

Another scenario leading to the same issue is with two clients (say
noVNC and KRDC):

The noVNC client sets the extension VNC_FEATURE_CLIPBOARD_EXT and
initializes its cbpeer.

The KRDC client does not, but triggers a vnc_client_cut_text() (note
it's not the _ext variant)). There, a new clipboard info with it as
the 'owner' is created and via qemu_clipboard_set_data() is called,
which in turn calls qemu_clipboard_update() with that info.

In qemu_clipboard_update(), the notifier for the noVNC client will be
called, i.e. vnc_clipboard_notify() and also set vs->cbinfo for the
noVNC client. The 'owner' in that clipboard info is the clipboard peer
for the KRDC client, which did not initialize the 'request' function.
That sounds correct to me, it is the owner of that clipboard info.

Then when noVNC sends a VNC_MSG_CLIENT_CUT_TEXT message (it did set
the VNC_FEATURE_CLIPBOARD_EXT feature correctly, so a check for it
passes), that clipboard info is passed to qemu_clipboard_request() and
the original segfault still happens.

Fix the issue by handling updates with size 0 differently. In
particular, mark in the clipboard info that the type is not available.

While at it, switch to g_memdup2(), because g_memdup() is deprecated.

Cc: qemu-sta...@nongnu.org
Fixes: CVE-2023-6683
Reported-by: Markus Frank 
Suggested-by: Marc-André Lureau 
Signed-off-by: Fiona Ebner 
---

Changes in v3:
* Yet another new appraoch, setting available to false when
  no data is passed in when updating.
* Update commit message to focus on the fact that non-extended
  VNC_MSG_CLIENT_CUT_TEXT messages with len=0 are problematic.

 ui/clipboard.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/ui/clipboard.c b/ui/clipboard.c
index 3d14bffaf8..b3f6fa3c9e 100644
--- a/ui/clipboard.c
+++ b/ui/clipboard.c
@@ -163,9 +163,15 @@ void qemu_clipboard_set_data(QemuClipboardPeer *peer,
 }
 
 g_free(info->types[type].data);
-info->types[type].data = g_memdup(data, size);
-info->types[type].size = size;
-info->types[type].available = true;
+if (size) {
+info->types[type].data = g_memdup2(data, size);
+info->types[type].size = size;
+info->types[type].available = true;
+} else {
+info->types[type].data = NULL;
+info->types[type].size = 0;
+info->types[type].available = false;
+}
 
 if (update) {
 qemu_clipboard_update(info);
-- 
2.39.2

[PATCH v3 2/2] ui/clipboard: add asserts for update and request

2024-01-24 Thread Fiona Ebner

Should an issue like CVE-2023-6683 ever appear again in the future,
it will be more obvious which assumption was violated.

Suggested-by: Marc-André Lureau 
Signed-off-by: Fiona Ebner 
---

Changes in v3:
* Turn check for update into an assertion.
* Split out into a separate patch.

 ui/clipboard.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/ui/clipboard.c b/ui/clipboard.c
index b3f6fa3c9e..4264884a6c 100644
--- a/ui/clipboard.c
+++ b/ui/clipboard.c
@@ -65,12 +65,24 @@ bool qemu_clipboard_check_serial(QemuClipboardInfo *info, 
bool client)
 
 void qemu_clipboard_update(QemuClipboardInfo *info)
 {
+uint32_t type;
 QemuClipboardNotify notify = {
 .type = QEMU_CLIPBOARD_UPDATE_INFO,
 .info = info,
 };
 assert(info->selection < QEMU_CLIPBOARD_SELECTION__COUNT);
 
+for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT; type++) {
+/*
+ * If data is missing, the clipboard owner's 'request' callback needs 
to
+ * be set. Otherwise, there is no way to get the clipboard data and
+ * qemu_clipboard_request() cannot be called.
+ */
+if (info->types[type].available && !info->types[type].data) {
+assert(info->owner && info->owner->request);
+}
+}
+
 notifier_list_notify(_notifiers, );
 
 if (cbinfo[info->selection] != info) {
@@ -132,6 +144,8 @@ void qemu_clipboard_request(QemuClipboardInfo *info,
 !info->owner)
 return;
 
+assert(info->owner->request);
+
 info->types[type].requested = true;
 info->owner->request(info, type);
 }
-- 
2.39.2

[PATCH] block/io_uring: improve error message when init fails

2024-01-23 Thread Fiona Ebner

The man page for io_uring_queue_init states:

> io_uring_queue_init(3) returns 0 on success and -errno on failure.

and the man page for io_uring_setup (which is one of the functions
where the return value of io_uring_queue_init() can come from) states:

> On error, a negative error code is returned. The caller should not
> rely on errno variable.

Tested using 'sysctl kernel.io_uring_disabled=2'. Output before this
change:

> failed to init linux io_uring ring

Output after this change:

> failed to init linux io_uring ring: Operation not permitted

Signed-off-by: Fiona Ebner 
---
 block/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/io_uring.c b/block/io_uring.c
index d77ae55745..d11b2051ab 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -432,7 +432,7 @@ LuringState *luring_init(Error **errp)
 
 rc = io_uring_queue_init(MAX_ENTRIES, ring, 0);
 if (rc < 0) {
-error_setg_errno(errp, errno, "failed to init linux io_uring ring");
+error_setg_errno(errp, -rc, "failed to init linux io_uring ring");
 g_free(s);
 return NULL;
 }
-- 
2.39.2

Re: [RFC 0/3] aio-posix: call ->poll_end() when removing AioHandler

2024-01-23 Thread Fiona Ebner

Am 22.01.24 um 18:52 schrieb Hanna Czenczek:
> On 22.01.24 18:41, Hanna Czenczek wrote:
>> On 05.01.24 15:30, Fiona Ebner wrote:
>>> Am 05.01.24 um 14:43 schrieb Fiona Ebner:
>>>> Am 03.01.24 um 14:35 schrieb Paolo Bonzini:
>>>>> On 1/3/24 12:40, Fiona Ebner wrote:
>>>>>> I'm happy to report that I cannot reproduce the CPU-usage-spike issue
>>>>>> with the patch, but I did run into an assertion failure when
>>>>>> trying to
>>>>>> verify that it fixes my original stuck-guest-IO issue. See below
>>>>>> for the
>>>>>> backtrace [0]. Hanna wrote in
>>>>>> https://issues.redhat.com/browse/RHEL-3934
>>>>>>
>>>>>>> I think it’s sufficient to simply call virtio_queue_notify_vq(vq)
>>>>>>> after the virtio_queue_aio_attach_host_notifier(vq, ctx) call,
>>>>>>> because
>>>>>>> both virtio-scsi’s and virtio-blk’s .handle_output() implementations
>>>>>>> acquire the device’s context, so this should be directly callable
>>>>>>> from
>>>>>>> any context.
>>>>>> I guess this is not true anymore now that the AioContext locking was
>>>>>> removed?
>>>>> Good point and, in fact, even before it was much safer to use
>>>>> virtio_queue_notify() instead.  Not only does it use the event
>>>>> notifier
>>>>> handler, but it also calls it in the right thread/AioContext just by
>>>>> doing event_notifier_set().
>>>>>
>>>> But with virtio_queue_notify() using the event notifier, the
>>>> CPU-usage-spike issue is present:
>>>>
>>>>>> Back to the CPU-usage-spike issue: I experimented around and it
>>>>>> doesn't
>>>>>> seem to matter whether I notify the virt queue before or after
>>>>>> attaching
>>>>>> the notifiers. But there's another functional difference. My patch
>>>>>> called virtio_queue_notify() which contains this block:
>>>>>>
>>>>>>>  if (vq->host_notifier_enabled) {
>>>>>>>  event_notifier_set(>host_notifier);
>>>>>>>  } else if (vq->handle_output) {
>>>>>>>  vq->handle_output(vdev, vq);
>>>>>> In my testing, the first branch was taken, calling
>>>>>> event_notifier_set().
>>>>>> Hanna's patch uses virtio_queue_notify_vq() and there,
>>>>>> vq->handle_output() will be called. That seems to be the relevant
>>>>>> difference regarding the CPU-usage-spike issue.
>>>> I should mention that this is with a VirtIO SCSI disk. I also attempted
>>>> to reproduce the CPU-usage-spike issue with a VirtIO block disk, but
>>>> didn't manage yet.
>>>>
>>>> What I noticed is that in virtio_queue_host_notifier_aio_poll(), one of
>>>> the queues (but only one) will always show as nonempty. And then,
>>>> run_poll_handlers_once() will always detect progress which explains the
>>>> CPU usage.
>>>>
>>>> The following shows
>>>> 1. vq address
>>>> 2. number of times vq was passed to
>>>> virtio_queue_host_notifier_aio_poll()
>>>> 3. number of times the result of virtio_queue_host_notifier_aio_poll()
>>>> was true for the vq
>>>>
>>>>> 0x555fd93f9c80 17162000 0
>>>>> 0x555fd93f9e48 17162000 6
>>>>> 0x555fd93f9ee0 17162000 0
>>>>> 0x555fd93f9d18 17162000 17162000
>>>>> 0x555fd93f9db0 17162000 0
>>>>> 0x555fd93f9f78 17162000 0
>>>> And for the problematic one, the reason it is seen as nonempty is:
>>>>
>>>>> 0x555fd93f9d18 shadow_avail_idx 8 last_avail_idx 0
>>> vring_avail_idx(vq) also gives 8 here. This is the vs->event_vq and
>>> s->events_dropped is false in my testing, so
>>> virtio_scsi_handle_event_vq() doesn't do anything.
>>>
>>>> Those values stay like this while the call counts above increase.
>>>>
>>>> So something going wrong with the indices when the event notifier is
>>>> set
>>>> from QEMU side (in the main thread)?
>>>>
>>>> The guest is Debian 12 with a 6.1 kernel.
>>
>> So, trying to figure out a new RFC version:
>>
>> About the stack tra

Re: [PATCH v2 3/4] qapi: blockdev-backup: add discard-source parameter

2024-01-19 Thread Fiona Ebner

Am 17.01.24 um 17:07 schrieb Vladimir Sementsov-Ogievskiy:
> Add a parameter that enables discard-after-copy. That is mostly useful
> in "push backup with fleecing" scheme, when source is snapshot-access
> format driver node, based on copy-before-write filter snapshot-access
> API:
> 
> [guest]  [snapshot-access] ~~ blockdev-backup ~~> [backup target]
>||
>| root   | file
>vv
> [copy-before-write]
>| |
>| file| target
>v v
> [active disk]   [temp.img]
> 
> In this case discard-after-copy does two things:
> 
>  - discard data in temp.img to save disk space
>  - avoid further copy-before-write operation in discarded area
> 
> Note that we have to declare WRITE permission on source in
> copy-before-write filter, for discard to work.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 

Unfortunately, setting BLK_PERM_WRITE unconditionally breaks
blockdev-backup for a read-only node (even when not using discard-source):

> #!/bin/bash
> ./qemu-img create /tmp/disk.raw -f raw 1G
> ./qemu-img create /tmp/backup.raw -f raw 1G
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> raw,node-name=node0,file.driver=file,file.filename=/tmp/disk.raw,read-only=true
>  \
> --blockdev raw,node-name=node1,file.driver=file,file.filename=/tmp/backup.raw 
> \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-backup", "arguments": { "device": "node0", "target": 
> "node1", "sync": "full", "job-id": "backup0" } }
> EOF

The above works before applying this patch, but leads to an error
afterwards:

> {"error": {"class": "GenericError", "desc": "Block node is read-only"}}

Best Regards,
Fiona

Re: [PATCH 3/3] monitor: only run coroutine commands in qemu_aio_context

2024-01-18 Thread Fiona Ebner

Am 16.01.24 um 20:00 schrieb Stefan Hajnoczi:
> monitor_qmp_dispatcher_co() runs in the iohandler AioContext that is not
> polled during nested event loops. The coroutine currently reschedules
> itself in the main loop's qemu_aio_context AioContext, which is polled
> during nested event loops. One known problem is that QMP device-add
> calls drain_call_rcu(), which temporarily drops the BQL, leading to all
> sorts of havoc like other vCPU threads re-entering device emulation code
> while another vCPU thread is waiting in device emulation code with
> aio_poll().
> 
> Paolo Bonzini suggested running non-coroutine QMP handlers in the
> iohandler AioContext. This avoids trouble with nested event loops. His
> original idea was to move coroutine rescheduling to
> monitor_qmp_dispatch(), but I resorted to moving it to qmp_dispatch()
> because we don't know if the QMP handler needs to run in coroutine
> context in monitor_qmp_dispatch(). monitor_qmp_dispatch() would have
> been nicer since it's associated with the monitor implementation and not
> as general as qmp_dispatch(), which is also used by qemu-ga.
> 
> A number of qemu-iotests need updated .out files because the order of
> QMP events vs QMP responses has changed.
> 
> Solves Issue #1933.
> 
> Fixes: 7bed89958bfbf40df9ca681cefbdca63abdde39d ("device_core: use 
> drain_call_rcu in in qmp_device_add")
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2215192
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2214985
> Buglink: https://issues.redhat.com/browse/RHEL-17369
> Signed-off-by: Stefan Hajnoczi 

With the patch I can no longer see any do_qmp_dispatch_bh() calls run in
vCPU threads.

I also did a bit of smoke testing with some other QMP and QGA commands
and didn't find any obvious breakage, so:

Tested-by: Fiona Ebner 

P.S.

Unfortunately, the patch does not solve the other issue I came across
back then [0] with snapshot_save_job_bh() being executed during a vCPU
thread's aio_poll(). See also [1].

I suppose this would need some other mechanism to solve or could it also
be scheduled to the iohandler AioContext? It's not directly related to
your patch of course, just mentioning it, because it's a similar theme.

[0]:
https://lore.kernel.org/qemu-devel/31757c45-695d-4408-468c-c2de560af...@proxmox.com/
[1]: https://gitlab.com/qemu-project/qemu/-/issues/2111

Best Regards,
Fiona

Re: [PATCH v2] ui/clipboard: ensure data is available or request callback is set upon update

2024-01-17 Thread Fiona Ebner

Am 17.01.24 um 13:20 schrieb Marc-André Lureau:
> Hi
> 
> On Wed, Jan 17, 2024 at 3:56 PM Fiona Ebner  wrote:
>>
>> Am 17.01.24 um 12:33 schrieb Marc-André Lureau:
>>> Hi
>>>
>>> On Wed, Jan 17, 2024 at 3:30 PM Fiona Ebner  wrote:
>>>>
>>>> Am 17.01.24 um 12:11 schrieb Marc-André Lureau:
>>>>> Hi
>>>>>
>>>>> On Wed, Jan 17, 2024 at 3:01 PM Fiona Ebner  wrote:
>>>>>>
>>>>>> +for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT && !missing_data; 
>>>>>> type++) {
>>>>>> +if (!info->types[type].data) {
>>>>>> +missing_data = true;
>>>>>> +}
>>>>>> +}
>>>>>> +/*
>>>>>> + * If data is missing, the clipboard owner's 'request' callback 
>>>>>> needs to be
>>>>>> + * set. Otherwise, there is no way to get the clipboard data and
>>>>>> + * qemu_clipboard_request() cannot be called.
>>>>>> + */
>>>>>> +if (missing_data && info->owner && !info->owner->request) {
>>>>>> +return;
>>>>>> +}
>>>>>
>>>>> It needs to check whether the type is "available". If not data is
>>>>> provided, owner should be set as well, it should assert() that.
>>>>>
>>>>> That should do the job:
>>>>>
>>>>> for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT; type++) {
>>>>> /*
>>>>>  * If data is missing, the clipboard owner's 'request' callback needs 
>>>>> to
>>>>>  * be set. Otherwise, there is no way to get the clipboard data and
>>>>>  * qemu_clipboard_request() cannot be called.
>>>>>  */
>>>>> if (info->types[type].available && !info->types[type].data) {
>>>>> assert(info->owner && info->owner->request);
>>>>> }
>>>>> }
>>>>>
>>>>
>>>> Okay, thanks! But we can't assert, because that doesn't resolve the CVE
>>>> as it would still crash. The VNC client might not have the
>>>> VNC_FEATURE_CLIPBOARD_EXT feature, and the request callback is currently
>>>> only set in that case. But we can return instead of assert to just avoid
>>>> clipboard update. I'll send a v3.
>>>
>>> If it doesn't have VNC_FEATURE_CLIPBOARD_EXT, it shouldn't update the
>>> clipboard without data. (ClientCutText/ServerCutText always have data,
>>> even if 0-length)
>>>
>>
>> But a buggy client should not be able to crash QEMU. With a
>> VNC_MSG_CLIENT_CUT_TEXT message, when read_s32(data, 4) == 0,
>> vnc_client_cut_text() is called with zero length. Is that supposed to
> 
> Indeed. That case should be handled at the VNC server code level.. A
> 0-length text should set clipboard to "", not NULL.
> 

So have vnc_client_cut_text() call qemu_clipboard_set_data() with
data="" and size=1 in the 0-length case? I can see such a call with
noVNC as the client (which does use extended messages) when clearing the
clipboard.

But I suppose the call in vnc_client_cut_text_ext() also would need a
check before calling qemu_clipboard_set_data(). Or is there anything
guaranteeing the size read from the inflated buffer, i.e.
uint32_t tsize = read_u32(buf, 0);
which is passed to qemu_clipboard_set_data() is not zero?

> With Ext, there is an explicit description of data ending with \0:
> According to 
> https://github.com/rfbproto/rfbproto/blob/master/rfbproto.rst#user-content-7727extended-clipboard-pseudo-encoding
> "The text must be followed by a terminating null even though the
> length is also explicitly given."
> 
> But we should still handle 0-length data cases.
> >> happen? The branch for an extended message is only taken when
>> read_s32(data, 4) < 0 and Daniel's patch fixes that branch.
>>
>> I noticed in qemu_clipboard_set_data():
>>
>>> info->types[type].data = g_memdup(data, size);
>>
>> the g_memdup call will return NULL when size == 0 even if data is
>> non-NULL. Is that the actual problem in the above scenario?
> 
> Hmm, qemu_clipboard_set_data() shouldn't allow data == NULL, or size == 0.
> 
> When data is requested, the "peer" will call
> qemu_clipboard_set_data(). But I can't see a good way for the peer to
> return with no data, it should probably update the clipboard info with
> available=false.
> 

Oh, I guess my suggestion above isn't necessary if we set
available=false when size == 0. Sorry, is this what you mean by "no
data"? Or do you mean the case where data=""? Or both?

Best Regards,
Fiona

Re: [PATCH v2] ui/clipboard: ensure data is available or request callback is set upon update

2024-01-17 Thread Fiona Ebner

Am 17.01.24 um 12:33 schrieb Marc-André Lureau:
> Hi
> 
> On Wed, Jan 17, 2024 at 3:30 PM Fiona Ebner  wrote:
>>
>> Am 17.01.24 um 12:11 schrieb Marc-André Lureau:
>>> Hi
>>>
>>> On Wed, Jan 17, 2024 at 3:01 PM Fiona Ebner  wrote:
>>>>
>>>> +for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT && !missing_data; 
>>>> type++) {
>>>> +if (!info->types[type].data) {
>>>> +missing_data = true;
>>>> +}
>>>> +}
>>>> +/*
>>>> + * If data is missing, the clipboard owner's 'request' callback needs 
>>>> to be
>>>> + * set. Otherwise, there is no way to get the clipboard data and
>>>> + * qemu_clipboard_request() cannot be called.
>>>> + */
>>>> +if (missing_data && info->owner && !info->owner->request) {
>>>> +return;
>>>> +}
>>>
>>> It needs to check whether the type is "available". If not data is
>>> provided, owner should be set as well, it should assert() that.
>>>
>>> That should do the job:
>>>
>>> for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT; type++) {
>>> /*
>>>  * If data is missing, the clipboard owner's 'request' callback needs to
>>>  * be set. Otherwise, there is no way to get the clipboard data and
>>>  * qemu_clipboard_request() cannot be called.
>>>  */
>>> if (info->types[type].available && !info->types[type].data) {
>>> assert(info->owner && info->owner->request);
>>> }
>>> }
>>>
>>
>> Okay, thanks! But we can't assert, because that doesn't resolve the CVE
>> as it would still crash. The VNC client might not have the
>> VNC_FEATURE_CLIPBOARD_EXT feature, and the request callback is currently
>> only set in that case. But we can return instead of assert to just avoid
>> clipboard update. I'll send a v3.
> 
> If it doesn't have VNC_FEATURE_CLIPBOARD_EXT, it shouldn't update the
> clipboard without data. (ClientCutText/ServerCutText always have data,
> even if 0-length)
> 

But a buggy client should not be able to crash QEMU. With a
VNC_MSG_CLIENT_CUT_TEXT message, when read_s32(data, 4) == 0,
vnc_client_cut_text() is called with zero length. Is that supposed to
happen? The branch for an extended message is only taken when
read_s32(data, 4) < 0 and Daniel's patch fixes that branch.

I noticed in qemu_clipboard_set_data():

> info->types[type].data = g_memdup(data, size);

the g_memdup call will return NULL when size == 0 even if data is
non-NULL. Is that the actual problem in the above scenario?

Best Regards,
Fiona

Re: [PATCH v2] ui/clipboard: ensure data is available or request callback is set upon update

2024-01-17 Thread Fiona Ebner

Am 17.01.24 um 12:11 schrieb Marc-André Lureau:
> Hi
> 
> On Wed, Jan 17, 2024 at 3:01 PM Fiona Ebner  wrote:
>>
>> +for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT && !missing_data; 
>> type++) {
>> +if (!info->types[type].data) {
>> +missing_data = true;
>> +}
>> +}
>> +/*
>> + * If data is missing, the clipboard owner's 'request' callback needs 
>> to be
>> + * set. Otherwise, there is no way to get the clipboard data and
>> + * qemu_clipboard_request() cannot be called.
>> + */
>> +if (missing_data && info->owner && !info->owner->request) {
>> +return;
>> +}
> 
> It needs to check whether the type is "available". If not data is
> provided, owner should be set as well, it should assert() that.
> 
> That should do the job:
> 
> for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT; type++) {
> /*
>  * If data is missing, the clipboard owner's 'request' callback needs to
>  * be set. Otherwise, there is no way to get the clipboard data and
>  * qemu_clipboard_request() cannot be called.
>  */
> if (info->types[type].available && !info->types[type].data) {
> assert(info->owner && info->owner->request);
> }
> }
> 

Okay, thanks! But we can't assert, because that doesn't resolve the CVE
as it would still crash. The VNC client might not have the
VNC_FEATURE_CLIPBOARD_EXT feature, and the request callback is currently
only set in that case. But we can return instead of assert to just avoid
clipboard update. I'll send a v3.

Best Regards,
Fiona

[PATCH v2] ui/clipboard: ensure data is available or request callback is set upon update

2024-01-17 Thread Fiona Ebner

With VNC, it can be that a client sends a VNC_MSG_CLIENT_CUT_TEXT
message before sending a VNC_MSG_CLIENT_SET_ENCODINGS message with
VNC_ENCODING_CLIPBOARD_EXT for configuring the clipboard extension.

This means that qemu_clipboard_request() can be reached (via
vnc_client_cut_text_ext()) before vnc_server_cut_text_caps() was
called and had the chance to initialize the clipboard peer. In that
case, info->owner->request is NULL instead of a function and so
attempting to call it in qemu_clipboard_request() results in a
segfault.

In particular, this can happen when using the KRDC (22.12.3) VNC
client on Wayland.

Another code path leading to the same issue in
qemu_clipboard_request() is via vdagent_chr_write() and
vdagent_clipboard_recv_request() after a non-extended
VNC_MSG_CLIENT_CUT_TEXT messages with len=0 was sent.

In particular, this can happen when using the KRDC (22.12.3) VNC
client on X11.

It is not enough to check in ui/vnc.c's protocol_client_msg() if the
VNC_FEATURE_CLIPBOARD_EXT feature is enabled before handling an
extended clipboard message with vnc_client_cut_text_ext(), because of
the following scenario with two clients (say noVNC and KRDC):

The noVNC client sets the extension VNC_FEATURE_CLIPBOARD_EXT and
initializes its cbpeer.

The KRDC client does not, but triggers a vnc_client_cut_text() (note
it's not the _ext variant)). There, a new clipboard info with it as
the 'owner' is created and via qemu_clipboard_set_data() is called,
which in turn calls qemu_clipboard_update() with that info.

In qemu_clipboard_update(), the notifier for the noVNC client will be
called, i.e. vnc_clipboard_notify() and also set vs->cbinfo for the
noVNC client. The 'owner' in that clipboard info is the clipboard peer
for the KRDC client, which did not initialize the 'request' function.
That sounds correct to me, it is the owner of that clipboard info.

Then when noVNC sends a VNC_MSG_CLIENT_CUT_TEXT message (it did set
the VNC_FEATURE_CLIPBOARD_EXT feature correctly, so a check for it
passes), that clipboard info is passed to qemu_clipboard_request() and
the original segfault still happens.

Fix the issue by disallowing clipboard update if both, data is missing
and the clipboard owner's 'request' callback is not set.

Add an assert that the clipboard owner's 'request' callback is set in
qemu_clipboard_request() to have a clean error/abort should a similar
issue appear in the future.

Fixes: CVE-2023-6683
Reported-by: Markus Frank 
Suggested-by: Marc-André Lureau 
Signed-off-by: Fiona Ebner 
---

Changes in v2:
* Different approach as suggested by Marc-André:
  Instead of quietly returning in qemu_clipboard_request() when no
  request callback is set, prevent clipboard update if there is
  both, no data and no request callback.

 ui/clipboard.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/ui/clipboard.c b/ui/clipboard.c
index 3d14bffaf8..d1bb7c77f2 100644
--- a/ui/clipboard.c
+++ b/ui/clipboard.c
@@ -65,12 +65,28 @@ bool qemu_clipboard_check_serial(QemuClipboardInfo *info, 
bool client)
 
 void qemu_clipboard_update(QemuClipboardInfo *info)
 {
+uint32_t type;
+bool missing_data = false;
 QemuClipboardNotify notify = {
 .type = QEMU_CLIPBOARD_UPDATE_INFO,
 .info = info,
 };
 assert(info->selection < QEMU_CLIPBOARD_SELECTION__COUNT);
 
+for (type = 0; type < QEMU_CLIPBOARD_TYPE__COUNT && !missing_data; type++) 
{
+if (!info->types[type].data) {
+missing_data = true;
+}
+}
+/*
+ * If data is missing, the clipboard owner's 'request' callback needs to be
+ * set. Otherwise, there is no way to get the clipboard data and
+ * qemu_clipboard_request() cannot be called.
+ */
+if (missing_data && info->owner && !info->owner->request) {
+return;
+}
+
 notifier_list_notify(_notifiers, );
 
 if (cbinfo[info->selection] != info) {
@@ -132,6 +148,8 @@ void qemu_clipboard_request(QemuClipboardInfo *info,
 !info->owner)
 return;
 
+assert(info->owner->request);
+
 info->types[type].requested = true;
 info->owner->request(info, type);
 }
-- 
2.39.2

Re: [PATCH] ui/clipboard: avoid crash upon request when clipboard peer is not initialized

2024-01-17 Thread Fiona Ebner

Am 16.01.24 um 13:11 schrieb Fiona Ebner:
> Am 15.01.24 um 13:00 schrieb Marc-André Lureau:
>>>>>
>>>>
>>>> The trouble is when qemu_clipboard_update() is called without data &
>>>> without a request callback set. We shouldn't allow that as we have no
>>>> means to get the clipboard data then.
>>>>
>>>
>>> In the above scenario, I'm pretty sure there is data when
>>> qemu_clipboard_update() is called. Just no request callback. If we'd
>>> reject this, won't that break clients that do not set the
>>> VNC_FEATURE_CLIPBOARD_EXT feature and only use non-extended
>>> VNC_MSG_CLIENT_CUT_TEXT messages?
>>
>> If "data" is already set, then qemu_clipboard_request() returns.
>>
>> Inverting the condition I suggested above: it's allowed to be the
>> clipboard owner if either "data" is set, or a request callback is set.
>>
> 
> Oh, sorry. Yes, it seems the problematic case is where data is not set.
> But isn't that legitimate when clearing the clipboard? Or is a
> VNC_MSG_CLIENT_CUT_TEXT message not valid when len is 0 and should be
> rejected? In my testing KRDC does send such a message when the clipboard
> is cleared:
> 
>> #1  0x558f1e6a0dac in vnc_client_cut_text (vs=0x558f207754d0, len=0, 
>> text=0x558f2046e008 "\003 \002\377\005") at ../ui/vnc-clipboard.c:313
>> #2  0x558f1e68e067 in protocol_client_msg (vs=0x558f207754d0, 
>> data=0x558f2046e000 "\006", len=8) at ../ui/vnc.c:2454
> 
> Your suggestion would disallow this for clients that do not set the
> VNC_FEATURE_CLIPBOARD_EXT feature (and only use non-extended
> VNC_MSG_CLIENT_CUT_TEXT messages).
> 

I noticed that there is yet another code path leading to
qemu_clipboard_request() and dereferencing the NULL owner->request even
if only non-extended VNC_MSG_CLIENT_CUT_TEXT messages are used:

> Thread 1 "qemu-system-x86" hit Breakpoint 1, vnc_client_cut_text (
> vs=0x593e6c00, len=0, 
> text=0x575ea608 
> "404078,bus=ahci0.0,drive=drive-sata0,id=sata0,bootindex=100\377\001") at 
> ../ui/vnc-clipboard.c:310
> (gdb) c
> Continuing.

And later:

> qemu_clipboard_request (info=0x57e576e0, type=QEMU_CLIPBOARD_TYPE_TEXT) 
> at ../ui/clipboard.c:129
> 129   if (info->types[type].data ||
> (gdb) p *info->owner
> $65 = {name = 0x0, notifier = {notify = 0x0, node = {le_next = 0x0, le_prev = 
> 0x0}}, request = 0x0}
> (gdb) bt
> #0  qemu_clipboard_request (info=0x57e576e0, 
> type=QEMU_CLIPBOARD_TYPE_TEXT) at ../ui/clipboard.c:129
> #1  0x558952ce in vdagent_chr_recv_chunk (vd=0x56f67800) at 
> ../ui/vdagent.c:769
> #2  vdagent_chr_write (chr=, buf=0x7fff4ab263e4 "", len=0) at 
> ../ui/vdagent.c:840
> #3  0x55d98830 in qemu_chr_write_buffer (s=s@entry=0x56f67800, 
> buf=buf@entry=0x7fff4ab263c0 "\001", len=36, 
> offset=offset@entry=0x7fffd030, write_all=false) at 
> ../chardev/char.c:122
> #4  0x55d98c3c in qemu_chr_write (s=0x56f67800, 
> buf=buf@entry=0x7fff4ab263c0 "\001", len=len@entry=36, 
> write_all=, write_all@entry=false) at ../chardev/char.c:186
> #5  0x55d9109f in qemu_chr_fe_write (be=be@entry=0x56e59040, 
> buf=buf@entry=0x7fff4ab263c0 "\001", len=len@entry=36)
> at ../chardev/char-fe.c:42
> #6  0x55900045 in flush_buf (port=0x56e58f40, buf=0x7fff4ab263c0 
> "\001", len=36) at ../hw/char/virtio-console.c:63
> #7  0x55bea4f3 in do_flush_queued_data (port=0x56e58f40, 
> vq=0x59272558, vdev=0x5926a4d0)
> at ../hw/char/virtio-serial-bus.c:188
> #8  0x55c201ff in virtio_queue_notify_vq (vq=0x59272558) at 
> ../hw/virtio/virtio.c:2268
> #9  0x55e36dd5 in aio_dispatch_handler (ctx=ctx@entry=0x56e51b10, 
> node=0x7ffed812b9f0) at ../util/aio-posix.c:372
> #10 0x55e37662 in aio_dispatch_handlers (ctx=0x56e51b10) at 
> ../util/aio-posix.c:414
> #11 aio_dispatch (ctx=0x56e51b10) at ../util/aio-posix.c:424
> #12 0x55e4d44e in aio_ctx_dispatch (source=, 
> callback=, user_data=)
> at ../util/async.c:358
> #13 0x7753e7a9 in g_main_context_dispatch () from 
> /lib/x86_64-linux-gnu/libglib-2.0.so.0
> #14 0x55e4ecb8 in glib_pollfds_poll () at ../util/main-loop.c:287
> #15 os_host_main_loop_wait (timeout=58675786) at ../util/main-loop.c:310
> #16 main_loop_wait (nonblocking=nonblocking@entry=0) at 
> ../util/main-loop.c:589
> #17 0x55aa5f63 in qemu_main_loop () at ../system/runstate.c:791
> #18 0x55cadc16 in qemu_default_main () at ../sys

[PATCH] block/io: clear BDRV_BLOCK_RECURSE flag after recursing in bdrv_co_block_status

2024-01-16 Thread Fiona Ebner

Using fleecing backup like in [0] on a qcow2 image (with metadata
preallocation) can lead to the following assertion failure:

> bdrv_co_do_block_status: Assertion `!(ret & BDRV_BLOCK_ZERO)' failed.

In the reproducer [0], it happens because the BDRV_BLOCK_RECURSE flag
will be set by the qcow2 driver, so the caller will recursively check
the file child. Then the BDRV_BLOCK_ZERO set too. Later up the call
chain, in bdrv_co_do_block_status() for the snapshot-access driver,
the assertion failure will happen, because both flags are set.

To fix it, clear the recurse flag after the recursive check was done.

In detail:

> #0  qcow2_co_block_status

Returns 0x45 = BDRV_BLOCK_RECURSE | BDRV_BLOCK_DATA |
BDRV_BLOCK_OFFSET_VALID.

> #1  bdrv_co_do_block_status

Because of the data flag, bdrv_co_do_block_status() will now also set
BDRV_BLOCK_ALLOCATED. Because of the recurse flag,
bdrv_co_do_block_status() for the bdrv_file child will be called,
which returns 0x16 = BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_OFFSET_VALID |
BDRV_BLOCK_ZERO. Now the return value inherits the zero flag.

Returns 0x57 = BDRV_BLOCK_RECURSE | BDRV_BLOCK_DATA |
BDRV_BLOCK_OFFSET_VALID | BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_ZERO.

> #2  bdrv_co_common_block_status_above
> #3  bdrv_co_block_status_above
> #4  bdrv_co_block_status
> #5  cbw_co_snapshot_block_status
> #6  bdrv_co_snapshot_block_status
> #7  snapshot_access_co_block_status
> #8  bdrv_co_do_block_status

Return value is propagated all the way up to here, where the assertion
failure happens, because BDRV_BLOCK_RECURSE and BDRV_BLOCK_ZERO are
both set.

> #9  bdrv_co_common_block_status_above
> #10 bdrv_co_block_status_above
> #11 block_copy_block_status
> #12 block_copy_dirty_clusters
> #13 block_copy_common
> #14 block_copy_async_co_entry
> #15 coroutine_trampoline

[0]:

> #!/bin/bash
> rm /tmp/disk.qcow2
> ./qemu-img create /tmp/disk.qcow2 -o preallocation=metadata -f qcow2 1G
> ./qemu-img create /tmp/fleecing.qcow2 -f qcow2 1G
> ./qemu-img create /tmp/backup.qcow2 -f qcow2 1G
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/disk.qcow2 \
> --blockdev 
> qcow2,node-name=node1,file.driver=file,file.filename=/tmp/fleecing.qcow2 \
> --blockdev 
> qcow2,node-name=node2,file.driver=file,file.filename=/tmp/backup.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "copy-before-write", 
> "file": "node0", "target": "node1", "node-name": "node3" } }
> {"execute": "blockdev-add", "arguments": { "driver": "snapshot-access", 
> "file": "node3", "node-name": "snap0" } }
> {"execute": "blockdev-backup", "arguments": { "device": "snap0", "target": 
> "node1", "sync": "full", "job-id": "backup0" } }
> EOF

Signed-off-by: Fiona Ebner 
---

I'm new to this part of the code, so I'm not sure if it is actually
safe to clear the flag? Intuitively, I'd expect it to be only relevant
until it was acted upon, but no clue.

 block/io.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/block/io.c b/block/io.c
index 8fa7670571..33150c0359 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2584,6 +2584,16 @@ bdrv_co_do_block_status(BlockDriverState *bs, bool 
want_zero,
 ret |= (ret2 & BDRV_BLOCK_ZERO);
 }
 }
+
+/*
+ * Now that the recursive search was done, clear the flag. Otherwise,
+ * with more complicated block graphs like snapshot-access ->
+ * copy-before-write -> qcow2, where the return value will be 
propagated
+ * further up to a parent bdrv_co_do_block_status() call, both the
+ * BDRV_BLOCK_RECURSE and BDRV_BLOCK_ZERO flags would be set, which is
+ * not allowed.
+ */
+ret &= ~BDRV_BLOCK_RECURSE;
 }
 
 out:
-- 
2.39.2

Re: [PATCH] ui/clipboard: avoid crash upon request when clipboard peer is not initialized

2024-01-16 Thread Fiona Ebner

Am 15.01.24 um 13:00 schrieb Marc-André Lureau:

>>>
>>> The trouble is when qemu_clipboard_update() is called without data &
>>> without a request callback set. We shouldn't allow that as we have no
>>> means to get the clipboard data then.
>>>
>>
>> In the above scenario, I'm pretty sure there is data when
>> qemu_clipboard_update() is called. Just no request callback. If we'd
>> reject this, won't that break clients that do not set the
>> VNC_FEATURE_CLIPBOARD_EXT feature and only use non-extended
>> VNC_MSG_CLIENT_CUT_TEXT messages?
> 
> If "data" is already set, then qemu_clipboard_request() returns.
> 
> Inverting the condition I suggested above: it's allowed to be the
> clipboard owner if either "data" is set, or a request callback is set.
> 

Oh, sorry. Yes, it seems the problematic case is where data is not set.
But isn't that legitimate when clearing the clipboard? Or is a
VNC_MSG_CLIENT_CUT_TEXT message not valid when len is 0 and should be
rejected? In my testing KRDC does send such a message when the clipboard
is cleared:

> #1  0x558f1e6a0dac in vnc_client_cut_text (vs=0x558f207754d0, len=0, 
> text=0x558f2046e008 "\003 \002\377\005") at ../ui/vnc-clipboard.c:313
> #2  0x558f1e68e067 in protocol_client_msg (vs=0x558f207754d0, 
> data=0x558f2046e000 "\006", len=8) at ../ui/vnc.c:2454

Your suggestion would disallow this for clients that do not set the
VNC_FEATURE_CLIPBOARD_EXT feature (and only use non-extended
VNC_MSG_CLIENT_CUT_TEXT messages).

Best Regards,
Fiona

Re: [PATCH] ui/clipboard: avoid crash upon request when clipboard peer is not initialized

2024-01-15 Thread Fiona Ebner

Am 15.01.24 um 12:33 schrieb Marc-André Lureau:
> Hi
> 
> On Mon, Jan 15, 2024 at 3:26 PM Fiona Ebner  wrote:
>>
>> Am 15.01.24 um 12:15 schrieb Marc-André Lureau:
>>> Hi
>>>
>>> On Mon, Jan 15, 2024 at 2:45 PM Fiona Ebner  wrote:
>>>>
>>>> Am 14.01.24 um 14:51 schrieb Marc-André Lureau:
>>>>>>
>>>>>> diff --git a/ui/clipboard.c b/ui/clipboard.c
>>>>>> index 3d14bffaf8..c13b54d2e9 100644
>>>>>> --- a/ui/clipboard.c
>>>>>> +++ b/ui/clipboard.c
>>>>>> @@ -129,7 +129,8 @@ void qemu_clipboard_request(QemuClipboardInfo *info,
>>>>>>  if (info->types[type].data ||
>>>>>>  info->types[type].requested ||
>>>>>>  !info->types[type].available ||
>>>>>> -!info->owner)
>>>>>> +!info->owner ||
>>>>>> +!info->owner->request)
>>>>>>  return;
>>>>>
>>>>> While that fixes the crash, I think we should handle the situation
>>>>> earlier. A clipboard peer shouldn't be allowed to hold the clipboard
>>>>> if it doesn't have the data available or a "request" callback set.
>>>>>
>>>>
>>>> Where should initialization of the cbpeer happen so that we are
>>>> guaranteed to do it also for clients that do not set the
>>>> VNC_FEATURE_CLIPBOARD_EXT feature? Can the vnc_clipboard_request
>>>> function be re-used for clients without that feature or will it be
>>>> necessary to add some kind of "dummy" request callback for those clients?
>>>
>>> qemu_clipboard_update() shouldn't accept info as current clipboard if
>>> the owner doesn't have the data available or a "request" callback set.
>>> This should also be assert() somehow and handled earlier.
>>>
>>
>> The request callback is only initialized in vnc_server_cut_text_caps()
>> when the VNC_FEATURE_CLIPBOARD_EXT is enabled. AFAIU, it's perfectly
>> fine for clients to use the clipboard with non-extended messages and
>> qemu_clipboard_update() should (and currently does) accept those.
>>
>>> In vnc_client_cut_text_ext() we could detect that situation, but with
>>> Daniel's "[PATCH] ui: reject extended clipboard message if not
>>> activated", this shouldn't happen anymore iiuc.
>>>
>>
>> Daniel's patch doesn't change the behavior for non-extended messages.
>> The problem can still happen with two VNC clients. This is the scenario
>> described in the lower half of my commit message (and why Daniel
>> mentions in his patch that it's not sufficient to fix the CVE).
>>
>> In short: client A does not set the VNC_FEATURE_CLIPBOARD_EXT feature
>> and then uses a non-extended VNC_MSG_CLIENT_CUT_TEXT message. This leads
>> to vnc_client_cut_text() being called and setting the clipboard info
>> referencing that client. But here, no request callback is initialized,
>> that only happens in vnc_server_cut_text_caps() when the
>> VNC_FEATURE_CLIPBOARD_EXT is enabled.
>>
>> When client B does set the VNC_FEATURE_CLIPBOARD_EXT feature and does
>> send an extended VNC_MSG_CLIENT_CUT_TEXT message, the request callback
>> will be attempted but it isn't set.
>>
> 
> The trouble is when qemu_clipboard_update() is called without data &
> without a request callback set. We shouldn't allow that as we have no
> means to get the clipboard data then.
> 

In the above scenario, I'm pretty sure there is data when
qemu_clipboard_update() is called. Just no request callback. If we'd
reject this, won't that break clients that do not set the
VNC_FEATURE_CLIPBOARD_EXT feature and only use non-extended
VNC_MSG_CLIENT_CUT_TEXT messages?

Best Regards,
Fiona

Re: [PATCH] ui/clipboard: avoid crash upon request when clipboard peer is not initialized

2024-01-15 Thread Fiona Ebner

Am 15.01.24 um 12:15 schrieb Marc-André Lureau:
> Hi
> 
> On Mon, Jan 15, 2024 at 2:45 PM Fiona Ebner  wrote:
>>
>> Am 14.01.24 um 14:51 schrieb Marc-André Lureau:
>>>>
>>>> diff --git a/ui/clipboard.c b/ui/clipboard.c
>>>> index 3d14bffaf8..c13b54d2e9 100644
>>>> --- a/ui/clipboard.c
>>>> +++ b/ui/clipboard.c
>>>> @@ -129,7 +129,8 @@ void qemu_clipboard_request(QemuClipboardInfo *info,
>>>>  if (info->types[type].data ||
>>>>  info->types[type].requested ||
>>>>  !info->types[type].available ||
>>>> -!info->owner)
>>>> +!info->owner ||
>>>> +!info->owner->request)
>>>>  return;
>>>
>>> While that fixes the crash, I think we should handle the situation
>>> earlier. A clipboard peer shouldn't be allowed to hold the clipboard
>>> if it doesn't have the data available or a "request" callback set.
>>>
>>
>> Where should initialization of the cbpeer happen so that we are
>> guaranteed to do it also for clients that do not set the
>> VNC_FEATURE_CLIPBOARD_EXT feature? Can the vnc_clipboard_request
>> function be re-used for clients without that feature or will it be
>> necessary to add some kind of "dummy" request callback for those clients?
> 
> qemu_clipboard_update() shouldn't accept info as current clipboard if
> the owner doesn't have the data available or a "request" callback set.
> This should also be assert() somehow and handled earlier.
> 

The request callback is only initialized in vnc_server_cut_text_caps()
when the VNC_FEATURE_CLIPBOARD_EXT is enabled. AFAIU, it's perfectly
fine for clients to use the clipboard with non-extended messages and
qemu_clipboard_update() should (and currently does) accept those.

> In vnc_client_cut_text_ext() we could detect that situation, but with
> Daniel's "[PATCH] ui: reject extended clipboard message if not
> activated", this shouldn't happen anymore iiuc.
> 

Daniel's patch doesn't change the behavior for non-extended messages.
The problem can still happen with two VNC clients. This is the scenario
described in the lower half of my commit message (and why Daniel
mentions in his patch that it's not sufficient to fix the CVE).

In short: client A does not set the VNC_FEATURE_CLIPBOARD_EXT feature
and then uses a non-extended VNC_MSG_CLIENT_CUT_TEXT message. This leads
to vnc_client_cut_text() being called and setting the clipboard info
referencing that client. But here, no request callback is initialized,
that only happens in vnc_server_cut_text_caps() when the
VNC_FEATURE_CLIPBOARD_EXT is enabled.

When client B does set the VNC_FEATURE_CLIPBOARD_EXT feature and does
send an extended VNC_MSG_CLIENT_CUT_TEXT message, the request callback
will be attempted but it isn't set.

Best Regards,
Fiona

1 2 3 >

1 - 100 of 295 matches

Mail list logo