date:20220216

Re: [PATCH 0/6] hw/nvme: enhanced protection information (64-bit guard)

2022-02-16 Thread Keith Busch

On Mon, Feb 14, 2022 at 01:30:23PM +0100, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> This adds support for one possible new protection information format
> introduced in TP4068 (and integrated in NVMe 2.0): the 64-bit CRC guard
> and 48-bit reference tag. This version does not support storage tags.
> 
> Like the CRC16 support already present, this uses a software
> implementation of CRC64 (so it is naturally pretty slow). But its good
> enough for verification purposes.
> 
> This goes hand-in-hand with the support that Keith submitted for the
> Linux kernel[1].
> 
> [1]: 
> https://lore.kernel.org/linux-nvme/20220201190128.3075065-1-kbu...@kernel.org/

Other than comment on 6/6, series looks good to me.

Reviewed-by: Keith Busch

Re: [PATCH 6/6] hw/nvme: 64-bit pi support

2022-02-16 Thread Keith Busch

On Mon, Feb 14, 2022 at 01:30:29PM +0100, Klaus Jensen wrote:
> @@ -384,6 +389,12 @@ static int nvme_ns_check_constraints(NvmeNamespace *ns, 
> Error **errp)
>  return -1;
>  }
>  
> +if (ns->params.pif != NVME_PI_GUARD_16 &&
> +ns->params.pif != NVME_PI_GUARD_64) {
> +error_setg(errp, "invalid 'pif'");
> +return -1;
> +}

In addition, the requested metadata size ('params.ms') should be checked
against the requested PI option. The function currently just checks
against 8 bytes, but the 64b guard requires at least 16 bytes.

Otherwise, looks great.

Re: [PATCH v2] tests/qemu-iotests: Rework the checks and spots using GNU sed

2022-02-16 Thread Eric Blake

On Wed, Feb 16, 2022 at 01:54:54PM +0100, Thomas Huth wrote:
> Instead of failing the iotests if GNU sed is not available (or skipping
> them completely in the check-block.sh script), it would be better to
> simply skip the bash-based tests that rely on GNU sed, so that the other
> tests could still be run. Thus we now explicitely use "gsed" (either as
> direct program or as a wrapper around "sed" if it's the GNU version)
> in the spots that rely on the GNU sed behavior. Statements that use the
> "-r" parameter of sed have been switched to use "-E" instead, since this
> switch is supported by all sed versions on our supported build hosts
> (most also support "-r", but macOS' sed only supports "-E"). With all
> these changes in place, we then can also remove the sed checks from the
> check-block.sh script, so that "make check-block" can now be run on
> systems without GNU sed, too.
> 
> Signed-off-by: Thomas Huth 
> ---
>  I've checked that this still works fine with "make vm-build-freebsd",
>  "make vm-build-netbsd" and "make vm-build-openbsd" and the Cirrus-CI
>  macOS tasks.
> 
>  tests/check-block.sh | 12 --
>  tests/qemu-iotests/271   |  2 +-
>  tests/qemu-iotests/common.filter | 65 
>  tests/qemu-iotests/common.rc | 45 +++---
>  4 files changed, 57 insertions(+), 67 deletions(-)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[PATCH v4 13/18] iotests/image-fleecing: add test-case for fleecing format node

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/tests/image-fleecing | 64 -
 tests/qemu-iotests/tests/image-fleecing.out | 76 -
 2 files changed, 120 insertions(+), 20 deletions(-)

diff --git a/tests/qemu-iotests/tests/image-fleecing 
b/tests/qemu-iotests/tests/image-fleecing
index a58b5a1781..909fc0a7ad 100755
--- a/tests/qemu-iotests/tests/image-fleecing
+++ b/tests/qemu-iotests/tests/image-fleecing
@@ -49,12 +49,17 @@ remainder = [('0xd5', '0x108000',  '32k'), # Right-end of 
partial-left [1]
  ('0xdc', '32M',   '32k'), # Left-end of partial-right [2]
  ('0xcd', '0x3ff', '64k')] # patterns[3]
 
-def do_test(use_cbw, base_img_path, fleece_img_path, nbd_sock_path, vm):
+def do_test(use_cbw, use_snapshot_access_filter, base_img_path,
+fleece_img_path, nbd_sock_path, vm):
 log('--- Setting up images ---')
 log('')
 
 assert qemu_img('create', '-f', iotests.imgfmt, base_img_path, '64M') == 0
-assert qemu_img('create', '-f', 'qcow2', fleece_img_path, '64M') == 0
+if use_snapshot_access_filter:
+assert use_cbw
+assert qemu_img('create', '-f', 'raw', fleece_img_path, '64M') == 0
+else:
+assert qemu_img('create', '-f', 'qcow2', fleece_img_path, '64M') == 0
 
 for p in patterns:
 qemu_io('-f', iotests.imgfmt,
@@ -81,16 +86,23 @@ def do_test(use_cbw, base_img_path, fleece_img_path, 
nbd_sock_path, vm):
 log('')
 
 
-# create tmp_node backed by src_node
-log(vm.qmp('blockdev-add', {
-'driver': 'qcow2',
-'node-name': tmp_node,
-'file': {
+if use_snapshot_access_filter:
+log(vm.qmp('blockdev-add', {
+'node-name': tmp_node,
 'driver': 'file',
 'filename': fleece_img_path,
-},
-'backing': src_node,
-}))
+}))
+else:
+# create tmp_node backed by src_node
+log(vm.qmp('blockdev-add', {
+'driver': 'qcow2',
+'node-name': tmp_node,
+'file': {
+'driver': 'file',
+'filename': fleece_img_path,
+},
+'backing': src_node,
+}))
 
 # Establish CBW from source to fleecing node
 if use_cbw:
@@ -102,6 +114,13 @@ def do_test(use_cbw, base_img_path, fleece_img_path, 
nbd_sock_path, vm):
 }))
 
 log(vm.qmp('qom-set', path=qom_path, property='drive', value='fl-cbw'))
+
+if use_snapshot_access_filter:
+log(vm.qmp('blockdev-add', {
+'driver': 'snapshot-access',
+'node-name': 'fl-access',
+'file': 'fl-cbw',
+}))
 else:
 log(vm.qmp('blockdev-backup',
job_id='fleecing',
@@ -109,16 +128,18 @@ def do_test(use_cbw, base_img_path, fleece_img_path, 
nbd_sock_path, vm):
target=tmp_node,
sync='none'))
 
+export_node = 'fl-access' if use_snapshot_access_filter else tmp_node
+
 log('')
 log('--- Setting up NBD Export ---')
 log('')
 
-nbd_uri = 'nbd+unix:///%s?socket=%s' % (tmp_node, nbd_sock_path)
+nbd_uri = 'nbd+unix:///%s?socket=%s' % (export_node, nbd_sock_path)
 log(vm.qmp('nbd-server-start',
{'addr': {'type': 'unix',
  'data': {'path': nbd_sock_path}}}))
 
-log(vm.qmp('nbd-server-add', device=tmp_node))
+log(vm.qmp('nbd-server-add', device=export_node))
 
 log('')
 log('--- Sanity Check ---')
@@ -151,7 +172,11 @@ def do_test(use_cbw, base_img_path, fleece_img_path, 
nbd_sock_path, vm):
 log('--- Cleanup ---')
 log('')
 
+log(vm.qmp('nbd-server-stop'))
+
 if use_cbw:
+if use_snapshot_access_filter:
+log(vm.qmp('blockdev-del', node_name='fl-access'))
 log(vm.qmp('qom-set', path=qom_path, property='drive', value=src_node))
 log(vm.qmp('blockdev-del', node_name='fl-cbw'))
 else:
@@ -160,7 +185,6 @@ def do_test(use_cbw, base_img_path, fleece_img_path, 
nbd_sock_path, vm):
 assert e is not None
 log(e, filters=[iotests.filter_qmp_event])
 
-log(vm.qmp('nbd-server-stop'))
 log(vm.qmp('blockdev-del', node_name=tmp_node))
 vm.shutdown()
 
@@ -177,17 +201,21 @@ def do_test(use_cbw, base_img_path, fleece_img_path, 
nbd_sock_path, vm):
 log('Done')
 
 
-def test(use_cbw):
+def test(use_cbw, use_snapshot_access_filter):
 with iotests.FilePath('base.img') as base_img_path, \
  iotests.FilePath('fleece.img') as fleece_img_path, \
  iotests.FilePath('nbd.sock',
   base_dir=iotests.sock_dir) as nbd_sock_path, \
  iotests.VM() as vm:
-do_test(use_cbw, base_img_path, fleece_img_path, nbd_sock_path, vm)
+do_test(use_cbw, use_snapshot_access_filter, base_img_path,
+fleece_img_path, nbd_sock_path, vm)
 
 
 log('=== Test backup(sync=none) based fleecing ===

[PATCH v4 11/18] block: introduce snapshot-access filter

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

The filter simply utilizes snapshot-access API of underlying block
node.

In further patches we want to use it like this:

[guest]   [NBD export]
   ||
   | root   | root
   v file   v
[copy-before-write]<--[snapshot-access]
   |   |
   | file  | target
   v   v
[active-disk] [temp.img]

This way, NBD client will be able to read snapshotted state of active
disk, when active disk is continued to be written by guest. This is
known as "fleecing", and currently uses another scheme based on qcow2
temporary image which backing file is active-disk. New scheme comes
with benefits - see next commit.

The other possible application is exporting internal snapshots of
qcow2, like this:

[guest]  [NBD export]
   |  |
   | root | root
   v   file   v
[qcow2]<-[snapshot-access]

For this, we'll need to implement snapshot-access API handlers in
qcow2 driver, and improve snapshot-access filter (and API) to make it
possibele to select snapshot by name.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 qapi/block-core.json|   4 +-
 block/snapshot-access.c | 132 
 MAINTAINERS |   1 +
 block/meson.build   |   1 +
 4 files changed, 137 insertions(+), 1 deletion(-)
 create mode 100644 block/snapshot-access.c

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 3bab597506..a904755e98 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2914,13 +2914,14 @@
 # @blkreplay: Since 4.2
 # @compress: Since 5.0
 # @copy-before-write: Since 6.2
+# @snapshot-access: Since 7.0
 #
 # Since: 2.9
 ##
 { 'enum': 'BlockdevDriver',
   'data': [ 'blkdebug', 'blklogwrites', 'blkreplay', 'blkverify', 'bochs',
 'cloop', 'compress', 'copy-before-write', 'copy-on-read', 'dmg',
-'file', 'ftp', 'ftps', 'gluster',
+'file', 'snapshot-access', 'ftp', 'ftps', 'gluster',
 {'name': 'host_cdrom', 'if': 'HAVE_HOST_BLOCK_DEVICE' },
 {'name': 'host_device', 'if': 'HAVE_HOST_BLOCK_DEVICE' },
 'http', 'https', 'iscsi',
@@ -4267,6 +4268,7 @@
   'rbd':'BlockdevOptionsRbd',
   'replication': { 'type': 'BlockdevOptionsReplication',
'if': 'CONFIG_REPLICATION' },
+  'snapshot-access': 'BlockdevOptionsGenericFormat',
   'ssh':'BlockdevOptionsSsh',
   'throttle':   'BlockdevOptionsThrottle',
   'vdi':'BlockdevOptionsGenericFormat',
diff --git a/block/snapshot-access.c b/block/snapshot-access.c
new file mode 100644
index 00..77b87c1946
--- /dev/null
+++ b/block/snapshot-access.c
@@ -0,0 +1,132 @@
+/*
+ * snapshot_access block driver
+ *
+ * Copyright (c) 2022 Virtuozzo International GmbH.
+ *
+ * Author:
+ *  Sementsov-Ogievskiy Vladimir 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see .
+ */
+
+#include "qemu/osdep.h"
+
+#include "sysemu/block-backend.h"
+#include "qemu/cutils.h"
+#include "block/block_int.h"
+
+static coroutine_fn int
+snapshot_access_co_preadv_part(BlockDriverState *bs,
+   int64_t offset, int64_t bytes,
+   QEMUIOVector *qiov, size_t qiov_offset,
+   BdrvRequestFlags flags)
+{
+if (flags) {
+return -ENOTSUP;
+}
+
+return bdrv_co_preadv_snapshot(bs->file, offset, bytes, qiov, qiov_offset);
+}
+
+static int coroutine_fn
+snapshot_access_co_block_status(BlockDriverState *bs,
+bool want_zero, int64_t offset,
+int64_t bytes, int64_t *pnum,
+int64_t *map, BlockDriverState **file)
+{
+return bdrv_co_snapshot_block_status(bs->file->bs, want_zero, offset,
+ bytes, pnum, map, file);
+}
+
+static int coroutine_fn snapshot_access_co_pdiscard(BlockDriverState *bs,
+ int64_t offset, int64_t bytes)
+{
+return bdrv_co_pdiscard_snapshot(bs->file->bs, offset, bytes);
+}
+
+static int coroutine_fn
+snapshot_access_co_pwrite_zeroes(BlockDriverState *bs,
+ int64_t offset, int64_t bytes,
+ BdrvRequestFlags flags)
+{
+return -ENOTSUP;
+

[PATCH v4 09/18] block/reqlist: add reqlist_wait_all()

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

Add function to wait for all intersecting requests.
To be used in the further commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Nikita Lapshin 
Reviewed-by: Hanna Reitz 
---
 include/block/reqlist.h | 8 
 block/reqlist.c | 8 
 2 files changed, 16 insertions(+)

diff --git a/include/block/reqlist.h b/include/block/reqlist.h
index 0fa1eef259..5253497bae 100644
--- a/include/block/reqlist.h
+++ b/include/block/reqlist.h
@@ -53,6 +53,14 @@ BlockReq *reqlist_find_conflict(BlockReqList *reqs, int64_t 
offset,
 bool coroutine_fn reqlist_wait_one(BlockReqList *reqs, int64_t offset,
int64_t bytes, CoMutex *lock);
 
+/*
+ * Wait for all intersecting requests. It just calls reqlist_wait_one() in a
+ * loop, caller is responsible to stop producing new requests in this region
+ * in parallel, otherwise reqlist_wait_all() may never return.
+ */
+void coroutine_fn reqlist_wait_all(BlockReqList *reqs, int64_t offset,
+   int64_t bytes, CoMutex *lock);
+
 /*
  * Shrink request and wake all waiting coroutines (maybe some of them are not
  * intersecting with shrunk request).
diff --git a/block/reqlist.c b/block/reqlist.c
index 09fecbd48c..08cb57cfa4 100644
--- a/block/reqlist.c
+++ b/block/reqlist.c
@@ -58,6 +58,14 @@ bool coroutine_fn reqlist_wait_one(BlockReqList *reqs, 
int64_t offset,
 return true;
 }
 
+void coroutine_fn reqlist_wait_all(BlockReqList *reqs, int64_t offset,
+   int64_t bytes, CoMutex *lock)
+{
+while (reqlist_wait_one(reqs, offset, bytes, lock)) {
+/* continue */
+}
+}
+
 void coroutine_fn reqlist_shrink_req(BlockReq *req, int64_t new_bytes)
 {
 if (new_bytes == req->bytes) {
-- 
2.31.1

[PATCH v4 08/18] block/dirty-bitmap: introduce bdrv_dirty_bitmap_status()

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

Add a convenient function similar with bdrv_block_status() to get
status of dirty bitmap.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/dirty-bitmap.h |  2 ++
 include/qemu/hbitmap.h   | 12 
 block/dirty-bitmap.c |  6 ++
 util/hbitmap.c   | 33 +
 4 files changed, 53 insertions(+)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index f95d350b70..6528336c4c 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -115,6 +115,8 @@ int64_t bdrv_dirty_bitmap_next_zero(BdrvDirtyBitmap 
*bitmap, int64_t offset,
 bool bdrv_dirty_bitmap_next_dirty_area(BdrvDirtyBitmap *bitmap,
 int64_t start, int64_t end, int64_t max_dirty_count,
 int64_t *dirty_start, int64_t *dirty_count);
+bool bdrv_dirty_bitmap_status(BdrvDirtyBitmap *bitmap, int64_t offset,
+  int64_t bytes, int64_t *count);
 BdrvDirtyBitmap *bdrv_reclaim_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
   Error **errp);
 
diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index 5e71b6d6f7..5bd986aa44 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -340,6 +340,18 @@ bool hbitmap_next_dirty_area(const HBitmap *hb, int64_t 
start, int64_t end,
  int64_t max_dirty_count,
  int64_t *dirty_start, int64_t *dirty_count);
 
+/*
+ * bdrv_dirty_bitmap_status:
+ * @hb: The HBitmap to operate on
+ * @start: The bit to start from
+ * @count: Number of bits to proceed
+ * @pnum: Out-parameter. How many bits has same value starting from @start
+ *
+ * Returns true if bitmap is dirty at @start, false otherwise.
+ */
+bool hbitmap_status(const HBitmap *hb, int64_t start, int64_t count,
+int64_t *pnum);
+
 /**
  * hbitmap_iter_next:
  * @hbi: HBitmapIter to operate on.
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 94a0276833..08d56845ad 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -875,6 +875,12 @@ bool bdrv_dirty_bitmap_next_dirty_area(BdrvDirtyBitmap 
*bitmap,
dirty_start, dirty_count);
 }
 
+bool bdrv_dirty_bitmap_status(BdrvDirtyBitmap *bitmap, int64_t offset,
+  int64_t bytes, int64_t *count)
+{
+return hbitmap_status(bitmap->bitmap, offset, bytes, count);
+}
+
 /**
  * bdrv_merge_dirty_bitmap: merge src into dest.
  * Ensures permissions on bitmaps are reasonable; use for public API.
diff --git a/util/hbitmap.c b/util/hbitmap.c
index 305b894a63..dd0501d9a7 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -301,6 +301,39 @@ bool hbitmap_next_dirty_area(const HBitmap *hb, int64_t 
start, int64_t end,
 return true;
 }
 
+bool hbitmap_status(const HBitmap *hb, int64_t start, int64_t count,
+int64_t *pnum)
+{
+int64_t next_dirty, next_zero;
+
+assert(start >= 0);
+assert(count > 0);
+assert(start + count <= hb->orig_size);
+
+next_dirty = hbitmap_next_dirty(hb, start, count);
+if (next_dirty == -1) {
+*pnum = count;
+return false;
+}
+
+if (next_dirty > start) {
+*pnum = next_dirty - start;
+return false;
+}
+
+assert(next_dirty == start);
+
+next_zero = hbitmap_next_zero(hb, start, count);
+if (next_zero == -1) {
+*pnum = count;
+return true;
+}
+
+assert(next_zero > start);
+*pnum = next_zero - start;
+return false;
+}
+
 bool hbitmap_empty(const HBitmap *hb)
 {
 return hb->count == 0;
-- 
2.31.1

[PATCH v4 12/18] block: copy-before-write: realize snapshot-access API

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

Current scheme of image fleecing looks like this:

[guest][NBD export]
  |  |
  |root  | root
  v  v
[copy-before-write] -> [temp.qcow2]
  | target  |
  |file |backing
  v |
[active disk] <-+

 - On guest writes copy-before-write filter copies old data from active
   disk to temp.qcow2. So fleecing client (NBD export) when reads
   changed regions from temp.qcow2 image and unchanged from active disk
   through backing link.

This patch makes possible new image fleecing scheme:

[guest]   [NBD export]
   ||
   | root   | root
   v file   v
[copy-before-write]<--[x-snapshot-access]
   |   |
   | file  | target
   v   v
[active-disk] [temp.img]

 - copy-before-write does CBW operations and also provides
   snapshot-access API. The API may be accessed through
   x-snapshot-access driver.

Benefits of new scheme:

1. Access control: if remote client try to read data that not covered
   by original dirty bitmap used on copy-before-write open, client gets
   -EACCES.

2. Discard support: if remote client do DISCARD, this additionally to
   discarding data in temp.img informs block-copy process to not copy
   these clusters. Next read from discarded area will return -EACCES.
   This is significant thing: when fleecing user reads data that was
   not yet copied to temp.img, we can avoid copying it on further guest
   write.

3. Synchronisation between client reads and block-copy write is more
   efficient. In old scheme we just rely on BDRV_REQ_SERIALISING flag
   used for writes to temp.qcow2. New scheme is less blocking:
 - fleecing reads are never blocked: if data region is untouched or
   in-flight, we just read from active-disk, otherwise we read from
   temp.img
 - writes to temp.img are not blocked by fleecing reads
 - still, guest writes of-course are blocked by in-flight fleecing
   reads, that currently read from active-disk - it's the minimum
   necessary blocking

4. Temporary image may be of any format, as we don't rely on backing
   feature.

5. Permission relation are simplified. With old scheme we have to share
   write permission on target child of copy-before-write, otherwise
   backing link conflicts with copy-before-write file child write
   permissions. With new scheme we don't have backing link, and
   copy-before-write node may have unshared access to temporary node.
   (Not realized in this commit, will be in future).

6. Having control on fleecing reads we'll be able to implement
   alternative behavior on failed copy-before-write operations.
   Currently we just break guest request (that's a historical behavior
   of backup). But in some scenarios it's a bad behavior: better
   is to drop the backup as failed but don't break guest request.
   With new scheme we can simply unset some bits in a bitmap on CBW
   failure and further fleecing reads will -EACCES, or something like
   this. (Not implemented in this commit, will be in future)
   Additional application for this is implementing timeout for CBW
   operations.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/copy-before-write.c | 212 +-
 1 file changed, 211 insertions(+), 1 deletion(-)

diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index 91a2288b66..a8c88f64eb 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -33,12 +33,37 @@
 #include "block/block-copy.h"
 
 #include "block/copy-before-write.h"
+#include "block/reqlist.h"
 
 #include "qapi/qapi-visit-block-core.h"
 
 typedef struct BDRVCopyBeforeWriteState {
 BlockCopyState *bcs;
 BdrvChild *target;
+
+/*
+ * @lock: protects access to @access_bitmap, @done_bitmap and
+ * @frozen_read_reqs
+ */
+CoMutex lock;
+
+/*
+ * @access_bitmap: represents areas allowed for reading by fleecing user.
+ * Reading from non-dirty areas leads to -EACCES.
+ */
+BdrvDirtyBitmap *access_bitmap;
+
+/*
+ * @done_bitmap: represents areas that was successfully copied to @target 
by
+ * copy-before-write operations.
+ */
+BdrvDirtyBitmap *done_bitmap;
+
+/*
+ * @frozen_read_reqs: current read requests for fleecing user in bs->file
+ * node. These areas must not be rewritten by guest.
+ */
+BlockReqList frozen_read_reqs;
 } BDRVCopyBeforeWriteState;
 
 static coroutine_fn int cbw_co_preadv(
@@ -48,10 +73,20 @@ static coroutine_fn int cbw_co_preadv(
 return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
 }
 
+/*
+ * Do copy-before-write operation.
+ *
+ * On failure guest request must be failed too.
+ *
+ * On success, we also wait for all in-flight fleecing read requests in source
+ * node, and it's guaranteed that aft

[PATCH v4 16/18] block: blk_root(): return non-const pointer

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

In the following patch we'll want to pass blk children to block-copy.
Const pointers are not enough. So, return non const pointer from
blk_root().

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/sysemu/block-backend.h | 2 +-
 block/block-backend.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index e5e1524f06..904d70f49c 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -277,7 +277,7 @@ int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, 
int64_t off_in,
int64_t bytes, BdrvRequestFlags read_flags,
BdrvRequestFlags write_flags);
 
-const BdrvChild *blk_root(BlockBackend *blk);
+BdrvChild *blk_root(BlockBackend *blk);
 
 int blk_make_empty(BlockBackend *blk, Error **errp);
 
diff --git a/block/block-backend.c b/block/block-backend.c
index 4ff6b4d785..97913acfcd 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2464,7 +2464,7 @@ int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, 
int64_t off_in,
   bytes, read_flags, write_flags);
 }
 
-const BdrvChild *blk_root(BlockBackend *blk)
+BdrvChild *blk_root(BlockBackend *blk)
 {
 return blk->root;
 }
-- 
2.31.1

[PATCH v4 18/18] iotests/image-fleecing: test push backup with fleecing

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/tests/image-fleecing | 121 ++--
 tests/qemu-iotests/tests/image-fleecing.out |  63 ++
 2 files changed, 152 insertions(+), 32 deletions(-)

diff --git a/tests/qemu-iotests/tests/image-fleecing 
b/tests/qemu-iotests/tests/image-fleecing
index 33995612be..903cd50be9 100755
--- a/tests/qemu-iotests/tests/image-fleecing
+++ b/tests/qemu-iotests/tests/image-fleecing
@@ -49,9 +49,15 @@ remainder = [('0xd5', '0x108000',  '32k'), # Right-end of 
partial-left [1]
  ('0xdc', '32M',   '32k'), # Left-end of partial-right [2]
  ('0xcd', '0x3ff', '64k')] # patterns[3]
 
-def do_test(use_cbw, use_snapshot_access_filter, base_img_path,
-fleece_img_path, nbd_sock_path, vm,
+def do_test(vm, use_cbw, use_snapshot_access_filter, base_img_path,
+fleece_img_path, nbd_sock_path=None,
+target_img_path=None,
 bitmap=False):
+push_backup = target_img_path is not None
+assert (nbd_sock_path is not None) != push_backup
+if push_backup:
+assert use_cbw
+
 log('--- Setting up images ---')
 log('')
 
@@ -65,6 +71,9 @@ def do_test(use_cbw, use_snapshot_access_filter, 
base_img_path,
 else:
 assert qemu_img('create', '-f', 'qcow2', fleece_img_path, '64M') == 0
 
+if push_backup:
+assert qemu_img('create', '-f', 'qcow2', target_img_path, '64M') == 0
+
 for p in patterns:
 qemu_io('-f', iotests.imgfmt,
 '-c', 'write -P%s %s %s' % p, base_img_path)
@@ -139,27 +148,45 @@ def do_test(use_cbw, use_snapshot_access_filter, 
base_img_path,
 
 export_node = 'fl-access' if use_snapshot_access_filter else tmp_node
 
-log('')
-log('--- Setting up NBD Export ---')
-log('')
+if push_backup:
+log('')
+log('--- Starting actual backup ---')
+log('')
 
-nbd_uri = 'nbd+unix:///%s?socket=%s' % (export_node, nbd_sock_path)
-log(vm.qmp('nbd-server-start',
-   {'addr': {'type': 'unix',
- 'data': {'path': nbd_sock_path}}}))
+log(vm.qmp('blockdev-add', **{
+'driver': iotests.imgfmt,
+'node-name': 'target',
+'file': {
+'driver': 'file',
+'filename': target_img_path
+}
+}))
+log(vm.qmp('blockdev-backup', device=export_node,
+   sync='full', target='target',
+   immutable_source=True,
+   job_id='push-backup', speed=1))
+else:
+log('')
+log('--- Setting up NBD Export ---')
+log('')
 
-log(vm.qmp('nbd-server-add', device=export_node))
+nbd_uri = 'nbd+unix:///%s?socket=%s' % (export_node, nbd_sock_path)
+log(vm.qmp('nbd-server-start',
+   {'addr': { 'type': 'unix',
+  'data': { 'path': nbd_sock_path } } }))
 
-log('')
-log('--- Sanity Check ---')
-log('')
+log(vm.qmp('nbd-server-add', device=export_node))
 
-for p in patterns + zeroes:
-cmd = 'read -P%s %s %s' % p
-log(cmd)
-out, ret = qemu_io_pipe_and_status('-r', '-f', 'raw', '-c', cmd, 
nbd_uri)
-if ret != 0:
-print(out)
+log('')
+log('--- Sanity Check ---')
+log('')
+
+for p in patterns + zeroes:
+cmd = 'read -P%s %s %s' % p
+log(cmd)
+out, ret = qemu_io_pipe_and_status('-r', '-f', 'raw', '-c', cmd, 
nbd_uri)
+if ret != 0:
+print(out)
 
 log('')
 log('--- Testing COW ---')
@@ -170,6 +197,20 @@ def do_test(use_cbw, use_snapshot_access_filter, 
base_img_path,
 log(cmd)
 log(vm.hmp_qemu_io(qom_path, cmd, qdev=True))
 
+if push_backup:
+# Check that previous operations were done during backup, not after
+result = vm.qmp('query-block-jobs')
+if len(result['return']) != 1:
+log('Backup finished too fast, COW is not tested')
+
+result = vm.qmp('block-job-set-speed', device='push-backup', speed=0)
+assert result == {'return': {}}
+
+log(vm.event_wait(name='BLOCK_JOB_COMPLETED',
+  match={'data': {'device': 'push-backup'}}),
+  filters=[iotests.filter_qmp_event])
+log(vm.qmp('blockdev-del', node_name='target'))
+
 log('')
 log('--- Verifying Data ---')
 log('')
@@ -177,15 +218,19 @@ def do_test(use_cbw, use_snapshot_access_filter, 
base_img_path,
 for p in patterns + zeroes:
 cmd = 'read -P%s %s %s' % p
 log(cmd)
-out, ret = qemu_io_pipe_and_status('-r', '-f', 'raw', '-c', cmd, 
nbd_uri)
-if ret != 0:
-print(out)
+if push_backup:
+assert qemu_io_silent('-r', '-c', cmd, target_img_path) == 0
+else:
+out, ret = qemu_io_pipe_and_status('-r', '-f', '

[PATCH v4 17/18] qapi: backup: add immutable-source parameter

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

We are on the way to implement internal-backup with fleecing scheme,
which includes backup job copying from fleecing block driver node
(which is target of copy-before-write filter) to final target of
backup. This job doesn't need own filter, as fleecing block driver node
is a kind of snapshot, it's immutable from reader point of view.

Let's add a parameter for backup to not insert filter but instead
unshare writes on source. This way backup job becomes a simple copying
process.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 qapi/block-core.json  | 11 ++-
 include/block/block_int.h |  1 +
 block/backup.c| 61 +++
 block/replication.c   |  2 +-
 blockdev.c|  1 +
 5 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index a904755e98..30d44683bf 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1436,6 +1436,15 @@
 #above node specified by @drive. If this option is not 
given,
 #a node name is autogenerated. (Since: 4.2)
 #
+# @immutable-source: If true, assume source is immutable, and don't insert 
filter
+#as no copy-before-write operations are needed. It will
+#fail if there are existing writers on source node.
+#Any attempt to add writer to source node during backup 
will
+#also fail. @filter-node-name must not be set.
+#If false, insert copy-before-write filter above source 
node
+#(see also @filter-node-name parameter).
+#Default is false. (Since 6.2)
+#
 # @x-perf: Performance options. (Since 6.0)
 #
 # Features:
@@ -1455,7 +1464,7 @@
 '*on-source-error': 'BlockdevOnError',
 '*on-target-error': 'BlockdevOnError',
 '*auto-finalize': 'bool', '*auto-dismiss': 'bool',
-'*filter-node-name': 'str',
+'*filter-node-name': 'str', '*immutable-source': 'bool',
 '*x-perf': { 'type': 'BackupPerf',
  'features': [ 'unstable' ] } } }
 
diff --git a/include/block/block_int.h b/include/block/block_int.h
index c43315ae6e..0270af29ae 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1348,6 +1348,7 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 BitmapSyncMode bitmap_mode,
 bool compress,
 const char *filter_node_name,
+bool immutable_source,
 BackupPerf *perf,
 BlockdevOnError on_source_error,
 BlockdevOnError on_target_error,
diff --git a/block/backup.c b/block/backup.c
index 21d5983779..104f8fd835 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -34,6 +34,14 @@ typedef struct BackupBlockJob {
 BlockDriverState *cbw;
 BlockDriverState *source_bs;
 BlockDriverState *target_bs;
+BlockBackend *source_blk;
+BlockBackend *target_blk;
+/*
+ * Note that if backup runs with filter (immutable-source parameter is
+ * false), @cbw is set but @source_blk and @target_blk are NULL.
+ * Otherwise if backup runs without filter (immutable-source paramter is
+ * true), @cbw is NULL but @source_blk and @target_blk are set.
+ */
 
 BdrvDirtyBitmap *sync_bitmap;
 
@@ -102,7 +110,17 @@ static void backup_clean(Job *job)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
 block_job_remove_all_bdrv(&s->common);
-bdrv_cbw_drop(s->cbw);
+if (s->cbw) {
+assert(!s->source_blk && !s->target_blk);
+bdrv_cbw_drop(s->cbw);
+} else {
+block_copy_state_free(s->bcs);
+s->bcs = NULL;
+blk_unref(s->source_blk);
+s->source_blk = NULL;
+blk_unref(s->target_blk);
+s->target_blk = NULL;
+}
 }
 
 void backup_do_checkpoint(BlockJob *job, Error **errp)
@@ -357,6 +375,7 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
   BitmapSyncMode bitmap_mode,
   bool compress,
   const char *filter_node_name,
+  bool immutable_source,
   BackupPerf *perf,
   BlockdevOnError on_source_error,
   BlockdevOnError on_target_error,
@@ -369,6 +388,7 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 int64_t cluster_size;
 BlockDriverState *cbw = NULL;
 BlockCopyState *bcs = NULL;
+BlockBackend *source_blk = NULL, *target_blk = NULL;
 
 assert(bs);
 assert(target);
@@ -377,6 +397,12 @@ BlockJob *backup_job_create(const char *job_id, 
BlockDriverState *bs,
 assert(sync_mode != MIRROR_SYNC_MODE_INCREMENTAL);
 assert(sync_bitmap || sync_mode != MIRROR_SYNC_MODE_BITMAP);
 
+if (

[PATCH v4 05/18] block/block-copy: add block_copy_reset()

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

Split block_copy_reset() out of block_copy_reset_unallocated() to be
used separately later.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 include/block/block-copy.h |  1 +
 block/block-copy.c | 21 +
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/block/block-copy.h b/include/block/block-copy.h
index b80ad02299..68bbd344b2 100644
--- a/include/block/block-copy.h
+++ b/include/block/block-copy.h
@@ -35,6 +35,7 @@ void block_copy_set_progress_meter(BlockCopyState *s, 
ProgressMeter *pm);
 
 void block_copy_state_free(BlockCopyState *s);
 
+void block_copy_reset(BlockCopyState *s, int64_t offset, int64_t bytes);
 int64_t block_copy_reset_unallocated(BlockCopyState *s,
  int64_t offset, int64_t *count);
 
diff --git a/block/block-copy.c b/block/block-copy.c
index 8aa6ee6a5c..0834e29b6e 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -692,6 +692,18 @@ static int block_copy_is_cluster_allocated(BlockCopyState 
*s, int64_t offset,
 }
 }
 
+void block_copy_reset(BlockCopyState *s, int64_t offset, int64_t bytes)
+{
+QEMU_LOCK_GUARD(&s->lock);
+
+bdrv_reset_dirty_bitmap(s->copy_bitmap, offset, bytes);
+if (s->progress) {
+progress_set_remaining(s->progress,
+   bdrv_get_dirty_count(s->copy_bitmap) +
+   s->in_flight_bytes);
+}
+}
+
 /*
  * Reset bits in copy_bitmap starting at offset if they represent unallocated
  * data in the image. May reset subsequent contiguous bits.
@@ -712,14 +724,7 @@ int64_t block_copy_reset_unallocated(BlockCopyState *s,
 bytes = clusters * s->cluster_size;
 
 if (!ret) {
-qemu_co_mutex_lock(&s->lock);
-bdrv_reset_dirty_bitmap(s->copy_bitmap, offset, bytes);
-if (s->progress) {
-progress_set_remaining(s->progress,
-   bdrv_get_dirty_count(s->copy_bitmap) +
-   s->in_flight_bytes);
-}
-qemu_co_mutex_unlock(&s->lock);
+block_copy_reset(s, offset, bytes);
 }
 
 *count = bytes;
-- 
2.31.1

[PATCH v4 10/18] block/io: introduce block driver snapshot-access API

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

Add new block driver handlers and corresponding generic wrappers.
It will be used to allow copy-before-write filter to provide
reach fleecing interface in further commit.

In future this approach may be used to allow reading qcow2 interanal
snaphots, for example to export them through NBD.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block_int.h | 27 +++
 block/io.c| 69 +++
 2 files changed, 96 insertions(+)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 27008cfb22..c43315ae6e 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -376,6 +376,24 @@ struct BlockDriver {
  */
 void (*bdrv_cancel_in_flight)(BlockDriverState *bs);
 
+/*
+ * Snapshot-access API.
+ *
+ * Block-driver may provide snapshot-access API: special functions to 
access
+ * some internal "snapshot". The functions are similar with normal
+ * read/block_status/discard handler, but don't have any specific handling
+ * in generic block-layer: no serializing, no alignment, no tracked
+ * requests. So, block-driver that realizes these APIs is fully responsible
+ * for synchronization between snapshot-access API and normal IO requests.
+ */
+int coroutine_fn (*bdrv_co_preadv_snapshot)(BlockDriverState *bs,
+int64_t offset, int64_t bytes, QEMUIOVector *qiov, size_t qiov_offset);
+int coroutine_fn (*bdrv_co_snapshot_block_status)(BlockDriverState *bs,
+bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum,
+int64_t *map, BlockDriverState **file);
+int coroutine_fn (*bdrv_co_pdiscard_snapshot)(BlockDriverState *bs,
+int64_t offset, int64_t bytes);
+
 /*
  * Invalidate any cached meta-data.
  */
@@ -1078,6 +1096,15 @@ extern BlockDriver bdrv_file;
 extern BlockDriver bdrv_raw;
 extern BlockDriver bdrv_qcow2;
 
+int coroutine_fn bdrv_co_preadv_snapshot(BdrvChild *child,
+int64_t offset, int64_t bytes, QEMUIOVector *qiov, size_t qiov_offset);
+int coroutine_fn bdrv_co_snapshot_block_status(BlockDriverState *bs,
+bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum,
+int64_t *map, BlockDriverState **file);
+int coroutine_fn bdrv_co_pdiscard_snapshot(BlockDriverState *bs,
+int64_t offset, int64_t bytes);
+
+
 int coroutine_fn bdrv_co_preadv(BdrvChild *child,
 int64_t offset, int64_t bytes, QEMUIOVector *qiov,
 BdrvRequestFlags flags);
diff --git a/block/io.c b/block/io.c
index 4e4cb556c5..0bcf09a491 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3587,3 +3587,72 @@ void bdrv_cancel_in_flight(BlockDriverState *bs)
 bs->drv->bdrv_cancel_in_flight(bs);
 }
 }
+
+int coroutine_fn
+bdrv_co_preadv_snapshot(BdrvChild *child, int64_t offset, int64_t bytes,
+QEMUIOVector *qiov, size_t qiov_offset)
+{
+BlockDriverState *bs = child->bs;
+BlockDriver *drv = bs->drv;
+int ret;
+
+if (!drv) {
+return -ENOMEDIUM;
+}
+
+if (!drv->bdrv_co_preadv_snapshot) {
+return -ENOTSUP;
+}
+
+bdrv_inc_in_flight(bs);
+ret = drv->bdrv_co_preadv_snapshot(bs, offset, bytes, qiov, qiov_offset);
+bdrv_dec_in_flight(bs);
+
+return ret;
+}
+
+int coroutine_fn
+bdrv_co_snapshot_block_status(BlockDriverState *bs,
+  bool want_zero, int64_t offset, int64_t bytes,
+  int64_t *pnum, int64_t *map,
+  BlockDriverState **file)
+{
+BlockDriver *drv = bs->drv;
+int ret;
+
+if (!drv) {
+return -ENOMEDIUM;
+}
+
+if (!drv->bdrv_co_snapshot_block_status) {
+return -ENOTSUP;
+}
+
+bdrv_inc_in_flight(bs);
+ret = drv->bdrv_co_snapshot_block_status(bs, want_zero, offset, bytes,
+ pnum, map, file);
+bdrv_dec_in_flight(bs);
+
+return ret;
+}
+
+int coroutine_fn
+bdrv_co_pdiscard_snapshot(BlockDriverState *bs, int64_t offset, int64_t bytes)
+{
+BlockDriver *drv = bs->drv;
+int ret;
+
+if (!drv) {
+return -ENOMEDIUM;
+}
+
+if (!drv->bdrv_co_pdiscard_snapshot) {
+return -ENOTSUP;
+}
+
+bdrv_inc_in_flight(bs);
+ret = drv->bdrv_co_pdiscard_snapshot(bs, offset, bytes);
+bdrv_dec_in_flight(bs);
+
+return ret;
+}
-- 
2.31.1

[PATCH v4 04/18] block/copy-before-write: add bitmap open parameter

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

This brings "incremental" mode to copy-before-write filter: user can
specify bitmap so that filter will copy only "dirty" areas.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 qapi/block-core.json  | 10 +++-
 block/copy-before-write.c | 51 ++-
 2 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 9a5a3641d0..3bab597506 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4171,11 +4171,19 @@
 #
 # @target: The target for copy-before-write operations.
 #
+# @bitmap: If specified, copy-before-write filter will do
+#  copy-before-write operations only for dirty regions of the
+#  bitmap. Bitmap size must be equal to length of file and
+#  target child of the filter. Note also, that bitmap is used
+#  only to initialize internal bitmap of the process, so further
+#  modifications (or removing) of specified bitmap doesn't
+#  influence the filter.
+#
 # Since: 6.2
 ##
 { 'struct': 'BlockdevOptionsCbw',
   'base': 'BlockdevOptionsGenericFormat',
-  'data': { 'target': 'BlockdevRef' } }
+  'data': { 'target': 'BlockdevRef', '*bitmap': 'BlockDirtyBitmap' } }
 
 ##
 # @BlockdevOptions:
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index 799223e3fb..91a2288b66 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -34,6 +34,8 @@
 
 #include "block/copy-before-write.h"
 
+#include "qapi/qapi-visit-block-core.h"
+
 typedef struct BDRVCopyBeforeWriteState {
 BlockCopyState *bcs;
 BdrvChild *target;
@@ -145,10 +147,53 @@ static void cbw_child_perm(BlockDriverState *bs, 
BdrvChild *c,
 }
 }
 
+static bool cbw_parse_bitmap_option(QDict *options, BdrvDirtyBitmap **bitmap,
+Error **errp)
+{
+QDict *bitmap_qdict = NULL;
+BlockDirtyBitmap *bmp_param = NULL;
+Visitor *v = NULL;
+bool ret = false;
+
+*bitmap = NULL;
+
+qdict_extract_subqdict(options, &bitmap_qdict, "bitmap.");
+if (!qdict_size(bitmap_qdict)) {
+ret = true;
+goto out;
+}
+
+v = qobject_input_visitor_new_flat_confused(bitmap_qdict, errp);
+if (!v) {
+goto out;
+}
+
+visit_type_BlockDirtyBitmap(v, NULL, &bmp_param, errp);
+if (!bmp_param) {
+goto out;
+}
+
+*bitmap = block_dirty_bitmap_lookup(bmp_param->node, bmp_param->name, NULL,
+errp);
+if (!*bitmap) {
+goto out;
+}
+
+ret = true;
+
+out:
+qapi_free_BlockDirtyBitmap(bmp_param);
+visit_free(v);
+qobject_unref(bitmap_qdict);
+
+return ret;
+}
+
 static int cbw_open(BlockDriverState *bs, QDict *options, int flags,
 Error **errp)
 {
 BDRVCopyBeforeWriteState *s = bs->opaque;
+BdrvDirtyBitmap *bitmap = NULL;
 
 bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds,
BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
@@ -163,6 +208,10 @@ static int cbw_open(BlockDriverState *bs, QDict *options, 
int flags,
 return -EINVAL;
 }
 
+if (!cbw_parse_bitmap_option(options, &bitmap, errp)) {
+return -EINVAL;
+}
+
 bs->total_sectors = bs->file->bs->total_sectors;
 bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
 (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
@@ -170,7 +219,7 @@ static int cbw_open(BlockDriverState *bs, QDict *options, 
int flags,
 ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
  bs->file->bs->supported_zero_flags);
 
-s->bcs = block_copy_state_new(bs->file, s->target, NULL, errp);
+s->bcs = block_copy_state_new(bs->file, s->target, bitmap, errp);
 if (!s->bcs) {
 error_prepend(errp, "Cannot create block-copy-state: ");
 return -EINVAL;
-- 
2.31.1

[PATCH v4 15/18] iotests/image-fleecing: add test case with bitmap

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

Note that reads zero areas (not dirty in the bitmap) fails, that's
correct.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/tests/image-fleecing | 32 ++--
 tests/qemu-iotests/tests/image-fleecing.out | 84 +
 2 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/tests/qemu-iotests/tests/image-fleecing 
b/tests/qemu-iotests/tests/image-fleecing
index 909fc0a7ad..33995612be 100755
--- a/tests/qemu-iotests/tests/image-fleecing
+++ b/tests/qemu-iotests/tests/image-fleecing
@@ -23,7 +23,7 @@
 # Creator/Owner: John Snow 
 
 import iotests
-from iotests import log, qemu_img, qemu_io, qemu_io_silent
+from iotests import log, qemu_img, qemu_io, qemu_io_silent, 
qemu_io_pipe_and_status
 
 iotests.script_initialize(
 supported_fmts=['qcow2', 'qcow', 'qed', 'vmdk', 'vhdx', 'raw'],
@@ -50,11 +50,15 @@ remainder = [('0xd5', '0x108000',  '32k'), # Right-end of 
partial-left [1]
  ('0xcd', '0x3ff', '64k')] # patterns[3]
 
 def do_test(use_cbw, use_snapshot_access_filter, base_img_path,
-fleece_img_path, nbd_sock_path, vm):
+fleece_img_path, nbd_sock_path, vm,
+bitmap=False):
 log('--- Setting up images ---')
 log('')
 
 assert qemu_img('create', '-f', iotests.imgfmt, base_img_path, '64M') == 0
+if bitmap:
+assert qemu_img('bitmap', '--add', base_img_path, 'bitmap0') == 0
+
 if use_snapshot_access_filter:
 assert use_cbw
 assert qemu_img('create', '-f', 'raw', fleece_img_path, '64M') == 0
@@ -106,12 +110,17 @@ def do_test(use_cbw, use_snapshot_access_filter, 
base_img_path,
 
 # Establish CBW from source to fleecing node
 if use_cbw:
-log(vm.qmp('blockdev-add', {
+fl_cbw = {
 'driver': 'copy-before-write',
 'node-name': 'fl-cbw',
 'file': src_node,
 'target': tmp_node
-}))
+}
+
+if bitmap:
+fl_cbw['bitmap'] = {'node': src_node, 'name': 'bitmap0'}
+
+log(vm.qmp('blockdev-add', fl_cbw))
 
 log(vm.qmp('qom-set', path=qom_path, property='drive', value='fl-cbw'))
 
@@ -148,7 +157,9 @@ def do_test(use_cbw, use_snapshot_access_filter, 
base_img_path,
 for p in patterns + zeroes:
 cmd = 'read -P%s %s %s' % p
 log(cmd)
-assert qemu_io_silent('-r', '-f', 'raw', '-c', cmd, nbd_uri) == 0
+out, ret = qemu_io_pipe_and_status('-r', '-f', 'raw', '-c', cmd, 
nbd_uri)
+if ret != 0:
+print(out)
 
 log('')
 log('--- Testing COW ---')
@@ -166,7 +177,9 @@ def do_test(use_cbw, use_snapshot_access_filter, 
base_img_path,
 for p in patterns + zeroes:
 cmd = 'read -P%s %s %s' % p
 log(cmd)
-assert qemu_io_silent('-r', '-f', 'raw', '-c', cmd, nbd_uri) == 0
+out, ret = qemu_io_pipe_and_status('-r', '-f', 'raw', '-c', cmd, 
nbd_uri)
+if ret != 0:
+print(out)
 
 log('')
 log('--- Cleanup ---')
@@ -201,14 +214,14 @@ def do_test(use_cbw, use_snapshot_access_filter, 
base_img_path,
 log('Done')
 
 
-def test(use_cbw, use_snapshot_access_filter):
+def test(use_cbw, use_snapshot_access_filter, bitmap=False):
 with iotests.FilePath('base.img') as base_img_path, \
  iotests.FilePath('fleece.img') as fleece_img_path, \
  iotests.FilePath('nbd.sock',
   base_dir=iotests.sock_dir) as nbd_sock_path, \
  iotests.VM() as vm:
 do_test(use_cbw, use_snapshot_access_filter, base_img_path,
-fleece_img_path, nbd_sock_path, vm)
+fleece_img_path, nbd_sock_path, vm, bitmap=bitmap)
 
 
 log('=== Test backup(sync=none) based fleecing ===\n')
@@ -219,3 +232,6 @@ test(True, False)
 
 log('=== Test fleecing-format based fleecing ===\n')
 test(True, True)
+
+log('=== Test fleecing-format based fleecing with bitmap ===\n')
+test(True, True, bitmap=True)
diff --git a/tests/qemu-iotests/tests/image-fleecing.out 
b/tests/qemu-iotests/tests/image-fleecing.out
index da0af93388..62e1c1fe42 100644
--- a/tests/qemu-iotests/tests/image-fleecing.out
+++ b/tests/qemu-iotests/tests/image-fleecing.out
@@ -190,6 +190,90 @@ read -P0 0x00f8000 32k
 read -P0 0x201 32k
 read -P0 0x3fe 64k
 
+--- Cleanup ---
+
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+
+--- Confirming writes ---
+
+read -P0xab 0 64k
+read -P0xad 0x00f8000 64k
+read -P0x1d 0x2008000 64k
+read -P0xea 0x3fe 64k
+read -P0xd5 0x108000 32k
+read -P0xdc 32M 32k
+read -P0xcd 0x3ff 64k
+
+Done
+=== Test fleecing-format based fleecing with bitmap ===
+
+--- Setting up images ---
+
+Done
+
+--- Launching VM ---
+
+Done
+
+--- Setting up Fleecing Graph ---
+
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"return": {}}
+
+--- Setting up NBD Export ---
+
+{"return": {}}
+{"return": {}}
+
+--- Sanity Check ---
+
+read -P0x5d 0 64k
+read -P0xd5 1M 64k
+read -P0xdc 32M 64k
+read -P0xcd 0x3ff000

[PATCH v4 14/18] iotests.py: add qemu_io_pipe_and_status()

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

Add helper that returns both status and output, to be used in the
following commit

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/iotests.py | 4 
 1 file changed, 4 insertions(+)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 6ba65eb1ff..23bc6f686f 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -278,6 +278,10 @@ def qemu_io(*args):
 '''Run qemu-io and return the stdout data'''
 return qemu_tool_pipe_and_status('qemu-io', qemu_io_wrap_args(args))[0]
 
+def qemu_io_pipe_and_status(*args):
+args = qemu_io_args + list(args)
+return qemu_tool_pipe_and_status('qemu-io', args)
+
 def qemu_io_log(*args):
 result = qemu_io(*args)
 log(result, filters=[filter_testfiles, filter_qemu_io])
-- 
2.31.1

[PATCH v4 07/18] block/reqlist: reqlist_find_conflict(): use ranges_overlap()

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

Let's reuse convenient helper.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/reqlist.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/reqlist.c b/block/reqlist.c
index 5e320ba649..09fecbd48c 100644
--- a/block/reqlist.c
+++ b/block/reqlist.c
@@ -13,6 +13,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/range.h"
 
 #include "block/reqlist.h"
 
@@ -35,7 +36,7 @@ BlockReq *reqlist_find_conflict(BlockReqList *reqs, int64_t 
offset,
 BlockReq *r;
 
 QLIST_FOREACH(r, reqs, list) {
-if (offset + bytes > r->offset && offset < r->offset + r->bytes) {
+if (ranges_overlap(offset, bytes, r->offset, r->bytes)) {
 return r;
 }
 }
-- 
2.31.1

[PATCH v4 06/18] block: intoduce reqlist

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

Split intersecting-requests functionality out of block-copy to be
reused in copy-before-write filter.

Note: while being here, fix tiny typo in MAINTAINERS.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/reqlist.h |  67 +++
 block/block-copy.c  | 116 +---
 block/reqlist.c |  76 ++
 MAINTAINERS |   4 +-
 block/meson.build   |   1 +
 5 files changed, 184 insertions(+), 80 deletions(-)
 create mode 100644 include/block/reqlist.h
 create mode 100644 block/reqlist.c

diff --git a/include/block/reqlist.h b/include/block/reqlist.h
new file mode 100644
index 00..0fa1eef259
--- /dev/null
+++ b/include/block/reqlist.h
@@ -0,0 +1,67 @@
+/*
+ * reqlist API
+ *
+ * Copyright (C) 2013 Proxmox Server Solutions
+ * Copyright (c) 2021 Virtuozzo International GmbH.
+ *
+ * Authors:
+ *  Dietmar Maurer (diet...@proxmox.com)
+ *  Vladimir Sementsov-Ogievskiy 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifndef REQLIST_H
+#define REQLIST_H
+
+#include "qemu/coroutine.h"
+
+/*
+ * The API is not thread-safe and shouldn't be. The struct is public to be part
+ * of other structures and protected by third-party locks, see
+ * block/block-copy.c for example.
+ */
+
+typedef struct BlockReq {
+int64_t offset;
+int64_t bytes;
+
+CoQueue wait_queue; /* coroutines blocked on this req */
+QLIST_ENTRY(BlockReq) list;
+} BlockReq;
+
+typedef QLIST_HEAD(, BlockReq) BlockReqList;
+
+/*
+ * Initialize new request and add it to the list. Caller must be sure that
+ * there are no conflicting requests in the list.
+ */
+void reqlist_init_req(BlockReqList *reqs, BlockReq *req, int64_t offset,
+  int64_t bytes);
+/* Search for request in the list intersecting with @offset/@bytes area. */
+BlockReq *reqlist_find_conflict(BlockReqList *reqs, int64_t offset,
+int64_t bytes);
+
+/*
+ * If there are no intersecting requests return false. Otherwise, wait for the
+ * first found intersecting request to finish and return true.
+ *
+ * @lock is passed to qemu_co_queue_wait()
+ * False return value proves that lock was released at no point.
+ */
+bool coroutine_fn reqlist_wait_one(BlockReqList *reqs, int64_t offset,
+   int64_t bytes, CoMutex *lock);
+
+/*
+ * Shrink request and wake all waiting coroutines (maybe some of them are not
+ * intersecting with shrunk request).
+ */
+void coroutine_fn reqlist_shrink_req(BlockReq *req, int64_t new_bytes);
+
+/*
+ * Remove request and wake all waiting coroutines. Do not release any memory.
+ */
+void coroutine_fn reqlist_remove_req(BlockReq *req);
+
+#endif /* REQLIST_H */
diff --git a/block/block-copy.c b/block/block-copy.c
index 0834e29b6e..ef948dccec 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -17,6 +17,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "block/block-copy.h"
+#include "block/reqlist.h"
 #include "sysemu/block-backend.h"
 #include "qemu/units.h"
 #include "qemu/coroutine.h"
@@ -83,7 +84,6 @@ typedef struct BlockCopyTask {
  */
 BlockCopyState *s;
 BlockCopyCallState *call_state;
-int64_t offset;
 /*
  * @method can also be set again in the while loop of
  * block_copy_dirty_clusters(), but it is never accessed concurrently
@@ -94,21 +94,17 @@ typedef struct BlockCopyTask {
 BlockCopyMethod method;
 
 /*
- * Fields whose state changes throughout the execution
- * Protected by lock in BlockCopyState.
+ * Generally, req is protected by lock in BlockCopyState, Still req.offset
+ * is only set on task creation, so may be read concurrently after 
creation.
+ * req.bytes is changed at most once, and need only protecting the case of
+ * parallel read while updating @bytes value in block_copy_task_shrink().
  */
-CoQueue wait_queue; /* coroutines blocked on this task */
-/*
- * Only protect the case of parallel read while updating @bytes
- * value in block_copy_task_shrink().
- */
-int64_t bytes;
-QLIST_ENTRY(BlockCopyTask) list;
+BlockReq req;
 } BlockCopyTask;
 
 static int64_t task_end(BlockCopyTask *task)
 {
-return task->offset + task->bytes;
+return task->req.offset + task->req.bytes;
 }
 
 typedef struct BlockCopyState {
@@ -136,7 +132,7 @@ typedef struct BlockCopyState {
 CoMutex lock;
 int64_t in_flight_bytes;
 BlockCopyMethod method;
-QLIST_HEAD(, BlockCopyTask) tasks; /* All tasks from all block-copy calls 
*/
+BlockReqList reqs;
 QLIST_HEAD(, BlockCopyCallState) calls;
 /*
  * skip_unallocated:
@@ -160,42 +156,6 @@ typedef struct BlockCopyState {
 RateLimit rate_limit;
 } BlockCopyState;
 
-/* Called with lock held */
-static BlockCopyTask *find_conflicting_task(BlockCopyState *s,
-

[PATCH v4 02/18] block/dirty-bitmap: bdrv_merge_dirty_bitmap(): add return value

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

That simplifies handling failure in existing code and in further new
usage of bdrv_merge_dirty_bitmap().

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 include/block/dirty-bitmap.h| 2 +-
 block/dirty-bitmap.c| 9 +++--
 block/monitor/bitmap-qmp-cmds.c | 5 +
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 40950ae3d5..f95d350b70 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -77,7 +77,7 @@ void bdrv_dirty_bitmap_set_persistence(BdrvDirtyBitmap 
*bitmap,
bool persistent);
 void bdrv_dirty_bitmap_set_inconsistent(BdrvDirtyBitmap *bitmap);
 void bdrv_dirty_bitmap_set_busy(BdrvDirtyBitmap *bitmap, bool busy);
-void bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, const BdrvDirtyBitmap *src,
+bool bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, const BdrvDirtyBitmap *src,
  HBitmap **backup, Error **errp);
 void bdrv_dirty_bitmap_skip_store(BdrvDirtyBitmap *bitmap, bool skip);
 bool bdrv_dirty_bitmap_get(BdrvDirtyBitmap *bitmap, int64_t offset);
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 0ef46163e3..94a0276833 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -880,11 +880,14 @@ bool bdrv_dirty_bitmap_next_dirty_area(BdrvDirtyBitmap 
*bitmap,
  * Ensures permissions on bitmaps are reasonable; use for public API.
  *
  * @backup: If provided, make a copy of dest here prior to merge.
+ *
+ * Returns true on success, false on failure. In case of failure bitmaps are
+ * untouched.
  */
-void bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, const BdrvDirtyBitmap *src,
+bool bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, const BdrvDirtyBitmap *src,
  HBitmap **backup, Error **errp)
 {
-bool ret;
+bool ret = false;
 
 bdrv_dirty_bitmaps_lock(dest->bs);
 if (src->bs != dest->bs) {
@@ -912,6 +915,8 @@ out:
 if (src->bs != dest->bs) {
 bdrv_dirty_bitmaps_unlock(src->bs);
 }
+
+return ret;
 }
 
 /**
diff --git a/block/monitor/bitmap-qmp-cmds.c b/block/monitor/bitmap-qmp-cmds.c
index 9f11deec64..83970b22fa 100644
--- a/block/monitor/bitmap-qmp-cmds.c
+++ b/block/monitor/bitmap-qmp-cmds.c
@@ -259,7 +259,6 @@ BdrvDirtyBitmap *block_dirty_bitmap_merge(const char *node, 
const char *target,
 BlockDriverState *bs;
 BdrvDirtyBitmap *dst, *src, *anon;
 BlockDirtyBitmapMergeSourceList *lst;
-Error *local_err = NULL;
 
 dst = block_dirty_bitmap_lookup(node, target, &bs, errp);
 if (!dst) {
@@ -297,9 +296,7 @@ BdrvDirtyBitmap *block_dirty_bitmap_merge(const char *node, 
const char *target,
 abort();
 }
 
-bdrv_merge_dirty_bitmap(anon, src, NULL, &local_err);
-if (local_err) {
-error_propagate(errp, local_err);
+if (!bdrv_merge_dirty_bitmap(anon, src, NULL, errp)) {
 dst = NULL;
 goto out;
 }
-- 
2.31.1

[PATCH v4 03/18] block/block-copy: block_copy_state_new(): add bitmap parameter

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

This will be used in the following commit to bring "incremental" mode
to copy-before-write filter.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block-copy.h |  1 +
 block/block-copy.c | 14 +-
 block/copy-before-write.c  |  2 +-
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/block/block-copy.h b/include/block/block-copy.h
index 99370fa38b..b80ad02299 100644
--- a/include/block/block-copy.h
+++ b/include/block/block-copy.h
@@ -25,6 +25,7 @@ typedef struct BlockCopyState BlockCopyState;
 typedef struct BlockCopyCallState BlockCopyCallState;
 
 BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
+ const BdrvDirtyBitmap *bitmap,
  Error **errp);
 
 /* Function should be called prior any actual copy request */
diff --git a/block/block-copy.c b/block/block-copy.c
index abda7a80bd..8aa6ee6a5c 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -384,8 +384,10 @@ static int64_t 
block_copy_calculate_cluster_size(BlockDriverState *target,
 }
 
 BlockCopyState *block_copy_state_new(BdrvChild *source, BdrvChild *target,
+ const BdrvDirtyBitmap *bitmap,
  Error **errp)
 {
+ERRP_GUARD();
 BlockCopyState *s;
 int64_t cluster_size;
 BdrvDirtyBitmap *copy_bitmap;
@@ -402,7 +404,17 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, 
BdrvChild *target,
 return NULL;
 }
 bdrv_disable_dirty_bitmap(copy_bitmap);
-bdrv_set_dirty_bitmap(copy_bitmap, 0, bdrv_dirty_bitmap_size(copy_bitmap));
+if (bitmap) {
+if (!bdrv_merge_dirty_bitmap(copy_bitmap, bitmap, NULL, errp)) {
+error_prepend(errp, "Failed to merge bitmap '%s' to internal "
+  "copy-bitmap: ", bdrv_dirty_bitmap_name(bitmap));
+bdrv_release_dirty_bitmap(copy_bitmap);
+return NULL;
+}
+} else {
+bdrv_set_dirty_bitmap(copy_bitmap, 0,
+  bdrv_dirty_bitmap_size(copy_bitmap));
+}
 
 /*
  * If source is in backing chain of target assume that target is going to 
be
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index 5bdaf0a9d9..799223e3fb 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -170,7 +170,7 @@ static int cbw_open(BlockDriverState *bs, QDict *options, 
int flags,
 ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK) &
  bs->file->bs->supported_zero_flags);
 
-s->bcs = block_copy_state_new(bs->file, s->target, errp);
+s->bcs = block_copy_state_new(bs->file, s->target, NULL, errp);
 if (!s->bcs) {
 error_prepend(errp, "Cannot create block-copy-state: ");
 return -EINVAL;
-- 
2.31.1

[PATCH v4 01/18] block/block-copy: move copy_bitmap initialization to block_copy_state_new()

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

We are going to complicate bitmap initialization in the further
commit. And in future, backup job will be able to work without filter
(when source is immutable), so we'll need same bitmap initialization in
copy-before-write filter and in backup job. So, it's reasonable to do
it in block-copy.

Note that for now cbw_open() is the only caller of
block_copy_state_new().

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Hanna Reitz 
---
 block/block-copy.c| 1 +
 block/copy-before-write.c | 4 
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/block/block-copy.c b/block/block-copy.c
index ce116318b5..abda7a80bd 100644
--- a/block/block-copy.c
+++ b/block/block-copy.c
@@ -402,6 +402,7 @@ BlockCopyState *block_copy_state_new(BdrvChild *source, 
BdrvChild *target,
 return NULL;
 }
 bdrv_disable_dirty_bitmap(copy_bitmap);
+bdrv_set_dirty_bitmap(copy_bitmap, 0, bdrv_dirty_bitmap_size(copy_bitmap));
 
 /*
  * If source is in backing chain of target assume that target is going to 
be
diff --git a/block/copy-before-write.c b/block/copy-before-write.c
index c30a5ff8de..5bdaf0a9d9 100644
--- a/block/copy-before-write.c
+++ b/block/copy-before-write.c
@@ -149,7 +149,6 @@ static int cbw_open(BlockDriverState *bs, QDict *options, 
int flags,
 Error **errp)
 {
 BDRVCopyBeforeWriteState *s = bs->opaque;
-BdrvDirtyBitmap *copy_bitmap;
 
 bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds,
BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY,
@@ -177,9 +176,6 @@ static int cbw_open(BlockDriverState *bs, QDict *options, 
int flags,
 return -EINVAL;
 }
 
-copy_bitmap = block_copy_dirty_bitmap(s->bcs);
-bdrv_set_dirty_bitmap(copy_bitmap, 0, bdrv_dirty_bitmap_size(copy_bitmap));
-
 return 0;
 }
 
-- 
2.31.1

[PATCH v4 00/18] Make image fleecing more usable

2022-02-16 Thread Vladimir Sementsov-Ogievskiy

v4: Switch to new fleecing scheme, more native for Qemu's block-layer,
see new patches 10-12 for details.

01,02: add Hanna's r-b
03: add const, add forgotten bdrv_release_dirty_bitmap()
04: rewrite bitmap parameter parsing
05: add Hanna's r-b
06: tiny wording fixes
07: new
08: fix comments, improve interface of new functions
09: fix grammar, add Hanna's and Nikita's r-bs
10-12: new fleecing scheme
tests: updated to use new fleecing scheme

===

These series brings several improvements to fleecing scheme:

1. support bitmap in copy-before-write filter

2. introduce snapshot-access API and filter, to make a new fleecing
   scheme. See "block: copy-before-write: realize snapshot-access API"
   commit message for details.

3. support "push backup with fleecing" scheme, when backup job is a
   client of common fleecing scheme. That helps when writes to final
   backup target are slow and we don't want guest writes hang waiting
   for copy-before-write operations to final target.

Vladimir Sementsov-Ogievskiy (18):
  block/block-copy: move copy_bitmap initialization to
block_copy_state_new()
  block/dirty-bitmap: bdrv_merge_dirty_bitmap(): add return value
  block/block-copy: block_copy_state_new(): add bitmap parameter
  block/copy-before-write: add bitmap open parameter
  block/block-copy: add block_copy_reset()
  block: intoduce reqlist
  block/reqlist: reqlist_find_conflict(): use ranges_overlap()
  block/dirty-bitmap: introduce bdrv_dirty_bitmap_status()
  block/reqlist: add reqlist_wait_all()
  block/io: introduce block driver snapshot-access API
  block: introduce snapshot-access filter
  block: copy-before-write: realize snapshot-access API
  iotests/image-fleecing: add test-case for fleecing format node
  iotests.py: add qemu_io_pipe_and_status()
  iotests/image-fleecing: add test case with bitmap
  block: blk_root(): return non-const pointer
  qapi: backup: add immutable-source parameter
  iotests/image-fleecing: test push backup with fleecing

 qapi/block-core.json|  25 +-
 include/block/block-copy.h  |   2 +
 include/block/block_int.h   |  28 +++
 include/block/dirty-bitmap.h|   4 +-
 include/block/reqlist.h |  75 ++
 include/qemu/hbitmap.h  |  12 +
 include/sysemu/block-backend.h  |   2 +-
 block/backup.c  |  61 -
 block/block-backend.c   |   2 +-
 block/block-copy.c  | 150 +--
 block/copy-before-write.c   | 265 +++-
 block/dirty-bitmap.c|  15 +-
 block/io.c  |  69 +
 block/monitor/bitmap-qmp-cmds.c |   5 +-
 block/replication.c |   2 +-
 block/reqlist.c |  85 +++
 block/snapshot-access.c | 132 ++
 blockdev.c  |   1 +
 util/hbitmap.c  |  33 +++
 MAINTAINERS |   5 +-
 block/meson.build   |   2 +
 tests/qemu-iotests/iotests.py   |   4 +
 tests/qemu-iotests/tests/image-fleecing | 175 ++---
 tests/qemu-iotests/tests/image-fleecing.out | 223 +++-
 24 files changed, 1227 insertions(+), 150 deletions(-)
 create mode 100644 include/block/reqlist.h
 create mode 100644 block/reqlist.c
 create mode 100644 block/snapshot-access.c

-- 
2.31.1

Re: [PATCH v2] nbd/server: Allow MULTI_CONN for shared writable exports

2022-02-16 Thread Vladimir Sementsov-Ogievskiy


16.02.2022 20:14, Nir Soffer wrote:

On Wed, Feb 16, 2022 at 10:08 AM Vladimir Sementsov-Ogievskiy
 wrote:


16.02.2022 02:24, Eric Blake wrote:

On Tue, Feb 15, 2022 at 09:23:36PM +0200, Nir Soffer wrote:

On Tue, Feb 15, 2022 at 7:22 PM Eric Blake  wrote:


According to the NBD spec, a server advertising
NBD_FLAG_CAN_MULTI_CONN promises that multiple client connections will
not see any cache inconsistencies: when properly separated by a single
flush, actions performed by one client will be visible to another
client, regardless of which client did the flush.  We satisfy these
conditions in qemu when our block layer is backed by the local
filesystem (by virtue of the semantics of fdatasync(), and the fact
that qemu itself is not buffering writes beyond flushes).  It is
harder to state whether we satisfy these conditions for network-based
protocols, so the safest course of action is to allow users to opt-in
to advertising multi-conn.  We may later tweak defaults to advertise
by default when the block layer can confirm that the underlying
protocol driver is cache consistent between multiple writers, but for
now, this at least allows savvy users (such as virt-v2v or nbdcopy) to
explicitly start qemu-nbd or qemu-storage-daemon with multi-conn
advertisement in a known-safe setup where the client end can then
benefit from parallel clients.



It makes sense, and will be used by oVirt. Actually we are already using
multiple connections for writing about 2 years, based on your promise
that if every client writes to district  areas this is safe.


I presume s/district/distinct/, but yes, I'm glad we're finally trying
to make the code match existing practice ;)


+++ b/docs/tools/qemu-nbd.rst
@@ -139,8 +139,7 @@ driver options if ``--image-opts`` is specified.
   .. option:: -e, --shared=NUM

 Allow up to *NUM* clients to share the device (default
-  ``1``), 0 for unlimited. Safe for readers, but for now,
-  consistency is not guaranteed between multiple writers.
+  ``1``), 0 for unlimited.



Removing the note means that now consistency is guaranteed between
multiple writers, no?

Or maybe we want to mention here that consistency depends on the protocol
and users can opt in, or refer to the section where this is discussed?


Yeah, a link to the QAPI docs where multi-conn is documented might be
nice, except I'm not sure the best way to do that in our sphinx
documentation setup.


+##
+# @NbdExportMultiConn:
+#
+# Possible settings for advertising NBD multiple client support.
+#
+# @off: Do not advertise multiple clients.
+#
+# @on: Allow multiple clients (for writable clients, this is only safe
+#  if the underlying BDS is cache-consistent, such as when backed
+#  by the raw file driver); ignored if the NBD server was set up
+#  with max-connections of 1.
+#
+# @auto: Behaves like @off if the export is writable, and @on if the
+#export is read-only.
+#
+# Since: 7.0
+##
+{ 'enum': 'NbdExportMultiConn',
+  'data': ['off', 'on', 'auto'] }



Are we going to have --multi-con=(on|off|auto)?


Oh. The QMP command (which is immediately visible through
nbd-server-add/block-storage-add to qemu and qemu-storage-daemon)
gains "multi-conn":"on", but you may be right that qemu-nbd would want
a command line option (either that, or we accellerate our plans that
qsd should replace qemu-nbd).


+++ b/blockdev-nbd.c
@@ -44,6 +44,11 @@ bool nbd_server_is_running(void)
   return nbd_server || is_qemu_nbd;
   }

+int nbd_server_max_connections(void)
+{
+return nbd_server ? nbd_server->max_connections : -1;
+}



-1 is a little bit strange for a limit, maybe 1 is a better default when
we nbd_server == NULL? When can this happen?


In qemu, if you haven't used the QMP command 'nbd-server-start' yet.
In qemu-nbd, always (per the nbd_server_is_running function just
above).  My iotest only covered the qemu/qsd side, not the qemu-nbd
side, so it looks like I need a v3...


+++ b/nbd/server.c



+/*
+ * Determine whether to advertise multi-conn.  Default is auto,
+ * which resolves to on for read-only and off for writable.  But
+ * if the server has max-connections 1, that forces the flag off.



Looks good, this can be enabled automatically based on the driver
if we want, so users using auto will can upgrade to multi-con automatically.


Yes, that's part of why I made it a tri-state with a default of 'auto'.





+ */
+if (!arg->has_multi_conn) {
+arg->multi_conn = NBD_EXPORT_MULTI_CONN_AUTO;
+}
+if (nbd_server_max_connections() == 1) {


+arg->multi_conn = NBD_EXPORT_MULTI_CONN_OFF;

+}


+if (arg->multi_conn == NBD_EXPORT_MULTI_CONN_AUTO) {

+multi_conn = readonly;
+} else {
+multi_conn = arg->multi_conn == NBD_EXPORT_MULTI_CONN_ON;
+}



This part is a little bit confusing - we do:
- initialize args->multi_con if it has not value
- set the temporary multi_con based now initialized args->multi_con

I think it will be nicer to se

Re: [PATCH v2] nbd/server: Allow MULTI_CONN for shared writable exports

2022-02-16 Thread Nir Soffer

On Wed, Feb 16, 2022 at 10:08 AM Vladimir Sementsov-Ogievskiy
 wrote:
>
> 16.02.2022 02:24, Eric Blake wrote:
> > On Tue, Feb 15, 2022 at 09:23:36PM +0200, Nir Soffer wrote:
> >> On Tue, Feb 15, 2022 at 7:22 PM Eric Blake  wrote:
> >>
> >>> According to the NBD spec, a server advertising
> >>> NBD_FLAG_CAN_MULTI_CONN promises that multiple client connections will
> >>> not see any cache inconsistencies: when properly separated by a single
> >>> flush, actions performed by one client will be visible to another
> >>> client, regardless of which client did the flush.  We satisfy these
> >>> conditions in qemu when our block layer is backed by the local
> >>> filesystem (by virtue of the semantics of fdatasync(), and the fact
> >>> that qemu itself is not buffering writes beyond flushes).  It is
> >>> harder to state whether we satisfy these conditions for network-based
> >>> protocols, so the safest course of action is to allow users to opt-in
> >>> to advertising multi-conn.  We may later tweak defaults to advertise
> >>> by default when the block layer can confirm that the underlying
> >>> protocol driver is cache consistent between multiple writers, but for
> >>> now, this at least allows savvy users (such as virt-v2v or nbdcopy) to
> >>> explicitly start qemu-nbd or qemu-storage-daemon with multi-conn
> >>> advertisement in a known-safe setup where the client end can then
> >>> benefit from parallel clients.
> >>>
> >>
> >> It makes sense, and will be used by oVirt. Actually we are already using
> >> multiple connections for writing about 2 years, based on your promise
> >> that if every client writes to district  areas this is safe.
> >
> > I presume s/district/distinct/, but yes, I'm glad we're finally trying
> > to make the code match existing practice ;)
> >
> >>> +++ b/docs/tools/qemu-nbd.rst
> >>> @@ -139,8 +139,7 @@ driver options if ``--image-opts`` is specified.
> >>>   .. option:: -e, --shared=NUM
> >>>
> >>> Allow up to *NUM* clients to share the device (default
> >>> -  ``1``), 0 for unlimited. Safe for readers, but for now,
> >>> -  consistency is not guaranteed between multiple writers.
> >>> +  ``1``), 0 for unlimited.
> >>>
> >>
> >> Removing the note means that now consistency is guaranteed between
> >> multiple writers, no?
> >>
> >> Or maybe we want to mention here that consistency depends on the protocol
> >> and users can opt in, or refer to the section where this is discussed?
> >
> > Yeah, a link to the QAPI docs where multi-conn is documented might be
> > nice, except I'm not sure the best way to do that in our sphinx
> > documentation setup.
> >
> >>> +##
> >>> +# @NbdExportMultiConn:
> >>> +#
> >>> +# Possible settings for advertising NBD multiple client support.
> >>> +#
> >>> +# @off: Do not advertise multiple clients.
> >>> +#
> >>> +# @on: Allow multiple clients (for writable clients, this is only safe
> >>> +#  if the underlying BDS is cache-consistent, such as when backed
> >>> +#  by the raw file driver); ignored if the NBD server was set up
> >>> +#  with max-connections of 1.
> >>> +#
> >>> +# @auto: Behaves like @off if the export is writable, and @on if the
> >>> +#export is read-only.
> >>> +#
> >>> +# Since: 7.0
> >>> +##
> >>> +{ 'enum': 'NbdExportMultiConn',
> >>> +  'data': ['off', 'on', 'auto'] }
> >>>
> >>
> >> Are we going to have --multi-con=(on|off|auto)?
> >
> > Oh. The QMP command (which is immediately visible through
> > nbd-server-add/block-storage-add to qemu and qemu-storage-daemon)
> > gains "multi-conn":"on", but you may be right that qemu-nbd would want
> > a command line option (either that, or we accellerate our plans that
> > qsd should replace qemu-nbd).
> >
> >>> +++ b/blockdev-nbd.c
> >>> @@ -44,6 +44,11 @@ bool nbd_server_is_running(void)
> >>>   return nbd_server || is_qemu_nbd;
> >>>   }
> >>>
> >>> +int nbd_server_max_connections(void)
> >>> +{
> >>> +return nbd_server ? nbd_server->max_connections : -1;
> >>> +}
> >>>
> >>
> >> -1 is a little bit strange for a limit, maybe 1 is a better default when
> >> we nbd_server == NULL? When can this happen?
> >
> > In qemu, if you haven't used the QMP command 'nbd-server-start' yet.
> > In qemu-nbd, always (per the nbd_server_is_running function just
> > above).  My iotest only covered the qemu/qsd side, not the qemu-nbd
> > side, so it looks like I need a v3...
> >
> >>> +++ b/nbd/server.c
> >
> >>> +/*
> >>> + * Determine whether to advertise multi-conn.  Default is auto,
> >>> + * which resolves to on for read-only and off for writable.  But
> >>> + * if the server has max-connections 1, that forces the flag off.
> >>>
> >>
> >> Looks good, this can be enabled automatically based on the driver
> >> if we want, so users using auto will can upgrade to multi-con 
> >> automatically.
> >
> > Yes, that's part of why I made it a tri-state with a default of 'auto'.
> >
> >>
> >>
> >>> + */
> >>> +if (!arg->has_multi_conn) {
> >>> +arg->mul

Re: [PATCH 1/4] python/utils: add enboxify() text decoration utility

2022-02-16 Thread Hanna Reitz

On 16.02.22 17:16, John Snow wrote:

On Tue, Feb 15, 2022, 6:57 PM Philippe Mathieu-Daudé  
wrote:

On 16/2/22 00:53, John Snow wrote:
> On Tue, Feb 15, 2022 at 5:55 PM Eric Blake 
wrote:
>>
>> On Tue, Feb 15, 2022 at 05:08:50PM -0500, John Snow wrote:
>> print(enboxify(msg, width=72, name="commit message"))
>>> ┏━ commit message
━┓
>>> ┃ enboxify() takes a chunk of text and wraps it in a text art
box that ┃
>>> ┃  adheres to a specified width. An optional title label may
be given, ┃
>>> ┃  and any of the individual glyphs used to draw the box may
be        ┃
>>
>> Why do these two lines have a leading space,
>>
>>> ┃ replaced or specified as well.                          ┃
>>
>> but this one doesn't?  It must be an off-by-one corner case
when your
>> choice of space to wrap on is exactly at the wrap column.
>>
>
> Right, you're probably witnessing the right-pad *and* the actual
space.
>
>>>
┗━━┛
>>>
>>> Signed-off-by: John Snow 
>>> ---
>>>   python/qemu/utils/__init__.py | 58
+++
>>>   1 file changed, 58 insertions(+)

>>> +    def _wrap(line: str) -> str:
>>> +        return os.linesep.join([
>>> +            wrapped_line.ljust(lwidth) + suffix
>>> +            for wrapped_line in textwrap.wrap(
>>> +                    line, width=lwidth, initial_indent=prefix,
>>> + subsequent_indent=prefix, replace_whitespace=False,
>>> +                    drop_whitespace=False,
break_on_hyphens=False)
>>
>> Always nice when someone else has written the cool library
function to
>> do all the hard work for you ;)  But this is probably where you
have the off-by-one I called out above.
>>
>
> Yeah, I just didn't want it to eat multiple spaces if they were
> present -- I wanted it to reproduce them faithfully. The tradeoff is
> some silliness near the margins.
>
> Realistically, if I want something any better than what I've done
> here, I should find a library to do it for me instead -- but for the
> sake of highlighting some important information, this may be
> just-enough-juice.

's/^┃  /┃ /' on top ;D

I have to admit that this function is actually very fragile. Last 
night, I did some reading on unicode and emoji encodings and 
discovered that it's *basically impossible* to predict the "visual 
width" of a sequence of unicode codepoints.

Jumping it at random without knowing any of the history (that’s my forte!):

*Clippy face*

It sounds like you want to put a bar to the right of some text in a 
terminal, but you can’t predict the texts horizontal width, and so you 
can’t work out the number of spaces needed to pad the text with before 
the bar.

Two things come to my mind, if we’re talking about TTY output:
(A) printf '%79s┃\r%s\n' '' "$text"
(B) printf '%s\e[80G┃\n' "$text"

I.e. either printing the bar first, then printing the text over the 
line; or using an ANSI escape sequence to have the TTY position the 
bar.  Both seem to work for me in both konsole and xterm.

Or perhaps you’re really trying to find out how long a piece of text is 
visually (so you can break the line when this exceeds some limit), which 
you can also technically do with ANSI escape sequences, because "\e[6n" 
will return the cursor position to stdin (as "\e[{Y};{X}R").  But 
reading from stdin when there’s no newline is always really stupid, so I 
don’t know if you really want that.

(And you also need to print the text first before you can find out how 
long it is, which is kind of not great.)

Now that I wrote this all it feels like I didn’t help at all, but I put 
work into this mail, so I’ll send it anyway!

Hanna

Re: [PATCH 1/4] python/utils: add enboxify() text decoration utility

2022-02-16 Thread John Snow

On Tue, Feb 15, 2022, 6:57 PM Philippe Mathieu-Daudé 
wrote:

> On 16/2/22 00:53, John Snow wrote:
> > On Tue, Feb 15, 2022 at 5:55 PM Eric Blake  wrote:
> >>
> >> On Tue, Feb 15, 2022 at 05:08:50PM -0500, John Snow wrote:
> >> print(enboxify(msg, width=72, name="commit message"))
> >>> ┏━ commit message
> ━┓
> >>> ┃ enboxify() takes a chunk of text and wraps it in a text art box that
> ┃
> >>> ┃  adheres to a specified width. An optional title label may be given,
> ┃
> >>> ┃  and any of the individual glyphs used to draw the box may be
> ┃
> >>
> >> Why do these two lines have a leading space,
> >>
> >>> ┃ replaced or specified as well.
>  ┃
> >>
> >> but this one doesn't?  It must be an off-by-one corner case when your
> >> choice of space to wrap on is exactly at the wrap column.
> >>
> >
> > Right, you're probably witnessing the right-pad *and* the actual space.
> >
> >>>
> ┗━━┛
> >>>
> >>> Signed-off-by: John Snow 
> >>> ---
> >>>   python/qemu/utils/__init__.py | 58
> +++
> >>>   1 file changed, 58 insertions(+)
>
> >>> +def _wrap(line: str) -> str:
> >>> +return os.linesep.join([
> >>> +wrapped_line.ljust(lwidth) + suffix
> >>> +for wrapped_line in textwrap.wrap(
> >>> +line, width=lwidth, initial_indent=prefix,
> >>> +subsequent_indent=prefix,
> replace_whitespace=False,
> >>> +drop_whitespace=False, break_on_hyphens=False)
> >>
> >> Always nice when someone else has written the cool library function to
> >> do all the hard work for you ;)  But this is probably where you have
> the off-by-one I called out above.
> >>
> >
> > Yeah, I just didn't want it to eat multiple spaces if they were
> > present -- I wanted it to reproduce them faithfully. The tradeoff is
> > some silliness near the margins.
> >
> > Realistically, if I want something any better than what I've done
> > here, I should find a library to do it for me instead -- but for the
> > sake of highlighting some important information, this may be
> > just-enough-juice.
>
> 's/^┃  /┃ /' on top ;D
>

I have to admit that this function is actually very fragile. Last night, I
did some reading on unicode and emoji encodings and discovered that it's
*basically impossible* to predict the "visual width" of a sequence of
unicode codepoints.

So, this function as written will only really work if we stick to
single-codepoint glyphs that can be rendered 1:1 in a monospace font.

I could probably improve it to work with "some" (but certainly not all)
wide glyphs and emoji, but it's a very complex topic and far outside my
specialty. Support for multi-codepoint narrow/halfwidth glyphs is also an
issue. (This affects some Latin characters outside of ascii if they are
encoded using combining codepoints.)

(See https://hsivonen.fi/string-length/ ... It's nasty.)

So I must admit that this function has some very serious limitations to it.
I want to explain why I wrote it, though.

First: Tracebacks make people's eyes cross over. It's a very long sequence
of mumbo jumbo that most people don't read, because it's program debug
information. I don't blame them. Setting apart the error summary visually
is a helpful tool for drawing one's eyes to the most critical pieces of
information.

Second: In my AQMP library, I use the ascii vertical bar | as a left-hand
border decoration to provide a kind of visual quoting mechanism to
illustrate in the logfile which subsequent confusing lines of jargon belong
to the same log entry. I really like this formatting mechanism, but...

Third: If a line of text becomes so long that it wraps in your terminal,
the visual quote mechanism breaks, making the output messy and hard to
read. Forcibly re-wrapping the text in a virtual box is a necessary
mechanism to preserve readability in this circumstance - the lines from
qemu-img et al may be much wider than your terminal column width.

And so, I drew a box instead of just a left border, because I needed to
re-wrap the text anyway. Visually, I believed it to help explain that the
output was being re-formatted to fit in a certain dimensionality.
Unfortunately, it's inadequate.

So ... what to do.

(1) I can just remove the right margin decoration and call the function
visual_quote or something. If any of the lines get too "long" because of
emoji/日本語, it MAY break the indent line, but occasional uses of one or two
wide characters probably won't cause wrapping that breaks the "visual quote
line" on a terminal with at least 85 columns. Essentially it'd still be
broken, but without a solid right border it'd be harder to notice *small*
breakages.

(2) If there is a genuine interest in using visual highlighting techniques
to make iotest failures easier to diagnose (and making sure it is properly
multilingual), I could use the urwid helper library t

Re: [PATCH] tests/qemu-iotests: Rework the checks and spots using GNU sed

2022-02-16 Thread Eric Blake

On Wed, Feb 16, 2022 at 12:39:06PM +0100, Thomas Huth wrote:
> > > -$SED -re 's/[0-9]{4}-[0-9]{2}-[0-9]{2} 
> > > [0-9]{2}:[0-9]{2}:[0-9]{2}/-mm-dd hh:mm:ss/'
> > > +gsed -re 's/[0-9]{4}-[0-9]{2}-[0-9]{2} 
> > > [0-9]{2}:[0-9]{2}:[0-9]{2}/-mm-dd hh:mm:ss/'
> > 
> > GNU sed recommends spelling it 'sed -E', not 'sed -r', when using
> > extended regex.  Older POSIX did not support either spelling, but the
> > next version will require -E, as many implementations have it now:
> > https://www.austingroupbugs.net/view.php?id=528
> 
> Thanks for the pointer ... I originally checked "man 1p sed" on
> my system and did not see -r or -E in there, so I thought that
> this must be really something specific to GNU sed. But now that
> you've mentioned this, I just double-checked the build environments
> that we support with QEMU, and seems like -E should be supported
> everywhere:

Yay.

> 
> So I think it should be safe to change these spots that are
> using "-r" to "sed -E" (and not use gsed here).
> 
> > Other than the fact that this was easier to write with ERE, I'm not
> > seeing any other GNU-only extensions in use here; but I'd recommend
> > that as long as we're touching the line, we spell it 'gsed -Ee'
> > instead of -re (here, and in several other places).
> > 
> > >   _filter_qom_path()
> > >   {
> > > -$SED -e '/Attached to:/s/\device[[0-9]\+\]/device[N]/g'
> > > +gsed -e '/Attached to:/s/\device[[0-9]\+\]/device[N]/g'
> > 
> > Here, it is our use of \+ that is a GNU sed extension, although it is
> > fairly easy (but verbose) to translate that one to portable sed
> > (\+ is the same as *).  So gsed is correct.

Then again, since we claim 'sed -E' is portable, we can get the +
operator everywhere by using ERE instead of BRE (and with fewer
leaning toothpicks, another reason I like ERE better than BRE).

On the
> > other hand, the use of [[0-9]\+\] looks ugly - it probably does NOT
> > match what we meant (we have a bracket expression '[...]' that matches
> > the 11 characters [ and 0-9, then '\+' to match that bracket
> > expression 1 or more times, then '\]' which in its context is
> > identical to ']' to match a closing ], since only opening [ needs \
> > escaping for literal treatment).  What we probably meant is:
> > 
> > '/Attached to:/s/\device\[[0-9][0-9]*]/device[N]/g'
> > 
> > at which point normal sed would do.
> 
> Ok ... but I'd prefer to clean such spots rather in a second step,
> to make sure not to introduce bugs and to make the review easier.

Yeah, fixing bugs and cleaning up consistent use of sed/gsed/$SED are
worth separating.

> > >   _filter_qemu_io()
> > >   {
> > > -_filter_win32 | $SED -e "s/[0-9]* ops\; [0-9/:. sec]* ([0-9/.inf]* 
> > > [EPTGMKiBbytes]*\/sec and [0-9/.inf]* ops\/sec)/X ops\; XX:XX:XX.X (XXX 
> > > YYY\/sec and XXX ops\/sec)/" \
> > > +_filter_win32 | gsed -e "s/[0-9]* ops\; [0-9/:. sec]* ([0-9/.inf]* 
> > > [EPTGMKiBbytes]*\/sec and [0-9/.inf]* ops\/sec)/X ops\; XX:XX:XX.X (XXX 
> > > YYY\/sec and XXX ops\/sec)/" \
> > >   -e "s/: line [0-9][0-9]*:  *[0-9][0-9]*\( Aborted\| 
> > > Killed\)/:\1/" \
> > >   -e "s/qemu-io> //g"
> > 
> > I'm not seeing anything specific to GNU sed in this (long) sed script;
> > can we relax this one to plain 'sed'?  Use of s#some/text## might be
> > easier than having to s/some\/text//, but that's not essential.
> 
> If I switch that to plain sed, I'm getting errors like this on NetBSD:
> 
> --- /home/qemu/qemu-test.is2SLq/src/tests/qemu-iotests/046.out
> +++ 11296-046.out.bad
> @@ -66,6 +66,7 @@
>  64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>  wrote 65536/65536 bytes at offset 2031616
>  64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +
>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=6442450944 
> backing_file=TEST_DIR/t.IMGFMT.base backing_fmt=IMGFMT

Huh; not sure what happened that I didn't see.  But I trust your tests
as a more canonical version of "it worked on this platform's sed" than
my "I don't see anything blantantly non-POSIX" ;)

> 
>  == Some concurrent requests touching the same cluster ==
> 
> So I'll keep gsed here for now - we need it for _filter_qemu_io
> anyway since it's calling _filter_win32 that currently needs
> gsed, too.

Yeah, I think your patch is big enough to prove there are places where
it really is easier to rely on gsed than to try and be portable.

> 
> > > @@ -142,7 +142,7 @@ _do_filter_img_create()
> > >   # precedes ", fmt=") and the options part ($options, which starts
> > >   # with "fmt=")
> > >   # (And just echo everything before the first "^Formatting")
> > > -readarray formatting_line < <($SED -e 's/, fmt=/\n/')
> > > +readarray formatting_line < <(gsed -e 's/, fmt=/\n/')
> > 
> > This one looks like it should work with plain 'sed'.
> 
> Using normal sed does not really work for me here. For example
> with NetBSD, I get errors like this:
> 
> --- /home/qemu/qemu-test.cSYvEb/src/tests/qemu-iotests/027.ou

Re: [PATCH v2] tests/qemu-iotests: Rework the checks and spots using GNU sed

2022-02-16 Thread Philippe Mathieu-Daudé via


On 16/2/22 13:54, Thomas Huth wrote:

Instead of failing the iotests if GNU sed is not available (or skipping
them completely in the check-block.sh script), it would be better to
simply skip the bash-based tests that rely on GNU sed, so that the other
tests could still be run. Thus we now explicitely use "gsed" (either as
direct program or as a wrapper around "sed" if it's the GNU version)
in the spots that rely on the GNU sed behavior. Statements that use the
"-r" parameter of sed have been switched to use "-E" instead, since this
switch is supported by all sed versions on our supported build hosts
(most also support "-r", but macOS' sed only supports "-E"). With all
these changes in place, we then can also remove the sed checks from the
check-block.sh script, so that "make check-block" can now be run on
systems without GNU sed, too.

Signed-off-by: Thomas Huth 
---
  I've checked that this still works fine with "make vm-build-freebsd",
  "make vm-build-netbsd" and "make vm-build-openbsd" and the Cirrus-CI
  macOS tasks.

  tests/check-block.sh | 12 --
  tests/qemu-iotests/271   |  2 +-
  tests/qemu-iotests/common.filter | 65 
  tests/qemu-iotests/common.rc | 45 +++---
  4 files changed, 57 insertions(+), 67 deletions(-)


Reviewed-by: Philippe Mathieu-Daudé

Re: [PATCH v4 00/15] hw/nvme: SR-IOV with Virtualization Enhancements

2022-02-16 Thread Lukasz Maniak

On Fri, Feb 11, 2022 at 08:26:10AM +0100, Klaus Jensen wrote:
> On Jan 26 18:11, Lukasz Maniak wrote:
> > Changes since v3:
> > - Addressed comments to review on pcie: Add support for Single Root I/O
> >   Virtualization (SR/IOV)
> > - Fixed issues reported by checkpatch.pl
> > 
> > Knut Omang (2):
> >   pcie: Add support for Single Root I/O Virtualization (SR/IOV)
> >   pcie: Add some SR/IOV API documentation in docs/pcie_sriov.txt
> > 
> > Lukasz Maniak (4):
> >   hw/nvme: Add support for SR-IOV
> >   hw/nvme: Add support for Primary Controller Capabilities
> >   hw/nvme: Add support for Secondary Controller List
> >   docs: Add documentation for SR-IOV and Virtualization Enhancements
> > 
> > Łukasz Gieryk (9):
> >   pcie: Add a helper to the SR/IOV API
> >   pcie: Add 1.2 version token for the Power Management Capability
> >   hw/nvme: Implement the Function Level Reset
> >   hw/nvme: Make max_ioqpairs and msix_qsize configurable in runtime
> >   hw/nvme: Remove reg_size variable and update BAR0 size calculation
> >   hw/nvme: Calculate BAR attributes in a function
> >   hw/nvme: Initialize capability structures for primary/secondary
> > controllers
> >   hw/nvme: Add support for the Virtualization Management command
> >   hw/nvme: Update the initalization place for the AER queue
> > 
> >  docs/pcie_sriov.txt  | 115 ++
> >  docs/system/devices/nvme.rst |  36 ++
> >  hw/nvme/ctrl.c   | 675 ---
> >  hw/nvme/ns.c |   2 +-
> >  hw/nvme/nvme.h   |  55 ++-
> >  hw/nvme/subsys.c |  75 +++-
> >  hw/nvme/trace-events |   6 +
> >  hw/pci/meson.build   |   1 +
> >  hw/pci/pci.c | 100 --
> >  hw/pci/pcie.c|   5 +
> >  hw/pci/pcie_sriov.c  | 302 
> >  hw/pci/trace-events  |   5 +
> >  include/block/nvme.h |  65 
> >  include/hw/pci/pci.h |  12 +-
> >  include/hw/pci/pci_ids.h |   1 +
> >  include/hw/pci/pci_regs.h|   1 +
> >  include/hw/pci/pcie.h|   6 +
> >  include/hw/pci/pcie_sriov.h  |  77 
> >  include/qemu/typedefs.h  |   2 +
> >  19 files changed, 1460 insertions(+), 81 deletions(-)
> >  create mode 100644 docs/pcie_sriov.txt
> >  create mode 100644 hw/pci/pcie_sriov.c
> >  create mode 100644 include/hw/pci/pcie_sriov.h
> > 
> > -- 
> > 2.25.1
> > 
> > 
> 
> Hi Lukasz,
> 
> Back in v3 you changed this:
> 
> - Secondary controller cannot be set online unless the corresponding VF
>   is enabled (sriov_numvfs set to at least the secondary controller's VF
>   number)
> 
> I'm having issues getting this to work now. As I understand it, this now
> requires that sriov_numvfs is set prior to onlining the devices, i.e.:
> 
>   echo 1 > /sys/bus/pci/devices/\:01\:00.0/sriov_numvfs
> 
> However, this causes the kernel to reject it:
> 
>   nvme nvme1: Device not ready; aborting initialisation, CSTS=0x2
>   nvme nvme1: Removing after probe failure status: -19
> 
> Is this the expected behavior? Must I manually bind the device again to
> the nvme driver? Prior to v3 this worked just fine since the VF was
> onlined at this point.
> 
> It would be useful if you added a small "onlining for dummies" section
> to the docs ;)

Hi Klaus,

Yes, this is the expected behavior and yeah it is less user friendly
than in v3.

Yet, after re-examining the NVMe specification, we concluded that this
is how it should work.

This is now the correct minimum flow needed to run a VF-based functional
NVMe controller:
# Unbind all flexible resources from the primary controller
nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0
nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0

# Reset the primary controller to actually release the resources
echo 1 > /sys/bus/pci/devices/:01:00.0/reset

# Enable VF
echo 1 > /sys/bus/pci/devices/:01:00.0/sriov_numvfs

# Assign flexible resources to VF and set it ONLINE
nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 21
nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 21
nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0

# Bind NVMe driver for VF controller
echo :01:00.1 > /sys/bus/pci/drivers/nvme/bind

I will update the docs.

Thanks,
Lukasz

Re: [PATCH v2] nbd/server: Allow MULTI_CONN for shared writable exports

2022-02-16 Thread Markus Armbruster

Eric Blake  writes:

> According to the NBD spec, a server advertising
> NBD_FLAG_CAN_MULTI_CONN promises that multiple client connections will
> not see any cache inconsistencies: when properly separated by a single
> flush, actions performed by one client will be visible to another
> client, regardless of which client did the flush.  We satisfy these
> conditions in qemu when our block layer is backed by the local
> filesystem (by virtue of the semantics of fdatasync(), and the fact
> that qemu itself is not buffering writes beyond flushes).  It is
> harder to state whether we satisfy these conditions for network-based
> protocols, so the safest course of action is to allow users to opt-in
> to advertising multi-conn.  We may later tweak defaults to advertise
> by default when the block layer can confirm that the underlying
> protocol driver is cache consistent between multiple writers, but for
> now, this at least allows savvy users (such as virt-v2v or nbdcopy) to
> explicitly start qemu-nbd or qemu-storage-daemon with multi-conn
> advertisement in a known-safe setup where the client end can then
> benefit from parallel clients.
>
> Note, however, that we don't want to advertise MULTI_CONN when we know
> that a second client cannot connect (for historical reasons, qemu-nbd
> defaults to a single connection while nbd-server-add and QMP commands
> default to unlimited connections; but we already have existing means
> to let either style of NBD server creation alter those defaults).  The
> harder part of this patch is setting up an iotest to demonstrate
> behavior of multiple NBD clients to a single server.  It might be
> possible with parallel qemu-io processes, but concisely managing that
> in shell is painful.  I found it easier to do by relying on the libnbd
> project's nbdsh, which means this test will be skipped on platforms
> where that is not available.
>
> Signed-off-by: Eric Blake 
> Fixes: https://bugzilla.redhat.com/1708300
> ---
>
> v1 was in Aug 2021 [1], with further replies in Sep [2] and Oct [3].
>
> [1] https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg04900.html
> [2] https://lists.gnu.org/archive/html/qemu-devel/2021-09/msg00038.html
> [3] https://lists.gnu.org/archive/html/qemu-devel/2021-10/msg06744.html
>
> Since then, I've tweaked the QAPI to mention 7.0 (instead of 6.2), and
> reworked the logic so that default behavior is unchanged for now
> (advertising multi-conn on a writable export requires opt-in during
> the command line or QMP, but remains default for a readonly export).
> I've also expanded the amount of testing done in the new iotest.
>
>  docs/interop/nbd.txt   |   1 +
>  docs/tools/qemu-nbd.rst|   3 +-
>  qapi/block-export.json |  34 +++-
>  include/block/nbd.h|   3 +-
>  blockdev-nbd.c |   5 +
>  nbd/server.c   |  27 ++-
>  MAINTAINERS|   1 +
>  tests/qemu-iotests/tests/nbd-multiconn | 188 +
>  tests/qemu-iotests/tests/nbd-multiconn.out | 112 
>  9 files changed, 363 insertions(+), 11 deletions(-)
>  create mode 100755 tests/qemu-iotests/tests/nbd-multiconn
>  create mode 100644 tests/qemu-iotests/tests/nbd-multiconn.out
>
> diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt
> index bdb0f2a41aca..6c99070b99c8 100644
> --- a/docs/interop/nbd.txt
> +++ b/docs/interop/nbd.txt
> @@ -68,3 +68,4 @@ NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE
>  * 4.2: NBD_FLAG_CAN_MULTI_CONN for shareable read-only exports,
>  NBD_CMD_FLAG_FAST_ZERO
>  * 5.2: NBD_CMD_BLOCK_STATUS for "qemu:allocation-depth"
> +* 7.0: NBD_FLAG_CAN_MULTI_CONN for shareable writable exports
> diff --git a/docs/tools/qemu-nbd.rst b/docs/tools/qemu-nbd.rst
> index 6031f9689312..1de785524c36 100644
> --- a/docs/tools/qemu-nbd.rst
> +++ b/docs/tools/qemu-nbd.rst
> @@ -139,8 +139,7 @@ driver options if ``--image-opts`` is specified.
>  .. option:: -e, --shared=NUM
>
>Allow up to *NUM* clients to share the device (default
> -  ``1``), 0 for unlimited. Safe for readers, but for now,
> -  consistency is not guaranteed between multiple writers.
> +  ``1``), 0 for unlimited.
>
>  .. option:: -t, --persistent
>
> diff --git a/qapi/block-export.json b/qapi/block-export.json
> index f183522d0d2c..0a27e8ee84f9 100644
> --- a/qapi/block-export.json
> +++ b/qapi/block-export.json
> @@ -21,7 +21,9 @@
>  # recreated on the fly while the NBD server is active.
>  # If missing, it will default to denying access (since 4.0).
>  # @max-connections: The maximum number of connections to allow at the same
> -#   time, 0 for unlimited. (since 5.2; default: 0)
> +#   time, 0 for unlimited. Setting this to 1 also stops
> +#   the server from advertising multiple client support
> +#   (since 5.2; default: 0)
>  #
>

Re: [PATCH v2] nbd/server: Allow MULTI_CONN for shared writable exports

2022-02-16 Thread Nir Soffer

On Wed, Feb 16, 2022 at 12:13 PM Richard W.M. Jones 
wrote:

> On Tue, Feb 15, 2022 at 05:24:14PM -0600, Eric Blake wrote:
> > Oh. The QMP command (which is immediately visible through
> > nbd-server-add/block-storage-add to qemu and qemu-storage-daemon)
> > gains "multi-conn":"on", but you may be right that qemu-nbd would want
> > a command line option (either that, or we accellerate our plans that
> > qsd should replace qemu-nbd).
>
> I really hope there will always be something called "qemu-nbd"
> that acts like qemu-nbd.
>

I share this hope. Most projects I work on are based on qemu-nbd.

However in oVirt use case, we want to provide an NBD socket for clients to
allow direct
access to disks. One of the issues we need to solve for this is having a
way to tell if the
qemu-nbd is active, so we can terminate idle transfers.

The way we do this with the ovirt-imageio server is to query the status of
the transfer, and
use the idle time (time since last request) and active status (has inflight
requests) to detect
a stale transfer that should be terminated. An example use case is a
process on a remote
host that started an image transfer, and killed or crashed in the middle of
the transfer
without cleaning up properly.

To be more specific, every request to the imageio server (read, write,
flush, zero, options)
updates a timestamp in the transfer state. When we get the status we report
the time since
that timestamp was updated.

Additionally we keep and report the number of inflight requests, so we can
tell the case when
requests are blocked on inaccessible storage (e.g. non responsive NFS).

We don't have a way to do this with qemu-nbd, but I guess that using
qemu-storage-daemon
when we have qmp access will make such monitoring possible.

Nir

[PATCH v2] tests/qemu-iotests: Rework the checks and spots using GNU sed

2022-02-16 Thread Thomas Huth

Instead of failing the iotests if GNU sed is not available (or skipping
them completely in the check-block.sh script), it would be better to
simply skip the bash-based tests that rely on GNU sed, so that the other
tests could still be run. Thus we now explicitely use "gsed" (either as
direct program or as a wrapper around "sed" if it's the GNU version)
in the spots that rely on the GNU sed behavior. Statements that use the
"-r" parameter of sed have been switched to use "-E" instead, since this
switch is supported by all sed versions on our supported build hosts
(most also support "-r", but macOS' sed only supports "-E"). With all
these changes in place, we then can also remove the sed checks from the
check-block.sh script, so that "make check-block" can now be run on
systems without GNU sed, too.

Signed-off-by: Thomas Huth 
---
 I've checked that this still works fine with "make vm-build-freebsd",
 "make vm-build-netbsd" and "make vm-build-openbsd" and the Cirrus-CI
 macOS tasks.

 tests/check-block.sh | 12 --
 tests/qemu-iotests/271   |  2 +-
 tests/qemu-iotests/common.filter | 65 
 tests/qemu-iotests/common.rc | 45 +++---
 4 files changed, 57 insertions(+), 67 deletions(-)

diff --git a/tests/check-block.sh b/tests/check-block.sh
index 720a46bc36..af0c574812 100755
--- a/tests/check-block.sh
+++ b/tests/check-block.sh
@@ -52,18 +52,6 @@ if LANG=C bash --version | grep -q 'GNU bash, version [123]' 
; then
 skip "bash version too old ==> Not running the qemu-iotests."
 fi
 
-if ! (sed --version | grep 'GNU sed') > /dev/null 2>&1 ; then
-if ! command -v gsed >/dev/null 2>&1; then
-skip "GNU sed not available ==> Not running the qemu-iotests."
-fi
-else
-# Double-check that we're not using BusyBox' sed which says
-# that "This is not GNU sed version 4.0" ...
-if sed --version | grep -q 'not GNU sed' ; then
-skip "BusyBox sed not supported ==> Not running the qemu-iotests."
-fi
-fi
-
 cd tests/qemu-iotests
 
 # QEMU_CHECK_BLOCK_AUTO is used to disable some unstable sub-tests
diff --git a/tests/qemu-iotests/271 b/tests/qemu-iotests/271
index 2775b4d130..c7c2cadda0 100755
--- a/tests/qemu-iotests/271
+++ b/tests/qemu-iotests/271
@@ -896,7 +896,7 @@ _make_test_img -o extended_l2=on 1M
 # Second and third writes in _concurrent_io() are independent and may finish in
 # different order. So, filter offset out to match both possible variants.
 _concurrent_io | $QEMU_IO | _filter_qemu_io | \
-$SED -e 's/\(20480\|40960\)/OFFSET/'
+sed -e 's/\(20480\|40960\)/OFFSET/'
 _concurrent_verify | $QEMU_IO | _filter_qemu_io
 
 # success, all done
diff --git a/tests/qemu-iotests/common.filter b/tests/qemu-iotests/common.filter
index 75cc241580..21819db9c3 100644
--- a/tests/qemu-iotests/common.filter
+++ b/tests/qemu-iotests/common.filter
@@ -21,44 +21,44 @@
 
 _filter_date()
 {
-$SED -re 's/[0-9]{4}-[0-9]{2}-[0-9]{2} 
[0-9]{2}:[0-9]{2}:[0-9]{2}/-mm-dd hh:mm:ss/'
+sed -Ee 's/[0-9]{4}-[0-9]{2}-[0-9]{2} 
[0-9]{2}:[0-9]{2}:[0-9]{2}/-mm-dd hh:mm:ss/'
 }
 
 _filter_vmstate_size()
 {
-$SED -r -e 's/[0-9. ]{5} [KMGT]iB/ SIZE/' \
--e 's/[0-9. ]{5} B/   SIZE/'
+sed -E -e 's/[0-9. ]{5} [KMGT]iB/ SIZE/' \
+   -e 's/[0-9. ]{5} B/   SIZE/'
 }
 
 _filter_generated_node_ids()
 {
-$SED -re 's/\#block[0-9]{3,}/NODE_NAME/'
+sed -Ee 's/\#block[0-9]{3,}/NODE_NAME/'
 }
 
 _filter_qom_path()
 {
-$SED -e '/Attached to:/s/\device[[0-9]\+\]/device[N]/g'
+gsed -e '/Attached to:/s/\device[[0-9]\+\]/device[N]/g'
 }
 
 # replace occurrences of the actual TEST_DIR value with TEST_DIR
 _filter_testdir()
 {
-$SED -e "s#$TEST_DIR/#TEST_DIR/#g" \
- -e "s#$SOCK_DIR/#SOCK_DIR/#g" \
- -e "s#SOCK_DIR/fuse-#TEST_DIR/#g"
+sed -e "s#$TEST_DIR/#TEST_DIR/#g" \
+-e "s#$SOCK_DIR/#SOCK_DIR/#g" \
+-e "s#SOCK_DIR/fuse-#TEST_DIR/#g"
 }
 
 # replace occurrences of the actual IMGFMT value with IMGFMT
 _filter_imgfmt()
 {
-$SED -e "s#$IMGFMT#IMGFMT#g"
+sed -e "s#$IMGFMT#IMGFMT#g"
 }
 
 # Replace error message when the format is not supported and delete
 # the output lines after the first one
 _filter_qemu_img_check()
 {
-$SED -e '/allocated.*fragmented.*compressed clusters/d' \
+gsed -e '/allocated.*fragmented.*compressed clusters/d' \
 -e 's/qemu-img: This image format does not support checks/No errors 
were found on the image./' \
 -e '/Image end offset: [0-9]\+/d'
 }
@@ -66,13 +66,14 @@ _filter_qemu_img_check()
 # Removes \r from messages
 _filter_win32()
 {
-$SED -e 's/\r//g'
+gsed -e 's/\r//g'
 }
 
 # sanitize qemu-io output
 _filter_qemu_io()
 {
-_filter_win32 | $SED -e "s/[0-9]* ops\; [0-9/:. sec]* ([0-9/.inf]* 
[EPTGMKiBbytes]*\/sec and [0-9/.inf]* ops\/sec)/X ops\; XX:XX:XX.X (XXX 
YYY\/sec and XXX ops\/sec)/" \
+_filter_win32 | \
+gsed -e "s/[0-9]* ops\; [0-9/:. sec]* ([0-9/.inf]* [

Re: [PATCH] tests/qemu-iotests: Rework the checks and spots using GNU sed

2022-02-16 Thread Thomas Huth


On 15/02/2022 23.10, Eric Blake wrote:

On Tue, Feb 15, 2022 at 02:20:31PM +0100, Thomas Huth wrote:

Instead of failing the iotests if GNU sed is not available (or skipping
them completely in the check-block.sh script), it would be better to
simply skip the bash-based tests that rely on GNU sed, so that the other
tests could still be run. Thus we now explicitely use "gsed" (either as
direct program or as a wrapper around "sed" if it's the GNU version)
in the spots that rely on the GNU sed behavior. Then we also remove the
sed checks from the check-block.sh script, so that "make check-block"
can now be run on systems without GNU sed, too.

...

diff --git a/tests/qemu-iotests/common.filter b/tests/qemu-iotests/common.filter
index 935217aa65..a3b1708adc 100644
--- a/tests/qemu-iotests/common.filter
+++ b/tests/qemu-iotests/common.filter
@@ -21,58 +21,58 @@
  
  _filter_date()

  {
-$SED -re 's/[0-9]{4}-[0-9]{2}-[0-9]{2} 
[0-9]{2}:[0-9]{2}:[0-9]{2}/-mm-dd hh:mm:ss/'
+gsed -re 's/[0-9]{4}-[0-9]{2}-[0-9]{2} 
[0-9]{2}:[0-9]{2}:[0-9]{2}/-mm-dd hh:mm:ss/'


GNU sed recommends spelling it 'sed -E', not 'sed -r', when using
extended regex.  Older POSIX did not support either spelling, but the
next version will require -E, as many implementations have it now:
https://www.austingroupbugs.net/view.php?id=528


Thanks for the pointer ... I originally checked "man 1p sed" on
my system and did not see -r or -E in there, so I thought that
this must be really something specific to GNU sed. But now that
you've mentioned this, I just double-checked the build environments
that we support with QEMU, and seems like -E should be supported
everywhere:

macOS 11:

$ sed --help
sed: illegal option -- -
usage: sed script [-Ealnru] [-i extension] [file ...]
sed [-Ealnu] [-i extension] [-e script] ... [-f script_file] ... [file 
...]


NetBSD 9.2:

$ sed --help
sed: unknown option -- -
Usage:  sed [-aElnru] command [file ...]
sed [-aElnru] [-e command] [-f command_file] [-I[extension]]
[-i[extension]] [file ...]


FreeBSD 12.3:

$ sed --help
sed: illegal option -- -
usage: sed script [-Ealnru] [-i extension] [file ...]
sed [-Ealnu] [-i extension] [-e script] ... [-f script_file] ... [file 
...]


OpenBSD 7.0:

$ sed --help
sed: unknown option -- -
usage: sed [-aEnru] [-i[extension]] command [file ...]
   sed [-aEnru] [-e command] [-f command_file] [-i[extension]] [file ...]


Illumos:

Has -E according to https://illumos.org/man/1/sed


Busybox:

Has -E according to https://www.commandlinux.com/man-page/man1/busybox.1.html


Haiku:

Seems to use GNU sed, so -E is available.


We likely never will run the iotests on Windows, so I did not check
those (but I assume MSYS and friends are using GNU sed, too).


So I think it should be safe to change these spots that are
using "-r" to "sed -E" (and not use gsed here).


Other than the fact that this was easier to write with ERE, I'm not
seeing any other GNU-only extensions in use here; but I'd recommend
that as long as we're touching the line, we spell it 'gsed -Ee'
instead of -re (here, and in several other places).


  _filter_qom_path()
  {
-$SED -e '/Attached to:/s/\device[[0-9]\+\]/device[N]/g'
+gsed -e '/Attached to:/s/\device[[0-9]\+\]/device[N]/g'


Here, it is our use of \+ that is a GNU sed extension, although it is
fairly easy (but verbose) to translate that one to portable sed
(\+ is the same as *).  So gsed is correct.  On the
other hand, the use of [[0-9]\+\] looks ugly - it probably does NOT
match what we meant (we have a bracket expression '[...]' that matches
the 11 characters [ and 0-9, then '\+' to match that bracket
expression 1 or more times, then '\]' which in its context is
identical to ']' to match a closing ], since only opening [ needs \
escaping for literal treatment).  What we probably meant is:

'/Attached to:/s/\device\[[0-9][0-9]*]/device[N]/g'

at which point normal sed would do.


Ok ... but I'd prefer to clean such spots rather in a second step,
to make sure not to introduce bugs and to make the review easier.


  # Removes \r from messages
  _filter_win32()
  {
-$SED -e 's/\r//g'
+gsed -e 's/\r//g'


Yep, \r is another GNU sed extension.


  }
  
  # sanitize qemu-io output

  _filter_qemu_io()
  {
-_filter_win32 | $SED -e "s/[0-9]* ops\; [0-9/:. sec]* ([0-9/.inf]* 
[EPTGMKiBbytes]*\/sec and [0-9/.inf]* ops\/sec)/X ops\; XX:XX:XX.X (XXX YYY\/sec and XXX 
ops\/sec)/" \
+_filter_win32 | gsed -e "s/[0-9]* ops\; [0-9/:. sec]* ([0-9/.inf]* 
[EPTGMKiBbytes]*\/sec and [0-9/.inf]* ops\/sec)/X ops\; XX:XX:XX.X (XXX YYY\/sec and XXX 
ops\/sec)/" \
  -e "s/: line [0-9][0-9]*:  *[0-9][0-9]*\( Aborted\| Killed\)/:\1/" \
  -e "s/qemu-io> //g"


I'm not seeing anything specific to GNU sed in this (long) sed script;
can we relax this one to plain 'sed'?  Use of s#some/text## might be
easier than having to s/some\/text//, but that's not essential.


If I switch that to plain sed

[PATCH v2 3/3] iotests/graph-changes-while-io: New test

2022-02-16 Thread Hanna Reitz

Test the following scenario:
1. Some block node (null-co) attached to a user (here: NBD server) that
   performs I/O and keeps the node in an I/O thread
2. Repeatedly run blockdev-add/blockdev-del to add/remove an overlay
   to/from that node

Each blockdev-add triggers bdrv_refresh_limits(), and because
blockdev-add runs in the main thread, it does not stop the I/O requests.
I/O can thus happen while the limits are refreshed, and when such a
request sees a temporarily invalid block limit (e.g. alignment is 0),
this may easily crash qemu (or the storage daemon in this case).

The block layer needs to ensure that I/O requests to a node are paused
while that node's BlockLimits are refreshed.

Signed-off-by: Hanna Reitz 
Reviewed-by: Eric Blake 
---
 .../qemu-iotests/tests/graph-changes-while-io | 91 +++
 .../tests/graph-changes-while-io.out  |  5 +
 2 files changed, 96 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/graph-changes-while-io
 create mode 100644 tests/qemu-iotests/tests/graph-changes-while-io.out

diff --git a/tests/qemu-iotests/tests/graph-changes-while-io 
b/tests/qemu-iotests/tests/graph-changes-while-io
new file mode 100755
index 00..567e8cf21e
--- /dev/null
+++ b/tests/qemu-iotests/tests/graph-changes-while-io
@@ -0,0 +1,91 @@
+#!/usr/bin/env python3
+# group: rw
+#
+# Test graph changes while I/O is happening
+#
+# Copyright (C) 2022 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+from threading import Thread
+import iotests
+from iotests import imgfmt, qemu_img, qemu_img_create, QMPTestCase, \
+QemuStorageDaemon
+
+
+top = os.path.join(iotests.test_dir, 'top.img')
+nbd_sock = os.path.join(iotests.sock_dir, 'nbd.sock')
+
+
+def do_qemu_img_bench() -> None:
+"""
+Do some I/O requests on `nbd_sock`.
+"""
+assert qemu_img('bench', '-f', 'raw', '-c', '200',
+f'nbd+unix:///node0?socket={nbd_sock}') == 0
+
+
+class TestGraphChangesWhileIO(QMPTestCase):
+def setUp(self) -> None:
+# Create an overlay that can be added at runtime on top of the
+# null-co block node that will receive I/O
+assert qemu_img_create('-f', imgfmt, '-F', 'raw', '-b', 'null-co://',
+   top) == 0
+
+# QSD instance with a null-co block node in an I/O thread,
+# exported over NBD (on `nbd_sock`, export name "node0")
+self.qsd = QemuStorageDaemon(
+'--object', 'iothread,id=iothread0',
+'--blockdev', 'null-co,node-name=node0,read-zeroes=true',
+'--nbd-server', f'addr.type=unix,addr.path={nbd_sock}',
+'--export', 'nbd,id=exp0,node-name=node0,iothread=iothread0,' +
+'fixed-iothread=true,writable=true',
+qmp=True
+)
+
+def tearDown(self) -> None:
+self.qsd.stop()
+
+def test_blockdev_add_while_io(self) -> None:
+# Run qemu-img bench in the background
+bench_thr = Thread(target=do_qemu_img_bench)
+bench_thr.start()
+
+# While qemu-img bench is running, repeatedly add and remove an
+# overlay to/from node0
+while bench_thr.is_alive():
+result = self.qsd.qmp('blockdev-add', {
+'driver': imgfmt,
+'node-name': 'overlay',
+'backing': 'node0',
+'file': {
+'driver': 'file',
+'filename': top
+}
+})
+self.assert_qmp(result, 'return', {})
+
+result = self.qsd.qmp('blockdev-del', {
+'node-name': 'overlay'
+})
+self.assert_qmp(result, 'return', {})
+
+bench_thr.join()
+
+if __name__ == '__main__':
+# Format must support raw backing files
+iotests.main(supported_fmts=['qcow', 'qcow2', 'qed'],
+ supported_protocols=['file'])
diff --git a/tests/qemu-iotests/tests/graph-changes-while-io.out 
b/tests/qemu-iotests/tests/graph-changes-while-io.out
new file mode 100644
index 00..ae1213e6f8
--- /dev/null
+++ b/tests/qemu-iotests/tests/graph-changes-while-io.out
@@ -0,0 +1,5 @@
+.
+--
+Ran 1 tests
+
+OK
-- 
2.34.1

[PATCH v2 2/3] iotests: Allow using QMP with the QSD

2022-02-16 Thread Hanna Reitz

Add a parameter to optionally open a QMP connection when creating a
QemuStorageDaemon instance.

Signed-off-by: Hanna Reitz 
---
 tests/qemu-iotests/iotests.py | 32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 6ba65eb1ff..6027780180 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -39,6 +39,7 @@
 
 from qemu.machine import qtest
 from qemu.qmp import QMPMessage
+from qemu.aqmp.legacy import QEMUMonitorProtocol
 
 # Use this logger for logging messages directly from the iotests module
 logger = logging.getLogger('qemu.iotests')
@@ -348,14 +349,30 @@ def cmd(self, cmd):
 
 
 class QemuStorageDaemon:
-def __init__(self, *args: str, instance_id: str = 'a'):
+_qmp: Optional[QEMUMonitorProtocol] = None
+_qmpsock: Optional[str] = None
+# Python < 3.8 would complain if this type were not a string literal
+# (importing `annotations` from `__future__` would work; but not on <= 3.6)
+_p: 'Optional[subprocess.Popen[bytes]]' = None
+
+def __init__(self, *args: str, instance_id: str = 'a', qmp: bool = False):
 assert '--pidfile' not in args
 self.pidfile = os.path.join(test_dir, f'qsd-{instance_id}-pid')
 all_args = [qsd_prog] + list(args) + ['--pidfile', self.pidfile]
 
+if qmp:
+self._qmpsock = os.path.join(sock_dir, f'qsd-{instance_id}.sock')
+all_args += ['--chardev',
+ f'socket,id=qmp-sock,path={self._qmpsock}',
+ '--monitor', 'qmp-sock']
+
+self._qmp = QEMUMonitorProtocol(self._qmpsock, server=True)
+
 # Cannot use with here, we want the subprocess to stay around
 # pylint: disable=consider-using-with
 self._p = subprocess.Popen(all_args)
+if self._qmp is not None:
+self._qmp.accept()
 while not os.path.exists(self.pidfile):
 if self._p.poll() is not None:
 cmd = ' '.join(all_args)
@@ -370,11 +387,24 @@ def __init__(self, *args: str, instance_id: str = 'a'):
 
 assert self._pid == self._p.pid
 
+def qmp(self, cmd: str, args: Optional[Dict[str, object]] = None) \
+-> QMPMessage:
+assert self._qmp is not None
+return self._qmp.cmd(cmd, args)
+
 def stop(self, kill_signal=15):
 self._p.send_signal(kill_signal)
 self._p.wait()
 self._p = None
 
+if self._qmp:
+self._qmp.close()
+
+if self._qmpsock is not None:
+try:
+os.remove(self._qmpsock)
+except OSError:
+pass
 try:
 os.remove(self.pidfile)
 except OSError:
-- 
2.34.1

[PATCH v2 1/3] block: Make bdrv_refresh_limits() non-recursive

2022-02-16 Thread Hanna Reitz

bdrv_refresh_limits() recurses down to the node's children.  That does
not seem necessary: When we refresh limits on some node, and then
recurse down and were to change one of its children's BlockLimits, then
that would mean we noticed the changed limits by pure chance.  The fact
that we refresh the parent's limits has nothing to do with it, so the
reason for the change probably happened before this point in time, and
we should have refreshed the limits then.

On the other hand, we do not have infrastructure for noticing that block
limits change after they have been initialized for the first time (this
would require propagating the change upwards to the respective node's
parents), and so evidently we consider this case impossible.

If this case is impossible, then we will not need to recurse down in
bdrv_refresh_limits().  Every node's limits are initialized in
bdrv_open_driver(), and are refreshed whenever its children change.
We want to use the childrens' limits to get some initial default, but we
can just take them, we do not need to refresh them.

The problem with recursing is that bdrv_refresh_limits() is not atomic.
It begins with zeroing BDS.bl, and only then sets proper, valid limits.
If we do not drain all nodes whose limits are refreshed, then concurrent
I/O requests can encounter invalid request_alignment values and crash
qemu.  Therefore, a recursing bdrv_refresh_limits() requires the whole
subtree to be drained, which is currently not ensured by most callers.

A non-recursive bdrv_refresh_limits() only requires the node in question
to not receive I/O requests, and this is done by most callers in some
way or another:
- bdrv_open_driver() deals with a new node with no parents yet
- bdrv_set_file_or_backing_noperm() acts on a drained node
- bdrv_reopen_commit() acts only on drained nodes
- bdrv_append() should in theory require the node to be drained; in
  practice most callers just lock the AioContext, which should at least
  be enough to prevent concurrent I/O requests from accessing invalid
  limits

So we can resolve the bug by making bdrv_refresh_limits() non-recursive.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1879437
Signed-off-by: Hanna Reitz 
Reviewed-by: Eric Blake 
---
 block/io.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/block/io.c b/block/io.c
index 4e4cb556c5..c3e7301613 100644
--- a/block/io.c
+++ b/block/io.c
@@ -189,10 +189,6 @@ void bdrv_refresh_limits(BlockDriverState *bs, Transaction 
*tran, Error **errp)
 QLIST_FOREACH(c, &bs->children, next) {
 if (c->role & (BDRV_CHILD_DATA | BDRV_CHILD_FILTERED | BDRV_CHILD_COW))
 {
-bdrv_refresh_limits(c->bs, tran, errp);
-if (*errp) {
-return;
-}
 bdrv_merge_limits(&bs->bl, &c->bs->bl);
 have_limits = true;
 }
-- 
2.34.1

[PATCH v2 0/3] block: Make bdrv_refresh_limits() non-recursive

2022-02-16 Thread Hanna Reitz

Hi,

v1 with detailed reasoning:
https://lists.nongnu.org/archive/html/qemu-block/2022-02/msg00508.html

This series makes bdrv_refresh_limits() non-recursive so that it is
sufficient for callers to ensure that the node on which they call it
will not receive concurrent I/O requests (instead of ensuring the same
for the whole subtree).

We need to ensure such I/O does not happen because bdrv_refresh_limits()
is not atomic and will produce intermediate invalid values, which will
break concurrent I/O requests that read these values.


v2:
- Use separate `try` block to clean up in patch 2 instead of putting the
  `os.remove()` in the existing one (which would cause the second
  `os.remove()` to be skipped if the first one failed)


git-backport-diff against v1:

Key:
[] : patches are identical
[] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/3:[] [--] 'block: Make bdrv_refresh_limits() non-recursive'
002/3:[0005] [FC] 'iotests: Allow using QMP with the QSD'
003/3:[] [--] 'iotests/graph-changes-while-io: New test'


Hanna Reitz (3):
  block: Make bdrv_refresh_limits() non-recursive
  iotests: Allow using QMP with the QSD
  iotests/graph-changes-while-io: New test

 block/io.c|  4 -
 tests/qemu-iotests/iotests.py | 32 ++-
 .../qemu-iotests/tests/graph-changes-while-io | 91 +++
 .../tests/graph-changes-while-io.out  |  5 +
 4 files changed, 127 insertions(+), 5 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/graph-changes-while-io
 create mode 100644 tests/qemu-iotests/tests/graph-changes-while-io.out

-- 
2.34.1

Re: [PATCH v2] nbd/server: Allow MULTI_CONN for shared writable exports

2022-02-16 Thread Richard W.M. Jones

On Tue, Feb 15, 2022 at 05:24:14PM -0600, Eric Blake wrote:
> Oh. The QMP command (which is immediately visible through
> nbd-server-add/block-storage-add to qemu and qemu-storage-daemon)
> gains "multi-conn":"on", but you may be right that qemu-nbd would want
> a command line option (either that, or we accellerate our plans that
> qsd should replace qemu-nbd).

I really hope there will always be something called "qemu-nbd"
that acts like qemu-nbd.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Re: [PATCH 2/3] iotests: Allow using QMP with the QSD

2022-02-16 Thread Hanna Reitz


On 15.02.22 23:19, Eric Blake wrote:

On Tue, Feb 15, 2022 at 02:57:26PM +0100, Hanna Reitz wrote:

Add a parameter to optionally open a QMP connection when creating a
QemuStorageDaemon instance.

Signed-off-by: Hanna Reitz 
---
  tests/qemu-iotests/iotests.py | 29 -
  1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 6ba65eb1ff..47e3808ab9 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -39,6 +39,7 @@
  
  from qemu.machine import qtest

  from qemu.qmp import QMPMessage
+from qemu.aqmp.legacy import QEMUMonitorProtocol

I thought we were trying to get rid of aqmp.legacy usage, so this
feels like a temporary regression.  Oh well, not the end of the
testing world.


I fiddled around with the non-legacy interface and wasn’t very 
successful...  I thought since machine.py still uses qemu.aqmp.legacy 
for QEMUMachine, when one is reworked to get rid of it (if that ever 
becomes necessary), then we can just do it here, too.





  def stop(self, kill_signal=15):
  self._p.send_signal(kill_signal)
  self._p.wait()
  self._p = None
  
+if self._qmp:

+self._qmp.close()
+
  try:
+if self._qmpsock is not None:
+os.remove(self._qmpsock)
  os.remove(self.pidfile)
  except OSError:
  pass

Do we need two try: blocks here, to remove self.pidfile even if
os.remove(self._qmpsock) failed?


Honestly, no reason not to use two blocks except it being longer. You’re 
right, I should’ve just done that.



Otherwise, makes sense to me.


Thanks for reviewing!

Hanna

Re: [PATCH 3/3] iotests/graph-changes-while-io: New test

2022-02-16 Thread Hanna Reitz


On 15.02.22 23:22, Eric Blake wrote:

On Tue, Feb 15, 2022 at 02:57:27PM +0100, Hanna Reitz wrote:

Test the following scenario:
1. Some block node (null-co) attached to a user (here: NBD server) that
performs I/O and keeps the node in an I/O thread
2. Repeatedly run blockdev-add/blockdev-del to add/remove an overlay
to/from that node

Each blockdev-add triggers bdrv_refresh_limits(), and because
blockdev-add runs in the main thread, it does not stop the I/O requests.
I/O can thus happen while the limits are refreshed, and when such a
request sees a temporarily invalid block limit (e.g. alignment is 0),
this may easily crash qemu (or the storage daemon in this case).

The block layer needs to ensure that I/O requests to a node are paused
while that node's BlockLimits are refreshed.

Signed-off-by: Hanna Reitz 
---
  .../qemu-iotests/tests/graph-changes-while-io | 91 +++
  .../tests/graph-changes-while-io.out  |  5 +
  2 files changed, 96 insertions(+)
  create mode 100755 tests/qemu-iotests/tests/graph-changes-while-io
  create mode 100644 tests/qemu-iotests/tests/graph-changes-while-io.out

Reviewed-by: Eric Blake 

Since we found this with the help of NBD, should I be considering this
series for my NBD queue, or is there a better block-related maintainer
queue that it should go through?


Well, we found it by using a guest, it’s just that using a guest in the 
iotests is not quite that great, so we need some other way to induce I/O 
(concurrently to monitor commands).  I could’ve used FUSE, too, but NBD 
is always compiled in, so. :)


In any case, of course I don’t mind who takes this series.  If you want 
to take it, go ahead (and thanks!) – I’ll be sending a v2 to split the 
`try` block in patch 2, though.


Hanna

Re: [PATCH 1/4] block: bdrv_merge_dirty_bitmap: add return value

2022-02-16 Thread Vladimir Sementsov-Ogievskiy


I forget that I already sent it in other series: [PATCH v3 02/19] 
block/dirty-bitmap: bdrv_merge_dirty_bitmap(): add return value

"[PATCH v3 02/19] block/dirty-bitmap: bdrv_merge_dirty_bitmap(): add return 
value" is a bit better as it adds a comment. And has Hanna's r-b

15.02.2022 20:53, Vladimir Sementsov-Ogievskiy wrote:

Add return value to bdrv_merge_dirty_bitmap() and use it to reduce
error propagation.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/block/dirty-bitmap.h| 2 +-
  block/dirty-bitmap.c| 6 --
  block/monitor/bitmap-qmp-cmds.c | 5 +
  3 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 40950ae3d5..f95d350b70 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -77,7 +77,7 @@ void bdrv_dirty_bitmap_set_persistence(BdrvDirtyBitmap 
*bitmap,
 bool persistent);
  void bdrv_dirty_bitmap_set_inconsistent(BdrvDirtyBitmap *bitmap);
  void bdrv_dirty_bitmap_set_busy(BdrvDirtyBitmap *bitmap, bool busy);
-void bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, const BdrvDirtyBitmap *src,
+bool bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, const BdrvDirtyBitmap *src,
   HBitmap **backup, Error **errp);
  void bdrv_dirty_bitmap_skip_store(BdrvDirtyBitmap *bitmap, bool skip);
  bool bdrv_dirty_bitmap_get(BdrvDirtyBitmap *bitmap, int64_t offset);
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 0ef46163e3..d16b96ee62 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -881,10 +881,10 @@ bool bdrv_dirty_bitmap_next_dirty_area(BdrvDirtyBitmap 
*bitmap,
   *
   * @backup: If provided, make a copy of dest here prior to merge.
   */
-void bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, const BdrvDirtyBitmap *src,
+bool bdrv_merge_dirty_bitmap(BdrvDirtyBitmap *dest, const BdrvDirtyBitmap *src,
   HBitmap **backup, Error **errp)
  {
-bool ret;
+bool ret = false;
  
  bdrv_dirty_bitmaps_lock(dest->bs);

  if (src->bs != dest->bs) {
@@ -912,6 +912,8 @@ out:
  if (src->bs != dest->bs) {
  bdrv_dirty_bitmaps_unlock(src->bs);
  }
+
+return ret;
  }
  
  /**

diff --git a/block/monitor/bitmap-qmp-cmds.c b/block/monitor/bitmap-qmp-cmds.c
index 9f11deec64..83970b22fa 100644
--- a/block/monitor/bitmap-qmp-cmds.c
+++ b/block/monitor/bitmap-qmp-cmds.c
@@ -259,7 +259,6 @@ BdrvDirtyBitmap *block_dirty_bitmap_merge(const char *node, 
const char *target,
  BlockDriverState *bs;
  BdrvDirtyBitmap *dst, *src, *anon;
  BlockDirtyBitmapMergeSourceList *lst;
-Error *local_err = NULL;
  
  dst = block_dirty_bitmap_lookup(node, target, &bs, errp);

  if (!dst) {
@@ -297,9 +296,7 @@ BdrvDirtyBitmap *block_dirty_bitmap_merge(const char *node, 
const char *target,
  abort();
  }
  
-bdrv_merge_dirty_bitmap(anon, src, NULL, &local_err);

-if (local_err) {
-error_propagate(errp, local_err);
+if (!bdrv_merge_dirty_bitmap(anon, src, NULL, errp)) {
  dst = NULL;
  goto out;
  }




--
Best regards,
Vladimir

Re: [PATCH v2] nbd/server: Allow MULTI_CONN for shared writable exports

2022-02-16 Thread Vladimir Sementsov-Ogievskiy


16.02.2022 02:24, Eric Blake wrote:

On Tue, Feb 15, 2022 at 09:23:36PM +0200, Nir Soffer wrote:

On Tue, Feb 15, 2022 at 7:22 PM Eric Blake  wrote:


According to the NBD spec, a server advertising
NBD_FLAG_CAN_MULTI_CONN promises that multiple client connections will
not see any cache inconsistencies: when properly separated by a single
flush, actions performed by one client will be visible to another
client, regardless of which client did the flush.  We satisfy these
conditions in qemu when our block layer is backed by the local
filesystem (by virtue of the semantics of fdatasync(), and the fact
that qemu itself is not buffering writes beyond flushes).  It is
harder to state whether we satisfy these conditions for network-based
protocols, so the safest course of action is to allow users to opt-in
to advertising multi-conn.  We may later tweak defaults to advertise
by default when the block layer can confirm that the underlying
protocol driver is cache consistent between multiple writers, but for
now, this at least allows savvy users (such as virt-v2v or nbdcopy) to
explicitly start qemu-nbd or qemu-storage-daemon with multi-conn
advertisement in a known-safe setup where the client end can then
benefit from parallel clients.



It makes sense, and will be used by oVirt. Actually we are already using
multiple connections for writing about 2 years, based on your promise
that if every client writes to district  areas this is safe.


I presume s/district/distinct/, but yes, I'm glad we're finally trying
to make the code match existing practice ;)


+++ b/docs/tools/qemu-nbd.rst
@@ -139,8 +139,7 @@ driver options if ``--image-opts`` is specified.
  .. option:: -e, --shared=NUM

Allow up to *NUM* clients to share the device (default
-  ``1``), 0 for unlimited. Safe for readers, but for now,
-  consistency is not guaranteed between multiple writers.
+  ``1``), 0 for unlimited.



Removing the note means that now consistency is guaranteed between
multiple writers, no?

Or maybe we want to mention here that consistency depends on the protocol
and users can opt in, or refer to the section where this is discussed?


Yeah, a link to the QAPI docs where multi-conn is documented might be
nice, except I'm not sure the best way to do that in our sphinx
documentation setup.


+##
+# @NbdExportMultiConn:
+#
+# Possible settings for advertising NBD multiple client support.
+#
+# @off: Do not advertise multiple clients.
+#
+# @on: Allow multiple clients (for writable clients, this is only safe
+#  if the underlying BDS is cache-consistent, such as when backed
+#  by the raw file driver); ignored if the NBD server was set up
+#  with max-connections of 1.
+#
+# @auto: Behaves like @off if the export is writable, and @on if the
+#export is read-only.
+#
+# Since: 7.0
+##
+{ 'enum': 'NbdExportMultiConn',
+  'data': ['off', 'on', 'auto'] }



Are we going to have --multi-con=(on|off|auto)?


Oh. The QMP command (which is immediately visible through
nbd-server-add/block-storage-add to qemu and qemu-storage-daemon)
gains "multi-conn":"on", but you may be right that qemu-nbd would want
a command line option (either that, or we accellerate our plans that
qsd should replace qemu-nbd).


+++ b/blockdev-nbd.c
@@ -44,6 +44,11 @@ bool nbd_server_is_running(void)
  return nbd_server || is_qemu_nbd;
  }

+int nbd_server_max_connections(void)
+{
+return nbd_server ? nbd_server->max_connections : -1;
+}



-1 is a little bit strange for a limit, maybe 1 is a better default when
we nbd_server == NULL? When can this happen?


In qemu, if you haven't used the QMP command 'nbd-server-start' yet.
In qemu-nbd, always (per the nbd_server_is_running function just
above).  My iotest only covered the qemu/qsd side, not the qemu-nbd
side, so it looks like I need a v3...


+++ b/nbd/server.c



+/*
+ * Determine whether to advertise multi-conn.  Default is auto,
+ * which resolves to on for read-only and off for writable.  But
+ * if the server has max-connections 1, that forces the flag off.



Looks good, this can be enabled automatically based on the driver
if we want, so users using auto will can upgrade to multi-con automatically.


Yes, that's part of why I made it a tri-state with a default of 'auto'.





+ */
+if (!arg->has_multi_conn) {
+arg->multi_conn = NBD_EXPORT_MULTI_CONN_AUTO;
+}
+if (nbd_server_max_connections() == 1) {


+arg->multi_conn = NBD_EXPORT_MULTI_CONN_OFF;

+}


+if (arg->multi_conn == NBD_EXPORT_MULTI_CONN_AUTO) {

+multi_conn = readonly;
+} else {
+multi_conn = arg->multi_conn == NBD_EXPORT_MULTI_CONN_ON;
+}



This part is a little bit confusing - we do:
- initialize args->multi_con if it has not value
- set the temporary multi_con based now initialized args->multi_con

I think it will be nicer to separate arguments parsing, so there is no need
to initialize it here or have arg->has_multi_conn, but I did not ch

42 matches

Mail list logo