date:20210707

Re: [PATCH v2 4/4] hw/nvme: fix controller hot unplugging

2021-07-07 Thread Klaus Jensen


On Jul  7 18:56, Klaus Jensen wrote:

On Jul  7 17:57, Hannes Reinecke wrote:

On 7/7/21 5:49 PM, Klaus Jensen wrote:

From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Signed-off-by: Klaus Jensen 
---
hw/nvme/nvme.h   | 15 ---
hw/nvme/ctrl.c   | 14 ++
hw/nvme/ns.c | 18 ++
hw/nvme/subsys.c |  3 +++
4 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index c4065467d877..83ffabade4cf 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -33,12 +33,20 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES > NVME_NSID_BROADCAST 
- 1);
typedef struct NvmeCtrl NvmeCtrl;
typedef struct NvmeNamespace NvmeNamespace;
+#define TYPE_NVME_BUS "nvme-bus"
+OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
+
+typedef struct NvmeBus {
+BusState parent_bus;
+} NvmeBus;
+
#define TYPE_NVME_SUBSYS "nvme-subsys"
#define NVME_SUBSYS(obj) \
OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
typedef struct NvmeSubsystem {
DeviceState parent_obj;
+NvmeBus bus;
uint8_t subnqn[256];
NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];
@@ -365,13 +373,6 @@ typedef struct NvmeCQueue {
QTAILQ_HEAD(, NvmeRequest) req_list;
} NvmeCQueue;
-#define TYPE_NVME_BUS "nvme-bus"
-#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
-
-typedef struct NvmeBus {
-BusState parent_bus;
-} NvmeBus;
-
#define TYPE_NVME "nvme"
#define NVME(obj) \
OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 90e3ee2b70ee..9a3b3a27c293 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6514,16 +6514,14 @@ static void nvme_exit(PCIDevice *pci_dev)
nvme_ctrl_reset(n);
-for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
-ns = nvme_ns(n, i);
-if (!ns) {
-continue;
+if (n->subsys) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+ns = nvme_ns(n, i);
+if (ns) {
+ns->attached--;
+}
}
-nvme_ns_cleanup(ns);


So who is removing the namespaces, then?
I would have expected some cleanup action from the subsystem, seeing 
that we reparent to that ...




Since we "move" the namespaces to the subsystem, and since the 
subsystem is non-hotpluggable, they will (and can) not be removed. In 
the case that there is no subsystem, nvme_ns_unrealize() will be 
called for each child namespace on the controller NvmeBus.



-}
-
-if (n->subsys) {
nvme_subsys_unregister_ctrl(n->subsys, n);
}
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 3c4f5b8c714a..b7cf1494e75b 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -441,6 +441,15 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
}
}
+static void nvme_ns_unrealize(DeviceState *dev)
+{
+NvmeNamespace *ns = NVME_NS(dev);
+
+nvme_ns_drain(ns);
+nvme_ns_shutdown(ns);
+nvme_ns_cleanup(ns);
+}
+
static void nvme_ns_realize(DeviceState *dev, Error **errp)
{
NvmeNamespace *ns = NVME_NS(dev);
@@ -462,6 +471,14 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
   "linked to an nvme-subsys device");
return;
}
+} else {
+/*
+ * If this namespace belongs to a subsystem (through a link on the
+ * controller device), reparent the device.
+ */
+if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
+return;
+}


What happens if that fails?
Will we abort? Not create the namespace?



Good point!

It can actually only fail if the bus implements check_address(), which 
it does not, so it always succeeds, so it should assert instead.




Nah, the 'if' is fine. If check_address() should be implemented at some 
point, errp will be set and invocation of qemu will stop with an error. 
So I think the error handling is fine as-is.


signature.asc
Description: PGP signature

[PATCH 1/2] iotests: Improve and rename test 291 to qemu-img-bitmap

2021-07-07 Thread Eric Blake

Enhance the test to demonstrate behavior of qemu-img with a qcow2
image containing an inconsistent bitmap, and rename it now that we
support useful iotest names.

While at it, fix a missing newline in the error message thus exposed.

Signed-off-by: Eric Blake 
---
 block/dirty-bitmap.c  |  2 +-
 .../{291 => tests/qemu-img-bitmaps}   | 13 +++-
 .../{291.out => tests/qemu-img-bitmaps.out}   | 32 ++-
 3 files changed, 44 insertions(+), 3 deletions(-)
 rename tests/qemu-iotests/{291 => tests/qemu-img-bitmaps} (92%)
 rename tests/qemu-iotests/{291.out => tests/qemu-img-bitmaps.out} (82%)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 68d295d6e3ed..0ef46163e3ea 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -193,7 +193,7 @@ int bdrv_dirty_bitmap_check(const BdrvDirtyBitmap *bitmap, 
uint32_t flags,
 error_setg(errp, "Bitmap '%s' is inconsistent and cannot be used",
bitmap->name);
 error_append_hint(errp, "Try block-dirty-bitmap-remove to delete"
-  " this bitmap from disk");
+  " this bitmap from disk\n");
 return -1;
 }

diff --git a/tests/qemu-iotests/291 b/tests/qemu-iotests/tests/qemu-img-bitmaps
similarity index 92%
rename from tests/qemu-iotests/291
rename to tests/qemu-iotests/tests/qemu-img-bitmaps
index 20efb080a6c0..76cd9e31e850 100755
--- a/tests/qemu-iotests/291
+++ b/tests/qemu-iotests/tests/qemu-img-bitmaps
@@ -3,7 +3,7 @@
 #
 # Test qemu-img bitmap handling
 #
-# Copyright (C) 2018-2020 Red Hat, Inc.
+# Copyright (C) 2018-2021 Red Hat, Inc.
 #
 # This program is free software; you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
@@ -32,6 +32,7 @@ _cleanup()
 trap "_cleanup; exit \$status" 0 1 2 3 15

 # get standard environment, filters and checks
+cd ..
 . ./common.rc
 . ./common.filter
 . ./common.nbd
@@ -129,6 +130,16 @@ $QEMU_IMG map --output=json --image-opts \

 nbd_server_stop

+echo
+echo "=== Check handling of inconsistent bitmap ==="
+echo
+
+$QEMU_IO -c abort "$TEST_IMG" 2>/dev/null
+$QEMU_IMG bitmap --add "$TEST_IMG" b4
+$QEMU_IMG bitmap --remove "$TEST_IMG" b1
+_img_info --format-specific | _filter_irrelevant_img_info
+$QEMU_IMG convert --bitmaps -O qcow2 "$TEST_IMG" "$TEST_IMG.copy"
+
 # success, all done
 echo '*** done'
 rm -f $seq.full
diff --git a/tests/qemu-iotests/291.out 
b/tests/qemu-iotests/tests/qemu-img-bitmaps.out
similarity index 82%
rename from tests/qemu-iotests/291.out
rename to tests/qemu-iotests/tests/qemu-img-bitmaps.out
index 018d6b103f87..17b34eaed30f 100644
--- a/tests/qemu-iotests/291.out
+++ b/tests/qemu-iotests/tests/qemu-img-bitmaps.out
@@ -1,4 +1,4 @@
-QA output created by 291
+QA output created by qemu-img-bitmaps

 === Initial image setup ===

@@ -115,4 +115,34 @@ Format specific information:
 [{ "start": 0, "length": 2097152, "depth": 0, "present": true, "zero": false, 
"data": true, "offset": OFFSET},
 { "start": 2097152, "length": 1048576, "depth": 0, "present": false, "zero": 
false, "data": false},
 { "start": 3145728, "length": 7340032, "depth": 0, "present": true, "zero": 
false, "data": true, "offset": OFFSET}]
+
+=== Check handling of inconsistent bitmap ===
+
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 10 MiB (10485760 bytes)
+cluster_size: 65536
+backing file: TEST_DIR/t.IMGFMT.base
+backing file format: IMGFMT
+Format specific information:
+bitmaps:
+[0]:
+flags:
+[0]: in-use
+[1]: auto
+name: b2
+granularity: 65536
+[1]:
+flags:
+[0]: in-use
+name: b0
+granularity: 65536
+[2]:
+flags:
+[0]: auto
+name: b4
+granularity: 65536
+corrupt: false
+qemu-img: Failed to populate bitmap b0: Bitmap 'b0' is inconsistent and cannot 
be used
+Try block-dirty-bitmap-remove to delete this bitmap from disk
 *** done
-- 
2.31.1

[PATCH 2/2] qemu-img: Add --skip-broken for 'convert --bitmaps'

2021-07-07 Thread Eric Blake

The point of 'qemu-img convert --bitmaps' is to be a convenience for
actions that are already possible through a string of smaller
'qemu-img bitmap' sub-commands.  One situation not accounted for
already is that if a source image contains an inconsistent bitmap (for
example, because a qemu process died abruptly before flushing bitmap
state), the user MUST delete those inconsistent bitmaps before
anything else useful can be done with the image.

We don't want to delete inconsistent bitmaps by default: although a
corrupt bitmap is only a loss of optimization rather than a corruption
of user-visible data, it is still nice to require the user to opt in
to the fact that they are aware of the loss of the bitmap.  Still,
requiring the user to check 'qemu-img info' to see whether bitmaps are
consistent, then use 'qemu-img bitmap --remove' to remove offenders,
all before using 'qemu-img convert', is a lot more work than just
adding a knob 'qemu-img convert --bitmaps --skip-broken' which opts in
to skipping the broken bitmaps.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1946084
Signed-off-by: Eric Blake 
---
 docs/tools/qemu-img.rst   |  8 +++-
 qemu-img.c| 20 +--
 tests/qemu-iotests/tests/qemu-img-bitmaps |  4 
 tests/qemu-iotests/tests/qemu-img-bitmaps.out | 14 +
 4 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index 1d8470eada0e..5cf1c764597b 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -414,7 +414,7 @@ Command description:
   4
 Error on reading data

-.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [--bitmaps] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l 
SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-r RATE_LIMIT] [-m NUM_COROUTINES] [-W] 
FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
+.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [--bitmaps [--skip-broken]] [-U] [-C] [-c] [-p] [-q] [-n] 
[-f FMT] [-t CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o 
OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-r RATE_LIMIT] [-m 
NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME

   Convert the disk image *FILENAME* or a snapshot *SNAPSHOT_PARAM*
   to disk image *OUTPUT_FILENAME* using format *OUTPUT_FMT*. It can
@@ -456,6 +456,12 @@ Command description:
   *NUM_COROUTINES* specifies how many coroutines work in parallel during
   the convert process (defaults to 8).

+  Use of ``--bitmaps`` requests that any persistent bitmaps present in
+  the original are also copied to the destination.  If any bitmap is
+  inconsistent in the source, the conversion will fail unless
+  ``--skip-broken`` is also specified to copy only the consistent
+  bitmaps.
+
 .. option:: create [--object OBJECTDEF] [-q] [-f FMT] [-b BACKING_FILE] [-F 
BACKING_FMT] [-u] [-o OPTIONS] FILENAME [SIZE]

   Create the new disk image *FILENAME* of size *SIZE* and format
diff --git a/qemu-img.c b/qemu-img.c
index 68a4d298098f..e8b012f39c0c 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -82,6 +82,7 @@ enum {
 OPTION_MERGE = 274,
 OPTION_BITMAPS = 275,
 OPTION_FORCE = 276,
+OPTION_SKIP_BROKEN = 277,
 };

 typedef enum OutputFormat {
@@ -2101,7 +2102,8 @@ static int convert_do_copy(ImgConvertState *s)
 return s->ret;
 }

-static int convert_copy_bitmaps(BlockDriverState *src, BlockDriverState *dst)
+static int convert_copy_bitmaps(BlockDriverState *src, BlockDriverState *dst,
+bool skip_broken)
 {
 BdrvDirtyBitmap *bm;
 Error *err = NULL;
@@ -2113,6 +2115,10 @@ static int convert_copy_bitmaps(BlockDriverState *src, 
BlockDriverState *dst)
 continue;
 }
 name = bdrv_dirty_bitmap_name(bm);
+if (skip_broken && bdrv_dirty_bitmap_inconsistent(bm)) {
+warn_report("Skipping inconsistent bitmap %s", name);
+continue;
+}
 qmp_block_dirty_bitmap_add(dst->node_name, name,
true, bdrv_dirty_bitmap_granularity(bm),
true, true,
@@ -2167,6 +2173,7 @@ static int img_convert(int argc, char **argv)
 bool force_share = false;
 bool explict_min_sparse = false;
 bool bitmaps = false;
+bool skip_broken = false;
 int64_t rate_limit = 0;

 ImgConvertState s = (ImgConvertState) {
@@ -2188,6 +2195,7 @@ static int img_convert(int argc, char **argv)
 {"salvage", no_argument, 0, OPTION_SALVAGE},
 {"target-is-zero", no_argument, 0, OPTION_TARGET_IS_ZERO},
 {"bitmaps", no_argument, 0, OPTION_BITMAPS},
+{"skip-broken", no_argument, 0, OPTION_SKIP_BROKEN},
 {0, 0, 0, 0}
 };
 c = getopt_long(argc, argv,

[PATCH 0/2] Let 'qemu-img convert --bitmaps' skip inconsistent bitmaps

2021-07-07 Thread Eric Blake

This is mostly a convenience factor as one could already use 'qemu-img
info' to learn which bitmaps are broken and then 'qemu-img bitmap
--remove' to nuke them before calling 'qemu-img convert --bitmaps',
but it does have the advantage that the copied file is usable without
extra efforts and the broken bitmap is not deleted from the source
file.

Eric Blake (2):
  iotests: Improve and rename test 291 to qemu-img-bitmap
  qemu-img: Add --skip-broken for 'convert --bitmaps'

 docs/tools/qemu-img.rst   |  8 +++-
 block/dirty-bitmap.c  |  2 +-
 qemu-img.c| 20 +++-
 .../{291 => tests/qemu-img-bitmaps}   | 17 ++-
 .../{291.out => tests/qemu-img-bitmaps.out}   | 46 ++-
 5 files changed, 87 insertions(+), 6 deletions(-)
 rename tests/qemu-iotests/{291 => tests/qemu-img-bitmaps} (88%)
 rename tests/qemu-iotests/{291.out => tests/qemu-img-bitmaps.out} (76%)

-- 
2.31.1

Re: [PATCH] block/rbd: fix type of task->complete

2021-07-07 Thread Connor Kuehl

On 7/7/21 11:04 AM, Peter Lieven wrote:
> task->complete is a bool not an integer.
> 
> Signed-off-by: Peter Lieven 
> ---
>  block/rbd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/rbd.c b/block/rbd.c
> index 01a7b94d62..dcf82b15b8 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -1066,7 +1066,7 @@ static int qemu_rbd_resize(BlockDriverState *bs, 
> uint64_t size)
>  static void qemu_rbd_finish_bh(void *opaque)
>  {
>  RBDTask *task = opaque;
> -task->complete = 1;
> +task->complete = true;
>  aio_co_wake(task->co);
>  }
>  
> 

Hi Peter,

What tree/QEMU git sha does this apply to? I am having trouble
finding definitions for RBDTask and qemu_rbd_finish_bh; and the
patch won't apply to my few-minutes-old clone of upstream.

Connor

Re: [PATCH 0/2] Remove deprecated qemu-img backing file without format

2021-07-07 Thread Eric Blake

On Mon, May 03, 2021 at 04:35:58PM -0500, Eric Blake wrote:
> We've gone enough release cycles without noticeable pushback on our
> intentions, so time to make it harder to create images that can form a
> security hole due to a need for format probing rather than an explicit
> format.
> 
> Eric Blake (2):
>   qcow2: Prohibit backing file changes in 'qemu-img amend'
>   qemu-img: Require -F with -b backing image

Ping.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH] nbd: register yank function earlier

2021-07-07 Thread Eric Blake

On Mon, Jul 05, 2021 at 11:07:21AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 04.07.2021 01:07, Lukas Straub wrote:
> > Although unlikely, qemu might hang in nbd_send_request().
> > 
> > Allow recovery in this case by registering the yank function before
> > calling it.
> > 
> > Signed-off-by: Lukas Straub
> 
> Reviewed-by: Vladimir Sementsov-Ogievskiy 

Thanks, will queue through my NBD tree.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v3 3/2] qemu-img: Reword 'qemu-img map --output=json' docs

2021-07-07 Thread Nir Soffer

On Wed, Jul 7, 2021 at 9:41 PM Eric Blake  wrote:
>
> Reword the paragraphs to list the JSON key first, rather than in the
> middle of prose.
>
> Suggested-by: Vladimir Sementsov-Ogievskiy 
> Signed-off-by: Eric Blake 
> ---
>  docs/tools/qemu-img.rst | 20 ++--
>  1 file changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
> index d6300f7ee03d..1d8470eada0e 100644
> --- a/docs/tools/qemu-img.rst
> +++ b/docs/tools/qemu-img.rst
> @@ -593,16 +593,16 @@ Command description:
>the ``start``, ``length``, ``offset`` fields;
>it will also include other more specific information:
>
> -  - whether the sectors contain actual data or not (boolean field ``data``;
> -if false, the sectors are either unallocated or stored as optimized
> -all-zero clusters);
> -  - whether the data is known to read as zero (boolean field ``zero``);
> -  - whether the data is actually present (boolean field ``present``);
> -if false, rebasing the backing chain onto a deeper file would pick
> -up data from the deeper file;
> -  - in order to make the output shorter, the target file is expressed as
> -a ``depth``; for example, a depth of 2 refers to the backing file
> -of the backing file of *FILENAME*.
> +  - boolean field ``data``: true if the sectors contain actual data,
> +false if the sectors are either unallocated or stored as optimized
> +all-zero clusters
> +  - boolean field ``zero``: true if the data is known to read as zero
> +  - boolean field ``present``: true if the data belongs to the backing
> +chain, false if rebasing the backing chain onto a deeper file
> +would pick up data from the deeper file;
> +  - integer field ``depth``: the depth within the backing chain at
> +which the data was resolved; for example, a depth of 2 refers to
> +the backing file of the backing file of *FILENAME*.
>
>In JSON format, the ``offset`` field is optional; it is absent in
>cases where ``human`` format would omit the entry or exit with an error.
> --
> 2.31.1

Would be nice if this could be generated from the json schema instead
of repeating the type and description of the fields, but this is a nice
improvement.

Reviewed-by: Nir Soffer

[PATCH v3 3/2] qemu-img: Reword 'qemu-img map --output=json' docs

2021-07-07 Thread Eric Blake

Reword the paragraphs to list the JSON key first, rather than in the
middle of prose.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Eric Blake 
---
 docs/tools/qemu-img.rst | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index d6300f7ee03d..1d8470eada0e 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -593,16 +593,16 @@ Command description:
   the ``start``, ``length``, ``offset`` fields;
   it will also include other more specific information:

-  - whether the sectors contain actual data or not (boolean field ``data``;
-if false, the sectors are either unallocated or stored as optimized
-all-zero clusters);
-  - whether the data is known to read as zero (boolean field ``zero``);
-  - whether the data is actually present (boolean field ``present``);
-if false, rebasing the backing chain onto a deeper file would pick
-up data from the deeper file;
-  - in order to make the output shorter, the target file is expressed as
-a ``depth``; for example, a depth of 2 refers to the backing file
-of the backing file of *FILENAME*.
+  - boolean field ``data``: true if the sectors contain actual data,
+false if the sectors are either unallocated or stored as optimized
+all-zero clusters
+  - boolean field ``zero``: true if the data is known to read as zero
+  - boolean field ``present``: true if the data belongs to the backing
+chain, false if rebasing the backing chain onto a deeper file
+would pick up data from the deeper file;
+  - integer field ``depth``: the depth within the backing chain at
+which the data was resolved; for example, a depth of 2 refers to
+the backing file of the backing file of *FILENAME*.

   In JSON format, the ``offset`` field is optional; it is absent in
   cases where ``human`` format would omit the entry or exit with an error.
-- 
2.31.1

[PATCH v3 4/4] replication: Remove workaround

2021-07-07 Thread Lukas Straub

Remove the workaround introduced in commit
6ecbc6c52672db5c13805735ca02784879ce8285
"replication: Avoid blk_make_empty() on read-only child".

It is not needed anymore since s->hidden_disk is guaranteed to be
writable when secondary_do_checkpoint() runs. Because replication_start(),
_do_checkpoint() and _stop() are only called by COLO migration code
and COLO-migration doesn't inactivate disks.

Signed-off-by: Lukas Straub 
---
 block/replication.c | 12 +---
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index c0d4a6c264..68b46d65a8 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -348,17 +348,7 @@ static void secondary_do_checkpoint(BlockDriverState *bs, 
Error **errp)
 return;
 }

-BlockBackend *blk = blk_new(qemu_get_current_aio_context(),
-BLK_PERM_WRITE, BLK_PERM_ALL);
-blk_insert_bs(blk, s->hidden_disk->bs, _err);
-if (local_err) {
-error_propagate(errp, local_err);
-blk_unref(blk);
-return;
-}
-
-ret = blk_make_empty(blk, errp);
-blk_unref(blk);
+ret = bdrv_make_empty(s->hidden_disk, errp);
 if (ret < 0) {
 return;
 }
--
2.20.1


pgpKvTw0eVhsi.pgp
Description: OpenPGP digital signature

[PATCH v3 3/4] replication: Properly attach children

2021-07-07 Thread Lukas Straub

The replication driver needs access to the children block-nodes of
it's child so it can issue bdrv_make_empty() and bdrv_co_pwritev()
to manage the replication. However, it does this by directly copying
the BdrvChilds, which is wrong.

Fix this by properly attaching the block-nodes with
bdrv_attach_child() and requesting the required permissions.

Signed-off-by: Lukas Straub 
---
 block/replication.c | 30 +++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 74adf30f54..c0d4a6c264 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -165,7 +165,12 @@ static void replication_child_perm(BlockDriverState *bs, 
BdrvChild *c,
uint64_t perm, uint64_t shared,
uint64_t *nperm, uint64_t *nshared)
 {
-*nperm = BLK_PERM_CONSISTENT_READ;
+if (role & BDRV_CHILD_PRIMARY) {
+*nperm = BLK_PERM_CONSISTENT_READ;
+} else {
+*nperm = 0;
+}
+
 if ((bs->open_flags & (BDRV_O_INACTIVE | BDRV_O_RDWR)) == BDRV_O_RDWR) {
 *nperm |= BLK_PERM_WRITE;
 }
@@ -552,8 +557,25 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 return;
 }

-s->hidden_disk = hidden_disk;
-s->secondary_disk = secondary_disk;
+bdrv_ref(hidden_disk->bs);
+s->hidden_disk = bdrv_attach_child(bs, hidden_disk->bs, "hidden disk",
+   _of_bds, BDRV_CHILD_DATA,
+   _err);
+if (local_err) {
+error_propagate(errp, local_err);
+aio_context_release(aio_context);
+return;
+}
+
+bdrv_ref(secondary_disk->bs);
+s->secondary_disk = bdrv_attach_child(bs, secondary_disk->bs,
+  "secondary disk", _of_bds,
+  BDRV_CHILD_DATA, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+aio_context_release(aio_context);
+return;
+}

 /* start backup job now */
 error_setg(>blocker,
@@ -659,7 +681,9 @@ static void replication_done(void *opaque, int ret)
 if (ret == 0) {
 s->stage = BLOCK_REPLICATION_DONE;

+bdrv_unref_child(bs, s->secondary_disk);
 s->secondary_disk = NULL;
+bdrv_unref_child(bs, s->hidden_disk);
 s->hidden_disk = NULL;
 s->error = 0;
 } else {
--
2.20.1



pgpCxuOT5oetJ.pgp
Description: OpenPGP digital signature

[PATCH v3 2/4] replication: Reduce usage of s->hidden_disk and s->secondary_disk

2021-07-07 Thread Lukas Straub

In preparation for the next patch, initialize s->hidden_disk and
s->secondary_disk later and replace access to them with local variables
in the places where they aren't initialized yet.

Signed-off-by: Lukas Straub 
---
 block/replication.c | 45 -
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 50940fbe33..74adf30f54 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -368,27 +368,35 @@ static void reopen_backing_file(BlockDriverState *bs, 
bool writable,
 Error **errp)
 {
 BDRVReplicationState *s = bs->opaque;
+BdrvChild *hidden_disk, *secondary_disk;
 BlockReopenQueue *reopen_queue = NULL;

+/*
+ * s->hidden_disk and s->secondary_disk may not be set yet, as they will
+ * only be set after the children are writable.
+ */
+hidden_disk = bs->file->bs->backing;
+secondary_disk = hidden_disk->bs->backing;
+
 if (writable) {
-s->orig_hidden_read_only = bdrv_is_read_only(s->hidden_disk->bs);
-s->orig_secondary_read_only = bdrv_is_read_only(s->secondary_disk->bs);
+s->orig_hidden_read_only = bdrv_is_read_only(hidden_disk->bs);
+s->orig_secondary_read_only = bdrv_is_read_only(secondary_disk->bs);
 }

-bdrv_subtree_drained_begin(s->hidden_disk->bs);
-bdrv_subtree_drained_begin(s->secondary_disk->bs);
+bdrv_subtree_drained_begin(hidden_disk->bs);
+bdrv_subtree_drained_begin(secondary_disk->bs);

 if (s->orig_hidden_read_only) {
 QDict *opts = qdict_new();
 qdict_put_bool(opts, BDRV_OPT_READ_ONLY, !writable);
-reopen_queue = bdrv_reopen_queue(reopen_queue, s->hidden_disk->bs,
+reopen_queue = bdrv_reopen_queue(reopen_queue, hidden_disk->bs,
  opts, true);
 }

 if (s->orig_secondary_read_only) {
 QDict *opts = qdict_new();
 qdict_put_bool(opts, BDRV_OPT_READ_ONLY, !writable);
-reopen_queue = bdrv_reopen_queue(reopen_queue, s->secondary_disk->bs,
+reopen_queue = bdrv_reopen_queue(reopen_queue, secondary_disk->bs,
  opts, true);
 }

@@ -396,8 +404,8 @@ static void reopen_backing_file(BlockDriverState *bs, bool 
writable,
 bdrv_reopen_multiple(reopen_queue, errp);
 }

-bdrv_subtree_drained_end(s->hidden_disk->bs);
-bdrv_subtree_drained_end(s->secondary_disk->bs);
+bdrv_subtree_drained_end(hidden_disk->bs);
+bdrv_subtree_drained_end(secondary_disk->bs);
 }

 static void backup_job_cleanup(BlockDriverState *bs)
@@ -454,7 +462,7 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 BlockDriverState *bs = rs->opaque;
 BDRVReplicationState *s;
 BlockDriverState *top_bs;
-BdrvChild *active_disk;
+BdrvChild *active_disk, *hidden_disk, *secondary_disk;
 int64_t active_length, hidden_length, disk_length;
 AioContext *aio_context;
 Error *local_err = NULL;
@@ -499,15 +507,15 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 return;
 }

-s->hidden_disk = active_disk->bs->backing;
-if (!s->hidden_disk->bs || !s->hidden_disk->bs->backing) {
+hidden_disk = active_disk->bs->backing;
+if (!hidden_disk->bs || !hidden_disk->bs->backing) {
 error_setg(errp, "Hidden disk doesn't have backing file");
 aio_context_release(aio_context);
 return;
 }

-s->secondary_disk = s->hidden_disk->bs->backing;
-if (!s->secondary_disk->bs || !bdrv_has_blk(s->secondary_disk->bs)) {
+secondary_disk = hidden_disk->bs->backing;
+if (!secondary_disk->bs || !bdrv_has_blk(secondary_disk->bs)) {
 error_setg(errp, "The secondary disk doesn't have block backend");
 aio_context_release(aio_context);
 return;
@@ -515,8 +523,8 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,

 /* verify the length */
 active_length = bdrv_getlength(active_disk->bs);
-hidden_length = bdrv_getlength(s->hidden_disk->bs);
-disk_length = bdrv_getlength(s->secondary_disk->bs);
+hidden_length = bdrv_getlength(hidden_disk->bs);
+disk_length = bdrv_getlength(secondary_disk->bs);
 if (active_length < 0 || hidden_length < 0 || disk_length < 0 ||
 active_length != hidden_length || hidden_length != disk_length) {
 error_setg(errp, "Active disk, hidden disk, secondary disk's 
length"
@@ -526,10 +534,10 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 }

 /* Must be true, or the bdrv_getlength() calls would have failed */
-assert(active_disk->bs->drv && s->hidden_disk->bs->drv);
+assert(active_disk->bs->drv && hidden_disk->bs->drv);

 if

[PATCH v3 1/4] replication: Remove s->active_disk

2021-07-07 Thread Lukas Straub

s->active_disk is bs->file. Remove it and use local variables instead.

Signed-off-by: Lukas Straub 
---
 block/replication.c | 38 +-
 1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 52163f2d1f..50940fbe33 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -35,7 +35,6 @@ typedef enum {
 typedef struct BDRVReplicationState {
 ReplicationMode mode;
 ReplicationStage stage;
-BdrvChild *active_disk;
 BlockJob *commit_job;
 BdrvChild *hidden_disk;
 BdrvChild *secondary_disk;
@@ -307,11 +306,15 @@ out:
 return ret;
 }

-static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
+static void secondary_do_checkpoint(BlockDriverState *bs, Error **errp)
 {
+BDRVReplicationState *s = bs->opaque;
+BdrvChild *active_disk;
 Error *local_err = NULL;
 int ret;

+active_disk = bs->file;
+
 if (!s->backup_job) {
 error_setg(errp, "Backup job was cancelled unexpectedly");
 return;
@@ -323,13 +326,13 @@ static void secondary_do_checkpoint(BDRVReplicationState 
*s, Error **errp)
 return;
 }

-if (!s->active_disk->bs->drv) {
+if (!active_disk->bs->drv) {
 error_setg(errp, "Active disk %s is ejected",
-   s->active_disk->bs->node_name);
+   active_disk->bs->node_name);
 return;
 }

-ret = bdrv_make_empty(s->active_disk, errp);
+ret = bdrv_make_empty(active_disk, errp);
 if (ret < 0) {
 return;
 }
@@ -451,6 +454,7 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 BlockDriverState *bs = rs->opaque;
 BDRVReplicationState *s;
 BlockDriverState *top_bs;
+BdrvChild *active_disk;
 int64_t active_length, hidden_length, disk_length;
 AioContext *aio_context;
 Error *local_err = NULL;
@@ -488,15 +492,14 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 case REPLICATION_MODE_PRIMARY:
 break;
 case REPLICATION_MODE_SECONDARY:
-s->active_disk = bs->file;
-if (!s->active_disk || !s->active_disk->bs ||
-!s->active_disk->bs->backing) {
+active_disk = bs->file;
+if (!active_disk || !active_disk->bs || !active_disk->bs->backing) {
 error_setg(errp, "Active disk doesn't have backing file");
 aio_context_release(aio_context);
 return;
 }

-s->hidden_disk = s->active_disk->bs->backing;
+s->hidden_disk = active_disk->bs->backing;
 if (!s->hidden_disk->bs || !s->hidden_disk->bs->backing) {
 error_setg(errp, "Hidden disk doesn't have backing file");
 aio_context_release(aio_context);
@@ -511,7 +514,7 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 }

 /* verify the length */
-active_length = bdrv_getlength(s->active_disk->bs);
+active_length = bdrv_getlength(active_disk->bs);
 hidden_length = bdrv_getlength(s->hidden_disk->bs);
 disk_length = bdrv_getlength(s->secondary_disk->bs);
 if (active_length < 0 || hidden_length < 0 || disk_length < 0 ||
@@ -523,9 +526,9 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 }

 /* Must be true, or the bdrv_getlength() calls would have failed */
-assert(s->active_disk->bs->drv && s->hidden_disk->bs->drv);
+assert(active_disk->bs->drv && s->hidden_disk->bs->drv);

-if (!s->active_disk->bs->drv->bdrv_make_empty ||
+if (!active_disk->bs->drv->bdrv_make_empty ||
 !s->hidden_disk->bs->drv->bdrv_make_empty) {
 error_setg(errp,
"Active disk or hidden disk doesn't support 
make_empty");
@@ -579,7 +582,7 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
 s->stage = BLOCK_REPLICATION_RUNNING;

 if (s->mode == REPLICATION_MODE_SECONDARY) {
-secondary_do_checkpoint(s, errp);
+secondary_do_checkpoint(bs, errp);
 }

 s->error = 0;
@@ -608,7 +611,7 @@ static void replication_do_checkpoint(ReplicationState *rs, 
Error **errp)
 }

 if (s->mode == REPLICATION_MODE_SECONDARY) {
-secondary_do_checkpoint(s, errp);
+secondary_do_checkpoint(bs, errp);
 }
 aio_context_release(aio_context);
 }
@@ -645,7 +648,6 @@ static void replication_done(void *opaque, int ret)
 if (ret == 0) {
 s->stage = BLOCK_REPLICATION_DONE;

-s->active_disk = NULL;
 s->secondary_disk = NULL;
 s->hidden_disk = NULL;
 s->error = 0;
@@ -659,11 +661,13 @@ static void replication_stop(ReplicationState *rs, bool 
failover, Error **errp)
 {
 BlockDriverState *bs = rs->opaque;
 BDRVReplicationState *s;
+BdrvChild *active_disk;
 AioContext *aio_context;

 aio_context =

[PATCH v3 0/4] replication: Properly attach children

2021-07-07 Thread Lukas Straub

Hello Everyone,
A while ago Kevin noticed that the replication driver doesn't properly attach
the children it wants to use. Instead, it directly copies the BdrvChilds from
it's backing file, which is wrong. Ths Patchset fixes the problem and removes
the workaround that was put in place back then.

Regards,
Lukas Straub

Changes:

-v3:
-Split up into multiple patches
-Remove s->active_disk
-Clarify child permissions in commit message

-v2: Test for BDRV_CHILD_PRIMARY in replication_child_perm, since
 bs->file might not be set yet. (Vladimir)

Lukas Straub (4):
  replication: Remove s->active_disk
  replication: Reduce usage of s->hidden_disk and s->secondary_disk
  replication: Properly attach children
  replication: Remove workaround

 block/replication.c | 115 +++-
 1 file changed, 72 insertions(+), 43 deletions(-)

--
2.20.1


pgphmZr8kEpDA.pgp
Description: OpenPGP digital signature

Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-07-07 Thread Peter Lieven




> Am 06.07.2021 um 17:25 schrieb Kevin Wolf :
> 
> Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben:
>>> Am 06.07.2021 um 15:19 schrieb Kevin Wolf :
>>> 
>>> Am 02.07.2021 um 19:23 hat Ilya Dryomov geschrieben:
 This series migrates the qemu rbd driver from the old aio emulation
 to native coroutines and adds write zeroes support which is important
 for block operations.
 
 To achieve this we first bump the librbd requirement to the already
 outdated luminous release of ceph to get rid of some wrappers and
 ifdef'ry in the code.
>>> 
>>> Thanks, applied to the block branch.
>>> 
>>> I've only had a very quick look at the patches, but I think there is one
>>> suggestion for a cleanup I can make: The qemu_rbd_finish_bh()
>>> indirection is probably unnecessary now because aio_co_wake() is thread
>>> safe.
>> 
>> But this is new, isn’t it?
> 
> Not sure in what sense you mean. aio_co_wake() has always been thread
> safe, as far as I know.
> 
> Obviously, the old code didn't use aio_co_wake(), but directly called
> some callbacks, so the part that is new is your coroutine conversion
> that enables getting rid of the BH.
> 
>> We also have this indirection in iscsi and nfs drivers I think.
> 
> Indeed, the resulting codes look the same. In iscsi and nfs it doesn't
> come from an incomplete converstion to coroutines, but they both used
> qemu_coroutine_enter() originally, which resumes the coroutine in the
> current thread...
> 
>> Does it matter that the completion callback is called from an librbd
>> thread? Will the coroutine continue to run in the right thread?
> 
> ...whereas aio_co_wake() resumes the coroutine in the thread where it
> was running before.
> 
> (Before commit 5f50be9b5, this would have been buggy from an librbd
> thread, but now it should work correctly even for threads that are
> neither iothreads nor vcpu threads.)
> 
>> I will have a decent look after my vacation.
> 
> Sounds good, thanks. Enjoy your vacation!


As I had to fire up my laptop to look into another issue anyway, I have sent 
two patches for updating MAINTAINERS and to fix the int vs. bool mix for 
task->complete. As Paolos fix (5f50be9b5) is relatively new and there are maybe 
other non obvious problems when removing the BH indirection and we are close to 
soft freeze I would leave the BH removal change for 6.2.

Best,
Peter

[PATCH] MAINTAINERS: update block/rbd.c maintainer

2021-07-07 Thread Peter Lieven

adding myself as a designated reviewer.

Signed-off-by: Peter Lieven 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 516db737d1..cfda57e825 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3058,6 +3058,7 @@ F: block/vmdk.c
 
 RBD
 M: Ilya Dryomov 
+R: Peter Lieven 
 L: qemu-block@nongnu.org
 S: Supported
 F: block/rbd.c
-- 
2.17.1

[PATCH] block/rbd: fix type of task->complete

2021-07-07 Thread Peter Lieven

task->complete is a bool not an integer.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/rbd.c b/block/rbd.c
index 01a7b94d62..dcf82b15b8 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1066,7 +1066,7 @@ static int qemu_rbd_resize(BlockDriverState *bs, uint64_t 
size)
 static void qemu_rbd_finish_bh(void *opaque)
 {
 RBDTask *task = opaque;
-task->complete = 1;
+task->complete = true;
 aio_co_wake(task->co);
 }
 
-- 
2.17.1

[RFC PATCH 6/6] jobs: remove unnecessary AioContext aquire/release pairs

2021-07-07 Thread Emanuele Giuseppe Esposito

Now that we use the job_mutex, remove unnecessary aio_context_acquire/release
pairs. However, some place still needs it, so try to reduce the
aio_context critical section to the minimum.

This patch is separated from the one before because here we are removing
locks without substituting it with aiocontext_acquire/release pairs.

These sections will also be removed in future, when the underlaying bdrv_*
API will also be free of context locks.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 block/mirror.c |   6 ++
 block/monitor/block-hmp-cmds.c |   6 --
 blockdev.c | 173 -
 blockjob.c |   3 +
 job.c  |   9 +-
 qemu-img.c |   4 -
 6 files changed, 54 insertions(+), 147 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index deefaa6a39..8d30c53690 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1857,6 +1857,7 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
 {
 bool is_none_mode;
 BlockDriverState *base;
+AioContext *aio_context;
 
 if ((mode == MIRROR_SYNC_MODE_INCREMENTAL) ||
 (mode == MIRROR_SYNC_MODE_BITMAP)) {
@@ -1866,11 +1867,16 @@ void mirror_start(const char *job_id, BlockDriverState 
*bs,
 }
 is_none_mode = mode == MIRROR_SYNC_MODE_NONE;
 base = mode == MIRROR_SYNC_MODE_TOP ? bdrv_backing_chain_next(bs) : NULL;
+
+aio_context = bdrv_get_aio_context(bs);
+aio_context_acquire(aio_context);
 mirror_start_job(job_id, bs, creation_flags, target, replaces,
  speed, granularity, buf_size, backing_mode, zero_target,
  on_source_error, on_target_error, unmap, NULL, NULL,
  _job_driver, is_none_mode, base, false,
  filter_node_name, true, copy_mode, errp);
+aio_context_release(aio_context);
+
 }
 
 BlockJob *commit_active_start(const char *job_id, BlockDriverState *bs,
diff --git a/block/monitor/block-hmp-cmds.c b/block/monitor/block-hmp-cmds.c
index 3e6670c963..99095afae7 100644
--- a/block/monitor/block-hmp-cmds.c
+++ b/block/monitor/block-hmp-cmds.c
@@ -206,7 +206,6 @@ void hmp_commit(Monitor *mon, const QDict *qdict)
 ret = blk_commit_all();
 } else {
 BlockDriverState *bs;
-AioContext *aio_context;
 
 blk = blk_by_name(device);
 if (!blk) {
@@ -219,12 +218,7 @@ void hmp_commit(Monitor *mon, const QDict *qdict)
 }
 
 bs = bdrv_skip_implicit_filters(blk_bs(blk));
-aio_context = bdrv_get_aio_context(bs);
-aio_context_acquire(aio_context);
-
 ret = bdrv_commit(bs);
-
-aio_context_release(aio_context);
 }
 if (ret < 0) {
 error_report("'commit' error for '%s': %s", device, strerror(-ret));
diff --git a/blockdev.c b/blockdev.c
index 9255aea6a2..119cb9a539 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -147,13 +147,8 @@ void blockdev_mark_auto_del(BlockBackend *blk)
 
 for (job = block_job_next(NULL); job; job = block_job_next(job)) {
 if (block_job_has_bdrv(job, blk_bs(blk))) {
-AioContext *aio_context = job_get_aiocontext(>job);
-aio_context_acquire(aio_context);
-
 job_lock();
 job_cancel(>job, false);
-
-aio_context_release(aio_context);
 job_unlock();
 }
 }
@@ -1714,7 +1709,6 @@ static void drive_backup_prepare(BlkActionState *common, 
Error **errp)
 }
 
 aio_context = bdrv_get_aio_context(bs);
-aio_context_acquire(aio_context);
 
 /* Paired with .clean() */
 bdrv_drained_begin(bs);
@@ -1726,7 +1720,7 @@ static void drive_backup_prepare(BlkActionState *common, 
Error **errp)
 
 /* Early check to avoid creating target */
 if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_BACKUP_SOURCE, errp)) {
-goto out;
+return;
 }
 
 flags = bs->open_flags | BDRV_O_RDWR;
@@ -1756,7 +1750,7 @@ static void drive_backup_prepare(BlkActionState *common, 
Error **errp)
 size = bdrv_getlength(bs);
 if (size < 0) {
 error_setg_errno(errp, -size, "bdrv_getlength failed");
-goto out;
+return;
 }
 
 if (backup->mode != NEW_IMAGE_MODE_EXISTING) {
@@ -1779,7 +1773,7 @@ static void drive_backup_prepare(BlkActionState *common, 
Error **errp)
 
 if (local_err) {
 error_propagate(errp, local_err);
-goto out;
+return;
 }
 
 options = qdict_new();
@@ -1791,12 +1785,11 @@ static void drive_backup_prepare(BlkActionState 
*common, Error **errp)
 
 target_bs = bdrv_open(backup->target, NULL, options, flags, errp);
 if (!target_bs) {
-goto out;
+return;
 }
 
 /* Honor bdrv_try_set_aio_context() context acquisition requirements. */
 old_context = bdrv_get_aio_context(target_bs);
-aio_context_release(aio_context);
 aio_context_acquire(old_context);
 
 ret = bdrv_try_set_aio_context(target_bs,

[RFC PATCH 1/6] job: use getter/setters instead of accessing the Job fields directly

2021-07-07 Thread Emanuele Giuseppe Esposito

Using getters/setters we can have a more strict control on struct Job
fields. The struct remains public, because it is also used as base
class for BlockJobs and various, but replace all direct accesses
to the fields we want to protect with getters/setters.
This is in preparation to the locking patches.

No functional change intended.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/qemu/job.h  | 33 +++
 block.c |  2 +-
 block/commit.c  |  4 +--
 block/mirror.c  | 17 +-
 block/replication.c |  3 +-
 blockdev.c  |  2 +-
 blockjob.c  | 78 -
 job-qmp.c   | 16 ++
 job.c   | 52 +-
 qemu-img.c  |  2 +-
 10 files changed, 151 insertions(+), 58 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 41162ed494..72c7d0f69d 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -303,6 +303,39 @@ void job_txn_unref(JobTxn *txn);
  */
 void job_txn_add_job(JobTxn *txn, Job *job);
 
+/** Returns the @ret field of a given Job. */
+int job_get_ret(Job *job);
+
+/** Returns the AioContext of a given Job. */
+AioContext *job_get_aiocontext(Job *job);
+
+/** Sets the AioContext of a given Job. */
+void job_set_aiocontext(Job *job, AioContext *aio);
+
+/** Returns if a given Job is busy. */
+bool job_is_busy(Job *job);
+
+/** Returns the Error of a given Job. */
+Error *job_get_err(Job *job);
+
+/** Returns if a Job has a pause_count > 0. */
+bool job_should_pause(Job *job);
+
+/** Sets the user_paused flag of a given Job to true. */
+void job_set_user_paused(Job *job);
+
+/** Sets the cancelled flag of a given Job. */
+void job_set_cancelled(Job *job, bool cancel);
+
+/** Returns if a given Job is paused. */
+bool job_is_paused(Job *job);
+
+/** Returns if a given Job is force cancelled. */
+bool job_is_force_cancel(Job *job);
+
+/** Returns the statis of a given Job. */
+JobStatus job_get_status(Job *job);
+
 /**
  * Create a new long-running job and return it.
  *
diff --git a/block.c b/block.c
index acd35cb0cb..1628db2550 100644
--- a/block.c
+++ b/block.c
@@ -5721,7 +5721,7 @@ XDbgBlockGraph *bdrv_get_xdbg_block_graph(Error **errp)
 GSList *el;
 
 xdbg_graph_add_node(gr, job, X_DBG_BLOCK_GRAPH_NODE_TYPE_BLOCK_JOB,
-   job->job.id);
+job->job.id);
 for (el = job->nodes; el; el = el->next) {
 xdbg_graph_add_edge(gr, job, (BdrvChild *)el->data);
 }
diff --git a/block/commit.c b/block/commit.c
index 42792b4556..087865953e 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -367,7 +367,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
 goto fail;
 }
 
-s->base = blk_new(s->common.job.aio_context,
+s->base = blk_new(job_get_aiocontext(>common.job),
   base_perms,
   BLK_PERM_CONSISTENT_READ
   | BLK_PERM_GRAPH_MOD
@@ -380,7 +380,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
 s->base_bs = base;
 
 /* Required permissions are already taken with block_job_add_bdrv() */
-s->top = blk_new(s->common.job.aio_context, 0, BLK_PERM_ALL);
+s->top = blk_new(job_get_aiocontext(>common.job), 0, BLK_PERM_ALL);
 ret = blk_insert_bs(s->top, top, errp);
 if (ret < 0) {
 goto fail;
diff --git a/block/mirror.c b/block/mirror.c
index 019f6deaa5..49fffa 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -636,7 +636,7 @@ static int mirror_exit_common(Job *job)
 BlockDriverState *target_bs;
 BlockDriverState *mirror_top_bs;
 Error *local_err = NULL;
-bool abort = job->ret < 0;
+bool abort = job_get_ret(job) < 0;
 int ret = 0;
 
 if (s->prepared) {
@@ -930,7 +930,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 while (!job_is_cancelled(>common.job) && !s->should_complete) {
 job_yield(>common.job);
 }
-s->common.job.cancelled = false;
+job_set_cancelled(>common.job, false);
 goto immediate_exit;
 }
 
@@ -1065,7 +1065,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
  * completion.
  */
 assert(QLIST_EMPTY(>tracked_requests));
-s->common.job.cancelled = false;
+job_set_cancelled(>common.job, false);
 need_drain = false;
 break;
 }
@@ -1079,7 +1079,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
 job_sleep_ns(>common.job, delay_ns);
 if (job_is_cancelled(>common.job) &&
-(!s->synced || s->common.job.force_cancel))
+(!s->synced || job_is_force_cancel(>common.job)))
 {
 break;
 }
@@ -1092,8 +1092,8 @@ immediate_exit:
  * or it was cancelled prematurely so that

[RFC PATCH 2/6] job: _locked functions and public job_lock/unlock for next patch

2021-07-07 Thread Emanuele Giuseppe Esposito

Create _locked functions, to make next patch a little bit smaller.
Also set the locking functions as public, so that they can be used
also from structures using the Job struct.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/qemu/job.h | 23 +
 job.c  | 85 ++
 2 files changed, 93 insertions(+), 15 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 72c7d0f69d..ba2f9b2660 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -305,6 +305,7 @@ void job_txn_add_job(JobTxn *txn, Job *job);
 
 /** Returns the @ret field of a given Job. */
 int job_get_ret(Job *job);
+int job_get_ret_locked(Job *job);
 
 /** Returns the AioContext of a given Job. */
 AioContext *job_get_aiocontext(Job *job);
@@ -336,6 +337,24 @@ bool job_is_force_cancel(Job *job);
 /** Returns the statis of a given Job. */
 JobStatus job_get_status(Job *job);
 
+/**
+ * job_lock:
+ *
+ * Take the mutex protecting the list of jobs and their status.
+ * Most functions called by the monitor need to call job_lock
+ * and job_unlock manually.  On the other hand, function called
+ * by the block jobs themselves and by the block layer will take the
+ * lock for you.
+ */
+void job_lock(void);
+
+/**
+ * job_unlock:
+ *
+ * Release the mutex protecting the list of jobs and their status.
+ */
+void job_unlock(void);
+
 /**
  * Create a new long-running job and return it.
  *
@@ -424,6 +443,7 @@ void job_start(Job *job);
  * Continue the specified job by entering the coroutine.
  */
 void job_enter(Job *job);
+void job_enter_locked(Job *job);
 
 /**
  * @job: The job that is ready to pause.
@@ -462,12 +482,15 @@ bool job_is_internal(Job *job);
 
 /** Returns whether the job is scheduled for cancellation. */
 bool job_is_cancelled(Job *job);
+bool job_is_cancelled_locked(Job *job);
 
 /** Returns whether the job is in a completed state. */
 bool job_is_completed(Job *job);
+bool job_is_completed_locked(Job *job);
 
 /** Returns whether the job is ready to be completed. */
 bool job_is_ready(Job *job);
+bool job_is_ready_locked(Job *job);
 
 /**
  * Request @job to pause at the next pause point. Must be paired with
diff --git a/job.c b/job.c
index 872bbebb01..96fb8e9730 100644
--- a/job.c
+++ b/job.c
@@ -32,6 +32,10 @@
 #include "trace/trace-root.h"
 #include "qapi/qapi-events-job.h"
 
+/* job_mutex protexts the jobs list, but also the job operations. */
+static QemuMutex job_mutex;
+
+/* Protected by job_mutex */
 static QLIST_HEAD(, Job) jobs = QLIST_HEAD_INITIALIZER(jobs);
 
 /* Job State Transition Table */
@@ -64,27 +68,22 @@ bool JobVerbTable[JOB_VERB__MAX][JOB_STATUS__MAX] = {
 /* Transactional group of jobs */
 struct JobTxn {
 
-/* Is this txn being cancelled? */
+/* Is this txn being cancelled? Atomic.*/
 bool aborting;
 
-/* List of jobs */
+/* List of jobs. Protected by job_mutex. */
 QLIST_HEAD(, Job) jobs;
 
-/* Reference count */
+/* Reference count. Atomic. */
 int refcnt;
 };
 
-/* Right now, this mutex is only needed to synchronize accesses to job->busy
- * and job->sleep_timer, such as concurrent calls to job_do_yield and
- * job_enter. */
-static QemuMutex job_mutex;
-
-static void job_lock(void)
+void job_lock(void)
 {
 qemu_mutex_lock(_mutex);
 }
 
-static void job_unlock(void)
+void job_unlock(void)
 {
 qemu_mutex_unlock(_mutex);
 }
@@ -109,11 +108,22 @@ bool job_is_busy(Job *job)
 return qatomic_read(>busy);
 }
 
-int job_get_ret(Job *job)
+/* Called with job_mutex held. */
+int job_get_ret_locked(Job *job)
 {
 return job->ret;
 }
 
+/* Called with job_mutex *not* held. */
+int job_get_ret(Job *job)
+{
+int ret;
+job_lock();
+ret = job_get_ret_locked(job);
+job_unlock();
+return ret;
+}
+
 Error *job_get_err(Job *job)
 {
 return job->err;
@@ -255,12 +265,24 @@ const char *job_type_str(const Job *job)
 return JobType_str(job_type(job));
 }
 
-bool job_is_cancelled(Job *job)
+/* Called with job_mutex held. */
+bool job_is_cancelled_locked(Job *job)
 {
 return job->cancelled;
 }
 
-bool job_is_ready(Job *job)
+/* Called with job_mutex *not* held. */
+bool job_is_cancelled(Job *job)
+{
+bool ret;
+job_lock();
+ret = job_is_cancelled_locked(job);
+job_unlock();
+return ret;
+}
+
+/* Called with job_mutex held. */
+bool job_is_ready_locked(Job *job)
 {
 switch (job->status) {
 case JOB_STATUS_UNDEFINED:
@@ -282,7 +304,18 @@ bool job_is_ready(Job *job)
 return false;
 }
 
-bool job_is_completed(Job *job)
+/* Called with job_mutex *not* held. */
+bool job_is_ready(Job *job)
+{
+bool ret;
+job_lock();
+ret = job_is_ready_locked(job);
+job_unlock();
+return ret;
+}
+
+/* Called with job_mutex held. */
+bool job_is_completed_locked(Job *job)
 {
 switch (job->status) {
 case JOB_STATUS_UNDEFINED:
@@ -304,6 +337,17 @@ bool job_is_completed(Job *job)
 return false;
 }
 
+/* Called with job_mutex *not*

[RFC PATCH 5/6] job: use global job_mutex to protect struct Job

2021-07-07 Thread Emanuele Giuseppe Esposito

This lock is going to replace most of the AioContext locks
in the job and blockjob, so that a Job can run in an arbitrary
AioContext.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/block/blockjob_int.h |   1 +
 include/qemu/job.h   |   2 +
 block/backup.c   |   4 +
 block/mirror.c   |  11 +-
 blockdev.c   |  62 
 blockjob.c   |  67 +++--
 job-qmp.c|  55 +++
 job.c| 284 +++
 qemu-img.c   |  15 +-
 9 files changed, 350 insertions(+), 151 deletions(-)

diff --git a/include/block/blockjob_int.h b/include/block/blockjob_int.h
index 6633d83da2..8b91126506 100644
--- a/include/block/blockjob_int.h
+++ b/include/block/blockjob_int.h
@@ -53,6 +53,7 @@ struct BlockJobDriver {
  */
 void (*attached_aio_context)(BlockJob *job, AioContext *new_context);
 
+/* Called with job mutex *not* held. */
 void (*set_speed)(BlockJob *job, int64_t speed);
 };
 
diff --git a/include/qemu/job.h b/include/qemu/job.h
index 4421d08d93..359f4e6b3a 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -49,6 +49,8 @@ typedef struct Job {
 /**
  * The type of this job.
  * Set it in job_create and just read.
+ * All calls to the driver function must be not locked by job_mutex,
+ * to avoid deadlocks.
  */
 const JobDriver *driver;
 
diff --git a/block/backup.c b/block/backup.c
index bd3614ce70..80ce956299 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -315,6 +315,10 @@ static void coroutine_fn backup_pause(Job *job)
 }
 }
 
+/*
+ * Called with job mutex *not* held (we don't want to call block_copy_kick
+ * with the lock held!)
+ */
 static void coroutine_fn backup_set_speed(BlockJob *job, int64_t speed)
 {
 BackupBlockJob *s = container_of(job, BackupBlockJob, common);
diff --git a/block/mirror.c b/block/mirror.c
index 49fffa..deefaa6a39 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1150,9 +1150,11 @@ static void mirror_complete(Job *job, Error **errp)
 s->should_complete = true;
 
 /* If the job is paused, it will be re-entered when it is resumed */
+job_lock();
 if (!job_is_paused(job)) {
-job_enter(job);
+job_enter_locked(job);
 }
+job_unlock();
 }
 
 static void coroutine_fn mirror_pause(Job *job)
@@ -1171,10 +1173,13 @@ static bool mirror_drained_poll(BlockJob *job)
  * from one of our own drain sections, to avoid a deadlock waiting for
  * ourselves.
  */
-if (!job_is_paused(>common.job) && !job_is_cancelled(>common.job) &&
-!s->in_drain) {
+job_lock();
+if (!job_is_paused(>common.job) &&
+!job_is_cancelled_locked(>common.job) && !s->in_drain) {
+job_unlock();
 return true;
 }
+job_unlock();
 
 return !!s->in_flight;
 }
diff --git a/blockdev.c b/blockdev.c
index 8e2c15370e..9255aea6a2 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -150,9 +150,11 @@ void blockdev_mark_auto_del(BlockBackend *blk)
 AioContext *aio_context = job_get_aiocontext(>job);
 aio_context_acquire(aio_context);
 
+job_lock();
 job_cancel(>job, false);
 
 aio_context_release(aio_context);
+job_unlock();
 }
 }
 
@@ -3309,48 +3311,44 @@ out:
 aio_context_release(aio_context);
 }
 
-/* Get a block job using its ID and acquire its AioContext */
-static BlockJob *find_block_job(const char *id, AioContext **aio_context,
-Error **errp)
+/* Get a block job using its ID and acquire its job_lock */
+static BlockJob *find_block_job(const char *id, Error **errp)
 {
 BlockJob *job;
 
 assert(id != NULL);
 
-*aio_context = NULL;
-
+job_lock();
 job = block_job_get(id);
 
 if (!job) {
 error_set(errp, ERROR_CLASS_DEVICE_NOT_ACTIVE,
   "Block job '%s' not found", id);
+job_unlock();
 return NULL;
 }
 
-*aio_context = blk_get_aio_context(job->blk);
-aio_context_acquire(*aio_context);
-
 return job;
 }
 
+/* Called with job_mutex *not* held. */
 void qmp_block_job_set_speed(const char *device, int64_t speed, Error **errp)
 {
-AioContext *aio_context;
-BlockJob *job = find_block_job(device, _context, errp);
+BlockJob *job = find_block_job(device, errp);
 
 if (!job) {
 return;
 }
 
 block_job_set_speed(job, speed, errp);
-aio_context_release(aio_context);
+job_unlock();
 }
 
+/* Called with job_mutex *not* held. */
 void qmp_block_job_cancel(const char *device,
   bool has_force, bool force, Error **errp)
 {
-AioContext *aio_context;
-BlockJob *job = find_block_job(device, _context, errp);
+BlockJob *job = find_block_job(device, errp);
 
 if (!job) {
 return;
@@ -3369,13 +3367,13 @@ void qmp_block_job_cancel(const char *device,

[RFC PATCH 3/6] job: minor changes to simplify locking

2021-07-07 Thread Emanuele Giuseppe Esposito

Check for NULL id to job_get, so that in the next patch we can
move job_get inside a single critical section of job_create.

Also add missing notifier_list_init for the on_idle NotifierList,
which seems to have been forgot.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 job.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/job.c b/job.c
index 96fb8e9730..48b304c3ff 100644
--- a/job.c
+++ b/job.c
@@ -375,6 +375,10 @@ Job *job_get(const char *id)
 {
 Job *job;
 
+if (!id) {
+return NULL;
+}
+
 QLIST_FOREACH(job, , job_list) {
 if (job->id && !strcmp(id, job->id)) {
 return job;
@@ -406,15 +410,18 @@ void *job_create(const char *job_id, const JobDriver 
*driver, JobTxn *txn,
 error_setg(errp, "Invalid job ID '%s'", job_id);
 return NULL;
 }
-if (job_get(job_id)) {
-error_setg(errp, "Job ID '%s' already in use", job_id);
-return NULL;
-}
 } else if (!(flags & JOB_INTERNAL)) {
 error_setg(errp, "An explicit job ID is required");
 return NULL;
 }
 
+job_lock();
+if (job_get(job_id)) {
+error_setg(errp, "Job ID '%s' already in use", job_id);
+job_unlock();
+return NULL;
+}
+
 job = g_malloc0(driver->instance_size);
 job->driver= driver;
 job->id= g_strdup(job_id);
@@ -434,6 +441,7 @@ void *job_create(const char *job_id, const JobDriver 
*driver, JobTxn *txn,
 notifier_list_init(>on_finalize_completed);
 notifier_list_init(>on_pending);
 notifier_list_init(>on_ready);
+notifier_list_init(>on_idle);
 
 job_state_transition(job, JOB_STATUS_CREATED);
 aio_timer_init(qemu_get_aio_context(), >sleep_timer,
-- 
2.31.1

[RFC PATCH 4/6] job.h: categorize job fields

2021-07-07 Thread Emanuele Giuseppe Esposito

This makes it easier to understand what needs to be protected
by a lock and what doesn't.

Signed-off-by: Emanuele Giuseppe Esposito 
---
 include/qemu/job.h | 101 -
 1 file changed, 82 insertions(+), 19 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index ba2f9b2660..4421d08d93 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -40,24 +40,40 @@ typedef struct JobTxn JobTxn;
  * Long-running operation.
  */
 typedef struct Job {
-/** The ID of the job. May be NULL for internal jobs. */
+/**
+ * The ID of the job. May be NULL for internal jobs.
+ * Set it in job_create and just read.
+ */
 char *id;
 
-/** The type of this job. */
+/**
+ * The type of this job.
+ * Set it in job_create and just read.
+ */
 const JobDriver *driver;
 
-/** Reference count of the block job */
+/**
+ * Reference count of the block job.
+ * Protected by job_mutex.
+ */
 int refcnt;
 
-/** Current state; See @JobStatus for details. */
+/**
+ * Current state; See @JobStatus for details.
+ * Protected by job_mutex.
+ */
 JobStatus status;
 
-/** AioContext to run the job coroutine in */
+/**
+ * AioContext to run the job coroutine in.
+ * Atomic.
+ */
 AioContext *aio_context;
 
 /**
  * The coroutine that executes the job.  If not NULL, it is reentered when
  * busy is false and the job is cancelled.
+ * Set it in job_create and just read.
  */
 Coroutine *co;
 
@@ -70,13 +86,15 @@ typedef struct Job {
 /**
  * Counter for pause request. If non-zero, the block job is either paused,
  * or if busy == true will pause itself as soon as possible.
+ * Protected by job_mutex.
  */
 int pause_count;
 
 /**
  * Set to false by the job while the coroutine has yielded and may be
  * re-entered by job_enter(). There may still be I/O or event loop activity
- * pending. Accessed under block_job_mutex (in blockjob.c).
+ * pending.
+ * Protected by job_mutex.
  *
  * When the job is deferred to the main loop, busy is true as long as the
  * bottom half is still pending.
@@ -86,12 +104,14 @@ typedef struct Job {
 /**
  * Set to true by the job while it is in a quiescent state, where
  * no I/O or event loop activity is pending.
+ * Protected by job_mutex.
  */
 bool paused;
 
 /**
  * Set to true if the job is paused by user.  Can be unpaused with the
  * block-job-resume QMP command.
+ * Protected by job_mutex.
  */
 bool user_paused;
 
@@ -100,22 +120,33 @@ typedef struct Job {
  * always be tested just before toggling the busy flag from false
  * to true.  After a job has been cancelled, it should only yield
  * if #aio_poll will ("sooner or later") reenter the coroutine.
+ * Protected by job_mutex.
  */
 bool cancelled;
 
 /**
  * Set to true if the job should abort immediately without waiting
  * for data to be in sync.
+ * Protected by job_mutex.
  */
 bool force_cancel;
 
-/** Set to true when the job has deferred work to the main loop. */
+/**
+ * Set to true when the job has deferred work to the main loop.
+ * Protected by job_mutex.
+ */
 bool deferred_to_main_loop;
 
-/** True if this job should automatically finalize itself */
+/**
+ * True if this job should automatically finalize itself.
+ * Set it in job_create and just read.
+ */
 bool auto_finalize;
 
-/** True if this job should automatically dismiss itself */
+/**
+ * True if this job should automatically dismiss itself.
+ * Set it in job_create and just read.
+ */
 bool auto_dismiss;
 
 ProgressMeter progress;
@@ -124,6 +155,7 @@ typedef struct Job {
  * Return code from @run and/or @prepare callback(s).
  * Not final until the job has reached the CONCLUDED status.
  * 0 on success, -errno on failure.
+ * Protected by job_mutex.
  */
 int ret;
 
@@ -131,37 +163,68 @@ typedef struct Job {
  * Error object for a failed job.
  * If job->ret is nonzero and an error object was not set, it will be set
  * to strerror(-job->ret) during job_completed.
+ * Protected by job_mutex.
  */
 Error *err;
 
-/** The completion function that will be called when the job completes.  */
+/**
+ * The completion function that will be called when the job completes.
+ * Set it in job_create and just read.
+ */
 BlockCompletionFunc *cb;
 
-/** The opaque value that is passed to the completion function.  */
+/**
+ * The opaque value that is passed to the completion function.
+ * Set it in job_create and just read.
+ */
 void *opaque;
 
-/** Notifiers called when a cancelled job is finalised */
+/**
+ * Notifiers called when a cancelled job is finalised.
+

[RFC PATCH 0/6] job: replace AioContext lock with job_mutex

2021-07-07 Thread Emanuele Giuseppe Esposito

This is a continuation on the work to reduce (and possibly get rid of) the 
usage of AioContext lock, by introducing smaller granularity locks to keep the 
thread safety.

This series aims to:
1) remove the aiocontext lock and substitute it with the already existing
   global job_mutex
2) fix what it looks like to be an oversight when moving the blockjob.c logic
   into the more generic job.c: job_mutex was introduced especially to
   protect job->busy flag, but it seems that it was not used in successive
   patches, because there are multiple code sections that directly
   access the field without any locking.
3) use job_mutex instead of the aiocontext_lock
4) extend the reach of the job_mutex to protect all shared fields
   that the job structure has.

The reason why we propose to use the existing job_mutex and not make one for
each job is to keep things as simple as possible for now, and because
the jobs are not in the execution critical path, so we can affort
some delays.
Having a lock per job would increase overall complexity and
increase the chances of deadlocks (one good example could be the job
transactions, where multiple jobs are grouped together).
Anyways, the per-job mutex can always be added in the future.

Patch 1-4 are in preparation for patch 5. They try to simplify and clarify
the job_mutex usage. Patch 5 tries to add proper syncronization to the job
structure, replacing the AioContext lock when necessary.
Patch 6 just removes unnecessary AioContext locks that are now unneeded.


RFC: I am not sure the way I layed out the locks is ideal.
But their usage should not make deadlocks. I also made sure
the series passess all qemu_iotests.

What is very clear from this patch is that it
is strictly related to the brdv_* and lower level calls, because
they also internally check or even use the aiocontext lock.
Therefore, in order to make it work, I temporarly added some
aiocontext_acquire/release pair around the function that
still assert for them or assume they are hold and temporarly
unlock (unlock() - lock()).

I also apologize for the amount of changes in patch 5, any suggestion on
how to improve the patch layout is also very much appreciated.

Emanuele Giuseppe Esposito (6):
  job: use getter/setters instead of accessing the Job fields directly
  job: _locked functions and public job_lock/unlock for next patch
  job: minor changes to simplify locking
  job.h: categorize job fields
  job: use global job_mutex to protect struct Job
  jobs: remove unnecessary AioContext aquire/release pairs

 include/block/blockjob_int.h   |   1 +
 include/qemu/job.h | 159 ++--
 block.c|   2 +-
 block/backup.c |   4 +
 block/commit.c |   4 +-
 block/mirror.c |  30 ++-
 block/monitor/block-hmp-cmds.c |   6 -
 block/replication.c|   3 +-
 blockdev.c | 235 ++
 blockjob.c | 140 +++
 job-qmp.c  |  65 +++--
 job.c  | 432 ++---
 qemu-img.c |  19 +-
 13 files changed, 724 insertions(+), 376 deletions(-)

-- 
2.31.1

Re: [PATCH v2 4/4] hw/nvme: fix controller hot unplugging

2021-07-07 Thread Klaus Jensen


On Jul  7 17:57, Hannes Reinecke wrote:

On 7/7/21 5:49 PM, Klaus Jensen wrote:

From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/nvme.h   | 15 ---
 hw/nvme/ctrl.c   | 14 ++
 hw/nvme/ns.c | 18 ++
 hw/nvme/subsys.c |  3 +++
 4 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index c4065467d877..83ffabade4cf 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -33,12 +33,20 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES > NVME_NSID_BROADCAST 
- 1);
 typedef struct NvmeCtrl NvmeCtrl;
 typedef struct NvmeNamespace NvmeNamespace;
+#define TYPE_NVME_BUS "nvme-bus"
+OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
+
+typedef struct NvmeBus {
+BusState parent_bus;
+} NvmeBus;
+
 #define TYPE_NVME_SUBSYS "nvme-subsys"
 #define NVME_SUBSYS(obj) \
 OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
 typedef struct NvmeSubsystem {
 DeviceState parent_obj;
+NvmeBus bus;
 uint8_t subnqn[256];
 NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];
@@ -365,13 +373,6 @@ typedef struct NvmeCQueue {
 QTAILQ_HEAD(, NvmeRequest) req_list;
 } NvmeCQueue;
-#define TYPE_NVME_BUS "nvme-bus"
-#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
-
-typedef struct NvmeBus {
-BusState parent_bus;
-} NvmeBus;
-
 #define TYPE_NVME "nvme"
 #define NVME(obj) \
 OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 90e3ee2b70ee..9a3b3a27c293 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6514,16 +6514,14 @@ static void nvme_exit(PCIDevice *pci_dev)
 nvme_ctrl_reset(n);
-for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
-ns = nvme_ns(n, i);
-if (!ns) {
-continue;
+if (n->subsys) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+ns = nvme_ns(n, i);
+if (ns) {
+ns->attached--;
+}
 }
-nvme_ns_cleanup(ns);


So who is removing the namespaces, then?
I would have expected some cleanup action from the subsystem, seeing 
that we reparent to that ...




Since we "move" the namespaces to the subsystem, and since the subsystem 
is non-hotpluggable, they will (and can) not be removed. In the case 
that there is no subsystem, nvme_ns_unrealize() will be called for each 
child namespace on the controller NvmeBus.



-}
-
-if (n->subsys) {
 nvme_subsys_unregister_ctrl(n->subsys, n);
 }
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 3c4f5b8c714a..b7cf1494e75b 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -441,6 +441,15 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
 }
 }
+static void nvme_ns_unrealize(DeviceState *dev)
+{
+NvmeNamespace *ns = NVME_NS(dev);
+
+nvme_ns_drain(ns);
+nvme_ns_shutdown(ns);
+nvme_ns_cleanup(ns);
+}
+
 static void nvme_ns_realize(DeviceState *dev, Error **errp)
 {
 NvmeNamespace *ns = NVME_NS(dev);
@@ -462,6 +471,14 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
"linked to an nvme-subsys device");
 return;
 }
+} else {
+/*
+ * If this namespace belongs to a subsystem (through a link on the
+ * controller device), reparent the device.
+ */
+if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
+return;
+}


What happens if that fails?
Will we abort? Not create the namespace?



Good point!

It can actually only fail if the bus implements check_address(), which 
it does not, so it always succeeds, so it should assert instead.



 }
 if (nvme_ns_setup(ns, errp)) {
@@ -552,6 +569,7 @@ static void nvme_ns_class_init(ObjectClass *oc, void *data)
 dc->bus_type = TYPE_NVME_BUS;
 dc->realize = nvme_ns_realize;
+dc->unrealize = nvme_ns_unrealize;
 device_class_set_props(dc, nvme_ns_props);
 dc->desc = "Virtual NVMe namespace";
 }
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 92caa604a280..93c35950d69d 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -50,6 +50,9 @@ static void nvme_subsys_realize(DeviceState *dev, Error 
**errp)
 {
 NvmeSubsystem *subsys = NVME_SUBSYS(dev);
+qbus_create_inplace(>bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev,
+dev->id);
+
 nvme_subsys_setup(subsys);
 }


Cheers,

Hannes
--
Dr. Hannes Reinecke

Re: [PATCH 4/4] hw/nvme: fix controller hot unplugging

2021-07-07 Thread Klaus Jensen


On Jul  7 17:14, Stefan Hajnoczi wrote:

On Wed, Jul 07, 2021 at 12:43:56PM +0200, Hannes Reinecke wrote:

On 7/7/21 11:53 AM, Klaus Jensen wrote:
> On Jul  7 09:49, Hannes Reinecke wrote:
> > On 7/6/21 11:33 AM, Klaus Jensen wrote:
> > > From: Klaus Jensen 
> > >
> > > Prior to this patch the nvme-ns devices are always children of the
> > > NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
> > > unrealized when the parent device is removed. However, when subsystems
> > > are involved, this is not what we want since the namespaces may be
> > > attached to other controllers as well.
> > >
> > > This patch adds an additional NvmeBus on the subsystem device. When
> > > nvme-ns devices are realized, if the parent controller device is linked
> > > to a subsystem, the parent bus is set to the subsystem one instead. This
> > > makes sure that namespaces are kept alive and not unrealized.
> > >
> > > Signed-off-by: Klaus Jensen 
> > > ---
> > >  hw/nvme/nvme.h   | 18 ++
> > >  hw/nvme/ctrl.c   |  8 +---
> > >  hw/nvme/ns.c | 32 +---
> > >  hw/nvme/subsys.c |  4 
> > >  4 files changed, 44 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> > > index c4065467d877..9401e212f9f7 100644
> > > --- a/hw/nvme/nvme.h
> > > +++ b/hw/nvme/nvme.h
> > > @@ -33,12 +33,21 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES >
> > > NVME_NSID_BROADCAST - 1);
> > >  typedef struct NvmeCtrl NvmeCtrl;
> > >  typedef struct NvmeNamespace NvmeNamespace;
> > > +#define TYPE_NVME_BUS "nvme-bus"
> > > +OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
> > > +
> > > +typedef struct NvmeBus {
> > > +    BusState parent_bus;
> > > +    bool is_subsys;
> > > +} NvmeBus;
> > > +
> > >  #define TYPE_NVME_SUBSYS "nvme-subsys"
> > >  #define NVME_SUBSYS(obj) \
> > >  OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
> > >  typedef struct NvmeSubsystem {
> > >  DeviceState parent_obj;
> > > +    NvmeBus bus;
> > >  uint8_t subnqn[256];
> > >  NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];
> > > @@ -365,13 +374,6 @@ typedef struct NvmeCQueue {
> > >  QTAILQ_HEAD(, NvmeRequest) req_list;
> > >  } NvmeCQueue;
> > > -#define TYPE_NVME_BUS "nvme-bus"
> > > -#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
> > > -
> > > -typedef struct NvmeBus {
> > > -    BusState parent_bus;
> > > -} NvmeBus;
> > > -
> > >  #define TYPE_NVME "nvme"
> > >  #define NVME(obj) \
> > >  OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
> > > @@ -463,7 +465,7 @@ typedef struct NvmeCtrl {
> > >  static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
> > >  {
> > > -    if (!nsid || nsid > NVME_MAX_NAMESPACES) {
> > > +    if (!n || !nsid || nsid > NVME_MAX_NAMESPACES) {
> > >  return NULL;
> > >  }
> > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > > index 90e3ee2b70ee..7c8fca36d9a5 100644
> > > --- a/hw/nvme/ctrl.c
> > > +++ b/hw/nvme/ctrl.c
> > > @@ -6516,11 +6516,13 @@ static void nvme_exit(PCIDevice *pci_dev)
> > >  for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
> > >  ns = nvme_ns(n, i);
> > > -    if (!ns) {
> > > -    continue;
> > > +    if (ns) {
> > > +    ns->attached--;
> > >  }
> > > +    }
> > > -    nvme_ns_cleanup(ns);
> > > +    if (n->subsys) {
> > > +    nvme_subsys_unregister_ctrl(n->subsys, n);
> > >  }
> > >  if (n->subsys) {
> > > diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
> > > index 3c4f5b8c714a..612a2786d75d 100644
> > > --- a/hw/nvme/ns.c
> > > +++ b/hw/nvme/ns.c
> > > @@ -444,13 +444,29 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
> > >  static void nvme_ns_realize(DeviceState *dev, Error **errp)
> > >  {
> > >  NvmeNamespace *ns = NVME_NS(dev);
> > > -    BusState *s = qdev_get_parent_bus(dev);
> > > -    NvmeCtrl *n = NVME(s->parent);
> > > -    NvmeSubsystem *subsys = n->subsys;
> > > +    BusState *qbus = qdev_get_parent_bus(dev);
> > > +    NvmeBus *bus = NVME_BUS(qbus);
> > > +    NvmeCtrl *n = NULL;
> > > +    NvmeSubsystem *subsys = NULL;
> > >  uint32_t nsid = ns->params.nsid;
> > >  int i;
> > > -    if (!n->subsys) {
> > > +    if (bus->is_subsys) {
> > > +    subsys = NVME_SUBSYS(qbus->parent);
> > > +    } else {
> > > +    n = NVME(qbus->parent);
> > > +    subsys = n->subsys;
> > > +    }
> > > +
> > > +    if (subsys) {
> > > +    /*
> > > + * If this namespace belongs to a subsystem (through a
> > > link on the
> > > + * controller device), reparent the device.
> > > + */
> > > +    if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
> > > +    return;
> > > +    }
> > > +    } else {
> > >  if (ns->params.detached) {
> > >  error_setg(errp, "detached requires that the nvme
> > > device is "
> > >     "linked to an nvme-subsys device");
> > > @@ -470,7 +486,7 @@ static void nvme_ns_realize(DeviceState

Re: [PATCH] block/replication.c: Properly attach children

2021-07-07 Thread Vladimir Sementsov-Ogievskiy


07.07.2021 17:53, Lukas Straub wrote:

Hi,
Thanks for your review. More below.

Btw: There is a overview of the replication design in
docs/block-replication.txt

On Wed, 7 Jul 2021 16:01:31 +0300
Vladimir Sementsov-Ogievskiy  wrote:


06.07.2021 19:11, Lukas Straub wrote:

The replication driver needs access to the children block-nodes of
it's child so it can issue bdrv_make_empty to manage the replication.
However, it does this by directly copying the BdrvChilds, which is
wrong.

Fix this by properly attaching the block-nodes with
bdrv_attach_child().

Also, remove a workaround introduced in commit
6ecbc6c52672db5c13805735ca02784879ce8285
"replication: Avoid blk_make_empty() on read-only child".

Signed-off-by: Lukas Straub 
---

-v2: Test for BDRV_CHILD_PRIMARY in replication_child_perm, since
   bs->file might not be set yet. (Vladimir)

   block/replication.c | 94 +
   1 file changed, 61 insertions(+), 33 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 52163f2d1f..fd8cb728a3 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -166,7 +166,12 @@ static void replication_child_perm(BlockDriverState *bs, 
BdrvChild *c,
  uint64_t perm, uint64_t shared,
  uint64_t *nperm, uint64_t *nshared)
   {
-*nperm = BLK_PERM_CONSISTENT_READ;
+if (role & BDRV_CHILD_PRIMARY) {
+*nperm = BLK_PERM_CONSISTENT_READ;
+} else {
+*nperm = 0;
+}


Why you drop READ access for other children? You don't mention it in 
commit-msg..

Upd: ok now I see that we are not going to read from hidden_disk child, and that's the 
only "other" child that worth to mention.
Still, we should be sure that hidden_disk child gets WRITE permission in case 
we are going to call bdrv_make_empty on it.


The code below that in replication_child_perm() should make sure of
that or am i misunderstanding it?

Or do you mean that it should always request WRITE regardless of
bs->open_flags & BDRV_O_INACTIVE?


Probably that.

I mean the following:

Do you know the circumstances when secondary_do_checkpoint() is called? With 
new code, could it be called when we don't have WRITE permission on 
hidden_disk? If it could, it's a bug.

And this patch should answer this question, because pre-patch we create blk 
with explicitly set BLK_PERM_WRITE. After-patch we rely on existing code in 
replication_child_perm().




+
   if ((bs->open_flags & (BDRV_O_INACTIVE | BDRV_O_RDWR)) == BDRV_O_RDWR) {
   *nperm |= BLK_PERM_WRITE;
   }
@@ -340,17 +345,7 @@ static void secondary_do_checkpoint(BDRVReplicationState 
*s, Error **errp)
   return;
   }
   
-BlockBackend *blk = blk_new(qemu_get_current_aio_context(),

-BLK_PERM_WRITE, BLK_PERM_ALL);
-blk_insert_bs(blk, s->hidden_disk->bs, _err);
-if (local_err) {
-error_propagate(errp, local_err);
-blk_unref(blk);
-return;
-}
-
-ret = blk_make_empty(blk, errp);
-blk_unref(blk);
+ret = bdrv_make_empty(s->hidden_disk, errp);


So, here you rely on BLK_PERM_WRITE being set in replication_child_perm().. 
Probably that's OK, however logic is changed. Shouldn't we always require write 
permission in replication_child_perm() for hidden_disk ?


   if (ret < 0) {
   return;
   }
@@ -365,27 +360,35 @@ static void reopen_backing_file(BlockDriverState *bs, 
bool writable,
   Error **errp)
   {
   BDRVReplicationState *s = bs->opaque;
+BdrvChild *hidden_disk, *secondary_disk;
   BlockReopenQueue *reopen_queue = NULL;
   
+/*

+ * s->hidden_disk and s->secondary_disk may not be set yet, as they will
+ * only be set after the children are writable.
+ */
+hidden_disk = bs->file->bs->backing;
+secondary_disk = hidden_disk->bs->backing;
+
   if (writable) {
-s->orig_hidden_read_only = bdrv_is_read_only(s->hidden_disk->bs);
-s->orig_secondary_read_only = bdrv_is_read_only(s->secondary_disk->bs);
+s->orig_hidden_read_only = bdrv_is_read_only(hidden_disk->bs);
+s->orig_secondary_read_only = bdrv_is_read_only(secondary_disk->bs);
   }
   
-bdrv_subtree_drained_begin(s->hidden_disk->bs);

-bdrv_subtree_drained_begin(s->secondary_disk->bs);
+bdrv_subtree_drained_begin(hidden_disk->bs);
+bdrv_subtree_drained_begin(secondary_disk->bs);


That kind of staff may be done as a separate preparation patch, with 
no-logic-change refactoring, this makes the main logic-change patch simpler.


Yes, makes sense. Will do in the next version.

   
   if (s->orig_hidden_read_only) {

   QDict *opts = qdict_new();
   qdict_put_bool(opts, BDRV_OPT_READ_ONLY, !writable);
-reopen_queue = bdrv_reopen_queue(reopen_queue, s->hidden_disk->bs,
+reopen_queue = bdrv_reopen_queue(reopen_queue, hidden_disk->bs,

Re: [PATCH 4/4] hw/nvme: fix controller hot unplugging

2021-07-07 Thread Stefan Hajnoczi

On Wed, Jul 07, 2021 at 12:43:56PM +0200, Hannes Reinecke wrote:
> On 7/7/21 11:53 AM, Klaus Jensen wrote:
> > On Jul  7 09:49, Hannes Reinecke wrote:
> > > On 7/6/21 11:33 AM, Klaus Jensen wrote:
> > > > From: Klaus Jensen 
> > > > 
> > > > Prior to this patch the nvme-ns devices are always children of the
> > > > NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
> > > > unrealized when the parent device is removed. However, when subsystems
> > > > are involved, this is not what we want since the namespaces may be
> > > > attached to other controllers as well.
> > > > 
> > > > This patch adds an additional NvmeBus on the subsystem device. When
> > > > nvme-ns devices are realized, if the parent controller device is linked
> > > > to a subsystem, the parent bus is set to the subsystem one instead. This
> > > > makes sure that namespaces are kept alive and not unrealized.
> > > > 
> > > > Signed-off-by: Klaus Jensen 
> > > > ---
> > > >  hw/nvme/nvme.h   | 18 ++
> > > >  hw/nvme/ctrl.c   |  8 +---
> > > >  hw/nvme/ns.c | 32 +---
> > > >  hw/nvme/subsys.c |  4 
> > > >  4 files changed, 44 insertions(+), 18 deletions(-)
> > > > 
> > > > diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> > > > index c4065467d877..9401e212f9f7 100644
> > > > --- a/hw/nvme/nvme.h
> > > > +++ b/hw/nvme/nvme.h
> > > > @@ -33,12 +33,21 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES >
> > > > NVME_NSID_BROADCAST - 1);
> > > >  typedef struct NvmeCtrl NvmeCtrl;
> > > >  typedef struct NvmeNamespace NvmeNamespace;
> > > > +#define TYPE_NVME_BUS "nvme-bus"
> > > > +OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
> > > > +
> > > > +typedef struct NvmeBus {
> > > > +    BusState parent_bus;
> > > > +    bool is_subsys;
> > > > +} NvmeBus;
> > > > +
> > > >  #define TYPE_NVME_SUBSYS "nvme-subsys"
> > > >  #define NVME_SUBSYS(obj) \
> > > >  OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
> > > >  typedef struct NvmeSubsystem {
> > > >  DeviceState parent_obj;
> > > > +    NvmeBus bus;
> > > >  uint8_t subnqn[256];
> > > >  NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];
> > > > @@ -365,13 +374,6 @@ typedef struct NvmeCQueue {
> > > >  QTAILQ_HEAD(, NvmeRequest) req_list;
> > > >  } NvmeCQueue;
> > > > -#define TYPE_NVME_BUS "nvme-bus"
> > > > -#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
> > > > -
> > > > -typedef struct NvmeBus {
> > > > -    BusState parent_bus;
> > > > -} NvmeBus;
> > > > -
> > > >  #define TYPE_NVME "nvme"
> > > >  #define NVME(obj) \
> > > >  OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
> > > > @@ -463,7 +465,7 @@ typedef struct NvmeCtrl {
> > > >  static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
> > > >  {
> > > > -    if (!nsid || nsid > NVME_MAX_NAMESPACES) {
> > > > +    if (!n || !nsid || nsid > NVME_MAX_NAMESPACES) {
> > > >  return NULL;
> > > >  }
> > > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > > > index 90e3ee2b70ee..7c8fca36d9a5 100644
> > > > --- a/hw/nvme/ctrl.c
> > > > +++ b/hw/nvme/ctrl.c
> > > > @@ -6516,11 +6516,13 @@ static void nvme_exit(PCIDevice *pci_dev)
> > > >  for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
> > > >  ns = nvme_ns(n, i);
> > > > -    if (!ns) {
> > > > -    continue;
> > > > +    if (ns) {
> > > > +    ns->attached--;
> > > >  }
> > > > +    }
> > > > -    nvme_ns_cleanup(ns);
> > > > +    if (n->subsys) {
> > > > +    nvme_subsys_unregister_ctrl(n->subsys, n);
> > > >  }
> > > >  if (n->subsys) {
> > > > diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
> > > > index 3c4f5b8c714a..612a2786d75d 100644
> > > > --- a/hw/nvme/ns.c
> > > > +++ b/hw/nvme/ns.c
> > > > @@ -444,13 +444,29 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
> > > >  static void nvme_ns_realize(DeviceState *dev, Error **errp)
> > > >  {
> > > >  NvmeNamespace *ns = NVME_NS(dev);
> > > > -    BusState *s = qdev_get_parent_bus(dev);
> > > > -    NvmeCtrl *n = NVME(s->parent);
> > > > -    NvmeSubsystem *subsys = n->subsys;
> > > > +    BusState *qbus = qdev_get_parent_bus(dev);
> > > > +    NvmeBus *bus = NVME_BUS(qbus);
> > > > +    NvmeCtrl *n = NULL;
> > > > +    NvmeSubsystem *subsys = NULL;
> > > >  uint32_t nsid = ns->params.nsid;
> > > >  int i;
> > > > -    if (!n->subsys) {
> > > > +    if (bus->is_subsys) {
> > > > +    subsys = NVME_SUBSYS(qbus->parent);
> > > > +    } else {
> > > > +    n = NVME(qbus->parent);
> > > > +    subsys = n->subsys;
> > > > +    }
> > > > +
> > > > +    if (subsys) {
> > > > +    /*
> > > > + * If this namespace belongs to a subsystem (through a
> > > > link on the
> > > > + * controller device), reparent the device.
> > > > + */
> > > > +    if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
> > > > +    return;
> > > > +    }
> > > > +    } else {
> > > >  if (ns->params.detached) {
> > > >

Re: [PATCH v2 4/4] hw/nvme: fix controller hot unplugging

2021-07-07 Thread Hannes Reinecke


On 7/7/21 5:49 PM, Klaus Jensen wrote:

From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Signed-off-by: Klaus Jensen 
---
  hw/nvme/nvme.h   | 15 ---
  hw/nvme/ctrl.c   | 14 ++
  hw/nvme/ns.c | 18 ++
  hw/nvme/subsys.c |  3 +++
  4 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index c4065467d877..83ffabade4cf 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -33,12 +33,20 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES > NVME_NSID_BROADCAST 
- 1);
  typedef struct NvmeCtrl NvmeCtrl;
  typedef struct NvmeNamespace NvmeNamespace;
  
+#define TYPE_NVME_BUS "nvme-bus"

+OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
+
+typedef struct NvmeBus {
+BusState parent_bus;
+} NvmeBus;
+
  #define TYPE_NVME_SUBSYS "nvme-subsys"
  #define NVME_SUBSYS(obj) \
  OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
  
  typedef struct NvmeSubsystem {

  DeviceState parent_obj;
+NvmeBus bus;
  uint8_t subnqn[256];
  
  NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];

@@ -365,13 +373,6 @@ typedef struct NvmeCQueue {
  QTAILQ_HEAD(, NvmeRequest) req_list;
  } NvmeCQueue;
  
-#define TYPE_NVME_BUS "nvme-bus"

-#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
-
-typedef struct NvmeBus {
-BusState parent_bus;
-} NvmeBus;
-
  #define TYPE_NVME "nvme"
  #define NVME(obj) \
  OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 90e3ee2b70ee..9a3b3a27c293 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6514,16 +6514,14 @@ static void nvme_exit(PCIDevice *pci_dev)
  
  nvme_ctrl_reset(n);
  
-for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {

-ns = nvme_ns(n, i);
-if (!ns) {
-continue;
+if (n->subsys) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+ns = nvme_ns(n, i);
+if (ns) {
+ns->attached--;
+}
  }
  
-nvme_ns_cleanup(ns);


So who is removing the namespaces, then?
I would have expected some cleanup action from the subsystem, seeing 
that we reparent to that ...



-}
-
-if (n->subsys) {
  nvme_subsys_unregister_ctrl(n->subsys, n);
  }
  
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c

index 3c4f5b8c714a..b7cf1494e75b 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -441,6 +441,15 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
  }
  }
  
+static void nvme_ns_unrealize(DeviceState *dev)

+{
+NvmeNamespace *ns = NVME_NS(dev);
+
+nvme_ns_drain(ns);
+nvme_ns_shutdown(ns);
+nvme_ns_cleanup(ns);
+}
+
  static void nvme_ns_realize(DeviceState *dev, Error **errp)
  {
  NvmeNamespace *ns = NVME_NS(dev);
@@ -462,6 +471,14 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 "linked to an nvme-subsys device");
  return;
  }
+} else {
+/*
+ * If this namespace belongs to a subsystem (through a link on the
+ * controller device), reparent the device.
+ */
+if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
+return;
+}


What happens if that fails?
Will we abort? Not create the namespace?


  }
  
  if (nvme_ns_setup(ns, errp)) {

@@ -552,6 +569,7 @@ static void nvme_ns_class_init(ObjectClass *oc, void *data)
  
  dc->bus_type = TYPE_NVME_BUS;

  dc->realize = nvme_ns_realize;
+dc->unrealize = nvme_ns_unrealize;
  device_class_set_props(dc, nvme_ns_props);
  dc->desc = "Virtual NVMe namespace";
  }
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 92caa604a280..93c35950d69d 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -50,6 +50,9 @@ static void nvme_subsys_realize(DeviceState *dev, Error 
**errp)
  {
  NvmeSubsystem *subsys = NVME_SUBSYS(dev);
  
+qbus_create_inplace(>bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev,

+dev->id);
+
  nvme_subsys_setup(subsys);
  }
  


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

[PATCH v2 4/4] hw/nvme: fix controller hot unplugging

2021-07-07 Thread Klaus Jensen

From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/nvme.h   | 15 ---
 hw/nvme/ctrl.c   | 14 ++
 hw/nvme/ns.c | 18 ++
 hw/nvme/subsys.c |  3 +++
 4 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index c4065467d877..83ffabade4cf 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -33,12 +33,20 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES > NVME_NSID_BROADCAST 
- 1);
 typedef struct NvmeCtrl NvmeCtrl;
 typedef struct NvmeNamespace NvmeNamespace;
 
+#define TYPE_NVME_BUS "nvme-bus"
+OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
+
+typedef struct NvmeBus {
+BusState parent_bus;
+} NvmeBus;
+
 #define TYPE_NVME_SUBSYS "nvme-subsys"
 #define NVME_SUBSYS(obj) \
 OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
 
 typedef struct NvmeSubsystem {
 DeviceState parent_obj;
+NvmeBus bus;
 uint8_t subnqn[256];
 
 NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];
@@ -365,13 +373,6 @@ typedef struct NvmeCQueue {
 QTAILQ_HEAD(, NvmeRequest) req_list;
 } NvmeCQueue;
 
-#define TYPE_NVME_BUS "nvme-bus"
-#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
-
-typedef struct NvmeBus {
-BusState parent_bus;
-} NvmeBus;
-
 #define TYPE_NVME "nvme"
 #define NVME(obj) \
 OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 90e3ee2b70ee..9a3b3a27c293 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6514,16 +6514,14 @@ static void nvme_exit(PCIDevice *pci_dev)
 
 nvme_ctrl_reset(n);
 
-for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
-ns = nvme_ns(n, i);
-if (!ns) {
-continue;
+if (n->subsys) {
+for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
+ns = nvme_ns(n, i);
+if (ns) {
+ns->attached--;
+}
 }
 
-nvme_ns_cleanup(ns);
-}
-
-if (n->subsys) {
 nvme_subsys_unregister_ctrl(n->subsys, n);
 }
 
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 3c4f5b8c714a..b7cf1494e75b 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -441,6 +441,15 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
 }
 }
 
+static void nvme_ns_unrealize(DeviceState *dev)
+{
+NvmeNamespace *ns = NVME_NS(dev);
+
+nvme_ns_drain(ns);
+nvme_ns_shutdown(ns);
+nvme_ns_cleanup(ns);
+}
+
 static void nvme_ns_realize(DeviceState *dev, Error **errp)
 {
 NvmeNamespace *ns = NVME_NS(dev);
@@ -462,6 +471,14 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
"linked to an nvme-subsys device");
 return;
 }
+} else {
+/*
+ * If this namespace belongs to a subsystem (through a link on the
+ * controller device), reparent the device.
+ */
+if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
+return;
+}
 }
 
 if (nvme_ns_setup(ns, errp)) {
@@ -552,6 +569,7 @@ static void nvme_ns_class_init(ObjectClass *oc, void *data)
 
 dc->bus_type = TYPE_NVME_BUS;
 dc->realize = nvme_ns_realize;
+dc->unrealize = nvme_ns_unrealize;
 device_class_set_props(dc, nvme_ns_props);
 dc->desc = "Virtual NVMe namespace";
 }
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 92caa604a280..93c35950d69d 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -50,6 +50,9 @@ static void nvme_subsys_realize(DeviceState *dev, Error 
**errp)
 {
 NvmeSubsystem *subsys = NVME_SUBSYS(dev);
 
+qbus_create_inplace(>bus, sizeof(NvmeBus), TYPE_NVME_BUS, dev,
+dev->id);
+
 nvme_subsys_setup(subsys);
 }
 
-- 
2.32.0

[PATCH v2 3/4] hw/nvme: unregister controller with subsystem at exit

2021-07-07 Thread Klaus Jensen

From: Klaus Jensen 

Make sure the controller is unregistered from the subsystem when device
is removed.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Klaus Jensen 
---
 hw/nvme/nvme.h   | 1 +
 hw/nvme/ctrl.c   | 4 
 hw/nvme/subsys.c | 5 +
 3 files changed, 10 insertions(+)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 0868359a1e86..c4065467d877 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -50,6 +50,7 @@ typedef struct NvmeSubsystem {
 } NvmeSubsystem;
 
 int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp);
+void nvme_subsys_unregister_ctrl(NvmeSubsystem *subsys, NvmeCtrl *n);
 
 static inline NvmeCtrl *nvme_subsys_ctrl(NvmeSubsystem *subsys,
  uint32_t cntlid)
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index dd1801510032..90e3ee2b70ee 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6523,6 +6523,10 @@ static void nvme_exit(PCIDevice *pci_dev)
 nvme_ns_cleanup(ns);
 }
 
+if (n->subsys) {
+nvme_subsys_unregister_ctrl(n->subsys, n);
+}
+
 g_free(n->cq);
 g_free(n->sq);
 g_free(n->aer_reqs);
diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index dc7a96862f37..92caa604a280 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -32,6 +32,11 @@ int nvme_subsys_register_ctrl(NvmeCtrl *n, Error **errp)
 return cntlid;
 }
 
+void nvme_subsys_unregister_ctrl(NvmeSubsystem *subsys, NvmeCtrl *n)
+{
+subsys->ctrls[n->cntlid] = NULL;
+}
+
 static void nvme_subsys_setup(NvmeSubsystem *subsys)
 {
 const char *nqn = subsys->params.nqn ?
-- 
2.32.0

[PATCH v2 2/4] hw/nvme: mark nvme-subsys non-hotpluggable

2021-07-07 Thread Klaus Jensen

From: Klaus Jensen 

We currently lack the infrastructure to handle subsystem hotplugging, so
disable it.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Klaus Jensen 
---
 hw/nvme/subsys.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 192223d17ca1..dc7a96862f37 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -61,6 +61,7 @@ static void nvme_subsys_class_init(ObjectClass *oc, void 
*data)
 
 dc->realize = nvme_subsys_realize;
 dc->desc = "Virtual NVMe subsystem";
+dc->hotpluggable = false;
 
 device_class_set_props(dc, nvme_subsystem_props);
 }
-- 
2.32.0

[PATCH v2 1/4] hw/nvme: remove NvmeCtrl parameter from ns setup/check functions

2021-07-07 Thread Klaus Jensen

From: Klaus Jensen 

The nvme_ns_setup and nvme_ns_check_constraints should not depend on the
controller state. Refactor and remove it.

Reviewed-by: Hannes Reinecke 
Signed-off-by: Klaus Jensen 
---
 hw/nvme/nvme.h |  2 +-
 hw/nvme/ctrl.c |  2 +-
 hw/nvme/ns.c   | 37 ++---
 3 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 56f8eceed2ad..0868359a1e86 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -246,7 +246,7 @@ static inline void nvme_aor_dec_active(NvmeNamespace *ns)
 }
 
 void nvme_ns_init_format(NvmeNamespace *ns);
-int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
+int nvme_ns_setup(NvmeNamespace *ns, Error **errp);
 void nvme_ns_drain(NvmeNamespace *ns);
 void nvme_ns_shutdown(NvmeNamespace *ns);
 void nvme_ns_cleanup(NvmeNamespace *ns);
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 629b0d38c2a2..dd1801510032 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6498,7 +6498,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 ns = >namespace;
 ns->params.nsid = 1;
 
-if (nvme_ns_setup(n, ns, errp)) {
+if (nvme_ns_setup(ns, errp)) {
 return;
 }
 
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 4275c3db6301..3c4f5b8c714a 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -346,8 +346,7 @@ static void nvme_zoned_ns_shutdown(NvmeNamespace *ns)
 assert(ns->nr_open_zones == 0);
 }
 
-static int nvme_ns_check_constraints(NvmeCtrl *n, NvmeNamespace *ns,
- Error **errp)
+static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
 {
 if (!ns->blkconf.blk) {
 error_setg(errp, "block backend not configured");
@@ -366,20 +365,6 @@ static int nvme_ns_check_constraints(NvmeCtrl *n, 
NvmeNamespace *ns,
 return -1;
 }
 
-if (!n->subsys) {
-if (ns->params.detached) {
-error_setg(errp, "detached requires that the nvme device is "
-   "linked to an nvme-subsys device");
-return -1;
-}
-
-if (ns->params.shared) {
-error_setg(errp, "shared requires that the nvme device is "
-   "linked to an nvme-subsys device");
-return -1;
-}
-}
-
 if (ns->params.zoned) {
 if (ns->params.max_active_zones) {
 if (ns->params.max_open_zones > ns->params.max_active_zones) {
@@ -411,9 +396,9 @@ static int nvme_ns_check_constraints(NvmeCtrl *n, 
NvmeNamespace *ns,
 return 0;
 }
 
-int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
+int nvme_ns_setup(NvmeNamespace *ns, Error **errp)
 {
-if (nvme_ns_check_constraints(n, ns, errp)) {
+if (nvme_ns_check_constraints(ns, errp)) {
 return -1;
 }
 
@@ -465,7 +450,21 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 uint32_t nsid = ns->params.nsid;
 int i;
 
-if (nvme_ns_setup(n, ns, errp)) {
+if (!n->subsys) {
+if (ns->params.detached) {
+error_setg(errp, "detached requires that the nvme device is "
+   "linked to an nvme-subsys device");
+return;
+}
+
+if (ns->params.shared) {
+error_setg(errp, "shared requires that the nvme device is "
+   "linked to an nvme-subsys device");
+return;
+}
+}
+
+if (nvme_ns_setup(ns, errp)) {
 return;
 }
 
-- 
2.32.0

[PATCH v2 0/4] hw/nvme: fix controller hotplugging

2021-07-07 Thread Klaus Jensen

From: Klaus Jensen 

Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We
discussed a bit back and fourth and I mentioned that the core issue was
an artifact of the parent/child relationship stemming from the qdev
setup we have with namespaces attaching to controller through a qdev
bus.

The gist of this series is the fourth patch "hw/nvme: fix controller hot
unplugging" which basically causes namespaces to be reassigned to a bus
owned by the subsystem if the parent controller is linked to one. This
fixes `device_del/add nvme` in such settings.

Note, that in the case that there is no subsystem involved, nvme devices
can be removed from the system with `device_del`, but this *will* cause
the namespaces to be removed as well since there is no place (i.e. no
subsystem) for them to "linger". And since this series does not add
support for hotplugging nvme-ns devices, while an nvme device can be
readded, no namespaces can. Support for hotplugging nvme-ns devices is
present in [1], but I'd rather not add that since I think '-device
nvme-ns' is already a bad design choice.

Now, I do realize that it is not "pretty" to explicitly change the
parent bus, so I do have a an RFC patch in queue that replaces the
subsystem and namespace devices with objects, but keeps -device shims
available for backwards compatibility. This approach will solve the
problems properly and should be a better model. However, I don't believe
it will make it for 6.1 and I'd really like to at least fix the
unplugging for 6.1 and this gets the job done.

  [1]: 20210511073511.32511-1-h...@suse.de

v2:
- added R-b's by Hannes for patches 1 through 3
- simplified "hw/nvme: fix controller hot unplugging"

Klaus Jensen (4):
  hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
  hw/nvme: mark nvme-subsys non-hotpluggable
  hw/nvme: unregister controller with subsystem at exit
  hw/nvme: fix controller hot unplugging

 hw/nvme/nvme.h   | 18 +---
 hw/nvme/ctrl.c   | 14 ++--
 hw/nvme/ns.c | 55 +++-
 hw/nvme/subsys.c |  9 
 4 files changed, 63 insertions(+), 33 deletions(-)

-- 
2.32.0

Re: [PATCH 4/4] hw/nvme: fix controller hot unplugging

2021-07-07 Thread Klaus Jensen


On Jul  7 12:43, Hannes Reinecke wrote:

On 7/7/21 11:53 AM, Klaus Jensen wrote:

On Jul  7 09:49, Hannes Reinecke wrote:

On 7/6/21 11:33 AM, Klaus Jensen wrote:

From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/nvme.h   | 18 ++
 hw/nvme/ctrl.c   |  8 +---
 hw/nvme/ns.c | 32 +---
 hw/nvme/subsys.c |  4 
 4 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index c4065467d877..9401e212f9f7 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -33,12 +33,21 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES > 
NVME_NSID_BROADCAST - 1);

 typedef struct NvmeCtrl NvmeCtrl;
 typedef struct NvmeNamespace NvmeNamespace;
+#define TYPE_NVME_BUS "nvme-bus"
+OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
+
+typedef struct NvmeBus {
+    BusState parent_bus;
+    bool is_subsys;
+} NvmeBus;
+
 #define TYPE_NVME_SUBSYS "nvme-subsys"
 #define NVME_SUBSYS(obj) \
 OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
 typedef struct NvmeSubsystem {
 DeviceState parent_obj;
+    NvmeBus bus;
 uint8_t subnqn[256];
 NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];
@@ -365,13 +374,6 @@ typedef struct NvmeCQueue {
 QTAILQ_HEAD(, NvmeRequest) req_list;
 } NvmeCQueue;
-#define TYPE_NVME_BUS "nvme-bus"
-#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
-
-typedef struct NvmeBus {
-    BusState parent_bus;
-} NvmeBus;
-
 #define TYPE_NVME "nvme"
 #define NVME(obj) \
 OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
@@ -463,7 +465,7 @@ typedef struct NvmeCtrl {
 static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
 {
-    if (!nsid || nsid > NVME_MAX_NAMESPACES) {
+    if (!n || !nsid || nsid > NVME_MAX_NAMESPACES) {
 return NULL;
 }
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 90e3ee2b70ee..7c8fca36d9a5 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6516,11 +6516,13 @@ static void nvme_exit(PCIDevice *pci_dev)
 for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
-    if (!ns) {
-    continue;
+    if (ns) {
+    ns->attached--;
 }
+    }
-    nvme_ns_cleanup(ns);
+    if (n->subsys) {
+    nvme_subsys_unregister_ctrl(n->subsys, n);
 }
 if (n->subsys) {
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 3c4f5b8c714a..612a2786d75d 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -444,13 +444,29 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
 static void nvme_ns_realize(DeviceState *dev, Error **errp)
 {
 NvmeNamespace *ns = NVME_NS(dev);
-    BusState *s = qdev_get_parent_bus(dev);
-    NvmeCtrl *n = NVME(s->parent);
-    NvmeSubsystem *subsys = n->subsys;
+    BusState *qbus = qdev_get_parent_bus(dev);
+    NvmeBus *bus = NVME_BUS(qbus);
+    NvmeCtrl *n = NULL;
+    NvmeSubsystem *subsys = NULL;
 uint32_t nsid = ns->params.nsid;
 int i;
-    if (!n->subsys) {
+    if (bus->is_subsys) {
+    subsys = NVME_SUBSYS(qbus->parent);
+    } else {
+    n = NVME(qbus->parent);
+    subsys = n->subsys;
+    }
+
+    if (subsys) {
+    /*
+ * If this namespace belongs to a subsystem (through a 
link on the

+ * controller device), reparent the device.
+ */
+    if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
+    return;
+    }
+    } else {
 if (ns->params.detached) {
 error_setg(errp, "detached requires that the nvme 
device is "

    "linked to an nvme-subsys device");
@@ -470,7 +486,7 @@ static void nvme_ns_realize(DeviceState 
*dev, Error **errp)

 if (!nsid) {
 for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
-    if (nvme_ns(n, i) || nvme_subsys_ns(subsys, i)) {
+    if (nvme_subsys_ns(subsys, i) || nvme_ns(n, i)) {
 continue;
 }
@@ -483,7 +499,7 @@ static void nvme_ns_realize(DeviceState 
*dev, Error **errp)

 return;
 }
 } else {
-    if (nvme_ns(n, nsid) || nvme_subsys_ns(subsys, nsid)) {
+    if (nvme_subsys_ns(subsys, nsid) || nvme_ns(n, nsid)) {
 error_setg(errp, "namespace id '%d' already 
allocated", nsid);

 return;
 }
@@ -509,7 +525,9 @@ static void nvme_ns_realize(DeviceState 
*dev, Error **errp)

 }
 }
-    nvme_attach_ns(n, ns);
+    if (n) {
+

[PULL 02/13] virtio: Clarify MR transaction optimization

2021-07-07 Thread Michael S. Tsirkin

From: Greg Kurz 

The device model batching its ioeventfds in a single MR transaction is
an optimization. Clarify this in virtio-scsi, virtio-blk and generic
virtio code. Also clarify that the transaction must commit before
closing ioeventfds so that no one is tempted to merge the loops
in the start functions error path and in the stop functions.

Signed-off-by: Greg Kurz 
Message-Id: <162125799728.1394228.339855768563326832.st...@bahia.lan>
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Michael S. Tsirkin 
Signed-off-by: Michael S. Tsirkin 
---
 hw/block/dataplane/virtio-blk.c | 16 
 hw/scsi/virtio-scsi-dataplane.c | 16 
 hw/virtio/virtio.c  | 16 
 3 files changed, 48 insertions(+)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index cd81893d1d..252c3a7a23 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -198,6 +198,10 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 goto fail_guest_notifiers;
 }
 
+/*
+ * Batch all the host notifiers in a single transaction to avoid
+ * quadratic time complexity in address_space_update_ioeventfds().
+ */
 memory_region_transaction_begin();
 
 /* Set up virtqueue notify */
@@ -211,6 +215,10 @@ int virtio_blk_data_plane_start(VirtIODevice *vdev)
 virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false);
 }
 
+/*
+ * The transaction expects the ioeventfds to be open when it
+ * commits. Do it now, before the cleanup loop.
+ */
 memory_region_transaction_commit();
 
 while (j--) {
@@ -330,12 +338,20 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
 
 aio_context_release(s->ctx);
 
+/*
+ * Batch all the host notifiers in a single transaction to avoid
+ * quadratic time complexity in address_space_update_ioeventfds().
+ */
 memory_region_transaction_begin();
 
 for (i = 0; i < nvqs; i++) {
 virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false);
 }
 
+/*
+ * The transaction expects the ioeventfds to be open when it
+ * commits. Do it now, before the cleanup loop.
+ */
 memory_region_transaction_commit();
 
 for (i = 0; i < nvqs; i++) {
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 28e003250a..18eb824c97 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -152,6 +152,10 @@ int virtio_scsi_dataplane_start(VirtIODevice *vdev)
 goto fail_guest_notifiers;
 }
 
+/*
+ * Batch all the host notifiers in a single transaction to avoid
+ * quadratic time complexity in address_space_update_ioeventfds().
+ */
 memory_region_transaction_begin();
 
 rc = virtio_scsi_set_host_notifier(s, vs->ctrl_vq, 0);
@@ -198,6 +202,10 @@ fail_host_notifiers:
 virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false);
 }
 
+/*
+ * The transaction expects the ioeventfds to be open when it
+ * commits. Do it now, before the cleanup loop.
+ */
 memory_region_transaction_commit();
 
 for (i = 0; i < vq_init_count; i++) {
@@ -238,12 +246,20 @@ void virtio_scsi_dataplane_stop(VirtIODevice *vdev)
 
 blk_drain_all(); /* ensure there are no in-flight requests */
 
+/*
+ * Batch all the host notifiers in a single transaction to avoid
+ * quadratic time complexity in address_space_update_ioeventfds().
+ */
 memory_region_transaction_begin();
 
 for (i = 0; i < vs->conf.num_queues + 2; i++) {
 virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false);
 }
 
+/*
+ * The transaction expects the ioeventfds to be open when it
+ * commits. Do it now, before the cleanup loop.
+ */
 memory_region_transaction_commit();
 
 for (i = 0; i < vs->conf.num_queues + 2; i++) {
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index ab516ac614..6dcf3baf56 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3728,6 +3728,10 @@ static int 
virtio_device_start_ioeventfd_impl(VirtIODevice *vdev)
 VirtioBusState *qbus = VIRTIO_BUS(qdev_get_parent_bus(DEVICE(vdev)));
 int i, n, r, err;
 
+/*
+ * Batch all the host notifiers in a single transaction to avoid
+ * quadratic time complexity in address_space_update_ioeventfds().
+ */
 memory_region_transaction_begin();
 for (n = 0; n < VIRTIO_QUEUE_MAX; n++) {
 VirtQueue *vq = >vq[n];
@@ -3766,6 +3770,10 @@ assign_error:
 r = virtio_bus_set_host_notifier(qbus, n, false);
 assert(r >= 0);
 }
+/*
+ * The transaction expects the ioeventfds to be open when it
+ * commits. Do it now, before the cleanup loop.
+ */
 memory_region_transaction_commit();
 
 while (--i >= 0) {
@@

Re: [PATCH v3 2/2] qemu-img: Make unallocated part of backing chain obvious in map

2021-07-07 Thread Eric Blake

On Sat, Jul 03, 2021 at 10:25:28AM +0300, Vladimir Sementsov-Ogievskiy wrote:
...
> > An obvious solution is to make 'qemu-img map --output=json' add an
> > additional "present":false designation to any cluster lacking an
> > allocation anywhere in the chain, without any change to the "depth"
> > parameter to avoid breaking existing clients.  The iotests have
> > several examples where this distinction demonstrates the additional
> > accuracy.
> > 
> > Signed-off-by: Eric Blake 

> > +++ b/docs/tools/qemu-img.rst
> > @@ -597,6 +597,9 @@ Command description:
> >   if false, the sectors are either unallocated or stored as optimized
> >   all-zero clusters);
> > - whether the data is known to read as zero (boolean field ``zero``);
> > +  - whether the data is actually present (boolean field ``present``);
> > +if false, rebasing the backing chain onto a deeper file would pick
> > +up data from the deeper file;
> 
> Preexisting, but rather strange style of documentation, when described option 
> doesn't go first in the paragraph..

Yeah.  I'll send a followup email with a rewording of those paragraphs
for consideration.

> > +++ b/qemu-img.c
> > @@ -2980,8 +2980,9 @@ static int dump_map_entry(OutputFormat output_format, 
> > MapEntry *e,
> >   break;
> >   case OFORMAT_JSON:
> >   printf("{ \"start\": %"PRId64", \"length\": %"PRId64","
> > -   " \"depth\": %"PRId64", \"zero\": %s, \"data\": %s",
> > -   e->start, e->length, e->depth,
> > +   " \"depth\": %"PRId64", \"present\": %s, \"zero\": %s,"
> > +   "\"data\": %s", e->start, e->length, e->depth,
> > +   e->present ? "true" : "false",
> 
> Didn't you want to put present at the end? Still, this shouldn't be 
> significant. And it make sense to keep present, zero and data together.

I wanted it before anything optional, which "offset" is, so it already
can't be at the end.  If I understood Nir correctly, it was more
important to always be present (it's easy to write a parser that
searches for terms in the same position, and tolerates a missing term
from an older verseion, but harder to parse a term that might or might
not be present).

> 
> You missied a whitespace after '"zero": %s,', which is obvious from further 
> test diff hunks.
> 
> With it fixed:
> Reviewed-by: Vladimir Sementsov-Ogievskiy 

Thanks for catching that.  I've updated that, and will queue through
my NBD tree.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[PATCH 3/3] linux-aio: limit the batch size using `aio-max-batch` parameter

2021-07-07 Thread Stefano Garzarella

When there are multiple queues attached to the same AIO context,
some requests may experience high latency, since in the worst case
the AIO engine queue is only flushed when it is full (MAX_EVENTS) or
there are no more queues plugged.

Commit 2558cb8dd4 ("linux-aio: increasing MAX_EVENTS to a larger
hardcoded value") changed MAX_EVENTS from 128 to 1024, to increase
the number of in-flight requests. But this change also increased
the potential maximum batch to 1024 elements.

When there is a single queue attached to the AIO context, the issue
is mitigated from laio_io_unplug() that will flush the queue every
time is invoked since there can't be others queue plugged.

Let's use the new `aio-max-batch` IOThread parameter to mitigate
this issue, limiting the number of requests in a batch.

We also define a default value (32): this value is obtained running
some benchmarks and it represents a good tradeoff between the latency
increase while a request is queued and the cost of the io_submit(2)
system call.

Signed-off-by: Stefano Garzarella 
---
 block/linux-aio.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/linux-aio.c b/block/linux-aio.c
index 3c0527c2bf..8a7bb136fc 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -28,6 +28,9 @@
  */
 #define MAX_EVENTS 1024
 
+/* Maximum number of requests in a batch. (default value) */
+#define DEFAULT_MAX_BATCH 32
+
 struct qemu_laiocb {
 Coroutine *co;
 LinuxAioState *ctx;
@@ -351,6 +354,7 @@ static int laio_do_submit(int fd, struct qemu_laiocb 
*laiocb, off_t offset,
 LinuxAioState *s = laiocb->ctx;
 struct iocb *iocbs = >iocb;
 QEMUIOVector *qiov = laiocb->qiov;
+int64_t max_batch = s->aio_context->aio_max_batch ?: DEFAULT_MAX_BATCH;
 
 switch (type) {
 case QEMU_AIO_WRITE:
@@ -371,7 +375,7 @@ static int laio_do_submit(int fd, struct qemu_laiocb 
*laiocb, off_t offset,
 s->io_q.in_queue++;
 if (!s->io_q.blocked &&
 (!s->io_q.plugged ||
- s->io_q.in_flight + s->io_q.in_queue >= MAX_EVENTS)) {
+ s->io_q.in_queue >= max_batch)) {
 ioq_submit(s);
 }
 
-- 
2.31.1

[PATCH 1/3] iothread: generalize iothread_set_param/iothread_get_param

2021-07-07 Thread Stefano Garzarella

Changes in preparation for next patches where we add a new
parameter not related to the poll mechanism.

Let's add two new generic functions (iothread_set_param and
iothread_get_param) that we use to set and get IOThread
parameters.

Signed-off-by: Stefano Garzarella 
---
 iothread.c | 27 +++
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/iothread.c b/iothread.c
index 2c5ccd7367..103679a16b 100644
--- a/iothread.c
+++ b/iothread.c
@@ -213,7 +213,7 @@ static PollParamInfo poll_shrink_info = {
 "poll-shrink", offsetof(IOThread, poll_shrink),
 };
 
-static void iothread_get_poll_param(Object *obj, Visitor *v,
+static void iothread_get_param(Object *obj, Visitor *v,
 const char *name, void *opaque, Error **errp)
 {
 IOThread *iothread = IOTHREAD(obj);
@@ -223,7 +223,7 @@ static void iothread_get_poll_param(Object *obj, Visitor *v,
 visit_type_int64(v, name, field, errp);
 }
 
-static void iothread_set_poll_param(Object *obj, Visitor *v,
+static bool iothread_set_param(Object *obj, Visitor *v,
 const char *name, void *opaque, Error **errp)
 {
 IOThread *iothread = IOTHREAD(obj);
@@ -232,17 +232,36 @@ static void iothread_set_poll_param(Object *obj, Visitor 
*v,
 int64_t value;
 
 if (!visit_type_int64(v, name, , errp)) {
-return;
+return false;
 }
 
 if (value < 0) {
 error_setg(errp, "%s value must be in range [0, %" PRId64 "]",
info->name, INT64_MAX);
-return;
+return false;
 }
 
 *field = value;
 
+return true;
+}
+
+static void iothread_get_poll_param(Object *obj, Visitor *v,
+const char *name, void *opaque, Error **errp)
+{
+
+iothread_get_param(obj, v, name, opaque, errp);
+}
+
+static void iothread_set_poll_param(Object *obj, Visitor *v,
+const char *name, void *opaque, Error **errp)
+{
+IOThread *iothread = IOTHREAD(obj);
+
+if (!iothread_set_param(obj, v, name, opaque, errp)) {
+return;
+}
+
 if (iothread->ctx) {
 aio_context_set_poll_params(iothread->ctx,
 iothread->poll_max_ns,
-- 
2.31.1

[PATCH 2/3] iothread: add aio-max-batch parameter

2021-07-07 Thread Stefano Garzarella

The `aio-max-batch` parameter will be propagated to AIO engines
and it will be used to control the maximum number of queued requests.

When there are in queue a number of requests equal to `aio-max-batch`,
the engine invokes the system call to forward the requests to the kernel.

This parameter allows us to control the maximum batch size to reduce
the latency that requests might accumulate while queued in the AIO
engine queue.

If `aio-max-batch` is equal to 0 (default value), the AIO engine will
use its default maximum batch size value.

Signed-off-by: Stefano Garzarella 
---
 qapi/misc.json|  6 -
 qapi/qom.json |  7 -
 include/block/aio.h   | 12 +
 include/sysemu/iothread.h |  3 +++
 iothread.c| 55 +++
 monitor/hmp-cmds.c|  2 ++
 util/aio-posix.c  | 12 +
 util/aio-win32.c  |  5 
 util/async.c  |  2 ++
 qemu-options.hx   |  8 --
 10 files changed, 103 insertions(+), 9 deletions(-)

diff --git a/qapi/misc.json b/qapi/misc.json
index 156f98203e..f64bb69f74 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -86,6 +86,9 @@
 # @poll-shrink: how many ns will be removed from polling time, 0 means that
 #   it's not configured (since 2.9)
 #
+# @aio-max-batch: maximum number of requests in a bacth for the AIO engine,
+# 0 means that the engine will use its default (since 6.1)
+#
 # Since: 2.0
 ##
 { 'struct': 'IOThreadInfo',
@@ -93,7 +96,8 @@
'thread-id': 'int',
'poll-max-ns': 'int',
'poll-grow': 'int',
-   'poll-shrink': 'int' } }
+   'poll-shrink': 'int',
+   'aio-max-batch': 'int' } }
 
 ##
 # @query-iothreads:
diff --git a/qapi/qom.json b/qapi/qom.json
index 652be317b8..23fd586614 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -516,12 +516,17 @@
 #   algorithm detects it is spending too long polling without
 #   encountering events. 0 selects a default behaviour (default: 0)
 #
+# @aio-max-batch: maximum number of requests in a bacth for the AIO engine,
+# 0 means that the engine will use its default
+# (default:0, since 6.1)
+#
 # Since: 2.0
 ##
 { 'struct': 'IothreadProperties',
   'data': { '*poll-max-ns': 'int',
 '*poll-grow': 'int',
-'*poll-shrink': 'int' } }
+'*poll-shrink': 'int',
+'*aio-max-batch': 'int' } }
 
 ##
 # @MemoryBackendProperties:
diff --git a/include/block/aio.h b/include/block/aio.h
index 10fcae1515..5b7983348f 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -232,6 +232,9 @@ struct AioContext {
 int64_t poll_grow;  /* polling time growth factor */
 int64_t poll_shrink;/* polling time shrink factor */
 
+/* AIO engine parameters */
+int64_t aio_max_batch;  /* maximum number of requests in a batch */
+
 /*
  * List of handlers participating in userspace polling.  Protected by
  * ctx->list_lock.  Iterated and modified mostly by the event loop thread
@@ -730,4 +733,13 @@ void aio_context_set_poll_params(AioContext *ctx, int64_t 
max_ns,
  int64_t grow, int64_t shrink,
  Error **errp);
 
+/**
+ * aio_context_set_aio_params:
+ * @ctx: the aio context
+ * @max_batch: maximum number of requests in a bacth, 0 means that the
+ * engine will use its default
+ */
+void aio_context_set_aio_params(AioContext *ctx, int64_t max_batch,
+Error **errp);
+
 #endif
diff --git a/include/sysemu/iothread.h b/include/sysemu/iothread.h
index f177142f16..7f714bd136 100644
--- a/include/sysemu/iothread.h
+++ b/include/sysemu/iothread.h
@@ -37,6 +37,9 @@ struct IOThread {
 int64_t poll_max_ns;
 int64_t poll_grow;
 int64_t poll_shrink;
+
+/* AioContext AIO engine parameters */
+int64_t aio_max_batch;
 };
 typedef struct IOThread IOThread;
 
diff --git a/iothread.c b/iothread.c
index 103679a16b..ddbbde61f7 100644
--- a/iothread.c
+++ b/iothread.c
@@ -152,6 +152,24 @@ static void iothread_init_gcontext(IOThread *iothread)
 iothread->main_loop = g_main_loop_new(iothread->worker_context, TRUE);
 }
 
+static void iothread_set_aio_context_params(IOThread *iothread, Error **errp)
+{
+ERRP_GUARD();
+
+aio_context_set_poll_params(iothread->ctx,
+iothread->poll_max_ns,
+iothread->poll_grow,
+iothread->poll_shrink,
+errp);
+if (*errp) {
+return;
+}
+
+aio_context_set_aio_params(iothread->ctx,
+   iothread->aio_max_batch,
+   errp);
+}
+
 static void iothread_complete(UserCreatable *obj, Error **errp)
 {
 Error *local_error = NULL;
@@ -171,11 +189,7 @@ static void

[PATCH 0/3] linux-aio: limit the batch size to reduce queue latency

2021-07-07 Thread Stefano Garzarella

This series add a new `aio-max-batch` parameter to IOThread, and use it in the
Linux AIO backend to limit the batch size (number of request submitted to the
kernel through io_submit(2)).

Commit 2558cb8dd4 ("linux-aio: increasing MAX_EVENTS to a larger hardcoded
value") changed MAX_EVENTS from 128 to 1024, to increase the number of
in-flight requests. But this change also increased the potential maximum batch
to 1024 elements.

The problem is noticeable when we have a lot of requests in flight and multiple
queues attached to the same AIO context.
In this case we potentially create very large batches. Instead, when we have
a single queue, the batch is limited because when the queue is unplugged,
there is a call to io_submit(2).
In practice, io_submit(2) was called only when there are no more queues plugged
in or when we fill the AIO queue (MAX_EVENTS = 1024).

I run some benchmarks to choose 32 as default batch value for Linux AIO.
Below the kIOPS measured with fio running in the guest (average over 3 runs):

   |   master  |   with this series applied|
   |687f9f7834e| maxbatch=8|maxbatch=16|maxbatch=32|maxbatch=64|
  # queues | 1q  | 4qs | 1q  | 4qs | 1q  | 4qs | 1q  | 4qs | 1q  | 4qs |
-- randread tests -|---|
bs=4k iodepth=1| 193 | 188 | 204 | 198 | 194 | 202 | 201 | 213 | 195 | 201 |
bs=4k iodepth=8| 241 | 265 | 247 | 248 | 249 | 250 | 257 | 269 | 270 | 240 |
bs=4k iodepth=64   | 216 | 202 | 257 | 269 | 269 | 256 | 258 | 271 | 254 | 251 |
bs=4k iodepth=128  | 212 | 177 | 267 | 253 | 285 | 271 | 245 | 281 | 255 | 269 |
bs=16k iodepth=1   | 130 | 133 | 137 | 137 | 130 | 130 | 130 | 130 | 130 | 130 |
bs=16k iodepth=8   | 130 | 137 | 144 | 137 | 131 | 130 | 131 | 131 | 130 | 131 |
bs=16k iodepth=64  | 130 | 104 | 137 | 134 | 131 | 128 | 131 | 128 | 137 | 128 |
bs=16k iodepth=128 | 130 | 101 | 137 | 134 | 131 | 129 | 131 | 129 | 138 | 129 |

1q  = virtio-blk device with a single queue
4qs = virito-blk device with multi queues (one queue per vCPU - 4)

I reported only the most significant tests, but I also did other tests to
make sure there were no regressions, here the full report:
https://docs.google.com/spreadsheets/d/11X3_5FJu7pnMTlf4ZatRDvsnU9K3EPj6Mn3aJIsE4tI

Test environment:
- Disk: Intel Corporation NVMe Datacenter SSD [Optane]
- CPU: Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
- QEMU: qemu-system-x86_64 -machine q35,accel=kvm -smp 4 -m 4096 \
  ... \
  -object iothread,id=iothread0,aio-max-batch=${MAX_BATCH} \
  -device virtio-blk-pci,iothread=iothread0,num-queues=${NUM_QUEUES}

- benchmark: fio --ioengine=libaio --thread --group_reporting \
 --number_ios=20 --direct=1 --filename=/dev/vdb \
 --rw=${TEST} --bs=${BS} --iodepth=${IODEPTH} --numjobs=16

Next steps:
 - benchmark io_uring and use `aio-max-batch` also there
 - make MAX_EVENTS parametric adding a new `aio-max-events` parameter

Comments and suggestions are welcome :-)

Thanks,
Stefano

Stefano Garzarella (3):
  iothread: generalize iothread_set_param/iothread_get_param
  iothread: add aio-max-batch parameter
  linux-aio: limit the batch size using `aio-max-batch` parameter

 qapi/misc.json|  6 ++-
 qapi/qom.json |  7 +++-
 include/block/aio.h   | 12 ++
 include/sysemu/iothread.h |  3 ++
 block/linux-aio.c |  6 ++-
 iothread.c| 82 ++-
 monitor/hmp-cmds.c|  2 +
 util/aio-posix.c  | 12 ++
 util/aio-win32.c  |  5 +++
 util/async.c  |  2 +
 qemu-options.hx   |  8 +++-
 11 files changed, 131 insertions(+), 14 deletions(-)

-- 
2.31.1

Re: [PATCH] block/replication.c: Properly attach children

2021-07-07 Thread Lukas Straub

Hi,
Thanks for your review. More below.

Btw: There is a overview of the replication design in
docs/block-replication.txt

On Wed, 7 Jul 2021 16:01:31 +0300
Vladimir Sementsov-Ogievskiy  wrote:

> 06.07.2021 19:11, Lukas Straub wrote:
> > The replication driver needs access to the children block-nodes of
> > it's child so it can issue bdrv_make_empty to manage the replication.
> > However, it does this by directly copying the BdrvChilds, which is
> > wrong.
> > 
> > Fix this by properly attaching the block-nodes with
> > bdrv_attach_child().
> > 
> > Also, remove a workaround introduced in commit
> > 6ecbc6c52672db5c13805735ca02784879ce8285
> > "replication: Avoid blk_make_empty() on read-only child".
> > 
> > Signed-off-by: Lukas Straub 
> > ---
> > 
> > -v2: Test for BDRV_CHILD_PRIMARY in replication_child_perm, since
> >   bs->file might not be set yet. (Vladimir)
> > 
> >   block/replication.c | 94 +
> >   1 file changed, 61 insertions(+), 33 deletions(-)
> > 
> > diff --git a/block/replication.c b/block/replication.c
> > index 52163f2d1f..fd8cb728a3 100644
> > --- a/block/replication.c
> > +++ b/block/replication.c
> > @@ -166,7 +166,12 @@ static void replication_child_perm(BlockDriverState 
> > *bs, BdrvChild *c,
> >  uint64_t perm, uint64_t shared,
> >  uint64_t *nperm, uint64_t *nshared)
> >   {
> > -*nperm = BLK_PERM_CONSISTENT_READ;
> > +if (role & BDRV_CHILD_PRIMARY) {
> > +*nperm = BLK_PERM_CONSISTENT_READ;
> > +} else {
> > +*nperm = 0;
> > +}  
> 
> Why you drop READ access for other children? You don't mention it in 
> commit-msg..
> 
> Upd: ok now I see that we are not going to read from hidden_disk child, and 
> that's the only "other" child that worth to mention.
> Still, we should be sure that hidden_disk child gets WRITE permission in case 
> we are going to call bdrv_make_empty on it.

The code below that in replication_child_perm() should make sure of
that or am i misunderstanding it?

Or do you mean that it should always request WRITE regardless of
bs->open_flags & BDRV_O_INACTIVE?

> > +
> >   if ((bs->open_flags & (BDRV_O_INACTIVE | BDRV_O_RDWR)) == 
> > BDRV_O_RDWR) {
> >   *nperm |= BLK_PERM_WRITE;
> >   }
> > @@ -340,17 +345,7 @@ static void 
> > secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
> >   return;
> >   }
> >   
> > -BlockBackend *blk = blk_new(qemu_get_current_aio_context(),
> > -BLK_PERM_WRITE, BLK_PERM_ALL);
> > -blk_insert_bs(blk, s->hidden_disk->bs, _err);
> > -if (local_err) {
> > -error_propagate(errp, local_err);
> > -blk_unref(blk);
> > -return;
> > -}
> > -
> > -ret = blk_make_empty(blk, errp);
> > -blk_unref(blk);
> > +ret = bdrv_make_empty(s->hidden_disk, errp);  
> 
> So, here you rely on BLK_PERM_WRITE being set in replication_child_perm().. 
> Probably that's OK, however logic is changed. Shouldn't we always require 
> write permission in replication_child_perm() for hidden_disk ?
> 
> >   if (ret < 0) {
> >   return;
> >   }
> > @@ -365,27 +360,35 @@ static void reopen_backing_file(BlockDriverState *bs, 
> > bool writable,
> >   Error **errp)
> >   {
> >   BDRVReplicationState *s = bs->opaque;
> > +BdrvChild *hidden_disk, *secondary_disk;
> >   BlockReopenQueue *reopen_queue = NULL;
> >   
> > +/*
> > + * s->hidden_disk and s->secondary_disk may not be set yet, as they 
> > will
> > + * only be set after the children are writable.
> > + */
> > +hidden_disk = bs->file->bs->backing;
> > +secondary_disk = hidden_disk->bs->backing;
> > +
> >   if (writable) {
> > -s->orig_hidden_read_only = bdrv_is_read_only(s->hidden_disk->bs);
> > -s->orig_secondary_read_only = 
> > bdrv_is_read_only(s->secondary_disk->bs);
> > +s->orig_hidden_read_only = bdrv_is_read_only(hidden_disk->bs);
> > +s->orig_secondary_read_only = 
> > bdrv_is_read_only(secondary_disk->bs);
> >   }
> >   
> > -bdrv_subtree_drained_begin(s->hidden_disk->bs);
> > -bdrv_subtree_drained_begin(s->secondary_disk->bs);
> > +bdrv_subtree_drained_begin(hidden_disk->bs);
> > +bdrv_subtree_drained_begin(secondary_disk->bs);  
> 
> That kind of staff may be done as a separate preparation patch, with 
> no-logic-change refactoring, this makes the main logic-change patch simpler.

Yes, makes sense. Will do in the next version.

> >   
> >   if (s->orig_hidden_read_only) {
> >   QDict *opts = qdict_new();
> >   qdict_put_bool(opts, BDRV_OPT_READ_ONLY, !writable);
> > -reopen_queue = bdrv_reopen_queue(reopen_queue, s->hidden_disk->bs,
> > +reopen_queue = bdrv_reopen_queue(reopen_queue, hidden_disk->bs,
> >opts, true);
>

Re: [PATCH] vhost-user: Fix backends without multiqueue support

2021-07-07 Thread Cornelia Huck

On Mon, Jul 05 2021, Kevin Wolf  wrote:

> dev->max_queues was never initialised for backends that don't support
> VHOST_USER_PROTOCOL_F_MQ, so it would use 0 as the maximum number of
> queues to check against and consequently fail for any such backend.
>
> Set it to 1 if the backend doesn't have multiqueue support.
>
> Fixes: c90bd505a3e8210c23d69fecab9ee6f56ec4a161
> Signed-off-by: Kevin Wolf 
> ---
>  hw/virtio/vhost-user.c | 3 +++
>  1 file changed, 3 insertions(+)

Reviewed-by: Cornelia Huck

Re: [PATCH] blockdev: fix drive-backup transaction endless drained section

2021-07-07 Thread Vladimir Sementsov-Ogievskiy


Forgotten thing :(

Kevin, could you please queue it in your block branch? For me not to bother 
Peter with one-patch pull request.

08.06.2021 20:18, Vladimir Sementsov-Ogievskiy wrote:

drive_backup_prepare() does bdrv_drained_begin() in hope that
bdrv_drained_end() will be called in drive_backup_clean(). Still we
need to set state->bs for this to work. That's done too late: a lot of
failure paths in drive_backup_prepare() miss setting state->bs. Fix
that.

Fixes: 2288ccfac96281c316db942d10e3f921c1373064
Fixes: https://gitlab.com/qemu-project/qemu/-/issues/399
Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  blockdev.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index f08192deda..094c085962 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1714,6 +1714,7 @@ static void drive_backup_prepare(BlkActionState *common, 
Error **errp)
  aio_context = bdrv_get_aio_context(bs);
  aio_context_acquire(aio_context);
  
+state->bs = bs;

  /* Paired with .clean() */
  bdrv_drained_begin(bs);
  
@@ -1813,8 +1814,6 @@ static void drive_backup_prepare(BlkActionState *common, Error **errp)

  }
  }
  
-state->bs = bs;

-
  state->job = do_backup_common(qapi_DriveBackup_base(backup),
bs, target_bs, aio_context,
common->block_job_txn, errp);




--
Best regards,
Vladimir

Re: [PATCH] block/replication.c: Properly attach children

2021-07-07 Thread Vladimir Sementsov-Ogievskiy


06.07.2021 19:11, Lukas Straub wrote:

The replication driver needs access to the children block-nodes of
it's child so it can issue bdrv_make_empty to manage the replication.
However, it does this by directly copying the BdrvChilds, which is
wrong.

Fix this by properly attaching the block-nodes with
bdrv_attach_child().

Also, remove a workaround introduced in commit
6ecbc6c52672db5c13805735ca02784879ce8285
"replication: Avoid blk_make_empty() on read-only child".

Signed-off-by: Lukas Straub 
---

-v2: Test for BDRV_CHILD_PRIMARY in replication_child_perm, since
  bs->file might not be set yet. (Vladimir)

  block/replication.c | 94 +
  1 file changed, 61 insertions(+), 33 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 52163f2d1f..fd8cb728a3 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -166,7 +166,12 @@ static void replication_child_perm(BlockDriverState *bs, 
BdrvChild *c,
 uint64_t perm, uint64_t shared,
 uint64_t *nperm, uint64_t *nshared)
  {
-*nperm = BLK_PERM_CONSISTENT_READ;
+if (role & BDRV_CHILD_PRIMARY) {
+*nperm = BLK_PERM_CONSISTENT_READ;
+} else {
+*nperm = 0;
+}


Why you drop READ access for other children? You don't mention it in 
commit-msg..

Upd: ok now I see that we are not going to read from hidden_disk child, and that's the 
only "other" child that worth to mention.
Still, we should be sure that hidden_disk child gets WRITE permission in case 
we are going to call bdrv_make_empty on it.


+
  if ((bs->open_flags & (BDRV_O_INACTIVE | BDRV_O_RDWR)) == BDRV_O_RDWR) {
  *nperm |= BLK_PERM_WRITE;
  }
@@ -340,17 +345,7 @@ static void secondary_do_checkpoint(BDRVReplicationState 
*s, Error **errp)
  return;
  }
  
-BlockBackend *blk = blk_new(qemu_get_current_aio_context(),

-BLK_PERM_WRITE, BLK_PERM_ALL);
-blk_insert_bs(blk, s->hidden_disk->bs, _err);
-if (local_err) {
-error_propagate(errp, local_err);
-blk_unref(blk);
-return;
-}
-
-ret = blk_make_empty(blk, errp);
-blk_unref(blk);
+ret = bdrv_make_empty(s->hidden_disk, errp);


So, here you rely on BLK_PERM_WRITE being set in replication_child_perm().. 
Probably that's OK, however logic is changed. Shouldn't we always require write 
permission in replication_child_perm() for hidden_disk ?


  if (ret < 0) {
  return;
  }
@@ -365,27 +360,35 @@ static void reopen_backing_file(BlockDriverState *bs, 
bool writable,
  Error **errp)
  {
  BDRVReplicationState *s = bs->opaque;
+BdrvChild *hidden_disk, *secondary_disk;
  BlockReopenQueue *reopen_queue = NULL;
  
+/*

+ * s->hidden_disk and s->secondary_disk may not be set yet, as they will
+ * only be set after the children are writable.
+ */
+hidden_disk = bs->file->bs->backing;
+secondary_disk = hidden_disk->bs->backing;
+
  if (writable) {
-s->orig_hidden_read_only = bdrv_is_read_only(s->hidden_disk->bs);
-s->orig_secondary_read_only = bdrv_is_read_only(s->secondary_disk->bs);
+s->orig_hidden_read_only = bdrv_is_read_only(hidden_disk->bs);
+s->orig_secondary_read_only = bdrv_is_read_only(secondary_disk->bs);
  }
  
-bdrv_subtree_drained_begin(s->hidden_disk->bs);

-bdrv_subtree_drained_begin(s->secondary_disk->bs);
+bdrv_subtree_drained_begin(hidden_disk->bs);
+bdrv_subtree_drained_begin(secondary_disk->bs);


That kind of staff may be done as a separate preparation patch, with 
no-logic-change refactoring, this makes the main logic-change patch simpler.

  
  if (s->orig_hidden_read_only) {

  QDict *opts = qdict_new();
  qdict_put_bool(opts, BDRV_OPT_READ_ONLY, !writable);
-reopen_queue = bdrv_reopen_queue(reopen_queue, s->hidden_disk->bs,
+reopen_queue = bdrv_reopen_queue(reopen_queue, hidden_disk->bs,
   opts, true);
  }
  
  if (s->orig_secondary_read_only) {

  QDict *opts = qdict_new();
  qdict_put_bool(opts, BDRV_OPT_READ_ONLY, !writable);
-reopen_queue = bdrv_reopen_queue(reopen_queue, s->secondary_disk->bs,
+reopen_queue = bdrv_reopen_queue(reopen_queue, secondary_disk->bs,
   opts, true);
  }, probably it could be a separate patch if it is needed.
  
@@ -393,8 +396,8 @@ static void reopen_backing_file(BlockDriverState *bs, bool writable,

  bdrv_reopen_multiple(reopen_queue, errp);
  }
  
-bdrv_subtree_drained_end(s->hidden_disk->bs);

-bdrv_subtree_drained_end(s->secondary_disk->bs);
+bdrv_subtree_drained_end(hidden_disk->bs);
+bdrv_subtree_drained_end(secondary_disk->bs);
  }
  
  static void backup_job_cleanup(BlockDriverState *bs)

@@ -451,6 +454,7

Re: [PATCH v1] block/raw-format: implement .bdrv_get_specific_info handler

2021-07-07 Thread Kevin Wolf

Am 07.07.2021 um 10:50 hat Or Ozeri geschrieben:
> Would you suggest to do this child traversal on bdrv_query_image_info, and 
> have
> it returned as part of the ImageInfo struct?
> In that case, I would add *driver-specific to ImageInfo, in addition to the
> existing *format-specific?

No, extending ImageInfo with a single additonal field wouldn't be
generic either. It's not set in stone that your graph must consist of
exactly two block nodes.

> Or should I just do the traversal in img_info (qemu-img.c), avoiding
> the change to the ImageInfo struct?

Yes, img_info() or bdrv_image_info_dump() doing the traversal through
the chain is what I had in mind. Maybe let img_info() collect everything
and then pass a list of ImageInfos instead of just a single one to
bdrv_image_info_dump().

Kevin

> -"Kevin Wolf" <[1]kw...@redhat.com> wrote: -
> To: "Or Ozeri" <[2]o...@il.ibm.com>
> From: "Kevin Wolf" <[3]kw...@redhat.com>
> Date: 07/07/2021 10:52AM
> Cc: [4]qemu-de...@nongnu.org, [5]qemu-block@nongnu.org, [6]
> to.my.troc...@gmail.com, [7]dan...@il.ibm.com, [8]berra...@redhat.com, [9]
> idryo...@gmail.com
> Subject: [EXTERNAL] Re: [PATCH v1] block/raw-format: implement
> .bdrv_get_specific_info handler
> 
> Am 07.07.2021 um 07:35 hat Or Ozeri geschrieben:
> > When using the raw format, allow exposing specific info by the underlying
> storage.
> > In particular, this will enable RBD images using the raw format to indicate
> > a LUKS2 encrypted image in the output of qemu-img info.
> >
> > Signed-off-by: Or Ozeri <[10]o...@il.ibm.com>
> 
> This doesn't feel right because it would introduce an inconsistency
> (drivers are supposed to return information about their layer, and all
> drivers except raw would still do so) and therefore wouldn't even solve
> the full problem: For non-raw images, the information isn't any less
> useful, but it still wouldn't be available.
> 
> I believe the information is already available in QMP, using
> query-named-block-nodes, because then you get a separate BlockDeviceInfo
> (which contains ImageInfoSpecific) for each node, including the protocol
> node.
> 
> So maybe what we need to do is change qemu-img to iterate the node chain
> (possibly using bdrv_primary_bs()) and print the ImageInfoSpecific for
> each layer if it's present, while indicating which part is for which
> layer.
> 
> So the output could look like this:
> 
> ...
> Driver specific information (qcow2):
> compat: 0.10
> compression type: zlib
> refcount bits: 16
> Driver specific information (rbd):
> encryption format: luks
> 
> Kevin
> 
> 
> 
> 
> References:
> 
> [1] mailto:kw...@redhat.com
> [2] mailto:o...@il.ibm.com
> [3] mailto:kw...@redhat.com
> [4] mailto:qemu-de...@nongnu.org
> [5] mailto:qemu-block@nongnu.org
> [6] mailto:to.my.troc...@gmail.com
> [7] mailto:dan...@il.ibm.com
> [8] mailto:berra...@redhat.com
> [9] mailto:idryo...@gmail.com
> [10] mailto:o...@il.ibm.com

Re: [PATCH v2 0/6] export/fuse: Allow other users access to the export

2021-07-07 Thread Kevin Wolf

Am 25.06.2021 um 16:23 hat Max Reitz geschrieben:
> Max Reitz (6):
>   export/fuse: Pass default_permissions for mount
>   export/fuse: Add allow-other option
>   export/fuse: Give SET_ATTR_SIZE its own branch
>   export/fuse: Let permissions be adjustable
>   iotests/308: Test +w on read-only FUSE exports
>   iotests/fuse-allow-other: Test allow-other

Thanks, fixed up the comments in the test case as indicated and applied
to the block branch.

Kevin

Re: [PATCH 4/4] hw/nvme: fix controller hot unplugging

2021-07-07 Thread Hannes Reinecke


On 7/7/21 11:53 AM, Klaus Jensen wrote:

On Jul  7 09:49, Hannes Reinecke wrote:

On 7/6/21 11:33 AM, Klaus Jensen wrote:

From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/nvme.h   | 18 ++
 hw/nvme/ctrl.c   |  8 +---
 hw/nvme/ns.c | 32 +---
 hw/nvme/subsys.c |  4 
 4 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index c4065467d877..9401e212f9f7 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -33,12 +33,21 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES > 
NVME_NSID_BROADCAST - 1);

 typedef struct NvmeCtrl NvmeCtrl;
 typedef struct NvmeNamespace NvmeNamespace;
+#define TYPE_NVME_BUS "nvme-bus"
+OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
+
+typedef struct NvmeBus {
+    BusState parent_bus;
+    bool is_subsys;
+} NvmeBus;
+
 #define TYPE_NVME_SUBSYS "nvme-subsys"
 #define NVME_SUBSYS(obj) \
 OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
 typedef struct NvmeSubsystem {
 DeviceState parent_obj;
+    NvmeBus bus;
 uint8_t subnqn[256];
 NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];
@@ -365,13 +374,6 @@ typedef struct NvmeCQueue {
 QTAILQ_HEAD(, NvmeRequest) req_list;
 } NvmeCQueue;
-#define TYPE_NVME_BUS "nvme-bus"
-#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
-
-typedef struct NvmeBus {
-    BusState parent_bus;
-} NvmeBus;
-
 #define TYPE_NVME "nvme"
 #define NVME(obj) \
 OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
@@ -463,7 +465,7 @@ typedef struct NvmeCtrl {
 static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
 {
-    if (!nsid || nsid > NVME_MAX_NAMESPACES) {
+    if (!n || !nsid || nsid > NVME_MAX_NAMESPACES) {
 return NULL;
 }
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 90e3ee2b70ee..7c8fca36d9a5 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6516,11 +6516,13 @@ static void nvme_exit(PCIDevice *pci_dev)
 for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
-    if (!ns) {
-    continue;
+    if (ns) {
+    ns->attached--;
 }
+    }
-    nvme_ns_cleanup(ns);
+    if (n->subsys) {
+    nvme_subsys_unregister_ctrl(n->subsys, n);
 }
 if (n->subsys) {
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 3c4f5b8c714a..612a2786d75d 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -444,13 +444,29 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
 static void nvme_ns_realize(DeviceState *dev, Error **errp)
 {
 NvmeNamespace *ns = NVME_NS(dev);
-    BusState *s = qdev_get_parent_bus(dev);
-    NvmeCtrl *n = NVME(s->parent);
-    NvmeSubsystem *subsys = n->subsys;
+    BusState *qbus = qdev_get_parent_bus(dev);
+    NvmeBus *bus = NVME_BUS(qbus);
+    NvmeCtrl *n = NULL;
+    NvmeSubsystem *subsys = NULL;
 uint32_t nsid = ns->params.nsid;
 int i;
-    if (!n->subsys) {
+    if (bus->is_subsys) {
+    subsys = NVME_SUBSYS(qbus->parent);
+    } else {
+    n = NVME(qbus->parent);
+    subsys = n->subsys;
+    }
+
+    if (subsys) {
+    /*
+ * If this namespace belongs to a subsystem (through a link 
on the

+ * controller device), reparent the device.
+ */
+    if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
+    return;
+    }
+    } else {
 if (ns->params.detached) {
 error_setg(errp, "detached requires that the nvme device 
is "

    "linked to an nvme-subsys device");
@@ -470,7 +486,7 @@ static void nvme_ns_realize(DeviceState *dev, 
Error **errp)

 if (!nsid) {
 for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
-    if (nvme_ns(n, i) || nvme_subsys_ns(subsys, i)) {
+    if (nvme_subsys_ns(subsys, i) || nvme_ns(n, i)) {
 continue;
 }
@@ -483,7 +499,7 @@ static void nvme_ns_realize(DeviceState *dev, 
Error **errp)

 return;
 }
 } else {
-    if (nvme_ns(n, nsid) || nvme_subsys_ns(subsys, nsid)) {
+    if (nvme_subsys_ns(subsys, nsid) || nvme_ns(n, nsid)) {
 error_setg(errp, "namespace id '%d' already allocated", 
nsid);

 return;
 }
@@ -509,7 +525,9 @@ static void nvme_ns_realize(DeviceState *dev, 
Error **errp)

 }
 }
-    nvme_attach_ns(n, ns);
+    if (n) {
+    nvme_attach_ns(n, ns);
+    }
 }
 static Property

Re: [PATCH v2 6/6] iotests/fuse-allow-other: Test allow-other

2021-07-07 Thread Kevin Wolf

Am 25.06.2021 um 16:23 hat Max Reitz geschrieben:
> Signed-off-by: Max Reitz 
> ---
>  tests/qemu-iotests/tests/fuse-allow-other | 175 ++
>  tests/qemu-iotests/tests/fuse-allow-other.out |  88 +
>  2 files changed, 263 insertions(+)
>  create mode 100755 tests/qemu-iotests/tests/fuse-allow-other
>  create mode 100644 tests/qemu-iotests/tests/fuse-allow-other.out
> 
> diff --git a/tests/qemu-iotests/tests/fuse-allow-other 
> b/tests/qemu-iotests/tests/fuse-allow-other
> new file mode 100755
> index 00..a513dbce66
> --- /dev/null
> +++ b/tests/qemu-iotests/tests/fuse-allow-other
> @@ -0,0 +1,175 @@
> +#!/usr/bin/env bash
> +# group: rw
> +#
> +# Test FUSE exports' allow-other option
> +#
> +# Copyright (C) 2021 Red Hat, Inc.
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +#
> +
> +seq=$(basename "$0")
> +echo "QA output created by $seq"
> +
> +status=1 # failure is the default!
> +
> +_cleanup()
> +{
> +_cleanup_qemu
> +_cleanup_test_img
> +rm -f "$EXT_MP"
> +}
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +# get standard environment, filters and checks
> +. ../common.rc
> +. ../common.filter
> +. ../common.qemu
> +
> +_supported_fmt generic
> +
> +_supported_proto file # We create the FUSE export manually
> +
> +sudo -n -u nobody true || \
> +_notrun 'Password-less sudo as nobody required to test allow_other'
> +
> +# $1: Export ID
> +# $2: Options (beyond the node-name and ID)
> +# $3: Expected return value (defaults to 'return')
> +# $4: Node to export (defaults to 'node-format')
> +fuse_export_add()
> +{
> +allow_other_not_supported='option allow_other only allowed if'
> +
> +output=$(
> +success_or_failure=yes _send_qemu_cmd $QEMU_HANDLE \
> +"{'execute': 'block-export-add',
> +  'arguments': {
> +  'type': 'fuse',
> +  'id': '$1',
> +  'node-name': '${4:-node-format}',
> +  $2
> +  } }" \
> +"${3:-return}" \
> +"$allow_other_not_supported" \
> +| _filter_imgfmt
> +)
> +
> +if echo "$output" | grep -q "$allow_other_not_supported"; then
> +# Shut down qemu gracefully so it can unmount the export
> +_send_qemu_cmd $QEMU_HANDLE \
> +"{'execute': 'quit'}" \
> +'return'
> +
> +wait=yes _cleanup_qemu
> +
> +_notrun "allow_other not supported"
> +fi
> +
> +echo "$output"
> +}
> +
> +EXT_MP="$TEST_DIR/fuse-export"
> +
> +_make_test_img 64k
> +touch "$EXT_MP"
> +
> +echo
> +echo '=== Test permissions ==='
> +
> +# Test that you can only change permissions on the export with 
> allow-other=true.
> +# We cannot really test the primary reason behind allow-other (i.e. to allow
> +# users other than the current one access to the export), because for that we
> +# would need sudo, which realistically nobody will allow this test to use.
> +# What we can do is test that allow-other=true also enables 
> default_permissions,
> +# i.e. whether we can still read from the file if we remove the read 
> permission.

I don't think this comment is accurate any more now that you're actually
using sudo.

> +# $1: allow-other value ('true' or 'false')

on/off/auto, actually.

I can fix this up while applying, removing the comment block above, and
adjusting this line.

Kevin

Re: [PATCH v2 2/6] export/fuse: Add allow-other option

2021-07-07 Thread Kevin Wolf

Am 25.06.2021 um 16:23 hat Max Reitz geschrieben:
> Without the allow_other mount option, no user (not even root) but the
> one who started qemu/the storage daemon can access the export.  Allow
> users to configure the export such that such accesses are possible.
> 
> While allow_other is probably what users want, we cannot make it an
> unconditional default, because passing it is only possible (for non-root
> users) if the global fuse.conf configuration file allows it.  Thus, the
> default is an 'auto' mode, in which we first try with allow_other, and
> then fall back to without.
> 
> FuseExport.allow_other reports whether allow_other was actually used as
> a mount option or not.  Currently, this information is not used, but a
> future patch will let this field decide whether e.g. an export's UID and
> GID can be changed through chmod.
> 
> One notable thing about 'auto' mode is that libfuse may print error
> messages directly to stderr, and so may fusermount (which it executes).
> Our export code cannot really filter or hide them.  Therefore, if 'auto'
> fails its first attempt and has to fall back, fusermount will print an
> error message that mounting with allow_other failed.
> 
> This behavior necessitates a change to iotest 308, namely we need to
> filter out this error message (because if the first attempt at mounting
> with allow_other succeeds, there will be no such message).
> 
> Furthermore, common.rc's _make_test_img should use allow-other=off for
> FUSE exports, because iotests generally do not need to access images
> from other users, so allow-other=on or allow-other=auto have no
> advantage.  OTOH, allow-other=on will not work on systems where
> user_allow_other is disabled, and with allow-other=auto, we get said
> error message that we would need to filter out again.  Just disabling
> allow-other is simplest.
> 
> Signed-off-by: Max Reitz 
> ---
>  qapi/block-export.json   | 33 -
>  block/export/fuse.c  | 28 +++-
>  tests/qemu-iotests/308   |  6 +-
>  tests/qemu-iotests/common.rc |  6 +-
>  4 files changed, 65 insertions(+), 8 deletions(-)
> 
> diff --git a/qapi/block-export.json b/qapi/block-export.json
> index e819e70cac..0ed63442a8 100644
> --- a/qapi/block-export.json
> +++ b/qapi/block-export.json
> @@ -120,6 +120,23 @@
>   '*logical-block-size': 'size',
>  '*num-queues': 'uint16'} }
>  
> +##
> +# @FuseExportAllowOther:
> +#
> +# Possible allow_other modes for FUSE exports.
> +#
> +# @off: Do not pass allow_other as a mount option.
> +#
> +# @on: Pass allow_other as a mount option.
> +#
> +# @auto: Try mounting with allow_other first, and if that fails, retry
> +#without allow_other.
> +#
> +# Since: 6.1
> +##
> +{ 'enum': 'FuseExportAllowOther',
> +  'data': ['off', 'on', 'auto'] }

Why not use the generic OnOffAuto type from common.json?

But since the external interface is unaffected so we can later change
this as a code cleanup and soft freeze is approaching, I won't consider
this a blocker.

Kevin

Re: [PATCH 4/4] hw/nvme: fix controller hot unplugging

2021-07-07 Thread Klaus Jensen


On Jul  6 11:33, Klaus Jensen wrote:

From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Signed-off-by: Klaus Jensen 
---
hw/nvme/nvme.h   | 18 ++
hw/nvme/ctrl.c   |  8 +---
hw/nvme/ns.c | 32 +---
hw/nvme/subsys.c |  4 
4 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index c4065467d877..9401e212f9f7 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -33,12 +33,21 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES > NVME_NSID_BROADCAST 
- 1);
typedef struct NvmeCtrl NvmeCtrl;
typedef struct NvmeNamespace NvmeNamespace;

+#define TYPE_NVME_BUS "nvme-bus"
+OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
+
+typedef struct NvmeBus {
+BusState parent_bus;
+bool is_subsys;
+} NvmeBus;
+
#define TYPE_NVME_SUBSYS "nvme-subsys"
#define NVME_SUBSYS(obj) \
OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)

typedef struct NvmeSubsystem {
DeviceState parent_obj;
+NvmeBus bus;
uint8_t subnqn[256];

NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];
@@ -365,13 +374,6 @@ typedef struct NvmeCQueue {
QTAILQ_HEAD(, NvmeRequest) req_list;
} NvmeCQueue;

-#define TYPE_NVME_BUS "nvme-bus"
-#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
-
-typedef struct NvmeBus {
-BusState parent_bus;
-} NvmeBus;
-
#define TYPE_NVME "nvme"
#define NVME(obj) \
OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
@@ -463,7 +465,7 @@ typedef struct NvmeCtrl {

static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
{
-if (!nsid || nsid > NVME_MAX_NAMESPACES) {
+if (!n || !nsid || nsid > NVME_MAX_NAMESPACES) {
return NULL;
}

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 90e3ee2b70ee..7c8fca36d9a5 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6516,11 +6516,13 @@ static void nvme_exit(PCIDevice *pci_dev)

for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
ns = nvme_ns(n, i);
-if (!ns) {
-continue;
+if (ns) {
+ns->attached--;
}
+}

-nvme_ns_cleanup(ns);
+if (n->subsys) {
+nvme_subsys_unregister_ctrl(n->subsys, n);
}

if (n->subsys) {
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 3c4f5b8c714a..612a2786d75d 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -444,13 +444,29 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
static void nvme_ns_realize(DeviceState *dev, Error **errp)
{
NvmeNamespace *ns = NVME_NS(dev);
-BusState *s = qdev_get_parent_bus(dev);
-NvmeCtrl *n = NVME(s->parent);
-NvmeSubsystem *subsys = n->subsys;
+BusState *qbus = qdev_get_parent_bus(dev);
+NvmeBus *bus = NVME_BUS(qbus);
+NvmeCtrl *n = NULL;
+NvmeSubsystem *subsys = NULL;
uint32_t nsid = ns->params.nsid;
int i;

-if (!n->subsys) {
+if (bus->is_subsys) {
+subsys = NVME_SUBSYS(qbus->parent);
+} else {
+n = NVME(qbus->parent);
+subsys = n->subsys;
+}


So, I realized that this if is not needed, since the device will always 
attach to the bus from the nvme device, never the 'fake' one from the 
subsystem.



+
+if (subsys) {
+/*
+ * If this namespace belongs to a subsystem (through a link on the
+ * controller device), reparent the device.
+ */
+if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
+return;
+}
+} else {
if (ns->params.detached) {
error_setg(errp, "detached requires that the nvme device is "
   "linked to an nvme-subsys device");
@@ -470,7 +486,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)

if (!nsid) {
for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
-if (nvme_ns(n, i) || nvme_subsys_ns(subsys, i)) {
+if (nvme_subsys_ns(subsys, i) || nvme_ns(n, i)) {
continue;
}

@@ -483,7 +499,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
return;
}
} else {
-if (nvme_ns(n, nsid) || nvme_subsys_ns(subsys, nsid)) {
+if (nvme_subsys_ns(subsys, nsid) || nvme_ns(n, nsid)) {
error_setg(errp, "namespace id '%d' already allocated", nsid);
return;
}
@@ -509,7 +525,9 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
}
}

-nvme_attach_ns(n, ns);
+if (n) {
+nvme_attach_ns(n, ns);
+}

Re: [PATCH 4/4] hw/nvme: fix controller hot unplugging

2021-07-07 Thread Klaus Jensen


On Jul  7 09:49, Hannes Reinecke wrote:

On 7/6/21 11:33 AM, Klaus Jensen wrote:

From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Signed-off-by: Klaus Jensen 
---
 hw/nvme/nvme.h   | 18 ++
 hw/nvme/ctrl.c   |  8 +---
 hw/nvme/ns.c | 32 +---
 hw/nvme/subsys.c |  4 
 4 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index c4065467d877..9401e212f9f7 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -33,12 +33,21 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES > NVME_NSID_BROADCAST 
- 1);
 typedef struct NvmeCtrl NvmeCtrl;
 typedef struct NvmeNamespace NvmeNamespace;
+#define TYPE_NVME_BUS "nvme-bus"
+OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
+
+typedef struct NvmeBus {
+BusState parent_bus;
+bool is_subsys;
+} NvmeBus;
+
 #define TYPE_NVME_SUBSYS "nvme-subsys"
 #define NVME_SUBSYS(obj) \
 OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
 typedef struct NvmeSubsystem {
 DeviceState parent_obj;
+NvmeBus bus;
 uint8_t subnqn[256];
 NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];
@@ -365,13 +374,6 @@ typedef struct NvmeCQueue {
 QTAILQ_HEAD(, NvmeRequest) req_list;
 } NvmeCQueue;
-#define TYPE_NVME_BUS "nvme-bus"
-#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
-
-typedef struct NvmeBus {
-BusState parent_bus;
-} NvmeBus;
-
 #define TYPE_NVME "nvme"
 #define NVME(obj) \
 OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
@@ -463,7 +465,7 @@ typedef struct NvmeCtrl {
 static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
 {
-if (!nsid || nsid > NVME_MAX_NAMESPACES) {
+if (!n || !nsid || nsid > NVME_MAX_NAMESPACES) {
 return NULL;
 }
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 90e3ee2b70ee..7c8fca36d9a5 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6516,11 +6516,13 @@ static void nvme_exit(PCIDevice *pci_dev)
 for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
 ns = nvme_ns(n, i);
-if (!ns) {
-continue;
+if (ns) {
+ns->attached--;
 }
+}
-nvme_ns_cleanup(ns);
+if (n->subsys) {
+nvme_subsys_unregister_ctrl(n->subsys, n);
 }
 if (n->subsys) {
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 3c4f5b8c714a..612a2786d75d 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -444,13 +444,29 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
 static void nvme_ns_realize(DeviceState *dev, Error **errp)
 {
 NvmeNamespace *ns = NVME_NS(dev);
-BusState *s = qdev_get_parent_bus(dev);
-NvmeCtrl *n = NVME(s->parent);
-NvmeSubsystem *subsys = n->subsys;
+BusState *qbus = qdev_get_parent_bus(dev);
+NvmeBus *bus = NVME_BUS(qbus);
+NvmeCtrl *n = NULL;
+NvmeSubsystem *subsys = NULL;
 uint32_t nsid = ns->params.nsid;
 int i;
-if (!n->subsys) {
+if (bus->is_subsys) {
+subsys = NVME_SUBSYS(qbus->parent);
+} else {
+n = NVME(qbus->parent);
+subsys = n->subsys;
+}
+
+if (subsys) {
+/*
+ * If this namespace belongs to a subsystem (through a link on the
+ * controller device), reparent the device.
+ */
+if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
+return;
+}
+} else {
 if (ns->params.detached) {
 error_setg(errp, "detached requires that the nvme device is "
"linked to an nvme-subsys device");
@@ -470,7 +486,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 if (!nsid) {
 for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
-if (nvme_ns(n, i) || nvme_subsys_ns(subsys, i)) {
+if (nvme_subsys_ns(subsys, i) || nvme_ns(n, i)) {
 continue;
 }
@@ -483,7 +499,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 return;
 }
 } else {
-if (nvme_ns(n, nsid) || nvme_subsys_ns(subsys, nsid)) {
+if (nvme_subsys_ns(subsys, nsid) || nvme_ns(n, nsid)) {
 error_setg(errp, "namespace id '%d' already allocated", nsid);
 return;
 }
@@ -509,7 +525,9 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 }
 }
-nvme_attach_ns(n, ns);
+if (n) {
+nvme_attach_ns(n, ns);
+}
 }
 static Property nvme_ns_props[] = {
diff --git a/hw/nvme/subsys.c

Re: [PATCH] hw/ide: Fix crash when plugging a piix3-ide device into the x-remote machine

2021-07-07 Thread Stefan Hajnoczi

On Tue, Jul 06, 2021 at 10:37:03AM +0200, Philippe Mathieu-Daudé wrote:
> Stefan, IIRC the multi-process conclusion was we have to reject
> PCI devices briding another (non-PCI) bus, such ISA / I2C / USB
> / SD / ... because QEMU register the bus type globally and the
> command line machinery resolves it to plug user-creatable devices,
> so we can not share such buses. Is that correct?

I'm not sure I understand, but I'll try:

You can implement an out-of-process USB host controller (a PCI device),
but QEMU will not be aware of devices on this out-of-process USB bus.

If you're referring to a PCI IDE controller that is also exposed on the
ISA bus, then that's hard to do. Maybe there would need to be a separate
ISA-to-PCI bridge device so there's a clean separation between the PCI
device and the ISA portion. The current multi-process QEMU protocol (and
the upcoming vfio-user protocol) support PCI devices but not ISA
devices.

Stefan

signature.asc
Description: PGP signature

RE: [PATCH v1] block/raw-format: implement .bdrv_get_specific_info handler

2021-07-07 Thread Or Ozeri

Would you suggest to do this child traversal on bdrv_query_image_info, and have it returned as part of the ImageInfo struct?In that case, I would add *driver-specific to ImageInfo, in addition to the existing *format-specific?Or should I just do the traversal in img_info (qemu-img.c), avoiding the change to the ImageInfo struct?-"Kevin Wolf"  wrote: -To: "Or Ozeri" From: "Kevin Wolf" Date: 07/07/2021 10:52AMCc: qemu-de...@nongnu.org, qemu-block@nongnu.org, to.my.troc...@gmail.com, dan...@il.ibm.com, berra...@redhat.com, idryo...@gmail.comSubject: [EXTERNAL] Re: [PATCH v1] block/raw-format: implement .bdrv_get_specific_info handlerAm 07.07.2021 um 07:35 hat Or Ozeri geschrieben:> When using the raw format, allow exposing specific info by the underlying storage.> In particular, this will enable RBD images using the raw format to indicate> a LUKS2 encrypted image in the output of qemu-img info.> > Signed-off-by: Or Ozeri This doesn't feel right because it would introduce an inconsistency(drivers are supposed to return information about their layer, and alldrivers except raw would still do so) and therefore wouldn't even solvethe full problem: For non-raw images, the information isn't any lessuseful, but it still wouldn't be available.I believe the information is already available in QMP, usingquery-named-block-nodes, because then you get a separate BlockDeviceInfo(which contains ImageInfoSpecific) for each node, including the protocolnode.So maybe what we need to do is change qemu-img to iterate the node chain(possibly using bdrv_primary_bs()) and print the ImageInfoSpecific foreach layer if it's present, while indicating which part is for whichlayer.So the output could look like this:...Driver specific information (qcow2):    compat: 0.10    compression type: zlib    refcount bits: 16Driver specific information (rbd):    encryption format: luksKevin

Re: [PATCH v1] block/raw-format: implement .bdrv_get_specific_info handler

2021-07-07 Thread Kevin Wolf

Am 07.07.2021 um 07:35 hat Or Ozeri geschrieben:
> When using the raw format, allow exposing specific info by the underlying 
> storage.
> In particular, this will enable RBD images using the raw format to indicate
> a LUKS2 encrypted image in the output of qemu-img info.
> 
> Signed-off-by: Or Ozeri 

This doesn't feel right because it would introduce an inconsistency
(drivers are supposed to return information about their layer, and all
drivers except raw would still do so) and therefore wouldn't even solve
the full problem: For non-raw images, the information isn't any less
useful, but it still wouldn't be available.

I believe the information is already available in QMP, using
query-named-block-nodes, because then you get a separate BlockDeviceInfo
(which contains ImageInfoSpecific) for each node, including the protocol
node.

So maybe what we need to do is change qemu-img to iterate the node chain
(possibly using bdrv_primary_bs()) and print the ImageInfoSpecific for
each layer if it's present, while indicating which part is for which
layer.

So the output could look like this:

...
Driver specific information (qcow2):
compat: 0.10
compression type: zlib
refcount bits: 16
Driver specific information (rbd):
encryption format: luks

Kevin

Re: [PATCH 4/4] hw/nvme: fix controller hot unplugging

2021-07-07 Thread Hannes Reinecke


On 7/6/21 11:33 AM, Klaus Jensen wrote:

From: Klaus Jensen 

Prior to this patch the nvme-ns devices are always children of the
NvmeBus owned by the NvmeCtrl. This causes the namespaces to be
unrealized when the parent device is removed. However, when subsystems
are involved, this is not what we want since the namespaces may be
attached to other controllers as well.

This patch adds an additional NvmeBus on the subsystem device. When
nvme-ns devices are realized, if the parent controller device is linked
to a subsystem, the parent bus is set to the subsystem one instead. This
makes sure that namespaces are kept alive and not unrealized.

Signed-off-by: Klaus Jensen 
---
  hw/nvme/nvme.h   | 18 ++
  hw/nvme/ctrl.c   |  8 +---
  hw/nvme/ns.c | 32 +---
  hw/nvme/subsys.c |  4 
  4 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index c4065467d877..9401e212f9f7 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -33,12 +33,21 @@ QEMU_BUILD_BUG_ON(NVME_MAX_NAMESPACES > NVME_NSID_BROADCAST 
- 1);
  typedef struct NvmeCtrl NvmeCtrl;
  typedef struct NvmeNamespace NvmeNamespace;
  
+#define TYPE_NVME_BUS "nvme-bus"

+OBJECT_DECLARE_SIMPLE_TYPE(NvmeBus, NVME_BUS)
+
+typedef struct NvmeBus {
+BusState parent_bus;
+bool is_subsys;
+} NvmeBus;
+
  #define TYPE_NVME_SUBSYS "nvme-subsys"
  #define NVME_SUBSYS(obj) \
  OBJECT_CHECK(NvmeSubsystem, (obj), TYPE_NVME_SUBSYS)
  
  typedef struct NvmeSubsystem {

  DeviceState parent_obj;
+NvmeBus bus;
  uint8_t subnqn[256];
  
  NvmeCtrl  *ctrls[NVME_MAX_CONTROLLERS];

@@ -365,13 +374,6 @@ typedef struct NvmeCQueue {
  QTAILQ_HEAD(, NvmeRequest) req_list;
  } NvmeCQueue;
  
-#define TYPE_NVME_BUS "nvme-bus"

-#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
-
-typedef struct NvmeBus {
-BusState parent_bus;
-} NvmeBus;
-
  #define TYPE_NVME "nvme"
  #define NVME(obj) \
  OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
@@ -463,7 +465,7 @@ typedef struct NvmeCtrl {
  
  static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)

  {
-if (!nsid || nsid > NVME_MAX_NAMESPACES) {
+if (!n || !nsid || nsid > NVME_MAX_NAMESPACES) {
  return NULL;
  }
  
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c

index 90e3ee2b70ee..7c8fca36d9a5 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6516,11 +6516,13 @@ static void nvme_exit(PCIDevice *pci_dev)
  
  for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {

  ns = nvme_ns(n, i);
-if (!ns) {
-continue;
+if (ns) {
+ns->attached--;
  }
+}
  
-nvme_ns_cleanup(ns);

+if (n->subsys) {
+nvme_subsys_unregister_ctrl(n->subsys, n);
  }
  
  if (n->subsys) {

diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 3c4f5b8c714a..612a2786d75d 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -444,13 +444,29 @@ void nvme_ns_cleanup(NvmeNamespace *ns)
  static void nvme_ns_realize(DeviceState *dev, Error **errp)
  {
  NvmeNamespace *ns = NVME_NS(dev);
-BusState *s = qdev_get_parent_bus(dev);
-NvmeCtrl *n = NVME(s->parent);
-NvmeSubsystem *subsys = n->subsys;
+BusState *qbus = qdev_get_parent_bus(dev);
+NvmeBus *bus = NVME_BUS(qbus);
+NvmeCtrl *n = NULL;
+NvmeSubsystem *subsys = NULL;
  uint32_t nsid = ns->params.nsid;
  int i;
  
-if (!n->subsys) {

+if (bus->is_subsys) {
+subsys = NVME_SUBSYS(qbus->parent);
+} else {
+n = NVME(qbus->parent);
+subsys = n->subsys;
+}
+
+if (subsys) {
+/*
+ * If this namespace belongs to a subsystem (through a link on the
+ * controller device), reparent the device.
+ */
+if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
+return;
+}
+} else {
  if (ns->params.detached) {
  error_setg(errp, "detached requires that the nvme device is "
 "linked to an nvme-subsys device");
@@ -470,7 +486,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
  
  if (!nsid) {

  for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
-if (nvme_ns(n, i) || nvme_subsys_ns(subsys, i)) {
+if (nvme_subsys_ns(subsys, i) || nvme_ns(n, i)) {
  continue;
  }
  
@@ -483,7 +499,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)

  return;
  }
  } else {
-if (nvme_ns(n, nsid) || nvme_subsys_ns(subsys, nsid)) {
+if (nvme_subsys_ns(subsys, nsid) || nvme_ns(n, nsid)) {
  error_setg(errp, "namespace id '%d' already allocated", nsid);
  return;
  }
@@ -509,7 +525,9 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
  }
  }
  
-nvme_attach_ns(n, ns);

+if (n) {
+nvme_attach_ns(n, ns);
+}
  }
  
  static

Re: [PATCH 3/4] hw/nvme: unregister controller with subsystem at exit

2021-07-07 Thread Hannes Reinecke


On 7/6/21 11:33 AM, Klaus Jensen wrote:

From: Klaus Jensen 

Make sure the controller is unregistered from the subsystem when device
is removed.

Signed-off-by: Klaus Jensen 
---
  hw/nvme/nvme.h   | 1 +
  hw/nvme/ctrl.c   | 4 
  hw/nvme/subsys.c | 5 +
  3 files changed, 10 insertions(+)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH 2/4] hw/nvme: mark nvme-subsys non-hotpluggable

2021-07-07 Thread Hannes Reinecke


On 7/6/21 11:33 AM, Klaus Jensen wrote:

From: Klaus Jensen 

We currently lack the infrastructure to handle subsystem hotplugging, so
disable it.

Signed-off-by: Klaus Jensen 
---
  hw/nvme/subsys.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/hw/nvme/subsys.c b/hw/nvme/subsys.c
index 192223d17ca1..dc7a96862f37 100644
--- a/hw/nvme/subsys.c
+++ b/hw/nvme/subsys.c
@@ -61,6 +61,7 @@ static void nvme_subsys_class_init(ObjectClass *oc, void 
*data)
  
  dc->realize = nvme_subsys_realize;

  dc->desc = "Virtual NVMe subsystem";
+dc->hotpluggable = false;
  
  device_class_set_props(dc, nvme_subsystem_props);

  }


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH 1/4] hw/nvme: remove NvmeCtrl parameter from ns setup/check functions

2021-07-07 Thread Hannes Reinecke


On 7/6/21 11:33 AM, Klaus Jensen wrote:

From: Klaus Jensen 

The nvme_ns_setup and nvme_ns_check_constraints should not depend on the
controller state. Refactor and remove it.

Signed-off-by: Klaus Jensen 
---
  hw/nvme/nvme.h |  2 +-
  hw/nvme/ctrl.c |  2 +-
  hw/nvme/ns.c   | 37 ++---
  3 files changed, 20 insertions(+), 21 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

59 matches

Mail list logo