date:20210709

Re: [PATCH v2 1/3] iotests: Improve and rename test 291 to qemu-img-bitmap

2021-07-09 Thread Vladimir Sementsov-Ogievskiy


09.07.2021 18:39, Eric Blake wrote:

Enhance the test to demonstrate existing less-than-stellar behavior of
qemu-img with a qcow2 image containing an inconsistent bitmap: we
don't diagnose the problem until after copying the entire image (a
potentially long time), and when we do diagnose the failure, we still
end up leaving an empty bitmap in the destination.  This mess will be
cleaned up in the next patch.

While at it, rename the test now that we support useful iotest names,
and fix a missing newline in the error message thus exposed.

Signed-off-by: Eric Blake


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir

Re: [RFC PATCH nvme-cli 2/2] nvme-cli/plugins/mi:add support

2021-07-09 Thread Mohit Kapoor

Signed-off-by: Mohit Kapoor

---

diff --git a/Makefile b/Makefile
index 86eb7c6..3ea82dd 100644
--- a/Makefile
+++ b/Makefile
@@ -83,6 +83,13 @@ PLUGIN_OBJS :=	\
 	plugins/wdc/wdc-utils.o			\
 	plugins/huawei/huawei-nvme.o		\
 	plugins/netapp/netapp-nvme.o		\
+	plugins/mi/util/hal/mi-nvme-hal-main.o 		\
+	plugins/mi/util/hal/mi-nvme-qemu/mi-nvme-qemu.o \
+	plugins/mi/util/mi-nvme-util-base.o			\
+	plugins/mi/util/mi-nvme-util-crc.o 			\
+	plugins/mi/util/mi-nvme-util-tool.o 			\
+	plugins/mi/mi-nvme.o			\
+	plugins/mi/mi-nvme-cmd.o			\
 	plugins/toshiba/toshiba-nvme.o		\
 	plugins/micron/micron-nvme.o		\
 	plugins/seagate/seagate-nvme.o 		\
diff --git a/plugins/mi/mi-nvme-cmd.c b/plugins/mi/mi-nvme-cmd.c
new file mode 100644
index 000..3a71170
--- /dev/null
+++ b/plugins/mi/mi-nvme-cmd.c
@@ -0,0 +1,116 @@
+/*
+ * Copyright (C) 2021 Samsung Electronics Inc. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version
+ * 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * ~~
+ * mi-nvme-cmd.c - Implementation of NVMe Management Interface commands in Nvme
+ *
+ * Developer : Mohit Kapoor 
+ */
+
+#include "mi-nvme-cmd.h"
+
+uint32_t gettransmissionunitsize()
+{
+return MCTP_TRANS_UNIT_SIZE_VAL_DEF;
+}
+
+int executecommand(__u8 *cmdobj)
+{
+struct nvme_mi_mctp_message_t message;
+memset(, 0, sizeof(struct nvme_mi_mctp_message_t));
+nvme_mi_admin_cmd_mctp_message adminmessage;
+memset(, 0, sizeof(nvme_mi_admin_cmd_mctp_message));
+bool ret = false;
+uint32_t RequestDataSize = 0;
+struct gencmd_nvmemi *micmd;
+struct gencmd_nvmemi_admin *miadmincmd;
+uint32_t uiMCTP_TUS = gettransmissionunitsize();
+
+ret = initialize(uiMCTP_TUS);
+if (ret == false) {
+return -1;
+}
+
+streply.length = 0;
+streply.errorcode = 0;
+memset(streply.replybuf, 0, sizeof(streply.replybuf));
+
+/*Skipping first element in the structure Size of buffer is 4 bytes each(uint32_t) in the  NVMe_MI_Header structure*/
+memcpy(, cmdobj, sizeof(struct nvme_mi_mctp_message_t) - 8);
+
+/*Check if the incoming command is an MI/MI-Admin Command*/
+if (message.msgPld.nvme_mi_message_header & 0x1000)	{
+miadmincmd = (struct gencmd_nvmemi_admin *)cmdobj;
+/*Skipping first element in the structureSize of buffer is 4 bytes each(uint32_t) in the  NVMe_MI_Header structure*/
+memcpy(, cmdobj, sizeof(nvme_mi_admin_cmd_mctp_message) - 8);
+if (miadmincmd->buf != NULL) {
+adminmessage.msgPld.buffer = miadmincmd->buf;
+}
+ret = execute_nvme_mi_admin_command(adminmessage, &(streply), REPLY_BUFFER_SIZE, miadmincmd->dlen);
+} else if (message.msgPld.nvme_mi_message_header & 0x0800) {
+micmd = (struct gencmd_nvmemi *)cmdobj;
+if (micmd->buf != NULL) {
+message.msgPld.buffer = micmd->buf;
+}
+ret = execute_nvme_mi_command(message, &(streply), REPLY_BUFFER_SIZE, RequestDataSize);
+}
+
+if (!ret) {
+return -1;
+} else {
+return 0;
+}
+}
+
+int getresponsedata(uint8_t *buf, uint32_t size, bool ismicmd)
+{
+uint32_t offset = 0;
+uint32_t length = gettransmissionunitsize() - NVME_MI_HEADER_SIZE;
+uint32_t bytesread = 0;
+
+if (buf == NULL) {
+printf("Error : getresponsedata input buf is NULL\n");
+return -1;
+}
+
+if (ismicmd == true) {
+offset = MCTP_PKT_HEADER_SIZE + NVME_MI_STATUS_SIZE + NVME_MI_HEADER_SIZE;
+length = gettransmissionunitsize() - NVME_MI_HEADER_SIZE - NVME_MI_STATUS_SIZE;
+streply.length = streply.length -  NVME_MI_HEADER_SIZE - NVME_MI_STATUS_SIZE - CRC_SIZE;
+} else {
+offset = MCTP_PKT_HEADER_SIZE + NVME_MI_STATUS_SIZE + NVME_MI_HEADER_SIZE + CQENTRY_SIZE;
+length = gettransmissionunitsize() - NVME_MI_HEADER_SIZE - NVME_MI_STATUS_SIZE - CQENTRY_SIZE;
+streply.length = streply.length - NVME_MI_HEADER_SIZE - NVME_MI_STATUS_SIZE - CRC_SIZE - CQENTRY_SIZE;
+}
+
+if (length > size) {
+length = size;
+}
+
+while (bytesread < streply.length) {
+memcpy(buf + bytesread, streply.replybuf + offset, length);
+offset += length + MCTP_PKT_HEADER_SIZE;
+bytesread += length;
+
+if (bytesread + gettransmissionunitsize() > streply.length) {
+length = streply.length - bytesread;
+} else {
+length = gettransmissionunitsize();
+}
+}
+return 0;
+}
+
+void getresponsemessage(uint8_t *buf, uint32_t size)
+{
+

Re: [PATCH] block: Add option to use driver whitelist even in tools

2021-07-09 Thread Eric Blake

On Fri, Jul 09, 2021 at 06:41:41PM +0200, Kevin Wolf wrote:
> Currently, the block driver whitelists are only applied for the system
> emulator. All other binaries still give unrestricted access to all block
> drivers. There are use cases where this made sense because the main
> concern was avoiding customers running VMs on less optimised block
> drivers and getting bad performance. Allowing the same image format e.g.
> as a target for 'qemu-img convert' is not a problem then.
> 
> However, if the concern is the supportability of the driver in general,
> either in full or when used read-write, not applying the list driver
> whitelist in tools doesn't help - especially since qemu-nbd and
> qemu-storage-daemon now give access to more or less the same operations
> in block drivers as running a system emulator.
> 
> In order to address this, introduce a new configure option that enforces
> the driver whitelist in all binaries.

Is it feasible that someone would want two separate lists: one for
qemu (which runs run efficiently) and another for tools (which ones do
we support at all)?  As written, your patch offers no chance to
distinguish between the two.

Also, is now a good time to join the bandwagon on picking a more
descriptive name (such as 'allow-list') for this terminology?

> +++ b/block.c
> @@ -6162,6 +6162,9 @@ BlockDriverState 
> *bdrv_find_backing_image(BlockDriverState *bs,
>  
>  void bdrv_init(void)
>  {
> +#ifdef CONFIG_BDRV_WHITELIST_TOOLS
> +use_bdrv_whitelist = 1;

Pre-existing, but this variable seems like a natural candidate to be
bool instead of int.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[PATCH] block: Add option to use driver whitelist even in tools

2021-07-09 Thread Kevin Wolf

Currently, the block driver whitelists are only applied for the system
emulator. All other binaries still give unrestricted access to all block
drivers. There are use cases where this made sense because the main
concern was avoiding customers running VMs on less optimised block
drivers and getting bad performance. Allowing the same image format e.g.
as a target for 'qemu-img convert' is not a problem then.

However, if the concern is the supportability of the driver in general,
either in full or when used read-write, not applying the list driver
whitelist in tools doesn't help - especially since qemu-nbd and
qemu-storage-daemon now give access to more or less the same operations
in block drivers as running a system emulator.

In order to address this, introduce a new configure option that enforces
the driver whitelist in all binaries.

Signed-off-by: Kevin Wolf 
---
 configure   | 14 --
 block.c |  3 +++
 meson.build |  1 +
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/configure b/configure
index 650d9c0735..d3f2a362d5 100755
--- a/configure
+++ b/configure
@@ -243,6 +243,7 @@ cross_prefix=""
 audio_drv_list=""
 block_drv_rw_whitelist=""
 block_drv_ro_whitelist=""
+block_drv_whitelist_tools="no"
 host_cc="cc"
 audio_win_int=""
 libs_qga=""
@@ -1003,6 +1004,10 @@ for opt do
   ;;
   --block-drv-ro-whitelist=*) block_drv_ro_whitelist=$(echo "$optarg" | sed -e 
's/,/ /g')
   ;;
+  --enable-block-drv-whitelist-in-tools) block_drv_whitelist_tools="yes"
+  ;;
+  --disable-block-drv-whitelist-in-tools) block_drv_whitelist_tools="no"
+  ;;
   --enable-debug-tcg) debug_tcg="yes"
   ;;
   --disable-debug-tcg) debug_tcg="no"
@@ -1776,10 +1781,12 @@ Advanced options (experts only):
   --block-drv-whitelist=L  Same as --block-drv-rw-whitelist=L
   --block-drv-rw-whitelist=L
set block driver read-write whitelist
-   (affects only QEMU, not qemu-img)
+   (by default affects only QEMU, not tools like 
qemu-img)
   --block-drv-ro-whitelist=L
set block driver read-only whitelist
-   (affects only QEMU, not qemu-img)
+   (by default affects only QEMU, not tools like 
qemu-img)
+  --enable-block-drv-whitelist-in-tools
+   use block whitelist also in tools instead of only 
QEMU
   --enable-trace-backends=B Set trace backend
Available backends: $trace_backend_list
   --with-trace-file=NAME   Full PATH,NAME of file to store traces
@@ -4556,6 +4563,9 @@ if test "$audio_win_int" = "yes" ; then
 fi
 echo "CONFIG_BDRV_RW_WHITELIST=$block_drv_rw_whitelist" >> $config_host_mak
 echo "CONFIG_BDRV_RO_WHITELIST=$block_drv_ro_whitelist" >> $config_host_mak
+if test "$block_drv_whitelist_tools" = "yes" ; then
+  echo "CONFIG_BDRV_WHITELIST_TOOLS=y" >> $config_host_mak
+fi
 if test "$xfs" = "yes" ; then
   echo "CONFIG_XFS=y" >> $config_host_mak
 fi
diff --git a/block.c b/block.c
index be083f389e..e97ce0b1c8 100644
--- a/block.c
+++ b/block.c
@@ -6162,6 +6162,9 @@ BlockDriverState 
*bdrv_find_backing_image(BlockDriverState *bs,
 
 void bdrv_init(void)
 {
+#ifdef CONFIG_BDRV_WHITELIST_TOOLS
+use_bdrv_whitelist = 1;
+#endif
 module_call_init(MODULE_INIT_BLOCK);
 }
 
diff --git a/meson.build b/meson.build
index eb362ee5eb..9b41b9e6d5 100644
--- a/meson.build
+++ b/meson.build
@@ -2859,6 +2859,7 @@ summary_info += {'coroutine pool':
config_host['CONFIG_COROUTINE_POOL'] == '1
 if have_block
   summary_info += {'Block whitelist (rw)': 
config_host['CONFIG_BDRV_RW_WHITELIST']}
   summary_info += {'Block whitelist (ro)': 
config_host['CONFIG_BDRV_RO_WHITELIST']}
+  summary_info += {'Use block whitelist in tools': 
config_host.has_key('CONFIG_BDRV_WHITELIST_TOOLS')}
   summary_info += {'VirtFS support':have_virtfs}
   summary_info += {'build virtiofs daemon': have_virtiofsd}
   summary_info += {'Live block migration': 
config_host.has_key('CONFIG_LIVE_BLOCK_MIGRATION')}
-- 
2.31.1

Re: [RFC PATCH nvme-cli 2/2] nvme-cli/plugins/mi:add support

2021-07-09 Thread Keith Busch

> +int hal_init()
> +{
> +int retval = -1;
> +switch (GetSidebandInterface()) {
> +case qemu_nvme_mi:
> +retval = qemu_mi_init();
> +break;
> +default:
> +break;
> +}
> +return retval;
> +}
> +
> +int hal_open()
> +{
> +int retval = -1;
> +switch (GetSidebandInterface()) {
> +case qemu_nvme_mi:
> +retval = qemu_mi_open();
> +break;
> +default:
> +break;
> +}
> +return retval;
> +}
> +
> +int hal_close()
> +{
> +int retval = -1;
> +switch (GetSidebandInterface()) {
> +case qemu_nvme_mi:
> +retval = qemu_mi_close();
> +break;
> +default:
> +break;
> +}
> +return retval;
> +}
> +
> +int hal_i2c_write(uint8_t *data_out, uint16_t num_bytes)
> +{
> +int retval = -1;
> +switch (GetSidebandInterface()) {
> +case qemu_nvme_mi:
> +retval = qemu_mi_write(data_out, num_bytes);
> +break;
> +default:
> +break;
> +}
> +return retval;
> +}
> +
> +int hal_i2c_read(uint8_t *data_in, uint16_t num_bytes)
> +{
> +uint32_t retval = -1;
> +switch (GetSidebandInterface()) {
> +case qemu_nvme_mi:
> +retval = qemu_mi_read(data_in, num_bytes);
> +break;
> +default:
> +break;
> +}
> +return retval;
> +}

I'm really not a fan of having non-standard interfaces. If you were
going to do that, though, you should create a struct of function
pointers so that you don't need these repetitive "switch (...)"
statements.

But if we're going to have OOB MI support in toolign, they should all
use the same standard defined interface.

Re: [RFC PATCH 1/2] hw/nvme: add mi device

2021-07-09 Thread Keith Busch

On Fri, Jul 09, 2021 at 07:25:45PM +0530, Padmakar Kalghatgi wrote:
> The following commands are tested with nvme-cli by hooking
> to the cid of the vsock as shown above and use the socket
> send/recieve commands to issue the commands and get the response.

Why sockets? Shouldn't mi targets use smbus for that?

[PATCH v2 1/3] iotests: Improve and rename test 291 to qemu-img-bitmap

2021-07-09 Thread Eric Blake

Enhance the test to demonstrate existing less-than-stellar behavior of
qemu-img with a qcow2 image containing an inconsistent bitmap: we
don't diagnose the problem until after copying the entire image (a
potentially long time), and when we do diagnose the failure, we still
end up leaving an empty bitmap in the destination.  This mess will be
cleaned up in the next patch.

While at it, rename the test now that we support useful iotest names,
and fix a missing newline in the error message thus exposed.

Signed-off-by: Eric Blake 
---
 block/dirty-bitmap.c  |  2 +-
 .../{291 => tests/qemu-img-bitmaps}   | 19 +++-
 .../{291.out => tests/qemu-img-bitmaps.out}   | 48 ++-
 3 files changed, 66 insertions(+), 3 deletions(-)
 rename tests/qemu-iotests/{291 => tests/qemu-img-bitmaps} (88%)
 rename tests/qemu-iotests/{291.out => tests/qemu-img-bitmaps.out} (75%)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 68d295d6e3ed..0ef46163e3ea 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -193,7 +193,7 @@ int bdrv_dirty_bitmap_check(const BdrvDirtyBitmap *bitmap, 
uint32_t flags,
 error_setg(errp, "Bitmap '%s' is inconsistent and cannot be used",
bitmap->name);
 error_append_hint(errp, "Try block-dirty-bitmap-remove to delete"
-  " this bitmap from disk");
+  " this bitmap from disk\n");
 return -1;
 }

diff --git a/tests/qemu-iotests/291 b/tests/qemu-iotests/tests/qemu-img-bitmaps
similarity index 88%
rename from tests/qemu-iotests/291
rename to tests/qemu-iotests/tests/qemu-img-bitmaps
index 20efb080a6c0..2f51651d0ce5 100755
--- a/tests/qemu-iotests/291
+++ b/tests/qemu-iotests/tests/qemu-img-bitmaps
@@ -3,7 +3,7 @@
 #
 # Test qemu-img bitmap handling
 #
-# Copyright (C) 2018-2020 Red Hat, Inc.
+# Copyright (C) 2018-2021 Red Hat, Inc.
 #
 # This program is free software; you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
@@ -27,11 +27,13 @@ status=1 # failure is the default!
 _cleanup()
 {
 _cleanup_test_img
+_rm_test_img "$TEST_IMG.copy"
 nbd_server_stop
 }
 trap "_cleanup; exit \$status" 0 1 2 3 15

 # get standard environment, filters and checks
+cd ..
 . ./common.rc
 . ./common.filter
 . ./common.nbd
@@ -129,6 +131,21 @@ $QEMU_IMG map --output=json --image-opts \

 nbd_server_stop

+echo
+echo "=== Check handling of inconsistent bitmap ==="
+echo
+
+$QEMU_IO -c abort "$TEST_IMG" 2>/dev/null
+$QEMU_IMG bitmap --add "$TEST_IMG" b4
+$QEMU_IMG bitmap --remove "$TEST_IMG" b1
+_img_info --format-specific | _filter_irrelevant_img_info
+$QEMU_IMG convert --bitmaps -O qcow2 "$TEST_IMG" "$TEST_IMG.copy" &&
+echo "unexpected success"
+# Bug - even though we failed at conversion, we left a file around with
+# a bitmap marked as not corrupt
+TEST_IMG=$TEST_IMG.copy _img_info --format-specific \
+| _filter_irrelevant_img_info
+
 # success, all done
 echo '*** done'
 rm -f $seq.full
diff --git a/tests/qemu-iotests/291.out 
b/tests/qemu-iotests/tests/qemu-img-bitmaps.out
similarity index 75%
rename from tests/qemu-iotests/291.out
rename to tests/qemu-iotests/tests/qemu-img-bitmaps.out
index 23411c0ff4d9..b762362075d1 100644
--- a/tests/qemu-iotests/291.out
+++ b/tests/qemu-iotests/tests/qemu-img-bitmaps.out
@@ -1,4 +1,4 @@
-QA output created by 291
+QA output created by qemu-img-bitmaps

 === Initial image setup ===

@@ -115,4 +115,50 @@ Format specific information:
 [{ "start": 0, "length": 2097152, "depth": 0, "zero": false, "data": true, 
"offset": OFFSET},
 { "start": 2097152, "length": 1048576, "depth": 0, "zero": false, "data": 
false},
 { "start": 3145728, "length": 7340032, "depth": 0, "zero": false, "data": 
true, "offset": OFFSET}]
+
+=== Check handling of inconsistent bitmap ===
+
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 10 MiB (10485760 bytes)
+cluster_size: 65536
+backing file: TEST_DIR/t.IMGFMT.base
+backing file format: IMGFMT
+Format specific information:
+bitmaps:
+[0]:
+flags:
+[0]: in-use
+[1]: auto
+name: b2
+granularity: 65536
+[1]:
+flags:
+[0]: in-use
+name: b0
+granularity: 65536
+[2]:
+flags:
+[0]: auto
+name: b4
+granularity: 65536
+corrupt: false
+qemu-img: Failed to populate bitmap b0: Bitmap 'b0' is inconsistent and cannot 
be used
+Try block-dirty-bitmap-remove to delete this bitmap from disk
+image: TEST_DIR/t.IMGFMT.copy
+file format: IMGFMT
+virtual size: 10 MiB (10485760 bytes)
+cluster_size: 65536
+Format specific information:
+bitmaps:
+[0]:
+flags:
+name: b0
+granularity: 65536
+[1]:
+flags:
+[0]: auto
+name: b4
+

[PATCH v2 2/3] qemu-img: Fail fast on convert --bitmaps with inconsistent bitmap

2021-07-09 Thread Eric Blake

Waiting until the end of the convert operation (a potentially
time-consuming task) to finally detect that we can't copy a bitmap is
bad, comparing to failing fast up front.  Furthermore, this prevents
us from leaving a file behind with a bitmap that is not marked as
inconsistent even though it does not have sane contents.

This fixes the problems exposed in the previous patch to the iotest.

Signed-off-by: Eric Blake 
---
 qemu-img.c| 30 +--
 tests/qemu-iotests/tests/qemu-img-bitmaps |  2 --
 tests/qemu-iotests/tests/qemu-img-bitmaps.out | 20 ++---
 3 files changed, 29 insertions(+), 23 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 7956a8996512..e84b3c530155 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2101,6 +2101,30 @@ static int convert_do_copy(ImgConvertState *s)
 return s->ret;
 }

+/* Check that bitmaps can be copied, or output an error */
+static int convert_check_bitmaps(BlockDriverState *src)
+{
+BdrvDirtyBitmap *bm;
+
+if (!bdrv_supports_persistent_dirty_bitmap(src)) {
+error_report("Source lacks bitmap support");
+return -1;
+}
+FOR_EACH_DIRTY_BITMAP(src, bm) {
+const char *name;
+
+if (!bdrv_dirty_bitmap_get_persistence(bm)) {
+continue;
+}
+name = bdrv_dirty_bitmap_name(bm);
+if (bdrv_dirty_bitmap_inconsistent(bm)) {
+error_report("Cannot copy inconsistent bitmap '%s'", name);
+return -1;
+}
+}
+return 0;
+}
+
 static int convert_copy_bitmaps(BlockDriverState *src, BlockDriverState *dst)
 {
 BdrvDirtyBitmap *bm;
@@ -2127,6 +2151,7 @@ static int convert_copy_bitmaps(BlockDriverState *src, 
BlockDriverState *dst)
   );
 if (err) {
 error_reportf_err(err, "Failed to populate bitmap %s: ", name);
+qmp_block_dirty_bitmap_remove(dst->node_name, name, NULL);
 return -1;
 }
 }
@@ -2552,9 +2577,8 @@ static int img_convert(int argc, char **argv)
 ret = -1;
 goto out;
 }
-if (!bdrv_supports_persistent_dirty_bitmap(blk_bs(s.src[0]))) {
-error_report("Source lacks bitmap support");
-ret = -1;
+ret = convert_check_bitmaps(blk_bs(s.src[0]));
+if (ret < 0) {
 goto out;
 }
 }
diff --git a/tests/qemu-iotests/tests/qemu-img-bitmaps 
b/tests/qemu-iotests/tests/qemu-img-bitmaps
index 2f51651d0ce5..3fde95907515 100755
--- a/tests/qemu-iotests/tests/qemu-img-bitmaps
+++ b/tests/qemu-iotests/tests/qemu-img-bitmaps
@@ -141,8 +141,6 @@ $QEMU_IMG bitmap --remove "$TEST_IMG" b1
 _img_info --format-specific | _filter_irrelevant_img_info
 $QEMU_IMG convert --bitmaps -O qcow2 "$TEST_IMG" "$TEST_IMG.copy" &&
 echo "unexpected success"
-# Bug - even though we failed at conversion, we left a file around with
-# a bitmap marked as not corrupt
 TEST_IMG=$TEST_IMG.copy _img_info --format-specific \
 | _filter_irrelevant_img_info

diff --git a/tests/qemu-iotests/tests/qemu-img-bitmaps.out 
b/tests/qemu-iotests/tests/qemu-img-bitmaps.out
index b762362075d1..546aaa404bba 100644
--- a/tests/qemu-iotests/tests/qemu-img-bitmaps.out
+++ b/tests/qemu-iotests/tests/qemu-img-bitmaps.out
@@ -143,22 +143,6 @@ Format specific information:
 name: b4
 granularity: 65536
 corrupt: false
-qemu-img: Failed to populate bitmap b0: Bitmap 'b0' is inconsistent and cannot 
be used
-Try block-dirty-bitmap-remove to delete this bitmap from disk
-image: TEST_DIR/t.IMGFMT.copy
-file format: IMGFMT
-virtual size: 10 MiB (10485760 bytes)
-cluster_size: 65536
-Format specific information:
-bitmaps:
-[0]:
-flags:
-name: b0
-granularity: 65536
-[1]:
-flags:
-[0]: auto
-name: b4
-granularity: 65536
-corrupt: false
+qemu-img: Cannot copy inconsistent bitmap 'b0'
+qemu-img: Could not open 'TEST_DIR/t.IMGFMT.copy': Could not open 
'TEST_DIR/t.IMGFMT.copy': No such file or directory
 *** done
-- 
2.31.1

[PATCH v2 3/3] qemu-img: Add --skip-broken-bitmaps for 'convert --bitmaps'

2021-07-09 Thread Eric Blake

The point of 'qemu-img convert --bitmaps' is to be a convenience for
actions that are already possible through a string of smaller
'qemu-img bitmap' sub-commands.  One situation not accounted for
already is that if a source image contains an inconsistent bitmap (for
example, because a qemu process died abruptly before flushing bitmap
state), the user MUST delete those inconsistent bitmaps before
anything else useful can be done with the image.

We don't want to delete inconsistent bitmaps by default: although a
corrupt bitmap is only a loss of optimization rather than a corruption
of user-visible data, it is still nice to require the user to opt in
to the fact that they are aware of the loss of the bitmap.  Still,
requiring the user to check 'qemu-img info' to see whether bitmaps are
consistent, then use 'qemu-img bitmap --remove' to remove offenders,
all before using 'qemu-img convert', is a lot more work than just
adding a knob 'qemu-img convert --bitmaps --skip-broken-bitmaps' which
opts in to skipping the broken bitmaps.

After testing the new option, also demonstrate the way to manually fix
things (either deleting bad bitmaps, or re-creating them as empty) so
that it is possible to convert without the option.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1946084
Signed-off-by: Eric Blake 
---
 docs/tools/qemu-img.rst   |  8 -
 qemu-img.c| 26 +---
 tests/qemu-iotests/tests/qemu-img-bitmaps | 10 ++
 tests/qemu-iotests/tests/qemu-img-bitmaps.out | 31 +++
 4 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index cfe11478791f..4d407b180450 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -414,7 +414,7 @@ Command description:
   4
 Error on reading data

-.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [--bitmaps] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l 
SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-r RATE_LIMIT] [-m NUM_COROUTINES] [-W] 
FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
+.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [--bitmaps [--skip-broken-bitmaps]] [-U] [-C] [-c] [-p] [-q] 
[-n] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o 
OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-r RATE_LIMIT] [-m 
NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME

   Convert the disk image *FILENAME* or a snapshot *SNAPSHOT_PARAM*
   to disk image *OUTPUT_FILENAME* using format *OUTPUT_FMT*. It can
@@ -456,6 +456,12 @@ Command description:
   *NUM_COROUTINES* specifies how many coroutines work in parallel during
   the convert process (defaults to 8).

+  Use of ``--bitmaps`` requests that any persistent bitmaps present in
+  the original are also copied to the destination.  If any bitmap is
+  inconsistent in the source, the conversion will fail unless
+  ``--skip-broken-bitmaps`` is also specified to copy only the
+  consistent bitmaps.
+
 .. option:: create [--object OBJECTDEF] [-q] [-f FMT] [-b BACKING_FILE] [-F 
BACKING_FMT] [-u] [-o OPTIONS] FILENAME [SIZE]

   Create the new disk image *FILENAME* of size *SIZE* and format
diff --git a/qemu-img.c b/qemu-img.c
index e84b3c530155..661538edd785 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -82,6 +82,7 @@ enum {
 OPTION_MERGE = 274,
 OPTION_BITMAPS = 275,
 OPTION_FORCE = 276,
+OPTION_SKIP_BROKEN = 277,
 };

 typedef enum OutputFormat {
@@ -2102,7 +2103,7 @@ static int convert_do_copy(ImgConvertState *s)
 }

 /* Check that bitmaps can be copied, or output an error */
-static int convert_check_bitmaps(BlockDriverState *src)
+static int convert_check_bitmaps(BlockDriverState *src, bool skip_broken)
 {
 BdrvDirtyBitmap *bm;

@@ -2117,7 +2118,7 @@ static int convert_check_bitmaps(BlockDriverState *src)
 continue;
 }
 name = bdrv_dirty_bitmap_name(bm);
-if (bdrv_dirty_bitmap_inconsistent(bm)) {
+if (!skip_broken && bdrv_dirty_bitmap_inconsistent(bm)) {
 error_report("Cannot copy inconsistent bitmap '%s'", name);
 return -1;
 }
@@ -2125,7 +2126,8 @@ static int convert_check_bitmaps(BlockDriverState *src)
 return 0;
 }

-static int convert_copy_bitmaps(BlockDriverState *src, BlockDriverState *dst)
+static int convert_copy_bitmaps(BlockDriverState *src, BlockDriverState *dst,
+bool skip_broken)
 {
 BdrvDirtyBitmap *bm;
 Error *err = NULL;
@@ -2137,6 +2139,10 @@ static int convert_copy_bitmaps(BlockDriverState *src, 
BlockDriverState *dst)
 continue;
 }
 name = bdrv_dirty_bitmap_name(bm);
+if (skip_broken && bdrv_dirty_bitmap_inconsistent(bm)) {
+warn_report("Skipping inconsistent bitmap %s", name);

[PATCH v2 0/3] Let 'qemu-img convert --bitmaps' skip inconsistent bitmaps

2021-07-09 Thread Eric Blake

In v2:
- patch 1: clean up leftover file [Vladimir]
- patch 2: new, to fix badness with leaving corrupt destination on
  failed bitmap copy [Vladimir]
- patch 3: spell new option --skip-broken-bitmaps (note that --skip-broken
  as an abbreviation still works) [Vladimir]

Eric Blake (3):
  iotests: Improve and rename test 291 to qemu-img-bitmap
  qemu-img: Fail fast on convert --bitmaps with inconsistent bitmap
  qemu-img: Add --skip-broken-bitmaps for 'convert --bitmaps'

 docs/tools/qemu-img.rst   |  8 ++-
 block/dirty-bitmap.c  |  2 +-
 qemu-img.c| 50 +--
 .../{291 => tests/qemu-img-bitmaps}   | 27 +++-
 .../{291.out => tests/qemu-img-bitmaps.out}   | 63 ++-
 5 files changed, 141 insertions(+), 9 deletions(-)
 rename tests/qemu-iotests/{291 => tests/qemu-img-bitmaps} (82%)
 rename tests/qemu-iotests/{291.out => tests/qemu-img-bitmaps.out} (69%)

-- 
2.31.1

Re: [PATCH 1/2] iotests: Improve and rename test 291 to qemu-img-bitmap

2021-07-09 Thread Eric Blake

On Fri, Jul 09, 2021 at 04:49:01PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 09.07.2021 16:16, Eric Blake wrote:
> > On Fri, Jul 09, 2021 at 09:33:50AM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > > +++ b/tests/qemu-iotests/tests/qemu-img-bitmaps
> > 
> > > > +echo
> > > > +echo "=== Check handling of inconsistent bitmap ==="
> > > > +echo
> > > > +
> > > > +$QEMU_IO -c abort "$TEST_IMG" 2>/dev/null
> > > > +$QEMU_IMG bitmap --add "$TEST_IMG" b4
> > > > +$QEMU_IMG bitmap --remove "$TEST_IMG" b1
> > > > +_img_info --format-specific | _filter_irrelevant_img_info
> > > > +$QEMU_IMG convert --bitmaps -O qcow2 "$TEST_IMG" "$TEST_IMG.copy"
> > > 
> > > Worth then removing remaining inconsistent bitmaps and try again?
> > > 
> > > I think you should now remove $TEST_IMG.copy in _cleanup
> > 
> > $TEST_IMG.copy isn't created on failure (or if it is, that in itself
> > is a problem we should be avoiding),
> 
> Seems that's the case:

Oh. Thanks for checking.  Yeah, re-reading the code, we don't attempt
to copy the bitmaps until after the rest of the convert operation is
done.  Which means a fix to pre-validate bitmaps before starting the
convert requires more code, but may be the right thing (failing at the
very end of a long convert attempt, but leaving the file around, is
not ideal compared to failing fast).

Looks like my v2 will be a bit more beefy.
> ./build/qemu-img convert --bitmaps -O qcow2 x y
> qemu-img: Failed to populate bitmap b1: Bitmap 'b1' is inconsistent and 
> cannot be used
> Try block-dirty-bitmap-remove to delete this bitmap from disk[root@kvm 
> master]#
> # ls y
> y
> ./build/qemu-img info y
> image: y
> file format: qcow2
> virtual size: 1 MiB (1048576 bytes)
> disk size: 204 KiB
> cluster_size: 65536
> Format specific information:
> compat: 1.1
> compression type: zlib
> lazy refcounts: false
> bitmaps:
> [0]:
> flags:
> name: b1
> granularity: 65536
> refcount bits: 16
> corrupt: false
> extended l2: false
> 
> 
> WOW! It even contains the bitmap not marked in-use. That's a real bug.

So we left things dangling with a bitmap that has completely different
contents than in the original file.  Yeah, that looks like an issue
worth improving.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[PULL 2/4] qemu-img: Make unallocated part of backing chain obvious in map

2021-07-09 Thread Eric Blake

The recently-added NBD context qemu:allocation-depth is able to
distinguish between locally-present data (even when that data is
sparse) [shown as depth 1 over NBD], and data that could not be found
anywhere in the backing chain [shown as depth 0]; and the libnbd
project was recently patched to give the human-readable name "absent"
to an allocation-depth of 0.  But qemu-img map --output=json predates
that addition, and has the unfortunate behavior that all portions of
the backing chain that resolve without finding a hit in any backing
layer report the same depth as the final backing layer.  This makes it
harder to reconstruct a qcow2 backing chain using just 'qemu-img map'
output, especially when using "backing":null to artificially limit a
backing chain, because it is impossible to distinguish between a
QCOW2_CLUSTER_UNALLOCATED (which defers to a [missing] backing file)
and a QCOW2_CLUSTER_ZERO_PLAIN cluster (which would override any
backing file), since both types of clusters otherwise show as
"data":false,"zero":true" (but note that we can distinguish a
QCOW2_CLUSTER_ZERO_ALLOCATED, which would also have an "offset":
listing).

The task of reconstructing a qcow2 chain was made harder in commit
0da9856851 (nbd: server: Report holes for raw images), because prior
to that point, it was possible to abuse NBD's block status command to
see which portions of a qcow2 file resulted in BDRV_BLOCK_ALLOCATED
(showing up as NBD_STATE_ZERO in isolation) vs. missing from the chain
(showing up as NBD_STATE_ZERO|NBD_STATE_HOLE); but now qemu reports
more accurate sparseness information over NBD.

An obvious solution is to make 'qemu-img map --output=json' add an
additional "present":false designation to any cluster lacking an
allocation anywhere in the chain, without any change to the "depth"
parameter to avoid breaking existing clients.  The iotests have
several examples where this distinction demonstrates the additional
accuracy.

Signed-off-by: Eric Blake 
Message-Id: <20210701190655.2131223-3-ebl...@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 docs/tools/qemu-img.rst   |   3 +
 qapi/block-core.json  |   7 +-
 qemu-img.c|   7 +-
 tests/qemu-iotests/122.out|  84 
 tests/qemu-iotests/154.out| 190 +-
 tests/qemu-iotests/179.out| 133 
 tests/qemu-iotests/209.out|   4 +-
 tests/qemu-iotests/223.out|  56 +++---
 tests/qemu-iotests/244.out|  23 ++-
 tests/qemu-iotests/252.out|  10 +-
 tests/qemu-iotests/274.out|  48 ++---
 tests/qemu-iotests/291.out|  24 +--
 .../tests/nbd-qemu-allocation.out |  16 +-
 13 files changed, 332 insertions(+), 273 deletions(-)

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index cfe11478791f..d6300f7ee03d 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -597,6 +597,9 @@ Command description:
 if false, the sectors are either unallocated or stored as optimized
 all-zero clusters);
   - whether the data is known to read as zero (boolean field ``zero``);
+  - whether the data is actually present (boolean field ``present``);
+if false, rebasing the backing chain onto a deeper file would pick
+up data from the deeper file;
   - in order to make the output shorter, the target file is expressed as
 a ``depth``; for example, a depth of 2 refers to the backing file
 of the backing file of *FILENAME*.
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 3114ba69bb46..ced846f48d6b 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -261,6 +261,9 @@
 # images in the chain)) before reaching one for which the
 # range is allocated
 #
+# @present: true if this layer provides the data, false if adding a backing
+#   layer could impact this region (since 6.1)
+#
 # @offset: if present, the image file stores the data for this range
 #  in raw format at the given (host) offset
 #
@@ -271,8 +274,8 @@
 ##
 { 'struct': 'MapEntry',
   'data': {'start': 'int', 'length': 'int', 'data': 'bool',
-   'zero': 'bool', 'depth': 'int', '*offset': 'int',
-   '*filename': 'str' } }
+   'zero': 'bool', 'depth': 'int', 'present': 'bool',
+   '*offset': 'int', '*filename': 'str' } }

 ##
 # @BlockdevCacheInfo:
diff --git a/qemu-img.c b/qemu-img.c
index 7956a8996512..68a4d298098f 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2980,8 +2980,9 @@ static int dump_map_entry(OutputFormat output_format, 
MapEntry *e,
 break;
 case OFORMAT_JSON:
 printf("{ \"start\": %"PRId64", \"length\": %"PRId64","
-   " \"depth\": %"PRId64", \"zero\": %s, \"data\": %s",
-   e->start, e->length, e->depth,
+   " \"depth\": %"PRId64",

[PULL 3/4] qemu-img: Reword 'qemu-img map --output=json' docs

2021-07-09 Thread Eric Blake

Reword the paragraphs to list the JSON key first, rather than in the
middle of prose.

Suggested-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Eric Blake 
Message-Id: <20210707184125.2551140-1-ebl...@redhat.com>
Reviewed-by: Nir Soffer 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 docs/tools/qemu-img.rst | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index d6300f7ee03d..1d8470eada0e 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -593,16 +593,16 @@ Command description:
   the ``start``, ``length``, ``offset`` fields;
   it will also include other more specific information:

-  - whether the sectors contain actual data or not (boolean field ``data``;
-if false, the sectors are either unallocated or stored as optimized
-all-zero clusters);
-  - whether the data is known to read as zero (boolean field ``zero``);
-  - whether the data is actually present (boolean field ``present``);
-if false, rebasing the backing chain onto a deeper file would pick
-up data from the deeper file;
-  - in order to make the output shorter, the target file is expressed as
-a ``depth``; for example, a depth of 2 refers to the backing file
-of the backing file of *FILENAME*.
+  - boolean field ``data``: true if the sectors contain actual data,
+false if the sectors are either unallocated or stored as optimized
+all-zero clusters
+  - boolean field ``zero``: true if the data is known to read as zero
+  - boolean field ``present``: true if the data belongs to the backing
+chain, false if rebasing the backing chain onto a deeper file
+would pick up data from the deeper file;
+  - integer field ``depth``: the depth within the backing chain at
+which the data was resolved; for example, a depth of 2 refers to
+the backing file of the backing file of *FILENAME*.

   In JSON format, the ``offset`` field is optional; it is absent in
   cases where ``human`` format would omit the entry or exit with an error.
-- 
2.31.1

[PULL 4/4] nbd: register yank function earlier

2021-07-09 Thread Eric Blake

From: Lukas Straub 

Although unlikely, qemu might hang in nbd_send_request().

Allow recovery in this case by registering the yank function before
calling it.

Signed-off-by: Lukas Straub 
Message-Id: <20210704000730.1befb...@gecko.fritz.box>
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Eric Blake 
---
 block/nbd.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 601fccc5bacd..f6ff1c4fb472 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -371,6 +371,9 @@ int coroutine_fn 
nbd_co_do_establish_connection(BlockDriverState *bs,
 return -ECONNREFUSED;
 }

+yank_register_function(BLOCKDEV_YANK_INSTANCE(s->bs->node_name), nbd_yank,
+   bs);
+
 ret = nbd_handle_updated_info(s->bs, NULL);
 if (ret < 0) {
 /*
@@ -381,6 +384,8 @@ int coroutine_fn 
nbd_co_do_establish_connection(BlockDriverState *bs,

 nbd_send_request(s->ioc, );

+yank_unregister_function(BLOCKDEV_YANK_INSTANCE(s->bs->node_name),
+ nbd_yank, bs);
 object_unref(OBJECT(s->ioc));
 s->ioc = NULL;

@@ -390,9 +395,6 @@ int coroutine_fn 
nbd_co_do_establish_connection(BlockDriverState *bs,
 qio_channel_set_blocking(s->ioc, false, NULL);
 qio_channel_attach_aio_context(s->ioc, bdrv_get_aio_context(bs));

-yank_register_function(BLOCKDEV_YANK_INSTANCE(s->bs->node_name), nbd_yank,
-   bs);
-
 /* successfully connected */
 s->state = NBD_CLIENT_CONNECTED;
 qemu_co_queue_restart_all(>free_sema);
-- 
2.31.1

[PULL 1/4] iotests: Improve and rename test 309 to nbd-qemu-allocation

2021-07-09 Thread Eric Blake

Enhance the test to inspect what qemu-nbd is advertising during
handshake, and rename it now that we support useful iotest names.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <20210701190655.2131223-2-ebl...@redhat.com>
---
 .../qemu-iotests/{309 => tests/nbd-qemu-allocation}  |  5 -
 .../{309.out => tests/nbd-qemu-allocation.out}   | 12 +++-
 2 files changed, 15 insertions(+), 2 deletions(-)
 rename tests/qemu-iotests/{309 => tests/nbd-qemu-allocation} (95%)
 rename tests/qemu-iotests/{309.out => tests/nbd-qemu-allocation.out} (81%)

diff --git a/tests/qemu-iotests/309 
b/tests/qemu-iotests/tests/nbd-qemu-allocation
similarity index 95%
rename from tests/qemu-iotests/309
rename to tests/qemu-iotests/tests/nbd-qemu-allocation
index b90b279994c9..4ee73db8033b 100755
--- a/tests/qemu-iotests/309
+++ b/tests/qemu-iotests/tests/nbd-qemu-allocation
@@ -3,7 +3,7 @@
 #
 # Test qemu-nbd -A
 #
-# Copyright (C) 2018-2020 Red Hat, Inc.
+# Copyright (C) 2018-2021 Red Hat, Inc.
 #
 # This program is free software; you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
@@ -32,6 +32,7 @@ _cleanup()
 trap "_cleanup; exit \$status" 0 1 2 3 15

 # get standard environment, filters and checks
+cd ..
 . ./common.rc
 . ./common.filter
 . ./common.nbd
@@ -57,6 +58,8 @@ echo
 $QEMU_IMG map --output=json -f qcow2 "$TEST_IMG"
 IMG="driver=nbd,server.type=unix,server.path=$nbd_unix_socket"
 nbd_server_start_unix_socket -r -f qcow2 -A "$TEST_IMG"
+# Inspect what the server is exposing
+$QEMU_NBD --list -k $nbd_unix_socket
 # Normal -f raw NBD block status loses access to allocation information
 $QEMU_IMG map --output=json --image-opts \
 "$IMG" | _filter_qemu_img_map
diff --git a/tests/qemu-iotests/309.out 
b/tests/qemu-iotests/tests/nbd-qemu-allocation.out
similarity index 81%
rename from tests/qemu-iotests/309.out
rename to tests/qemu-iotests/tests/nbd-qemu-allocation.out
index db75bb6b0df9..c51022b2a38d 100644
--- a/tests/qemu-iotests/309.out
+++ b/tests/qemu-iotests/tests/nbd-qemu-allocation.out
@@ -1,4 +1,4 @@
-QA output created by 309
+QA output created by nbd-qemu-allocation

 === Initial image setup ===

@@ -14,6 +14,16 @@ wrote 2097152/2097152 bytes at offset 1048576
 [{ "start": 0, "length": 1048576, "depth": 1, "zero": false, "data": true, 
"offset": 327680},
 { "start": 1048576, "length": 2097152, "depth": 0, "zero": false, "data": 
true, "offset": 327680},
 { "start": 3145728, "length": 1048576, "depth": 1, "zero": true, "data": 
false}]
+exports available: 1
+ export: ''
+  size:  4194304
+  flags: 0x58f ( readonly flush fua df multi cache )
+  min block: 1
+  opt block: 4096
+  max block: 33554432
+  available meta contexts: 2
+   base:allocation
+   qemu:allocation-depth
 [{ "start": 0, "length": 3145728, "depth": 0, "zero": false, "data": true, 
"offset": OFFSET},
 { "start": 3145728, "length": 1048576, "depth": 0, "zero": true, "data": 
false, "offset": OFFSET}]
 [{ "start": 0, "length": 1048576, "depth": 0, "zero": true, "data": true, 
"offset": OFFSET},
-- 
2.31.1

[RFC PATCH 1/2] hw/nvme: add mi device

2021-07-09 Thread Padmakar Kalghatgi


The enclosed patch contains the implementation of certain
commands of nvme-mi specification.The MI commands are useful
to manage/configure/monitor the device.Eventhough the MI commands
can be sent via the inband NVMe-MI send/recieve commands, the idea 
here is to emulate the sideband interface for MI.


Since the nvme-mi specification deals in communicating
to the nvme subsystem via. a sideband interface, in this
qemu implementation, virtual-vsock is used for making the
sideband communication, the guest VM needs to make the
connection to the specific cid of the vsock of the qemu host.

One needs to specify the following command in the launch to
specify the nvme-mi device, cid and to setup the vsock:
-device nvme-mi,bus=
-device vhost-vsock-pci, guest-cid=

The following commands are tested with nvme-cli by hooking
to the cid of the vsock as shown above and use the socket
send/recieve commands to issue the commands and get the response.

we are planning to push the changes for nvme-cli as well to test the
MI functionality.

As the connection can be established by the guest VM at any point,
we have created a thread which is looking for a connection request.
Please suggest if there is a native/better way to handle this.

This module makes use of the NvmeCtrl structure of the nvme module,
to fetch relevant information of the nvme device which are used in
some of the mi commands. Eventhough certain commands might require
modification to the nvme module, currently we have currently refrained
from making changes to the nvme module.

Since this is our first foray into implementing a new module in qemu,
we will be happy to receive comments, suggestions to improve the current
implementation.

cmd-name  cmd-type
*
read nvme-mi dsnvme-mi
configuration set  nvme-mi
configuration get  nvme-mi
vpd read   nvme-mi
vpd write  nvme-mi
controller Health Staus Poll   nvme-mi
nvme subsystem health status poll  nvme-mi
identify   nvme-admin
get log page   nvme-admin
get features   nvme-admin

Signed-off-by: Padmakar Kalghatgi

---
hw/nvme/meson.build |   2 +-
hw/nvme/nvme-mi.c   | 676 
hw/nvme/nvme-mi.h   | 288 ++
3 files changed, 965 insertions(+), 1 deletion(-)
create mode 100644 hw/nvme/nvme-mi.c
create mode 100644 hw/nvme/nvme-mi.h

diff --git a/hw/nvme/meson.build b/hw/nvme/meson.build
index 3cf4004..8dac50e 100644
--- a/hw/nvme/meson.build
+++ b/hw/nvme/meson.build
@@ -1 +1 @@
-softmmu_ss.add(when: 'CONFIG_NVME_PCI', if_true: files('ctrl.c', 'dif.c', 
'ns.c', 'subsys.c'))
+softmmu_ss.add(when: 'CONFIG_NVME_PCI', if_true: files('ctrl.c', 'dif.c', 
'ns.c', 'subsys.c', 'nvme-mi.c'))
diff --git a/hw/nvme/nvme-mi.c b/hw/nvme/nvme-mi.c
new file mode 100644
index 000..5e93417
--- /dev/null
+++ b/hw/nvme/nvme-mi.c
@@ -0,0 +1,676 @@
+/*
+ * QEMU NVMe-MI Controller
+ *
+ * Copyright (c) 2021, Samsung Electronics co Ltd.
+ *
+ * Written by Padmakar Kalghatgi 
+ *Arun Kumar Agasar 
+ *Saurav Kumar 
+ *
+ * This code is licensed under the GNU GPL v2 or later.
+ */
+
+/**
+ * Reference Specs: http://www.nvmexpress.org,
+ *
+ *
+ * Usage
+ * -
+ * The nvme-mi device has to be used along with nvme device only
+ *
+ * Add options:
+ *-device  nvme-mi,bus=
+ *-device  vhost-vsock-pci, guest-cid=
+ *
+ * the cid is used to connect to the vsock
+ */
+
+#include "qemu/osdep.h"
+#include "hw/qdev-core.h"
+#include "hw/block/block.h"
+#include "hw/pci/msix.h"
+#include "nvme.h"
+#include "nvme-mi.h"
+#include "qemu/crc32c.h"
+
+#define NVME_TEMPERATURE 0x143
+#define NVME_TEMPERATURE_WARNING 0x157
+#define NVME_TEMPERATURE_CRITICAL 0x175
+
+static void nvme_mi_send_resp(NvmeMiCtrl *ctrl_mi, uint8_t *resp, uint32_t 
size)
+{
+uint32_t crc_value = crc32c(0x, resp, size);
+size += 4;
+uint32_t retries = 5;
+uint32_t offset = 0;
+uint32_t som = 1;
+uint32_t eom = 0;
+uint32_t pktseq = 0;
+uint32_t mtus = ctrl_mi->mctp_unit_size;
+while (size > 0) {
+uint32_t sizesent = size > mtus ? mtus : size;
+size -= sizesent;
+eom = size > 0 ? 0 : 1;
+g_autofree uint8_t *buf = (uint8_t *)g_malloc(sizesent + 8);
+buf[2] = sizesent + 5;
+buf[7] = (som << 7) | (eom << 6) | (pktseq << 5);
+som = 0;
+memcpy(buf + 8, resp + offset, sizesent);
+offset += sizesent;
+if (size <= 0) {
+memcpy(buf + 4 + offset , _value, sizeof(crc_value));
+}
+retries = 5;
+while (retries > 0) {
+int32_t nsend = send(ctrl_mi->sock_desc, buf, sizesent + 8, 0);
+if (nsend < 0) {
+retries--;
+

Re: [PATCH 1/2] iotests: Improve and rename test 291 to qemu-img-bitmap

2021-07-09 Thread Vladimir Sementsov-Ogievskiy


09.07.2021 16:16, Eric Blake wrote:

On Fri, Jul 09, 2021 at 09:33:50AM +0300, Vladimir Sementsov-Ogievskiy wrote:

+++ b/tests/qemu-iotests/tests/qemu-img-bitmaps



+echo
+echo "=== Check handling of inconsistent bitmap ==="
+echo
+
+$QEMU_IO -c abort "$TEST_IMG" 2>/dev/null
+$QEMU_IMG bitmap --add "$TEST_IMG" b4
+$QEMU_IMG bitmap --remove "$TEST_IMG" b1
+_img_info --format-specific | _filter_irrelevant_img_info
+$QEMU_IMG convert --bitmaps -O qcow2 "$TEST_IMG" "$TEST_IMG.copy"


Worth then removing remaining inconsistent bitmaps and try again?

I think you should now remove $TEST_IMG.copy in _cleanup


$TEST_IMG.copy isn't created on failure (or if it is, that in itself
is a problem we should be avoiding),


Seems that's the case:
./build/qemu-img create -f qcow2 x 1M
./build/qemu-img bitmap --add x b1
./build/qemu-io x
qemu-io> abort
Aborted (core dumped)
 ./build/qemu-img info x
image: x
file format: qcow2
virtual size: 1 MiB (1048576 bytes)
disk size: 204 KiB
cluster_size: 65536
Format specific information:
compat: 1.1
compression type: zlib
lazy refcounts: false
bitmaps:
[0]:
flags:
[0]: in-use
[1]: auto
name: b1
granularity: 65536
refcount bits: 16
corrupt: false
extended l2: false


ls y
ls: cannot access 'y': No such file or directory
./build/qemu-img convert --bitmaps -O qcow2 x y
qemu-img: Failed to populate bitmap b1: Bitmap 'b1' is inconsistent and cannot 
be used
Try block-dirty-bitmap-remove to delete this bitmap from disk[root@kvm master]#
# ls y
y
./build/qemu-img info y
image: y
file format: qcow2
virtual size: 1 MiB (1048576 bytes)
disk size: 204 KiB
cluster_size: 65536
Format specific information:
compat: 1.1
compression type: zlib
lazy refcounts: false
bitmaps:
[0]:
flags:
name: b1
granularity: 65536
refcount bits: 16
corrupt: false
extended l2: false


WOW! It even contains the bitmap not marked in-use. That's a real bug.


so as written, there was nothing
that should have needed cleaning up until patch 2.  But your idea
(here and in patch 2) of demonstrating manual cleanup for recovery (in
addition to the goal of patch 2 of skipping broken bitmaps in the
first place) is reasonable, so I'll incorporate that into v2.




--
Best regards,
Vladimir

Re: [PULL 0/5] Block patches

2021-07-09 Thread Peter Maydell

On Thu, 8 Jul 2021 at 14:11, Stefan Hajnoczi  wrote:
>
> The following changes since commit 711c0418c8c1ce3a24346f058b001c4c5a2f0f81:
>
>   Merge remote-tracking branch 'remotes/philmd/tags/mips-20210702' into 
> staging (2021-07-04 14:04:12 +0100)
>
> are available in the Git repository at:
>
>   https://gitlab.com/stefanha/qemu.git tags/block-pull-request
>
> for you to fetch changes up to 9f460c64e13897117f35ffb61f6f5e0102cabc70:
>
>   block/io: Merge discard request alignments (2021-07-06 14:28:55 +0100)
>
> 
> Pull request
>

Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/6.1
for any user-visible changes.

-- PMM

Re: [PATCH 3/2] qemu-img: Improve error for rebase without backing format

2021-07-09 Thread Eric Blake

On Thu, Jul 08, 2021 at 06:58:21PM +0200, Kevin Wolf wrote:
> Am 08.07.2021 um 17:52 hat Eric Blake geschrieben:
> > When removeing support for qemu-img being able to create backing

removing

(is it bad that I don't catch my own typos until seeing them through
the mailing list?)

> > chains without embedded backing formats, we caused a poor error
> > message as caught by iotest 114.  Improve the situation to inform the
> > user what went wrong.
> > 
> > Suggested-by: Kevin Wolf 
> > Signed-off-by: Eric Blake 
> 
> Thanks, applied to the block branch.

And thanks for catching my obvious lack of filtering out local file names.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 1/2] iotests: Improve and rename test 291 to qemu-img-bitmap

2021-07-09 Thread Eric Blake

On Fri, Jul 09, 2021 at 09:33:50AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > +++ b/tests/qemu-iotests/tests/qemu-img-bitmaps

> > +echo
> > +echo "=== Check handling of inconsistent bitmap ==="
> > +echo
> > +
> > +$QEMU_IO -c abort "$TEST_IMG" 2>/dev/null
> > +$QEMU_IMG bitmap --add "$TEST_IMG" b4
> > +$QEMU_IMG bitmap --remove "$TEST_IMG" b1
> > +_img_info --format-specific | _filter_irrelevant_img_info
> > +$QEMU_IMG convert --bitmaps -O qcow2 "$TEST_IMG" "$TEST_IMG.copy"
> 
> Worth then removing remaining inconsistent bitmaps and try again?
> 
> I think you should now remove $TEST_IMG.copy in _cleanup

$TEST_IMG.copy isn't created on failure (or if it is, that in itself
is a problem we should be avoiding), so as written, there was nothing
that should have needed cleaning up until patch 2.  But your idea
(here and in patch 2) of demonstrating manual cleanup for recovery (in
addition to the goal of patch 2 of skipping broken bitmaps in the
first place) is reasonable, so I'll incorporate that into v2.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[PULL 27/28] iotests: Test reopening multiple devices at the same time

2021-07-09 Thread Kevin Wolf

From: Alberto Garcia 

This test swaps the images used by two active block devices.

This is now possible thanks to the new ability to run
x-blockdev-reopen on multiple devices at the same time.

Signed-off-by: Alberto Garcia 
Signed-off-by: Kevin Wolf 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <20210708114709.206487-6-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/245 | 47 ++
 tests/qemu-iotests/245.out |  4 ++--
 2 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/245 b/tests/qemu-iotests/245
index ca08955207..8bc8101e6d 100755
--- a/tests/qemu-iotests/245
+++ b/tests/qemu-iotests/245
@@ -649,6 +649,53 @@ class TestBlockdevReopen(iotests.QMPTestCase):
  '-c', 'read -P 0x40 0x40008 1',
  '-c', 'read -P 0x80 0x40010 1', 
hd_path[0])
 
+# Swap the disk images of two active block devices
+def test_swap_files(self):
+# Add hd0 and hd2 (none of them with backing files)
+opts0 = hd_opts(0)
+result = self.vm.qmp('blockdev-add', conv_keys = False, **opts0)
+self.assert_qmp(result, 'return', {})
+
+opts2 = hd_opts(2)
+result = self.vm.qmp('blockdev-add', conv_keys = False, **opts2)
+self.assert_qmp(result, 'return', {})
+
+# Write different data to both block devices
+self.run_qemu_io("hd0", "write -P 0xa0 0 1k")
+self.run_qemu_io("hd2", "write -P 0xa2 0 1k")
+
+# Check that the data reads correctly
+self.run_qemu_io("hd0", "read  -P 0xa0 0 1k")
+self.run_qemu_io("hd2", "read  -P 0xa2 0 1k")
+
+# It's not possible to make a block device use an image that
+# is already being used by the other device.
+self.reopen(opts0, {'file': 'hd2-file'},
+"Permission conflict on node 'hd2-file': permissions "
+"'write, resize' are both required by node 'hd2' (uses "
+"node 'hd2-file' as 'file' child) and unshared by node "
+"'hd0' (uses node 'hd2-file' as 'file' child).")
+self.reopen(opts2, {'file': 'hd0-file'},
+"Permission conflict on node 'hd0-file': permissions "
+"'write, resize' are both required by node 'hd0' (uses "
+"node 'hd0-file' as 'file' child) and unshared by node "
+"'hd2' (uses node 'hd0-file' as 'file' child).")
+
+# But we can swap the images if we reopen both devices at the
+# same time
+opts0['file'] = 'hd2-file'
+opts2['file'] = 'hd0-file'
+self.reopenMultiple([opts0, opts2])
+self.run_qemu_io("hd0", "read  -P 0xa2 0 1k")
+self.run_qemu_io("hd2", "read  -P 0xa0 0 1k")
+
+# And we can of course come back to the original state
+opts0['file'] = 'hd0-file'
+opts2['file'] = 'hd2-file'
+self.reopenMultiple([opts0, opts2])
+self.run_qemu_io("hd0", "read  -P 0xa0 0 1k")
+self.run_qemu_io("hd2", "read  -P 0xa2 0 1k")
+
 # Misc reopen tests with different block drivers
 @iotests.skip_if_unsupported(['quorum', 'throttle'])
 def test_misc_drivers(self):
diff --git a/tests/qemu-iotests/245.out b/tests/qemu-iotests/245.out
index daf1e51922..4eced19294 100644
--- a/tests/qemu-iotests/245.out
+++ b/tests/qemu-iotests/245.out
@@ -17,8 +17,8 @@ read 1/1 bytes at offset 262152
 read 1/1 bytes at offset 262160
 1 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
-..
+...
 --
-Ran 24 tests
+Ran 25 tests
 
 OK
-- 
2.31.1

[PULL 23/28] qcow2: Fix dangling pointer after reopen for 'file'

2021-07-09 Thread Kevin Wolf

Without an external data file, s->data_file is a second pointer with the
same value as bs->file. When changing bs->file to a different BdrvChild
and freeing the old BdrvChild, s->data_file must also be updated,
otherwise it points to freed memory and causes crashes.

This problem was caught by iotests case 245.

Fixes: df2b7086f169239ebad5d150efa29c9bb6d4f820
Signed-off-by: Kevin Wolf 
Message-Id: <20210708114709.206487-2-kw...@redhat.com>
Reviewed-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Kevin Wolf 
---
 block/qcow2.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 0cac2eda36..9f1b6461c8 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1926,6 +1926,7 @@ static void qcow2_refresh_limits(BlockDriverState *bs, 
Error **errp)
 static int qcow2_reopen_prepare(BDRVReopenState *state,
 BlockReopenQueue *queue, Error **errp)
 {
+BDRVQcow2State *s = state->bs->opaque;
 Qcow2ReopenState *r;
 int ret;
 
@@ -1956,6 +1957,16 @@ static int qcow2_reopen_prepare(BDRVReopenState *state,
 }
 }
 
+/*
+ * Without an external data file, s->data_file points to the same BdrvChild
+ * as bs->file. It needs to be resynced after reopen because bs->file may
+ * be changed. We can't use it in the meantime.
+ */
+if (!has_data_file(state->bs)) {
+assert(s->data_file == state->bs->file);
+s->data_file = NULL;
+}
+
 return 0;
 
 fail:
@@ -1966,7 +1977,16 @@ fail:
 
 static void qcow2_reopen_commit(BDRVReopenState *state)
 {
+BDRVQcow2State *s = state->bs->opaque;
+
 qcow2_update_options_commit(state->bs, state->opaque);
+if (!s->data_file) {
+/*
+ * If we don't have an external data file, s->data_file was cleared by
+ * qcow2_reopen_prepare() and needs to be updated.
+ */
+s->data_file = state->bs->file;
+}
 g_free(state->opaque);
 }
 
@@ -1990,6 +2010,15 @@ static void qcow2_reopen_commit_post(BDRVReopenState 
*state)
 
 static void qcow2_reopen_abort(BDRVReopenState *state)
 {
+BDRVQcow2State *s = state->bs->opaque;
+
+if (!s->data_file) {
+/*
+ * If we don't have an external data file, s->data_file was cleared by
+ * qcow2_reopen_prepare() and needs to be restored.
+ */
+s->data_file = state->bs->file;
+}
 qcow2_update_options_abort(state->bs, state->opaque);
 g_free(state->opaque);
 }
-- 
2.31.1

[PULL 25/28] block: Acquire AioContexts during bdrv_reopen_multiple()

2021-07-09 Thread Kevin Wolf

As the BlockReopenQueue can contain nodes in multiple AioContexts, only
one of which may be locked when AIO_WAIT_WHILE() can be called, we can't
let the caller lock the right contexts. Instead, individually lock the
AioContext of a single node when iterating the queue.

Reintroduce bdrv_reopen() as a wrapper for reopening a single node that
drains the node and temporarily drops the AioContext lock for
bdrv_reopen_multiple().

Signed-off-by: Kevin Wolf 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <20210708114709.206487-4-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 include/block/block.h |  2 ++
 block.c   | 49 ---
 block/replication.c   |  7 +++
 blockdev.c|  5 +
 qemu-io-cmds.c|  7 +--
 5 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 6d42992985..3477290f9a 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -388,6 +388,8 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue 
*bs_queue,
 bool keep_old_opts);
 void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue);
 int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp);
+int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts,
+Error **errp);
 int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
   Error **errp);
 int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
diff --git a/block.c b/block.c
index a26465e3da..be083f389e 100644
--- a/block.c
+++ b/block.c
@@ -4124,19 +4124,26 @@ void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue)
  *
  * All affected nodes must be drained between bdrv_reopen_queue() and
  * bdrv_reopen_multiple().
+ *
+ * To be called from the main thread, with all other AioContexts unlocked.
  */
 int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp)
 {
 int ret = -1;
 BlockReopenQueueEntry *bs_entry, *next;
+AioContext *ctx;
 Transaction *tran = tran_new();
 g_autoptr(GHashTable) found = NULL;
 g_autoptr(GSList) refresh_list = NULL;
 
+assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 assert(bs_queue != NULL);
 
 QTAILQ_FOREACH(bs_entry, bs_queue, entry) {
+ctx = bdrv_get_aio_context(bs_entry->state.bs);
+aio_context_acquire(ctx);
 ret = bdrv_flush(bs_entry->state.bs);
+aio_context_release(ctx);
 if (ret < 0) {
 error_setg_errno(errp, -ret, "Error flushing drive");
 goto abort;
@@ -4145,7 +4152,10 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, 
Error **errp)
 
 QTAILQ_FOREACH(bs_entry, bs_queue, entry) {
 assert(bs_entry->state.bs->quiesce_counter > 0);
+ctx = bdrv_get_aio_context(bs_entry->state.bs);
+aio_context_acquire(ctx);
 ret = bdrv_reopen_prepare(_entry->state, bs_queue, tran, errp);
+aio_context_release(ctx);
 if (ret < 0) {
 goto abort;
 }
@@ -4188,7 +4198,10 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, 
Error **errp)
  * to first element.
  */
 QTAILQ_FOREACH_REVERSE(bs_entry, bs_queue, entry) {
+ctx = bdrv_get_aio_context(bs_entry->state.bs);
+aio_context_acquire(ctx);
 bdrv_reopen_commit(_entry->state);
+aio_context_release(ctx);
 }
 
 tran_commit(tran);
@@ -4197,7 +4210,10 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, 
Error **errp)
 BlockDriverState *bs = bs_entry->state.bs;
 
 if (bs->drv->bdrv_reopen_commit_post) {
+ctx = bdrv_get_aio_context(bs);
+aio_context_acquire(ctx);
 bs->drv->bdrv_reopen_commit_post(_entry->state);
+aio_context_release(ctx);
 }
 }
 
@@ -4208,7 +4224,10 @@ abort:
 tran_abort(tran);
 QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
 if (bs_entry->prepared) {
+ctx = bdrv_get_aio_context(bs_entry->state.bs);
+aio_context_acquire(ctx);
 bdrv_reopen_abort(_entry->state);
+aio_context_release(ctx);
 }
 }
 
@@ -4218,23 +4237,39 @@ cleanup:
 return ret;
 }
 
-int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
-  Error **errp)
+int bdrv_reopen(BlockDriverState *bs, QDict *opts, bool keep_old_opts,
+Error **errp)
 {
-int ret;
+AioContext *ctx = bdrv_get_aio_context(bs);
 BlockReopenQueue *queue;
-QDict *opts = qdict_new();
-
-qdict_put_bool(opts, BDRV_OPT_READ_ONLY, read_only);
+int ret;
 
 bdrv_subtree_drained_begin(bs);
-queue = bdrv_reopen_queue(NULL, bs, opts, true);
+if (ctx != qemu_get_aio_context()) {
+aio_context_release(ctx);
+}
+
+queue = bdrv_reopen_queue(NULL, bs, opts, keep_old_opts);
 ret = bdrv_reopen_multiple(queue, errp);
+
+

[PULL 21/28] qemu-img: Require -F with -b backing image

2021-07-09 Thread Kevin Wolf

From: Eric Blake 

Back in commit d9f059aa6c (qemu-img: Deprecate use of -b without -F),
we deprecated the ability to create a file with a backing image that
requires qemu to perform format probing.  Qemu can still probe older
files for backwards compatibility, but it is time to finish off the
ability to create such images, due to the potential security risk they
present.  Update a couple of iotests affected by the change.

Signed-off-by: Eric Blake 
Message-Id: <20210503213600.569128-3-ebl...@redhat.com>
Reviewed-by: Connor Kuehl 
Signed-off-by: Kevin Wolf 
---
 docs/system/deprecated.rst   | 20 -
 docs/system/removed-features.rst | 19 
 block.c  | 37 ++--
 qemu-img.c   |  6 --
 tests/qemu-iotests/040   |  4 ++--
 tests/qemu-iotests/041   |  6 --
 tests/qemu-iotests/114   | 18 
 tests/qemu-iotests/114.out   | 11 --
 tests/qemu-iotests/301   |  4 +---
 tests/qemu-iotests/301.out   | 16 ++
 10 files changed, 56 insertions(+), 85 deletions(-)

diff --git a/docs/system/deprecated.rst b/docs/system/deprecated.rst
index 9626a1fb57..25d6c4c9a0 100644
--- a/docs/system/deprecated.rst
+++ b/docs/system/deprecated.rst
@@ -282,26 +282,6 @@ this CPU is also deprecated.
 Related binaries
 
 
-qemu-img backing file without format (since 5.1)
-
-
-The use of ``qemu-img create``, ``qemu-img rebase``, or ``qemu-img
-convert`` to create or modify an image that depends on a backing file
-now recommends that an explicit backing format be provided.  This is
-for safety: if QEMU probes a different format than what you thought,
-the data presented to the guest will be corrupt; similarly, presenting
-a raw image to a guest allows a potential security exploit if a future
-probe sees a non-raw image based on guest writes.
-
-To avoid the warning message, or even future refusal to create an
-unsafe image, you must pass ``-o backing_fmt=`` (or the shorthand
-``-F`` during create) to specify the intended backing format.  You may
-use ``qemu-img rebase -u`` to retroactively add a backing format to an
-existing image.  However, be aware that there are already potential
-security risks to blindly using ``qemu-img info`` to probe the format
-of an untrusted backing image, when deciding what format to add into
-an existing image.
-
 Backwards compatibility
 ---
 
diff --git a/docs/system/removed-features.rst b/docs/system/removed-features.rst
index b64ea55194..28bb035043 100644
--- a/docs/system/removed-features.rst
+++ b/docs/system/removed-features.rst
@@ -503,6 +503,25 @@ backing chain should be performed with ``qemu-img rebase 
-u`` either
 before or after the remaining changes being performed by amend, as
 appropriate.
 
+qemu-img backing file without format (removed in 6.1)
+'
+
+The use of ``qemu-img create``, ``qemu-img rebase``, or ``qemu-img
+convert`` to create or modify an image that depends on a backing file
+now requires that an explicit backing format be provided.  This is
+for safety: if QEMU probes a different format than what you thought,
+the data presented to the guest will be corrupt; similarly, presenting
+a raw image to a guest allows a potential security exploit if a future
+probe sees a non-raw image based on guest writes.
+
+To avoid creating unsafe backing chains, you must pass ``-o
+backing_fmt=`` (or the shorthand ``-F`` during create) to specify the
+intended backing format.  You may use ``qemu-img rebase -u`` to
+retroactively add a backing format to an existing image.  However, be
+aware that there are already potential security risks to blindly using
+``qemu-img info`` to probe the format of an untrusted backing image,
+when deciding what format to add into an existing image.
+
 Block devices
 -
 
diff --git a/block.c b/block.c
index acd35cb0cb..ce96585575 100644
--- a/block.c
+++ b/block.c
@@ -5074,7 +5074,7 @@ int coroutine_fn bdrv_co_check(BlockDriverState *bs,
  * -ENOTSUP - format driver doesn't support changing the backing file
  */
 int bdrv_change_backing_file(BlockDriverState *bs, const char *backing_file,
- const char *backing_fmt, bool warn)
+ const char *backing_fmt, bool require)
 {
 BlockDriver *drv = bs->drv;
 int ret;
@@ -5088,10 +5088,8 @@ int bdrv_change_backing_file(BlockDriverState *bs, const 
char *backing_file,
 return -EINVAL;
 }
 
-if (warn && backing_file && !backing_fmt) {
-warn_report("Deprecated use of backing file without explicit "
-"backing format, use of this image requires "
-"potentially unsafe format probing");
+if (require && backing_file && !backing_fmt) {
+return -EINVAL;
 }
 
 if

[PULL 22/28] qemu-img: Improve error for rebase without backing format

2021-07-09 Thread Kevin Wolf

From: Eric Blake 

When removeing support for qemu-img being able to create backing
chains without embedded backing formats, we caused a poor error
message as caught by iotest 114.  Improve the situation to inform the
user what went wrong.

Suggested-by: Kevin Wolf 
Signed-off-by: Eric Blake 
Message-Id: <20210708155228.2666172-1-ebl...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 qemu-img.c | 3 +++
 tests/qemu-iotests/114.out | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/qemu-img.c b/qemu-img.c
index ec0e2fabe5..7c4fc60312 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3767,6 +3767,9 @@ static int img_rebase(int argc, char **argv)
 if (ret == -ENOSPC) {
 error_report("Could not change the backing file to '%s': No "
  "space left in the file header", out_baseimg);
+} else if (ret == -EINVAL && out_baseimg && !out_basefmt) {
+error_report("Could not change the backing file to '%s': backing "
+ "format must be specified", out_baseimg);
 } else if (ret < 0) {
 error_report("Could not change the backing file to '%s': %s",
 out_baseimg, strerror(-ret));
diff --git a/tests/qemu-iotests/114.out b/tests/qemu-iotests/114.out
index 6d638da266..f51dd9d20a 100644
--- a/tests/qemu-iotests/114.out
+++ b/tests/qemu-iotests/114.out
@@ -14,7 +14,7 @@ qemu-io: can't open device TEST_DIR/t.qcow2: Could not open 
backing file: Unknow
 no file open, try 'help open'
 read 4096/4096 bytes at offset 0
 4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-qemu-img: Could not change the backing file to 'TEST_DIR/t.qcow2.base': 
Invalid argument
+qemu-img: Could not change the backing file to 'TEST_DIR/t.qcow2.base': 
backing format must be specified
 read 4096/4096 bytes at offset 0
 4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 *** done
-- 
2.31.1

[PULL 20/28] qcow2: Prohibit backing file changes in 'qemu-img amend'

2021-07-09 Thread Kevin Wolf

From: Eric Blake 

This was deprecated back in bc5ee6da7 (qcow2: Deprecate use of
qemu-img amend to change backing file), and no one in the meantime has
given any reasons why it should be supported.  Time to make change
attempts a hard error (but for convenience, specifying the _same_
backing chain is not forbidden).  Update a couple of iotests to match.

Signed-off-by: Eric Blake 
Message-Id: <20210503213600.569128-2-ebl...@redhat.com>
Reviewed-by: Connor Kuehl 
Signed-off-by: Kevin Wolf 
---
 docs/system/deprecated.rst   | 12 
 docs/system/removed-features.rst | 12 
 block/qcow2.c| 13 -
 tests/qemu-iotests/061   |  3 +++
 tests/qemu-iotests/061.out   |  3 ++-
 tests/qemu-iotests/082.out   |  6 --
 6 files changed, 25 insertions(+), 24 deletions(-)

diff --git a/docs/system/deprecated.rst b/docs/system/deprecated.rst
index 70e08baff6..9626a1fb57 100644
--- a/docs/system/deprecated.rst
+++ b/docs/system/deprecated.rst
@@ -282,18 +282,6 @@ this CPU is also deprecated.
 Related binaries
 
 
-qemu-img amend to adjust backing file (since 5.1)
-'
-
-The use of ``qemu-img amend`` to modify the name or format of a qcow2
-backing image is deprecated; this functionality was never fully
-documented or tested, and interferes with other amend operations that
-need access to the original backing image (such as deciding whether a
-v3 zero cluster may be left unallocated when converting to a v2
-image).  Rather, any changes to the backing chain should be performed
-with ``qemu-img rebase -u`` either before or after the remaining
-changes being performed by amend, as appropriate.
-
 qemu-img backing file without format (since 5.1)
 
 
diff --git a/docs/system/removed-features.rst b/docs/system/removed-features.rst
index 2b21bd39ab..b64ea55194 100644
--- a/docs/system/removed-features.rst
+++ b/docs/system/removed-features.rst
@@ -491,6 +491,18 @@ topologies described with -smp include all possible cpus, 
i.e.
 The ``enforce-config-section`` property was replaced by the
 ``-global migration.send-configuration={on|off}`` option.
 
+qemu-img amend to adjust backing file (removed in 6.1)
+''
+
+The use of ``qemu-img amend`` to modify the name or format of a qcow2
+backing image was never fully documented or tested, and interferes
+with other amend operations that need access to the original backing
+image (such as deciding whether a v3 zero cluster may be left
+unallocated when converting to a v2 image).  Any changes to the
+backing chain should be performed with ``qemu-img rebase -u`` either
+before or after the remaining changes being performed by amend, as
+appropriate.
+
 Block devices
 -
 
diff --git a/block/qcow2.c b/block/qcow2.c
index ee4530cdbd..0cac2eda36 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -5620,15 +5620,10 @@ static int qcow2_amend_options(BlockDriverState *bs, 
QemuOpts *opts,
 if (backing_file || backing_format) {
 if (g_strcmp0(backing_file, s->image_backing_file) ||
 g_strcmp0(backing_format, s->image_backing_format)) {
-warn_report("Deprecated use of amend to alter the backing file; "
-"use qemu-img rebase instead");
-}
-ret = qcow2_change_backing_file(bs,
-backing_file ?: s->image_backing_file,
-backing_format ?: s->image_backing_format);
-if (ret < 0) {
-error_setg_errno(errp, -ret, "Failed to change the backing file");
-return ret;
+error_setg(errp, "Cannot amend the backing file");
+error_append_hint(errp,
+  "You can use 'qemu-img rebase' instead.\n");
+return -EINVAL;
 }
 }
 
diff --git a/tests/qemu-iotests/061 b/tests/qemu-iotests/061
index e26d94a0df..9507c223bd 100755
--- a/tests/qemu-iotests/061
+++ b/tests/qemu-iotests/061
@@ -167,6 +167,9 @@ _make_test_img -o "compat=1.1" 64M
 TEST_IMG="$TEST_IMG.base" _make_test_img -o "compat=1.1" 64M
 $QEMU_IO -c "write -P 0x2a 0 128k" "$TEST_IMG.base" | _filter_qemu_io
 $QEMU_IO -c "read -P 0 0 128k" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IMG amend -o "backing_file=$TEST_IMG.base,backing_fmt=qcow2" \
+ "$TEST_IMG" && echo "unexpected pass"
+$QEMU_IMG rebase -u -b "$TEST_IMG.base" -F qcow2 "$TEST_IMG"
 $QEMU_IMG amend -o "backing_file=$TEST_IMG.base,backing_fmt=qcow2" "$TEST_IMG"
 $QEMU_IO -c "read -P 0x2a 0 128k" "$TEST_IMG" | _filter_qemu_io
 _check_test_img
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index ee30da2665..7ecbd4dea8 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -370,7 +370,8 @@ wrote 131072/131072 bytes at offset 0
 128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 read 131072/131072 bytes at

[PULL 28/28] block: Make blockdev-reopen stable API

2021-07-09 Thread Kevin Wolf

From: Alberto Garcia 

This patch drops the 'x-' prefix from x-blockdev-reopen.

Signed-off-by: Alberto Garcia 
Signed-off-by: Kevin Wolf 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <20210708114709.206487-7-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 qapi/block-core.json|  6 +++---
 blockdev.c  |  2 +-
 tests/qemu-iotests/155  |  2 +-
 tests/qemu-iotests/165  |  2 +-
 tests/qemu-iotests/245  | 10 +-
 tests/qemu-iotests/248  |  2 +-
 tests/qemu-iotests/248.out  |  2 +-
 tests/qemu-iotests/296  |  2 +-
 tests/qemu-iotests/298  |  2 +-
 tests/qemu-iotests/tests/remove-bitmap-from-backing |  4 ++--
 10 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 052520331e..c7a311798a 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4219,7 +4219,7 @@
 { 'command': 'blockdev-add', 'data': 'BlockdevOptions', 'boxed': true }
 
 ##
-# @x-blockdev-reopen:
+# @blockdev-reopen:
 #
 # Reopens one or more block devices using the given set of options.
 # Any option not specified will be reset to its default value regardless
@@ -4257,9 +4257,9 @@
 # image does not have a default backing file name as part of its
 # metadata.
 #
-# Since: 4.0
+# Since: 6.1
 ##
-{ 'command': 'x-blockdev-reopen',
+{ 'command': 'blockdev-reopen',
   'data': { 'options': ['BlockdevOptions'] } }
 
 ##
diff --git a/blockdev.c b/blockdev.c
index 5ad0e9070e..3d8ac368a1 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3559,7 +3559,7 @@ fail:
 visit_free(v);
 }
 
-void qmp_x_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp)
+void qmp_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp)
 {
 BlockReopenQueue *queue = NULL;
 GSList *drained = NULL;
diff --git a/tests/qemu-iotests/155 b/tests/qemu-iotests/155
index 2947bfb81a..eadda52615 100755
--- a/tests/qemu-iotests/155
+++ b/tests/qemu-iotests/155
@@ -261,7 +261,7 @@ class TestBlockdevMirrorReopen(MirrorBaseClass):
 result = self.vm.qmp('blockdev-add', node_name="backing",
  driver="null-co")
 self.assert_qmp(result, 'return', {})
-result = self.vm.qmp('x-blockdev-reopen', options=[{
+result = self.vm.qmp('blockdev-reopen', options=[{
  'node-name': "target",
  'driver': iotests.imgfmt,
  'file': "target-file",
diff --git a/tests/qemu-iotests/165 b/tests/qemu-iotests/165
index ef4cf14516..ce499946b8 100755
--- a/tests/qemu-iotests/165
+++ b/tests/qemu-iotests/165
@@ -137,7 +137,7 @@ class TestPersistentDirtyBitmap(iotests.QMPTestCase):
 assert sha256_1 == self.getSha256()
 
 # Reopen to RW
-result = self.vm.qmp('x-blockdev-reopen', options=[{
+result = self.vm.qmp('blockdev-reopen', options=[{
 'node-name': 'node0',
 'driver': iotests.imgfmt,
 'file': {
diff --git a/tests/qemu-iotests/245 b/tests/qemu-iotests/245
index 8bc8101e6d..bf8261eec0 100755
--- a/tests/qemu-iotests/245
+++ b/tests/qemu-iotests/245
@@ -1,7 +1,7 @@
 #!/usr/bin/env python3
 # group: rw
 #
-# Test cases for the QMP 'x-blockdev-reopen' command
+# Test cases for the QMP 'blockdev-reopen' command
 #
 # Copyright (C) 2018-2019 Igalia, S.L.
 # Author: Alberto Garcia 
@@ -85,16 +85,16 @@ class TestBlockdevReopen(iotests.QMPTestCase):
  "Expected output of %d qemu-io commands, found %d" %
  (found, self.total_io_cmds))
 
-# Run x-blockdev-reopen on a list of block devices
+# Run blockdev-reopen on a list of block devices
 def reopenMultiple(self, opts, errmsg = None):
-result = self.vm.qmp('x-blockdev-reopen', conv_keys=False, 
options=opts)
+result = self.vm.qmp('blockdev-reopen', conv_keys=False, options=opts)
 if errmsg:
 self.assert_qmp(result, 'error/class', 'GenericError')
 self.assert_qmp(result, 'error/desc', errmsg)
 else:
 self.assert_qmp(result, 'return', {})
 
-# Run x-blockdev-reopen on a single block device (specified by
+# Run blockdev-reopen on a single block device (specified by
 # 'opts') but applying 'newopts' on top of it. The original 'opts'
 # dict is unmodified
 def reopen(self, opts, newopts = {}, errmsg = None):
@@ -161,7 +161,7 @@ class TestBlockdevReopen(iotests.QMPTestCase):
 self.reopen(opts, {'file.locking': 'off'}, "Cannot change the option 
'locking'")
 self.reopen(opts, {'file.filename': None}, "Invalid parameter type for 
'options[0].file.filename', expected: string")
 
-# node-name is optional in

[PULL 07/28] block/rbd: add write zeroes support

2021-07-09 Thread Kevin Wolf

From: Peter Lieven 

This patch wittingly sets BDRV_REQ_NO_FALLBACK and silently ignores
BDRV_REQ_MAY_UNMAP for older librbd versions.

The rationale for this is as follows (citing Ilya Dryomov current RBD
maintainer):

---8<---
a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
   and as a consequence always unmap if librbd is too old

   It's not clear what qemu's expectation is but in general Write
   Zeroes is allowed to unmap.  The only guarantee is that subsequent
   reads return zeroes, everything else is a hint.  This is how it is
   specified in the kernel and in the NVMe spec.

   In particular, block/nvme.c implements it as follows:

   if (flags & BDRV_REQ_MAY_UNMAP) {
   cdw12 |= (1 << 25);
   }

   This sets the Deallocate bit.  But if it's not set, the device may
   still deallocate:

   """
   If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
   command, and the namespace supports clearing all bytes to 0h in the
   values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
   from a deallocated logical block and its metadata (excluding
   protection information), then for each specified logical block, the
   controller:
   - should deallocate that logical block;

   ...

   If the Deallocate bit is cleared to '0' in a Write Zeroes command,
   and the namespace supports clearing all bytes to 0h in the values
   read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
   a deallocated logical block and its metadata (excluding protection
   information), then, for each specified logical block, the
   controller:
   - may deallocate that logical block;
   """

   
https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf

b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags

   Again, it's not clear what qemu expects here, but without it we end
   up in a ridiculous situation where specifying the "don't allow slow
   fallback" switch immediately fails all efficient zeroing requests on
   a device where Write Zeroes is always efficient:

   $ qemu-io -c 'help write' | grep -- '-[zun]'
-n, -- with -z, don't allow slow fallback
-u, -- with -z, allow unmapping
-z, -- write zeroes using blk_co_pwrite_zeroes

   $ qemu-io -f rbd -c 'write -z -u -n 0 1M' rbd:foo/bar
   write failed: Operation not supported
--->8---

Signed-off-by: Peter Lieven 
Reviewed-by: Ilya Dryomov 
Message-Id: <20210702172356.11574-6-idryo...@gmail.com>
Signed-off-by: Kevin Wolf 
---
 block/rbd.c | 32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/block/rbd.c b/block/rbd.c
index 380ad28861..3152bc8ba0 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -75,7 +75,8 @@ typedef enum {
 RBD_AIO_READ,
 RBD_AIO_WRITE,
 RBD_AIO_DISCARD,
-RBD_AIO_FLUSH
+RBD_AIO_FLUSH,
+RBD_AIO_WRITE_ZEROES
 } RBDAIOCmd;
 
 typedef struct BDRVRBDState {
@@ -999,6 +1000,10 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 }
 
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK;
+#endif
+
 /* When extending regular files, we get zeros from the OS */
 bs->supported_truncate_flags = BDRV_REQ_ZERO_WRITE;
 
@@ -1123,6 +1128,18 @@ static int coroutine_fn 
qemu_rbd_start_co(BlockDriverState *bs,
 case RBD_AIO_FLUSH:
 r = rbd_aio_flush(s->image, c);
 break;
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+case RBD_AIO_WRITE_ZEROES: {
+int zero_flags = 0;
+#ifdef RBD_WRITE_ZEROES_FLAG_THICK_PROVISION
+if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+zero_flags = RBD_WRITE_ZEROES_FLAG_THICK_PROVISION;
+}
+#endif
+r = rbd_aio_write_zeroes(s->image, offset, bytes, c, zero_flags, 0);
+break;
+}
+#endif
 default:
 r = -EINVAL;
 }
@@ -1193,6 +1210,16 @@ static int coroutine_fn 
qemu_rbd_co_pdiscard(BlockDriverState *bs,
 return qemu_rbd_start_co(bs, offset, count, NULL, 0, RBD_AIO_DISCARD);
 }
 
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+static int
+coroutine_fn qemu_rbd_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
+  int count, BdrvRequestFlags flags)
+{
+return qemu_rbd_start_co(bs, offset, count, NULL, flags,
+ RBD_AIO_WRITE_ZEROES);
+}
+#endif
+
 static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 BDRVRBDState *s = bs->opaque;
@@ -1473,6 +1500,9 @@ static BlockDriver bdrv_rbd = {
 .bdrv_co_pwritev= qemu_rbd_co_pwritev,
 .bdrv_co_flush_to_disk  = qemu_rbd_co_flush,
 .bdrv_co_pdiscard   = qemu_rbd_co_pdiscard,
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+.bdrv_co_pwrite_zeroes  = qemu_rbd_co_pwrite_zeroes,
+#endif
 
 .bdrv_snapshot_create   = qemu_rbd_snap_create,
 .bdrv_snapshot_delete   = qemu_rbd_snap_remove,
-- 
2.31.1

[PULL 19/28] blockdev: fix drive-backup transaction endless drained section

2021-07-09 Thread Kevin Wolf

From: Vladimir Sementsov-Ogievskiy 

drive_backup_prepare() does bdrv_drained_begin() in hope that
bdrv_drained_end() will be called in drive_backup_clean(). Still we
need to set state->bs for this to work. That's done too late: a lot of
failure paths in drive_backup_prepare() miss setting state->bs. Fix
that.

Fixes: 2288ccfac96281c316db942d10e3f921c1373064
Fixes: https://gitlab.com/qemu-project/qemu/-/issues/399
Signed-off-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <20210608171852.250775-1-vsement...@virtuozzo.com>
Reviewed-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 blockdev.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index f08192deda..094c085962 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1714,6 +1714,7 @@ static void drive_backup_prepare(BlkActionState *common, 
Error **errp)
 aio_context = bdrv_get_aio_context(bs);
 aio_context_acquire(aio_context);
 
+state->bs = bs;
 /* Paired with .clean() */
 bdrv_drained_begin(bs);
 
@@ -1813,8 +1814,6 @@ static void drive_backup_prepare(BlkActionState *common, 
Error **errp)
 }
 }
 
-state->bs = bs;
-
 state->job = do_backup_common(qapi_DriveBackup_base(backup),
   bs, target_bs, aio_context,
   common->block_job_txn, errp);
-- 
2.31.1

[PULL 17/28] MAINTAINERS: add block/rbd.c reviewer

2021-07-09 Thread Kevin Wolf

From: Peter Lieven 

adding myself as a designated reviewer.

Signed-off-by: Peter Lieven 
Message-Id: <20210707180449.32665-2...@kamp.de>
Acked-by: Ilya Dryomov 
Signed-off-by: Kevin Wolf 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 86ec36a062..0e9f338e32 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3061,6 +3061,7 @@ F: block/vmdk.c
 
 RBD
 M: Ilya Dryomov 
+R: Peter Lieven 
 L: qemu-block@nongnu.org
 S: Supported
 F: block/rbd.c
-- 
2.31.1

[PULL 16/28] block/rbd: fix type of task->complete

2021-07-09 Thread Kevin Wolf

From: Peter Lieven 

task->complete is a bool not an integer.

Signed-off-by: Peter Lieven 
Message-Id: <20210707180449.32665-1...@kamp.de>
Reviewed-by: Ilya Dryomov 
Signed-off-by: Kevin Wolf 
---
 block/rbd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/rbd.c b/block/rbd.c
index 01a7b94d62..dcf82b15b8 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1066,7 +1066,7 @@ static int qemu_rbd_resize(BlockDriverState *bs, uint64_t 
size)
 static void qemu_rbd_finish_bh(void *opaque)
 {
 RBDTask *task = opaque;
-task->complete = 1;
+task->complete = true;
 aio_co_wake(task->co);
 }
 
-- 
2.31.1

[PULL 18/28] vhost-user: Fix backends without multiqueue support

2021-07-09 Thread Kevin Wolf

dev->max_queues was never initialised for backends that don't support
VHOST_USER_PROTOCOL_F_MQ, so it would use 0 as the maximum number of
queues to check against and consequently fail for any such backend.

Set it to 1 if the backend doesn't have multiqueue support.

Fixes: c90bd505a3e8210c23d69fecab9ee6f56ec4a161
Signed-off-by: Kevin Wolf 
Message-Id: <20210705171429.29286-1-kw...@redhat.com>
Reviewed-by: Cornelia Huck 
Reviewed-by: Raphael Norwitz 
Signed-off-by: Kevin Wolf 
---
 hw/virtio/vhost-user.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 1ac4a2ebec..29ea2b4fce 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1913,7 +1913,10 @@ static int vhost_user_backend_init(struct vhost_dev 
*dev, void *opaque,
 if (err < 0) {
 return -EPROTO;
 }
+} else {
+dev->max_queues = 1;
 }
+
 if (dev->num_queues && dev->max_queues < dev->num_queues) {
 error_setg(errp, "The maximum number of queues supported by the "
"backend is %" PRIu64, dev->max_queues);
-- 
2.31.1

[PULL 26/28] block: Support multiple reopening with x-blockdev-reopen

2021-07-09 Thread Kevin Wolf

From: Alberto Garcia 

[ kwolf: Fixed AioContext locking ]

Signed-off-by: Alberto Garcia 
Signed-off-by: Kevin Wolf 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <20210708114709.206487-5-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 qapi/block-core.json  | 18 +++--
 blockdev.c| 81 ++-
 tests/qemu-iotests/155|  9 ++-
 tests/qemu-iotests/165|  4 +-
 tests/qemu-iotests/245| 27 ---
 tests/qemu-iotests/248|  2 +-
 tests/qemu-iotests/248.out|  2 +-
 tests/qemu-iotests/296|  9 ++-
 tests/qemu-iotests/298|  4 +-
 .../tests/remove-bitmap-from-backing  | 18 +++--
 10 files changed, 99 insertions(+), 75 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 4a46552816..052520331e 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4221,13 +4221,15 @@
 ##
 # @x-blockdev-reopen:
 #
-# Reopens a block device using the given set of options. Any option
-# not specified will be reset to its default value regardless of its
-# previous status. If an option cannot be changed or a particular
+# Reopens one or more block devices using the given set of options.
+# Any option not specified will be reset to its default value regardless
+# of its previous status. If an option cannot be changed or a particular
 # driver does not support reopening then the command will return an
-# error.
+# error. All devices in the list are reopened in one transaction, so
+# if one of them fails then the whole transaction is cancelled.
 #
-# The top-level @node-name option (from BlockdevOptions) must be
+# The command receives a list of block devices to reopen. For each one
+# of them, the top-level @node-name option (from BlockdevOptions) must be
 # specified and is used to select the block device to be reopened.
 # Other @node-name options must be either omitted or set to the
 # current name of the appropriate node. This command won't change any
@@ -4247,8 +4249,8 @@
 #
 #  4) NULL: the current child (if any) is detached.
 #
-# Options (1) and (2) are supported in all cases, but at the moment
-# only @backing allows replacing or detaching an existing child.
+# Options (1) and (2) are supported in all cases. Option (3) is
+# supported for @file and @backing, and option (4) for @backing only.
 #
 # Unlike with blockdev-add, the @backing option must always be present
 # unless the node being reopened does not have a backing file and its
@@ -4258,7 +4260,7 @@
 # Since: 4.0
 ##
 { 'command': 'x-blockdev-reopen',
-  'data': 'BlockdevOptions', 'boxed': true }
+  'data': { 'options': ['BlockdevOptions'] } }
 
 ##
 # @blockdev-del:
diff --git a/blockdev.c b/blockdev.c
index 0acbace8fd..5ad0e9070e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3559,51 +3559,60 @@ fail:
 visit_free(v);
 }
 
-void qmp_x_blockdev_reopen(BlockdevOptions *options, Error **errp)
-{
-BlockDriverState *bs;
-AioContext *ctx;
-QObject *obj;
-Visitor *v = qobject_output_visitor_new();
-BlockReopenQueue *queue;
-QDict *qdict;
+void qmp_x_blockdev_reopen(BlockdevOptionsList *reopen_list, Error **errp)
+{
+BlockReopenQueue *queue = NULL;
+GSList *drained = NULL;
+
+/* Add each one of the BDS that we want to reopen to the queue */
+for (; reopen_list != NULL; reopen_list = reopen_list->next) {
+BlockdevOptions *options = reopen_list->value;
+BlockDriverState *bs;
+AioContext *ctx;
+QObject *obj;
+Visitor *v;
+QDict *qdict;
+
+/* Check for the selected node name */
+if (!options->has_node_name) {
+error_setg(errp, "node-name not specified");
+goto fail;
+}
 
-/* Check for the selected node name */
-if (!options->has_node_name) {
-error_setg(errp, "node-name not specified");
-goto fail;
-}
+bs = bdrv_find_node(options->node_name);
+if (!bs) {
+error_setg(errp, "Failed to find node with node-name='%s'",
+   options->node_name);
+goto fail;
+}
 
-bs = bdrv_find_node(options->node_name);
-if (!bs) {
-error_setg(errp, "Failed to find node with node-name='%s'",
-   options->node_name);
-goto fail;
-}
+/* Put all options in a QDict and flatten it */
+v = qobject_output_visitor_new();
+visit_type_BlockdevOptions(v, NULL, , _abort);
+visit_complete(v, );
+visit_free(v);
 
-/* Put all options in a QDict and flatten it */
-visit_type_BlockdevOptions(v, NULL, , _abort);
-visit_complete(v, );
-qdict = qobject_to(QDict, obj);
+qdict = qobject_to(QDict, obj);
 
-qdict_flatten(qdict);
+qdict_flatten(qdict);
 
-/* Perform the reopen operation */
-ctx =

[PULL 24/28] block: Add bdrv_reopen_queue_free()

2021-07-09 Thread Kevin Wolf

From: Alberto Garcia 

Move the code to free a BlockReopenQueue to a separate function.
It will be used in a subsequent patch.

[ kwolf: Also free explicit_options and options, and explicitly
  qobject_ref() the value when it continues to be used. This makes
  future memory leaks less likely. ]

Signed-off-by: Alberto Garcia 
Signed-off-by: Kevin Wolf 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Message-Id: <20210708114709.206487-3-kw...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 include/block/block.h |  1 +
 block.c   | 22 --
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 7ec77ecb1a..6d42992985 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -386,6 +386,7 @@ BlockDriverState *bdrv_new_open_driver(BlockDriver *drv, 
const char *node_name,
 BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue *bs_queue,
 BlockDriverState *bs, QDict *options,
 bool keep_old_opts);
+void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue);
 int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, Error **errp);
 int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
   Error **errp);
diff --git a/block.c b/block.c
index ce96585575..a26465e3da 100644
--- a/block.c
+++ b/block.c
@@ -4095,6 +4095,19 @@ BlockReopenQueue *bdrv_reopen_queue(BlockReopenQueue 
*bs_queue,
NULL, 0, keep_old_opts);
 }
 
+void bdrv_reopen_queue_free(BlockReopenQueue *bs_queue)
+{
+if (bs_queue) {
+BlockReopenQueueEntry *bs_entry, *next;
+QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
+qobject_unref(bs_entry->state.explicit_options);
+qobject_unref(bs_entry->state.options);
+g_free(bs_entry);
+}
+g_free(bs_queue);
+}
+}
+
 /*
  * Reopen multiple BlockDriverStates atomically & transactionally.
  *
@@ -4197,15 +4210,10 @@ abort:
 if (bs_entry->prepared) {
 bdrv_reopen_abort(_entry->state);
 }
-qobject_unref(bs_entry->state.explicit_options);
-qobject_unref(bs_entry->state.options);
 }
 
 cleanup:
-QTAILQ_FOREACH_SAFE(bs_entry, bs_queue, entry, next) {
-g_free(bs_entry);
-}
-g_free(bs_queue);
+bdrv_reopen_queue_free(bs_queue);
 
 return ret;
 }
@@ -4573,6 +4581,8 @@ static void bdrv_reopen_commit(BDRVReopenState 
*reopen_state)
 /* set BDS specific flags now */
 qobject_unref(bs->explicit_options);
 qobject_unref(bs->options);
+qobject_ref(reopen_state->explicit_options);
+qobject_ref(reopen_state->options);
 
 bs->explicit_options   = reopen_state->explicit_options;
 bs->options= reopen_state->options;
-- 
2.31.1

[PULL 06/28] block/rbd: migrate from aio to coroutines

2021-07-09 Thread Kevin Wolf

From: Peter Lieven 

Signed-off-by: Peter Lieven 
Reviewed-by: Ilya Dryomov 
Message-Id: <20210702172356.11574-5-idryo...@gmail.com>
Signed-off-by: Kevin Wolf 
---
 block/rbd.c | 252 +++-
 1 file changed, 90 insertions(+), 162 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index e2028d3db5..380ad28861 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -78,22 +78,6 @@ typedef enum {
 RBD_AIO_FLUSH
 } RBDAIOCmd;
 
-typedef struct RBDAIOCB {
-BlockAIOCB common;
-int64_t ret;
-QEMUIOVector *qiov;
-RBDAIOCmd cmd;
-int error;
-struct BDRVRBDState *s;
-} RBDAIOCB;
-
-typedef struct RADOSCB {
-RBDAIOCB *acb;
-struct BDRVRBDState *s;
-int64_t size;
-int64_t ret;
-} RADOSCB;
-
 typedef struct BDRVRBDState {
 rados_t cluster;
 rados_ioctx_t io_ctx;
@@ -105,6 +89,13 @@ typedef struct BDRVRBDState {
 uint64_t object_size;
 } BDRVRBDState;
 
+typedef struct RBDTask {
+BlockDriverState *bs;
+Coroutine *co;
+bool complete;
+int64_t ret;
+} RBDTask;
+
 static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
 BlockdevOptionsRbd *opts, bool cache,
 const char *keypairs, const char *secretid,
@@ -337,13 +328,6 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const 
char *keypairs_json,
 return ret;
 }
 
-static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
-{
-RBDAIOCB *acb = rcb->acb;
-iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
-   acb->qiov->size - offs);
-}
-
 #ifdef LIBRBD_SUPPORTS_ENCRYPTION
 static int qemu_rbd_convert_luks_options(
 RbdEncryptionOptionsLUKSBase *luks_opts,
@@ -733,46 +717,6 @@ exit:
 return ret;
 }
 
-/*
- * This aio completion is being called from rbd_finish_bh() and runs in qemu
- * BH context.
- */
-static void qemu_rbd_complete_aio(RADOSCB *rcb)
-{
-RBDAIOCB *acb = rcb->acb;
-int64_t r;
-
-r = rcb->ret;
-
-if (acb->cmd != RBD_AIO_READ) {
-if (r < 0) {
-acb->ret = r;
-acb->error = 1;
-} else if (!acb->error) {
-acb->ret = rcb->size;
-}
-} else {
-if (r < 0) {
-qemu_rbd_memset(rcb, 0);
-acb->ret = r;
-acb->error = 1;
-} else if (r < rcb->size) {
-qemu_rbd_memset(rcb, r);
-if (!acb->error) {
-acb->ret = rcb->size;
-}
-} else if (!acb->error) {
-acb->ret = r;
-}
-}
-
-g_free(rcb);
-
-acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
-
-qemu_aio_unref(acb);
-}
-
 static char *qemu_rbd_mon_host(BlockdevOptionsRbd *opts, Error **errp)
 {
 const char **vals;
@@ -1122,89 +1066,59 @@ static int qemu_rbd_resize(BlockDriverState *bs, 
uint64_t size)
 return 0;
 }
 
-static const AIOCBInfo rbd_aiocb_info = {
-.aiocb_size = sizeof(RBDAIOCB),
-};
-
-static void rbd_finish_bh(void *opaque)
+static void qemu_rbd_finish_bh(void *opaque)
 {
-RADOSCB *rcb = opaque;
-qemu_rbd_complete_aio(rcb);
+RBDTask *task = opaque;
+task->complete = 1;
+aio_co_wake(task->co);
 }
 
 /*
- * This is the callback function for rbd_aio_read and _write
+ * This is the completion callback function for all rbd aio calls
+ * started from qemu_rbd_start_co().
  *
  * Note: this function is being called from a non qemu thread so
  * we need to be careful about what we do here. Generally we only
  * schedule a BH, and do the rest of the io completion handling
- * from rbd_finish_bh() which runs in a qemu context.
+ * from qemu_rbd_finish_bh() which runs in a qemu context.
  */
-static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB *rcb)
+static void qemu_rbd_completion_cb(rbd_completion_t c, RBDTask *task)
 {
-RBDAIOCB *acb = rcb->acb;
-
-rcb->ret = rbd_aio_get_return_value(c);
+task->ret = rbd_aio_get_return_value(c);
 rbd_aio_release(c);
-
-replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs),
- rbd_finish_bh, rcb);
+aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs),
+qemu_rbd_finish_bh, task);
 }
 
-static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
- int64_t off,
- QEMUIOVector *qiov,
- int64_t size,
- BlockCompletionFunc *cb,
- void *opaque,
- RBDAIOCmd cmd)
+static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
+  uint64_t offset,
+  uint64_t bytes,
+  QEMUIOVector *qiov,
+  int flags,
+  RBDAIOCmd cmd)
 {
-

[PULL 05/28] block/rbd: update s->image_size in qemu_rbd_getlength

2021-07-09 Thread Kevin Wolf

From: Peter Lieven 

While at it just call rbd_get_size and avoid rbd_image_info_t.

Signed-off-by: Peter Lieven 
Reviewed-by: Ilya Dryomov 
Message-Id: <20210702172356.11574-4-idryo...@gmail.com>
Signed-off-by: Kevin Wolf 
---
 block/rbd.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 1ebf8f7e48..e2028d3db5 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1304,15 +1304,14 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
 static int64_t qemu_rbd_getlength(BlockDriverState *bs)
 {
 BDRVRBDState *s = bs->opaque;
-rbd_image_info_t info;
 int r;
 
-r = rbd_stat(s->image, , sizeof(info));
+r = rbd_get_size(s->image, >image_size);
 if (r < 0) {
 return r;
 }
 
-return info.size;
+return s->image_size;
 }
 
 static int coroutine_fn qemu_rbd_co_truncate(BlockDriverState *bs,
-- 
2.31.1

[PULL 03/28] block/rbd: bump librbd requirement to luminous release

2021-07-09 Thread Kevin Wolf

From: Peter Lieven 

Ceph Luminous (version 12.2.z) is almost 4 years old at this point.
Bump the requirement to get rid of the ifdef'ry in the code.
Qemu 6.1 dropped the support for RHEL-7 which was the last supported
OS that required an older librbd.

Signed-off-by: Peter Lieven 
Reviewed-by: Ilya Dryomov 
Message-Id: <20210702172356.11574-2-idryo...@gmail.com>
Signed-off-by: Kevin Wolf 
---
 block/rbd.c | 120 
 meson.build |   7 ++-
 2 files changed, 13 insertions(+), 114 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index f51adb3646..b4b928bbb9 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -55,24 +55,10 @@
  * leading "\".
  */
 
-/* rbd_aio_discard added in 0.1.2 */
-#if LIBRBD_VERSION_CODE >= LIBRBD_VERSION(0, 1, 2)
-#define LIBRBD_SUPPORTS_DISCARD
-#else
-#undef LIBRBD_SUPPORTS_DISCARD
-#endif
-
 #define OBJ_MAX_SIZE (1UL << OBJ_DEFAULT_OBJ_ORDER)
 
 #define RBD_MAX_SNAPS 100
 
-/* The LIBRBD_SUPPORTS_IOVEC is defined in librbd.h */
-#ifdef LIBRBD_SUPPORTS_IOVEC
-#define LIBRBD_USE_IOVEC 1
-#else
-#define LIBRBD_USE_IOVEC 0
-#endif
-
 #define RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN 8
 
 static const char rbd_luks_header_verification[
@@ -96,7 +82,6 @@ typedef struct RBDAIOCB {
 BlockAIOCB common;
 int64_t ret;
 QEMUIOVector *qiov;
-char *bounce;
 RBDAIOCmd cmd;
 int error;
 struct BDRVRBDState *s;
@@ -106,7 +91,6 @@ typedef struct RADOSCB {
 RBDAIOCB *acb;
 struct BDRVRBDState *s;
 int64_t size;
-char *buf;
 int64_t ret;
 } RADOSCB;
 
@@ -354,13 +338,9 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const 
char *keypairs_json,
 
 static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
 {
-if (LIBRBD_USE_IOVEC) {
-RBDAIOCB *acb = rcb->acb;
-iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
-   acb->qiov->size - offs);
-} else {
-memset(rcb->buf + offs, 0, rcb->size - offs);
-}
+RBDAIOCB *acb = rcb->acb;
+iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
+   acb->qiov->size - offs);
 }
 
 #ifdef LIBRBD_SUPPORTS_ENCRYPTION
@@ -787,13 +767,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
 
 g_free(rcb);
 
-if (!LIBRBD_USE_IOVEC) {
-if (acb->cmd == RBD_AIO_READ) {
-qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
-}
-qemu_vfree(acb->bounce);
-}
-
 acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
 
 qemu_aio_unref(acb);
@@ -1174,28 +1147,6 @@ static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB 
*rcb)
  rbd_finish_bh, rcb);
 }
 
-static int rbd_aio_discard_wrapper(rbd_image_t image,
-   uint64_t off,
-   uint64_t len,
-   rbd_completion_t comp)
-{
-#ifdef LIBRBD_SUPPORTS_DISCARD
-return rbd_aio_discard(image, off, len, comp);
-#else
-return -ENOTSUP;
-#endif
-}
-
-static int rbd_aio_flush_wrapper(rbd_image_t image,
- rbd_completion_t comp)
-{
-#ifdef LIBRBD_SUPPORTS_AIO_FLUSH
-return rbd_aio_flush(image, comp);
-#else
-return -ENOTSUP;
-#endif
-}
-
 static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
  int64_t off,
  QEMUIOVector *qiov,
@@ -1218,21 +1169,6 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
 
 rcb = g_new(RADOSCB, 1);
 
-if (!LIBRBD_USE_IOVEC) {
-if (cmd == RBD_AIO_DISCARD || cmd == RBD_AIO_FLUSH) {
-acb->bounce = NULL;
-} else {
-acb->bounce = qemu_try_blockalign(bs, qiov->size);
-if (acb->bounce == NULL) {
-goto failed;
-}
-}
-if (cmd == RBD_AIO_WRITE) {
-qemu_iovec_to_buf(acb->qiov, 0, acb->bounce, qiov->size);
-}
-rcb->buf = acb->bounce;
-}
-
 acb->ret = 0;
 acb->error = 0;
 acb->s = s;
@@ -1246,7 +1182,7 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
 }
 
 switch (cmd) {
-case RBD_AIO_WRITE: {
+case RBD_AIO_WRITE:
 /*
  * RBD APIs don't allow us to write more than actual size, so in order
  * to support growing images, we resize the image before write
@@ -1258,25 +1194,16 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
 goto failed_completion;
 }
 }
-#ifdef LIBRBD_SUPPORTS_IOVEC
-r = rbd_aio_writev(s->image, qiov->iov, qiov->niov, off, c);
-#else
-r = rbd_aio_write(s->image, off, size, rcb->buf, c);
-#endif
+r = rbd_aio_writev(s->image, qiov->iov, qiov->niov, off, c);
 break;
-}
 case RBD_AIO_READ:
-#ifdef LIBRBD_SUPPORTS_IOVEC
-r = rbd_aio_readv(s->image, qiov->iov, qiov->niov, off, c);
-#else
-r = rbd_aio_read(s->image, off, size,

[PULL 15/28] iotests/fuse-allow-other: Test allow-other

2021-07-09 Thread Kevin Wolf

From: Max Reitz 

Signed-off-by: Max Reitz 
Message-Id: <20210625142317.271673-7-mre...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/tests/fuse-allow-other | 168 ++
 tests/qemu-iotests/tests/fuse-allow-other.out |  88 +
 2 files changed, 256 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/fuse-allow-other
 create mode 100644 tests/qemu-iotests/tests/fuse-allow-other.out

diff --git a/tests/qemu-iotests/tests/fuse-allow-other 
b/tests/qemu-iotests/tests/fuse-allow-other
new file mode 100755
index 00..19f494aefb
--- /dev/null
+++ b/tests/qemu-iotests/tests/fuse-allow-other
@@ -0,0 +1,168 @@
+#!/usr/bin/env bash
+# group: rw
+#
+# Test FUSE exports' allow-other option
+#
+# Copyright (C) 2021 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+seq=$(basename "$0")
+echo "QA output created by $seq"
+
+status=1   # failure is the default!
+
+_cleanup()
+{
+_cleanup_qemu
+_cleanup_test_img
+rm -f "$EXT_MP"
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ../common.rc
+. ../common.filter
+. ../common.qemu
+
+_supported_fmt generic
+
+_supported_proto file # We create the FUSE export manually
+
+sudo -n -u nobody true || \
+_notrun 'Password-less sudo as nobody required to test allow_other'
+
+# $1: Export ID
+# $2: Options (beyond the node-name and ID)
+# $3: Expected return value (defaults to 'return')
+# $4: Node to export (defaults to 'node-format')
+fuse_export_add()
+{
+allow_other_not_supported='option allow_other only allowed if'
+
+output=$(
+success_or_failure=yes _send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'block-export-add',
+  'arguments': {
+  'type': 'fuse',
+  'id': '$1',
+  'node-name': '${4:-node-format}',
+  $2
+  } }" \
+"${3:-return}" \
+"$allow_other_not_supported" \
+| _filter_imgfmt
+)
+
+if echo "$output" | grep -q "$allow_other_not_supported"; then
+# Shut down qemu gracefully so it can unmount the export
+_send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'quit'}" \
+'return'
+
+wait=yes _cleanup_qemu
+
+_notrun "allow_other not supported"
+fi
+
+echo "$output"
+}
+
+EXT_MP="$TEST_DIR/fuse-export"
+
+_make_test_img 64k
+touch "$EXT_MP"
+
+echo
+echo '=== Test permissions ==='
+
+# $1: allow-other value ('on'/'off'/'auto')
+run_permission_test()
+{
+_launch_qemu \
+-blockdev \
+
"$IMGFMT,node-name=node-format,file.driver=file,file.filename=$TEST_IMG"
+
+_send_qemu_cmd $QEMU_HANDLE \
+"{'execute': 'qmp_capabilities'}" \
+'return'
+
+fuse_export_add 'export' \
+"'mountpoint': '$EXT_MP',
+ 'allow-other': '$1'"
+
+# Should always work
+echo '(Removing all permissions)'
+chmod 000 "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
+stat -c 'Permissions post-chmod: %a' "$EXT_MP"
+
+# Should always work
+echo '(Granting u+r)'
+chmod u+r "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
+stat -c 'Permissions post-chmod: %a' "$EXT_MP"
+
+# Should only work with allow-other: Otherwise, no permissions can be
+# granted to the group or others
+echo '(Granting read permissions for everyone)'
+chmod 444 "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
+stat -c 'Permissions post-chmod: %a' "$EXT_MP"
+
+echo 'Doing operations as nobody:'
+# Change to TEST_DIR, so nobody will not have to attempt a lookup
+pushd "$TEST_DIR" >/dev/null
+
+# This is already prevented by the permissions (without allow-other, FUSE
+# exports always have o-r), but test it anyway
+sudo -n -u nobody cat fuse-export >/dev/null
+
+# If the only problem were the lack of permissions, we should still be able
+# to stat the export as nobody; it should not work without allow-other,
+# though
+sudo -n -u nobody \
+stat -c 'Permissions seen by nobody: %a' fuse-export 2>&1 \
+| _filter_imgfmt
+
+# To prove the point, revoke read permissions for others and try again
+chmod o-r fuse-export 2>&1 | _filter_testdir | _filter_imgfmt
+
+# Should fail
+sudo -n -u nobody cat fuse-export >/dev/null
+# Should work

[PULL 14/28] iotests/308: Test +w on read-only FUSE exports

2021-07-09 Thread Kevin Wolf

From: Max Reitz 

Test that +w on read-only FUSE exports returns an EROFS error.  u+x on
the other hand should work.  (There is no special reason to choose u+x
here, it simply is like +w another flag that is not set by default.)

Signed-off-by: Max Reitz 
Message-Id: <20210625142317.271673-6-mre...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/308 | 11 +++
 tests/qemu-iotests/308.out |  4 
 2 files changed, 15 insertions(+)

diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index d13a9a969c..6b386bd523 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -170,6 +170,17 @@ fuse_export_add 'export-mp' "'mountpoint': '$EXT_MP'"
 # Check that the export presents the same data as the original image
 $QEMU_IMG compare -f raw -F $IMGFMT -U "$EXT_MP" "$TEST_IMG"
 
+# Some quick chmod tests
+stat -c 'Permissions pre-chmod: %a' "$EXT_MP"
+
+# Verify that we cannot set +w
+chmod u+w "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
+stat -c 'Permissions post-+w: %a' "$EXT_MP"
+
+# But that we can set, say, +x (if we are so inclined)
+chmod u+x "$EXT_MP" 2>&1 | _filter_testdir | _filter_imgfmt
+stat -c 'Permissions post-+x: %a' "$EXT_MP"
+
 echo
 echo '=== Mount over existing file ==='
 
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index 0e9420645f..fc47bb11a2 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -50,6 +50,10 @@ wrote 67108864/67108864 bytes at offset 0
   } }
 {"return": {}}
 Images are identical.
+Permissions pre-chmod: 400
+chmod: changing permissions of 'TEST_DIR/t.IMGFMT.fuse': Read-only file system
+Permissions post-+w: 400
+Permissions post-+x: 500
 
 === Mount over existing file ===
 {'execute': 'block-export-add',
-- 
2.31.1

[PULL 12/28] export/fuse: Give SET_ATTR_SIZE its own branch

2021-07-09 Thread Kevin Wolf

From: Max Reitz 

In order to support changing other attributes than the file size in
fuse_setattr(), we have to give each its own independent branch.  This
also applies to the only attribute we do support right now.

Signed-off-by: Max Reitz 
Reviewed-by: Kevin Wolf 
Message-Id: <20210625142317.271673-4-mre...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block/export/fuse.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 4068250241..26ad644cd7 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -417,20 +417,22 @@ static void fuse_setattr(fuse_req_t req, fuse_ino_t 
inode, struct stat *statbuf,
 FuseExport *exp = fuse_req_userdata(req);
 int ret;
 
-if (!exp->writable) {
-fuse_reply_err(req, EACCES);
-return;
-}
-
 if (to_set & ~FUSE_SET_ATTR_SIZE) {
 fuse_reply_err(req, ENOTSUP);
 return;
 }
 
-ret = fuse_do_truncate(exp, statbuf->st_size, true, PREALLOC_MODE_OFF);
-if (ret < 0) {
-fuse_reply_err(req, -ret);
-return;
+if (to_set & FUSE_SET_ATTR_SIZE) {
+if (!exp->writable) {
+fuse_reply_err(req, EACCES);
+return;
+}
+
+ret = fuse_do_truncate(exp, statbuf->st_size, true, PREALLOC_MODE_OFF);
+if (ret < 0) {
+fuse_reply_err(req, -ret);
+return;
+}
 }
 
 fuse_getattr(req, inode, fi);
-- 
2.31.1

[PULL 08/28] block/rbd: drop qemu_rbd_refresh_limits

2021-07-09 Thread Kevin Wolf

From: Peter Lieven 

librbd supports 1 byte alignment for all aio operations.

Currently, there is no API call to query limits from the Ceph
ObjectStore backend.  So drop the bdrv_refresh_limits completely
until there is such an API call.

Signed-off-by: Peter Lieven 
Reviewed-by: Ilya Dryomov 
Message-Id: <20210702172356.11574-7-idryo...@gmail.com>
Signed-off-by: Kevin Wolf 
---
 block/rbd.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 3152bc8ba0..01a7b94d62 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -240,14 +240,6 @@ done:
 return;
 }
 
-
-static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error **errp)
-{
-/* XXX Does RBD support AIO on less than 512-byte alignment? */
-bs->bl.request_alignment = 512;
-}
-
-
 static int qemu_rbd_set_auth(rados_t cluster, BlockdevOptionsRbd *opts,
  Error **errp)
 {
@@ -1482,7 +1474,6 @@ static BlockDriver bdrv_rbd = {
 .format_name= "rbd",
 .instance_size  = sizeof(BDRVRBDState),
 .bdrv_parse_filename= qemu_rbd_parse_filename,
-.bdrv_refresh_limits= qemu_rbd_refresh_limits,
 .bdrv_file_open = qemu_rbd_open,
 .bdrv_close = qemu_rbd_close,
 .bdrv_reopen_prepare= qemu_rbd_reopen_prepare,
-- 
2.31.1

[PULL 13/28] export/fuse: Let permissions be adjustable

2021-07-09 Thread Kevin Wolf

From: Max Reitz 

Allow changing the file mode, UID, and GID through SETATTR.

Without allow_other, UID and GID are not allowed to be changed, because
it would not make sense.  Also, changing group or others' permissions
is not allowed either.

For read-only exports, +w cannot be set.

Signed-off-by: Max Reitz 
Message-Id: <20210625142317.271673-5-mre...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block/export/fuse.c | 73 ++---
 1 file changed, 62 insertions(+), 11 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 26ad644cd7..ada9e263eb 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -48,6 +48,10 @@ typedef struct FuseExport {
 bool growable;
 /* Whether allow_other was used as a mount option or not */
 bool allow_other;
+
+mode_t st_mode;
+uid_t st_uid;
+gid_t st_gid;
 } FuseExport;
 
 static GHashTable *exports;
@@ -125,6 +129,13 @@ static int fuse_export_create(BlockExport *blk_exp,
 args->allow_other = FUSE_EXPORT_ALLOW_OTHER_AUTO;
 }
 
+exp->st_mode = S_IFREG | S_IRUSR;
+if (exp->writable) {
+exp->st_mode |= S_IWUSR;
+}
+exp->st_uid = getuid();
+exp->st_gid = getgid();
+
 if (args->allow_other == FUSE_EXPORT_ALLOW_OTHER_AUTO) {
 /* Ignore errors on our first attempt */
 ret = setup_fuse_export(exp, args->mountpoint, true, NULL);
@@ -338,7 +349,6 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
 int64_t length, allocated_blocks;
 time_t now = time(NULL);
 FuseExport *exp = fuse_req_userdata(req);
-mode_t mode;
 
 length = blk_getlength(exp->common.blk);
 if (length < 0) {
@@ -353,17 +363,12 @@ static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
 allocated_blocks = DIV_ROUND_UP(allocated_blocks, 512);
 }
 
-mode = S_IFREG | S_IRUSR;
-if (exp->writable) {
-mode |= S_IWUSR;
-}
-
 statbuf = (struct stat) {
 .st_ino = inode,
-.st_mode= mode,
+.st_mode= exp->st_mode,
 .st_nlink   = 1,
-.st_uid = getuid(),
-.st_gid = getgid(),
+.st_uid = exp->st_uid,
+.st_gid = exp->st_gid,
 .st_size= length,
 .st_blksize = blk_bs(exp->common.blk)->bl.request_alignment,
 .st_blocks  = allocated_blocks,
@@ -409,19 +414,52 @@ static int fuse_do_truncate(const FuseExport *exp, 
int64_t size,
 }
 
 /**
- * Let clients set file attributes.  Only resizing is supported.
+ * Let clients set file attributes.  Only resizing and changing
+ * permissions (st_mode, st_uid, st_gid) is allowed.
+ * Changing permissions is only allowed as far as it will actually
+ * permit access: Read-only exports cannot be given +w, and exports
+ * without allow_other cannot be given a different UID or GID, and
+ * they cannot be given non-owner access.
  */
 static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat 
*statbuf,
  int to_set, struct fuse_file_info *fi)
 {
 FuseExport *exp = fuse_req_userdata(req);
+int supported_attrs;
 int ret;
 
-if (to_set & ~FUSE_SET_ATTR_SIZE) {
+supported_attrs = FUSE_SET_ATTR_SIZE | FUSE_SET_ATTR_MODE;
+if (exp->allow_other) {
+supported_attrs |= FUSE_SET_ATTR_UID | FUSE_SET_ATTR_GID;
+}
+
+if (to_set & ~supported_attrs) {
 fuse_reply_err(req, ENOTSUP);
 return;
 }
 
+/* Do some argument checks first before committing to anything */
+if (to_set & FUSE_SET_ATTR_MODE) {
+/*
+ * Without allow_other, non-owners can never access the export, so do
+ * not allow setting permissions for them
+ */
+if (!exp->allow_other &&
+(statbuf->st_mode & (S_IRWXG | S_IRWXO)) != 0)
+{
+fuse_reply_err(req, EPERM);
+return;
+}
+
+/* +w for read-only exports makes no sense, disallow it */
+if (!exp->writable &&
+(statbuf->st_mode & (S_IWUSR | S_IWGRP | S_IWOTH)) != 0)
+{
+fuse_reply_err(req, EROFS);
+return;
+}
+}
+
 if (to_set & FUSE_SET_ATTR_SIZE) {
 if (!exp->writable) {
 fuse_reply_err(req, EACCES);
@@ -435,6 +473,19 @@ static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, 
struct stat *statbuf,
 }
 }
 
+if (to_set & FUSE_SET_ATTR_MODE) {
+/* Ignore FUSE-supplied file type, only change the mode */
+exp->st_mode = (statbuf->st_mode & 0) | S_IFREG;
+}
+
+if (to_set & FUSE_SET_ATTR_UID) {
+exp->st_uid = statbuf->st_uid;
+}
+
+if (to_set & FUSE_SET_ATTR_GID) {
+exp->st_gid = statbuf->st_gid;
+}
+
 fuse_getattr(req, inode, fi);
 }
 
-- 
2.31.1

[PULL 10/28] export/fuse: Pass default_permissions for mount

2021-07-09 Thread Kevin Wolf

From: Max Reitz 

We do not do any permission checks in fuse_open(), so let the kernel do
them.  We already let fuse_getattr() report the proper UNIX permissions,
so this should work the way we want.

This causes a change in 308's reference output, because now opening a
non-writable export with O_RDWR fails already, instead of only actually
attempting to write to it.  (That is an improvement.)

Signed-off-by: Max Reitz 
Message-Id: <20210625142317.271673-2-mre...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 block/export/fuse.c| 8 ++--
 tests/qemu-iotests/308 | 3 ++-
 tests/qemu-iotests/308.out | 2 +-
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 38f74c94da..d0b88e8f80 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -153,8 +153,12 @@ static int setup_fuse_export(FuseExport *exp, const char 
*mountpoint,
 struct fuse_args fuse_args;
 int ret;
 
-/* Needs to match what fuse_init() sets.  Only max_read must be supplied. 
*/
-mount_opts = g_strdup_printf("max_read=%zu", FUSE_MAX_BOUNCE_BYTES);
+/*
+ * max_read needs to match what fuse_init() sets.
+ * max_write need not be supplied.
+ */
+mount_opts = g_strdup_printf("max_read=%zu,default_permissions",
+ FUSE_MAX_BOUNCE_BYTES);
 
 fuse_argv[0] = ""; /* Dummy program name */
 fuse_argv[1] = "-o";
diff --git a/tests/qemu-iotests/308 b/tests/qemu-iotests/308
index f122065d0f..11c28a75f2 100755
--- a/tests/qemu-iotests/308
+++ b/tests/qemu-iotests/308
@@ -215,7 +215,8 @@ echo '=== Writable export ==='
 fuse_export_add 'export-mp' "'mountpoint': '$EXT_MP', 'writable': true"
 
 # Check that writing to the read-only export fails
-$QEMU_IO -f raw -c 'write -P 42 1M 64k' "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -f raw -c 'write -P 42 1M 64k' "$TEST_IMG" 2>&1 \
+| _filter_qemu_io | _filter_testdir | _filter_imgfmt
 
 # But here it should work
 $QEMU_IO -f raw -c 'write -P 42 1M 64k' "$EXT_MP" | _filter_qemu_io
diff --git a/tests/qemu-iotests/308.out b/tests/qemu-iotests/308.out
index 466e7e0267..0e9420645f 100644
--- a/tests/qemu-iotests/308.out
+++ b/tests/qemu-iotests/308.out
@@ -91,7 +91,7 @@ virtual size: 0 B (0 bytes)
   'mountpoint': 'TEST_DIR/t.IMGFMT.fuse', 'writable': true
   } }
 {"return": {}}
-write failed: Permission denied
+qemu-io: can't open device TEST_DIR/t.IMGFMT: Could not open 
'TEST_DIR/t.IMGFMT': Permission denied
 wrote 65536/65536 bytes at offset 1048576
 64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 wrote 65536/65536 bytes at offset 1048576
-- 
2.31.1

[PULL 09/28] util/uri: do not check argument of uri_free()

2021-07-09 Thread Kevin Wolf

From: Heinrich Schuchardt 

uri_free() checks if its argument is NULL in uri_clean() and g_free().
There is no need to check the argument before the call.

Signed-off-by: Heinrich Schuchardt 
Message-Id: <20210629063602.4239-1-xypron.g...@gmx.de>
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Richard W.M. Jones 
Signed-off-by: Kevin Wolf 
---
 block/nfs.c |  4 +---
 block/ssh.c |  4 +---
 util/uri.c  | 22 ++
 3 files changed, 8 insertions(+), 22 deletions(-)

diff --git a/block/nfs.c b/block/nfs.c
index 7dff64f489..9aeaefb364 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -147,9 +147,7 @@ out:
 if (qp) {
 query_params_free(qp);
 }
-if (uri) {
-uri_free(uri);
-}
+uri_free(uri);
 return ret;
 }
 
diff --git a/block/ssh.c b/block/ssh.c
index d008caf059..e0fbb4934b 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -237,9 +237,7 @@ static int parse_uri(const char *filename, QDict *options, 
Error **errp)
 return 0;
 
  err:
-if (uri) {
-  uri_free(uri);
-}
+uri_free(uri);
 return -EINVAL;
 }
 
diff --git a/util/uri.c b/util/uri.c
index 8bdef84120..ff72c6005f 100644
--- a/util/uri.c
+++ b/util/uri.c
@@ -1340,7 +1340,7 @@ static void uri_clean(URI *uri)
 
 /**
  * uri_free:
- * @uri:  pointer to an URI
+ * @uri:  pointer to an URI, NULL is ignored
  *
  * Free up the URI struct
  */
@@ -1939,15 +1939,9 @@ step_7:
 val = uri_to_string(res);
 
 done:
-if (ref != NULL) {
-uri_free(ref);
-}
-if (bas != NULL) {
-uri_free(bas);
-}
-if (res != NULL) {
-uri_free(res);
-}
+uri_free(ref);
+uri_free(bas);
+uri_free(res);
 return val;
 }
 
@@ -2190,12 +2184,8 @@ done:
 if (remove_path != 0) {
 ref->path = NULL;
 }
-if (ref != NULL) {
-uri_free(ref);
-}
-if (bas != NULL) {
-uri_free(bas);
-}
+uri_free(ref);
+uri_free(bas);
 
 return val;
 }
-- 
2.31.1

[PULL 11/28] export/fuse: Add allow-other option

2021-07-09 Thread Kevin Wolf

From: Max Reitz 

Without the allow_other mount option, no user (not even root) but the
one who started qemu/the storage daemon can access the export.  Allow
users to configure the export such that such accesses are possible.

While allow_other is probably what users want, we cannot make it an
unconditional default, because passing it is only possible (for non-root
users) if the global fuse.conf configuration file allows it.  Thus, the
default is an 'auto' mode, in which we first try with allow_other, and
then fall back to without.

FuseExport.allow_other reports whether allow_other was actually used as
a mount option or not.  Currently, this information is not used, but a
future patch will let this field decide whether e.g. an export's UID and
GID can be changed through chmod.

One notable thing about 'auto' mode is that libfuse may print error
messages directly to stderr, and so may fusermount (which it executes).
Our export code cannot really filter or hide them.  Therefore, if 'auto'
fails its first attempt and has to fall back, fusermount will print an
error message that mounting with allow_other failed.

This behavior necessitates a change to iotest 308, namely we need to
filter out this error message (because if the first attempt at mounting
with allow_other succeeds, there will be no such message).

Furthermore, common.rc's _make_test_img should use allow-other=off for
FUSE exports, because iotests generally do not need to access images
from other users, so allow-other=on or allow-other=auto have no
advantage.  OTOH, allow-other=on will not work on systems where
user_allow_other is disabled, and with allow-other=auto, we get said
error message that we would need to filter out again.  Just disabling
allow-other is simplest.

Signed-off-by: Max Reitz 
Message-Id: <20210625142317.271673-3-mre...@redhat.com>
Signed-off-by: Kevin Wolf 
---
 qapi/block-export.json   | 33 -
 block/export/fuse.c  | 28 +++-
 tests/qemu-iotests/308   |  6 +-
 tests/qemu-iotests/common.rc |  6 +-
 4 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/qapi/block-export.json b/qapi/block-export.json
index e819e70cac..0ed63442a8 100644
--- a/qapi/block-export.json
+++ b/qapi/block-export.json
@@ -120,6 +120,23 @@
'*logical-block-size': 'size',
 '*num-queues': 'uint16'} }
 
+##
+# @FuseExportAllowOther:
+#
+# Possible allow_other modes for FUSE exports.
+#
+# @off: Do not pass allow_other as a mount option.
+#
+# @on: Pass allow_other as a mount option.
+#
+# @auto: Try mounting with allow_other first, and if that fails, retry
+#without allow_other.
+#
+# Since: 6.1
+##
+{ 'enum': 'FuseExportAllowOther',
+  'data': ['off', 'on', 'auto'] }
+
 ##
 # @BlockExportOptionsFuse:
 #
@@ -132,11 +149,25 @@
 # @growable: Whether writes beyond the EOF should grow the block node
 #accordingly. (default: false)
 #
+# @allow-other: If this is off, only qemu's user is allowed access to
+#   this export.  That cannot be changed even with chmod or
+#   chown.
+#   Enabling this option will allow other users access to
+#   the export with the FUSE mount option "allow_other".
+#   Note that using allow_other as a non-root user requires
+#   user_allow_other to be enabled in the global fuse.conf
+#   configuration file.
+#   In auto mode (the default), the FUSE export driver will
+#   first attempt to mount the export with allow_other, and
+#   if that fails, try again without.
+#   (since 6.1; default: auto)
+#
 # Since: 6.0
 ##
 { 'struct': 'BlockExportOptionsFuse',
   'data': { 'mountpoint': 'str',
-'*growable': 'bool' },
+'*growable': 'bool',
+'*allow-other': 'FuseExportAllowOther' },
   'if': 'defined(CONFIG_FUSE)' }
 
 ##
diff --git a/block/export/fuse.c b/block/export/fuse.c
index d0b88e8f80..4068250241 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -46,6 +46,8 @@ typedef struct FuseExport {
 char *mountpoint;
 bool writable;
 bool growable;
+/* Whether allow_other was used as a mount option or not */
+bool allow_other;
 } FuseExport;
 
 static GHashTable *exports;
@@ -57,7 +59,7 @@ static void fuse_export_delete(BlockExport *exp);
 static void init_exports_table(void);
 
 static int setup_fuse_export(FuseExport *exp, const char *mountpoint,
- Error **errp);
+ bool allow_other, Error **errp);
 static void read_from_fuse_export(void *opaque);
 
 static bool is_regular_file(const char *path, Error **errp);
@@ -118,7 +120,22 @@ static int fuse_export_create(BlockExport *blk_exp,
 exp->writable = blk_exp_args->writable;
 exp->growable = args->growable;
 
-ret = setup_fuse_export(exp, args->mountpoint, errp);
+/* set default */
+if

[PULL 02/28] block/rbd: Add support for rbd image encryption

2021-07-09 Thread Kevin Wolf

From: Or Ozeri 

Starting from ceph Pacific, RBD has built-in support for image-level encryption.
Currently supported formats are LUKS version 1 and 2.

There are 2 new relevant librbd APIs for controlling encryption, both expect an
open image context:

rbd_encryption_format: formats an image (i.e. writes the LUKS header)
rbd_encryption_load: loads encryptor/decryptor to the image IO stack

This commit extends the qemu rbd driver API to support the above.

Signed-off-by: Or Ozeri 
Message-Id: <20210627114635.39326-1-...@il.ibm.com>
Reviewed-by: Ilya Dryomov 
Signed-off-by: Kevin Wolf 
---
 qapi/block-core.json | 110 -
 block/rbd.c  | 361 ++-
 2 files changed, 465 insertions(+), 6 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 3114ba69bb..4a46552816 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -127,6 +127,18 @@
   'extents': ['ImageInfo']
   } }
 
+##
+# @ImageInfoSpecificRbd:
+#
+# @encryption-format: Image encryption format
+#
+# Since: 6.1
+##
+{ 'struct': 'ImageInfoSpecificRbd',
+  'data': {
+  '*encryption-format': 'RbdImageEncryptionFormat'
+  } }
+
 ##
 # @ImageInfoSpecific:
 #
@@ -141,7 +153,8 @@
   # If we need to add block driver specific parameters for
   # LUKS in future, then we'll subclass QCryptoBlockInfoLUKS
   # to define a ImageInfoSpecificLUKS
-  'luks': 'QCryptoBlockInfoLUKS'
+  'luks': 'QCryptoBlockInfoLUKS',
+  'rbd': 'ImageInfoSpecificRbd'
   } }
 
 ##
@@ -3613,6 +3626,94 @@
 { 'enum': 'RbdAuthMode',
   'data': [ 'cephx', 'none' ] }
 
+##
+# @RbdImageEncryptionFormat:
+#
+# Since: 6.1
+##
+{ 'enum': 'RbdImageEncryptionFormat',
+  'data': [ 'luks', 'luks2' ] }
+
+##
+# @RbdEncryptionOptionsLUKSBase:
+#
+# @key-secret: ID of a QCryptoSecret object providing a passphrase
+#  for unlocking the encryption
+#
+# Since: 6.1
+##
+{ 'struct': 'RbdEncryptionOptionsLUKSBase',
+  'data': { 'key-secret': 'str' } }
+
+##
+# @RbdEncryptionCreateOptionsLUKSBase:
+#
+# @cipher-alg: The encryption algorithm
+#
+# Since: 6.1
+##
+{ 'struct': 'RbdEncryptionCreateOptionsLUKSBase',
+  'base': 'RbdEncryptionOptionsLUKSBase',
+  'data': { '*cipher-alg': 'QCryptoCipherAlgorithm' } }
+
+##
+# @RbdEncryptionOptionsLUKS:
+#
+# Since: 6.1
+##
+{ 'struct': 'RbdEncryptionOptionsLUKS',
+  'base': 'RbdEncryptionOptionsLUKSBase',
+  'data': { } }
+
+##
+# @RbdEncryptionOptionsLUKS2:
+#
+# Since: 6.1
+##
+{ 'struct': 'RbdEncryptionOptionsLUKS2',
+  'base': 'RbdEncryptionOptionsLUKSBase',
+  'data': { } }
+
+##
+# @RbdEncryptionCreateOptionsLUKS:
+#
+# Since: 6.1
+##
+{ 'struct': 'RbdEncryptionCreateOptionsLUKS',
+  'base': 'RbdEncryptionCreateOptionsLUKSBase',
+  'data': { } }
+
+##
+# @RbdEncryptionCreateOptionsLUKS2:
+#
+# Since: 6.1
+##
+{ 'struct': 'RbdEncryptionCreateOptionsLUKS2',
+  'base': 'RbdEncryptionCreateOptionsLUKSBase',
+  'data': { } }
+
+##
+# @RbdEncryptionOptions:
+#
+# Since: 6.1
+##
+{ 'union': 'RbdEncryptionOptions',
+  'base': { 'format': 'RbdImageEncryptionFormat' },
+  'discriminator': 'format',
+  'data': { 'luks': 'RbdEncryptionOptionsLUKS',
+'luks2': 'RbdEncryptionOptionsLUKS2' } }
+
+##
+# @RbdEncryptionCreateOptions:
+#
+# Since: 6.1
+##
+{ 'union': 'RbdEncryptionCreateOptions',
+  'base': { 'format': 'RbdImageEncryptionFormat' },
+  'discriminator': 'format',
+  'data': { 'luks': 'RbdEncryptionCreateOptionsLUKS',
+'luks2': 'RbdEncryptionCreateOptionsLUKS2' } }
+
 ##
 # @BlockdevOptionsRbd:
 #
@@ -3628,6 +3729,8 @@
 #
 # @snapshot: Ceph snapshot name.
 #
+# @encrypt: Image encryption options. (Since 6.1)
+#
 # @user: Ceph id name.
 #
 # @auth-client-required: Acceptable authentication modes.
@@ -3650,6 +3753,7 @@
 'image': 'str',
 '*conf': 'str',
 '*snapshot': 'str',
+'*encrypt': 'RbdEncryptionOptions',
 '*user': 'str',
 '*auth-client-required': ['RbdAuthMode'],
 '*key-secret': 'str',
@@ -4403,13 +4507,15 @@
 #point to a snapshot.
 # @size: Size of the virtual disk in bytes
 # @cluster-size: RBD object size
+# @encrypt: Image encryption options. (Since 6.1)
 #
 # Since: 2.12
 ##
 { 'struct': 'BlockdevCreateOptionsRbd',
   'data': { 'location': 'BlockdevOptionsRbd',
 'size': 'size',
-'*cluster-size' :   'size' } }
+'*cluster-size' :   'size',
+'*encrypt' :'RbdEncryptionCreateOptions' } }
 
 ##
 # @BlockdevVmdkSubformat:
diff --git a/block/rbd.c b/block/rbd.c
index 26f64cce7c..f51adb3646 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -73,6 +73,18 @@
 #define LIBRBD_USE_IOVEC 0
 #endif
 
+#define RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN 8
+
+static const char rbd_luks_header_verification[
+RBD_ENCRYPTION_LUKS_HEADER_VERIFICATION_LEN] = {
+'L', 'U', 'K', 'S', 0xBA, 0xBE, 0, 1
+};
+
+static const char rbd_luks2_header_verification[
+

[PULL 04/28] block/rbd: store object_size in BDRVRBDState

2021-07-09 Thread Kevin Wolf

From: Peter Lieven 

Signed-off-by: Peter Lieven 
Reviewed-by: Ilya Dryomov 
Message-Id: <20210702172356.11574-3-idryo...@gmail.com>
Signed-off-by: Kevin Wolf 
---
 block/rbd.c | 18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index b4b928bbb9..1ebf8f7e48 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -102,6 +102,7 @@ typedef struct BDRVRBDState {
 char *snap;
 char *namespace;
 uint64_t image_size;
+uint64_t object_size;
 } BDRVRBDState;
 
 static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
@@ -958,6 +959,7 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 const QDictEntry *e;
 Error *local_err = NULL;
 char *keypairs, *secretid;
+rbd_image_info_t info;
 int r;
 
 keypairs = g_strdup(qdict_get_try_str(options, "=keyvalue-pairs"));
@@ -1035,12 +1037,14 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 #endif
 }
 
-r = rbd_get_size(s->image, >image_size);
+r = rbd_stat(s->image, , sizeof(info));
 if (r < 0) {
-error_setg_errno(errp, -r, "error getting image size from %s",
+error_setg_errno(errp, -r, "error getting image info from %s",
  s->image_name);
 goto failed_post_open;
 }
+s->image_size = info.size;
+s->object_size = info.obj_size;
 
 /* If we are using an rbd snapshot, we must be r/o, otherwise
  * leave as-is */
@@ -1253,15 +1257,7 @@ static BlockAIOCB *qemu_rbd_aio_flush(BlockDriverState 
*bs,
 static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 BDRVRBDState *s = bs->opaque;
-rbd_image_info_t info;
-int r;
-
-r = rbd_stat(s->image, , sizeof(info));
-if (r < 0) {
-return r;
-}
-
-bdi->cluster_size = info.obj_size;
+bdi->cluster_size = s->object_size;
 return 0;
 }
 
-- 
2.31.1

[PULL 00/28] Block layer patches

2021-07-09 Thread Kevin Wolf

The following changes since commit 9db3065c62a983286d06c207f4981408cf42184d:

  Merge remote-tracking branch 
'remotes/vivier2/tags/linux-user-for-6.1-pull-request' into staging (2021-07-08 
16:30:18 +0100)

are available in the Git repository at:

  git://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to e60edf69e2f64e818466019313517a2e6d6b63f4:

  block: Make blockdev-reopen stable API (2021-07-09 13:19:11 +0200)


Block layer patches

- Make blockdev-reopen stable
- Remove deprecated qemu-img backing file without format
- rbd: Convert to coroutines and add write zeroes support
- rbd: Updated MAINTAINERS
- export/fuse: Allow other users access to the export
- vhost-user: Fix backends without multiqueue support
- Fix drive-backup transaction endless drained section


Alberto Garcia (4):
  block: Add bdrv_reopen_queue_free()
  block: Support multiple reopening with x-blockdev-reopen
  iotests: Test reopening multiple devices at the same time
  block: Make blockdev-reopen stable API

Eric Blake (3):
  qcow2: Prohibit backing file changes in 'qemu-img amend'
  qemu-img: Require -F with -b backing image
  qemu-img: Improve error for rebase without backing format

Heinrich Schuchardt (1):
  util/uri: do not check argument of uri_free()

Ilya Dryomov (1):
  MAINTAINERS: update block/rbd.c maintainer

Kevin Wolf (3):
  vhost-user: Fix backends without multiqueue support
  qcow2: Fix dangling pointer after reopen for 'file'
  block: Acquire AioContexts during bdrv_reopen_multiple()

Max Reitz (6):
  export/fuse: Pass default_permissions for mount
  export/fuse: Add allow-other option
  export/fuse: Give SET_ATTR_SIZE its own branch
  export/fuse: Let permissions be adjustable
  iotests/308: Test +w on read-only FUSE exports
  iotests/fuse-allow-other: Test allow-other

Or Ozeri (1):
  block/rbd: Add support for rbd image encryption

Peter Lieven (8):
  block/rbd: bump librbd requirement to luminous release
  block/rbd: store object_size in BDRVRBDState
  block/rbd: update s->image_size in qemu_rbd_getlength
  block/rbd: migrate from aio to coroutines
  block/rbd: add write zeroes support
  block/rbd: drop qemu_rbd_refresh_limits
  block/rbd: fix type of task->complete
  MAINTAINERS: add block/rbd.c reviewer

Vladimir Sementsov-Ogievskiy (1):
  blockdev: fix drive-backup transaction endless drained section

 qapi/block-core.json   | 134 +++-
 qapi/block-export.json |  33 +-
 docs/system/deprecated.rst |  32 -
 docs/system/removed-features.rst   |  31 +
 include/block/block.h  |   3 +
 block.c| 108 +--
 block/export/fuse.c| 121 +++-
 block/nfs.c|   4 +-
 block/qcow2.c  |  42 +-
 block/rbd.c| 749 +
 block/replication.c|   7 +
 block/ssh.c|   4 +-
 blockdev.c |  77 ++-
 hw/virtio/vhost-user.c |   3 +
 qemu-img.c |   9 +-
 qemu-io-cmds.c |   7 +-
 util/uri.c |  22 +-
 MAINTAINERS|   3 +-
 meson.build|   7 +-
 tests/qemu-iotests/040 |   4 +-
 tests/qemu-iotests/041 |   6 +-
 tests/qemu-iotests/061 |   3 +
 tests/qemu-iotests/061.out |   3 +-
 tests/qemu-iotests/082.out |   6 +-
 tests/qemu-iotests/114 |  18 +-
 tests/qemu-iotests/114.out |  11 +-
 tests/qemu-iotests/155 |   9 +-
 tests/qemu-iotests/165 |   4 +-
 tests/qemu-iotests/245 |  78 ++-
 tests/qemu-iotests/245.out |   4 +-
 tests/qemu-iotests/248 |   4 +-
 tests/qemu-iotests/248.out |   2 +-
 tests/qemu-iotests/296 |  11 +-
 tests/qemu-iotests/298 |   4 +-
 tests/qemu-iotests/301 |   4 +-
 tests/qemu-iotests/301.out |  16 +-
 tests/qemu-iotests/308 |  20 +-
 tests/qemu-iotests/308.out |   6 +-
 tests/qemu-iotests/common.rc   |

[PULL 01/28] MAINTAINERS: update block/rbd.c maintainer

2021-07-09 Thread Kevin Wolf

From: Ilya Dryomov 

Jason has moved on from working on RBD and Ceph.  I'm taking over
his role upstream.

Signed-off-by: Ilya Dryomov 
Message-Id: <20210519112513.19694-1-idryo...@gmail.com>
Acked-by: Stefano Garzarella 
Signed-off-by: Kevin Wolf 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 684142e12e..86ec36a062 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3060,7 +3060,7 @@ S: Supported
 F: block/vmdk.c
 
 RBD
-M: Jason Dillaman 
+M: Ilya Dryomov 
 L: qemu-block@nongnu.org
 S: Supported
 F: block/rbd.c
-- 
2.31.1

Re: [PATCH v3 1/4] replication: Remove s->active_disk

2021-07-09 Thread Lukas Straub

On Fri, 9 Jul 2021 10:11:15 +0300
Vladimir Sementsov-Ogievskiy  wrote:

> 07.07.2021 21:15, Lukas Straub wrote:
> > s->active_disk is bs->file. Remove it and use local variables instead.
> > 
> > Signed-off-by: Lukas Straub 
> > ---
> >   block/replication.c | 38 +-
> >   1 file changed, 21 insertions(+), 17 deletions(-)
> > 
> > diff --git a/block/replication.c b/block/replication.c
> > index 52163f2d1f..50940fbe33 100644
> > --- a/block/replication.c
> > +++ b/block/replication.c
> > @@ -35,7 +35,6 @@ typedef enum {
> >   typedef struct BDRVReplicationState {
> >   ReplicationMode mode;
> >   ReplicationStage stage;
> > -BdrvChild *active_disk;
> >   BlockJob *commit_job;
> >   BdrvChild *hidden_disk;
> >   BdrvChild *secondary_disk;
> > @@ -307,11 +306,15 @@ out:
> >   return ret;
> >   }
> > 
> > -static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
> > +static void secondary_do_checkpoint(BlockDriverState *bs, Error **errp)
> >   {
> > +BDRVReplicationState *s = bs->opaque;
> > +BdrvChild *active_disk;  
> 
> Why not to combine initialization into definition:
> 
> BdrvChild *active_disk = bs->file;

Ok, will fix.

> >   Error *local_err = NULL;
> >   int ret;
> > 
> > +active_disk = bs->file;
> > +
> >   if (!s->backup_job) {
> >   error_setg(errp, "Backup job was cancelled unexpectedly");
> >   return;
> > @@ -323,13 +326,13 @@ static void 
> > secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
> >   return;
> >   }
> > 
> > -if (!s->active_disk->bs->drv) {
> > +if (!active_disk->bs->drv) {
> >   error_setg(errp, "Active disk %s is ejected",
> > -   s->active_disk->bs->node_name);
> > +   active_disk->bs->node_name);
> >   return;
> >   }
> > 
> > -ret = bdrv_make_empty(s->active_disk, errp);
> > +ret = bdrv_make_empty(active_disk, errp);
> >   if (ret < 0) {
> >   return;
> >   }
> > @@ -451,6 +454,7 @@ static void replication_start(ReplicationState *rs, 
> > ReplicationMode mode,
> >   BlockDriverState *bs = rs->opaque;
> >   BDRVReplicationState *s;
> >   BlockDriverState *top_bs;
> > +BdrvChild *active_disk;
> >   int64_t active_length, hidden_length, disk_length;
> >   AioContext *aio_context;
> >   Error *local_err = NULL;
> > @@ -488,15 +492,14 @@ static void replication_start(ReplicationState *rs, 
> > ReplicationMode mode,
> >   case REPLICATION_MODE_PRIMARY:
> >   break;
> >   case REPLICATION_MODE_SECONDARY:
> > -s->active_disk = bs->file;
> > -if (!s->active_disk || !s->active_disk->bs ||
> > -!s->active_disk->bs->backing) {
> > +active_disk = bs->file;  
> 
> Here initializing active_disk only here makes sense: we consider "active 
> disk" only in secondary mode. Right?

Yes.

> > +if (!active_disk || !active_disk->bs || !active_disk->bs->backing) 
> > {
> >   error_setg(errp, "Active disk doesn't have backing file");
> >   aio_context_release(aio_context);
> >   return;
> >   }
> > 
> > -s->hidden_disk = s->active_disk->bs->backing;
> > +s->hidden_disk = active_disk->bs->backing;
> >   if (!s->hidden_disk->bs || !s->hidden_disk->bs->backing) {
> >   error_setg(errp, "Hidden disk doesn't have backing file");
> >   aio_context_release(aio_context);
> > @@ -511,7 +514,7 @@ static void replication_start(ReplicationState *rs, 
> > ReplicationMode mode,
> >   }
> > 
> >   /* verify the length */
> > -active_length = bdrv_getlength(s->active_disk->bs);
> > +active_length = bdrv_getlength(active_disk->bs);
> >   hidden_length = bdrv_getlength(s->hidden_disk->bs);
> >   disk_length = bdrv_getlength(s->secondary_disk->bs);
> >   if (active_length < 0 || hidden_length < 0 || disk_length < 0 ||
> > @@ -523,9 +526,9 @@ static void replication_start(ReplicationState *rs, 
> > ReplicationMode mode,
> >   }
> > 
> >   /* Must be true, or the bdrv_getlength() calls would have failed 
> > */
> > -assert(s->active_disk->bs->drv && s->hidden_disk->bs->drv);
> > +assert(active_disk->bs->drv && s->hidden_disk->bs->drv);
> > 
> > -if (!s->active_disk->bs->drv->bdrv_make_empty ||
> > +if (!active_disk->bs->drv->bdrv_make_empty ||
> >   !s->hidden_disk->bs->drv->bdrv_make_empty) {
> >   error_setg(errp,
> >  "Active disk or hidden disk doesn't support 
> > make_empty");
> > @@ -579,7 +582,7 @@ static void replication_start(ReplicationState *rs, 
> > ReplicationMode mode,
> >   s->stage = BLOCK_REPLICATION_RUNNING;
> > 
> >   if (s->mode == REPLICATION_MODE_SECONDARY) {
> > -secondary_do_checkpoint(s, errp);
> > +

Re: [PATCH 3/2] qemu-img: Improve error for rebase without backing format

2021-07-09 Thread Kevin Wolf

Am 08.07.2021 um 17:52 hat Eric Blake geschrieben:
> When removeing support for qemu-img being able to create backing
> chains without embedded backing formats, we caused a poor error
> message as caught by iotest 114.  Improve the situation to inform the
> user what went wrong.
> 
> Suggested-by: Kevin Wolf 
> Signed-off-by: Eric Blake 

> diff --git a/tests/qemu-iotests/114.out b/tests/qemu-iotests/114.out
> index 172454401257..016e9ce3ecfb 100644
> --- a/tests/qemu-iotests/114.out
> +++ b/tests/qemu-iotests/114.out
> @@ -14,7 +14,7 @@ qemu-io: can't open device TEST_DIR/t.qcow2: Could not open 
> backing file: Unknow
>  no file open, try 'help open'
>  read 4096/4096 bytes at offset 0
>  4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> -qemu-img: Could not change the backing file to 
> '/home/eblake/qemu/build/tests/qemu-iotests/scratch/t.qcow2.base': Invalid 
> argument
> +qemu-img: Could not change the backing file to 
> '/home/eblake/qemu/build/tests/qemu-iotests/scratch/t.qcow2.base': backing 
> format must be specified
>  read 4096/4096 bytes at offset 0
>  4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>  *** done

Wait, there is a problem in the image path...

I'm squashing in the following for patch 2, and will do the obvious
conflict resolution for this one.

Kevin

diff --git a/tests/qemu-iotests/114 b/tests/qemu-iotests/114
index 3e30b402bc..de6fd327ee 100755
--- a/tests/qemu-iotests/114
+++ b/tests/qemu-iotests/114
@@ -65,7 +65,7 @@ $QEMU_IO -c "open $TEST_IMG" -c "read 0 4k" 2>&1 | 
_filter_qemu_io | _filter_tes
 $QEMU_IO -c "open -o backing.driver=$IMGFMT $TEST_IMG" -c "read 0 4k" | 
_filter_qemu_io

 # Rebase the image, to show that backing format is required.
-$QEMU_IMG rebase -u -b "$TEST_IMG.base" "$TEST_IMG" && echo "unexpected pass"
+($QEMU_IMG rebase -u -b "$TEST_IMG.base" "$TEST_IMG" 2>&1 && echo "unexpected 
pass") | _filter_testdir
 $QEMU_IMG rebase -u -b "$TEST_IMG.base" -F $IMGFMT "$TEST_IMG"
 $QEMU_IO -c "open $TEST_IMG" -c "read 0 4k" 2>&1 | _filter_qemu_io | 
_filter_testdir

diff --git a/tests/qemu-iotests/114.out b/tests/qemu-iotests/114.out
index 1724544012..f51dd9d20a 100644
--- a/tests/qemu-iotests/114.out
+++ b/tests/qemu-iotests/114.out
@@ -14,7 +14,7 @@ qemu-io: can't open device TEST_DIR/t.qcow2: Could not open 
backing file: Unknow
 no file open, try 'help open'
 read 4096/4096 bytes at offset 0
 4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-qemu-img: Could not change the backing file to 
'/home/eblake/qemu/build/tests/qemu-iotests/scratch/t.qcow2.base': Invalid 
argument
+qemu-img: Could not change the backing file to 'TEST_DIR/t.qcow2.base': 
backing format must be specified
 read 4096/4096 bytes at offset 0
 4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 *** done

Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-07-09 Thread Kevin Wolf

Am 08.07.2021 um 20:23 hat Peter Lieven geschrieben:
> Am 08.07.2021 um 14:18 schrieb Kevin Wolf :
> > Am 07.07.2021 um 20:13 hat Peter Lieven geschrieben:
> >>> Am 06.07.2021 um 17:25 schrieb Kevin Wolf :
> >>> Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben:
>  I will have a decent look after my vacation.
> >>> 
> >>> Sounds good, thanks. Enjoy your vacation!
> >> 
> >> As I had to fire up my laptop to look into another issue anyway, I
> >> have sent two patches for updating MAINTAINERS and to fix the int vs.
> >> bool mix for task->complete.
> > 
> > I think you need to reevaluate your definition of vacation. ;-)
> 
> Lets talk about this when the kids are grown up. Sometimes sending
> patches can be quite relaxing :-)

Heh, fair enough. :-)

> > But thanks anyway.
> > 
> >> As Paolos fix (5f50be9b5) is relatively new and there are maybe other
> >> non obvious problems when removing the BH indirection and we are close
> >> to soft freeze I would leave the BH removal change for 6.2.
> > 
> > Sure, code cleanups aren't urgent.
> 
> Isn’t the indirection also a slight performance drop?

Yeah, I guess technically it is, though I doubt it's measurable.

Kevin

Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging

2021-07-09 Thread Klaus Jensen

On Jul  9 10:51, Hannes Reinecke wrote:
> On 7/9/21 8:55 AM, Klaus Jensen wrote:
> > On Jul  9 08:16, Hannes Reinecke wrote:
> > > On 7/9/21 8:05 AM, Klaus Jensen wrote:
> > > > On Jul  7 17:49, Klaus Jensen wrote:
> > > > > From: Klaus Jensen 
> > > > > 
> > > > > Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We
> > > > > discussed a bit back and fourth and I mentioned that the core issue 
> > > > > was
> > > > > an artifact of the parent/child relationship stemming from the qdev
> > > > > setup we have with namespaces attaching to controller through a qdev
> > > > > bus.
> > > > > 
> > > > > The gist of this series is the fourth patch "hw/nvme: fix
> > > > > controller hot
> > > > > unplugging" which basically causes namespaces to be reassigned to a 
> > > > > bus
> > > > > owned by the subsystem if the parent controller is linked to one. This
> > > > > fixes `device_del/add nvme` in such settings.
> > > > > 
> > > > > Note, that in the case that there is no subsystem involved,
> > > > > nvme devices
> > > > > can be removed from the system with `device_del`, but this *will* 
> > > > > cause
> > > > > the namespaces to be removed as well since there is no place (i.e. no
> > > > > subsystem) for them to "linger". And since this series does not add
> > > > > support for hotplugging nvme-ns devices, while an nvme device can be
> > > > > readded, no namespaces can. Support for hotplugging nvme-ns devices is
> > > > > present in [1], but I'd rather not add that since I think '-device
> > > > > nvme-ns' is already a bad design choice.
> > > > > 
> > > > > Now, I do realize that it is not "pretty" to explicitly change the
> > > > > parent bus, so I do have a an RFC patch in queue that replaces the
> > > > > subsystem and namespace devices with objects, but keeps -device shims
> > > > > available for backwards compatibility. This approach will solve the
> > > > > problems properly and should be a better model. However, I
> > > > > don't believe
> > > > > it will make it for 6.1 and I'd really like to at least fix the
> > > > > unplugging for 6.1 and this gets the job done.
> > > > > 
> > > > >  [1]: 20210511073511.32511-1-h...@suse.de
> > > > > 
> > > > > v2:
> > > > > - added R-b's by Hannes for patches 1 through 3
> > > > > - simplified "hw/nvme: fix controller hot unplugging"
> > > > > 
> > > > > Klaus Jensen (4):
> > > > >  hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
> > > > >  hw/nvme: mark nvme-subsys non-hotpluggable
> > > > >  hw/nvme: unregister controller with subsystem at exit
> > > > >  hw/nvme: fix controller hot unplugging
> > > > > 
> > > > > hw/nvme/nvme.h   | 18 +---
> > > > > hw/nvme/ctrl.c   | 14 ++--
> > > > > hw/nvme/ns.c | 55 +++-
> > > > > hw/nvme/subsys.c |  9 
> > > > > 4 files changed, 63 insertions(+), 33 deletions(-)
> > > > > 
> > > > > -- 
> > > > > 2.32.0
> > > > > 
> > > > 
> > > > Applied patches 1 through 3 to nvme-next.
> > > 
> > > So, how do we go about with patch 4?
> > > Without it this whole exercise is a bit pointless, seeing that it
> > > doesn't fix anything.
> > > 
> > 
> > Patch 1-3 are fixes we need anyway, so I thought I might as well apply
> > them :)
> > 
> > > Shall we go with that patch as an interim solution?
> > > Will you replace it with your 'object' patch?
> > > What is the plan?
> > > 
> > 
> > Yes, if acceptable, I would like to use patch 4 as an interim solution.
> > We have a bug we need to fix for 6.1, and I believe this does the job.
> > 
> Oh, yes, it does. But it's ever so slightly ugly with the reparenting stuff.
> But if that's considered an interim solution I'm fine with it.
> 

Definitely interim.

> You can add my 'Reviewed-by: Hannes Reinecke ' tag if you
> like.
> 

Thanks mate!

> > I considered changing the existing nvme-bus to be on the main system
> > bus, but then we break the existing behavior that the namespaces attach
> > to the most recently defined controller in the absence of the shared
> > parameter or an explicit bus parameter.
> > 
> Do we?

Yes, I believe so. Right now there is an implicit and documented
behavior that an nvme-ns device attaches to the most recently defined
nvme device (due to the implicit bus attachment). If we move the bus
from the controller so that the namespace plugs into the subsystem when
realized, then we break existing setups since the namespace won't know
what controller to "attach to" without an additional parameter. Right
now, this is determined by inspecting the parent of the bus it plugged
into.

> My idea was to always attach a namespace to a subsystem (and, if not
> present, create one). The controller would then 'link' to that subsystem.
> The subsystem would have a 'shared' attribute, which would determine if more
> than one controller can be 'linked' to it.
> 

Hmm, why would we want to have a subsystem restricted to one controller?
Or do you mean namespace? Because we already

Re: [PATCH 0/2] Let 'qemu-img convert --bitmaps' skip inconsistent bitmaps

2021-07-09 Thread Vladimir Sementsov-Ogievskiy


08.07.2021 04:29, Eric Blake wrote:

This is mostly a convenience factor as one could already use 'qemu-img
info' to learn which bitmaps are broken and then 'qemu-img bitmap
--remove' to nuke them before calling 'qemu-img convert --bitmaps',
but it does have the advantage that the copied file is usable without
extra efforts and the broken bitmap is not deleted from the source
file.

Eric Blake (2):
   iotests: Improve and rename test 291 to qemu-img-bitmap
   qemu-img: Add --skip-broken for 'convert --bitmaps'

  docs/tools/qemu-img.rst   |  8 +++-
  block/dirty-bitmap.c  |  2 +-
  qemu-img.c| 20 +++-
  .../{291 => tests/qemu-img-bitmaps}   | 17 ++-
  .../{291.out => tests/qemu-img-bitmaps.out}   | 46 ++-
  5 files changed, 87 insertions(+), 6 deletions(-)
  rename tests/qemu-iotests/{291 => tests/qemu-img-bitmaps} (88%)
  rename tests/qemu-iotests/{291.out => tests/qemu-img-bitmaps.out} (76%)



I've applied this onto your nbd branch. This doesn't apply on master.

--
Best regards,
Vladimir

Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging

2021-07-09 Thread Hannes Reinecke


On 7/9/21 8:55 AM, Klaus Jensen wrote:

On Jul  9 08:16, Hannes Reinecke wrote:

On 7/9/21 8:05 AM, Klaus Jensen wrote:

On Jul  7 17:49, Klaus Jensen wrote:

From: Klaus Jensen 

Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We
discussed a bit back and fourth and I mentioned that the core issue was
an artifact of the parent/child relationship stemming from the qdev
setup we have with namespaces attaching to controller through a qdev
bus.

The gist of this series is the fourth patch "hw/nvme: fix controller 
hot

unplugging" which basically causes namespaces to be reassigned to a bus
owned by the subsystem if the parent controller is linked to one. This
fixes `device_del/add nvme` in such settings.

Note, that in the case that there is no subsystem involved, nvme 
devices

can be removed from the system with `device_del`, but this *will* cause
the namespaces to be removed as well since there is no place (i.e. no
subsystem) for them to "linger". And since this series does not add
support for hotplugging nvme-ns devices, while an nvme device can be
readded, no namespaces can. Support for hotplugging nvme-ns devices is
present in [1], but I'd rather not add that since I think '-device
nvme-ns' is already a bad design choice.

Now, I do realize that it is not "pretty" to explicitly change the
parent bus, so I do have a an RFC patch in queue that replaces the
subsystem and namespace devices with objects, but keeps -device shims
available for backwards compatibility. This approach will solve the
problems properly and should be a better model. However, I don't 
believe

it will make it for 6.1 and I'd really like to at least fix the
unplugging for 6.1 and this gets the job done.

 [1]: 20210511073511.32511-1-h...@suse.de

v2:
- added R-b's by Hannes for patches 1 through 3
- simplified "hw/nvme: fix controller hot unplugging"

Klaus Jensen (4):
 hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
 hw/nvme: mark nvme-subsys non-hotpluggable
 hw/nvme: unregister controller with subsystem at exit
 hw/nvme: fix controller hot unplugging

hw/nvme/nvme.h   | 18 +---
hw/nvme/ctrl.c   | 14 ++--
hw/nvme/ns.c | 55 +++-
hw/nvme/subsys.c |  9 
4 files changed, 63 insertions(+), 33 deletions(-)

--
2.32.0



Applied patches 1 through 3 to nvme-next.


So, how do we go about with patch 4?
Without it this whole exercise is a bit pointless, seeing that it 
doesn't fix anything.




Patch 1-3 are fixes we need anyway, so I thought I might as well apply 
them :)



Shall we go with that patch as an interim solution?
Will you replace it with your 'object' patch?
What is the plan?



Yes, if acceptable, I would like to use patch 4 as an interim solution. 
We have a bug we need to fix for 6.1, and I believe this does the job.


Oh, yes, it does. But it's ever so slightly ugly with the reparenting 
stuff. But if that's considered an interim solution I'm fine with it.


You can add my 'Reviewed-by: Hannes Reinecke ' tag if you 
like.


I considered changing the existing nvme-bus to be on the main system 
bus, but then we break the existing behavior that the namespaces attach 
to the most recently defined controller in the absence of the shared 
parameter or an explicit bus parameter.



Do we?
My idea was to always attach a namespace to a subsystem (and, if not 
present, create one). The controller would then 'link' to that 
subsystem. The subsystem would have a 'shared' attribute, which would 
determine if more than one controller can be 'linked' to it.


That way we change the relationship between the controller and the 
namespace, as then the namespace would be a child of the subsystem,

and the namespace would need to be detached separately from the controller.

But it fits neatly into the current device model, except the slightly 
awkward 'link' thingie.



Wrt. "the plan", right now, I see two solutions going forward:

1. Introduce new -object's for nvme-nvm-subsystem and nvme-ns
    This is the approach that I am taking right now and it works well. 
It allows many-to-many relationships and separates the life times of 
subsystems, namespaces and controllers like you mentioned.




Ah. Would like to see that path, then.

    Conceptually, I also really like that the subsystem and namespace 
are not "devices". One could argue that the namespace is comparable 
to a SCSI LUN (-device scsi-hd, right?), but where the SCSI LUN 
actually "shows up" in the host, the nvme namespace does not.




Well, 'devices' really is an abstraction, and it really depends on what 
you declare a device is. But yes, in the QDEV sense with its strict 
inheritance the nvme topology is not a good fit, agreed.


As for SCSI: the namespace is quite comparable to a SCSI LUN; the NVMe 
controller is roughly equivalent to the 'initiator' on SCSI, and the 
subsystem would match up to the SCSI target.


The problem for NVMe is that the whole

Re: [PATCH v3 4/4] replication: Remove workaround

2021-07-09 Thread Vladimir Sementsov-Ogievskiy


07.07.2021 21:15, Lukas Straub wrote:

Remove the workaround introduced in commit
6ecbc6c52672db5c13805735ca02784879ce8285
"replication: Avoid blk_make_empty() on read-only child".

It is not needed anymore since s->hidden_disk is guaranteed to be
writable when secondary_do_checkpoint() runs. Because replication_start(),
_do_checkpoint() and _stop() are only called by COLO migration code
and COLO-migration doesn't inactivate disks.


If look at replication_child_perm() you should also be sure that it always 
works only with RW disks..

Actually, I think that it would be correct just require BLK_PERM_WRITE in 
replication_child_perm() unconditionally. Let generic layer care about all 
these RD/WR things. In _child_perm() we can require WRITE and don't care. If 
something goes wrong and we can't get WRITE permission we should see clean 
error-out.

Opposite, if we don't require WRITE permission in some case and still do WRITE 
request, it may crash.

Still, this may be considered as a preexisting problem of 
replication_child_perm() and fixed separately.



Signed-off-by: Lukas Straub 


So, for this one commit (with probably updated commit message accordingly to my 
comments, or even rebased on fixed replication_child_perm()):

Reviewed-by: Vladimir Sementsov-Ogievskiy 



---
  block/replication.c | 12 +---
  1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index c0d4a6c264..68b46d65a8 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -348,17 +348,7 @@ static void secondary_do_checkpoint(BlockDriverState *bs, 
Error **errp)
  return;
  }

-BlockBackend *blk = blk_new(qemu_get_current_aio_context(),
-BLK_PERM_WRITE, BLK_PERM_ALL);
-blk_insert_bs(blk, s->hidden_disk->bs, _err);
-if (local_err) {
-error_propagate(errp, local_err);
-blk_unref(blk);
-return;
-}
-
-ret = blk_make_empty(blk, errp);
-blk_unref(blk);
+ret = bdrv_make_empty(s->hidden_disk, errp);
  if (ret < 0) {
  return;
  }
--
2.20.1




--
Best regards,
Vladimir

Re: [PATCH v3 3/4] replication: Properly attach children

2021-07-09 Thread Vladimir Sementsov-Ogievskiy


07.07.2021 21:15, Lukas Straub wrote:

The replication driver needs access to the children block-nodes of
it's child so it can issue bdrv_make_empty() and bdrv_co_pwritev()
to manage the replication. However, it does this by directly copying
the BdrvChilds, which is wrong.

Fix this by properly attaching the block-nodes with
bdrv_attach_child() and requesting the required permissions.

Signed-off-by: Lukas Straub 
---
  block/replication.c | 30 +++---
  1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 74adf30f54..c0d4a6c264 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -165,7 +165,12 @@ static void replication_child_perm(BlockDriverState *bs, 
BdrvChild *c,
 uint64_t perm, uint64_t shared,
 uint64_t *nperm, uint64_t *nshared)
  {
-*nperm = BLK_PERM_CONSISTENT_READ;
+if (role & BDRV_CHILD_PRIMARY) {
+*nperm = BLK_PERM_CONSISTENT_READ;
+} else {
+*nperm = 0;
+}
+
  if ((bs->open_flags & (BDRV_O_INACTIVE | BDRV_O_RDWR)) == BDRV_O_RDWR) {
  *nperm |= BLK_PERM_WRITE;
  }
@@ -552,8 +557,25 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
  return;
  }

-s->hidden_disk = hidden_disk;
-s->secondary_disk = secondary_disk;
+bdrv_ref(hidden_disk->bs);
+s->hidden_disk = bdrv_attach_child(bs, hidden_disk->bs, "hidden disk",
+   _of_bds, BDRV_CHILD_DATA,
+   _err);
+if (local_err) {
+error_propagate(errp, local_err);
+aio_context_release(aio_context);
+return;
+}
+
+bdrv_ref(secondary_disk->bs);
+s->secondary_disk = bdrv_attach_child(bs, secondary_disk->bs,
+  "secondary disk", _of_bds,
+  BDRV_CHILD_DATA, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+aio_context_release(aio_context);
+return;
+}


Now looking at this I can say that design is a bit strange:

If these children are children of replication state, we probably want something 
like bdrv_root_attach_child(), and not attach children to the active disk but 
to the state itself (like block-job children).. But than we'll need new class 
of bdrv children (child_of_replication?) and that obviously not worth the 
complexity..

So, we want new children to be children of our filter driver. Than, we probably 
should create them in repliction_open(), together with bs->file..

Still, it should work as is I think:

Reviewed-by: Vladimir Sementsov-Ogievskiy 




  /* start backup job now */
  error_setg(>blocker,
@@ -659,7 +681,9 @@ static void replication_done(void *opaque, int ret)
  if (ret == 0) {
  s->stage = BLOCK_REPLICATION_DONE;

+bdrv_unref_child(bs, s->secondary_disk);
  s->secondary_disk = NULL;
+bdrv_unref_child(bs, s->hidden_disk);
  s->hidden_disk = NULL;
  s->error = 0;
  } else {
--
2.20.1




--
Best regards,
Vladimir

Re: [PATCH v3 2/4] replication: Reduce usage of s->hidden_disk and s->secondary_disk

2021-07-09 Thread Vladimir Sementsov-Ogievskiy


07.07.2021 21:15, Lukas Straub wrote:

In preparation for the next patch, initialize s->hidden_disk and
s->secondary_disk later and replace access to them with local variables
in the places where they aren't initialized yet.

Signed-off-by: Lukas Straub


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir

Re: [PATCH v3 1/4] replication: Remove s->active_disk

2021-07-09 Thread Vladimir Sementsov-Ogievskiy


07.07.2021 21:15, Lukas Straub wrote:

s->active_disk is bs->file. Remove it and use local variables instead.

Signed-off-by: Lukas Straub 
---
  block/replication.c | 38 +-
  1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/block/replication.c b/block/replication.c
index 52163f2d1f..50940fbe33 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -35,7 +35,6 @@ typedef enum {
  typedef struct BDRVReplicationState {
  ReplicationMode mode;
  ReplicationStage stage;
-BdrvChild *active_disk;
  BlockJob *commit_job;
  BdrvChild *hidden_disk;
  BdrvChild *secondary_disk;
@@ -307,11 +306,15 @@ out:
  return ret;
  }

-static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
+static void secondary_do_checkpoint(BlockDriverState *bs, Error **errp)
  {
+BDRVReplicationState *s = bs->opaque;
+BdrvChild *active_disk;


Why not to combine initialization into definition:

BdrvChild *active_disk = bs->file;


  Error *local_err = NULL;
  int ret;

+active_disk = bs->file;
+
  if (!s->backup_job) {
  error_setg(errp, "Backup job was cancelled unexpectedly");
  return;
@@ -323,13 +326,13 @@ static void secondary_do_checkpoint(BDRVReplicationState 
*s, Error **errp)
  return;
  }

-if (!s->active_disk->bs->drv) {
+if (!active_disk->bs->drv) {
  error_setg(errp, "Active disk %s is ejected",
-   s->active_disk->bs->node_name);
+   active_disk->bs->node_name);
  return;
  }

-ret = bdrv_make_empty(s->active_disk, errp);
+ret = bdrv_make_empty(active_disk, errp);
  if (ret < 0) {
  return;
  }
@@ -451,6 +454,7 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
  BlockDriverState *bs = rs->opaque;
  BDRVReplicationState *s;
  BlockDriverState *top_bs;
+BdrvChild *active_disk;
  int64_t active_length, hidden_length, disk_length;
  AioContext *aio_context;
  Error *local_err = NULL;
@@ -488,15 +492,14 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
  case REPLICATION_MODE_PRIMARY:
  break;
  case REPLICATION_MODE_SECONDARY:
-s->active_disk = bs->file;
-if (!s->active_disk || !s->active_disk->bs ||
-!s->active_disk->bs->backing) {
+active_disk = bs->file;


Here initializing active_disk only here makes sense: we consider "active disk" 
only in secondary mode. Right?


+if (!active_disk || !active_disk->bs || !active_disk->bs->backing) {
  error_setg(errp, "Active disk doesn't have backing file");
  aio_context_release(aio_context);
  return;
  }

-s->hidden_disk = s->active_disk->bs->backing;
+s->hidden_disk = active_disk->bs->backing;
  if (!s->hidden_disk->bs || !s->hidden_disk->bs->backing) {
  error_setg(errp, "Hidden disk doesn't have backing file");
  aio_context_release(aio_context);
@@ -511,7 +514,7 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
  }

  /* verify the length */
-active_length = bdrv_getlength(s->active_disk->bs);
+active_length = bdrv_getlength(active_disk->bs);
  hidden_length = bdrv_getlength(s->hidden_disk->bs);
  disk_length = bdrv_getlength(s->secondary_disk->bs);
  if (active_length < 0 || hidden_length < 0 || disk_length < 0 ||
@@ -523,9 +526,9 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
  }

  /* Must be true, or the bdrv_getlength() calls would have failed */
-assert(s->active_disk->bs->drv && s->hidden_disk->bs->drv);
+assert(active_disk->bs->drv && s->hidden_disk->bs->drv);

-if (!s->active_disk->bs->drv->bdrv_make_empty ||
+if (!active_disk->bs->drv->bdrv_make_empty ||
  !s->hidden_disk->bs->drv->bdrv_make_empty) {
  error_setg(errp,
 "Active disk or hidden disk doesn't support 
make_empty");
@@ -579,7 +582,7 @@ static void replication_start(ReplicationState *rs, 
ReplicationMode mode,
  s->stage = BLOCK_REPLICATION_RUNNING;

  if (s->mode == REPLICATION_MODE_SECONDARY) {
-secondary_do_checkpoint(s, errp);
+secondary_do_checkpoint(bs, errp);
  }

  s->error = 0;
@@ -608,7 +611,7 @@ static void replication_do_checkpoint(ReplicationState *rs, 
Error **errp)
  }

  if (s->mode == REPLICATION_MODE_SECONDARY) {
-secondary_do_checkpoint(s, errp);
+secondary_do_checkpoint(bs, errp);
  }
  aio_context_release(aio_context);
  }
@@ -645,7 +648,6 @@ static void replication_done(void *opaque, int ret)
  if (ret == 0) {
  s->stage = BLOCK_REPLICATION_DONE;

-s->active_disk = NULL;
  s->secondary_disk = NULL;

Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging

2021-07-09 Thread Klaus Jensen


On Jul  9 08:55, Klaus Jensen wrote:

On Jul  9 08:16, Hannes Reinecke wrote:

On 7/9/21 8:05 AM, Klaus Jensen wrote:

On Jul  7 17:49, Klaus Jensen wrote:

From: Klaus Jensen 

Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We
discussed a bit back and fourth and I mentioned that the core issue was
an artifact of the parent/child relationship stemming from the qdev
setup we have with namespaces attaching to controller through a qdev
bus.

The gist of this series is the fourth patch "hw/nvme: fix controller hot
unplugging" which basically causes namespaces to be reassigned to a bus
owned by the subsystem if the parent controller is linked to one. This
fixes `device_del/add nvme` in such settings.

Note, that in the case that there is no subsystem involved, nvme devices
can be removed from the system with `device_del`, but this *will* cause
the namespaces to be removed as well since there is no place (i.e. no
subsystem) for them to "linger". And since this series does not add
support for hotplugging nvme-ns devices, while an nvme device can be
readded, no namespaces can. Support for hotplugging nvme-ns devices is
present in [1], but I'd rather not add that since I think '-device
nvme-ns' is already a bad design choice.

Now, I do realize that it is not "pretty" to explicitly change the
parent bus, so I do have a an RFC patch in queue that replaces the
subsystem and namespace devices with objects, but keeps -device shims
available for backwards compatibility. This approach will solve the
problems properly and should be a better model. However, I don't believe
it will make it for 6.1 and I'd really like to at least fix the
unplugging for 6.1 and this gets the job done.

 [1]: 20210511073511.32511-1-h...@suse.de

v2:
- added R-b's by Hannes for patches 1 through 3
- simplified "hw/nvme: fix controller hot unplugging"

Klaus Jensen (4):
 hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
 hw/nvme: mark nvme-subsys non-hotpluggable
 hw/nvme: unregister controller with subsystem at exit
 hw/nvme: fix controller hot unplugging

hw/nvme/nvme.h   | 18 +---
hw/nvme/ctrl.c   | 14 ++--
hw/nvme/ns.c | 55 +++-
hw/nvme/subsys.c |  9 
4 files changed, 63 insertions(+), 33 deletions(-)

--
2.32.0



Applied patches 1 through 3 to nvme-next.


So, how do we go about with patch 4?
Without it this whole exercise is a bit pointless, seeing that it 
doesn't fix anything.




Patch 1-3 are fixes we need anyway, so I thought I might as well apply 
them :)



Shall we go with that patch as an interim solution?
Will you replace it with your 'object' patch?
What is the plan?



Yes, if acceptable, I would like to use patch 4 as an interim 
solution. We have a bug we need to fix for 6.1, and I belive this does 
the job.


I considered changing the existing nvme-bus to be on the main system 
bus, but then we break the existing behavior that the namespaces 
attach to the most recently defined controller in the absence of the 
shared parameter or an explicit bus parameter.


Wrt. "the plan", right now, I see two solutions going forward:

1. Introduce new -object's for nvme-nvm-subsystem and nvme-ns
  This is the approach that I am taking right now and it works well. 
Itallows many-to-many relationships and separates the life times 
ofsubsystems, namespaces and controllers like you mentioned.


  Conceptually, I also really like that the subsystem and namespace 
arenot "devices". One could argue that the namespace is comparable 
to aSCSI LUN (-device scsi-hd, right?), but where the SCSI LUN 
actually"shows up" in the host, the nvme namespace does not.


  My series handles backwards compatibility by keeping -device 
"shims"around that just wraps the new objects but behaves like it 
used to.The plan would be to deprecate these devices.


  The downside to this approach is that it moves the subsystem and
namespaces out of the "qdev tree (info qtree)" and into the pure QOM
"/objects" tree. Instead of qtree, we can have QMP and HMP commands
for introspection.


2. Make the subsystem a "system bus device"
  This way we add an "nvme-nvm-subsystems" bus as a direct child of 
themain system bus, and we can possibly get rid of the explicit 
-devicenvme-subsys as well. We change the namespace device to plug 
into thatinstead. The nvme controller device still needs to plug 
into the PCIbus, so it cannot be a child of the subsystems bus, 
but can keepusing a link parameter to hook into the subsystem and 
attach to anynamespaces it would like.


  I'm unsure if we can do this without deprecating the existing
namespace device, just like option 1.


  I have not implemented this, so I need to look more into it. It 
seemslike the main thing that this gives us compared to 1) is 
`info qtree`support and we still end up just "wiring" namespace 
attachment withbacklinks

Re: [PATCH] tests/qtest/nvme-test: add persistent memory region test

2021-07-09 Thread Klaus Jensen


On Jun 18 16:04, Gollu Appalanaidu wrote:

This will test the PMR functionality.

Signed-off-by: Gollu Appalanaidu 
---
tests/qtest/nvme-test.c | 78 -
1 file changed, 77 insertions(+), 1 deletion(-)

diff --git a/tests/qtest/nvme-test.c b/tests/qtest/nvme-test.c
index d32c953a38..6d557be6ca 100644
--- a/tests/qtest/nvme-test.c
+++ b/tests/qtest/nvme-test.c
@@ -13,6 +13,7 @@
#include "libqos/libqtest.h"
#include "libqos/qgraph.h"
#include "libqos/pci.h"
+#include "include/block/nvme.h"

typedef struct QNvme QNvme;

@@ -21,6 +22,9 @@ struct QNvme {
QPCIDevice dev;
};

+static char *t_path;
+#define TEST_IMAGE_SIZE (2 * 1024 * 1024)
+
static void *nvme_get_driver(void *obj, const char *interface)
{
QNvme *nvme = obj;
@@ -66,12 +70,77 @@ static void nvmetest_oob_cmb_test(void *obj, void *data, 
QGuestAllocator *alloc)
g_assert_cmpint(qpci_io_readl(pdev, bar, cmb_bar_size - 1), !=, 0x44332211);
}

+static void nvmetest_pmr_reg_test(void *obj, void *data, QGuestAllocator 
*alloc)
+{
+QNvme *nvme = obj;
+QPCIDevice *pdev = >dev;
+QPCIBar pmr_bar, nvme_bar;
+uint32_t pmrcap, pmrsts;
+
+qpci_device_enable(pdev);
+pmr_bar = qpci_iomap(pdev, 4, NULL);
+
+/* Without Enabling PMRCTL check bar enablemet */
+qpci_io_writel(pdev, pmr_bar, 0, 0xccbbaa99);
+g_assert_cmpint(qpci_io_readb(pdev, pmr_bar, 0), !=, 0x99);
+g_assert_cmpint(qpci_io_readw(pdev, pmr_bar, 0), !=, 0xaa99);
+
+/* Map NVMe Bar Register to Enable the Mem Region */
+nvme_bar = qpci_iomap(pdev, 0, NULL);
+
+pmrcap = qpci_io_readl(pdev, nvme_bar, 0xe00);
+g_assert_cmpint(NVME_PMRCAP_RDS(pmrcap), ==, 0x1);
+g_assert_cmpint(NVME_PMRCAP_WDS(pmrcap), ==, 0x1);
+g_assert_cmpint(NVME_PMRCAP_BIR(pmrcap), ==, 0x4);
+g_assert_cmpint(NVME_PMRCAP_PMRWBM(pmrcap), ==, 0x2);
+g_assert_cmpint(NVME_PMRCAP_CMSS(pmrcap), ==, 0x1);
+
+/* Enable PMRCTRL */
+qpci_io_writel(pdev, nvme_bar, 0xe04, 0x1);
+
+qpci_io_writel(pdev, pmr_bar, 0, 0x44332211);
+g_assert_cmpint(qpci_io_readb(pdev, pmr_bar, 0), ==, 0x11);
+g_assert_cmpint(qpci_io_readw(pdev, pmr_bar, 0), ==, 0x2211);
+g_assert_cmpint(qpci_io_readl(pdev, pmr_bar, 0), ==, 0x44332211);
+
+pmrsts = qpci_io_readl(pdev, nvme_bar, 0xe08);
+g_assert_cmpint(NVME_PMRSTS_NRDY(pmrsts), ==, 0x0);
+
+/* Disable PMRCTRL */
+qpci_io_writel(pdev, nvme_bar, 0xe04, 0x0);
+
+qpci_io_writel(pdev, pmr_bar, 0, 0x88776655);
+g_assert_cmpint(qpci_io_readb(pdev, pmr_bar, 0), !=, 0x55);
+g_assert_cmpint(qpci_io_readw(pdev, pmr_bar, 0), !=, 0x6655);
+g_assert_cmpint(qpci_io_readl(pdev, pmr_bar, 0), !=, 0x88776655);
+
+pmrsts = qpci_io_readl(pdev, nvme_bar, 0xe08);
+g_assert_cmpint(NVME_PMRSTS_NRDY(pmrsts), ==, 0x1);
+
+qpci_iounmap(pdev, nvme_bar);
+qpci_iounmap(pdev, pmr_bar);
+}
+
static void nvme_register_nodes(void)
{
+int fd, ret;
+t_path = g_strdup("/tmp/qtest.XX");
+
+/* Create a temporary raw image*/
+fd = mkstemp(t_path);
+g_assert(fd >= 0);
+ret = ftruncate(fd, TEST_IMAGE_SIZE);
+g_assert(ret == 0);
+close(fd);
+
+char *pmr_cmd_line = g_strdup_printf("-object memory-backend-file,id=pmr0,"
+ "share=on,mem-path=%s,size=8", 
t_path);
+
QOSGraphEdgeOptions opts = {
.extra_device_opts = "addr=04.0,drive=drv0,serial=foo",
.before_cmd_line = "-drive id=drv0,if=none,file=null-co://,"
-   "file.read-zeroes=on,format=raw",
+   "file.read-zeroes=on,format=raw ",
+   pmr_cmd_line,
};

add_qpci_address(, &(QPCIAddress) { .devfn = QPCI_DEVFN(4, 0) });
@@ -83,6 +152,13 @@ static void nvme_register_nodes(void)
qos_add_test("oob-cmb-access", "nvme", nvmetest_oob_cmb_test, 
&(QOSGraphTestOptions) {
.edge.extra_device_opts = "cmb_size_mb=2"
});
+
+qos_add_test("pmr-test-access", "nvme", nvmetest_pmr_reg_test,
+ &(QOSGraphTestOptions) {
+.edge.extra_device_opts = "pmrdev=pmr0"
+});
+
+unlink(t_path);
}

libqos_init(nvme_register_nodes);
--
2.17.1



Applied to nvme-next. I swapped the memory-backend-file with a 
memory-backend-ram so we don't need to setup an actual file.


signature.asc
Description: PGP signature

Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging

2021-07-09 Thread Klaus Jensen


On Jul  9 08:16, Hannes Reinecke wrote:

On 7/9/21 8:05 AM, Klaus Jensen wrote:

On Jul  7 17:49, Klaus Jensen wrote:

From: Klaus Jensen 

Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We
discussed a bit back and fourth and I mentioned that the core issue was
an artifact of the parent/child relationship stemming from the qdev
setup we have with namespaces attaching to controller through a qdev
bus.

The gist of this series is the fourth patch "hw/nvme: fix controller hot
unplugging" which basically causes namespaces to be reassigned to a bus
owned by the subsystem if the parent controller is linked to one. This
fixes `device_del/add nvme` in such settings.

Note, that in the case that there is no subsystem involved, nvme devices
can be removed from the system with `device_del`, but this *will* cause
the namespaces to be removed as well since there is no place (i.e. no
subsystem) for them to "linger". And since this series does not add
support for hotplugging nvme-ns devices, while an nvme device can be
readded, no namespaces can. Support for hotplugging nvme-ns devices is
present in [1], but I'd rather not add that since I think '-device
nvme-ns' is already a bad design choice.

Now, I do realize that it is not "pretty" to explicitly change the
parent bus, so I do have a an RFC patch in queue that replaces the
subsystem and namespace devices with objects, but keeps -device shims
available for backwards compatibility. This approach will solve the
problems properly and should be a better model. However, I don't believe
it will make it for 6.1 and I'd really like to at least fix the
unplugging for 6.1 and this gets the job done.

 [1]: 20210511073511.32511-1-h...@suse.de

v2:
- added R-b's by Hannes for patches 1 through 3
- simplified "hw/nvme: fix controller hot unplugging"

Klaus Jensen (4):
 hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
 hw/nvme: mark nvme-subsys non-hotpluggable
 hw/nvme: unregister controller with subsystem at exit
 hw/nvme: fix controller hot unplugging

hw/nvme/nvme.h   | 18 +---
hw/nvme/ctrl.c   | 14 ++--
hw/nvme/ns.c | 55 +++-
hw/nvme/subsys.c |  9 
4 files changed, 63 insertions(+), 33 deletions(-)

--
2.32.0



Applied patches 1 through 3 to nvme-next.


So, how do we go about with patch 4?
Without it this whole exercise is a bit pointless, seeing that it 
doesn't fix anything.




Patch 1-3 are fixes we need anyway, so I thought I might as well apply 
them :)



Shall we go with that patch as an interim solution?
Will you replace it with your 'object' patch?
What is the plan?



Yes, if acceptable, I would like to use patch 4 as an interim solution. 
We have a bug we need to fix for 6.1, and I belive this does the job.


I considered changing the existing nvme-bus to be on the main system 
bus, but then we break the existing behavior that the namespaces attach 
to the most recently defined controller in the absence of the shared 
parameter or an explicit bus parameter.


Wrt. "the plan", right now, I see two solutions going forward:

1. Introduce new -object's for nvme-nvm-subsystem and nvme-ns
   This is the approach that I am taking right now and it works well. It 
   allows many-to-many relationships and separates the life times of 
   subsystems, namespaces and controllers like you mentioned.


   Conceptually, I also really like that the subsystem and namespace are 
   not "devices". One could argue that the namespace is comparable to a 
   SCSI LUN (-device scsi-hd, right?), but where the SCSI LUN actually 
   "shows up" in the host, the nvme namespace does not.


   My series handles backwards compatibility by keeping -device "shims" 
   around that just wraps the new objects but behaves like it used to. 
   The plan would be to deprecate these devices.


   The downside to this approach is that it moves the subsystem and 
   namespaces out of the "qdev tree (info qtree)" and into the pure QOM 
   "/objects" tree. Instead of qtree, we can have QMP and HMP commands 
   for introspection.


2. Make the subsystem a "system bus device"
   This way we add an "nvme-nvm-subsystems" bus as a direct child of the 
   main system bus, and we can possibly get rid of the explicit -device 
   nvme-subsys as well. We change the namespace device to plug into that 
   instead. The nvme controller device still needs to plug into the PCI 
   bus, so it cannot be a child of the subsystems bus, but can keep 
   using a link parameter to hook into the subsystem and attach to any 
   namespaces it would like.


   I'm unsure if we can do this without deprecating the existing 
   namespace device, just like option 1.


   I have not implemented this, so I need to look more into it. It seems 
   like the main thing that this gives us compared to 1) is `info qtree` 
   support and we still end up just "wiring" namespace attachment with 
   backlinks anyway.


I'm not sure what I

Re: [PATCH 2/2] qemu-img: Add --skip-broken for 'convert --bitmaps'

2021-07-09 Thread Vladimir Sementsov-Ogievskiy


08.07.2021 04:30, Eric Blake wrote:

The point of 'qemu-img convert --bitmaps' is to be a convenience for
actions that are already possible through a string of smaller
'qemu-img bitmap' sub-commands.  One situation not accounted for
already is that if a source image contains an inconsistent bitmap (for
example, because a qemu process died abruptly before flushing bitmap
state), the user MUST delete those inconsistent bitmaps before
anything else useful can be done with the image.

We don't want to delete inconsistent bitmaps by default: although a
corrupt bitmap is only a loss of optimization rather than a corruption
of user-visible data, it is still nice to require the user to opt in
to the fact that they are aware of the loss of the bitmap.  Still,
requiring the user to check 'qemu-img info' to see whether bitmaps are
consistent, then use 'qemu-img bitmap --remove' to remove offenders,
all before using 'qemu-img convert', is a lot more work than just
adding a knob 'qemu-img convert --bitmaps --skip-broken' which opts in
to skipping the broken bitmaps.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1946084
Signed-off-by: Eric Blake 
---
  docs/tools/qemu-img.rst   |  8 +++-
  qemu-img.c| 20 +--
  tests/qemu-iotests/tests/qemu-img-bitmaps |  4 
  tests/qemu-iotests/tests/qemu-img-bitmaps.out | 14 +
  4 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index 1d8470eada0e..5cf1c764597b 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -414,7 +414,7 @@ Command description:
4
  Error on reading data

-.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [--bitmaps] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t 
CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o OPTIONS] [-l 
SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-r RATE_LIMIT] [-m NUM_COROUTINES] [-W] 
FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
+.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] 
[--target-is-zero] [--bitmaps [--skip-broken]] [-U] [-C] [-c] [-p] [-q] [-n] 
[-f FMT] [-t CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o 
OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-r RATE_LIMIT] [-m 
NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME


Of course, [--bitmaps [--skip-broken]] looks like --skip-broken is a 
suboption.. But actually it's not so. So, shouldn't it be named more explicit, 
like --skip-broken-bitmaps ? To be sure that we will not interfere in future 
with some other broken things we want to skip? And to avoid strange but correct 
command lines like

qemu-img convert --skip-broken  --bitmaps  src dst




Convert the disk image *FILENAME* or a snapshot *SNAPSHOT_PARAM*
to disk image *OUTPUT_FILENAME* using format *OUTPUT_FMT*. It can
@@ -456,6 +456,12 @@ Command description:
*NUM_COROUTINES* specifies how many coroutines work in parallel during
the convert process (defaults to 8).

+  Use of ``--bitmaps`` requests that any persistent bitmaps present in
+  the original are also copied to the destination.  If any bitmap is
+  inconsistent in the source, the conversion will fail unless
+  ``--skip-broken`` is also specified to copy only the consistent
+  bitmaps.
+
  .. option:: create [--object OBJECTDEF] [-q] [-f FMT] [-b BACKING_FILE] [-F 
BACKING_FMT] [-u] [-o OPTIONS] FILENAME [SIZE]

Create the new disk image *FILENAME* of size *SIZE* and format
diff --git a/qemu-img.c b/qemu-img.c
index 68a4d298098f..e8b012f39c0c 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -82,6 +82,7 @@ enum {
  OPTION_MERGE = 274,
  OPTION_BITMAPS = 275,
  OPTION_FORCE = 276,
+OPTION_SKIP_BROKEN = 277,
  };

  typedef enum OutputFormat {
@@ -2101,7 +2102,8 @@ static int convert_do_copy(ImgConvertState *s)
  return s->ret;
  }

-static int convert_copy_bitmaps(BlockDriverState *src, BlockDriverState *dst)
+static int convert_copy_bitmaps(BlockDriverState *src, BlockDriverState *dst,
+bool skip_broken)
  {
  BdrvDirtyBitmap *bm;
  Error *err = NULL;
@@ -2113,6 +2115,10 @@ static int convert_copy_bitmaps(BlockDriverState *src, 
BlockDriverState *dst)
  continue;
  }
  name = bdrv_dirty_bitmap_name(bm);
+if (skip_broken && bdrv_dirty_bitmap_inconsistent(bm)) {
+warn_report("Skipping inconsistent bitmap %s", name);
+continue;
+}
  qmp_block_dirty_bitmap_add(dst->node_name, name,
 true, bdrv_dirty_bitmap_granularity(bm),
 true, true,
@@ -2167,6 +2173,7 @@ static int img_convert(int argc, char **argv)
  bool force_share = false;
  bool explict_min_sparse = false;
  bool bitmaps = false;
+bool skip_broken = false;
  int64_t rate_limit = 0;

Re: [PATCH 1/2] iotests: Improve and rename test 291 to qemu-img-bitmap

2021-07-09 Thread Vladimir Sementsov-Ogievskiy


08.07.2021 04:30, Eric Blake wrote:

Enhance the test to demonstrate behavior of qemu-img with a qcow2
image containing an inconsistent bitmap, and rename it now that we
support useful iotest names.

While at it, fix a missing newline in the error message thus exposed.

Signed-off-by: Eric Blake 
---
  block/dirty-bitmap.c  |  2 +-
  .../{291 => tests/qemu-img-bitmaps}   | 13 +++-
  .../{291.out => tests/qemu-img-bitmaps.out}   | 32 ++-
  3 files changed, 44 insertions(+), 3 deletions(-)
  rename tests/qemu-iotests/{291 => tests/qemu-img-bitmaps} (92%)
  rename tests/qemu-iotests/{291.out => tests/qemu-img-bitmaps.out} (82%)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 68d295d6e3ed..0ef46163e3ea 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -193,7 +193,7 @@ int bdrv_dirty_bitmap_check(const BdrvDirtyBitmap *bitmap, 
uint32_t flags,
  error_setg(errp, "Bitmap '%s' is inconsistent and cannot be used",
 bitmap->name);
  error_append_hint(errp, "Try block-dirty-bitmap-remove to delete"
-  " this bitmap from disk");
+  " this bitmap from disk\n");
  return -1;
  }

diff --git a/tests/qemu-iotests/291 b/tests/qemu-iotests/tests/qemu-img-bitmaps
similarity index 92%
rename from tests/qemu-iotests/291
rename to tests/qemu-iotests/tests/qemu-img-bitmaps
index 20efb080a6c0..76cd9e31e850 100755
--- a/tests/qemu-iotests/291
+++ b/tests/qemu-iotests/tests/qemu-img-bitmaps
@@ -3,7 +3,7 @@
  #
  # Test qemu-img bitmap handling
  #
-# Copyright (C) 2018-2020 Red Hat, Inc.
+# Copyright (C) 2018-2021 Red Hat, Inc.
  #
  # This program is free software; you can redistribute it and/or modify
  # it under the terms of the GNU General Public License as published by
@@ -32,6 +32,7 @@ _cleanup()
  trap "_cleanup; exit \$status" 0 1 2 3 15

  # get standard environment, filters and checks
+cd ..
  . ./common.rc
  . ./common.filter
  . ./common.nbd
@@ -129,6 +130,16 @@ $QEMU_IMG map --output=json --image-opts \

  nbd_server_stop

+echo
+echo "=== Check handling of inconsistent bitmap ==="
+echo
+
+$QEMU_IO -c abort "$TEST_IMG" 2>/dev/null
+$QEMU_IMG bitmap --add "$TEST_IMG" b4
+$QEMU_IMG bitmap --remove "$TEST_IMG" b1
+_img_info --format-specific | _filter_irrelevant_img_info
+$QEMU_IMG convert --bitmaps -O qcow2 "$TEST_IMG" "$TEST_IMG.copy"


Worth then removing remaining inconsistent bitmaps and try again?

I think you should now remove $TEST_IMG.copy in _cleanup

with squashed in

--- a/tests/qemu-iotests/tests/qemu-img-bitmaps
+++ b/tests/qemu-iotests/tests/qemu-img-bitmaps
@@ -27,6 +27,7 @@ status=1 # failure is the default!
 _cleanup()
 {
 _cleanup_test_img
+_rm_test_img "$TEST_IMG.copy"
 nbd_server_stop
 }
 trap "_cl

Tested-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Vladimir Sementsov-Ogievskiy 


+
  # success, all done
  echo '*** done'
  rm -f $seq.full
diff --git a/tests/qemu-iotests/291.out 
b/tests/qemu-iotests/tests/qemu-img-bitmaps.out
similarity index 82%
rename from tests/qemu-iotests/291.out
rename to tests/qemu-iotests/tests/qemu-img-bitmaps.out
index 018d6b103f87..17b34eaed30f 100644
--- a/tests/qemu-iotests/291.out
+++ b/tests/qemu-iotests/tests/qemu-img-bitmaps.out
@@ -1,4 +1,4 @@
-QA output created by 291
+QA output created by qemu-img-bitmaps

  === Initial image setup ===

@@ -115,4 +115,34 @@ Format specific information:
  [{ "start": 0, "length": 2097152, "depth": 0, "present": true, "zero": false, "data": 
true, "offset": OFFSET},
  { "start": 2097152, "length": 1048576, "depth": 0, "present": false, "zero": false, 
"data": false},
  { "start": 3145728, "length": 7340032, "depth": 0, "present": true, "zero": false, 
"data": true, "offset": OFFSET}]
+
+=== Check handling of inconsistent bitmap ===
+
+image: TEST_DIR/t.IMGFMT
+file format: IMGFMT
+virtual size: 10 MiB (10485760 bytes)
+cluster_size: 65536
+backing file: TEST_DIR/t.IMGFMT.base
+backing file format: IMGFMT
+Format specific information:
+bitmaps:
+[0]:
+flags:
+[0]: in-use
+[1]: auto
+name: b2
+granularity: 65536
+[1]:
+flags:
+[0]: in-use
+name: b0
+granularity: 65536
+[2]:
+flags:
+[0]: auto
+name: b4
+granularity: 65536
+corrupt: false
+qemu-img: Failed to populate bitmap b0: Bitmap 'b0' is inconsistent and cannot 
be used
+Try block-dirty-bitmap-remove to delete this bitmap from disk
  *** done




--
Best regards,
Vladimir

Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging

2021-07-09 Thread Hannes Reinecke


On 7/9/21 8:05 AM, Klaus Jensen wrote:

On Jul  7 17:49, Klaus Jensen wrote:

From: Klaus Jensen 

Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We
discussed a bit back and fourth and I mentioned that the core issue was
an artifact of the parent/child relationship stemming from the qdev
setup we have with namespaces attaching to controller through a qdev
bus.

The gist of this series is the fourth patch "hw/nvme: fix controller hot
unplugging" which basically causes namespaces to be reassigned to a bus
owned by the subsystem if the parent controller is linked to one. This
fixes `device_del/add nvme` in such settings.

Note, that in the case that there is no subsystem involved, nvme devices
can be removed from the system with `device_del`, but this *will* cause
the namespaces to be removed as well since there is no place (i.e. no
subsystem) for them to "linger". And since this series does not add
support for hotplugging nvme-ns devices, while an nvme device can be
readded, no namespaces can. Support for hotplugging nvme-ns devices is
present in [1], but I'd rather not add that since I think '-device
nvme-ns' is already a bad design choice.

Now, I do realize that it is not "pretty" to explicitly change the
parent bus, so I do have a an RFC patch in queue that replaces the
subsystem and namespace devices with objects, but keeps -device shims
available for backwards compatibility. This approach will solve the
problems properly and should be a better model. However, I don't believe
it will make it for 6.1 and I'd really like to at least fix the
unplugging for 6.1 and this gets the job done.

 [1]: 20210511073511.32511-1-h...@suse.de

v2:
- added R-b's by Hannes for patches 1 through 3
- simplified "hw/nvme: fix controller hot unplugging"

Klaus Jensen (4):
 hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
 hw/nvme: mark nvme-subsys non-hotpluggable
 hw/nvme: unregister controller with subsystem at exit
 hw/nvme: fix controller hot unplugging

hw/nvme/nvme.h   | 18 +---
hw/nvme/ctrl.c   | 14 ++--
hw/nvme/ns.c | 55 +++-
hw/nvme/subsys.c |  9 
4 files changed, 63 insertions(+), 33 deletions(-)

--
2.32.0



Applied patches 1 through 3 to nvme-next.


So, how do we go about with patch 4?
Without it this whole exercise is a bit pointless, seeing that it 
doesn't fix anything.


Shall we go with that patch as an interim solution?
Will you replace it with your 'object' patch?
What is the plan?

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

Re: [PATCH]: /hw/nvme/ctrl error handling if descriptors are greater than 1024

2021-07-09 Thread Klaus Jensen


On Jul  6 16:13, Padmakar Kalghatgi wrote:

From: padmakar



if the number of descriptors or pages is more than 1024,

dma writes or reads will result in failure. Hence, we check

if the number of descriptors or pages is more than 1024

in the nvme module and return Internal Device error.



Signed-off-by: Padmakar Kalghatgi

---

hw/nvme/ctrl.c   | 14 ++

hw/nvme/trace-events |  1 +

2 files changed, 15 insertions(+)



diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c

index 40a7efc..082592f 100644

--- a/hw/nvme/ctrl.c

+++ b/hw/nvme/ctrl.c

@@ -602,6 +602,20 @@ static uint16_t nvme_map_addr(NvmeCtrl *n, NvmeSg *sg,
hwaddr addr, size_t len)

return NVME_SUCCESS;

}

+/*

+ *  The QEMU has an inherent issue where in if the no.

+ *  of descriptors is more than 1024, it will result in

+ *  failure during the dma write or reads. Hence, we need

+ *  to return the error.

+ */

+

+if (((sg->flags & NVME_SG_DMA) && ((sg->qsg.nsg + 1) > IOV_MAX)) ||

+((sg->iov.niov + 1) > IOV_MAX)) {

+NVME_GUEST_ERR(pci_nvme_ub_sg_desc_toohigh,

+   "number of descriptors is greater than 1024");

+return NVME_INTERNAL_DEV_ERROR;

+}

+

trace_pci_nvme_map_addr(addr, len);

if (nvme_addr_is_cmb(n, addr)) {

diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events

index ea33d0c..bfe1a3b 100644

--- a/hw/nvme/trace-events

+++ b/hw/nvme/trace-events

@@ -202,3 +202,4 @@ pci_nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t
new_head) "completion qu

pci_nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write
for nonexistent queue, sqid=%"PRIu32", ignoring"

pci_nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail)
"submission queue doorbell write value beyond queue size, sqid=%"PRIu32",
new_head=%"PRIu16", ignoring"

pci_nvme_ub_unknown_css_value(void) "unknown value in cc.css field"

+pci_nvme_ub_sg_desc_toohigh(void) "the number of sg descriptors is too
high"

--

2.7.0.windows.1





Applied to nvme-next, but made the error message more generic (this also 
applied to PRPs).


signature.asc
Description: PGP signature

Re: [PATCH v2 0/4] hw/nvme: fix controller hotplugging

2021-07-09 Thread Klaus Jensen


On Jul  7 17:49, Klaus Jensen wrote:

From: Klaus Jensen 

Back in May, Hannes posted a fix[1] to re-enable NVMe PCI hotplug. We
discussed a bit back and fourth and I mentioned that the core issue was
an artifact of the parent/child relationship stemming from the qdev
setup we have with namespaces attaching to controller through a qdev
bus.

The gist of this series is the fourth patch "hw/nvme: fix controller hot
unplugging" which basically causes namespaces to be reassigned to a bus
owned by the subsystem if the parent controller is linked to one. This
fixes `device_del/add nvme` in such settings.

Note, that in the case that there is no subsystem involved, nvme devices
can be removed from the system with `device_del`, but this *will* cause
the namespaces to be removed as well since there is no place (i.e. no
subsystem) for them to "linger". And since this series does not add
support for hotplugging nvme-ns devices, while an nvme device can be
readded, no namespaces can. Support for hotplugging nvme-ns devices is
present in [1], but I'd rather not add that since I think '-device
nvme-ns' is already a bad design choice.

Now, I do realize that it is not "pretty" to explicitly change the
parent bus, so I do have a an RFC patch in queue that replaces the
subsystem and namespace devices with objects, but keeps -device shims
available for backwards compatibility. This approach will solve the
problems properly and should be a better model. However, I don't believe
it will make it for 6.1 and I'd really like to at least fix the
unplugging for 6.1 and this gets the job done.

 [1]: 20210511073511.32511-1-h...@suse.de

v2:
- added R-b's by Hannes for patches 1 through 3
- simplified "hw/nvme: fix controller hot unplugging"

Klaus Jensen (4):
 hw/nvme: remove NvmeCtrl parameter from ns setup/check functions
 hw/nvme: mark nvme-subsys non-hotpluggable
 hw/nvme: unregister controller with subsystem at exit
 hw/nvme: fix controller hot unplugging

hw/nvme/nvme.h   | 18 +---
hw/nvme/ctrl.c   | 14 ++--
hw/nvme/ns.c | 55 +++-
hw/nvme/subsys.c |  9 
4 files changed, 63 insertions(+), 33 deletions(-)

--
2.32.0



Applied patches 1 through 3 to nvme-next.


signature.asc
Description: PGP signature

67 matches

Mail list logo