date:20210504

Re: [PATCH 10/10] qcow2-refcount: check_refblocks(): add separate message for reserved

2021-05-04 Thread Eric Blake

On 5/4/21 10:20 AM, Vladimir Sementsov-Ogievskiy wrote:
> Split checking for reserved bits out of aligned offset check.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/qcow2.h  |  1 +
>  block/qcow2-refcount.c | 10 +-
>  2 files changed, 10 insertions(+), 1 deletion(-)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 09/10] qcow2-refcount: check_refcounts_l1(): check reserved bits

2021-05-04 Thread Eric Blake

On 5/4/21 10:20 AM, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/qcow2.h  | 1 +
>  block/qcow2-refcount.c | 6 ++
>  2 files changed, 7 insertions(+)
> 
Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 08/10] qcow2-refcount: improve style of check_refcounts_l1()

2021-05-04 Thread Eric Blake

On 5/4/21 10:20 AM, Vladimir Sementsov-Ogievskiy wrote:
>  - use g_autofree for l1_table
>  - better name for size in bytes variable
>  - reduce code blocks nesting
>  - whitespaces, braces, newlines
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/qcow2-refcount.c | 97 +-
>  1 file changed, 49 insertions(+), 48 deletions(-)
> 
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 44fc0dd5dc..eb6de3dabd 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -1864,71 +1864,72 @@ static int check_refcounts_l1(BlockDriverState *bs,
>int flags, BdrvCheckMode fix, bool active)
>  {
>  BDRVQcow2State *s = bs->opaque;
> -uint64_t *l1_table = NULL, l2_offset, l1_size2;
> +size_t l1_size_bytes = l1_size * L1E_SIZE;
> +g_autofree uint64_t *l1_table = g_try_malloc(l1_size_bytes);

Note that this now happens...

> +uint64_t l2_offset;
>  int i, ret;
>  
> -l1_size2 = l1_size * L1E_SIZE;
> +if (!l1_size) {
> +return 0;

...before you validate whether l1_size is non-zero, which can result in
g_try_malloc(0).  Probably harmless, but it might be better if you declare
 g_autofree uint64_t *l1_table = NULL;
and then initialize it via malloc only after the sanity check.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 07/10] qcow2-refcount: check_refcounts_l2(): check reserved bits

2021-05-04 Thread Eric Blake

On 5/4/21 10:20 AM, Vladimir Sementsov-Ogievskiy wrote:
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/qcow2.h  |  1 +
>  block/qcow2-refcount.c | 12 +++-
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 06/10] qcow2-refcount: check_refcounts_l2(): check l2_bitmap

2021-05-04 Thread Eric Blake

On 5/4/21 10:20 AM, Vladimir Sementsov-Ogievskiy wrote:
> Check subcluster bitmap of the l2 entry for different types of
> clusters:
> 
>  - for compressed it must be zero
>  - for allocated check consistency of two parts of the bitmap
>  - for unallocated all subclusters should be unallocated
>(or zero-plain)
> 
> For unallocated clusters we can safely fix the entry by making it
> zero-plain.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/qcow2-refcount.c | 30 +-
>  1 file changed, 29 insertions(+), 1 deletion(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 05/10] qcow2-refcount: fix_l2_entry_by_zero(): also zero L2 entry bitmap

2021-05-04 Thread Eric Blake

On 5/4/21 10:20 AM, Vladimir Sementsov-Ogievskiy wrote:
> We'll reuse the function to fix wrong L2 entry bitmap. Support it now.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/qcow2-refcount.c | 18 +++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index f1e771d742..62d59eb2e9 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -1588,7 +1588,8 @@ enum {
>  };
>  
>  /*
> - * Fix L2 entry by making it QCOW2_CLUSTER_ZERO_PLAIN.
> + * Fix L2 entry by making it QCOW2_CLUSTER_ZERO_PLAIN (or maing all its 
> present

making

> + * subclusters QCOW2_SUBCLUSTER_ZERO_PLAIN).
>   *
>   * Function do res->corruptions-- on success, so caller is responsible to do
>   * corresponding res->corruptions++ prior to the call.
> @@ -1605,9 +1606,20 @@ static int fix_l2_entry_by_zero(BlockDriverState *bs, 
> BdrvCheckResult *res,
>  int idx = l2_index * (l2_entry_size(s) / sizeof(uint64_t));
>  uint64_t l2e_offset = l2_offset + (uint64_t)l2_index * l2_entry_size(s);
>  int ign = active ? QCOW2_OL_ACTIVE_L2 : QCOW2_OL_INACTIVE_L2;
> -uint64_t l2_entry = has_subclusters(s) ? 0 : QCOW_OFLAG_ZERO;
>  
> -set_l2_entry(s, l2_table, l2_index, l2_entry);
> +if (has_subclusters(s)) {
> +uint64_t l2_bitmap = get_l2_bitmap(s, l2_table, l2_index);
> +
> +/* Allocated subclusters becomes zero */

become

> +l2_bitmap |= l2_bitmap << 32;
> +l2_bitmap &= QCOW_L2_BITMAP_ALL_ZEROES;
> +
> +set_l2_bitmap(s, l2_table, l2_index, l2_bitmap);
> +set_l2_entry(s, l2_table, l2_index, 0);
> +} else {
> +set_l2_entry(s, l2_table, l2_index, QCOW_OFLAG_ZERO);
> +}
> +
>  ret = qcow2_pre_write_overlap_check(bs, ign, l2e_offset, 
> l2_entry_size(s),
>  false);
>  if (metadata_overlap) {
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 04/10] qcow2-refcount: introduce fix_l2_entry_by_zero()

2021-05-04 Thread Eric Blake

On 5/4/21 10:20 AM, Vladimir Sementsov-Ogievskiy wrote:
> Split fix_l2_entry_by_zero() out of check_refcounts_l2() to be
> reused in further patch.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/qcow2-refcount.c | 87 +-
>  1 file changed, 60 insertions(+), 27 deletions(-)
> 
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 66cbb94ef9..f1e771d742 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -1587,6 +1587,54 @@ enum {
>  CHECK_FRAG_INFO = 0x2,  /* update BlockFragInfo counters */
>  };
>  
> +/*
> + * Fix L2 entry by making it QCOW2_CLUSTER_ZERO_PLAIN.
> + *
> + * Function do res->corruptions-- on success, so caller is responsible to do
> + * corresponding res->corruptions++ prior to the call.

This function decrements res->corruptions on success, so the caller is
responsible to increment res->corruptions prior to the call.

> + *
> + * On failure in-memory @l2_table may be modified.
> + */
> +static int fix_l2_entry_by_zero(BlockDriverState *bs, BdrvCheckResult *res,
> +uint64_t l2_offset,
> +uint64_t *l2_table, int l2_index, bool 
> active,
> +bool *metadata_overlap)

Otherwise seems okay.
Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 03/10] qcow2: introduce qcow2_parse_compressed_l2_entry() helper

2021-05-04 Thread Eric Blake

On 5/4/21 10:20 AM, Vladimir Sementsov-Ogievskiy wrote:
> Add helper to parse compressed l2_entry and use it everywhere instead
> of opencoding.

open-coding

> 
> Note, that in most places we move to precise coffset/csize instead of
> sector-aligned. Still it should work good enough for updating
> refcounts.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/qcow2.h  |  3 ++-
>  block/qcow2-cluster.c  | 15 +++
>  block/qcow2-refcount.c | 36 +---
>  block/qcow2.c  |  9 ++---
>  4 files changed, 36 insertions(+), 27 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 02/10] qcow2: compressed read: simplify cluster descriptor passing

2021-05-04 Thread Eric Blake

On 5/4/21 10:20 AM, Vladimir Sementsov-Ogievskiy wrote:
> Let's pass the whole L2 entry and not bother with
> L2E_COMPRESSED_OFFSET_SIZE_MASK.
> 
> It also helps further refactoring that adds generic
> qcow2_parse_compressed_l2_entry() helper.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/qcow2.h |  1 -
>  block/qcow2-cluster.c |  5 ++---
>  block/qcow2.c | 12 +++-
>  3 files changed, 9 insertions(+), 9 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH 01/10] qcow2-refcount: improve style of check_refcounts_l2()

2021-05-04 Thread Eric Blake

On 5/4/21 10:20 AM, Vladimir Sementsov-Ogievskiy wrote:
>  - don't use same name for size in bytes and in entries
>  - use g_autofree for l2_table
>  - add whitespace
>  - fix block comment style
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/qcow2-refcount.c | 47 +-
>  1 file changed, 24 insertions(+), 23 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH] qcow2: set bdi->is_dirty

2021-05-04 Thread Kirill Tkhai

On 04.05.2021 19:06, Vladimir Sementsov-Ogievskiy wrote:
> Set bdi->is_dirty, so that qemu-img info could show dirty flag.
> 
> After this commit the following check will show '"dirty-flag": true':
> 
> ./build/qemu-img create -f qcow2 -o lazy_refcounts=on x 1M
> ./build/qemu-io x
> qemu-io> write 0 1M
> 
>  After "write" command success, kill the qemu-io process:
> 
> kill -9 
> 
> ./build/qemu-img info --output=json x
> 
> This will show '"dirty-flag": true' among other things. (before this
> commit it shows '"dirty-flag": false')
> 
> Note, that qcow2's dirty-bit is not a "dirty bit for the image". It
> only protects qcow2 lazy refcounts feature. So, there are a lot of
> conditions when qcow2 session may be not closed correctly, but bit is
> 0. Still, when bit is set, the last session is definitely not finished
> correctly and it's better to report it.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 

Works for me. Thanks,

Tested-by: Kirill Tkhai 

> ---
>  block/qcow2.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 9727ae8fe3..39b91ef940 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -5089,6 +5089,7 @@ static int qcow2_get_info(BlockDriverState *bs, 
> BlockDriverInfo *bdi)
>  BDRVQcow2State *s = bs->opaque;
>  bdi->cluster_size = s->cluster_size;
>  bdi->vm_state_offset = qcow2_vm_state_offset(s);
> +bdi->is_dirty = s->incompatible_features & QCOW2_INCOMPAT_DIRTY;
>  return 0;
>  }
>  
>

Re: [PATCH] qcow2: set bdi->is_dirty

2021-05-04 Thread Eric Blake

On 5/4/21 11:06 AM, Vladimir Sementsov-Ogievskiy wrote:
> Set bdi->is_dirty, so that qemu-img info could show dirty flag.
> 
> After this commit the following check will show '"dirty-flag": true':
> 
> ./build/qemu-img create -f qcow2 -o lazy_refcounts=on x 1M
> ./build/qemu-io x
> qemu-io> write 0 1M
> 
>  After "write" command success, kill the qemu-io process:
> 
> kill -9 
> 
> ./build/qemu-img info --output=json x
> 
> This will show '"dirty-flag": true' among other things. (before this
> commit it shows '"dirty-flag": false')
> 
> Note, that qcow2's dirty-bit is not a "dirty bit for the image". It
> only protects qcow2 lazy refcounts feature. So, there are a lot of
> conditions when qcow2 session may be not closed correctly, but bit is
> 0. Still, when bit is set, the last session is definitely not finished
> correctly and it's better to report it.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/qcow2.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Eric Blake 


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

[PATCH] qcow2: set bdi->is_dirty

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Set bdi->is_dirty, so that qemu-img info could show dirty flag.

After this commit the following check will show '"dirty-flag": true':

./build/qemu-img create -f qcow2 -o lazy_refcounts=on x 1M
./build/qemu-io x
qemu-io> write 0 1M

 After "write" command success, kill the qemu-io process:

kill -9 

./build/qemu-img info --output=json x

This will show '"dirty-flag": true' among other things. (before this
commit it shows '"dirty-flag": false')

Note, that qcow2's dirty-bit is not a "dirty bit for the image". It
only protects qcow2 lazy refcounts feature. So, there are a lot of
conditions when qcow2 session may be not closed correctly, but bit is
0. Still, when bit is set, the last session is definitely not finished
correctly and it's better to report it.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 9727ae8fe3..39b91ef940 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -5089,6 +5089,7 @@ static int qcow2_get_info(BlockDriverState *bs, 
BlockDriverInfo *bdi)
 BDRVQcow2State *s = bs->opaque;
 bdi->cluster_size = s->cluster_size;
 bdi->vm_state_offset = qcow2_vm_state_offset(s);
+bdi->is_dirty = s->incompatible_features & QCOW2_INCOMPAT_DIRTY;
 return 0;
 }
 
-- 
2.29.2

Re: [PATCH] virtio-blk: drop deprecated scsi=on|off property

2021-05-04 Thread Eduardo Habkost

On Tue, May 04, 2021 at 03:32:55PM +0100, Stefan Hajnoczi wrote:
> On Thu, Apr 29, 2021 at 02:03:52PM -0400, Eduardo Habkost wrote:
> > On Thu, Apr 29, 2021 at 04:52:21PM +0100, Stefan Hajnoczi wrote:
> > > Live migrating old guests from an old QEMU with the SCSI feature bit
> > > enabled will fail with "Features 0x... unsupported. Allowed features:
> > > 0x...". We've followed the QEMU deprecation policy so users have been
> > > warned...
> > > 
> > 
> > Were they really warned, though?  People running
> > "-machine pc-i440fx-2.4" might be completely unaware that it was
> > silently enabling a deprecated feature.
> > 
> > Can we have this documented in a more explicit way?  Maybe just a
> > comment at hw_compat_2_4 would be enough, to warn people doing
> > backports and rebases downstream.
> > 
> > Can we make QEMU refuse to start if using pc-2.4 + virtio-blk
> > together, just to be sure?
> 
> On second thought, do we really want to break pc-2.4 user's QEMU
> command-lines if they have a virtio-blk device?

It depends which command line you are talking about.

I believe we _must_ break the following:
"-machine pc-i440fx-2.4 -device virtio-blk", and
"-machine pc-i440fx-2.4 -device virtio-blk,scsi=on".
Your patch breaks only the latter.

Your patch also breaks the following:
"-machine pc-i440fx-2.4 -device virtio-blk,scsi=off",
which I don't think we should break.

> 
> BTW Peter mentioned libvirt avoids the unnecessary scsi=off:
> https://gitlab.com/libvirt/libvirt/-/commit/ec69f0190be731d12faeac08dbf63325836509a9
> 
> Stefan



-- 
Eduardo

[PATCH 02/10] qcow2: compressed read: simplify cluster descriptor passing

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Let's pass the whole L2 entry and not bother with
L2E_COMPRESSED_OFFSET_SIZE_MASK.

It also helps further refactoring that adds generic
qcow2_parse_compressed_l2_entry() helper.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.h |  1 -
 block/qcow2-cluster.c |  5 ++---
 block/qcow2.c | 12 +++-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 0fe5f74ed3..42a0058ab7 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -588,7 +588,6 @@ typedef enum QCow2MetadataOverlap {
 
 #define L1E_OFFSET_MASK 0x00fffe00ULL
 #define L2E_OFFSET_MASK 0x00fffe00ULL
-#define L2E_COMPRESSED_OFFSET_SIZE_MASK 0x3fffULL
 
 #define REFT_OFFSET_MASK 0xfe00ULL
 
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index bd0597842f..04735ee439 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -556,8 +556,7 @@ static int coroutine_fn 
do_perform_cow_write(BlockDriverState *bs,
  * offset needs to be aligned to a cluster boundary.
  *
  * If the cluster is unallocated then *host_offset will be 0.
- * If the cluster is compressed then *host_offset will contain the
- * complete compressed cluster descriptor.
+ * If the cluster is compressed then *host_offset will contain the l2 entry.
  *
  * On entry, *bytes is the maximum number of contiguous bytes starting at
  * offset that we are interested in.
@@ -660,7 +659,7 @@ int qcow2_get_host_offset(BlockDriverState *bs, uint64_t 
offset,
 ret = -EIO;
 goto fail;
 }
-*host_offset = l2_entry & L2E_COMPRESSED_OFFSET_SIZE_MASK;
+*host_offset = l2_entry;
 break;
 case QCOW2_SUBCLUSTER_ZERO_PLAIN:
 case QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN:
diff --git a/block/qcow2.c b/block/qcow2.c
index 9727ae8fe3..746ae85b89 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -74,7 +74,7 @@ typedef struct {
 
 static int coroutine_fn
 qcow2_co_preadv_compressed(BlockDriverState *bs,
-   uint64_t cluster_descriptor,
+   uint64_t l2_entry,
uint64_t offset,
uint64_t bytes,
QEMUIOVector *qiov,
@@ -2177,7 +2177,7 @@ typedef struct Qcow2AioTask {
 
 BlockDriverState *bs;
 QCow2SubclusterType subcluster_type; /* only for read */
-uint64_t host_offset; /* or full descriptor in compressed clusters */
+uint64_t host_offset; /* or l2_entry for compressed read */
 uint64_t offset;
 uint64_t bytes;
 QEMUIOVector *qiov;
@@ -4665,7 +4665,7 @@ qcow2_co_pwritev_compressed_part(BlockDriverState *bs,
 
 static int coroutine_fn
 qcow2_co_preadv_compressed(BlockDriverState *bs,
-   uint64_t cluster_descriptor,
+   uint64_t l2_entry,
uint64_t offset,
uint64_t bytes,
QEMUIOVector *qiov,
@@ -4677,8 +4677,10 @@ qcow2_co_preadv_compressed(BlockDriverState *bs,
 uint8_t *buf, *out_buf;
 int offset_in_cluster = offset_into_cluster(s, offset);
 
-coffset = cluster_descriptor & s->cluster_offset_mask;
-nb_csectors = ((cluster_descriptor >> s->csize_shift) & s->csize_mask) + 1;
+assert(qcow2_get_cluster_type(bs, l2_entry) == QCOW2_CLUSTER_COMPRESSED);
+
+coffset = l2_entry & s->cluster_offset_mask;
+nb_csectors = ((l2_entry >> s->csize_shift) & s->csize_mask) + 1;
 csize = nb_csectors * QCOW2_COMPRESSED_SECTOR_SIZE -
 (coffset & ~QCOW2_COMPRESSED_SECTOR_MASK);
 
-- 
2.29.2

[PATCH 09/10] qcow2-refcount: check_refcounts_l1(): check reserved bits

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.h  | 1 +
 block/qcow2-refcount.c | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/block/qcow2.h b/block/qcow2.h
index b8b1093b61..58fd7f1678 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -586,6 +586,7 @@ typedef enum QCow2MetadataOverlap {
 (QCOW2_OL_CACHED | QCOW2_OL_INACTIVE_L2)
 
 #define L1E_OFFSET_MASK 0x00fffe00ULL
+#define L1E_RESERVED_MASK 0x7f0001ffULL
 #define L2E_OFFSET_MASK 0x00fffe00ULL
 #define L2E_STD_RESERVED_MASK 0x3f0001feULL
 
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index eb6de3dabd..9a20aac0c9 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1903,6 +1903,12 @@ static int check_refcounts_l1(BlockDriverState *bs,
 continue;
 }
 
+if (l1_table[i] & L1E_RESERVED_MASK) {
+fprintf(stderr, "ERROR found L1 entry with reserved bits set: "
+"%" PRIx64, l1_table[i]);
+res->corruptions++;
+}
+
 l2_offset = l1_table[i] & L1E_OFFSET_MASK;
 
 /* Mark L2 table as used */
-- 
2.29.2

[PATCH 08/10] qcow2-refcount: improve style of check_refcounts_l1()

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

 - use g_autofree for l1_table
 - better name for size in bytes variable
 - reduce code blocks nesting
 - whitespaces, braces, newlines

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2-refcount.c | 97 +-
 1 file changed, 49 insertions(+), 48 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 44fc0dd5dc..eb6de3dabd 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1864,71 +1864,72 @@ static int check_refcounts_l1(BlockDriverState *bs,
   int flags, BdrvCheckMode fix, bool active)
 {
 BDRVQcow2State *s = bs->opaque;
-uint64_t *l1_table = NULL, l2_offset, l1_size2;
+size_t l1_size_bytes = l1_size * L1E_SIZE;
+g_autofree uint64_t *l1_table = g_try_malloc(l1_size_bytes);
+uint64_t l2_offset;
 int i, ret;
 
-l1_size2 = l1_size * L1E_SIZE;
+if (!l1_size) {
+return 0;
+}
 
 /* Mark L1 table as used */
 ret = qcow2_inc_refcounts_imrt(bs, res, refcount_table, 
refcount_table_size,
-   l1_table_offset, l1_size2);
+   l1_table_offset, l1_size_bytes);
 if (ret < 0) {
-goto fail;
+return ret;
+}
+
+if (l1_table == NULL) {
+res->check_errors++;
+return -ENOMEM;
 }
 
 /* Read L1 table entries from disk */
-if (l1_size2 > 0) {
-l1_table = g_try_malloc(l1_size2);
-if (l1_table == NULL) {
-ret = -ENOMEM;
-res->check_errors++;
-goto fail;
-}
-ret = bdrv_pread(bs->file, l1_table_offset, l1_table, l1_size2);
-if (ret < 0) {
-fprintf(stderr, "ERROR: I/O error in check_refcounts_l1\n");
-res->check_errors++;
-goto fail;
-}
-for(i = 0;i < l1_size; i++)
-be64_to_cpus(_table[i]);
+ret = bdrv_pread(bs->file, l1_table_offset, l1_table, l1_size_bytes);
+if (ret < 0) {
+fprintf(stderr, "ERROR: I/O error in check_refcounts_l1\n");
+res->check_errors++;
+return ret;
+}
+
+for (i = 0; i < l1_size; i++) {
+be64_to_cpus(_table[i]);
 }
 
 /* Do the actual checks */
-for(i = 0; i < l1_size; i++) {
-l2_offset = l1_table[i];
-if (l2_offset) {
-/* Mark L2 table as used */
-l2_offset &= L1E_OFFSET_MASK;
-ret = qcow2_inc_refcounts_imrt(bs, res,
-   refcount_table, refcount_table_size,
-   l2_offset, s->cluster_size);
-if (ret < 0) {
-goto fail;
-}
+for (i = 0; i < l1_size; i++) {
+if (!l1_table[i]) {
+continue;
+}
 
-/* L2 tables are cluster aligned */
-if (offset_into_cluster(s, l2_offset)) {
-fprintf(stderr, "ERROR l2_offset=%" PRIx64 ": Table is not "
-"cluster aligned; L1 entry corrupted\n", l2_offset);
-res->corruptions++;
-}
+l2_offset = l1_table[i] & L1E_OFFSET_MASK;
 
-/* Process and check L2 entries */
-ret = check_refcounts_l2(bs, res, refcount_table,
- refcount_table_size, l2_offset, flags,
- fix, active);
-if (ret < 0) {
-goto fail;
-}
+/* Mark L2 table as used */
+ret = qcow2_inc_refcounts_imrt(bs, res,
+   refcount_table, refcount_table_size,
+   l2_offset, s->cluster_size);
+if (ret < 0) {
+return ret;
+}
+
+/* L2 tables are cluster aligned */
+if (offset_into_cluster(s, l2_offset)) {
+fprintf(stderr, "ERROR l2_offset=%" PRIx64 ": Table is not "
+"cluster aligned; L1 entry corrupted\n", l2_offset);
+res->corruptions++;
+}
+
+/* Process and check L2 entries */
+ret = check_refcounts_l2(bs, res, refcount_table,
+ refcount_table_size, l2_offset, flags,
+ fix, active);
+if (ret < 0) {
+return ret;
 }
 }
-g_free(l1_table);
-return 0;
 
-fail:
-g_free(l1_table);
-return ret;
+return 0;
 }
 
 /*
-- 
2.29.2

[PATCH 07/10] qcow2-refcount: check_refcounts_l2(): check reserved bits

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.h  |  1 +
 block/qcow2-refcount.c | 12 +++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index c0e1e83796..b8b1093b61 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -587,6 +587,7 @@ typedef enum QCow2MetadataOverlap {
 
 #define L1E_OFFSET_MASK 0x00fffe00ULL
 #define L2E_OFFSET_MASK 0x00fffe00ULL
+#define L2E_STD_RESERVED_MASK 0x3f0001feULL
 
 #define REFT_OFFSET_MASK 0xfe00ULL
 
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index dc940f3003..44fc0dd5dc 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1682,8 +1682,18 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 int csize;
 l2_entry = get_l2_entry(s, l2_table, i);
 uint64_t l2_bitmap = get_l2_bitmap(s, l2_table, i);
+QCow2ClusterType type = qcow2_get_cluster_type(bs, l2_entry);
 
-switch (qcow2_get_cluster_type(bs, l2_entry)) {
+if (type != QCOW2_CLUSTER_COMPRESSED) {
+/* Check reserved bits of Standard Cluster Descriptor */
+if (l2_entry & L2E_STD_RESERVED_MASK) {
+fprintf(stderr, "ERROR found l2 entry with reserved bits set: "
+"%" PRIx64, l2_entry);
+res->corruptions++;
+}
+}
+
+switch (type) {
 case QCOW2_CLUSTER_COMPRESSED:
 /* Compressed clusters don't have QCOW_OFLAG_COPIED */
 if (l2_entry & QCOW_OFLAG_COPIED) {
-- 
2.29.2

[PATCH 05/10] qcow2-refcount: fix_l2_entry_by_zero(): also zero L2 entry bitmap

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

We'll reuse the function to fix wrong L2 entry bitmap. Support it now.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2-refcount.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index f1e771d742..62d59eb2e9 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1588,7 +1588,8 @@ enum {
 };
 
 /*
- * Fix L2 entry by making it QCOW2_CLUSTER_ZERO_PLAIN.
+ * Fix L2 entry by making it QCOW2_CLUSTER_ZERO_PLAIN (or maing all its present
+ * subclusters QCOW2_SUBCLUSTER_ZERO_PLAIN).
  *
  * Function do res->corruptions-- on success, so caller is responsible to do
  * corresponding res->corruptions++ prior to the call.
@@ -1605,9 +1606,20 @@ static int fix_l2_entry_by_zero(BlockDriverState *bs, 
BdrvCheckResult *res,
 int idx = l2_index * (l2_entry_size(s) / sizeof(uint64_t));
 uint64_t l2e_offset = l2_offset + (uint64_t)l2_index * l2_entry_size(s);
 int ign = active ? QCOW2_OL_ACTIVE_L2 : QCOW2_OL_INACTIVE_L2;
-uint64_t l2_entry = has_subclusters(s) ? 0 : QCOW_OFLAG_ZERO;
 
-set_l2_entry(s, l2_table, l2_index, l2_entry);
+if (has_subclusters(s)) {
+uint64_t l2_bitmap = get_l2_bitmap(s, l2_table, l2_index);
+
+/* Allocated subclusters becomes zero */
+l2_bitmap |= l2_bitmap << 32;
+l2_bitmap &= QCOW_L2_BITMAP_ALL_ZEROES;
+
+set_l2_bitmap(s, l2_table, l2_index, l2_bitmap);
+set_l2_entry(s, l2_table, l2_index, 0);
+} else {
+set_l2_entry(s, l2_table, l2_index, QCOW_OFLAG_ZERO);
+}
+
 ret = qcow2_pre_write_overlap_check(bs, ign, l2e_offset, l2_entry_size(s),
 false);
 if (metadata_overlap) {
-- 
2.29.2

[PATCH 06/10] qcow2-refcount: check_refcounts_l2(): check l2_bitmap

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Check subcluster bitmap of the l2 entry for different types of
clusters:

 - for compressed it must be zero
 - for allocated check consistency of two parts of the bitmap
 - for unallocated all subclusters should be unallocated
   (or zero-plain)

For unallocated clusters we can safely fix the entry by making it
zero-plain.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2-refcount.c | 30 +-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 62d59eb2e9..dc940f3003 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1681,6 +1681,7 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 uint64_t coffset;
 int csize;
 l2_entry = get_l2_entry(s, l2_table, i);
+uint64_t l2_bitmap = get_l2_bitmap(s, l2_table, i);
 
 switch (qcow2_get_cluster_type(bs, l2_entry)) {
 case QCOW2_CLUSTER_COMPRESSED:
@@ -1700,6 +1701,14 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 break;
 }
 
+if (l2_bitmap) {
+fprintf(stderr, "ERROR compressed cluster %d with non-zero "
+"subcluster allocation bitmap, entry=0x%" PRIx64 "\n",
+i, l2_entry);
+res->corruptions++;
+break;
+}
+
 /* Mark cluster as used */
 qcow2_parse_compressed_l2_entry(bs, l2_entry, , );
 ret = qcow2_inc_refcounts_imrt(
@@ -1727,13 +1736,19 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 {
 uint64_t offset = l2_entry & L2E_OFFSET_MASK;
 
+if ((l2_bitmap >> 32) & l2_bitmap) {
+res->corruptions++;
+fprintf(stderr, "ERROR offset=%" PRIx64 ": Allocated "
+"cluster has corrupted subcluster allocation bitmap\n",
+offset);
+}
+
 /* Correct offsets are cluster aligned */
 if (offset_into_cluster(s, offset)) {
 bool contains_data;
 res->corruptions++;
 
 if (has_subclusters(s)) {
-uint64_t l2_bitmap = get_l2_bitmap(s, l2_table, i);
 contains_data = (l2_bitmap & QCOW_L2_BITMAP_ALL_ALLOC);
 } else {
 contains_data = !(l2_entry & QCOW_OFLAG_ZERO);
@@ -1800,6 +1815,19 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 
 case QCOW2_CLUSTER_ZERO_PLAIN:
 case QCOW2_CLUSTER_UNALLOCATED:
+if (l2_bitmap & QCOW_L2_BITMAP_ALL_ALLOC) {
+res->corruptions++;
+fprintf(stderr, "%s: Unallocated "
+"cluster has non-zero subcluster allocation map\n",
+fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR");
+if (fix & BDRV_FIX_ERRORS) {
+ret = fix_l2_entry_by_zero(bs, res, l2_offset, l2_table, i,
+   active, _overlap);
+if (metadata_overlap) {
+return ret;
+}
+}
+}
 break;
 
 default:
-- 
2.29.2

[PATCH 10/10] qcow2-refcount: check_refblocks(): add separate message for reserved

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Split checking for reserved bits out of aligned offset check.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.h  |  1 +
 block/qcow2-refcount.c | 10 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 58fd7f1678..fd48a89d45 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -591,6 +591,7 @@ typedef enum QCow2MetadataOverlap {
 #define L2E_STD_RESERVED_MASK 0x3f0001feULL
 
 #define REFT_OFFSET_MASK 0xfe00ULL
+#define REFT_RESERVED_MASK 0x1ffULL
 
 #define INV_OFFSET (-1ULL)
 
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 9a20aac0c9..9ae45efc75 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -2090,9 +2090,17 @@ static int check_refblocks(BlockDriverState *bs, 
BdrvCheckResult *res,
 
 for(i = 0; i < s->refcount_table_size; i++) {
 uint64_t offset, cluster;
-offset = s->refcount_table[i];
+offset = s->refcount_table[i] & REFT_OFFSET_MASK;
 cluster = offset >> s->cluster_bits;
 
+if (s->refcount_table[i] & REFT_RESERVED_MASK) {
+fprintf(stderr, "ERROR refcount table entry %" PRId64 " has "
+"reserved bits set\n", i);
+res->corruptions++;
+*rebuild = true;
+continue;
+}
+
 /* Refcount blocks are cluster aligned */
 if (offset_into_cluster(s, offset)) {
 fprintf(stderr, "ERROR refcount block %" PRId64 " is not "
-- 
2.29.2

[PATCH 01/10] qcow2-refcount: improve style of check_refcounts_l2()

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

 - don't use same name for size in bytes and in entries
 - use g_autofree for l2_table
 - add whitespace
 - fix block comment style

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2-refcount.c | 47 +-
 1 file changed, 24 insertions(+), 23 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 8e649b008e..2734338625 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1601,23 +1601,22 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
   int flags, BdrvCheckMode fix, bool active)
 {
 BDRVQcow2State *s = bs->opaque;
-uint64_t *l2_table, l2_entry;
+uint64_t l2_entry;
 uint64_t next_contiguous_offset = 0;
-int i, l2_size, nb_csectors, ret;
+int i, nb_csectors, ret;
+size_t l2_size_bytes = s->l2_size * l2_entry_size(s);
+g_autofree uint64_t *l2_table = g_malloc(l2_size_bytes);
 
 /* Read L2 table from disk */
-l2_size = s->l2_size * l2_entry_size(s);
-l2_table = g_malloc(l2_size);
-
-ret = bdrv_pread(bs->file, l2_offset, l2_table, l2_size);
+ret = bdrv_pread(bs->file, l2_offset, l2_table, l2_size_bytes);
 if (ret < 0) {
 fprintf(stderr, "ERROR: I/O error in check_refcounts_l2\n");
 res->check_errors++;
-goto fail;
+return ret;
 }
 
 /* Do the actual checks */
-for(i = 0; i < s->l2_size; i++) {
+for (i = 0; i < s->l2_size; i++) {
 l2_entry = get_l2_entry(s, l2_table, i);
 
 switch (qcow2_get_cluster_type(bs, l2_entry)) {
@@ -1647,14 +1646,15 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 l2_entry & QCOW2_COMPRESSED_SECTOR_MASK,
 nb_csectors * QCOW2_COMPRESSED_SECTOR_SIZE);
 if (ret < 0) {
-goto fail;
+return ret;
 }
 
 if (flags & CHECK_FRAG_INFO) {
 res->bfi.allocated_clusters++;
 res->bfi.compressed_clusters++;
 
-/* Compressed clusters are fragmented by nature.  Since they
+/*
+ * Compressed clusters are fragmented by nature.  Since they
  * take up sub-sector space but we only have sector granularity
  * I/O we need to re-read the same sectors even for adjacent
  * compressed clusters.
@@ -1700,9 +1700,11 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 if (ret < 0) {
 fprintf(stderr, "ERROR: Overlap check failed\n");
 res->check_errors++;
-/* Something is seriously wrong, so abort checking
- * this L2 table */
-goto fail;
+/*
+ * Something is seriously wrong, so abort checking
+ * this L2 table.
+ */
+return ret;
 }
 
 ret = bdrv_pwrite_sync(bs->file, l2e_offset,
@@ -1712,13 +1714,17 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 fprintf(stderr, "ERROR: Failed to overwrite L2 "
 "table entry: %s\n", strerror(-ret));
 res->check_errors++;
-/* Do not abort, continue checking the rest of this
- * L2 table's entries */
+/*
+ * Do not abort, continue checking the rest of this
+ * L2 table's entries.
+ */
 } else {
 res->corruptions--;
 res->corruptions_fixed++;
-/* Skip marking the cluster as used
- * (it is unused now) */
+/*
+ * Skip marking the cluster as used
+ * (it is unused now).
+ */
 continue;
 }
 }
@@ -1743,7 +1749,7 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
refcount_table_size,
offset, s->cluster_size);
 if (ret < 0) {
-goto fail;
+return ret;
 }
 }
 break;
@@ -1758,12 +1764,7 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 }
 }
 
-g_free(l2_table);
 return 0;
-
-fail:
-g_free(l2_table);
-return ret;
 }
 
 /*
-- 
2.29.2

[PATCH 04/10] qcow2-refcount: introduce fix_l2_entry_by_zero()

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Split fix_l2_entry_by_zero() out of check_refcounts_l2() to be
reused in further patch.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2-refcount.c | 87 +-
 1 file changed, 60 insertions(+), 27 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 66cbb94ef9..f1e771d742 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1587,6 +1587,54 @@ enum {
 CHECK_FRAG_INFO = 0x2,  /* update BlockFragInfo counters */
 };
 
+/*
+ * Fix L2 entry by making it QCOW2_CLUSTER_ZERO_PLAIN.
+ *
+ * Function do res->corruptions-- on success, so caller is responsible to do
+ * corresponding res->corruptions++ prior to the call.
+ *
+ * On failure in-memory @l2_table may be modified.
+ */
+static int fix_l2_entry_by_zero(BlockDriverState *bs, BdrvCheckResult *res,
+uint64_t l2_offset,
+uint64_t *l2_table, int l2_index, bool active,
+bool *metadata_overlap)
+{
+BDRVQcow2State *s = bs->opaque;
+int ret;
+int idx = l2_index * (l2_entry_size(s) / sizeof(uint64_t));
+uint64_t l2e_offset = l2_offset + (uint64_t)l2_index * l2_entry_size(s);
+int ign = active ? QCOW2_OL_ACTIVE_L2 : QCOW2_OL_INACTIVE_L2;
+uint64_t l2_entry = has_subclusters(s) ? 0 : QCOW_OFLAG_ZERO;
+
+set_l2_entry(s, l2_table, l2_index, l2_entry);
+ret = qcow2_pre_write_overlap_check(bs, ign, l2e_offset, l2_entry_size(s),
+false);
+if (metadata_overlap) {
+*metadata_overlap = ret < 0;
+}
+if (ret < 0) {
+fprintf(stderr, "ERROR: Overlap check failed\n");
+goto fail;
+}
+
+ret = bdrv_pwrite_sync(bs->file, l2e_offset, _table[idx],
+   l2_entry_size(s));
+if (ret < 0) {
+fprintf(stderr, "ERROR: Failed to overwrite L2 "
+"table entry: %s\n", strerror(-ret));
+goto fail;
+}
+
+res->corruptions--;
+res->corruptions_fixed++;
+return 0;
+
+fail:
+res->check_errors++;
+return ret;
+}
+
 /*
  * Increases the refcount in the given refcount table for the all clusters
  * referenced in the L2 table. While doing so, performs some checks on L2
@@ -1606,6 +1654,7 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 int i, ret;
 size_t l2_size_bytes = s->l2_size * l2_entry_size(s);
 g_autofree uint64_t *l2_table = g_malloc(l2_size_bytes);
+bool metadata_overlap;
 
 /* Read L2 table from disk */
 ret = bdrv_pread(bs->file, l2_offset, l2_table, l2_size_bytes);
@@ -1685,19 +1734,10 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR",
 offset);
 if (fix & BDRV_FIX_ERRORS) {
-int idx = i * (l2_entry_size(s) / sizeof(uint64_t));
-uint64_t l2e_offset =
-l2_offset + (uint64_t)i * l2_entry_size(s);
-int ign = active ? QCOW2_OL_ACTIVE_L2 :
-   QCOW2_OL_INACTIVE_L2;
-
-l2_entry = has_subclusters(s) ? 0 : QCOW_OFLAG_ZERO;
-set_l2_entry(s, l2_table, i, l2_entry);
-ret = qcow2_pre_write_overlap_check(bs, ign,
-l2e_offset, l2_entry_size(s), false);
-if (ret < 0) {
-fprintf(stderr, "ERROR: Overlap check failed\n");
-res->check_errors++;
+ret = fix_l2_entry_by_zero(bs, res, l2_offset,
+   l2_table, i, active,
+   _overlap);
+if (metadata_overlap) {
 /*
  * Something is seriously wrong, so abort checking
  * this L2 table.
@@ -1705,26 +1745,19 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 return ret;
 }
 
-ret = bdrv_pwrite_sync(bs->file, l2e_offset,
-   _table[idx],
-   l2_entry_size(s));
-if (ret < 0) {
-fprintf(stderr, "ERROR: Failed to overwrite L2 "
-"table entry: %s\n", strerror(-ret));
-res->check_errors++;
-/*
- * Do not abort, continue checking the rest of this
- * L2 table's entries.
- */
-} else {
-

[PATCH 00/10] qcow2 check: check some reserved bits and subcluster bitmaps

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Hi all!

Here are some good refactorings and new (qemu-img check) checks for
qcow2.

Vladimir Sementsov-Ogievskiy (10):
  qcow2-refcount: improve style of check_refcounts_l2()
  qcow2: compressed read: simplify cluster descriptor passing
  qcow2: introduce qcow2_parse_compressed_l2_entry() helper
  qcow2-refcount: introduce fix_l2_entry_by_zero()
  qcow2-refcount: fix_l2_entry_by_zero(): also zero L2 entry bitmap
  qcow2-refcount: check_refcounts_l2(): check l2_bitmap
  qcow2-refcount: check_refcounts_l2(): check reserved bits
  qcow2-refcount: improve style of check_refcounts_l1()
  qcow2-refcount: check_refcounts_l1(): check reserved bits
  qcow2-refcount: check_refblocks(): add separate message for reserved

 block/qcow2.h  |   7 +-
 block/qcow2-cluster.c  |  20 ++-
 block/qcow2-refcount.c | 327 ++---
 block/qcow2.c  |  13 +-
 4 files changed, 239 insertions(+), 128 deletions(-)

-- 
2.29.2

[PATCH 03/10] qcow2: introduce qcow2_parse_compressed_l2_entry() helper

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Add helper to parse compressed l2_entry and use it everywhere instead
of opencoding.

Note, that in most places we move to precise coffset/csize instead of
sector-aligned. Still it should work good enough for updating
refcounts.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/qcow2.h  |  3 ++-
 block/qcow2-cluster.c  | 15 +++
 block/qcow2-refcount.c | 36 +---
 block/qcow2.c  |  9 ++---
 4 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 42a0058ab7..c0e1e83796 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -110,7 +110,6 @@
 
 /* Defined in the qcow2 spec (compressed cluster descriptor) */
 #define QCOW2_COMPRESSED_SECTOR_SIZE 512U
-#define QCOW2_COMPRESSED_SECTOR_MASK (~(QCOW2_COMPRESSED_SECTOR_SIZE - 1ULL))
 
 /* Must be at least 2 to cover COW */
 #define MIN_L2_CACHE_SIZE 2 /* cache entries */
@@ -913,6 +912,8 @@ int qcow2_alloc_compressed_cluster_offset(BlockDriverState 
*bs,
   uint64_t offset,
   int compressed_size,
   uint64_t *host_offset);
+void qcow2_parse_compressed_l2_entry(BlockDriverState *bs, uint64_t l2_entry,
+ uint64_t *coffset, int *csize);
 
 int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m);
 void qcow2_alloc_cluster_abort(BlockDriverState *bs, QCowL2Meta *m);
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 04735ee439..70d0570a33 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -2462,3 +2462,18 @@ fail:
 g_free(l1_table);
 return ret;
 }
+
+void qcow2_parse_compressed_l2_entry(BlockDriverState *bs, uint64_t l2_entry,
+ uint64_t *coffset, int *csize)
+{
+BDRVQcow2State *s = bs->opaque;
+int nb_csectors;
+
+assert(qcow2_get_cluster_type(bs, l2_entry) == QCOW2_CLUSTER_COMPRESSED);
+
+*coffset = l2_entry & s->cluster_offset_mask;
+
+nb_csectors = ((l2_entry >> s->csize_shift) & s->csize_mask) + 1;
+*csize = nb_csectors * QCOW2_COMPRESSED_SECTOR_SIZE -
+(*coffset & (QCOW2_COMPRESSED_SECTOR_SIZE - 1));
+}
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 2734338625..66cbb94ef9 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1177,11 +1177,11 @@ void qcow2_free_any_cluster(BlockDriverState *bs, 
uint64_t l2_entry,
 switch (ctype) {
 case QCOW2_CLUSTER_COMPRESSED:
 {
-int64_t offset = (l2_entry & s->cluster_offset_mask)
-& QCOW2_COMPRESSED_SECTOR_MASK;
-int size = QCOW2_COMPRESSED_SECTOR_SIZE *
-(((l2_entry >> s->csize_shift) & s->csize_mask) + 1);
-qcow2_free_clusters(bs, offset, size, type);
+uint64_t coffset;
+int csize;
+
+qcow2_parse_compressed_l2_entry(bs, l2_entry, , );
+qcow2_free_clusters(bs, coffset, csize, type);
 }
 break;
 case QCOW2_CLUSTER_NORMAL:
@@ -1247,7 +1247,7 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 bool l1_allocated = false;
 int64_t old_entry, old_l2_offset;
 unsigned slice, slice_size2, n_slices;
-int i, j, l1_modified = 0, nb_csectors;
+int i, j, l1_modified = 0;
 int ret;
 
 assert(addend >= -1 && addend <= 1);
@@ -1318,14 +1318,14 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 
 switch (qcow2_get_cluster_type(bs, entry)) {
 case QCOW2_CLUSTER_COMPRESSED:
-nb_csectors = ((entry >> s->csize_shift) &
-   s->csize_mask) + 1;
 if (addend != 0) {
-uint64_t coffset = (entry & s->cluster_offset_mask)
-& QCOW2_COMPRESSED_SECTOR_MASK;
+uint64_t coffset;
+int csize;
+
+qcow2_parse_compressed_l2_entry(bs, entry,
+, );
 ret = update_refcount(
-bs, coffset,
-nb_csectors * QCOW2_COMPRESSED_SECTOR_SIZE,
+bs, coffset, csize,
 abs(addend), addend < 0,
 QCOW2_DISCARD_SNAPSHOT);
 if (ret < 0) {
@@ -1603,7 +1603,7 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
 BDRVQcow2State *s = bs->opaque;
 uint64_t l2_entry;
 uint64_t next_contiguous_offset = 0;
-int i, nb_csectors, ret;
+int i, ret;
 size_t l2_size_bytes = s->l2_size * l2_entry_size(s);
 g_autofree uint64_t *l2_table = g_malloc(l2_size_bytes);
 
@@ -1617,6 +1617,8 @@ static int

Re: [PATCH] block/rbd: Add support for rbd image encryption

2021-05-04 Thread Daniel P . Berrangé

On Sun, May 02, 2021 at 10:36:17AM +0300, Or Ozeri wrote:
> Starting from ceph Pacific, RBD has built-in support for image-level 
> encryption.
> Currently supported formats are LUKS version 1 and 2.
> 
> There are 2 new relevant librbd APIs for controlling encryption, both expect 
> an
> open image context:
> 
> rbd_encryption_format: formats an image (i.e. writes the LUKS header)
> rbd_encryption_load: loads encryptor/decryptor to the image IO stack

> 
> This commit extends the qemu rbd driver API to support the above.
> 
> Signed-off-by: Or Ozeri 
> ---
>  block/rbd.c  | 230 ++-
>  qapi/block-core.json |  61 
>  2 files changed, 287 insertions(+), 4 deletions(-)
> 
> diff --git a/block/rbd.c b/block/rbd.c
> index f098a89c7b..1239e97889 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -108,6 +108,13 @@ typedef struct BDRVRBDState {
>  uint64_t image_size;
>  } BDRVRBDState;
>  
> +#ifdef LIBRBD_SUPPORTS_ENCRYPTION
> +typedef int (*RbdEncryptionFunc)(rbd_image_t image,
> + rbd_encryption_format_t format,
> + rbd_encryption_options_t opts,
> + size_t opts_size);
> +#endif
> +
>  static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
>  BlockdevOptionsRbd *opts, bool cache,
>  const char *keypairs, const char *secretid,
> @@ -341,6 +348,115 @@ static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
>  }
>  }
>  
> +#ifdef LIBRBD_SUPPORTS_ENCRYPTION
> +static int qemu_rbd_convert_luks_options(
> +RbdEncryptionOptionsLUKSBase *luks_opts,
> +rbd_encryption_algorithm_t *alg,
> +char** passphrase,
> +Error **errp)
> +{
> +int r = 0;
> +
> +if (!luks_opts->has_passphrase_secret) {
> +r = -EINVAL;
> +error_setg_errno(errp, -r, "missing encrypt.passphrase-secret");
> +return r;
> +}
> +
> +*passphrase = qcrypto_secret_lookup_as_utf8(luks_opts->passphrase_secret,
> +errp);
> +if (!*passphrase) {
> +return -EIO;
> +}
> +
> +if (luks_opts->has_cipher_alg) {
> +switch (luks_opts->cipher_alg) {
> +case RBD_ENCRYPTION_ALGORITHM_AES_128: {
> +*alg = RBD_ENCRYPTION_ALGORITHM_AES128;
> +break;
> +}
> +case RBD_ENCRYPTION_ALGORITHM_AES_256: {
> +*alg = RBD_ENCRYPTION_ALGORITHM_AES256;
> +break;
> +}
> +default: {
> +r = -ENOTSUP;
> +error_setg_errno(errp, -r, "unknown encryption algorithm: 
> %u",
> + luks_opts->cipher_alg);
> +return r;
> +}
> +}
> +} else {
> +/* default alg */
> +*alg = RBD_ENCRYPTION_ALGORITHM_AES256;
> +}
> +
> +return 0;
> +}
> +
> +static int qemu_rbd_apply_encryption_function(rbd_image_t image,
> +  RbdEncryptionSpec* spec,
> +  RbdEncryptionFunc func,
> +  Error **errp)
> +{
> +int r = 0;
> +g_autofree char *passphrase = NULL;
> +g_autofree rbd_encryption_options_t opts = NULL;
> +rbd_encryption_format_t format;
> +size_t opts_size;
> +
> +switch (spec->format) {
> +case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS1: {
> +rbd_encryption_luks1_format_options_t *luks1_opts =
> +g_new0(rbd_encryption_luks1_format_options_t, 1);
> +format = RBD_ENCRYPTION_FORMAT_LUKS1;
> +opts = luks1_opts;
> +opts_size = sizeof(rbd_encryption_luks1_format_options_t);
> +r = qemu_rbd_convert_luks_options(
> +qapi_RbdEncryptionOptionsLUKS1_base(>u.luks1),
> +_opts->alg, , errp);
> +if (passphrase) {
> +luks1_opts->passphrase = passphrase;
> +luks1_opts->passphrase_size = strlen(passphrase);
> +}
> +break;
> +}
> +case RBD_IMAGE_ENCRYPTION_FORMAT_LUKS2: {
> +rbd_encryption_luks2_format_options_t *luks2_opts =
> +g_new0(rbd_encryption_luks2_format_options_t, 1);
> +format = RBD_ENCRYPTION_FORMAT_LUKS2;
> +opts = luks2_opts;
> +opts_size = sizeof(rbd_encryption_luks2_format_options_t);
> +r = qemu_rbd_convert_luks_options(
> +qapi_RbdEncryptionOptionsLUKS2_base(>u.luks2),
> +_opts->alg, , errp);
> +if (passphrase) {
> +luks2_opts->passphrase = passphrase;
> +luks2_opts->passphrase_size = strlen(passphrase);
> +}
> +break;
> +}
> +default: {
> +

Re: [PATCH] virtio-blk: drop deprecated scsi=on|off property

2021-05-04 Thread Stefan Hajnoczi

On Thu, Apr 29, 2021 at 02:03:52PM -0400, Eduardo Habkost wrote:
> On Thu, Apr 29, 2021 at 04:52:21PM +0100, Stefan Hajnoczi wrote:
> > Live migrating old guests from an old QEMU with the SCSI feature bit
> > enabled will fail with "Features 0x... unsupported. Allowed features:
> > 0x...". We've followed the QEMU deprecation policy so users have been
> > warned...
> > 
> 
> Were they really warned, though?  People running
> "-machine pc-i440fx-2.4" might be completely unaware that it was
> silently enabling a deprecated feature.
> 
> Can we have this documented in a more explicit way?  Maybe just a
> comment at hw_compat_2_4 would be enough, to warn people doing
> backports and rebases downstream.
> 
> Can we make QEMU refuse to start if using pc-2.4 + virtio-blk
> together, just to be sure?

On second thought, do we really want to break pc-2.4 user's QEMU
command-lines if they have a virtio-blk device?

BTW Peter mentioned libvirt avoids the unnecessary scsi=off:
https://gitlab.com/libvirt/libvirt/-/commit/ec69f0190be731d12faeac08dbf63325836509a9

Stefan


signature.asc
Description: PGP signature

Re: [PATCH] virtio-blk: drop deprecated scsi=on|off property

2021-05-04 Thread Stefan Hajnoczi

On Thu, Apr 29, 2021 at 06:16:28PM +0200, Peter Krempa wrote:
> On Thu, Apr 29, 2021 at 16:52:21 +0100, Stefan Hajnoczi wrote:
> > The scsi=on|off property was deprecated in QEMU 5.0 and can be removed
> > completely at this point.
> > 
> > Drop the scsi=on|off option. It was only available on Legacy virtio-blk
> > devices. Linux v5.6 already dropped support for it.
> > 
> > Remove the hw_compat_2_4[] property assignment since scsi=on|off no
> > longer exists. Old guests with Legacy virtio-blk devices no longer see
> > the SCSI host features bit.
> 
> Does this mean that qemu rejects it if it's explicitly enabled on the
> commandline? 

Yes.

Stefan


signature.asc
Description: PGP signature

Re: [PATCH] virtio-blk: drop deprecated scsi=on|off property

2021-05-04 Thread Stefan Hajnoczi

On Thu, Apr 29, 2021 at 02:03:52PM -0400, Eduardo Habkost wrote:
> On Thu, Apr 29, 2021 at 04:52:21PM +0100, Stefan Hajnoczi wrote:
> > The scsi=on|off property was deprecated in QEMU 5.0 and can be removed
> > completely at this point.
> > 
> > Drop the scsi=on|off option. It was only available on Legacy virtio-blk
> > devices. Linux v5.6 already dropped support for it.
> > 
> > Remove the hw_compat_2_4[] property assignment since scsi=on|off no
> > longer exists. Old guests with Legacy virtio-blk devices no longer see
> > the SCSI host features bit.
> > 
> 
> This means pc-2.4 will now break guest ABI if using virtio-blk
> devices, correct?

Yes.

However, this feature was only enabled on Linux hosts, so cross-host OS
migration was always broken and no one noticed. Maybe that configuration
is too niche and QEMU never supported cross-host OS migration, but it
still means that the "pc-2.4" ABI was never solid to begin with :).

> 
> > Live migrating old guests from an old QEMU with the SCSI feature bit
> > enabled will fail with "Features 0x... unsupported. Allowed features:
> > 0x...". We've followed the QEMU deprecation policy so users have been
> > warned...
> > 
> 
> Were they really warned, though?  People running
> "-machine pc-i440fx-2.4" might be completely unaware that it was
> silently enabling a deprecated feature.
> 
> Can we have this documented in a more explicit way?  Maybe just a
> comment at hw_compat_2_4 would be enough, to warn people doing
> backports and rebases downstream.
> 
> Can we make QEMU refuse to start if using pc-2.4 + virtio-blk
> together, just to be sure?
> 
> An alternative would be keeping the property (and the
> hw_compat_2_4 entry) just to keep pc-2.4 working (until pc-2.4 is
> deprecated and removed), but refusing to realize the device if
> the feature is enabled.

Yes, the least invasive approach is to leave the property in place but
refuse to realize the virtio-blk device when scsi=on. The cost is more
cruft, including a useless scsi=off command-line option that will
continue to show up in libvirt-generated QEMU command-lines.

The cautious approach makes sense to me and I'll send a new revision.

Stefan


signature.asc
Description: PGP signature

Re: [ANNOUNCE] libblkio v0.1.0 preview release

2021-05-04 Thread Kevin Wolf

Am 30.04.2021 um 17:49 hat Stefan Hajnoczi geschrieben:
> On Thu, Apr 29, 2021 at 05:51:16PM +0200, Kevin Wolf wrote:
> > Am 29.04.2021 um 16:05 hat Stefan Hajnoczi geschrieben:
> > > Hi,
> > > A preview release of libblkio, a library for high-performance block I/O,
> > > is now available:
> > > 
> > >   https://gitlab.com/libblkio/libblkio/-/tree/v0.1.0
> > > 
> > > Applications are increasingly integrating high-performance I/O
> > > interfaces such as Linux io_uring, userspace device drivers, and
> > > vhost-user device support. The effort required to add each of these
> > > low-level interfaces into an application is relatively high. libblkio
> > > provides a single API for efficiently accessing block devices and
> > > eliminates the need to write custom code for each one.
> > > 
> > > The library is not yet ready to use and currently lacks many features.
> > > In fact, I hope this news doesn't spread too far yet because libblkio is
> > > at a very early stage, but feedback from the QEMU community is necessary
> > > at this time.
> > 
> > I'm a bit worried about the configuration interface. This looks an awful
> > lot like plain QOM properties, which have proven to result in both very
> > verbose (not to say messy) and rather error prone implementations.
> > 
> > You have to write setters/getters for every property, even if it usually
> > ends up doing the same thing, storing the value somewhere. Worse, these
> > getters and setters have to work in very different circumstances, namely
> > initialisation where you usually have to store the value away so that it
> > can be checked for consistency with other properties in .realize() or
> > .complete(), and property updates during runtime. Often enough, we
> > forget the latter and create bugs. If we don't create bugs, we usually
> > start with 'if (realized)' and have two completely different code paths.
> > Another common bug in QOM objects is forgetting to check if mandatory
> > properties were actually set.
> >
> > Did you already consider these problems and decided to go this way
> > anyway, or is this more of an accidental design? And if the former, what
> > were the reasons that made it appealing?
> 
> That's true. Here is the code to reject accesses when the instance is
> not initialized:
> 
>   self.must_be_initialized()?;
> 
> It's very concise but you still need to remember to add it.

Yes, I think the problem isn't the length (in QOM it's commonly four
lines because we don't inline it like that, but same complexity really),
but remembering to have it there.

> The overall reasons for choosing the properties API were:
> 
> 1. It keeps the library's API very minimal (just generic getters/setters
>for primitive types). It minimizes ABI compatibility issues because
>there are no configuration structs or functions exported by the
>library.
> 
> 2. It's trivial to have a string setter/getter that automatically
>converts to the primitive type representation, so application config
>file or command-line values can easily be set.
> 
>This is kind of the inverse of what you're saying. If the library
>offers dedicated interfaces for each configuration value then the
>library doesn't need getters/setters for each one but all
>applications need special-purpose code for each configuration value.
> 
> That said, this is exactly why I published the preview release. If
> someone has a better way to do this or the feedback is just that this is
> bad style, then I'm happy to change it.

I can see the advantages in this. Maybe the problem that I'm seeing is
more about implementing drivers in the Rust code rather than the
external interface in C.

I've been playing a bit with this and maybe a good part of it can be
abstracted away (maybe a bit like qdev properties abstract the
complexity of QOM properties away). If I get somewhere with this, I can
send patches.

There is one more thing I'm wondering right now: Why do we have separate
states for connecting to the backend (created) and configuring it
(initialized)? The property setters check for the right state, but they
don't really do anything that wouldn't be possible in the other state.
A state machine exposed as two boolean rather than a tristate enum feels
a bit awkward, too, but even more so if two states could possibly be
enough.

The reason why I'm asking is that in order to address my points, it
would be best to have separate property accessors for each state, and
two pairs of accessors would make property declarations more managable
than three pairs.

> > Alternatives in QEMU are qdev properties (which are internally QOM
> > properties, but provide default implementations and are at least
> > automatically read-only after realize, avoiding that whole class of
> > bugs) and QAPI.
> > If this was QEMU code, I would of course go for QAPI, but a library is
> > something different and adding the code generator would probably be a
> > bit too much anyway. But the idea in the

Re: [PATCH 2/2] qemu-img: Require -F with -b backing image

2021-05-04 Thread Eric Blake

On 5/3/21 4:45 PM, Eric Blake wrote:
> On 5/3/21 4:36 PM, Eric Blake wrote:
>> Back in commit d9f059aa6c (qemu-img: Deprecate use of -b without -F),
>> we deprecated the ability to create a file with a backing image that
>> requires qemu to perform format probing.  Qemu can still probe older
>> files for backwards compatibility, but it is time to finish off the
>> ability to create such images, due to the potential security risk they
>> present.  Update a couple of iotests affected by the change.
>>
>> Signed-off-by: Eric Blake 
>> ---
>>  docs/system/deprecated.rst   | 20 -
>>  docs/system/removed-features.rst | 19 
>>  block.c  | 37 ++--
>>  qemu-img.c   |  6 --
>>  tests/qemu-iotests/114   | 18 
>>  tests/qemu-iotests/114.out   | 11 --
>>  tests/qemu-iotests/301   |  4 +---
>>  tests/qemu-iotests/301.out   | 16 ++
>>  8 files changed, 50 insertions(+), 81 deletions(-)
> 
> I'll need a followup to fix iotest failures in 40 and 41 (apparently
> they weren't passing backing formats, but I did not catch them in my
> original cleanup of iotests back in commit b66ff2c298)

Squash in:

diff --git i/tests/qemu-iotests/040 w/tests/qemu-iotests/040
index ba7cb34ce8cf..f3677de9dfde 100755
--- i/tests/qemu-iotests/040
+++ w/tests/qemu-iotests/040
@@ -920,8 +920,8 @@ class
TestCommitWithOverriddenBacking(iotests.QMPTestCase):
 def setUp(self):
 qemu_img('create', '-f', iotests.imgfmt, self.img_base_a, '1M')
 qemu_img('create', '-f', iotests.imgfmt, self.img_base_b, '1M')
-qemu_img('create', '-f', iotests.imgfmt, '-b', self.img_base_a, \
- self.img_top)
+qemu_img('create', '-f', iotests.imgfmt, '-b', self.img_base_a,
+ '-F', iotests.imgfmt, self.img_top)

 self.vm = iotests.VM()
 self.vm.launch()
diff --git i/tests/qemu-iotests/041 w/tests/qemu-iotests/041
index 5cc02b24fc7a..db9f5dc540e8 100755
--- i/tests/qemu-iotests/041
+++ w/tests/qemu-iotests/041
@@ -1295,8 +1295,10 @@ class TestReplaces(iotests.QMPTestCase):
 class TestFilters(iotests.QMPTestCase):
 def setUp(self):
 qemu_img('create', '-f', iotests.imgfmt, backing_img, '1M')
-qemu_img('create', '-f', iotests.imgfmt, '-b', backing_img,
test_img)
-qemu_img('create', '-f', iotests.imgfmt, '-b', backing_img,
target_img)
+qemu_img('create', '-f', iotests.imgfmt, '-b', backing_img,
+ '-F', iotests.imgfmt, test_img)
+qemu_img('create', '-f', iotests.imgfmt, '-b', backing_img,
+ '-F', iotests.imgfmt, target_img)

 qemu_io('-c', 'write -P 1 0 512k', backing_img)
 qemu_io('-c', 'write -P 2 512k 512k', test_img)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [PATCH v2 2/6] vhost-user-blk: Don't reconnect during initialisation

2021-05-04 Thread Michael S. Tsirkin

On Tue, May 04, 2021 at 12:57:29PM +0200, Kevin Wolf wrote:
> Am 04.05.2021 um 11:44 hat Michael S. Tsirkin geschrieben:
> > On Tue, May 04, 2021 at 11:27:12AM +0200, Kevin Wolf wrote:
> > > Am 04.05.2021 um 10:59 hat Michael S. Tsirkin geschrieben:
> > > > On Thu, Apr 29, 2021 at 07:13:12PM +0200, Kevin Wolf wrote:
> > > > > This is a partial revert of commits 77542d43149 and bc79c87bcde.
> > > > > 
> > > > > Usually, an error during initialisation means that the configuration 
> > > > > was
> > > > > wrong. Reconnecting won't make the error go away, but just turn the
> > > > > error condition into an endless loop. Avoid this and return errors
> > > > > again.
> > > > 
> > > > So there are several possible reasons for an error:
> > > > 
> > > > 1. remote restarted - we would like to reconnect,
> > > >this was the original use-case for reconnect.
> > > > 
> > > >I am not very happy that we are killing this usecase.
> > > 
> > > This patch is killing it only during initialisation, where it's quite
> > > unlikely compared to other cases and where the current implementation is
> > > rather broken. So reverting the broken feature and going back to a
> > > simpler correct state feels like a good idea to me.
> > > 
> > > The idea is to add the "retry during initialisation" feature back on top
> > > of this, but it requires some more changes in the error paths so that we
> > > can actually distinguish different kinds of errors and don't retry when
> > > we already know that it can't succeed.
> > 
> > Okay ... let's make all this explicit in the commit log though, ok?
> 
> That's fair, I'll add a paragraph addressing this case when merging the
> series, like this:
> 
> Note that this removes the ability to reconnect during
> initialisation (but not during operation) when there is no permanent
> error, but the backend restarts, as the implementation was buggy.
> This feature can be added back in a follow-up series after changing
> error paths to distinguish cases where retrying could help from
> cases with permanent errors.
> 
> > > > 2. qemu detected an error and closed the connection
> > > >looks like we try to handle that by reconnect,
> > > >this is something we should address.
> > > 
> > > Yes, if qemu produces the error locally, retrying is useless.
> > > 
> > > > 3. remote failed due to a bad command from qemu.
> > > >this usecase isn't well supported at the moment.
> > > > 
> > > >How about supporting it on the remote side? I think that if the
> > > >data is well-formed just has a configuration remote can not support
> > > >then instead of closing the connection, remote can wait for
> > > >commands with need_reply set, and respond with an error. Or at
> > > >least do it if VHOST_USER_PROTOCOL_F_REPLY_ACK has been negotiated.
> > > >If VHOST_USER_SET_VRING_ERR is used then signalling that fd might
> > > >also be reasonable.
> > > > 
> > > >OTOH if qemu is buggy and sends malformed data and remote detects
> > > >that then hacing qemu retry forever is ok, might actually be
> > > >benefitial for debugging.
> > > 
> > > I haven't really checked this case yet, it seems to be less common.
> > > Explicitly communicating an error is certainly better than just cutting
> > > the connection. But as you say, it means QEMU is buggy, so blindly
> > > retrying in this case is kind of acceptable.
> > > 
> > > Raphael suggested that we could limit the number of retries during
> > > initialisation so that it wouldn't result in a hang at least.
> > 
> > not sure how do I feel about random limits ... how would we set the
> > limit?
> 
> To be honest, probably even 1 would already be good enough in practice.
> Make it 5 or something and you definitely cover any realistic case when
> there is no bug involved.
> 
> Even hitting this case once requires bad luck with the timing, so that
> the restart of the backend coincides with already having connected to
> the socket, but not completed the configuration yet, which is a really
> short window. Having the backend drop the connection again in the same
> short window on the second attempt is an almost sure sign of a bug with
> one of the operations done during initialisation.
> 
> Even if this corner case turned out to be a bit less unlikely to happen
> than I'm thinking (which is, it won't happen at all), randomly failing a
> device-add once in a while still feels a lot better than hanging the VM
> once in a while.
> 
> Kevin

Well if backend is e.g. just stuck and connection does not close, then
VM hangs anyway. So IMHO it's not such a big deal.  If we really want to
address this we should handle all this asynchronously. As in make
device-add succeed and then progress in stages but do not block the
monitor. That would be nice but it's a big change in the code.

-- 
MST

Re: [PATCH v2 2/6] vhost-user-blk: Don't reconnect during initialisation

2021-05-04 Thread Kevin Wolf

Am 04.05.2021 um 11:44 hat Michael S. Tsirkin geschrieben:
> On Tue, May 04, 2021 at 11:27:12AM +0200, Kevin Wolf wrote:
> > Am 04.05.2021 um 10:59 hat Michael S. Tsirkin geschrieben:
> > > On Thu, Apr 29, 2021 at 07:13:12PM +0200, Kevin Wolf wrote:
> > > > This is a partial revert of commits 77542d43149 and bc79c87bcde.
> > > > 
> > > > Usually, an error during initialisation means that the configuration was
> > > > wrong. Reconnecting won't make the error go away, but just turn the
> > > > error condition into an endless loop. Avoid this and return errors
> > > > again.
> > > 
> > > So there are several possible reasons for an error:
> > > 
> > > 1. remote restarted - we would like to reconnect,
> > >this was the original use-case for reconnect.
> > > 
> > >I am not very happy that we are killing this usecase.
> > 
> > This patch is killing it only during initialisation, where it's quite
> > unlikely compared to other cases and where the current implementation is
> > rather broken. So reverting the broken feature and going back to a
> > simpler correct state feels like a good idea to me.
> > 
> > The idea is to add the "retry during initialisation" feature back on top
> > of this, but it requires some more changes in the error paths so that we
> > can actually distinguish different kinds of errors and don't retry when
> > we already know that it can't succeed.
> 
> Okay ... let's make all this explicit in the commit log though, ok?

That's fair, I'll add a paragraph addressing this case when merging the
series, like this:

Note that this removes the ability to reconnect during
initialisation (but not during operation) when there is no permanent
error, but the backend restarts, as the implementation was buggy.
This feature can be added back in a follow-up series after changing
error paths to distinguish cases where retrying could help from
cases with permanent errors.

> > > 2. qemu detected an error and closed the connection
> > >looks like we try to handle that by reconnect,
> > >this is something we should address.
> > 
> > Yes, if qemu produces the error locally, retrying is useless.
> > 
> > > 3. remote failed due to a bad command from qemu.
> > >this usecase isn't well supported at the moment.
> > > 
> > >How about supporting it on the remote side? I think that if the
> > >data is well-formed just has a configuration remote can not support
> > >then instead of closing the connection, remote can wait for
> > >commands with need_reply set, and respond with an error. Or at
> > >least do it if VHOST_USER_PROTOCOL_F_REPLY_ACK has been negotiated.
> > >If VHOST_USER_SET_VRING_ERR is used then signalling that fd might
> > >also be reasonable.
> > > 
> > >OTOH if qemu is buggy and sends malformed data and remote detects
> > >that then hacing qemu retry forever is ok, might actually be
> > >benefitial for debugging.
> > 
> > I haven't really checked this case yet, it seems to be less common.
> > Explicitly communicating an error is certainly better than just cutting
> > the connection. But as you say, it means QEMU is buggy, so blindly
> > retrying in this case is kind of acceptable.
> > 
> > Raphael suggested that we could limit the number of retries during
> > initialisation so that it wouldn't result in a hang at least.
> 
> not sure how do I feel about random limits ... how would we set the
> limit?

To be honest, probably even 1 would already be good enough in practice.
Make it 5 or something and you definitely cover any realistic case when
there is no bug involved.

Even hitting this case once requires bad luck with the timing, so that
the restart of the backend coincides with already having connected to
the socket, but not completed the configuration yet, which is a really
short window. Having the backend drop the connection again in the same
short window on the second attempt is an almost sure sign of a bug with
one of the operations done during initialisation.

Even if this corner case turned out to be a bit less unlikely to happen
than I'm thinking (which is, it won't happen at all), randomly failing a
device-add once in a while still feels a lot better than hanging the VM
once in a while.

Kevin

Re: [PATCH] virtio-blk: drop deprecated scsi=on|off property

2021-05-04 Thread Dr. David Alan Gilbert

* Peter Krempa (pkre...@redhat.com) wrote:
> On Fri, Apr 30, 2021 at 09:42:05 +0200, Markus Armbruster wrote:
> > Eduardo Habkost  writes:
> > 
> > > On Thu, Apr 29, 2021 at 04:52:21PM +0100, Stefan Hajnoczi wrote:
> > >> The scsi=on|off property was deprecated in QEMU 5.0 and can be removed
> > >> completely at this point.
> > >> 
> > >> Drop the scsi=on|off option. It was only available on Legacy virtio-blk
> > >> devices. Linux v5.6 already dropped support for it.
> > >> 
> > >> Remove the hw_compat_2_4[] property assignment since scsi=on|off no
> > >> longer exists. Old guests with Legacy virtio-blk devices no longer see
> > >> the SCSI host features bit.
> > >> 
> > >
> > > This means pc-2.4 will now break guest ABI if using virtio-blk
> > > devices, correct?
> > >
> > > This looks like a sign we should have deprecated pc-2.4 a long
> > > time ago.
> > 
> > The last batch of PC machine type retiring was pc-1.0 to pc-1.3:
> > deprecated in 5.0 (commit 30d2a17b4, Dec 2019), dropped in 6.0 (commit
> > f862ddbb1, just weeks ago).  pc-1.3 was a bit over seven years old when
> > we released 5.0.  pc-2.4 will be six years old by the time we release
> > 6.1.  Fair game?
> 
> As a data-point, libvirt will be dropping support for  (release, not the machine type) in the upcomming release. I'm not sure
> whether that justifies more deprecation though.

What qemu features will you then be relying on?

Dave

-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

[PATCH v2 0/5] block permission updated follow-up

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

v2: rebased on Kevin's "[PATCH 0/2] block: Fix Transaction leaks"
1: add assertions and drop extra declaration
2: add Alberto's r-b
3: improve commit message

Based-on: <20210503110555.24001-1-kw...@redhat.com>

Vladimir Sementsov-Ogievskiy (5):
  block: document child argument of bdrv_attach_child_common()
  block-backend: improve blk_root_get_parent_desc()
  block: improve bdrv_child_get_parent_desc()
  block: simplify bdrv_child_user_desc()
  block: improve permission conflict error message

 block.c   | 61 +--
 block/block-backend.c |  9 ++--
 tests/qemu-iotests/283.out|  2 +-
 tests/qemu-iotests/307.out|  2 +-
 tests/qemu-iotests/tests/qsd-jobs.out |  2 +-
 5 files changed, 46 insertions(+), 30 deletions(-)

-- 
2.29.2

[PATCH v2 2/5] block-backend: improve blk_root_get_parent_desc()

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

We have different types of parents: block nodes, block backends and
jobs. So, it makes sense to specify type together with name.

While being here also use g_autofree.

iotest 307 output is updated.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
---
 block/block-backend.c  | 9 -
 tests/qemu-iotests/307.out | 2 +-
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 6fca9853e1..2b7e9b5192 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -142,19 +142,18 @@ static void blk_root_set_aio_ctx(BdrvChild *child, 
AioContext *ctx,
 static char *blk_root_get_parent_desc(BdrvChild *child)
 {
 BlockBackend *blk = child->opaque;
-char *dev_id;
+g_autofree char *dev_id = NULL;
 
 if (blk->name) {
-return g_strdup(blk->name);
+return g_strdup_printf("block device '%s'", blk->name);
 }
 
 dev_id = blk_get_attached_dev_id(blk);
 if (*dev_id) {
-return dev_id;
+return g_strdup_printf("block device '%s'", dev_id);
 } else {
 /* TODO Callback into the BB owner for something more detailed */
-g_free(dev_id);
-return g_strdup("a block device");
+return g_strdup("unnamed block device");
 }
 }
 
diff --git a/tests/qemu-iotests/307.out b/tests/qemu-iotests/307.out
index daa8ad2da0..66bf2ddb74 100644
--- a/tests/qemu-iotests/307.out
+++ b/tests/qemu-iotests/307.out
@@ -53,7 +53,7 @@ exports available: 1
 
 === Add a writable export ===
 {"execute": "block-export-add", "arguments": {"description": "This is the 
writable second export", "id": "export1", "name": "export1", "node-name": 
"fmt", "type": "nbd", "writable": true, "writethrough": true}}
-{"error": {"class": "GenericError", "desc": "Conflicts with use by sda as 
'root', which does not allow 'write' on fmt"}}
+{"error": {"class": "GenericError", "desc": "Conflicts with use by block 
device 'sda' as 'root', which does not allow 'write' on fmt"}}
 {"execute": "device_del", "arguments": {"id": "sda"}}
 {"return": {}}
 {"data": {"device": "sda", "path": "/machine/peripheral/sda"}, "event": 
"DEVICE_DELETED", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
-- 
2.29.2

[PATCH v2 1/5] block: document child argument of bdrv_attach_child_common()

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

The logic around **child is not obvious: this reference is used not
only to return resulting child, but also to rollback NULL value on
transaction abort.

So, let's add documentation and some assertions.

While being here, drop extra declaration of bdrv_attach_child_noperm().

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/block.c b/block.c
index 69615fabd1..b9df90d61d 100644
--- a/block.c
+++ b/block.c
@@ -85,14 +85,6 @@ static BlockDriverState *bdrv_open_inherit(const char 
*filename,
 
 static void bdrv_replace_child_noperm(BdrvChild *child,
   BlockDriverState *new_bs);
-static int bdrv_attach_child_noperm(BlockDriverState *parent_bs,
-BlockDriverState *child_bs,
-const char *child_name,
-const BdrvChildClass *child_class,
-BdrvChildRole child_role,
-BdrvChild **child,
-Transaction *tran,
-Error **errp);
 static void bdrv_remove_filter_or_cow_child(BlockDriverState *bs,
 Transaction *tran);
 
@@ -2762,6 +2754,12 @@ static TransactionActionDrv bdrv_attach_child_common_drv 
= {
 
 /*
  * Common part of attaching bdrv child to bs or to blk or to job
+ *
+ * Resulting new child is returned through @child.
+ * At start *@child must be NULL.
+ * @child is saved to a new entry of @tran, so that *@child could be reverted 
to
+ * NULL on abort(). So referenced variable must live at least until transaction
+ * end.
  */
 static int bdrv_attach_child_common(BlockDriverState *child_bs,
 const char *child_name,
@@ -2836,6 +2834,10 @@ static int bdrv_attach_child_common(BlockDriverState 
*child_bs,
 return 0;
 }
 
+/*
+ * Variable referenced by @child must live at least until transaction end.
+ * (see bdrv_attach_child_common() doc for details)
+ */
 static int bdrv_attach_child_noperm(BlockDriverState *parent_bs,
 BlockDriverState *child_bs,
 const char *child_name,
@@ -2918,7 +2920,6 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState 
*child_bs,
child_role, perm, shared_perm, opaque,
, tran, errp);
 if (ret < 0) {
-assert(child == NULL);
 goto out;
 }
 
@@ -2926,6 +2927,9 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState 
*child_bs,
 
 out:
 tran_finalize(tran, ret);
+/* child is unset on failure by bdrv_attach_child_common_abort() */
+assert((ret < 0) == !child);
+
 bdrv_unref(child_bs);
 return child;
 }
@@ -2965,6 +2969,8 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 
 out:
 tran_finalize(tran, ret);
+/* child is unset on failure by bdrv_attach_child_common_abort() */
+assert((ret < 0) == !child);
 
 bdrv_unref(child_bs);
 
-- 
2.29.2

[PATCH v2 5/5] block: improve permission conflict error message

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Now permissions are updated as follows:
 1. do graph modifications ignoring permissions
 2. do permission update

 (of course, we rollback [1] if [2] fails)

So, on stage [2] we can't say which users are "old" and which are
"new" and exist only since [1]. And current error message is a bit
outdated. Let's improve it, to make everything clean.

While being here, add also a comment and some good assertions.

iotests 283, 307, qsd-jobs outputs are updated.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block.c   | 29 ---
 tests/qemu-iotests/283.out|  2 +-
 tests/qemu-iotests/307.out|  2 +-
 tests/qemu-iotests/tests/qsd-jobs.out |  2 +-
 4 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/block.c b/block.c
index 2f73523285..354438d918 100644
--- a/block.c
+++ b/block.c
@@ -2032,20 +2032,35 @@ static char *bdrv_child_user_desc(BdrvChild *c)
 return c->klass->get_parent_desc(c);
 }
 
+/*
+ * Check that @a allows everything that @b needs. @a and @b must reference same
+ * child node.
+ */
 static bool bdrv_a_allow_b(BdrvChild *a, BdrvChild *b, Error **errp)
 {
-g_autofree char *user = NULL;
-g_autofree char *perm_names = NULL;
+g_autofree char *a_user = NULL;
+g_autofree char *a_against = NULL;
+g_autofree char *b_user = NULL;
+g_autofree char *b_perm = NULL;
+
+assert(a->bs);
+assert(a->bs == b->bs);
 
 if ((b->perm & a->shared_perm) == b->perm) {
 return true;
 }
 
-perm_names = bdrv_perm_names(b->perm & ~a->shared_perm);
-user = bdrv_child_user_desc(a);
-error_setg(errp, "Conflicts with use by %s as '%s', which does not "
-   "allow '%s' on %s",
-   user, a->name, perm_names, bdrv_get_node_name(b->bs));
+a_user = bdrv_child_user_desc(a);
+a_against = bdrv_perm_names(b->perm & ~a->shared_perm);
+
+b_user = bdrv_child_user_desc(b);
+b_perm = bdrv_perm_names(b->perm);
+
+error_setg(errp, "Permission conflict on node '%s': %s wants to use it as "
+   "'%s', which requires these permissions: %s. On the other hand 
%s "
+   "wants to use it as '%s', which doesn't share: %s",
+   bdrv_get_node_name(b->bs),
+   b_user, b->name, b_perm, a_user, a->name, a_against);
 
 return false;
 }
diff --git a/tests/qemu-iotests/283.out b/tests/qemu-iotests/283.out
index c9397bfc44..92f3cc1ed5 100644
--- a/tests/qemu-iotests/283.out
+++ b/tests/qemu-iotests/283.out
@@ -5,7 +5,7 @@
 {"execute": "blockdev-add", "arguments": {"driver": "blkdebug", "image": 
"base", "node-name": "other", "take-child-perms": ["write"]}}
 {"return": {}}
 {"execute": "blockdev-backup", "arguments": {"device": "source", "sync": 
"full", "target": "target"}}
-{"error": {"class": "GenericError", "desc": "Cannot append backup-top filter: 
Conflicts with use by node 'source' as 'image', which does not allow 'write' on 
base"}}
+{"error": {"class": "GenericError", "desc": "Cannot append backup-top filter: 
Permission conflict on node 'base': node 'other' wants to use it as 'image', 
which requires these permissions: write. On the other hand node 'source' wants 
to use it as 'image', which doesn't share: write"}}
 
 === backup-top should be gone after job-finalize ===
 
diff --git a/tests/qemu-iotests/307.out b/tests/qemu-iotests/307.out
index 66bf2ddb74..e03932ba4f 100644
--- a/tests/qemu-iotests/307.out
+++ b/tests/qemu-iotests/307.out
@@ -53,7 +53,7 @@ exports available: 1
 
 === Add a writable export ===
 {"execute": "block-export-add", "arguments": {"description": "This is the 
writable second export", "id": "export1", "name": "export1", "node-name": 
"fmt", "type": "nbd", "writable": true, "writethrough": true}}
-{"error": {"class": "GenericError", "desc": "Conflicts with use by block 
device 'sda' as 'root', which does not allow 'write' on fmt"}}
+{"error": {"class": "GenericError", "desc": "Permission conflict on node 
'fmt': unnamed block device wants to use it as 'root', which requires these 
permissions: consistent read, write. On the other hand block device 'sda' wants 
to use it as 'root', which doesn't share: write"}}
 {"execute": "device_del", "arguments": {"id": "sda"}}
 {"return": {}}
 {"data": {"device": "sda", "path": "/machine/peripheral/sda"}, "event": 
"DEVICE_DELETED", "timestamp": {"microseconds": "USECS", "seconds": "SECS"}}
diff --git a/tests/qemu-iotests/tests/qsd-jobs.out 
b/tests/qemu-iotests/tests/qsd-jobs.out
index 9f52255da8..b0596d2c95 100644
--- a/tests/qemu-iotests/tests/qsd-jobs.out
+++ b/tests/qemu-iotests/tests/qsd-jobs.out
@@ -16,7 +16,7 @@ QMP_VERSION
 {"return": {}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"JOB_STATUS_CHANGE", "data": {"status": "created", "id": "job0"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": 
"JOB_STATUS_CHANGE", "data": {"status": "null", "id": "job0"}}
-{"error": {"class":

[PATCH v2 4/5] block: simplify bdrv_child_user_desc()

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

All existing parent types (block nodes, block devices, jobs) has the
realization. So, drop unreachable code.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/block.c b/block.c
index 54a3da9311..2f73523285 100644
--- a/block.c
+++ b/block.c
@@ -2029,11 +2029,7 @@ bool bdrv_is_writable(BlockDriverState *bs)
 
 static char *bdrv_child_user_desc(BdrvChild *c)
 {
-if (c->klass->get_parent_desc) {
-return c->klass->get_parent_desc(c);
-}
-
-return g_strdup("another user");
+return c->klass->get_parent_desc(c);
 }
 
 static bool bdrv_a_allow_b(BdrvChild *a, BdrvChild *b, Error **errp)
-- 
2.29.2

[PATCH v2 3/5] block: improve bdrv_child_get_parent_desc()

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

We have different types of parents: block nodes, block backends and
jobs. So, it makes sense to specify type together with name.

Next, this handler us used to compose an error message about permission
conflict. And permission conflict occurs in a specific place of block
graph. We shouldn't report name of parent device (as it refers another
place in block graph), but exactly and only the name of the node. So,
use bdrv_get_node_name() directly.

iotest 283 output is updated.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block.c| 2 +-
 tests/qemu-iotests/283.out | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index b9df90d61d..54a3da9311 100644
--- a/block.c
+++ b/block.c
@@ -1152,7 +1152,7 @@ int bdrv_parse_cache_mode(const char *mode, int *flags, 
bool *writethrough)
 static char *bdrv_child_get_parent_desc(BdrvChild *c)
 {
 BlockDriverState *parent = c->opaque;
-return g_strdup(bdrv_get_device_or_node_name(parent));
+return g_strdup_printf("node '%s'", bdrv_get_node_name(parent));
 }
 
 static void bdrv_child_cb_drained_begin(BdrvChild *child)
diff --git a/tests/qemu-iotests/283.out b/tests/qemu-iotests/283.out
index 97e62a4c94..c9397bfc44 100644
--- a/tests/qemu-iotests/283.out
+++ b/tests/qemu-iotests/283.out
@@ -5,7 +5,7 @@
 {"execute": "blockdev-add", "arguments": {"driver": "blkdebug", "image": 
"base", "node-name": "other", "take-child-perms": ["write"]}}
 {"return": {}}
 {"execute": "blockdev-backup", "arguments": {"device": "source", "sync": 
"full", "target": "target"}}
-{"error": {"class": "GenericError", "desc": "Cannot append backup-top filter: 
Conflicts with use by source as 'image', which does not allow 'write' on base"}}
+{"error": {"class": "GenericError", "desc": "Cannot append backup-top filter: 
Conflicts with use by node 'source' as 'image', which does not allow 'write' on 
base"}}
 
 === backup-top should be gone after job-finalize ===
 
-- 
2.29.2

Re: [PATCH v2 2/6] vhost-user-blk: Don't reconnect during initialisation

2021-05-04 Thread Michael S. Tsirkin

On Tue, May 04, 2021 at 11:27:12AM +0200, Kevin Wolf wrote:
> Am 04.05.2021 um 10:59 hat Michael S. Tsirkin geschrieben:
> > On Thu, Apr 29, 2021 at 07:13:12PM +0200, Kevin Wolf wrote:
> > > This is a partial revert of commits 77542d43149 and bc79c87bcde.
> > > 
> > > Usually, an error during initialisation means that the configuration was
> > > wrong. Reconnecting won't make the error go away, but just turn the
> > > error condition into an endless loop. Avoid this and return errors
> > > again.
> > 
> > So there are several possible reasons for an error:
> > 
> > 1. remote restarted - we would like to reconnect,
> >this was the original use-case for reconnect.
> > 
> >I am not very happy that we are killing this usecase.
> 
> This patch is killing it only during initialisation, where it's quite
> unlikely compared to other cases and where the current implementation is
> rather broken. So reverting the broken feature and going back to a
> simpler correct state feels like a good idea to me.
> 
> The idea is to add the "retry during initialisation" feature back on top
> of this, but it requires some more changes in the error paths so that we
> can actually distinguish different kinds of errors and don't retry when
> we already know that it can't succeed.

Okay ... let's make all this explicit in the commit log though, ok?

> > 2. qemu detected an error and closed the connection
> >looks like we try to handle that by reconnect,
> >this is something we should address.
> 
> Yes, if qemu produces the error locally, retrying is useless.
> 
> > 3. remote failed due to a bad command from qemu.
> >this usecase isn't well supported at the moment.
> > 
> >How about supporting it on the remote side? I think that if the
> >data is well-formed just has a configuration remote can not support
> >then instead of closing the connection, remote can wait for
> >commands with need_reply set, and respond with an error. Or at
> >least do it if VHOST_USER_PROTOCOL_F_REPLY_ACK has been negotiated.
> >If VHOST_USER_SET_VRING_ERR is used then signalling that fd might
> >also be reasonable.
> > 
> >OTOH if qemu is buggy and sends malformed data and remote detects
> >that then hacing qemu retry forever is ok, might actually be
> >benefitial for debugging.
> 
> I haven't really checked this case yet, it seems to be less common.
> Explicitly communicating an error is certainly better than just cutting
> the connection. But as you say, it means QEMU is buggy, so blindly
> retrying in this case is kind of acceptable.
> 
> Raphael suggested that we could limit the number of retries during
> initialisation so that it wouldn't result in a hang at least.

not sure how do I feel about random limits ... how would we
set the limit?


> > > Additionally, calling vhost_user_blk_disconnect() from the chardev event
> > > handler could result in use-after-free because none of the
> > > initialisation code expects that the device could just go away in the
> > > middle. So removing the call fixes crashes in several places.
> > > For example, using a num-queues setting that is incompatible with the
> > > backend would result in a crash like this (dereferencing dev->opaque,
> > > which is already NULL):
> > > 
> > >  #0  0x55d0a4bd in vhost_user_read_cb (source=0x568f4690, 
> > > condition=(G_IO_IN | G_IO_HUP), opaque=0x7fffcbf0) at 
> > > ../hw/virtio/vhost-user.c:313
> > >  #1  0x55d950d3 in qio_channel_fd_source_dispatch 
> > > (source=0x57c3f750, callback=0x55d0a478 , 
> > > user_data=0x7fffcbf0) at ../io/channel-watch.c:84
> > >  #2  0x77b32a9f in g_main_context_dispatch () at 
> > > /lib64/libglib-2.0.so.0
> > >  #3  0x77b84a98 in g_main_context_iterate.constprop () at 
> > > /lib64/libglib-2.0.so.0
> > >  #4  0x77b32163 in g_main_loop_run () at /lib64/libglib-2.0.so.0
> > >  #5  0x55d0a724 in vhost_user_read (dev=0x57bc62f8, 
> > > msg=0x7fffcc50) at ../hw/virtio/vhost-user.c:402
> > >  #6  0x55d0ee6b in vhost_user_get_config (dev=0x57bc62f8, 
> > > config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost-user.c:2133
> > >  #7  0x55d56d46 in vhost_dev_get_config (hdev=0x57bc62f8, 
> > > config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost.c:1566
> > >  #8  0x55cdd150 in vhost_user_blk_device_realize 
> > > (dev=0x57bc60b0, errp=0x7fffcf90) at 
> > > ../hw/block/vhost-user-blk.c:510
> > >  #9  0x55d08f6d in virtio_device_realize (dev=0x57bc60b0, 
> > > errp=0x7fffcff0) at ../hw/virtio/virtio.c:3660
> > 
> > Right. So that's definitely something to fix.
> > 
> > > 
> > > Signed-off-by: Kevin Wolf 
> 
> Kevin

Re: [PATCH v2 2/6] vhost-user-blk: Don't reconnect during initialisation

2021-05-04 Thread Kevin Wolf

Am 04.05.2021 um 10:59 hat Michael S. Tsirkin geschrieben:
> On Thu, Apr 29, 2021 at 07:13:12PM +0200, Kevin Wolf wrote:
> > This is a partial revert of commits 77542d43149 and bc79c87bcde.
> > 
> > Usually, an error during initialisation means that the configuration was
> > wrong. Reconnecting won't make the error go away, but just turn the
> > error condition into an endless loop. Avoid this and return errors
> > again.
> 
> So there are several possible reasons for an error:
> 
> 1. remote restarted - we would like to reconnect,
>this was the original use-case for reconnect.
> 
>I am not very happy that we are killing this usecase.

This patch is killing it only during initialisation, where it's quite
unlikely compared to other cases and where the current implementation is
rather broken. So reverting the broken feature and going back to a
simpler correct state feels like a good idea to me.

The idea is to add the "retry during initialisation" feature back on top
of this, but it requires some more changes in the error paths so that we
can actually distinguish different kinds of errors and don't retry when
we already know that it can't succeed.

> 2. qemu detected an error and closed the connection
>looks like we try to handle that by reconnect,
>this is something we should address.

Yes, if qemu produces the error locally, retrying is useless.

> 3. remote failed due to a bad command from qemu.
>this usecase isn't well supported at the moment.
> 
>How about supporting it on the remote side? I think that if the
>data is well-formed just has a configuration remote can not support
>then instead of closing the connection, remote can wait for
>commands with need_reply set, and respond with an error. Or at
>least do it if VHOST_USER_PROTOCOL_F_REPLY_ACK has been negotiated.
>If VHOST_USER_SET_VRING_ERR is used then signalling that fd might
>also be reasonable.
> 
>OTOH if qemu is buggy and sends malformed data and remote detects
>that then hacing qemu retry forever is ok, might actually be
>benefitial for debugging.

I haven't really checked this case yet, it seems to be less common.
Explicitly communicating an error is certainly better than just cutting
the connection. But as you say, it means QEMU is buggy, so blindly
retrying in this case is kind of acceptable.

Raphael suggested that we could limit the number of retries during
initialisation so that it wouldn't result in a hang at least.

> > Additionally, calling vhost_user_blk_disconnect() from the chardev event
> > handler could result in use-after-free because none of the
> > initialisation code expects that the device could just go away in the
> > middle. So removing the call fixes crashes in several places.
> > For example, using a num-queues setting that is incompatible with the
> > backend would result in a crash like this (dereferencing dev->opaque,
> > which is already NULL):
> > 
> >  #0  0x55d0a4bd in vhost_user_read_cb (source=0x568f4690, 
> > condition=(G_IO_IN | G_IO_HUP), opaque=0x7fffcbf0) at 
> > ../hw/virtio/vhost-user.c:313
> >  #1  0x55d950d3 in qio_channel_fd_source_dispatch 
> > (source=0x57c3f750, callback=0x55d0a478 , 
> > user_data=0x7fffcbf0) at ../io/channel-watch.c:84
> >  #2  0x77b32a9f in g_main_context_dispatch () at 
> > /lib64/libglib-2.0.so.0
> >  #3  0x77b84a98 in g_main_context_iterate.constprop () at 
> > /lib64/libglib-2.0.so.0
> >  #4  0x77b32163 in g_main_loop_run () at /lib64/libglib-2.0.so.0
> >  #5  0x55d0a724 in vhost_user_read (dev=0x57bc62f8, 
> > msg=0x7fffcc50) at ../hw/virtio/vhost-user.c:402
> >  #6  0x55d0ee6b in vhost_user_get_config (dev=0x57bc62f8, 
> > config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost-user.c:2133
> >  #7  0x55d56d46 in vhost_dev_get_config (hdev=0x57bc62f8, 
> > config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost.c:1566
> >  #8  0x55cdd150 in vhost_user_blk_device_realize 
> > (dev=0x57bc60b0, errp=0x7fffcf90) at 
> > ../hw/block/vhost-user-blk.c:510
> >  #9  0x55d08f6d in virtio_device_realize (dev=0x57bc60b0, 
> > errp=0x7fffcff0) at ../hw/virtio/virtio.c:3660
> 
> Right. So that's definitely something to fix.
> 
> > 
> > Signed-off-by: Kevin Wolf 

Kevin

[PULL 3/9] simplebench/bench-backup: add --compressed option

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Allow bench compressed backup.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 scripts/simplebench/bench-backup.py| 55 ++
 scripts/simplebench/bench_block_job.py | 23 +++
 2 files changed, 62 insertions(+), 16 deletions(-)

diff --git a/scripts/simplebench/bench-backup.py 
b/scripts/simplebench/bench-backup.py
index 33a1ecfefa..72eae85bb1 100755
--- a/scripts/simplebench/bench-backup.py
+++ b/scripts/simplebench/bench-backup.py
@@ -23,7 +23,7 @@
 
 import simplebench
 from results_to_text import results_to_text
-from bench_block_job import bench_block_copy, drv_file, drv_nbd
+from bench_block_job import bench_block_copy, drv_file, drv_nbd, drv_qcow2
 
 
 def bench_func(env, case):
@@ -37,29 +37,41 @@ def bench_func(env, case):
 def bench(args):
 test_cases = []
 
-sources = {}
-targets = {}
-for d in args.dir:
-label, path = d.split(':')  # paths with colon not supported
-sources[label] = drv_file(path + '/test-source')
-targets[label] = drv_file(path + '/test-target')
+# paths with colon not supported, so we just split by ':'
+dirs = dict(d.split(':') for d in args.dir)
 
+nbd_drv = None
 if args.nbd:
 nbd = args.nbd.split(':')
 host = nbd[0]
 port = '10809' if len(nbd) == 1 else nbd[1]
-drv = drv_nbd(host, port)
-sources['nbd'] = drv
-targets['nbd'] = drv
+nbd_drv = drv_nbd(host, port)
 
 for t in args.test:
 src, dst = t.split(':')
 
-test_cases.append({
-'id': t,
-'source': sources[src],
-'target': targets[dst]
-})
+if src == 'nbd' and dst == 'nbd':
+raise ValueError("Can't use 'nbd' label for both src and dst")
+
+if (src == 'nbd' or dst == 'nbd') and not nbd_drv:
+raise ValueError("'nbd' label used but --nbd is not given")
+
+if src == 'nbd':
+source = nbd_drv
+else:
+source = drv_file(dirs[src] + '/test-source')
+
+if dst == 'nbd':
+test_cases.append({'id': t, 'source': source, 'target': nbd_drv})
+continue
+
+fname = dirs[dst] + '/test-target'
+if args.compressed:
+fname += '.qcow2'
+target = drv_file(fname)
+if args.compressed:
+target = drv_qcow2(target)
+test_cases.append({'id': t, 'source': source, 'target': target})
 
 binaries = []  # list of (, , [])
 for i, q in enumerate(args.env):
@@ -106,6 +118,13 @@ def bench(args):
 elif opt.startswith('max-workers='):
 x_perf['max-workers'] = int(opt.split('=')[1])
 
+backup_options = {}
+if x_perf:
+backup_options['x-perf'] = x_perf
+
+if args.compressed:
+backup_options['compress'] = True
+
 if is_mirror:
 assert not x_perf
 test_envs.append({
@@ -117,7 +136,7 @@ def bench(args):
 test_envs.append({
 'id': f'backup({label})\n' + '\n'.join(opts),
 'cmd': 'blockdev-backup',
-'cmd-options': {'x-perf': x_perf} if x_perf else {},
+'cmd-options': backup_options,
 'qemu-binary': path
 })
 
@@ -163,5 +182,9 @@ def __call__(self, parser, namespace, values, 
option_string=None):
 p.add_argument('--test', nargs='+', help='''\
 Tests, in form source-dir-label:target-dir-label''',
action=ExtendAction)
+p.add_argument('--compressed', help='''\
+Use compressed backup. It automatically means
+automatically creating qcow2 target with
+lazy_refcounts for each test run''', action='store_true')
 
 bench(p.parse_args())
diff --git a/scripts/simplebench/bench_block_job.py 
b/scripts/simplebench/bench_block_job.py
index 7332845c1c..08f86ed9c1 100755
--- a/scripts/simplebench/bench_block_job.py
+++ b/scripts/simplebench/bench_block_job.py
@@ -21,6 +21,7 @@
 
 import sys
 import os
+import subprocess
 import socket
 import json
 
@@ -77,11 +78,29 @@ def bench_block_job(cmd, cmd_args, qemu_args):
 return {'seconds': (end_ms - start_ms) / 100.0}
 
 
+def get_image_size(path):
+out = subprocess.run(['qemu-img', 'info', '--out=json', path],
+ stdout=subprocess.PIPE, check=True).stdout
+return json.loads(out)['virtual-size']
+
+
 # Bench backup or mirror
 def bench_block_copy(qemu_binary, cmd, cmd_options, source, target):
 """Helper to run bench_block_job() for mirror or backup"""
 assert cmd in ('blockdev-backup', 'blockdev-mirror')
 
+if target['driver'] == 'qcow2':
+try:
+os.remove(target['file']['filename'])
+except OSError:
+pass
+
+subprocess.run(['qemu-img', 'create', '-f', 'qcow2',
+target['file']['filename'],
+str(get_image_size(source['filename']))],
+

[PULL 7/9] simplebench/bench-backup: add --count and --no-initial-run

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Add arguments to set number of test runs per table cell and to disable
initial run that is not counted in results.

It's convenient to set --count 1 --no-initial-run to fast run test
onece, and to set --count to some large enough number for good
precision of the results.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 scripts/simplebench/bench-backup.py | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/scripts/simplebench/bench-backup.py 
b/scripts/simplebench/bench-backup.py
index a2120fcbf0..092fed5816 100755
--- a/scripts/simplebench/bench-backup.py
+++ b/scripts/simplebench/bench-backup.py
@@ -155,7 +155,8 @@ def bench(args):
 'qemu-binary': path
 })
 
-result = simplebench.bench(bench_func, test_envs, test_cases, count=3)
+result = simplebench.bench(bench_func, test_envs, test_cases,
+   count=args.count, initial_run=args.initial_run)
 with open('results.json', 'w') as f:
 json.dump(result, f, indent=4)
 print(results_to_text(result))
@@ -211,4 +212,13 @@ def __call__(self, parser, namespace, values, 
option_string=None):
both: generate two test cases for each src:dst pair''',
default='direct', choices=('direct', 'cached', 'both'))
 
+p.add_argument('--count', type=int, default=3, help='''\
+Number of test runs per table cell''')
+
+# BooleanOptionalAction helps to support --no-initial-run option
+p.add_argument('--initial-run', action=argparse.BooleanOptionalAction,
+   help='''\
+Do additional initial run per cell which doesn't count in result,
+default true''')
+
 bench(p.parse_args())
-- 
2.29.2

Re: [PATCH v2 2/6] vhost-user-blk: Don't reconnect during initialisation

2021-05-04 Thread Kevin Wolf

Am 03.05.2021 um 19:01 hat Raphael Norwitz geschrieben:
> So we're not going with the suggestion to retry once or a fixed number
> of times? Any reason why not?

I thought we agreed that we'd add reconnection back in a follow-up
series that also addresses the different kinds of errors and retries
only when it makes sense?

Kevin

> On Thu, Apr 29, 2021 at 07:13:12PM +0200, Kevin Wolf wrote:
> > This is a partial revert of commits 77542d43149 and bc79c87bcde.
> > 
> > Usually, an error during initialisation means that the configuration was
> > wrong. Reconnecting won't make the error go away, but just turn the
> > error condition into an endless loop. Avoid this and return errors
> > again.
> > 
> > Additionally, calling vhost_user_blk_disconnect() from the chardev event
> > handler could result in use-after-free because none of the
> > initialisation code expects that the device could just go away in the
> > middle. So removing the call fixes crashes in several places.
> > 
> > For example, using a num-queues setting that is incompatible with the
> > backend would result in a crash like this (dereferencing dev->opaque,
> > which is already NULL):
> > 
> >  #0  0x55d0a4bd in vhost_user_read_cb (source=0x568f4690, 
> > condition=(G_IO_IN | G_IO_HUP), opaque=0x7fffcbf0) at 
> > ../hw/virtio/vhost-user.c:313
> >  #1  0x55d950d3 in qio_channel_fd_source_dispatch 
> > (source=0x57c3f750, callback=0x55d0a478 , 
> > user_data=0x7fffcbf0) at ../io/channel-watch.c:84
> >  #2  0x77b32a9f in g_main_context_dispatch () at 
> > /lib64/libglib-2.0.so.0
> >  #3  0x77b84a98 in g_main_context_iterate.constprop () at 
> > /lib64/libglib-2.0.so.0
> >  #4  0x77b32163 in g_main_loop_run () at /lib64/libglib-2.0.so.0
> >  #5  0x55d0a724 in vhost_user_read (dev=0x57bc62f8, 
> > msg=0x7fffcc50) at ../hw/virtio/vhost-user.c:402
> >  #6  0x55d0ee6b in vhost_user_get_config (dev=0x57bc62f8, 
> > config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost-user.c:2133
> >  #7  0x55d56d46 in vhost_dev_get_config (hdev=0x57bc62f8, 
> > config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost.c:1566
> >  #8  0x55cdd150 in vhost_user_blk_device_realize 
> > (dev=0x57bc60b0, errp=0x7fffcf90) at 
> > ../hw/block/vhost-user-blk.c:510
> >  #9  0x55d08f6d in virtio_device_realize (dev=0x57bc60b0, 
> > errp=0x7fffcff0) at ../hw/virtio/virtio.c:3660
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >  hw/block/vhost-user-blk.c | 59 +++
> >  1 file changed, 17 insertions(+), 42 deletions(-)
> > 
> > diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> > index 7c85248a7b..c0b9958da1 100644
> > --- a/hw/block/vhost-user-blk.c
> > +++ b/hw/block/vhost-user-blk.c
> > @@ -50,6 +50,8 @@ static const int user_feature_bits[] = {
> >  VHOST_INVALID_FEATURE_BIT
> >  };
> >  
> > +static void vhost_user_blk_event(void *opaque, QEMUChrEvent event);
> > +
> >  static void vhost_user_blk_update_config(VirtIODevice *vdev, uint8_t 
> > *config)
> >  {
> >  VHostUserBlk *s = VHOST_USER_BLK(vdev);
> > @@ -362,19 +364,6 @@ static void vhost_user_blk_disconnect(DeviceState *dev)
> >  vhost_dev_cleanup(>dev);
> >  }
> >  
> > -static void vhost_user_blk_event(void *opaque, QEMUChrEvent event,
> > - bool realized);
> > -
> > -static void vhost_user_blk_event_realize(void *opaque, QEMUChrEvent event)
> > -{
> > -vhost_user_blk_event(opaque, event, false);
> > -}
> > -
> > -static void vhost_user_blk_event_oper(void *opaque, QEMUChrEvent event)
> > -{
> > -vhost_user_blk_event(opaque, event, true);
> > -}
> > -
> >  static void vhost_user_blk_chr_closed_bh(void *opaque)
> >  {
> >  DeviceState *dev = opaque;
> > @@ -382,12 +371,11 @@ static void vhost_user_blk_chr_closed_bh(void *opaque)
> >  VHostUserBlk *s = VHOST_USER_BLK(vdev);
> >  
> >  vhost_user_blk_disconnect(dev);
> > -qemu_chr_fe_set_handlers(>chardev, NULL, NULL,
> > -vhost_user_blk_event_oper, NULL, opaque, NULL, true);
> > +qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vhost_user_blk_event,
> > + NULL, opaque, NULL, true);
> >  }
> >  
> > -static void vhost_user_blk_event(void *opaque, QEMUChrEvent event,
> > - bool realized)
> > +static void vhost_user_blk_event(void *opaque, QEMUChrEvent event)
> >  {
> >  DeviceState *dev = opaque;
> >  VirtIODevice *vdev = VIRTIO_DEVICE(dev);
> > @@ -401,17 +389,7 @@ static void vhost_user_blk_event(void *opaque, 
> > QEMUChrEvent event,
> >  }
> >  break;
> >  case CHR_EVENT_CLOSED:
> > -/*
> > - * Closing the connection should happen differently on device
> > - * initialization and operation stages.
> > - * On initalization, we want to re-start vhost_dev initialization
> > -

[PULL 9/9] MAINTAINERS: update Benchmark util: add git tree

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 4c05ff8bba..f9f2acea8f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2531,6 +2531,7 @@ Benchmark util
 M: Vladimir Sementsov-Ogievskiy 
 S: Maintained
 F: scripts/simplebench/
+T: git https://src.openvz.org/scm/~vsementsov/qemu.git simplebench
 
 Transactions helper
 M: Vladimir Sementsov-Ogievskiy 
-- 
2.29.2

[PULL 8/9] simplebench/bench-backup: add --drop-caches argument

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Add an option to drop caches before each test run. It may probably
improve reliability of results when testing in cached mode.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 scripts/simplebench/bench-backup.py |  6 +-
 scripts/simplebench/simplebench.py  | 11 ++-
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/scripts/simplebench/bench-backup.py 
b/scripts/simplebench/bench-backup.py
index 092fed5816..5a0675c593 100755
--- a/scripts/simplebench/bench-backup.py
+++ b/scripts/simplebench/bench-backup.py
@@ -156,7 +156,8 @@ def bench(args):
 })
 
 result = simplebench.bench(bench_func, test_envs, test_cases,
-   count=args.count, initial_run=args.initial_run)
+   count=args.count, initial_run=args.initial_run,
+   drop_caches=args.drop_caches)
 with open('results.json', 'w') as f:
 json.dump(result, f, indent=4)
 print(results_to_text(result))
@@ -221,4 +222,7 @@ def __call__(self, parser, namespace, values, 
option_string=None):
 Do additional initial run per cell which doesn't count in result,
 default true''')
 
+p.add_argument('--drop-caches', action='store_true', help='''\
+Do "sync; echo 3 > /proc/sys/vm/drop_caches" before each test run''')
+
 bench(p.parse_args())
diff --git a/scripts/simplebench/simplebench.py 
b/scripts/simplebench/simplebench.py
index 27bc4d4715..8efca2af98 100644
--- a/scripts/simplebench/simplebench.py
+++ b/scripts/simplebench/simplebench.py
@@ -19,11 +19,17 @@
 #
 
 import statistics
+import subprocess
 import time
 
 
+def do_drop_caches():
+subprocess.run('sync; echo 3 > /proc/sys/vm/drop_caches', shell=True,
+   check=True)
+
+
 def bench_one(test_func, test_env, test_case, count=5, initial_run=True,
-  slow_limit=100):
+  slow_limit=100, drop_caches=False):
 """Benchmark one test-case
 
 test_func   -- benchmarking function with prototype
@@ -40,6 +46,7 @@ def bench_one(test_func, test_env, test_case, count=5, 
initial_run=True,
 initial_run -- do initial run of test_func, which don't get into result
 slow_limit  -- stop at slow run (that exceedes the slow_limit by seconds).
(initial run is not measured)
+drop_caches -- drop caches before each run
 
 Returns dict with the following fields:
 'runs': list of test_func results
@@ -53,6 +60,7 @@ def bench_one(test_func, test_env, test_case, count=5, 
initial_run=True,
 """
 if initial_run:
 print('  #initial run:')
+do_drop_caches()
 print('   ', test_func(test_env, test_case))
 
 runs = []
@@ -60,6 +68,7 @@ def bench_one(test_func, test_env, test_case, count=5, 
initial_run=True,
 t = time.time()
 
 print('  #run {}'.format(i+1))
+do_drop_caches()
 res = test_func(test_env, test_case)
 print('   ', res)
 runs.append(res)
-- 
2.29.2

[PULL 2/9] simplebench: bench_one(): support count=1

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

statistics.stdev raises if sequence length is less than two. Support
that case by hand.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: John Snow 
---
 scripts/simplebench/simplebench.py | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/scripts/simplebench/simplebench.py 
b/scripts/simplebench/simplebench.py
index 0a3035732c..27bc4d4715 100644
--- a/scripts/simplebench/simplebench.py
+++ b/scripts/simplebench/simplebench.py
@@ -83,7 +83,10 @@ def bench_one(test_func, test_env, test_case, count=5, 
initial_run=True,
 dim = 'seconds'
 result['dimension'] = dim
 result['average'] = statistics.mean(r[dim] for r in succeeded)
-result['stdev'] = statistics.stdev(r[dim] for r in succeeded)
+if len(succeeded) == 1:
+result['stdev'] = 0
+else:
+result['stdev'] = statistics.stdev(r[dim] for r in succeeded)
 
 if len(succeeded) < count:
 result['n-failed'] = count - len(succeeded)
-- 
2.29.2

[PULL 6/9] simplebench/bench-backup: support qcow2 source files

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Add support for qcow2 source. New option says to use test-source.qcow2
instead of test-source. Of course, test-source.qcow2 should be
precreated.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: John Snow 
---
 scripts/simplebench/bench-backup.py| 5 +
 scripts/simplebench/bench_block_job.py | 7 ++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/scripts/simplebench/bench-backup.py 
b/scripts/simplebench/bench-backup.py
index fbc85f266f..a2120fcbf0 100755
--- a/scripts/simplebench/bench-backup.py
+++ b/scripts/simplebench/bench-backup.py
@@ -58,6 +58,8 @@ def bench(args):
 
 if src == 'nbd':
 source = nbd_drv
+elif args.qcow2_sources:
+source = drv_qcow2(drv_file(dirs[src] + '/test-source.qcow2'))
 else:
 source = drv_file(dirs[src] + '/test-source')
 
@@ -199,6 +201,9 @@ def __call__(self, parser, namespace, values, 
option_string=None):
 Use compressed backup. It automatically means
 automatically creating qcow2 target with
 lazy_refcounts for each test run''', action='store_true')
+p.add_argument('--qcow2-sources', help='''\
+Use test-source.qcow2 images as sources instead of
+test-source raw images''', action='store_true')
 p.add_argument('--target-cache', help='''\
 Setup cache for target nodes. Options:
direct: default, use O_DIRECT and aio=native
diff --git a/scripts/simplebench/bench_block_job.py 
b/scripts/simplebench/bench_block_job.py
index 71d2e489c8..4f03c12169 100755
--- a/scripts/simplebench/bench_block_job.py
+++ b/scripts/simplebench/bench_block_job.py
@@ -88,6 +88,11 @@ def get_image_size(path):
 return json.loads(out)['virtual-size']
 
 
+def get_blockdev_size(obj):
+img = obj['filename'] if 'filename' in obj else obj['file']['filename']
+return get_image_size(img)
+
+
 # Bench backup or mirror
 def bench_block_copy(qemu_binary, cmd, cmd_options, source, target):
 """Helper to run bench_block_job() for mirror or backup"""
@@ -101,7 +106,7 @@ def bench_block_copy(qemu_binary, cmd, cmd_options, source, 
target):
 
 subprocess.run(['qemu-img', 'create', '-f', 'qcow2',
 target['file']['filename'],
-str(get_image_size(source['filename']))],
+str(get_blockdev_size(source))],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL, check=True)
 
-- 
2.29.2

[PULL 5/9] simplebench/bench_block_job: handle error in BLOCK_JOB_COMPLETED

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

We should not report success if there is an error in final event.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: John Snow 
---
 scripts/simplebench/bench_block_job.py | 4 
 1 file changed, 4 insertions(+)

diff --git a/scripts/simplebench/bench_block_job.py 
b/scripts/simplebench/bench_block_job.py
index 8f8385ccce..71d2e489c8 100755
--- a/scripts/simplebench/bench_block_job.py
+++ b/scripts/simplebench/bench_block_job.py
@@ -70,6 +70,10 @@ def bench_block_job(cmd, cmd_args, qemu_args):
 vm.shutdown()
 return {'error': 'block-job failed: ' + str(e),
 'vm-log': vm.get_log()}
+if 'error' in e['data']:
+vm.shutdown()
+return {'error': 'block-job failed: ' + e['data']['error'],
+'vm-log': vm.get_log()}
 end_ms = e['timestamp']['seconds'] * 100 + \
 e['timestamp']['microseconds']
 finally:
-- 
2.29.2

[PULL 1/9] simplebench: bench_one(): add slow_limit argument

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Sometimes one of cells in a testing table runs too slow. And we really
don't want to wait so long. Limit number of runs in this case.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 scripts/simplebench/simplebench.py | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/scripts/simplebench/simplebench.py 
b/scripts/simplebench/simplebench.py
index f61513af90..0a3035732c 100644
--- a/scripts/simplebench/simplebench.py
+++ b/scripts/simplebench/simplebench.py
@@ -19,9 +19,11 @@
 #
 
 import statistics
+import time
 
 
-def bench_one(test_func, test_env, test_case, count=5, initial_run=True):
+def bench_one(test_func, test_env, test_case, count=5, initial_run=True,
+  slow_limit=100):
 """Benchmark one test-case
 
 test_func   -- benchmarking function with prototype
@@ -36,6 +38,8 @@ def bench_one(test_func, test_env, test_case, count=5, 
initial_run=True):
 test_case   -- test case - opaque second argument for test_func
 count   -- how many times to call test_func, to calculate average
 initial_run -- do initial run of test_func, which don't get into result
+slow_limit  -- stop at slow run (that exceedes the slow_limit by seconds).
+   (initial run is not measured)
 
 Returns dict with the following fields:
 'runs': list of test_func results
@@ -53,11 +57,19 @@ def bench_one(test_func, test_env, test_case, count=5, 
initial_run=True):
 
 runs = []
 for i in range(count):
+t = time.time()
+
 print('  #run {}'.format(i+1))
 res = test_func(test_env, test_case)
 print('   ', res)
 runs.append(res)
 
+if time.time() - t > slow_limit:
+print('- run is too slow, stop here')
+break
+
+count = len(runs)
+
 result = {'runs': runs}
 
 succeeded = [r for r in runs if ('seconds' in r or 'iops' in r)]
-- 
2.29.2

Re: [PATCH v2 2/6] vhost-user-blk: Don't reconnect during initialisation

2021-05-04 Thread Michael S. Tsirkin

On Thu, Apr 29, 2021 at 07:13:12PM +0200, Kevin Wolf wrote:
> This is a partial revert of commits 77542d43149 and bc79c87bcde.
> 
> Usually, an error during initialisation means that the configuration was
> wrong. Reconnecting won't make the error go away, but just turn the
> error condition into an endless loop. Avoid this and return errors
> again.

So there are several possible reasons for an error:

1. remote restarted - we would like to reconnect,
   this was the original use-case for reconnect.

   I am not very happy that we are killing this usecase.

2. qemu detected an error and closed the connection
   looks like we try to handle that by reconnect,
   this is something we should address.
3. remote failed due to a bad command from qemu.
   this usecase isn't well supported at the moment.

   How about supporting it on the remote side? I think
   that if the data is well-formed just has a configuration
   remote can not support then instead of closing the connection, remote can 
wait
   for commands with need_reply set, and respond with
   an error. Or at least do it
   if VHOST_USER_PROTOCOL_F_REPLY_ACK has been negotiated.
   If VHOST_USER_SET_VRING_ERR is used then signalling that
   fd might also be reasonable.

   OTOH if qemu is buggy and sends malformed data and remote detects that then
   hacing qemu retry forever is ok, might actually be benefitial for
   debugging.



> Additionally, calling vhost_user_blk_disconnect() from the chardev event
> handler could result in use-after-free because none of the
> initialisation code expects that the device could just go away in the
> middle. So removing the call fixes crashes in several places.
> For example, using a num-queues setting that is incompatible with the
> backend would result in a crash like this (dereferencing dev->opaque,
> which is already NULL):
> 
>  #0  0x55d0a4bd in vhost_user_read_cb (source=0x568f4690, 
> condition=(G_IO_IN | G_IO_HUP), opaque=0x7fffcbf0) at 
> ../hw/virtio/vhost-user.c:313
>  #1  0x55d950d3 in qio_channel_fd_source_dispatch 
> (source=0x57c3f750, callback=0x55d0a478 , 
> user_data=0x7fffcbf0) at ../io/channel-watch.c:84
>  #2  0x77b32a9f in g_main_context_dispatch () at 
> /lib64/libglib-2.0.so.0
>  #3  0x77b84a98 in g_main_context_iterate.constprop () at 
> /lib64/libglib-2.0.so.0
>  #4  0x77b32163 in g_main_loop_run () at /lib64/libglib-2.0.so.0
>  #5  0x55d0a724 in vhost_user_read (dev=0x57bc62f8, 
> msg=0x7fffcc50) at ../hw/virtio/vhost-user.c:402
>  #6  0x55d0ee6b in vhost_user_get_config (dev=0x57bc62f8, 
> config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost-user.c:2133
>  #7  0x55d56d46 in vhost_dev_get_config (hdev=0x57bc62f8, 
> config=0x57bc62ac "", config_len=60) at ../hw/virtio/vhost.c:1566
>  #8  0x55cdd150 in vhost_user_blk_device_realize (dev=0x57bc60b0, 
> errp=0x7fffcf90) at ../hw/block/vhost-user-blk.c:510
>  #9  0x55d08f6d in virtio_device_realize (dev=0x57bc60b0, 
> errp=0x7fffcff0) at ../hw/virtio/virtio.c:3660

Right. So that's definitely something to fix.

> 
> Signed-off-by: Kevin Wolf 
> ---
>  hw/block/vhost-user-blk.c | 59 +++
>  1 file changed, 17 insertions(+), 42 deletions(-)
> 
> diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
> index 7c85248a7b..c0b9958da1 100644
> --- a/hw/block/vhost-user-blk.c
> +++ b/hw/block/vhost-user-blk.c
> @@ -50,6 +50,8 @@ static const int user_feature_bits[] = {
>  VHOST_INVALID_FEATURE_BIT
>  };
>  
> +static void vhost_user_blk_event(void *opaque, QEMUChrEvent event);
> +
>  static void vhost_user_blk_update_config(VirtIODevice *vdev, uint8_t *config)
>  {
>  VHostUserBlk *s = VHOST_USER_BLK(vdev);
> @@ -362,19 +364,6 @@ static void vhost_user_blk_disconnect(DeviceState *dev)
>  vhost_dev_cleanup(>dev);
>  }
>  
> -static void vhost_user_blk_event(void *opaque, QEMUChrEvent event,
> - bool realized);
> -
> -static void vhost_user_blk_event_realize(void *opaque, QEMUChrEvent event)
> -{
> -vhost_user_blk_event(opaque, event, false);
> -}
> -
> -static void vhost_user_blk_event_oper(void *opaque, QEMUChrEvent event)
> -{
> -vhost_user_blk_event(opaque, event, true);
> -}
> -
>  static void vhost_user_blk_chr_closed_bh(void *opaque)
>  {
>  DeviceState *dev = opaque;
> @@ -382,12 +371,11 @@ static void vhost_user_blk_chr_closed_bh(void *opaque)
>  VHostUserBlk *s = VHOST_USER_BLK(vdev);
>  
>  vhost_user_blk_disconnect(dev);
> -qemu_chr_fe_set_handlers(>chardev, NULL, NULL,
> -vhost_user_blk_event_oper, NULL, opaque, NULL, true);
> +qemu_chr_fe_set_handlers(>chardev, NULL, NULL, vhost_user_blk_event,
> + NULL, opaque, NULL, true);
>  }
>  
> -static void vhost_user_blk_event(void *opaque, QEMUChrEvent event,
> -

[PULL 4/9] simplebench/bench-backup: add target-cache argument

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Allow benchmark with different kinds of target cache.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: John Snow 
---
 scripts/simplebench/bench-backup.py| 33 --
 scripts/simplebench/bench_block_job.py | 10 +---
 2 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/scripts/simplebench/bench-backup.py 
b/scripts/simplebench/bench-backup.py
index 72eae85bb1..fbc85f266f 100755
--- a/scripts/simplebench/bench-backup.py
+++ b/scripts/simplebench/bench-backup.py
@@ -65,13 +65,26 @@ def bench(args):
 test_cases.append({'id': t, 'source': source, 'target': nbd_drv})
 continue
 
-fname = dirs[dst] + '/test-target'
-if args.compressed:
-fname += '.qcow2'
-target = drv_file(fname)
-if args.compressed:
-target = drv_qcow2(target)
-test_cases.append({'id': t, 'source': source, 'target': target})
+if args.target_cache == 'both':
+target_caches = ['direct', 'cached']
+else:
+target_caches = [args.target_cache]
+
+for c in target_caches:
+o_direct = c == 'direct'
+fname = dirs[dst] + '/test-target'
+if args.compressed:
+fname += '.qcow2'
+target = drv_file(fname, o_direct=o_direct)
+if args.compressed:
+target = drv_qcow2(target)
+
+test_id = t
+if args.target_cache == 'both':
+test_id += f'({c})'
+
+test_cases.append({'id': test_id, 'source': source,
+   'target': target})
 
 binaries = []  # list of (, , [])
 for i, q in enumerate(args.env):
@@ -186,5 +199,11 @@ def __call__(self, parser, namespace, values, 
option_string=None):
 Use compressed backup. It automatically means
 automatically creating qcow2 target with
 lazy_refcounts for each test run''', action='store_true')
+p.add_argument('--target-cache', help='''\
+Setup cache for target nodes. Options:
+   direct: default, use O_DIRECT and aio=native
+   cached: use system cache (Qemu default) and aio=threads (Qemu default)
+   both: generate two test cases for each src:dst pair''',
+   default='direct', choices=('direct', 'cached', 'both'))
 
 bench(p.parse_args())
diff --git a/scripts/simplebench/bench_block_job.py 
b/scripts/simplebench/bench_block_job.py
index 08f86ed9c1..8f8385ccce 100755
--- a/scripts/simplebench/bench_block_job.py
+++ b/scripts/simplebench/bench_block_job.py
@@ -115,9 +115,13 @@ def bench_block_copy(qemu_binary, cmd, cmd_options, 
source, target):
 '-blockdev', json.dumps(target)])
 
 
-def drv_file(filename):
-return {'driver': 'file', 'filename': filename,
-'cache': {'direct': True}, 'aio': 'native'}
+def drv_file(filename, o_direct=True):
+node = {'driver': 'file', 'filename': filename}
+if o_direct:
+node['cache'] = {'direct': True}
+node['aio'] = 'native'
+
+return node
 
 
 def drv_nbd(host, port):
-- 
2.29.2

[PULL 0/9] scripts/simplebench patches

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

The following changes since commit 53c5433e84e8935abed8e91d4a2eb813168a0ecf:

  Merge remote-tracking branch 'remotes/rth-gitlab/tags/pull-tcg-20210501' into 
staging (2021-05-02 12:02:46 +0100)

are available in the Git repository at:

  https://src.openvz.org/scm/~vsementsov/qemu.git 
tags/pull-simplebench-2021-05-04

for you to fetch changes up to e34bd02694026722410b80cee02ab7f33f893e9b:

  MAINTAINERS: update Benchmark util: add git tree (2021-05-04 11:37:26 +0300)


scripts/simplebench improvements for 2021-05-04


Vladimir Sementsov-Ogievskiy (9):
  simplebench: bench_one(): add slow_limit argument
  simplebench: bench_one(): support count=1
  simplebench/bench-backup: add --compressed option
  simplebench/bench-backup: add target-cache argument
  simplebench/bench_block_job: handle error in BLOCK_JOB_COMPLETED
  simplebench/bench-backup: support qcow2 source files
  simplebench/bench-backup: add --count and --no-initial-run
  simplebench/bench-backup: add --drop-caches argument
  MAINTAINERS: update Benchmark util: add git tree

 MAINTAINERS|  1 +
 scripts/simplebench/bench-backup.py| 95 --
 scripts/simplebench/bench_block_job.py | 42 +--
 scripts/simplebench/simplebench.py | 28 +-
 4 files changed, 144 insertions(+), 22 deletions(-)

Re: [PATCH v3 0/9] simplebench improvements

2021-05-04 Thread Vladimir Sementsov-Ogievskiy


23.03.2021 16:47, Vladimir Sementsov-Ogievskiy wrote:

Hi all!

Here are some improvements to simplebench lib, to support my
"qcow2: compressed write cache" series.

v3:
01: use simpler logic
02,04-06: add John's r-b
07: use BooleanOptionalAction and
 initial_run=args.initial_run
08: rewrite so that we have a new --drop-caches option

Vladimir Sementsov-Ogievskiy (9):
   simplebench: bench_one(): add slow_limit argument
   simplebench: bench_one(): support count=1
   simplebench/bench-backup: add --compressed option
   simplebench/bench-backup: add target-cache argument
   simplebench/bench_block_job: handle error in BLOCK_JOB_COMPLETED
   simplebench/bench-backup: support qcow2 source files
   simplebench/bench-backup: add --count and --no-initial-run
   simplebench/bench-backup: add --drop-caches argument
   MAINTAINERS: update Benchmark util: add git tree

  MAINTAINERS|  1 +
  scripts/simplebench/bench-backup.py| 95 +-
  scripts/simplebench/bench_block_job.py | 42 +++-
  scripts/simplebench/simplebench.py | 28 +++-
  4 files changed, 144 insertions(+), 22 deletions(-)



Thanks for reviewing, applied to my simplebench branch at
https://src.openvz.org/scm/~vsementsov/qemu.git

I'm going to send pull request now, and see, will Peter take it or not :)

--
Best regards,
Vladimir

[PATCH v2 8/9] test-write-threshold: drop extra includes

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/unit/test-write-threshold.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/tests/unit/test-write-threshold.c 
b/tests/unit/test-write-threshold.c
index 49b1ef7a20..761054eab2 100644
--- a/tests/unit/test-write-threshold.c
+++ b/tests/unit/test-write-threshold.c
@@ -7,8 +7,6 @@
  */
 
 #include "qemu/osdep.h"
-#include "qapi/error.h"
-#include "block/block_int.h"
 #include "block/write-threshold.h"
 
 
-- 
2.29.2

[PATCH v2 5/9] block/write-threshold: don't use aio context lock

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Instead of relying on aio context lock, let's make use of atomic
operations.

The tricky place is bdrv_write_threshold_check_write(): we want
atomically unset bs->write_threshold_offset iff
  offset + bytes > bs->write_threshold_offset
We don't have such atomic operation, so let's go in a loop:

1. fetch wtr atomically
2. if condition satisfied, try cmpxchg (if not satisfied, we are done,
   don't send event)
3. if cmpxchg succeeded, we are done (send event), else go to [1]

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/write-threshold.c | 32 +---
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/block/write-threshold.c b/block/write-threshold.c
index ae54ee05dc..fbf4e6f5c4 100644
--- a/block/write-threshold.c
+++ b/block/write-threshold.c
@@ -2,6 +2,7 @@
  * QEMU System Emulator block write threshold notification
  *
  * Copyright Red Hat, Inc. 2014
+ * Copyright (c) 2021 Virtuozzo International GmbH.
  *
  * Authors:
  *  Francesco Romani 
@@ -21,43 +22,44 @@
 
 uint64_t bdrv_write_threshold_get(const BlockDriverState *bs)
 {
-return bs->write_threshold_offset;
+return qatomic_read(>write_threshold_offset);
 }
 
 void bdrv_write_threshold_set(BlockDriverState *bs, uint64_t threshold_bytes)
 {
-bs->write_threshold_offset = threshold_bytes;
+qatomic_set(>write_threshold_offset, threshold_bytes);
 }
 
 void qmp_block_set_write_threshold(const char *node_name,
uint64_t threshold_bytes,
Error **errp)
 {
-BlockDriverState *bs;
-AioContext *aio_context;
-
-bs = bdrv_find_node(node_name);
+BlockDriverState *bs = bdrv_find_node(node_name);
 if (!bs) {
 error_setg(errp, "Device '%s' not found", node_name);
 return;
 }
 
-aio_context = bdrv_get_aio_context(bs);
-aio_context_acquire(aio_context);
-
 bdrv_write_threshold_set(bs, threshold_bytes);
-
-aio_context_release(aio_context);
 }
 
 void bdrv_write_threshold_check_write(BlockDriverState *bs, int64_t offset,
   int64_t bytes)
 {
 int64_t end = offset + bytes;
-uint64_t wtr = bs->write_threshold_offset;
+uint64_t wtr;
 
-if (wtr > 0 && end > wtr) {
-qapi_event_send_block_write_threshold(bs->node_name, end - wtr, wtr);
-bdrv_write_threshold_set(bs, 0);
+retry:
+wtr = bdrv_write_threshold_get(bs);
+if (wtr == 0 || wtr >= end) {
+return;
 }
+
+if (qatomic_cmpxchg(>write_threshold_offset, wtr, 0) != wtr) {
+/* bs->write_threshold_offset changed in parallel */
+goto retry;
+}
+
+/* We have cleared bs->write_threshold_offset, so let's send event */
+qapi_event_send_block_write_threshold(bs->node_name, end - wtr, wtr);
 }
-- 
2.29.2

[PATCH v2 4/9] block/write-threshold: drop extra APIs

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

bdrv_write_threshold_exceeded() is unused at all.

bdrv_write_threshold_is_set() is used only to double check the value of
bs->write_threshold_offset in tests. No real sense in it (both tests do
check real value with help of bdrv_write_threshold_get())

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/write-threshold.h   | 24 
 block/write-threshold.c   | 19 ---
 tests/unit/test-write-threshold.c |  4 
 3 files changed, 47 deletions(-)

diff --git a/include/block/write-threshold.h b/include/block/write-threshold.h
index 7942dcc368..072bc8f286 100644
--- a/include/block/write-threshold.h
+++ b/include/block/write-threshold.h
@@ -35,30 +35,6 @@ void bdrv_write_threshold_set(BlockDriverState *bs, uint64_t 
threshold_bytes);
  */
 uint64_t bdrv_write_threshold_get(const BlockDriverState *bs);
 
-/*
- * bdrv_write_threshold_is_set
- *
- * Tell if a write threshold is set for a given BDS.
- */
-bool bdrv_write_threshold_is_set(const BlockDriverState *bs);
-
-/*
- * bdrv_write_threshold_exceeded
- *
- * Return the extent of a write request that exceeded the threshold,
- * or zero if the request is below the threshold.
- * Return zero also if the threshold was not set.
- *
- * NOTE: here we assume the following holds for each request this code
- * deals with:
- *
- * assert((req->offset + req->bytes) <= UINT64_MAX)
- *
- * Please not there is *not* an actual C assert().
- */
-uint64_t bdrv_write_threshold_exceeded(const BlockDriverState *bs,
-   const BdrvTrackedRequest *req);
-
 /*
  * bdrv_write_threshold_check_write
  *
diff --git a/block/write-threshold.c b/block/write-threshold.c
index 11152c60df..ae54ee05dc 100644
--- a/block/write-threshold.c
+++ b/block/write-threshold.c
@@ -24,25 +24,6 @@ uint64_t bdrv_write_threshold_get(const BlockDriverState *bs)
 return bs->write_threshold_offset;
 }
 
-bool bdrv_write_threshold_is_set(const BlockDriverState *bs)
-{
-return bs->write_threshold_offset > 0;
-}
-
-uint64_t bdrv_write_threshold_exceeded(const BlockDriverState *bs,
-   const BdrvTrackedRequest *req)
-{
-if (bdrv_write_threshold_is_set(bs)) {
-if (req->offset > bs->write_threshold_offset) {
-return (req->offset - bs->write_threshold_offset) + req->bytes;
-}
-if ((req->offset + req->bytes) > bs->write_threshold_offset) {
-return (req->offset + req->bytes) - bs->write_threshold_offset;
-}
-}
-return 0;
-}
-
 void bdrv_write_threshold_set(BlockDriverState *bs, uint64_t threshold_bytes)
 {
 bs->write_threshold_offset = threshold_bytes;
diff --git a/tests/unit/test-write-threshold.c 
b/tests/unit/test-write-threshold.c
index fd40a815b8..bb5c1a5217 100644
--- a/tests/unit/test-write-threshold.c
+++ b/tests/unit/test-write-threshold.c
@@ -18,8 +18,6 @@ static void test_threshold_not_set_on_init(void)
 BlockDriverState bs;
 memset(, 0, sizeof(bs));
 
-g_assert(!bdrv_write_threshold_is_set());
-
 res = bdrv_write_threshold_get();
 g_assert_cmpint(res, ==, 0);
 }
@@ -33,8 +31,6 @@ static void test_threshold_set_get(void)
 
 bdrv_write_threshold_set(, threshold);
 
-g_assert(bdrv_write_threshold_is_set());
-
 res = bdrv_write_threshold_get();
 g_assert_cmpint(res, ==, threshold);
 }
-- 
2.29.2

[PATCH v2 9/9] block/write-threshold: drop extra includes

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/write-threshold.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/write-threshold.c b/block/write-threshold.c
index fbf4e6f5c4..db271c5537 100644
--- a/block/write-threshold.c
+++ b/block/write-threshold.c
@@ -12,10 +12,7 @@
  */
 
 #include "qemu/osdep.h"
-#include "block/block_int.h"
-#include "qemu/coroutine.h"
 #include "block/write-threshold.h"
-#include "qemu/notify.h"
 #include "qapi/error.h"
 #include "qapi/qapi-commands-block-core.h"
 #include "qapi/qapi-events-block-core.h"
-- 
2.29.2

[PATCH v2 7/9] test-write-threshold: drop extra TestStruct structure

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

We don't need this extra logic: it doesn't make code simpler.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/unit/test-write-threshold.c | 20 +++-
 1 file changed, 3 insertions(+), 17 deletions(-)

diff --git a/tests/unit/test-write-threshold.c 
b/tests/unit/test-write-threshold.c
index 9e9986aefc..49b1ef7a20 100644
--- a/tests/unit/test-write-threshold.c
+++ b/tests/unit/test-write-threshold.c
@@ -37,26 +37,12 @@ static void test_threshold_trigger(void)
 g_assert_cmpuint(bdrv_write_threshold_get(), ==, 0);
 }
 
-typedef struct TestStruct {
-const char *name;
-void (*func)(void);
-} TestStruct;
-
 
 int main(int argc, char **argv)
 {
-size_t i;
-TestStruct tests[] = {
-{ "/write-threshold/not-trigger",
-  test_threshold_not_trigger },
-{ "/write-threshold/trigger",
-  test_threshold_trigger },
-{ NULL, NULL }
-};
-
 g_test_init(, , NULL);
-for (i = 0; tests[i].name != NULL; i++) {
-g_test_add_func(tests[i].name, tests[i].func);
-}
+g_test_add_func("/write-threshold/not-trigger", 
test_threshold_not_trigger);
+g_test_add_func("/write-threshold/trigger", test_threshold_trigger);
+
 return g_test_run();
 }
-- 
2.29.2

[PATCH v2 6/9] test-write-threshold: drop extra tests

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Testing set/get of one 64bit variable doesn't seem necessary. We have a
lot of such variables. Also remaining tests do test set/get anyway.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/unit/test-write-threshold.c | 43 ---
 1 file changed, 43 deletions(-)

diff --git a/tests/unit/test-write-threshold.c 
b/tests/unit/test-write-threshold.c
index bb5c1a5217..9e9986aefc 100644
--- a/tests/unit/test-write-threshold.c
+++ b/tests/unit/test-write-threshold.c
@@ -12,43 +12,6 @@
 #include "block/write-threshold.h"
 
 
-static void test_threshold_not_set_on_init(void)
-{
-uint64_t res;
-BlockDriverState bs;
-memset(, 0, sizeof(bs));
-
-res = bdrv_write_threshold_get();
-g_assert_cmpint(res, ==, 0);
-}
-
-static void test_threshold_set_get(void)
-{
-uint64_t threshold = 4 * 1024 * 1024;
-uint64_t res;
-BlockDriverState bs;
-memset(, 0, sizeof(bs));
-
-bdrv_write_threshold_set(, threshold);
-
-res = bdrv_write_threshold_get();
-g_assert_cmpint(res, ==, threshold);
-}
-
-static void test_threshold_multi_set_get(void)
-{
-uint64_t threshold1 = 4 * 1024 * 1024;
-uint64_t threshold2 = 15 * 1024 * 1024;
-uint64_t res;
-BlockDriverState bs;
-memset(, 0, sizeof(bs));
-
-bdrv_write_threshold_set(, threshold1);
-bdrv_write_threshold_set(, threshold2);
-res = bdrv_write_threshold_get();
-g_assert_cmpint(res, ==, threshold2);
-}
-
 static void test_threshold_not_trigger(void)
 {
 uint64_t threshold = 4 * 1024 * 1024;
@@ -84,12 +47,6 @@ int main(int argc, char **argv)
 {
 size_t i;
 TestStruct tests[] = {
-{ "/write-threshold/not-set-on-init",
-  test_threshold_not_set_on_init },
-{ "/write-threshold/set-get",
-  test_threshold_set_get },
-{ "/write-threshold/multi-set-get",
-  test_threshold_multi_set_get },
 { "/write-threshold/not-trigger",
   test_threshold_not_trigger },
 { "/write-threshold/trigger",
-- 
2.29.2

[PATCH v2 2/9] block: drop write notifiers

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

They are unused now.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block_int.h | 12 
 block.c   |  1 -
 block/io.c|  6 --
 3 files changed, 19 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 51f51286a5..eab352f363 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -954,9 +954,6 @@ struct BlockDriverState {
  */
 int64_t total_sectors;
 
-/* Callback before write request is processed */
-NotifierWithReturnList before_write_notifiers;
-
 /* threshold limit for writes, in bytes. "High water mark". */
 uint64_t write_threshold_offset;
 
@@ -1083,15 +1080,6 @@ void bdrv_parse_filename_strip_prefix(const char 
*filename, const char *prefix,
 bool bdrv_backing_overridden(BlockDriverState *bs);
 
 
-/**
- * bdrv_add_before_write_notifier:
- *
- * Register a callback that is invoked before write requests are processed but
- * after any throttling or waiting for overlapping requests.
- */
-void bdrv_add_before_write_notifier(BlockDriverState *bs,
-NotifierWithReturn *notifier);
-
 /**
  * bdrv_add_aio_context_notifier:
  *
diff --git a/block.c b/block.c
index 874c22c43e..e3c6c6ed93 100644
--- a/block.c
+++ b/block.c
@@ -401,7 +401,6 @@ BlockDriverState *bdrv_new(void)
 for (i = 0; i < BLOCK_OP_TYPE_MAX; i++) {
 QLIST_INIT(>op_blockers[i]);
 }
-notifier_with_return_list_init(>before_write_notifiers);
 qemu_co_mutex_init(>reqs_lock);
 qemu_mutex_init(>dirty_bitmap_mutex);
 bs->refcnt = 1;
diff --git a/block/io.c b/block/io.c
index 3520de51bb..1e826ba9e8 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3165,12 +3165,6 @@ bool bdrv_qiov_is_aligned(BlockDriverState *bs, 
QEMUIOVector *qiov)
 return true;
 }
 
-void bdrv_add_before_write_notifier(BlockDriverState *bs,
-NotifierWithReturn *notifier)
-{
-notifier_with_return_list_add(>before_write_notifiers, notifier);
-}
-
 void bdrv_io_plug(BlockDriverState *bs)
 {
 BdrvChild *child;
-- 
2.29.2

[PATCH v2 1/9] block/write-threshold: don't use write notifiers

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

write-notifiers are used only for write-threshold. New code for such
purpose should create filters.

Let's better special-case write-threshold and drop write notifiers at
all. (Actually, write-threshold is special-cased anyway, as the only
user of write-notifiers)

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block_int.h   |  1 -
 include/block/write-threshold.h |  9 +
 block/io.c  |  5 ++-
 block/write-threshold.c | 68 +++--
 4 files changed, 25 insertions(+), 58 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index c823f5b1b3..51f51286a5 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -959,7 +959,6 @@ struct BlockDriverState {
 
 /* threshold limit for writes, in bytes. "High water mark". */
 uint64_t write_threshold_offset;
-NotifierWithReturn write_threshold_notifier;
 
 /* Writing to the list requires the BQL _and_ the dirty_bitmap_mutex.
  * Reading from the list can be done with either the BQL or the
diff --git a/include/block/write-threshold.h b/include/block/write-threshold.h
index c646f267a4..7942dcc368 100644
--- a/include/block/write-threshold.h
+++ b/include/block/write-threshold.h
@@ -59,4 +59,13 @@ bool bdrv_write_threshold_is_set(const BlockDriverState *bs);
 uint64_t bdrv_write_threshold_exceeded(const BlockDriverState *bs,
const BdrvTrackedRequest *req);
 
+/*
+ * bdrv_write_threshold_check_write
+ *
+ * Check, does specified request exceeds write threshold. If it is, send
+ * corresponding event and unset write threshold.
+ */
+void bdrv_write_threshold_check_write(BlockDriverState *bs, int64_t offset,
+  int64_t bytes);
+
 #endif
diff --git a/block/io.c b/block/io.c
index 35b6c56efc..3520de51bb 100644
--- a/block/io.c
+++ b/block/io.c
@@ -30,6 +30,7 @@
 #include "block/blockjob_int.h"
 #include "block/block_int.h"
 #include "block/coroutines.h"
+#include "block/write-threshold.h"
 #include "qemu/cutils.h"
 #include "qapi/error.h"
 #include "qemu/error-report.h"
@@ -2008,8 +2009,8 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t 
offset, int64_t bytes,
 } else {
 assert(child->perm & BLK_PERM_WRITE);
 }
-return notifier_with_return_list_notify(>before_write_notifiers,
-req);
+bdrv_write_threshold_check_write(bs, offset, bytes);
+return 0;
 case BDRV_TRACKED_TRUNCATE:
 assert(child->perm & BLK_PERM_RESIZE);
 return 0;
diff --git a/block/write-threshold.c b/block/write-threshold.c
index 85b78dc2a9..11152c60df 100644
--- a/block/write-threshold.c
+++ b/block/write-threshold.c
@@ -29,14 +29,6 @@ bool bdrv_write_threshold_is_set(const BlockDriverState *bs)
 return bs->write_threshold_offset > 0;
 }
 
-static void write_threshold_disable(BlockDriverState *bs)
-{
-if (bdrv_write_threshold_is_set(bs)) {
-notifier_with_return_remove(>write_threshold_notifier);
-bs->write_threshold_offset = 0;
-}
-}
-
 uint64_t bdrv_write_threshold_exceeded(const BlockDriverState *bs,
const BdrvTrackedRequest *req)
 {
@@ -51,55 +43,9 @@ uint64_t bdrv_write_threshold_exceeded(const 
BlockDriverState *bs,
 return 0;
 }
 
-static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
-void *opaque)
-{
-BdrvTrackedRequest *req = opaque;
-BlockDriverState *bs = req->bs;
-uint64_t amount = 0;
-
-amount = bdrv_write_threshold_exceeded(bs, req);
-if (amount > 0) {
-qapi_event_send_block_write_threshold(
-bs->node_name,
-amount,
-bs->write_threshold_offset);
-
-/* autodisable to avoid flooding the monitor */
-write_threshold_disable(bs);
-}
-
-return 0; /* should always let other notifiers run */
-}
-
-static void write_threshold_register_notifier(BlockDriverState *bs)
-{
-bs->write_threshold_notifier.notify = before_write_notify;
-bdrv_add_before_write_notifier(bs, >write_threshold_notifier);
-}
-
-static void write_threshold_update(BlockDriverState *bs,
-   int64_t threshold_bytes)
-{
-bs->write_threshold_offset = threshold_bytes;
-}
-
 void bdrv_write_threshold_set(BlockDriverState *bs, uint64_t threshold_bytes)
 {
-if (bdrv_write_threshold_is_set(bs)) {
-if (threshold_bytes > 0) {
-write_threshold_update(bs, threshold_bytes);
-} else {
-write_threshold_disable(bs);
-}
-} else {
-if (threshold_bytes > 0) {
-/* avoid multiple registration */
-write_threshold_register_notifier(bs);
-write_threshold_update(bs, threshold_bytes);
-}
-/* discard bogus disable request */
-}
+

[PATCH v2 3/9] test-write-threshold: rewrite test_threshold_(not_)trigger tests

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

These tests use bdrv_write_threshold_exceeded() API, which is used only
for test. Better is testing real API, which is used in block.c as well.

So, let's call bdrv_write_threshold_check_write(), and check is
bs->write_threshold_offset cleared or not (it's cleared iff threshold
triggered).

Also we get rid of BdrvTrackedRequest use here. Tracked requests are
unrelated to write-threshold since we get rid of write notifiers.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/unit/test-write-threshold.c | 22 --
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/tests/unit/test-write-threshold.c 
b/tests/unit/test-write-threshold.c
index fc1c45a2eb..fd40a815b8 100644
--- a/tests/unit/test-write-threshold.c
+++ b/tests/unit/test-write-threshold.c
@@ -55,41 +55,27 @@ static void test_threshold_multi_set_get(void)
 
 static void test_threshold_not_trigger(void)
 {
-uint64_t amount = 0;
 uint64_t threshold = 4 * 1024 * 1024;
 BlockDriverState bs;
-BdrvTrackedRequest req;
 
 memset(, 0, sizeof(bs));
-memset(, 0, sizeof(req));
-req.offset = 1024;
-req.bytes = 1024;
-
-bdrv_check_request(req.offset, req.bytes, _abort);
 
 bdrv_write_threshold_set(, threshold);
-amount = bdrv_write_threshold_exceeded(, );
-g_assert_cmpuint(amount, ==, 0);
+bdrv_write_threshold_check_write(, 1024, 1024);
+g_assert_cmpuint(bdrv_write_threshold_get(), ==, threshold);
 }
 
 
 static void test_threshold_trigger(void)
 {
-uint64_t amount = 0;
 uint64_t threshold = 4 * 1024 * 1024;
 BlockDriverState bs;
-BdrvTrackedRequest req;
 
 memset(, 0, sizeof(bs));
-memset(, 0, sizeof(req));
-req.offset = (4 * 1024 * 1024) - 1024;
-req.bytes = 2 * 1024;
-
-bdrv_check_request(req.offset, req.bytes, _abort);
 
 bdrv_write_threshold_set(, threshold);
-amount = bdrv_write_threshold_exceeded(, );
-g_assert_cmpuint(amount, >=, 1024);
+bdrv_write_threshold_check_write(, threshold - 1024, 2 * 1024);
+g_assert_cmpuint(bdrv_write_threshold_get(), ==, 0);
 }
 
 typedef struct TestStruct {
-- 
2.29.2

[PATCH v2 0/9] block: refactor write threshold

2021-05-04 Thread Vladimir Sementsov-Ogievskiy

Hi all!

This is a v2 for
"[PATCH] block: simplify write-threshold and drop write notifiers"
Supersedes: <20210421220950.105017-1-vsement...@virtuozzo.com>

v2: split into several patches, improve thread-safety, more refactoring

Vladimir Sementsov-Ogievskiy (9):
  block/write-threshold: don't use write notifiers
  block: drop write notifiers
  test-write-threshold: rewrite test_threshold_(not_)trigger tests
  block/write-threshold: drop extra APIs
  block/write-threshold: don't use aio context lock
  test-write-threshold: drop extra tests
  test-write-threshold: drop extra TestStruct structure
  test-write-threshold: drop extra includes
  block/write-threshold: drop extra includes

 include/block/block_int.h |  13 
 include/block/write-threshold.h   |  25 ++-
 block.c   |   1 -
 block/io.c|  11 +--
 block/write-threshold.c   | 110 +++---
 tests/unit/test-write-threshold.c |  91 ++--
 6 files changed, 39 insertions(+), 212 deletions(-)

-- 
2.29.2

Re: [PATCH v2] Document qemu-img options data_file and data_file_raw

2021-05-04 Thread Max Reitz


On 04.05.21 01:15, Connor Kuehl wrote:

On 4/30/21 9:45 AM, Max Reitz wrote:

+  ``data_file_raw``
+If this option is set to ``on``, QEMU will always keep the external
+data file consistent as a standalone read-only raw image. It does
+this by forwarding updates through to the raw image in addition to
+updating the image metadata. If set to ``off``, QEMU will only
+update the image metadata without forwarding the changes through
+to the raw image. The default value is ``off``.


Hm, what updates and what changes?  I mean, the first part makes sense (the “It 
does this by...”), but the second part doesn’t.  qemu will still forward most 
writes to the data file.  (Not all, but most.)

(Also, nit pick: With data_file_raw=off, the data file is not a raw image.  
(You still call it that in the penultimate sentence.))
When you write data to a qcow2 file with data_file, the data also goes to the 
data_file, most of the time.  The exception is when it can be handled with a 
metadata update, i.e. when it's a zero write or discard.

In addition, such updates (i.e. zero writes, I presume) not happening to the 
data file are usually a minor problem.  The real problem is that without 
data_file_raw, data clusters can be allocated anywhere in the data file, 
whereas with data_file_raw, they are allocated at their respective guest offset 
(i.e. the host offset always equals the guest offset).

I personally would have been fine with the first sentence, but if we want more 
of an explanation...  Perhaps:

<

I found it very helpful. I'll incorporate your explanation into the next
revision.

I'm wondering what the most appropriate trailer would be for the next
revision?

Suggested-by: Max [..]
Co-developed-by: Max [..]

Let me know if you have a strong preference, otherwise I'll go with
Suggested-by:


I’m fine without any tag (if I merge this patch, it’ll get my S-o-b 
anyway :)), but if any, I’d probably go with a Suggested-by, yes.


Max

Re: [PATCH 1/6] block: fix leak of tran in bdrv_root_attach_child

2021-05-04 Thread Vladimir Sementsov-Ogievskiy


03.05.2021 18:51, Alberto Garcia wrote:

On Mon 03 May 2021 01:33:57 PM CEST, Vladimir Sementsov-Ogievskiy 
 wrote:

@@ -2918,12 +2918,18 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState 
*child_bs,
 child_role, perm, shared_perm, opaque,
 , tran, errp);
  if (ret < 0) {
-bdrv_unref(child_bs);
-return NULL;
+goto out;
  }
  
  ret = bdrv_refresh_perms(child_bs, errp);

+if (ret < 0) {
+goto out;
+}
+
+out:


I see why you're doing this last error check, but it looks a bit
weird. My first reaction was to think that I was missing something.

I would remove it.



Hmm. I don't know. And don't insist of course.


--
Best regards,
Vladimir

Re: [PATCH 2/2] block: Fix Transaction leak in bdrv_reopen_multiple()

2021-05-04 Thread Vladimir Sementsov-Ogievskiy


03.05.2021 14:05, Kevin Wolf wrote:

Like other error paths, this one needs to call tran_finalize() and clean
up the BlockReopenQueue, too.

Fixes: CID 1452772
Fixes: 72373e40fbc7e4218061a8211384db362d3e7348
Signed-off-by: Kevin Wolf 
---
  block.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 5c0ced6238..69615fabd1 100644
--- a/block.c
+++ b/block.c
@@ -4052,7 +4052,7 @@ int bdrv_reopen_multiple(BlockReopenQueue *bs_queue, 
Error **errp)
  ret = bdrv_flush(bs_entry->state.bs);
  if (ret < 0) {
  error_setg_errno(errp, -ret, "Error flushing drive");
-goto cleanup;
+goto abort;
  }
  }
  




Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir

Re: [PATCH 5/6] block: simplify bdrv_child_user_desc()

2021-05-04 Thread Vladimir Sementsov-Ogievskiy


03.05.2021 19:05, Alberto Garcia wrote:

On Mon 03 May 2021 01:34:01 PM CEST, Vladimir Sementsov-Ogievskiy 
 wrote:

All existing parent types (block nodes, block devices, jobs) has the
realization. So, drop unreachable code.

Signed-off-by: Vladimir Sementsov-Ogievskiy 


With the updated description that you propose in your reply to the
patch,


Hmm, is it answer to 4/6, not 5/6 ?



Reviewed-by: Alberto Garcia 

Berto




--
Best regards,
Vladimir

Re: [PATCH 2/6] block: bdrv_reopen_multiple(): fix leak of tran object

2021-05-04 Thread Vladimir Sementsov-Ogievskiy


03.05.2021 18:52, Alberto Garcia wrote:

On Mon 03 May 2021 01:33:58 PM CEST, Vladimir Sementsov-Ogievskiy 
 wrote:

We have one path, where tran object is created, but we don't touch and
don't free it in any way: "goto cleanup" in first loop with calls to
bdrv_flush().

Fix it simply moving tran_new() call below that loop.

Reported-by: Coverity (CID 1452772)
Reported-by: Peter Maydell 
Suggested-by: Peter Maydell 
Fixes: 72373e40fbc7e4218061a8211384db362d3e7348
Signed-off-by: Vladimir Sementsov-Ogievskiy 


Reviewed-by: Alberto Garcia 



Thanks!  Still, now I see that Kevin's patch is better ([PATCH 2/2] block: Fix 
Transaction leak in bdrv_reopen_multiple())


--
Best regards,
Vladimir

Re: [PATCH 0/2] block: Fix Transaction leaks

2021-05-04 Thread Vladimir Sementsov-Ogievskiy


03.05.2021 14:05, Kevin Wolf wrote:

These are two follow-up fixes for Vladimir's "block: update graph
permissions update". The bugs were reported by Coverity.

Kevin Wolf (2):
   block: Fix Transaction leak in bdrv_root_attach_child()
   block: Fix Transaction leak in bdrv_reopen_multiple()

  block.c | 9 +
  1 file changed, 5 insertions(+), 4 deletions(-)



I'll rebase my "[PATCH 0/6] block permission updated follow-up" on top of this.

--
Best regards,
Vladimir

Re: [PATCH 2/2] block: Fix Transaction leak in bdrv_reopen_multiple()

2021-05-04 Thread Vladimir Sementsov-Ogievskiy


03.05.2021 17:33, Kevin Wolf wrote:

Am 03.05.2021 um 15:09 hat Vladimir Sementsov-Ogievskiy geschrieben:

03.05.2021 15:41, Kevin Wolf wrote:

Am 03.05.2021 um 13:40 hat Vladimir Sementsov-Ogievskiy geschrieben:

03.05.2021 14:05, Kevin Wolf wrote:

Like other error paths, this one needs to call tran_finalize() and clean
up the BlockReopenQueue, too.


We don't need the "abort" loop on that path. And clean-up of
BlockReopenQueue is at "cleanup:" label.


The cleanup of the BlockReopenQueue itself is there, but not of all
fields in it. Specifically, these lines are needed:

  qobject_unref(bs_entry->state.explicit_options);
  qobject_unref(bs_entry->state.options);

The references are taken in bdrv_reopen_queue_child() and either used in
commit or released on abort. Doing nothing with them leaks them.


Oops. Somehow I decided they are set in _prepare.




So I'd prefer Peter's suggestion (my "[PATCH 2/6] block:
bdrv_reopen_multiple(): fix leak of tran object")


I don't like it because I think every call that doesn't end in a commit,
should jump to the abort label to make sure that everything that remains
unused because of this is correctly cleaned up.




agree now..

Still, don't we miss these two qobject_unref() calls on success path?


No, they are put to use in bdrv_reopen_commit():

 qobject_unref(bs->explicit_options);
 qobject_unref(bs->options);

 bs->explicit_options   = reopen_state->explicit_options;
 bs->options= reopen_state->options;

Kevin



OK then. I should have checked myself :\

--
Best regards,
Vladimir

72 matches

Mail list logo