Re: [PATCH v2 5/5] vhost: add device started check in migration set log

2020-05-06 Thread Raphael Norwitz
As you correctly point out, this code needs to be looked at more
carefully so that
if the device does disconnect in the background we can handle the migration path
gracefully. In particular, we need to decide whether a migration
should be allowed
to continue if a device disconnects durning the migration stage.

mst, any thoughts?

Have you looked at the suggestion I gave Li Feng to move vhost_dev_cleanup()
into the connection path in vhost-user-blk? I’m not sure if he’s
actively working on it,
but I would prefer if we can find a way to keep some state around
between reconnects
so we aren’t constantly checking dev->started. A device can be stopped
for reasons
other than backend disconnect so I’d rather not reuse this field to
check for backend
disconnect failures.

On Thu, Apr 30, 2020 at 9:57 AM Dima Stepanov  wrote:
>
> If vhost-user daemon is used as a backend for the vhost device, then we
> should consider a possibility of disconnect at any moment. If such
> disconnect happened in the vhost_migration_log() routine the vhost
> device structure will be clean up.
> At the start of the vhost_migration_log() function there is a check:
>   if (!dev->started) {
>   dev->log_enabled = enable;
>   return 0;
>   }
> To be consistent with this check add the same check after calling the
> vhost_dev_set_log() routine. This in general help not to break a

Could you point to the specific asserts which are being triggered?

> migration due the assert() message. But it looks like that this code
> should be revised to handle these errors more carefully.
>
> In case of vhost-user device backend the fail paths should consider the
> state of the device. In this case we should skip some function calls
> during rollback on the error paths, so not to get the NULL dereference
> errors.
>
> Signed-off-by: Dima Stepanov 
> ---
>  hw/virtio/vhost.c | 39 +++
>  1 file changed, 35 insertions(+), 4 deletions(-)
>
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 3ee50c4..d5ab96d 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -787,6 +787,17 @@ static int vhost_dev_set_features(struct vhost_dev *dev,
>  static int vhost_dev_set_log(struct vhost_dev *dev, bool enable_log)
>  {
>  int r, i, idx;

A couple points here


(1) This will fail the live migration if the device is disconnected.
That my be the right thing
  to do, but if there are cases where migrations can proceed with
a disconnected device,
  this may not be desirable.

(2) This looks racy. As far as I can tell vhost_dev_set_log() is only
called by vhost_migration_log(),
  and as you say one of the first things vhost_migration_log does
is return if dev->started is not
  set. What’s to stop a disconnect from clearing the vdev right
after this check, just before
  vhost_dev_set_features() is called?

As stated above, I would prefer it if we could add some state which
would persist between
reconnects which could then be checked in the vhost-user code before
interacting with
the backend. I understand this will be a much more involved change and
will require a lot
of thought.

Also, regarding (1) above, if the original check in
vhost_migration_log() returns success if the
device is not started why return an error here? I imagine this could
lead to some inconsistent
behavior if the device disconnects before the first check verses
before the second.

> +
> +if (!dev->started) {
> +/*
> + * If vhost-user daemon is used as a backend for the
> + * device and the connection is broken, then the vhost_dev
> + * structure will be reset all its values to 0.
> + * Add additional check for the device state.
> + */
> +return -1;
> +}
> +
>  r = vhost_dev_set_features(dev, enable_log);
>  if (r < 0) {
>  goto err_features;
> @@ -801,12 +812,19 @@ static int vhost_dev_set_log(struct vhost_dev *dev, 
> bool enable_log)
>  }



Re: [PATCH v2 5/5] qemu-img: Add --start-offset and --max-length to map

2020-05-06 Thread Eric Blake

On 5/6/20 4:34 PM, Eyal Moscovici wrote:

The mapping operation of large disks especially ones stored over a
long chain of QCOW2 files can take a long time to finish.
Additionally when mapping fails there was no way recover by
restarting the mapping from the failed location.

The new options, --start-offset and --max-length allows the user to
divide these type of map operations into shorter independent tasks.

Reviewed-by: Eric Blake 


This patch has some changes from v1.  Among others,...


@@ -3041,6 +3045,18 @@ static int img_map(int argc, char **argv)
  case OPTION_OUTPUT:
  output = optarg;
  break;
+case 's':
+start_offset = cvtnum("start offset", optarg);
+if (start_offset < 0) {
+return 1;
+}
+break;


the new semantics of cvtnum() in this series is enough of a difference 
that I would have removed R-b to make sure the updated patch gets 
re-reviewed, if it had been me as author.  But in this case, it does 
look like the changes are all addressed to comments I suggested in v1, 
so I'm fine that you left my R-b.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v2 3/5] qemu-img: validate image length in img_map

2020-05-06 Thread Eric Blake

On 5/6/20 4:34 PM, Eyal Moscovici wrote:

The code handles this case correctly we merely skip the loop. However it
is probably best to return an explicit error.

Acked-by: Mark Kanda 
Signed-off-by: Eyal Moscovici 
---
  qemu-img.c | 5 +
  1 file changed, 5 insertions(+)


Reviewed-by: Eric Blake 



diff --git a/qemu-img.c b/qemu-img.c
index 4a06ab7fe8..a1b507a0be 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3086,6 +3086,11 @@ static int img_map(int argc, char **argv)
  }
  
  length = blk_getlength(blk);

+if (length < 0) {
+error_report("Failed to get size for '%s'", filename);
+return 1;
+}
+
  while (curr.start + curr.length < length) {
  int64_t offset = curr.start + curr.length;
  int64_t n;



--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v2 2/5] qemu_img: add error report to cvtnum

2020-05-06 Thread Eric Blake

On 5/6/20 4:34 PM, Eyal Moscovici wrote:

All calls to cvtnum check the return value and print the same error message more
or less. And so error reporting moved to cvtnum to reduce duplicate code and
provide a single error message.

Acked-by: Mark Kanda 
Signed-off-by: Eyal Moscovici 
---
  qemu-img.c | 63 --
  tests/qemu-iotests/049.out |  4 +--
  2 files changed, 28 insertions(+), 39 deletions(-)





-err = qemu_strtosz(s, NULL, );
-if (err < 0) {
+err = qemu_strtosz(arg_value, NULL, );
+if (err < 0 && err != -ERANGE) {
+error_report("Invalid %s specified! You may use "
+ "k, M, G, T, P or E suffixes for ", arg_name);
+error_report("kilobytes, megabytes, gigabytes, terabytes, "
+ "petabytes and exabytes.");
 return err;
 }
-if (value > INT64_MAX) {
+if (err == -ERANGE || value > INT64_MAX) {
+error_report("Invalid %s specified! Must be less than 8 EiB!",


Copied from our pre-existing errors, but why are we shouting at our 
user?  This would be a good time to s/!/./ to tone it down a bit.



@@ -4491,10 +4488,12 @@ static int img_dd_bs(const char *arg,
  {
  int64_t res;
  
-res = cvtnum(arg);

+res = cvtnum("bs", arg);
  
-if (res <= 0) {

-error_report("invalid number: '%s'", arg);
+if (res < 0) {
+return 1;
+} else if (res == 0) {
+error_report("Invalid bs specified! It cannot be 0.");


Maybe it's worth two functions:

int64_t cvtnum_full(const char *name, const char *value, int64_t min, 
int64_t max)


and then a common helper:

int64_t cvtnum(const char *name, const char *value) {
return cvtnum_full(name, value, 0, INT64_MAX);
}

many existing callers remain with cvtnum(), but callers like this could 
be cvtnum("bs", arg, 1, INT64_MAX).  You'd still have to special-case 
other restrictions, such as whether a number must a power-of-2, but 
that's fewer places.



+++ b/tests/qemu-iotests/049.out
@@ -92,13 +92,13 @@ Formatting 'TEST_DIR/t.qcow2', fmt=qcow2 size=1649267441664 
cluster_size=65536 l
  == 3. Invalid sizes ==
  
  qemu-img create -f qcow2 TEST_DIR/t.qcow2 -- -1024

-qemu-img: Image size must be less than 8 EiB!
+qemu-img: Invalid image size specified! Must be less than 8 EiB!


Nice that you checked for iotest fallout.  Is this really the only 
impacted test?


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v2 1/5] qemu-img: remove check that cvtnum value > MAX_INT

2020-05-06 Thread Eric Blake

On 5/6/20 4:34 PM, Eyal Moscovici wrote:

Following commit f46bfdbfc8f95cf65d7818ef68a801e063c40332 (util/cutils: Change
qemu_strtosz*() from int64_t to uint64_t) which added a similar check to
cvtnum. As a result there is no need to check it separately outside of cvtnum.

Acked-by: Mark Kanda 
Signed-off-by: Eyal Moscovici 
---
  qemu-img.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 6a4327aaba..116a9c6349 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -4307,7 +4307,7 @@ static int img_bench(int argc, char **argv)
  int64_t sval;
  
  sval = cvtnum(optarg);

-if (sval < 0 || sval > INT_MAX) {
+if (sval < 0) {
  error_report("Invalid buffer size specified");


INT_MAX is smaller than cvtnum's check for INT64_MAX.  This code change 
allows larger buffer sizes, which is probably not a good idea.



  return 1;
  }
@@ -4320,7 +4320,7 @@ static int img_bench(int argc, char **argv)
  int64_t sval;
  
  sval = cvtnum(optarg);

-if (sval < 0 || sval > INT_MAX) {
+if (sval < 0) {
  error_report("Invalid step size specified");
  return 1;
  }
@@ -4493,7 +4493,7 @@ static int img_dd_bs(const char *arg,
  
  res = cvtnum(arg);
  
-if (res <= 0 || res > INT_MAX) {

+if (res <= 0) {
  error_report("invalid number: '%s'", arg);
  return 1;
  }



NACK.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v2 0/5] Additional parameters for qemu_img map

2020-05-06 Thread Eric Blake

On 5/6/20 4:34 PM, Eyal Moscovici wrote:

Hi,

The following series adds two parameters to qemu-img map:
1. start-offset: mapping starting offset.
2. max-length: the length of the mapping.

These parameters proved useful when mapping large disk spread across
long store file chains. It allows us to bound the execution time of each
qemu-img map execution as well as recover from failed mapping
operations. In addition the map operation can divided to
multiple independent tasks.

V2 changes:
1. add error reporting to cvtnum.
2. add image length validation in img_map.
3. rebase over QEMU 5.0.


It's better to post a v2 as a new top-level thread rather than buried 
in-reply-to the v1 thread; among other things, burying a reply can cause 
automated patch tooling to miss the updated series.


But since I see it, I'll review.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 6/8] block/vhdx: drop unallocated_blocks_are_zero

2020-05-06 Thread Eric Blake

On 5/6/20 4:25 AM, Vladimir Sementsov-Ogievskiy wrote:

vhdx doesn't have .bdrv_co_block_status handler, so DATA|ALLOCATED is
always assumed for it. unallocated_blocks_are_zero is useless, drop it.



After the analysis I did in patch 1, this is correct.  No behavior change.

Reviewed-by: Eric Blake 


Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/vhdx.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/block/vhdx.c b/block/vhdx.c
index aedd782604..45963a3166 100644
--- a/block/vhdx.c
+++ b/block/vhdx.c
@@ -1164,9 +1164,6 @@ static int vhdx_get_info(BlockDriverState *bs, 
BlockDriverInfo *bdi)
  
  bdi->cluster_size = s->block_size;
  
-bdi->unallocated_blocks_are_zero =

-(s->params.data_bits & VHDX_PARAMS_HAS_PARENT) == 0;
-
  return 0;
  }
  



--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 5/8] block/file-posix: drop unallocated_blocks_are_zero

2020-05-06 Thread Eric Blake

On 5/6/20 4:25 AM, Vladimir Sementsov-Ogievskiy wrote:

raw_co_block_status() in block/file-posix.c never returns 0, so
unallocated_blocks_are_zero is useless. Drop it.


As in 4/8, you are correct that it had no impact on block_status, but it 
did affect qemu-img convert.  So again, removing the clients first makes 
this easier to justify as a cleanup patch.


That said...



Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/file-posix.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 05e094be29..5c01735108 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2878,9 +2878,6 @@ static int coroutine_fn raw_co_pwrite_zeroes(
  
  static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)

  {
-BDRVRawState *s = bs->opaque;
-
-bdi->unallocated_blocks_are_zero = s->discard_zeroes;
  return 0;
  }


the function now does nothing.  Hmm - why does bdrv_get_info() return 
-ENOTSUP if the driver does not implement this function?  Wouldn't it be 
better if the block layer could allow us to omit .bdrv_get_info and do 
the same thing, without us having to write a dummy function that does 
nothing but return 0?  As separate patches, of course, as it would 
require changing several existing bdrv_get_info() callers to behave 
sanely when getting an all-0 success return where they now deal with an 
-ENOTSUP return.


So in the meantime, yes, we need this placeholder.
Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




[PATCH v2 5/5] qemu-img: Add --start-offset and --max-length to map

2020-05-06 Thread Eyal Moscovici
The mapping operation of large disks especially ones stored over a
long chain of QCOW2 files can take a long time to finish.
Additionally when mapping fails there was no way recover by
restarting the mapping from the failed location.

The new options, --start-offset and --max-length allows the user to
divide these type of map operations into shorter independent tasks.

Reviewed-by: Eric Blake 
Acked-by: Mark Kanda 
Co-developed-by: Yoav Elnekave 
Signed-off-by: Yoav Elnekave 
Signed-off-by: Eyal Moscovici 
---
 docs/tools/qemu-img.rst |  2 +-
 qemu-img-cmds.hx|  4 ++--
 qemu-img.c  | 22 +-
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index 0080f83a76..f4ffe528ea 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -519,7 +519,7 @@ Command description:
 ``ImageInfoSpecific*`` QAPI object (e.g. ``ImageInfoSpecificQCow2``
 for qcow2 images).
 
-.. option:: map [--object OBJECTDEF] [--image-opts] [-f FMT] [--output=OFMT] 
[-U] FILENAME
+.. option:: map [--object OBJECTDEF] [--image-opts] [-f FMT] 
[--start-offset=OFFSET] [--max-length=LEN] [--output=OFMT] [-U] FILENAME
 
   Dump the metadata of image *FILENAME* and its backing file chain.
   In particular, this commands dumps the allocation state of every sector
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index c9c54de1df..35f832816f 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -63,9 +63,9 @@ SRST
 ERST
 
 DEF("map", img_map,
-"map [--object objectdef] [--image-opts] [-f fmt] [--output=ofmt] [-U] 
filename")
+"map [--object objectdef] [--image-opts] [-f fmt] [--start-offset=offset] 
[--max-length=len] [--output=ofmt] [-U] filename")
 SRST
-.. option:: map [--object OBJECTDEF] [--image-opts] [-f FMT] [--output=OFMT] 
[-U] FILENAME
+.. option:: map [--object OBJECTDEF] [--image-opts] [-f FMT] 
[--start-offset=OFFSET] [--max-length=LEN] [--output=OFMT] [-U] FILENAME
 ERST
 
 DEF("measure", img_measure,
diff --git a/qemu-img.c b/qemu-img.c
index 0a140fe564..f59b2c0a7c 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3003,6 +3003,8 @@ static int img_map(int argc, char **argv)
 int ret = 0;
 bool image_opts = false;
 bool force_share = false;
+int64_t start_offset = 0;
+int64_t max_length = -1;
 
 fmt = NULL;
 output = NULL;
@@ -3015,9 +3017,11 @@ static int img_map(int argc, char **argv)
 {"object", required_argument, 0, OPTION_OBJECT},
 {"image-opts", no_argument, 0, OPTION_IMAGE_OPTS},
 {"force-share", no_argument, 0, 'U'},
+{"start-offset", required_argument, 0, 's'},
+{"max-length", required_argument, 0, 'l'},
 {0, 0, 0, 0}
 };
-c = getopt_long(argc, argv, ":f:hU",
+c = getopt_long(argc, argv, ":f:s:l:hU",
 long_options, _index);
 if (c == -1) {
 break;
@@ -3041,6 +3045,18 @@ static int img_map(int argc, char **argv)
 case OPTION_OUTPUT:
 output = optarg;
 break;
+case 's':
+start_offset = cvtnum("start offset", optarg);
+if (start_offset < 0) {
+return 1;
+}
+break;
+case 'l':
+max_length = cvtnum("max length", optarg);
+if (max_length < 0) {
+return 1;
+}
+break;
 case OPTION_OBJECT: {
 QemuOpts *opts;
 opts = qemu_opts_parse_noisily(_object_opts,
@@ -3091,7 +3107,11 @@ static int img_map(int argc, char **argv)
 error_report("Failed to get size for '%s'", filename);
 return 1;
 }
+if (max_length != -1) {
+length = MIN(start_offset + max_length, length);
+}
 
+curr.start = start_offset;
 while (curr.start + curr.length < length) {
 int64_t offset = curr.start + curr.length;
 int64_t n;
-- 
2.17.2 (Apple Git-113)




[PATCH v2 2/5] qemu_img: add error report to cvtnum

2020-05-06 Thread Eyal Moscovici
All calls to cvtnum check the return value and print the same error message more
or less. And so error reporting moved to cvtnum to reduce duplicate code and
provide a single error message.

Acked-by: Mark Kanda 
Signed-off-by: Eyal Moscovici 
---
 qemu-img.c | 63 --
 tests/qemu-iotests/049.out |  4 +--
 2 files changed, 28 insertions(+), 39 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 116a9c6349..4a06ab7fe8 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -470,16 +470,22 @@ static int add_old_style_options(const char *fmt, 
QemuOpts *opts,
 return 0;
 }
 
-static int64_t cvtnum(const char *s)
+static int64_t cvtnum(const char *arg_name, const char *arg_value)
 {
 int err;
 uint64_t value;
 
-err = qemu_strtosz(s, NULL, );
-if (err < 0) {
+err = qemu_strtosz(arg_value, NULL, );
+if (err < 0 && err != -ERANGE) {
+error_report("Invalid %s specified! You may use "
+ "k, M, G, T, P or E suffixes for ", arg_name);
+error_report("kilobytes, megabytes, gigabytes, terabytes, "
+ "petabytes and exabytes.");
 return err;
 }
-if (value > INT64_MAX) {
+if (err == -ERANGE || value > INT64_MAX) {
+error_report("Invalid %s specified! Must be less than 8 EiB!",
+ arg_name);
 return -ERANGE;
 }
 return value;
@@ -572,16 +578,8 @@ static int img_create(int argc, char **argv)
 if (optind < argc) {
 int64_t sval;
 
-sval = cvtnum(argv[optind++]);
+sval = cvtnum("image size", argv[optind++]);
 if (sval < 0) {
-if (sval == -ERANGE) {
-error_report("Image size must be less than 8 EiB!");
-} else {
-error_report("Invalid image size specified! You may use k, M, "
-  "G, T, P or E suffixes for ");
-error_report("kilobytes, megabytes, gigabytes, terabytes, "
- "petabytes and exabytes.");
-}
 goto fail;
 }
 img_size = (uint64_t)sval;
@@ -2187,8 +2185,10 @@ static int img_convert(int argc, char **argv)
 {
 int64_t sval;
 
-sval = cvtnum(optarg);
-if (sval < 0 || !QEMU_IS_ALIGNED(sval, BDRV_SECTOR_SIZE) ||
+sval = cvtnum("buffer size for sparse output", optarg);
+if (sval < 0) {
+goto fail_getopt;
+} else if (!QEMU_IS_ALIGNED(sval, BDRV_SECTOR_SIZE) ||
 sval / BDRV_SECTOR_SIZE > MAX_BUF_SECTORS) {
 error_report("Invalid buffer size for sparse output specified. 
"
 "Valid sizes are multiples of %llu up to %llu. Select "
@@ -4291,9 +4291,8 @@ static int img_bench(int argc, char **argv)
 break;
 case 'o':
 {
-offset = cvtnum(optarg);
+offset = cvtnum("offset", optarg);
 if (offset < 0) {
-error_report("Invalid offset specified");
 return 1;
 }
 break;
@@ -4306,9 +4305,8 @@ static int img_bench(int argc, char **argv)
 {
 int64_t sval;
 
-sval = cvtnum(optarg);
+sval = cvtnum("buffer size", optarg);
 if (sval < 0) {
-error_report("Invalid buffer size specified");
 return 1;
 }
 
@@ -4319,9 +4317,8 @@ static int img_bench(int argc, char **argv)
 {
 int64_t sval;
 
-sval = cvtnum(optarg);
+sval = cvtnum("step_size", optarg);
 if (sval < 0) {
-error_report("Invalid step size specified");
 return 1;
 }
 
@@ -4491,10 +4488,12 @@ static int img_dd_bs(const char *arg,
 {
 int64_t res;
 
-res = cvtnum(arg);
+res = cvtnum("bs", arg);
 
-if (res <= 0) {
-error_report("invalid number: '%s'", arg);
+if (res < 0) {
+return 1;
+} else if (res == 0) {
+error_report("Invalid bs specified! It cannot be 0.");
 return 1;
 }
 in->bsz = out->bsz = res;
@@ -4506,10 +4505,9 @@ static int img_dd_count(const char *arg,
 struct DdIo *in, struct DdIo *out,
 struct DdInfo *dd)
 {
-dd->count = cvtnum(arg);
+dd->count = cvtnum("count", arg);
 
 if (dd->count < 0) {
-error_report("invalid number: '%s'", arg);
 return 1;
 }
 
@@ -4538,10 +4536,9 @@ static int img_dd_skip(const char *arg,
struct DdIo *in, struct DdIo *out,
struct DdInfo *dd)
 {
-in->offset = cvtnum(arg);
+in->offset = cvtnum("skip", arg);
 
 if (in->offset < 0) {
-error_report("invalid number: '%s'", arg);
 return 1;
 }
 
@@ -4923,16 +4920,8 @@ static int img_measure(int argc, char **argv)
 {
 int64_t sval;
 
-  

[PATCH v2 3/5] qemu-img: validate image length in img_map

2020-05-06 Thread Eyal Moscovici
The code handles this case correctly we merely skip the loop. However it
is probably best to return an explicit error.

Acked-by: Mark Kanda 
Signed-off-by: Eyal Moscovici 
---
 qemu-img.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/qemu-img.c b/qemu-img.c
index 4a06ab7fe8..a1b507a0be 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -3086,6 +3086,11 @@ static int img_map(int argc, char **argv)
 }
 
 length = blk_getlength(blk);
+if (length < 0) {
+error_report("Failed to get size for '%s'", filename);
+return 1;
+}
+
 while (curr.start + curr.length < length) {
 int64_t offset = curr.start + curr.length;
 int64_t n;
-- 
2.17.2 (Apple Git-113)




[PATCH v2 4/5] qemu-img: refactor dump_map_entry JSON format output

2020-05-06 Thread Eyal Moscovici
Previously dump_map_entry identified whether we need to start a new JSON
array based on whether start address == 0. In this refactor we remove
this assumption as in following patches we will allow map to start from
an arbitrary position.

Reviewed-by: Eric Blake 
Acked-by: Mark Kanda 
Signed-off-by: Eyal Moscovici 
---
 qemu-img.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index a1b507a0be..0a140fe564 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -2896,9 +2896,8 @@ static int dump_map_entry(OutputFormat output_format, 
MapEntry *e,
 }
 break;
 case OFORMAT_JSON:
-printf("%s{ \"start\": %"PRId64", \"length\": %"PRId64","
+printf("{ \"start\": %"PRId64", \"length\": %"PRId64","
" \"depth\": %"PRId64", \"zero\": %s, \"data\": %s",
-   (e->start == 0 ? "[" : ",\n"),
e->start, e->length, e->depth,
e->zero ? "true" : "false",
e->data ? "true" : "false");
@@ -2907,8 +2906,8 @@ static int dump_map_entry(OutputFormat output_format, 
MapEntry *e,
 }
 putchar('}');
 
-if (!next) {
-printf("]\n");
+if (next) {
+puts(",");
 }
 break;
 }
@@ -3083,6 +3082,8 @@ static int img_map(int argc, char **argv)
 
 if (output_format == OFORMAT_HUMAN) {
 printf("%-16s%-16s%-16s%s\n", "Offset", "Length", "Mapped to", "File");
+} else if (output_format == OFORMAT_JSON) {
+putchar('[');
 }
 
 length = blk_getlength(blk);
@@ -3119,6 +3120,9 @@ static int img_map(int argc, char **argv)
 }
 
 ret = dump_map_entry(output_format, , NULL);
+if (output_format == OFORMAT_JSON) {
+puts("]");
+}
 
 out:
 blk_unref(blk);
-- 
2.17.2 (Apple Git-113)




[PATCH v2 0/5] Additional parameters for qemu_img map

2020-05-06 Thread Eyal Moscovici
Hi,

The following series adds two parameters to qemu-img map:
1. start-offset: mapping starting offset.
2. max-length: the length of the mapping.

These parameters proved useful when mapping large disk spread across
long store file chains. It allows us to bound the execution time of each
qemu-img map execution as well as recover from failed mapping
operations. In addition the map operation can divided to
multiple independent tasks.

V2 changes:
1. add error reporting to cvtnum.
2. add image length validation in img_map.
3. rebase over QEMU 5.0.

Eyal Moscovici (5):
  qemu-img: remove check that cvtnum value > MAX_INT
  qemu_img: add error report to cvtnum
  qemu-img: validate image length in img_map
  qemu-img: refactor dump_map_entry JSON format output
  qemu-img: Add --start-offset and --max-length to map

 docs/tools/qemu-img.rst|   2 +-
 qemu-img-cmds.hx   |   4 +-
 qemu-img.c | 106 ++---
 tests/qemu-iotests/049.out |   4 +-
 4 files changed, 67 insertions(+), 49 deletions(-)

-- 
2.17.2 (Apple Git-113)




[PATCH v2 1/5] qemu-img: remove check that cvtnum value > MAX_INT

2020-05-06 Thread Eyal Moscovici
Following commit f46bfdbfc8f95cf65d7818ef68a801e063c40332 (util/cutils: Change
qemu_strtosz*() from int64_t to uint64_t) which added a similar check to
cvtnum. As a result there is no need to check it separately outside of cvtnum.

Acked-by: Mark Kanda 
Signed-off-by: Eyal Moscovici 
---
 qemu-img.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 6a4327aaba..116a9c6349 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -4307,7 +4307,7 @@ static int img_bench(int argc, char **argv)
 int64_t sval;
 
 sval = cvtnum(optarg);
-if (sval < 0 || sval > INT_MAX) {
+if (sval < 0) {
 error_report("Invalid buffer size specified");
 return 1;
 }
@@ -4320,7 +4320,7 @@ static int img_bench(int argc, char **argv)
 int64_t sval;
 
 sval = cvtnum(optarg);
-if (sval < 0 || sval > INT_MAX) {
+if (sval < 0) {
 error_report("Invalid step size specified");
 return 1;
 }
@@ -4493,7 +4493,7 @@ static int img_dd_bs(const char *arg,
 
 res = cvtnum(arg);
 
-if (res <= 0 || res > INT_MAX) {
+if (res <= 0) {
 error_report("invalid number: '%s'", arg);
 return 1;
 }
-- 
2.17.2 (Apple Git-113)




Re: [PATCH 4/8] block/iscsi: drop unallocated_blocks_are_zero

2020-05-06 Thread Eric Blake

On 5/6/20 4:25 AM, Vladimir Sementsov-Ogievskiy wrote:

We set bdi->unallocated_blocks_are_zero = iscsilun->lbprz, but
iscsi_co_block_status doesn't return 0 in case of iscsilun->lbprz, it
returns ZERO when appropriate. So actually unallocated_blocks_are_zero
is useless. Drop it now.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/iscsi.c | 1 -
  1 file changed, 1 deletion(-)


This one is easier to justify after removing the 2 clients.  But it's 
simpler than patch 1 in that because block_status never returned 0, this 
has no visible impact to 'qemu-io -c map' or similar, so it doesn't need 
the commit message justification about any change in behavior like patch 
1 needed.


Reviewed-by: Eric Blake 



diff --git a/block/iscsi.c b/block/iscsi.c
index a8b76979d8..767e3e75fd 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -2163,7 +2163,6 @@ static int coroutine_fn 
iscsi_co_truncate(BlockDriverState *bs, int64_t offset,
  static int iscsi_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
  {
  IscsiLun *iscsilun = bs->opaque;
-bdi->unallocated_blocks_are_zero = iscsilun->lbprz;
  bdi->cluster_size = iscsilun->cluster_size;
  return 0;
  }



--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 3/8] block/crypto: drop unallocated_blocks_are_zero

2020-05-06 Thread Eric Blake

On 5/6/20 4:25 AM, Vladimir Sementsov-Ogievskiy wrote:

It's false by default, no needs to set it. We are going to drop this
variable at all, so drop it now here, it doesn't hurt.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/crypto.c | 1 -
  1 file changed, 1 deletion(-)


Trivially correct, regardless of clients.

Reviewed-by: Eric Blake 



diff --git a/block/crypto.c b/block/crypto.c
index e02f343590..7685e61844 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -694,7 +694,6 @@ static int block_crypto_get_info_luks(BlockDriverState *bs,
  return ret;
  }
  
-bdi->unallocated_blocks_are_zero = false;

  bdi->cluster_size = subbdi.cluster_size;
  
  return 0;




--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 2/8] block/vpc: return ZERO block-status when appropriate

2020-05-06 Thread Eric Blake

On 5/6/20 4:25 AM, Vladimir Sementsov-Ogievskiy wrote:

In case when get_image_offset() returns -1, we do zero out the
corresponding chunk of qiov. So, this should be reported as ZERO.

After block-status update, it never reports 0, so setting
unallocated_blocks_are_zero doesn't make sense. Drop it.


Same analysis as in patch 1 as to the lone two clients that cared, and 
the fact that we are changing 'qemu-io -c map' output by reporting data 
as allocated now.  But I concur that as there is never a backing file, 
the change is not a regression, but rather a bug fix.




Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/vpc.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)


While the commit message could be improved, the code change itself looks 
correct.


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v5 30/31] qcow2: Add subcluster support to qcow2_measure()

2020-05-06 Thread Eric Blake

On 5/5/20 12:38 PM, Alberto Garcia wrote:

Extended L2 entries are bigger than normal L2 entries so this has an
impact on the amount of metadata needed for a qcow2 file.

Signed-off-by: Alberto Garcia 
Reviewed-by: Max Reitz 
---
  block/qcow2.c | 19 ---
  1 file changed, 12 insertions(+), 7 deletions(-)


Should this be hoisted earlier in the series, before 28/31?

Should there be iotest coverage?

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v5 29/31] qcow2: Assert that expand_zero_clusters_in_l1() does not support subclusters

2020-05-06 Thread Eric Blake

On 5/5/20 12:38 PM, Alberto Garcia wrote:

This function is only used by qcow2_expand_zero_clusters() to
downgrade a qcow2 image to a previous version. It is however not
possible to downgrade an image with extended L2 entries because older
versions of qcow2 do not have this feature.

Signed-off-by: Alberto Garcia 
---
  block/qcow2-cluster.c  | 8 +++-
  tests/qemu-iotests/061 | 6 ++
  tests/qemu-iotests/061.out | 5 +
  3 files changed, 18 insertions(+), 1 deletion(-)


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v5 28/31] qcow2: Add the 'extended_l2' option and the QCOW2_INCOMPAT_EXTL2 bit

2020-05-06 Thread Eric Blake

On 5/5/20 12:38 PM, Alberto Garcia wrote:

Now that the implementation of subclusters is complete we can finally
add the necessary options to create and read images with this feature,
which we call "extended L2 entries".

Signed-off-by: Alberto Garcia 
Reviewed-by: Max Reitz 
---


What you have looks good, but I didn't notice anything affecting amend. 
The simplest option: amend can reject attempts to toggle the extended L2 
option (the zstd compression patches take that path).   More complicated 
is actually supporting it (in either direction, turning it on or off), 
which requires rewriting ALL L2 tables in the entry (including any in 
internal snapshots), which could be quite time-intensive, and where you 
must be careful to stage things so that failures during partial 
conversion merely leave leaked clusters rather than a header pointing to 
a half-converted state.  Either way, one of the iotests should probably 
add coverage on what happens when you attempt to amend the bit on or off.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v5 23/31] qcow2: Add subcluster support to check_refcounts_l2()

2020-05-06 Thread Eric Blake

On 5/5/20 12:38 PM, Alberto Garcia wrote:

Setting the QCOW_OFLAG_ZERO bit of the L2 entry is forbidden if an
image has subclusters. Instead, the individual 'all zeroes' bits must
be used.


Should we s/is forbidden/is ignored/ based on your spec changes?

But the patch itself is right.
Reviewed-by: Eric Blake 



Signed-off-by: Alberto Garcia 
Reviewed-by: Max Reitz 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
  block/qcow2-refcount.c | 9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index dfdcdd3c25..9bb161481e 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1686,8 +1686,13 @@ static int check_refcounts_l2(BlockDriverState *bs, 
BdrvCheckResult *res,
  int ign = active ? QCOW2_OL_ACTIVE_L2 :
 QCOW2_OL_INACTIVE_L2;
  
-l2_entry = QCOW_OFLAG_ZERO;

-set_l2_entry(s, l2_table, i, l2_entry);
+if (has_subclusters(s)) {
+set_l2_entry(s, l2_table, i, 0);
+set_l2_bitmap(s, l2_table, i,
+  QCOW_L2_BITMAP_ALL_ZEROES);
+} else {
+set_l2_entry(s, l2_table, i, QCOW_OFLAG_ZERO);
+}
  ret = qcow2_pre_write_overlap_check(bs, ign,
  l2e_offset, l2_entry_size(s), false);
  if (ret < 0) {



--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v5 20/31] qcow2: Add subcluster support to qcow2_get_host_offset()

2020-05-06 Thread Eric Blake

On 5/5/20 12:38 PM, Alberto Garcia wrote:

The logic of this function remains pretty much the same, except that
it uses count_contiguous_subclusters(), which combines the logic of
count_contiguous_clusters() / count_contiguous_clusters_unallocated()
and checks individual subclusters.



Maybe mention that qcow2_cluster_to_subcluster_type() is now inlined 
into its lone remaining caller.



Signed-off-by: Alberto Garcia 
---
  block/qcow2.h |  38 +---
  block/qcow2-cluster.c | 141 --
  2 files changed, 82 insertions(+), 97 deletions(-)




+++ b/block/qcow2-cluster.c
@@ -376,66 +376,58 @@ fail:



+static int count_contiguous_subclusters(BlockDriverState *bs, int nb_clusters,
+unsigned sc_index, uint64_t *l2_slice,
+int l2_index)
  {



+
+assert(type != QCOW2_SUBCLUSTER_INVALID); /* The caller should check this 
*/
+assert(l2_index + nb_clusters <= s->l2_size);
+
+if (type == QCOW2_SUBCLUSTER_COMPRESSED) {
+/* Compressed clusters are always processed one by one */
+return s->subclusters_per_cluster - sc_index;


Should this assert(sc_index == 0)?


  for (i = 0; i < nb_clusters; i++) {
-uint64_t entry = get_l2_entry(s, l2_slice, l2_index + i);
-QCow2ClusterType type = qcow2_get_cluster_type(bs, entry);
-
-if (type != wanted_type) {
-break;
+l2_entry = get_l2_entry(s, l2_slice, l2_index + i);
+l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + i);
+if (check_offset && expected_offset != (l2_entry & L2E_OFFSET_MASK)) {
+return count;
+}
+for (j = (i == 0) ? sc_index : 0; j < s->subclusters_per_cluster; j++) 
{
+if (qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, j) != type) 
{
+return count;
+}


This really is checking that sub-clusters have the exact same type.


@@ -604,24 +604,17 @@ int qcow2_get_host_offset(BlockDriverState *bs, uint64_t 
offset,
  ret = -EIO;
  goto fail;
  }
-/* Compressed clusters can only be processed one by one */
-c = 1;
  *host_offset = l2_entry & L2E_COMPRESSED_OFFSET_SIZE_MASK;
  break;
-case QCOW2_CLUSTER_ZERO_PLAIN:
-case QCOW2_CLUSTER_UNALLOCATED:
-/* how many empty clusters ? */
-c = count_contiguous_clusters_unallocated(bs, nb_clusters,
-  l2_slice, l2_index, type);


The old code was counting how many contiguous clusters has similar 
types, coalescing ranges of two different cluster types into one 
nb_clusters result.



+case QCOW2_SUBCLUSTER_ZERO_PLAIN:
+case QCOW2_SUBCLUSTER_UNALLOCATED_PLAIN:
  *host_offset = 0;
  break;
-case QCOW2_CLUSTER_ZERO_ALLOC:
-case QCOW2_CLUSTER_NORMAL: {
+case QCOW2_SUBCLUSTER_ZERO_ALLOC:
+case QCOW2_SUBCLUSTER_NORMAL:
+case QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC: {
  uint64_t host_cluster_offset = l2_entry & L2E_OFFSET_MASK;
  *host_offset = host_cluster_offset + offset_in_cluster;
-/* how many allocated clusters ? */
-c = count_contiguous_clusters(bs, nb_clusters, s->cluster_size,
-  l2_slice, l2_index, QCOW_OFLAG_ZERO);


and here coalescing three different cluster types into one nb_clusters 
result.



  if (offset_into_cluster(s, host_cluster_offset)) {
  qcow2_signal_corruption(bs, true, -1, -1,
  "Cluster allocation offset %#"
@@ -647,9 +640,11 @@ int qcow2_get_host_offset(BlockDriverState *bs, uint64_t 
offset,
  abort();
  }
  
+sc = count_contiguous_subclusters(bs, nb_clusters, sc_index,

+  l2_slice, l2_index);


But the new code is stopping at the first different subcluster type, 
rather than trying to coalesce as many compatible types into one larger 
nb_clusters.  When coupled with patch 19, that factors into my concern 
over whether patch 19 needed to check for INVALID clusters in the 
middle, or whether its fail: label was unreachable.  But it also means 
that you are potentially fragmenting the write in more places (every 
time a subcluster status changes) rather than coalescing similar status, 
the way the old code used to operate.


I don't think the extra fragmentation causes any correctness issues, but 
it may cause performance issues.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v5 19/31] qcow2: Add subcluster support to calculate_l2_meta()

2020-05-06 Thread Eric Blake

On 5/6/20 12:14 PM, Alberto Garcia wrote:


Note that you are only skipping until the first normal subcluster, even
if other zero/unallocated clusters occur between the first normal
cluster and the start of the action.


That's correct.


Or visually, suppose we have:

--0-NN-0___

as our 32 subclusters, with sc_index of 8.  You will end up skipping
subclusters 0-3, but NOT 6 and 7.


That's correct. This function works with the assumption that the initial
COW region is located immediately before the data region, which is in
turn contiguous to the tail COW region.


...

Still, even though we spend time copying the allocated contents of
those two subclusters, we also copy the subcluster status, and the
guest still ends up reading the same data as before.


No, the subcluster status is updated and those subclusters are marked
now as allocated. That's actually why we can use the _RANGE masks that
you proposed here:

https://lists.gnu.org/archive/html/qemu-block/2020-04/msg01155.html

In other words, we have this bitmap:

--0-NN-0___

If we write to subcluster 8 (which was already allocated), the resulting
bitmap is this one:

--0-___

The last block in iotest 271 deals exactly with this kind of scenarios.


Ah, that's what I was missing. So we indeed do more I/O than strictly 
necessary (by reading from the backing file or writing explicit zeroes 
rather than leaving unallocated or zero subclusters), but we are careful 
that the COW range of the operation puts the correct data into the head 
and tail (either from the backing file or explicitly zero) so that even 
though the status changes to N, the guest still sees the same contents.


I also agree that it is not worth trying to optimize to those later 
subclusters - it adds a lot of complexity for something that does not 
occur as frequently.  It is more important to optimize to the initial 
sequence of unalloc/zero clusters, but once we hit data, taking the 
penalty of COWing a few extra subclusters isn't going to be much worse.





   /* Get the L2 entry of the last cluster */
-l2_entry = get_l2_entry(s, l2_slice, l2_index + nb_clusters - 1);
-type = qcow2_get_cluster_type(bs, l2_entry);
+l2_index += nb_clusters - 1;
+l2_entry = get_l2_entry(s, l2_slice, l2_index);
+l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
+sc_index = offset_to_sc_index(s, guest_offset + bytes - 1);
+type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc_index);


[1] but here, we are skipping any intermediate clusters, and worrying
only about the state of the final cluster.  Is that always going to do
the correct thing, or will there be cases where we need to update the
L2 entries of intermediate clusters?


Cluster 1 Cluster 2 Cluster 3
|-|-|-|
<---cow_start--><---write request><--cow_end--->


All L2 entries from the beginning of cow_start until the end of cow_end
are always updated. That's again what the loop that I optimized using
the _RANGE masks (and that I liked above) was doing.


I guess the other thing that helps is that even if there are 
intermediate invalid subclusters, ideally the caller of this function is 
setting nb_clusters according to its own scan of how many contiguous 
clusters are even available for us to attempt to write into.  So patch 
20/31 catches the case where we don't have contiguous clusters if any of 
the subclusters are invalid.  And in _this_ patch, it shouldn't really 
matter if we don't detect that we are overwriting what used to be an 
invalid subcluster with something that is now valid data.




The code in calculate_l2_meta() is only concerned with determining the
actual start and end points. Everything between them will be written to
and marked as allocated. It's only the subclusters outside that range
that keep the previous values (unallocated, or zero).


I think you've convinced me that we are safe on what this function does. 
 In fact, if we rely on 20/31 checking for invalid subclusters when 
computing nb_clusters, we could probably assert that the start and end 
cluster in this function are not invalid, instead of adding the fail: 
label.  But even if you don't do that, I can live with:


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v5 19/31] qcow2: Add subcluster support to calculate_l2_meta()

2020-05-06 Thread Alberto Garcia
On Tue 05 May 2020 11:59:18 PM CEST, Eric Blake wrote:
>> +for (i = first_sc; i <= last_sc; i++) {
>> +unsigned c = i / s->subclusters_per_cluster;
>> +unsigned sc = i % s->subclusters_per_cluster;
>> +l2_entry = get_l2_entry(s, l2_slice, l2_index + c);
>> +l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index + c);
>> +type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc);
>> +if (type == QCOW2_SUBCLUSTER_INVALID) {
>> +l2_index += c; /* Point to the invalid entry */
>> +goto fail;
>> +}
>> +if (type != QCOW2_SUBCLUSTER_NORMAL) {
>>   break;
>>   }
>>   }
>
> This loop is now 32 times slower (for extended L2 entries).  Do you
> really need to check for an invalid subcluster here, or can we just
> blindly check that all 32 subclusters are NORMAL, and leave handling
> of invalid clusters for the rest of the function after we failed the
> exit-early test?  For that matter, for all but the first and last
> cluster, checking if 32 clusters are NORMAL is a simple 64-bit
> comparison rather than 32 iterations of a loop; and even for the first
> and last cluster, the _RANGE macros in 14/31 work well to mask out
> which bits must be set/cleared.  My guess is that optimizing this loop
> is worthwhile, since overwriting existing data is probably more common
> than allocating new data.

I think you're right, and now we have the _RANGE macros so it should be
doable. I'll look into it.

>> -if (i == nb_clusters) {
>> -return;
>> +if (i == last_sc + 1) {
>> +return 0;
>>   }
>>   }
>
> If we get here, then i is now the address of the first subcluster that 
> was not NORMAL, even if it is much smaller than the final subcluster 
> learned by nb_clusters for the overall request.  [1]

I'm replying to this part later in [1]

>>   /* Get the L2 entry of the first cluster */
>>   l2_entry = get_l2_entry(s, l2_slice, l2_index);
>> -type = qcow2_get_cluster_type(bs, l2_entry);
>> +l2_bitmap = get_l2_bitmap(s, l2_slice, l2_index);
>> +sc_index = offset_to_sc_index(s, guest_offset);
>> +type = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, sc_index);
>>   
>> -if (type == QCOW2_CLUSTER_NORMAL && keep_old) {
>> -cow_start_from = cow_start_to;
>> +if (type == QCOW2_SUBCLUSTER_INVALID) {
>> +goto fail;
>> +}
>> +
>> +if (!keep_old) {
>> +switch (type) {
>> +case QCOW2_SUBCLUSTER_COMPRESSED:
>> +cow_start_from = 0;
>> +break;
>> +case QCOW2_SUBCLUSTER_NORMAL:
>> +case QCOW2_SUBCLUSTER_ZERO_ALLOC:
>> +case QCOW2_SUBCLUSTER_UNALLOCATED_ALLOC: {
>> +int i;
>> +/* Skip all leading zero and unallocated subclusters */
>> +for (i = 0; i < sc_index; i++) {
>> +QCow2SubclusterType t;
>> +t = qcow2_get_subcluster_type(bs, l2_entry, l2_bitmap, i);
>> +if (t == QCOW2_SUBCLUSTER_INVALID) {
>> +goto fail;
>> +} else if (t == QCOW2_SUBCLUSTER_NORMAL) {
>> +break;
>> +}
>> +}
>> +cow_start_from = i << s->subcluster_bits;
>> +break;
>
> Note that you are only skipping until the first normal subcluster, even 
> if other zero/unallocated clusters occur between the first normal 
> cluster and the start of the action.

That's correct.

> Or visually, suppose we have:
>
> --0-NN-0___
>
> as our 32 subclusters, with sc_index of 8.  You will end up skipping
> subclusters 0-3, but NOT 6 and 7.

That's correct. This function works with the assumption that the initial
COW region is located immediately before the data region, which is in
turn contiguous to the tail COW region.

I'm actually not sure that it necessarily has to be that way, but at
least it seems that functions like handle_alloc_space() rely on
that. Certainly before subclusters I don't see how there would be any
space between the COW regions and the actual data region.

While checking the documentation of QCowL2Meta I also realized that
maybe it also needs to be updated. "The COW Region between the start of
the first allocated cluster and the area the guest actually writes to",
it's not necessarily the start of the cluster anymore, although the word
"between" leaves some room for interpretation.

Anyway, even if we could do COW of subclusters 4-5 only, there's no
general way to do that without touching QCowL2Meta or using more than
one structure (imagine we have -N-N-N-N_ ...). I'm also not sure
that it's worth it.

> Still, even though we spend time copying the allocated contents of
> those two subclusters, we also copy the subcluster status, and the
> guest still ends up reading the same data as before.

No, the subcluster status is 

Re: [PATCH v3 32/33] block: Pass BdrvChildRole in remaining cases

2020-05-06 Thread Kevin Wolf
Am 18.02.2020 um 13:42 hat Max Reitz geschrieben:
> These calls have no real use for the child role yet, but it will not
> harm to give one.
> 
> Notably, the bdrv_root_attach_child() call in blockjob.c is left
> unmodified because there is not much the generic BlockJob object wants
> from its children.
> 
> Signed-off-by: Max Reitz 
> Reviewed-by: Eric Blake 

> diff --git a/block/vvfat.c b/block/vvfat.c
> index 8f4ff5a97e..d4f4218924 100644
> --- a/block/vvfat.c
> +++ b/block/vvfat.c
> @@ -3186,7 +3186,7 @@ static int enable_write_target(BlockDriverState *bs, 
> Error **errp)
>  options = qdict_new();
>  qdict_put_str(options, "write-target.driver", "qcow");
>  s->qcow = bdrv_open_child(s->qcow_filename, options, "write-target", bs,
> -  _vvfat_qcow, 0, false, errp);
> +  _vvfat_qcow, BDRV_CHILD_DATA, false, 
> errp);

Doesn't it contain metadata, too?

Kevin




Re: [PATCH v3 26/33] block: Use child_of_bds in remaining places

2020-05-06 Thread Kevin Wolf
Am 18.02.2020 um 13:42 hat Max Reitz geschrieben:
> Replace child_file by child_of_bds in all remaining places (excluding
> tests).
> 
> Signed-off-by: Max Reitz 
> Reviewed-by: Eric Blake 

> diff --git a/block/blkreplay.c b/block/blkreplay.c
> index f97493f45a..71628f4d56 100644
> --- a/block/blkreplay.c
> +++ b/block/blkreplay.c
> @@ -27,8 +27,9 @@ static int blkreplay_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  int ret;
>  
>  /* Open the image file */
> -bs->file = bdrv_open_child(NULL, options, "image",
> -   bs, _file, 0, false, _err);
> +bs->file = bdrv_open_child(NULL, options, "image", bs, _of_bds,
> +   BDRV_CHILD_DATA | BDRV_CHILD_PRIMARY,
> +   false, _err);

Why isn't blkreplay a filter?

Kevin




Re: [PATCH v3 22/33] block: Make backing files child_of_bds children

2020-05-06 Thread Kevin Wolf
Am 18.02.2020 um 13:42 hat Max Reitz geschrieben:
> Make all parents of backing files pass the appropriate BdrvChildRole.
> By doing so, we can switch their BdrvChildClass over to the generic
> child_of_bds, which will do the right thing when given a correct
> BdrvChildRole.
> 
> Signed-off-by: Max Reitz 
> Reviewed-by: Eric Blake 
> ---
>  block.c | 26 --
>  block/backup-top.c  |  2 +-
>  block/vvfat.c   |  3 ++-
>  tests/test-bdrv-drain.c | 13 +++--
>  4 files changed, 30 insertions(+), 14 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 43df38ca30..31affcb4ee 100644
> --- a/block.c
> +++ b/block.c
> @@ -2770,6 +2770,20 @@ static bool 
> bdrv_inherits_from_recursive(BlockDriverState *child,
>  return child != NULL;
>  }
>  
> +/*
> + * Return the BdrvChildRole for @bs's backing child.  bs->backing is
> + * mostly used for COW backing children (role = COW), but also for
> + * filtered children (role = FILTERED | PRIMARY).
> + */
> +static BdrvChildRole bdrv_backing_role(BlockDriverState *bs)
> +{
> +if (bs->drv && bs->drv->is_filter) {
> +return BDRV_CHILD_FILTERED | BDRV_CHILD_PRIMARY;
> +} else {
> +return BDRV_CHILD_COW;
> +}
> +}

bdrv_mirror_top and bdrv_commit_top don't set .is_filter, so it will
return the wrong role for them. (They also don't set .supports_backing,
so grepping for that wouldn't help...)

>  /*
>   * Sets the backing file link of a BDS. A new reference is created; callers
>   * which don't need their own reference any more must call bdrv_unref().
> @@ -2797,8 +2811,8 @@ void bdrv_set_backing_hd(BlockDriverState *bs, 
> BlockDriverState *backing_hd,
>  goto out;
>  }
>  
> -bs->backing = bdrv_attach_child(bs, backing_hd, "backing", 
> _backing,
> -0, errp);
> +bs->backing = bdrv_attach_child(bs, backing_hd, "backing", _of_bds,
> +bdrv_backing_role(bs), errp);
>  /* If backing_hd was already part of bs's backing chain, and
>   * inherits_from pointed recursively to bs then let's update it to
>   * point directly to bs (else it will become NULL). */
> @@ -2895,7 +2909,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict 
> *parent_options,
>  }
>  
>  backing_hd = bdrv_open_inherit(backing_filename, reference, options, 0, 
> bs,
> -   _backing, 0, errp);
> +   _of_bds, BDRV_CHILD_COW, errp);

Wouldn't it be more consistent to use bdrv_backing_role() here, too?

Kevin




Re: [PATCH 8/8] block: drop unallocated_blocks_are_zero

2020-05-06 Thread Eric Blake

On 5/6/20 4:25 AM, Vladimir Sementsov-Ogievskiy wrote:

Currently this field only set by qed and qcow2. But in fact, all
backing-supporting formats (parallels, qcow, qcow2, qed, vmdk) share
this semantics: on unallocated blocks, if there is no backing file they
just memset the buffer with zeroes.


In checking the behavior of all 5 .supports_backing drivers, I noticed 
that qed.c:qed_read_backing_file() does a lot of redundant work in 
computing the backing file size itself, when it could just defer to the 
block layer like all the other drivers do.  That would be a separate patch.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 8/8] block: drop unallocated_blocks_are_zero

2020-05-06 Thread Vladimir Sementsov-Ogievskiy

06.05.2020 18:14, Eric Blake wrote:

On 5/6/20 4:25 AM, Vladimir Sementsov-Ogievskiy wrote:

Currently this field only set by qed and qcow2.


Well, only after patches 1-6 (prior to then, it was also set in protocol 
drivers).  I think you might be able to hoist part of this patch earlier in the 
series, to make the changes to the protocol drivers easier to review, by 
rewording this sentence:

Currently, the only format drivers that set this field are qed and qcow2.


But in fact, all
backing-supporting formats (parallels, qcow, qcow2, qed, vmdk) share
this semantics: on unallocated blocks, if there is no backing file they


s/this/these/


just memset the buffer with zeroes.

So, document this behavior for .supports_backing and drop
.unallocated_blocks_are_zero

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/block/block.h |  6 --
  include/block/block_int.h | 13 -
  block.c   | 15 ---
  block/io.c    |  8 
  block/qcow2.c |  1 -
  block/qed.c   |  1 -
  6 files changed, 16 insertions(+), 28 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 8b62429aa4..db1cb503ec 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -21,11 +21,6 @@ typedef struct BlockDriverInfo {
  /* offset at which the VM state can be saved (0 if not possible) */
  int64_t vm_state_offset;
  bool is_dirty;
-    /*
- * True if unallocated blocks read back as zeroes. This is equivalent
- * to the LBPRZ flag in the SCSI logical block provisioning page.
- */
-    bool unallocated_blocks_are_zero;


You can't delete this field until all protocol drivers are cleaned up, so 
deferring this part of the change to the end of the series makes sense.


  /*
   * True if this block driver only supports compressed writes
   */
@@ -431,7 +426,6 @@ int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, 
int64_t bytes);
  int bdrv_has_zero_init_1(BlockDriverState *bs);
  int bdrv_has_zero_init(BlockDriverState *bs);
  int bdrv_has_zero_init_truncate(BlockDriverState *bs);
-bool bdrv_unallocated_blocks_are_zero(BlockDriverState *bs);


Doing this cleanup makes sense: there is only one caller of this function 
pre-patch, and 0 callers post-patch - but whether the cleanup should be at the 
same time as you fix the one caller, or deferred to when you also clean up the 
field, is less important.

If I were writing the series:

1 - fix qemu-img.c to not use the field
2 - fix block_status to not use the function
3-n - fix protocol drivers that set the field to also return _ZERO
  during block status (but not delete the field at that time)
n+1 - delete unused function and field (from ALL drivers)


  bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
  int bdrv_block_status(BlockDriverState *bs, int64_t offset,
    int64_t bytes, int64_t *pnum, int64_t *map,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 92335f33c7..c156b22c6b 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -115,7 +115,18 @@ struct BlockDriver {
   */
  bool bdrv_needs_filename;
-    /* Set if a driver can support backing files */
+    /*
+ * Set if a driver can support backing files. This also implies the
+ * following semantics:
+ *
+ *  - Return status 0 of .bdrv_co_block_status means that corresponding
+ *    blocks are not allocated in this layer of backing-chain
+ *  - For such (unallocated) blocks, read will:
+ *    - read from backing file if there is one and big enough


s/and/and it is/


+ *    - fill buffer with zeroes if there is no backing file
+ *    - space after EOF of the backing file considered as zero
+ *  (corresponding part of read-buffer must be zeroed by driver)


Does the driver actually have to do the zeroing?  Looking at qcow2.c, I see:
static coroutine_fn int qcow2_co_preadv_task(BlockDriverState *bs,
...

     case QCOW2_CLUSTER_UNALLOCATED:
     assert(bs->backing); /* otherwise handled in qcow2_co_preadv_part */

     BLKDBG_EVENT(bs->file, BLKDBG_READ_BACKING_AIO);
     return bdrv_co_preadv_part(bs->backing, offset, bytes,
    qiov, qiov_offset, 0);

which just defers to the block layer to handle read-beyond-EOF, rather than an 
explicit memset in the driver.


Hmm, yes.



So maybe you can simplify to:
- For such (unallocated) blocks, read will:
   - fill buffer with zeros if there is no backing file
   - read from the backing file otherwise, where the block layer
     takes care of reading zeros beyond EOF if backing file is short



OK


But the effect is the same.


+++ b/block/io.c
@@ -2385,16 +2385,16 @@ static int coroutine_fn 
bdrv_co_block_status(BlockDriverState *bs,
  if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
  ret |= BDRV_BLOCK_ALLOCATED;
-    } else if (want_zero) {
-    

Re: [PATCH v5 07/31] qcow2: Document the Extended L2 Entries feature

2020-05-06 Thread Alberto Garcia
On Wed 06 May 2020 05:02:52 PM CEST, Alberto Garcia wrote:
>>> -With version 2, this is always 0.
>>> +With version 2 or with extended L2 entries (see the 
>>> next
>>> +section), this is always 0.
>>
>> In your cover letter, you said you changed things to tolerate this
>> bit being set even with extended L2 entries.  Does this sentence need
>> a tweak?
>
> Not a bad idea.

I just had a look at what happens with v2 images. There the bit is
actually checked and the image is marked corrupt, but that only seems to
happen in qcow2_get_host_offset(). You can write to a cluster that has
that bit set and QEMU won't complain (the bit is however cleared).

'qemu-img check' does not detect any inconsistency either.

As I said in my cover letter maybe it's worth preparing a separate
series that adds a QCOW2_CLUSTER_INVALID type and includes all these
cases.

Berto



Re: [PATCH 8/8] block: drop unallocated_blocks_are_zero

2020-05-06 Thread Eric Blake

On 5/6/20 4:25 AM, Vladimir Sementsov-Ogievskiy wrote:

Currently this field only set by qed and qcow2.


Well, only after patches 1-6 (prior to then, it was also set in protocol 
drivers).  I think you might be able to hoist part of this patch earlier 
in the series, to make the changes to the protocol drivers easier to 
review, by rewording this sentence:


Currently, the only format drivers that set this field are qed and qcow2.


But in fact, all
backing-supporting formats (parallels, qcow, qcow2, qed, vmdk) share
this semantics: on unallocated blocks, if there is no backing file they


s/this/these/


just memset the buffer with zeroes.

So, document this behavior for .supports_backing and drop
.unallocated_blocks_are_zero

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/block/block.h |  6 --
  include/block/block_int.h | 13 -
  block.c   | 15 ---
  block/io.c|  8 
  block/qcow2.c |  1 -
  block/qed.c   |  1 -
  6 files changed, 16 insertions(+), 28 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 8b62429aa4..db1cb503ec 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -21,11 +21,6 @@ typedef struct BlockDriverInfo {
  /* offset at which the VM state can be saved (0 if not possible) */
  int64_t vm_state_offset;
  bool is_dirty;
-/*
- * True if unallocated blocks read back as zeroes. This is equivalent
- * to the LBPRZ flag in the SCSI logical block provisioning page.
- */
-bool unallocated_blocks_are_zero;


You can't delete this field until all protocol drivers are cleaned up, 
so deferring this part of the change to the end of the series makes sense.



  /*
   * True if this block driver only supports compressed writes
   */
@@ -431,7 +426,6 @@ int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, 
int64_t bytes);
  int bdrv_has_zero_init_1(BlockDriverState *bs);
  int bdrv_has_zero_init(BlockDriverState *bs);
  int bdrv_has_zero_init_truncate(BlockDriverState *bs);
-bool bdrv_unallocated_blocks_are_zero(BlockDriverState *bs);


Doing this cleanup makes sense: there is only one caller of this 
function pre-patch, and 0 callers post-patch - but whether the cleanup 
should be at the same time as you fix the one caller, or deferred to 
when you also clean up the field, is less important.


If I were writing the series:

1 - fix qemu-img.c to not use the field
2 - fix block_status to not use the function
3-n - fix protocol drivers that set the field to also return _ZERO
 during block status (but not delete the field at that time)
n+1 - delete unused function and field (from ALL drivers)


  bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
  int bdrv_block_status(BlockDriverState *bs, int64_t offset,
int64_t bytes, int64_t *pnum, int64_t *map,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 92335f33c7..c156b22c6b 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -115,7 +115,18 @@ struct BlockDriver {
   */
  bool bdrv_needs_filename;
  
-/* Set if a driver can support backing files */

+/*
+ * Set if a driver can support backing files. This also implies the
+ * following semantics:
+ *
+ *  - Return status 0 of .bdrv_co_block_status means that corresponding
+ *blocks are not allocated in this layer of backing-chain
+ *  - For such (unallocated) blocks, read will:
+ *- read from backing file if there is one and big enough


s/and/and it is/


+ *- fill buffer with zeroes if there is no backing file
+ *- space after EOF of the backing file considered as zero
+ *  (corresponding part of read-buffer must be zeroed by driver)


Does the driver actually have to do the zeroing?  Looking at qcow2.c, I see:
static coroutine_fn int qcow2_co_preadv_task(BlockDriverState *bs,
...

case QCOW2_CLUSTER_UNALLOCATED:
assert(bs->backing); /* otherwise handled in 
qcow2_co_preadv_part */


BLKDBG_EVENT(bs->file, BLKDBG_READ_BACKING_AIO);
return bdrv_co_preadv_part(bs->backing, offset, bytes,
   qiov, qiov_offset, 0);

which just defers to the block layer to handle read-beyond-EOF, rather 
than an explicit memset in the driver.


So maybe you can simplify to:
- For such (unallocated) blocks, read will:
  - fill buffer with zeros if there is no backing file
  - read from the backing file otherwise, where the block layer
takes care of reading zeros beyond EOF if backing file is short

But the effect is the same.


+++ b/block/io.c
@@ -2385,16 +2385,16 @@ static int coroutine_fn 
bdrv_co_block_status(BlockDriverState *bs,
  
  if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {

  ret |= BDRV_BLOCK_ALLOCATED;
-} else if (want_zero) {
-if (bdrv_unallocated_blocks_are_zero(bs)) {
- 

Re: [PATCH v5 07/31] qcow2: Document the Extended L2 Entries feature

2020-05-06 Thread Alberto Garcia
On Tue 05 May 2020 09:35:06 PM CEST, Eric Blake wrote:
> On 5/5/20 12:38 PM, Alberto Garcia wrote:
>> Subcluster allocation in qcow2 is implemented by extending the
>> existing L2 table entries and adding additional information to
>> indicate the allocation status of each subcluster.
>> 
>> This patch documents the changes to the qcow2 format and how they
>> affect the calculation of the L2 cache size.
>> 
>> Signed-off-by: Alberto Garcia 
>> Reviewed-by: Max Reitz 
>> ---
>>   docs/interop/qcow2.txt | 68 --
>>   docs/qcow2-cache.txt   | 19 +++-
>>   2 files changed, 83 insertions(+), 4 deletions(-)
>> 
>
>> @@ -547,7 +557,8 @@ Standard Cluster Descriptor:
>>   nor is data read from the backing file if the cluster 
>> is
>>   unallocated.
>>   
>> -With version 2, this is always 0.
>> +With version 2 or with extended L2 entries (see the next
>> +section), this is always 0.
>
> In your cover letter, you said you changed things to tolerate this bit
> being set even with extended L2 entries.  Does this sentence need a
> tweak?

Not a bad idea.


>> +The last 64 bits contain a subcluster allocation bitmap with this format:
>> +
>> +Subcluster Allocation Bitmap (for standard clusters):
>> +
>> +Bit  0 -  31:   Allocation status (one bit per subcluster)
>
> Why two spaces after '-'?

No reason, I guess I just didn't notice. I'll fix it.

Berto



Re: [PATCH 7/8] qemu-img: convert: don't use unallocated_blocks_are_zero

2020-05-06 Thread Vladimir Sementsov-Ogievskiy

06.05.2020 17:49, Eric Blake wrote:

On 5/6/20 4:25 AM, Vladimir Sementsov-Ogievskiy wrote:

qemu-img convert wants to distinguish ZERO which comes from short
backing files. unallocated_blocks_are_zero field of bdi is unrelated:
space after EOF is always considered to be zero anyway. So, just make
post_backing_zero true in case of short backing file.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  qemu-img.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)


This patch should come first in the series.  It would have saved me a lot of 
time in reviewing 1/8.


I'm sorry :(





diff --git a/qemu-img.c b/qemu-img.c
index 6a4327aaba..4fe751878b 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1632,7 +1632,6 @@ typedef struct ImgConvertState {
  BlockBackend *target;
  bool has_zero_init;
  bool compressed;
-    bool unallocated_blocks_are_zero;
  bool target_is_new;
  bool target_has_backing;
  int64_t target_backing_sectors; /* negative if unknown */
@@ -1677,7 +1676,7 @@ static int convert_iteration_sectors(ImgConvertState *s, 
int64_t sector_num)
  if (s->target_backing_sectors >= 0) {
  if (sector_num >= s->target_backing_sectors) {
-    post_backing_zero = s->unallocated_blocks_are_zero;
+    post_backing_zero = true;
  } else if (sector_num + n > s->target_backing_sectors) {
  /* Split requests around target_backing_sectors (because
   * starting from there, zeros are handled differently) */
@@ -2580,7 +2579,6 @@ static int img_convert(int argc, char **argv)
  } else {
  s.compressed = s.compressed || bdi.needs_compressed_writes;
  s.cluster_sectors = bdi.cluster_size / BDRV_SECTOR_SIZE;
-    s.unallocated_blocks_are_zero = bdi.unallocated_blocks_are_zero;
  }


My question in 1/8 about whether we have iotest coverage of this optimization 
remains, but I concur that the block layer takes care of reading zero when 
encountering unallocated blocks beyond EOF of a short backing file, and 
therefore performing this optimization always rather than just when the driver 
claims that unallocated blocks read as zero should be safe.

Reviewed-by: Eric Blake 



Thanks!

--
Best regards,
Vladimir



Re: [PATCH 1/8] block/vdi: return ZERO block-status when appropriate

2020-05-06 Thread Vladimir Sementsov-Ogievskiy

06.05.2020 16:57, Eric Blake wrote:

On 5/6/20 4:25 AM, Vladimir Sementsov-Ogievskiy wrote:

In case of !VDI_IS_ALLOCATED[], we do zero out the corresponding chunk
of qiov. So, this should be reported as ZERO.


This part makes sense outright, since vdi does not allow backing files, so all 
reads of a VDI disk result in data rather than deferral to another layer, and 
we indeed read zero in this case.  However, it _is_ a behavior change: 
previously, 'qemu-io -c map' on a vdi image will show unallocated portions, but 
with this change, the entire image now shows as allocated.  You need to call 
out this fact in the commit message, documenting that it is intentional (it 
matches what we do for posix files: since neither files nor vdi support backing 
images, we guarantee that all read attempts will be satisfied by this layer 
rather than deferring to a backing layer, and thus can treat all data as 
allocated in this layer, regardless of whether it is sparse).

Do any existing iotests need an update to reflect change in output?  If not, 
should we have an iotest covering it?


all passed for me on tmpfs (with some skips)..





After block-status update, it never reports 0, so setting
unallocated_blocks_are_zero doesn't make sense. Drop it.


This is a harder claim. To prove it, we need to inspect all callers that use 
unallocated_blocks_are_zero, to see their intent.  Fortunately, the list is 
small - a mere 2 clients!

block/io.c:bdrv_co_block_status(): if .bdrv_co_block_status sets either _DATA 
or _ZERO, block status adds _ALLOCATED; but if the driver did not set either, 
we use bdrv_unallocated_blocks_are_zero() in order to set _ZERO but not 
_ALLOCATED.  This is the code that explains the behavior change mentioned above 
(now that vdi is reporting _ZERO instead of 0, block_status is appending 
_ALLOCATED instead of querying bdrv_unallocated_blocks_are_zero).  But you are 
correct that it does not change where _ZERO is reported.

qemu-img.c:img_convert(): calls bdrv_get_info() up front and caches what it 
learned from unallocated_blocks_are_zero about the target; later on, in 
convert_iteration_sectors(), it uses this information to optimize the case 
where the source reads as zero, but the target has a backing file and the range 
being written lies beyond the end of the backing file. That is, given the 
following scenario:

qemu-img convert input -B backing output
input:   ABCD-0E
backing: ACBD

the optimization lets us produce:
output:  -BC---E

instead of the larger:
output:  -BC--0E

Do we have any iotests that cover this particular scenario, to prove that our 
optimization does indeed result in a smaller image file without explicit zeros 
written in the tail?  Or put another way, if we were to disable the 
post_backing_zero optimization in convert_iteration_sectors(), would any 
iotests fail?  If not, we should really enhance our testsuite.

With that said, we could just as easily achieve the optimization of the 
conversion to the tail of a destination with short backing file by checking 
block_status to see whether the tail already reads as all zeroes, rather than 
relying on unallocated_blocks_are_zero.  Even if querying block_status is a 
slight pessimization in time, it would avoid regressing in the more important 
factor of artificially bloating the destination.  I would _really_ love to see 
a patch to qemu-img.c reworking the logic to avoid the use of 
unallocated_blocks_are_zero first, prior to removing the setting of that field 
in the various drivers.  Once bdrv_co_block_status() remains as the only client 
of the field, then I could accept this patch with a better commit message about 
the intentional change in block_status _ALLOCATION behavior.


Actually unallocated_blocks_are_zero doesn't influence on this thing, you should not 
check unallocated_blocks_are_zero to understand how to read unallocated area above short 
backing (after backing EOF). It's always reads as zeroes. So, patch 7 just use 
"true" instead. But yes, I'd better move patch 7 to be the first one.





Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/vdi.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/vdi.c b/block/vdi.c
index 0c7835ae70..83471528d2 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -334,7 +334,6 @@ static int vdi_get_info(BlockDriverState *bs, 
BlockDriverInfo *bdi)
  logout("\n");
  bdi->cluster_size = s->block_size;
  bdi->vm_state_offset = 0;
-    bdi->unallocated_blocks_are_zero = true;
  return 0;
  }
@@ -536,7 +535,7 @@ static int coroutine_fn 
vdi_co_block_status(BlockDriverState *bs,
  *pnum = MIN(s->block_size - index_in_block, bytes);
  result = VDI_IS_ALLOCATED(bmap_entry);
  if (!result) {
-    return 0;
+    return BDRV_BLOCK_ZERO;
  }
  *map = s->header.offset_data + (uint64_t)bmap_entry * s->block_size +






--
Best regards,
Vladimir



Re: [PATCH 7/8] qemu-img: convert: don't use unallocated_blocks_are_zero

2020-05-06 Thread Eric Blake

On 5/6/20 4:25 AM, Vladimir Sementsov-Ogievskiy wrote:

qemu-img convert wants to distinguish ZERO which comes from short
backing files. unallocated_blocks_are_zero field of bdi is unrelated:
space after EOF is always considered to be zero anyway. So, just make
post_backing_zero true in case of short backing file.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  qemu-img.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)


This patch should come first in the series.  It would have saved me a 
lot of time in reviewing 1/8.




diff --git a/qemu-img.c b/qemu-img.c
index 6a4327aaba..4fe751878b 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1632,7 +1632,6 @@ typedef struct ImgConvertState {
  BlockBackend *target;
  bool has_zero_init;
  bool compressed;
-bool unallocated_blocks_are_zero;
  bool target_is_new;
  bool target_has_backing;
  int64_t target_backing_sectors; /* negative if unknown */
@@ -1677,7 +1676,7 @@ static int convert_iteration_sectors(ImgConvertState *s, 
int64_t sector_num)
  
  if (s->target_backing_sectors >= 0) {

  if (sector_num >= s->target_backing_sectors) {
-post_backing_zero = s->unallocated_blocks_are_zero;
+post_backing_zero = true;
  } else if (sector_num + n > s->target_backing_sectors) {
  /* Split requests around target_backing_sectors (because
   * starting from there, zeros are handled differently) */
@@ -2580,7 +2579,6 @@ static int img_convert(int argc, char **argv)
  } else {
  s.compressed = s.compressed || bdi.needs_compressed_writes;
  s.cluster_sectors = bdi.cluster_size / BDRV_SECTOR_SIZE;
-s.unallocated_blocks_are_zero = bdi.unallocated_blocks_are_zero;
  }


My question in 1/8 about whether we have iotest coverage of this 
optimization remains, but I concur that the block layer takes care of 
reading zero when encountering unallocated blocks beyond EOF of a short 
backing file, and therefore performing this optimization always rather 
than just when the driver claims that unallocated blocks read as zero 
should be safe.


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 1/8] block/vdi: return ZERO block-status when appropriate

2020-05-06 Thread Eric Blake

On 5/6/20 4:25 AM, Vladimir Sementsov-Ogievskiy wrote:

In case of !VDI_IS_ALLOCATED[], we do zero out the corresponding chunk
of qiov. So, this should be reported as ZERO.


This part makes sense outright, since vdi does not allow backing files, 
so all reads of a VDI disk result in data rather than deferral to 
another layer, and we indeed read zero in this case.  However, it _is_ a 
behavior change: previously, 'qemu-io -c map' on a vdi image will show 
unallocated portions, but with this change, the entire image now shows 
as allocated.  You need to call out this fact in the commit message, 
documenting that it is intentional (it matches what we do for posix 
files: since neither files nor vdi support backing images, we guarantee 
that all read attempts will be satisfied by this layer rather than 
deferring to a backing layer, and thus can treat all data as allocated 
in this layer, regardless of whether it is sparse).


Do any existing iotests need an update to reflect change in output?  If 
not, should we have an iotest covering it?




After block-status update, it never reports 0, so setting
unallocated_blocks_are_zero doesn't make sense. Drop it.


This is a harder claim. To prove it, we need to inspect all callers that 
use unallocated_blocks_are_zero, to see their intent.  Fortunately, the 
list is small - a mere 2 clients!


block/io.c:bdrv_co_block_status(): if .bdrv_co_block_status sets either 
_DATA or _ZERO, block status adds _ALLOCATED; but if the driver did not 
set either, we use bdrv_unallocated_blocks_are_zero() in order to set 
_ZERO but not _ALLOCATED.  This is the code that explains the behavior 
change mentioned above (now that vdi is reporting _ZERO instead of 0, 
block_status is appending _ALLOCATED instead of querying 
bdrv_unallocated_blocks_are_zero).  But you are correct that it does not 
change where _ZERO is reported.


qemu-img.c:img_convert(): calls bdrv_get_info() up front and caches what 
it learned from unallocated_blocks_are_zero about the target; later on, 
in convert_iteration_sectors(), it uses this information to optimize the 
case where the source reads as zero, but the target has a backing file 
and the range being written lies beyond the end of the backing file. 
That is, given the following scenario:


qemu-img convert input -B backing output
input:   ABCD-0E
backing: ACBD

the optimization lets us produce:
output:  -BC---E

instead of the larger:
output:  -BC--0E

Do we have any iotests that cover this particular scenario, to prove 
that our optimization does indeed result in a smaller image file without 
explicit zeros written in the tail?  Or put another way, if we were to 
disable the post_backing_zero optimization in 
convert_iteration_sectors(), would any iotests fail?  If not, we should 
really enhance our testsuite.


With that said, we could just as easily achieve the optimization of the 
conversion to the tail of a destination with short backing file by 
checking block_status to see whether the tail already reads as all 
zeroes, rather than relying on unallocated_blocks_are_zero.  Even if 
querying block_status is a slight pessimization in time, it would avoid 
regressing in the more important factor of artificially bloating the 
destination.  I would _really_ love to see a patch to qemu-img.c 
reworking the logic to avoid the use of unallocated_blocks_are_zero 
first, prior to removing the setting of that field in the various 
drivers.  Once bdrv_co_block_status() remains as the only client of the 
field, then I could accept this patch with a better commit message about 
the intentional change in block_status _ALLOCATION behavior.




Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/vdi.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/vdi.c b/block/vdi.c
index 0c7835ae70..83471528d2 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -334,7 +334,6 @@ static int vdi_get_info(BlockDriverState *bs, 
BlockDriverInfo *bdi)
  logout("\n");
  bdi->cluster_size = s->block_size;
  bdi->vm_state_offset = 0;
-bdi->unallocated_blocks_are_zero = true;
  return 0;
  }
  
@@ -536,7 +535,7 @@ static int coroutine_fn vdi_co_block_status(BlockDriverState *bs,

  *pnum = MIN(s->block_size - index_in_block, bytes);
  result = VDI_IS_ALLOCATED(bmap_entry);
  if (!result) {
-return 0;
+return BDRV_BLOCK_ZERO;
  }
  
  *map = s->header.offset_data + (uint64_t)bmap_entry * s->block_size +




--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v3 18/33] block: Add bdrv_default_perms()

2020-05-06 Thread Kevin Wolf
Am 18.02.2020 um 13:42 hat Max Reitz geschrieben:
> This callback can be used by BDSs that use child_of_bds with the
> appropriate BdrvChildRole for their children.
> 
> Also, make bdrv_format_default_perms() use it for child_of_bds children
> (just a temporary solution until we can drop bdrv_format_default_perms()
> altogether).
> 
> Signed-off-by: Max Reitz 
> Reviewed-by: Eric Blake 
> ---
>  block.c   | 46 ---
>  include/block/block_int.h | 11 ++
>  2 files changed, 49 insertions(+), 8 deletions(-)
> 
> diff --git a/block.c b/block.c
> index c0ba274743..3e5b0bc345 100644
> --- a/block.c
> +++ b/block.c
> @@ -2383,14 +2383,12 @@ static void 
> bdrv_default_perms_for_metadata(BlockDriverState *bs, BdrvChild *c,
>  *nshared = shared;
>  }
>  
> -/* TODO: Use */
> -static void __attribute__((unused))
> -bdrv_default_perms_for_data(BlockDriverState *bs, BdrvChild *c,
> -const BdrvChildClass *child_class,
> -BdrvChildRole role,
> -BlockReopenQueue *reopen_queue,
> -uint64_t perm, uint64_t shared,
> -uint64_t *nperm, uint64_t *nshared)
> +static void bdrv_default_perms_for_data(BlockDriverState *bs, BdrvChild *c,
> +const BdrvChildClass *child_class,
> +BdrvChildRole role,
> +BlockReopenQueue *reopen_queue,
> +uint64_t perm, uint64_t shared,
> +uint64_t *nperm, uint64_t *nshared)
>  {
>  assert(child_class == _of_bds && (role & BDRV_CHILD_DATA));
>  
> @@ -2425,6 +2423,13 @@ void bdrv_format_default_perms(BlockDriverState *bs, 
> BdrvChild *c,
> uint64_t *nperm, uint64_t *nshared)
>  {
>  bool backing = (child_class == _backing);
> +
> +if (child_class == _of_bds) {
> +bdrv_default_perms(bs, c, child_class, role, reopen_queue,
> +   perm, shared, nperm, nshared);
> +return;
> +}
> +
>  assert(child_class == _backing || child_class == _file);
>  
>  if (!backing) {
> @@ -2436,6 +2441,31 @@ void bdrv_format_default_perms(BlockDriverState *bs, 
> BdrvChild *c,
>  }
>  }
>  
> +void bdrv_default_perms(BlockDriverState *bs, BdrvChild *c,
> +const BdrvChildClass *child_class, BdrvChildRole 
> role,
> +BlockReopenQueue *reopen_queue,
> +uint64_t perm, uint64_t shared,
> +uint64_t *nperm, uint64_t *nshared)
> +{
> +assert(child_class == _of_bds);
> +
> +if (role & BDRV_CHILD_FILTERED) {
> +bdrv_filter_default_perms(bs, c, child_class, role, reopen_queue,
> +  perm, shared, nperm, nshared);
> +} else if (role & BDRV_CHILD_COW) {
> +bdrv_default_perms_for_backing(bs, c, child_class, role, 
> reopen_queue,
> +   perm, shared, nperm, nshared);
> +} else if (role & BDRV_CHILD_METADATA) {
> +bdrv_default_perms_for_metadata(bs, c, child_class, role, 
> reopen_queue,
> +perm, shared, nperm, nshared);
> +} else if (role & BDRV_CHILD_DATA) {
> +bdrv_default_perms_for_data(bs, c, child_class, role, reopen_queue,
> +perm, shared, nperm, nshared);
> +} else {
> +g_assert_not_reached();
> +}
> +}

Here the discussion from the start of the series becomes relevant: Which
flags can be combined with which other flags, and if multiple flags are
set, which one takes precedence?

First undocumented requirement: You must set at least one of FILTERED,
COW, METADATA and DATA.

Then, for FILTERED we decided to document that DATA shouldn't be set at
the same time. I guess neither should COW and METADATA. Apart for
documentation, worth an assertion?

COW seems to be exclusive in practice, too. I guess you could construct
a driver that somehow keeps its own (meta)data in its backing file,
maybe in the VM state area. It also sounds like a really bad idea. So
forbid it, document it and assert it, too?

METADATA and DATA can be set at the same time. As your previous patch
shows, the function for DATA is a laxer version of the one for METADATA,
requesting a subset of the METADATA permissions and sharing a superset.
So the order in the code makes sense.

But can we make sure that this condition will be true in the future?
Imagine we find a need for a new permission that is used for data files,
but not for metadata. I think at the very least, this deserves a
comment. But probably it means that both should stay a single function
that can check each flag for the exact permission bits it influences.

Kevin




Re: [PATCH v3 15/33] block: Pull out bdrv_default_perms_for_backing()

2020-05-06 Thread Kevin Wolf
Am 18.02.2020 um 13:42 hat Max Reitz geschrieben:
> Right now, bdrv_format_default_perms() is used by format parents
> (generally). We want to switch to a model where most parents use a
> single BdrvChildClass, which then decides the permissions based on the
> child role. To do so, we have to split bdrv_format_default_perms() into
> separate functions for each such role.
> 
> Signed-off-by: Max Reitz 
> Reviewed-by: Eric Blake 

As you want to call this based on the child role, would
bdrv_default_perms_for_cow() be a more obvious name?

Kevin




Re: [PATCH v3 09/33] block: Add generic bdrv_inherited_options()

2020-05-06 Thread Kevin Wolf
Am 06.05.2020 um 12:37 hat Kevin Wolf geschrieben:
> Am 18.02.2020 um 13:42 hat Max Reitz geschrieben:
> > After the series this patch belongs to, we want to have a common
> > BdrvChildClass that encompasses all of child_file, child_format, and
> > child_backing.  Such a single class needs a single .inherit_options()
> > implementation, and this patch introduces it.
> > 
> > The next patch will show how the existing implementations can fall back
> > to it just by passing appropriate BdrvChildRole and parent_is_format
> > values.
> > 
> > Signed-off-by: Max Reitz 
> > Reviewed-by: Eric Blake 
> > ---
> >  block.c | 84 +
> >  1 file changed, 84 insertions(+)
> > 
> > diff --git a/block.c b/block.c
> > index c33f0e9b42..9179b9b604 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -998,6 +998,90 @@ static void bdrv_temp_snapshot_options(int 
> > *child_flags, QDict *child_options,
> >  *child_flags &= ~BDRV_O_NATIVE_AIO;
> >  }
> >  
> > +/*
> > + * Returns the options and flags that a generic child of a BDS should
> > + * get, based on the given options and flags for the parent BDS.
> > + */
> > +static void __attribute__((unused))
> > +bdrv_inherited_options(BdrvChildRole role, bool parent_is_format,
> > +   int *child_flags, QDict *child_options,
> > +   int parent_flags, QDict *parent_options)
> > +{
> > +int flags = parent_flags;
> > +
> > +/*
> > + * First, decide whether to set, clear, or leave BDRV_O_PROTOCOL.
> > + * Generally, the question to answer is: Should this child be
> > + * format-probed by default?
> > + */

Just for clarity: Do you know a good reason to ever leave it (i.e.
inherit it from the parent), except that that's what we have always been
doing for backing files? Though of course, only formats have backing
files, so the flag would never be set in practice in this case.

> > +/*
> > + * Pure and non-filtered data children of non-format nodes should
> > + * be probed by default (even when the node itself has BDRV_O_PROTOCOL
> > + * set).  This only affects a very limited set of drivers (namely
> > + * quorum and blkverify when this comment was written).
> > + * Force-clear BDRV_O_PROTOCOL then.
> > + */
> > +if (!parent_is_format &&
> > +(role & (BDRV_CHILD_DATA | BDRV_CHILD_METADATA |
> > + BDRV_CHILD_FILTERED)) ==
> > +BDRV_CHILD_DATA)
> 
> You could avoid the odd indentation (I can't decide whether or not it
> should be one space more to align correctly) and probably also make the
> expression more readable if you split it into:
> 
> (role & BDRV_CHILD_DATA) &&
> !(role & (BDRV_CHILD_METADATA | BDRV_CHILD_FILTERED))
> 
> > +{
> > +flags &= ~BDRV_O_PROTOCOL;
> > +}
> > +
> > +/*
> > + * All children of format nodes (except for COW children) and all
> > + * metadata children in general should never be format-probed.
> > + * Force-set BDRV_O_PROTOCOL then.
> > + */
> > +if ((parent_is_format && !(role & BDRV_CHILD_COW)) ||
> > +(role & BDRV_CHILD_METADATA))
> > +{
> > +flags |= BDRV_O_PROTOCOL;
> > +}
> > +
> > +/*
> > + * If the cache mode isn't explicitly set, inherit direct and no-flush 
> > from
> > + * the parent.
> > + */
> > +qdict_copy_default(child_options, parent_options, 
> > BDRV_OPT_CACHE_DIRECT);
> > +qdict_copy_default(child_options, parent_options, 
> > BDRV_OPT_CACHE_NO_FLUSH);
> > +qdict_copy_default(child_options, parent_options, 
> > BDRV_OPT_FORCE_SHARE);
> > +
> > +if (role & BDRV_CHILD_COW) {
> > +/* backing files are always opened read-only */
> 
> Not "always", just by default.
> 
> > +qdict_set_default_str(child_options, BDRV_OPT_READ_ONLY, "on");
> > +qdict_set_default_str(child_options, BDRV_OPT_AUTO_READ_ONLY, 
> > "off");
> > +} else {
> > +/* Inherit the read-only option from the parent if it's not set */
> > +qdict_copy_default(child_options, parent_options, 
> > BDRV_OPT_READ_ONLY);
> > +qdict_copy_default(child_options, parent_options,
> > +   BDRV_OPT_AUTO_READ_ONLY);
> > +}
> > +
> > +if (parent_is_format && !(role & BDRV_CHILD_COW)) {
> > +/*
> > + * Our format drivers take care to send flushes and respect
> > + * unmap policy, so we can default to enable both on lower
> > + * layers regardless of the corresponding parent options.
> > + */
> > +qdict_set_default_str(child_options, BDRV_OPT_DISCARD, "unmap");
> > +}
> 
> Why the restriction to format here? Don't we break "unmap" propagation
> through filters with this?
> 
> It would probably also be a good question why we don't propagate it to
> the backing file, but this is preexisting.

Some patches later, I think the fix is an else branch that copies the
flag 

Re: [PULL 00/24] Block patches

2020-05-06 Thread Peter Maydell
On Tue, 5 May 2020 at 13:58, Max Reitz  wrote:
>
> The following changes since commit 5375af3cd7b8adcc10c18d8083b7be63976c9645:
>
>   Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging 
> (2020-05-04 15:51:09 +0100)
>
> are available in the Git repository at:
>
>   https://github.com/XanClic/qemu.git tags/pull-block-2020-05-05
>
> for you to fetch changes up to 4ce5dd3e9b5ee0fac18625860eb3727399ee965e:
>
>   block/block-copy: use aio-task-pool API (2020-05-05 14:03:28 +0200)
>
> 
> Block patches:
> - Asynchronous copying for block-copy (i.e., the backup job)
> - Allow resizing of qcow2 images when they have internal snapshots
> - iotests: Logging improvements for Python tests
> - iotest 153 fix, and block comment cleanups
>


Applied, thanks.

Please update the changelog at https://wiki.qemu.org/ChangeLog/5.1
for any user-visible changes.

-- PMM



Re: [PATCH 0/9] More truncate improvements

2020-05-06 Thread Eric Blake

On 4/28/20 3:28 PM, Eric Blake wrote:

Based-on: <20200424125448.63318-1-kw...@redhat.com>
[PATCH v7 00/10] block: Fix resize (extending) of short overlays

After reviewing Kevin's work, I questioned if we had a redundancy with
bdrv_has_zero_init_truncate.  It turns out we do, and this is the result.

Patch 1 has been previously posted [1] and reviewed, the rest is new.
I did not address Neils' comment that modern gluster also always
0-initializes [2], as I am not set up to verify it (my changes to the
other drivers are semantic no-ops, so I don't feel as bad about
posting them with less rigourous testing).

[1] https://lists.gnu.org/archive/html/qemu-devel/2020-01/msg08070.html
[2] https://lists.gnu.org/archive/html/qemu-devel/2020-02/msg04266.html

Eric Blake (9):
   gluster: Drop useless has_zero_init callback
   file-win32: Support BDRV_REQ_ZERO_WRITE for truncate
   nfs: Support BDRV_REQ_ZERO_WRITE for truncate
   rbd: Support BDRV_REQ_ZERO_WRITE for truncate
   sheepdog: Support BDRV_REQ_ZERO_WRITE for truncate
   ssh: Support BDRV_REQ_ZERO_WRITE for truncate
   parallels: Rework truncation logic
   vhdx: Rework truncation logic
   block: Drop unused .bdrv_has_zero_init_truncate



Ping.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH v3 13/33] block: Add child_of_bds

2020-05-06 Thread Kevin Wolf
Am 18.02.2020 um 13:42 hat Max Reitz geschrieben:
> Any current user of child_file, child_format, and child_backing can and
> should use this generic BdrvChildClass instead, as it can handle all of
> these cases.  However, to be able to do so, the users must pass the
> appropriate BdrvChildRole when the child is created/attached.  (The
> following commits will take care of that.)
> 
> Signed-off-by: Max Reitz 
> Reviewed-by: Eric Blake 

>  block.c   | 27 +++
>  include/block/block_int.h |  1 +
>  2 files changed, 28 insertions(+)
> 
> diff --git a/block.c b/block.c
> index 0f24546863..1d33f58ff8 100644
> --- a/block.c
> +++ b/block.c
> @@ -1094,6 +1094,33 @@ static void bdrv_inherited_options(BdrvChildRole role, 
> bool parent_is_format,
>  *child_flags = flags;
>  }
>  
> +static int bdrv_backing_update_filename(BdrvChild *c, BlockDriverState *base,
> +const char *filename, Error **errp);

Actually, I should have saved the comment on the previous patch for this
one. The forward declaration could easily be avoided by moving
child_of_bds down to after bdrv_backing_update_filename().

Kevin




Re: [PATCH v3 12/33] block: Unify bdrv_child_cb_detach()

2020-05-06 Thread Kevin Wolf
Am 18.02.2020 um 13:42 hat Max Reitz geschrieben:
> Make bdrv_child_cb_detach() call bdrv_backing_detach() for children with
> a COW role (and drop the reverse call from bdrv_backing_detach()), so it
> can be used for any child (with a proper role set).
> 
> Because so far no child has a proper role set, we need a temporary new
> callback for child_backing.detach that ensures bdrv_backing_detach() is
> called for all COW children that do not have their role set yet.
> 
> Signed-off-by: Max Reitz 
> Reviewed-by: Eric Blake 
> ---
>  block.c | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/block.c b/block.c
> index 3cf1293a7b..0f24546863 100644
> --- a/block.c
> +++ b/block.c
> @@ -943,6 +943,7 @@ static void bdrv_child_cb_drained_end(BdrvChild *child,
>  }
>  
>  static void bdrv_backing_attach(BdrvChild *c);
> +static void bdrv_backing_detach(BdrvChild *c);

This series leaves a few static forward declarations behind, and even
in the middle of the code rather than at the top.

Does anything stop us from adding bdrv_inherited_options() after all the
old functions instead? This will require a temporary forward
declaration, too, but it can go away at the end of the series when there
is only child_of_bds left.

Kevin




Re: [PATCH v4 03/13] acpi: rtc: use a single crs range

2020-05-06 Thread Igor Mammedov
On Wed, 6 May 2020 10:39:02 +0200
Gerd Hoffmann  wrote:

>   Hi,
> 
> > >  crs = aml_resource_template();
> > >  aml_append(crs, aml_io(AML_DECODE16, RTC_ISA_BASE, RTC_ISA_BASE,
> > > -   0x10, 0x02));
> > > +   0x10, 0x08));
> > >  aml_append(crs, aml_irq_no_flags(RTC_ISA_IRQ));
> > > -aml_append(crs, aml_io(AML_DECODE16, RTC_ISA_BASE + 2, RTC_ISA_BASE 
> > > + 2,
> > > -   0x02, 0x06));  
> > can we just drop the later range as unused? (I don't see where it's 
> > actually initialized)  
> 
> I'd rather follow what physical hardware is doing here
> for better compatibility ...

maybe add comment here why it doesn't match IO range that RTC actualy provides,
otherwise it's looks very confusing

> 
> take care,
>   Gerd
> 




Re: [PATCH v5 13/31] qcow2: Update get/set_l2_entry() and add get/set_l2_bitmap()

2020-05-06 Thread Alberto Garcia
On Tue 05 May 2020 10:04:32 PM CEST, Eric Blake wrote:
> What happens for an image whose size is not cluster-aligned?  Must the
> bits corresponding to subclusters not present in the final cluster
> always be zero, or are they instead ignored regardless of value?

Attempting to read or write beyond the end of the image returns an error
(see blk_check_byte_request()).

But let's say we have a 32k image (only 1/2 of the first cluster is
used) and we manipulate the bitmap to mark all subclusters as allocated.

In this case if you try 'read 0 32k' then count_contiguous_subclusters()
would indeed say that there are 64k of data available. However the
caller knows that we only want 32k and

if (bytes_available > bytes_needed) {
bytes_available = bytes_needed;
}

so in the end it doesn't really matter.

I think in practice it's the same as with traditional qcow2 images: once
we reach qcow2_get_host_offset() the code does not really know or care
about the size of the image or how much of the last cluster contains
actual data. It only cares about how much data the caller needs. The
limits have already been checked before.

But we could document it: "if the image size is not a multiple of the
cluster size then the bits corresponding to the subclusters beyond the
end of the image are ignored and should be set to zero", or something
like that.

Berto



Re: [PATCH v3 09/33] block: Add generic bdrv_inherited_options()

2020-05-06 Thread Kevin Wolf
Am 18.02.2020 um 13:42 hat Max Reitz geschrieben:
> After the series this patch belongs to, we want to have a common
> BdrvChildClass that encompasses all of child_file, child_format, and
> child_backing.  Such a single class needs a single .inherit_options()
> implementation, and this patch introduces it.
> 
> The next patch will show how the existing implementations can fall back
> to it just by passing appropriate BdrvChildRole and parent_is_format
> values.
> 
> Signed-off-by: Max Reitz 
> Reviewed-by: Eric Blake 
> ---
>  block.c | 84 +
>  1 file changed, 84 insertions(+)
> 
> diff --git a/block.c b/block.c
> index c33f0e9b42..9179b9b604 100644
> --- a/block.c
> +++ b/block.c
> @@ -998,6 +998,90 @@ static void bdrv_temp_snapshot_options(int *child_flags, 
> QDict *child_options,
>  *child_flags &= ~BDRV_O_NATIVE_AIO;
>  }
>  
> +/*
> + * Returns the options and flags that a generic child of a BDS should
> + * get, based on the given options and flags for the parent BDS.
> + */
> +static void __attribute__((unused))
> +bdrv_inherited_options(BdrvChildRole role, bool parent_is_format,
> +   int *child_flags, QDict *child_options,
> +   int parent_flags, QDict *parent_options)
> +{
> +int flags = parent_flags;
> +
> +/*
> + * First, decide whether to set, clear, or leave BDRV_O_PROTOCOL.
> + * Generally, the question to answer is: Should this child be
> + * format-probed by default?
> + */
> +
> +/*
> + * Pure and non-filtered data children of non-format nodes should
> + * be probed by default (even when the node itself has BDRV_O_PROTOCOL
> + * set).  This only affects a very limited set of drivers (namely
> + * quorum and blkverify when this comment was written).
> + * Force-clear BDRV_O_PROTOCOL then.
> + */
> +if (!parent_is_format &&
> +(role & (BDRV_CHILD_DATA | BDRV_CHILD_METADATA |
> + BDRV_CHILD_FILTERED)) ==
> +BDRV_CHILD_DATA)

You could avoid the odd indentation (I can't decide whether or not it
should be one space more to align correctly) and probably also make the
expression more readable if you split it into:

(role & BDRV_CHILD_DATA) &&
!(role & (BDRV_CHILD_METADATA | BDRV_CHILD_FILTERED))

> +{
> +flags &= ~BDRV_O_PROTOCOL;
> +}
> +
> +/*
> + * All children of format nodes (except for COW children) and all
> + * metadata children in general should never be format-probed.
> + * Force-set BDRV_O_PROTOCOL then.
> + */
> +if ((parent_is_format && !(role & BDRV_CHILD_COW)) ||
> +(role & BDRV_CHILD_METADATA))
> +{
> +flags |= BDRV_O_PROTOCOL;
> +}
> +
> +/*
> + * If the cache mode isn't explicitly set, inherit direct and no-flush 
> from
> + * the parent.
> + */
> +qdict_copy_default(child_options, parent_options, BDRV_OPT_CACHE_DIRECT);
> +qdict_copy_default(child_options, parent_options, 
> BDRV_OPT_CACHE_NO_FLUSH);
> +qdict_copy_default(child_options, parent_options, BDRV_OPT_FORCE_SHARE);
> +
> +if (role & BDRV_CHILD_COW) {
> +/* backing files are always opened read-only */

Not "always", just by default.

> +qdict_set_default_str(child_options, BDRV_OPT_READ_ONLY, "on");
> +qdict_set_default_str(child_options, BDRV_OPT_AUTO_READ_ONLY, "off");
> +} else {
> +/* Inherit the read-only option from the parent if it's not set */
> +qdict_copy_default(child_options, parent_options, 
> BDRV_OPT_READ_ONLY);
> +qdict_copy_default(child_options, parent_options,
> +   BDRV_OPT_AUTO_READ_ONLY);
> +}
> +
> +if (parent_is_format && !(role & BDRV_CHILD_COW)) {
> +/*
> + * Our format drivers take care to send flushes and respect
> + * unmap policy, so we can default to enable both on lower
> + * layers regardless of the corresponding parent options.
> + */
> +qdict_set_default_str(child_options, BDRV_OPT_DISCARD, "unmap");
> +}

Why the restriction to format here? Don't we break "unmap" propagation
through filters with this?

It would probably also be a good question why we don't propagate it to
the backing file, but this is preexisting.

> +/* Clear flags that only apply to the top layer */
> +flags &= ~(BDRV_O_SNAPSHOT | BDRV_O_NO_BACKING | BDRV_O_COPY_ON_READ);
> +
> +if (role & BDRV_CHILD_METADATA) {
> +flags &= ~BDRV_O_NO_IO;
> +}

Hm... This is interesting, but I guess it makes sense. It just never was
a hard rule that a format driver must not access data even internally
with NO_IO, but I think it holds true.

> +if (role & BDRV_CHILD_COW) {
> +flags &= ~BDRV_O_TEMPORARY;
> +}

We could probably have a long discussion about whether this is right in
theory, but in practice BDRV_O_TEMPORARY only exists for snapshot=on,
for 

Re: [PATCH v5 11/31] qcow2: Add offset_into_subcluster() and size_to_subclusters()

2020-05-06 Thread Alberto Garcia
On Tue 05 May 2020 09:42:08 PM CEST, Eric Blake wrote:
> On 5/5/20 12:38 PM, Alberto Garcia wrote:
>> Like offset_into_cluster() and size_to_clusters(), but for
>> subclusters.
>> 
>> Signed-off-by: Alberto Garcia 
>> ---
>>   block/qcow2.h | 10 ++
>>   1 file changed, 10 insertions(+)
>> 
>> diff --git a/block/qcow2.h b/block/qcow2.h
>> index e68febb15b..8b1ed1cbcf 100644
>> --- a/block/qcow2.h
>> +++ b/block/qcow2.h
>> @@ -537,11 +537,21 @@ static inline int64_t 
>> offset_into_cluster(BDRVQcow2State *s, int64_t offset)
>>   return offset & (s->cluster_size - 1);
>>   }
>>   
>> +static inline int64_t offset_into_subcluster(BDRVQcow2State *s, int64_t 
>> offset)
>> +{
>> +return offset & (s->subcluster_size - 1);
>> +}
>> +
>>   static inline uint64_t size_to_clusters(BDRVQcow2State *s, uint64_t size)
>>   {
>>   return (size + (s->cluster_size - 1)) >> s->cluster_bits;
>>   }
>
> Pre-existing, but this could use DIV_ROUND_UP.

Yeah but it would be nicer to have a version of the macro optimized for
powers of two.

Berto



Re: [PATCH 1/2] qemu-img: refactor dump_map_entry JSON format output

2020-05-06 Thread Eyal Moscovici



On 29/04/2020, 17:58, "Eric Blake"  wrote:

On 3/22/20 4:11 AM, Eyal Moscovici wrote:
> Previously dump_map_entry identified whether we need to start a new JSON
> array based on whether start address == 0. In this refactor we remove
> this assumption as in following patches we will allow map to start from
> an arbitrary position.
> 
> Acked-by: Mark Kanda 
> Signed-off-by: Eyal Moscovici 
> ---
>   qemu-img.c | 12 
>   1 file changed, 8 insertions(+), 4 deletions(-)
> 

> @@ -2871,8 +2870,8 @@ static int dump_map_entry(OutputFormat 
output_format, MapEntry *e,
>   }
>   putchar('}');
>   
> -if (!next) {
> -printf("]\n");
> +if (next) {
> +printf(",\n");

As long as you're touching this, puts(",") is slightly more efficient 
than printf().  But what you have is not wrong.

Thanks, will fix.

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org







Re: [PATCH 2/2] qemu-img: Add --start-offset and --max-length to map

2020-05-06 Thread Eyal Moscovici
Thanks for the review. I will send V2 based on QEMU version 5.0.

On 29/04/2020, 18:06, "Eric Blake"  wrote:

On 3/22/20 4:11 AM, Eyal Moscovici wrote:
> The mapping operation of large disks especially ones stored over a
> long chain of QCOW2 files can take a long time to finish.
> Additionally when mapping fails there was no way recover by
> restarting the mapping from the failed location.
> 
> The new options, --start-offset and --max-length allows the user to
> divide these type of map operations into shorter independent tasks.
> 
> Acked-by: Mark Kanda 
> Co-developed-by: Yoav Elnekave 
> Signed-off-by: Yoav Elnekave 
> Signed-off-by: Eyal Moscovici 
> ---
>   docs/tools/qemu-img.rst |  2 +-
>   qemu-img-cmds.hx|  4 ++--
>   qemu-img.c  | 30 +-
>   3 files changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
> index 0080f83a76..924e89f679 100644
> --- a/docs/tools/qemu-img.rst
> +++ b/docs/tools/qemu-img.rst
> @@ -519,7 +519,7 @@ Command description:
>   ``ImageInfoSpecific*`` QAPI object (e.g. ``ImageInfoSpecificQCow2``
>   for qcow2 images).
>   
> -.. option:: map [--object OBJECTDEF] [--image-opts] [-f FMT] 
[--output=OFMT] [-U] FILENAME
> +.. option:: map [--object OBJECTDEF] [--image-opts] [-f FMT] 
[--start-offset=offset] [--max-length=len] [--output=OFMT] [-U] FILENAME

Consistency with the rest of the line says this should be 
[--start-offset=OFFSET] [--max-length=LEN]

Will fix.

>   
> Dump the metadata of image *FILENAME* and its backing file chain.
> In particular, this commands dumps the allocation state of every 
sector
> diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
> index c9c54de1df..35f832816f 100644
> --- a/qemu-img-cmds.hx
> +++ b/qemu-img-cmds.hx
> @@ -63,9 +63,9 @@ SRST
>   ERST
>   
>   DEF("map", img_map,
> -"map [--object objectdef] [--image-opts] [-f fmt] [--output=ofmt] 
[-U] filename")
> +"map [--object objectdef] [--image-opts] [-f fmt] 
[--start-offset=offset] [--max-length=len] [--output=ofmt] [-U] filename")

this one is fine,

>   SRST
> -.. option:: map [--object OBJECTDEF] [--image-opts] [-f FMT] 
[--output=OFMT] [-U] FILENAME
> +.. option:: map [--object OBJECTDEF] [--image-opts] [-f FMT] 
[--start-offset=OFFSET] [--max-length=LEN] [--output=OFMT] [-U] FILENAME

this one has the same problem as the .rst.

> @@ -3005,6 +3009,26 @@ static int img_map(int argc, char **argv)
>   case OPTION_OUTPUT:
>   output = optarg;
>   break;
> +case 's':
> +start_offset = cvtnum(optarg);
> +if (start_offset < 0) {
> +error_report("Invalid start offset specified! You may 
use "
> + "k, M, G, T, P or E suffixes for ");
> +error_report("kilobytes, megabytes, gigabytes, 
terabytes, "
> + "petabytes and exabytes.");

Pre-existing elsewhere in the file, but this seems rather verbose - 
shouldn't we have cvtnum() (or another wrapper function) give this extra 
information about what is valid, rather than open-coding it at every 
client of cvtnum()?

You are probably right I will send a patch that adds the error message  about 
the units to cvtnum().

> +return 1;
> +}
> +break;
> +case 'l':
> +max_length = cvtnum(optarg);
> +if (max_length < 0) {
> +error_report("Invalid max length specified! You may use "
> + "k, M, G, T, P or E suffixes for ");
> +error_report("kilobytes, megabytes, gigabytes, 
terabytes, "
> + "petabytes and exabytes.");
> +return 1;
> +}
> +break;
>   case OPTION_OBJECT: {
>   QemuOpts *opts;
>   opts = qemu_opts_parse_noisily(_object_opts,
> @@ -3050,7 +3074,11 @@ static int img_map(int argc, char **argv)
>   printf("[");
>   }
>   
> +curr.start = start_offset;
>   length = blk_getlength(blk);
> +if (max_length != -1) {
> +length = MIN(start_offset + max_length, length);
> +}

Pre-existing, but where does this code check for length == -1?  But your 
MIN() doesn't make it any worse (if we fail to get length, we merely 
skip the loop).

I don't think there is a check. I will add a check in a different patch.

>   while (curr.start + curr.length < length) {
>   int64_t offset = curr.start + curr.length;

Re: [PATCH v2 1/5] char-socket: return -1 in case of disconnect during tcp_chr_write

2020-05-06 Thread Marc-André Lureau
On Thu, Apr 30, 2020 at 3:37 PM Dima Stepanov  wrote:
>
> During testing of the vhost-user-blk reconnect functionality the qemu
> SIGSEGV was triggered:
>  start qemu as:
>  x86_64-softmmu/qemu-system-x86_64 -m 1024M -M q35 \
>-object 
> memory-backend-file,id=ram-node0,size=1024M,mem-path=/dev/shm/qemu,share=on \
>-numa node,cpus=0,memdev=ram-node0 \
>-chardev socket,id=chardev0,path=./vhost.sock,noserver,reconnect=1 \
>-device vhost-user-blk-pci,chardev=chardev0,num-queues=4 --enable-kvm
>  start vhost-user-blk daemon:
>  ./vhost-user-blk -s ./vhost.sock -b test-img.raw
>
> If vhost-user-blk will be killed during the vhost initialization
> process, for instance after getting VHOST_SET_VRING_CALL command, then
> QEMU will fail with the following backtrace:
>
> Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
> 0x559272bb in vhost_user_read (dev=0x7fffef2d53e0, msg=0x7fffd5b0)
> at ./hw/virtio/vhost-user.c:260
> 260 CharBackend *chr = u->user->chr;
>
>  #0  0x559272bb in vhost_user_read (dev=0x7fffef2d53e0, 
> msg=0x7fffd5b0)
> at ./hw/virtio/vhost-user.c:260
>  #1  0x5592acb8 in vhost_user_get_config (dev=0x7fffef2d53e0, 
> config=0x7fffef2d5394 "", config_len=60)
> at ./hw/virtio/vhost-user.c:1645
>  #2  0x55925525 in vhost_dev_get_config (hdev=0x7fffef2d53e0, 
> config=0x7fffef2d5394 "", config_len=60)
> at ./hw/virtio/vhost.c:1490
>  #3  0x558cc46b in vhost_user_blk_device_realize (dev=0x7fffef2d51a0, 
> errp=0x7fffd8f0)
> at ./hw/block/vhost-user-blk.c:429
>  #4  0x55920090 in virtio_device_realize (dev=0x7fffef2d51a0, 
> errp=0x7fffd948)
> at ./hw/virtio/virtio.c:3615
>  #5  0x55a9779c in device_set_realized (obj=0x7fffef2d51a0, 
> value=true, errp=0x7fffdb88)
> at ./hw/core/qdev.c:891
>  ...
>
> The problem is that vhost_user_write doesn't get an error after
> disconnect and try to call vhost_user_read(). The tcp_chr_write()
> routine should return -1 in case of disconnect. Indicate the EIO error
> if this routine is called in the disconnected state.
>
> Signed-off-by: Dima Stepanov 


Reviewed-by: Marc-André Lureau 

> ---
>  chardev/char-socket.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> index 185fe38..c128cca 100644
> --- a/chardev/char-socket.c
> +++ b/chardev/char-socket.c
> @@ -175,14 +175,16 @@ static int tcp_chr_write(Chardev *chr, const uint8_t 
> *buf, int len)
>  if (ret < 0 && errno != EAGAIN) {
>  if (tcp_chr_read_poll(chr) <= 0) {
>  tcp_chr_disconnect_locked(chr);
> -return len;
> +/* Return an error since we made a disconnect. */
> +return ret;
>  } /* else let the read handler finish it properly */
>  }
>
>  return ret;
>  } else {
> -/* XXX: indicate an error ? */
> -return len;
> +/* Indicate an error. */
> +errno = EIO;
> +return -1;
>  }
>  }
>
> --
> 2.7.4
>




[PATCH 5/8] block/file-posix: drop unallocated_blocks_are_zero

2020-05-06 Thread Vladimir Sementsov-Ogievskiy
raw_co_block_status() in block/file-posix.c never returns 0, so
unallocated_blocks_are_zero is useless. Drop it.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/file-posix.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 05e094be29..5c01735108 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2878,9 +2878,6 @@ static int coroutine_fn raw_co_pwrite_zeroes(
 
 static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
-BDRVRawState *s = bs->opaque;
-
-bdi->unallocated_blocks_are_zero = s->discard_zeroes;
 return 0;
 }
 
-- 
2.21.0




[PATCH 7/8] qemu-img: convert: don't use unallocated_blocks_are_zero

2020-05-06 Thread Vladimir Sementsov-Ogievskiy
qemu-img convert wants to distinguish ZERO which comes from short
backing files. unallocated_blocks_are_zero field of bdi is unrelated:
space after EOF is always considered to be zero anyway. So, just make
post_backing_zero true in case of short backing file.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 qemu-img.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 6a4327aaba..4fe751878b 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1632,7 +1632,6 @@ typedef struct ImgConvertState {
 BlockBackend *target;
 bool has_zero_init;
 bool compressed;
-bool unallocated_blocks_are_zero;
 bool target_is_new;
 bool target_has_backing;
 int64_t target_backing_sectors; /* negative if unknown */
@@ -1677,7 +1676,7 @@ static int convert_iteration_sectors(ImgConvertState *s, 
int64_t sector_num)
 
 if (s->target_backing_sectors >= 0) {
 if (sector_num >= s->target_backing_sectors) {
-post_backing_zero = s->unallocated_blocks_are_zero;
+post_backing_zero = true;
 } else if (sector_num + n > s->target_backing_sectors) {
 /* Split requests around target_backing_sectors (because
  * starting from there, zeros are handled differently) */
@@ -2580,7 +2579,6 @@ static int img_convert(int argc, char **argv)
 } else {
 s.compressed = s.compressed || bdi.needs_compressed_writes;
 s.cluster_sectors = bdi.cluster_size / BDRV_SECTOR_SIZE;
-s.unallocated_blocks_are_zero = bdi.unallocated_blocks_are_zero;
 }
 
 ret = convert_do_copy();
-- 
2.21.0




[PATCH 3/8] block/crypto: drop unallocated_blocks_are_zero

2020-05-06 Thread Vladimir Sementsov-Ogievskiy
It's false by default, no needs to set it. We are going to drop this
variable at all, so drop it now here, it doesn't hurt.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/crypto.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/crypto.c b/block/crypto.c
index e02f343590..7685e61844 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -694,7 +694,6 @@ static int block_crypto_get_info_luks(BlockDriverState *bs,
 return ret;
 }
 
-bdi->unallocated_blocks_are_zero = false;
 bdi->cluster_size = subbdi.cluster_size;
 
 return 0;
-- 
2.21.0




[PATCH 8/8] block: drop unallocated_blocks_are_zero

2020-05-06 Thread Vladimir Sementsov-Ogievskiy
Currently this field only set by qed and qcow2. But in fact, all
backing-supporting formats (parallels, qcow, qcow2, qed, vmdk) share
this semantics: on unallocated blocks, if there is no backing file they
just memset the buffer with zeroes.

So, document this behavior for .supports_backing and drop
.unallocated_blocks_are_zero

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/block.h |  6 --
 include/block/block_int.h | 13 -
 block.c   | 15 ---
 block/io.c|  8 
 block/qcow2.c |  1 -
 block/qed.c   |  1 -
 6 files changed, 16 insertions(+), 28 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 8b62429aa4..db1cb503ec 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -21,11 +21,6 @@ typedef struct BlockDriverInfo {
 /* offset at which the VM state can be saved (0 if not possible) */
 int64_t vm_state_offset;
 bool is_dirty;
-/*
- * True if unallocated blocks read back as zeroes. This is equivalent
- * to the LBPRZ flag in the SCSI logical block provisioning page.
- */
-bool unallocated_blocks_are_zero;
 /*
  * True if this block driver only supports compressed writes
  */
@@ -431,7 +426,6 @@ int bdrv_co_pdiscard(BdrvChild *child, int64_t offset, 
int64_t bytes);
 int bdrv_has_zero_init_1(BlockDriverState *bs);
 int bdrv_has_zero_init(BlockDriverState *bs);
 int bdrv_has_zero_init_truncate(BlockDriverState *bs);
-bool bdrv_unallocated_blocks_are_zero(BlockDriverState *bs);
 bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs);
 int bdrv_block_status(BlockDriverState *bs, int64_t offset,
   int64_t bytes, int64_t *pnum, int64_t *map,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 92335f33c7..c156b22c6b 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -115,7 +115,18 @@ struct BlockDriver {
  */
 bool bdrv_needs_filename;
 
-/* Set if a driver can support backing files */
+/*
+ * Set if a driver can support backing files. This also implies the
+ * following semantics:
+ *
+ *  - Return status 0 of .bdrv_co_block_status means that corresponding
+ *blocks are not allocated in this layer of backing-chain
+ *  - For such (unallocated) blocks, read will:
+ *- read from backing file if there is one and big enough
+ *- fill buffer with zeroes if there is no backing file
+ *- space after EOF of the backing file considered as zero
+ *  (corresponding part of read-buffer must be zeroed by driver)
+ */
 bool supports_backing;
 
 /* For handling image reopen for split or non-split files */
diff --git a/block.c b/block.c
index cf5c19b1db..0283fdecbc 100644
--- a/block.c
+++ b/block.c
@@ -5305,21 +5305,6 @@ int bdrv_has_zero_init_truncate(BlockDriverState *bs)
 return 0;
 }
 
-bool bdrv_unallocated_blocks_are_zero(BlockDriverState *bs)
-{
-BlockDriverInfo bdi;
-
-if (bs->backing) {
-return false;
-}
-
-if (bdrv_get_info(bs, ) == 0) {
-return bdi.unallocated_blocks_are_zero;
-}
-
-return false;
-}
-
 bool bdrv_can_write_zeroes_with_unmap(BlockDriverState *bs)
 {
 if (!(bs->open_flags & BDRV_O_UNMAP)) {
diff --git a/block/io.c b/block/io.c
index a4f9714230..484adec5a1 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2385,16 +2385,16 @@ static int coroutine_fn 
bdrv_co_block_status(BlockDriverState *bs,
 
 if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
 ret |= BDRV_BLOCK_ALLOCATED;
-} else if (want_zero) {
-if (bdrv_unallocated_blocks_are_zero(bs)) {
-ret |= BDRV_BLOCK_ZERO;
-} else if (bs->backing) {
+} else if (want_zero && bs->drv->supports_backing) {
+if (bs->backing) {
 BlockDriverState *bs2 = bs->backing->bs;
 int64_t size2 = bdrv_getlength(bs2);
 
 if (size2 >= 0 && offset >= size2) {
 ret |= BDRV_BLOCK_ZERO;
 }
+} else {
+ret |= BDRV_BLOCK_ZERO;
 }
 }
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 2ba0b17c39..dc3c0aac2b 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4858,7 +4858,6 @@ err:
 static int qcow2_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 BDRVQcow2State *s = bs->opaque;
-bdi->unallocated_blocks_are_zero = true;
 bdi->cluster_size = s->cluster_size;
 bdi->vm_state_offset = qcow2_vm_state_offset(s);
 return 0;
diff --git a/block/qed.c b/block/qed.c
index b0fdb8f565..fb7833dc8b 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -1514,7 +1514,6 @@ static int bdrv_qed_get_info(BlockDriverState *bs, 
BlockDriverInfo *bdi)
 memset(bdi, 0, sizeof(*bdi));
 bdi->cluster_size = s->header.cluster_size;
 bdi->is_dirty = s->header.features & QED_F_NEED_CHECK;
-bdi->unallocated_blocks_are_zero = true;
 return 0;
 }
 

[PATCH 6/8] block/vhdx: drop unallocated_blocks_are_zero

2020-05-06 Thread Vladimir Sementsov-Ogievskiy
vhdx doesn't have .bdrv_co_block_status handler, so DATA|ALLOCATED is
always assumed for it. unallocated_blocks_are_zero is useless, drop it.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/vhdx.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/vhdx.c b/block/vhdx.c
index aedd782604..45963a3166 100644
--- a/block/vhdx.c
+++ b/block/vhdx.c
@@ -1164,9 +1164,6 @@ static int vhdx_get_info(BlockDriverState *bs, 
BlockDriverInfo *bdi)
 
 bdi->cluster_size = s->block_size;
 
-bdi->unallocated_blocks_are_zero =
-(s->params.data_bits & VHDX_PARAMS_HAS_PARENT) == 0;
-
 return 0;
 }
 
-- 
2.21.0




[PATCH 2/8] block/vpc: return ZERO block-status when appropriate

2020-05-06 Thread Vladimir Sementsov-Ogievskiy
In case when get_image_offset() returns -1, we do zero out the
corresponding chunk of qiov. So, this should be reported as ZERO.

After block-status update, it never reports 0, so setting
unallocated_blocks_are_zero doesn't make sense. Drop it.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/vpc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/vpc.c b/block/vpc.c
index 2d1eade146..555f9d8587 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -606,7 +606,6 @@ static int vpc_get_info(BlockDriverState *bs, 
BlockDriverInfo *bdi)
 bdi->cluster_size = s->block_size;
 }
 
-bdi->unallocated_blocks_are_zero = true;
 return 0;
 }
 
@@ -745,7 +744,7 @@ static int coroutine_fn 
vpc_co_block_status(BlockDriverState *bs,
 image_offset = get_image_offset(bs, offset, false, NULL);
 allocated = (image_offset != -1);
 *pnum = 0;
-ret = 0;
+ret = BDRV_BLOCK_ZERO;
 
 do {
 /* All sectors in a block are contiguous (without using the bitmap) */
-- 
2.21.0




[PATCH 4/8] block/iscsi: drop unallocated_blocks_are_zero

2020-05-06 Thread Vladimir Sementsov-Ogievskiy
We set bdi->unallocated_blocks_are_zero = iscsilun->lbprz, but
iscsi_co_block_status doesn't return 0 in case of iscsilun->lbprz, it
returns ZERO when appropriate. So actually unallocated_blocks_are_zero
is useless. Drop it now.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/iscsi.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index a8b76979d8..767e3e75fd 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -2163,7 +2163,6 @@ static int coroutine_fn 
iscsi_co_truncate(BlockDriverState *bs, int64_t offset,
 static int iscsi_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 IscsiLun *iscsilun = bs->opaque;
-bdi->unallocated_blocks_are_zero = iscsilun->lbprz;
 bdi->cluster_size = iscsilun->cluster_size;
 return 0;
 }
-- 
2.21.0




[PATCH 1/8] block/vdi: return ZERO block-status when appropriate

2020-05-06 Thread Vladimir Sementsov-Ogievskiy
In case of !VDI_IS_ALLOCATED[], we do zero out the corresponding chunk
of qiov. So, this should be reported as ZERO.

After block-status update, it never reports 0, so setting
unallocated_blocks_are_zero doesn't make sense. Drop it.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/vdi.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/vdi.c b/block/vdi.c
index 0c7835ae70..83471528d2 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -334,7 +334,6 @@ static int vdi_get_info(BlockDriverState *bs, 
BlockDriverInfo *bdi)
 logout("\n");
 bdi->cluster_size = s->block_size;
 bdi->vm_state_offset = 0;
-bdi->unallocated_blocks_are_zero = true;
 return 0;
 }
 
@@ -536,7 +535,7 @@ static int coroutine_fn 
vdi_co_block_status(BlockDriverState *bs,
 *pnum = MIN(s->block_size - index_in_block, bytes);
 result = VDI_IS_ALLOCATED(bmap_entry);
 if (!result) {
-return 0;
+return BDRV_BLOCK_ZERO;
 }
 
 *map = s->header.offset_data + (uint64_t)bmap_entry * s->block_size +
-- 
2.21.0




[PATCH 0/8] drop unallocated_blocks_are_zero

2020-05-06 Thread Vladimir Sementsov-Ogievskiy
Hi all!

This is first step to block-status refactoring, and solves most simple
problem mentioned in my investigation of block-status described in
the thread "backing chain & block status & filters":
  https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg04706.html


unallocated_blocks_are_zero doesn't simplify all the logic about
block-status, and happily it's not needed, as shown by the following
patches. So, let's get rid of it.

Vladimir Sementsov-Ogievskiy (8):
  block/vdi: return ZERO block-status when appropriate
  block/vpc: return ZERO block-status when appropriate
  block/crypto: drop unallocated_blocks_are_zero
  block/iscsi: drop unallocated_blocks_are_zero
  block/file-posix: drop unallocated_blocks_are_zero
  block/vhdx: drop unallocated_blocks_are_zero
  qemu-img: convert: don't use unallocated_blocks_are_zero
  block: drop unallocated_blocks_are_zero

 include/block/block.h |  6 --
 include/block/block_int.h | 13 -
 block.c   | 15 ---
 block/crypto.c|  1 -
 block/file-posix.c|  3 ---
 block/io.c|  8 
 block/iscsi.c |  1 -
 block/qcow2.c |  1 -
 block/qed.c   |  1 -
 block/vdi.c   |  3 +--
 block/vhdx.c  |  3 ---
 block/vpc.c   |  3 +--
 qemu-img.c|  4 +---
 13 files changed, 19 insertions(+), 43 deletions(-)

-- 
2.21.0




Re: [PATCH v2 1/5] char-socket: return -1 in case of disconnect during tcp_chr_write

2020-05-06 Thread Li Feng
Thanks,

Feng Li

Dima Stepanov  于2020年4月30日周四 下午9:36写道:
>
> During testing of the vhost-user-blk reconnect functionality the qemu
> SIGSEGV was triggered:
>  start qemu as:
>  x86_64-softmmu/qemu-system-x86_64 -m 1024M -M q35 \
>-object 
> memory-backend-file,id=ram-node0,size=1024M,mem-path=/dev/shm/qemu,share=on \
>-numa node,cpus=0,memdev=ram-node0 \
>-chardev socket,id=chardev0,path=./vhost.sock,noserver,reconnect=1 \
>-device vhost-user-blk-pci,chardev=chardev0,num-queues=4 --enable-kvm
>  start vhost-user-blk daemon:
>  ./vhost-user-blk -s ./vhost.sock -b test-img.raw
>
> If vhost-user-blk will be killed during the vhost initialization
> process, for instance after getting VHOST_SET_VRING_CALL command, then
> QEMU will fail with the following backtrace:
>
> Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
> 0x559272bb in vhost_user_read (dev=0x7fffef2d53e0, msg=0x7fffd5b0)
> at ./hw/virtio/vhost-user.c:260
> 260 CharBackend *chr = u->user->chr;
>
>  #0  0x559272bb in vhost_user_read (dev=0x7fffef2d53e0, 
> msg=0x7fffd5b0)
> at ./hw/virtio/vhost-user.c:260
>  #1  0x5592acb8 in vhost_user_get_config (dev=0x7fffef2d53e0, 
> config=0x7fffef2d5394 "", config_len=60)
> at ./hw/virtio/vhost-user.c:1645
>  #2  0x55925525 in vhost_dev_get_config (hdev=0x7fffef2d53e0, 
> config=0x7fffef2d5394 "", config_len=60)
> at ./hw/virtio/vhost.c:1490
>  #3  0x558cc46b in vhost_user_blk_device_realize (dev=0x7fffef2d51a0, 
> errp=0x7fffd8f0)
> at ./hw/block/vhost-user-blk.c:429
>  #4  0x55920090 in virtio_device_realize (dev=0x7fffef2d51a0, 
> errp=0x7fffd948)
> at ./hw/virtio/virtio.c:3615
>  #5  0x55a9779c in device_set_realized (obj=0x7fffef2d51a0, 
> value=true, errp=0x7fffdb88)
> at ./hw/core/qdev.c:891
>  ...
>
> The problem is that vhost_user_write doesn't get an error after
> disconnect and try to call vhost_user_read(). The tcp_chr_write()
> routine should return -1 in case of disconnect. Indicate the EIO error
> if this routine is called in the disconnected state.
>
> Signed-off-by: Dima Stepanov 
> ---
>  chardev/char-socket.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> index 185fe38..c128cca 100644
> --- a/chardev/char-socket.c
> +++ b/chardev/char-socket.c
> @@ -175,14 +175,16 @@ static int tcp_chr_write(Chardev *chr, const uint8_t 
> *buf, int len)
>  if (ret < 0 && errno != EAGAIN) {
>  if (tcp_chr_read_poll(chr) <= 0) {
>  tcp_chr_disconnect_locked(chr);
> -return len;
> +/* Return an error since we made a disconnect. */
> +return ret;
The `return` statement could be deleted.
The outside has a return statement.

>  } /* else let the read handler finish it properly */
>  }
>
>  return ret;
>  } else {
> -/* XXX: indicate an error ? */
> -return len;
> +/* Indicate an error. */
> +errno = EIO;
> +return -1;
>  }
>  }
>
> --
> 2.7.4
>



Re: [PATCH v2 2/5] vhost: introduce wrappers to set guest notifiers for virtio device

2020-05-06 Thread Dima Stepanov
On Sun, May 03, 2020 at 08:36:45PM -0400, Raphael Norwitz wrote:
> I’m happy from the vhost, vhost-user-blk and vhost-user-scsi side. For
> other device types it looks pretty straightforward, but their maintainers
> should probably confirm.
> 
> Since you plan to change the behavior of these helpers in subsequent
> patches, maybe consider sending the other device types separately
> after the rest of the series has been merged? That way the changes to
> individual devices will be much easier to review.

Thanks for comments.
Agree, will make a more straightforward fix only for vhost-user-blk.
After it we can figure out how to propogate this change to other
devices.

> 
> On Thu, Apr 30, 2020 at 9:48 AM Dima Stepanov  wrote:
> >
> > Introduce new wrappers to set/reset guest notifiers for the virtio
> > device in the vhost device module:
> >   vhost_dev_assign_guest_notifiers
> > ->set_guest_notifiers(..., ..., true);
> >   vhost_dev_drop_guest_notifiers
> > ->set_guest_notifiers(..., ..., false);
> > This is a preliminary step to refactor code, so the set_guest_notifiers
> > methods could be called based on the vhost device state.
> > Update all vhost used devices to use these wrappers instead of direct
> > method call.
> >
> > Signed-off-by: Dima Stepanov 
> > ---
> >  backends/cryptodev-vhost.c  | 26 +++---
> >  backends/vhost-user.c   | 16 +---
> >  hw/block/vhost-user-blk.c   | 15 +--
> >  hw/net/vhost_net.c  | 30 +-
> >  hw/scsi/vhost-scsi-common.c | 15 +--
> >  hw/virtio/vhost-user-fs.c   | 17 +++--
> >  hw/virtio/vhost-vsock.c | 18 --
> >  hw/virtio/vhost.c   | 38 ++
> >  hw/virtio/virtio.c  | 13 +
> >  include/hw/virtio/vhost.h   |  4 
> >  include/hw/virtio/virtio.h  |  1 +
> >  11 files changed, 118 insertions(+), 75 deletions(-)
> >



Re: [PATCH v2 3/5] vhost-user-blk: add mechanism to track the guest notifiers init state

2020-05-06 Thread Dima Stepanov
On Sun, May 03, 2020 at 09:06:38PM -0400, Raphael Norwitz wrote:
> Apologies for mixing up patches last time. This looks good from a
> vhost-user-blk perspective, but I worry that some of these changes
> could impact other vhost device types.
> 
> I agree with adding notifiers_set to struct vhost_dev, and setting it in
> vhost_dev_enable/disable notifiers, but is there any reason notifiers_set
> can’t be checked inside vhost-user-blk?
Thanks for your review. I also have some concerns about changing current
API, but my idea was that these issues will be triggered for all
vhost-user/reconnect devices. But maybe you are right and first we
should fix vhost-user-blk issues.
I'll try to modify patch 2 and 3 in my patchset, so new notifiers_set
field will be added, but no API change will be made. Will see how it
looks.



Re: [PATCH v4 07/13] acpi: move aml builder code for parallel device

2020-05-06 Thread Gerd Hoffmann
> > +static void parallel_isa_build_aml(ISADevice *isadev, Aml *scope)
> > +{
> > +ISAParallelState *isa = ISA_PARALLEL(isadev);
> > +int i, uid = 0;
> > +Aml *dev;
> > +Aml *crs;
> > +
> > +for (i = 0; i < ARRAY_SIZE(isa_parallel_io); i++) {
> > +if (isa->iobase == isa_parallel_io[i]) {
> > +uid = i + 1;
> 
> I'm not sure about this check, as we can create a ISA device setting
> manually index & iobase. What about using simply "uid = isa->index + 1"
> instead?

Looking at the code I see isa->index is assigned unconditionally.  I
misremembered that detail.  So, yes, simply using isa->index should work
fine even with '-device isa-serial,iobase=".  I'll fix it for
both serial and parallel.

cheers,
  Gerd




Re: [PATCH v4 03/13] acpi: rtc: use a single crs range

2020-05-06 Thread Gerd Hoffmann
  Hi,

> >  crs = aml_resource_template();
> >  aml_append(crs, aml_io(AML_DECODE16, RTC_ISA_BASE, RTC_ISA_BASE,
> > -   0x10, 0x02));
> > +   0x10, 0x08));
> >  aml_append(crs, aml_irq_no_flags(RTC_ISA_IRQ));
> > -aml_append(crs, aml_io(AML_DECODE16, RTC_ISA_BASE + 2, RTC_ISA_BASE + 
> > 2,
> > -   0x02, 0x06));
> can we just drop the later range as unused? (I don't see where it's actually 
> initialized)

I'd rather follow what physical hardware is doing here
for better compatibility ...

take care,
  Gerd




Re: [PATCH v23 0/4] implement zstd cluster compression method

2020-05-06 Thread Vladimir Sementsov-Ogievskiy

06.05.2020 11:01, Denis Plotnikov wrote:



On 05.05.2020 15:03, Max Reitz wrote:

On 05.05.20 12:26, Max Reitz wrote:

On 30.04.20 12:19, Denis Plotnikov wrote:

v23:
    Undecided: whether to add zstd(zlib) compression
   details to the qcow2 spec
    03: tighten assertion on zstd decompression [Eric]
    04: use _rm_test_img appropriately [Max]

Thanks, applied to my block branch:

I’m afraid I have to unqueue this series again, because it makes many
iotests fail due to an additional “compression_type=zlib” output when
images are created, an additional “compression type” line in
qemu-img info output where format-specific information is not
suppressed, and an additional line in qemu-img create -f qcow2 -o help.

Max



Hmm, this is strange. I made some modifications for the tests
in 0001 of the series (qcow2: introduce compression type feature).

Among the other test related changes, the patch contains the hunk:

+++ b/tests/qemu-iotests/common.filter
@@ -152,7 +152,8 @@ _filter_img_create()
  -e "s# refcount_bits=[0-9]\\+##g" \
  -e "s# key-secret=[a-zA-Z0-9]\\+##g" \
  -e "s# iter-time=[0-9]\\+##g" \
-    -e "s# force_size=\\(on\\|off\\)##g"
+    -e "s# force_size=\\(on\\|off\\)##g" \
+    -e "s# compression_type=[a-zA-Z0-9]\\+##g"
  }

which has to filter out "compression_type" on image creation.

But you say that you can see the "compression_type" on the image creation.
May be the patch wasn't fully applied? Or the test related modification were 
omitted?

I've just re-based the series on top of:

681b07f4e76dbb700c16918d (vanilla/master, mainstream)
Merge: a2261b2754 714eb0dbc5
Author: Peter Maydell 
Date:   Tue May 5 15:47:44 2020 +0100

and run the tests with 'make check-block'

and got the following:

Not run: 071 099 184 220 259 267
Some cases not run in: 030 040 041
Passed all 113 iotests



Hmm, we definitely have a lot more iotests. So, I assume make check-block 
doesn't run all of them.

To run all:

cd tests/qemu-iotests
./check -qcow2
./check -raw
... and so for any format you want to test

I also recommend to do
  export TEST_DIR=/something
before running tests, where something is tmpfs or ssd, which will decrease 
testing time a lot. Still, some (very small subset) of tests doesn't run on 
tmpfs, you can rerun them with TEST_DIR=/normal/hdd/directory

--
Best regards,
Vladimir



Re: [PATCH v2 3/4] backup: Make sure that source and target size match

2020-05-06 Thread Vladimir Sementsov-Ogievskiy

06.05.2020 11:02, Kevin Wolf wrote:

Am 06.05.2020 um 08:07 hat Vladimir Sementsov-Ogievskiy geschrieben:

05.05.2020 13:03, Kevin Wolf wrote:

Am 30.04.2020 um 20:21 hat Vladimir Sementsov-Ogievskiy geschrieben:

30.04.2020 17:27, Kevin Wolf wrote:

Since the introduction of a backup filter node in commit 00e30f05d, the
backup block job crashes when the target image is smaller than the
source image because it will try to write after the end of the target
node without having BLK_PERM_RESIZE. (Previously, the BlockBackend layer
would have caught this and errored out gracefully.)

We can fix this and even do better than the old behaviour: Check that
source and target have the same image size at the start of the block job
and unshare BLK_PERM_RESIZE. (This permission was already unshared
before the same commit 00e30f05d, but the BlockBackend that was used to
make the restriction was removed without a replacement.) This will
immediately error out when starting the job instead of only when writing
to a block that doesn't exist in the target.

Longer target than source would technically work because we would never
write to blocks that don't exist, but semantically these are invalid,
too, because a backup is supposed to create a copy, not just an image
that starts with a copy.

Fixes: 00e30f05de1d19586345ec373970ef4c192c6270
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1778593
Cc: qemu-sta...@nongnu.org
Signed-off-by: Kevin Wolf 


I'm OK with it as is, as it fixes bug:

Reviewed-by: Vladimir Sementsov-Ogievskiy 

still, some notes below



---
block/backup-top.c | 14 +-
block/backup.c | 14 +-
2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/block/backup-top.c b/block/backup-top.c
index 3b50c06e2c..79b268e6dc 100644
--- a/block/backup-top.c
+++ b/block/backup-top.c
@@ -148,8 +148,10 @@ static void backup_top_child_perm(BlockDriverState *bs, 
BdrvChild *c,
 *
 * Share write to target (child_file), to not interfere
 * with guest writes to its disk which may be in target backing 
chain.
+ * Can't resize during a backup block job because we check the size
+ * only upfront.
 */
-*nshared = BLK_PERM_ALL;
+*nshared = BLK_PERM_ALL & ~BLK_PERM_RESIZE;
*nperm = BLK_PERM_WRITE;
} else {
/* Source child */
@@ -159,7 +161,7 @@ static void backup_top_child_perm(BlockDriverState *bs, 
BdrvChild *c,
if (perm & BLK_PERM_WRITE) {
*nperm = *nperm | BLK_PERM_CONSISTENT_READ;
}
-*nshared &= ~BLK_PERM_WRITE;
+*nshared &= ~(BLK_PERM_WRITE | BLK_PERM_RESIZE);
}
}
@@ -192,11 +194,13 @@ BlockDriverState *bdrv_backup_top_append(BlockDriverState 
*source,
{
Error *local_err = NULL;
BDRVBackupTopState *state;
-BlockDriverState *top = bdrv_new_open_driver(_backup_top_filter,
- filter_node_name,
- BDRV_O_RDWR, errp);
+BlockDriverState *top;
bool appended = false;
+assert(source->total_sectors == target->total_sectors);


May be better to error-out, just to keep backup-top independent. Still, now 
it's not
really needed, as we have only one caller. And this function have to be 
refactored
anyway, when publishing this filter (open() and close() should appear, so this 
code
will be rewritten anyway.)


Yes, the whole function only works because it's used in this restricted
context today. For example, we only know that total_sectors is up to
date because the caller has called bdrv_getlength() just a moment ago.

I think fixing this would be beyond the scope of this patch, but
certainly a good idea anyway.


And the other thought: the permissions we declared above, will be activated 
only after
successful bdrv_child_refresh_perms(). I think some kind of race is possible, 
so that
size is changed actual permission activation. So, may be good to double check 
sizes after
bdrv_child_refresh_perms().. But it's a kind of paranoia.


We're not in coroutine context, so we can't yield. I don't see who could
change the size in parallel (apart from an external process, but an
external process can mess up anything).

When we make backup-top an independent driver, instead of double
checking (what would you do on error?), maybe we could move the size
initialisation (then with bdrv_getlength()) to after
bdrv_child_refresh_perms().


Also, third thought: the restricted permissions doesn't save us from resizing
of the source through exactly this node, does it? Hmm, but your test works 
somehow. But
(I assume) it worked in a previous patch version without unsharing on source..

Ha, but bdrv_co_truncate just can't work on backup-top, because it doesn't have 
file child.
But, if we fix bdrv_co_truncate to skip filters, we'll need to define 
.bdrv_co_truncate in
backup_top, which will return 

Re: [PATCH v2 3/4] backup: Make sure that source and target size match

2020-05-06 Thread Kevin Wolf
Am 06.05.2020 um 08:07 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 05.05.2020 13:03, Kevin Wolf wrote:
> > Am 30.04.2020 um 20:21 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > 30.04.2020 17:27, Kevin Wolf wrote:
> > > > Since the introduction of a backup filter node in commit 00e30f05d, the
> > > > backup block job crashes when the target image is smaller than the
> > > > source image because it will try to write after the end of the target
> > > > node without having BLK_PERM_RESIZE. (Previously, the BlockBackend layer
> > > > would have caught this and errored out gracefully.)
> > > > 
> > > > We can fix this and even do better than the old behaviour: Check that
> > > > source and target have the same image size at the start of the block job
> > > > and unshare BLK_PERM_RESIZE. (This permission was already unshared
> > > > before the same commit 00e30f05d, but the BlockBackend that was used to
> > > > make the restriction was removed without a replacement.) This will
> > > > immediately error out when starting the job instead of only when writing
> > > > to a block that doesn't exist in the target.
> > > > 
> > > > Longer target than source would technically work because we would never
> > > > write to blocks that don't exist, but semantically these are invalid,
> > > > too, because a backup is supposed to create a copy, not just an image
> > > > that starts with a copy.
> > > > 
> > > > Fixes: 00e30f05de1d19586345ec373970ef4c192c6270
> > > > Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1778593
> > > > Cc: qemu-sta...@nongnu.org
> > > > Signed-off-by: Kevin Wolf 
> > > 
> > > I'm OK with it as is, as it fixes bug:
> > > 
> > > Reviewed-by: Vladimir Sementsov-Ogievskiy 
> > > 
> > > still, some notes below
> > > 
> > > 
> > > > ---
> > > >block/backup-top.c | 14 +-
> > > >block/backup.c | 14 +-
> > > >2 files changed, 22 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/block/backup-top.c b/block/backup-top.c
> > > > index 3b50c06e2c..79b268e6dc 100644
> > > > --- a/block/backup-top.c
> > > > +++ b/block/backup-top.c
> > > > @@ -148,8 +148,10 @@ static void backup_top_child_perm(BlockDriverState 
> > > > *bs, BdrvChild *c,
> > > > *
> > > > * Share write to target (child_file), to not interfere
> > > > * with guest writes to its disk which may be in target 
> > > > backing chain.
> > > > + * Can't resize during a backup block job because we check the 
> > > > size
> > > > + * only upfront.
> > > > */
> > > > -*nshared = BLK_PERM_ALL;
> > > > +*nshared = BLK_PERM_ALL & ~BLK_PERM_RESIZE;
> > > >*nperm = BLK_PERM_WRITE;
> > > >} else {
> > > >/* Source child */
> > > > @@ -159,7 +161,7 @@ static void backup_top_child_perm(BlockDriverState 
> > > > *bs, BdrvChild *c,
> > > >if (perm & BLK_PERM_WRITE) {
> > > >*nperm = *nperm | BLK_PERM_CONSISTENT_READ;
> > > >}
> > > > -*nshared &= ~BLK_PERM_WRITE;
> > > > +*nshared &= ~(BLK_PERM_WRITE | BLK_PERM_RESIZE);
> > > >}
> > > >}
> > > > @@ -192,11 +194,13 @@ BlockDriverState 
> > > > *bdrv_backup_top_append(BlockDriverState *source,
> > > >{
> > > >Error *local_err = NULL;
> > > >BDRVBackupTopState *state;
> > > > -BlockDriverState *top = 
> > > > bdrv_new_open_driver(_backup_top_filter,
> > > > - filter_node_name,
> > > > - BDRV_O_RDWR, errp);
> > > > +BlockDriverState *top;
> > > >bool appended = false;
> > > > +assert(source->total_sectors == target->total_sectors);
> > > 
> > > May be better to error-out, just to keep backup-top independent. Still, 
> > > now it's not
> > > really needed, as we have only one caller. And this function have to be 
> > > refactored
> > > anyway, when publishing this filter (open() and close() should appear, so 
> > > this code
> > > will be rewritten anyway.)
> > 
> > Yes, the whole function only works because it's used in this restricted
> > context today. For example, we only know that total_sectors is up to
> > date because the caller has called bdrv_getlength() just a moment ago.
> > 
> > I think fixing this would be beyond the scope of this patch, but
> > certainly a good idea anyway.
> > 
> > > And the other thought: the permissions we declared above, will be 
> > > activated only after
> > > successful bdrv_child_refresh_perms(). I think some kind of race is 
> > > possible, so that
> > > size is changed actual permission activation. So, may be good to double 
> > > check sizes after
> > > bdrv_child_refresh_perms().. But it's a kind of paranoia.
> > 
> > We're not in coroutine context, so we can't yield. I don't see who could
> > change the size in parallel (apart from an external process, but an
> > external process can mess up anything).
> > 
> 

Re: [PATCH v23 0/4] implement zstd cluster compression method

2020-05-06 Thread Denis Plotnikov




On 05.05.2020 15:03, Max Reitz wrote:

On 05.05.20 12:26, Max Reitz wrote:

On 30.04.20 12:19, Denis Plotnikov wrote:

v23:
Undecided: whether to add zstd(zlib) compression
   details to the qcow2 spec
03: tighten assertion on zstd decompression [Eric]
04: use _rm_test_img appropriately [Max]

Thanks, applied to my block branch:

I’m afraid I have to unqueue this series again, because it makes many
iotests fail due to an additional “compression_type=zlib” output when
images are created, an additional “compression type” line in
qemu-img info output where format-specific information is not
suppressed, and an additional line in qemu-img create -f qcow2 -o help.

Max



Hmm, this is strange. I made some modifications for the tests
in 0001 of the series (qcow2: introduce compression type feature).

Among the other test related changes, the patch contains the hunk:

+++ b/tests/qemu-iotests/common.filter
@@ -152,7 +152,8 @@ _filter_img_create()
 -e "s# refcount_bits=[0-9]\\+##g" \
 -e "s# key-secret=[a-zA-Z0-9]\\+##g" \
 -e "s# iter-time=[0-9]\\+##g" \
-    -e "s# force_size=\\(on\\|off\\)##g"
+    -e "s# force_size=\\(on\\|off\\)##g" \
+    -e "s# compression_type=[a-zA-Z0-9]\\+##g"
 }

which has to filter out "compression_type" on image creation.

But you say that you can see the "compression_type" on the image creation.
May be the patch wasn't fully applied? Or the test related modification 
were omitted?


I've just re-based the series on top of:

681b07f4e76dbb700c16918d (vanilla/master, mainstream)
Merge: a2261b2754 714eb0dbc5
Author: Peter Maydell 
Date:   Tue May 5 15:47:44 2020 +0100

and run the tests with 'make check-block'

and got the following:

Not run: 071 099 184 220 259 267
Some cases not run in: 030 040 041
Passed all 113 iotests

May be I do something wrong?

Denis



Re: [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases

2020-05-06 Thread Vladimir Sementsov-Ogievskiy

27.04.2020 17:39, Vladimir Sementsov-Ogievskiy wrote:

It's safer to expand in_flight request to start before enter to
coroutine in synchronous wrappers, due to the following (theoretical)
problem:

Consider write.
It's possible, that qemu_coroutine_enter only schedules execution,
assume such case.

Then we may possibly have the following:

1. Somehow check that we are not in drained section in outer code.

2. Call bdrv_pwritev(), assuming that it will increase in_flight, which
will protect us from starting drained section.

3. It calls bdrv_prwv_co() -> bdrv_coroutine_enter() (not yet increased
in_flight).

4. Assume coroutine not yet actually entered, only scheduled, and we go
to some code, which starts drained section (as in_flight is zero).

5. Scheduled coroutine starts, and blindly increases in_flight, and we
are in drained section with in_flight request.

Signed-off-by: Vladimir Sementsov-Ogievskiy 


Very interesting: this patch breaks test-replication. It hangs:

(gdb) thr a a bt

Thread 2 (Thread 0x7eff256cd700 (LWP 2843)):
#0  0x7eff2f5fd1fd in syscall () from /lib64/libc.so.6
#1  0x55af9a9a4f11 in qemu_futex_wait (f=0x55af9aa6f758 
, val=4294967295) at 
/work/src/qemu/up-expand-bdrv-in_flight-bounds/include/qemu/futex.h:29
#2  0x55af9a9a50d5 in qemu_event_wait (ev=0x55af9aa6f758 
) at util/qemu-thread-posix.c:459
#3  0x55af9a9bd20d in call_rcu_thread (opaque=0x0) at util/rcu.c:260
#4  0x55af9a9a5288 in qemu_thread_start (args=0x55af9c4f1b80) at 
util/qemu-thread-posix.c:519
#5  0x7eff2f6d44c0 in start_thread () from /lib64/libpthread.so.0
#6  0x7eff2f602553 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7eff25820a80 (LWP 2842)):
#0  0x7eff2f5f7bd6 in ppoll () from /lib64/libc.so.6
#1  0x55af9a99e405 in qemu_poll_ns (fds=0x55af9c52a830, nfds=1, timeout=-1) 
at util/qemu-timer.c:335
#2  0x55af9a9a1cab in fdmon_poll_wait (ctx=0x55af9c526890, 
ready_list=0x7ffc73e8c5d0, timeout=-1) at util/fdmon-poll.c:79
#3  0x55af9a9a160c in aio_poll (ctx=0x55af9c526890, blocking=true) at 
util/aio-posix.c:600
#4  0x55af9a8f0bb0 in bdrv_do_drained_begin (bs=0x55af9c52a8d0, 
recursive=false, parent=0x0, ignore_bds_parents=false, poll=true) at 
block/io.c:429
#5  0x55af9a8f0c95 in bdrv_drained_begin (bs=0x55af9c52a8d0) at 
block/io.c:435
#6  0x55af9a8dc6a8 in blk_drain (blk=0x55af9c542c10) at 
block/block-backend.c:1681
#7  0x55af9a8da0b6 in blk_unref (blk=0x55af9c542c10) at 
block/block-backend.c:473
#8  0x55af9a8eb5e7 in mirror_exit_common (job=0x55af9c6c45c0) at 
block/mirror.c:667
#9  0x55af9a8eb9c1 in mirror_prepare (job=0x55af9c6c45c0) at 
block/mirror.c:765
#10 0x55af9a87cd65 in job_prepare (job=0x55af9c6c45c0) at job.c:781
#11 0x55af9a87b62a in job_txn_apply (job=0x55af9c6c45c0, fn=0x55af9a87cd28 
) at job.c:158
#12 0x55af9a87cdee in job_do_finalize (job=0x55af9c6c45c0) at job.c:798
#13 0x55af9a87cfb5 in job_completed_txn_success (job=0x55af9c6c45c0) at 
job.c:852
#14 0x55af9a87d055 in job_completed (job=0x55af9c6c45c0) at job.c:865
#15 0x55af9a87d0a8 in job_exit (opaque=0x55af9c6c45c0) at job.c:885
#16 0x55af9a99b981 in aio_bh_call (bh=0x55af9c547440) at util/async.c:136
#17 0x55af9a99ba8b in aio_bh_poll (ctx=0x55af9c526890) at util/async.c:164
#18 0x55af9a9a17ff in aio_poll (ctx=0x55af9c526890, blocking=true) at 
util/aio-posix.c:650
#19 0x55af9a8f7011 in bdrv_flush (bs=0x55af9c53b900) at block/io.c:3019
#20 0x55af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
#21 0x55af9a874ca3 in bdrv_delete (bs=0x55af9c53b900) at block.c:4498
#22 0x55af9a877862 in bdrv_unref (bs=0x55af9c53b900) at block.c:5866
#23 0x55af9a870837 in bdrv_root_unref_child (child=0x55af9c6c4430) at 
block.c:2684
#24 0x55af9a8da9a2 in blk_remove_bs (blk=0x55af9c547bd0) at 
block/block-backend.c:803
#25 0x55af9a8d9e54 in blk_delete (blk=0x55af9c547bd0) at 
block/block-backend.c:422
#26 0x55af9a8da0f8 in blk_unref (blk=0x55af9c547bd0) at 
block/block-backend.c:477
#27 0x55af9a86a6f1 in teardown_secondary () at tests/test-replication.c:392
#28 0x55af9a86aac1 in test_secondary_stop () at tests/test-replication.c:490
#29 0x7eff2fd7df7e in g_test_run_suite_internal () from 
/lib64/libglib-2.0.so.0
#30 0x7eff2fd7dd24 in g_test_run_suite_internal () from 
/lib64/libglib-2.0.so.0
#31 0x7eff2fd7dd24 in g_test_run_suite_internal () from 
/lib64/libglib-2.0.so.0
#32 0x7eff2fd7e46a in g_test_run_suite () from /lib64/libglib-2.0.so.0
#33 0x7eff2fd7e485 in g_test_run () from /lib64/libglib-2.0.so.0
#34 0x55af9a86b19c in main (argc=1, argv=0x7ffc73e8d088) at 
tests/test-replication.c:645


(gdb) p ((BlockBackend *)0x55af9c547bd0)->in_flight
$5 = 0
(gdb) p ((BlockBackend *)0x55af9c542c10)->in_flight
$6 = 0
(gdb) p ((BlockDriverState *)0x55af9c53b900)->in_flight
$7 = 1
(gdb) p ((BlockDriverState *)0x55af9c52a8d0)->in_flight
$8 = 0
(gdb) fr 20
#20 0x55af9a874351 in 

Re: [PATCH v3 17/17] block: use int64_t instead of int in driver discard handlers

2020-05-06 Thread Vladimir Sementsov-Ogievskiy

30.04.2020 14:10, Vladimir Sementsov-Ogievskiy wrote:

We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, convert driver discard handlers bytes parameter to int64_t.

This patch just converts handlers where it is obvious that bytes
parameter is passed further to 64bit interfaces, and add simple
wrappers where it is not obvious.

Series: 64bit-block-status
Signed-off-by: Vladimir Sementsov-Ogievskiy



squash in:

--- a/tests/test-block-iothread.c
+++ b/tests/test-block-iothread.c
@@ -48,7 +48,7 @@ static int coroutine_fn bdrv_test_co_pwritev(BlockDriverState 
*bs,
 }
 
 static int coroutine_fn bdrv_test_co_pdiscard(BlockDriverState *bs,

-  int64_t offset, int bytes)
+  int64_t offset, int64_t bytes)
 {
 return 0;
 }



Re: [PATCH v3 00/17] 64bit block-layer

2020-05-06 Thread Vladimir Sementsov-Ogievskiy

30.04.2020 23:51, no-re...@patchew.org wrote:

Patchew URL: 
https://patchew.org/QEMU/20200430111033.29980-1-vsement...@virtuozzo.com/



Hi,

This series failed the docker-quick@centos7 build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
make docker-image-centos7 V=1 NETWORK=1
time make docker-test-quick@centos7 SHOW_ENV=1 J=14 NETWORK=1
=== TEST SCRIPT END ===

   CC  tests/test-timed-average.o
   CC  tests/test-util-filemonitor.o
   CC  tests/test-util-sockets.o
/tmp/qemu-test/src/tests/test-block-iothread.c:79:5: error: initialization from 
incompatible pointer type [-Werror]
  .bdrv_co_pdiscard   = bdrv_test_co_pdiscard,
  ^


Oh yes. Shows me didn't run make check :(


/tmp/qemu-test/src/tests/test-block-iothread.c:79:5: error: (near 
initialization for 'bdrv_test.bdrv_co_pdiscard') [-Werror]
cc1: all warnings being treated as errors
   CC  tests/test-authz-simple.o
make: *** [tests/test-block-iothread.o] Error 1
make: *** Waiting for unfinished jobs
Traceback (most recent call last):
   File "./tests/docker/docker.py", line 664, in 
---
 raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', 
'--label', 'com.qemu.instance.uuid=42df3799b5074b82a8a4b696cb76547f', '-u', 
'1001', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', 
'-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 
'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', 
'/home/patchew/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', 
'/var/tmp/patchew-tester-tmp-t9q8cenf/src/docker-src.2020-04-30-16.47.42.6034:/var/tmp/qemu:z,ro',
 'qemu:centos7', '/var/tmp/qemu/run', 'test-quick']' returned non-zero exit 
status 2.
filter=--filter=label=com.qemu.instance.uuid=42df3799b5074b82a8a4b696cb76547f
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-t9q8cenf/src'
make: *** [docker-run-test-quick@centos7] Error 2

real3m38.963s
user0m9.441s


The full log is available at
http://patchew.org/logs/20200430111033.29980-1-vsement...@virtuozzo.com/testing.docker-quick@centos7/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-de...@redhat.com




--
Best regards,
Vladimir



Re: [PATCH 0/8] iotests skipping

2020-05-06 Thread Vladimir Sementsov-Ogievskiy

04.05.2020 19:32, Kevin Wolf wrote:

Am 30.04.2020 um 14:47 hat Vladimir Sementsov-Ogievskiy geschrieben:

Hi all!

This series adds a bit more support for iotests skipping due to format
whitelisting. Not pretend to be something complete. It just lay in its
folder I don't know how much time, I forgot to send it.

Still, now I've rebased it on master, let's take them, they are useful.


I agree. They are certainly not complete by any means, but let's just
take what we already have.

Thanks, applied to the block branch.



Great! Thank you!


--
Best regards,
Vladimir



Re: [PATCH v2 2/6] block/nbd-client: drop max_block restriction from discard

2020-05-06 Thread Vladimir Sementsov-Ogievskiy

04.05.2020 22:59, Eric Blake wrote:

On 4/21/20 4:56 PM, Eric Blake wrote:

On 4/1/20 10:01 AM, Vladimir Sementsov-Ogievskiy wrote:

NBD spec is updated, so that max_block doesn't relate to
NBD_CMD_TRIM. So, drop the restriction.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  block/nbd.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Eric Blake 

I might tweak the commit message of 1/6 and here to call out the NBD spec 
commit id (nbd.git 9f30fedb), but that doesn't change the patch proper.


I'm queuing 1 and 2 through my NBD tree now; the rest involve more of the block 
layer and go in tandem with your other work on cleaning up 64-bit operations 
throughout, and I still need to give that a better review.



Thank you!

--
Best regards,
Vladimir



Re: [PATCH v2 3/4] backup: Make sure that source and target size match

2020-05-06 Thread Vladimir Sementsov-Ogievskiy

05.05.2020 13:03, Kevin Wolf wrote:

Am 30.04.2020 um 20:21 hat Vladimir Sementsov-Ogievskiy geschrieben:

30.04.2020 17:27, Kevin Wolf wrote:

Since the introduction of a backup filter node in commit 00e30f05d, the
backup block job crashes when the target image is smaller than the
source image because it will try to write after the end of the target
node without having BLK_PERM_RESIZE. (Previously, the BlockBackend layer
would have caught this and errored out gracefully.)

We can fix this and even do better than the old behaviour: Check that
source and target have the same image size at the start of the block job
and unshare BLK_PERM_RESIZE. (This permission was already unshared
before the same commit 00e30f05d, but the BlockBackend that was used to
make the restriction was removed without a replacement.) This will
immediately error out when starting the job instead of only when writing
to a block that doesn't exist in the target.

Longer target than source would technically work because we would never
write to blocks that don't exist, but semantically these are invalid,
too, because a backup is supposed to create a copy, not just an image
that starts with a copy.

Fixes: 00e30f05de1d19586345ec373970ef4c192c6270
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1778593
Cc: qemu-sta...@nongnu.org
Signed-off-by: Kevin Wolf 


I'm OK with it as is, as it fixes bug:

Reviewed-by: Vladimir Sementsov-Ogievskiy 

still, some notes below



---
   block/backup-top.c | 14 +-
   block/backup.c | 14 +-
   2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/block/backup-top.c b/block/backup-top.c
index 3b50c06e2c..79b268e6dc 100644
--- a/block/backup-top.c
+++ b/block/backup-top.c
@@ -148,8 +148,10 @@ static void backup_top_child_perm(BlockDriverState *bs, 
BdrvChild *c,
*
* Share write to target (child_file), to not interfere
* with guest writes to its disk which may be in target backing 
chain.
+ * Can't resize during a backup block job because we check the size
+ * only upfront.
*/
-*nshared = BLK_PERM_ALL;
+*nshared = BLK_PERM_ALL & ~BLK_PERM_RESIZE;
   *nperm = BLK_PERM_WRITE;
   } else {
   /* Source child */
@@ -159,7 +161,7 @@ static void backup_top_child_perm(BlockDriverState *bs, 
BdrvChild *c,
   if (perm & BLK_PERM_WRITE) {
   *nperm = *nperm | BLK_PERM_CONSISTENT_READ;
   }
-*nshared &= ~BLK_PERM_WRITE;
+*nshared &= ~(BLK_PERM_WRITE | BLK_PERM_RESIZE);
   }
   }
@@ -192,11 +194,13 @@ BlockDriverState *bdrv_backup_top_append(BlockDriverState 
*source,
   {
   Error *local_err = NULL;
   BDRVBackupTopState *state;
-BlockDriverState *top = bdrv_new_open_driver(_backup_top_filter,
- filter_node_name,
- BDRV_O_RDWR, errp);
+BlockDriverState *top;
   bool appended = false;
+assert(source->total_sectors == target->total_sectors);


May be better to error-out, just to keep backup-top independent. Still, now 
it's not
really needed, as we have only one caller. And this function have to be 
refactored
anyway, when publishing this filter (open() and close() should appear, so this 
code
will be rewritten anyway.)


Yes, the whole function only works because it's used in this restricted
context today. For example, we only know that total_sectors is up to
date because the caller has called bdrv_getlength() just a moment ago.

I think fixing this would be beyond the scope of this patch, but
certainly a good idea anyway.


And the other thought: the permissions we declared above, will be activated 
only after
successful bdrv_child_refresh_perms(). I think some kind of race is possible, 
so that
size is changed actual permission activation. So, may be good to double check 
sizes after
bdrv_child_refresh_perms().. But it's a kind of paranoia.


We're not in coroutine context, so we can't yield. I don't see who could
change the size in parallel (apart from an external process, but an
external process can mess up anything).

When we make backup-top an independent driver, instead of double
checking (what would you do on error?), maybe we could move the size
initialisation (then with bdrv_getlength()) to after
bdrv_child_refresh_perms().


Also, third thought: the restricted permissions doesn't save us from resizing
of the source through exactly this node, does it? Hmm, but your test works 
somehow. But
(I assume) it worked in a previous patch version without unsharing on source..

Ha, but bdrv_co_truncate just can't work on backup-top, because it doesn't have 
file child.
But, if we fix bdrv_co_truncate to skip filters, we'll need to define 
.bdrv_co_truncate in
backup_top, which will return something like -EBUSY.. Or just -ENOTSUP, doesn't 
matter.


Maybe this is a sign that bdrv_co_truncate shouldn't automatically