Re: [Qemu-devel] [PATCH v5 01/20] block: Add .bdrv_co_block_status() callback

2017-12-06 Thread Eric Blake
On 12/01/2017 09:24 AM, Kevin Wolf wrote:
> Am 01.12.2017 um 16:03 hat Eric Blake geschrieben:
>> On 12/01/2017 08:40 AM, Kevin Wolf wrote:
 Note that most drivers give sector-aligned answers, except at
 end-of-file, even when request_alignment is smaller than a sector.
 However, bdrv_getlength() is sector-aligned (even though it gives a
 byte answer), often by exceeding the actual file size.  If we were to
 give back strict results, at least file-posix.c would report a
 transition from DATA to HOLE at the end of a file even in the middle
 of a sector, which can throw off callers; so we intentionally lie and
 state that any partial sector at the end of a file has the same
 status for the entire sector.  Maybe at some future day we can
 report actual file size instead of rounding up, but not for this
 series.
>>>
>>> In what way does this throw off callers in practice?
>>
>> Several iotests failed if I didn't do that (it's been a few months, so the
>> details are a bit fuzzy).  I think the biggest problem is that because we
>> round the size up in bdrv_getlength(), but CANNOT access those rounded
>> bytes, then reporting the status of those bytes as a hole (which is the only
>> sane thing that file-posix can do) can cause grief when the rest of the
>> sector (which we CAN access) is data.
> 
> Does this indicate caller bugs? If they call byte-based
> bdrv_co_block_status(), they should surely be able to handle this
> situation in theory?

Hmm, the more I try to reproduce the failure, the more I'm coming to the
conclusion that I needed this back when we returned a rounded offset via
the return value (my tag nbd-byte-callback-v2), but now that we return
the actual offset rather than a rounded offset via a secondary parameter
(my tag nbd-byte-callback-v4), we don't need to compensate for
end-of-file at all!  That means even less code in my upcoming v5 submission.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH v5 01/20] block: Add .bdrv_co_block_status() callback

2017-12-01 Thread Kevin Wolf
Am 01.12.2017 um 02:42 hat Eric Blake geschrieben:
> We are gradually moving away from sector-based interfaces, towards
> byte-based. Now that the block layer exposes byte-based allocation,
> it's time to tackle the drivers.  Add a new callback that operates
> on as small as byte boundaries. Subsequent patches will then update
> individual drivers, then finally remove .bdrv_co_get_block_status().
> The old code now uses a goto in order to minimize churn at that later
> removal.  Update documentation in this patch so that the later
> removal can be a straight delete.
> 
> The new code also passes through the 'want_zero' hint, which will
> allow subsequent patches to further optimize callers that only care
> about how much of the image is allocated (want_zero is false),
> rather than full details about runs of zeroes and which offsets the
> allocation actually maps to (want_zero is true).
> 
> Note that most drivers give sector-aligned answers, except at
> end-of-file, even when request_alignment is smaller than a sector.
> However, bdrv_getlength() is sector-aligned (even though it gives a
> byte answer), often by exceeding the actual file size.  If we were to
> give back strict results, at least file-posix.c would report a
> transition from DATA to HOLE at the end of a file even in the middle
> of a sector, which can throw off callers; so we intentionally lie and
> state that any partial sector at the end of a file has the same
> status for the entire sector.  Maybe at some future day we can
> report actual file size instead of rounding up, but not for this
> series.

In what way does this throw off callers in practice?

The rounding will lead to strange effects, and I'm not sure that dealing
with them is any easier than fixing the callers. Imagine an image file
like this (very small example, file size 384 bytes):

0128  384  512
||||
++++
|Hole|  Data  ||
++++
  |
  EOF

bdrv_co_block_status(offset=0, bytes=512) returns 512 bytes of HOLE.
bdrv_co_block_status(offset=128, bytes=512) returns 384 bytes of DATA.
bdrv_co_block_status(offset=384, bytes=512) returns 128 bytes of HOLE.

This is not only contradictory, but the first one is almost begging for
data corruption because it returns HOLE for a region that actually
contains data.

The only excuse I can imagine is that we say that this can never happen
because drivers use 512 byte granularity anyway. But why are we
introducing the new interface then? I don't think this semantics is
compatible with a bytes-based driver interface.

> We also add an assertion that any driver using the new callback will
> make progress (the only time pnum will be 0 is if the block layer
> already handled an out-of-bounds request, or if there is an error).
> 
> Signed-off-by: Eric Blake 
> 
> ---
> v5: rebase to master, typo fix, document more block layer guarantees
> v4: rebase to master
> v3: no change
> v2: improve alignment handling, ensure all iotests still pass
> ---
>  include/block/block.h |  9 -
>  include/block/block_int.h | 14 +++---
>  block/io.c| 31 ++-
>  3 files changed, 41 insertions(+), 13 deletions(-)
> 
> diff --git a/include/block/block.h b/include/block/block.h
> index c05cac57e5..798e98f783 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -136,11 +136,10 @@ typedef struct HDGeometry {
>   * that the block layer recompute the answer from the 
> returned
>   * BDS; must be accompanied by just BDRV_BLOCK_OFFSET_VALID.
>   *
> - * If BDRV_BLOCK_OFFSET_VALID is set, bits 9-62 (BDRV_BLOCK_OFFSET_MASK) of
> - * the return value (old interface) or the entire map parameter (new
> - * interface) represent the offset in the returned BDS that is allocated for
> - * the corresponding raw data.  However, whether that offset actually
> - * contains data also depends on BDRV_BLOCK_DATA, as follows:
> + * If BDRV_BLOCK_OFFSET_VALID is set, the map parameter represents the
> + * host offset within the returned BDS that is allocated for the
> + * corresponding raw guest data.  However, whether that offset
> + * actually contains data also depends on BDRV_BLOCK_DATA, as follows:
>   *
>   * DATA ZERO OFFSET_VALID
>   *  ttt   sectors read as zero, returned file is zero at 
> offset
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index a5482775ec..071263c40f 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -206,13 +206,21 @@ struct BlockDriver {
>   * bdrv_is_allocated[_above].  The driver should answer only
>   * according to the current layer, and should not set
>   * BDRV_BLOCK_ALLOCATED, but may set BDRV_BLOCK_RAW.  See block.h
> - * for the meaning of _DATA, _ZERO, and _OFFSET_VALID.  The block
> - * layer guarantees input aligned to 

Re: [Qemu-devel] [PATCH v5 01/20] block: Add .bdrv_co_block_status() callback

2017-12-01 Thread Kevin Wolf
Am 01.12.2017 um 16:03 hat Eric Blake geschrieben:
> On 12/01/2017 08:40 AM, Kevin Wolf wrote:
> > > Note that most drivers give sector-aligned answers, except at
> > > end-of-file, even when request_alignment is smaller than a sector.
> > > However, bdrv_getlength() is sector-aligned (even though it gives a
> > > byte answer), often by exceeding the actual file size.  If we were to
> > > give back strict results, at least file-posix.c would report a
> > > transition from DATA to HOLE at the end of a file even in the middle
> > > of a sector, which can throw off callers; so we intentionally lie and
> > > state that any partial sector at the end of a file has the same
> > > status for the entire sector.  Maybe at some future day we can
> > > report actual file size instead of rounding up, but not for this
> > > series.
> > 
> > In what way does this throw off callers in practice?
> 
> Several iotests failed if I didn't do that (it's been a few months, so the
> details are a bit fuzzy).  I think the biggest problem is that because we
> round the size up in bdrv_getlength(), but CANNOT access those rounded
> bytes, then reporting the status of those bytes as a hole (which is the only
> sane thing that file-posix can do) can cause grief when the rest of the
> sector (which we CAN access) is data.

Does this indicate caller bugs? If they call byte-based
bdrv_co_block_status(), they should surely be able to handle this
situation in theory?

> > The rounding will lead to strange effects, and I'm not sure that dealing
> > with them is any easier than fixing the callers. Imagine an image file
> > like this (very small example, file size 384 bytes):
> > 
> >  0128  384  512
> >  ||||
> >  ++++
> >  |Hole|  Data  ||
> >  ++++
> >|
> >EOF
> 
> Unlikely.  Holes are at least a sector in size on all known filesystems that
> have holes; that's also true for qcow2 format.  The only non-sector-aligned
> hole that you can encounter in practice is at EOF.

Yes, probably not going to happen on file-posix. But I wouldn't bet
money that it's the same for all of the other protocols.

Wasn't one of the reasons for this series that NBD actually allows byte
granularity block status in its protocol? Which means that an NBD server
could expose something like this. Of course, in practice the server is
backed by something, too, so this situation is very unlikely, but
strictly speaking we wouldn't be able to work with all compliant
servers.

> > bdrv_co_block_status(offset=0, bytes=512) returns 512 bytes of HOLE.
> > bdrv_co_block_status(offset=128, bytes=512) returns 384 bytes of DATA.
> > bdrv_co_block_status(offset=384, bytes=512) returns 128 bytes of HOLE.
> > 
> > This is not only contradictory, but the first one is almost begging for
> > data corruption because it returns HOLE for a region that actually
> > contains data.
> 
> I agree that it would be confusing, if it were possible.  But in practice it
> is not possible.
> 
> > 
> > The only excuse I can imagine is that we say that this can never happen
> > because drivers use 512 byte granularity anyway. But why are we
> > introducing the new interface then? I don't think this semantics is
> > compatible with a bytes-based driver interface.
> 
> Really, the ONLY boundary that is unlikely to ever be 512-byte aligned is at
> EOF - and we wouldn't even need to do any rounding if bdrv_getlength()
> didn't round.

Yes, bdrv_getlength() needs to be converted sooner or later, too.

> One thing I _can_ do: it is ALWAYS valid to report a partial sector as data.
> It may pessimize the code slightly, but while rounding the size of a hole up
> can be wrong if the rounding covers data but the caller getting the rounded
> data treats that data as 0, rounding the size of data up will never
> misbehave because the caller will just read the literal zeroes.  So I will
> tweak the code to make sure that if any rounding takes place, that either
> the driver already set BDRV_BLOCK_EOF (all further bytes also read as zero),
> or else report the rounded region as data.  (I thought I could get away with
> only io.c setting BDRV_BLOCK_EOF; but it sounds like setting it in the
> drivers will be helpful).

This looks safer and definitely gives me a better feeling about the
rounding.

Kevin



Re: [Qemu-devel] [PATCH v5 01/20] block: Add .bdrv_co_block_status() callback

2017-12-01 Thread Eric Blake

On 12/01/2017 08:40 AM, Kevin Wolf wrote:

Note that most drivers give sector-aligned answers, except at
end-of-file, even when request_alignment is smaller than a sector.
However, bdrv_getlength() is sector-aligned (even though it gives a
byte answer), often by exceeding the actual file size.  If we were to
give back strict results, at least file-posix.c would report a
transition from DATA to HOLE at the end of a file even in the middle
of a sector, which can throw off callers; so we intentionally lie and
state that any partial sector at the end of a file has the same
status for the entire sector.  Maybe at some future day we can
report actual file size instead of rounding up, but not for this
series.


In what way does this throw off callers in practice?



Several iotests failed if I didn't do that (it's been a few months, so 
the details are a bit fuzzy).  I think the biggest problem is that 
because we round the size up in bdrv_getlength(), but CANNOT access 
those rounded bytes, then reporting the status of those bytes as a hole 
(which is the only sane thing that file-posix can do) can cause grief 
when the rest of the sector (which we CAN access) is data.



The rounding will lead to strange effects, and I'm not sure that dealing
with them is any easier than fixing the callers. Imagine an image file
like this (very small example, file size 384 bytes):

 0128  384  512
 ||||
 ++++
 |Hole|  Data  ||
 ++++
   |
   EOF


Unlikely.  Holes are at least a sector in size on all known filesystems 
that have holes; that's also true for qcow2 format.  The only 
non-sector-aligned hole that you can encounter in practice is at EOF.




bdrv_co_block_status(offset=0, bytes=512) returns 512 bytes of HOLE.
bdrv_co_block_status(offset=128, bytes=512) returns 384 bytes of DATA.
bdrv_co_block_status(offset=384, bytes=512) returns 128 bytes of HOLE.

This is not only contradictory, but the first one is almost begging for
data corruption because it returns HOLE for a region that actually
contains data.


I agree that it would be confusing, if it were possible.  But in 
practice it is not possible.




The only excuse I can imagine is that we say that this can never happen
because drivers use 512 byte granularity anyway. But why are we
introducing the new interface then? I don't think this semantics is
compatible with a bytes-based driver interface.


Really, the ONLY boundary that is unlikely to ever be 512-byte aligned 
is at EOF - and we wouldn't even need to do any rounding if 
bdrv_getlength() didn't round.


One thing I _can_ do: it is ALWAYS valid to report a partial sector as 
data. It may pessimize the code slightly, but while rounding the size of 
a hole up can be wrong if the rounding covers data but the caller 
getting the rounded data treats that data as 0, rounding the size of 
data up will never misbehave because the caller will just read the 
literal zeroes.  So I will tweak the code to make sure that if any 
rounding takes place, that either the driver already set BDRV_BLOCK_EOF 
(all further bytes also read as zero), or else report the rounded region 
as data.  (I thought I could get away with only io.c setting 
BDRV_BLOCK_EOF; but it sounds like setting it in the drivers will be 
helpful).





We also add an assertion that any driver using the new callback will
make progress (the only time pnum will be 0 is if the block layer
already handled an out-of-bounds request, or if there is an error).

Signed-off-by: Eric Blake 

+ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
+aligned_bytes, pnum, _map,
+_file);
+if (ret < 0) {
+*pnum = 0;
+goto out;
+}
+assert(*pnum); /* The block driver must make progress */


I debated about combining this assert...


+
+/*
+ * total_size is always sector-aligned, by sometimes exceeding actual
+ * file size. Expand pnum if it lands mid-sector due to end-of-file.
+ */
+if (QEMU_ALIGN_UP(*pnum + aligned_offset,


aligned_offset + *pnum wouldn be the more conventional order.


Sure.




+  BDRV_SECTOR_SIZE) == total_size) {
+*pnum = total_size - aligned_offset;
+}
+refine:
  /*
   * The driver's result must be a multiple of request_alignment.
   * Clamp pnum and adjust map to original request.


...with these, but it would trigger on at least vmdk (see patch 17/20)

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org