Re: [Qemu-devel] [PATCH v4 19/21] block: Make bdrv_is_allocated() byte-based

2017-07-07 Thread Kevin Wolf
Am 07.07.2017 um 04:55 hat Eric Blake geschrieben:
> On 07/06/2017 11:02 AM, Kevin Wolf wrote:
> 
> >> +++ b/qemu-img.c
> >> @@ -3229,6 +3229,7 @@ static int img_rebase(int argc, char **argv)
> >>  int64_t new_backing_num_sectors = 0;
> >>  uint64_t sector;
> >>  int n;
> >> +int64_t count;
> >>  float local_progress = 0;
> >>
> >>  buf_old = blk_blockalign(blk, IO_BUF_SIZE);
> >> @@ -3276,12 +3277,14 @@ static int img_rebase(int argc, char **argv)
> >>  }
> >>
> >>  /* If the cluster is allocated, we don't need to take action 
> >> */
> >> -ret = bdrv_is_allocated(bs, sector, n, );
> >> +ret = bdrv_is_allocated(bs, sector << BDRV_SECTOR_BITS,
> >> +n << BDRV_SECTOR_BITS, );
> >>  if (ret < 0) {
> >>  error_report("error while reading image metadata: %s",
> >>   strerror(-ret));
> >>  goto out;
> >>  }
> >> +n = DIV_ROUND_UP(count, BDRV_SECTOR_SIZE);
> >>  if (ret) {
> >>  continue;
> >>  }
> > 
> > Same thing. Sectors that are half allocated and half free must be
> > considered completely allocated if the caller can't do byte alignment.
> > Just rounding up without looking at ret isn't correct.
> 
> In reality, qemu-img rebase should probably be rewritten to exploit
> bdrv_get_block_status() instead of the weaker bdrv_is_allocated(). Right
> now, it is manually calling buffer_is_zero() on every sector to learn
> what sectors are candidates for optimizing, rather than being able to
> exploit BDRV_BLOCK_ZERO where possible

Yes, that's true, but also unrelated. As soon as you convert
bdrv_get_block_status() to byte granularity, you get the same thing
again.

Actually, having another look at this code, the solution for this
specific case is to convert the whole loop (or even function) to byte
granularity. The bdrv_is_allocated() call was the only thing that
required sectors.

> (yes, we may STILL want to call buffer_is_zero() on BDRV_BLOCK_DATA
> when we are trying to squeeze out all parts of an image that can be
> compressed, but maybe that should be an option, just as GNU dd has
> options for creating as much sparseness as possible (slow, because it
> reads data) vs. preserving existing sparseness (faster, because it
> relies on hole detection but leaves explicit written zeroes as
> non-sparse).

I'd expect that typically buffer_is_zero() is only slow if it's also
effective. Most data blocks start with non-zero data in the first few
bytes, so that's really quick. And I'm sure that checking whether a
buffer is zero is faster than doing disk I/O for the same buffer, so the
case of actual zero buffers should benefit, too. The only problem is
with cases where the buffer starts with many zeros and then ends in some
real data. I'm not sure how common this is.

> I guess there's still lots of ground for improving qemu-img, so for
> THIS series, I'll just leave it at an assertion that things are
> sector-aligned.

Sounds okay. If you want, you can add another patch to convert the
function to bytes, or we can do it later on top.

Kevin


pgpFDsy2F704N.pgp
Description: PGP signature


Re: [Qemu-devel] [PATCH v4 19/21] block: Make bdrv_is_allocated() byte-based

2017-07-06 Thread Eric Blake
On 07/06/2017 11:02 AM, Kevin Wolf wrote:

>> +++ b/qemu-img.c
>> @@ -3229,6 +3229,7 @@ static int img_rebase(int argc, char **argv)
>>  int64_t new_backing_num_sectors = 0;
>>  uint64_t sector;
>>  int n;
>> +int64_t count;
>>  float local_progress = 0;
>>
>>  buf_old = blk_blockalign(blk, IO_BUF_SIZE);
>> @@ -3276,12 +3277,14 @@ static int img_rebase(int argc, char **argv)
>>  }
>>
>>  /* If the cluster is allocated, we don't need to take action */
>> -ret = bdrv_is_allocated(bs, sector, n, );
>> +ret = bdrv_is_allocated(bs, sector << BDRV_SECTOR_BITS,
>> +n << BDRV_SECTOR_BITS, );
>>  if (ret < 0) {
>>  error_report("error while reading image metadata: %s",
>>   strerror(-ret));
>>  goto out;
>>  }
>> +n = DIV_ROUND_UP(count, BDRV_SECTOR_SIZE);
>>  if (ret) {
>>  continue;
>>  }
> 
> Same thing. Sectors that are half allocated and half free must be
> considered completely allocated if the caller can't do byte alignment.
> Just rounding up without looking at ret isn't correct.

In reality, qemu-img rebase should probably be rewritten to exploit
bdrv_get_block_status() instead of the weaker bdrv_is_allocated(). Right
now, it is manually calling buffer_is_zero() on every sector to learn
what sectors are candidates for optimizing, rather than being able to
exploit BDRV_BLOCK_ZERO where possible (yes, we may STILL want to call
buffer_is_zero() on BDRV_BLOCK_DATA when we are trying to squeeze out
all parts of an image that can be compressed, but maybe that should be
an option, just as GNU dd has options for creating as much sparseness as
possible (slow, because it reads data) vs. preserving existing
sparseness (faster, because it relies on hole detection but leaves
explicit written zeroes as non-sparse).  I guess there's still lots of
ground for improving qemu-img, so for THIS series, I'll just leave it at
an assertion that things are sector-aligned.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH v4 19/21] block: Make bdrv_is_allocated() byte-based

2017-07-06 Thread Eric Blake
On 07/06/2017 11:02 AM, Kevin Wolf wrote:
> Am 05.07.2017 um 23:08 hat Eric Blake geschrieben:
>> We are gradually moving away from sector-based interfaces, towards
>> byte-based.  In the common case, allocation is unlikely to ever use
>> values that are not naturally sector-aligned, but it is possible
>> that byte-based values will let us be more precise about allocation
>> at the end of an unaligned file that can do byte-based access.
>>
>> Changing the signature of the function to use int64_t *pnum ensures
>> that the compiler enforces that all callers are updated.  For now,
>> the io.c layer still assert()s that all callers are sector-aligned
>> on input and that *pnum is sector-aligned on return to the caller,
>> but that can be relaxed when a later patch implements byte-based
>> block status.  Therefore, this code adds usages like
>> DIV_ROUND_UP(,BDRV_SECTOR_SIZE) to callers that still want aligned
>> values, where the call might reasonbly give non-aligned results
>> in the future; on the other hand, no rounding is needed for callers
>> that should just continue to work with byte alignment.
>>
>> For the most part this patch is just the addition of scaling at the
>> callers followed by inverse scaling at bdrv_is_allocated().  But
>> some code, particularly bdrv_commit(), gets a lot simpler because it
>> no longer has to mess with sectors; also, it is now possible to pass
>> NULL if the caller does not care how much of the image is allocated
>> beyond the initial offset.
>>
>> For ease of review, bdrv_is_allocated_above() will be tackled
>> separately.
>>

>> @@ -1036,14 +1036,15 @@ static int coroutine_fn 
>> bdrv_aligned_preadv(BdrvChild *child,

>>
>> -if (!ret || pnum != nb_sectors) {
>> +if (!ret || pnum != nb_sectors << BDRV_SECTOR_BITS) {
>>  ret = bdrv_co_do_copy_on_readv(child, offset, bytes, qiov);
>>  goto out;
>>  }
> 
> This code would become a lot simpler if you just removed the local
> variables start_sector/end_sector and used the byte offsets instead that
> are available in this function. (And even simpler once sector alignment
> isn't required any more for bdrv_is_allocated(). Should we leave a
> comment here so we won't forget the conversion?)

Hmm - that cleanup is not (yet) done in my bdrv_get_block_status series
(where I finally drop the need for sector alignment in
bdrv_is_allocated).  Yes, I can add a temporary comment here (that aids
with the understanding of intermediate states), as well as a new
followup patch that actually does the cleanup.

>> -int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num,
>> -   int nb_sectors, int *pnum)
>> +int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
>> +   int64_t bytes, int64_t *pnum)
>>  {
>>  BlockDriverState *file;
>> -int64_t ret = bdrv_get_block_status(bs, sector_num, nb_sectors, pnum,
>> -);
>> +int64_t sector_num = offset >> BDRV_SECTOR_BITS;
>> +int nb_sectors = bytes >> BDRV_SECTOR_BITS;
>> +int64_t ret;
>> +int psectors;
>> +
>> +assert(QEMU_IS_ALIGNED(offset, BDRV_SECTOR_SIZE));
>> +assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE) &&
>> +   bytes < INT_MAX * BDRV_SECTOR_SIZE);
> 
> I think we can actually assert bytes <= INT_MAX. Both ways to write the
> assertion are only true because of BDRV_REQUEST_MAX_SECTORS, which
> ensures that the byte counts fit in an int.

That's a tighter assertion, but if it passes, I'm fine using it.

> 
> Some of the places that call bdrv_is_allocated() actually depend on this
> because your patches tend to use nb_sectors << BDRV_SECTOR_BITS (which
> truncates) rather than nb_sectors * BDRV_SECTOR_BYTES (which is always a
> 64 bit calculation).

Actually, for this series of cleanups, I tried really hard to use *
_BYTES everywhere that I widen sectors to bytes, and limit myself to >>
_BITS for going from bytes back to sectors (after also checking if I
needed DIV_ROUND_UP instead of straight division).  If you spot places
where I'm still doing a 32-bit << _BITS, please call it out (as I do
remember that I caused bugs that way back when I converted pdiscard to
byte-based, back near commit d2cb36af).


>> +++ b/block/stream.c
>> @@ -137,6 +137,7 @@ static void coroutine_fn stream_run(void *opaque)
>>
>>  for ( ; offset < s->common.len; offset += n * BDRV_SECTOR_SIZE) {
>>  bool copy;
>> +int64_t count = 0;
>>
>>  /* Note that even when no rate limit is applied we need to yield
>>   * with no pending I/O here so that bdrv_drain_all() returns.
>> @@ -148,8 +149,8 @@ static void coroutine_fn stream_run(void *opaque)
>>
>>  copy = false;
>>
>> -ret = bdrv_is_allocated(bs, offset / BDRV_SECTOR_SIZE,
>> -STREAM_BUFFER_SIZE / BDRV_SECTOR_SIZE, );
>> +ret = bdrv_is_allocated(bs, 

Re: [Qemu-devel] [PATCH v4 19/21] block: Make bdrv_is_allocated() byte-based

2017-07-06 Thread Kevin Wolf
Am 05.07.2017 um 23:08 hat Eric Blake geschrieben:
> We are gradually moving away from sector-based interfaces, towards
> byte-based.  In the common case, allocation is unlikely to ever use
> values that are not naturally sector-aligned, but it is possible
> that byte-based values will let us be more precise about allocation
> at the end of an unaligned file that can do byte-based access.
> 
> Changing the signature of the function to use int64_t *pnum ensures
> that the compiler enforces that all callers are updated.  For now,
> the io.c layer still assert()s that all callers are sector-aligned
> on input and that *pnum is sector-aligned on return to the caller,
> but that can be relaxed when a later patch implements byte-based
> block status.  Therefore, this code adds usages like
> DIV_ROUND_UP(,BDRV_SECTOR_SIZE) to callers that still want aligned
> values, where the call might reasonbly give non-aligned results
> in the future; on the other hand, no rounding is needed for callers
> that should just continue to work with byte alignment.
> 
> For the most part this patch is just the addition of scaling at the
> callers followed by inverse scaling at bdrv_is_allocated().  But
> some code, particularly bdrv_commit(), gets a lot simpler because it
> no longer has to mess with sectors; also, it is now possible to pass
> NULL if the caller does not care how much of the image is allocated
> beyond the initial offset.
> 
> For ease of review, bdrv_is_allocated_above() will be tackled
> separately.
> 
> Signed-off-by: Eric Blake 
> Reviewed-by: John Snow 
> Reviewed-by: Juan Quintela 
> Reviewed-by: Jeff Cody 

> diff --git a/block/io.c b/block/io.c
> index f662c97..cb40069 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1036,14 +1036,15 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild 
> *child,
>  int64_t start_sector = offset >> BDRV_SECTOR_BITS;
>  int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
>  unsigned int nb_sectors = end_sector - start_sector;
> -int pnum;
> +int64_t pnum;
> 
> -ret = bdrv_is_allocated(bs, start_sector, nb_sectors, );
> +ret = bdrv_is_allocated(bs, start_sector << BDRV_SECTOR_BITS,
> +nb_sectors << BDRV_SECTOR_BITS, );
>  if (ret < 0) {
>  goto out;
>  }
> 
> -if (!ret || pnum != nb_sectors) {
> +if (!ret || pnum != nb_sectors << BDRV_SECTOR_BITS) {
>  ret = bdrv_co_do_copy_on_readv(child, offset, bytes, qiov);
>  goto out;
>  }

This code would become a lot simpler if you just removed the local
variables start_sector/end_sector and used the byte offsets instead that
are available in this function. (And even simpler once sector alignment
isn't required any more for bdrv_is_allocated(). Should we leave a
comment here so we won't forget the conversion?)

> @@ -1904,15 +1905,26 @@ int64_t bdrv_get_block_status(BlockDriverState *bs,
> sector_num, nb_sectors, pnum, file);
>  }
> 
> -int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num,
> -   int nb_sectors, int *pnum)
> +int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
> +   int64_t bytes, int64_t *pnum)
>  {
>  BlockDriverState *file;
> -int64_t ret = bdrv_get_block_status(bs, sector_num, nb_sectors, pnum,
> -);
> +int64_t sector_num = offset >> BDRV_SECTOR_BITS;
> +int nb_sectors = bytes >> BDRV_SECTOR_BITS;
> +int64_t ret;
> +int psectors;
> +
> +assert(QEMU_IS_ALIGNED(offset, BDRV_SECTOR_SIZE));
> +assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE) &&
> +   bytes < INT_MAX * BDRV_SECTOR_SIZE);

I think we can actually assert bytes <= INT_MAX. Both ways to write the
assertion are only true because of BDRV_REQUEST_MAX_SECTORS, which
ensures that the byte counts fit in an int.

Some of the places that call bdrv_is_allocated() actually depend on this
because your patches tend to use nb_sectors << BDRV_SECTOR_BITS (which
truncates) rather than nb_sectors * BDRV_SECTOR_BYTES (which is always a
64 bit calculation).

> +ret = bdrv_get_block_status(bs, sector_num, nb_sectors, ,
> +);
>  if (ret < 0) {
>  return ret;
>  }
> +if (pnum) {
> +*pnum = psectors * BDRV_SECTOR_SIZE;
> +}
>  return !!(ret & BDRV_BLOCK_ALLOCATED);
>  }

> diff --git a/block/stream.c b/block/stream.c
> index e3dd2ac..e5f2a08 100644
> --- a/block/stream.c
> +++ b/block/stream.c
> @@ -137,6 +137,7 @@ static void coroutine_fn stream_run(void *opaque)
> 
>  for ( ; offset < s->common.len; offset += n * BDRV_SECTOR_SIZE) {
>  bool copy;
> +int64_t count = 0;
> 
>  /* Note that even when 

[Qemu-devel] [PATCH v4 19/21] block: Make bdrv_is_allocated() byte-based

2017-07-05 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  In the common case, allocation is unlikely to ever use
values that are not naturally sector-aligned, but it is possible
that byte-based values will let us be more precise about allocation
at the end of an unaligned file that can do byte-based access.

Changing the signature of the function to use int64_t *pnum ensures
that the compiler enforces that all callers are updated.  For now,
the io.c layer still assert()s that all callers are sector-aligned
on input and that *pnum is sector-aligned on return to the caller,
but that can be relaxed when a later patch implements byte-based
block status.  Therefore, this code adds usages like
DIV_ROUND_UP(,BDRV_SECTOR_SIZE) to callers that still want aligned
values, where the call might reasonbly give non-aligned results
in the future; on the other hand, no rounding is needed for callers
that should just continue to work with byte alignment.

For the most part this patch is just the addition of scaling at the
callers followed by inverse scaling at bdrv_is_allocated().  But
some code, particularly bdrv_commit(), gets a lot simpler because it
no longer has to mess with sectors; also, it is now possible to pass
NULL if the caller does not care how much of the image is allocated
beyond the initial offset.

For ease of review, bdrv_is_allocated_above() will be tackled
separately.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Juan Quintela 
Reviewed-by: Jeff Cody 

---
v4: rebase to vvfat TAB cleanup, R-b kept
v3: no change
v2: rebase to earlier changes, tweak commit message
---
 include/block/block.h |  4 +--
 block/backup.c| 17 -
 block/commit.c| 21 +++-
 block/io.c| 49 +---
 block/stream.c|  5 ++--
 block/vvfat.c | 34 ++---
 migration/block.c |  9 ---
 qemu-img.c|  5 +++-
 qemu-io-cmds.c| 70 +++
 9 files changed, 114 insertions(+), 100 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index a9dc753..d3e01fb 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -427,8 +427,8 @@ int64_t bdrv_get_block_status_above(BlockDriverState *bs,
 int64_t sector_num,
 int nb_sectors, int *pnum,
 BlockDriverState **file);
-int bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num, int nb_sectors,
-  int *pnum);
+int bdrv_is_allocated(BlockDriverState *bs, int64_t offset, int64_t bytes,
+  int64_t *pnum);
 int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
 int64_t sector_num, int nb_sectors, int *pnum);

diff --git a/block/backup.c b/block/backup.c
index 04def91..b2048bf 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -47,12 +47,6 @@ typedef struct BackupBlockJob {
 QLIST_HEAD(, CowRequest) inflight_reqs;
 } BackupBlockJob;

-/* Size of a cluster in sectors, instead of bytes. */
-static inline int64_t cluster_size_sectors(BackupBlockJob *job)
-{
-  return job->cluster_size / BDRV_SECTOR_SIZE;
-}
-
 /* See if in-flight requests overlap and wait for them to complete */
 static void coroutine_fn wait_for_overlapping_requests(BackupBlockJob *job,
int64_t offset,
@@ -433,7 +427,6 @@ static void coroutine_fn backup_run(void *opaque)
 BackupCompleteData *data;
 BlockDriverState *bs = blk_bs(job->common.blk);
 int64_t offset;
-int64_t sectors_per_cluster = cluster_size_sectors(job);
 int ret = 0;

 QLIST_INIT(>inflight_reqs);
@@ -465,12 +458,13 @@ static void coroutine_fn backup_run(void *opaque)
 }

 if (job->sync_mode == MIRROR_SYNC_MODE_TOP) {
-int i, n;
+int i;
+int64_t n;

 /* Check to see if these blocks are already in the
  * backing file. */

-for (i = 0; i < sectors_per_cluster;) {
+for (i = 0; i < job->cluster_size;) {
 /* bdrv_is_allocated() only returns true/false based
  * on the first set of sectors it comes across that
  * are are all in the same state.
@@ -478,9 +472,8 @@ static void coroutine_fn backup_run(void *opaque)
  * backup cluster length.  We end up copying more than
  * needed but at some point that is always the case. */
 alloced =
-bdrv_is_allocated(bs,
-  (offset >> BDRV_SECTOR_BITS) + i,
-  sectors_per_cluster - i, );
+