date:20170918

Re: [Qemu-block] [Qemu-trivial] [PATCH] nbd-client: Use correct macro parenthesization

2017-09-18 Thread Eric Blake

On 09/18/2017 05:13 PM, Philippe Mathieu-Daudé wrote:
> Hi Eric,
> 
> On 09/18/2017 06:46 PM, Eric Blake wrote:
>> If 'bs' is a complex expression, we were only casting the front half
>> rather than the full expression.  Luckily, none of the callers were
>> passing bad arguments, but it's better to be robust up front.
>>
>> Signed-off-by: Eric Blake 
> 
> Reviewed-by: Philippe Mathieu-Daudé 
> 
> Any magic cocci script to verify there aren't no more?

I don't know if cocci can do it; checkpatch tries to check whether macro
arguments are parenthesized, but even that's prone to missing things.  I
just spotted this particular one while reviewing a related patch.


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v2 7/7] block/nbd-client: do not yield from nbd_read_reply_entry

2017-09-18 Thread Eric Blake

On 09/18/2017 08:59 AM, Vladimir Sementsov-Ogievskiy wrote:
> Refactor nbd client to not yield from nbd_read_reply_entry. It's
> possible now as all reading is done in nbd_read_reply_entry and
> all request-related data is stored in per-request s->requests[i].
> 
> We need here some additional work with s->requests[i].ret and
> s->quit to not fail requests on normal disconnet and to not report
> reading errors on normal disconnect.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd-client.c | 29 ++---
>  1 file changed, 10 insertions(+), 19 deletions(-)

The diffstat may have merit, but I'm wondering:

> 
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index f80a4c5564..bf20d5d5e6 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -79,7 +79,11 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
>  while (!s->quit) {
>  ret = nbd_receive_reply(s->ioc, , _err);
>  if (ret < 0) {
> -error_report_err(local_err);
> +if (s->quit) {
> +error_free(local_err);

This discards errors on all remaining coroutines after we detect an
early exit. Are we sure the coroutine that set s->quit will have
reported an appropriate error, and thus explaining why we can discard
all secondary errors?  A comment in the code would be helpful.

> +} else {
> +error_report_err(local_err);
> +}
>  }

> @@ -210,7 +204,7 @@ static int nbd_co_receive_reply(NBDClientSession *s, 
> NBDRequest *request)
>  s->requests[i].receiving = true;
>  qemu_coroutine_yield();
>  s->requests[i].receiving = false;
> -if (!s->ioc || s->quit) {
> +if (!s->ioc) {
>  ret = -EIO;

Why don't we need to check s->quit any more?  That was just added
earlier in the series; why the churn?

>  } else {
>  ret = s->requests[i].ret;
> @@ -218,11 +212,6 @@ static int nbd_co_receive_reply(NBDClientSession *s, 
> NBDRequest *request)
>  
>  s->requests[i].coroutine = NULL;
>  
> -/* Kick the read_reply_co to get the next reply.  */
> -if (s->read_reply_co) {
> -aio_co_wake(s->read_reply_co);
> -}
> -
>  qemu_co_mutex_lock(>send_mutex);
>  s->in_flight--;
>  qemu_co_queue_next(>free_sema);
> @@ -364,6 +353,8 @@ void nbd_client_close(BlockDriverState *bs)
>  
>  nbd_send_request(client->ioc, );
>  
> +client->quit = true;

Previously, client->quit was only set when detecting a broken server,
now it is also set for a clean exit.  Do we need to change any
documentation of the field?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v2 6/7] block/nbd-client: early fail nbd_read_reply_entry if s->quit is set

2017-09-18 Thread Eric Blake

On 09/18/2017 08:59 AM, Vladimir Sementsov-Ogievskiy wrote:
> Do not continue any operation if s->quit is set in parallel.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd-client.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index 280147e6a7..f80a4c5564 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -81,7 +81,7 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
>  if (ret < 0) {
>  error_report_err(local_err);
>  }
> -if (ret <= 0) {
> +if (ret <= 0 || s->quit) {
>  break;
>  }

I'm still not convinced this helps: either nbd_receive_reply() already
returned an error, or we got a successful reply header, at which point
either the command is done (no need to abort the command early - it
succeeded) or it is a read command (and we should detect at the point
where we try to populate the qiov that we don't want to read any more).
It depends on your 3/7 patch for the fact that we even try to read the
qiov directly in this while loop rather than in the coroutine handler,
where Paolo questioned whether we need that change.

> @@ -105,9 +105,8 @@ static coroutine_fn void nbd_read_reply_entry(void 
> *opaque)
>  assert(qiov != NULL);
>  assert(s->requests[i].request->len ==
> iov_size(qiov->iov, qiov->niov));
> -if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
> -  NULL) < 0)
> -{
> +ret = qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov, NULL);
> +if (ret < 0 || s->quit) {
>  s->requests[i].ret = -EIO;
>  break;
>  }

The placement here looks odd. Either we should not attempt the read
because s->quit was already set (okay, your first addition makes sense
in that light), or we succeeded at the read (at which point, why do we
need to claim it failed?).

I'm trying to look forward to structured reads, where we will have to
deal with more than one server message in reply to a single client
request.  For read, we just piece together portions of the qiov until
the server has sent us all the pieces.  But for block query, I really do
think we'll want to defer to specialized handlers for doing further
reads (the amount of data to be read from the server is not known by the
client a priori, so it is a two-part read, one to get the length, and
another to read that remaining length); if we need some sort of callback
function rather than cramming all the logic here in the main read loop,
then the qio_channel_readv_all for read commands would belong in the
callback, and it is more a question of the main read loop checking
whether there is an early quit condition before calling into the callback.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-trivial] [PATCH] nbd-client: Use correct macro parenthesization

2017-09-18 Thread Philippe Mathieu-Daudé


Hi Eric,

On 09/18/2017 06:46 PM, Eric Blake wrote:

If 'bs' is a complex expression, we were only casting the front half
rather than the full expression.  Luckily, none of the callers were
passing bad arguments, but it's better to be robust up front.

Signed-off-by: Eric Blake 


Reviewed-by: Philippe Mathieu-Daudé 

Any magic cocci script to verify there aren't no more?


---

I plan to take this through my NBD queue, unless it is picked up
by qemu-trivial first...

  block/nbd-client.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index 5f241ecc22..2a528dd217 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -31,8 +31,8 @@
  #include "qapi/error.h"
  #include "nbd-client.h"

-#define HANDLE_TO_INDEX(bs, handle) ((handle) ^ ((uint64_t)(intptr_t)bs))
-#define INDEX_TO_HANDLE(bs, index)  ((index)  ^ ((uint64_t)(intptr_t)bs))
+#define HANDLE_TO_INDEX(bs, handle) ((handle) ^ (uint64_t)(intptr_t)(bs))
+#define INDEX_TO_HANDLE(bs, index)  ((index)  ^ (uint64_t)(intptr_t)(bs))

  static void nbd_recv_coroutines_wake_all(NBDClientSession *s)
  {

Re: [Qemu-block] [PATCH v2 4/7] block/nbd-client: drop reply field from NBDClientSession

2017-09-18 Thread Eric Blake

On 09/18/2017 08:59 AM, Vladimir Sementsov-Ogievskiy wrote:
> Drop 'reply' from NBDClientSession. It's used to only deliver request
> return code to receiving coroutine. Instead introduce per-request ret
> variable to simplify the scheme and make further refactoring possible.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd-client.h |  2 +-
>  block/nbd-client.c | 22 +-
>  2 files changed, 10 insertions(+), 14 deletions(-)
> 
> diff --git a/block/nbd-client.h b/block/nbd-client.h
> index 3f97d31013..4bc8fe3993 100644
> --- a/block/nbd-client.h
> +++ b/block/nbd-client.h
> @@ -22,6 +22,7 @@ typedef struct {
>  bool receiving; /* waiting for read_reply_co? */
>  NBDRequest *request;
>  QEMUIOVector *qiov;
> +int ret;
>  } NBDClientRequest;

I like this idea.  However, I note that:

> @@ -211,11 +211,7 @@ static int nbd_co_receive_reply(NBDClientSession *s, 
> NBDRequest *request)
>  if (!s->ioc || s->quit) {
>  ret = -EIO;
>  } else {
> -assert(s->reply.handle == request->handle);
> -ret = -s->reply.error;
> -
> -/* Tell the read handler to read another header.  */
> -s->reply.handle = 0;
> +ret = s->requests[i].ret;

you are removing the assert you added in 2/7, where I questioned whether
we needed NBDClientRequest.request in the first place.  So there may be
some rebase churn, depending on how that conversation pans out.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

[Qemu-block] [PATCH] nbd-client: Use correct macro parenthesization

2017-09-18 Thread Eric Blake

If 'bs' is a complex expression, we were only casting the front half
rather than the full expression.  Luckily, none of the callers were
passing bad arguments, but it's better to be robust up front.

Signed-off-by: Eric Blake 
---

I plan to take this through my NBD queue, unless it is picked up
by qemu-trivial first...

 block/nbd-client.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index 5f241ecc22..2a528dd217 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -31,8 +31,8 @@
 #include "qapi/error.h"
 #include "nbd-client.h"

-#define HANDLE_TO_INDEX(bs, handle) ((handle) ^ ((uint64_t)(intptr_t)bs))
-#define INDEX_TO_HANDLE(bs, index)  ((index)  ^ ((uint64_t)(intptr_t)bs))
+#define HANDLE_TO_INDEX(bs, handle) ((handle) ^ (uint64_t)(intptr_t)(bs))
+#define INDEX_TO_HANDLE(bs, index)  ((index)  ^ (uint64_t)(intptr_t)(bs))

 static void nbd_recv_coroutines_wake_all(NBDClientSession *s)
 {
-- 
2.13.5

Re: [Qemu-block] [Qemu-devel] [PATCH] block/throttle-groups.c: allocate RestartData on the heap

2017-09-18 Thread Eric Blake

On 09/18/2017 03:25 PM, Manos Pitsidianakis wrote:
> RestartData is the opaque data of the throttle_group_restart_queue_entry
> coroutine. By being stack allocated, it isn't available anymore if
> aio_co_enter schedules the coroutine with a bottom halve and runs after

s/halve/half/

> throttle_group_restart_queue returns.
> 
> Signed-off-by: Manos Pitsidianakis 
> ---
>  block/throttle-groups.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

[Qemu-block] [PATCH] block/throttle-groups.c: allocate RestartData on the heap

2017-09-18 Thread Manos Pitsidianakis

RestartData is the opaque data of the throttle_group_restart_queue_entry
coroutine. By being stack allocated, it isn't available anymore if
aio_co_enter schedules the coroutine with a bottom halve and runs after
throttle_group_restart_queue returns.

Signed-off-by: Manos Pitsidianakis 
---
 block/throttle-groups.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index 6ba992c8d7..b291a88481 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -403,17 +403,19 @@ static void coroutine_fn 
throttle_group_restart_queue_entry(void *opaque)
 schedule_next_request(tgm, is_write);
 qemu_mutex_unlock(>lock);
 }
+
+g_free(data);
 }
 
 static void throttle_group_restart_queue(ThrottleGroupMember *tgm, bool 
is_write)
 {
 Coroutine *co;
-RestartData rd = {
-.tgm = tgm,
-.is_write = is_write
-};
+RestartData *rd = g_new0(RestartData, 1);
 
-co = qemu_coroutine_create(throttle_group_restart_queue_entry, );
+rd->tgm = tgm;
+rd->is_write = is_write;
+
+co = qemu_coroutine_create(throttle_group_restart_queue_entry, rd);
 aio_co_enter(tgm->aio_context, co);
 }
 
-- 
2.11.0

Re: [Qemu-block] [PATCH v2 2/7] block/nbd-client: exit reply-reading coroutine on incorrect handle

2017-09-18 Thread Eric Blake

On 09/18/2017 08:59 AM, Vladimir Sementsov-Ogievskiy wrote:
> Check reply-handle == request-handle in the same place, where
> recv coroutine number is calculated from reply->handle and it's
> correctness checked - in nbd_read_reply_entry.
> 
> Also finish nbd_read_reply_entry in case of reply-handle !=
> request-handle in the same way as in case of incorrect reply-handle.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd-client.h | 1 +
>  block/nbd-client.c | 8 ++--
>  2 files changed, 7 insertions(+), 2 deletions(-)

On second thought:

> +++ b/block/nbd-client.c
> @@ -92,7 +92,9 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
>  i = HANDLE_TO_INDEX(s, s->reply.handle);
>  if (i >= MAX_NBD_REQUESTS ||
>  !s->requests[i].coroutine ||
> -!s->requests[i].receiving) {
> +!s->requests[i].receiving ||
> +s->reply.handle != s->requests[i].request->handle)

How can this condition ever be false?  s->reply.handle only ever
contains two values: 0 when it is being reused for the next iteration of
the loop, or the (munged) address of the request pointer.  But we are
careful that we never write s->reply.handle = 0 until after
s->requests[i].receiving is false.  So we will never see s->reply.handle
equal to 0 here.  (It may have been possible prior to the introduction
of reply.receiving, in commit 40f4a218, but I'm not seeing it now)

If I'm right, then this patch can be simplified - we don't need to track
s.requests[i].request, and can merely:

> @@ -189,9 +192,10 @@ static int nbd_co_receive_reply(NBDClientSession *s,
>  s->requests[i].receiving = true;
>  qemu_coroutine_yield();
>  s->requests[i].receiving = false;
> -if (s->reply.handle != request->handle || !s->ioc || s->quit) {

shorten this conditional to remove the now-dead condition.

> +if (!s->ioc || s->quit) {
>  ret = -EIO;
>  } else {
> +assert(s->reply.handle == request->handle);
>  ret = -s->reply.error;
>  if (qiov && s->reply.error == 0) {
>  assert(request->len == iov_size(qiov->iov, qiov->niov));
> 

[Oh, and I just noticed HANDLE_TO_INDEX() and INDEX_TO_HANDLE() are
improperly parenthesized, although to no ill effect with current clients]

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] Disk Image Read Location

2017-09-18 Thread Max Reitz

On 2017-09-18 20:47, Derrick McKee wrote:
> Hi,
> 
> I am trying to figure out if a particular disk operation is reading from
> the same location in an image file.  I am simulating a USB disk, and during
> an operation the guest OS performs, I would like to know how many times the
> backing image file is read.  I have identified the function
> address_space_write as the function that writes the data to the guest OS
> from the image, but I am not sure what to compare with past writes.  There
> are AddressSpace*, MemoryRegion*, and hwaddr types in that function.  Can
> any of those be used to compare image read locations?  I am hoping that if,
> for example, the AddressSpace* passed into address_space_write during one
> invocation equals the AddressSpace* of another invocation means that the
> same location within the image file was read twice.  Any help would be
> appreciated.

If you're referring to images as in images of block devices, then the
block layer contains the functionality you're looking for.

bdrv_driver_preadv() from block/io.c is a function that every read
request should go through.  However, at this point it's already aligned
to (host) sectors, so maybe bdrv_co_preadv() is better suited.

But then again, because of how the block layer works you'll likely see
every read request twice there (once for the disk image's format (e.g.
qcow2 or raw) and once for the storage backend used for the image (e.g.
file for normal files or http)).  So you either correct for that or you
instead go one layer above into block/block-backend.c.

All read requests from the guest go through either blk_co_preadv() or
blk_aio_preadv(), and they should go through there only once.

Max

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [Qemu-devel] [PATCH v2 5/7] block/nbd-client: nbd_co_send_request: return -EIO if s->quit was set in parallel

2017-09-18 Thread Eric Blake

On 09/18/2017 11:01 AM, Eric Blake wrote:
> On 09/18/2017 08:59 AM, Vladimir Sementsov-Ogievskiy wrote:
>> It's incorrect to return success rc >= 0 if we skip qio_channel_writev_all()
>> call due to s->quit.
> 
> Does this need to cc: qemu-stable for 2.10.1 (or put another way, can we
> come up with some scenario of EAGAIN or other handling that would
> actually set s->quit in a parallel coroutine when a client sends out
> multiple requests at once)?

Also, long subject line (longer than 80 columns!).  I'd suggest
something like:

nbd-client: fail with -EIO on early exit of nbd_co_send_request

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

[Qemu-block] [PATCH v8 19/20] dirty-bitmap: Switch bdrv_set_dirty() to bytes

2017-09-18 Thread Eric Blake

Both callers already had bytes available, but were scaling to
sectors.  Move the scaling to internal code.  In the case of
bdrv_aligned_pwritev(), we are now passing the exact offset
rather than a rounded sector-aligned value, but that's okay
as long as dirty bitmap widens start/bytes to granularity
boundaries.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v4: only context changes
v3: rebase to lock context changes, R-b kept
v2: no change
---
 include/block/block_int.h | 2 +-
 block/io.c| 6 ++
 block/dirty-bitmap.c  | 7 ---
 3 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index ba4c383393..55c5d573d4 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1021,7 +1021,7 @@ void blk_dev_eject_request(BlockBackend *blk, bool force);
 bool blk_dev_is_tray_open(BlockBackend *blk);
 bool blk_dev_is_medium_locked(BlockBackend *blk);

-void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector, int64_t nr_sect);
+void bdrv_set_dirty(BlockDriverState *bs, int64_t offset, int64_t bytes);
 bool bdrv_requests_pending(BlockDriverState *bs);

 void bdrv_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap **out);
diff --git a/block/io.c b/block/io.c
index 4378ae4c7d..8a0cd8835a 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1334,7 +1334,6 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild 
*child,
 bool waited;
 int ret;

-int64_t start_sector = offset >> BDRV_SECTOR_BITS;
 int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
 uint64_t bytes_remaining = bytes;
 int max_transfer;
@@ -1409,7 +1408,7 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild 
*child,
 bdrv_debug_event(bs, BLKDBG_PWRITEV_DONE);

 atomic_inc(>write_gen);
-bdrv_set_dirty(bs, start_sector, end_sector - start_sector);
+bdrv_set_dirty(bs, offset, bytes);

 stat64_max(>wr_highest_offset, offset + bytes);

@@ -2438,8 +2437,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, 
int64_t offset,
 ret = 0;
 out:
 atomic_inc(>write_gen);
-bdrv_set_dirty(bs, req.offset >> BDRV_SECTOR_BITS,
-   req.bytes >> BDRV_SECTOR_BITS);
+bdrv_set_dirty(bs, req.offset, req.bytes);
 tracked_request_end();
 bdrv_dec_in_flight(bs);
 return ret;
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 117837b3cc..58a3f330a9 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -628,10 +628,10 @@ void bdrv_dirty_bitmap_deserialize_finish(BdrvDirtyBitmap 
*bitmap)
 hbitmap_deserialize_finish(bitmap->bitmap);
 }

-void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector,
-int64_t nr_sectors)
+void bdrv_set_dirty(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
 BdrvDirtyBitmap *bitmap;
+int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);

 if (QLIST_EMPTY(>dirty_bitmaps)) {
 return;
@@ -643,7 +643,8 @@ void bdrv_set_dirty(BlockDriverState *bs, int64_t 
cur_sector,
 continue;
 }
 assert(!bdrv_dirty_bitmap_readonly(bitmap));
-hbitmap_set(bitmap->bitmap, cur_sector, nr_sectors);
+hbitmap_set(bitmap->bitmap, offset >> BDRV_SECTOR_BITS,
+end_sector - (offset >> BDRV_SECTOR_BITS));
 }
 bdrv_dirty_bitmaps_unlock(bs);
 }
-- 
2.13.5

[Qemu-block] [PATCH v8 17/20] qcow2: Switch load_bitmap_data() to byte-based iteration

2017-09-18 Thread Eric Blake

Now that we have adjusted the majority of the calls this function
makes to be byte-based, it is easier to read the code if it makes
passes over the image using bytes rather than sectors.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Vladimir Sementsov-Ogievskiy 

---
v5: no change
v4: new patch
---
 block/qcow2-bitmap.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index b09010b1d3..692ce0de88 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -291,9 +291,8 @@ static int load_bitmap_data(BlockDriverState *bs,
 {
 int ret = 0;
 BDRVQcow2State *s = bs->opaque;
-uint64_t sector, limit, sbc;
+uint64_t offset, limit;
 uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
-uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE);
 uint8_t *buf = NULL;
 uint64_t i, tab_size =
 size_to_clusters(s,
@@ -305,32 +304,27 @@ static int load_bitmap_data(BlockDriverState *bs,

 buf = g_malloc(s->cluster_size);
 limit = bytes_covered_by_bitmap_cluster(s, bitmap);
-sbc = limit >> BDRV_SECTOR_BITS;
-for (i = 0, sector = 0; i < tab_size; ++i, sector += sbc) {
-uint64_t count = MIN(bm_sectors - sector, sbc);
+for (i = 0, offset = 0; i < tab_size; ++i, offset += limit) {
+uint64_t count = MIN(bm_size - offset, limit);
 uint64_t entry = bitmap_table[i];
-uint64_t offset = entry & BME_TABLE_ENTRY_OFFSET_MASK;
+uint64_t data_offset = entry & BME_TABLE_ENTRY_OFFSET_MASK;

 assert(check_table_entry(entry, s->cluster_size) == 0);

-if (offset == 0) {
+if (data_offset == 0) {
 if (entry & BME_TABLE_ENTRY_FLAG_ALL_ONES) {
-bdrv_dirty_bitmap_deserialize_ones(bitmap,
-   sector * BDRV_SECTOR_SIZE,
-   count * BDRV_SECTOR_SIZE,
+bdrv_dirty_bitmap_deserialize_ones(bitmap, offset, count,
false);
 } else {
 /* No need to deserialize zeros because the dirty bitmap is
  * already cleared */
 }
 } else {
-ret = bdrv_pread(bs->file, offset, buf, s->cluster_size);
+ret = bdrv_pread(bs->file, data_offset, buf, s->cluster_size);
 if (ret < 0) {
 goto finish;
 }
-bdrv_dirty_bitmap_deserialize_part(bitmap, buf,
-   sector * BDRV_SECTOR_SIZE,
-   count * BDRV_SECTOR_SIZE,
+bdrv_dirty_bitmap_deserialize_part(bitmap, buf, offset, count,
false);
 }
 }
-- 
2.13.5

[Qemu-block] [PATCH v8 13/20] dirty-bitmap: Change bdrv_get_dirty_locked() to take bytes

2017-09-18 Thread Eric Blake

Half the callers were already scaling bytes to sectors; the other
half can eventually be simplified to use byte iteration.  Both
callers were already using the result as a bool, so make that
explicit.  Making the change also makes it easier for a future
dirty-bitmap patch to offload scaling over to the internal hbitmap.

Remember, asking whether a byte is dirty is effectively asking
whether the entire granularity containing the byte is dirty, since
we only track dirtiness by granularity.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Juan Quintela 

---
v4: only context change
v3: rebase to _locked rename was straightforward enough that R-b kept
v2: tweak commit message, no code change
---
 include/block/dirty-bitmap.h | 4 ++--
 block/dirty-bitmap.c | 8 
 block/mirror.c   | 3 +--
 migration/block.c| 3 ++-
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 757fc4c5b8..9e39537e4b 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -72,8 +72,8 @@ void bdrv_dirty_bitmap_set_persistance(BdrvDirtyBitmap 
*bitmap,
 /* Functions that require manual locking.  */
 void bdrv_dirty_bitmap_lock(BdrvDirtyBitmap *bitmap);
 void bdrv_dirty_bitmap_unlock(BdrvDirtyBitmap *bitmap);
-int bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
-  int64_t sector);
+bool bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
+   int64_t offset);
 void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
   int64_t cur_sector, int64_t nr_sectors);
 void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 8322e23f0d..ad559c62b1 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -438,13 +438,13 @@ BlockDirtyInfoList 
*bdrv_query_dirty_bitmaps(BlockDriverState *bs)
 }

 /* Called within bdrv_dirty_bitmap_lock..unlock */
-int bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
-  int64_t sector)
+bool bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
+   int64_t offset)
 {
 if (bitmap) {
-return hbitmap_get(bitmap->bitmap, sector);
+return hbitmap_get(bitmap->bitmap, offset >> BDRV_SECTOR_BITS);
 } else {
-return 0;
+return false;
 }
 }

diff --git a/block/mirror.c b/block/mirror.c
index 7113d47db4..e36fc81df3 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -361,8 +361,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 int64_t next_offset = offset + nb_chunks * s->granularity;
 int64_t next_chunk = next_offset / s->granularity;
 if (next_offset >= s->bdev_length ||
-!bdrv_get_dirty_locked(source, s->dirty_bitmap,
-   next_offset >> BDRV_SECTOR_BITS)) {
+!bdrv_get_dirty_locked(source, s->dirty_bitmap, next_offset)) {
 break;
 }
 if (test_bit(next_chunk, s->in_flight_bitmap)) {
diff --git a/migration/block.c b/migration/block.c
index a3512945da..b618869661 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -530,7 +530,8 @@ static int mig_save_device_dirty(QEMUFile *f, 
BlkMigDevState *bmds,
 blk_mig_unlock();
 }
 bdrv_dirty_bitmap_lock(bmds->dirty_bitmap);
-if (bdrv_get_dirty_locked(bs, bmds->dirty_bitmap, sector)) {
+if (bdrv_get_dirty_locked(bs, bmds->dirty_bitmap,
+  sector * BDRV_SECTOR_SIZE)) {
 if (total_sectors - sector < BDRV_SECTORS_PER_DIRTY_CHUNK) {
 nr_sectors = total_sectors - sector;
 } else {
-- 
2.13.5

[Qemu-block] [PATCH v8 15/20] mirror: Switch mirror_dirty_init() to byte-based iteration

2017-09-18 Thread Eric Blake

Now that we have adjusted the majority of the calls this function
makes to be byte-based, it is easier to read the code if it makes
passes over the image using bytes rather than sectors.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v6: no change
v5: rebase to earlier changes
v2-v4: no change
---
 block/mirror.c | 38 ++
 1 file changed, 14 insertions(+), 24 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index c2f73c91c5..5cdaaed7be 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -612,15 +612,13 @@ static void mirror_throttle(MirrorBlockJob *s)

 static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
 {
-int64_t sector_num, end;
+int64_t offset;
 BlockDriverState *base = s->base;
 BlockDriverState *bs = s->source;
 BlockDriverState *target_bs = blk_bs(s->target);
-int ret, n;
+int ret;
 int64_t count;

-end = s->bdev_length / BDRV_SECTOR_SIZE;
-
 if (base == NULL && !bdrv_has_zero_init(target_bs)) {
 if (!bdrv_can_write_zeroes_with_unmap(target_bs)) {
 bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, s->bdev_length);
@@ -628,9 +626,9 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
 }

 s->initial_zeroing_ongoing = true;
-for (sector_num = 0; sector_num < end; ) {
-int nb_sectors = MIN(end - sector_num,
-QEMU_ALIGN_DOWN(INT_MAX, s->granularity) >> BDRV_SECTOR_BITS);
+for (offset = 0; offset < s->bdev_length; ) {
+int bytes = MIN(s->bdev_length - offset,
+QEMU_ALIGN_DOWN(INT_MAX, s->granularity));

 mirror_throttle(s);

@@ -646,9 +644,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
 continue;
 }

-mirror_do_zero_or_discard(s, sector_num * BDRV_SECTOR_SIZE,
-  nb_sectors * BDRV_SECTOR_SIZE, false);
-sector_num += nb_sectors;
+mirror_do_zero_or_discard(s, offset, bytes, false);
+offset += bytes;
 }

 mirror_wait_for_all_io(s);
@@ -656,10 +653,10 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob 
*s)
 }

 /* First part, loop on the sectors and initialize the dirty bitmap.  */
-for (sector_num = 0; sector_num < end; ) {
+for (offset = 0; offset < s->bdev_length; ) {
 /* Just to make sure we are not exceeding int limit. */
-int nb_sectors = MIN(INT_MAX >> BDRV_SECTOR_BITS,
- end - sector_num);
+int bytes = MIN(s->bdev_length - offset,
+QEMU_ALIGN_DOWN(INT_MAX, s->granularity));

 mirror_throttle(s);

@@ -667,23 +664,16 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob 
*s)
 return 0;
 }

-ret = bdrv_is_allocated_above(bs, base, sector_num * BDRV_SECTOR_SIZE,
-  nb_sectors * BDRV_SECTOR_SIZE, );
+ret = bdrv_is_allocated_above(bs, base, offset, bytes, );
 if (ret < 0) {
 return ret;
 }

-/* TODO: Relax this once bdrv_is_allocated_above and dirty
- * bitmaps no longer require sector alignment. */
-assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
-n = count >> BDRV_SECTOR_BITS;
-assert(n > 0);
+assert(count);
 if (ret == 1) {
-bdrv_set_dirty_bitmap(s->dirty_bitmap,
-  sector_num * BDRV_SECTOR_SIZE,
-  n * BDRV_SECTOR_SIZE);
+bdrv_set_dirty_bitmap(s->dirty_bitmap, offset, count);
 }
-sector_num += n;
+offset += count;
 }
 return 0;
 }
-- 
2.13.5

[Qemu-block] [PATCH v8 08/20] dirty-bitmap: Change bdrv_dirty_bitmap_serialize() to take bytes

2017-09-18 Thread Eric Blake

Right now, the dirty-bitmap code exposes the fact that we use
a scale of sector granularity in the underlying hbitmap to anything
that wants to serialize a dirty bitmap.  It's nicer to uniformly
expose bytes as our dirty-bitmap interface, matching the previous
change to bitmap size.  The only caller to serialization is currently
qcow2-cluster.c, which becomes a bit more verbose because it is still
tracking sectors for other reasons, but a later patch will fix that
to more uniformly use byte offsets everywhere.  Likewise, within
dirty-bitmap, we have to add more assertions that we are not
truncating incorrectly, which can go away once the internal hbitmap
is byte-based rather than sector-based.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v5: no change
v4: new patch
---
 include/block/dirty-bitmap.h | 14 +++---
 block/dirty-bitmap.c | 37 -
 block/qcow2-bitmap.c | 22 ++
 3 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 7a27590047..5f34a1a3c7 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -49,19 +49,19 @@ BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap 
*bitmap,
 void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter);

 uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap,
-  uint64_t start, uint64_t count);
+  uint64_t offset, uint64_t bytes);
 uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap);
 void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap,
-  uint8_t *buf, uint64_t start,
-  uint64_t count);
+  uint8_t *buf, uint64_t offset,
+  uint64_t bytes);
 void bdrv_dirty_bitmap_deserialize_part(BdrvDirtyBitmap *bitmap,
-uint8_t *buf, uint64_t start,
-uint64_t count, bool finish);
+uint8_t *buf, uint64_t offset,
+uint64_t bytes, bool finish);
 void bdrv_dirty_bitmap_deserialize_zeroes(BdrvDirtyBitmap *bitmap,
-  uint64_t start, uint64_t count,
+  uint64_t offset, uint64_t bytes,
   bool finish);
 void bdrv_dirty_bitmap_deserialize_ones(BdrvDirtyBitmap *bitmap,
-uint64_t start, uint64_t count,
+uint64_t offset, uint64_t bytes,
 bool finish);
 void bdrv_dirty_bitmap_deserialize_finish(BdrvDirtyBitmap *bitmap);

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 6d32554b4e..555c736024 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -568,42 +568,53 @@ void bdrv_undo_clear_dirty_bitmap(BdrvDirtyBitmap 
*bitmap, HBitmap *in)
 }

 uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap,
-  uint64_t start, uint64_t count)
+  uint64_t offset, uint64_t bytes)
 {
-return hbitmap_serialization_size(bitmap->bitmap, start, count);
+assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
+return hbitmap_serialization_size(bitmap->bitmap,
+  offset >> BDRV_SECTOR_BITS,
+  bytes >> BDRV_SECTOR_BITS);
 }

 uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap)
 {
-return hbitmap_serialization_align(bitmap->bitmap);
+return hbitmap_serialization_align(bitmap->bitmap) * BDRV_SECTOR_SIZE;
 }

 void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap,
-  uint8_t *buf, uint64_t start,
-  uint64_t count)
+  uint8_t *buf, uint64_t offset,
+  uint64_t bytes)
 {
-hbitmap_serialize_part(bitmap->bitmap, buf, start, count);
+assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
+hbitmap_serialize_part(bitmap->bitmap, buf, offset >> BDRV_SECTOR_BITS,
+   bytes >> BDRV_SECTOR_BITS);
 }

 void bdrv_dirty_bitmap_deserialize_part(BdrvDirtyBitmap *bitmap,
-uint8_t *buf, uint64_t start,
-uint64_t count, bool finish)
+uint8_t *buf, uint64_t offset,
+uint64_t bytes, bool finish)
 {
-

[Qemu-block] [PATCH v8 12/20] dirty-bitmap: Change bdrv_get_dirty_count() to report bytes

2017-09-18 Thread Eric Blake

Thanks to recent cleanups, all callers were scaling a return value
of sectors into bytes; do the scaling internally instead.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v8: no change, add R-b
v7: fix one more trace caller [Kevin]
v4-v6: no change
v3: no change, add R-b
v2: no change
---
 block/dirty-bitmap.c |  4 ++--
 block/mirror.c   | 16 ++--
 migration/block.c|  2 +-
 3 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index e451916187..8322e23f0d 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -423,7 +423,7 @@ BlockDirtyInfoList 
*bdrv_query_dirty_bitmaps(BlockDriverState *bs)
 QLIST_FOREACH(bm, >dirty_bitmaps, list) {
 BlockDirtyInfo *info = g_new0(BlockDirtyInfo, 1);
 BlockDirtyInfoList *entry = g_new0(BlockDirtyInfoList, 1);
-info->count = bdrv_get_dirty_count(bm) << BDRV_SECTOR_BITS;
+info->count = bdrv_get_dirty_count(bm);
 info->granularity = bdrv_dirty_bitmap_granularity(bm);
 info->has_name = !!bm->name;
 info->name = g_strdup(bm->name);
@@ -652,7 +652,7 @@ void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *iter, int64_t 
offset)

 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap)
 {
-return hbitmap_count(bitmap->bitmap);
+return hbitmap_count(bitmap->bitmap) << BDRV_SECTOR_BITS;
 }

 int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap)
diff --git a/block/mirror.c b/block/mirror.c
index 77bf5aa3a4..7113d47db4 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -340,8 +340,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 if (offset < 0) {
 bdrv_set_dirty_iter(s->dbi, 0);
 offset = bdrv_dirty_iter_next(s->dbi);
-trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap) *
-  BDRV_SECTOR_SIZE);
+trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap));
 assert(offset >= 0);
 }
 bdrv_dirty_bitmap_unlock(s->dirty_bitmap);
@@ -811,11 +810,10 @@ static void coroutine_fn mirror_run(void *opaque)

 cnt = bdrv_get_dirty_count(s->dirty_bitmap);
 /* s->common.offset contains the number of bytes already processed so
- * far, cnt is the number of dirty sectors remaining and
+ * far, cnt is the number of dirty bytes remaining and
  * s->bytes_in_flight is the number of bytes currently being
  * processed; together those are the current total operation length */
-s->common.len = s->common.offset + s->bytes_in_flight +
-cnt * BDRV_SECTOR_SIZE;
+s->common.len = s->common.offset + s->bytes_in_flight + cnt;

 /* Note that even when no rate limit is applied we need to yield
  * periodically with no pending I/O so that bdrv_drain_all() returns.
@@ -827,8 +825,7 @@ static void coroutine_fn mirror_run(void *opaque)
 s->common.iostatus == BLOCK_DEVICE_IO_STATUS_OK) {
 if (s->in_flight >= MAX_IN_FLIGHT || s->buf_free_count == 0 ||
 (cnt == 0 && s->in_flight > 0)) {
-trace_mirror_yield(s, cnt * BDRV_SECTOR_SIZE,
-   s->buf_free_count, s->in_flight);
+trace_mirror_yield(s, cnt, s->buf_free_count, s->in_flight);
 mirror_wait_for_io(s);
 continue;
 } else if (cnt != 0) {
@@ -869,7 +866,7 @@ static void coroutine_fn mirror_run(void *opaque)
  * whether to switch to target check one last time if I/O has
  * come in the meanwhile, and if not flush the data to disk.
  */
-trace_mirror_before_drain(s, cnt * BDRV_SECTOR_SIZE);
+trace_mirror_before_drain(s, cnt);

 bdrv_drained_begin(bs);
 cnt = bdrv_get_dirty_count(s->dirty_bitmap);
@@ -888,8 +885,7 @@ static void coroutine_fn mirror_run(void *opaque)
 }

 ret = 0;
-trace_mirror_before_sleep(s, cnt * BDRV_SECTOR_SIZE,
-  s->synced, delay_ns);
+trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
 if (!s->synced) {
 block_job_sleep_ns(>common, QEMU_CLOCK_REALTIME, delay_ns);
 if (block_job_is_cancelled(>common)) {
diff --git a/migration/block.c b/migration/block.c
index 9171f60028..a3512945da 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -667,7 +667,7 @@ static int64_t get_remaining_dirty(void)
 aio_context_release(blk_get_aio_context(bmds->blk));
 }

-return dirty << BDRV_SECTOR_BITS;
+return dirty;
 }


-- 
2.13.5

[Qemu-block] [PATCH v8 20/20] dirty-bitmap: Convert internal hbitmap size/granularity

2017-09-18 Thread Eric Blake

Now that all callers are using byte-based interfaces, there's no
reason for our internal hbitmap to remain with sector-based
granularity.  It also simplifies our internal scaling, since we
already know that hbitmap widens requests out to granularity
boundaries.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v8: rebase to earlier truncate changes (R-b kept)
v7: rebase to dirty_iter_next cleanup (no semantic change, R-b kept)
v6: no change
v5: fix bdrv_dirty_bitmap_truncate [John]
v4: rebase to earlier changes, include serialization, R-b dropped
v3: no change
v2: no change
---
 block/dirty-bitmap.c | 62 +++-
 1 file changed, 18 insertions(+), 44 deletions(-)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 58a3f330a9..bd04e991b1 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -38,7 +38,7 @@
  */
 struct BdrvDirtyBitmap {
 QemuMutex *mutex;
-HBitmap *bitmap;/* Dirty sector bitmap implementation */
+HBitmap *bitmap;/* Dirty bitmap implementation */
 HBitmap *meta;  /* Meta dirty bitmap */
 BdrvDirtyBitmap *successor; /* Anonymous child; implies frozen status */
 char *name; /* Optional non-empty unique ID */
@@ -130,12 +130,7 @@ BdrvDirtyBitmap *bdrv_create_dirty_bitmap(BlockDriverState 
*bs,
 }
 bitmap = g_new0(BdrvDirtyBitmap, 1);
 bitmap->mutex = >dirty_bitmap_mutex;
-/*
- * TODO - let hbitmap track full granularity. For now, it is tracking
- * only sector granularity, as a shortcut for our iterators.
- */
-bitmap->bitmap = hbitmap_alloc(DIV_ROUND_UP(bitmap_size, BDRV_SECTOR_SIZE),
-   ctz32(granularity) - BDRV_SECTOR_BITS);
+bitmap->bitmap = hbitmap_alloc(bitmap_size, ctz32(granularity));
 bitmap->size = bitmap_size;
 bitmap->name = g_strdup(name);
 bitmap->disabled = false;
@@ -312,7 +307,7 @@ void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, 
int64_t bytes)
 QLIST_FOREACH(bitmap, >dirty_bitmaps, list) {
 assert(!bdrv_dirty_bitmap_frozen(bitmap));
 assert(!bitmap->active_iterators);
-hbitmap_truncate(bitmap->bitmap, DIV_ROUND_UP(bytes, 
BDRV_SECTOR_SIZE));
+hbitmap_truncate(bitmap->bitmap, bytes);
 bitmap->size = bytes;
 }
 bdrv_dirty_bitmaps_unlock(bs);
@@ -442,7 +437,7 @@ bool bdrv_get_dirty_locked(BlockDriverState *bs, 
BdrvDirtyBitmap *bitmap,
int64_t offset)
 {
 if (bitmap) {
-return hbitmap_get(bitmap->bitmap, offset >> BDRV_SECTOR_BITS);
+return hbitmap_get(bitmap->bitmap, offset);
 } else {
 return false;
 }
@@ -470,7 +465,7 @@ uint32_t 
bdrv_get_default_bitmap_granularity(BlockDriverState *bs)

 uint32_t bdrv_dirty_bitmap_granularity(const BdrvDirtyBitmap *bitmap)
 {
-return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->bitmap);
+return 1U << hbitmap_granularity(bitmap->bitmap);
 }

 BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap)
@@ -503,20 +498,16 @@ void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter)

 int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter)
 {
-int64_t ret = hbitmap_iter_next(>hbi);
-return ret < 0 ? -1 : ret * BDRV_SECTOR_SIZE;
+return hbitmap_iter_next(>hbi);
 }

 /* Called within bdrv_dirty_bitmap_lock..unlock */
 void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
   int64_t offset, int64_t bytes)
 {
-int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
-
 assert(bdrv_dirty_bitmap_enabled(bitmap));
 assert(!bdrv_dirty_bitmap_readonly(bitmap));
-hbitmap_set(bitmap->bitmap, offset >> BDRV_SECTOR_BITS,
-end_sector - (offset >> BDRV_SECTOR_BITS));
+hbitmap_set(bitmap->bitmap, offset, bytes);
 }

 void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap,
@@ -531,12 +522,9 @@ void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap,
 void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
 int64_t offset, int64_t bytes)
 {
-int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
-
 assert(bdrv_dirty_bitmap_enabled(bitmap));
 assert(!bdrv_dirty_bitmap_readonly(bitmap));
-hbitmap_reset(bitmap->bitmap, offset >> BDRV_SECTOR_BITS,
-  end_sector - (offset >> BDRV_SECTOR_BITS));
+hbitmap_reset(bitmap->bitmap, offset, bytes);
 }

 void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap,
@@ -556,8 +544,7 @@ void bdrv_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, 
HBitmap **out)
 hbitmap_reset_all(bitmap->bitmap);
 } else {
 HBitmap *backup = bitmap->bitmap;
-bitmap->bitmap = hbitmap_alloc(DIV_ROUND_UP(bitmap->size,
-BDRV_SECTOR_SIZE),
+bitmap->bitmap = hbitmap_alloc(bitmap->size,

[Qemu-block] [PATCH v8 16/20] qcow2: Switch qcow2_measure() to byte-based iteration

2017-09-18 Thread Eric Blake

This is new code, but it is easier to read if it makes passes over
the image using bytes rather than sectors (and will get easier in
the future when bdrv_get_block_status is converted to byte-based).

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v7: tweak constant given to MIN (no semantic change, R-b kept) [Kevin]
v6: separate bug fix to earlier patch
v5: new patch
---
 block/qcow2.c | 22 ++
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index bae5893327..64dcd98a91 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3647,20 +3647,19 @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, 
BlockDriverState *in_bs,
  */
 required = virtual_size;
 } else {
-int cluster_sectors = cluster_size / BDRV_SECTOR_SIZE;
-int64_t sector_num;
+int64_t offset;
 int pnum = 0;

-for (sector_num = 0;
- sector_num < ssize / BDRV_SECTOR_SIZE;
- sector_num += pnum) {
-int nb_sectors = MIN(ssize / BDRV_SECTOR_SIZE - sector_num,
- BDRV_REQUEST_MAX_SECTORS);
+for (offset = 0; offset < ssize;
+ offset += pnum * BDRV_SECTOR_SIZE) {
+int nb_sectors = MIN(ssize - offset,
+ BDRV_REQUEST_MAX_BYTES) / 
BDRV_SECTOR_SIZE;
 BlockDriverState *file;
 int64_t ret;

 ret = bdrv_get_block_status_above(in_bs, NULL,
-  sector_num, nb_sectors,
+  offset >> BDRV_SECTOR_BITS,
+  nb_sectors,
   , );
 if (ret < 0) {
 error_setg_errno(_err, -ret,
@@ -3673,12 +3672,11 @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, 
BlockDriverState *in_bs,
 } else if ((ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED)) ==
(BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED)) {
 /* Extend pnum to end of cluster for next iteration */
-pnum = ROUND_UP(sector_num + pnum, cluster_sectors) -
-   sector_num;
+pnum = (ROUND_UP(offset + pnum * BDRV_SECTOR_SIZE,
+ cluster_size) - offset) >> BDRV_SECTOR_BITS;

 /* Count clusters we've seen */
-required += (sector_num % cluster_sectors + pnum) *
-BDRV_SECTOR_SIZE;
+required += offset % cluster_size + pnum * 
BDRV_SECTOR_SIZE;
 }
 }
 }
-- 
2.13.5

[Qemu-block] [PATCH v8 18/20] qcow2: Switch store_bitmap_data() to byte-based iteration

2017-09-18 Thread Eric Blake

Now that we have adjusted the majority of the calls this function
makes to be byte-based, it is easier to read the code if it makes
passes over the image using bytes rather than sectors.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 
Reviewed-by: Vladimir Sementsov-Ogievskiy 

---
v7: rebase to earlier change, make rounding of offset obvious (no semantic
change, so R-b kept) [Kevin]
v5-v6: no change
v4: new patch
---
 block/qcow2-bitmap.c | 26 +++---
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index 692ce0de88..302fffd6e1 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -1072,10 +1072,9 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 {
 int ret;
 BDRVQcow2State *s = bs->opaque;
-int64_t sector;
-uint64_t limit, sbc;
+int64_t offset;
+uint64_t limit;
 uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
-uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE);
 const char *bm_name = bdrv_dirty_bitmap_name(bitmap);
 uint8_t *buf = NULL;
 BdrvDirtyBitmapIter *dbi;
@@ -1100,18 +1099,17 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 dbi = bdrv_dirty_iter_new(bitmap);
 buf = g_malloc(s->cluster_size);
 limit = bytes_covered_by_bitmap_cluster(s, bitmap);
-sbc = limit >> BDRV_SECTOR_BITS;
 assert(DIV_ROUND_UP(bm_size, limit) == tb_size);

-while ((sector = bdrv_dirty_iter_next(dbi) >> BDRV_SECTOR_BITS) >= 0) {
-uint64_t cluster = sector / sbc;
+while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) {
+uint64_t cluster = offset / limit;
 uint64_t end, write_size;
 int64_t off;

-sector = cluster * sbc;
-end = MIN(bm_sectors, sector + sbc);
-write_size = bdrv_dirty_bitmap_serialization_size(bitmap,
-sector * BDRV_SECTOR_SIZE, (end - sector) * BDRV_SECTOR_SIZE);
+offset = QEMU_ALIGN_DOWN(offset, limit);
+end = MIN(bm_size, offset + limit);
+write_size = bdrv_dirty_bitmap_serialization_size(bitmap, offset,
+  end - offset);
 assert(write_size <= s->cluster_size);

 off = qcow2_alloc_clusters(bs, s->cluster_size);
@@ -1123,9 +1121,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 }
 tb[cluster] = off;

-bdrv_dirty_bitmap_serialize_part(bitmap, buf,
- sector * BDRV_SECTOR_SIZE,
- (end - sector) * BDRV_SECTOR_SIZE);
+bdrv_dirty_bitmap_serialize_part(bitmap, buf, offset, end - offset);
 if (write_size < s->cluster_size) {
 memset(buf + write_size, 0, s->cluster_size - write_size);
 }
@@ -1143,11 +1139,11 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 goto fail;
 }

-if (end >= bm_sectors) {
+if (end >= bm_size) {
 break;
 }

-bdrv_set_dirty_iter(dbi, end * BDRV_SECTOR_SIZE);
+bdrv_set_dirty_iter(dbi, end);
 }

 *bitmap_table_size = tb_size;
-- 
2.13.5

[Qemu-block] [PATCH v8 11/20] dirty-bitmap: Change bdrv_dirty_iter_next() to report byte offset

2017-09-18 Thread Eric Blake

Thanks to recent cleanups, most callers were scaling a return value
of sectors into bytes (the exception, in qcow2-bitmap, will be
converted to byte-based iteration later).  Update the interface to
do the scaling internally instead.

In qcow2-bitmap, the code was specifically checking for an error
return of -1.  To avoid a regression, we either have to make sure
we continue to return -1 (rather than a scaled -512) on error, or
we have to fix the caller to treat all negative values as error
rather than just one magic value.  It's easy enough to make both
changes at the same time, even though either one in isolation
would work.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v8: tweak commit message, add R-b
v7: return -1, not -512; and fix qcow2-bitmap to check all negatives [Kevin]
v5-v6: no change
v4: rebase to persistent bitmap
v3: no change
v2: no change
---
 block/backup.c   | 2 +-
 block/dirty-bitmap.c | 3 ++-
 block/mirror.c   | 8 
 block/qcow2-bitmap.c | 2 +-
 4 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/block/backup.c b/block/backup.c
index ac9c018717..06ddbfd03d 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -375,7 +375,7 @@ static int coroutine_fn 
backup_run_incremental(BackupBlockJob *job)
 dbi = bdrv_dirty_iter_new(job->sync_bitmap);

 /* Find the next dirty sector(s) */
-while ((offset = bdrv_dirty_iter_next(dbi) * BDRV_SECTOR_SIZE) >= 0) {
+while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) {
 cluster = offset / job->cluster_size;

 /* Fake progress updates for any clusters we skipped */
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 84509476ba..e451916187 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -503,7 +503,8 @@ void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter)

 int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter)
 {
-return hbitmap_iter_next(>hbi);
+int64_t ret = hbitmap_iter_next(>hbi);
+return ret < 0 ? -1 : ret * BDRV_SECTOR_SIZE;
 }

 /* Called within bdrv_dirty_bitmap_lock..unlock */
diff --git a/block/mirror.c b/block/mirror.c
index 0b063b3c20..77bf5aa3a4 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -336,10 +336,10 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 int max_io_bytes = MAX(s->buf_size / MAX_IN_FLIGHT, MAX_IO_BYTES);

 bdrv_dirty_bitmap_lock(s->dirty_bitmap);
-offset = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
+offset = bdrv_dirty_iter_next(s->dbi);
 if (offset < 0) {
 bdrv_set_dirty_iter(s->dbi, 0);
-offset = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
+offset = bdrv_dirty_iter_next(s->dbi);
 trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap) *
   BDRV_SECTOR_SIZE);
 assert(offset >= 0);
@@ -370,11 +370,11 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 break;
 }

-next_dirty = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
+next_dirty = bdrv_dirty_iter_next(s->dbi);
 if (next_dirty > next_offset || next_dirty < 0) {
 /* The bitmap iterator's cache is stale, refresh it */
 bdrv_set_dirty_iter(s->dbi, next_offset);
-next_dirty = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
+next_dirty = bdrv_dirty_iter_next(s->dbi);
 }
 assert(next_dirty == next_offset);
 nb_chunks++;
diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index 44329fc74f..b09010b1d3 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -1109,7 +1109,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 sbc = limit >> BDRV_SECTOR_BITS;
 assert(DIV_ROUND_UP(bm_size, limit) == tb_size);

-while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
+while ((sector = bdrv_dirty_iter_next(dbi) >> BDRV_SECTOR_BITS) >= 0) {
 uint64_t cluster = sector / sbc;
 uint64_t end, write_size;
 int64_t off;
-- 
2.13.5

[Qemu-block] [PATCH v8 14/20] dirty-bitmap: Change bdrv_[re]set_dirty_bitmap() to use bytes

2017-09-18 Thread Eric Blake

Some of the callers were already scaling bytes to sectors; others
can be easily converted to pass byte offsets, all in our shift
towards a consistent byte interface everywhere.  Making the change
will also make it easier to write the hold-out callers to use byte
rather than sectors for their iterations; it also makes it easier
for a future dirty-bitmap patch to offload scaling over to the
internal hbitmap.  Although all callers happen to pass
sector-aligned values, make the internal scaling robust to any
sub-sector requests.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v5: only context change
v4: only context change, due to rebasing to persistent bitmaps
v3: rebase to addition of _locked interfaces; complex enough that I
dropped R-b
v2: no change
---
 include/block/dirty-bitmap.h |  8 
 block/dirty-bitmap.c | 22 ++
 block/mirror.c   | 16 
 migration/block.c|  7 +--
 4 files changed, 31 insertions(+), 22 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 9e39537e4b..3579a7597c 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -40,9 +40,9 @@ const char *bdrv_dirty_bitmap_name(const BdrvDirtyBitmap 
*bitmap);
 int64_t bdrv_dirty_bitmap_size(const BdrvDirtyBitmap *bitmap);
 DirtyBitmapStatus bdrv_dirty_bitmap_status(BdrvDirtyBitmap *bitmap);
 void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap,
-   int64_t cur_sector, int64_t nr_sectors);
+   int64_t offset, int64_t bytes);
 void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap,
- int64_t cur_sector, int64_t nr_sectors);
+ int64_t offset, int64_t bytes);
 BdrvDirtyBitmapIter *bdrv_dirty_meta_iter_new(BdrvDirtyBitmap *bitmap);
 BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap);
 void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter);
@@ -75,9 +75,9 @@ void bdrv_dirty_bitmap_unlock(BdrvDirtyBitmap *bitmap);
 bool bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
int64_t offset);
 void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
-  int64_t cur_sector, int64_t nr_sectors);
+  int64_t offset, int64_t bytes);
 void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
-int64_t cur_sector, int64_t nr_sectors);
+int64_t offset, int64_t bytes);
 int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter);
 void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t offset);
 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index ad559c62b1..117837b3cc 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -509,35 +509,41 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter)

 /* Called within bdrv_dirty_bitmap_lock..unlock */
 void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
-  int64_t cur_sector, int64_t nr_sectors)
+  int64_t offset, int64_t bytes)
 {
+int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
+
 assert(bdrv_dirty_bitmap_enabled(bitmap));
 assert(!bdrv_dirty_bitmap_readonly(bitmap));
-hbitmap_set(bitmap->bitmap, cur_sector, nr_sectors);
+hbitmap_set(bitmap->bitmap, offset >> BDRV_SECTOR_BITS,
+end_sector - (offset >> BDRV_SECTOR_BITS));
 }

 void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap,
-   int64_t cur_sector, int64_t nr_sectors)
+   int64_t offset, int64_t bytes)
 {
 bdrv_dirty_bitmap_lock(bitmap);
-bdrv_set_dirty_bitmap_locked(bitmap, cur_sector, nr_sectors);
+bdrv_set_dirty_bitmap_locked(bitmap, offset, bytes);
 bdrv_dirty_bitmap_unlock(bitmap);
 }

 /* Called within bdrv_dirty_bitmap_lock..unlock */
 void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
-int64_t cur_sector, int64_t nr_sectors)
+int64_t offset, int64_t bytes)
 {
+int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
+
 assert(bdrv_dirty_bitmap_enabled(bitmap));
 assert(!bdrv_dirty_bitmap_readonly(bitmap));
-hbitmap_reset(bitmap->bitmap, cur_sector, nr_sectors);
+hbitmap_reset(bitmap->bitmap, offset >> BDRV_SECTOR_BITS,
+  end_sector - (offset >> BDRV_SECTOR_BITS));
 }

 void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap,
- int64_t cur_sector, int64_t nr_sectors)
+ int64_t offset, int64_t bytes)
 {
 bdrv_dirty_bitmap_lock(bitmap);
-bdrv_reset_dirty_bitmap_locked(bitmap, cur_sector, nr_sectors);
+

[Qemu-block] [PATCH v8 10/20] dirty-bitmap: Set iterator start by offset, not sector

2017-09-18 Thread Eric Blake

All callers to bdrv_dirty_iter_new() passed 0 for their initial
starting point, drop that parameter.

Most callers to bdrv_set_dirty_iter() were scaling a byte offset to
a sector number; the exception qcow2-bitmap will be converted later
to use byte rather than sector iteration.  Move the scaling to occur
internally to dirty bitmap code instead, so that callers now pass
in bytes.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v5: no change
v4: rebase to persistent bitmaps
v3: no change
v2: no change
---
 include/block/dirty-bitmap.h | 5 ++---
 block/backup.c   | 5 ++---
 block/dirty-bitmap.c | 9 -
 block/mirror.c   | 4 ++--
 block/qcow2-bitmap.c | 4 ++--
 5 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 5f34a1a3c7..757fc4c5b8 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -44,8 +44,7 @@ void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap,
 void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap,
  int64_t cur_sector, int64_t nr_sectors);
 BdrvDirtyBitmapIter *bdrv_dirty_meta_iter_new(BdrvDirtyBitmap *bitmap);
-BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap,
- uint64_t first_sector);
+BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap);
 void bdrv_dirty_iter_free(BdrvDirtyBitmapIter *iter);

 uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap,
@@ -80,7 +79,7 @@ void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
 void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
 int64_t cur_sector, int64_t nr_sectors);
 int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter);
-void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num);
+void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t offset);
 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
 int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap);
 void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes);
diff --git a/block/backup.c b/block/backup.c
index 517c300a49..ac9c018717 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -372,7 +372,7 @@ static int coroutine_fn 
backup_run_incremental(BackupBlockJob *job)

 granularity = bdrv_dirty_bitmap_granularity(job->sync_bitmap);
 clusters_per_iter = MAX((granularity / job->cluster_size), 1);
-dbi = bdrv_dirty_iter_new(job->sync_bitmap, 0);
+dbi = bdrv_dirty_iter_new(job->sync_bitmap);

 /* Find the next dirty sector(s) */
 while ((offset = bdrv_dirty_iter_next(dbi) * BDRV_SECTOR_SIZE) >= 0) {
@@ -403,8 +403,7 @@ static int coroutine_fn 
backup_run_incremental(BackupBlockJob *job)
 /* If the bitmap granularity is smaller than the backup granularity,
  * we need to advance the iterator pointer to the next cluster. */
 if (granularity < job->cluster_size) {
-bdrv_set_dirty_iter(dbi,
-cluster * job->cluster_size / 
BDRV_SECTOR_SIZE);
+bdrv_set_dirty_iter(dbi, cluster * job->cluster_size);
 }

 last_cluster = cluster - 1;
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 555c736024..84509476ba 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -473,11 +473,10 @@ uint32_t bdrv_dirty_bitmap_granularity(const 
BdrvDirtyBitmap *bitmap)
 return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->bitmap);
 }

-BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap,
- uint64_t first_sector)
+BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap)
 {
 BdrvDirtyBitmapIter *iter = g_new(BdrvDirtyBitmapIter, 1);
-hbitmap_iter_init(>hbi, bitmap->bitmap, first_sector);
+hbitmap_iter_init(>hbi, bitmap->bitmap, 0);
 iter->bitmap = bitmap;
 bitmap->active_iterators++;
 return iter;
@@ -645,9 +644,9 @@ void bdrv_set_dirty(BlockDriverState *bs, int64_t 
cur_sector,
 /**
  * Advance a BdrvDirtyBitmapIter to an arbitrary offset.
  */
-void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *iter, int64_t sector_num)
+void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *iter, int64_t offset)
 {
-hbitmap_iter_init(>hbi, iter->hbi.hb, sector_num);
+hbitmap_iter_init(>hbi, iter->hbi.hb, offset >> BDRV_SECTOR_BITS);
 }

 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap)
diff --git a/block/mirror.c b/block/mirror.c
index 6531652d73..0b063b3c20 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -373,7 +373,7 @@ static uint64_t coroutine_fn 
mirror_iteration(MirrorBlockJob *s)
 next_dirty = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
 if (next_dirty > next_offset || next_dirty < 0) {
 /* The bitmap iterator's cache is stale, refresh it */
-bdrv_set_dirty_iter(s->dbi,

[Qemu-block] [PATCH v8 07/20] dirty-bitmap: Track bitmap size by bytes

2017-09-18 Thread Eric Blake

We are still using an internal hbitmap that tracks a size in sectors,
with the granularity scaled down accordingly, because it lets us
use a shortcut for our iterators which are currently sector-based.
But there's no reason we can't track the dirty bitmap size in bytes,
since it is (mostly) an internal-only variable (remember, the size
is how many bytes are covered by the bitmap, not how many bytes the
bitmap occupies).  A later cleanup will convert dirty bitmap
internals to be entirely byte-based, eliminating the intermediate
sector rounding added here; and technically, since bdrv_getlength()
already rounds up to sectors, our use of DIV_ROUND_UP is more for
theoretical completeness than for any actual rounding.

Use is_power_of_2() while at it, instead of open-coding that.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v8: rebase to earlier truncate changes, R-b kept
v7: split external from internal [Kevin], drop R-b
v6: no change
v5: fix bdrv_dirty_bitmap_truncate [John], drop R-b
v4: retitle from "Track size in bytes", rebase to persistent bitmaps,
round up when converting bytes to sectors
v3: no change
v2: tweak commit message, no code change
---
 block/dirty-bitmap.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 75af32..6d32554b4e 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -42,7 +42,7 @@ struct BdrvDirtyBitmap {
 HBitmap *meta;  /* Meta dirty bitmap */
 BdrvDirtyBitmap *successor; /* Anonymous child; implies frozen status */
 char *name; /* Optional non-empty unique ID */
-int64_t size;   /* Size of the bitmap (Number of sectors) */
+int64_t size;   /* Size of the bitmap, in bytes */
 bool disabled;  /* Bitmap is disabled. It ignores all writes to
the device */
 int active_iterators;   /* How many iterators are active */
@@ -115,17 +115,14 @@ BdrvDirtyBitmap 
*bdrv_create_dirty_bitmap(BlockDriverState *bs,
 {
 int64_t bitmap_size;
 BdrvDirtyBitmap *bitmap;
-uint32_t sector_granularity;

-assert((granularity & (granularity - 1)) == 0);
+assert(is_power_of_2(granularity) && granularity >= BDRV_SECTOR_SIZE);

 if (name && bdrv_find_dirty_bitmap(bs, name)) {
 error_setg(errp, "Bitmap already exists: %s", name);
 return NULL;
 }
-sector_granularity = granularity >> BDRV_SECTOR_BITS;
-assert(sector_granularity);
-bitmap_size = bdrv_nb_sectors(bs);
+bitmap_size = bdrv_getlength(bs);
 if (bitmap_size < 0) {
 error_setg_errno(errp, -bitmap_size, "could not get length of device");
 errno = -bitmap_size;
@@ -133,7 +130,12 @@ BdrvDirtyBitmap *bdrv_create_dirty_bitmap(BlockDriverState 
*bs,
 }
 bitmap = g_new0(BdrvDirtyBitmap, 1);
 bitmap->mutex = >dirty_bitmap_mutex;
-bitmap->bitmap = hbitmap_alloc(bitmap_size, ctz32(sector_granularity));
+/*
+ * TODO - let hbitmap track full granularity. For now, it is tracking
+ * only sector granularity, as a shortcut for our iterators.
+ */
+bitmap->bitmap = hbitmap_alloc(DIV_ROUND_UP(bitmap_size, BDRV_SECTOR_SIZE),
+   ctz32(granularity) - BDRV_SECTOR_BITS);
 bitmap->size = bitmap_size;
 bitmap->name = g_strdup(name);
 bitmap->disabled = false;
@@ -175,7 +177,7 @@ void bdrv_release_meta_dirty_bitmap(BdrvDirtyBitmap *bitmap)

 int64_t bdrv_dirty_bitmap_size(const BdrvDirtyBitmap *bitmap)
 {
-return bitmap->size * BDRV_SECTOR_SIZE;
+return bitmap->size;
 }

 const char *bdrv_dirty_bitmap_name(const BdrvDirtyBitmap *bitmap)
@@ -305,14 +307,13 @@ BdrvDirtyBitmap 
*bdrv_reclaim_dirty_bitmap(BlockDriverState *bs,
 void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes)
 {
 BdrvDirtyBitmap *bitmap;
-int64_t size = DIV_ROUND_UP(bytes, BDRV_SECTOR_SIZE);

 bdrv_dirty_bitmaps_lock(bs);
 QLIST_FOREACH(bitmap, >dirty_bitmaps, list) {
 assert(!bdrv_dirty_bitmap_frozen(bitmap));
 assert(!bitmap->active_iterators);
-hbitmap_truncate(bitmap->bitmap, size);
-bitmap->size = size;
+hbitmap_truncate(bitmap->bitmap, DIV_ROUND_UP(bytes, 
BDRV_SECTOR_SIZE));
+bitmap->size = bytes;
 }
 bdrv_dirty_bitmaps_unlock(bs);
 }
@@ -549,7 +550,8 @@ void bdrv_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, 
HBitmap **out)
 hbitmap_reset_all(bitmap->bitmap);
 } else {
 HBitmap *backup = bitmap->bitmap;
-bitmap->bitmap = hbitmap_alloc(bitmap->size,
+bitmap->bitmap = hbitmap_alloc(DIV_ROUND_UP(bitmap->size,
+BDRV_SECTOR_SIZE),
hbitmap_granularity(backup));
 *out = backup;
 }
-- 
2.13.5

[Qemu-block] [PATCH v8 04/20] dirty-bitmap: Drop unused functions

2017-09-18 Thread Eric Blake

We had several functions that no one is currently using, and which
use sector-based interfaces.  I'm trying to convert towards byte-based
interfaces, so it's easier to just drop the unused functions:

bdrv_dirty_bitmap_get_meta
bdrv_dirty_bitmap_get_meta_locked
bdrv_dirty_bitmap_reset_meta
bdrv_dirty_bitmap_meta_granularity

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v5: no change
v4: rebase to Vladimir's persistent bitmaps (bdrv_dirty_bitmap_size now
in use), dropped R-b
v3: rebase to upstream changes (bdrv_dirty_bitmap_get_meta_locked was
added in b64bd51e with no clients), kept R-b
v2: tweak commit message based on review, no code change
---
 include/block/dirty-bitmap.h | 10 --
 block/dirty-bitmap.c | 44 
 2 files changed, 54 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index a79a58d2c3..8fd842eac9 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -34,7 +34,6 @@ void bdrv_enable_dirty_bitmap(BdrvDirtyBitmap *bitmap);
 BlockDirtyInfoList *bdrv_query_dirty_bitmaps(BlockDriverState *bs);
 uint32_t bdrv_get_default_bitmap_granularity(BlockDriverState *bs);
 uint32_t bdrv_dirty_bitmap_granularity(const BdrvDirtyBitmap *bitmap);
-uint32_t bdrv_dirty_bitmap_meta_granularity(BdrvDirtyBitmap *bitmap);
 bool bdrv_dirty_bitmap_enabled(BdrvDirtyBitmap *bitmap);
 bool bdrv_dirty_bitmap_frozen(BdrvDirtyBitmap *bitmap);
 const char *bdrv_dirty_bitmap_name(const BdrvDirtyBitmap *bitmap);
@@ -44,15 +43,6 @@ void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap,
int64_t cur_sector, int64_t nr_sectors);
 void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap,
  int64_t cur_sector, int64_t nr_sectors);
-int bdrv_dirty_bitmap_get_meta(BlockDriverState *bs,
-   BdrvDirtyBitmap *bitmap, int64_t sector,
-   int nb_sectors);
-int bdrv_dirty_bitmap_get_meta_locked(BlockDriverState *bs,
-  BdrvDirtyBitmap *bitmap, int64_t sector,
-  int nb_sectors);
-void bdrv_dirty_bitmap_reset_meta(BlockDriverState *bs,
-  BdrvDirtyBitmap *bitmap, int64_t sector,
-  int nb_sectors);
 BdrvDirtyBitmapIter *bdrv_dirty_meta_iter_new(BdrvDirtyBitmap *bitmap);
 BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap,
  uint64_t first_sector);
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 0490ca3aff..42a55e4a4b 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -173,45 +173,6 @@ void bdrv_release_meta_dirty_bitmap(BdrvDirtyBitmap 
*bitmap)
 qemu_mutex_unlock(bitmap->mutex);
 }

-int bdrv_dirty_bitmap_get_meta_locked(BlockDriverState *bs,
-  BdrvDirtyBitmap *bitmap, int64_t sector,
-  int nb_sectors)
-{
-uint64_t i;
-int sectors_per_bit = 1 << hbitmap_granularity(bitmap->meta);
-
-/* To optimize: we can make hbitmap to internally check the range in a
- * coarse level, or at least do it word by word. */
-for (i = sector; i < sector + nb_sectors; i += sectors_per_bit) {
-if (hbitmap_get(bitmap->meta, i)) {
-return true;
-}
-}
-return false;
-}
-
-int bdrv_dirty_bitmap_get_meta(BlockDriverState *bs,
-   BdrvDirtyBitmap *bitmap, int64_t sector,
-   int nb_sectors)
-{
-bool dirty;
-
-qemu_mutex_lock(bitmap->mutex);
-dirty = bdrv_dirty_bitmap_get_meta_locked(bs, bitmap, sector, nb_sectors);
-qemu_mutex_unlock(bitmap->mutex);
-
-return dirty;
-}
-
-void bdrv_dirty_bitmap_reset_meta(BlockDriverState *bs,
-  BdrvDirtyBitmap *bitmap, int64_t sector,
-  int nb_sectors)
-{
-qemu_mutex_lock(bitmap->mutex);
-hbitmap_reset(bitmap->meta, sector, nb_sectors);
-qemu_mutex_unlock(bitmap->mutex);
-}
-
 int64_t bdrv_dirty_bitmap_size(const BdrvDirtyBitmap *bitmap)
 {
 return bitmap->size;
@@ -511,11 +472,6 @@ uint32_t bdrv_dirty_bitmap_granularity(const 
BdrvDirtyBitmap *bitmap)
 return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->bitmap);
 }

-uint32_t bdrv_dirty_bitmap_meta_granularity(BdrvDirtyBitmap *bitmap)
-{
-return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->meta);
-}
-
 BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap,
  uint64_t first_sector)
 {
-- 
2.13.5

[Qemu-block] [PATCH v8 06/20] dirty-bitmap: Change bdrv_dirty_bitmap_size() to report bytes

2017-09-18 Thread Eric Blake

We're already reporting bytes for bdrv_dirty_bitmap_granularity();
mixing bytes and sectors in our return values is a recipe for
confusion.  A later cleanup will convert dirty bitmap internals
to be entirely byte-based, but in the meantime, we should report
the bitmap size in bytes.

The only external caller in qcow2-bitmap.c is temporarily more verbose
(because it is still using sector-based math), but will later be
switched to track progress by bytes instead of sectors.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v8: no change
v7: split external from internal change [Kevin], drop R-b
v6: no change
v5: fix bdrv_dirty_bitmap_truncate [John], drop R-b
v4: retitle from "Track size in bytes", rebase to persistent bitmaps,
round up when converting bytes to sectors
v3: no change
v2: tweak commit message, no code change
---
 block/dirty-bitmap.c |  2 +-
 block/qcow2-bitmap.c | 14 --
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index ee164fb518..75af32 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -175,7 +175,7 @@ void bdrv_release_meta_dirty_bitmap(BdrvDirtyBitmap *bitmap)

 int64_t bdrv_dirty_bitmap_size(const BdrvDirtyBitmap *bitmap)
 {
-return bitmap->size;
+return bitmap->size * BDRV_SECTOR_SIZE;
 }

 const char *bdrv_dirty_bitmap_name(const BdrvDirtyBitmap *bitmap)
diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index b3ee4c794a..65122e9ae1 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -295,10 +295,11 @@ static int load_bitmap_data(BlockDriverState *bs,
 BDRVQcow2State *s = bs->opaque;
 uint64_t sector, sbc;
 uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
+uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE);
 uint8_t *buf = NULL;
 uint64_t i, tab_size =
 size_to_clusters(s,
-bdrv_dirty_bitmap_serialization_size(bitmap, 0, bm_size));
+bdrv_dirty_bitmap_serialization_size(bitmap, 0, bm_sectors));

 if (tab_size != bitmap_table_size || tab_size > BME_MAX_TABLE_SIZE) {
 return -EINVAL;
@@ -307,7 +308,7 @@ static int load_bitmap_data(BlockDriverState *bs,
 buf = g_malloc(s->cluster_size);
 sbc = sectors_covered_by_bitmap_cluster(s, bitmap);
 for (i = 0, sector = 0; i < tab_size; ++i, sector += sbc) {
-uint64_t count = MIN(bm_size - sector, sbc);
+uint64_t count = MIN(bm_sectors - sector, sbc);
 uint64_t entry = bitmap_table[i];
 uint64_t offset = entry & BME_TABLE_ENTRY_OFFSET_MASK;

@@ -1077,13 +1078,14 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 int64_t sector;
 uint64_t sbc;
 uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
+uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE);
 const char *bm_name = bdrv_dirty_bitmap_name(bitmap);
 uint8_t *buf = NULL;
 BdrvDirtyBitmapIter *dbi;
 uint64_t *tb;
 uint64_t tb_size =
 size_to_clusters(s,
-bdrv_dirty_bitmap_serialization_size(bitmap, 0, bm_size));
+bdrv_dirty_bitmap_serialization_size(bitmap, 0, bm_sectors));

 if (tb_size > BME_MAX_TABLE_SIZE ||
 tb_size * s->cluster_size > BME_MAX_PHYS_SIZE)
@@ -1101,7 +1103,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 dbi = bdrv_dirty_iter_new(bitmap, 0);
 buf = g_malloc(s->cluster_size);
 sbc = sectors_covered_by_bitmap_cluster(s, bitmap);
-assert(DIV_ROUND_UP(bm_size, sbc) == tb_size);
+assert(DIV_ROUND_UP(bm_sectors, sbc) == tb_size);

 while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
 uint64_t cluster = sector / sbc;
@@ -1109,7 +,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 int64_t off;

 sector = cluster * sbc;
-end = MIN(bm_size, sector + sbc);
+end = MIN(bm_sectors, sector + sbc);
 write_size =
 bdrv_dirty_bitmap_serialization_size(bitmap, sector, end - sector);
 assert(write_size <= s->cluster_size);
@@ -1141,7 +1143,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 goto fail;
 }

-if (end >= bm_size) {
+if (end >= bm_sectors) {
 break;
 }

-- 
2.13.5

[Qemu-block] [PATCH v8 05/20] dirty-bitmap: Avoid size query failure during truncate

2017-09-18 Thread Eric Blake

We've previously fixed several places where we failed to account
for possible errors from bdrv_nb_sectors().  Fix another one by
making bdrv_dirty_bitmap_truncate() take the new size from the
caller instead of querying itself; then adjust the sole caller
bdrv_truncate() to pass the size just determined by a successful
resize, or to skip the bitmap resize on failure, thus avoiding
sizing the bitmaps to -1.

Signed-off-by: Eric Blake 

---
v8: retitle and rework to avoid possibility of secondary failure [John]
v7: new patch [Kevin]
---
 include/block/dirty-bitmap.h |  2 +-
 block.c  | 15 ++-
 block/dirty-bitmap.c |  6 +++---
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
index 8fd842eac9..7a27590047 100644
--- a/include/block/dirty-bitmap.h
+++ b/include/block/dirty-bitmap.h
@@ -83,7 +83,7 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter);
 void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t sector_num);
 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
 int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap);
-void bdrv_dirty_bitmap_truncate(BlockDriverState *bs);
+void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes);
 bool bdrv_dirty_bitmap_readonly(const BdrvDirtyBitmap *bitmap);
 bool bdrv_has_readonly_bitmaps(BlockDriverState *bs);
 bool bdrv_dirty_bitmap_get_autoload(const BdrvDirtyBitmap *bitmap);
diff --git a/block.c b/block.c
index ee6a48976e..61ee9d4b83 100644
--- a/block.c
+++ b/block.c
@@ -3450,12 +3450,17 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, 
PreallocMode prealloc,
 assert(!(bs->open_flags & BDRV_O_INACTIVE));

 ret = drv->bdrv_truncate(bs, offset, prealloc, errp);
-if (ret == 0) {
-ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
-bdrv_dirty_bitmap_truncate(bs);
-bdrv_parent_cb_resize(bs);
-atomic_inc(>write_gen);
+if (ret < 0) {
+return ret;
 }
+ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
+if (ret < 0) {
+error_setg_errno(errp, -ret, "Could not refresh total sector count");
+return ret;
+}
+bdrv_dirty_bitmap_truncate(bs, bs->total_sectors * BDRV_SECTOR_SIZE);
+bdrv_parent_cb_resize(bs);
+atomic_inc(>write_gen);
 return ret;
 }

diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 42a55e4a4b..ee164fb518 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -1,7 +1,7 @@
 /*
  * Block Dirty Bitmap
  *
- * Copyright (c) 2016 Red Hat. Inc
+ * Copyright (c) 2016-2017 Red Hat. Inc
  *
  * Permission is hereby granted, free of charge, to any person obtaining a copy
  * of this software and associated documentation files (the "Software"), to 
deal
@@ -302,10 +302,10 @@ BdrvDirtyBitmap 
*bdrv_reclaim_dirty_bitmap(BlockDriverState *bs,
  * Truncates _all_ bitmaps attached to a BDS.
  * Called with BQL taken.
  */
-void bdrv_dirty_bitmap_truncate(BlockDriverState *bs)
+void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes)
 {
 BdrvDirtyBitmap *bitmap;
-uint64_t size = bdrv_nb_sectors(bs);
+int64_t size = DIV_ROUND_UP(bytes, BDRV_SECTOR_SIZE);

 bdrv_dirty_bitmaps_lock(bs);
 QLIST_FOREACH(bitmap, >dirty_bitmaps, list) {
-- 
2.13.5

[Qemu-block] [PATCH v8 09/20] qcow2: Switch sectors_covered_by_bitmap_cluster() to byte-based

2017-09-18 Thread Eric Blake

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Change the qcow2 bitmap
helper function sectors_covered_by_bitmap_cluster(), renaming it
to bytes_covered_by_bitmap_cluster() in the process.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v5: no change
v4: new patch
---
 block/qcow2-bitmap.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index 92098bfa49..4475273d8c 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -269,18 +269,16 @@ static int free_bitmap_clusters(BlockDriverState *bs, 
Qcow2BitmapTable *tb)
 return 0;
 }

-/* This function returns the number of disk sectors covered by a single qcow2
- * cluster of bitmap data. */
-static uint64_t sectors_covered_by_bitmap_cluster(const BDRVQcow2State *s,
-  const BdrvDirtyBitmap 
*bitmap)
+/* Return the disk size covered by a single qcow2 cluster of bitmap data. */
+static uint64_t bytes_covered_by_bitmap_cluster(const BDRVQcow2State *s,
+const BdrvDirtyBitmap *bitmap)
 {
-uint64_t sector_granularity =
-bdrv_dirty_bitmap_granularity(bitmap) >> BDRV_SECTOR_BITS;
-uint64_t sbc = sector_granularity * (s->cluster_size << 3);
+uint64_t granularity = bdrv_dirty_bitmap_granularity(bitmap);
+uint64_t limit = granularity * (s->cluster_size << 3);

-assert(QEMU_IS_ALIGNED(sbc << BDRV_SECTOR_BITS,
+assert(QEMU_IS_ALIGNED(limit,
bdrv_dirty_bitmap_serialization_align(bitmap)));
-return sbc;
+return limit;
 }

 /* load_bitmap_data
@@ -293,7 +291,7 @@ static int load_bitmap_data(BlockDriverState *bs,
 {
 int ret = 0;
 BDRVQcow2State *s = bs->opaque;
-uint64_t sector, sbc;
+uint64_t sector, limit, sbc;
 uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
 uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE);
 uint8_t *buf = NULL;
@@ -306,7 +304,8 @@ static int load_bitmap_data(BlockDriverState *bs,
 }

 buf = g_malloc(s->cluster_size);
-sbc = sectors_covered_by_bitmap_cluster(s, bitmap);
+limit = bytes_covered_by_bitmap_cluster(s, bitmap);
+sbc = limit >> BDRV_SECTOR_BITS;
 for (i = 0, sector = 0; i < tab_size; ++i, sector += sbc) {
 uint64_t count = MIN(bm_sectors - sector, sbc);
 uint64_t entry = bitmap_table[i];
@@ -1080,7 +1079,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 int ret;
 BDRVQcow2State *s = bs->opaque;
 int64_t sector;
-uint64_t sbc;
+uint64_t limit, sbc;
 uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
 uint64_t bm_sectors = DIV_ROUND_UP(bm_size, BDRV_SECTOR_SIZE);
 const char *bm_name = bdrv_dirty_bitmap_name(bitmap);
@@ -1106,8 +1105,9 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,

 dbi = bdrv_dirty_iter_new(bitmap, 0);
 buf = g_malloc(s->cluster_size);
-sbc = sectors_covered_by_bitmap_cluster(s, bitmap);
-assert(DIV_ROUND_UP(bm_sectors, sbc) == tb_size);
+limit = bytes_covered_by_bitmap_cluster(s, bitmap);
+sbc = limit >> BDRV_SECTOR_BITS;
+assert(DIV_ROUND_UP(bm_size, limit) == tb_size);

 while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
 uint64_t cluster = sector / sbc;
-- 
2.13.5

[Qemu-block] [PATCH v8 02/20] hbitmap: Rename serialization_granularity to serialization_align

2017-09-18 Thread Eric Blake

The only client of hbitmap_serialization_granularity() is dirty-bitmap's
bdrv_dirty_bitmap_serialization_align().  Keeping the two names consistent
is worthwhile, and the shorter name is more representative of what the
function returns (the required alignment to be used for start/count of
other serialization functions, where violating the alignment causes
assertion failures).

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v5: no change
v4: new patch
---
 include/qemu/hbitmap.h |  8 
 block/dirty-bitmap.c   |  2 +-
 tests/test-hbitmap.c   | 10 +-
 util/hbitmap.c |  8 
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
index d3a74a21fc..81e78043d1 100644
--- a/include/qemu/hbitmap.h
+++ b/include/qemu/hbitmap.h
@@ -159,16 +159,16 @@ bool hbitmap_get(const HBitmap *hb, uint64_t item);
 bool hbitmap_is_serializable(const HBitmap *hb);

 /**
- * hbitmap_serialization_granularity:
+ * hbitmap_serialization_align:
  * @hb: HBitmap to operate on.
  *
- * Granularity of serialization chunks, used by other serialization functions.
- * For every chunk:
+ * Required alignment of serialization chunks, used by other serialization
+ * functions. For every chunk:
  * 1. Chunk start should be aligned to this granularity.
  * 2. Chunk size should be aligned too, except for last chunk (for which
  *  start + count == hb->size)
  */
-uint64_t hbitmap_serialization_granularity(const HBitmap *hb);
+uint64_t hbitmap_serialization_align(const HBitmap *hb);

 /**
  * hbitmap_serialization_size:
diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
index 30462d4f9a..0490ca3aff 100644
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -617,7 +617,7 @@ uint64_t bdrv_dirty_bitmap_serialization_size(const 
BdrvDirtyBitmap *bitmap,

 uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap)
 {
-return hbitmap_serialization_granularity(bitmap->bitmap);
+return hbitmap_serialization_align(bitmap->bitmap);
 }

 void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap,
diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
index 1acb353889..af41642346 100644
--- a/tests/test-hbitmap.c
+++ b/tests/test-hbitmap.c
@@ -738,15 +738,15 @@ static void test_hbitmap_meta_one(TestHBitmapData *data, 
const void *unused)
 }
 }

-static void test_hbitmap_serialize_granularity(TestHBitmapData *data,
-   const void *unused)
+static void test_hbitmap_serialize_align(TestHBitmapData *data,
+ const void *unused)
 {
 int r;

 hbitmap_test_init(data, L3 * 2, 3);
 g_assert(hbitmap_is_serializable(data->hb));

-r = hbitmap_serialization_granularity(data->hb);
+r = hbitmap_serialization_align(data->hb);
 g_assert_cmpint(r, ==, 64 << 3);
 }

@@ -974,8 +974,8 @@ int main(int argc, char **argv)
 hbitmap_test_add("/hbitmap/meta/word", test_hbitmap_meta_word);
 hbitmap_test_add("/hbitmap/meta/sector", test_hbitmap_meta_sector);

-hbitmap_test_add("/hbitmap/serialize/granularity",
- test_hbitmap_serialize_granularity);
+hbitmap_test_add("/hbitmap/serialize/align",
+ test_hbitmap_serialize_align);
 hbitmap_test_add("/hbitmap/serialize/basic",
  test_hbitmap_serialize_basic);
 hbitmap_test_add("/hbitmap/serialize/part",
diff --git a/util/hbitmap.c b/util/hbitmap.c
index 21535cc90b..2f9d0fdbd0 100644
--- a/util/hbitmap.c
+++ b/util/hbitmap.c
@@ -413,14 +413,14 @@ bool hbitmap_is_serializable(const HBitmap *hb)
 {
 /* Every serialized chunk must be aligned to 64 bits so that endianness
  * requirements can be fulfilled on both 64 bit and 32 bit hosts.
- * We have hbitmap_serialization_granularity() which converts this
+ * We have hbitmap_serialization_align() which converts this
  * alignment requirement from bitmap bits to items covered (e.g. sectors).
  * That value is:
  *64 << hb->granularity
  * Since this value must not exceed UINT64_MAX, hb->granularity must be
  * less than 58 (== 64 - 6, where 6 is ld(64), i.e. 1 << 6 == 64).
  *
- * In order for hbitmap_serialization_granularity() to always return a
+ * In order for hbitmap_serialization_align() to always return a
  * meaningful value, bitmaps that are to be serialized must have a
  * granularity of less than 58. */

@@ -437,7 +437,7 @@ bool hbitmap_get(const HBitmap *hb, uint64_t item)
 return (hb->levels[HBITMAP_LEVELS - 1][pos >> BITS_PER_LEVEL] & bit) != 0;
 }

-uint64_t hbitmap_serialization_granularity(const HBitmap *hb)
+uint64_t hbitmap_serialization_align(const HBitmap *hb)
 {
 assert(hbitmap_is_serializable(hb));

@@ -454,7 +454,7 @@ static void serialization_chunk(const HBitmap *hb,
 unsigned long **first_el, uint64_t

[Qemu-block] [PATCH v8 01/20] block: Make bdrv_img_create() size selection easier to read

2017-09-18 Thread Eric Blake

All callers of bdrv_img_create() pass in a size, or -1 to read the
size from the backing file.  We then set that size as the QemuOpt
default, which means we will reuse that default rather than the
final parameter to qemu_opt_get_size() several lines later.  But
it is rather confusing to read subsequent checks of 'size == -1'
when it looks (without seeing the full context) like size defaults
to 0; it also doesn't help that a size of 0 is valid (for some
formats).

Rework the logic to make things more legible.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v6: Combine into a series rather than being a standalone patch (more for
ease of tracking than for being on topic)
---
 block.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 6dd47e414e..ee6a48976e 100644
--- a/block.c
+++ b/block.c
@@ -4393,7 +4393,7 @@ void bdrv_img_create(const char *filename, const char 
*fmt,

 /* The size for the image must always be specified, unless we have a 
backing
  * file and we have not been forbidden from opening it. */
-size = qemu_opt_get_size(opts, BLOCK_OPT_SIZE, 0);
+size = qemu_opt_get_size(opts, BLOCK_OPT_SIZE, img_size);
 if (backing_file && !(flags & BDRV_O_NO_BACKING)) {
 BlockDriverState *bs;
 char *full_backing = g_new0(char, PATH_MAX);
-- 
2.13.5

[Qemu-block] [PATCH v8 00/20] make dirty-bitmap byte-based

2017-09-18 Thread Eric Blake

There are patches floating around to add NBD_CMD_BLOCK_STATUS,
but NBD wants to report status on byte granularity (even if the
reporting will probably be naturally aligned to sectors or even
much higher levels).  I've therefore started the task of
converting our block status code to report at a byte granularity
rather than sectors.

Now that 2.11 is open, I'm rebasing/reposting the remaining patches.

The overall conversion currently looks like:
part 1: bdrv_is_allocated (merged in 2.10, commit 51b0a488)
part 2: dirty-bitmap (this series, v7 was here [1])
part 3: bdrv_get_block_status (v4 is posted [2] and is mostly reviewed)
part 4: .bdrv_co_block_status (v3 is posted [3], but needs review)

Available as a tag at:
git fetch git://repo.or.cz/qemu/ericb.git nbd-byte-dirty-v8

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg03160.html
[2] https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg03543.html
[3] https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg03812.html

Since v7:
- add R-b where appropriate
- patch 5 rewritten to be simpler [John]
- patch 7 and 20 have fallout from patch 5, but I kept R-b

001/20:[] [--] 'block: Make bdrv_img_create() size selection easier to read'
002/20:[] [--] 'hbitmap: Rename serialization_granularity to 
serialization_align'
003/20:[] [--] 'qcow2: Ensure bitmap serialization is aligned'
004/20:[] [--] 'dirty-bitmap: Drop unused functions'
005/20:[down] 'dirty-bitmap: Avoid size query failure during truncate'
006/20:[] [--] 'dirty-bitmap: Change bdrv_dirty_bitmap_size() to report 
bytes'
007/20:[0007] [FC] 'dirty-bitmap: Track bitmap size by bytes'
008/20:[] [--] 'dirty-bitmap: Change bdrv_dirty_bitmap_*serialize*() to 
take bytes'
009/20:[] [--] 'qcow2: Switch sectors_covered_by_bitmap_cluster() to 
byte-based'
010/20:[] [-C] 'dirty-bitmap: Set iterator start by offset, not sector'
011/20:[] [--] 'dirty-bitmap: Change bdrv_dirty_iter_next() to report byte 
offset'
012/20:[] [--] 'dirty-bitmap: Change bdrv_get_dirty_count() to report bytes'
013/20:[] [--] 'dirty-bitmap: Change bdrv_get_dirty_locked() to take bytes'
014/20:[] [--] 'dirty-bitmap: Change bdrv_[re]set_dirty_bitmap() to use 
bytes'
015/20:[] [--] 'mirror: Switch mirror_dirty_init() to byte-based iteration'
016/20:[] [--] 'qcow2: Switch qcow2_measure() to byte-based iteration'
017/20:[] [--] 'qcow2: Switch load_bitmap_data() to byte-based iteration'
018/20:[] [--] 'qcow2: Switch store_bitmap_data() to byte-based iteration'
019/20:[] [--] 'dirty-bitmap: Switch bdrv_set_dirty() to bytes'
020/20:[0004] [FC] 'dirty-bitmap: Convert internal hbitmap size/granularity'

Eric Blake (20):
  block: Make bdrv_img_create() size selection easier to read
  hbitmap: Rename serialization_granularity to serialization_align
  qcow2: Ensure bitmap serialization is aligned
  dirty-bitmap: Drop unused functions
  dirty-bitmap: Avoid size query failure during truncate
  dirty-bitmap: Change bdrv_dirty_bitmap_size() to report bytes
  dirty-bitmap: Track bitmap size by bytes
  dirty-bitmap: Change bdrv_dirty_bitmap_*serialize*() to take bytes
  qcow2: Switch sectors_covered_by_bitmap_cluster() to byte-based
  dirty-bitmap: Set iterator start by offset, not sector
  dirty-bitmap: Change bdrv_dirty_iter_next() to report byte offset
  dirty-bitmap: Change bdrv_get_dirty_count() to report bytes
  dirty-bitmap: Change bdrv_get_dirty_locked() to take bytes
  dirty-bitmap: Change bdrv_[re]set_dirty_bitmap() to use bytes
  mirror: Switch mirror_dirty_init() to byte-based iteration
  qcow2: Switch qcow2_measure() to byte-based iteration
  qcow2: Switch load_bitmap_data() to byte-based iteration
  qcow2: Switch store_bitmap_data() to byte-based iteration
  dirty-bitmap: Switch bdrv_set_dirty() to bytes
  dirty-bitmap: Convert internal hbitmap size/granularity

 include/block/block_int.h|   2 +-
 include/block/dirty-bitmap.h |  43 ++
 include/qemu/hbitmap.h   |   8 +--
 block/io.c   |   6 +-
 block.c  |  17 --
 block/backup.c   |   7 +--
 block/dirty-bitmap.c | 134 ++-
 block/mirror.c   |  79 +++--
 block/qcow2-bitmap.c |  57 +-
 block/qcow2.c|  22 ---
 migration/block.c|  12 ++--
 tests/test-hbitmap.c |  10 ++--
 util/hbitmap.c   |   8 +--
 13 files changed, 167 insertions(+), 238 deletions(-)

-- 
2.13.5

[Qemu-block] [PATCH v8 03/20] qcow2: Ensure bitmap serialization is aligned

2017-09-18 Thread Eric Blake

When subdividing a bitmap serialization, the code in hbitmap.c
enforces that start/count parameters are aligned (except that
count can end early at end-of-bitmap).  We exposed this required
alignment through bdrv_dirty_bitmap_serialization_align(), but
forgot to actually check that we comply with it.

Fortunately, qcow2 is never dividing bitmap serialization smaller
than one cluster (which is a minimum of 512 bytes); so we are
always compliant with the serialization alignment (which insists
that we partition at least 64 bits per chunk) because we are doing
at least 4k bits per chunk.

Still, it's safer to add an assertion (for the unlikely case that
we'd ever support a cluster smaller than 512 bytes, or if the
hbitmap implementation changes what it considers to be aligned),
rather than leaving bdrv_dirty_bitmap_serialization_align()
without a caller.

Signed-off-by: Eric Blake 
Reviewed-by: John Snow 

---
v5: no change
v4: new patch
---
 block/qcow2-bitmap.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-bitmap.c b/block/qcow2-bitmap.c
index e8d3bdbd6e..b3ee4c794a 100644
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -274,10 +274,13 @@ static int free_bitmap_clusters(BlockDriverState *bs, 
Qcow2BitmapTable *tb)
 static uint64_t sectors_covered_by_bitmap_cluster(const BDRVQcow2State *s,
   const BdrvDirtyBitmap 
*bitmap)
 {
-uint32_t sector_granularity =
+uint64_t sector_granularity =
 bdrv_dirty_bitmap_granularity(bitmap) >> BDRV_SECTOR_BITS;
+uint64_t sbc = sector_granularity * (s->cluster_size << 3);

-return (uint64_t)sector_granularity * (s->cluster_size << 3);
+assert(QEMU_IS_ALIGNED(sbc,
+   bdrv_dirty_bitmap_serialization_align(bitmap)));
+return sbc;
 }

 /* load_bitmap_data
-- 
2.13.5

Re: [Qemu-block] [PATCH v8 0/4] Add shrink image for qcow2

2017-09-18 Thread Max Reitz

On 2017-09-18 14:42, Pavel Butsykin wrote:
> This patch add shrinking of the image file for qcow2. As a result, this allows
> us to reduce the virtual image size and free up space on the disk without
> copying the image. Image can be fragmented and shrink is done by punching 
> holes
> in the image file.
> 
> # ./qemu-img create -f qcow2 image.qcow2 4G
> Formatting 'image.qcow2', fmt=qcow2 size=4294967296 encryption=off 
> cluster_size=65536 lazy_refcounts=off refcount_bits=16
> 
> # ./qemu-io -c "write -P 0x22 0 1G" image.qcow2
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 1 ops; 0:00:04.59 (222.886 MiB/sec and 0.2177 ops/sec)
> 
> # ./qemu-img resize image.qcow2 512M
> warning: qemu-img: Shrinking an image will delete all data beyond the 
> shrunken image's end. Before performing such an operation, make sure there is 
> no important data there.
> error: qemu-img: Use the --shrink option to perform a shrink operation.
> 
> # ./qemu-img resize --shrink image.qcow2 128M
> Image resized.
> 
> # ./qemu-img info image.qcow2
> image: image.qcow2
> file format: qcow2
> virtual size: 128M (134217728 bytes)
> disk size: 128M
> cluster_size: 65536
> Format specific information:
> compat: 1.1
> lazy refcounts: false
> refcount bits: 16
> corrupt: false
> 
> # du -h image.qcow2
> 129Mimage.qcow2

Thanks, I've added the missing space in patch 1 and applied the series
to my block branch:

https://github.com/XanClic/qemu/commits/block

Max



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v8 1/4] qemu-img: add --shrink flag for resize

2017-09-18 Thread Max Reitz

On 2017-09-18 14:42, Pavel Butsykin wrote:
> The flag is additional precaution against data loss. Perhaps in the future the
> operation shrink without this flag will be blocked for all formats, but for 
> now
> we need to maintain compatibility with raw.
> 
> Signed-off-by: Pavel Butsykin 
> Reviewed-by: Max Reitz 
> Reviewed-by: John Snow 
> ---
>  qemu-img-cmds.hx   |  4 ++--
>  qemu-img.c | 23 +++
>  qemu-img.texi  |  6 +-
>  tests/qemu-iotests/102 |  4 ++--
>  tests/qemu-iotests/106 |  2 +-
>  5 files changed, 33 insertions(+), 6 deletions(-)

[...]

> diff --git a/qemu-img.c b/qemu-img.c
> index 56ef49e214..b7b2386cbd 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c

[...]

> @@ -3571,6 +3577,23 @@ static int img_resize(int argc, char **argv)
>  goto out;
>  }
>  
> +if (total_size < current_size && !shrink) {
> +warn_report("Shrinking an image will delete all data beyond the "
> +"shrunken image's end. Before performing such an "
> +"operation, make sure there is no important data 
> there.");
> +
> +if (g_strcmp0(bdrv_get_format_name(blk_bs(blk)), "raw") != 0) {
> +error_report(
> +  "Use the --shrink option to perform a shrink operation.");
> +ret = -1;
> +goto out;
> +} else {
> +warn_report("Using the --shrink option will suppress this 
> message."

Still missing a space here.

Max

> +"Note that future versions of qemu-img may refuse to 
> "
> +"shrink images without this option.");
> +}
> +}
> +
>  ret = blk_truncate(blk, total_size, prealloc, );
>  if (!ret) {
>  qprintf(quiet, "Image resized.\n");



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH 17/18] qemu-io: Add background write

2017-09-18 Thread Max Reitz

On 2017-09-18 08:46, Fam Zheng wrote:
> On Wed, 09/13 20:19, Max Reitz wrote:
>> Add a new parameter -B to qemu-io's write command.  When used, qemu-io
>> will not wait for the result of the operation and instead execute it in
>> the background.
> 
> Cannot aio_write be used for this purpose?

Depends.  I have been trained to dislike *_aio_*, so that's probably the
initial reason why I didn't use it.

Second, I'd have to fix aio_write before it can be used.  Currently,
this aborts:

echo 'qemu-io drv0 "aio_write -P 0x11 0 64M"' \
| x86_64-softmmu/qemu-system-x86_64 -monitor stdio \
  -blockdev node-name=drv0,driver=null-co

because aio_write_done thinks it's a good idea to use qemu-io's
BlockBackend -- but when qemu-io is executed through the HMP, the
BlockBackend is only created for the duration of the qemu-io command
(unless there already is a BB).  So what I'd have to do is add a
blk_ref()/blk_unref() there, but for some reason I really don't like that.

So I'd probably have to give up on using -blockdev in the new iotest and
would have to use -drive again.
(Note: With if=none, it still aborts while doing the block accounting,
and I have looked long enough into it to just decide I'd go with
if=virtio instead.)

So, yes, it appears I can use aio_write, together with -drive if=virtio
instead of -blockdev.

The remaining difference is the following: With aio_write, all writes
come from the same BlockBackend, and they are really asynchronous.
That's nice because it's like a guest behaves.

With write -B, they come from different BBs and the BB is usually
already gone when the write is completed -- or maybe destroying the BB
means that everything is flushed and thus the writes are not necessarily
asynchronous.  That doesn't seem so nice, but this behavior made me
write patch 13, so maybe it actually is a good idea to test this.

So I'm a bit torn.  On one hand it seems to be a good idea to use
aio_write because that's already there and it's good enough to simulate
a guest.  But on the other hand, write -B gives a bit more funny
behavior which in my opinion is always good for a test...

Max

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH 18/18] iotests: Add test for active mirroring

2017-09-18 Thread Max Reitz

On 2017-09-18 08:45, Fam Zheng wrote:
> On Wed, 09/13 20:19, Max Reitz wrote:
>> Signed-off-by: Max Reitz 
>> ---
>>  tests/qemu-iotests/151 | 111 
>> +
>>  tests/qemu-iotests/151.out |   5 ++
>>  tests/qemu-iotests/group   |   1 +
>>  3 files changed, 117 insertions(+)
>>  create mode 100755 tests/qemu-iotests/151
>>  create mode 100644 tests/qemu-iotests/151.out
>>
>> diff --git a/tests/qemu-iotests/151 b/tests/qemu-iotests/151
>> new file mode 100755
>> index 00..49a60773f9
>> --- /dev/null
>> +++ b/tests/qemu-iotests/151
>> @@ -0,0 +1,111 @@
>> +#!/usr/bin/env python
>> +#
>> +# Tests for active mirroring
>> +#
>> +# Copyright (C) 2017 Red Hat, Inc.
>> +#
>> +# This program is free software; you can redistribute it and/or modify
>> +# it under the terms of the GNU General Public License as published by
>> +# the Free Software Foundation; either version 2 of the License, or
>> +# (at your option) any later version.
>> +#
>> +# This program is distributed in the hope that it will be useful,
>> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> +# GNU General Public License for more details.
>> +#
>> +# You should have received a copy of the GNU General Public License
>> +# along with this program.  If not, see .
>> +#
>> +
>> +import os
>> +import iotests
>> +from iotests import qemu_img
>> +
>> +source_img = os.path.join(iotests.test_dir, 'source.' + iotests.imgfmt)
>> +target_img = os.path.join(iotests.test_dir, 'target.' + iotests.imgfmt)
>> +
>> +class TestActiveMirror(iotests.QMPTestCase):
>> +image_len = 128 * 1024 * 1024 # MB
>> +potential_writes_in_flight = True
>> +
>> +def setUp(self):
>> +qemu_img('create', '-f', iotests.imgfmt, source_img, '128M')
>> +qemu_img('create', '-f', iotests.imgfmt, target_img, '128M')
>> +
>> +blk_source = {'node-name': 'source',
>> +  'driver': iotests.imgfmt,
>> +  'file': {'driver': 'file',
>> +   'filename': source_img}}
>> +
>> +blk_target = {'node-name': 'target',
>> +  'driver': iotests.imgfmt,
>> +  'file': {'driver': 'file',
>> +   'filename': target_img}}
>> +
>> +self.vm = iotests.VM()
>> +self.vm.add_blockdev(self.qmp_to_opts(blk_source))
>> +self.vm.add_blockdev(self.qmp_to_opts(blk_target))
>> +self.vm.launch()
>> +
>> +def tearDown(self):
>> +self.vm.shutdown()
>> +
>> +if not self.potential_writes_in_flight:
>> +self.assertTrue(iotests.compare_images(source_img, target_img),
>> +'mirror target does not match source')
>> +
>> +os.remove(source_img)
>> +os.remove(target_img)
>> +
>> +def doActiveIO(self, sync_source_and_target):
>> +# Fill the source image
>> +self.vm.hmp_qemu_io('source',
>> +'write -P 1 0 %i' % self.image_len);
>> +
>> +# Start some background requests
>> +for offset in range(0, self.image_len, 1024 * 1024):
>> +self.vm.hmp_qemu_io('source', 'write -B -P 2 %i 1M' % offset)
>> +
>> +# Start the block job
>> +result = self.vm.qmp('blockdev-mirror',
>> + job_id='mirror',
>> + filter_node_name='mirror-node',
>> + device='source',
>> + target='target',
>> + sync='full',
>> + copy_mode='active-write')
>> +self.assert_qmp(result, 'return', {})
>> +
>> +# Start some more requests
>> +for offset in range(0, self.image_len, 1024 * 1024):
>> +self.vm.hmp_qemu_io('mirror-node', 'write -B -P 3 %i 1M' % 
>> offset)
>> +
>> +# Wait for the READY event
>> +self.wait_ready(drive='mirror')
>> +
>> +# Now start some final requests; all of these (which land on
>> +# the source) should be settled using the active mechanism.
>> +# The mirror code itself asserts that the source BDS's dirty
>> +# bitmap will stay clean between READY and COMPLETED.
>> +for offset in range(0, self.image_len, 1024 * 1024):
>> +self.vm.hmp_qemu_io('mirror-node', 'write -B -P 4 %i 1M' % 
>> offset)
>> +
>> +if sync_source_and_target:
>> +# If source and target should be in sync after the mirror,
>> +# we have to flush before completion
> 
> Not sure I understand this requirements, does it apply to libvirt and user 
> too?
> I.e. it's a part of the interface ? Why cannot mirror_complete do it
> automatically?

Well, it seems to pass without this flush, but the original intention
was this: When mirror is completed, the

Re: [Qemu-block] [PATCH 05/18] block/mirror: Convert to coroutines

2017-09-18 Thread Max Reitz

On 2017-09-18 08:02, Fam Zheng wrote:
> On Wed, 09/13 20:18, Max Reitz wrote:
>> In order to talk to the source BDS (and maybe in the future to the
>> target BDS as well) directly, we need to convert our existing AIO
>> requests into coroutine I/O requests.
>>
>> Signed-off-by: Max Reitz 
>> ---
>>  block/mirror.c | 134 
>> +
>>  1 file changed, 78 insertions(+), 56 deletions(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 4664b0516f..2b3297aa61 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -80,6 +80,9 @@ typedef struct MirrorOp {
>>  QEMUIOVector qiov;
>>  int64_t offset;
>>  uint64_t bytes;
>> +
>> +/* Set by mirror_co_read() before yielding for the first time */
>> +uint64_t bytes_copied;
>>  } MirrorOp;
>>  
>>  typedef enum MirrorMethod {
>> @@ -101,7 +104,7 @@ static BlockErrorAction 
>> mirror_error_action(MirrorBlockJob *s, bool read,
>>  }
>>  }
>>  
>> -static void mirror_iteration_done(MirrorOp *op, int ret)
>> +static void coroutine_fn mirror_iteration_done(MirrorOp *op, int ret)
>>  {
>>  MirrorBlockJob *s = op->s;
>>  struct iovec *iov;
>> @@ -138,9 +141,8 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
>>  }
>>  }
>>  
>> -static void mirror_write_complete(void *opaque, int ret)
>> +static void coroutine_fn mirror_write_complete(MirrorOp *op, int ret)
>>  {
>> -MirrorOp *op = opaque;
>>  MirrorBlockJob *s = op->s;
>>  
>>  aio_context_acquire(blk_get_aio_context(s->common.blk));
>> @@ -158,9 +160,8 @@ static void mirror_write_complete(void *opaque, int ret)
>>  aio_context_release(blk_get_aio_context(s->common.blk));
>>  }
>>  
>> -static void mirror_read_complete(void *opaque, int ret)
>> +static void coroutine_fn mirror_read_complete(MirrorOp *op, int ret)
>>  {
>> -MirrorOp *op = opaque;
>>  MirrorBlockJob *s = op->s;
>>  
>>  aio_context_acquire(blk_get_aio_context(s->common.blk));
>> @@ -176,8 +177,11 @@ static void mirror_read_complete(void *opaque, int ret)
>>  
>>  mirror_iteration_done(op, ret);
>>  } else {
>> -blk_aio_pwritev(s->target, op->offset, >qiov,
>> -0, mirror_write_complete, op);
>> +int ret;
>> +
>> +ret = blk_co_pwritev(s->target, op->offset,
>> + op->qiov.size, >qiov, 0);
>> +mirror_write_complete(op, ret);
>>  }
>>  aio_context_release(blk_get_aio_context(s->common.blk));
>>  }
>> @@ -242,53 +246,49 @@ static inline void mirror_wait_for_io(MirrorBlockJob 
>> *s)
>>   *  (new_end - offset) if tail is rounded up or down due to
>>   *  alignment or buffer limit.
>>   */
>> -static uint64_t mirror_do_read(MirrorBlockJob *s, int64_t offset,
>> -   uint64_t bytes)
>> +static void coroutine_fn mirror_co_read(void *opaque)
>>  {
>> +MirrorOp *op = opaque;
>> +MirrorBlockJob *s = op->s;
>>  BlockBackend *source = s->common.blk;
>>  int nb_chunks;
>>  uint64_t ret;
>> -MirrorOp *op;
>>  uint64_t max_bytes;
>>  
>>  max_bytes = s->granularity * s->max_iov;
>>  
>>  /* We can only handle as much as buf_size at a time. */
>> -bytes = MIN(s->buf_size, MIN(max_bytes, bytes));
>> -assert(bytes);
>> -assert(bytes < BDRV_REQUEST_MAX_BYTES);
>> -ret = bytes;
>> +op->bytes = MIN(s->buf_size, MIN(max_bytes, op->bytes));
>> +assert(op->bytes);
>> +assert(op->bytes < BDRV_REQUEST_MAX_BYTES);
>> +op->bytes_copied = op->bytes;
>>  
>>  if (s->cow_bitmap) {
>> -ret += mirror_cow_align(s, , );
>> +op->bytes_copied += mirror_cow_align(s, >offset, >bytes);
>>  }
>> -assert(bytes <= s->buf_size);
>> +/* Cannot exceed BDRV_REQUEST_MAX_BYTES + INT_MAX */
>> +assert(op->bytes_copied <= UINT_MAX);
>> +assert(op->bytes <= s->buf_size);
>>  /* The offset is granularity-aligned because:
>>   * 1) Caller passes in aligned values;
>>   * 2) mirror_cow_align is used only when target cluster is larger. */
>> -assert(QEMU_IS_ALIGNED(offset, s->granularity));
>> +assert(QEMU_IS_ALIGNED(op->offset, s->granularity));
>>  /* The range is sector-aligned, since bdrv_getlength() rounds up. */
>> -assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE));
>> -nb_chunks = DIV_ROUND_UP(bytes, s->granularity);
>> +assert(QEMU_IS_ALIGNED(op->bytes, BDRV_SECTOR_SIZE));
>> +nb_chunks = DIV_ROUND_UP(op->bytes, s->granularity);
>>  
>>  while (s->buf_free_count < nb_chunks) {
>> -trace_mirror_yield_in_flight(s, offset, s->in_flight);
>> +trace_mirror_yield_in_flight(s, op->offset, s->in_flight);
>>  mirror_wait_for_io(s);
>>  }
>>  
>> -/* Allocate a MirrorOp that is used as an AIO callback.  */
>> -op = g_new(MirrorOp, 1);
>> -op->s = s;
>> -op->offset = offset;
>> -op->bytes = bytes;
>> -
>>  /* Now make a

Re: [Qemu-block] [PATCH 15/18] block/mirror: Add active mirroring

2017-09-18 Thread Max Reitz

On 2017-09-18 12:06, Stefan Hajnoczi wrote:
> On Sat, Sep 16, 2017 at 03:58:01PM +0200, Max Reitz wrote:
>> On 2017-09-14 17:57, Stefan Hajnoczi wrote:
>>> On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote:
 This patch implements active synchronous mirroring.  In active mode, the
 passive mechanism will still be in place and is used to copy all
 initially dirty clusters off the source disk; but every write request
 will write data both to the source and the target disk, so the source
 cannot be dirtied faster than data is mirrored to the target.  Also,
 once the block job has converged (BLOCK_JOB_READY sent), source and
 target are guaranteed to stay in sync (unless an error occurs).

 Optionally, dirty data can be copied to the target disk on read
 operations, too.

 Active mode is completely optional and currently disabled at runtime.  A
 later patch will add a way for users to enable it.

 Signed-off-by: Max Reitz 
 ---
  qapi/block-core.json |  23 +++
  block/mirror.c   | 187 
 +--
  2 files changed, 205 insertions(+), 5 deletions(-)

 diff --git a/qapi/block-core.json b/qapi/block-core.json
 index bb11815608..e072cfa67c 100644
 --- a/qapi/block-core.json
 +++ b/qapi/block-core.json
 @@ -938,6 +938,29 @@
'data': ['top', 'full', 'none', 'incremental'] }
  
  ##
 +# @MirrorCopyMode:
 +#
 +# An enumeration whose values tell the mirror block job when to
 +# trigger writes to the target.
 +#
 +# @passive: copy data in background only.
 +#
 +# @active-write: when data is written to the source, write it
 +#(synchronously) to the target as well.  In addition,
 +#data is copied in background just like in @passive
 +#mode.
 +#
 +# @active-read-write: write data to the target (synchronously) both
 +# when it is read from and written to the source.
 +# In addition, data is copied in background just
 +# like in @passive mode.
>>>
>>> I'm not sure the terms "active"/"passive" are helpful.  "Active commit"
>>> means committing the top-most BDS while the guest is accessing it.  The
>>> "passive" mirror block still works on the top-most BDS while the guest
>>> is accessing it.
>>>
>>> Calling it "asynchronous" and "synchronous" is clearer to me.  It's also
>>> the terminology used in disk replication (e.g. DRBD).
>>
>> I'd be OK with that, too, but I think I remember that in the past at
>> least Kevin made a clear distinction between active/passive and
>> sync/async when it comes to mirroring.
>>
>>> Ideally the user wouldn't have to worry about async vs sync because QEMU
>>> would switch modes as appropriate in order to converge.  That way
>>> libvirt also doesn't have to worry about this.
>>
>> So here you mean async/sync in the way I meant it, i.e., whether the
>> mirror operations themselves are async/sync?
> 
> The meaning I had in mind is:
> 
> Sync mirroring means a guest write waits until the target write
> completes.

I.e. active-sync, ...

> Async mirroring means guest writes completes independently of target
> writes.

... i.e. passive or active-async in the future.

So you really want qemu to decide whether to use active or passive mode
depending on what's enough to let the block job converge and not
introduce any switch for the user?

I'm not sure whether I like this too much, mostly because "libvirt
doesn't have to worry" doesn't feel quite right to me.  If we don't make
libvirt worry about this, then qemu has to worry.  I'm not sure whether
that's any better.

I think this really does get into policy territory.  Just switching to
active mode the instant target writes are slower than source writes may
not be what the user wants: Maybe it's OK for a short duration because
they don't care about hard convergence too much.  Maybe they want to
switch to active mode already when "only" twice as much is written to
the target as to the source.

I think this is a decision the management layer (or the user) has to make.

Max



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH 02/18] block: BDS deletion during bdrv_drain_recurse

2017-09-18 Thread Max Reitz

On 2017-09-18 05:44, Fam Zheng wrote:
> On Wed, 09/13 20:18, Max Reitz wrote:
>> Drainined a BDS child may lead to both the original BDS and/or its other
>> children being deleted (e.g. if the original BDS represents a block
>> job).  We should prepare for this in both bdrv_drain_recurse() and
>> bdrv_drained_begin() by monitoring whether the BDS we are about to drain
>> still exists at all.
> 
> Can the deletion happen when IOThread calls
> bdrv_drain_recurse/bdrv_drained_begin?

I don't think so, because (1) my issue was draining a block job and that
can only be completed in the main loop, and (2) I would like to think
it's always impossible, considering that bdrv_unref() may only be called
with the BQL.

> If not, is it enough to do
> 
> ...
> if (in_main_loop) {
> bdrv_ref(bs);
> }
> ...
> if (in_main_loop) {
> bdrv_unref(bs);
> }
> 
> to protect the main loop case? So the BdrvDeletedStatus state is not needed.

We already have that in bdrv_drained_recurse(), don't we?

The issue here is, though, that QLIST_FOREACH_SAFE() stores the next
child pointer to @tmp.  However, once the current child @child is
drained, @tmp may no longer be valid -- it may have been detached from
@bs, and it may even have been deleted.

We could work around the latter by increasing the next child's reference
somehow (but BdrvChild doesn't really have a refcount, and in order to
do so, we would probably have to emulate being a parent or
something...), but then you'd still have the issue of @tmp being
detached from the children list we're trying to iterate over.  So
tmp->next is no longer valid.

Anyway, so the latter is the reason why I decided to introduce the bs_list.

But maybe that actually saves us from having to fiddle with BdrvChild...
 Since it's just a list of BDSs now, it may be enough to simply
bdrv_ref() all of the BDSs in that list before draining any of them.  So
 we'd keep creating the bs_list and then we'd move the existing
bdrv_ref() from the drain loop into the loop filling bs_list.

And adding a bdrv_ref()/bdrv_unref() pair to bdrv_drained_begin() should
hopefully work there, too.

Max

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v2 5/7] block/nbd-client: nbd_co_send_request: return -EIO if s->quit was set in parallel

2017-09-18 Thread Eric Blake

On 09/18/2017 08:59 AM, Vladimir Sementsov-Ogievskiy wrote:
> It's incorrect to return success rc >= 0 if we skip qio_channel_writev_all()
> call due to s->quit.

Does this need to cc: qemu-stable for 2.10.1 (or put another way, can we
come up with some scenario of EAGAIN or other handling that would
actually set s->quit in a parallel coroutine when a client sends out
multiple requests at once)?

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd-client.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index f331f08a9a..280147e6a7 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -189,6 +189,9 @@ static int nbd_co_send_request(BlockDriverState *bs,
>  }
>  
>  err:
> +if (rc >= 0 && s->quit) {
> +rc = -EIO;
> +}

I'm still not convinced this is in the right place.  This fails the
send_request regardless of whether we skipped qio_channel_writev_all();
shouldn't the rc be set only in the case that we actually skipped
writing the full command because s->quit was detected at that point in time?

>  if (rc < 0) {
>  s->quit = true;
>  s->requests[i].coroutine = NULL;
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v2 3/7] block/nbd-client: refactor reading reply

2017-09-18 Thread Eric Blake

On 09/18/2017 10:43 AM, Paolo Bonzini wrote:
> On 18/09/2017 15:59, Vladimir Sementsov-Ogievskiy wrote:
>> Read the whole reply in one place - in nbd_read_reply_entry.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy 
>> ---
>>  block/nbd-client.h |  1 +
>>  block/nbd-client.c | 42 --
>>  2 files changed, 25 insertions(+), 18 deletions(-)
>>

> 
> I am not sure this is an improvement.  In principle you could have
> commands that read replies a bit at a time without using a QEMUIOVector.

Right now we don't, but the most likely point where this would be an
issue is the fact that we want to implement structured replies (the
server can send more than one response to a single request from the
client) in order to then implement block status queries (where the
server can send piecemeal information in response to a query, and the
client could very easily want to handle information as it comes in
rather than waiting for the entire server response, especially if the
amount of information returned by the server is not known a priori by
the client, the way the length is known in advance for NBD_CMD_READ, but
instead learned partway through the reply).  I guess the question
becomes a matter of whether we are over-constraining future additions by
making this refactoring, or whether we can still implement block status
queries using a single QEMUIOVector.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v2 2/7] block/nbd-client: exit reply-reading coroutine on incorrect handle

2017-09-18 Thread Eric Blake

On 09/18/2017 08:59 AM, Vladimir Sementsov-Ogievskiy wrote:
> Check reply-handle == request-handle in the same place, where

s/,//

> recv coroutine number is calculated from reply->handle and it's
> correctness checked - in nbd_read_reply_entry.
> 
> Also finish nbd_read_reply_entry in case of reply-handle !=
> request-handle in the same way as in case of incorrect reply-handle.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd-client.h | 1 +
>  block/nbd-client.c | 8 ++--
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v2 3/7] block/nbd-client: refactor reading reply

2017-09-18 Thread Paolo Bonzini

On 18/09/2017 15:59, Vladimir Sementsov-Ogievskiy wrote:
> Read the whole reply in one place - in nbd_read_reply_entry.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd-client.h |  1 +
>  block/nbd-client.c | 42 --
>  2 files changed, 25 insertions(+), 18 deletions(-)
> 
> diff --git a/block/nbd-client.h b/block/nbd-client.h
> index b1900e6a6d..3f97d31013 100644
> --- a/block/nbd-client.h
> +++ b/block/nbd-client.h
> @@ -21,6 +21,7 @@ typedef struct {
>  Coroutine *coroutine;
>  bool receiving; /* waiting for read_reply_co? */
>  NBDRequest *request;
> +QEMUIOVector *qiov;
>  } NBDClientRequest;
>  
>  typedef struct NBDClientSession {
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index 5f241ecc22..f7b642f079 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -98,6 +98,21 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
>  break;
>  }
>  
> +if (s->reply.error == 0 &&
> +s->requests[i].request->type == NBD_CMD_READ)
> +{
> +QEMUIOVector *qiov = s->requests[i].qiov;
> +assert(qiov != NULL);
> +assert(s->requests[i].request->len ==
> +   iov_size(qiov->iov, qiov->niov));
> +if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
> +  NULL) < 0)
> +{
> +s->reply.error = EIO;
> +break;
> +}
> +}

I am not sure this is an improvement.  In principle you could have
commands that read replies a bit at a time without using a QEMUIOVector.

Paolo

>  /* We're woken up again by the request itself.  Note that there
>   * is no race between yielding and reentering read_reply_co.  This
>   * is because:
> @@ -118,6 +133,7 @@ static coroutine_fn void nbd_read_reply_entry(void 
> *opaque)
>  s->read_reply_co = NULL;
>  }
>  
> +/* qiov should be provided for READ and WRITE */
>  static int nbd_co_send_request(BlockDriverState *bs,
> NBDRequest *request,
> QEMUIOVector *qiov)
> @@ -145,6 +161,7 @@ static int nbd_co_send_request(BlockDriverState *bs,
>  
>  request->handle = INDEX_TO_HANDLE(s, i);
>  s->requests[i].request = request;
> +s->requests[i].qiov = qiov;
>  
>  if (s->quit) {
>  rc = -EIO;
> @@ -155,7 +172,8 @@ static int nbd_co_send_request(BlockDriverState *bs,
>  goto err;
>  }
>  
> -if (qiov) {
> +if (s->requests[i].request->type == NBD_CMD_WRITE) {
> +assert(qiov);
>  qio_channel_set_cork(s->ioc, true);
>  rc = nbd_send_request(s->ioc, request);
>  if (rc >= 0 && !s->quit) {
> @@ -181,9 +199,7 @@ err:
>  return rc;
>  }
>  
> -static int nbd_co_receive_reply(NBDClientSession *s,
> -NBDRequest *request,
> -QEMUIOVector *qiov)
> +static int nbd_co_receive_reply(NBDClientSession *s, NBDRequest *request)
>  {
>  int ret;
>  int i = HANDLE_TO_INDEX(s, request->handle);
> @@ -197,14 +213,6 @@ static int nbd_co_receive_reply(NBDClientSession *s,
>  } else {
>  assert(s->reply.handle == request->handle);
>  ret = -s->reply.error;
> -if (qiov && s->reply.error == 0) {
> -assert(request->len == iov_size(qiov->iov, qiov->niov));
> -if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
> -  NULL) < 0) {
> -ret = -EIO;
> -s->quit = true;
> -}
> -}
>  
>  /* Tell the read handler to read another header.  */
>  s->reply.handle = 0;
> @@ -232,16 +240,14 @@ static int nbd_co_request(BlockDriverState *bs,
>  NBDClientSession *client = nbd_get_client_session(bs);
>  int ret;
>  
> -assert(!qiov || request->type == NBD_CMD_WRITE ||
> -   request->type == NBD_CMD_READ);
> -ret = nbd_co_send_request(bs, request,
> -  request->type == NBD_CMD_WRITE ? qiov : NULL);
> +assert((qiov == NULL) !=
> +   (request->type == NBD_CMD_WRITE || request->type == 
> NBD_CMD_READ));
> +ret = nbd_co_send_request(bs, request, qiov);
>  if (ret < 0) {
>  return ret;
>  }
>  
> -return nbd_co_receive_reply(client, request,
> -request->type == NBD_CMD_READ ? qiov : NULL);
> +return nbd_co_receive_reply(client, request);
>  }
>  
>  int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,
>

Re: [Qemu-block] [PATCH 00/18] block/mirror: Add active-sync mirroring

2017-09-18 Thread Max Reitz

On 2017-09-18 12:02, Stefan Hajnoczi wrote:
> On Sat, Sep 16, 2017 at 04:02:45PM +0200, Max Reitz wrote:
>> On 2017-09-14 17:42, Stefan Hajnoczi wrote:
>>> On Wed, Sep 13, 2017 at 08:18:52PM +0200, Max Reitz wrote:
 There may be a couple of things to do on top of this series:
 - Allow switching between active and passive mode at runtime: This
   should not be too difficult to implement, the main question is how to
   expose it to the user.
   (I seem to recall we wanted some form of block-job-set-option
   command...?)

 - Implement an asynchronous active mode: May be detrimental when it
   comes to convergence, but it might be nice to have anyway.  May or may
   not be complicated to implement.
>>>
>>> Ideally the user doesn't have to know about async vs sync.  It's an
>>> implementation detail.
>>>
>>> Async makes sense during the bulk copy phase (e.g. sync=full) because
>>> guest read/write latencies are mostly unaffected.  Once the entire
>>> device has been copied there are probably still dirty blocks left
>>> because the guest touched them while the mirror job was running.  At
>>> that point it definitely makes sense to switch to synchronous mirroring
>>> in order to converge.
>>
>> Makes sense, but I'm not sure whether it really is just an
>> implementation detail.  If you're in the bulk copy phase in active/async
>> mode and you have enough write requests with the target being slow
>> enough, I suspect you might still not get convergence then (because the
>> writes to the target yield for a long time while ever more write
>> requests pile up) -- so then you'd just shift the dirty tracking from
>> the bitmap to a list of requests in progress.
>>
>> And I think we do want the bulk copy phase to guarantee convergence,
>> too, usually (when active/foreground/synchronous mode is selected).  If
>> we don't, then that's a policy decision and would be up to libvirt, as I
>> see it.
> 
> This is a good point.  Bulk copy should converge too.
> 
> Can we measure the target write rate and guest write rate?  A heuristic
> can choose between async vs sync based on the write rates.
> 
> For example, if the guest write rate has been larger than the target
> write rate for the past 10 seconds during the bulk phase, switch to
> synchronous mirroring.

I guess we can just count how many unfinished target write requests are
piling up.

...or libvirt can simply see that the block job is not progressing and
switch the mode. :-)

Max



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v2 1/7] block/nbd-client: refactor nbd_co_receive_reply

2017-09-18 Thread Eric Blake

On 09/18/2017 08:59 AM, Vladimir Sementsov-Ogievskiy wrote:
> "NBDReply *reply" parameter of nbd_co_receive_reply is used only
> to pass return value for nbd_co_request (reply.error). Remove it
> and use function return value instead.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/nbd-client.c | 30 +++---
>  1 file changed, 15 insertions(+), 15 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v3 4/7] block: don't use constant 512 as sector size in crypto driver

2017-09-18 Thread Daniel P. Berrange

On Sat, Sep 16, 2017 at 06:24:56PM +0200, Max Reitz wrote:
> On 2017-09-12 13:28, Daniel P. Berrange wrote:
> > Use the qcrypto_block_get_sector_size() value in the block
> > crypto driver instead of hardcoding 512 as the sector size.
> > 
> > Signed-off-by: Daniel P. Berrange 
> > ---
> >  block/crypto.c | 34 ++
> >  1 file changed, 18 insertions(+), 16 deletions(-)
> > 
> > diff --git a/block/crypto.c b/block/crypto.c
> > index d68cbac2ac..49d6d4c058 100644
> > --- a/block/crypto.c
> > +++ b/block/crypto.c
> > @@ -392,8 +392,9 @@ block_crypto_co_readv(BlockDriverState *bs, int64_t 
> > sector_num,
> >  uint8_t *cipher_data = NULL;
> >  QEMUIOVector hd_qiov;
> >  int ret = 0;
> > +uint64_t sector_size = qcrypto_block_get_sector_size(crypto->block);
> >  uint64_t payload_offset =
> > -qcrypto_block_get_payload_offset(crypto->block) / 512;
> > +qcrypto_block_get_payload_offset(crypto->block) / sector_size;
> >  assert(payload_offset < (INT64_MAX / 512));
> >  
> >  qemu_iovec_init(_qiov, qiov->niov);
> > @@ -401,9 +402,9 @@ block_crypto_co_readv(BlockDriverState *bs, int64_t 
> > sector_num,
> >  /* Bounce buffer because we don't wish to expose cipher text
> >   * in qiov which points to guest memory.
> >   */
> > -cipher_data =
> > -qemu_try_blockalign(bs->file->bs, MIN(BLOCK_CRYPTO_MAX_SECTORS * 
> > 512,
> > -  qiov->size));
> > +cipher_data = qemu_try_blockalign(
> > +bs->file->bs, MIN(BLOCK_CRYPTO_MAX_SECTORS * sector_size,
> > +  qiov->size));
> >  if (cipher_data == NULL) {
> >  ret = -ENOMEM;
> >  goto cleanup;
> > @@ -417,7 +418,7 @@ block_crypto_co_readv(BlockDriverState *bs, int64_t 
> > sector_num,
> >  }
> >  
> >  qemu_iovec_reset(_qiov);
> > -qemu_iovec_add(_qiov, cipher_data, cur_nr_sectors * 512);
> > +qemu_iovec_add(_qiov, cipher_data, cur_nr_sectors * 
> > sector_size);
> 
> cur_nr_sectors is based on remaining_sectors; which in turn is a
> parameter to this function and comes from the block layer.  Therefore
> its unit is BDRV_SECTOR_SIZE and not the crypto driver's sector size.
> 
> Same in the hunk below, and in block_crypto_co_writev().
> 
> >  
> >  ret = bdrv_co_readv(bs->file,
> >  payload_offset + sector_num,
> 
> Same thing here, albeit in a different variation: The unit of this
> parameter of bdrv_co_readv() (start sector index) is a block layer
> sector, whose size is always BDRV_SECTOR_SIZE.
> 
> Therefore you cannot divide the result from
> qcrypto_block_get_payload_offset() by the crypto driver's sector size
> and then use it as a sector index here.
> 
> Same in block_crypto_co_writev().
> 
> 
> I assume that you fix this in the next patch, but for now it's just wrong.

Yeah, in retrospect I should have kept this patch using BDRV_SECTOR_SIZE
throughout as previous versions had it, so I'm going to go back to doing
that. Only use the encryption sector size in a later patch where we have
already switched to doing byte based I/O.


Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [Qemu-block] [PATCH v5 6/6] tests: Add check-qobject for equality tests

2017-09-18 Thread Eric Blake

On 09/16/2017 01:51 PM, Max Reitz wrote:
> Add a new test file (check-qobject.c) for unit tests that concern
> QObjects as a whole.
> 
> Its only purpose for now is to test the qobject_is_equal() function.
> 
> Signed-off-by: Max Reitz 
> ---
>  tests/Makefile.include |   4 +-
>  tests/check-qobject.c  | 315 
> +

We're not entirely consistent on our testsuite naming, although I'm fine
with your choice [and there's still the idea floating around in another
thread to split tests into subdirectories by whether they are part of
check-unit or check-qtest].  However, the check-* notation cannot be
automatically covered by .gitignore the way a -test suffix is, so you're
missing a change to tests/.gitignore covering the new test in an in-tree
build.

We have the idea floating around that we want to outlaw in-tree builds
altogether [although I'm not sold on that yet], and/or to generate
.gitignore, but until those land, it doesn't hurt to keep .gitignore
up-to-date.  So with that addition,

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v5 5/6] iotests: Add test for non-string option reopening

2017-09-18 Thread Eric Blake

On 09/16/2017 01:51 PM, Max Reitz wrote:
> Reviewed-by: Kevin Wolf 
> Signed-off-by: Max Reitz 
> ---
>  tests/qemu-iotests/133 | 9 +
>  tests/qemu-iotests/133.out | 5 +
>  2 files changed, 14 insertions(+)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v5 3/6] qapi: Add qobject_is_equal()

2017-09-18 Thread Eric Blake

On 09/16/2017 01:51 PM, Max Reitz wrote:
> This generic function (along with its implementations for different
> types) determines whether two QObjects are equal.
> 
> Signed-off-by: Max Reitz 
> ---

> +case QNUM_DOUBLE:
> +switch (num_y->kind) {
> +case QNUM_I64:
> +case QNUM_U64:
> +return false;
> +case QNUM_DOUBLE:
> +/* Comparison in native double type */
> +return num_x->u.dbl == num_y->u.dbl;

I saw comments mentioning the odd behavior of NaN; one other odd
behavior is that this treats -0.0 as equal to 0.0.  But that's fine by
me (it matches C semantics of double == double).

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v5 2/6] qapi/qlist: Add qlist_append_null() macro

2017-09-18 Thread Eric Blake

On 09/16/2017 01:51 PM, Max Reitz wrote:
> Signed-off-by: Max Reitz 
> ---
>  include/qapi/qmp/qlist.h | 3 +++
>  1 file changed, 3 insertions(+)

Please update scripts/coccinelle/qobject.cocci to also cover this (since
we updated it for qdict_put_null() in 0f9afc2).

With that addition,
Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v7 12/16] migration: add postcopy migration of dirty bitmaps

2017-09-18 Thread Vladimir Sementsov-Ogievskiy


10.07.2017 19:30, Vladimir Sementsov-Ogievskiy wrote:

Postcopy migration of dirty bitmaps. Only named dirty bitmaps,
associated with root nodes and non-root named nodes are migrated.

If destination qemu is already containing a dirty bitmap with the same name
as a migrated bitmap (for the same node), then, if their granularities are
the same the migration will be done, otherwise the error will be generated.

If destination qemu doesn't contain such bitmap it will be created.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
  include/migration/misc.h   |   3 +
  migration/Makefile.objs|   1 +
  migration/block-dirty-bitmap.c | 700 +
  migration/migration.c  |   3 +
  migration/migration.h  |   3 +
  migration/savevm.c |   2 +
  migration/trace-events |  14 +
  vl.c   |   1 +
  8 files changed, 727 insertions(+)
  create mode 100644 migration/block-dirty-bitmap.c

diff --git a/include/migration/misc.h b/include/migration/misc.h
index 22551216bb..c53689eb0f 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -56,4 +56,7 @@ bool migration_in_postcopy_after_devices(MigrationState *);
  void migration_only_migratable_set(void);
  void migration_global_dump(Monitor *mon);
  
+/* migration/block-dirty-bitmap.c */

+void dirty_bitmap_mig_init(void);
+
  #endif
diff --git a/migration/Makefile.objs b/migration/Makefile.objs
index 1c7770da46..525cc8293e 100644
--- a/migration/Makefile.objs
+++ b/migration/Makefile.objs
@@ -6,6 +6,7 @@ common-obj-y += qemu-file.o global_state.o
  common-obj-y += qemu-file-channel.o
  common-obj-y += xbzrle.o postcopy-ram.o
  common-obj-y += qjson.o
+common-obj-y += block-dirty-bitmap.o
  
  common-obj-$(CONFIG_RDMA) += rdma.o
  
diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c

new file mode 100644
index 00..e4cd17bcc8
--- /dev/null
+++ b/migration/block-dirty-bitmap.c
@@ -0,0 +1,700 @@
+/*
+ * Block dirty bitmap postcopy migration
+ *
+ * Copyright IBM, Corp. 2009
+ * Copyright (C) 2016 Parallels IP Holdings GmbH. All rights reserved.
+ *
+ * Authors:
+ *  Liran Schour   
+ *  Vladimir Sementsov-Ogievskiy 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ * This file is derived from migration/block.c, so it's author and IBM 
copyright
+ * are here, although content is quite different.
+ *
+ * Contributions after 2012-01-13 are licensed under the terms of the
+ * GNU GPL, version 2 or (at your option) any later version.
+ *
+ ****
+ *
+ * Here postcopy migration of dirty bitmaps is realized. Only named dirty
+ * bitmaps, associated with root nodes and non-root named nodes are migrated.
+ *
+ * If destination qemu is already containing a dirty bitmap with the same name
+ * as a migrated bitmap (for the same node), then, if their granularities are
+ * the same the migration will be done, otherwise the error will be generated.
+ *
+ * If destination qemu doesn't contain such bitmap it will be created.
+ *
+ * format of migration:
+ *
+ * # Header (shared for different chunk types)
+ * 1, 2 or 4 bytes: flags (see qemu_{put,put}_flags)
+ * [ 1 byte: node name size ] \  flags & DEVICE_NAME
+ * [ n bytes: node name ] /
+ * [ 1 byte: bitmap name size ] \  flags & BITMAP_NAME
+ * [ n bytes: bitmap name ] /
+ *
+ * # Start of bitmap migration (flags & START)
+ * header
+ * be64: granularity
+ * 1 byte: bitmap enabled flag


note: this will be changed to 1 byte of flags, to transfer also 
persistance of bitmaps.



+ *
+ * # Complete of bitmap migration (flags & COMPLETE)
+ * header
+ *
+ * # Data chunk of bitmap migration
+ * header
+ * be64: start sector
+ * be32: number of sectors
+ * [ be64: buffer size  ] \ ! (flags & ZEROES)
+ * [ n bytes: buffer] /
+ *
+ * The last chunk in stream should contain flags & EOS. The chunk may skip
+ * device and/or bitmap names, assuming them to be the same with the previous
+ * chunk.
+ */
+
+#include "qemu/osdep.h"
+#include "block/block.h"
+#include "block/block_int.h"
+#include "sysemu/block-backend.h"
+#include "qemu/main-loop.h"
+#include "qemu/error-report.h"
+#include "migration/misc.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "migration/vmstate.h"
+#include "migration/register.h"
+#include "qemu/hbitmap.h"
+#include "sysemu/sysemu.h"
+#include "qemu/cutils.h"
+#include "qapi/error.h"
+#include "trace.h"
+
+#define CHUNK_SIZE (1 << 10)
+
+/* Flags occupy one, two or four bytes (Big Endian). The size is determined as
+ * follows:
+ * in first (most significant) byte bit 8 is clear  -->  one byte
+ * in first byte bit 8 is set-->  two or four bytes, depending on second
+ *byte:
+ *| in second byte bit 8 is clear

[Qemu-block] ping Re: [Qemu-devel] [PATCH v7 03/16] migration: split common postcopy out of ram postcopy

2017-09-18 Thread Vladimir Sementsov-Ogievskiy


ping for 1-3
Can we merge them?

22.08.2017 02:34, John Snow wrote:


On 07/11/2017 09:38 AM, Vladimir Sementsov-Ogievskiy wrote:

11.07.2017 16:06, Dr. David Alan Gilbert wrote:

* Vladimir Sementsov-Ogievskiy (vsement...@virtuozzo.com) wrote:

Split common postcopy staff from ram postcopy staff.

Signed-off-by: Vladimir Sementsov-Ogievskiy 

I'm OK with this; I'm not sure I'd have bothered making the ping's
dependent on it being RAM.

(These first few are pretty much a separable series).

It would be grate if you (or Juan?) can take them separately.


Yes please. I don't think this ever happened, did it? Can we split off
1-3 and re-roll?




--
Best regards,
Vladimir

Re: [Qemu-block] [PATCH v3 7/7] block: support passthrough of BDRV_REQ_FUA in crypto driver

2017-09-18 Thread Eric Blake

On 09/12/2017 06:28 AM, Daniel P. Berrange wrote:
> The BDRV_REQ_FUA flag can trivially be allowed in the crypt driver
> as a passthrough to the underlying block driver.
> 
> Signed-off-by: Daniel P. Berrange 
> ---
>  block/crypto.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

[Qemu-block] [PATCH v2 7/7] block/nbd-client: do not yield from nbd_read_reply_entry

2017-09-18 Thread Vladimir Sementsov-Ogievskiy

Refactor nbd client to not yield from nbd_read_reply_entry. It's
possible now as all reading is done in nbd_read_reply_entry and
all request-related data is stored in per-request s->requests[i].

We need here some additional work with s->requests[i].ret and
s->quit to not fail requests on normal disconnet and to not report
reading errors on normal disconnect.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/nbd-client.c | 29 ++---
 1 file changed, 10 insertions(+), 19 deletions(-)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index f80a4c5564..bf20d5d5e6 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -79,7 +79,11 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
 while (!s->quit) {
 ret = nbd_receive_reply(s->ioc, , _err);
 if (ret < 0) {
-error_report_err(local_err);
+if (s->quit) {
+error_free(local_err);
+} else {
+error_report_err(local_err);
+}
 }
 if (ret <= 0 || s->quit) {
 break;
@@ -112,19 +116,8 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
 }
 }
 
-/* We're woken up again by the request itself.  Note that there
- * is no race between yielding and reentering read_reply_co.  This
- * is because:
- *
- * - if the request runs on the same AioContext, it is only
- *   entered after we yield
- *
- * - if the request runs on a different AioContext, reentering
- *   read_reply_co happens through a bottom half, which can only
- *   run after we yield.
- */
+s->requests[i].receiving = false;
 aio_co_wake(s->requests[i].coroutine);
-qemu_coroutine_yield();
 }
 
 s->quit = true;
@@ -157,6 +150,7 @@ static int nbd_co_send_request(BlockDriverState *bs,
 
 s->requests[i].coroutine = qemu_coroutine_self();
 s->requests[i].receiving = false;
+s->requests[i].ret = -EIO;
 
 request->handle = INDEX_TO_HANDLE(s, i);
 s->requests[i].request = request;
@@ -210,7 +204,7 @@ static int nbd_co_receive_reply(NBDClientSession *s, 
NBDRequest *request)
 s->requests[i].receiving = true;
 qemu_coroutine_yield();
 s->requests[i].receiving = false;
-if (!s->ioc || s->quit) {
+if (!s->ioc) {
 ret = -EIO;
 } else {
 ret = s->requests[i].ret;
@@ -218,11 +212,6 @@ static int nbd_co_receive_reply(NBDClientSession *s, 
NBDRequest *request)
 
 s->requests[i].coroutine = NULL;
 
-/* Kick the read_reply_co to get the next reply.  */
-if (s->read_reply_co) {
-aio_co_wake(s->read_reply_co);
-}
-
 qemu_co_mutex_lock(>send_mutex);
 s->in_flight--;
 qemu_co_queue_next(>free_sema);
@@ -364,6 +353,8 @@ void nbd_client_close(BlockDriverState *bs)
 
 nbd_send_request(client->ioc, );
 
+client->quit = true;
+
 nbd_teardown_connection(bs);
 }
 
-- 
2.11.1

[Qemu-block] [PATCH v2 3/7] block/nbd-client: refactor reading reply

2017-09-18 Thread Vladimir Sementsov-Ogievskiy

Read the whole reply in one place - in nbd_read_reply_entry.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/nbd-client.h |  1 +
 block/nbd-client.c | 42 --
 2 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/block/nbd-client.h b/block/nbd-client.h
index b1900e6a6d..3f97d31013 100644
--- a/block/nbd-client.h
+++ b/block/nbd-client.h
@@ -21,6 +21,7 @@ typedef struct {
 Coroutine *coroutine;
 bool receiving; /* waiting for read_reply_co? */
 NBDRequest *request;
+QEMUIOVector *qiov;
 } NBDClientRequest;
 
 typedef struct NBDClientSession {
diff --git a/block/nbd-client.c b/block/nbd-client.c
index 5f241ecc22..f7b642f079 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -98,6 +98,21 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
 break;
 }
 
+if (s->reply.error == 0 &&
+s->requests[i].request->type == NBD_CMD_READ)
+{
+QEMUIOVector *qiov = s->requests[i].qiov;
+assert(qiov != NULL);
+assert(s->requests[i].request->len ==
+   iov_size(qiov->iov, qiov->niov));
+if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
+  NULL) < 0)
+{
+s->reply.error = EIO;
+break;
+}
+}
+
 /* We're woken up again by the request itself.  Note that there
  * is no race between yielding and reentering read_reply_co.  This
  * is because:
@@ -118,6 +133,7 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
 s->read_reply_co = NULL;
 }
 
+/* qiov should be provided for READ and WRITE */
 static int nbd_co_send_request(BlockDriverState *bs,
NBDRequest *request,
QEMUIOVector *qiov)
@@ -145,6 +161,7 @@ static int nbd_co_send_request(BlockDriverState *bs,
 
 request->handle = INDEX_TO_HANDLE(s, i);
 s->requests[i].request = request;
+s->requests[i].qiov = qiov;
 
 if (s->quit) {
 rc = -EIO;
@@ -155,7 +172,8 @@ static int nbd_co_send_request(BlockDriverState *bs,
 goto err;
 }
 
-if (qiov) {
+if (s->requests[i].request->type == NBD_CMD_WRITE) {
+assert(qiov);
 qio_channel_set_cork(s->ioc, true);
 rc = nbd_send_request(s->ioc, request);
 if (rc >= 0 && !s->quit) {
@@ -181,9 +199,7 @@ err:
 return rc;
 }
 
-static int nbd_co_receive_reply(NBDClientSession *s,
-NBDRequest *request,
-QEMUIOVector *qiov)
+static int nbd_co_receive_reply(NBDClientSession *s, NBDRequest *request)
 {
 int ret;
 int i = HANDLE_TO_INDEX(s, request->handle);
@@ -197,14 +213,6 @@ static int nbd_co_receive_reply(NBDClientSession *s,
 } else {
 assert(s->reply.handle == request->handle);
 ret = -s->reply.error;
-if (qiov && s->reply.error == 0) {
-assert(request->len == iov_size(qiov->iov, qiov->niov));
-if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
-  NULL) < 0) {
-ret = -EIO;
-s->quit = true;
-}
-}
 
 /* Tell the read handler to read another header.  */
 s->reply.handle = 0;
@@ -232,16 +240,14 @@ static int nbd_co_request(BlockDriverState *bs,
 NBDClientSession *client = nbd_get_client_session(bs);
 int ret;
 
-assert(!qiov || request->type == NBD_CMD_WRITE ||
-   request->type == NBD_CMD_READ);
-ret = nbd_co_send_request(bs, request,
-  request->type == NBD_CMD_WRITE ? qiov : NULL);
+assert((qiov == NULL) !=
+   (request->type == NBD_CMD_WRITE || request->type == NBD_CMD_READ));
+ret = nbd_co_send_request(bs, request, qiov);
 if (ret < 0) {
 return ret;
 }
 
-return nbd_co_receive_reply(client, request,
-request->type == NBD_CMD_READ ? qiov : NULL);
+return nbd_co_receive_reply(client, request);
 }
 
 int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,
-- 
2.11.1

[Qemu-block] [PATCH v2 5/7] block/nbd-client: nbd_co_send_request: return -EIO if s->quit was set in parallel

2017-09-18 Thread Vladimir Sementsov-Ogievskiy

It's incorrect to return success rc >= 0 if we skip qio_channel_writev_all()
call due to s->quit.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/nbd-client.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index f331f08a9a..280147e6a7 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -189,6 +189,9 @@ static int nbd_co_send_request(BlockDriverState *bs,
 }
 
 err:
+if (rc >= 0 && s->quit) {
+rc = -EIO;
+}
 if (rc < 0) {
 s->quit = true;
 s->requests[i].coroutine = NULL;
-- 
2.11.1

[Qemu-block] [PATCH v2 4/7] block/nbd-client: drop reply field from NBDClientSession

2017-09-18 Thread Vladimir Sementsov-Ogievskiy

Drop 'reply' from NBDClientSession. It's used to only deliver request
return code to receiving coroutine. Instead introduce per-request ret
variable to simplify the scheme and make further refactoring possible.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/nbd-client.h |  2 +-
 block/nbd-client.c | 22 +-
 2 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/block/nbd-client.h b/block/nbd-client.h
index 3f97d31013..4bc8fe3993 100644
--- a/block/nbd-client.h
+++ b/block/nbd-client.h
@@ -22,6 +22,7 @@ typedef struct {
 bool receiving; /* waiting for read_reply_co? */
 NBDRequest *request;
 QEMUIOVector *qiov;
+int ret;
 } NBDClientRequest;
 
 typedef struct NBDClientSession {
@@ -35,7 +36,6 @@ typedef struct NBDClientSession {
 int in_flight;
 
 NBDClientRequest requests[MAX_NBD_REQUESTS];
-NBDReply reply;
 bool quit;
 } NBDClientSession;
 
diff --git a/block/nbd-client.c b/block/nbd-client.c
index f7b642f079..f331f08a9a 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -74,10 +74,10 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
 uint64_t i;
 int ret = 0;
 Error *local_err = NULL;
+NBDReply reply;
 
 while (!s->quit) {
-assert(s->reply.handle == 0);
-ret = nbd_receive_reply(s->ioc, >reply, _err);
+ret = nbd_receive_reply(s->ioc, , _err);
 if (ret < 0) {
 error_report_err(local_err);
 }
@@ -89,18 +89,18 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
  * handler acts as a synchronization point and ensures that only
  * one coroutine is called until the reply finishes.
  */
-i = HANDLE_TO_INDEX(s, s->reply.handle);
+i = HANDLE_TO_INDEX(s, reply.handle);
 if (i >= MAX_NBD_REQUESTS ||
 !s->requests[i].coroutine ||
 !s->requests[i].receiving ||
-s->reply.handle != s->requests[i].request->handle)
+reply.handle != s->requests[i].request->handle)
 {
 break;
 }
 
-if (s->reply.error == 0 &&
-s->requests[i].request->type == NBD_CMD_READ)
-{
+s->requests[i].ret = -reply.error;
+
+if (reply.error == 0 && s->requests[i].request->type == NBD_CMD_READ) {
 QEMUIOVector *qiov = s->requests[i].qiov;
 assert(qiov != NULL);
 assert(s->requests[i].request->len ==
@@ -108,7 +108,7 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
 if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
   NULL) < 0)
 {
-s->reply.error = EIO;
+s->requests[i].ret = -EIO;
 break;
 }
 }
@@ -211,11 +211,7 @@ static int nbd_co_receive_reply(NBDClientSession *s, 
NBDRequest *request)
 if (!s->ioc || s->quit) {
 ret = -EIO;
 } else {
-assert(s->reply.handle == request->handle);
-ret = -s->reply.error;
-
-/* Tell the read handler to read another header.  */
-s->reply.handle = 0;
+ret = s->requests[i].ret;
 }
 
 s->requests[i].coroutine = NULL;
-- 
2.11.1

[Qemu-block] [PATCH v2 1/7] block/nbd-client: refactor nbd_co_receive_reply

2017-09-18 Thread Vladimir Sementsov-Ogievskiy

"NBDReply *reply" parameter of nbd_co_receive_reply is used only
to pass return value for nbd_co_request (reply.error). Remove it
and use function return value instead.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/nbd-client.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index ee7f758e68..acd8e5d007 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -178,26 +178,26 @@ err:
 return rc;
 }
 
-static void nbd_co_receive_reply(NBDClientSession *s,
- NBDRequest *request,
- NBDReply *reply,
- QEMUIOVector *qiov)
+static int nbd_co_receive_reply(NBDClientSession *s,
+NBDRequest *request,
+QEMUIOVector *qiov)
 {
+int ret;
 int i = HANDLE_TO_INDEX(s, request->handle);
 
 /* Wait until we're woken up by nbd_read_reply_entry.  */
 s->requests[i].receiving = true;
 qemu_coroutine_yield();
 s->requests[i].receiving = false;
-*reply = s->reply;
-if (reply->handle != request->handle || !s->ioc || s->quit) {
-reply->error = EIO;
+if (s->reply.handle != request->handle || !s->ioc || s->quit) {
+ret = -EIO;
 } else {
-if (qiov && reply->error == 0) {
+ret = -s->reply.error;
+if (qiov && s->reply.error == 0) {
 assert(request->len == iov_size(qiov->iov, qiov->niov));
 if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
   NULL) < 0) {
-reply->error = EIO;
+ret = -EIO;
 s->quit = true;
 }
 }
@@ -217,6 +217,8 @@ static void nbd_co_receive_reply(NBDClientSession *s,
 s->in_flight--;
 qemu_co_queue_next(>free_sema);
 qemu_co_mutex_unlock(>send_mutex);
+
+return ret;
 }
 
 static int nbd_co_request(BlockDriverState *bs,
@@ -224,7 +226,6 @@ static int nbd_co_request(BlockDriverState *bs,
   QEMUIOVector *qiov)
 {
 NBDClientSession *client = nbd_get_client_session(bs);
-NBDReply reply;
 int ret;
 
 assert(!qiov || request->type == NBD_CMD_WRITE ||
@@ -232,12 +233,11 @@ static int nbd_co_request(BlockDriverState *bs,
 ret = nbd_co_send_request(bs, request,
   request->type == NBD_CMD_WRITE ? qiov : NULL);
 if (ret < 0) {
-reply.error = -ret;
-} else {
-nbd_co_receive_reply(client, request, ,
- request->type == NBD_CMD_READ ? qiov : NULL);
+return ret;
 }
-return -reply.error;
+
+return nbd_co_receive_reply(client, request,
+request->type == NBD_CMD_READ ? qiov : NULL);
 }
 
 int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,
-- 
2.11.1

[Qemu-block] [PATCH v2 2/7] block/nbd-client: exit reply-reading coroutine on incorrect handle

2017-09-18 Thread Vladimir Sementsov-Ogievskiy

Check reply-handle == request-handle in the same place, where
recv coroutine number is calculated from reply->handle and it's
correctness checked - in nbd_read_reply_entry.

Also finish nbd_read_reply_entry in case of reply-handle !=
request-handle in the same way as in case of incorrect reply-handle.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/nbd-client.h | 1 +
 block/nbd-client.c | 8 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/nbd-client.h b/block/nbd-client.h
index b435754b82..b1900e6a6d 100644
--- a/block/nbd-client.h
+++ b/block/nbd-client.h
@@ -20,6 +20,7 @@
 typedef struct {
 Coroutine *coroutine;
 bool receiving; /* waiting for read_reply_co? */
+NBDRequest *request;
 } NBDClientRequest;
 
 typedef struct NBDClientSession {
diff --git a/block/nbd-client.c b/block/nbd-client.c
index acd8e5d007..5f241ecc22 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -92,7 +92,9 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
 i = HANDLE_TO_INDEX(s, s->reply.handle);
 if (i >= MAX_NBD_REQUESTS ||
 !s->requests[i].coroutine ||
-!s->requests[i].receiving) {
+!s->requests[i].receiving ||
+s->reply.handle != s->requests[i].request->handle)
+{
 break;
 }
 
@@ -142,6 +144,7 @@ static int nbd_co_send_request(BlockDriverState *bs,
 s->requests[i].receiving = false;
 
 request->handle = INDEX_TO_HANDLE(s, i);
+s->requests[i].request = request;
 
 if (s->quit) {
 rc = -EIO;
@@ -189,9 +192,10 @@ static int nbd_co_receive_reply(NBDClientSession *s,
 s->requests[i].receiving = true;
 qemu_coroutine_yield();
 s->requests[i].receiving = false;
-if (s->reply.handle != request->handle || !s->ioc || s->quit) {
+if (!s->ioc || s->quit) {
 ret = -EIO;
 } else {
+assert(s->reply.handle == request->handle);
 ret = -s->reply.error;
 if (qiov && s->reply.error == 0) {
 assert(request->len == iov_size(qiov->iov, qiov->niov));
-- 
2.11.1

Re: [Qemu-block] [PATCH v5 1/6] qapi/qnull: Add own header

2017-09-18 Thread Eric Blake

On 09/16/2017 01:51 PM, Max Reitz wrote:
> Signed-off-by: Max Reitz 
> ---
>  include/qapi/qmp/qdict.h|  1 +
>  include/qapi/qmp/qnull.h| 30 ++
>  include/qapi/qmp/qobject.h  | 12 
>  include/qapi/qmp/types.h|  1 +
>  qapi/qapi-clone-visitor.c   |  1 +
>  qapi/string-input-visitor.c |  1 +
>  qobject/qnull.c |  2 +-
>  tests/check-qnull.c |  2 +-
>  8 files changed, 36 insertions(+), 14 deletions(-)
>  create mode 100644 include/qapi/qmp/qnull.h
> 

> +++ b/include/qapi/qmp/qnull.h
> @@ -0,0 +1,30 @@
> +/*
> + * QNull
> + *
> + * Copyright (C) 2015 Red Hat, Inc.

Worth also claiming 2017?

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

[Qemu-block] [PATCH v2 6/7] block/nbd-client: early fail nbd_read_reply_entry if s->quit is set

2017-09-18 Thread Vladimir Sementsov-Ogievskiy

Do not continue any operation if s->quit is set in parallel.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 block/nbd-client.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/nbd-client.c b/block/nbd-client.c
index 280147e6a7..f80a4c5564 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -81,7 +81,7 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
 if (ret < 0) {
 error_report_err(local_err);
 }
-if (ret <= 0) {
+if (ret <= 0 || s->quit) {
 break;
 }
 
@@ -105,9 +105,8 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
 assert(qiov != NULL);
 assert(s->requests[i].request->len ==
iov_size(qiov->iov, qiov->niov));
-if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
-  NULL) < 0)
-{
+ret = qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov, NULL);
+if (ret < 0 || s->quit) {
 s->requests[i].ret = -EIO;
 break;
 }
-- 
2.11.1

[Qemu-block] [PATCH v2 0/7] nbd client refactoring and fixing

2017-09-18 Thread Vladimir Sementsov-Ogievskiy

v2: Actually - tail of v1 series, with changes, additions and rebased
  on recent work in upstream (which includes some patches from v1 too)

Vladimir Sementsov-Ogievskiy (7):
  block/nbd-client: refactor nbd_co_receive_reply
  block/nbd-client: exit reply-reading coroutine on incorrect handle
  block/nbd-client: refactor reading reply
  block/nbd-client: drop reply field from NBDClientSession
  block/nbd-client: nbd_co_send_request: return -EIO if s->quit was set
in parallel
  block/nbd-client: early fail nbd_read_reply_entry if s->quit is set
  block/nbd-client: do not yield from nbd_read_reply_entry

 block/nbd-client.h |   4 ++-
 block/nbd-client.c | 103 ++---
 2 files changed, 54 insertions(+), 53 deletions(-)

-- 
2.11.1

Re: [Qemu-block] [PATCH v3 4/7] block: don't use constant 512 as sector size in crypto driver

2017-09-18 Thread Eric Blake

On 09/12/2017 06:28 AM, Daniel P. Berrange wrote:
> Use the qcrypto_block_get_sector_size() value in the block
> crypto driver instead of hardcoding 512 as the sector size.
> 
> Signed-off-by: Daniel P. Berrange 
> ---
>  block/crypto.c | 34 ++
>  1 file changed, 18 insertions(+), 16 deletions(-)
> 
> diff --git a/block/crypto.c b/block/crypto.c
> index d68cbac2ac..49d6d4c058 100644
> --- a/block/crypto.c
> +++ b/block/crypto.c
> @@ -392,8 +392,9 @@ block_crypto_co_readv(BlockDriverState *bs, int64_t 
> sector_num,
>  uint8_t *cipher_data = NULL;
>  QEMUIOVector hd_qiov;
>  int ret = 0;
> +uint64_t sector_size = qcrypto_block_get_sector_size(crypto->block);
>  uint64_t payload_offset =
> -qcrypto_block_get_payload_offset(crypto->block) / 512;
> +qcrypto_block_get_payload_offset(crypto->block) / sector_size;
>  assert(payload_offset < (INT64_MAX / 512));

In addition to Max's comments, should the assert be dividing by
sector_size instead of 512?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v3 3/7] block: fix data type casting for crypto payload offset

2017-09-18 Thread Eric Blake

On 09/12/2017 06:28 AM, Daniel P. Berrange wrote:
> The crypto APIs report the offset of the data payload as an uint64_t
> type, but the block driver is casting to size_t or ssize_t which will
> potentially truncate.
> 
> Most of the block APIs use int64_t for offsets meanwhile, so even if
> using uint64_t in the crypto block driver we are still at risk of
> truncation.
> 
> Change the block crypto driver to use uint64_t, but add asserts that
> the value is less than INT64_MAX.
> 
> Signed-off-by: Daniel P. Berrange 
> ---
>  block/crypto.c | 13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH v3 2/7] crypto: expose encryption sector size in APIs

2017-09-18 Thread Eric Blake

On 09/12/2017 06:28 AM, Daniel P. Berrange wrote:
> While current encryption schemes all have a fixed sector size of
> 512 bytes, this is not guaranteed to be the case in future. Expose
> the sector size in the APIs so the block layer can remove assumptions
> about fixed 512 byte sectors.
> 
> Signed-off-by: Daniel P. Berrange 
> ---
>  crypto/block-luks.c|  6 --
>  crypto/block-qcow.c|  1 +
>  crypto/block.c |  6 ++
>  crypto/blockpriv.h |  1 +
>  include/crypto/block.h | 15 +++
>  5 files changed, 27 insertions(+), 2 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature

[Qemu-block] [PATCH v8 4/4] qemu-iotests: add shrinking image test

2017-09-18 Thread Pavel Butsykin

Signed-off-by: Pavel Butsykin 
Reviewed-by: Max Reitz 
Reviewed-by: John Snow 
---
 tests/qemu-iotests/163 | 170 +
 tests/qemu-iotests/163.out |   5 ++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 176 insertions(+)
 create mode 100644 tests/qemu-iotests/163
 create mode 100644 tests/qemu-iotests/163.out

diff --git a/tests/qemu-iotests/163 b/tests/qemu-iotests/163
new file mode 100644
index 00..403842354e
--- /dev/null
+++ b/tests/qemu-iotests/163
@@ -0,0 +1,170 @@
+#!/usr/bin/env python
+#
+# Tests for shrinking images
+#
+# Copyright (c) 2016-2017 Parallels International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os, random, iotests, struct, qcow2
+from iotests import qemu_img, qemu_io, image_size
+
+test_img = os.path.join(iotests.test_dir, 'test.img')
+check_img = os.path.join(iotests.test_dir, 'check.img')
+
+def size_to_int(str):
+suff = ['B', 'K', 'M', 'G', 'T']
+return int(str[:-1]) * 1024**suff.index(str[-1:])
+
+class ShrinkBaseClass(iotests.QMPTestCase):
+image_len = '128M'
+shrink_size = '10M'
+chunk_size = '16M'
+refcount_bits = '16'
+
+def __qcow2_check(self, filename):
+entry_bits = 3
+entry_size = 1 << entry_bits
+l1_mask = 0x00fffe00
+div_roundup = lambda n, d: (n + d - 1) / d
+
+def split_by_n(data, n):
+for x in xrange(0, len(data), n):
+yield struct.unpack('>Q', data[x:x + n])[0] & l1_mask
+
+def check_l1_table(h, l1_data):
+l1_list = list(split_by_n(l1_data, entry_size))
+real_l1_size = div_roundup(h.size,
+   1 << (h.cluster_bits*2 - entry_size))
+used, unused = l1_list[:real_l1_size], l1_list[real_l1_size:]
+
+self.assertTrue(len(used) != 0, "Verifying l1 table content")
+self.assertFalse(any(unused), "Verifying l1 table content")
+
+def check_reftable(fd, h, reftable):
+for offset in split_by_n(reftable, entry_size):
+if offset != 0:
+fd.seek(offset)
+cluster = fd.read(1 << h.cluster_bits)
+self.assertTrue(any(cluster), "Verifying reftable content")
+
+with open(filename, "rb") as fd:
+h = qcow2.QcowHeader(fd)
+
+fd.seek(h.l1_table_offset)
+l1_table = fd.read(h.l1_size << entry_bits)
+
+fd.seek(h.refcount_table_offset)
+reftable = fd.read(h.refcount_table_clusters << h.cluster_bits)
+
+check_l1_table(h, l1_table)
+check_reftable(fd, h, reftable)
+
+def __raw_check(self, filename):
+pass
+
+image_check = {
+'qcow2' : __qcow2_check,
+'raw' : __raw_check
+}
+
+def setUp(self):
+if iotests.imgfmt == 'raw':
+qemu_img('create', '-f', iotests.imgfmt, test_img, self.image_len)
+qemu_img('create', '-f', iotests.imgfmt, check_img,
+ self.shrink_size)
+else:
+qemu_img('create', '-f', iotests.imgfmt,
+ '-o', 'cluster_size=' + self.cluster_size +
+ ',refcount_bits=' + self.refcount_bits,
+ test_img, self.image_len)
+qemu_img('create', '-f', iotests.imgfmt,
+ '-o', 'cluster_size=%s'% self.cluster_size,
+ check_img, self.shrink_size)
+qemu_io('-c', 'write -P 0xff 0 ' + self.shrink_size, check_img)
+
+def tearDown(self):
+os.remove(test_img)
+os.remove(check_img)
+
+def image_verify(self):
+self.assertEqual(image_size(test_img), image_size(check_img),
+ "Verifying image size")
+self.image_check[iotests.imgfmt](self, test_img)
+
+if iotests.imgfmt == 'raw':
+return
+self.assertEqual(qemu_img('check', test_img), 0,
+ "Verifying image corruption")
+
+def test_empty_image(self):
+qemu_img('resize',  '-f', iotests.imgfmt, '--shrink', test_img,
+ self.shrink_size)
+
+self.assertEqual(
+qemu_io('-c', 'read -P 0x00 %s'%self.shrink_size, test_img),
+qemu_io('-c', 'read -P 0x00

[Qemu-block] [PATCH v8 1/4] qemu-img: add --shrink flag for resize

2017-09-18 Thread Pavel Butsykin

The flag is additional precaution against data loss. Perhaps in the future the
operation shrink without this flag will be blocked for all formats, but for now
we need to maintain compatibility with raw.

Signed-off-by: Pavel Butsykin 
Reviewed-by: Max Reitz 
Reviewed-by: John Snow 
---
 qemu-img-cmds.hx   |  4 ++--
 qemu-img.c | 23 +++
 qemu-img.texi  |  6 +-
 tests/qemu-iotests/102 |  4 ++--
 tests/qemu-iotests/106 |  2 +-
 5 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index b47d409665..2fe31893cf 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -89,9 +89,9 @@ STEXI
 ETEXI
 
 DEF("resize", img_resize,
-"resize [--object objectdef] [--image-opts] [-q] filename [+ | -]size")
+"resize [--object objectdef] [--image-opts] [-q] [--shrink] filename [+ | 
-]size")
 STEXI
-@item resize [--object @var{objectdef}] [--image-opts] [-q] @var{filename} [+ 
| -]@var{size}
+@item resize [--object @var{objectdef}] [--image-opts] [-q] [--shrink] 
@var{filename} [+ | -]@var{size}
 ETEXI
 
 STEXI
diff --git a/qemu-img.c b/qemu-img.c
index 56ef49e214..b7b2386cbd 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -65,6 +65,7 @@ enum {
 OPTION_TARGET_IMAGE_OPTS = 263,
 OPTION_SIZE = 264,
 OPTION_PREALLOCATION = 265,
+OPTION_SHRINK = 266,
 };
 
 typedef enum OutputFormat {
@@ -3437,6 +3438,7 @@ static int img_resize(int argc, char **argv)
 },
 };
 bool image_opts = false;
+bool shrink = false;
 
 /* Remove size from argv manually so that negative numbers are not treated
  * as options by getopt. */
@@ -3455,6 +3457,7 @@ static int img_resize(int argc, char **argv)
 {"object", required_argument, 0, OPTION_OBJECT},
 {"image-opts", no_argument, 0, OPTION_IMAGE_OPTS},
 {"preallocation", required_argument, 0, OPTION_PREALLOCATION},
+{"shrink", no_argument, 0, OPTION_SHRINK},
 {0, 0, 0, 0}
 };
 c = getopt_long(argc, argv, ":f:hq",
@@ -3498,6 +3501,9 @@ static int img_resize(int argc, char **argv)
 return 1;
 }
 break;
+case OPTION_SHRINK:
+shrink = true;
+break;
 }
 }
 if (optind != argc - 1) {
@@ -3571,6 +3577,23 @@ static int img_resize(int argc, char **argv)
 goto out;
 }
 
+if (total_size < current_size && !shrink) {
+warn_report("Shrinking an image will delete all data beyond the "
+"shrunken image's end. Before performing such an "
+"operation, make sure there is no important data there.");
+
+if (g_strcmp0(bdrv_get_format_name(blk_bs(blk)), "raw") != 0) {
+error_report(
+  "Use the --shrink option to perform a shrink operation.");
+ret = -1;
+goto out;
+} else {
+warn_report("Using the --shrink option will suppress this message."
+"Note that future versions of qemu-img may refuse to "
+"shrink images without this option.");
+}
+}
+
 ret = blk_truncate(blk, total_size, prealloc, );
 if (!ret) {
 qprintf(quiet, "Image resized.\n");
diff --git a/qemu-img.texi b/qemu-img.texi
index 72dabd6b3e..ea5d04b873 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -536,7 +536,7 @@ qemu-img rebase -b base.img diff.qcow2
 At this point, @code{modified.img} can be discarded, since
 @code{base.img + diff.qcow2} contains the same information.
 
-@item resize [--preallocation=@var{prealloc}] @var{filename} [+ | -]@var{size}
+@item resize [--shrink] [--preallocation=@var{prealloc}] @var{filename} [+ | 
-]@var{size}
 
 Change the disk image as if it had been created with @var{size}.
 
@@ -544,6 +544,10 @@ Before using this command to shrink a disk image, you MUST 
use file system and
 partitioning tools inside the VM to reduce allocated file systems and partition
 sizes accordingly.  Failure to do so will result in data loss!
 
+When shrinking images, the @code{--shrink} option must be given. This informs
+qemu-img that the user acknowledges all loss of data beyond the truncated
+image's end.
+
 After using this command to grow a disk image, you must use file system and
 partitioning tools inside the VM to actually begin using the new space on the
 device.
diff --git a/tests/qemu-iotests/102 b/tests/qemu-iotests/102
index 87db1bb1bf..d7ad8d9840 100755
--- a/tests/qemu-iotests/102
+++ b/tests/qemu-iotests/102
@@ -54,7 +54,7 @@ _make_test_img $IMG_SIZE
 $QEMU_IO -c 'write 0 64k' "$TEST_IMG" | _filter_qemu_io
 # Remove data cluster from image (first cluster: image header, second: 
reftable,
 # third: refblock, fourth: L1 table, fifth: L2 table)
-$QEMU_IMG resize -f raw "$TEST_IMG" $((5 * 64 * 1024))
+$QEMU_IMG resize -f raw --shrink "$TEST_IMG" $((5 * 64 * 1024))

[Qemu-block] [PATCH v8 2/4] qcow2: add qcow2_cache_discard

2017-09-18 Thread Pavel Butsykin

Whenever l2/refcount table clusters are discarded from the file we can
automatically drop unnecessary content of the cache tables. This reduces
the chance of eviction useful cache data and eliminates inconsistent data
in the cache with the data in the file.

Signed-off-by: Pavel Butsykin 
Reviewed-by: Max Reitz 
Reviewed-by: John Snow 
---
 block/qcow2-cache.c| 26 ++
 block/qcow2-refcount.c | 20 ++--
 block/qcow2.h  |  3 +++
 3 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 1d25147392..75746a7f43 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -411,3 +411,29 @@ void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, 
Qcow2Cache *c,
 assert(c->entries[i].offset != 0);
 c->entries[i].dirty = true;
 }
+
+void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c,
+  uint64_t offset)
+{
+int i;
+
+for (i = 0; i < c->size; i++) {
+if (c->entries[i].offset == offset) {
+return qcow2_cache_get_table_addr(bs, c, i);
+}
+}
+return NULL;
+}
+
+void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, void *table)
+{
+int i = qcow2_cache_get_table_idx(bs, c, table);
+
+assert(c->entries[i].ref == 0);
+
+c->entries[i].offset = 0;
+c->entries[i].lru_counter = 0;
+c->entries[i].dirty = false;
+
+qcow2_cache_table_release(bs, c, i, 1);
+}
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 168fc32e7b..8c17c0e3aa 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -861,8 +861,24 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
 }
 s->set_refcount(refcount_block, block_index, refcount);
 
-if (refcount == 0 && s->discard_passthrough[type]) {
-update_refcount_discard(bs, cluster_offset, s->cluster_size);
+if (refcount == 0) {
+void *table;
+
+table = qcow2_cache_is_table_offset(bs, s->refcount_block_cache,
+offset);
+if (table != NULL) {
+qcow2_cache_put(bs, s->refcount_block_cache, _block);
+qcow2_cache_discard(bs, s->refcount_block_cache, table);
+}
+
+table = qcow2_cache_is_table_offset(bs, s->l2_table_cache, offset);
+if (table != NULL) {
+qcow2_cache_discard(bs, s->l2_table_cache, table);
+}
+
+if (s->discard_passthrough[type]) {
+update_refcount_discard(bs, cluster_offset, s->cluster_size);
+}
 }
 }
 
diff --git a/block/qcow2.h b/block/qcow2.h
index 96a8d43c17..52c374e9ed 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -649,6 +649,9 @@ int qcow2_cache_get(BlockDriverState *bs, Qcow2Cache *c, 
uint64_t offset,
 int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
 void **table);
 void qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table);
+void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c,
+  uint64_t offset);
+void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, void *table);
 
 /* qcow2-bitmap.c functions */
 int qcow2_check_bitmaps_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
-- 
2.14.1

[Qemu-block] [PATCH v8 3/4] qcow2: add shrink image support

2017-09-18 Thread Pavel Butsykin

This patch add shrinking of the image file for qcow2. As a result, this allows
us to reduce the virtual image size and free up space on the disk without
copying the image. Image can be fragmented and shrink is done by punching holes
in the image file.

Signed-off-by: Pavel Butsykin 
Reviewed-by: Max Reitz 
Reviewed-by: John Snow 
---
 block/qcow2-cluster.c  |  50 +
 block/qcow2-refcount.c | 120 +
 block/qcow2.c  |  43 ++
 block/qcow2.h  |  14 ++
 qapi/block-core.json   |   8 +++-
 5 files changed, 225 insertions(+), 10 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 0d4824993c..d2518d1893 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -32,6 +32,56 @@
 #include "qemu/bswap.h"
 #include "trace.h"
 
+int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t exact_size)
+{
+BDRVQcow2State *s = bs->opaque;
+int new_l1_size, i, ret;
+
+if (exact_size >= s->l1_size) {
+return 0;
+}
+
+new_l1_size = exact_size;
+
+#ifdef DEBUG_ALLOC2
+fprintf(stderr, "shrink l1_table from %d to %d\n", s->l1_size, 
new_l1_size);
+#endif
+
+BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
+ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
+   new_l1_size * sizeof(uint64_t),
+ (s->l1_size - new_l1_size) * sizeof(uint64_t), 0);
+if (ret < 0) {
+goto fail;
+}
+
+ret = bdrv_flush(bs->file->bs);
+if (ret < 0) {
+goto fail;
+}
+
+BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_FREE_L2_CLUSTERS);
+for (i = s->l1_size - 1; i > new_l1_size - 1; i--) {
+if ((s->l1_table[i] & L1E_OFFSET_MASK) == 0) {
+continue;
+}
+qcow2_free_clusters(bs, s->l1_table[i] & L1E_OFFSET_MASK,
+s->cluster_size, QCOW2_DISCARD_ALWAYS);
+s->l1_table[i] = 0;
+}
+return 0;
+
+fail:
+/*
+ * If the write in the l1_table failed the image may contain a partially
+ * overwritten l1_table. In this case it would be better to clear the
+ * l1_table in memory to avoid possible image corruption.
+ */
+memset(s->l1_table + new_l1_size, 0,
+   (s->l1_size - new_l1_size) * sizeof(uint64_t));
+return ret;
+}
+
 int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
 bool exact_size)
 {
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 8c17c0e3aa..88d5a3f1ad 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -29,6 +29,7 @@
 #include "block/qcow2.h"
 #include "qemu/range.h"
 #include "qemu/bswap.h"
+#include "qemu/cutils.h"
 
 static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size);
 static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
@@ -3061,3 +3062,122 @@ done:
 qemu_vfree(new_refblock);
 return ret;
 }
+
+static int qcow2_discard_refcount_block(BlockDriverState *bs,
+uint64_t discard_block_offs)
+{
+BDRVQcow2State *s = bs->opaque;
+uint64_t refblock_offs = get_refblock_offset(s, discard_block_offs);
+uint64_t cluster_index = discard_block_offs >> s->cluster_bits;
+uint32_t block_index = cluster_index & (s->refcount_block_size - 1);
+void *refblock;
+int ret;
+
+assert(discard_block_offs != 0);
+
+ret = qcow2_cache_get(bs, s->refcount_block_cache, refblock_offs,
+  );
+if (ret < 0) {
+return ret;
+}
+
+if (s->get_refcount(refblock, block_index) != 1) {
+qcow2_signal_corruption(bs, true, -1, -1, "Invalid refcount:"
+" refblock offset %#" PRIx64
+", reftable index %u"
+", block offset %#" PRIx64
+", refcount %#" PRIx64,
+refblock_offs,
+offset_to_reftable_index(s, 
discard_block_offs),
+discard_block_offs,
+s->get_refcount(refblock, block_index));
+qcow2_cache_put(bs, s->refcount_block_cache, );
+return -EINVAL;
+}
+s->set_refcount(refblock, block_index, 0);
+
+qcow2_cache_entry_mark_dirty(bs, s->refcount_block_cache, refblock);
+
+qcow2_cache_put(bs, s->refcount_block_cache, );
+
+if (cluster_index < s->free_cluster_index) {
+s->free_cluster_index = cluster_index;
+}
+
+refblock = qcow2_cache_is_table_offset(bs, s->refcount_block_cache,
+   discard_block_offs);
+if (refblock) {
+/* discard refblock from the cache if refblock is cached */
+qcow2_cache_discard(bs, s->refcount_block_cache, refblock);
+}
+

[Qemu-block] [PATCH v8 0/4] Add shrink image for qcow2

2017-09-18 Thread Pavel Butsykin

This patch add shrinking of the image file for qcow2. As a result, this allows
us to reduce the virtual image size and free up space on the disk without
copying the image. Image can be fragmented and shrink is done by punching holes
in the image file.

# ./qemu-img create -f qcow2 image.qcow2 4G
Formatting 'image.qcow2', fmt=qcow2 size=4294967296 encryption=off 
cluster_size=65536 lazy_refcounts=off refcount_bits=16

# ./qemu-io -c "write -P 0x22 0 1G" image.qcow2
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 0:00:04.59 (222.886 MiB/sec and 0.2177 ops/sec)

# ./qemu-img resize image.qcow2 512M
warning: qemu-img: Shrinking an image will delete all data beyond the shrunken 
image's end. Before performing such an operation, make sure there is no 
important data there.
error: qemu-img: Use the --shrink option to perform a shrink operation.

# ./qemu-img resize --shrink image.qcow2 128M
Image resized.

# ./qemu-img info image.qcow2
image: image.qcow2
file format: qcow2
virtual size: 128M (134217728 bytes)
disk size: 128M
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

# du -h image.qcow2
129Mimage.qcow2

Changes from v1:
- add --shrink flag for qemu-img resize
- add qcow2_cache_discard
- simplify qcow2_shrink_l1_table() to reduce the likelihood of image corruption
- add new qemu-iotests for shrinking images

Changes from v2:
- replace qprintf() on error_report() (1)
- rewrite warning messages (1)
- enforce --shrink flag for all formats except raw (1)
- split qcow2_cache_discard() (2)
- minor fixes according to comments (3)
- rewrite the last part of qcow2_shrink_reftable() to avoid
  qcow2_free_clusters() calls inside (3)
- improve test for shrinking image (4)

Changes from v3:
- rebase on "Implement a warning_report function" Alistair's patch-set (1)
- spelling fixes (1)
- the man page fix according to the discussion (1)
- add call qcow2_signal_corruption() in case of image corruption (3)

Changes from v4:
- rebase on https://github.com/XanClic/qemu/commits/block Max's block branch

Changes from v5:
- the condition refcount == 0 should be enough to evict the l2/refcount cluster
  from the cache (2)
- overwrite the l1/refcount table in memory with zeros, even if overwriting the
  l1/refcount table on disk has failed (3)
- replace g_try_malloc() on g_malloc() for allocation reftable_tmp (3)

Changes from v6:
- rebase on master 1f29673387

Changes from v7:
- fix 106 iotest (1)
- minor fixes according to comments (2, 3)
- add documentation of the new enum members (3)
- add r-b's by Max and John

Pavel Butsykin (4):
  qemu-img: add --shrink flag for resize
  qcow2: add qcow2_cache_discard
  qcow2: add shrink image support
  qemu-iotests: add shrinking image test

 block/qcow2-cache.c|  26 +++
 block/qcow2-cluster.c  |  50 +
 block/qcow2-refcount.c | 140 -
 block/qcow2.c  |  43 +---
 block/qcow2.h  |  17 +
 qapi/block-core.json   |   8 ++-
 qemu-img-cmds.hx   |   4 +-
 qemu-img.c |  23 ++
 qemu-img.texi  |   6 +-
 tests/qemu-iotests/102 |   4 +-
 tests/qemu-iotests/106 |   2 +-
 tests/qemu-iotests/163 | 170 +
 tests/qemu-iotests/163.out |   5 ++
 tests/qemu-iotests/group   |   1 +
 14 files changed, 481 insertions(+), 18 deletions(-)
 create mode 100644 tests/qemu-iotests/163
 create mode 100644 tests/qemu-iotests/163.out

-- 
2.14.1

Re: [Qemu-block] [Qemu-devel] [PATCH 0/6] block: Fix permissions after ro/rw reopen

2017-09-18 Thread Fam Zheng

On Mon, 09/18 14:11, Kevin Wolf wrote:
> But how does it determine the desired new perms? Either you duplicate
> the logic of the .bdrv_child_perm implementation into a new set of
> functions that does the same thing, but based on the new state; or you
> extend the existing function with a BlockReopenQueue parameter. The
> latter is basically this series, except with an additional unnecessary
> detour through the driver code instead of doing it in common code.
> 
> Also note that storing it in BDRVReopenState would have to involve a new
> list of an additional data structure because permissions are per
> BdrvChild, not per BlockDriverState.

Indeed, this is not going to remove any complexity. :(

Fam

Re: [Qemu-block] [Qemu-devel] [PATCH 0/6] block: Fix permissions after ro/rw reopen

2017-09-18 Thread Kevin Wolf

Am 18.09.2017 um 13:53 hat Fam Zheng geschrieben:
> On Mon, 09/18 10:11, Kevin Wolf wrote:
> > > 2) Don't change the prototypes at all, just change .bdrv_reopen_prepare 
> > > contract
> > > so that after it returns, .bdrv_child_perm/.bdrv_check_perm should comply 
> > > to the
> > > new state that would be commited once .bdrv_reopen_commit() is called, or
> > > reverted if .bdrv_reopen_abort().
> > 
> > Hm, .bdrv_reopen_prepare already gets the whole queue passed, so I guess
> > this could technically work. I'm not sure if it is a good idea, though.
> > 
> > Such a change would still make .bdrv_child_perm depend on the reopen
> > queue, just without actually passing it as a parameter. I like such
> > hidden data flows even less than adding an explicit one.
> > 
> > It would also mean that each block driver would have to save the queue
> > in its local bs->opaque structure so that .bdrv_child_perm can access it
> > later. Alternatively, bdrv_reopen_prepare could already store the new
> > cumulative parent permissions, but it would still involve two new fields
> > in bs->opaque for storing something of a rather temporary nature.
> 
> What about this?
> 
> 1) drv->bdrv_reopen_prepare() saves the desired new perms in
>BDRVReopenState.

But how does it determine the desired new perms? Either you duplicate
the logic of the .bdrv_child_perm implementation into a new set of
functions that does the same thing, but based on the new state; or you
extend the existing function with a BlockReopenQueue parameter. The
latter is basically this series, except with an additional unnecessary
detour through the driver code instead of doing it in common code.

Also note that storing it in BDRVReopenState would have to involve a new
list of an additional data structure because permissions are per
BdrvChild, not per BlockDriverState.

> 2) bdrv_reopen_prepare() checks the new perms after drv->bdrv_reopen_prepare()
>returns.
> 3) bdrv_reopen_commit() updates the bs to new perms.

2) and 3) are already implemented in this series like you suggest.

Kevin

Re: [Qemu-block] [Qemu-devel] [PATCH 0/6] block: Fix permissions after ro/rw reopen

2017-09-18 Thread Fam Zheng

On Mon, 09/18 10:11, Kevin Wolf wrote:
> > 2) Don't change the prototypes at all, just change .bdrv_reopen_prepare 
> > contract
> > so that after it returns, .bdrv_child_perm/.bdrv_check_perm should comply 
> > to the
> > new state that would be commited once .bdrv_reopen_commit() is called, or
> > reverted if .bdrv_reopen_abort().
> 
> Hm, .bdrv_reopen_prepare already gets the whole queue passed, so I guess
> this could technically work. I'm not sure if it is a good idea, though.
> 
> Such a change would still make .bdrv_child_perm depend on the reopen
> queue, just without actually passing it as a parameter. I like such
> hidden data flows even less than adding an explicit one.
> 
> It would also mean that each block driver would have to save the queue
> in its local bs->opaque structure so that .bdrv_child_perm can access it
> later. Alternatively, bdrv_reopen_prepare could already store the new
> cumulative parent permissions, but it would still involve two new fields
> in bs->opaque for storing something of a rather temporary nature.

What about this?

1) drv->bdrv_reopen_prepare() saves the desired new perms in BDRVReopenState.
2) bdrv_reopen_prepare() checks the new perms after drv->bdrv_reopen_prepare()
   returns.
3) bdrv_reopen_commit() updates the bs to new perms.

Fam

Re: [Qemu-block] [Qemu-devel] [PATCH] qemu.py: Fix syntax error

2017-09-18 Thread Peter Maydell

On 18 September 2017 at 06:25, Kevin Wolf  wrote:
> Python requires parentheses around multiline expression. This fixes the
> breakage of all Python-based qemu-iotests cases that was introduced in
> commit dab91d9aa0.
>
> Signed-off-by: Kevin Wolf 
> ---
>
> Eduardo, I think I'm going to include this patch in a block layer pull
> request today, just to stop the CI spam I'm getting, so the CC is just
> FYI.

Applied to master as a buildfix; thanks.

-- PMM

Re: [Qemu-block] [PATCH V5 01/10] specs/qcow2: add compress format extension

2017-09-18 Thread Peter Lieven


Am 11.09.2017 um 16:22 schrieb Kevin Wolf:

Am 25.07.2017 um 16:41 hat Peter Lieven geschrieben:

Signed-off-by: Peter Lieven 
---
  docs/interop/qcow2.txt | 51 +-
  roms/ipxe  |  2 +-
  2 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index d7fdb1f..d0d2a8f 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -86,7 +86,12 @@ in the description of a field.
  be written to (unless for regaining
  consistency).
  
-Bits 2-63:  Reserved (set to 0)

+Bit 2:  Compress format bit.  If and only if this bit
+is set then the compress format extension
+MUST be present and MUST be parsed and checked
+for compatibility.
+
+Bits 3-63:  Reserved (set to 0)
  
   80 -  87:  compatible_features

  Bitmask of compatible features. An implementation can
@@ -137,6 +142,7 @@ be stored. Each extension has a structure like the 
following:
  0x6803f857 - Feature name table
  0x23852875 - Bitmaps extension
  0x0537be77 - Full disk encryption header pointer
+0xC03183A3 - Compress format extension
  other  - Unknown header extension, can be safely
   ignored
  
@@ -311,6 +317,49 @@ The algorithms used for encryption vary depending on the method

 in the LUKS header, with the physical disk sector as the
 input tweak.
  
+

+== Compress format extension ==
+
+The compress format extension is an optional header extension. It provides
+the ability to specify the compress algorithm and compress parameters
+that are used for compressed clusters. This new header MUST be present if
+the incompatible-feature bit "compress format bit" is set and MUST be absent
+otherwise.
+
+The fields of the compress format extension are:
+
+Byte  0 - 13:  compress_format_name (padded with zeros, but not
+   necessarily null terminated if it has full length).
+   Valid compression format names currently are:
+
+   deflate: Standard zlib deflate compression without
+compression header
+
+  14:  compress_level (uint8_t)
+
+   0 = default compress level (valid for all formats, default)
+
+   Additional valid compression levels for deflate compression:
+
+   All values between 1 and 9. 1 gives best speed, 9 gives best
+   compression. The default compression level is defined by 
zlib
+   and currently defaults to 6.
+
+  15:  compress_window_size (uint8_t)
+
+   Window or dictionary size used by the compression format.
+   Currently only used by the deflate compression algorithm.
+
+   Valid window sizes for deflate compression range from 8 to
+   15 inclusively.

So what is the plan with respect to adding new compression algorithms?

If I understand correctly, we'll use the same extension type
(0xC03183A3) and a different compress_format_name. However, the other
algorithm will likely have different options and also a different number
of options. Will the meaning of bytes 14-end then depend on the specific
algorithm?


The idea of the current options is that almost every algorithm will have
a compression level setting and most have a dictionary or window
size. This is why I added them to the common header.



Part of why I'm asking this is because I wonder why compress_format_name
is 14 characters. It looks to me as if that is intended to make the
header a round 16 bytes for the deflate algorithm. But unless we say
"16 bits ought to be enough for every algorithm", this is just
optimising a special case. (Or actually not optimising, but just moving
the padding from the end of the header extension to padding inside the
compress_format_name field.)

Wouldn't 8 bytes still be plenty of space for a format name, and look
more natural? Then any format that requires options would have a little
more space without immediately going to a 24 byte header extension.


Sure 8 characters will still be enough. I can respin the series with
an updated header if you like.

Thanks,
Peter

Re: [Qemu-block] [PATCH 15/18] block/mirror: Add active mirroring

2017-09-18 Thread Stefan Hajnoczi

On Sat, Sep 16, 2017 at 03:58:01PM +0200, Max Reitz wrote:
> On 2017-09-14 17:57, Stefan Hajnoczi wrote:
> > On Wed, Sep 13, 2017 at 08:19:07PM +0200, Max Reitz wrote:
> >> This patch implements active synchronous mirroring.  In active mode, the
> >> passive mechanism will still be in place and is used to copy all
> >> initially dirty clusters off the source disk; but every write request
> >> will write data both to the source and the target disk, so the source
> >> cannot be dirtied faster than data is mirrored to the target.  Also,
> >> once the block job has converged (BLOCK_JOB_READY sent), source and
> >> target are guaranteed to stay in sync (unless an error occurs).
> >>
> >> Optionally, dirty data can be copied to the target disk on read
> >> operations, too.
> >>
> >> Active mode is completely optional and currently disabled at runtime.  A
> >> later patch will add a way for users to enable it.
> >>
> >> Signed-off-by: Max Reitz 
> >> ---
> >>  qapi/block-core.json |  23 +++
> >>  block/mirror.c   | 187 
> >> +--
> >>  2 files changed, 205 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/qapi/block-core.json b/qapi/block-core.json
> >> index bb11815608..e072cfa67c 100644
> >> --- a/qapi/block-core.json
> >> +++ b/qapi/block-core.json
> >> @@ -938,6 +938,29 @@
> >>'data': ['top', 'full', 'none', 'incremental'] }
> >>  
> >>  ##
> >> +# @MirrorCopyMode:
> >> +#
> >> +# An enumeration whose values tell the mirror block job when to
> >> +# trigger writes to the target.
> >> +#
> >> +# @passive: copy data in background only.
> >> +#
> >> +# @active-write: when data is written to the source, write it
> >> +#(synchronously) to the target as well.  In addition,
> >> +#data is copied in background just like in @passive
> >> +#mode.
> >> +#
> >> +# @active-read-write: write data to the target (synchronously) both
> >> +# when it is read from and written to the source.
> >> +# In addition, data is copied in background just
> >> +# like in @passive mode.
> > 
> > I'm not sure the terms "active"/"passive" are helpful.  "Active commit"
> > means committing the top-most BDS while the guest is accessing it.  The
> > "passive" mirror block still works on the top-most BDS while the guest
> > is accessing it.
> > 
> > Calling it "asynchronous" and "synchronous" is clearer to me.  It's also
> > the terminology used in disk replication (e.g. DRBD).
> 
> I'd be OK with that, too, but I think I remember that in the past at
> least Kevin made a clear distinction between active/passive and
> sync/async when it comes to mirroring.
> 
> > Ideally the user wouldn't have to worry about async vs sync because QEMU
> > would switch modes as appropriate in order to converge.  That way
> > libvirt also doesn't have to worry about this.
> 
> So here you mean async/sync in the way I meant it, i.e., whether the
> mirror operations themselves are async/sync?

The meaning I had in mind is:

Sync mirroring means a guest write waits until the target write
completes.

Async mirroring means guest writes completes independently of target
writes.

Re: [Qemu-block] [PATCH 00/18] block/mirror: Add active-sync mirroring

2017-09-18 Thread Stefan Hajnoczi

On Sat, Sep 16, 2017 at 04:02:45PM +0200, Max Reitz wrote:
> On 2017-09-14 17:42, Stefan Hajnoczi wrote:
> > On Wed, Sep 13, 2017 at 08:18:52PM +0200, Max Reitz wrote:
> >> There may be a couple of things to do on top of this series:
> >> - Allow switching between active and passive mode at runtime: This
> >>   should not be too difficult to implement, the main question is how to
> >>   expose it to the user.
> >>   (I seem to recall we wanted some form of block-job-set-option
> >>   command...?)
> >>
> >> - Implement an asynchronous active mode: May be detrimental when it
> >>   comes to convergence, but it might be nice to have anyway.  May or may
> >>   not be complicated to implement.
> > 
> > Ideally the user doesn't have to know about async vs sync.  It's an
> > implementation detail.
> > 
> > Async makes sense during the bulk copy phase (e.g. sync=full) because
> > guest read/write latencies are mostly unaffected.  Once the entire
> > device has been copied there are probably still dirty blocks left
> > because the guest touched them while the mirror job was running.  At
> > that point it definitely makes sense to switch to synchronous mirroring
> > in order to converge.
> 
> Makes sense, but I'm not sure whether it really is just an
> implementation detail.  If you're in the bulk copy phase in active/async
> mode and you have enough write requests with the target being slow
> enough, I suspect you might still not get convergence then (because the
> writes to the target yield for a long time while ever more write
> requests pile up) -- so then you'd just shift the dirty tracking from
> the bitmap to a list of requests in progress.
> 
> And I think we do want the bulk copy phase to guarantee convergence,
> too, usually (when active/foreground/synchronous mode is selected).  If
> we don't, then that's a policy decision and would be up to libvirt, as I
> see it.

This is a good point.  Bulk copy should converge too.

Can we measure the target write rate and guest write rate?  A heuristic
can choose between async vs sync based on the write rates.

For example, if the guest write rate has been larger than the target
write rate for the past 10 seconds during the bulk phase, switch to
synchronous mirroring.

Stefan

Re: [Qemu-block] [PATCH] throttle: Assert that bkt->max is valid in throttle_compute_wait()

2017-09-18 Thread Kevin Wolf

Am 13.09.2017 um 10:28 hat Alberto Garcia geschrieben:
> If bkt->max == 0 and bkt->burst_length > 1 then we could have a
> division by 0 in throttle_do_compute_wait(). That configuration is
> however not permitted and is already detected by throttle_is_valid(),
> but let's assert it in throttle_compute_wait() to make it explicit.
> 
> Found by Coverity (CID: 1381016).
> 
> Signed-off-by: Alberto Garcia 

Thanks, applied to the block branch.

Kevin

Re: [Qemu-block] [Qemu-devel] [PATCH v7 0/4] Add shrink image for qcow2

2017-09-18 Thread Pavel Butsykin


On 16.09.2017 17:56, Max Reitz wrote:

On 2017-08-22 01:31, John Snow wrote:



On 08/17/2017 05:15 AM, Pavel Butsykin wrote:

This patch add shrinking of the image file for qcow2. As a result, this allows
us to reduce the virtual image size and free up space on the disk without
copying the image. Image can be fragmented and shrink is done by punching holes
in the image file.

# ./qemu-img create -f qcow2 image.qcow2 4G
Formatting 'image.qcow2', fmt=qcow2 size=4294967296 encryption=off 
cluster_size=65536 lazy_refcounts=off refcount_bits=16

# ./qemu-io -c "write -P 0x22 0 1G" image.qcow2
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1 ops; 0:00:04.59 (222.886 MiB/sec and 0.2177 ops/sec)

# ./qemu-img resize image.qcow2 512M
warning: qemu-img: Shrinking an image will delete all data beyond the shrunken 
image's end. Before performing such an operation, make sure there is no 
important data there.
error: qemu-img: Use the --shrink option to perform a shrink operation.

# ./qemu-img resize --shrink image.qcow2 128M
Image resized.

# ./qemu-img info image.qcow2
image: image.qcow2
file format: qcow2
virtual size: 128M (134217728 bytes)
disk size: 128M
cluster_size: 65536
Format specific information:
 compat: 1.1
 lazy refcounts: false
 refcount bits: 16
 corrupt: false

# du -h image.qcow2
129Mimage.qcow2



It looks sane to me, and already has a full set of R-Bs from Max. Are we
waiting for Kevin?


Now I found that I was waiting for my own R-b for patch 3.  I hadn't yet
reviewed that patch in its current state (I had reviewed it for v5, but
it has changed quite a bit in v6)


I decided that small changes are not considered :) I'm sorry for the
confusion with the R-b.



Max

Re: [Qemu-block] [Qemu-devel] [PATCH] qemu.py: Fix syntax error

2017-09-18 Thread Alex Bennée


Kevin Wolf  writes:

> Python requires parentheses around multiline expression. This fixes the
> breakage of all Python-based qemu-iotests cases that was introduced in
> commit dab91d9aa0.
>
> Signed-off-by: Kevin Wolf 

Heh, just sent an identical patch.

Reviewed-by: Alex Bennée 
Tested-by: Alex Bennée 

> ---
>
> Eduardo, I think I'm going to include this patch in a block layer pull
> request today, just to stop the CI spam I'm getting, so the CC is just
> FYI.
>
>  scripts/qemu.py | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/scripts/qemu.py b/scripts/qemu.py
> index 8c67595ec8..5e02dd8e78 100644
> --- a/scripts/qemu.py
> +++ b/scripts/qemu.py
> @@ -193,8 +193,8 @@ class QEMUMachine(object):
>  qemulog = open(self._qemu_log_path, 'wb')
>  try:
>  self._pre_launch()
> -self._qemu_full_args = self._wrapper + [self._binary] +
> -self._base_args() + self._args
> +self._qemu_full_args = (self._wrapper + [self._binary] +
> +self._base_args() + self._args)
>  self._popen = subprocess.Popen(self._qemu_full_args,
> stdin=devnull,
> stdout=qemulog,


--
Alex Bennée

Re: [Qemu-block] [Qemu-devel] [PATCH 6/6] block: Fix permissions after bdrv_reopen()

2017-09-18 Thread Kevin Wolf

Am 15.09.2017 um 21:06 hat Eric Blake geschrieben:
> On 09/15/2017 05:10 AM, Kevin Wolf wrote:
> > If we switch between read-only and read-write, the permissions that
> > image format drivers need on bs->file change, too. Make sure to update
> > the permissions during bdrv_reopen().
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >  include/block/block.h |  1 +
> >  block.c   | 64 
> > +++
> >  2 files changed, 65 insertions(+)
> > 
> 
> > +static BlockReopenQueueEntry *find_parent_in_reopen_queue(BlockReopenQueue 
> > *q,
> > +  BdrvChild *c)
> > +{
> > +BlockReopenQueueEntry *entry;
> > +
> > +QSIMPLEQ_FOREACH(entry, q, entry) {
> > +BlockDriverState *bs = entry->state.bs;
> > +BdrvChild *child;
> > +
> > +QLIST_FOREACH(child, >children, next) {
> > +if (child == c) {
> > +return entry;
> 
> An O(n^2) loop. Is it going to bite us at any point in the future, or
> are we generally dealing with a small enough queue size and BDS graph to
> not worry about it?

The loops you're quoting aren't O(n^2), they don't loop over the same
thing. This part is O(n) in terms of BdrvChild elements looked at.

The thing that worried me a bit more is the caller:

+QLIST_FOREACH(c, >parents, next_parent) {
+parent = find_parent_in_reopen_queue(q, c);

This is indeed O(n^2) (again with n = number of BdrvChild elements) in
the pathological worst case of a quorum node where all children point to
the same node.

As soon as all parents of the node are distinct - and I don't see any
reason why they wouldn't in practice - we're back to O(n) because each
BdrvChild belongs to only one parent. Even if we ever introduce a driver
where having the same node as a child in a constant number of different
roles makes sense for a parent (i.e. anything that doesn't involve an
(unbounded) array of children), we would still be O(n) with an additional
small constant factor.

So I think in practice we should be okay.

Kevin

signature.asc
Description: PGP signature

Re: [Qemu-block] [PATCH] qemu.py: Fix syntax error

2017-09-18 Thread Stefan Hajnoczi

On Mon, Sep 18, 2017 at 07:25:24AM +0200, Kevin Wolf wrote:
> Python requires parentheses around multiline expression. This fixes the
> breakage of all Python-based qemu-iotests cases that was introduced in
> commit dab91d9aa0.
> 
> Signed-off-by: Kevin Wolf 
> ---
> 
> Eduardo, I think I'm going to include this patch in a block layer pull
> request today, just to stop the CI spam I'm getting, so the CC is just
> FYI.
> 
>  scripts/qemu.py | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi

Re: [Qemu-block] [PATCH 0/6] block: Fix permissions after ro/rw reopen

2017-09-18 Thread Kevin Wolf

Am 18.09.2017 um 09:51 hat Fam Zheng geschrieben:
> On Fri, 09/15 12:10, Kevin Wolf wrote:
> > bdrv_reopen() can switch nodes between read-only and read-write modes.
> > This has implications for the required permissions on their child nodes.
> > For example, a qcow2 node requests write permissions on bs->file only if
> > it is writable itself.
> > 
> > This means that during bdrv_reopen(), the permissions need to be
> > recalculated in order to prevent failures where the bs->file
> > permissions don't match its actual read-only state (e.g. bs->file is a
> > read-write node, but the permission still enforces read-only access).
> 
> Passing reopen_queue around makes the interface and implementations
> complicated.

Yes. I don't like it, but I couldn't find any easier way.

> I wonder if any of the alternatives make sense:
> 
> 1) Don't pass reopen_queue as a parameter, just pass the one
> interesting BDRVReopenState pointer. So that callees don't need to
> call bdrv_reopen_get_flags().

There isn't a single interesting BDRVReopenState. The subset that a
single BDS needs to determine its new state is the the BDRVReopenState
of all of its parents. This can be an arbitrary number.

> 2) Don't change the prototypes at all, just change .bdrv_reopen_prepare 
> contract
> so that after it returns, .bdrv_child_perm/.bdrv_check_perm should comply to 
> the
> new state that would be commited once .bdrv_reopen_commit() is called, or
> reverted if .bdrv_reopen_abort().

Hm, .bdrv_reopen_prepare already gets the whole queue passed, so I guess
this could technically work. I'm not sure if it is a good idea, though.

Such a change would still make .bdrv_child_perm depend on the reopen
queue, just without actually passing it as a parameter. I like such
hidden data flows even less than adding an explicit one.

It would also mean that each block driver would have to save the queue
in its local bs->opaque structure so that .bdrv_child_perm can access it
later. Alternatively, bdrv_reopen_prepare could already store the new
cumulative parent permissions, but it would still involve two new fields
in bs->opaque for storing something of a rather temporary nature.

Though maybe I'm just missing another way to implement this that you had
in mind?

> 3) Don't change the prototypes at all, track the reopen progress in block.c
> generically, (e.g. ignore conflicts and voilations) and update the permissions
> only after bdrv_reopen_commit().

Both permission updates and reopen are transactional. You need to do
both prepare stages first before you can do commits. If you only start
doing the prepare stage of permissions during the commit stage of
reopen, you break the error cases.

Kevin

Re: [Qemu-block] [PATCH 0/6] block: Fix permissions after ro/rw reopen

2017-09-18 Thread Fam Zheng

On Fri, 09/15 12:10, Kevin Wolf wrote:
> bdrv_reopen() can switch nodes between read-only and read-write modes.
> This has implications for the required permissions on their child nodes.
> For example, a qcow2 node requests write permissions on bs->file only if
> it is writable itself.
> 
> This means that during bdrv_reopen(), the permissions need to be
> recalculated in order to prevent failures where the bs->file
> permissions don't match its actual read-only state (e.g. bs->file is a
> read-write node, but the permission still enforces read-only access).

Passing reopen_queue around makes the interface and implementations complicated.
I wonder if any of the alternatives make sense:

1) Don't pass reopen_queue as a parameter, just pass the one interesting
BDRVReopenState pointer. So that callees don't need to call
bdrv_reopen_get_flags().

2) Don't change the prototypes at all, just change .bdrv_reopen_prepare contract
so that after it returns, .bdrv_child_perm/.bdrv_check_perm should comply to the
new state that would be commited once .bdrv_reopen_commit() is called, or
reverted if .bdrv_reopen_abort().

3) Don't change the prototypes at all, track the reopen progress in block.c
generically, (e.g. ignore conflicts and voilations) and update the permissions
only after bdrv_reopen_commit().

Fam

> 
> Kevin Wolf (6):
>   qemu-io: Reset qemuio_blk permissions before each command
>   block: Add reopen_queue to bdrv_child_perm()
>   block: Add reopen queue to bdrv_check_perm()
>   block: Base permissions on rw state after reopen
>   block: reopen: Queue children after their parents
>   block: Fix permissions after bdrv_reopen()
> 
>  include/block/block.h  |   2 +-
>  include/block/block_int.h  |   7 ++
>  block.c| 191 
> +
>  block/commit.c |   1 +
>  block/mirror.c |   1 +
>  block/replication.c|   1 +
>  block/vvfat.c  |   1 +
>  qemu-io.c  |  13 +++
>  tests/qemu-iotests/187.out |   2 +-
>  9 files changed, 169 insertions(+), 50 deletions(-)
> 
> -- 
> 2.13.5
>

Re: [Qemu-block] [PATCH 6/6] block: Fix permissions after bdrv_reopen()

2017-09-18 Thread Kevin Wolf

Am 18.09.2017 um 09:37 hat Fam Zheng geschrieben:
> On Fri, 09/15 12:10, Kevin Wolf wrote:
> > If we switch between read-only and read-write, the permissions that
> > image format drivers need on bs->file change, too. Make sure to update
> > the permissions during bdrv_reopen().
> > 
> > Signed-off-by: Kevin Wolf 
> > ---
> >  include/block/block.h |  1 +
> >  block.c   | 64 
> > +++
> >  2 files changed, 65 insertions(+)
> > 
> > diff --git a/include/block/block.h b/include/block/block.h
> > index 082eb2cd9c..3c3af462e4 100644
> > --- a/include/block/block.h
> > +++ b/include/block/block.h
> > @@ -166,6 +166,7 @@ typedef QSIMPLEQ_HEAD(BlockReopenQueue, 
> > BlockReopenQueueEntry) BlockReopenQueue;
> >  typedef struct BDRVReopenState {
> >  BlockDriverState *bs;
> >  int flags;
> > +uint64_t perm, shared_perm;
> >  QDict *options;
> >  QDict *explicit_options;
> >  void *opaque;
> > diff --git a/block.c b/block.c
> > index 204cbb46c7..5c65fac672 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -2781,6 +2781,10 @@ static BlockReopenQueue 
> > *bdrv_reopen_queue_child(BlockReopenQueue *bs_queue,
> >  bs_entry->state.explicit_options = explicit_options;
> >  bs_entry->state.flags = flags;
> >  
> > +/* This needs to be overwritten in bdrv_reopen_prepare() */
> > +bs_entry->state.perm = UINT64_MAX;
> 
> Probably doesn't matter because as the comment says it will be overwritten 
> soon,
> but is BLK_PERM_ALL more appropriate?

I had BLK_PERM_ALL at first, but after debugging some assertion failures
in gdb, I came to the conclusion that UINT64_MAX is easier to identify as
uninitialised than BLK_PERM_ALL, which could be a valid result of the
permission calculation.

Kevin

Re: [Qemu-block] [PATCH 6/6] block: Fix permissions after bdrv_reopen()

2017-09-18 Thread Fam Zheng

On Fri, 09/15 12:10, Kevin Wolf wrote:
> If we switch between read-only and read-write, the permissions that
> image format drivers need on bs->file change, too. Make sure to update
> the permissions during bdrv_reopen().
> 
> Signed-off-by: Kevin Wolf 
> ---
>  include/block/block.h |  1 +
>  block.c   | 64 
> +++
>  2 files changed, 65 insertions(+)
> 
> diff --git a/include/block/block.h b/include/block/block.h
> index 082eb2cd9c..3c3af462e4 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -166,6 +166,7 @@ typedef QSIMPLEQ_HEAD(BlockReopenQueue, 
> BlockReopenQueueEntry) BlockReopenQueue;
>  typedef struct BDRVReopenState {
>  BlockDriverState *bs;
>  int flags;
> +uint64_t perm, shared_perm;
>  QDict *options;
>  QDict *explicit_options;
>  void *opaque;
> diff --git a/block.c b/block.c
> index 204cbb46c7..5c65fac672 100644
> --- a/block.c
> +++ b/block.c
> @@ -2781,6 +2781,10 @@ static BlockReopenQueue 
> *bdrv_reopen_queue_child(BlockReopenQueue *bs_queue,
>  bs_entry->state.explicit_options = explicit_options;
>  bs_entry->state.flags = flags;
>  
> +/* This needs to be overwritten in bdrv_reopen_prepare() */
> +bs_entry->state.perm = UINT64_MAX;

Probably doesn't matter because as the comment says it will be overwritten soon,
but is BLK_PERM_ALL more appropriate?

> +bs_entry->state.shared_perm = 0;
> +
>  QLIST_FOREACH(child, >children, next) {
>  QDict *new_child_options;
>  char *child_key_dot;
> @@ -2887,6 +2891,52 @@ int bdrv_reopen(BlockDriverState *bs, int bdrv_flags, 
> Error **errp)
>  return ret;
>  }
>  
> +static BlockReopenQueueEntry *find_parent_in_reopen_queue(BlockReopenQueue 
> *q,
> +  BdrvChild *c)
> +{
> +BlockReopenQueueEntry *entry;
> +
> +QSIMPLEQ_FOREACH(entry, q, entry) {
> +BlockDriverState *bs = entry->state.bs;
> +BdrvChild *child;
> +
> +QLIST_FOREACH(child, >children, next) {
> +if (child == c) {
> +return entry;
> +}
> +}
> +}
> +
> +return NULL;
> +}
> +
> +static void bdrv_reopen_perm(BlockReopenQueue *q, BlockDriverState *bs,
> + uint64_t *perm, uint64_t *shared)
> +{
> +BdrvChild *c;
> +BlockReopenQueueEntry *parent;
> +uint64_t cumulative_perms = 0;
> +uint64_t cumulative_shared_perms = BLK_PERM_ALL;
> +
> +QLIST_FOREACH(c, >parents, next_parent) {
> +parent = find_parent_in_reopen_queue(q, c);
> +if (!parent) {
> +cumulative_perms |= c->perm;
> +cumulative_shared_perms &= c->shared_perm;
> +} else {
> +uint64_t nperm, nshared;
> +
> +bdrv_child_perm(parent->state.bs, bs, c, c->role, q,
> +parent->state.perm, parent->state.shared_perm,
> +, );
> +
> +cumulative_perms |= nperm;
> +cumulative_shared_perms &= nshared;
> +}
> +}
> +*perm = cumulative_perms;
> +*shared = cumulative_shared_perms;
> +}
>  
>  /*
>   * Prepares a BlockDriverState for reopen. All changes are staged in the
> @@ -2952,6 +3002,9 @@ int bdrv_reopen_prepare(BDRVReopenState *reopen_state, 
> BlockReopenQueue *queue,
>  goto error;
>  }
>  
> +/* Calculate required permissions after reopening */
> +bdrv_reopen_perm(queue, reopen_state->bs,
> + _state->perm, _state->shared_perm);
>  
>  ret = bdrv_flush(reopen_state->bs);
>  if (ret) {
> @@ -3007,6 +3060,12 @@ int bdrv_reopen_prepare(BDRVReopenState *reopen_state, 
> BlockReopenQueue *queue,
>  } while ((entry = qdict_next(reopen_state->options, entry)));
>  }
>  
> +ret = bdrv_check_perm(reopen_state->bs, queue, reopen_state->perm,
> +  reopen_state->shared_perm, NULL, errp);
> +if (ret < 0) {
> +goto error;
> +}
> +
>  ret = 0;
>  
>  error:
> @@ -3047,6 +3106,9 @@ void bdrv_reopen_commit(BDRVReopenState *reopen_state)
>  
>  bdrv_refresh_limits(bs, NULL);
>  
> +bdrv_set_perm(reopen_state->bs, reopen_state->perm,
> +  reopen_state->shared_perm);
> +
>  new_can_write =
>  !bdrv_is_read_only(bs) && !(bdrv_get_flags(bs) & BDRV_O_INACTIVE);
>  if (!old_can_write && new_can_write && drv->bdrv_reopen_bitmaps_rw) {
> @@ -3080,6 +3142,8 @@ void bdrv_reopen_abort(BDRVReopenState *reopen_state)
>  }
>  
>  QDECREF(reopen_state->explicit_options);
> +
> +bdrv_abort_perm_update(reopen_state->bs);
>  }
>  
>  
> -- 
> 2.13.5
>

Re: [Qemu-block] [PATCH 1/6] qemu-io: Reset qemuio_blk permissions before each command

2017-09-18 Thread Fam Zheng

On Fri, 09/15 12:10, Kevin Wolf wrote:
> qemu-io provides a 'reopen' command that allows switching from writable
> to read-only access. We need to make sure that we don't try to keep
> write permissions to a BlockBackend that becomes read-only, otherwise
> things are going to fail.
> 
> command() already makes sure to request any additional permissions that
> a qemu-io command requires, so just resetting the permissions to values
> that are safe for read-only images is enough to fix this.
> 
> As a side effect, this makes the output of qemu-iotests case 187 more
> consistent.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  qemu-io.c  | 13 +
>  tests/qemu-iotests/187.out |  2 +-
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/qemu-io.c b/qemu-io.c
> index 265445ad89..9e031f0f8e 100644
> --- a/qemu-io.c
> +++ b/qemu-io.c
> @@ -366,6 +366,12 @@ static void command_loop(void)
>  char *input;
>  
>  for (i = 0; !done && i < ncmdline; i++) {
> +/* Make sure that we start each command with clean permissions and 
> only
> + * add whatever the specific cmdinfo_t describes */
> +if (qemuio_blk) {
> +blk_set_perm(qemuio_blk, BLK_PERM_CONSISTENT_READ, BLK_PERM_ALL,
> + _abort);
> +}
>  done = qemuio_command(qemuio_blk, cmdline[i]);
>  }
>  if (cmdline) {
> @@ -391,6 +397,13 @@ static void command_loop(void)
>  if (input == NULL) {
>  break;
>  }
> +
> +/* Make sure that we start each command with clean permissions and 
> only
> + * add whatever the specific cmdinfo_t describes */
> +if (qemuio_blk) {
> +blk_set_perm(qemuio_blk, BLK_PERM_CONSISTENT_READ, BLK_PERM_ALL,
> + _abort);
> +}
>  done = qemuio_command(qemuio_blk, input);
>  g_free(input);
>  
> diff --git a/tests/qemu-iotests/187.out b/tests/qemu-iotests/187.out
> index 68fb944cd5..30b987f71f 100644
> --- a/tests/qemu-iotests/187.out
> +++ b/tests/qemu-iotests/187.out
> @@ -12,7 +12,7 @@ Start from read-write
>  
>  wrote 65536/65536 bytes at offset 0
>  64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> -write failed: Operation not permitted
> +Block node is read-only
>  wrote 65536/65536 bytes at offset 0
>  64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>  *** done
> -- 
> 2.13.5
> 

Reviewed-by: Fam Zheng

Re: [Qemu-block] [PATCH 17/18] qemu-io: Add background write

2017-09-18 Thread Fam Zheng

On Wed, 09/13 20:19, Max Reitz wrote:
> Add a new parameter -B to qemu-io's write command.  When used, qemu-io
> will not wait for the result of the operation and instead execute it in
> the background.

Cannot aio_write be used for this purpose?

Fam

Re: [Qemu-block] [PATCH 18/18] iotests: Add test for active mirroring

2017-09-18 Thread Fam Zheng

On Wed, 09/13 20:19, Max Reitz wrote:
> Signed-off-by: Max Reitz 
> ---
>  tests/qemu-iotests/151 | 111 
> +
>  tests/qemu-iotests/151.out |   5 ++
>  tests/qemu-iotests/group   |   1 +
>  3 files changed, 117 insertions(+)
>  create mode 100755 tests/qemu-iotests/151
>  create mode 100644 tests/qemu-iotests/151.out
> 
> diff --git a/tests/qemu-iotests/151 b/tests/qemu-iotests/151
> new file mode 100755
> index 00..49a60773f9
> --- /dev/null
> +++ b/tests/qemu-iotests/151
> @@ -0,0 +1,111 @@
> +#!/usr/bin/env python
> +#
> +# Tests for active mirroring
> +#
> +# Copyright (C) 2017 Red Hat, Inc.
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see .
> +#
> +
> +import os
> +import iotests
> +from iotests import qemu_img
> +
> +source_img = os.path.join(iotests.test_dir, 'source.' + iotests.imgfmt)
> +target_img = os.path.join(iotests.test_dir, 'target.' + iotests.imgfmt)
> +
> +class TestActiveMirror(iotests.QMPTestCase):
> +image_len = 128 * 1024 * 1024 # MB
> +potential_writes_in_flight = True
> +
> +def setUp(self):
> +qemu_img('create', '-f', iotests.imgfmt, source_img, '128M')
> +qemu_img('create', '-f', iotests.imgfmt, target_img, '128M')
> +
> +blk_source = {'node-name': 'source',
> +  'driver': iotests.imgfmt,
> +  'file': {'driver': 'file',
> +   'filename': source_img}}
> +
> +blk_target = {'node-name': 'target',
> +  'driver': iotests.imgfmt,
> +  'file': {'driver': 'file',
> +   'filename': target_img}}
> +
> +self.vm = iotests.VM()
> +self.vm.add_blockdev(self.qmp_to_opts(blk_source))
> +self.vm.add_blockdev(self.qmp_to_opts(blk_target))
> +self.vm.launch()
> +
> +def tearDown(self):
> +self.vm.shutdown()
> +
> +if not self.potential_writes_in_flight:
> +self.assertTrue(iotests.compare_images(source_img, target_img),
> +'mirror target does not match source')
> +
> +os.remove(source_img)
> +os.remove(target_img)
> +
> +def doActiveIO(self, sync_source_and_target):
> +# Fill the source image
> +self.vm.hmp_qemu_io('source',
> +'write -P 1 0 %i' % self.image_len);
> +
> +# Start some background requests
> +for offset in range(0, self.image_len, 1024 * 1024):
> +self.vm.hmp_qemu_io('source', 'write -B -P 2 %i 1M' % offset)
> +
> +# Start the block job
> +result = self.vm.qmp('blockdev-mirror',
> + job_id='mirror',
> + filter_node_name='mirror-node',
> + device='source',
> + target='target',
> + sync='full',
> + copy_mode='active-write')
> +self.assert_qmp(result, 'return', {})
> +
> +# Start some more requests
> +for offset in range(0, self.image_len, 1024 * 1024):
> +self.vm.hmp_qemu_io('mirror-node', 'write -B -P 3 %i 1M' % 
> offset)
> +
> +# Wait for the READY event
> +self.wait_ready(drive='mirror')
> +
> +# Now start some final requests; all of these (which land on
> +# the source) should be settled using the active mechanism.
> +# The mirror code itself asserts that the source BDS's dirty
> +# bitmap will stay clean between READY and COMPLETED.
> +for offset in range(0, self.image_len, 1024 * 1024):
> +self.vm.hmp_qemu_io('mirror-node', 'write -B -P 4 %i 1M' % 
> offset)
> +
> +if sync_source_and_target:
> +# If source and target should be in sync after the mirror,
> +# we have to flush before completion

Not sure I understand this requirements, does it apply to libvirt and user too?
I.e. it's a part of the interface ? Why cannot mirror_complete do it
automatically?

Fam

> +self.vm.hmp_qemu_io('mirror-node', 'flush')
> +self.potential_writes_in_flight = False
> +
> +self.complete_and_wait(drive='mirror', wait_ready=False)
> +
> +def testActiveIO(self):
> +self.doActiveIO(False)
> +
> +def

Re: [Qemu-block] [PATCH 05/18] block/mirror: Convert to coroutines

2017-09-18 Thread Fam Zheng

On Wed, 09/13 20:18, Max Reitz wrote:
> In order to talk to the source BDS (and maybe in the future to the
> target BDS as well) directly, we need to convert our existing AIO
> requests into coroutine I/O requests.
> 
> Signed-off-by: Max Reitz 
> ---
>  block/mirror.c | 134 
> +
>  1 file changed, 78 insertions(+), 56 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 4664b0516f..2b3297aa61 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -80,6 +80,9 @@ typedef struct MirrorOp {
>  QEMUIOVector qiov;
>  int64_t offset;
>  uint64_t bytes;
> +
> +/* Set by mirror_co_read() before yielding for the first time */
> +uint64_t bytes_copied;
>  } MirrorOp;
>  
>  typedef enum MirrorMethod {
> @@ -101,7 +104,7 @@ static BlockErrorAction 
> mirror_error_action(MirrorBlockJob *s, bool read,
>  }
>  }
>  
> -static void mirror_iteration_done(MirrorOp *op, int ret)
> +static void coroutine_fn mirror_iteration_done(MirrorOp *op, int ret)
>  {
>  MirrorBlockJob *s = op->s;
>  struct iovec *iov;
> @@ -138,9 +141,8 @@ static void mirror_iteration_done(MirrorOp *op, int ret)
>  }
>  }
>  
> -static void mirror_write_complete(void *opaque, int ret)
> +static void coroutine_fn mirror_write_complete(MirrorOp *op, int ret)
>  {
> -MirrorOp *op = opaque;
>  MirrorBlockJob *s = op->s;
>  
>  aio_context_acquire(blk_get_aio_context(s->common.blk));
> @@ -158,9 +160,8 @@ static void mirror_write_complete(void *opaque, int ret)
>  aio_context_release(blk_get_aio_context(s->common.blk));
>  }
>  
> -static void mirror_read_complete(void *opaque, int ret)
> +static void coroutine_fn mirror_read_complete(MirrorOp *op, int ret)
>  {
> -MirrorOp *op = opaque;
>  MirrorBlockJob *s = op->s;
>  
>  aio_context_acquire(blk_get_aio_context(s->common.blk));
> @@ -176,8 +177,11 @@ static void mirror_read_complete(void *opaque, int ret)
>  
>  mirror_iteration_done(op, ret);
>  } else {
> -blk_aio_pwritev(s->target, op->offset, >qiov,
> -0, mirror_write_complete, op);
> +int ret;
> +
> +ret = blk_co_pwritev(s->target, op->offset,
> + op->qiov.size, >qiov, 0);
> +mirror_write_complete(op, ret);
>  }
>  aio_context_release(blk_get_aio_context(s->common.blk));
>  }
> @@ -242,53 +246,49 @@ static inline void mirror_wait_for_io(MirrorBlockJob *s)
>   *  (new_end - offset) if tail is rounded up or down due to
>   *  alignment or buffer limit.
>   */
> -static uint64_t mirror_do_read(MirrorBlockJob *s, int64_t offset,
> -   uint64_t bytes)
> +static void coroutine_fn mirror_co_read(void *opaque)
>  {
> +MirrorOp *op = opaque;
> +MirrorBlockJob *s = op->s;
>  BlockBackend *source = s->common.blk;
>  int nb_chunks;
>  uint64_t ret;
> -MirrorOp *op;
>  uint64_t max_bytes;
>  
>  max_bytes = s->granularity * s->max_iov;
>  
>  /* We can only handle as much as buf_size at a time. */
> -bytes = MIN(s->buf_size, MIN(max_bytes, bytes));
> -assert(bytes);
> -assert(bytes < BDRV_REQUEST_MAX_BYTES);
> -ret = bytes;
> +op->bytes = MIN(s->buf_size, MIN(max_bytes, op->bytes));
> +assert(op->bytes);
> +assert(op->bytes < BDRV_REQUEST_MAX_BYTES);
> +op->bytes_copied = op->bytes;
>  
>  if (s->cow_bitmap) {
> -ret += mirror_cow_align(s, , );
> +op->bytes_copied += mirror_cow_align(s, >offset, >bytes);
>  }
> -assert(bytes <= s->buf_size);
> +/* Cannot exceed BDRV_REQUEST_MAX_BYTES + INT_MAX */
> +assert(op->bytes_copied <= UINT_MAX);
> +assert(op->bytes <= s->buf_size);
>  /* The offset is granularity-aligned because:
>   * 1) Caller passes in aligned values;
>   * 2) mirror_cow_align is used only when target cluster is larger. */
> -assert(QEMU_IS_ALIGNED(offset, s->granularity));
> +assert(QEMU_IS_ALIGNED(op->offset, s->granularity));
>  /* The range is sector-aligned, since bdrv_getlength() rounds up. */
> -assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE));
> -nb_chunks = DIV_ROUND_UP(bytes, s->granularity);
> +assert(QEMU_IS_ALIGNED(op->bytes, BDRV_SECTOR_SIZE));
> +nb_chunks = DIV_ROUND_UP(op->bytes, s->granularity);
>  
>  while (s->buf_free_count < nb_chunks) {
> -trace_mirror_yield_in_flight(s, offset, s->in_flight);
> +trace_mirror_yield_in_flight(s, op->offset, s->in_flight);
>  mirror_wait_for_io(s);
>  }
>  
> -/* Allocate a MirrorOp that is used as an AIO callback.  */
> -op = g_new(MirrorOp, 1);
> -op->s = s;
> -op->offset = offset;
> -op->bytes = bytes;
> -
>  /* Now make a QEMUIOVector taking enough granularity-sized chunks
>   * from s->buf_free.
>   */
>  qemu_iovec_init(>qiov, nb_chunks);
>  while (nb_chunks-- > 0) {
>

90 matches

Mail list logo